From: AAAI-94 Proceedings. Copyright © 1994, AAAI (www.aaai.org). All rights reserved.
Acting
Optimally
Anthony
in Part ially
R. Cassandra*,
servable
Leslie Pack Kaelblingt
ains
Stoc
and Michael
L. Littmant
Department
of Computer Science
Brown University
Providence,
RI 02912
{arc,lpk,mll}Ocs.brown.edu
Abstract
we describe
the partially
observable
In this paper,
Markov decision process (POMDP) approach to finding
optimal or near-optimal
control strategies for partially
observable stochastic
environments,
given a complete
model of the environment.
The POMDP
approach was
originally developed in the operations
research community and provides a formal basis for planning problems that have been of interest to the AI community.
We found the existing algorithms
for computing
optimal control strategies
to be highly computationally
inefficient and have developed a new algorithm that is
empirically
more efficient.
We sketch this algorithm
and present preliminary
results on several small problems that illustrate important
properties of the POMDP
approach.
Introduction
Agents that act in real environments,
whether physical
or virtual, rarely have complete information
about the
state of the environment
in which they are working. It
is necessary for them to choose their actions in partial
ignorance and often it is helpful for them to take explicit steps to gain information
to achieve their goals
most efficiently.
This problem has been addressed in the artificial inusing formalisms
of epistelligence
(AI) community
temic logic and by incorporating
knowledge preconditions and effects into their planners (Moore 1985).
These solutions are applicable to fairly high-level problems in which the environment
is assumed to be completely deterministic,
an assumption that often fails in
low-level control problems.
Domains in which actions have probabilistic
results
and the agent has direct access to the state of the
environment
can be formalized
as Markov
decision
*Anthony
Cassandra’s
work was supported
in part
National Science Foundation
Award IRI-9257592.
by
+Leslie Kaelbling’s
work was supported
in part by a
National Science Foundation
National Young Investigator
Award IRI-9257592
and in part by ONR Contract
NOOO1491-4052,
ARPA Order 8225.
*Michael
Littman’s
work was supported
by Bellcore.
processes (MDPS) (Howard 1960).
An important
aspect of the MDP
model is that it provides the basis for algorithms
that provably find optimal policies (mappings from environmental
states to actions)
given a stochastic
model of the environment
and a
goal. MDP models play an important
role in current
AI research on planning (Dean e2 al. 1993; Sutton
1990) and learning (Barto,
Bradtke,
& Singh 1991;
Watkins & Dayan 1992), but the assumption
of complete observability
provides a significant
obstacle to
their application to real-world problems.
This paper explores an extension of the MDP model
Markov
decision
processes
observable
to partially
(POMDPS)
(Monahan
1982; Lovejoy 1991), which, like
MDPS, were developed within the context of operations
research.
The POMDP
model provides an elegant solution to the problem of acting in partially observable
domains, treating actions that affect the environment
and actions that only affect the agent’s state of information uniformly.
We begin by explaining the basic POMDP
formalism;
next we present an algorithm
for finding arbitrarily good approximations
to optimal
policies and a method for the compact representation
of many such policies; finally, we conclude with examples that illustrate generalization
in the policy representation and taking action to gain information.
artially
Observable Markov
Processes
Decision
Markov Decision Processes
An MDP is defined by
the tuple (S, ,A, T, R), where S is a finite set of environmental states that can be reliably identified by the
agent; A is a finite set of actions;
T is a state transition
model of the environment,
which is a function mapping
elements of S xd into discrete probability distributions
over S; and R is a reward function mapping
S x A to
the real numbers that specify the instantaneous
reward
that the agent derives from taking an action in a state.
We write T(s, a, s’) for the probability
that the environment will make a transition from state s to state s’
when action a is taken and we write R(s, a) for the immediate reward to the agent for taking action a in state
s. A policy, 7rIT,
is a mapping from S to A, specifying
Agents
1023
Figure
Figure
an action
1: Controller
for a POMDP
Adding Partial Observability
When the state is
not completely
observable,
we must add a model of
observation.
This includes a finite set, 0, of possible
observations
and an observation function, 0, mapping
~4 x S into discrete probability
distributions
over 0.
We write O(a, s, o) for the probability
of making observation o from state s after having taken action a.
One might simply take the set of observations
to be
the set of states and treat a POMDP
as if it were an MDP.
The problem is that the process would not necessarily
be Markov since there could be multiple states in the
environment
that require different actions but appear
identical.
As a result, even an optimal policy of this
form can have arbitrarily
poor performance.
Instead, we introduce a kind of internal state for the
agent. A belief state is a discrete probability
distribution over the set of environmental
states, S, representing for each state the probability that the environment
is currently in that state.
Let B be the set of belief
states.
We write b(s) for the probability
assigned to
state s when the agent’s belief state is b.
Now, we can decompose the problem of acting in a
partially observable environment
as shown in Figure 1.
The component labeled “SE” is the state estimator. It
takes as input the last belief state, the most recent
action and the most recent observation,
and returns
an updated belief state. The second component is the
policy, which now maps belief states into actions.
The state estimator can be constructed
from T and
0 by straightforward
application
of Bayes’ rule. The
output of the state estimator
is a belief state, which
can be represented as a vector of probabilities,
one for
each environmental
state, that sums to 1. The component corresponding
to state s’, written SE,/(b) a, o),
can be determined from the previous belief state, b, the
previous action, a, and the current observation,
o, as
follows:
=
Pr(s’ 1 a, o, 6)
=
Pr(o 1 s’, a, b) Pr(s’ 1 a, b)
Pr(o I a, b)
z
where Pr(o
Oh s’, 4 CsESw, a, S’W
Pr(o I a,b)
1 a, b) is a normalizing
Pr(o I a, b) = x
O(u, s’, o) ET(s)
S’ES
1024
factor
Planning and Scheduling
SES
POMDP
environment
The resulting function will ensure that our current belief accurately summarizes all available information.
to be taken in each situation.
SE81 (b, a, o)
2: A Simple
defined as
a, s’)b(s)
I
Example
A simple example of a POMDP
is shown in
Figure 2. It has four states, one of which (state 2)
is designated
as the goal state.
An agent is in one
of the states at all times; it has two actions, left and
right, that move it one state in either direction.
If it
moves into a wall, it stays in the state it was in. If the
agent reaches the goal state, no matter what action it
takes, it is moved with equal probability
into state 0,
1, or 3 and receives reward 1. This problem is trivial if
the agent can observe what state it is in, but is more
difficult when it can only observe whether or not it is
currently at the goal state.
When the agent cannot observe its true state, it can
represent its belief of where it is with a probability
vector. For example, after leaving the goal, the agent
moves to one of the other states with equal probability. This is represented by a belief state of (5,;)
0, i).
After taking action “right” and not observing the goal,
there are only two states from which the agent could
have moved: 0 and 3. Hence, the agent’s new belief
vector is (0: 4, 0, f).
If it moves “right” once again
without seeing the goal, the agent can be sure it is
now in state 3 with belief state (O,O, 0, 1). Because the
actions are deterministic
in this example, the agent’s
uncertainty
shrinks on each step; in general, some actions in some situations
will decrease the uncertainty
while others will increase it.
Constructing
Optimal
Policies
Constructing
an optimal policy can be quite diffiEven specifying
a policy at every point in
cult.
the uncountable
state space is challenging.
One
simple method
is to find the optimal state-action
value function,
Q&,
for the completely
observable MDP
(S,d, T, R) (Watkins
h Dayan
1992);
then, given belief state b as input, generate
action
argmax,sA
C, b(s)&*,,(s)
u). That is, act as if the uncertainty will be present for one action step, but that
the environment
will be completely observable thereafter.
This approach, similar to one used by Chrisman (Chrisman
1992), leads to policies that do not
take actions to gain information
and will therefore be
suboptimal
in many environments.
The key to finding truly optimal policies in the partially observable case is to cast the problem as a COVLpletely observable continuous-space
MDP. The state set
of this “belief MDP" is a and the action set is A. Given
a current belief state b and action u, there are only
101 possible successor belief states b’, so the new state
transition function, 7, can be defined as
T(b, a, 6’) =
>:
Pr(o I a, b) ,
where Pr(o 1 ~1,b) is defined above. If the new belief
state, b’, cannot be generated by the state estimator
from b, a, and some observation,
then the probability
of that transition is 0. The reward function, p, is constructed from R by taking expectations
according to
the belief state; that is,
p(b,
a)=x b(s)R(s,
a).
SES
At first, this may seem strange; it appears the agent is
rewarded simply for believing it is in good states. Because of the way the state estimation
module is constructed,
it is not possible for the agent to purposely
delude itself into believing that it is in a good state
when it is not.
1965), that is,
The belief MDP is Markov (Astrom
having information about previous belief states cannot
improve the choice of action. Most importantly,
if an
agent adopts the optimal policy for the belief MDP,
the resulting behavior will be optimal for the partially
The remaining
difficulty is that
observable
process.
belief space is continuous;
the established
algorithms
for finding optimal policies in MDPS work only in finite
state spaces. In the following sections, we discuss the
method of value iteration for finding optimal policies.
Value Iteration
Value iteration (Howard 1960) was
developed for finding optimal policies for MDPS. Since
we have formulated the partially observable problem as
an MDP over belief states, we can find optimal policies
for POMDPs
in an analogous manner.
The agent moves through the world according to its
Although
there are many
policy, collecting reward.
criterion possible for choosing one policy over another,
we here focus on policies that maximize
the infinite
expected sum of discounted rewards from all states. In
we seek to maximize
such infinite horizon problems,
q=(O) i- Et”=1 yyty; where 0 < y < 1 is a discount
factor and r(t) is the reward reczved at time t. If y is
zero, the agent seeks to maximize the reward for only
the next time step with no regard for future conseAs y increases, future rewards play a larger
quences.
role in the decision process.
The optimal value of any belief state b is the infinite
expected sum of discounted rewards starting in state b
and executing the optimal policy. The value function,
V*(b), can be expressed as a system of simultaneous
equations as follows:
\
V*(b)
I
= yEy[P(b,
a) + y c
+,
b’El3
a, b’)V*(b’)]
.
(1)
The value of a state is its instantaneous
reward plus
the discounted value of the next state after taking-the
action that maximizes this value.
One could also consider a policy that maximizes reward over a finite number of time steps, t. The essence
of value iteration is that optimal t-horizon solutions approach the optimal infinite horizon solution as t tends
toward infinity. More precisely, it can be shown that
the maximum difference between the value function of
the optimal infinite horizon policy, V*, and the analogously defined value function for the optimal t-horizon
policy, Vt*, goes to zero as t goes to infinity.
This property leads the following value iteration algorithm:
Let Vo(b) = 0 for all b E x3
Let t = 0
Loop
t:=t+1
For all b E B
Vi(b) = maxa&&
a>
+Y CbjEs +, a, b’)Vt-1 @‘)I
Until IVt(b) - &-l(b)1 < E for all b E B
This algorithm is guaranteed
to converge in a finite
number of iterations
and results in a policy that is
within 2yc/( l-y)
of the optimal policy (Bellman 1957;
Lovejoy 1991).
In finite state MDPS, value functions can be represented as tables.
For this continuous space, however,
we need to make use of special properties of the belief MDP to represent it finitely.
First of all, any finite horizon value function is piecewise linear and convex (Sondik 1971; Smallwood & Sondik 1973). In addition, for the infinite horizon, the value function can be
approximated
arbitrarily closely by a convex piecewiselinear function (Sondik 1971).
A representation
that makes use of these properties
was introduced
by Sondik (Sondik 1971).
Let Vi be
a set of ISI-d imensional vectors of real numbers.
The
optimal t-horizon value function can be written as:
for some set Vt. Any
tion can be expressed
vectors in Vt can also
sociated with different
analogous
to Watkins’
1992); see (Cassandra,
Sondik 1971).
piecewise-linear
convex functhis way, but the particular
be viewed as the values aschoices in the optimal policy,
Q-values (Watkins
& Dayan
Kaelbling,
& Littman
1994;
The Witness
Algorithm
The task at each step in
the value iteration algorithm is to find the set Vi that
represents
Vt* given Vt-1.
Detailed algorithms
have
been developed for this problem (Smallwood & Sondik
1973; Monahan 1982; Cheng 1988) but are extremely
inefficient.
We describe a new algorithm,
inspired by
Cheng’s linear support algorithm (Cheng 1988), which
both in theory and in practice seems to be more efficient than the others.
Many
algorithms
(Smallwood
& Sondik
1973;
value function,
Cheng 1988) construct an approximate
Agents
1025
e(b)
= max,EGt
b - Q, which is successively
improved
by adding vectors to $‘t C Vi. The set it, is built up using a key insight. From Vt- 1 and any particular belief
state, b, we can determine the a E Vt that should be
added to ot to make l&(b) = Vt*(b).
The
algorithmic
challenge, then, is to find a b for which vt(b) # Vt*(b)
or to prove that no such b exists (i.e., that the approximation is perfect).
The Witness
algorithm
(Cassandra,
Kaelbling,
&
Littman
1994) defines a linear program that returns
a single point that is a “witness” to the fact that
I& f Vt*. The process begins with an initial ct populated by the vectors needed to represent the value
function at the corners of the belief space (i.e., the ISI
belief states consisting of all O’s and a single 1). A linear program is constructed
with ISI variables used to
represent the components of a belief state, b. Auxiliary
variables and constraints
are used to define
v = V,*(b) = ma&@,
a) + y ):
T(b, a, b’) cuy;xl
c-t. b’]
b’Et3
and
6 = Qt(b) = maxbe b
BE\ir
A final constraint
insists that 6 # v and thus the program either returns a witness or fails if I& = Vc. If
a witness is found, it is used to determine a new vector to include in pt and the process repeats. Only one
linear
In
must
tive.
program is solved for each vector in $‘t.
the current formulation,
a tolerance factor, 6,
be defined for the linear program to be effecThus the algorithm can terminate
even though
h
Vt* # Vt as long as the difference at any point is no
more than S. This differentiates
the Witness algorithm
from the other approaches mentioned, which find exact
solutions.
Although the Witness algorithm only constructs approximations,
in conjunction
with value iteration it can
construct policies arbitrarily
close to optimal by making S small enough.
Unfortunately,
extremely
small
values of S result in numerically
unstable linear programs that can be quite challenging
for many linear
programming
implementations.
It has been shown that finding the optimal policy
for a finite-horizon
POMDP
is PSPACE-complete
(Papadimitriou & Tsitsiklis 1987)) and indeed all of the algorithms mentioned take time exponential in the problem size if the specific POMDP
parameters
require an
exponential
number of vectors to represent Vi*. The
main advantage of the Witness
algorithm
is that it
appears to be the only one of the algorithms
whose
running time is guaranteed
not to be exponential
if
the number of vectors required is not. In practice, this
has resulted in vastly improved running times and the
ability to run much larger example problems than existing POMDP
algorithms.
Details of the algorithm are
outlined in a technical report (Cassandra,
Kaelbling,
& Littman 1994).
1026
Planning and Scheduling
Representing
Policies
When value iteration
converges, we are left with a set of vectors, Venal, that constitutes an approximation
to the optimal value function, V*. Each of these vectors defines a region of the
belief space such that a belief state is in a vector’s region if its dot product with the vector is maximum.
Thus, the vectors define a partition of belief space.
It can be shown (Smallwood
& Sondik 1973) that
all state vectors that share a partition also share an
optimal action, so a policy can be specified by a set of
pairs, (a’*, a*), where n(b) = a* if ct!* . b 2 CY. b for all
a f hinal.
For many problems, the partitions
have an important property that leads to a particularly
useful representation
for the optimal policy. Given the optimal
action and a resulting observation,
all belief states in
one partition will be transformed
to belief states occupying the same partition on the next step. The set of
partitions
and their corresponding
transitions
constitute a policy graph that summarizes the action choices
of the optimal policy.
Figure 3 shows the policy graph of the optimal policy
for the simple POMDP
environment
of Figure 2. Each
node in the picture corresponds to a set of belief states
over which one vector in Venal has the largest dot product and is labeled with the optimal action for that set of
belief states. Observations
label the arcs of the graph,
specifying how incoming information affects the agent’s
choice of future actions. The agent’s initial belief state
is in the node marked with the extra arrow. In the example figure we chose the uniform belief distribution
to indicate that the agent initially has no knowledge of
its situation.
Using a Policy Graph
Once computed, the policy
graph is a representation
of the optimal policy. The
agent chooses the action associated with the start node,
and then, depending on which observation
it makes,
the agent makes a transition to the appropriate node.
It executes the associated
action, follows the arc for
the next observation,
and so on.
For a reactive agent, this representation
of a policy
is ideal. The current node of the policy graph is sufficient to summarize the agent’s past experience
and
its future decisions. The arcs of the graph dictate how
new information
in the form of observations
is incorporated into the agent’s decision-making.
The graph
itself can be executed simply and efficiently.
Returning to Figure 3, we can give a concrete demonstration
of how a policy graph is used.
The policy
graph can be summarized
as “Execute
the pattern
right, right, left, stopping when the goal is encountered.
Execute the action left to reset.
Repeat.”
It
is straightforward
to verify that this strategy is indeed
optimal; no other pattern performs better.
Note that the use of the state estimator,
SE, is no
longer necessary for the agent to choose actions optimally.
The policy graph has all the information
it
needs.
Figure 3: Sample
environment
policy graph for the simple POMDP
Figure 4: Small unobservable
grid and its policy graph
Results
After experimenting
with several algorithms for solving POMDPS, we devised and implemented the Witness
algorithm
and a heuristic method for constructing
a
policy graph from Venal. Although the size of the optimal policy graph can be arbitrarily
large, for most
of the problems we tried the policy graph included no
more than thirty nodes. The largest problem we looked
at consisted of 23 states, 4 actions and 11 obser vations
of four
and our algorithm converged on a policy graph
nodes in under a half of an hour.
Generalization
In the policy graph of Figure 3,
there is almost a one-to-one
correspondence
between
nodes and the belief states encountered
by the agent.
the decisions for many differFor some environments,
ent belief states are captured in a small number of
This constitutes
a form of generalization
in
nodes.
that a continuum
of belief states > including distinct
environmental
states, are handled identically.
Figure 4 shows an extremely
simple environment
consisting of a 4 by 4 grid where all cells except for the
goal in the lower right-hand
corner are indistinguishable. The optimal policy graph for this environment
consists of just two nodes, one for moving down in the
grid, and the other for moving to the right. From the
given start node, the agent will execu te a down-rightdown-right pattern until it reaches the goal at which
point it will start again. Note that each time it is in
the “down” node, it will have different beliefs about
what environmental
state it is in.
Acting to Gain Information
In many real-world
problems, an agent must take specific actions to gain
information
that will allow it to make more informed
decisions and achieve increased performance.
In most
planning systems, these kinds of actions are handled
differently than actions that change the state of the
Figure
5: Policy graph for the tiger problem
A uniform treatment
of actions of all
environment.
kinds is desirable for simplicity, but also because there
are many actions that have both material and informational consequences.
To illustrate the treatment of information-gathering
actions in the POMDP model, we introduce a modified
version of a classic problem. You stand in front of two
doors: behind one door is a tiger and behind the other
is a vast reward, but you do not know which is where.
You may open either door, receiving a large penalty if
you chose the one with the tiger and a large reward if
you chose the other. You have the additional option
If the tiger is on the left, then
of simply listening.
with probability
0.85 you will hear the tiger on your
left and with probability
0.15 you will hear it on your
right; symmetrically
for the case in which the tiger is on
your right. If you listen, you will pay a small penalty,
Finally, the problem is iterated, so immediately
after
you choose either of the doors, you will again be faced
with the problem of choosing a door; of course, the
tiger has been randomly repositioned.
The problem is this: How long should you stand
and listen before you choose a door?
The Witness
algorithm found the solution shown in Figure 5. If
you are beginning with no information,
then you enter
the center node, in which you listen. If you hear the
tiger on your left, then you enter the lower right node’
which encodes roughly “I’ve heard a tiger on my left
once more than I’ve heard a tiger on my right”; if you
hear the tiger on your right, then you move back to
the center node, which encodes “I’ve heard a tiger on
my left as many times as I’ve heard one on my right.”
Following this, you listen again. You continue listening
until you have heard the tiger twice more on one side
than the other, at which point you choose.
As the consequences of meeting a tiger are made less
dire, the Witness algorithm finds strategies that listen
only once before choosing, then ones that do not bother
to listen at all. As the reliability of listening is made
worse, strategies that listen more are found.
elated Work
There is an extensive discussion of POMDPs in the operations research literature.
Surveys by Monahan (Monahan 1982) and Lovejoy (Lovejoy 1991) are good starting points.
Within the AI community,
several of the
issues addressed here have also been examined by researchers working on reinforcement
learning.
White-
Agents
1027
head and Ballard (Whitehead & Ballard 1991) solve
problems of partial observability through access to extra perceptual data. Chrisman (Chrisman 1992) and
McCallum (McCallum 1993) describe algorithms for
inducing a POMDP
from interactions with the environment and use relatively simple approximations to the
resulting optimal value function. Other relevant work
in the AI community includes the work of Tan (Tan
1991) on inducing decision trees for performing lowcost identification of objects by selecting appropriate
sensory tests.
Future Work
The results presented in this paper are preliminary. We
intend, in the short term, to extend our algorithm to
perform policy iteration, which is likely to be more efficient. We will solve larger examples including tracking
and surveillance problems. In addition, we hope to extend this work in a number of directions such as applying stochastic dynamic programming (Barto, Bradtke,
& Singh 1991) and f unction approximation to derive an
optimal value function, rather than solving for it analytically. We expect that good approximate policies
may be found more quickly this way. Another aim is to
integrate the POMDP
framework with methods such as
those used by Dean et al. (Dean et al. 1993) for finding
approximately optimal policies quickly by considering
only small regions of the search space.
Acknowledgments
Thanks to Lonnie Chrisman, Tom Dean and (indirectly) Ross Schachter for introducing us to POMDPS.
References
Astrom, K. J. 1965. Optimal control of markov decision processes with incomplete state estimation. J.
Math. Anal. Appl. 10:174-205.
Barto, A. G.; Bradtke, S. J.; and Singh, S. P. 1991.
Real-time learning and control using asynchronous
dynamic programming. Technical Report 91-57, Department of Computer and Information Science, University of Massachusetts, Amherst, Massachusetts.
Bellman, R. 1957. Dynamic Programming.
Princeton,
New Jersey: Princeton University Press.
Cassandra, A. R.; Kaelbling, L. P.; and Littman,
M. L. 1994. Algorithms for partially observable
markov decision processes. Technical Report 94-14,
Brown University, Providence, Rhode Island.
Cheng, H.-T. 1988. Algorithms for Partially Observable Murkov Decision Processes.
Ph.D. Dissertation,
University of British Columbia, British Columbia,
Canada.
Chrisman, L. 1992. Reinforcement learning with
perceptual aliasing: The perceptual distinctions approach. In Proceedings of the Tenth National Conference on Artificial Intelligence,
183-188. San Jose,
California: AAAI Press.
1028
Planning and Scheduling
Dean, T.; Kaelbling, L. P.; Kirman, J.; and Nicholson, A. 1993. Planning with deadlines in stochastic domains. In Proceedings of the Eleventh National
Conference
on Artificial
Intelligence.
Howard, R. A.
1960. Dynamic Programming
Murkov Processes.
Cambridge, Massachusetts:
MIT Press.
and
The
Lovejoy, W. S. 1991. A survey of algorithmic methods for partially observed markov decision processes.
Annals of Operations Research 28( 1):47-65.
McCallum, R. A. 1993. Overcoming incomplete perception with utile distinction memory. In Proceedings
of the Tenth International
Conference
on Machine
Learning.
Amherst, Massachusetts: Morgan Kauf-
mann.
Monahan, G. E. 1982. A survey of partially observable markov decision processes: Theory, models, and
Science 28( l):l-16.
algorithms. Management
Moore, R. C. 1985. A formal theory of knowledge and
action. In Hobbs, J. R., and Moore, R. C., eds., Formal Theories of the Commonsense
World. Norwood,
New Jersey: Ablex Publishing Company.
Papadimitriou, C. H., and Tsitsiklis, J. N. 1987. The
complexity of markov decision processes. Muthemutits of Operations Research 12(3):441-450.
Smallwood, R. D., and Sondik, E. J. 1973. The optimal control of partially observable markov processes
over a finite horizon. Operations Research 21:10711088.
Sondik, E. J.
1971.
The Optimal
Murkov Processes.
Control
of Pur-
Ph.D. Dissertation, Stanford University, Stanford, California.
tially Observable
Sutton, R. S. 1990. Integrated architectures for learning, planning, and reacting based on approximating
In Proceedings of the Sevdynamic programming.
enth International
Conference
on Machine
Learning.
Austin, Texas: Morgan Kaufmann.
Tan, M. 1991. Cost-sensitive reinforcement learning
for adaptive classification and control. In Proceedings
of the Ninth National
ligence.
Conference
on Artificial
Intel-
Watkins, C. J. C. H., and Dayan, P. 1992. Q-learning.
Machine
Learning
8(3):279-292.
Whitehead, S. D., and Ballard, D. H. 1991. Learning to perceive and act by trial and error. Machine
Learning 7( 1):45-83.