Using Prior Knowledge to Improve Reinforcement

advertisement
Using Prior Knowledge to Improve Reinforcement
Learning in Mobile Robotics
David L. Moreno, Carlos V. Regueiro† , Roberto Iglesias and Senén Barro
Departamento de Electrónica y Computación,
Universidad de Santiago de Compostela, 15782 Spain
†
Departamento de Electrónica y Sistemas,
Universidad da Coruña, 15071 Spain
dave@dec.usc.es
Abstract
Reinforcement learning (RL) is thought to be an
appropriate paradigm for acquiring control policies in mobile robotics. However, in its standard
formulation (tabula rasa) RL must explore and
learn everything from scratch, which is neither
realistic nor effective in real-world tasks. In this
article we propose a new strategy, called Supervised Reinforcement Learning (SRL), for taking
advantage of external knowledge within this type
of learning and validate it in a wall-following behaviour.
1.
Introduction
Reinforcement learning (RL) is an interesting strategy for the automatic resolution of tasks in different domains, such as game-playing (Tesauro, 1994),
robotics (Schaal and Atkeson, 1994) and even computer
networks (Boyan and Littman, 1994). One of the main
advantages of RL is that it does not need a set of inputs
with their correct answers for training, which are often
hard to produce in dynamic and unknown environments.
Instead, it only requires a measurement of the system’s
level of behaviour, so-called reinforcement. This feature
and its incremental nature and adaptative capabilities
make is suitable for use in mobile robotics.
Nevertheless, tabula rasa RL has strong limitations.
The main one is the fact that RL assumes that the environment as perceived by the system is a Markov Decision Process (MDP), which implies that the system only
needs to know the current state of the process in order to
be able to predict its future behaviour. Moreover, there
must not be any perceptual aliasing; i.e, the agent cannot consider two situations to which it has to respond
with different actions as being equal. Another limitation
is the exploitation/exploration dilemma; i.e., the need to
establish the strategy that the agent must follow in order to decide between attempting new actions or using
previously acquired knowledge.
These limitations cause lack of stability during learning, low levels of robustness in the learnt behaviours, and
slowness in convergence. These problems become even
more evident in real applications, especially in mobile
robotics (Wyatt, 1995), where environments are complex, dynamic, and not usually totally modelable. However, in these systems there usually exists prior knowledge on the task in the form of human expertise or previously developed controllers. This information can be
used to improve the learning process, as the RL agent
does not start from scratch.
In this paper, we present an application in the mobile robotics field of tabula rasa RL, as well as a
new strategy to take advantage of prior knowledge in
this type of learning, which we have called Supervised
Reinforcement Learning (SRL) (Iglesias et al., 1998a,
Iglesias et al., 1998b). We carried out our experiments
within the Nomad 200 simulator and we selected the
wall-following behaviour as the task to be learnt.
The remainder of this paper is organized as follows:
Section 2. reviews the basics of reinforcement learning. Section 3. presents the main ideas behind the SRL
model. Sections 4. and 5. report our experimental setup
and the experiments we have carried out. Section 6.
briefly reviews related work on the use of prior knowledge to improve RL learning. Finally, section 7. discusses
the contributions of this paper.
2.
Reinforcement Learning
In the RL paradigm (Kaelbling et al., 1996), an agent
interacts with the environment through a set of actions.
The environment is then modified and the agent perceives the new state through its sensors. Furthermore,
at each step the agent receives an external reward signal
(see Figure 1). The objective of the RL agent is to maximize the amount of reward received in the long term.
In this learning strategy an objective is defined and
the learning process takes place through trial and error interactions in a dynamic environment. The agent
3.
Figure 1: Basic diagram of a RL agent.
is rewarded or punished on the basis of the actions it
carries out. There are many algorithms that implement
RL principles, among the most used are Sarsa, Dyna,
Prioritized Sweeping and Q-Learning. In this work we
have employed the last one (Watkins, 1989) as, due to
its simplicity and easy implementation, is the one that
is currently the most commonly-used.
We have used a tabular representation, in which for
every state-action pair a Q-value Q(s, a) is stored and
updated over the learning process. This Q-value represents the usefulness of performing action a when the
robot is in state s. Q-learning directly approaches the
optimal action-value function, independently of the policy currently being followed. Its updating rule is:
h
i
Q(st , at ) ← (1−α) Q(st , at )+α rt + γ max Q(st+1 , a) ,
a
(1)
where st is the current state, at is the action taken at the
current instant, rt is the reward received after executing
at and Q(st , at ) is the evaluation of the action at in the
state st . The only parameters that need to be adjusted
are the learning rate α and the discount factor γ.
The Q-learning algorithm requires a good balance between the exploitation of the information
learnt to date and the exploration of different actions (Sutton and Barto, 1998). To achieve this we have
used the Softmax algorithm (Bridle, 1990), in which the
probability of taking the action ai in the state s is given
by the following equation:
eQ(s,ai )/T (s)
Pr(s, ai ) = Pn
,
Q(s,aj )/T (s)
j=1 e
i = 1, . . . , n,
(2)
where {a1 , . . . , an } is the set of possible actions in the
state s. The parameter T (temperature) makes it possible to regulate the distribution of probabilities between
the actions. By means of varying T we force an intense
exploration at the onset of the learning phase, focusing gradually on the selection of actions with the best
evaluation. Each state regulates its exploration level independently.
Embedding prior knowledge: SRL
The objective of the SRL model is to establish a working
framework for the development of systems that will integrate prior knowledge into RL processes through focalizing the exploration of an RL algorithm. SRL comprises
several basic blocks (Figure 2): the reinforcement learning module, the prior knowledge sources (PKSs), and the
control module, which regulates the knowledge transfer
between them.
The RL module houses the RL algorithm. This cannot
be an on-policy algorithm, as in such methods the action
to be carried out must be determined by the algorithm
itself, while in the SRL model the control module makes
this decision (Figure 2). For this application we use the
Q-Learning algorithm.
The PKSs supply their advice (recommended actions)
for the current state of the RL module, s. More specifically, they produce a vector of utilities u which contains
a value u(s, ai ) ∈ [0, 1] for each action ai that it is possible to carry out in the current state s. This value is an
indication to the control module as to how advisable this
action is according to the PKS that supplies the vector.
3.1 Control module
The control module has the task of amalgamating the
utilities that the PKSs supply for each action (the pieces
of advice) with the information learnt to date (Figure 2).
This module gives priority to the knowledge transfer over
exploration of new actions. The control module is divided into two blocks: the credit assignment block and
the decision block.
3.1.1
Credit assignment block
The credit assignment block initially decides whether the
advice obtained from the PKSs is of a sufficient quality
to be used. If this is not the case, alternative actions to
the ones recommended will need to be explored.
The inputs to the credit assignment block (Figure 2)
are the Q-values Q(s, ai ) from the RL module, and the
advice from the PKSs, {u1 . . . um }. The first step consists of amalgamating the different pieces of advice into
one single vector and then normalizing it in order to obtain the exploitation policy, w(a), which is thus called
as following it implies exploiting the knowledge that is
stored in the sources of prior knowledge:
Pm
j=1 uj (ai )
Pm
,
(3)
w(ai ) =
maxk { j=1 uj (ak )}
where m is the number of PKSs. Then the exploration
policy, e(a), is constructed, thus called because it recommends taking those actions that are considered unadvisable by the PKSs:
e(ai ) = 1 − w(ai ),
i = 1, . . . , n.
(4)
Figure 2: SRL block diagram.
The next step is to decide which of the two policies needs
to be followed. In order to do so, we define the function
Ω(s, x), where s is the current state and x is a generic
vector of utility, i.e., a vector that is normalized with
a value for each action of the state s. This function is
defined as:
½
h
i¾
Ω(s, x) = max x(ai ) · Q(s, ai ) − min Q(s, a) . (5)
i
a
The value of Ω gives us a measurement of the
compatibility of the utility vector x with those Q-values
that have been learned for state s, which tells us whether
the actions suggested by x have been evaluated well by
the learning process. In order to decide the most suitable
policy, g(a), we establish the following criterion:
(
e(a) if Ω(s, e) − δ > Ω(s, uj ) ∀ j = 1, . . . , m,
g(a) =
w(a) otherwise,
(6)
i.e., e(a) is chosen whenever its compatibility Ω is higher,
with a margin δ, to that of all the suggestions uj in the
current state s. In this situation, the suggestions either
have no information to supply, or are recommending actions that are poorly evaluated by the experience that
has been accumulated in the RL module, thus the advice is not trustworthy, and the system must search for
alternatives: it must explore. The parameter δ, the exploration threshold, is a positive value and makes it possible to regulate the tolerance that the system will have
with bad advisors.
Lastly, a vector of utilities, h(s, a), is drawn up with
a value between 0 and 1 for each action:
h(s, ai ) =
g(ai ) · [Q(s, ai ) − mina Q(s, a)]
.
Ω(s, g)
(7)
This vector is called the decision vector and indicates
which actions are the most suitable for the current state.
In this vector, if the advice coincides with a high value for
the action, it is reinforced. On the other hand, if a piece
of advice recommends an action with a low evaluation,
it is moderated. This vector goes to the decision block.
3.1.2
Decision block
This block has the task of selecting the action to be implemented. The action is chosen taking into account the
vector h(s, a). In order to maintain a certain degree of
exploration over the policy that is finally selected we use
the Softmax algorithm. Thus the probability of taking
the action ai in the state s is:
eh(s,ai )/T (s)
Pr(s, ai ) = Pn
,
h(s,aj )/T (s)
j=1 e
i = 1, . . . , n,
(8)
where n is the number of actions available in the state
s. Again, each state has its own temperature.
4.
Application
The control of mobile robots is highly complex, and requires the use of control architectures
(Regueiro et al., 2002). The majority of these are based
on low-level behaviours that efficiently resolve a specific
task. We have chosen the wall-following behaviour as
the task to be learnt, as it is one of the most used in
mobile robotics.
4.1 RL Agent
The three elements of a RL system are: the state representation, the reward function and the actions that
the robot can take in each state. Our state representation uses only information from the robot’s ultrasonic
sensors, which are divided into four overlapping groups
(Figure 3(a)) that are associated to four state variables:
Parameter
α
γ
δ (only in SRL)
Description
Learning rate
Discount factor
Exploration threshold
Value
0.2
0.99
0.1
Table 1: Learning parameters for RL and SRL.
where ∨ represents the logical operator ”OR”. Figure
3(b) shows the values of the state variables (shaded) that
give rise to a negative reward.
The actions that the robot can carry out constitute
the final component. In order to simplify the learning
process we set a constant linear velocity for the robot
(20 cm/s) so that it is only necessary to learn to control
the angular velocity. As the maximum angular velocity
of the robot is 45◦ /s we have discretized the space of
actions as follows:
(◦ /s).
(10)
The values used for the parameters of the RL and SRL
agents can be seen on Table 1.
A = {-40, -20, -10, -0.3, 0, 0.3, 10, 20, 40}
Figure 3: State representation (except orientation): a) ultrasound sensors associated to each state variable. Sensor
number 0 marks the advancing direction; b) discretization of
the distance variables. Values that give rise to a negative
reward are shaded.
right-hand distance (R), left-hand distance (L), frontal
distance (F) and right-frontal distance (RF). The value
of each variable is the minimum measurement of its associated sensors, which is discretized according to the
values shown in Figure 3(b).
After several failed experiments, we had to define a
fifth state variable: the relative orientation (O) between
the robot and the wall. To obtain this the reference wall
is approximated to a straight line by means of linear
regression of the latest measurements from sensors 11,
12 and 13 (Figure 3(a)). The angle between this straight
line and the forward direction of the robot is discretized
into 4 values: approaching, moving away, parallel and no
direction. The last value is assigned when the regression
is not reliable, the frontal wall is very close, or the robot
is to close to or too far from the wall being followed.
Thus, the state is defined as the quintuple
s =< R, RF, F, L, O >. Theoretically, there are 188 possible states, although not all are met during task development in a normal environment.
The second component of the RL system is the reward
function. As in the wall-following behaviour the objective of the robot is to keep a constant distance from the
right-hand wall, we define the reward function as:


−1 if R=too close ∨ R=too far
r=
(9)
∨ F =too close ∨ L=close,


0
on any other situation,
4.2 Prior knowledge
In our system we have used one PKS, to build it we
have used a human expert to decide on the action to
be carried out in a number of representative positions
within the environment. This PKS (called Ad-Hoc) supplies reasonably good advice in 12 states corresponding
to the straight wall and open corner situations. For each
state with advice only an action with maximum utility
is recommended. It is important to stress that on its
own, the advisor is not capable of completely resolving
the task.
5.
Experimental results
5.1 Experimental procedure
In our experiments we differentiate between control cycles and learning steps. The former are carried out every
1/3 second, and here the system updates the sensorial
measurements, determines the current state, and selects
the action to be implemented. Nevertheless, the learning
steps only take place with the agent changes state, due
to which they are asynchronous.
Our experimental procedure comprises two phases:
learning and testing. In the former, the action to be carried out in each control cycle is selected on the basis of
the probabilities given by the equation 8. Furthermore,
the Q-values from the RL module are stored periodically
for the test phase.
The test phase is used to measure the performance of
the system throughout the learning process and to determine when the learning process has converged. During
90000
Reward obtained (learning phase)
Control cycles without failure (test phase)
80000
60000
-50
50000
40000
-100
30000
-150
Reward obtained
0
70000
Control cycles
50
20000
-200
10000
0
0
-250
5000 10000 15000 20000 25000 30000 35000
Learning steps
Figure 4: Reward obtained by Q-Learning during learning
phase and number of control cycles without error during test
phase (max. 75,000 cycles)
basis of 16,000 learning cycles, even thought it still receives some negative rewards. There is a small disruption
between 27,000 and 29,000 learning steps (4 negative rewards) which is due to the correct action of a critical
state being devaluated. The system is destabilized until
it relearns the correct action, which is reflected in the
test phase. These fluctuations are common in RL algorithms. Figure 5(a) shows the trajectory of the robot
during the test phase after 35,000 learning steps.
A comparsion of the RL and SRL agents’ convergence
times can be seen in Figure 5(b). The incorporation of
knowledge into the reinforced learning process by means
of SRL significantly accelerates learning convergence.
The Ad-Hoc advisor reduces learning convergence time
to within the region of 2,000 learning steps, which is an
eighth of the time required by tabula rasa RL.
5.3 Robustness of the learnt behavior
this phase, the stored Q-values are loaded and the best
valued action is always selected. The PKS is never used.
A test finishes when the robot makes an error (collides
with the environment or strays too far away from the
wall being followed) or when it is considered that the
task has been satisfactorily learnt.
As the measurement of the quality of that which has
been learnt by the agent we have selected the time during which the agent carries out the task before committing an error. Given that the duration of each control
cycle and the linear velocity of the robot are constant,
this measurement is equivalent to the number of control
cycles issued before an error occurs.
We have not used the amount of reinforcement received throughout the learning process (though this is
precisely the value that the agent attempts to maximize)
as in these systems it is not always possible to make out
whether the obtention of little reinforcement is due to a
total failure or to low performance in carrying out the
task. The difference is crucial for us, as in the former
case, the task is not attained, while in the latter it is.
In our experiments, we consider the task to have been
accurately learnt when the system is able to implement
it over 75,000 control cycles without making an error.
Our convergence criterion is that the task should be accurately learnt over 5 consecutive tests.
5.2
Convergence time
Figure 4 shows the results for the RL agent (Q-learning).
This figure shows the reward obtained during the learning stage: each point represents the reinforcement accumulated in the previous 1,000 learning cycles. It can be
seen how the learning process stabilizes at around 19,000
learning steps. This figure also shows the results of the
test phase (control cycles without failure). As can be
seen, the agent is capable of resolving the task on the
The final aim of the learning process is for the robot
to succeed in learning to follow the reference wall. But
it is also important to verify just how robust the final
learnt behaviour is, and one way of verifying this is to
study what occurs when the environment changes. If it
is robust, it will not be affected to any great degree, and
the robot will be capable of carrying out the task.
In this experiment we took the behaviours learnt by
tabula rasa RL and SRL with Ad-Hoc in the environment of Figure 5(a) and we tested them in the new environments (Figures 6(a) and 6(c)). We carried out the
tests on different stages of the learning process, and the
results obtained can be seen in Figures 6(b) and 6(d),
respectively.
The knowledge acquired by tabula rasa RL in the original environment is not sufficient to be able to resolve
the task in other much more complicated environments
(Figure 6(c)). RL learning is not sufficiently robust to
obtain a satisfactory behaviour. On the other hand, the
behaviour learnt with SRL shows a great degree of robustness when carrying out the task correctly in the new
environments from the onset (4,000 learning steps). Figures 6(a) and 6(c) show the path of the robot carrying
out the task in the new environments with the policy
acquired by means of SRL (Ad-Hoc SC ) after 20,000
learning cycles in the original environment.
This lack of robustness of the RL-learnt behaviour is
due to the fact that the selected action (the one with
the highest Q-value) for the most frequent states keeps
changing over the learning process for a longer period,
with the same learning parameters, in RL than in SRL.
Hence, RL convergence is slow and weak. In the original
environment, these changes in the selected actions have
no effect on the performance after 16,000 learning steps,
but they prove critical in complex environments such
as the one shown in Figure 6(c). On the other hand,
thanks to the prior knowledge, the selected actions for
80000
70000
Control cycles
60000
50000
40000
30000
20000
10000
SRL (SC)
tabula rasa
0
0
(a)
5000
10000
15000
Learning steps
20000
25000
(b)
Figure 5: RL and SRL results for wall-following task: a) environment and robot trajectory performing the task after 35,000
RL learning steps (test phase); b) comparsion of the SRL (PKS Ad-Hoc) and RL tabula rasa results.
these states are settled sooner in SRL. This gives rise to
a solid convergence and a much more robust behaviour.
6.
Related work
Efforts aimed at including agents’ prior knowledge into
RL in the field of mobile robotics have been based on
three techniques. The first is the design of complex reinforcements (Mataric’, 1994), which has the problem of
its lack of generality, and the difficulty caused by devising the functions that supply the variable reinforcement,
which will probably not be made use of in any other task.
Dynamic knowledge-orientated creation of the state
space (Hailu, 2001) is the second technique. Its main
drawbacks are that the stability of the system during
learning is not taken into account, and that there is a
great dependency on the quality of knowledge: if this is
incorrect, the agent cannot learn.
The third and most important approach is the focalization of exploration. There are various approaches, one
being to focalize exploration only at the onset of learning (del R. Millán et al., 2002). The statistical nature
of RL results in there being fluctuations in the learning process, which may lead to the devaluation of an
initially-recommended good action. In these cases, the
initial focalization of the exploration does not help to
stabilize the convergence of the RL algorithm.
Another approach is that of Lin (Lin, 1992), who proposes systems that use external knowledge expressed in
the form of sequences of states and actions that are sup-
plied by a teacher. The limitation of these systems is
that they can only make use of knowledge that can be
expressed as sequences of states and actions that are
characteristic of the system that is learning.
In a similar strategy, Clouse presents a system in which advice is administered asynchronously
(Clouse and Utgoff, 1992). Here a teacher may take part
in the learning process whenever he considers it necessary. In general, example- or teacher-based systems do
not allow the use of knowledge that is not drawn up
specifically for the system, incorrect knowledge or various simultaneous teachers.
The latest exploration focalization technique, the one
used in SRL, is to employ a control module to regulate the transfer of information between the knowledge
sources and the RL algorithm, which is located in the
same level. Dixon (Dixon et al., 2000) presents a system of this type, which is similar to SRL. However, his
work lacks the generality of SRL, which is designed to
allow a broader range of prior knowledge sources and
to deal with partial or even erroneus advice. In SRL,
the balancing of prior knowledge and learnt information,
which is carried out in the Credit Assignment Block, allows the designer to use several PKSs without the need
for a complex elaboration or any testing of their correctness. Dixon’s control modules are far simpler.
SRL also contemplates the possibility of the Control
Module imposing its decisions or assigning overall control
to a knowledge source, transforming it into a teacher.
Thus, SRL includes the aforementioned approaches at
RL tabula rasa
SRL with Ad-Hoc
Control cycles during test phase
80000
70000
60000
50000
40000
30000
20000
10000
0
0
5000
10000
(a)
15000 20000
Learning steps
25000
30000
35000
(b)
RL tabula rasa
SRL with Ad-Hoc
Control cycles during test phase
80000
70000
60000
50000
40000
30000
20000
10000
0
0
(c)
5000
10000
15000 20000
Learning steps
25000
30000
35000
(d)
Figure 6: SRL-learnt wall-following behavior robustness: a) and c) robot trajectory performing the task with Q-values learnt
after 20,000 SRL(Ad-Hoc) learning steps in environment of Fig.5(a); b) and d) control cycles during test phase without failure
using Q-values learnt in environment of Fig.5(a), SRL(Ad-Hoc) against tabula rasa RL.
the same time as it permits the use of a much broader set
of knowledge sources. Furthermore, we believe that SRL
is suitable for use on real systems, in mobile robotics.
7.
Conclusion and future work
The use of reinforcement learning (RL) in real systems
has highlighted the limitations of these algorithms, the
main one being slowness in convergence. On the other
hand, in real systems there often exists prior knowledge
on the task being learnt that can be used to improve the
learning process.
In this paper we propose a new strategy for making use
of external knowledge within RL, which we call Supervised Reinforcement Learning (SRL). The SRL is based
on using prior knowledge on the task to focalize the RL
algorithm’s exploration towards the most promising areas of the state space. Thanks to SRL, knowledge can be
used to speed up convergence of RL algorithms, yielding
at the same time more robust controllers and improving
the agent’s stability during the learning process.
In order to demonstrate the viability of the proposed
methodology it has been applied to the resolution of
a basic task in mobile robotics, the wall-following behaviour, and it has been compared with a classical tabula
rasa RL algorithm (Q-learning).
Thanks to SRL a significant reduction in learning convergence times has been achieved, even using a simple,
intuitive prior knowledge source. Thus, just by advising
on the best action to be implemented in 12 of the 188
possible states, there is a 84% reduction in the SLR convergence time with respect to RL, that is reduced from
16,000 to 2,000 learning steps.
Furthermore, we have verified that the reduction in
convergence time does not imply a loss in robustness
in the final behaviour that is obtained. In fact, the behaviour learnt by SRL proves to be correct, even in more
complex environments than those used during the learning phase, and in which the policy learnt by RL does not
always succeed in carrying out the task safely.
The SRL model permits the existence of several prior
knowledge sources, even ones with incorrect advice. Our
next objective is to study the performance of the model
in such situations. We also aim to explore new methods
for improving knowledge transfer, concentrating on the
theoretical aspects of the model.
Lastly, we intend to apply SRL directly on a real robot,
and not on a simulator. The characteristics of SRL, with
regard to stability during the learning phase and convergence time, lead us to believe that ’on-line’ learning will
be feasible and safe.
Acknowledgment
This work was supported by CICYT’s project TIC200309400-C04-03. David L. Moreno’s research was also supported by MECD grant FPU-AP2001-3350.
References
Boyan, J. A. and Littman, M. L. (1994). Packet routing in dynamically changing networks: A reinforcement learning approach. In Cowan, J. D., Tesauro,
G., and Alspector, J., (Eds.), Advances in Neural
Information Processing Systems, volume 6, pages
671–678. Morgan Kaufmann.
Bridle, J. S. (1990). Training stochastic model recognition algorithms as networks can lead to maximum
mutual information estimation of parameters. In
Touretzky, D., (Ed.), Advances in Neural Information Processing Systems: Proc. 1989 Conf., pages
211–217. Morgan-Kaufmann.
Clouse, J. A. and Utgoff, P. E. (1992). A teaching
method for reinforcement learning. In Machine
Learning. Proc. 9th Int. Workshop (ML92), pages
92–101. Morgan Kaufmann.
del R. Millán, J., Posenato, D., and Dedieu, E. (2002).
Continuous-action q-learning. Machine Learning,
49:247, 265.
Dixon, K. R., Malak, R. J., and Khosla, P. K.
(2000). Incorporating prior knowledge and previously learned information into reinforcement learning agents. Technical report, Carnegie Mellon University, Institute for Complex Engineered Systems.
Hailu, G. (2001). Symbolic structures in numeric reinforcement for learning optimum robot trajectory.
Robotics and Autonomous Systems, 37:53–68.
Iglesias, R., Regueiro, C., Correa, J., and Barro, S.
(1998a). Supervised reinforcement learning: Application to a wall following behaviour in a mobile
robot. In Tasks and Methods in Applied Artificial
Intelligence, volume 1416, pages 300–309. Springer
Verlag.
Iglesias, R., Regueiro, C., Correa, J., Sánchez, E., and
Barro, S. (1998b). Improving wall following behaviour in a mobile robot using reinforcement learning. In Proc. International ICSC Symposium on
Engineering of Intelligent Systems, volume 3, pages
531–537, Tenerife, Spain.
Kaelbling, L. P., Littman, M. L., and Moore, A. P.
(1996). Reinforcement learning: A survey. Journal
of Artificial Intelligence Research, 4:237–285.
Lin, L.-J. (1992). Self-improving reactive agents based
on reinforcement learning, planning and teaching.
Machine Learning, 8(3/4):293–321.
Mataric’, M. J. (1994). Reward functions for accelerated learning. In Int. Conf. on Machine Learning,
pages 181–189.
Regueiro, C., Rodrı́guez, M., Correa, J., Moreno, D.,
Iglesias, R., and Barro, S. (2002). A control architecture for mobile robotics based on specialists. volume 6 of Intelligent Systems: Technology and Applications, pages 337–360. CRC Press.
Schaal, S. and Atkeson, C. (1994). Robot juggling:
Implementation of memory-based learning. IEEE
Control Systems, 14(1):57–71.
Sutton, R. S. and Barto, A. G. (1998). Reinforcement
Learning: An Introduction. MIT Press.
Tesauro, G. (1994).
Td-gammon, a self-teaching
backgammon program, achieves master-level play.
Neural Computation, 6(2):215–219.
Watkins, C. (1989). Learning from Delayed Rewards.
PhD thesis, Cambridge University.
Wyatt, J. (1995). Issues in putting reinforcement learning onto robots. In 10th Biennal Conference of the
AISB, Sheffield, UK.
Download