3 Reinforcement Learning in Mobile Robot Control

advertisement
Rei n forc e me nt Lea rn ing fo r Mobil e Ro bot s w ith
Con ti nuo us Sta te s
Yizheng Cai
Department of Computer Science
University of British Columbia
Vancouver, V6T 1Z4
Email:yizhengc@cs.ubc.ca
Homepage: www.cs.ubc.ca/~yizhengc
Abstract
It is very tedious to program a mobile robot in the real world. Also, it
is difficult for the programmed mobile robot to get adapted to a new
environment. Reinforcement learning provides a proper mechanism
for the mobile robot to learn how to accomplish a task in any given
environment with very little work for programming. However,
traditional reinforcement learning methods only assume discrete
state and action space, which is not applicable for mobile robots in
the real world. This project simulated a mobile robot with continuous
states and discrete actions using a safe approximation of the value
function to learn the optimal policy. Experiment result shows that
learning can be very successful and efficient with bootstrapped
information provided by human controls.
1
In trod u cti on
The control task for a mobile robot is very tedious and time consuming because it is
very difficult to translate high-level knowledge of how to accomplish a task into
low-level control for the mobile robot to understand. This control task becomes more
difficult if the environment of the task is complex. Also, this approach is not adaptive
to the changes of the environment. The best way to solve this problem is to find a
proper mechanism for the robot to learn how to accomplish the task itself.
Reinforcement learning is just a proper way for the robot to learn the environment and
accomplish the task. However, the traditional reinforcement learning methods always
assume discrete state and action space, which is not applicable for mobile robot with
real value states. In 1994, Boyan and Moore [1] introduced a method to safely
approximate the value function for reinforcement learning, which makes it possible to
apply reinforcement learning to the continuous state space. Later, Smart and
Kaelbling [2, 3] adopted the safe value function approximation method in their work
and used Q-learning to solve the control problem of mobile robots in continuous space.
Another important contribution of their work is that they introduced an efficient and
practical learning-system construction methodology that can augment the
reinforcement learning process by information that is easy for human expert to
provide. The experiment results in their work demonstrate that, with the bootstrapped
information provided by human controls, the learning process can be very efficient
and the learning results can be extremely good.
The motivation of this project is to take a closer look at reinforcement learning applied
in the mobile robot control task and do some experiments to verify the efficiency of
reinforcement learning in such tasks. The main task of this project is to simulate a
mobile robot with continuous states and discrete actions using the same approach as
Smart and Kaelbling did in their work so that experiments can be done to verify those
hypotheses in their work. One of the important hypotheses is the effectiveness of save
value function approximation which is the basis for the application of reinforceme nt
learning in continuous state space. Another important hypothesis is that learning with
bootstrapped information provided by human experts can significantly shorten the
learning time and produce very good learning results for the robot to accomplish the
task.
The first part of this report will describe the basic idea of reinforcement learning and
Q-learning. The second part will describe how the Q-learning method is implemented
for the control of mobile robot. In the third part, some experiment results will be
presented with discussion. The last part will be the discussion and future work of this
project.
2
Rei n f orce men t L ea rn i n g
Reinforcement learning had attracted much attention in the community of machine
learning and artificial intelligence in the past several decades. It provides a very good
way to teach a robot with some rewards or punishments so that the robot can learn how
to accomplish a task without hard coding the low-level control strategies.
The traditional reinforcement learning assumes that the world can be represented by a
set of discrete states, S, and the agent has a finite set of actions, A, to take. The
interaction between the agent and the world is represented by the reward, R. Here, the
time is also discretized into time steps. For any agent, the state at time t is represented
as s t, and the action to take is a t. After taking the action, a t, the agent will go to a new
state, represented by st+1, and will also be given a reward rt+1, which is an immediate
reward to evaluate how good the action taken is. Thus, the agent can have cumulative
experience represented by a sequence of tuples (st, a t, rt+1 , st+1 ). For each action, it can
affect both the immediate reward and the next state the agent might be in, which might
affect the delayed reward. So, the ultimate goal of reinforcement learning is to find an
optimal policy of behaviors that perform best in the environment; or an optimal value
function that best maps the states (or the state and action) to a measure of long-term
value of being in a state.
The method used in this project, which is called Q-learning, is to find an optimal
function that maps the given state to the long-term value of being in that state.
2.1
Q-learning
Q-learning, introduced by Watkins and Dayan [4], is a method that is typically used to
solve the RL problems. One of the big advantages of Q-Learning is that it is a
model-free algorithm which does not require any prior knowledge about the MDP
(Markov Decision Process) model. The optimal value function for Q-learning is
defined as follows:
Q* ( s, a)  E[ R(s, a)   max Q* (s' , a' )]
a'
Here, the optimal value function is the expected value, over the next state, of the
reward for taking action a in state s, ending up with s’ and perform optimal action
from then on. γ is known as the discount factor which is a measure of how much
weight should be given to the future reward. With the definition of Q -function, it is
easy to define the function for the optimal policy:
 * (s)  arg max Q(s, a)
a
During the Q-learning process, the agent performs trial actions around the
environment to get a sequence of experience tuples and store the mapping from state
and action to value into a table. As the learning process goes on, the state-action pair
will be visited multiple times and the corresponding value in the table will then be
updated according to the following form:
Q(st , at )  Q(st , at )   (rt 1   max a ' Q(st 1, a' )  Q(st , at ))
where α is known as the learning rate. It has been proved by Watkins and Dayan [4]
that the Q-function mapping will finally converge after infinite visit s of those
state-action pairs.
2.2
S a f e V a l u e F u n c t i o n A p p r o x i ma t i o n
The table mapping approach described in the previous section was the traditional
method of Q-learning. But, in the real world, most mobile robots will have real value
states and actions which are not possible to be very properly discretized. An
alternative approach is to use some function approximation techniques to approximate
the Q-function. One of the major problems of this approach is the error of prediction.
During the process of Q-learning, it is required to predict the Q-value given some
state-action pair. According to the updating function in the previous section, the error
in predicting both the current Q-value and the maximum Q-value of the next state will
easily accumulate when the process goes on, and finally dominate the approximation.
2.2.1
R e d u c i n g A p p r o x i ma t i o n E r r o r
In order to reduce the approximation error, Boyan and Moore [1] introduced a method
to safely approximate the value function. They claimed that most approximation
methods suffer from a problem called hidden extrapolation which means that the
approximation extrapolate data that are not within the train data space. In order to
safely approximate the Q-function, they suggested to only use function approximators
to interpolate training data instead of extrapolate from it. So, they tried to create a
convex hull around the training data and only allow for prediction queries within it. To
compromise the computation complexity and the safety, the structure adopted is what
called independent variable hull (IVH). For a training data matrix, X, where the rows
of X correspond to the training data points, the hat matrix can be computed in the
following way:
V  X ( X ' X )1 X '
For any query vector, it lies within the hull only if it fulfill s the following criteria:
x'( X ' X )1 x  max ii
i
where  ii are the diagonal elements of V. The next section will describe the detail of
how to use this IVH for prediction.
Another important technique to reduce the prediction error is to use the Locally
Weighted Regression [5]. Instead of using all the training data to approximate the
function, LWR only uses those nearby neighbors that are close to the query points.
Each of these neighbor will be associated a certain weight according to their distance
to the query point. Any kernel function can be used to compute the weight. The one
used in this project is a Gaussian function:
  e( x q )
2
h2
where, q is the query point, x is neighbor of that query.
2.2.2
The HEDGER Algorithm
HEDGER algorithm [2, 3] is a function approximation algorithm using linear
regression. It is based on both safe value function approximation [1] and Locally
Weighted Regression [5]. The following is the pseudocode of the HEDGER algorithm
for predicting the Q-value used in the simulation of this project. It is based on the one
given by [6] with some minor modification for implementation purposes.
Algorithm 1 HEDGER prediction
Input:
Set of training examples, S, with tuples of the form (s, a, q, r)
Query state, s
Query action, a
LWR minimum set size, k
LWR distance threshold, D
LWR bandwidth, h
Output:
Predicted Q-value, q s,a
x  (s, a)
K  training points with distance to x smaller than D
if the total number of training points is less than k
q s,a = q default
else
Construct an IVH, H, with training points in K
if K’K is singular
q s,a = q default
else
if x is outside of H then
q s,a = q default
else
q s,a = LWR prediction using x , K, and h
if q s,a > q max or q s,a < q min then
q s,a = q default
Return q s,a
The minimum size of LWR set is easy to decide. It is just the number of the parameters
that are to be decided for linear regression. For the distance threshold for LWR, it is
better that the threshold is large enough to include as many training points as possible.
But, as the size of K grows, the computation would be prohibitively expensive. Also,
points that are far from the query might have very small weight due to the bandwidth
so that their contribution can safely be ignored and the computation spent on them at
early stages is wasted. So, it is better to define a threshold, κ, as the minimum weight
for points within K so that they will not be ignored. Thus, the distance threshold can be
computed in the following way, if the kernel function is Gaussian:
D  h 2 log 
where, the value of the bandwidth, h, is empirically determined in this project because
Smart claimed in his thesis [6] that the improvement on h does not bring significant
benefit for the overall performance. In addition, the predicted Q-value should be
within the boundary of the possible minimum and maximum value of Q-value here.
Because the problem in this project is set to be infinite discounted Q -learning, the
possible maximum and minimum number of the Q-value can be computed as follows:

Qmax    t rmax 
t 0
rmax
1 
The minimum value of Q-value can be computed the same way. The boundary for the
reward can be easily updated by the new reward received in each stage so that the
Q-value can always be safely within the actual boundary. In addition, the predefined
default value for Q-value should also be defined within the boundary.
The HEDGER algorithm for training is a modification to the traditional Q -learning.
The pseudocode is presented in Algorithm 2 with some changes of the one in [6].
Algorithm 2 HEDGER training
Input:
Set of training examples, S, with tuples of the form (s, a, q, r)
Initial state, st
Action, a t
Next state, st+1
Reward, rt+1
Learning rate, α
Discount factor, γ
LWR minimum set size, k
LWR distance threshold, D
Output:
New set of training set, S’
Update q max and q min based on rt+1 and γ
x  ( st , at )
qt 1  maximum predicted Q-value at state st+1 based on S, k
qt  predicted Q-value for query of x , based on S, k
K  set of training points used for prediction of qt
  set of weights of corresponding training points
qnew   (rt 1  qt 1  qt )  qt
S '  S  ( x, qnew )
for each point in K
qi   (qnew  qi )  qi
Return S’
In this project, due to the limit of time, it assumes that the robot has discrete action so
that the maximum predicted Q-value at state st+1 can be obtained by compare the
Q-values from all possible actions.
3
Rei n f orce men t L ea rn i n g i n Mob il e Rob ot Con trol
As is seen from the previous sections, the Q-learning method requires the agent to
perform randomly at the beginning of learning to have sufficient experience. For
environment with very sparse rewards, the learning process might spend quite a lot of
time to explore the world without any useful information. One of the possible and
reasonable ways to solve the problem is to use supplied control for the robot and lead
it to those more interesting points as soon as possible, so that these information can
bootstrap the learning process.
3.1
Two Phase Learning Process
Smart and Kaelbling introduced the two phase learning process [2, 3] which uses some
predefined policy or the direct control of human experts in the first phase to collect
sufficient experience. Meanwhile, the RL learning system just learns the experience
passively so that the information can be used to for the value-function approximation.
In the second phase, the reinforcement learning system takes the control of the robot
and learns the experience while the supplied control no long have effect on the robot.
Here, the RL system is not trying to learn the trajectory but only uses the experience
for the value-function approximation. The two phase learning process can be
demonstrated in the following figure [2]:
Environment
R O
Supplied Control
Policy
Environment
A R O
Supplied Control
Policy
Learning
System
Learning
System
(a) Phase 1
(b) Phase 2
A
Figure 1. The graph demonstration of the two learning phases
3.2
Corridor-Following Task
The task simulated in this project is the Corridor-Following task which is similar to
the experiment done with the real robot by Smart and Kaelbling [2, 3]. The task is
depicted by figure 2.
Reward
Area
Position in
corridor
Distance to the end
Figure 2. The corridor following task
In the corridor-following task, the state space has three dimensions, the distance to the
end of the corridor, the distance to the left wall of the corridor as a fraction of the total
width of the corridor and the angle to the target, shown in figure 3.
Left wall
φ
ψ
d
Target
w
θ
Figure 3. Relation between three dimensions of the state space
According to figure 3, the relation between the three dimensions can be represented by
the following formula:
  tan 1
d

 
w
2
All the three dimensions of the state are of real values. In the simulation, the scenario
is of sparse reward, which means that the robot will be given reward of 10 when it
reaches the reward area, and 0 elsewhere. The reason to use sparse reward is that,
intuitively, if the learning works for the sparse reward situatio n, it would probably
work for dense reward situations because everything is the same, but the robot will
more easily get to the interesting point and update the approximation quickly so that
the learning process will be faster. It is also a reason why it is necessary to adopt the
two phase learning procedure to speed up the learning process. Because of the sparse
reward, it takes longer time for the robot to get to those spots with rewards by taking
random actions. One of the main purposes of this project is just to see how the learning
strategy works for the sparse reward scenario.
3.3
Imple mentation Issues
For simulation purpose, the corridor is modeled by a 300 by 800 pixel region and the
robot is denoted by a circle with a line from the center to indicate t he direction the
robot is facing. The action is rotation from the original direction of robot. The goal of
the task is to reach the end of the corridor. Due to the limit of time for this project, the
task is simplified to some extent. The translation speed of the robot, υt, is constant
everywhere in the corridor. The action of the robot is the counterclockwise angle from
the direction of the robot at current state. The angle takes the value from 30 degree to
360 degree with interval of 30 degree. Here, using 360 degree instead of 0 is in order
to avoid getting a singular matrix K in Algorithm 1. Also, for the same reason, the task
uses the angle to the target as one of the dimension in the state space so that its value
will rarely be zero.
As is seen in the formula in section 3.2, even if the actions are discrete and speed is
constant, the position and direction of the robot is definitely real value. In order to
simulate the continuous states, all the elements of the state vector are stored as real
values. Only when the robot is plotted on the screen, the position of the robot will take
the closest integer of that real value position in the state vector. Thus, even if it is
simulated on the computer with discrete pixels on display, the simulation still take al l
the state entries as real value so that the simulation is very close to the real situation.
For the parameters in the simulation, the learning rate, α, is set to 0.2 and the discount
factor, γ, is 0.99, which are all adopted directly from Smart’s work. The LWR
bandwidth is empirically set to 0.06. For the minimum size, k, of LWR set, it should be
at least the same as the size of query vector (s, a), which has four entries. So, k, is set
to 5 in the simulation. The supplied control in the first phase is hard coded to force the
robot always moving forward, but the exact action is random.
In addition, for a real robot, it is better not to hit any obstacle to avoid break the
sensors or other components, the robot should have some mechanism to avoid hitting
into the wall. So, in the simulation of this project, the robot will approximate whether
it will bump into the wall according to the state information fro m the sensor and the
current action generated by the action module. If the action leads to a bump, the action
module will just generate a random action.
In the simulation, at the second phase, the action module controlled by the RL learning
system will not always follow the greedy policy which always takes the best action.
For the exploration purpose, it will have some probability to perform random actions.
Also, in the evaluation, the action module will also produce noisy action so that the
simulation looks more real.
4
Resu l ts
In order to evaluate the learning results, after every 5 training rounds, the training
process will be evaluated. In all the training rounds, the robot starts at the same
position but with different random directions. The following figure depicts the steps
to goal of the robot during evaluation.
Phase One Training Runs
Phase Two Training Runs
Figure 4. The steps to the goal in evaluation of the two phase training
According to figure 4, with several training runs in the first phase, the number of steps
to goal decreases rapidly. Although there are some peaks, the overall number of steps
to goal is decreasing. After 30 rounds of first phase training, the system switches to
the control of the RL system. The phase change leads to a peak in the plot because it
needs to get adaptive to those new data brought by RL control and exploration. Finally,
with sufficient exploration of the world, the number of steps to the goal converges and
is close to the optimal value.
Different from the result in Smart’s work, there is a latency of the peak when the phase
changes. There are several reasons that might cause this problem. At the early stage of
the second phase learning, there are not enough new data points, brought by the RL
module, to affect the Q-value prediction. The prediction highly depends on the
previous data points from the supplied control. Also, because of the exploration
strategy, the action module controlled by RL system will take the random actions with
some probability, which leads the robot to the area that it is not familiar. Thus, the
robot have to perform randomly and update the Q-value in those regions until it arrive
the regions that it is more familiar with. So, those newly created data, which are still
not stable, might cause the robot to perform poorly. Those unexpected peaks both in
the two phases of training are probably caused by the noise added to the action module
at the evaluation stage. Because, before the dataset cover most regions in the corridor,
the random action would more likely to lead the robot to a strange regions and the
robot need to take many random actions to go back to the familiar region . Finally,
when there are sufficient training data which explore most regions from the s tarting
position to the goal, one random action will not make the robot jump out of the safe
region. So, the total number of steps to the goals will finally converge. In this
simulation, because the k-d tree is not implemented to speed up the algorithm by
shortening the time to find the nearest neighbor in the experience , it takes more than
one day to simulation the two phase training with more than 80 training runs. Fig ure 4
only reflects one whole simulation and no more simulation are done due to the limit of
time.
Figure 5 shows two images that plot the path of the robot to achieve the
corridor-following task. The first image is plotted at the early stage of the first phase
training. It well reflects the pattern of Q-value propagation. Because it is a sparse
reward situation, the propagation of the Q-value start from the reward area and back
propagate slowly. Those steps in the early stages are more like random action. But in
the end, the robot moves quickly to the end of corridor. The second image is at the late
stage of second phase training, it is seen that the Q-value is well distributed so that
almost all the actions are forward to the target rather than perform randomly.
Figure 5. The path of the robot in the early stages of the whole training procedure
In conclusion, from the simulation, the safe value approximation works well for RL in
continuous state space. The robot does converge very close to an “optimal” policy
after reasonable runs of training. Also, the two phase training procedure doe s make
the number of steps quickly decrease to reasonable value and speed up the whole
training procedure significantly, especially in environment s with very sparse reward.
This simulation really confirms the effectiveness of using the human aided control as
a practical method for the control task of mobile robots.
5
Di scu ssi on and Fu tu re w ork
The results from this simulation well support the hypotheses by Smart and Kaelbling,
especially the idea of using two phase learning. It is really a promising approach that
can use simple and high level method to handle those annoying problems of low level
control and information mapping. Anyhow, there are still two major problems remain
to be solved. Firstly, the task in this simulation is really simple. Simulations with more
complicated tasks are needed to further prove the effectiveness of the approach.
Secondly, the action space in this simulation is discrete, which significantly simplify
the problem. But in real world, most actions have real values and are difficult to be
discretized. So, some fancy methods are required to find the maximum of the Q-value
of the next state, which is just to find the extremum of the approximated Q-function
around a specified region. It will probably make the policy more difficult to converge
to “optimal”. Brent’s method is the one used by Smart and Kaelbling and proved to be
effective. Some other Root-Finding methods should be tried as well.
A c k n o w l e d g me n t s
Thanks to David Poole for useful suggestions and information for the literature.
References
[1] Boyan J.A. and Moore A.W.. Generalization in reinforcement learning: Safely
approximating the value function. In Advances in Neural Information Processing Systems:
Proceedings of the 1994 Conference (pp. 369-376). Cambridge, MA: MIT Press, 1995
[2] William D. Smart and Leslie Pack Kaelbling. Effective Reinforcement Learning for Mobile
Robots. In International Conference on Robotics and Automation, 2002
[3] William D. Smart and Leslie Pack Kaelbling. Practical Reinforcement Learning in Continuous
Spaces. In Proceedings of the Seventeenth International Conference on Machine Learning ,
2000.
[4] Christopher J. C. H. Watkins and Peter Dayan, “Q-learning”. In Machine Learning, vol. 8,
pp. 279-292, 1992
[5] Atkeson C.G., Moore A.W. and Schaal, S.. Locally weighted learning. In AI Review, vol. 11,
pp. 11-73. 1997
[6] William D. Smart. Making Reinforcement Learning Work on Real Robots. PhD Thesis,
Department of Computer Science, Brown University, 2002
Download