Solving Maze Problem with Abstraction Selection Xiaodong Wang, and Chi Zhang

advertisement
Solving Maze Problem with Abstraction Selection
Xiaodong Wang, and Chi Zhang
University of Tennessee, Knoxville, USA
{xwang33, czhang24}@utk.edu
Abstract—Abstraction has been extensively studied in the
fields of artificial intelligence. It is especially useful for highdimensional continuous domains, since abstraction can reduce
the large state space and action space in such problems. In
this project, we solve the multi-task Maze problem by temporal
difference learning under the abstraction selection framework.
Simulation results show that the multi-task maze problem can
be solved efficiently with the abstraction selection framework. We
further compare the algorithm proposed in an existing literature
with our modified algorithm. Results showing that the multitask maze problem can be solved more efficiently by using
tabular form TD learning, compared with the existing function
approximation framework proposed in [1].
I. I NTRODUCTION
High-dimensional, continuous domains are a class of reinforcement learning problems which remains difficult to solve.
A key approach for such problems is using an abstraction
that reduces the number of state variables in the solutions.
However, one single abstraction cannot be effectively applied
for the whole problem. Much recent research in such problems
has focused on hierarchical reinforcement learning. It divides
an intrinsically high-dimensional problem into several sub
problems, each of which is much easier and can be solved
using only a small set of state variables. For example, in
the problem of learning driving home, the entire task can
be broken into several small tasks, including getting to the
parking lot, opening the car, starting the car and driving home,
etc. We take the advantage of breaking the large problem
down into a series of sub problems, and solve each sub
problem using its own abstraction. If the agent has a library
of abstraction available to it, it can select among the library
and apply the selected abstraction to aid in new skill learning.
In this project, we used abstraction selection algorithm
proposed in [1] to solve a multi-task maze problem. In the
multi-task problem, the agent needs to go through a series
of sub-goals before finishing the learning task at the final
goal. This problem fits well in the hierarchical reinforcement
learning domain. We can break the entire multi-task maze
problem by the subgoal. After breaking the multi-task maze
problem into sub-tasks, we select an appropriate abstraction
for each sub-goal task and using Temporal Difference learning
to solve this problem. The result shows that the agent selects
an appropriate abstraction using very little sample data and
therefore significantly improves skill learning performance in
the large real-valued reinforcement learning domain.
The remainder of this paper is organized as follows. Section
II introduces the background of abstraction and the option
(subtask) framework. Section III elaborates on the detail of
using abstraction to solve a large learning problem. Section
IV presents two application scenarios in which abstraction
selection can be applied. Section V shows the evaluation
results of using abstraction selection for the multi-task maze
problem. Section VI concludes the paper.
II. BACKGROUND
In this section, we introduces the background of abstraction
selection, including the option framework and the definition
of abstraction.
A. The Options Framework
Options framework [2] is a hierarchical reinforcement learning framework that provides methods for learning and planning
by adding temporally extended actions (called options) in
the standard reinforcement learning framework. Options are
closed-loop policies for taking action over a period of time.
Options consist of three components: a policy π; an initiation state set Io and the termination condition ςo .
πo : (s, a) 7−→ [0, 1]
Io : s
7−→ {0, 1}
ςo : (s, a) 7−→ [0, 1]
(1)
The initiation set Io is an indicator function, which is 1
for states where the option can be executed and 0 elsewhere.
An option is available in state St if and only if St ∈ Io .
In our multi-task maze problem, the initiation set Io can be
any state in the state space, since any state can be included
for walking to a specific sub-goal. ςo is the termination
condition for the option. Creation and termination are usually
performed by the identification of subgoal states. In our multitask maze problem, the termination condition is reaching the
given subgoal at the current option condition. After reaching
the current subgoal, algorithm can create a new option for the
next subgoal by defining the termination condition.
B. Abstraction
In the example of learning motor skills, although humans
have many sensory inputs and degrees of freedom, which
consist a very large state space, specific sensorimotor skills
almost always involve a small number of sensor features
and ignore most of the sensor and motor features in the
environment. This inspires us by applying abstraction in high
dimensional problems.
In reinforcement learning, the use of a smaller set of
variables to solve a large problem is modeled using the notion
of abstraction [3]. Instead of working in the ground state space
and action space, the decision maker usually finds solutions
in the abstract state space as well as action space much faster
2
by treating groups of states and actions as a unit by ignoring
irrelevant information.
We define an abstraction Mi to be a pair of functions
(σi , τi ), where σi : S → S ′ is a mapping from the overall
state space S to a smaller state space S ′ , and τi : A → A′ is
a mapping from the full problem action space A to a smaller
action space A′ . Besides, each abstraction has an associated
vector of basis functions Φi defined over S ′ , which can be
used to approximate value functions.
In the hierarchical reinforcement learning setting the agent
tries to build as many abstractions as is has skills. Thus the
agent solving many problems in its lifetime may accumulate
a library of abstraction, which can be used later to solve
new problems. Combining abstraction with option, we define
that when an agent creates a new option it should create
it with an accompanying abstraction. The agent can select
abstraction from a library of abstractions, and refine the
selected abstraction through experience.
An agent creates an option to reach a particular sub goal
(state) only after the sub goal is first reached. Therefore, a
set of sample interactions end at the new subgoal, which we
consider as a sample trajectory for the option. For a trajectory
with m steps, it consists of a sequence of m state-actionreward. Given a library of abstractions, if we apply each
abstraction to the sample trajectory we can obtain:
)}
)
(
) (
{( i i
(2)
s1 , a1 , r1 , si2 , ai2 , r2 , ..., sim , aim , rm
( i i
)
Where sk , ak , rk = (σi (sk ) , τi (ak ) , rk ) is a state-actionreward tuple obtained from abstraction i describing the kth
state-action pair in the trajectory.
III. S ELECTING A PPROPRIATE A BSTRACTION
Our goal is to break a large task into small tasks and then
choose appropriate abstraction to learn the skill for the each
sub-task. In this section, we first introduce linear function
approximation, the basic tool used in the abstraction selection.
Then, we elaborate on how to choose the appropriate sub-task.
A. Linear Function Approximation
Function value estimation represented as a table with one
entry for each state or for each state-action pair is a particularly
clear and instructive case, but it is limited to tasks with small
numbers of states and actions. For high-dimensional state
representations, reality is very different. The problem is not
just the memory needed for large tables, but also the time
and data needed to fill them accurately. Function approximation provides us an easy way for finding the value of each
state while avoiding the large resource overhead. Function
approximation takes examples from a desired function (e.g.,
a value function) and attempts to generalize from them to
construct an approximation of the entire function. One of
the most important function approximations is linear function
approximation, which approximates V by a weighted sum of
basis functions Φ:
V̄ (s) = w · Φ (s) =
n
∑
wi ϕi (s)
(3)
i=1
where ϕi is the ith basis function.
One of the basis functions that is widely used for function
approximation is the Fourier Basis [4]. The Fourier expansion
of the multivariate function F (x) with period T in m dimensions is:
F̄ (x) =
]
∑[
2π
2π
ac cos( c · x) + bc sin( c · x)
T
T
c
(4)
where c = [c1 , ..., ci ], ci ∈ [0, ..., n], 0 ≤ i ≤ m. This results
in 2(n + 1)m basis functions for an nth order full Fourier
approximation to a value function in m dimensions, which
can be reduced to (n + 1)m if we drop either the sin or cos
terms for each variable as described above.
We thus define the kth order Fourier Basis for m variables:
(
)
ϕi (x) = cos πci · x
(5)
where ci = [c1 , ..., cm ], cj ∈ [0, ..., k], 0 ≤ j ≤ m. Each basis
function thus has a coefficient vector c that attaches an integer
coefficient (less than or equal to k) to each variable in x; the
basis set is obtained by systematically varying the variables
in c. This basis has the benefit of being easy to compute
accurately even for high degrees, since cos is bounded in
[-1,1], and its arguments are formed by multiplication and
summation rather than exponentiation.
B. Abstraction Selection
The objective of abstraction selection is to achieve efficient
skill learning. The key idea of our design is using abstraction
with hierarchical reinforcement learning in high-dimensional
continues problems like. One of scenarios where the abstraction selection can be applied is the continuous playing room
problem, which appears to humans as easy, but is difficult for
agent because of the large number of variables and interactions
between variables (e.g., between ∆x and ∆y values for an
object-effector pair) that cannot all be included in the overall
task function approximation. Considering an O (1) Fourier
Basis over 120 variables that does not treat each variable
as independent, results in 2120 features. Thus, options and
abstractions are utilized to greatly improve performance in
such domains. In this work, we implemented a multi-task
maze.
If we already have the entire trajectory at once, we may
approximate functions and then select the best abstraction for
a regression problem. A common model selection criterion is
Bayesian Information Criterion (BIC) [5]:
1
ln p (D|Mi ) ≈ ln p (D|θM AP , Mi ) − |Mi | ln m (6)
2
where D is the date, Mi is abstraction i, p (D|θM AP , Mi )
is the likelihood of D given the maximum a priori value
3
function Sensorimotor Abstraction Fit (i, ρ, η) :
1. Initialization:
Set A0 , b0 , z0 , Rc and Rz to 0, g to 1
2. Iteratively handle incoming samples:
for each incoming sample (st , at , rt ):
At = ρAt−1 + Φi (st )ΦTi (st )
bt = ρbt−1 + ρrt zt−1 + rt Φi (st )
zt = ρzt−1 + Φi (st )
Rc = ρRc + grt2 + ρrt Rz
Rz = ρRz + 2grt
g = ρg + 1
3. Compute weights, error and variance:
(after m samples)
w = (Am + ηI)−1 bm
e = wT Am w − 2w · bm + Rc
β= m
e
4. Compute log likelihood and BIC:
(quantities constant across abstractions ignored)
ln β
ll = − β2 e + m
2
return ll - 21 |Φi | ln m
Fig. 1.
An incremental algorithm for computing the BIC value of an
abstraction i, using weight factor ρ and regularization parameter η, given
a successful sample trajectory.
function parameters θmap for abstraction i, |Mi | is the number
of parameters in abstraction i and m is the sample size.
In this work, we use linear regression model as the appropriate statistical model for the data. The log likelihood of this
model is
(
)
β
m2 − m
ln
+
ln ρ,
2π
4
(7)
where β − 1 is the variance,
w
is
the
function
approximation
∑m (m−j)
2
weight vector and ei =
[w · Φi (sj ) − Rj ] is
j=1 ρ
the summed weighted squared error.
We use incremental algorithm given in Figure 3, following
[1]. The algorithm is run simultaneously for each abstraction
while the agent is interacting with the environment. Whenever
an option is created by the agent, the algorithm computes
the associated log likelihood for each abstraction in one step.
The agent then selects the abstraction with the highest log
likelihood.
Since more than one sample trajectory may be available, or
may be required to produce robust selection. Given p samples,
we can modify the algorithm to run step 1 and 2 separately
for each sample trajectory and sum the A, b and Rc . Steps 3
and 4 then use the summed variables to perform a fit over all
p trajectories simultaneously.
β
m
ln p (D|Mi , w, β) = − ei +
2
2
IV. A PPLICATIONS WITH A BSTRACTION S ELECTION
In this section, we introduce two applications that can be
solved by using the abstraction framework, the continuous
playroom and the multi-task maze problem. We also elaborate
Fig. 2.
Playing room
on the extension to the algorithm with function approximation
introduced in the previous section.
A. Continuous Playroom
The Continuous Playroom is a real-valued version of the
Playroom domain [6]. It consists of an agent with a number
of objects: a light switch, a ball, a bell, two movable blocks
that are also buttons for turning music on and off, as well as
a toy monkey that can make sounds. The agent also has three
effectors: an eye, a hand, and a visual marker. The agent’s
sensors tell it what objects (if any) are under the eye, hand
and marker. The agent is in 1x1room, and may move any of its
effectors 0.05 units in one of the usual four directions. When
both its eye and hand are over an object it may additionally
interact with it, but only if the light is on (unless the object
is the light switch). Interacting with the green button switches
the music on, while the red button switches the music off.
The switch toggles the light. If both the hand and the eye
are on the light switch, then the action of flicking the light
switch becomes available, and if both the hand and eye are on
the ball, then the action of kicking the ball becomes available
(which when pushed, moves in a straight line to the marker).
Finally, if the agent interacts with the ball and its marker
is over the bell, then the ball hits the bell. Hitting the bell
frightens the monkey if the light is on and the music is on,
and causes it to squeak, whereupon the agent receives a reward
of 100,000 and the episode ends. All other actions cause the
agent to receive a reward of -1. Every time the objects are
interacted with any effectors, it will relocate randomly in the
room so that they do not overlap in each episode. Notice that if
the agent has already learned how to turn the light on and off,
how to turn music on, and how to make the bell ring, then
those learned skills would be of obvious use in simplifying
this process of engaging the toy monkey.
[1] implemented the continuous playing room in an O (3)
independent Fourier Basis and learning is performed using
Sarsa(λ). Results show that agents that learning using an
abstraction starts better and are able to obtain better overall
solutions. Moreover, the initial value function obtained by
abstraction selection benefits agents with less episode steps,
compared with scratch starting.
4
TABLE I
L EARNING CURVES IMPROVEMENT FOR AN OPTION WITH AN
ABSTRACTION , AND WITH AN ABSTRACTION USING THE INITIAL VALUE
1500
FUNCTION FIT OBTAINED DURING RANDOM AND OPTIMAL TRAJECTORY
SAMPLES , NORMALIZED TO NO ABSTRACTION [1].
Abstraction
78.57%
-6.67%
75%
84.21%
Start
Fit(Random)
67.14%
40.00%
77.5%
89.12%
1
2
Fit(Given)
87.86%
44.44%
77.6%
73.68%
Goal
Steps
1000
Episode
1
10
20
30
Function approximation
Tabular form
500
0
0
20
40
60
80
100
Episode
Fig. 4. Comparison between applying abstraction selection with function
approximation and with tabular form to the multi-task maze problem
2
1
Start
Goal
Fig. 3.
Multi-task Maze
Table 1 is the result concluded from [1], showing that
abstraction selection can improve performance greatly. Normalized to no abstraction learning, learning using abstraction
obtain better overall solutions almost in all the episodes.
Besides, the quality of the trajectory data used for the fit
significantly impacts the resulting policy, with policies obtained from fitting optimal sample trajectories (Fit(Given))
performing much better than those obtained from random
sample trajectories(Fit(Random)).
B. Multi-task Maze
The second application, which is our major target in this
project, is the multi-task maze problem. Figure 3 shows an
example of the multi-maze task. The agent need to start off at
the start point, and the final goal is to reach the Goal point.
For the task to be accomplished, there is a specific requirement
for the task, which is that the agent must go through block 1
and 2 in sequence first.
Maze problem is a very special case, for which tabular
form of value prediction can be naturally applied. Although
the function approximation with the Fourier Basis approach
introduced in previous sections can reduce the state space
effectively, it might not work well in the maze problem context
(which we can see in the later evaluation section). The major
reason is that first, the maze problem is a discrete state space
problem, which is hard to use the Fourier Basis to linearly
approximate its value. Second, using linear approximation with
Fourier Basis leads to a highly sensitive value to a minor
change in the function value, which can result in biased value
prediction, thus worse learning result. Therefore, we propose
to use tabular form for solving this specific maze problem with
abstraction.
Specifically, we use the abstraction selection framework by
keeping the table in the abstraction and select an appropriate
abstraction for the next option (subgoal). We then use the
tabular Sarsa(λ) at the option learning stage for each subgoal.
Please refer to the implementation code in the appendix.
V. M ULTI - TASK M AZE P ROBLEM R ESULT
In this section, we show the results of using abstraction
selection in the Multi-task Maze problem.
A. Is Fourier Basis Function Approximation Good for Maze?
As we have introduced in previous section that linear
function approximation is good for continuous state space.
Since the maze problem is a discrete state space problem,
function approximation might not work well in this context. In
this experiment we implement both the function approximation
Sarsa(λ) and the tabular form Sarsa(λ). We see from figure
4 that the abstraction selection working better under the tabular
form learning for this multi-task maze problem. The results are
based on a 100 run of each episode. It demonstrates that the
abstraction selection can also work with tabular form learning,
specifically more effective to the discrete space problem.
B. Learning Performance
We perform more experiment to evaluate the impact from
the ϵ-greedy parameter choice in the option learning stage. We
change the ϵ value from 0.01 to 0.09 and explore its impact.
We see from Figure 5 that with a higher ϵ, the average number
of steps over 100 episodes is decreasing. This is because that
the learning actually explores more with a higher ϵ value,
prone to find a better solution to the problem.
5
700
Average Steps
600
500
400
300
200
100
0
0
0.02
0.04
0.06
0.08
0.1
Epsilon
Fig. 5.
Impact of different ϵ in the option learning stage
VI. C ONCLUSION
In the context of small discrete domains, acquired skill
hierarchies have been proved to be beneficial. But for highdimensional continuous domains there may be difficulties due
to large state action spaces. Abstraction selection opens up
a further advantage to skill acquisition in high-dimensional
continuous domains, allowing an agent to exploit abstractions.
In an environment where an agent may acquire many skills
over its lifetime this may represent a great potential efficiency
improvement, that in conjunction with a good skill acquisition
algorithm could enable reinforcement learning agents to scale
up to higher dimensional domains. Additionally, abstraction
selection opens up the possibility of abstraction transfer, where
an agent that has learned a set of skills may benefit from the
abstractions refined for each, even if it never uses those skills
again.
In this work, we implemented abstraction selection with
function approximation TD learning and tabular form TD
learning respectively. Results show that under abstraction
framework, tabular form outperforms function approximation
for multi-task problem solution.
R EFERENCES
[1] K. George and B. Andrew, “Efficient skill learning using abstraction
selection,” in In Proceedings of the 21st International Joint Conference
on Artificial Intelligence, 2009.
[2] R. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps:
A framework for temporal abstraction in reinforcement learning,” in
Artificial Intelligence, 1999.
[3] L. Li, T. Walsh, and M. Littman, “Towards a unified theory of state
abstraction for mdps,” in In Proceedings of the Ninth International
Symposium on Artificial Intelligence and Mathematics, 2006.
[4] G. Konidaris and S. Osentoski, “Value function approximation in reinforcement learning using the fourier basis,” University of Massachusetts,
Amherst, Tech. Rep., 2008.
[5] G. Schwarz, “Estimating the dimension of a model,” Annals of Statistics,
vol. 6, no. 2, pp. 461–464, 1978.
[6] S. Singh, A. Barto, and N. Chentanez, “Intrinsically motivated reinforcement learning,” in In Proceedings of the 18th Annual Conference on
Neural Information Processing Systems, 2004.
Appendix A
function [totalsteps] = simplified_abstraction(income)
%UNTITLED6 Summary of this function goes here
%
Detailed explanation goes here
% clear all;
States = zeros(20,20)+65;
States(18,18) = 65;
destx =10;
desty =10;
subx = 1;
suby = 1;
up = 1;
left = 2;
down = 3;
right = 4;
gamma = 1; %1
lamda = 0.9;
alpha = 0.001;
epsilon = 0.01;
episode = 0;
record = 0;
a = randi(4,1);
Q = zeros(20,20,4,2);
policy = zeros(20,20,4,2)+0.25;
e = zeros(20,20,4,2);
while(episode<100)
episode = episode+1;
%
display(episode);
%
input('crap');
subfinish = 1;
%
%
%
%
%
%
%
%
m = 0;
while(1)
sx = randi(20,1);
sy = randi(20,1);
if States(sx,sy)==65
break;
end
end
if mod(episode,10) == 0
sx = 5;
sy = 5;
record = record+1;
%disp(episode);
%
%disp(record);
end
stepcount = 0;
deltax = zeros(2,1);
deltay = zeros(2,1);
nextdeltax = zeros(2,1);
nextdeltay = zeros(2,1);
%
while(sx~=destx||sy~=desty||subfinish~=2)
while(sx~=destx||sy~=desty)
stepcount = stepcount+1;
m = m+1;
deltax(2)
deltay(2)
deltax(1)
deltay(1)
=
=
=
=
(sx-destx)+10;
(sy-desty)+10;
(sx-subx)+10;
(sy-suby)+10;
phi(:,1,a) =
[1;cos(pi*deltaxsub*0.05);cos(pi*deltaysub*0.05);cos(0.05*pi*(deltaysub
+deltaxsub))];
phi(:,2,a) =
[1;cos(pi*deltax*0.05);cos(pi*deltay*0.05);cos(0.05*pi*(deltay+deltax))
];
%define the next state
if a==up
nextsx=sx-1;
nextsy = sy;
if nextsx<1;
nextsx=1;
end
elseif a == left
nextsx = sx;
nextsy=sy-1;
if nextsy<1;
nextsy=1;
end
elseif a == down
nextsx=sx+1;
nextsy = sy;
if nextsx>10;
nextsx=10;
end
elseif a == right
nextsx = sx;
nextsy=sy+1;
if nextsy>10;
nextsy=10;
end
end
nextdeltax(2)
nextdeltay(2)
nextdeltax(1)
nextdeltay(1)
%
%
=
=
=
=
nextsx-destx+10;
nextsy-desty+10;
nextsx-subx+10;
nextsy-suby+10;
%find out the next reward
if(subfinish ==2)
if (nextsx == destx)&&(nextsy==desty)
if (nextsx == destx)&&(nextsy==desty)
r = 10000;
else
r = -1;
end
else
if (nextsx == subx)&&(nextsy==suby)
if (nextsx == destx)&&(nextsy==desty)
r = 10000;
else
r = -1;
end
end
count = 0;
qmax =
max(Q(nextdeltax(subfinish),nextdeltay(subfinish),:,subfinish));
for i = 1:4
if qmax ==
Q(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish)
count = count+1;
end
end
probability = rand();
sumprob = 0;
for i = 1:4
sumprob =
sumprob+policy(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish);
if (sumprob>probability)
nexta = i;
break;
end
end
for i = 1:4
if qmax ==
Q(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish)
policy(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish) = (1epsilon)/count;
else
policy(nextdeltax(subfinish),nextdeltay(subfinish),i,subfinish) =
epsilon/(4-count);
end
end
delta = r +
gamma*Q(nextdeltax(subfinish),nextdeltay(subfinish),nexta,subfinish)Q(deltax(subfinish),deltay(subfinish),a,subfinish);
e(deltax(subfinish),deltay(subfinish),a,subfinish) =
e(deltax(subfinish),deltay(subfinish),a,subfinish)+1;
for i = 1:20
for j = 1:20
for k = 1:4
Q(i,j,k,subfinish)=Q(i,j,k,subfinish)+alpha*delta*e(i,j,k,subfinish);
e(i,j,k,subfinish) = gamma*lamda*e(i,j,k,subfinish);
end
end
end
%for abstraction updating
for i = 1:2
A(:,:,i) = rho*A(:,:,i)+phi(:,i,a)*phi(:,i,a)';
b(:,i)=rho*b(:,i)+rho*r*z(:,i)+r*phi(:,i,a);
z(:,i) = rho*z(:,i)+phi(:,i,a);
Rc(:,i)=rho*Rc(:,i)+g(:,i)*r*r+rho*r*Rz(:,i);
Rz(:,i) = rho*Rz(:,i)+2*g(:,i)*r;
g(:,i) = rho*g(:,i)+1;
end
%epsilon greedy policy improvement
if(nextsx == subx)&&(nextsy == suby)&&subfinish==1
for i = 1:2
w(:,i) = A(:,:,i)\b(:,i);
error(:,i)=w(:,i)'*A(:,:,i)*w(:,i)2*w(:,i)'*b(:,i)+Rc(:,i);
beta(:,i) = m/error(:,i);
likelihood(:,i) = beta(:,i)*error(:,i)/2+m*0.5*log(beta(:,i))-0.5*2*log(m);
end
disp(m);
m = 0;
[likelihoodmax,position]=max(likelihood);
weight(:,subfinish)=w(:,position);
subfinish = 2;
for i=1:20
for j = 1:20
deltaxtemp = (i-destx);
deltaytemp = (j-desty);
phitemp =
[1;cos(pi*deltaxtemp*0.05);cos(pi*deltaytemp*0.05);cos(0.05*pi*(deltayt
emp+deltaxtemp))];
V(i,j) = weight(:,subfinish)'*phitemp;
end
end
for i = 1:20
for j = 1:20
Qtemp = zeros(4,1);
if(i~=destx||j~=desty)
for k = 1:4
if k==up
nexti = i-1;
nextj = j;
if nexti<1;
nexti=1;
end
if(nexti==destx&&nextj==desty)
Qtemp(k)=V(nexti,nextj)+10000;
else
Qtemp(k)=V(nexti,nextj)-1;
end
elseif k == left
nexti = i;
nextj = j-1;
if nextj<1;
nextj=1;
end
if(nexti==destx&&nextj==desty)
Qtemp(k)=V(nexti,nextj)+10000;
else
Qtemp(k)=V(nexti,nextj)-1;
end
elseif k == down
nexti=i+1;
nextj = j;
if nexti>20;
nexti=20;
end
if(nexti==destx&&nextj==desty)
Qtemp(k)=V(nexti,nextj)+10000;
else
Qtemp(k)=V(nexti,nextj)-1;
end
elseif k == right
nexti = i;
nextj = i+1;
if nextj>20;
nextj=20;
end
if(nexti==destx&&nextj==desty)
Qtemp(k)=V(nexti,nextj)+10000;
else
Qtemp(k)=V(nexti,nextj)-1;
end
end
end
[maxq,maxindex]=max(Qtemp);
for k = 1:4
if(k==maxindex)
Q(i,j,k)=(1-epsilon)*Qtemp(k);
else
Q(i,j,k)=epsilon*Qtemp(k)/3;
end
end
else
for k = 1:4
Q(i,j,k)=0;
end
end
end
end
A = zeros(4,4,2);
b = zeros(4,2);
z = zeros(4,2);
Rc = zeros(1,2);
Rz = zeros(1,2);
g = zeros(1,2);
weight = zeros(4,2);
w = zeros(4,2);
error = zeros(1,2);
beta = zeros(1,2);
likelihood = zeros(1,2);
%having problem here, what to take for the initial q value?
end
if(nextsx == destx)&&(nextsy==desty)&&subfinish==2
for i = 1:2
w(:,i) = A(:,:,i)\b(:,i);
error(:,i)=w(:,i)'*A(:,:,i)*w(:,i)2*w(:,i)'*b(:,i)+Rc(:,i);
beta(:,i) = m/error(:,i);
likelihood(:,i) = beta(:,i)*error(:,i)/2+m*0.5*log(beta(:,i))-0.5*2*log(m);
end
disp(m);
m = 0;
[likelihoodmax,position]=max(likelihood);
weight(:,subfinish)=w(:,position);
subfinish = 1;
for i=1:20
for j = 1:20
deltaxtemp = (i-destx);
deltaytemp = (j-desty);
phitemp =
[1;cos(pi*deltaxtemp*0.05);cos(pi*deltaytemp*0.05);cos(0.05*pi*(deltayt
emp+deltaxtemp))];
V(i,j) = weight(:,subfinish)'*phitemp;
end
end
for i = 1:20
for j = 1:20
Qtemp = zeros(4,1);
for k = 1:4
if k==up
nexti = i-1;
nextj = j;
if nexti<1;
nexti=1;
end
Qtemp(k)=V(nexti,nextj)-1;
elseif k == left
nexti = i;
nextj = j-1;
if nextj<1;
nextj=1;
end
Qtemp(k)=V(nexti,nextj)-1;
elseif k == down
nexti=i+1;
nextj = j;
if nexti>20;
nexti=20;
end
Qtemp(k)=V(nexti,nextj)-1;
elseif k == right
nexti = i;
nextj = i+1;
if nextj>20;
nextj=20;
end
Qtemp(k)=V(nexti,nextj)-1;
end
end
[maxq,maxindex]=max(Qtemp);
for k = 1:4
if(k==maxindex)
Q(i,j,k)=(1-epsilon)*Qtemp(k);
else
Q(i,j,k)=epsilon*Qtemp(k)/3;
end
end
end
end
subfinish = 2;
end
sx = nextsx;
sy = nextsy;
a = nexta;
if subfinish == 1
if nextsx == subx && nextsy == suby
subfinish = 2;
disp(m);
m = 0;
a = randi(4,1);
end
else
if (nextsx == destx)&&(nextsy==desty)
disp(m);
m = 0;
end
end
%
%
end
totalsteps(episode) = stepcount;
end
Download