ECE 517 Project 2 TD Learning with Eligibility Traces and Planning

advertisement
ECE 517 Project 2
TD Learning with Eligibility Traces and Planning
Nicole Pennington & Alexander Saites
11/3/2011
Abstract
This paper details the implementation of a Sarsa(λ) reinforcement learning algorithm with planning to
determine an optimal policy for solving a maze. Our agent was placed at a random starting location
within the maze, which consisted of a 20x20 grid with a goal state and several obstacles, and was reset
to a new random location each time it reached the goal state. After determining appropriate parameter
values, we experimented with several TD learning techniques before settling on Sarsa(λ) with planning.
Our final algorithm consistently converges to a near-optimal maze solution in a relatively small number
of episodes.
I. Background
Temporal-difference (TD) learning is a combination of Monte Carlo and dynamic programming ideas.
Like Monte Carlo, TD learning can learn directly from experience, eliminating the need for a model. Like
dynamic programming, TD learning “bootstraps”, or updates estimates based in part on other learned
estimates without waiting for a final outcome. In this way, TD methods are implemented in an online,
fully incremental fashion. Two TD control algorithms are Sarsa and Q-learning. Sarsa is an on-policy TD
control algorithm which continually estimates the action-value function for the behavior policy while
adjusting the policy to be greedy with respect to the action-value function. Q-learning, an off-policy TD
control algorithm, directly approximates the optimal action-value function from the learned action-value
function independent of the policy being followed. Almost any TD method can be combined with
eligibility traces to obtain a more efficient learning engine. Eligibility traces are like temporary records of
the occurrence of an event, such as taking an action. Sarsa(λ), the eligibility trace version of Sarsa, uses
traces to mark a state-action pair as eligible for undergoing learning changes. In this way, eligibility
traces are used for both prediction and control. Planning can be integrated into Sarsa(λ) by having the
agent build a model of its environment through experience, then using that model to produce simulated
experience. With planning, the agent can make fuller use of a limited amount of experience and thus
achieve a better policy with fewer environmental interactions.
II. Introduction
In this experiment, we constructed a 20x20 maze, which contained a starting state, fixed goal state,
and a reasonable number of obstacles. All states visited by the agent offer no reward except for the goal
state, which carries a reward of +1. The agent has 4 possible actions from each state in the maze: up,
down, left and right. Each action moves the agent into the state that corresponds to the direction of the
action. If the agent tries to take an illegal action, such as moving into an obstacle or into the border of
the maze, it simply remains in its current state. Our task was to train the agent to find the goal state
using TD learning with eligibility and planning to facilitate the learning process. This task was considered
continuous because the agent was relocated to a new random starting state each time it reached the
goal state. To accomplish this, we experimented with Q-learning before choosing to use Sarsa(λ) with
planning, as described in the Design section.
III. Design
Objectives and Challenges
Our design objective was to create a simple and extendable program that elegantly facilitated
changes to the source and modeling of the environment. The primary challenge in this was finding a way
to meaningfully represent and display our environment and agent's progress. By representing our Q(s,a)
as a matrix and storing positions to which we moved, we were able to easily implement Sarsa and
display our results. Additionally, our design allowed us to easily and quickly vary parameters, such as the
number of obstacles in the world and the number of episodes our agent performed. Thus, our design
allowed easily readable code and logical graphical representation of our maze.
We choose to represent Q(s,a) as a 20x20x4 matrix. Each value in the matrix represents the
value of a particular direction (our action set) for a given x,y position (our state set). Similarly, our
eligibility trace is a 20x20x4 matrix indexed in the same way. This made several portions of our program
simple to write and edit. The actual Sarsa portion of the algorithm, for instance, is easily accomplished in
a few lines, as we can quickly extract the proper Q(s,a) of the four actions along with the appropriate
eligibility trace value. Furthermore, the eligibility trace can be decayed by simply multiplying it by the
appropriate values. Finally, this allowed us to easily graph the Q(s,a), showing the values of best
directions throughout the world.
Technical Approach
Our program works by randomly selecting the positions of a goal and several obstacles. It
initializes Q(s,a) to zeros. These values can be initialized randomly, but setting them all to zero allows
faster convergence to the optimal Q. After these values are set, it begins an episode by randomly
selecting a starting position and clearing out two matrices used to record travel positions. It then begins
searching for the goal. To do so, it chooses a new position using an epsilon-greedy algorithm. If this new
position is invalid, it simply stays where it is. If the new position is the goal, it receives a reward of +1.
The eligibility trace for this position is increased and Q is then updated using SARSA. After this, the
eligibility trace is decayed. If planning is enabled, it enters a planning loop in which it makes predictions
based upon what it has learned about the environment. To do this, the agent constructs a model of the
environment as it moves through the maze. It then uses this model to simulate experience by updating
state-action values through simulated maximum actions taken from a set of N random previously
observed states and actions taken from those states. At the end of each episode, the state is updated,
and if this new state is the goal position, epsilon is decayed and a new episode begins. This continues for
the number of episodes specified. This idea is modeled in the following flowchart:
We chose an epsilon-greedy algorithm to select our next action. This provides good on-line
performance and, since epsilon is decayed over time, converges to the optimal solution. Additionally, we
chose a Sarsa control scheme since it generally allows greater on-line performance and is easier to
implement than Q-learning with eligibility traces.
The manner in which the program chooses to place obstacles and the starting position is very
simplistic. It randomly chooses a position and checks it against obstacles already placed in the world. If
an obstacle exists in that position, it randomly chooses a new position. This continues until it finds a
valid position. Although in theory this could continue for a very long time if the chosen positions were
consistently already occupied, in practice it is quick and effective. Additionally, our program makes it
easy to change the number of obstacles in the world. If that number of obstacles is greater than the
length of a dimension, there is some probability that no valid path from the starting position to the goal
will exist. This probability increases with the number of obstacles placed in the world.
Parameters
A. Epsilon
Epsilon represents the probability of selecting a non-greedy action from any state. We initially set
epsilon to 0.75. As a result, the epsilon-greedy algorithm will begin by choosing one of the four
directions with equal probability. Epsilon is then slowly decayed (using the factor mu) with each episode
to allow convergence to the optimal state-action values. If epsilon is initially set to a lower value, it will
not explore as much and instead will often take the default value (in this case 'up'). As a result, it will
take longer in initial episodes to find the goal, as little learning has taken place. An epsilon value higher
than .75 is not of value, as initially every state is as valid as another, so exploring uniformly at first is the
best option.
B. Gamma
Gamma indicates how much to discount the value of the next state when calculating the new value
of the current state. A higher gamma causes the value of the next state to have a greater influence on
the value of the current state. If the next state’s value is correct, it is desirable for it to have a greater
impact on the current state’s value. Because we chose to initialize the state-action values to zero and
the only positive reward is received at the goal state, we decided that a higher gamma would be
appropriate. Setting gamma too low would slow convergence because it would hinder the propagation
of rewards from the goal state; states too far from the goal state would be worth almost nothing.
Setting gamma too high also slows down the speed of convergence because it could easily result in
learning the wrong action values. We experimentally determined that the correct balance is a value of
approximately 0.85.
C. Alpha
Alpha discounts the influence of the delta function when updating the value of the current stateaction pair. As alpha is increased, a greater portion of the value update is being relayed to the value of
the current state. While a lower alpha limits the update of each state-action value, it can help control
the negative influence of an incorrect update. Because a high alpha can cause state-action values to
oscillate, we chose a lower alpha of 0.1. This converged well and prevented incorrect updates from
overwhelming good state-action pair values.
D. Lambda
Lambda is the rate of decay (in conjunction with gamma) of the eligibility trace. This is the amount
by which the eligibility of a state is reduced each time step that it is not being visited. A low lambda
causes a lower reward to propagate back to states farther from the goal. While this can prevent the
reinforcement of a path which is not optimal, it causes a state which is far from the goal to receive very
little reward. This slows down convergence, because the agent spends more time searching for a path if
it starts far from the goal. Conversely, a high lambda allows more of the path to be updated with higher
rewards. This suited our implementation, because our high initial epsilon was able to correct any state
values which might have been incorrectly reinforced and create a more defined path to the goal in fewer
episodes. Because of this, we chose a final lambda of 0.9.
Experiments and Results
Q-learning
We initially experimented with Q-learning without eligibility traces or planning. This simplified the
implementation and allowed us to establish reasonable values for our parameters. Although this ran
episodes quickly and produced reasonable results, it took several thousand episodes to generate the
same level of results as our final method. An example output is shown below. As we will explain, our
Sarsa algorithm with eligibility traces and planning is able to produce better results fewer than 26
episodes.
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Figure 1: A maze result from Q-learning
This image represents the environment and the agent movements. The boxes are obstacles, the star is
the goal position, and the diamond is the start position. Black circles show places the agent has been.
Arrows are shown for the maximum action for a given state. The direction shows the direction of the
maximum action and the length shows its magnitude.
Q-learning Q(s,a) after 2000 episodes
20
18
2.5
16
2
Y-Position
14
12
1.5
10
8
1
6
0.5
4
2
0
5
10
X-Position
15
20
Figure 2: Q-learning takes many episodes to generate this result
Sarsa with eligibility trace and planning
The following series of graphs show an example experiment ran with our Sarsa algorithm which uses an
eligibility trace and planning. We ran the experiment for 500 episodes with 19 obstacles and plotted
every 25 runs. A select few of those runs are shown below. For this experiment, epsilon started at 0.75
with a decay rate of 0.99. Gamma and lambda were set to 0.9 and alpha was set to 0.1. Each episode
used 40 planning iterations.
Episode 1
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Figure 3: When just starting, the agent walks about randomly searching for the goal
When the agent begins the first episode, no learning has taken place, so Q(s,a) is zero for all state-action
pairs. Since epsilon is 0.75, it performs the first episode by walking randomly until it finds the goal. At
this point, Q(s,a) is updated using the eligibility trace.
Q(s,a) for Episode 1
20
0.16
18
0.14
16
0.12
12
0.1
10
0.08
Y-Position
14
8
0.06
6
0.04
4
0.02
2
5
10
X-Position
15
20
Figure 4: The eligibility trace allows updating of several actions that led to the goal
Q(s,a) for Episode 1
0.16
0.14
3
0.12
2
0.1
0.08
1
0.06
0
20
0.04
15
20
15
10
5
Y-Position
0.02
10
5
X-Position
Figure 5: This side view shows the curvature of Q(s,a)
These chart show the sum of Q(s,a) for all actions. As expected, the eligibility trace causes an update to
the state-action value of several actions that led to the goal.
After 26 episodes, the agent already shows improvement:
Episode 26
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Figure 6: The agent finds the goal faster
Q(s,a) for Episode 26
20
2
18
1.8
16
1.6
14
1.4
12
1.2
10
1
Y-Position
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
8
0.8
6
0.6
4
0.4
2
0.2
5
10
X-Position
15
20
0
Figure 7: The eligibility trace and planning allow quick improvement
Q(s,a) for Episode 26
2
1.8
1.6
3
1.4
1.2
2
1
1
0.8
0.6
0
20
0.4
20
15
15
10
5
Y-Position
0.2
10
5
0
X-Position
Figure 8: A side view of Q(s,a)
Note that dark spots indicate a value of zero, so obstacles show up as dark spots. The goal position also
shows up as a dark spot because the agent never takes an action while in the goal position (a new
episode begins).
In fewer than 30 episodes, the agent is able to find optimal solutions from reasonable distances.
Moreover, Q(s,a) is already near optimal, as most of the arrow heads are pointing toward the goal.
Episode 176
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Figure 9: The agent finds an optimal solution
Q(s,a) for Episode 176
20
18
3
16
2.5
14
12
Y-Position
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
2
10
1.5
8
1
6
4
0.5
2
5
10
X-Position
15
Figure 10: Q(s,a) is near optimal
20
0
Q(s,a) for Episode 176
3
3
2.5
2
2
1
1.5
1
0
20
20
15
10
5
Y-Position
0.5
15
10
0
5
X-Position
Figure 11: This side view shows the mountain of reward expectations leading up to the goal
The following figures show the results after 500 episodes:
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Figure 12: The arrows show a near-optimal solution
Q(s,a) for Episode 500
20
18
3
16
14
12
10
Y-Position
2.5
2
1.5
8
6
4
2
5
10
X-Position
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
15
Figure 13: The final Q(s,a)
20
1
0.5
Q(s,a) for Episode 500
3
3
2.5
2
2
1
1.5
1
0
20
20
15
10
5
Y-Position
0.5
15
10
5
X-Position
Experimenting with epsilon and planning
We ran several experiments with different values for epsilon both with and without planning. We
recorded the number of steps taken before the agent reached the goal state. We compared this with
the number of spaces from the start to the goal. Recording the error in this way is fast, but not
completely accurate. If a direct path to the goal does not exist (i.e. an obstacle lies directly next to the
goal or start position and between the two), then the optimal solution is higher than this value. As a
result, an optimal solution will have a recorded error near 1.
Using Sarsa without planning, the average error for the final 100 episodes after running 500 episodes
reaches 2.25. Epsilon was decayed by .99 after each episode. With planning, we reduce this error to 1.05
after 500 episodes. Since the error calculation assumes a direct path always exists, it is reasonable to
believe this solution is in fact optimal. The images below show the errors and state-action value
functions for these cases.
Last 100 Errors - No Planning - epsilon decay = .99
25
20
Errors
15
10
5
0
400
420
440
460
Episodes
480
500
Figure 14: Errors are still high without planning
Last 100 Errors - Planning - epsilon decay = .99
8
7
6
Errors
5
4
3
2
1
0
400
420
440
460
Episodes
480
Figure 15: Error rates are much smaller with planning
500
Q(s,a) - No Planning - epsilon decay = .99
20
2.5
18
16
2
Y-Position
14
12
1.5
10
1
8
6
0.5
4
2
5
10
X-Position
15
20
0
Figure 16: The state-action value function without planning converges slowly
Q(s,a) - Planning - epsilon decay = .99
20
18
3
16
2.5
Y-Position
14
2
12
10
1.5
8
1
6
4
0.5
2
5
10
X-Position
15
20
0
Figure 17: The state-action value function with planning shows much more learning
Unsurprisingly, Q(s,a) converges more slowly with a higher (slower) decay rate, as the agent explores
too much in later episodes when Q(s,a) is already quite accurate. The average error over all 500
episodes without planning with a decay rate of .999 is 135.02. When we reduce the decay rate to .99,
this error falls to just 26.52 over all 500 episodes. With planning turned on and the decay rate set to
.999, the error is 134.07. With a decay rate of .99, this falls to 23.72. The errors and state-action value
functions are shown in the figures below.
Errors - No Planning - epsilon decay = .999
5000
4500
4000
3500
Error
3000
2500
2000
1500
1000
500
0
0
100
200
300
400
Episode
Figure 18: The total error rate declines more slowly than with .99
500
Errors - Planning - epsilon decay = .999
2500
2000
Error
1500
1000
500
0
0
100
200
300
400
500
Episode
Figure 19: The total error rate declines more slowly than with .99
Q(s,a) - No Planning - epsilon decay = .999
20
2.5
18
16
2
Y-Position
14
12
1.5
10
1
8
6
0.5
4
2
5
10
X-Position
15
20
Figure 20: With more exploration, Q(s,a) is updated only around the goal
0
Q(s,a) - Planning - epsilon decay = .999
20
3
18
16
2.5
Y-Position
14
2
12
10
1.5
8
1
6
4
0.5
2
5
10
X-Position
15
20
0
Figure 21: With a lot of exploration and planning, Q(s,a) is updated much more. Note: dark spots show obstacles and goal
In the other direction, when the decay rate is too small, the agent does not explore enough. With
planning turned off and the decay rate set to .9, the agent performs so poorly that it does not finish in a
reasonable time. With planning turned on, the error rate increases to 34.22 over all episodes. While this
is better than .999, the error for the last 100 episodes is still 6.90, showing that strong initial exploration
is critical. These error rates are summarized in the following table.
Planning?
Epsilon decay rate
No
No
No
Yes
Yes
Yes
.9
.99
.999
.9
.99
.999
Average error for all
episodes
N/A
26.52
135.02
34.22
23.72
134.07
Average error for last
100 episodes
N/A
2.25
28.68
6.90
1.05
22.24
Table 1: Comparison of error rates for various epsilon decay values with and without planning
These also show clearly that planning reduces the number of episodes necessary to generate optimal
solutions. Although in our simulated environment, experience is cheap, as generating episodes takes
only fractions of seconds, a real robot attempting to find a real goal position in a real environment is
naturally constrained by reality. In such a scenario, the planning stage becomes critical, as it allows
significantly faster improvements with far fewer episodes.
IV. Summary
This project provided experience with the implementation of temporal difference learning methods,
specifically Q-learning and Sarsa(λ), as well as eligibility traces and planning. Our first attempt at a
solution was a simple, one-step Q-learning algorithm which performed well and converged to an
(epsilon-) optimal solution in approximately 2000 episodes. After determining good values for the alpha,
epsilon, and gamma parameters from this model, we switched to Sarsa(λ) to implement eligibility traces.
This algorithm actually had the same convergence rate with slower performance than the Q-learning
method because of the extra storage and computations required. After experimentally determining a
good value for lambda, we added planning to the algorithm. Planning allowed the agent to make fuller
use of a limited amount of experience and thus achieve a better policy with fewer environmental
interactions. While still not as fast as the one-step Q-learning, our Sarsa(λ) with planning algorithm
consistently converged to a near-optimal maze solution within 50 episodes. With this implementation,
we successfully achieved an efficient TD algorithm with eligibility traces and planning which can
meaningfully display our environment and agent's progress.
V. Appendix
Matlab Code
Sarsa
dim = 20;
num_obstacles = 19;
num_episodes = 500;
plot_freq = 25; % every $plot_freq images are plotted
save_maze = 1; % 0 = false, 1 = true
img_dir = 'images'; % image directory; where to save images
planning = 1; %0 = off, 1 = on
N = 40;
%Planning steps
%initialize parameters
epsilon = .75;
gamma =.9;
alpha = .1;
lambda = .9;
mu = .99;
muval = 99; %used for outputing if save_maze = 1
%initialize goal position
goalX = randi( dim ) - .5;
goalY = randi( dim ) - .5;
%goalX = 13.5;
%goalY = 12.5;
%initialize obstacles to zeros
obstaclesX = zeros( 1, num_obstacles );
obstaclesY = zeros( 1, num_obstacles );
%add goal to obstacles so randomly generated obstacles aren't in the goal
obstaclesX(1) = goalX;
obstaclesY(1) = goalY;
%set up the world to make plotting easier later
gridX = repmat( transpose(.5:1:(dim-.5)), 1, dim );
gridY = transpose( gridX );
u = zeros( dim, dim );
v = zeros( dim, dim );
%randomly generate obstacles
for i=2:num_obstacles
newObX = randi( dim ) - .5;
newObY = randi( dim ) - .5;
while Check_obstacle( newObX, newObY, obstaclesX, obstaclesY )
newObX = randi( dim ) - .5;
newObY = randi( dim ) - .5;
end
obstaclesX(i) = newObX;
obstaclesY(i) = newObY;
end
%remove goal from obstacles
obstaclesX = obstaclesX(2:end);
obstaclesY = obstaclesY(2:end);
%initialize Q(s,a) arbitrarily
Q = zeros( dim, dim, 4 );
Q( (obstaclesX+.5), (obstaclesY+.5), : ) = 0;
%eligability trace and planning model
et = zeros( dim, dim, 4 );
if planning
model = zeros(dim, dim, 4, 3);
end
%get optimal solution for each point
%this may be slightly off, depending on obstacle positions
optSol = abs(gridX - goalX) + abs(gridY - goalY);
error = zeros( 1, num_episodes );
for i=1:num_episodes
%begin an episode
if(i == num_episodes)
%Remove epsilon-greedy
epsilon = 0;
end
%initialize start state -- don't run into obstacles and be a bit from
%the goal
X = randi(dim) - .5;
Y = randi(dim) - .5;
while (abs(X-goalX) < 2 ) || Check_obstacle(X,Y,obstaclesX,obstaclesY) ||
(abs(Y-goalY) < 2 )
X = randi(dim) - .5;
Y = randi(dim) - .5;
end
startX = X;
startY = Y;
%initialize the action
if (i == 1)
action = randi(4);
else
[val,action] = max(Q(X+.5,Y+.5,:));
end
%these matricies will hold the x,y positions traveled
xmat = 0;
ymat = 0;
steps = 0;
amat = 0;
%repeat for each step
while( 1 )
%save the number of steps it has taken
steps = steps + 1;
sprintf( '%u %u %u\n', steps, X, Y );
%save
xmat(
ymat(
amat(
the x
steps
steps
steps
and
) =
) =
) =
y positions and corresponding action
X;
Y;
action;
%take action, observe r,s'
nextX = X;
nextY = Y;
switch action
case 1
nextY = Y + 1; %up
if Check_obstacle( X, nextY,
nextY = Y;
end
case 2
nextX = X + 1; %right
if Check_obstacle( nextX, Y,
nextX = X;
end
case 3
nextX = X - 1; %left
if Check_obstacle( nextX, Y,
nextX = X;
end
case 4
nextY = Y - 1; %down
if Check_obstacle( X, nextY,
nextY = Y;
end
end
%go back if it knocks you off the map
if nextX > dim || nextX < 0
nextX = X;
end
if nextY > dim || nextY < 0
obstaclesX, obstaclesY )
obstaclesX, obstaclesY )
obstaclesX, obstaclesY )
obstaclesX, obstaclesY )
nextY = Y;
end
%only reward for hitting goal
if nextX == goalX && nextY == goalY
reward = 1;
else
reward = 0;
end
%choose next action based on Q using epsilon-greedy
rannum = rand();
[val,ind] = max(Q(nextX+.5,nextY+.5,:));
if rannum > epsilon
%take greedy
next_action = ind;
else
%take non-greedy
next_action = randi(4);
while next_action == ind
next_action = randi(4);
end
end
delta = reward + gamma*Q(nextX+.5,nextY+.5,next_action) Q(X+.5,Y+.5,action);
et(X+.5, Y+.5, action ) = et(X+.5, Y+.5, action ) + 1;
Q = Q + alpha*delta*et;
et = gamma*lambda*et;
if planning
model(X+.5,Y+.5,action,1) = nextX;
model(X+.5,Y+.5,action,2) = nextY;
model(X+.5,Y+.5,action,3) = reward;
for k=1:N
prev = randi(length(xmat)); %choose a random previous state
X = xmat(prev);
Y = ymat(prev);
action = amat(prev);
simX = model(X+.5,Y+.5,action,1);
simY = model(X+.5,Y+.5,action,2);
simR = model(X+.5,Y+.5,action,3);
[val,simA] = max(Q(simX+.5,simY+.5,:));
Q(X+.5,Y+.5,action) = Q(X+.5,Y+.5,action) + alpha*(simR +
gamma*Q(simX+.5,simY+.5,simA) - Q(X+.5,Y+.5,action));
end
end
X = nextX;
Y = nextY;
action = next_action;
if X == goalX && Y == goalY
error(i) = steps - optSol(startX+.5,startY+.5);
break;
end
end
%decay epsilon with time
epsilon = epsilon*mu;
%print out maze
if( ~mod( (i-1), plot_freq ) )
if( save_maze )
close all hidden;
map = figure( 'Visible', 'off' );
else
figure;
end
%set up grid to plot Q(s,a)
for a=1:dim
for b=1:dim
[val,ind] = max( Q(a,b,:) );
switch ind
case 1
u(a,b) = 0;
v(a,b) = val;
case 2
u(a,b) = val;
v(a,b) = 0;
case 3
u(a,b) = -val;
v(a,b) = 0;
case 4
u(a,b) = 0;
v(a,b) = -val;
end
end
end
%overlay learning upon the world
quiver( gridX, gridY, u, v, 0 );
hold on;
%plot where we've been, the goal, start, and obstacles
plot( xmat, ymat, 'ko' ); %black circles
plot( obstaclesX, obstaclesY, 'ks','MarkerSize', 10,
'MarkerFaceColor', 'k' ); %black square
plot( startX, startY, 'bd', 'MarkerSize', 10, 'MarkerFaceColor', 'b'
); %blue diamond
plot( goalX, goalY, 'bp','MarkerSize', 14, 'MarkerFaceColor', 'b' );
%blue pentagram
grid on;
axis( [0 dim 0 dim] );
set( gca, 'YTick', 0:1:dim );
set( gca, 'XTick', 0:1:dim );
set( gca, 'GridLineStyle', '-' );
title( sprintf( 'Episode %d', i ) );
%output the images
if( save_maze )
filename = sprintf( '%s/image_%05d.ppm', img_dir, i );
fprintf( 'saving %s...', filename );
print( map, '-dppm', '-r200', filename );
fprintf( 'done.\n' );
close( map );
else
drawnow;
end
%output the learning
Qsum = sum( Q, 3 );
if( save_maze )
img = figure( 'Visible', 'off' );
surf( Qsum );
title( sprintf( 'Q(s,a) for Episode %d', i ) );
xlabel( 'X-Position' );
ylabel( 'Y-Position' );
colorbar;
axis( [1 dim 1 dim 0 3.5 ] );
filename = sprintf( '%s/Qsum_%05d.ppm', img_dir, i );
fprintf( 'saving %s...', filename );
print( img, '-dppm', '-r200', filename );
fprintf( 'done\n' );
hold on;
axis( [0 dim 0 dim 0 3.5 ] );
set( gca, 'CameraTargetMode', 'manual' )
set( gca, 'CameraPosition', [10 10 -28 ] );
set( gca, 'CameraUpVector', [1 0 0] );
axis( [1 dim 1 dim 0 3.5 ] );
set(gca, 'XAxisLocation', 'top')
set(get(gca,'XLabel'), 'Position', [-1 12 0]);
set(get(gca,'YLabel'), 'Position', [10 0 0]);
set(get(gca,'Title'), 'Position', [20.2 10 0]);
filename = sprintf( '%s/Qsum-bottom_%05d.ppm', img_dir, i );
fprintf( 'saving %s...', filename );
print( img, '-dppm', '-r200', filename );
fprintf( 'done\n' );
else
figure;
surf( Qsum );
end
end
end
%print the final image
if( save_maze )
close all hidden;
map = figure( 'Visible', 'off' );
else
figure;
end
%set up grid to plot Q(s,a)
for a=1:dim
for b=1:dim
[val,ind] = max( Q(a,b,:) );
switch ind
case 1
u(a,b) = 0;
v(a,b) = val;
case 2
u(a,b) = val;
v(a,b) = 0;
case 3
u(a,b) = -val;
v(a,b) = 0;
case 4
u(a,b) = 0;
v(a,b) = -val;
end
end
end
%output Q(s,a) using quiver
quiver( gridX, gridY, u, v, 0 );
hold on;
%plot where we've been, the goal, start, and obstacles
plot( xmat, ymat, 'ko' ); %black circles
plot( obstaclesX, obstaclesY, 'ks','MarkerSize', 10, 'MarkerFaceColor', 'k'
); %black square
plot( startX, startY, 'bd', 'MarkerSize', 10, 'MarkerFaceColor', 'b' ); %blue
diamond
plot( goalX, goalY, 'bp','MarkerSize', 14, 'MarkerFaceColor', 'b' ); %blue
pentagram
grid on;
axis( [0 dim 0 dim] );
set( gca, 'YTick', 0:1:dim );
set( gca, 'XTick', 0:1:dim );
set( gca, 'GridLineStyle', '-' );
if( save_maze )
filename = sprintf( '%s/image_%05d.ppm', img_dir, i );
fprintf( 'saving %s...', filename );
print( map, '-dppm', '-r200', filename );
fprintf( 'done\n' );
close( map );
else
drawnow;
end
Qsum = sum( Q, 3 );
if( save_maze )
img = figure( 'Visible', 'off' );
surf( Qsum );
title( sprintf( 'Q(s,a) for Episode %d', i ) );
xlabel( 'X-Position' );
ylabel( 'Y-Position' );
colorbar;
axis( [1 dim 1 dim 0 3.5 ] );
filename = sprintf( '%s/Qsum_%05d.ppm', img_dir, i );
fprintf( 'saving %s...', filename );
print( img, '-dppm', '-r200', filename );
fprintf( 'done\n' );
hold on;
axis( [0 dim 0 dim 0 3.5 ] );
set( gca, 'CameraTargetMode', 'manual' )
set( gca, 'CameraPosition', [10 10 32 ] );
axis( [1 dim 1 dim 0 3.5 ] );
set(gca, 'XAxisLocation', 'top')
set(get(gca,'XLabel'), 'Position', [-1 12 0]);
set(get(gca,'YLabel'), 'Position', [10 0 0]);
set(get(gca,'Title'), 'Position', [20.2 10 0]);
filename = sprintf( '%s/Qsum-bottom_%05d.ppm', img_dir, i );
fprintf( 'saving %s...', filename );
print( img, '-dppm', '-r200', filename );
fprintf( 'done\n' );
else
img = figure;
surf( Qsum );
end
%display average error
sum(error)/size(error,2)
last100 = error(size(error,2)-100:end);
sum(last100)/size(last100,2)
%---save error figure
if planning == 1
plan = 'planning';
else
plan = 'no';
end
img = figure('Visible','off');
plot( error );
title( sprintf('Errors over %d episodes', num_episodes) );
xlabel('Episode');
ylabel('Error');
if( save_maze )
filename = sprintf( '%s/error-%s-%dep-mu%d.fig', img_dir, plan,
num_episodes, muval );
fprintf( 'saving %s...', filename );
saveas( img, filename, 'fig' );
fprintf( 'done\n' );
close all hidden;
else
img = figure('Visible','on');
drawnow;
end
Q-learning
dim = 20;
num_obstacles = 19;
num_episodes = 2000;
plot_freq = 200; % every $plot_freq images are plotted
save_maze = 0; % 0 = false, 1 = true
img_dir = 'images'; % image directory; where to save images
%initialize parameters
epsilon = .75;
gamma = .75;
alpha = .1;
lambda = .9;
mu = .999;
%initialize goal position
goalX = randi( dim ) - .5;
goalY = randi( dim ) - .5;
%goalX = 13.5
%goalY = 12.5
%initialize obstacles to zeros
obstaclesX = zeros( 1, num_obstacles );
obstaclesY = zeros( 1, num_obstacles );
%add goal to obstacles so randomly generated obstacles aren't in the goal
obstaclesX(1) = goalX;
obstaclesY(1) = goalY;
%randomly generate obstacles
for i=2:num_obstacles
newObX = randi( dim ) - .5;
newObY = randi( dim ) - .5;
while Check_obstacle( newObX, newObY, obstaclesX, obstaclesY )
newObX = randi( dim ) - .5;
newObY = randi( dim ) - .5;
end
obstaclesX(i) = newObX;
obstaclesY(i) = newObY;
end
%remove goal from obstacles
obstaclesX = obstaclesX(2:end);
obstaclesY = obstaclesY(2:end);
%initialize Q(s,a) arbitrarily
%Q = rand( [dim, dim, 4] ) * .25;
Q = zeros( dim, dim, 4 );
Q( (obstaclesX+.5), (obstaclesY+.5), : ) = 0;
%eligability trace
%et = zeros( dim, dim, 4 );
for i=1:num_episodes
%begin an episode
%initialize start state -- don't run into obstacles and be a bit from
%the goal
X = randi(dim) - .5;
Y = randi(dim) - .5;
while (abs(X-goalX) < 2 ) || Check_obstacle(X,Y,obstaclesX,obstaclesY) ||
(abs(Y-goalY) < 2 )
X = randi(dim) - .5;
Y = randi(dim) - .5;
end
startX = X;
startY = Y;
%these matricies will hold the x,y positions traveled
xmat = 0;
ymat = 0;
steps = 0;
%repeat for each step
while( 1 )
%save the number of steps it has taken
steps = steps + 1;
%save the x and y positions
xmat( steps ) = X;
ymat( steps ) = Y;
%choose action based on Q using epsilon-greedy
rannum = rand();
[val,ind] = max(Q(X+.5,Y+.5,:));
if rannum > epsilon
%take greedy
action = ind;
else
%take non-greedy
action = randi(4);
while action == ind
action = randi(4);
end
end
%take action a, observe r,s'
newX = X;
newY = Y;
switch action
case 1
newY = Y + 1; %up
if Check_obstacle( X, newY,
newY = Y;
end
case 2
newX = X + 1; %right
if Check_obstacle( newX, Y,
newX = X;
end
case 3
newX = X - 1; %left
if Check_obstacle( newX, Y,
newX = X;
end
case 4
newY = Y - 1; %down
if Check_obstacle( X, newY,
newY = Y;
end
end
obstaclesX, obstaclesY )
obstaclesX, obstaclesY )
obstaclesX, obstaclesY )
obstaclesX, obstaclesY )
%go back if it knocks you off the map
if newX > dim || newX < 0
newX = X;
end
if newY > dim || newY < 0
newY = Y;
end
%only reward for hitting goal
if newX == goalX && newY == goalY
reward = 1;
else
reward = 0;
end
%Q-learning
[val,next_act] = max(Q(newX+.5,newY+.5,:));
%et( X+.5, Y+.5, action ) = et( X+.5, Y+.5, action ) + 1;
%Q(X+.5,Y+.5,action) = Q(X+.5,Y+.5,action) +
et(X+.5,Y+.5,action)*alpha*( reward + (gamma*val)-Q(X+.5,Y+.5,action) );
Q(X+.5,Y+.5,action) = Q(X+.5,Y+.5,action) + alpha*( reward +
(gamma*val)-Q(X+.5,Y+.5,action) );
%decay eligibility trace
%et = gamma*lambda*et;
%update the state
X = newX;
Y = newY;
if X == goalX && Y == goalY
break;
end
end
%decay epsilon with time
epsilon = epsilon*mu;
%print out maze
if( ~mod( (i-1), plot_freq ) )
if( save_maze )
map = figure( 'Visible', 'off' );
else
figure;
end
%set up grid to plot Q(s,a)
gridX = repmat( transpose(.5:1:(dim-.5)), 1, dim );
gridY = transpose( gridX );
u = zeros( dim, dim );
v = zeros( dim, dim );
for a=1:dim
for b=1:dim
[val,ind] = max( Q(a,b,:) );
switch ind
case 1
u(a,b) = 0;
v(a,b) = val;
case 2
u(a,b) = val;
v(a,b) = 0;
case 3
u(a,b) = -val;
v(a,b) = 0;
case 4
u(a,b) = 0;
v(a,b) = -val;
end
end
end
quiver( gridX, gridY, u, v, 0 ); %that's right, quiver
hold on;
%plot where we've been, the goal, start, and obstacles
plot( xmat, ymat, 'ko' ); %black circles
plot( obstaclesX, obstaclesY, 'ks','MarkerSize', 10,
'MarkerFaceColor', 'k' ); %black square
plot( startX, startY, 'bd', 'MarkerSize', 10, 'MarkerFaceColor', 'b'
); %blue diamond
plot( goalX, goalY, 'bp','MarkerSize', 14, 'MarkerFaceColor', 'b' );
%blue pentagram
grid on;
axis( [0 dim 0 dim] );
set( gca, 'YTick', 0:1:dim );
set( gca, 'XTick', 0:1:dim );
set( gca, 'GridLineStyle', '-' );
if( save_maze )
filename = sprintf( '%s/image_%d.ppm', img_dir, i );
fprintf( 'saving %s...', filename );
print( map, '-dppm', '-r200', filename );
fprintf( 'done.\n' );
close( map );
else
drawnow;
end
end
end
sum( steps ) / size( steps, 1 )
%print the final image
if( save_maze )
map = figure( 'Visible', 'off' );
else
figure;
end
%set up grid to plot Q(s,a)
gridX = repmat( transpose(.5:1:(dim-.5)), 1, dim );
gridY = transpose( gridX );
u = zeros( dim, dim );
v = zeros( dim, dim );
for a=1:dim
for b=1:dim
[val,ind] = max( Q(a,b,:) );
switch ind
case 1
u(a,b) = 0;
v(a,b) = val;
case 2
u(a,b) = val;
v(a,b) = 0;
case 3
u(a,b) = -val;
v(a,b) = 0;
case 4
u(a,b) = 0;
v(a,b) = -val;
end
end
end
%output Q(s,a) using quiver
quiver( gridX, gridY, u, v, 0 );
hold on;
%plot where we've been, the goal, start, and obstacles
plot( xmat, ymat, 'ko' ); %black circles
plot( obstaclesX, obstaclesY, 'ks','MarkerSize', 10, 'MarkerFaceColor', 'k'
); %black square
plot( startX, startY, 'bd', 'MarkerSize', 10, 'MarkerFaceColor', 'b' );
%blue diamond
plot( goalX, goalY, 'bp','MarkerSize', 14, 'MarkerFaceColor', 'b' ); %blue
pentagram
grid on;
axis( [0 dim 0 dim] );
set( gca, 'YTick', 0:1:dim );
set( gca, 'XTick', 0:1:dim );
set( gca, 'GridLineStyle', '-' );
if( save_maze )
filename = sprintf( '%s/image_%d.ppm', img_dir, i );
fprintf( 'saving %s...', filename );
print( map, '-dppm', '-r200', filename );
fprintf( 'done\n' );
close( map );
else
drawnow;
end
Qsum = sum( Q, 3 );
if( save_maze )
img = figure( 'Visible', 'off' );
surf( Qsum );
filename = sprintf( '%s/Qsum.ppm', img_dir );
fprintf( 'saving %s...', filename );
print( img, '-dppm', '-r200', filename );
fprintf( 'done\n' );
else
figure;
surf( Qsum );
end
Video Link
We created a few videos during our work to enable better visualization. Although changes were made to
our implementation details after these videos were created, they are still somewhat interesting. They
may be viewed on youtube via the following addresses:
http://www.youtube.com/watch?v=JSxAXOI1Hmc -- Maze learning with SARSA
http://www.youtube.com/watch?v=8Fwv5YI7LU4 -- SARSA Q(s,a) Value Improvement – Top View
http://www.youtube.com/watch?v=jaa9zvCH4Jk -- SARSA Q(s,a) Value Improvement – Side View
Download