ECE 517 Project 2 TD Learning with Eligibility Traces and Planning Nicole Pennington & Alexander Saites 11/3/2011 Abstract This paper details the implementation of a Sarsa(λ) reinforcement learning algorithm with planning to determine an optimal policy for solving a maze. Our agent was placed at a random starting location within the maze, which consisted of a 20x20 grid with a goal state and several obstacles, and was reset to a new random location each time it reached the goal state. After determining appropriate parameter values, we experimented with several TD learning techniques before settling on Sarsa(λ) with planning. Our final algorithm consistently converges to a near-optimal maze solution in a relatively small number of episodes. I. Background Temporal-difference (TD) learning is a combination of Monte Carlo and dynamic programming ideas. Like Monte Carlo, TD learning can learn directly from experience, eliminating the need for a model. Like dynamic programming, TD learning “bootstraps”, or updates estimates based in part on other learned estimates without waiting for a final outcome. In this way, TD methods are implemented in an online, fully incremental fashion. Two TD control algorithms are Sarsa and Q-learning. Sarsa is an on-policy TD control algorithm which continually estimates the action-value function for the behavior policy while adjusting the policy to be greedy with respect to the action-value function. Q-learning, an off-policy TD control algorithm, directly approximates the optimal action-value function from the learned action-value function independent of the policy being followed. Almost any TD method can be combined with eligibility traces to obtain a more efficient learning engine. Eligibility traces are like temporary records of the occurrence of an event, such as taking an action. Sarsa(λ), the eligibility trace version of Sarsa, uses traces to mark a state-action pair as eligible for undergoing learning changes. In this way, eligibility traces are used for both prediction and control. Planning can be integrated into Sarsa(λ) by having the agent build a model of its environment through experience, then using that model to produce simulated experience. With planning, the agent can make fuller use of a limited amount of experience and thus achieve a better policy with fewer environmental interactions. II. Introduction In this experiment, we constructed a 20x20 maze, which contained a starting state, fixed goal state, and a reasonable number of obstacles. All states visited by the agent offer no reward except for the goal state, which carries a reward of +1. The agent has 4 possible actions from each state in the maze: up, down, left and right. Each action moves the agent into the state that corresponds to the direction of the action. If the agent tries to take an illegal action, such as moving into an obstacle or into the border of the maze, it simply remains in its current state. Our task was to train the agent to find the goal state using TD learning with eligibility and planning to facilitate the learning process. This task was considered continuous because the agent was relocated to a new random starting state each time it reached the goal state. To accomplish this, we experimented with Q-learning before choosing to use Sarsa(λ) with planning, as described in the Design section. III. Design Objectives and Challenges Our design objective was to create a simple and extendable program that elegantly facilitated changes to the source and modeling of the environment. The primary challenge in this was finding a way to meaningfully represent and display our environment and agent's progress. By representing our Q(s,a) as a matrix and storing positions to which we moved, we were able to easily implement Sarsa and display our results. Additionally, our design allowed us to easily and quickly vary parameters, such as the number of obstacles in the world and the number of episodes our agent performed. Thus, our design allowed easily readable code and logical graphical representation of our maze. We choose to represent Q(s,a) as a 20x20x4 matrix. Each value in the matrix represents the value of a particular direction (our action set) for a given x,y position (our state set). Similarly, our eligibility trace is a 20x20x4 matrix indexed in the same way. This made several portions of our program simple to write and edit. The actual Sarsa portion of the algorithm, for instance, is easily accomplished in a few lines, as we can quickly extract the proper Q(s,a) of the four actions along with the appropriate eligibility trace value. Furthermore, the eligibility trace can be decayed by simply multiplying it by the appropriate values. Finally, this allowed us to easily graph the Q(s,a), showing the values of best directions throughout the world. Technical Approach Our program works by randomly selecting the positions of a goal and several obstacles. It initializes Q(s,a) to zeros. These values can be initialized randomly, but setting them all to zero allows faster convergence to the optimal Q. After these values are set, it begins an episode by randomly selecting a starting position and clearing out two matrices used to record travel positions. It then begins searching for the goal. To do so, it chooses a new position using an epsilon-greedy algorithm. If this new position is invalid, it simply stays where it is. If the new position is the goal, it receives a reward of +1. The eligibility trace for this position is increased and Q is then updated using SARSA. After this, the eligibility trace is decayed. If planning is enabled, it enters a planning loop in which it makes predictions based upon what it has learned about the environment. To do this, the agent constructs a model of the environment as it moves through the maze. It then uses this model to simulate experience by updating state-action values through simulated maximum actions taken from a set of N random previously observed states and actions taken from those states. At the end of each episode, the state is updated, and if this new state is the goal position, epsilon is decayed and a new episode begins. This continues for the number of episodes specified. This idea is modeled in the following flowchart: We chose an epsilon-greedy algorithm to select our next action. This provides good on-line performance and, since epsilon is decayed over time, converges to the optimal solution. Additionally, we chose a Sarsa control scheme since it generally allows greater on-line performance and is easier to implement than Q-learning with eligibility traces. The manner in which the program chooses to place obstacles and the starting position is very simplistic. It randomly chooses a position and checks it against obstacles already placed in the world. If an obstacle exists in that position, it randomly chooses a new position. This continues until it finds a valid position. Although in theory this could continue for a very long time if the chosen positions were consistently already occupied, in practice it is quick and effective. Additionally, our program makes it easy to change the number of obstacles in the world. If that number of obstacles is greater than the length of a dimension, there is some probability that no valid path from the starting position to the goal will exist. This probability increases with the number of obstacles placed in the world. Parameters A. Epsilon Epsilon represents the probability of selecting a non-greedy action from any state. We initially set epsilon to 0.75. As a result, the epsilon-greedy algorithm will begin by choosing one of the four directions with equal probability. Epsilon is then slowly decayed (using the factor mu) with each episode to allow convergence to the optimal state-action values. If epsilon is initially set to a lower value, it will not explore as much and instead will often take the default value (in this case 'up'). As a result, it will take longer in initial episodes to find the goal, as little learning has taken place. An epsilon value higher than .75 is not of value, as initially every state is as valid as another, so exploring uniformly at first is the best option. B. Gamma Gamma indicates how much to discount the value of the next state when calculating the new value of the current state. A higher gamma causes the value of the next state to have a greater influence on the value of the current state. If the next state’s value is correct, it is desirable for it to have a greater impact on the current state’s value. Because we chose to initialize the state-action values to zero and the only positive reward is received at the goal state, we decided that a higher gamma would be appropriate. Setting gamma too low would slow convergence because it would hinder the propagation of rewards from the goal state; states too far from the goal state would be worth almost nothing. Setting gamma too high also slows down the speed of convergence because it could easily result in learning the wrong action values. We experimentally determined that the correct balance is a value of approximately 0.85. C. Alpha Alpha discounts the influence of the delta function when updating the value of the current stateaction pair. As alpha is increased, a greater portion of the value update is being relayed to the value of the current state. While a lower alpha limits the update of each state-action value, it can help control the negative influence of an incorrect update. Because a high alpha can cause state-action values to oscillate, we chose a lower alpha of 0.1. This converged well and prevented incorrect updates from overwhelming good state-action pair values. D. Lambda Lambda is the rate of decay (in conjunction with gamma) of the eligibility trace. This is the amount by which the eligibility of a state is reduced each time step that it is not being visited. A low lambda causes a lower reward to propagate back to states farther from the goal. While this can prevent the reinforcement of a path which is not optimal, it causes a state which is far from the goal to receive very little reward. This slows down convergence, because the agent spends more time searching for a path if it starts far from the goal. Conversely, a high lambda allows more of the path to be updated with higher rewards. This suited our implementation, because our high initial epsilon was able to correct any state values which might have been incorrectly reinforced and create a more defined path to the goal in fewer episodes. Because of this, we chose a final lambda of 0.9. Experiments and Results Q-learning We initially experimented with Q-learning without eligibility traces or planning. This simplified the implementation and allowed us to establish reasonable values for our parameters. Although this ran episodes quickly and produced reasonable results, it took several thousand episodes to generate the same level of results as our final method. An example output is shown below. As we will explain, our Sarsa algorithm with eligibility traces and planning is able to produce better results fewer than 26 episodes. 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 1: A maze result from Q-learning This image represents the environment and the agent movements. The boxes are obstacles, the star is the goal position, and the diamond is the start position. Black circles show places the agent has been. Arrows are shown for the maximum action for a given state. The direction shows the direction of the maximum action and the length shows its magnitude. Q-learning Q(s,a) after 2000 episodes 20 18 2.5 16 2 Y-Position 14 12 1.5 10 8 1 6 0.5 4 2 0 5 10 X-Position 15 20 Figure 2: Q-learning takes many episodes to generate this result Sarsa with eligibility trace and planning The following series of graphs show an example experiment ran with our Sarsa algorithm which uses an eligibility trace and planning. We ran the experiment for 500 episodes with 19 obstacles and plotted every 25 runs. A select few of those runs are shown below. For this experiment, epsilon started at 0.75 with a decay rate of 0.99. Gamma and lambda were set to 0.9 and alpha was set to 0.1. Each episode used 40 planning iterations. Episode 1 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 3: When just starting, the agent walks about randomly searching for the goal When the agent begins the first episode, no learning has taken place, so Q(s,a) is zero for all state-action pairs. Since epsilon is 0.75, it performs the first episode by walking randomly until it finds the goal. At this point, Q(s,a) is updated using the eligibility trace. Q(s,a) for Episode 1 20 0.16 18 0.14 16 0.12 12 0.1 10 0.08 Y-Position 14 8 0.06 6 0.04 4 0.02 2 5 10 X-Position 15 20 Figure 4: The eligibility trace allows updating of several actions that led to the goal Q(s,a) for Episode 1 0.16 0.14 3 0.12 2 0.1 0.08 1 0.06 0 20 0.04 15 20 15 10 5 Y-Position 0.02 10 5 X-Position Figure 5: This side view shows the curvature of Q(s,a) These chart show the sum of Q(s,a) for all actions. As expected, the eligibility trace causes an update to the state-action value of several actions that led to the goal. After 26 episodes, the agent already shows improvement: Episode 26 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 6: The agent finds the goal faster Q(s,a) for Episode 26 20 2 18 1.8 16 1.6 14 1.4 12 1.2 10 1 Y-Position 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 8 0.8 6 0.6 4 0.4 2 0.2 5 10 X-Position 15 20 0 Figure 7: The eligibility trace and planning allow quick improvement Q(s,a) for Episode 26 2 1.8 1.6 3 1.4 1.2 2 1 1 0.8 0.6 0 20 0.4 20 15 15 10 5 Y-Position 0.2 10 5 0 X-Position Figure 8: A side view of Q(s,a) Note that dark spots indicate a value of zero, so obstacles show up as dark spots. The goal position also shows up as a dark spot because the agent never takes an action while in the goal position (a new episode begins). In fewer than 30 episodes, the agent is able to find optimal solutions from reasonable distances. Moreover, Q(s,a) is already near optimal, as most of the arrow heads are pointing toward the goal. Episode 176 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 9: The agent finds an optimal solution Q(s,a) for Episode 176 20 18 3 16 2.5 14 12 Y-Position 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 2 10 1.5 8 1 6 4 0.5 2 5 10 X-Position 15 Figure 10: Q(s,a) is near optimal 20 0 Q(s,a) for Episode 176 3 3 2.5 2 2 1 1.5 1 0 20 20 15 10 5 Y-Position 0.5 15 10 0 5 X-Position Figure 11: This side view shows the mountain of reward expectations leading up to the goal The following figures show the results after 500 episodes: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Figure 12: The arrows show a near-optimal solution Q(s,a) for Episode 500 20 18 3 16 14 12 10 Y-Position 2.5 2 1.5 8 6 4 2 5 10 X-Position 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 15 Figure 13: The final Q(s,a) 20 1 0.5 Q(s,a) for Episode 500 3 3 2.5 2 2 1 1.5 1 0 20 20 15 10 5 Y-Position 0.5 15 10 5 X-Position Experimenting with epsilon and planning We ran several experiments with different values for epsilon both with and without planning. We recorded the number of steps taken before the agent reached the goal state. We compared this with the number of spaces from the start to the goal. Recording the error in this way is fast, but not completely accurate. If a direct path to the goal does not exist (i.e. an obstacle lies directly next to the goal or start position and between the two), then the optimal solution is higher than this value. As a result, an optimal solution will have a recorded error near 1. Using Sarsa without planning, the average error for the final 100 episodes after running 500 episodes reaches 2.25. Epsilon was decayed by .99 after each episode. With planning, we reduce this error to 1.05 after 500 episodes. Since the error calculation assumes a direct path always exists, it is reasonable to believe this solution is in fact optimal. The images below show the errors and state-action value functions for these cases. Last 100 Errors - No Planning - epsilon decay = .99 25 20 Errors 15 10 5 0 400 420 440 460 Episodes 480 500 Figure 14: Errors are still high without planning Last 100 Errors - Planning - epsilon decay = .99 8 7 6 Errors 5 4 3 2 1 0 400 420 440 460 Episodes 480 Figure 15: Error rates are much smaller with planning 500 Q(s,a) - No Planning - epsilon decay = .99 20 2.5 18 16 2 Y-Position 14 12 1.5 10 1 8 6 0.5 4 2 5 10 X-Position 15 20 0 Figure 16: The state-action value function without planning converges slowly Q(s,a) - Planning - epsilon decay = .99 20 18 3 16 2.5 Y-Position 14 2 12 10 1.5 8 1 6 4 0.5 2 5 10 X-Position 15 20 0 Figure 17: The state-action value function with planning shows much more learning Unsurprisingly, Q(s,a) converges more slowly with a higher (slower) decay rate, as the agent explores too much in later episodes when Q(s,a) is already quite accurate. The average error over all 500 episodes without planning with a decay rate of .999 is 135.02. When we reduce the decay rate to .99, this error falls to just 26.52 over all 500 episodes. With planning turned on and the decay rate set to .999, the error is 134.07. With a decay rate of .99, this falls to 23.72. The errors and state-action value functions are shown in the figures below. Errors - No Planning - epsilon decay = .999 5000 4500 4000 3500 Error 3000 2500 2000 1500 1000 500 0 0 100 200 300 400 Episode Figure 18: The total error rate declines more slowly than with .99 500 Errors - Planning - epsilon decay = .999 2500 2000 Error 1500 1000 500 0 0 100 200 300 400 500 Episode Figure 19: The total error rate declines more slowly than with .99 Q(s,a) - No Planning - epsilon decay = .999 20 2.5 18 16 2 Y-Position 14 12 1.5 10 1 8 6 0.5 4 2 5 10 X-Position 15 20 Figure 20: With more exploration, Q(s,a) is updated only around the goal 0 Q(s,a) - Planning - epsilon decay = .999 20 3 18 16 2.5 Y-Position 14 2 12 10 1.5 8 1 6 4 0.5 2 5 10 X-Position 15 20 0 Figure 21: With a lot of exploration and planning, Q(s,a) is updated much more. Note: dark spots show obstacles and goal In the other direction, when the decay rate is too small, the agent does not explore enough. With planning turned off and the decay rate set to .9, the agent performs so poorly that it does not finish in a reasonable time. With planning turned on, the error rate increases to 34.22 over all episodes. While this is better than .999, the error for the last 100 episodes is still 6.90, showing that strong initial exploration is critical. These error rates are summarized in the following table. Planning? Epsilon decay rate No No No Yes Yes Yes .9 .99 .999 .9 .99 .999 Average error for all episodes N/A 26.52 135.02 34.22 23.72 134.07 Average error for last 100 episodes N/A 2.25 28.68 6.90 1.05 22.24 Table 1: Comparison of error rates for various epsilon decay values with and without planning These also show clearly that planning reduces the number of episodes necessary to generate optimal solutions. Although in our simulated environment, experience is cheap, as generating episodes takes only fractions of seconds, a real robot attempting to find a real goal position in a real environment is naturally constrained by reality. In such a scenario, the planning stage becomes critical, as it allows significantly faster improvements with far fewer episodes. IV. Summary This project provided experience with the implementation of temporal difference learning methods, specifically Q-learning and Sarsa(λ), as well as eligibility traces and planning. Our first attempt at a solution was a simple, one-step Q-learning algorithm which performed well and converged to an (epsilon-) optimal solution in approximately 2000 episodes. After determining good values for the alpha, epsilon, and gamma parameters from this model, we switched to Sarsa(λ) to implement eligibility traces. This algorithm actually had the same convergence rate with slower performance than the Q-learning method because of the extra storage and computations required. After experimentally determining a good value for lambda, we added planning to the algorithm. Planning allowed the agent to make fuller use of a limited amount of experience and thus achieve a better policy with fewer environmental interactions. While still not as fast as the one-step Q-learning, our Sarsa(λ) with planning algorithm consistently converged to a near-optimal maze solution within 50 episodes. With this implementation, we successfully achieved an efficient TD algorithm with eligibility traces and planning which can meaningfully display our environment and agent's progress. V. Appendix Matlab Code Sarsa dim = 20; num_obstacles = 19; num_episodes = 500; plot_freq = 25; % every $plot_freq images are plotted save_maze = 1; % 0 = false, 1 = true img_dir = 'images'; % image directory; where to save images planning = 1; %0 = off, 1 = on N = 40; %Planning steps %initialize parameters epsilon = .75; gamma =.9; alpha = .1; lambda = .9; mu = .99; muval = 99; %used for outputing if save_maze = 1 %initialize goal position goalX = randi( dim ) - .5; goalY = randi( dim ) - .5; %goalX = 13.5; %goalY = 12.5; %initialize obstacles to zeros obstaclesX = zeros( 1, num_obstacles ); obstaclesY = zeros( 1, num_obstacles ); %add goal to obstacles so randomly generated obstacles aren't in the goal obstaclesX(1) = goalX; obstaclesY(1) = goalY; %set up the world to make plotting easier later gridX = repmat( transpose(.5:1:(dim-.5)), 1, dim ); gridY = transpose( gridX ); u = zeros( dim, dim ); v = zeros( dim, dim ); %randomly generate obstacles for i=2:num_obstacles newObX = randi( dim ) - .5; newObY = randi( dim ) - .5; while Check_obstacle( newObX, newObY, obstaclesX, obstaclesY ) newObX = randi( dim ) - .5; newObY = randi( dim ) - .5; end obstaclesX(i) = newObX; obstaclesY(i) = newObY; end %remove goal from obstacles obstaclesX = obstaclesX(2:end); obstaclesY = obstaclesY(2:end); %initialize Q(s,a) arbitrarily Q = zeros( dim, dim, 4 ); Q( (obstaclesX+.5), (obstaclesY+.5), : ) = 0; %eligability trace and planning model et = zeros( dim, dim, 4 ); if planning model = zeros(dim, dim, 4, 3); end %get optimal solution for each point %this may be slightly off, depending on obstacle positions optSol = abs(gridX - goalX) + abs(gridY - goalY); error = zeros( 1, num_episodes ); for i=1:num_episodes %begin an episode if(i == num_episodes) %Remove epsilon-greedy epsilon = 0; end %initialize start state -- don't run into obstacles and be a bit from %the goal X = randi(dim) - .5; Y = randi(dim) - .5; while (abs(X-goalX) < 2 ) || Check_obstacle(X,Y,obstaclesX,obstaclesY) || (abs(Y-goalY) < 2 ) X = randi(dim) - .5; Y = randi(dim) - .5; end startX = X; startY = Y; %initialize the action if (i == 1) action = randi(4); else [val,action] = max(Q(X+.5,Y+.5,:)); end %these matricies will hold the x,y positions traveled xmat = 0; ymat = 0; steps = 0; amat = 0; %repeat for each step while( 1 ) %save the number of steps it has taken steps = steps + 1; sprintf( '%u %u %u\n', steps, X, Y ); %save xmat( ymat( amat( the x steps steps steps and ) = ) = ) = y positions and corresponding action X; Y; action; %take action, observe r,s' nextX = X; nextY = Y; switch action case 1 nextY = Y + 1; %up if Check_obstacle( X, nextY, nextY = Y; end case 2 nextX = X + 1; %right if Check_obstacle( nextX, Y, nextX = X; end case 3 nextX = X - 1; %left if Check_obstacle( nextX, Y, nextX = X; end case 4 nextY = Y - 1; %down if Check_obstacle( X, nextY, nextY = Y; end end %go back if it knocks you off the map if nextX > dim || nextX < 0 nextX = X; end if nextY > dim || nextY < 0 obstaclesX, obstaclesY ) obstaclesX, obstaclesY ) obstaclesX, obstaclesY ) obstaclesX, obstaclesY ) nextY = Y; end %only reward for hitting goal if nextX == goalX && nextY == goalY reward = 1; else reward = 0; end %choose next action based on Q using epsilon-greedy rannum = rand(); [val,ind] = max(Q(nextX+.5,nextY+.5,:)); if rannum > epsilon %take greedy next_action = ind; else %take non-greedy next_action = randi(4); while next_action == ind next_action = randi(4); end end delta = reward + gamma*Q(nextX+.5,nextY+.5,next_action) Q(X+.5,Y+.5,action); et(X+.5, Y+.5, action ) = et(X+.5, Y+.5, action ) + 1; Q = Q + alpha*delta*et; et = gamma*lambda*et; if planning model(X+.5,Y+.5,action,1) = nextX; model(X+.5,Y+.5,action,2) = nextY; model(X+.5,Y+.5,action,3) = reward; for k=1:N prev = randi(length(xmat)); %choose a random previous state X = xmat(prev); Y = ymat(prev); action = amat(prev); simX = model(X+.5,Y+.5,action,1); simY = model(X+.5,Y+.5,action,2); simR = model(X+.5,Y+.5,action,3); [val,simA] = max(Q(simX+.5,simY+.5,:)); Q(X+.5,Y+.5,action) = Q(X+.5,Y+.5,action) + alpha*(simR + gamma*Q(simX+.5,simY+.5,simA) - Q(X+.5,Y+.5,action)); end end X = nextX; Y = nextY; action = next_action; if X == goalX && Y == goalY error(i) = steps - optSol(startX+.5,startY+.5); break; end end %decay epsilon with time epsilon = epsilon*mu; %print out maze if( ~mod( (i-1), plot_freq ) ) if( save_maze ) close all hidden; map = figure( 'Visible', 'off' ); else figure; end %set up grid to plot Q(s,a) for a=1:dim for b=1:dim [val,ind] = max( Q(a,b,:) ); switch ind case 1 u(a,b) = 0; v(a,b) = val; case 2 u(a,b) = val; v(a,b) = 0; case 3 u(a,b) = -val; v(a,b) = 0; case 4 u(a,b) = 0; v(a,b) = -val; end end end %overlay learning upon the world quiver( gridX, gridY, u, v, 0 ); hold on; %plot where we've been, the goal, start, and obstacles plot( xmat, ymat, 'ko' ); %black circles plot( obstaclesX, obstaclesY, 'ks','MarkerSize', 10, 'MarkerFaceColor', 'k' ); %black square plot( startX, startY, 'bd', 'MarkerSize', 10, 'MarkerFaceColor', 'b' ); %blue diamond plot( goalX, goalY, 'bp','MarkerSize', 14, 'MarkerFaceColor', 'b' ); %blue pentagram grid on; axis( [0 dim 0 dim] ); set( gca, 'YTick', 0:1:dim ); set( gca, 'XTick', 0:1:dim ); set( gca, 'GridLineStyle', '-' ); title( sprintf( 'Episode %d', i ) ); %output the images if( save_maze ) filename = sprintf( '%s/image_%05d.ppm', img_dir, i ); fprintf( 'saving %s...', filename ); print( map, '-dppm', '-r200', filename ); fprintf( 'done.\n' ); close( map ); else drawnow; end %output the learning Qsum = sum( Q, 3 ); if( save_maze ) img = figure( 'Visible', 'off' ); surf( Qsum ); title( sprintf( 'Q(s,a) for Episode %d', i ) ); xlabel( 'X-Position' ); ylabel( 'Y-Position' ); colorbar; axis( [1 dim 1 dim 0 3.5 ] ); filename = sprintf( '%s/Qsum_%05d.ppm', img_dir, i ); fprintf( 'saving %s...', filename ); print( img, '-dppm', '-r200', filename ); fprintf( 'done\n' ); hold on; axis( [0 dim 0 dim 0 3.5 ] ); set( gca, 'CameraTargetMode', 'manual' ) set( gca, 'CameraPosition', [10 10 -28 ] ); set( gca, 'CameraUpVector', [1 0 0] ); axis( [1 dim 1 dim 0 3.5 ] ); set(gca, 'XAxisLocation', 'top') set(get(gca,'XLabel'), 'Position', [-1 12 0]); set(get(gca,'YLabel'), 'Position', [10 0 0]); set(get(gca,'Title'), 'Position', [20.2 10 0]); filename = sprintf( '%s/Qsum-bottom_%05d.ppm', img_dir, i ); fprintf( 'saving %s...', filename ); print( img, '-dppm', '-r200', filename ); fprintf( 'done\n' ); else figure; surf( Qsum ); end end end %print the final image if( save_maze ) close all hidden; map = figure( 'Visible', 'off' ); else figure; end %set up grid to plot Q(s,a) for a=1:dim for b=1:dim [val,ind] = max( Q(a,b,:) ); switch ind case 1 u(a,b) = 0; v(a,b) = val; case 2 u(a,b) = val; v(a,b) = 0; case 3 u(a,b) = -val; v(a,b) = 0; case 4 u(a,b) = 0; v(a,b) = -val; end end end %output Q(s,a) using quiver quiver( gridX, gridY, u, v, 0 ); hold on; %plot where we've been, the goal, start, and obstacles plot( xmat, ymat, 'ko' ); %black circles plot( obstaclesX, obstaclesY, 'ks','MarkerSize', 10, 'MarkerFaceColor', 'k' ); %black square plot( startX, startY, 'bd', 'MarkerSize', 10, 'MarkerFaceColor', 'b' ); %blue diamond plot( goalX, goalY, 'bp','MarkerSize', 14, 'MarkerFaceColor', 'b' ); %blue pentagram grid on; axis( [0 dim 0 dim] ); set( gca, 'YTick', 0:1:dim ); set( gca, 'XTick', 0:1:dim ); set( gca, 'GridLineStyle', '-' ); if( save_maze ) filename = sprintf( '%s/image_%05d.ppm', img_dir, i ); fprintf( 'saving %s...', filename ); print( map, '-dppm', '-r200', filename ); fprintf( 'done\n' ); close( map ); else drawnow; end Qsum = sum( Q, 3 ); if( save_maze ) img = figure( 'Visible', 'off' ); surf( Qsum ); title( sprintf( 'Q(s,a) for Episode %d', i ) ); xlabel( 'X-Position' ); ylabel( 'Y-Position' ); colorbar; axis( [1 dim 1 dim 0 3.5 ] ); filename = sprintf( '%s/Qsum_%05d.ppm', img_dir, i ); fprintf( 'saving %s...', filename ); print( img, '-dppm', '-r200', filename ); fprintf( 'done\n' ); hold on; axis( [0 dim 0 dim 0 3.5 ] ); set( gca, 'CameraTargetMode', 'manual' ) set( gca, 'CameraPosition', [10 10 32 ] ); axis( [1 dim 1 dim 0 3.5 ] ); set(gca, 'XAxisLocation', 'top') set(get(gca,'XLabel'), 'Position', [-1 12 0]); set(get(gca,'YLabel'), 'Position', [10 0 0]); set(get(gca,'Title'), 'Position', [20.2 10 0]); filename = sprintf( '%s/Qsum-bottom_%05d.ppm', img_dir, i ); fprintf( 'saving %s...', filename ); print( img, '-dppm', '-r200', filename ); fprintf( 'done\n' ); else img = figure; surf( Qsum ); end %display average error sum(error)/size(error,2) last100 = error(size(error,2)-100:end); sum(last100)/size(last100,2) %---save error figure if planning == 1 plan = 'planning'; else plan = 'no'; end img = figure('Visible','off'); plot( error ); title( sprintf('Errors over %d episodes', num_episodes) ); xlabel('Episode'); ylabel('Error'); if( save_maze ) filename = sprintf( '%s/error-%s-%dep-mu%d.fig', img_dir, plan, num_episodes, muval ); fprintf( 'saving %s...', filename ); saveas( img, filename, 'fig' ); fprintf( 'done\n' ); close all hidden; else img = figure('Visible','on'); drawnow; end Q-learning dim = 20; num_obstacles = 19; num_episodes = 2000; plot_freq = 200; % every $plot_freq images are plotted save_maze = 0; % 0 = false, 1 = true img_dir = 'images'; % image directory; where to save images %initialize parameters epsilon = .75; gamma = .75; alpha = .1; lambda = .9; mu = .999; %initialize goal position goalX = randi( dim ) - .5; goalY = randi( dim ) - .5; %goalX = 13.5 %goalY = 12.5 %initialize obstacles to zeros obstaclesX = zeros( 1, num_obstacles ); obstaclesY = zeros( 1, num_obstacles ); %add goal to obstacles so randomly generated obstacles aren't in the goal obstaclesX(1) = goalX; obstaclesY(1) = goalY; %randomly generate obstacles for i=2:num_obstacles newObX = randi( dim ) - .5; newObY = randi( dim ) - .5; while Check_obstacle( newObX, newObY, obstaclesX, obstaclesY ) newObX = randi( dim ) - .5; newObY = randi( dim ) - .5; end obstaclesX(i) = newObX; obstaclesY(i) = newObY; end %remove goal from obstacles obstaclesX = obstaclesX(2:end); obstaclesY = obstaclesY(2:end); %initialize Q(s,a) arbitrarily %Q = rand( [dim, dim, 4] ) * .25; Q = zeros( dim, dim, 4 ); Q( (obstaclesX+.5), (obstaclesY+.5), : ) = 0; %eligability trace %et = zeros( dim, dim, 4 ); for i=1:num_episodes %begin an episode %initialize start state -- don't run into obstacles and be a bit from %the goal X = randi(dim) - .5; Y = randi(dim) - .5; while (abs(X-goalX) < 2 ) || Check_obstacle(X,Y,obstaclesX,obstaclesY) || (abs(Y-goalY) < 2 ) X = randi(dim) - .5; Y = randi(dim) - .5; end startX = X; startY = Y; %these matricies will hold the x,y positions traveled xmat = 0; ymat = 0; steps = 0; %repeat for each step while( 1 ) %save the number of steps it has taken steps = steps + 1; %save the x and y positions xmat( steps ) = X; ymat( steps ) = Y; %choose action based on Q using epsilon-greedy rannum = rand(); [val,ind] = max(Q(X+.5,Y+.5,:)); if rannum > epsilon %take greedy action = ind; else %take non-greedy action = randi(4); while action == ind action = randi(4); end end %take action a, observe r,s' newX = X; newY = Y; switch action case 1 newY = Y + 1; %up if Check_obstacle( X, newY, newY = Y; end case 2 newX = X + 1; %right if Check_obstacle( newX, Y, newX = X; end case 3 newX = X - 1; %left if Check_obstacle( newX, Y, newX = X; end case 4 newY = Y - 1; %down if Check_obstacle( X, newY, newY = Y; end end obstaclesX, obstaclesY ) obstaclesX, obstaclesY ) obstaclesX, obstaclesY ) obstaclesX, obstaclesY ) %go back if it knocks you off the map if newX > dim || newX < 0 newX = X; end if newY > dim || newY < 0 newY = Y; end %only reward for hitting goal if newX == goalX && newY == goalY reward = 1; else reward = 0; end %Q-learning [val,next_act] = max(Q(newX+.5,newY+.5,:)); %et( X+.5, Y+.5, action ) = et( X+.5, Y+.5, action ) + 1; %Q(X+.5,Y+.5,action) = Q(X+.5,Y+.5,action) + et(X+.5,Y+.5,action)*alpha*( reward + (gamma*val)-Q(X+.5,Y+.5,action) ); Q(X+.5,Y+.5,action) = Q(X+.5,Y+.5,action) + alpha*( reward + (gamma*val)-Q(X+.5,Y+.5,action) ); %decay eligibility trace %et = gamma*lambda*et; %update the state X = newX; Y = newY; if X == goalX && Y == goalY break; end end %decay epsilon with time epsilon = epsilon*mu; %print out maze if( ~mod( (i-1), plot_freq ) ) if( save_maze ) map = figure( 'Visible', 'off' ); else figure; end %set up grid to plot Q(s,a) gridX = repmat( transpose(.5:1:(dim-.5)), 1, dim ); gridY = transpose( gridX ); u = zeros( dim, dim ); v = zeros( dim, dim ); for a=1:dim for b=1:dim [val,ind] = max( Q(a,b,:) ); switch ind case 1 u(a,b) = 0; v(a,b) = val; case 2 u(a,b) = val; v(a,b) = 0; case 3 u(a,b) = -val; v(a,b) = 0; case 4 u(a,b) = 0; v(a,b) = -val; end end end quiver( gridX, gridY, u, v, 0 ); %that's right, quiver hold on; %plot where we've been, the goal, start, and obstacles plot( xmat, ymat, 'ko' ); %black circles plot( obstaclesX, obstaclesY, 'ks','MarkerSize', 10, 'MarkerFaceColor', 'k' ); %black square plot( startX, startY, 'bd', 'MarkerSize', 10, 'MarkerFaceColor', 'b' ); %blue diamond plot( goalX, goalY, 'bp','MarkerSize', 14, 'MarkerFaceColor', 'b' ); %blue pentagram grid on; axis( [0 dim 0 dim] ); set( gca, 'YTick', 0:1:dim ); set( gca, 'XTick', 0:1:dim ); set( gca, 'GridLineStyle', '-' ); if( save_maze ) filename = sprintf( '%s/image_%d.ppm', img_dir, i ); fprintf( 'saving %s...', filename ); print( map, '-dppm', '-r200', filename ); fprintf( 'done.\n' ); close( map ); else drawnow; end end end sum( steps ) / size( steps, 1 ) %print the final image if( save_maze ) map = figure( 'Visible', 'off' ); else figure; end %set up grid to plot Q(s,a) gridX = repmat( transpose(.5:1:(dim-.5)), 1, dim ); gridY = transpose( gridX ); u = zeros( dim, dim ); v = zeros( dim, dim ); for a=1:dim for b=1:dim [val,ind] = max( Q(a,b,:) ); switch ind case 1 u(a,b) = 0; v(a,b) = val; case 2 u(a,b) = val; v(a,b) = 0; case 3 u(a,b) = -val; v(a,b) = 0; case 4 u(a,b) = 0; v(a,b) = -val; end end end %output Q(s,a) using quiver quiver( gridX, gridY, u, v, 0 ); hold on; %plot where we've been, the goal, start, and obstacles plot( xmat, ymat, 'ko' ); %black circles plot( obstaclesX, obstaclesY, 'ks','MarkerSize', 10, 'MarkerFaceColor', 'k' ); %black square plot( startX, startY, 'bd', 'MarkerSize', 10, 'MarkerFaceColor', 'b' ); %blue diamond plot( goalX, goalY, 'bp','MarkerSize', 14, 'MarkerFaceColor', 'b' ); %blue pentagram grid on; axis( [0 dim 0 dim] ); set( gca, 'YTick', 0:1:dim ); set( gca, 'XTick', 0:1:dim ); set( gca, 'GridLineStyle', '-' ); if( save_maze ) filename = sprintf( '%s/image_%d.ppm', img_dir, i ); fprintf( 'saving %s...', filename ); print( map, '-dppm', '-r200', filename ); fprintf( 'done\n' ); close( map ); else drawnow; end Qsum = sum( Q, 3 ); if( save_maze ) img = figure( 'Visible', 'off' ); surf( Qsum ); filename = sprintf( '%s/Qsum.ppm', img_dir ); fprintf( 'saving %s...', filename ); print( img, '-dppm', '-r200', filename ); fprintf( 'done\n' ); else figure; surf( Qsum ); end Video Link We created a few videos during our work to enable better visualization. Although changes were made to our implementation details after these videos were created, they are still somewhat interesting. They may be viewed on youtube via the following addresses: http://www.youtube.com/watch?v=JSxAXOI1Hmc -- Maze learning with SARSA http://www.youtube.com/watch?v=8Fwv5YI7LU4 -- SARSA Q(s,a) Value Improvement – Top View http://www.youtube.com/watch?v=jaa9zvCH4Jk -- SARSA Q(s,a) Value Improvement – Side View