Learning in a State of Confusion: Perceptual Aliasing in Grid World Navigation Paul A. Crook and Gillian Hayes Institute of Perception, Action and Behaviour, School of Informatics, University of Edinburgh, 5 Forrest Hill, Edinburgh. EH1 2QL, UK {paulc, gmh}@dai.ed.ac.uk Abstract Due to the unavoidable fact that a robot’s sensors will be limited in some manner, it is entirely possible that it can find itself unable to distinguish between differing states of the world. This confounding of states, also referred to as perceptual aliasing, has serious effects on the ability of reinforcement learning algorithms to learn stable policies. Using simple grid world navigation problems we demonstrate experimentally these effects. Although 1-step backup reinforcement learning algorithms performed surprisingly better than expected, our results confirm that algorithms using eligibility traces should be preferred. 1. Introduction Consider a robot learning to navigate its way to a goal point from anywhere in a building. Whatever the design of the robot it can only be equipped with a finite number of sensors and will have limited computational resources with which to interpret this sensory information. Due to these limitations there is always a chance that multiple states of the world, for example two Tjunctions or two long corridors in the building, will be indistinguishable to the robot. This problem was identified in active vision systems by Whitehead and Ballard (1991) who coined the phrase perceptual aliasing. Although with the current pace of technological advance it is always possible to augment both sensory information and the processing, it would be better to have a range of techniques that the robot can use to deal with these situations. In addition, perceptual aliasing if controlled correctly could form a useful tool. If the mapping between the external world and the agent’s internal state is chosen correctly state distinctions that are irrelevant to a task could be ignored, reducing the state space that has to be explored. Perceptual aliasing is especially problematic when using reinforcement learning algorithms. Reinforcement learning algorithms learn to associate rewards and ac- tions with specific states. Perceptual aliasing, which results in the confounding of the true state of the world, therefore, makes it difficult if not impossible for algorithms to learn stable policies. Systems that contain perceptual aliasing are examples of partially observable environments. Work looking at partially observable environments has shown that reinforcement learning, when augmented with memory (Lanzi, 2000) or the ability to create models of its world (Chrisman, 1992; McCallum, 1993), can solve tasks which contain perceptually aliased states. Our long term goal is to test whether the use of active perception can provide an effective alternative to these two approaches. However, at this stage we wish to better understand the problems caused by perceptual aliasing. To study the fundamentals of the problem we consider simulated agents with fixed perception moving around deterministic grid worlds, such as Sutton’s Grid World (figure 1). Depending on the sensory input we allow the agent, it faces similar problems to those that could be encountered by a real mobile robot. This paper presents results of applying various reinforcement learning algorithms, that are commonly used in robotics, to two grid world navigation problems. These two problems can be rendered partially observable by selecting the observations that make up the agent’s state. The aim of these experiments is to test the hypotheses that: • 1-step reinforcement learning algorithms are not able to learn policies which are both stable and optimal, when the task involves perceptually aliased states; • Reinforcement algorithms that use eligibility traces can however learn optimal memoryless 1 and stable policies for the same task. . We confirm results observed by Loch and Singh (1998) that in partially observable environments, reinforcement learning algorithms which use eligibility traces are 1 see section 2. for an explanation of this term In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 preferable to those that use 1-step backup. Furthermore, by running multiple trials we gather some useful insights on the performance of the algorithms tested. 2. Background Whitehead and Ballard (1991) considered reinforcement learning of active perceptual systems, specifically active vision systems. To simulate the problems involved they considered a blocks world problem where a robotic arm has to uncover and grasp a green block from among randomly sorted piles of blocks. Their system could select two blocks to be the focus of its attention and the characteristics of these two blocks formed the state of the world as seen by the learning algorithm. The reinforcement learning algorithm had to learn to coordinate which objects were the focus of attention as well as the actions of the robotic arm. Whitehead and Ballard (1991) found that 1-step backup Q-learning failed to learn the optimal policy, only performing slightly better than selecting actions at random. This failure, they concluded, was due to the inability of Q-learning (or any reinforcement learning algorithm using 1-step backup) to learn stable policies in the presence of perceptual aliasing, the perceptual aliasing in their case being caused by the design of the active perception system. Littman (1994) considered learning state-action policies without memory in partially observable environments, i.e. environments containing perceptual aliasing. He introduced the useful concepts of satisficing and optimal memoryless policies. A memoryless policy returns an action based solely on the current sensation. Standard reinforcement learning algorithms, such as SARSA and Q-learning (Sutton and Barto, 1998, p146,p149), work on exactly this basis. A policy is satisficing if independent of its initial state an agent following this policy is guaranteed to reach the goal. The performance of a policy is measured using the total steps that the agent takes to reach the goal from all possible initial states. An optimal policy is one that achieves the minimum total steps to goal. Therefore, an optimal memoryless policy is a policy that achieves the minimum number of total steps which can be achieved by a memoryless policy. Using hill climbing and branch and bound techniques Littman (1994) showed that it is possible to find optimal memoryless policies for various grid world navigation problems including the variation on Sutton’s grid world2 presented here. Loch and Singh (1998) then showed that reinforcement learning using eligibility traces could also find optimal memoryless policies in grid world navigation problems. 47 135 41 * 39 7 7 151 144 40 0 0 148 189 41 148 41 0 0 20 157 41 20 9 128 64 36 41 4 1 16 233 224 71 2 224 224 228 226 2 149 8 0 148 225 224 244 Figure 1: Sutton’s grid world. Values indicate observations obtained by an agent with fixed perception sampling the eight surrounding squares (8 Adjacent Squares Agent). Arrows show an example optimal memoryless policy (from Loch and Singh (1998)). Filled black squares are obstacles or walls, and the goal is indicated by an asterisk (*). AGENT PERCEPTIONS Figure 2: 8 Adjacent Squares Agent — fixed perception sampling the eight surrounding squares. 2.1 Effects of Perceptual Aliasing Perceptual aliasing causes two distinct effects labelled by Whitehead (1992, (chp.5)) as local and global impairment. Local impairment — an agent that is unable to distinguish between several states of the world will sometimes select actions that are inconsistent with the true underlying state. An example can be seen in figure 1. The states in this figure are labelled to indicate the observations obtained by an agent who can only observe the eight squares adjacent to itself, such as the agent illustrated in figure 2. Such an agent believes it is in state 148 in three distinct locations. These are: four squares directly below the goal; near the middle of the obstacle which is to the left of the goal; and near the middle of the obstacle on the left hand side of the board. In one 2 The original problem was presented in Sutton (1990) In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 of these three states the optimal action is to move north (the state directly below the goal), in the second case the optimal action is to move south (the state just to the left of the goal), in the third state it does not matter if the agent moves north or south. As can be seen in figure 1, the agent, unable to distinguish between these three locations, decides it is best served by moving north. Although appropriate for two of the three occurrences of 148, it is not optimal for the state just to the left of the goal. It is because of such situations that optimal stateaction policies learnt in the absence of memory, i.e. the optimal memoryless policy, are arbitrarily worse than the optimal policy that can be achieved in the absence of perceptual aliasing (Singh et al., 1994). Global Impairment — given the bucket-brigade update employed by 1-step backup reinforcement learning algorithms, inaccurate estimates of state values that occur with perceptually aliased states can lead to errors in the state values of non-perceptually aliased states. This problem is best illustrated by considering figure 3. This world is deliberately designed such that an agent who can only observe the eight squares adjacent to itself cannot distinguish between states 2 and 5. It is able to uniquely identify all of the remaining states. In all states the optimal action is to move right. The agent when updating state values does not regard states 2 and 5 as separate states, thus it stores only one state value to represent both states and their updates are averaged. This averaging results in state 2 having a value greater than its true value. If an agent is in state 3 it might mistakenly select the action move left, towards state 2, believing state 2 to be nearer the goal than state 4. Updating state 3 on the basis of the state value of state 2 then propagates this error potentially causing other states, such as state 4, to select the action move left and further propagating the inconsistent state values. These errors in state values can end up affecting the whole of the agent’s policy. 3. Experiments We conduct our experiments using two grid world navigation problems: (i) Sutton’s Grid World; (ii) A simple 1-D example devised by Whitehead (1992, pp73–78) to illustrate the problems perceptual aliasing causes to 1-step Q learning. And two types of agent: (i) An agent whose state representation is its location in the grid world given in Cartesian coordinates, the Absolute Position Agent. (ii) An agent whose state representation is formed by observing the 8 squares adjacent to its current location, the 8 Adjacent Squares Agent, see figure 2. The importance of these two agents is not in the detail of what they can observe but that the Absolute Position Agent has a unique state representation for every location in either of the two grid worlds, while the latter, the 8 Adjacent Squares Agent, has the same state representation for multiple states in each of the two worlds. 3.1 Sutton’s Grid World Sutton’s grid world is shown in figure 1. It consists of a 9 × 6 grid containing various obstacles and a goal in the top right hand corner (indicated by an asterisk). An agent in this world can choose between four physical actions; move north, south, east and west. State transitions are deterministic and each action moves it one square in the appropriate direction. If an agent tries to move towards an obstacle or wall it is not allowed to move, i.e. location and state remain unaltered, although it receives the same reward as if the action had succeeded. The agent receives a reward of −1 for each action that does not move it directly to the goal state and a reward of 0 for moving directly to the goal state. When the agent reaches the goal state it is relocated to a uniformly random start state. For the 8 Adjacent Squares Agent there are multiple locations that appear to be the same state, e.g. state 148 (figure 1) as discussed in section 2.1 above. 3.2 Simple 1-D Example State State State State State State State 0 1 2 3 4 5 6 * WALL * GOAL STATE Figure 3: Simple 1-D example world to illustrate the problems caused by perceptual aliasing (Whitehead, 1992, pp73– 78). The simple 1-D example world consists of a 1×8 grid as shown in figure 3, with the goal at the far right hand side. An agent in this world can select between two physical actions; move east or move west. State transitions in this world are deterministic and the two actions move the agent one square in the appropriate direction. The agent is not allowed to move past either the far left hand end or far right hand end of the world; if it tries to do this its location and state remain unaltered. On reaching the goal state the agent receives a reward of 5000. Non goal states yield zero reward. The arrangement of the wall and gaps above and below play no part in the actions the agent is allowed to execute. They do however encode the state as seen by the 8 Adjacent Squares Agent. For this agent each state appears unique except states labelled State 2 and State In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 5 (figure 3) which to it appear to be one and the same. 4 3 • learning rate α = 0.1, • discount rate γ = 0.9, • The state-action values for all the learning algorithms were initiated at zero. 3.4 Evaluation We adopted the same evaluation method as used by Loch and Singh (1998). After every 1000 learning steps the policy is evaluated greedily to determine the total number of steps required to reach the goal from every possible starting state. The agent is limited to a maximum of 1000 steps to reach the goal from each starting state. Thus if a policy is evaluated in a world with N starting states and fails to reach the goal from all of them it would have a maximum total steps of N × 1000. Each run consists of a million action steps, with evaluation of the current policy occurring every 1000 steps. Each combination of agent, world, learning algorithm and value of λ was repeated 100 times giving 100 samples per data point. 4 5 x 10 Random Action SARSA Q−learning SARSA(0.9) Watkins’ Q(0.9) 2 4 3 1.5 2 1 1 0 0 0.5 1 2 3 4 5 4 x 10 2 4 6 Action−Learning Steps 8 10 5 x 10 2 4 6 Action−Learning Steps 8 10 5 x 10 4 3.5 • probability of random action ² started at 20% and decayed linearly reaching zero at the 200, 000th actionlearning step. Thereafter it remained at zero. • A range of values were tried for the eligibility trace decay rate λ from 0.001 to 0.9. 2.5 0 0 Mean Total Steps to Reach Goal A selection of learning and action selection algorithms were used: random action selection, SARSA, Q-learning, SARSA(λ) with replacement traces, Watkins’s Q(λ) with accumulating traces. For details of the learning algorithms see Sutton and Barto (1998, p146, p149, p181, p184) respectively. The random action selection algorithm, as its name suggest, selects uniformly between the available actions and provides a baseline for comparison with the other methods. All of the learning algorithms continuously update their policies. Actions are selected greedily using the current policy with a probability of (1 − ²). In cases where actions have the same value ties were broken at random. In the remaining ² cases the action executed was select randomly between all the available actions. In both cases above the random selection was uniform across all possibilities. The following values were used for the learning algorithms: Mean Total Steps to Reach Goal 3.3 Learning Algorithms & Action Selection x 10 x 10 3 2.5 2 1.5 1 0.5 0 0 Figure 4: Plot for Sutton’s Grid World of mean total steps found when policies were evaluated versus action-learning steps. Bars indicate 95% confidence intervals. To simplify plots data points are only shown for every 50,000 actionlearning steps. Top graph shows results for Absolution Position Agent which suffers no perceptual aliasing. Insert shows enlargement of first 50,000 steps which would otherwise not be visible (data plotted for every 1,000 step). Lower plot shows results for 8 Adjacent Squares Agent which aliases multiple locations. 4. Results 4.1 Sutton’s Grid World Figure 4 shows the mean total steps for all four learning algorithms. The top plot shows results with the Absolute Position Agent which experiences no perceptual aliasing. All four learning algorithms quickly converge on the optimal solution in around fifty thousand action-learning steps. This indicates that all four learning algorithms have no problem in learning this task if there is no per- In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 100 100 SARSA Number of Policies in each Class Number of Policies in each Class 60 40 20 0 0 2 4 6 8 Action−Learning Steps 100 40 20 0 0 10 2 4 6 8 Action−Learning Steps 5 10 5 x 10 100 SARSA(0.9) Watkins’ Q(0.9) 80 Number of Policies in each Class Number of Policies in each Class 60 x 10 80 60 40 Non Satisficing Other Satisficing Memoryless Optimal 20 0 0 Q−learning 80 80 2 4 6 Action−Learning Steps 8 10 5 x 10 60 40 20 0 0 2 4 6 Action−Learning Steps 8 10 5 x 10 Figure 5: Plots show categorisation of policies versus action-learning steps for the four learning algorithms. All are learning to solve Sutton’s grid world using the 8 Adjacent Squares Agent. ceptual aliasing. The lower plot shows results for the 8 Adjacent Squares Agent which aliases multiple locations in the world. The mean total steps for SARSA(λ) and Watkins’s Q(λ) with λ = 0.9 rapidly approach the optimal memoryless solution, with SARSA(λ) reaching convergence in less than one hundred thousand actionlearning steps, and Watkins’s Q(λ) in around three hundred thousand action-learning steps. The other values of λ tried were 0.001, 0.005, 0.01, 0.05, 0.1, 0.5 and 0.75 (none of which are shown) and all converged to a similar number of mean total steps as λ = 0.9. As would be expected lower values of λ take longer to converge, with the lowest value, λ = 0.001, converging after eight hundred thousand action-learning steps. The mean total steps of the policies learnt by by the 1step backup algorithms, Q-learning and SARSA, appear to be gradually converging towards a level where the majority of policies will be satisficing. This convergence is, however, extremely slow compared to SARSA(λ) and Watkins’s Q(λ). As indicated by the 95% confidence intervals there is a significant variation in the policies learnt by Q-learning and SARSA. To obtain a better idea of the quality of the policies that are being learnt we have identified five policy cat- egories and tracked the number of policies that fall into each category over time, see Figure 5. The five policy categories are optimal, better than memoryless optimal, memoryless optimal, other satisficing and nonsatisficing. We will define these categories specifically for Sutton’s grid world in terms of the total steps measured when the policy is evaluated. The optimal policy is defined as that which takes the minimum possible total steps to reach the goal from all starting positions. For Sutton’s grid world this is 404 steps. Littman (1994) showed that the optimal memoryless solution for Sutton’s grid world is 416 steps. Littman (1994)’s definition of satisficing is a policy that reaches the goal from all possible start states. Our measure of satisficing is stricter than this requirement as the agent is limited to 1,000 actions from each start state, after which the run is truncated. Accordingly, any policy who fails to reach the goal state from any start location in under 1,000 steps is classed as non-satisficing irrespective of the total steps for that policy. The remaining policies who succeed in reaching the goal from all start states are classified as: optimal if their total steps equals 404; better than memoryless optimal if their total steps lies between 404 and 416 (exclusive); memoryless optimal if the total steps equals 416; and other satisficing if the total steps exceeds In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 40 SARSA Number of Policies Changing from Satisficing to Non−Satisficing & visa−versa Number of Policies Changing from Satisficing to Non−Satisficing & visa−versa 40 30 20 10 0 2 4 6 Action−Learning Steps 8 20 10 2 4 6 Action−Learning Steps 8 10 5 x 10 40 SARSA(0.9) Number of Policies Changing from Satisficing to Non−Satisficing & visa−versa Number of Policies Changing from Satisficing to Non−Satisficing & visa−versa 30 0 10 5 x 10 40 30 20 10 0 Q−learning 2 4 6 Action−Learning Steps 8 10 5 x 10 Watkins’ Q(0.9) 30 20 10 0 2 4 6 Action−Learning Steps 8 10 5 x 10 Figure 6: Plots of stability of policy classification versus action-learning steps for the four learning algorithms. Height of bars indicate the number of policies that changed from being satisficing to non-satisficing and visa-versa since the previous policy evaluation. All are learning to solve Sutton’s grid world using the 8 Adjacent Squares Agent. Goal Reached From All Start States yes Total Physical Actions 404 yes 405 − 415 yes yes no 416 > 416 - Policy Category Optimal Better Than Memoryless Optimal Memoryless Optimal Other Satisficing Non-Satisficing Table 1: Policy categories for Sutton’s Grid World 416. These five categories are summarised in table 1. Figure 5 shows the variation in types of policies learnt against the number of action-learning steps that have been executed. Each combination of parameters and learning algorithm was repeated 100 times. The height of the shaded areas on these plots indicate the number of policies falling into each policy category as measured after a particular number of action-learning steps. Examining the top left hand plot, which shows the policies learnt using SARSA, initially all one hundred policies are non-satisficing (grey shading). At around one hundred thousand action-learning steps a small number of policies become satisficing, but their total steps exceeds 416 so they are classified as other satisficing policies (black shading). The number of policies classified as other satisficing gradually increases until after one million action learning steps 64 of the policies are other satisficing and 36 are non-satisficing. The results for Q-learning (top right) are similar with final tallies of 81 other satisficing policies and 19 non-satisficing. Neither learnt any policies that were better than other satisficing at any stage. However, the other satisficing policies found were reasonable with a mean total steps of 457 for SARSA and 487 for Q-learning. By comparison, SARSA(λ) and Watkins’s Q(λ) had in less than six thousand action-learning steps, learnt a small number of policies that were classified as memoryless optimal (white shaded areas in the lower two plots of figure 5). By the end of a million actionlearning steps the distribution of policies for SARSA(λ) were: 80 memoryless optimal; 0 other satisficing, and 20 non-satisficing. Similarly, after a million action-learning steps, the distribution of policies for Watkins’s Q(λ) was: 48 memoryless optimal; 22 other satisficing, and 30 nonsatisficing There existed the possibility for all of the learning algorithms that although a given proportion of the population of policies were continually categorised as satisficing, individual policies were not stable, switching back and forth between satisficing and non-satisficing solutions. With this possibility in mind we examined the stability In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 of the policies learnt by the four learning algorithms Plots for stability are shown in figure 6. These are derived by counting the number of individual policies that change classification between consecutive policy evaluations. A change in classification is counted when a policy changes from being non-satisficing to any of the other satisficing classifications, or visa-versa. From figure 6 both SARSA and Q-learning are very stable with no more than three policies changing classification at any one time. Much large changes in classification are seen initially for SARSA(λ) and Watkins’s Q(λ), as we would expect. The number of changes then steadies at a fairly low level for SARSA(λ), but remains relatively high for Watkins’s Q(λ). A related observation is that Watkins’s Q(λ) generates a much large number of other satisficing policies than SARSA(λ). SARSA(λ) learns either memoryless optimal or non-satisficing polices, and virtually zero other satisficing policies. A quick investigation reveals that the memoryless optimal policies generated by Watkins’s Q(λ) are reasonably stable, i.e. the number of changes between memoryless optimal policies and any other classification are comparable to the figures for SARSA(λ). However, a large number of polices, on average 11.6, flip between other satisficing and any other classification. This accounts for most of the changes reported on the Watkins’s Q(λ) plot in figure 6. It thus appears that the policy update rule used by Watkins’s Q(λ) learns a significant proportion of unstable, non memoryless optimal policies. 4.2 Simple 1-D Example Results for the simple 1-D example world are shown in figure 7. The top two plots show mean total steps with 95% confidence intervals for the four learning algorithms and also for random action selection. The left hand plot shows results for the Absolute Position Agent which does not experience any perceptual aliasing. In the absence of any state aliasing all four learning algorithms learn the optimal solution to this world. The 8 Adjacent Squares Agent (top right hand plot) aliases State 2 and State 5 (figure 3) in this simple 1D world. Using this agent both SARSA and Q-learning perform worse than an agent selecting actions at random, though the large confidence intervals suggest that it is worth investigating what is occurring with the individual policies. SARSA(λ) and Watkins’s Q(λ), λ = 0.9, both learn the optimal solution in less than 50,000 actionlearning steps. The total steps for an optimal solution to this problem is 28. Due to the very simple nature of this world the total steps for an optimal memoryless policy is also 28. Because these two types of policy are identical we reduce the number of categories used to just three, see table 2. Even though there appears to be only one solu- Goal Reached From All Start States yes yes no Total Physical Actions 28 > 28 - Policy Category Optimal Other Satisficing Non-Satisficing Table 2: Policy categories for Simple 1-D Example tion, move east in all states, and policies are evaluated greedily, other satisficing policies could still exist. Ties where actions have the same value are broke at random. Thus it is possible to image a policy which, in one or more states, has no preference between moving east or moving west, such that an agent following this policy performs a limited random walk before ultimately reaching the goal. Such a policy, if it reached the goal in less than 1,000 steps from each starting state, would still be satisficing but would exceed 28 total steps. In practice this never occurred and only two categories are shown on the plots in figure 7. The middle two plots (figure 7) show the number of policies falling into each category for SARSA and Qlearning. We see that both SARSA and Q-learning reach a plateau with just over 65% of the policies learnt being classified as optimal after just 300,000 action-learning steps. For these two learning algorithms we again plot the change in classification of the policies to test that the policies are stable. The two lower plots in figure 7 suggest that the optimal policies learnt are indeed stable. Plots of the categorisation of policies learn by SARSA(λ) and Watkins’s Q(λ) are not shown as all the policies were classified as optimal after 1000 actionlearning steps and there is very little variation from this initial level for the remainder of the one million actionlearning steps. 5. Discussion & Conclusions The results successfully replicates those of Loch and Singh (1998) showing that SARSA(λ) can find optimal memoryless solutions to tasks containing perceptual aliasing. In fact this result generalises to Watkins’s Q(λ) suggesting that any method that uses eligibility traces can find optimal memoryless solutions. The surprise result was that SARSA and Q-learning could learn satisficing policies to Sutton’s Grid World, and optimal policies for the simple 1-D example. The latter is even more remarkable as Whitehead (1992) presented the example in order to illustrate the extent to which perceptual aliasing can interfere with Q-learning and claimed that “1-step Q-learning cannot learn the optimal policy for this task” (p.73). In both instances the policies learnt appear to be stable. Further examination of this issue indicates that ex- In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 350 3000 2500 Mean Total Steps to Goal Mean Total Steps to Goal 300 250 200 Random Action SARSA Q−learning SARSA(0.9) Watkins’ Q(0.9) 150 100 2000 1500 1000 500 50 0 0 2 4 6 Action−Learning Steps 8 0 0 10 5 x 10 100 2 4 6 Action−Learning Steps Q−learning Number of Policies in each Class 80 60 40 20 Non Satisficing Optimal 0 0 2 4 6 8 Action−Learning Steps 20 10 4 6 Action−Learning Steps 20 0 0 2 4 8 10 5 x 10 6 8 Action−Learning Steps x 10 30 2 40 5 40 0 60 10 SARSA 50 80 Number of Policies Changing from Satisficing to Non−Satisficing & visa−versa Number of Policies in each Class 10 5 x 10 100 SARSA Number of Policies Changing from Satisficing to Non−Satisficing & visa−versa 8 10 5 x 10 Q−learning 50 40 30 20 10 0 2 4 6 Action−Learning Steps 8 10 5 x 10 Figure 7: Top plots are mean total steps found when policies were evaluated versus action-learning steps. To simplify plots data points are only shown for every 50,000 action-learning steps. Left hand graph shows results for Absolution Position Agent which suffers no perceptual aliasing. Right hand plot shows results for 8 Adjacent Squares Agent which aliases states 2 and 5. Middle plots show categorisation of policies versus action-learning steps for SARSA and Q-learning. Bottom plots show stability of policy classification versus action-learning steps for SARSA and Q-learning. Height of bars indicate the number of policies that changed from being optimal to non-satisficing and visa-versa since the previous policy evaluation. All plots are for the Simple 1-D Example grid world and (with the exception of top left) the 8 Adjacent Squares Agent. In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 ploration is important in determining whether reinforcement learning algorithms which use 1-step backup, such as SARSA and Q-learning, can learn policies that are both stable and satisficing in partially observable environments. In the experiments presented above the probability of selecting an exploratory action (²) starts at 20% but reaches zero after two hundred thousand actionlearning steps. For the remaining eight hundred thousand steps the agent always follows the current policy without trying any exploratory actions. The lack of exploration appears to avoid the destructive effects of global impairment allowing policies to achieve stable solutions. The effect of exploration is nicely illustrated by figure 8 which shows the categorisation of policies learnt for the Simple 1-D Example world and 8 Adjacent Squares Agent, using Q-learning with ² fixed at 0.01. With a fixed value for ² the policies are not stable, and continuous oscillations are seen in the number of optimal policies. This is in contrast to the plateau seen in figure 7. A secondary point of note is that in figure 7 the number of optimal policies ramps up slowly as ² decreases from 0.2 to zero. This contrasts with figure 8 where, with a fixed, but initially lower, value of ², the number of optimal policies learnt increases more rapidly. The observed oscillations reinforce Whitehead (1992, p.78)’s argument that Q-learning (or any 1-step backup algorithm) is unable to converge on stable solutions in partially observable environments, provided there is some possibility of selecting an exploratory action. However, once exploration has ceased, it is possible for 1-step backup algorithms to converge on satisficing policies. Number of Policies in each Class 100 Q−learning, ε = 0.01 Watkins’s Q(λ) appear to be flipping between satisficing and non-satisficing, see figure 6. Our first hypothesis “1-step reinforcement learning algorithms are not able to learn policies which are both stable and optimal, when the task involves perceptually aliased states,” needs to be modified in light of the above discussion to reflect the importance of selecting exploratory actions. Although this is an interesting result, it is apparent that policies learnt using SARSA and Q-learning converge on satisficing solutions very slowly. The main aim of the experiments presented above is to illuminate the problems that occur when applying reinforcement learning to partially observable environments. We are interested in doing this in order to clear the ground before moving on to look at whether active perception can be used to address these issues. These results are, however, of interest in their own right as reinforcement learning is often used in robotics, and real limited sensor arrays certainly create the possibility of perceptual aliasing of states. An important observation, therefore, is that if there exists the possibility of state aliasing, then it either needs to be designed out of the task, or careful selection should be made of the learning algorithm. For example, it is probably worth avoiding reinforcement learning algorithms that do 1-step backup. In fact any reinforcement learning algorithm that uses truncated returns will be subject to some detrimental effects of global impairment (Whitehead, 1992, p.80). However, as demonstrated by the above results, reinforcement learning algorithms that use eligibility traces can quickly learn reasonable solutions. 6. Future Work 80 60 40 Non Satisficing Optimal 20 0 0 2 4 6 Action−Learning Steps 8 10 5 x 10 Figure 8: Plot shows categorisation of policies versus actionlearning steps for Q-learning with ² fixed at 0.01, for the Simple 1-D Example world and 8 Adjacent Squares Agent. Of our initial hypotheses we have confirmed the second, that reinforcement algorithms that use eligibility traces can learn optimal memoryless policies, though a question can be raised as to the stability of the solutions given that around 10% of the policies learnt by The main focus of our future work is to test the conjecture that active perception can allow reinforcement learning algorithms which are not enhanced using memory or internal world models, to find optimal solutions to navigation problems which involve perceptual aliasing. This we plan to investigate initially by equipping grid world agents with some form of self directed, perceptual system. Ultimately, we would like to prove our approach using a mobile robot navigating a building, the robot’s input state being formed from the images captured by an on board camera, which can pan, tilt and zoom, and over which the robot’s learning algorithm has direct control. Acknowledgements Paul Crook would like to thank EPSRC without whose funding this work would not be possible. We would also like to thank the reviewers for their useful comments. In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003 References Chrisman, L. (1992). Reinforcement learning with perceptual aliasing: The perceptual distinctions approach. In Tenth National Conference on Artificial Intelligence, pages 183–188. AAAI/MIT Press. Lanzi, P. L. (2000). Adaptive agents with reinforcement learning and internal memory. In Meyer, J.-A. et al., (Eds.), From Animals to Animats 6: Proceedings of the Sixth International Conference on the Simulation of Adaptive Behavior (SAB’2000), pages 333–342. The MIT Press, Cambridge, MA. Littman, M. L. (1994). Memoryless policies: Theoretical limitations and practical results. In Cliff, D. et al., (Eds.), From Animals to Animats 3: Proceedings of the Third International Conference on Simulation of Adaptive Behavior (SAB’94), pages 238–245. The MIT Press, Cambridge, MA. Loch, J. and Singh, S. (1998). Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In Proceedings of the Fifteenth International Conference on Machine Learning (ICML’98), pages 323–331. Morgan Kaufmann, San Francisco, CA. McCallum, A. K. (1993). Overcoming incomplete perception with utile distinction memory. In Proceedings of the Tenth International Conference on Machine Learning (ICML’93), pages 190–196. Morgan Kaufmann, San Francisco, CA. Singh, S. P., Jaakkola, T., and Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Proceedings of the Eleventh International Conference on Machine Learning (ICML’94), pages 284–292. Morgan Kaufmann, San Francisco, CA. Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In Proceedings of the Seventh International Conference on Machine Learning (ICML’90), pages 216–224. Morgan Kaufmann, San Francisco, CA. Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA. Whitehead, S. D. (1992). Reinforcement Learning for the Adaptive Control of Perception and Action. PhD thesis, University of Rochester, Department of Computer Science, Rochester, New York. Whitehead, S. D. and Ballard, D. H. (1991). Learning to perceive and act by trial and error. Machine Learning, 7(1):45–83. In Proceedings of Towards Intelligent Mobile Robots (TIMR 2003) 4th British Conference on (Mobile) Robotics, UWE, Bristol 2003