CS573x Project -- The Behavior of Q-Learning in a Dramatically Dynamic Environment Chris Strasburg May 8, 2003 Contents Introduction Theoretical Examination Test Environment Results Discussion Bibliography Abstract Many reinforcement learning algorithms have been applied to domains which have both probabilistic state transitions and probabilistic rewards. Many of these also perform well in a slowly changing environment. However there may be instances where an agent's ability to adapt to a dramatic environment change would give it an advantage. This paper explores one possible modification to the QLearning algorithm to allow an agent to detect and adapt to one or more dramatic changes to the environment. Introduction Most reinforcement learners employ some concept of converging toward an optimal policy. Convergence often relies on a decreasing learning parameter to prevent noise from continually throwing the learner off track. Additionally one suggestion for an exploration versus exploitation policy utilizes a decaying parameter. This will work nicely in a static or nearly static environment, but when the environment can change dynamically with little or no warning, an 'established' learner will have difficulty in adapting. The purpose of this paper is to explore the potential benefits of 'adaptivity' in the Q-Learning reinforcement learning algorithm. The discussion will also include the tradeoffs involved in making agents more adaptive. Theoretical Examination Deterministic Q-Learning The Q-Learning algorithm is a very well known special case of the Temporal Difference algorithm (see below). An agent uses perceived rewards to iteratively update the Q values of each state. Eventually these Q values converge and the agent can then greedily choose the state transition which will lead to the largest Q value. This has been proved to be an optimal policy. The Q value update rule is: Q(s, a) = r + y max Q(s', a') where y is the discount factor. If the goal state jumps the agent's estimation of the Q function (the mapping of states to real values) deviates from the new Q function. In this test environment, the further the jump, the larger the deviation. Since the agent will continue to wander in an environment until it is reset, it will eventually adjust its policy to align with the new Q function. The difference between two Q functions will be measured as their Mean Squared Error computed by taking the sum of the square of the absolute value of the difference between each element in each Q matrix, and computing the square root of that value. To deal with the experimentation versus exploitation issue, I used an exponential function with an 'exploration' constant to select the next action. Specifically in any state the probability of choosing an action, a, is: (k^Q(s, a)/# of times tried on this iteration) / sum-a(k^Q(s, a)/# of times tried on this iteration) This guarantees that all actions have a non-zero probability of being chosen in a given state (assuming the action is possible). Dividing by the number of times tried on the current iteration encourages the agent to try a new action if it arrives at the same state twice. Larger values of k will tend toward exploitation, and smaller values will tend toward exploration. When k = 1, each action has an equal probability of being picked. At higher values of k, higher Q values improve the chance that an action will be picked. When 0< k < 1, higher Q values actually reduce the probability that an action will be selected. In many scenarios it may be desirable to start with a small value of k (ie. 1), and increase it on a schedule as time passes. This will allow the agent to explore early on, and exploit as it becomes more experienced. For instance, my Q-Learner starts with a k = 1, and on each successful achievement of the goal state increments by log(k + 1) (this was a pretty arbitrary choice, but grows more rapidly in the beginning and slows down later on). It is precisely this scenario where the adaptivity of the agent decreases with age. If the environment changes significantly after the k value has increased, the agent will continue along it's original path with little chance of adapting to the new environment. If, however, the agent uses some method of detecting environmental change, it can modify the k value to increase exploration. In my chosen environment change is easy to detect. The dynamicQLearner will simply impose a Gaussian distribution on the number of turns number of turns it has taken to get to the goal state. If an iteration has taken more than a sensitivity factor (s) times a standard deviation number over the average of turns to reach the goal, it will reduce k by some adaptation factor (ie. k = k * (1 - A) where larger A ( 0 < A <= 1) provides a more immediate response to perceived environment change). The sensitivity factor, 0 < s, determines how sensitive an agent is to detecting environment change. So the smaller s is, the more sensitive the agent is to anomalies. Non-Deterministic Q-Learning In order to deal with nondeterministic actions, the Q-Learning algorithm requires a change to reflect the fact that we now deal with the expected reward, as opposed to the actual reward. This requires that the agent adjust the Q values more slowly, depending on the number of visits in the past. Specifically the Q function is redefined as: Qn(s, a) = (1 - Ln)Q(n-1)(s, a) + Ln[r + ymaxQ(n-1)(s', a')] where L = 1 / (1 + visitsn(s, a)) and y is the discount factor The choice of L was made here to follow the example in Mitchell's book. There are other valid choices. In this case, if the number of previous visits is high, the Q value will change very little because L will be small. If the environment has changed, however, the L value needs to be increased to allow more efficient recovery. To deal with L I simply zeroed out the memory of how many times a state-action pair has been tried. This is reasonable since the agent cannot predict exactly how the environment has changed, and hence past experience is steeply devalued. This may not be entirely desirable if the agent erroneously detected an environment change. Test Environment The test environment consists of a 5 x 5 grid with one goal state yielding a reward of 1, and the others yielding a reward of 0. The agent has four actions at her disposal: North, South, East, and West. Immediately upon arrival at the goal state, the environment is reset. The agent's home state never changes. The agent has perfect knowledge of his current state, but state transitions may be probabilistic. The information reported by the agent/environment is: the number of iterations (completed journeys from start to goal), the number of times each state was visited, the number of turns (steps in the environment), the final policy, the turns on which the goal state jumped, and the turn on which the agent detected the jump. Results Each average was taken over 20 runs. Each dynamic learner was given a sensitivity s = 0.5, 1, and 2 and Adaptivities of 0.1 0.5 and 0.9. The dynamic worlds were run with approximately a 1 in 100 random chance of the goal state moving after each iteration. The nondeterministic worlds were run with an 80% chance that an action would behave as expected. All learners used a gamma of 0.9. The world size was 5 x 5. Deterministic Static Environment: Static Deterministic Q Learner: 4.34 turns per iteration Dynamic Deterministic Q Learner: Sensitivity Adaptability Average Turns Per Iteration 0.5 0.1 5.83 1.0 0.1 3.79 2.0 0.1 3.54 5.0 0.1 3.72 0.5 0.5 6.46 1.0 0.5 6.33 2.0 0.5 5.34 5.0 0.5 4.07 0.5 0.9 8.93 1.0 0.9 7.69 2.0 0.9 5.73 5.0 0.9 4.22 Comments It is interesting to note that some of the dynamic deterministic Q learners outperformed the static deterministic Q learner. This implies that the k decay may be too steep. Two general trends are interesting: 1) The performance improves as the sensitivity is reduced (larger values take more of a variance to trigger) and 2) The performance decreases as the adaptability increases (changes to k become more dramatic). Deterministic Dynamic Environment: Static Deterministic Q Learner: 5.7831 Dynamic Deterministic Q Learner: Sensitivity Adaptivity Average Turns Per Iteration 0.5 0.1 7.95 1.0 0.1 7.46 2.0 0.1 6.48 5.0 0.1 6.11 0.5 0.5 9.71 1.0 0.5 9.25 2.0 0.5 7.17 5.0 0.5 6.98 0.5 0.9 11.73 1.0 0.9 9.26 2.0 0.9 7.88 5.0 0.9 6.02 Comments It looks like the static Q learner outperforms the dynamic Q learner for all attempted values of sensitivity and adaptivity. In a deterministic dynamic environment, each action is guaranteed to perform as expected, so there is much less random moving about. Still, I would not have expected this result, and currently I can't explain it. Again the general trends mentioned above that decreased sensitivity and decreased adaptivity yield increases in performance. However in this case we have an interesting anomaly that the best performer had a large adaptivity and a low sensitivity. Nondeterministic Static Environment: Static Nondeterministic Q Learner: 6.27 Dynamic Nondeterministic Q Learner: Sensitivity Adaptivity Average Number of Turns Per Iteration 0.5 0.1 4.64 1.0 0.1 4.3 2.0 0.1 3.58 5.0 0.1 3.75 0.5 0.5 5.54 1.0 0.5 4.88 2.0 0.5 3.99 5.0 0.5 4.05 0.5 0.9 7.04 1.0 0.9 4.95 2.0 0.9 3.77 5.0 0.9 3.54 Comments: It is interesting to note that the table above nearly mirrors that of the table for the deterministic dynamic environment for trends. Once again, even though the environment is static, the dynamic learner consistently outperforms the static learner. Nondeterministic Dynamic Environment: Static Nondeterministic Q Learner: 4.04 Dynamic Nondeterministic Q Learner: Sensitivity Adaptivity Average Turns Per Iteration 0.5 0.1 3.13 1.0 0.1 3.66 2.0 0.1 3.41 5.0 0.1 2.71 0.5 0.5 5.4 1.0 0.5 3.91 2.0 0.5 2.8 5.0 0.5 4.1 0.5 0.9 6.01 1.0 0.9 5.00 2.0 0.9 3.95 5.0 0.9 4.55 Comments: This is the most interesting case. The best two performers are high sensitivity value, but one has a low adaptivity and the other has a medium adaptivity. Discussion Based on the above data, I found three trends rather surprising: 1) An adaptive agent outperforms a static agent in a static deterministic world, 2) An adaptive agent outperforms a static agent in a static nondeterministic world, and 3) A static agent outperforms an adaptive agent in a dynamic deterministic world. The evidence seems to support the hypothesis that in a dynamic nondeterministic environment a dynamic agent with appropriate choices for adaptivity and sensitivity has an advantage over a static agent. I suspect that the lack of evidence to support the similar hypothesis for the dynamic deterministic environment is due more to my methods than a problem with the theory. In all cases large adaptivity values (where k is reduced by an order of magnitude) cause poorer performance than moderate changes. In observing the actual behavior of the agent, this result is unsurprising since the adaptivity is more a measure of the change resolution than the agent's ability to adapt. This is due to the fact that the agent can effect a modification of k multiple times in one iteration. A restriction on this, or a reduction in the environment change likelihood may give more of an advantage to higher values of adaptivity. At this point I expect the optimal adaptivity value to vary inversely with the frequency of environment change. The sensitivity seemed generally to yield better performance for higher values. This also seems reasonable since I used an unbounded number of iterations to generate the mean and standard deviation for the number of turns per iteration. This means that the longer an agent lives, the less tolerant they are of performance variation. So larger values of sensitivity (less sensitive) will result in fewer false positive environment change detections. While I cannot immediately explain why an adaptive agent would outperform a static one in a static environment, this result alleviates concern of large penalties for adding adaptivity if it is not strictly required. These preliminary results are encouraging and future exploration should include many more trials to improve the reliability of the statistical results, a larger range of values for both sensitivity and adaptivity, variation on the level of determinism and the likelihood of environment change, and a more rigorous examination of the theory underlying both sensitivity and adaptivity. Another factor worth varying is how many iterations an agent remembers for computation of mean and standard deviation for detecting environment change. A more thorough survey of current reinforcement learning literature would also be highly recommended. Bibliography Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement Learning: A Survey. In Jounral of Artificial Intelligence Research 4., 1996 pp 237-285. Tom M. Mitchell. Machine Learning. WCB/McGraw-Hill, 1997 pp 367-383.