RL-CD: Dealing with Non-Stationarity in Reinforcement Learning Bruno C. da Silva and Eduardo W. Basso and Ana L. C. Bazzan and Paulo M. Engel Instituto de Informática, UFRGS Caixal Postal 15064 CEP 91.501-970 Porto Alegre, Brazil {bcs,ewbasso,bazzan,engel}@inf.ufrgs.br Abstract environment. This causes two problems: 1) the time for relearning how to behave makes the performance drop during the readjustment phase; and 2) the system, when learning a new optimal policy, forgets the old one, and consequently makes the relearning process necessary even for dynamics which have already been experienced. The non-stationary environments that we are interested in this investigation consist on those whose behavior is given by one among several different stationary dynamics. We call each type of dynamics a context and assume that it can only be estimated by observing the transitions and rewards. We also assume that the maintenance of multiple models of the environment (and their respective policies) is a good solution to this learning problem. Partial models have been used for the purpose of dealing with non-stationarity by other authors, such as Choi (Choi, Yeung, & Zhang 2001) and Doya (Doya et al. 2002). However, their approaches require a fixed number of models, and thus implicitly assume that the approximate number of different environment dynamics is known a priori. Since this assumption is not always realistic, our idea is to overcome this restriction by incrementally building new models. Our main hypothesis is that the use of multiple partial models makes the learning system capable of partitioning the knowledge into models, each model automatically assuming for itself the responsibility for “understanding” one kind of environment behavior. The method we are currently investigating is called RL-CD, or Reinforcement Learning with Context Detection, and is based on the continuous evaluation of the prediction errors generated by each model. This student abstract describes ongoing investigations regarding an approach for dealing with non-stationarity in reinforcement learning (RL) problems. We briefly propose and describe a method for managing multiple partial models of the environment and comment previous results which show that the proposed mechanism has better convergence times comparing to standard RL algorithms. Current efforts include the development of a more robust approach, capable of dealing with noisy environments, and also investigations regarding the possibility of using partial models in order to aliviate learning problems in systems with an explosive number of states. Introduction When implementing learning algorithms, one often faces the difficult problem of dealing with environments whose dynamics might change due to some unknown or not directly perceivable cause. Non-stationary environments affect standard reinforcement learning (RL) methods in a way that forces them to continuously relearn the policy from scratch. We are currently studying methods capable of complementing RL algorithms so that they perform well in a specific class of non-stationary environments. The two usual approaches for reinforcement learning are called model-free and model-based. Model-free algorithms do not require that the agent have access to informations about how the environment works. They are usually able to compute good policies in relatively simple environments. For complex environments, however, a great engineering effort in designing smart state representations is needed. Model-based approaches, such as Prioritized Sweeping, can usually perform better than model-free systems and present a much lower convergence time, although at the cost of demanding a greater computational effort per iteration. For more details on these methods please refer to (Kaelbling, Littman, & Moore 1996). It is important to emphasize that both of these approaches were designed to work in stationary environments. When dealing with non-stationary environments, they have to continually readapt themselves to the changing dynamics of the RL-CD The class of non-stationary environments that we are interested is similar to the one studied by Hidden-Mode MDPs researchers (Choi, Yeung, & Zhang 2001). We assume that the following properties hold: 1) environmental changes are confined to a small number of contexts, which are stationary environments with distinct dynamics; 2) the current context cannot be directly observed, but can be estimated according to the types of transitions and rewards observed; 3) environmental context changes are independent of the agent’s actions; and 4) context changes are relatively infrequent. The mechanism we are studying relies on a set of partial models for predicting the environment dynamics. A partial c 2006, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved. 1863 the trace of prediction error Em of the partial model is updated: Em = Em + ρ em,ϕ − Em model m contains functions which estimate transition probabilities (T̂m ) and rewards (R̂m ). Standard model-based RL methods such as Prioritized Sweeping and Dyna can be used to compute the locally optimal policy πm (s). Given an experience tuple ϕ ≡ s, a, s , r, we update the current partial model m by adjusting its model of transition and rewards by ∆T̂m,ϕ and ∆R̂ m,ϕ , respectively. These adjustments are computed as follows: 1 s T̂ ∀κ ∈ S ∆m,ϕ (κ) = Nm (s,a)+1 τκ − T̂m (s, a, κ) ∆R̂ m,ϕ = 1 Nm (s,a)+1 where ρ is the adjustment coefficient for the error. The error Em is updated after each iteration for every partial model m, but only the active model is adjusted accordingly. A plasticity threshold λ is used to specify until when a partial model should be adjusted. When Em becomes higher than λ, the predictions made by the model are considered sufficiently different from the real observations. In this case, a context change is detected and the model with lowest error is activated. A new partial model is created when there are no models with trace error smaller than the plasticity. The mechanism starts with only one model and then incrementally creates new partial models as they become necessary. Values for M , ρ, Ω and λ are problem-dependent, and further analytical investigation of these parameters is still being developed. r − R̂m (s, a) such that τ is the Kronecker Delta: 1, κ = s τκs = 0, κ = s The effect of τ is to update the transition probability T (s, a, s ) towards 1 and all other transitions T (s, a, κ), for all κ ∈ S, towards zero. The quantity Nm (s, a) reflects the number of times, in model m, action a was executed in state s. We compute Nm considering only a truncated (finite) memory of past M experiences: Nm (s, a) = min Nm (s, a) + 1, M Conclusions and future work We have so far tested our method only in simple nonstationary environments. Our first experiments showed that RL-CD could successfully build one model for each different context, and that its convergence time was one order of magnitude fewer steps than the best model-based RL algorithm. For more informations on these preliminary experiments, please refer to a detailed version of this abstract at www.inf.ufrgs.br/∼bcs/aaai/aaai-sa-detailed.pdf. We are currently testing RL-CD in a noisy environment in which there are three different causes for non-stationarity: 1) explicit modifications in the transition probabilities; 2) partial observability of a subset of sensors; and 3) poor discretization of sensors’ readings. This scenario is much more difficult because the number of models will probably depend on a combination of the non-stationarity causes. We are also studying the possibility of using context detection as a method for dimensionality reduction, since partial models might summarize the impact of a subset of hidden sensors which cause non-stationarity due to partial observations. Finally, current research is being done regarding the trade-off between memory requirements and model quality in highly non-stationary environments. It is hoped that further analysis on the proposed approach shed some light on investigations of methods for dealing with non-stationarity in RL. A truncated value of N acts like a learning coefficient for T̂m and R̂m , causing transitions to be updated faster in the initial observations and slower as the agent experiments more. Having the values for ∆T̂m,ϕ and ∆R̂ m,ϕ , we update the transition probabilities: T̂m (s, a, κ) = T̂m (s, a, κ) + ∆T̂m,ϕ (κ), ∀κ ∈ S and also the model of expected rewards: R̂m (s, a) = R̂m (s, a) + ∆R̂ m,ϕ In order to detect context changes, the system must be able to evaluate how well the current partial model can predict the environment. Thus, an error signal is computed for each partial model. The instantaneous error is proportional to a confidence value, which reflects the number of times the agent tried an action in a state. Given a model m and an experience tuple ϕ = s, a, s , r, we calculate the instantaneous error em,ϕ and the confidence cm (s, a) as follows: 2 Nm (s, a) cm (s, a) = M References Choi, S. P. M.; Yeung, D.-Y.; and Zhang, N. L. 2001. Hidden-mode markov decision processes for nonstationary sequential decision making. In Sequence Learning Paradigms, Algorithms, and Applications, 264–287. London, UK: Springer-Verlag. Doya, K.; Samejima, K.; ichi Katagiri, K.; and Kawato, M. 2002. Multiple model-based reinforcement learning. Neural Computation 14(6):1347–1369. Kaelbling, L. P.; Littman, M.; and Moore, A. 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4:237–285. 2 2 T̂ em,ϕ = cm (s, a) Ω(∆R̂ ) + (1 − Ω) ∆ (κ) m,ϕ m,ϕ κ∈S where Ω specifies the relative importance of the reward and transition prediction errors for the assessment of the model’s quality. Once the instantaneous error has been computed, 1864