RL-CD: Dealing with Non-Stationarity in Reinforcement Learning Bruno C. da Silva

advertisement
RL-CD: Dealing with Non-Stationarity in Reinforcement Learning
Bruno C. da Silva and Eduardo W. Basso and Ana L. C. Bazzan and Paulo M. Engel
Instituto de Informática, UFRGS
Caixal Postal 15064 CEP 91.501-970
Porto Alegre, Brazil
{bcs,ewbasso,bazzan,engel}@inf.ufrgs.br
Abstract
environment. This causes two problems: 1) the time for relearning how to behave makes the performance drop during
the readjustment phase; and 2) the system, when learning a
new optimal policy, forgets the old one, and consequently
makes the relearning process necessary even for dynamics
which have already been experienced.
The non-stationary environments that we are interested in
this investigation consist on those whose behavior is given
by one among several different stationary dynamics. We call
each type of dynamics a context and assume that it can only
be estimated by observing the transitions and rewards. We
also assume that the maintenance of multiple models of the
environment (and their respective policies) is a good solution to this learning problem. Partial models have been used
for the purpose of dealing with non-stationarity by other authors, such as Choi (Choi, Yeung, & Zhang 2001) and Doya
(Doya et al. 2002). However, their approaches require a
fixed number of models, and thus implicitly assume that the
approximate number of different environment dynamics is
known a priori. Since this assumption is not always realistic, our idea is to overcome this restriction by incrementally
building new models.
Our main hypothesis is that the use of multiple partial
models makes the learning system capable of partitioning
the knowledge into models, each model automatically assuming for itself the responsibility for “understanding” one
kind of environment behavior. The method we are currently
investigating is called RL-CD, or Reinforcement Learning
with Context Detection, and is based on the continuous evaluation of the prediction errors generated by each model.
This student abstract describes ongoing investigations regarding an approach for dealing with non-stationarity in reinforcement learning (RL) problems. We briefly propose and describe a method for managing multiple partial models of the
environment and comment previous results which show that
the proposed mechanism has better convergence times comparing to standard RL algorithms. Current efforts include
the development of a more robust approach, capable of dealing with noisy environments, and also investigations regarding the possibility of using partial models in order to aliviate
learning problems in systems with an explosive number of
states.
Introduction
When implementing learning algorithms, one often faces the
difficult problem of dealing with environments whose dynamics might change due to some unknown or not directly
perceivable cause. Non-stationary environments affect standard reinforcement learning (RL) methods in a way that
forces them to continuously relearn the policy from scratch.
We are currently studying methods capable of complementing RL algorithms so that they perform well in a specific
class of non-stationary environments.
The two usual approaches for reinforcement learning are
called model-free and model-based. Model-free algorithms
do not require that the agent have access to informations
about how the environment works. They are usually able
to compute good policies in relatively simple environments.
For complex environments, however, a great engineering
effort in designing smart state representations is needed.
Model-based approaches, such as Prioritized Sweeping, can
usually perform better than model-free systems and present
a much lower convergence time, although at the cost of demanding a greater computational effort per iteration. For
more details on these methods please refer to (Kaelbling,
Littman, & Moore 1996).
It is important to emphasize that both of these approaches
were designed to work in stationary environments. When
dealing with non-stationary environments, they have to continually readapt themselves to the changing dynamics of the
RL-CD
The class of non-stationary environments that we are interested is similar to the one studied by Hidden-Mode MDPs
researchers (Choi, Yeung, & Zhang 2001). We assume that
the following properties hold: 1) environmental changes are
confined to a small number of contexts, which are stationary
environments with distinct dynamics; 2) the current context
cannot be directly observed, but can be estimated according
to the types of transitions and rewards observed; 3) environmental context changes are independent of the agent’s
actions; and 4) context changes are relatively infrequent.
The mechanism we are studying relies on a set of partial
models for predicting the environment dynamics. A partial
c 2006, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved.
1863
the trace of prediction error Em of the partial model is updated:
Em = Em + ρ em,ϕ − Em
model m contains functions which estimate transition probabilities (T̂m ) and rewards (R̂m ). Standard model-based RL
methods such as Prioritized Sweeping and Dyna can be used
to compute the locally optimal policy πm (s).
Given an experience tuple ϕ ≡ s, a, s , r, we update the
current partial model m by adjusting its model of transition
and rewards by ∆T̂m,ϕ and ∆R̂
m,ϕ , respectively. These adjustments are computed as follows:
1
s
T̂
∀κ ∈ S
∆m,ϕ (κ) = Nm (s,a)+1 τκ − T̂m (s, a, κ)
∆R̂
m,ϕ
=
1
Nm (s,a)+1
where ρ is the adjustment coefficient for the error.
The error Em is updated after each iteration for every partial model m, but only the active model is adjusted accordingly. A plasticity threshold λ is used to specify until when a
partial model should be adjusted. When Em becomes higher
than λ, the predictions made by the model are considered
sufficiently different from the real observations. In this case,
a context change is detected and the model with lowest error
is activated. A new partial model is created when there are
no models with trace error smaller than the plasticity. The
mechanism starts with only one model and then incrementally creates new partial models as they become necessary.
Values for M , ρ, Ω and λ are problem-dependent, and further analytical investigation of these parameters is still being
developed.
r − R̂m (s, a)
such that τ is the Kronecker Delta:
1, κ = s
τκs =
0, κ = s
The effect of τ is to update the transition probability
T (s, a, s ) towards 1 and all other transitions T (s, a, κ), for
all κ ∈ S, towards zero. The quantity Nm (s, a) reflects the
number of times, in model m, action a was executed in state
s. We compute Nm considering only a truncated (finite)
memory of past M experiences:
Nm (s, a) = min Nm (s, a) + 1, M
Conclusions and future work
We have so far tested our method only in simple nonstationary environments. Our first experiments showed that
RL-CD could successfully build one model for each different context, and that its convergence time was one order of
magnitude fewer steps than the best model-based RL algorithm. For more informations on these preliminary experiments, please refer to a detailed version of this abstract at
www.inf.ufrgs.br/∼bcs/aaai/aaai-sa-detailed.pdf.
We are currently testing RL-CD in a noisy environment in
which there are three different causes for non-stationarity:
1) explicit modifications in the transition probabilities; 2)
partial observability of a subset of sensors; and 3) poor discretization of sensors’ readings. This scenario is much more
difficult because the number of models will probably depend
on a combination of the non-stationarity causes. We are
also studying the possibility of using context detection as
a method for dimensionality reduction, since partial models
might summarize the impact of a subset of hidden sensors
which cause non-stationarity due to partial observations. Finally, current research is being done regarding the trade-off
between memory requirements and model quality in highly
non-stationary environments. It is hoped that further analysis on the proposed approach shed some light on investigations of methods for dealing with non-stationarity in RL.
A truncated value of N acts like a learning coefficient
for T̂m and R̂m , causing transitions to be updated faster in
the initial observations and slower as the agent experiments
more. Having the values for ∆T̂m,ϕ and ∆R̂
m,ϕ , we update the
transition probabilities:
T̂m (s, a, κ) = T̂m (s, a, κ) + ∆T̂m,ϕ (κ),
∀κ ∈ S
and also the model of expected rewards:
R̂m (s, a) = R̂m (s, a) + ∆R̂
m,ϕ
In order to detect context changes, the system must be
able to evaluate how well the current partial model can predict the environment. Thus, an error signal is computed for
each partial model. The instantaneous error is proportional
to a confidence value, which reflects the number of times
the agent tried an action in a state. Given a model m and an
experience tuple ϕ = s, a, s , r, we calculate the instantaneous error em,ϕ and the confidence cm (s, a) as follows:
2
Nm (s, a)
cm (s, a) =
M
References
Choi, S. P. M.; Yeung, D.-Y.; and Zhang, N. L. 2001.
Hidden-mode markov decision processes for nonstationary sequential decision making. In Sequence Learning Paradigms, Algorithms, and Applications, 264–287. London, UK: Springer-Verlag.
Doya, K.; Samejima, K.; ichi Katagiri, K.; and Kawato,
M. 2002. Multiple model-based reinforcement learning.
Neural Computation 14(6):1347–1369.
Kaelbling, L. P.; Littman, M.; and Moore, A. 1996. Reinforcement learning: A survey. Journal of Artificial Intelligence Research 4:237–285.
2
2
T̂
em,ϕ = cm (s, a) Ω(∆R̂
)
+
(1
−
Ω)
∆
(κ)
m,ϕ
m,ϕ
κ∈S
where Ω specifies the relative importance of the reward and
transition prediction errors for the assessment of the model’s
quality. Once the instantaneous error has been computed,
1864
Download