Model Estimation Within Planning and Learning

advertisement
Model Estimation Within Planning and Learning
Alborz Geramifard†
Joshua D. Redding†
Joshua Joseph∗
Jonathan P. How†
†Aerospace Controls Laboratory, MIT Cambridge, MA 02139 USA
∗Robust Robotics Group, MIT Cambridge, MA 02139 USA
AGF @ MIT. EDU
JREDDING @ MIT. EDU
JMJOSEPH @ MIT. EDU
JHOW @ MIT. EDU
Abstract
Risk and reward are fundamental concepts in the
cooperative control of unmanned systems. In
this research, we focus on developing a constructive relationship between cooperative planning and learning algorithms to mitigate the
learning risk while boosting system (planner +
learner) asymptotic performance and guaranteeing the safety of agent behavior. Our framework is an instance of the intelligent cooperative control architecture (iCCA) where the learner
incrementally improves on the output of a baseline planner through interaction and constrained
exploration. We extend previous work by extracting the embedded parameterized transition
model from within the cooperative planner and
making it adaptable and accessible to all iCCA
modules. We empirically demonstrate the advantage of using an adaptive model over a static
model and pure learning approaches in a grid
world problem. Finally we discuss two extensions to our approach to handle cases where the
true model can not be captured through the presumed functional form.
1. Introduction
The concept of risk is common among humans, robots and
software agents alike. Amongst the latter two, risk models are routinely combined with relevant observations to
analyze potential actions for unnecessary risk or other unintended consequences. Risk mitigation is a particularly
interesting topic in the context of the intelligent cooperative control of teams of autonomous mobile robots (Kim,
2003; Weibel & Hansman, 2005). In such a multi-agent
Preliminary work in Planning and Acting with Uncertain Models
Workshop, ICML, , Bellevue, WA, USA, 2011.
Figure 1. The intelligent cooperative control architecture (iCCA)
is a customizable template for tightly integrating planning and
learning algorithms.
setting, cooperative planning algorithms rely on knowledge
of underlying transition models to provide guarantees on
resulting agent performance. In many situations however,
these models are based on simple abstractions of the system
and are lacking in representational power. Using simplified models may aid computational tractability and enable
quick analysis, but at the possible cost of implicitly introducing significant risk elements into cooperative plans.
Aimed at mitigating this risk, we adopt the intelligent cooperative control architecture (iCCA) as a framework for
tightly coupling cooperative planning and learning algorithms (Redding et al., 2010). Fig. 1 shows the template
iCCA framework which is comprised of a cooperative planner, a learner, and a performance analysis module. The
performance analysis module is implemented as a riskanalyzer where actions suggested by the learner can be
overridden by the baseline cooperative planner if they are
deemed unsafe. This synergistic planner-learner relationship yields a “safe” policy in the eyes of the planner, upon
which the learner can improve.
This research focused on developing a constructive relationship between cooperative planning and learning algorithms to reduce agent risk while boosting system (planner
Model Estimation Within Planning and Learning
+ learner) performance. We extend previous work (Geramifard et al., 2011) by extracting the parameterized transition
model from within the cooperative planner and making it
accessible to all iCCA modules. In addition, we consider
the cases where the assumed functional form of the model
is both correct and incorrect.
When the functional form is correct, we update the assumed model using an adaptive parameter estimation
scheme and demonstrate that the performance of the resulting system increases. Furthermore, we explore two
methods for handling the case when the assumed functional form can not represent the true model (e.g., a system
with state dependent noise represented through a uniform
noise model). First, we enable the learner to be the sole
decision maker in areas with high confidence, in which it
has experienced many interactions. This extension eliminates the need for the baseline planner altogether asymptotically. Second, we use past data to estimate the reward
of the learner’s policy. While the policy used to obtain past
data may differ from the current policy of the agent, we
can still exploit the Markov assumption to piece together
trajectories the learner would have experienced given the
past history (Bowling et al., 2008). Additionally, we use
the incorrect functional form of the model to further reduce
the amount of experience required to accurately estimate
the learner’s reward using the method of control variates
(Zinkevich et al., 2006; White, 2009).
The proposed approach of integrating an adaptive model
into the planning & learning framework is shown to yield
significant improvements in sample complexity in the grid
world domain, when the assumed functional form of the
model is correct. When incorrect, we highlight the drawbacks of our approach and discuss two extensions which
can mitigate such drawbacks. The paper proceeds as follows: Section 2 provides background information and Section 3 highlights the problem of interest by defining a pedagogical scenario where planning and learning algorithms
are used to mitigate stochastic risk. Section 4 outlines the
proposed technical approach for learning to mitigate risk.
Section 6 highlights two extensions to our approach when
the true model can not be captured in the parametric form
assumed for the model. Section 7 concludes the paper.
2. Background
2.1. Markov Decision Processes
Markov decision processes (MDPs) provide a general
formulation for sequential planning under uncertainty
(Howard, 1960; Puterman, 1994; Littman et al., 1995;
Kaelbling et al., 1998; Russell & Norvig, 2003). MDPs
are a natural framework for solving multi-agent planning
problems as their versatility allows modeling of stochas-
tic system dynamics as well as inter-dependencies between
a
a
agents. An MDP is defined by tuple (S, A, Pss
0 , Rss0 , γ),
where S is the set of states, A is the set of possible aca
tions. Taking action a from state s has Pss
0 probability of
0
ending up in state s and receiving reward Rass0 . Finally
γ ∈ [0, 1] is the discount factor used to prioritize early rewards against future rewards.1 A trajectory of experience
is defined by sequence s0 , a0 , r0 , s1 , a1 , r1 , · · · , where the
agent starts at state s0 , takes action a0 , receives reward r0 ,
transits to state s1 , and so on. A policy π is defined as a
function from S × A to the probability space [0, 1], where
π(s, a) corresponds to the probability of taking action a
from state s. The value of each state-action pair under
policy π, Qπ (s, a), is defined as the expected sum of discounted rewards when the agent takes action a from state s
and follow policy π thereafter:
"∞
#
X
π
t Q (s, a) = Eπ
γ rt s0 = s, a0 = a, .
t=1
The optimal policy π ∗ maximizes the above expectation for
∗
all state-action pairs: π ∗ = argmaxa Qπ (s, a).
2.2. Reinforcement Learning in MDPs
The underlying goal of the two reinforcement learning algorithms presented here is to improve performance of the
cooperative planning system over time using observed rewards by exploring new agent behaviors that may lead to
more favorable outcomes. The details of how these algorithms accomplish this goal are discussed in the following
sections.
2.2.1. S ARSA
A popular approach among MDP solvers is to find an approximation to Qπ (s, a) (policy evaluation) and update the
policy with respect to the resulting values (policy improvement). Temporal Difference learning (TD) (Sutton, 1988)
is a traditional policy evaluation method in which the current Q(s, a) is adjusted based on the difference between the
current estimate of Q and a better approximation formed by
the actual observed reward and the estimated value of the
following state. Given (st , at , rt , st+1 , at+1 ) and the current value estimates, the temporal difference (TD) error, δt ,
is calculated as:
δt (Q) = rt + γQπ (st+1 , at+1 ) − Qπ (st , at ).
The one-step TD algorithm, also known as TD(0), updates
the value estimates using:
Qπ (st , at ) = Qπ (st , at ) + αδt (Q),
(1)
1
γ can be set to 1 only for episodic tasks, where the length of
trajectories are fixed.
Model Estimation Within Planning and Learning
where α is the learning rate. Sarsa (state action reward state
action) (Rummery & Niranjan, 1994) is basic TD for which
the policy is directly derived from the Q values as:
1 − a = argmaxa Q(s, a)
π Sarsa (s, a) =
,
Otherwise
|A|
where is the probability of taking a random action. This
policy is also known as the -greedy policy2 .
3. The GridWorld Domain: A Pedagogical
Example
Consider the gridworld domain shown in Fig. 2-(a), in
which the task is to navigate from the bottom-middle (•)
to one of the top corner grids (?), while avoiding the danger zone (◦), where the agent will be eliminated upon entrance. At each step the agent can take any action from the
set {↑, ↓, ←, →}. However, due to wind disturbances unbeknownst to the agent, there is a 20% chance the agent
will be transferred into a neighboring unoccupied grid cell
upon executing each action. The reward for reaching either of the goal regions and the danger zone are +1 and
−1, respectively, while every other action results in −0.01
reward.
Let’s first consider the conservative policy shown in Fig. 2(b) designed for high values of wind noise. As expected,
the nominal path, highlighted as a gray watermark, follows
the long but safe path to the top left goal. The color of
each grid represents the true value of each state under the
policy. Green indicates positive, and white indicates zero.
The value of blocked grids are shown as red.
Fig. 2-(c) depicts a policy designed to reach the right goal
corner from every location. This policy ignores the existence of the noise, hence the nominal path in this case gets
close to the danger zone. Finally Fig. 2-(d) shows the optimal solution. Notice how the nominal path avoids getting
close to the danger zone. Model-free learning techniques
such as Sarsa can find the optimal policy of the noisy environment through interaction, but require a great deal of
training examples. More critically, they may deliberately
move the agent towards dangerous regions just to gain information about those areas. Previously, we demonstrated
that when a planner (e.g., methods to generate policies in
Fig. 2-(b),(c)) is integrated with a learner, it can rule out
intentionally poor decisions, resulting in safer exploration.
Furthermore, the planner’s policy can be used as a starting
point for the learner to bootstrap on, potentially reducing
the amount of data required by the learner to master the
task (Redding et al., 2010; Geramifard et al., 2011). In our
past work, we considered the case where the model used
2
Ties are broken randomly, if more than one action maximizes
Q(s, a).
for planning and risk analysis were static. In this paper,
we expand our framework by representing the model as a
separate entity which can be adapted through the learning
process. The focus here is on the case where the parametric form of the approximated model (T̂ ) includes the true
underlying model (T ) (e.g., assuming an unknown uniform
noise parameter for the gridworld domain). In Section 6,
we discuss drawbacks of our approach when T̂ is unable to
exactly represent T and introduce two potential extensions.
Adding a parametric model to the planning/learning
scheme is easily motivated by the case when the initial
bootstrapped policy is wrong, or built from incorrect assumptions. In such a case, it is more effective to simply
switch the underlying policy with a better one, rather than
requiring a plethora of interactions to learn from and refine
a poor initial policy. The remainder of this paper shows that
by representing the model as a separate entity which can be
adapted through the learning process, we enable the ability
to intelligently switch-out the underlying policy, which is
refined by the learning process.
4. Technical Approach
In this section, we introduce the architecture in which this
research was carried out. First, notice the addition of
the “Models” module in the iCCA framework as implemented when compared to the template architecture of Figure 1. This module allows us to adapt the agent’s transition
model in light of actual transitions experienced. An estimated model, T̂ , is output and is used to sample successive
states when simulating trajectories. As mentioned above,
this model is assumed to be of the correct functional form
(e.g., a single uncertain parameter). Additionally, Figure 1
shows a dashed boxed outlining the learner and the riskanalysis modules, which are formulated together within a
Markov decision process (MDP) to enable the use of reinforcement learning algorithms in the learning module.
The Sarsa (Rummery & Niranjan, 1994) algorithms is implemented as the system learner which uses past experiences to guide exploration and then suggests behaviors that
are likely to lead to more favorable outcomes than those
suggested by the baseline planner. The performance analysis block is implemented as a risk analysis tool where actions suggested by the learner can be rejected if they are
deemed too risky. The following sections describe iCCA
blocks in further detail.
4.1. Cooperative Planner
At its fundamental level, the cooperative planning algorithm used in iCCA yields a solution to the multi-agent
path planning, task assignment or resource allocation problem, depending on the domain. This means that it seeks
Model Estimation Within Planning and Learning
(a)
1
2
(b) 3
4
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
1
2
(c) 3
4
5
1
2
(d)
3
4
5
Figure 2. The gridworld domain is shown in (a), where the task is to navigate from the bottom middle (•) to one of the top corners (?).
The the danger region (◦) is an off-limit area where the agent should avoid. The corresponding policy and value function, are depicted
with respect to (b) a conservative policy to reach the left corner in most states, (c) an aggressive policy which aims for the top right
corner, and (d) the optimal policy.
to optimize an underlying, user-defined objective function.
Many existing cooperative control algorithms use observed
performance to calculate temporal-difference errors which
drive the objective function in the desired direction (Murphey & Pardalos, 2002; Bertsekas & Tsitsiklis, 1996). Regardless of how it is formulated (e.g. MILP, MDP, CBBA),
the cooperative planner, or cooperative control algorithm,
is the source for baseline plan generation within iCCA. We
assume that this module can provide safe solutions to the
problem in reasonable amount of time.
4.2. Learning and Risk-Analysis
As discussed earlier, learning algorithms may encourage
the agent to explore dangerous situations (e.g., flying close
to the danger zone) in hope of improving the long-term performance. While some degree of exploration is necessary,
unbounded exploration can lead to undesirable scenarios
such as crashing or losing a UAV. To avoid such undesirable
outcomes, we implemented the iCCA performance analysis module as a risk analysis element where candidate ac-
tions are evaluated for safety against an adaptive estimated
transition model T̂ . Actions deemed too “risky” are replaced with the safe action suggested by the cooperative
planner. As the risk-analysis and learning modules are coupled within an MDP formulation, as shown by the dashed
box in Fig. 3, we now discuss the details of the learning and
risk-analysis algorithms.
Previous research employed a risk analysis scheme that
used the planner’s transition model, which can be stochastic, to mitigate risk (Geramifard et al., 2011). In this research, we pull this embedded model from within the planner and allow it to be updated online. This allows both
the planner and the risk-analysis module to benefit from
model updates. Algorithm 1, explains the risk analysis
process where we assume the existence of the function
constrained: S → {0, 1}, which indicates if being in a
particular state is allowed or not. We define “risk” as the
probability of visiting any of the constrained states. The
core idea is to use Monte-Carlo sampling to estimate the
risk level associated with the given state-action pair if the
Model Estimation Within Planning and Learning
iCCA framework.
Figure 3. The intelligent cooperative control architecture as implemented. The conventional reinforcement learning method
(e.g., Sarsa, Natural Actor Critic) sits in the learner box while the
performance analysis block is implemented as a risk analysis tool.
Together, the learner and the risk-analysis modules are formulated
within a Markov decision process (MDP).
Algorithm 1: safe
1
2
3
4
5
6
7
8
9
Input: s, a, T̂
Output: isSaf e
risk ← 0
for i ← 1 to M do
t←1
st ∼ T̂ (s, a)
while not constrained(st ) and not
isT erminal(st ) and t < H do
st+1 ∼ T̂ (st , π p (st ))
t←t+1
risk ← risk + 1i (constrained(st ) − risk)
isSaf e ← (risk < ψ)
planner’s policy is applied thereafter. This is done by simulating M trajectories from the current state s. The first
action is the learner’s suggested action a, and the rest of
actions come from the planner policy, π p . The adaptive approximate model, T̂ , is utilized to sample successive states.
Each trajectory is bounded to a fixed horizon H and the risk
of taking action a from state s is estimated by the probability of a simulated trajectory reaching a “risky” (e.g., constrained) state within horizon H. If this risk is below a
given threshold, ψ, the action is deemed to be safe.
It is important to note that, though not explicitly required,
the cooperative planner should take advantage of the updated transition model by replanning. This ensures that
the risk analysis module is not overriding actions deemed
“risky” by an updated model with actions deemed “safe” by
an outdated model. Such behavior would result in convergence of the cooperative learning algorithm to the baseline
planner policy and the system would not benefit from the
For learning schemes that do not represent the policy as
a separate entity, such as Sarsa, integration within iCCA
framework is not immediately obvious. Previously, we
presented an approach for integrating learning approaches
without an explicit actor component (Redding et al., 2010).
Our idea was motivated by the concept of the Rmax algorithm (Brafman & Tennenholtz, 2001). We illustrate our
approach through the parent-child analogy, where the planner takes the role of the parent and the learner takes the
role of the child. In the beginning, the child does not know
much about the world, hence, for the most part s/he takes
actions advised by the parent. While learning from such actions, after a while, the child feels comfortable about taking
a self-motivated actions as s/he has been through the same
situation many times. Seeking permission from the parent,
the child could take the action if the parent thinks the action is safe. Otherwise the child should follow the action
suggested by the parent.
Our approach for safe, cooperative learning is shown in Algorithm 2. The cyan section highlights our previous cooperative method (Geramifard et al., 2011), while the green
region depicts the new version of the algorithm which includes model adaptation. On every step, the learner inspects the suggested action by the planner and estimates
the knownness of the state-action pair by considering the
number of times that state-action pair has been experienced
following the planner’s suggestion (line 3). The knownness
parameter controls the shift speed from following the planner’s policy to the learner’s policy. Given the knownness
of the state-action pair, the learner probabilistically decides
to select an action from its own policy (line 4). If the action is deemed to be safe, it is executed. Otherwise, the
planner’s policy overrides the learner’s choice (lines 5-7).
If the planner’s action is selected, the knownness count of
the corresponding state-action pair is incremented (line 9).
Finally the learner updates its parameter depending on the
choice of the learning algorithm (line 11). A drawback of
Algorithm 2 is that state-action pairs explicitly forbidden
by risk analyzer will not be visited. Hence, if the model is
designed poorly, it can hinder the learning process in parts
of the state space for which the risk is overestimated. Furthermore, the planner can take advantage of the adaptive
model and revisits its policy. Hence we extended the previous algorithm in order to let the model be adapted during
the learning phase (line 12). Furthermore, if the change to
the model used for planning crosses a predefined threshold
(ξ), the planner revisit its policy and keeps record of the
new model (lines 13-15). If the policy changes, the counts
of all state-action pairs are set to zero so that the learner
start watching the new policy from scratch (line 16,17). An
important observation is that the planner’s policy should
be seen as safe through the eyes of the risk analyzer at all
ModelEstimation
EstimationWithin
WithinPlanning
Planning
and
Learning
Model
and
Learning
550
551
552
553
554
555
556
557
558
559
560
[ACC 2011]
561
562
563
8 else
564
9
count(s, a) ← count(s, a) + 1
565
�
566
10 Take action a and observe r, s
�
567
11 learner.update(s, a, r, s )
Adaptive Model568
�
12 T̂ ← N ewM odelEstimation(s, a, s )
569
p
13 if ||T̂ − T̂ || > ξ then
570
571
14
T̂ p ← T̂
572
15
π p ← P lanner.replan()
573
p
16
if π is changed then
574
17
reset all counts to zero
575
576
577
times. Otherwise, most actions suggested by the learner578
will be deemed too risky by mistake, as they are followed579
580
by the planner’s policy.
581
learner suggests optimal actions such as taking → in the
582
row, they are deemed
too risky as the planner’s policy 583
5.3rd
Experimental
Results
which is followed afterward is not safe anymore.
584
We compared the effectiveness of the adaptive model ap-585
proach
combined withResults
iCCA framework (AM-iCCA) with586
5. Experimental
respect to (i) our previous work with a static model (iCCA)587
588
We(ii)
compare
thelearning
effectiveness
of the All
adaptive
model used
apand
the pure
approach.
algorithms
589
proach
with the
iCCA
framework
(AM-iCCA)
Sarsa
for combined
learning with
following
form
of learningwith
rate:590
respect to two methods (i) our previous work with a 591
fixed model (iCCA) and (ii)Nthe
1 learning approach 592
0 +pure
αt&=Barto,
α0 1998) in a1.1
.
Sarsa (see Sutton
N0 + Episode # GridWorld Domain 593
594
showin in Figure
595
For
each algorithm,
we used
theuse
bestonly
α0 the
from
{0.01,
0.1, 1}596
against
representations
that (i)
initial
features,
and
{100,
1000,representation,
106 }. Duringand
exploration,
we597
0 from
(ii)Nuse
the full
tabular
(iii) use two
used
an -greedyrepresentation-expansion
policy with = 0.1. methods:
Value functions
state-of-the-art
adap- 598
599
were
using lookup
tables.
Bothinto
iCCA
tive represented
tile coding (ATC),
which cuts
the space
finermethre- 600
ods
started
withtime
the (Whiteson
noise estimate
40%and
withsparse
the count
gions
through
et al.,of
2007),
dis- 601
weight
of memories
100, and the
conservative
policyoverlapping
(Fig. 2-c).sets
We602
tributed
(SDM),
which creates
used
5 Monte-Carlo
to evaluate
re-603
of regions
(Ratitch &simulations
Precup, 2004).
All cases risk
used and
learnN0 +1
0
ing rates
αt =
where
kt was the
number
jected
actions
forαwhich
any 1.1
of ,the
trajectories
entered
the604
Algorithm 2: Cooperative Learning
Input: N, π p , s, learner
Output: a
p
1 a ← π (s)
l
2 π ← learner.π
count(s,a)
3 knownness ← min{1,
}
N
4 if rand() < knownness then
5
al ∼ π l (s, a)
6
if safe(s, al , T̂ ) then
7
a ← al
kt N0 +Episode #
of active
features
at time t. parameter
For each algorithm
andtododanger
zone.
The knowness
(N) was set
10.
main,
we
used
the
best
α
from
{0.01,
0.1,
1}
and
N
0
0 from
For the AM-iCCA, the noise parameter was estimated
as:
{100, 1000, 106 }. During exploration, we used an �-greedy
agents
moveswas
+ initial
policy with#unintended
� = 0.1. Each
algorithm
testedweight
on each
.
noise =for 30 runs (60 for the rescue mission). iFDD was
domain
#total number of moves + initial weight
fairly robust with respect to the threshold, ψ, outperformingplanner
initial and
tabular representations
for most
values.
The
switched
to the aggressive
policy
(Fig. 2-
b) whenever the wind disturbance estimate reached below
25%. Each algorithm was tested for 100 trials. Error bars
represent 95% confidence intervals.
1
440
Algorithm
6052: Cooperative Le
Algorithm
2: Cooperative
Learning
AM-iCCA
441
Input:606
N, π p , s, learner
Input:
N, π p , s, learner
0.5
442
Output:
a
Output:
a
607
443
p
1 a ← π (s)
0 π p (s)
1 a←
444
608
l
l
Aggressive Policy 2 π ← learner.π
2 π ← learner.π
445
−0.5
609
coun
3
knownness
count(s,a)
446
Conservative Policy 610 ← min{1,
3 knownness
← min{1,
}
iCCA
N
4
if
rand()
<
knownness
th
447
−1
4 if rand()
< knownness then
5
a� 611
∼ π l (s, a)
448
l
l
5
a ∼ π (s, a)
[ACC 2011]
6
if safe(s,
449−1.5
612 a� , T̂ ) then
l
6
if
safe(s,
a
,
T̂
)
then
7
a ← a�
450
613
−2 a ← al
7
451
8 else 614
Sarsa
452
8 else
9
count(s, a) ← count(s,
−2.5Figure 3. The intelligent cooperative control architecture as im615
453
9
count(s,
←consensus-based
count(s, a) +bundle
1 algorithm (CBBA (Choi 10 Take action a and observe r,
plemented.a)
The
454 −3et al., 2009)) serves as the cooperative
616
�
�
planner to solve the multi11 learner.update(s, a, r, s )
10
action a and observe r, s
455Takeagent
617
task allocation problem.
Natural actor-critic
(Bhatnagar
�
Adaptive
Model
11
a, r, s )
12 T̂ ← N ewM odelEstimati
456learner.update(s,
−3.5et al., 2007) and Sarsa
(Rummery & Niranjan, 1994) reinforce13
if ||T̂ p618
− T̂ || > ξ then
457T̂ ←ment
12
N
odelEstimation(s,
a, s� )6000
0 ewM
2000
10000
learning
algorithms are4000
implemented
as the system8000
learners
Steps
p the performance analysis block is implemented as a risk anal458if ||T̂and
14
T̂ p619
← T̂
13
− T̂ || > ξ then
ysis
459
15
π p620
← P lanner.replan(
p tool. Together, the learner and the risk-analysis modules are
14
T̂formulated
← T̂ within a Markov decision process (MDP).
460
p Empirical results of AM-iCCA, iCCA, and Sarsa algo- 621
Figure
15
π4.
← P lanner.replan()
461
p
622
16
if
π
is changed
then
rithms
in
the
grid world
problem.
462
Algorithm
1: safe to zero
17
reset all counts
623
463
times. Seeking permission from
Input: s, a, T̂
464
the 624
action if the parent th
6.take
Extensions
Output: isSaf e
465
625
erwise the
child should follow
1
risk
←
0
466 4 compares the cumulative return obtained in the
Fig.
grid
Soparent.
far, we
assumed that the
626
2 for i ← 1 to M do
467
world
domain
for
Sarsa,
iCCA,
and
AM-iCCA
based
on
the
sented
accurately within func
627
3
t←1
Algorithm
?details the process
468
mated
What if this con
of sinteractions.
Thesuch
expected
performance
of
bothmodel.
6.number
Extensions
inspects
the suggested action b
learner
optimal
as taking
→ in the
469
4 suggests
(s, a) actions
628
t ∼ T̂
section,
we are going
discus
the
knownness
static
policies
horizontal
The improve- 629 of thetostate-ac
470
5 they
while
not shown
constrained(s
not lines.policy
3rd row,
areare
deemed
too as
risky
as
the planner’s
t ) and
ingnumber
our proposed
for t
Soment
far, of
weiCCA
assumed
that
the
true
model
can
be
repreof timesmethods
that state-actio
471
isT erminal(s
)
and
t
<
H
do
which
is
followed
afterward
is
not
safe
anymore.
t
with a staticp model over the pure learning
ap- 630
extensions
to
approach
to o
following
theour
planner’s
sugges
472 accurately
sented
form of the approxi6
st+1within
∼ T̂ (stfunctional
, π (st ))
proach
is
statistically
significant
in
the
beginning,
while
the
631
trols
the
shift
speed
from
fol
473
7
t ← tif
+ this
1 condition does not hold? In this
Lets go back to the cliff domai
mated
model.
What
5. Experimental
Results
to
the learner’s
policy. Given
improvement
is
less
significant
as
more
interactions
were
474
632
1
where
the
20%
noise
is
not
ap
8we are
riskgoing
← risk
risk)
t ) − involved
section,
to+discuss
challenges
in us-action pair, the learner probabi
i (constrained(s
475
obtained.
633
We
compare
the
effectiveness
of
the
adaptive
model
apgrids
close
to
the
cliff
marked
9 isSaf e ←
(risk < for
ψ) this scenario. We suggest twoaction from its own policy. If
ing476
our proposed
methods
proach combined with iCCA framework (AM-iCCA) with
resulting
policy
and the
value
634
safe, it is
executed.
Otherwise
477
extensions
to
our
approach
toour
overcome
these
challenges.
Although
initialized
with
the
conservative
policy,
thelarger
adaprespect
to
two
methods
(i)
previous
work
with
a
noise
value
the
optimal
635
rides
the
learner’s
choice.
If the
478
tive
approach
(shown
in green
in
Figfixed
model
(iCCA)
andwithin
(ii) the iCCA
pure
learning
approach
When
the 636
agent count
assumes
thecor
u
the
knownness
of the
Lets
gomodel
back
to
the
cliff
domain,
but
now
consider
the
case
479
simply
as they
theinpolicy
explicitly.Domain
For learn- take, it generalizes the noisy m
Sarsa
(see
Sutton
&parameterize
Barto, 1998)
a actual
GridWorld
is
incremented.
Finally
the
le
ure
4)
quickly
learned
that
the
noise
in
the
system
480
where the
noise
states
only to
637
ing20%
thatis
donot
not applied
represent to
theall
policy
as abut
separate
showin
in schemes
Figure
todepending
all states. on
This
cause
thecan
choice
of the
th
481
was
much
less
than
the
initial
40%
estimate
andisswitched
entity,
such
as
Sarsa,
integration
within
iCCA
framework
grids
close
to
the
cliff
marked
with
∗.
Fig.
4
depicts
the
638however,
this
means,
is that
sta
to
a
suboptimal
policy,
as the
482
not
immediately
obvious.
Previously,
we
presented
an
apagainst representations
(i) use
only the Notice
initial
features,
to
(and and
refining)
the
aggressive
policy.
Asfora any
result
of by639
resulting
policy
thethat
value
function.
that
bidden
the baseline planner
483using
for integrating
learning approaches
without
an ex- actions suggested by the learne
(ii) useproach
the discovery
full
tabular and
representation,
and (iii)
use outperformed
two
ited. Hence,
this
switch,
AM-iCCA
larger
noise
value
the
optimal
policy
remains
unchanged.
484 early
sumption.
640if the planner’s m
plicit actorrepresentation-expansion
component(Redding et al., 2010).
Our idea
was
state-of-the-art
methods:
adaphinder the learning process in
485 the
When
agentand
assumes
theOver
uniform
noise
model
by
both
iCCA
Sarsa.
time,
however,
all mismethods
641
motivated
by
the
concept
of
thethe
R
algorithm
(Brafman
max
tive tile
coding
(ATC),
which
cuts
space
into
finer
reThe
rootthe
of
thisisproblem
is th
which
risk
overestimated
486
&
Tennenholtz,
2001).
We
illustrate
our
approach
through
take,
it
generalizes
the
noisy
movements
close
to
the
cliff
reached
the
same
level
of
performance.
On
that
note,
it
is
642 in selecting
gions
through
time
(Whiteson
et
al.,
2007),
and
sparse
disfinal
authority
trol RL algorithm,
even thethe
ac
487
the parent-child
analogy,
where
the planner
takes the
role of
toimportant
all states.
This
can
cause
the
ACSarsa
agent
to(Sarsa,
converge
tributed
memories
(SDM),
which
creates
overlapping
sets
the
planner,
hence
our
643
can
be used
as the both
input of
to Alg
to see
that
all
learning
methods
iCCA,
488
the parent and the learner takes the role of the child. In the
of
regions (Ratitch
& Precup,
2004).
All
casesstatic
used
learning
this authority.
The first
489
toAM-iCCA)
a suboptimal
policy,
as
risk
analyzer
filters
optimal
644 we deal
improved
onthe
the
baseline
highbeginning,
child
does
not
know
much about
thepolicies,
world,
Here is where
with the
αthe
N0 +1
0
ing rates
αt =
, where
kto
the
number
alyzer
mandatory
only
li
490
t was
actions
suggested
by
the
learner
due
incorrect
model
ask
N
+
Episode
#1.1
t
0
645
hence,
for
the
most
part
s/he
takes
actions
advised
by
the
rect
functional
form
for to
theamo
lighting
their
sub-optimality.
491
of active
features
at learning
time t. from
For each
anda while,
doparent/child
thesimply
child
parent.
While
such algorithm
actions, after
analogy,646
theanalogy,
child may
sumption.
492
main, we
best
α0 from {0.01,
1} aand
N0 from
if thinks
the parent
thinks safe
an action
the used
childthe
feels
comfortable
about 0.1,
taking
self-motivated
an action
once s
647 a isself-motivate
493
{100,
1000,
10as6 }.s/he
During
exploration,
werisk
usedanalyzer
an
�-greedy
fortable
taking
The
root
of
this
problem
is
that
the
has
the
actions
has
been
through
the
same
situation
many
a
self-motivated
action.
The r
6.
Extensions
494
648 circumvent
policy with � = 0.1. Each algorithm was tested on each
would eventually
Return
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
Model
Estimation
Within
Planning
Model
Estimation
Within
Planningand
andLearning
Learning
final authority in selecting the actions from the learner and
domain for 30 runs (60 for the rescue mission). iFDD was
649specifically, line
the
planner,
both ofthat
our the
extensions
focus can
on revokSo
far, wehence
assumed
true model
be gether.
repre-More
fairly robust with respect to the threshold, ψ, outperformso that if the
knowness of a par
650
ing
authority.
first functional
extension
turns
the
risk approxiansented
accurately
within
form
of the
ingthis
initial
and tabularThe
representations
for most
values.
threshold,651
probing the safety o
alyzer
a limited
Back
to our the
matedmandatory
model. Inonly
thistosection,
wedegree.
are going
to discuss
parent/child
stopmethods
checkingwhen 652
challengesanalogy,
involvedthe
in child
usingmay
our simply
proposed
if this
the parent
thinks
annot
action
safesuggest
once s/he
com- to 653
condition
does
holdis and
twofeels
extensions
654
fortable
taking
a self-motivated
overcome
such
challenges. action. Thus, the learner
655
would eventually circumvent the need for a planner alto656
Returning
to
the
grid
world
domain,
consider
the
case
gether. More specifically, line 6 of Algorithm 2 is changed,
20% noiseofisa particular
not applied
to reaches
all states.
Fig. 5 de- 657
sowhere
that if the knowness
state
a certain
threshold,
probing
the safety
of the action
is only
not mandatory
picts such
a scenario
where
noise is
applied to the 658
659
grid cells marked with a ∗. While passing close to the danger zone is safe, when the agent assumes the uniform noise
model by mistake, it generalizes the noisy movements to
Model Estimation Within Planning and Learning
1
2
3
4
5
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Figure 5. The solution to the grid world scenario where the noise is only applied in windy grid cells (∗).
all states including the area close to the danger zone. This
can cause the AM-iCCA to converge to a suboptimal policy, as the risk analyzer filters optimal actions suggested by
the learner due to the incorrect model assumption.
The root of this problem is that the risk analyzer has the
final authority in selecting the actions from the learner and
the planner, hence both of our extensions focus on revoking this authority. The first extension turns the risk analyzer mandatory only to a limited degree. Back to our parent/child analogy, the child may simply stop checking if
the parent thinks an action is safe once s/he feels comfortable taking a self-motivated action. Thus, the learner would
eventually circumvent the need for a planner altogether.
More specifically, line 6 of Algorithm 2 is changed, so
that if the knownness of a particular state reaches a certain
threshold, probing the safety of the action is not mandatory
anymore. While this approach would introduce another parameter to the framework, it guarantees that in the limit, the
learner would be the final decision maker. Hence the resulting method will converge to the same solution of Sarsa
asymptotically.
Furthermore, an additional approach to dealing with an incorrect model is to use the previous experience of our agent
to estimate the reward of the learner’s policy. By accurately
estimating the learner’s policy from past data we can disregard the opinion of the risk analyzer, whose estimate of
the learner’s policy is based on an incorrect model. For a
given action from the learner a and the planner’s policy π p
we evaluate this joint policy’s return by combining two approaches. First, we estimate the reward from the joint policy by calculating the reward our agent would receive from
the training data when taking the joint policy, conditioned
on the agent’s current state. Unfortunately, this estimate
of the joint policy’s reward can suffer from high variance
if little data has been seen. To reduce the variance of this
estimate we use the method of control variates (Zinkevich
et al., 2006; White, 2009). A control variate is a random
variable y that is correlated with a random variable x and
is used to reduce the variance of the estimate of x’s expectation. Assuming we know y’s expectation we can write
E[x] = E[x − α(y − B)]
where E[y] = B. Using the incorrect model we can form
a control variate that is correlated with the reward experienced by our agent and can therefore be used to reduce the
variance of the joint policy’s estimated reward.
7. Conclusions
In this paper, we extended our previous iCCA framework
by representing the model as a separate entity which can be
shared by the planner and the risk analyzer. Furthermore,
when the true functional form of the the transition model is
known, we discussed how the new method can facilitate a
safer exploration scheme through a more accurate risk analysis. Empirical results in a grid world domain verified the
potential of the new approach in reducing the sample complexity. Finally we argued through an example that model
adaptation can hurt the asymptotic performance, if the true
model can not be captured accurately. For this case, we
provided two extensions to our method in order to mitigate
the problem. For the future work, we are going to apply
our algorithms to large UAV mission planning domains to
verify their scalability.
References
Bertsekas, Dimitri P. and Tsitsiklis, John N. NeuroDynamic Programming (Optimization and Neural Computation Series, 3). Athena Scientific, May 1996. ISBN
1886529108.
Bhatnagar, Shalabh, Sutton, Richard S., Ghavamzadeh,
Mohammad, and Lee, Mark. Incremental natural actorcritic algorithms. In Platt, John C., Koller, Daphne,
Singer, Yoram, and Roweis, Sam T. (eds.), NIPS, pp.
105–112. MIT Press, 2007.
Bowling, Michael, Johanson, Michael, Burch, Neil, and
Szafron, Duane. Strategy evaluation in extensive games
with importance sampling. Proceedings of the 25th
Annual International Conference on Machine Learning
(ICML), 2008.
Brafman, Ronen I. and Tennenholtz, Moshe. R-MAX - a
general polynomial time algorithm for near-optimal re-
Model Estimation Within Planning and Learning
inforcement learning. Journal of Machine Learning, 3:
213–231, 2001.
Geramifard, Alborz, Redding, Joshua, Roy, Nicholas,
and How, Jonathan P. UAV Cooperative Control with
Stochastic Risk Models. In American Control Conference (ACC), June 2011.
Howard, R A. Dynamic programming and Markov processes, 1960.
Kaelbling, Leslie Pack, Littman, Michael L., and Cassandra, Anthony R. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101:
99–134, 1998.
Kim, Jongrae. Discrete approximations to continuous
shortest-path: Application to minimum-risk path planning for groups of uavs. In In Proc. of the 42nd Conf. on
Decision and Control, pp. 9–12, 2003.
Littman, Michael L., Dean, Thomas L., and Kaelbling,
Leslie Pack. On the complexity of solving Markov decision problems. In In Proc. of the Eleventh International
Conference on Uncertainty in Artificial Intelligence, pp.
394–402, 1995.
Murphey, R. and Pardalos, P.M. Cooperative control and
optimization. Kluwer Academic Pub, 2002.
Puterman, M. Markov Decision Processes: Discrete
Stochastic Dynamic Programming. Wiley, 1994.
Redding, J., Geramifard, A., Undurti, A., Choi, H.L., and
How, J. An intelligent cooperative control architecture. In American Control Conference (ACC), pp. 57–62,
2010.
Rummery, G. A. and Niranjan, M. Online Q-learning using connectionist systems (tech. rep. no. cued/f-infeng/tr
166). Cambridge University Engineering Department,
1994.
Russell, S. and Norvig, P. Artificial Intelligence: A Modern
Approach, 2nd Edition. Prentice-Hall, Englewood Cliffs,
NJ, 2003.
Sutton, Richard S. Learning to predict by the methods of
temporal differences. Machine Learning, 3:9–44, 1988.
Weibel, R. and Hansman, R. An integrated approach to
evaluating risk mitigation measures for UAV operational
concepts in the nas. In AIAA Infotech@Aerospace Conference, pp. AIAA–2005–6957, 2005.
White, Martha. A general framework for reducing variance
in agent evaluation, 2009.
Zinkevich, Martin, Bowling, Michael, Bard, Nolan, Kan,
Morgan, and Billings, Darse. Optimal unbiased estimators for evaluating agent performance. In Proceedings of
the 21st national conference on Artificial intelligence Volume 1, pp. 573–578. AAAI Press, 2006. ISBN 9781-57735-281-5.
Download