Model Estimation Within Planning and Learning Alborz Geramifard† Joshua D. Redding† Joshua Joseph∗ Jonathan P. How† †Aerospace Controls Laboratory, MIT Cambridge, MA 02139 USA ∗Robust Robotics Group, MIT Cambridge, MA 02139 USA AGF @ MIT. EDU JREDDING @ MIT. EDU JMJOSEPH @ MIT. EDU JHOW @ MIT. EDU Abstract Risk and reward are fundamental concepts in the cooperative control of unmanned systems. In this research, we focus on developing a constructive relationship between cooperative planning and learning algorithms to mitigate the learning risk while boosting system (planner + learner) asymptotic performance and guaranteeing the safety of agent behavior. Our framework is an instance of the intelligent cooperative control architecture (iCCA) where the learner incrementally improves on the output of a baseline planner through interaction and constrained exploration. We extend previous work by extracting the embedded parameterized transition model from within the cooperative planner and making it adaptable and accessible to all iCCA modules. We empirically demonstrate the advantage of using an adaptive model over a static model and pure learning approaches in a grid world problem. Finally we discuss two extensions to our approach to handle cases where the true model can not be captured through the presumed functional form. 1. Introduction The concept of risk is common among humans, robots and software agents alike. Amongst the latter two, risk models are routinely combined with relevant observations to analyze potential actions for unnecessary risk or other unintended consequences. Risk mitigation is a particularly interesting topic in the context of the intelligent cooperative control of teams of autonomous mobile robots (Kim, 2003; Weibel & Hansman, 2005). In such a multi-agent Preliminary work in Planning and Acting with Uncertain Models Workshop, ICML, , Bellevue, WA, USA, 2011. Figure 1. The intelligent cooperative control architecture (iCCA) is a customizable template for tightly integrating planning and learning algorithms. setting, cooperative planning algorithms rely on knowledge of underlying transition models to provide guarantees on resulting agent performance. In many situations however, these models are based on simple abstractions of the system and are lacking in representational power. Using simplified models may aid computational tractability and enable quick analysis, but at the possible cost of implicitly introducing significant risk elements into cooperative plans. Aimed at mitigating this risk, we adopt the intelligent cooperative control architecture (iCCA) as a framework for tightly coupling cooperative planning and learning algorithms (Redding et al., 2010). Fig. 1 shows the template iCCA framework which is comprised of a cooperative planner, a learner, and a performance analysis module. The performance analysis module is implemented as a riskanalyzer where actions suggested by the learner can be overridden by the baseline cooperative planner if they are deemed unsafe. This synergistic planner-learner relationship yields a “safe” policy in the eyes of the planner, upon which the learner can improve. This research focused on developing a constructive relationship between cooperative planning and learning algorithms to reduce agent risk while boosting system (planner Model Estimation Within Planning and Learning + learner) performance. We extend previous work (Geramifard et al., 2011) by extracting the parameterized transition model from within the cooperative planner and making it accessible to all iCCA modules. In addition, we consider the cases where the assumed functional form of the model is both correct and incorrect. When the functional form is correct, we update the assumed model using an adaptive parameter estimation scheme and demonstrate that the performance of the resulting system increases. Furthermore, we explore two methods for handling the case when the assumed functional form can not represent the true model (e.g., a system with state dependent noise represented through a uniform noise model). First, we enable the learner to be the sole decision maker in areas with high confidence, in which it has experienced many interactions. This extension eliminates the need for the baseline planner altogether asymptotically. Second, we use past data to estimate the reward of the learner’s policy. While the policy used to obtain past data may differ from the current policy of the agent, we can still exploit the Markov assumption to piece together trajectories the learner would have experienced given the past history (Bowling et al., 2008). Additionally, we use the incorrect functional form of the model to further reduce the amount of experience required to accurately estimate the learner’s reward using the method of control variates (Zinkevich et al., 2006; White, 2009). The proposed approach of integrating an adaptive model into the planning & learning framework is shown to yield significant improvements in sample complexity in the grid world domain, when the assumed functional form of the model is correct. When incorrect, we highlight the drawbacks of our approach and discuss two extensions which can mitigate such drawbacks. The paper proceeds as follows: Section 2 provides background information and Section 3 highlights the problem of interest by defining a pedagogical scenario where planning and learning algorithms are used to mitigate stochastic risk. Section 4 outlines the proposed technical approach for learning to mitigate risk. Section 6 highlights two extensions to our approach when the true model can not be captured in the parametric form assumed for the model. Section 7 concludes the paper. 2. Background 2.1. Markov Decision Processes Markov decision processes (MDPs) provide a general formulation for sequential planning under uncertainty (Howard, 1960; Puterman, 1994; Littman et al., 1995; Kaelbling et al., 1998; Russell & Norvig, 2003). MDPs are a natural framework for solving multi-agent planning problems as their versatility allows modeling of stochas- tic system dynamics as well as inter-dependencies between a a agents. An MDP is defined by tuple (S, A, Pss 0 , Rss0 , γ), where S is the set of states, A is the set of possible aca tions. Taking action a from state s has Pss 0 probability of 0 ending up in state s and receiving reward Rass0 . Finally γ ∈ [0, 1] is the discount factor used to prioritize early rewards against future rewards.1 A trajectory of experience is defined by sequence s0 , a0 , r0 , s1 , a1 , r1 , · · · , where the agent starts at state s0 , takes action a0 , receives reward r0 , transits to state s1 , and so on. A policy π is defined as a function from S × A to the probability space [0, 1], where π(s, a) corresponds to the probability of taking action a from state s. The value of each state-action pair under policy π, Qπ (s, a), is defined as the expected sum of discounted rewards when the agent takes action a from state s and follow policy π thereafter: "∞ # X π t Q (s, a) = Eπ γ rt s0 = s, a0 = a, . t=1 The optimal policy π ∗ maximizes the above expectation for ∗ all state-action pairs: π ∗ = argmaxa Qπ (s, a). 2.2. Reinforcement Learning in MDPs The underlying goal of the two reinforcement learning algorithms presented here is to improve performance of the cooperative planning system over time using observed rewards by exploring new agent behaviors that may lead to more favorable outcomes. The details of how these algorithms accomplish this goal are discussed in the following sections. 2.2.1. S ARSA A popular approach among MDP solvers is to find an approximation to Qπ (s, a) (policy evaluation) and update the policy with respect to the resulting values (policy improvement). Temporal Difference learning (TD) (Sutton, 1988) is a traditional policy evaluation method in which the current Q(s, a) is adjusted based on the difference between the current estimate of Q and a better approximation formed by the actual observed reward and the estimated value of the following state. Given (st , at , rt , st+1 , at+1 ) and the current value estimates, the temporal difference (TD) error, δt , is calculated as: δt (Q) = rt + γQπ (st+1 , at+1 ) − Qπ (st , at ). The one-step TD algorithm, also known as TD(0), updates the value estimates using: Qπ (st , at ) = Qπ (st , at ) + αδt (Q), (1) 1 γ can be set to 1 only for episodic tasks, where the length of trajectories are fixed. Model Estimation Within Planning and Learning where α is the learning rate. Sarsa (state action reward state action) (Rummery & Niranjan, 1994) is basic TD for which the policy is directly derived from the Q values as: 1 − a = argmaxa Q(s, a) π Sarsa (s, a) = , Otherwise |A| where is the probability of taking a random action. This policy is also known as the -greedy policy2 . 3. The GridWorld Domain: A Pedagogical Example Consider the gridworld domain shown in Fig. 2-(a), in which the task is to navigate from the bottom-middle (•) to one of the top corner grids (?), while avoiding the danger zone (◦), where the agent will be eliminated upon entrance. At each step the agent can take any action from the set {↑, ↓, ←, →}. However, due to wind disturbances unbeknownst to the agent, there is a 20% chance the agent will be transferred into a neighboring unoccupied grid cell upon executing each action. The reward for reaching either of the goal regions and the danger zone are +1 and −1, respectively, while every other action results in −0.01 reward. Let’s first consider the conservative policy shown in Fig. 2(b) designed for high values of wind noise. As expected, the nominal path, highlighted as a gray watermark, follows the long but safe path to the top left goal. The color of each grid represents the true value of each state under the policy. Green indicates positive, and white indicates zero. The value of blocked grids are shown as red. Fig. 2-(c) depicts a policy designed to reach the right goal corner from every location. This policy ignores the existence of the noise, hence the nominal path in this case gets close to the danger zone. Finally Fig. 2-(d) shows the optimal solution. Notice how the nominal path avoids getting close to the danger zone. Model-free learning techniques such as Sarsa can find the optimal policy of the noisy environment through interaction, but require a great deal of training examples. More critically, they may deliberately move the agent towards dangerous regions just to gain information about those areas. Previously, we demonstrated that when a planner (e.g., methods to generate policies in Fig. 2-(b),(c)) is integrated with a learner, it can rule out intentionally poor decisions, resulting in safer exploration. Furthermore, the planner’s policy can be used as a starting point for the learner to bootstrap on, potentially reducing the amount of data required by the learner to master the task (Redding et al., 2010; Geramifard et al., 2011). In our past work, we considered the case where the model used 2 Ties are broken randomly, if more than one action maximizes Q(s, a). for planning and risk analysis were static. In this paper, we expand our framework by representing the model as a separate entity which can be adapted through the learning process. The focus here is on the case where the parametric form of the approximated model (T̂ ) includes the true underlying model (T ) (e.g., assuming an unknown uniform noise parameter for the gridworld domain). In Section 6, we discuss drawbacks of our approach when T̂ is unable to exactly represent T and introduce two potential extensions. Adding a parametric model to the planning/learning scheme is easily motivated by the case when the initial bootstrapped policy is wrong, or built from incorrect assumptions. In such a case, it is more effective to simply switch the underlying policy with a better one, rather than requiring a plethora of interactions to learn from and refine a poor initial policy. The remainder of this paper shows that by representing the model as a separate entity which can be adapted through the learning process, we enable the ability to intelligently switch-out the underlying policy, which is refined by the learning process. 4. Technical Approach In this section, we introduce the architecture in which this research was carried out. First, notice the addition of the “Models” module in the iCCA framework as implemented when compared to the template architecture of Figure 1. This module allows us to adapt the agent’s transition model in light of actual transitions experienced. An estimated model, T̂ , is output and is used to sample successive states when simulating trajectories. As mentioned above, this model is assumed to be of the correct functional form (e.g., a single uncertain parameter). Additionally, Figure 1 shows a dashed boxed outlining the learner and the riskanalysis modules, which are formulated together within a Markov decision process (MDP) to enable the use of reinforcement learning algorithms in the learning module. The Sarsa (Rummery & Niranjan, 1994) algorithms is implemented as the system learner which uses past experiences to guide exploration and then suggests behaviors that are likely to lead to more favorable outcomes than those suggested by the baseline planner. The performance analysis block is implemented as a risk analysis tool where actions suggested by the learner can be rejected if they are deemed too risky. The following sections describe iCCA blocks in further detail. 4.1. Cooperative Planner At its fundamental level, the cooperative planning algorithm used in iCCA yields a solution to the multi-agent path planning, task assignment or resource allocation problem, depending on the domain. This means that it seeks Model Estimation Within Planning and Learning (a) 1 2 (b) 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 1 2 (c) 3 4 5 1 2 (d) 3 4 5 Figure 2. The gridworld domain is shown in (a), where the task is to navigate from the bottom middle (•) to one of the top corners (?). The the danger region (◦) is an off-limit area where the agent should avoid. The corresponding policy and value function, are depicted with respect to (b) a conservative policy to reach the left corner in most states, (c) an aggressive policy which aims for the top right corner, and (d) the optimal policy. to optimize an underlying, user-defined objective function. Many existing cooperative control algorithms use observed performance to calculate temporal-difference errors which drive the objective function in the desired direction (Murphey & Pardalos, 2002; Bertsekas & Tsitsiklis, 1996). Regardless of how it is formulated (e.g. MILP, MDP, CBBA), the cooperative planner, or cooperative control algorithm, is the source for baseline plan generation within iCCA. We assume that this module can provide safe solutions to the problem in reasonable amount of time. 4.2. Learning and Risk-Analysis As discussed earlier, learning algorithms may encourage the agent to explore dangerous situations (e.g., flying close to the danger zone) in hope of improving the long-term performance. While some degree of exploration is necessary, unbounded exploration can lead to undesirable scenarios such as crashing or losing a UAV. To avoid such undesirable outcomes, we implemented the iCCA performance analysis module as a risk analysis element where candidate ac- tions are evaluated for safety against an adaptive estimated transition model T̂ . Actions deemed too “risky” are replaced with the safe action suggested by the cooperative planner. As the risk-analysis and learning modules are coupled within an MDP formulation, as shown by the dashed box in Fig. 3, we now discuss the details of the learning and risk-analysis algorithms. Previous research employed a risk analysis scheme that used the planner’s transition model, which can be stochastic, to mitigate risk (Geramifard et al., 2011). In this research, we pull this embedded model from within the planner and allow it to be updated online. This allows both the planner and the risk-analysis module to benefit from model updates. Algorithm 1, explains the risk analysis process where we assume the existence of the function constrained: S → {0, 1}, which indicates if being in a particular state is allowed or not. We define “risk” as the probability of visiting any of the constrained states. The core idea is to use Monte-Carlo sampling to estimate the risk level associated with the given state-action pair if the Model Estimation Within Planning and Learning iCCA framework. Figure 3. The intelligent cooperative control architecture as implemented. The conventional reinforcement learning method (e.g., Sarsa, Natural Actor Critic) sits in the learner box while the performance analysis block is implemented as a risk analysis tool. Together, the learner and the risk-analysis modules are formulated within a Markov decision process (MDP). Algorithm 1: safe 1 2 3 4 5 6 7 8 9 Input: s, a, T̂ Output: isSaf e risk ← 0 for i ← 1 to M do t←1 st ∼ T̂ (s, a) while not constrained(st ) and not isT erminal(st ) and t < H do st+1 ∼ T̂ (st , π p (st )) t←t+1 risk ← risk + 1i (constrained(st ) − risk) isSaf e ← (risk < ψ) planner’s policy is applied thereafter. This is done by simulating M trajectories from the current state s. The first action is the learner’s suggested action a, and the rest of actions come from the planner policy, π p . The adaptive approximate model, T̂ , is utilized to sample successive states. Each trajectory is bounded to a fixed horizon H and the risk of taking action a from state s is estimated by the probability of a simulated trajectory reaching a “risky” (e.g., constrained) state within horizon H. If this risk is below a given threshold, ψ, the action is deemed to be safe. It is important to note that, though not explicitly required, the cooperative planner should take advantage of the updated transition model by replanning. This ensures that the risk analysis module is not overriding actions deemed “risky” by an updated model with actions deemed “safe” by an outdated model. Such behavior would result in convergence of the cooperative learning algorithm to the baseline planner policy and the system would not benefit from the For learning schemes that do not represent the policy as a separate entity, such as Sarsa, integration within iCCA framework is not immediately obvious. Previously, we presented an approach for integrating learning approaches without an explicit actor component (Redding et al., 2010). Our idea was motivated by the concept of the Rmax algorithm (Brafman & Tennenholtz, 2001). We illustrate our approach through the parent-child analogy, where the planner takes the role of the parent and the learner takes the role of the child. In the beginning, the child does not know much about the world, hence, for the most part s/he takes actions advised by the parent. While learning from such actions, after a while, the child feels comfortable about taking a self-motivated actions as s/he has been through the same situation many times. Seeking permission from the parent, the child could take the action if the parent thinks the action is safe. Otherwise the child should follow the action suggested by the parent. Our approach for safe, cooperative learning is shown in Algorithm 2. The cyan section highlights our previous cooperative method (Geramifard et al., 2011), while the green region depicts the new version of the algorithm which includes model adaptation. On every step, the learner inspects the suggested action by the planner and estimates the knownness of the state-action pair by considering the number of times that state-action pair has been experienced following the planner’s suggestion (line 3). The knownness parameter controls the shift speed from following the planner’s policy to the learner’s policy. Given the knownness of the state-action pair, the learner probabilistically decides to select an action from its own policy (line 4). If the action is deemed to be safe, it is executed. Otherwise, the planner’s policy overrides the learner’s choice (lines 5-7). If the planner’s action is selected, the knownness count of the corresponding state-action pair is incremented (line 9). Finally the learner updates its parameter depending on the choice of the learning algorithm (line 11). A drawback of Algorithm 2 is that state-action pairs explicitly forbidden by risk analyzer will not be visited. Hence, if the model is designed poorly, it can hinder the learning process in parts of the state space for which the risk is overestimated. Furthermore, the planner can take advantage of the adaptive model and revisits its policy. Hence we extended the previous algorithm in order to let the model be adapted during the learning phase (line 12). Furthermore, if the change to the model used for planning crosses a predefined threshold (ξ), the planner revisit its policy and keeps record of the new model (lines 13-15). If the policy changes, the counts of all state-action pairs are set to zero so that the learner start watching the new policy from scratch (line 16,17). An important observation is that the planner’s policy should be seen as safe through the eyes of the risk analyzer at all ModelEstimation EstimationWithin WithinPlanning Planning and Learning Model and Learning 550 551 552 553 554 555 556 557 558 559 560 [ACC 2011] 561 562 563 8 else 564 9 count(s, a) ← count(s, a) + 1 565 � 566 10 Take action a and observe r, s � 567 11 learner.update(s, a, r, s ) Adaptive Model568 � 12 T̂ ← N ewM odelEstimation(s, a, s ) 569 p 13 if ||T̂ − T̂ || > ξ then 570 571 14 T̂ p ← T̂ 572 15 π p ← P lanner.replan() 573 p 16 if π is changed then 574 17 reset all counts to zero 575 576 577 times. Otherwise, most actions suggested by the learner578 will be deemed too risky by mistake, as they are followed579 580 by the planner’s policy. 581 learner suggests optimal actions such as taking → in the 582 row, they are deemed too risky as the planner’s policy 583 5.3rd Experimental Results which is followed afterward is not safe anymore. 584 We compared the effectiveness of the adaptive model ap-585 proach combined withResults iCCA framework (AM-iCCA) with586 5. Experimental respect to (i) our previous work with a static model (iCCA)587 588 We(ii) compare thelearning effectiveness of the All adaptive model used apand the pure approach. algorithms 589 proach with the iCCA framework (AM-iCCA) Sarsa for combined learning with following form of learningwith rate:590 respect to two methods (i) our previous work with a 591 fixed model (iCCA) and (ii)Nthe 1 learning approach 592 0 +pure αt&=Barto, α0 1998) in a1.1 . Sarsa (see Sutton N0 + Episode # GridWorld Domain 593 594 showin in Figure 595 For each algorithm, we used theuse bestonly α0 the from {0.01, 0.1, 1}596 against representations that (i) initial features, and {100, 1000,representation, 106 }. Duringand exploration, we597 0 from (ii)Nuse the full tabular (iii) use two used an -greedyrepresentation-expansion policy with = 0.1. methods: Value functions state-of-the-art adap- 598 599 were using lookup tables. Bothinto iCCA tive represented tile coding (ATC), which cuts the space finermethre- 600 ods started withtime the (Whiteson noise estimate 40%and withsparse the count gions through et al.,of 2007), dis- 601 weight of memories 100, and the conservative policyoverlapping (Fig. 2-c).sets We602 tributed (SDM), which creates used 5 Monte-Carlo to evaluate re-603 of regions (Ratitch &simulations Precup, 2004). All cases risk used and learnN0 +1 0 ing rates αt = where kt was the number jected actions forαwhich any 1.1 of ,the trajectories entered the604 Algorithm 2: Cooperative Learning Input: N, π p , s, learner Output: a p 1 a ← π (s) l 2 π ← learner.π count(s,a) 3 knownness ← min{1, } N 4 if rand() < knownness then 5 al ∼ π l (s, a) 6 if safe(s, al , T̂ ) then 7 a ← al kt N0 +Episode # of active features at time t. parameter For each algorithm andtododanger zone. The knowness (N) was set 10. main, we used the best α from {0.01, 0.1, 1} and N 0 0 from For the AM-iCCA, the noise parameter was estimated as: {100, 1000, 106 }. During exploration, we used an �-greedy agents moveswas + initial policy with#unintended � = 0.1. Each algorithm testedweight on each . noise =for 30 runs (60 for the rescue mission). iFDD was domain #total number of moves + initial weight fairly robust with respect to the threshold, ψ, outperformingplanner initial and tabular representations for most values. The switched to the aggressive policy (Fig. 2- b) whenever the wind disturbance estimate reached below 25%. Each algorithm was tested for 100 trials. Error bars represent 95% confidence intervals. 1 440 Algorithm 6052: Cooperative Le Algorithm 2: Cooperative Learning AM-iCCA 441 Input:606 N, π p , s, learner Input: N, π p , s, learner 0.5 442 Output: a Output: a 607 443 p 1 a ← π (s) 0 π p (s) 1 a← 444 608 l l Aggressive Policy 2 π ← learner.π 2 π ← learner.π 445 −0.5 609 coun 3 knownness count(s,a) 446 Conservative Policy 610 ← min{1, 3 knownness ← min{1, } iCCA N 4 if rand() < knownness th 447 −1 4 if rand() < knownness then 5 a� 611 ∼ π l (s, a) 448 l l 5 a ∼ π (s, a) [ACC 2011] 6 if safe(s, 449−1.5 612 a� , T̂ ) then l 6 if safe(s, a , T̂ ) then 7 a ← a� 450 613 −2 a ← al 7 451 8 else 614 Sarsa 452 8 else 9 count(s, a) ← count(s, −2.5Figure 3. The intelligent cooperative control architecture as im615 453 9 count(s, ←consensus-based count(s, a) +bundle 1 algorithm (CBBA (Choi 10 Take action a and observe r, plemented.a) The 454 −3et al., 2009)) serves as the cooperative 616 � � planner to solve the multi11 learner.update(s, a, r, s ) 10 action a and observe r, s 455Takeagent 617 task allocation problem. Natural actor-critic (Bhatnagar � Adaptive Model 11 a, r, s ) 12 T̂ ← N ewM odelEstimati 456learner.update(s, −3.5et al., 2007) and Sarsa (Rummery & Niranjan, 1994) reinforce13 if ||T̂ p618 − T̂ || > ξ then 457T̂ ←ment 12 N odelEstimation(s, a, s� )6000 0 ewM 2000 10000 learning algorithms are4000 implemented as the system8000 learners Steps p the performance analysis block is implemented as a risk anal458if ||T̂and 14 T̂ p619 ← T̂ 13 − T̂ || > ξ then ysis 459 15 π p620 ← P lanner.replan( p tool. Together, the learner and the risk-analysis modules are 14 T̂formulated ← T̂ within a Markov decision process (MDP). 460 p Empirical results of AM-iCCA, iCCA, and Sarsa algo- 621 Figure 15 π4. ← P lanner.replan() 461 p 622 16 if π is changed then rithms in the grid world problem. 462 Algorithm 1: safe to zero 17 reset all counts 623 463 times. Seeking permission from Input: s, a, T̂ 464 the 624 action if the parent th 6.take Extensions Output: isSaf e 465 625 erwise the child should follow 1 risk ← 0 466 4 compares the cumulative return obtained in the Fig. grid Soparent. far, we assumed that the 626 2 for i ← 1 to M do 467 world domain for Sarsa, iCCA, and AM-iCCA based on the sented accurately within func 627 3 t←1 Algorithm ?details the process 468 mated What if this con of sinteractions. Thesuch expected performance of bothmodel. 6.number Extensions inspects the suggested action b learner optimal as taking → in the 469 4 suggests (s, a) actions 628 t ∼ T̂ section, we are going discus the knownness static policies horizontal The improve- 629 of thetostate-ac 470 5 they while not shown constrained(s not lines.policy 3rd row, areare deemed too as risky as the planner’s t ) and ingnumber our proposed for t Soment far, of weiCCA assumed that the true model can be repreof timesmethods that state-actio 471 isT erminal(s ) and t < H do which is followed afterward is not safe anymore. t with a staticp model over the pure learning ap- 630 extensions to approach to o following theour planner’s sugges 472 accurately sented form of the approxi6 st+1within ∼ T̂ (stfunctional , π (st )) proach is statistically significant in the beginning, while the 631 trols the shift speed from fol 473 7 t ← tif + this 1 condition does not hold? In this Lets go back to the cliff domai mated model. What 5. Experimental Results to the learner’s policy. Given improvement is less significant as more interactions were 474 632 1 where the 20% noise is not ap 8we are riskgoing ← risk risk) t ) − involved section, to+discuss challenges in us-action pair, the learner probabi i (constrained(s 475 obtained. 633 We compare the effectiveness of the adaptive model apgrids close to the cliff marked 9 isSaf e ← (risk < for ψ) this scenario. We suggest twoaction from its own policy. If ing476 our proposed methods proach combined with iCCA framework (AM-iCCA) with resulting policy and the value 634 safe, it is executed. Otherwise 477 extensions to our approach toour overcome these challenges. Although initialized with the conservative policy, thelarger adaprespect to two methods (i) previous work with a noise value the optimal 635 rides the learner’s choice. If the 478 tive approach (shown in green in Figfixed model (iCCA) andwithin (ii) the iCCA pure learning approach When the 636 agent count assumes thecor u the knownness of the Lets gomodel back to the cliff domain, but now consider the case 479 simply as they theinpolicy explicitly.Domain For learn- take, it generalizes the noisy m Sarsa (see Sutton &parameterize Barto, 1998) a actual GridWorld is incremented. Finally the le ure 4) quickly learned that the noise in the system 480 where the noise states only to 637 ing20% thatis donot not applied represent to theall policy as abut separate showin in schemes Figure todepending all states. on This cause thecan choice of the th 481 was much less than the initial 40% estimate andisswitched entity, such as Sarsa, integration within iCCA framework grids close to the cliff marked with ∗. Fig. 4 depicts the 638however, this means, is that sta to a suboptimal policy, as the 482 not immediately obvious. Previously, we presented an apagainst representations (i) use only the Notice initial features, to (and and refining) the aggressive policy. Asfora any result of by639 resulting policy thethat value function. that bidden the baseline planner 483using for integrating learning approaches without an ex- actions suggested by the learne (ii) useproach the discovery full tabular and representation, and (iii) use outperformed two ited. Hence, this switch, AM-iCCA larger noise value the optimal policy remains unchanged. 484 early sumption. 640if the planner’s m plicit actorrepresentation-expansion component(Redding et al., 2010). Our idea was state-of-the-art methods: adaphinder the learning process in 485 the When agentand assumes theOver uniform noise model by both iCCA Sarsa. time, however, all mismethods 641 motivated by the concept of thethe R algorithm (Brafman max tive tile coding (ATC), which cuts space into finer reThe rootthe of thisisproblem is th which risk overestimated 486 & Tennenholtz, 2001). We illustrate our approach through take, it generalizes the noisy movements close to the cliff reached the same level of performance. On that note, it is 642 in selecting gions through time (Whiteson et al., 2007), and sparse disfinal authority trol RL algorithm, even thethe ac 487 the parent-child analogy, where the planner takes the role of toimportant all states. This can cause the ACSarsa agent to(Sarsa, converge tributed memories (SDM), which creates overlapping sets the planner, hence our 643 can be used as the both input of to Alg to see that all learning methods iCCA, 488 the parent and the learner takes the role of the child. In the of regions (Ratitch & Precup, 2004). All casesstatic used learning this authority. The first 489 toAM-iCCA) a suboptimal policy, as risk analyzer filters optimal 644 we deal improved onthe the baseline highbeginning, child does not know much about thepolicies, world, Here is where with the αthe N0 +1 0 ing rates αt = , where kto the number alyzer mandatory only li 490 t was actions suggested by the learner due incorrect model ask N + Episode #1.1 t 0 645 hence, for the most part s/he takes actions advised by the rect functional form for to theamo lighting their sub-optimality. 491 of active features at learning time t. from For each anda while, doparent/child thesimply child parent. While such algorithm actions, after analogy,646 theanalogy, child may sumption. 492 main, we best α0 from {0.01, 1} aand N0 from if thinks the parent thinks safe an action the used childthe feels comfortable about 0.1, taking self-motivated an action once s 647 a isself-motivate 493 {100, 1000, 10as6 }.s/he During exploration, werisk usedanalyzer an �-greedy fortable taking The root of this problem is that the has the actions has been through the same situation many a self-motivated action. The r 6. Extensions 494 648 circumvent policy with � = 0.1. Each algorithm was tested on each would eventually Return 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 Model Estimation Within Planning Model Estimation Within Planningand andLearning Learning final authority in selecting the actions from the learner and domain for 30 runs (60 for the rescue mission). iFDD was 649specifically, line the planner, both ofthat our the extensions focus can on revokSo far, wehence assumed true model be gether. repre-More fairly robust with respect to the threshold, ψ, outperformso that if the knowness of a par 650 ing authority. first functional extension turns the risk approxiansented accurately within form of the ingthis initial and tabularThe representations for most values. threshold,651 probing the safety o alyzer a limited Back to our the matedmandatory model. Inonly thistosection, wedegree. are going to discuss parent/child stopmethods checkingwhen 652 challengesanalogy, involvedthe in child usingmay our simply proposed if this the parent thinks annot action safesuggest once s/he com- to 653 condition does holdis and twofeels extensions 654 fortable taking a self-motivated overcome such challenges. action. Thus, the learner 655 would eventually circumvent the need for a planner alto656 Returning to the grid world domain, consider the case gether. More specifically, line 6 of Algorithm 2 is changed, 20% noiseofisa particular not applied to reaches all states. Fig. 5 de- 657 sowhere that if the knowness state a certain threshold, probing the safety of the action is only not mandatory picts such a scenario where noise is applied to the 658 659 grid cells marked with a ∗. While passing close to the danger zone is safe, when the agent assumes the uniform noise model by mistake, it generalizes the noisy movements to Model Estimation Within Planning and Learning 1 2 3 4 5 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 Figure 5. The solution to the grid world scenario where the noise is only applied in windy grid cells (∗). all states including the area close to the danger zone. This can cause the AM-iCCA to converge to a suboptimal policy, as the risk analyzer filters optimal actions suggested by the learner due to the incorrect model assumption. The root of this problem is that the risk analyzer has the final authority in selecting the actions from the learner and the planner, hence both of our extensions focus on revoking this authority. The first extension turns the risk analyzer mandatory only to a limited degree. Back to our parent/child analogy, the child may simply stop checking if the parent thinks an action is safe once s/he feels comfortable taking a self-motivated action. Thus, the learner would eventually circumvent the need for a planner altogether. More specifically, line 6 of Algorithm 2 is changed, so that if the knownness of a particular state reaches a certain threshold, probing the safety of the action is not mandatory anymore. While this approach would introduce another parameter to the framework, it guarantees that in the limit, the learner would be the final decision maker. Hence the resulting method will converge to the same solution of Sarsa asymptotically. Furthermore, an additional approach to dealing with an incorrect model is to use the previous experience of our agent to estimate the reward of the learner’s policy. By accurately estimating the learner’s policy from past data we can disregard the opinion of the risk analyzer, whose estimate of the learner’s policy is based on an incorrect model. For a given action from the learner a and the planner’s policy π p we evaluate this joint policy’s return by combining two approaches. First, we estimate the reward from the joint policy by calculating the reward our agent would receive from the training data when taking the joint policy, conditioned on the agent’s current state. Unfortunately, this estimate of the joint policy’s reward can suffer from high variance if little data has been seen. To reduce the variance of this estimate we use the method of control variates (Zinkevich et al., 2006; White, 2009). A control variate is a random variable y that is correlated with a random variable x and is used to reduce the variance of the estimate of x’s expectation. Assuming we know y’s expectation we can write E[x] = E[x − α(y − B)] where E[y] = B. Using the incorrect model we can form a control variate that is correlated with the reward experienced by our agent and can therefore be used to reduce the variance of the joint policy’s estimated reward. 7. Conclusions In this paper, we extended our previous iCCA framework by representing the model as a separate entity which can be shared by the planner and the risk analyzer. Furthermore, when the true functional form of the the transition model is known, we discussed how the new method can facilitate a safer exploration scheme through a more accurate risk analysis. Empirical results in a grid world domain verified the potential of the new approach in reducing the sample complexity. Finally we argued through an example that model adaptation can hurt the asymptotic performance, if the true model can not be captured accurately. For this case, we provided two extensions to our method in order to mitigate the problem. For the future work, we are going to apply our algorithms to large UAV mission planning domains to verify their scalability. References Bertsekas, Dimitri P. and Tsitsiklis, John N. NeuroDynamic Programming (Optimization and Neural Computation Series, 3). Athena Scientific, May 1996. ISBN 1886529108. Bhatnagar, Shalabh, Sutton, Richard S., Ghavamzadeh, Mohammad, and Lee, Mark. Incremental natural actorcritic algorithms. In Platt, John C., Koller, Daphne, Singer, Yoram, and Roweis, Sam T. (eds.), NIPS, pp. 105–112. MIT Press, 2007. Bowling, Michael, Johanson, Michael, Burch, Neil, and Szafron, Duane. Strategy evaluation in extensive games with importance sampling. Proceedings of the 25th Annual International Conference on Machine Learning (ICML), 2008. Brafman, Ronen I. and Tennenholtz, Moshe. R-MAX - a general polynomial time algorithm for near-optimal re- Model Estimation Within Planning and Learning inforcement learning. Journal of Machine Learning, 3: 213–231, 2001. Geramifard, Alborz, Redding, Joshua, Roy, Nicholas, and How, Jonathan P. UAV Cooperative Control with Stochastic Risk Models. In American Control Conference (ACC), June 2011. Howard, R A. Dynamic programming and Markov processes, 1960. Kaelbling, Leslie Pack, Littman, Michael L., and Cassandra, Anthony R. Planning and acting in partially observable stochastic domains. Artificial Intelligence, 101: 99–134, 1998. Kim, Jongrae. Discrete approximations to continuous shortest-path: Application to minimum-risk path planning for groups of uavs. In In Proc. of the 42nd Conf. on Decision and Control, pp. 9–12, 2003. Littman, Michael L., Dean, Thomas L., and Kaelbling, Leslie Pack. On the complexity of solving Markov decision problems. In In Proc. of the Eleventh International Conference on Uncertainty in Artificial Intelligence, pp. 394–402, 1995. Murphey, R. and Pardalos, P.M. Cooperative control and optimization. Kluwer Academic Pub, 2002. Puterman, M. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994. Redding, J., Geramifard, A., Undurti, A., Choi, H.L., and How, J. An intelligent cooperative control architecture. In American Control Conference (ACC), pp. 57–62, 2010. Rummery, G. A. and Niranjan, M. Online Q-learning using connectionist systems (tech. rep. no. cued/f-infeng/tr 166). Cambridge University Engineering Department, 1994. Russell, S. and Norvig, P. Artificial Intelligence: A Modern Approach, 2nd Edition. Prentice-Hall, Englewood Cliffs, NJ, 2003. Sutton, Richard S. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988. Weibel, R. and Hansman, R. An integrated approach to evaluating risk mitigation measures for UAV operational concepts in the nas. In AIAA Infotech@Aerospace Conference, pp. AIAA–2005–6957, 2005. White, Martha. A general framework for reducing variance in agent evaluation, 2009. Zinkevich, Martin, Bowling, Michael, Bard, Nolan, Kan, Morgan, and Billings, Darse. Optimal unbiased estimators for evaluating agent performance. In Proceedings of the 21st national conference on Artificial intelligence Volume 1, pp. 573–578. AAAI Press, 2006. ISBN 9781-57735-281-5.