Hierarchical Decomposition of Multi-agent Markov Decision Processes with Application to Health Aware Planning by MASSACHUSETMS WTIUTE OF TECHNOLOGY Yu Fan Chen OCT 0 2 2014 B.A.Sc., Aerospace Engineering University of Toronto (2012) LIBRARIES Submitted to the Department of Aeronautics and Astronautics in partial fulfillment of the requirements for the degree of Master of Science in Aerospace Engineering at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY September 2014 @ Massachusetts Institute of Technology 2014. All rights reserved. Signature redacted Author .................. .................... Department of Aeronautics and Astronautics August 21, 2014 Certified by................Signature .................... Jonathan P. How Richard C. Maclaurin Professor of Aeronautics and Astronautics Thesis Supervisor Signature redacted.- .................. Paulo C. Lozano Associate Professor of Aeronautics and Astronautics Chair, Graduate Program Committee Accepted by ................ 2 Hierarchical Decomposition of Multi-agent Markov Decision Processes with Application to Health Aware Planning by Yu Fan Chen Submitted to the Department of Aeronautics and Astronautics on August 21, 2014, in partial fulfillment of the requirements for the degree of Master of Science in Aerospace Engineering Abstract Multi-agent robotic systems have attracted the interests of both researchers and practitioners because they provide more capabilities and afford greater flexibility than single-agent systems. Coordination of individual agents within large teams is often challenging because of the combinatorial nature of such problems. In particular, the number of possible joint configurations is the product of that of every agent. Further, real world applications often contain various sources of uncertainties. This thesis investigates techniques to address the scalability issue of multi-agent planning under uncertainties. This thesis develops a novel hierarchical decomposition approach (HD-MMDP) for solving Multi-agent Markov Decision Processes (MMDPs), which is a natural framework for formulating stochastic sequential decision-making problems. In particular, the HD-MMDP algorithm builds a decomposition structure by exploiting coupling relationships in the reward function. A number of smaller subproblems are formed and are solved individually. The planning spaces of each subproblem are much smaller than that of the original problem, which improves the computational efficiency, and the solutions to the subproblems can be combined to form a solution (policy) to the original problem. The HD-MMDP algorithm is applied on a ten agent persistent search and track (PST) mission and shows more than 35% improvement over an existing algorithm developed specifically for this domain. This thesis also contributes to the development of the software infrastructure that enables hardware experiments involving multiple robots. In particular, the thesis presents a novel optimization based multi-agent path planning algorithm, which was tested in simulation and hardware (quadrotor) experiment. The HD-MMDP algorithm is also used to solve a multi-agent intruder monitoring mission implemented using real robots. Thesis Supervisor: Jonathan P. How Title: Richard C. Maclaurin Professor of Aeronautics and Astronautics 3 4 Acknowledgments First and foremost, I would like to thank my advisor, Professor Jonathan How, for his guidance and support in the past two years. Jon has the ability to accurately convey arcane concepts through simple diagrams, and always tries to draw connections to the big picture. Also, Jon is always adamant about making good slides, which helped me profoundly to improve the quality of my talks. I would also like to thank all members of the Aerospace Control Lab, whose support is indispensable to my research. In particular, I am grateful to Mark Cutler, thanks for bearing with me as I flied quadrotors into the walls over and over again. He is master of hardware and ROS, and is always willing to lend a hand. I would also like to thank Kemal Ure, with whom I worked closely on this project. He was always there to answer questions and provide feedbacks, his input has been essential to this work. In addition, thanks to Jack Quindlen, Rob Grande, Sarah Ferguson, Chris Lowe, and Bobby Klein, with whom I shared memories of many late night problem sets and quals practice sessions. Also, special thanks to Beipeng Mu, for sharing the cultural heritage; to Luke Johnson, for helping me in Portland; to Trevor Campbell, for sharing the UofT spirit; and to Ali Agha, for his patience and generosity. Finally, I would like to thank my wonderful parents. Thank you mom and dad, for your unconditional support throughout the years of college and graduate school. Also, I am grateful to my twin brother, whose presence always managed to confuse and amuse my lab mates. To all my friends from UofT and MIT, thanks for being there through the ups and downs. This research has been funded by the Boeing Company. The author acknowledges Boeing Research & Technology for support of the indoor flight facilities. 5 6 Contents 1 Introduction 1.1 Background and Motivation . . . . . . . . . 1.2 Related Work . . . . . . . . . . . . . . . . . 1.2.1 Dynamic Programming Approaches . 1.2.2 Structural Decomposition Approaches. 1.2.3 Optimization Based Approaches . . . 1.3 Outline and Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 13 15 15 17 18 19 21 21 23 24 26 26 27 27 3 Hierarchical Decomposition of MMDP 3.1 Structure of Reward Functions . . . . . . 3.1.1 Construction of Reward Functions. 3.1.2 Factored Reward Function . . . . 3.1.3 Useful Functional Forms . . . . . 29 30 30 32 33 36 36 38 38 40 44 . . . . . 2 Preliminaries 2.1 Markov Decision Processes ....................... 2.2 Multi-agent Markov Decision Processes ............ 2.3 Dynamic Programming Algorithms . . . . . . . . . . . . . 2.4 Extensions to Markov Decision Processes ........... 2.4.1 Partially Observable Markov Decision Processes . . 2.4.2 Decentralized Markov Decision Process (dec-MMDP) 2.5 Sum mary ............................ . Hierarchically Decomposed MMDP . . . 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 Decomposition ............. Solving Leaf Tasks ......... Ranking of Leaf Tasks . . . . . . Agent Assignment ......... Summary ................ . 3.2 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ............ . . Run Time Complexity Analysis ....... Summary .................. . 3.3 3.4 44 46 47 47 48 49 53 53 58 60 60 62 66 68 69 5 Experimental Setup 5.1 RAVEN Testbed ................. 5.2 Autonomous Vehicles .................. 5.3 Software Architecture .................. 5.4 Path Planner ...................... 5.4.1 Potential Field ............... 5.4.2 Sequential Convex Program ........ 5.4.3 Incremental Sequential Convex Program 5.5 Intruder Monitoring Mission ........... 5.6 Summary ..................... 71 71 73 75 78 78 79 83 90 92 6 Conclusions 6.1 Summary ..................... 6.2 Future Work .................... 6.2.1 Optimization on Task Tree Structure . 6.2.2 Formulation of Leaf Tasks . . . . . . . 95 95 96 96 97 References 98 . , . . . . . . . 4 Application to Health Aware Planning 4.1 Persistent Search and Track Mission (PST) 4.1.1 Mission Description . . . 4.1.2 Problem Formulation . . 4.1.3 Structural Decomposition 4.1.4 Empirical Analysis 4.1.5 Simulation Result..... 4.2 Intruder Monitoring Mission 4.2.1 Mission Description .... 4.2.2 Problem Formulation 4.2.3 Extension to HD-MMDP 4.2.4 Simulation Results . . . 4.3 Summary .. . . . . . . . . . . . 8 List of Figures 14 1-1 Applications of Multi-agent Systems . . . . . . . . . . . . . . . . . . 3-1 3-2 3-3 3-4 3-5 3-6 3-7 3-8 3-9 3-10 Construction of Multi-agent Reward Two Agent Consensus Problem. . . Summation Reward Function . . . Multiplicative Reward Function . . Example of Task Tree Structure . . Solution to Leaf Tasks . . . . . . . Relative Importance of Leaf Tasks . Ranking of Leaf Tasks . . . . . . . Computation of Minimum Accepted Assignment of Agents to Leaf Tasks Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Expected Reward . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 31 34 35 37 38 39 41 43 43 4-1 4-2 4-3 4-4 4-5 4-6 4-7 4-8 4-9 4-10 4-11 4-12 Persistent Search and Track Mission . . . . . . . . . . . Task Tree of Persistent Search and Track Mission . . . Three Agent PST Mission Simulation Result . . . . . . Effect of Adopting Decomposition Structure . . . . . . Pruning Solution Space . . . . . . . . . . . . . . . . . . Simulation of Ten Agent PST Mission . . . . . . . . . . Intruder Monitor Domain . . . . . . . . . . . . . . . . Leaf Tasks in Intruder Monitoring Domain . . . . . . . Action Space of Intruder Monitoring Mission . . . . . . State Transition Model of Intruder Monitoring Mission Task Tree of Intruder Monitoring Mission . . . . . . . . Simulation of Intruder Monitoring Mission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... . . . . . . . . . . . . . . . . . . . . . . . . 48 53 55 56 58 60 61 62 64 65 67 70 5-1 Robotic Vehicles and Experiment Environment . . . . . . . . . . . . . 5-2 Schematic of RAVEN Testbed . . . . . . . . . . . . . . . . . . . . . . 5-3 Image Projection Capability . . . . . . . . . . . . . . . . . . . . . . . 72 73 74 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Autonomous Robotic Agents . . . . . . . . . . . . . . . Software Architecture . . . . . . . . . . . . . . . . . . . A Live Lock Configuration for Potential Field Algorithm Typical Convergence Pattern of SCP . . . . . . . . . . Modeling Corner Constraint . . . . . . . . . . . . . . . Convergence Pattern Around a Corner . . . . . . . . . Convergence Pattern for Multi-agent System . . . . . . Mutli-agent Path Planning Test Cases . . . . . . . . . Hardware Demonstration of iSCP Algorithm . . . . . . Hardware Demonstration of Intruder Monitoring Mission . 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 5-13 10 . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . 76 77 79 83 84 87 88 89 91 93 List of Tables 3.1 Local Optimal Policies can be Jointly Suboptimal . . . . . . . . . . . 35 3.2 Run Time Complexity of HD-MMDP .................. 46 4.1 Computational Time of Value Iteration on PST Mission ........ 54 11 12 Chapter 1 Introduction 1.1 Background and Motivation Multi-agent systems arise in many applications of interest. In the manufacturing domain, efficient resource allocation [17, 58] among a group of robotic agents is crucial for maximizing production output. Similarly, in military applications, intelligent organization of multiple unmanned vehicles[46, 18, 55} is critical for surveillance and reconnaissance missions. Also, in the network management domain, smooth coordination of multiple servers[12, 51] is essential for maximizing network traffic. In each of these scenarios, the objective is to determine an action for each agent to optimize a joint performance metric for the team. In addition, various sources of uncertainties arise naturally in such planning problems. In the manufacturing domain, there exist process and sensor noise that affect robots' performance. In surveillance and reconnaissance missions, the behavior patterns of targets are often uncertain and can only be modeled probabilistically. In network management, servers may fail due to internal flaws (software bugs) or malicious attack from external sources, both of which can be modeled as stochastic events. Thus, these applications can be framed as multi-agent stochastic sequential decision problems, of which a major challenge is to hedge against future uncertainties. Markov Decision Processes (MDP) is a mathematical framework for stochastic sequential decision problems. In particular, given a probabilistic transition model, the 13 Figure 1-1: Applications of Multi-agent Systems. Left sub-figure shows multiple robotic agents in a manufacturing setting; middle sub-figure shows a multi-vehicle surveillance mission; right sub-figure shows a computer network. goal is to find a policy that maximizes long term reward. A policy is a framework for actions, which selects an appropriate action for each possible state of the system. A naive approach is to develop a policy for each agent independently. However, explicit coordination is desired when (i) a task is inherently multi-agent, such as transportation of a large object with combined effort, and (ii) a task is highly stochastic, such that each agent should plan for contingencies in case of degraded capabilities of its teammates. Another approach that explicitly account for coordination among a group of agents is to plan in the joint spaces. However, given the combinatorial nature of a multi-agent system, the joint planning space scales exponentially in the number of agents. In a simple surveillance domain, if the physical space is partitioned into 100 regions, each representing a possible state of a vehicle, a team comprised of ten vehicles would have 10010 joint states. It is computationally prohibitive to enumerate and iterate over such enormous planning spaces. In short, multi-agent planning problems have great application values, but are technically challenging because (i) the need to model agent level and environmental uncertainties, and (ii) the exponential growth of planning space as the number of agents increases. The primary objective of this thesis is to develop an algorithm that can handle uncertainty models and scale favorably with the number of agents. The proposed algorithm automatically identifies a decomposition structure from the reward function, and attains computational tractability by exploiting this decomposition structure. 14 1.2 Related Work Researchers have previously formulated multi-agent sequential decision problems as Markov Decision Processes (MDP) [42] and Multi-agent Markov Decision Processes (MMDP) [10]. Although an optimal policy can in theory be computed via dynamic programming (DP) [5], the exponential growth of joint planning spaces had made such approaches impractical to implement. Instead, researchers have developed various generic methods (Section 1.2.1) for solving MDP approximately at lower computational cost. In addition, researcher have studied various structural decomposition methods (Section 1.2.2) to lower computational complexity. Further, researchers have considered solving multi-agent problems through formulation of a mathematical program (Section 1.2.3). A precise definition of MDP and MMDP, as well as the detailed descriptions of classical DP methods, will be presented in the next chapter. 1.2.1 Dynamic Programming Approaches Asynchronous Dynamic Programming Classical DP methods maintain a tabular value function with an entry every state, and perform Bellman update once for every state per iteration. Asynchronous DP [7, 35] is a class of DP algorithms that performs Bellman update at different frequencies for different parts of the state space. The key idea is that some parts of the state space are more likely to generate a favorable policy; and thus, it is desirable to focus on those parts of the state space. In Prioritized Sweeping algorithm [39], residual (difference between successive Bellman updates) is used as a metric to determine the best set of states for future update. LAO* [27] keeps track of all previously explored states, and slowly expands the set if Blman update suggests that an unexplored state could be useful. Real time dynamic programming (RTDP) [50, 9] simulates trajectories (Monte Carlo simulation) through the state space and perform Bellman update on the explored trajectories. Asynchronous DP has been shown to be very effective in planning problems with a well defined initiation set and termination set; that is, the goal is to find a policy 15 that can drive the system from one part of the state space to another part of the state space. For instance, a stochastic path planning problem requires finding a policy that guides a vehicle from a starting position to a goal position. In such problems, the algorithm can focus on parts of the state space "connecting" the initiation set and the termination set. However, in some persistent planning domains where initiation and termination sets are not specified, such as in reconnaissance missions, it is difficult to determine which states are more important than others. Further, this class of algorithm still requires explicit enumeration of a substantial portion of the joint planning spaces, which can be impractical for multi-agent applications. Approximate Dynamic Programming Approximate DP [14, 22] is a class of DP algorithms that uses feature vectors to aggregate states for lowering computational complexity. In particular, given a set of feature vectors, Oi(s), the value function can be approximated as V(s) = E wi4i(s). The algorithms attempt to find the set of wi that would best approximate the true value function. Since number of features (I(I) is typically much smaller than the number of states (ISi), approximate DP achieves significant computation savings compared to classical DP. Unsurprisingly, since 14I < ISI, the representation power of feature space (linear combination of feature vectors) is less than that of tabular enumeration. Thus, various error metric has been proposed for feature representation of the true value function. Minimizing different error metric leads to different approximate Bellman update step, which defines different algorithms. For instance, Least Square Temporal Difference (LSTD) [13] is a popular approximate DP algorithm that minimizes a projection error under Bellman update. Approximate Value Iteration and approximate Policy Iteration are other examples of this class of algorithms. GA-dec-MMDP [44] is an algorithm developed specifically for multi-agent systems, in which by specifying a set of feature vectors, allows each agent to model the rest of the team in an approximate, aggregate manner. The key idea of approximate DP is that many states are similar. For instance, 16 in a path planning domain, one feature vector could lump together all states near obstacles. One Bellman update step can change the value of all related states simultaneously. However, it usually require extensive domain knowledge to specify a good set of feature vectors. 1.2.2 Structural Decomposition Approaches Structural decomposition methods [26] decompose a planning problem into a number of subproblems with smaller planning spaces. Then, eachsubproblem is solved individually to compute local policies. Lastly, the local policies are combined to generate a policy to the original problem. Serial Decomposition In serial decomposition, the planning spaces are partitioned into smaller sets. A subproblem (i.e. MDP) can be formulated around each partitioned set. The size of the original problem is the sum of that of the subproblems; that is, the planning spaces of the original problem are the union of that of the subproblems. At each decision period, serial decomposition identifies one active subproblem [29], and follows the subproblem's local policy. Consider a stochastic path planning problem inside a building. The building has multiple rooms connected by hallways and corridors. A serial decomposition algorithm would decompose the physical space into individual rooms and hallways. Local policies can be computed inside each room to drive the vehicle towards the door, which connects the room to hallway or other rooms. A policy to the original problem can be found by first identifying which room the vehicle currently resides, and then follow the room's local policy. A popular serial decomposition algorithm is . MAXQ by Dietterich et al. [21] Serial decomposition works well when there exist good heuristics for partitioning the planning space. Also, it usually assumes infrequent transition among different partitions. Thus, these algorithms typically require .domain knowledge. More importantly, serial decomposition still requires explicit enumeration of the entire state space, which can be impractical for large scale multi-agent systems. 17 Parallel Decomposition In parallel decomposition, the joint planning spaces are decomposed along different dimensions. For instance, vehicles in surveillance mission can have a dimension specifying its geographical location and another dimension specifying its operational health status. A subproblem is formulated for each dimension (or a subset of dimensions). The size of the original problem is the product of that of the subproblems, that is, the planing space of the original problem is the Cartesian product of that of the subproblems. At each decision period, multiple subproblems are active, and the local policies are combined to form a policy to the original problem. It is worth noting that adding an agent is equivalent to adding a dimension in a planning problem (see next chapter for details). Thus, parallel decomposition methods have been widely applied to multi-agent systems. For instance, the coordination graph approach [25, 32] assumes different subset of agents are coupled through locally shared value functions. Subproblems are formed around each subset of connected agents and a policy is computed by maximizing the sum of each subproblem's value function. Other examples include Large Weakly Coupled MDPs [38], Factored MDPs [24], Concurent MDPs [37] , and Dynamically Merge MDP [491. Parallel decomposition methods require user specified decomposition structure, which rely on domain knowledge. Also, coupling between subproblems is assumed to take simple forms, such as addition of local value functions or reward functions. 1.2.3 Optimization Based Approaches The previous two sections described solution approaches in the MDP framework. In the optimization community, researchers have studied task allocation problems for multi-agent systems [28, 2]. In particular, the objective is to optimize a joint performance metric given individual scoring functions [30] Na max Nt Fi(x, r) ZZ i=1 j=1 18 (1.1) where Fi (x, r) is the scoring function specifying the reward of agent i performing task j at time r, and xij is binary decision variables. The optimization problem can also be augmented with various constraints. Exemplar algorithms for solving this optimization problem approximately are sequential greedy algorithm and auction based algorithms. Since this class of algorithms avoids enumerating and iterating through the problem's planning space, it scales favorably with the number of agents (compared to MDP approaches). Optimization based algorithms were originally developed for deterministic systems and has been recently extended to incorporate some aspects of uncertainty models. For instance, Campbell et al. [15] used agent level MDP to compute scoring function Ft,. Yet, the structure of a task needs to be specified by the user, which rely on domain knowledge. Further, Eqn. 1.1 explicitly assumes additive utility between agents. 1.3 Outline and Summary of Contributions This thesis proposes a novel hierarchical decomposition algorithm, which decomposes a multi-agent planning problem into a hierarchy of tasks by exploiting the structure of the reward function. Often times the reward function reveals a different degree of coupling between agents, and thus can guide the decomposition procedure. Smaller planning problems are formed around the tasks at the bottom of the hierarchy. These subproblems have state spaces that are orders of magnitudes smaller than that of the original problem, thus allowing for computational tractability. Lastly, the algorithm combines solutions of the subproblems to construct a policy to the original problem. The proposed algorithm is shown to achieve better performance at lower computational cost in a persistent surveillance mission. The algorithm has also been applied to other multi-agent planning domains with more complex coupling relations. In addition to simulation, the algorithm has been run on real robots to demonstrate that it is real time implementable. e Chapter 2 presents the preliminaries for modeling multi-agent systems -under 19 uncertainty. It reviews the formulation of Markov Decision Processes (MDP) and Multi-agent Markov Decision Processes (MMDP). It describes the mechanics of value iteration (a dynamic programming algorithm) for solving MDPs and MMDPs. e Chapter 3 proposes the Hierarchically Decomposed Multi-agent Markov Decision Processes (HD-MMDP) algorithm for solving MMDPs [16]. It motivates the solution approach by examining different functional structures in a MMDP's reward function. Then, it describes the mechanics of the HD-MMDP algorithm for generating an approximate solution to multi-agent planning problems. e Chapter 4 describes a persistent search and track (PST) mission and applies HDMMDP to solve this planning problem. Then, it presents an empirical analysis of HD-MMDP. On the ten agent PST domain, simulation results show that HDMMDP outperforms GA-dec-MMDP algorithm by more than 35%. HD-MMDP is also applied to another planning problem, the intruder monitoring mission (IMM), and demonstrated sustainable performance. Lastly, several extensions to HD-MMDP algorithms are proposed. * Chapter 5 presents the hardware setup of Real-time indoor Autonomous Vehicles ENvironment (RAVEN). It also describes a flexible software architecture that enabled conducting hardware experiments involving multiple robots. A novel multi-agent path planning algorithm is proposed and implemented in RAVEN testbed. Hardware implementation of intruder monitoring mission is presented to demonstrate that HD-MMDP is real time implementable. e Chapter 6 offers concluding remarks and suggests possible directions for future works. 20 Chapter 2 Preliminaries This chapter presents preliminaries for Markov Decision Processes, Multi-agent Markov Decision Processes, and Dynamic Programming solution approaches. 2.1 Markov Decision Processes Markov Decision Process (MDP) [42] is a flexible mathematical framework for formulating stochastic, sequential decision problems. MDP was originally developed for discrete systems, where the number of states and actions is finite. Recent works have extended MDP formulation to continuous systems, whereas this thesis focuses on discrete MDP. Mathematically, MDP is defined by a tuple (S, A, P, R, y), which are state space, action space, state transition function, reward function, and discount factor, respectively. * State Space: S State space is the set of all possible states of a system. For instance, imagine a room being partitioned into a 5 x 5 grid, then there are 25 possible states, each of which uniquely identifies a configuration of the system. For some systems, a state may be comprised of multiple dimensions. Consider a vehicle navigating in the 5 x 5 grid. In addition to a locational dimension, it may additionally have a battery level dimension, discretized to the nearest percentage. Then, the 21 state space of system is the Cartesian product of both dimensions. The size of the state space is the product of that of each dimension, which is 2500 in the above example. Therefore, the state space of the system grows exponentially with addition of new dimensions, which phenomenon is referred to as the curse of dimensionality in the literature. * Action Space: A Action space is the set of all actions that a system can exert. For instance, a vehicle navigating in a 5 x 5 grid must choose to go {forward, backward, left, right}. In addition, the set of permissible actions may depend on the state of the system, such that the vehicle cannot go across a physical boundary. * State Transition Function: P State transition function specifies the probability of reaching state s' by taking action a at state s, that is, P : S x A x S -+ [0,1]. State transition function reflects the Markovian property of the problem, which meant future uncertainty distribution solely depends on the system's current state and action, and is independent of the system's state trajectory in the past. State transition function reflects the degree of stochasticity in the system. A vehicle navigating in the 5 x 5 grid may be a stochastic system if the floor is slippery. In particular, a vehicle driving straight can have a 90% probability reaching its intended location in the front, but also have 10% probability ending up in adjacent locations on the left or right, due to slippery conditions. " Reward Function: 1? Reward function maps each state-action pair to a scalar value, that is, 1: S xA -+ R. Intuitively, reward function is designed to favor a set of desired states. For instance, a vehicle navigating in the 5 x 5 grid may assign a high reward to its goal position, thereby favoring a policy that drives the vehicle toward its goal. This thesis assumes that the objective is to generate a policy that maximizes expected reward. Alternatively, in some scenarios, it is more natural to think about cost and the objective is to minimize expected cost. 22 " Discount Factor: -y Discount factor is a scalax in range (0,1). Intuitively, a discount factor specifies that reward in the distant future is less valuable. Computationally, it also has some important implications on theoretical aspects. " Policy: ir A deterministic policy, ir, is a function that maps each state to an action, that is, ir : S -+ A. Solving a MDP is the process of computing a policy that maximizes discounted cumulative reward, which is called the value function. For an infinite horizon MDP, value function is defined as 00 V'(s) = E E YR (s,r(st)) (2.1) t=O The discount factor must be less than one or the value function may otherwise diverge. Theoretical results have been developed showing that the optimal policy for an infinite horizon MDP is deterministic and stationary. 2.2 Multi-agent Markov Decision Processes Multi-agent Markov Decision Processes (MMDP) [11] is the application of MDP to model multi-agent systems. In particular, MMDP is defined by a tuple (na, S, A, 7?, -y), where n. is the number of agent, and other variables are defined similarly as that for a MDP. MMDP is distinguished from a MDP by its assumption of factored action space, according to the original definition by Boutilier. There are also two other assumptions, factored state space and transition independence, that researchers often associate with MMDP. In this thesis, all three conditions are assumed to hold. e Factored Action Space The joint action space is assumed to be the Cartesian product of that of each agent, that is A = A 1 x A 2 x ... x Aa. This condition implies that each agent's 23 action space remains unchanged in presence of its teammates, and that the joint action is specified by actions of all agents. " Factored State Space The joint state space is assumed to be the Cartesian product of that of each agent, that is S = Si x S2 x ... x S,.. This condition implies that each agent's state space remains unchanged in presence of its teammates, and that the joint state is specified by states of all agents. Factored state and action spaces assumptions reflect the thinking that each agent remains its own entity. " Transition Independence The transition function is assumed to be the Cartesian product of that of each agent, that is P = Ph x P2 x - -- x P,.. This condition implies that each agent's transition dynamics is only a function of its own state and action, and is independent of its teammates' states and actions [4]. Low level interaction between agents, such as interference due to occupying the same state, is not explicitly modeled given this condition. In comparison to MDP formulation, it can be seen that adding an agent to MMDP has similar effect to adding a dimension in MDP. Planning spaces is a phrase used to refer to state spaces and action spaces. Thus, the size of the planning spaces scales exponentially with the number of agents. 2.3 Dynamic Programming Algorithms Dynamic Programming (DR) [7] is a class of algorithms that solves optimization problems by breaking the problem down into a number of overlapping simpler subproblems. The subproblems are solved in different stages, and solution to each subproblem is maintained in memory and used for solving later subproblems. In context of MDP, Dynamics Programming exploits the Markovian property of the problem. The best action to take at a particular state is independent of how the system has reached the particular state. Thus, the original problem of finding the 24 Algorithm 1: Value Iteration Input: MDP (S, A, P, R,y), tolerance E Output: policy ir initialize V0 +- 0, n +- 0 2 while max, residual(s) > E do 1 I 4 5 6 7 foreach s E S do V +1(s) +- max{R(s, a) + - ,, P(s, a, s')V"(s')} _ residual(s) = IVn+ 1 (s) _ Vn(s)j n +- n + 1 return ir(s) +- arg maxa{R(s, a) +7 ,, P(s, a, s)Vn(s')} optimal sequence of actions is equivalent to the simpler problem of finding the best action assuming the system will follow the optimal policy thereafter. This idea leads to the Bellman optimality condition, V'*(s) = max{R(s,a) +-yEP(s,a,s')V'*(s')} (2.2) 81 where V* (s) is the optimal value function associated with the optimal policy lr*. Eqn. 2.2 not only has specified a condition that the optimal value function should satisfy, but also suggests an iterative algorithm for computing the optimal value function. In particular, given an initialization V0 , one can apply Vk+1(s) = max{1Z(s, a) +7 P(s, a, s')Vk(s)} (2.3) 8' to drive the next iterate closer to V'*. Eqn. 2.3 is called Bellman update or Bellman backup, at state s. Theoretical results have been developed to show that V *(s) is unique and performing Bellman update will drive any initialization of value function toward the optimal value function asymptotically. Value Iteration (VI) [36] is shown in Algorithm 1. The algorithm computes a policy 7r given an MDP and convergence tolerance E. The algorithm first initializes the value function in line 1. Then, Bellman update (line 3-5) is performed at each state s E S using the current iterate value. Residual, which is the change in value function in between successive iterates, is computed and stored. Once the maximum residual 25 falls below the tolerance e, it is assumed that the value function has converged (line 2-6). Lastly, a policy is retrieved from the value function by taking greedy selection with respect to Bellman update equation (line 7). Each Bellman Update (line 4) step requires iterating through every possible action and every possible next state, thus yielding run time complexity O(IA ISI), where I-I denotes the cardinality of a set. Consequently, each Bellman update sweep (line 3-5) yields run time complexity O(IA 1S12). Empirically, VI usually requires between 10 to 100 Bellman update sweeps to converge. Note that run time complexity depends on the size of the planning spaces. This relates to the curse of dimensionality such that DP approaches quickly become computationally intractable with addition of new dimensions to the planning spaces. There also exists another popular DP approach known as policy iteration (PI) [36]. PI iterates between two steps, policy evaluation and policy update. In its original form, PI has greater run time complexity than VI. A plethora of algorithms have been developed based on PI and VI, including those reviewed in Sec. 1.2.1. 2.4 Extensions to Markov Decision Processes 2.4.1 Partially Observable Markov Decision Processes (POMDP) Markov Decision Process assumes full observability; that is, after taking an action, the next state will be known to the system. This model works reasonably well when measurement noise is small compared to discretization level. Otherwise, uncertainty distribution after each measurement needs to be explicitly modeled, which leads to the formulation of Partially Observable Markov Decision Processes(POMDP) [47], defined by the tuple (S, B, A, P, JZ, -y). B denotes the belief space, which is a distribution over states. When solving a POMDP, one has to keep track of the evolution of belief states. It is extremely computationally intensive to compute the optimal policy for POMDP. In particular, MDP is P-complete whereas POMDP is PSPACE-hard [41]. Recently, 26 researchers have developed more scalable algorithms, such as FIRM [1], that solve POMDPs approximately. 2.4.2 Decentralized Markov Decision Process (dec-MMDP) Multi-agent MDP, as defined in section 2.2, is a specialization of MDP, thus also assumes full observability. Assuming each agent can only measure its own state, MMDP requires communication between agents such that a centralized planner can gain full observation to the system. In contested environments where communication bandwidth is limited, such assumption may be unrealistic. Thus, many researchers have studied multi-agent sequential decision problems from a decentralized perspective. One particular modeling framework is dec-MDP [23]. Dec-MDP assumes joint full observability, which means full observability can be achieved only by combining observations of all agents. The communication limitation is modeled by a probability measure that specifies the probability of receiving an agent's observation given its state and action. Thus, the optimal policy should strike a balance between optimizing the reward function and maintaining a good degree of communication. Dec-MMDP is NEXP-Complete due to its joint full observability assumption [6]. This thesis assumes full observability and focuses on MDP and MMDP formulations. 2.5 Summary This section reviewed formulation of Markov Decision Processes (MDP) and Multiagent Markov Decision Processes (MMDP). It also described dynamic programming (DP) as a class of algorithms for computing optimal solutions of MDPs and MMDPs. However, DP approaches are not practical for solving MMDPs due to curse of dimensionality. The next chapter introduces an algorithm that solves MMDPs at lower computational complexity by exploiting structures common to MMDPs. 27 28 Chapter 3 Hierarchical Decomposition of MMDP This chapter presents the mechanics of a novel hierarchical decomposition algorithm for solving Multi-agent Markov Decision Processes (MMDP). The section starts with a discussion of various structures of MMDP's reward function, and relates different function structures to different degrees of coupling between agents. Then, it proposes an algorithm that exploits such structures for solving MMDPs. In particular, the proposed algorithm would decompose an MMDP into a number of subproblems, solve each subproblem individually, and combine the subproblem's local policy to form a global policy to the MMDP. To distinguish between the subproblems and the original MMDP, the phrases local policy, local reward function are used to refer to the subproblems, and the phrases global policy, global reward function, joint planning spaces are used to refer to the original MMDP. If it is clear in context, the planning spaces of a multi-agent subproblem are also referred to as joint planning spaces. 29 r4 . r1 Figure 3-1: Construction of Multi-agent Reward Function. Each shape represents a task that can be accomplished by a single agent. 3.1 Structure of Reward Functions 3.1.1 Construction of Reward Functions The reward function, as defined in Section 2.1, can be specified in a tabular form with one scalar entry per state-action pair. The scalar values can be arbitrary and there need to be no correlation between any two scalars. However, this mathematical perspective is rarely useful in practice. After all, the reward function is designed by people to induce a favorable policy upon solving the MDP. Thus, reward function is often compactly represented by algebraic functions, which focuses on a small set of favorable states. More importantly, the set of favorable states is often expressed as composition of simple rules. For instance, in a multi-agent surveillance domain, a reward can be given when (i) a certain number of agents is actively patrolling in a specific area, and (ii) each patrolling agent has an adequate amount of fuel and fully operational sensors. The observation that reward functions usually have simple algebraic structure is formalized with the definition of factored reward function. Before presenting the precise mathematical formulation, it is useful to provide a simple motivating example for construction of a multi-agent problem, as shown in Fig. 3-1. Let each shape represent a point of interest in a 5 x 5 grid world; each shape is associated with a scalar reward labeled to the right. Given agent dynamics and environmental constraints, 30 Ar2 Figure 3-2: Two Agent Consensus Problem. Agents need to agree on which grid to visit together (both choosing green action or red action). developing a policy to reach each point of interest can be formulated as a single agent MDP. These MDPs are called leaf tasks. With a slight abuse of terminology, the state spaces of these MDPs are also referred to as leaf tasks. Given these building blocks, more complicated missions can be constructed by adjoining these leaf tasks. For instance, addition ri(@) + r2 (N) is used to indicate that tasks 1 and 2 can be executed independently, and multiplication r3 () - r4 () is used to indicate that task 3 and 4 need to be executed simultaneously. We can build more complex mission by repeated composition of leaf tasks. The previous example illustrates composition of single-agent (singleton) leaf tasks. Yet, there are multi-agent tasks that cannot be decomposed into singleton tasks. For instance, consider the consensus problem shown in Fig. 3-2. The reward function for this problem is r(s1, 82) = W(S1)I(s1 = 82), where w(-) is a scalar function determining the value of each grid and I(.) is the indicator function specifying the condition that both agents must be in the same grid for collection of reward. Thus, when choosing an action, each agent must be cognizant of the other agent's state and action. Let the green and red arrows denote two actions that each agent can choose from. In this case, depending on magnitude of w(s) and state transition uncertainties, both agent should either choose the green arrows or the red arrows. Mathematically, it is difficult to decompose the indicator function into a composition of more elementary functions of si and s2 separately. 31 3.1.2 Factored Reward Function This section introduces some terminologies to formalize the observations made in section 3.1.1. Task The i-th task, T, is defined as the joint state space Si = s, x ... x s, of any set of nt agents. Task Tz is associated with local reward function rZ, which is a part of the global reward function R. Local reward function is a mapping from the joint state space to a scalar, that is, r: T' -+ R. Further, t is defined as a particular state in task T, that is, ti E T. The definition of a task does not restrict to any particular subset of agents; reward is given as long as some subset of agents reaches the desired joint state. The dimension of a task's state space is the minimum required number of agents for executing the task. More agents can be assigned to a task for adding redundancy, in which case reward is calculated based on the subset of agents that best accomplishes the task. Note that the composition of tasks forms another task. Leaf Task A leaf task is a task that cannot be decomposed into smaller sub-tasks. More precisely, a leaf task is associated with a leaf reward function, which is function that cannot be expressed as composition of simpler functions, rea(si, ... where ,s,) $ fi(C)f 2 (CC) (3.1) C c {si, ... , sni}, and C 5 0 Intuitively, leaf reward functions are the elementary building blocks of a global reward function. Leaf tasks are subproblems formed around leaf reward functions. *, * *, *, from Fig. 3-2 are examples of singleton leaf tasks. Example from Fig. 3-2 is an example of a two-agent leaf task. 32 Factored Reward Function Factored reward function is the repeated function composition of leaf reward functions, 1Z(8i, 82,... ,iS. = fl{... fA [ fm(r mi(tml') ,... ..,rh(thl),... , rmkl-(tmnk)),.. r11 (t"l,. ,rlk(tlk)], (3.2) , rhk(thk )} where h, 1, m are dummy indices showing association of different terms. f(.)'s are arbitrary functions, r(t) is leaf reward function r(-) evaluated at state t of task T. For example, consider a three agent planning problem with global reward function R(si, 82) = r3 (t)(r 2 (ti) + r3 (t3)). This can be written in form of Eqn. 3.2 as R(si, S2, 83) = fi (f2(r2(t), r3 (t)),rs(t)), where fi is the scalar multiplication func- tion and f2 is the scalar addition function. Recall from definition of task, t is the association of multiple agent's state (i.e. t = (ri, r2 , r3 )). 3.1.3 Useful Functional Forms Construction of reward function illustrates that different functional forms can be used to express different coordination specifications. More importantly, different functional forms reveal the extent to which agents are coupled. Intuitively, coupling between agents is weak if each agent can choose its own action without regard to its teammates' states and actions; and coupling is strong if an agent needs to take into account of all other agents' states and actions. The strongest form of coupling is represented by leaf reward function, in which further decomposition is difficult. This section presents two functional forms that can be decomposed into subproblems. Subtask is a phrase used to describe a subproblem that is formulated as a MDP. ) Summation Reward Function: R(si,..., s,) = E. rk(tk This functional form represents weak coupling between agents because the subtasks are almost independent. In particular, assuming infrequent transition of individual 33 A r2 E+12 Figure 3-3: Summation Reward Function. Agent A1 performs the red subtask independently of A 2 performing the green subtask. Note that the color of the arrows are different, which is to show that each agent can perform its own subtask without concerning the other agent. agents between subtasks, solving each subtasks optimally in parallel would yield a jointly optimal solution because max [Zk rik(tlk)] = E max [E01 yrlk(tlk)] _1 Consider an example shown in Fig. 3-3, which reward function is ri (0) + r2 (f). Assuming agent A 1 is assigned to subtask @ and agent A 2 is assigned to subtask 0, the jointly optimal policy is to have each agent execute its local optimal policy. Therefore, the two subtasks are almost independent in the sense that each agent can execute its local policy without concerning the other agent's state and action. Multiplicative Reward Function: R(s,.. ., s,) = ]Jk rk(tk) Multiplicative reward function represents a stronger degree of coupling than that of the summation functional form. In particular, solving each subtask optimally in parallel would not yield a jointly optimal solution. Intuitively, this functional form specifies that the subtasks should be executed simultaneously. Let agent A 1 be assigned to subtask * r4 and agent A 2 be assigned to subtask . Consider an example shown in Fig. 3-3, which reward function is r3 For simplicity, let the transition model be deterministic. The optimal policy of each subtask, 7r* can be found via value iteration. One other policy, r', is given for each subtask. Each policy induces a sequence of reward (one scalar reward for each decision period), of which is listed in table 3.1. Recall the objective is to optimize discounted cumulative reward, E> -yiRr(si). For each task, it is seen that r* induces 34 A_ ,-2' r I/ 2A Figure 3-4: Multiplicative Reward Function. Agent A1 is performing the orange subtask and agent A 2 is performing the blue subtask. The lasso over the two arrows is showing that the two agents should explicitly coordinate when performing the subtasks. Table 3.1: Local Optimal Policies can be Jointly Suboptimal. The first row shows the sequence of reward following a policy, and second row shows the discounted cumulative reward calculated from the first row. The better policy, as revealed by discounted cumulative reward, is shown in red. Reward induced by r' subtask (1, 0,,0,1, 0, 0,. ..) (0, 0,1, 0,0, 1, ... subtask I (0,1,0, 0,1, 0, ...) K_0 joint task 7 3i+1 r3(*) r4(g) (0, 0, 0, 0 ,0, 0, ... ) __________________0 ZK= 0 73i+2 (0,0, 1, 0, 0, 1, ... K 73i+2 (0, 0, 1, 0 ,0, 1, ... ) 0 ^ 3i ) K~= ) Reward induced by r* K=___________ Zy3i+2 higher discounted cumulative reward than that of ir'. However, the joint policy of executing 7r* independently will induce zero discounted cumulative reward. In contrast, combining the suboptimal local policies 7r' will induce discounted cumulative reward of K= 0 Y3i+2, which is greater than zero. This example shows that in general, combining local optimal policy does not induce global optimal policy for a multiplicative reward function. In discussion of functional form and coupling strength, it was assumed (i) infrequent transition of agents in between different subtasks and (ii) assignment of minimum required number of agents to subtasks (i.e. one agent is assigned to served a singleton task). However, there are planning domains where such assumptions can lead to poor solution quality. For instance, if agents cycle through different states 35 reflecting different capabilities (i.e. battery level), then they should be periodically assigned to different subtasks. Also, if one subtask is much more important than all other subtasks, then more agents should be assigned to this important subtask for added redundancy. These are also important aspects that a planning algorithm should consider for developing a policy to the original planning problem. 3.2 Hierarchically Decomposed MMDP (HD-MMDP) This section develops a hierarchical decomposition algorithm [16] that exploits structure in factored reward function. This algorithm will (1) decompose the original problem into a number of subproblems, (2) solve each subproblem using classical dynamic programming (DP), (3) rank the subproblems according to some importance metric, and (4) assign each agent to one of the subproblems. Each agent will query an action from its assigned subproblem's local policy; which in turn, forms a global policy to the original MMDP. 3.2.1 Decomposition The first step is to decompose a factored reward function and to build a task tree structure. A task tree is a graphical representation of a factored reward function. The leaf nodes of a task tree are the leaf tasks of the planning problem. Recall a leaf task is a MDP formed around a leaf reward function. The internal nodes of a task tree are function nodes. The Decomposition subroutine, shown in Algorithm 2, builds a task tree structure recursively. In lines 1-3, an MDP is formed if a leaf task is identified. Otherwise, a function node (lines 4-8) is formed and each factor (argument of the function) is converted into a subtree and added as one of its child. The subroutine is invoked recursively (line 7) until all leaf tasks are identified. For example, consider the factored reward function rir2 + r3 r4 , where r is used 36 ) Algorithm 2: Decomposition Input: Agents Nj, State, Action, and Transition model < S, Aj, P > of each agent, for j = 1, ... , na, Discount factor 7, and Reward Function R Output: A task tree with leaf nodes as smaller MDPs ti, and function nodes at all internal nodes 1 if r(t) cannot be decomposed then 2 p +- new MDP(< Sj,AjP>,1?, 3 L return p else p +- new function node 5 6 foreach factor r' in R do 7 node +- Decomposition (Nj, < Sj, Aj, P >, r, 7) 8 _ add node as a child of p 4 9 L return p ( r r r Figure 3-5: Task Tree of Example 3.1. This tree is built automatically by invoking . decomposition subroutine on the reward function rir2 + r3 r4 as a shorthand for r(t). The decomposition subroutine first identifies an summation function node, and adds the two arguments, rir2 and r3 r4 , as children of the summation node. Then, the subroutine is invoked separately on each child. In particular, upon decomposition of rir2 , a multiplication functional node is formed, and two leaf reward functions are identified. The decomposition of r3 r4 is carried out similarly. Lastly, leaf tasks are formed around each leaf reward function. The resulting task tree is illustrated in Fig. 3-5. This example (Example 3.1) will be used throughout this section to explain the key procedures in HD-MMDP. 37 1499 2 5 T, T3 T4 T2 Figure 3-6: Solution to Leaf Tasks. Leaf tasks are formed around leaf reward functions and are solved with value iteration. The maximum of each value function, rma, is labeled above each respective leaf task. The leaf nodes are labeled T instead of r to emphasize that leaf tasks are formed around each leaf reward functions. 3.2.2 Solving Leaf Tasks Given a task tree structure, the second step is to solve each subproblem (leaf task) at the bottom of the hierarchy. Each leaf task can be viewed as a standalone MDP, and can be solved using classical dynamic programming approaches. HD-MMDP algorithm uses value iteration (VI) for computing optimal policy to each leaf task. Use of approximate algorithms for solving leaf tasks is also possible. The optimal policy and the associated optimal value function can be found upon solving each leaf task. The value function of a leaf task is the expected cumulative reward of performing the particular leaf task. Thus, the relative magnitude of each leaf task's value function reveals which leaf task is more valuable. As a heuristic, the maximum attainable expected reward (rma. = max, V(s)) is used to compare relative importance between the leaf tasks. In Example 3.1, the maximum attainable expected reward, rma, is calculated for each leaf task and labeled in Fig. 3-6. 3.2.3 Ranking of Leaf Tasks The third step is to develop an importance ordering for the set of leaf tasks. The rma values calculated in the previous step will be useful for construction of this ordering. However, it cannot be used directly. For instance, consider T2 and T4 in Fig. 3-6. Although T2 has greater rma than that of T4 , the joint task T, x T2 is less rewarding than the joint task T3 x T4 . Thus, it would be preferable to assign agents to the joint 38 Algorithm 3: rankTaskTree Input: A task tree, tree Output: An ordered list of leaf tasks, L 1 foreach leaf task T' in tree do rd+- maxti Vi(t') 2 3 foreach internal node t' in tree do L Lhildn1 child 6 return L -- ) 5 rma fi (rm. , ...,ma L +- Sort(leaf tasks, compareTask(tree)) L rm.. 19 1 9 10 1 9 72 Ti T2 1 5 T3 52 91 T4 T, T2 T3 T4 (b) armax (a) rmx Figure 3-7: Relative Importance of Leaf Tasks. To determine an execution order for leaf tasks, it requires finding which leaf task would contribute more to the joint reward function. The first step is to compute the maximum attainable expected reward rma at each internal node in the tree, shown in sub-figure(a). This is carried out by using rm, at the leaf nodes and percolating up to the root node in the task tree. Then, it is the to compute the degree in which each sub-task contributes the task's cumulative reward, arm., shown in sub-figure(b). task T3 x T4 . Consequently, leaf task T4 should be ranked higher than T2 despite having lower rma value. The rankTaskTree subroutine, as shown in Algorithm 3, ranks the leaf tasks by pairwise comparison. In lines 1-2, the maximum expected cumulative reward, rma, is determined from each substask's value function. In lines 3-4, rma at each internal node is calculated by percolating up from the leaf tasks. This procedure is illustrated in greater detail for Example 3.1, as shown in Fig. 3-7. It should be noted that rma is a statistic, in this case the mean of a distribution. Thus, rm, = f(rd ,. .,rmi) is only an approximation. After computing rma for each node in the task tree, the last step in rankTaskTree is to sort the leaf tasks using rm, as a heuristic (line 5 of Algorithm 3). The degree 39 Algorithm 4: compareTask Input: A task tree, tree, leaf task T and Tk Output: leaf task T with lower rank 1 g +- mostRecentCommonAncestor(T , Tk) 2 pi +- children(g) A ancestors(T') pk - children(g) A ancestors(Tk) 3 4 if Orm. > 5 rm. or (&r,. = OrmT return L o return and rP. > rm.) then Tk in which each sub-task contributes to the task's cumulative reward, is calculated by The sorting procedure relies on a pairwise comparison subroutine, compareTask, shown in Algorithm 4. Given two leaf tasks T and Tk, the subroutine = -,i ?x. first identifies their most recent common ancestor, g (line 1). Then, in lines 2-3, the algorithm finds the ancestors, p' and qk, of the two leaf tasks at one level below g. Lastly, in lines 4-6, the leaf task whose ancestor has a higher Ormx value is given a higher priority ranking. If rmx of both sub-tasks are equal, the leaf task whose ancestor with higher rm, value is ranked higher. Recall that the composition of tasks form another task with larger planning spaces. Intuitively, Algorithm 4 first identifies the smallest joint task that the contains both leaf tasks. Then, it tries to identify which leaf task is more valuable to the execution of the joint task. The application of compare Task subroutine to Example 3.1, and the resulting importance ordering, are illustrated in Fig. 3-8. 3.2.4 Agent Assignment The final step is to assign agents to the leaf tasks in the importance ordering determined in the previous step. Recall the ordering was determined by rm. = max, V(s), which is the maximum expected cumulative reward. In general, however, an agent's current state will not coincide with the best state, that is s / argmax, V(s'). Thus, it requires determining (1) how well would an agent execute a leaf task, and (2) the condition in which a leaf task can be considered completed, such that the algorithm can move onto the next task in the importance ordering. 40 19 19 19 T, T2 10 9 10 9 2 5 T3 T4 25 19 T, Siii T2 T3 T4 iv i ii Higher Priority (b) (a) Figure 3-8: Ranking of Leaf Tasks. In sub-figure a), the algorithm compares whether leaf task 2 or leaf task 3 is more valuable to the joint reward function. First, the most recent common ancestor for the two leaf tasks is found, which is the root node in this case. Second, move one level down from the most recent common ancestor, and find the two multiplication nodes in this case. Compare the arma and rm, values at this level to determine the priority ranking between the pair of leaf tasks. In this case, the values of rm, are equal and so the values of rm, are compared. T3 is of higher priority than T2 since 10 > 9, as shown by colored rma values. In sub-figure b), each pair of leaf tasks is compared, which induces the importance ranking illustrated in this figure. Instead of comparing each pair of tasks, a depth first search algorithm can be implemented for better computational efficiency. The value function specifies the expected cumulative reward starting at the current state, and it is a good indication of how well an agent would execute a task. The completion criterion is implemented by specifying a set of minimum rewards {Cmi.}, for leaf tasks {Tt }. With domain knowledge, a designer can specify each value independently. In this thesis, C is chosen to be 0.8r'.. The ratio C' -/ri, can be interpreted as the probability of successfully executing a task. This completion metric specifies the amount of desired redundancy in each leaf task, such that more agents (than the minimum required) will be assigned to task T' until at least Cmin units of reward is expected to be achieved. The computation of Cmii for Example 3.1 is illustrated in Fig. 3-9. Lastly, an action is queried for each agent from the local policy of its assigned leaf task. The assignTask subroutine, shown in Algorithm 5, details the task allocation process. For each leaf task, the expected cumulative reward R' is initialized to zero, 41 the boolean completion flag C' is initialized to false, and the agent assignment list Ai is initialized as a empty list (line 1). The list of leaf tasks are executed (assign agents) in the order of decreasing importance (line 2). For each leaf task, the algorithm first identifies the agent, Nj, that would best execute the task (line 5). Then, it adds agent Nj to the agent assignment list A' of task T (line 6). Recall leaf tasks can be either singleton or multi-agent. Leaf task T will be executed when it is assigned sufficient number of agents (line 7-9), in which an action is queried for each agent and the expected cumulative reward k is updated.1 If k exceeds the minimum required cumulative reward CG, then task T is consider being completed and its assigned agents will be removed from the set of available agents, N (line 10-13). Then, the algorithm moves onto the next leaf task and the process terminates when each agent has been assigned a leaf task (lines 3-4). The application of assignTask subroutine to Example 3.1 is shown in Fig. 3-10. Algorithm 5: assignTask Input: Agents Nj, Sorted List of Leaf Tasks L, and Completion Heuristic Cn, policy 7r' and value function V'(.) Output: Actions aj for each agent j= 1,...,na 0 - 0,Ci +-0, Ai 1 task T in L do 2 foreach leaf if N is empty then a 4 5 6 7 [_ break foreach best agent N in N do Ai +- Ai u {Nj} if T has enough agents then 8 aj- 9 k 11 12 is 14 fV7 Vi({sj})) if I > G then C N N\A - 10 7r'(sj) VNj E A < break return a, for j= 1,...,na 1The details of the updating function are explained in Chapter 4. 42 19 9 10 19 9: 2 05 T1 T2 T T4 0.8 7.2 1.6 4 iii iv Hi Hii iv i Hi Cmin Figure 3-9: Computation of Minimum Accepted Expected Reward Cmj.. The values of Cmj. reflects the degree of desired redundancy in execution of leaf tasks, which serve as a completion heuristic. These can be specified from domain knowledge, or using a heuristic approach, such as multiplying rm, by a constant factor. 0.8 7.2 16 4 ill Iv I Ii 0.8 7.2 II I v 1.6 4 0.8 7.2 1.6 i i ill iv i (b) I' 1.8 (1.8 (a) 4 3.2 (c) Figure 3-10: Assignment of Agents to Leaf Tasks. The smiley faces are symbols for agents. Yellow color indicates that the agent haven't been assigned a task and red otherwise. The numbers labeled inside the leaf nodes are Cmii, the minimum accepted cumulative reward. The Roman numerals under each leaf node indicate the importance ordering. In sub-figure a), the first agent is assigned to the highest ranked leaf task, as shown by the green arrow. In sub-figure b), since the first agent achieves expected cumulative reward of 1.8, which is greater than the minimum required value of 1.6, the algorithm move onto the second leaf task and assigns the next agent to it. In sub-figure c), since the second agent only achieves expected cumulative reward of 3.2, which is smaller than the minimum required value of 4, the next agent will be still assigned to the second leaf task. 43 Algorithm 6: HD-MMDP Input: Agents Nj, State, Action, and Transition model < S, Aj, P > of each agent, for j = 1,... , n, Discount Factor -y, Reward Function R, and Completion Heuristic Cm Output: Actions aj for each agent j = 1, ... , n. 1 tree +- Decomposition (< S, AI P >, R,) 2 foreach leaf task ' in tree do ir', V' +- ValueIterate(T7) 3 L 4 L <- rankTaskTree(tree) 5 {a} +- assignTask(Nj, L, Cm, ir, V( e return aj for j =1,...,n 3.2.5 Summary This section has developed a hierarchical decomposition (HD-MMDP) algorithm for Multi-agent Markov Decision Processes. The proposed algorithm is outlined in Algorithm 6, which summarizes its four major steps. First, exploiting structure in factored reward function, HD-MMDP decomposes the original MMDP into a number of subproblems (line 1). Second, each subproblem is formed as a MDP and solved with value iteration (line 2-3). Third, the subproblems were ranked using the rma heuristic (line 4). Fourth, the subproblems were executed in a decreasing importance ordering (line 5), in which process determines an action for each agent (generating a joint policy). 3.3 Run Time Complexity Analysis This section establishes the run time complexity of HD-MMDP algorithm and shows how it compares favorably to classical DP approaches. Let nt be the number of leaf reward functions, na be the number of agents, 1SI be the size of a single agent's state space, JAI be the size of a single agent's action space, and Jnm, 1 be the number of required agents in the largest leaf task. 1. Decomposition O(nt) In structural decomposition step, each recursive call of Decomposition subrou44 tine expands one of the internal function nodes. Since each internal functional node has at least two children, the number of internal functional nodes is at most nt - 1 (in case of a perfect binary tree). Thus, its run time scales linearly in the number of leaf tasks, nt. 2. Solving Leaf Tasks O(n IS 2nm-xAInm-) Recall from section 2.3, run time complexity of value iteration is O(ISMDP12IAMDPI). Also, the size of joint state and action spaces of n. agents are IAm"- and ISIn-, respectively. The largest leaf task (with nm. number of agents) has state and action spaces of sizes JAnm- and ISI"-, respectively. The run time complex- ity for solving the largest leaf task is O(IS 2nm-Aln-). Since there are at most nt leaf tasks of equal size, solving every leaf task has run time complexity 0 (nt IS12nm- IA In-). 3. Ranking Leaf Tasks O(nt log nt) Computing maximum expected cumulative reward (rm.) for every internal functional nodes has run time complexity O(nt). This is because there are at most nt -1 internal functional nodes and calculating rm. at each node involves constant time function evaluation. The sorting of leaf tasks can be implemented precisely as a depth first search, which has run time complexity O(nt log nt). 4. Task Allocation O(ntn2) In the task allocation step (Algorithm 5), finding the best agent for executing leaf task (line 5) requires looping through each agent, which is O(na). For each leaf task, this procedure may need to be repeated na times to reach desired degree of redundancy (line 7). In addition, the subroutine will loop through all nt leaf tasks (line 2). Hence, the task allocation step is O(ntna). Run time complexity of HD-MMDP is summarized in Table 3.2. For applications considered in this work, the magnitude of each variable is as follows: nt ~ 10, ISI ~ 100, JAI ~ 10, nma ~ 3, na ~ 10. Thus, it can be seen that the most expensive step is solving leaf tasks individually, as highlighted in red in the second row of the table. 45 Table 3.2: Run Time Complexity of HD-MMDP Run Time Complexity Step O(nt) 1) Decomposition 2) Solving Leaf Tasks 3) Ranking of Leaf Tasks 4) Task Allocation O(nt IS 2nm-AJnm) O(nt log nt) O(ntn2) In comparison, value iteration (VI) on the original MMDP has run time complexity O(IS 2naAIna). Since planning spaces scale exponentially with the number of agents, VI is orders of magnitude more expensive than the proposed algorithm. 3.4 Summary This section introduced the Hierarchically Decomposed Multi-agent Markov Decision Process (HD-MMDP) algorithm. First. common structures in MMDP's reward function are explored and being related to coupling strength between agents. Then, the mechanics of the HD-MMDP algorithm were developed. The run time complexity of HD-MMDP was shown to be much lower than classical dynamic programming approaches. The next chapter performs an empirical analysis of HD-MMDP by applying to solve a multi-agent persistent search and track (PST) mission. 46 Chapter 4 Application to Health Aware Planning This chapter presents the application of HD-MMDP algorithm to two multi-agent planning domains. It begins with a verbal description of a persistent search and track (PST) mission, and then provides a mathematical formulation of this problem. Further, HD-MMDP algorithm is used to solve the PST mission, whose empirical results are analyzed and compared to that of other algorithms. Lastly, it describes possible extensions to HD-MMDP algorithm with application to an intruder monitoring mission. 4.1 Persistent Search and Track Mission (PST) The persistent search and track mission was first proposed and developed by Bethke et al. {8, 45]. The mission involves a group of autonomous Unmanned Aerial Vehicles (UAV) equipped with sensors for monitoring ground targets. The objective is to have the UAVs continuously survey a region of interest while satisfying a communication constraint. Uncertainty plays a major role in this planning problem since UAVs have stochastic sensor failure models and stochastic fuel consumption dynamics. 47 Base, Communication INm II Nsur Contested Tasking Environment Figure 4-1: Persistent Search and Track Mission. The objective is to have the UAVs monitor the tasking region while maintaining active communication with the base region. In addition, there exists various sources of uncertainties, such as stochastic fuel consumption dynamics and stochastic sensor failure model. 4.1.1 Mission Description Mission scenario is illustrated in Fig. 4-1. There are three distinct geographical regions, namely base, communication and tasking regions. There may exist ground target vehicles, such as supply trucks from other parties, in the tasking region. UAVs are deployed to continuously monitor the tasking area while sending observations back to the base. For security concerns, base region is located far from the tasking region and no direct communication link can be reliably established. Thus, it requires a communication relay region that is strategically positioned to connect base and tasking regions. A dedicated UAV is required to loiter in the communication region to enable reliable data transfer between the base and tasking regions. Fuel management is a crucial aspect of the PST mission. Given physical limitations, UAVs have finite fuel capacity. As well, UAVs also have stochastic fuel consumption model. This is due to uncertainties in the environment, such as wind gusts, and uncertainties in target vehicle behaviors (speed, turning radius) that require dif48 ferent degree of maneuver aggressiveness. An UAV running out of fuel is declared Crashed and is permanently removed from the mission. Thus, a major challenge is to minimize risk of running out of fuel. Sensor management is another crucial aspect of the PST mission. Each UAV is equipped with two sets of sensors, a set of surveillance sensors that enable detection and track of ground targets, and a set of actuation sensors that monitor actuator health. A UAV with failed surveillance sensor cannot track ground targets effectively, but it can act as a communication relay. A UAV with actuation sensor failure can neither track targets nor act as communication relay. Neither failure modes lead to immediate crash and can be repaired upon returning to base. In short, sensor failures lead to degraded capabilities. The major challenges of planning in the PST domain are (i) coordinate the group of UAVs to maximize coverage in the tasking area and (ii) hedge against agent level uncertainties to prevent running out of fuel and minimize effects of sensor failures. The goal is to develop a policy that can achieve sustainable performance. 4.1.2 Problem Formulation The PST mission is formulated as a Multi-agent Markov Decision Process. Recall from Section 2.2, a MMDP is defined by a tuple (n., S, A, P, R, 1), which are number of agents, states space, action space, state transition function, reward function, and discount factor, respectively. State Space: S The state, si, of a UAV, Ni, is specified by a vector of three dimensions. The three dimensions are geographical location yi, fuel level fi, and sensor health status hi. Each of these quantities takes a scalar value. In particular, si = [yi, fi, hi]T 49 (4.1) The set of possible locations are base region Yb, communication region Y, and tasking region Y. The set of possible fuel levels are from empty to full (Fm.) at equipartitioned step Af. The set of possible sensor health states are nominal health H ,, surveillance sensor failure H.,, and actuator sensor failure H.ct. In summary, (4.2) yi E Y ={Yb, Y, Y} fi E F ={0, Af, ... Fmax - Af Fmax} hi E H ={Hnom, H8 enHact} (4.3) (4.4) The joint state space is the Cartesian production of that of each agent, that is S = Si x S2 x ... x Sn. In this thesis, Fmx is set to 10 and Af is set to 1. Hence, the size of a single UAV's state space is ISi|= 3 x 11 x 3 = 99, and the size of the joint state space is S = 99%. Action Space: A The action, ai, of a UAV is to move between different geographical regions, which are arranged in a chain with the communication region in the center. It is assumed that a UAV cannot fly from base to tasking region directly. In particular, ai = {-1, 0, 1} (4.5) which correspond to "move towards base", "stay", and "move towards tasking region", respectively. Depending on the state of a UAV, only a subset of the actions are allowed, a E {-1, 0,1},7 if yj = Y {-1,0}, {0, 1}, ify=Y if y = Y {0}, if f = 0 50 (4.6) where the last line specifies a UAV would remain stationary as it had run out of fuel. The size of the action space for single agent, IAil, is three. The size of the joint action space JAI is 3na. State Transition Model: P Since MMDP assumes transition independence, it suffices to specify the state transition model of a single UAV. Recall a UAV's state is comprised of three dimensions, which axe geographical location, fuel level, and sensor health status. The transition dynamic for each agent's position is deterministic, such that an agent will reach in its intended location, y,(k + 1) = yi(k), if fi = 0 or ai(k) = 0 Yb, if yi(k) = Y and ai(k) = -1 Yt, if yi(k) = Y and ai(k) = +1 Y, if yi(k) = Yt and-ai(k) = -1 Y, if yi(k) = Yb and ai(k) = +1 (4.7) A stochastic fuel consumption model is developed to account for uncertainties in the environment. In particular, an agent consumes one unit of fuel with probability ppe, and two units of fuel with probability 1 - p . As well, an agent is refueled to maximum fuel level upon returning to base. An agent who ran out of fuel remains out of fuel. 1 fi(k + 1) = fi(k) - 1, f(k) 10, - 2, if yi(k + 1) #Yb if yi(k + 1) = M 0, if f,(k) = 0 if yi(k + 1) # M Prob: pf,, Prob: 1 - pf (4.8) 'State transition function should only depend on a system's current state, and not its future state. Thus, use of y,(k +1) is not allowed in general. However, it is fine in this case since y (k +1) is a deterministic function of 8,(k) and a1 (k). 51 Surveillance and actuator sensors can fail with probabilities p,,, and paa, respectively. A UAV's failed sensor is fixed when the UAV returns to base region. hi(k), if f (k) = 0 Hnoa, if y(k) = Y hi(k +1) = 7H, if y (k) #Yb and hi(k) = H,,, Had, if y (k) # Y and hi(k) = H,,, Hen, if y (k) #Y and hi(k) = Hn,, Prob: (1 - Pem)(1 - Pa) Prob: (1 - Pen)Pac Prob: Pen (4.9) Reward Function: R The reward function is designed to induce favorable outcomes by placing high rewards on a set of desirable states. In particular, a desirable state is characterized by (1) having agents to fly in the tasking region, and (2) having at least one agent in the communication area to relay data between base and tasking region. This is made precise as R(ss , (4.10) s ) = I(Non > 1)rhurNaur where Ne,,n and Nur are the number of capable UAVs that are flying in the communication and tasking regions, respectively; r,,. is a scalar value (units of reward for each UAV in the surveillance area). I(clause) is the indicator function that returns 1 if clause is true and 0 otherwise. Capable is defined as having required sensor health status as described in section 4.1.1. In particular, No,= Nu, = Naur I(yj = Y)I(fj >0)I(hj # Had) ~ I(yj = Y)I(fj > 0)I(hj = Ho,) nom 52 (4.11) Comm Task Comm Task Comm Task ''' Comm Task Figure 4-2: Task Tree of Persistent Search and Track Mission. 4.1.3 Structural Decomposition Eqn. 4.10 and 4.11 can be combined and rewritten in form of Eqn. 3.2, in which two leaf reward functions are identified. reo, (sj) = I(yj = Y)I(fj > 0)I(hj k Hact) rsur(sj) = I(yj = Yt)I(fj > 0)I(hj = Hn,,m) (4.12) With application of Alg. 2, a task tree for the PST mission is constructed and shown in Fig. 4-2. There are two types of leaf tasks, namely communication and surveillance leaf tasks, that are built around Eqn. 4.12. The surveillance leaf tasks are coupled through a summation function node, which reflects the mission specification of having as much UAVs flying in the tasking region as possible. The communication leaf task is coupled to the set of surveillance leaf tasks through a multiplicative function node, which models communication link constraint. 4.1.4 Empirical Analysis Recall MMDP's state and action spaces scale exponentially with the numbers of agents and value iteration(VI) has run time complexity O(IS IAI). 2 Thus, applying VI to obtain optimal solution is computationally prohibitive for na > 4, as summarized in Table 4.1. HD-MMDP decomposes PST into a number of singleton tasks, and therefore its computation time is much cheaper than that of value iteration on MMDP. In simulation, it takes about 2.3 seconds to compute the optimal policies for singleton 53 Table 4.1: Computational Time of Value Iteration on PST Mission. VI is impractical for n. > 4 since it will require about 3 weeks to compute. SI number of agents (n.) 99 1 9801 2 979299 3 95059601 4 JAl Computational Time - 0.1s 3 1m is 9 57m 27s 27 ~ 21 days 81 leaf tasks in the off-line phase; and it takes less than 2.0 seconds to generate a 50 step trajectory for a ten agent PST mission in the on-line phase (task allocation step). Trade-off between Computational Tractability and Solution Optimality A comparison between HD-MMDP and the optimal solution is shown in Fig. 4-3. The comparison is made on a three agent scenario because this is the biggest problem size that permits computing the optimal policy in a reasonable amount of time, according to Table 4.1. This graphs shows that HD-MMDP's solution quality is within 8% of that of the optimal policy. Yet, HD-MMDP takes less than a 1% of value iteration's computation time on a MMDP formed around the three agent PST mission. HD-MMDP yields a suboptimal solution partly due to the imposed decomposition structure. In particular, in solving leaf tasks individually, each agent only accounts for its own future uncertainties. In contrast, the optimal policy accounts for the future uncertainties of all agents simultaneously. In HD-MMDP, when an agent is assigned to the surveillance leaf task, it assumes that another agent will successfully complete the communication leaf task; its does not hedge against the uncertainties of how well the other agent will actually perform. However, explicitly accounting for the future uncertainties of all agents leads to the exponential growth of planning spaces. Hence, HD-MMDP reveals a trade-off between computational tractability and solution optimality. HD-MMDP uses MDP formulation for strong coupling encapsulated in leaf tasks and uses task allocation for weaker forms of coupling. 54 150 MMDP's Optimal Policy cc ( 100- 50 - E HD-MMDP E 0 0' 0 20 40 60 time Figure 4-3: Three Agent PST Mission Simulation Result. The red line shows the cumulative reward following a policy generated by HD-MMDP algorithm. The blue line shows the cumulative reward following the MMDP's optimal policy. HD-MMDP achieved sustainable cumulative reward (no UAV run out of fuel), and produces reward within 8% of that of optimal policy. Decomposition Structure HD-MMDP proposes a decomposition structure as revealed by the reward function. A natural question to ask is whether this is a good decomposition structure. The effects of adopting various decomposition structures on solution quality is studied empirically on a nine agent PST mission. A nine agent scenario is chosen because 9 is a integer multiple of 3, which is the largest PST problem size that permits computing the optimal policy in a reasonable amount of time. An alternative decomposition structure is to form three 3-agent leaf tasks. The first leaf task is a standard 3-agent PST mission; and the two other leaf tasks are PST missions without the communication link requirement. When combined, as in the MMDP formulation, only one agent is required to serve as a communication link. In this decomposition structure, subgroups are formed as large as computationally plausible. The task allocation step follows the same set of heuristics as that in the 55 400 U) 350 300 HD-MMDP Decomoposition--+ Structure 0250 .00 200= 150- 3 groups of three E E 100 50 0 0 20 40 60 time Figure 4-4: Effect of Adopting Decomposition Structure on Nine Agent PST Mission. The red line shows the cumulative reward of using HD-MMDP's decomposition structure. The blue line shows the cumulative reward of using three groups of three. The same task allocation algorithm was used on both decomposition structures. This graph shows that forming larger subgroups does not always induce better solution quality. HD-MMDP algorithm. Fig. 4-4 shows the empirical result of using various decomposition structures, averaged over 30 independent runs. HD-MMDP's decomposition structure (red) induces better solution quality than that of the alternative structure (blue). The 3 groups of three decomposition structure took more than 2 hours to solve (solving instances of 3-agent PST mission) and still yields inferior solution quality. This suggests that, in general, it is not beneficial to form large leaf tasks even if computationally feasible. This is because arbitrarily formed leaf tasks do not reflect structure in the joint problem. For instance, in the PST mission, earning a reward requires having one agent in the communication area and another agent in the tasking area. In a three agent scenario, if two agents are running low on fuel and are scheduled to return to base, a capable (perfect sensor health and high fuel level) third agent should also return to base because it cannot earn any reward by itself. However, in a team of nine agents, 56 when two agents are running low on fuel, the third capable agent should coordinate with the other six agents to maximize reward. In this case, arbitrarily formed leaf tasks bring about new structures (i.e. strong coupling between 3 agents) that do not exist in the original problem. Pruning Solution Space Recall solving an MMDP is to.compute a policy, which specifies joint actions for all agents at every possible state of the system. 1n contrast, through a decomposition procedure, the HD-MMDP algorithm converts the MMDP to an agent assignment problem. The objective is to assign each agent to one of the leaf tasks such that the cumulative reward at the root node will be maximized. This agent assignment problem is a non-trivial combinatorial optimization problem. More importantly, each leaf task is stochastic with only first order statistics (expected reward) available. The combinatorial and stochastic nature of this agent assignment problem made it difficult the optimize. In converting the MMDP to an agent assignment problem, a natural question to ask is what benefits do this transformation bring, especially if the latter is also difficult to solve. In fact, the solution space H'*" of the agent assignment problem corresponds to a subset of MMDP's policy space II, that is Hdc E H. It is observed that on average, policies in IV yield much better solution quality than that in II; more precisely, EErfldec [V'M(s)] > EER [V'(s)] , where V"(s) is the value function evaluated at state s corresponding to policy 7r. For the ten agent PST mission, random policies chosen form I1Tc (red) outperforms random polices chosen from II (green), as shown in Fig. 4-5. A random policy from IV is constructed by assigning agents to leaf tasks randomly, then querying actions from leaf task's local policy. A random policy from H is constructed by choosing actions randomly for every agent. The green curve flattens after approximately 40 iterations since many UAVs had run out of fuel given long sequence of random actions. In contrast, the red curve achieves sustained performance. Although the agent assignment step is random, the actions queried from leaf task's local policy are not 57 300 random task alloc - 250( 200 > 150 a) random policy E 100 E 0 50 0 0 20 time 40 60 Figure 4-5: Pruning Solution Space. Green line shows the cumulative reward by choosing actions randomly for every agent (following random policy 7r E H). The red lines shows the cumulative reward induced by assigning agents randomly to leaf tasks (following random policy 7r E fde"). Both curves are the average of 30 independent trials. The decomposition procedure can be seen as a pruning step because 11 dec c H and on average, policies in Hdec have better solution quality than those in H. random. In the PST mission, regardless of whether a UAV with low fuel is assigned to the communication leaf task or the surveillance leaf task, local policies from both leaf tasks will send the UAV back to base for refueling. Thus, no UAV ran out of fuel given a random policy from 11 'd* In short, the decomposition step turns the original MMDP into an agent assignment problem; and this conversion prunes the solution (policy) space of the original problem. 4.1.5 Simulation Result HD-MMDP is applied to solve the ten agent PST mission, which planning domain has more than 1019 states and 104 actions. It is estimated that value iteration would take centuries to compute the optimal policy for such enormous planning spaces. This thesis compares HD-MMDP against another multi-agent planning algorithm, known as Group Aggregate Decentralized Multi-agent Markov Decision Process (GA-decMMDP) [45]. To author's knowledge, among algorithms that are computationally 58 tractable in planning spaces of this size, GA-dec-MMDP is the best solution approach to PST missions. GA-dec-MMDP is an approximate dynamic programming algorithm (see section 1.2.1) specialized for multi-agent planning problems. Its main idea is that each agent can model the rest of team in a reduced form, which only need to capture important aggregate properties. For instana, in the PST mission, each agent can make a good decision if it can accurately predict the-joint configuration of its teammates (number of agents surveying the tasking area and number of agents acting as a communication link); however, an agent does not need to predict which particular agent or how the agent surveys a sub-task. A domain expert can design a reduced model by specifying a set of feature vectors that capture such aggregate properties. The use of reduced model makes GA-dec-MMDP computationally tractable, as its run time scales linearly with the number of agents. However, GA-dec-MMDP has two major drawbacks. First, it requires a domain expert to specify a set of feature vectors. This usually relies on extensive domain knowledge and is difficult to generalize. Second, it requires to develop a state transition function for the reduced model (i.e. how many agents will survey the tasking area). This is difficult because an accurate state transition function depends on the joint policy; however, to derive a joint policy, it requires specifying an accurate state transition function. To overcome this issue, Redding developed an approximate method using matrix extrapolation techniques. In contrast, HD-MMDP algorithm automatically identifies a decomposition structure from the reward function and does not require specific domain knowledge. Fig. 46 shows that HD-MMDP outperforms GA-dec-MMDP algorithm by more than 35% in cumulative reward. Both algorithms are real time implementable for ten agent PST mission. 59 500 - 400 HD-MMDP 300 200 E. 100 GA-Dec-M MDP A' 0 20 time 40 60 Figure 4-6: Simulation of Ten Agent PST Mission. Comparison in cumulative reward shows that HD-MMDP outperforms GA-dec-MMDP by more than 35%. Both curves are the average of 30 independent trials. 4.2 Intruder Monitoring Mission The intruder monitoring mission (IMM) is developed as an extension to the PST mission. IMM requires a group of UAVs to continuously survey a region of interest to detect elusive ground intruders. When a target is discovered, multiple UAVs will act in concert to escort the targets out of the region of interest. In comparison with the PST mission, IMM requires more complex interaction between agents because (1) it has multi-agent leaf tasks and (2) it encodes temporal constraints between leaf tasks. HD-MMDP is applied to solve IMM and various extensions to HD-MMDP are proposed. 4.2.1 Mission Description Mission scenario is illustrated in Fig. 4-7. The tasking region is discretized into a n x m grid world. It requires a group of UAVs to continuously monitor this tasking region. Ground vehicles from the opposing forces may emerge stochastically and are 60 ___A Figure 4-7: Intruder Monitor Mission. A blue ground vehicle from the opposing forces tries the reach its goal position marked by the orange star. It has a known feedback policy that generates a trajectory shown in blue. The mission is to coordinate the two red UAVs such that they will escort the ground vehicle out of the tasking area by forcing it to follow the green trajectory. initially hidden from the group of UAVs (fog of war). Further, the ground vehicles follow a known stochastic policy that guide them toward a goal position. The intruder monitoring mission requires the UAVs to (1) detect the initially hidden ground vehicles, (2) prevent the ground vehicles from reaching their goal position, and (3) escort the ground vehicles out of the tasking area. The three mission objectives can be formulated as MDPs, which are called surveillance, blocking, and cornering leaf tasks, as illustrated in Fig. 4-8. In the surveillance leaf task, each grid is associated with an uncertainty value that relates to the probability of detecting a ground target in that grid. The uncertainty value increases with time and can only be reduced by flying an UAV to its vicinity. The goal is to reduce the uncertainty in the tasking region to maximize probability of detecting ground targets. In Fig 4-8(a) lighter shade of gray corresponds to lower uncertainty value; note how the UAV reduces uncertainty in its close proximity. The ground vehicles attempt to reach a goal position, but cannot move into grid positions occupied by UAVs. In the blocking leaf task, shown in Fig. 4-8(b), a single UAV tries to prevent the ground vehicle from reaching the goal position. In the cornering leaf task, shown in Fig. 4-8(c), two UAVs try to escort the ground vehicle 61 01 + 010 (a) (b) (c) Figure 4-8: Leaf Tasks in Intruder Monitoring Domain. Sub-figures a), b) and c) correspond to the surveillance, blocking, and cornering leaf tasks, respectively. out of the tasking area (two green corners) in coordinated efforts. In both leaf tasks, a UAV needs to predict the future positions of the ground vehicle, and thereby position itself strategically to influence the trajectory of ground vehicle. A major challenge is to account for the stochasticity in the ground vehicle's choice of action. More importantly, leaf tasks in IMM are coupled through operational constraints. In particular, the blocking and cornering leaf tasks can only be executed after detection of a ground vehicle in the surveillance leaf task, which is a temporal constraint. As well, blocking and cornering leaf tasks (for one ground vehicle) are mutually exclusive such that only one of them can be executed at a time. In contrast, in PST domain, leaf tasks are coupled through the reward function but can be attempted independently. Another major challenge is to account for these operational constraints while solving the IMM. 4.2.2 Problem Formulation Although there are two types of robotic vehicles, only the UAVs are considered as agents from planner's perspective. In contrast, the ground vehicles are considered as part of the environment. In particular, the planner should determine a set of actions for the group of UAVs in response to stochasticity in the motion pattern of the ground vehicles. To distinguish between the two types of vehicles, Ni is used to denote the 62 ith UAV, and Mj is used to denote the jth ground vehicle. State Space The state, si, of a UAV, N, is specified by a vector of two dimensions. The two dimensions correspond to the x and y coordinates of the UAV, respectively. In particular, s (4.13) [X , y]T = The state, qj, of a ground vehicle, M, is specified by a vector of three dimensions. The first two dimensions correspond to the x and y coordinates of the ground vehicle and the third dimension is a binary variable specifying whether the ground vehicle has been detected. in particular, (4.14) q= [Xj,y 3 ,d3 ] T Further, the environment is associated with a matrix that specifies an uncertainty value for each grid position, such that W = [wmn]. The configuration of the entire system is specified by the joint state of the UAVs, the ground targets, and uncertainty in the environment, s = (six ... x sn) x (q x ... xq,)x W (4.15) Action Space The action space, Ai, defines the set of possible actions for a UAV. Each agent can choose to go to a nearby location within 2 units of Manhattan distance away from its current location, as shown in Fig. 4-9 (a). Also, a UAV is allowed to move neither outside the grid world nor into a grid position intended by a ground vehicle. The set of possible actions of ground vehicles are shown in Fig. 4-9, which illustrates that (i) UAVs are more agile than ground vehicles and (ii) ground vehicles cannot remain 63 (a) (b) Figure 4-9: Action Space of Intruder Monitoring Mission. Sub-figure (a) and (b) show the sets of permissible actions for an UAV and a ground vehicle, respectively. Each shaded square shows a possible next state given the current position of the vehicle, and each next state is associated with a unique action (i.e. two squares to the left). UAVs are assumed to be more agile (travel to a larger area) than ground vehicles and ground vehicles cannot remain stationary. stationary2 . The ground vehicle's choice of action is described in a state transition model. State Tfransition Model The state transition model consists of three major components, which describe the state evolution of UAVs, ground vehicles and uncertainty value wmn. State evolution of each UAV is deterministic, such that a UAV always reaches its intended position. State evolution of uncertainty value wmn is also deterministic. In particular, wmn is decremented by wdec (i.e., 5 units) if a UAV hovers at the particular grid position, and wmn is incremented by wine (i.e 1 unit) otherwise. As well, the uncertainty value is upper bounded by wm., such that wmn E [0, wmax]. State evolution of each ground vehicle is characterized by a known stochastic policy. Among the four permissible actions illustrated in Fig. 4-9(b), the stochastic policy selects the two actions that lead to states closest (minimum Euclidean distance) to goal location. Further, a probability distribution is specified for the two actions. For instance, the most favorable action is chosen 75% of the time and second action is chosen 25% of the time. This policy is illustrated in Fig. 4-10, with the longer arrow 2 Technically, the ground vehicles are modeled as part of the environment and its choice of actions are not part of the action space. 64 _ U (b) (a) Figure 4-10: State Transition Model of Intruder Monitoring Mission. The two pale blue arrows at each grid position illustrate the stochastic policy of the blue ground vehicle. The action represented by the longer arrow is chosen 75% of time and that by the shorter arrow is chosen 25% of the time. The presence of a red UAV induces a change in ground vehicle's choice of action, as shown inside the red box in sub-figure (b). being the more favorable action. The stochastic policy also takes into account the UAV's position, as illustrated in Fig. 4-10(b). In particular, the red UAV in Fig. 410(b) prevents the blue ground vehicle from traveling upwards. Thus, the best and second best actions for the ground vehicle are traveling to the right and traveling to the left, respectively. Similarly, the ground vehicle's choice of actions are changed in grid positions inside the red T-shaped box. When a ground vehicle first appears in the grid world, it is undetected, such that di = 0. The ground vehicle remains undetected with probability proportional to the uncertainty value at its position, such that P(di(k + 1) = 0) = Wmn/Wmax. Once a ground vehicle is detected, it remains detected. Reward Function The reward function is designed to specify a set of desirable states. Recall that it requires to (1) detect a ground vehicle, (2) prevent the ground vehicle from reaching its goal position, and (3) escort the ground vehicles out of the tasking area. This is 65 made precise as follows, no R(s) = I(dj = 1) [c(qj, sl, s2) E g(q, si)] (4.16) where s is as defined in Eqn. 4.15; I(dj = 1) is a indicator function that specifies the ground vehicle must be detected; c(qj, si, sio) is a function that specifies the cornering condition; g(qj, sil) is a function that specifies the blocking condition; and "@" expresses mutual exclusion. The cornering condition rewards the state in which the ground vehicle is forced into one of the green squares as illustrated in Fig. 4-8(c). The blocking condition rewards each state inversely with the distance between the ground vehicle and the goal location. 4.2.3 Extension to HD-MMDP The formulation of the intruder monitoring mission contains elements that do not conform to definition of MMDP as described in section 2.2. In particular, the joint state space is not the Cartesian product of that of each agent (UAV). This is because the system also needs to account for the ground vehicles. In addition, there exists certain operational constraints between leaf tasks. For instance, the blocking leaf task and the cornering leaf task cannot be executed simultaneously. To this end, several extensions are proposed to the HD-MMDP algorithm. For instance, the definition of a leaf task is relaxed to include states characterizing the environment. However, the key idea behind the development of HD-MMDP is unchanged. The algorithm first decomposes the planning problem into a number of subproblems by exploiting structure in the reward function, then solvse each subproblem to compute local policy, and finally assigns each agent to one of subproblems and queries an action for each agent from the corresponding local policy. Decomposition Structure and Ranking Heuristic Three leaf reward functions can be identified from Eqn. 4.16, around which the surveillance, cornering and block- 66 Surv. Task Blocking Cornering Blocking Task- Task Task Cornering ITask Figure 4-11: Task Tree of Intruder Monitoring Mission. Each ground vehicle is associated with two mutually exclusive leaf tasks, namely the blocking and cornering leaf tasks. A surveillance leaf task is associated with all ground vehicles. The red curve encodes the temporal constraint that a ground vehicle must first be detected. The green addition node encodes the mutual exclusive relation. ing leaf tasks are formed. In particular, flg rcorneing rblocking (4.17) (dj = 1) rsurv = = c(qj, = g(qj,sii) Soi, Si2) Note that rurv cannot be decomposed further because the detection of each ground vehicles is coupled by the uncertainty values (Wmn) in the environment (see Section 4.2.2). To maximize the number of ground vehicles being detected is equivalent to minimizing uncertainties in the environment, that is, min m,n Wmn. Novel structures are introduced in the task tree structure (Fig. 4-11) to encode operational constraints. For instance, the red curves encode the temporal constraint that a ground vehicle must first be detected before execution of blocking or cornering leaf tasks. In addition, the green addition node encodes that blocking and cornering leaf tasks cannot be executed simultaneously. The heuristics for ranking the leaf tasks need to be modified to satisfy operational constraints. In this example, surveillance leaf task is ranked the highest because ground vehicles need to be first detected. Cornering leaf task is given precedence over the blocking leaf task because it is desirable to force the ground vehicles out of the tasking areas. 67 State Abstraction State abstraction is a concept more frequently applied to hier- archical decomposition of single agent MDPs. It is based on the idea some dimensions are irrelevant to the execution of subtasks and can be abstracted away, thereby drastically reducing the size of the subproblem's planning spaces. This idea can also be applied to the HD-MMDP algorithm. For instance, in IMM, the detection variable d, of the ground vehicle's state q,, is irrelevant to the cornering or blocking leaf tasks (always assumed to be detected). Similarly, the uncertainty value wmn in the environment can also be abstracted away in formulation of cornering and blocking leaf tasks. Solving Leaf Tasks Approximately In HD-MMDP algorithm, value iteration is used to solve each leaf task for simplicity. Yet, each leaf task is a standalone MDP and thus can be solved using any generic MDP solution approach. By taking advantage of the leaf task's local structure, significant computational savings can be achieved. For instance, assuming wn E [0, 5], the surveillance leaf task in IMM has 61x" states since each grid location is associated with a distinct uncertainty value. Even for a 5 x 5 grid world, value iterations would take months to iterate over such state space. However, since the state transition model of surveillance leaf task is deterministic, as described in Section 4.2.2, general purpose optimization algorithm can be applied. In particular, a sequential greedy algorithm is implemented such that each UAV sequentially chooses an action that would lead to the most uncertainty reduction. In short, approximate MDP solution approaches can be applied to leaf tasks for additional computational efficiency. 4.2.4 Simulation Results The HD-MMDP algorithm (with modifications outlined in section 4.2.3) is applied to solve a intruder monitoring mission with three UAVs and two ground vehicles residing in a 8 x 6 grid world. Following a policy returned by the HD-MMDP algorithm, a simulation was run for a length of 50 steps. In all 30 independent trials from the same initial configuration, HD-MMDP's policy achieved sustainable performance (steadily 68 increasing cumulative reward). A simulation trajectory is chosen and illustrated in Fig. 4-12 to highlight key aspects of HD-MMDP's performance. The UAVs are initialized in the center region and two ground vehicles emerge from the bottom corners, as illustrated in 4-12(a). Since the ground vehicles are initially undetected (labeled with gray disk), the UAVs perform the surveillance leaf task as indicated by green circles. Up to time step 3, as shown in Fig. 4-12(b), all ground vehicles remain undetected and the UAVs distribute themselves more evenly in the grid world. In time step 4, as shown in Fig. 4-12(c), the purple ground vehicle is detected and two UAVs switch from surveillance to cornering leaf task targeting the newly detected purple ground vehicle. Fast forward to step 9 and 10, as shown in Fig. 4-12(d, e), two UAVs have forced the purple ground target into one of the green corners. The other UAV tries to prevent the orange ground vehicle from reaching the goal location. In step 9, the orange ground vehicle chose to go left while the UAV was expecting it to go right. This reflects the stochasticity in ground vehicle's policy, where it randomly chooses between the best and second best actions. In step 11 and 12, as shown in Fig. 412(f, g) the cornered purple ground vehicle exits and a new blue ground vehicle enters. Between the two UAVs near the bottom right corner, one of them switched to a surveillance task while the other went upward to help escort the orange ground vehicle out of the grid world. 4.3 Summary This section formulated a multi-agent persistent search and track (PST) mission and an intruder monitoring mission (IMM). HD-MMDP was applied to solve the PST domain and showed 35% improvement over GA-dec-MMDP. The IMM was shown to have more complicated operational constraints than MMDP formulation would allow. Thus, various extensions to HD-MMDP were proposed and used to solve IMM. The next chapter describes a hardware implementation of IMM. 69 (a) (d) (b) (e) (c) (f) (g) Figure 4-12: Simulation of Intruder Monitoring Mission. Red squares represent UAVs; purple, orange and blue disks represent ground vehicles. The orange star represents the ground vehicles' intended goal location. The green squares are locations where the UAVs would like to trap the ground vehicles. Each vehicle's choice of action is shown by the arrow emanating from itself. If a ground vehicle is undetected (d. = 0), it is labeled with an additional gray disk (i.e. sub-figure a). If a UAV is assigned to block or corner a particular ground vehicle, it is labeled with a disk of the ground vehicle's color (i.e. sub-figure c). For clarity, the uncertainty value wmn in each grid is not shown. Sub-figure(a) shows the initial configuration of the system. Sub-figures(b-g) show the configuration of the system at step 3, 4, 9, 10, 11, and 12, respectively. 70 Chapter 5 Experimental Setup This chapter presents the hardware and software infrastructures that enable complex multi-agent missions with real robotic vehicles. It begins with a description of a testbed environment (Fig. 5-1) with real-time motion tracking capability. Further, it explains the software organization, which coordinates a variety of functionalities, from high level mission planning to low level vehicle motor control. Then, it develops a novel optimization based solution approach to the path planning problem. Lastly, it presents hardware implementation of the intruder monitoring mission described in Sec. 4.2. 5.1 RAVEN Testbed The Real-time indoor Autonomous Vehicle test ENvironment (RAVEN) [57, 56] testbed is a facility that enables rapid prototyping and testing of autonomous robotic technologies. It is an indoor space of approximately 12m x 8m x 2m, as shown in Fig. 5-2(a). RAVEN testbed is equipped with Vicon motion capture system [52], which consists of 18 Vicon cameras shown in5-2(b). Each Vicon camera emits infrared radiation and detects its reflection from lightweight reflective markers mounted on autonomous vehicles. By combining infrared images from multiple cameras, the Vicon system provides high-fidelity state measurement of multiple targets at 100Hz, with sub-millimeter accuracy. 71 (a) (b) Figure 5-1: Robotic Vehicles and Experiment Environment. A ground robot and two quadrotors with package delivery capability are shown in sub-figure(a). The hardware experiment environment, equipped with Vicon motion capture system (inside yellow boxes), is shown in sub-figure(b). Given construction of the building and placement of computer systems, Vicon enabled volumed is non-convex, as shown in Fig. 5-2 (a). In particular, the bottom left corner does not have Vicon coverage and there exists a pole in the middle of the space. The space's non-convexity poses technical challenges for autonomous path planning, which is explored in the next section. A unique feature of the RAVEN testbed is an image projection system, shown in Fig. 5-3. It consists of multiple downward facing Sony VPL-FH31 projectors mounted to the ceiling. The projectors are calibrated to cast a contiguous image onto the floor. The projection area is shown as a rectangle with green hatched pattern in Fig. 52 (a). Acting as a window to the planner's belief state, the projected image can reveal important information about a mission. For instance, in Fig. 5-3, it shows the uncertainty state of the world as different levels of transparency, the position of the quadrotor as a purple cross, intentions of ground vehicle as arrows of various colors, etc1 . In path planning missions, it shows the intended trajectory of an vehicle. In short, the projection system can help observers better understand the movement of autonomous vehicles in complex missions; which in turn, serve as a useful tool for 'This is a snapshot of the intruder monitoring mission, which is explained in Section 5.5 72 Vicon Enabled Space Projection Enabled Space Vicon Camera (a) (b) Figure 5-2: Schematic of RAVEN Testbed. Sub-figure(a) illustrates the Real-time Autonomous Vehicle test ENvironment (RAVEN). The space is equipped with a Vicon motion capture system and an image projection system, as shown in blue and green, respectively. RAVEN testbed's coordinate system is labeled in black. Red squares illustrate Vicon camera placement. A picture of a Vicon camera is shown in subfigure(b). debugging, evaluation, and demonstration purposes. 5.2 Autonomous Vehicles Various types of autonomous vehicles have been developed for the RAVEN testbed, as shown in Fig. 5-4. This section provides a short description of the robotic vehicles that are involved in persistent search and track (PST) and intruder monitoring missions (IMM). A quadrotor with recharge capability is shown hovering under a fan, in Fig. 5-4(a). The quadrotor carries a 3D printed battery cage, which houses a three cell 1300mAh lithium ion battery. The quadrotor is controlled via successive loop closures, with attitude control loop (inner loop) running on-board, and acceleration 73 Figure 5-3: Image Projection Capability. The projection system reveals important information regarding a planner's belief state, such as the intent of an robotic agent and the state of the environment. The projection system is an vital tool for helping observers understand a complex multi-agent mission. Shown in this figure is a snapshot of the intruder monitoring mission (IMM). control loop (outer loop) running on a stationary ground computer [19]. A trajectory generator produces a series of waypoint (position, velocity and acceleration) commands, which are piped into the acceleration control loop. The quadrotor can be modified to carry up to 500 grams of cargo. The quadrotor can communicate with a ground computer via Xbee connection. The quadrotor can land autonomously on a custom-designed battery recharge station, as shown in Fig. 5-4(b). The recharge station [54, 53] has two rotating drums that each hold four battery cages. As a quadrotor lands on the recharge station, the depleted battery is moved from the quadrotor into an empty battery bay, and a fully charged battery is moved into the quadrotor. While performing a battery swap, the quadrotor is secured with a clamping mechanism that also acts as a temporary power source. Thereby, the quadrotor remains powered on during a battery swap, and thus retains its memory (i.e. integrator values). Batteries take less than 50 minutes to charge and are charged independently. A quadrotor can fly for 74 approximately 7 minutes with a fully charged battery, hence a recharge station with eight batteries can support the continuous operation of one quadrotor. The recharge station can communicate with a ground computer via bluetooth connection. Autonomous ground robots are based on commercially available iRobot Create vehicles, as shown in Fig. 54(c), A Xbee module is added kto commerciah product to enable wireless U = with a grou enmputer. The ground wbicle ac- cepts linear and angular velocities as control inputs. They also requires a trajectory generator for comput 5.3 igeW'aths in the non-cQnvex lab speot Software Architecture The software infrastructure developed in RAVEN utilizes the Robotic Operating System (ROS) [43], which is set of libraries for robotic applications. ROS is a flexible framework that encourages modular software design. In particular, ROS abstracts away the need to explicitly fork new processes, spawn new threads, and synchronize between threads. A ROS program is r ike a .graph structure, with each node being a standalone executable and each edge being a data connection between nodes. ROS facilitates communication between diferent executables by introducing simple data transfer protocol (topics and services). ROS is particularly suitable for multiagent applications because each node can be developed and tested independently. The software architecture that supports PST and IMM experiments is shown in Fig. 5-5. Each rectangular box represents a ROS node, and the arrows represent data connection. Vicon Data node, shown in top left corner, is a ROS wrapper of Vicon software, which publishes filtered position measurements of all robotic vehicles at 100Hz. The Mission Planner node, shown in the top right corner, is an implementation of high level planning algorithms, such as HD-MMDP. Given operational state (i.e. battery level, grid position), Mission Planner node outputs mission level commands, such as fly to surveillance area or fly to base for battery recharge. A Visualization node is implemented utilizing ROS's rviz library. The Visualization node simulates a virtual environment and illustrates position and operation status information of each 75 (a) (b) (c) Figure 5-4: Autonomous Robotic Agents. Sub-figure(a) shows a quadrotor with removable battery cage hovering under a fan. Sub-figure(b) shows a battery recharge station executing an automated battery swap. Sub-figure(c) shows a pair of iRobot Create vehicles. vehicle. Mission Manager node is a central hub that aggregates all information about the system. Also, it keeps a list of all types of vehicles (quadrotors, ground vehicles, and recharge stations), and maintains a task queue for each individual vehicle. When Mission Planner node issues multiple waypoint commands, the Mission Manager node 76 Vicon Data Mission Visualization Planner Mission Manager - Quadrotors quadi Gnd Vehicles gVehi Recham St. chrgi quad2 quad3 gVeh2 gVeh3 chrg2 chr g3 others Path Planner Path Planner Position measurement ---------- :virtual vehicles 1 Quad Simulator ' Gnd VehI Simulator Recharge Simulator Operational status Mission level command real vehicles Quad Controller Simulated position Gnd Veh Controller Recharge Controller Control input (i.e. pos, vel, accel) Figure 5-5: Software Architecture. Software layer's modular designs supports a wide variety of hardware experiments. The major components and data connection between them are shown in this figure. appends these waypoint commands to the task queue of the corresponding vehicles. Mission Manager node determines the condition for successfully completing a task (i.e. reach c neighborhood of desired waypoint) and decides when to send out the next task in queue. For stationary vehicles (i.e. recharge stations), operational commands are sent directly to vehicle controllers. For mobiles vehicles, a Path Planner node is required for generating feasible and collision-free trajectories. Then, trajectories are sent to 77 vehicle controllers as sequences of waypoints. The algorithm design of the Path Planner node is explained in greater details in the Sec. 5.4. Further, the software infrastructure accommodates for both real and virtual vehicles, which capability is crucial for reducing hardware complexity during software development. The virtual vehicles need to publish their simulated states (i.e. position), which is made available to the Mission Manager node. 5.4 Path Planner The path planning problem is in itself a well investigated research topic. Researchers have proposed optimization based algorithms [48, 20] and sampling based algorithms [33, 31, 34]. This thesis does not attempt to provide a comprehensive treatment of the path planning problem; rather, it seeks to find/develop an algorithm that can generate feasible trajectories for multiple (1-6) vehicles in real time. The multi-agent path planning problem is to find a set of control inputs such that each vehicle would reach its desired final position and satisfy collision constraints. In particular, vehicles must steer clear of static obstacles and avoid running into each other. In this work, since there exist robust vehicle level controllers for tracking trajectories, the control inputs are assumed to be a sequence of waypoint commands, [pos, vel, accel]. 5.4.1 Potential Field The potential field algorithm models vehicles, static obstacles and room boundary as particles with the same charge; and it models each vehicle's desired position as a particle with the opposite charge. The algorithm computes the potential field of the system and chooses an input that minimizes potential. Hence, each vehicle is attracted to its desired position and is repelled from static obstacles and other vehicles. For computation reasons, the potential function is often chosen such that it decays to zero beyond a certain distance 1from the vehicle. Hence, the potential field algorithm can be seen as a decentralized reactive solution approach. In particular, vehicles choose 78 #0 Figure 5-6: A Live Lock Configuration for Potential Field Algorithm. Each colored disk represents a quadrotor vehicle, and the associated arrow shows the quadrotor's intended goal position. control inputs independently unless they are in close proximity (less that 1 apart). The potential field algorithm is easy to implement and computationally tractable. However, since this algorithm chooses an input that minimizes potential of current configuration, it does not compute the whole trajectory at once. Thus, the algorithm is often trapped in local minimum and leads to live lock configurations. For instance, consider a four agent live lock configuration illustrated in Fig. 5-6, where the arrows show each vehicle's intended position. Further, given its reactive nature, the potential field algorithm tends to induce sharp turns and jerky maneuvers. In practice, this algorithm works decently for less than three vehicles, but live lock configurations are common for a team of more than three vehicles. 5.4.2 Sequential Convex Program Problem Formulation This section describes an optimization based path planning algorithm, largely based on a paper by Augugliaro et al. [3] In particular, Augugliaro et al. formulated the multi-agent path planning problem as a sequential convex program, whose decision variables ai[k]'s are the accelerations of agent i at discretized time step k. Cost function: For a system with N agents and K discretization steps, the objective 79 is to minimize N Zai[k] +||2 (5.1) K E I E i=1 k=1 where g is the gravity vector. To optimize this cost function is to minimize the sum of thrust, which in practice, tend to induce smooth trajectories with small curvature. Constraints: Given accelerations decision variables ai[k] and discretization step size h, position, velocity and jerk can be derived via kinematic relations, in particular, pi[k +1] = i[k +p1] = v,[k+1I = ji[k] = ik+ vk]+ha[] h2 i[k]+hvi[k]+ag[k] vi[kl+ha[kI (5.2) ai[kJ-a1[k-1] h where expressions for pi [k +1] and vi [k +1] can be rewritten in terms of initial conditions and accelerations. A feasible trajectory must reach the desired final waypoint, such that at a fixed final time T, pi [T/h] = p,(T), vi[T/h] = vi(T), a, [T/h] = a1(T) (5.3) In addition, it assumes the vehicles reside in a convex planning space. In addition, given physical limits (i.e. actuator), each vehicle must satisfy velocity, acceleration and jerk constraints, such that in each dimension, pmm pi [k] 5 pm., v.s ! vi [k] amin 5 a [k] 5 ama, vmac, jmn 5 ji[k] Vmn ! vi [k] jmax, Vi, k vma (5.4) Eqns. 5.3 and 5.4 are linear convex constraints. The inter-vehicle collision avoidance constraint, however, is non-convex. In particular, it requires each pair of vehicle to be at least R distance apart at all time steps, Ipi[k] - p [k]||2 > R Vij 80 i34j Vk (5.5) The constraint is approximated using Taylor expansion, thereby forming a convex relaxation 7T (p [k] - p[k]) > R pi[k] - pj [k] 1||pi[k]|- pj [k]| (5.6) where p![k] and pj[k] are the positions of vehicles i and j from the previous iteration. This leads to the formation of a sequential convex program, since it iteratively refines the solution and updates constraint approximation. In summary, the cost function (Eqn. 5.1) is quadratic; the boundary conditions (Eqn. 5.3) can be expressed as linear equality constraints by substituting kinematic relations (Eqn. 5.2); the physical capability constraints (Eqn. 5.4) and the convex approximation of collision avoidance (Eqn. 5.7) constraints can be expressed as linear inequality constraints. This leads to a quadratic program at each iteration. min x p s.t. +qTX+r (5.8) Ax = be Ainx ; bin Decentralized Implementation Augugliaro's paper proposed a centralized algorithm in which a single quadratic program is formed for all vehicles simultaneously, which has O((KND) 2 ) number of constraints and KND number of decision variables. Recall K is the number of discretization steps, N is the number of agents, and D is the number of dimensions. Since the run time of solving a quadratic program scales quadratically in the number of constraints, the run time of Augliaro's algorithm is O((KND)4 ) per iteration. In comparison, Morgan et al. [40] developed a decentralized SCP algorithm for control of multi-agent spacecraft swarms. In particular, it forms N small quadratic programs, one for each vehicle. The quadratic programs are solved sequentially and solution (trajectory) for an earlier vehicle is casted as constraints for later vehicles. 81 The run time complexity of this algorithm is O(N(KND) 2 ) per iteration, which is much cheaper than that of Augugliaro. In practice, it is often more important to find feasible trajectories quickly than guaranteeing optimality. In this thesis, SCP is used as a short hand for a decentralized adaptation of Augugliaro's algorithm. The following section first shows that SCP does not work robustly for multi-agent path planning (fail to return feasible trajectory for certain test cases); then, it proposes key modifications to significantly improve robustness. Convergence Pattern Augugliaro presented a number of test cases and showed through empirical simulation that the algorithm usually converges within four iterations. However, the convergence process is not uniform. In particular, linearizing about the first several iterates often induce an infeasible quadratic program (QP). Common commercial solver packages, such as Matlab or MOSEK, often return infeasible trajectories when QP is infeasible. Sometimes, linearizing about the infeasible trajectories would yield a feasible QP and then the algorithm would start to converge. A typical convergence pattern of SCP is illustrated in Fig. 5-7. The red lines show corner and pole constraints, which is explained in Sec. 5.4.3. In this case, two quadrotors attempt to swap positions. In SCP, the blue vehicle, traveling from (-1,6) to (-1,-3), is solved first and hence had a straight trajectory. The green vehicle, traveling (-1,-3) to (-1,-6), is solved second and needed to satisfy collision constraint with respect to the blue vehicle. From top left to bottom right, this figure shows the green vehicle's trajectory after 1st, 2nd, 5th and 9th iterations. Linearization about the first four iterations yields infeasible QPs, and thus no feasible solutions were found (green trajectories in the top row do not reach (-1, -6)). After a feasible solution is found, the algorithm then converges to a local optimum. This phenomenon sometimes causes the algorithm failing to find feasible trajectories, which is discussed in the next section (shown in Fig. 5-10, 5-9) 82 0 0. 2- I - 4- 4- 6 -6 -4 -2 0 -6 -4 -2 0 -6 -4 -2 0 0 0- 2. 4- -6- 6- - -4- -6 -4 -2 0 Figure 5-7: Typical Convergence Pattern of SCP. Two quadrotors attempt to swap positions. The blue vehicle is solved before the green vehicle. From top left to bottom right, this figure shows the green vehicle's trajectory after 1st, 2nd, 5th and 9th iterations. SCP returned infeasible trajectories in the first four iterations. 5.4.3 Incremental Sequential Convex Program The convergence pattern of SCP is troublesome because it shows that SCP can get trapped in intermediate steps, where no feasible solution can be found. Empirical simulation shows that this happens approximately 40% of the time (randomly generated test cases) for a group of four vehicles navigating in RAVEN testbed. This section introduces a number of modifications that lead to the formulation of incremental sequential convex programming algorithm, abbreviated iSCP. This work is developed between the author, Mark Cutler and Jack Quindlen. 83 1 1000000 0.5 10, 0 1000 0-10 -2 ID -2.5 -3- 3 -2 .51 2 )T -1 0 K 0 1 *... - -2 2 2 -- -2 -4 -4 Figure 5-8: Modeling Corner Constraint. The exponential loss function (Eqn. 5.9) is less than zero inside the infeasible corner region (x < -1 and y < -2), and greater than zero elsewhere. Increase the value of constant c leads to better approximation of the true corner boundary. Corner and Pole Constraints Augugliaro's paper assumed convex rectangular planning space and does not permit any static obstacles. Motivated by the need to navigate in RAVEN's non-convex flight volume, an exponential loss function is developed to model corner and pole constraints, in particular, g(px [k], p [k]) = exp (c(pf [k] - xeo,)) + exp (c(p [k] - yc,.)) - 2 (5.9) where px [k] and py [k] are the x and y coordinates, respectively, of vehicle i at time step k; xe,. and ye,,. are the coordinates of the corner. The contour plot of this function is shown in Fig. 5-8. The constraint is to require g(px[k], py [k]) > 0 Vi, k, thereby forcing vehicles to remain outside of the corner region. Since Eqn. 5.9 is nonlinear, Taylor expansion is applied to form linear approximation similar to that in Eqn. 5.7. The pole constraint can be developed similarly. The corner and pole constraints are shown in red in Figures 5-7, 5-9, and 5-10. 84 Incremental Addition of Constraints SCP often fails because the quadratic program formed by linearizing about the initial guess (straight line between starting and final position) is often infeasible, as shown in Fig. 5-9(a) and 5-10(a). Sometimes the infeasible QP yields infeasible trajectories, which upon linearization, produces another infeasible QP. In such scenarios, the algorithm can be trapped and no feasible solution is found. This observation prompts the development of incremental sequential convex program (iSCP). The key idea is to gently steer the trajectory towards the goal by adding constraints incrementally. Thereby, iSCP tries to ensure that all intermediate quadratic programs (at every iteration) are feasible. Recall K is the number of discretization time steps. Thus, a vehicle's trajectory is composed of K points, with one point at each discretization time step. Each point in a vehicle's trajectory is subject to collision constraints, which are linearized by Taylor expansion about the point itself (Eqn. 5.7). Thus, each point is subject to a different set of collision constraints. In construction of a quadratic program (QP), SCP adds collision constraints at all points to QP in every iteration. In contrast, iSCP adds collision constraints incrementally. For instance, it adds collision constraints about the first point (k = 1) on the first iteration, and it adds collision constraints about the first two points (k = 1 and k = 2) on the second iteration2 . In at most K iterations, SCP would add collision constraints at all points. Final Time Scaling Existing algorithms require the user to specify time step h and end time T in advance. However, it is often difficult to determine these values beforehand, especially for groups of more than three vehicles. If T is set too small, the path planning problem could be infeasible given constraints on velocity, acceleration, and jerk. If T is set too big, vehicles will be traveling very slowly. Instead, iSCP solves the problem in normalized time (T = 1) and scales the solution such the maximum allowed ac2 1n actual implementation, iSCP can skip certain points for computational efficiency. 85 celeration (or velocity, or jerk) is achieved. In particular, given acceleration a1 [k] in normalized time, velocity and jerk are computed using Eqn. 5.2. A scaling factor r is computed as follows, r = in m min Im Vi [k] 1 , mmk am a1 [k] m ' i,k ji [k] (5.10) where Vma, am=, and Jmax are maximum allowed velocity, acceleration and jerk, respectively. Solution to the normalized problem is then scaled by the factor r. ai[k]sa = ai[k]r Vi, k (5.11) hwe = h/vp (5.12) Tsa = T/ (5.13) It can be shown that this scaling procedure only changes the rate at which vehicles travel along the trajectories, but not the shape of trajectories. Simulation Results SCP and iSCP path planning algorithms were implemented and run in a simulated environment with corner and pole constraints (reflecting layout of RAVEN testbed). Each algorithm is given the same set of randomly generated test cases, in which the starting and ending positions of four vehicles are sampled uniformly from the feasible region. SCP returned feasible trajectories in 15 out of 25 test cases, while iSCP returned feasible trajectories in all test cases. To explain this observation, two test cases are presented to illustrate the differences in convergence between SCP and iSCP. Fig. 5-9 illustrates a scenario in which a single quadrotor tries to navigate around the upper left corner. In both algorithms, trajectory is initialized as a straight line connecting the starting and ending positions. The SCP algorithm, shown in the left column, added all (at every discretization time step) collision constraints (linearized about the straight line) in forming the first QP. However, the QP is infeasible and 86 0 -2 Li - -4 6- -6 -6 -4 -2 0 -6 New 0 -4 -2 -4 -2 0 0 -2 -4 -E -6 -6 -4 -2 - 0 0 0 -4 -6 _E -6 -4 -2 0 -6 -4 -2 0 -6 -4 -2 0 0 0 -2 -2 -4 -4 -6 -6 -4 -2 0 (b) (a) Figure 5-9: Convergence Pattern Around a Corner. A quadrotor attempts to negotiate a corner while traveling from (-2, 0) to (-5, -2.5). Sub-figures (a) and (b) illustrate the quadrotor's trajectory computed by SCP and iSCP, respectively, at iterations 1, 2, 5, and 9 from top to bottom. SCP failed to find a feasible solution while iSCP found a feasible solution. 87 0 d - EI--2 -E -6 -4 -2 0 0 -6 -4 -2 -6 -4 -2 0 -6 -4 -2 0 -6 -4 -2 0 0 0 -E -6 -4 -2 0 0 0 -2 -4 -6 -E -6 -4 -2 0 (a) (b) Figure 5-10: Convergence Pattern for Multi-agent System. Two pairs of quadrotors attempt swap position. The red, blue and green agents are solved before the teal agent. Sub-figures(a) and (b) illustrate the quadrotors' trajectories computed for the teal agent by SCP and iSCP, respectively, at iterations 1, 2, 5, and 9 from top to bottom. SCP failed to find a feasible solution while iSCP found a feasible solution. 88 1 0 0 -2 -3 -1 -2 -3 -4 -4- -5 -6 -6- -1 -6 -2 -4 0 -6 -4 (a) -2 0 (b) Figure 5-11: Mutli-agent Path Planning Test Cases. In sub-figure (a), six quadrotors swap positions in a convex environment. In sub-figure (b), four quadrotors navigate in RAVEN environment. In both sub-figures, triangles and crosses represent the starting and ending positions of quadrotors, respectively. Each color represents a different quadrotor. Matlab returned a trajectory that does not reach the ending position, shown in the top left sub-figure. Linearizing about this trajectory forms another infeasible QP, which leads to a similar trajectory shown in the second row. This process repeats as the algorithm gets stuck in bottom sub-figure and no feasible solution is found. In contrast, the iSCP algorithm, shown in the right column, only added collision constraint for one discretization time step in forming the first QP. This leads to a feasible QP whose solution is shown in the top right sub-figure. This trajectory, albeit infeasible, shows a small improvement over the past iterate (straight line initialization) and still reaches desired ending position. Repeating this procedure eventually leads to a feasible solution, as shown in the bottom right sub-figure. Fig. 5-10 illustrates a four agent scenario in which two pairs of vehicles attempt to swap positions. The test cases is chosen such that all vehicles are away from the corner and pole, and thus the example can be seen as path planning in a convex volume. In both algorithms, the red, blue and green agents are solved before the teal agent. From the teal agent's perspective, the other three agents can be viewed as dynamic obstacles with known trajectories, which play a similar role as the corner constraint in the previous test case. SCP adds all collision avoidance constraint at once and eventually gets trapped because it induces an infeasible QP at every iteration. In 89 contrast, iSCP finds a feasible solution by making incremental improvements one step at a time, forming feasible QP at every iteration. Lastly, Fig. 5-11 illustrates two challenging test cases, such that the trajectories of any two vehicles intersect, where imply a high degree of interference between vehicles. In both scenarios, collision avoidance constraint requires each pair of vehicles to be at least 0.8m apart at every time step. In both scenarios, iSCP was able to find feasible trajectories under two seconds. Hardware Implementation iSCP is implemented in C++ and utilizes MOSEK library for solving QP at each iteration. For single vehicle scenarios, iSCP returns a feasible trajectory in less than a second. For four agent scenarios, iSCP takes (on average) 2.1 seconds to compute feasible trajectories. iSCP is tested on quadrotors for path planning in RAVEN testbed. A hardware demonstration of iSCP is shown in Fig. 5-12, in which a single quadrotor flies around a pole. Trajectories returned by iSCP are typically very smooth. 5.5 Intruder Monitoring Mission The intruder monitoring mission (IMM), as described in Section 4.2, is implemented on hardware in RAVEN testbed. Quadrotors are used for UAVs and iRobot Creates are used for ground vehicles. The mission was run for 50 time steps (decision periods), which took approximately 3 minutes 20 seconds to complete in hardware. This shows that HD-MMDP is real time implementable. IMM hardware experiment is shown in Fig. 5-13. The gird world's background is a satellite image of mountainous terrain. The grid world is overlaid with translucent white tiles (fog of war) to illustrate the uncertainty state wmn of the environment. When quadrotors perform the surveillance task, as indicated by green radar in Fig. 513(a), the transparency of the white tiles (in vicinity of the quadrotor) increases. When quadrotors perform blocking or cornering tasks, as shown in Fig. 5-13(b), the transparency of the white tiles decreases. The yellow square is the ground vehicle's 90 (a) (b) Figure 5-12: Hardware Demonstration of iSCP Algorithm. A quadrotor flies around a pole following a trajectory returned by iSCP. Sub-figures(a) and (b) show a quadrotor executing the trajectory from two different angles. goal position. When a ground vehicle first enters the grid world, it remains undetected, such as the green vehicle in Fig. 5-13(a). When a ground vehicle is detected, a colored disk and a flashing array is projected onto the ground vehicle. The x-y position of each quadrotor is shown by a colored cross projected on the floor. The shadow cast by the 91 quadrotor is sometimes more than half of a meter away from the quadrotor's actual x-y position. The color of the cross indicates which ground vehicle the quadrotor is blocking. For instance, in Fig. 5-13(b), the quadrotor with a purple cross is trying to block the purple ground vehicle from reaching the yellow square. The arrows emanating from each ground vehicle show the planner's understanding of ground vehicle's intent. In particular, the green arrow shows the vehicle's most probable next state (one step ahead); the black arrow shows the ground vehicle's most probable next next state (two steps ahead) if the quadrotor remains stationary; the white arrow shows the ground vehicle's most probable next next state (two steps ahead) if the quadrotor moves to block the ground vehicle. Therefore, the planner deploys quadrotors strategically to influence the behaviors of the ground vehicles (induce change of future action from black arrow to white arrow). For instance, in Fig. 5-13(b), the purple ground vehicle tries to reach the yellow square by going around the purple cross (follow green arrow first and then the black arrow second). Knowing the ground vehicle's-intent, theapurple quadrotor moves in parallel to the green arrow, which prevents the ground vehicle from traveling along the black arrow in the next time step. Since IMM is a stochastic problem, the arrows illustrate the most probable future states, and therefore the ground vehicles may sometimes choose an action different from as indicated by the arrows. A video of this experiment can be found at https://www.youtube.com/watch?v=2oOIW7rZAJO. a 5.6 Summary This chapter presented the hardware and software infrastructures that enabled hardware implementation of the intruder monitoring mission (IMM). Thereby, it showed that the HD-MMDP algorithm is real time implementable. It also presented a novel multi-agent path planning algorithm. The next chapter provides a summary of this thesis and points to directions for future works. 3 For visualization reasons, when a ground vehicle is forced to the edge, it is escorted out of the grid world at two times the normal speed. Thus, the intent arrows may look a bit confusing in such circumstances. 92 (a) (b) Figure 5-13: Hardware Demonstration of Intruder Monitoring Mission. HD-MMDP algorithm (with modifications outlined in section 4.2.3) is applied to solve the intruder monitoring mission. The mission scenario is implemented in hardware with overhead projection showing the belief state of the system. Sub-figures(a) and (b) illustrate different visualization elements of the mission. 93 94 Chapter 6 Conclusions 6.1 Summary This thesis has developed a novel hierarchical decomposition algorithm for solving Multi-agent Markov Decision Processes (MMDP) by exploiting coupling relationship in the reward function. It motivated multi-agent planning problems and described how uncertainty plays a major role in many real world scenarios. It reviewed the formulations of Markov Decision Processes (MDP) and showed that MMDP is a natural framework for modeling stochastic sequential decision processes for multi-agent systems. It provided a short description of classical dynamic programming (DP), which can be used to find the optimal solution. However, classical DP approaches suffer from the curse the dimensionality, which in multi-agent problems, implies that the algorithm's run time scales exponentially with the number of agents. The thesis proposed Hierarchically Decomposed MMDP (HD-MMDP) algorithm that would automatically identify a decomposition structure. The key idea is that different functional forms in the reward function represent different degree of coupling between agents, which is formalized with definition of factored reward function. Then, subproblems are formulated at the bottom of decomposition hierarchy and are solved using classical DP approaches. The subproblems typically have planning spaces that are orders of magnitude smaller than that of the original planning problem. HDMMDP develops a heuristic approach for assigning agents to subproblems, in which 95 the subproblem's solutions are combined to form a solution (policy) to the original problem. HD-MMDP is applied to the persistent search and track mission (PST), which is a cooperative multi-agent mission with stochastic fuel consumption and health progression models. HD-MMDP is shown to outperform GA-dec-MMDP algorithm in the ten agent PST domain by more than 35% in cumulative reward. Further, a more complex (in terms of coupling relationship) intruder monitoring mission (IMM) is formulated. HD-MMDP is shown to achieve sustained performance (steadily increasing cumulative reweard) in the IMM planning domain. Lastly, it presented the hardware and software infrastructures of RAVEN testbed, which enabled hardware implementation of IMM. Hardware experiments illustrated that HD-MMDP is a real time algorithm and shed light on the algorithm's performance. 6.2 6.2.1 Future Work Optimization on Task Tree Structure The last step of HD-MMDP is to assign each agent to one of the leaf tasks in the task tree structure. Currently, the algorithm adopts a heuristic approach, such that it determines an importance ranking of leaf tasks based on maximum attainable reward, r.. Then, the leaf tasks are executed in this fixed ordering. However, the process of computing this ordering did not account for the agents' current states. For instance, if all UAVs are low on fuel, it may be more preferable to have the team performing an easier task with lower reward as opposed to performing a difficult task with higher reward. The ordering of execution should be made flexible taking into account of each agent's current state. More generally, assigning agents to leaf tasks can be seen as a stochastic optimization problem, in which more mathematically principled solution approaches should be further explored. 96 6.2.2 Formulation of Leaf Tasks As briefly described in Sec. 4.2.3, state abstraction can be applied when forming leaf tasks. This can drastically reduce the size of a leaf task's planning spaces, thereby improving run time complexity. State abstraction was applied in solving the intruder monitoring mission. However, this procedure often requires domain knowledge to identify dimensions that can be abstracted away. It would be interesting to explore ways to automate this procedure, possibly by using information embedded in the task tree structure. 97 98 Bibliography [1] A-a Agha-mohammadi, Suman Chakravorty, and Nancy M Amato. Firm: Feedback controller-based information-state roadmap-a framework for motion planning under uncertainty. In Intelligent Robots and Systems (IROS), 2011 IEEE/RSJ InternationalConference on, pages 4284-4291. IEEE, 2011. [2] Mehdi Alighanbari. Task assignment algorithmsfor teams of UAVs in dynamic environments. PhD thesis, Citeseer, 2004. [3] Federico Augugliaro, Angela P Schoellig, and Raffaello D'Andrea. Generation of collision-free trajectories for a quadrocopter fleet: A sequential convex programming approach. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ InternationalConference on, pages 1917-1922. IEEE, 2012. [4] Raphen Becker. Solving transition independent decentralized markov decision processes. Computer Science Department Faculty Publication Series, page 208, 2004. [5] Richard Bellman. Dynamic programming and stochastic control processes. Information and Control, 1(3):228-239, 1958. [6] Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of markov decision processes. Mathematics of operations research, 27(4):819-840, 2002. [7] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2005. [8] B. Bethke, J. P. How, and J. Vian. Group health management of UAV teams with applications to persistent surveillance. In American Control Conference (ACC), pages 3145-3150, Seattle, WA, 11-13 June 2008. 99 [9] Blai Bonet and Hector Geffner. Labeled rtdp: Improving the convergence of real-time dynamic programming. In ICAPS, volume 3, pages 12-21, 2003. [10] Craig Boutilier. Sequential optimality and coordination in multiagent systems. In IJCAI, 1999. [11] Craig Boutilier. Sequential optimality and coordination in multiagent systems. In In International Joint Conference on Artificial Intelligence, pages 478-485, 1999. [12] Justin A Boyan and Michael L Littman. Packet routing in dynamically changing networks: A reinforcement learning approach. Advances in neural information processing systems, pages 671-671, 1994. [13] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for temporal difference learning. Machine Learning, 22(1-3):33-57, 1996. [14] Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Reinforcement Learning and Dynamic Programming Using Function Approximators. CRC Press, 2010. [15] Trevor Campbell, Luke Johnson, and Jonathan P. How. Multiagent allocation of markov decision process tasks. In American Control Conference (ACC). IEEE, 2013. [16] Yu Fan Chen, N. Kemal Ure, Girish Chowdhary, Jonathan P. How, and John Vian. Planning for large-scale multiagent problems via hierarchical decomposition with applications to uav health management. In American Control Conference (A CC), Portland, OR, June 2014. [17] Yann Chevaleyre, Paul E Dunne, Ulle Endriss, Jerome Lang, Michel Lemaitre, Nicolas Maudet, Julian Padget, Steve Phelps, Juan A Rodriguez-Aguilar, and Paulo Sousa. Issues in multiagent resource allocation. 2005. [18] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple sources. The Journal of Machine Learning Research, 9:1757-1774, 2008. [19] Mark Cutler. Design and Control of an Autonomous Variable-Pitch Quadrotor Helicopter. Master's thesis, Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, August 2012. 100 [20] Robin Deits and Russ Tedrake. Computing large convex regions of obstacle-free space through semidefiite programming. In Submitted to: Workshop on the Algorithmic Fundamentals of Robotics, 2014. [21] Thomas G Dietterich et al. The maxq method for hierarchical reinforcement learning. In Proceedings of the fifteenth international conference on machine learning, volume 8. Citeseer, 1998. [22] Alborz Geramifard. Practical Reinforcement Learning Using Representation Learning and Safe Explorationfor Large Scale Markov Decision Processes. PhD thesis, Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, February 2012. [23] Claudia V Goldman and Shlomo Zilberstein. Decentralized control of cooperative systems: Categorization and complexity analysis. J. Artif. Intell. Res. (JAIR), 22:143-174, 2004. [24] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored mdps. In In NIPS-14, pages 1523-1530. The MIT Press, 2001. [25] Carlos Guestrin, Michail Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In ICML, 2002. [261 Carlos Guestrin, Shobha Venkataraman, and Daphne Koller. Context-specific multiagent coordination and planning with factored mdps. In AAAI/IAAI, pages 253-259, 2002. [27] Eric A. Hansen and Shlomo Zilberstein. LAO*: A heuristic search algorithm that finds solutions with loops. Artificial Intelligence, 129(1-2):35-62, 2001. [28] M. Hoeing, P. Dasgupta, P. Petrov, and S. O'Hara. Auction-based multi-robot task allocation in comstar. In Pivceedings of the 6th internationaljoint conference on Autonomous agents and multiagent systems, 2007. [29] A.P. Hu, Chao Liu, and Hao Leo Li. A novel contactless battery charging system for soccer playing robot. In 15th International Conference on Mechatronics and Machine Vision in Practice, pages 646 -650, December 2008. [30] Luke Johnson. Decentralized Task Allocation for Dynamic Environments. Master's thesis, Massachusetts Institute of Technology, January 2012. 101 [31] S. Karaman and E. Frazzoli. Sampling-based algorithms for optimal motion planning. InternationalJournalof Robotics Research, 30(7):846-894, June 2011. [32] Jelle R Kok, Matthijs TJ Spaan, and Nikos Vlassis. Multi-robot decision making using coordination graphs. In Proceedings of the 11th International Conference on Advanced Robotics, ICAR, volume 3, pages 1124-1129, 2003. [33] Steven M LaValle. Rapidly-exploring random trees a ew tool for path planning. 1998. [34] Brandon Luders and Jonathan P. How. Probabilistic feasibility for nonlinear systems with non-Gaussian uncertainty using RRT. In AIAA Infotech@Aerospace Conference, St. Louis, MO, March 2011. (AIAA-2011-1589). [35] Mausam and Andrey Kolobov. Planningwith Markov Decision Processes: An AI Perspective. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012. [36] Mausam and Andrey Kolobov. Planningwith Markov Decision Processes: An AI Perspective. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012. [37] Mausam and Daniel S. Weld. Solving concurrent markov decision processes. In Deborah L. McGuinness and George Ferguson, editors, AAAI, pages 716-722. AAAI Press / The MIT Press, 2004. [38] Nicolas Meuleau, Milos Hauskrecht, Kee eung Kim, Leonid Peshkin, Leslie Pack Kaelbling, Thomas Dean, and Craig Boutilier. Solving very large weakly coupled markov decision processes. In In Proceedingsof the Fifteenth National Conference on Artificial Intelligence, pages 165-172, 1998. [39] Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. In Machine Learning, pages 103-130, 1993. [40] Daniel Morgan, Soon-Jo Chung, and Fred Y Hadaegh. Model predictive control of swarms of spacecraft using sequential convex programming. Journalof Guidance, Control, and Dynamics, pages 1-16, 2014. [41] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov decision processes. Mathematics of operations research, 12(3):441-450, 1987. 102 [42] M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994. [43] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy Leibs, Rob Wheeler, and Andrew Y Ng. Ros: an open-source robot operating system. In ICRA workshop on open source software, volume 3, page 5, 2009. [44] J. D. Redding, T. Toksoz, N. Kemal Ure, A. Geramifard, J. P. How, M. Vavrina, and J. Vian. Persistent distributed multi-agent missions with automated battery management. In AIAA Guidance, Navigation, and Control Conference (GNC), August 2011. (AIAA-2011-6480). [45] Joshua D. Redding. Approximate Multi-Agent Planningin Dynamic and Uncertain Environments. PhD thesis, Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, Cambridge MA, February 2012. [46] A. Richards, J. Bellingham, M. Tillerson, and J. P. How. Coordination and control of multiple UAVs. In AIAA Guidance, Navigation, and Control Conference (GNC), Monterey, CA, August 2002. AIAA Paper 2002-4588. [47] Stuart Russell, Peter Norvig, and Artificial Intelligence. A modern approach. Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, 25, 1995. [48] T. Schouwenaars, B. de Moor, E. Feron, and J. P. How. Mixed integer programming for multi-vehicle path planning. In Proceedings of the European Control Conference, pages 2603-2608, Porto, Portugal, September 2001. European Union Control Association. [49] Satinder Singh and David Cohn. How to dynamically merge markov decision processes, 1997. [50] Trey Smith and Reid Simmons. Focused real-time dynamic programming for mdps: Squeezing more out of a heuristic. In In AAAI 2006, pages 1227-1232. [51] Devika Subramanian, Peter Druschel, and Johnny Chen. Ants and reinforcement learning: A case study in routing in dynamic networks. In In IJCAI (2, pages 832-838. Morgan Kaufmann, 1998. [52] Vicon Motion Systems. Motion capture systems from Vicon, 2008. Online http: //wv.vicon. com/. 103 [53] Tuna Toksoz. Design and Implementation of an Automated Battery Management Platform. Master's thesis, Massachusetts Institute of Technology, August 2012. [54] Tuna Toksoz, Joshua Redding, Matthew Michini, Bernard Michini, Jonathan P. How, Matthew Vavrina, and John Vian. Automated Battery Swap and Recharge to Enable Persistent UAV Missions. In AIAA Infotech@Aerospace Conference, March 2011. (AIAA-2011-1405). [55] Nazim Kemal Ure, Girish Chowdhary, Jonathan P. How, Mathew Vavarina, and John Vian. Health aware planning under uncertainty for uav missions with heterogeneous teams. In Proceedings of the European Control Conference, Zurich, Switzerland, July 2013. [561 M. Valenti, B. Bethke, D. Dale, A. Frank, J. McGrew, S. Ahrens, J. P. How, and J. Vian. The MIT Indoor Multi-Vehicle Flight Testbed. In IEEE International Conference on Robotics and Automation (ICRA), pages 2758-2759, 10-14 April 2007. 157] M. Valenti, B. Bethke, G. Fiore, J. P. How, and E. Feron. Indoor Multi-Vehicle Flight Testbed for Fault Detection, Isolation, and Recovery. In AIAA Guidance, Navigation, and Control Conference (GNC), Keystone, CO, August 2006 (AIAA2006-6200). [58] Li-Chih Wang and Sian-Kun Lin. A multi-agent based agile manufacturing planning and control system. Computers and IndustrialEngineering, 57(2):620-640, 2009. 104