Hierarchical Decomposition of Multi-agent

advertisement
Hierarchical Decomposition of Multi-agent
Markov Decision Processes with Application
to Health Aware Planning
by
MASSACHUSETMS WTIUTE
OF TECHNOLOGY
Yu Fan Chen
OCT 0 2 2014
B.A.Sc., Aerospace Engineering
University of Toronto (2012)
LIBRARIES
Submitted to the Department of Aeronautics and Astronautics
in partial fulfillment of the requirements for the degree of
Master of Science in Aerospace Engineering
at the
MASSACHUSETTS INSTITUTE OF TECHNOLOGY
September 2014
@ Massachusetts Institute of Technology 2014. All rights reserved.
Signature redacted
Author ..................
....................
Department of Aeronautics and Astronautics
August 21, 2014
Certified by................Signature
....................
Jonathan P. How
Richard C. Maclaurin Professor of Aeronautics and Astronautics
Thesis Supervisor
Signature redacted.-
..................
Paulo C. Lozano
Associate Professor of Aeronautics and Astronautics
Chair, Graduate Program Committee
Accepted by ................
2
Hierarchical Decomposition of Multi-agent
Markov Decision Processes with Application
to Health Aware Planning
by
Yu Fan Chen
Submitted to the Department of Aeronautics and Astronautics
on August 21, 2014, in partial fulfillment of the
requirements for the degree of
Master of Science in Aerospace Engineering
Abstract
Multi-agent robotic systems have attracted the interests of both researchers and practitioners because they provide more capabilities and afford greater flexibility than
single-agent systems. Coordination of individual agents within large teams is often
challenging because of the combinatorial nature of such problems. In particular, the
number of possible joint configurations is the product of that of every agent. Further,
real world applications often contain various sources of uncertainties. This thesis investigates techniques to address the scalability issue of multi-agent planning under
uncertainties.
This thesis develops a novel hierarchical decomposition approach (HD-MMDP) for
solving Multi-agent Markov Decision Processes (MMDPs), which is a natural framework for formulating stochastic sequential decision-making problems. In particular,
the HD-MMDP algorithm builds a decomposition structure by exploiting coupling relationships in the reward function. A number of smaller subproblems are formed and
are solved individually. The planning spaces of each subproblem are much smaller
than that of the original problem, which improves the computational efficiency, and
the solutions to the subproblems can be combined to form a solution (policy) to the
original problem. The HD-MMDP algorithm is applied on a ten agent persistent
search and track (PST) mission and shows more than 35% improvement over an existing algorithm developed specifically for this domain. This thesis also contributes to
the development of the software infrastructure that enables hardware experiments involving multiple robots. In particular, the thesis presents a novel optimization based
multi-agent path planning algorithm, which was tested in simulation and hardware
(quadrotor) experiment. The HD-MMDP algorithm is also used to solve a multi-agent
intruder monitoring mission implemented using real robots.
Thesis Supervisor: Jonathan P. How
Title: Richard C. Maclaurin Professor of Aeronautics and Astronautics
3
4
Acknowledgments
First and foremost, I would like to thank my advisor, Professor Jonathan How, for his
guidance and support in the past two years. Jon has the ability to accurately convey
arcane concepts through simple diagrams, and always tries to draw connections to
the big picture. Also, Jon is always adamant about making good slides, which helped
me profoundly to improve the quality of my talks.
I would also like to thank all members of the Aerospace Control Lab, whose
support is indispensable to my research. In particular, I am grateful to Mark Cutler,
thanks for bearing with me as I flied quadrotors into the walls over and over again.
He is master of hardware and ROS, and is always willing to lend a hand. I would also
like to thank Kemal Ure, with whom I worked closely on this project. He was always
there to answer questions and provide feedbacks, his input has been essential to this
work. In addition, thanks to Jack Quindlen, Rob Grande, Sarah Ferguson, Chris
Lowe, and Bobby Klein, with whom I shared memories of many late night problem
sets and quals practice sessions. Also, special thanks to Beipeng Mu, for sharing the
cultural heritage; to Luke Johnson, for helping me in Portland; to Trevor Campbell,
for sharing the UofT spirit; and to Ali Agha, for his patience and generosity.
Finally, I would like to thank my wonderful parents. Thank you mom and dad,
for your unconditional support throughout the years of college and graduate school.
Also, I am grateful to my twin brother, whose presence always managed to confuse
and amuse my lab mates. To all my friends from UofT and MIT, thanks for being
there through the ups and downs.
This research has been funded by the Boeing Company. The author acknowledges
Boeing Research & Technology for support of the indoor flight facilities.
5
6
Contents
1 Introduction
1.1 Background and Motivation . . . . . . . . .
1.2 Related Work . . . . . . . . . . . . . . . . .
1.2.1 Dynamic Programming Approaches .
1.2.2 Structural Decomposition Approaches.
1.2.3 Optimization Based Approaches . . .
1.3 Outline and Summary of Contributions . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
15
15
17
18
19
21
21
23
24
26
26
27
27
3 Hierarchical Decomposition of MMDP
3.1 Structure of Reward Functions . . . . . .
3.1.1 Construction of Reward Functions.
3.1.2 Factored Reward Function . . . .
3.1.3 Useful Functional Forms . . . . .
29
30
30
32
33
36
36
38
38
40
44
.
.
.
.
.
2 Preliminaries
2.1 Markov Decision Processes .......................
2.2 Multi-agent Markov Decision Processes ............
2.3 Dynamic Programming Algorithms . . . . . . . . . . . . .
2.4 Extensions to Markov Decision Processes ...........
2.4.1 Partially Observable Markov Decision Processes . .
2.4.2 Decentralized Markov Decision Process (dec-MMDP)
2.5 Sum mary ............................
.
Hierarchically Decomposed MMDP . . .
3.2.1
3.2.2
3.2.3
3.2.4
3.2.5
Decomposition .............
Solving Leaf Tasks .........
Ranking of Leaf Tasks . . . . . .
Agent Assignment .........
Summary ................
.
3.2
7
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . .
............ . .
Run Time Complexity Analysis .......
Summary ..................
.
3.3
3.4
44
46
47
47
48
49
53
53
58
60
60
62
66
68
69
5 Experimental Setup
5.1 RAVEN Testbed .................
5.2 Autonomous Vehicles ..................
5.3 Software Architecture ..................
5.4 Path Planner ......................
5.4.1 Potential Field ...............
5.4.2 Sequential Convex Program ........
5.4.3 Incremental Sequential Convex Program
5.5 Intruder Monitoring Mission ...........
5.6 Summary .....................
71
71
73
75
78
78
79
83
90
92
6 Conclusions
6.1 Summary .....................
6.2 Future Work ....................
6.2.1 Optimization on Task Tree Structure .
6.2.2 Formulation of Leaf Tasks . . . . . . .
95
95
96
96
97
References
98
.
,
.
.
.
.
.
.
.
4 Application to Health Aware Planning
4.1 Persistent Search and Track Mission (PST)
4.1.1 Mission Description . . .
4.1.2 Problem Formulation . .
4.1.3 Structural Decomposition
4.1.4 Empirical Analysis
4.1.5 Simulation Result.....
4.2 Intruder Monitoring Mission
4.2.1 Mission Description ....
4.2.2 Problem Formulation
4.2.3 Extension to HD-MMDP
4.2.4 Simulation Results . . .
4.3 Summary .. . . . . . . . . . . .
8
List of Figures
14
1-1
Applications of Multi-agent Systems . . . . . . . . . . . . . . . . . .
3-1
3-2
3-3
3-4
3-5
3-6
3-7
3-8
3-9
3-10
Construction of Multi-agent Reward
Two Agent Consensus Problem. . .
Summation Reward Function . . .
Multiplicative Reward Function . .
Example of Task Tree Structure . .
Solution to Leaf Tasks . . . . . . .
Relative Importance of Leaf Tasks .
Ranking of Leaf Tasks . . . . . . .
Computation of Minimum Accepted
Assignment of Agents to Leaf Tasks
Function . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
Expected Reward
. . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
30
31
34
35
37
38
39
41
43
43
4-1
4-2
4-3
4-4
4-5
4-6
4-7
4-8
4-9
4-10
4-11
4-12
Persistent Search and Track Mission . . . . . . . . . . .
Task Tree of Persistent Search and Track Mission . . .
Three Agent PST Mission Simulation Result . . . . . .
Effect of Adopting Decomposition Structure . . . . . .
Pruning Solution Space . . . . . . . . . . . . . . . . . .
Simulation of Ten Agent PST Mission . . . . . . . . . .
Intruder Monitor Domain . . . . . . . . . . . . . . . .
Leaf Tasks in Intruder Monitoring Domain . . . . . . .
Action Space of Intruder Monitoring Mission . . . . . .
State Transition Model of Intruder Monitoring Mission
Task Tree of Intruder Monitoring Mission . . . . . . . .
Simulation of Intruder Monitoring Mission . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
. .
...
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
48
53
55
56
58
60
61
62
64
65
67
70
5-1 Robotic Vehicles and Experiment Environment . . . . . . . . . . . . .
5-2 Schematic of RAVEN Testbed . . . . . . . . . . . . . . . . . . . . . .
5-3 Image Projection Capability . . . . . . . . . . . . . . . . . . . . . . .
72
73
74
9
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Autonomous Robotic Agents . . . . . . . . . . . . . . .
Software Architecture . . . . . . . . . . . . . . . . . . .
A Live Lock Configuration for Potential Field Algorithm
Typical Convergence Pattern of SCP . . . . . . . . . .
Modeling Corner Constraint . . . . . . . . . . . . . . .
Convergence Pattern Around a Corner . . . . . . . . .
Convergence Pattern for Multi-agent System . . . . . .
Mutli-agent Path Planning Test Cases . . . . . . . . .
Hardware Demonstration of iSCP Algorithm . . . . . .
Hardware Demonstration of Intruder Monitoring Mission
.
5-4
5-5
5-6
5-7
5-8
5-9
5-10
5-11
5-12
5-13
10
. . .
. . .
. . .
. . .
.. .
. . .
. . .
. . .
. . .
. . .
76
77
79
83
84
87
88
89
91
93
List of Tables
3.1
Local Optimal Policies can be Jointly Suboptimal . . . . . . . . . . .
35
3.2
Run Time Complexity of HD-MMDP ..................
46
4.1
Computational Time of Value Iteration on PST Mission ........
54
11
12
Chapter 1
Introduction
1.1
Background and Motivation
Multi-agent systems arise in many applications of interest. In the manufacturing domain, efficient resource allocation [17, 58] among a group of robotic agents is crucial
for maximizing production output. Similarly, in military applications, intelligent organization of multiple unmanned vehicles[46, 18, 55} is critical for surveillance and
reconnaissance missions. Also, in the network management domain, smooth coordination of multiple servers[12, 51] is essential for maximizing network traffic. In each
of these scenarios, the objective is to determine an action for each agent to optimize
a joint performance metric for the team.
In addition, various sources of uncertainties arise naturally in such planning problems. In the manufacturing domain, there exist process and sensor noise that affect
robots' performance. In surveillance and reconnaissance missions, the behavior patterns of targets are often uncertain and can only be modeled probabilistically. In
network management, servers may fail due to internal flaws (software bugs) or malicious attack from external sources, both of which can be modeled as stochastic events.
Thus, these applications can be framed as multi-agent stochastic sequential decision
problems, of which a major challenge is to hedge against future uncertainties.
Markov Decision Processes (MDP) is a mathematical framework for stochastic sequential decision problems. In particular, given a probabilistic transition model, the
13
Figure 1-1: Applications of Multi-agent Systems. Left sub-figure shows multiple
robotic agents in a manufacturing setting; middle sub-figure shows a multi-vehicle
surveillance mission; right sub-figure shows a computer network.
goal is to find a policy that maximizes long term reward. A policy is a framework for
actions, which selects an appropriate action for each possible state of the system. A
naive approach is to develop a policy for each agent independently. However, explicit
coordination is desired when (i) a task is inherently multi-agent, such as transportation of a large object with combined effort, and (ii) a task is highly stochastic, such
that each agent should plan for contingencies in case of degraded capabilities of its
teammates. Another approach that explicitly account for coordination among a group
of agents is to plan in the joint spaces. However, given the combinatorial nature of
a multi-agent system, the joint planning space scales exponentially in the number
of agents. In a simple surveillance domain, if the physical space is partitioned into
100 regions, each representing a possible state of a vehicle, a team comprised of ten
vehicles would have 10010 joint states. It is computationally prohibitive to enumerate
and iterate over such enormous planning spaces.
In short, multi-agent planning problems have great application values, but are
technically challenging because (i) the need to model agent level and environmental uncertainties, and (ii) the exponential growth of planning space as the number
of agents increases. The primary objective of this thesis is to develop an algorithm
that can handle uncertainty models and scale favorably with the number of agents.
The proposed algorithm automatically identifies a decomposition structure from the
reward function, and attains computational tractability by exploiting this decomposition structure.
14
1.2
Related Work
Researchers have previously formulated multi-agent sequential decision problems as
Markov Decision Processes (MDP) [42] and Multi-agent Markov Decision Processes
(MMDP) [10]. Although an optimal policy can in theory be computed via dynamic
programming (DP) [5], the exponential growth of joint planning spaces had made
such approaches impractical to implement. Instead, researchers have developed various generic methods (Section 1.2.1) for solving MDP approximately at lower computational cost. In addition, researcher have studied various structural decomposition
methods (Section 1.2.2) to lower computational complexity. Further, researchers have
considered solving multi-agent problems through formulation of a mathematical program (Section 1.2.3). A precise definition of MDP and MMDP, as well as the detailed
descriptions of classical DP methods, will be presented in the next chapter.
1.2.1
Dynamic Programming Approaches
Asynchronous Dynamic Programming
Classical DP methods maintain a tabular value function with an entry every state, and
perform Bellman update once for every state per iteration. Asynchronous DP [7, 35]
is a class of DP algorithms that performs Bellman update at different frequencies for
different parts of the state space. The key idea is that some parts of the state space
are more likely to generate a favorable policy; and thus, it is desirable to focus on
those parts of the state space.
In Prioritized Sweeping algorithm [39], residual (difference between successive
Bellman updates) is used as a metric to determine the best set of states for future
update. LAO* [27] keeps track of all previously explored states, and slowly expands
the set if Blman update suggests that an unexplored state could be useful. Real time
dynamic programming (RTDP) [50, 9] simulates trajectories (Monte Carlo simulation)
through the state space and perform Bellman update on the explored trajectories.
Asynchronous DP has been shown to be very effective in planning problems with
a well defined initiation set and termination set; that is, the goal is to find a policy
15
that can drive the system from one part of the state space to another part of the state
space. For instance, a stochastic path planning problem requires finding a policy that
guides a vehicle from a starting position to a goal position. In such problems, the
algorithm can focus on parts of the state space "connecting" the initiation set and
the termination set. However, in some persistent planning domains where initiation
and termination sets are not specified, such as in reconnaissance missions, it is difficult to determine which states are more important than others. Further, this class
of algorithm still requires explicit enumeration of a substantial portion of the joint
planning spaces, which can be impractical for multi-agent applications.
Approximate Dynamic Programming
Approximate DP [14, 22] is a class of DP algorithms that uses feature vectors to
aggregate states for lowering computational complexity. In particular, given a set of
feature vectors, Oi(s), the value function can be approximated as V(s) = E wi4i(s).
The algorithms attempt to find the set of wi that would best approximate the true
value function. Since number of features (I(I) is typically much smaller than the
number of states (ISi), approximate DP achieves significant computation savings
compared to classical DP.
Unsurprisingly, since 14I
< ISI, the
representation power of feature space (linear
combination of feature vectors) is less than that of tabular enumeration. Thus, various
error metric has been proposed for feature representation of the true value function.
Minimizing different error metric leads to different approximate Bellman update step,
which defines different algorithms. For instance, Least Square Temporal Difference
(LSTD) [13] is a popular approximate DP algorithm that minimizes a projection
error under Bellman update. Approximate Value Iteration and approximate Policy
Iteration are other examples of this class of algorithms. GA-dec-MMDP [44] is an
algorithm developed specifically for multi-agent systems, in which by specifying a set
of feature vectors, allows each agent to model the rest of the team in an approximate,
aggregate manner.
The key idea of approximate DP is that many states are similar. For instance,
16
in a path planning domain, one feature vector could lump together all states near
obstacles. One Bellman update step can change the value of all related states simultaneously. However, it usually require extensive domain knowledge to specify a good
set of feature vectors.
1.2.2
Structural Decomposition Approaches
Structural decomposition methods [26] decompose a planning problem into a number
of subproblems with smaller planning spaces. Then, eachsubproblem is solved individually to compute local policies. Lastly, the local policies are combined to generate
a policy to the original problem.
Serial Decomposition
In serial decomposition, the planning spaces are partitioned into smaller sets. A
subproblem (i.e. MDP) can be formulated around each partitioned set. The size of
the original problem is the sum of that of the subproblems; that is, the planning spaces
of the original problem are the union of that of the subproblems. At each decision
period, serial decomposition identifies one active subproblem [29], and follows the
subproblem's local policy.
Consider a stochastic path planning problem inside a building.
The building
has multiple rooms connected by hallways and corridors. A serial decomposition
algorithm would decompose the physical space into individual rooms and hallways.
Local policies can be computed inside each room to drive the vehicle towards the
door, which connects the room to hallway or other rooms. A policy to the original
problem can be found by first identifying which room the vehicle currently resides,
and then follow the room's local policy. A popular serial decomposition algorithm is
.
MAXQ by Dietterich et al. [21]
Serial decomposition works well when there exist good heuristics for partitioning
the planning space. Also, it usually assumes infrequent transition among different
partitions. Thus, these algorithms typically require .domain knowledge. More importantly, serial decomposition still requires explicit enumeration of the entire state
space, which can be impractical for large scale multi-agent systems.
17
Parallel Decomposition
In parallel decomposition, the joint planning spaces are decomposed along different
dimensions. For instance, vehicles in surveillance mission can have a dimension specifying its geographical location and another dimension specifying its operational health
status. A subproblem is formulated for each dimension (or a subset of dimensions).
The size of the original problem is the product of that of the subproblems, that is,
the planing space of the original problem is the Cartesian product of that of the
subproblems. At each decision period, multiple subproblems are active, and the local
policies are combined to form a policy to the original problem.
It is worth noting that adding an agent is equivalent to adding a dimension in a
planning problem (see next chapter for details). Thus, parallel decomposition methods have been widely applied to multi-agent systems. For instance, the coordination
graph approach [25, 32] assumes different subset of agents are coupled through locally shared value functions. Subproblems are formed around each subset of connected
agents and a policy is computed by maximizing the sum of each subproblem's value
function. Other examples include Large Weakly Coupled MDPs [38], Factored MDPs
[24], Concurent MDPs [37] , and Dynamically Merge MDP [491.
Parallel decomposition methods require user specified decomposition structure,
which rely on domain knowledge. Also, coupling between subproblems is assumed to
take simple forms, such as addition of local value functions or reward functions.
1.2.3
Optimization Based Approaches
The previous two sections described solution approaches in the MDP framework.
In the optimization community, researchers have studied task allocation problems
for multi-agent systems [28, 2]. In particular, the objective is to optimize a joint
performance metric given individual scoring functions [30]
Na
max
Nt
Fi(x, r)
ZZ
i=1 j=1
18
(1.1)
where Fi (x, r) is the scoring function specifying the reward of agent i performing
task j at time r, and xij is binary decision variables. The optimization problem
can also be augmented with various constraints. Exemplar algorithms for solving
this optimization problem approximately are sequential greedy algorithm and auction
based algorithms. Since this class of algorithms avoids enumerating and iterating
through the problem's planning space, it scales favorably with the number of agents
(compared to MDP approaches).
Optimization based algorithms were originally developed for deterministic systems
and has been recently extended to incorporate some aspects of uncertainty models.
For instance, Campbell et al. [15] used agent level MDP to compute scoring function
Ft,. Yet, the structure of a task needs to be specified by the user, which rely on
domain knowledge.
Further, Eqn. 1.1 explicitly assumes additive utility between
agents.
1.3
Outline and Summary of Contributions
This thesis proposes a novel hierarchical decomposition algorithm, which decomposes
a multi-agent planning problem into a hierarchy of tasks by exploiting the structure
of the reward function. Often times the reward function reveals a different degree of
coupling between agents, and thus can guide the decomposition procedure. Smaller
planning problems are formed around the tasks at the bottom of the hierarchy. These
subproblems have state spaces that are orders of magnitudes smaller than that of the
original problem, thus allowing for computational tractability. Lastly, the algorithm
combines solutions of the subproblems to construct a policy to the original problem.
The proposed algorithm is shown to achieve better performance at lower computational cost in a persistent surveillance mission. The algorithm has also been applied
to other multi-agent planning domains with more complex coupling relations. In addition to simulation, the algorithm has been run on real robots to demonstrate that
it is real time implementable.
e Chapter 2 presents the preliminaries for modeling multi-agent systems -under
19
uncertainty. It reviews the formulation of Markov Decision Processes (MDP)
and Multi-agent Markov Decision Processes (MMDP). It describes the mechanics of value iteration (a dynamic programming algorithm) for solving MDPs and
MMDPs.
e Chapter 3 proposes the Hierarchically Decomposed Multi-agent Markov Decision Processes (HD-MMDP) algorithm for solving MMDPs [16]. It motivates
the solution approach by examining different functional structures in a MMDP's
reward function. Then, it describes the mechanics of the HD-MMDP algorithm
for generating an approximate solution to multi-agent planning problems.
e Chapter 4 describes a persistent search and track (PST) mission and applies HDMMDP to solve this planning problem. Then, it presents an empirical analysis
of HD-MMDP. On the ten agent PST domain, simulation results show that HDMMDP outperforms GA-dec-MMDP algorithm by more than 35%. HD-MMDP
is also applied to another planning problem, the intruder monitoring mission
(IMM), and demonstrated sustainable performance. Lastly, several extensions
to HD-MMDP algorithms are proposed.
* Chapter 5 presents the hardware setup of Real-time indoor Autonomous Vehicles ENvironment (RAVEN). It also describes a flexible software architecture that enabled conducting hardware experiments involving multiple robots.
A novel multi-agent path planning algorithm is proposed and implemented in
RAVEN testbed. Hardware implementation of intruder monitoring mission is
presented to demonstrate that HD-MMDP is real time implementable.
e Chapter 6 offers concluding remarks and suggests possible directions for future
works.
20
Chapter 2
Preliminaries
This chapter presents preliminaries for Markov Decision Processes, Multi-agent Markov
Decision Processes, and Dynamic Programming solution approaches.
2.1
Markov Decision Processes
Markov Decision Process (MDP) [42] is a flexible mathematical framework for formulating stochastic, sequential decision problems. MDP was originally developed
for discrete systems, where the number of states and actions is finite. Recent works
have extended MDP formulation to continuous systems, whereas this thesis focuses
on discrete MDP. Mathematically, MDP is defined by a tuple (S, A, P, R, y), which
are state space, action space, state transition function, reward function, and discount
factor, respectively.
* State Space: S
State space is the set of all possible states of a system. For instance, imagine a
room being partitioned into a 5 x 5 grid, then there are 25 possible states, each
of which uniquely identifies a configuration of the system. For some systems, a
state may be comprised of multiple dimensions. Consider a vehicle navigating
in the 5 x 5 grid. In addition to a locational dimension, it may additionally
have a battery level dimension, discretized to the nearest percentage. Then, the
21
state space of system is the Cartesian product of both dimensions. The size of
the state space is the product of that of each dimension, which is 2500 in the
above example. Therefore, the state space of the system grows exponentially
with addition of new dimensions, which phenomenon is referred to as the curse
of dimensionality in the literature.
* Action Space: A
Action space is the set of all actions that a system can exert. For instance, a
vehicle navigating in a 5 x 5 grid must choose to go {forward, backward, left,
right}. In addition, the set of permissible actions may depend on the state of
the system, such that the vehicle cannot go across a physical boundary.
* State Transition Function: P
State transition function specifies the probability of reaching state s' by taking
action a at state s, that is, P : S x A x S
-+
[0,1]. State transition function
reflects the Markovian property of the problem, which meant future uncertainty
distribution solely depends on the system's current state and action, and is independent of the system's state trajectory in the past. State transition function
reflects the degree of stochasticity in the system. A vehicle navigating in the
5 x 5 grid may be a stochastic system if the floor is slippery. In particular, a
vehicle driving straight can have a 90% probability reaching its intended location in the front, but also have 10% probability ending up in adjacent locations
on the left or right, due to slippery conditions.
" Reward Function: 1?
Reward function maps each state-action pair to a scalar value, that is, 1:
S xA
-+
R. Intuitively, reward function is designed to favor a set of desired
states. For instance, a vehicle navigating in the 5 x 5 grid may assign a high
reward to its goal position, thereby favoring a policy that drives the vehicle
toward its goal. This thesis assumes that the objective is to generate a policy
that maximizes expected reward. Alternatively, in some scenarios, it is more
natural to think about cost and the objective is to minimize expected cost.
22
"
Discount Factor: -y
Discount factor is a scalax in range (0,1). Intuitively, a discount factor specifies
that reward in the distant future is less valuable. Computationally, it also has
some important implications on theoretical aspects.
" Policy: ir
A deterministic policy, ir, is a function that maps each state to an action, that is,
ir : S -+ A. Solving a MDP is the process of computing a policy that maximizes
discounted cumulative reward, which is called the value function. For an infinite
horizon MDP, value function is defined as
00
V'(s) = E E YR (s,r(st))
(2.1)
t=O
The discount factor must be less than one or the value function may otherwise
diverge.
Theoretical results have been developed showing that the optimal
policy for an infinite horizon MDP is deterministic and stationary.
2.2
Multi-agent Markov Decision Processes
Multi-agent Markov Decision Processes (MMDP) [11] is the application of MDP to
model multi-agent systems. In particular, MMDP is defined by a tuple (na, S, A, 7?, -y),
where n. is the number of agent, and other variables are defined similarly as that for
a MDP.
MMDP is distinguished from a MDP by its assumption of factored action space,
according to the original definition by Boutilier. There are also two other assumptions,
factored state space and transition independence, that researchers often associate with
MMDP. In this thesis, all three conditions are assumed to hold.
e Factored Action Space
The joint action space is assumed to be the Cartesian product of that of each
agent, that is A = A 1 x A 2
x ...
x Aa. This condition implies that each agent's
23
action space remains unchanged in presence of its teammates, and that the joint
action is specified by actions of all agents.
" Factored State Space
The joint state space is assumed to be the Cartesian product of that of each
agent, that is S = Si x S2 x ...
x S,.. This condition implies that each
agent's state space remains unchanged in presence of its teammates, and that
the joint state is specified by states of all agents. Factored state and action
spaces assumptions reflect the thinking that each agent remains its own entity.
" Transition Independence
The transition function is assumed to be the Cartesian product of that of each
agent, that is P = Ph x P2 x - -- x P,.. This condition implies that each
agent's transition dynamics is only a function of its own state and action, and
is independent of its teammates' states and actions [4]. Low level interaction
between agents, such as interference due to occupying the same state, is not
explicitly modeled given this condition.
In comparison to MDP formulation, it can be seen that adding an agent to MMDP
has similar effect to adding a dimension in MDP. Planning spaces is a phrase used to
refer to state spaces and action spaces. Thus, the size of the planning spaces scales
exponentially with the number of agents.
2.3
Dynamic Programming Algorithms
Dynamic Programming (DR) [7] is a class of algorithms that solves optimization problems by breaking the problem down into a number of overlapping simpler subproblems. The subproblems are solved in different stages, and solution to each subproblem
is maintained in memory and used for solving later subproblems.
In context of MDP, Dynamics Programming exploits the Markovian property of
the problem. The best action to take at a particular state is independent of how the
system has reached the particular state. Thus, the original problem of finding the
24
Algorithm 1: Value Iteration
Input: MDP (S, A, P, R,y), tolerance E
Output: policy ir
initialize V0 +- 0, n +- 0
2 while max, residual(s) > E do
1
I
4
5
6
7
foreach s E S do
V +1(s) +- max{R(s, a) + - ,, P(s, a, s')V"(s')}
_ residual(s) = IVn+ 1 (s) _ Vn(s)j
n +- n + 1
return ir(s) +- arg maxa{R(s, a) +7 ,, P(s, a, s)Vn(s')}
optimal sequence of actions is equivalent to the simpler problem of finding the best
action assuming the system will follow the optimal policy thereafter. This idea leads
to the Bellman optimality condition,
V'*(s) = max{R(s,a) +-yEP(s,a,s')V'*(s')}
(2.2)
81
where V* (s) is the optimal value function associated with the optimal policy lr*. Eqn.
2.2 not only has specified a condition that the optimal value function should satisfy,
but also suggests an iterative algorithm for computing the optimal value function. In
particular, given an initialization V0 , one can apply
Vk+1(s) = max{1Z(s, a) +7
P(s, a, s')Vk(s)}
(2.3)
8'
to drive the next iterate closer to V'*. Eqn. 2.3 is called Bellman update or Bellman
backup, at state s. Theoretical results have been developed to show that V *(s) is
unique and performing Bellman update will drive any initialization of value function
toward the optimal value function asymptotically.
Value Iteration (VI) [36] is shown in Algorithm 1. The algorithm computes a
policy 7r given an MDP and convergence tolerance E. The algorithm first initializes
the value function in line 1. Then, Bellman update (line 3-5) is performed at each state
s E S using the current iterate value. Residual, which is the change in value function
in between successive iterates, is computed and stored. Once the maximum residual
25
falls below the tolerance e, it is assumed that the value function has converged (line
2-6). Lastly, a policy is retrieved from the value function by taking greedy selection
with respect to Bellman update equation (line 7).
Each Bellman Update (line 4) step requires iterating through every possible action
and every possible next state, thus yielding run time complexity O(IA ISI), where I-I
denotes the cardinality of a set. Consequently, each Bellman update sweep (line 3-5)
yields run time complexity O(IA
1S12).
Empirically, VI usually requires between 10
to 100 Bellman update sweeps to converge. Note that run time complexity depends
on the size of the planning spaces. This relates to the curse of dimensionality such
that DP approaches quickly become computationally intractable with addition of new
dimensions to the planning spaces.
There also exists another popular DP approach known as policy iteration (PI) [36].
PI iterates between two steps, policy evaluation and policy update. In its original
form, PI has greater run time complexity than VI. A plethora of algorithms have
been developed based on PI and VI, including those reviewed in Sec. 1.2.1.
2.4
Extensions to Markov Decision Processes
2.4.1
Partially Observable Markov Decision Processes
(POMDP)
Markov Decision Process assumes full observability; that is, after taking an action,
the next state will be known to the system. This model works reasonably well when
measurement noise is small compared to discretization level. Otherwise, uncertainty
distribution after each measurement needs to be explicitly modeled, which leads to the
formulation of Partially Observable Markov Decision Processes(POMDP) [47], defined
by the tuple (S, B, A, P, JZ, -y). B denotes the belief space, which is a distribution over
states. When solving a POMDP, one has to keep track of the evolution of belief states.
It is extremely computationally intensive to compute the optimal policy for POMDP.
In particular, MDP is P-complete whereas POMDP is PSPACE-hard [41]. Recently,
26
researchers have developed more scalable algorithms, such as FIRM [1], that solve
POMDPs approximately.
2.4.2
Decentralized Markov Decision Process (dec-MMDP)
Multi-agent MDP, as defined in section 2.2, is a specialization of MDP, thus also
assumes full observability.
Assuming each agent can only measure its own state,
MMDP requires communication between agents such that a centralized planner can
gain full observation to the system. In contested environments where communication
bandwidth is limited, such assumption may be unrealistic. Thus, many researchers
have studied multi-agent sequential decision problems from a decentralized perspective. One particular modeling framework is dec-MDP [23]. Dec-MDP assumes joint
full observability, which means full observability can be achieved only by combining
observations of all agents. The communication limitation is modeled by a probability measure that specifies the probability of receiving an agent's observation given its
state and action. Thus, the optimal policy should strike a balance between optimizing
the reward function and maintaining a good degree of communication. Dec-MMDP is
NEXP-Complete due to its joint full observability assumption [6]. This thesis assumes
full observability and focuses on MDP and MMDP formulations.
2.5
Summary
This section reviewed formulation of Markov Decision Processes (MDP) and Multiagent Markov Decision Processes (MMDP). It also described dynamic programming
(DP) as a class of algorithms for computing optimal solutions of MDPs and MMDPs.
However, DP approaches are not practical for solving MMDPs due to curse of dimensionality. The next chapter introduces an algorithm that solves MMDPs at lower
computational complexity by exploiting structures common to MMDPs.
27
28
Chapter 3
Hierarchical Decomposition of
MMDP
This chapter presents the mechanics of a novel hierarchical decomposition algorithm
for solving Multi-agent Markov Decision Processes (MMDP). The section starts with
a discussion of various structures of MMDP's reward function, and relates different
function structures to different degrees of coupling between agents. Then, it proposes
an algorithm that exploits such structures for solving MMDPs. In particular, the proposed algorithm would decompose an MMDP into a number of subproblems, solve
each subproblem individually, and combine the subproblem's local policy to form a
global policy to the MMDP. To distinguish between the subproblems and the original MMDP, the phrases local policy, local reward function are used to refer to the
subproblems, and the phrases global policy, global reward function, joint planning
spaces are used to refer to the original MMDP. If it is clear in context, the planning
spaces of a multi-agent subproblem are also referred to as joint planning spaces.
29
r4
.
r1
Figure 3-1: Construction of Multi-agent Reward Function. Each shape represents a
task that can be accomplished by a single agent.
3.1
Structure of Reward Functions
3.1.1
Construction of Reward Functions
The reward function, as defined in Section 2.1, can be specified in a tabular form
with one scalar entry per state-action pair. The scalar values can be arbitrary and
there need to be no correlation between any two scalars. However, this mathematical
perspective is rarely useful in practice. After all, the reward function is designed by
people to induce a favorable policy upon solving the MDP. Thus, reward function is
often compactly represented by algebraic functions, which focuses on a small set of
favorable states. More importantly, the set of favorable states is often expressed as
composition of simple rules. For instance, in a multi-agent surveillance domain, a
reward can be given when (i) a certain number of agents is actively patrolling in a
specific area, and (ii) each patrolling agent has an adequate amount of fuel and fully
operational sensors.
The observation that reward functions usually have simple algebraic structure
is formalized with the definition of factored reward function. Before presenting the
precise mathematical formulation, it is useful to provide a simple motivating example
for construction of a multi-agent problem, as shown in Fig. 3-1. Let each shape
represent a point of interest in a 5 x 5 grid world; each shape is associated with a scalar
reward labeled to the right. Given agent dynamics and environmental constraints,
30
Ar2
Figure 3-2: Two Agent Consensus Problem. Agents need to agree on which grid to
visit together (both choosing green action or red action).
developing a policy to reach each point of interest can be formulated as a single agent
MDP. These MDPs are called leaf tasks. With a slight abuse of terminology, the
state spaces of these MDPs are also referred to as leaf tasks. Given these building
blocks, more complicated missions can be constructed by adjoining these leaf tasks.
For instance, addition ri(@) + r2 (N) is used to indicate that tasks 1 and 2 can be
executed independently, and multiplication r3 ()
- r4 ()
is used to indicate that
task 3 and 4 need to be executed simultaneously. We can build more complex mission
by repeated composition of leaf tasks.
The previous example illustrates composition of single-agent (singleton) leaf tasks.
Yet, there are multi-agent tasks that cannot be decomposed into singleton tasks. For
instance, consider the consensus problem shown in Fig. 3-2. The reward function for
this problem is r(s1, 82)
=
W(S1)I(s1
= 82),
where w(-) is a scalar function determining
the value of each grid and I(.) is the indicator function specifying the condition that
both agents must be in the same grid for collection of reward. Thus, when choosing
an action, each agent must be cognizant of the other agent's state and action. Let
the green and red arrows denote two actions that each agent can choose from. In
this case, depending on magnitude of w(s) and state transition uncertainties, both
agent should either choose the green arrows or the red arrows. Mathematically, it is
difficult to decompose the indicator function into a composition of more elementary
functions of si and s2 separately.
31
3.1.2
Factored Reward Function
This section introduces some terminologies to formalize the observations made in
section 3.1.1.
Task
The i-th task, T, is defined as the joint state space Si = s, x ...
x s, of any set of
nt agents. Task Tz is associated with local reward function rZ, which is a part of the
global reward function R. Local reward function is a mapping from the joint state
space to a scalar, that is, r: T' -+ R. Further, t is defined as a particular state in
task T, that is, ti E T. The definition of a task does not restrict to any particular
subset of agents; reward is given as long as some subset of agents reaches the desired
joint state. The dimension of a task's state space is the minimum required number
of agents for executing the task. More agents can be assigned to a task for adding
redundancy, in which case reward is calculated based on the subset of agents that
best accomplishes the task. Note that the composition of tasks forms another task.
Leaf Task
A leaf task is a task that cannot be decomposed into smaller sub-tasks. More precisely,
a leaf task is associated with a leaf reward function, which is function that cannot be
expressed as composition of simpler functions,
rea(si, ...
where
,s,) $
fi(C)f 2 (CC)
(3.1)
C c {si, ... , sni}, and C 5 0
Intuitively, leaf reward functions are the elementary building blocks of a global reward
function. Leaf tasks are subproblems formed around leaf reward functions.
*,
*
*, *,
from Fig. 3-2 are examples of singleton leaf tasks. Example from Fig. 3-2 is
an example of a two-agent leaf task.
32
Factored Reward Function
Factored reward function is the repeated function composition of leaf reward functions,
1Z(8i,
82,... ,iS.
=
fl{... fA
[ fm(r
mi(tml') ,...
..,rh(thl),...
, rmkl-(tmnk)),.. r11 (t"l,. ,rlk(tlk)],
(3.2)
, rhk(thk )}
where h, 1, m are dummy indices showing association of different terms. f(.)'s are
arbitrary functions, r(t) is leaf reward function r(-) evaluated at state t of task
T. For example, consider a three agent planning problem with global reward function R(si, 82)
=
r3 (t)(r 2 (ti) + r3 (t3)).
This can be written in form of Eqn. 3.2 as
R(si, S2, 83) = fi (f2(r2(t), r3 (t)),rs(t)), where fi is the scalar multiplication func-
tion and
f2
is the scalar addition function. Recall from definition of task, t is the
association of multiple agent's state (i.e. t = (ri, r2 , r3 )).
3.1.3
Useful Functional Forms
Construction of reward function illustrates that different functional forms can be used
to express different coordination specifications. More importantly, different functional
forms reveal the extent to which agents are coupled. Intuitively, coupling between
agents is weak if each agent can choose its own action without regard to its teammates'
states and actions; and coupling is strong if an agent needs to take into account of
all other agents' states and actions. The strongest form of coupling is represented by
leaf reward function, in which further decomposition is difficult. This section presents
two functional forms that can be decomposed into subproblems. Subtask is a phrase
used to describe a subproblem that is formulated as a MDP.
)
Summation Reward Function: R(si,..., s,) = E. rk(tk
This functional form represents weak coupling between agents because the subtasks
are almost independent. In particular, assuming infrequent transition of individual
33
A
r2
E+12
Figure 3-3: Summation Reward Function. Agent A1 performs the red subtask independently of A 2 performing the green subtask. Note that the color of the arrows
are different, which is to show that each agent can perform its own subtask without
concerning the other agent.
agents between subtasks, solving each subtasks optimally in parallel would yield a
jointly optimal solution because max
[Zk rik(tlk)] = E max [E01 yrlk(tlk)]
_1
Consider an example shown in Fig. 3-3, which reward function is ri (0) + r2 (f).
Assuming agent A 1 is assigned to subtask @ and agent A 2 is assigned to subtask
0, the jointly optimal policy is to have each agent execute its local optimal policy.
Therefore, the two subtasks are almost independent in the sense that each agent can
execute its local policy without concerning the other agent's state and action.
Multiplicative Reward Function: R(s,.. ., s,) = ]Jk rk(tk)
Multiplicative reward function represents a stronger degree of coupling than that
of the summation functional form. In particular, solving each subtask optimally in
parallel would not yield a jointly optimal solution. Intuitively, this functional form
specifies that the subtasks should be executed simultaneously.
Let agent A 1 be assigned to subtask
*
r4
and agent A 2 be assigned to subtask
.
Consider an example shown in Fig. 3-3, which reward function is r3
For simplicity, let the transition model be deterministic. The optimal policy of each
subtask,
7r*
can be found via value iteration. One other policy, r', is given for each
subtask. Each policy induces a sequence of reward (one scalar reward for each decision period), of which is listed in table 3.1. Recall the objective is to optimize
discounted cumulative reward, E> -yiRr(si). For each task, it is seen that r* induces
34
A_
,-2'
r
I/
2A
Figure 3-4: Multiplicative Reward Function. Agent A1 is performing the orange
subtask and agent A 2 is performing the blue subtask. The lasso over the two arrows
is showing that the two agents should explicitly coordinate when performing the
subtasks.
Table 3.1: Local Optimal Policies can be Jointly Suboptimal. The first row shows the
sequence of reward following a policy, and second row shows the discounted cumulative
reward calculated from the first row. The better policy, as revealed by discounted
cumulative reward, is shown in red.
Reward induced by r'
subtask
(1, 0,,0,1, 0, 0,. ..)
(0, 0,1, 0,0, 1, ...
subtask
I (0,1,0, 0,1, 0, ...)
K_0
joint task
7 3i+1
r3(*) r4(g)
(0, 0, 0, 0 ,0, 0, ... )
__________________0
ZK= 0 73i+2
(0,0, 1, 0, 0, 1, ...
K 73i+2
(0, 0, 1, 0 ,0, 1, ...
)
0 ^ 3i
)
K~=
)
Reward induced by r*
K=___________
Zy3i+2
higher discounted cumulative reward than that of ir'. However, the joint policy of executing 7r* independently will induce zero discounted cumulative reward. In contrast,
combining the suboptimal local policies 7r' will induce discounted cumulative reward
of K= 0 Y3i+2, which is greater than zero. This example shows that in general, combining local optimal policy does not induce global optimal policy for a multiplicative
reward function.
In discussion of functional form and coupling strength, it was assumed (i) infrequent transition of agents in between different subtasks and (ii) assignment of
minimum required number of agents to subtasks (i.e. one agent is assigned to served
a singleton task). However, there are planning domains where such assumptions can
lead to poor solution quality. For instance, if agents cycle through different states
35
reflecting different capabilities (i.e. battery level), then they should be periodically
assigned to different subtasks. Also, if one subtask is much more important than
all other subtasks, then more agents should be assigned to this important subtask
for added redundancy. These are also important aspects that a planning algorithm
should consider for developing a policy to the original planning problem.
3.2
Hierarchically Decomposed MMDP
(HD-MMDP)
This section develops a hierarchical decomposition algorithm [16] that exploits structure in factored reward function. This algorithm will (1) decompose the original
problem into a number of subproblems, (2) solve each subproblem using classical dynamic programming (DP), (3) rank the subproblems according to some importance
metric, and (4) assign each agent to one of the subproblems. Each agent will query
an action from its assigned subproblem's local policy; which in turn, forms a global
policy to the original MMDP.
3.2.1
Decomposition
The first step is to decompose a factored reward function and to build a task tree
structure. A task tree is a graphical representation of a factored reward function.
The leaf nodes of a task tree are the leaf tasks of the planning problem. Recall a leaf
task is a MDP formed around a leaf reward function. The internal nodes of a task
tree are function nodes.
The Decomposition subroutine, shown in Algorithm 2, builds a task tree structure
recursively. In lines 1-3, an MDP is formed if a leaf task is identified. Otherwise,
a function node (lines 4-8) is formed and each factor (argument of the function) is
converted into a subtree and added as one of its child. The subroutine is invoked
recursively (line 7) until all leaf tasks are identified.
For example, consider the factored reward function rir2 + r3 r4 , where r is used
36
)
Algorithm 2: Decomposition
Input: Agents Nj, State, Action, and Transition model < S, Aj, P > of each
agent, for j = 1, ... , na, Discount factor 7, and Reward Function R
Output: A task tree with leaf nodes as smaller MDPs ti, and function nodes
at all internal nodes
1 if r(t) cannot be decomposed then
2
p +- new MDP(< Sj,AjP>,1?,
3
L
return p
else
p +- new function node
5
6
foreach factor r' in R do
7
node +- Decomposition (Nj, < Sj, Aj, P >, r, 7)
8
_ add node as a child of p
4
9
L
return p
(
r
r
r
Figure 3-5: Task Tree of Example 3.1. This tree is built automatically by invoking
.
decomposition subroutine on the reward function rir2 + r3 r4
as a shorthand for r(t). The decomposition subroutine first identifies an summation
function node, and adds the two arguments, rir2 and r3 r4 , as children of the summation node. Then, the subroutine is invoked separately on each child. In particular,
upon decomposition of rir2 , a multiplication functional node is formed, and two leaf
reward functions are identified. The decomposition of r3 r4 is carried out similarly.
Lastly, leaf tasks are formed around each leaf reward function. The resulting task
tree is illustrated in Fig. 3-5. This example (Example 3.1) will be used throughout
this section to explain the key procedures in HD-MMDP.
37
1499
2
5
T,
T3
T4
T2
Figure 3-6: Solution to Leaf Tasks. Leaf tasks are formed around leaf reward functions
and are solved with value iteration. The maximum of each value function, rma, is
labeled above each respective leaf task. The leaf nodes are labeled T instead of r to
emphasize that leaf tasks are formed around each leaf reward functions.
3.2.2
Solving Leaf Tasks
Given a task tree structure, the second step is to solve each subproblem (leaf task)
at the bottom of the hierarchy. Each leaf task can be viewed as a standalone MDP,
and can be solved using classical dynamic programming approaches. HD-MMDP
algorithm uses value iteration (VI) for computing optimal policy to each leaf task.
Use of approximate algorithms for solving leaf tasks is also possible.
The optimal policy and the associated optimal value function can be found upon
solving each leaf task. The value function of a leaf task is the expected cumulative
reward of performing the particular leaf task. Thus, the relative magnitude of each
leaf task's value function reveals which leaf task is more valuable. As a heuristic,
the maximum attainable expected reward (rma. = max, V(s)) is used to compare
relative importance between the leaf tasks. In Example 3.1, the maximum attainable
expected reward, rma, is calculated for each leaf task and labeled in Fig. 3-6.
3.2.3
Ranking of Leaf Tasks
The third step is to develop an importance ordering for the set of leaf tasks. The rma
values calculated in the previous step will be useful for construction of this ordering.
However, it cannot be used directly. For instance, consider T2 and T4 in Fig. 3-6.
Although T2 has greater rma than that of T4 , the joint task T, x T2 is less rewarding
than the joint task T3 x T4 . Thus, it would be preferable to assign agents to the joint
38
Algorithm 3: rankTaskTree
Input: A task tree, tree
Output: An ordered list of leaf tasks, L
1 foreach leaf task T' in tree do
rd+- maxti Vi(t')
2
3 foreach internal node t' in tree do
L
Lhildn1
child
6
return L
--
)
5
rma
fi (rm. , ...,ma
L +- Sort(leaf tasks, compareTask(tree))
L rm..
19
1
9
10
1
9 72
Ti
T2
1
5
T3
52
91
T4
T,
T2
T3
T4
(b) armax
(a) rmx
Figure 3-7: Relative Importance of Leaf Tasks. To determine an execution order for
leaf tasks, it requires finding which leaf task would contribute more to the joint reward
function. The first step is to compute the maximum attainable expected reward rma
at each internal node in the tree, shown in sub-figure(a). This is carried out by using
rm, at the leaf nodes and percolating up to the root node in the task tree. Then, it
is the to compute the degree in which each sub-task contributes the task's cumulative
reward, arm., shown in sub-figure(b).
task T3 x T4 . Consequently, leaf task T4 should be ranked higher than T2 despite
having lower rma value.
The rankTaskTree subroutine, as shown in Algorithm 3, ranks the leaf tasks by
pairwise comparison. In lines 1-2, the maximum expected cumulative reward, rma,
is determined from each substask's value function. In lines 3-4, rma at each internal
node is calculated by percolating up from the leaf tasks. This procedure is illustrated
in greater detail for Example 3.1, as shown in Fig. 3-7. It should be noted that rma is
a statistic, in this case the mean of a distribution. Thus, rm,
=
f(rd
,. .,rmi)
is only an approximation.
After computing rma for each node in the task tree, the last step in rankTaskTree
is to sort the leaf tasks using rm, as a heuristic (line 5 of Algorithm 3). The degree
39
Algorithm 4: compareTask
Input: A task tree, tree, leaf task T and Tk
Output: leaf task T with lower rank
1
g +- mostRecentCommonAncestor(T , Tk)
2
pi +- children(g) A ancestors(T')
pk - children(g) A ancestors(Tk)
3
4 if Orm. >
5
rm. or (&r,. = OrmT
return
L
o return
and rP. > rm.) then
Tk
in which each sub-task contributes to the task's cumulative reward, is calculated by
The sorting procedure relies on a pairwise comparison subroutine,
compareTask, shown in Algorithm 4. Given two leaf tasks T and Tk, the subroutine
=
-,i
?x.
first identifies their most recent common ancestor, g (line 1). Then, in lines 2-3, the
algorithm finds the ancestors, p' and qk, of the two leaf tasks at one level below g.
Lastly, in lines 4-6, the leaf task whose ancestor has a higher Ormx value is given
a higher priority ranking. If rmx of both sub-tasks are equal, the leaf task whose
ancestor with higher rm, value is ranked higher.
Recall that the composition of tasks form another task with larger planning spaces.
Intuitively, Algorithm 4 first identifies the smallest joint task that the contains both
leaf tasks. Then, it tries to identify which leaf task is more valuable to the execution
of the joint task. The application of compare Task subroutine to Example 3.1, and
the resulting importance ordering, are illustrated in Fig. 3-8.
3.2.4
Agent Assignment
The final step is to assign agents to the leaf tasks in the importance ordering determined in the previous step. Recall the ordering was determined by rm. = max, V(s),
which is the maximum expected cumulative reward. In general, however, an agent's
current state will not coincide with the best state, that is s / argmax, V(s'). Thus,
it requires determining (1) how well would an agent execute a leaf task, and (2) the
condition in which a leaf task can be considered completed, such that the algorithm
can move onto the next task in the importance ordering.
40
19
19
19
T,
T2
10
9
10
9
2
5
T3
T4
25
19
T,
Siii
T2
T3
T4
iv
i
ii
Higher
Priority
(b)
(a)
Figure 3-8: Ranking of Leaf Tasks. In sub-figure a), the algorithm compares whether
leaf task 2 or leaf task 3 is more valuable to the joint reward function. First, the most
recent common ancestor for the two leaf tasks is found, which is the root node in this
case. Second, move one level down from the most recent common ancestor, and find
the two multiplication nodes in this case. Compare the arma and rm, values at this
level to determine the priority ranking between the pair of leaf tasks. In this case,
the values of rm, are equal and so the values of rm, are compared. T3 is of higher
priority than T2 since 10 > 9, as shown by colored rma values. In sub-figure b), each
pair of leaf tasks is compared, which induces the importance ranking illustrated in
this figure. Instead of comparing each pair of tasks, a depth first search algorithm
can be implemented for better computational efficiency.
The value function specifies the expected cumulative reward starting at the current state, and it is a good indication of how well an agent would execute a task. The
completion criterion is implemented by specifying a set of minimum rewards {Cmi.},
for leaf tasks {Tt }. With domain knowledge, a designer can specify each value independently. In this thesis, C
is chosen to be 0.8r'.. The ratio C' -/ri, can
be interpreted as the probability of successfully executing a task. This completion
metric specifies the amount of desired redundancy in each leaf task, such that more
agents (than the minimum required) will be assigned to task T' until at least Cmin
units of reward is expected to be achieved. The computation of Cmii for Example 3.1
is illustrated in Fig. 3-9. Lastly, an action is queried for each agent from the local
policy of its assigned leaf task.
The assignTask subroutine, shown in Algorithm 5, details the task allocation
process. For each leaf task, the expected cumulative reward R' is initialized to zero,
41
the boolean completion flag C' is initialized to false, and the agent assignment list Ai
is initialized as a empty list (line 1). The list of leaf tasks are executed (assign agents)
in the order of decreasing importance (line 2). For each leaf task, the algorithm first
identifies the agent, Nj, that would best execute the task (line 5). Then, it adds
agent Nj to the agent assignment list A' of task T (line 6). Recall leaf tasks can
be either singleton or multi-agent. Leaf task T will be executed when it is assigned
sufficient number of agents (line 7-9), in which an action is queried for each agent and
the expected cumulative reward k is updated.1 If k exceeds the minimum required
cumulative reward CG, then task T is consider being completed and its assigned
agents will be removed from the set of available agents, N (line 10-13). Then, the
algorithm moves onto the next leaf task and the process terminates when each agent
has been assigned a leaf task (lines 3-4). The application of assignTask subroutine
to Example 3.1 is shown in Fig. 3-10.
Algorithm 5: assignTask
Input: Agents Nj, Sorted List of Leaf Tasks L, and Completion Heuristic
Cn, policy 7r' and value function V'(.)
Output: Actions aj for each agent j= 1,...,na
0
- 0,Ci +-0, Ai
1
task
T
in
L do
2 foreach leaf
if N is empty then
a
4
5
6
7
[_
break
foreach best agent N in N do
Ai +- Ai u {Nj}
if T has enough agents then
8
aj-
9
k
11
12
is
14
fV7
Vi({sj}))
if I > G then
C
N
N\A
-
10
7r'(sj) VNj E A
<
break
return a, for j= 1,...,na
1The details of the updating function are explained in Chapter 4.
42
19
9
10
19
9:
2
05
T1
T2
T
T4
0.8
7.2
1.6
4
iii
iv
Hi
Hii
iv
i
Hi
Cmin
Figure 3-9: Computation of Minimum Accepted Expected Reward Cmj.. The values
of Cmj. reflects the degree of desired redundancy in execution of leaf tasks, which
serve as a completion heuristic. These can be specified from domain knowledge, or
using a heuristic approach, such as multiplying rm, by a constant factor.
0.8
7.2
16
4
ill
Iv
I
Ii
0.8
7.2
II
I
v
1.6
4
0.8
7.2
1.6
i
i
ill
iv
i
(b)
I'
1.8
(1.8
(a)
4
3.2
(c)
Figure 3-10: Assignment of Agents to Leaf Tasks. The smiley faces are symbols
for agents. Yellow color indicates that the agent haven't been assigned a task and
red otherwise.
The numbers labeled inside the leaf nodes are Cmii, the minimum
accepted cumulative reward. The Roman numerals under each leaf node indicate the
importance ordering. In sub-figure a), the first agent is assigned to the highest ranked
leaf task, as shown by the green arrow. In sub-figure b), since the first agent achieves
expected cumulative reward of 1.8, which is greater than the minimum required value
of 1.6, the algorithm move onto the second leaf task and assigns the next agent to it.
In sub-figure c), since the second agent only achieves expected cumulative reward of
3.2, which is smaller than the minimum required value of 4, the next agent will be
still assigned to the second leaf task.
43
Algorithm 6: HD-MMDP
Input: Agents Nj, State, Action, and Transition model < S, Aj, P > of each
agent, for j = 1,... , n, Discount Factor -y, Reward Function R, and
Completion Heuristic Cm
Output: Actions aj for each agent j = 1, ... , n.
1 tree +- Decomposition (< S, AI P >, R,)
2 foreach leaf task ' in tree do
ir', V' +- ValueIterate(T7)
3
L
4 L <- rankTaskTree(tree)
5 {a} +- assignTask(Nj, L, Cm, ir, V(
e return aj for j =1,...,n
3.2.5
Summary
This section has developed a hierarchical decomposition (HD-MMDP) algorithm for
Multi-agent Markov Decision Processes. The proposed algorithm is outlined in Algorithm 6, which summarizes its four major steps. First, exploiting structure in
factored reward function, HD-MMDP decomposes the original MMDP into a number
of subproblems (line 1). Second, each subproblem is formed as a MDP and solved
with value iteration (line 2-3). Third, the subproblems were ranked using the rma
heuristic (line 4). Fourth, the subproblems were executed in a decreasing importance
ordering (line 5), in which process determines an action for each agent (generating a
joint policy).
3.3
Run Time Complexity Analysis
This section establishes the run time complexity of HD-MMDP algorithm and shows
how it compares favorably to classical DP approaches. Let nt be the number of leaf
reward functions, na be the number of agents,
1SI
be the size of a single agent's state
space, JAI be the size of a single agent's action space, and Jnm,
1
be the number of
required agents in the largest leaf task.
1. Decomposition O(nt)
In structural decomposition step, each recursive call of Decomposition subrou44
tine expands one of the internal function nodes. Since each internal functional
node has at least two children, the number of internal functional nodes is at
most nt - 1 (in case of a perfect binary tree). Thus, its run time scales linearly
in the number of leaf tasks, nt.
2. Solving Leaf Tasks O(n IS 2nm-xAInm-)
Recall from section 2.3, run time complexity of value iteration is O(ISMDP12IAMDPI).
Also, the size of joint state and action spaces of n. agents are
IAm"-
and ISIn-,
respectively. The largest leaf task (with nm. number of agents) has state and
action spaces of sizes JAnm- and
ISI"-, respectively.
The run time complex-
ity for solving the largest leaf task is O(IS 2nm-Aln-). Since there are at
most nt leaf tasks of equal size, solving every leaf task has run time complexity
0 (nt IS12nm- IA In-).
3. Ranking Leaf Tasks O(nt log nt)
Computing maximum expected cumulative reward (rm.) for every internal
functional nodes has run time complexity O(nt). This is because there are at
most nt -1
internal functional nodes and calculating rm. at each node involves
constant time function evaluation. The sorting of leaf tasks can be implemented
precisely as a depth first search, which has run time complexity O(nt log nt).
4. Task Allocation O(ntn2)
In the task allocation step (Algorithm 5), finding the best agent for executing
leaf task (line 5) requires looping through each agent, which is O(na). For each
leaf task, this procedure may need to be repeated na times to reach desired
degree of redundancy (line 7). In addition, the subroutine will loop through all
nt leaf tasks (line 2). Hence, the task allocation step is O(ntna).
Run time complexity of HD-MMDP is summarized in Table 3.2. For applications
considered in this work, the magnitude of each variable is as follows: nt ~ 10,
ISI
~
100, JAI ~ 10, nma ~ 3, na ~ 10. Thus, it can be seen that the most expensive step
is solving leaf tasks individually, as highlighted in red in the second row of the table.
45
Table 3.2: Run Time Complexity of HD-MMDP
Run Time Complexity
Step
O(nt)
1) Decomposition
2) Solving Leaf Tasks
3) Ranking of Leaf Tasks
4) Task Allocation
O(nt IS 2nm-AJnm)
O(nt log nt)
O(ntn2)
In comparison, value iteration (VI) on the original MMDP has run time complexity
O(IS 2naAIna). Since planning spaces scale exponentially with the number of agents,
VI is orders of magnitude more expensive than the proposed algorithm.
3.4
Summary
This section introduced the Hierarchically Decomposed Multi-agent Markov Decision
Process (HD-MMDP) algorithm. First. common structures in MMDP's reward function are explored and being related to coupling strength between agents. Then, the
mechanics of the HD-MMDP algorithm were developed. The run time complexity of
HD-MMDP was shown to be much lower than classical dynamic programming approaches. The next chapter performs an empirical analysis of HD-MMDP by applying
to solve a multi-agent persistent search and track (PST) mission.
46
Chapter 4
Application to Health Aware
Planning
This chapter presents the application of HD-MMDP algorithm to two multi-agent
planning domains. It begins with a verbal description of a persistent search and track
(PST) mission, and then provides a mathematical formulation of this problem. Further, HD-MMDP algorithm is used to solve the PST mission, whose empirical results
are analyzed and compared to that of other algorithms. Lastly, it describes possible extensions to HD-MMDP algorithm with application to an intruder monitoring
mission.
4.1
Persistent Search and Track Mission (PST)
The persistent search and track mission was first proposed and developed by Bethke
et al. {8, 45]. The mission involves a group of autonomous Unmanned Aerial Vehicles
(UAV) equipped with sensors for monitoring ground targets. The objective is to have
the UAVs continuously survey a region of interest while satisfying a communication
constraint. Uncertainty plays a major role in this planning problem since UAVs have
stochastic sensor failure models and stochastic fuel consumption dynamics.
47
Base,
Communication
INm
II
Nsur
Contested
Tasking
Environment
Figure 4-1: Persistent Search and Track Mission. The objective is to have the UAVs
monitor the tasking region while maintaining active communication with the base
region. In addition, there exists various sources of uncertainties, such as stochastic
fuel consumption dynamics and stochastic sensor failure model.
4.1.1
Mission Description
Mission scenario is illustrated in Fig. 4-1. There are three distinct geographical regions, namely base, communication and tasking regions. There may exist ground
target vehicles, such as supply trucks from other parties, in the tasking region. UAVs
are deployed to continuously monitor the tasking area while sending observations back
to the base. For security concerns, base region is located far from the tasking region
and no direct communication link can be reliably established. Thus, it requires a
communication relay region that is strategically positioned to connect base and tasking regions. A dedicated UAV is required to loiter in the communication region to
enable reliable data transfer between the base and tasking regions.
Fuel management is a crucial aspect of the PST mission. Given physical limitations, UAVs have finite fuel capacity. As well, UAVs also have stochastic fuel consumption model. This is due to uncertainties in the environment, such as wind gusts,
and uncertainties in target vehicle behaviors (speed, turning radius) that require dif48
ferent degree of maneuver aggressiveness. An UAV running out of fuel is declared
Crashed and is permanently removed from the mission. Thus, a major challenge is
to minimize risk of running out of fuel.
Sensor management is another crucial aspect of the PST mission. Each UAV is
equipped with two sets of sensors, a set of surveillance sensors that enable detection
and track of ground targets, and a set of actuation sensors that monitor actuator
health. A UAV with failed surveillance sensor cannot track ground targets effectively,
but it can act as a communication relay. A UAV with actuation sensor failure can
neither track targets nor act as communication relay. Neither failure modes lead to
immediate crash and can be repaired upon returning to base. In short, sensor failures
lead to degraded capabilities.
The major challenges of planning in the PST domain are (i) coordinate the group
of UAVs to maximize coverage in the tasking area and (ii) hedge against agent level
uncertainties to prevent running out of fuel and minimize effects of sensor failures.
The goal is to develop a policy that can achieve sustainable performance.
4.1.2
Problem Formulation
The PST mission is formulated as a Multi-agent Markov Decision Process. Recall
from Section 2.2, a MMDP is defined by a tuple (n., S, A, P, R, 1), which are number
of agents, states space, action space, state transition function, reward function, and
discount factor, respectively.
State Space: S
The state, si, of a UAV, Ni, is specified by a vector of three dimensions. The three
dimensions are geographical location yi, fuel level
fi,
and sensor health status hi.
Each of these quantities takes a scalar value. In particular,
si = [yi, fi, hi]T
49
(4.1)
The set of possible locations are base region Yb, communication region Y, and tasking
region Y. The set of possible fuel levels are from empty to full (Fm.) at equipartitioned step Af. The set of possible sensor health states are nominal health H ,,
surveillance sensor failure H.,, and actuator sensor failure H.ct. In summary,
(4.2)
yi E Y ={Yb, Y, Y}
fi E F ={0, Af, ...
Fmax - Af Fmax}
hi E H ={Hnom, H8 enHact}
(4.3)
(4.4)
The joint state space is the Cartesian production of that of each agent, that is S =
Si x S2 x
...
x Sn. In this thesis, Fmx is set to 10 and Af is set to 1. Hence, the
size of a single UAV's state space is ISi|= 3 x 11 x 3 = 99, and the size of the joint
state space is S = 99%.
Action Space: A
The action, ai, of a UAV is to move between different geographical regions, which are
arranged in a chain with the communication region in the center. It is assumed that
a UAV cannot fly from base to tasking region directly. In particular,
ai = {-1, 0, 1}
(4.5)
which correspond to "move towards base", "stay", and "move towards tasking region",
respectively. Depending on the state of a UAV, only a subset of the actions are
allowed,
a E
{-1, 0,1},7
if yj = Y
{-1,0},
{0, 1},
ify=Y
if y = Y
{0},
if f = 0
50
(4.6)
where the last line specifies a UAV would remain stationary as it had run out of fuel.
The size of the action space for single agent, IAil, is three. The size of the joint action
space JAI is 3na.
State Transition Model: P
Since MMDP assumes transition independence, it suffices to specify the state transition model of a single UAV. Recall a UAV's state is comprised of three dimensions,
which axe geographical location, fuel level, and sensor health status. The transition
dynamic for each agent's position is deterministic, such that an agent will reach in
its intended location,
y,(k + 1) =
yi(k),
if fi = 0 or ai(k) = 0
Yb,
if yi(k) = Y and ai(k) = -1
Yt,
if yi(k) = Y and ai(k) = +1
Y,
if yi(k) = Yt and-ai(k) = -1
Y,
if yi(k) = Yb and ai(k) = +1
(4.7)
A stochastic fuel consumption model is developed to account for uncertainties in
the environment. In particular, an agent consumes one unit of fuel with probability
ppe, and two units of fuel with probability 1 - p
. As well, an agent is refueled
to maximum fuel level upon returning to base. An agent who ran out of fuel remains
out of fuel.
1
fi(k + 1) =
fi(k)
- 1,
f(k)
10,
- 2, if yi(k + 1) #Yb
if yi(k + 1) = M
0,
if f,(k) = 0
if yi(k + 1) # M
Prob: pf,,
Prob: 1 - pf
(4.8)
'State transition function should only depend on a system's current state, and not its future
state. Thus, use of y,(k +1) is not allowed in general. However, it is fine in this case since y (k +1)
is a deterministic function of 8,(k) and a1 (k).
51
Surveillance and actuator sensors can fail with probabilities p,,, and paa, respectively. A UAV's failed sensor is fixed when the UAV returns to base region.
hi(k),
if f (k) = 0
Hnoa,
if y(k) = Y
hi(k +1) =
7H,
if y (k) #Yb
and hi(k) = H,,,
Had,
if y (k) # Y and hi(k) = H,,,
Hen,
if y (k)
#Y
and hi(k) = Hn,,
Prob: (1 - Pem)(1 - Pa)
Prob: (1 - Pen)Pac
Prob: Pen
(4.9)
Reward Function: R
The reward function is designed to induce favorable outcomes by placing high rewards
on a set of desirable states. In particular, a desirable state is characterized by (1)
having agents to fly in the tasking region, and (2) having at least one agent in the
communication area to relay data between base and tasking region. This is made
precise as
R(ss
,
(4.10)
s ) = I(Non > 1)rhurNaur
where Ne,,n and Nur are the number of capable UAVs that are flying in the communication and tasking regions, respectively; r,,. is a scalar value (units of reward for
each UAV in the surveillance area). I(clause) is the indicator function that returns 1
if clause is true and 0 otherwise. Capable is defined as having required sensor health
status as described in section 4.1.1. In particular,
No,=
Nu, =
Naur
I(yj = Y)I(fj >0)I(hj #
Had)
~ I(yj = Y)I(fj > 0)I(hj = Ho,)
nom
52
(4.11)
Comm
Task
Comm
Task
Comm
Task
'''
Comm
Task
Figure 4-2: Task Tree of Persistent Search and Track Mission.
4.1.3
Structural Decomposition
Eqn. 4.10 and 4.11 can be combined and rewritten in form of Eqn. 3.2, in which two
leaf reward functions are identified.
reo, (sj) =
I(yj = Y)I(fj > 0)I(hj k Hact)
rsur(sj) =
I(yj = Yt)I(fj > 0)I(hj = Hn,,m)
(4.12)
With application of Alg. 2, a task tree for the PST mission is constructed and shown
in Fig. 4-2. There are two types of leaf tasks, namely communication and surveillance
leaf tasks, that are built around Eqn. 4.12. The surveillance leaf tasks are coupled
through a summation function node, which reflects the mission specification of having
as much UAVs flying in the tasking region as possible. The communication leaf task
is coupled to the set of surveillance leaf tasks through a multiplicative function node,
which models communication link constraint.
4.1.4
Empirical Analysis
Recall MMDP's state and action spaces scale exponentially with the numbers of
agents and value iteration(VI) has run time complexity O(IS
IAI).
2
Thus, applying VI
to obtain optimal solution is computationally prohibitive for na > 4, as summarized
in Table 4.1. HD-MMDP decomposes PST into a number of singleton tasks, and
therefore its computation time is much cheaper than that of value iteration on MMDP.
In simulation, it takes about 2.3 seconds to compute the optimal policies for singleton
53
Table 4.1: Computational Time of Value Iteration on PST Mission. VI is impractical
for n. > 4 since it will require about 3 weeks to compute.
SI
number of agents (n.)
99
1
9801
2
979299
3
95059601
4
JAl Computational Time
- 0.1s
3
1m is
9
57m 27s
27
~ 21 days
81
leaf tasks in the off-line phase; and it takes less than 2.0 seconds to generate a 50 step
trajectory for a ten agent PST mission in the on-line phase (task allocation step).
Trade-off between Computational Tractability and Solution Optimality
A comparison between HD-MMDP and the optimal solution is shown in Fig. 4-3. The
comparison is made on a three agent scenario because this is the biggest problem size
that permits computing the optimal policy in a reasonable amount of time, according
to Table 4.1. This graphs shows that HD-MMDP's solution quality is within 8% of
that of the optimal policy. Yet, HD-MMDP takes less than a 1% of value iteration's
computation time on a MMDP formed around the three agent PST mission.
HD-MMDP yields a suboptimal solution partly due to the imposed decomposition
structure. In particular, in solving leaf tasks individually, each agent only accounts for
its own future uncertainties. In contrast, the optimal policy accounts for the future
uncertainties of all agents simultaneously. In HD-MMDP, when an agent is assigned
to the surveillance leaf task, it assumes that another agent will successfully complete
the communication leaf task; its does not hedge against the uncertainties of how well
the other agent will actually perform. However, explicitly accounting for the future
uncertainties of all agents leads to the exponential growth of planning spaces. Hence,
HD-MMDP reveals a trade-off between computational tractability and solution optimality. HD-MMDP uses MDP formulation for strong coupling encapsulated in leaf
tasks and uses task allocation for weaker forms of coupling.
54
150
MMDP's
Optimal Policy
cc
( 100-
50
-
E
HD-MMDP
E
0
0'
0
20
40
60
time
Figure 4-3: Three Agent PST Mission Simulation Result. The red line shows the
cumulative reward following a policy generated by HD-MMDP algorithm. The blue
line shows the cumulative reward following the MMDP's optimal policy. HD-MMDP
achieved sustainable cumulative reward (no UAV run out of fuel), and produces reward within 8% of that of optimal policy.
Decomposition Structure
HD-MMDP proposes a decomposition structure as revealed by the reward function.
A natural question to ask is whether this is a good decomposition structure. The
effects of adopting various decomposition structures on solution quality is studied
empirically on a nine agent PST mission. A nine agent scenario is chosen because 9 is
a integer multiple of 3, which is the largest PST problem size that permits computing
the optimal policy in a reasonable amount of time.
An alternative decomposition structure is to form three 3-agent leaf tasks. The
first leaf task is a standard 3-agent PST mission; and the two other leaf tasks are
PST missions without the communication link requirement. When combined, as in
the MMDP formulation, only one agent is required to serve as a communication link.
In this decomposition structure, subgroups are formed as large as computationally
plausible. The task allocation step follows the same set of heuristics as that in the
55
400
U)
350
300
HD-MMDP
Decomoposition--+
Structure
0250
.00 200= 150-
3 groups of three
E
E 100
50
0
0
20
40
60
time
Figure 4-4: Effect of Adopting Decomposition Structure on Nine Agent PST Mission. The red line shows the cumulative reward of using HD-MMDP's decomposition
structure. The blue line shows the cumulative reward of using three groups of three.
The same task allocation algorithm was used on both decomposition structures. This
graph shows that forming larger subgroups does not always induce better solution
quality.
HD-MMDP algorithm.
Fig. 4-4 shows the empirical result of using various decomposition structures, averaged over 30 independent runs. HD-MMDP's decomposition structure (red) induces
better solution quality than that of the alternative structure (blue). The 3 groups of
three decomposition structure took more than 2 hours to solve (solving instances of
3-agent PST mission) and still yields inferior solution quality. This suggests that, in
general, it is not beneficial to form large leaf tasks even if computationally feasible.
This is because arbitrarily formed leaf tasks do not reflect structure in the joint problem. For instance, in the PST mission, earning a reward requires having one agent
in the communication area and another agent in the tasking area. In a three agent
scenario, if two agents are running low on fuel and are scheduled to return to base, a
capable (perfect sensor health and high fuel level) third agent should also return to
base because it cannot earn any reward by itself. However, in a team of nine agents,
56
when two agents are running low on fuel, the third capable agent should coordinate
with the other six agents to maximize reward. In this case, arbitrarily formed leaf
tasks bring about new structures (i.e. strong coupling between 3 agents) that do not
exist in the original problem.
Pruning Solution Space
Recall solving an MMDP is to.compute a policy, which specifies joint actions for all
agents at every possible state of the system. 1n contrast, through a decomposition
procedure, the HD-MMDP algorithm converts the MMDP to an agent assignment
problem. The objective is to assign each agent to one of the leaf tasks such that
the cumulative reward at the root node will be maximized. This agent assignment
problem is a non-trivial combinatorial optimization problem. More importantly, each
leaf task is stochastic with only first order statistics (expected reward) available. The
combinatorial and stochastic nature of this agent assignment problem made it difficult
the optimize.
In converting the MMDP to an agent assignment problem, a natural question
to ask is what benefits do this transformation bring, especially if the latter is also
difficult to solve. In fact, the solution space H'*" of the agent assignment problem
corresponds to a subset of MMDP's policy space II, that is Hdc E H. It is observed
that on average, policies in IV
yield much better solution quality than that in II;
more precisely, EErfldec [V'M(s)] > EER [V'(s)] , where V"(s) is the value function
evaluated at state s corresponding to policy
7r.
For the ten agent PST mission, random policies chosen form I1Tc (red) outperforms random polices chosen from II (green), as shown in Fig. 4-5. A random policy
from
IV
is constructed by assigning agents to leaf tasks randomly, then querying
actions from leaf task's local policy. A random policy from H is constructed by choosing actions randomly for every agent. The green curve flattens after approximately
40 iterations since many UAVs had run out of fuel given long sequence of random actions. In contrast, the red curve achieves sustained performance. Although the agent
assignment step is random, the actions queried from leaf task's local policy are not
57
300
random task alloc
- 250(
200
>
150
a)
random policy
E 100
E
0
50
0
0
20
time
40
60
Figure 4-5: Pruning Solution Space. Green line shows the cumulative reward by
choosing actions randomly for every agent (following random policy 7r E H). The red
lines shows the cumulative reward induced by assigning agents randomly to leaf tasks
(following random policy 7r E fde"). Both curves are the average of 30 independent
trials. The decomposition procedure can be seen as a pruning step because 11 dec c H
and on average, policies in Hdec have better solution quality than those in H.
random. In the PST mission, regardless of whether a UAV with low fuel is assigned
to the communication leaf task or the surveillance leaf task, local policies from both
leaf tasks will send the UAV back to base for refueling. Thus, no UAV ran out of fuel
given a random policy from
11
'd* In short, the decomposition step turns the original
MMDP into an agent assignment problem; and this conversion prunes the solution
(policy) space of the original problem.
4.1.5
Simulation Result
HD-MMDP is applied to solve the ten agent PST mission, which planning domain
has more than 1019 states and 104 actions. It is estimated that value iteration would
take centuries to compute the optimal policy for such enormous planning spaces. This
thesis compares HD-MMDP against another multi-agent planning algorithm, known
as Group Aggregate Decentralized Multi-agent Markov Decision Process (GA-decMMDP) [45]. To author's knowledge, among algorithms that are computationally
58
tractable in planning spaces of this size, GA-dec-MMDP is the best solution approach
to PST missions.
GA-dec-MMDP is an approximate dynamic programming algorithm (see section
1.2.1) specialized for multi-agent planning problems. Its main idea is that each agent
can model the rest of team in a reduced form, which only need to capture important
aggregate properties. For instana, in the PST mission, each agent can make a good
decision if it can accurately predict the-joint configuration of its teammates (number
of agents surveying the tasking area and number of agents acting as a communication
link); however, an agent does not need to predict which particular agent or how the
agent surveys a sub-task. A domain expert can design a reduced model by specifying
a set of feature vectors that capture such aggregate properties. The use of reduced
model makes GA-dec-MMDP computationally tractable, as its run time scales linearly
with the number of agents.
However, GA-dec-MMDP has two major drawbacks. First, it requires a domain
expert to specify a set of feature vectors. This usually relies on extensive domain
knowledge and is difficult to generalize. Second, it requires to develop a state transition function for the reduced model (i.e. how many agents will survey the tasking
area).
This is difficult because an accurate state transition function depends on
the joint policy; however, to derive a joint policy, it requires specifying an accurate
state transition function. To overcome this issue, Redding developed an approximate
method using matrix extrapolation techniques.
In contrast, HD-MMDP algorithm automatically identifies a decomposition structure from the reward function and does not require specific domain knowledge. Fig. 46 shows that HD-MMDP outperforms GA-dec-MMDP algorithm by more than 35%
in cumulative reward. Both algorithms are real time implementable for ten agent
PST mission.
59
500
- 400
HD-MMDP
300
200
E.
100
GA-Dec-M MDP
A'
0
20
time
40
60
Figure 4-6: Simulation of Ten Agent PST Mission. Comparison in cumulative reward
shows that HD-MMDP outperforms GA-dec-MMDP by more than 35%. Both curves
are the average of 30 independent trials.
4.2
Intruder Monitoring Mission
The intruder monitoring mission (IMM) is developed as an extension to the PST
mission. IMM requires a group of UAVs to continuously survey a region of interest
to detect elusive ground intruders. When a target is discovered, multiple UAVs will
act in concert to escort the targets out of the region of interest. In comparison with
the PST mission, IMM requires more complex interaction between agents because
(1) it has multi-agent leaf tasks and (2) it encodes temporal constraints between leaf
tasks. HD-MMDP is applied to solve IMM and various extensions to HD-MMDP are
proposed.
4.2.1
Mission Description
Mission scenario is illustrated in Fig. 4-7. The tasking region is discretized into a
n x m grid world. It requires a group of UAVs to continuously monitor this tasking
region. Ground vehicles from the opposing forces may emerge stochastically and are
60
___A
Figure 4-7: Intruder Monitor Mission. A blue ground vehicle from the opposing forces
tries the reach its goal position marked by the orange star. It has a known feedback
policy that generates a trajectory shown in blue. The mission is to coordinate the
two red UAVs such that they will escort the ground vehicle out of the tasking area
by forcing it to follow the green trajectory.
initially hidden from the group of UAVs (fog of war). Further, the ground vehicles
follow a known stochastic policy that guide them toward a goal position. The intruder
monitoring mission requires the UAVs to (1) detect the initially hidden ground vehicles, (2) prevent the ground vehicles from reaching their goal position, and (3) escort
the ground vehicles out of the tasking area.
The three mission objectives can be formulated as MDPs, which are called surveillance, blocking, and cornering leaf tasks, as illustrated in Fig. 4-8. In the surveillance
leaf task, each grid is associated with an uncertainty value that relates to the probability of detecting a ground target in that grid. The uncertainty value increases with
time and can only be reduced by flying an UAV to its vicinity. The goal is to reduce
the uncertainty in the tasking region to maximize probability of detecting ground
targets. In Fig 4-8(a) lighter shade of gray corresponds to lower uncertainty value;
note how the UAV reduces uncertainty in its close proximity.
The ground vehicles attempt to reach a goal position, but cannot move into grid
positions occupied by UAVs. In the blocking leaf task, shown in Fig. 4-8(b), a single
UAV tries to prevent the ground vehicle from reaching the goal position. In the
cornering leaf task, shown in Fig. 4-8(c), two UAVs try to escort the ground vehicle
61
01
+
010
(a)
(b)
(c)
Figure 4-8: Leaf Tasks in Intruder Monitoring Domain. Sub-figures a), b) and c)
correspond to the surveillance, blocking, and cornering leaf tasks, respectively.
out of the tasking area (two green corners) in coordinated efforts. In both leaf tasks, a
UAV needs to predict the future positions of the ground vehicle, and thereby position
itself strategically to influence the trajectory of ground vehicle. A major challenge is
to account for the stochasticity in the ground vehicle's choice of action.
More importantly, leaf tasks in IMM are coupled through operational constraints.
In particular, the blocking and cornering leaf tasks can only be executed after detection of a ground vehicle in the surveillance leaf task, which is a temporal constraint.
As well, blocking and cornering leaf tasks (for one ground vehicle) are mutually exclusive such that only one of them can be executed at a time. In contrast, in PST
domain, leaf tasks are coupled through the reward function but can be attempted independently. Another major challenge is to account for these operational constraints
while solving the IMM.
4.2.2
Problem Formulation
Although there are two types of robotic vehicles, only the UAVs are considered as
agents from planner's perspective. In contrast, the ground vehicles are considered as
part of the environment. In particular, the planner should determine a set of actions
for the group of UAVs in response to stochasticity in the motion pattern of the ground
vehicles. To distinguish between the two types of vehicles, Ni is used to denote the
62
ith UAV, and Mj is used to denote the jth ground vehicle.
State Space
The state, si, of a UAV, N, is specified by a vector of two dimensions. The two
dimensions correspond to the x and y coordinates of the UAV, respectively. In particular,
s
(4.13)
[X , y]T
=
The state, qj, of a ground vehicle, M, is specified by a vector of three dimensions.
The first two dimensions correspond to the x and y coordinates of the ground vehicle
and the third dimension is a binary variable specifying whether the ground vehicle
has been detected. in particular,
(4.14)
q= [Xj,y 3 ,d3 ] T
Further, the environment is associated with a matrix that specifies an uncertainty
value for each grid position, such that W = [wmn]. The configuration of the entire
system is specified by the joint state of the UAVs, the ground targets, and uncertainty
in the environment,
s = (six ... x sn)
x
(q x ... xq,)x W
(4.15)
Action Space
The action space, Ai, defines the set of possible actions for a UAV. Each agent can
choose to go to a nearby location within 2 units of Manhattan distance away from
its current location, as shown in Fig. 4-9 (a). Also, a UAV is allowed to move neither
outside the grid world nor into a grid position intended by a ground vehicle. The
set of possible actions of ground vehicles are shown in Fig. 4-9, which illustrates that
(i) UAVs are more agile than ground vehicles and (ii) ground vehicles cannot remain
63
(a)
(b)
Figure 4-9: Action Space of Intruder Monitoring Mission. Sub-figure (a) and (b) show
the sets of permissible actions for an UAV and a ground vehicle, respectively. Each
shaded square shows a possible next state given the current position of the vehicle,
and each next state is associated with a unique action (i.e. two squares to the left).
UAVs are assumed to be more agile (travel to a larger area) than ground vehicles and
ground vehicles cannot remain stationary.
stationary2 . The ground vehicle's choice of action is described in a state transition
model.
State Tfransition Model
The state transition model consists of three major components, which describe the
state evolution of UAVs, ground vehicles and uncertainty value wmn. State evolution
of each UAV is deterministic, such that a UAV always reaches its intended position.
State evolution of uncertainty value wmn is also deterministic. In particular, wmn is
decremented by wdec (i.e., 5 units) if a UAV hovers at the particular grid position,
and wmn is incremented by wine (i.e 1 unit) otherwise. As well, the uncertainty value
is upper bounded by wm., such that wmn E [0, wmax].
State evolution of each ground vehicle is characterized by a known stochastic
policy. Among the four permissible actions illustrated in Fig. 4-9(b), the stochastic
policy selects the two actions that lead to states closest (minimum Euclidean distance)
to goal location. Further, a probability distribution is specified for the two actions.
For instance, the most favorable action is chosen 75% of the time and second action is
chosen 25% of the time. This policy is illustrated in Fig. 4-10, with the longer arrow
2
Technically, the ground vehicles are modeled as part of the environment and its choice of actions
are not part of the action space.
64
_
U
(b)
(a)
Figure 4-10: State Transition Model of Intruder Monitoring Mission. The two pale
blue arrows at each grid position illustrate the stochastic policy of the blue ground
vehicle. The action represented by the longer arrow is chosen 75% of time and that
by the shorter arrow is chosen 25% of the time. The presence of a red UAV induces a
change in ground vehicle's choice of action, as shown inside the red box in sub-figure
(b).
being the more favorable action. The stochastic policy also takes into account the
UAV's position, as illustrated in Fig. 4-10(b). In particular, the red UAV in Fig. 410(b) prevents the blue ground vehicle from traveling upwards. Thus, the best and
second best actions for the ground vehicle are traveling to the right and traveling to
the left, respectively. Similarly, the ground vehicle's choice of actions are changed in
grid positions inside the red T-shaped box.
When a ground vehicle first appears in the grid world, it is undetected, such that
di = 0. The ground vehicle remains undetected with probability proportional to the
uncertainty value at its position, such that P(di(k + 1) = 0)
= Wmn/Wmax.
Once a
ground vehicle is detected, it remains detected.
Reward Function
The reward function is designed to specify a set of desirable states. Recall that it
requires to (1) detect a ground vehicle, (2) prevent the ground vehicle from reaching
its goal position, and (3) escort the ground vehicles out of the tasking area. This is
65
made precise as follows,
no
R(s) =
I(dj = 1) [c(qj, sl,
s2)
E g(q, si)]
(4.16)
where s is as defined in Eqn. 4.15; I(dj = 1) is a indicator function that specifies
the ground vehicle must be detected; c(qj, si, sio) is a function that specifies the
cornering condition; g(qj, sil) is a function that specifies the blocking condition; and
"@" expresses mutual exclusion. The cornering condition rewards the state in which
the ground vehicle is forced into one of the green squares as illustrated in Fig. 4-8(c).
The blocking condition rewards each state inversely with the distance between the
ground vehicle and the goal location.
4.2.3
Extension to HD-MMDP
The formulation of the intruder monitoring mission contains elements that do not
conform to definition of MMDP as described in section 2.2. In particular, the joint
state space is not the Cartesian product of that of each agent (UAV). This is because
the system also needs to account for the ground vehicles. In addition, there exists
certain operational constraints between leaf tasks. For instance, the blocking leaf task
and the cornering leaf task cannot be executed simultaneously.
To this end, several extensions are proposed to the HD-MMDP algorithm. For
instance, the definition of a leaf task is relaxed to include states characterizing the
environment. However, the key idea behind the development of HD-MMDP is unchanged. The algorithm first decomposes the planning problem into a number of
subproblems by exploiting structure in the reward function, then solvse each subproblem to compute local policy, and finally assigns each agent to one of subproblems
and queries an action for each agent from the corresponding local policy.
Decomposition Structure and Ranking Heuristic
Three leaf reward functions
can be identified from Eqn. 4.16, around which the surveillance, cornering and block-
66
Surv.
Task
Blocking
Cornering
Blocking
Task-
Task
Task
Cornering
ITask
Figure 4-11: Task Tree of Intruder Monitoring Mission. Each ground vehicle is associated with two mutually exclusive leaf tasks, namely the blocking and cornering leaf
tasks. A surveillance leaf task is associated with all ground vehicles. The red curve
encodes the temporal constraint that a ground vehicle must first be detected. The
green addition node encodes the mutual exclusive relation.
ing leaf tasks are formed. In particular,
flg
rcorneing
rblocking
(4.17)
(dj = 1)
rsurv =
=
c(qj,
=
g(qj,sii)
Soi, Si2)
Note that rurv cannot be decomposed further because the detection of each ground
vehicles is coupled by the uncertainty values (Wmn) in the environment (see Section 4.2.2). To maximize the number of ground vehicles being detected is equivalent
to minimizing uncertainties in the environment, that is, min
m,n Wmn.
Novel structures are introduced in the task tree structure (Fig. 4-11) to encode
operational constraints. For instance, the red curves encode the temporal constraint
that a ground vehicle must first be detected before execution of blocking or cornering
leaf tasks. In addition, the green addition node encodes that blocking and cornering
leaf tasks cannot be executed simultaneously. The heuristics for ranking the leaf tasks
need to be modified to satisfy operational constraints. In this example, surveillance
leaf task is ranked the highest because ground vehicles need to be first detected. Cornering leaf task is given precedence over the blocking leaf task because it is desirable
to force the ground vehicles out of the tasking areas.
67
State Abstraction
State abstraction is a concept more frequently applied to hier-
archical decomposition of single agent MDPs. It is based on the idea some dimensions
are irrelevant to the execution of subtasks and can be abstracted away, thereby drastically reducing the size of the subproblem's planning spaces. This idea can also be
applied to the HD-MMDP algorithm. For instance, in IMM, the detection variable
d, of the ground vehicle's state q,, is irrelevant to the cornering or blocking leaf tasks
(always assumed to be detected). Similarly, the uncertainty value wmn in the environment can also be abstracted away in formulation of cornering and blocking leaf
tasks.
Solving Leaf Tasks Approximately
In HD-MMDP algorithm, value iteration
is used to solve each leaf task for simplicity. Yet, each leaf task is a standalone
MDP and thus can be solved using any generic MDP solution approach. By taking
advantage of the leaf task's local structure, significant computational savings can be
achieved. For instance, assuming wn E [0, 5], the surveillance leaf task in IMM has
61x" states since each grid location is associated with a distinct uncertainty value.
Even for a 5 x 5 grid world, value iterations would take months to iterate over such
state space. However, since the state transition model of surveillance leaf task is
deterministic, as described in Section 4.2.2, general purpose optimization algorithm
can be applied. In particular, a sequential greedy algorithm is implemented such that
each UAV sequentially chooses an action that would lead to the most uncertainty
reduction. In short, approximate MDP solution approaches can be applied to leaf
tasks for additional computational efficiency.
4.2.4
Simulation Results
The HD-MMDP algorithm (with modifications outlined in section 4.2.3) is applied to
solve a intruder monitoring mission with three UAVs and two ground vehicles residing
in a 8 x 6 grid world. Following a policy returned by the HD-MMDP algorithm, a
simulation was run for a length of 50 steps. In all 30 independent trials from the same
initial configuration, HD-MMDP's policy achieved sustainable performance (steadily
68
increasing cumulative reward). A simulation trajectory is chosen and illustrated in
Fig. 4-12 to highlight key aspects of HD-MMDP's performance.
The UAVs are initialized in the center region and two ground vehicles emerge from
the bottom corners, as illustrated in 4-12(a). Since the ground vehicles are initially
undetected (labeled with gray disk), the UAVs perform the surveillance leaf task as
indicated by green circles. Up to time step 3, as shown in Fig. 4-12(b), all ground
vehicles remain undetected and the UAVs distribute themselves more evenly in the
grid world. In time step 4, as shown in Fig. 4-12(c), the purple ground vehicle is
detected and two UAVs switch from surveillance to cornering leaf task targeting the
newly detected purple ground vehicle.
Fast forward to step 9 and 10, as shown in Fig. 4-12(d, e), two UAVs have forced
the purple ground target into one of the green corners. The other UAV tries to
prevent the orange ground vehicle from reaching the goal location. In step 9, the
orange ground vehicle chose to go left while the UAV was expecting it to go right.
This reflects the stochasticity in ground vehicle's policy, where it randomly chooses
between the best and second best actions. In step 11 and 12, as shown in Fig. 412(f, g) the cornered purple ground vehicle exits and a new blue ground vehicle
enters. Between the two UAVs near the bottom right corner, one of them switched
to a surveillance task while the other went upward to help escort the orange ground
vehicle out of the grid world.
4.3
Summary
This section formulated a multi-agent persistent search and track (PST) mission and
an intruder monitoring mission (IMM). HD-MMDP was applied to solve the PST
domain and showed 35% improvement over GA-dec-MMDP. The IMM was shown to
have more complicated operational constraints than MMDP formulation would allow.
Thus, various extensions to HD-MMDP were proposed and used to solve IMM. The
next chapter describes a hardware implementation of IMM.
69
(a)
(d)
(b)
(e)
(c)
(f)
(g)
Figure 4-12: Simulation of Intruder Monitoring Mission. Red squares represent UAVs;
purple, orange and blue disks represent ground vehicles. The orange star represents
the ground vehicles' intended goal location. The green squares are locations where the
UAVs would like to trap the ground vehicles. Each vehicle's choice of action is shown
by the arrow emanating from itself. If a ground vehicle is undetected (d. = 0), it is
labeled with an additional gray disk (i.e. sub-figure a). If a UAV is assigned to block
or corner a particular ground vehicle, it is labeled with a disk of the ground vehicle's
color (i.e. sub-figure c). For clarity, the uncertainty value wmn in each grid is not
shown. Sub-figure(a) shows the initial configuration of the system. Sub-figures(b-g)
show the configuration of the system at step 3, 4, 9, 10, 11, and 12, respectively.
70
Chapter 5
Experimental Setup
This chapter presents the hardware and software infrastructures that enable complex
multi-agent missions with real robotic vehicles. It begins with a description of a
testbed environment (Fig. 5-1) with real-time motion tracking capability. Further,
it explains the software organization, which coordinates a variety of functionalities,
from high level mission planning to low level vehicle motor control. Then, it develops
a novel optimization based solution approach to the path planning problem. Lastly,
it presents hardware implementation of the intruder monitoring mission described in
Sec. 4.2.
5.1
RAVEN Testbed
The Real-time indoor Autonomous Vehicle test ENvironment (RAVEN) [57, 56]
testbed is a facility that enables rapid prototyping and testing of autonomous robotic
technologies. It is an indoor space of approximately 12m x 8m x 2m, as shown in Fig.
5-2(a). RAVEN testbed is equipped with Vicon motion capture system [52], which
consists of 18 Vicon cameras shown in5-2(b). Each Vicon camera emits infrared
radiation and detects its reflection from lightweight reflective markers mounted on
autonomous vehicles. By combining infrared images from multiple cameras, the Vicon system provides high-fidelity state measurement of multiple targets at 100Hz,
with sub-millimeter accuracy.
71
(a)
(b)
Figure 5-1: Robotic Vehicles and Experiment Environment. A ground robot and two
quadrotors with package delivery capability are shown in sub-figure(a). The hardware
experiment environment, equipped with Vicon motion capture system (inside yellow
boxes), is shown in sub-figure(b).
Given construction of the building and placement of computer systems, Vicon
enabled volumed is non-convex, as shown in Fig. 5-2 (a). In particular, the bottom
left corner does not have Vicon coverage and there exists a pole in the middle of
the space. The space's non-convexity poses technical challenges for autonomous path
planning, which is explored in the next section.
A unique feature of the RAVEN testbed is an image projection system, shown in
Fig. 5-3. It consists of multiple downward facing Sony VPL-FH31 projectors mounted
to the ceiling. The projectors are calibrated to cast a contiguous image onto the floor.
The projection area is shown as a rectangle with green hatched pattern in Fig. 52 (a). Acting as a window to the planner's belief state, the projected image can
reveal important information about a mission. For instance, in Fig. 5-3, it shows the
uncertainty state of the world as different levels of transparency, the position of the
quadrotor as a purple cross, intentions of ground vehicle as arrows of various colors,
etc1 . In path planning missions, it shows the intended trajectory of an vehicle. In
short, the projection system can help observers better understand the movement of
autonomous vehicles in complex missions; which in turn, serve as a useful tool for
'This is a snapshot of the intruder monitoring mission, which is explained in Section 5.5
72
Vicon Enabled
Space
Projection
Enabled Space
Vicon Camera
(a)
(b)
Figure 5-2: Schematic of RAVEN Testbed. Sub-figure(a) illustrates the Real-time
Autonomous Vehicle test ENvironment (RAVEN). The space is equipped with a Vicon
motion capture system and an image projection system, as shown in blue and green,
respectively. RAVEN testbed's coordinate system is labeled in black. Red squares
illustrate Vicon camera placement. A picture of a Vicon camera is shown in subfigure(b).
debugging, evaluation, and demonstration purposes.
5.2
Autonomous Vehicles
Various types of autonomous vehicles have been developed for the RAVEN testbed,
as shown in Fig. 5-4. This section provides a short description of the robotic vehicles
that are involved in persistent search and track (PST) and intruder monitoring missions (IMM). A quadrotor with recharge capability is shown hovering under a fan, in
Fig. 5-4(a). The quadrotor carries a 3D printed battery cage, which houses a three
cell 1300mAh lithium ion battery. The quadrotor is controlled via successive loop
closures, with attitude control loop (inner loop) running on-board, and acceleration
73
Figure 5-3: Image Projection Capability. The projection system reveals important
information regarding a planner's belief state, such as the intent of an robotic agent
and the state of the environment. The projection system is an vital tool for helping observers understand a complex multi-agent mission. Shown in this figure is a
snapshot of the intruder monitoring mission (IMM).
control loop (outer loop) running on a stationary ground computer [19]. A trajectory
generator produces a series of waypoint (position, velocity and acceleration) commands, which are piped into the acceleration control loop. The quadrotor can be
modified to carry up to 500 grams of cargo. The quadrotor can communicate with a
ground computer via Xbee connection.
The quadrotor can land autonomously on a custom-designed battery recharge
station, as shown in Fig.
5-4(b). The recharge station [54, 53] has two rotating
drums that each hold four battery cages.
As a quadrotor lands on the recharge
station, the depleted battery is moved from the quadrotor into an empty battery
bay, and a fully charged battery is moved into the quadrotor. While performing a
battery swap, the quadrotor is secured with a clamping mechanism that also acts
as a temporary power source. Thereby, the quadrotor remains powered on during a
battery swap, and thus retains its memory (i.e. integrator values). Batteries take less
than 50 minutes to charge and are charged independently. A quadrotor can fly for
74
approximately 7 minutes with a fully charged battery, hence a recharge station with
eight batteries can support the continuous operation of one quadrotor. The recharge
station can communicate with a ground computer via bluetooth connection.
Autonomous ground robots are based on commercially available iRobot Create
vehicles, as shown in Fig. 54(c), A Xbee module is added kto commerciah product
to enable wireless
U
= with a grou
enmputer. The ground wbicle ac-
cepts linear and angular velocities as control inputs. They also requires a trajectory
generator for comput
5.3
igeW'aths in the non-cQnvex lab speot
Software Architecture
The software infrastructure developed in RAVEN utilizes the Robotic Operating System (ROS) [43], which is set of libraries for robotic applications. ROS is a flexible
framework that encourages modular software design. In particular, ROS abstracts
away the need to explicitly fork new processes, spawn new threads, and synchronize
between threads. A ROS program is r
ike
a .graph structure, with each node
being a standalone executable and each edge being a data connection between nodes.
ROS facilitates communication between diferent executables by introducing simple
data transfer protocol (topics and services). ROS is particularly suitable for multiagent applications because each node can be developed and tested independently.
The software architecture that supports PST and IMM experiments is shown in
Fig. 5-5. Each rectangular box represents a ROS node, and the arrows represent data
connection. Vicon Data node, shown in top left corner, is a ROS wrapper of Vicon
software, which publishes filtered position measurements of all robotic vehicles at
100Hz. The Mission Planner node, shown in the top right corner, is an implementation
of high level planning algorithms, such as HD-MMDP. Given operational state (i.e.
battery level, grid position), Mission Planner node outputs mission level commands,
such as fly to surveillance area or fly to base for battery recharge. A Visualization
node is implemented utilizing ROS's rviz library. The Visualization node simulates a
virtual environment and illustrates position and operation status information of each
75
(a)
(b)
(c)
Figure 5-4: Autonomous Robotic Agents. Sub-figure(a) shows a quadrotor with
removable battery cage hovering under a fan. Sub-figure(b) shows a battery recharge
station executing an automated battery swap. Sub-figure(c) shows a pair of iRobot
Create vehicles.
vehicle.
Mission Manager node is a central hub that aggregates all information about the
system. Also, it keeps a list of all types of vehicles (quadrotors, ground vehicles,
and recharge stations), and maintains a task queue for each individual vehicle. When
Mission Planner node issues multiple waypoint commands, the Mission Manager node
76
Vicon Data
Mission
Visualization
Planner
Mission Manager
-
Quadrotors
quadi
Gnd Vehicles
gVehi
Recham St.
chrgi
quad2
quad3
gVeh2
gVeh3
chrg2
chr g3
others
Path
Planner
Path
Planner
Position
measurement
----------
:virtual vehicles
1
Quad
Simulator
'
Gnd VehI
Simulator
Recharge
Simulator
Operational status
Mission level
command
real vehicles
Quad
Controller
Simulated position
Gnd Veh
Controller
Recharge
Controller
Control input
(i.e. pos, vel, accel)
Figure 5-5: Software Architecture. Software layer's modular designs supports a wide
variety of hardware experiments. The major components and data connection between them are shown in this figure.
appends these waypoint commands to the task queue of the corresponding vehicles.
Mission Manager node determines the condition for successfully completing a task
(i.e. reach c neighborhood of desired waypoint) and decides when to send out the
next task in queue.
For stationary vehicles (i.e. recharge stations), operational commands are sent
directly to vehicle controllers. For mobiles vehicles, a Path Planner node is required
for generating feasible and collision-free trajectories. Then, trajectories are sent to
77
vehicle controllers as sequences of waypoints.
The algorithm design of the Path
Planner node is explained in greater details in the Sec. 5.4. Further, the software
infrastructure accommodates for both real and virtual vehicles, which capability is
crucial for reducing hardware complexity during software development. The virtual
vehicles need to publish their simulated states (i.e. position), which is made available
to the Mission Manager node.
5.4
Path Planner
The path planning problem is in itself a well investigated research topic. Researchers
have proposed optimization based algorithms [48, 20] and sampling based algorithms
[33, 31, 34]. This thesis does not attempt to provide a comprehensive treatment of
the path planning problem; rather, it seeks to find/develop an algorithm that can
generate feasible trajectories for multiple (1-6) vehicles in real time.
The multi-agent path planning problem is to find a set of control inputs such that
each vehicle would reach its desired final position and satisfy collision constraints. In
particular, vehicles must steer clear of static obstacles and avoid running into each
other. In this work, since there exist robust vehicle level controllers for tracking
trajectories, the control inputs are assumed to be a sequence of waypoint commands,
[pos, vel, accel].
5.4.1
Potential Field
The potential field algorithm models vehicles, static obstacles and room boundary
as particles with the same charge; and it models each vehicle's desired position as a
particle with the opposite charge. The algorithm computes the potential field of the
system and chooses an input that minimizes potential. Hence, each vehicle is attracted
to its desired position and is repelled from static obstacles and other vehicles. For
computation reasons, the potential function is often chosen such that it decays to zero
beyond a certain distance 1from the vehicle. Hence, the potential field algorithm can
be seen as a decentralized reactive solution approach. In particular, vehicles choose
78
#0
Figure 5-6: A Live Lock Configuration for Potential Field Algorithm. Each colored
disk represents a quadrotor vehicle, and the associated arrow shows the quadrotor's
intended goal position.
control inputs independently unless they are in close proximity (less that 1 apart).
The potential field algorithm is easy to implement and computationally tractable.
However, since this algorithm chooses an input that minimizes potential of current
configuration, it does not compute the whole trajectory at once. Thus, the algorithm
is often trapped in local minimum and leads to live lock configurations. For instance,
consider a four agent live lock configuration illustrated in Fig. 5-6, where the arrows
show each vehicle's intended position. Further, given its reactive nature, the potential
field algorithm tends to induce sharp turns and jerky maneuvers. In practice, this
algorithm works decently for less than three vehicles, but live lock configurations are
common for a team of more than three vehicles.
5.4.2
Sequential Convex Program
Problem Formulation
This section describes an optimization based path planning algorithm, largely based
on a paper by Augugliaro et al. [3] In particular, Augugliaro et al. formulated the
multi-agent path planning problem as a sequential convex program, whose decision
variables ai[k]'s are the accelerations of agent i at discretized time step k.
Cost function: For a system with N agents and K discretization steps, the objective
79
is to minimize
N
Zai[k] +||2
(5.1)
K
E I
E
i=1 k=1
where g is the gravity vector. To optimize this cost function is to minimize the sum
of thrust, which in practice, tend to induce smooth trajectories with small curvature.
Constraints: Given accelerations decision variables ai[k] and discretization step size
h, position, velocity and jerk can be derived via kinematic relations, in particular,
pi[k +1]
=
i[k +p1] =
v,[k+1I
=
ji[k] =
ik+ vk]+ha[]
h2
i[k]+hvi[k]+ag[k]
vi[kl+ha[kI
(5.2)
ai[kJ-a1[k-1]
h
where expressions for pi [k +1] and vi [k +1] can be rewritten in terms of initial conditions and accelerations. A feasible trajectory must reach the desired final waypoint,
such that at a fixed final time T,
pi [T/h] = p,(T),
vi[T/h] = vi(T),
a, [T/h] = a1(T)
(5.3)
In addition, it assumes the vehicles reside in a convex planning space. In addition,
given physical limits (i.e. actuator), each vehicle must satisfy velocity, acceleration
and jerk constraints, such that in each dimension,
pmm
pi [k] 5 pm.,
v.s ! vi [k]
amin 5 a [k] 5 ama,
vmac,
jmn 5 ji[k]
Vmn ! vi [k]
jmax,
Vi, k
vma
(5.4)
Eqns. 5.3 and 5.4 are linear convex constraints. The inter-vehicle collision avoidance constraint, however, is non-convex. In particular, it requires each pair of vehicle
to be at least R distance apart at all time steps,
Ipi[k] - p [k]||2 > R
Vij
80
i34j
Vk
(5.5)
The constraint is approximated using Taylor expansion, thereby forming a convex
relaxation
7T (p [k] - p[k]) > R
pi[k] - pj [k]
1||pi[k]|- pj [k]|
(5.6)
where p![k] and pj[k] are the positions of vehicles i and j from the previous iteration.
This leads to the formation of a sequential convex program, since it iteratively refines
the solution and updates constraint approximation.
In summary, the cost function (Eqn. 5.1) is quadratic; the boundary conditions
(Eqn. 5.3) can be expressed as linear equality constraints by substituting kinematic
relations (Eqn. 5.2); the physical capability constraints (Eqn. 5.4) and the convex
approximation of collision avoidance (Eqn. 5.7) constraints can be expressed as linear
inequality constraints. This leads to a quadratic program at each iteration.
min
x p
s.t.
+qTX+r
(5.8)
Ax = be
Ainx ; bin
Decentralized Implementation
Augugliaro's paper proposed a centralized algorithm in which a single quadratic program is formed for all vehicles simultaneously, which has O((KND) 2 ) number of
constraints and KND number of decision variables. Recall K is the number of discretization steps, N is the number of agents, and D is the number of dimensions.
Since the run time of solving a quadratic program scales quadratically in the number
of constraints, the run time of Augliaro's algorithm is O((KND)4 ) per iteration.
In comparison, Morgan et al. [40] developed a decentralized SCP algorithm for
control of multi-agent spacecraft swarms. In particular, it forms N small quadratic
programs, one for each vehicle. The quadratic programs are solved sequentially and
solution (trajectory) for an earlier vehicle is casted as constraints for later vehicles.
81
The run time complexity of this algorithm is O(N(KND) 2 ) per iteration, which is
much cheaper than that of Augugliaro. In practice, it is often more important to find
feasible trajectories quickly than guaranteeing optimality.
In this thesis, SCP is used as a short hand for a decentralized adaptation of
Augugliaro's algorithm. The following section first shows that SCP does not work
robustly for multi-agent path planning (fail to return feasible trajectory for certain
test cases); then, it proposes key modifications to significantly improve robustness.
Convergence Pattern
Augugliaro presented a number of test cases and showed through empirical simulation
that the algorithm usually converges within four iterations. However, the convergence
process is not uniform. In particular, linearizing about the first several iterates often
induce an infeasible quadratic program (QP). Common commercial solver packages,
such as Matlab or MOSEK, often return infeasible trajectories when QP is infeasible.
Sometimes, linearizing about the infeasible trajectories would yield a feasible QP and
then the algorithm would start to converge.
A typical convergence pattern of SCP is illustrated in Fig. 5-7. The red lines
show corner and pole constraints, which is explained in Sec. 5.4.3. In this case, two
quadrotors attempt to swap positions. In SCP, the blue vehicle, traveling from (-1,6) to (-1,-3), is solved first and hence had a straight trajectory. The green vehicle,
traveling (-1,-3) to (-1,-6), is solved second and needed to satisfy collision constraint
with respect to the blue vehicle. From top left to bottom right, this figure shows
the green vehicle's trajectory after 1st, 2nd, 5th and 9th iterations. Linearization
about the first four iterations yields infeasible QPs, and thus no feasible solutions
were found (green trajectories in the top row do not reach (-1, -6)). After a feasible
solution is found, the algorithm then converges to a local optimum. This phenomenon
sometimes causes the algorithm failing to find feasible trajectories, which is discussed
in the next section (shown in Fig. 5-10, 5-9)
82
0
0.
2-
I
-
4-
4-
6
-6
-4
-2
0
-6
-4
-2
0
-6
-4
-2
0
0
0-
2.
4-
-6-
6-
-
-4-
-6
-4
-2
0
Figure 5-7: Typical Convergence Pattern of SCP. Two quadrotors attempt to swap
positions. The blue vehicle is solved before the green vehicle. From top left to
bottom right, this figure shows the green vehicle's trajectory after 1st, 2nd, 5th and
9th iterations. SCP returned infeasible trajectories in the first four iterations.
5.4.3
Incremental Sequential Convex Program
The convergence pattern of SCP is troublesome because it shows that SCP can get
trapped in intermediate steps, where no feasible solution can be found. Empirical
simulation shows that this happens approximately 40% of the time (randomly generated test cases) for a group of four vehicles navigating in RAVEN testbed. This
section introduces a number of modifications that lead to the formulation of incremental sequential convex programming algorithm, abbreviated iSCP. This work is
developed between the author, Mark Cutler and Jack Quindlen.
83
1
1000000
0.5
10,
0
1000
0-10
-2
ID
-2.5
-3-
3
-2
.51
2
)T
-1
0
K
0
1
*...
-
-2
2
2
--
-2
-4 -4
Figure 5-8: Modeling Corner Constraint. The exponential loss function (Eqn. 5.9) is
less than zero inside the infeasible corner region (x < -1 and y < -2), and greater
than zero elsewhere. Increase the value of constant c leads to better approximation
of the true corner boundary.
Corner and Pole Constraints
Augugliaro's paper assumed convex rectangular planning space and does not permit
any static obstacles. Motivated by the need to navigate in RAVEN's non-convex
flight volume, an exponential loss function is developed to model corner and pole
constraints, in particular,
g(px [k], p [k]) = exp (c(pf [k] - xeo,)) + exp (c(p [k] - yc,.)) - 2
(5.9)
where px [k] and py [k] are the x and y coordinates, respectively, of vehicle i at time step
k; xe,. and ye,,. are the coordinates of the corner. The contour plot of this function
is shown in Fig. 5-8. The constraint is to require g(px[k], py [k]) > 0 Vi, k, thereby
forcing vehicles to remain outside of the corner region. Since Eqn. 5.9 is nonlinear,
Taylor expansion is applied to form linear approximation similar to that in Eqn. 5.7.
The pole constraint can be developed similarly. The corner and pole constraints are
shown in red in Figures 5-7, 5-9, and 5-10.
84
Incremental Addition of Constraints
SCP often fails because the quadratic program formed by linearizing about the initial
guess (straight line between starting and final position) is often infeasible, as shown
in Fig. 5-9(a) and 5-10(a). Sometimes the infeasible QP yields infeasible trajectories, which upon linearization, produces another infeasible QP. In such scenarios, the
algorithm can be trapped and no feasible solution is found.
This observation prompts the development of incremental sequential convex program (iSCP). The key idea is to gently steer the trajectory towards the goal by
adding constraints incrementally. Thereby, iSCP tries to ensure that all intermediate
quadratic programs (at every iteration) are feasible.
Recall K is the number of discretization time steps. Thus, a vehicle's trajectory is
composed of K points, with one point at each discretization time step. Each point in
a vehicle's trajectory is subject to collision constraints, which are linearized by Taylor
expansion about the point itself (Eqn. 5.7). Thus, each point is subject to a different
set of collision constraints. In construction of a quadratic program (QP), SCP adds
collision constraints at all points to QP in every iteration. In contrast, iSCP adds
collision constraints incrementally. For instance, it adds collision constraints about
the first point (k = 1) on the first iteration, and it adds collision constraints about the
first two points (k = 1 and k = 2) on the second iteration2 . In at most K iterations,
SCP would add collision constraints at all points.
Final Time Scaling
Existing algorithms require the user to specify time step h and end time T in advance. However, it is often difficult to determine these values beforehand, especially
for groups of more than three vehicles. If T is set too small, the path planning problem could be infeasible given constraints on velocity, acceleration, and jerk. If T is
set too big, vehicles will be traveling very slowly. Instead, iSCP solves the problem
in normalized time (T = 1) and scales the solution such the maximum allowed ac2
1n actual implementation, iSCP can skip certain points for computational
efficiency.
85
celeration (or velocity, or jerk) is achieved. In particular, given acceleration a1 [k] in
normalized time, velocity and jerk are computed using Eqn. 5.2. A scaling factor r
is computed as follows,
r = in m min Im
Vi [k]
1 ,
mmk
am
a1 [k]
m
'
i,k
ji [k]
(5.10)
where Vma, am=, and Jmax are maximum allowed velocity, acceleration and jerk,
respectively. Solution to the normalized problem is then scaled by the factor r.
ai[k]sa = ai[k]r
Vi, k
(5.11)
hwe
= h/vp
(5.12)
Tsa
=
T/
(5.13)
It can be shown that this scaling procedure only changes the rate at which vehicles
travel along the trajectories, but not the shape of trajectories.
Simulation Results
SCP and iSCP path planning algorithms were implemented and run in a simulated
environment with corner and pole constraints (reflecting layout of RAVEN testbed).
Each algorithm is given the same set of randomly generated test cases, in which
the starting and ending positions of four vehicles are sampled uniformly from the
feasible region. SCP returned feasible trajectories in 15 out of 25 test cases, while
iSCP returned feasible trajectories in all test cases. To explain this observation, two
test cases are presented to illustrate the differences in convergence between SCP and
iSCP.
Fig. 5-9 illustrates a scenario in which a single quadrotor tries to navigate around
the upper left corner. In both algorithms, trajectory is initialized as a straight line
connecting the starting and ending positions. The SCP algorithm, shown in the left
column, added all (at every discretization time step) collision constraints (linearized
about the straight line) in forming the first QP. However, the QP is infeasible and
86
0
-2
Li
-
-4
6-
-6
-6
-4
-2
0
-6
New
0
-4
-2
-4
-2
0
0
-2
-4
-E
-6
-6
-4
-2
-
0
0
0
-4
-6
_E
-6
-4
-2
0
-6
-4
-2
0
-6
-4
-2
0
0
0
-2
-2
-4
-4
-6
-6
-4
-2
0
(b)
(a)
Figure 5-9: Convergence Pattern Around a Corner. A quadrotor attempts to negotiate a corner while traveling from (-2, 0) to (-5, -2.5). Sub-figures (a) and (b) illustrate
the quadrotor's trajectory computed by SCP and iSCP, respectively, at iterations 1,
2, 5, and 9 from top to bottom. SCP failed to find a feasible solution while iSCP
found a feasible solution.
87
0
d
- EI--2
-E
-6
-4
-2
0
0
-6
-4
-2
-6
-4
-2
0
-6
-4
-2
0
-6
-4
-2
0
0
0
-E
-6
-4
-2
0
0
0
-2
-4
-6
-E
-6
-4
-2
0
(a)
(b)
Figure 5-10: Convergence Pattern for Multi-agent System. Two pairs of quadrotors
attempt swap position. The red, blue and green agents are solved before the teal
agent. Sub-figures(a) and (b) illustrate the quadrotors' trajectories computed for the
teal agent by SCP and iSCP, respectively, at iterations 1, 2, 5, and 9 from top to
bottom. SCP failed to find a feasible solution while iSCP found a feasible solution.
88
1
0
0
-2
-3
-1
-2
-3
-4
-4-
-5
-6
-6-
-1
-6
-2
-4
0
-6
-4
(a)
-2
0
(b)
Figure 5-11: Mutli-agent Path Planning Test Cases. In sub-figure (a), six quadrotors
swap positions in a convex environment. In sub-figure (b), four quadrotors navigate
in RAVEN environment. In both sub-figures, triangles and crosses represent the
starting and ending positions of quadrotors, respectively. Each color represents a
different quadrotor.
Matlab returned a trajectory that does not reach the ending position, shown in the
top left sub-figure.
Linearizing about this trajectory forms another infeasible QP,
which leads to a similar trajectory shown in the second row. This process repeats
as the algorithm gets stuck in bottom sub-figure and no feasible solution is found.
In contrast, the iSCP algorithm, shown in the right column, only added collision
constraint for one discretization time step in forming the first QP. This leads to a
feasible QP whose solution is shown in the top right sub-figure. This trajectory, albeit
infeasible, shows a small improvement over the past iterate (straight line initialization)
and still reaches desired ending position. Repeating this procedure eventually leads
to a feasible solution, as shown in the bottom right sub-figure.
Fig. 5-10 illustrates a four agent scenario in which two pairs of vehicles attempt to
swap positions. The test cases is chosen such that all vehicles are away from the corner
and pole, and thus the example can be seen as path planning in a convex volume.
In both algorithms, the red, blue and green agents are solved before the teal agent.
From the teal agent's perspective, the other three agents can be viewed as dynamic
obstacles with known trajectories, which play a similar role as the corner constraint
in the previous test case. SCP adds all collision avoidance constraint at once and
eventually gets trapped because it induces an infeasible QP at every iteration. In
89
contrast, iSCP finds a feasible solution by making incremental improvements one
step at a time, forming feasible QP at every iteration.
Lastly, Fig. 5-11 illustrates two challenging test cases, such that the trajectories of
any two vehicles intersect, where imply a high degree of interference between vehicles.
In both scenarios, collision avoidance constraint requires each pair of vehicles to be at
least 0.8m apart at every time step. In both scenarios, iSCP was able to find feasible
trajectories under two seconds.
Hardware Implementation
iSCP is implemented in C++ and utilizes MOSEK library for solving QP at each iteration. For single vehicle scenarios, iSCP returns a feasible trajectory in less than a
second. For four agent scenarios, iSCP takes (on average) 2.1 seconds to compute feasible trajectories. iSCP is tested on quadrotors for path planning in RAVEN testbed.
A hardware demonstration of iSCP is shown in Fig. 5-12, in which a single quadrotor
flies around a pole. Trajectories returned by iSCP are typically very smooth.
5.5
Intruder Monitoring Mission
The intruder monitoring mission (IMM), as described in Section 4.2, is implemented
on hardware in RAVEN testbed. Quadrotors are used for UAVs and iRobot Creates
are used for ground vehicles. The mission was run for 50 time steps (decision periods),
which took approximately 3 minutes 20 seconds to complete in hardware. This shows
that HD-MMDP is real time implementable.
IMM hardware experiment is shown in Fig. 5-13. The gird world's background is
a satellite image of mountainous terrain. The grid world is overlaid with translucent
white tiles (fog of war) to illustrate the uncertainty state wmn of the environment.
When quadrotors perform the surveillance task, as indicated by green radar in Fig. 513(a), the transparency of the white tiles (in vicinity of the quadrotor) increases.
When quadrotors perform blocking or cornering tasks, as shown in Fig. 5-13(b), the
transparency of the white tiles decreases. The yellow square is the ground vehicle's
90
(a)
(b)
Figure 5-12: Hardware Demonstration of iSCP Algorithm. A quadrotor flies around a
pole following a trajectory returned by iSCP. Sub-figures(a) and (b) show a quadrotor
executing the trajectory from two different angles.
goal position.
When a ground vehicle first enters the grid world, it remains undetected, such as
the green vehicle in Fig. 5-13(a). When a ground vehicle is detected, a colored disk
and a flashing array is projected onto the ground vehicle. The x-y position of each
quadrotor is shown by a colored cross projected on the floor. The shadow cast by the
91
quadrotor is sometimes more than half of a meter away from the quadrotor's actual
x-y position. The color of the cross indicates which ground vehicle the quadrotor is
blocking. For instance, in Fig. 5-13(b), the quadrotor with a purple cross is trying to
block the purple ground vehicle from reaching the yellow square.
The arrows emanating from each ground vehicle show the planner's understanding
of ground vehicle's intent. In particular, the green arrow shows the vehicle's most
probable next state (one step ahead); the black arrow shows the ground vehicle's
most probable next next state (two steps ahead) if the quadrotor remains stationary;
the white arrow shows the ground vehicle's most probable next next state (two steps
ahead) if the quadrotor moves to block the ground vehicle. Therefore, the planner
deploys quadrotors strategically to influence the behaviors of the ground vehicles
(induce change of future action from black arrow to white arrow). For instance, in
Fig. 5-13(b), the purple ground vehicle tries to reach the yellow square by going
around the purple cross (follow green arrow first and then the black arrow second).
Knowing the ground vehicle's-intent, theapurple quadrotor moves in parallel to the
green arrow, which prevents the ground vehicle from traveling along the black arrow
in the next time step. Since IMM is a stochastic problem, the arrows illustrate the
most probable future states, and therefore the ground vehicles may sometimes choose
an action different from as indicated by the arrows. A video of this experiment can
be found at https://www.youtube.com/watch?v=2oOIW7rZAJO. a
5.6
Summary
This chapter presented the hardware and software infrastructures that enabled hardware implementation of the intruder monitoring mission (IMM). Thereby, it showed
that the HD-MMDP algorithm is real time implementable. It also presented a novel
multi-agent path planning algorithm. The next chapter provides a summary of this
thesis and points to directions for future works.
3
For visualization reasons, when a ground vehicle is forced to the edge, it is escorted out of the
grid world at two times the normal speed. Thus, the intent arrows may look a bit confusing in such
circumstances.
92
(a)
(b)
Figure 5-13: Hardware Demonstration of Intruder Monitoring Mission. HD-MMDP
algorithm (with modifications outlined in section 4.2.3) is applied to solve the intruder
monitoring mission. The mission scenario is implemented in hardware with overhead
projection showing the belief state of the system. Sub-figures(a) and (b) illustrate
different visualization elements of the mission.
93
94
Chapter 6
Conclusions
6.1
Summary
This thesis has developed a novel hierarchical decomposition algorithm for solving
Multi-agent Markov Decision Processes (MMDP) by exploiting coupling relationship
in the reward function. It motivated multi-agent planning problems and described
how uncertainty plays a major role in many real world scenarios. It reviewed the
formulations of Markov Decision Processes (MDP) and showed that MMDP is a natural framework for modeling stochastic sequential decision processes for multi-agent
systems. It provided a short description of classical dynamic programming (DP),
which can be used to find the optimal solution. However, classical DP approaches
suffer from the curse the dimensionality, which in multi-agent problems, implies that
the algorithm's run time scales exponentially with the number of agents.
The thesis proposed Hierarchically Decomposed MMDP (HD-MMDP) algorithm
that would automatically identify a decomposition structure. The key idea is that
different functional forms in the reward function represent different degree of coupling
between agents, which is formalized with definition of factored reward function. Then,
subproblems are formulated at the bottom of decomposition hierarchy and are solved
using classical DP approaches. The subproblems typically have planning spaces that
are orders of magnitude smaller than that of the original planning problem. HDMMDP develops a heuristic approach for assigning agents to subproblems, in which
95
the subproblem's solutions are combined to form a solution (policy) to the original
problem.
HD-MMDP is applied to the persistent search and track mission (PST), which
is a cooperative multi-agent mission with stochastic fuel consumption and health
progression models. HD-MMDP is shown to outperform GA-dec-MMDP algorithm
in the ten agent PST domain by more than 35% in cumulative reward. Further, a
more complex (in terms of coupling relationship) intruder monitoring mission (IMM)
is formulated. HD-MMDP is shown to achieve sustained performance (steadily increasing cumulative reweard) in the IMM planning domain. Lastly, it presented the
hardware and software infrastructures of RAVEN testbed, which enabled hardware
implementation of IMM. Hardware experiments illustrated that HD-MMDP is a real
time algorithm and shed light on the algorithm's performance.
6.2
6.2.1
Future Work
Optimization on Task Tree Structure
The last step of HD-MMDP is to assign each agent to one of the leaf tasks in the
task tree structure. Currently, the algorithm adopts a heuristic approach, such that it
determines an importance ranking of leaf tasks based on maximum attainable reward,
r..
Then, the leaf tasks are executed in this fixed ordering. However, the process of
computing this ordering did not account for the agents' current states. For instance,
if all UAVs are low on fuel, it may be more preferable to have the team performing
an easier task with lower reward as opposed to performing a difficult task with higher
reward. The ordering of execution should be made flexible taking into account of each
agent's current state. More generally, assigning agents to leaf tasks can be seen as a
stochastic optimization problem, in which more mathematically principled solution
approaches should be further explored.
96
6.2.2
Formulation of Leaf Tasks
As briefly described in Sec. 4.2.3, state abstraction can be applied when forming leaf
tasks. This can drastically reduce the size of a leaf task's planning spaces, thereby
improving run time complexity. State abstraction was applied in solving the intruder
monitoring mission. However, this procedure often requires domain knowledge to
identify dimensions that can be abstracted away. It would be interesting to explore
ways to automate this procedure, possibly by using information embedded in the task
tree structure.
97
98
Bibliography
[1] A-a Agha-mohammadi, Suman Chakravorty, and Nancy M Amato. Firm:
Feedback controller-based information-state roadmap-a framework for motion
planning under uncertainty. In Intelligent Robots and Systems (IROS), 2011
IEEE/RSJ InternationalConference on, pages 4284-4291. IEEE, 2011.
[2] Mehdi Alighanbari. Task assignment algorithmsfor teams of UAVs in dynamic
environments. PhD thesis, Citeseer, 2004.
[3] Federico Augugliaro, Angela P Schoellig, and Raffaello D'Andrea. Generation
of collision-free trajectories for a quadrocopter fleet: A sequential convex programming approach. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ
InternationalConference on, pages 1917-1922. IEEE, 2012.
[4] Raphen Becker. Solving transition independent decentralized markov decision
processes. Computer Science Department Faculty Publication Series, page 208,
2004.
[5] Richard Bellman. Dynamic programming and stochastic control processes. Information and Control, 1(3):228-239, 1958.
[6] Daniel S Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The
complexity of decentralized control of markov decision processes. Mathematics
of operations research, 27(4):819-840, 2002.
[7] D. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific,
2005.
[8] B. Bethke, J. P. How, and J. Vian. Group health management of UAV teams
with applications to persistent surveillance. In American Control Conference
(ACC), pages 3145-3150, Seattle, WA, 11-13 June 2008.
99
[9] Blai Bonet and Hector Geffner. Labeled rtdp: Improving the convergence of
real-time dynamic programming. In ICAPS, volume 3, pages 12-21, 2003.
[10] Craig Boutilier. Sequential optimality and coordination in multiagent systems.
In IJCAI, 1999.
[11] Craig Boutilier. Sequential optimality and coordination in multiagent systems.
In In International Joint Conference on Artificial Intelligence, pages 478-485,
1999.
[12] Justin A Boyan and Michael L Littman. Packet routing in dynamically changing
networks: A reinforcement learning approach. Advances in neural information
processing systems, pages 671-671, 1994.
[13] Steven J Bradtke and Andrew G Barto. Linear least-squares algorithms for
temporal difference learning. Machine Learning, 22(1-3):33-57, 1996.
[14] Lucian Busoniu, Robert Babuska, Bart De Schutter, and Damien Ernst. Reinforcement Learning and Dynamic Programming Using Function Approximators.
CRC Press, 2010.
[15] Trevor Campbell, Luke Johnson, and Jonathan P. How. Multiagent allocation of
markov decision process tasks. In American Control Conference (ACC). IEEE,
2013.
[16] Yu Fan Chen, N. Kemal Ure, Girish Chowdhary, Jonathan P. How, and John
Vian. Planning for large-scale multiagent problems via hierarchical decomposition with applications to uav health management. In American Control Conference (A CC), Portland, OR, June 2014.
[17] Yann Chevaleyre, Paul E Dunne, Ulle Endriss, Jerome Lang, Michel Lemaitre,
Nicolas Maudet, Julian Padget, Steve Phelps, Juan A Rodriguez-Aguilar, and
Paulo Sousa. Issues in multiagent resource allocation. 2005.
[18] Koby Crammer, Michael Kearns, and Jennifer Wortman. Learning from multiple
sources. The Journal of Machine Learning Research, 9:1757-1774, 2008.
[19] Mark Cutler. Design and Control of an Autonomous Variable-Pitch Quadrotor
Helicopter. Master's thesis, Massachusetts Institute of Technology, Department
of Aeronautics and Astronautics, August 2012.
100
[20] Robin Deits and Russ Tedrake. Computing large convex regions of obstacle-free
space through semidefiite programming. In Submitted to: Workshop on the
Algorithmic Fundamentals of Robotics, 2014.
[21] Thomas G Dietterich et al. The maxq method for hierarchical reinforcement
learning. In Proceedings of the fifteenth international conference on machine
learning, volume 8. Citeseer, 1998.
[22] Alborz Geramifard. Practical Reinforcement Learning Using Representation
Learning and Safe Explorationfor Large Scale Markov Decision Processes. PhD
thesis, Massachusetts Institute of Technology, Department of Aeronautics and
Astronautics, February 2012.
[23] Claudia V Goldman and Shlomo Zilberstein. Decentralized control of cooperative
systems: Categorization and complexity analysis. J. Artif. Intell. Res. (JAIR),
22:143-174, 2004.
[24] Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with
factored mdps. In In NIPS-14, pages 1523-1530. The MIT Press, 2001.
[25] Carlos Guestrin, Michail Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In ICML, 2002.
[261 Carlos Guestrin, Shobha Venkataraman, and Daphne Koller. Context-specific
multiagent coordination and planning with factored mdps. In AAAI/IAAI, pages
253-259, 2002.
[27] Eric A. Hansen and Shlomo Zilberstein. LAO*: A heuristic search algorithm
that finds solutions with loops. Artificial Intelligence, 129(1-2):35-62, 2001.
[28] M. Hoeing, P. Dasgupta, P. Petrov, and S. O'Hara. Auction-based multi-robot
task allocation in comstar. In Pivceedings of the 6th internationaljoint conference
on Autonomous agents and multiagent systems, 2007.
[29] A.P. Hu, Chao Liu, and Hao Leo Li. A novel contactless battery charging system
for soccer playing robot. In 15th International Conference on Mechatronics and
Machine Vision in Practice, pages 646 -650, December 2008.
[30] Luke Johnson. Decentralized Task Allocation for Dynamic Environments. Master's thesis, Massachusetts Institute of Technology, January 2012.
101
[31] S. Karaman and E. Frazzoli. Sampling-based algorithms for optimal motion
planning. InternationalJournalof Robotics Research, 30(7):846-894, June 2011.
[32] Jelle R Kok, Matthijs TJ Spaan, and Nikos Vlassis. Multi-robot decision making
using coordination graphs. In Proceedings of the 11th International Conference
on Advanced Robotics, ICAR, volume 3, pages 1124-1129, 2003.
[33] Steven M LaValle. Rapidly-exploring random trees a ew tool for path planning.
1998.
[34] Brandon Luders and Jonathan P. How. Probabilistic feasibility for nonlinear
systems with non-Gaussian uncertainty using RRT. In AIAA Infotech@Aerospace
Conference, St. Louis, MO, March 2011. (AIAA-2011-1589).
[35] Mausam and Andrey Kolobov. Planningwith Markov Decision Processes: An AI
Perspective. Synthesis Lectures on Artificial Intelligence and Machine Learning.
Morgan & Claypool Publishers, 2012.
[36] Mausam and Andrey Kolobov. Planningwith Markov Decision Processes: An AI
Perspective. Synthesis Lectures on Artificial Intelligence and Machine Learning.
Morgan & Claypool Publishers, 2012.
[37] Mausam and Daniel S. Weld. Solving concurrent markov decision processes. In
Deborah L. McGuinness and George Ferguson, editors, AAAI, pages 716-722.
AAAI Press / The MIT Press, 2004.
[38] Nicolas Meuleau, Milos Hauskrecht, Kee eung Kim, Leonid Peshkin, Leslie Pack
Kaelbling, Thomas Dean, and Craig Boutilier. Solving very large weakly coupled
markov decision processes. In In Proceedingsof the Fifteenth National Conference
on Artificial Intelligence, pages 165-172, 1998.
[39] Andrew W. Moore and Christopher G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. In Machine Learning, pages 103-130,
1993.
[40] Daniel Morgan, Soon-Jo Chung, and Fred Y Hadaegh. Model predictive control of
swarms of spacecraft using sequential convex programming. Journalof Guidance,
Control, and Dynamics, pages 1-16, 2014.
[41] Christos H Papadimitriou and John N Tsitsiklis. The complexity of markov
decision processes. Mathematics of operations research, 12(3):441-450, 1987.
102
[42] M. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 1994.
[43] Morgan Quigley, Ken Conley, Brian Gerkey, Josh Faust, Tully Foote, Jeremy
Leibs, Rob Wheeler, and Andrew Y Ng. Ros: an open-source robot operating
system. In ICRA workshop on open source software, volume 3, page 5, 2009.
[44] J. D. Redding, T. Toksoz, N. Kemal Ure, A. Geramifard, J. P. How, M. Vavrina,
and J. Vian. Persistent distributed multi-agent missions with automated battery
management. In AIAA Guidance, Navigation, and Control Conference (GNC),
August 2011. (AIAA-2011-6480).
[45] Joshua D. Redding. Approximate Multi-Agent Planningin Dynamic and Uncertain Environments. PhD thesis, Massachusetts Institute of Technology, Department of Aeronautics and Astronautics, Cambridge MA, February 2012.
[46] A. Richards, J. Bellingham, M. Tillerson, and J. P. How. Coordination and control of multiple UAVs. In AIAA Guidance, Navigation, and Control Conference
(GNC), Monterey, CA, August 2002. AIAA Paper 2002-4588.
[47] Stuart Russell, Peter Norvig, and Artificial Intelligence. A modern approach.
Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, 25, 1995.
[48] T. Schouwenaars, B. de Moor, E. Feron, and J. P. How. Mixed integer programming for multi-vehicle path planning. In Proceedings of the European Control
Conference, pages 2603-2608, Porto, Portugal, September 2001. European Union
Control Association.
[49] Satinder Singh and David Cohn. How to dynamically merge markov decision
processes, 1997.
[50] Trey Smith and Reid Simmons. Focused real-time dynamic programming for
mdps: Squeezing more out of a heuristic. In In AAAI 2006, pages 1227-1232.
[51] Devika Subramanian, Peter Druschel, and Johnny Chen. Ants and reinforcement
learning: A case study in routing in dynamic networks. In In IJCAI (2, pages
832-838. Morgan Kaufmann, 1998.
[52] Vicon Motion Systems. Motion capture systems from Vicon, 2008. Online http:
//wv.vicon. com/.
103
[53] Tuna Toksoz. Design and Implementation of an Automated Battery Management
Platform. Master's thesis, Massachusetts Institute of Technology, August 2012.
[54] Tuna Toksoz, Joshua Redding, Matthew Michini, Bernard Michini, Jonathan P.
How, Matthew Vavrina, and John Vian. Automated Battery Swap and Recharge
to Enable Persistent UAV Missions. In AIAA Infotech@Aerospace Conference,
March 2011. (AIAA-2011-1405).
[55] Nazim Kemal Ure, Girish Chowdhary, Jonathan P. How, Mathew Vavarina, and
John Vian. Health aware planning under uncertainty for uav missions with
heterogeneous teams. In Proceedings of the European Control Conference, Zurich,
Switzerland, July 2013.
[561 M. Valenti, B. Bethke, D. Dale, A. Frank, J. McGrew, S. Ahrens, J. P. How, and
J. Vian. The MIT Indoor Multi-Vehicle Flight Testbed. In IEEE International
Conference on Robotics and Automation (ICRA), pages 2758-2759, 10-14 April
2007.
157]
M. Valenti, B. Bethke, G. Fiore, J. P. How, and E. Feron. Indoor Multi-Vehicle
Flight Testbed for Fault Detection, Isolation, and Recovery. In AIAA Guidance,
Navigation, and Control Conference (GNC), Keystone, CO, August 2006 (AIAA2006-6200).
[58] Li-Chih Wang and Sian-Kun Lin. A multi-agent based agile manufacturing planning and control system. Computers and IndustrialEngineering, 57(2):620-640,
2009.
104
Download