Document 11916165

advertisement
AN ABSTRACT OF THE THESIS OF
Scott Proper for the degree of Doctor of Philosophy in Computer Science
presented on December 1, 2009.
Title: Scaling Multiagent Reinforcement Learning
Abstract approved:
Prasad Tadepalli
Reinforcement learning in real-world domains suffers from three curses of
dimensionality: explosions in state and action spaces, and high stochasticity or
“outcome space” explosion. Multiagent domains are particularly susceptible to
these problems. This thesis describes ways to mitigate these curses in several
different multiagent domains, including real-time delivery of products using
multiple vehicles with stochastic demands, a multiagent predator-prey domain,
and a domain based on a real-time strategy game.
To mitigate the problem of state-space explosion, this thesis present several
approaches that mitigate each of these curses. “Tabular linear functions” (TLFs)
are introduced that generalize tile-coding and linear value functions and allow
learning of complex nonlinear functions in high-dimensional state-spaces. It is
also shown how to adapt TLFs to relational domains, creating a “lifted” version
called relational templates. To mitigate the problem of action-space explosion,
the replacement of complete joint action space search with a form of hill climbing
is described. To mitigate the problem of outcome space explosion, a more
efficient calculation of the expected value of the next state is shown, and two
real-time dynamic programming algorithms based on afterstates, ASH-learning
and ATR-learning, are introduced.
Lastly, two approaches that scale by treating a multiagent domain as being
formed of several coordinating agents are presented. “Multiagent H-learning”
and “Multiagent ASH-learning” are described, where coordination is achieved
through a method called “serial coordination”. This technique has the benefit of
addressing each of the three curses of dimensionality simultaneously by reducing
the space of states and actions each local agent must consider.
The second approach to multiagent coordination presented is “assignment-based
decomposition”, which divides the action selection step into an assignment phase
and a primitive action selection step. Like the multiagent approach,
assignment-based decomposition addresses all three curses of dimensionality
simultaneously by reducing the space of states and actions each group of agents
must consider. This method is capable of much more sophisticated coordination.
Experimental results are presented which show successful application of all
methods described. These results demonstrate that the scaling techniques
described in this thesis can greatly mitigate the three curses of dimensionality
and allow solutions for multiagent domains to scale to large numbers of agents,
and complex state and outcome spaces.
c
Copyright by Scott Proper
December 1, 2009
All Rights Reserved
Scaling Multiagent Reinforcement Learning
by
Scott Proper
A THESIS
submitted to
Oregon State University
in partial fulfillment of
the requirements for the
degree of
Doctor of Philosophy
Presented December 1, 2009
Commencement June 2010
Doctor of Philosophy thesis of Scott Proper presented on December 1, 2009.
APPROVED:
Major Professor, representing Computer Science
Director of the School of Electrical Engineering and Computer Science
Dean of the Graduate School
I understand that my thesis will become part of the permanent collection of
Oregon State University libraries. My signature below authorizes release of my
thesis to any reader upon request.
Scott Proper, Author
ACKNOWLEDGEMENTS
My deepest thanks are extended to all those who have supported me, in
particular my major professor, Prasad Tadepalli. Without his assistance and
support throughout my graduate education, this thesis could never have been
completed. In addition I would like to thank the members of my committee: Tom
Dietterich, Alan Fern, Ron Metoyer, and Jack Higginbotham for their patience
and support.
I would also like to thank Neville Mehta, Aaron Wilson, Sriraam Natarajan, and
Ronald Bjarnason for their friendship, and many useful discussions throughout
my research.
Very special thanks to my parents, Anna Collins-Proper and Datus Proper for
their love, support, encouragement, and understanding throughout my life. It is
because of them that I have had the opportunity to take my education this far.
Finally, I gratefully acknowledge the support of the Defense Advanced Research
Projects Agency under grant number FA8750-05-2-0249 and the National Science
Foundation for grant number IIS-0329278.
TABLE OF CONTENTS
Page
1 Introduction
1
1.1
Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1.2
Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . .
5
2 Background
7
2.1
Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . .
7
2.2
Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . .
8
2.3
Dynamic Programming . . . . . . . . . .
2.3.1 Total Reward Optimization . . .
2.3.2 Discounted Reward Optimization
2.3.3 Average Reward Optimization . .
.
.
.
.
10
10
11
12
2.4
Model-free Reinforcement Learning . . . . . . . . . . . . . . . . . .
13
2.5
Model-based Reinforcement Learning . . . . . . . . . . . . . . . . .
15
2.6
Multiagent Reinforcement Learning . . . . . . . . . . . . . . . . . .
18
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 The Three Curses of Dimensionality
20
3.1
Function Approximation . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 Tabular Linear Functions . . . . . . . . . . . . . . . . . . . .
3.1.2 Relational Templates . . . . . . . . . . . . . . . . . . . . . .
20
21
24
3.2
Hill Climbing for Action Space Search
26
3.3
Reducing Result-Space Explosion . . . .
3.3.1 Efficient Expectation Calculation
3.3.2 ASH-learning . . . . . . . . . . .
3.3.3 ATR-learning . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
28
29
32
3.4
Experimental Results . . . . . . . . . . .
3.4.1 The Product Delivery Domain . .
3.4.2 The Real-Time Strategy Domain
3.4.3 ASH-learning Experiments . . . .
3.4.4 ATR-learning Experiments . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
34
34
37
39
44
3.5
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
48
. . . . . . . . . . . . . . . .
TABLE OF CONTENTS (Continued)
Page
4 Multiagent Learning
51
4.1
Multiagent H-learning . . . . . . . . . . .
4.1.1 Decomposition of the State Space .
4.1.2 Decomposition of the Action Space
4.1.3 Serial Coordination . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
52
52
54
55
4.2
Multiagent ASH-learning . . . . . . . . . . . . . . . . . . . . . . . .
57
4.3
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Team Capture domain . . . . . . . . . . . . . . . . . . . . .
4.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . .
58
58
61
4.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5 Assignment-based Decomposition
67
5.1
Model-free Assignment-based Decomposition . . . . . . . . . . . . .
69
5.2
Model-based Assignment-based Decomposition . . . . . . . . . . . .
71
5.3
Assignment Search Techniques . . . . . . . . . . . . . . . . . . . . .
74
5.4
Advantages of Assignment-based Decomposition . . . . . . . . . . .
76
5.5
Coordination Graphs . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 The Max-plus Algorithm . . . . . . . . . . . . . . . . . . . .
5.5.2 Dynamic Coordination . . . . . . . . . . . . . . . . . . . . .
77
80
82
5.6
Experimental Results . . . . . . . . . . . . . . . . . . . .
5.6.1 Multiagent Predator-Prey Domain . . . . . . . . .
5.6.2 Model-free Reinforcement Learning Experiments .
5.6.3 Model-based Reinforcement Learning Experiments
84
85
87
94
5.7
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6 Assignment-level Learning
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
101
6.1
HRL Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.2
Function Approximation Semantics . . . . . . . . . . . . . . . . . . 105
6.3
Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 108
6.3.1 Four-state MDP Domain . . . . . . . . . . . . . . . . . . . . 109
6.3.2 Real-Time Strategy Game Domain . . . . . . . . . . . . . . 110
6.4
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
TABLE OF CONTENTS (Continued)
Page
7 Conclusions
116
7.1
Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 116
7.2
Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 119
Bibliography
120
LIST OF FIGURES
Figure
Page
2.1
Schematic diagram for reinforcement learning. . . . . . . . . . . . .
7
2.2
The relationship between model-free (direct) and model-based (indirect) reinforcement learning. . . . . . . . . . . . . . . . . . . . . .
13
3.1
Progression of states (s, s0 , and s00 ) and afterstates (sa and s0a0 ). . .
29
3.2
The product delivery domain, with depot (square) and five shops
(circles). Numbers indicate probability of customer visit each time
step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Comparison of complete search, Hill climbing, H- and ASH-learning
for the truck-shop tiling approximation. . . . . . . . . . . . . . . . .
39
Comparison of complete search, Hill climbing, H- and ASH-learning
for the linear inventory approximation. . . . . . . . . . . . . . . . .
40
Comparison of hand-coded algorithm vs. ASH-learning with complete search for the truck-shop tiling, linear inventory, and all featurepairs tiling approximations. . . . . . . . . . . . . . . . . . . . . . .
43
3.6
Comparison of 3 agents vs 1 task domains. . . . . . . . . . . . . . .
45
3.7
Comparison of training on various source domains transferred to the
3 Archers vs. 1 Tower domain. . . . . . . . . . . . . . . . . . . . . .
46
Comparison of training on various source domains transferred to the
Infantry vs. Knight domain. . . . . . . . . . . . . . . . . . . . . . .
47
DBN showing the creation of afterstates sa1 ...sam and the final state
s0 by the actions of agents a1 ...am and the environment E. . . . . .
56
An example of the team capture domain for 2 pieces per side on a
4x4 grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
The tiles used to create the function approximation for the team
capture domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59
Comparison of multiagent, joint agent, H- and ASH-learning for the
two vs. two Team Capture domain. . . . . . . . . . . . . . . . . . .
62
3.3
3.4
3.5
3.8
4.1
4.2
4.3
4.4
LIST OF FIGURES (Continued)
Figure
4.5
Page
Comparison of ASH-learning approaches and hand-coded algorithm
for the four vs. four Team Capture domain. . . . . . . . . . . . . .
63
Comparison of multiagent ASH-learning to hand-coded algorithm
for the ten vs. ten Team Capture domain. . . . . . . . . . . . . . .
64
A possible coordination graph for a 4-agent domain. Q-values indicate an edge-based decomposition of the graph. . . . . . . . . . . .
78
Messages passed using Max-plus. Each step, every node passes a
message to each neighbor. . . . . . . . . . . . . . . . . . . . . . . .
80
A possible state in an 8 vs. 4 toroidal grid predator-prey domain.
All eight predators (black) are in a position to possibly capture all
four prey (white). . . . . . . . . . . . . . . . . . . . . . . . . . . . .
85
Comparison of various Q-learning approaches for the product delivery domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
Examination of the optimality of policy found by assignment-based
decomposition for product delivery domain. . . . . . . . . . . . . .
90
Comparison of action selection and search methods for the 4 vs 2
Predator-Prey domain. . . . . . . . . . . . . . . . . . . . . . . . . .
91
Comparison of action selection and search methods for the 8 vs 4
Predator-Prey domain. . . . . . . . . . . . . . . . . . . . . . . . . .
92
5.8
Comparison of 6 agents vs 2 task domains. . . . . . . . . . . . . . .
95
5.9
Comparison of 12 agents vs 4 task domains. . . . . . . . . . . . . .
96
6.1
Information typically examined by assignment-based decomposition. 102
6.2
Information examined by assignment-based decomposition with assignmentlevel learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3
A 4-state MDP with two tasks. . . . . . . . . . . . . . . . . . . . . 103
6.4
Comparison of various strategies for assignment-level learning. . . . 109
6.5
Comparison of assignment-based decomposition with and without
assignment-level learning for the 3 vs 2 real-time strategy domain. . 111
4.6
5.1
5.2
5.3
5.4
5.5
5.6
5.7
LIST OF FIGURES (Continued)
Figure
Page
6.6
Comparison of 6 archers vs. 2 glass cannons, 2 halls domain. . . . . 113
6.7
Comparison of 6 agents vs 4 tasks domain. . . . . . . . . . . . . . . 114
LIST OF TABLES
Table
3.1
Page
Various relational templates used in experiments. See Table 3.2 for
descriptions of relational features, and Section 3.4.2 for a description
of the domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
24
3.2
Meaning of various relational features. . . . . . . . . . . . . . . . .
24
3.3
Different unit types. . . . . . . . . . . . . . . . . . . . . . . . . . .
37
3.4
Comparison of execution times for one run . . . . . . . . . . . . . .
42
4.1
Comparison of execution times in seconds for one run of each algorithm. Column labels indicate number of pieces. “–” indicates a
test requiring impractically large computation time. . . . . . . . . .
64
Running times (in seconds), parameters required, and and terms
summed over for five algorithms applied to the product delivery
domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
89
Experiment data and run times. Columns list domain size, units
involved (Archers, Infantry, Towers, Ballista, or Knights), use of
transfer learning, assignment search type (“flat” indicates no assignment search), relational templates used for state and afterstate
value functions, and average time to complete a single run. . . . . .
98
5.1
5.2
7.1
The contributions of several methods discussed in this paper towards
mitigating the three curses of dimensionality. . . . . . . . . . . . . . 117
15
List of Algorithms
2.1
The Q-learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . .
14
2.2
The R-learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . .
16
2.3
The H-learning algorithm. The agent executes each step when in
state s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.1
17
The ASH-learning algorithm. The agent executes steps 1-7 when in
state s0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
3.2
The ATR-learning algorithm, using the update of Equation 3.14. . .
33
4.1
The multiagent H-learning algorithm with serial coordination. Each
agent a executes each step when in state s. . . . . . . . . . . . . . .
4.2
53
The multiagent ASH-learning algorithm. Each agent a executes each
step when in state s0 .
. . . . . . . . . . . . . . . . . . . . . . . . . .
58
5.1
The assignment-based decomposition Q-learning algorithm. . . . . .
70
5.2
The ATR-learning algorithm with assignment-based decomposition,
using the update of Equations 5.3 and 5.5. . . . . . . . . . . . . . . .
73
5.3
The centralized anytime Max-plus algorithm. . . . . . . . . . . . . .
81
5.4
The assignment-based decomposition Q-learning algorithm using coordination graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.1
83
The assignment-based decomposition with assignment-level learning
Q-learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.2
The ATR-learning algorithm with assignment-based decomposition
and assignment-level learning. . . . . . . . . . . . . . . . . . . . . . . 107
DEDICATION
I dedicate this thesis to my mother, Anna,
and to my father, Datus, in memoriam.
Chapter 1 – Introduction
1.1 Outline of the Thesis
Reinforcement Learning (RL) is a method of teaching a computer to learn how to
act in a given environment or “domain” via trial and error. By repeatedly taking
actions, observing results and an associated reward or cost signal, a computer agent
may learn to act in an environment in such way as to maximize its reward. This
kind of technique allows a computer to learn how to solve problems that might
be impractical to solve any other way. Reinforcement learning provides a nice
framework to model a variety of stochastic optimization problems [23], which are
optimization problems involving probabilistic (random) elements. Often, reinforcement learning is performed by learning a value function over states or state-action
pairs of the domain. This value function maps states or state-action pairs to values,
which allow an agent to determine the relative utility of a given state or action.
Typically, the value function is stored in a table, such that each single state or
state-action pair is mapped to a value. However, table-based approaches to large
RL problems suffer from three “curses of dimensionality”: explosions in state, action, and outcome spaces [16]. In this thesis, I propose and demonstrate several
ways to mitigate these curses in a variety of multiagent domains, including product delivery and routing, multiple predator and prey simulations, and real-time
2
strategy games.
The three main computational obstacles to dealing with large reinforcement
learning problems may be described as follows: First, the state space (and the time
required for convergence) grows exponentially in the number of variables. Second,
the space of possible actions is exponential in the number of agents, so even onestep look-ahead search is computationally expensive. Lastly, exact computation
of the expected value of the next state is costly, as the number of possible future
states (outcomes) can be exponential in the number of state variables. These three
obstacles are referred to as the three “curses of dimensionality”.
I introduce methods that effectively address each of the above difficulties, both
individually and together, in several different domains. To mitigate the exploding
state-space problem, I introduce “tabular linear functions” (TLFs), which can be
viewed as linear functions over some features, whose weights are functions of other
features. TLFs generalize tables, linear functions, and tile coding, and allow for
a fairly flexible mechanism for specifying the space of potential value functions. I
show particular uses of these functions in a product delivery domain that achieve
a compact representation of the value function and faster learning. I introduce a
“lifted” relational version of TLFs called “relational templates”, which I show to
facilitate transfer learning in certain domains.
Second, to reduce the computational cost of searching the action space, which is
exponential in the number of agents, I introduce a simple hill climbing algorithm
that effectively scales to a larger number of agents without sacrificing solution
quality.
3
Third, for model-based reinforcement learning algorithms, the expected value of
the next state at every step must be calculated. Unfortunately many domains have
a high stochastic branching factor (number of possible next states) when the state
is a Cartesian product of several random state variables. I provide two solutions
to this problem. First, I take advantage of the factoring of the action model and
the partial linearity of the value function to decompose this computation. Second,
I introduce two new algorithms called ASH-Learning and ATR-learning, which are
“afterstate” versions of model-based reinforcement learning using average reward
and total reward settings respectively. These algorithms learn by distinguishing
between the action-dependent and action-independent effects of an agent’s action
[23]. I show experimental results in a product delivery domain and a real-time
strategy game that demonstrate that my methods are effective in ameliorating the
three curses of dimensionality that limit the applicability of reinforcement learning.
The above methods address each curse of dimensionality individually. A method
to address all curses of dimensionality simultaneously would be very useful. To
this end, I introduce multiagent versions of the H-learning and ASH-learning algorithms. By decomposing the state and action spaces of a joint agent into several
weakly coordinating agents, we can simultaneously address each of the three curses
of dimensionality. Results for this are demonstrated in a “Team Capture” domain.
When implementing a multiagent RL solution, coordination between agents is
the main difficulty, and it determines the tradeoff between speed and solution quality. The weak “serial coordination” introduced by the multiagent H-learning and
ASH-learning methods may not be enough. To improve coordination, I introduce
4
both a model-free and a model-based version of the assignment-based decomposition architecture. In this architecture, the action space is divided into a task
assignment level and a task execution level. At the task assignment level, each
task is assigned a group of agents. At the task execution level, each agent is assigned primitive actions to perform based on its task. Since the task execution is
mostly local, I learn a low-dimensional relational template-based value function for
this. Since the task assignment level is global, I use various exact and approximate
search algorithms to do the task assignment. While this two-level decomposition
resembles that of hierarchical multiagent reinforcement learning [14], in contrast
to that work, I do not require that a value function be stored at the root level.
Such a root-level value function must be over the joint state-space of all agents in
the worst case and intractable. I demonstrate results showing that using search
over values given by a lower-level value function at the assignment level allows my
system to scale up to 12 agents, and potentially much beyond this.
Assignment based decomposition is flexible enough to allow a value function
at all levels of the decision-making process, although as described above it is not
required. As described later in this thesis, I will show how such a value function
may be added to the assignment level and how doing this may allow improved
assignment decisions under certain circumstances. However, because of the scaling
limitations of requiring a global value function over all agents, using an assignmentlevel value function can limit the total number of agents at the top level of the
decision-making process.
In this thesis, I also explore the usefulness of transfer learning when applied to
5
the scaling problem. Transfer learning is a research problem in machine learning
that focuses on storing knowledge gained while solving one problem and applying
it to a different but related problem. I explore this in the context of a real-time
strategy game by exploiting the benefits of relational templates. I present three
kinds of transfer learning results in real-time strategy games. First, I show how
how my approach enables transfer of value function knowledge between different
but similarly-sized multiagent domains. Second, I show transfer across domains
with different numbers of agents and tasks. Combining these two types of transfer
learning, I then show how knowledge gained from learning in several small domains
may be transfered to solve problems in large domains with multiple types of agents
and tasks. Thus, transfer learning may be applied to scale reinforcement learning
for multiagent domains.
1.2 Thesis Organization
The rest of the thesis is organized as follows:
Chapter 2 introduces reinforcement learning and Markov Decision Processes.
It describes two previously-studied reinforcement learning methods: a model-free
discounted learning method called Q-learning and a model-based average-reward
learning method called H-learning. These algorithms form the basis of algorithms
described in later chapters.
Chapter 3 discusses the three curses of dimensionality and some methods for
mitigating them individually. These methods include two kinds of function ap-
6
proximation, which serve to mitigate the first curse of dimensionality or exploding
state space, a hill climbing action selection approach which I use to mitigate the
second curse of dimensionality or exploding action space, and the ASH-learning
and ATR-learning algorithms, which are techniques for mitigating the third curse
of dimensionality or exploding outcome space for model-based RL algorithms. Finally I introduce the product delivery domain and describe experimental results
for the above techniques.
Chapter 4 describes a method for learning in multiagent domains which simultaneously addresses each of the three curses of dimensionality. Decomposed versions
of the H-learning and ASH-learning algorithms are provided. The “Team Capture”
domain is explained, and experimental results for this work are presented.
Chapter 5 introduces assignment-based decomposition, a more sophisticated
coordination technique for multiagent domains, for both model-free (Q-learning)
and model-based (ATR-learning) reinforcement learning algorithms. I also show
how to use coordination graphs together with assignment-based decomposition. I
finally describe a new “Predator-Prey” domain and the results for my experiments
on this and other domains using assignment-based decomposition.
Chapter 6 introduces assignment-level learning, which adds a value function
over certain “global” features of the state to the top assignment-level decision.
This allows assignment-based decomposition to solve certain problems it might
otherwise have difficulty with, but limits the scalability of the algorithm.
Finally in Chapter 7, I summarize my results and discuss potential future work.
7
Chapter 2 – Background
This chapter outlines the background of Reinforcement Learning (RL) and describes two previously studied reinforcement learning methods: Q-learning and
H-learning. These two algorithms form a basis upon which I build several new
reinforcement learning algorithms described in later chapters.
2.1 Reinforcement Learning
Reinforcement learning is the problem faced by a learning agent that must learn to
act by trial-and-error interactions with its environment. In the standard reinforcement learning paradigm, an agent is connected to its environment via perception
and action, as shown in Figure 2.1.
In each step of interaction, the agent senses the environment and then selects
Figure 2.1: Schematic diagram for reinforcement learning.
8
an action to change the state of the environment. This state transition generates
a reinforcement signal – reward or penalty – that is received by the agent. While
taking actions by trial-and-error, the agent may incrementally learn a “value function” over states or state-action pairs, which indicates their utility to that agent.
The goal of reinforcement learning methods is to arrive, by performing actions
and observing their outcomes, at a policy, i.e. a mapping from states to actions,
which maximizes some measure of the accumulated reward over time. RL methods differ according to the exact measure and optimization criteria they use to
select actions. These methods apply trial-and-error methodology to explore the
environment over time to come up with a desired policy.
2.2 Markov Decision Processes
The agent’s environment is modeled as a Markov Decision Process (MDP). An
MDP is a tuple hS, A, P, Ri where S is a finite set of n discrete states and A is a
finite set of actions available to the agent. The set of actions which are applicable in
a state s are denoted by A(s) and are called admissible. The actions are stochastic
and Markovian in that an action a in a given state s ∈ S results in a state s0 with
fixed probability P (s0 |s, a). This probability matrix is called an action model of
s. The reward function R : S × A → R returns the reward R(s, a) after taking
action a in state s, also callled the reward model of s. The action and reward
models of an MDP are called its “domain model”. Each action is assumed to take
one time step. An agent’s policy is defined as a mapping π : S → A, such that
9
the agent executes action a ∈ A when in state s. A stationary policy is one which
does not change with time. A deterministic policy always maps the same state to
the same action. For the remainder of this thesis, “policy” refers to a stationary
deterministic policy.
Instead of directly learning a policy, in RL the agent may learn a value function that estimates the value for each state. At any time, the RL methods use
one-step lookahead with the current value function to choose the best action in
each state by some kind of maximization. Therefore the policies that RL methods learn are called “greedy” with respect to their value functions. In addition to
such greedy actions, RL methods also take some directed or random (exploratory)
actions. These exploratory actions ensure that all reachable states are explored
with sufficient frequency so that a learning method does not get stuck in a local maximum. There are several exploration strategies. The random exploration
strategy takes random actions with a fixed probability, giving high probabilities
to actions with high values [1]. The counter-based exploration prefers to execute
actions that lead to less frequently visited states [26]. Recency-based exploration
prefers actions which have not been executed recently in a given state [22]. In this
thesis, I use an -greedy strategy in all my experiments, which takes a random
action with probability and a greedy action with probability 1 − .
10
2.3 Dynamic Programming
Given a complete and accurate model of an MDP in the form of the action and
reward models P (s0 |s, a) and R(s, a), it is possible to solve the decision problem
off-line by applying Dynamic Programming (DP) algorithms [2, 3, 18]. The recurrence relation of DP differs according to the optimization criterion: total reward
optimization, discounted reward optimization, or average reward optimization.
2.3.1 Total Reward Optimization
Suppose that an agent using a policy π goes through states s0 , ..., st in time 0
through t, with some probability. The cumulative sum of rewards received by
following a policy π starting from any state s0 is given by:
π
V (s0 ) = lim E(
t→∞
t−1
X
R(sk π(sk )))
(2.1)
k=0
When there is an absorbing goal state g which is reachable from every state
under every stationary policy, and from which there are no transitions to other
states, the value function for a given policy π can be computed using the following
recurrence relation:
V π (g) = 0
∀s 6= g V π (s) = R(s, π(s)) + E(
X
(2.2)
P (s0 |s, π(s))V π (s0 ))
(2.3)
s0 ∈S
An optimal total reward policy π ∗ maximizes the above value function over all
11
∗
states s0 and policies π, i.e. V π (s0 ) ≥ V π (s0 ).
Under the above conditions, the value function for the optimal total reward
policy π ∗ can be computed by:
∗
V π (g) = 0
∗
V π (s) = max {R(s, a) +
a∈A(s)
X
(2.4)
∗
P (s0 |s, a)V π (s0 )}
(2.5)
s0 ∈S
∗
where π ∗ indicates the optimal policy, and thus V π (s) is a value function corresponding to the optimal policy. A(s)
2.3.2 Discounted Reward Optimization
Total reward is a good candidate to optimize; but if the agent has an infinite horizon
and there is no absorbing goal state, the total reward approaches ∞. One way
to make this total finite is by exponentially discounting future rewards. In other
words, one unit of reward received after one time step is considered equivalent
to a reward of γ < 1 received immediately. We now maximize the discounted
cumulative sum of rewards received by following a policy. The discounted total
reward received by following a policy π from state s0 is given by:
fγπ (s0 )
t−1
X
= lim E(
γ t R(sk π(sk )))
t→∞
(2.6)
k=0
where γ < 1 is the discount factor. Discounting by γ < 1 makes fγπ (s0 ) finite.
The value function above can be computed for any state by solving the following
12
set of simultaneous recurrence relations:
f π (s) = R(s, π(s)) + γ
X
P (s0 |s, π(s))f π (s0 )
(2.7)
s0 ∈S
An optimal discounted policy π ∗ maximizes the above value function over all
states s and policies π. It can be shown to satisfy the following recurrence relation
[1, 3]:
∗
f π (s) = max {R(s, a) + γ
a∈A(s)
X
∗
P (s0 |s, a)f π (s0 )}
(2.8)
s0 ∈S
2.3.3 Average Reward Optimization
For average reward optimization, we seek to optimize the average reward per time
step computed over time t as t → ∞, which is called the gain [18]. For a given
starting state s0 and policy π, the gain is given by Equation 2.9, where rπ (s0 , t)
is the total reward in t steps when policy π is used starting at state s0 , and
E(rπ (s0 , t)) is its expected value:
1
E(rπ (s0 , t))
t→∞ t
ρπ (s0 ) = lim
(2.9)
The goal of average reward learning is to learn a policy that achieves near-optimal
gain by executing actions, receiving rewards and learning from them. A policy
that optimizes the gain is called a gain-optimal policy. The expected total reward
in time t for optimal policies depends on the starting state s and can be written
as the sum ρ(s) · t + ht (s), where ρ(s) is its gain. The Cesaro’s limit (or expected
13
value) of the second term ht (s) as t → ∞ is called the bias of state s and is denoted
by h(s). In communicating MDPs, where every state is reachable from every other
state, the optimal gain ρ∗ is independent of the starting state [18]. This is because
as t → ∞, we can expect to visit every state infinitely often (including the starting
state), and the contribution of the starting state will be included as a part of
the average reward. ρ∗ and the biases of the states satisfy the following Bellman
equation:
(
h(s) = max
a∈A(s)
r(s, a) +
N
X
)
p(s0 |s, a)h(s0 )
− ρ∗
(2.10)
s0 =1
2.4 Model-free Reinforcement Learning
There are two main roles for experience in a reinforcement learning agent: it may
be used to directly learn the policy, or it may be used to learn a model which can
then be used to plan and learn a value function or policy from. This relationship
is visualized in Figure 2.2. Using experience to directly learn the policy is called
“direct RL” [23] or “model-free” reinforcement learning. In this case, the model is
Figure 2.2: The relationship between model-free (direct) and model-based (indirect) reinforcement learning.
14
1
2
3
4
5
6
7
8
Initialize Q(s, a) arbitrarily
Initialize s to any starting state
for each step do
Choose action a from s using -greedy policy derived from Q
Take action a, observeh reward r and next state s0
i
Q(s, a) ← Q(s, a) + α r + γ 0max
Q(s0 , a0 ) − Q(s, a)
0
a ∈A (s)
s←s
end
0
Algorithm 2.1: The Q-learning algorithm.
learned implicitly as a part of the value function or policy. The case in which the
model is learned explicitly is called “indirect RL” or “model-based” reinforcement
learning, and is covered in Section 2.5. Both model-free and model-based RL have
advantages: for example, model-free methods will not be affected by biases in the
structure or design of the model.
In this section I describe two common model-free algorithms: Q-learning and Rlearning. Q-learning can be a discounted or total reward algorithm. The objective
is to find an optimal policy π ∗ that maximizes the expected discounted future
reward for each state s. The MDP is assumed to have an infinite horizon, and so
future rewards are discounted exponentially with a discount factor γ ∈ [0, 1).
The optimal action-value function or Q-function gives the expected discounted
future reward for any state s when executing action a and then following the
optimal policy. The Q-function satisfies the following recurrence relation:
Q∗ (s, a) = R(s, a) + γ
X
s0
P (s0 |s, a) max
Q∗ (s0 , a0 )
0
a
(2.11)
15
The optimal policy for a state s is the action arg maxa Q∗ (s, a) that maximizes the
expected future discounted reward. See the most common form of the Q-learning
algorithm in Algorithm 2.1.
R-learning [19] is an off-policy model-free average reward reinforcement learning
algorithm. As with all average reward algorithms, the objective is to find an
optimal policy π ∗ that maximizes the reward per time step. The MDP is assumed
to have an infinite horizon, but unlike discounted methods, the value functions for
a policy are defined relative to the average expected reward per time step under
the policy.
R-learning is a standard TD control method similar to Q-learning. It maintains
a value function for each state-action pair and a running estimate of the average
reward ρ, which is an approximation of ρπ , the true optimal average reward. See
the complete algorithm in 2.2. Note that I do not conduct any experiments using
this algorithm in this thesis, it is included here because of its similarity to other
methods I discuss, in particular H-learning, which is discussed in the next section.
2.5 Model-based Reinforcement Learning
Model-based RL has several advantages over model-free methods: indirect methods
can often make fuller use of limited experience, and thus converge to a better policy
given fewer interactions with the environment. Having a model also provides more
options: one can choose to learn the model first and then use planning approaches
to learn a policy off-line, for example. This allows the luxury of testing several
16
1
2
3
4
5
6
Initialize ρ and Q(s, a) arbitrarily
Initialize s to any starting state
for each step do
Choose action a from s using -greedy policy derived from Q
Take action a, observeh reward r and next state s0
i
0 0
Q(s, a) ← Q(s, a) + α r − ρ + 0max
Q(s
,
a
)
−
Q(s,
a)
0
a ∈A (s)
0
7
8
9
10
0
if Q(s, a) = 0max
Q(s , a ) then
ah
∈A0 (s)
i
0 0
ρ ← ρ + β r − ρ + 0max
Q(s
,
a
)
−
max
Q(s,
a)
0
a ∈A (s)
s←s
end
a∈A(s)
0
Algorithm 2.2: The R-learning algorithm.
different forms of function approximation to determine which works best with
a given problem, while minimizing the amount of experience data that must be
gathered. In addition, it is usually the case that a value function learned using a
model needs to store a value only for each state, not each state-action pair as with
model-free methods such as Q-learning. If the model is compact, this can result in
many fewer parameters required to store the value function.
In this thesis, I explore several model-based RL algorithms based on “Hlearning”, which is an average reward learning algorithm. H-Learning is modelbased in that it uses explicitly represented action models p(s0 |s, u) and r(s, u). In
previous work, H-learning has been found to be more robust and faster than its
model-free counterpart, R-learning [19, 24].
At every step, the H-learning algorithm updates the parameters of the value
function in the direction of reducing the temporal difference error TDE, i.e., the
17
1
2
3
4
5
6
7
8
P
Find an action u ∈ U (s) that maximizes R(s, u) + N
q=1 p(q|s, u)h(q)
Take an exploratory action or a greedy action in the current state s. Let a
be the action taken, s0 be the resulting state, and rimm be the immediate
reward received.
Update the model parameters for p(s0 |s, a) and R(s, a)
if a greedy action was taken then
ρ ← (1 − α)ρ + α(R(s, a) − h(s) + h(s0 ))
α
α ← α+1
end
P
h(s) ← max R(s, u) + N
q=1 p(q|s, u)h(q) − ρ
u∈U (s)
0
s←s
Algorithm 2.3: The H-learning algorithm. The agent executes each step
when in state s.
9
difference between the r.h.s. and the l.h.s. of the Bellman Equation 2.10.
(
T DE(s) = max
a∈A(s)
r(s, a) +
N
X
)
p(s0 |s, a)h(s0 )
− ρ − h(s)
(2.12)
s0 =1
One issue that still needs to be addressed in Average-reward RL is the estimation
of ρ∗ , the optimal gain. Since it is unknown, H-learning uses ρ, an estimate of the
average reward of the current greedy policy, instead. From Equation 2.10, it can
be seen that r(s, u) + h(s0 ) − h(s) gives an unbiased estimate of ρ∗ , when action u
is greedy in state s, and s0 is the next state. We may thus update ρ as follows, in
every step:
ρ ← (1 − α)ρ + α(r(s, a) − h(s) + h(s0 ))
See Algorithm 2.3 for the complete algorithm.
(2.13)
18
2.6 Multiagent Reinforcement Learning
The term “multiagent” may have several meanings. In this thesis, when referring
to a description of a domain, “multiagent” means a factored joint action space,
i.e. the actions available to the entire agent may be expressed as a Cartesian
product of n sets a1 , a2 , ..., an , each set corresponding to the actions of a separate,
independent agent. All domains described in this thesis are multiagent domains
in this sense.
When referring to algorithms for solving multiagent domains, there are two
general categories: a “joint agent” approach, which treats the full set (or joint set)
of all agent actions as a large list of actions which must be exhaustively searched to
find the correct joint action, or a “multiagent” approach, which takes full advantage
of the factored nature of the action space and typically searches only a local space
of actions unique to each agent, for each agent. The joint agent approach is often
slow, but an exhaustive search of the action space may sometimes find solutions
that multiagent approach could not. However, joint agent approaches are unlikely
to scale to large numbers of agents. In this paper, I discuss methods for scaling
joint agent approaches in Chapter 3, and multiagent approaches in Chapters 4 and
5. Typically, the challenge of multiagent approaches involves introducing enough
coordination between agents so that the absence of an exhaustive search of the
action space is mitigated.
Within mulitagent algorithms, there are two broad approaches: first, using a
centralized multiagent approach to mitigate difficulties in scaling due to a large
19
joint action space, or second, a distributed mulitagent approach that is required
due to the constraints of the domain, for example, when some method is needed to
coordinate the actions of multiple robots acting in the world. The main difference
between these approaches is that communication and sharing of data between
agents is easier or effortless in the case of a centralized approach emphasizing
scaling. For domains requiring a distributed multiagent approach, communication
usually carries some sort of cost. In this thesis, I focus entirely on the benefits of
a multiagent approach to scaling, and do not consider problems of communication
between agents.
I present a brief example of a typical multiagent RL algorithm here, by adapting Q-learning to a multiagent context. In a multiagent approach, the global Qfunction Q(s, a) is approximated as a sum of agent-specific action-value functions:
Pn
Q(s, a) =
i Qi (si , ai ) [12]. Further I approximate each agent-specific actionvalue as a function only of each agent’s local state si . A “selfish” agent-based
version of multiagent Q-learning [23] updates each agent’s Q-value independently
using the update function:
h
i
Qi (si , ai ) ← Qi (si , ai ) + α Ri (s, a) + γQi (s0i , a∗i ) − Qi (si , ai )
(2.14)
where α ∈ [0, 1] is the learning rate. The notation Qi indicates only that the
Q-value is agent-based. The parameters used to store the Q-function may either
be unique to that agent or shared between all agents. The term Ri indicates that
the reward is factored, i.e. a separate reward signal is given to each agent.
20
Chapter 3 – The Three Curses of Dimensionality
Reinforcement learning algorithms suffer from three “curses of dimensionality”:
explosions in state and action spaces, and a large number of possible next states of
an action due to stochasticity (or “outcome space” explosion‘) [16]. In this chapter,
I explore several methods for mitigating each of these three curses individually.
To mitigate the explosion in the state space, I introduced two related methods of function approximation, Tabular Linear Functions (TLFs) and relational
templates. To mitigate the explosion in action space common to multiagent algorithms, I suggest an approximate search of the action space using Hill-climbing.
To help mitigate the explosion in the number of result states, I introduce ASHlearning and ATR-learning, which are a hybrid model-free/model-based approach
using afterstates.
3.1 Function Approximation
Unfortunately, table-based reinforcement learning does not scale to large state
spaces such as those explored in this thesis both due to limitations of space and
convergence speed. The value function needs to be approximated using a more
compact representation to make it scale with size. Linear function approximators
are among the simplest and fastest means of approximation. However, since the
21
value function is usually highly nonlinear in the primitive features of the domain,
the user needs to carefully hand-design high-level features so that the value function
can be approximated by a function which is linear in them [27].
In the following sections I introduce two related function approximation schemes:
“Tabular Linear Functions” (TLFs) and “Relational Templates” which generalize
linear functions, tables, and tile coding. Usually, a TLF expresses a tradeoff between the small number of parameters used by typical linear functions, and the
expressiveness of a complete table. Like any table, TLFs may express a nonlinear value function, but like a linear function, the value function may be stored
compactly.
3.1.1 Tabular Linear Functions
A tabular linear function is a linear function of a set of “linear” features of the
state, where the weights of the linear function are arbitrary functions of other
discretized (or “nominal”) features. Hence the weights can be stored in a table
indexed by the nominal features, and when multiplied with the linear features of
the state and summed, produce the final value function.
More formally, a tabular linear function TLF is represented by Equation 3.1,
which is a sum of n terms. Each term is a product of a linear feature φi and a
weight θi . The features φi need not be distinct from each other, although they
22
usually are. Each weight θi is a function of mi nominal features fi,1 , . . . , fi,mi .
v(s) =
n
X
θi (fi,1 (s), . . . , fi,mi (s))φi (s)
(3.1)
i=1
A TLF reduces to a linear function when there are no nominal features, i.e. when
θ1 , . . . , θn are scalar values. One can also view any TLF as a purely linear function
where there is a term for every possible set of values of the nominal features:
v(s) =
n X
X
θi,k φi (s)I(fi (s) = k)
(3.2)
i=1 k∈K
Here I(fi (s) = k) is 1 if fi (s) = k and 0 otherwise. fi (s) is a vector of values
fi,1 (s), . . . , fi,mi (s) in Equation 3.1, and K is the set of all possible vectors. TLFs
reduce to a table when there is a single term and no linear features, i.e., n = 1 and
φ1 = 1 for all states. They reduce to tile coding or coarse coding when there are
no linear features, but there are multiple terms, i.e., φi = 1 for all i and n ≥ 1.
The nominal features of each term can be viewed as defining a tiling or partition
of the state space into overlapping regions and the terms are simply added up to
yield the final value of the state [23].
Most forms of TLF are created using prior knowledge about the domain (see
Section 3.4.1). What if such prior knowledge does not exist? It is possible to take
advantage of tabular linear functions by constraining them with some syntactic
restrictions. For example, we can define a set of terms over all possible pairs
(or triples) of primitive state features. We then sum over all n2 terms, where
23
n indicates the number of primitive state features. For example, if we have four
features f1 ...f4 , we will then have 6 possible tuples of 2 features each:
v(s) = θ1 (f1 , f2 ) + θ2 (f1 , f3 ) + θ3 (f1 , f4 ) + θ4 (f2 , f3 ) + θ5 (f2 , f4 ) + θ6 (f3 , f4 ) (3.3)
Since this is also an instance of tile-coding, I call this all feature-pairs tiling.
An advantage of TLFs is that they provide a flexible but simple framework to
consider and incorporate different assumptions about the functional form of the
value function and the set of relevant features.
In general, the value function is represented as a parameterized functional form
of Equation 3.1 with weights θ1 , . . . , θn and linear features φ1 , . . . , φn . Each weight
θi is a function of mi nominal features fi,1 , . . . , fi,mi .
Then each θi is updated using the following equation:
θi (fi,1 (s), . . . , fi,mi (s)) ← θi (fi,1 (s), . . . , fi,mi (s)) + β(T DE(s))∇θi v(s)
(3.4)
where ∇θi v(s) = φi (s) and β is the learning rate.
The above update suggests that the value function would be adjusted to reduce the temporal difference error in state s. This update is very similar to the
update used for ordinary linear value functions. Unlike with a normal linear value
function, only those table entries that match the current state’s nominal features
are updated, in proportion to the value of the linear feature φi (s).
24
Table 3.1: Various relational templates used in experiments. See Table 3.2 for
descriptions of relational features, and Section 3.4.2 for a description of the domain.
No.
Description
#1
#2
#3
#4
#5
#6
hDistance(A, B), AgentHP (B), T askHP (A), U nitsInrange(B)i
hU nitT ype(B), T askT ype(A), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A)i
hU nitT ype(B), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A)i
hT askT ype(A), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A)i
hDistance(A, B), AgentHP (B), T askHP (A), U nitsInrange(B), T asksInrange(B)i
hU nitT ype(B), T askT ype(A), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A),
T asksInrange(B)i
hU nitT ype(B), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A), T asksInrange(B)i
hT askT ype(A)Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A), T asksInrange(B)i
hU nitX(A), U nitY (A), U nitX(B), U nitY (B)i
#7
#8
#9
3.1.2 Relational Templates
Many domains are object-oriented, where the state consists of multiple objects or
units of different classes, each with multiple attributes. Relational templates are
a “lifted” version of tabular linear functions, generalizing them to object-oriented
domains [7].
A relational template is defined by a set of relational features over shared
variables (see Table 3.1). Each relational feature may have certain constraints
Table 3.2: Meaning of various relational features.
Feature
Constraint
Meaning
Distance(A, B)
AgentHP (B)
T askHP (A)
U nitsInrange(A)
T asksInrange(B)
U nitX(B)
U nitY (B)
U nitT ype(B)
T askT ype(A)
T ask(A) ∧ Agent(B)
Agent(B)
T ask(A)
T ask(A)
Agent(B)
Agent(B)
Agent(B)
Agent(B)
T ask(A)
Manhattan distance between units
Hit points of an agent
Hit points of a task
Count of the number of agents able to attack a task
Count of the number of enemies able to attack an agent
X-coordinate of an agent
Y-coordinate of an agent
Type (archery or infantry) of an agent
Type (tower, ballista, or knight) of a task
25
on the objects that can be passed to it; for example, in Table 3.2 each feature
has a type constraint on its variables. Each template is instantiated in a state
by binding its variables to units of the correct type. An instantiated template i
defines a table θi indexed by the values of its features in the current state. In
general, each template may give rise to multiple instantiations in the same state.
The value v(s) of a state s is the sum of the values represented by all instantiations
of all templates.
v(s) =
n
X
X
θi (fi,1 (s, σ), . . . , fi,mi (s, σ))
(3.5)
i=1 σ∈I(i,s)
where i is a particular template, I(i, s) is the set of possible instantiations of i
in state s, and σ is a particular instantiation of i that binds the variables of the
template to units in the state. The relational features fi,1 (s, σ), . . . , fi,mi (s, σ) map
state s and instantiation σ to discrete values which index into the table θi . All
instantiations of each template i share the same table θi , which is updated for each
σ using the following equation:
θi (fi,1 (s, σ), . . . , fi,mi (s, σ)) ← θi (fi,1 (s, σ), . . . , fi,mi (s, σ)) + α(T DE(s, σ)) (3.6)
where α is the learning rate. This update suggests that the value of v(s) would be
adjusted to reduce the temporal difference error in state s. In some domains, the
number of objects can grow or shrink over time: this merely changes the number
of instantiations of a template.
26
One template is more refined than another if it has a superset of features.
The refinement relationship defines a hierarchy over the templates with the base
template forming the root and the most refined templates at the leaves. The values
in the tables of any intermediate template in this hierarchy can be computed from
its child template by summing up the entries in its table that refine a given entry in
the parent template. Hence, we can avoid maintaining the intermediate template
tables explicitly. This adds to the complexity of action selection and updates, so
my implementation explicitly maintains all templates.
3.2 Hill Climbing for Action Space Search
The second curse of dimensionality in reinforcement learning is the exponential
growth of the joint action space and the corresponding time required to search this
action space. To mitigate this problem, one may implement a simple form of hill
climbing which can greatly speed up the action selection process with minimal loss
in the quality of the resulting policy.
In my experiments, hill climbing was used only during training. This is possible
only when using an off-policy learning method, such as H-learning (Algorithm 2.3).
I found that full exploitation of the policy during training is not necessarily conducive to improved performance during testing. It is only important that potentially high-value states are explored as often as necessary to learn their true
values; this kind of good exploration is a property shared by both complete and
hill climbing searches of the action space.
27
I performed hill climbing by noting that every joint action ~a is a vector of
sub-actions, each by a single agent, i.e., a = (a1 , . . . , ak ). This vector is initialized
with all neutral actions. The definition of “neutral action” varies with the domain:
a “wait” action would be a typical example for some domains. Starting at a1 , I
consider a small neighborhood of actions (one for each possible action a1 may take,
other than the action it is currently set to), and a is set to the best action. This
process is repeated for each agent a2 , . . . , ak . The process then starts over at a1 ,
repeating until ~a has converged to a local optimum.
3.3 Reducing Result-Space Explosion
The third curse of dimensionality occurs due to the difficulty of efficiently calculating the expected value of the next state, in domains with many possible resulting
states. This is sometimes called the “result-space” explosion. Such domains often
arise when there are many objects in the domain that are not agent-controlled,
yet exhibit some unpredictable behavior. For example, in a real-time strategy
game (see Section 3.4.2) there may be many enemy agents, each of which is acting
according to some unknown or stochastic policy.
It should be noted that while model-free reinforcement learning may also suffer
from having too many result states (thus requiring a lower learning rate and a
longer time to converge) in this thesis I am primarily concerned with the time
required to actually calculate the expected value of the next state. Model-free
algorithms have a significant advantage here, as usually an explicit calculation of
28
this value is needed only when using model-based reinforcement learning. One of
the drawbacks of model-based methods is that they require stepping through all
possible next states of a given action to compute the expected value of the next
state. This is very time-consuming. Optimizing this step improves the speed of
the algorithm considerably. Consider the fact that we need to compute the term
PN
0
0
s0 =1 p(s |s, u)h(s ) in Equation 2.12 to compute the Bellman error and update the
parameters. Since there are often an exponential number of possible next states
in domain parameters such as the number of enemy agents, doing this calculation
by brute-force is expensive. I present three possible solutions to this problem in
the next sections.
3.3.1 Efficient Expectation Calculation
For this first method to apply, the value function must be linear in any features
whose values change stochastically. For example, in the product delivery domain
(see Section 3.4.1), the only features whose values change stochastically are the
shop inventory levels. Hence this solution may be applied for the linear inventory
function approximation of my domain.
Under the above assumption, we can rewrite the exponential-size calculation
PN
PN
Pn
0
0
0
l=1 θ l φl,s ), which can be
s0 =1 p(s |s, u)h(s ) in Equation 2.12 as
s0 =1 p(s |s, u)(
Pn
PN
Pn
0
rewritten as
l=1 θ l
l=1 θ l E(φl,s0 |s, u),
s0 =1 p(s |s, u)φl,s and be simplified to
PN
0
where E(φl,s0 |s, u) =
s0 =1 p(s |s, u)φl,s0 and represents the expected value of the
feature value φl in the next state under action u. E(φl,s0 |s, u) is directly estimated
29
by on-line sampling and stored in a factored form. Instead of taking time proportional to the number of possible next states, this only takes time proportional to
the number of features, which is exponentially smaller. For example, if the current
inventory level of shop l is 2, and the probability of inventory going down by 1
in this step is 0.2, and the probability of its going down by 2 or more is 0, then
E(φl,j |i, u) = 2 − 1 ∗ .2 = 1.8. So we obtain the following temporal difference error:
(
T DE(s) = max
r(s, u) +
u∈U (s)
by substituting
Pn
l=1 θ l E(φl,s0
n
X
)
θl E(φl,s |s, u)
− ρ − h(s)
(3.7)
l=1
|s, u) for
PN
s0 =1 p(s
0
|s, u)h(s0 ) in Equation 2.12.
3.3.2 ASH-learning
A second method for optimizing the calculation of the expectation is a different algorithm I call ASH-Learning, which stands
Figure 3.1: Progression of states (s,
0
s
, and s00 ) and afterstates (sa and
for Afterstate H-Learning. This is based
s0a0 ).
on the notion of afterstates [23] also called
“post-decision states” [16]. Afterstates are created by conceptually splitting the effects of an agent’s action into “action-dependent” effects and “action-independent”
(or environmental) effects.
The afterstate is the state that results by taking
into account the action-dependent effects, but not the action-independent effects.
If we consider Figure 3.1, we can view the progression of states/afterstates as
30
a0
a
s → sa → s0 → s0a0 → s00 (see Figure 3.1). The “a” subscript used here indicates
that sa is the afterstate of state s and action a. The action-independent effects of
the environment have created state s0 from afterstate sa . The agent chooses action
a0 leading to afterstate s0a0 and receiving reward r(s0 , a0 ). The environment again
stochastically selects a state, and so on. The h-values may now be redefined in
these terms:
h(sa ) = E(h(s0 ))


N


X
p(s0u |s0 , u)h(s0u ) − ρ∗
h(s0 ) = max0 r(s0 , u) +

u∈U (s ) 
0
(3.8)
(3.9)
su =1
If we substitute Equation 3.9 into Equation 3.8, we obtain this Bellman equation:

h(sa ) = E  max0


u∈U (s ) 
r(s0 , u) +
N
X
s0u =1
p(s0u |s0 , u)h(s0u )



− ρ∗ 
(3.10)

Here the s0u notation indicates the afterstate obtained by taking action u in state
s0 . I estimate the expectation of the max above via sampling in the ASH-learning
algorithm (Algorithm 3.1). Since this avoids looping through all possible next
states, the algorithm is much faster. In the domains explored in this chapter, the
afterstate is deterministic given the agent’s actions, but the stochastic effects due
to the environment are unknown. Using afterstates to learn the expectation of
the value of the next state takes advantage of this knowledge. For such domains
with deterministic agent actions, we do not need to learn p(s0u |s0 ,u), providing a
31
1
2
3
4
5
6
7
8
Find an action u ∈ U (s0 ) that maximizes
o
n
P
0
0 0
)
|s
,
u)h(s
r(s0 , u) + N
p(s
0
u
u
su =1
Take an exploratory action or a greedy action in the state s0 . Let a0 be the
action taken, s0a0 be the afterstate, and s00 be the resulting state.
Update the model parameters p(s0a0 |s0 , a0 ) and r(s0 , a0 ) using the immediate
reward received.
if a greedy action was taken then
ρ ← (1 − α)ρ + α(r(s0 , a0 ) − h(sa ) + h(s0a0 ))
α
α ← α+1
end
o
n
PN
0
0 0
0
h(sa ) ← (1 − β)h(sa ) + β max0 r(s , u) + s0u =1 p(su |s , u)h(su ) − ρ
u∈U (s )
0
00
s ←s
sa ← s0a0
Algorithm 3.1: The ASH-learning algorithm. The agent executes steps 1-7
when in state s0 .
9
10
significant savings in memory, computation time, and code complexity. By storing
only the values of states rather than state-action pairs, this method also shows the
advantages of model-based H-learning.
The temporal difference error for the ASH-learning algorithm would be:
T DE(sa ) = max0
u∈U (s )



r(s0 , u) +
N
X
s0u =1
p(s0u |s0 , u)h(s0u )


− ρ − h(sa )
(3.11)

which I use in Equation 3.4 when using TLFs for function approximation.
ASH-learning generalizes model-based H-learning and model-free R-learning
[19]. ASH-learning reduces to H-learning if the afterstate is set to be the next
state, treating all effects to be action-dependent. Doing this, the expectation of
the maximum in Equation 3.11 drops out, and we have the Bellman equation for
32
H-learning in Equation 2.10.
To reduce ASH-learning to R-learning, two steps are required. First, we define
the afterstate to be the state-action pair, which makes the action-dependent effects
implicit. Second, note that for Equations 3.8 and 3.9, we assume the reward
r(s, a) in Figure 3.1 is given prior to the afterstate sa . It is also valid to instead
assume the reward is given after the afterstate. This is a conceptual difference
only, but combined with the first step above, this small change allows us to reduce
Equation 3.11 to the Bellman equation for R-learning:
0
∗
h(s, a) = E r(s, a) + max0 {h(s , u)} − ρ
(3.12)
u∈U (s )
The transition probabilities p(s0u |s0 , u) drop out because the afterstate (s0 , u) is
deterministic given the state and action. Under these circumstances, ASH-learning
reduces to model-free R-learning.
While in theory the afterstate may be defined as being anything from the
current state-action pair to the next state, in practice it is useful if it has low
stochasticity and small dimensionality. This is true for example when an agent’s
actions are completely deterministic and stochasticity is due to the actions of the
environment (possibly including other agents).
3.3.3 ATR-learning
In this section, I adapt ASH-learning to finite horizon domains. Instead of using
average reward, I calculate total reward. I call this variation of afterstate total-
33
1
2
3
4
5
6
7
Initialize afterstate value function av(·)
Initialize s to a starting state
for each step do
Find action u that maximizes r(s, u) + av(su )
Take an exploratory action or a greedy action in the state s. Let a be
the joint action taken, r the reward received, sa the corresponding
afterstate, and s0 be the resulting state.
Update the model parameters
r(s0 , a).
av(sa ) ← av(sa ) + α max {r(s0 , u) + av(s0u )} − av(sa )
u∈A
0
s←s
end
Algorithm 3.2: The ATR-learning algorithm, using the update of Equation 3.14.
8
9
reward learning “ATR-learning”. I define the afterstate-based value function of
ATR-learning as av(sa ), which satisfies the following Bellman equation:
av(sa ) =
X
0
0
p(s |sa ) max {r(s , u) +
u∈A
s0 ∈S
av(s0u )}
.
(3.13)
As with ASH-learning, I use sampling to avoid the expensive calculation of the
expectation above. At every step, the ATR-learning algorithm updates the parameters of the value function in the direction of reducing the temporal difference
error (TDE), i.e., the difference between the r.h.s. and the l.h.s. of the above Bellman equation:
T DE(sa ) = max {r(s0 , u) + av(s0u )} − av(sa ).
u∈A
The ATR-learning algorithm is shown in Algorithm 3.2.
(3.14)
34
3.4 Experimental Results
In this chapter, I describe two domains used to perform my experiments: a product
delivery domain, and a real-time strategy game domain. These domains illustrate
the three curses of dimensionality, each having several agents, states, actions, and
possible result states.
While discussing each domain, I will show how that domain exhibits each curse
of dimensionality. In general, the number of agents has the largest influence on the
number of states, actions, and result states – these are usually exponential in the
number of agents. If the environment includes several random actors (for example,
customers or enemy agents) this will also increase the number of possible result
states.
Finally I present my experiments involving these domains and the techniques
described so far.
3.4.1 The Product Delivery Domain
I simplify many of the complexities of the real-world delivery problem, including
varieties of products, seasonal changes in demands, constraints on the availability
of labor and on routing, the extra costs due to serving multiple shops in the same
trip, etc. Rather than building a deployable solution for a real world problem, my
goal is to scale the methods of reinforcement learning to be able to address the
combinatorial core of the product delivery problem. This approach is consistent
with the many similar efforts in the operations research literature [4]. While RL
35
Figure 3.2: The product delivery domain, with depot (square) and five shops
(circles). Numbers indicate probability of customer visit each time step.
has been applied separately to inventory control [28] and vehicle routing [16,20,21]
in the past, I am not aware of any applications of RL to the integrated problem of
real-time delivery of products that includes both.
I assume a supplier of a single product that needs to be delivered to several
shops from a warehouse using several trucks. The goal is to ensure that the stores
remain supplied while minimizing truck movements. I experimented with an instance of the problem shown in Figure 3.2. To simplify matters further, I assumed
it takes one unit of time to go from any location to its adjacent location or to
execute an unload action.
The shop inventory levels and truck load levels are discretized into 5 levels 0-4.
It is easy to see that the size of the state space is exponential in the number of
trucks and the number of shops, which illustrates the first curse of dimensionality.
I experimented with 4 trucks, 5 shops, and 10 possible truck locations, which gives
36
a state-space size of (55 )(54 )(104 ) = 19, 531, 250, 000.
Each truck has 9 actions available at each time step: unload 1, 2, 3, or 4
units, move in one of up to four directions, or wait. A policy for this domain
seeks to address the problems of inventory control, vehicle assignment, and routing
simultaneously. The set of possible actions in any one state is a Cartesian product
of the available actions for all trucks, and it is exponential in the number of trucks.
Thus, just picking a greedy joint action with respect to the value function requires
an exponential size search at each learning step, illustrating the second curse of
dimensionality. In my experiments with 4 trucks, 94 = 6561 actions in each step
must be considered. Although this is feasible, we need a faster approach to scale
to larger numbers of trucks, since the action search occurs at each step of the
learning algorithm. Trucks are loaded automatically upon reaching the depot. A
small negative reward of −0.1 is given for every “move” action of a truck to reflect
the fuel cost.
The consumption at each shop is modeled by decreasing the inventory level by
1 unit with some probability, which independently varies from shop to shop. This
can be viewed as a purchase by a customer. In general, the number of possible next
states for a state and an action is exponential in the number of shops, since each
shop may end up in multiple next states, thus illustrating the third and final curse
of dimensionality. With my assumption of 5 shops, each of which may or may not
be visited by a customer each time step, this gives us up to 25 = 32 possible next
states each time step. I call this the stochastic branching factor – the maximum
number of possible next states for any state-action pair. I also give a penalty of
37
−5 if a customer enters a store and finds the shelves empty.
3.4.2 The Real-Time Strategy Domain
I performed several experiments on several variations of a real-time strategy game
(RTS) simulation. This RTS domain has several features that make it more challenging than the product delivery domain: enemy units respond actively to agent
actions, and can even kill the agents. Because enemy units are more powerful
than the agents, they require coordination to defeat. For problems with multiple
enemy units (discussed in Section 5.6.3), agents and enemy units may also vary
in type, requiring even more complex policies and coordination. Scaling problems
also prove particularly challenging as the number of agents and enemy units grows.
I implemented a simple realTable 3.3: Different unit types.
time strategy game simulation
on a 10x10 gridworld. The grid
is presumed to be a coarse discretization of a real battlefield,
and so units are permitted to
share spaces. In this chapter,
Unit
HP
Damage
Range
Mobile
Archer
Infantry
Tower
Ballista
Knight
Glass Cannon
Hall
3
6
6
2
6
1
6
1
1
1
1
2
6
0
3
1
3
5
1
1
0
yes
yes
no
yes
yes
yes
no
the experiments use three agents
vs. a single enemy agent (See Section 5.6.3 for experiments with up to twelve starting agents and four enemy units). Units, either enemy or friendly, were defined
by several features: position (in x and y coordinates, hit points (0-6), and type
38
(either archer, infantry, tower, ballista, or knight, glass cannon, or hall). I also
defined relational features such as distance between agents and the enemy units,
and aggregation features such as a count of the number of opposing units within
range. In addition, each unit type was defined by how many starting hit points
it had, how much damage it did, the range of its attack (in manhattan distance),
and whether it was mobile or not. See Table 3.3 for the differences between units.
Agents were always created as one of the weaker unit types (archer or infantry),
and enemies were created as one of the stronger types (tower, ballista, or knight).
The “glass cannon” and hall are special units described in Section 6.3.2.
Agents had six actions available in each time step: move in one of the four cardinal directions, wait, or attack an enemy (if in range). Enemy units had the same
options (although a choice of whom to attack) and followed predefined policies,
approaching the nearest enemy unit if mobile and out of range, or attacking the
unit closest to death if within range. An attack at a unit within range always hits,
inflicting damage to that unit and killing it if it is reduced to 0 hit points. Thus,
the number of agents (and tasks) are reduced over time. Eventually, one side or the
other is wiped out, and the battle is “won”. I also impose a time limit of 20 steps.
Due to the episodic nature of this domain, total reward reinforcement learning is
suitable. I gave a reward of +1 for a successful kill of an enemy unit, a reward
of −1 if an agent is killed, and a reward of −.1 each time step to encourage swift
completion of tasks. Thus, to receive positive reward, it is necessary for agents to
coordinate with each other to quickly kill enemy units without any losses of their
own.
39
Figure 3.3: Comparison of complete search, Hill climbing, H- and ASH-learning
for the truck-shop tiling approximation.
3.4.3 ASH-learning Experiments
I conducted several experiments testing ASH-learning, tabular linear functions, hill
climbing, and efficient expectation calculation (Section 3.3.1). Tests are averaged
over 30 runs of 106 time steps for all results displayed here. Training is divided
into 20 phases of 48,000 training steps and 2,000 evaluation steps each. During
evaluation steps, exploration is turned off and complete search is used to select
actions for all methods. In all tests, 4 trucks and 5 shops were used.
I compared three different kinds of TLFs in my experiments. The first TLF
represents the value function h(s) as follows:
h(s) =
k X
n
X
t=1 x=1
θt,x (pt , lt , ix )
(3.15)
40
Figure 3.4: Comparison of complete search, Hill climbing, H- and ASH-learning
for the linear inventory approximation.
where there are k trucks, n shops, and no linear features. The value function
has kn terms, each term corresponding to a truck-shop pair (t, x). The nominal
features are truck position pt , truck load lt , and shop inventory ix . In analogy to
tile coding, I call this the truck-shop tiling approximation.
In this domain there are 10 truck locations, 5 shop inventory levels, and 5 levels
of truck loads. The number of parameters in the above TLF is k × n × 10 × 5 × 5,
as opposed to 10k 5k 5n , as required by a complete tabular representation.
The second TLF uses shop inventory levels ix as a linear feature instead of a
nominal feature as used by the truck-shop tiling:
h(s) =
k X
n
X
t=1 x=1
θt,x (pt , lt )ix
(3.16)
41
Treating the shop inventory level ix as a linear feature is particularly attractive in
this model-based setting, where we need to explicitly calculate the expectation of
the value function over the possible next states. The linearity assumption makes
this computation linear in the number of shops rather than exponential (see Section 3.3.1). This also makes it unnecessary to discretize the shop inventory levels,
although we continue to discretize them in our experiments to keep the comparisons between different function approximation schemes fair. We call this scheme
the linear inventory approximation.
In Figure 3.3 I compare results on the truck-shop tiling approximation, Hlearning and ASH-learning, and complete search of the joint action space vs. hill
climbing search. Figure 3.4 repeats these tests for the linear inventory approximation. My results show that ASH-learning outperforms H-learning, converging faster
and to a better average reward, especially with the truck-shop tiling approximation. H-learning’s performance improves with the linear inventory approximation,
but it has a higher variance compared to ASH-learning.
From Table 3.4, we can see that the methods of addressing the stochastic
branching factor were quite successful. When using the linear inventory approximation and the corresponding optimization discussed in Section 4.3, H-learning
shaved 36 seconds (or 24%) off the execution time. ASH-learning was even more
successful at ameliorating the explosion in stochastic branching factor. The largest
gains in execution time were seen with the simplest of methods: using hill climbing
during the action search saved more time than any other method. Combined with
ASH-learning, this led to speedups of nearly an order of magnitude. These gains
42
Table 3.4: Comparison of execution times for one run
Search
Complete
Complete
Complete
Complete
Complete
Hill climbing
Hill climbing
Hill climbing
Hill climbing
Algorithm
ASH-learning
H-learning
ASH-learning
H-learning
ASH-learning
H-learning
ASH-learning
H-learning
ASH-learning
Approximation
All feature pairs
Linear inventory
Linear inventory
Truck-shop tiling
Truck-shop tiling
Linear inventory
Linear inventory
Truck-shop tiling
Truck-shop tiling
Seconds
175
112
86
148
92
19
15
26
15
in speed can be explained by comparing the average number of actions the hill
climbing technique searches each time step to the number of actions considered by
a complete search of the action space.
In these tests, hill climbing search considered an average of 44 actions before
reaching a local optimum. A complete search of the joint action space that ignores the most obviously illegal of the 94 possible actions considered an average
of 385 actions: a significant savings. Moreover, my results show that hill climbing
performs as well as complete search when measuring the number of steps required
to reach convergence as well as the final average reward of the policy. This is a
very encouraging result since it suggests that it may be possible to do less search
during the learning phase and obtain just as good a result during the evaluation
or testing phase. As the number of trucks increases beyond 4, we would expect to
see even greater improvements in execution times of hill climbing search over full
search of the joint action space.
43
Figure 3.5: Comparison of hand-coded algorithm vs. ASH-learning with complete search for the truck-shop tiling, linear inventory, and all feature-pairs tiling
approximations.
Figure 3.5, compares the best results from Figures 3.3 and 3.4 vs. a fairly sophisticated hand-coded non-learning greedy algorithm and ASH-learning based on
the all feature-pairs tiling. The hand-coded algorithm worked by first prioritizing
the shops by the expected time until each shop will be empty due to customer
actions, then assigning trucks to the highest-priority shops. Once an assignment
has been made, it becomes much easier to assign a good move, unload, or wait
action to deliver products. It should be noted that creation of this hand-coded algorithm required considerable prior knowledge of the domain. All learning-based
approaches used ASH-learning with complete search. Note that the results in this
figure have a different scale on the Y-axis. Figure 3.5 shows that, for the vehicle
routing domain, the linear inventory approximation does not perform well. Encouragingly, the truck-shop tiling and all feature-pairs approximations converge to
44
a better average reward than the hand-coded algorithm. In this domain, it appears
that the all feature-pairs approximation performs better than any other. I verified
that the final average rewards reached by these tests are statistically significantly
different at 95% confidence level using the student’s t-test.
3.4.4 ATR-learning Experiments
Relational templates (Section 3.1.2) may be easily adapted to facilitate transfer
learning between different domains. In particular, I demonstrate how knowledge
learned from experimenting with particular combinations of units (subdomains) –
for example, archers and infantry vs. towers or ballista – may be transferred to a
different subdomain by taking advantage of the properties of relational templates.
The results of the experiments are shown in Figures 3.6, 3.7, and 3.8. All figures
show the influence that learning on various combinations of source domains has on
the final performance in one or more different target domains. The experiments
were tested on the target domain for 30 runs of 105 steps each, and averaged
together. I used the ATR-learning algorithm (Algorithm 3.2) for all experiments.
Each run was divided into 40 alternating training and testing phases of 3000 and
2000 steps each respectively. I used = .1 for the training phases and = 0 for the
test phases. I adjusted α independently for each relational template: for “parent”
templates (#3-4), I set α = .01, and for any other template, I used α = .1. This
allows the parent templates, used only for transfer, to influence the value function
less than subdomain-specific templates.
45
Figure 3.6: Comparison of 3 agents vs 1 task domains.
For Figure 3.6, I trained a value function for 106 steps on all subdomains not
included in the final “target” domains. I then transferred the value function to the
target domains: Infantry vs Knight or Archers and Infantry vs. Knight. Starting
distributions of units are randomized according to the allowed combinations of
units. These results show that using transfer learning improves results over using
no transfer at all. In the case of Archers and Infantry vs. Knight, agents have
no prior experience vs. the knight, yet perform better overall. This indicates that
agents can be sensitive to particular kinds of prior experience, which is verified in
Figure 3.7.
For Figures 3.7 and 3.8, I trained a value function for 106 steps on the various combinations of source domains indicated in the legend. Abbreviations such
as “AvK” indicates a single kind of domain – Archers vs. Knights for example.
46
Figure 3.7: Comparison of training on various source domains transferred to the 3
Archers vs. 1 Tower domain.
Likewise I,T,B indicate Infantry, Towers, and Ballista respectively. When training multiple domains at once, each episode was randomly initialized to one of the
allowable combinations of domains. I then transferred the parameters of the relational templates learned in these source domains to the target “AvT” or “IvK”
domains.
Our results show that additional relevant knowledge (in the form of training on
source domains that share a unit type with the target domain) is usually helpful,
though not always. For example, in the IvK target domain, training on the IvB
domain alone performs worse than not using transfer learning at all. However,
training IvB and IvT together is better than training on IvT alone, and training
on IvT is much better than no transfer at all. These results also show that irrelevant
47
Figure 3.8: Comparison of training on various source domains transferred to the
Infantry vs. Knight domain.
training – in this case on the AvT and AvB domains, which do not share a unit
type with the IvK domain – harms transfer.
For the AvT target domain, transfer from any domain initially performs better
than no transfer at all, but only a few source domains continue to perform better
than no transfer by the end of each run. In both target domains the “AvK” source
domain provides the best possible training for both target domains. The IvT, and
IvT, IvB, and IvK source domains also perform well here.
These results confirm that the value function can be quite sensitive to the choice
of source domains to transfer from. Which source domains are most helpful is
often unpredictable. Initially, transfer from any combination of domains performs
better than no transfer at all, but only by training on certain source domains will
48
performance continue to improve over no transfer by the end of each run. While
we might expect irrelevant information – such as training on the IvK and IvB
domains, which do not share a unit type with the AvT domain – might harm
transfer, for this particular experiment that does not appear to be the case. It
is possible that training on these domains has an indirect influence on the value
function, which helps more than it harms.
3.5 Summary
I illustrated the three curses of dimensionality of reinforcement learning and showed
effective techniques to address them in certain domains. Tabular linear functions
seem to offer an attractive alternative to other forms of function approximation.
They are faster than neural nets and give opportunities to provide meaningful
prior knowledge without excessive feature engineering. Hill climbing is a cheap but
effective technique to mitigate the action-space explosion due to multiple agents.
I introduced ASH-learning and ATR-learning, which are afterstate versions of
model-based real-time dynamic programming. These algorithms are similar to Qlearning in that action-independent effects are not learned or used. However, the
value function is state-based, so it is more compact than Q-learning, much more
so for multiple agents. I have shown how ASH-learning generalizes model-based
H-learning and model-free R-learning [19]. ASH-learning reduces to H-learning if
the afterstate is set to be the next state. If the afterstate is set equal to the current
state-action pair, ASH-learning reduces to R-learning. Thus, ASH-learning carries
49
some advantages from both methods. As with R-learning, much of the action model
is not learned or used. However, the value function is state-based, and so it is more
compact than R-learning, especially in the decomposed agent case. Thus, ASHlearning combines the nice features of both model-based and model-free methods
and has proven itself very well in the domains I have tested it with. Similar gains
in performance should be expected for any domain in which the afterstate is either
observable or inferable. A limitation of the approach is when neither of these is
true, i.e., when the afterstate is neither inferable nor observable. It appears that
in many cases, it may be possible to induce a compact description of the action
models which in turn could help us derive the afterstate. This would make the
afterstate approach as broadly applicable as the standard model-based approach.
I have shown how relational templates may be refined from a “base” template
–applicable to all subdomains– to templates that are specialized to particular subdomains based on the particular features added to the template. By using several
templates with different combinations of type features, a function approximator is
created that generalizes between similar subdomains and also specializes to particular subdomains. This process allows for easy transfer of knowledge between
subdomains.
In addition I have shown how relational templates and assignment-based decomposition combine fruitfully to transfer knowledge from a domain with only a
few units to domains with many units. Although sometimes the addition of one
or more relational features is required, the decomposed value function used in this
technique allows a straightforward transfer of knowledge between domains.
50
In summary, I conclude that the explosions in state space, action space, and
high stochasticity may each be ameliorated.
51
Chapter 4 – Multiagent Learning
This chapter explores methods for implementing a multiagent learning approach
(as in Section 2.6) for model-based reinforcement learning methods. In particular,
it shows how to implement multiagent versions of H-learning and ASH-learning,
and demonstrates several experiments using these methods in the “Team Capture”
domain. This domain sets several agents (or game pieces) against an equal number
of enemy pieces in an effort to capture them.
When considering the three curses of dimensionality, a multiagent approach is
primarily of interest when there is concern about an explosion of the action space
due to a large number of agents. However, all three curses of dimensionality can be
mitigated by a multiagent approach. By considering a smaller, local, state for each
agent, the number of states that must be learned by the agent is reduced. Similarly,
it may be possible to consider fewer environmental dimensions of the state – only
those that affect the local state of each agent – and thus fewer possible result states.
This may have the side effect of making the outcome space appear to become more
stochastic, due to the unmodeled effects of the other agents’ actions.
52
4.1 Multiagent H-learning
In a multiagent approach to reinforcement learning, the joint action space is decomposed into several agents, each of which consists of a set of states and actions.
As I emphasize multiagent approaches as a method of scaling in large multiagent
domains, I allow agents to share memory and communicate free of cost. This kind
of action decomposition is useful, because the joint action space is exponentially
sized in the number of agents, so an exhaustive search is impractical. There exist
other multiagent approaches to this problem that work well for some domains;
see [9] for an alternative model-based multiagent approach based on least squares
policy iteration. That work requires the creation of a sparse coordination graph
between cooperating agents, which is not always practical.
Multiagent systems differ from joint agent systems most significantly in that
the environment’s dynamics can be affected by other agents, rendering it nonMarkovian and non-stationary. In general, optimal performance may require the
agents to model each other’s goals, intentions, and communication needs [6]. Nevertheless, I pursue a simple approach of modeling each agent with its own MDP,
and adding a limited amount of carefully chosen coordination information. This
method is described in 3 steps.
4.1.1 Decomposition of the State Space
A multiagent MDP can be approximated as a set of MDPs, one for each agent.
Each agent’s state consists of a set of global variables that are accessible to all
53
1
2
3
4
5
6
7
8
Find
an action ua ∈ Ua (s) that maximizes
P
E(ra |sua , ua ) + N
p
(q|s
,
u
)h
(q)
a
u
a
a
a
q=1
Take an exploratory action or a greedy action in the current state s. Let va
be the action taken, sva be the afterstate, s0 be the resulting state, and rimm
be the immediate reward received.
Update the model parameters for pa (s0 |sva ) and E(ra |sva , va )
if a greedy action was taken, then then
ρa ← (1 − α)ρa + αE(ra |sva , va ) − ha (s) + ha (s0 ))
α
α ← α+1
end
P
ha (s) ← max E(ra |sua , ua ) + N
q=1 pa (q|sua , ua )ha (q) − ρa
ua ∈Ua (s)
0
s←s
Algorithm 4.1: The multiagent H-learning algorithm with serial coordination. Each agent a executes each step when in state s.
9
agents and a set of variables local to just that agent. Similarly the joint action u
must be decomposed as a vector (u1 , u2 ..., un ) of agent actions, and the rewards
must be distributed among the agents in such a way that the total reward to the
system for a fixed joint policy at any time is the sum of the rewards received by
the individual agents:
E(r|s, u) =
m
X
E(ra |s, u)
(4.1)
a=1
Depending on the domain, rewards may be provided in already-factored form, or a
model of each agents’ reward must be learned such that the total reward is approximated by sum of the predicted rewards for each agent, as in the above equation.
In this section, I show how to adapt H-learning into a multiagent algorithm. The
above additive decomposition of rewards yields a corresponding additive decomposition of the average reward ρ into ρa and the biases h into ha for agent a. Hence,
54
h(s) =
m
P
m
P
ha (s) and ρ =
a=1
ρa , and Equation 2.12 becomes:
a=1
(
TDEa (s) = max
u∈U (s)
E(ra |s, u) +
N
X
)
p(s0 |s, u)ha (s0 )
− ρa − ha (s)
(4.2)
s0 =1
Here, E(ra |s, u) is agent a’s portion of the expected immediate reward. The ha (s)
notation indicates that agent a’s h-function is being used. The ρa for each agent
a is updated using:
ρa ← (1 − α)ρa + α(E(ra |s, u) − ha (s) + ha (s0 ))
(4.3)
Note that the h-value for each state no longer needs to be stored, as that value is
now decomposed across several local state values.
4.1.2 Decomposition of the Action Space
The next step is to decompose actions so that action selection is faster. Note that
Equation 4.2 searches the joint action space of all agents to find a joint action
that maximizes the right hand side. This is because the next state and its value
are functions of all agents’ actions. To make this explicit, we can write the joint
action u as a vector (u1 , u2 ..., un ). Thus computing the max in the right hand-side
of Equation 4.2 requires time exponential in the number of agents.
To reduce the complexity of the action choice, we want to have the agents
choose their actions independently. To do this, each agent must be able to model
55
the expected outcomes of the actions of the other agents. For each agent, the
unknown actions of other agents and their stochastic effects are considered as
being a part of the environment. Thus, the action models of the agent include not
only the known effects of its own actions, but also the effects of the actions of the
other agents, which are largely unknown.
Given a model of the other agents’ effects, Equation 4.2 now becomes:
(
TDEa (s) = max
ua ∈Ua (s)
E(ra |s, ua ) +
N
X
)
p(s0 |s, ua )ha (s0 )
− ρa − ha (s)
(4.4)
s0 =1
Here we need only examine the actions Ua (s) agent a may take in state s, rather
than the Cartesian product of all agent actions U (s). However, in addition to modeling the effects of the environment, the model variables E(ra |s, ua ) and p(s0 |s, ua )
must also model the effects of the actions of the other agents on the current agent’s
reward and the next state.
4.1.3 Serial Coordination
While the above method for multiagent H-learning can work well for some domains, problems may arise when coordination of agent actions is needed. With
the methods described in Sections 4.1.1 and 4.1.2, coordination is achieved through
the model. Each agent predicts the actions of the other agents. That prediction
may be inexact.
To solve this problem, I introduce a limited form of coordination I call “serial
56
Figure 4.1: DBN showing the creation of afterstates sa1 ...sam and the final state s0
by the actions of agents a1 ...am and the environment E.
coordination.” In particular, each agent chooses actions in sequence, and knows the
actions chosen by those agents that have chosen one. This is a compromise between
making completely independent choices, and a completely centralized decision.
This method assumes that agents are able to communicate their action choices
without cost; this may not always be correct for some domains. This method bears
a resemblance to the coordination graph methods of [11], but is far simpler to code
and implement. Serial coordination, together with the coordination provided by
the model (which model-free methods such as Q-learning lack), allows us to do
well with a simpler alternative to more complex forms of coordination.
Using the serial coordination method, each agent must either predict (via a
model) or know exactly the actions of other agents. The first agent to choose an
action must tries to predict the actions of all other agents. Subsequent agents will
be allowed to know the actions of those agents that have selected an action, but
must model the action choices of the rest. In general, serial coordination reduces to
an MDP in which each agent takes its action, one at a time. The first agent knows
only the current state s. The second agent’s state is expanded by the knowledge
of the first agent’s action: s, a1 . The third agent’s state is determined by s, a1 , a2 ,
and so on until all agents have acted.
57
For domains in which the immediate effects of an agent’s action are deterministic, serial coordination is simple to implement by taking advantage of the structure
provided by a sequence of afterstates (Section 3.3.2). In this chapter, a state subscripted with an action indicates an afterstate. Thus, saj is the afterstate of state
s and action aj , and includes the effects of all actions a1 ...aj (see Algorithm 4.1).
A subscript used anywhere else indicates a particular agent.
4.2 Multiagent ASH-learning
For model-based learning, we need to compute the expected value of the next state,
which is exponential in the number of relevant random variables, e.g. the effects
of other agents and the environment. To do this, I introduce a multiagent version
of ASH-learning (Section 3.3.2).
To adapt the joint-agent form of ASH-learning into a multiagent algorithm is
fairly straightforward. Each agent action va is treated as creating a sequence of
afterstates as in Figure 3.1. Equations 4.4 and 3.11 are combined to obtain:
(
TDEa (sva ) = max 0
ua ∈Ua (s )
E(ra |s0ua , ua ) +
N
X
)
p(s0ua |s0 , ua )ha (s0ua )
− ρa − ha (sva )
s0ua =1
(4.5)
The complete algorithm is shown in Algorithm 4.2. Note that it is necessary to
store an afterstate for each agent for use in the update step.
58
1
Find
an action ua ∈ Ua (s0 ) that maximizes o
n
P
0
0
0
E(ra |s0ua , ua ) + N
s0u =1 p(sua |s , ua )ha (sua )
a
2
3
4
5
6
7
8
Take an exploratory action or a greedy action in the state s0 . Let va0 be the
action taken, s0va0 be the afterstate, and s00 be the resulting state.
Update the model parameters p(s0va0 |s0 , va0 ) and E(ra |s0va0 , va0 ) using the
immediate reward received.
if a greedy action was taken then
ρa ← (1 − α)ρa + α(E(ra |s0va0 , va0 ) − ha (sva ) + ha (s0va0 ))
α
α ← α+1
end
ha
(sva ) ← (1 − β)ha (sva )+
o
n
PN
0
0
0
0
β
max 0 E(ra |sua , ua ) + s0u =1 p(sua |s , ua )ha (sua ) − ρa
0
ua ∈Ua (s )
00
a
s ←s
0
10 sva ← sv 0
a
Algorithm 4.2: The multiagent ASH-learning algorithm. Each agent a
executes each step when in state s0 .
9
4.3 Experimental Results
This section introduces the Team Capture domain and uses it to illustrate issues
of scaling. I discuss how to implement this domain using multiagent learning
algorithms. I then discuss several experiments demonstrating these methods.
4.3.1 Team Capture domain
The Team Capture domain is a competitive two-sided game played on a square
grid. Sides are colored white and black, and control an equal number of pieces
(at least two per side). Each side takes turns moving all their pieces. Each piece
59
Figure 4.2: An example of the team
capture domain for 2 pieces per side on
a 4x4 grid.
Figure 4.3: The tiles used to create the
function approximation for the team
capture domain.
has five actions available each turn: stay in position, or move up, down, left, or
right. Pieces may not leave the board or enter an occupied square. The goal is to
capture opposing pieces as quickly as possible. Any side may capture an opposing
piece by surrounding it on opposite sides with two or four of its own pieces. If this
occurs, the side that captured the piece receives a reward of 1 and the captured
piece is randomly moved to any empty square. Figure 4.2 illustrates an example
of the Team Capture domain for the two vs. two piece problem on a 4x4 grid. In
this example, if white piece 1 moves down, it will capture black piece 2 with white
piece 2.
Taking the joint state and action spaces of all pieces to be the set of basic
states and actions, the team capture domain can be reduced to a standard joint
agent MDP. In this model, the optimal policy is defined as the optimal joint action
for each joint state and can be computed in theory using single agent H-learning.
However, this immediately runs into huge scaling problems as the number of agents
(pieces) increases: the set of actions available to a joint agent each time step grows
60
exponentially in the number of pieces. Hence a multiagent approach should be
taken, which decomposes the joint agent into several coordinating agents.
A multiagent RL algorithm may be used to learn in the team capture domain
by splitting the reward received for capturing an opposing piece evenly between the
two or four agents (pieces) responsible for the capture. We can model the effects of
other agent’s actions by noting that their actions contribute to the expected local
reward only by capturing an opposing piece with it. The probability P (a capture
will occur|the number of allied pieces (up to 4) that might assist in a capture)
should be measured. This model requires only 5 parameters. More advanced
models are possible, but I found this to work well.
I used Tabular Linear Function approximation (Section 3.1.1) to reduce the
number of parameters for each agent:
ha (s) =
t
X
θx (g1 (s, x, a), ..., gn (s, x, a))
(4.6)
x=1
This equation represents a set of t tiles laid down over the local area of the grid for
each agent a. Each tile is represented by a function θx (in this case, a table) and
overlays n grid positions g1 (s, x, a), ..., gn (s, x, a). As can be seen in Figure 4.3, I
used a set of 20 overlapping 2x2 tiles surrounding the local area of each piece. Each
tile contains the state information relevant to the four grid squares it occupies.
Each square in the tile may have four possible values: empty, white, black, or
“off the board”. This gives us a total of 20 × 44 = 5120 parameters. For this
domain, all agents share the same function approximator, so there is no duplication
61
of parameters between agents. This decision trades off the possible gains from
having each agent learn specialized behavior for fewer parameters and thus faster
convergence. The subscript x in θx indicates that we take a sum over t weights,
each weight taken from a table. Each table is defined over the parameters of θ.
I chose to use only limited, “local” state information to define the value function
for each agent in order to limit the number of parameters that must be learned
and thus decrease convergence time. In addition, there is a danger that if non-local
information is permitted to be learned, that information could be very noisy and
could potentially cause learning to suffer.
To adapt team capture to ASH-learning, the effects of each agent’s actions are
split into immediate effects (the movement of a piece, which is deterministic) and
environmental effects (the movement of other agents and enemy pieces). When
using serial coordination, the afterstate incorporates the effects of each piece on
the agent’s side moving up, down, left, or right, but the effects of opposing pieces
being captured, the moves made by the opposing team, and the agent’s pieces being
captured have not been calculated yet. None of these things require knowledge of
the action taken by the agent in order to calculate; only the afterstate.
4.3.2 Experiments
I conducted several experiments testing the methods discussed in this chapter.
Tests are averaged over 30 runs of 106 time steps each. Runs are divided into 20
phases of 48,000 training steps and 2,000 evaluation steps each. During evalua-
62
Figure 4.4: Comparison of multiagent, joint agent, H- and ASH-learning for the
two vs. two Team Capture domain.
tion steps, exploration is turned off. The opposing agent uses a random policy.
Figure 4.4 compares my tests for 2 vs. 2 pieces on a 4x4 grid (see Figure 4.2). I
compared the results of using multiagent approaches vs. joint agent approaches,
and H-learning vs. ASH-learning. I also include the average reward received by a
greedy hand-coded policy that represents my best effort at creating a good solution
to this domain. My multiagent approaches all used decomposed state and action
spaces with serial coordination (which I found to be critical to allow successful
capture of opposing pieces).
Figures 4.5 and 4.6 display the result of applying ASH-learning and multiagent
approaches for 4 vs. 4 pieces on a 6x6 grid and 10 vs. 10 pieces on a 10x10 grid. Hlearning is impractical for these tests due to very large stochastic branching factors.
63
Figure 4.5: Comparison of ASH-learning approaches and hand-coded algorithm
for the four vs. four Team Capture domain.
A joint agent is similarly impractical for the 10-piece domain. I used the function
approximation shown in Equation 4.6 to mitigate the problem of enormous state
spaces. I also compare my results to a good hand-coded agent.
From these results, we see that multiagent approaches perform nearly as well as
their joint agent counterparts: indeed, there is no statistically significant difference
between the multiagent and joint agent approaches for two pieces (using a 95%
confidence interval). The multiagent approaches for two pieces used twice as many
parameters as the joint agent approach, and so could further benefit from function
approximation. For this domain, H-learning outperforms ASH-learning (the difference is small, although statistically significant at a 95% confidence level), however I
have observed that for some domains (including a product delivery/vehicle routing
64
Figure 4.6: Comparison of multiagent ASH-learning to hand-coded algorithm for
the ten vs. ten Team Capture domain.
domain) ASH-learning outperforms H-learning. The relative performance of these
two algorithms appears to depend on the domain, function approximation used,
and learning parameters. The greatest benefit of my approaches can be seen when
comparing the computation time required to create the results for Figures 4.4, 4.5,
Table 4.1: Comparison of execution times in seconds for one run of each algorithm. Column labels indicate number of pieces. “–” indicates a test requiring
impractically large computation time.
2
Multiagent ASH-learning 1
Joint ASH-learning
2
Multiagent H-learning
56
Joint H-learning
22
4
31
750
–
–
10
81
–
–
–
65
and 4.6. As seen in Table 4.1, ASH-learning is much faster than H-learning. For
two pieces, it is actually faster to use a joint agent approach rather than a multiagent approach. However as the number of pieces increases, multiagent approaches
rapidly become the only practical option. For ten pieces, I cannot use a joint agent
approach at all.
4.4 Summary
This chapter has shown that multiagent decomposition and serial coordination
taken together go a long way towards scaling model-based reinforcement learning
up to handle real-world problems. Scaling limitations of H-learning prevent me
from demonstrating similar benefits in this chapter.
A multiagent approach is particularly useful for solving the three curses of
dimensionality and scaling reinforcement learning problems to large domains. This
is because multiagent techniques can simultaneously address each of the curses: by
requiring each agent to consider its local state, the burden of memorizing a value for
each state is lessened. By considering each agents’ actions independently, actions
may be selected in time linear rather than exponential in the number of agents.
And finally, if each agent requires knowledge of only a subset of the objects in the
environment in order to successfully calculate the expected value of the next state,
the outcome space explosion may be greatly ameliorated.
Serial coordination is one of the simplest forms of multiagent model-based coordination. As with any multiagent coordination method, it is very much a compro-
66
mise between quality and speed of action selection, in this case erring on the side
of speed. Better coordination techniques should result in improved performance;
this is a topic I will discuss in the next Chapter.
In summary, I conclude that the multiagent versions of H-learning and ASHlearning greatly moderate the explosions in state and action spaces observed in
real-world domains. The experimental results in the Team Capture domain show
that this approach may be scaled to larger and more practical problems.
67
Chapter 5 – Assignment-based Decomposition
The simple multiagent learning approach discussed in the previous chapter is adequate for certain domains, but what if greater coordination is required for best
performance? In this chapter I explore a more sophisticated coordination method.
I start by defining a multiagent assignment MDP, or MAMDP: this is a cooperative multiagent MDP with two or more tasks that agents are required to
complete in order to receive a reward. Examples of such a domain might be fire
and emergency response in a typical city, product delivery using multiple trucks
to deliver to different customers, or a real-time strategy game in which several
friendly units must cooperate to destroy multiple enemy units. An assignment is
a mapping from agents to tasks. Given an assignment, agents are required to act
in accordance with it until the task is complete or the assignment is changed. The
task assignment may change every time step, every few time steps, every time a
new task arrives, every time a task completes, or every time all tasks complete.
In principle, each of these different cases can be modeled as a joint MDP with
appropriate changes to the state and action spaces. For example, the assignment
can be made a part of the state, and changes to the assignment can be treated
as part of the joint action. Conditions may be placed on when an assignment is
allowed to change. Weaker conditions will ensure a more flexible policy and hence
more potential reward. There is usually a reward for completing any task. The
68
goal is to maximize the total expected discounted reward.
More formally, an MAMDP extends the usual multiagent MDP framework to
a set of n agents G = {g} (|G| = n). Each agent g has its own set of local state Sg
and actions Ag . We also define a set of tasks T = {t}, each associated with a set of
state variables St that describe the task. The set of tasks (and corresponding state
variables required to describe them) may vary between states. The joint action
space is the Cartesian product of the actions of all n agents: A = A1 ×A2 ×...×An .
The joint state space is the Cartesian product of the states of all agents and all
P
tasks. The reward is decomposed between all n agents, i.e., R(s, a) = ni Ri (s, a),
where Ri (s, a) is the agent-specific reward for state s and action a.
β : T → Gk is an assignment of tasks to agents; here k indicates an upper
bound on the number of agents that may be assigned to a particular task. β(t)
indicates the set of agents assigned to task t. Let sβ(t) denote the joint states of
all agents assigned to t, and aβ(t) denote the joint actions of all agents assigned to
task t. The total utility Q(s, a) depends on the states of all tasks and agents s and
actions a of all agents.
To solve MAMDPs, I propose a solution that splits the action selection step
of a reinforcement learning algorithm into two levels: the upper assignment level,
and the lower task execution level. At the assignment level, agents are assigned to
tasks. Once the assignment decision is made, the lower level action that each agent
should take to complete its assigned task is decided by reinforcement learning in
a smaller state space. This two-level decision making process occurs each timestep of the reinforcement learning algorithm, taking advantage of the opportunistic
69
reassignments.
At the assignment level, interactions between the agents assigned to different
tasks are ignored. This action decomposition exponentially reduces the number
of possible actions that need to be considered at the lowest level, at a cost of
increasing the number of possible assignments that must be considered. Because
each agent g need only consider its local state sg and task-specific state st to
come to a decision, this method can greatly reduce the number of parameters that
it is necessary to store. This reduction is possible because rather than storing
separate value functions for each possible agent and task combination, a single
value function may be shared between multiple agent-task assignments.
In the next sections, I discuss the particulars of my implementations of modelfree and model-based methods, as well as various search techniques that can be
used to speed up the assignment search process. I also analyze the time and
space complexity of assignment-based decomposition as compared to more typical
methods.
5.1 Model-free Assignment-based Decomposition
The assignment β denotes the complete assignment of agents to tasks, and g = β(t)
denotes the group of agents assigned to task t. The Q-function Q(st , sg , ag ) denotes
the discounted total reward for a task t and set of agents g starting from a local
task state st and joint agent state sg , and joint actions ag of the agents in the
team. For an assigned subset of agents, the Q-function is learned using standard
70
1
2
3
4
Initialize Q(s, a) optimistically
Initialize s to any starting state
for each step do
P
Assign tasks T to agents M by finding arg maxβ t v(s, β(t), t),
where v(s, g, t) = max Q(st , sg , a)
a∈Ag
5
6
7
For each task t, choose actions aβ(t) from sβ(t) using -greedy policy
derived from Q
Take action a, observe rewards r and next state s0
foreach
task t do Q(st , sβ(t) , aβ(t) ) ← Q(st , sβ(t)i, aβ(t) )+
h
α rβ(t) + γ max
Q(s0t , s0β(t) , a0 ) − Q(st , sβ(t) , aβ(t) )
0
a
0
s←s
end
Algorithm 5.1: The assignment-based decomposition Q-learning algorithm.
8
9
Q-learning approaches:
h
i
0
0
0
Q(st , sg , ag ) ← Q(st , sg , ag ) + α rg + γ max
Q(s
,
s
,
a
)
−
Q(s
,
s
,
a
)
t
g
g
t g
0
a
(5.1)
The assignment problem described is nontrivial – the number of possible assignments is exponential in the number of agents. The space of possible assignments
may be searched by defining a value v(s, g, t) of a state s, task t, and set of agents
g. This value is derived from the underlying value function, given by Equation 5.1:
v(s, g, t) = max Q(st , sg , a)
a∈Ag
(5.2)
Various techniques to search the assignment space using this value are disscussed
in Section 5.3.
The final model-free assignment-based decomposition algorithm is shown in
71
Algorithm 5.1. This algorithm is similar to the ordinary Q-learning algorithm,
with several key differences: before the normal action-selection step in line 4, we
search for the best available assignment. If we have multiple tasks, in line 5 we
assume actions and states are factored by task. Likewise rewards are also factored.
In line 7, Q-values are updated for each task according to the local states and
actions associated with that task.
5.2 Model-based Assignment-based Decomposition
This section describes a model-based implementation of assignment-based decomposition, which has considerable advantages over its model-free counterpart. In
addition to requiring many fewer parameters than model-free methods, the time
required to calculate the assignment is greatly reduced, as will be shown below.
The value v(st , sg ) denotes the maximum expected total reward for a set of
agents g assigned to task t, starting from the joint state st , sg by following their
best policy and assuming no interference from other agents. Similarly, av(sag )
is defined as the value of the afterstate of s due to the actions ag of agents g.
The temporal difference error (TDE) of the afterstate-based value function for an
assigned subset of agents using ATR-learning is as follows:
T DE(sag ) = max
ug ∈Ag
n
o
rg (s0 , ug ) + av(s0ug ) − av(sag ).
(5.3)
As with model-free assignment-based decomposition, the number of possible
72
assignments is exponential in the number of agents. We can search over the space
of possible assignments by defining a value y(s, g, t) of a state s, set of agents g,
and task t. This value is derived from the underlying state-based value function
v(st , sg ):
y(s, g, t) = v(st , sg ).
(5.4)
Note this is a considerable simplification of the corresponding model-free calculation of Equation 5.2.
The value function for ATR-learning is based on afterstates, so the value
v(st , sg ), being based on states, must be learned separately. The update for this
value is based on the temporal difference error given below:
T DE(st , sg ) = rg + max
ug ∈Ag
n
o
rg (s0 , ug ) + av(s0ug ) − v(st , sg )
(5.5)
Here rg is the immediate reward received for task t and agents g. This equation
may re-use the calculation of the max found in Equation 5.3. This max is the longterm expected total reward for being in afterstate sa and thereafter executing the
optimal policy. This afterstate value does not account for the immediate reward
received for being in state s and taking action a, and so it must be added in here.
To find the best assignment of tasks to agents over the long run, we need to
compute the assignment that maximizes the sum of the expected total reward (in
the case of ATR-learning) until task completion plus the expected total reward that
the agents could collect after that. Unfortunately this leads to a global optimization
problem which we want to avoid. So we ignore the rewards after the first task is
73
1
2
3
4
5
6
7
8
9
Initialize state and afterstate h-functions v(·) and av(·)
Initialize s to any starting state
for each step do
P
Assign tasks T to agents M by finding arg maxβ t y(s, β(t)), where
y(s, g, t) = v(st , sg )
For each task t, find joint action uβ(t) ∈ Aβ(t) that maximizes
rβ(t) (s, uβ(t) ) + av(suβ(t) )
Take an exploratory action or a greedy action in the state s. For each
set of agents β(t), Let aβ(t) be the joint action taken, rβ(t) the reward
received, saβ(t) the corresponding afterstate, and s0 be the resulting state.
Update the model parameters rβ(t) (s, aβ(t) ).
for each task t do
n
o
Let T argett =
max
uβ(t) ∈Aβ(t)
rβ(t) (s0 , uβ(t) ) + av(s0uβ(t) )
av(saβ(t) ) ← av(saβ(t) ) + α(T argett − av(saβ(t) ))
11
v(st , sβ(t) ) ← v(st , sβ(t) ) + α(rβ(t) + T argett − v(st , sβ(t) ))
12
end
13
s ← s0
14 end
Algorithm 5.2: The ATR-learning algorithm with assignment-based decomposition, using the update of Equations 5.3 and 5.5.
10
completed and try to find the assignment that maximizes the total expected reward
accumulated by all agents for that task. It turns out that this approximation is
not so drastic, because the agents get to reassess the value of the task assignment
after every step and opportunistically exchange tasks.
The final algorithm for assignment-based decomposition with ATR-learning is
shown in Algorithm 5.2. We begin by initializing the starting state and two value
functions v(·) and av(·). v(·) stores the state-based value function and is used only
to determine assignments. The afterstate-based value function is stored in av(·)
and is used to determine the task execution-level actions of each agent. Figure 3.1
74
may be helpful in understanding the progression in time of states and afterstates.
We begin each step of the algorithm by choosing an assignment. The use of v(·)
rather than av(·) here is critical: using an afterstate-based function to perform a
search would require a search over all actions available in the current state, slowing
the algorithm tremendously. Once we have an assignment, we then search for the
task exectution-level actions. After taking actions, we obtain a decomposed reward
signal and resulting afterstate and state. We can then update the model of the
expected immediate reward. Finally, for each task we calculate the TD-error of
av(·) and use it to update v(·) and av(·), then go to the next step.
5.3 Assignment Search Techniques
The problem of searching the assignment space for the best possible assignment
is very important, as it can be the main difficulty in scaling assignment-based
decomposition to large domains. Here, I present several options:
Fixed assignment: This is the simplest possible option: no assignment search
at all. Assignments are arbitrarily set at the start of an episode and never change.
Exhaustive search: One straightforward method that guarantees optimal
assignment is to exhaustively search for the mapping β that returns the maximum
P
total value for all tasks maxβ t y(s, g). However, with many agents, this search
could become intractable. A faster approximate search technique is necessary,
which I introduce next.
75
Sequential greedy assignment: This search uses a simple method of greedily
assigning agents to high-value tasks: for each task t we consider all sets of agents
that might be assigned and choose the set g that provides the maximum value
v(s, g, t). We remove agents g from future consideration, and repeat until all tasks
or agents have been assigned.
Swap-based hill climbing: This method uses the assignment at the previous
step (or a random assignment for the first time this search occurs) as the starting
point of a hill climbing search of the assignment space. At each step of the search,
it considers all possible next states that can be obtained by swapping a set of
agents from one task with another set of the same size assigned to a different task.
It then commits to the swap resulting in the most improvement, repeating until
convergence.
Bipartite search: The Hungarian method [13] is a combinatorial optimization
algorithm which solves the assignment problem in polynomial time. I adapted this
technique for use in solving the assignment problem faced by assignment-based
decomposition. I adapt the Hungarian method (or Kuhn-Munkres algorithm as it
is sometimes called) to assign multiple agents to each task by copying each task as
many times as necessary to match the number of agents (one copy for each “slot”
available to agents for completing a task, given by the upper bound on agents per
task k). This creates an n × n matrix defining a bipartite graph, which can be
solved by the Hungarian method in polynomial time. The weight of each edge of
the graph is given by yg , where g is a single task and agent. The solution to the
bipartite graph consists of an assignment of each task to a set of agents.
76
A serious problem with this approach is that each edge of the graph, or entry
in the n × n matrix, cannot contain any information other than that pertaining
to the single edge and task of that edge. In other words, we must give up any
coordination information when making assignment decisions. While in principle
this could cause some serious sub-optimalities, in practice the assignment found
by this method performs very well.
5.4 Advantages of Assignment-based Decomposition
The time complexity of assignment-based decomposition can be analyzed as follows. The time required to perform an exhaustive search of the assignment space is
the sum of the time required to pre-calculate v(s, g, t) values and the time required
to perform the actual search. The time required to calculate a single v(s, g, t) value
is O(|A|k ), where |A| is the number of actions a single agent may take, and k is an
upper bound on the number of agents that may be assigned to a task. Therefore,
the time required to pre-calculate all values of v(s, g, t) is O(|A|k |T |Ckn ) where C
is the choice function (binomial coefficient), |T | is the number of tasks, and n is
the number of agents. An exhaustive search requires O(n!/(k!)n/k ) time, which is
proportional to the number of ways to assign k agents each to n/k tasks. This is
significantly reduced by any approximate search algorithm, such as hill climbing
or bipartite search.
The advantage of assignment-based decomposition is much more apparent when
we consider the space complexity of the value function. A value function over the
77
entire state-action space would require O(|St ||T | |Sa |n |A|n ) parameters, where |St |
and |Sa | are the sizes of the state required to store local parameters for each task
and agent respectively. Assignment-based decomposition uses considerably fewer
parameters to store the task-based value function Q(st , sg , a). Instead, we need
space of only O(|St ||Sa |k |A|k ) parameters for each task, which is polynomial, for
fixed k.
A further advantage of the additive decomposition of the task execution level in
Equation 5.6 is that each Qi,j function may share the same parameters. Generalizing, or transferring, that single shared value function to additional tasks and/or
agents can be quite simple. In many cases, no additional learning is necessary.
The same value function can often be used, for example, in domains with twice as
many tasks and agents as the original domain. Only the size of the search space
at the assignment level needs to grow.
5.5 Coordination Graphs
A coordination graph can be described over a system of agents to represent the
coordination requirements of that system. Such a graph contains a node for each
agent and an edge between pairs of agents if they must directly coordinate their
actions to optimize some particular Qij . See Figure 5.1 for an example coordination
graph showing some possible coordination requirements between four agents.
This section examines the potential of using coordination graphs to solve multiagent assignment MDPs. In most cases, if each agent independently pursues a
78
policy to optimize its own Qi (see Equation 2.14), this will not optimize the total
utility, since each agent’s actions affect the state and the utility of others. Hence,
collaborative agents need to coordinate. A coordination graph allows the agents
to specify and model coordination requirements [8]. The presence of an edge in a
coordination graph indicates that two agents should coordinate their action selection, for example, so as to avoid collisions. A coordination graph may be specified
as part of the domain, or if the graph is context specific [10], as a combination
of rules provided with the domain. This set of rules determines whether an edge
between any two vertices of the graph should exist, given the state.
As in [12] I use an edge-based decomposition of a context-specific coordination graph.
The global Q-function for such a decomposition is approximated by a sum over all local
Q-functions, each defined over an edge (i, j) of
the graph:
Figure 5.1: A possible coordina(5.6) tion graph for a 4-agent domain.
(i,j)∈E
Q-values indicate an edge-based
decomposition of the graph.
⊆ si ∪ sj is the subset of state variables relevant to agents i and j,
Q(s, a) =
where sij
X
Qij (sij , ai , aj ).
and (i, j) ∈ E describes a pair of neighboring nodes (i.e., agents). The optimal
action for a coordination graph is given by arg maxa Q(s, a). As with the agentbased Q-function, the notation Qij indicates only that the Q-value is edge-based.
Parameters may or may not be shared between edges.
79
Coordination graphs are a powerful method for coordinating multiple agents,
but they are ill-fitted for solving multiagent assignment problems with arbitrary
coordination constraints. I show a simple proof of this below. For simplicity I
equate tasks and actions and assume that each action is relevant to a single task
a or b:
Proposition 1 Arbitrary reward functions from the joint action space A1 ×. . .×An
to {0, 1} are not expressible using an edge-based decomposition over a coordination
graph.
Proof: Let A1 = . . . = An = {a, b}, hence there are 2n joint actions. Each
n
joint action may be mapped to 0 or 1 reward, leading to 22 possible functions. To
represent these functions, we need at least 2n bits. A coordination graph over n
agents has at most O(n2 ) edges. Each edge has at most 4 constraints, one for each
possible action pair. Thus, we have room for specifying only O(n2 ) values, which
are not sufficient to represent 22 possible functions. 2
n
Although coordination graphs alone are not sufficient to solve MAMDP problems, assignment-based decomposition is also sometimes insufficient to coordinate
complex MAMDPs. Assignment-based decomposition is sufficient coordination if
the problem is completely decomposed after assignments have been made; however this is often not the case. The possibility remains of interference between
agents assigned to different tasks. To handle such interactions, I define a coordination graph over agents acting on the task execution level. An edge should be
placed between two agents when the actions of those agents might interfere, such
80
as when a collision is possible or the two agents might need to share a common
resource. Such coordination must be context-specific, since agents are constantly
changing states. Thus, it is necessary to combine the assignment decisions with
context-specific coordination at the task execution level. To that end, I adapt some
methods described in [11] and [12].
5.5.1 The Max-plus Algorithm
If we define a coordination graph over several agents, we must use an action selection algorithm that can take advantage of this structure. We wish to maximize the global payoff
maxa Q(s, a), (where Q(s, a) is given by Equation 5.6). Initial work in coordination graphs
Figure 5.2: Messages passed using Max-plus. Each step, every
[9] to solve this problem. However work in [12] node passes a message to each
neighbor.
shows that VE techniques can be slow to solve
suggested a variable elimination (VE) technique
large coordination graphs, require a lot of memory, and in addition can be quite
complex to implement. Instead, [12] proposed using the Max-plus algorithm, which
trades some solution quality for a great increase in solution speed.
The Max-plus algorithm is a message-passing algorithm based on loopy belief
propagation for Bayesian networks [15, 29, 30]. Agents in Max-plus instead pass
(normalized) values indicating the locally optimal payoff of each agent’s actions
81
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Input: Graph G = (V, E)
Output: A vector of actions a∗
Initialize µji = µij = 0 for (i, j) ∈ E, gi = 0 for i ∈ V and m = −∞
while converged = false and deadline not met do
converged = true
foreach agent i do
foreach neighbor j ∈ Γ(i) do
foreach action aj ∈ Aj do
Create message
P
µ0ij (aj ) = maxai {fi (ai ) + fij (ai , aj ) + k∈Γ(i)\j µki (ai )}
end
P
cij = |A1j | aj µ0ij (aj )
Normalize: µ0ij (aj ) ← µ0ij (aj ) − cij
if µ0ij differs from µij by a small threshold then
converged = false
end
P
a0i = arg maxai {fi (ai ) + j∈Γ(i) µji (ai )}
end
if u(a0 ) > m then
a∗ = a0 and m = u(a0 )
µ ← µ0
end
return a∗
Algorithm 5.3: The centralized anytime Max-plus algorithm.
along edges of the coordination graph (see Figure 5.2). Max-plus finds the global
payoff by having each agent i repeatedly sending messages µij to its neighbors:
µ0ij
n
o
X
= max Qij (sij , ai , aj ) +
µki (ai ) − cij
ai
(5.7)
k∈Γ(i)\j
where µki is the incoming message, and µ0ij is the outgoing message. All messages
are in fact vectors over possible actions. Γ(i) \ j represents all neighbors of i
82
except j and cij is a normalization factor, calculated after the initial values µ0ij have
been found. Max-plus sets this to be the average over all values of the outgoing
P
message: cij = |A1j | aj µ0ij (aj ). This prevents messages from exploding in value
as multiple iterations of the algorithm proceed. Once messages have converged
or a time limit has been reached, each agent chooses the action that maximizes
P
arg maxai {Qij (sij , ai , aj ) + j∈Γ(i) µji (ai )} for that agent. See Algorithm 5.3 for
the full algorithm.
5.5.2 Dynamic Coordination
Although I use an edge-based decomposition (as described above), it is often the
case that rewards are received on a per-agent basis instead of a per-edge basis.
Thus, we must compute local Qi functions for each agent in the graph. Following
[12], I do this by assuming each Qij contributes equally to each agent i and j of
its edge:
Qi (si , ai ) =
1 X
Qij (sij , ai , aj )
2
(5.8)
j∈Γ(i)
where Γ(i) indicates the neighbors of agent i. The sum of all such Qi functions
equals Q in Equation 5.6. I assume each agent has at least one other neighbor
in the coordination graph. It is straightforward to adapt these methods in cases
where an agent does not need to coordinate with anyone.
Because our coordination graph is context-specific, to update the Q-function
we must use an agent-based update as opposed to an edge-based update. This is
because the presence or absence of edges changes from state to state, so we cannot
83
1
2
3
4
Initialize Q(s, a) optimistically
Initialize s to any starting state
for each step do
P
Assign tasks T toPagents M by finding arg maxβ t v(s, β(t), t), where
Qij (sij , ai , aj )
v(s, g, t) = max
a∈Ag i,j∈g
5
6
7
8
Choose a from s using Max-plus algorithm and -greedy policy derived
from Q
Take action a, observe rewards r and next state s0
Use rules given with domain to create coordination graph G = (V, E) for
state s0
Determine agent Q-functions
Qi (si , ai ) and Qi (s0i , a∗i ) for each agent i
P
Qij (sij , ai , aj )
using Qi (si , ai ) = 12
j∈Γ(i)
9
For each edge (i, j) of the coordination graph, update its Q-value using
P Rk (s,a)+γQk (s0k ,a∗k )−Qk (sk ,ak )
Qij (sij , ai , aj ) ← Qij (sij , ai , aj ) + α
|Γ(k)|
k∈{i,j}
s ← s0
11 end
Algorithm 5.4: The assignment-based decomposition Q-learning algorithm
using coordination graphs.
10
be assured that an edge that is present in the current time step was available
in the last time step. To obtain the agent-based update equation for an edgebased decomposition, the agent-based update (Equation 2.14) is rewritten using
Equation 5.8 to get:
Qij (sij , ai , aj ) ← Qij (sij , ai , aj ) + α
X Rk (s, a) + γQk (s0 , a∗ ) − Qk (sk , ak )
k
k
|Γ(k)|
k∈{i,j}
(5.9)
This update equation propagates the temporal-difference error from all edges that
include agents i and j to the local Q-function of each edge (i, j). This update is
84
context-specific, because it does not require the same edges to be present at each
time step of the Q-learning algorithm. It only requires that local Qk functions
can be computed for each vertex of the coordination graph, which is done using
Equation 5.8. The notation Qi and Qij indicates that the Q-values are agent-based
or edge-based respectively. Q-function parameters are shared between agents and
edges. The final Q-learning algorithm may be seen in Algorithm 5.4.
A complication arises during the assignment search step of Algorithm 5.1 when
using coordination graphs. It is not possible to efficiently calculate the value of
an assignment v(s, g, t) while still taking into account the contribution of edgebased Q-values Qij that occur between groups of agents assigned to different tasks.
Hence, I approximate v(s, g, t) by only taking into account the local state and
actions of its assigned agents, the state variables st , and ignoring inter-group edges
of the graph. At the task execution level, I consider all interactions, but since the
task assignment is fixed, the possible interactions are again limited.
5.6 Experimental Results
I experimented with three different multiagent domains: the product delivery domain (discussed in Section 3.4.1), the real-time strategy game domain (Discussed
in Section 3.4.2, and a new predator-prey domain discussed below. I follow this
with several experiments using both model-free and model-based assignment-based
decomposition to solve them.
85
5.6.1 Multiagent Predator-Prey Domain
The multiagent predator-prey domain is a cooperative multiagent domain based on
work by [11]. That original domain requires two agents (predators) to cooperate in
order to capture a single prey. Agents move over a 10x10 toroidal grid world, and
may move in four directions or stay in place. Prey move randomly to any empty
square. Predators and prey move simultaneously, so predators must guess where
the prey will be in the next time step. If predators collide, or if a predator enters
the same space as the prey without an adjacent predator, the responsible predators
are penalized and moved to a random empty square. The prey is captured (with a
reward of 75) when one predator enters its square, and another predator is adjacent.
The version of the domain used in my experiments exhibits two key differences
to that of [11]: first, I increase the numbers of predators and prey from 2 vs. 1
to 4 vs. 2 or 8 vs. 4 (see Figure 5.3). Second, each time a prey is captured, it
Figure 5.3: A possible state in an 8 vs. 4 toroidal grid predator-prey domain. All
eight predators (black) are in a position to possibly capture all four prey (white).
86
is randomly relocated somewhere else on the board and the simulation continues.
Thus, this version of the domain has an infinite horizon rather than being episodic.
There are several consequences of the increase in scale of this domain. Of
course, the joint action and state spaces increase exponentially. More interesting
is a need for predators to be assigned to prey such that exactly two predators are
assigned to capture each prey, if the best average reward is to be found. Thus, this
domain is an example of an MAMDP.
Once predators are assigned to prey, it is useful to coordinate the actions of
predators on the task execution level to prevent collisions. Thus I introduce coordination graphs on the task execution level as described in Chapter 5.5. The existence of a top-level assignment provides several advantages, such as when defining
the rules determining when agents should cooperate. I change only one of the
coordination rules introduced in [11]. Predators should coordinate when either of
two conditions hold:
• the Manhattan distance between them is less than or equal to two cells, or
• both predators are assigned to the same prey.
The existence of predator assignments allows us to create improved coordination
rules on the task execution level. It also reduces the number of state variables
(i.e., prey) we are required to account for in the edge value function. The Q-value
of each edge between predators cooperating to capture a prey need only be based
on the positions of those predators relative to their assigned prey. The Q-values
of each edge between predators cooperating only for collision avoidance need only
87
be based on the positions of those two predators. The existence of these two kinds
of edges does increase the number of parameters required to represent the value
function, but far less than the exponential increase in the number of parameters
required to store a value function over two predators and two or more prey (without
function approximation).
5.6.2 Model-free Reinforcement Learning Experiments
For my experiments using model-free Q-learning and assignment-based decomposition, I compared results in two MAMDP domains: the product delivery domain
discussed in Section 3.4.1, and the multiagent predator-prey domain discussed in
Section 3.4.1. The product delivery domain does not require coordination on the
task execution level. It is simple enough that flat and multiagent Q-learning results can be obtained (albeit requiring function approximation) for comparison to
my approach using a assignment-based decomposition.
The multiagent predator-prey domain is more complex. Standard Q-learning
approaches do not work here. I also tested the use of coordination graphs, and
several different assignment search techniques.
The product delivery domain may be easily described as an MAMDP. Restocking stores becomes a task for any of the multiple agents (trucks) to complete. An
assignment is therefore a mapping from shops to the trucks that will serve them.
Each shop is assigned one truck, which may only unload at that shop. Thus,
agents’ actions cannot interfere with each other, and there is no need for coordi-
88
nation on the task execution level. Because not all shops can be delivered to, I
add a “phantom truck” for the unassigned shop. This “agent” has no associated
state features. Its existence allows the assignment step of the assignment-based
decomposition to determine the appropriate penalty for not assigning a truck to
any shop.
I conducted several experiments in this domain (see Figure 5.4). All results were
averaged over 30 runs of 106 steps each. I tuned the learning rate α separately
for each test, setting α = 0.1 for the assignment-based decomposition test and
α = 0.01 for all others. I set the discount rate γ = .9, and used -greedy exploration
with = .1. Average reward was measured for 2,000 out of every 50,000 steps.
The assignment-based decomposition approach used an exhaustive search of
possible assignments and no function approximation. A total of 11,250 parameters
are required to store the value function Q(st , sg , ag ) (5 shops, 5 shop inventory
levels, 10 truck locations, 5 truck loads, and 9 possible actions per truck). Here st
indicates state features about the assigned shop and its inventory, and sg indicates
features for truck position and load.
The joint and multiagent Q-learning approaches I used need too many parameters to represent the value function using a complete table. Thus, I used the
“truck-shop tiling” approximation discussed in Section 3.4.3. Each agent uses its
own value function, so four times as many parameters as the assignment-based decomposition were used. Joint agent Q-learning sums over four times as many terms,
additionally indexing with each truck, but requires no additional parameters. The
hand-coded approach works similarly to assignment based decomposition: for each
89
Figure 5.4: Comparison of various Q-learning approaches for the product delivery
domain.
truck-shop pair, a distance weight was calculated from the state features. Then
the assignment was made based on an exhaustive search over possible assignments,
taking the assignment giving minimum total distance.
I tried two coordination methods for multiagent Q-learning: when selecting
actions, I either exhaustively searched over all joint actions, or I used a simple
Table 5.1: Running times (in seconds), parameters required, and and terms
summed over for five algorithms applied to the product delivery domain.
Algorithm
Time
Space
Terms
Joint agent Q-learning
Multiagent Q, exhaustive search
Multiagent Q, serial coordination
Assignment-based decomposition Q
Hand-coded algorithm
142
160
3
3
3
45,000
45,000
45,000
11,250
N/A
20
5
5
1
N/A
90
Figure 5.5: Examination of the optimality of policy found by assignment-based
decomposition for product delivery domain.
form of multiagent coordination called serial coordination, which greedily selects
actions for agents one at a time, allowing each agent to know the actions selected
by previous agents.
Assignment-based decomposition outperformed all other approaches, although
my hand-coded algorithm comes close. The multiagent Q-learning approaches performed the worst of these methods. In CPU time, both multiagent Q-learning with
serial coordination and assignment-based decomposition approaches were much
faster than those approaches using an exhaustive search of the action space (Table 5.1).
I also examined the optimality of the policy found by assignment-based decomposition in the product delivery domain (Figure 5.5). The top line is an optimistic
estimate of the optimal policy in this domain. I calculated this by multiplying the
91
Figure 5.6: Comparison of action selection and search methods for the 4 vs 2
Predator-Prey domain.
average number of customer visits per time step (1) by the transportation cost
required to satisfy a single customer visit (−.1) to get the average transportation cost per time step required to satisfy all customers (−.1). This estimate is
very optimistic, because it ignores stockout costs, which are inevitable due to the
stochastic nature of customer visits. Still, the average reward of the policy found
by assignment-based decomposition is quite close to my estimate. This analysis
may be taken one step further: as my estimate ignores stockout costs, we can
similarly ignore the contribution of stockout events to the average reward of the
policy found by assignment-based decomposition. The result is a graph of only
the transportation costs incurred by this policy, seen in Figure 5.5. From this I
conclude that the policy found by assignment-based decomposition is very close to
optimal in this domain.
92
Figure 5.7: Comparison of action selection and search methods for the 8 vs 4
Predator-Prey domain.
The second domain I experimented in is the multiagent predator-prey domain
discussed in Section 5.6.1. In these tests, results are shown over 107 steps of the
model-free assignment-based decomposition algorithms (Algorithms 5.1 and 5.4).
Figure 5.6 shows the results for 4 predators vs. 2 prey, and Figure 5.7 shows
the results for 8 predators vs. 4 prey. The same set of search and coordination
strategies are compared in both domains. I set the learning rate α = 0.1, discount
rate γ = .9, and exploration rate = .2. Average reward of the domain was
measured for 2,000 steps out of every 500,000 steps. During test phases, was
set to 0. Because the maximum reward receivable by two agents is 75, edge value
functions were optimistically initialized to this value. Results were averaged over
30 runs.
I conducted six identical experiments for each domain: Max-plus action se-
93
lection without an assignment-based decomposition (using sparse cooperative Qlearning as in [11]), assignment-based decomposition without using coordination
graphs and using an exhaustive search of assignments (as in Algorithm 5.1), and
assignment-based decomposition with Max-plus action selection and four assignment methods: exhaustive search, sequential greedy assignment, swap-based hill
climbing, and a fixed assignment (as in Algorithm 5.4). For the fixed assignment,
I arbitrarily assigned pairs of predators to prey at the start of the run, then never
reassigned them.
As may be seen from these results, Max-plus search alone performed poorly
compared to the other techniques. This is because a coordination graph alone is
unable to capture the coordination requirements of the predator-prey domain (for
similar reasons as those seen in Proposition 1). Using assignment search alone
results in a large increase in performance; this kind of search does capture some
essential coordination requirements. However, this alone is also not enough: it is
still possible for agents to interfere (collide) with each other after assignments have
been made. This type of coordination is ideal for a coordination graph approach
to solve as described in Section 5.5, as may be seen by the experiments combining
assignment search with Max-plus action search.
Of the various task assignment methods, fixed assignment and sequential greedy
assignment did not perform well. Swap-based hill climbing performed almost identically to exhaustive search. This provides hope that similar approximate search
techniques can allow assignment-based decomposition to scale to a large number
of agents.
94
I also experimented with transfer learning (Figure 5.7). Instead of initializing
Q-values optimistically, I transferred parameters learned from the 4 vs. 2 to the 8
vs. 4 predator-prey domain. This is possible because both domains have the same
number of parameters; as would domains with any number of agents, because the
Q-functions are all based on 2 predators and 1 prey. I tested the resulting policy
by turning off learning and using assignment-based decomposition with exhaustive
assignment search and Max-plus coordination. These results demonstrate that,
thanks to assignment-based decomposition, a policy learned with few agents can
scale successfully to many more agents. Transfer learning is explored in greater
detail in the next section.
5.6.3 Model-based Reinforcement Learning Experiments
I performed all experiments on several variations of the real-time strategy game
(RTS) described in Section 3.4.2. I focus here on expanding the transfer learning
results of the previous section, and show how to use transfer learning to overcome
the difficulties of scaling the RTS domain to large numbers of units.
These experiments are in the context of model-based assignment-based decomposition (Section 5.2) and ATR-learning (Section 3.3.3). The relational templates
of Section 3.1.2 are also used to form the value function.
Transfer learning across different domains (as in Section 3.4.4 is very helpful,
but transfer learning may also provide an additional benefit when combined with
assignment-based decomposition: we can transfer knowledge learned in a small
95
Figure 5.8: Comparison of 6 agents vs 2 task domains.
domain (such as the 3 vs 1 domains discussed in the Section 3.4.4) to a larger
domain, such as the 6 vs. 2 or 12 vs. 4 domains discussed here.
To transfer results from the 3 vs. 1 domain to the 6 vs. 2 domain, we must
use assignment-based decomposition within the larger domain. This domain has
two enemy units (tasks) and six agents. Each time step, I use the algorithm shown
in Algorithm 5.2 to assign agents to tasks, and allow the task execution level to
decide how each group of agents should complete its single assigned task. Thus,
we can adapt the relational templates used to solve the 3 vs. 1 domain to this
larger problem.
If I adapt the templates used in the 3 vs. 1 domain (Table 3.1, #2-4) directly,
performance will suffer due to interference (being shot at) by enemy units other
than those assigned to each agent. To prevent this, I create a new aggregation
96
Figure 5.9: Comparison of 12 agents vs 4 task domains.
feature T asksInrange(B), and define a behavior transfer function [25] ρ(π) which
initializes the new relational templates which include this feature (#6-8) by transforming the old templates (#2-4) which do not. I do this simply by copying the
parameters of the old templates into those of the new for all possible values of the
additional dimensions.
There is an additional consideration when transferring from the 3 vs. 1 to 6 vs
2 domains: as I am using assignment-based decomposition in the 6 vs. 2 domains,
can we (or should we?) transfer the state-based value functions? While this is
possible to do (by learning the function in the 3 vs. 1 domain), empirical results
have shown that it is not necessary, and performance suffers very little if we learn
the state-based value function from scratch each time. Hence, this is what I have
done for all results in this paper.
97
All experiments shown here (Figures 5.8 and 5.9) are averaged over 30 runs of
105 steps each. I used the ATR-learning algorithm with assignment-based decomposition (Algorithm 5.2) for most of the experiments. As with the experiments
with the 3 vs. 1 domain, runs were divided into 40 alternating train/test phases,
with = .1 or = 0. α is similarly adjusted independently: for the 6 vs. 2 domain, I used α = .01 for parent templates, and α = .1 otherwise. For the 12 vs. 4
domain, I used α = .001 for parent templates, and α = .01 otherwise.
I compared results with and without transfer (from all combined subdomains
in the 3 vs. 1 domain) to the 6 Archers vs. 2 Towers domain. I used exhaustive
search to compare transfer results. These results show that using transfer is significantly better than not using it at all. In addition, I tested several different forms of
assignment search: exhaustive, hill climbing, bipartite, and fixed assignment. As
expected, fixed assignment performs quite poorly. Bipartite search, while performing slightly worse than exhaustive, still does very well. The performance of hill
climbing varies between that of fixed assignment and bipartite search, depending
on how many times the hill climbing algorithm is used to improve the assignment.
Shown are results for only one iteration of the hill climbing algorithm, which is
only a modest improvement upon fixed assignment.
Finally, I tested the “flat” (no assignment-based decomposition, using the algorithm of Algorithm 3.2) 6 vs. 2 domain without transfer learning. As expected,
this performed very poorly, which is due to the difficulty of creating an adequate
function approximator for 6 agents and 2 tasks. I arrived at using only template
#5 after experimentation with several other alternative templates. Even with α set
98
very low (.008), parameters of the value function slowly continue to grow, causing
performance to peak and eventually dip. This points to the inadequacy of traditional methods for solving such a large problem: we need to decompose problems
of this size in order to solve them. In addition, “flat” ATR-learning is very slow
on such problems, taking almost 43 times more computation time to finish a single
run than when using assignment-based decomposition! (Table 5.2)
My tests on the 12 vs. 4 domain have similar results. Here, I transferred
from the 6 vs. 2 domain, which requires no additional relational features. Results
(Figure 5.9) show that using transfer provides an enormous benefit. All 12 vs.
Table 5.2: Experiment data and run times. Columns list domain size, units involved (Archers, Infantry, Towers, Ballista, or Knights), use of transfer learning,
assignment search type (“flat” indicates no assignment search), relational templates used for state and afterstate value functions, and average time to complete
a single run.
Size
Subdomain(s)
Transfer
Search
type
State
templates
Afterstate
templates
Seconds
3 vs 1
3 vs 1
6 vs 2
6 vs 2
6 vs 2
6 vs 2
6 vs 2
6 vs 2
12 vs 4
12 vs 4
12 vs 4
12 vs 4
12 vs 4
Any
Any
A vs. T
A vs. T
A vs. T
A vs. T
A vs. T
A vs. T
A vs. T
A vs. T
A vs. T
A vs. T
A,I vs. T,B,K
no
yes
no
yes
yes
yes
yes
no
no
yes
yes
yes
yes
flat
flat
exhaustive
exhaustive
bipartite
hill climbing
fixed
flat
bipartite
bipartite
hill climbing
fixed
bipartite
N/A
N/A
5,9
5,9
1
5,9
N/A
N/A
1
1
5,9
N/A
1
2-4,9
2-4,9
6-9
6-9
6-9
6-9
6-9
5
6-9
6-9
6-9
6-9
6-9
28
29
34
60
60
60
57
2567
76
122
156
114
108
99
4 results but one use bipartite search (as an exhaustive search of the assignment
space is unacceptably slow), and this performs very well, especially compared to
no assignment search at all. Finally, I tested transfer from the 6 vs. 2 domain
to a 12 vs. 4 combined problem. In the combined problem, all unit types were
allowed on their respective sides (archers or infantry for the agents, ballista, tower,
or knight for the enemy units). This is a very complex problem that requires the
assignment step of the assignment-based decomposition algorithm to assign the
best possible agents (archers or infantry) to their best possible match. While the
complexity of this domain prevents the algorithm from performing as well as it
does on a single subdomain, the assignment-based decomposition algorithm still
performed very well.
Finally, I examine the performance of the various algorithm/domain combinations (Table 5.2). From these results, we can see that the computation time
required to solve a problem using assignment-based decomposition scales linearly
in the number of agents and tasks. This is a considerable improvement over
“flat” approaches, which require an exponential amount of time in the number
of agents/tasks to solve each domain. As expected, not searching at all is very
fast. Exhaustive search is the slowest search technique, and it is so slow as to be
unusable in the 12 vs. 4 domain. Of the various approximate search techniques,
hill climbing is the slowest. Interestingly, methods that used no transfer learning
were faster than those that did. This is most likely because more agents died
during these runs, resulting in less time to compute each time step.
100
5.7 Summary
This chapter introduced Multiagent Assignment MDPs and gave a two-level action
decomposition method that is effective for this class of MDPs. This class of MDPs
can capture many real-world domains such as vehicle routing and delivery, board
and real-time strategy games, disaster response, fire fighting in a city, etc., where
multiple agents and tasks are involved.
I showed how both model-free and model-based reinforcement learning algorithms may be adapted for use with assignment-based decomposition. In the
case of model-free RL, I also showed how to combine coordination graphs with
assignment-based decomposition to allow for two different types of coordination at
both the upper assignment level and the lower task execution level. I gave empirical results in two domains that demonstrate that the combination of assignment
search at the top level and coordinated reinforcement learning at the task execution level is well-suited to solving such domains, while either method alone is not
sufficiently powerful.
Because a search over an exponential number of assignments is not scalable as
the number of agents increases, I have also shown how several simple approximate
search techniques perform effective assignment search. My results show that bipartite search performs the best in terms of speed and average reward, however
bipartite search cannot always be applied. In such situations, a method like hill
climbing search is preferable. These results encourage the conclusion that assignment search is a practical approach for large cooperative multiagent domains.
101
Chapter 6 – Assignment-level Learning
The techniques demonstrated in Chapter 5 with assignment-based decomposition
have involved the assignment level using information solely from the task execution
level to make a decision. This leads to compact, scalable value functions for many
agents. In this chapter, I explore the potential of trading off scalability for improved
solution quality by introducing assignment-level learning. This is similar to and
inspired by the way hierarchical reinforcement learning allows learning to occur at
every level of the hierarchy [5].
6.1 HRL Semantics
To see how assignment-level learning might be introduced, let us first examine
how assignment-based decomposition makes a decision. In Figure 6.1, we see three
states (of a potentially larger MDP) in which a single agent at state s1 needs to
make a choice between being assigned to task x or task y. This choice will then
lead the agent to states s2 or s3 respectively, receiving reward rx or ry . In this
figure, the local MDP at the task execution level is a one-step Markov chain, for
which only a single Q-value need be learned: either Qt (x) or Qt (y), for which
the values rx and ry will eventually be learned. (Here the notation Qt is used to
indicate the task-level value function). Note the lack of a state variable: the local
102
Figure 6.1: Information typically examined by assignment-based decomposition.
Figure 6.2: Information examined by
assignment-based decomposition with
assignment-level learning.
task-execution MDP’s are aware of only the local state, of which there is only one
for which a value must be learned, and thus no need for a state variable. Because
we use the task-level value function to make decisions at the assignment level, the
agent is assigned to the task that provides the greatest value max(Qt (x), Qt (y)),
which becomes max(rx , ry ) once the appropriate parameters have been learned.
This process does not take into account what may occur after a task has been
completed. Further tasks may become available, and the process may continue.
For example, in Figure 6.2, new tasks z or w may be completed after finishing
tasks x and y, receiving reward rz or rw . Tasks may potentially continue beyond
this decision indefinitely, or until the episode is terminated. Previous work in with
hierarchical reinforcement learning [5] would suggest that to make the correct
decision in state s1 requires that we take into account potential sources of reward
103
that occur after the current task is finished. That is, we introduce a value function
at the assignment level and learn the value of the contribution of any reward we
receive after the current task is finished. This is called the completion function.
We can learn the completion function by conceptually splitting the assignmentlevel Q-function into two parts, the existing Qt function which is taken from the
task execution level and a Qa function which is learned and used only at the
assignment level. Thus the decision of how to assign a single agent in Figure 6.2
is made by comparing the values max(Qt (x) + Qa (s1 , x), Qt (y) + Qa (s1 , y)), which
becomes max(rx +rz , ry +rw ) once appropriate values of all parameters are learned.
Attaching this meaning to the Qa function I call hierarchical reinforcement learning
or HRL semantics, and will now proceed to show why this meaning may yield bad
results.
To begin, let us examine Figure 6.3, which shows a
simple 4-state MDP with a single agent and two tasks.
Should the agent be assigned to task x in the first state,
it “dies” and receives a negative reward. However, if it
is assigned to task y, it receives a zero reward, but is
given the opportunity to try task x again. This time,
it receives a positive reward. Let us start by examining what happens if we try to solve this problem with
assignment-based decomposition.
Figure 6.3: A 4-state
MDP with two tasks.
Assignment-based decomposition learns only two parameters, one per task: Qt (x) and Qt (y). Qt (y) will always be zero (because this
104
is the reward received for completing task y in state s3 ). The value of Qt (x) is
less certain: it may be less than or greater than zero, depending on which task
the agent is assigned to in state s1 . If the assignment is always optimal, the agent
should learn Qt (x) = 1. However, as soon as Qt (x) > 0, the agent will be assigned
to task x in state s1 , and eventually Qt (x) becomes negative. This leads to an
oscillation of this value around zero, which is undesirable.
To apply HRL semantics to solve this problem, we introduce two new parameters: Qa (s1 , x) and Qa (s1 , y) which learn the completion function for state s1
and tasks x and y. Qa (s1 , x) must be zero because no reward is ever received
after task x is finished. Qa (s1 , y) = 1, because this is the only reward ever received after task y is finished. Should the assignment be optimal, Qt (x) = 1 and
Qt (y) = 0 for the same reasons discussed previously. In this situation, we compute
max(Qt (x) + Qa (s1 , x), Qt (y) + Qa (s1 , y)) = max(1 + 0, 0 + 1) = max(1, 1) = 1.
Note that both tasks appear equally optimal in this case, which is undesirable. The
correct equation should compare the true Q-values at the end of the two possible
episodes, −1 and 1.
Why is HRL semantics failing here? It is because, at the task execution level,
states s1 and s3 are aliased, that is, the task execution level, and through that the
assignment-level Q-function, is failing to distinguish between these states. This is
because the task-level Q-values Qt (x) and Qt (y) only take into account the local
state. In hierarchical reinforcement learning, this would be considered an “unsafe
abstraction”. This unsafe abstraction could be removed only by having the Qt
functions take into account the global state; however this becomes impractical as
105
the number of agents increases, and the benefits of assignment-based decomposition
are lost.
6.2 Function Approximation Semantics
The problem with HRL semantics may be repaired by using different semantics,
I call “function approximation semantics”. This simply supposes that the Qa
value should not be the completion function, but merely serves to correct the
task execution value Qt towards the true Q-value. For example, in Figure 6.3,
we have Qt values fixed by the task-level rewards (assuming optimal assignment)
so that Qt (x) = 1 and Qt (y) = 0. In order to arrive at the true Q-values to be
compared in state s1 , let Qa (s1 , x) = −2 and Qa (s1 , y) = 1. Thus, Q(s1 , x) =
Qt (x) + Qa (s1 , x) = 1 + −2 = −1, and Q(s1 , y) = Qt (y) + Qa (s1 , y) = 0 + 1 = 1,
which are the correct values and allow the correct decision to be made.
To determine how we might learn these improved values, we can simply subP
stitute Qa (s, β) + t v(s, β(t), t), where β is the assignment (i.e. assignment-level
action) for Q(s, a) in the Q-learning update equation:
h
i
0 0
Q(s, a) ← Q(s, a) + α r + γ 0max
Q(s
,
a
)
−
Q(s,
a)
0 0
a ∈A (s )
(6.1)
106
1
2
3
4
Initialize Qa (s, a) to all-zero’s, and Qt (s, a) optimistically
Initialize s to any starting state
for each step do
Assign tasks T to agents
P M by finding
arg maxβ Qa (s, β) + t v(s, β(t), t), where v(s, g, t) = max Qt (st , sg , a)
a∈Ag
For each task t, choose actions aβ(t) from sβ(t) using -greedy policy
derived from Qt
Take action a, observe rewards r and next state s0
foreach
task t do Qt (st , sβ(t) , aβ(t) ) ← Qt (st , sβ(t) ,iaβ(t) )+
h
5
6
7
αt rβ(t) + γ t max
Qt (s0t , s0β(t) , a0 ) − Qt (st , sβ(t) , aβ(t) )
0
a
Qah(s, β) ← Qa (s, β)+
8
αa r + γ a max
(Qa (s0 , β 0 ) +
0
0
0
t v(s , β (t), t)) − Qa (s, β) −
P
β
i
v(s,
β(t),
t)
t
P
0
s←s
10 end
Algorithm 6.1: The assignment-based decomposition with assignment-level
learning Q-learning algorithm.
9
which becomes:
Qa (s, β) +
X
v(s, β(t), t) ← Qa (s, β) +
X
v(s, β(t), t)+
t
t
#
!
"
α r + γ max
Qa (s0 , β 0 ) +
0
β
X
v(s0 , β 0 (t), t)
t
− (Qa (s, β) +
X
v(s, β(t), t)) .
t
(6.2)
After simplifying and eliminating
P
t
v(s, β(t), t) (because these values are updated
107
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Initialize state and afterstate h-functions v(·) and av(·)
Initialize assignment-level Q-function Qa (·)
Initialize s to any starting state
for each step do
P
Assign tasks T to agents M by finding arg maxβ t y(s, β(t)), where
y(s, g, t) = v(st , sg )
For each task t, find joint action uβ(t) ∈ Aβ(t) that maximizes
rβ(t) (s, uβ(t) ) + av(suβ(t) )
Take an exploratory action or a greedy action in the state s. For each
set of agents β(t), Let aβ(t) be the joint action taken, rβ(t) the reward
received, saβ(t) the corresponding afterstate, and s0 be the resulting state.
Update the model parameters rβ(t) (s, aβ(t) ).
foreach task t do
n
o
Let T argett =
max
uβ(t) ∈Aβ(t)
rβ(t) (s0 , uβ(t) ) + av(s0uβ(t) )
av(saβ(t) ) ← av(saβ(t) ) + α(T argett − av(saβ(t) ))
v(st , sβ(t) ) ← v(st , sβ(t) ) + α(rβ(t) + T argett − v(st , sβ(t) ))
end
Qah(s, β) ← Qa (s, β)+
i
P
P
0
0
0
0
(Qa (s , β ) + t y(s , β (t)) − Qa (s, β) − t y(s, β(t))
αa r + γ a max
0
β
s ← s0
16 end
Algorithm 6.2: The ATR-learning algorithm with assignment-based decomposition and assignment-level learning.
15
during the task execution level update) on both sides of the arrow we get:
Qa (s, β) ← Qa (s, β)+
"
α r + γ max
Qa (s0 , β 0 ) +
0
β
!
X
t
v(s0 , β 0 (t), t)
#
− Qa (s, β) −
X
v(s, β(t), t)
t
(6.3)
108
Because assignments need not change often, a useful approximation of Equation 6.3 is possible. Instead of searching over all possible assignments to compute
P
max
(Qa (s0 , β 0 ) + t v(s0 , β 0 (t), t)), we can approximate this quantity by simply
0
β
using the current assignment β, which we can assume remains good in the next
step:
Qa (s, β) ←Qa (s, β)+
"
α r+γ
!
Qa (s0 , β) +
X
t
v(s0 , β(t), t)
#
− Qa (s, β) −
X
v(s, β(t), t)
t
(6.4)
Algorithm 6.1 shows the final algorithms for model-free assignment-based decomposition with assignment-level learning. Similar methods to those shown in
Section 5.2 may be used to adapt this algorithm to a model-based method, as
shown in Algorithm 6.2. The notation αa , γ a , αt , or γ t refer to learning rate and
discount factor for assignment- and task-level Q-functions. Note the use of the
notation Qt and Qa to distinguish between task- and assignment-level Q-functions
respectively: these subscripts do not indicate an index over particular tasks or
actions, for example.
6.3 Experimental Results
In this section I show the results for several experiments in two different domains:
first, the simple 4-state MDP shown in Figure 6.3, and second the real-time strategy
109
Figure 6.4: Comparison of various strategies for assignment-level learning.
game domain discussed in Section 3.4.2. These results demonstrate the improvement gained by using function approximation semantics over either HRL semantics
or assignment-based decomposition without assignment-level learning.
6.3.1 Four-state MDP Domain
Figure 6.4 compares the results of normal assignment-based decomposition vs.
both assignment-level learning strategies discussed here on the simple 4-state MDP
shown in Figure 6.3. Results were plotted for 1000 episodes, and averaged over
10,000 runs. Softmax exploration with a temperature τ = .5 and learning rate
αa = αt = .01 was used in order to demonstrate the difference between the HRL
and function approximation semantics. As may be seen from these results, as
110
expected assignment-based decomposition with no learning fails to perform well
in this domain. HRL semantics improves performance, but because the values of
the assignment-level Q-values for both tasks remain very close to each other, still
does not perform well. Only function approximation semantics allows learning
appropriate Qa values and performs near-optimally. It is this semantics which I
continue to use in my experiments in Section 6.3.2.
6.3.2 Real-Time Strategy Game Domain
These experiments are in the context of model-based assignment-based decomposition (Section 5.2) and ATR-learning (Section 3.3.3). The relational templates of
Section 3.1.2 are also used to represent the value function.
To test assignment-level learning in a more complex domain, I experimented
with several different combinations of units in the real-time strategy domain, with
and without assignment-level learning. I used model-based assignment-based decomposition as described in Section 5.2. All results are for 105 steps, and averaged
over 30 runs. I used γ t = .95 at the task level for this problem, and γ a = 1 at the
assignment level.
Because the global state is very complex and cannot be efficiently stored in
a table, a function approximator that can capture the needed global interactions
at the assignment level must be used. For these tests, I created a table over
several derived features of the global state. These derived features consisted of
a count of the number of enemy units of each type, and a count of the number
111
Figure 6.5: Comparison of assignment-based decomposition with and without
assignment-level learning for the 3 vs 2 real-time strategy domain.
of units of each type assigned to each type of enemy unit. For example, these
derived features could capture the fact that there are three archers assigned to
a single enemy ballista, three assigned to a single enemy knight, and no agents
assigned to the two remaining enemy units. Other information that might be
useful, such as hit points of units or relative distances, could not be captured due
to the exponential increase in the number of parameters required. Still, empirical
results show that even these simple derived features can improve performance
significantly with assignment-level learning.
Figure 6.5 shows results for three agents vs. two enemy units. One enemy was
a dangerous knight or ballista, the second enemy was a harmless, immobile “hall”.
Both enemy units returned a reward of 1 if killed. If enemy units killed an agent,
a reward of −1 was received. As the state information for both enemies was not
112
included in the local state of the task, when the agents are assigned to the hall,
the problem of predicting if the knight will kill them is very difficult. The hall is
harmless, and so attacking the hall first might appear more attractive. Unfortunately, ignoring the knight will allow it to quickly kill off agents. Thus, the correct
decision is to attack the knight first, then the hall. This is analogous to the problem presented in the simple 4-state MDP of Figure 6.3. An exhaustive search over
possible assignments was performed, with αt = .1 at the task level, and αa = .001
at the assignment level. As may be seen from these results, adding assignmentlevel learning improves average reward significantly, although assignment-based
decomposition still performs fairly well. Also included in these results are a comparison of assignment-level learning with and without the approximation used in
Equation 6.4. These results show that this approximation performs very similarly
to the full update of Equation 6.3 for these domains.
Figure 6.6 tests assignment-based decomposition with and without assignmentlevel learning on six agents vs. four enemy units. I introduced a new unit here
called the “glass cannon” (see Section 3.4.2), which instantly kills any unit it hits,
and likewise is instantly killed if attacked. The learning rate αt = .001 for the
task execution level. For the assignment level I set the learning rate to start at
1 and divided it by αa + 1 every 100 time steps. I used hill climbing assignment
search (repeated three times) to find the best action. Bipartite search cannot be
used with assignment-level learning as determining the value of the assignment
requires the global state, which bipartite search cannot provide. This time, I gave
a reward of zero if the glass cannon was killed, and ten if the hall was killed. As
113
Figure 6.6: Comparison of 6 archers vs. 2 glass cannons, 2 halls domain.
with the 3 vs. 2 results, this made attacking the hall much more attractive, and so
average reward suffered without assignment-level learning. Still, performance was
robust in either case, although assignment-based decomposition improved results
significantly. Use of assignment-level learning with the approximate update also
improved performance, though not as much as with the full update of Equation 6.3.
In Figure 6.7, I performed tests much as I did in Figure 6.6, however I used
different units and returned to more standard rewards (a reward of 1 was given
for all enemy kills). This time I tested six archers vs. two knights and two halls.
Both learning rates αt = αa = .001. Using assignment-level learning again improved performance over no learning at the assignment level, however this time
the improvement was much less than that for Figure 6.6. This is because the rewards given for killing the enemy units are very similar, and so assigning agents to
114
Figure 6.7: Comparison of 6 agents vs 4 tasks domain.
one or the other unit appears similarly attractive. These results also demonstrate
that assignment-based decomposition can perform robustly regardless of whether
or not assignment-level learning is used. Interestingly, the approximate update of
Equation 6.4 outperformed the full update here.
It was not possible to perform tests of assignment-level learning for larger numbers of agents, as the global table function approximator grows too large. Future
work may involve seeking a way to mitigate this difficulty.
6.4 Summary
Assignment-based decomposition is a robust method for solving MAMDP problems; however under certain circumstances – particularly when local state informa-
115
tion is insufficient to correctly differentiate between the true values of a particular
task and assignment – assignment-based decomposition may underperform. This
may occur, for example, when the rewards (and thus the learned Q-values) are
very different between tasks, making it appear that the higher-valued task should
be completed first, even if completing a lower-valued task first will lead to greater
reward in the long term.
To mitigate this problem, I have shown how previous work in hierarchical reinforcement learning can inspire “assignment-level learning”. However, unlike hierarchical reinforcement learning, assignment-level learning requires a different method
for learning a value function at the assignment level. This is due to the potentially
unsafe abstractions caused by global interactions that cannot easily be captured by
a local task-based value function approximator. Note that although assignmentlevel learning can improve average reward significantly, the ability to scale to larger
numbers of units can be greatly impaired. This is because the size of the value
function over the global state at the assignment level grows exponentially in the
number of agents.
116
Chapter 7 – Conclusions
7.1 Summary of Contributions
Throughout this thesis I have presented several techniques for mitigating each of
the three curses of dimensionality, either singly or several at once. Function approximation techniques such as tabular linear functions and relational templates mitigate the first curse of dimensionality (state space explosion). A hill climbing search
of the action space or other approximate search technique can mitigate the second
curse of dimensionality (action space explosion). Afterstate-based methods such as
ASH-learning and ATR-learning can help mitigate the third curse of dimensionality (outcome space explosion). Methods such as multiagent H-learning, multiagent
ASH-learning, and assignment-based decomposition techniques can mitigate some
or all of the curses of dimensionality at once. Finally, specialized techniques such
as transfer learning, while not typically used for this purpose, can combine with
assignment-based decomposition to scale domains with few agents to much larger
numbers of agents. To see a summary of some of the contributions of this dissertation and the curses each contribution can help mitigate, see Table 7.1.
It seems apparent that in order to solve the most complex domains, some
combination of all these scaling methods will be required. In particular, starting
with either assignment-based decomposition or a simpler multiagent method and
117
Table 7.1: The contributions of several methods discussed in this paper towards
mitigating the three curses of dimensionality.
Method
State Space
Tabular linear functions
Relational templates
Hill climbing the action space
Efficient expectation calculation
ASH-learning
ATR-learning
Multiagent H-learning
Multiagent ASH-learning
Assignment-based decomposition
Transfer learning
yes
yes
Action Space
Outcome Space
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
yes
adding any further techniques required, such as function approximation, will often
perform very well.
In particular, the fastest results found in this paper use assignment-based decomposition in combination with fast approximate assignment search techniques
such as bipartite search. Using these techniques, I have achieved a nearly linear
increase in required computation times as additional agents are added to the domain, as opposed to the exponential amount of time conventional RL approaches
require (as shown in Table 5.2).
All techniques discussed in this thesis involve tradeoffs: usually solution quality
for speed. By using the right techniques, it is hoped that the loss in quality of a
solution is negligible. In general, it is usually necessary to test several alternatives
before becoming confident that the approximations made are not too damaging to
the expected reward received.
118
Just as each curse of dimensionality may be more or less onerous for any particular domain, the possible tradeoffs required to mitigate each curse may be more
or less damaging to the expected reward. This tradeoff is particularly obvious
when choosing a function approximator. Typically, the more parameters allowed
in the function approximator, the better the value function that can be represented.
Unfortunately convergence time is closely related to the number of parameters required to learn, and in the worst case these can be exponential in the number of
dimensions of the state (as in a tabular representation).
Approaches that decompose the joint agent into a multiagent problem also
show an obvious tradeoff: in this case, the quality of coordination between agents.
A joint agent approach can perfectly coordinate between the agents. A multiagent
or assignment-based decomposition approach sacrifices perfect coordination for
fast action selection. In practice, most multiagent domains do not require perfect
coordination between agents. Yet, some coordination is usually necessary, but it
is not always clear what form that should take. In this thesis I presented three
general kinds of such coordination: serial coordination, coordination graphs, and
assignment based decomposition. I showed how these techniques do not have to
be mutually exclusive and can complement each other. Picking one or two of these
methods is often sufficient for most domains.
119
7.2 Discussion and Future Work
An interesting area of possible future work in model-based assignment-based decomposition is the introduction of coordination graphs, as was done for model-free
reinforcement learning in Chapter 5.5. Coordination graphs are not sufficient to
coordinate assignment decisions [17], however they are useful for coordinating between agents at the task-execution level, for example to avoid collisions. The RTS
domain introduced here does not model collisions, and so there is no need for lowlevel coordination between tasks as there is in [17]. Introducing collisions to this
RTS domain would be straightforward, and it would require adapting the use of
coordination graphs to a model-based RL algorithm.
Future work includes scaling the approaches in this paper to work with much
larger numbers of agents, tasks, and state variables and considering other kinds of
interactions such as global resource constraints.
Future work in assignment-based decomposition could address adapting it to a
decentralized domain. The Max-plus algorithm already can be decentralized [12],
however assignment-based decomposition assumes a centralized controller. Adapting these algorithms to work in a decentralized context could involve messagepassing techniques similar to those used by the Max-plus algorithm.
Lastly, I hope to continue to explore similarities and differences between
assignment-based decomposition and hierarchical reinforcement learning. In particular, I hope to generalize my work with assignment based decomposition to
handle complex hierarchical domains and multi-level assignment structures. For
120
example, one might imagine a domain inspired by real-life army heirarchies. At
the top, there is a general in command of several corps. Each corps has several
divisions, followed by a hierarchical structure of brigades, battalions, companies,
platoons, squads, and finally individual soldiers. A generalization of assignmentlevel learning to multi-level domains such as this would be an exciting new application. To my knowledge no previous work in hierarchical RL has explored such
complex domains with many agents.
121
Bibliography
[1] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time
dynamic programming. Artif. Intell., 72(1-2):81–138, 1995.
[2] R. E. Bellman. Dynamic Programming. Dover Publications, Incorporated,
2003.
[3] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995.
[4] O. Bräysy and M. Gendreau. Vehicle Routing Problem with Time Windows,
Part II: Metaheuristics. Working Paper, SINTEF Applied Mathematics, Department of Optimisation, Norway, 2003.
[5] T. G. Dietterich. The MAXQ method for hierarchical reinforcement learning.
In J. W. Shavlik, editor, ICML, pages 118–126. Morgan Kaufmann, 1998.
[6] M. Ghavamzadeh and S. Mahadevan. Learning to communicate and act using
hierarchical reinforcement learning. In AAMAS ’04: Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems,
pages 1114–1121, Washington, DC, USA, 2004. IEEE Computer Society.
[7] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing plans to
new environments in relational MDPs. In IJCAI ’03: In International Joint
Conference on Artificial Intelligence, pages 1003–1010, 2003.
[8] C. Guestrin, D. Koller, and R. Parr. Multiagent planning with factored MDPs.
In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, NIPS ’01: Proceedings of the Neural Information Processing Systems, pages 1523–1530. MIT
Press, 2001.
[9] C. Guestrin, M. Lagoudakis, and R. Parr. Coordinated reinforcement learning.
In ICML ’02: Proceedings of the 19st International Conference on Machine
Learning, San Francisco, CA, July 2002. Morgan Kaufmann.
122
[10] C. Guestrin, S. Venkataraman, and D. Koller. Context specific multiagent coordination and planning with factored MDPs. In AAAI ’02: Proceedings of the
8th National Conference on Artificial Intelligence, pages 253–259, Edmonton,
Canada, July 2002.
[11] J. R. Kok and N. A. Vlassis. Sparse cooperative Q-learning. In R. Greiner
and D. Schuurmans, editors, ICML ’04: Proceedings of the 21st International
Conference on Machine Learning, pages 481–488, Banff, Canada, July 2004.
ACM.
[12] J. R. Kok and N. A. Vlassis. Collaborative multiagent reinforcement learning
by payoff propagation. J. Mach. Learn. Res., 7:1789–1828, 2006.
[13] H. Kuhn. The Hungarian Method for the assignment problem. Naval Research
Logistic Quarterly, 2:83–97, 1955.
[14] R. Makar, S. Mahadevan, and M. Ghavamzadeh. Hierarchical multi-agent
reinforcement learning. In AGENTS ’01: Proceedings of the 5th International
Conference on Autonomous Agents, pages 246–253, Montreal, Canada, 2001.
ACM Press.
[15] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible
inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988.
[16] W. B. Powell and B. Van Roy. Approximate Dynamic Programming for HighDimensional Dynamic Resource Allocation Problems. In J. Si, A. G. Barto,
W. B. Powell, and D. Wunsch, editors, Handbook of Learning and Approximate
Dynamic Programming. Wiley-IEEE Press, Hoboken, NJ, 2004.
[17] S. Proper and P. Tadepalli. Solving multiagent assignment markov decision
processes. In AAMAS ’09: Proceedings of the 8th International Joint Conference on Autonomous Agents and Multiagent Systems (to appear), pages
681–688, 2009.
[18] M. L. Puterman. Markov Decision Processes: Discrete Dynamic Stochastic
Programming. John Wiley, 1994.
[19] A. Schwartz. A Reinforcement Learning Method for Maximizing Undiscounted
Rewards. In ICML ’93: Proceedings of the 10th International Conference on
Machine Learning, pages 298–305, Amherst, Massachusetts, 1993. Morgan
Kaufmann.
123
[20] N. Secamondi. Comparing Neuro-Dynamic Programming Algorithms for the
Vehicle Routing Problem with Stochastic Demands. Computers and Operations Research, 27(11-12), September 2000.
[21] N. Secamondi. A Rollout Policy for the Vehicle Routing Problem with
Stochastic Demands. Operations Research, 49(5):768–802, 2001.
[22] R. S. Sutton. Integrated architectures for learning, planning, and reacting
based on approximating dynamic programming. In In Proceedings of the Seventh International Conference on Machine Learning, pages 216–224. Morgan
Kaufmann, 1990.
[23] R. S. Sutton and A. G. Barto. Reinforcement learning: an introduction. MIT
Press, 1998.
[24] P. Tadepalli and D. Ok. Model-based Average Reward Reinforcement Learning. Artificial Intelligence, 100:177–224, 1998.
[25] M. E. Taylor and P. Stone. Behavior transfer for value-function-based reinforcement learning. In F. Dignum, V. Dignum, S. Koenig, S. Kraus, M. P.
Singh, and M. Wooldridge, editors, AAMAS ’05: The Fourth International
Joint Conference on Autonomous Agents and Multiagent Systems, pages 53–
59, New York, NY, July 2005. ACM Press.
[26] S. Thrun. The role of exploration in learning control. In D. White and
D. Sofge, editors, Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches. Van Nostrand Reinhold, Florence, Kentucky 41022, 1992.
[27] J. N. Tsitsiklis and B. V. Roy. Feature-based methods for large scale dynamic
programming. Machine Learning, 22(1-3):59–94, 1996.
[28] B. Van Roy, D. P. Bertsekas, Y. Lee, and J. N. Tsitsiklis. A Neuro-Dynamic
Programming Approach to Retailer Inventory Management. In Proceedings
of the IEEE Conference on Decision and Control, 1997.
[29] M. Wainwright, T. Jaakkola, and A. Willsky. Tree consistency and bounds
on the performance of the max-product algorithm and its generalizations.
Statistics and Computing, 14(2):143–166, 2004.
[30] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation
and its generalizations. Exploring artificial intelligence in the new millennium,
pages 239–269, 2003.
Download