AN ABSTRACT OF THE THESIS OF Scott Proper for the degree of Doctor of Philosophy in Computer Science presented on December 1, 2009. Title: Scaling Multiagent Reinforcement Learning Abstract approved: Prasad Tadepalli Reinforcement learning in real-world domains suffers from three curses of dimensionality: explosions in state and action spaces, and high stochasticity or “outcome space” explosion. Multiagent domains are particularly susceptible to these problems. This thesis describes ways to mitigate these curses in several different multiagent domains, including real-time delivery of products using multiple vehicles with stochastic demands, a multiagent predator-prey domain, and a domain based on a real-time strategy game. To mitigate the problem of state-space explosion, this thesis present several approaches that mitigate each of these curses. “Tabular linear functions” (TLFs) are introduced that generalize tile-coding and linear value functions and allow learning of complex nonlinear functions in high-dimensional state-spaces. It is also shown how to adapt TLFs to relational domains, creating a “lifted” version called relational templates. To mitigate the problem of action-space explosion, the replacement of complete joint action space search with a form of hill climbing is described. To mitigate the problem of outcome space explosion, a more efficient calculation of the expected value of the next state is shown, and two real-time dynamic programming algorithms based on afterstates, ASH-learning and ATR-learning, are introduced. Lastly, two approaches that scale by treating a multiagent domain as being formed of several coordinating agents are presented. “Multiagent H-learning” and “Multiagent ASH-learning” are described, where coordination is achieved through a method called “serial coordination”. This technique has the benefit of addressing each of the three curses of dimensionality simultaneously by reducing the space of states and actions each local agent must consider. The second approach to multiagent coordination presented is “assignment-based decomposition”, which divides the action selection step into an assignment phase and a primitive action selection step. Like the multiagent approach, assignment-based decomposition addresses all three curses of dimensionality simultaneously by reducing the space of states and actions each group of agents must consider. This method is capable of much more sophisticated coordination. Experimental results are presented which show successful application of all methods described. These results demonstrate that the scaling techniques described in this thesis can greatly mitigate the three curses of dimensionality and allow solutions for multiagent domains to scale to large numbers of agents, and complex state and outcome spaces. c Copyright by Scott Proper December 1, 2009 All Rights Reserved Scaling Multiagent Reinforcement Learning by Scott Proper A THESIS submitted to Oregon State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy Presented December 1, 2009 Commencement June 2010 Doctor of Philosophy thesis of Scott Proper presented on December 1, 2009. APPROVED: Major Professor, representing Computer Science Director of the School of Electrical Engineering and Computer Science Dean of the Graduate School I understand that my thesis will become part of the permanent collection of Oregon State University libraries. My signature below authorizes release of my thesis to any reader upon request. Scott Proper, Author ACKNOWLEDGEMENTS My deepest thanks are extended to all those who have supported me, in particular my major professor, Prasad Tadepalli. Without his assistance and support throughout my graduate education, this thesis could never have been completed. In addition I would like to thank the members of my committee: Tom Dietterich, Alan Fern, Ron Metoyer, and Jack Higginbotham for their patience and support. I would also like to thank Neville Mehta, Aaron Wilson, Sriraam Natarajan, and Ronald Bjarnason for their friendship, and many useful discussions throughout my research. Very special thanks to my parents, Anna Collins-Proper and Datus Proper for their love, support, encouragement, and understanding throughout my life. It is because of them that I have had the opportunity to take my education this far. Finally, I gratefully acknowledge the support of the Defense Advanced Research Projects Agency under grant number FA8750-05-2-0249 and the National Science Foundation for grant number IIS-0329278. TABLE OF CONTENTS Page 1 Introduction 1 1.1 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 7 2.1 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Dynamic Programming . . . . . . . . . . 2.3.1 Total Reward Optimization . . . 2.3.2 Discounted Reward Optimization 2.3.3 Average Reward Optimization . . . . . . 10 10 11 12 2.4 Model-free Reinforcement Learning . . . . . . . . . . . . . . . . . . 13 2.5 Model-based Reinforcement Learning . . . . . . . . . . . . . . . . . 15 2.6 Multiagent Reinforcement Learning . . . . . . . . . . . . . . . . . . 18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 The Three Curses of Dimensionality 20 3.1 Function Approximation . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Tabular Linear Functions . . . . . . . . . . . . . . . . . . . . 3.1.2 Relational Templates . . . . . . . . . . . . . . . . . . . . . . 20 21 24 3.2 Hill Climbing for Action Space Search 26 3.3 Reducing Result-Space Explosion . . . . 3.3.1 Efficient Expectation Calculation 3.3.2 ASH-learning . . . . . . . . . . . 3.3.3 ATR-learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 28 29 32 3.4 Experimental Results . . . . . . . . . . . 3.4.1 The Product Delivery Domain . . 3.4.2 The Real-Time Strategy Domain 3.4.3 ASH-learning Experiments . . . . 3.4.4 ATR-learning Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 37 39 44 3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 . . . . . . . . . . . . . . . . TABLE OF CONTENTS (Continued) Page 4 Multiagent Learning 51 4.1 Multiagent H-learning . . . . . . . . . . . 4.1.1 Decomposition of the State Space . 4.1.2 Decomposition of the Action Space 4.1.3 Serial Coordination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 52 54 55 4.2 Multiagent ASH-learning . . . . . . . . . . . . . . . . . . . . . . . . 57 4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Team Capture domain . . . . . . . . . . . . . . . . . . . . . 4.3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 58 61 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Assignment-based Decomposition 67 5.1 Model-free Assignment-based Decomposition . . . . . . . . . . . . . 69 5.2 Model-based Assignment-based Decomposition . . . . . . . . . . . . 71 5.3 Assignment Search Techniques . . . . . . . . . . . . . . . . . . . . . 74 5.4 Advantages of Assignment-based Decomposition . . . . . . . . . . . 76 5.5 Coordination Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 The Max-plus Algorithm . . . . . . . . . . . . . . . . . . . . 5.5.2 Dynamic Coordination . . . . . . . . . . . . . . . . . . . . . 77 80 82 5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . 5.6.1 Multiagent Predator-Prey Domain . . . . . . . . . 5.6.2 Model-free Reinforcement Learning Experiments . 5.6.3 Model-based Reinforcement Learning Experiments 84 85 87 94 5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6 Assignment-level Learning . . . . . . . . . . . . . . . . . . . . . . . . 101 6.1 HRL Semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 6.2 Function Approximation Semantics . . . . . . . . . . . . . . . . . . 105 6.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 108 6.3.1 Four-state MDP Domain . . . . . . . . . . . . . . . . . . . . 109 6.3.2 Real-Time Strategy Game Domain . . . . . . . . . . . . . . 110 6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 TABLE OF CONTENTS (Continued) Page 7 Conclusions 116 7.1 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . 116 7.2 Discussion and Future Work . . . . . . . . . . . . . . . . . . . . . . 119 Bibliography 120 LIST OF FIGURES Figure Page 2.1 Schematic diagram for reinforcement learning. . . . . . . . . . . . . 7 2.2 The relationship between model-free (direct) and model-based (indirect) reinforcement learning. . . . . . . . . . . . . . . . . . . . . . 13 3.1 Progression of states (s, s0 , and s00 ) and afterstates (sa and s0a0 ). . . 29 3.2 The product delivery domain, with depot (square) and five shops (circles). Numbers indicate probability of customer visit each time step. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Comparison of complete search, Hill climbing, H- and ASH-learning for the truck-shop tiling approximation. . . . . . . . . . . . . . . . . 39 Comparison of complete search, Hill climbing, H- and ASH-learning for the linear inventory approximation. . . . . . . . . . . . . . . . . 40 Comparison of hand-coded algorithm vs. ASH-learning with complete search for the truck-shop tiling, linear inventory, and all featurepairs tiling approximations. . . . . . . . . . . . . . . . . . . . . . . 43 3.6 Comparison of 3 agents vs 1 task domains. . . . . . . . . . . . . . . 45 3.7 Comparison of training on various source domains transferred to the 3 Archers vs. 1 Tower domain. . . . . . . . . . . . . . . . . . . . . . 46 Comparison of training on various source domains transferred to the Infantry vs. Knight domain. . . . . . . . . . . . . . . . . . . . . . . 47 DBN showing the creation of afterstates sa1 ...sam and the final state s0 by the actions of agents a1 ...am and the environment E. . . . . . 56 An example of the team capture domain for 2 pieces per side on a 4x4 grid. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 The tiles used to create the function approximation for the team capture domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Comparison of multiagent, joint agent, H- and ASH-learning for the two vs. two Team Capture domain. . . . . . . . . . . . . . . . . . . 62 3.3 3.4 3.5 3.8 4.1 4.2 4.3 4.4 LIST OF FIGURES (Continued) Figure 4.5 Page Comparison of ASH-learning approaches and hand-coded algorithm for the four vs. four Team Capture domain. . . . . . . . . . . . . . 63 Comparison of multiagent ASH-learning to hand-coded algorithm for the ten vs. ten Team Capture domain. . . . . . . . . . . . . . . 64 A possible coordination graph for a 4-agent domain. Q-values indicate an edge-based decomposition of the graph. . . . . . . . . . . . 78 Messages passed using Max-plus. Each step, every node passes a message to each neighbor. . . . . . . . . . . . . . . . . . . . . . . . 80 A possible state in an 8 vs. 4 toroidal grid predator-prey domain. All eight predators (black) are in a position to possibly capture all four prey (white). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 Comparison of various Q-learning approaches for the product delivery domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Examination of the optimality of policy found by assignment-based decomposition for product delivery domain. . . . . . . . . . . . . . 90 Comparison of action selection and search methods for the 4 vs 2 Predator-Prey domain. . . . . . . . . . . . . . . . . . . . . . . . . . 91 Comparison of action selection and search methods for the 8 vs 4 Predator-Prey domain. . . . . . . . . . . . . . . . . . . . . . . . . . 92 5.8 Comparison of 6 agents vs 2 task domains. . . . . . . . . . . . . . . 95 5.9 Comparison of 12 agents vs 4 task domains. . . . . . . . . . . . . . 96 6.1 Information typically examined by assignment-based decomposition. 102 6.2 Information examined by assignment-based decomposition with assignmentlevel learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 6.3 A 4-state MDP with two tasks. . . . . . . . . . . . . . . . . . . . . 103 6.4 Comparison of various strategies for assignment-level learning. . . . 109 6.5 Comparison of assignment-based decomposition with and without assignment-level learning for the 3 vs 2 real-time strategy domain. . 111 4.6 5.1 5.2 5.3 5.4 5.5 5.6 5.7 LIST OF FIGURES (Continued) Figure Page 6.6 Comparison of 6 archers vs. 2 glass cannons, 2 halls domain. . . . . 113 6.7 Comparison of 6 agents vs 4 tasks domain. . . . . . . . . . . . . . . 114 LIST OF TABLES Table 3.1 Page Various relational templates used in experiments. See Table 3.2 for descriptions of relational features, and Section 3.4.2 for a description of the domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2 Meaning of various relational features. . . . . . . . . . . . . . . . . 24 3.3 Different unit types. . . . . . . . . . . . . . . . . . . . . . . . . . . 37 3.4 Comparison of execution times for one run . . . . . . . . . . . . . . 42 4.1 Comparison of execution times in seconds for one run of each algorithm. Column labels indicate number of pieces. “–” indicates a test requiring impractically large computation time. . . . . . . . . . 64 Running times (in seconds), parameters required, and and terms summed over for five algorithms applied to the product delivery domain. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Experiment data and run times. Columns list domain size, units involved (Archers, Infantry, Towers, Ballista, or Knights), use of transfer learning, assignment search type (“flat” indicates no assignment search), relational templates used for state and afterstate value functions, and average time to complete a single run. . . . . . 98 5.1 5.2 7.1 The contributions of several methods discussed in this paper towards mitigating the three curses of dimensionality. . . . . . . . . . . . . . 117 15 List of Algorithms 2.1 The Q-learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 14 2.2 The R-learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . 16 2.3 The H-learning algorithm. The agent executes each step when in state s. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 17 The ASH-learning algorithm. The agent executes steps 1-7 when in state s0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 The ATR-learning algorithm, using the update of Equation 3.14. . . 33 4.1 The multiagent H-learning algorithm with serial coordination. Each agent a executes each step when in state s. . . . . . . . . . . . . . . 4.2 53 The multiagent ASH-learning algorithm. Each agent a executes each step when in state s0 . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.1 The assignment-based decomposition Q-learning algorithm. . . . . . 70 5.2 The ATR-learning algorithm with assignment-based decomposition, using the update of Equations 5.3 and 5.5. . . . . . . . . . . . . . . . 73 5.3 The centralized anytime Max-plus algorithm. . . . . . . . . . . . . . 81 5.4 The assignment-based decomposition Q-learning algorithm using coordination graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 83 The assignment-based decomposition with assignment-level learning Q-learning algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 106 6.2 The ATR-learning algorithm with assignment-based decomposition and assignment-level learning. . . . . . . . . . . . . . . . . . . . . . . 107 DEDICATION I dedicate this thesis to my mother, Anna, and to my father, Datus, in memoriam. Chapter 1 – Introduction 1.1 Outline of the Thesis Reinforcement Learning (RL) is a method of teaching a computer to learn how to act in a given environment or “domain” via trial and error. By repeatedly taking actions, observing results and an associated reward or cost signal, a computer agent may learn to act in an environment in such way as to maximize its reward. This kind of technique allows a computer to learn how to solve problems that might be impractical to solve any other way. Reinforcement learning provides a nice framework to model a variety of stochastic optimization problems [23], which are optimization problems involving probabilistic (random) elements. Often, reinforcement learning is performed by learning a value function over states or state-action pairs of the domain. This value function maps states or state-action pairs to values, which allow an agent to determine the relative utility of a given state or action. Typically, the value function is stored in a table, such that each single state or state-action pair is mapped to a value. However, table-based approaches to large RL problems suffer from three “curses of dimensionality”: explosions in state, action, and outcome spaces [16]. In this thesis, I propose and demonstrate several ways to mitigate these curses in a variety of multiagent domains, including product delivery and routing, multiple predator and prey simulations, and real-time 2 strategy games. The three main computational obstacles to dealing with large reinforcement learning problems may be described as follows: First, the state space (and the time required for convergence) grows exponentially in the number of variables. Second, the space of possible actions is exponential in the number of agents, so even onestep look-ahead search is computationally expensive. Lastly, exact computation of the expected value of the next state is costly, as the number of possible future states (outcomes) can be exponential in the number of state variables. These three obstacles are referred to as the three “curses of dimensionality”. I introduce methods that effectively address each of the above difficulties, both individually and together, in several different domains. To mitigate the exploding state-space problem, I introduce “tabular linear functions” (TLFs), which can be viewed as linear functions over some features, whose weights are functions of other features. TLFs generalize tables, linear functions, and tile coding, and allow for a fairly flexible mechanism for specifying the space of potential value functions. I show particular uses of these functions in a product delivery domain that achieve a compact representation of the value function and faster learning. I introduce a “lifted” relational version of TLFs called “relational templates”, which I show to facilitate transfer learning in certain domains. Second, to reduce the computational cost of searching the action space, which is exponential in the number of agents, I introduce a simple hill climbing algorithm that effectively scales to a larger number of agents without sacrificing solution quality. 3 Third, for model-based reinforcement learning algorithms, the expected value of the next state at every step must be calculated. Unfortunately many domains have a high stochastic branching factor (number of possible next states) when the state is a Cartesian product of several random state variables. I provide two solutions to this problem. First, I take advantage of the factoring of the action model and the partial linearity of the value function to decompose this computation. Second, I introduce two new algorithms called ASH-Learning and ATR-learning, which are “afterstate” versions of model-based reinforcement learning using average reward and total reward settings respectively. These algorithms learn by distinguishing between the action-dependent and action-independent effects of an agent’s action [23]. I show experimental results in a product delivery domain and a real-time strategy game that demonstrate that my methods are effective in ameliorating the three curses of dimensionality that limit the applicability of reinforcement learning. The above methods address each curse of dimensionality individually. A method to address all curses of dimensionality simultaneously would be very useful. To this end, I introduce multiagent versions of the H-learning and ASH-learning algorithms. By decomposing the state and action spaces of a joint agent into several weakly coordinating agents, we can simultaneously address each of the three curses of dimensionality. Results for this are demonstrated in a “Team Capture” domain. When implementing a multiagent RL solution, coordination between agents is the main difficulty, and it determines the tradeoff between speed and solution quality. The weak “serial coordination” introduced by the multiagent H-learning and ASH-learning methods may not be enough. To improve coordination, I introduce 4 both a model-free and a model-based version of the assignment-based decomposition architecture. In this architecture, the action space is divided into a task assignment level and a task execution level. At the task assignment level, each task is assigned a group of agents. At the task execution level, each agent is assigned primitive actions to perform based on its task. Since the task execution is mostly local, I learn a low-dimensional relational template-based value function for this. Since the task assignment level is global, I use various exact and approximate search algorithms to do the task assignment. While this two-level decomposition resembles that of hierarchical multiagent reinforcement learning [14], in contrast to that work, I do not require that a value function be stored at the root level. Such a root-level value function must be over the joint state-space of all agents in the worst case and intractable. I demonstrate results showing that using search over values given by a lower-level value function at the assignment level allows my system to scale up to 12 agents, and potentially much beyond this. Assignment based decomposition is flexible enough to allow a value function at all levels of the decision-making process, although as described above it is not required. As described later in this thesis, I will show how such a value function may be added to the assignment level and how doing this may allow improved assignment decisions under certain circumstances. However, because of the scaling limitations of requiring a global value function over all agents, using an assignmentlevel value function can limit the total number of agents at the top level of the decision-making process. In this thesis, I also explore the usefulness of transfer learning when applied to 5 the scaling problem. Transfer learning is a research problem in machine learning that focuses on storing knowledge gained while solving one problem and applying it to a different but related problem. I explore this in the context of a real-time strategy game by exploiting the benefits of relational templates. I present three kinds of transfer learning results in real-time strategy games. First, I show how how my approach enables transfer of value function knowledge between different but similarly-sized multiagent domains. Second, I show transfer across domains with different numbers of agents and tasks. Combining these two types of transfer learning, I then show how knowledge gained from learning in several small domains may be transfered to solve problems in large domains with multiple types of agents and tasks. Thus, transfer learning may be applied to scale reinforcement learning for multiagent domains. 1.2 Thesis Organization The rest of the thesis is organized as follows: Chapter 2 introduces reinforcement learning and Markov Decision Processes. It describes two previously-studied reinforcement learning methods: a model-free discounted learning method called Q-learning and a model-based average-reward learning method called H-learning. These algorithms form the basis of algorithms described in later chapters. Chapter 3 discusses the three curses of dimensionality and some methods for mitigating them individually. These methods include two kinds of function ap- 6 proximation, which serve to mitigate the first curse of dimensionality or exploding state space, a hill climbing action selection approach which I use to mitigate the second curse of dimensionality or exploding action space, and the ASH-learning and ATR-learning algorithms, which are techniques for mitigating the third curse of dimensionality or exploding outcome space for model-based RL algorithms. Finally I introduce the product delivery domain and describe experimental results for the above techniques. Chapter 4 describes a method for learning in multiagent domains which simultaneously addresses each of the three curses of dimensionality. Decomposed versions of the H-learning and ASH-learning algorithms are provided. The “Team Capture” domain is explained, and experimental results for this work are presented. Chapter 5 introduces assignment-based decomposition, a more sophisticated coordination technique for multiagent domains, for both model-free (Q-learning) and model-based (ATR-learning) reinforcement learning algorithms. I also show how to use coordination graphs together with assignment-based decomposition. I finally describe a new “Predator-Prey” domain and the results for my experiments on this and other domains using assignment-based decomposition. Chapter 6 introduces assignment-level learning, which adds a value function over certain “global” features of the state to the top assignment-level decision. This allows assignment-based decomposition to solve certain problems it might otherwise have difficulty with, but limits the scalability of the algorithm. Finally in Chapter 7, I summarize my results and discuss potential future work. 7 Chapter 2 – Background This chapter outlines the background of Reinforcement Learning (RL) and describes two previously studied reinforcement learning methods: Q-learning and H-learning. These two algorithms form a basis upon which I build several new reinforcement learning algorithms described in later chapters. 2.1 Reinforcement Learning Reinforcement learning is the problem faced by a learning agent that must learn to act by trial-and-error interactions with its environment. In the standard reinforcement learning paradigm, an agent is connected to its environment via perception and action, as shown in Figure 2.1. In each step of interaction, the agent senses the environment and then selects Figure 2.1: Schematic diagram for reinforcement learning. 8 an action to change the state of the environment. This state transition generates a reinforcement signal – reward or penalty – that is received by the agent. While taking actions by trial-and-error, the agent may incrementally learn a “value function” over states or state-action pairs, which indicates their utility to that agent. The goal of reinforcement learning methods is to arrive, by performing actions and observing their outcomes, at a policy, i.e. a mapping from states to actions, which maximizes some measure of the accumulated reward over time. RL methods differ according to the exact measure and optimization criteria they use to select actions. These methods apply trial-and-error methodology to explore the environment over time to come up with a desired policy. 2.2 Markov Decision Processes The agent’s environment is modeled as a Markov Decision Process (MDP). An MDP is a tuple hS, A, P, Ri where S is a finite set of n discrete states and A is a finite set of actions available to the agent. The set of actions which are applicable in a state s are denoted by A(s) and are called admissible. The actions are stochastic and Markovian in that an action a in a given state s ∈ S results in a state s0 with fixed probability P (s0 |s, a). This probability matrix is called an action model of s. The reward function R : S × A → R returns the reward R(s, a) after taking action a in state s, also callled the reward model of s. The action and reward models of an MDP are called its “domain model”. Each action is assumed to take one time step. An agent’s policy is defined as a mapping π : S → A, such that 9 the agent executes action a ∈ A when in state s. A stationary policy is one which does not change with time. A deterministic policy always maps the same state to the same action. For the remainder of this thesis, “policy” refers to a stationary deterministic policy. Instead of directly learning a policy, in RL the agent may learn a value function that estimates the value for each state. At any time, the RL methods use one-step lookahead with the current value function to choose the best action in each state by some kind of maximization. Therefore the policies that RL methods learn are called “greedy” with respect to their value functions. In addition to such greedy actions, RL methods also take some directed or random (exploratory) actions. These exploratory actions ensure that all reachable states are explored with sufficient frequency so that a learning method does not get stuck in a local maximum. There are several exploration strategies. The random exploration strategy takes random actions with a fixed probability, giving high probabilities to actions with high values [1]. The counter-based exploration prefers to execute actions that lead to less frequently visited states [26]. Recency-based exploration prefers actions which have not been executed recently in a given state [22]. In this thesis, I use an -greedy strategy in all my experiments, which takes a random action with probability and a greedy action with probability 1 − . 10 2.3 Dynamic Programming Given a complete and accurate model of an MDP in the form of the action and reward models P (s0 |s, a) and R(s, a), it is possible to solve the decision problem off-line by applying Dynamic Programming (DP) algorithms [2, 3, 18]. The recurrence relation of DP differs according to the optimization criterion: total reward optimization, discounted reward optimization, or average reward optimization. 2.3.1 Total Reward Optimization Suppose that an agent using a policy π goes through states s0 , ..., st in time 0 through t, with some probability. The cumulative sum of rewards received by following a policy π starting from any state s0 is given by: π V (s0 ) = lim E( t→∞ t−1 X R(sk π(sk ))) (2.1) k=0 When there is an absorbing goal state g which is reachable from every state under every stationary policy, and from which there are no transitions to other states, the value function for a given policy π can be computed using the following recurrence relation: V π (g) = 0 ∀s 6= g V π (s) = R(s, π(s)) + E( X (2.2) P (s0 |s, π(s))V π (s0 )) (2.3) s0 ∈S An optimal total reward policy π ∗ maximizes the above value function over all 11 ∗ states s0 and policies π, i.e. V π (s0 ) ≥ V π (s0 ). Under the above conditions, the value function for the optimal total reward policy π ∗ can be computed by: ∗ V π (g) = 0 ∗ V π (s) = max {R(s, a) + a∈A(s) X (2.4) ∗ P (s0 |s, a)V π (s0 )} (2.5) s0 ∈S ∗ where π ∗ indicates the optimal policy, and thus V π (s) is a value function corresponding to the optimal policy. A(s) 2.3.2 Discounted Reward Optimization Total reward is a good candidate to optimize; but if the agent has an infinite horizon and there is no absorbing goal state, the total reward approaches ∞. One way to make this total finite is by exponentially discounting future rewards. In other words, one unit of reward received after one time step is considered equivalent to a reward of γ < 1 received immediately. We now maximize the discounted cumulative sum of rewards received by following a policy. The discounted total reward received by following a policy π from state s0 is given by: fγπ (s0 ) t−1 X = lim E( γ t R(sk π(sk ))) t→∞ (2.6) k=0 where γ < 1 is the discount factor. Discounting by γ < 1 makes fγπ (s0 ) finite. The value function above can be computed for any state by solving the following 12 set of simultaneous recurrence relations: f π (s) = R(s, π(s)) + γ X P (s0 |s, π(s))f π (s0 ) (2.7) s0 ∈S An optimal discounted policy π ∗ maximizes the above value function over all states s and policies π. It can be shown to satisfy the following recurrence relation [1, 3]: ∗ f π (s) = max {R(s, a) + γ a∈A(s) X ∗ P (s0 |s, a)f π (s0 )} (2.8) s0 ∈S 2.3.3 Average Reward Optimization For average reward optimization, we seek to optimize the average reward per time step computed over time t as t → ∞, which is called the gain [18]. For a given starting state s0 and policy π, the gain is given by Equation 2.9, where rπ (s0 , t) is the total reward in t steps when policy π is used starting at state s0 , and E(rπ (s0 , t)) is its expected value: 1 E(rπ (s0 , t)) t→∞ t ρπ (s0 ) = lim (2.9) The goal of average reward learning is to learn a policy that achieves near-optimal gain by executing actions, receiving rewards and learning from them. A policy that optimizes the gain is called a gain-optimal policy. The expected total reward in time t for optimal policies depends on the starting state s and can be written as the sum ρ(s) · t + ht (s), where ρ(s) is its gain. The Cesaro’s limit (or expected 13 value) of the second term ht (s) as t → ∞ is called the bias of state s and is denoted by h(s). In communicating MDPs, where every state is reachable from every other state, the optimal gain ρ∗ is independent of the starting state [18]. This is because as t → ∞, we can expect to visit every state infinitely often (including the starting state), and the contribution of the starting state will be included as a part of the average reward. ρ∗ and the biases of the states satisfy the following Bellman equation: ( h(s) = max a∈A(s) r(s, a) + N X ) p(s0 |s, a)h(s0 ) − ρ∗ (2.10) s0 =1 2.4 Model-free Reinforcement Learning There are two main roles for experience in a reinforcement learning agent: it may be used to directly learn the policy, or it may be used to learn a model which can then be used to plan and learn a value function or policy from. This relationship is visualized in Figure 2.2. Using experience to directly learn the policy is called “direct RL” [23] or “model-free” reinforcement learning. In this case, the model is Figure 2.2: The relationship between model-free (direct) and model-based (indirect) reinforcement learning. 14 1 2 3 4 5 6 7 8 Initialize Q(s, a) arbitrarily Initialize s to any starting state for each step do Choose action a from s using -greedy policy derived from Q Take action a, observeh reward r and next state s0 i Q(s, a) ← Q(s, a) + α r + γ 0max Q(s0 , a0 ) − Q(s, a) 0 a ∈A (s) s←s end 0 Algorithm 2.1: The Q-learning algorithm. learned implicitly as a part of the value function or policy. The case in which the model is learned explicitly is called “indirect RL” or “model-based” reinforcement learning, and is covered in Section 2.5. Both model-free and model-based RL have advantages: for example, model-free methods will not be affected by biases in the structure or design of the model. In this section I describe two common model-free algorithms: Q-learning and Rlearning. Q-learning can be a discounted or total reward algorithm. The objective is to find an optimal policy π ∗ that maximizes the expected discounted future reward for each state s. The MDP is assumed to have an infinite horizon, and so future rewards are discounted exponentially with a discount factor γ ∈ [0, 1). The optimal action-value function or Q-function gives the expected discounted future reward for any state s when executing action a and then following the optimal policy. The Q-function satisfies the following recurrence relation: Q∗ (s, a) = R(s, a) + γ X s0 P (s0 |s, a) max Q∗ (s0 , a0 ) 0 a (2.11) 15 The optimal policy for a state s is the action arg maxa Q∗ (s, a) that maximizes the expected future discounted reward. See the most common form of the Q-learning algorithm in Algorithm 2.1. R-learning [19] is an off-policy model-free average reward reinforcement learning algorithm. As with all average reward algorithms, the objective is to find an optimal policy π ∗ that maximizes the reward per time step. The MDP is assumed to have an infinite horizon, but unlike discounted methods, the value functions for a policy are defined relative to the average expected reward per time step under the policy. R-learning is a standard TD control method similar to Q-learning. It maintains a value function for each state-action pair and a running estimate of the average reward ρ, which is an approximation of ρπ , the true optimal average reward. See the complete algorithm in 2.2. Note that I do not conduct any experiments using this algorithm in this thesis, it is included here because of its similarity to other methods I discuss, in particular H-learning, which is discussed in the next section. 2.5 Model-based Reinforcement Learning Model-based RL has several advantages over model-free methods: indirect methods can often make fuller use of limited experience, and thus converge to a better policy given fewer interactions with the environment. Having a model also provides more options: one can choose to learn the model first and then use planning approaches to learn a policy off-line, for example. This allows the luxury of testing several 16 1 2 3 4 5 6 Initialize ρ and Q(s, a) arbitrarily Initialize s to any starting state for each step do Choose action a from s using -greedy policy derived from Q Take action a, observeh reward r and next state s0 i 0 0 Q(s, a) ← Q(s, a) + α r − ρ + 0max Q(s , a ) − Q(s, a) 0 a ∈A (s) 0 7 8 9 10 0 if Q(s, a) = 0max Q(s , a ) then ah ∈A0 (s) i 0 0 ρ ← ρ + β r − ρ + 0max Q(s , a ) − max Q(s, a) 0 a ∈A (s) s←s end a∈A(s) 0 Algorithm 2.2: The R-learning algorithm. different forms of function approximation to determine which works best with a given problem, while minimizing the amount of experience data that must be gathered. In addition, it is usually the case that a value function learned using a model needs to store a value only for each state, not each state-action pair as with model-free methods such as Q-learning. If the model is compact, this can result in many fewer parameters required to store the value function. In this thesis, I explore several model-based RL algorithms based on “Hlearning”, which is an average reward learning algorithm. H-Learning is modelbased in that it uses explicitly represented action models p(s0 |s, u) and r(s, u). In previous work, H-learning has been found to be more robust and faster than its model-free counterpart, R-learning [19, 24]. At every step, the H-learning algorithm updates the parameters of the value function in the direction of reducing the temporal difference error TDE, i.e., the 17 1 2 3 4 5 6 7 8 P Find an action u ∈ U (s) that maximizes R(s, u) + N q=1 p(q|s, u)h(q) Take an exploratory action or a greedy action in the current state s. Let a be the action taken, s0 be the resulting state, and rimm be the immediate reward received. Update the model parameters for p(s0 |s, a) and R(s, a) if a greedy action was taken then ρ ← (1 − α)ρ + α(R(s, a) − h(s) + h(s0 )) α α ← α+1 end P h(s) ← max R(s, u) + N q=1 p(q|s, u)h(q) − ρ u∈U (s) 0 s←s Algorithm 2.3: The H-learning algorithm. The agent executes each step when in state s. 9 difference between the r.h.s. and the l.h.s. of the Bellman Equation 2.10. ( T DE(s) = max a∈A(s) r(s, a) + N X ) p(s0 |s, a)h(s0 ) − ρ − h(s) (2.12) s0 =1 One issue that still needs to be addressed in Average-reward RL is the estimation of ρ∗ , the optimal gain. Since it is unknown, H-learning uses ρ, an estimate of the average reward of the current greedy policy, instead. From Equation 2.10, it can be seen that r(s, u) + h(s0 ) − h(s) gives an unbiased estimate of ρ∗ , when action u is greedy in state s, and s0 is the next state. We may thus update ρ as follows, in every step: ρ ← (1 − α)ρ + α(r(s, a) − h(s) + h(s0 )) See Algorithm 2.3 for the complete algorithm. (2.13) 18 2.6 Multiagent Reinforcement Learning The term “multiagent” may have several meanings. In this thesis, when referring to a description of a domain, “multiagent” means a factored joint action space, i.e. the actions available to the entire agent may be expressed as a Cartesian product of n sets a1 , a2 , ..., an , each set corresponding to the actions of a separate, independent agent. All domains described in this thesis are multiagent domains in this sense. When referring to algorithms for solving multiagent domains, there are two general categories: a “joint agent” approach, which treats the full set (or joint set) of all agent actions as a large list of actions which must be exhaustively searched to find the correct joint action, or a “multiagent” approach, which takes full advantage of the factored nature of the action space and typically searches only a local space of actions unique to each agent, for each agent. The joint agent approach is often slow, but an exhaustive search of the action space may sometimes find solutions that multiagent approach could not. However, joint agent approaches are unlikely to scale to large numbers of agents. In this paper, I discuss methods for scaling joint agent approaches in Chapter 3, and multiagent approaches in Chapters 4 and 5. Typically, the challenge of multiagent approaches involves introducing enough coordination between agents so that the absence of an exhaustive search of the action space is mitigated. Within mulitagent algorithms, there are two broad approaches: first, using a centralized multiagent approach to mitigate difficulties in scaling due to a large 19 joint action space, or second, a distributed mulitagent approach that is required due to the constraints of the domain, for example, when some method is needed to coordinate the actions of multiple robots acting in the world. The main difference between these approaches is that communication and sharing of data between agents is easier or effortless in the case of a centralized approach emphasizing scaling. For domains requiring a distributed multiagent approach, communication usually carries some sort of cost. In this thesis, I focus entirely on the benefits of a multiagent approach to scaling, and do not consider problems of communication between agents. I present a brief example of a typical multiagent RL algorithm here, by adapting Q-learning to a multiagent context. In a multiagent approach, the global Qfunction Q(s, a) is approximated as a sum of agent-specific action-value functions: Pn Q(s, a) = i Qi (si , ai ) [12]. Further I approximate each agent-specific actionvalue as a function only of each agent’s local state si . A “selfish” agent-based version of multiagent Q-learning [23] updates each agent’s Q-value independently using the update function: h i Qi (si , ai ) ← Qi (si , ai ) + α Ri (s, a) + γQi (s0i , a∗i ) − Qi (si , ai ) (2.14) where α ∈ [0, 1] is the learning rate. The notation Qi indicates only that the Q-value is agent-based. The parameters used to store the Q-function may either be unique to that agent or shared between all agents. The term Ri indicates that the reward is factored, i.e. a separate reward signal is given to each agent. 20 Chapter 3 – The Three Curses of Dimensionality Reinforcement learning algorithms suffer from three “curses of dimensionality”: explosions in state and action spaces, and a large number of possible next states of an action due to stochasticity (or “outcome space” explosion‘) [16]. In this chapter, I explore several methods for mitigating each of these three curses individually. To mitigate the explosion in the state space, I introduced two related methods of function approximation, Tabular Linear Functions (TLFs) and relational templates. To mitigate the explosion in action space common to multiagent algorithms, I suggest an approximate search of the action space using Hill-climbing. To help mitigate the explosion in the number of result states, I introduce ASHlearning and ATR-learning, which are a hybrid model-free/model-based approach using afterstates. 3.1 Function Approximation Unfortunately, table-based reinforcement learning does not scale to large state spaces such as those explored in this thesis both due to limitations of space and convergence speed. The value function needs to be approximated using a more compact representation to make it scale with size. Linear function approximators are among the simplest and fastest means of approximation. However, since the 21 value function is usually highly nonlinear in the primitive features of the domain, the user needs to carefully hand-design high-level features so that the value function can be approximated by a function which is linear in them [27]. In the following sections I introduce two related function approximation schemes: “Tabular Linear Functions” (TLFs) and “Relational Templates” which generalize linear functions, tables, and tile coding. Usually, a TLF expresses a tradeoff between the small number of parameters used by typical linear functions, and the expressiveness of a complete table. Like any table, TLFs may express a nonlinear value function, but like a linear function, the value function may be stored compactly. 3.1.1 Tabular Linear Functions A tabular linear function is a linear function of a set of “linear” features of the state, where the weights of the linear function are arbitrary functions of other discretized (or “nominal”) features. Hence the weights can be stored in a table indexed by the nominal features, and when multiplied with the linear features of the state and summed, produce the final value function. More formally, a tabular linear function TLF is represented by Equation 3.1, which is a sum of n terms. Each term is a product of a linear feature φi and a weight θi . The features φi need not be distinct from each other, although they 22 usually are. Each weight θi is a function of mi nominal features fi,1 , . . . , fi,mi . v(s) = n X θi (fi,1 (s), . . . , fi,mi (s))φi (s) (3.1) i=1 A TLF reduces to a linear function when there are no nominal features, i.e. when θ1 , . . . , θn are scalar values. One can also view any TLF as a purely linear function where there is a term for every possible set of values of the nominal features: v(s) = n X X θi,k φi (s)I(fi (s) = k) (3.2) i=1 k∈K Here I(fi (s) = k) is 1 if fi (s) = k and 0 otherwise. fi (s) is a vector of values fi,1 (s), . . . , fi,mi (s) in Equation 3.1, and K is the set of all possible vectors. TLFs reduce to a table when there is a single term and no linear features, i.e., n = 1 and φ1 = 1 for all states. They reduce to tile coding or coarse coding when there are no linear features, but there are multiple terms, i.e., φi = 1 for all i and n ≥ 1. The nominal features of each term can be viewed as defining a tiling or partition of the state space into overlapping regions and the terms are simply added up to yield the final value of the state [23]. Most forms of TLF are created using prior knowledge about the domain (see Section 3.4.1). What if such prior knowledge does not exist? It is possible to take advantage of tabular linear functions by constraining them with some syntactic restrictions. For example, we can define a set of terms over all possible pairs (or triples) of primitive state features. We then sum over all n2 terms, where 23 n indicates the number of primitive state features. For example, if we have four features f1 ...f4 , we will then have 6 possible tuples of 2 features each: v(s) = θ1 (f1 , f2 ) + θ2 (f1 , f3 ) + θ3 (f1 , f4 ) + θ4 (f2 , f3 ) + θ5 (f2 , f4 ) + θ6 (f3 , f4 ) (3.3) Since this is also an instance of tile-coding, I call this all feature-pairs tiling. An advantage of TLFs is that they provide a flexible but simple framework to consider and incorporate different assumptions about the functional form of the value function and the set of relevant features. In general, the value function is represented as a parameterized functional form of Equation 3.1 with weights θ1 , . . . , θn and linear features φ1 , . . . , φn . Each weight θi is a function of mi nominal features fi,1 , . . . , fi,mi . Then each θi is updated using the following equation: θi (fi,1 (s), . . . , fi,mi (s)) ← θi (fi,1 (s), . . . , fi,mi (s)) + β(T DE(s))∇θi v(s) (3.4) where ∇θi v(s) = φi (s) and β is the learning rate. The above update suggests that the value function would be adjusted to reduce the temporal difference error in state s. This update is very similar to the update used for ordinary linear value functions. Unlike with a normal linear value function, only those table entries that match the current state’s nominal features are updated, in proportion to the value of the linear feature φi (s). 24 Table 3.1: Various relational templates used in experiments. See Table 3.2 for descriptions of relational features, and Section 3.4.2 for a description of the domain. No. Description #1 #2 #3 #4 #5 #6 hDistance(A, B), AgentHP (B), T askHP (A), U nitsInrange(B)i hU nitT ype(B), T askT ype(A), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A)i hU nitT ype(B), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A)i hT askT ype(A), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A)i hDistance(A, B), AgentHP (B), T askHP (A), U nitsInrange(B), T asksInrange(B)i hU nitT ype(B), T askT ype(A), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A), T asksInrange(B)i hU nitT ype(B), Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A), T asksInrange(B)i hT askT ype(A)Distance(A, B), AgentHP (B), T askHP (A), U nitsInrange(A), T asksInrange(B)i hU nitX(A), U nitY (A), U nitX(B), U nitY (B)i #7 #8 #9 3.1.2 Relational Templates Many domains are object-oriented, where the state consists of multiple objects or units of different classes, each with multiple attributes. Relational templates are a “lifted” version of tabular linear functions, generalizing them to object-oriented domains [7]. A relational template is defined by a set of relational features over shared variables (see Table 3.1). Each relational feature may have certain constraints Table 3.2: Meaning of various relational features. Feature Constraint Meaning Distance(A, B) AgentHP (B) T askHP (A) U nitsInrange(A) T asksInrange(B) U nitX(B) U nitY (B) U nitT ype(B) T askT ype(A) T ask(A) ∧ Agent(B) Agent(B) T ask(A) T ask(A) Agent(B) Agent(B) Agent(B) Agent(B) T ask(A) Manhattan distance between units Hit points of an agent Hit points of a task Count of the number of agents able to attack a task Count of the number of enemies able to attack an agent X-coordinate of an agent Y-coordinate of an agent Type (archery or infantry) of an agent Type (tower, ballista, or knight) of a task 25 on the objects that can be passed to it; for example, in Table 3.2 each feature has a type constraint on its variables. Each template is instantiated in a state by binding its variables to units of the correct type. An instantiated template i defines a table θi indexed by the values of its features in the current state. In general, each template may give rise to multiple instantiations in the same state. The value v(s) of a state s is the sum of the values represented by all instantiations of all templates. v(s) = n X X θi (fi,1 (s, σ), . . . , fi,mi (s, σ)) (3.5) i=1 σ∈I(i,s) where i is a particular template, I(i, s) is the set of possible instantiations of i in state s, and σ is a particular instantiation of i that binds the variables of the template to units in the state. The relational features fi,1 (s, σ), . . . , fi,mi (s, σ) map state s and instantiation σ to discrete values which index into the table θi . All instantiations of each template i share the same table θi , which is updated for each σ using the following equation: θi (fi,1 (s, σ), . . . , fi,mi (s, σ)) ← θi (fi,1 (s, σ), . . . , fi,mi (s, σ)) + α(T DE(s, σ)) (3.6) where α is the learning rate. This update suggests that the value of v(s) would be adjusted to reduce the temporal difference error in state s. In some domains, the number of objects can grow or shrink over time: this merely changes the number of instantiations of a template. 26 One template is more refined than another if it has a superset of features. The refinement relationship defines a hierarchy over the templates with the base template forming the root and the most refined templates at the leaves. The values in the tables of any intermediate template in this hierarchy can be computed from its child template by summing up the entries in its table that refine a given entry in the parent template. Hence, we can avoid maintaining the intermediate template tables explicitly. This adds to the complexity of action selection and updates, so my implementation explicitly maintains all templates. 3.2 Hill Climbing for Action Space Search The second curse of dimensionality in reinforcement learning is the exponential growth of the joint action space and the corresponding time required to search this action space. To mitigate this problem, one may implement a simple form of hill climbing which can greatly speed up the action selection process with minimal loss in the quality of the resulting policy. In my experiments, hill climbing was used only during training. This is possible only when using an off-policy learning method, such as H-learning (Algorithm 2.3). I found that full exploitation of the policy during training is not necessarily conducive to improved performance during testing. It is only important that potentially high-value states are explored as often as necessary to learn their true values; this kind of good exploration is a property shared by both complete and hill climbing searches of the action space. 27 I performed hill climbing by noting that every joint action ~a is a vector of sub-actions, each by a single agent, i.e., a = (a1 , . . . , ak ). This vector is initialized with all neutral actions. The definition of “neutral action” varies with the domain: a “wait” action would be a typical example for some domains. Starting at a1 , I consider a small neighborhood of actions (one for each possible action a1 may take, other than the action it is currently set to), and a is set to the best action. This process is repeated for each agent a2 , . . . , ak . The process then starts over at a1 , repeating until ~a has converged to a local optimum. 3.3 Reducing Result-Space Explosion The third curse of dimensionality occurs due to the difficulty of efficiently calculating the expected value of the next state, in domains with many possible resulting states. This is sometimes called the “result-space” explosion. Such domains often arise when there are many objects in the domain that are not agent-controlled, yet exhibit some unpredictable behavior. For example, in a real-time strategy game (see Section 3.4.2) there may be many enemy agents, each of which is acting according to some unknown or stochastic policy. It should be noted that while model-free reinforcement learning may also suffer from having too many result states (thus requiring a lower learning rate and a longer time to converge) in this thesis I am primarily concerned with the time required to actually calculate the expected value of the next state. Model-free algorithms have a significant advantage here, as usually an explicit calculation of 28 this value is needed only when using model-based reinforcement learning. One of the drawbacks of model-based methods is that they require stepping through all possible next states of a given action to compute the expected value of the next state. This is very time-consuming. Optimizing this step improves the speed of the algorithm considerably. Consider the fact that we need to compute the term PN 0 0 s0 =1 p(s |s, u)h(s ) in Equation 2.12 to compute the Bellman error and update the parameters. Since there are often an exponential number of possible next states in domain parameters such as the number of enemy agents, doing this calculation by brute-force is expensive. I present three possible solutions to this problem in the next sections. 3.3.1 Efficient Expectation Calculation For this first method to apply, the value function must be linear in any features whose values change stochastically. For example, in the product delivery domain (see Section 3.4.1), the only features whose values change stochastically are the shop inventory levels. Hence this solution may be applied for the linear inventory function approximation of my domain. Under the above assumption, we can rewrite the exponential-size calculation PN PN Pn 0 0 0 l=1 θ l φl,s ), which can be s0 =1 p(s |s, u)h(s ) in Equation 2.12 as s0 =1 p(s |s, u)( Pn PN Pn 0 rewritten as l=1 θ l l=1 θ l E(φl,s0 |s, u), s0 =1 p(s |s, u)φl,s and be simplified to PN 0 where E(φl,s0 |s, u) = s0 =1 p(s |s, u)φl,s0 and represents the expected value of the feature value φl in the next state under action u. E(φl,s0 |s, u) is directly estimated 29 by on-line sampling and stored in a factored form. Instead of taking time proportional to the number of possible next states, this only takes time proportional to the number of features, which is exponentially smaller. For example, if the current inventory level of shop l is 2, and the probability of inventory going down by 1 in this step is 0.2, and the probability of its going down by 2 or more is 0, then E(φl,j |i, u) = 2 − 1 ∗ .2 = 1.8. So we obtain the following temporal difference error: ( T DE(s) = max r(s, u) + u∈U (s) by substituting Pn l=1 θ l E(φl,s0 n X ) θl E(φl,s |s, u) − ρ − h(s) (3.7) l=1 |s, u) for PN s0 =1 p(s 0 |s, u)h(s0 ) in Equation 2.12. 3.3.2 ASH-learning A second method for optimizing the calculation of the expectation is a different algorithm I call ASH-Learning, which stands Figure 3.1: Progression of states (s, 0 s , and s00 ) and afterstates (sa and for Afterstate H-Learning. This is based s0a0 ). on the notion of afterstates [23] also called “post-decision states” [16]. Afterstates are created by conceptually splitting the effects of an agent’s action into “action-dependent” effects and “action-independent” (or environmental) effects. The afterstate is the state that results by taking into account the action-dependent effects, but not the action-independent effects. If we consider Figure 3.1, we can view the progression of states/afterstates as 30 a0 a s → sa → s0 → s0a0 → s00 (see Figure 3.1). The “a” subscript used here indicates that sa is the afterstate of state s and action a. The action-independent effects of the environment have created state s0 from afterstate sa . The agent chooses action a0 leading to afterstate s0a0 and receiving reward r(s0 , a0 ). The environment again stochastically selects a state, and so on. The h-values may now be redefined in these terms: h(sa ) = E(h(s0 )) N X p(s0u |s0 , u)h(s0u ) − ρ∗ h(s0 ) = max0 r(s0 , u) + u∈U (s ) 0 (3.8) (3.9) su =1 If we substitute Equation 3.9 into Equation 3.8, we obtain this Bellman equation: h(sa ) = E max0 u∈U (s ) r(s0 , u) + N X s0u =1 p(s0u |s0 , u)h(s0u ) − ρ∗ (3.10) Here the s0u notation indicates the afterstate obtained by taking action u in state s0 . I estimate the expectation of the max above via sampling in the ASH-learning algorithm (Algorithm 3.1). Since this avoids looping through all possible next states, the algorithm is much faster. In the domains explored in this chapter, the afterstate is deterministic given the agent’s actions, but the stochastic effects due to the environment are unknown. Using afterstates to learn the expectation of the value of the next state takes advantage of this knowledge. For such domains with deterministic agent actions, we do not need to learn p(s0u |s0 ,u), providing a 31 1 2 3 4 5 6 7 8 Find an action u ∈ U (s0 ) that maximizes o n P 0 0 0 ) |s , u)h(s r(s0 , u) + N p(s 0 u u su =1 Take an exploratory action or a greedy action in the state s0 . Let a0 be the action taken, s0a0 be the afterstate, and s00 be the resulting state. Update the model parameters p(s0a0 |s0 , a0 ) and r(s0 , a0 ) using the immediate reward received. if a greedy action was taken then ρ ← (1 − α)ρ + α(r(s0 , a0 ) − h(sa ) + h(s0a0 )) α α ← α+1 end o n PN 0 0 0 0 h(sa ) ← (1 − β)h(sa ) + β max0 r(s , u) + s0u =1 p(su |s , u)h(su ) − ρ u∈U (s ) 0 00 s ←s sa ← s0a0 Algorithm 3.1: The ASH-learning algorithm. The agent executes steps 1-7 when in state s0 . 9 10 significant savings in memory, computation time, and code complexity. By storing only the values of states rather than state-action pairs, this method also shows the advantages of model-based H-learning. The temporal difference error for the ASH-learning algorithm would be: T DE(sa ) = max0 u∈U (s ) r(s0 , u) + N X s0u =1 p(s0u |s0 , u)h(s0u ) − ρ − h(sa ) (3.11) which I use in Equation 3.4 when using TLFs for function approximation. ASH-learning generalizes model-based H-learning and model-free R-learning [19]. ASH-learning reduces to H-learning if the afterstate is set to be the next state, treating all effects to be action-dependent. Doing this, the expectation of the maximum in Equation 3.11 drops out, and we have the Bellman equation for 32 H-learning in Equation 2.10. To reduce ASH-learning to R-learning, two steps are required. First, we define the afterstate to be the state-action pair, which makes the action-dependent effects implicit. Second, note that for Equations 3.8 and 3.9, we assume the reward r(s, a) in Figure 3.1 is given prior to the afterstate sa . It is also valid to instead assume the reward is given after the afterstate. This is a conceptual difference only, but combined with the first step above, this small change allows us to reduce Equation 3.11 to the Bellman equation for R-learning: 0 ∗ h(s, a) = E r(s, a) + max0 {h(s , u)} − ρ (3.12) u∈U (s ) The transition probabilities p(s0u |s0 , u) drop out because the afterstate (s0 , u) is deterministic given the state and action. Under these circumstances, ASH-learning reduces to model-free R-learning. While in theory the afterstate may be defined as being anything from the current state-action pair to the next state, in practice it is useful if it has low stochasticity and small dimensionality. This is true for example when an agent’s actions are completely deterministic and stochasticity is due to the actions of the environment (possibly including other agents). 3.3.3 ATR-learning In this section, I adapt ASH-learning to finite horizon domains. Instead of using average reward, I calculate total reward. I call this variation of afterstate total- 33 1 2 3 4 5 6 7 Initialize afterstate value function av(·) Initialize s to a starting state for each step do Find action u that maximizes r(s, u) + av(su ) Take an exploratory action or a greedy action in the state s. Let a be the joint action taken, r the reward received, sa the corresponding afterstate, and s0 be the resulting state. Update the model parameters r(s0 , a). av(sa ) ← av(sa ) + α max {r(s0 , u) + av(s0u )} − av(sa ) u∈A 0 s←s end Algorithm 3.2: The ATR-learning algorithm, using the update of Equation 3.14. 8 9 reward learning “ATR-learning”. I define the afterstate-based value function of ATR-learning as av(sa ), which satisfies the following Bellman equation: av(sa ) = X 0 0 p(s |sa ) max {r(s , u) + u∈A s0 ∈S av(s0u )} . (3.13) As with ASH-learning, I use sampling to avoid the expensive calculation of the expectation above. At every step, the ATR-learning algorithm updates the parameters of the value function in the direction of reducing the temporal difference error (TDE), i.e., the difference between the r.h.s. and the l.h.s. of the above Bellman equation: T DE(sa ) = max {r(s0 , u) + av(s0u )} − av(sa ). u∈A The ATR-learning algorithm is shown in Algorithm 3.2. (3.14) 34 3.4 Experimental Results In this chapter, I describe two domains used to perform my experiments: a product delivery domain, and a real-time strategy game domain. These domains illustrate the three curses of dimensionality, each having several agents, states, actions, and possible result states. While discussing each domain, I will show how that domain exhibits each curse of dimensionality. In general, the number of agents has the largest influence on the number of states, actions, and result states – these are usually exponential in the number of agents. If the environment includes several random actors (for example, customers or enemy agents) this will also increase the number of possible result states. Finally I present my experiments involving these domains and the techniques described so far. 3.4.1 The Product Delivery Domain I simplify many of the complexities of the real-world delivery problem, including varieties of products, seasonal changes in demands, constraints on the availability of labor and on routing, the extra costs due to serving multiple shops in the same trip, etc. Rather than building a deployable solution for a real world problem, my goal is to scale the methods of reinforcement learning to be able to address the combinatorial core of the product delivery problem. This approach is consistent with the many similar efforts in the operations research literature [4]. While RL 35 Figure 3.2: The product delivery domain, with depot (square) and five shops (circles). Numbers indicate probability of customer visit each time step. has been applied separately to inventory control [28] and vehicle routing [16,20,21] in the past, I am not aware of any applications of RL to the integrated problem of real-time delivery of products that includes both. I assume a supplier of a single product that needs to be delivered to several shops from a warehouse using several trucks. The goal is to ensure that the stores remain supplied while minimizing truck movements. I experimented with an instance of the problem shown in Figure 3.2. To simplify matters further, I assumed it takes one unit of time to go from any location to its adjacent location or to execute an unload action. The shop inventory levels and truck load levels are discretized into 5 levels 0-4. It is easy to see that the size of the state space is exponential in the number of trucks and the number of shops, which illustrates the first curse of dimensionality. I experimented with 4 trucks, 5 shops, and 10 possible truck locations, which gives 36 a state-space size of (55 )(54 )(104 ) = 19, 531, 250, 000. Each truck has 9 actions available at each time step: unload 1, 2, 3, or 4 units, move in one of up to four directions, or wait. A policy for this domain seeks to address the problems of inventory control, vehicle assignment, and routing simultaneously. The set of possible actions in any one state is a Cartesian product of the available actions for all trucks, and it is exponential in the number of trucks. Thus, just picking a greedy joint action with respect to the value function requires an exponential size search at each learning step, illustrating the second curse of dimensionality. In my experiments with 4 trucks, 94 = 6561 actions in each step must be considered. Although this is feasible, we need a faster approach to scale to larger numbers of trucks, since the action search occurs at each step of the learning algorithm. Trucks are loaded automatically upon reaching the depot. A small negative reward of −0.1 is given for every “move” action of a truck to reflect the fuel cost. The consumption at each shop is modeled by decreasing the inventory level by 1 unit with some probability, which independently varies from shop to shop. This can be viewed as a purchase by a customer. In general, the number of possible next states for a state and an action is exponential in the number of shops, since each shop may end up in multiple next states, thus illustrating the third and final curse of dimensionality. With my assumption of 5 shops, each of which may or may not be visited by a customer each time step, this gives us up to 25 = 32 possible next states each time step. I call this the stochastic branching factor – the maximum number of possible next states for any state-action pair. I also give a penalty of 37 −5 if a customer enters a store and finds the shelves empty. 3.4.2 The Real-Time Strategy Domain I performed several experiments on several variations of a real-time strategy game (RTS) simulation. This RTS domain has several features that make it more challenging than the product delivery domain: enemy units respond actively to agent actions, and can even kill the agents. Because enemy units are more powerful than the agents, they require coordination to defeat. For problems with multiple enemy units (discussed in Section 5.6.3), agents and enemy units may also vary in type, requiring even more complex policies and coordination. Scaling problems also prove particularly challenging as the number of agents and enemy units grows. I implemented a simple realTable 3.3: Different unit types. time strategy game simulation on a 10x10 gridworld. The grid is presumed to be a coarse discretization of a real battlefield, and so units are permitted to share spaces. In this chapter, Unit HP Damage Range Mobile Archer Infantry Tower Ballista Knight Glass Cannon Hall 3 6 6 2 6 1 6 1 1 1 1 2 6 0 3 1 3 5 1 1 0 yes yes no yes yes yes no the experiments use three agents vs. a single enemy agent (See Section 5.6.3 for experiments with up to twelve starting agents and four enemy units). Units, either enemy or friendly, were defined by several features: position (in x and y coordinates, hit points (0-6), and type 38 (either archer, infantry, tower, ballista, or knight, glass cannon, or hall). I also defined relational features such as distance between agents and the enemy units, and aggregation features such as a count of the number of opposing units within range. In addition, each unit type was defined by how many starting hit points it had, how much damage it did, the range of its attack (in manhattan distance), and whether it was mobile or not. See Table 3.3 for the differences between units. Agents were always created as one of the weaker unit types (archer or infantry), and enemies were created as one of the stronger types (tower, ballista, or knight). The “glass cannon” and hall are special units described in Section 6.3.2. Agents had six actions available in each time step: move in one of the four cardinal directions, wait, or attack an enemy (if in range). Enemy units had the same options (although a choice of whom to attack) and followed predefined policies, approaching the nearest enemy unit if mobile and out of range, or attacking the unit closest to death if within range. An attack at a unit within range always hits, inflicting damage to that unit and killing it if it is reduced to 0 hit points. Thus, the number of agents (and tasks) are reduced over time. Eventually, one side or the other is wiped out, and the battle is “won”. I also impose a time limit of 20 steps. Due to the episodic nature of this domain, total reward reinforcement learning is suitable. I gave a reward of +1 for a successful kill of an enemy unit, a reward of −1 if an agent is killed, and a reward of −.1 each time step to encourage swift completion of tasks. Thus, to receive positive reward, it is necessary for agents to coordinate with each other to quickly kill enemy units without any losses of their own. 39 Figure 3.3: Comparison of complete search, Hill climbing, H- and ASH-learning for the truck-shop tiling approximation. 3.4.3 ASH-learning Experiments I conducted several experiments testing ASH-learning, tabular linear functions, hill climbing, and efficient expectation calculation (Section 3.3.1). Tests are averaged over 30 runs of 106 time steps for all results displayed here. Training is divided into 20 phases of 48,000 training steps and 2,000 evaluation steps each. During evaluation steps, exploration is turned off and complete search is used to select actions for all methods. In all tests, 4 trucks and 5 shops were used. I compared three different kinds of TLFs in my experiments. The first TLF represents the value function h(s) as follows: h(s) = k X n X t=1 x=1 θt,x (pt , lt , ix ) (3.15) 40 Figure 3.4: Comparison of complete search, Hill climbing, H- and ASH-learning for the linear inventory approximation. where there are k trucks, n shops, and no linear features. The value function has kn terms, each term corresponding to a truck-shop pair (t, x). The nominal features are truck position pt , truck load lt , and shop inventory ix . In analogy to tile coding, I call this the truck-shop tiling approximation. In this domain there are 10 truck locations, 5 shop inventory levels, and 5 levels of truck loads. The number of parameters in the above TLF is k × n × 10 × 5 × 5, as opposed to 10k 5k 5n , as required by a complete tabular representation. The second TLF uses shop inventory levels ix as a linear feature instead of a nominal feature as used by the truck-shop tiling: h(s) = k X n X t=1 x=1 θt,x (pt , lt )ix (3.16) 41 Treating the shop inventory level ix as a linear feature is particularly attractive in this model-based setting, where we need to explicitly calculate the expectation of the value function over the possible next states. The linearity assumption makes this computation linear in the number of shops rather than exponential (see Section 3.3.1). This also makes it unnecessary to discretize the shop inventory levels, although we continue to discretize them in our experiments to keep the comparisons between different function approximation schemes fair. We call this scheme the linear inventory approximation. In Figure 3.3 I compare results on the truck-shop tiling approximation, Hlearning and ASH-learning, and complete search of the joint action space vs. hill climbing search. Figure 3.4 repeats these tests for the linear inventory approximation. My results show that ASH-learning outperforms H-learning, converging faster and to a better average reward, especially with the truck-shop tiling approximation. H-learning’s performance improves with the linear inventory approximation, but it has a higher variance compared to ASH-learning. From Table 3.4, we can see that the methods of addressing the stochastic branching factor were quite successful. When using the linear inventory approximation and the corresponding optimization discussed in Section 4.3, H-learning shaved 36 seconds (or 24%) off the execution time. ASH-learning was even more successful at ameliorating the explosion in stochastic branching factor. The largest gains in execution time were seen with the simplest of methods: using hill climbing during the action search saved more time than any other method. Combined with ASH-learning, this led to speedups of nearly an order of magnitude. These gains 42 Table 3.4: Comparison of execution times for one run Search Complete Complete Complete Complete Complete Hill climbing Hill climbing Hill climbing Hill climbing Algorithm ASH-learning H-learning ASH-learning H-learning ASH-learning H-learning ASH-learning H-learning ASH-learning Approximation All feature pairs Linear inventory Linear inventory Truck-shop tiling Truck-shop tiling Linear inventory Linear inventory Truck-shop tiling Truck-shop tiling Seconds 175 112 86 148 92 19 15 26 15 in speed can be explained by comparing the average number of actions the hill climbing technique searches each time step to the number of actions considered by a complete search of the action space. In these tests, hill climbing search considered an average of 44 actions before reaching a local optimum. A complete search of the joint action space that ignores the most obviously illegal of the 94 possible actions considered an average of 385 actions: a significant savings. Moreover, my results show that hill climbing performs as well as complete search when measuring the number of steps required to reach convergence as well as the final average reward of the policy. This is a very encouraging result since it suggests that it may be possible to do less search during the learning phase and obtain just as good a result during the evaluation or testing phase. As the number of trucks increases beyond 4, we would expect to see even greater improvements in execution times of hill climbing search over full search of the joint action space. 43 Figure 3.5: Comparison of hand-coded algorithm vs. ASH-learning with complete search for the truck-shop tiling, linear inventory, and all feature-pairs tiling approximations. Figure 3.5, compares the best results from Figures 3.3 and 3.4 vs. a fairly sophisticated hand-coded non-learning greedy algorithm and ASH-learning based on the all feature-pairs tiling. The hand-coded algorithm worked by first prioritizing the shops by the expected time until each shop will be empty due to customer actions, then assigning trucks to the highest-priority shops. Once an assignment has been made, it becomes much easier to assign a good move, unload, or wait action to deliver products. It should be noted that creation of this hand-coded algorithm required considerable prior knowledge of the domain. All learning-based approaches used ASH-learning with complete search. Note that the results in this figure have a different scale on the Y-axis. Figure 3.5 shows that, for the vehicle routing domain, the linear inventory approximation does not perform well. Encouragingly, the truck-shop tiling and all feature-pairs approximations converge to 44 a better average reward than the hand-coded algorithm. In this domain, it appears that the all feature-pairs approximation performs better than any other. I verified that the final average rewards reached by these tests are statistically significantly different at 95% confidence level using the student’s t-test. 3.4.4 ATR-learning Experiments Relational templates (Section 3.1.2) may be easily adapted to facilitate transfer learning between different domains. In particular, I demonstrate how knowledge learned from experimenting with particular combinations of units (subdomains) – for example, archers and infantry vs. towers or ballista – may be transferred to a different subdomain by taking advantage of the properties of relational templates. The results of the experiments are shown in Figures 3.6, 3.7, and 3.8. All figures show the influence that learning on various combinations of source domains has on the final performance in one or more different target domains. The experiments were tested on the target domain for 30 runs of 105 steps each, and averaged together. I used the ATR-learning algorithm (Algorithm 3.2) for all experiments. Each run was divided into 40 alternating training and testing phases of 3000 and 2000 steps each respectively. I used = .1 for the training phases and = 0 for the test phases. I adjusted α independently for each relational template: for “parent” templates (#3-4), I set α = .01, and for any other template, I used α = .1. This allows the parent templates, used only for transfer, to influence the value function less than subdomain-specific templates. 45 Figure 3.6: Comparison of 3 agents vs 1 task domains. For Figure 3.6, I trained a value function for 106 steps on all subdomains not included in the final “target” domains. I then transferred the value function to the target domains: Infantry vs Knight or Archers and Infantry vs. Knight. Starting distributions of units are randomized according to the allowed combinations of units. These results show that using transfer learning improves results over using no transfer at all. In the case of Archers and Infantry vs. Knight, agents have no prior experience vs. the knight, yet perform better overall. This indicates that agents can be sensitive to particular kinds of prior experience, which is verified in Figure 3.7. For Figures 3.7 and 3.8, I trained a value function for 106 steps on the various combinations of source domains indicated in the legend. Abbreviations such as “AvK” indicates a single kind of domain – Archers vs. Knights for example. 46 Figure 3.7: Comparison of training on various source domains transferred to the 3 Archers vs. 1 Tower domain. Likewise I,T,B indicate Infantry, Towers, and Ballista respectively. When training multiple domains at once, each episode was randomly initialized to one of the allowable combinations of domains. I then transferred the parameters of the relational templates learned in these source domains to the target “AvT” or “IvK” domains. Our results show that additional relevant knowledge (in the form of training on source domains that share a unit type with the target domain) is usually helpful, though not always. For example, in the IvK target domain, training on the IvB domain alone performs worse than not using transfer learning at all. However, training IvB and IvT together is better than training on IvT alone, and training on IvT is much better than no transfer at all. These results also show that irrelevant 47 Figure 3.8: Comparison of training on various source domains transferred to the Infantry vs. Knight domain. training – in this case on the AvT and AvB domains, which do not share a unit type with the IvK domain – harms transfer. For the AvT target domain, transfer from any domain initially performs better than no transfer at all, but only a few source domains continue to perform better than no transfer by the end of each run. In both target domains the “AvK” source domain provides the best possible training for both target domains. The IvT, and IvT, IvB, and IvK source domains also perform well here. These results confirm that the value function can be quite sensitive to the choice of source domains to transfer from. Which source domains are most helpful is often unpredictable. Initially, transfer from any combination of domains performs better than no transfer at all, but only by training on certain source domains will 48 performance continue to improve over no transfer by the end of each run. While we might expect irrelevant information – such as training on the IvK and IvB domains, which do not share a unit type with the AvT domain – might harm transfer, for this particular experiment that does not appear to be the case. It is possible that training on these domains has an indirect influence on the value function, which helps more than it harms. 3.5 Summary I illustrated the three curses of dimensionality of reinforcement learning and showed effective techniques to address them in certain domains. Tabular linear functions seem to offer an attractive alternative to other forms of function approximation. They are faster than neural nets and give opportunities to provide meaningful prior knowledge without excessive feature engineering. Hill climbing is a cheap but effective technique to mitigate the action-space explosion due to multiple agents. I introduced ASH-learning and ATR-learning, which are afterstate versions of model-based real-time dynamic programming. These algorithms are similar to Qlearning in that action-independent effects are not learned or used. However, the value function is state-based, so it is more compact than Q-learning, much more so for multiple agents. I have shown how ASH-learning generalizes model-based H-learning and model-free R-learning [19]. ASH-learning reduces to H-learning if the afterstate is set to be the next state. If the afterstate is set equal to the current state-action pair, ASH-learning reduces to R-learning. Thus, ASH-learning carries 49 some advantages from both methods. As with R-learning, much of the action model is not learned or used. However, the value function is state-based, and so it is more compact than R-learning, especially in the decomposed agent case. Thus, ASHlearning combines the nice features of both model-based and model-free methods and has proven itself very well in the domains I have tested it with. Similar gains in performance should be expected for any domain in which the afterstate is either observable or inferable. A limitation of the approach is when neither of these is true, i.e., when the afterstate is neither inferable nor observable. It appears that in many cases, it may be possible to induce a compact description of the action models which in turn could help us derive the afterstate. This would make the afterstate approach as broadly applicable as the standard model-based approach. I have shown how relational templates may be refined from a “base” template –applicable to all subdomains– to templates that are specialized to particular subdomains based on the particular features added to the template. By using several templates with different combinations of type features, a function approximator is created that generalizes between similar subdomains and also specializes to particular subdomains. This process allows for easy transfer of knowledge between subdomains. In addition I have shown how relational templates and assignment-based decomposition combine fruitfully to transfer knowledge from a domain with only a few units to domains with many units. Although sometimes the addition of one or more relational features is required, the decomposed value function used in this technique allows a straightforward transfer of knowledge between domains. 50 In summary, I conclude that the explosions in state space, action space, and high stochasticity may each be ameliorated. 51 Chapter 4 – Multiagent Learning This chapter explores methods for implementing a multiagent learning approach (as in Section 2.6) for model-based reinforcement learning methods. In particular, it shows how to implement multiagent versions of H-learning and ASH-learning, and demonstrates several experiments using these methods in the “Team Capture” domain. This domain sets several agents (or game pieces) against an equal number of enemy pieces in an effort to capture them. When considering the three curses of dimensionality, a multiagent approach is primarily of interest when there is concern about an explosion of the action space due to a large number of agents. However, all three curses of dimensionality can be mitigated by a multiagent approach. By considering a smaller, local, state for each agent, the number of states that must be learned by the agent is reduced. Similarly, it may be possible to consider fewer environmental dimensions of the state – only those that affect the local state of each agent – and thus fewer possible result states. This may have the side effect of making the outcome space appear to become more stochastic, due to the unmodeled effects of the other agents’ actions. 52 4.1 Multiagent H-learning In a multiagent approach to reinforcement learning, the joint action space is decomposed into several agents, each of which consists of a set of states and actions. As I emphasize multiagent approaches as a method of scaling in large multiagent domains, I allow agents to share memory and communicate free of cost. This kind of action decomposition is useful, because the joint action space is exponentially sized in the number of agents, so an exhaustive search is impractical. There exist other multiagent approaches to this problem that work well for some domains; see [9] for an alternative model-based multiagent approach based on least squares policy iteration. That work requires the creation of a sparse coordination graph between cooperating agents, which is not always practical. Multiagent systems differ from joint agent systems most significantly in that the environment’s dynamics can be affected by other agents, rendering it nonMarkovian and non-stationary. In general, optimal performance may require the agents to model each other’s goals, intentions, and communication needs [6]. Nevertheless, I pursue a simple approach of modeling each agent with its own MDP, and adding a limited amount of carefully chosen coordination information. This method is described in 3 steps. 4.1.1 Decomposition of the State Space A multiagent MDP can be approximated as a set of MDPs, one for each agent. Each agent’s state consists of a set of global variables that are accessible to all 53 1 2 3 4 5 6 7 8 Find an action ua ∈ Ua (s) that maximizes P E(ra |sua , ua ) + N p (q|s , u )h (q) a u a a a q=1 Take an exploratory action or a greedy action in the current state s. Let va be the action taken, sva be the afterstate, s0 be the resulting state, and rimm be the immediate reward received. Update the model parameters for pa (s0 |sva ) and E(ra |sva , va ) if a greedy action was taken, then then ρa ← (1 − α)ρa + αE(ra |sva , va ) − ha (s) + ha (s0 )) α α ← α+1 end P ha (s) ← max E(ra |sua , ua ) + N q=1 pa (q|sua , ua )ha (q) − ρa ua ∈Ua (s) 0 s←s Algorithm 4.1: The multiagent H-learning algorithm with serial coordination. Each agent a executes each step when in state s. 9 agents and a set of variables local to just that agent. Similarly the joint action u must be decomposed as a vector (u1 , u2 ..., un ) of agent actions, and the rewards must be distributed among the agents in such a way that the total reward to the system for a fixed joint policy at any time is the sum of the rewards received by the individual agents: E(r|s, u) = m X E(ra |s, u) (4.1) a=1 Depending on the domain, rewards may be provided in already-factored form, or a model of each agents’ reward must be learned such that the total reward is approximated by sum of the predicted rewards for each agent, as in the above equation. In this section, I show how to adapt H-learning into a multiagent algorithm. The above additive decomposition of rewards yields a corresponding additive decomposition of the average reward ρ into ρa and the biases h into ha for agent a. Hence, 54 h(s) = m P m P ha (s) and ρ = a=1 ρa , and Equation 2.12 becomes: a=1 ( TDEa (s) = max u∈U (s) E(ra |s, u) + N X ) p(s0 |s, u)ha (s0 ) − ρa − ha (s) (4.2) s0 =1 Here, E(ra |s, u) is agent a’s portion of the expected immediate reward. The ha (s) notation indicates that agent a’s h-function is being used. The ρa for each agent a is updated using: ρa ← (1 − α)ρa + α(E(ra |s, u) − ha (s) + ha (s0 )) (4.3) Note that the h-value for each state no longer needs to be stored, as that value is now decomposed across several local state values. 4.1.2 Decomposition of the Action Space The next step is to decompose actions so that action selection is faster. Note that Equation 4.2 searches the joint action space of all agents to find a joint action that maximizes the right hand side. This is because the next state and its value are functions of all agents’ actions. To make this explicit, we can write the joint action u as a vector (u1 , u2 ..., un ). Thus computing the max in the right hand-side of Equation 4.2 requires time exponential in the number of agents. To reduce the complexity of the action choice, we want to have the agents choose their actions independently. To do this, each agent must be able to model 55 the expected outcomes of the actions of the other agents. For each agent, the unknown actions of other agents and their stochastic effects are considered as being a part of the environment. Thus, the action models of the agent include not only the known effects of its own actions, but also the effects of the actions of the other agents, which are largely unknown. Given a model of the other agents’ effects, Equation 4.2 now becomes: ( TDEa (s) = max ua ∈Ua (s) E(ra |s, ua ) + N X ) p(s0 |s, ua )ha (s0 ) − ρa − ha (s) (4.4) s0 =1 Here we need only examine the actions Ua (s) agent a may take in state s, rather than the Cartesian product of all agent actions U (s). However, in addition to modeling the effects of the environment, the model variables E(ra |s, ua ) and p(s0 |s, ua ) must also model the effects of the actions of the other agents on the current agent’s reward and the next state. 4.1.3 Serial Coordination While the above method for multiagent H-learning can work well for some domains, problems may arise when coordination of agent actions is needed. With the methods described in Sections 4.1.1 and 4.1.2, coordination is achieved through the model. Each agent predicts the actions of the other agents. That prediction may be inexact. To solve this problem, I introduce a limited form of coordination I call “serial 56 Figure 4.1: DBN showing the creation of afterstates sa1 ...sam and the final state s0 by the actions of agents a1 ...am and the environment E. coordination.” In particular, each agent chooses actions in sequence, and knows the actions chosen by those agents that have chosen one. This is a compromise between making completely independent choices, and a completely centralized decision. This method assumes that agents are able to communicate their action choices without cost; this may not always be correct for some domains. This method bears a resemblance to the coordination graph methods of [11], but is far simpler to code and implement. Serial coordination, together with the coordination provided by the model (which model-free methods such as Q-learning lack), allows us to do well with a simpler alternative to more complex forms of coordination. Using the serial coordination method, each agent must either predict (via a model) or know exactly the actions of other agents. The first agent to choose an action must tries to predict the actions of all other agents. Subsequent agents will be allowed to know the actions of those agents that have selected an action, but must model the action choices of the rest. In general, serial coordination reduces to an MDP in which each agent takes its action, one at a time. The first agent knows only the current state s. The second agent’s state is expanded by the knowledge of the first agent’s action: s, a1 . The third agent’s state is determined by s, a1 , a2 , and so on until all agents have acted. 57 For domains in which the immediate effects of an agent’s action are deterministic, serial coordination is simple to implement by taking advantage of the structure provided by a sequence of afterstates (Section 3.3.2). In this chapter, a state subscripted with an action indicates an afterstate. Thus, saj is the afterstate of state s and action aj , and includes the effects of all actions a1 ...aj (see Algorithm 4.1). A subscript used anywhere else indicates a particular agent. 4.2 Multiagent ASH-learning For model-based learning, we need to compute the expected value of the next state, which is exponential in the number of relevant random variables, e.g. the effects of other agents and the environment. To do this, I introduce a multiagent version of ASH-learning (Section 3.3.2). To adapt the joint-agent form of ASH-learning into a multiagent algorithm is fairly straightforward. Each agent action va is treated as creating a sequence of afterstates as in Figure 3.1. Equations 4.4 and 3.11 are combined to obtain: ( TDEa (sva ) = max 0 ua ∈Ua (s ) E(ra |s0ua , ua ) + N X ) p(s0ua |s0 , ua )ha (s0ua ) − ρa − ha (sva ) s0ua =1 (4.5) The complete algorithm is shown in Algorithm 4.2. Note that it is necessary to store an afterstate for each agent for use in the update step. 58 1 Find an action ua ∈ Ua (s0 ) that maximizes o n P 0 0 0 E(ra |s0ua , ua ) + N s0u =1 p(sua |s , ua )ha (sua ) a 2 3 4 5 6 7 8 Take an exploratory action or a greedy action in the state s0 . Let va0 be the action taken, s0va0 be the afterstate, and s00 be the resulting state. Update the model parameters p(s0va0 |s0 , va0 ) and E(ra |s0va0 , va0 ) using the immediate reward received. if a greedy action was taken then ρa ← (1 − α)ρa + α(E(ra |s0va0 , va0 ) − ha (sva ) + ha (s0va0 )) α α ← α+1 end ha (sva ) ← (1 − β)ha (sva )+ o n PN 0 0 0 0 β max 0 E(ra |sua , ua ) + s0u =1 p(sua |s , ua )ha (sua ) − ρa 0 ua ∈Ua (s ) 00 a s ←s 0 10 sva ← sv 0 a Algorithm 4.2: The multiagent ASH-learning algorithm. Each agent a executes each step when in state s0 . 9 4.3 Experimental Results This section introduces the Team Capture domain and uses it to illustrate issues of scaling. I discuss how to implement this domain using multiagent learning algorithms. I then discuss several experiments demonstrating these methods. 4.3.1 Team Capture domain The Team Capture domain is a competitive two-sided game played on a square grid. Sides are colored white and black, and control an equal number of pieces (at least two per side). Each side takes turns moving all their pieces. Each piece 59 Figure 4.2: An example of the team capture domain for 2 pieces per side on a 4x4 grid. Figure 4.3: The tiles used to create the function approximation for the team capture domain. has five actions available each turn: stay in position, or move up, down, left, or right. Pieces may not leave the board or enter an occupied square. The goal is to capture opposing pieces as quickly as possible. Any side may capture an opposing piece by surrounding it on opposite sides with two or four of its own pieces. If this occurs, the side that captured the piece receives a reward of 1 and the captured piece is randomly moved to any empty square. Figure 4.2 illustrates an example of the Team Capture domain for the two vs. two piece problem on a 4x4 grid. In this example, if white piece 1 moves down, it will capture black piece 2 with white piece 2. Taking the joint state and action spaces of all pieces to be the set of basic states and actions, the team capture domain can be reduced to a standard joint agent MDP. In this model, the optimal policy is defined as the optimal joint action for each joint state and can be computed in theory using single agent H-learning. However, this immediately runs into huge scaling problems as the number of agents (pieces) increases: the set of actions available to a joint agent each time step grows 60 exponentially in the number of pieces. Hence a multiagent approach should be taken, which decomposes the joint agent into several coordinating agents. A multiagent RL algorithm may be used to learn in the team capture domain by splitting the reward received for capturing an opposing piece evenly between the two or four agents (pieces) responsible for the capture. We can model the effects of other agent’s actions by noting that their actions contribute to the expected local reward only by capturing an opposing piece with it. The probability P (a capture will occur|the number of allied pieces (up to 4) that might assist in a capture) should be measured. This model requires only 5 parameters. More advanced models are possible, but I found this to work well. I used Tabular Linear Function approximation (Section 3.1.1) to reduce the number of parameters for each agent: ha (s) = t X θx (g1 (s, x, a), ..., gn (s, x, a)) (4.6) x=1 This equation represents a set of t tiles laid down over the local area of the grid for each agent a. Each tile is represented by a function θx (in this case, a table) and overlays n grid positions g1 (s, x, a), ..., gn (s, x, a). As can be seen in Figure 4.3, I used a set of 20 overlapping 2x2 tiles surrounding the local area of each piece. Each tile contains the state information relevant to the four grid squares it occupies. Each square in the tile may have four possible values: empty, white, black, or “off the board”. This gives us a total of 20 × 44 = 5120 parameters. For this domain, all agents share the same function approximator, so there is no duplication 61 of parameters between agents. This decision trades off the possible gains from having each agent learn specialized behavior for fewer parameters and thus faster convergence. The subscript x in θx indicates that we take a sum over t weights, each weight taken from a table. Each table is defined over the parameters of θ. I chose to use only limited, “local” state information to define the value function for each agent in order to limit the number of parameters that must be learned and thus decrease convergence time. In addition, there is a danger that if non-local information is permitted to be learned, that information could be very noisy and could potentially cause learning to suffer. To adapt team capture to ASH-learning, the effects of each agent’s actions are split into immediate effects (the movement of a piece, which is deterministic) and environmental effects (the movement of other agents and enemy pieces). When using serial coordination, the afterstate incorporates the effects of each piece on the agent’s side moving up, down, left, or right, but the effects of opposing pieces being captured, the moves made by the opposing team, and the agent’s pieces being captured have not been calculated yet. None of these things require knowledge of the action taken by the agent in order to calculate; only the afterstate. 4.3.2 Experiments I conducted several experiments testing the methods discussed in this chapter. Tests are averaged over 30 runs of 106 time steps each. Runs are divided into 20 phases of 48,000 training steps and 2,000 evaluation steps each. During evalua- 62 Figure 4.4: Comparison of multiagent, joint agent, H- and ASH-learning for the two vs. two Team Capture domain. tion steps, exploration is turned off. The opposing agent uses a random policy. Figure 4.4 compares my tests for 2 vs. 2 pieces on a 4x4 grid (see Figure 4.2). I compared the results of using multiagent approaches vs. joint agent approaches, and H-learning vs. ASH-learning. I also include the average reward received by a greedy hand-coded policy that represents my best effort at creating a good solution to this domain. My multiagent approaches all used decomposed state and action spaces with serial coordination (which I found to be critical to allow successful capture of opposing pieces). Figures 4.5 and 4.6 display the result of applying ASH-learning and multiagent approaches for 4 vs. 4 pieces on a 6x6 grid and 10 vs. 10 pieces on a 10x10 grid. Hlearning is impractical for these tests due to very large stochastic branching factors. 63 Figure 4.5: Comparison of ASH-learning approaches and hand-coded algorithm for the four vs. four Team Capture domain. A joint agent is similarly impractical for the 10-piece domain. I used the function approximation shown in Equation 4.6 to mitigate the problem of enormous state spaces. I also compare my results to a good hand-coded agent. From these results, we see that multiagent approaches perform nearly as well as their joint agent counterparts: indeed, there is no statistically significant difference between the multiagent and joint agent approaches for two pieces (using a 95% confidence interval). The multiagent approaches for two pieces used twice as many parameters as the joint agent approach, and so could further benefit from function approximation. For this domain, H-learning outperforms ASH-learning (the difference is small, although statistically significant at a 95% confidence level), however I have observed that for some domains (including a product delivery/vehicle routing 64 Figure 4.6: Comparison of multiagent ASH-learning to hand-coded algorithm for the ten vs. ten Team Capture domain. domain) ASH-learning outperforms H-learning. The relative performance of these two algorithms appears to depend on the domain, function approximation used, and learning parameters. The greatest benefit of my approaches can be seen when comparing the computation time required to create the results for Figures 4.4, 4.5, Table 4.1: Comparison of execution times in seconds for one run of each algorithm. Column labels indicate number of pieces. “–” indicates a test requiring impractically large computation time. 2 Multiagent ASH-learning 1 Joint ASH-learning 2 Multiagent H-learning 56 Joint H-learning 22 4 31 750 – – 10 81 – – – 65 and 4.6. As seen in Table 4.1, ASH-learning is much faster than H-learning. For two pieces, it is actually faster to use a joint agent approach rather than a multiagent approach. However as the number of pieces increases, multiagent approaches rapidly become the only practical option. For ten pieces, I cannot use a joint agent approach at all. 4.4 Summary This chapter has shown that multiagent decomposition and serial coordination taken together go a long way towards scaling model-based reinforcement learning up to handle real-world problems. Scaling limitations of H-learning prevent me from demonstrating similar benefits in this chapter. A multiagent approach is particularly useful for solving the three curses of dimensionality and scaling reinforcement learning problems to large domains. This is because multiagent techniques can simultaneously address each of the curses: by requiring each agent to consider its local state, the burden of memorizing a value for each state is lessened. By considering each agents’ actions independently, actions may be selected in time linear rather than exponential in the number of agents. And finally, if each agent requires knowledge of only a subset of the objects in the environment in order to successfully calculate the expected value of the next state, the outcome space explosion may be greatly ameliorated. Serial coordination is one of the simplest forms of multiagent model-based coordination. As with any multiagent coordination method, it is very much a compro- 66 mise between quality and speed of action selection, in this case erring on the side of speed. Better coordination techniques should result in improved performance; this is a topic I will discuss in the next Chapter. In summary, I conclude that the multiagent versions of H-learning and ASHlearning greatly moderate the explosions in state and action spaces observed in real-world domains. The experimental results in the Team Capture domain show that this approach may be scaled to larger and more practical problems. 67 Chapter 5 – Assignment-based Decomposition The simple multiagent learning approach discussed in the previous chapter is adequate for certain domains, but what if greater coordination is required for best performance? In this chapter I explore a more sophisticated coordination method. I start by defining a multiagent assignment MDP, or MAMDP: this is a cooperative multiagent MDP with two or more tasks that agents are required to complete in order to receive a reward. Examples of such a domain might be fire and emergency response in a typical city, product delivery using multiple trucks to deliver to different customers, or a real-time strategy game in which several friendly units must cooperate to destroy multiple enemy units. An assignment is a mapping from agents to tasks. Given an assignment, agents are required to act in accordance with it until the task is complete or the assignment is changed. The task assignment may change every time step, every few time steps, every time a new task arrives, every time a task completes, or every time all tasks complete. In principle, each of these different cases can be modeled as a joint MDP with appropriate changes to the state and action spaces. For example, the assignment can be made a part of the state, and changes to the assignment can be treated as part of the joint action. Conditions may be placed on when an assignment is allowed to change. Weaker conditions will ensure a more flexible policy and hence more potential reward. There is usually a reward for completing any task. The 68 goal is to maximize the total expected discounted reward. More formally, an MAMDP extends the usual multiagent MDP framework to a set of n agents G = {g} (|G| = n). Each agent g has its own set of local state Sg and actions Ag . We also define a set of tasks T = {t}, each associated with a set of state variables St that describe the task. The set of tasks (and corresponding state variables required to describe them) may vary between states. The joint action space is the Cartesian product of the actions of all n agents: A = A1 ×A2 ×...×An . The joint state space is the Cartesian product of the states of all agents and all P tasks. The reward is decomposed between all n agents, i.e., R(s, a) = ni Ri (s, a), where Ri (s, a) is the agent-specific reward for state s and action a. β : T → Gk is an assignment of tasks to agents; here k indicates an upper bound on the number of agents that may be assigned to a particular task. β(t) indicates the set of agents assigned to task t. Let sβ(t) denote the joint states of all agents assigned to t, and aβ(t) denote the joint actions of all agents assigned to task t. The total utility Q(s, a) depends on the states of all tasks and agents s and actions a of all agents. To solve MAMDPs, I propose a solution that splits the action selection step of a reinforcement learning algorithm into two levels: the upper assignment level, and the lower task execution level. At the assignment level, agents are assigned to tasks. Once the assignment decision is made, the lower level action that each agent should take to complete its assigned task is decided by reinforcement learning in a smaller state space. This two-level decision making process occurs each timestep of the reinforcement learning algorithm, taking advantage of the opportunistic 69 reassignments. At the assignment level, interactions between the agents assigned to different tasks are ignored. This action decomposition exponentially reduces the number of possible actions that need to be considered at the lowest level, at a cost of increasing the number of possible assignments that must be considered. Because each agent g need only consider its local state sg and task-specific state st to come to a decision, this method can greatly reduce the number of parameters that it is necessary to store. This reduction is possible because rather than storing separate value functions for each possible agent and task combination, a single value function may be shared between multiple agent-task assignments. In the next sections, I discuss the particulars of my implementations of modelfree and model-based methods, as well as various search techniques that can be used to speed up the assignment search process. I also analyze the time and space complexity of assignment-based decomposition as compared to more typical methods. 5.1 Model-free Assignment-based Decomposition The assignment β denotes the complete assignment of agents to tasks, and g = β(t) denotes the group of agents assigned to task t. The Q-function Q(st , sg , ag ) denotes the discounted total reward for a task t and set of agents g starting from a local task state st and joint agent state sg , and joint actions ag of the agents in the team. For an assigned subset of agents, the Q-function is learned using standard 70 1 2 3 4 Initialize Q(s, a) optimistically Initialize s to any starting state for each step do P Assign tasks T to agents M by finding arg maxβ t v(s, β(t), t), where v(s, g, t) = max Q(st , sg , a) a∈Ag 5 6 7 For each task t, choose actions aβ(t) from sβ(t) using -greedy policy derived from Q Take action a, observe rewards r and next state s0 foreach task t do Q(st , sβ(t) , aβ(t) ) ← Q(st , sβ(t)i, aβ(t) )+ h α rβ(t) + γ max Q(s0t , s0β(t) , a0 ) − Q(st , sβ(t) , aβ(t) ) 0 a 0 s←s end Algorithm 5.1: The assignment-based decomposition Q-learning algorithm. 8 9 Q-learning approaches: h i 0 0 0 Q(st , sg , ag ) ← Q(st , sg , ag ) + α rg + γ max Q(s , s , a ) − Q(s , s , a ) t g g t g 0 a (5.1) The assignment problem described is nontrivial – the number of possible assignments is exponential in the number of agents. The space of possible assignments may be searched by defining a value v(s, g, t) of a state s, task t, and set of agents g. This value is derived from the underlying value function, given by Equation 5.1: v(s, g, t) = max Q(st , sg , a) a∈Ag (5.2) Various techniques to search the assignment space using this value are disscussed in Section 5.3. The final model-free assignment-based decomposition algorithm is shown in 71 Algorithm 5.1. This algorithm is similar to the ordinary Q-learning algorithm, with several key differences: before the normal action-selection step in line 4, we search for the best available assignment. If we have multiple tasks, in line 5 we assume actions and states are factored by task. Likewise rewards are also factored. In line 7, Q-values are updated for each task according to the local states and actions associated with that task. 5.2 Model-based Assignment-based Decomposition This section describes a model-based implementation of assignment-based decomposition, which has considerable advantages over its model-free counterpart. In addition to requiring many fewer parameters than model-free methods, the time required to calculate the assignment is greatly reduced, as will be shown below. The value v(st , sg ) denotes the maximum expected total reward for a set of agents g assigned to task t, starting from the joint state st , sg by following their best policy and assuming no interference from other agents. Similarly, av(sag ) is defined as the value of the afterstate of s due to the actions ag of agents g. The temporal difference error (TDE) of the afterstate-based value function for an assigned subset of agents using ATR-learning is as follows: T DE(sag ) = max ug ∈Ag n o rg (s0 , ug ) + av(s0ug ) − av(sag ). (5.3) As with model-free assignment-based decomposition, the number of possible 72 assignments is exponential in the number of agents. We can search over the space of possible assignments by defining a value y(s, g, t) of a state s, set of agents g, and task t. This value is derived from the underlying state-based value function v(st , sg ): y(s, g, t) = v(st , sg ). (5.4) Note this is a considerable simplification of the corresponding model-free calculation of Equation 5.2. The value function for ATR-learning is based on afterstates, so the value v(st , sg ), being based on states, must be learned separately. The update for this value is based on the temporal difference error given below: T DE(st , sg ) = rg + max ug ∈Ag n o rg (s0 , ug ) + av(s0ug ) − v(st , sg ) (5.5) Here rg is the immediate reward received for task t and agents g. This equation may re-use the calculation of the max found in Equation 5.3. This max is the longterm expected total reward for being in afterstate sa and thereafter executing the optimal policy. This afterstate value does not account for the immediate reward received for being in state s and taking action a, and so it must be added in here. To find the best assignment of tasks to agents over the long run, we need to compute the assignment that maximizes the sum of the expected total reward (in the case of ATR-learning) until task completion plus the expected total reward that the agents could collect after that. Unfortunately this leads to a global optimization problem which we want to avoid. So we ignore the rewards after the first task is 73 1 2 3 4 5 6 7 8 9 Initialize state and afterstate h-functions v(·) and av(·) Initialize s to any starting state for each step do P Assign tasks T to agents M by finding arg maxβ t y(s, β(t)), where y(s, g, t) = v(st , sg ) For each task t, find joint action uβ(t) ∈ Aβ(t) that maximizes rβ(t) (s, uβ(t) ) + av(suβ(t) ) Take an exploratory action or a greedy action in the state s. For each set of agents β(t), Let aβ(t) be the joint action taken, rβ(t) the reward received, saβ(t) the corresponding afterstate, and s0 be the resulting state. Update the model parameters rβ(t) (s, aβ(t) ). for each task t do n o Let T argett = max uβ(t) ∈Aβ(t) rβ(t) (s0 , uβ(t) ) + av(s0uβ(t) ) av(saβ(t) ) ← av(saβ(t) ) + α(T argett − av(saβ(t) )) 11 v(st , sβ(t) ) ← v(st , sβ(t) ) + α(rβ(t) + T argett − v(st , sβ(t) )) 12 end 13 s ← s0 14 end Algorithm 5.2: The ATR-learning algorithm with assignment-based decomposition, using the update of Equations 5.3 and 5.5. 10 completed and try to find the assignment that maximizes the total expected reward accumulated by all agents for that task. It turns out that this approximation is not so drastic, because the agents get to reassess the value of the task assignment after every step and opportunistically exchange tasks. The final algorithm for assignment-based decomposition with ATR-learning is shown in Algorithm 5.2. We begin by initializing the starting state and two value functions v(·) and av(·). v(·) stores the state-based value function and is used only to determine assignments. The afterstate-based value function is stored in av(·) and is used to determine the task execution-level actions of each agent. Figure 3.1 74 may be helpful in understanding the progression in time of states and afterstates. We begin each step of the algorithm by choosing an assignment. The use of v(·) rather than av(·) here is critical: using an afterstate-based function to perform a search would require a search over all actions available in the current state, slowing the algorithm tremendously. Once we have an assignment, we then search for the task exectution-level actions. After taking actions, we obtain a decomposed reward signal and resulting afterstate and state. We can then update the model of the expected immediate reward. Finally, for each task we calculate the TD-error of av(·) and use it to update v(·) and av(·), then go to the next step. 5.3 Assignment Search Techniques The problem of searching the assignment space for the best possible assignment is very important, as it can be the main difficulty in scaling assignment-based decomposition to large domains. Here, I present several options: Fixed assignment: This is the simplest possible option: no assignment search at all. Assignments are arbitrarily set at the start of an episode and never change. Exhaustive search: One straightforward method that guarantees optimal assignment is to exhaustively search for the mapping β that returns the maximum P total value for all tasks maxβ t y(s, g). However, with many agents, this search could become intractable. A faster approximate search technique is necessary, which I introduce next. 75 Sequential greedy assignment: This search uses a simple method of greedily assigning agents to high-value tasks: for each task t we consider all sets of agents that might be assigned and choose the set g that provides the maximum value v(s, g, t). We remove agents g from future consideration, and repeat until all tasks or agents have been assigned. Swap-based hill climbing: This method uses the assignment at the previous step (or a random assignment for the first time this search occurs) as the starting point of a hill climbing search of the assignment space. At each step of the search, it considers all possible next states that can be obtained by swapping a set of agents from one task with another set of the same size assigned to a different task. It then commits to the swap resulting in the most improvement, repeating until convergence. Bipartite search: The Hungarian method [13] is a combinatorial optimization algorithm which solves the assignment problem in polynomial time. I adapted this technique for use in solving the assignment problem faced by assignment-based decomposition. I adapt the Hungarian method (or Kuhn-Munkres algorithm as it is sometimes called) to assign multiple agents to each task by copying each task as many times as necessary to match the number of agents (one copy for each “slot” available to agents for completing a task, given by the upper bound on agents per task k). This creates an n × n matrix defining a bipartite graph, which can be solved by the Hungarian method in polynomial time. The weight of each edge of the graph is given by yg , where g is a single task and agent. The solution to the bipartite graph consists of an assignment of each task to a set of agents. 76 A serious problem with this approach is that each edge of the graph, or entry in the n × n matrix, cannot contain any information other than that pertaining to the single edge and task of that edge. In other words, we must give up any coordination information when making assignment decisions. While in principle this could cause some serious sub-optimalities, in practice the assignment found by this method performs very well. 5.4 Advantages of Assignment-based Decomposition The time complexity of assignment-based decomposition can be analyzed as follows. The time required to perform an exhaustive search of the assignment space is the sum of the time required to pre-calculate v(s, g, t) values and the time required to perform the actual search. The time required to calculate a single v(s, g, t) value is O(|A|k ), where |A| is the number of actions a single agent may take, and k is an upper bound on the number of agents that may be assigned to a task. Therefore, the time required to pre-calculate all values of v(s, g, t) is O(|A|k |T |Ckn ) where C is the choice function (binomial coefficient), |T | is the number of tasks, and n is the number of agents. An exhaustive search requires O(n!/(k!)n/k ) time, which is proportional to the number of ways to assign k agents each to n/k tasks. This is significantly reduced by any approximate search algorithm, such as hill climbing or bipartite search. The advantage of assignment-based decomposition is much more apparent when we consider the space complexity of the value function. A value function over the 77 entire state-action space would require O(|St ||T | |Sa |n |A|n ) parameters, where |St | and |Sa | are the sizes of the state required to store local parameters for each task and agent respectively. Assignment-based decomposition uses considerably fewer parameters to store the task-based value function Q(st , sg , a). Instead, we need space of only O(|St ||Sa |k |A|k ) parameters for each task, which is polynomial, for fixed k. A further advantage of the additive decomposition of the task execution level in Equation 5.6 is that each Qi,j function may share the same parameters. Generalizing, or transferring, that single shared value function to additional tasks and/or agents can be quite simple. In many cases, no additional learning is necessary. The same value function can often be used, for example, in domains with twice as many tasks and agents as the original domain. Only the size of the search space at the assignment level needs to grow. 5.5 Coordination Graphs A coordination graph can be described over a system of agents to represent the coordination requirements of that system. Such a graph contains a node for each agent and an edge between pairs of agents if they must directly coordinate their actions to optimize some particular Qij . See Figure 5.1 for an example coordination graph showing some possible coordination requirements between four agents. This section examines the potential of using coordination graphs to solve multiagent assignment MDPs. In most cases, if each agent independently pursues a 78 policy to optimize its own Qi (see Equation 2.14), this will not optimize the total utility, since each agent’s actions affect the state and the utility of others. Hence, collaborative agents need to coordinate. A coordination graph allows the agents to specify and model coordination requirements [8]. The presence of an edge in a coordination graph indicates that two agents should coordinate their action selection, for example, so as to avoid collisions. A coordination graph may be specified as part of the domain, or if the graph is context specific [10], as a combination of rules provided with the domain. This set of rules determines whether an edge between any two vertices of the graph should exist, given the state. As in [12] I use an edge-based decomposition of a context-specific coordination graph. The global Q-function for such a decomposition is approximated by a sum over all local Q-functions, each defined over an edge (i, j) of the graph: Figure 5.1: A possible coordina(5.6) tion graph for a 4-agent domain. (i,j)∈E Q-values indicate an edge-based decomposition of the graph. ⊆ si ∪ sj is the subset of state variables relevant to agents i and j, Q(s, a) = where sij X Qij (sij , ai , aj ). and (i, j) ∈ E describes a pair of neighboring nodes (i.e., agents). The optimal action for a coordination graph is given by arg maxa Q(s, a). As with the agentbased Q-function, the notation Qij indicates only that the Q-value is edge-based. Parameters may or may not be shared between edges. 79 Coordination graphs are a powerful method for coordinating multiple agents, but they are ill-fitted for solving multiagent assignment problems with arbitrary coordination constraints. I show a simple proof of this below. For simplicity I equate tasks and actions and assume that each action is relevant to a single task a or b: Proposition 1 Arbitrary reward functions from the joint action space A1 ×. . .×An to {0, 1} are not expressible using an edge-based decomposition over a coordination graph. Proof: Let A1 = . . . = An = {a, b}, hence there are 2n joint actions. Each n joint action may be mapped to 0 or 1 reward, leading to 22 possible functions. To represent these functions, we need at least 2n bits. A coordination graph over n agents has at most O(n2 ) edges. Each edge has at most 4 constraints, one for each possible action pair. Thus, we have room for specifying only O(n2 ) values, which are not sufficient to represent 22 possible functions. 2 n Although coordination graphs alone are not sufficient to solve MAMDP problems, assignment-based decomposition is also sometimes insufficient to coordinate complex MAMDPs. Assignment-based decomposition is sufficient coordination if the problem is completely decomposed after assignments have been made; however this is often not the case. The possibility remains of interference between agents assigned to different tasks. To handle such interactions, I define a coordination graph over agents acting on the task execution level. An edge should be placed between two agents when the actions of those agents might interfere, such 80 as when a collision is possible or the two agents might need to share a common resource. Such coordination must be context-specific, since agents are constantly changing states. Thus, it is necessary to combine the assignment decisions with context-specific coordination at the task execution level. To that end, I adapt some methods described in [11] and [12]. 5.5.1 The Max-plus Algorithm If we define a coordination graph over several agents, we must use an action selection algorithm that can take advantage of this structure. We wish to maximize the global payoff maxa Q(s, a), (where Q(s, a) is given by Equation 5.6). Initial work in coordination graphs Figure 5.2: Messages passed using Max-plus. Each step, every [9] to solve this problem. However work in [12] node passes a message to each neighbor. shows that VE techniques can be slow to solve suggested a variable elimination (VE) technique large coordination graphs, require a lot of memory, and in addition can be quite complex to implement. Instead, [12] proposed using the Max-plus algorithm, which trades some solution quality for a great increase in solution speed. The Max-plus algorithm is a message-passing algorithm based on loopy belief propagation for Bayesian networks [15, 29, 30]. Agents in Max-plus instead pass (normalized) values indicating the locally optimal payoff of each agent’s actions 81 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Input: Graph G = (V, E) Output: A vector of actions a∗ Initialize µji = µij = 0 for (i, j) ∈ E, gi = 0 for i ∈ V and m = −∞ while converged = false and deadline not met do converged = true foreach agent i do foreach neighbor j ∈ Γ(i) do foreach action aj ∈ Aj do Create message P µ0ij (aj ) = maxai {fi (ai ) + fij (ai , aj ) + k∈Γ(i)\j µki (ai )} end P cij = |A1j | aj µ0ij (aj ) Normalize: µ0ij (aj ) ← µ0ij (aj ) − cij if µ0ij differs from µij by a small threshold then converged = false end P a0i = arg maxai {fi (ai ) + j∈Γ(i) µji (ai )} end if u(a0 ) > m then a∗ = a0 and m = u(a0 ) µ ← µ0 end return a∗ Algorithm 5.3: The centralized anytime Max-plus algorithm. along edges of the coordination graph (see Figure 5.2). Max-plus finds the global payoff by having each agent i repeatedly sending messages µij to its neighbors: µ0ij n o X = max Qij (sij , ai , aj ) + µki (ai ) − cij ai (5.7) k∈Γ(i)\j where µki is the incoming message, and µ0ij is the outgoing message. All messages are in fact vectors over possible actions. Γ(i) \ j represents all neighbors of i 82 except j and cij is a normalization factor, calculated after the initial values µ0ij have been found. Max-plus sets this to be the average over all values of the outgoing P message: cij = |A1j | aj µ0ij (aj ). This prevents messages from exploding in value as multiple iterations of the algorithm proceed. Once messages have converged or a time limit has been reached, each agent chooses the action that maximizes P arg maxai {Qij (sij , ai , aj ) + j∈Γ(i) µji (ai )} for that agent. See Algorithm 5.3 for the full algorithm. 5.5.2 Dynamic Coordination Although I use an edge-based decomposition (as described above), it is often the case that rewards are received on a per-agent basis instead of a per-edge basis. Thus, we must compute local Qi functions for each agent in the graph. Following [12], I do this by assuming each Qij contributes equally to each agent i and j of its edge: Qi (si , ai ) = 1 X Qij (sij , ai , aj ) 2 (5.8) j∈Γ(i) where Γ(i) indicates the neighbors of agent i. The sum of all such Qi functions equals Q in Equation 5.6. I assume each agent has at least one other neighbor in the coordination graph. It is straightforward to adapt these methods in cases where an agent does not need to coordinate with anyone. Because our coordination graph is context-specific, to update the Q-function we must use an agent-based update as opposed to an edge-based update. This is because the presence or absence of edges changes from state to state, so we cannot 83 1 2 3 4 Initialize Q(s, a) optimistically Initialize s to any starting state for each step do P Assign tasks T toPagents M by finding arg maxβ t v(s, β(t), t), where Qij (sij , ai , aj ) v(s, g, t) = max a∈Ag i,j∈g 5 6 7 8 Choose a from s using Max-plus algorithm and -greedy policy derived from Q Take action a, observe rewards r and next state s0 Use rules given with domain to create coordination graph G = (V, E) for state s0 Determine agent Q-functions Qi (si , ai ) and Qi (s0i , a∗i ) for each agent i P Qij (sij , ai , aj ) using Qi (si , ai ) = 12 j∈Γ(i) 9 For each edge (i, j) of the coordination graph, update its Q-value using P Rk (s,a)+γQk (s0k ,a∗k )−Qk (sk ,ak ) Qij (sij , ai , aj ) ← Qij (sij , ai , aj ) + α |Γ(k)| k∈{i,j} s ← s0 11 end Algorithm 5.4: The assignment-based decomposition Q-learning algorithm using coordination graphs. 10 be assured that an edge that is present in the current time step was available in the last time step. To obtain the agent-based update equation for an edgebased decomposition, the agent-based update (Equation 2.14) is rewritten using Equation 5.8 to get: Qij (sij , ai , aj ) ← Qij (sij , ai , aj ) + α X Rk (s, a) + γQk (s0 , a∗ ) − Qk (sk , ak ) k k |Γ(k)| k∈{i,j} (5.9) This update equation propagates the temporal-difference error from all edges that include agents i and j to the local Q-function of each edge (i, j). This update is 84 context-specific, because it does not require the same edges to be present at each time step of the Q-learning algorithm. It only requires that local Qk functions can be computed for each vertex of the coordination graph, which is done using Equation 5.8. The notation Qi and Qij indicates that the Q-values are agent-based or edge-based respectively. Q-function parameters are shared between agents and edges. The final Q-learning algorithm may be seen in Algorithm 5.4. A complication arises during the assignment search step of Algorithm 5.1 when using coordination graphs. It is not possible to efficiently calculate the value of an assignment v(s, g, t) while still taking into account the contribution of edgebased Q-values Qij that occur between groups of agents assigned to different tasks. Hence, I approximate v(s, g, t) by only taking into account the local state and actions of its assigned agents, the state variables st , and ignoring inter-group edges of the graph. At the task execution level, I consider all interactions, but since the task assignment is fixed, the possible interactions are again limited. 5.6 Experimental Results I experimented with three different multiagent domains: the product delivery domain (discussed in Section 3.4.1), the real-time strategy game domain (Discussed in Section 3.4.2, and a new predator-prey domain discussed below. I follow this with several experiments using both model-free and model-based assignment-based decomposition to solve them. 85 5.6.1 Multiagent Predator-Prey Domain The multiagent predator-prey domain is a cooperative multiagent domain based on work by [11]. That original domain requires two agents (predators) to cooperate in order to capture a single prey. Agents move over a 10x10 toroidal grid world, and may move in four directions or stay in place. Prey move randomly to any empty square. Predators and prey move simultaneously, so predators must guess where the prey will be in the next time step. If predators collide, or if a predator enters the same space as the prey without an adjacent predator, the responsible predators are penalized and moved to a random empty square. The prey is captured (with a reward of 75) when one predator enters its square, and another predator is adjacent. The version of the domain used in my experiments exhibits two key differences to that of [11]: first, I increase the numbers of predators and prey from 2 vs. 1 to 4 vs. 2 or 8 vs. 4 (see Figure 5.3). Second, each time a prey is captured, it Figure 5.3: A possible state in an 8 vs. 4 toroidal grid predator-prey domain. All eight predators (black) are in a position to possibly capture all four prey (white). 86 is randomly relocated somewhere else on the board and the simulation continues. Thus, this version of the domain has an infinite horizon rather than being episodic. There are several consequences of the increase in scale of this domain. Of course, the joint action and state spaces increase exponentially. More interesting is a need for predators to be assigned to prey such that exactly two predators are assigned to capture each prey, if the best average reward is to be found. Thus, this domain is an example of an MAMDP. Once predators are assigned to prey, it is useful to coordinate the actions of predators on the task execution level to prevent collisions. Thus I introduce coordination graphs on the task execution level as described in Chapter 5.5. The existence of a top-level assignment provides several advantages, such as when defining the rules determining when agents should cooperate. I change only one of the coordination rules introduced in [11]. Predators should coordinate when either of two conditions hold: • the Manhattan distance between them is less than or equal to two cells, or • both predators are assigned to the same prey. The existence of predator assignments allows us to create improved coordination rules on the task execution level. It also reduces the number of state variables (i.e., prey) we are required to account for in the edge value function. The Q-value of each edge between predators cooperating to capture a prey need only be based on the positions of those predators relative to their assigned prey. The Q-values of each edge between predators cooperating only for collision avoidance need only 87 be based on the positions of those two predators. The existence of these two kinds of edges does increase the number of parameters required to represent the value function, but far less than the exponential increase in the number of parameters required to store a value function over two predators and two or more prey (without function approximation). 5.6.2 Model-free Reinforcement Learning Experiments For my experiments using model-free Q-learning and assignment-based decomposition, I compared results in two MAMDP domains: the product delivery domain discussed in Section 3.4.1, and the multiagent predator-prey domain discussed in Section 3.4.1. The product delivery domain does not require coordination on the task execution level. It is simple enough that flat and multiagent Q-learning results can be obtained (albeit requiring function approximation) for comparison to my approach using a assignment-based decomposition. The multiagent predator-prey domain is more complex. Standard Q-learning approaches do not work here. I also tested the use of coordination graphs, and several different assignment search techniques. The product delivery domain may be easily described as an MAMDP. Restocking stores becomes a task for any of the multiple agents (trucks) to complete. An assignment is therefore a mapping from shops to the trucks that will serve them. Each shop is assigned one truck, which may only unload at that shop. Thus, agents’ actions cannot interfere with each other, and there is no need for coordi- 88 nation on the task execution level. Because not all shops can be delivered to, I add a “phantom truck” for the unassigned shop. This “agent” has no associated state features. Its existence allows the assignment step of the assignment-based decomposition to determine the appropriate penalty for not assigning a truck to any shop. I conducted several experiments in this domain (see Figure 5.4). All results were averaged over 30 runs of 106 steps each. I tuned the learning rate α separately for each test, setting α = 0.1 for the assignment-based decomposition test and α = 0.01 for all others. I set the discount rate γ = .9, and used -greedy exploration with = .1. Average reward was measured for 2,000 out of every 50,000 steps. The assignment-based decomposition approach used an exhaustive search of possible assignments and no function approximation. A total of 11,250 parameters are required to store the value function Q(st , sg , ag ) (5 shops, 5 shop inventory levels, 10 truck locations, 5 truck loads, and 9 possible actions per truck). Here st indicates state features about the assigned shop and its inventory, and sg indicates features for truck position and load. The joint and multiagent Q-learning approaches I used need too many parameters to represent the value function using a complete table. Thus, I used the “truck-shop tiling” approximation discussed in Section 3.4.3. Each agent uses its own value function, so four times as many parameters as the assignment-based decomposition were used. Joint agent Q-learning sums over four times as many terms, additionally indexing with each truck, but requires no additional parameters. The hand-coded approach works similarly to assignment based decomposition: for each 89 Figure 5.4: Comparison of various Q-learning approaches for the product delivery domain. truck-shop pair, a distance weight was calculated from the state features. Then the assignment was made based on an exhaustive search over possible assignments, taking the assignment giving minimum total distance. I tried two coordination methods for multiagent Q-learning: when selecting actions, I either exhaustively searched over all joint actions, or I used a simple Table 5.1: Running times (in seconds), parameters required, and and terms summed over for five algorithms applied to the product delivery domain. Algorithm Time Space Terms Joint agent Q-learning Multiagent Q, exhaustive search Multiagent Q, serial coordination Assignment-based decomposition Q Hand-coded algorithm 142 160 3 3 3 45,000 45,000 45,000 11,250 N/A 20 5 5 1 N/A 90 Figure 5.5: Examination of the optimality of policy found by assignment-based decomposition for product delivery domain. form of multiagent coordination called serial coordination, which greedily selects actions for agents one at a time, allowing each agent to know the actions selected by previous agents. Assignment-based decomposition outperformed all other approaches, although my hand-coded algorithm comes close. The multiagent Q-learning approaches performed the worst of these methods. In CPU time, both multiagent Q-learning with serial coordination and assignment-based decomposition approaches were much faster than those approaches using an exhaustive search of the action space (Table 5.1). I also examined the optimality of the policy found by assignment-based decomposition in the product delivery domain (Figure 5.5). The top line is an optimistic estimate of the optimal policy in this domain. I calculated this by multiplying the 91 Figure 5.6: Comparison of action selection and search methods for the 4 vs 2 Predator-Prey domain. average number of customer visits per time step (1) by the transportation cost required to satisfy a single customer visit (−.1) to get the average transportation cost per time step required to satisfy all customers (−.1). This estimate is very optimistic, because it ignores stockout costs, which are inevitable due to the stochastic nature of customer visits. Still, the average reward of the policy found by assignment-based decomposition is quite close to my estimate. This analysis may be taken one step further: as my estimate ignores stockout costs, we can similarly ignore the contribution of stockout events to the average reward of the policy found by assignment-based decomposition. The result is a graph of only the transportation costs incurred by this policy, seen in Figure 5.5. From this I conclude that the policy found by assignment-based decomposition is very close to optimal in this domain. 92 Figure 5.7: Comparison of action selection and search methods for the 8 vs 4 Predator-Prey domain. The second domain I experimented in is the multiagent predator-prey domain discussed in Section 5.6.1. In these tests, results are shown over 107 steps of the model-free assignment-based decomposition algorithms (Algorithms 5.1 and 5.4). Figure 5.6 shows the results for 4 predators vs. 2 prey, and Figure 5.7 shows the results for 8 predators vs. 4 prey. The same set of search and coordination strategies are compared in both domains. I set the learning rate α = 0.1, discount rate γ = .9, and exploration rate = .2. Average reward of the domain was measured for 2,000 steps out of every 500,000 steps. During test phases, was set to 0. Because the maximum reward receivable by two agents is 75, edge value functions were optimistically initialized to this value. Results were averaged over 30 runs. I conducted six identical experiments for each domain: Max-plus action se- 93 lection without an assignment-based decomposition (using sparse cooperative Qlearning as in [11]), assignment-based decomposition without using coordination graphs and using an exhaustive search of assignments (as in Algorithm 5.1), and assignment-based decomposition with Max-plus action selection and four assignment methods: exhaustive search, sequential greedy assignment, swap-based hill climbing, and a fixed assignment (as in Algorithm 5.4). For the fixed assignment, I arbitrarily assigned pairs of predators to prey at the start of the run, then never reassigned them. As may be seen from these results, Max-plus search alone performed poorly compared to the other techniques. This is because a coordination graph alone is unable to capture the coordination requirements of the predator-prey domain (for similar reasons as those seen in Proposition 1). Using assignment search alone results in a large increase in performance; this kind of search does capture some essential coordination requirements. However, this alone is also not enough: it is still possible for agents to interfere (collide) with each other after assignments have been made. This type of coordination is ideal for a coordination graph approach to solve as described in Section 5.5, as may be seen by the experiments combining assignment search with Max-plus action search. Of the various task assignment methods, fixed assignment and sequential greedy assignment did not perform well. Swap-based hill climbing performed almost identically to exhaustive search. This provides hope that similar approximate search techniques can allow assignment-based decomposition to scale to a large number of agents. 94 I also experimented with transfer learning (Figure 5.7). Instead of initializing Q-values optimistically, I transferred parameters learned from the 4 vs. 2 to the 8 vs. 4 predator-prey domain. This is possible because both domains have the same number of parameters; as would domains with any number of agents, because the Q-functions are all based on 2 predators and 1 prey. I tested the resulting policy by turning off learning and using assignment-based decomposition with exhaustive assignment search and Max-plus coordination. These results demonstrate that, thanks to assignment-based decomposition, a policy learned with few agents can scale successfully to many more agents. Transfer learning is explored in greater detail in the next section. 5.6.3 Model-based Reinforcement Learning Experiments I performed all experiments on several variations of the real-time strategy game (RTS) described in Section 3.4.2. I focus here on expanding the transfer learning results of the previous section, and show how to use transfer learning to overcome the difficulties of scaling the RTS domain to large numbers of units. These experiments are in the context of model-based assignment-based decomposition (Section 5.2) and ATR-learning (Section 3.3.3). The relational templates of Section 3.1.2 are also used to form the value function. Transfer learning across different domains (as in Section 3.4.4 is very helpful, but transfer learning may also provide an additional benefit when combined with assignment-based decomposition: we can transfer knowledge learned in a small 95 Figure 5.8: Comparison of 6 agents vs 2 task domains. domain (such as the 3 vs 1 domains discussed in the Section 3.4.4) to a larger domain, such as the 6 vs. 2 or 12 vs. 4 domains discussed here. To transfer results from the 3 vs. 1 domain to the 6 vs. 2 domain, we must use assignment-based decomposition within the larger domain. This domain has two enemy units (tasks) and six agents. Each time step, I use the algorithm shown in Algorithm 5.2 to assign agents to tasks, and allow the task execution level to decide how each group of agents should complete its single assigned task. Thus, we can adapt the relational templates used to solve the 3 vs. 1 domain to this larger problem. If I adapt the templates used in the 3 vs. 1 domain (Table 3.1, #2-4) directly, performance will suffer due to interference (being shot at) by enemy units other than those assigned to each agent. To prevent this, I create a new aggregation 96 Figure 5.9: Comparison of 12 agents vs 4 task domains. feature T asksInrange(B), and define a behavior transfer function [25] ρ(π) which initializes the new relational templates which include this feature (#6-8) by transforming the old templates (#2-4) which do not. I do this simply by copying the parameters of the old templates into those of the new for all possible values of the additional dimensions. There is an additional consideration when transferring from the 3 vs. 1 to 6 vs 2 domains: as I am using assignment-based decomposition in the 6 vs. 2 domains, can we (or should we?) transfer the state-based value functions? While this is possible to do (by learning the function in the 3 vs. 1 domain), empirical results have shown that it is not necessary, and performance suffers very little if we learn the state-based value function from scratch each time. Hence, this is what I have done for all results in this paper. 97 All experiments shown here (Figures 5.8 and 5.9) are averaged over 30 runs of 105 steps each. I used the ATR-learning algorithm with assignment-based decomposition (Algorithm 5.2) for most of the experiments. As with the experiments with the 3 vs. 1 domain, runs were divided into 40 alternating train/test phases, with = .1 or = 0. α is similarly adjusted independently: for the 6 vs. 2 domain, I used α = .01 for parent templates, and α = .1 otherwise. For the 12 vs. 4 domain, I used α = .001 for parent templates, and α = .01 otherwise. I compared results with and without transfer (from all combined subdomains in the 3 vs. 1 domain) to the 6 Archers vs. 2 Towers domain. I used exhaustive search to compare transfer results. These results show that using transfer is significantly better than not using it at all. In addition, I tested several different forms of assignment search: exhaustive, hill climbing, bipartite, and fixed assignment. As expected, fixed assignment performs quite poorly. Bipartite search, while performing slightly worse than exhaustive, still does very well. The performance of hill climbing varies between that of fixed assignment and bipartite search, depending on how many times the hill climbing algorithm is used to improve the assignment. Shown are results for only one iteration of the hill climbing algorithm, which is only a modest improvement upon fixed assignment. Finally, I tested the “flat” (no assignment-based decomposition, using the algorithm of Algorithm 3.2) 6 vs. 2 domain without transfer learning. As expected, this performed very poorly, which is due to the difficulty of creating an adequate function approximator for 6 agents and 2 tasks. I arrived at using only template #5 after experimentation with several other alternative templates. Even with α set 98 very low (.008), parameters of the value function slowly continue to grow, causing performance to peak and eventually dip. This points to the inadequacy of traditional methods for solving such a large problem: we need to decompose problems of this size in order to solve them. In addition, “flat” ATR-learning is very slow on such problems, taking almost 43 times more computation time to finish a single run than when using assignment-based decomposition! (Table 5.2) My tests on the 12 vs. 4 domain have similar results. Here, I transferred from the 6 vs. 2 domain, which requires no additional relational features. Results (Figure 5.9) show that using transfer provides an enormous benefit. All 12 vs. Table 5.2: Experiment data and run times. Columns list domain size, units involved (Archers, Infantry, Towers, Ballista, or Knights), use of transfer learning, assignment search type (“flat” indicates no assignment search), relational templates used for state and afterstate value functions, and average time to complete a single run. Size Subdomain(s) Transfer Search type State templates Afterstate templates Seconds 3 vs 1 3 vs 1 6 vs 2 6 vs 2 6 vs 2 6 vs 2 6 vs 2 6 vs 2 12 vs 4 12 vs 4 12 vs 4 12 vs 4 12 vs 4 Any Any A vs. T A vs. T A vs. T A vs. T A vs. T A vs. T A vs. T A vs. T A vs. T A vs. T A,I vs. T,B,K no yes no yes yes yes yes no no yes yes yes yes flat flat exhaustive exhaustive bipartite hill climbing fixed flat bipartite bipartite hill climbing fixed bipartite N/A N/A 5,9 5,9 1 5,9 N/A N/A 1 1 5,9 N/A 1 2-4,9 2-4,9 6-9 6-9 6-9 6-9 6-9 5 6-9 6-9 6-9 6-9 6-9 28 29 34 60 60 60 57 2567 76 122 156 114 108 99 4 results but one use bipartite search (as an exhaustive search of the assignment space is unacceptably slow), and this performs very well, especially compared to no assignment search at all. Finally, I tested transfer from the 6 vs. 2 domain to a 12 vs. 4 combined problem. In the combined problem, all unit types were allowed on their respective sides (archers or infantry for the agents, ballista, tower, or knight for the enemy units). This is a very complex problem that requires the assignment step of the assignment-based decomposition algorithm to assign the best possible agents (archers or infantry) to their best possible match. While the complexity of this domain prevents the algorithm from performing as well as it does on a single subdomain, the assignment-based decomposition algorithm still performed very well. Finally, I examine the performance of the various algorithm/domain combinations (Table 5.2). From these results, we can see that the computation time required to solve a problem using assignment-based decomposition scales linearly in the number of agents and tasks. This is a considerable improvement over “flat” approaches, which require an exponential amount of time in the number of agents/tasks to solve each domain. As expected, not searching at all is very fast. Exhaustive search is the slowest search technique, and it is so slow as to be unusable in the 12 vs. 4 domain. Of the various approximate search techniques, hill climbing is the slowest. Interestingly, methods that used no transfer learning were faster than those that did. This is most likely because more agents died during these runs, resulting in less time to compute each time step. 100 5.7 Summary This chapter introduced Multiagent Assignment MDPs and gave a two-level action decomposition method that is effective for this class of MDPs. This class of MDPs can capture many real-world domains such as vehicle routing and delivery, board and real-time strategy games, disaster response, fire fighting in a city, etc., where multiple agents and tasks are involved. I showed how both model-free and model-based reinforcement learning algorithms may be adapted for use with assignment-based decomposition. In the case of model-free RL, I also showed how to combine coordination graphs with assignment-based decomposition to allow for two different types of coordination at both the upper assignment level and the lower task execution level. I gave empirical results in two domains that demonstrate that the combination of assignment search at the top level and coordinated reinforcement learning at the task execution level is well-suited to solving such domains, while either method alone is not sufficiently powerful. Because a search over an exponential number of assignments is not scalable as the number of agents increases, I have also shown how several simple approximate search techniques perform effective assignment search. My results show that bipartite search performs the best in terms of speed and average reward, however bipartite search cannot always be applied. In such situations, a method like hill climbing search is preferable. These results encourage the conclusion that assignment search is a practical approach for large cooperative multiagent domains. 101 Chapter 6 – Assignment-level Learning The techniques demonstrated in Chapter 5 with assignment-based decomposition have involved the assignment level using information solely from the task execution level to make a decision. This leads to compact, scalable value functions for many agents. In this chapter, I explore the potential of trading off scalability for improved solution quality by introducing assignment-level learning. This is similar to and inspired by the way hierarchical reinforcement learning allows learning to occur at every level of the hierarchy [5]. 6.1 HRL Semantics To see how assignment-level learning might be introduced, let us first examine how assignment-based decomposition makes a decision. In Figure 6.1, we see three states (of a potentially larger MDP) in which a single agent at state s1 needs to make a choice between being assigned to task x or task y. This choice will then lead the agent to states s2 or s3 respectively, receiving reward rx or ry . In this figure, the local MDP at the task execution level is a one-step Markov chain, for which only a single Q-value need be learned: either Qt (x) or Qt (y), for which the values rx and ry will eventually be learned. (Here the notation Qt is used to indicate the task-level value function). Note the lack of a state variable: the local 102 Figure 6.1: Information typically examined by assignment-based decomposition. Figure 6.2: Information examined by assignment-based decomposition with assignment-level learning. task-execution MDP’s are aware of only the local state, of which there is only one for which a value must be learned, and thus no need for a state variable. Because we use the task-level value function to make decisions at the assignment level, the agent is assigned to the task that provides the greatest value max(Qt (x), Qt (y)), which becomes max(rx , ry ) once the appropriate parameters have been learned. This process does not take into account what may occur after a task has been completed. Further tasks may become available, and the process may continue. For example, in Figure 6.2, new tasks z or w may be completed after finishing tasks x and y, receiving reward rz or rw . Tasks may potentially continue beyond this decision indefinitely, or until the episode is terminated. Previous work in with hierarchical reinforcement learning [5] would suggest that to make the correct decision in state s1 requires that we take into account potential sources of reward 103 that occur after the current task is finished. That is, we introduce a value function at the assignment level and learn the value of the contribution of any reward we receive after the current task is finished. This is called the completion function. We can learn the completion function by conceptually splitting the assignmentlevel Q-function into two parts, the existing Qt function which is taken from the task execution level and a Qa function which is learned and used only at the assignment level. Thus the decision of how to assign a single agent in Figure 6.2 is made by comparing the values max(Qt (x) + Qa (s1 , x), Qt (y) + Qa (s1 , y)), which becomes max(rx +rz , ry +rw ) once appropriate values of all parameters are learned. Attaching this meaning to the Qa function I call hierarchical reinforcement learning or HRL semantics, and will now proceed to show why this meaning may yield bad results. To begin, let us examine Figure 6.3, which shows a simple 4-state MDP with a single agent and two tasks. Should the agent be assigned to task x in the first state, it “dies” and receives a negative reward. However, if it is assigned to task y, it receives a zero reward, but is given the opportunity to try task x again. This time, it receives a positive reward. Let us start by examining what happens if we try to solve this problem with assignment-based decomposition. Figure 6.3: A 4-state MDP with two tasks. Assignment-based decomposition learns only two parameters, one per task: Qt (x) and Qt (y). Qt (y) will always be zero (because this 104 is the reward received for completing task y in state s3 ). The value of Qt (x) is less certain: it may be less than or greater than zero, depending on which task the agent is assigned to in state s1 . If the assignment is always optimal, the agent should learn Qt (x) = 1. However, as soon as Qt (x) > 0, the agent will be assigned to task x in state s1 , and eventually Qt (x) becomes negative. This leads to an oscillation of this value around zero, which is undesirable. To apply HRL semantics to solve this problem, we introduce two new parameters: Qa (s1 , x) and Qa (s1 , y) which learn the completion function for state s1 and tasks x and y. Qa (s1 , x) must be zero because no reward is ever received after task x is finished. Qa (s1 , y) = 1, because this is the only reward ever received after task y is finished. Should the assignment be optimal, Qt (x) = 1 and Qt (y) = 0 for the same reasons discussed previously. In this situation, we compute max(Qt (x) + Qa (s1 , x), Qt (y) + Qa (s1 , y)) = max(1 + 0, 0 + 1) = max(1, 1) = 1. Note that both tasks appear equally optimal in this case, which is undesirable. The correct equation should compare the true Q-values at the end of the two possible episodes, −1 and 1. Why is HRL semantics failing here? It is because, at the task execution level, states s1 and s3 are aliased, that is, the task execution level, and through that the assignment-level Q-function, is failing to distinguish between these states. This is because the task-level Q-values Qt (x) and Qt (y) only take into account the local state. In hierarchical reinforcement learning, this would be considered an “unsafe abstraction”. This unsafe abstraction could be removed only by having the Qt functions take into account the global state; however this becomes impractical as 105 the number of agents increases, and the benefits of assignment-based decomposition are lost. 6.2 Function Approximation Semantics The problem with HRL semantics may be repaired by using different semantics, I call “function approximation semantics”. This simply supposes that the Qa value should not be the completion function, but merely serves to correct the task execution value Qt towards the true Q-value. For example, in Figure 6.3, we have Qt values fixed by the task-level rewards (assuming optimal assignment) so that Qt (x) = 1 and Qt (y) = 0. In order to arrive at the true Q-values to be compared in state s1 , let Qa (s1 , x) = −2 and Qa (s1 , y) = 1. Thus, Q(s1 , x) = Qt (x) + Qa (s1 , x) = 1 + −2 = −1, and Q(s1 , y) = Qt (y) + Qa (s1 , y) = 0 + 1 = 1, which are the correct values and allow the correct decision to be made. To determine how we might learn these improved values, we can simply subP stitute Qa (s, β) + t v(s, β(t), t), where β is the assignment (i.e. assignment-level action) for Q(s, a) in the Q-learning update equation: h i 0 0 Q(s, a) ← Q(s, a) + α r + γ 0max Q(s , a ) − Q(s, a) 0 0 a ∈A (s ) (6.1) 106 1 2 3 4 Initialize Qa (s, a) to all-zero’s, and Qt (s, a) optimistically Initialize s to any starting state for each step do Assign tasks T to agents P M by finding arg maxβ Qa (s, β) + t v(s, β(t), t), where v(s, g, t) = max Qt (st , sg , a) a∈Ag For each task t, choose actions aβ(t) from sβ(t) using -greedy policy derived from Qt Take action a, observe rewards r and next state s0 foreach task t do Qt (st , sβ(t) , aβ(t) ) ← Qt (st , sβ(t) ,iaβ(t) )+ h 5 6 7 αt rβ(t) + γ t max Qt (s0t , s0β(t) , a0 ) − Qt (st , sβ(t) , aβ(t) ) 0 a Qah(s, β) ← Qa (s, β)+ 8 αa r + γ a max (Qa (s0 , β 0 ) + 0 0 0 t v(s , β (t), t)) − Qa (s, β) − P β i v(s, β(t), t) t P 0 s←s 10 end Algorithm 6.1: The assignment-based decomposition with assignment-level learning Q-learning algorithm. 9 which becomes: Qa (s, β) + X v(s, β(t), t) ← Qa (s, β) + X v(s, β(t), t)+ t t # ! " α r + γ max Qa (s0 , β 0 ) + 0 β X v(s0 , β 0 (t), t) t − (Qa (s, β) + X v(s, β(t), t)) . t (6.2) After simplifying and eliminating P t v(s, β(t), t) (because these values are updated 107 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Initialize state and afterstate h-functions v(·) and av(·) Initialize assignment-level Q-function Qa (·) Initialize s to any starting state for each step do P Assign tasks T to agents M by finding arg maxβ t y(s, β(t)), where y(s, g, t) = v(st , sg ) For each task t, find joint action uβ(t) ∈ Aβ(t) that maximizes rβ(t) (s, uβ(t) ) + av(suβ(t) ) Take an exploratory action or a greedy action in the state s. For each set of agents β(t), Let aβ(t) be the joint action taken, rβ(t) the reward received, saβ(t) the corresponding afterstate, and s0 be the resulting state. Update the model parameters rβ(t) (s, aβ(t) ). foreach task t do n o Let T argett = max uβ(t) ∈Aβ(t) rβ(t) (s0 , uβ(t) ) + av(s0uβ(t) ) av(saβ(t) ) ← av(saβ(t) ) + α(T argett − av(saβ(t) )) v(st , sβ(t) ) ← v(st , sβ(t) ) + α(rβ(t) + T argett − v(st , sβ(t) )) end Qah(s, β) ← Qa (s, β)+ i P P 0 0 0 0 (Qa (s , β ) + t y(s , β (t)) − Qa (s, β) − t y(s, β(t)) αa r + γ a max 0 β s ← s0 16 end Algorithm 6.2: The ATR-learning algorithm with assignment-based decomposition and assignment-level learning. 15 during the task execution level update) on both sides of the arrow we get: Qa (s, β) ← Qa (s, β)+ " α r + γ max Qa (s0 , β 0 ) + 0 β ! X t v(s0 , β 0 (t), t) # − Qa (s, β) − X v(s, β(t), t) t (6.3) 108 Because assignments need not change often, a useful approximation of Equation 6.3 is possible. Instead of searching over all possible assignments to compute P max (Qa (s0 , β 0 ) + t v(s0 , β 0 (t), t)), we can approximate this quantity by simply 0 β using the current assignment β, which we can assume remains good in the next step: Qa (s, β) ←Qa (s, β)+ " α r+γ ! Qa (s0 , β) + X t v(s0 , β(t), t) # − Qa (s, β) − X v(s, β(t), t) t (6.4) Algorithm 6.1 shows the final algorithms for model-free assignment-based decomposition with assignment-level learning. Similar methods to those shown in Section 5.2 may be used to adapt this algorithm to a model-based method, as shown in Algorithm 6.2. The notation αa , γ a , αt , or γ t refer to learning rate and discount factor for assignment- and task-level Q-functions. Note the use of the notation Qt and Qa to distinguish between task- and assignment-level Q-functions respectively: these subscripts do not indicate an index over particular tasks or actions, for example. 6.3 Experimental Results In this section I show the results for several experiments in two different domains: first, the simple 4-state MDP shown in Figure 6.3, and second the real-time strategy 109 Figure 6.4: Comparison of various strategies for assignment-level learning. game domain discussed in Section 3.4.2. These results demonstrate the improvement gained by using function approximation semantics over either HRL semantics or assignment-based decomposition without assignment-level learning. 6.3.1 Four-state MDP Domain Figure 6.4 compares the results of normal assignment-based decomposition vs. both assignment-level learning strategies discussed here on the simple 4-state MDP shown in Figure 6.3. Results were plotted for 1000 episodes, and averaged over 10,000 runs. Softmax exploration with a temperature τ = .5 and learning rate αa = αt = .01 was used in order to demonstrate the difference between the HRL and function approximation semantics. As may be seen from these results, as 110 expected assignment-based decomposition with no learning fails to perform well in this domain. HRL semantics improves performance, but because the values of the assignment-level Q-values for both tasks remain very close to each other, still does not perform well. Only function approximation semantics allows learning appropriate Qa values and performs near-optimally. It is this semantics which I continue to use in my experiments in Section 6.3.2. 6.3.2 Real-Time Strategy Game Domain These experiments are in the context of model-based assignment-based decomposition (Section 5.2) and ATR-learning (Section 3.3.3). The relational templates of Section 3.1.2 are also used to represent the value function. To test assignment-level learning in a more complex domain, I experimented with several different combinations of units in the real-time strategy domain, with and without assignment-level learning. I used model-based assignment-based decomposition as described in Section 5.2. All results are for 105 steps, and averaged over 30 runs. I used γ t = .95 at the task level for this problem, and γ a = 1 at the assignment level. Because the global state is very complex and cannot be efficiently stored in a table, a function approximator that can capture the needed global interactions at the assignment level must be used. For these tests, I created a table over several derived features of the global state. These derived features consisted of a count of the number of enemy units of each type, and a count of the number 111 Figure 6.5: Comparison of assignment-based decomposition with and without assignment-level learning for the 3 vs 2 real-time strategy domain. of units of each type assigned to each type of enemy unit. For example, these derived features could capture the fact that there are three archers assigned to a single enemy ballista, three assigned to a single enemy knight, and no agents assigned to the two remaining enemy units. Other information that might be useful, such as hit points of units or relative distances, could not be captured due to the exponential increase in the number of parameters required. Still, empirical results show that even these simple derived features can improve performance significantly with assignment-level learning. Figure 6.5 shows results for three agents vs. two enemy units. One enemy was a dangerous knight or ballista, the second enemy was a harmless, immobile “hall”. Both enemy units returned a reward of 1 if killed. If enemy units killed an agent, a reward of −1 was received. As the state information for both enemies was not 112 included in the local state of the task, when the agents are assigned to the hall, the problem of predicting if the knight will kill them is very difficult. The hall is harmless, and so attacking the hall first might appear more attractive. Unfortunately, ignoring the knight will allow it to quickly kill off agents. Thus, the correct decision is to attack the knight first, then the hall. This is analogous to the problem presented in the simple 4-state MDP of Figure 6.3. An exhaustive search over possible assignments was performed, with αt = .1 at the task level, and αa = .001 at the assignment level. As may be seen from these results, adding assignmentlevel learning improves average reward significantly, although assignment-based decomposition still performs fairly well. Also included in these results are a comparison of assignment-level learning with and without the approximation used in Equation 6.4. These results show that this approximation performs very similarly to the full update of Equation 6.3 for these domains. Figure 6.6 tests assignment-based decomposition with and without assignmentlevel learning on six agents vs. four enemy units. I introduced a new unit here called the “glass cannon” (see Section 3.4.2), which instantly kills any unit it hits, and likewise is instantly killed if attacked. The learning rate αt = .001 for the task execution level. For the assignment level I set the learning rate to start at 1 and divided it by αa + 1 every 100 time steps. I used hill climbing assignment search (repeated three times) to find the best action. Bipartite search cannot be used with assignment-level learning as determining the value of the assignment requires the global state, which bipartite search cannot provide. This time, I gave a reward of zero if the glass cannon was killed, and ten if the hall was killed. As 113 Figure 6.6: Comparison of 6 archers vs. 2 glass cannons, 2 halls domain. with the 3 vs. 2 results, this made attacking the hall much more attractive, and so average reward suffered without assignment-level learning. Still, performance was robust in either case, although assignment-based decomposition improved results significantly. Use of assignment-level learning with the approximate update also improved performance, though not as much as with the full update of Equation 6.3. In Figure 6.7, I performed tests much as I did in Figure 6.6, however I used different units and returned to more standard rewards (a reward of 1 was given for all enemy kills). This time I tested six archers vs. two knights and two halls. Both learning rates αt = αa = .001. Using assignment-level learning again improved performance over no learning at the assignment level, however this time the improvement was much less than that for Figure 6.6. This is because the rewards given for killing the enemy units are very similar, and so assigning agents to 114 Figure 6.7: Comparison of 6 agents vs 4 tasks domain. one or the other unit appears similarly attractive. These results also demonstrate that assignment-based decomposition can perform robustly regardless of whether or not assignment-level learning is used. Interestingly, the approximate update of Equation 6.4 outperformed the full update here. It was not possible to perform tests of assignment-level learning for larger numbers of agents, as the global table function approximator grows too large. Future work may involve seeking a way to mitigate this difficulty. 6.4 Summary Assignment-based decomposition is a robust method for solving MAMDP problems; however under certain circumstances – particularly when local state informa- 115 tion is insufficient to correctly differentiate between the true values of a particular task and assignment – assignment-based decomposition may underperform. This may occur, for example, when the rewards (and thus the learned Q-values) are very different between tasks, making it appear that the higher-valued task should be completed first, even if completing a lower-valued task first will lead to greater reward in the long term. To mitigate this problem, I have shown how previous work in hierarchical reinforcement learning can inspire “assignment-level learning”. However, unlike hierarchical reinforcement learning, assignment-level learning requires a different method for learning a value function at the assignment level. This is due to the potentially unsafe abstractions caused by global interactions that cannot easily be captured by a local task-based value function approximator. Note that although assignmentlevel learning can improve average reward significantly, the ability to scale to larger numbers of units can be greatly impaired. This is because the size of the value function over the global state at the assignment level grows exponentially in the number of agents. 116 Chapter 7 – Conclusions 7.1 Summary of Contributions Throughout this thesis I have presented several techniques for mitigating each of the three curses of dimensionality, either singly or several at once. Function approximation techniques such as tabular linear functions and relational templates mitigate the first curse of dimensionality (state space explosion). A hill climbing search of the action space or other approximate search technique can mitigate the second curse of dimensionality (action space explosion). Afterstate-based methods such as ASH-learning and ATR-learning can help mitigate the third curse of dimensionality (outcome space explosion). Methods such as multiagent H-learning, multiagent ASH-learning, and assignment-based decomposition techniques can mitigate some or all of the curses of dimensionality at once. Finally, specialized techniques such as transfer learning, while not typically used for this purpose, can combine with assignment-based decomposition to scale domains with few agents to much larger numbers of agents. To see a summary of some of the contributions of this dissertation and the curses each contribution can help mitigate, see Table 7.1. It seems apparent that in order to solve the most complex domains, some combination of all these scaling methods will be required. In particular, starting with either assignment-based decomposition or a simpler multiagent method and 117 Table 7.1: The contributions of several methods discussed in this paper towards mitigating the three curses of dimensionality. Method State Space Tabular linear functions Relational templates Hill climbing the action space Efficient expectation calculation ASH-learning ATR-learning Multiagent H-learning Multiagent ASH-learning Assignment-based decomposition Transfer learning yes yes Action Space Outcome Space yes yes yes yes yes yes yes yes yes yes yes yes yes adding any further techniques required, such as function approximation, will often perform very well. In particular, the fastest results found in this paper use assignment-based decomposition in combination with fast approximate assignment search techniques such as bipartite search. Using these techniques, I have achieved a nearly linear increase in required computation times as additional agents are added to the domain, as opposed to the exponential amount of time conventional RL approaches require (as shown in Table 5.2). All techniques discussed in this thesis involve tradeoffs: usually solution quality for speed. By using the right techniques, it is hoped that the loss in quality of a solution is negligible. In general, it is usually necessary to test several alternatives before becoming confident that the approximations made are not too damaging to the expected reward received. 118 Just as each curse of dimensionality may be more or less onerous for any particular domain, the possible tradeoffs required to mitigate each curse may be more or less damaging to the expected reward. This tradeoff is particularly obvious when choosing a function approximator. Typically, the more parameters allowed in the function approximator, the better the value function that can be represented. Unfortunately convergence time is closely related to the number of parameters required to learn, and in the worst case these can be exponential in the number of dimensions of the state (as in a tabular representation). Approaches that decompose the joint agent into a multiagent problem also show an obvious tradeoff: in this case, the quality of coordination between agents. A joint agent approach can perfectly coordinate between the agents. A multiagent or assignment-based decomposition approach sacrifices perfect coordination for fast action selection. In practice, most multiagent domains do not require perfect coordination between agents. Yet, some coordination is usually necessary, but it is not always clear what form that should take. In this thesis I presented three general kinds of such coordination: serial coordination, coordination graphs, and assignment based decomposition. I showed how these techniques do not have to be mutually exclusive and can complement each other. Picking one or two of these methods is often sufficient for most domains. 119 7.2 Discussion and Future Work An interesting area of possible future work in model-based assignment-based decomposition is the introduction of coordination graphs, as was done for model-free reinforcement learning in Chapter 5.5. Coordination graphs are not sufficient to coordinate assignment decisions [17], however they are useful for coordinating between agents at the task-execution level, for example to avoid collisions. The RTS domain introduced here does not model collisions, and so there is no need for lowlevel coordination between tasks as there is in [17]. Introducing collisions to this RTS domain would be straightforward, and it would require adapting the use of coordination graphs to a model-based RL algorithm. Future work includes scaling the approaches in this paper to work with much larger numbers of agents, tasks, and state variables and considering other kinds of interactions such as global resource constraints. Future work in assignment-based decomposition could address adapting it to a decentralized domain. The Max-plus algorithm already can be decentralized [12], however assignment-based decomposition assumes a centralized controller. Adapting these algorithms to work in a decentralized context could involve messagepassing techniques similar to those used by the Max-plus algorithm. Lastly, I hope to continue to explore similarities and differences between assignment-based decomposition and hierarchical reinforcement learning. In particular, I hope to generalize my work with assignment based decomposition to handle complex hierarchical domains and multi-level assignment structures. For 120 example, one might imagine a domain inspired by real-life army heirarchies. At the top, there is a general in command of several corps. Each corps has several divisions, followed by a hierarchical structure of brigades, battalions, companies, platoons, squads, and finally individual soldiers. A generalization of assignmentlevel learning to multi-level domains such as this would be an exciting new application. To my knowledge no previous work in hierarchical RL has explored such complex domains with many agents. 121 Bibliography [1] A. G. Barto, S. J. Bradtke, and S. P. Singh. Learning to act using real-time dynamic programming. Artif. Intell., 72(1-2):81–138, 1995. [2] R. E. Bellman. Dynamic Programming. Dover Publications, Incorporated, 2003. [3] D. P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 1995. [4] O. Bräysy and M. Gendreau. Vehicle Routing Problem with Time Windows, Part II: Metaheuristics. Working Paper, SINTEF Applied Mathematics, Department of Optimisation, Norway, 2003. [5] T. G. Dietterich. The MAXQ method for hierarchical reinforcement learning. In J. W. Shavlik, editor, ICML, pages 118–126. Morgan Kaufmann, 1998. [6] M. Ghavamzadeh and S. Mahadevan. Learning to communicate and act using hierarchical reinforcement learning. In AAMAS ’04: Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems, pages 1114–1121, Washington, DC, USA, 2004. IEEE Computer Society. [7] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing plans to new environments in relational MDPs. In IJCAI ’03: In International Joint Conference on Artificial Intelligence, pages 1003–1010, 2003. [8] C. Guestrin, D. Koller, and R. Parr. Multiagent planning with factored MDPs. In T. G. Dietterich, S. Becker, and Z. Ghahramani, editors, NIPS ’01: Proceedings of the Neural Information Processing Systems, pages 1523–1530. MIT Press, 2001. [9] C. Guestrin, M. Lagoudakis, and R. Parr. Coordinated reinforcement learning. In ICML ’02: Proceedings of the 19st International Conference on Machine Learning, San Francisco, CA, July 2002. Morgan Kaufmann. 122 [10] C. Guestrin, S. Venkataraman, and D. Koller. Context specific multiagent coordination and planning with factored MDPs. In AAAI ’02: Proceedings of the 8th National Conference on Artificial Intelligence, pages 253–259, Edmonton, Canada, July 2002. [11] J. R. Kok and N. A. Vlassis. Sparse cooperative Q-learning. In R. Greiner and D. Schuurmans, editors, ICML ’04: Proceedings of the 21st International Conference on Machine Learning, pages 481–488, Banff, Canada, July 2004. ACM. [12] J. R. Kok and N. A. Vlassis. Collaborative multiagent reinforcement learning by payoff propagation. J. Mach. Learn. Res., 7:1789–1828, 2006. [13] H. Kuhn. The Hungarian Method for the assignment problem. Naval Research Logistic Quarterly, 2:83–97, 1955. [14] R. Makar, S. Mahadevan, and M. Ghavamzadeh. Hierarchical multi-agent reinforcement learning. In AGENTS ’01: Proceedings of the 5th International Conference on Autonomous Agents, pages 246–253, Montreal, Canada, 2001. ACM Press. [15] J. Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1988. [16] W. B. Powell and B. Van Roy. Approximate Dynamic Programming for HighDimensional Dynamic Resource Allocation Problems. In J. Si, A. G. Barto, W. B. Powell, and D. Wunsch, editors, Handbook of Learning and Approximate Dynamic Programming. Wiley-IEEE Press, Hoboken, NJ, 2004. [17] S. Proper and P. Tadepalli. Solving multiagent assignment markov decision processes. In AAMAS ’09: Proceedings of the 8th International Joint Conference on Autonomous Agents and Multiagent Systems (to appear), pages 681–688, 2009. [18] M. L. Puterman. Markov Decision Processes: Discrete Dynamic Stochastic Programming. John Wiley, 1994. [19] A. Schwartz. A Reinforcement Learning Method for Maximizing Undiscounted Rewards. In ICML ’93: Proceedings of the 10th International Conference on Machine Learning, pages 298–305, Amherst, Massachusetts, 1993. Morgan Kaufmann. 123 [20] N. Secamondi. Comparing Neuro-Dynamic Programming Algorithms for the Vehicle Routing Problem with Stochastic Demands. Computers and Operations Research, 27(11-12), September 2000. [21] N. Secamondi. A Rollout Policy for the Vehicle Routing Problem with Stochastic Demands. Operations Research, 49(5):768–802, 2001. [22] R. S. Sutton. Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In In Proceedings of the Seventh International Conference on Machine Learning, pages 216–224. Morgan Kaufmann, 1990. [23] R. S. Sutton and A. G. Barto. Reinforcement learning: an introduction. MIT Press, 1998. [24] P. Tadepalli and D. Ok. Model-based Average Reward Reinforcement Learning. Artificial Intelligence, 100:177–224, 1998. [25] M. E. Taylor and P. Stone. Behavior transfer for value-function-based reinforcement learning. In F. Dignum, V. Dignum, S. Koenig, S. Kraus, M. P. Singh, and M. Wooldridge, editors, AAMAS ’05: The Fourth International Joint Conference on Autonomous Agents and Multiagent Systems, pages 53– 59, New York, NY, July 2005. ACM Press. [26] S. Thrun. The role of exploration in learning control. In D. White and D. Sofge, editors, Handbook for Intelligent Control: Neural, Fuzzy and Adaptive Approaches. Van Nostrand Reinhold, Florence, Kentucky 41022, 1992. [27] J. N. Tsitsiklis and B. V. Roy. Feature-based methods for large scale dynamic programming. Machine Learning, 22(1-3):59–94, 1996. [28] B. Van Roy, D. P. Bertsekas, Y. Lee, and J. N. Tsitsiklis. A Neuro-Dynamic Programming Approach to Retailer Inventory Management. In Proceedings of the IEEE Conference on Decision and Control, 1997. [29] M. Wainwright, T. Jaakkola, and A. Willsky. Tree consistency and bounds on the performance of the max-product algorithm and its generalizations. Statistics and Computing, 14(2):143–166, 2004. [30] J. S. Yedidia, W. T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Exploring artificial intelligence in the new millennium, pages 239–269, 2003.