From: AAAI Technical Report SS-02-02. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved. Coordinated Reinforcement Learning Carlos Guestrin Computer Science Dept StanfordUniversity guestrin@cs.stanford.edu Michail Lagoudakis Computer Science Dept Duke University mgl @cs.duke.edu Abstract Wepresent several newalgorithmsfor multiagentreinforcementlearning. Acommon feature of these algorithmsis a parameterized,structuredrepresentation of a policyor valuefunction.This structure is leveragedin an approachwecall coordinatedreinforcement learning,by whichagentscoordinateboth their action selectionactivities andtheir parameter updates.Within the limits of our parametricrepresentations,the agents will determinea jointly optimalactionwithoutexplicitly consideringeverypossibleactionin their exponentially large joint actionspace.Ourmethods differ from manypreviousreinforcementlearning approachesto multiagentcoordinationin that structured communication and coordinationbetweenagents appearsat the core of both the learning algorithmandthe execution architecture. 1 Introduction Considera systemwheremultipleagents, each with its own set of possibleactions andits ownobservations,mustcoordinate in order to achieve a common goal. Wewant to find a mechanism for coordinatingthe agents’ actions so as to maximize their joint utility. Oneobviousapproachto this problemis to representthe systemas a Markov decision process (MDP), wherethe "action"is a joint actionfor all of the agentsandthe rewardis the total rewardfor all of the agents. Theimmediate difficulty with this approachis that the action spaceis quite large: If there are g agents,eachof whichcan a. takea actions,thenthe actionspaceis a Onenatural approachto reducingthe complexityof this problem is to restrict the amount of information that is available to each agent and hopeto maximizeglobal welfare by solving local optimizationproblemsfor each agent[14]. In somecases, it is possibleto manipulate the presentationof informationto the agentsin a manner/hat forces local optimizations to implyglobaloptimizations[17]. In general,however, the problemof findinga globallyoptimalsolution for agents with partial informationis known to be intractable [2]. Following[7] wepresent an approachthat combinesvalue function approximationwith a messagepassing schemeby whichthe agentsefficiently determinethe jointly optimalaction with respect to an approximate valuefunction. Ourapproachis basedon approximating the joint value functionas a linear combination of local valuefunctions,each of which 98 Ronald Parr Computer Science Dept Duke University parr@ cs.duke.edu relates onlyto the parts of the systemcontrolledby a small numberagents. Weshowhowsuch factored value functions allowthe agents to find a globallyoptimaljoint action using a very natural messagepassing scheme.This schemecan be implemented as a negotiationprocedurefor selecting actions at run time. Alternatively,if the agentssharea common observationvector, eachagentcan efficiently determinethe actions that will be takenby all of the collaboratingagents without any additional communication. Givenan action selection mechanism, the remainingtask is to developa reinforcement learningalgorithmthat is capable of producingvaluefunctionsof the appropriateform.An algorithmfor computing suchvaluefunctionsis presentedin [7] for the case wherethe modelis knownand represented as a factoredMDP. Thisis the first applicationof these techniquesin the contextof reinforcementlearning, whereweno longerrequirea factoredmodelor evena discretestate space. Webegin by presenting two methodsof computingan appropriate value function throughreinforcementlearning, a variant of Q-learninganda variant of modelfree policy iteration called LeastSquaresPolicyIteration (LSPI)[11]. also demonstratehowparameterizedvalue functions of the formacquiredby our reinforcementlearning variants can be combinedin a very natural waywith direct policy search methodssuch as [13; 12; 1; 15; 9]. Thesamecommunication andcoordinationstructures usedin the valuefunction approximation phaseare used in the policy search phaseto samplefromandupdatea factoredstochastic policyfunction. Wecall our overall approachCoordinatedReinforcement Learningbecausestructured coordinationbetweenagents is usedin core of our learningalgorithmsand in our execution architectures.Ourexperimental results indicatethat, in the case of LSPI, the messagepassing action selection mechanism and value function approximationcan be combinedto produceeffective policies. 2 CooperativeAction Sdection Webegin by considering the simpler problemof having a groupof agentsselect a globallyoptimaljoint action in order to maximizethe sumof their individual utility functions. Supposewehavea collection of agents, whereeach agent j mustchoosean action aj froma finite set of possible actions Dom(Aj).Weuse A to denote {A1,...,Aa}. Theagentsare acting in a spacedescribedby a set of state variables, X = {X1... Xn}, whereeach Xj takes on val- ues in some domainDom(Xj). A state x defines a setting zj E Dom(Xj)for each variable X~ and an action a defines an action a~ E Dom(Aj)for each agent. The agents must choosethe joint action a that maximizes the total utility. In general, the total utility Q will dependon all state variables X and on the actions of all agents A. However,in many practical problems,it is possibleto approximate the total utility Q by the sumof local sub-utilities Q j, one for each agent. Now, the total utility becomes Q = ~j Qj. For example, consider the decision process of a section managerin a warehouse. Her local utility Qj maydepend on the state of the inventory of her section, on her decision of whichproducts to stock up and on the decision of the sales managerover pricing and special offers. Onthe other hand, it maynot dependdirectly on the actions of the customersupport team. However, the decisions of the supportteamwill be relevant, as they may affect the actions of the sales manager. Computingthe action that maximizes Q = ~j Qj in such problemseemsintractable a priori, as it wouldrequire the enumerationof the joint action space of all agents. Fortunately, by exploiting the local structure in the Qj functions through a coordination graph we can compute the optimal action very efficiently, with limited communication between agents and limited observability, as proposedin [7]. Werepeat the construction here as it will be important throughout this paper. In our framework,each agent j has a local utility function Qj. An agent’s local Q function might be influenced by a subset of the state variables, the agent’s action and those of some other agents; we define Seope[Qj] C X U A to be the set of state variables and agents that influence Qj. (Weuse Qj (x, a) to denote the value of Qj applied to the instantiation of the variables in Scope[Qj]within x, a.) The scope of Qj can be further divided into two parts: the observable state variables: Observable[Qj] ---- {Xi E X [Xi E Scope[Qj]}; and the relevant agent decision variables: Relevant[Qj] = {Ai E A ] Ai E Scope[Qj]}. This distinction will allow us to characterize the observations each agent needs to make and the type of communication neededto obtain the jointly optimal action, i.e., the joint action choice that maximizes the total utility Q---- ~"~j Qj. We note that each Qj maybe further decomposed as a linear combination of functionsthat involvefewervariables; in this case, the complexityof the algorithm maybe further reduced. Recall that our task is to find a coordination strategy for the agents to maximize~"~j Qj at state x. First, note that the scopeof the Qj functions that comprisethe value can include both action choices and state variables. Weassumethat the agents havefull observability of the relevant state variables, i.e., agent j can observe Observable[Qj].Given a particular state x = {ael,..., :en}, agent j can instantiate the part of Qj that dependson the state x, i.e., condition Qj on state x. Notethat each agent only needs to observethe variables in Observable[Qj], thereby decreasing considerably the amount of information each agent needs to gather. After conditioning on the current state, each Qj will only dependon the agent’s action choice Aj. Our task is nowto 99 4 Figure 1: Coordinationgraph for a 4-agent problem. select a joint action a that maximizes~j Qj (a). The fact that the Qj dependon the actions of multiple agents forces the agents to coordinate their action choices. Wecan represent the coordination requirements of the system using a coordination graph, where there is a node for each agent and an edge betweentwo agents if they must directly coordinate their actions to optimizesomeparticular Oi. Fig. 1 showsthe coordination graph for an examplewhere Q = Ql(al, a2) 4- 02(a2, a4) 4" Os(al,as) Q4(as,a4). A graph structure suggests the use of a cost network [6], which can be solved using non-serial dynamic programming [3] or a variable elimination algorithm whichis virtually identical to variable elimination in a Bayesiannetwork. Wereview this construction here, as it is a key component in the rest the paper. The key idea is that, rather than summing all functions and then doing the maximization,we maximizeover variables one at a time. Whenmaximizingover at, only summandsinvolving az participate in the maximization. In our example, we wish to compute: max Ql(al,as)+Q2(a2,a4)+Qs(al, aa)+Q,l(as, ~11 ,az~,Q3,Q4 Let us begin our optimization with agent 4. To optimize A4, functions Qx and Q3are irrelevant. Hence,we obtain: maxQ1 (al, a2)-/-Qs(al, aa)4-max[Q2(a2,a4)4-Q4 (aa, al ,G2~0,3 I14 Wesee that to choose A4 optimally, the agent must know the values of A2 and As. In effect, it is computinga conditional strategy, with a (possibly) different action choice for each action choice of agents 2 and 3. Agent4 can summarize the value that it brings to the systemin the different circumstances using a newfunction f4(Az, A3) whosevalue at the point as, as is the value of the internal maxexpression. This newfunction is a newjoint value function for agents 2 and 3, summarizing their joint contribution to the total rewardunder the assumptionthat agent 4 will act optimally with respect to their choices. Our problem nowreduces to computing max Ql(al,a2) 4- Qs(al,as) "1" f4(a2,as), having one fewer agent. Next, agent 3 makesits decision, giving: maxal,a~ Ql(al, a2) + fz(al, a2), where fs(al,a2) = maxa,[Q~(ax,as) 4- f4(a2,a~)]. Agent 2 nowmakesits decision, giving f2(al)=max 01(al, a2)+/3(al, and agent I can nowsimply choose the action al that maximizes fl -- maxel f2(al). The result at this point is a number, which is the desired maximum over al,..., a4. Wecan recover the maximizingset of actions by performing the entire process in reverse: The maximizingchoice for fl selects the action a~ for agent 1. To fulfill its commitment to agent 1, agent 2 must choosethe value a~ whichmaximizes f2(a~). This, in turn, forces agent 3 and then agent 4 to select their actions appropriately. In general, the algorithmmaintainsa set .7" of functions, which initially contains {Qx,..., Qo}. The algorithm then repeats the followingsteps: I. Select an uneliminatedagent At; 2. Take all fl,..., fL E .T whosescope contains At. 3. Define a new function f - maxaz~j fj and introduce it into .T. The scope of f is UL=lScope[fj]-- {At}. As above, the maximizingaction choices are recovered by sending messagesin the reverse direction. Wenote that this algorithmis essentially a special case of the algorithmused to solve influence diagramswith multiple parallel decisions [8] (as is the one in the next section). However,to our knowledge, these ideas have not yet been applied in the context of reinforcement learning. The computationalcost of this algorithm is linear in the numberof new "function values" introduced in the elimination process. Moreprecisely, consider the computation of a new function e whose scope is Z. To computethis function, we need to compute[Dora[Z]]different values. The cost of the algorithmis linear in the overall numberof these values, introduced throughout the algorithm. As shownin [6], this cost is exponential in the induced width of the coordination graph for the problem. The algorithm is distributed in the sense that the only communication required is between agents that participate in the interior maximizationsdescribed above. There is no need for a direct exchangeof information betweenother agents. Thus, the induced tree width has a natural interpretation in this context: It is the maximum number of agents whowill need to directly collaborate on the action choice. The order in whichthe variables are eliminated will have an important impact on the efficiency of this algorithm. Weassumethat this is determineda priori and knownto all agents. At this point, wehaveshownthat if the global utility function Qis approximated by the sumof local utilities Qj, then it is possible to use the coordination graph to computethe maximizingjoint action efficiently. In the remainderof this paper, wewill showhowwe can learn these local utilities efficiently. 3 Markov decision processes Weassumethat the underlying control problemis a Marker Decision Process (MDP).An MDPis defined as a 4-tuple (X, 04, P, R) where: X is a finite set of states; ,4 is a finite set of actions; P is a Markoviantransition modelwhere P(s, a, s’) represents the probability of going from state s to state s’ with action a; and R is a reward function R : X x .4 x X ~ IR, such that R(s, a, s’) represents the reward obtained when taking action a in state s and ending up in state s’. For convenience, we will sometimesuse reCs,a) = E,, PCs,a, s’)RCs, a, s’). Wewill be assumingthat the MDPhas an infinite horizon and that future rewards are discounted exponentially with a discount factor 3’ E [0, 1). Astationary policy ~r for an MDP is a mappingIr : S ~ A, where lr(x) is the action the agent takes at state x. Theoptimal value function V*is defined so that the value of a state mustbe the maximalvalue achievable by any action at that state. Moreprecisely, we define: Or(x,a) = nCx,a) + "r~ PCx’I x, a)VCx’); ~t and the Bellmanoperator 7"* to be: 7-’V(x)= max Ov(x,a). ~t The optimal value function V* is the fixed point of T*: V*= T’V*. For any value function V, we can define the policy obtained by acting greedily relative to V. In other words,at each state, we take the action that maximizesthe one-step utility, assumingthat V represents our long-term utility achieved at the next state. Moreprecisely, we define Greedy(V)(x) arg max,, Ov(x, a). Thegreedy policy relative to the optimal value function V* is the optimal policy 7r* = Cavedy(V*). Recall that we are interested in computinglocal utilities Qj for each agent that will represent an approximation to the global utility Q. In [7], we presented an algorithm for computingsuch value functions for the case where the model is knownand represented as a factored MDP.In this paper, we consider the case of unknownaction and transition models, i.e., the reinforcementlearning case. In the next three sections, we will present alternative, and complementary,approaches for coordinating agents in order to learn the local Oj functions. 4 Coordinationstructure in Q-learning Q-learning is a standard approach to solving an MDPthrough reinforcement learning. In Q-learning the agent directly learns the values of state-action pairs from observations of quadruples of the form (state, action, reward, next-state), whichwe will henceforth refer to as (x, a, r, x~). For each such quadruple, Q-learning performs the following update: Q(x,a) ~ Q(x,a) + air + "rV(x’)- Q(x, The mechanism described above can used to maximize not just immediatevalue, but long term cumulativerewards, i.e., the true expected, discounted value of an action, by using Qfunctions that are derived from value function from an MDP. The extent to whichsuch a schemewill be successful will be determinedby our ability to represent the value function in a form that is usable by our action selection mechanism. Before addressing this question, we first review the MDP framework. 100 where a, is the "learning ram:’ or step size parameter and V(x’) = mazaQ(x~,a). With a suitable decay schedule for the learning rate, a policy that ensuresthat every state-action pair is experiencedinfinitely often and a representation for Q(x, a) whichcan assign an independentvalue to every stateaction pair, Q-learningwill convergeto estimates for Q(x, a) whichreflect the expected, discountedvalue of taking action a in state x and proceedingoptimally thereafter. In practice the formal convergence requirements for Qlearning almostnever hold becausethe state space is too large to permit an independentrepresentation of the value of every state. Typically, a parametric function approximatorsuch as a neural networkis used to represent the Q function for each action. The following gradient-based update schemeis then used: a, w Q(x, a, a, w), (1) where w is a weight vector for our function approximation amhitectureand, again, the value V(xt) of the next state x’ is given by: V(x’) ma:r.aQ(x’, a) . (2) The Q-learning update mechanismis completely generic and requires only that the approximationarchitecture is differentiable. Weare free to choosean architecture that is compatible with our action selection mechanism.Therefore, we can assumethat every agent maintains a local Q j-function defined over somesubset of the state variables, its ownactions, and possibly actions of someother agents. The global Q-functionis nowa function of the state and an action vector a, whichcontains the joint action of all of the agents defined as the sumof these local Qj-functions: g x, a, w)= Qj(x, a, D. QC j-----1 The are some somewhatsubtle consequences of this representation. Thefirst is that determiningV(xt) in Eq. (2) seems intractable, because it requires a maximizationover an exponentially large action space. Fortunately, the Q-function is factored as a linear combinationof local Qj functions. Thus, we can apply the coordination graph procedure from Section 2 to obtain the maximum value V(x’) at any given state any wi and wj. The locality of the weight updates in this formulation of Q-learning makeit very attractive for a distributed implementation.Each agent can maintain an entirely local Q-function and does need to knowanything about the structure of the neighboring agents’ Q-functions. Different agents can even use different architectures, e.g., one might use a neural network and another might use a CMAC. The only requirementis that the joint Q-functionbe expressed as a sumof the these individual Q-functions. The only negative aspect of this Q-learning formulation is that, like almost all formsof Q-learningwith function approximation, it is difficult to provide any kind of formal convergence guarantees. Wehave no masonto believe that it would perform any better or worse than general Q-learning with a linear function approximationarchitecture, howeverthis remainsto be verified experimentally. 5 Multiagent Least Squarespolicy iteration (LSPI)[11] is a newreinforcementlearning methodthat performspolicy iteration by using a stored corpus of samples in place of a model. LSPI represents Q-functionsusing as a linear combinationof basis functions. Givena policy, 7ri, LSPIcomputesa set of Q-functions, Qn, (in the space spannedby the basis functions) whichare a fixed point for 7r with respect to the samples. ThenewQ~, then implicitly define policy Xi+l and the process is repeated until someform of convergenceis achieved. Webriefly review the mathematicaloperations required for LSPIhere; full details are available in [10]. Weassumethat our Q-functionswill be approximatedby a linear combination of basis functions(features), k 0 0,, I.X The Q j-functions themselves can be maintained locally by each agent as an arbitrary, differentiable function of a set of local weights wj. Each Q1 can be defined over the entire state space, or just somesubset of the state variables visible to agent j. Once we have defined the local Qj functions, we must computethe weight update in Eq. (1). Each agent must know: LSPI = ¢,(x, = ¢(,,, i=1 For convenience we express our basis functions in matrix form: ¢(xl,al) .o. A(x,a,r,x’,w) [r+’W(x’)-Q(x,a,w)], (3 the difference betweenthe current Q-valueand the discounted value of the next state. Wehavejust shownthat it is possible to apply the coordination graph from Section 2 to compute V(x’). The other unknownterms in Eq. (3) are the reward r and the previous Q value, Q(x, a, w). The reward is observed and the Q value can be computedby a simple message passing schemesimilar to the one used in the coordination graph by fixing the action of every agent to the one assigned in a. Therefore, after the coordination step, each agent will have access to the value of A(x,a,r,x~,w). At this point, the weightupdate equation is entirely local: w~ ~ Q(x,a,w) + a A(x,a,r,x’, w)V,~Qi(x,a, wi), The reason is simply that the gradient decomposeslinearly: Oncean action is selected, there are no cross-terms involving I01 where ¯ is (IXllA[x k). If we knewthe transition matrix, p,r, for the current policy and knewthe reward function we could, in principle, computethe fixed point by defining and solving the following system: Aw~ = b, where A = @T(@ _ .yp,r@) and b = @TR. In practice, we must sample experiences with the environment in place of T~ and pn. Given a set of samples, D = {sd,,aa,,a, h, a~)[ i = 1,2,... ,L}, wherethe (sa~,a,Ji), can construct approximate versions of @, pn@,and T~ as follows: = I... -- (SdL,ad,.) ~ -~ S’ rdi ’ , T ("d,.)) ¯ 7"dL It’s easy to see that approximations(A.1, bl, A2, b~) derived from different sets of samples (D1, D2) can be combinedadditively to yield a better approximationthat corresponds to the combinedset of samples: -~ = A.1 + -~2 and b = bl + b2. This observation leads to an incremental update rule for and b. Assumethat initially .~. = 0 and b = 0. For a fixed policy, a newsample(x, a, r, x’) contributes to the approximationaccording to the following update equation : and Wenote that this approachis very similar to the LSTD algorithm [5]. Unlike LSTD,which defines a system of equations relating state values to state values, LSPIis definedover Q-values. Eachiteration of LSPIyields the Q-valuesfor the current policy. Thus, each solution implicitly defines the next policy for policy iteration. Animportantfeature of LSPIis that it is able to reuse the sameset of sampleseven as the policy changes. For example, supposethe corpus contains a transition from state x to state x’ under action Ul and ~q(x’) = a2. This is entered into the A matrix as if a transition were madefrom Q(X, al) Q(x’, a2). If Iri+l ’) changes the action fo r x’ from a2to a3, then the next iteration of LSPI enters a transition from Q(x, al) to Q(x~, a.q) into the A matrix. The sample can be reused because the dynamicsfor state x under action al have not changed. The application of collaborative action selection to the LSPIframeworkis surprisingly straightforward. Wefirst note that since LSPIis a linear method,any set of Q-functionsproducedby LSPIwill, by construction, be of the right form for collaborative action selection. Eachagent is assigned a local set of basis functions whichdefine its local Q-function.These basis functions can be defined o~/er the agent’s ownactions as well as the actions of a small numberof other agents. As with ordinary LSPI, the current policy ~q is defined implicitly by the current set of Q-functions, Q,r,. However,in the multiagent case, we cannot enumerateeach possible action to determinethe policy at somegiven state because this set of actions is exponential in the numberof agents. Fortunately, we can again exploit the structure of the coordination graph to determine the optimal actions relative to Q"’: For each i02 transition fromstate x to state x~ underjoint action a the coordination graph is used to determinethe optimal actions for x’, yielding action a’. Thetransition is addedto the .4 matrix as a transition from Q(x, a) to Q(x’, a’). Anadvantage to the LSPI approachto collaborative action selection is that it computesa value function for each successive policy whichhas a coherent interpretation as a projection into the linear space spannedby the individual agent’s local Q-functions.Thus, there is reasonto believe that the resulting Q-functionswill be well-suited to collaborative action selection using a coordination graph mechanism. A disadvantageof the LSPIapproachis that it is not currently amenableto a distributed implementationduring the learning phase: Construction of the A matrix requires knowledgeof the evaluation of each agent’s basis functions for every state in the corpus, not only for every action that is actually taken, but for every action recommended by every policy considered by LSPI. 6 Coordination in direct policy search Value function based reinforcement learning methods have recently come under some criticism as being unstable and difficult to use in practice. A function approximationarchitecture that is not well-suited to the problemcan diverge or producepoor results with little meaningfulfeedbackthat is directly useful for modifyingthe function approximatorto achieve better performance. LSPI was designed to address someof the concerns with Q-learning based value function approximation. It is more stable than Q-learning and since it is a linear method,it is somewhateasier to debug. However,LSPIis still an approximate policy iteration procedure and can be quite sensitive to small errors in the estimated Q-valuesfor policies [4]. In practice, LSPIcan to take large, coarse steps in policy space. The shortcomings of value function based methods have led to a surge of interest in direct policy search methods[13; 12; 1; 15; 9]. Thesemethodsuse gradient ascent to search a space of parameterizedstochastic policies. As with all gradient ascent methods, local optimacan be problematic. Defining a relatively smoothbut expressive policy space and finding reasonablestarting points within this space are all important elementsof any successful application of gradient ascent. Wenowshowhowto seed a gradient ascent procedure with a multiagent policy generated by Q-learning or LSPI as described above. To guarantee that the gradient exists, policy search methodsrequire stochastic policies. Ourfirst task is to convert the deterministic policy implied by our value Qfunctions into a stochastic, policy/~(alx), i.e., a distribution over actions given the state. A natural wayto do this, which also turns out to be compatiblewith most policy search methods, is to create a softmaxover the Q-values: eEj Qj (x,a) I x) Qh0,,b)" (4) To be able to apply policy search methodsfor such policy representation, we mustpresent two additional steps: The first is a methodof efficiently generating samplesfrom our stochastic policy and the second is a methodof efficiently differentiating our stochastic policy for gradient ascent purposes. Samplingfrom our stochastic policy mayappear problematie becauseof the size of the joint action space. For sampling purposes, we can ignore the denominator,since it is the same for all actions, and sample from the numerator directly as an unnormalizedpotential function. To do this sampling we again use variable elimination on a coordination graph with exactly the samestructure as the one in Section 2. Conditioning on the current state x is again easy: each agent needsto observe only the variables in Observable[Qj]and instantiate Q./appropriately. At this point, we need to generate a sample from Qj functions that dependonly on the action choice. Following our earlier example, our task is nowto sample from the potential corresponding to the numerator of /~(a I x). Suppose,for example,that the individual agent’s Q-functions have the following form: V -- Ql(al, a2) + Q2(a2,a4) + Qz(ax, as) Q4(az,a4). and we wish to samplefrom the potential function for eoa (al,a2)eO2(a2,a4)eQs(aa,as)eQ4(as,a4). The next key operation is the computationof the gradient of our stochastic policy function, a key operation in a REINFORCE style [16] policy search algorithm.! First, recall that we are using a linear representation for each local Qj function: 0j Cx, a) = a); i whereeach basis function ¢ij (x, a) can dependon a subset of the state and action variables. Our stochastic policy representation nowbecomes: e~j..i wi,$0i..~(x,a) ~(" Ix) = Eb e~" ~,.~0,.,(x,a)" To simplify our notation, we will refer to the weights as wi rather than wij and the basis functionsas ~bi rather than ~bi,j. Finally, weneed to computethe gradient of the log policy: a~to a i In p(alx) o / -- 0wi In 0o,0 0 = f4(A2, Aa) = E eQ2(a2’a’)eO4(as’a’)" ~bi(x,a) ~ ~’]~beE’w’¢"(x’b) ~,b’ eE, u,,~,(~,b’) = Cdx,a) A4 The problem nowreduces to sampling from eQ1(,~l,,,~)eOS(~,a,)(a2, as) ¢i(X, having one fewer agent. Next, agent 3 communicatesits contribution giving: eO~(~a’a2)fs(al,a2), where fs(al,a2) = ~a3 ag,as)" eQS(a"as)f4( Agent 2 nowcommunicatesits contribution, giving = f12 and agent 1 can now sample actions from the potential P(al) ’~ f2(al). Wecan nowsample actions for the remaining agents by reversing the direction of the messagesand sampling from the distribution for each agent, conditionedon the choices of the previous agents. For example, whenagent 2 is informed of the action selected by agent 1, agent 2 can sampleactions fromthe distribution: P(a~,al) eQa(~a’*2) fs(al,a2) P(a2lax)= P(al) = h(al) The general algorithm has the same messagepassing topology as the original action selection mechanism.The only difference is the content of the messages: The forward pass messagesare probability potentials and the backwardpass messagesare used to computeconditional distributions from whichactions are sampled. 103 l~k’E~b)’] _ To sample actions one at a time, we will follow a strategy of marginalizingout actions until we are left with a potential over a single action. Wethen samplefrom this potential and propagate the results backwardsto sampleactions for the remaining agents. Supposewe begin by eliminating a4. Agent 4 can summarizeit’s impact on the rest of the distribution by combiningits potential function with that of agent 2 and defining a newpotential: a) -- ~b---a°E’~"~’("’b) Eb’ e~’~i ~ ~i(X, wi~i(x,b’) b) e~’ ]~i wi0 i(x,b’) " b Wenote that both the numeratorand the denominatorin the summationcan be determinedby a variable elimination procedure similar to the stochastic action selection procedure. These action selection and gradient computation mechanismsprovide the basic functions required for essentially any policy search method.As in the case of Q-learning, a global error signal must be shared by the entire set of agents. Apart from this, the gradient computationsand stochastic policy sampling procedure involve a messagepassing schemewith the same topology as the action selection mechanism.We believe that these methodscan be incorporated into any of a numberof policy search methodsto fine tune a policy derived by Q-learning with linear Q-functionsor by LSPI. 7 Experimental results In this section we report results of applying our Coordinated RL approach with LSPI to the SysAdmin multiagent domain [7]. The SysAdmin problem consists of a network of n machines connected in one of the following topologies: chain, ring, star, ring-of-rings, or star-andring. The state of each machinej is described by two variables: status Sj E {good, faulty, dead}, and load Lj E IMostpolicysearchalgorithmsare of this style. {idle, loaded, process successful}. Newjobs arrive with probability 0.5 on idle machinesand makethem loaded. A loaded machinein good status can execute and successfully completea job with probability 0.5 at each time step. Afaulty machinecan also execute jobs, but it will take longer to terminate (jobs end with probability 0.25). A dead machine not able to execute jobs and remainsdeaduntil it is rebooted. Eachmachinereceives a reward of + 1 for each job completed successfully, otherwise it receives a reward of 0. Machines fail stochastically and switch status from goodto faulty and eventually to dead, but they are also influencedby their neighbors; a deadmachineincreases the probability that its neighbors will becomefaulty and eventually die. Each machineis also associated with an agent Aj whichat each step has to choosebetweenrebooting the machineor not. Rebootinga machinemakesits status good independently of the current status, but any runningjob is lost. Theseagents have to coordinate their actions so as to maximizethe total rewardfor the system, or in other words, maximizethe total numberof successfully completed jobs in the system. Initially, all machinesare in "good"status and the load is set to "processsuccessful". The discount factor 7 is set to 0.95. The SysAdminproblem has been studied in [7], where the modelof the process is assumedto be available as a factored MDP.This is a fairly large MDP with 9n possible states and 2n possible actions for a network of n machines. A direct approach to solving such an MDPwill fail even for small values of n. The state value function is approximatedas a linear combinationof indicator basis functions, and the coefficients are computedusing a Linear Programming(LP) approach. Thederived policies are very close to the theoretical optimal and significantly better comparedto policies learned by the Distributed Reward(DR) and Distributed Value Function (DVF)algorithms [14]. Weuse the SysAdminproblem to evaluate our coordinated RLapproachdespite the availability of a factored modelbecause it provides a baseline for comparingthe model-based approach with coordinated RL. Note that conventional RL methodswouldfail here because of the exponential action space. Weapplied Coordinated RLapproach with LSPI in these experiments. To makea fair comparison with the resuits obtained by the LP approach, we had to makesure that the sets of basis functions used in each case are comparable. The LP approach makes use of "single" or "pair" indicator functions to approximate the state value function V(x) and forms Q values using the model and the Bellmanequation. It then solves a linear programto find the coefficients for the approximationof V (x) under a set of constraints imposed the Q values. Onthe other hand, LSPIuses a set of basis functions to approximate the state-action value function Q(x, a). To make sure the two sets are comparable, we have to choose basis functions for LSPIthat span the space of all possible Q functions that the LP approach can form. To do this we backprojeered the LP method’sbasis functions through the transition model[’/], added the instantaneous reward, and used the resuiting functions as our basis for LSPI.Wenote that this set of basis functions potentially spans a larger space than the Q functions considered by the LP approach. This is because the 104 4.4 i4.3 .. LSPI \. UI~IC Maxi~ Value ~ 4.2 ~4.1 i, 3.g ..... q, f"<’~,-..../ i~, ..., .f’ . ~<. .,, .,_-..._,... ¯ ¯ ., / Dml’VIF 3./’ Numbe¢of,~m~ Figure 2: Results on star networks. LP constrains its Q functions by a lower-dimensionalV function, while LSPIis free to consider the entire space spanned by its Q-functionbasis. Whilethere are someslight differences, we believe bases are close enoughto permit a fair comparison. In our experiments,we created two sets of LSPIbasis functions correspondingto the backprojectionsof the "single" or "pair" indicator functions from [7]. The first set of experimentsinvolved a star topology with up to 14 legs yielding a range from 3 to 15 agents. For n machines, we experimentally found that about 600nsamplesare sufficient for LSPIto learn a goodpolicy. The samples were collected by starting at the initial state and followinga purely randompolicy. To avoid biasing our samplestoo heavily by the stationary distribution of the randompolicy, each episode was truncated at 15 steps. Thus, samples were collected from 40n episodes each one 15 steps long. Theresulting policies wereevaluated with a Monte-Carloapproach by averaging 20 100-step long runs. The entire experimentwas repeated 10 times with different sample sets and the results were averaged. Figure 2 shows the results obtained by LSPIcomparedwith the results of LP, DR,and DVFas reported in [7] for the "single" basis functions case. Theerror bars for LSPIcorrespondto the averages over the different sample sets used. The "pair" basis functions producedsimilar results for LSPIand slightly better for LPand are not reported here due to space limitations. The second set of experiments involved a ring-of-rings topologywith up to 6 small rings (each one of size 3) forming a bigger ring and the total numberof agents was ranging from 6 to 18. All learning parameterswerethe sameas in the star case. Figure 3 comparesthe results of LSPIwith those of LP, DR,and DVFfor the "single" basis functions case. The results in both cases clearly indicate that LSPIlearns very good policies comparableto the LP approach using the samebasis functions, but without any use of the model. It is worth noting that the numberof samples used in each case grows linearly in the numberof agents, whereas the joint state-action space growsexponentially. generalize immediatelyto continuousstate variables. Guestrin et al. [7] presented the coordination graph method for action selection used in this paper and an algorithm for finding an approximation to the Q-function using local Qj functions for problems where the MDPmodel can be represented compactlyas a factored MDP.The Multiagent LSPIalgorithmpresented in this paper can be thought of as a modelfree reinforcementlearning approachfor generating the local Qj approximation. The factored MDPapproach [7] is able to exploit the modeland computeits approximationvery efficiently. Onthe other hand, Multiagent LSPIis moregeneral as it can be applied to problemsthat cannot be represented compactly by a factored MDP. Acknowledgments Weare very grateful to DaphneKoller for manyuseful discussions. Thisworkwas supportedby the ONR under the MURI program"Decision MakingUnderUncertainty" and by the SloanFoundation.Thefirst author wasalso supported by a SiebelScholarship. UtoplcMaximum Vldue Figure 3: Results on ring-of-rings networks. 8 Conclusionsand future work References [1] J. Baxterand P.Bartlett. Reinforcement learning in POMDP’s In this paper, we proposed a new approach to reinforcement via direct gradient ascent. In Proc. Int. Conf. on Machine learning: CoordinatedRL. In this approach, agents makecoLearning(ICML),2000. ordinated decisions and share informationto achieve a princiThecomplex[2] D. Bernstein,S. Ziiberstein, and N. Immerman. pled learning strategy. Our methodsuccessfully incorporates ity of decentralized control of Markov decision processes. In the cooperative action selection mechanism derived in [7] into Proc.of Conf.on Uncertainly in AI ( UAI-O0), 2000. the reinforcement learning frameworkto allow for structured [3] U. Bertele and F. Brioschi. NonserialDynamic Programming. communicationbetween agents, each of which have only parAcademic Press, NewYork, 1972. tial access to the state description. Wepresented three in[4] D. Bertsekasand J. Tsitsildis. Neuro-Dynomic Programming. stances of Coordinated RL, extending three RLframeworks: AthenaScientific, Belmont,Massachusetts, 1996. Q-learning, LSPIandpolicy search. WithQ-learning andpol- [5] S. Bradtkeand A. Barto. Linearleast-squares algorithmsfor icysearch, thelearning mechanism canbedistributed. Agents temporaldifference learning. MachineLearning,2(I):33-58, communicate reinforcement signals, utility values andconJanuary1996. ditional policies. Ontheotherhand,inLSPIsomecentral- [6] R. Dechter. Bucketelimination: A unifying frameworkfor izedcoordination is required to compute theprojection of reasoning.Artificial Intelligence,113(1-2):41-85, 1999. thevalue function. Inallcases, theresulting policies can [7] Carlos Guestrin,DaphneKoller, and RonaldParr. Multiagent beexecuted ina distributed manner. In ourview,an algoplanningwithfactoredMDPs. In 14thNeuralInformationProcessing Systems(NIPS-14),2001. rithmsuchasLSPIcanprovide a batchoffline estimate of theQ-functions. Subsequently, Q-learning or direct policy [8] F. Jensen, E V. Jensen, and S. Dittmer. Frominfluencediagramsto junctiontrees. In UA1-94, 1994. search canbeapplied online torefine thisestimate. Byusing [9] V. Konda and J. Tsitsiklis. Actor-critic algorithms.In NIPS-12, ourCoordinated RL method, we cansmoothly shiftbetween 2000. thesetwophases. and R. Parr. Model-Free Least-SquarespolOurinitial empirical results demonstrate thatreinforce-[10] M. G. Lagoudakis icy iteration. TechnicalReportCS-2001-05,Departmentof mentlearning methods canlearn goodpolicies using thetype ComputerScience, DukeUniversity, December 2001. oflinear architecture required forcooperative action selec[11] M. Lagoudakis and R. Parr. Model free least squares policy tion. Inourexperiments withLSPI, wereliably learned poliiteration. In NIPS-14,2001. ciesthatwerecomparable to thebestpolicies achieved by A. Ngand M. Jordan. PEGASUS: A policy search methodfor other methods andclose tothetheoretical optimal achievable[12] large MDPsand POMDPs. In UAI-O0,2000. in ourtestdomains. Theamount of datarequired to learn [13] A. Ng,R. Parr, andD. Koller.Policysearchvia densityestigoodpolicies scaled linearly withthenumber ofstate andacmation.InNIPS-12,2000. tionvariables eventhough thestate andaction spaces were [14] J. Schneider, W.Wong,A. Moore,and M. Riedmiller. Disgrowing exponentially. tributed value functions.In 16thICML,1999. Ourinitial experiments involved models withdiscrete state [15] R. Sutton, D. McAllester,S. Singh, and Y. Mansour.Policy andaction spaces thatcouldbedescribed compactly usinga gradientmethodsfor reinforcement learningwith functionapdynamic Bayesian network. Thesewerechosenprimarily to proximation.In NIPS-12,2000. compare learning performance withprevious closed-form ap- [16] RonaldJ. Williams.Simplestatistical gradient-followingalproximation methods andourbasisfunctions werechosen to gorithmsfor connectionist reinforcementlearning. Machine Learning,8(3):229-256,1992. match closely thebasis functions usedinearlier experiments withthesamemodels. In future work,we planto explore [17] D. Wolpert,K. Wheller,and K. Turner.Generalprinciples of learning-basedmulti-agentsystems.In Proc.of 3rdInt. Conf. problems thatinvolve continuous variables andbasisfuncon Autonomous Agents(Agents’99),1999. tions. Whileourmethods dorequire discrete actions, they 105