Coordinated Reinforcement Learning Carlos Guestrin Computer Science Dept

From: AAAI Technical Report SS-02-02. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved.
Coordinated Reinforcement Learning
Carlos Guestrin
Computer
Science Dept
StanfordUniversity
guestrin@cs.stanford.edu
Michail Lagoudakis
Computer Science Dept
Duke University
mgl @cs.duke.edu
Abstract
Wepresent several newalgorithmsfor multiagentreinforcementlearning. Acommon
feature of these algorithmsis a parameterized,structuredrepresentation
of a policyor valuefunction.This structure is leveragedin an approachwecall coordinatedreinforcement
learning,by whichagentscoordinateboth their action
selectionactivities andtheir parameter
updates.Within
the limits of our parametricrepresentations,the agents
will determinea jointly optimalactionwithoutexplicitly consideringeverypossibleactionin their exponentially large joint actionspace.Ourmethods
differ from
manypreviousreinforcementlearning approachesto
multiagentcoordinationin that structured communication and coordinationbetweenagents appearsat the
core of both the learning algorithmandthe execution
architecture.
1 Introduction
Considera systemwheremultipleagents, each with its own
set of possibleactions andits ownobservations,mustcoordinate in order to achieve a common
goal. Wewant to find
a mechanism
for coordinatingthe agents’ actions so as to
maximize
their joint utility. Oneobviousapproachto this
problemis to representthe systemas a Markov
decision process (MDP),
wherethe "action"is a joint actionfor all of the
agentsandthe rewardis the total rewardfor all of the agents.
Theimmediate
difficulty with this approachis that the action
spaceis quite large: If there are g agents,eachof whichcan
a.
takea actions,thenthe actionspaceis a
Onenatural approachto reducingthe complexityof this
problem
is to restrict the amount
of information
that is available to each agent and hopeto maximizeglobal welfare by
solving local optimizationproblemsfor each agent[14]. In
somecases, it is possibleto manipulate
the presentationof informationto the agentsin a manner/hat
forces local optimizations to implyglobaloptimizations[17]. In general,however,
the problemof findinga globallyoptimalsolution for agents
with partial informationis known
to be intractable [2].
Following[7] wepresent an approachthat combinesvalue
function approximationwith a messagepassing schemeby
whichthe agentsefficiently determinethe jointly optimalaction with respect to an approximate
valuefunction. Ourapproachis basedon approximating
the joint value functionas
a linear combination
of local valuefunctions,each of which
98
Ronald Parr
Computer Science Dept
Duke University
parr@ cs.duke.edu
relates onlyto the parts of the systemcontrolledby a small
numberagents. Weshowhowsuch factored value functions
allowthe agents to find a globallyoptimaljoint action using a very natural messagepassing scheme.This schemecan
be implemented
as a negotiationprocedurefor selecting actions at run time. Alternatively,if the agentssharea common
observationvector, eachagentcan efficiently determinethe
actions that will be takenby all of the collaboratingagents
without any additional communication.
Givenan action selection mechanism,
the remainingtask
is to developa reinforcement
learningalgorithmthat is capable of producingvaluefunctionsof the appropriateform.An
algorithmfor computing
suchvaluefunctionsis presentedin
[7] for the case wherethe modelis knownand represented
as a factoredMDP.
Thisis the first applicationof these techniquesin the contextof reinforcementlearning, whereweno
longerrequirea factoredmodelor evena discretestate space.
Webegin by presenting two methodsof computingan appropriate value function throughreinforcementlearning, a
variant of Q-learninganda variant of modelfree policy iteration called LeastSquaresPolicyIteration (LSPI)[11].
also demonstratehowparameterizedvalue functions of the
formacquiredby our reinforcementlearning variants can be
combinedin a very natural waywith direct policy search
methodssuch as [13; 12; 1; 15; 9]. Thesamecommunication andcoordinationstructures usedin the valuefunction
approximation
phaseare used in the policy search phaseto
samplefromandupdatea factoredstochastic policyfunction.
Wecall our overall approachCoordinatedReinforcement
Learningbecausestructured coordinationbetweenagents is
usedin core of our learningalgorithmsand in our execution
architectures.Ourexperimental
results indicatethat, in the
case of LSPI, the messagepassing action selection mechanism and value function approximationcan be combinedto
produceeffective policies.
2
CooperativeAction Sdection
Webegin by considering the simpler problemof having a
groupof agentsselect a globallyoptimaljoint action in order to maximizethe sumof their individual utility functions. Supposewehavea collection of agents, whereeach
agent j mustchoosean action aj froma finite set of possible actions Dom(Aj).Weuse A to denote {A1,...,Aa}.
Theagentsare acting in a spacedescribedby a set of state
variables, X = {X1... Xn}, whereeach Xj takes on val-
ues in some domainDom(Xj). A state x defines a setting
zj E Dom(Xj)for each variable X~ and an action a defines
an action a~ E Dom(Aj)for each agent. The agents must
choosethe joint action a that maximizes
the total utility.
In general, the total utility Q will dependon all state variables X and on the actions of all agents A. However,in many
practical problems,it is possibleto approximate
the total utility Q by the sumof local sub-utilities Q j, one for each agent.
Now, the total utility becomes Q = ~j Qj. For example,
consider the decision process of a section managerin a warehouse. Her local utility Qj maydepend on the state of the
inventory of her section, on her decision of whichproducts to
stock up and on the decision of the sales managerover pricing
and special offers. Onthe other hand, it maynot dependdirectly on the actions of the customersupport team. However,
the decisions of the supportteamwill be relevant, as they may
affect the actions of the sales manager.
Computingthe action that maximizes Q = ~j Qj in such
problemseemsintractable a priori, as it wouldrequire the
enumerationof the joint action space of all agents. Fortunately, by exploiting the local structure in the Qj functions
through a coordination graph we can compute the optimal
action very efficiently, with limited communication
between
agents and limited observability, as proposedin [7]. Werepeat the construction here as it will be important throughout
this paper.
In our framework,each agent j has a local utility function
Qj. An agent’s local Q function might be influenced by a
subset of the state variables, the agent’s action and those of
some other agents; we define Seope[Qj] C X U A to be the
set of state variables and agents that influence Qj. (Weuse
Qj (x, a) to denote the value of Qj applied to the instantiation
of the variables in Scope[Qj]within x, a.) The scope of Qj
can be further divided into two parts: the observable state
variables:
Observable[Qj] ---- {Xi E X [Xi E Scope[Qj]};
and the relevant agent decision variables:
Relevant[Qj] = {Ai E A ] Ai E Scope[Qj]}.
This distinction will allow us to characterize the observations
each agent needs to make and the type of communication
neededto obtain the jointly optimal action, i.e., the joint action choice that maximizes
the total utility Q---- ~"~j Qj. We
note that each Qj maybe further decomposed
as a linear combination of functionsthat involvefewervariables; in this case,
the complexityof the algorithm maybe further reduced.
Recall that our task is to find a coordination strategy for
the agents to maximize~"~j Qj at state x. First, note that the
scopeof the Qj functions that comprisethe value can include
both action choices and state variables. Weassumethat the
agents havefull observability of the relevant state variables,
i.e., agent j can observe Observable[Qj].Given a particular
state x = {ael,..., :en}, agent j can instantiate the part of
Qj that dependson the state x, i.e., condition Qj on state
x. Notethat each agent only needs to observethe variables in
Observable[Qj], thereby decreasing considerably the amount
of information each agent needs to gather.
After conditioning on the current state, each Qj will only
dependon the agent’s action choice Aj. Our task is nowto
99
4
Figure 1: Coordinationgraph for a 4-agent problem.
select a joint action a that maximizes~j Qj (a). The fact
that the Qj dependon the actions of multiple agents forces
the agents to coordinate their action choices. Wecan represent the coordination requirements of the system using a
coordination graph, where there is a node for each agent and
an edge betweentwo agents if they must directly coordinate
their actions to optimizesomeparticular Oi. Fig. 1 showsthe
coordination graph for an examplewhere
Q = Ql(al, a2) 4- 02(a2, a4) 4" Os(al,as) Q4(as,a4).
A graph structure suggests the use of a cost network [6],
which can be solved using non-serial dynamic programming [3] or a variable elimination algorithm whichis virtually identical to variable elimination in a Bayesiannetwork.
Wereview this construction here, as it is a key component
in
the rest the paper.
The key idea is that, rather than summing
all functions and
then doing the maximization,we maximizeover variables one
at a time. Whenmaximizingover at, only summandsinvolving az participate in the maximization. In our example, we
wish to compute:
max Ql(al,as)+Q2(a2,a4)+Qs(al,
aa)+Q,l(as,
~11 ,az~,Q3,Q4
Let us begin our optimization with agent 4. To optimize A4,
functions Qx and Q3are irrelevant. Hence,we obtain:
maxQ1 (al, a2)-/-Qs(al, aa)4-max[Q2(a2,a4)4-Q4 (aa,
al ,G2~0,3
I14
Wesee that to choose A4 optimally, the agent must know
the values of A2 and As. In effect, it is computinga conditional strategy, with a (possibly) different action choice for
each action choice of agents 2 and 3. Agent4 can summarize
the value that it brings to the systemin the different circumstances using a newfunction f4(Az, A3) whosevalue at the
point as, as is the value of the internal maxexpression. This
newfunction is a newjoint value function for agents 2 and 3,
summarizing
their joint contribution to the total rewardunder
the assumptionthat agent 4 will act optimally with respect to
their choices.
Our problem nowreduces to computing
max Ql(al,a2)
4-
Qs(al,as)
"1"
f4(a2,as),
having one fewer agent. Next, agent 3 makesits decision,
giving:
maxal,a~ Ql(al, a2) + fz(al, a2),
where fs(al,a2) = maxa,[Q~(ax,as) 4- f4(a2,a~)].
Agent 2 nowmakesits decision, giving
f2(al)=max
01(al, a2)+/3(al,
and agent I can nowsimply choose the action al that maximizes fl -- maxel f2(al). The result at this point is a number, which is the desired maximum
over al,..., a4.
Wecan recover the maximizingset of actions by performing the entire process in reverse: The maximizingchoice for
fl selects the action a~ for agent 1. To fulfill its commitment
to agent 1, agent 2 must choosethe value a~ whichmaximizes
f2(a~). This, in turn, forces agent 3 and then agent 4 to select
their actions appropriately.
In general, the algorithmmaintainsa set .7" of functions,
which initially contains {Qx,..., Qo}. The algorithm then
repeats the followingsteps:
I. Select an uneliminatedagent At;
2. Take all fl,..., fL E .T whosescope contains At.
3. Define a new function f - maxaz~j fj and introduce it
into .T. The scope of f is UL=lScope[fj]-- {At}.
As above, the maximizingaction choices are recovered by
sending messagesin the reverse direction. Wenote that this
algorithmis essentially a special case of the algorithmused to
solve influence diagramswith multiple parallel decisions [8]
(as is the one in the next section). However,to our knowledge, these ideas have not yet been applied in the context of
reinforcement learning.
The computationalcost of this algorithm is linear in the
numberof new "function values" introduced in the elimination process. Moreprecisely, consider the computation of
a new function e whose scope is Z. To computethis function, we need to compute[Dora[Z]]different values. The cost
of the algorithmis linear in the overall numberof these values, introduced throughout the algorithm. As shownin [6],
this cost is exponential in the induced width of the coordination graph for the problem. The algorithm is distributed
in the sense that the only communication
required is between
agents that participate in the interior maximizationsdescribed
above. There is no need for a direct exchangeof information
betweenother agents. Thus, the induced tree width has a natural interpretation in this context: It is the maximum
number
of agents whowill need to directly collaborate on the action
choice. The order in whichthe variables are eliminated will
have an important impact on the efficiency of this algorithm.
Weassumethat this is determineda priori and knownto all
agents.
At this point, wehaveshownthat if the global utility function Qis approximated
by the sumof local utilities Qj, then it
is possible to use the coordination graph to computethe maximizingjoint action efficiently. In the remainderof this paper,
wewill showhowwe can learn these local utilities efficiently.
3
Markov
decision processes
Weassumethat the underlying control problemis a Marker
Decision Process (MDP).An MDPis defined as a 4-tuple
(X, 04, P, R) where: X is a finite set of states; ,4 is a finite set of actions; P is a Markoviantransition modelwhere
P(s, a, s’) represents the probability of going from state s
to state s’ with action a; and R is a reward function R :
X x .4 x X ~ IR, such that R(s, a, s’) represents the
reward obtained when taking action a in state s and ending up in state s’. For convenience, we will sometimesuse
reCs,a) = E,, PCs,a, s’)RCs,
a, s’).
Wewill be assumingthat the MDPhas an infinite horizon
and that future rewards are discounted exponentially with a
discount factor 3’ E [0, 1). Astationary policy ~r for an MDP
is a mappingIr : S ~ A, where lr(x) is the action the agent
takes at state x. Theoptimal value function V*is defined so
that the value of a state mustbe the maximalvalue achievable
by any action at that state. Moreprecisely, we define:
Or(x,a) = nCx,a) + "r~ PCx’I x, a)VCx’);
~t
and the Bellmanoperator 7"* to be:
7-’V(x)= max
Ov(x,a).
~t
The optimal value function V* is the fixed point of T*: V*=
T’V*.
For any value function V, we can define the policy obtained by acting greedily relative to V. In other words,at each
state, we take the action that maximizesthe one-step utility,
assumingthat V represents our long-term utility achieved at
the next state. Moreprecisely, we define Greedy(V)(x)
arg max,, Ov(x, a). Thegreedy policy relative to the optimal
value function V* is the optimal policy 7r* = Cavedy(V*).
Recall that we are interested in computinglocal utilities
Qj for each agent that will represent an approximation to
the global utility Q. In [7], we presented an algorithm for
computingsuch value functions for the case where the model
is knownand represented as a factored MDP.In this paper,
we consider the case of unknownaction and transition models, i.e., the reinforcementlearning case. In the next three
sections, we will present alternative, and complementary,approaches for coordinating agents in order to learn the local
Oj functions.
4
Coordinationstructure in Q-learning
Q-learning is a standard approach to solving an MDPthrough
reinforcement learning. In Q-learning the agent directly
learns the values of state-action pairs from observations of
quadruples of the form (state, action, reward, next-state),
whichwe will henceforth refer to as (x, a, r, x~). For each
such quadruple, Q-learning performs the following update:
Q(x,a) ~ Q(x,a) + air + "rV(x’)- Q(x,
The mechanism described above can used to maximize not
just immediatevalue, but long term cumulativerewards, i.e.,
the true expected, discounted value of an action, by using Qfunctions that are derived from value function from an MDP.
The extent to whichsuch a schemewill be successful will be
determinedby our ability to represent the value function in a
form that is usable by our action selection mechanism.
Before
addressing this question, we first review the MDP
framework.
100
where a, is the "learning ram:’ or step size parameter and
V(x’) = mazaQ(x~,a). With a suitable decay schedule for
the learning rate, a policy that ensuresthat every state-action
pair is experiencedinfinitely often and a representation for
Q(x, a) whichcan assign an independentvalue to every stateaction pair, Q-learningwill convergeto estimates for Q(x, a)
whichreflect the expected, discountedvalue of taking action
a in state x and proceedingoptimally thereafter.
In practice the formal convergence requirements for Qlearning almostnever hold becausethe state space is too large
to permit an independentrepresentation of the value of every
state. Typically, a parametric function approximatorsuch as
a neural networkis used to represent the Q function for each
action. The following gradient-based update schemeis then
used:
a,
w Q(x, a,
a, w),
(1)
where w is a weight vector for our function approximation
amhitectureand, again, the value V(xt) of the next state x’ is
given by:
V(x’) ma:r.aQ(x’, a) .
(2)
The Q-learning update mechanismis completely generic
and requires only that the approximationarchitecture is differentiable. Weare free to choosean architecture that is compatible with our action selection mechanism.Therefore, we
can assumethat every agent maintains a local Q j-function
defined over somesubset of the state variables, its ownactions, and possibly actions of someother agents. The global
Q-functionis nowa function of the state and an action vector
a, whichcontains the joint action of all of the agents defined
as the sumof these local Qj-functions:
g
x, a, w)= Qj(x, a, D.
QC
j-----1
The are some somewhatsubtle consequences of this representation. Thefirst is that determiningV(xt) in Eq. (2) seems
intractable, because it requires a maximizationover an exponentially large action space. Fortunately, the Q-function is
factored as a linear combinationof local Qj functions. Thus,
we can apply the coordination graph procedure from Section 2 to obtain the maximum
value V(x’) at any given state
any wi and wj. The locality of the weight updates in this
formulation of Q-learning makeit very attractive for a distributed implementation.Each agent can maintain an entirely
local Q-function and does need to knowanything about the
structure of the neighboring agents’ Q-functions. Different
agents can even use different architectures, e.g., one might
use a neural network and another might use a CMAC.
The
only requirementis that the joint Q-functionbe expressed as
a sumof the these individual Q-functions.
The only negative aspect of this Q-learning formulation is
that, like almost all formsof Q-learningwith function approximation, it is difficult to provide any kind of formal convergence guarantees. Wehave no masonto believe that it would
perform any better or worse than general Q-learning with a
linear function approximationarchitecture, howeverthis remainsto be verified experimentally.
5 Multiagent
Least Squarespolicy iteration (LSPI)[11] is a newreinforcementlearning methodthat performspolicy iteration by using
a stored corpus of samples in place of a model. LSPI represents Q-functionsusing as a linear combinationof basis functions. Givena policy, 7ri, LSPIcomputesa set of Q-functions,
Qn, (in the space spannedby the basis functions) whichare
a fixed point for 7r with respect to the samples. ThenewQ~,
then implicitly define policy Xi+l and the process is repeated
until someform of convergenceis achieved.
Webriefly review the mathematicaloperations required for
LSPIhere; full details are available in [10]. Weassumethat
our Q-functionswill be approximatedby a linear combination
of basis functions(features),
k
0 0,,
I.X
The Q j-functions themselves can be maintained locally by
each agent as an arbitrary, differentiable function of a set of
local weights wj. Each Q1 can be defined over the entire
state space, or just somesubset of the state variables visible
to agent j.
Once we have defined the local Qj functions, we must
computethe weight update in Eq. (1). Each agent must know:
LSPI
= ¢,(x,
= ¢(,,,
i=1
For convenience we express our basis functions in matrix
form:
¢(xl,al)
.o.
A(x,a,r,x’,w) [r+’W(x’)-Q(x,a,w)],
(3
the difference betweenthe current Q-valueand the discounted
value of the next state. Wehavejust shownthat it is possible
to apply the coordination graph from Section 2 to compute
V(x’). The other unknownterms in Eq. (3) are the reward
r and the previous Q value, Q(x, a, w). The reward is observed and the Q value can be computedby a simple message
passing schemesimilar to the one used in the coordination
graph by fixing the action of every agent to the one assigned
in a.
Therefore, after the coordination step, each agent will have
access to the value of A(x,a,r,x~,w). At this point, the
weightupdate equation is entirely local:
w~ ~ Q(x,a,w) + a A(x,a,r,x’,
w)V,~Qi(x,a, wi),
The reason is simply that the gradient decomposeslinearly:
Oncean action is selected, there are no cross-terms involving
I01
where ¯ is (IXllA[x k). If we knewthe transition matrix,
p,r, for the current policy and knewthe reward function we
could, in principle, computethe fixed point by defining and
solving the following system:
Aw~ = b,
where
A = @T(@
_ .yp,r@)
and
b = @TR.
In practice, we must sample experiences with the environment in place of T~ and pn. Given a set of samples, D =
{sd,,aa,,a, h, a~)[ i = 1,2,... ,L}, wherethe (sa~,a,Ji),
can construct approximate versions of @, pn@,and T~ as
follows:
=
I...
--
(SdL,ad,.)
~ -~
S’
rdi
’
, T
("d,.))
¯
7"dL
It’s easy to see that approximations(A.1, bl, A2, b~) derived
from different sets of samples (D1, D2) can be combinedadditively to yield a better approximationthat corresponds to
the combinedset of samples:
-~ = A.1 + -~2 and b = bl + b2.
This observation leads to an incremental update rule for
and b. Assumethat initially .~. = 0 and b = 0. For a fixed
policy, a newsample(x, a, r, x’) contributes to the approximationaccording to the following update equation :
and
Wenote that this approachis very similar to the LSTD
algorithm [5]. Unlike LSTD,which defines a system of equations relating state values to state values, LSPIis definedover
Q-values. Eachiteration of LSPIyields the Q-valuesfor the
current policy. Thus, each solution implicitly defines the next
policy for policy iteration.
Animportantfeature of LSPIis that it is able to reuse the
sameset of sampleseven as the policy changes. For example,
supposethe corpus contains a transition from state x to state
x’ under action Ul and ~q(x’) = a2. This is entered into
the A matrix as if a transition were madefrom Q(X, al)
Q(x’, a2). If Iri+l ’) changes the action fo r x’ from a2to
a3, then the next iteration of LSPI enters a transition from
Q(x, al) to Q(x~, a.q) into the A matrix. The sample can be
reused because the dynamicsfor state x under action al have
not changed.
The application of collaborative action selection to the
LSPIframeworkis surprisingly straightforward. Wefirst note
that since LSPIis a linear method,any set of Q-functionsproducedby LSPIwill, by construction, be of the right form for
collaborative action selection. Eachagent is assigned a local
set of basis functions whichdefine its local Q-function.These
basis functions can be defined o~/er the agent’s ownactions as
well as the actions of a small numberof other agents. As
with ordinary LSPI, the current policy ~q is defined implicitly by the current set of Q-functions, Q,r,. However,in the
multiagent case, we cannot enumerateeach possible action to
determinethe policy at somegiven state because this set of
actions is exponential in the numberof agents. Fortunately,
we can again exploit the structure of the coordination graph
to determine the optimal actions relative to Q"’: For each
i02
transition fromstate x to state x~ underjoint action a the coordination graph is used to determinethe optimal actions for
x’, yielding action a’. Thetransition is addedto the .4 matrix
as a transition from Q(x, a) to Q(x’, a’).
Anadvantage to the LSPI approachto collaborative action
selection is that it computesa value function for each successive policy whichhas a coherent interpretation as a projection
into the linear space spannedby the individual agent’s local
Q-functions.Thus, there is reasonto believe that the resulting
Q-functionswill be well-suited to collaborative action selection using a coordination graph mechanism.
A disadvantageof the LSPIapproachis that it is not currently amenableto a distributed implementationduring the
learning phase: Construction of the A matrix requires knowledgeof the evaluation of each agent’s basis functions for every state in the corpus, not only for every action that is actually taken, but for every action recommended
by every policy
considered by LSPI.
6 Coordination
in direct policy search
Value function based reinforcement learning methods have
recently come under some criticism as being unstable and
difficult to use in practice. A function approximationarchitecture that is not well-suited to the problemcan diverge or
producepoor results with little meaningfulfeedbackthat is
directly useful for modifyingthe function approximatorto
achieve better performance.
LSPI was designed to address someof the concerns with
Q-learning based value function approximation. It is more
stable than Q-learning and since it is a linear method,it is
somewhateasier to debug. However,LSPIis still an approximate policy iteration procedure and can be quite sensitive
to small errors in the estimated Q-valuesfor policies [4]. In
practice, LSPIcan to take large, coarse steps in policy space.
The shortcomings of value function based methods have
led to a surge of interest in direct policy search methods[13;
12; 1; 15; 9]. Thesemethodsuse gradient ascent to search a
space of parameterizedstochastic policies. As with all gradient ascent methods, local optimacan be problematic. Defining a relatively smoothbut expressive policy space and finding reasonablestarting points within this space are all important elementsof any successful application of gradient ascent.
Wenowshowhowto seed a gradient ascent procedure with
a multiagent policy generated by Q-learning or LSPI as described above. To guarantee that the gradient exists, policy
search methodsrequire stochastic policies. Ourfirst task is
to convert the deterministic policy implied by our value Qfunctions into a stochastic, policy/~(alx), i.e., a distribution
over actions given the state. A natural wayto do this, which
also turns out to be compatiblewith most policy search methods, is to create a softmaxover the Q-values:
eEj Qj (x,a)
I x)
Qh0,,b)"
(4)
To be able to apply policy search methodsfor such policy representation, we mustpresent two additional steps: The
first is a methodof efficiently generating samplesfrom our
stochastic policy and the second is a methodof efficiently
differentiating our stochastic policy for gradient ascent purposes.
Samplingfrom our stochastic policy mayappear problematie becauseof the size of the joint action space. For sampling
purposes, we can ignore the denominator,since it is the same
for all actions, and sample from the numerator directly as
an unnormalizedpotential function. To do this sampling we
again use variable elimination on a coordination graph with
exactly the samestructure as the one in Section 2. Conditioning on the current state x is again easy: each agent needsto
observe only the variables in Observable[Qj]and instantiate
Q./appropriately. At this point, we need to generate a sample
from Qj functions that dependonly on the action choice.
Following our earlier example, our task is nowto sample from the potential corresponding to the numerator of
/~(a I x). Suppose,for example,that the individual agent’s
Q-functions have the following form:
V -- Ql(al, a2) + Q2(a2,a4) + Qz(ax, as) Q4(az,a4).
and we wish to samplefrom the potential function for
eoa (al,a2)eO2(a2,a4)eQs(aa,as)eQ4(as,a4).
The next key operation is the computationof the gradient
of our stochastic policy function, a key operation in a REINFORCE
style [16] policy search algorithm.! First, recall that
we are using a linear representation for each local Qj function:
0j Cx, a) = a);
i
whereeach basis function ¢ij (x, a) can dependon a subset
of the state and action variables. Our stochastic policy representation nowbecomes:
e~j..i wi,$0i..~(x,a)
~(" Ix) = Eb e~" ~,.~0,.,(x,a)"
To simplify our notation, we will refer to the weights as wi
rather than wij and the basis functionsas ~bi rather than ~bi,j.
Finally, weneed to computethe gradient of the log policy:
a~to
a i In p(alx)
o /
--
0wi
In
0o,0
0
=
f4(A2, Aa) = E eQ2(a2’a’)eO4(as’a’)"
~bi(x,a) ~ ~’]~beE’w’¢"(x’b)
~,b’ eE, u,,~,(~,b’)
= Cdx,a)
A4
The problem nowreduces to sampling from
eQ1(,~l,,,~)eOS(~,a,)(a2, as)
¢i(X,
having one fewer agent. Next, agent 3 communicatesits contribution giving:
eO~(~a’a2)fs(al,a2),
where fs(al,a2) = ~a3 ag,as)"
eQS(a"as)f4(
Agent 2 nowcommunicatesits contribution, giving
=
f12
and agent 1 can now sample actions from the potential
P(al) ’~ f2(al).
Wecan nowsample actions for the remaining agents by
reversing the direction of the messagesand sampling from
the distribution for each agent, conditionedon the choices of
the previous agents. For example, whenagent 2 is informed
of the action selected by agent 1, agent 2 can sampleactions
fromthe distribution:
P(a~,al) eQa(~a’*2) fs(al,a2)
P(a2lax)=
P(al)
=
h(al)
The general algorithm has the same messagepassing topology as the original action selection mechanism.The only
difference is the content of the messages: The forward pass
messagesare probability potentials and the backwardpass
messagesare used to computeconditional distributions from
whichactions are sampled.
103
l~k’E~b)’]
_
To sample actions one at a time, we will follow a strategy
of marginalizingout actions until we are left with a potential
over a single action. Wethen samplefrom this potential and
propagate the results backwardsto sampleactions for the remaining agents. Supposewe begin by eliminating a4. Agent
4 can summarizeit’s impact on the rest of the distribution
by combiningits potential function with that of agent 2 and
defining a newpotential:
a) --
~b---a°E’~"~’("’b)
Eb’ e~’~i
~ ~i(X,
wi~i(x,b’)
b) e~’ ]~i wi0 i(x,b’) "
b
Wenote that both the numeratorand the denominatorin the
summationcan be determinedby a variable elimination procedure similar to the stochastic action selection procedure.
These action selection and gradient computation mechanismsprovide the basic functions required for essentially any
policy search method.As in the case of Q-learning, a global
error signal must be shared by the entire set of agents. Apart
from this, the gradient computationsand stochastic policy
sampling procedure involve a messagepassing schemewith
the same topology as the action selection mechanism.We
believe that these methodscan be incorporated into any of a
numberof policy search methodsto fine tune a policy derived
by Q-learning with linear Q-functionsor by LSPI.
7
Experimental results
In this section we report results of applying our Coordinated RL approach with LSPI to the SysAdmin multiagent domain [7]. The SysAdmin problem consists of a
network of n machines connected in one of the following topologies: chain, ring, star, ring-of-rings, or star-andring. The state of each machinej is described by two variables: status Sj E {good, faulty, dead}, and load Lj E
IMostpolicysearchalgorithmsare of this style.
{idle, loaded, process successful}. Newjobs arrive with
probability 0.5 on idle machinesand makethem loaded. A
loaded machinein good status can execute and successfully
completea job with probability 0.5 at each time step. Afaulty
machinecan also execute jobs, but it will take longer to terminate (jobs end with probability 0.25). A dead machine
not able to execute jobs and remainsdeaduntil it is rebooted.
Eachmachinereceives a reward of + 1 for each job completed
successfully, otherwise it receives a reward of 0. Machines
fail stochastically and switch status from goodto faulty and
eventually to dead, but they are also influencedby their neighbors; a deadmachineincreases the probability that its neighbors will becomefaulty and eventually die.
Each machineis also associated with an agent Aj whichat
each step has to choosebetweenrebooting the machineor not.
Rebootinga machinemakesits status good independently of
the current status, but any runningjob is lost. Theseagents
have to coordinate their actions so as to maximizethe total
rewardfor the system, or in other words, maximizethe total
numberof successfully completed jobs in the system. Initially, all machinesare in "good"status and the load is set to
"processsuccessful". The discount factor 7 is set to 0.95.
The SysAdminproblem has been studied in [7], where the
modelof the process is assumedto be available as a factored
MDP.This is a fairly large MDP
with 9n possible states and
2n possible actions for a network of n machines. A direct
approach to solving such an MDPwill fail even for small
values of n. The state value function is approximatedas a
linear combinationof indicator basis functions, and the coefficients are computedusing a Linear Programming(LP) approach. Thederived policies are very close to the theoretical
optimal and significantly better comparedto policies learned
by the Distributed Reward(DR) and Distributed Value Function (DVF)algorithms [14].
Weuse the SysAdminproblem to evaluate our coordinated
RLapproachdespite the availability of a factored modelbecause it provides a baseline for comparingthe model-based
approach with coordinated RL. Note that conventional RL
methodswouldfail here because of the exponential action
space. Weapplied Coordinated RLapproach with LSPI in
these experiments. To makea fair comparison with the resuits obtained by the LP approach, we had to makesure that
the sets of basis functions used in each case are comparable.
The LP approach makes use of "single" or "pair" indicator
functions to approximate the state value function V(x) and
forms Q values using the model and the Bellmanequation.
It then solves a linear programto find the coefficients for the
approximationof V (x) under a set of constraints imposed
the Q values.
Onthe other hand, LSPIuses a set of basis functions to approximate the state-action value function Q(x, a). To make
sure the two sets are comparable, we have to choose basis
functions for LSPIthat span the space of all possible Q functions that the LP approach can form. To do this we backprojeered the LP method’sbasis functions through the transition
model[’/], added the instantaneous reward, and used the resuiting functions as our basis for LSPI.Wenote that this set
of basis functions potentially spans a larger space than the Q
functions considered by the LP approach. This is because the
104
4.4
i4.3
..
LSPI
\.
UI~IC Maxi~ Value
~
4.2
~4.1
i,
3.g
.....
q,
f"<’~,-..../ i~, ...,
.f’ . ~<. .,, .,_-..._,... ¯
¯ ., /
Dml’VIF
3./’
Numbe¢of,~m~
Figure 2: Results on star networks.
LP constrains its Q functions by a lower-dimensionalV function, while LSPIis free to consider the entire space spanned
by its Q-functionbasis. Whilethere are someslight differences, we believe bases are close enoughto permit a fair comparison.
In our experiments,we created two sets of LSPIbasis functions correspondingto the backprojectionsof the "single" or
"pair" indicator functions from [7]. The first set of experimentsinvolved a star topology with up to 14 legs yielding a
range from 3 to 15 agents. For n machines, we experimentally found that about 600nsamplesare sufficient for LSPIto
learn a goodpolicy. The samples were collected by starting
at the initial state and followinga purely randompolicy. To
avoid biasing our samplestoo heavily by the stationary distribution of the randompolicy, each episode was truncated at 15
steps. Thus, samples were collected from 40n episodes each
one 15 steps long. Theresulting policies wereevaluated with
a Monte-Carloapproach by averaging 20 100-step long runs.
The entire experimentwas repeated 10 times with different
sample sets and the results were averaged. Figure 2 shows
the results obtained by LSPIcomparedwith the results of LP,
DR,and DVFas reported in [7] for the "single" basis functions case. Theerror bars for LSPIcorrespondto the averages
over the different sample sets used. The "pair" basis functions producedsimilar results for LSPIand slightly better for
LPand are not reported here due to space limitations.
The second set of experiments involved a ring-of-rings
topologywith up to 6 small rings (each one of size 3) forming
a bigger ring and the total numberof agents was ranging from
6 to 18. All learning parameterswerethe sameas in the star
case. Figure 3 comparesthe results of LSPIwith those of LP,
DR,and DVFfor the "single" basis functions case.
The results in both cases clearly indicate that LSPIlearns
very good policies comparableto the LP approach using the
samebasis functions, but without any use of the model. It is
worth noting that the numberof samples used in each case
grows linearly in the numberof agents, whereas the joint
state-action space growsexponentially.
generalize immediatelyto continuousstate variables.
Guestrin et al. [7] presented the coordination graph method
for action selection used in this paper and an algorithm for
finding an approximation to the Q-function using local Qj
functions for problems where the MDPmodel can be represented compactlyas a factored MDP.The Multiagent LSPIalgorithmpresented in this paper can be thought of as a modelfree reinforcementlearning approachfor generating the local
Qj approximation. The factored MDPapproach [7] is able
to exploit the modeland computeits approximationvery efficiently. Onthe other hand, Multiagent LSPIis moregeneral
as it can be applied to problemsthat cannot be represented
compactly by a factored MDP.
Acknowledgments
Weare very grateful to DaphneKoller for
manyuseful discussions. Thisworkwas supportedby the ONR
under the MURI
program"Decision MakingUnderUncertainty" and
by the SloanFoundation.Thefirst author wasalso supported by a
SiebelScholarship.
UtoplcMaximum
Vldue
Figure 3: Results on ring-of-rings networks.
8 Conclusionsand future work
References
[1] J. Baxterand P.Bartlett. Reinforcement
learning in POMDP’s
In this paper, we proposed a new approach to reinforcement
via
direct
gradient
ascent.
In
Proc.
Int.
Conf. on Machine
learning: CoordinatedRL. In this approach, agents makecoLearning(ICML),2000.
ordinated decisions and share informationto achieve a princiThecomplex[2] D. Bernstein,S. Ziiberstein, and N. Immerman.
pled learning strategy. Our methodsuccessfully incorporates
ity
of
decentralized
control
of
Markov
decision
processes.
In
the cooperative action selection mechanism
derived in [7] into
Proc.of Conf.on Uncertainly
in AI ( UAI-O0),
2000.
the reinforcement learning frameworkto allow for structured
[3] U. Bertele and F. Brioschi. NonserialDynamic
Programming.
communicationbetween agents, each of which have only parAcademic
Press, NewYork, 1972.
tial access to the state description. Wepresented three in[4] D. Bertsekasand J. Tsitsildis. Neuro-Dynomic
Programming.
stances of Coordinated RL, extending three RLframeworks:
AthenaScientific, Belmont,Massachusetts,
1996.
Q-learning,
LSPIandpolicy
search.
WithQ-learning
andpol- [5] S. Bradtkeand A. Barto. Linearleast-squares algorithmsfor
icysearch,
thelearning
mechanism
canbedistributed.
Agents
temporaldifference learning. MachineLearning,2(I):33-58,
communicate
reinforcement
signals,
utility
values
andconJanuary1996.
ditional
policies.
Ontheotherhand,inLSPIsomecentral- [6] R. Dechter. Bucketelimination: A unifying frameworkfor
izedcoordination
is required
to compute
theprojection
of
reasoning.Artificial Intelligence,113(1-2):41-85,
1999.
thevalue
function.
Inallcases,
theresulting
policies
can [7] Carlos Guestrin,DaphneKoller, and RonaldParr. Multiagent
beexecuted
ina distributed
manner.
In ourview,an algoplanningwithfactoredMDPs.
In 14thNeuralInformationProcessing Systems(NIPS-14),2001.
rithmsuchasLSPIcanprovide
a batchoffline
estimate
of
theQ-functions.
Subsequently,
Q-learning
or direct
policy [8] F. Jensen, E V. Jensen, and S. Dittmer. Frominfluencediagramsto junctiontrees. In UA1-94,
1994.
search
canbeapplied
online
torefine
thisestimate.
Byusing
[9]
V.
Konda
and
J.
Tsitsiklis.
Actor-critic
algorithms.In NIPS-12,
ourCoordinated
RL method,
we cansmoothly
shiftbetween
2000.
thesetwophases.
and R. Parr. Model-Free
Least-SquarespolOurinitial
empirical
results
demonstrate
thatreinforce-[10] M. G. Lagoudakis
icy iteration. TechnicalReportCS-2001-05,Departmentof
mentlearning
methods
canlearn
goodpolicies
using
thetype
ComputerScience, DukeUniversity, December
2001.
oflinear
architecture
required
forcooperative
action
selec[11]
M.
Lagoudakis
and
R.
Parr.
Model
free
least
squares
policy
tion.
Inourexperiments
withLSPI,
wereliably
learned
poliiteration. In NIPS-14,2001.
ciesthatwerecomparable
to thebestpolicies
achieved
by
A. Ngand M. Jordan. PEGASUS:
A policy search methodfor
other
methods
andclose
tothetheoretical
optimal
achievable[12] large MDPsand POMDPs.
In UAI-O0,2000.
in ourtestdomains.
Theamount
of datarequired
to learn [13] A. Ng,R. Parr, andD. Koller.Policysearchvia densityestigoodpolicies
scaled
linearly
withthenumber
ofstate
andacmation.InNIPS-12,2000.
tionvariables
eventhough
thestate
andaction
spaces
were [14] J. Schneider, W.Wong,A. Moore,and M. Riedmiller. Disgrowing
exponentially.
tributed value functions.In 16thICML,1999.
Ourinitial
experiments
involved
models
withdiscrete
state [15] R. Sutton, D. McAllester,S. Singh, and Y. Mansour.Policy
andaction
spaces
thatcouldbedescribed
compactly
usinga
gradientmethodsfor reinforcement
learningwith functionapdynamic
Bayesian
network.
Thesewerechosenprimarily
to
proximation.In NIPS-12,2000.
compare
learning
performance
withprevious
closed-form
ap- [16] RonaldJ. Williams.Simplestatistical gradient-followingalproximation
methods
andourbasisfunctions
werechosen
to
gorithmsfor connectionist reinforcementlearning. Machine
Learning,8(3):229-256,1992.
match
closely
thebasis
functions
usedinearlier
experiments
withthesamemodels.
In future
work,we planto explore [17] D. Wolpert,K. Wheller,and K. Turner.Generalprinciples of
learning-basedmulti-agentsystems.In Proc.of 3rdInt. Conf.
problems
thatinvolve
continuous
variables
andbasisfuncon Autonomous
Agents(Agents’99),1999.
tions.
Whileourmethods
dorequire
discrete
actions,
they
105