The Dynamics of Reinforcement Learning ... Caroline Claus and Craig Boutilier

From: AAAI Technical Report WS-97-03. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved.
The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems
CarolineClausandCraigBoutilier
Departmentof ComputerScience
Universityof British Columbia
Vancouver,B.C., CanadaV6T1Z4
{cclaus,cebly}@cs.ubc.ca
1 Introduction
The application of learning to the problemof coordination in
multiagent systems (MASs)has becomeincreasingly popular in AI andgametheory. The use of reinforcementlearning
(RL),in particular, has attracted recent attention [11, 9, 8, 6].
As noted in [9], using RLas.a meansof achieving coordinated behavioris attractive becauseof its generality and robustness.
Standard techniques for RL, for example,Q-learning [10],
have been applied directly to MASswith some success.
However,a general understanding of the conditions under
which RLcan be usefully applied, and exactly what form
RLmight take in MASs,are problemsthat have not yet been
tackled in depth. Wemight ask the following questions:
2 Preliminary
2.1 Single
¯ Arethere differences betweenagents that learn as if there
are no other agents (i.e., use single agent RLalgorithms)
and agents that attempt to learn both the values of specific
joint actions and the strategies employedby other agents?
¯ Are RLalgorithms guaranteed to converge in multiagent
settings? If so, do they convergeto (optimal) equilibria?
¯ Howare rates of convergenceand limit points influenoed
by the system structure and learning dynamics?
In this paper, we begin to address someof these questions
in a very specific context, namely, repeated gamesin which
agents have commoninterests (i.e., cooperative MASs).
focus our attention on Q-learning,due to its relative simplicity (certainly not for its general efficacy), and considersome
of the factors that mayinfluence the dynamicsof multiagent
Q-learning, and provide partial answersto these questions.
Wefirst distinguish and comparetwo forms of multiagent
RL(MARL).
Independent learners (ILs) apply Q-learning
the classic sense, ignoringthe existence of other agents. Joint
action learners (JALs), in contrast, learn the value of their
ownactions in conjunctionwith those of other agents via integration of RLwith equilibrium (or coordination) learning
methods[12, 5, 4, 7]. Wealso examinethe influence of partial observability on JALs, and howgamestructure and exploration strategies influence the dynamicsof the learning
process and the convergenceto equilibrium. Weconclude by
mentioningseveral problems that promise to makethe integration of RLwith coordination learning an exciting area of
research for the foreseeable future.
13
Concepts
and Notation
Stage Games
Our interest is in r~-player cooperative repeated games.1 We
assume a collection a of n (heterogeneous) agents, each
agent i E a having available to it a finite set of individual
actions Ai. Agents repeatedly play a stage game in which
they each independently select an individual action to perform. The chosenactions at any point constitute a joint action, the set of which is denoted ~4 = ×ie~Ai. With each
a E A is associated a (possibly stochastic) rewardR(a);
decision problemis cooperativesince there is a single reward
function R reflecting the utility assessmentof all agents. The
agents wish to choose actions that maximizereward.
A randomizedstrategy for agent i is a distribution 7r E
A(Ai) (where A(Ai) is the set of distributions over
agent’s action set A0. Intuitively, 7r(ai) denotes the probability of agent i selecting the individual action a~. Astrategy
7r is deterministic if 7r(ai) = 1 for someai E Ai. A strategyprofile is a collection II = {Trl : i E a} of strategies
for each agent i. The expected value of acting according to
a fixed profile can easily be determined. If each 7ri E H is
deterministic, wecan think of II as a joint action. A reduced
profile for agenti is a strategy profile for all agentsbut i (denoted II_i). Givena profile II_i, a strategy 7ri is a best responsefor agent i if the expectedvalue of the strategy profile
II_i t_J {Tri} is maximalfor agent i; that is, agent i could not
do better using any other strategy 7r[. Finally, wesay that the
strategy profile 1I is a Nashequilibriumiff H[i] (i’s component of H) is a best responseto II_i, for every agent i. Note
that in cooperative games,deterministic equilibria are easy
to find. Anequilibrium(or joint action) is optimalif no other
has greater value.
As an example, consider the simple two-agent stage game:
aO al
bO x
0
bl 0
y
Agents A and B each have two actions at their disposal,
a0, al and b0, bl, respectively. If x > y > 0, (a0, b0) and
(al, bl) are bothequilibria; but onlythe first is optimal.
1Mostof our conclusionsholdmutatis mutandisfor sequential,
multiagentMarkov
decisionprocesses[2] withmultiplestates.
2.2 Learning in Coordination Games
its estimate Q(a) based on sample(a, r) as follows:
Actionselection is moredifficult if there are multiple opti(1)
Q(a) + (~(r + -Q(a)
maljoint actions. If, for instance, x = y > 0, neither agent
has a reason to prefer one or the other of its actions. If they
The sample(a, r) is the "experience" obtained by the agent:
choosethemrandomly,or in somewayreflecting personal biaction a was performed resulting in reward r. Here a is
ases, then they risk choosinga suboptimal, or uncoordinated the learning rate (0 < c~ < 1), governing to what extent
joint action. The general problemof equilibrium selection
the newsample replaces the current estimate. If c~ is decan be addressedin several ways.Weconsider the notion that
creased "slowly" during learning and all actions are sampled
an equilibrium (i.e., coordinatedaction choice in our cooper- infinitely, Q-learningwill convergeto true Q-valuesfor all
ative setting) might be learned through repeated play of the
actions in the single agent setting [10].
game[4, 5, 7, 8].
Convergenceof Q-learning does not depend on the exOne especially simple, yet often effective, model for
ploration
strategy used. Anagent can try its actions at any
achievingcoordinationis fictitious play [3, 4]. Eachagent i
time---there is no requirementto performactions that are curkeeps a count C’aJj, for each j E a and aJ E Aj, of the num- . rently estimated to be best. Of course, if wehope to enhance
ber of times agent j has used action aj in the past. When overall performanceduring learning, it makessense (at least
the gameis encountered,i treats the relative frequencies of
intuitively) to bias selection towardbetter actions. Wecan
each of j’s movesas indicative of j’s current (randomized) distinguish two forms of exploration. In nonexploitive explostrategy. That is, for each agent j, / assumesj plays action
ration, an agent randomlychooses its actions with uniform
probability. There is no attempt to use what was learned to
aJ E Aj with probability Prij = CJaj/ (EbJEAj CJbj). This
improve performance--the aim is simply to learn Q-values.
set of strategies forms a reducedprofile H_i, for whichagent
In exploitive exploration an agent choosesits best estimated
i adopts a best response. After the play, i updates its counts
appropriately, given the actions used bythe other agents. We action with probability p~, and chooses some other action
with probability 1 - p~. Often the exploitation probability
think of these countsas reflecting the beliefs an agent has rePx is increased slowly over time. Wecall a nonoptimalaction
gardingthe play of the other agents (initial countscan also be
choice an exploration step and 1 - Pz the exploration probweightedto reflect priors).
ability. Nonoptimalaction selection can be uniform during
This very simple adaptive strategy will converge to an
exploration, or can be biased by the magnitudesof Q-values.
equilibrium in our simple cooperative games, and can be
A popular biased strategy is Boltzmannexploration: action a
madeto convergeto an optimal equilibrium [12, 1]; that is,
is chosenwith probability
the probability of coordinatedequilibriumafter k interactions
can be madearbitrarily high by increasing k sufficiently. It
eQ(a)/T
is also not hard to see that oncethe agents reach an equilib(2)
rium, they will remainthere---each best response simply re~’~a’ eQ(a’)/T
inforces the beliefs of the other agents that the coordinated
equilibrium remains in force.
The temperature parameter T can be decreased over time so
Wenote that most gametheoretic modelsassumethat each
that the exploitation probability increases.
agent can observe the actions executed by its counterparts
There are two distinct ways in which Q-learning could be
with certainty. As pointed out and addressed in [1, 6], this
applied to a multiagent system. Wesay a MARL
algorithm
assumptionis often unrealistic. Wewill be interested below is an independentlearner (iL) algorithm if the agents learn
in the moregeneral case whereeach agent obtains an obserQ-valuesfor their individual actions based on Equation (1).
vation whichis related stochastically to the actual joint acIn other words, they performtheir actions, obtain a reward
tion selected. Formally, we assumean observation set O, and
and update their Q-valueswithout regard to the actions peran observation modelp : ,4 --+ A(O). Intuitively, p(a)(o),
formedby other.agents. Experiencesfor agent i take the form
which we write in the less cumbersome
fashion Pr,~ (o), de(ai, r) wherea~ is the action performedby i and r is a renotes the probability of observation o being obtained by all
ward. If an agent is unawareofthe existence of other agents,
agents whenjoint action a is performed.Eachagent is aware
cannotidentify their actions, or has no reason to believe that
2of this function.
other agents are acting strategically, then this is an appropriate methodof learning. Of course, even if these conditions
2.3 Reinforcement Learning
do not hold, an agent maychoose to ignore information about
Actionselection is moredifficult still if agents are unaware the other agents’ actions.
A joint action learner (JAL) is an agent that learns
of the rewards associated with various joint actions. In such
values for joint actions as opposedto individual actions. The
a case, reinforcement learning can be used by the agents to
experiences for such an agent are of the form (a, r) where
estimate, based on past experience, the expected reward asis a joint action. This implies that each agent can observethe
sociated with individual or joint actions.
actions of other agents. Wecan generalize the picture slightly
A simple, well-understood algorithm for single agent
by allowingexperiencesof the form (a~, o, r) whereai is the
learning is Q-learning[10]. In our stateless setting, we asaction performedby i, and o is its (joint action) observation.
sumea Q-value, Q(a), that provides an estimate of the value
For JALs, exploration strategies require somecare. In the
of performing(individual or joint) action a. Anagent updates
exampleabove, if A currently has Q-valuesfor all four joint
2Generalizations
of this modelcouldalso be used.
actions, the expected value of performinga0 or al depends
crucially on the strategy adopted by B. To determine the
relative values of their individual actions, each agent in a
JALalgorithm maintainsbeliefs about the strategies of other
agents. Here we will use simple empirical distributions, possibly with biased initial weightsas in fictitious play. Agent
A, for instance, assumesthat each other agent B will choose
actions in accordancewith A’s current beliefs about B (i.e.,
A’s empirical distribution over B’s action choices). In general, agent i assesses the expectedvalue of its individual action a’ to be
Choolin9
opUrnal
jointactions
1
0.06
xl
,
lt,
5
10 15
*ll
,~
l’
~’~"~e
0.0
0.85
0.8
0.76
O.7
O.65
0.6
o.55
0,5
Agenti can use these values just as it wouldQ-valuesin im3plementingan exploration strategy.
Maintaining a belief distribution by meansof fictitious
play is problematicif agents haveimpreciseobservationalcapabilities. Following[1], we use a simple Bayesianupdating
rule for beliefs:
20 25 3O
35
Numbsr
of InleracU~s
40
45
5O
Figure 1: Convergenceof coordination for ILs and JALs(averaged over 100trials).
Rather than pursuing this direction, we consider the case
where both the ILs and JALsuse Boltzmannexploration. Exploitation of the knownvalues allows the agents to "coordinate" in their choices for the samereasons that equilibrium
i)
learning methods work when agents knowthe reward strucPr(oJa[i
=
a
]
ture. Figure 1 shows the probability of two ILs and JALs
selecting an optimal joint action as a function of the number
Agenti then updates its distribution over j’s probabilities
using this "stochastic observation;" in particular, the count of interactions they have. The temperature parameter is T =
16 initially and decayedby a factor of 0.9t at the t + 1st interC~is incrementedby Pr(aJko) (intuitively, b y a"fractional"
action.
Wesee that ILs coordinate quite quickly. There is no
4
outcome).
preference for either equilibrium point: each of the twoequilibria wasattained in about half of the trials. Wedo not show
3 ComparingIndependent and Joint-Action
convergenceof Q-values, but note that the Q-values for the
Learners
actions of the equilibria attained (e.g., (a0, b0)) tendedto
Wefirst compare the relative performance of independent
while the other actions tended to 0. Wenote that probability
and joint-action learners on a simple coordination gameof
of optimal action selection does not increase smoothlywithin
the form described above:
individual trials; the averagedprobabilities reflect the likelia0 al
hood of having reached an equilibrium by time t, as well as
b0 10
0
exploration probabilities. Wealso point out that muchfaster
10
bl 0
convergencecan be had for different parametersettings (e.g.,
decaying temperature T more rapidly). Wedefer general reThefirst thing to note is that ILsusingnonexploitiveexplomarks on convergenceto Section 5.
ration will not deemeither of their choices (on average)
be better than the other. For instance, A’s Q-valuesfor both
The figure also shows convergence for JALs under the
action a0 and al will convergeto 5, since whenever,say, a0
same circumstances (full observability is assumed). JALs
is executed, there is a 0.5 probability of b0 and bl being exdo perform somewhatbetter after a fixed numberof interecuted. Of course, at any point, due to the stochastic nature
actions, as shownin the graph. While the JALShave more
of the strategies and the decayin learning rate, we wouldexinformation at their disposal, convergenceis not enhanced
pect that the learned Q-valueswill not be identical; thus the
dramatically. In retrospect, this should not be too surprisagents, once they converge, might each have a reason to preing. WhileJALsare able to distinguish Q-valuesof different
fer one action to the other. Unfortunately, these biases need joint actions, their ability to use this informationis circumnot be coordinated.
scribed by the strictures of the action selection mechanism.
i)
3Theexpressionfor EV(a makesthe justifiable assumption Anagent maintains beliefs about the strategy being played
that the other agentsare selectingtheir actionsindependently.
Less by the other agents and "exploits" actions according to expected value basedon these beliefs. In other words, the value
reasonableis the assumption
that these choicesare uncorrelated,or
evencorrelatedwithi’s choices.Suchcorrelationscanoften emerge of individual actions "pluggedin" to the exploration strategy
due to the dynamics
of belief updatingwithoutagents beingaware is moreor less the sameas the Q-valueslearned by ILs--the
of this correlation,especiallyif frequenciesof particularjoint aconly distinction is that JALscomputethemusing explicit betions are ignored.
lief distributions and joint Q-valuesinstead of updatingthem
4Thisessentially correspondsto using the empiricalcountsas
directly. Thus, even thoughthe agents maybe fairly sure of
Dirichletparameters,
andtreating the i’s beliefs as a Dirichletdistributionoverj’s set of mixedstrategies.
the relative Q-valuesof joint actions, Boltzmannexploration
Pr(a[j] = s Ja[i] = ai , o)
Pr(oJa~] = s, a[i] = ai )Pr(aLj] = s)
l
15
5does not let themexploit this.
4
The Effects
of Partial
Rale
of C~v~gence
InDependency
ofOb~;ervation
Probeldity
JointacUon
I~rne¢e
Action Observability
WhenJALs are in a situation where partial observability
holds sway, the updating of Q-values should be undertaken
with more care. Given an experience (a i, o, r), where a~ is
agent i’s action and o is the resultant observation,there can be
a numberof joint actions a consistent with ai and o, according to the observationmodel,that give rise to this experience.
Intuitively, i should update the Q-valuesfor eachof such a,
somethingthat conflicts with the usual Q-learning model. To
deal with this, we propose the use of "fractional" updates of
Q-values:the Q-valuefor joint action a will be updatedby rewardr; but the magnitudeof this update will be "tempered"
by the probability Pr(alo, i) t hat awas ta ken given o,ai. 6
Specifically, agent/updates joint Q-valuesfor all actions a
wherea[i] = a~ using:
Q(a) = Q(a) ~Pr(alo,
ai )(r -
Q(
g’ o!g oi,
olg olg
o15
o16
oi,
Obeewation
Probabilitiel;
Figure 2: Timeto convergence(opt. joint action with prob.
0.99) as a function of observation prob. (averaged over 1000
trials; a = 0.2, T = 16, T decayedat rate 0.9).
(3)
where Pr(a[o, i) i s c omputed ni t he obvious way by i using
its beliefs and Bayesrule.
Wefirst note that if agent i is learning in a stationary environment(i.e., all other agents play fixed, mixedstrategies),
then this update rule will converge;that is, the limiting Qvalues are well-defined, under the sameconditions required
of Q-learning. Informally:
Q-vel~
foragent
A.results
averaged
over100
Irlals, alpha
decayed
IracUonaJ
6
4
Proposition l Convergence of Q-values using update
rule (3) is assuredin stationary environments.
3.5
Wenote that convergenceof Q-values is enhancedwhenthe
"fractional" nature of the updates is accountedfor whendecaying the learning parameter a--we typically maintain a
separate aa for each joint action a and decay aa using the
total experiencewith a (i.e., weightingthe decay rate using
Pr(alo, ai)). Wealso note that convergencecannot generally
be to the true Q-valuesof joint actions unless the observation
modelis perfect. In our runningexample,for instance, if observations do not permit agent A to distinguish b0 from bl
perfectly, then the Q-valuefor, say, (a0, b0) must lie somewhere between 0 and 10.
Figure 2 showshowconvergenceto equilibrium is influenced by the accuracy of the observation modelin the game
above. The model associates a unique observation o~ with
each of the four joint actions a. The observation probability refers to Pr(oa la), and ranges from0.25 (fully unobservable) to 1 (fully observable). If o oathen Pr( o[a) =
(1 - Pr(oa la))/3. Rate of convergencerefers to the expected
numberof interactions until convergence
(i.e., the probability
of selecting an optimal joint action reaches 0.99). The more
accurate the observation model, the more quickly the agents
converge. Figure 3 showsthe convergenceof joint Q-values
for observation probability 0.5.
2,6
g
aThekeyreasonfor the differencein ILs andJALsis the larger
differencein Q-valuesfor JALs,whichbias Boltzmann
exploration
slightly moretowardthe estimatedoptimalaction.
6Analternative wayof viewingthis: a single Q(a)is chosenby
the agent, withprobabilityPr(alo, al), for updatewithreward
3_6
2
1.g
50 100 150
2OO 26O 3OO
350 4OO 45O 5O0
Number
of Interac2on8
Figure 3: Convergenceof Q-values with observation probability 0.5 (averagedover 1000trials).
Ukelihood
of C~v~gen~
to optimal
jointactionsin dependercy
of Penally
k
1
JALs,
Temp
decay
=0.99
0,9 ~-.
JALe,
Temp
decay.
0,8
0.4
0.2
..... ~.......~~.)~ .............
o
=IO0
-80
-oo
Penalty
k
-40
~.’
-20
Figure 4: Likelihoodof convergenceto opt. equilibrium as a
function of penalty k (averagedover 100trials).
5 Convergence and Game Structure
Wewill arguebelowthat Q-learning,
in boththe IL andJAL
0.9
considerthe waysin whichthe gamestructurecan influence
the dynamics
of the learningprocess.
0.7
aO al
bO 10 0
0
2
bl
b2 k
0
l
a2
k
0
10
When
k < O, this gamehasthreedeterministicequilibria,
of whichtwo are preferred.If k = -100, agent A, during
initial exploration,
will findits first andthirdactionsto be
unattractivebecauseof B’s random
exploration.If Ais an
Strategy
ofagent
A
I
cases, will convergeto an equilibriumunderappropriateconditions. Ofcourse, in general, convergence
to an optimal
equilibriumcannotbe assured. Before makingthis case, we
Prob.
o[aO
Prob.
of
prob.
orala2D
\
0.8
m
m
%
+.+.*f
’"":::::::::::::::
o.6
O.S
* %
0.4
0.3
+
0.2
o.I
~+ .
0
IL, the averagerewards(and henceQ-values)for a0, a2 will
be quite low; andif A is a JAL,its beliefs aboutB’s strategy
200 400 600
800 1000 1200 1400 1600 1800 2000
Number
ofInteraclionA
Figure5: Dynamics
of A’sstrategy.
will affordtheseactionslowexpectedvalue.Similarremarks
applyto B, andthe self-confirming
natureof equilibriavirtually assure convergenceto (al, bl). However,the closer
k is to 0, the lowerthe likelihood the agentswill find their
first andthirdactionsunattractive--the
stochasticnatureof
exploration
means
that, occasionally,theseactionswill have
high estimatedutility andconvergenceto one of the optimal
equilibria will occur. Figure 4 showshowthe probabilityof
convergence
to oneof the optimalequilibriais influenced
by
the magnitude
of the "penalty"k. Not surprisingly, different
7equilibriacanbeattainedwithdifferentlikelihoods.
Thusfar, our examples
showagentsproceeding
on a direct
routeto equilibria(albeit at variousrates, andwithdestinations "chosen"
stochastically). Unfortunately,
convergence
is not so straightforward
in general.Consider
the game:
aO
al
a2
-30
0
bO 11
bl -30
7
6
b2
0
0
5
Initially, the two learnersare almostcertainlygoingto begin
to play the nonequilibrium
strategy profile (a2, b2). This
seen clearly in Figures 5 and 6. However,oncethey "settle"
at this point, as long as explorationcontinues(here Boltzmannexploration is used), agent B will soon find bl to be
Slratsgy
ofagent
B
1
+÷++
+++
0.9
0.8
0.7
+4,¢,~
m
overtimealonga best replypath.Theconvergence
(andrest8ingpoints)for the joint Q-values
are shown
in Figure7.
This phenomenon
will obtain in general, allowing one to
conclude that the multiagent Q-learning schemeswe have
++
+
+
0.6 m+
0.5
o
0.4 +m
D
0,3
0.2
+ ~~m~d
b
0.1
o
0
2(X) 4OO 600
8OO tOO0 1200 1400 1600 1800 200O
Number
of InleractJ~
Figure6: Dynamics
of B’s strategy.
moreattractive--solongas Acontinuesto primarilychoose
a2. Oncethe nonequilibrium
point(a2, bl) is attained,agent
Atracks B’s moveandbeginsto performaction al. Once
this equilibrium
is reached,the agentsremainthere. Figures5 and6 clearly showthe agentssettling into nonequilibria at variouspoints and"climbing"
towardan equilibrium
~+
Prob.ofb0¯
Prob.
of
Prob.
ofbl
b2+
o
....o
t.,o+
Q-valu~
fmagent
A,resulte
notaveraged,
alpha
d~yed
frac~onal
,
./,**=,t
=**-- ......
A
Q a0.bl +
sob2
,,m..,
............. ....................
¯ ~...................................................
al,b0 x
¯
/
al,bl ¯
...................................................
a2oi .....
.......
a2,bl
......
¯
,
J;
ariA¯.
+~,,~o***ooooe+e~aoeeaoeeeeae**~o*o*aae*
proposed
will converge
to equilibriaalmostsurely--butonly
if certainconditions
aremetin the caseof ILs. First, explorationmustcontinueat anypointwithnonnegative
probabil?Theseresultsareshown
for JALs;
but the generalpatternholds
trueforILsas well.
STheseresults are basedon Boltzmann
explorationwitha temperature
setting of I andno decay---thisis to induceconsiderable
exploration,otherwise
convergence
is quite slow(see below).This
explainswhyagent Aconvergesto Pr(al) = 0.7.
Number
of Inlerac~ons
Figure 7: Dynamics
of joint Q-values(c~ = 0.1).
ity. This is required by Q-learning in any case, but is necessary to ensure that a nonequilibriumis not prematurelyaccepted. Second,whenagents are exploring, if the probability
of both movingawayfrom a nonequilibrium point simultaneously is sufficiently high, then we cannot expect agent A to
notice that it has a newbest reply. So, for instance, if weuse
uniform exploration over time, we cannot induce B to learn
a goodenoughQ-value for its "break out move"to actually
moveaway(e.g., every time bl is tried, thereis a high enough
probability that A does a0 that bl never looks better than b2).
Wenote that Boltzmannexploration with a decaying temperature parameter ensures infinite exploration. Furthermore,as the temperaturedecreases over time, so too does the
probability of simultaneousexploration. Thus, at somefinite
time t (assumingboundedrewards), the probability of B executing bl while A performs a2 will becomehigh enoughto
renderbl moreattractive after somefinite numberof experiences with bl. At this point, (with high probability) B will
performbl, allowing A to respond in a similar fashion.
These arguments can be put together to showthat, with
proper exploration and proper use of best responses(i.e., that
are at least asymptotically myopic[4]), we will eventually
convergeto an equilibrium. Informally:
Proposition 2 IL and JAL algorithms will converge to an
equilibrium if sufficient, but decreasing, exploration is
aundertaken,
Weremark that this theoretical guarantee of convergence
maynot be of muchpractical value for sufficiently complicated games.Thekey difficulty is that convergencerelies on
the use of decayingexploration: this is necessary to ensure
that the agents’ estimated values are based (ultimately) on
stationary environment. This gradual decay, however,makes
the time required to shift fromthe current entrenchedstrategy
profile to a better profile rather long. If the agents initially
settle on a profile that is a large distance (in termsof a best
reply path) froman equilibrium, each shift required can take
longer to occur becauseof the decayin exploration. Furthermore,as pointed out above, the probability of concurrentexploration mayhaveto be sufficiently small to ensure that the
expectedvalue of a shift along the best reply path is greater
than no such shift, whichcan introduce further delays in the
process. Wenote that the longer these delays are, the lower
the learning rate a becomes,requiring even moreexperience
to overcomethe initially biased estimated Q-values. Finally,
the largest drawbacklies in the fact that--even whenJALs
haveperfect joint Q-values--beliefs basedon a lot of experience require a considerable amountof contrary experience to
be overcome. For example, once B has made the shift from
b2 to bl above, a significant amountof time is neededfor A
to switch from a2 to al: it has to observe B performing bl
enoughto overcomethe rather large degree of belief it had
that B wouldcontinue doing b2.
6 Concluding Remarks
Wehave seen described two basic ways in which Q-learning
can be applied in multiagent cooperative settings, and exam9Wehavenot yet constructeda formalproof, but see no impedimentsto this conjecture.Theconvergence
conjecturealso relies on
the absenceof cyclesin best reply pathsof cooperativegames.
ined the impact of various features on the success of the interaction betweenequilibrium selection learning techniques
with RLtechniques. Wehave demonstratedthat the integration requires somecare, and that Q-learningis not nearly as
robust as in single-agent settings.
These considerations point toward a numberof interesting avenues of research directed toward improvingthe performance of MARL.Weremark on some of the directions
we are currently pursuing. The first is the enhancementof
practical convergence by the use of "windowing"methods
for estimating beliefs in JALs. By using only the last k experiences to determineempiricalbelief distributions, the abatement of convergence caused by strongly held prior beliefs
can be alleviated. Wenote that a memory-basedtechnique
of this sort might be applied to ILs if reward experiences
are stored and old experiences discounted in determining Qvalues. The second avenue we are investigating are possible techniques for encouragingagents to convergeto optimal
equilibria. This is one area whereJALshavea distinct advantage over ILs: evenif they have convergedto an equilibrium,
they can tell--since they have access to joint Q-values--if a
better equilibrium exists. Coordinationlearning techniques
(see, e.g., [1]) mightthen be applied, as could other exploration techniquesthat attempt to inducea shift fromone equilibrium to another. Finally, we hope to explore the details
associated with general, multistate sequential decision problems and investigate the application of generalization techniques in domainswith large state spaces.
Acknowledgements:
Thanksto Leslie Kaelbling and Michael
Littmanfor their helpfuldiscussionsin the earlystagesof this work.
References
[1] C. Boutilier. Learningconventionsin multiagentstochastic domainsusing likelihood estimates. UAI-96,pp.106-114,
Portland.
[2] C. Boutilier. Planning,learningand coordinationin multiagent decision processes. TARK-96,
pp.195-210,Amsterdam.
[3] G. W.Brown.Iterative solution of gamesby fictitious play.
T. C. Koopmans
(ed.), Activity Analysisof ProductionandAllocation. Wiley,1951.
[4] D. Fudenbergand D. M. Kreps. LearningandEquilibriumin
Strategic FormGames.CORE
Foundation, Louvain,1992.
[5] D. Fudenbergand D. K. Levine. Steady state learning and
Nashequilibrium. Econometrica,61(3):547-573,1993.
[6] J. Huand M. P. Wellman.
Self-fulfilling bias in multiagent
learning. 1CMAS-96,
pp. 118-125,Kyoto.
[7] E. Kalai andE. Lehrer.Rationallearningleads to Nashequilibrium. Econometrica,61(5):1019-1045,1993.
for multi-agent
[8] M. L. Littman.Markovgamesas a framework
reinforcementlearning. ML-94,pp. 157-163,NewBrunswick.
[9] S. Sen, M.Sekaran,andJ. Hale. Learningto coordinatewithout sharing information.AAA1-94,
pp.426-431,Seattle.
[10] C.J.C.H. Watkinsand P. Dayan.Q-learning. Mach.Learn.,
8:279-292,1992.
[11] G. WeilLLearningto coordinateactions in multi-agentsystems. IJCAI-93,pp.311-316,Chambery.
[12] H. PeytonYoung.The evolution of conventions.Econometrica, 61(1):57-84,1993.