From: AAAI Technical Report WS-97-03. Compilation copyright © 1997, AAAI (www.aaai.org). All rights reserved. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems CarolineClausandCraigBoutilier Departmentof ComputerScience Universityof British Columbia Vancouver,B.C., CanadaV6T1Z4 {cclaus,cebly}@cs.ubc.ca 1 Introduction The application of learning to the problemof coordination in multiagent systems (MASs)has becomeincreasingly popular in AI andgametheory. The use of reinforcementlearning (RL),in particular, has attracted recent attention [11, 9, 8, 6]. As noted in [9], using RLas.a meansof achieving coordinated behavioris attractive becauseof its generality and robustness. Standard techniques for RL, for example,Q-learning [10], have been applied directly to MASswith some success. However,a general understanding of the conditions under which RLcan be usefully applied, and exactly what form RLmight take in MASs,are problemsthat have not yet been tackled in depth. Wemight ask the following questions: 2 Preliminary 2.1 Single ¯ Arethere differences betweenagents that learn as if there are no other agents (i.e., use single agent RLalgorithms) and agents that attempt to learn both the values of specific joint actions and the strategies employedby other agents? ¯ Are RLalgorithms guaranteed to converge in multiagent settings? If so, do they convergeto (optimal) equilibria? ¯ Howare rates of convergenceand limit points influenoed by the system structure and learning dynamics? In this paper, we begin to address someof these questions in a very specific context, namely, repeated gamesin which agents have commoninterests (i.e., cooperative MASs). focus our attention on Q-learning,due to its relative simplicity (certainly not for its general efficacy), and considersome of the factors that mayinfluence the dynamicsof multiagent Q-learning, and provide partial answersto these questions. Wefirst distinguish and comparetwo forms of multiagent RL(MARL). Independent learners (ILs) apply Q-learning the classic sense, ignoringthe existence of other agents. Joint action learners (JALs), in contrast, learn the value of their ownactions in conjunctionwith those of other agents via integration of RLwith equilibrium (or coordination) learning methods[12, 5, 4, 7]. Wealso examinethe influence of partial observability on JALs, and howgamestructure and exploration strategies influence the dynamicsof the learning process and the convergenceto equilibrium. Weconclude by mentioningseveral problems that promise to makethe integration of RLwith coordination learning an exciting area of research for the foreseeable future. 13 Concepts and Notation Stage Games Our interest is in r~-player cooperative repeated games.1 We assume a collection a of n (heterogeneous) agents, each agent i E a having available to it a finite set of individual actions Ai. Agents repeatedly play a stage game in which they each independently select an individual action to perform. The chosenactions at any point constitute a joint action, the set of which is denoted ~4 = ×ie~Ai. With each a E A is associated a (possibly stochastic) rewardR(a); decision problemis cooperativesince there is a single reward function R reflecting the utility assessmentof all agents. The agents wish to choose actions that maximizereward. A randomizedstrategy for agent i is a distribution 7r E A(Ai) (where A(Ai) is the set of distributions over agent’s action set A0. Intuitively, 7r(ai) denotes the probability of agent i selecting the individual action a~. Astrategy 7r is deterministic if 7r(ai) = 1 for someai E Ai. A strategyprofile is a collection II = {Trl : i E a} of strategies for each agent i. The expected value of acting according to a fixed profile can easily be determined. If each 7ri E H is deterministic, wecan think of II as a joint action. A reduced profile for agenti is a strategy profile for all agentsbut i (denoted II_i). Givena profile II_i, a strategy 7ri is a best responsefor agent i if the expectedvalue of the strategy profile II_i t_J {Tri} is maximalfor agent i; that is, agent i could not do better using any other strategy 7r[. Finally, wesay that the strategy profile 1I is a Nashequilibriumiff H[i] (i’s component of H) is a best responseto II_i, for every agent i. Note that in cooperative games,deterministic equilibria are easy to find. Anequilibrium(or joint action) is optimalif no other has greater value. As an example, consider the simple two-agent stage game: aO al bO x 0 bl 0 y Agents A and B each have two actions at their disposal, a0, al and b0, bl, respectively. If x > y > 0, (a0, b0) and (al, bl) are bothequilibria; but onlythe first is optimal. 1Mostof our conclusionsholdmutatis mutandisfor sequential, multiagentMarkov decisionprocesses[2] withmultiplestates. 2.2 Learning in Coordination Games its estimate Q(a) based on sample(a, r) as follows: Actionselection is moredifficult if there are multiple opti(1) Q(a) + (~(r + -Q(a) maljoint actions. If, for instance, x = y > 0, neither agent has a reason to prefer one or the other of its actions. If they The sample(a, r) is the "experience" obtained by the agent: choosethemrandomly,or in somewayreflecting personal biaction a was performed resulting in reward r. Here a is ases, then they risk choosinga suboptimal, or uncoordinated the learning rate (0 < c~ < 1), governing to what extent joint action. The general problemof equilibrium selection the newsample replaces the current estimate. If c~ is decan be addressedin several ways.Weconsider the notion that creased "slowly" during learning and all actions are sampled an equilibrium (i.e., coordinatedaction choice in our cooper- infinitely, Q-learningwill convergeto true Q-valuesfor all ative setting) might be learned through repeated play of the actions in the single agent setting [10]. game[4, 5, 7, 8]. Convergenceof Q-learning does not depend on the exOne especially simple, yet often effective, model for ploration strategy used. Anagent can try its actions at any achievingcoordinationis fictitious play [3, 4]. Eachagent i time---there is no requirementto performactions that are curkeeps a count C’aJj, for each j E a and aJ E Aj, of the num- . rently estimated to be best. Of course, if wehope to enhance ber of times agent j has used action aj in the past. When overall performanceduring learning, it makessense (at least the gameis encountered,i treats the relative frequencies of intuitively) to bias selection towardbetter actions. Wecan each of j’s movesas indicative of j’s current (randomized) distinguish two forms of exploration. In nonexploitive explostrategy. That is, for each agent j, / assumesj plays action ration, an agent randomlychooses its actions with uniform probability. There is no attempt to use what was learned to aJ E Aj with probability Prij = CJaj/ (EbJEAj CJbj). This improve performance--the aim is simply to learn Q-values. set of strategies forms a reducedprofile H_i, for whichagent In exploitive exploration an agent choosesits best estimated i adopts a best response. After the play, i updates its counts appropriately, given the actions used bythe other agents. We action with probability p~, and chooses some other action with probability 1 - p~. Often the exploitation probability think of these countsas reflecting the beliefs an agent has rePx is increased slowly over time. Wecall a nonoptimalaction gardingthe play of the other agents (initial countscan also be choice an exploration step and 1 - Pz the exploration probweightedto reflect priors). ability. Nonoptimalaction selection can be uniform during This very simple adaptive strategy will converge to an exploration, or can be biased by the magnitudesof Q-values. equilibrium in our simple cooperative games, and can be A popular biased strategy is Boltzmannexploration: action a madeto convergeto an optimal equilibrium [12, 1]; that is, is chosenwith probability the probability of coordinatedequilibriumafter k interactions can be madearbitrarily high by increasing k sufficiently. It eQ(a)/T is also not hard to see that oncethe agents reach an equilib(2) rium, they will remainthere---each best response simply re~’~a’ eQ(a’)/T inforces the beliefs of the other agents that the coordinated equilibrium remains in force. The temperature parameter T can be decreased over time so Wenote that most gametheoretic modelsassumethat each that the exploitation probability increases. agent can observe the actions executed by its counterparts There are two distinct ways in which Q-learning could be with certainty. As pointed out and addressed in [1, 6], this applied to a multiagent system. Wesay a MARL algorithm assumptionis often unrealistic. Wewill be interested below is an independentlearner (iL) algorithm if the agents learn in the moregeneral case whereeach agent obtains an obserQ-valuesfor their individual actions based on Equation (1). vation whichis related stochastically to the actual joint acIn other words, they performtheir actions, obtain a reward tion selected. Formally, we assumean observation set O, and and update their Q-valueswithout regard to the actions peran observation modelp : ,4 --+ A(O). Intuitively, p(a)(o), formedby other.agents. Experiencesfor agent i take the form which we write in the less cumbersome fashion Pr,~ (o), de(ai, r) wherea~ is the action performedby i and r is a renotes the probability of observation o being obtained by all ward. If an agent is unawareofthe existence of other agents, agents whenjoint action a is performed.Eachagent is aware cannotidentify their actions, or has no reason to believe that 2of this function. other agents are acting strategically, then this is an appropriate methodof learning. Of course, even if these conditions 2.3 Reinforcement Learning do not hold, an agent maychoose to ignore information about Actionselection is moredifficult still if agents are unaware the other agents’ actions. A joint action learner (JAL) is an agent that learns of the rewards associated with various joint actions. In such values for joint actions as opposedto individual actions. The a case, reinforcement learning can be used by the agents to experiences for such an agent are of the form (a, r) where estimate, based on past experience, the expected reward asis a joint action. This implies that each agent can observethe sociated with individual or joint actions. actions of other agents. Wecan generalize the picture slightly A simple, well-understood algorithm for single agent by allowingexperiencesof the form (a~, o, r) whereai is the learning is Q-learning[10]. In our stateless setting, we asaction performedby i, and o is its (joint action) observation. sumea Q-value, Q(a), that provides an estimate of the value For JALs, exploration strategies require somecare. In the of performing(individual or joint) action a. Anagent updates exampleabove, if A currently has Q-valuesfor all four joint 2Generalizations of this modelcouldalso be used. actions, the expected value of performinga0 or al depends crucially on the strategy adopted by B. To determine the relative values of their individual actions, each agent in a JALalgorithm maintainsbeliefs about the strategies of other agents. Here we will use simple empirical distributions, possibly with biased initial weightsas in fictitious play. Agent A, for instance, assumesthat each other agent B will choose actions in accordancewith A’s current beliefs about B (i.e., A’s empirical distribution over B’s action choices). In general, agent i assesses the expectedvalue of its individual action a’ to be Choolin9 opUrnal jointactions 1 0.06 xl , lt, 5 10 15 *ll ,~ l’ ~’~"~e 0.0 0.85 0.8 0.76 O.7 O.65 0.6 o.55 0,5 Agenti can use these values just as it wouldQ-valuesin im3plementingan exploration strategy. Maintaining a belief distribution by meansof fictitious play is problematicif agents haveimpreciseobservationalcapabilities. Following[1], we use a simple Bayesianupdating rule for beliefs: 20 25 3O 35 Numbsr of InleracU~s 40 45 5O Figure 1: Convergenceof coordination for ILs and JALs(averaged over 100trials). Rather than pursuing this direction, we consider the case where both the ILs and JALsuse Boltzmannexploration. Exploitation of the knownvalues allows the agents to "coordinate" in their choices for the samereasons that equilibrium i) learning methods work when agents knowthe reward strucPr(oJa[i = a ] ture. Figure 1 shows the probability of two ILs and JALs selecting an optimal joint action as a function of the number Agenti then updates its distribution over j’s probabilities using this "stochastic observation;" in particular, the count of interactions they have. The temperature parameter is T = 16 initially and decayedby a factor of 0.9t at the t + 1st interC~is incrementedby Pr(aJko) (intuitively, b y a"fractional" action. Wesee that ILs coordinate quite quickly. There is no 4 outcome). preference for either equilibrium point: each of the twoequilibria wasattained in about half of the trials. Wedo not show 3 ComparingIndependent and Joint-Action convergenceof Q-values, but note that the Q-values for the Learners actions of the equilibria attained (e.g., (a0, b0)) tendedto Wefirst compare the relative performance of independent while the other actions tended to 0. Wenote that probability and joint-action learners on a simple coordination gameof of optimal action selection does not increase smoothlywithin the form described above: individual trials; the averagedprobabilities reflect the likelia0 al hood of having reached an equilibrium by time t, as well as b0 10 0 exploration probabilities. Wealso point out that muchfaster 10 bl 0 convergencecan be had for different parametersettings (e.g., decaying temperature T more rapidly). Wedefer general reThefirst thing to note is that ILsusingnonexploitiveexplomarks on convergenceto Section 5. ration will not deemeither of their choices (on average) be better than the other. For instance, A’s Q-valuesfor both The figure also shows convergence for JALs under the action a0 and al will convergeto 5, since whenever,say, a0 same circumstances (full observability is assumed). JALs is executed, there is a 0.5 probability of b0 and bl being exdo perform somewhatbetter after a fixed numberof interecuted. Of course, at any point, due to the stochastic nature actions, as shownin the graph. While the JALShave more of the strategies and the decayin learning rate, we wouldexinformation at their disposal, convergenceis not enhanced pect that the learned Q-valueswill not be identical; thus the dramatically. In retrospect, this should not be too surprisagents, once they converge, might each have a reason to preing. WhileJALsare able to distinguish Q-valuesof different fer one action to the other. Unfortunately, these biases need joint actions, their ability to use this informationis circumnot be coordinated. scribed by the strictures of the action selection mechanism. i) 3Theexpressionfor EV(a makesthe justifiable assumption Anagent maintains beliefs about the strategy being played that the other agentsare selectingtheir actionsindependently. Less by the other agents and "exploits" actions according to expected value basedon these beliefs. In other words, the value reasonableis the assumption that these choicesare uncorrelated,or evencorrelatedwithi’s choices.Suchcorrelationscanoften emerge of individual actions "pluggedin" to the exploration strategy due to the dynamics of belief updatingwithoutagents beingaware is moreor less the sameas the Q-valueslearned by ILs--the of this correlation,especiallyif frequenciesof particularjoint aconly distinction is that JALscomputethemusing explicit betions are ignored. lief distributions and joint Q-valuesinstead of updatingthem 4Thisessentially correspondsto using the empiricalcountsas directly. Thus, even thoughthe agents maybe fairly sure of Dirichletparameters, andtreating the i’s beliefs as a Dirichletdistributionoverj’s set of mixedstrategies. the relative Q-valuesof joint actions, Boltzmannexploration Pr(a[j] = s Ja[i] = ai , o) Pr(oJa~] = s, a[i] = ai )Pr(aLj] = s) l 15 5does not let themexploit this. 4 The Effects of Partial Rale of C~v~gence InDependency ofOb~;ervation Probeldity JointacUon I~rne¢e Action Observability WhenJALs are in a situation where partial observability holds sway, the updating of Q-values should be undertaken with more care. Given an experience (a i, o, r), where a~ is agent i’s action and o is the resultant observation,there can be a numberof joint actions a consistent with ai and o, according to the observationmodel,that give rise to this experience. Intuitively, i should update the Q-valuesfor eachof such a, somethingthat conflicts with the usual Q-learning model. To deal with this, we propose the use of "fractional" updates of Q-values:the Q-valuefor joint action a will be updatedby rewardr; but the magnitudeof this update will be "tempered" by the probability Pr(alo, i) t hat awas ta ken given o,ai. 6 Specifically, agent/updates joint Q-valuesfor all actions a wherea[i] = a~ using: Q(a) = Q(a) ~Pr(alo, ai )(r - Q( g’ o!g oi, olg olg o15 o16 oi, Obeewation Probabilitiel; Figure 2: Timeto convergence(opt. joint action with prob. 0.99) as a function of observation prob. (averaged over 1000 trials; a = 0.2, T = 16, T decayedat rate 0.9). (3) where Pr(a[o, i) i s c omputed ni t he obvious way by i using its beliefs and Bayesrule. Wefirst note that if agent i is learning in a stationary environment(i.e., all other agents play fixed, mixedstrategies), then this update rule will converge;that is, the limiting Qvalues are well-defined, under the sameconditions required of Q-learning. Informally: Q-vel~ foragent A.results averaged over100 Irlals, alpha decayed IracUonaJ 6 4 Proposition l Convergence of Q-values using update rule (3) is assuredin stationary environments. 3.5 Wenote that convergenceof Q-values is enhancedwhenthe "fractional" nature of the updates is accountedfor whendecaying the learning parameter a--we typically maintain a separate aa for each joint action a and decay aa using the total experiencewith a (i.e., weightingthe decay rate using Pr(alo, ai)). Wealso note that convergencecannot generally be to the true Q-valuesof joint actions unless the observation modelis perfect. In our runningexample,for instance, if observations do not permit agent A to distinguish b0 from bl perfectly, then the Q-valuefor, say, (a0, b0) must lie somewhere between 0 and 10. Figure 2 showshowconvergenceto equilibrium is influenced by the accuracy of the observation modelin the game above. The model associates a unique observation o~ with each of the four joint actions a. The observation probability refers to Pr(oa la), and ranges from0.25 (fully unobservable) to 1 (fully observable). If o oathen Pr( o[a) = (1 - Pr(oa la))/3. Rate of convergencerefers to the expected numberof interactions until convergence (i.e., the probability of selecting an optimal joint action reaches 0.99). The more accurate the observation model, the more quickly the agents converge. Figure 3 showsthe convergenceof joint Q-values for observation probability 0.5. 2,6 g aThekeyreasonfor the differencein ILs andJALsis the larger differencein Q-valuesfor JALs,whichbias Boltzmann exploration slightly moretowardthe estimatedoptimalaction. 6Analternative wayof viewingthis: a single Q(a)is chosenby the agent, withprobabilityPr(alo, al), for updatewithreward 3_6 2 1.g 50 100 150 2OO 26O 3OO 350 4OO 45O 5O0 Number of Interac2on8 Figure 3: Convergenceof Q-values with observation probability 0.5 (averagedover 1000trials). Ukelihood of C~v~gen~ to optimal jointactionsin dependercy of Penally k 1 JALs, Temp decay =0.99 0,9 ~-. JALe, Temp decay. 0,8 0.4 0.2 ..... ~.......~~.)~ ............. o =IO0 -80 -oo Penalty k -40 ~.’ -20 Figure 4: Likelihoodof convergenceto opt. equilibrium as a function of penalty k (averagedover 100trials). 5 Convergence and Game Structure Wewill arguebelowthat Q-learning, in boththe IL andJAL 0.9 considerthe waysin whichthe gamestructurecan influence the dynamics of the learningprocess. 0.7 aO al bO 10 0 0 2 bl b2 k 0 l a2 k 0 10 When k < O, this gamehasthreedeterministicequilibria, of whichtwo are preferred.If k = -100, agent A, during initial exploration, will findits first andthirdactionsto be unattractivebecauseof B’s random exploration.If Ais an Strategy ofagent A I cases, will convergeto an equilibriumunderappropriateconditions. Ofcourse, in general, convergence to an optimal equilibriumcannotbe assured. Before makingthis case, we Prob. o[aO Prob. of prob. orala2D \ 0.8 m m % +.+.*f ’""::::::::::::::: o.6 O.S * % 0.4 0.3 + 0.2 o.I ~+ . 0 IL, the averagerewards(and henceQ-values)for a0, a2 will be quite low; andif A is a JAL,its beliefs aboutB’s strategy 200 400 600 800 1000 1200 1400 1600 1800 2000 Number ofInteraclionA Figure5: Dynamics of A’sstrategy. will affordtheseactionslowexpectedvalue.Similarremarks applyto B, andthe self-confirming natureof equilibriavirtually assure convergenceto (al, bl). However,the closer k is to 0, the lowerthe likelihood the agentswill find their first andthirdactionsunattractive--the stochasticnatureof exploration means that, occasionally,theseactionswill have high estimatedutility andconvergenceto one of the optimal equilibria will occur. Figure 4 showshowthe probabilityof convergence to oneof the optimalequilibriais influenced by the magnitude of the "penalty"k. Not surprisingly, different 7equilibriacanbeattainedwithdifferentlikelihoods. Thusfar, our examples showagentsproceeding on a direct routeto equilibria(albeit at variousrates, andwithdestinations "chosen" stochastically). Unfortunately, convergence is not so straightforward in general.Consider the game: aO al a2 -30 0 bO 11 bl -30 7 6 b2 0 0 5 Initially, the two learnersare almostcertainlygoingto begin to play the nonequilibrium strategy profile (a2, b2). This seen clearly in Figures 5 and 6. However,oncethey "settle" at this point, as long as explorationcontinues(here Boltzmannexploration is used), agent B will soon find bl to be Slratsgy ofagent B 1 +÷++ +++ 0.9 0.8 0.7 +4,¢,~ m overtimealonga best replypath.Theconvergence (andrest8ingpoints)for the joint Q-values are shown in Figure7. This phenomenon will obtain in general, allowing one to conclude that the multiagent Q-learning schemeswe have ++ + + 0.6 m+ 0.5 o 0.4 +m D 0,3 0.2 + ~~m~d b 0.1 o 0 2(X) 4OO 600 8OO tOO0 1200 1400 1600 1800 200O Number of InleractJ~ Figure6: Dynamics of B’s strategy. moreattractive--solongas Acontinuesto primarilychoose a2. Oncethe nonequilibrium point(a2, bl) is attained,agent Atracks B’s moveandbeginsto performaction al. Once this equilibrium is reached,the agentsremainthere. Figures5 and6 clearly showthe agentssettling into nonequilibria at variouspoints and"climbing" towardan equilibrium ~+ Prob.ofb0¯ Prob. of Prob. ofbl b2+ o ....o t.,o+ Q-valu~ fmagent A,resulte notaveraged, alpha d~yed frac~onal , ./,**=,t =**-- ...... A Q a0.bl + sob2 ,,m.., ............. .................... ¯ ~................................................... al,b0 x ¯ / al,bl ¯ ................................................... a2oi ..... ....... a2,bl ...... ¯ , J; ariA¯. +~,,~o***ooooe+e~aoeeaoeeeeae**~o*o*aae* proposed will converge to equilibriaalmostsurely--butonly if certainconditions aremetin the caseof ILs. First, explorationmustcontinueat anypointwithnonnegative probabil?Theseresultsareshown for JALs; but the generalpatternholds trueforILsas well. STheseresults are basedon Boltzmann explorationwitha temperature setting of I andno decay---thisis to induceconsiderable exploration,otherwise convergence is quite slow(see below).This explainswhyagent Aconvergesto Pr(al) = 0.7. Number of Inlerac~ons Figure 7: Dynamics of joint Q-values(c~ = 0.1). ity. This is required by Q-learning in any case, but is necessary to ensure that a nonequilibriumis not prematurelyaccepted. Second,whenagents are exploring, if the probability of both movingawayfrom a nonequilibrium point simultaneously is sufficiently high, then we cannot expect agent A to notice that it has a newbest reply. So, for instance, if weuse uniform exploration over time, we cannot induce B to learn a goodenoughQ-value for its "break out move"to actually moveaway(e.g., every time bl is tried, thereis a high enough probability that A does a0 that bl never looks better than b2). Wenote that Boltzmannexploration with a decaying temperature parameter ensures infinite exploration. Furthermore,as the temperaturedecreases over time, so too does the probability of simultaneousexploration. Thus, at somefinite time t (assumingboundedrewards), the probability of B executing bl while A performs a2 will becomehigh enoughto renderbl moreattractive after somefinite numberof experiences with bl. At this point, (with high probability) B will performbl, allowing A to respond in a similar fashion. These arguments can be put together to showthat, with proper exploration and proper use of best responses(i.e., that are at least asymptotically myopic[4]), we will eventually convergeto an equilibrium. Informally: Proposition 2 IL and JAL algorithms will converge to an equilibrium if sufficient, but decreasing, exploration is aundertaken, Weremark that this theoretical guarantee of convergence maynot be of muchpractical value for sufficiently complicated games.Thekey difficulty is that convergencerelies on the use of decayingexploration: this is necessary to ensure that the agents’ estimated values are based (ultimately) on stationary environment. This gradual decay, however,makes the time required to shift fromthe current entrenchedstrategy profile to a better profile rather long. If the agents initially settle on a profile that is a large distance (in termsof a best reply path) froman equilibrium, each shift required can take longer to occur becauseof the decayin exploration. Furthermore,as pointed out above, the probability of concurrentexploration mayhaveto be sufficiently small to ensure that the expectedvalue of a shift along the best reply path is greater than no such shift, whichcan introduce further delays in the process. Wenote that the longer these delays are, the lower the learning rate a becomes,requiring even moreexperience to overcomethe initially biased estimated Q-values. Finally, the largest drawbacklies in the fact that--even whenJALs haveperfect joint Q-values--beliefs basedon a lot of experience require a considerable amountof contrary experience to be overcome. For example, once B has made the shift from b2 to bl above, a significant amountof time is neededfor A to switch from a2 to al: it has to observe B performing bl enoughto overcomethe rather large degree of belief it had that B wouldcontinue doing b2. 6 Concluding Remarks Wehave seen described two basic ways in which Q-learning can be applied in multiagent cooperative settings, and exam9Wehavenot yet constructeda formalproof, but see no impedimentsto this conjecture.Theconvergence conjecturealso relies on the absenceof cyclesin best reply pathsof cooperativegames. ined the impact of various features on the success of the interaction betweenequilibrium selection learning techniques with RLtechniques. Wehave demonstratedthat the integration requires somecare, and that Q-learningis not nearly as robust as in single-agent settings. These considerations point toward a numberof interesting avenues of research directed toward improvingthe performance of MARL.Weremark on some of the directions we are currently pursuing. The first is the enhancementof practical convergence by the use of "windowing"methods for estimating beliefs in JALs. By using only the last k experiences to determineempiricalbelief distributions, the abatement of convergence caused by strongly held prior beliefs can be alleviated. Wenote that a memory-basedtechnique of this sort might be applied to ILs if reward experiences are stored and old experiences discounted in determining Qvalues. The second avenue we are investigating are possible techniques for encouragingagents to convergeto optimal equilibria. This is one area whereJALshavea distinct advantage over ILs: evenif they have convergedto an equilibrium, they can tell--since they have access to joint Q-values--if a better equilibrium exists. Coordinationlearning techniques (see, e.g., [1]) mightthen be applied, as could other exploration techniquesthat attempt to inducea shift fromone equilibrium to another. Finally, we hope to explore the details associated with general, multistate sequential decision problems and investigate the application of generalization techniques in domainswith large state spaces. Acknowledgements: Thanksto Leslie Kaelbling and Michael Littmanfor their helpfuldiscussionsin the earlystagesof this work. References [1] C. Boutilier. Learningconventionsin multiagentstochastic domainsusing likelihood estimates. UAI-96,pp.106-114, Portland. [2] C. Boutilier. Planning,learningand coordinationin multiagent decision processes. TARK-96, pp.195-210,Amsterdam. [3] G. W.Brown.Iterative solution of gamesby fictitious play. T. C. Koopmans (ed.), Activity Analysisof ProductionandAllocation. Wiley,1951. [4] D. Fudenbergand D. M. Kreps. LearningandEquilibriumin Strategic FormGames.CORE Foundation, Louvain,1992. [5] D. Fudenbergand D. K. Levine. Steady state learning and Nashequilibrium. Econometrica,61(3):547-573,1993. [6] J. Huand M. P. Wellman. Self-fulfilling bias in multiagent learning. 1CMAS-96, pp. 118-125,Kyoto. [7] E. Kalai andE. Lehrer.Rationallearningleads to Nashequilibrium. Econometrica,61(5):1019-1045,1993. for multi-agent [8] M. L. Littman.Markovgamesas a framework reinforcementlearning. ML-94,pp. 157-163,NewBrunswick. [9] S. Sen, M.Sekaran,andJ. Hale. Learningto coordinatewithout sharing information.AAA1-94, pp.426-431,Seattle. [10] C.J.C.H. Watkinsand P. Dayan.Q-learning. Mach.Learn., 8:279-292,1992. [11] G. WeilLLearningto coordinateactions in multi-agentsystems. IJCAI-93,pp.311-316,Chambery. [12] H. PeytonYoung.The evolution of conventions.Econometrica, 61(1):57-84,1993.