From: AAAI Technical Report WS-02-06. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved. Scalable Learning in Stochastic Games Michael Bowling and Manuela Veloso Computer Science Department Carnegie Mellon University Pittsburgh PA, 15213-3891 Abstract Stochasticgamesare a general modelof interaction between multipleagents.Theyhaverecentlybeenthe focus of a great deal of researchin reinforcementlearningas they are both descriptive and havea well-definedNashequilibriumsolution. Mostof this recent work,althoughvery general, has only been applied to small gameswith at mosthundredsof states. Onthe other hand,thereare landmark results of learning beingsuccessfullyappliedto specific large and complex gamessuch as Checkersand Backgammon. In this paper we describea scalable learningalgorithmfor stochastic games, that combines three separateideas fromreinforcement learning into a single algorithm.Theseideas are tile codingfor generalization,policy gradientascent as the basic learning method,and our previous workon the WoLF ("Winor Learn Fast") variable learning rate to encourageconvergence.We applythis algorithmto the intractablysized game-theoretic card gameGoofspiel,showingpreliminaryresults of learning in self-play. Wedemonstratethat policygradientascent canlearn evenin this highlynon-stationaryproblem withsimultaneouslearning. Wealso showthat the WoLF principle continuesto havea convergingeffect evenin large problems withapproximation andgeneralization. Introduction Weare interested in the problemof learning in multiagent environments. One of the main challenges with these environments is that other agents in the environmentmaybe learning and adapting as well. These environments are, therefore, no longer stationary. They violate the Markov property that traditional single-agent behaviorlearning relies upon. The model of stochastic games captures these problems very well through explicit modelsof the reward functions of the other agents and their affects on transitions. They are also a natural extension of Markovdecision processes (MDPs)to multiple agents and so have attracted interest from the reinforcement learning community. The problem of simultaneouslyfinding optimal policies for stochastic gameshas beenwell studied in the field of gametheory. The traditional solution conceptis that of Nashequilibria, a policy for all the players whereeach is playing optimally with Copyright(~) 2002,American Associationfor Artificial Intelligence(www.aaai.org). All rights reserved. 11 respect to the others. This concept is a powerful solution for these gameseven in a learning context, since no agent could learn a better policy whenall the agents are playing an equilibria. It is this foundation that has driven muchof the recent work in applying reinforcement learning to stochastic games (Littman 1994; Hu & Wellman1998; Singh, Kearns, & Mansour 2000; Littman 2001; Bowling & Veloso 2002a; Greenwald& Hall 2002). This work has thus far only been applied to small games with enumerable state and action spaces. Historically, though, a numberof landmarkresults in reinforcementlearning have looked at learning in particular stochastic gamesthat are not small nor are the state easily enumerated. Samuel’s Checkers playing program (Samuel 1967) and Tesauro’s TD-Gammon (Tesauro 1995) are successful applications of learning in games with very large state spaces. Both of these results made generous use of generalization and approximation,which have not been used in the more recent work. On the other hand, both TDGammon and Samuel’s Checkers player only used deterministic strategies to play competitively, while Nashequilibria often require stochastic strategies. Weare interested in scaling someof the recent techniques based on the Nash equilibria concept to games with intractable state spaces. Sucha goal is not new.Singhand colleagues’ also described future workof applyingtheir simple gradient techniques to problemswith large or infinite state and action spaces (Singh, Kearns, & Mansour2000). This paperexaminessomeinitial results in this direction. Wefirst describe the formal definition of a stochastic gameand the notion of equilibria. Wethen describe one particular very large, two-player, zero-sumstochastic game,Goofspiel. Our learning algorithm is described as the combinationof three ideas from reinforcement learning: tile-coding, policy gradients, and the WoLF principle. Wethen showresults of our algorithm learning to play Goofspielwith self-play. Finally, we conclude with somefuture directions for this work. Stochastic Games A stochastic gameis a tuple (n, S, .Ax...n, T,/~l...n), where n is the number of agents, ,5 is a set of states, .Ai is the set of actions available to agent i (and .A is the joint action space A1x... x An),T is a transition function,5x .,4 x,5 ---> [0, 1], and ~ is a rewardfunction for the ith agent ,5 x .A ---> R. This looks very similar to the MDPframeworkexcept we havemultiple agents selecting actions and the next state and rewards dependon the joint action of the agents. Another importantdifference is that each agent has its ownseparate rewardfunction. Thegoal for each agent is to select actions in order to maximizeits discounted future rewardswith discount factor 7Stochastic games are a very natural extension of MDPs to multiple agents. They are also an extension of matrix games to multiple states. Twoexample matrix games are in Figure 1. In these gamesthere are two players; one selects a row and the other selects a columnof the matrix. The entry of the matrix they jointly select determinesthe payoffs. The gamesin Figure 1 are zero-sumgames, so the row player wouldreceive the payoff in the matrix,-and the columnplayer wouldreceive the negative of that payoff. In the general case (general-sum games), each player would have a separate matrix that determinestheir payoffs. -1 1 Matching Pennies -1 0 1 0 R-P-S Figure 1: MatchingPennies and Rock-Paper-Scissors matrix games. Each state in a stochastic gamecan be viewedas a matrix gamewith the payoffs for each joint action determined by the matrices Pq(s,a). After playing the matrix game and receivingtheir payoffsthe players are transitioned to another state (or matrix game)determinedby their joint action. can see that stochastic gamesthen contain both MDPsand matrix gamesas subsets of the framework. Stochastic Policies. Unlike in single-agent settings, deterministic policies in multiagent settings can often be exploited by the other agents. Consider the matching penhies matrix game as shown in Figure 1. If the column player were to play either action deterministically, the row player could win every time. This requires us to consider mixedstrategies and stochastic policies. A stochastic policy, p : S ~ PD(.Ai), is a function that maps states to mixedstrategies, whichare probability distributions over the player’s actions. Nash Equilibria. Even with the concept of mixed strategies there are still no optimal strategies that are independent of the other players’ strategies. Wecan, though, define a notion of best-response. A strategy is a best-response to the other players’strategies if it is optimalgiventheir strategies. The major advancementthat has driven muchof the developmentof matrix games, gametheory, and even stochastic gamesis the notion of a best-response equilibrium, or Nash equilibrium (Nash, Jr. 1950). ANashequilibriumis a collection of strategies for each of the players such that each player’s strategy is a best-response to the other players’ strategies. So, no player can do better 12 by changingstrategies given that the other players also don’t change strategies. Whatmakes the notion of equilibrium compelling is that all matrix gameshave such an equilibrium, possibly having multiple equilibria. Zero-sum, twoplayer games,whereone player’s payoffs are the negative of the other, have a single Nashequilibrium.1 In the zero-sum examplesin Figure 1, both gameshave an equilibrium consisting of each player playing the mixedstrategy whereall the actions have equal probability. The concept of equilibria also extends to stochastic games.This is a non-trivial result, provenby Shapley(Shapley 1953) for zero-sum stochastic gamesand by Fink (Fink 1964) for general-sum stochastic games. Learning in Stochastic Games. Stochastic games have been the focus of recent research in the area of reinforcement learning. There are two different approaches being explored. The first is that of algorithms that explicitly learn equilibria through experience, independentof the other players’ policy (Littman 1994; Hu & Wellman1998; Greenwald& Hall 2002). These algorithms iteratively estimate value functions, and use them to computean equilibrium for the game. A second approach is that of bestresponse learners (Claus & Boutilier 1998; Singh, Kearns, & Mansour 2000; Bowling & Veloso 2002a). These learners explicitly optimizetheir rewardwith respect to the other players’ (changing) policies. This approach, too, has strong connectionto equilibria. If these algorithmsconverge whenplaying each other, then they mustdo so to an equilibrium (Bowling & Veloso 2001). Neither of these approaches, though, have been scaled beyond games with a few hundred states. Gameswith a very large numberof states, or gameswith continuous state spaces, makestate enumeration intractable. Since previous algorithms in their stated form require the enumeration of states either for policies or value functions, this is a major limitation. In this paper we examinelearning in a very large stochastic game, using approximationand generalization techniques. Specifically, wewill build on the idea of best-response learners using gradient techniques (Singh, Kearns, & Mansour 2000; Bowling & Veloso 2002a). We first describe an interesting gamewith an intractably large state space. Goofspiel Goofspiel (or The Gameof Pure Strategy) was invented Merrill Flood while at Princeton (Flood 1985). The game has numerousvariations, but here we focus on the simple two-player, n-card version. Eachplayer receives a suit of cards numbered1 through n, a third suit of cards is shuffled and placed face downas the deck. Eachround the next card is flipped over from the deck, and the two players each select a card placing it face down.They are revealed simultaneouslyand the player with the highest card wins the card from the deck, whichis worth its numberin points. If ITherecan actually be multipleequilibria, but they will all haveequal payoffsand are interchangeable(Osborne&Rubinstein 1994). the players choosethe samevalued card, then neither player gets any points. Regardlessof the winner, both players discard their chosencard. This is repeated until the deck and players hands are exhausted. The winner is the player with the mostpoints. This gamehas numerousinteresting properties makingit a very interesting step betweentoy problemsand more realistic problems.First, notice that this gameis zero-sum, and as with manyzero-sum gamesany deterministic strategy can be soundly defeated. In this game, it’s by simply playing the card one higher than the other player’s deterministically chosen card. Second, notice that the number of states and state-acfion pairs growsexponentiallywith the numberof cards. The standard size of the gamen = 13 is so large that just storing one player’s policy or Q-table would require approximately2.5 terabytes of space. Just gathering data on all the state-action transitions wouldrequire well over 1012 playings of the game. Table 1 shows the number of states and state-action pairs as well as the policy size for three different values of n. This gameobviously requires someform of generalization to makelearning possible. Another interesting property is that randomlyselecting actions is a reasonably good policy. The worst-case values of the randompolicy along with the worst-case values of the best deterministic policy are also shownin Table1. This game can be described using the stochastic game model. The state is the current cards in the players’ hands and deck along with the upturned card. The actions for a player are the cards in the player’s hand. Thetransitions follow the rules as described, with an immediatereward going to the player whowonthe upturned card. Since the gamehas a finite end and weare interested in maximizing total reward, we can set the discount factor 7 to be 1. Althoughequilibrium learning techniques such as Minimax-Q(Littman 1994) are guaranteedto find the game’sequilibrium, it requires maintaininga state-joint-action table of values. This table wouldrequire 20.1 terabytes to store for the n = 13 card game. Wewill nowdescribe a best-response learning algorithm using approximationtechniques to handle the enormousstate space. Three Ideas - One Algorithm The algorithm we will use combines three separate ideas from reinforcement learning. The first is the idea of tile coding as a generalization for linear function approximation. Thesecondis the use of a parameterizedpolicy and learning as gradient ascent in the policy’s parameterspace. The final componentis the use of a WoLF variable learning rate to adjust the gradient ascent step size. Wewill briefly overview these three techniques and then describe howthey are combined into a reinforcementlearning algorithmfor Goofspiel. Tile Coding. Tile coding (Sutton & Barto 1998), also knownas CMACS, is a popular techniquefor creating a set of booleanfeatures from a set of continuous features. In reinforcement learning, tile coding has been used extensively to create linear approximatorsof state-action values (e.g., (Stone &Sutton 2001)). 13 , Tiling One ,’ ’riling Two ._L-I I r ! i i i i Figure 2: Anexampleof tile coding a two dimensionalspace with two overlappingfilings. Thebasic idea is to lay offset grids or filings over the multidimensional continuous feature space. A point in the continuous feature space will be in exactly one tile for each of the offset tilings. Eachtile has an associated boolean variable, so the continuous feature vector gets mappedinto a very high-dimensional boolean vector. In addition, nearby points will fall into the sametile for manyof the offset grids, and so share manyof the samebooleanvariables in their resuiting vector. This provides the important feature of generalization. Anexampleof tile coding in a two-dimensional continuous space is shownin Figure 2. This exampleshows twooverlappingfilings, and so any given point falls into two differenttiles. Anothercommon trick with tile coding is the use of hashing to keep the number of parameters manageable. Each tile is hashedinto a table of fixed size. Collisions are simply ignored, meaningthat two unrelated tiles mayshare the same parameter. Hashing reduces the memoryrequirements with little loss in performance.This is becauseonly a small fraction of the continuousspace is actually neededor visited while learning, and so independentparametersfor every tile are often not necessary. Hashingprovides a meansfor using only the numberof parameters the problem requires while not knowingin advance which state-action pairs need parameters. Policy Gradient Ascent Policy gradient techniques (Sutton et al. 2000; Baxter & Bartlett 2000) are a methodof reinforcement learning with function approximation. Traditional approaches approximate a state-action value function, and result in a deterministic policy that selects the action with the maximum learned value. Alternatively, policy gradient approachesapproximate a policy directly, and then use gradient ascent to adjust the parameters to maximizethe policy’s value. There are three goodreasons for the latter approach. First, there’s a wholebody of theoretical work describing convergence problemsusing a variety of value-basedlearning techniques with a variety of function approximationtechniques (See (Gordon2000) for a summary of these results.) Second, value-based approacheslearn deterministic policies, and as we mentionedearlier deterministic policies in multiagent ISl n 4 8 13 IS x AISIZEOF(?r or Q) 692 15150 3 x 10o r1 x 10 1 x 1011 7 x 1011 --, 59KB ,,~ 47MB ~ 2.5TB VALUE(det) VALUE(random) -2 -20 -65 -2.5 --10.5 -28 Table 1: The approximatenumberof states and state-actions, and the size of a stochastic policy or Q table for Goofspiel depending on the numberof cards, n. The VALUE columnslist the worst-case value of the best deterministic policy and the randompolicy respectively. settings are often easily exploitable. Third, gradient techniques have been shownto be successful for simultaneous learning in matrix games(Singh, Kearns, & Mansour2000; Bowling & Veloso 2002a). Weuse the policy gradient technique presented by Sutton and colleagues (Sutton et al. 2000). Specifically, we will define a policy as a Gibbsdistribution over a linear combination of features, such as those taken froma tile coding representation of state-actions. Let 0 be a vector of the policy’s parametersand esa be a feature vector for state s and action a then this defines a stochastic policy accordingto, e0-¢.° 71"(8, a) = Eb eO’¢’b" Their main result was a convergenceproof for the following policy iteration rule that updates a policy’s parameters, 0~k(s, a) 0k+l = 0k + C~k E arab(s)0Ofw,(s, a). (1) s this amountsto updating weights in proportion to howoften the state is visited. By doing updates on-line as states are visited we can simply drop this term from equation 2, resulting in, Ok+l = Ok + O~k E esa" 7r(s,a)fwk (s,a). (3) fl Lastly, we will do the policy improvementstep (updating 0) simultaneouslywith the value estimation step (updating w). Wewill do value estimation using gradient-descent Sarsa(0) (Sutton &Barto 1998) over the same feature space as the policy. Specifically, if at time k the systemis in state s and takes action a transitioning to state s’ and then taking action a~, we update the weight vector, Wk+l= w~ + 13k (r + 7Q~k(s’, a’) - Qtok (s, (4) The policy improvementstep uses equation 3 where s is the state of the systemat time k and the action-value estimates from Sarsa Qw~are used to computethe advantage term, f~,~ (s, a)=Q,.~ (s, a)-~_. ~r(s, alOk)Q,.~ (s, a). Forthe Gibbsdistribution this is just, tl Ok+1= Ok + ctkZ d~(s)E dpsa7r(s,a)fw~(s, a) (2) Win or Learn Fast WoLF("Winor LearnFast")is a methodforchanging the learning rate to encourage convergencein a multiagent reHereak is an appropriately decayed learning rateand inforcement learning scenario (Bowling &Veloso 2002a). d~(s)is states’scontribution to thepolicy’s overall value. Thiscontribution is defined differently depending Notice that the gradient ascent algorithm described does not accountfor a non-stationary environmentthat arises with sion whether average or discounted start state reward criterion is used. fw~(s, a) is an independentapproximation multaneouslearning in stochastic games. All of the other agents actions are simply assumedto be part of the enviQ~ (s, a) with parameters w, which is the expected value ronment and unchanging. WoLFprovides a simple way to of taking action a from state s and then followingthe policy account for other agents through adjusting howquickly or 7rk. For a Gibbsdistribution, Sutton and colleagues showed slowly the agent changesits policy. that for convergencethis approximationshould havethe folSince only the rate of learning is changed, algorithms lowing form, that are guaranteedto find (locally) optimal policies in nonstationary environmentsretain this property even whenus/~(s,a) = w. Cs.-~_~(s,b)¢sb ing WOLF. In stochastic gameswith simultaneous learning, b WoLF has both theoretical evidence (limited to two-player, two-action matrix games), and empirical evidence (experAs they point out, this amountsto f,~ being an approxiiments in matrix games, as well as smaller zero-sum and mation of the advantage function, A~(s, a) = Q~(s, a) general-sum stochastic games) that it encourages converV~r (s), where’r (s) i s t he value of f ollowing oplicy ~r f rom state s. It is this advantagefunction that we will estimate gence in algorithms that don’t otherwise converge (Bowland use for gradient ascent. ing & Veloso 2002a). The intuition for this technique is Usingthis basic formulation we derive an on-line version that a learner should adapt quickly whenit is doing more of the learning rule, wherethe policy’s weights are updated poorly than expected. Whenit is doing better than expected, with each state visited. The total rewardcriterion for Goofit should be cautious, since the other players are likely to spiel is identical to having7 = 1 in the discountedsetting. changetheir policy. This implicitly accountsfor other playSo, d~(s) is just the probability of visiting state s whenfolers that are learning, rather than other techniquesthat try to lowingpolicy ~r. Since we will be visiting states on-policy, explicitly reason about their action choices. $ a [ - 14 The WoLF principle naturally lends itself to policy gradient techniqueswherethere is a well-definedlearning rate, ak. With WoLF we replace the original learning rate with two learning rates a~’ < at to be used when winning or losing, respectively. One determination of winning and losing that has been successful is to comparethe value of the current policy V’r (s) to the value of the averagepolicy over time V~(s). With the policy gradient technique above we can define a similar rule that examinesthe approximate value, using Qw,of the current weight vector 0 with the average weight vector over time 0. Specifically, we are "winning" if and onlyif, card available and player’s action. The tilings use tile sizes equal to roughly half the numberof cards in the gamewith the numberof tilings greater than the tile sizes to distinguish betweenany integer state values. Finally, these tiles wereall then hashedinto a table of size one million in order to keep the parameter space manageable.Wedon’t suggest that this is a perfect or even good tiling for this domain,but as we will showthe results are still interesting. Results Oneof the difficult and openissues in multiagentreinforcementlea~ng is that of evaluation. Before presenting learning results we first need to look at howone evaluates learning success. ~"~~r(s, alO)Q,o(s,a)>~_~r(s, al#)Q,,,(s,a). (5) Whenwinningin a particular state, we update the parameters for that state using a~°, otherwiseark. Evaluation Learning in Goofspiel Wecombinethese three techniques in the obvious way. Tile codingprovides a large booleanfeature vector for any stateaction pair. This is used both for the parameterizationof the policy and for the approximationof the policy’s value, which is used to computethe policy’s gradient. Gradient updates are then performedon both the policy using equation 3 and the value estimate using equation 4. WoLF is used to vary the learning rate ak in the policy updateaccordingto the rule in inequality 5. This compositioncan be essentially thought of as an actor-critic method(Sutton & Barto 1998). Here the Gibbsdistribution over the set of parametersis the actor, and the gradient-descent Sarsa(0) is the critic. Tile-coding provides the necessary parameterization of the state. The WoLF principle is adjusting howthe actor changes policies basedon responsefrom the critic. -The main detail yet to be explained and where the algo rithm is specifically adaptedto Goofspielis in the tile coding. The methodof tiling is extremelyimportantto the overall performanceof learning as it is a powerfulbias on what policies can and will be learned. The major decision to be madeis howto represent the state as a vector of numbersand whichof these numbersare tiled together. The first decision determines what states are distinguishable, and the second determines howgeneralization works across distinguishable states. Despite the importanceof the tiling we essentially selected what seemedlike a reasonable tiling, and used it throughoutour results. Werepresent a set of cards, either a player’s handor the deck, by five numbers,correspondingto the value of the card that is the minimum, lower quartile, median,upper quartile, and maximum.This provides information as to the general shape of the set, which is what is important in Goofspiel. The other values used in the tiling are the value of the card that is being bid on and the card correspondingto the agent’s action. Anexampleof this process in the 13-card gameis shownin Table 2. These values are combinedtogether into three filings. Thefirst tiles togetherthe quartiles describing the players’ hands. Thesecondtiles together the quartiles of the deckwith the card available and player’s action. The last tiles together the quartiles of the opponent’shand with the One straightforward evaluation technique is to have two learning algorithms learn against each other and simply examine the expected reward over time. This technique is not useful if one’s interested in learning in self-play, whereboth players use an identical algorithm. In this case with a symmetric zero-sum game like Goofspiel, the expected reward of the two agents is necessarily zero, providing no information. Another common evaluation criterion is that of convergence. This is true in single-agent learning as well as multiagent learning. Onestrong motivation for considering this criterion in multiagent domainsis the connectionof convergence to Nash equilibrium. If algorithms that are guaranteed to convergeto optimal policies in stationary environments, converge in a multiagent learning environment, then the resulting joint policy must be a Nashequilibrium of the stochastic game (Bowling& Veloso 2002a). Although,convergenceto an equilibrium is an ideal criterion for small problems, there are a numberof reasons why this is unlikely to be possible for large problems.First, optimality in large (even stationary) environmentsis not generally feasible. This is exactly the motivationfor exploring function approximation and policy parameterizations. Second, whenwe account for the limitations that approximation imposeson a player’s policy then equilibria maycease to exist, makingconvergenceof policies impossible (Bowling Veloso 2002b). Third, policy gradient techniques learn only locally optimal policies. Theymayconvergeto policies that are not globally optimal and therefore necessarily not equilibria. Althoughconvergenceto equilibria and therefore convergence in general is not a reasonablecriterion we wouldstill expect self-play learning agents to learn something.In this paper we use the evaluation technique used by Littman with Minimax-Q (Littman 1994). Wetrain an agent in self-play, and then freeze its policy, and train a challengerto find that policy’s worst-case performance.This challenger is trained using just gradient-descent Sarsa and chooses the action with maximum estimated value with e-greedy exploration. Notice that the possible policies playable by the challenger are the deterministic policies (moduloexploration) playable by the learning algorithm being evaluated. Since Goofspiel 15 My Hand 1 Quartiles * 3 4 * 5 6 8 * 11 * 13 * OppHand Quartiles 4 * 5 8 * 9 10 11 * * 12 13 * Deck Quartiles 1 * 2 3 5 11 12 Card Action 11 3 9 10 (1,4,6,8, 13), (4,8,10,11,13>, (1,3,9,10,12), 11,3 g (Tile Coding) TILESE {0, I°8 i} Table 2: Anexamplestate-action representation using quartiles to describe the players’ hands and the deck. Thesenumbersare then tiled and hashedwith the resulting tiles representing a booleanvector of size 106. is a symmetric zero-sum game, we knowthat the equilibriumpolicy, if one exists, wouldhave value zero against its challenger. So, this provides somemeasureof howclose the policy is to the equilibrium by examiningits value against its challenger. A secondrelated criterion will also help to understandthe performanceof the algorithm. Althoughpolicy convergence might not be possible, convergenceof the expected value of the agents’ policies maybe possible. Since the real desirability of policy convergenceis the convergenceof the policy’s value, this is in fact often just as good.Thisis also one of the strengths of the WoLF variable learning rate, as it has been shownto makelearning algorithms with cycling policies and expected values converge both in expected value and policy. Experiments Throughout our experiments, we examinedthree different learning algorithmsin self-play. Thefirst twodid not use the WoLF variable learning rate, and instead followed a static step size. "Fast" used a large step size ~k = 0.16; "Slow" used a small step size ak = 0.008; "WOLF" switched betweenthese learning rates based on inequality 5. In all experiments, the value estimation update used a fixed learning rate of/~ = 0.2. Theserates were not decayed, in order to better isolate the effectiveness apart fromappropriate selection of decay schedules. In addition, throughouttraining and evaluation runs, all agents followed an e-greedyexploration strategy with e = 0.05. The initial policies and values all begin with zero weight vectors, whichwith a Gibbs distribution corresponds to the randompolicy, which as we have noted is reasonably good. In our first experimentwe trained the learner in self-play for 40,000 games. After every 5,000 games we stopped the training and trained a challenger against the agent’s current policy. The challenger was trained on 10,000 gamesusing Sarsa(0) gradient ascent with the learning rate parameters described above. Thetwo policies, the agent’s and its challenger, were then evaluated on 1,000 gamesto estimate the policy’s worst-case expected value. This experimentwas repeated thirty times for each algorithm. 16 The learning results averaged over the thirty runs are shownin Figure 3 for card sizes of 4, 8, and 13. Thebaseline comparisonis with that of the randompolicy, a very competitive policy for this game.All three learners improveon this policy while training in self-play. The initial dips in the 8 and 13 card gamesare due to the fact that value estimates are initially very poor makingthe initial policy gradients not in the direction of increasing the overall value of the policy. It takes a numberof training gamesfor the delayed reward of winningcards later to overcomethe initial immediatereward of winningcards now. Lastly, notice the affect of the WoLF principle. It consistently outperformsthe two static step size learners. This is identical to affects shownin nonapproximated stochastic games(Bowling & Veloso 2002a). The second experiment was to further examinethe issue of convergenceand the affect of the WoLF principle on the learning process. Instead of examiningworst-case performanceagainst somefictitious challenger, we nowexamine the expected value of the player’s policy while learning in self-play. Again the algorithm was trained in self-play for 40,000 games. After 50 gamesboth players’ policies were frozen and evaluated over 1,000 gamesto find the expected value to the players at that moment.Weran each algorithm once on just the 13 card gameand plotted its expected value over time while learning. The results are shownin Figure 4. Notice that expected value of all the learning algorithms seem to have someoscillation around zero. Wewouldexpect this with identical learners in a symmetriczero-sumgame. The point of interest thoughis howclose these oscillations stay to zero over time. The WoLF principle causes the policies to have a more constant expected value with lower amplitude oscillations. This again showsthat the WoLF principle continues to have converging affects even in stochastic gameswith approximation techniques. Conclusion Wehave described a scalable learning algorithm for stochastic games, composedof three reinforcement learning ideas. Weshowedpreliminary results of this algorithm learning in the gameGoofspiel. These results demonstratethat the pol- Fast n=4 -1.1 -1.2 c -1.3 -1.4 -1.5 o, -1.6 -1.7 O -1.8 -1.9 =0 -2 !~/ > -2.1 -2,2 0 ’s 10 /...... 5 / /-/ i Slow ............. --~--= . Ranmo~--m 40u00 30000 i UJ i 10000 0 10000 20000 Numberof Training Games 30000 20000 Numberof Games 40000 Slow n---8 -2 15 E m 10 -3 =, -4 -~ 5 J~ -5 ..6 / -7 / 0 ./ ...." / > ....--"" /" -8 : ............. -9 -10 uJ -15 i i 10000 20000 30000 Numberof Training Games 0 "6 -10 WoLF Fast Slow ............. Random -i -5 "o i 10000 00(30 i 30000 20000 Numberof Games 40000 WoLF n=13 -12 tO o =; c~ 15 .¢_ E 10 .14 -16 5 -18 -2o I’:’" ii "-...,,i~ -’i i -22 i ¯...,, -24 -26 I 0 "~ "] == o > -5 WoLF Fast .... Slow ............. Random 40000 30000 ¢1) -10 LU -15 i ’ ’ 20000 1 0O00 Numberof Training Games 0 i i 10OOO 20O0O 30OOO Numberof Games 400O0 Figure4: Expectedvalue of the gamewhilelearning. Figure3: Worst-case expectedvalueof the policy learnedin self-play. 17 icy gradient approachusing an actor-critic modelcan learn in this domain.In addition, the WoLF principle for encouraging convergence also seems to hold even whenusing approximationand generalization techniques. There are a numberof directions for future work. Within the gameof Goofspiel, it wouldbe interesting to explore alternative waysof tiling the state-action space. This could likely increase the overall performanceof the learned policy, but wouldalso examinehowgeneralization might affect the convergenceof learning. Mightcertain generalization techniquesretain the existence of equilibrium,and is the equilibrium learnable? Anotherimportant direction is to examine these techniques on more domains, with possibly continuous state and action spaces. Also, it wouldbe interesting to vary someof the componentsof the system. Can we use a different approximator than file-coding? Dowe achieve similar results with different policy gradient techniques(e.g. GPOMDP (Baxter & Bartlett 2000)). These initial results, though, show promise that gradient ascent and the WoLF principle can scale to large state spaces. References Baxter, J., and Bartlett, P. L. 2000. Reinforcement learning in POMDP’s via direct gradient ascent. In Proceedings of the Seventeenth International Conference on Machine Learning, 41-48. Stanford University: MorganKaufman. Bowling, M., and Veloso, M. 2001. Rational and convergent learning in stochastic games. In Proceedings of the Seventeenth International Joint Conferenceon Artificial Intelligence, 1021-1026. Bowling, M., and Veloso, M. 2002a. Multiagent learning using a variable learning rate. Artificial Intelligence. In Press. Bowling, M., and Veloso, M.M. 2002b. Existence of multiagent equilibria with limited agents. Technical report CMU-CS-02-104,Computer Science Department, Carnegie MellonUniversity. Claus, C., and Boutilier, C. 1998. The dynamicsof reinforcement learning in cooperative multiagent systems. In Proceedingsof the Fifteenth National Conferenceon Artificial Intelligence. MenloPark, CA: AAAIPress. Fink, A.M. 1964. Equilibrium in a stochastic n-person game.Journal of Science in Hiroshirna University, Series A-128:89-93. Flood, M. 1985. Interview by Albert Tucker. The Princeton Mathematics Communityin the 1930s, Transcript Number 11. Gordon, G. 2000. Reinforcement learning with function approximation converges to a region. In Advancesin Neural Information Processing Systems 12. MITPress. Greenwald,A., and Hall, K. 2002. Correlated Q-learning. In Proceedings of the AAAI Spring SymposiumWorkshop on Collaborative LearningAgents. In Press. Hu, J., and Wellman, M.P. 1998. Multiagent reinforcement learning: Theoretical frameworkand an algorithm. In Proceedingsof the Fifteenth International Conference 18 on Machine Learning, 242-250. San Francisco: Morgan Kaufman. Kuhn,H. W., ed. 1997. Classics in GameTheory. Princeton University Press. Littman, M. L. 1994. Markov games as a framework for multi-agent reinforcement learning. In Proceedingsof the, Eleventh International Conference on MachineLearning, 157-163. Morgan Kaufman. Littman, M. 2001. Friend-or-foe Q-learning in generalsumgames. In Proceedingsof the Eighteenth International Conference on MachineLearning, 322-328. Williams College: Morgan Kaufman. Nash, Jr., J. E 1950. Equilibriumpoints in n-person games. PNAS36:48-49. Reprinted in (Kuhn 1997). Osborne, M. J., and Rubinstein, A. 1994. A Course in GameTheory. The MIT Press. Samuel, A.L. 1967. Somestudies in machine learning using the game of checkers. IBMJournal on Research and Development 11:601-617. Shapley, L. S. 1953. Stochastic games. PNAS39:10951100. Reprinted in (Kuhn 1997). Singh, S.; Kearns, M.; and Mansour, Y. 2000. Nash convergence of gradient dynamics in general-sum games. In Proceedingsof the Sixteenth Conferenceon Uncertainty in Artificial Intelligence, 541-548. MorganKaufman. Stone, P., and Sutton, R. 2001. Scaling reinforcement learning toward Robocupsoccer. In Proceedings of the Eighteenth International Conference on MachineLearning, 537-534. Williams College: MorganKaufman. Sutton, R. S., and Barto, A. G. 1998. ReinforcementLearning. M]TPress. Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour,Y. 2000. Policy gradient methodsfor reinforcement learning with function approximation. In Advancesin Neural Information Processing Systems 12. MITPress. Tesauro, G.J. 1995. Temporal difference learning and TD-Gammon.Communications of the ACM38:48-68.