Scalable Learning in Stochastic Games

From: AAAI Technical Report WS-02-06. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved.
Scalable Learning in Stochastic Games
Michael
Bowling and Manuela Veloso
Computer Science Department
Carnegie Mellon University
Pittsburgh PA, 15213-3891
Abstract
Stochasticgamesare a general modelof interaction between
multipleagents.Theyhaverecentlybeenthe focus of a great
deal of researchin reinforcementlearningas they are both
descriptive and havea well-definedNashequilibriumsolution. Mostof this recent work,althoughvery general, has
only been applied to small gameswith at mosthundredsof
states. Onthe other hand,thereare landmark
results of learning beingsuccessfullyappliedto specific large and complex
gamessuch as Checkersand Backgammon.
In this paper we
describea scalable learningalgorithmfor stochastic games,
that combines
three separateideas fromreinforcement
learning into a single algorithm.Theseideas are tile codingfor
generalization,policy gradientascent as the basic learning
method,and our previous workon the WoLF
("Winor Learn
Fast") variable learning rate to encourageconvergence.We
applythis algorithmto the intractablysized game-theoretic
card gameGoofspiel,showingpreliminaryresults of learning in self-play. Wedemonstratethat policygradientascent
canlearn evenin this highlynon-stationaryproblem
withsimultaneouslearning. Wealso showthat the WoLF
principle
continuesto havea convergingeffect evenin large problems
withapproximation
andgeneralization.
Introduction
Weare interested in the problemof learning in multiagent
environments. One of the main challenges with these environments is that other agents in the environmentmaybe
learning and adapting as well. These environments are,
therefore, no longer stationary. They violate the Markov
property that traditional single-agent behaviorlearning relies upon.
The model of stochastic games captures these problems
very well through explicit modelsof the reward functions
of the other agents and their affects on transitions. They
are also a natural extension of Markovdecision processes
(MDPs)to multiple agents and so have attracted interest
from the reinforcement learning community. The problem of simultaneouslyfinding optimal policies for stochastic
gameshas beenwell studied in the field of gametheory. The
traditional solution conceptis that of Nashequilibria, a policy for all the players whereeach is playing optimally with
Copyright(~) 2002,American
Associationfor Artificial Intelligence(www.aaai.org).
All rights reserved.
11
respect to the others. This concept is a powerful solution
for these gameseven in a learning context, since no agent
could learn a better policy whenall the agents are playing
an equilibria.
It is this foundation that has driven muchof the recent work in applying reinforcement learning to stochastic
games (Littman 1994; Hu & Wellman1998; Singh, Kearns,
& Mansour 2000; Littman 2001; Bowling & Veloso 2002a;
Greenwald& Hall 2002). This work has thus far only been
applied to small games with enumerable state and action
spaces. Historically, though, a numberof landmarkresults in
reinforcementlearning have looked at learning in particular
stochastic gamesthat are not small nor are the state easily
enumerated. Samuel’s Checkers playing program (Samuel
1967) and Tesauro’s TD-Gammon
(Tesauro 1995) are successful applications of learning in games with very large
state spaces. Both of these results made generous use of
generalization and approximation,which have not been used
in the more recent work. On the other hand, both TDGammon
and Samuel’s Checkers player only used deterministic strategies to play competitively, while Nashequilibria
often require stochastic strategies.
Weare interested in scaling someof the recent techniques
based on the Nash equilibria concept to games with intractable state spaces. Sucha goal is not new.Singhand colleagues’ also described future workof applyingtheir simple
gradient techniques to problemswith large or infinite state
and action spaces (Singh, Kearns, & Mansour2000). This
paperexaminessomeinitial results in this direction. Wefirst
describe the formal definition of a stochastic gameand the
notion of equilibria. Wethen describe one particular very
large, two-player, zero-sumstochastic game,Goofspiel. Our
learning algorithm is described as the combinationof three
ideas from reinforcement learning: tile-coding, policy gradients, and the WoLF
principle. Wethen showresults of our
algorithm learning to play Goofspielwith self-play. Finally,
we conclude with somefuture directions for this work.
Stochastic
Games
A stochastic gameis a tuple (n, S, .Ax...n, T,/~l...n), where
n is the number
of agents, ,5 is a set of states, .Ai is the set of
actions available to agent i (and .A is the joint action space
A1x... x An),T is a transition function,5x .,4 x,5 ---> [0, 1],
and ~ is a rewardfunction for the ith agent ,5 x .A ---> R.
This looks very similar to the MDPframeworkexcept we
havemultiple agents selecting actions and the next state and
rewards dependon the joint action of the agents. Another
importantdifference is that each agent has its ownseparate
rewardfunction. Thegoal for each agent is to select actions
in order to maximizeits discounted future rewardswith discount factor 7Stochastic games are a very natural extension of MDPs
to multiple agents. They are also an extension of matrix
games to multiple states. Twoexample matrix games are
in Figure 1. In these gamesthere are two players; one selects a row and the other selects a columnof the matrix. The
entry of the matrix they jointly select determinesthe payoffs. The gamesin Figure 1 are zero-sumgames, so the row
player wouldreceive the payoff in the matrix,-and the columnplayer wouldreceive the negative of that payoff. In the
general case (general-sum games), each player would have
a separate matrix that determinestheir payoffs.
-1
1
Matching Pennies
-1
0
1 0
R-P-S
Figure 1: MatchingPennies and Rock-Paper-Scissors matrix
games.
Each state in a stochastic gamecan be viewedas a matrix
gamewith the payoffs for each joint action determined by
the matrices Pq(s,a). After playing the matrix game and
receivingtheir payoffsthe players are transitioned to another
state (or matrix game)determinedby their joint action.
can see that stochastic gamesthen contain both MDPsand
matrix gamesas subsets of the framework.
Stochastic Policies. Unlike in single-agent settings, deterministic policies in multiagent settings can often be exploited by the other agents. Consider the matching penhies matrix game as shown in Figure 1. If the column
player were to play either action deterministically, the row
player could win every time. This requires us to consider
mixedstrategies and stochastic policies. A stochastic policy, p : S ~ PD(.Ai), is a function that maps states to
mixedstrategies, whichare probability distributions over the
player’s actions.
Nash Equilibria. Even with the concept of mixed strategies there are still no optimal strategies that are independent
of the other players’ strategies. Wecan, though, define a notion of best-response. A strategy is a best-response to the
other players’strategies if it is optimalgiventheir strategies.
The major advancementthat has driven muchof the developmentof matrix games, gametheory, and even stochastic
gamesis the notion of a best-response equilibrium, or Nash
equilibrium (Nash, Jr. 1950).
ANashequilibriumis a collection of strategies for each of
the players such that each player’s strategy is a best-response
to the other players’ strategies. So, no player can do better
12
by changingstrategies given that the other players also don’t
change strategies. Whatmakes the notion of equilibrium
compelling is that all matrix gameshave such an equilibrium, possibly having multiple equilibria. Zero-sum, twoplayer games,whereone player’s payoffs are the negative of
the other, have a single Nashequilibrium.1 In the zero-sum
examplesin Figure 1, both gameshave an equilibrium consisting of each player playing the mixedstrategy whereall
the actions have equal probability.
The concept of equilibria also extends to stochastic
games.This is a non-trivial result, provenby Shapley(Shapley 1953) for zero-sum stochastic gamesand by Fink (Fink
1964) for general-sum stochastic games.
Learning in Stochastic Games. Stochastic games have
been the focus of recent research in the area of reinforcement learning. There are two different approaches being explored. The first is that of algorithms that explicitly learn equilibria through experience, independentof the
other players’ policy (Littman 1994; Hu & Wellman1998;
Greenwald& Hall 2002). These algorithms iteratively estimate value functions, and use them to computean equilibrium for the game. A second approach is that of bestresponse learners (Claus & Boutilier 1998; Singh, Kearns,
& Mansour 2000; Bowling & Veloso 2002a). These learners explicitly optimizetheir rewardwith respect to the other
players’ (changing) policies. This approach, too, has
strong connectionto equilibria. If these algorithmsconverge
whenplaying each other, then they mustdo so to an equilibrium (Bowling & Veloso 2001).
Neither of these approaches, though, have been scaled
beyond games with a few hundred states. Gameswith a
very large numberof states, or gameswith continuous state
spaces, makestate enumeration intractable. Since previous algorithms in their stated form require the enumeration of states either for policies or value functions, this is
a major limitation. In this paper we examinelearning in a
very large stochastic game, using approximationand generalization techniques. Specifically, wewill build on the idea
of best-response learners using gradient techniques (Singh,
Kearns, & Mansour 2000; Bowling & Veloso 2002a). We
first describe an interesting gamewith an intractably large
state space.
Goofspiel
Goofspiel (or The Gameof Pure Strategy) was invented
Merrill Flood while at Princeton (Flood 1985). The game
has numerousvariations, but here we focus on the simple
two-player, n-card version. Eachplayer receives a suit of
cards numbered1 through n, a third suit of cards is shuffled and placed face downas the deck. Eachround the next
card is flipped over from the deck, and the two players each
select a card placing it face down.They are revealed simultaneouslyand the player with the highest card wins the
card from the deck, whichis worth its numberin points. If
ITherecan actually be multipleequilibria, but they will all
haveequal payoffsand are interchangeable(Osborne&Rubinstein
1994).
the players choosethe samevalued card, then neither player
gets any points. Regardlessof the winner, both players discard their chosencard. This is repeated until the deck and
players hands are exhausted. The winner is the player with
the mostpoints.
This gamehas numerousinteresting properties makingit
a very interesting step betweentoy problemsand more realistic problems.First, notice that this gameis zero-sum,
and as with manyzero-sum gamesany deterministic strategy can be soundly defeated. In this game, it’s by simply
playing the card one higher than the other player’s deterministically chosen card. Second, notice that the number
of states and state-acfion pairs growsexponentiallywith the
numberof cards. The standard size of the gamen = 13 is so
large that just storing one player’s policy or Q-table would
require approximately2.5 terabytes of space. Just gathering data on all the state-action transitions wouldrequire well
over 1012 playings of the game. Table 1 shows the number
of states and state-action pairs as well as the policy size for
three different values of n. This gameobviously requires
someform of generalization to makelearning possible. Another interesting property is that randomlyselecting actions
is a reasonably good policy. The worst-case values of the
randompolicy along with the worst-case values of the best
deterministic policy are also shownin Table1.
This game can be described using the stochastic game
model. The state is the current cards in the players’ hands
and deck along with the upturned card. The actions for a
player are the cards in the player’s hand. Thetransitions follow the rules as described, with an immediatereward going
to the player whowonthe upturned card. Since the gamehas
a finite end and weare interested in maximizing
total reward,
we can set the discount factor 7 to be 1. Althoughequilibrium learning techniques such as Minimax-Q(Littman
1994) are guaranteedto find the game’sequilibrium, it requires maintaininga state-joint-action table of values. This
table wouldrequire 20.1 terabytes to store for the n = 13
card game. Wewill nowdescribe a best-response learning algorithm using approximationtechniques to handle the
enormousstate space.
Three Ideas - One Algorithm
The algorithm we will use combines three separate ideas
from reinforcement learning. The first is the idea of tile
coding as a generalization for linear function approximation.
Thesecondis the use of a parameterizedpolicy and learning
as gradient ascent in the policy’s parameterspace. The final
componentis the use of a WoLF
variable learning rate to adjust the gradient ascent step size. Wewill briefly overview
these three techniques and then describe howthey are combined into a reinforcementlearning algorithmfor Goofspiel.
Tile Coding.
Tile coding (Sutton & Barto 1998), also knownas CMACS,
is a popular techniquefor creating a set of booleanfeatures
from a set of continuous features. In reinforcement learning, tile coding has been used extensively to create linear
approximatorsof state-action values (e.g., (Stone &Sutton
2001)).
13
,
Tiling
One
,’
’riling
Two
._L-I
I
r
!
i
i
i
i
Figure 2: Anexampleof tile coding a two dimensionalspace
with two overlappingfilings.
Thebasic idea is to lay offset grids or filings over the multidimensional continuous feature space. A point in the continuous feature space will be in exactly one tile for each of
the offset tilings. Eachtile has an associated boolean variable, so the continuous feature vector gets mappedinto a
very high-dimensional boolean vector. In addition, nearby
points will fall into the sametile for manyof the offset grids,
and so share manyof the samebooleanvariables in their resuiting vector. This provides the important feature of generalization. Anexampleof tile coding in a two-dimensional
continuous space is shownin Figure 2. This exampleshows
twooverlappingfilings, and so any given point falls into two
differenttiles.
Anothercommon
trick with tile coding is the use of hashing to keep the number of parameters manageable. Each
tile is hashedinto a table of fixed size. Collisions are simply ignored, meaningthat two unrelated tiles mayshare the
same parameter. Hashing reduces the memoryrequirements
with little loss in performance.This is becauseonly a small
fraction of the continuousspace is actually neededor visited
while learning, and so independentparametersfor every tile
are often not necessary. Hashingprovides a meansfor using
only the numberof parameters the problem requires while
not knowingin advance which state-action pairs need parameters.
Policy Gradient Ascent
Policy gradient techniques (Sutton et al. 2000; Baxter &
Bartlett 2000) are a methodof reinforcement learning with
function approximation. Traditional approaches approximate a state-action value function, and result in a deterministic policy that selects the action with the maximum
learned value. Alternatively, policy gradient approachesapproximate a policy directly, and then use gradient ascent
to adjust the parameters to maximizethe policy’s value.
There are three goodreasons for the latter approach. First,
there’s a wholebody of theoretical work describing convergence problemsusing a variety of value-basedlearning techniques with a variety of function approximationtechniques
(See (Gordon2000) for a summary
of these results.) Second,
value-based approacheslearn deterministic policies, and as
we mentionedearlier deterministic policies in multiagent
ISl
n
4
8
13
IS x AISIZEOF(?r
or Q)
692
15150
3 x 10o r1 x 10
1 x 1011 7 x 1011
--, 59KB
,,~ 47MB
~ 2.5TB
VALUE(det)
VALUE(random)
-2
-20
-65
-2.5
--10.5
-28
Table 1: The approximatenumberof states
and state-actions, and the size of a stochastic policy or Q table for Goofspiel
depending on the numberof cards, n. The VALUE
columnslist the worst-case value of the best deterministic policy and the
randompolicy respectively.
settings are often easily exploitable. Third, gradient techniques have been shownto be successful for simultaneous
learning in matrix games(Singh, Kearns, & Mansour2000;
Bowling & Veloso 2002a).
Weuse the policy gradient technique presented by Sutton
and colleagues (Sutton et al. 2000). Specifically, we will
define a policy as a Gibbsdistribution over a linear combination of features, such as those taken froma tile coding
representation of state-actions. Let 0 be a vector of the policy’s parametersand esa be a feature vector for state s and
action a then this defines a stochastic policy accordingto,
e0-¢.°
71"(8, a) = Eb eO’¢’b"
Their main result was a convergenceproof for the following
policy iteration rule that updates a policy’s parameters,
0~k(s, a)
0k+l = 0k + C~k E arab(s)0Ofw,(s, a). (1)
s
this amountsto updating weights in proportion to howoften the state is visited. By doing updates on-line as states
are visited we can simply drop this term from equation 2,
resulting in,
Ok+l = Ok + O~k E esa" 7r(s,a)fwk
(s,a).
(3)
fl
Lastly, we will do the policy improvementstep (updating 0) simultaneouslywith the value estimation step (updating w). Wewill do value estimation using gradient-descent
Sarsa(0) (Sutton &Barto 1998) over the same feature space
as the policy. Specifically, if at time k the systemis in state
s and takes action a transitioning to state s’ and then taking
action a~, we update the weight vector,
Wk+l= w~ + 13k (r + 7Q~k(s’, a’) - Qtok (s, (4)
The policy improvementstep uses equation 3 where s is the
state of the systemat time k and the action-value estimates
from Sarsa Qw~are used to computethe advantage term,
f~,~
(s,
a)=Q,.~
(s,
a)-~_.
~r(s,
alOk)Q,.~
(s,
a).
Forthe Gibbsdistribution this is just,
tl
Ok+1= Ok + ctkZ d~(s)E dpsa7r(s,a)fw~(s, a)
(2)
Win or Learn Fast
WoLF("Winor LearnFast")is a methodforchanging
the
learning rate to encourage convergencein a multiagent reHereak is an appropriately
decayed
learning
rateand
inforcement learning scenario (Bowling &Veloso 2002a).
d~(s)is states’scontribution
to thepolicy’s
overall
value.
Thiscontribution
is defined
differently
depending Notice that the gradient ascent algorithm described does not
accountfor a non-stationary environmentthat arises with sion whether average or discounted start state reward criterion is used. fw~(s, a) is an independentapproximation
multaneouslearning in stochastic games. All of the other
agents actions are simply assumedto be part of the enviQ~ (s, a) with parameters w, which is the expected value
ronment and unchanging. WoLFprovides a simple way to
of taking action a from state s and then followingthe policy
account for other agents through adjusting howquickly or
7rk. For a Gibbsdistribution, Sutton and colleagues showed
slowly the agent changesits policy.
that for convergencethis approximationshould havethe folSince only the rate of learning is changed, algorithms
lowing form,
that are guaranteedto find (locally) optimal policies in nonstationary environmentsretain this property even whenus/~(s,a) = w. Cs.-~_~(s,b)¢sb
ing WOLF.
In stochastic gameswith simultaneous learning,
b
WoLF
has both theoretical evidence (limited to two-player,
two-action matrix games), and empirical evidence (experAs they point out, this amountsto f,~ being an approxiiments in matrix games, as well as smaller zero-sum and
mation of the advantage function, A~(s, a) = Q~(s, a)
general-sum stochastic games) that it encourages converV~r (s), where’r (s) i s t he value of f ollowing oplicy ~r f rom
state s. It is this advantagefunction that we will estimate
gence in algorithms that don’t otherwise converge (Bowland use for gradient ascent.
ing & Veloso 2002a). The intuition for this technique is
Usingthis basic formulation we derive an on-line version
that a learner should adapt quickly whenit is doing more
of the learning rule, wherethe policy’s weights are updated
poorly than expected. Whenit is doing better than expected,
with each state visited. The total rewardcriterion for Goofit should be cautious, since the other players are likely to
spiel is identical to having7 = 1 in the discountedsetting.
changetheir policy. This implicitly accountsfor other playSo, d~(s) is just the probability of visiting state s whenfolers that are learning, rather than other techniquesthat try to
lowingpolicy ~r. Since we will be visiting states on-policy,
explicitly reason about their action choices.
$
a
[
-
14
The WoLF
principle naturally lends itself to policy gradient techniqueswherethere is a well-definedlearning rate,
ak. With WoLF
we replace the original learning rate with
two learning rates a~’ < at to be used when winning
or losing, respectively. One determination of winning and
losing that has been successful is to comparethe value of
the current policy V’r (s) to the value of the averagepolicy
over time V~(s). With the policy gradient technique above
we can define a similar rule that examinesthe approximate
value, using Qw,of the current weight vector 0 with the average weight vector over time 0. Specifically, we are "winning" if and onlyif,
card available and player’s action. The tilings use tile sizes
equal to roughly half the numberof cards in the gamewith
the numberof tilings greater than the tile sizes to distinguish
betweenany integer state values. Finally, these tiles wereall
then hashedinto a table of size one million in order to keep
the parameter space manageable.Wedon’t suggest that this
is a perfect or even good tiling for this domain,but as we
will showthe results are still interesting.
Results
Oneof the difficult and openissues in multiagentreinforcementlea~ng is that of evaluation. Before presenting learning results we first need to look at howone evaluates learning success.
~"~~r(s, alO)Q,o(s,a)>~_~r(s, al#)Q,,,(s,a). (5)
Whenwinningin a particular state, we update the parameters for that state using a~°, otherwiseark.
Evaluation
Learning in Goofspiel
Wecombinethese three techniques in the obvious way. Tile
codingprovides a large booleanfeature vector for any stateaction pair. This is used both for the parameterizationof the
policy and for the approximationof the policy’s value, which
is used to computethe policy’s gradient. Gradient updates
are then performedon both the policy using equation 3 and
the value estimate using equation 4. WoLF
is used to vary
the learning rate ak in the policy updateaccordingto the rule
in inequality 5. This compositioncan be essentially thought
of as an actor-critic method(Sutton & Barto 1998). Here
the Gibbsdistribution over the set of parametersis the actor,
and the gradient-descent Sarsa(0) is the critic. Tile-coding
provides the necessary parameterization of the state. The
WoLF
principle is adjusting howthe actor changes policies
basedon responsefrom the critic.
-The main detail yet to be explained and where the algo
rithm is specifically adaptedto Goofspielis in the tile coding. The methodof tiling is extremelyimportantto the overall performanceof learning as it is a powerfulbias on what
policies can and will be learned. The major decision to be
madeis howto represent the state as a vector of numbersand
whichof these numbersare tiled together. The first decision
determines what states are distinguishable, and the second
determines howgeneralization works across distinguishable
states. Despite the importanceof the tiling we essentially
selected what seemedlike a reasonable tiling, and used it
throughoutour results.
Werepresent a set of cards, either a player’s handor the
deck, by five numbers,correspondingto the value of the card
that is the minimum,
lower quartile, median,upper quartile,
and maximum.This provides information as to the general
shape of the set, which is what is important in Goofspiel.
The other values used in the tiling are the value of the card
that is being bid on and the card correspondingto the agent’s
action. Anexampleof this process in the 13-card gameis
shownin Table 2. These values are combinedtogether into
three filings. Thefirst tiles togetherthe quartiles describing
the players’ hands. Thesecondtiles together the quartiles of
the deckwith the card available and player’s action. The last
tiles together the quartiles of the opponent’shand with the
One straightforward evaluation technique is to have two
learning algorithms learn against each other and simply examine the expected reward over time. This technique is not
useful if one’s interested in learning in self-play, whereboth
players use an identical algorithm. In this case with a symmetric zero-sum game like Goofspiel, the expected reward
of the two agents is necessarily zero, providing no information.
Another common
evaluation criterion is that of convergence. This is true in single-agent learning as well as multiagent learning. Onestrong motivation for considering this
criterion in multiagent domainsis the connectionof convergence to Nash equilibrium. If algorithms that are guaranteed to convergeto optimal policies in stationary environments, converge in a multiagent learning environment, then
the resulting joint policy must be a Nashequilibrium of the
stochastic game (Bowling& Veloso 2002a).
Although,convergenceto an equilibrium is an ideal criterion for small problems, there are a numberof reasons why
this is unlikely to be possible for large problems.First, optimality in large (even stationary) environmentsis not generally feasible. This is exactly the motivationfor exploring
function approximation and policy parameterizations. Second, whenwe account for the limitations that approximation
imposeson a player’s policy then equilibria maycease to exist, makingconvergenceof policies impossible (Bowling
Veloso 2002b). Third, policy gradient techniques learn only
locally optimal policies. Theymayconvergeto policies that
are not globally optimal and therefore necessarily not equilibria.
Althoughconvergenceto equilibria and therefore convergence in general is not a reasonablecriterion we wouldstill
expect self-play learning agents to learn something.In this
paper we use the evaluation technique used by Littman with
Minimax-Q
(Littman 1994). Wetrain an agent in self-play,
and then freeze its policy, and train a challengerto find that
policy’s worst-case performance.This challenger is trained
using just gradient-descent Sarsa and chooses the action
with maximum
estimated value with e-greedy exploration.
Notice that the possible policies playable by the challenger
are the deterministic policies (moduloexploration) playable
by the learning algorithm being evaluated. Since Goofspiel
15
My Hand
1
Quartiles
*
3
4
*
5
6
8
*
11
*
13
*
OppHand
Quartiles
4
*
5
8
*
9 10 11
* *
12 13
*
Deck
Quartiles
1
*
2
3
5
11 12
Card
Action
11
3
9
10
(1,4,6,8,
13),
(4,8,10,11,13>,
(1,3,9,10,12),
11,3
g
(Tile Coding)
TILESE {0, I°8
i}
Table 2: Anexamplestate-action representation using quartiles to describe the players’ hands and the deck. Thesenumbersare
then tiled and hashedwith the resulting tiles representing a booleanvector of size 106.
is a symmetric zero-sum game, we knowthat the equilibriumpolicy, if one exists, wouldhave value zero against its
challenger. So, this provides somemeasureof howclose the
policy is to the equilibrium by examiningits value against
its challenger.
A secondrelated criterion will also help to understandthe
performanceof the algorithm. Althoughpolicy convergence
might not be possible, convergenceof the expected value of
the agents’ policies maybe possible. Since the real desirability of policy convergenceis the convergenceof the policy’s value, this is in fact often just as good.Thisis also one
of the strengths of the WoLF
variable learning rate, as it has
been shownto makelearning algorithms with cycling policies and expected values converge both in expected value
and policy.
Experiments
Throughout our experiments, we examinedthree different
learning algorithmsin self-play. Thefirst twodid not use the
WoLF
variable learning rate, and instead followed a static
step size. "Fast" used a large step size ~k = 0.16; "Slow"
used a small step size ak = 0.008; "WOLF"
switched betweenthese learning rates based on inequality 5. In all experiments, the value estimation update used a fixed learning
rate of/~ = 0.2. Theserates were not decayed, in order to
better isolate the effectiveness apart fromappropriate selection of decay schedules. In addition, throughouttraining and
evaluation runs, all agents followed an e-greedyexploration
strategy with e = 0.05. The initial policies and values all
begin with zero weight vectors, whichwith a Gibbs distribution corresponds to the randompolicy, which as we have
noted is reasonably good.
In our first experimentwe trained the learner in self-play
for 40,000 games. After every 5,000 games we stopped the
training and trained a challenger against the agent’s current
policy. The challenger was trained on 10,000 gamesusing
Sarsa(0) gradient ascent with the learning rate parameters
described above. Thetwo policies, the agent’s and its challenger, were then evaluated on 1,000 gamesto estimate the
policy’s worst-case expected value. This experimentwas repeated thirty times for each algorithm.
16
The learning results averaged over the thirty runs are
shownin Figure 3 for card sizes of 4, 8, and 13. Thebaseline
comparisonis with that of the randompolicy, a very competitive policy for this game.All three learners improveon this
policy while training in self-play. The initial dips in the 8
and 13 card gamesare due to the fact that value estimates
are initially very poor makingthe initial policy gradients not
in the direction of increasing the overall value of the policy.
It takes a numberof training gamesfor the delayed reward
of winningcards later to overcomethe initial immediatereward of winningcards now. Lastly, notice the affect of the
WoLF
principle. It consistently outperformsthe two static
step size learners. This is identical to affects shownin nonapproximated stochastic games(Bowling & Veloso 2002a).
The second experiment was to further examinethe issue
of convergenceand the affect of the WoLF
principle on the
learning process. Instead of examiningworst-case performanceagainst somefictitious challenger, we nowexamine
the expected value of the player’s policy while learning in
self-play. Again the algorithm was trained in self-play for
40,000 games. After 50 gamesboth players’ policies were
frozen and evaluated over 1,000 gamesto find the expected
value to the players at that moment.Weran each algorithm
once on just the 13 card gameand plotted its expected value
over time while learning.
The results are shownin Figure 4. Notice that expected
value of all the learning algorithms seem to have someoscillation around zero. Wewouldexpect this with identical
learners in a symmetriczero-sumgame. The point of interest thoughis howclose these oscillations stay to zero over
time. The WoLF
principle causes the policies to have a more
constant expected value with lower amplitude oscillations.
This again showsthat the WoLF
principle continues to have
converging affects even in stochastic gameswith approximation techniques.
Conclusion
Wehave described a scalable learning algorithm for stochastic games, composedof three reinforcement learning ideas.
Weshowedpreliminary results of this algorithm learning in
the gameGoofspiel. These results demonstratethat the pol-
Fast
n=4
-1.1
-1.2
c
-1.3
-1.4
-1.5
o, -1.6
-1.7
O
-1.8
-1.9
=0 -2
!~/
>
-2.1
-2,2
0
’s
10
/......
5
/
/-/
i
Slow
.............
--~--= .
Ranmo~--m
40u00
30000
i
UJ
i
10000
0
10000
20000
Numberof Training Games
30000
20000
Numberof Games
40000
Slow
n---8
-2
15
E
m 10
-3
=,
-4
-~
5
J~
-5
..6
/
-7
/
0
./
...."
/
>
....--""
/"
-8 : .............
-9
-10
uJ -15
i
i
10000
20000
30000
Numberof Training Games
0
"6 -10
WoLF
Fast
Slow
.............
Random -i
-5
"o
i
10000
00(30
i
30000
20000
Numberof Games
40000
WoLF
n=13
-12
tO
o
=;
c~ 15
.¢_
E
10
.14
-16
5
-18
-2o I’:’" ii
"-...,,i~
-’i i
-22 i ¯...,,
-24
-26
I
0
"~
"]
== o
> -5
WoLF
Fast ....
Slow
.............
Random
40000
30000
¢1)
-10
LU -15
i
’
’
20000
1 0O00
Numberof Training Games
0
i
i
10OOO
20O0O
30OOO
Numberof Games
400O0
Figure4: Expectedvalue of the gamewhilelearning.
Figure3: Worst-case
expectedvalueof the policy learnedin
self-play.
17
icy gradient approachusing an actor-critic modelcan learn
in this domain.In addition, the WoLF
principle for encouraging convergence also seems to hold even whenusing approximationand generalization techniques.
There are a numberof directions for future work. Within
the gameof Goofspiel, it wouldbe interesting to explore
alternative waysof tiling the state-action space. This could
likely increase the overall performanceof the learned policy,
but wouldalso examinehowgeneralization might affect the
convergenceof learning. Mightcertain generalization techniquesretain the existence of equilibrium,and is the equilibrium learnable? Anotherimportant direction is to examine
these techniques on more domains, with possibly continuous state and action spaces. Also, it wouldbe interesting
to vary someof the componentsof the system. Can we use
a different approximator than file-coding? Dowe achieve
similar results with different policy gradient techniques(e.g.
GPOMDP
(Baxter & Bartlett 2000)). These initial results,
though, show promise that gradient ascent and the WoLF
principle can scale to large state spaces.
References
Baxter, J., and Bartlett, P. L. 2000. Reinforcement
learning
in POMDP’s
via direct gradient ascent. In Proceedings
of the Seventeenth International Conference on Machine
Learning, 41-48. Stanford University: MorganKaufman.
Bowling, M., and Veloso, M. 2001. Rational and convergent learning in stochastic games. In Proceedings of
the Seventeenth International Joint Conferenceon Artificial Intelligence, 1021-1026.
Bowling, M., and Veloso, M. 2002a. Multiagent learning
using a variable learning rate. Artificial Intelligence. In
Press.
Bowling, M., and Veloso, M.M. 2002b. Existence
of multiagent equilibria with limited agents. Technical
report CMU-CS-02-104,Computer Science Department,
Carnegie MellonUniversity.
Claus, C., and Boutilier, C. 1998. The dynamicsof reinforcement learning in cooperative multiagent systems. In
Proceedingsof the Fifteenth National Conferenceon Artificial Intelligence. MenloPark, CA: AAAIPress.
Fink, A.M. 1964. Equilibrium in a stochastic n-person
game.Journal of Science in Hiroshirna University, Series
A-128:89-93.
Flood, M. 1985. Interview by Albert Tucker. The Princeton
Mathematics Communityin the 1930s, Transcript Number
11.
Gordon, G. 2000. Reinforcement learning with function
approximation converges to a region. In Advancesin Neural Information Processing Systems 12. MITPress.
Greenwald,A., and Hall, K. 2002. Correlated Q-learning.
In Proceedings of the AAAI Spring SymposiumWorkshop
on Collaborative LearningAgents. In Press.
Hu, J., and Wellman, M.P. 1998. Multiagent reinforcement learning: Theoretical frameworkand an algorithm.
In Proceedingsof the Fifteenth International Conference
18
on Machine Learning, 242-250. San Francisco: Morgan
Kaufman.
Kuhn,H. W., ed. 1997. Classics in GameTheory. Princeton University Press.
Littman, M. L. 1994. Markov games as a framework for
multi-agent reinforcement learning. In Proceedingsof the,
Eleventh International Conference on MachineLearning,
157-163. Morgan Kaufman.
Littman, M. 2001. Friend-or-foe Q-learning in generalsumgames. In Proceedingsof the Eighteenth International
Conference on MachineLearning, 322-328. Williams College: Morgan Kaufman.
Nash, Jr., J. E 1950. Equilibriumpoints in n-person games.
PNAS36:48-49. Reprinted in (Kuhn 1997).
Osborne, M. J., and Rubinstein, A. 1994. A Course in
GameTheory. The MIT Press.
Samuel, A.L. 1967. Somestudies in machine learning
using the game of checkers. IBMJournal on Research and
Development 11:601-617.
Shapley, L. S. 1953. Stochastic games. PNAS39:10951100. Reprinted in (Kuhn 1997).
Singh, S.; Kearns, M.; and Mansour, Y. 2000. Nash convergence of gradient dynamics in general-sum games. In
Proceedingsof the Sixteenth Conferenceon Uncertainty in
Artificial Intelligence, 541-548. MorganKaufman.
Stone, P., and Sutton, R. 2001. Scaling reinforcement
learning toward Robocupsoccer. In Proceedings of the
Eighteenth International Conference on MachineLearning, 537-534. Williams College: MorganKaufman.
Sutton, R. S., and Barto, A. G. 1998. ReinforcementLearning. M]TPress.
Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour,Y.
2000. Policy gradient methodsfor reinforcement learning
with function approximation. In Advancesin Neural Information Processing Systems 12. MITPress.
Tesauro, G.J. 1995. Temporal difference learning and
TD-Gammon.Communications of the ACM38:48-68.