From: AAAI Technical Report SS-02-02. Compilation copyright © 2002, AAAI (www.aaai.org). All rights reserved. Collaborative Learning in Strategic Environments Akira Namatame,Noriko Tanoura, Hiroshi Sato Dept. of ComputerScience, National Defense Academy Yokosuka, 239-8686, JAPAN E-mail: { nama,hsato} @cc.nda.ac.jp Abstract Thereis no presumptionthat collective behaviorof interacting agents leads to collectively satisfactory results. Howwell agents can adapt to their social environmentis different to howsatisfactory a social environment they collectively create. In this paper, we attempt to probe a deeper understandingof this issue by specifying howagents interact by adaptingtheir behavior. Weconsider the problems of asymmetric coordination, which are formulated as minority games, and we address the followingquestion: howdo interacting agents realize an efficient coordinationwithoutany central authority through self-organizing macroscopicorders from bottomup? Weinvestigate several types of learning methodologies including anewmodel,give-and-takelearning, in whichagents yield to others if they gain andthey randomize their actions if they lose or do not gain. Weshowthat evolutionary learning is the mostefficient in asymmetric strategic environments. Keyword:aymmetric coordination, social efficiency, evolutionarylearning, give-and-takelearning 1. Introduction ensurethat their individual actions are carried out with little conflicts. Weclassify this type of coordinationas asymmetric coordination[7]. Considerthe followingsituation: A collection of agents have to travel using one of the route A or B. Each agent gains a payoff if he chooses the sameroute what the majority does. This type of coordination is classified as symmetriccoordination. Onthe other hand, each agent gains a payoff if he chooses the opposite route what the majority does. This type of coordination is classified as asymmetric coordination. Coordination problems are characterized with many equilibria, and they often face the problemof coordination failure resulting from their independentinductive processes [ 1][4]. Aninteresting problemis then underwhatcircumstances will a collectionof agentsrealizes a particular stable situation, and whetherthey satisfy the conditions of social efficiency? In recent years, this issue has beenaddressedby formulating Thereare manysituations whereinteracting agents can benefit fromcoordinatingtheir actions. Social interactions pose many minority games(MG)[2][10]. However,the growingliterature on MGtreats agents as automata, merely responding to coordination problems for individuals. Individials face problemsof sharing and distributing limited resources in an changingenvironmentswithout deliberating about individuals’ decisions[13]. Thereis no presumption that the self-interested efficient way. Consider a competitive routing problem of networksin whichthe paths from sources to destination have behavior of agents should usually lead to collectively to be established by multiple agents. In the context of traffic networks, for instance, agents have to determinetheir route independently, and in telecommunicationnetworks, they have to decideon whatfraction of their traffic to send on eachlink of the network. Coordinationimplies that increased effort by someagents leads the remainingagents to follow suit, whichgives rise to multiplier effects. Weclassify this type of coordination as symmetriccoordination[3]. Coordinationis also necessaryto 61 satisfactory results [8][9]. Howwell each agent does in adapting to its social environmentis not the samething as howsatisfactory a social environmentthey collectively create for themselves. Aninteresting problemis then under what circumstanceswill a society of rational agents realize social efficiency? Solutionsto these problemsinvokethe intervention of an authority whofinds the social optimumand imposes the optimalbehaviorto agents. Whilesuch an optimal solution maybe easy to find, the implementationmaybe difficult to enforcein practical situations. Self-enforcingsolutions, where agents achieve optimal allocation of resources while pursing their self-interests withoutany explicit agreementwith others are of great practical importance. Weare interested in the bottom-upapproach for leading to moreefficient coordinationwith the powerof moreeffective learningat the individual levels [I 1]. Withinthe scopeof our previousnights. Whatmakesthis problemparticularly interesting is that it is impossiblefor each agent to be perfectly rational, in the sense of correctly predicting the attendance on any given night. Thisis becauseif mostagentspredict that the attendance model, we create models in which agents makedeliberate decisions by applyingrational learning procedures.Weexplore the mechanismin which interacting agents are stuck at an inefficient equilibrium. While agents understand that the outcomeis inefficient, each agent acting independently is powerlessto managedecisionswhcireflect collective activity. will be low(and therefore decide to attend), the attendance will actually be high, whileif they predict the attendancewto be high (and therefore decide not to attend) the attendance will be low. Arthurinvestigated the numberof agents attending the bar over time by using a diverse population of simple rules. Oneinteresting result obtained wasthat over time, the averageattendance of the bar is about 60. Agentsmaketheir Agents also maynot knowabout what to do and also howto makea decision. The design of efficient collective action is crucial in manyfields. In collective activity, two types of choices by predicting ahead of time whether the attendance on the current night will exceedthe capability and then take the appropriate course of action. Arthur examinedthe dynamic activities maybe necessary: Eachagent behavesas a member of society, while at the sametime, it behavesindependently by adjusting its view and action. At the individual level, it learns to improveits action basedon its ownobservationand experiences. At the samelevel, they put forwardtheir learnt knowledgefor consideration by others. Animportant aspect driving force behindthis equilibrium. The Arthur’s "El Farol" modelhas been extended in the form as Minority Games(MG),whichshowfor the first time howequilibrium can be reachedusing inductive learning [2]. The MGis played by a collection of rational agents G = {At : 1 < i < N}. Withoutlosing the generarity, we can of this coordinationis the learningrule adaptedby individuals. assume N is an odd number. On each period of the stage 2. Formalismof AsymmetricCoordination and Minority Games game, each agent must choose privately and independently betweentwo strategies S = {SI, S2 }. Werepresent the action ofagentA~at the time period t by ai(t ) = 1 if he choosesSi. and at(t ) = 0 if he chooses S2. Given the actions of all The EL Farol bar problem and its variants provide a clean and simple example of asymmetric coordination problems agents, the payoffof agent Ai is givenby Ea,(t)lN <0 [ 1 ][4]. BrianArthuruseda verysimpleyet interesting problem (i) u~(t)= l if a~(t)= and p( t)ffi I~i$N to illustrate effective uses of inductive reasoning of (ii) ui(t ) = 0 if ai(t) = and p(t) > 0.5 heterogeneousagents. There is a bar called El Farol in the downtown of Santa Fe. In Santa Fe, there are agents interested in going to the bar each night. All agents have identical preferences. Eachof them will enjoy the night at El Farol very muchif there are no morethan the threshold numberof agents in the bar; however,each of themwill suffer miserably if there are more than the threshold numberof agents. In Arthur’s example, the total numberof agents is N=IO0,and the threshold numberis set to 60. The only information available to agents is the numberof visitors to the bar in 62 (2.1) Each agent first receives aggregate information p(t) which represents all agents’ actions, and then he decides whetherto choose$1 or S,. Eachagent is rewardedwith a unitary payoff wheneverthe side he chooses happens to be chosen by the minority of the agents, while agents on the majority side get nothing. All agents have access to public information on the record of past histories on p(~r), Z" < t. The past history available at the time period t is represented by /2(t). How do agents choose actions under the commoninformation bt(t)? Agentsmaybehavedifferently becauseof their personal beliefs on the outcomeof the next time period p(t + 1), chooseS~at the timeperiod t. whichonly dependson what agents do at the next time period t+l, and the past history #(t) has no direct impacton it. Weanalyze the structure of the MGto see what we should expect. Thesocial efficiency can be measuredfromthe average payoff of one agent over a long-time period. Consider the extremecase whereonly one agent take one side, and all the others take the other side at each time period. The lucky agent gets a reward, nothing for the others, and the average ~ Table1 The payoff matrix of the minority games mhzfs stramgy S1 (go) 82 (stay) Si (go) 82 (stay) payoff per agent is 1/N. Equally extreme situation is that when(N-I)/2 agents on one side, (N+1)/’2agents on the other side wherethe average payoff is about 0.5. Fromthe society N=50 point of view,the latter situation is preferable. The MGgameis characterized with manysolutions. It is easy to see that this game has (N- I)/2 asymmetric Nash N=50 equilibria in pure strategies in the case whereexactly (N-I)~2 agents choose either one of the two sides. The gamealso presents a unique symmetricmixedstrategy Nashequilibrium in which each agent selects the two sides with un equal probability. Withthis mixedstrategy, each agent can expect the payoff 0.5 on each time period, and the society payoff follows a binomial distribution with the meanequal to N/2 and the variance N/4. The variance is also a measureof the degreeof social efficiency. Thehigher the variance, the higher the magnitude of the fluctuations around N/2 and the correspondingaggregatewelfare loss. Several learning rules have been found to lead to an efficient outcomewhenagents learn fromeach other [2][15]. Howexactly does an agent’s utility depend on the numberof total participants? Wenow showthe MGcan be represented as 2x2 gamesin which an agent play with the aggregate of the society with payoff matrix in Table 1. Let supposeeach agent plays with all other agents individually with the payoff matrix in Table 1. The average payoffs of agent Ai from the play S~and S2 with one agent are given: Ui(Sl) ffi 1 - ~ai(t)/N Isi:gN U,(S,,) ~, a,(t)lN (2.2) I<I¢,N p(t) = ~ a~ (t) represents the proportion of a gents who l<i<N 63 Figure I Local Matchings with 8 neighbours The matchingmethodologyalso plays an important role in the outcome of the game. The uniform matching, random matching, or local matching are used as the matching methodologiesin manyliteratures [3][6]. Agentsinteract with all other agents, which is knownas the uniform matching. Agents are not assumed to be knowledgeable enough to correctly anticipate all other agents’ choices, howeverthey can only access information about the aggregate behavior of the society. The optimal solution of the MGdependson the matching methodology.As shownin Figure 2, if each agent is matchedwithall other agents (uniformmatching),he receives 0.5 as the optimal payoff. As shownin Figure 2, if each agent is matchedwith their eight neighbors (local matching), receives 0.75 or 0.5 as the optimal payoff, dependingon how they split into two groupsof choosingS, (dark circle) or 2 (whitecircle). (1) Reinforcementlearning Agentstend to adopt actions that yielded a higher payoff in the past, and to avoidactions that yielded a lowpayoff. Payoff describe choicebehavior,but it is one’s ownpast payoffsthat matter, not the payoffs of the others. The basic premise is that the probabilityof taking an action in the present increases with the payoff that resulted from taking that action in the o (a)Average payoff: 0.75 (b}Average payoff: 0.5 past [6]. (2) Best response learning Agentsadopt actions that optimizetheir expectedpayoffgiven whatthey expect others to do. In this learning model,agents choosebest replies to the empiricalfrequencies distribution of the previousactions of the others. (c) averagepayoff: 0.5 Figure.2: The optimal payoffs per agent with different matehings: (a) (b) Local matching with 8 neighbours, Uniform or randommatching 3. Learning Models in Strategic Environments Gametheory is typically baseduponthe assumptionof rational choice. In our view, the reason for the dominanceof the rational-choice approachis not because scholars think it is realistic. Nor is gametheory used solely because it offers good advice to a decision maker, because its unrealistic assumptionsunderminemuchof its value as a basis for advice. Thereal advantageof the rational-choiceassumptionis that it often allows deduction.Themainalternative to the assumption of rational choice is some form of adaptive behavior. Adaptation maybe expected at the individual level through learning,or it maybe at the populationlevel throughdifferential survival and reproductionof the moresuccessful individuals. Either way, the consequencesof adaptive processes are often very hard to deducewhenthere are manyinteracting agents followingrules that havenonlineareffects. Wespecify howagents adapt their behavior in responseto others’ behaviorin strategic environments.Among the adaptive mechanisms that havebeendiscussedin the learning literature are the following[5][6][12][14]. Animportantissue in strategic environment is the learningstrategy adaptedby eachindividual. 64 (3) Evolutionarylearning Agentswhouse high-off payoff strategies are at a productive advantagecomparedto agents whouse low-payoffstrategies, hencethe latter decreasein frequencyin the populationover time(natural selection). In the standardmodelof this situation agents are viewedas being genetically codedwith a strategy andselection pressurefavors agentsthat are fitter, i.e., whose strategy yields a higher payoffagainst the population. (4) Social learning Agentslearn eachfromother withsocial learning. For instance, agents maycopy the behavior of others, especially behavior that is popularto yield highpayoffs(imitation). In contrast natural selection, the payoffs describe how agents make choices, and agents’ payoff must be observable by others for the modelto makesense. Thecrossoverstrategy is also another type of social learning. Theselearning modelscan be represented on the spectrum in Figure 3. The reinforcementlearning and social learning basedon give-and-taketake limiting cases representingat the right-most and left-most points of the spectrum,models. reinforcementbest-response give&take Figure.3 : The spectrumof learning evolution:crossover 4. Best-Response Learning cycles betweenthe two extremesituations whereall agents Withthe assumption of rationality, agents are assumedto choosean optimal strategy based on a sampleof information about what others agents have done in the past. Agents are able to calculate best replies andlearn the strategy distribution of play in a society. Gradually, agents learn the strategy distribution in the society. Withbest-responselearning, each agent calculates his best strategy basedon informationabout the current distributional patterns of the strategies [5]. At each period of time, each agent decides which strategy to choose given the knowledgeof the aggregate behavior of the population. Each agent thinks strategically, knowingthat everyoneelse is also makinga rational choice given its own information. Animportant assumptionis howagents receive knowledge of the current strategy distribution. For simplicity we assume that there is no strategic interaction across time; an agent’s decision dependsonly on current information and not on any previousactions. Thedynamics for collective decisionof agents are described as follows: Letp(Obe the proportion of agents whohavechosenSs at time t. Let U/Sk) be the expectedpayoff to Ai whenhe chooses Sk, k=l,Z The best-response of agent is then givenas follows: If Ui(Si) > U/S2) then choose S l If U~(S:) < U/S2), then choose $2 chooseS~ or S2. Underthis cyclic behavior, no agent gains resultingin a hugewaste.Thisresult has a considerable intuitive appeal since it displays situations whererational individual action, in pursuit of well-defined preferences, lead to undesirable outcomes. 5. Collaboraive Learning with Give-and-Take In this section, we proposethe give-and-takelearning which departs from the conventional assumptionsuch that agents update their behaviors in order to improvetheir measure functions such as payoffs. It is commonly assumedthat agents tend to adopt actions that yield a higher payoff in the past, and to avoid actions that yield a low payoff. Withthe give and take learning, on the contrary, agents are assumedto yield to others if they receive a payoffby taking the opposite starategy at the next time period, and they chooserandomly if they do not gain the payoff. Each agent gets the common information p(t) whichaggregate all agents’ actions of the last time period, and then he decides whetherto chooseSs or S2 at the time periodt+lby consideringwhetherhe is rewarded at time t: He is rewardeda unitary payoff wheneverthe side he chooseshappensto be chosenby the minorityof the agents. (4.1) i The expectedpayoffs of agentAs are obtained as U,(S,) = e(l p(t)), u,(s2) = (1O)p(t (4.2) Thebest-response adaptive rule of agent Ai is then obtained as follows: I1[ (i) If p(t)<q, then choose S I (ii) If p(t)> q, then choose S: (4.3) Aggregateinformationp(0,the current status of the collective decision,has a significant effect onagents’rational decisions. The result of the learning with the global best-response strategy is simple. Starting fromany initial conditionp(O), it 65 Figure 4 (a) The numberof agents to chooseS~ and 2 with the mixedstrategy. the payoffand the averagepayoff is 0.5, whichis the maximum payoff. Withthe corraborative learning, every agent receive the payoff and the majorityof agents receive 0.35.. It,’ ~E ¢ w" :,~.~¯ .... I "*,,sd~ ~. ................................................ ". Figure 4 (b) The proportion of agents with the sameaverage payoff with the mixedstrategy .~q 4~ lO IO TUG Fig.5(a): The numberof agents to choose t and S2, with give-and-takelearning Weformalize give-and-take learning as folows: The action ai(t + 1)of agent i at t he next t ime period t + Iis determinedby the followingrule: (i) ai(t + 1) = 0 (Choose82) ai(t) = 1 anp(t) <0.5 (Gain) rl= (ii) ai(t + 1) = 1 (ChooseS,) ai( ) - 0 and p(t) > 0.5 (Gain) ,° n (iii) ........ ? .... ai(t + 1) = RND(x)if ai(t) = 1 and p(t) > 0.5 ........................ (No gain) (iv) ai(t + 1) "- RND(x)if a~(t) -" 0 and p(t) < 0.5 (No gain) (5.1) ./, ... Fig.5(b): The proportion of agents with the same average payoff with give-and-takelearning where RND(x)represents the mixedstarategy x=(x, l-x) of 6, Evolutionary Learning with Local Matching chossingS~with the probability x and S2 with 1-x. payoff of an individual. The expected payoff of agents who In this section, we investigate evolutionary learning where agents learn from the most successful neighbours, and they co-evolve their strategies over time. Each agent adapts the choose S1 is given l-p, and those of whochoose S2 is p, where p denotes the proportion of agents who choose S I .Therefore the average payoff oer individual is given by 2p(1 - p), which takes the maximum value 0.5 atp=0.5. If the exact numberN=50attends the bar, they receive 1 as most successful strategy as guides for their owndecision (individual learning). Hencetheir success dependsin large part on howwell they learn from their neighbours. If the neighbouris doing well, its strategy can be imitated by all others (collective learning). In an evolutionaryapproach,there Weevaluate the performace by comparing the average 66 is no needto assumea rational calculationto identify the best strategy. Instead, the analysis of whatis chosenat any specific time is baseduponan implementation of the idea that effective strategies are morelikely to be retained than ineffective strategies [15]. Moreover,the evolutionary approachallows the introduction of new strategies as occasional random mutationsof old strategies. The evolutionaryprinciple itself can be thought of as the consequenceof any one of three different mechanisms.It could be that the more effective individuals are morelikely to survive and reproduce.A second interpretation is that agents learn by trial and error, keeping effective strategies and altering ones that turn out poorly. A third interpretationis that agents observeeachother, andthose with poor performancetend to imitate the strategies of those they see doingbetter. Meus-ruleof interaction --7 -’’ Figure 6: Therepresentation of the meta-ruleas the list of strategies Weconsider two types of evolutionary learning, mimicry and crossover. Each agent interacts with the agents on all eight adjacent squares and imitates the strategy of any better performingone. In each generation,eachagent attains a success score measuredby its average performance with its eight neighbours. Then if an agent has one or more neighbours whoare moresuccessful, the agent convertsto the strategy of In this section weconsider the local matchingas shown the most successful of themor crosses with the strategy of in Figure l, whereeach agent is modeledto be matchedwith the most successful neighbour.Neighborusalso serve another his 8 neighbours.Eachagent is modeledto be matchedseveral function as well. If the neighboris doingwell, the behaviour times with the sameneighbour, and the rule of the strategy of the neighbourcan be shared, and successful strategies can selection is codedas the list as shownin Figure 6. A part of spread throughout a population from neighbour to neighbour the list is replacedwith that of the mostsuccessfulneighbour. [111. Anagent’sdecision rule is representedby the Nbinary string. Significant differences were observedbetweenthe mimicry At each generation gen, gent[1 .... lastgen], agents and the crossover. As shownFigure 7, the case with the repeatedly play the gamefor T iterations. An agent A~., i ~ [1...N], uses a binary string i to makea decision about mimicry strategy, each agent acquires a payoff of his actionat eachiteration t, t ~ [l...T]. Abinarystring consists approximately 0.35, and with the cross-over, each agent acquires a payoff of approximately 0.7. Consequently, we of 22 positions (genes). Each position PI’J~[I,22]" is represented as follows. The first and secondposition, P] and P2, encodesthe action that the agent takes at iteration t = 1 and t = 2. A position p], j ~ [3,6], encodesthe history of mutualhands(cooperateor defect) that agent i tookat iteration can concludethat evolutionlearning leads to a moreefficient situation in the strategic environments. payoff 1 ................................................................................... c~mmovcr t - 1 and t - 2 with his neighbor(opponent).A position p j, j ~ [7, 22], encodesthe action that agent i takes at iteration °"i................. ................................................. -1.................... t > 2, corresponding to the position pj, j ~ [3,6]. 0.40’6 -.............. .................... .............. ~ ........................................................................... O 1 50 I00 150 200 250 300 350 400 450 ~n~lnion Figure 7: The averagepayoff with the evolutionary learning: crossover and mimicry 67 7. Conclusion 8. References The interaction of heterogeneousagents produces somekind of coherent, systematic behavior. Weinvestigated the macroscopicpatterns arising from strategic interactions of heterougeneousagents whobehave based on the local rules. In this paper we address the questions such as: 1) how society of selffish agents self-organizes, without a central [1]Arthur W.B. "Inductive reasoning and Bounded Rationality", AmericanEconomicReview, Vol.84, pp406-411, 1994. authority, their collective behaviorto satisfy the constraints? 2) Howdoes learning at individual levels generate more efficient collective behavior? 3) Howdoes co-evolution in society put its invisible handsto promoteseif-oranization of emergingcollective bahaviors? In previous work on collective behavior, the standard assumption was that agents use the same kind of adaptive rule. In this paper, we departed from this assumption by considering a modelof heterogeneousagents with respect to their meta-ruleof makingdecisonsat each time period. Agents use ad hoc meta-rules to maketheir decision based on the past performance.Wealso consider several types of learning rules agents use to updatetheir meta-rulefor makingdecisions. Weconsider specific strategic environmentsin whicha large numberof agents have to chooseone of twosides indepenently and those on the minorityside win, whichis knonas a minority game. A rational approachis helpless in our minority game as it generates large-scale social inefficiency. Weintrduced a newlearning modelat the individual level, give-and-take learning whereevery agent should makehis decison based on [2] Challet,D& Zhang.C.," Emergenceof Cooperation and Organization in an Evolutionary Game,Physica A246,1997. [3] Cooper, R: CoordinationGames,CambridgeUniv. Press, 1999. [4] Fogel B., Chellapia K., "Inductive Reasoningand Bounded Rationality Reconsidered", IEEE Trans. of Evolutionary Computation,Vol.3, pp142-146,1999. [5] Fudenberg,D., and Levine, D., The Theoryof Learningin Games,The MITPress, 1998 [6] Hart,S.,and Mas-Colell, A, Simple Adaptive Procedure Leadingto Correlated Equilibrium, Econometrica,2000. [7] IwanagaS., Namatame.A: "AsymmetricCoordination of Heterogeneous Agents", IEICE Trans. on Information and Systems, Vol. E84-D,pp.937-944, 2001. [8] IwanagaS., Namatame.A: "The Complexityof Collective Decision, Journal of Nonlinear Dynamicsand Control", (to appear). [9]Kaniovski, Y., Kryazhimskii, A., &Young,H. Adaptive Dynamicsin GamesPlayed by Heterogeneous Populations. Gamesand EconomicsBehavior, 31, 50-96,2000. [I0] Marsili Matteo.,"Toy Models of Markets with Heterogeneous Interacting Agents", Lecture Notes in Economicsand Mathematical Systems, Vol.503, pp161-181, the past history of collective behavior.It is shownthat emegent Springer, 2001 A,andSato, H., "Collective Learningin Social collective behavioris moreefficient than that generatedfrom [11] Namatame, Coordination Games", Proc. of AAAISpring Symposium, the mixed Nash equilibrium strategies. Wealso proposed SS-01-03,pp.156-159,2001. collaborative learning based on a Darwinianapproach.. It is shownthat in strategic environmentswhereevery agent has to keep improvinghis meta-rule for makingdecisons in order to survive, if agents learn from each other, then the social efficiencyis realized withouta central of authority. [12] Samuelson,,L., Evolutionary Gamesand Equilibrium Selection, The MITPress, 1998. [ 13] Sipper M., Evolutionof ParaUelCellular Machines:The Cellular Programming Approach,Springer, 1996. [14] Young,H.P., Individual Strategy and Social Structur~ PrincedomUniv. Press, 1998. [ 15] WeibullJ., Evolutionary GameTheory, The MITPress, 1996. 68