Applied Adversarial Reasoning and Risk Modeling: Papers from the 2011 AAAI Workshop (WS-11-06) Robust Decision Making under Strategic Uncertainty in Multiagent Environments Maciej M. Łatek and Seyed M. Mussavi Rizi Department of Computational Social Science George Mason University Fairfax, VA 22030, USA mlatek, smussavi@gmu.edu abundant background information available to humans, and rarely evoke introspection or counterfactual reasoning by players (Costa-Gomes, Crawford, and Iriberri 2009). Now consider a player who wishes to best respond to an adversary that may have multiple, equally “good” actions to choose from at each step of the game. Strategic uncertainty characterizes this uncertainty over adversary actions with equal exact or mean payoffs at each step of the game. How can a player who faces strategic uncertainty find best responses that on average beat any choices by the adversary? Finding such robust best responses is a more sinister conundrum than the equilibrium selection problem: The player is not looking for a way to pick one equilibrium strategy among many as a probability distribution over joint action sets; he has already abandoned the notion of equilibrium strategy for best responses that maximize his expected payoff over a finite planning horizon. Not surprisingly, neither (a) nor (b) nor (c) can address strategic uncertainty. Enter n-th order rationality. n–th order rationality belongs to a class of cognitive hierarchy models (Camerer, Ho, and Chong 2004) that use an agent’s assumptions on how rational other agents are and information on the environment to anticipate adversaries’ behavior. For n > 0, n–th order rational agents (NORA) determine their best response to all other agents, assuming that they are (n − 1)–th order rational. Zeroth-order rational agents follow a non-strategic heuristic (Stahl and Wilson 1994). First-order rational agents use their beliefs on the state of the environment and the strategies of zeroth-order agents to calculate their best response to other agents and so forth. NORA have long permeated studies of strategic interaction in one guise or another. Sun Tzu (Niou and Ordeshook 1994) advocated concepts similar to NORA in warfare; (Keynes 1936) in economics. NORA use information on the environment to explicitly anticipate adversaries, but do so by calculating single-agent best responses instead of optimizing in the joint action space. Combined with an efficient multiagent formulation that can be solved even for complex environments, n–th order rationality is a convenient heuristic for reducing strategic uncertainty in multiagent settings. This paper motivates the feasibility of addressing strategic uncertainty by boundedly rational agents, then presents an algorithm that enables NORA to tackle strategic uncertainty Abstract We introduce the notion of strategic uncertainty for boundedly rational, non-myopic agents as an analog to the equilibrium selection problem in classical game theory. We then motivate the need for and feasibility of addressing strategic uncertainty and present an algorithm that produces decisions that are robust to it. Finally, we show how agents’ rationality levels and planning horizons alter the robustness of their decisions. Introduction To define strategic uncertainty, we start by equilibrium selection as an analogous problem in multiple-equilibrium games. A player can play any equilibrium strategy of a multiple-equilibrium game. However, uncertainty about which equilibrium strategy he picks creates an equilibrium selection problem (Crawford and Haller 1990) that can be solved by (a) Introducing equilibrium selection rules into the game that enable players to implement mutual best responses (Huyck, Battalio, and Beil 1991), for example picking risk-dominant equilibria. (b) Refining the notion of equilibrium so that only one equilibrium can exist (McKelvey and Palfrey 1995; 1998), for example, logit quantal response equilibrium. (c) Selecting “learnable equilibria” that players can reach as a result of processes through which they learn from the history of play. For example, (Suematsu and Hayashi 2002) and (Hu and Wellman 2003) offer solutions for multiagent reinforcement learning in stochastic games that converge to a Nash equilibrium when other agents are adaptive and to an optimal response otherwise. Generally, players need to have unlimited computational power to obtain equilibria for (a) and (b). On the other hand, players that use multiagent learning algorithms in (c) need to interact repeatedly for a long period of time to be able to update internal representations of adversaries and the environment as they accrue experience. Yet, learning algorithms fail to match the players’ rate of adaptation to changes in the environment or opponents’ behavior, because they forgo c 2011, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 34 in multiagent environments and use this algorithm to show how to make n–th order rationality robust and tractable for non-myopic agents with finite planning horizons. Finally, it produces sensitivity analyses of non-myopic n–th order rationality to show how players’ order of rationality and planning horizon alters the robustness of their decisions to strategic uncertainty. p1+p2+p3=const p1 p2 Strategy A (p1,p2,p3) Motivating Example p3 In this section, we give an example of strategic uncertainty in the Colonel Blotto game and broadly describe how boundedly rational agents can deal with it. Figure 1: Instruction for reading Colonel Blotto fitness surfaces. Each point within the triangle maps uniquely onto a Blotto strategy through barycentric transformation. Hatched areas dominate A’s strategy (0.6 0.2 0.2). Colonel Blotto Colonel Blotto is a zero-sum game of strategic mismatch between players X and Y , first solved by (Gross and Wagner 1950).A policy sxt for player X at time t is a real vecwhere M is the number of tor sxt = x1t x2t . . . xM t fronts, xit ∈ [0, 1] is the fraction of budget X allocates to M front i ∈ [1, M ] at t such that i=1 xit = 1. Players have equal available budget. The single-stage payoff of sxt against syt for X is: rX (sxt , syt ) = M the hatched areas in Figures 1 and 2. Suppose A anticipates B’s best responses. What should A do at t = 1? A’s problem is that B can equally well choose any strategy that belongs to the best response set. If A fully accounts for the fact that B randomizes uniformly, A’s response surface, presented on the right panel of Figure 2, has three local maxima, also listed in Table 1. Each of these maxima offers similar expected payoffs to A at t = 1, around 0.14. A’s expected payoff is 0.03, if it takes just a single realization of B’s best response and tries to best responds to it. sgn xit − yti i=1 where sgn(·) is the sign function and we assume M > 2 so the game does not end in a tie. Colonel Blotto does not have a pure strategy Nash equilibrium, because a pure strategy sxt that allocates xit > 0 to front i loses to the strategy syt that allocates no budget to front i and more budget to all other fronts. Constantsum Colonel Blotto has a mixed strategy Nash equilibrium in which marginal distributions on all fronts are uniform on [0, 2/m] (Gross and Wagner 1950). This unpredictability leaves the opponent with no preference for a single allocation so long as no front receives more than 2/m budget. More modern treatments of this game allow interactions among fronts (Golman and Page 2008; Hart 2007). (Arad and Rubinstein 2010) discusses experimental results with Colonel Blotto, including fitting an n–th order rational model. Since we need a repeated game in the next section, we transform the single-stage Colonel Blotto into a repeated game by making policy adjustment from t to t + 1 costly: Table 1: Expected payoffs for robust, second-order rational player A at t = 1 and t = 2 under the assumption that player B plays simple myopic best response and player A chooses again after period 1. Name Period 1 strategy A1 A2 A3 (0.42, 0.52, 0.04) (0.52, 0.24, 0.24) (0.42, 0.04, 0.52) Expected payoff at period 1 period 2 0.141 0.145 0.141 0.201 0.169 0.201 What if A goes one step further and assumes that B will again best respond after period 1? A can construct a robust best response again using expected joint actions for period 1 as the departure point. This step differentiates between A’s period 1 choices, clearly favoring two of them. This example underlines three lessons: (a) it is important for boundedly rational agents to realize the problem of multiple, equally desirable choices for their adversary; (b) it is possible to hedge against such choices by anticipating the adversary’s choice selection rule, and (c) planning for more than one period may further clarify the choice. M x y x sgn xit − yti − δ sxt − sxt−1 · st , st , st−1 = rX i=1 Parameter δ controls the scaling of the penalty a player pays for changing his strategy. Best Response and Robust Best Response RENORA Algorithm In this section, we assess strategic uncertainty in the best response dynamics of Colonel Blotto by Monte Carlo simulation. Suppose A and B play repeated Colonel Blotto with M = 3 without strategy adjustment cost. If A plays strategy (p1 p2 p3 ) at t = 0; B’s best responses at t = 1 belong to Recursive Simulation A multiagent model of strategic interactions among some agents defines the space of feasible actions for each agent and constrains possible sequences of interactions among 35 0 0 Time 0 A’s strategy 20 80 agent’s assumptions on other agents’ rationality orders ultimately reflect an assumption on how he perceives their assumptions on his own rationality. To describe the algorithm that introduces non-myopic NORA into a given model Ψ, we denote the level of rationality for an NORA with d = 0, 1, 2, . . .. We label the i-th NORA corresponding to level of rationality d as Aid and a set containing its feasible action as id : 20 80 0 60 T3 40 A1 0 40 60 T2 60 T2 T3 40 60 40 A2 0 20 40 T1 60 80 20 80 20 80 0 A3 0 20 40 T1 60 80 0 Figure 2: Fitness surfaces for first-order rational player B on the left panel and robust second-order rational player A on the right panel at t = 1. A plays strategy (0.6 0.2 0.2) at t = 0. d = 0 A zeroth-order rational agent Ai0 chooses action in i0 ; d = 1 A first-order rational agent Ai1 chooses action in i1 and so forth. If an agent is denoted Aid , from his point of view the other agent must be A−i d−1 . If the superscript index is absent, like in Ad , we may refer to either of the two agents, assuming that the level of rationality is d. If subscript is absent, like in Ai , we refer to agent i regardless of his level of rationality. Now we show how a myopic NORA uses MARS to plan his action. First, let us take case of A0 . Set 0 contains feasible actions that are not conditioned on A0 ’s expectations of what the other agent will do. So A0 does not assume that the other agent optimizes, and arrives at 0 by using non-strategic heuristics such as expert advice, drawing actions from a fixed probability distribution over the action space, or sampling the library of historical interactions. We set the size of action sample for A0 as κ + 1 = 0 . For example, consider the iterated prisoner’s dilemma game. The equivalent retaliation heuristic called “tit for tat” is an appropriate nonstrategic d = 0 behavior with κ = 0. For notational ease, we will say that using the non-strategic heuristic is equivalent to launching NORA for d = 0: i 0 = NORA Ai0 · agents. It also stores a library of past interactions among agents, and calculates agents’ payoffs for any allowable trajectory of interactions among them based on the each agent’s implicit or explicit preferences or utility function. (Latek, Axtell, and Kaminski 2009) discussed how to decompose a multiagent model into a state of the environment and agents’ current actions and use the model to define a mapping from the state of the environment and agents’ current actions into a realization of the next state of the environment and agents’ current payoffs. In this setting, an agent can use adaptive heuristics or statistical procedures to compute the probability distribution of payoffs for any action he can take; then pick an action that is in some sense suitable. Alternatively, he can simulate the environment forward in a cloned model; derive the probability distribution of payoffs for the actions he can take by simulation and pick a suitable action. When applied to multiagent models, this recursive approach to decision making amounts to having simulated agents use simulation to make decisions (Gilmer 2003). We call this technology multiagent recursive simulation (MARS). This approach was used to create myopic, non-robust n-th order rational agents in (Latek, Axtell, and Kaminski 2009). Nonmyopic behaviors were added in (Latek and Rizi 2010). The next section generalizes the MARS approach to cope with strategic uncertainty. A0 adopts one action in the set of feasible actions 0 after it is computed. Recall that A1 forms 1 by best responding to 0 adopted by another agent in Ψ whom he assumes to be A0 . So A1 finds a strategy that on average performs best when the A0 it faces adopts any course of action in its 0 . A1 takes K samples of each pair of his candidate actions and feasible A0 actions in order to integrate out the stochasticity of Ψ. For higher orders of d, Aid does not consider all possible actions of his opponent A−i , but focuses on κ + 1 historical actions and τ most probable future actions computed under d–th order rational assumption: A−i d−1 . Algorithm 1 goes further and modifies the MARS principle to solve the multiple-period planning problem for NORA. (Haruvy and Stahl 2004) showed that n–th order rationality can be used to plan for longer horizons in repeated matrix games; however, no solution exists for a general model. In particular, we need a decision rule that enables Ad to derive optimum decisions if (a) he wishes to plan for more than one step; (b) takes random lengths of time to take action or aborts the execution of an action midcourse, and (c) interacts asynchronously with other NORA. To address these issues, we introduce the notion of planning horizon h. While no classic solution to problems (b) and (c) Implementing n-th Order Rationality by Recursive Simulation In order to behave strategically, agents need access to plausible mechanisms of forming expectations of others’ future behaviors. n–th order rationality is one such mechanism. An n–th order rational agent (NORA) assumes that other agents are (n − 1)–th order rational and best responds to them. A zeroth-order rational agent acts according to a nonstrategic heuristic such as randomly drawing actions from the library of historical interactions or continuing the current action. A first-order rational agent assumes that all other agents are zeroth-order rational and best responds to them. A second-order rational agent assumes that all other agents are first-order rational and best responds to them, and so forth. Observe that if the assumption of a second-order rational agent about other agents is correct; they must assume that the second-order rational agent is zeroth-order rational agent instead of a second-order rational agent. In other words, an 36 0 0 0.2 0 0.15 0.3 0.15 20 80 0.2 20 0.1 0.1 80 20 80 0.05 0.05 0 T3 60 0 T2 40 0 T3 60 40 60 T2 40 T2 T3 0.1 −0.05 −0.05 −0.1 40 60 −0.1 −0.1 40 60 60 40 −0.15 −0.15 −0.2 −0.2 −0.2 20 80 80 20 80 20 −0.25 −0.25 −0.3 −0.3 −0.3 0 20 60 40 80 0 0 20 T1 (a) RENORA(AB 1 ,1), τA = 1 60 40 80 0 T1 0 20 60 40 80 0 T1 (b) RENORA(AA 2 ,1), τA = 1 (c) RENORA(AA 2 ,2), τA = 20 Figure 3: Fitness surfaces for different variants of player B and robust player A at t = 1 when A plays (0.2, 0.6, 0.2) at t = 0. exists, the classic method of addressing (a), that is, finding the optimum of h × number of actions, leads to exponential explosion in computational cost. RENORA handles (a), (b) and (c) simultaneously by exploiting probabilistic replanning (Yoon, Fern, and Givan 2007), hence the name replanning NORA (RENORA). In short, RENORA plan their first action, knowing that they will have to replan after the first action is executed or aborted. The expected utility of the second, replanned action is added to that of the first and so forth. Therefore, RENORA avoid taking actions that lead to states of the environment without good exits. We will show that planning horizon h and τ play important roles in solving the strategic uncertainty problem. In Algorithm 1, parameter K, the number of repetitions of simulation for each complete joint action scenario, controls the desired level of robustness with respect to environmental noise. Parameters τ and κ enable decision makers to control forward- versus backward-looking bias of RENORA. K determines the number of times a pair of agent strategies are played against each other, therefore higher K reduces the effects of model randomness on the chosen action. τ shows the number of the opponent’s equally good future actions an agent is willing to hedge against, whereas κ represents the number of opponent’s actions an agent wishes to draw from history. RENORA generalizes fictitious best response. κ represents the number of actions an agent wishes to draw from history, so the higher κ ≥ 1 is, the closer an agent is to playing the fictitious best response at d = 1. If τ κ = 0, a d ≥ 2 the agent has purely forward-looking orientation as in the classical game theory setup. Figure 3 shows fitness surfaces generated by RENORA that can be directly compared with the results of the Monte Carlo study from Figure 2 and Table 2. Input: Parameters K, τ, κ, state of simulation Ψ Output: Set d of optimal actions for Ad −i Compute −i d−1 = RENORA Ad−1 , h foreach action ad available to Aid do Initialize action’s payoff p̄ (ad ) = 0 foreach i K do foreach ad−1 ∈ −i d−1 do s = cloned Ψ Assign initial action ad to self Assign initial action ad−1 to the other while s.time() < h do if ad is not executing then Substitute action ad for an action taken at random from set RENORA Aid , h − s.time() end if ad−1 is not executing then Substitute action ad−1 for an action picked at random from , RENORA A−i d−1 h − s.time() end Accumulate Ai ’s payoff += s (ad−1 , ad ) by running an iteration of cloned simulation end end end Compute average strategy payoff p̄ (ad ) over all taken samples end Eliminate all but τ best actions from the set of initial actions available to Aid Compute set i0 = RENORA Ai0 , h Add both sets arriving at id Algorithm 1: Algorithm RENORA Aid , h for d, h > 0. Discussion RENORA decouples environment and behavior representation, thus injecting strategic reasoning into multiagent simulations, the most general paradigm to model complex system to date (Axtell 2000). RENORA achieves this feat by bringing n–th order rationality and recursive simulation together. (Gilmer 2003) and (Gilmer and Sullivan 2005) used recursive simulation to help decision making; (Durfee and Vidal 2003; 1995), (Hu and Wellman 2001), and (Gmytrasiewicz, 37 A=B=1 A=B=5 dA dA dB 0 1 2 3 4 0 1 0.00 0.17 -0.17 0.00 -0.01 -0.01 -0.01 0.00 0.01 0.00 =0.25 dB 0 1 2 3 4 0.00 0.17 -0.17 0.00 -0.14 -0.03 -0.11 -0.01 -0.09 -0.01 =0 2 0.01 0.01 0.00 0.00 0.00 3 4 0.01 -0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.14 0.11 0.03 0.01 0.00 0.02 -0.02 0.00 -0.01 -0.01 0.09 0.01 0.01 0.01 0.00 0 1 2 3 0.00 0.17 0.02 0.02 -0.17 0.00 0.04 0.01 -0.02 -0.04 0.00 0.04 -0.02 -0.01 -0.04 0.00 -0.02 -0.01 -0.01 -0.06 4 0.02 0.01 0.01 0.06 0.00 0.00 0.17 0.09 -0.01 0.01 -0.17 0.00 0.09 0.06 -0.02 -0.09 -0.09 0.00 0.04 0.03 0.01 -0.06 -0.04 0.00 0.04 -0.01 0.02 -0.03 -0.04 0.00 Figure 4: A’s payoffs as a function of players’ rationality dA × dB , type of environment δ and the numbers of forward-looking samples players draw τA = τB ∈ {1, 5}. Noh, and Kellogg 1998) implemented n–th order rationality in multiagent models using pre-calculated equations, and (Parunak and Brueckner 2006) used stigmergic interactions to generate likely trajectories of interacting groups. However, Algorithm 1 combines the two techniques for the first time to produce robust decisions that hedge against both model stochasticity by varying K > 0 and agents’ coevolving strategies with τ > 0. RENORA is related to partially observable stochastic games (POSG) (Hansen, Bernstein, and Zilberstein 2004) and interactive, partially observable Markov decision processes (IPOMDP) (Rathnasabapathy, Dosh, and Gmytrasiewicz 2006; Nair et al. 2002) frameworks. These frameworks use formal language and sequential decision modeling to encode dynamic and stochastic multiagent environments. They explicitly model other agents’ beliefs, capabilities, and preferences as part of the state space perceived by each agent. Each framework provides an equivalent notion of Nash equilibrium. A number of exact equilibrium search algorithms exist, but they often grow doubly exponentially in the number of agents, time horizons and size of the state-action space. Approximate algorithms are based on filtering possible future trajectories of the system through Monte Carlo tree search (Carmel and Markovitch 1996; Chaslot et al. 2008), grouping behaviorally equivalent models of agents together (Seuken and Zilberstein 2007), iterated elimination of dominated strategies, and applying the reductions and limits to nested beliefs of agents (Zettlemoyer, Milch, and Kaelbling 2008). RENORA similarly performs filtered search and subsequent optimization, using n-th order rationality. However, unlike POSG and IPOMDP, RENORA does not require that the model is rewritten using a specialized language; it works with any model that can be cloned. Algorithm 1 derives robust and farsighted actions for an agent by computing average best response payoff over τ + κ + 1 actions of the opponent. Depending on the level of risk an agent is willing to accept, the difference among these averages may turn out to be of practical significance or not. Therefore, it may make sense to perform sensitivity analysis on parameters of RENORA and use other robustness criteria like minimax, maximin or Hurwicz measures (Rosenhead, Elton, and Gupta 1972). Experimental Results We have conducted three experiments using RENORA agents A and B playing Colonel Blotto: (1) Sweeping the spaces of dA and dB when τA = τB ∈ [1, 5]; (2) B ran RENORA(1, 1). We swept the spaces of dA and τA , keeping hA = 1, τB = 1 and κ• = 0; (3) B ran RENORA(1, 1). We swept the spaces of hA and τA , keeping dA = 2, τB = 1 and κ• = 0. In period t = 0, both agents behave randomly. We varied the cost of strategy adjustment from 0 to 0.25 in all experiments. For each experiment and each set of parameters, we executed an average of 10 runs of the game, each lasting 50 periods. Results use the average payoffs of agent A as the target metric. Figure 4 aggregates the results of Experiment (1). Regardless of the type of environment and how robust both agents attempt to be, A obtains the largest payoffs when dA = dB + 1, that is, when its beliefs about B’s rationality are consistent with truth. The higher the levels of rationality for both agents, the more opportunities there are for optimization noise in the RENORA best response calculation to compound. This problem is less pronounced in environments with strategic stickiness and can be further rectified by taking more forward-looking samples. This aspect is outlined on Figure 5(a). The second result links the number of forward-looking samples τ to planning horizon h. In Section 2 we found that planning horizon can play a role in distinguishing between possible myopic robust choices. In Figure 5(b) we do not find this effect to persist under repeated interaction when 38 =0.25 A =0 0 1 2 3 4 5 6 7 8 9 10 A dA 0 1 2 3 -0.17 -0.02 -0.01 -0.02 -0.17 0.01 0.01 0.01 -0.17 0.01 0.02 0.02 -0.17 0.01 0.02 0.00 -0.17 0.00 0.05 0.01 -0.17 0.00 0.04 0.02 -0.17 0.00 0.05 0.01 -0.17 -0.02 0.05 0.00 -0.17 -0.01 0.08 0.03 -0.17 0.01 0.06 0.01 -0.17 0.00 0.06 0.01 0 1 2 3 4 5 6 7 8 9 10 -0.17 -0.14 -0.11 -0.09 -0.17 0.00 0.03 0.00 -0.17 0.00 0.08 0.00 -0.17 0.00 0.09 0.04 -0.17 0.00 0.11 0.07 -0.17 0.01 0.10 0.06 -0.16 0.00 0.13 0.09 -0.17 0.01 0.11 0.04 -0.17 0.00 0.11 0.06 -0.17 0.01 0.13 0.07 -0.17 0.00 0.13 0.08 agents re-optimize each period. At the same time, increasing planning horizons or the number of forward-looking samples does not create adverse effects on agents’ payoffs. One exception from this might be long planning horizons when the cost of strategy adjustment is also high, but the statistical significance of the impact is borderline. On the other hand, increasing planning horizon removes optimism bias in agents’ planning. On Figure 6 we present predicted versus realized payoffs for a d = 2 agent playing against a myopic best responder. Even when τ = 10, the agent predicts a payoff of 0.25 but obtains an actual payoff of 0.12, in the same vicinity as 0.14, obtained in Table 1 for a specific initial condition. Therefore, assessments of future become much more realistic when agents increase their planning horizon while actual payoffs converge faster than predicted payoffs as the number of forward-looking samples increases. Summary In this paper, we demonstrated how to hedge against strategic uncertainty using a multiagent implementation of n–th order rationality for replanning agents with arbitrary planning horizons. The example we used, the Blotto game, is rather simple and served as a proof-of-concept. We have used RENORA in a number of richly detailed social simulations of markets and organizational behavior for which writing POSG or IPOMDP formalizations is simply impossible. The first example is the model presented in (Kaminski and Latek 2010) that we used to study the emergence of price discrimination in telecommunication markets. We showed that the irregular topology of the call graph leads to the emergence of price discrimination patterns that are consistent with real markets but are very difficult to replicate using orthodox, representative-agent approaches. In (Latek and Kaminski 2009) we examined markets in which companies are allowed to obfuscate prices and customers are forced to rely on their direct experience and signals they receive from social networks to make purchasing decisions. We used the RENORA-augmented model to search for market designs that are robust with respect to the bounded rationality of companies and customers. Finally, in (Latek, Rizi, and Alsheddi 2011) we studied how security agencies can determine a suitable blend of evidence on the historical patterns of terrorist behavior with current intelligence on terrorists in order to devise appropriate countermeasures. We showed that terrorist organizations’ acquisition of new capabilities at a rapid pace makes optimal strategies advocated by gametheoretic reasoning unlikely to succeed. Each of these simulations featured more than two strategic agents and many thousands of non-strategic adaptive agents. Our work on RENORA is not yet complete. Recently, a distribution-free approach to modeling incompleteinformation games through robust optimization equilibrium has been proposed that seems to unify a number of other approaches (Aghassi and Bertsimas 2006). We are working toward using these results to formalize measures of robustness that can be used to drive parameter selection for RENORA. Secondly, computational game theorists have recently begun (a) d and τ hA 3 0.00 0.00 0.01 0.02 0.04 0.08 0.05 0.04 0.03 0.04 0.05 A =0.25 2 0.00 0.01 0.03 0.04 0.03 0.03 0.05 0.03 0.03 0.05 0.05 0 1 2 3 4 5 6 7 8 9 10 A =0 1 -0.01 0.01 0.02 0.02 0.05 0.04 0.05 0.05 0.08 0.06 0.06 0 -0.11 -0.07 -0.08 1 0.03 0.05 0.04 2 0.08 0.08 0.10 3 0.09 0.11 0.10 4 0.11 0.11 0.10 5 0.10 0.10 0.10 6 0.13 0.08 0.10 7 0.11 0.10 0.08 8 0.11 0.08 0.09 9 0.13 0.07 0.11 10 0.13 0.07 0.08 (b) h and τ Figure 5: Interaction between level of planning horizon h, rationality levels d and τ on player A’s payoffs. B uses REN ORA(AB 1 , 1). 39 Planning horizon h =1 h =2 A h =3 A A 0.4 Predicted payoff Actual payoff 0.35 0.3 δ=0 0.25 0.2 0.15 0.1 Payoff 0.05 0 0.35 0.3 0.25 δ=0.25 0.2 0.15 0.1 0.05 0 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 Number of forward−looking samples τ A Figure 6: A’s predicted and observed payoffs as a function its planning horizon hA and the number of forward-looking sample τA . Levels of rationality are fixed at dA = 2 and dB = 1. B has planning horizon hB = 1 and τB = 1. In the upper panel, the cost of strategy adjustment δ is zero. In the lower panel δ = 0.25. Thin lines correspond to a 95% confidence interval. Agent Computing in the Social Sciences. Technical Report 17, Center on Social Dynamics, The Brookings Institution. to deal with incorporating cognitive elements of bounded rationality, for example, anchoring theories on human perceptions of probability distributions, into their models of Stackelberg security games (Pita et al. 2010). We intend to create templates of imperfect model cloning functions that serve a similar purpose. Lastly, the assumption that a dorder rational agent considers the other player as (d − 1)order rational is hard coded in RENORA. Our experiments have shown that the performance of RENORA in self-play strongly depends on this assumption being true. We plan to make RENORA more adaptive by combining it with an adaptation heuristic that allows agents to learn the actual rationality orders of their opponents during interactions. Camerer, C. F.; Ho, T. H.; and Chong, J. K. 2004. A Cognitive Hierarchy Model of Games. Quarterly Journal of Economics 119:861–898. Carmel, D., and Markovitch, S. 1996. Incorporating Opponent Models into Adversary Search. In Proceedings of the Thirteenth National Conference on Artificial Intelligence AAAI’96. Chaslot, G.; Bakkes, S.; Szita, I.; and Spronck, P. 2008. Monte-Carlo Tree Search: A New Framework for Game AI. In Proceedings of the Fourth Artificial Intelligence and Interactive Digital Entertainment Conference. Acknowledgments Costa-Gomes, M. A.; Crawford, V. P.; and Iriberri, N. 2009. Comparing Models of Strategic Thinking in Van Huyck, Battalio, and Beila’s Coordination Games. Journal of the European Economic Association 7:365–376. Authors were partly supported by the Office of Naval Research (ONR) grant N00014–08–1–0378. Opinions expressed herein are solely those of the authors, not of George Mason University or the ONR. Crawford, V., and Haller, H. 1990. Learning How to Cooperate: Optimal Play in Repeated Coordination Games. Econometrica 58(3):571–595. References Aghassi, M., and Bertsimas, D. 2006. Robust Game Theory. Mathematical Programming 107:231–273. Arad, A., and Rubinstein, A. 2010. Colonel Blotto’s Top Secret Files. Levine’s Working Paper Archive 1–23. Axtell, R. 2000. Why Agents? On the Varied Motivations for Durfee, E. H., and Vidal, J. M. 1995. Recursive Agent Modeling Using Limited Rationality. Proceedings of the First International Conference on Multi-Agent Systems 125–132. Durfee, E. H., and Vidal, J. M. 2003. Predicting the Expected Behavior of Agents That Learn About Agents: The 40 CLRI Framework. Autonomous Agents and Multiagent Systems. Gilmer, J. B., and Sullivan, F. 2005. Issues in Event Analysis for Recursive Simulation. Proceedings of the 37th Winter Simulation Conference 12–41. Gilmer, J. 2003. The Use of Recursive Simulation to Support Decisionmaking. In Chick, S.; Sanchez, P. J.; Ferrin, D.; and Morrice, D., eds., Proceedings of the 2003 Winter Simulation Conference. Gmytrasiewicz, P.; Noh, S.; and Kellogg, T. 1998. Bayesian Update of Recursive Agent Models. User Modeling and User-Adapted Interaction 8:49–69. Golman, R., and Page, S. E. 2008. General Blotto: Games of Allocative Strategic Mismatch. Public Choice 138(34):279–299. Gross, O., and Wagner, R. 1950. A Continuous Colonel Blotto Game. Technical report, RAND. Hansen, E. A.; Bernstein, D. S.; and Zilberstein, S. 2004. Dynamic Programming for Partially Observable Stochastic Games. In Proceedings Of The National Conference On Artificial Intelligence. American Association For Artificial Intelligence. Hart, S. 2007. Discrete Colonel Blotto and General Lotto games. International Journal of Game Theory 36(3-4):441– 460. Haruvy, E., and Stahl, D. 2004. Level-n Bounded Rationality on a Level Playing Field of Sequential Games. In Econometric Society 2004 North American Winter Meetings. Econometric Society. Hu, J., and Wellman, M. P. 2001. Learning about Other Agents in a Dynamic Multiagent System. Cognitive Systems Research 2:67–79. Hu, J., and Wellman, M. P. 2003. Nash Q-Learning for General-Sum Stochastic Games. Journal of Machine Learning Research 4(6):1039–1069. Huyck, J.; Battalio, R.; and Beil, R. 1991. Strategic Uncertainty, Equilibrium Selection, and Coordination Failure in Average Opinion Games. The Quarterly Journal of Economics 106(3):885–910. Kaminski, B., and Latek, M. 2010. The Influence of Call Graph Topology on the Dynamics of Telecommunication Markets. In Jedrzejowicz, P.; Nguyen, N.; Howlet, R.; and Jain, L., eds., Agent and Multi-Agent Systems: Technologies and Applications, volume 6070 of Lecture Notes in Computer Science, 263–272. Keynes, J. M. 1936. The General Theory of Employment, Interest and Money. Macmillan Cambridge University Press. Latek, M. M., and Kaminski, B. 2009. Social Learning and Pricing Obfuscation. In Hernández, C.; Posada, M.; and López-Paredes, A., eds., Lecture Notes in Economics and Mathematical Systems, 103–114. Springer. Latek, M. M., and Rizi, S. M. M. 2010. Plan, Replan and Plan to Replan: Algorithms for Robust Courses of Action under Strategic Uncertainty. In Proceeding of the 19th Conference on Behavior Representation in Modeling and Simulation (BRIMS). Latek, M. M.; Axtell, R.; and Kaminski, B. 2009. Bounded Rationality via Recursion. Proceedings of The 8th International Conference on Autonomous Agents and Multiagent Systems 457–464. Latek, M.; Rizi, S. M. M.; and Alsheddi, T. A. 2011. Optimal Blends of History and Intelligence for Robust Antiterrorism Policy. Journal of Homeland Security and Emergency Management 8. McKelvey, R., and Palfrey, T. 1995. Quantal Response Equilibria for Normal Form Games. Games and Economic Behavior 10(1):6–38. McKelvey, R., and Palfrey, T. 1998. Quantal Response Equilibria for Extensive Form Games. Experimental Economics 1(1):9–41. Nair, R.; Tambe, M.; Yokoo, M.; Pynadath, D.; and Marsella, S. 2002. Towards Computing Optimal Policies for Decentralized POMDPs. Technical Report WS-02-06, AAAI. Niou, E. M. S., and Ordeshook, P. C. 1994. A GameTheoretic Interpretation of Sun Tzu’s: The Art of War. Journal of Peace Research 31:161–174. Parunak, H. V. D., and Brueckner, S. 2006. Concurrent Modeling of Alternative Worlds with Polyagents. Proceedings of the Seventh International Workshop on Multi-Agent-Based Simulation. Pita, J.; Jain, M.; Tambe, M.; Ordonez, F.; and Kraus, S. 2010. Robust Solutions to Stackelberg Games: Addressing Bounded Rationality and Limited Observations in Human Cognition. Artificial Intelligence 174(15):1142–1171. Rathnasabapathy, B.; Dosh, P.; and Gmytrasiewicz, P. 2006. Exact solutions of interactive pomdps using behavioral equivalence. In Proceedings of the Fifth International Joint Conference on Autonomous Sgents and Multiagent Systems (AAMAS ’06). Rosenhead, J.; Elton, M.; and Gupta, S. 1972. Robustness and Optimality as Criteria for Strategic Decisions. Operational Research Quarterly 23(4):413–431. Seuken, S., and Zilberstein, S. 2007. Memory-Bounded Dynamic Programming for DEC-POMDPs. In roceedings of the 20th International Joint Conference on Artificial Intelligence. Stahl, D., and Wilson, P. 1994. Experimental Evidence on Players’ Models of Other Players. Journal of Economic Behavior and Organization 25:309–327. Suematsu, N., and Hayashi, A. 2002. A Multiagent Reinforcement Learning Algorithm Using Extended Optimal Response. Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems AAMAS ’02 370. Yoon, S.; Fern, A.; and Givan, R. 2007. FF-Replan: A Baseline for Probabilistic Planning. In 17th International Conference on Automated Planning and Scheduling (ICAPS-07), 352–359. Zettlemoyer, L. S.; Milch, B.; and Kaelbling, L. P. 2008. Multi-Agent Filtering with Infinitely Nested Beliefs. In Proceedings of NIPS’08. 41