Robust Decision Making under Strategic Uncertainty in Multiagent Environments

advertisement
Applied Adversarial Reasoning and Risk Modeling: Papers from the 2011 AAAI Workshop (WS-11-06)
Robust Decision Making under
Strategic Uncertainty in Multiagent Environments
Maciej M. Łatek and Seyed M. Mussavi Rizi
Department of Computational Social Science
George Mason University
Fairfax, VA 22030, USA
mlatek, smussavi@gmu.edu
abundant background information available to humans, and
rarely evoke introspection or counterfactual reasoning by
players (Costa-Gomes, Crawford, and Iriberri 2009).
Now consider a player who wishes to best respond to an
adversary that may have multiple, equally “good” actions
to choose from at each step of the game. Strategic uncertainty characterizes this uncertainty over adversary actions
with equal exact or mean payoffs at each step of the game.
How can a player who faces strategic uncertainty find best
responses that on average beat any choices by the adversary? Finding such robust best responses is a more sinister conundrum than the equilibrium selection problem: The
player is not looking for a way to pick one equilibrium strategy among many as a probability distribution over joint action sets; he has already abandoned the notion of equilibrium
strategy for best responses that maximize his expected payoff over a finite planning horizon. Not surprisingly, neither
(a) nor (b) nor (c) can address strategic uncertainty. Enter
n-th order rationality.
n–th order rationality belongs to a class of cognitive hierarchy models (Camerer, Ho, and Chong 2004) that use an
agent’s assumptions on how rational other agents are and
information on the environment to anticipate adversaries’
behavior. For n > 0, n–th order rational agents (NORA)
determine their best response to all other agents, assuming
that they are (n − 1)–th order rational. Zeroth-order rational agents follow a non-strategic heuristic (Stahl and Wilson 1994). First-order rational agents use their beliefs on the
state of the environment and the strategies of zeroth-order
agents to calculate their best response to other agents and
so forth. NORA have long permeated studies of strategic interaction in one guise or another. Sun Tzu (Niou and Ordeshook 1994) advocated concepts similar to NORA in warfare; (Keynes 1936) in economics. NORA use information
on the environment to explicitly anticipate adversaries, but
do so by calculating single-agent best responses instead of
optimizing in the joint action space. Combined with an efficient multiagent formulation that can be solved even for
complex environments, n–th order rationality is a convenient heuristic for reducing strategic uncertainty in multiagent settings.
This paper motivates the feasibility of addressing strategic
uncertainty by boundedly rational agents, then presents an
algorithm that enables NORA to tackle strategic uncertainty
Abstract
We introduce the notion of strategic uncertainty for boundedly rational, non-myopic agents as an analog to the equilibrium selection problem in classical game theory. We then
motivate the need for and feasibility of addressing strategic
uncertainty and present an algorithm that produces decisions
that are robust to it. Finally, we show how agents’ rationality levels and planning horizons alter the robustness of their
decisions.
Introduction
To define strategic uncertainty, we start by equilibrium selection as an analogous problem in multiple-equilibrium
games. A player can play any equilibrium strategy of
a multiple-equilibrium game. However, uncertainty about
which equilibrium strategy he picks creates an equilibrium
selection problem (Crawford and Haller 1990) that can be
solved by
(a) Introducing equilibrium selection rules into the game
that enable players to implement mutual best responses
(Huyck, Battalio, and Beil 1991), for example picking
risk-dominant equilibria.
(b) Refining the notion of equilibrium so that only one equilibrium can exist (McKelvey and Palfrey 1995; 1998), for
example, logit quantal response equilibrium.
(c) Selecting “learnable equilibria” that players can reach as a
result of processes through which they learn from the history of play. For example, (Suematsu and Hayashi 2002)
and (Hu and Wellman 2003) offer solutions for multiagent
reinforcement learning in stochastic games that converge
to a Nash equilibrium when other agents are adaptive and
to an optimal response otherwise.
Generally, players need to have unlimited computational
power to obtain equilibria for (a) and (b). On the other hand,
players that use multiagent learning algorithms in (c) need
to interact repeatedly for a long period of time to be able to
update internal representations of adversaries and the environment as they accrue experience. Yet, learning algorithms
fail to match the players’ rate of adaptation to changes in
the environment or opponents’ behavior, because they forgo
c 2011, Association for the Advancement of Artificial
Copyright Intelligence (www.aaai.org). All rights reserved.
34
in multiagent environments and use this algorithm to show
how to make n–th order rationality robust and tractable for
non-myopic agents with finite planning horizons. Finally, it
produces sensitivity analyses of non-myopic n–th order rationality to show how players’ order of rationality and planning horizon alters the robustness of their decisions to strategic uncertainty.
p1+p2+p3=const
p1
p2
Strategy A (p1,p2,p3)
Motivating Example
p3
In this section, we give an example of strategic uncertainty
in the Colonel Blotto game and broadly describe how boundedly rational agents can deal with it.
Figure 1: Instruction for reading Colonel Blotto fitness surfaces. Each point within the triangle maps uniquely onto a
Blotto strategy through barycentric transformation. Hatched
areas dominate A’s strategy (0.6 0.2 0.2).
Colonel Blotto
Colonel Blotto is a zero-sum game of strategic mismatch
between players X and Y , first solved by (Gross and Wagner 1950).A policy sxt for player
X at time t is a real vecwhere M is the number of
tor sxt = x1t x2t . . . xM
t
fronts, xit ∈ [0, 1] is the fraction of budget X allocates to
M
front i ∈ [1, M ] at t such that i=1 xit = 1. Players have
equal available budget. The single-stage payoff of sxt against
syt for X is:
rX (sxt , syt ) =
M
the hatched areas in Figures 1 and 2. Suppose A anticipates
B’s best responses. What should A do at t = 1?
A’s problem is that B can equally well choose any strategy that belongs to the best response set. If A fully accounts
for the fact that B randomizes uniformly, A’s response surface, presented on the right panel of Figure 2, has three local
maxima, also listed in Table 1. Each of these maxima offers
similar expected payoffs to A at t = 1, around 0.14. A’s expected payoff is 0.03, if it takes just a single realization of
B’s best response and tries to best responds to it.
sgn xit − yti
i=1
where sgn(·) is the sign function and we assume M > 2 so
the game does not end in a tie.
Colonel Blotto does not have a pure strategy Nash equilibrium, because a pure strategy sxt that allocates xit > 0
to front i loses to the strategy syt that allocates no budget
to front i and more budget to all other fronts. Constantsum Colonel Blotto has a mixed strategy Nash equilibrium
in which marginal distributions on all fronts are uniform
on [0, 2/m] (Gross and Wagner 1950). This unpredictability leaves the opponent with no preference for a single allocation so long as no front receives more than 2/m budget. More modern treatments of this game allow interactions among fronts (Golman and Page 2008; Hart 2007).
(Arad and Rubinstein 2010) discusses experimental results
with Colonel Blotto, including fitting an n–th order rational
model. Since we need a repeated game in the next section,
we transform the single-stage Colonel Blotto into a repeated
game by making policy adjustment from t to t + 1 costly:
Table 1: Expected payoffs for robust, second-order rational
player A at t = 1 and t = 2 under the assumption that player
B plays simple myopic best response and player A chooses
again after period 1.
Name
Period 1 strategy
A1
A2
A3
(0.42, 0.52, 0.04)
(0.52, 0.24, 0.24)
(0.42, 0.04, 0.52)
Expected payoff at
period 1
period 2
0.141
0.145
0.141
0.201
0.169
0.201
What if A goes one step further and assumes that B will
again best respond after period 1? A can construct a robust
best response again using expected joint actions for period 1
as the departure point. This step differentiates between A’s
period 1 choices, clearly favoring two of them. This example
underlines three lessons: (a) it is important for boundedly
rational agents to realize the problem of multiple, equally
desirable choices for their adversary; (b) it is possible to
hedge against such choices by anticipating the adversary’s
choice selection rule, and (c) planning for more than one
period may further clarify the choice.
M
x y x sgn xit − yti − δ sxt − sxt−1 ·
st , st , st−1 =
rX
i=1
Parameter δ controls the scaling of the penalty a player
pays for changing his strategy.
Best Response and Robust Best Response
RENORA Algorithm
In this section, we assess strategic uncertainty in the best
response dynamics of Colonel Blotto by Monte Carlo simulation. Suppose A and B play repeated Colonel Blotto with
M = 3 without strategy adjustment cost. If A plays strategy
(p1 p2 p3 ) at t = 0; B’s best responses at t = 1 belong to
Recursive Simulation
A multiagent model of strategic interactions among some
agents defines the space of feasible actions for each agent
and constrains possible sequences of interactions among
35
0
0
Time 0 A’s strategy
20
80
agent’s assumptions on other agents’ rationality orders ultimately reflect an assumption on how he perceives their assumptions on his own rationality.
To describe the algorithm that introduces non-myopic
NORA into a given model Ψ, we denote the level of rationality for an NORA with d = 0, 1, 2, . . .. We label the i-th
NORA corresponding to level of rationality d as Aid and a
set containing its feasible action as id :
20
80
0
60
T3
40
A1
0
40
60
T2
60
T2
T3
40
60
40
A2
0
20
40
T1
60
80
20
80
20
80
0
A3
0
20
40
T1
60
80
0
Figure 2: Fitness surfaces for first-order rational player B on
the left panel and robust second-order rational player A on
the right panel at t = 1. A plays strategy (0.6 0.2 0.2) at
t = 0.
d = 0 A zeroth-order rational agent Ai0 chooses action in i0 ;
d = 1 A first-order rational agent Ai1 chooses action in i1
and so forth. If an agent is denoted Aid , from his point of
view the other agent must be A−i
d−1 . If the superscript index
is absent, like in Ad , we may refer to either of the two agents,
assuming that the level of rationality is d. If subscript is absent, like in Ai , we refer to agent i regardless of his level of
rationality. Now we show how a myopic NORA uses MARS
to plan his action.
First, let us take case of A0 . Set 0 contains feasible actions that are not conditioned on A0 ’s expectations of what
the other agent will do. So A0 does not assume that the other
agent optimizes, and arrives at 0 by using non-strategic
heuristics such as expert advice, drawing actions from a
fixed probability distribution over the action space, or sampling the library of historical interactions. We set the size of
action sample for A0 as κ + 1 = 0 . For example, consider the iterated prisoner’s dilemma game. The equivalent
retaliation heuristic called “tit for tat” is an appropriate nonstrategic d = 0 behavior with κ = 0. For notational ease, we
will say that using the non-strategic heuristic is equivalent to
launching NORA for d = 0:
i
0 = NORA Ai0 ·
agents. It also stores a library of past interactions among
agents, and calculates agents’ payoffs for any allowable trajectory of interactions among them based on the each agent’s
implicit or explicit preferences or utility function. (Latek,
Axtell, and Kaminski 2009) discussed how to decompose a
multiagent model into a state of the environment and agents’
current actions and use the model to define a mapping from
the state of the environment and agents’ current actions into
a realization of the next state of the environment and agents’
current payoffs. In this setting, an agent can use adaptive
heuristics or statistical procedures to compute the probability distribution of payoffs for any action he can take; then
pick an action that is in some sense suitable. Alternatively,
he can simulate the environment forward in a cloned model;
derive the probability distribution of payoffs for the actions
he can take by simulation and pick a suitable action. When
applied to multiagent models, this recursive approach to decision making amounts to having simulated agents use simulation to make decisions (Gilmer 2003). We call this technology multiagent recursive simulation (MARS). This approach was used to create myopic, non-robust n-th order rational agents in (Latek, Axtell, and Kaminski 2009). Nonmyopic behaviors were added in (Latek and Rizi 2010). The
next section generalizes the MARS approach to cope with
strategic uncertainty.
A0 adopts one action in the set of feasible actions 0 after
it is computed. Recall that A1 forms 1 by best responding
to 0 adopted by another agent in Ψ whom he assumes to
be A0 . So A1 finds a strategy that on average performs best
when the A0 it faces adopts any course of action in its 0 .
A1 takes K samples of each pair of his candidate actions
and feasible A0 actions in order to integrate out the stochasticity of Ψ. For higher orders of d, Aid does not consider all
possible actions of his opponent A−i , but focuses on κ + 1
historical actions and τ most probable future actions computed under d–th order rational assumption: A−i
d−1 .
Algorithm 1 goes further and modifies the MARS principle to solve the multiple-period planning problem for
NORA. (Haruvy and Stahl 2004) showed that n–th order
rationality can be used to plan for longer horizons in repeated matrix games; however, no solution exists for a general model. In particular, we need a decision rule that enables Ad to derive optimum decisions if (a) he wishes to
plan for more than one step; (b) takes random lengths of
time to take action or aborts the execution of an action midcourse, and (c) interacts asynchronously with other NORA.
To address these issues, we introduce the notion of planning
horizon h. While no classic solution to problems (b) and (c)
Implementing n-th Order Rationality by Recursive
Simulation
In order to behave strategically, agents need access to plausible mechanisms of forming expectations of others’ future
behaviors. n–th order rationality is one such mechanism.
An n–th order rational agent (NORA) assumes that other
agents are (n − 1)–th order rational and best responds to
them. A zeroth-order rational agent acts according to a nonstrategic heuristic such as randomly drawing actions from
the library of historical interactions or continuing the current action. A first-order rational agent assumes that all other
agents are zeroth-order rational and best responds to them. A
second-order rational agent assumes that all other agents are
first-order rational and best responds to them, and so forth.
Observe that if the assumption of a second-order rational
agent about other agents is correct; they must assume that
the second-order rational agent is zeroth-order rational agent
instead of a second-order rational agent. In other words, an
36
0
0
0.2
0
0.15
0.3
0.15
20
80
0.2
20
0.1
0.1
80
20
80
0.05
0.05
0
T3
60
0
T2
40
0
T3
60
40
60
T2
40
T2
T3
0.1
−0.05
−0.05
−0.1
40
60
−0.1
−0.1
40
60
60
40
−0.15
−0.15
−0.2
−0.2
−0.2
20
80
80
20
80
20
−0.25
−0.25
−0.3
−0.3
−0.3
0
20
60
40
80
0
0
20
T1
(a) RENORA(AB
1 ,1), τA = 1
60
40
80
0
T1
0
20
60
40
80
0
T1
(b) RENORA(AA
2 ,1), τA = 1
(c) RENORA(AA
2 ,2), τA = 20
Figure 3: Fitness surfaces for different variants of player B and robust player A at t = 1 when A plays (0.2, 0.6, 0.2) at t = 0.
exists, the classic method of addressing (a), that is, finding
the optimum of h × number of actions, leads to exponential
explosion in computational cost. RENORA handles (a), (b)
and (c) simultaneously by exploiting probabilistic replanning (Yoon, Fern, and Givan 2007), hence the name replanning NORA (RENORA). In short, RENORA plan their first
action, knowing that they will have to replan after the first
action is executed or aborted. The expected utility of the second, replanned action is added to that of the first and so forth.
Therefore, RENORA avoid taking actions that lead to states
of the environment without good exits. We will show that
planning horizon h and τ play important roles in solving the
strategic uncertainty problem.
In Algorithm 1, parameter K, the number of repetitions of
simulation for each complete joint action scenario, controls
the desired level of robustness with respect to environmental
noise. Parameters τ and κ enable decision makers to control
forward- versus backward-looking bias of RENORA. K determines the number of times a pair of agent strategies are
played against each other, therefore higher K reduces the
effects of model randomness on the chosen action. τ shows
the number of the opponent’s equally good future actions an
agent is willing to hedge against, whereas κ represents the
number of opponent’s actions an agent wishes to draw from
history.
RENORA generalizes fictitious best response. κ represents the number of actions an agent wishes to draw from
history, so the higher κ ≥ 1 is, the closer an agent is to playing the fictitious best response at d = 1. If τ κ = 0, a
d ≥ 2 the agent has purely forward-looking orientation as in
the classical game theory setup. Figure 3 shows fitness surfaces generated by RENORA that can be directly compared
with the results of the Monte Carlo study from Figure 2 and
Table 2.
Input: Parameters K, τ, κ, state of simulation Ψ
Output: Set d of optimal actions for Ad
−i
Compute −i
d−1 = RENORA Ad−1 , h
foreach action ad available to Aid do
Initialize action’s payoff p̄ (ad ) = 0
foreach i K do
foreach ad−1 ∈ −i
d−1 do
s = cloned Ψ
Assign initial action ad to self
Assign initial action ad−1 to the other
while s.time() < h do
if ad is not executing then
Substitute action ad for an action
taken at random
from set RENORA Aid , h − s.time()
end
if ad−1 is not executing then
Substitute action ad−1 for an action
picked at random
from
,
RENORA A−i
d−1 h − s.time()
end
Accumulate Ai ’s payoff +=
s (ad−1 , ad ) by running an iteration of
cloned simulation
end
end
end
Compute average strategy payoff p̄ (ad ) over all
taken samples
end
Eliminate all but τ best actions from the set of initial
actions available to Aid
Compute set i0 = RENORA Ai0 , h
Add both sets arriving at id
Algorithm 1: Algorithm RENORA Aid , h for d, h > 0.
Discussion
RENORA decouples environment and behavior representation, thus injecting strategic reasoning into multiagent simulations, the most general paradigm to model complex system
to date (Axtell 2000). RENORA achieves this feat by bringing n–th order rationality and recursive simulation together.
(Gilmer 2003) and (Gilmer and Sullivan 2005) used recursive simulation to help decision making; (Durfee and Vidal
2003; 1995), (Hu and Wellman 2001), and (Gmytrasiewicz,
37
A=B=1
A=B=5
dA
dA
dB
0
1
2
3
4
0
1
0.00 0.17
-0.17 0.00
-0.01 -0.01
-0.01 0.00
0.01 0.00
=0.25 dB
0
1
2
3
4
0.00 0.17
-0.17 0.00
-0.14 -0.03
-0.11 -0.01
-0.09 -0.01
=0
2
0.01
0.01
0.00
0.00
0.00
3
4
0.01 -0.01
0.00 0.00
0.00 0.00
0.00 0.00
0.00 0.00
0.14 0.11
0.03 0.01
0.00 0.02
-0.02 0.00
-0.01 -0.01
0.09
0.01
0.01
0.01
0.00
0
1
2
3
0.00 0.17 0.02 0.02
-0.17 0.00 0.04 0.01
-0.02 -0.04 0.00 0.04
-0.02 -0.01 -0.04 0.00
-0.02 -0.01 -0.01 -0.06
4
0.02
0.01
0.01
0.06
0.00
0.00 0.17 0.09 -0.01 0.01
-0.17 0.00 0.09 0.06 -0.02
-0.09 -0.09 0.00 0.04 0.03
0.01 -0.06 -0.04 0.00 0.04
-0.01 0.02 -0.03 -0.04 0.00
Figure 4: A’s payoffs as a function of players’ rationality dA × dB , type of environment δ and the numbers of forward-looking
samples players draw τA = τB ∈ {1, 5}.
Noh, and Kellogg 1998) implemented n–th order rationality in multiagent models using pre-calculated equations, and
(Parunak and Brueckner 2006) used stigmergic interactions
to generate likely trajectories of interacting groups. However, Algorithm 1 combines the two techniques for the first
time to produce robust decisions that hedge against both
model stochasticity by varying K > 0 and agents’ coevolving strategies with τ > 0.
RENORA is related to partially observable stochastic games (POSG) (Hansen, Bernstein, and Zilberstein
2004) and interactive, partially observable Markov decision
processes (IPOMDP) (Rathnasabapathy, Dosh, and Gmytrasiewicz 2006; Nair et al. 2002) frameworks. These frameworks use formal language and sequential decision modeling to encode dynamic and stochastic multiagent environments. They explicitly model other agents’ beliefs, capabilities, and preferences as part of the state space perceived
by each agent. Each framework provides an equivalent notion of Nash equilibrium. A number of exact equilibrium
search algorithms exist, but they often grow doubly exponentially in the number of agents, time horizons and size of
the state-action space. Approximate algorithms are based on
filtering possible future trajectories of the system through
Monte Carlo tree search (Carmel and Markovitch 1996;
Chaslot et al. 2008), grouping behaviorally equivalent models of agents together (Seuken and Zilberstein 2007), iterated
elimination of dominated strategies, and applying the reductions and limits to nested beliefs of agents (Zettlemoyer,
Milch, and Kaelbling 2008). RENORA similarly performs
filtered search and subsequent optimization, using n-th order
rationality. However, unlike POSG and IPOMDP, RENORA
does not require that the model is rewritten using a specialized language; it works with any model that can be cloned.
Algorithm 1 derives robust and farsighted actions for an
agent by computing average best response payoff over τ +
κ + 1 actions of the opponent. Depending on the level of
risk an agent is willing to accept, the difference among these
averages may turn out to be of practical significance or not.
Therefore, it may make sense to perform sensitivity analysis
on parameters of RENORA and use other robustness criteria
like minimax, maximin or Hurwicz measures (Rosenhead,
Elton, and Gupta 1972).
Experimental Results
We have conducted three experiments using RENORA
agents A and B playing Colonel Blotto:
(1) Sweeping the spaces of dA and dB when τA = τB ∈
[1, 5];
(2) B ran RENORA(1, 1). We swept the spaces of dA and
τA , keeping hA = 1, τB = 1 and κ• = 0;
(3) B ran RENORA(1, 1). We swept the spaces of hA and
τA , keeping dA = 2, τB = 1 and κ• = 0.
In period t = 0, both agents behave randomly. We varied
the cost of strategy adjustment from 0 to 0.25 in all experiments. For each experiment and each set of parameters, we
executed an average of 10 runs of the game, each lasting 50
periods. Results use the average payoffs of agent A as the
target metric.
Figure 4 aggregates the results of Experiment (1). Regardless of the type of environment and how robust both
agents attempt to be, A obtains the largest payoffs when
dA = dB + 1, that is, when its beliefs about B’s rationality
are consistent with truth. The higher the levels of rationality
for both agents, the more opportunities there are for optimization noise in the RENORA best response calculation
to compound. This problem is less pronounced in environments with strategic stickiness and can be further rectified
by taking more forward-looking samples. This aspect is outlined on Figure 5(a).
The second result links the number of forward-looking
samples τ to planning horizon h. In Section 2 we found that
planning horizon can play a role in distinguishing between
possible myopic robust choices. In Figure 5(b) we do not
find this effect to persist under repeated interaction when
38
=0.25
A
=0
0
1
2
3
4
5
6
7
8
9
10
A
dA
0
1
2
3
-0.17 -0.02 -0.01 -0.02
-0.17 0.01 0.01 0.01
-0.17 0.01 0.02 0.02
-0.17 0.01 0.02 0.00
-0.17 0.00 0.05 0.01
-0.17 0.00 0.04 0.02
-0.17 0.00 0.05 0.01
-0.17 -0.02 0.05 0.00
-0.17 -0.01 0.08 0.03
-0.17 0.01 0.06 0.01
-0.17 0.00 0.06 0.01
0
1
2
3
4
5
6
7
8
9
10
-0.17 -0.14 -0.11 -0.09
-0.17 0.00 0.03 0.00
-0.17 0.00 0.08 0.00
-0.17 0.00 0.09 0.04
-0.17 0.00 0.11 0.07
-0.17 0.01 0.10 0.06
-0.16 0.00 0.13 0.09
-0.17 0.01 0.11 0.04
-0.17 0.00 0.11 0.06
-0.17 0.01 0.13 0.07
-0.17 0.00 0.13 0.08
agents re-optimize each period. At the same time, increasing
planning horizons or the number of forward-looking samples does not create adverse effects on agents’ payoffs. One
exception from this might be long planning horizons when
the cost of strategy adjustment is also high, but the statistical
significance of the impact is borderline.
On the other hand, increasing planning horizon removes
optimism bias in agents’ planning. On Figure 6 we present
predicted versus realized payoffs for a d = 2 agent playing
against a myopic best responder. Even when τ = 10, the
agent predicts a payoff of 0.25 but obtains an actual payoff
of 0.12, in the same vicinity as 0.14, obtained in Table 1
for a specific initial condition. Therefore, assessments of future become much more realistic when agents increase their
planning horizon while actual payoffs converge faster than
predicted payoffs as the number of forward-looking samples
increases.
Summary
In this paper, we demonstrated how to hedge against strategic uncertainty using a multiagent implementation of n–th
order rationality for replanning agents with arbitrary planning horizons. The example we used, the Blotto game, is
rather simple and served as a proof-of-concept. We have
used RENORA in a number of richly detailed social simulations of markets and organizational behavior for which
writing POSG or IPOMDP formalizations is simply impossible.
The first example is the model presented in (Kaminski and
Latek 2010) that we used to study the emergence of price
discrimination in telecommunication markets. We showed
that the irregular topology of the call graph leads to the
emergence of price discrimination patterns that are consistent with real markets but are very difficult to replicate using
orthodox, representative-agent approaches. In (Latek and
Kaminski 2009) we examined markets in which companies
are allowed to obfuscate prices and customers are forced to
rely on their direct experience and signals they receive from
social networks to make purchasing decisions. We used the
RENORA-augmented model to search for market designs
that are robust with respect to the bounded rationality of
companies and customers. Finally, in (Latek, Rizi, and Alsheddi 2011) we studied how security agencies can determine a suitable blend of evidence on the historical patterns
of terrorist behavior with current intelligence on terrorists
in order to devise appropriate countermeasures. We showed
that terrorist organizations’ acquisition of new capabilities
at a rapid pace makes optimal strategies advocated by gametheoretic reasoning unlikely to succeed. Each of these simulations featured more than two strategic agents and many
thousands of non-strategic adaptive agents.
Our work on RENORA is not yet complete. Recently, a distribution-free approach to modeling incompleteinformation games through robust optimization equilibrium
has been proposed that seems to unify a number of other approaches (Aghassi and Bertsimas 2006). We are working toward using these results to formalize measures of robustness
that can be used to drive parameter selection for RENORA.
Secondly, computational game theorists have recently begun
(a) d and τ
hA
3
0.00
0.00
0.01
0.02
0.04
0.08
0.05
0.04
0.03
0.04
0.05
A
=0.25
2
0.00
0.01
0.03
0.04
0.03
0.03
0.05
0.03
0.03
0.05
0.05
0
1
2
3
4
5
6
7
8
9
10
A
=0
1
-0.01
0.01
0.02
0.02
0.05
0.04
0.05
0.05
0.08
0.06
0.06
0 -0.11 -0.07 -0.08
1 0.03 0.05 0.04
2 0.08 0.08 0.10
3 0.09 0.11 0.10
4 0.11 0.11 0.10
5 0.10 0.10 0.10
6 0.13 0.08 0.10
7 0.11 0.10 0.08
8 0.11 0.08 0.09
9 0.13 0.07 0.11
10 0.13 0.07 0.08
(b) h and τ
Figure 5: Interaction between level of planning horizon h,
rationality levels d and τ on player A’s payoffs. B uses
REN ORA(AB
1 , 1).
39
Planning horizon h =1
h =2
A
h =3
A
A
0.4
Predicted payoff
Actual payoff
0.35
0.3
δ=0
0.25
0.2
0.15
0.1
Payoff
0.05
0
0.35
0.3
0.25
δ=0.25 0.2
0.15
0.1
0.05
0
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
Number of forward−looking samples τ
A
Figure 6: A’s predicted and observed payoffs as a function its planning horizon hA and the number of forward-looking sample
τA . Levels of rationality are fixed at dA = 2 and dB = 1. B has planning horizon hB = 1 and τB = 1. In the upper panel, the
cost of strategy adjustment δ is zero. In the lower panel δ = 0.25. Thin lines correspond to a 95% confidence interval.
Agent Computing in the Social Sciences. Technical Report
17, Center on Social Dynamics, The Brookings Institution.
to deal with incorporating cognitive elements of bounded
rationality, for example, anchoring theories on human perceptions of probability distributions, into their models of
Stackelberg security games (Pita et al. 2010). We intend to
create templates of imperfect model cloning functions that
serve a similar purpose. Lastly, the assumption that a dorder rational agent considers the other player as (d − 1)order rational is hard coded in RENORA. Our experiments
have shown that the performance of RENORA in self-play
strongly depends on this assumption being true. We plan
to make RENORA more adaptive by combining it with an
adaptation heuristic that allows agents to learn the actual rationality orders of their opponents during interactions.
Camerer, C. F.; Ho, T. H.; and Chong, J. K. 2004. A Cognitive Hierarchy Model of Games. Quarterly Journal of Economics 119:861–898.
Carmel, D., and Markovitch, S. 1996. Incorporating Opponent Models into Adversary Search. In Proceedings of
the Thirteenth National Conference on Artificial Intelligence
AAAI’96.
Chaslot, G.; Bakkes, S.; Szita, I.; and Spronck, P. 2008.
Monte-Carlo Tree Search: A New Framework for Game AI.
In Proceedings of the Fourth Artificial Intelligence and Interactive Digital Entertainment Conference.
Acknowledgments
Costa-Gomes, M. A.; Crawford, V. P.; and Iriberri, N. 2009.
Comparing Models of Strategic Thinking in Van Huyck,
Battalio, and Beila’s Coordination Games. Journal of the
European Economic Association 7:365–376.
Authors were partly supported by the Office of Naval
Research (ONR) grant N00014–08–1–0378. Opinions expressed herein are solely those of the authors, not of George
Mason University or the ONR.
Crawford, V., and Haller, H. 1990. Learning How to Cooperate: Optimal Play in Repeated Coordination Games. Econometrica 58(3):571–595.
References
Aghassi, M., and Bertsimas, D. 2006. Robust Game Theory.
Mathematical Programming 107:231–273.
Arad, A., and Rubinstein, A. 2010. Colonel Blotto’s Top
Secret Files. Levine’s Working Paper Archive 1–23.
Axtell, R. 2000. Why Agents? On the Varied Motivations for
Durfee, E. H., and Vidal, J. M. 1995. Recursive Agent Modeling Using Limited Rationality. Proceedings of the First
International Conference on Multi-Agent Systems 125–132.
Durfee, E. H., and Vidal, J. M. 2003. Predicting the Expected Behavior of Agents That Learn About Agents: The
40
CLRI Framework. Autonomous Agents and Multiagent Systems.
Gilmer, J. B., and Sullivan, F. 2005. Issues in Event Analysis
for Recursive Simulation. Proceedings of the 37th Winter
Simulation Conference 12–41.
Gilmer, J. 2003. The Use of Recursive Simulation to Support Decisionmaking. In Chick, S.; Sanchez, P. J.; Ferrin,
D.; and Morrice, D., eds., Proceedings of the 2003 Winter
Simulation Conference.
Gmytrasiewicz, P.; Noh, S.; and Kellogg, T. 1998. Bayesian
Update of Recursive Agent Models. User Modeling and
User-Adapted Interaction 8:49–69.
Golman, R., and Page, S. E. 2008. General Blotto: Games
of Allocative Strategic Mismatch. Public Choice 138(34):279–299.
Gross, O., and Wagner, R. 1950. A Continuous Colonel
Blotto Game. Technical report, RAND.
Hansen, E. A.; Bernstein, D. S.; and Zilberstein, S. 2004.
Dynamic Programming for Partially Observable Stochastic
Games. In Proceedings Of The National Conference On Artificial Intelligence. American Association For Artificial Intelligence.
Hart, S. 2007. Discrete Colonel Blotto and General Lotto
games. International Journal of Game Theory 36(3-4):441–
460.
Haruvy, E., and Stahl, D. 2004. Level-n Bounded Rationality on a Level Playing Field of Sequential Games. In
Econometric Society 2004 North American Winter Meetings. Econometric Society.
Hu, J., and Wellman, M. P. 2001. Learning about Other
Agents in a Dynamic Multiagent System. Cognitive Systems
Research 2:67–79.
Hu, J., and Wellman, M. P. 2003. Nash Q-Learning for
General-Sum Stochastic Games. Journal of Machine Learning Research 4(6):1039–1069.
Huyck, J.; Battalio, R.; and Beil, R. 1991. Strategic Uncertainty, Equilibrium Selection, and Coordination Failure
in Average Opinion Games. The Quarterly Journal of Economics 106(3):885–910.
Kaminski, B., and Latek, M. 2010. The Influence of Call
Graph Topology on the Dynamics of Telecommunication
Markets. In Jedrzejowicz, P.; Nguyen, N.; Howlet, R.; and
Jain, L., eds., Agent and Multi-Agent Systems: Technologies
and Applications, volume 6070 of Lecture Notes in Computer Science, 263–272.
Keynes, J. M. 1936. The General Theory of Employment, Interest and Money. Macmillan Cambridge University Press.
Latek, M. M., and Kaminski, B. 2009. Social Learning
and Pricing Obfuscation. In Hernández, C.; Posada, M.; and
López-Paredes, A., eds., Lecture Notes in Economics and
Mathematical Systems, 103–114. Springer.
Latek, M. M., and Rizi, S. M. M. 2010. Plan, Replan and
Plan to Replan: Algorithms for Robust Courses of Action
under Strategic Uncertainty. In Proceeding of the 19th Conference on Behavior Representation in Modeling and Simulation (BRIMS).
Latek, M. M.; Axtell, R.; and Kaminski, B. 2009. Bounded
Rationality via Recursion. Proceedings of The 8th International Conference on Autonomous Agents and Multiagent
Systems 457–464.
Latek, M.; Rizi, S. M. M.; and Alsheddi, T. A. 2011. Optimal Blends of History and Intelligence for Robust Antiterrorism Policy. Journal of Homeland Security and Emergency Management 8.
McKelvey, R., and Palfrey, T. 1995. Quantal Response Equilibria for Normal Form Games. Games and Economic Behavior 10(1):6–38.
McKelvey, R., and Palfrey, T. 1998. Quantal Response Equilibria for Extensive Form Games. Experimental Economics
1(1):9–41.
Nair, R.; Tambe, M.; Yokoo, M.; Pynadath, D.; and
Marsella, S. 2002. Towards Computing Optimal Policies
for Decentralized POMDPs. Technical Report WS-02-06,
AAAI.
Niou, E. M. S., and Ordeshook, P. C. 1994. A GameTheoretic Interpretation of Sun Tzu’s: The Art of War. Journal of Peace Research 31:161–174.
Parunak, H. V. D., and Brueckner, S. 2006. Concurrent Modeling of Alternative Worlds with Polyagents. Proceedings of
the Seventh International Workshop on Multi-Agent-Based
Simulation.
Pita, J.; Jain, M.; Tambe, M.; Ordonez, F.; and Kraus, S.
2010. Robust Solutions to Stackelberg Games: Addressing
Bounded Rationality and Limited Observations in Human
Cognition. Artificial Intelligence 174(15):1142–1171.
Rathnasabapathy, B.; Dosh, P.; and Gmytrasiewicz, P. 2006.
Exact solutions of interactive pomdps using behavioral
equivalence. In Proceedings of the Fifth International Joint
Conference on Autonomous Sgents and Multiagent Systems
(AAMAS ’06).
Rosenhead, J.; Elton, M.; and Gupta, S. 1972. Robustness
and Optimality as Criteria for Strategic Decisions. Operational Research Quarterly 23(4):413–431.
Seuken, S., and Zilberstein, S. 2007. Memory-Bounded
Dynamic Programming for DEC-POMDPs. In roceedings
of the 20th International Joint Conference on Artificial Intelligence.
Stahl, D., and Wilson, P. 1994. Experimental Evidence on
Players’ Models of Other Players. Journal of Economic Behavior and Organization 25:309–327.
Suematsu, N., and Hayashi, A. 2002. A Multiagent Reinforcement Learning Algorithm Using Extended Optimal
Response. Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems AAMAS ’02 370.
Yoon, S.; Fern, A.; and Givan, R. 2007. FF-Replan: A Baseline for Probabilistic Planning. In 17th International Conference on Automated Planning and Scheduling (ICAPS-07),
352–359.
Zettlemoyer, L. S.; Milch, B.; and Kaelbling, L. P. 2008.
Multi-Agent Filtering with Infinitely Nested Beliefs. In Proceedings of NIPS’08.
41
Download