From: AAAI Technical Report SS-03-04. Compilation copyright © 2003, AAAI (www.aaai.org). All rights reserved.
Modeling and Interacting with Agents in Mixed Environments
Piotr J. Gmytrasiewicz
Department of Computer Science
Department of Computer Science
University of Illinois at Chicago, Chicago, IL 60607-7053
piotr@cs.uic.edu
January 21, 2003
Abstract
We adopt the decision-theoretic principle of expected utility maximization as a paradigm for designing
autonomous rational agents operating in multi-agent and mixed environments. Our approach is based on
explicitly modeling other agents, human or artificial. The advantage of our approach is that explicit models
are the most comprehensive tools one can use to guide meaningful interactions. The disadvantage is that
formulating and maintaining explicit models can be computationally prohibitive.
1 Introduction
This paper seeks to address the issue of controlling autonomous agents interacting with other agents and humans within a common environment. We use the normative decision-theoretic paradigm of rational decisionmaking under uncertainty, according to which an agent should make decisions so as to maximize its expected
utility [7, 9, 11, 18, 34]. Decision theory is applicable to agents interacting with other agents and humans
because of uncertainty: The abilities, sensing capabilities, beliefs, goals, preferences, and, most importantly,
intentions of other agents and humans clearly are not directly observable and usually are not known with
certainty. Thus, according to Bayesian decision theory, agents should represent their uncertainty explicitly,
revise it based on information that becomes available during interactions, and use their up-to-date state of
knowledge to optimize their of choices.
In decision theory, expected utility maximization is a theorem that follows from the axioms of probability
and utility theories [12, 29]. In other words, if an agent’s beliefs about the uncertain environment conform to
the axioms of probability theory, and its preferences obey the axioms of utility theory (see, for example, [34]
page 474), then the agent should choose its actions so as to maximize its expected utility.1
Decision theory, based on a framework of partially observable Markov decision processes (POMDP’s),
forms a conceptual basis used to derive optimal behavior of an agent in an uncertain environment.
The formalism of MDPs has been extended to multiple agents and has been called stochastic games
or Markov games [14, 32]. Traditionally, the solution concept used for stochastic games is that of Nash
equilibria. Some recent work in AI follows that tradition [6, 21, 24, 25]. However, while Nash equilibria are
useful for describing a multi-agent system if, and when, it has reached a stable state, this solution concept is
1
Some authors have expressed reservations as to the justifiability of these axioms. See the discussions in [26] and the excellent
overview of descriptive aspects of decision theory in [8].
1
not sufficient as a general control paradigm in multi agent and mixed environments. The main reasons are
that there may be multiple equilibria with no clear way to choose among them (non uniqueness), the fact
equilibria do not specify actions in cases in which agents believe that other agents may act not according to
their equilibrium strategies (incompleteness), and the fact that some equilibria solutions seem, simply put,
irrational [5, 22].
Extensions of POMDPs to multiple agents appeared in [4, 33]. They have been called decentralized
POMDPs (DEC-POMDPs), and are related to decentralized control problems [31]. However, DEC-POMDP
framework assumes that the agents are fully cooperative, i.e., they have common reward function and form
a team. Further, it is assumed the optimal joint policy is computed centrally, and then distributed among the
agents, or that, if each of the agent compute the joint policy locally, they are guaranteed to obtain exactly the
same solution.2
We propose a formalism that applies to autonomous agents with possibly conflicting objectives, operating
in partially observable environments, who locally compute what actions they should execute to optimize their
preferences given what they believe. We are motivated by a subjective approach to probability in games
[22], and we combine POMDPs, notion of types from Bayesian games of incomplete information, work on
interactive belief systems [2, 1, 19, 27], and related work [5, 10].
Our approach follows a tradition of knowledge-based paradigm of agent design, according to which agents
need a representation of all of the important elements of their environment to allow them to function efficiently.
The distinguishing feature of our approach is that it extends this paradigm to other agents and relies on explicit
modeling of other agents to predict their actions, and on optimal choices of agent’s own behaviors given these
predictions. The two main elements, one, of predicting actions of others, and second, of choosing own
action, are both handled from the standpoint of decision theory. The descriptive power of decision-theoretic
rationality, which describes actions of rational individuals, is used to predict actions of other agents. The
prescriptive aspect of decision-theoretic rationality, which dictates that optimal actions chosen are the ones
that maximize expected utility, is used for agent to select its own action. An approach similar to ours has
been called a decision-theoretic approach to game theory [22, 29], but a precise mathematical formulation of
it has been lacking. Both decision theory and game theory have been applied to human decision making; this
indicates that it is applicable to designing artificial systems interacting with humans and with other synthetic
agents.
2 Background: Partially Observable Markov Decision Processes
A partially observable Markov decision processes (POMDP) [7, 20, 23, 28] of an agent is defined as
(1)
where:
is a set of possible states of the environment (defined as the reality external to the agent ), is a set of
which describes results of
actions agent can execute, is a transition function – agent ’s actions, is the set of observations that the agent can make, is the agent’s observation function
– which specifies probabilities of observations if agent executes various actions that
result in different states, is the reward function representing the agent ’s preferences – .
The POMDP framework is useful for defining optimal behavior of an agent , capable of executing actions
, in an environment with possible states , given that the agent’s sensing capabilities are described by
Ê
2
In [4] this assumption is supported by the assumption that the initial state of the system in commonly known.
2
function , that the agent’s actions change the state according to function , and that the agent’s preferences
are as specified by the function . To fully define optimality of agent’s behavior we additionally need to
specify its belief about the state of the environment, and its optimality criterion.
2.1 Belief and its Update
In POMDP’s an agent’s belief about the state is represented as a probability distribution over – it fully
expresses the agent’s current information about its environment. As the agent makes observations and acts,
its belief about the state changes. Initially, before any observations or actions take place, the agent has
some (prior) belief, . After some time steps, , we assume that the agent has observations and
has performed actionsfootnoteWe assume that action is taken at every time step; it is without the loss of
generality since one of the actions maybe a No-op.. These can be assembled into agent ’s observable history:
at time . Let denote the set of all observable histories of agent ’s.
Agent’s current belief, over , is continually revised based on evolving observable history. It turns out that
the agent’s belief state is sufficient to summarize all of the past observable history and initial belief (it is also
called a sufficient statistic.)
The belief update takes into account changes due to action, , executed at time , and the next observation,
¼
¼
, after the action. The new degree of belief, , that the current state is , is:
¼
¼ ¼ (2)
¼
¼
¼
¼
¾ ¾ In the above computation the denominator can be treated as a normalizing constant. Let us summarize the above update performed for all states ¼ as ¼ [23].
¼ ¼
Given the initial belief, , the probability that the state of belief in the next time step will be equal to ¼ ,
given that agent executes action will be denoted as ¼ , and can be computed as:
where ¼ is equal to 1 if ¼
as:
¼
¼
¼
¼
¼ (3)
¾
and 0 otherwise, and where can be computed
¼
¼
¼
¼ ¼
¼
¼
(4)
2.2 Optimality Criteria and Solutions
The agent’s optimality criterion, is needed to specify how rewards acquired over time accumulate. Commonly used criteria include:
3
A finite horizon criterion, in which the agent maximizes the expected value of the sum of first rewards:
, where is a reward obtained at time and in the length of the horizon. We will denote
this criterion as .
An infinite horizon criterion with discounting, according to which the agent maximizes where is a discount factor. We will denote this criterion as ½
,
An infinite horizon criterion with averaging, according to which the agent maximizes the average reward
per time step. We will denote this as .
In what follows, we only consider an infinite horizon criterion with discounting, for simplicity. The utility
associated with a belief state, is composed of the best of the immediate rewards that can be obtained in ,
together with discounted sum of utilities associated with belief states following :
¾
(5)
The optimal action, £ for the case of infinite horizon criterion with discounting, is then an element of the
set of optimal actions for the belief state, , defined as:
¾
(6)
2.3 Agent Types
The model above specifies the agent’s optimal behavior, given its beliefs, its POMDP, and the optimality
criterion the agent uses. These, implementation independent, factors grouped together will be called a type,
, of agent :
Definition 1: A type, , of an agent , is , where is agent
’s state of belief, is its optimality criterion, and the rest of the elements are as defined above. Let be
the set of agent ’s types.
In defining the notion of an agent type, our intention is to list all implementation-independent factors
relevant the agent’s behavior, under the assumption that the agent is Bayesian-rational. Given type, the set of
optimal actions on the left-hand side of Eq.(6) will be denoted as . We will generalize the notion of
type below to situations which include interactions with other agents, and will coincide with the notion of type
used in Bayesian games [19, 14]. Of course, an agent’s behavior can also depend on implementation-specific
parameters, like the processor speed, memory available, cognitive capacity, etc. These can be included in the
(implementation dependent, or complete) type, increasing the accuracy of predicted behavior3 , but at the cost
of additional complexity.
3
Types will be used to predict behavior, as we describe below, and as is customary in Bayesian games.
4
3 Interactive POMDPs
For simplicity of the presentation we consider an agent that is interacting with with one other agent .
Definition 2: An interactive POMDP of agent , ! , is:
!
! (7)
where:
! is a set of interactive states defined as ! , where is the set of states of the environment,
and is the set of possible models of agent . Models of agents list factors relevant to the agents’
behavior. Analogously to states of the world, models of agents are intended to be rich enough to allow
informed prediction about behavior. Let us note that the states are subjective; the reality external to
each agent, , is different in that it includes agents other than .4
A general model of an agent, , is a function , i.e., a mapping from histories to
(probabilistic) predictions of ’s behaviors. One version of a general model is obtained during the, so
called, fictitious play [13, 30] where probabilities of agent’s future actions are estimated as frequencies
of these actions observed in the past. Another simple version is a no-information model [16], according
to which actions are independent of history and are assumed to occur with uniform probabilities, "
,
each.
Perhaps the most interesting model is the intentional model, defined to be the agent ’s type: ! . Each agent knows its own type since it’s private information. ’s
belief is, therefore, a probability distribution over the states of the world and the models of the other
agent, i.e., agent . The notion of intentional model, or type, we use here coincides with the notion of
type in game theory, where it is defined as embodying any of the agent ’s private information that is
relevant to its decision making [19, 14]. In particular, if agents’ beliefs are private information, then
their types involve possibly infinitely nested beliefs over others’ types and their beliefs about others.5
They have been called knowledge-belief structures and type hierarchies in game theory [2, 1, 3, 27],
and we called them recursive model structures in prior work [16, 17]. The intentional models use the
ascription of Bayesian rationality to agents. While computationally costly due to their explicit nature,
they are the most comprehensive and general tools to predict the behavior of other agents.
is the set of joint moves of all agents. Our intention is to include primarily physical
actions (intended to change the physical state), and communicative actions (intended to be perceived
by other agent and change its beliefs).
is a transition function ! ! which describes results of agents’ actions.
As we mentioned, actions can now change the physical reality – these are physical actions, as well as
beliefs and models of other agents. One can model communicative actions as changing the beliefs of the
agents directly, but it may be more appropriate to model communication as action that can be observed
by others and changing their beliefs indirectly.
One can make following assumptions about :
4
An alternative to defining states subjectively from the individual agents’ point of view is to define a state space common to all
agents.
5
One can compare the infinite belief hierarchies to decimal representation of . The successive levels of nesting of possibly
infinite type hierarchy, like digits of , have progressively lower significance.
5
Definition 3: Under the belief non-manipulability (BNM) assumption agents’ actions do not change the
others’ beliefs directly. Formally, ¼ ¼
¼ ¼
¼ ¼ ¼ ¼
iff ¼
.
Belief non-manipulability is justified in usual cases of interacting autonomous agents. Autonomy precludes direct “mind control” and implies that the other agents’ belief states can be changed only indirectly, typically by changing the environment in a way observable to them.
is an observation function !
.
Definition 4: Under a belief non-observability (BNO) assumption agents cannot observe each other’s
beliefs directly. Formally, for all observations , states and actions ¼
.
The belief non-observability assumption justifies our including the agent’s belief within its type, since
the latter embodies the factors private to the agent. The BNO assumption does not imply that the other
elements of the agent’s type are fully observable; indeed it seems unlikely they would be in practice.
Ê
is defined as ! . We allow the agent to have preferences over physical states and
states of the other agent, but usually only the physical state will matter.
is defined as before in POMDP model
Interactive POMDPs are a subjective counterpart to an objective external view of agents in stochastic
games [14], to approaches in [6] and [24], and in decentralized POMDPs [4]. Interactive POMDPs represent
an individual agent’s point of view, and facilitate planning and problem solving on the agent’s own individual
level.
3.1 Belief Update, Values and Optimality in I-POMDPs
Agent ’s beliefs evolve with time. The new belief state, ! , is a function of the previous state, ! ,
the last action, , and the new observation, , just as in POMDPs (see [34], section 17.4, and [20, 23]
for comparison). There are two differences that complicate things, however. First, since the state of the
physical environment depends on the actions performed by both agents, the prediction of how the physical
state changes has to be made based on the anticipated actions of the other agent. The probabilities of other’s
actions are obtained based on their models.
Secondly, changes in the models of the other agents have to be included in the update. Some changes may
be directly attributed to the agents’ actions,6 but, more importantly, the update of the other agent’s beliefs due
to its new observation has to be included. In other words, the agent has to update its beliefs based on how
it anticipates the other agent updates, which includes how the other agent might expect the original agent to
update, and so on. As should be expected, the update of the possibly infinitely nested belief over other’s types
can also be infinite.7
Below, we will define the agent ’s belief update function , where is an
interactive state – ! . We will use the belief state estimation function, , as an abbreviation for
6
For example, actions may change the agents’ observation capabilities directly.
Continuing with the loose analogy between nested beliefs and , one can compare the computations over nested type hierarchy to
¾ . Both computations could go on for ever but the confidence
computing the value of the area of a circle using the formula in results increases as the calculations proceed.
7
6
belief updates for individual states so that . will stand for
. We will also define the set of type-dependent optimal actions of an agent, .
Proposition 1: If the BNM and BNO assumptions hold, the belief update function for interactive POMDPs
is:
#
½
½
½
½
¾
as we defined before, and ½
¾ ½
¾
¾
½
where
¾
½
½
½
(8)
are the belief elements of and , respectively,
½
is a normalizing constant, ¾ is the observation function in ¾ , and is the probability
of other agent’s action given its model. If the model is not intentional then the probability of actions are
given directly by the model. If the model is the other agent’s type, , then this probability is equal to
, i.e., to the probability that is Bayesian rational for that type of agent. We define
below.
Proof of Proposition is in [15].
Analogously to POMDPs, given an agent ’s type (type includes current belief ), we can define the
immediate reward for performing action as:
#
$
(9)
Further, just like in POMDPs for an infinite horizon case, each belief state has an associated optimal value
reflecting the maximum discounted payoff the agent can expect in this belief state:
¾
(10)
Similarly, the agent ’s optimal action, £ , for the case of infinite horizon criterion with discounting, is an
element of the set of optimal actions for the belief state, , defined as:
¾
(11)
4 Conclusions
This paper proposed a decision-theoretic approach to game theory (DTGT) as a paradigm for designing agents
that are able to intelligently interact and coordinate actions with other agents in multi-agent and mixed humanagent environments. We defined a general multi-agent version of partially observable Markov decision pro-
7
cesses, called interactive POMDP’s, and illustrated assumptions, solution method, and an application of our
approach.
Much future work is required to bring ideas in this paper to implementation. On the theoretical side
we need to investigate the properties of the proposed framework, convergence and approximation properties.
We think current POMDP solution algorithms can be modified to compute solution to multi-agent planning
problems. Finally, we will work on convincing examples and prototypes in domains of interest.
Acknowledgments
This research was supported by ONR grant N00014-95-1-0775, and by the National Science Foundation
CAREER award IRI-9702132.
References
[1] Robert J. Aumann. Interactive epistemology i: Knowledge. International Journal of Game Theory, (28):263–300,
1999.
[2] Robert J. Aumann and Adam Brandenburger. Epistemic conditions for Nash equilibrium. Econometrica, 1995.
[3] Robert J. Aumann and Aviad Heifetz. Incomplete information. In R. Aumann and S. Hart, editors, Handbook of
Game Theory with Economic Applications, Volume III, Chapter 43. Elsevier, 2002.
[4] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized
control of markov decision processes. To appear in Mathematics of Operations Research, 2002.
[5] Ken Binmore. Essays on Foundations of Game Theory. Pitman, 1982.
[6] Craig Boutilier. Sequential optimality and coordination in multiagent systems. In Proceedings of the Sixteenth
International Joint Conference on Artificial Intelligence, pages 478–485, August 1999.
[7] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial intelligence Research, 11:1–94, 1999.
[8] Colin Camerer. Individual decision making. In John H. Kagel and Alvin E. Roth, editors, The Handbook of
Experimental Economics. Princeton University Press, 1995.
[9] Jon Doyle. Rationality and its role in reasoning. Computational Intelligence, 8:376–409, 1992.
[10] Ronald R. Fagin, Joseph Y. Halpern, Yoram Moses, and Moshe Y. Vardi. Reasoning About Knowledge. MIT Press,
1995.
[11] Jerome A. Feldman and Robert F. Sproull. Decision theory and Artificial Intelligence II: The hungry monkey.
Cognitive Science, 1(2):158–192, April 1977.
[12] Peter Fishburn. The Foundations of Expected Utility. D. Reidel Publishing Company, 1981.
[13] Drew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press, 1998.
[14] Drew Fudenberg and Jean Tirole. Game Theory. MIT Press, 1991.
[15] Piotr J. Gmytrasiewicz. A framework for sequential rationality in multi-agent settings. Technical report, University
of Illinois at Chicago, http://www.cs.uic.edu/ piotr/ 2003.
[16] Piotr J. Gmytrasiewicz and Edmund H. Durfee. Rational coordination in multi-agent environments. Autonomous
Agents and Multiagent Systems Journal, 3(4):319–350, 2000.
8
[17] Piotr J. Gmytrasiewicz and Edmund H. Durfee. Rational communication in multi-agent environments. Autonomous
Agents and Multiagent Systems Journal, 4:233–272, 2001.
[18] Peter Haddawy and Steven Hanks. Issues in decision-theoretic planning: Symbolic goals and numeric utilities. In
Proceedings of the 1990 DARPA Workshop on Innovative Approaches to Planning, Scheduling, and Control, pages
48–58, November 1990.
[19] John C. Harsanyi. Games with incomplete information played by ’Bayesian’ players. Management Science,
14(3):159–182, November 1967.
[20] Milos Hauskrecht. Value-function approximations for partially observable markov decision processes. Journal of
Artificial Intelligence Research, (13):33–94, July 2000.
[21] J. Hu and M. P. Wellman. Multiagent reinforcement learning: Theoretical framework and an algorithm. In Fifteenth
International Conference on Machine Learning, pages 242–250, 1998.
[22] Joseph B. Kadane and Patrick D. Larkey. Subjective probability and the theory of games. Management Science,
28(2):113–120, February 1982.
[23] Leslie P. Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable
stochastic domains. Artificial Intelligence, (2), 1998.
[24] Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. In Seventeenth International Joint Conference on Artificial Intelligence, pages 1027–1034, Seattle, Washington, August
2001.
[25] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the
International Conference on Machine Learning, 1994.
[26] Per-Eric Malmnas. Axiomatic justification of the utility principle: A formal investigation. Synthese, (99):233–249,
1994.
[27] J.-F. Mertens and S. Zamir. Formulation of Bayesian analysis for games with incomplete information. International
Journal of Game Theory, 14:1–29, 1985.
[28] George E. Monahan. A survey of partially observable markov decision processes: Theory, models, and algorithms.
Management Science, pages 1–16, 1982.
[29] Roger B. Myerson. Game Theory: Analysis of Conflict. Harvard University Press, 1991.
[30] J. Nachbar. Evolutionary selection dynamics in games: Convergence and limit properties. International Journal
of Game Theory, 19:59–89, 1990.
[31] J. M. Ooi and G. W. Wornell. Decentralized control of a multiple access broadcast channel. In Proceedings of the
35th Conference on Decision and Control, 1996.
[32] Guillermo Owen. Game Theory: Second Edition. Academic Press, 1982.
[33] Milind Tambe Ranjit Nair, David Pynadath and Stacy Marsella. Towards computing oprimal policies for decentralized pomdps. In Notes of the 2002 AAAI Workshop on Game Theoretic and Decision Theoretic Agents, 2002.
[34] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995.
9