From: AAAI Technical Report SS-03-04. Compilation copyright © 2003, AAAI (www.aaai.org). All rights reserved. Modeling and Interacting with Agents in Mixed Environments Piotr J. Gmytrasiewicz Department of Computer Science Department of Computer Science University of Illinois at Chicago, Chicago, IL 60607-7053 piotr@cs.uic.edu January 21, 2003 Abstract We adopt the decision-theoretic principle of expected utility maximization as a paradigm for designing autonomous rational agents operating in multi-agent and mixed environments. Our approach is based on explicitly modeling other agents, human or artificial. The advantage of our approach is that explicit models are the most comprehensive tools one can use to guide meaningful interactions. The disadvantage is that formulating and maintaining explicit models can be computationally prohibitive. 1 Introduction This paper seeks to address the issue of controlling autonomous agents interacting with other agents and humans within a common environment. We use the normative decision-theoretic paradigm of rational decisionmaking under uncertainty, according to which an agent should make decisions so as to maximize its expected utility [7, 9, 11, 18, 34]. Decision theory is applicable to agents interacting with other agents and humans because of uncertainty: The abilities, sensing capabilities, beliefs, goals, preferences, and, most importantly, intentions of other agents and humans clearly are not directly observable and usually are not known with certainty. Thus, according to Bayesian decision theory, agents should represent their uncertainty explicitly, revise it based on information that becomes available during interactions, and use their up-to-date state of knowledge to optimize their of choices. In decision theory, expected utility maximization is a theorem that follows from the axioms of probability and utility theories [12, 29]. In other words, if an agent’s beliefs about the uncertain environment conform to the axioms of probability theory, and its preferences obey the axioms of utility theory (see, for example, [34] page 474), then the agent should choose its actions so as to maximize its expected utility.1 Decision theory, based on a framework of partially observable Markov decision processes (POMDP’s), forms a conceptual basis used to derive optimal behavior of an agent in an uncertain environment. The formalism of MDPs has been extended to multiple agents and has been called stochastic games or Markov games [14, 32]. Traditionally, the solution concept used for stochastic games is that of Nash equilibria. Some recent work in AI follows that tradition [6, 21, 24, 25]. However, while Nash equilibria are useful for describing a multi-agent system if, and when, it has reached a stable state, this solution concept is 1 Some authors have expressed reservations as to the justifiability of these axioms. See the discussions in [26] and the excellent overview of descriptive aspects of decision theory in [8]. 1 not sufficient as a general control paradigm in multi agent and mixed environments. The main reasons are that there may be multiple equilibria with no clear way to choose among them (non uniqueness), the fact equilibria do not specify actions in cases in which agents believe that other agents may act not according to their equilibrium strategies (incompleteness), and the fact that some equilibria solutions seem, simply put, irrational [5, 22]. Extensions of POMDPs to multiple agents appeared in [4, 33]. They have been called decentralized POMDPs (DEC-POMDPs), and are related to decentralized control problems [31]. However, DEC-POMDP framework assumes that the agents are fully cooperative, i.e., they have common reward function and form a team. Further, it is assumed the optimal joint policy is computed centrally, and then distributed among the agents, or that, if each of the agent compute the joint policy locally, they are guaranteed to obtain exactly the same solution.2 We propose a formalism that applies to autonomous agents with possibly conflicting objectives, operating in partially observable environments, who locally compute what actions they should execute to optimize their preferences given what they believe. We are motivated by a subjective approach to probability in games [22], and we combine POMDPs, notion of types from Bayesian games of incomplete information, work on interactive belief systems [2, 1, 19, 27], and related work [5, 10]. Our approach follows a tradition of knowledge-based paradigm of agent design, according to which agents need a representation of all of the important elements of their environment to allow them to function efficiently. The distinguishing feature of our approach is that it extends this paradigm to other agents and relies on explicit modeling of other agents to predict their actions, and on optimal choices of agent’s own behaviors given these predictions. The two main elements, one, of predicting actions of others, and second, of choosing own action, are both handled from the standpoint of decision theory. The descriptive power of decision-theoretic rationality, which describes actions of rational individuals, is used to predict actions of other agents. The prescriptive aspect of decision-theoretic rationality, which dictates that optimal actions chosen are the ones that maximize expected utility, is used for agent to select its own action. An approach similar to ours has been called a decision-theoretic approach to game theory [22, 29], but a precise mathematical formulation of it has been lacking. Both decision theory and game theory have been applied to human decision making; this indicates that it is applicable to designing artificial systems interacting with humans and with other synthetic agents. 2 Background: Partially Observable Markov Decision Processes A partially observable Markov decision processes (POMDP) [7, 20, 23, 28] of an agent is defined as (1) where: is a set of possible states of the environment (defined as the reality external to the agent ), is a set of which describes results of actions agent can execute, is a transition function – agent ’s actions, is the set of observations that the agent can make, is the agent’s observation function – which specifies probabilities of observations if agent executes various actions that result in different states, is the reward function representing the agent ’s preferences – . The POMDP framework is useful for defining optimal behavior of an agent , capable of executing actions , in an environment with possible states , given that the agent’s sensing capabilities are described by Ê 2 In [4] this assumption is supported by the assumption that the initial state of the system in commonly known. 2 function , that the agent’s actions change the state according to function , and that the agent’s preferences are as specified by the function . To fully define optimality of agent’s behavior we additionally need to specify its belief about the state of the environment, and its optimality criterion. 2.1 Belief and its Update In POMDP’s an agent’s belief about the state is represented as a probability distribution over – it fully expresses the agent’s current information about its environment. As the agent makes observations and acts, its belief about the state changes. Initially, before any observations or actions take place, the agent has some (prior) belief, . After some time steps, , we assume that the agent has observations and has performed actionsfootnoteWe assume that action is taken at every time step; it is without the loss of generality since one of the actions maybe a No-op.. These can be assembled into agent ’s observable history: at time . Let denote the set of all observable histories of agent ’s. Agent’s current belief, over , is continually revised based on evolving observable history. It turns out that the agent’s belief state is sufficient to summarize all of the past observable history and initial belief (it is also called a sufficient statistic.) The belief update takes into account changes due to action, , executed at time , and the next observation, ¼ ¼ , after the action. The new degree of belief, , that the current state is , is: ¼ ¼ ¼ (2) ¼ ¼ ¼ ¼ ¾ ¾ In the above computation the denominator can be treated as a normalizing constant. Let us summarize the above update performed for all states ¼ as ¼ [23]. ¼ ¼ Given the initial belief, , the probability that the state of belief in the next time step will be equal to ¼ , given that agent executes action will be denoted as ¼ , and can be computed as: where ¼ is equal to 1 if ¼ as: ¼ ¼ ¼ ¼ ¼ (3) ¾ and 0 otherwise, and where can be computed ¼ ¼ ¼ ¼ ¼ ¼ ¼ (4) 2.2 Optimality Criteria and Solutions The agent’s optimality criterion, is needed to specify how rewards acquired over time accumulate. Commonly used criteria include: 3 A finite horizon criterion, in which the agent maximizes the expected value of the sum of first rewards: , where is a reward obtained at time and in the length of the horizon. We will denote this criterion as . An infinite horizon criterion with discounting, according to which the agent maximizes where is a discount factor. We will denote this criterion as ½ , An infinite horizon criterion with averaging, according to which the agent maximizes the average reward per time step. We will denote this as . In what follows, we only consider an infinite horizon criterion with discounting, for simplicity. The utility associated with a belief state, is composed of the best of the immediate rewards that can be obtained in , together with discounted sum of utilities associated with belief states following : ¾ (5) The optimal action, £ for the case of infinite horizon criterion with discounting, is then an element of the set of optimal actions for the belief state, , defined as: ¾ (6) 2.3 Agent Types The model above specifies the agent’s optimal behavior, given its beliefs, its POMDP, and the optimality criterion the agent uses. These, implementation independent, factors grouped together will be called a type, , of agent : Definition 1: A type, , of an agent , is , where is agent ’s state of belief, is its optimality criterion, and the rest of the elements are as defined above. Let be the set of agent ’s types. In defining the notion of an agent type, our intention is to list all implementation-independent factors relevant the agent’s behavior, under the assumption that the agent is Bayesian-rational. Given type, the set of optimal actions on the left-hand side of Eq.(6) will be denoted as . We will generalize the notion of type below to situations which include interactions with other agents, and will coincide with the notion of type used in Bayesian games [19, 14]. Of course, an agent’s behavior can also depend on implementation-specific parameters, like the processor speed, memory available, cognitive capacity, etc. These can be included in the (implementation dependent, or complete) type, increasing the accuracy of predicted behavior3 , but at the cost of additional complexity. 3 Types will be used to predict behavior, as we describe below, and as is customary in Bayesian games. 4 3 Interactive POMDPs For simplicity of the presentation we consider an agent that is interacting with with one other agent . Definition 2: An interactive POMDP of agent , ! , is: ! ! (7) where: ! is a set of interactive states defined as ! , where is the set of states of the environment, and is the set of possible models of agent . Models of agents list factors relevant to the agents’ behavior. Analogously to states of the world, models of agents are intended to be rich enough to allow informed prediction about behavior. Let us note that the states are subjective; the reality external to each agent, , is different in that it includes agents other than .4 A general model of an agent, , is a function , i.e., a mapping from histories to (probabilistic) predictions of ’s behaviors. One version of a general model is obtained during the, so called, fictitious play [13, 30] where probabilities of agent’s future actions are estimated as frequencies of these actions observed in the past. Another simple version is a no-information model [16], according to which actions are independent of history and are assumed to occur with uniform probabilities, " , each. Perhaps the most interesting model is the intentional model, defined to be the agent ’s type: ! . Each agent knows its own type since it’s private information. ’s belief is, therefore, a probability distribution over the states of the world and the models of the other agent, i.e., agent . The notion of intentional model, or type, we use here coincides with the notion of type in game theory, where it is defined as embodying any of the agent ’s private information that is relevant to its decision making [19, 14]. In particular, if agents’ beliefs are private information, then their types involve possibly infinitely nested beliefs over others’ types and their beliefs about others.5 They have been called knowledge-belief structures and type hierarchies in game theory [2, 1, 3, 27], and we called them recursive model structures in prior work [16, 17]. The intentional models use the ascription of Bayesian rationality to agents. While computationally costly due to their explicit nature, they are the most comprehensive and general tools to predict the behavior of other agents. is the set of joint moves of all agents. Our intention is to include primarily physical actions (intended to change the physical state), and communicative actions (intended to be perceived by other agent and change its beliefs). is a transition function ! ! which describes results of agents’ actions. As we mentioned, actions can now change the physical reality – these are physical actions, as well as beliefs and models of other agents. One can model communicative actions as changing the beliefs of the agents directly, but it may be more appropriate to model communication as action that can be observed by others and changing their beliefs indirectly. One can make following assumptions about : 4 An alternative to defining states subjectively from the individual agents’ point of view is to define a state space common to all agents. 5 One can compare the infinite belief hierarchies to decimal representation of . The successive levels of nesting of possibly infinite type hierarchy, like digits of , have progressively lower significance. 5 Definition 3: Under the belief non-manipulability (BNM) assumption agents’ actions do not change the others’ beliefs directly. Formally, ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ iff ¼ . Belief non-manipulability is justified in usual cases of interacting autonomous agents. Autonomy precludes direct “mind control” and implies that the other agents’ belief states can be changed only indirectly, typically by changing the environment in a way observable to them. is an observation function ! . Definition 4: Under a belief non-observability (BNO) assumption agents cannot observe each other’s beliefs directly. Formally, for all observations , states and actions ¼ . The belief non-observability assumption justifies our including the agent’s belief within its type, since the latter embodies the factors private to the agent. The BNO assumption does not imply that the other elements of the agent’s type are fully observable; indeed it seems unlikely they would be in practice. Ê is defined as ! . We allow the agent to have preferences over physical states and states of the other agent, but usually only the physical state will matter. is defined as before in POMDP model Interactive POMDPs are a subjective counterpart to an objective external view of agents in stochastic games [14], to approaches in [6] and [24], and in decentralized POMDPs [4]. Interactive POMDPs represent an individual agent’s point of view, and facilitate planning and problem solving on the agent’s own individual level. 3.1 Belief Update, Values and Optimality in I-POMDPs Agent ’s beliefs evolve with time. The new belief state, ! , is a function of the previous state, ! , the last action, , and the new observation, , just as in POMDPs (see [34], section 17.4, and [20, 23] for comparison). There are two differences that complicate things, however. First, since the state of the physical environment depends on the actions performed by both agents, the prediction of how the physical state changes has to be made based on the anticipated actions of the other agent. The probabilities of other’s actions are obtained based on their models. Secondly, changes in the models of the other agents have to be included in the update. Some changes may be directly attributed to the agents’ actions,6 but, more importantly, the update of the other agent’s beliefs due to its new observation has to be included. In other words, the agent has to update its beliefs based on how it anticipates the other agent updates, which includes how the other agent might expect the original agent to update, and so on. As should be expected, the update of the possibly infinitely nested belief over other’s types can also be infinite.7 Below, we will define the agent ’s belief update function , where is an interactive state – ! . We will use the belief state estimation function, , as an abbreviation for 6 For example, actions may change the agents’ observation capabilities directly. Continuing with the loose analogy between nested beliefs and , one can compare the computations over nested type hierarchy to ¾ . Both computations could go on for ever but the confidence computing the value of the area of a circle using the formula in results increases as the calculations proceed. 7 6 belief updates for individual states so that . will stand for . We will also define the set of type-dependent optimal actions of an agent, . Proposition 1: If the BNM and BNO assumptions hold, the belief update function for interactive POMDPs is: # ½ ½ ½ ½ ¾ as we defined before, and ½ ¾ ½ ¾ ¾ ½ where ¾ ½ ½ ½ (8) are the belief elements of and , respectively, ½ is a normalizing constant, ¾ is the observation function in ¾ , and is the probability of other agent’s action given its model. If the model is not intentional then the probability of actions are given directly by the model. If the model is the other agent’s type, , then this probability is equal to , i.e., to the probability that is Bayesian rational for that type of agent. We define below. Proof of Proposition is in [15]. Analogously to POMDPs, given an agent ’s type (type includes current belief ), we can define the immediate reward for performing action as: # $ (9) Further, just like in POMDPs for an infinite horizon case, each belief state has an associated optimal value reflecting the maximum discounted payoff the agent can expect in this belief state: ¾ (10) Similarly, the agent ’s optimal action, £ , for the case of infinite horizon criterion with discounting, is an element of the set of optimal actions for the belief state, , defined as: ¾ (11) 4 Conclusions This paper proposed a decision-theoretic approach to game theory (DTGT) as a paradigm for designing agents that are able to intelligently interact and coordinate actions with other agents in multi-agent and mixed humanagent environments. We defined a general multi-agent version of partially observable Markov decision pro- 7 cesses, called interactive POMDP’s, and illustrated assumptions, solution method, and an application of our approach. Much future work is required to bring ideas in this paper to implementation. On the theoretical side we need to investigate the properties of the proposed framework, convergence and approximation properties. We think current POMDP solution algorithms can be modified to compute solution to multi-agent planning problems. Finally, we will work on convincing examples and prototypes in domains of interest. Acknowledgments This research was supported by ONR grant N00014-95-1-0775, and by the National Science Foundation CAREER award IRI-9702132. References [1] Robert J. Aumann. Interactive epistemology i: Knowledge. International Journal of Game Theory, (28):263–300, 1999. [2] Robert J. Aumann and Adam Brandenburger. Epistemic conditions for Nash equilibrium. Econometrica, 1995. [3] Robert J. Aumann and Aviad Heifetz. Incomplete information. In R. Aumann and S. Hart, editors, Handbook of Game Theory with Economic Applications, Volume III, Chapter 43. Elsevier, 2002. [4] Daniel S. Bernstein, Robert Givan, Neil Immerman, and Shlomo Zilberstein. The complexity of decentralized control of markov decision processes. To appear in Mathematics of Operations Research, 2002. [5] Ken Binmore. Essays on Foundations of Game Theory. Pitman, 1982. [6] Craig Boutilier. Sequential optimality and coordination in multiagent systems. In Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, pages 478–485, August 1999. [7] Craig Boutilier, Thomas Dean, and Steve Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial intelligence Research, 11:1–94, 1999. [8] Colin Camerer. Individual decision making. In John H. Kagel and Alvin E. Roth, editors, The Handbook of Experimental Economics. Princeton University Press, 1995. [9] Jon Doyle. Rationality and its role in reasoning. Computational Intelligence, 8:376–409, 1992. [10] Ronald R. Fagin, Joseph Y. Halpern, Yoram Moses, and Moshe Y. Vardi. Reasoning About Knowledge. MIT Press, 1995. [11] Jerome A. Feldman and Robert F. Sproull. Decision theory and Artificial Intelligence II: The hungry monkey. Cognitive Science, 1(2):158–192, April 1977. [12] Peter Fishburn. The Foundations of Expected Utility. D. Reidel Publishing Company, 1981. [13] Drew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press, 1998. [14] Drew Fudenberg and Jean Tirole. Game Theory. MIT Press, 1991. [15] Piotr J. Gmytrasiewicz. A framework for sequential rationality in multi-agent settings. Technical report, University of Illinois at Chicago, http://www.cs.uic.edu/ piotr/ 2003. [16] Piotr J. Gmytrasiewicz and Edmund H. Durfee. Rational coordination in multi-agent environments. Autonomous Agents and Multiagent Systems Journal, 3(4):319–350, 2000. 8 [17] Piotr J. Gmytrasiewicz and Edmund H. Durfee. Rational communication in multi-agent environments. Autonomous Agents and Multiagent Systems Journal, 4:233–272, 2001. [18] Peter Haddawy and Steven Hanks. Issues in decision-theoretic planning: Symbolic goals and numeric utilities. In Proceedings of the 1990 DARPA Workshop on Innovative Approaches to Planning, Scheduling, and Control, pages 48–58, November 1990. [19] John C. Harsanyi. Games with incomplete information played by ’Bayesian’ players. Management Science, 14(3):159–182, November 1967. [20] Milos Hauskrecht. Value-function approximations for partially observable markov decision processes. Journal of Artificial Intelligence Research, (13):33–94, July 2000. [21] J. Hu and M. P. Wellman. Multiagent reinforcement learning: Theoretical framework and an algorithm. In Fifteenth International Conference on Machine Learning, pages 242–250, 1998. [22] Joseph B. Kadane and Patrick D. Larkey. Subjective probability and the theory of games. Management Science, 28(2):113–120, February 1982. [23] Leslie P. Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains. Artificial Intelligence, (2), 1998. [24] Daphne Koller and Brian Milch. Multi-agent influence diagrams for representing and solving games. In Seventeenth International Joint Conference on Artificial Intelligence, pages 1027–1034, Seattle, Washington, August 2001. [25] M. L. Littman. Markov games as a framework for multi-agent reinforcement learning. In Proceedings of the International Conference on Machine Learning, 1994. [26] Per-Eric Malmnas. Axiomatic justification of the utility principle: A formal investigation. Synthese, (99):233–249, 1994. [27] J.-F. Mertens and S. Zamir. Formulation of Bayesian analysis for games with incomplete information. International Journal of Game Theory, 14:1–29, 1985. [28] George E. Monahan. A survey of partially observable markov decision processes: Theory, models, and algorithms. Management Science, pages 1–16, 1982. [29] Roger B. Myerson. Game Theory: Analysis of Conflict. Harvard University Press, 1991. [30] J. Nachbar. Evolutionary selection dynamics in games: Convergence and limit properties. International Journal of Game Theory, 19:59–89, 1990. [31] J. M. Ooi and G. W. Wornell. Decentralized control of a multiple access broadcast channel. In Proceedings of the 35th Conference on Decision and Control, 1996. [32] Guillermo Owen. Game Theory: Second Edition. Academic Press, 1982. [33] Milind Tambe Ranjit Nair, David Pynadath and Stacy Marsella. Towards computing oprimal policies for decentralized pomdps. In Notes of the 2002 AAAI Workshop on Game Theoretic and Decision Theoretic Agents, 2002. [34] S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995. 9