From: AAAI Technical Report WS-93-06. Compilation copyright © 1993, AAAI (www.aaai.org). All rights reserved. PAGODA: An Integrated Architecture Autonomous Intelligent Agents for Marie desJardins SRI International 333 R.avenswood Ave. Menlo Park CA 94025 marie@erg.sri.com 2 Abstract PAGODA (Probabilistic Autonomous GOalDirected Agent) is an autonomousintelligent agent that explores a novel environment, building a modelof the world and using the modelto plan its actions [desJardins, 1992b]. PAGODA incorporates solutions to the problems of selecting learning tasks, choosinga learning bias, classifying observations, and performing induetire learning of world modelsunder uncertainty, ill all integrated systemfor planning and learning in complex domains. This paper raises some key issues in building autonomous embedded agents and shows how PAGODA addresses these issues. The probabilistic learning mechanismthat PAGODA uses to build its world model is described in more detail. 1 Introduction PAGODA is an autonomous intelligent agent that explores a novel, possibly nondeterministic, environment, building a probabilisti¢ model of the world and using the model to plan its actions to maximizeutility. The guiding principles behind PAGODA include probabilistic representation of knowledge, Bayesian evaluation techniques, and limited rationality as a normative behavioral goal. The key properties of PAGODA are: ¯ The agent operates autonomously,with rot,ritual intervention from humans, and does not require a teacher to present or classify learning instances, or to provide a representation for learned theories. ¯ The agent handles uncertainty due to inaccurate sensors, randomness in the environment, and sensory limitations. The learned theories express observed uncertainty explicitly. Section 2 motivates the problem by defining embedded limited rational agents and autonomouslearning. Section 3 describes the architecture and the componentsof PAGODA; the probabilisti¢ learning componentis presented mmore detail in Section 4. Conclusions are given in Section 5. 15 The Problem Learning of Autonomous In this section, we define the concepts of embeddedlimited rational agents and autonomouslearning and discuss four specific problemsthat arise from these definitions. An embeddedagent consists of three components: a transducer, a learning module, and a planner. Embedded agents must interact with their environments in real time, and are continuously being affected by and manipulating the environment. The sensory inputs received from the transducer, which maybe incomplete or inconsistent, are referred to as the agent’s perceived world. A perceived world may correspond to many actual world states. The agent’s actions allow it to moveabout in limited ways, usually with limited accuracy. The learning module uses background knowledge and its sensory inputs to build a modelof the world for the planner to use. In our model, the agent initially knows the set of actions it can execute, but not what effect those actions have. A rational agent chooses actions that maximizeits expected utility. A limited rational agent takes into account the cost of time, balancing time spent deliberating with performing external actions. An autonomousagent operates independently of humanintervention. Specifically, it does not require inputs (except for its initial state) to tell it what its goals are, how to behave, or what to learn. Deciding What to Learn: In a complex environment, the true world model will be too complicated for an agent with limited resources to learn completely. Therefore, the agent will have to focus attention on learning portions of this true world model. A rational agent should allocate its resources to maximizeits ultimate goal achievement,by focusing its learning attention on whatever aspects of the world are expected to be most useful to learn. Selecting Representations for Learning: An autonomous agent must decide what bias to use for each learning task. Bias, as defined by Mitchell [1980], is "...any basis for choosing one generalization over another, other than strict consistency with the observed training instances." Rather than requiring the designer to specify biases for each potential learning task, the L................ ENVIRONMENT/ _J TRANSDUCER PAGODA PLANNER LEARNER l EVALUATION HYPOTHESIS PROBABILISTIC SEARCH I EVAL TION [ GOAL GENERATION ies I" POT"ES’S J GENERATION Goals ( KNOWLEDGE / ¯ k BASE / ~ ...... ..... Theories Figure J" 1 Schematic agent must use background or learned knowledge to select biases as learning tasks arise. Learning in Uncertain Domains: Many real-world environments contain uncertainty, which can arise from randomness, noise in the agent’s sensors, sensory limitations, and/or complexity. Traditionally, learning has been defined as the problem of finding a theory that is consistent with all observed instances. However, when uncertainty is present, there may be no consistent theories under a reasonable learning bias. In this case, the agent must either settle for an inconsistent world model or represent uncertainty explicitly in its model. Planning Under Uncertainty: When an agent’s learned world model contains uncertainty, the agent needs a planning mechanism that can maximize goal satisfaction in the face of this uncertainty. Most Ai planning techniques require deterministic models of the world, and are therefore inapplicable in this case. Fortunately, decision theory provides a paradigm for behaving optimally under uncertainty: a rational agent should choose whatever action maximizes its expected fltt.ure average utility per unit time. 3 PAGODA: An Autonomous Agent Model PAGODA(Probabilistic Autonomous GOal-Directed Agent) is a model for an intelligent agent that was motivated by the issues and problems discussed in the previous section. PAGODA is a limited semi-rational embedded agent that exhibits autononmus learning. We say "semi-rational" because PAGODA does not exhibit optimal resource-bounded behavior. However, the model does explicitly consider issues of limited rationality, providing important contributions towards building an optimal agent. 16 New theories; biases view of PAGODA 3.1 Architecture PAGODA builds an explicit, predictive world model that its planner can use to construct action plans. The learnlag process is guided by the system’s behavioral goal (to maximize utility), and background knowledge is used to select learning biases automatically. Figure 1 shows a schematic view of PAGODA.The behavior cycle of the agent is as follows: initial knowledge goal is to predict its 1. PAGODA’s utility: that is, it will first learn theories to predict the utility of performing various actions in specified world states. The agent’s utility is provided as one of its sensory inputs. background knowledge is used to as2. Probabilistic sign a value to potential biases for each knowledge goal (Section 3.3). The bias with the highest, value is sent to the hypothesis generator. 3. Sensory observations are sent to the agent by the transducer. Probabilities of old theories are updated to reflect the new evidence provided by the observations, and new theories are generated and evaluated (Section 4). 4. The planner analyzes the preconditions of the theories to determine which features of the environment will be most useful (with respect to maximizing utility) to learn (Section 3.2). These most useful preconditions are sent to the learner as new knowledge goals. a forward search through the 5. The planner initiates space of possible outcomes of actions, based on the probabilistic predictions made by the current best theories. The action which maximizes expected utility is taken (Section 3.4). fS. The action chosen by the planner is sent to the transducer, which executes the action in the real or simulated environment. 7. The sequence is repeated. 3.2 Goal-Directed Learning Initially, PAGODA has a trivial "theory" for features in its sensory inputs, in the sense that it can determine their values by examiningits sensory inputs. Its learning effort is directed towards being able to predict feature values resulting from a proposed sequence of actions, by forming a model of the world that provides a mapping from perceived world and actions to feature values. PAGODA incorporates a novel approach called GoalDirected Learning (GDL),which allows the agent to decide what features of the world are most worth learning about [desJardins, 1992a]. The agent uses decision theory to computethe expected utility of being able to predict various features of the world, and formulates knowledge 9oals to predict the features with highest utility. 3.3 Evaluating Learning Biases PAGODA uses probabilistic background knowledge to evaluate potential biases for each knowledgegoal by compoting howwell each bias is expected to perform during future learning. Probabilistic Bias Evaluation (PBE) chooses a set of features that is as relevant as possible, without being so large that the complexity of the learning task is excessive [desJardins, 1991]. Each potential bias (set of input features) is assigned a value using a decision-theoretic computation which combines the expected accuracy of predictions over time with a time-preference (or discounting) function that expresses the agent’s willingness to trade long-term for short-term performance. The computed value represents the expected discounted accuracy of predictions made by theories formed using the given bias. The bias with the highest value is used for learning. Planning 3.4 Probabilistic PAGODA uses the principle of maximizingexpected utilit.v to choose its behaviors: it forward chains through the probability space of predictions, and selects the action that maximizes its expectedutility, tlowever,it. ctt,.rently only plans external actions and alwayssearches to a fixed search depth, determined by the designer. The planner occasionally chooses a randomaction instead of selecting the best apparent action, in order to ensure that exploration continues and the agent does not get stuck o,J a local maximum,but it does not explicitly reason about the value of taking such sub-optimal actions. 4 Probabilistic Learning PAGODA’s theories are represented as sets of conditional probabilities. In order to evaluate them, the system trades off the predictive accuracy of a proposedtheory with its simplicity using Bayesian theory evaluation. The prior probability of a theory is a function of its simplicity; the likelihood of the evidence correspondsto the theory’s accuracy with respect, to the evidence. These are combinedusing Bayes’ rule to yieht an overall probability. To distinguish betweenvarious levels of abstraction of t.heories, we use the terms class, structure, conditiolml 17 probabilities, and specific theories. Each prior probability distribution is over theory classes; one or more theories are grouped into each class. The structure of a theory within a class refers to the conditioning contexts of the rules in the theory. Whenvalues for the conditional probabilities are addedto these rules, the theory is called a specific theory. Simplicity is considered to be a measure applied to the class that a theory belongs in, rather than the theory itself. Various classification schemes maybe used to measuresimplicity; somepossibilities are briefly presented in Section 4.2.2. The likelihood of the data (i.e., predictive accuracy), on the other hand, is a function of the specific theory and conditional probabilities it contains. 4.1 Probabilistic Representation and Inference A theory consists of a set of conditional probability distributions (called "rules"); each of these specifies the observed distribution of values of the predicted ("output") feature, given the conditioning context. Conditioning contexts consist of a perceived world and possibly an action taken by the agent. PCI (Probability Computation using Independence)is a probabilistic inference mechanismthat combines the conditional probabilities in a theory to make predictions about the output feature, given a proposed action and a perceived world [desJardins, 1993]. PCI determines which rules (conditional distributions) within a theory are relevant for making prediction: these are the most specific conditioning contexts that apply to the perceived world. If there are multiple relevant rules, they are combined(using minireal independenceassumptions) to get a single predicted distribution. 4.2 Bayesian Learning In this section, we develop a Bayesian methodfor evaluating the relative likelihood of alternative theories. As we will show, only two terms need to be considered: the prior probability of the theory, P(T), and the accuracy of the theory, given by the likelihood of evidence P(EIT). The former quantity is defined as a function of the simplicity of the class containing the theory; the latter is computedby applying PCI to the specific theory. The formula used to evaluate theories is derived in Section 4.2.1, and the prior probability distributions used by PAGODA are discussed in Section 4.2.2. Finally, likelihood is discussed in Section 4.2.3. 4.2.1 Bayesian Theory Evaluation Recall that the theories being evaluated consist of conditional probabilities, which are determinedempirically. These probabilities in a theory are distinct from the probability of a theory. The goal in this section is to find the probability of a theory. The theory with the highest probability should be that with the most effective structure for representing the observed data. Given this structure, the conditional probabilities within the theory are straightforward to optimize. Complexstrt,ctt, res (those with manydependencies)cost the agent in terms of space, computation time, and risk ofoverfitting. On the other hand, simple structures with only a few dependencies may not capture important relationships in the world. The probability we wish to find, then, is the probability that the structure of this theory is the best. representation of the behavior of the environment. It is not the probability that the particular values of tile conditional (statistical) probabilities in the theory are correct, or eventhat they are close, l The statistical probabilities are estimated using observed frequencies; this maximizes the accuracy of the theory as given by the Bayesian likelihood P(EIT A K). Using the notation T a proposed theory K background knowledge E evidence: a sequence of observations el, e2,..., e,, Bayes’rule gives P(TIK A E) P(TIK) P( E[T h (1) p(EjK) Given that the same evidence and background knowledge is used to evaluate competingtheories, the constaut factor P(EIK) in the denominator can be dropped,"~ yielding P(TIK A E) ¢x P(TIK) P(EIT ^ h’) (2) Wealso assume that the individual observations el ...en composingE are independent, given K and 7’. This standard conditional independence assumption is reasonable, because the theories generated by the agent make independent predictions. Therefore, T embodies an assumption that the observations are independent, which must be true if T holds. Given this independence assumption, Equation 2 becomes: P(TIKAE ) (x P(TIK) H’~=~P(e,ITAh’) (3) The first quantity on the right-hand side represents the "informedprior"--i.e., the probability of tile theory given the background knowledgeK, but no direct, evidence. The second quantity represents the likelihood of the theory, i.e., the combinedprobabilities of each piece of evidence given the theory and K. 4.2.2 Prior Probability The prior probability of a theory, P(T), is the probability of T before any evidence has been collected. A prior, however, is never completely uninformed: even before any direct observations about a particular learning task are made,an agent’s past. experience, a vailabh, sensors, and internal representation will affect its disposition to believe a theory, and hence its prior probability distribution. All of the backgroundknowledgeavailable to an agent should ideally be reflected in its "prior." The background knowledge K of Equation 3 consists of probabilistic backgroundknowledge(provided by 1Which is not to say that the statistical probabilitiesaren’t close, simplythat we do not claim to measuretheir accuracy explicitly with this evaluationtechnique. aNote that if we drop these normalizing factors, we no longer havea true probability distribution. However,for notational convenience, the resulting measurewill still be called P. 18 the designer) and theories about other output features, whichare irrelevant to the probability of this theory. The probabilistic background knowledge is used by PAGODA to select the initial bias, i.e., the langua[gein whichits thcories are represented [desJardins, 1991J. ’1 he relevant information i,i the probabilistic backgroundknowledgeis therefore implicit in the initial bias B, so that P(TIK ) = ) P(TIB The search mechanism in PAGODA only explores the space defined by the initial bias. However,B does not provide a preference function over the space. Wecan drop the conditioning context B, since all theories are implicity conditioned by the same bias. A variety of justifications have been proposed for the use of simplicity as one test of the value of a theory. The most commonreference to simplicity is Occam’srazor, which tells us to select the most simple of the consistent theories. But this has two problems:first, it does not tell us what to do when we have a complex theory with high accuracy on the learning set and a simpler, but slightly less accurate, theory. (Or, rather, it does tell us what to do--we are to reject all inconsistent theories out of hand, which doesn’t seem reasonable.) Second, it does not provide a definition of simplicity. The approach used by PAGODA is to evaluate theories by first determining the prior probability of the theory class, and then computingthe likelihood of specific theories within the class. The prior probability is therefore a distribution over classes, rather than over theories. The prior probability of a theory P(T) refers to the probability that the true theory Tr is in the sameclass as T, i,e,, P(T) = P(Class(T) = Class(T’)) Wehave used PAGODA as a testbed to experiment with several different metrics of simplicity. Theydiffer in the level of classification (theories, rules, features, or terms) and in the methodof finding the probability of the class (the two methodsused are the uniform distribution and a Huffman encoding scheme). In the uniform distribution on theories, all theories are placed in the same class, so equal prior probability is assigned to every theory. This prior leads to a preference for theories that exactly fits the data, if any exist. Underthe rule-level classification of theories, all theories with the same number of rules are in the same class and have equal probability. This leads t.o a weak bias towards simpler theories--theories with fewer rules are preferred, bill. the complexityof the rules within the theory has no effect, in the feature-level classification of theories, a stronger bias towards simplicity, all theories with the sametotal numberof features in tile conditioning contexts of the rules are in the sameclass. Huffmanencoding of terms is the strongest prior. The classification level is terms ("words" within feature descriptors), but a uniform distribution is not assumed. Rather, the frequency of terms within the theory is used to compute an optimal (Huffman) encoding for the theory, and the length of the encoded theory gives the negative logarithmof its probability. Using PAGODA’s Bayesian theory evaluation method, all of these priors converge to the best theory, given enoughexamples. However,we are interested ill increasing learning performanceill the short term, whena limited number of training examples maybe available. We expect that ill more complex worlds with a greater degree of complexity, the stronger biases towards simplicity will be better at avoiding overfitting in the short term, since stronger evidence is required to include a rule in a theory. In simple deterministic domains, we expect to find a good theory quickly, so the weaker biases should give good results. Preliminary experiments seem to bear this out, but do not yet provide a good understanding of whento select whichprior. 4.2.3 Likelihood of Evidence P(etlTA K) is the probability of the observation made at time l, given a theory and background knowledge. P(et]TAK) is equal to P(etlT) if et is conditionally independent of K given T, which is a reasonable assumption since the other theories in K make no predictions regarding T’s output feature, and any relevant informa, lion ill K has already been used to select, the current bias. If the theory being evaluated predicts the output feature o, et can be rewritten as senses, A act;£ontA Or+ 1 This is because in the current implementatiou of PAGODA, only features at time t are considered for predicting features at time t + 1. This assumption does not affect the analysis, though; for example,features at. time t - 1 could be included without any significant modifications. The probability of el given the theory T is feature-poor. However,even in this simple domain, GDL makesthe optimal choice for a newfeature to learn, improving the overall performanceof the system within resource bounds. Weplan to test GDLand the planning component in a domain with a larger number of interacting features. Therehas beena fairamountof recentresearch on learning probabilistic theories, particularly decision trees(e.g., [Buntine, 1990,Quinlan, 1986]). Webelieve thattheprobabilistic rule-like representation usedby PAGODA is a morenatural representation, andthatbiasessuchas thosePAGODAusesto prefersimpletheorieswillleadtobetter hypotheses. However, wehavenot yet doneany directcomparisons of PAGODAto these systems. To ourknowledge, therearenootherintelligent agent architectures thatintegrate solutions to therangeof problems PAGODAaddresses---selecting learning tasks, choosing biases, learning andreasoning withprobabilistic worldmodels, andprobabilistic planning. Althoughmuchworkremains to be done,we believe that PAGODAaddresses manyof the fundamental concerns arising fromthedesireto buildautonomous learning agents, andprovides a foundation forbuilding rational autonomous agentsthatcanlearnandplanin complex, nondeterministic environments. References [Buntine, 1990] Wray Buntine. A Theory of Learning Classification Rules. PhDthesis, University of Technology, Sydney, February 1990. [Chaitin, 1977] G. J. Chaitin. Algorithmic information theory. IBMJ. Res. Develop., 21:350-359, July 1977. P(sensestA ac¢iontA Ot+l IT) [desJardins, 1991] Marie desJardins. Probabilistic eval= P(senses, A actiontlT) uation of bias for learning systems. In Eighth International Workshop on Machine Learning, pages 495P(ot+llsensestA actiont A T) 499. Morgan Kaufman, 1991. Since T makes no predictions regarding sonssst and [desJardins, 1992a] Marie desJardins. Goal-directed actiont, the first term can be rewritten as the prior learning: A decision-theoretic modelfor deciding what probability: to learn next. In Proceedingsof the MachineDiscovery P(sensestA actiontIT) = P(sensest A action,) Workshop, 1992. Wedropthisterm, since itisa constant fora given setof [desJardins, 1992b] Marie desJardins. PAGODA:A observations. Thesecondtermis computed by applying Model for AutonomousLearning in Probabilistic DoPCIto thespecific theory T [desJardins, 1993]. mains. PhDthesis, UCBerkeley, 1992. [desJardins, 1993] Marie desJardins. Representing and 5 Conclusions and Future Work reasoning with probabilistic knowledge: A Bayesian approach. In Conference on Uncertainty in Artificial PAGODA has been implemented in the ltALPtl (RatioIntelligence, 1993. hal Agent with Limited Processing llardware) worhl, silnl,lated, nondeterministic robot domain. A good the[Mitchell, 1980] TomMitchell. The need for biases in ory of this domainis non-trivial to represent. PAGODA learning generalizations. Technical Report CBM-TRlearns a fairly goodtheory, but is limited by its inability 117, Rutgers University, May1980. t.o generate and represent internal states that wouldal[Quinlan, 1986] R. Quinlan. The effect of noise on conlow it. to "remember"earlier sensory experiences to aid cept learning. In Ryszard Michalski, Jaime Carbonell, in predictions. Anotherresult of these tests is that tile and TomMitchell, editors, MachineLearning H, pages choice of prior probability distribution has a significant 149-166. Morgan Kaufman, 1986. impact on learning. Internal states and automated prior [Solomonoff, 1964] R. J. Solomonoff. A formal theory of selection are important directions for future research. inductive inference. Information and Control, 7, 1964. Tile GDLcomponent of PAGODA has not been extensively tested, since the RALPH domainis relatively 19