Generic Exception Analysis in a Dynamic Multi-Agent Environment by Zhi-Hui (Winifred) Xu Submitted to the Department of Electrical Engineering and Computer Science in Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science ENG at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY , MASSA CHUSETTS INSTITUTE OF TECHNOLOGY May 21,1999 JUL 2 7 2000 © 1999 Zhi-Hui Xu. All rights Reserved. LIBRARIES The author hereby grants to M.I.T. permission to reproduce and distribute publicly paper and electronic copies of this thesis and to grant others the right to do so. Author Department of Electrical Engineering and Computer Science May 21, 1999 Certified by I Prof. Howard E. Shrobe Thesis Supervisor Accepted by Arthur C. Smith Chairman, Department Committee on Graduate Theses Generic Exception Analysis in a Dynamic Multi-Agent Environment by Zhi-Hui (Winifred) Xu Submitted to the Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology May 21, 1999 In Partial Fulfillment of the Requirements for the Degree of Bachelor of Science in Computer Science and Engineering and Master of Engineering in Electrical Engineering and Computer Science ABSTRACT In a dynamic real world environment, autonomous agents need to be well coordinated in order to be able to execute their mission successfully together. To ease the burden on individual agent's ability to handle errors and unforeseen circumstances, Dr. Dellarocas and Dr. Klein is working on a Generic Exception handling (EH) architecture called the "Good Citizen" approach, where the EH capabilities are external to the local agents. This EH acts as a "doctor" for all the agents in the "society". For this thesis, specific protocols TeamCore and Socially Attentive Monitoring (SAM) designed by Dr. Milind Tambe's group in ISI of USC are studied and analyzed. Results will be represented in the exception taxonomy at CCS to evaluate the ability of the current "good citizen" architecture to capture the exception handling components of the ISI protocols, and expand the existing knowledge base of generic exceptions and its anticipation, detection, prevention and repair methods. Thesis Supervisors: Prof. Chrysanthos Dellarocas Title: Assistant Professor, MIT Sloan School of Management Dr. Mark Klein Title: Research Associate, MIT Center for Coordination Sciences Prof. Howard E. Shrobe Title: Associate Director, MIT Artificial Intelligence Lab 2 Acknowledgments: Time sure flies, today, I finally completed my Master of Engineering Thesis, and have officially graduated from MIT (Massachusetts Institute of Technology), the place where I have spent the past 5 years learning, "tooling", growing, and maturing. The person that I owe my entire life to is my dearest mother, who is happy with tears at this moment as she hears the good news. All of this would not have been possible her, who worked so hard to raise and educate me. Mom, I love you very very much! She loves me more then her own dear life, I would like to give thanks to those who helped and guided me through my five years of tedious yet fun times at MIT. First of all, my deepest gratitude goes to Prof. Chris Dellarocas and Dr. Mark Klein for their support and supervision throughout this term, and thanks for giving me the opportunity to work on this project. Many thanks to Prof. Howard Shrobe and Prof. Patrick Winston who helped me in defining my thesis and focusing on my goal. To my close friends @ MIT: Florence, Cathy, Sabrina, Clare, and many others have given me support and encouragement, thanks for lending your ears. Life would have been boring without your friendships. I will miss the happy moments where we hangout at the food trucks, Au Bon Pain, W20 lounge, 500 Memorial Dr. (Next House), pulling all-nighters, all of this has come to an end and will eventually pay off. To my friends from CBCGB (CBF/ICF/CCharis), thank you very much for your prayers and support. I've grown so much spiritually while attending Lexington church. Your kindness have touched me deeply, and I will always remember the wonderful times we spent together, and the Michael W. Smith's song -- "Friends". "And friends are friends forever If the Lord's the Lord of them. And a friend will not say "never" 'Cause the welcome will not end." --Michael W. Smith "Friends" Lastly, to those friends who have encouraged me and visited me from NYC, Thank you and keep in touch! All of this is done through the grace of God the Almighty, who showered me with so many blessings, loved me and carried me on his shoulders throughout the years. Thank you, Lord, my Heavenly Father. Amen! Dream what you want to dream; Go where you want to go; Be what you want to be; Because you have only one life and One chance to do *all* the things you want to do! 3 Table of Contents ABSTRACT 2 Acknowledgments: 3 Chapter1: 5 1.1). Dynamic environment causes troubled agents 5 1.2). "Survivalist" approach to exception handling 6 7 "Survivalist" agent approach: 1.3). The "Good Citizen" Approach - Social Monitoring Social Monitoring: Good Citizen Approach: 8 8 10 11 1.4). Thesis focus and outline 13 Chapter2: 2.1.1). TeamCore Protocol: 13 2.1.2). Joint Intention Theory 13 2.1.3). Shared Plan Theory 14 2.1.4). Establish JPG 15 2.1.5). Monitor and Repair 16 2.1.6). Selective Communication 17 22 Chapter 3: 22 3.1). Socially Attentive Monitor (SAM) Chapter4: 24 24 4.1). Exception Handling Template: 25 25 27 27 34 37 EH entry template Reference For each exception type For each handler For every meta-process For every meta-process Chapter 5: 39 5.1). Comparison to the Good Citizen's Architecture 39 5.2). Table listing the comparison between TeamCore and Good Citizen 42 44 Chapter 6: 44 6.1). Future Direction: 45 References: 4 Chapter 1: 1.1). Dynamic environment causes troubled agents Through out humanity's search for the perfect adaptation of a dynamic agent society, where autonomous agents with different capabilities can come together and freely interact to accomplish useful tasks, exception handling has been one of the biggest obstacles in achieving that goal. Agents being designed in lab can rarely handle the entire array of possible exceptions caused by the changing environment. Each agent may be the expert in their particular field, however, together, they lack the overall comprehension to deal with the numerous unforeseen situations outside of their specific knowledge realm. ("society of mind", Minsky) Thus the desire to construct a useful agent society such that it functions in a similar fashion as the human society has not yet been completely fulfilled. Especially with team tasks, the difficulty encountered in assembling a team, delivery of instructions, and carry out the tasks together while each agent has an assigned role is enormous. Software agents have been employed in numerous complex and uncertain environments such as virtual training programs, military applications, space missions. Ensuring the successful completion of a task or the capability to learn from failures to prevent future repeats is critical to agent designers. Autonomous agents running in dynamic environments can encounter numerous unforeseen situations that may cause them to fail to accomplish their goals. Errors can appear from both within and without the agent/protocol. Unanticipated changes in the scenario setting can lead to confusion, agent dependence relationships could cause chain-reaction failures (consumer-producer relationships), and other forgotten variables can all generate exceptions. Unhandled exception can further trigger other undesired systemic problems (network clog, resource contention, dead-lock, etc.). The most obvious one is the error in agent/protocol design. An agent can suddenly go wild and make a wrong decision, such as a calculation error, judgment 5 errors, or sensor failures. The protocol implemented could have intrinsic design flaws, such as logic bugs, or just failure due to designer's overlooking certain things. This type of failure is easy to detect and resolve. These can be tested at design time. Once it surfaces in a trial run (or maybe debugging stage), the design flaw will soon expose it self. The detection of the error is simple to implement because we can just check for the deviation from normal behavior path, and resolution mechanism is even simpler. The second type is due to hardware or environmental failures. An agent can be missing in action (MIA) while collaborating with a team; an electrical outage can cause many grievances. This is usually outside of the control of the implementers, it's not an intrinsic fault inside the architecture of the agent/protocol; yet, it can be anticipated, and exception handlers can be written into the agent's code to prevent/repair such errors. The last type of exceptions are the "emergent exceptions". Though similar to the second type of error that they occur unexpectedly at run time. These are not due to the fault in the design of the agent or protocol, but rather due to unforeseen circumstances. Exceptions such as resource contingencies, or dead/live locks, or message congestion. Usually in a dynamic "open" agent environment, due to the participation of the myriad types of agents, the heterogeneity of the agent types can cause subtle interaction errors. The occurrence of these exceptions are not predictable at design time, and thus very hard to avoid. Smart implementers can try to picture the scenarios and then place monitors that watch out for the signs of these errors, thus able to prevent or repair later. 1.2). "Survivalist"approach to exception handling In the engineering world, one of the popular practices for developers is to try to simplify the agent world in order to ease design time complexity. Thus they try to envision the agent environment to be a concrete man-made environment with a few variables and almost perfect settings. Traditional exception handling had always been a subtask after the majority of the main functionalities have been implemented. Designers 6 would try to anticipate an array of possible errors that the agent can generate from its own actions, plus a few of the possible environmental variables. While this may be relatively sufficient to ensure successful task completion in a single agent / isolated mechanical world, it is prone to numerous failures in an ensemble of agents involved in team work. Attempts have been made to anticipate possible failures at design time, and then hard code exception handlers into the individual agent applications; thus extensive amount of efforts has been spent on the prediction and prevention of possible errors. For each individual application agent, this kind of effort is tremendous. "Survivalist" agent approach: A great deal of efforts has been spent in trying to resolve these issues, amongst them were focuses on single agent architecture with various reactive plans (open agent protocol); or other decision making algorithms. Still others chose to focus on improving the dynamic environments that agents are working in, so to handle many coordination issues at the higher layer. One of the most popular approach is a "Survivalist" agent. This is an instance of the Distributed "self-check" method. After all, the agents have direct access to its own internal state information and the necessary data used in the comparison. Designers chose to incorporate as many situations as possible into the exception handling section of the individual agent. This would require that individual agent to have fairly sophisticated reasoning capabilities, a very "paranoid" set of checking conditions, plus heavy code to deal with each and everyone of them. And anything forgotten is a potential loop hole for failure while on mission. The one advantage of this method is that individual agents knows what they are looking for, have access to internal state values, and the designers can easily set standards for the local situation. However, the complexity in the code for the agent grows quickly, and they usually are difficult to understand or maintain, code for agents built with compiled in exception handling are not as reusable, and the state 7 relationships becoming so convoluted. Also, it's unrealistic to expect that all agents participating in a team will have the same exception handling capabilities, thus seriously limits the amount of cooperation and interactions available to all. However, in a multi-agent environment, this "survivalist" approach may not survive very well, due to the potential endless amount of errors simply caused by the diverse set of agents interacting with each other. Agent teams built on such "survivalist" agent infrastructures will lack in the following aspects: + Team construction - lots of domain dependent work requires huge amounts of human effort, and domain dependent information are not reusable for other domains. + Team flexibility - when faced with uncertain situations, agents need a teamwork model to deal with unforeseen situations, to facilitate the coordination and collaboration. + Team Scale-up - limited resources and huge human efforts in trying to predict the error s and program in the exception handlers. + Learning ability - repeated failure should be avoided, but since agents cannot reason, thus they cannot learn from past failure experiences. 1.3). The "Good Citizen" Approach - Social Monitoring Social Monitoring: In contrast to the self-checking mechanism, human society has implemented a concept called "monitoring". Institutes like the police are specifically designated to watch all the wrong doings happening in the society. And they are the ones responsible of detection and repair. This is convenient because we are alleviating the heavy burden o F monitoring off the shoulders of each participant in the society. We empower the police to pay attention to a common set of standards they we believe everyone should obey. Also give them the responsibility and authority to deal with exceptions when they do arise. Similarly in the agent world, the common facilities such as network congestion, resource 8 poaching can be the focus of all agent's attention, thus it's wise to ask other "policing" agents to monitor and resolve when errors arise in those situations. A policing agent monitors the behavior of an individual agent or a team (which may include itself), detecting and diagnosing failures as they occur. The big advantage of this type of "social policing" over the "self-checking" is that we can save all the redundant effort and spend it on one particular monitor. The designer for the "society" is the one responsible for predicting the needs and writing the exception handling devices for this type of common interest failures. And we can give those exception handlers the power to repair the situation, where we can't possibly entrust them with individual local agents. On the other hand, direct access to the monitored agent's internal states are not easily accessible now. It is virtually impossible and also unfeasible for agents to continuously communicate its internal states to the monitoring agent due to reasons such as communication cost, privacy safety, etc. However, interaction amongst agents complicate the task of the monitor. It now has to consider interaction between the agent-environment and also agent-agent communications. In addition, failures in interaction between an agent and the environment affect its interactions with other agents. Since only individual agent knows what kind of values it is expecting and what type of logical calculations its responsible for, it would be hard for the social police to be able to incorporate all of the variables and realize which values to check for. Programming the social monitor to suite domain requirements can be very tedious and time consuming for the implementers. 9 Good Citizen Approach: The "Good Citizen" approach is one instance of the "social monitoring" method. This is currently under research at CCS (Center for Coordination Sciences) at MIT. This shared exception handling service is designed to be easily "plugged" into existing agent systems with little customization. This handler acts as a "doctor", which looks systemwide for "diseases" (exceptions), and tries to fix them as soon as any is detected. The "doctor" also possesses knowledge about the various types of exceptions and the myriad ways of fixes because it has an exception taxonomy database at easy reach. This provides an abstraction level that alleviates the individual agents of the exception handling abilities, and transfers the worries to the "doctor", who is now the expert in exception handling area due to its vast database of "diseases". (see Figure. 1 below) This clear division of labor allows agents from different background to interact freely with each other and joint forces easily. Each agent only needs to focus on its tasks at hand and not on detection and resolution of various exceptions. This exception handling service is easily applicable to independent agents because its only "cost of admission" is that agents understand a common set of languages and protocols used The exception handling service can be expanded to fit the needs of multiple domains, thus vastly increasing code reusability. 10 Figure 1: "Good Citizen" approach diagram Core EH Agents new agent registration -------- find diagnoses ----norm-tv- behavior specification ranked diagnoses create/select resolution ecedselected sdmtoms d--resolution ---- symptoms to symptomsplan look for EH Agents Created As Needed diagnostic U quenes exception detection agent (sentinel) query interface exception resolution agent action interface problem solving agent query interface actn intefc infrastructure 1.4). Thesis focus and outline This project is a part of the effort in trying to identify some generic exceptions that can happen to any agent in a dynamic environment, then categorize them into a taxonomy. This taxonomy will serve as the knowledge-base used by the exception handling "doctors" in the "Good Citizen Approach". The exception handling service will focus on a set of targeted exceptions, then choose an anticipation, avoidance, detection, and resolution method based on the available handlers registered in the taxonomy. In order to expand the taxonomy, our team has decided to analyze several DARPA related protocols at hand, thus extracting a set of generic exceptions. This thesis focuses on the work done in ISI of USC, namely the TeamCore and SAM protocols developed 11 under Prof. Milind Tambe's group. Evaluation of the potential of the generic shared service approach mentioned above in providing flexible and adequate exception detection and resolution. Exception Handling Entry Template is designed to capture the essence of each type of exceptioris and detailing it's anticipation, avoidance, detection, and resolution approaches. This method is used to categorize the exceptions, and will be investigated for its comprehensiveness. Chapter 2 presents the background on the TeamCore protocol and chapter 3 on SAM model. Detailed explanation will illustrate the activity and their exception handling services. The core analysis of this thesis is stated in chapter 4, including Entry Templates detailing each exception and their handlers. In chapter 5, a thorough analysis and comparative study of the "good citizen" approach versus the TeamCore will be presented. The last chapter concludes this thesis with lists of potential future research and studies. 12 Chapter 2: 2.1.1). TeamCore Protocol: A research team lead by Prof. Milind Tambe at the Information Sciences Institute(ISI) of University of Southern California (USC) has been designing a teamwork protocol for some time. Uncertainties in dynamic domain obstruct coherent teamwork. In their opinion, highly flexible coordination and communication is key in addressing such uncertainties. Their central hypothesis is to provide the agents with general models of teamwork to address such difficulties. An implemented general model of teamwork called STEAM (Shell for TEAMwork) is under analysis in this thesis. It is mainly based on the Joint Intention Theory (Levesque et al., 1990) and borrows from the Shared Plan Theory (Grosz, 1996; Frosz & Kraus, 1996); Grosz & Sidner, 1990). Teamwork would involve a common team goal and coordination among team members. TeamCore is essentially a "wrapper" around a domain agent. A TeamCore agent is a purely "social agent" with only core teamwork capabilities. While the domain agent is the physical agent that interacts with the dynamic run time environment. TeamCore is fundamentally a distributed team-oriented system involving the coupling of "social" domain-independent TeamCore agents with domain-specific agents. 2.1.2). Joint Intention Theory The joint intentions framework centers around a team's team goal. A team keeps a common team mental state, which is shared by all the members and can only be updated by the leader of the team. A team 0 jointly intends a team action if all team members are committed to carry out the goal until the goal is achieved, unachievable, or irrelevant. The common goal is named Joint Persistent Goal (JPG), and the action can be denoted as 13 JPG (0, p, q), where p stands for completion of a team action, and q stands for the irrelevance clause. All team members involved in the JPG believe that p is currently false (goal has not been achieved yet) and that p is their mutual goal. When q turns false, it means the goal p is now irrelevant, thus leads to the termination of JPG for the entire team. Below is a list of all conditions must hold true while the team JPG is engaged: 1. p must be currently false for all team members (means goal is not yet achieved). 2. V team members set p as their mutual goal, wanting it to become true. 3. Each team member believe that until the goal is achieved, unachieved, or irrelevant, they each hold p as a weak achievement goal (WAG). 4. WAG (, p, 0, q) where: * -- team member in 0 * believes p is currently False * when p is no longer false, or q no longer true (implies that p is achieved, unachievable, or irrelevant), g will commit to have private belief about p become O's mutual belief. 2.1.3). Shared Plan Theory A shared plan relies on a mental state: intending that, which is defined via a set of axioms that guide an individual to take actions such as communication, enabling/disabling, or form subteam to perform assigned tasks. (Grosz & Kraus, 1996). There are two SP, either a Full SharedPlan(FSP) or a PartialSharedPlan(PSP). The focus here is mainly on the FSP. FSP (P, GR, c, Tp, Tc', Rc) denotes a group GR's plan P, at time Tp to do action cc at time Tc using recipe Ra. FSP (P, GR, x, Tp, Tc, Rct) holds iff the following conditions are true: 14 1. V team members of group GR holds intention of proposition Do(GR, a, Tc(), that GR does ax over time To. 2. V members of GR mutually believe that Rc is the recipe for cx. 3. For each step in Pi in Ra: " A subgroup GRk (GRk ; GR) has an FSP for Pi, using recipe R3i. " Other members of GR believe that there exists a recipe s.t. GRk can bring about * i and have an FSP for Pi, even if they don't know the Ri. Other members of GR intend that GRk can bring about i with the recipe. A SP represents the mental snapshot of the team in action at a particular situation. It aspires to describe the entirety of a team's intentions and beliefs when engaged in teamwork. 2.1.4). Establish JPG STEAM, an implementation of the TeamCore concept, uses Joint Intentions theory mainly as its fundamental building block. It is based on enhancement to the Soar architecture (Nowell, 1990), with a set of domain independent rules in Soar. TeamCore's main concept is to facilitate coordination amongst team members and also provides a minimal set of monitoring and repair services. The domain-independent component of teamcore is responsible for surmounting and adapting to uncertainties in the dynamic complex domains. At its current stage, TeamCore requires that human users enter all domain related information before agents enter into group commitment. This step abstracted away the user specified domain dependent information from the core of the team organization rules. The first step is to establish JPG. This involves the three step establish commitment procedure: 15 * The chosen team leader will send out broadcast a message asking each potential team member to commit to the JPG. (establish WAG to operator OP, if no response within time limit, resend message). * Upon receiving the message, each agent vi will in turn send out a message commit to the WAG for OP. * Wait until Vvi responded positively, confirm the establishment of JPG. Each team agent carries with it the pre-specified team plan. Thus upon the establishment of the initial JPG, agents have agreed to commit to the same common goal, and shared the same team state. This commitment will not the terminated until any one agent privately believed that the goal p or conditions q have changed, then informing other members, eventually the OP will update team state, and JPG is terminated. The whole hierarchy of the TeamCore model revolves around reactive plans already installed into the agents. At times of changes, the different conditions of plan will be triggered, and different sub-plans will be activated. For consistency, only the (sub)team leader will have the power to change the (sub)team state, and thus synchronizing all agents on that (sub)team. Private belief that leads to the termination of JPG will also be conveyed by the individual agent to the leader, and under unanimous approval, the leader will change team state variables and terminate group commitment. Thus the JPG concept helped team organization and avoided many coordination problems. 2.1.5). Monitor and Repair In addition to the multiple levels of JPG commitment by the teams and subteams, TeamCore also incorporates some mechanisms of monitoring and repair. The particular monitoring designed for TeamCore is to predict various conditions that would cause goal p to no longer remain false (achieved, unachievable, or irrelevant). These axioms were loaded at initialization time along with other domain specific information. Any observations by domain agents that violates the pre-set axioms will set off a trigger, which then diagnoses it to be either a "Critical Role Failure" or a regular "Role Failure". 16 This is determined based on the role relationship conditions between agents, AND, OR, or = dependency roles. Critical role failure is defined when a single role failure causes the unachievability of the goal. The action taken for repair in both cases is team reconfiguration. For a critical role failure, determine a team member to replace for the failed role. 1. Call for candidates for role substitution. 2. Check for appropriate role capabilities and conflicting previous responsibilities. 3. Announce role-substitution to the team. 4. Delete non-critical conflicting commitments for the new agent. And for a role-failure, just disable the single dependent agent from role performance. 2.1.6). Selective Communication Another interesting feature of the TeamCore system is the selective Communication method used to limit the cost. An expected Utility function is calculated depending on the value of Benefit versus the value of Communication cost. While not an exception handling mechanism, but the addition of variable communication possibilities does further complicate the exception detection and repair efforts. The diagram below by Dr. Milind Tambe explains it in detail: 17 Figure 2: Plan UML UML Class Diagram: TEAMCORE Team Plan Team Operator , i Plan i 1 Communication Application 1 Execution 1 Domain Level Agent Cost Plan Conditions Teamcore Agent Name: String Name: String Execution Termination Termination 18 Figure 3: Seq UML UML Sequence Diagram: Team Communication Team Leader Team Member 2 Team Member L Team Member 3 <establish commiqrlent, joint activity,> (request) <refusecommitment, j oint-activity, reason> (refusal) <.-b---d- <negotiation...>(**) - . - m. j-i------ y (-n- <establishedcomn itment, joint activity' (confirmation) <established com mitment, jointactivity (confirmation) > <established commitment, joint-4ctivity> (confirmation): <terminate, reasoit> (assertion) <refuseterminatiop,jointactivity, reasoA> (refusal) --- -- - - -..-.. ---------------- -- <negotiation... >(**) . . .-..-..-.-.-... I.... ... 19 (*) This nonstandard UML notation denotes a broadcast. (**) This capability is undet active development Figure 4: Team UML UML Class Diagram: State of a Single Agent I Root Belief State Unique-name: String Participates In *Subteams Self Belief State 1 Speaking Order Member or Leader Names: Ordered List of Strings Type: String 1 Value: String Communicationcommand: String * * RoleInfoPlan Member List * Plan_name: String Team_role: String Member Members I 0,1 Name: String Channel Approach toCommunication: String Preferred Colocated: (yes, no I Leader: String * TeamType: String Available Type: String Uniquename: String Contains-subteam- [yes,n Cannot~participate: String Trust: List of Strings Communication Team Belief State 1 Lede 20 Figure 5: TeamCore generic Chart TeamCore Model Load Domain Establish Joint Estimate Gamma Monitor & Information Persistent Goal (communication) Repair Terminate Joint Persistent Goal (JPG) (JPG ) / -ooo"oooooooooo~ooooo Agent reaches Leader makes Agents reply Confirm JPG announcement to join team established private belief Leader updates team state Inform all other team members Confirm and Terminate JPG Monitor and Repair Mechanisms Specify conditions and axioms role relationships: AND, OR, dependencies Monitoring: Invoke [Repa ir} infer role responsibilities for re-planni abnormal behavior that violates axioms Critical vs. F ole Failure 21 Update team belief and team states Chapter 3: 3.1). SociallyAttentive Monitor (SAM) In the world of dynamic agents, execution monitoring is a very important issue. Previous investigations of execution monitoring usually center around a single agent/system being the monitor, detecting and analyzing other agent behaviors and dispatch for exception handlers whenever needed. However, short comings of such method arises when errors not only come from individual agent behavior, but rather is due to the inter-agent and agent-environment interactions. This complicates the task since it now has to monitor all of these interactions and their failure may generate chain failures involving other agents. SAM is a novel approach currently under study by Dr. Gal Kaminka at the Information Sciences Institute of University of Southern California. This social execution framework is complementary to the existing approaches. SAM captures the various social relationships between agents and environment, and violations are based on these relationship rules. Their hypothesis is that monitoring the maintenance of those relationship rules are easier than determining the "correct" behavior for at least some classes of exceptions. SAM possesses a knowledge base of various types of social relationships employed in that particular domain. And the SAM monitoring agent will employ various protocols available such as communication, plan-recognition, etc. to help it to observe the agent-agent, agent-environment states, thus trying to deduce the relationship from the observation and match it to the template. Example given would to use velocity and relative position of two planes to judge whether they are flying in formation. SAM is responsible for the analysis of a violation after its detection, and uses relationship model to try to explain the error. Other exception handlers must be implemented to effectively correct the situation and repair the malfunction agent if necessary. 22 Figure 6: General Structure of a SAM System. The general structure of SAM's design emphasizes layers of abstraction, thus clear division of responsibilities. The knowledge-base contains models of the social relationships between monitored agents/environment. The monitor will be collecting information to represent the agent relationships. And the detector will verify the validity of the observed relationship through comparison with the knowledge base. If a failure is detected, then analysis will be done on it to examine the cause and find ways of fixes. 23 Chapter 4: 4.1). Exception Handling Template: The goal of this thesis is to investigate the method of exception collection mechanism designed for the enrichment of the taxonomy. We have conduct a detailed analysis of the TeamCore and SAM (Social Attentive Monitor) protocol designed by Prof. Milind Tambe's group at ISI. This study will lead to the collection of exceptions relating to these protocols. We realize that TeamCore is an abstract "wrapper" protocol that tries to encapsulate the coordination efforts, thus we looked at the exceptions that are anticipated, avoided, detected, and resolved through this protocol. The Exception Handling Template is our main tool for the collection of exception cases and their related handlers. We realize that the one of the most important section in the Good Citizen design is to be able to capture information about the various exceptions in the Exception database. Related information are being separated into exception handlers, meta-process, exception symptoms sections, etc. In order to better categorize the protocols, exceptions, and handlers, a detailed Exception Handler table will be used to test the flexibility of the design of the EH Knowledge chart and also to compare with other protocols. The EH Template is being tested on the TeamCore protocol for the first time. Because of the abstract nature of the protocol, different variations will be looked at and designed. The exceptions then will be extrapolated into "symptoms". According to the various "symptoms", methods of repair, detection and prevention will be evaluated as fit. All these analysis will at the end be used to fill the taxonomy tree with the generic exceptions. Each exception will be considered augmented by the following data structure: + * + + + + + Detailed description of the exception type Types of processes in which it normally occurs Anticipation methods Avoidance methods Detection methods Resolution methods Criticality: various system conditions in which this specific exception type may be critical 24 And each protocol had a meta-process table that captured more information about the exception handling in the flow process: + + + + + + Identify target excetions Generate exception finding processes Enact exception finding processes Select exception instances to fix Generate exception fixes Enact exception fixes The design of the EHT has evolved along with each stage of investigation, further improved its capability to accurately capture the exception related information. This analysis is only the beginning of the collection of exception cases, and there will be much more work done relating to other protocols such as Contract Net, Multi-level Coordination, etc. Below is the analysis of the exception results: EH entry template Reference Towards Flexible Teamwork (Tambe '97) For each process Name Description Team Work autonomous agents group together for a common Generalizations task in a dynamic environment - Multi-Agent System Coordination Mechanism Applicability conditions including all underlying assumptions: all agents available, leader chosen, communication channel available, proper domain information given. -- Team Commitment Failure -Role Relationship Violation -- Coordination Failure -- Communication Problem - Information Distribution Problem Exceptions Inputs Outputs Mapping Decomposition 25 For each process Name Team Core Description autonomous agents group together for a common task in a dynamic environment using the TeamCore protocol - Team Work including all underlying assumptions: all agents available, leader chosen, communication channel available, proper domain information given. Generalizations Applicability conditions Exceptions ---- Inputs Outputs Mapping Decomposition 26 Team Commitment Failure Role Relationship Violation Communication Problem For each exception type Name Description Impact (AKA criticality) Generalizations Anticipation handler(s) Avoidance handler(s) Detection handler(s) Resolution handler(s) Team Commitment Failure Team agents abnormally terminate participation with the common tasks, does not act in accordance with the team plan. Team goal affected, may become unachievable for the entire team depending on the role of the failed agent. Role Relationship Violations n/a JPG Coordination Handler (TeamCore) Relationship Violation Detection Handler For each handler JPG Coordination Handler TeamCore -- JPG termination requires that each team member shares the same belief that JPG is un/achieved/irrelevant, thus all members share the common (sub)team state, and cannot terminate until uniform consensus reached. TeamCore -- only leader can update team state, and must get all members to share the same belief before termination Avoidance Handler _ communication capabilities leader updates (sub)team state _ all agents alive and listening agent with private belief of un/achievable/irrelevant information spreads its Name Description Generalizations Applicability conditions knowledge. JPG, team agents, leader's messages JPG establishment Inputs Outputs Mapping Decomposition _ _ Leader send out JPG request Agents reply with confirmation Team JPG established Relationship Violation Detection Handler SAM - agents observe each other's behavior and does comparison to social model, if not in accordance, then send out error message of "Role Relationship Violation". Detection Handler communication capabilities all agents alive Name Description Generalizations Applicability conditions 27 Inputs _ Outputs Mapping Decomposition team relationship model accessible social relationship model individual agent's observation in the physical domain error message if applicable team member make observations compare with social model send/not send error message 28 For each exception type Name Description Role-Relationship Violation Agent does not act in accordance with its role in the team scenario. Ex. Missing In Action agent causes other agents to wait infinitely. Affect the behavior of other agents in the team. May cause Goal to be come unachievable. Team Work Exceptions n/a n/a _ Check Role Dependency Handler (TeamCore) Relationship Violation Detection Handler (SAM) _ Change Plan Handler Impact (AKA criticality) Generalizations Anticipation handler(s) Avoidance handler(s) Detection handler(s) Resolution handler(s) Agent Replacement Handler For each handler Check Role Dependency Handler TeamCore - constant checking on team states to Name Description make sure the relationships are not violated. Generalizations Applicability conditions Detection Handler communication channels available leaders update (sub)team state Inputs Role Relationship Model Outputs send Mapping Decomposition load Role Relationship Model out Checking requests find target relationship rules send out check request Relationship Violation Detection Handler SAM - agents observe each other's behavior and does comparison to social model, if not in accordance, then send out error message of "Role Relationship Violation". Detection Handler _ communication capabilities all agents alive team relationship model accessible social relationship model individual agent's observation in the physical domain error message if applicable Name Description Generalizations Applicability conditions Inputs Outputs Mapping Decomposition team member make observations 29 compare with social model send/not send error message Change Plan Handler TeamCore - if JPG becomes unachievable or irrelevant, then dynamic change of plan or choose an alternative plan. Resolution Handler communication available leader updates (sub)team state Role Relationship Model, Team Model new JPG or alternative plan _ depending on alternative team plan Leader send out request Announce change of plan to all team agents Name Description Generalizations Applicability conditions Inputs Outputs Mapping Decomposition Agent Replacement Handler - Replace the misbehaving agent with a healthy Name Description agent. Generalizations Applicability conditions _ Resolution Handler Pre-run time information input communication availability healthy agent available Inputs Outputs Mapping Decomposition _ _ _ 30 Domain dependent information Role Relationship Model depending on agent availability and capabilities Leader send out request all agents examine own ability If suitable, send in replacement agent; or else change plan For each exception type Coordination Failure Agents join a team, but initialization of actions are not uniform Goal achieve-ability affected. Team Work Exceptions n/a Establish JPG Handler n/a n/a Name Description Impact (AKA criticality) Generalizations Anticipation handler(s) Avoidance handler(s) Detection handler(s) Resolution handler(s) For each handler Inputs Outputs Establish JPG Handler TeamCore - all members establish JPG first, before process orders, leader take responsibility of coordination Avoidance Handler communication availability leader updates (sub)team state Goal, agents, leader chosen Agent Team working for common JPG Mapping _ Name Description Generalizations Applicability conditions Decomposition - 31 n/a Leader send out JPG request Agents reply with confirmation announce JPG established For each exception type Communication Problem ex: Messages from the leader gets lost during communication. - Team coordination - Goal achieve-ability Team Work Exceptions n/a n/a Repeat Message Handler Spread Error Warning Handler Name Description Impact (AKA criticality) Generalizations Anticipation handler(s) Avoidance handler(s) Detection handler(s) Resolution handler(s) For each handler Repeat Message Handler Repeatedly sending message if desired response is not received within a time limit. Detection Handler communication availability _ leader update (sub)team state Name Description Generalizations Applicability conditions Inputs Outputs Mapping Decomposition no reply from certain agent during time limit send out message again n/a if over time limit, send out message again, record number of resend. Spread Error Warning Handler Count how many tries, after a certain number of unsuccessful tries, and send Error message to the entire team. Resolution Handler _ communication availability leader update (sub)team state Name Description Generalizations Applicability conditions Inputs Outputs _ Mapping Decomposition _ 32 resend ineffective message to other agents on the team, warn them of the communication failure. if set number of resend does not work, send out message to all other agents For each exception type Information Distribution Problem Agent observes/receive information critically related to the team goal/state, and fails to communicate to other team members. Goal achieve-ability for the team Team Work Exceptions n/a n/a Observation Handler Communication Probability Handler Name Description Impact (AKA criticality) Generalizations Anticipation handler(s) Avoidance handler(s) Detection handler(s) Resolution handler(s) For each handler Observation Handler A supervising agent constant checks of the decision of communication by individual agent. Detection Handler communication availability team model (with role relationships) team state (with relevancy clauses) Team Model, JPG axioms decide relevancy of information, accuracy of the individual agent's communication decisions _ if information does affect JPG, then call Communication Probability Handler. (If Benefit > Cost, send message If Benefit < Cost, do not send message) if information is irrelevant, keep put receive information check against Team Model and axioms execute mapping decision Name Description Generalizations Applicability conditions Inputs Outputs Mapping Decomposition Communication Probability Handler Supervisor agent updates the communication probability cost table in the individual agent (weighs communication cost and benefits, determine if other members may infer the same information before send out new information). Resolution Handler communication availability leader updates (sub)team state agent error in decision to communicate _ updated agent's communication probability table Name Description Generalizations Applicability conditions Inputs Outputs Mapping Decomposition 33 For every meta-process Name Description Generalizations Applicability conditions Inputs Outputs TeamCore EH Meta-Process based on Joint Intention and Shared Plan theories, facilitates coordination and team work model. Group of agents collaborate together to achieve a common task goal. Team Work (EH Meta-process) resolution process status working team order Mapping Decomposition Name Description Applicability conditions Generalizations Inputs Outputs Identify target exceptions (TeamCore) Hard wired at run time. * Team Commitment Failure -- Team agents abnormally terminate participation with the common tasks, does not act in accordance with the team plan. * Role Relationship Violation -- Agent does not act in accordance with its role in the team scenario. * Communication Problem -- Messages from agents get lost during communication. Identify target exceptions Classification of exception types: Team Commitment Failure Role Relationship Violation Communication Problem Mapping Decomposition Name Description Applicability conditions Generalizations Inputs Generate exception finding processes (TeamCore) hard wired at run time. Generate exception finding processes Classification of exception types: Team Commitment Failure Role Relationship Violation Communication Problem Outputs detection handlers according to the classes of exceptions: - Relationship Violation Detection Handler or _ Check Role Dependency Handler or _ Repeat Message Handler 34 Mapping * + * _ symptoms for Team Commitment Failure - Relationship Violation Detection Handler. symptoms for Role Relationship Violation-- Relationship Violation Detection Handler. symptoms for Communication Problem -- Repeat Message Handler search for appropriate detection hander in the knowledge base. Decomposition Name Description Applicability conditions Generalizations Inputs Enact exception finding processes (TeamCore) enact exception detection handlers at run time Enact exception finding processes the detection handler chosen for the exception class: _ Relationship Violation Detection Handler or _ Check Role Dependency Handler or _ Repeat Message Handler Outputs Mapping Decomposition list of instances of exceptions Name Description Applicability conditions Generalizations Inputs Outputs Mapping Decomposition Select exception instance(s) to fix (TeamCore) choose the order to fix exception instances, at run time Name Description Applicability conditions Generalizations Inputs Outputs Generate exception fix(es) (TeamCore) generate fix handlers at run time Mapping Select exception instance(s) to fix list of instances of exceptions list of instances of exceptions FIFO order Generate exception fix(es) instance of exception from Selector exception resolution handlers: Change Plan Handler Agent Replacement Handler For Role Relationship Violation: * if replacement agent can be found -- Agent Replacement Handler * if no agent available for replacement -- Change Plan Handler Decomposition Name Enact exception fix(es) (TeamCore) 35 Description Fix the error by enacting the resolution handlers from the generator, at run time. Applicability conditions Generalizations Inputs Outputs Mapping Decomposition_ Decomposition Enact exception fix(es) Change Plan Handler or Agent Replacement Handler resolution process status working team order For Role Relationship Violation: * Change Plan Handler -- choose another alternative team plan * Agent Replacement Handler -- only if a replacement agent can be found. _________________________ 36 For every meta-process Name Description Generalizations Applicability conditions Inputs Outputs Socially Attentive Monitor (SAM) identifies agent exceptions via observable abnormal social behavior Team Work (EH Meta-process) restore relationship order resolution process status Mapping Decomposition Name Description Applicability conditions Generalizations Inputs Outputs Mapping Decomposition Identify target exceptions (SAM) hard wired at design time. Name Description Generate exception finding processes (SAM) select appropriate exception detector, hard wired at design time. Applicability conditions Generalizations Inputs Outputs Mapping Decomposition Identify target exceptions n/a Role Relationship Violation Generate exception finding processes Role Relationship Violation Relationship Violation Detection Handler Name Description Applicability conditions Generalizations Inputs Outputs Mapping Decomposition Enact exception finding processes (SAM) enact exception detector at run time Name Description Applicability conditions Generalizations Select exception instance(s) to fix (SAM) select the identified exception instance to fix, run time Enact exception finding processes Relationship Violation Detection Handler instances of role relationship violation Select exception instance(s) to fix 37 Inputs Outputs Mapping Decomposition instances of role relationship violation instances of role relationship violation FIFO Name Description Applicability conditions Generalizations Inputs Outputs Mapping Decomposition Generate exception fix(es) (SAM) generate fix handlers at run time Name Description Applicability conditions Generalizations Inputs Outputs Enact exception fix(es) (SAM) fix troubled agent and restore relationship order Generate exception fix(es) instances of role relationship violation CONSA (negotiation protocol) Enact exception fix(es) CONSA (negotiation protocol) resolution process status restore relationship order Mapping Decomposition 38 Chapter 5: 5.1). Comparison to the Good Citizen's Architecture One sophisticated method that is currently under research here at CCS (Center of Coordination Sciences) of MIT is a variant of the "central monitoring" method. The research underway is to extracting the layer of generic exceptions from all the local agents, and allow an outside Exception Handler (EH) to take care of the detection, prevention, repair and learning of the errors. This view concurs with Grosz (1996), who states that "capabilities for teamwork cannot be patched on, but must be designed in from the start". "Good Citizen" was designed based on an dynamic agent society allowing heterogeneous agent participation, and had team model structure at the center of its design. There is an established stable environment provided for all types of agents. There are several public service agencies which provide some common functionalities shared by all agents, and act as interfaces between individual agents and the rules and knowledge bases in the system. (see figure 1) Before any agent attempt to join this "society", it must first agree to the rules and protocols utilized in the society, which are provided by the Socialization Service. They must register with the Socialization Service, which then pull the data from Social Laws and ask the agent to observe the rules listed. This ensures that all agents speak a common "language" and have a set standard. The set of Social Laws dictating the proper protocol and normative behavior; it is supplied according to the needs of the dynamic domain. The Notary Service is utilized when agents enter the society, a contract between the society and the agent is initialized, which is then stored by the Notary Service. This contract is the "entry fee" that ensures the agent entering has agreed to observe the laws and behavior dictated by the social laws, and in return, the society will provide the agent with various public services, which include the monitoring and Exception Handling Service to alleviate the burden on individual agents. 39 The last agency is the Exception Handling Service (EH). It's main function is to provide sentinels that monitor the behavior of agents in the society, and provides the exception handling mechanism whenever necessary. When an agent decides to join a team, a contract detailing the proper interaction between various agents and the team goal are registered with the Notary Service. Then the Notary Service will alert the EH about that particular contract. According to the type of contract, the EH will search in its exception knowledge base for lists of possible exceptions to anticipate and avoid. The EH then instantiates sentinels which are being sent into the agent society (themselves being agents of one type) to take the monitoring and detection responsibilities. When any sentinels observes an exception, it will alarm the EH, which then chooses a proper method of avoid/repair according to the knowledge base. The EH templates described in the last section is utilized in the exception knowledge base, it's attempt to capture the essence of a generic type of error and it's corresponding respond handlers. The same data are also entered into the Process Hand Book at CCS in the form of a taxonomy. Please see Figure 7 diagram below. 40 Figure 7: Good Citizen Approach Diagram. o o o 0 Agent Society (dynamic environment) 0 0 0 0 T Socialization Service Social Laws Notary Service Exception. Handling Service Contracts (Social & Team) Exception Knowledge -base After realizing that TeamCore is also an abstraction of general functionalities away from domain agents into the TeamCore level, we attempted a comparison analysis between TeamCore and our Good Citizen approach (GC). TeamCore attempts to abstract the team coordination efforts from the domain agents, thus the "wrapper" TeamCore agents handled all the social interaction functionalities. GC on the other hand, attempts to encapsulate the monitoring and exception handling efforts with a social institute. It's also designed for a much grander scaled agent society instead of one team. Below is a chart detailing the similarities and comparisons between the two designs. 41 5.2). Table listingthe comparison between TeamCore and Good Citizen Table 1: TeamCore vs. Good Citizen Features TeamCore Good Citizen Comment domain dependent information Load Domain Information Social Laws Monitoring Monitoring Sentinels exception handling Repair Exception Handling Service Detection of Violation preset Axioms, goal/ Knowledge-base domain dependent both needs to load domain dependent information in advance EHS in GC sends out Sentinels to perform the monitoring role TeamCore does not handle all types of exceptions, only when the role failure can be repaired by team reorganization GC is much more comprehensive and will be able to detect more run time errors Team coordination leader (operator) conducts coordination. Team membership pre-determined Notary Service track the agent commitments. Does not organize team memberships. GC does not describe the detail of the actual calling of teams, but it allows a much more flexible team membership Commitment JPG function is commitment of all team agents to a specific goal. Not possible to terminate Contracts section store both social (agent-society) and team contracts until JPG terminate. Agent Society include penalty.. random agents that decides to pay the "entry fee" and obey team agents the social laws Guards to society GC doesn't assume loyalty, monitor + check for commitment, contract may Socialization Service NONE. All team agents are pre-chosen and join, GC allows much more flexibility in the heterogeneity of agent society. GC allows random agents into society, so need guards like the SS. introduced. 42 According to the chart analysis above, we can see that TeamCore is a model more specifically for the initiation/organization of a team task. It's focus is on the accomplishment of (sub)tasks when the team agents commitment to a common goal. While the GC approach is one a different scale, because GC does not worry about how the team gets together and carries out the (sub) tasks. GC is a model for an entire agent society, how it should function so that it can bring together agents of different capabilities and design to work harmoniously in a dynamic environment on team objectives. While TeamCore certainly is a very handy tool for team organization, and handles several team type of exceptions. GC has a more comprehensive scope of exception handling service because its knowledge-based database incorporates all types of generic exceptions and their responding handlers. In addition to that, the two schema can be combined and complement each other to form a less error-prone environment for all. GC can be used to set up the agent society, while TeamCore used specifically in the particular organization procedure. In addition, the sentinels sent out by the Exception Handling Service for the detection of errors in team performance can incorporate the SAM methods. 43 Chapter 6: 6.1). Future Direction: In order for the Good Citizen (GC) architecture to work well, it needs to include exception information for many other protocols. Our lab has so far focused on the TeamCore protocol, Contract Net Protocol (used by DARPA), and Multi-Level Coordination Protocol. This analysis is only the beginning of the collection of exception cases, and there will be much more work done relating to other protocols such as Contract Net, Multi-level Coordination, etc. In the future, we want to greatly expand the taxonomy to include many other protocol analysis, thus further the capabilities of our exception knowledge base. Much more work can be done related to the specific exception detection, research algorithm for the resolution/ diagnosis of errors, designing a standard language for the definition of normative behavior. Eventually, the Good Citizen architecture can be put into implementation and test. 44 References: Agent architectures for flexible, practical teamwork. Tambe, Milind. In Proceedings of the National Conference on Artificial Intelligence (AAAI). August, 1997. Teamwork in real-world, dynamic environments. Tambe, Milind. In Proceedings of the International Conference on Multi-agent Systems (ICMAS). Dec., 1996. Tracking dynamic team activity. Tambe, Milind. In Proceedings of the National Conference on Artificial Intelligence(AAAI). August, 1996. Dellarocas C., Klein M. Exception handling in agent systems, 1998. Adaptive System and Evolutionary Software homepage, http://ccs.mit.edu/ases. M. Klein and C. Dellarocas. Exception Handling in Agent Systems. Proceedings of the Third International Conference on AUTONOMOUS AGENTS, Seattle, Washington, 1999. Klein, M. An Exception Handling Approach to Enhancing Consistency, Completeness and Correctness in Collaborative Requirements Capture. Journal of Concurrent Engineering Research and Applications. March 1997. Tambe, M. 1997 Towards Flexible Teamwork Journal of Artificial Intelligence Research, Volume 7, Pages 83-124 On-line appendix for this article: STEAM rules and documentation Tambe, M. 1997 Agent architectures for flexible, practical teamwork. National Conference on Artificial Intelligence (AAAI-97) Tambe, M. Implementing agent teams in dynamic multi-agent environments Applied Artificial Intelligence 1998; volume 12 Qiu, Z. and Tambe, M. 1998 Flexible Negotiations in Teamwork: Extended Abstract Proceedings of the AAAI Fall Symposium on Distributed Continual Planning Tambe, M., Shen, W., Mataric, M., Goldberg, D., Modi, J., Qiu, Z., and Salemi, B.,1999 Teamwork in cyberspace: Using TEAMCORE to make agents team-ready To appear in the Proceedings of AAAI Spring Symposium on Agents in Cyberspace. Franccedilois Michaud and Maja J Mataric, "Learning from History for Behavior-Based Mobile Robots in Non-stationary Conditions", joint special issue on Learning in 45 Autonomous Robots, Machine Learning, 31(1-3), 141-167, and Autonomous Robots, 5(34), Jul/Aug 1998, 335-354. Maja J Mataric, "Coordination and Learning in Multi-Robot Systems", IEEE Intelligent Systems, Mar/Apr 1998, 6-8. PDF version is also available. Maja J Mataric, "Learning Social Behavior", Robotics and Autonomous Systems, 20, 1997, 191-204. Maja J Mataric, "Reinforcement Learning in the Multi-Robot Domain", Autonomous Robots, 4(1), Jan 1997, 73-83. Maja J Mataric, "Designing and Understanding Adaptive Group Behavior", Adaptive Behavior 4:1, Dec 1995, 51-80. Maja J Mataric, "Issues and Approaches in the Design of Collective Autonomous Agents", Robotics and Autonomous Systems, 16(2-4), Dec 1995, 321-331. Maja J Mataric, "Using Communication to Reduce Locality in Distributed Multi-Agent Learning", Proceedings, AAAI-97, Providence, Rhode Island, Jul 27-31, 1997, 643-648. Dani Goldberg and Maja J Mataric, "Interference as a Tool for Designing and Evaluating Multi-Robot Controllers", Proceedings, AAAI-97, Providence, Rhode Island, Jul 27-31, 1997, 637-642. 46