Toward Automatic Knowledge Validation Scott A. Wallace John E. Laird University of Michigan 1101 Beal Ave. Ann Arbor, MI 48109 734-647-0969 swallace@umich.edu, laird@umich.edu Keywords: Error Detection, Knowledge Validation ABSTRACT: Computer generated forces have enormous potential for training simulations, mission evaluation and real-world projects. However, to be successful they must faithfully reproduce expert human behavior. Unfortunately, there are currently no standard procedures or methods for determining whether a CGF's behavior meets this validity criterion. Instead, ad hoc methods, which are both tedious and error-prone, are often employed. These methods require the human expert and domain engineer to closely examine the CGF’s performance in a number of different situations and evaluate its behavior. We propose a series of metrics developed to identify different types of deviations between a CGF's behavior and a human expert's behavior. The diversity of these metrics allows errors to be detected in a wide range of domains. We then describe a validation system that uses each of these metrics to analyze instances of behavior. The result of the analysis is an overall view of the similarities and differences between the CGF's behavior and the expert's behavior. A system designer can examine this analysis to help ensure that the CGF is acting in a manner sufficiently similar to the expert for the task at hand, thus providing an efficient means to validate a complex agent. 1. Introduction Development of computer-generated forces (CGFs) is often a difficult task. Moreover, as CGFs begin to exhibit increasingly human-like behavior, the task of developing the CGF and then validating its behavior to ensure that it performs up to its specifications becomes additionally complex. A typical development model begins with knowledge acquisition. During this phase, human domain experts are interviewed to determine underlying rules that guide their behavior. A knowledge engineer then uses this information to encode the CGF's knowledge base (KB) in a form that is usable by the underlying agent architecture. After this initial phase, the CGF's knowledge is in a form that is mostly correct, but may contain errors that will cause it to behave inappropriately in some situations. Validation, the process of determining whether a system's external behavior meets the user's requirements, attempts to uncover these situations [1] [5]. During validation, the domain expert typically examines the CGF's behavior on a number of test cases in order to identify errors. Because this process is both error-prone and tedious, an automated approach would be highly desirable. On the surface, it might seem that error detection could be automated easily. Unfortunately, identifying errors without direct help from a domain expert is not a straightforward task because not every deviation in behavior is an error. In fact, the concept of an error is ambiguous and closely tied to the properties of the underlying task that is being performed. The complexities surrounding this problem can be illustrated best with a concrete example. Consider an airborne defense mission in which the pilot flies combat air patrol intercepting enemy planes as they are identified. Initially, the expert pilot and its CGF counterpart fly identically. Once an enemy is identified, both the expert and the CGF decide to use the same tactic to engage the enemy. While executing their maneuvers, however, there are significant differences between the speed and altitude of the expert and the CGF, even though they both succeed in shooting down the enemy. Finally, on the return to base, there is once again a deviation between the expert’s and the CGF’s speed and altitude even though their actions are otherwise identical. The problem at this point is to determine whether the CGF’s deviation from the expert’s behavior indicates an error, or whether the CGF’s actions are within the scope of correct behavior. In order to make this distinction, it may be necessary to examine the expert’s and CGF’s actions, which affect the external environment, as well as their goals which indicate their motivation for pursing particular actions. Unfortunately, the best way to examine these behavioral elements is highly dependent on the domain. These domain dependencies make error detection difficult for the following reasons: 1. Differences between two sequences of actions do not necessarily imply an error. CGFs are meant to operate in high fidelity simulations and in real-world environments. These environments have a large potential goal and action space, and as the size of this space increases, there is an increased likelihood that problems will have multiple solutions. This means that the simple observation that a CGF and an expert performed a different action (such as maneuvering the airplane at a different speed) is not, by itself, enough information to determine whether an error has occurred. 2. Criteria for determining correctness may change during problem solving. Complex tasks require solving many simple problems. Each sub-problem may be viewed as a task in itself, and as such, the criteria used to judge whether it is correct may be unique within the scope of the overall problem. For example, similarity of speed and altitude may be relatively unimportant while engaging the enemy, but extremely important on the return to base (or vice versa). This means that error detection methods must be adaptable to suit a number of different situations. Otherwise, their use will be severely limited. 3. Criteria for determining correctness may depend on the goals accomplished. In some situations, primitive actions may be the wrong level of abstraction for determining whether a particular behavior is correct. In such cases, it may make more sense to leave the exact implementation open ended and perform an evaluation based on the motivation for performing each action, or goal. In the case of the CGF pilot, for example, a dogfight may be considered correctly executed so long as the enemy is destroyed even though the exact sequence of actions used to accomplish this goal may differ from the expert’s example. 4. The context of the task may affect which solutions are acceptable. For example, there may be significant flexibility in choosing a tactic of engagement when only a small number of planes in involved, but if the attack is part of a larger, coordinated effort, tactics that would otherwise be acceptable may be considered inappropriate in this situation. The result of this property is that in many cases the motivation for performing an action (i.e. the overall goal) will factor into determining whether that action was correct. 5. Only a fraction of available information is necessary to determine whether an error has occurred. In complex, real-world environments the state space is often very large if not infinite. In order to deal with these worlds, CGFs must efficiently abstract their sensory information. As a result, only a fraction of the information is required to decide what action should be pursued at any given point or whether the CGF is behaving appropriately. Thus, as the environment grows more complex, error detection becomes more difficult because detection methods must correctly differentiate between features of the environment that are important for determining whether the CGF is behaving correctly, and features that should be ignored. 6. Problem solving strategies may be flexible and diverse. In non-deterministic environments, actions may not always have their intended effect. As a result, a procedure may fail unexpectedly even though it is the correct thing to do. To deal with such situations effectively, CGFs are likely to benefit from a flexible and diverse set of strategies that accomplish the task using different means. This complicates error detection because a detection method that overly constrains the CGFs behavior (for example, by ensuring that the CGF’s actions are identical to the expert’s actions) may adversely affect the CGF’s performance when unexpected situations arise. The difficulties surrounding automatic error detection are not trivial to overcome. Indeed, it may not be possible to develop a useful error detection system that can operate autonomously in the complex, interactive domains of many CGFs. Nonetheless, our research makes a step toward that goal by identifying the difficulties and outlining a set of weak methods that can be used across many different domains to identify errors. Although these methods do not, in themselves, define an autonomous error detection system, they do produce a high level analysis of the similarities and differences in behavior that we believe will be an improvement over current validation techniques. 2. Related Work Detecting when an error has occurred is important for a number of tasks besides knowledge validation. In fact, it is a fundamental problem that has a bearing on: • Automatic correction of knowledge bases, in which a system modifies an agent's knowledge to minimize the number of situations in which the agent behaves differently than a domain expert. • Intelligent tutoring systems, in which a system detects errors in a human novice's behavior when compared against a formal specification. • Intelligent interfaces, in which a system helps a user perform a task by identifying or even recovering from incorrect behavior. Given the applicability of error detection to a number of problems in artificial intelligence, it is somewhat surprising that it has not been the center of more attention. Nonetheless, four distinct fields of active research may contribute toward novel methods of automating error detection. The knowledge base verification and validation (V & V) community has explored issues related to judging a KB's correctness. Verification, the process of determining whether a system's internal specification is correct, has received the majority of attention within this community. Much of this work is focused even more narrowly on issues related to static verification methods (e.g. [3], [7], [10], [11]) that can be performed automatically and without the agent architecture. However, relatively little of the work in this community has focused on the complementary problem of automatic validation [14]. What research has been done in this field typically circumvents the problem of automatically detecting errors by off-loading this task onto the human domain expert (e.g. [5], [8]). Automating this process has the potential of significantly reducing cost, and making a significant step forward in validation technology. The problem of detecting errors is addressed in the knowledge refinement and theory refinement communities. Knowledge and theory refinement are concerned with producing correct knowledge bases from initial and partially flawed knowledge. Although these fields share many similarities with V & V, this community emphasizes automatically fixing problems after they are detected. As a result, it may not be surprising that they must have some method for determining when an error has occurred. Although some refinement frameworks allow some sort of user defined specification of what constitutes an error (e.g. [6]), most refinement frameworks avoid the underlying complexities of the error detection problem. Instead, these frameworks make one or more limiting assumptions: that the nature of an error is very constrained, and perhaps not even context dependent [2] [9]; or that the task is limited to noninteractive problems such as classification, where errors can be detected without comparing to long episodes of human behavior [2] [6] [9]. In the field of intelligent tutoring systems (ITS), the error detection problem is addressed with slightly different assumptions. In this community, the goal is to determine whether a human novice performs a task, such as multicolumn subtraction, correctly. When the tutoring system determines that the human has made an error, it will attempt to provide some general knowledge that will allow the human to solve the problem correctly. This field of research has had mixed success. In general, systems that have performed best operate in restricted domains where there are very few ways to solve a particular problem, and it is possible to ensure that the ITS’s knowledge of these potential solutions is complete. Finally, the dual of error detection is the subject of study in both the plan and goal recognition communities. In this body of research, the recognition framework monitors an action sequence (either produced by another CGF or a human) and attempts to determine what goal or plan this other entity is pursuing. Plan recognition has examined a number of different approaches to help classify behavior ranging from plan libraries to Bayesian networks. The specifics of our problem, both that we are looking for divergent as opposed to convergent behavior, and that we are dealing with CGFs in rich, dynamic environments means that some of the typical assumptions of plan recognition will be violated. Each of the four fields described above has some relation to the problem of detecting errors in a CGF's behavior. Although some work has examined simplified versions of the general error detection problem, automatic error detection performed at more than a superficial level remains a distant goal. 3. Error Detection Methods When examining potential error detection methodologies, we make two main assumptions. The first of these is that error detection is performed by comparing the behavior of a CGF to previously recorded expert behavior. This assumption is common among most systems that incorporate even minimal amounts of error detection. The major weakness of this methodology is that errors can only be detected if they are exposed by the example problems. Thus, choosing which examples should be used for validation has a significant impact on the efficacy of the validation process. However, failing to make this basic assumption about the availability of solved example problems means that error detection must rely upon only weak, task-independent, information such as loops in behavior or strong environmental feedback such as death. Although our error detection methods will use expert behavior to perform validation, we will not focus on how this behavior selected. Potential methods for selecting behavior have been addressed in some previous research (e.g. [13]), and further progress will be left as future work. The second assumption that underlies all of the approaches we will consider is the availability of particular information in the expert behavior traces. Two simple error detection methods used previously in knowledge refinement and plan recognition both require explicit information about the sequence of actions pursued by the expert during problem solving. To increase the applicability of our methods to complex and dynamic interactive domains, we also require that information about the sequence of states encountered by the expert during problem solving is available either explicitly or by deductive means. In many situations, such as when the expert interacts with a computer simulation, information about the world states encountered by the expert and the sequence of actions they pursue can easily be captured by recording the stream of data to and from the simulator. 4. Building Blocks of ED Methods Each potential approach is characterized by a number of different properties described below. Together, these form a landscape of methodologies that could be used to identify problematic differences in two streams of behavior. The specific properties encapsulated by a methodology will impact how simple it is to use, and the situations for which it is most suited. ED1: [Availability of Goal Annotations] As we discussed previously, our methods rely on traces of human behavior. They must contain, at the very least, explicit information about what actions the expert performed, as well as a means to correctly deduce the sequence of environmental states that was encountered. In addition to this information, some error detection methodologies may rely on knowledge about the expert's goals at each point in problem solving. If this is the case, that information might be Given or Abduced using some inference procedure. In the tables that follow, the label ‘-‘ for property ED1 indicates that no information about the expert’s goals is required. The label ‘G’ indicates that the goals have been given, and the label ‘A’ indicates that the goals have been abduced. ED2: [Abstracted Examples] Because an expert behavior trace describes only a single solution path, it may be useful to use an abstracted version instead of, or along with, the original. Abstracted expert behavior may indicate, for example, that the expert’s goal ‘look’ is functionally equivalent to the goals ‘examine’ and ‘inspect’. Abstraction may also be used to indicate that actions such as ‘turn left quickly’ and ‘turn left slowly’ are functionally equivalent. Defining these abstractions allows an error detection method to ignore irrelevant details in the expert’s behavior. This then helps to focus the error detection methods on aspects of the behavior that are critical to determining its correctness. Error detection methods that employ abstraction are labeled with an ‘A’ in property ED2. Error detection methods that do not make use of abstraction are labeled with a ‘-‘. ED3: [Comparison Method] To determine whether an error has occurred, the CGF's behavior must be compared against the expert’s behavior in some manner. Consider the following simple expert action sequence that completes an attack task: arm-weapon, move, shoot. Comparing this expert behavior to a CGF’s behavior can be done in one of two ways. The first method uses a strong comparison that we refer to simply as compare (labeled ‘C’ for property ED3 in the tables below). Using this strong method, each element in the expert’s behavior is expected to correspond to an element in the CGF’s behavior. If the CGF fails to generate a corresponding behavior stream, an error is detected. An alternative to this approach is a much weaker comparison method we call justify (labeled ‘J’ in the tables below). This approach attempts to ensure that the CGF is able to explain or justify the expert’s behavior at each point in the behavior trace. If so, then this indicates that the CGF and the expert share a body of knowledge about when goals are actions are appropriate. If not, this indicates an error. Unlike the stronger metric compare, justify allows the agent an additional degree of freedom to exploit knowledge or preferences about alternative actions that the expert might not have. 5. Basic Error Detection Methods By combining the properties described above, we construct five distinct error detection methods. These methods, encompassing three novel methods and two that have been provided by prior work in knowledge refinement and plan recognition, are outlined in the following sections. As we examine each method, we analyze it with respect to the following criteria: • What demands are placed on the domain expert and knowledge engineer? • What trade-off is made between maximizing true positives (detected errors that are in fact errors) and reducing false positives (detected errors that are in fact acceptable behavior)? • How does the method compare to the benchmark methods provided by the knowledge refinement and plan recognition communities? • What environmental/CGF properties are supported by this method? Analysis of these constraints will then allow us to determine the types of situations that a particular method is best suited for, and those situations in which it will be ineffective. 5.1 Strict Matching ED1 - ED2 - ED3 C Table 1: Properties of Strict Matching Strict matching (SM) ensures that the CGF pursues the same series of actions as the expert when faced with a common situation. This benchmark approach is inherited from the knowledge refinement community that uses a similar view of what constitutes an error on noninteractive, classification type tasks. Strict matching attempts to force the CGF to follow the same solution path. Returning to our previous air-combat example, strict matching would ensure that the expert and the CGF perform the same actions in the environment at all times (including how the plane is maneuvered during execution of specific tactics, and the speed and altitude of the plane at all points in flight). In strict matching, the expert’s behavior is not abstracted in any manner, and errors are detected any time the CGF's behavior deviates from the expert's (Table 1 indicates the basic properties of SM). If strict matching is performed off-line, using captured traces of CGF and expert behavior, it can be viewed as a string matching problem. This would allow efficient exact matching algorithms to be used to indicate whether an error has taken place. Alternatively, approximate string matching algorithms could be used to indicate how many errors (of omission, insertion, or substitution) occurred in the CGF's behavior. Perhaps the greatest advantage of strict matching is the fact that it requires no additional knowledge to be supplied by either the domain expert or the knowledge engineer. This method can be applied using only the observed sequence of states and actions that were encountered and pursued during problem solving. Because the methodology is very strict in terms of what constitutes an error, it maximizes true positives, but does little to reduce false positives. As a result, it is likely that CGFs that allow their problem solutions to be influenced by individual preferences would be classified as exhibiting incorrect behavior. To eliminate the false positives flagged by this under specific routine, it is extremely likely that the domain expert would need to filter errors returned by this method, thus resulting in increased human effort and reducing the other benefits of this approach. Strict matching meets many of the constraints and requirements outlined in Section 5. However, because of the rigidity of the error detection method, it is not particularly well suited to complex environments with many states or goals. Moreover, because a valid CGF must emulate expert behavior according to this approach, this method is not ideal for environments where diversity of behavior is valued. On the other hand, strict matching's ability to maximize true positive errors can be useful in domains with irreversible actions or with a high cost of failure. In some situations, such as protocol driven exercises, this property may outweigh all other concerns. 5.2 Justification Based Detection ED1 - ED2 - ED3 J Table 2: Properties of Justification Based Detection Justification based detection attempts to ensure that the CGF can explain the expert’s actions at all points in time (see Table 2). This second benchmark approach is inherited from plan recognition, where the goal is simply to understand another motivations. Justification based detection requires that the CFG and the expert both share a large body of common knowledge, but it also allows a significant degree of flexibility not found in other methods. This freedom allows the CGF to pursue its own solution paths while still ensuring that the CGF is able to produce (at least in principle) guaranteed correct behavior. It is important to note that justification based detection does necessarily need to view actual CGF behavior. Instead, error detection is performed simply by asking the CGF to explain each of the expert’s actions. Returning to the air-combat example, this means that so long as the CGF could justify the expert engaging the enemy, employing its chosen maneuvers, and then returning to base at a given speed and altitude, no error would be detected. Unfortunately, this methodology is likely to leave some of the CGF's knowledge untested. This means that although a CGF may understand expert behavior, in practice it may not perform the same actions as the expert would have. As with the other benchmark approach (SM), the main advantage of justification based error detection is that it can be used without any additional knowledge sources beyond the solved example problems, thus reducing the human effort involved in its use. Unlike strict matching, the inherent optimism of this method allows the CGF a large degree of freedom to pursue its own form of problem solving and reduces false negatives. At the same time, however, this optimism is likely to lead to a similar reduction in the total number of detected errors. Similar to strict matching, justification based detection meets many of our constraints. This approach should fare well in environments with a large state space, as it is only intended to ensure that the CGF's knowledge encompasses the expert's. In environments with a large goal space, however, it is increasingly likely that the CGF may be able to justify the expert behavior even if the knowledge used to make the justification is incorrect. This may occur when the goals associated with a particular action sequence are different for the CGF and the expert. One the other hand, this method does help support behavioral diversity by allowing CGFs to pursue their own solutions. Nonetheless, the optimism of JBD is likely to be a disadvantage in areas where there is a high cost of failure because the decisions a CGF makes during execution may not be known during validation. of failure. Although it is potentially more optimistic than SM, in GSM we can set the acceptable abstraction level arbitrarily low in order to achieve the most desirable trade-off between a low number of false positives and a high number of true positives. 5.3 Generalized Sequence Matching Abstract goal based matching (AGBM) makes a somewhat different extension to strict matching. In this technique, the expert’s and the CGF’s goals, as opposed to their individual actions, are compared to one another. AGBM uses annotations in the solved example problems to compare the expert's upper n goals to the CGF's upper n goals at each point in time (see Table 4). Using this method to validate behavior allows the error detection system to verify that both the expert and the CGF pursue and achieve the same goals, even if their lower level actions (and potential some lower level goals) are significantly different. In the air combat domain, this would make it easy to identify that both the expert and the CGF correctly completed the dogfight task by shooting down the enemy, but might make it hard to identify whether a specific maneuver had been preformed. ED1 - ED2 A ED3 C Table 3: Properties of Generalized Sequence Matching Generalized sequence matching (GSM) takes the strict matching approach, but allows the expert’s actions to be viewed abstractly (see Table 3). Strict matching uses the observed expert's actions as a template upon which the CGF's behavior must match. Each action in the sequence is likely to be associated with parameters that modify (to some extent) their effects on the environment. GSM extends this approach by defining how actions may be substituted for one another. This allows the CGF some degree of freedom in its behavior that would otherwise have been identified as an error. Using this methodology in the air-combat example, one could explicitly inform the error detection system whether speed and altitude are critical components (for the sake of identifying errors) of the fly-plane action, and whether they should be restricted to a specific range of values. This method requires only minimal additional effort on the part of the domain expert. Before validation begins, the abstraction relationships (such as equivalence classes, taxonomies, or valid numerical ranges) must be defined. These relationships are likely to be reusable for future validation of other CGFs so long as they all operate in the same domain. Moreover, the effort required to encode this abstraction knowledge is independent from the number of examples used during validation and the number of rules used to encode the CGF's behavior. The coverage of generalized sequence matching (GSM) lies somewhere between that of the benchmark methods, SM and JBD. In terms of the constraints outlined earlier, GSM is well suited to complex environments because of its ability to allow a relatively loose correspondence between the CGF's and the expert's behavior, while at the same time having the ability to explicitly relate how much these two behaviors deviate. Not surprisingly, we expect that this method should perform relatively well in environments that have irreversible actions or a high cost 5.4 Abstract Goal Based Matching ED1 G ED2 - ED3 C Table 4: Properties of Abstract Goal Based Matching As with strict matching, detecting differences between the CGF’s behavior pattern and the expert’s behavior pattern could be viewed as a string matching problem if it is performed off-line. The essential difference here is that the strings would be composed of goals as opposed to the primitive actions that are explicitly available in the behavior traces. Abstract Goal Based Matching (AGBM) requires the expert to annotate their problem solutions. Like GSM, the human effort involved is likely to be independent from the number of rules used to encode the CGF's knowledge. Instead, the effort is proportional to the number of solved problems used for validation. The coverage of AGBM is of a somewhat different nature than the error detection methods we have previously discussed, because the action sequence of the CGF and the expert are never explicitly compared. This means that the types of allowable action sequences are relatively diverse so long as they can achieve the appropriate goals. Because AGBM only needs to recognize a small set of high-level goal states, and because it already has rules to perform this function efficiently, it should scale well as the size of the state and goal spaces expand. AGBM's ability to deal with different environments can be controlled easily by adjusting the required level of correspondence between the CGF's and the expert's goals. A looser coupling involving perhaps the top few layers of the goal hierarchy will allow CGFs more flexibility than a tight coupling that requires correspondence deep into the goal stack. Not surprisingly, the tightness of the coupling also has a significant impact on the suitability of this method for environments with irreversible actions and a high cost of failure. These environments, however, benefit from a tighter coupling where errors in the CGF's reasoning process may be uncovered before an action is even taken. 5.5 Grounded Variability Matching ED1 G ED2 A ED3 C Table 5: Properties of Grounded Variability Matching In the same vein as AGBM, grounded variability matching (GVM) works to ensure that both the CGF and the expert pursue the same goals. This method also works from the bottom up similar to GSM (see Table 5) by taking into account what primitive actions are legitimate within the context of the current goal stack. Returning to our air-combat example, using GVM one would be able to explicitly inform the error detection system that the altitude and speed are critical components of the fly-plane action when the goal is return-to-base, but that these components are relatively unimportant in other circumstances. As with AGBM, the expert must annotate the example problems with the goals that are being pursued at each point in time. Comparison of the CGF's and expert's goals takes place down to a pre-specified level and ensures that CGF's have the correct motivation for their actions. Tied to this bottom tier of the goal hierarchy is a list of allowable primitive actions. Comparison of the CGF's actions to the list of allowable actions ensures that the CGF is pursuing their goals in an acceptable fashion. This list may be extracted via interviews with a domain expert, but it is probably more feasible to extract it by observing experts perform the task multiple times as has been done in [12]. likely to be significantly influenced by the details of the domain. Compared to the two benchmark methods from knowledge refinement (SM) and plan recognition (JBD), GVM is likely to do much better, but at the cost of increased human effort. In terms of meeting the constraints of an ideal error detection methodology, GVM performs similarly to AGBM. It is well suited to environments with many states and plans, although because it must keep primitive action lists, there may be an efficiency concern in environments with a very large number of allowable actions. As in AGBM, the degree of coupling affects the ability of this method to deal well with environments that have shared resources or irreversible actions and a high cost of failure. To some extent, however, this severity of this trade-off is mitigated by the fact that each goal context is associated with allowable primitive actions. This means that even with a relatively loose coupling, there is a greater possibility of preventing failure. 5.6 Discussion of Basic Methods Name SM JBD GSM AGBM GVM ED1 G G ED2 A A ED3 C J C C C Table 6: Summary of Basic Error Detection Methods Table 6 illustrates the properties of the five basic error detection methods we have presented. Together, they span a significant portion of the landscape of potential error detection methodologies. However, the careful reader will note that a number of potential methods have been left unaddressed. These methods can be classified into two groups depicted in Table 7. Name Justification Techniques Basic Abductive Techniques ED1 ED2 ED3 * * J A * C Table 7: Unexplored Error Detection Methods Compared to other methods we have examined, GVM puts relatively high demands on the domain expert. Not only must goals be represented explicitly in the problem solutions (as in AGBM) this method also requires observing a potentially large amount of expert behavior in order to gather information about what primitive actions are allowed within a specific context. GVM has the potential of providing better coverage than AGBM. The exact difference between these methods is Justification techniques represent all methods that use justification but attempt to either justify abstractions of the expert's action sequences (as given by the solved example problems) or attempt to justify only the expert's higher level goals. In both cases, these error detection methods are used to justify abstract representations of the expert's behavior, thus making them more general than JBD. However, as we previously mentioned, JBD is already a very optimistic error detection method and is likely to allow a relatively large number of true positives to go undetected. Because the remaining justification techniques will detect only a subset of the errors identified by JBD, these methods are unlikely to warrant deeper investigation. Basic abductive techniques suffer from similar problems. Abductive techniques attempt to use the CGF's knowledge in order to identify what goals the expert is pursuing while solving the example problems. Correct identification of the expert's goals allow error detection to be performed at a more abstract level than is possible by comparing primitive action sequences alone. They are useful because they force the CGF to achieve the same goals as the expert, but allow freedom in terms of the primitive actions used to achieve each goal. However, when abduction is used without an additional knowledge source, the expert's goals can only be determined by examining the primitive action sequence available in the solved example problem. Because abduction relies on the CGF's knowledge base to determine the expert's goals, it is only likely to be successful if the CGF solves the problem using a primitive action sequence that is very similar to the one used by the expert. Clearly, this constraint undermines the main power of abductive techniques—that they can be used to force the CGF to achieve the same goals as the expert while allowing freedom in the underlying primitive action sequence. This violation means that basic abductive techniques that do not rely on information aside from the expert's behavior trace and the faulty CGF knowledge are not worth further investigation. 6. Beyond Basic Methods Each of the basic error detection methods outlined in the previous section can be aided by the use of additional information about the task domain. One such source of information describes the method for selecting packets from the CGF and expert behavior streams. Before a CGF’s behavior is compared to an expert’s behavior, a decision must be made as to what two packets in the stream should be analyzed. The most appropriate choice should be influenced by properties of the task and domain. So CGFs that interact with a real-time environment should integrate the value of a world-clock into their method for selecting two packets of behavior to be compared. On the other hand, CGFs that operate in turn based simulations may not have this requirement. In this case, it may be sufficient to simply compare actions as they occur, and ignore time that passes in between two successive actions. The example above illustrates just one example of how additional domain knowledge could be used to improve the error detection process. As our investigation continues we expect that we will be able to identify a set of orthogonal components that can be used in conjunction with one another to produce an error detection system tailored to the needs of a particular domain. 7. System Design One of the primary design goals for our error detection system is the ability to scale as our understanding of the basic framework that underlies error detection grows. As a result, we have developed a modular framework that allows us to separate different aspects of the error detection process. Sequencer Behavior Stream Behavior Stream Classifier1 … Classifiern At this point, we have divided the framework into two orthogonal components. The first component is sequencing and involves selecting appropriate packets of CGF and expert behavior to compare, as described in Section 6. The second component is classification and involves applying one or more of the basic error detection metrics (described in Section 5) to the packets of behavior selected by the sequencer. Because these components are orthogonal, we get the maximal flexibility from our system: each new component results in a combinatorial increase in the number of potential ways in which we can identify errors. The initial step in using our error detection system is for the knowledge engineer and domain expert to examine the properties of the environment and determine which error detection methods are most suitable for the task at hand. In essence, this means picking out one or more sets of components to perform the error detection task. Once this has been done, two behavior streams, one from the expert and one from the CGF are used as input to the error detection system. As the system examines the two streams, it produces a description as to where the CGF’s behavior faithfully reproduces the expert’s behavior, and where the CGF has made errors. Using this information, the domain expert and knowledge engineer can examine a large set of test cases, quickly isolating when and where errors have occurred, and significantly reducing the cost of the validation process. [4] 9. Future Work [5] Our research so far has paved the way for a broad investigation of methods for detection errors and validating CGFs efficiently. Our near term goal is to examine the performance of our basic error detection methods. To do this we will examine a simple object retrieval domain in which a CGF accomplishes a number of high-level goals such as plan-route, travel, and findobject. This domain contains approximately 20 primitive actions and 10 distinct goals creating a very large space of potential behavior. As our experiments with this test domain mature, we will focus on identifying two critical relationships. The first relationship we will examine is between the output of an error detection method and the impact of that information on improving the efficiency of validation. A better understanding of this relationship will allow us to optimize our error detection methods. The second relationship we will examine is between properties of a goal and the effectiveness of a particular error detection method. A deeper understanding of this relationship will allow a better-grounded choice of which error detection methods should be applied to a particular problem. In the longer term, we will continue searching for ways in which we can improve our system’s ability to detect errors by exploiting new sources of knowledge. We will continue to organize these knowledge sources into orthogonal dimensions of a unified error detection framework. This will allow us to take advantage of the combinatorial growth of new detection methods that occurs each time a new source of knowledge can be added to the framework. 10. References [1] [2] [3] David J. Bawcom: “An Incompleteness Handling Methodology for Validation of Bayesian Knowledge Bases”. Masters Thesis: Air Force Institute of Technology, 1997. Susan Craw, D. Sleeman: ”Automating the Refinement of Knowledge-Based Systems”. Proceedings of the ECAI90 Conference, pp. 167172, 1990 Yolanda Gil, Eric Melz: “Explicit Representations of Problem-Solving Strategies to Support [6] [7] [8] [9] [10] [11] [12] [13] [14] Knowledge Acquisition”. Proceedings of the Thirteenth National Conference on Artificial Intelligence, pp. 469-476, 1996. Randolph M. Jones, John E. Laird, Paul E. Nielsen, Karen J. Coulter, Patrick Kenny, Frank V. Koss: “Automated Intelligent Pilots for Combat Flight Simulation”. AI Magazine, Vol. 20, pp. 27-42, 1999. Byeong Ho Kang, Windy Gambetta, Paul Compton: “Verification and Validation with Ripple-Down Rules”. International Journal of Human Computer Studies, Vol. 44(2), pp. 257-269, 1997. Patrick M. Murphy, Michael J. Pazzani: "Revision of production system rule-bases". Proc. 11th International Conference on Machine Learning, pp. 199-207, 1994. Tin A. Nguyen, Walton A. Perkins, Thomas J. Laffey, Deanne Pecora: “Knowledge Base Verification”. AI Magazine, Vol. 8, pp 69-75, 1987. Robert M. O'Keefe, Osman Balci, Eric P. Smith: “Validating Expert System Performance”. IEEE Expert, Vol. 2(4), pp 81-90, 1987. Douglas Pearson: “Learning Procedural Planning Knowledge in Complex Environments”. Ph.D. Thesis: University of Michigan, 1996. Alun D. Preece, Rajjan Shinghal, Aida Batarekh: “Verifying Expert Systems: A Logical Framework and a Practical Tool”. Expert Systems With Applications, Vol. 5, pp. 421-436, 1992 Marcelo Tallis: “A Script-Based Approach to Modifying Knowledge-Based Systems”. International Journal of Human-Computer Studies, To Appear. Michael van Lent: “Learning Task-Performance Knowledge Through Observation”. Ph.D. Thesis. University of Michigan, 2000. Nirmalie Wiratunga, Susan Craw: ”Informed Selection of Training Examples for Knowledge Refinement”. Proceedings of the 12th European Knowledge Acquisition Workshop, pp. 233-248, 2000. Neli Zlatareva, Alun Preece: “State of the Art in Automated Validation of Knowledge-Based Systems”. Expert System With Applications, Vol. 7(2), pp. 151-167, 1994. Author Biographies SCOTT WALLACE is a Ph.D. candidate in the University of Michigan’s Computer Science program. His research interests include empirical analysis of A.I. architectures, and knowledge engineering. He received his B.S. in Physics and Mathematics from the University of Michigan in 1996. JOHN LAIRD is a Professor of Electrical Engineering and Computer Science at the University of Michigan. He received his B.S. from the University of Michigan in 1975 and his Ph.D. from Carnegie Mellon University in 1983. He is one of the original developers of the Soar architecture and leads its continued development and evolution. From 1992-1997, he led the development of TacAir-Soar, a real-time expert system that flew all of the U.S. fixed-wing air missions in STOW-97.