Toward Automatic Knowledge Validation

Toward Automatic Knowledge Validation
Scott A. Wallace
John E. Laird
University of Michigan
1101 Beal Ave.
Ann Arbor, MI 48109
Error Detection, Knowledge Validation
ABSTRACT: Computer generated forces have enormous potential for training simulations, mission evaluation and
real-world projects. However, to be successful they must faithfully reproduce expert human behavior. Unfortunately,
there are currently no standard procedures or methods for determining whether a CGF's behavior meets this validity
criterion. Instead, ad hoc methods, which are both tedious and error-prone, are often employed. These methods require
the human expert and domain engineer to closely examine the CGF’s performance in a number of different situations
and evaluate its behavior.
We propose a series of metrics developed to identify different types of deviations between a CGF's behavior and a
human expert's behavior. The diversity of these metrics allows errors to be detected in a wide range of domains. We
then describe a validation system that uses each of these metrics to analyze instances of behavior. The result of the
analysis is an overall view of the similarities and differences between the CGF's behavior and the expert's behavior. A
system designer can examine this analysis to help ensure that the CGF is acting in a manner sufficiently similar to the
expert for the task at hand, thus providing an efficient means to validate a complex agent.
1. Introduction
Development of computer-generated forces (CGFs) is
often a difficult task. Moreover, as CGFs begin to exhibit
increasingly human-like behavior, the task of developing
the CGF and then validating its behavior to ensure that it
performs up to its specifications becomes additionally
A typical development model begins with knowledge
acquisition. During this phase, human domain experts are
interviewed to determine underlying rules that guide their
behavior. A knowledge engineer then uses this
information to encode the CGF's knowledge base (KB) in
a form that is usable by the underlying agent architecture.
After this initial phase, the CGF's knowledge is in a form
that is mostly correct, but may contain errors that will
cause it to behave inappropriately in some situations.
Validation, the process of determining whether a system's
external behavior meets the user's requirements, attempts
to uncover these situations [1] [5].
During validation, the domain expert typically examines
the CGF's behavior on a number of test cases in order to
identify errors. Because this process is both error-prone
and tedious, an automated approach would be highly
desirable. On the surface, it might seem that error
detection could be automated easily. Unfortunately,
identifying errors without direct help from a domain
expert is not a straightforward task because not every
deviation in behavior is an error. In fact, the concept of an
error is ambiguous and closely tied to the properties of the
underlying task that is being performed. The
complexities surrounding this problem can be illustrated
best with a concrete example.
Consider an airborne defense mission in which the pilot
flies combat air patrol intercepting enemy planes as they
are identified. Initially, the expert pilot and its CGF
counterpart fly identically. Once an enemy is identified,
both the expert and the CGF decide to use the same tactic
to engage the enemy. While executing their maneuvers,
however, there are significant differences between the
speed and altitude of the expert and the CGF, even though
they both succeed in shooting down the enemy. Finally,
on the return to base, there is once again a deviation
between the expert’s and the CGF’s speed and altitude
even though their actions are otherwise identical. The
problem at this point is to determine whether the CGF’s
deviation from the expert’s behavior indicates an error, or
whether the CGF’s actions are within the scope of correct
behavior. In order to make this distinction, it may be
necessary to examine the expert’s and CGF’s actions,
which affect the external environment, as well as their
goals which indicate their motivation for pursing
particular actions. Unfortunately, the best way to examine
these behavioral elements is highly dependent on the
domain. These domain dependencies make error
detection difficult for the following reasons:
Differences between two sequences of actions do
not necessarily imply an error. CGFs are meant to
operate in high fidelity simulations and in real-world
environments. These environments have a large
potential goal and action space, and as the size of this
space increases, there is an increased likelihood that
problems will have multiple solutions. This means
that the simple observation that a CGF and an expert
performed a different action (such as maneuvering
the airplane at a different speed) is not, by itself,
enough information to determine whether an error
has occurred.
Criteria for determining correctness may change
during problem solving. Complex tasks require
solving many simple problems. Each sub-problem
may be viewed as a task in itself, and as such, the
criteria used to judge whether it is correct may be
unique within the scope of the overall problem. For
example, similarity of speed and altitude may be
relatively unimportant while engaging the enemy, but
extremely important on the return to base (or vice
versa). This means that error detection methods must
be adaptable to suit a number of different situations.
Otherwise, their use will be severely limited.
Criteria for determining correctness may depend
on the goals accomplished. In some situations,
primitive actions may be the wrong level of
abstraction for determining whether a particular
behavior is correct. In such cases, it may make more
sense to leave the exact implementation open ended
and perform an evaluation based on the motivation
for performing each action, or goal. In the case of the
CGF pilot, for example, a dogfight may be
considered correctly executed so long as the enemy is
destroyed even though the exact sequence of actions
used to accomplish this goal may differ from the
expert’s example.
The context of the task may affect which solutions
are acceptable. For example, there may be
significant flexibility in choosing a tactic of
engagement when only a small number of planes in
involved, but if the attack is part of a larger,
coordinated effort, tactics that would otherwise be
acceptable may be considered inappropriate in this
situation. The result of this property is that in many
cases the motivation for performing an action (i.e. the
overall goal) will factor into determining whether that
action was correct.
Only a fraction of available information is
necessary to determine whether an error has
occurred. In complex, real-world environments the
state space is often very large if not infinite. In order
to deal with these worlds, CGFs must efficiently
abstract their sensory information. As a result, only a
fraction of the information is required to decide what
action should be pursued at any given point or
whether the CGF is behaving appropriately. Thus, as
the environment grows more complex, error detection
becomes more difficult because detection methods
must correctly differentiate between features of the
environment that are important for determining
whether the CGF is behaving correctly, and features
that should be ignored.
Problem solving strategies may be flexible and
diverse. In non-deterministic environments, actions
may not always have their intended effect. As a
result, a procedure may fail unexpectedly even
though it is the correct thing to do. To deal with such
situations effectively, CGFs are likely to benefit from
a flexible and diverse set of strategies that
accomplish the task using different means. This
complicates error detection because a detection
method that overly constrains the CGFs behavior (for
example, by ensuring that the CGF’s actions are
identical to the expert’s actions) may adversely affect
the CGF’s performance when unexpected situations
The difficulties surrounding automatic error detection are
not trivial to overcome. Indeed, it may not be possible to
develop a useful error detection system that can operate
autonomously in the complex, interactive domains of
many CGFs. Nonetheless, our research makes a step
toward that goal by identifying the difficulties and
outlining a set of weak methods that can be used across
many different domains to identify errors. Although these
methods do not, in themselves, define an autonomous
error detection system, they do produce a high level
analysis of the similarities and differences in behavior that
we believe will be an improvement over current
validation techniques.
2. Related Work
Detecting when an error has occurred is important for a
number of tasks besides knowledge validation. In fact, it
is a fundamental problem that has a bearing on:
• Automatic correction of knowledge bases, in which a
system modifies an agent's knowledge to minimize
the number of situations in which the agent behaves
differently than a domain expert.
• Intelligent tutoring systems, in which a system
detects errors in a human novice's behavior when
compared against a formal specification.
• Intelligent interfaces, in which a system helps a user
perform a task by identifying or even recovering
from incorrect behavior.
Given the applicability of error detection to a number of
problems in artificial intelligence, it is somewhat
surprising that it has not been the center of more attention.
Nonetheless, four distinct fields of active research may
contribute toward novel methods of automating error
The knowledge base verification and validation (V & V)
community has explored issues related to judging a KB's
correctness. Verification, the process of determining
whether a system's internal specification is correct, has
received the majority of attention within this community.
Much of this work is focused even more narrowly on
issues related to static verification methods (e.g. [3], [7],
[10], [11]) that can be performed automatically and
without the agent architecture. However, relatively little
of the work in this community has focused on the
complementary problem of automatic validation [14].
What research has been done in this field typically
circumvents the problem of automatically detecting errors
by off-loading this task onto the human domain expert
(e.g. [5], [8]). Automating this process has the potential
of significantly reducing cost, and making a significant
step forward in validation technology.
The problem of detecting errors is addressed in the
knowledge refinement and theory refinement
communities. Knowledge and theory refinement are
concerned with producing correct knowledge bases from
initial and partially flawed knowledge. Although these
fields share many similarities with V & V, this
community emphasizes automatically fixing problems
after they are detected. As a result, it may not be
surprising that they must have some method for
determining when an error has occurred. Although some
refinement frameworks allow some sort of user defined
specification of what constitutes an error (e.g. [6]), most
refinement frameworks avoid the underlying complexities
of the error detection problem. Instead, these frameworks
make one or more limiting assumptions: that the nature of
an error is very constrained, and perhaps not even context
dependent [2] [9]; or that the task is limited to noninteractive problems such as classification, where errors
can be detected without comparing to long episodes of
human behavior [2] [6] [9].
In the field of intelligent tutoring systems (ITS), the error
detection problem is addressed with slightly different
assumptions. In this community, the goal is to determine
whether a human novice performs a task, such as multicolumn subtraction, correctly. When the tutoring system
determines that the human has made an error, it will
attempt to provide some general knowledge that will
allow the human to solve the problem correctly. This
field of research has had mixed success. In general,
systems that have performed best operate in restricted
domains where there are very few ways to solve a
particular problem, and it is possible to ensure that the
ITS’s knowledge of these potential solutions is complete.
Finally, the dual of error detection is the subject of study
in both the plan and goal recognition communities. In
this body of research, the recognition framework monitors
an action sequence (either produced by another CGF or a
human) and attempts to determine what goal or plan this
other entity is pursuing. Plan recognition has examined a
number of different approaches to help classify behavior
ranging from plan libraries to Bayesian networks. The
specifics of our problem, both that we are looking for
divergent as opposed to convergent behavior, and that we
are dealing with CGFs in rich, dynamic environments
means that some of the typical assumptions of plan
recognition will be violated.
Each of the four fields described above has some relation
to the problem of detecting errors in a CGF's behavior.
Although some work has examined simplified versions of
the general error detection problem, automatic error
detection performed at more than a superficial level
remains a distant goal.
3. Error Detection Methods
When examining potential error detection methodologies,
we make two main assumptions. The first of these is that
error detection is performed by comparing the behavior of
a CGF to previously recorded expert behavior. This
assumption is common among most systems that
incorporate even minimal amounts of error detection.
The major weakness of this methodology is that errors
can only be detected if they are exposed by the example
problems. Thus, choosing which examples should be
used for validation has a significant impact on the
efficacy of the validation process. However, failing to
make this basic assumption about the availability of
solved example problems means that error detection must
rely upon only weak, task-independent, information such
as loops in behavior or strong environmental feedback
such as death. Although our error detection methods will
use expert behavior to perform validation, we will not
focus on how this behavior selected. Potential methods
for selecting behavior have been addressed in some
previous research (e.g. [13]), and further progress will be
left as future work.
The second assumption that underlies all of the
approaches we will consider is the availability of
particular information in the expert behavior traces. Two
simple error detection methods used previously in
knowledge refinement and plan recognition both require
explicit information about the sequence of actions
pursued by the expert during problem solving. To
increase the applicability of our methods to complex and
dynamic interactive domains, we also require that
information about the sequence of states encountered by
the expert during problem solving is available either
explicitly or by deductive means. In many situations,
such as when the expert interacts with a computer
simulation, information about the world states
encountered by the expert and the sequence of actions
they pursue can easily be captured by recording the
stream of data to and from the simulator.
4. Building Blocks of ED Methods
Each potential approach is characterized by a number of
different properties described below. Together, these
form a landscape of methodologies that could be used to
identify problematic differences in two streams of
behavior. The specific properties encapsulated by a
methodology will impact how simple it is to use, and the
situations for which it is most suited.
ED1: [Availability of Goal Annotations] As we
discussed previously, our methods rely on traces of
human behavior. They must contain, at the very least,
explicit information about what actions the expert
performed, as well as a means to correctly deduce the
sequence of environmental states that was encountered.
In addition to this information, some error detection
methodologies may rely on knowledge about the expert's
goals at each point in problem solving. If this is the case,
that information might be Given or Abduced using some
inference procedure. In the tables that follow, the label ‘-‘
for property ED1 indicates that no information about the
expert’s goals is required. The label ‘G’ indicates that the
goals have been given, and the label ‘A’ indicates that the
goals have been abduced.
ED2: [Abstracted Examples] Because an expert
behavior trace describes only a single solution path, it
may be useful to use an abstracted version instead of, or
along with, the original. Abstracted expert behavior may
indicate, for example, that the expert’s goal ‘look’ is
functionally equivalent to the goals ‘examine’ and
‘inspect’. Abstraction may also be used to indicate that
actions such as ‘turn left quickly’ and ‘turn left slowly’ are
functionally equivalent. Defining these abstractions
allows an error detection method to ignore irrelevant
details in the expert’s behavior. This then helps to focus
the error detection methods on aspects of the behavior
that are critical to determining its correctness. Error
detection methods that employ abstraction are labeled
with an ‘A’ in property ED2. Error detection methods
that do not make use of abstraction are labeled with a ‘-‘.
ED3: [Comparison Method] To determine whether an
error has occurred, the CGF's behavior must be compared
against the expert’s behavior in some manner. Consider
the following simple expert action sequence that
completes an attack task: arm-weapon, move, shoot.
Comparing this expert behavior to a CGF’s behavior can
be done in one of two ways. The first method uses a
strong comparison that we refer to simply as compare
(labeled ‘C’ for property ED3 in the tables below). Using
this strong method, each element in the expert’s behavior
is expected to correspond to an element in the CGF’s
behavior. If the CGF fails to generate a corresponding
behavior stream, an error is detected. An alternative to
this approach is a much weaker comparison method we
call justify (labeled ‘J’ in the tables below). This
approach attempts to ensure that the CGF is able to
explain or justify the expert’s behavior at each point in the
behavior trace. If so, then this indicates that the CGF and
the expert share a body of knowledge about when goals
are actions are appropriate. If not, this indicates an error.
Unlike the stronger metric compare, justify allows the
agent an additional degree of freedom to exploit
knowledge or preferences about alternative actions that
the expert might not have.
5. Basic Error Detection Methods
By combining the properties described above, we
construct five distinct error detection methods. These
methods, encompassing three novel methods and two that
have been provided by prior work in knowledge
refinement and plan recognition, are outlined in the
following sections. As we examine each method, we
analyze it with respect to the following criteria:
• What demands are placed on the domain expert and
knowledge engineer?
• What trade-off is made between maximizing true
positives (detected errors that are in fact errors) and
reducing false positives (detected errors that are in
fact acceptable behavior)?
• How does the method compare to the benchmark
methods provided by the knowledge refinement and
plan recognition communities?
• What environmental/CGF properties are supported by
this method?
Analysis of these constraints will then allow us to
determine the types of situations that a particular method
is best suited for, and those situations in which it will be
5.1 Strict Matching
Table 1: Properties of Strict Matching
Strict matching (SM) ensures that the CGF pursues the
same series of actions as the expert when faced with a
common situation. This benchmark approach is inherited
from the knowledge refinement community that uses a
similar view of what constitutes an error on noninteractive, classification type tasks. Strict matching
attempts to force the CGF to follow the same solution
path. Returning to our previous air-combat example,
strict matching would ensure that the expert and the CGF
perform the same actions in the environment at all times
(including how the plane is maneuvered during execution
of specific tactics, and the speed and altitude of the plane
at all points in flight). In strict matching, the expert’s
behavior is not abstracted in any manner, and errors are
detected any time the CGF's behavior deviates from the
expert's (Table 1 indicates the basic properties of SM). If
strict matching is performed off-line, using captured
traces of CGF and expert behavior, it can be viewed as a
string matching problem. This would allow efficient
exact matching algorithms to be used to indicate whether
an error has taken place. Alternatively, approximate string
matching algorithms could be used to indicate how many
errors (of omission, insertion, or substitution) occurred in
the CGF's behavior.
Perhaps the greatest advantage of strict matching is the
fact that it requires no additional knowledge to be
supplied by either the domain expert or the knowledge
engineer. This method can be applied using only the
observed sequence of states and actions that were
encountered and pursued during problem solving.
Because the methodology is very strict in terms of what
constitutes an error, it maximizes true positives, but does
little to reduce false positives. As a result, it is likely that
CGFs that allow their problem solutions to be influenced
by individual preferences would be classified as
exhibiting incorrect behavior. To eliminate the false
positives flagged by this under specific routine, it is
extremely likely that the domain expert would need to
filter errors returned by this method, thus resulting in
increased human effort and reducing the other benefits of
this approach.
Strict matching meets many of the constraints and
requirements outlined in Section 5. However, because of
the rigidity of the error detection method, it is not
particularly well suited to complex environments with
many states or goals. Moreover, because a valid CGF
must emulate expert behavior according to this approach,
this method is not ideal for environments where diversity
of behavior is valued. On the other hand, strict matching's
ability to maximize true positive errors can be useful in
domains with irreversible actions or with a high cost of
failure. In some situations, such as protocol driven
exercises, this property may outweigh all other concerns.
5.2 Justification Based Detection
Table 2: Properties of Justification Based Detection
Justification based detection attempts to ensure that the
CGF can explain the expert’s actions at all points in time
(see Table 2). This second benchmark approach is
inherited from plan recognition, where the goal is simply
to understand another motivations. Justification based
detection requires that the CFG and the expert both share
a large body of common knowledge, but it also allows a
significant degree of flexibility not found in other
methods. This freedom allows the CGF to pursue its own
solution paths while still ensuring that the CGF is able to
produce (at least in principle) guaranteed correct
behavior. It is important to note that justification based
detection does necessarily need to view actual CGF
behavior. Instead, error detection is performed simply by
asking the CGF to explain each of the expert’s actions.
Returning to the air-combat example, this means that so
long as the CGF could justify the expert engaging the
enemy, employing its chosen maneuvers, and then
returning to base at a given speed and altitude, no error
would be detected. Unfortunately, this methodology is
likely to leave some of the CGF's knowledge untested.
This means that although a CGF may understand expert
behavior, in practice it may not perform the same actions
as the expert would have.
As with the other benchmark approach (SM), the main
advantage of justification based error detection is that it
can be used without any additional knowledge sources
beyond the solved example problems, thus reducing the
human effort involved in its use. Unlike strict matching,
the inherent optimism of this method allows the CGF a
large degree of freedom to pursue its own form of
problem solving and reduces false negatives. At the same
time, however, this optimism is likely to lead to a similar
reduction in the total number of detected errors.
Similar to strict matching, justification based detection
meets many of our constraints. This approach should fare
well in environments with a large state space, as it is only
intended to ensure that the CGF's knowledge
encompasses the expert's. In environments with a large
goal space, however, it is increasingly likely that the CGF
may be able to justify the expert behavior even if the
knowledge used to make the justification is incorrect.
This may occur when the goals associated with a
particular action sequence are different for the CGF and
the expert. One the other hand, this method does help
support behavioral diversity by allowing CGFs to pursue
their own solutions. Nonetheless, the optimism of JBD is
likely to be a disadvantage in areas where there is a high
cost of failure because the decisions a CGF makes during
execution may not be known during validation.
of failure. Although it is potentially more optimistic than
SM, in GSM we can set the acceptable abstraction level
arbitrarily low in order to achieve the most desirable
trade-off between a low number of false positives and a
high number of true positives.
5.3 Generalized Sequence Matching
Abstract goal based matching (AGBM) makes a
somewhat different extension to strict matching. In this
technique, the expert’s and the CGF’s goals, as opposed
to their individual actions, are compared to one another.
AGBM uses annotations in the solved example problems
to compare the expert's upper n goals to the CGF's upper
n goals at each point in time (see Table 4). Using this
method to validate behavior allows the error detection
system to verify that both the expert and the CGF pursue
and achieve the same goals, even if their lower level
actions (and potential some lower level goals) are
significantly different. In the air combat domain, this
would make it easy to identify that both the expert and the
CGF correctly completed the dogfight task by shooting
down the enemy, but might make it hard to identify
whether a specific maneuver had been preformed.
Table 3: Properties of Generalized Sequence Matching
Generalized sequence matching (GSM) takes the strict
matching approach, but allows the expert’s actions to be
viewed abstractly (see Table 3). Strict matching uses the
observed expert's actions as a template upon which the
CGF's behavior must match. Each action in the sequence
is likely to be associated with parameters that modify (to
some extent) their effects on the environment. GSM
extends this approach by defining how actions may be
substituted for one another. This allows the CGF some
degree of freedom in its behavior that would otherwise
have been identified as an error. Using this methodology
in the air-combat example, one could explicitly inform the
error detection system whether speed and altitude are
critical components (for the sake of identifying errors) of
the fly-plane action, and whether they should be restricted
to a specific range of values.
This method requires only minimal additional effort on
the part of the domain expert. Before validation begins,
the abstraction relationships (such as equivalence classes,
taxonomies, or valid numerical ranges) must be defined.
These relationships are likely to be reusable for future
validation of other CGFs so long as they all operate in the
same domain. Moreover, the effort required to encode
this abstraction knowledge is independent from the
number of examples used during validation and the
number of rules used to encode the CGF's behavior.
The coverage of generalized sequence matching (GSM)
lies somewhere between that of the benchmark methods,
SM and JBD. In terms of the constraints outlined earlier,
GSM is well suited to complex environments because of
its ability to allow a relatively loose correspondence
between the CGF's and the expert's behavior, while at the
same time having the ability to explicitly relate how much
these two behaviors deviate. Not surprisingly, we expect
that this method should perform relatively well in
environments that have irreversible actions or a high cost
5.4 Abstract Goal Based Matching
Table 4: Properties of Abstract Goal Based Matching
As with strict matching, detecting differences between the
CGF’s behavior pattern and the expert’s behavior pattern
could be viewed as a string matching problem if it is
performed off-line. The essential difference here is that
the strings would be composed of goals as opposed to the
primitive actions that are explicitly available in the
behavior traces.
Abstract Goal Based Matching (AGBM) requires the
expert to annotate their problem solutions. Like GSM,
the human effort involved is likely to be independent
from the number of rules used to encode the CGF's
knowledge. Instead, the effort is proportional to the
number of solved problems used for validation.
The coverage of AGBM is of a somewhat different nature
than the error detection methods we have previously
discussed, because the action sequence of the CGF and
the expert are never explicitly compared. This means that
the types of allowable action sequences are relatively
diverse so long as they can achieve the appropriate goals.
Because AGBM only needs to recognize a small set of
high-level goal states, and because it already has rules to
perform this function efficiently, it should scale well as
the size of the state and goal spaces expand. AGBM's
ability to deal with different environments can be
controlled easily by adjusting the required level of
correspondence between the CGF's and the expert's goals.
A looser coupling involving perhaps the top few layers of
the goal hierarchy will allow CGFs more flexibility than a
tight coupling that requires correspondence deep into the
goal stack. Not surprisingly, the tightness of the coupling
also has a significant impact on the suitability of this
method for environments with irreversible actions and a
high cost of failure. These environments, however,
benefit from a tighter coupling where errors in the CGF's
reasoning process may be uncovered before an action is
even taken.
5.5 Grounded Variability Matching
Table 5: Properties of Grounded Variability Matching
In the same vein as AGBM, grounded variability
matching (GVM) works to ensure that both the CGF and
the expert pursue the same goals. This method also works
from the bottom up similar to GSM (see Table 5) by
taking into account what primitive actions are legitimate
within the context of the current goal stack. Returning to
our air-combat example, using GVM one would be able to
explicitly inform the error detection system that the
altitude and speed are critical components of the fly-plane
action when the goal is return-to-base, but that these
components are relatively unimportant in other
As with AGBM, the expert must annotate the example
problems with the goals that are being pursued at each
point in time. Comparison of the CGF's and expert's
goals takes place down to a pre-specified level and
ensures that CGF's have the correct motivation for their
actions. Tied to this bottom tier of the goal hierarchy is a
list of allowable primitive actions. Comparison of the
CGF's actions to the list of allowable actions ensures that
the CGF is pursuing their goals in an acceptable fashion.
This list may be extracted via interviews with a domain
expert, but it is probably more feasible to extract it by
observing experts perform the task multiple times as has
been done in [12].
likely to be significantly influenced by the details of the
domain. Compared to the two benchmark methods from
knowledge refinement (SM) and plan recognition (JBD),
GVM is likely to do much better, but at the cost of
increased human effort.
In terms of meeting the constraints of an ideal error
detection methodology, GVM performs similarly to
AGBM. It is well suited to environments with many
states and plans, although because it must keep primitive
action lists, there may be an efficiency concern in
environments with a very large number of allowable
actions. As in AGBM, the degree of coupling affects the
ability of this method to deal well with environments that
have shared resources or irreversible actions and a high
cost of failure. To some extent, however, this severity of
this trade-off is mitigated by the fact that each goal
context is associated with allowable primitive actions.
This means that even with a relatively loose coupling,
there is a greater possibility of preventing failure.
5.6 Discussion of Basic Methods
Table 6: Summary of Basic Error Detection Methods
Table 6 illustrates the properties of the five basic error
detection methods we have presented. Together, they
span a significant portion of the landscape of potential
error detection methodologies. However, the careful
reader will note that a number of potential methods have
been left unaddressed. These methods can be classified
into two groups depicted in Table 7.
Basic Abductive
Table 7: Unexplored Error Detection Methods
Compared to other methods we have examined, GVM
puts relatively high demands on the domain expert. Not
only must goals be represented explicitly in the problem
solutions (as in AGBM) this method also requires
observing a potentially large amount of expert behavior in
order to gather information about what primitive actions
are allowed within a specific context.
GVM has the potential of providing better coverage than
AGBM. The exact difference between these methods is
Justification techniques represent all methods that use
justification but attempt to either justify abstractions of
the expert's action sequences (as given by the solved
example problems) or attempt to justify only the expert's
higher level goals. In both cases, these error detection
methods are used to justify abstract representations of the
expert's behavior, thus making them more general than
JBD. However, as we previously mentioned, JBD is
already a very optimistic error detection method and is
likely to allow a relatively large number of true positives
to go undetected. Because the remaining justification
techniques will detect only a subset of the errors
identified by JBD, these methods are unlikely to warrant
deeper investigation.
Basic abductive techniques suffer from similar problems.
Abductive techniques attempt to use the CGF's
knowledge in order to identify what goals the expert is
pursuing while solving the example problems. Correct
identification of the expert's goals allow error detection to
be performed at a more abstract level than is possible by
comparing primitive action sequences alone. They are
useful because they force the CGF to achieve the same
goals as the expert, but allow freedom in terms of the
primitive actions used to achieve each goal. However,
when abduction is used without an additional knowledge
source, the expert's goals can only be determined by
examining the primitive action sequence available in the
solved example problem. Because abduction relies on the
CGF's knowledge base to determine the expert's goals, it
is only likely to be successful if the CGF solves the
problem using a primitive action sequence that is very
similar to the one used by the expert. Clearly, this
constraint undermines the main power of abductive
techniques—that they can be used to force the CGF to
achieve the same goals as the expert while allowing
freedom in the underlying primitive action sequence.
This violation means that basic abductive techniques that
do not rely on information aside from the expert's
behavior trace and the faulty CGF knowledge are not
worth further investigation.
6. Beyond Basic Methods
Each of the basic error detection methods outlined in the
previous section can be aided by the use of additional
information about the task domain. One such source of
information describes the method for selecting packets
from the CGF and expert behavior streams.
Before a CGF’s behavior is compared to an expert’s
behavior, a decision must be made as to what two packets
in the stream should be analyzed. The most appropriate
choice should be influenced by properties of the task and
domain. So CGFs that interact with a real-time
environment should integrate the value of a world-clock
into their method for selecting two packets of behavior to
be compared. On the other hand, CGFs that operate in
turn based simulations may not have this requirement. In
this case, it may be sufficient to simply compare actions
as they occur, and ignore time that passes in between two
successive actions.
The example above illustrates just one example of how
additional domain knowledge could be used to improve
the error detection process. As our investigation
continues we expect that we will be able to identify a set
of orthogonal components that can be used in conjunction
with one another to produce an error detection system
tailored to the needs of a particular domain.
7. System Design
One of the primary design goals for our error detection
system is the ability to scale as our understanding of the
basic framework that underlies error detection grows. As
a result, we have developed a modular framework that
allows us to separate different aspects of the error
detection process.
Behavior Stream
Behavior Stream
At this point, we have divided the framework into two
orthogonal components. The first component is
sequencing and involves selecting appropriate packets of
CGF and expert behavior to compare, as described in
Section 6. The second component is classification and
involves applying one or more of the basic error detection
metrics (described in Section 5) to the packets of behavior
selected by the sequencer. Because these components are
orthogonal, we get the maximal flexibility from our
system: each new component results in a combinatorial
increase in the number of potential ways in which we can
identify errors.
The initial step in using our error detection system is for
the knowledge engineer and domain expert to examine the
properties of the environment and determine which error
detection methods are most suitable for the task at hand.
In essence, this means picking out one or more sets of
components to perform the error detection task. Once this
has been done, two behavior streams, one from the expert
and one from the CGF are used as input to the error
detection system. As the system examines the two
streams, it produces a description as to where the CGF’s
behavior faithfully reproduces the expert’s behavior, and
where the CGF has made errors. Using this information,
the domain expert and knowledge engineer can examine a
large set of test cases, quickly isolating when and where
errors have occurred, and significantly reducing the cost
of the validation process.
9. Future Work
Our research so far has paved the way for a broad
investigation of methods for detection errors and
validating CGFs efficiently. Our near term goal is to
examine the performance of our basic error detection
methods. To do this we will examine a simple object
retrieval domain in which a CGF accomplishes a number
of high-level goals such as plan-route, travel, and findobject. This domain contains approximately 20 primitive
actions and 10 distinct goals creating a very large space of
potential behavior.
As our experiments with this test domain mature, we will
focus on identifying two critical relationships. The first
relationship we will examine is between the output of an
error detection method and the impact of that information
on improving the efficiency of validation. A better
understanding of this relationship will allow us to
optimize our error detection methods. The second
relationship we will examine is between properties of a
goal and the effectiveness of a particular error detection
method. A deeper understanding of this relationship will
allow a better-grounded choice of which error detection
methods should be applied to a particular problem.
In the longer term, we will continue searching for ways in
which we can improve our system’s ability to detect
errors by exploiting new sources of knowledge. We will
continue to organize these knowledge sources into
orthogonal dimensions of a unified error detection
framework. This will allow us to take advantage of the
combinatorial growth of new detection methods that
occurs each time a new source of knowledge can be added
to the framework.
10. References
David J. Bawcom: “An Incompleteness Handling
Methodology for Validation of Bayesian
Knowledge Bases”. Masters Thesis: Air Force
Institute of Technology, 1997.
Susan Craw, D. Sleeman: ”Automating the
Refinement of Knowledge-Based Systems”.
Proceedings of the ECAI90 Conference, pp. 167172, 1990
Yolanda Gil, Eric Melz: “Explicit Representations
of Problem-Solving Strategies to Support
Knowledge Acquisition”. Proceedings of the
Thirteenth National Conference on Artificial
Intelligence, pp. 469-476, 1996.
Randolph M. Jones, John E. Laird, Paul E. Nielsen,
Karen J. Coulter, Patrick Kenny, Frank V. Koss:
“Automated Intelligent Pilots for Combat Flight
Simulation”. AI Magazine, Vol. 20, pp. 27-42,
Byeong Ho Kang, Windy Gambetta, Paul
Compton: “Verification and Validation with
Ripple-Down Rules”. International Journal of
Human Computer Studies, Vol. 44(2), pp. 257-269,
Patrick M. Murphy, Michael J. Pazzani: "Revision
of production system rule-bases". Proc. 11th
International Conference on Machine Learning, pp.
199-207, 1994.
Tin A. Nguyen, Walton A. Perkins, Thomas J.
Laffey, Deanne Pecora: “Knowledge Base
Verification”. AI Magazine, Vol. 8, pp 69-75,
Robert M. O'Keefe, Osman Balci, Eric P. Smith:
“Validating Expert System Performance”. IEEE
Expert, Vol. 2(4), pp 81-90, 1987.
Douglas Pearson: “Learning Procedural Planning
Knowledge in Complex Environments”. Ph.D.
Thesis: University of Michigan, 1996.
Alun D. Preece, Rajjan Shinghal, Aida Batarekh:
“Verifying Expert Systems: A Logical Framework
and a Practical Tool”. Expert Systems With
Applications, Vol. 5, pp. 421-436, 1992
Marcelo Tallis: “A Script-Based Approach to
Modifying Knowledge-Based Systems”.
International Journal of Human-Computer Studies,
To Appear.
Michael van Lent: “Learning Task-Performance
Knowledge Through Observation”. Ph.D. Thesis.
University of Michigan, 2000.
Nirmalie Wiratunga, Susan Craw: ”Informed
Selection of Training Examples for Knowledge
Refinement”. Proceedings of the 12th European
Knowledge Acquisition Workshop, pp. 233-248,
Neli Zlatareva, Alun Preece: “State of the Art in
Automated Validation of Knowledge-Based
Systems”. Expert System With Applications, Vol.
7(2), pp. 151-167, 1994.
Author Biographies
SCOTT WALLACE is a Ph.D. candidate in the
University of Michigan’s Computer Science program.
His research interests include empirical analysis of A.I.
architectures, and knowledge engineering. He received
his B.S. in Physics and Mathematics from the University
of Michigan in 1996.
JOHN LAIRD is a Professor of Electrical Engineering
and Computer Science at the University of Michigan. He
received his B.S. from the University of Michigan in 1975
and his Ph.D. from Carnegie Mellon University in 1983.
He is one of the original developers of the Soar
architecture and leads its continued development and
evolution. From 1992-1997, he led the development of
TacAir-Soar, a real-time expert system that flew all of the
U.S. fixed-wing air missions in STOW-97.