Towards Combined Attention Focusing and Anomaly Detection

From: AAAI Technical Report SS-94-01. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.
TowardsCombinedAttention Focusing and AnomalyDetection
in Medical Monitoring
Richard J. Doyle
Artificial
Intelligence Group
Jet Propulsion Laboratory
California Institute
of Technology
Pasadena,
CA 91109-8099
rdoyle @ aig.jpl.nasa.gov
Abstract
operators employ multiple methods for recognizing anomalies, and 2) missionoperators do not and should not interpret
all sensor data all of the time. Weseek an approachfor determining from momentto momentwhich of the available sensor
data is most informative about the presence of anomaliesoccurring within a system. The workreported here extends the
anomalydetection capability in Doyle’s SELMON
monitoring
system[5, 6] by adding an attention focusing capability.
Other model-based monitoring systems include Dvorak’s
MIMIC,which performs robust discrepancy detection for continuous dynamic systems [7], and DeCoste’s DATMI, which
infers system states from incomplete sensor data [3]. This
work also complementsother work within NASA
on empirical and model-basedmethodsfor fault diagnosis of aerospace
platforms[1, 8, 9, 12].
Anyattempt to introduce automationinto the monitoring of complexphysical systems must start from
a robust anomalydetection capability. This task
is far from straightforward, for a single definition
of what constitutes an anomalyis difficult to come
by. In addition, to make the monitoring process
efficient, and to avoid the potential for information
overload on humanoperators, attention focusing
must also be addressed. Whenan anomaly occurs,
moreoften than not several sensors are affected, and
the partially redundantinformationthey provide can
be confusing,particularly in a crisis situation where
a responseis neededquickly.
The focus of this paper is a newtechnique for attention focusing. The technique involves reasoning
about the distance between two frequency distributions, and is used to detect both anomaloussystem parameters and "broken" causal dependencies.
Thesetwo forms of information together isolate the
locus of anomalousbehavior in the system being
monitored.
2
1 Introduction
Mission Operations personnel at NASA
have the task of determining, from momentto moment,whether a space platform is exhibiting behavior which is in any way anomalous,
whichcould disrupt the operation of the platform, and in the
worst case, could represent a loss of ability to achieve mission
goals. A traditional techniquefor assisting mission operators
in space platform health analysis is the establishmentof alarm
thresholds for sensors, typically indexed by operating mode,
which summarize which ranges of sensor values imply the
existence of anomalies. Another established technique for
anomalydetection is the comparisonof predicted values from
a simulation to actual values received in telemetry. However,
experienced mission operators reason about more than alarm
threshold crossings and discrepancies betweenpredicted and
actual to detect anomalies: they mayask whethera sensor is
behavingdifferently than it has in the past, or whethera current behavior maylead to--the particular baneof operators--a
rapidly developing alarm sequence.
Our approach to introducing automationinto real-time systems monitoring is based on two observations: 1) mission
31
Background:
The SELMONApproach
Howdoes a humanoperator or a machine observing a complex physical system decide whensomething is going wrong?
Abnormalbehavior is always defined as somekind of departure from normalbehavior. Unfortunately, there appears to be
no single, crisp definition of "normal"behavior. In the traditional monitoringtechnique of limit sensing, normalbehavior
is predefined by nominal value ranges for sensors. A fundamental limitation of this approachis the lack of sensitivity
to context. In the other traditional monitoring technique of
discrepancy detection, normal behavior is obtained by simulating a modelof the system being monitored. This approach,
while avoidingthe insensitivity to context of the limit sensing approach, has its ownlimitations. The approach is only
as good as the system model. In addition, normal system
behavior typically changes with time, and the model must
continueto evolve. Giventhese limitations, it can be difficult
to distinguish genuine anomaliesfrom errors in the model.
Noting the limitations of the existing monitoring techniques, we have developed an approach to monitoring which
is designed to makethe anomalydetection process more robust, to reduce the numberof undetected anomalies (false
negatives). Towardsthis end, we introduce multiple anomaly
models, each employinga different notion of "normal" behavior.
2.1 Empirical AnomalyDetection Methods
In this section, we briefly describe the empirical methods
that we use to determine, from a local viewpoint, when a
sensor is reporting anomalousbehavior. These measures use
knowledgeabout each individual sensor, without knowledge
of any relations amongsensors.
Surprise
An appealing way to assess whether current behavior is
anomalousor not is via comparison to past behavior. This
is the essence of the surprise measure. It is designed to
highlight a sensor whichbehavesother than it has historically.
Specifically, surprise uses the historical frequencydistribution
for the sensor in two ways: To determine the likelihood of
the given current value of the sensor (unusualness), and to
examinethe relative likelihoods of different values of the
sensor (informativeness). It is those sensors which display
unlikely values whenother values of the sensor are more
likely which get a high surprise score. Surprise is not high
if the only reasona sensor’svalue is unlikely is that there are
manypossible values for the sensor, all equally unlikely.
Alarm
Alarmthresholds for sensors, indexed by operating mode,
typically are established throughan off-line analysis of system
design. The notion of alarmin SELMON
extends the usual one
bit of information(the sensor is in alarmor it is not), and also
reports howmuchof the alarm range has been traversed. Thus
a sensor which has gone deep into alarm gets a higher score
than one whichhas just crossed over the alarm threshold.
AlarmAnticipation
The alarm anticipation measure in SELMON
performs a
simple formof trend analysis to decide whetheror not a sensor
is expected to be in alarm in the future. A straightforward
curve fit is used to project whenthe sensor will next cross an
alarm threshold, in either direction. A high score meansthe
sensor will soon enter alarmor will remainthere. A lowscore
meansthe sensor will remain in the nominalrange or emerge
from alarm soon.
anomalies. This raw discrepancyis entered into a normalization process identical to that used for the value changescore,
and it is this representation of relative discrepancywhichis
reported. The deviation score for a sensor is minimum
if there
is no discrepancy and maximum
if the discrepancy between
predictedand actual is the greatest seen to date on that sensor.
Deviationonly requires that a simulationbe available in any
form for generating sensor value predictions. However,the
remaining sensitivity and cascading alarms measuresrequire
the ability to simulate and reason with a causal modelof the
system being monitored.
Sensitivity and Cascading Alarms
Sensitivity measuresthe potential for a large global perturbation to develop from current state. Cascadingalarms
measuresthe potential for an alarm sequence to develop from
current state. Both of these anomalymeasuresuse an eventdriven causal simulator [2, 11 ] to generate predictions about
future states of the system, given current state. Current state
is taken to be defined by both the current values of system
parameters (not all of whichmaybe sensed) and the pending
events already resident on the simulator agenda. The measures assign scores to individual sensors accordingto howthe
system parameter corresponding to a sensor participates in,
or influences, the predicted global behavior. A sensor will
haveits highest sensitivity score whenbehaviororiginating at
that sensor causes all sensors causally downstreamto exhibit
their maximum
value change to date. A sensor will have its
highest cascading alarms score whenbehavior originating at
that sensor causes all sensors causally downstreamto go into
an alarmstate.
2.3 Previous Results
In order to assess whether SELMON
increased the robustness
of the anomalydetection process, we performedthe following
experiment: Wecompared SELMONperformance to the performanceof the traditional limit sensingtechniquein selecting
critical sensor subsets specified by a SpaceStation EnvironValue Change
mental Control and Life Support System (ECLSS)domain
A change in the value of a sensor maybe indicative of an
expert, sensors seen by that expert as useful in understanding
anomaly.In order to better assess such an event, the value
episodes of anomalousbehavior in actual historical data from
change measure in SELMON
compares a given value change
ECLSS
testbed operations.
to historical value changes seen on that sensor. The score
The experimentasked the following specific question: How
reported is based on the proportion of previous value changes often did SELMON
place a "critical" sensor in the top half of
which were less than the given value change. It is maximum its sensor ordering based on the anomalydetection measures?
whenthe given value changeis the greatest value changeseen
The performance of a randomsensor selection algorithm
to date on that sensor. It is minimumwhenno value change
would be expected to be about 50%; any particular sensor,
has occurredin that sensor.
wouldappear in the top half of the sensor ordering about half
the time. Limit sensing detected the anomalies 76.3%of the
2.2 Model-Based AnomalyDetection Methods
time. SELMON
detected the anomalies 95.1%of the time.
Although many anomalies can be detected by applying
These results show SELMON
performing considerably betanomalymodelsto the behaviorreported at individual sensors,
ter than the traditional practice of limit sensing. Theylend
robust monitoring also requires reasoning about interactions
credibility to our premise that the most effective monitoring
occurring in a system and detecting anomalies in behavior
system is one which incorporates several models of anomareported by several sensors.
lous behavior. Our aim is to offer a more complete, robust
set of techniques for anomalydetection to makehumanoperDeviation
ators moreeffective, or to provide the basis for an automated
monitoringcapability.
The deviation measureis our extension of the traditional
methodof discrepancy detection. As in discrepancy detecThe following is a specific exampleof the value added of
tion, comparisonsare madebetweenpredicted and actual senSELMON.
During an episode in which the ECLSSpre-heater
sor values, and differences are interpreted to be indications of
failed, system pressure (which normally oscillates within
32
knownrange) became stable. This "abnormally normal" behavior is not detected by traditional monitoring methodsbecause the systempressure remains firmly in the nominalrange
wherelimit sensing fails to trigger. Furthermore,the fluctuating behavior of the sensor is not modeled;the predicted value
is an averagedstable value whichfails to trigger discrepancy
detection. See [5, 6] for moredetails on these previous results
in evaluating the SELMON
approach.
3
Attention
Focusing
A robust anomalydetection capability provides the core for
monitoring, but only whenthis capability is combinedwith
attention focusing does monitoring becomeboth robust and
efficient. Otherwise, the potential problemsof information
overload and too manyfalse positives maydefeat the utility
of the monitoring system.
The attention focusing technique developed here uses two
sources of information: historical data describing nominal
system behavior, and causal information describing which
pairs of sensors are constrained to be correlated, due to the
presenceof a dependency.The intuition is that the origin and
extent of an anomaly can be determined if the misbehaving
system parameters and the misbehavingcausal dependencies
can be determined.
3.1 TwoAdditional Measures
While SELMON
runs, it computesincremental frequency distributions for all sensors being monitored. These frequency
distributions can be saved as a methodfor capturing behavior from any episode of interest. Of particular interest are
historical distributions which correspond to nominal system
behavior.
To identify an anomaloussensor, we apply a distance measure, defined below, to the frequencydistribution whichrepresents recent behaviorto the historical frequencydistribution
representing nominal behavior. Wecall the measure simply
distance. To identify a "broken"causal dependency,we first
apply the same distance measureto the historical frequency
distributions for the cause sensor and the effect sensor. This
reference distance is a weakrepresentation of the correlation
that exists betweenthe values of the two sensors due to the
causal dependency.This reference distance is then compared
to the distance betweenthe frequencydistributions based on
recent data of the samecause sensor and effect sensor. The difference betweenthe reference distance and the recent distance
is the measureof the "brokenness"of the causal dependency.
Wecall this measurecausal distance.
3.2 Desired Properties of the Distance Measure
Definea distribution D as the vector di such that
Vi, O <_di <_1
and
If our aim was only to comparedifferent frequency distributions of the same sensor, we could use a distance measure
which required the numberof partitions, or bins in the two
distributions to be equal, and the range of values covered by
the distributions to be the same. However,since our aim is
to be able to comparethe frequencydistributions of different
sensors, these conditions must be relaxed.
Wedefine two special types of frequency distribution. Let
F be the random,or flat distribution where Vi, di = l~. Let
Si be the set of "spike" distributions wheredi = 1 and Vj
i, dj=O.
3.3 The Distance Measure
The distance measure is computedby projecting the two distributions into the two-dimensional
space [f, s] in polar coordinates and taking the euclidian distance betweenthe projections.
Define the "flatness" componentf(D) of a distribution as
follows:
-d,
i=O
This is simply the sumof the bin-by-bin differences between the given distribution and F. Note that 0 < f(D) <_
Also, f(Si) --+ 1 as n -4 0o.
Define the "spikeness" components(D) of a distribution
as:
n-I
i
d
i=0
This is simplythe centroid value calculation for the distribution. The weightingfactor ¢ will be explained in a moment.
Once again, 0 < s(D) < 1.
Nowtake [f, s] to be polar coordinates [r, 0]. This maps
F to the origin and the Si to points along an arc on the unit
circle. See Figure 1.
Note that we take ¢ =-y. ’~ This choice of ¢ guarantees
that A(So, S,~-1) = A(F, S0) A(F, Sn -I) = 1 and al l
other distances in the region whichis the range of A are by
inspection < 1.
Insensitivity to the numberof bins in the two distributions
and the range of values encodedin the distributions is provided
by the [f, s] projection function, which abstracts awayfrom
these properties of the distributions.
Additionaldetails on desired properties of the distance measure and howthey are satisfied by the function A maybe found
in [4].
3.4 Results
n-I
Edi
=1
i=0
For a sensor S, we assumethat the range of values for the
sensor has been partitioned into n contiguoussubrangeswhich
exhaust this range. Weconstruct a frequencydistribution as a
vector Ds of length n, wherethe value of di is the frequency
with whichS has displayed a value in the ith subrange.
33
In this section, we report on the results of applyingthe distribution distance measureto the task of focusing attention
in monitoring. The distribution distance measureis used to
identify misbehavingnodes (distance) and arcs (causal distance) in the causal graph of the systembeing monitored, or
equivalently, detect and isolate the extent of anomaliesin the
system being monitored.
0,34
0.0
o~.o O’0o.oI
o.o
o.ol
~
o.
o.o
.1
o.oe *.s8
o.oo o
F~
1=I
~So
o.~
o.es
o.u
0.30
o.o
o.o
Figure 1: The function A(D], D2).
Figure 3: A leak fault.
HeTank
T HeTwlk
P
o.o
o.o
~eVR Cl HeVRCI
~
o.o
0.07
0,0
~
Manf
tP
oi
0.2
M|nfI P
o.oo.o
o.o
o.o
o.13
o.1~I
oo
o.o
o.oe
o.o
o.o
Figure 2: Causal Graph for the Forward Reactive Control
System (FRCS)of the Space Shuttle.
3.4.1 A Space Shuttle Propulsion Subsystem
Figure 2 showsa causal graph for a portion of the Forward
Reactive Control System(FRCS)of the Space Shuttle. A full
causal graph for the Reactive Control System, comprisingthe
Forward, Left and Right RCS,was developed with the domain
expert.
3.4.2 Attention Focusing Examples
SELMON
was run on seven episodes describing nominal
behavior of the FRCS.The frequency distributions collected
during these runs were merged. Reference distances were
computedfor sensors participating in causal dependencies.
SELMON
was then run on 13 different fault episodes, representing faults such as leaks, sensor failures and regulator
failures. Twoof these episodes will be examinedhere; results were similar for all episodes. In each fault episode,
and for each sensor, the distribution distance measurewas
applied to the incremental frequency distribution collected
during the episode and the historical frequency distribution
from the mergednominal episodes. These distances were a
measureof the "brokenness"of nodesin the causal graph; i.e.,
instantiations of the distance measure.
Newdistances were computed between the distributions
correspondingto sensors participating in causal dependencies.
The differences betweenthe newdistances and the reference
distances for the dependencies were a measure of the "brokenness"of arcs in the causal graph; i.e., instantiations of the
causal distance measure.
The first episode involves a leak affecting the first and
second manifolds (jets) on the oxidizer side of the FRCS.
The pressures at these two manifoldsdrop to vapor pressure.
The dependencybetween these pressures and the pressure in
34
Figure 4: A pressure regulator fault.
the propellant tank is severed because the valve betweenthe
propellant tank and the manifolds is closed. Thus there are
two anomaloussystem parameters (the manifold pressures)
and two anomalous mechanisms(the agreement between the
propellant and manifold pressures whenthe valve is open).
The distance and causal distance measures computedfor
nodes and arcs in the FRCScausal graph reflect this faulty
behavior. See Figure 3. (To visualize howthe distribution
distance measurecircumscribes the extent of anomalies, the
coloring of nodes and the width of arcs in the figure are
correlated with the magnitudesof the associated distance and
causal distance scores). An explanation for the apparent
helium tank temperature anomalyis not available. However,
we note that this behavior was present in the data for all six
leak episodes.
The second episode involves an overpressurization of the
propellant tank due to a regulator failure. Onboardsoftware
automatically attempts to close the valves which isolate the
heliumtank from the propellant tank. Oneof the valves sticks
and remains open.
The distance and causal distance measures isolate both
the misbehavingsystem parameters (propellant pressure and
valve status indicators) and the altered relationships between
the heliumand propellant tank pressures and betweenthe propellant tank pressure and the valve status indicators. Overpressurizationof the propellant tank also alters the usual relation betweenpropellant tank pressure and manifoldpressures.
See Figure 4.
4
Discussion
The distance and causal distance measures based on the distribution distance measurecombinetwo concepts: 1) empir-
ical data alone can be an effective modelof behavior, and 2)
the existence of a causal dependencybetween two parameters implies that their values are somehowcorrelated. The
causal distance measureconstructs a modelof the correlation betweentwo causally related parameters, capturing the
general notion of constraint in an admittedlyabstract manner.
Nonetheless, these modelsof constraint arising from causality provide surprising discriminatory powerfor determining
which causal dependencies (and corresponding system mechanisms) are misbehaving.(In the distance measurefor detecting misbehavingsystem parameters, we are simply using the
degenerate constraint of expected equality betweenhistorical
and recent behavior.)
Several issues need to be examinedto continue the evaluation of the attention focusing techniquebased on the distribution distance measureand its utility in monitoring.
Weneed to understand the sensitivity of the technique to
howsensor value rangesare partitioned. Clearly the discriminatory powerof the distribution distance measureis related
to the resolution provided by the numberof bins and the bin
boundaries. The results reported here are encouragingfor the
number of FRCSsensor bins were in many cases as low as
three and in no cases morethan eight.
Weneed to understand the suitability of the technique for
systems which have manymodesor configurations. Wewould
expect that the discriminatory powerof the technique would
be compromised
if the distributions describing behaviors from
different modeswere merged. Thus the technique requires
that historical data representing nominalbehavioris separable
for each mode.If there are manymodes,at the very least there
is a data managementtask. A capability for tracking mode
transitions is also required. Anunsupervised learning system which can enumerate system modesfrom historical data
and enable automatedclassification wouldsolve this problem
nicely.
Notall distinct distributions are mapped
to distinct points in
the projection space [f, s] by the distribution distancemeasure.
Weneedto understandthis limitation; in particular we needto
understandwhetheror not distributions wewish to distinguish
are in fact being distinguished. The judicial introduction of
additional components(e.g., the numberof local maximain
a frequencydistribution) to the distribution projection space
[f, s] maybe required to enhancediscriminability.
It should be possible to describe the temporal (along with
the causal/spatial) extent of anomaliesby incrementally comparing recent sensor frequency distributions calculated from
a "movingwindow"of constant length with static reference
frequencydistributions.
5
Towards Applications
Monitoring
in Medical
The approachdescribed in this paper has usability advantages
over other forms of model-basedreasoning. The overhead involved in constructing the causal and behavioral modelof the
system is minimal. The behavioral modelis derived directly
from actual data; no offline modelingis required. The causal
modelis of the simplest form, describing only the existence of
dependencies. For the Shuttle RCS,a 198-nodecausal graph
was constructed in a single one and one half hour session
betweenthe author and the domainexpert.
35
This low modelingoverhead holds out the promise that the
SELMON
approach maybe suitable for applications outside the
NASA
mission operations arena for which it was designed,
including perhaps medical monitoring.
The questions which need to be answered are: Do ample amounts of empirical data exist to describe "nominal"
behavior for the measurableparameters used in medical monitoring? Candependency-onlycausal graphs be easily built to
describe wherecorrelations should exist in such data? Pearl’s
work [10] mayprovide a wayto construct such causal graphs
if they are not easily generatedby hand.
Our approach is most easily applied when the number of
modes,configurations or states whichneed to be understoodis
small, for it is necessaryto collect data representing nominal
behavior for each mode.While recognizing mylack of expertise in this area, it wouldseemthat the numberof "nominal"
modesfound in medical monitoringcontexts is in general less
than than associated with designedphysical artifacts.
Of particular concern in medical monitoring is the false
alarm rate. The attention focusing approachdescribed in this
paper is meantto address this problemdirectly by identifying
anomalysignatures which span causally related parameters,
rather than stressing the possibly redundant information of
anomalousbehavior occurring at several individual sensors.
This work, however, does not provide any silver bullet for
determining howto set thresholds to define significant departures from nominal behavior, whatever model of nominal
behavior is being employed.
6
Summary
Wehave described the properties and performanceof a distance measure used to identify misbehavior at sensor locations and across mechanismsin a system being monitored.
The technique enables the locus of an anomalyto be determined. This attention focusing capability is combinedwith a
previously reported anomalydetection capability in a robust,
efficient and informative monitoring system, which is being
applied in mission operations at NASA.
7
Acknowledgements
The members of the SELMON
team are Len Charest, Harry
Porta, Nicolas Rouquette and Jay Wyatt. Other recent members of the SELMON
project include Dan Clancy, Steve Chien
and UsamaFayyad. Harry Porta provided valuable mathematical and counterexampleinsights during the development
of the distance measure. Matt Barry, Dennis DeCoste and
Charles Robertson also provided valuable discussion. Matt
Barry served invaluably as domain expert for the Shuttle
FRCS.
The research described in this paper was carried out by the
Jet PropulsionLaboratory, California Institute of Technology,
under a contract with the National Aeronautics and Space
Administration.
References
[I ] K. H. Abbott, "Robust Operative Diagnosis as Problem
Solving in a Hypothesis Space," 7th National Conference on Artificial Intelligence, St. Paul, Minnesota,August 1988.
[2] L. Charest, Jr., N. Rouquette, R. Doyle, C. Robertson,
and J. Wyatt, "MESA:AnInteractive Modelingand Simulation Environment for Intelligent Systems Automation," HawaiiInternational Conferenceon System Sciences, Maui, Hawaii, January 1994.
[3] D. DeCoste, "Dynamic Across-Time Measurement Interpretation" 8th National Conferenceon Artificial Intelligence, Boston, Massachusetts, August1990.
[4] R. J. Doyle, "A Distance Measurefor Attention Focusing and AnomalyDetection in Systems Monitoring"
submitted to 12th National Conferenceon Artificial Intelligence, Seattle, Washington,July 1994.
[5] R. J. Doyle, L. Charest, Jr., N. Rouquette,J. Wyatt,and
C. Robertson, "Causal Modelingand Event-driven Simulation for Monitoring of Continuous Systems" Computers in Aerospace9, AmericanInstitute of Aeronautics
and Astronautics, San Diego, California, October 1993.
[6] R. J. Doyle, S. A. Chien, U. M. Fayyad, and E. J. Wyatt, "Focused Real-time Systems Monitoring Based on
Multiple AnomalyModels" 7th International Workshop
on Qualitative Reasoning, Eastsound, Washington,May
1993.
[7] D. L. Dvorak and B. J. Kuipers, "Model-BasedMonitoring of DynamicSystems," 11th International Conference on Artificial Intelligence, Detroit, Michigan,August 1989.
[8] T. Hill, W. Morris, and C. Robertson, "Implementing
a Real-time Reasoning System for Robust Diagnosis,"
6th Annual Workshopon Space Operations Applications
and Research, Houston, Texas, August 1992.
[9] J. Muratore, T. Heindel, T. Murphy,A. Rasmussen,and
R. McFarland, "Space Shuttle Telemetry Monitoring by
Expert Systems in Mission Control," in Innovative Applications of Artificial Intelligence, H. Schorr and A.
Rappaport (eds.), AAAIPress, 1989.
[10] J. Pearl, ProbabilisticReasoningin lntelligent Systems:
Networks of Plausible Inference, Morgan-Kaufmann,
San Mateo, 1988.
[11] N. F. Rouquette, S. A. Chien, and L. K. Charest, Jr.,
"Event-driven Simulation in SELMON:
An Overview of
EDSE"JPL Publication 92-23, August 1992.
[12] E. A. Scarl, J. R. Jamieson,and E. New,"Deriving Fault
Location and Control from a Functional Model" 3rd
IEEESymposiumon Intelligent Control, Arlington, Virginia, 1988.
36