From: AAAI Technical Report SS-94-04. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Attention Focusing and Anomaly Detection in Systems Monitoring RichardJ. Doyle Artificial IntelligenceGroup Jet PropulsionLaboratory CaliforniaInstitute of Technology Pasadena, CA 91109-8099 rdoyle~alg.jpl.nasa.gov operators employ multiple methods for recognizing anomalies, and 2) mission operators do not and should not interpret all sensor data all of the time. Weseek an approachfor determining from momentto momentwhich of the available sensor d~tA is most informative about the presence of anomalies occurring within a system. The work reported here e~tends the anomaly detection capability in Doyle’s SELMON monitoring system[5, 6] by adding an attention focusing capability. Other model-based monitoring systems include Dvorak’s MIMIC,which performs robust discrepancy detection for continuous dynamic systems [7], and DeCoste’s DATMI,which infers system states from incomplete sensor data [3]. This work also complements other work within NASA on empirical and model-basedmethodsfor fault diagnosis of aerospace platforms[1, 8, 9, 11]. Anyattempt to introduce automationinto the moniwring of complexphysical systems must start from a robust anomalydetection capability. This task is far from straightforward, for a single definition of what constitutes an anomalyis difficult to come by. In addition, to make the monitoring process efficient, and to avoid the potential for information overload on humanoperators, attention focusing must also be addressed. Whenan anomaly oc~’t~, moreoften than not several sensors are affected, and the partially redundantinformationthey provide can be confusing,particularly in a crisis situation where a response is needed quickly. The focus of this paper is a newtechnique for attention focusing. The technique involves reasoning about the distance between two frequency distributions, and is used to detect both anomaloussystem parameters and "broken" causal dependencies. Thesetwo forms of information together isolate the locus of anomalous behavior in the system being monitored. 2 Background: The SELMON Approach 1 Introduction Mission Operations personnel at NASA have the task of determining, from momentto moment,whether a space platform is exhibiting behavior which is in any way anomalous, whichcould disrupt the operation of the platform, and in the worst case, could represent a loss of ability to achieve mission goals. Atraditional techniquefor assisting mission operators in space platform health analysis is the establishmentof alarm thresholds for sensors, typically indexed by operating mode, which summarizewhich ranges of sensor values imply the existence of anomalies. Another established technique for anomalydetection is the comparisonof predicted values from a simulation to actual values received in telemetry. However, experienced mission operators reason about more than alarm threshold crossings and discrepancies between predicted and actual to detect anomalies: they mayask whethera sensor is behavingdifferently than it has in the past, or whethera current behaviormaylead to----the particular baneofoperators---a rapidly developing alarm seq~. Our approachto introducing automationinto real-time systems monitoring is based on two observations: 1) mission 39 Howdoes a humanoperator or a machine observing a complex physical system decide when something is going wrong7 Abnormalbehavior is always defined as somekind of departure from normalbehavior. Unfortunately, there appears to be no single, crisp definition of "normal"behavior. In the traditional monitoringtechnique of limit sensing, normalbehavior is predefined by nominalvalue ranges for sensors. A fundamental limitation of this approachis the lack of sensitivity to context. In the other traditional monitoring technique of discrepancy detection, normal behavior is obtained by simulating a modelof the systembeing monitored. This approach, while avoiding the insensitivity to Context of the limit sensing approach, has its ownlimitations. The approach is only as good as the system model. In addition, normal system behavior typically changes with time, and the model must continue to evolve. Giventhese limitations, it can be difficult to distinguish genuine anomaliesfrom errors in the model. Noting the limitations of the existing monitoring techniques, we have developed an approach to monitoring which is designed to makethe anomalydetection process more robust, to reduce the number of undetected anomalies (false negatives). Towardsthis end, we introduce multiple anomaly models, each employinga different notion of "normal" behavior. 2.1 Empirical Anomaly Detection Methods In this section, we briefly describe the empirical methods that we use to determine, from a localviewpoint, when a sensor is reporting anomalous behavior. Thesemeasuresuse anomalies.This rawdiscrepancyis entered into a normalizaknowledgeabout each individual sensor, without knowledge tion processidentical to that usedfor the valuechangescore, ofanyrelations among sensors. andit is this representationof relative discrepancywhichis reported.Thedeviationscorefor a sensoris minimum if there Surprise is no discrepancy and maximum if the discrepancy between Anappealing waytoassess whether current behavior is predictedandactualis the greatestseento date onthat sensor. anomalous ornotisviacomparison topastbehavior. This Deviationonlyrequiresthat a simulationbe availablein any is the essence of the surprise measure.It is designedto formforgenerating sensor value predictions. However,the highlighta sensorwhichbehavesother thanit has historically. remainingsensitivity and cascadingalarmsmeasuresrequire Specifically,surpriseusesthe historicalfrequency distribution the ability to simulateandreasonwith a causal modelof the for the sensor in two ways: Todeterminethe likelihood of systembeing monitored. the given current value of the sensor (unusualness),and Sensitivity and CascadingAlarms examinethe relative likelihoods of different values of the sensor (informativeness).It is those sensors whichdisplay Sensitivity measuresthe potential for a large global perunlikely values whenother values of the sensor are more turbation to developfrom current state. Cascadingalarms likely whichget a high surprisescore. Surpriseis not high measuresthe potential for an alarmsequenceto developfrom if the onlyreasona sensor’svalueis unlikelyis that thereare current state. Bothof these anomalymeasuresuse an eventmanypossiblevaluesfor the sensor,all equallyunlikely. drivencausal simulator[2, 10] to generatepredictionsabout future states of the system,given current state. C~t state Alarm is taken to be definedby both the current values of system Alarmthresholds for sensors, indexedby operating mode, parameters(not all of whichmaybe sensed) andthe pending typicallyare establishedthroughan off-line analysisof system events already resident on the simulator agenda.Themeadesign. Thenotion of alarmin SELMON extends the usual one sures assign scores to individualsensorsaccordingto howthe correspondingto a sensor participates in, bit of information (the sensoris in alarmor it is not), andalso systempmmneter reports howmuchof the alarmrangehas beentraversed. Thus or influences, the predicted global behavior. Asensor will a sensor whichhas gonedeepinto alarmgets a higher score haveits highestsensitivity scorewhenbehaviororiginatingat than onewhichhas just crossedover the alarmthreshold. that sensorcausesall sensorscausally downstream to exhibit their maximum value changeto date. Asensor will haveits AlarmAnticipation highest cascadingalarmsscore whenbehaviororiginating at that sensorcausesall sensorscausallydownstream to go into The alarm anticipation measurein SELMON performs a simpleformof trend analysisto decidewhetheror not a sensor an alarmstate. is expectedto be in alarmin the future. Astraightforward curvefit is usedto project whenthe sensorwill next cross an 2.3 PreviousResults alarmthreshold, in either direction. A high score meansthe In order to assess whetherSELMON increased the robustness sensorwill soonenter almmor will remainthere. Alowscore of the anomalydetection process, weperformedthe following meansthe sensor will remainin the nominalrange or emerge experiment: WecomparedSELMON performanceto the perfrom alarmsoon, formance of the traditional limit sensingtechniquein selecting critical sensorsubsetsspecified by a SpaceStation EnvironValue Change mental Control and Life Support System (ECLSS)domain Achangein the value of a sensor maybe indicative of an expert, sensorsseen by that expert as usefulin understanding anomaly.In order to better assess such an event, the value episodesof anomalous behaviorin actual historical data from changemeasurein SELMON comparesa given value change ECLSS testhed operations. to historical value changesseen on that sensor. Thescore Theexperimentaskedthe followingspecific question: How reportedis basedon the proportionof previousvalue changes often did SELMON place a "critical" sensorin the top half of whichwereless than the given value change.It is maximum its sensor orderingbasedon the anomalydetection measures7 whenthe givenvaluechangeis the greatest valuechangeseen Theperformanceof a randomsensor selection algorithm to date on that sensor. It is minimum whenno value change wouldbe expected to be about 50%;any particular sensor has occurredin that sensor. wouldappearin the top half of the sensororderingabouthalf the time. Limit sensing detected the anomalies76.3%of the 2.2 Model-BasedAnomalyDetection Methods time. SELMON detected the anomalies95.1%of the time. Although manyanomalies can be detected by applying Theseresults showSELMON performingconsiderablybetanomaly modelsto the behaviorreportedat individualsensors, ter than the traditional practice of limit sensing.Theylend robust monitoringalso requires reasoningabout interactions credibility to our premisethat the mosteffective monitoring occurring in a systemand detecting anomaliesin behavior systemis one whichincorporates several modelsof anomareportedby several sensors. lous behavior. Ouraim is to offer a morecomplete,robust set of techniquesfor anomalydetection to makehumanoperDeviation ators moreeffective, or to providethe basis for an automated Thedeviationmeasureis our extensionof the traditional monitoringcapability. methodof discrepancydetection. As in discrepancydetecThefollowingis a specific exampleof the value addedof tion, comparisons are madebetweenpredictedandactual senSELMON. During an episode in whichthe ECLSS pre-heater sor values,anddifferencesare interpretedto be indicationsof failed, systempressure (whichnormallyoscillates within 4O knownrange) becamestable. This "abnormallynormal"behavior is not detected by traditional monitoringmethodsbecausethe systempressureremainsfirmly in the nominalrange wherelimit sensingfails to trigger. Furthermore, the fluctuating behaviorof the sensoris not modeled;the predictedvalue is an averagedstable valuewhichfails to trigger discrepancy detection.See[5, 6] for moredetails onthese previousresults in evaln~fingthe SELMON approach. 3 Attention Focusing A robust anomalydetection capability providesthe core for monitoring,but only whenthis capability is combinedwith attention focusing does monitoringbecomeboth robust and efficient. Otherwise,the potential problemsof information overloadandtoo manyfalse positives maydefeat the utility of the monitoringsystem. Theattention focusingtechniquedevelopedhere uses two sourcesof information:historical data describing nominal system behavior, and causal informationdescribing which pairs of sensors are constrainedto be correlated, dueto the presenceof a dependency. Theintuition is that the origin and extent of an anomalycan be determinedif the misbehaving systemparametersand the misbehavingcausal dependencies can be determined. 3.1 TwoAdditional Measures WhileSELMON runs, it computesincrementalfrequencydistributions for all sensors beingmonitored.Thesefrequency distributions can be savedas a methodfor capturingbehavior fromany episodeof interest. Of particular interest are historical distributions whichcorrespondto nominalsystem behavior. Toidentify aa anomalous sensor, weapply a distance measure, definedbelow,to the frequencydistribution whichrepresents_recentbehaviorto the historicalfrequency distribution representing nominalbehavior. Wecall the measuresimply distance. To identify a "broken"causal dependency, wefirst apply the samedistance measureto the historical frequency distributionsfor the causesensorandthe effect sensor. This referencedistanceis a weakrepresentationof the correlation that exists betweenthe values of the twosensors dueto the causal dependency.This referencedistance is then compared to the distancebetweenthe frequencydistributions basedon _recentdata of the samecausesensorandeffect sensor.Thedifferencebetween the referencedistanceandthe recent distance is the measureof the "brokenness"of the causal dependency. Wecall this measurecausaldistance. 3.2 Desired Properties of the Distance Measure Definea distributionDas the vector di suchthat and If our aimwasonly to comparedifferent frequencydistributions of the samesensor, wecould use a distance measure whichrequired the numberof partitions, or bins in the two distributions to be equal, andthe rangeof valuescoveredby the distributions m be the same. However,since our aim is to be able to compare the frequencydistributionsof different sensors, these conditionsmustbe relaxed. Wedefine twospecial types of frequencydistribution. Let F be the random,or fiat distribution whereVi, di -- 1~. Let Si be the set of"spike"distributions whered~ -- 1 andYj i, dj =O. 3.3 The Distance Measure Thedistance measureis computed by projecting the two distributions into the two-dimensional space[f, s] in polar coordinates andtaking the euclidiandistance betweenthe projections. Definethe "flatness" component f(D) of a distribution as follows: n-I 1 This is simplythe sumof the bin-by-bindifferences betweenthe givendistribution andF. Notethat 0 _<f (D) _< Also,f(S~)-, 1 as n ~ oo. Definethe "spikeness"component s(D) of a distribution asn-I i d’ i=0 Thisis simplythe centroidvaluecalculationfor the distribution.Theweightingfactor ~b will be explainedin a moment. Onceagain, 0 < s(D) < 1. Now take[f, s] to bepolarcoordinates [r, 0]. Thismaps F to the origin and the Si to points alongan arc on the unit circle. SeeFigure1. Note that we take ~b = ~. This choice of ~ guarantees that A(SO,Sn-l) = A(F, So) = A(F, S.-1) = 1 and other distances in the region whichis the range of Aare by inspection_<1. Insensitivity to the number of bins in the twodistributions andthe rangeof valuesencoded in the distributionsis provided by the If, s] projection function, whichabstracts awayfrom these propertiesof the distributions. Additionaldetails ondesiredpropertiesof the distancemeasure andhowthey are satisfied by the functionAmaybe found in [4]. 3A Results In this section, wereport on the results of applyingthe disi=0 tribution distance measureto the task of focusingattention For a sensorS, weassumethat the rangeof valuesfor the in monitoring.Thedistribution distance measureis used m sensorhas beenpartitioned into n contiguoussubrangeswhich identify misbehaving nodes(distance) andarcs (causal disexhaustthis range. Weconstructa frequencydistribution as a tance) in the causal graphof the systembeingmonitored,or vector Dsof length n, wherethe valueof di is the frequency equivalently,detect andisolate the extentof anomalies in the with whichS has displayeda value in the ith subrange. systembeing monitored. n-I 41 Figure1: Thefunction A(D1,/:)2). Figure3: ’Aleak fault. Figure 2: Causal Graphfor the ForwardReactive Control System(FRCS)of the SpaceShuttle. 3.4.1 A Space Shuttle Propulsion Subsystem Figure2 showsa ca.~! graphfor a portion of the Forward ReactiveConlrolSystem(FRCS)of the SpaceShuttle. Afull causal graphfor the ReactiveControlSystem,comprisingthe Forward,Left and Right RCS,wasdevelopedwith the domain expert. 3.4.2 Attention FocusingExamples SIK.MON was run on seven episodes describing nominal behaviorof the FRCS.Thefrequencydistributions collected during these runs were merged.Referencedistances were computed for sensors participating in causal dependencies. Sm~ON wasthen run on 13 different fault episodes, representing faults suchas leaks, sensorfailures andregulator failures. Twoof these episodeswill be examinedhere; results weresimilar for all episodes. In each fault episode, and for each sensor, the distribution distance measurewas applied to the incrementalfrequencydistribution collected duringthe episodeandthe historical frequencydistribution from the mergednominalepisodes. Thesedistances were a measure of the "brokenness" of nodesin the causalgraph;i.e., instantiationsof the distancemeasure. Newdistances were computedbetweenthe distributions corresponding to sensorsparticipatingin causal dependencies. Thedifferencesbetweenthe newdistances and the reference distances for the dependencieswerea measureof the "brokenness"of arcs in the causalgraph;i.e., instantiationsof the causaldistance measure. Figure4: Apressureregulatorfault. the propellant tank is severedbecausethe valvebetweenthe propellant tank and the manifoldsis closed. Thusthere are two anomaloussystem paramOexs(the manifoldpressures) and two anomalousmechanisms(the agreementbetweenthe propellant and manifoldpressureswhenthe valve is open). The distance and causal distance measurescomputedfor nodesand arcs in the FRCS causal graphreflect this faulty behavior. See Figure 3. (To visualize howthe distribution distance measurecircumscribesthe extent of anomalies,the coloring of nodes and the width of axes in the figure are correlatedwith the magnitudes of the associateddistanceand causal distance scores). An explanation for the apparent heliumtank temperatureanomalyis not available. However, wenote that this behaviorwaspwaentin the data for all six lea, episodes. The secondepisode involves an overpressufiz~onof the propellant tank dueto a regulator failure. Onboardsoftware automaticallyattemptsto close the valves whichisolate the heliumtank fromthe propellanttank. Oneof the valvessticks and remainsopen. The distance and causal distance measuresisolate both the misbehavingsystemparameters(propellant pressure and valveStatusindicators) andthe altered relationshipsbetween the heliumand propellanttank pressuresandbetweenthe propellant tank pressure andthe valve status indicators. Overpre~urization of the propellanttankalso alters the usualrelation betweenpropellanttank pressure andmanifoldpressures. SeeFigure4. Thefirst episodeinvolvesa leak affecting the first and 4 Discussion secondmanifolds(jets) on the oxidizer side of the FRCS. Thepressuresat these twomanifoldsdropto vaporpressure. Thedistance andcausal distance measuresbasedon the disThedependency betweenthese pressures and the pressure in tribution distance measurecombinetwo concepts: 1) empir- 42 ical data alonecan be an effective modelof behavior,and2) volvedin constructingthe causal andbehavioralmodelof the the existence of a causal dependencybetweentwo parame- systemis minimal.Thebehavioralmodelis derived directly ters implies that their values are somehow correlated. The fromactual data; no ofiline modelingis required. Thecausal causaldistance measureconstructs a modelof the correlamodelis of the simplestform,describingonly the existenceof tion betweentwocausally related parameters,capturingthe dependencies.For the Shuttle RCS,a 198-nodecausal graph generalnotionof constraintin an admittedlyabstract manner. wasconstructedin a single one and one half hour session Nonetheless,these modelsof constraint arising fromcausal- betweenthe author andthe domainexpert. ity providesurprising discriminatorypowerfor determining SELMON is being applied at the NASA JohnsonSpaceCenwhichcausal dependencies(and correspondingsystemmech- ter as a monitoringtool for SpaceShuttle Operationsand anisms)are misbehaving. (In the distancemeasurefor detect- Space Station ~ons. Current application efforts include ing misbehavingsystemparameters,weare simplyusing the the onefor the Propulsion(PROP) flight control discipline degenerateconstraint of expectedequality betweenhistorical reported on here, and one for the Thermal(EECOM) flight andrecent behavior.) control discipline. EECOM wishesin particular to be able Severalissues needto be examined to continuethe evalua- to knowand reason about howactual orbiter thermal pertion of the attention focusingtechniquebasedon the distribu- formancediffers frompredictions generatedby an available tion distancemeasure andits utility in monitoring. mathematicalmodelof orbiter thermal performance.AnopWeneedto understandthe sensitivity of the techniqueto erational SELMON prototype, available starting with the rehowsensorvalue rangesare partitioned. Clearlythe discrim- cent HubbleRepairmissionis availablefor evaluationby all inatory powerof the distribution distancemeasureis related flight controldisciplines,onlyrequiringthat a list of sensors to the re~lution providedby the numberof bins and the bin .... owned" by that disciplinebe provided. boundaries.Theresults reportedhere are encouraging for the At the Jet PropulsionLaboratory,weare looking at the numberof FRCSsensor bins were in manycases as low as problemof onboarddownlinkdeterminationfor the Pluto Fast three andin no casesmorethan eight. Flybyproject, nowin its early planningphase.Thespacecraft Weneedto understandthe suitability of the techniquefor will havelimited communications bandwidthandit will not be systemswhichhavemanymodesor configurations. Wewould possible to transmit all onbonrd-collectedsensor d~t~ Only expectthat the discriminatorypowerof the techniquewould eight hours of coveragefromthe DeepSpaceNetworkwill be be compromised if the distributions describingbehaviorsfrom available per week.Thechallengeis to devise a methodfor different modeswere merged.Thusthe technique requires constructinga suitable summary of a week’sworthof sensor _thathistorical~d~t~_ representingnominal behavioris separable dAtAguaranteedto report on any anomalieswhichoccurred. for eachmode.¯If thereare imanymodes,at the very least there Theanomalydetection andattention focusingcapabilities of is a ,tqtA nuut~ement task. A capability for tracking mode SELMON maybe well-matchedto this task. transitions is also required. Anunsupervisedlearning system whichcan enumeratesystemmodesfromhistorical d~t~ 6 Summary andenableautomatedclassification wouldsolve this problem Wehavedescribedthe properties and performanceof a disnicely. Notall distinct distributionsare mapped to distinct pointsin tance measureused to identify misbehaviorat sensor locain a systembeing monitored. the projectionspace[f, s] bythe distributiondistancemeasure. tions and across mechanisms Thetechniqueenables the locus of an anomalyto be deterWeneedto understandthis limitation; in particular weneedto with a understandwhetheror not distributionswewishto distinguish mined.This attention focusingcapability is combined are in fact beingdistinguished.Thejudicial introductionof previouslyreportedanomaly detection capability in a robust, additional components (e.g., the numberof local maximain efficient and informativemonitoringsystem,whichis being a frequencydistribution) to the distributionprojectionspace applied in missionoperations at NASA¯ [f, s] maybe requiredto enhancediscriminability. Thediscriminatorypowerof the causal distance measure 7 Acknowledgements mightbe enhanced byretaining the flatness/spikenessdistincteam are Len Charest, Harry tion. For manylinear functions,different input distributions The membersof the S]~LMON Porta, Nicolas Rouquette and Jay Wyatt. Other recent memmaymapto value-shiftedbut similarly shapedoutputdistribers of the SI~ON project include DanClancy,Steve Chien butions. In other words, the spikeness component mayvary and Usama Fayyad. Harry Porta provided valuable mathewhilethe flatness component maybe relatively invariant. It matical and counterexample insights during the development maybe possible to distinguish the case wheremisbehavior of the distance measure. Matt Barry, Dennis DeCosteand is the result of bogusvaluesbeing propagatedthroughstill Charles Robertson also provided valuable discussion. Matt correctly functioningmechanisms. Barry served invaluably as domain expert for the Shuttle It shouldbe possible to describethe temporal(alongwith the causal/spatial) extent of anomaliesby incrementallycom- FRCS. Theresearchdescribedin this paperwascarried out by the paring recent sensor frequencydistributions calculated from Jet PropulsionLaboratory,CaliforniaInstitute of Technology, a "movingwindow" of constant length with static reference under a contract with the National Aeronauticsand Space frequencydistributions. Administration. 5 Towards Applications Theapproachdescribedin this paperhas usability advantages over other formsof model-based reasoning. Theoverheadin- 43 References [1] K.H. Abbott, "Robust ~ve Diagnosis as Problem Solving in a HypothesisSpace7 7th National Conferenceon Artificial Intelligence,St. Paul, Minnesota, August 1988. [2] L. Charest, Jr., N. Rouquette,R. Doyle,C. Robertson, andJ. Wyatt,"MESA: AnInteractive Modelingand Simulation Environmentfor Intelligent SystemsAutomation," HawaiiInternational Conferenceon SystemSciences, Maul,Hawaii,January1994. [3] D. DeCoste, "DynamicAcross-Time Meas~cnt Interp~afion;’ 8th NationalConference on Artificial Intelligence, Boston,Massachusetts,August1990. [4] R. J. Doyle,"A DistanceMeasurefor Attention Focusing and AnomalyDetection in SystemsMonitoring;’ submittedto 12th NationalConference on Artificial In, telligence, Seattle, Washington,July1994. [5] R.J. Doyle,L. Charest,Jr., N. Rouquette,J. Wyatt,and C. Robertson,"CausalModelingand Event-drivenSimulation for Monitoringof ContinuousSystems,"Computers in Aerospace 9, American Institute of Aeronautics and Aslzonautics,SanDiego,California, October1993. [6] R.J. Doyle, S. A. Chien, U. M. Fayyad,and E. J. Wyatt, ’T-,c~usedReal-timeSystemsMonitoringBasedon Multiple Anomaly Models,"7th International Workshop on QualitativeReasontng,Eastsound,Washington,May 1993. D. Moni[7] L. Dvorakand B. J. Kuipers, "Model-Based toting of Dynamic Systems"11 th InternationalConferenceon Artificial Intelligence, Detroit, Michigan,August 1989. [8] T. Hill, W.Morris, and C. Robextson,"Implementing a Real-time ReasoningSystemfor RobustDiagnosis" 6th AnnualWorkshop on SpaceOperationsApplications and Research,Houston,Texas, August1992. and [9] J. Muratore,T. Heindel,T. Murphy,A. Rasmussen, R. McFarland,"SpaceShuttle TelemetryMonitoringby Expert Systemsin MissionControl:’ in InnovativeApplications of Artificial Intelligence, H. Schorrand A. Rappaport(eds.), AAAI Press, 1989. [10] N. F. Rouquette,S. A. Chien,and L. K. Charest, Jr., "Event-driven Simulation in SELMON: An Overviewof 7 JPLPublication 92-23, August1992. EDSE [11] E. A. Scarl, J. R. Jamieson,andE. New,"DerivingFault Location and Control from a Functional Model,"3rd IEEESymposium on Intelligent Control,Arlington, Virginia, 1988. 44