Complex Event Pattern Recognition for Long-Term System Monitoring Will Fitzgerald∗ and R. James Firby I/NET, Inc. will.fitzgerald@nrc-cnrc.gc.ca, firby@inetmi.com Aaron Phillips Wheaton College Aaron.B.Phillips@wheaton.edu Abstract One useful paradigm for monitoring autonomous systems over long periods of time is to be able to monitor and respond to complex patterns of events. The Complex Event Recognition Architecture, or CERA, is an architecture to do just this: it defines several classes of (hierarchical) event patterns and recognizers for these patterns. Software for CERA has been successfully used for a prototype monitoring system for a complex water recovery system at NASA’s Johnson Space Center. We describe how CERA was used in this project, and announce the availability of open-source software versions of CERA for use in autonomous system monitoring. Introduction One of the difficulties of building human-machine interfaces (HMI) for autonomous systems that extend over long periods of time is the current standard pattern for HMI, the ”model–view–controller” pattern (Krasner & Pope 1988) which biases HMIs that are very thin–the view is mediated by a controller that more or less directly reflects the underlying model–typically the current state of the system. The problem, of course, is that the “current state of the system,” may not be all that is needed to monitor or control an autonomous system over long periods. We may also need to know the history of the system and predictions about the future of the system for understanding and control. The standard alternative–logging system events which can then be used for forensic or predictive analysis–has its own problems. The granularity of the logging data may be too fine, resulting in data streams overwhelmingly difficult to analyze. This is only exacerbated by logging data for extended periods of time. In this paper, we describe a monitoring system called CERA that is designed to allow the implementor to create an easy to understand, hierarchical event detection and classification system. The CERA system combines a well-defined semantics for detecting extended event patterns with the ability to process the recognized patterns in arbitrary ways to generate additional synthetic events. This combination enables CERA to apply a uniform processing model to highly ∗ Now at the National Research Council Canada. c 2004, American Association for Artificial IntelliCopyright gence (www.aaai.org). All rights reserved. Jordan Kairys Kalamazoo College k00jk01@kzoo.edu idiosyncratic, numeric low-level events and highly abstract, symbolic high-level events. In this paper, we will describe how CERA was used to monitor events that occurred over time for a complex water recovery system. We will go into some detail describing how we used CERA to define event pattern recognizers that turn raw telemetric data into increasingly useful, structured information whose semantic transparency aids long-term system monitoring. The Water Recovery System Project The Water Recovery System (WRS) is an advanced life support system project at NASA’s Johnson Space Center (Schreckenghost et al. 1998b; Bonasso, Kortenkamp, & Thronesbery 2003). The WRS consists of four key subsystems: a biological water processor (BWP) for removing ammonia and organic compounds; a reverse osmosis system (RO) to remove inorganic compounds, an air evaporation system (AES) to recover additional water from the brine produced from the RO, and a final post-processing system (PPO). The BWP was first tested during a 450 day trial, and the WRS system as a whole was tested during a year long project. The WRS’s subsystems contain approximately 200 sensors and actuators. The WRS as a whole is controlled by a three-tiered architecture (Schreckenghost et al. 1998a): the top layer consisting of the an automated planner or commands from the human crew; the bottom layer consisting of the skill-level software for reading sensors and commanding actuators; and a middle layer for task sequencing. using the RAP System (Firby 1989). There were a number of significant events that can be gleaned from the log data. For example, the communication channel between the WRS hardware and the RAP System would, from time to time, fail. Because the WRS hardware was no longer receiving control information, the WRS needed to safely shut down; this, of course, meant that each of the subsystems need to safely shut down. Eventually, communications would recover, and the WRS would restart; again, this meant that the subsystems would restart. During operation of the WRS, each subsystem asynchronously logged measurements from its own sensors and actuators. These event logs were provided for a WRS monitoring demonstration. In this demonstration, the event logs 05/23/01 05/23/01 05/23/01 05/23/01 05/23/01 05/23/01 05/23/01 00:03:28 00:08:36 00:13:38 00:18:33 00:23:35 00:28:30 00:33:32 15121 15121 15121 15040 15040 15121 15040 12470 12548 12470 12470 12548 12548 12548 247 246 247 246 246 246 246 0 0 0 0 0 0 0 ... ... ... ... ... ... ... Figure 1: Example data from the Reverse Osmosis log were replayed and a real-time event monitor, CERA, was to watch for combinations of measurements that represented significant events, such as a loss of RAP System communication (LORC), recovery of RAP System communication (RORC), and the intermediate system and subsystem “safing” and restarting–as well as that entire event sequence. Using CERA on the WRS Data In this section, we provide specific details on how CERA was used in a WRS project. For more information of how CERA was used as a subsystem for human/machine collaboration, see (Martin et al. 2004) in this workshop. We provide these details to give an in-depth, real-life use case of complex event pattern detection. What system developers typically have are telemetry data like that in Figure 1. What system developers would like is a way to abstract away the many details of this data, and recognize patterns of events and sub-events which comprise higher-level, more complex event patterns. For example, for the WRS data, to recognize that the four subsystems have completed safely shut down, it is useful to write something like this1 : (define-recognizer (safing-complete) (pattern ’(all (safing (system bwp) (status on)) (safing (system ro) (status on)) (safing (system aes) (status on)) (safing (system pps) (status on)))) (on-complete (st end) (signal-event ’(all-safed) st end))) That is, a recongizer named safing-complete that is watching for the pattern of four safing events occurring (one from each of the WRS subsystems; order not mattering) When all of these are seen, a synthesized all-safed event should be signaled. What are the advantages of such a system? The first, and most obvious, advantage is that of explanation. It is much easier for a human to understand this recognition pattern than to parse megabytes of log data looking for the data relationships which indicate safing of the various sub-systems. Further, the pattern explicitly states what the relevant lowerlevel events are, as well as the pattern of their occurrence, to 1 The original version of CERA was written using Common Lisp, hence this Lisp-like syntax. Of course, the specific syntax chosen is not the crucial point. the completion of this pattern. In this case, an all-safed event is signalled just in the case that the four safing events have occurred. Patterns also have explanatory power in situ, while the pattern has not yet completed. That is, they can indicate the partial completion of patterns. For example, if two safing events have occurred–say, the AES and RO subsystems have completed safing– then it’s reasonable to say we are in the middle of the completion of an textttallsafed event. This leads to a second advantage, that of prediction. Because these patterns describe what has happened–in the previous example, two subsystems have completed safing–we can also use them to predict what might (or, perhaps, should) happen–in the current example, that the BWP and PPS subsystems may will complete safing. Third, we can build levels of abstraction of event patterns to make complex event pattern recognition possible and understandable. The safing-complete recognizer above is part of the overall pattern of RAP System communication loss, safing completion, RAP System communication reacquired, and restart complete that comprise the overarching event we described earlier. In order to get an overall view of the situation at hand, the raw data events must be packaged into larger, hierarchically structured events (Thronesbery & Schreckenghost 2003). Within this project, CERA acted as a stand-alone system to read the measurement events from a file and recognize semantically meaningful state changes, such as LORC and RORC events. When such events were recognized, CERA packaged them up in XML format and sent them to other parts of the monitoring system via CORBA connections. The Raw Subsystem Events The first step in the CERA event processing for the WRS monitoring system was to transform the complex, multivalued events from each subsystem into simpler, more abstract events. To do this, four recognizers were defined, one to match the measurement events from each of the four subsystems. Each time that a measurement event was matched, three things happened: first, measured telemetry values were saved away for possible later inclusion in an abstract event2 . Second, a function was called to see whether the subsystem had gone into safe mode. For example, the PPS system has gone into safe mode if its PT09 value is less than 1.4, its R04O value is 0 and its FM09 value is less than 2003 . Third, another function checked in a similar way to see whether the subsystem was restarting. The specific data values and predicates for each subsystem are different, but the basic recognizer logic is the same. These CERA raw data event recognizers illustrate the importance of combining pattern recognition with the ability to take arbitrary action. In this case, the patterns are relatively simple, consisting of a single event (although the event holds 2 This is to allow the highest granularity of data for later inspection; of course, this needs to be limited as to the amount of data retained. This is essentially an internal logging function. 3 These details and rules were, of course, supplied by our NASA collaborators (Martin et al. 2004). a lot of data). However, in response to recognizing the pattern a number of things must be done: • the data values must be saved for future use; • if the data values indicate a transition into or out of safe mode, a signal must be generated; • if the data values indicate the system starting or stopping, a signal must be generated. In our experience, this sort of idiosyncratic behavior in response to changes in specific data values occurs frequently when processing low-level telemetry. Thus, it must be simple to include arbitrary code for storing, comparing, and combining data values as appropriate. The LORC Event In addition to the raw measurement data from the various subsystems, CERA was also monitoring events detailing communications problems. In the WRS demonstration, external events were played back from the log that specifically mention when the RAP system loses contact with the various subsystems. A sketch of the recognizer that was used to process these events is shown in Figure 2. Recognizer: RAPS Communication Loss Pattern: repeat-match: RAPS Skills Comms Out seconds← seconds If RAPs Communication 6= off-nominal: Record RAPs Communication state ← off-nominal; Begin saving telemetry values; Calculate start of loss from seconds; Signal RAPs Communication status: off nominal. Recognizer: LORC Pattern: one: RAPs Communication status: off nominal On complete: Signal a LORC event. Figure 2: Recognizer for the LORC Event A repeat-match pattern is used here to watch for events. When each occurs, the attached code checks to see whether the event means that the communications status has changed. When a change to :off-nominal occurs, it is recorded for future use, and a synthetic STATUS - CHANGE event is signalled to CERA. In addition, the recording of telemetry events from all measurement recognizers is begun and the real start time of the communications loss is computed (actual loss began the specified number of seconds before the communication loss event is generated). Again, a number of very idiosyncratic things need to be done in response to the raw RAPS - SKILLS - COMMS - OUT event. However, once that processing is done, the recognizer for the LORC event, a sketch for which is also shown in Figure 2, can be much simpler. RAPS - SKILLS - COMMS - OUT The LORC event recognizer uses a simple one pattern looking for a single instance of the STATUS - CHANGE event generated by the communications loss recognizer. Then is pattern is recognized, the LORC event is saved away for future processing and also signalled to the CERA system. The processing steps in the LORC recognizer could easily have been put into the communications loss recognizer. However, we chose to separate them to keep the low-level event processing separate from the more abstract events needed by the WRS monitor. Saving the LORC event away is the first step in building up a more complex event description to send to other parts of the monitoring system. The LORC Situation Event The first situation of interest to the larger WRS monitoring system is when a LORC occurs followed by all four subsystems moving into safe mode. A sketch for the recognizer for this situation is shown in Figure 3. The pattern for this recognizer is a LORC event followed by the safing status of all subsystems changing to safe in any order. Recognizer: LORC Situation Pattern: in-order: LORC; all: PBBWP status: safe; RO status: safe; AES status: safe; PPS status: safe On completion: Get saved LORC data; Publish it to CORBA; Signal a LORC Situation event. Figure 3: Recognizer for the LORC Situation Event When the pattern is recognized, the recognizer finds the triggering LORC event saved away previously, and then publishes an XML version of the LORC situation event to the CORBA interface, and hence to the rest of the WRS monitoring system. In addition, a lorc-situation event is signalled to CERA. The creation of the appropriate XML for the LORC situation depends on the triggering event and the telemetry data values saved away by the recognition of previous events. The LORC-RORC Situation Finally, the WRS is interested in knowing when RAPs communication is restored and all of the subsystems have restarted. The recognizer of Figure 4 watches for that situation. The pattern looks for a LORC event followed by a LORC situation followed by all subsystem restarting. The LORC is not really necessary as it will be included within the LORC situation. However, it simplifies extracting the trigger event for inclusion in the published XML description. Recognizer: LORC-RORC Situation Pattern: in-order: LORC; LORC Situation; all: PBBWP status: start; RO status: start; AES status: start; PPS status: start On completion: Get related events; Get saved LORC data; Publish it to CORBA; Signal a LORC-RORC Situation event. Stop saving telemetry values. Figure 4: Recognizer for the LORC-RORC Situation When this recognizer completes, it again publishes a composite XML LORC-RORC situation event to the WRS monitor via CORBA. It then signals a similar event to CERA and stops the recording of telemetry values from the subsystems. Meanwhile, all of the recognizers continue to process new events so that the next LORC and RORC events will be detected. These patterns (with minor editing due to space considerations) are part of a set of patterns used to recognize significant events occurring over time in the WRS. On a 700 MHz. Windows 2000 machine with 256 Mb of memory running Allegro Common Lisp 6.1, The CERA software processes an event in approximately 1.5 ms. (2346 base events, representing two days of data, in 3.55 seconds), and thus, we believe, provides a useful system that can be used for complex event detection in real time. An Example Transcript The sample transcript in Figure 5 is taken from one run of the WRS monitor demonstration data. It shows the events signalled by the CERA recognizers described above. Each line in the transcript begins with the event signalled and is followed by the time covered by the event. Point events show only a single time. The LORC, LORC Situation, and LORC-RORC Situation events are highlighted. In particular, note that, in contrast to the raw data transcript of Figure 1, these events can occur over time; they have internal structure, and they have semantically meaningful names. The transcript begins with the RO subsystem stopping at 00:49 and then restarting at 01:14. The PPS subsystem also goes down from 00:55 to 01:14. Then the AES subsystem can be seen to go down at 02:10 and restart at 02:14. Next, at 04:02, a loss of communications event is recognized that began at 03:56. This completes the pattern for the LORC recognizer and a LORC event is signalled immediately as 04:02:01. (safing-status (system ro) (status safe)) at 00:49 (startup-status (system ro) (status stop)) at 00:49 (startup-status (system pps) (status stop)) at 00:55 (safing-status (system pps) (status safe)) at 00:59 (safing-status (system ro) (status nominal)) at 01:14 (startup-status (system ro) (status start)) at 01:14 (safing-status (system pps) (status nominal)) at 01:14 (startup-status (system pps) (status start)) at 01:14 (safing-status (system aes) (status safe)) at 02:10 (startup-status (system aes) (status stop)) at 02:10 (safing-status (system aes) (status nominal)) at 02:14 (startup-status (system aes) (status start)) at 02:14 (status-change (measure raps-comm) (status off-nominal) (reason lost-contact)) from 03:56 to 04:02 (lorc) from 03:56 to 04:02 (safing-status (system ro) (status safe)) at 04:04 (startup-status (system ro) (status stop)) at 04:04 (safing-status (system aes) (status safe)) at 04:04 (startup-status (system aes) (status stop)) at 04:04 (startup-status (system pps) (status stop)) at 04:04 (safing-status (system pbbwp) (status safe)) at 04:05 (startup-status (system pbbwp) (status stop)) at 04:05 (safing-status (system pps) (status safe)) at 04:09 (lorc-situation) from 03:56 to 04:09 (safing-status (system aes) (status nominal)) at 14:32 (startup-status (system aes) (status start)) at 14:32 (status-change (measure raps-comm) (status nominal) (reason aes)) at 14:32 (safing-status (system pbbwp) (status nominal)) at 14:32 (startup-status (system pbbwp) (status start)) at 14:32 (safing-status (system ro) (status nominal)) at 14:48 (startup-status (system ro) (status start)) at 14:48 (safing-status (system aes) (status safe)) at 14:48 (startup-status (system aes) (status stop)) at 14:48 (safing-status (system pps) (status nominal)) at 14:49 (startup-status (system pps) (status start)) at 14:50 (safing-status (system aes) (status nominal)) at 14:52 (startup-status (system aes) (status start)) at 14:52 (lorc-rorc-situation) from 03:56 to 14:52 (safing-status (system aes) (status safe)) at 15:44 (startup-status (system aes) (status stop)) at 15:44 (safing-status (system aes) (status nominal)) at 15:49 (startup-status (system aes) (status start)) at 15:49 Figure 5: Transcript of CERA Events Showing a LORCRORC Situation (Transcript edited for space) Event After the LORC event, the RO and AES subsystems both enter safe mode at 04:04 and the PBBWP system follows at 04:05 with the PPS subsystem not far behind at 04:09. With all four subsystems entering safe mode after the LORC event, a LORC situation is recognized and signalled at 04:09. Notice that this event begins at 03:56 when the loss of communications actually began. Finally, the subsystems begin coming back on line and at 14:52 the AES restarts completing the LORC-RORC situation pattern and a LORC-RORC situation event is signalled, running from 03:56 to 14:52. Notice that the AES subsystem does restart earlier, at 14:32 but then it stops again at 14:48, before the PPS subsystem restarts. The semantics of the all pattern combined with the fact that AES stopping contravenes the AES starting given our representation, means that CERA will wait for all four subsystems to be started simultaneously before signalling the LORC-RORC event. Alternative behaviors can be obtained by using different patterns in the recognizer. CERA Pattern and Recognizer Types supports a variety of pattern and pattern recognizer types. What follows are brief descriptions of these patterns and pattern recognizers. One patterns are the simplest of the pattern types. As their name implies, they consist of exactly one element. The recognizer that a one pattern creates is complete as soon as it recognize that element. Repeat-match bind with signals (such as the telemetry events seen above) and provide one way to allow maintenance of state while monitoring. A one-of pattern is an unordered collection of elements. A one-of recognizer is complete when any one of the elements specified by the pattern that created it occurs. An in-order pattern is a collection of elements in a specified order. It follows then that an in-order recognizer will be complete if each of the elements of the in-order pattern occurs in the correct order. This definition is somewhat lenient in that it is acceptable for other elements to occur in between two of the elements in the pattern, and even for the elements of the pattern to occur out of order, since events that do not directly affect the next element in the list are ignored by the recognizer. An all pattern is a collection of elements with no regard given to the order in which they occur. Once all of the elements of a given all pattern have occured, an all recognizer created by that pattern would be complete. Recognizers for in-order and all patterns can also make use of the the idea of event contravention4 . The basic idea is this: in a complex multichannel environment, many events will occur that are not relevant to the completion or futility of a recognizer. For example, while watching for all subsystems to go into safe mode, many events are irrelevant, but some events contravene the recognition of this pattern– for example, having once noticed that a subsystem has gone into safe mode, that subsystem’s restarting (that is to say, going into “non-safe” mode) contravenes the overall pattern CERA 4 Previously we called this event contradiction (Fitzgerald, Firby, & Hannemann 2003). recognition. To support this idea, software classes that represent base signals can be assigned a method that returns true or false when compared to other base signals. This predicate can be as simple or as complex as necessary for a given application. A within pattern consists of a single element and a time duration. A within recognizer can only be complete if the element that it is looking for occurs with start and finish times such that the difference between the start and finish times is less than the duration specified by the pattern. A without pattern consists of a single item and a time interval. The recognizer created by a without pattern can only be completed if its subrecognizer returns complete with a start time after the finish time of the specified interval or a finish time before start time of the specified interval. Otherwise, the recognizer is said to be FUTILE. CERA also provides patterns based on the possible relationships between the intervals over which event occur (Allen 1983). Software Availability The CERA system was originally developed in Common Lisp, and we took advantage of the ease with which Common Lisp allowed us to write patterns declaratively with procedural attachment (as in the examples given). We have also written procedural versions of the CERA software in both Python and C++. The Python and C++ versions of CERA are available on the CERA software project site at http://sourceforge.net/projects/cera and are released under an open-source license which allows them to be easily incorporated into new projects. The proprietary Lisp version of CERA (Fitzgerald, Firby, & Hannemann 2003) is available from I/NET Inc. Conclusions The immediate current state of autonomous systems is often reflected in the telemetric stream of sensor and actuator signals. To monitor the state of a system in semantically meaningful terms often requires watching for complex patterns of signals and events. As we have demonstrated in the WRS prototype, CERA is a useful architecture for this kind of complex system monitoring. Furthermore, monitoring the completion or non-completion of CERA event pattern recognizers provides further insight to system history and possible future states. Using a system such as CERA allows system developers to recognize hierarchical patterns of events which leads to deeper insight into the system state. This is especially useful with systems which operate for long periods and do so autonomously most of the time. CERA has also been used to provide the direct humanmachine interface for a mobile robot that acts as a judge in a multi-modal memory game (Fitzgerald, Kairys, & Phillips 2004), and can be useful for building the immediate multimodal interfaces to monitoring systems (as well as monitoring the systems themselves). By providing open-source versions of the CERA software, we hope to develop a community of users that will lead to new event pattern descriptions as well as innovations in multi-modal user interface development. Readers are invited to try the CERA software in their own projects. Acknowledgements was originally developed under NASA SBIR Projects 99-1-08.04-1856 and 99-2-08.04-1856. The Python and C++ code was developed a 2003 Summer NSF REU Program at Hope College. We gratefully acknowledge useful comments by Michael Hannemann on an earlier version of this paper. CERA References Allen, J. 1983. Maintaining knowledge about temporal intervals. Communications of the ACM 26:832–843. Bonasso, P.; Kortenkamp, D.; and Thronesbery, C. 2003. Intelligent control of a water-recovery system: Three years in the trenches. AI Magazine 24(1):19–44. Firby, R. J. 1989. Adaptive execution in complex dynamic worlds. Technical Report YALEU/CSD/RR #672, Computer Science Department, Yale University. Fitzgerald, W.; Firby, R. J.; and Hannemann, M. 2003. Multimodal event parsing for intelligent user interfaces. In Proceedings of the 2004 International Conference on Intelligent User Interfaces, 253–261. ACM Press. Fitzgerald, W.; Kairys, J.; and Phillips, A. 2004. Complex event pattern recognition software for multi-modal user interfaces. Available on-line at http://cera.sourcforge.net. Krasner, G. E., and Pope, S. T. 1988. A cookbook for using the Model-View-Controller user interface paradigm in Smalltalk-80. Journal of Object-Oriented Programming 1(3):26–49. Martin, C.; Schreckenghost, D.; Bonasso, P.; Kortenkamp, D.; Milam, T.; and Thronesbery, C. 2004. Aiding collaboration among humans and complex software agents. In Proceedings of AAAI 2004 Spring Symposium on Interaction Between Humans and Autonomous Systems over Extended Operation. Schreckenghost, D.; Bonasso, R. P.; Kortenkamp, D.; and Ryan, D. 1998a. Three tier architecture for controlling space life support systems. In IEEE Symposium on Intelligence in Automation and Robotics. Schreckenghost, D.; Ryan, D.; Thronesbery, C.; Bonasso, R. P.; and Poirot, D. 1998b. Intelligent control of life support systems for space habitats. AAAI Innovative Applications of AI. Thronesbery, C., and Schreckenghost, D. 2003. Situation views: Getting started handling anomalies. In Proceedings of the Systems, Man, and Cybernetics Conference. Washington, D.C.: IEEE.