International Journal of Condition Monitoring and Diagnostic Engineering Management. 4(4):14/32,October 2001, ISSN 1362-7681 Automated Safety Monitoring: A Review and Classification of Methods Yiannis Papadopoulos1 & John McDermid2 1 Department of Computer Science, University of Hull, Hull, HU6 7RX, UK e-mail: Y.I.Papadopoulos@hull.ac.uk 1 Department of Computer Science, University of York, York YO10 5DD, UK e-mail: jam@cs.york.ac.uk Yiannis Papadopoulos holds a degree in Electrical Engineering from Aristotle University of Thessaloniki in Greece, an MSc in Advanced Manufacturing Technology from Cranfield University and a PhD in Computer Science from the University of York. In 1989, he joined the Square D Company where he took up a leading role in the development of an experimental Ladder Logic compiler with fault injection capabilities and a simulator for the Symax programmable logic controllers. Between 1994 and 2001 he worked as a research associate and research fellow in the department of Computer Science at the University of York. Today he is a lecturer in the department of Computer Science at the University of Hull where he continues his research on the safety and reliability of computer systems and software. Dr. Papadopoulos is the author of more than 20 scientific journal and conference papers. His research engages with a body of work on model-based safety analysis and on-line system diagnosis and contributes to that body of work with new models and algorithms that capture and usefully exploit the hierarchical organisation of the structure and behaviour of the system. His work currently develops through extensive technical collaborations with the European Industry, mainly with DaimlerChrysler, Volvo and Germanisher Lloyd. John McDermid studied Engineering and Electrical Science at Cambridge, as an MoD-sponsored student, then worked at RSRE Malvern (now part of DERA) as a research scientist. Whilst at RSRE he obtained a PhD in Computer Science from the University of Birmingham. He spent five years working for a software house (Systems Designers, now part of EDS) before being appointed to a chair in Software Engineering at the University of York in 1987. In York, he built up a new research group in high integrity systems engineering (HISE). The group studies a broad range of issues in systems, software and safety engineering, and works closely with the UK aerospace industry. Professor McDermid is the Director of the Rolls-Royce funded University Technology Centre (UTC) in Systems and Software Engineering and the British Aerospace-funded Dependable Computing System Centre (DCSC). The work in these two centres was a significant factor in the award of the 1996 Queen’s Anniversary Prize for Higher and Further Education to the University. Professor McDermid has published over 200 papers and 7 books. He has also been an invited speaker at conferences in the UK, Australia and South America. He has worked as a consultant to industry and government in the UK, Europe and North America, particularly advising on policy for the procurement and assessment of safety critical systems. 1 Abstract. The role of automated safety monitors in safety critical systems is to detect conditions that signal potentially hazardous disturbances, and assist the operators of the system in the timely control of those disturbances. In this paper, we re-define the problem of safety monitoring in terms of three principal dimensions: the optimisation of low-level status information and alarms, fault diagnosis, and failure control and correction. Within this framework, we discuss state of the art systems and research prototypes, and develop a critical review and classification of safety monitoring methods. We also arrive at a set of properties that, in our view, should characterise an effective approach to the problem and draw some conclusions about the current state of research in the field. 1. Introduction Failures in safety critical systems can pose serious hazards for people and the environment. One way of preventing failures from developing into hazardous disturbances in such systems is to use additional external devices that continuously monitor the health of the system or that of the controlled processes. The role of such safety monitoring devices is to detect conditions that signal potentially hazardous disturbances, and assist the operators of the system in the timely control of those disturbances. Traditionally, safety monitoring relied primarily on the abilities of operators to process effectively a large amount of status information and alarms. Indeed, until the early 80s, monitoring aids were confined to hard-wired monitoring and alarm annunciation panels that provided direct indication of system statuses and alarms. Those conventional alarm systems have often failed to support the effective recognition and control of failures. A prominent example of such failure is the Three Mile Island nuclear reactor (TMI-2) accident in 1979 [1], where a multiplicity of alarms raised at the early stages of the disturbance confused operators and led them to a misdiagnosis of the plant status. Similar problems have also been experienced in the aviation industry. Older aircraft, for example, were equipped with status monitoring systems that incorporated a large number of instruments and pilot lights. Those systems were also ineffective in raising the level of situational awareness that is required in conditions of failure. Indeed, several accidents have occurred because pilots ignored or responded late to the indications or warnings of the alarm system [2]. The problems experienced with those conventional monitoring system helped to realise that it is unsafe to rely on operators (alone) to cope with an increasing amount of process information. They highlighted the need of improving the process feedback and the importance of providing intelligent support in monitoring operations. At the same time, the problems of conventional hard-wired systems motivated considerable work towards automating the safety monitoring task and developing computerised safety monitors. In this paper, we explore that body of work on safety monitoring through a comprehensive review of the relevant literature. We discuss the different facets of the problem, and develop a critical review and classification of safety monitoring methods. We also arrive at a set of properties that, in our view, should characterise an effective approach to the problem. Our presentation is developed in four sections. In section two, we identify and analyse what we perceive as the three key aspects of the problem: the optimisation of a potentially enormous volume of low-level status information and alarms, the diagnosis of complex failures, and the control and correction of those failures. We then use this trichotomy of the problem as a basis for the methodological review, which occupies the main body of the paper (sections three to five). Finally, we summarise the discussion, enumerate a number of properties that an effective safety monitoring scheme should have, and draw some conclusions about the current state of research in the field. 2. Nature of the Problem and Goals of Safety Monitoring During the 80s and early 90s a series of significant studies in the aerospace, nuclear and process industries1 re-assessed the problem of safety monitoring and created a vast body of literature on 1 For studies on requirements for safety monitoring in the aerospace sector, the reader is referred to [26]. For similar studies in the nuclear and process industries, the reader is referred to [7-12]. 2 requirements for the effective accomplishment of the task. In this paper, we abstract away from the detailed recommendations of individual studies, and we present a structured set of goals, which, we believe, effectively summarises this body of research. Figure 1 illustrates these goals and the hierarchical relationships between them. The first of those goals is the optimisation of status monitoring and alarm annunciation. We use the term optimisation to describe functions such as: the filtering of false and unimportant alarms; the organisation of alarms to reflect priority or causal relationships, and the integration of plant measurements and low-level alarm signals into high-level functional alarms. The rationale here is that fewer, and more informative, functional alarms could help operators in understanding better the nature of disturbances. Functional alarms, though, would generally describe the symptoms of a disturbance and not its underlying causes, but the problem here lies in that complex failures are not always manifested with unique, clear and easy to identify symptoms. The reason for this is that causes and their effects do not necessarily have an exclusive one to one relationship. Indeed, different faults often exhibit identical or very similar symptoms. The treatment of those symptoms typically requires fault diagnosis, that is the isolation of the underlying causal faults from a series of observable effects on the monitored process. Over recent years considerable research has been focused on the development of automated systems that can assist operators in this highly skilled and difficult task. The principal goal of this research is to explore ways of automating the early detection of escalating disturbances and the isolation of root failures from ambiguous anomalous symptoms. The detection and diagnosis of failures is followed by the final and more decisive phase of safety monitoring, the phase of failure control and correction. Here, the system or the operator has to take actions that will remove or minimise the hazardous effects of failure. In taking such actions, operators often need to understand first how low-level malfunctions affect the functionality of the system. Thus, there are two areas in which an automated monitor could provide help in the control and correction of failures: the assessment of the effects of failure, and the provision of guidance on appropriate corrective measures. An additional and potentially useful goal of advanced monitoring is prognosis, the prediction of future conditions from present observations and historical trends of process parameters. In the context of mechanical components, and generally components which are subject to wear, prognosis can be seen as the early identification of conditions that signify a potential failure [13]. At system level, the objective is to make predictions about the propagation of a disturbance while the disturbance is in progress and still at the early stages. Such predictions could help, for example, to assess the severity of the disturbance and, hence, the risk associated with the current state of the plant. Prognosis is, therefore, a potentially useful tool for optimising the decisions about the appropriate course of action during process disturbances. In this paper, we deal with two aspects of the safety monitoring problem that are closely related to prognosis, the early detection and diagnosis of disturbances and the assessment of the functional effects of low-level failures. We wish to clarify though that the issue of prognosis itself is out of the scope of this paper. Advanced Safety Monitoring Optimisation of status monitoring and alarm annunciation Filtering of false and unimportant alarms Organisation of alarms to reflect priority or causal relationships Provision of high level functional alarms Early detection of escalating disturbances On-line fault Diagnosis Isolation of root failures from observed anomalous symptoms Assessment, and provision to the operator, of the effects of failure Failure control and correction Guidance on corrective measures that remove or minimise the effects of failure Figure 1. Goals of advanced safety monitoring 3 3. Optimisation of Status Information and Alarms The first key area of the safety monitoring problem identified in the preceding section is the optimisation of the status information and alarms produced by the monitoring system. Operational experience shows that disturbances in complex systems can cause an enormous, and potentially confusing, volume of anomalous plant measurements. Indeed, in traditional alarm annunciation panels this phenomenon has caused problems in the detection and control of disturbances. In such systems, alarms often went undetected by operators as they were reported amongst a large number of other less important and even spurious alarms [9]. The experience from those conventional monitoring systems shows that a key issue in improving the process feedback is the filtering of false and unimportant alarms. False alarms are typically caused either by instrumentation failures, that is failures of the sensor or the instrumentation circuit, or normal parameter transients that violate the acceptable operational limits of monitored variables. An equally important factor in the generation of false alarms is the confusion often made between alarms and statuses [14]. The status of an element of the system or the controlled process indicates the state that the element (component, sub-system or physical parameter) is in. An alarm, on the other hand, indicates that the element is in a particular state when it should be in a different one. Although status and alarm represent two different types of process information, statuses are often directly translated into alarms. The problem of false alarms can then arise whenever a particular status is not a genuine alarm in every context of the system operation. Another important issue in improving the process feedback appears to be the ability of the monitoring system to recognise distinct groups of alarms and their causal/temporal relationships during the disturbance. An initiating hazardous event is typically followed by a cascading sequence of alarms, which are raised by the monitor as the failure propagates in the system and gradually disturbs an increasing number of process variables. However, a static view of this set of alarms does not convey very useful information about the nature of the disturbance. The order in which alarms are added into this set, though, can progressively indicate both the source and future direction of the disturbance (see, for example, the work of Dansak and Roscoe [15,16]). Thus, the ability to organise alarms in a way that can reflect their temporal or causal relationships is an important factor in helping operators to understand better the nature of disturbances. An alternative, and equally useful, way to organise status information and alarms is via a monitoring model which relates low-level process indications to functional losses or malfunctions of the system (see for example the work of Kim, Modarres, and Larsson [17-19]. Such a model can be used in real-time to provide the operator with a functional, as opposed to a purely architectural view of failures. This functional view is becoming important as the complexity of safety critical systems increases and operators become less aware of the architectural details of the system. A number of methods have been proposed to address the optimisation of the process feedback during a disturbance. Their aims span from filtering of false alarms to the provision of an intelligent interface between the plant and the operator that can organise low-level alarm signals and plant measurements into fewer and more meaningful alarm messages and warnings. Figure 2 identifies a number of such methods, and classifies them into three categories using as a criterion their primary goal with respect to safety monitoring. A discussion of these methods follows in the remainder of this section. Optimisation of Filtering of false Sensor Validation Methods Status Information and unimportant alarms State & Alarms Association and Alarms Multiple Stage Alarm Relationships Organisation of alarms to reflect causal/temporal relationships Alarm Patterns Function-Oriented Electronic Checklists Monitoring Goal Trees Alarm Trees Logic and Production Rules Figure 2. Classification of alarm optimisation methods 4 3.1 Detection of Sensor Failures A primary cause of false and misleading alarms is sensor failures. Such alarms can be avoided using general techniques for signal validation and sensor failure detection. Sensor failures can be classified into two types with respect to the detection process: coarse (easy to detect) and subtle (more difficult to detect). Coarse failures cause indications that lie out of the normal measurement range and can be easily detected using simple limit checking techniques [20]. Examples of such failures are a short circuit to the power supply or an open circuit. Subtle failures, on the other hand, cause indications that lie within the normal range of measurement and they are more difficult to detect. Such failures include biases, nonlinear drifts, and stuck at failures. One way to detect such failures is to employ a hardware redundancy scheme, for example, a triplex or quadruple configuration of components [21]. Such a system operates by comparing sensor outputs in a straightforward majority-voting scheme. The replicated sensors are usually spatially distributed to ensure protection against common cause failure. Although hardware replication offers a robust method of protection against sensor failures, at the same time it leads to increased weight, volume and cost. An alternative, and in theory more cost-effective, way to detect subtle failures is by using analytical redundancy, in other words relationships between dissimilar process variables. State estimation [22] and parity space [23-25] techniques, employ control theory to derive a normally consistent set of such relationships, detect inconsistencies caused by sensor failures and locate these failures. Kim and Modarres [26] propose an alternative method based on analytical redundancy, in which the diagnosis is achieved by parsing the structure of a sensor validation tree, and by examining in the process a number of sensor validation criteria specified in the tree structure. These criteria represent coherent relationships among process variables, which are violated in the context of sensor failures. Such relationships are formulated using deep knowledge about the system processes, for example, pump performance curves, mass and energy conservation equations and control equations. Overall, a number of methods based on analytical redundancy have been developed to detect and diagnose sensor faults [27]. Experience, however, from attempts to apply these methods in industrial scale has shown that the development of the required models is an expensive process which often results to incomplete and inaccurate results [28]. More recently, research has been directed to the concept of self-diagnostics, in which sensor failures are detected not from their effects on the process but through self-tests that exploit the manufacturer’s own expertise and knowledge of the instrument [29]. The state of the art in this area is the SEVA approach developed at Oxford University [30,31]. In SEVA, sensor self-diagnosis is supplemented with an assessment of the impact of detected faults on measurement values. If a measurement is invalid, a SEVA sensor will attempt to replace that measurement with a projection of historical values of the given parameter. Together with the measurement value, SEVA sensors also provide information about uncertainty in the measurement as well as detailed device specific diagnostics for maintenance engineers. A number of prototype SEVA sensors have been developed and the concept is now reaching maturity and the stage of commercialisation. 3.2 State-Alarms Association Another cause of false and irrelevant alarms is the confusion often made between statuses and alarms [14]. A seemingly abnormal status of a system parameter (e.g. high or low) is often directly translated to an alarm. This direct translation, however, is not always valid. To illuminate this point, we will consider a simple system in which a tank is used to heat a chemical to a specified temperature (see Figure 3). The operational sequence of the system is described in the GRAFCET notation (see for example, [32]) and shows the four states of this system: idle, filling, heating and emptying. Initially the system is idle, and the operational sequence starts when the start cycle signal arrives. A process cycle is then executed as the system goes through the four states in a sequential manner and then the cycle is repeated until a stop signal arrives. Let us now attempt to interpret the meaning of a tank full signal in the three functional states of the system. In the filling state, the signal does not signify an alarm. On the contrary, in this state the signal is expected to cause a normal transition of the system from the filling state to the heating state. In the heating state, not only does the presence of the signal not signify an alarm, but its absence should generate one. The tank is now isolated, and therefore, the tank full signal should remain hi as long as the system remains in that state. Finally, in the emptying state, the signal signifies a genuine alarm. If tank full remains while the system is in this state then either Pump B has failed or the tank outlet is blocked. 5 operational sequence system idle Pump A tank full (start cycle) or (tank empty) Pump A in operation filling Heater stop tank empty temperature sensor tank full Pump B heating Heater in operation temperature hi emptying Pump B in operation tank empty Figure 3. Heating system & operational sequence [33] This example indicates clearly that we need to interpret the status in the context of the system operation before we can decide if it is normal or not. One way we can achieve this is by determining which conditions signify a real alarm in each state or mode of the system. The monitor can then exploit mechanisms which track down system states in real time to monitor only the relevant alarms in each state. The concept was successfully demonstrated in the Alarm Processing System [14]. This system can interpret the process feedback in the evolving context of operation of a nuclear power station. The system has demonstrated that it significantly reduces the number of false and unimportant alarms. 3.3 Alarm Filtering Using Multiple Stage Alarm Relationships Multiple stage alarms are often used to indicate increasing levels of criticality in a particular disturbance. A typical example of this is the two stage alarms employed in level monitoring systems to indicate unacceptably high and very high level of content. When the actual level exceeds the very high point both alarms are normally activated. This is obviously unnecessary because the very-high level alarm implies the high level alarm. The level-precursor relationship method that has been employed in the Alarm Filtering System [34] extends this concept and uses more complex hierarchical relationships between multiple stage alarms to suppress precursor alarms whenever a more important alarm is generated. Many alarms could be filtered using this method alone in a typical process plant. 3.4 Identifying Alarm Patterns An alarm pattern is a sequence of alarms which is typically activated following the occurrence of an initiating hazardous event in the system. Alarm patterns carry a significantly higher information content than the set of alarms that compose them. A unique alarm pattern, for example, unambiguously points to the cause of the disturbance, and indicates the course of its future propagation. The idea of organising alarms into patterns and then deriving diagnoses on-line from such patterns was first demonstrated in the Diagnosis of Multiple Alarms system at the Savannah River reactors [15]. A similar approach was followed in the Nuclear Power Plant Alarm Prioritisation program [16], where the alternative term alarm signature was used to describe a sequence of alarms which unfold in a characteristic temporal sequence. A shortcoming of this approach is that the unique nature of the alarm signature is often defined by the characteristic time of activation of each alarm. The implication is that the “pattern matching” algorithm must know the timing characteristics of the expected patterns, a fact which creates substantial technical difficulties in the definition of alarm patterns for complex processes. An additional problem is that the timing of an alarm signature may not be fixed and may depend on several system parameters. In the cooling system of a nuclear power plant, for example, the timing of an alarm sequence will almost certainly depend on parameters such as the plant operating power level, the number of active coolant pumps, and the availability of service water systems [35]. 6 3.5 Organising Alarms Using Alarm Trees An alternative way to organise alarms is the alarm tree [7]. The tree is composed of nodes, which represent alarms, and arrows that interconnect nodes and denote cause-effect relationships. Active alarms at the lowest level of the tree are called primary cause alarms. Alarms which appear at higher levels of the tree (effect alarms) are classified in two categories: important (uninhibited) alarms and less important (inhibited) alarms. During a disturbance, the primary cause alarm and all active uninhibited alarms are displayed using the causal order with which they appear in the tree. At the same time, inhibited alarms are suppressed to reduce the amount of information returned to the operator. Beyond real process alarms based on measurements, the tree can also generate messages based on such alarms. These messages, for example, may relate a number of alarms to indicate a possible diagnosis of a process fault. Such diagnostic messages are called deduced or synthesised alarms. Alarm trees have been extensively tried in the UK Nuclear industry. Plant-wide applications of the method, however, have shown that alarm trees are difficult and costly to develop. It took ten man-years of effort, for example, to develop the alarm trees for the Oldbury nuclear power plant [9]. Operational experience has also indicated that large and complex alarm trees often deviate from the actual behaviour of the plant and that they do not appear to be very useful to operators. In evaluating the British experience, Meijer [36] states that the “most valuable trees were relatively small, typically connecting three or four alarms”. 3.6 Synthesising Alarms Using Logical Inferences A more flexible way of organising alarms and status information is by using logic and production rules [37]. A production rule is an If-Then statement that defines an implication relationship between a prerequisite and a conclusion. Production rules can be used to express logical relationships between low level process indications and more informative conclusions about the disturbance. Such rules can be used in real time for the synthesis of high level alarms. Rule-based monitors incorporate an inference engine which evaluates and decides how to chain rules, using data from the monitored process. As rules are chained and new conclusions are reached, the system effectively synthesises low-level plant data into higher level and more informative alarms. To illustrate the concept we will use the example of a pump, which delivers a substance from A to B (Figure 4). One of the faults that may occur in this system is an inadvertent closure of the suction valve while the pump is running. In such circumstances, the operator of a conventional system would see the status of the pump remain set to running and two alarms: no flow and low pressure at the suction valve. On the other hand, a rule-based system would call a simple rule to synthesise these low-level data and generate a single alarm that does not require further interpretation. This alarm (suction valve inadvertently closed) certainly conveys a more dense and useful message to the operator. It is worth noticing that in generating this alarm, the system has performed a diagnostic function. Indeed, rule-based systems have been extensively applied in the area of fault diagnosis. The idea of using logic and production rules for alarm synthesis has been successfully applied in the Alarm Filtering System [34], an experimental monitor for nuclear power plants. 3.7 Function-Oriented Monitoring The status monitoring and alarm organisation methods that we have examined so far share a fundamental principle. Monitoring is performed against a model of the behaviour of the system in conditions of failure. Indeed, the different models that we have discussed (e.g. alarm patterns, alarm trees) are founded on a common notion of an expected anomalous event, a signal which signifies equipment failure or a disturbed system variable. IF B the pump state is running AND there is no flow in the pipe AND low pressure exists at the suction valve delivery valve suction valve A THEN suction valve inadvertently closed flow sensor pump Figure 4. Production rule and transfer pump 7 pressure sensor A radically different approach is to monitor the system using a model of its normal (fault free) behaviour. The objective here is to detect discrepancies between the actual process feedback and the expected normal behaviour. Such discrepancies indicate potential problems in the system and, therefore, provide the basis for generating alarms and warnings. A simple monitoring model of normal behaviour is the electronic checklist. An electronic checklist associates a critical function of the system with a checklist of normal conditions which ensure that the function can be carried out safely. In real-time, the checklist enables monitoring of the function by evaluation of the necessary conditions. Electronic checklists have been widely used in the design of contemporary aircraft. Figure 5 illustrates an example from the Electronic Instrument System of the MD11 aircraft [38]. Before a take-off, the pilot has to check assorted status information to ensure that he can proceed safely. The function “safe take-off” is, therefore, associated with a list of prerequisite conditions. If a take-off is attempted and these essential conditions are not present, the monitor generates a red alert. The checklist is then displayed to explain why the function cannot be carried out safely. Although electronic checklists are useful, they are limited in scope. To enable more complex representations of normal behaviour, Kim and Modarres [17] have proposed a much more powerful model called the Goal Tree Success Tree (GTST). According to the method, maintaining a critical function in a system can be seen as a target or a goal. In a nuclear power plant, for example, a basic goal is to “prevent radiation leak to the environment”. Such a broad objective, though, can be decomposed into a group of goals which represent lower-level safety functions that the system should maintain in order to achieve the basic objective. These goals are then analysed and reduced to sub-goals and the decomposition process is repeated until sub-goals cannot be specified without reference to the system hardware. The result of this process is a Goal Tree which shows the hierarchical implementation of critical functions in the system. A Success Tree is then attached to each of the terminal nodes in the Goal Tree. The Success Tree models the plant conditions which satisfy the goal that this terminal node represents. The GTST contains only two types of logical connectives: AND and OR gates. Thus, a simple valuepropagation algorithm [26,39] can be used by the monitor to update nodes and provide functional alarms for these critical safety functions that the system fails to maintain. The GTST has been successfully applied in a number of pilot applications in the nuclear [18], space [40,41] and process industries [42]. Later on in this paper we return to this model to explore how it has been employed for on-line fault diagnosis. 3.8 Hierarchical Presentation of Status Information An additional aspect of the problem of optimising status information is presentation. This aspect of the problem involves many issues such as the design of appropriate media for input and display, the userinterface design and other issues of man-machine interaction. Many of those issues are very much system, application and task dependent. Different interfaces, for example, would be required in a cockpit and a plant control room. We have identified however, and we discuss here, a fundamental principle that has been used in a number of advanced systems to enhance the presentation of status information. This is the concept of layered monitors or, in other words, the idea of hierarchical presentation of the process feedback. Critical function Checklist Safe take – off Stab not in green band Slats not in take-off position Flaps not in take-off position Parking brake on Spoilers not armed Figure 5. Checklist of essential items for take-off [38] 8 Operational experience in safety critical systems shows that operators of such systems need much less information when systems are working than when they are malfunctioning [2]. To achieve this type of control over the amount of information provided to the operator, many contemporary systems employ a layered monitor. Under normal conditions, a main system display is used to present only a minimum amount of important data. When a subsystem malfunction is detected, more data is presented in a subsystem display, either automatically or on request. This approach is used in systems such as the Electronic Instrument system of Douglas MD-11 [43] and the Electronic Centralised Aircraft Monitor of Airbus A320 [44,45]. In both systems, the aircraft monitor consists of two displays. The first display is an Engine/Warning Display, which displays the status and health of the engines in terms of various parameters. The engine status is the most significant information under normal flight and it is always made available to the pilot. The second part of the display is an area dedicated to aircraft alerting information. The second display is called the Subsystems Display and can display synoptics of the aircraft subsystems (Electrical, Hydraulic, Fuel, Environment etc.). A particular synoptic can be selected from a control panel by pressing the corresponding push-button. Part of this display is an area where the system generates reports about failures, provides warnings about consequences and recommends corrective measures. Figure 6 illustrates how the MD-11 electronic instrument system reports a hydraulic fluid loss in one of the three hydraulic systems. Initially, an alerting message appears in the main (engine) display. The hydraulic system push-button on the control panel (HYD) is also lighted. When the button is actuated, the monitor brings up the synoptic of the hydraulic system. The synoptic shows that there is low hydraulic fluid in system A. It also displays the new status of the system after the automatic resolution of the fault by inactivation of the system A pump. The consequences of the fault are reported and some actions are recommended to the pilot. The idea of a layered monitor can also be found in the nuclear industry [46]. In a similar fashion, top level displays are used to monitor important parameters which provide indication of the overall plant status. Low level displays provide a focus on subsystems. When a process anomaly is detected, the monitoring system generates appropriate messages to guide the operator from main displays to appropriate subsystem displays. 4. On-line Fault Diagnosis If faults and their effects had an exclusive one to one relationship then complex failures would always have exhibited unique and easy to identify symptoms. In practice, however, more than one fault often causes identical or very similar symptoms. The control of such failures requires on-line fault diagnosis, in other words the isolation of the underlying causal faults from a series of observable effects on the monitored process. Fault diagnosis is a demanding task for system operators. It is performed under the stressful conditions that normally follow a serious failure, and the results have to be timely in order to allow further corrective action. The task also requires a wide range of expertise which typically include a functional understanding of components and their interrelationships, knowledge of the failure history of the system, and familiarity with system procedures [47]. Subsystems Display Engine/Warning Display HYD SYS 0 3,009 3,009 Control Panel A ELE HYD FUEL AIR Alert Area HYD SYS A FAILURE B C Alarm: HYD SYS A FAILURE Consequences flight control effect reduced use max. 35 flaps GPWS OVRD if flaps less than 35 If flaps<35 spoilers at nose, GR on GND, throttles idle Autopilot 2 INOP Figure 6. A hydraulic failure reported by the MD-11 EIS [38] 9 The complicated nature of the task has motivated considerable research towards its automation. The common goal of this research has been the development of monitors that can perform diagnoses, by effectively combining on-line observations2 of the system and some form of coded knowledge about the system and its processes. Beyond this common goal though, there is an interesting, and not yet concluded, debate concerning the type of coded knowledge that would be most appropriate for fault diagnosis. Should this knowledge be a set of problem-solving heuristics generated by an expert diagnostician, or should it be based on explicit models of the system? And if we assume the latter, which type of model would be more suitable for diagnostic reasoning: a model of normal behaviour of the system, for example, or a model of disturbed behaviour? This debate has played an important role in the evolution and diversification of research on automated fault diagnosis. In the context of this paper, the debate concerning the type of knowledge and its representation provides an observation point from which we can see diagnostic methods converging and forming “families ”, and then those families of methods diverging as different approaches to the same problem. Figure 7 classifies diagnostic methods using as a criterion the type of knowledge they use as a basis for the diagnostic reasoning. At the first level of this classification we identify three classes of methods, the rule-based experiential/heuristic approach, model-based approaches and data-driven approaches. Within the model-based approach, we make a further distinction between two classes of diagnostic methods: those that rely on models of disturbed behaviour and those that rely on models of the normal behaviour of the system. Models of disturbed behaviour include fault propagation models that are typically used in reliability engineering and qualitative causal process graphs such as the digraph. Models of normal behaviour include models that can be used to represent the functional decomposition of the system, models of system topology and component behaviour, and qualitative simulation models. Finally, the third class of methods in Figure 7 includes all those methods that infer process anomalies from observations of historical trends of process parameters. We collectively characterise those methods as data-driven approaches. A further discussion of the above methods follows in the remainder of this section. Families of Fault Diagnosis Methods The rule-based experiential/heuristic approach Model-based Approaches Using Models of Disturbed Behaviour Fault Propagation Models Cause Consequence Tree Cause Consequence Diagram Master Fault Tree Qualitative Causal Process Graphs Digraph Event Graph Logic flow Graph Using Models of Normal Behaviour Models of Functional Decomposition Goal Tree Success Tree Multilevel Flow Model Models of System Structure + Component Behaviour (Diagnosis from First Principles) Qualitative Models of Normal Behaviour (Diagnosis via Qualitative Simulation) Data-Driven Approaches Statistical Process Monitoring Monitoring of Qualitative Trends Neural Networks (Training with historical data & Pattern recognition) Figure 7. Families of diagnostic methods 2 The term observations here is used in a general sense and may include facts which are derived through a dialogue with the operator. 10 4.1 Rule-based Expert Systems Diagnosis here is seen as a process of observing anomalous symptoms and then relating these symptoms to possible malfunctions using rules that describe relationships between faults and their effects. This approach, which is often termed as “shallow reasoning” [48], has originated from early work in medical diagnosis and the well-known MYCIN expert system [49]. Similar diagnostic monitors for engineered systems started to appear in the early eighties. These systems reason about process disturbances by operating on a knowledge base that contains two categories of knowledge: (a) declarative knowledge in the form of statements about the system, e.g. its initial conditions or configuration (facts) and (b) procedural knowledge in the form of production rules. Production rules typically describe functional relationships between system entities and causal relationships between failures and effects. Such rules are formed from empirical associations made by expert troubleshooters, or they are derived from analysis of plant models and theoretical studies of plant behaviour. The theory underlying rule-based systems is well documented in the literature (see, for example, [50,51]). In this paper, we assume some basic knowledge of this theory to discuss the application of such systems in fault diagnosis. Diagnosis here is achieved with either forward or backward chaining of rules. In a typical diagnosis with forward chaining, a deviation of a monitored system parameter will trigger and fire a production rule. When the rule is fired, its consequent (conclusion) is considered a new fact that in turn may trigger other rules which anticipate this fact as one of their antecedents (prerequisites). The derived facts will then trigger a chain of rules and the process is repeated until the most recently derived facts cannot trigger any other rules. These facts describe the causes of the disturbance. Another way to perform diagnosis is by testing fault hypotheses [52]. A fault hypothesis is typically a malfunction that operators or the monitor hypothesise to be true on the basis of some abnormal indications. Testing fault hypotheses requires a different inference mechanism called backward chaining. In backward chaining, the knowledge base is first scanned to see if the hypothesis is a consequent of a rule. The antecedents of the rule then formulate the next set of hypotheses. Hypothesis testing continues recursively until some hypothesis is false or all hypotheses are verified with real-time observations, in which case the system confirms the original fault hypothesis. An example of a diagnostic monitor which incorporates both inference paradigms is REACTOR [53,54], an experimental monitor for nuclear power plants. FALCON [55,56] is a similar prototype rule-based system that identifies the causes of process disturbances in chemical process plants. One of the problems of early expert systems was their inability to record history values and, in consequence, their inability to perform temporal reasoning. This limitation has prohibited the application of early rule-based systems to many diagnosis problems, where it is important to watch both short-term and long-term trends of system parameters. In more recent developments, though, we can see temporal extensions to the rule-based approach [57]. RT-WORKS, for example, is a rule-based expert system shell which also uses a frame representation language [58] to enable the definition of object classes. Each class has attributes, called frame slots, which can be used to represent monitored variables. Each frame slot can point to buffers where a series of values and their associated time tags are kept. An instance of the class is therefore defined not only in terms of its present state, but also in terms of its history. Figure 8 illustrates an instance of a hypothetical BATTERY frame class. The system allows production rules to reference the attributes of such frames and the inference engine to retrieve/insert history values in the buffers. Rules can also reference a number of statistical and other mathematical functions, which allow the inference engine to calculate trends and reason about past and future events. B1: Class BATTERY B1. Location value time bay_5 7:00 value time relay.k11 7:00 value time 34.4 7:12 34.7 7:43 value time normal 7:12 abnormal 8.21 B1. Input B1. Voltage B1. status 36.5 8.21 Figure 8. An instance of a hypothetical BATTERY frame [58] 11 A number of diagnostic monitors have adopted this hybrid rule/frame-based approach as a more powerful framework for knowledge representation. Ramesh et al, for example, have developed a hybrid expert system that is used to analyse disturbances in an acid production plant [59]. RESHELL is a similar expert system for fault diagnosis of gradually deteriorating engineering systems [60]. An additional problem in diagnosis is the possible uncertainty about the implication relationship between observations made and conclusions reached in the course of the diagnosis. This uncertainty is often inevitable, especially when the diagnosis exploits production rules that represent empirical associations about which we can only have a certain level of confidence. There can also be uncertainty about the antecedents or consequents of production rules when, for example, the verification of the facts that they anticipate is not a straightforward task but requires human judgement. Probably the most common way of representing this uncertainty is the use of certainty factors [49]. A certainty factor is a heuristic index (number) which associates a level of confidence with a rule, an antecedent or a consequent. Systems using such factors can arrive at one or more diagnostic conclusions with a degree of certainty. This is achieved as the inference engine uses certainty factors to continuously update belief values about antecedents or consequents that are encountered in the course of the inference process. The technique was first described in the MYCIN project [49] and it was later applied in other systems such as the mineral exploration expert system PROSPECTOR [61]. Rule based systems have demonstrated several benefits of separating knowledge from reasoning. They have demonstrated, for example, that the same principles can be applied to solve diagnostic problems in a number of domains, which span from medical diagnosis to troubleshooting of electromechanical devices. Rule-based systems can also perform both fault diagnosis and fault hypothesis testing, by operating on the same knowledge base and switching to the appropriate inference mechanism. Recently, rule-based diagnostic systems have been successfully combined with neural networks, an emerging and promising technology for the primary detection of symptoms of failure in industrial processes. QUINCE [62,63], for example, is a hybrid system that relies on neural networks to detect the primary symptoms of disturbances and rules that reflect expert knowledge to relate those symptoms to root faults. In this system, a neural network is trained to predict the values of monitored parameters. Provided that the network is trained on patterns that reflect normal conditions of operation, the prediction error of the network can be used as an indicator of how unexpected new values of the monitored parameters are. This in turn can be used to signal anomalies in the system. A set of observed such anomalies form a symptoms vector which is then multiplied by a fault matrix, in which expert rules that describe causal links between faults and symptoms have been encoded. The result of this operation indicates the most likely underlying fault that can explain the observed symptoms. This system has been used for monitoring of vibration harmonics and fault diagnosis in aircraft engines. In general, the application of rule-based systems in engineering has been very successful in processes that can be described using a small number of rules. However, attempts of wider application to complex systems have shown that large rule-based systems are prone to inconsistencies, incompleteness, long search time and lack of portability and maintainability [18,64]. From a methodological point of view, many of these problems can be attributed to the limitations of the production rule as a model for knowledge representation [65]. Production rules allow homogenous representation and incremental growth of knowledge [66]. At the same time though, simple implication relationships offer analysts very limited help in thinking about how to represent properties of the system, or in thinking about how to use such knowledge for reasoning about the nature of disturbances. These problems underlined the need for more elaborate models for knowledge representation and created a strong interest towards model-based diagnostic systems. 4.2 Fault Propagation Models The first monitor with some diagnostic capabilities that employed a fault propagation model was STAR [67], a small experimental system for the OECD Halden Reactor in Norway. STAR made use of Cause Consequence Diagrams (CCDs) to model the expected propagation of disturbances in the reactor. CCDs represent in a formal, logical way the causal relationships between the status of various system entities during a disturbance. They are typically composed of various interconnected cause and consequence trees. A cause tree describes how logical combinations of system malfunctions and their effects lead to a hazardous event. On the other hand, a consequence tree starts with the assumption that a hazardous event has occurred and then describes how other conditions and the success or failure of 12 corrective measures lead the system to safe or unsafe states. Figure 9 illustrates an example cause consequence diagram. CCDs are used in STAR as templates for systematic monitoring of the system. The monitoring model is updated every five seconds with fresh process feedback, and every event in the CCD that can be verified with real-time data is checked for occurrence. An analysis module then examines the model for event chains and by extrapolation of these chains it determines possible causes of the disturbance and possible subsequent events. The EPRI-DAS [36] system uses a similar model, the Cause Consequence Tree (CCT), to perform early matching of anomalous event patterns that could signify the early stages of hazardous disturbances propagating into the system. The Cause Consequence Tree is a modified Cause Tree in which delays have been added to determine the precise chronological relationships between events. Despite some methodological problems such as the difficulties in the development of representative CCDs and problems of computational efficiency [9], the principles that these early model-based systems have introduced are still encountered in contemporary research. The safety monitors recently developed for the Surry and San Onofre nuclear power plants in the USA [68], for example, also employ a faultpropagation model to assist real-time safety monitoring. This is a master fault tree [69] synthesised from the plant Probabilistic Risk Assessment (PRA). Varde et al [70] have proposed an alternative way of exploiting fault trees for diagnostic purposes. They describe a hybrid system in which artificial neural networks perform the initial primary detection of anomalies and transient conditions and a rule-based monitor then performs diagnoses using a knowledge base that has been derived from the fault trees contained in the plant PRA. The knowledge base is created as fault trees are translated into a set of production rules, each of which records the logical relationship between two successive levels of the tree. A traditional limitation of fault propagation models has been the difficulty of representing the effects that changes in the behaviour or structure of complex dynamic systems have in the failure behaviour of those systems. To illustrate this point we will consider, for example, the fuel system of an aircraft. In this system, there are usually a number of alternative ways of supplying the fuel to the aircraft engines. During operation, the system typically switches between different states in which it uses different configurations of fuel resources, pumps and pipes to maintain the necessary fuel flow. Initially, for example, the system may supply the fuel from the wing tanks and when these resources are depleted it may continue providing the fuel from a central tank. The system also incorporates complex control functions such as fuel sequencing and transfer among a number of separated tanks to maintain an optimal centre of gravity. If we attempt to analyse such a system with a technique like fault tree analysis we will soon be faced with the difficulties caused by the dynamic character of the system. Indeed, as components are activated, deactivated or perform alternative fuel transfer functions in different operational states, the set of failure modes that may have an adverse effect on the system changes. Almost inevitably, the causes and propagation of failure in one state of the system are different from those in other states. In that case, though, how can such complex state dependencies be taken into account during the fault tree analysis, and how can they be represented in the structure of fault trees? Cause Tree ←→ Consequence Tree C1 Safe state A Y N C2 Primary Failure Events C Y N Hazardous Effects Y N Safe state B Unsafe State Suggested Actions and Conditions to verify their Success Figure 9. Cause Consequence Diagram (CCD) 13 To address this problem Papadopoulos and McDermid [71] introduce in the centre stage of the assessment process a dynamic model that can capture the behavioural and structural transformations that occur in complex systems. In its general form, this model is a hierarchy of abstract state-machines (see right hand side of Figure 10) which is developed around the structural decomposition of the system (see left hand side of Figure 10). The structural part of that model records the physical or logical architectural decomposition of the system into sub-systems and basic components, while the state machines determine the behaviour of the system and its sub-systems. The lower layers of the behavioural model identify transitions of low-level sub-systems to abnormal states, in other words states where those sub-systems deviate from their expected normal behaviour. As we progressively move towards the higher layers of the behavioural model, the model shows how logical combinations or sequences of such lower-level subsystem failures (transitions to abnormal states) propagate upwards and cause failure transitions at higher levels of the design. As Figure 10 illustrates, some of the transition conditions at the low-levels of the design represent the top events of fault trees which record the causes and propagation of failure through the architectures of the corresponding sub-systems. A methodology for the semi-mechanical construction of such models is elaborated in [72]. According to that methodology (see also [73,74]) the fault trees which are attached to the state-machines can be mechanically synthesised by traversing the structural model of the system and by exploiting some form of local component failure models, which define how the corresponding components fail in the various states of system. As opposed to a classical fault tree or a cause consequence diagram, the proposed model records the gradual transformation of lower-level failures into system malfunctions, taking into account the physical or logical dependencies between those failures and their chronological order. Thus, the model not only situates the propagation of failure in the evolving context of the system operation, but also provides an increasingly more abstract and simplified representation of failure in the system. Furthermore, the model can be mechanically transformed into an executable specification, upon which an automated monitor could operate in real-time. In [72], Papadopoulos describes the engine of such a monitor, in other words a set of generic algorithms by which the monitor can operate on the model in order to deliver a wide range of monitoring functions. These function span from the primary detection of the symptoms of disturbances though on-line fault diagnosis to the provision of corrective measures that minimise or remove the effects of failures. This concept has been demonstrated in a laboratory model of an aircraft fuel system Structural Model Architectural diagrams of the system and its subsystems Behavioural Model State machines of the system and its subsystems system & its state machine sub-systems & their state machines sub-system architectures (basic components) Mechanically generated fault trees which locate the root causes of low-level sub-system malfunctions in the architecture of the system Figure 10. Placing Fault Trees in the Context of a Dynamic Behavioural Model 14 4.3 Qualitative Causal Process Graphs The second class of diagnostic methods that rely on a model of disturbed behaviour is formed by methods which use qualitative causal process graphs. These graphs generally provide a graphical representation of the interactions between process variables in the course of a disturbance (although in the general case those interactions may also be valid during normal operation). Such models include the digraph [75-77], the logic flow-graph [78,79] and the event graph [80-82]. A digraph models how significant qualitative changes in the trend of process parameters affect other parameters. The notation uses nodes that represent process parameters and arrows that represent relationships between these parameters. If a process variable affects another one, a directed edge connects the independent and dependent variable and a sign (-1, +1) is used to indicate the type of relationship. A positive sign indicates that changes in the values of the two variables occur in the same direction (e.g. an increase will cause an increase), while a negative sign denotes that changes occur in the opposite direction. A diagnostic system can exploit these relationships to trace backwards the propagation of failure on the digraph and locate the nodes that have deviated first as a direct effect of a fault. We will use a simple example to demonstrate the basic principles underlying the application of digraphs in fault diagnosis. Figure 11 illustrates a heat exchanger and part of the digraph of this system. In this system, the hot stream follows the path between points one and two (1,2) and the cooling stream follows the path between points three and four (3,4). Since there are no control loops, any disturbance in the input variables (the temperature and flow at the inputs of the hot stream [T1, M1] and the cooling stream [T3, M3]) will cause a disturbance in the output temperature of the hot stream (T2). The digraph determines this relationship of T2 with the four input parameters. It also anticipates the possibility of a leak in the cooling system, through a normally closed valve for example. This event is represented as the flow of leaking coolant M5. Finally, the digraph defines the relationships between the leakage (M5), the pump speed (P6) and the coolant flow (M3). Let us now assume that the diagnostic monitor observes the event “excessive temperature at the output of the hot stream”, which we symbolise as T2(+1). The task of the monitor is to locate one or more nodes, termed primary deviations [76], that can explain the event T2(+1) and any other abnormal event in the digraph. From the relationships between T2 and its adjacent nodes the system can determine that the possible causes of T2(+1) are T1(+1), M1(+1), T3(+1) or M3(-1). If all the process variables were observable, the system would be able to select a node in which to proceed with the search. If M3(1) was true, for example, the set of possible causes would change to {P6(-1), M5(+1)}. If we further assume that P6(-1) was true, the system would be able to locate the cause of the disturbance to a single primary deviation. In practice though, only a small proportion of digraph nodes is monitored and, hence, the diagnosis is likely to be ambiguous and to contain a set of more than one possible primary deviations [64]. In the DIEX expert system [83], for example, every time that a node deviates the system initiates a backward search on the digraph to determine a set of potential primary deviations. This search is terminated when a boundary of the digraph is reached, a normal node is met or the same node is encountered twice. With each new abnormal indication, a new backward search is initiated starting from the disturbed node, and the resulting set of candidate nodes is intersected with the existing set of primary deviations to update the diagnosis. To complete the diagnosis, the system reports a list of potential faults that have been statically associated with each primary node. system digraph Cooling stream T1 +1 4 5 Hot stream 2 1 M1 Leak +1 +1 3 T3 6 -1 T2 Pump M3 +1 -1 P6 M5 Figure 11. Heat exchanger system and digraph 15 Ulerich and Powers [84] have proposed a variant of this approach, where the diagnostic model is a fault tree that has been semi-mechanically generated from a system digraph. Let us use again the example of the heat exchanger, this time to illustrate the fault tree synthesis algorithm. The heat exchanger system is an open loop system. In a digraph like this, which does not contain closed control loops, the fault tree will have only or gates [46]. This is a valid generalisation which reflects the fact that open loop systems do not have control mechanisms that can compensate for disturbances in the input variables. Any single disturbance is, therefore, likely to propagate and cause the deviation that represents the top event of the fault tree. To construct a fault tree for the top event T2(+1) we will first visit the nodes that affect T2 to determine the deviations that can cause the top event. These deviations {T1(+1), M1(+1), T3(+1), M3(1)} form the next level of the fault tree and they are connected to the top event with an OR gate. If a node is independent {e.g. T1}, the corresponding event {T1(+1)} is a primary deviation and becomes a terminal node in the fault tree. When a node is dependent, the search continues in a recursive way until all nodes connected to a dependent node are themselves independent. Figure 12 illustrates the fault tree synthesised for the heat exchanger by application of this algorithm. It can be noticed that nodes in this fault tree contain abstract qualitative descriptions. For the diagnosis though, it is essential to replace at least some of these descriptions with expressions that can be monitored. The solution proposed by Ulerich is to attach to each root node of the fault tree an AND gate which determines the verification conditions for this event. Fault diagnosis is then accomplished via on-line monitoring of the minimal cutsets derived from this tree. This method has been demonstrated on a single closed control loop of a mixed-heat exchanger system. We must point out that the simple recursive algorithm that we have described can only be applied on systems without control loops. Lambert has analysed the digraph of a generic negative feedback loop system and derived a generic fault tree operator for the synthesis of fault trees for systems with such closed loops [46]. Lapp and Powers have also developed algorithms for digraphs with negative feedback and negative feed-forward loops. The synthesis of fault trees from digraphs has demonstrated benefits from formalising, systematising and automating the development of fault trees. However the application of this method in fault diagnosis presents many difficulties, mainly because the system under examination should be modelled so as to fall into one of the categories for which a synthesis algorithm has been developed. A traditional limitation of the digraph approach had been the difficulty in developing digraphs for large process plants. To overcome this limitation, Chung [85] has proposed a method for the automatic construction of large plant digraphs from smaller digraphs of basic plant units. In this work, the plant digraph is synthesised from the topology of the plant and a library of models describing common plant units such as valves, pipes, pumps and vessels [86-87]. A number of other variations of the basic digraph approach have been proposed to enhance the expressiveness of the model and extend its application in complex systems. Kramer and Palowitch, for example, have developed an algorithm to convert a digraph into a concise set of production rules [88]. Such rules can be integrated with heuristic rules to achieve a more flexible and potentially powerful scheme for fault diagnosis. T1 (+1) coolant leakage M1 (+1) T2 (+1) T3 (+1) M5 (+1) M3 (-1) P6 (-1) excessive output temperature coolant pump running at lower speed or inoperative Figure 12. Automatically generated fault tree for heat exchanger 16 Guarro et al propose an extended digraph notation called the logic flowgraph [78]. The notation maintains the basic units of nodes and edges to represent process variables and their dependencies, but introduces an extended set of symbols and representation rules to achieve a greater degree of flexibility and modelling capability. Continuous and binary state variables, for example, are represented using different symbols. Logical gates and condition edges can also be used to show how logical combinations of conditions can affect the network of variables. Yau, Guarro and Apostolakis have recently demonstrated the use of dynamic flowgraphs as part of their methodology for the development of a flight control system for the Titan II Space Launch Vehicle [79,89]. The Model-Integrated Diagnostic Analysis System (MIDAS) [80] uses another type of qualitative causal graph, the event graph, to perform monitoring, validation of measurements, primary detection of anomalies like violations of quantitative constraints and diagnostic reasoning. In the event graph, nodes still represent qualitative states of system parameters. However, edges do not simply represent the directional change between qualitative states. They are indicative of events and can depict relationships between different states of the same variable (intravariable links) or relationships between states of different variables (intervariable links). In MIDAS, such event graphs can be built semi-automatically from system digraphs. Although the general problem of diagnosis of multiple faults remains intractable, the system can diagnose multiple faults provided that each fault influences a different set of measurements [83]. Thus, it overcomes the hypothesis of a single fault present in the process, which has seriously limited the application of traditional digraph-based approaches to diagnosis [81]. Tarifa and Scenna [90] attempt to address the difficulties arising in the development of large digraphs and the associated problems of efficiency of search algorithms, by proposing a method for structural and functional decomposition of the monitored process. They have reported a successful, in terms of performance and computational cost, application of this method to a multistage flash desalination process plant [91]. Digraph-based approaches and their derivatives are process oriented, in the sense that, they focus on process variables and exploit models of the propagation of disturbances between those variables. Within this framework though, the modelling and safety monitoring of a system with complex control and operating procedures becomes extremely difficult. Indeed, the application of these approaches has been confined to fairly simple automated plant processes. However, we should note that the application of process di-graphs in combination with other methods, statistical techniques for example [92], is still investigated and some interesting results are reported in the literature, new efficient algorithms for the diagnosis of multiple faults [93] for instance. For more information on these developments the reader is referred to the work carried out in the Laboratory for Intelligent Process Systems at Purdue University3. 4.4 Models of Functional Decomposition Let us now turn our attention to those diagnostic methods that rely on a model of the normal behaviour of the system. One model that we have already introduced in section 3.7 is the Goal Tree Success Tree (GTST). A GTST depicts the hierarchical decomposition of a high-level safety objective into sets of lower-level safety goals and logical combinations of conditions that guarantee the accomplishment of these goals. The typical structure of a GTST is illustrated in Figure 13. GTSTs are logical trees which contain only AND and OR gates. Simple value-propagation algorithms can therefore be used by a monitor to update nodes of the tree that are not observable and to provide alarms for the critical functions that the system fails to maintain. On the other hand, simple search algorithms can be applied on the tree structure to locate the causes of functional failures. GOTRES [94], for example, incorporates an inference engine which can perform a depth first search on the tree to determine the reasons why a goal has failed. 3 Information on this work can currently be found in www.lips.ecn.purdue.edu. This work is also linked to the Abnormal Situation Management Joint Research and Development Consortium led by Honeywell, formed in 1992 to develop technologies that would help plant operators in dealing with abnormal situations. Beyond Honeywell, the consortium also comprises Purdue and Ohio State Universities as well as a number of major oil and petrochemical companies. 17 TO P O B JE C TIV E G O A L TR E E GOAL GOAL GOAL SUBGOAL S Y S TE M SUCCESS HUMAN A C TIO N S U C C E S S TR E E P ath 1 P ath 2 P ath 3 Figure 13. Typical structure of a GTST [18] An advancement of GOTRES is FAX [18], an expert system which uses a probabilistic approach and a breadth first search algorithm to update belief values in the tree structure. The system is, thus, able to decide which sub-goal is more likely to have failed and should be further investigated when the process indicates that a parent goal has failed. FORMENTOR [95] is another real-time expert system for risk prevention in hazardous applications which uses value propagation and search algorithms on a GTST to deliver monitoring and diagnostic functions. The GTST has been successfully applied in a number of realistically complex pilot applications in the nuclear [18], space [40] and petrochemical industries [9697]. An alternative model of functional decomposition is the Multilevel Flow Model (MFM). Like the GTST, the MFM is a normative model of the system [98], a representation of what the system is supposed to do (system goals) and the functions that the system employs to reach these goals. The MFM defines three basic types: goals, functions, and components. The model is developed as a network of goals which are supported by a number of system functions which in turn are supported by components as Figure 14 illustrates. The notation defines a rich vocabulary of basic mass, energy and information flow functions and their semantics. Beyond this decomposition of goals and functions, the notation also allows the description of the functional structure of the system. Each function in the hierarchy can be represented as a network of connected flow sub-functions. Larsson has described an algorithm which produces a “backward chaining style ” of diagnosis using MFM models [19]. Triggered by a failure of a goal, the monitor hypothesises failures of functions, examines the networks of flow sub-functions for potential candidates and ultimately arrives to diagnoses of component malfunctions. Larson argues that the MFM is particularly suitable for diagnosis with hard real-time constraints. By exploiting the hierarchy of goals and functions it is possible to produce a rough diagnosis quickly and then use any additional time to localise the fault with more precision. G oals Function s C om pone nts Figure 14. Abstract view of a multilevel flow model [19] 18 4.5 Diagnosis from First Principles A number of diagnostic methods have adopted an approach to diagnosis known as “diagnosis from first principles”. Such methods use a model of the system structure and normal behaviour of its components to make predictions about the state of various system parameters. Diagnosis here is seen as a process of interaction between prediction and observation of the physical system, which is driven by discrepancies between predicted values and actual measurements. The theoretical foundation of this approach is Reiter’s [99] original diagnosis algorithm and its corrected version [100]. Reiter uses a model of system structure and component behaviour to derive a description of the system expressed in logic. Together with a set of observations of the system, this description provides the basis for computing a minimal set of possible hypotheses about component malfunctions that can explain observed deviations from the predicted (normal) behaviour. Davis and Hamscher [65] identify three stages in model-based diagnosis: hypothesis generation, hypothesis testing and hypothesis discrimination. Given a discrepancy, the first step in the diagnostic process is to identify which components might have misbehaved and caused that discrepancy (hypothesis generation). These components are called suspects. In the second stage of the process (hypothesis testing), we can eliminate a number of suspects, by hypothesising their failure and checking in the model if this condition is consistent with other observations of the system. Finally, if more than one hypothesis survives testing, we need to determine what additional observations of the system would be required to allow further discrimination among the remaining suspects (hypothesis discrimination). In generating hypotheses, one can use the structure of the system or properties of components to constrain the number of suspects. Consider, for example, the model of an electronic circuit, which consists of three multipliers and two adders (Figure 15), and assume that an incorrect output (e.g. F=10) has been observed. In this type of model one can safely assume that to be a suspect a component must be connected to the deviation. Application of this rule in this example reduces the suspects to ADD-1, MULT-1 and MULT-2. We can further reduce the number of suspects, by exploiting measurements that indicate more than one discrepancy between the model and the system. If the observations were F=10 and G=10, for example, then searching backwards from F would have generated the set of suspects {ADD-1, MULT-1, MULT-2}, while tracing back the dependencies from G would have generated the set {ADD-2, MULT-2, MULT-3}. Assuming a single point of failure, we can locate the suspects in the intersection between the two sets (MULT-2). This scheme is easily expanded to deal with multiple failures, where in the general case each hypothesis would be a set of components that could possibly account for the observed anomalies [65]. Let us now turn to the second stage of the diagnostic process and examine hypothesis testing. There are two main approaches to this task: constraint suspension and assumption-based truth maintenance systems. In constraint suspension [101], the behaviour of each component is modelled as a set of constraints. These constraints define all the possible relationships between the inputs and outputs of the device, or stated more generally, the inferences that can be drawn about the state of one terminal if we know the state of the other terminals. Figure 16, for example, illustrates an adder and the constraints that describe the normal behaviour of this component. A=3 M ULT-1 ADD-1 F=12 ADD-2 G=12 B=3 C=2 M ULT-2 D=2 E=3 M ULT-3 Figure 15. Multiplier and adder circuit [65] 19 Component A B Constraints ADD-1 C=A +B A=C-B B=C-A C Figure 16. A model of the behaviour of an adder By modelling each component in this way, we effectively transform the model of the system into a network of constraints where we can propagate values in any direction, backwards or forwards, using relationships between component inputs and outputs. Assuming that we can supply the network with a set of measurements, we can then use the network to predict the state of other non-observable terminals within the model. An important property of this process is that when some of the input measurements are abnormal the predicted values are inconsistent. These inconsistencies arise as the predictive engine propagates both normal and abnormal values and, inevitably, fires from different directions constraints which generate different values for the same input or output terminal. In hypothesis testing we assume that the behaviour of the suspect (normal or faulty) is unknown. This is represented by a suspension of the suspect’s constraints [65]. We then supply the current observations of the system and let the rest of the network propagate values. If at the end of the propagation process the network is still inconsistent, even with the suspects constraints suspended, this indicates that the problem lies somewhere else, in other words, that a component other than the (suspended) suspect is misbehaving. In those circumstances, the current hypothesis is rejected. If, however the network is consistent, we can safety conclude that the component was the source of the original inconsistency. Hence, the current hypothesis has survived testing, and the component remains a suspect. The more observations that we supply to the network, the more hypotheses are rejected and the diagnosis becomes more precise. Some systems such as GDE (General Diagnostic Engine) [102] provide a single mechanism for hypothesis generation and testing. The underlying technology in GDE is an assumption-based truth maintenance system [103], a predictive engine that propagates not only values, but also the assumptions that these values carry with them. Since all values are generated from a model of correct behaviour, the assumption underlying each value is that all components involved in the calculation of the value are functioning correctly. To illustrate this let us take the example [104] of the standard adder-multiplier circuit illustrated in Figure 17. By propagating the values given for the inputs and output F, a truth maintenance system will determine that either X=6 with the assumption that MULT-1 is working or X=4 with the assumption that MULT-2 and ADD-1 are working. The conflicting values for X, though, imply that at least one of the three components implicated in the assumptions must be malfunctioning. In GDE, whenever such conflicts are encountered, the system generates a set of suspects by taking the union of the assumptions underlying the conflicting predictions. The diagnosis is achieved, as the propagation of constraints generates more conflicts, and the sets of assumptions are then intersected to identify suspects that can explain all the conflicts. This mechanism can generate diagnoses of both single and multiple faults. GDE has been tested successfully in many examples in the domain of troubleshooting digital circuits [104]. X=6, {M ULT-1} A=3,{ } M ULT-1 X=4, {M ULT-2,AD D-1} ADD -1 B=3,{ } C=2,{ } M ULT-2 Y=6, {M ULT-2} ADD -2 D=2,{ } E=3,{ } F=10,{ } M ULT-3 Figure 17. Assumption based Truth Maintenance System [104] 20 Another system that performs “diagnosis from first principles” is KATE [105], a model -based system developed for NASA’s Systems Autonomy Program at Kennedy Space Centre. KATE has been used for monitoring and troubleshooting of simple electromechanical systems. However, the system in not suitable for diagnosis of dynamic systems with complex behaviour [12]. This fact is a reflection of the general problems of diagnosis from first principles. Constraint propagators, for example, rely on snapshot measurements and do not permit temporal reasoning, something essential for the analysis of disturbances that exhibit complex and variable symptoms. There are several other difficulties with this approach, for example, difficulties in representing and reasoning about complex behaviour and issues concerning the adequacy and efficiency of predictive engines. For a discussion of these problems the reader is referred to [19,65,66]. 4.6 Qualitative Simulation One possible approach to monitoring and diagnosis of dynamic, continuous systems is via qualitative modelling and simulation [106]. A continuous system is typically represented in control theory as a set of quantitative, first order differential equations. In qualitative modelling, this representation is transformed into a set of qualitative constraints, which in Kuipers’ work are expressed in the language of a tool called QSIM [106]. Some of the constraints in QSIM represent equalities: for example, MULT(k,x,y) is directly equivalent to y=kx, and DERIV(f1,f2) is equivalent to f2=d(f1)/dt. Constraints though can also represent other qualitative relationships between physical parameters: M+(x,y), for example, indicates that y is monotonically increasing with x. Figure 18 illustrates how a quantitative differential equation is converted into a set of such constraints. QSIM incorporates an algorithm which can simulate a system represented as a set of qualitative constraints. Given the description of the system and a set of initial states, the algorithm can predict qualitative values (increasing, stable, decreasing) and quantitative ranges (e.g. [3.4..5.6]) for the immediate successor state of each parameter. MIMIC [107] is a system which uses the predictive power of QSIM for fault diagnosis of dynamic devices. Triggered by an anomalous observation of the physical system, MIMIC first generates a set of fault hypotheses. For each hypothesis, MIMIC evokes and initialises a QSIM model of the system under the current hypothesis. If the observations of the system match the predictions of the model, the model and the corresponding hypotheses are retained. Otherwise, they are discarded. One problem with this approach is that the number of QSIM models that need to be specified a priori is equal to a potentially large number of plausible faults. An additional drawback is that unanticipated faults are not modelled and, therefore, go undetected [108]. To overcome those problems, Ng [109] has proposed a modified version of Reiter’s algorithm which integrates qualitative simulation with the theory of diagnosis from first principles. Ng retains Reiter’s model of structure and component behaviour as a basis for the representation of the system. However, the normal behaviour of each component is now specified as a set of qualitative constraints and, hence, the constraint propagators used by Davis and deKleer are replaced by the predictive engine of QSIM. Ng has demonstrated his approach on three small processes: a proportional temperature controller, a pressure regulator and a toaster. 2 d u dt 2 - du dt f 1 =du/dt f 2 =d f 1 /dt f 3 =ku f 4 =arctan f 3 f 2 - f 1 + f 4 =0 + arctan ku = 0 D E R IV (u, f 1 ) D E R IV (f 1 , f 2 ) M U LT (k,u, f 3 ) + M (f 3 , f 4 ) A D D (f 2 , f 4 , f 1 ) Figure 18. Transformation of a differential equation into a set of qualitative constraints [106] 21 4.7 Data driven approaches to fault detection and diagnosis In the preceding sections we have shown that one way to construct a predictive model of normal behaviour for diagnostic purposes is from in-depth knowledge about the system and its processes. An alternative way to construct such a model is from empirical data collected during the operation of the system. In that case the reconstructed model of the process should fit the collected historical data to a function that relates different parameters of that process, output parameters for example to input parameters and control settings. The theoretical advantage of this approach is that the development of the predictive model does not require deep knowledge of the system. The challenge of course here is to define an appropriate fitting function that generates accurate predictions. In univariate data-driven monitoring methods, the fitting function usually relates one dependent variable (e.g. output parameter) to a number of independent variables (e.g. input parameters). The output of the model here is the value of the dependent variable. In simple systems, the fitting function can be a single-order linear polynomial with co-efficients that correspond to the empirical data. In more complex systems, though, higher order non-linear functions may be required. In such systems, therefore, it is often difficult to arrive at an accurate model. An additional limitation of univariate modelling is that the quality of the prediction of the dependent variable is usually strongly contingent to the quality of the measurements of the independent variables. Univariate methods are now being replaced by more powerful multivariate approaches in which statistical methods are applied on sets of historical data to create models that can more accurately predict the values of both dependent and independent variables in a process. Principal component analysis [110] and partial least squares [111] represent the principal examples of such techniques. A vast body of literature reports on how discrepancies between the values predicted by those techniques and actual parameter values can be used in real-time to determine and locate possible disturbances in plant processes. Another approach to data-driven fault detection is that of qualitative trend analysis [112]. There are two steps in this approach. The first is the identification of trends in measurements and the second is the interpretation of trends into fault scenarios. Qualitative parameter trends can be constructed from a series of primitives, each of which describes the qualitative state of a parameter in a window of time. Dash and Venkatasubramanian [113] define five such primitives: parameter value stable¸ increasing with an increasing rate, increasing with a decreasing rate, decreasing with an increasing rate and decreasing with a decreasing rate. In qualitative trend analysis, a sequence of such primitives forms trends that can then be related to fault scenarios. The primary identification of primitives is achieved by simple calculation of first and second derivatives or alternatively by using neural networks with the advantage that such networks can learn from examples and tolerate noise. The qualitative abstraction of data achieved in this approach coupled with techniques for the recognition of trend patterns can provide a way for dealing with large amounts of process data [114]. 5. Failure control and Correction Failure control and correction is the final and probably most decisive stage of safety monitoring, where operators (or an automated system) have to take actions that will remove or minimise the effects of failure. Like fault diagnosis, this task is also performed under stressful conditions and requires a wide range of expertise which include at least a functional understanding of the system and a working knowledge of any published remedial procedures. State of the art aircraft monitors, such as the A320 Electronic Centralised Aircraft Monitor [44] and the Engine Indicating and Crew Alerting System on the Boeing 757 [115], provide some degree of assistance in this task. The design of these systems reflects a common underlying philosophy. For each warning that the system generates, there is an associated corrective procedure. In its simplest form, a corrective procedure is a set of actions that should be performed without further investigation. Such procedures are usually displayed electronically by the monitor. More complex procedures, which define conditional sequences of actions, are represented as decision trees, and they are typically listed in the pilot’s Quick Reference Handbook. Figure 19 illustrates a sample page of the McDonald-Douglas MD11 aircraft handbook [2], which describes the procedure for an avionics compartment overheat alert. 22 AVNCS COMPT OVHT NOTE: This alert will be accompanied by the TRIM AIR AVNCS OVHT light on the overhead panel, which will remain illuminated until reset by maintenance. Air system .................................................................................................VERIFY MANUAL TRIM AIR..........................................................................................................VERIFY OFF “AVNCS COMPT OVHT “ ALERT REMAINS DISPLAYED NO PACKS 1 & 3 .….........................................................................OFF “AVNCS COMPT OVHT “ REMAINS DISPLAYED NO PACKS 1 & 3 must remain off for remainder of flight END PACK 1 ..........................................................................................ON “AVNCS COMPT OVHT “ DISPLAYED AGAIN NO PACK 1 .......................................................................OFF When “AVNCS COMPT OVHT “ is no longer displayed PACK 3 .......................................................ON “AVNCS COMPT OVHT“ DISPLAYED AGAIN NO ` PACK 3 ..................................... OFF PACKS 1 & 3 must remain off for remainder of flight END PACK 1 must remain off for remainder of flight END PACK 3 must remain off for remainder of flight END END Figure 19. Quick Reference Handbook, MD11 [2] Several of the experimental monitors that we have examined in other domains follow similar approaches to failure control and correction. FORMENTOR [95], for example, relates nodes of the monitoring model (GTST) that represent significant alarms to FMEAs that provide corrective measures for these alarms. Varde et al [70], on the other hand, have developed a system which generates corrective measures from event trees. Event trees are typically employed in safety analysis to determine the effect of a succession of responses to an initiating hazardous event. These approaches work in so far as there are no conflicts between the procedures associated with each alarm or initiating event. Problems arise, though, in cases where multiple failures or propagating disturbances cause symptoms that require conflicting remedies. An example of such a conflict is a condition where a high fuel temperature in an aircraft coincides with an engine stall. The procedure to cure high fuel temperature requires that the thrust on the associated engine is increased to aid cooling, while the corrective measure for an engine stall is to reduce thrust. Hill [116,117] has proposed a model, called the Goal Ordered Search Tree (GOST), that can accommodate such conflicting safety goals, and a method to use such a model for dynamic generation of optimum corrective measures. The GOST forms an aircraft status metric, by means of which the system can assess various procedures and decide which configuration optimises the status metric and therefore system safety. These procedures are optimal in the sense that they maximise system safety by securing as many as possible of the most critical functions. The GOST is a network of abstract nodes that represent aircraft critical functions and nodes that represent equipment status. Each node has a binary value (1,0), which shows whether the state represented by the node is achieved or not. A second attribute indicates the priority (or criticality) of the function that the node represents. Nodes are linked with directed arrows and logical connectives (and, or) which define how statuses are propagated in the network of nodes. A solid arrow carries the status of the source node, while a dashed arrow propagates the negative of this status. Let us now see how such a 23 model can be used for dynamic resolution of potential conflicts between safety goals. Figure 20 illustrates part of a simplified GOST for an aircraft [116]. The example shows five safety goals (ALLENGINES, ENGINE1-SAFE, ELECTRICS, FUEL-SAFE and DOMESTICS) and their respective priorities (20,40,10,30 and 5). The first failure that we will consider in this example is a high fuel temperature. This is represented by the node FUEL-HI-TEMP changing value from 0 to 1. The network shows that this condition causes loss of the FUEL-SAFE goal. This problem can be corrected by either disconnecting the inertial drive generator (IDG=0) or switching off the galley which will cause FUELACTIONS=1 and therefore FUEL-SAFE=1. The first procedure will result in loss of ELECTRICS, while the second in loss of DOMESTICS. Because DOMESTICS has a lower priority, the second procedure is selected and the network is reconfigured as illustrated in Figure 21. ALL-ENGINES ( 1,20 ) ENGINE1-SAFE ( 1,40 ) ELECTRICS ( 1,10 ) FUEL-SAFE ( 1,30 ) OR DOMESTICS ( 1,5) OR IDG ( 1,10 ) FUEL-HI-TEMP ( 0,5 ) FUEL-ACTIONS ( 0,5 ) AND GALLEY ( 1,5 ) THRUST-HI ( 1,0 ) NODE STATUS ENGINE-1 ( 1,20 ) ENG1-STALL ( 0,0 ) THRUST-LOW ( 0,0 ) not (NODE STATUS) Figure 20. Example GOST in initial configuration [116] ALL-ENGINES ( 1,20 ) ENGINE1-SAFE ( 1,40 ) ELECTRICS ( 1,10 ) FUEL-SAFE ( 1,30 ) OR DOMESTICS ( 0,5) OR IDG ( 1,10 ) FUEL-HI-TEMP ( 1,5 ) FUEL-ACTIONS ( 1,5 ) AND GALLEY ( 0,5 ) THRUST-HI ( 1,0 ) NODE STATUS ENGINE-1 ( 1,20 ) ENG1-STALL ( 0,0 ) THRUST-LOW ( 0,0 ) not (NODE STATUS) Figure 21. The GOST after the first corrective action [116] ALL-ENGINES ( 1,20 ) ENGINE1-SAFE ( 1,40 ) ELECTRICS ( 0,10 ) FUEL-SAFE ( 1,30 ) OR DOMESTICS ( 1,5) OR IDG ( 0,10 ) FUEL-HI-TEMP ( 1,5 ) FUEL-ACTIONS ( 0,5 ) AND THRUST-HI ( 0,0 ) GALLEY ( 1,5 ) NODE STATUS ENGINE-1 ( 1,20 ) ENG1-STALL ( 1,0 ) THRUST-LOW ( 1,0 ) not (NODE STATUS) Figure 22. The GOST after the second corrective action [116] 24 The second failure that we will consider in this example is an engine stall (ENG1-STALL=1). As we can see, this condition causes loss of the ENGINE1-SAFE goal. According to the network, to ensure the safety of the engine, the pilot can either switch off the engine (ENGINE-1=0) or reduce the engine thrust below the stall margin (THRUST-LOW=1). Switching the engine off directly denies the goal ALLENGINES, while reducing thrust indirectly denies the goal ELECTRICS 4. The priority of ELECTRICS is lower and, therefore, the second option is chosen and, therefore, THRUST-LOW is selected. GALLEY off is not required any more to maintain FUEL-SAFE and, thus, the galley is switched on to restore the goal DOMESTICS. The new configuration is illustrated in Figure 22. Although in this example Hill demonstrates a very attractive idea, that of dynamic synthesis of optimum non-conflicting corrective measures, the method has a number of limitations. Firstly, there is an assumption that the statically defined priorities of goals remain valid in all circumstances throughout the flight [117]. Secondly, the method requires the binary representation of equipment statuses and, hence, is unable to generate corrective measures based on analogue settings. This problem bring us to a more general issue concerning the application of the method, that of how to address the difficulties arising from the scale and complexity of a system and how to apply the proposed concept in a realistic environment. 6. Conclusions In this paper, we developed a framework which defines the problem of safety monitoring in terms of three principal dimensions: the optimisation of low-level status information and alarms, fault diagnosis and fault correction. Within this framework, we have discussed state of the art systems and research prototypes, placing an emphasis on the methodological features of these systems. In the course of this discussion we extended our basic framework by identifying, in each of the three areas of the problem, the main methodological approaches and their derivatives. Our analysis confirms that safety monitoring is a complex and multifaceted problem. Perhaps more importantly, it draws on previous research on requirements and methods to identify the various aspects of a potentially effective solution to the problem. In Table 1, we summarise a set of properties that our research suggests should characterise an effective safety monitoring scheme. These properties cover a wide range of goals, which span from the early primary detection of the symptoms of a disturbance to the synthesis and provision of non-conflicting corrective measures. Table 1. Properties of an effective safety monitoring scheme Aspects of a potentially effective scheme for safety monitoring early detection of the symptoms of a disturbance detection of sensor failures and the effective validation of measurements correct interpretation of statuses and alarms in the evolving context of the system operation ability to organise and synthesise low level status information and alarms exploiting their temporal, causal, functional and logical relationships provision of a hierarchical and functional, as opposed to a purely architectural, view of failure timely and unambiguous diagnosis of malfunctions the synthesis and provision of non-conflicting corrective procedures. 4 ELECTRICS is sacrificed as the IDG is switched off to maintain FUEL-SAFE, that would otherwise have been denied because of THRUST-LOW. Indeed THRUST-LOW causes FUEL-ACTIONS=0 and therefore FUEL-SAFE=0. Thus, to maintain FUEL-SAFE (priority=30) and ALL-ENGINES (priority=20) we sacrifice ELECTRICS (priority=10) switching off the IDG. 25 We have seen that model-based monitors can address one or more of these requirements by operating on functional, causal, logical, structural and behavioural models of the monitored processes. These monitors are more likely to be consistent, and they provide better diagnostic coverage than “experiential” expert systems, because the model building and validation processes aid the systematic collection and validation of the required knowledge about the monitored process. There are, however, a number of open research issues concerning the model-based approach. We have seen, for example, that within the model-based approach, there are two classes of diagnostic methods: those that rely on models of the failure behaviour of the system and those that rely on models of normal behaviour. In the past, it has been argued that models of failure behaviour do not perform as well as models of normal behaviour. Davis and Ng [118,108], for instance, claim that a method which relies on a model of normal behaviour can detect and diagnose any fault that will cause a discrepancy between observations of the system and the model predictions. The authors argue that explicit fault models, on the other hand, are likely to be incomplete or carry incorrect assumptions about how the system may fail, in which case unanticipated or incorrectly anticipated faults will go undetected. A counter-argument here is that a model (indeed, any model including models of normal behaviour) is nothing but a set of assumptions, which suggests that normal models may also be incorrect, incomplete or insufficiently detailed. In the context of monitoring though, such inappropriate models will also generate wrong predictions and in turn cause false and misleading alarms. This observation simply highlights an important truth, that the quality of monitoring and the inferences drawn by model-based monitors is always strongly dependent on the completeness and correctness of the underlying model. Formal verification techniques, as well as simulation and testing under failure conditions can, of course, be applied to increase our confidence on the underlying model. The validation of the monitoring model, however, is clearly an issue that creates much wider scope for further research. Another problem with model-based approaches is the difficulty in scaling up to large systems or systems with complex behaviour. Indeed, large scale often makes it difficult to achieve representative and consistent models of the monitored system. Simplifying assumptions can usually help to reduce the complexity of the monitoring model. Such assumptions though must be made with care so that they do not compromise the detection and diagnostic ability of the monitor. In addition, there are problems in representing and reasoning about complex behaviour and difficulties concerning the efficiency of some of the current predictive programs and diagnostic algorithms. Research is under way to address these issues and improve existing model-based methods. The recent plant wide applications of fault propagation models, for example, are particularly encouraging, and demonstrate the real possibility of transforming fault propagation models from offline tools for understanding risk to operational tools for controlling this risk in real time [119]. At the same time, the fact that there are still many monitoring and diagnostic problems which are too hard to describe using current modelling technology highlights the significance of approaches that employ expert diagnostic knowledge, statistical analysis of data or machine learning, and suggests that further exploration of the potential for combining these approaches with model-based methods would be particularly useful in the future. Acknowledgements This study was carried out in the context of a European Commission funded project called SETTA (IST contract number 10043). We would like to thank the anonymous reviewers for pointing out omissions and for generally helping us with their comments to improve the initial version of this paper. References 1. Malone T. B., Human Factors Evaluation of Control Room Design and Operator Performance at Three Mile Island 2, US Nuclear Regulatory Committee Report, NUREG/CR-1270, US NRC Washington DC (1980). 2. Billings C. E., Human-centred aircraft automation: A concept and Guidelines, NASA Technical Memorandum TM-103885, NASA Ames Research Centre, Moffett Field CA United States (1991). 26 3. Randle, R. J., Larsen W. H., Williams D. H., Some Human Factors Issues in the Development and Evaluation of Cockpit Alerting and Warning Systems, NASA Technical Report RP-1055, NASA Ames Research Centre, Moffett Field CA United States (1980). 4. Wiener E. L., Curry R. E., Flight-Deck Automation: Promises and Problems, NASA Technical Memorandum TM-81206, NASA Ames Research Centre, Moffett Field CA United States (1980). 5. Wiener E. L., Beyond The Sterile Cockpit, Human factors, 27(1)(1985):75-90. 6. Wiener E. L., Fallible Humans and Vulnerable Systems, in Wise J. A., Debons A. (eds.), Information Systems-Failure Analysis, NATO ASI series Volume 32, Springer Verlag (1987). 7. Long A. B., Technical Assessment of Disturbance Analysis Systems, Journal of Nuclear Safety, 21(1)(1980):38-49. 8. Long A. B, Summary and Evaluation of Scoping and Feasibility Studies for Disturbance Analysis and Surveillance System (DASS), EPRI Technical Report NP-1684, Electric Power Research Institute, Palo Alto California United States (1980). 9. Lees F. P., Process Computer Alarm and Disturbance Analysis System: Review of the State of the Art, Computers and Chemical Engineering, 7(6)(1983):669-694, Pergamon Press. 10. Rankin W. L., & Ames K. R., Near-term Improvements in Annunciator/Alarm Systems, US Nuclear Regulatory Committee Report, NUREG/CR-3217, US NRC Washington DC (1983). 11. Roscoe, B. J., Weston, L. M., Human Factors in Annunciator/Alarm Systems, US Nuclear Regulatory Committee Report, NUREG/CR-4463, US NRC Washington DC (1986). 12. m, I. S., Computerised Systems for On-line Management of Failures, Reliability Engineering and System Safety, 44(1994):279-295, Elsevier Science. 13. Pusey H. C., Mechanical Failure Technology: A Historical Perspective, Journal of Condition Monitoring and Diagnostic Engineering Management, 1(4)(1998):25-34, COMADEM International, United Kingdom. 14. Domenico P., Mach E. & Corsberg D., Alarm Processing System, in Proceedings of Expert Systems Applications for the Electric Power Industry, pages 135-138, Orlando FL United States, June (1989). 15. Dansak M. M., Alarms Within Advanced Display Systems: Alternatives and Performance Measures, US Nuclear Regulatory Committee Report, NUREG/CR-2776, US NRC Washington DC (1980). 16. Roscoe, B. J., Nuclear Power Plant Alarm Prioritisation (NPPAP) Program Status Report, US Nuclear Regulatory Committee Report, NUREG/CR-3684, US NRC Washington DC (1984). 17. Kim I. S, Modarres M., Application of Goal Tree-Success Tree Method as the Knowledge Base of Operator Advisory Systems, Nuclear Engineering and Design, 104(1987):67-81, Elsevier Science. 18. Chen L.W., Modarres M., Hierarchical Decision Process for Fault Administration, Computers and Chemical Engineering, 16(1992):425-448, Pergamon Press. 19. Larsson J. E., Diagnosis Based on Explicit Means-end Models, Artificial Intelligence, 80(1996):2993, Elsevier science. 20. Meijer C. H., On-line Power Plant Signal Validation Technique Utilising Parity-Space Representation and Analytical Redundancy, EPRI Technical Report NP-2110, Electric Power Research Institute, Palo Alto California United States (1981). 21. Clark R. N., Instrument Fault Detection, IEEE Transactions on Aerospace and Electronic Systems, 14(3)(1978):456-465. 22. Isermann R., Process Fault-Detection Based on Modelling and Estimation Methods, Automatica, 20(1984):387-404. 23. Upadhyaya B. R., Sensor Failure Detection and Estimation, Journal of Nuclear Safety, 26 (1)(1985):23-32. 24. Patton R. J., Willcox S. W., A Robust Method for Fault Diagnosis Using Parity Space Eigenstructure Assignment, in Singh M. G., Hindi K. S., Schmidt G., Tzafestas S. (eds.), Fault Detection & Reliability: Knowledge-Based and Other Approaches, International Series on Systems and Control, pages 155-163, Pergamon Press (1987). 25. Willcox S. W., Robust Sensor Fault Diagnosis for Aircraft Based on Analytical Redundancy, DPhil Thesis, J.B Morell Library, University of York, United Kingdom (1989). 26. Kim I. S., Modarres M., Hunt R. N. M., A Model-based Approach to On-line Disturbance Management: The Models, Reliability Engineering and System Safety, 28(1990):265-305, Elsevier Science. 27. Patton R. J., Robust Model-Based Fault Diagnosis: The State of the Art, IFAC/IMACS Safeprocess Symposium, Espoo (1994). 27 28. Leahy M.J., Henry M.P. and Clarke D.W. Sensor validation in biomedical applications, Control Engineering Practice, 5(2)(1997):1753-1758. 29. Henry M.P., and Clarke D.W., The Self-Validating Sensor: Rationale Definitions and Examples, Control Engineering Practice, 1(4)(1993):585-610. 30. Henry M.P. Self-validating digital Coriolis mass flow meter, Computing & Control Engineering, 11(5)(2000):219-227. 31. Henry M.P. Plant assessment management via intelligent sensors: digital, distributed and for free, Computing & Control Engineering 11(5)(2000):211-213. 32. Oulton B., Structured Programming Based on IEC SC 65 A, Using Alternative Programming Methodologies and Languages with Programmable Controllers, IEEE Conference on Electrical Engineering Problems in the Rubber and Plastic Industries, IEEE no: 92CH3111-2, pages 18-20, IEEE service center, United States (1992). 33. Papadopoulos Y., A Computer Aided System for PLC Program Testing and Debugging, M.Sc. Thesis, Cranfield University Library, United Kingdom (1993). 34. Gorsberg D. R., Wilkie D., An Object Oriented Alarm Filtering System, in Proceedings of the 6th Power Plant Dynamics, Control and testing Symposium, pages 123-125, Knoxville TN United States, April (1986). 35. Kim I. S., On-line Process Failure Diagnosis: The Necessities and a Comparative Review of the Methodologies, in Proceedings of Symposium on Safety of Thermal Reactors, pages 777-783, Portland OR United States (1991). 36. Meijer, C. H., On-line Power Plant Alarm and Disturbance Analysis System, EPRI Technical Report NP-1379, pages 2.13-2.24, Electric Power Research Institute, Palo Alto California United States (1980). 37. Chester D., Lamb D., Dhubati, P., Rule-based Computer Alarm Analysis in Chemical Process Plants, Proceedings of the 7th Annual Conference on Computer Technology, pages 22-29, IEEE press (1984). 38. Morgan J., Miller J., The MD-11 Electronic Instrument System, in Proceedings of the 11th Digital Avionics Systems Conference, Seattle, IEEE no: 92CH3212-8, pages 248-253, IEEE publications, Piscataway United States (1992). 39. Kim I. S., Modarres M., Hunt R. N. M., A Model-based Approach to On-line Disturbance Management: The Application, Reliability Engineering and System Safety, 29: 185-239, Elsevier Science (1990). 40. Pennings R., Saussais Y., Real-time Safety Oriented Decision Support Using a GTST for the ISOSatellite Application, in Proceedings of 13th International Conference on AI / 9th International Conference on Data Engineering, Avignon France, May (1993). 41. Gerlinger G., Pennings R., FORMENTOR: A Data Model Based on Information Propagation Control for Real-time Distributed Knowledge-based Systems, in Proceedings of the Conference on Real Time Systems-93, Paris France (1993). 42. Wilikens M., Burton C. J., Masera M., FORMENTOR: BP Pilot Application: Results of an Application to a Petrochemical Plant, JRC Technical Note ISEI/IE 2707/94, Institute for Systems Engineering and Informatics, Joint Research Centre, Ispra Italy, May (1994). 43. Bell D. A., Automatic Control of Aircraft Utility Sub-systems on the MD-11, in Proceedings of the 11th Digital Avionics Systems Conference, Seattle, United States, IEEE no: 92CH3212-8, pages 592-597, IEEE publications, Piscataway United States (1992). 44. Airbus Industrie, The A320 Electronic Instrument system, Airbus Technical Report AI/EE-A639/86 (issue 2), Airbus Industrie, 31707 Blagnac Cedex France (1989). 45. British Airways, Airbus A320 Technical Manual, pages 11(30)(1991):1-60, BA Technical Information Services, London. 46. Hensley J., Kumamoto H., Designing for Reliability and Safety Control, Prentice Hall, ISBN: 013201-229-4, 1985. 47. Pau L. F., Survey of Expert Systems for Fault Detection, Test Generation and Maintenance, Journal of Expert systems, 3(2)(1986):100-111. 48. Scherer W. T., White C. C., A survey of Expert Systems for Equipment Maintenance and Diagnostics, in Singh M.G., Hindi K.S., Schmidt G., Tzafestas S. (eds.), Fault Detection & Reliability: Knowledge-Based and Other Approaches, International Series on Systems and Control, pages 3-18, Pergamon Press (1987). 28 49. Buchanan B. G., Shortliffe E. H., Rule Based Expert Systems: The Mycin Experiments of the Stanford Heuristic Programming Project, Addison-Wesley, New York (1984). 50. Miller R. K., Walker T. C., Artificial Intelligence Applications in Engineering, Fairmont Pr, ISBN: 0-13048-034-7 (1988). 51. Waterman D. A., A Guide to Expert Systems, Addison-Wesley, ISBN:0-20108-313-2 (1986) 52. Shihari S. N., Applications of Expert Systems in Engineering, in Tzafestas S.G (ed.), Knowledgebased System Diagnosis Supervision and Control, pages 1-10, Plenum Press, ISBN: 0-30643-036-3, (1989). 53. Underwood, W. E., A CSA Model-based Nuclear Power Plant Consultant, in Proceedings of the National Conference on Artificial Intelligence, 302-305, AAAI Press, Pittsburgh PA United States (1982). 54. Nelson W. R., REACTOR: An Expert System for Diagnosis and Treatment of Nuclear Reactors, in Proceedings of the National Conference on Artificial Intelligence, pages 296-301, AAAI Press, Pittsburgh PA United States (1982). 55. Shirley R. S., Some lessons Learned Using Expert Systems for Process Control, in Proceedings of the 1987 American Control Conference, June 10-12 (1987) 56. Rowan D. A., On-line Expert Systems in Process Industries, AI Expert, 4(8)(1989):30-38. 57. Ballard D., Nielsen D., A Real-time Knowledge Processing System for Rotor-craft Applications, in Proceedings of the 11th Digital Avionics Systems Conference, Seattle, IEEE no: 92CH3212-8, pages 199-204, IEEE publications, Piscataway United States (1992). 58. Talatarian Co., RTWORKS: A Technical Overview, Scientific Computers, Victoria Road Burgess Hill - RH15 9LH - United Kingdom (1996). 59. Ramesh T. S., Shum S. K., Davis J. F., A Structured Framework for Efficient Problem Solving in Diagnostic Expert Systems, Computers and Chemical Engineering, 12(10)(1988):891-902, Pergamon Press. 60. Gazdik I., An Expert System shell for Fault Diagnosis, in Tzafestas et al (eds.), System Fault Diagnostics, Reliability and Related Knowledge-based Approaches, pages 81-104, Reidel Co (1987). 61. Yardick R., Vaughan D., Perrin B., Holden P., Kempf K., Evaluation of Uncertain Inference Models I: Prospector, in Proceedings of AAAI-86: Fifth National Conference on Artificial Intelligence, Philadelphia PA, pages 333-338, AAAI Press, United States (1986). 62. Nairac A., Townsend N., Carr R., King S., Cowley P., and Tarassenko L., A system for the analysis of jet engine vibration data'' Integrated Computer -Aided Engineering, 6(1999):53-65. 63. Tarassenko L., Nairac A., Townsend N., Buxton I. and Cowley P., Novelty detection for the identification of abnormalities, Int. Journal of Systems Science 141(4)(2000):210-216. 64. Finch F. E., Kramer M. A., Expert System Methodology for Process Malfunction Diagnosis, in Proceedings of 10th IFAC World Congress on Automatic Control, United States, July (1987). 65. Davis R., Hamscher W., Model Based Reasoning: Troubleshooting, in Hamscher W., Console L., de Kleer J. (eds.), Readings in Model-based Diagnosis, pages 3-28, Morgan Kaufman, ISBN: 1-55860249-6 (1992). 66. Tzafestas S. G., System Fault Diagnosis Using the Knowledge-based Methodology, in Patton R., Frank P., Clark R. (eds.), Fault Diagnosis in Dynamic Systems: Theory and Applications, pages 509595, Prentice Hall, ISBN: 0-13308-263-6 (1989). 67. Felkel L., Grumbach R., Saedtler E., Treatment, Analysis and Presentation of Information about Component Faults and Plant Disturbances, Symposium on Nuclear Power Plant Control and Instrumentation, IAEA-SM-266/40, pages 340-347, United Kingdom (1978). 68. Puglia, W. J., Atefi B., Examination of Issues Related to the Development and Implementation of Real-time Operational Safety Monitoring Tools in the Nuclear Power Industry, Reliability Engineering and System Safety, 49(1995):189-99, Elsevier Science. 69. Hook T. G., Use of the Safety Monitor in Operation Decision Making and the Maintenance Rule at the Onofre Nuclear Generating Station, in Proceedings of PSA (Probabilistic safety Assessment) 95, Seoul, Korea, 26-30 November (1995). 70. Varde P. V., Shankar S., Verma A. K., An Operator Support System for Research Reactor Operations and Fault Diagnosis Through a Connectionist Framework and PSA based Knowledgebased Systems, Reliability Engineering and System Safety, 60(1)(1998): 53-71, Elsevier science. 71. Papadopoulos Y., McDermid J. A., Safety Directed Monitoring Using Safety Cases, Journal of Condition Monitoring and Diagnostic Engineering Management, 1(4)(1998):5-15. 29 72. Papadopoulos Y., Safety Directed Monitoring Using Safety Cases, Ph.D. Thesis, J.B. Morell Library, University of York, U.K. (2000). 73. Papadopoulos Y., McDermid J. A., Hierarchically Performed Hazard Origin and Propagation Studies, in Lecture Notes in Computer Science, 1698(1999):139-152, Springer Verlag. 74. Papadopoulos Y., McDermid J. A., Sasse R., Heiner G., Analysis and Synthesis of the Behaviour of Complex Programmable Electronic Systems in Conditions of Failure, Reliability Engineering and System Safety, 71(3)(2001):229-247, Elsevier Science. 75. Shiozaki J. Matsuyama H., O’Shima E., Ira M., An Improved Algorithm for Diagnosis of System Failures in the Chemical Process, Computers and Chemical Engineering, 9(3)(1985): 285-293, Pergamon Press. 76. Kramer M. A., Automated Diagnosis of Malfunctions based on Object-Oriented Programming, Journal of Loss Prevention in Process Industries, 1(1)(1988):226-232. 77. Iverson D.L, Patterson F.A, Advances in Digraph Model Processing Applied to Automated Monitoring and Diagnosis, Reliability Engineering and System Safety, 49(3)(1995):325-334, Elsevier Science. 78. Guarro S., Okrent D., The logic Flowgraph: A new Approach to Process Failure Modelling and Diagnosis for Disturbance Analysis Applications, Nuclear Technology, 67(1984):348-359. 79. Yau M., Guarro S., Apostolakis G., The use of Prime Implicants in Dependability Analysis of Software Controlled Systems, Reliability Engineering and System Safety, 62(1-2)(1998): 23-32, Elsevier science. 80. Finch F. E., Oyeleye, O. O., Kramer M. A., A Robust Event-oriented Methodology for Diagnosis of Dynamic Process Systems, Computers and Chemical Engineering, 14(12)( 1990):1379-1396, Pergamon Press. 81. Oyeleye O. O., Finch F. E., Kramer M. A., Qualitative Modelling and Fault Diagnosis of Dynamic Processes by MIDAS, in Hamscher W., Console L., de Kleer J. (eds.), Readings in Model-based Diagnosis, pages 262-275, Morgan Kaufman, ISBN: 1-55860-249-6 (1992). 82. Dziopa P., Gentil S., Multi-Model-based Operation Support System, Engineering Applications of Artificial Intelligence: 10(2)(1997):117-127, Elsevier Science. 83. Kramer M. A., Finch F. E., Fault Diagnosis of Chemical Processes, in Tzafestas S. G. (ed.), Knowledge-based System Diagnosis Supervision and Control, pages 247-263, Plenum Press, New York, ISBN: 0-30643-036-3 (1989). 84. Ulerich N. H., & Powers G., On-line Hazard Aversion and Fault Diagnosis in Chemical Processes: The Digraph + Fault tree method, IEEE Transactions on Reliability, 37(2): 171-177, IEEE press (1988). 85. Chung P. W. H., Qualitative Analysis of Process Plant Behaviour, in Proceedings of the 6th International Conference on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems, pages 277-283, Gordon and Breach Science Publishers, Edinburgh (1993). 86. Palmer C., Chung P.W.H., Verifying Signed Directed Graph Models for Process Plants, Computers and Chemical Engineering, 23(Sup) (1999):391-394, Pergamon Press. 87. McCoy S. A., Wakeman S. J., Larkin F. D., Chung P. W. H., Rushton A. G., Lees F. P., HAZID, a Computer Aid for Hazard Identification, Transactions IChemE, 77(B)(1999) :328-334, IChemE. 88. Kramer M., Palowitch B. L., A Rule-based Approach to Diagnosis Using the Signed Directed Graph, Artificial Intelligence in Chemical Engineering, 33(7)(1987):1067-1078. 89. Yau M., Guarro S., Apostolakis G., Demonstration of the Dynamic Flowgraph Methodology using the Titan II Space Launch Vehicle Digital Flight Control System, Reliability Engineering and System Safety, 49(3)(1995): 335-353, Elsevier science. 90. Tarifa E. E., Scenna N. J., A Methodology for Fault Diagnosis in Large Chemical Processes and an Application to a Multistage Flash Desalination Process: Part I, Reliability Engineering and System Safety, 60(1)(1998): 29-40, Elsevier science. 91. Tarifa E. E., Scenna N. J., A Methodology for Fault Diagnosis in Large Chemical Processes and an Application to a Multistage Flash Desalination Process: Part II, Reliability Engineering and System Safety, 60(1)(1998): 41-51, Elsevier science. 92. Vedam H.,, Dash S. and Venkatasubramanian H., An Intelligent Operator Decision Support System for Abnormal Situation Management, Computers and Chemical Engineering 23(1999):S577-S580, Elsevier Science. 93. Vedam H., Venkatasubramanian H., Signed Digraph-based Multiple Fault Diagnosis, Computers and Chemical Engineering 21(1997):S665-660, Elsevier Science. 30 94. Chung D. T., Modarres M., GOTRES: an Expert System for Fault Detection and Analysis, Reliability Engineering and System Safety, 24(1989):113-137, Elsevier Science. 95. Wilikens M., Nordvik J-P, Poucet A., FORMENTOR, A Real-time Expert System for Risk Prevention in Hazardous Environments, Control Engineering Practice, 1(2)(1993):323-328. 96. Masera M., Wilikens M., Contini S., An On-line Supervisory System Extended for Supporting Maintenance Management, in Proceedings of the Topical Meeting on Computer-Based Human Support Systems: Technology, Methods and Future, American Nuclear Society, Philadelphia United States, June 25- 29 (1995). 97. Wilikens M., Burton C. J., FORMENTOR: Real-time Operator Advisory System for Loss ControlApplication to a Petrochemical plant, Journal of Industrial Ergonomics, 17(4), Elsevier science, April (1996). 98. Lind M., On the Modelling of Diagnostic Tasks, in Proceedings of Third European Conference on Cognitive Science Approaches to Process Control, Cardiff, Wales (1991). 99. Reiter R., A Theory of Diagnosis from First Principles, Artificial Intelligence, 32(1987):57-96, Elsevier Science. 100. Greiner R., Msith B., Wilkerson R., A correction to the algorithm in Reiter’s theory of diagnosis, in Hamscher W., Console L., de Kleer J. (eds.), Readings in Model-based Diagnosis, pages 49-53, Morgan Kaufman, ISBN: 1-55860-249-6 (1992). 101. Davis R. Diagnostic Reasoning based on Structure and Behaviour, Artificial Intelligence, 24(3)(1984): 347-410, Elsevier science. 102. deKleer J., Williams B. C., Diagnosis of Multiple Faults, Artificial Intelligence, 32(1)(1987):97130, Elsevier Science. 103. deKleer J., An Assumption-based Truth Maintenance System, Artificial Intelligence, 28(1986):127162, Elsevier Science. 104. deKleer J., Williams B. C., Diagnosing Multiple Faults, in Hamscher W., Console L., de Kleer J. (eds.), Readings in Model-based Diagnosis, pages 100-117, Morgan Kaufman, ISBN: 1-55860-249-6 (1992). 105. Touchton R. A., Subramanyan N., Nuclear Power Applications of NASA Control and Diagnostic Technology, EPRI Technical Report NP-6839, Electric Power Research Institute, Palo Alto California United States (1990). 106. Kuipers B. J., Qualitative Simulation, Artificial Intelligence, 29(1986):289-338, Elsevier science. 107. Dvorak D., Kuipers B. J., Model-based Monitoring and Diagnosis of Dynamic systems, in Hamscher W., Console L., de Kleer J. (eds.), Readings in Model-based Diagnosis, pages 249-261, Morgan Kaufman, ISBN: 1-55860-249-6 (1992). 108. Ng H. T., Model-based Multiple Fault Diagnosis of Time-Varying, Continuous Physical Devices, in Hamscher W., Console L., de Kleer J. (eds.), Readings in Model-based Diagnosis, pages 242-248, Morgan Kaufman, ISBN: 1-55860-249-6 (1992). 109. Ng H. T., Model-based Multiple Fault Diagnosis of Dynamic Physical Devices, IEEE expert, pages 38-43, IEEE (1991). 110. Monitoring Batch Processes Using Multiway Principal Component Analysis, AIChE journal, 40(1994):1361-1375. 111. Wangen L.E. and Kowalski B.R., A Multi-block Partial Least Squares Algorithm for Investigating Complex Chemical Systems, Journal of Chemometrics, 3(1988):3-20. 112. Cheung, J.T., Stephanopoulos G., Representation of Process Trends. Part I: A Formal Representation Framework, Computers and Chemical Engineering 14(1990):495-510. 113. Dash S. and Venkatasubramanian V., Challenges in the industrial applications of fault diagnostic systems, Computers & Chemical Engineering, 24(2-7)(2000):785-791, Elsevier Science. 114. Bakshi, B.R., Stephanopoulos, G., Representation of Process Trends, Part IV. Induction of RealTime Patterns from Operating Data for Diagnosis and Supervisory Control, Computers and Chemical Engineering , 18(4)(1994):303-332. 115. Ramohalli G., The Honeywell On-board Diagnostic and Maintenance System for the Boeing 777, in Proceedings of the 11th Digital Avionics Systems Conference, Seattle, IEEE no: 92CH3212-8, pages 485-490, IEEE publications, Piscataway United States (1992). 116. Hill D., A Deep Knowledge Approach to Aircraft Safety Management, in Proceedings of 13th international conference in Artificial Intelligence and Expert Systems, pages 355-365, ISSN 2-906899-879, (1993). 31 117. Hill D., Aircraft Configuration Management using Constraint Satisfaction Techniques, in Proceedings of IEE International Conference on Control ‘94, 2:1278-1283, ISSN 0-852-966-113, (1994). 118. Davis R., Diagnostic Reasoning Based on Structure and Behaviour, in Hamscher W., Console L., de Kleer J. (eds.), Readings in Model-based Diagnosis, pages 376-407, Morgan Kaufman, ISBN: 155860-249-6, (1992). 119. Chien S. H., Hook T. G. Lee R. J., Use of a Safety Monitor in Operational Decision Making at a Nuclear Generating Facility, Reliability Engineering and System Safety, 62(1998):11-16, Elsevier Science. 32