From: Proceedings of the Twelfth International FLAIRS Conference. Copyright © 1999, AAAI (www.aaai.org). All rights reserved. Intelligent Alarm Handling Lars Asker, Mats Danielson, Love Ekenberg Dept. of Computer and Systems Sciences Royal Institute of Technology Electrum 230, SE-164 40, Kista, SWEDEN {asker,mad,lovek}@dsv.su.se Abstract Most of the actions taken within today's power plants are directed by control systems, which usually are computerised and located in a central control room within the power plant. In normal states, the communication between the control system and the operators is satisfactory, with few alarms occurring infrequently. However, when large disturbances occur, the communication is problematical. Instead of being aided by the messages, the operators become swamped by the amount of information, and often have to make more or less informed guesses of what causes the abnormal situation. It is therefore of great importance if the control system can discriminate between normal and abnormal situations, as well as being less sensitive and giving priority to alarms that must be sent to the operators. In order for the system to make such analyses, processes for diagnosis and decision making regarding the reliability and importance of the information are needed. This paper shows how machine learning algorithms can be combined with decision theory w.r.t. vague and numerically imprecise background information, by using classifiers. An ensemble is a classifier created by combining the predictions of multiple component classifiers. We present a new method for combining classifiers into an ensemble based on a simple estimation of each classifier's competence. The purpose is to develop a filter for handling complex alarm situations. Decision situations are evaluated using fast algorithms developed particularly for solving these kinds of problems. The presented framework has been developed in co-operation with one of the main actors in the Swedish power plant industry. Introduction Process control in today's power plants is most often handled by a computer system, which ensures that the process is kept within given limits and that the objects in the plant (pumps, engines, valves, etc.) are stable w.r.t. directions given by operators or the control system. When deviations from the normal situation occur, the computer system sends alarm messages to the operators, which pursue the necessary actions, based on given recommendations or their own experience. When large deviations occur, the safety routines in the control system stop some or all of the subsystems in the plant. Most of the actions taken within the plant are directed by the control systems, which usually are computerised and located in a central control room within the power plant. In normal situations, the level of communication between the control system and the operators is satisfactory, with few alarms occurring infrequently. However, when a large disturbance occurs, the communication is far from satisfactory. For example, after a short power failure, the control system may stop pumps and engines and the position of valves may change. In this situation, the safety routines normally start to interfere, and a huge stream of alarm messages will follow. Instead of being helped by the messages, the operators become swamped by all information, and they often have to make more or less informed guesses of what causes the abnormal situation. It would therefore be of great help to the operators if the control system could discriminate between normal and abnormal situations. In the former case, the sensitivity of the control system would be as today, resulting in few alarms being sent every now and then, while in the latter case, the system would be less sensitive and give priority to some alarms that are to be sent to the operators. In order for the system to make such analyses, processes for diagnosis and decision making regarding the reliability and importance of the information are needed. This paper describes a model which have been developed for use in a large Swedish power plant, which, by using machine learning and reliability analysis, can evaluate the classified alarms according to credibilities that depend on the current situation. Furthermore, when a large disturbance has occurred, the system should be able to sort the alarms in an order that eases the recovery of the plant. To efficiently address the problem space indicated above, techniques from different AI sub-fields were employed. From the field of machine learning, techniques for creating models from observations were applied. ML techniques have been used for finding rules to be used for classifying a situation as normal or not, as well as for finding rules that can help the control system to make priorities among the alarms. The techniques for reliability analysis will be highly relevant as the control system is faced with a complex decision problem when encountering abnormal situations. These models were used to describe or analyse new observations, e.g. suggesting how to act in certain situations, or to make predictions, e.g. suggesting what to expect in certain situations. Computational methods for decision analysis is another important subject area. Algorithms for computational decision analysis offer promising solutions to the problem of combining different classifiers when the information at hand is imprecise. Classifiers A popular method for creating an accurate classifier from a set of training data is to train several different classifiers on the training data, and then to combine the predictions of these classifiers into a single prediction [Breiman, 1996; Drucker et al., 1994; Wolpert, 1992]. The resulting classifier is generally referred to as an ensemble because it is made up of component classifiers. In this paper we present a novel method for combining the predictions of several classifiers. For each classifier we derive an interval prediction and a confidence interval associated with it. Figure 1 A number of researchers have demonstrated that ensembles are generally more accurate than any of their component classifiers [Breiman, 1996; Clemen, 1989; Quinlan, 1996; Wolpert, 1992; Zhang et al., 1992]. Using an ensemble, the class of an example is predicted by first classifying the example with each of the component classifiers and then combining the resulting predictions into a single classification. To create an ensemble, a user generally must focus on two aspects: (1) which classifiers to use as components of the ensemble; and (2) how to combine their individual predictions into one. Hansen and Salamon [1990] demonstrated that under certain assumptions, the accuracy of an ensemble increases with the number of classifiers combined. For each example where the average error rate is less than 50% for the distribution of possible classifiers, they show that in the limit the expected error on that example can be reduced to zero. Of course, since not all patterns will necessarily share this characteristic (e.g., outliers may be predicted at more than 50% error), the error rate over all the patterns cannot necessarily be reduced to zero. But if we assume a significant percentage of the patterns are predicted with less than 50% average error, gains in generalisation will be achieved. A key assumption of Hansen and Salamon's analysis is that the classifiers combined should be independent in their production of errors. Krogh and Vedelsby [1995] expanded on this notion to show that the error for an ensemble is related to the generalisation error of the classifiers plus how much disagreement there is between the different classifiers. Thus much research on selecting appropriate classifiers to combine has focused on selecting classifiers that are accurate in the predictions, but differ in where they are accurate. Methods for approaching this problem include using different classification methods, training on subsets of the data set, training on different sets of input features, and using different subsets of the training set for training the classifiers [Breiman, 1996; Drucker et al., 1994; Hansen and Salamon, 1990; Hashem et al., 1994; Krogh and Vedelsby, 1995; Maclin and Shavlik, 1995]. The second aspect of creating an ensemble is the choice of the function for combining the predictions of the component classifiers [Kearns and Seung, 1995]. Examples of combination functions include voting schemes [Hansen and Salamon, 1990], simple averages [Lincoln and Skrzypek, 1989], weighted average schemes [Perrone and Cooper, 1994; Rogova, 1994], and schemes for training combiners [Rost and Sander, 1993; Wolpert, 1992; Zhang et al., 1992]. Clemen [1989] demonstrated that in the absence of knowledge concerning a specific problem, almost any reasonable method, including the simple ones such as voting or using a weighted average, would result in an effective ensemble. Our approach can be thought of as being related to the hierarchical mixture of experts approach [Jacobs et al., 1991; Jordan and Jacobs, 1994; Nowlan and Hinton, 1990]. In a mixture of experts approach a group of sub-classifiers is trained so that each sub-classifier will become an expert on a different portion of the input space. Our approach differs in that our training mechanism is simpler (each classifier simply trains on the entire problem) but as a result our component classifiers may have significant overlap in their expertise. Instead of having each classifier pointing out a specific class, or set of classes, we allow the classifiers to distribute their belief over classes by allowing interval beliefs and requiring the distributed beliefs to be normalised (i.e. sum to unity). A General Method for Combining Classifiers Consider a scenario where a power plant operator is in charge of running a highly complex power generating process. The control system is partly based on a rulebased alarm system. When alarms occur, the operator, which may be human as well a software agent, needs to respond, either by ignoring the alarm or by some action based on the current state of the plant. See the event chain in Figure 2. More formally, the operator faces a situation involving a choice between a finite set of strategies Si (e.g. actions) having access to a finite set of triggered alarms Ai. In a situation modelled as above, some alarms may be more reliable than others. The operator may have access to assessments expressing the credibility of the different alarms. Figure 2 From this information, the operator is determined to evaluate the strategies given the individual alarms and their relative credibilities. However, for the operator to carry out his tasks and to acquire sufficient and reliable knowledge, it is fundamental that he is able to evaluate information gathered from different alarms, some unreliable and some noisy. The dynamic adaptation taking place over time as the operator interacts with his environment, and with the other alarms, is affected by the means available to assess and evaluate imprecise information. The operator may rank the credibilities of the different classifiers as well as quantifying them in imprecise terms. The classifiers have a similar expressibility regarding their respective reports about the classes under consideration. We will consider the problem as a multi-criteria decision problem. Aggregation of utility functions under a variety of criteria is investigated in the area of Multi Attribute Utility Theory (MAUT) [Fishburn, 1970; Keeney and Raiffa, 1976; Keeney, 1992]. A number of techniques used in MAUT have been implemented using computer programs such as SMART [Edwards, 1977] and EXPERT CHOICE, the latter which is based on the widely used AHP [Saaty, 1977; Saaty, 1980; Saaty, 1982]. AHP has been criticised in a variety of respects [Belton and Gear, 1983; Watson and Freeling, 1982; Watson and Freeling, 1983] and models using geometric mean value techniques have been suggested instead [Barzilai et al, 1987; Krovac et al, 1987]. Techniques based on the geometric mean value have, for instance, been implemented by Lootsma and Rog in REMBRANDT [Lootsma, 1993]. All these approaches have their advantages, but the requirement to provide numerically precise information sometimes seems to be unrealistic in real-life decisions situations such as alarm classification, and a number of models with representations allowing imprecise statements have been suggested. For instance, [Salo and Hämäläinen, 1984] extends the AHP-method in this respect and also make use of structural information when the alternatives are evaluated into overlapping intervals. The system ARIADNE [Sage and White, 1984] also allows the decision maker to use imprecise estimates, but does not discriminate between alternatives when these are evaluated into overlapping intervals. Fuzzy set theory is a more widespread approach to relaxing the requirement of numerically precise data by providing a more realistic model of the vagueness in subjective estimates of probabilities, weights, and values [Chen and Hwang, 1992; Lai and Hwang, 1994]. These approaches allow, among other features, the decision-maker to model and evaluate a decision situation in vague linguistic terms. The methods we propose herein originate from earlier work on handling probabilistic decision problems involving a number of alternatives and consequences when the background information is vague or numerically imprecise [Danielson and Ekenberg, 1998; Ekenberg et al, 1996; Malmnäs, 1994]. The aim of this paper is to generalise the work into the realm of multiple criteria decision aids, but still conform to classical statistical theory rather than to fuzzy set theory. By doing so, we try to avoid problems emanating from difficulties in providing set membership functions and in defining set operators having a satisfying intuitive correspondence. Evaluating Information from different Classifiers As was mentioned above, a significant feature of the framework is that it allows for situations where numerically imprecise or comparative sentences occur. These sentences are represented in a numerical format and with respect to this the situations can be evaluated using a variety of decision rules. The further discriminating analyses try to show which parts of the given information are the most critical and must be given extra careful consideration. Representation For instance, the credibility of a classifier is represented by weight estimates. Such estimates can be represented by linear constraints and we treat two classes of these: interval sentences, and comparative sentences (cf. [Ekenberg and Danielson, 1994]). Interval sentences are of the form: “The weight of classifier i lies between the numbers ai and bi” and are translated into wi ∈ [ai,bi]. Comparative sentences are of the form: “The importance of classifier i is greater than the importance of classifier j”. Such a sentence is translated into an inequality wi ≥ wj. Each statement is thus represented by one or more constraints. We call the conjunction of constraints of the types above, together with the normalisation constraint Σi≤n wi = 1, the credibility base (S). The situation base (K) consists of similar translations of imprecise belief estimates. A situation base with n classifiers and m classes is expressed in belief variables {u11,...,u1n,...,um1,...,umn} stating the belief in the classes of situations according to the different classifiers. The term uij denotes the belief in situation Si with respect to classifier cj. The collection of weight and belief statements constitutes the information frame. It is assumed that the variables’ respective ranges are real numbers in the interval [0,1]. Below, we will refer to an information frame as a structure 〈S, K 〉. Example: A decision-maker gives assessments concerning the situations for a risk policy of a company. The objective of the diagnosis is to determine the cause of a set of alarms. The possible situations1 (classes) are S1:detector malfunction, S2:fire in main generator, and S3:blown fuse in electric circuit. Assume that the diagnosis is supposed to be given with respect to the classifiers C1 and C2. The beliefs involved could, for example, be given in numbers. In that case, they are linearly transformed to real values over the interval [0,1]. For instance, the assessments with respect to classifiers C1 could be the following: • The belief in situation S1 is between 0.20 and 0.50. • The belief in situation S2 is between 0.20 and 0.60. • The belief in situation S3 is between 0.40 and 0.60. • The belief in situation S2 is at least 0.10 stronger than that of S1. Similar belief assessments can be asserted with respect to C2. Moreover, the weights of C1 and C2 may be estimated as numbers in the interval [0,1]. The number 0 denotes the lowest credibility and 1 the highest. Thus, the assessments about the classifiers could be: • Classifier C2 is at least as credible as C1 • The credibility of classifier C1 is between 0.30 and 0.70 One further reason for allowing interval as well as comparative assessments is that the background information may have different sources. For instance, intervals may arise naturally from aggregated quantitative information whereas qualitative analyses often result in comparisons. Since the sources may be different, the assessments may not necessarily be consistent with each other. 1 The possible situations in the system analysed is defined by logic schemata which define the relationship between different alarm scenarios and possible causes. The belief estimates with respect to C1 are translated into the following expressions. u11 ∈ [0.20, 0.50] u21 ∈ [0.20, 0.60] u31 ∈ [0.40, 0.60] u21 ≥ u11 + 0.10 The credibility of C1 and C2 are also represented as numbers in the interval [0,1], and the translation of the assessments above results in the following expressions. w2 ≥ w1 w1 ∈ [0.30, 0.70] Aggregations One reasonable candidate for an aggregation principle could be based on a weighted sum of the beliefs and the following notation will be used to define this with respect to an information frame representing n classifiers and m situations: Definition: Given an information frame 〈S, K〉, the global belief G(Si) of a situation Si is G(Si) = Σk≤n wk· uik, where wk and uik are variables in K and S, respectively. Definition: Given an information frame 〈S, K〉, the difference in global belief δij between two situations Si and Sj are δij = G(Si) – G(Sj) = Σk≤n wk · (uik – ujk), where wk, uik, and ujk are variables in S and K, respectively. Definition: Given an information frame 〈S, K〉, let a and b be two vectors of real numbers (a1,...,an) and (b11,...,bmn) respectively. Then define abG(Si) = Σk≤n ak · bik, where ak and bik are numbers substituted for wk and uik in G(Si)). Similarly, define abdδij = abG(Si) – adG(Sj). With respect to these definitions, we can, for instance, express the concept of admissibility in the sense of [Lehmann, 1959]. The concept of admissibility is computationally meaningful in our framework as demonstrated in [Danielson and Ekenberg, 1998]. However, the admissibility often seems to be too weak to form a decision rule by itself, and in [Danielson and Ekenberg, 1998; Ekenberg et al, 1997] we introduce a variety of discriminating principles in the case of decisions under risk. Furthermore, in non-trivial decision situations, when an information frame contains numerically imprecise information, the principles suggested above are sometimes too weak to yield a conclusive result. A way to refine the analysis is to investigate how much the different intervals can be contracted before an expression such as δij > 0 ceases to be consistent. This contraction avoids the complexity inherent in combinatorial analyses, but it is still possible to study the stability of a result by gaining a better understanding of how important the interval boundary points are. By co-varying the contractions of an arbitrary set of intervals, it is possible to gain much better insight into the influence of the structure of the information frame on the solutions. Contrary to volume estimates, contractions are not measures of the sizes of the solution sets but rather of the strength of statements when the original solution sets are modified in controlled ways. Both the set of intervals under investigation and the scale of individual contractions can be controlled. Consequently, a contraction can be regarded as a focus parameter that zooms in on central sub-intervals of the full statement intervals. Concluding Remarks We have demonstrated how a set of vague and numerically imprecise information can be used to evaluate critical situations with respect to a set of classifier. The approach considers this problem with respect to the different classifiers as well as the analysis of the different situations involved. These aspects are modelled into information frames consisting of systems of linear expressions stating inequalities and interval assessments. The situations may be evaluated relative to a variety of principles, for example generalisations of the principle of maximising the expected utility. Contractions are introduced as an automated sensitivity analysis. This concept allows us to investigate critical variables and the stability of the results. It should also be noted that the model can be generalised to integrate other aspects such as costs, utilities and probabilities into the framework and, by this, taking into account a more general picture of aspects involved in these kinds of decision situations. References Barzilai, J., Cook, W., and Golany, B. 1987. “Consistent Weights for Judgements Matrices of the Relative Importance for Alternatives,” Operations Research Letters, vol. 6, pp. 131–134. Belton, V., and Gear, A. E. 1983. “On a Shortcoming on Saaty's Method of Analytical Hierarchies,” OMEGA, vol. 11, pp. 227–230. Breiman, L. 1996. Bagging predictors. Machine Learning, vol. 24(2) pp. 123–140. Chen, S-J., and Hwang, C-L. 1992. Fuzzy Multiple Attribute Decision Making, vol. 375, Springer-Verlag. Clemen, R. 1989. Combining forecasts: A review and annotated bibliography. International Journal of Forecasting, vol. 5 pp. 559–583. Danielson, M., and Ekenberg, L. 1998. “A Framework for Analysing Decisions under Risk,” European Journal of Operational Research, vol. 104/3. Drucker, H., Cortes, C., Jackel, L., LeCun, Y., and Vapnik, V. 1994. Boosting and other machine learning algorithms. In Proceedings of the Eleventh International Conference on Machine Learning, pages 53–61, New Brunswick, NJ, July 1994. Morgan Kaufmann. Edwards, W. 1977. “How to Use Multiattribute Belief Measurement for Social Decisionmaking,” IEEE Transactions on Systems, Man, and Cybernetic, SMC-7:5, pp. 326–340. Ekenberg, L., and Danielson, M. 1994. “A Support System for Real-Life Decisions in Numerically Imprecise Domains,” Proceedings of the International Conference on Operations Research ’94, pp. 500–505, SpringerVerlag. Ekenberg, L., Danielson, M., and Boman, M. 1996. “From Local Assessments to Global Rationality,” International Journal of Intelligent and Cooperative Information Systems, vol. 5, nos. 2 & 3, pp. 315–331. Ekenberg, L., Danielson, M., and Boman, M. 1997. “Imposing Security Constraints on Agent-Based Decision Support,” Decision Support Systems International Journal, vol.20. Fishburn, P. C. 1970. Belief Theory for Decision Making: John Wiley and Sons. Hansen, L., and Salamon, P. 1990, Neural network ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:993–1001. Hashem, S., Schmeiser, B., and Yih, Y. 1994. Optimal linear combinations of neural networks: An overview. In Proceedings of the 1994 IEEE International Conference on Neural Networks, Orlando, FL. Jacobs, R., Jordan, M., Nowlan, S., and Hinton, G. 1991. Adaptive mixtures of local experts. Neural Computation, vol 3, pp. 79–87. Jordan, M., and Jacobs, R. 1995. Hierarchical mixtures of experts and the EM algorithm. In Proceedings of the International Conference on Artificial Neural Networks, London, 1994, vol. 18 pp. 255–276. Kearns, M., and Seung, H. 1995. Learning from a population of hypotheses. Machine Learning vol. 18, pp. 255-276. Keeney, R., and Raiffa, H. 1976. Decisions with Multiple Objectives: Preferences and Value Trade-offs: John Wiley and Sons. Keeney, R. L. 1992. Value-Focused Thinking: A Path to Creative Decision Making: Harvard University Press. Krogh, A., and Vedelsby, J. 1995. Neural network ensembles, cross validation, and active learning. In G. Tesauro, D. Touretzky, and T. Leen, editors, 1995, Advances in Neural Information Processing Systems, vol. 7, Cambridge, MA. MIT Press. Krovak, J. 1987. “Ranking Alternatives–Comparison of Different Methods Based on Binary Comparison Matrices,” European Journal of Operational Research, vol. 32, pp. 86–95. Lai, Y-J., and Hwang, C-L. 1994. Fuzzy Multiple Objective Decision Making, vol. 404, Springer-Verlag. Lehmann, E. L. 1959. Testing Statistical Hypothesis, John Wiley and Sons. Lincoln, W., and Skrzypek, J. 1989. Synergy of clustering multiple back propagation networks. In D. Touretzky, editor, Advances in Neural Information Processing Systems, vol. 2, pp. 650–659, San Mateo, CA, Morgan Kaufmann. Lootsma, F. A. 1993. “Scale Sensitivity in the Multiplicative AHP and SMART,” Journal of MultiClassifiers Decision Analysis, vol. 2, pp. 87–110. Maclin, R., and Shavlik, J. 1995. Combining the predictions of multiple classifiers: Using competitive learning to initialize neural networks. In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montreal, Canada, September. Malmnäs, P-E. 1994. “Towards a Mechanization of Real Life Decisions,” in Logic and Philosophy of Science in Uppsala, Prawitz and Westerståhl, Eds.: Kluwer Academic Publishers. Nowlan, S., and Hinton, G. 1990. Evaluation of adaptive mixtures of competing experts. In R. Lippmann, J. Moody, and D. Touretzky, editors, Advances in Neural Information Processing Systems, vol. 3, San Mateo, CA, Morgan Kaufmann. Perrone, M., and Cooper, L. 1994. When networks disagree: Ensemble method for neural networks. In R. J. Mammone, editor, Artificial Neural Networks for Speech and Vision. Chapman and Hall, London. Quinlan, J. R. 1996. Bagging, boosting, and c4.5. In Proceedings of the Thirteenth National Conference on Artificial Intelligence. Rogova, G. 1994. Combining the results of several neural network classifiers. Neural Networks, vol. 7, pp. 777–781. Rost, B., and Sander, C. 1993. Prediction of protein secondary structure at better than 70% accuracy. Journal of Molecular Biology, 232:584–599. Saaty, T. L. 1977. “A Scaling Method for Priorities in Hierarchical Structures,” Journal of Mathematical Psychology, vol. 15, pp. 234–281. Saaty, T. L. 1980. The Analytical Hierarchy Process: McGraw-Hill. Saaty, T. L. 1982. Decision Makers for Leaders: Van Nostrand Reinhold. Sage, A. P., and White, C. C. 1984. “ARIADNE: A Knowledge-Based Interactive System for Planning and Decision Support,” IEEE Transactions, SMC-14:1. Salo, A. A., and Hämäläinen, R. P. 1995. “Preference Programming through Approximate Ratio Comparisons,” European Journal of Operational Research, vol. 82, pp. 458–475. Watson, S. R., and Freeling, A. N. S. 1982. “Assessing Attribute Weights,” OMEGA, vol. 10, pp. 582–583. Watson, S. R., and Freeling, A. N. S. 1983. “Comments on: Assessing Attribute Weights by Ratio,” OMEGA, vol. 11, pp. 13. Wolpert, D. 1992. Stacked generalisation. Neural Networks, vol. 5, pp. 241–259. Zhang, X., Mesirov, J., and Waltz, D. 1992. Hybrid system for protein secondary structure prediction. Journal of Molecular Biology, vol. 225, pp. 1049–1063.