From: AAAI Technical Report SS-94-06. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Decision-Theoretic Planning for Anticipating and Troubleshooting Faults Paul Department O’Rorke (ororke~ics.uci.edu) of Information and Computer University of California, Irvine Irvine, CA 92717-3425 USA Introduction Science qualitative reasoning methods are used to predict normal behaviors, consequences of actions and faults, and Problems associated with anticipating and troubleshootto decide trade-offs. The following summaryprovides a ing faults arise during the design and manufacture of brief overview of our research, descriptions of our methcomplex devices and systems, in preparing to deploy ods and results, and pointers to relevant papers that them, and in operations subsequent to deployment. contain more information. Prior to deployment,it is often useful to anticipate faults In early work [El Fattah and O’Rorke, 1991] we develin the components of a system, to determine their conoped several architectures integrating model-based disequences, to assess the associated risks, and to propose agnosis and explanation-based learning. Wefound that and decide upon actions that reduce the risks. This colprecompilation or "learning in advance" is often much lection of problems is called "Failure Modesand Effects more efficient than the traditional EBLtechnique for Analysis" (FMEA). Whenmalfunctions occur after de"learning while doing" in the context of model-based diployment, it is important to assess the situation and to agnosis of multiple faults. But precompilation is only recommendactions that will help determine the causes feasible for relatively small devices, so we developed and minimize negative impacts. This is called "Failure methods for trading accuracy for improvements in effiDetection, Isolation, and Recovery" (FDIR). ciency [El Fattah and O’Rorke, 1992, 1993a]. In one arDecision-theoretic concepts like likelihood, value, and chitecture, the user prespecifies a desired accuracy rate. expected utility are frequently useful in anticipating and The method employed by the system uses (learned) astroubleshooting faults. These concepts break down barsociational knowledge unless the accuracy dips below riers and unify topics that have been identified as septhe desired threshold; then performance is considered to arate areas in previous work. For example, FMEAand be unsatisfactory and the system switches from associFDIR can be seen to have much in common, although ational to model-based reasoning until accuracy returns they have been pursued as independent topics in previto the desired range. One can get dramatic improveous research and development. In FMEA,anticipating ments in efficiency using this approach with a relatively potential risks involves assessing the likelihoods of faults small loss in accuracy. This is because a small number and the costs associated with their effects. Recommend- of rules tend to cover problems that occur frequently. ing actions to reduce the risks ought to take into account More recently, we have developed a new architecthe costs of the actions and the likelihood that they will ture called the Generic Troubleshooting System (GTS) succeed. In FDIR,prior probabilities of faults are useful that integrates diagnosis, repair planning, execution, and in determining the most likely explanations of abnormal learning. GTSexploits information about the likelibehavior, and the main goal is to generate plans for gathhoods of failures and the costs Of probes and replaceering more information about the fault and for fixing or ments [El Fattah and O’Rorke, 1993b,c]. Experimental working around it. The costs of information gathering results showthat the system -- which uses expected utilprobes and tests ought to be taken into account and ity and the value of information -- does well as compared the value of the information they provide ought to be with systems based on previous information-theoretic weighedin terms of savings in repair or recovery costs. and purely probabilistic methods. It is often quicker at So, even within FDIR, problems such as diagnosis and producing less expensive information-gathering and rerepair planning that have been viewed as separable probpair plains. During diagnosis and repair planning, GTS lems can be seen to have much in common. learns prediction macros and action rules that form reentrant decision tree structures. Experimental results Research Summary show that the learning techniques incorporated in GTS In collaboration with students and colleagues, I have deeffectively provide speed-up and take good advantage of non-uniform distributions of faults. veloped methods that support decision-making and planning relevant to anticipating and troubleshooting faults. Wehave also developed methods supporting diagnosThe methods employ decision-theoretic techniques for tic decision-making including qualitative and order-ofevaluating situations and actions. Model-based and magnitude reasoning and EBLmethods. An architecture 289 called AbMaL(short for Abductive Macro Learner) [O’Rorke, to appear], uses abduction and explanationbased learning to acquire information including expert’s preferences and assessments of likelihoods. Computational experiments showthat a coarse qualitative reasoning system [O’Rorke et al., 1993b] that can be learned using EBL[O’Rorke et al., 1993a] covers a disproportionately large number of important decision problems. Mathematical analysis confirms and explains these experimental results -- a more sophisticated method based on order-of-magnitude reasoning should make it possible to improve them [O’Rorke and El Fattah, 1993]. In contrast to FDIR, work on AI methods for partially automating FMEAhas just begun [O’Rorke, 1992; Wirth and O’Rorke, 1993]. AI methods for predictive simulation -- from simple forward chaining to modelbased qualitative reasoning -- have been applied with some success to the problem of determining consequences of component failures [O’Rorke, 1993]. An important problem is that the number of possible FMEAscan be very large for complex systems, especially if multiple faults are considered. I have developed a method for FMEAbased on a decision-theoretic extension of David Poole’s Probabilistic Horn Abduction (PHA). The new method which I call Decision-Theoretic Horn Abduction (DTHA)[O’Rorke, 1994], has the advantage that it focuses on the most important explanations (e.g., the ones with the smallest expected utility). This method is relevant to FMEAbecause FMEAis most concerned with the most likely failures and the most costly consequences. Future Work An important FMEAproblem that has not yet been targeted for computerization is the problem of generating suggested actions aimed at reducing risks. This is a good target for AI research on decision-theoretic planning. An important problem associated with FDIR is that diagnosis has been considered in isolation, and this has lead to diagnosis methods that do not work well in practice within the larger FDIRpicture. For example, most AI work on diagnosis presumes that faults must be localized before repair actions can be taken. But it is obviously not worthwhile to determine the exact location of a fault when it makes no difference with respect to the repair actions available. (Determining exactly which gate is faulty in an integrated circuit maynot be useful if the entire chip must be replaced in order to fix the fault.) Superior troubleshooting methods will be developed (and indeed already are being developed at several sites) using different forms of decision-theoretic planning. In the future, more work is also needed that integrates AI approaches to FMEA,model-based diagnosis, and planning-based approaches to repair. Most work on FDIR and MBDhas been done independently of FMEA, but it seemsclear that this approach will lead to wasteful duplication. An integrated approach will exploit overlap and eliminate duplication. For example, one model and one simulation system can be used for both FMEAand MBD/FDIR.One decision-theoretic planner can be used to find actions that can reduce the costs of failures before they occur (in FMEA)and after they have occurred 290 (in MBDand FDIR). Only the knowledge provided the planner, for example about actions available in these situations, needs to be different. Relevant Papers [El Fattah and O’Rorke, 1991] Y. E1 Fattah and P. O’Rorke. Learning multiple fault diagnosis. In Proceedings of the Seventh IEEE Conference on Artificial Intelligence Applications, pages 235-239, Los Alamitos, CA, 1991, IEEE Computer Society Press. [El Fattah and O’Rorke, 1992] Y. E1 Fattah and P. O~Rorke. Learning approximate diagnosis. In Proceedings of the Eighth IEEE Conference on Artificial Intelligence Applications, pages 150-156, Los Alamitos, CA, 1992, IEEE Computer Society Press. [El Fattah and O’Rorke, 1993a] Y. E1 Fattah and P. O’Rorke. Explanation-based learning for diagnosis. Machine Learning, 13(1): 35-70, 1993. [El Fattah and O’Rorke, 1993b] Y. El Fattah and P. O’Rorke. GTS: Improved troubleshooting via learning and Utility evaluation. In The Fourth International Workshop on Principles of Diagnosis, pages 187-200, Aberystwyth, Wales, 1993. [El Fattah and O’Rorke, 1993c] Y. E1 Fattah and P. O’Rorke. Utility of action in model-based troubleshooting. In Computing in Aerospace 9, pages 406-414, Washington, DC, 1993, AIAA. [O’Rorke, 1992] P. O’Rorke. Failure modes and effects analysis: Opportunities for automation. 1992. [O’Rorke, 1993] P. O’Rorke. The Wales FMEAproject: Summaryand evaluation. 1993. [O’Rorke, 1994] P. O’Rorke. Focusing on the most important explanations: Decision-theoretic Horn abduction. 1994. [O’Rorke, to appear] P. O’Rorke. Abduction and explanation-based learning: Case studies in diverse domains. Computational Intelligence, 10(4). [O’Rorke and El Fattah, 1993] P. O’Rorke and Y. E1 Fattah. Qualitative and order of magnitude reasoning about diagnostic decisions. In The Fourth International Workshopon Principles of Diagnosis, pages 136-148, Aberystwyth, Wales, 1993. [O’Rorke et al., 1993a] P. O’Rorkeand Y. E1 Fattah and M. Elliott. Explaining and generalizing diagnostic decisions. In Machine Learning: Proceedings of the Tenth International Conference, pages 228-235, San Mateo, CA, 1993, Morgan Kaufmann. [O’Rorke et al., 1993] P. O’Rorkeand Y. El Fattah and M. Elliott. Explanation-based learning for qualitative decision-making. In N. Piera-Carretd and M. G. Singh, Ed. Qualitative Reasoning and Decision Technologies. pages 409-418. CIMNE,Barcelona, 1993. [Wirth and O’Rorke, 1993] R. Wirth and P. O’Rorke. Representing and reasoning about functions for failure modesand effects analysis. In Working Notes of the AAAI-93 Workshop on Reasoning about Function, Washington, DC, 1993.