Parametric Fault Trees with Dynamic Gates and Repair Boxes Andrea Bobbio, Università del Piemonte Orientale, Alessandria Daniele Codetta R., Università del Piemonte Orientale, Alessandria Key Words: Parametric fault tree, modularization, dynamic gate, repair box, Colored Petri net. SUMMARY & CONCLUSIONS A new approach is proposed to include s-dependencies in Fault Tree (FT) models. With respect to previous techniques, the approach presented in this paper is based on two peculiar powerful features. First, we adopt a parameterization technique, referred to as Parametric FT (PFT), to fold equal subtrees (or basic events) in order to resort to a more compact FT representation. It is shown that parameterization can be conveniently adopted as well for dynamic gates. Second, PFT can be modularized and each module translated into a High Level Colored Petri net in the form of a Stochastic Wellformed Net (SWN). SWN generate a lumped Markov chain and the saving in the dimension of the state space can be very substantial with respect to standard (non colored) Petri nets. Translation of PFT modules into SWN has proved to be very flexible, and various kinds of new dependencies can be easily accommodated. In order to exploit this flexibility a new primitive, called repair box, is introduced. A repair box, attached to an event, causes the starting of a repair activity of all the components that are failed as the event occurs. In contrast to all the previous FT based models, the addition of repair boxes enables our approach to model cyclic behaviors. We refer to the proposed approach as Dynamic Repairable PFT (DRPFT). A tool supporting DRPFT is briefly described and the tool is validated by analyzing a benchmark proposed recently in the literature for quantitative comparison (Ref. 12). 1. INTRODUCTION Traditional Fault-trees (FT) have gained a widespread acceptance for the dependability and safety analysis of complex and critical systems, since they are simple to manipulate and are supported by powerful software tools for the qualitative and quantitative analysis. However, traditional FT suffer from the main limitation that basic components must be assumed as s-independent. S-dependence in the failure process arises when the failure behavior of a component depends on the state of the system. This kind of s-dependence has been recently tackled by many authors (Refs. 1, 2, 3, 4, 5). In the Dynamic FT (DFT) approach (Refs. 1, 3), the FT is decomposed in independent modules and each module is analyzed by generating its state space and solving the underlying Markov chain (CTMC), or by solving the local dependencies by means of numerical techniques (Ref. 5). All the above FT models are acyclic models and no previous technique has addressed the problem of including in a FT actions, taken after a fault, that restore the system to a previous condition (repair, recovery, roll-back, rejuvenation), converting an acyclic model into a cyclic one. In order to alleviate the largeness problem, a more compact FT representation has been proposed in Refs. 6,7. This compact representation (referred to as Parametric FT (PFT)) is based on the observation that often, due to redundancies, the system to be modeled contains similar replicated units or subtrees. Similar subtrees may be folded and parameterized, so that only one representative is explicitly included in the model. PFT can still be modularized starting from an algorithm presented in Ref. 8, and each module can be automatically converted into a high level Colored Petri Net formalism called SWN (Ref. 9). SWN’s have the property that they generate symbolic states (markings) that may be viewed as a high level description of sets of actual markings. The definition of symbolic markings allows us to exploit symmetry properties in the model and to generate the underlying Markov chain in lumped form. The degree of saving in the state space generation depends on the redundancies present in the system and can be very consistent (Ref. 6). The aim of the present paper is to present an extended version of PFT that we call DRPFT (Dynamic Repairable Parametric FT). DRPFT implements dynamic gates in compact parametric form. Moreover, DRPFT is extended to include dependencies arising from the repair process, by adding a new primitive called repair box. The solution procedure for a DRPFT is presented and a software tool developed for the analysis is briefly described. Finally, the DRPFT is validated through a benchmark example taken from Ref. 12. The quantitative results obtained from DRPFT coincide with those published in Ref. 12, but the example emphasizes that the dimension of the state space that is achieved using the DRPFT approach is more than two orders of magnitude lower that the one obtained by previous non parametric techniques. 2. FT P(D)FT DRPFT FDEP PAND SEQ WSP SWN ACRONYMS fault tree parametric (dynamic) fault tree dynamic repairable parametric fault tree functional dependency gate priority and gate sequence enforcing gate warm spare gate stochastic well-formed nets 3. DYNAMIC REPAIRABLE PARAMETRIC FT PFT have been extensively discussed in Refs. 6 and 7. By means of the introduction of a new event, called replicator event (and drawn as a dashed rectangle) similar subtrees (or basic events) can be folded and parameterized, by defining in the replicator event a parameter with its range of variation. Replicator events are, thus, a compact construct to generate as many identical subtrees as the cardinality of the declared parameter. PFT with AND, OR, and K:N gates can be automatically converted (Ref. 6) into a Colored Petri net in the form SWN (Ref. 9). Notice that a PFT with no replicator events becomes a standard FT, and its automatic translation in a SWN provides a standard (non colored) PN. In the following, we extend the PFT formalism to include the dynamic gates proposed in Refs. 1 and 3. In particular, we consider the dynamic gates FDEP (functional dependency gate), PAND (priority and gate), SEQ (sequence enforcing gate) and WSP (warm spare gate). We show that the dynamic gates can be parameterized (when the proper conditions arise) and translated into a SWN. Finally, we introduce a new primitive, called repair box. A repair box is assigned a constant repair rate, and can be connected to any event in the PFT, with the meaning of indicating the repair (with the assigned repair rate) of all the basic components that are failed when the event occurs. The PFT formalism, augmented with dynamic gates and repair boxes, is referred to as Dynamic Repairable PFT (DRPFT). The analysis of a DRPFT follows a classical hierarchical scheme (Refs. 2,10). The DRPFT structure is first modularized, i.e. partitioned in sindependent subtrees, called modules. Each module is converted into a Petri net in the form SWN and analyzed in isolation by resorting to the underlying lumped CTMC. The module failure probability, computed from the resulting CTMC, is cast back into the original DRPFT, by replacing the whole module with a single basic event. All the above steps are automatized in a software tool and hidden to the modeler. 4. DYNAMIC GATES IN DRPFT AND THEIR TRANSLATION IN SWN When suitable symmetry conditions arise, dynamic gates can be represented in compact parametric form, and then automatically translated into a SWN. The way in which this procedure is implemented in DRPFT is illustrated in the following paragraphs. The graphical symbols adopted for the dynamic gates are those introduced in Ref. 3. Failure of component A is represented by a token in place A that can be determined by the firing of transition A_f (failure Fig. 1b: Petri net representation of FDEP of A by its own) or by the firing of transition fdep_2 (failure of the trigger event T). In the case the dependent components are identical, they can be folded and parameterized as in Fig. 2a. T is the trigger event while D(i) is a replicator event providing the parametric representation of the set of dependent components. If D(i) has the cardinality equal to 2, the DRPFT of Fig. 2a is coincident with the DFT of Fig 1a. However, the parameter i can have any cardinality, and, hence, can represent a FDEP gate with any number of multiple identical dependent components. Fig. 2b shows the corresponding SWN. The failure of one of the dependent components is represented by a colored token in place D and may be caused either by the Fig. 2a: PFT-FDEP gate firing of transition D_f (failure of one of the D(i)’s) or by the firing of transition 4.1 FDEP A FDEP gate is characterized by a trigger event and a set of dependent events. Dependent events may fail by their own or by the effect of the trigger event failure. In Fig. 1a, T is the trigger event while A and B are the dependent events. Since FDEP in Fig. 1a has Fig. 1a: FDEP gate no replicator events its translation is in the form of the standard PN of Fig. 1b. Fig. 2b: SWN representation of PFT-FDEP fdep_2 (failure of T). Notice again that, in contrast to the FT of Fig. 1a, the complexity of the DRPFT structure of Fig. 2a does not depend on the number of dependent components that only influence the cardinality of the set D(i). 4.2 PAND 4.4 WSP PAND gate fails if all of its input fail in a specified order (from left to right). Let us consider the gate in Fig. 3a: its input events are A and B and they have to fail in this order. Since PAND in Fig. 3a has no replicator events its translation is in the form of the standard PN of Fig. 3b. Transition pand_2 fires if B fails and A is still working; this transition puts a token in place Oper to indicate that the order has not been respected and the gate failure did not occur. Otherwise, if A fails Fig. 3a: PAND before B, transition pand_1 fires putting gate a token in the failure place PAND_fail. A PFT construct can A WSP gate is characterized by a main component, and a set of ordered spare components. When the main component fails it is replaced by the first component available in the spare list. A spare may be in one of the following states: dormant or stand-by (it is not working, but ready to replace the main component); working (it is working in place of the main component which is failed); failed. The failure rate of a spare component when in a working condition is . Denoting by (0 1) the dormancy factor, the failure rate of the spare in the dormant condition is . Notice that models a cold stand-by, models the hot s-independent case (the WSP behaves as an AND gate). The WSP gate fails when the main component fails and there are no available Fig. 5a: PFT-WSP gate spares. Assuming that there are m identical spares, we can model the spares by means of the replicator node SP(i), in which the parameter i of cardinality m is defined (see Fig. 5a). Hence, in the DRPFT representation (Fig. 5a) the WSP gate has two inputs: - a basic event P representing the failure of the main component; - a replicator basic event SP(i) that is the parametric representation of the set of spares; the cardinality m of this set is equal to the number of spares. The translation of the WSP gate of Fig. 5a is given in the SWN of Fig. 5b. Place SP_na contains the coloured tokens of the spares which are not available because failed or already working; SP_curr contains the token relative to the spare which is currently replacing the main component. Transition SP_fail models the fault of a spare when in dormant condition putting the relative token in SP_na. When the main component P fails (token in place P_dn), transition P_spare fires putting the token relative to the spare to be used in SP_curr and SP_na. If later the spare fails (firing of transition SP_fail), if place SP_na contains a number of tokens equal to the number of spares (there are no more available spares at the moment) transition P_fail fires modeling the general failure of the gate, else another spare Fig. 3b: Petri net representation of PAND be envisaged if the PAND gate has more than two identical ordered input events. 4.3 SEQ Fig. 4a: SEQ gate SEQ gate forces its input to occur in a specified order (we assume from left to right). The translation of the SEQ gate of Fig. 4a into a Petri net is shown in Fig. 4b where the transition B_f, representing the failure of B, is enabled and fires in the presence of a token in place A (A is failed). In a similar way, the failure of C is enabled by the failure of B. Fig. 4b: Petri net representation of SEQ Fig. 5b: SWN representation of aWSP starts working by means of P_spare transition. The SWN of Fig 5b is actually more general than the corresponding WSP gate of Fig. 5a. Indeed, assigning a color class to place P_dn we can model a situation in which there are n main components with m shared (or non-shared) spares. 5. MODULES DETECTION AND CLASSIFICATION A module is a subtree that is s-independent from the rest of the FT. In a DRPFT, a subtree is a module when it has no Fig. 7: reduced PFT Fig. 6: DPFT structural modules nodes in common with other modules, does not descend from a dynamic gate or does not contain a repair box. However, the parameterization of the FT hinders the search for shared basic events, since additional conditions on the parameter definition and propagation need to be satisfied (Ref. 7). The example in Fig. 6 clarifies this point. A module may be classified as static or dynamic. Static modules contain common basic events (possibly in parameterized form) and can be analyzed by means of suitable combinatorial techniques. Dynamic modules contain dynamic gates or repair boxes and require a state-space analysis which, in the DRPFT methodology, is obtained by translating the dynamic module into a SWN. Dynamic modules are analyzed in isolation, and replaced in the original FT by a single basic event to which the module Top event proTab. 1: modules classification STEP Structural 1 Module Shared nodes STEP 2 Shared Param. nodes STEP 3 Dyn. Mod. Type Gate Descendant Min. SYS1 SUB(i) SYS2 D_F SYS3 Q_F(i) no yes no no no no no no no yes no no yes yes no yes no no no no no no yes no yes no yes yes stat. dyn. dyn. dyn. bability is assigned. The module detection algorithm proceeds in three steps. In the first step, a structural analysis of the FT is performed, neglecting the specific nature of the gates. In this step, applying the linear algorithm described in Ref. 8, the subtrees with no shared nodes are identified and these are called structural modules. Structural modules are passed through steps 2 and 3. In step 2, appropriate conditions on the parameters defined in the replicator events are checked in order to verify the presence of parameterized common events. In step 3, it is checked whether the structural module does not descend from a dynamic gate and does not contain repair boxes. A dynamic module is minimal if it does not contain modules of any nature; minimal dynamic modules are those to be detached, analyzed apart and replaced in the original DRPFT. Let’s consider DRPFT example in Fig. 6. The first step (algorithm in Ref. 8) locates the structural modules that are encircled in dotted line in Fig. 6. Then, each module is passed through the subsequent two steps. The module SUB(i) has shared parameterized common events. Indeed, the parameter declared in the replicator event SUB(i) differs from the parameter declared in the replicator event B(j). Hence, each replica generated by SUB(i) shares all the replicas generated by B(j). The minimal (static) module is, therefore, SYS1. Structural module D_F descends from a dynamic PAND gate, and the minimal (dynamic) module is, therefore, SYS2. Structural module Q_F contains a dynamic gate and turns out to be a dynamic gate. The result of the modularization procedure is reported in Table 1. After each dynamic node has been replaced, the reduced FT structure is shown in Fig. 7, and can be solved by any traditional technique. 6. DRPFT TOOL OVERVIEW The tool supporting the DRPFT formalism is DrawNet (Ref. 11): DrawNet has a flexible graphical interface (that can be adapted to any graph-like model) and saves the graphical structure into a XML file. The XML file is passed to the DRPFTproc block that detects the modules of the FT and their (static or dynamic) nature. For each minimal dynamic module a XML file is generated and passed to the translator block from DRPFT to SWN. Then, a transient analysis of the SWN representing the dynamic module is performed, Fig. 8: tool overview computing the module top event probability at a mission time specified by the user. The result is passed back to DRPFTproc and the dynamic module is replaced in the original DRPFT by a basic event whose failure probability is constant and equal to the result of the transient analysis. This procedure is iterated until all the dynamic modules have been analyzed and replaced; finally, the resulting (non dynamic) PFT is analyzed by any traditional technique for FT. Fig. 8 sketches the flow chart of the tool. 7. BENCHMARK ANALYSIS In order to verify the correctness of the described procedure and to test the quantitative results provided by the tool described in the previous section, we have applied the DRPFT approach to a benchmark that has been specifically proposed in Ref. 12 for quantitative comparison (Fig. 9). The benchmark is composed by an OR gate whose input events are 8 WSP’s that share 2 spares (S1 and S2). Since in reported in Fig. 9b. The replicator event Q_F(i), of cardinality 8, generates 8 identical subtrees that model the main components. The replicator event S(j), of cardinality 2, generates the two spares shared by the 8 main components. It should be remarked that the FT in Fig. 9a (and 9b) forms a single dynamic module and must be analyzed as a whole resorting to its state space Fig. 9b: DRPFT version representation. The advantage of thebenchmark of using the compact DRPFT representation of Fig. 9b comes from the fact that the analysis is based on the translation into a SWN that exploits the high level of symmetry of the example by directly generating the CTMC in a lumped form. The lumped CTMC generated by the DRPFT tool contains 35 states. The number of CTMC states generated by the tool Galileo in the original benchmark is not known from Ref. 12, but can be guessed by unfolding the SWN. In this way, the estimated number of states for the CTMC generated by Fig. 9a is 5898 states. Tab. 2: comparison of results repair box t DRPFT Galileo Unreliability Unreliability 1.0E-06 0.1 8766 5.65660E-05 5.66E-05 1.0E-06 0.1 43830 5.72744E-03 5.73E-03 1.0E-06 0.1 87660 3.53699E-02 3.54E-02 The saving using the DRPFT approach is more than two orders of magnitude with respect to a pure CTMC analysis. The results for the transient analysis at different mission times obtained from the DRPFT tool are compared in Table 2 with those reported in Ref. 12 and obtained from the Galileo tool, and turn out to be coincident. 8. REPAIR BOXES Fig. 9a: benchmark (Ref.12) Ref. 12 all the components are assumed to be identical, we can fold them using the parameterization technique described in paragraph 5.4. The DRPFT version of the same example is Fig. 10a: repair box connected to the main component In order to model the repair of failed components, we have introduced in the DRPFT formalism a new primitive called repair box (Ref. 11). A repair box may be connected to any event with the following meaning: when the event occurs, the repair box becomes enabled and starts repairing all the components that are failed Tab. 3: results with repair boxes Time (h) Fig. 5 TE unreliability in the subtree whose root is the event under consideration. Every repair box has a repair rate () to represent the exponentially distributed time necessary to complete the repair. The use and the effect of this new construct is illustrated by means of the following example, in which the WSP gate of Fig. 5a is taken as a base model. A single repair box is added in Fig 10 and two repair boxes are added in Fig. 11. In Fig. 10a, a repair box is attached to the main component of the WSP. The effect of this repair box is to model the repair of the main component only while a spare is replacing it; when Fig. 10 TE unreliability Fig. 11 TE unreliability We have added place SP_dn, containing the tokens relative to failed spares, and the spare repair transition SP_repair whose firing removes the token (of the same color) from SP_na and SP_dn in order to return the spare to its available state. The failure condition of this system (Top Event) is reached when the main component and all the spares are in a failed condition at the same time. The Top Event unreliability has been computed using the DRPFT Fig. 11a: repair boxes connected tool for different to main and spare components mission times and for the three cases Fig. 10b: main component repair SWN the repair ends, the spare is returned to a dormant condition while the main component is put back in operation. Failure of the system occurs when the main component is under repair and there are no more available spares. With respect to Fig. 5, the SWN of Fig. 10b, resulting from the translation of the DRPFT of Fig. 10a, contains the new transition named P_repair. When P_repair fires, the main component turns in the working state (P_dn gets empty) and the spare actually replacing the main component, turns in stand-by state (its token is removed from SP_curr) and can be used again if necessary (its token is removed from SP_na too). In Fig. 11a, a repair box is attached also to the (replicator) event modeling the spares. The effect of this second repair box, is to model the repair also of the failed spares. When a spare fails (either in dormant or operating condition) a repair action is started and the spare under repair is replaced by the first available spare in the list. The resulting SWN is shown in Fig. 11b. Fig. 11b: SWN translated from Fig 11a (no repair - Fig. 5, one repair box - Fig 10 and two repair boxes - Fig. 11). The results are summarized and compared in Table 3, where the assumed values for the failure rate , the dormancy factor and the repair rate (common to the two repair boxes) are also reported. Looking at the results of Table. 3, the effect of the repair boxes is to reduce the probability of reaching the system failure state, as expected. ACKNOWLEDGMENTS The work documented in this paper has been partially supported by MIUR under Grant FIRB-Perf- RBNE019N8N. REFERENCES J. B. Dugan, S. J. Bavuso, M. A. Boyd, “Dynamic Fault-Tree Models for Fault-Tolerant Computer Systems”, IEEE Transactions on Reliability, vol 41, 1992, pp 363-377. 2. Anand, A. K. Somani, “Hierarchical Analysis of Fault Trees with Dependencies, Using Decomposition”, Proc Annual Reliability and Maintainability Symposium, 1998, pp 69-75 3. R. Manian, D. W. Coppit, K. J. Sullivan, J. B. Dugan, “Bridging the Gap Between Systems and Dynamic Fault Tree Models”, Proceedings Annual Reliability and Maintainability Symposium, 1999, pp 105-111 4. Bobbio and L. Portinale and M. Minichino and E. Ciancamerla, “Improving the Analysis of Dependable Systems by Mapping Fault Trees into Bayesian Networks”, Reliability Engineering and System Safety, vol 71, 2001, pp 249-260 5. S. Amari and G. Dill and E. Howals, “A new approach to solve dynamic fault-trees”, Proceedings IEEE Annual Reliability and Maintainability Symposium, 2003 6. Bobbio, G. Franceschinis, R. Gaeta, L. Portinale, “Parametric Fault-Tree for the Dependability Analysis of Redundant Systems and its High Level Petri Net Semantics”, IEEE Transactions on Software Engineering, vol 29, 2003, pp 270-287 7. Bobbio, G. Franceschinis, R. Gaeta, L. Portinale, “Dependability Assessment of an Industrial Programmable Logic Controller via Parametric FaultTree and High Level PN”, Proc 9th International Workshop on Petri Nets and Performance Models, 2001, pp 29-38 8. Y. Dutuit, A. Rauzy, “A Linear-Time Algorithm to Find Modules of Fault Trees”, IEEE Transactions on Reliability, vol 45, 1996, pp 422-425 9. G. Chiola, C. Duthuillet. G. Franceschinis, S. Haddad, “ Stochastic Well-Formed Colored Nets and Symmetric Modeling Applications”, IEEE Transactions on Computers, vol 42, 1993, pp 13431360 10. J. B. Dugan, K. J. Sullivan, D. Coppit, “Developing a Low-Cost High-Quality Software Tool for Dynamic Fault-Tree Analysis”, IEEE Transactions on Reliability, vol 49, 2000, pp 49-59 11. V. Vittorini, G. Franceschinis, M. Gribaudo, M. Iacono, N. Mazzocca, “DrawNet: Model Objects to Support Performance Analysis and Simulation of Complex Systems”, 12th Int Conf Modelling Tools and Techniques for Computer and Communication System Performance Evaluation, Springer Verlag LNCS, Vol 2324, 2002, pp 233-238 1. 12. H. Zhu, S. Zhou, J. B. Dugan, K. J. Sullivan, “A Benchmark for Quantitative Fault Tree Reliability Analysis”, Proceedings Annual Reliability and Maintainability Symposium, 2001, pp 86-93 BIOGRAPHIES Andrea Bobbio Dipartimento di Informatica Università del Piemonte Orientale Spalto Marengo, 33 15100 Alessandria, ITALY e-mail: bobbio@unipmn.it Andrea Bobbio graduated in Nuclear Engineering from Politecnico di Torino. Presently, he is full professor at Department of Computer Science of the Università del Piemonte Orientale, Alessandria, Italy. His activity is mainly focused on the modeling and analysis of the performance and reliability of stochastic systems, with particular emphasis on Markovian and nonMarkovian models and stochastic Petri Nets. Bobbio has spent various research periods at the Department of Computer Science of the Duke University (Durham NC, USA), at the Technical University of Budapest and at the Department of Computer Science and Engineering at the Indian Institute of Technology in Kanpur (India). He has been principal investigator and leader of research groups in various research projects with public and private institutions. He his Senior Member of IEEE, and he is author of several papers in international journals as well as communications to international conferences. Daniele Codetta R. Dipartimento di Informatica Università del Piemonte Orientale Spalto Marengo, 33 15100 Alessandria, ITALY e-mail: raiteri@unipmn.it Daniele Codetta Raiteri got his degree in Computer Science in July 2002 at Università del Piemonte Orientale (Italy) and he is, presently, a Ph. D. student in Computer Science at Università di Torino (Italy). His activity concerns stochastic models for reliability analysis, more specifically fault trees and their evolutions and analysis.