Methods of increasing modelling power for safety analysis of digital systems, applied to a Gas Turbine Control System A. Bobbio1, E. Ciancamerla2, G.Franceschinis1, R. Gaeta3 , M. Minichino2, L. Portinale1, 1 DISTA, Università del Piemonte Orientale, 15100 Alessandria (Italy) 2 ENEA CR Casaccia, 00060 Roma (Italy) 3Dipartimento di Informatica, Università di Torino, 10150 Torino (Italy) Abstract. The paper describes a probabilistic approach, based on methods of increasing modelling power and different analytical tractability, to analyse safety of a Gas Turbine Digital Controller. First, a Fault-Tree (FT) has been built to model system basic assumptions, such as independent stochastic activities and binary states. Then, the FT has been converted into a Bayesian Net, to include multi-states and sequentially dependent failures of components and to perform diagnoses, and further into a Stochastic Petri Net, to accommodate repair activity and components in cold stand-by redundancy. Due to the very large space of states of the Stochastic Petri Net model, Stochastic Well formed Net (SWN) have been adopted to alleviate the state explosion problem. SWN allowed to compact symmetries of the system and to fold components with the same failure rates, by the use of colours. Safety measures have been computed, referring to the emergent standard IEC 61508. The applicability, the limits and the main selection criteria of the investigated methods are provided. 1. Introduction The paper describes a probabilistic approach, based on methods of increasing modelling power and different analytical tractability, to analyse safety of a Gas Turbine Digital Controller. The Gas Digital Controller belong to a co-generative plant, ICARO, in operation at centre ENEA CR Casaccia. The plant is composed by two sections: the gas turbine section of producing electrical power and the heat exchange section for extracting heat from the turbine gas exhaust gases. The Gas Digital Controller performs both control and protection functions in order to allow the Gas Turbine section works efficiently, with high availability and to protect the engine from overtemperature and over-speed, so involving safety aspects. A fault or a deterioration in the Gas Turbine Digital Controller could result in a reduction of plant efficiency (i.e increasing fuel consumption or nitrogen oxide pollution), in a reduction of plant availability (i.e. decreasing of operating 1 time due to trips or failures) or in a reduction of plant safety (i.e. on failure of a protection function, which could result in a damage of the engine, safety critical to the plant because of its high capital cost). The control and protection functions rely on a digital system. That, if on one side increases benefits, on the other side increases risks due to vulnerability to random failures and design errors of such systems. For digital systems, the demand of safety is more and more urgent even in conventional application domains, like ICARO co generative plant, as proven by the increasing demand of conformity to IEC 61508 standard. IEC 61508 standard does not address any specific sector. A very important concept in IEC 61508 is that of Safety Integrity Level (SIL). SILs are used as the basis for specifying the safety integrity requirements for the safety functions to be implemented by the safety related system. As far as it concerns the determination of the appropriate Safety Integrity Level, IEC 61508 is based on the concept of risk and provides a number of different methods, quantitative and qualitative, for determining it As a starting point, a Fault-Tree, modelling system basic assumptions, such as independent stochastic activities and binary states, has been built and analysed. To include multi states and sequentially dependent failures of some components of the Digital Controller and to perform diagnoses, FT has been converted into a Bayesian Net. To accommodate statistical dependence of the transition rates and repair transitions, FT has been converted into a Stochastic Petri Net. The Stochastic Petri Net model resulted unmanageable, due to a very large space of states. To manage such a complexity, Stochastic Well formed Net, a special class of Stochastic Petri Nets have been adopted. Stochastic Well formed Nets allow to compact symmetries of the system, by using colours and to alleviate the states explosion problem. Model results have been compared with Safety Integrity Levels of IEC 61508 standard. The paper is organised in the following sections. We start in section 2 with the main requirements of IEC 61508 standard. Section 3 deals with the description of the case study. Sections 4, 5 and 6 provide the models of the case study using the different methods, the investigated measures and the experimental results. In section 7 there are the conclusions. 2. IEC 61508 Standard Process industry requires that well defined safety requirements must be achieved, as hazards may be present in process installations. IEC 61508 introduces a principle referred to with the name As Low As Reasonably Practicable (ALARP). ALARP defines the tolerable risk as that risk where additional spending on risk reduction would be in disproportion to the actually obtainable reduction of risk. The strategy proposed by IEC 61508 takes into account both random as systematic errors, and gives emphasis not only to technical requirements, but also to the management of the safety activities for the whole safety lifecycle [2]. 2 IEC 61508 has introduced the concept of Safety Integrity Level (SIL) attempting to homogenise the concept of safety requirements for the Safety Instrumented Systems. According to IEC 61508 the SIL is defined as “one of 4 possible discrete levels for specifying the safety integrity requirements of the safety functions to be allocated to the safety-related systems. SIL 4 has the highest level of safety integrity, SIL 1 has the lowest”. The target dependability measures for the 4 SILs are specified in Table 1, for systems with low demand mode of operation and with continuous (or high demand) mode of operation. The determination of the appropriate SIL for a safety-related system is a difficult task, and is largely related to the experience and judgement of the team doing the job. IEC 61508 offers suitable criteria and guidelines for assigning the appropriate SIL as a function of the level of fault-tolerance and of the coverage of the diagnostic. Table 1. Safety Integrity Levels: Target Failure Measures Safety Integrity Level 4 3 2 1 LOW DEMAND MODE OF OPERATION (Probability of failure to perform its design function on demand) >=10-5 to <10-4 >=10-4 to <10-3 >=10-3 to <10-2 >=10-2 to <10-1 CONTINUOUS/HIGH DEMAND MODE OF OPERATION (Prob of a dangerous failure per hour) >=10-9 to <10-8 >=10-8 to <10-7 >=10-7 to <10-6 >=10-6 to <10-5 3. Gas Turbine Section The gas turbine section (figure 1) consists fundamentally of four main parts: the compressor, the combustion chamber, the turbine itself and the generator. The gas turbine is a single shaft engine. The rotor, which rotates at 22500 Rpm, is linked to a reduction gear for coupling with the generator. The compressor feeds air to the combustion chamber where the gas is also fed. Here, the combustion produces high pressure gases and high temperature. The content of NOx can be maintained inside the requested limits by a water injection to reduce the flame temperature. The expansion of these gases in the turbine produces the turbine rotation with a torque that is transmitted to the generator in order to produce the electrical power output. 3 Figure 1 - Gas turbine section The air flow rate is constant and a control valve regulates the gas fuel in the combustion chamber. The control valve is actuated by the control system and a position sensor reads its position. The exhaust gas temperature, which is the most critical variable for the engine control system, is taken as an average of eight thermocouples, located along the circumference of the turbine exit. Among all variables that participate in the gas turbine only a few are directly measured by the sensors. From these sensors averages are taken by analog circuitry and are used, together with speed of turbine, to protect the engine 3.1. Gas Turbine Digital Controller Gas Turbine Digital Controller performs both control functions and protection functions [3,4]. It also performs alarm monitoring and communication functions, not considered in this paper. Control functions address the normal run operation and all plant sequencing needed in starting and stopping operations. In performing control functions, the control system evolves throughout several states: from starting to no-load states, running on load and shutting down states. From such states shutdown requests override control logic and lead the system to the prior to start state. At any time a shutdown request will cause the control system to enter in its emergency shutdown state and carry out the shutdown actions which include the de-energisation of related relays. The control functions are essentially based on the control of fuel metering valve position. Protection functions simply consist in providing the engine protection by independent overtemperature and overspeed shutdowns. Two thermocouple sensors are used, together with one speed probe as inputs of the protection functions. Gas turbine control system (figure 2) comprises a Main Controller, which implements the control functions, and a Backup Unit, which implements the protection functions. The Main Controller and the Backup Unit have separate processors and independent power supplies so that the Backup Unit is able to provide independent protection functions. The Main Controller occupies two boards in full: the Baseboard and the Expansion Board, and a portion of the Auxiliary Board that is shared with the Backup Unit. In the Baseboard is located a processor performing the main control functions. The Expansion Board provides an additional intelligent I/O capacity. The Backup Unit occupies one other portion of the Auxiliary Board on which an independent processor performs the protection functions. The sensors and the actuators, reported in figure 2, are limited to the ones needed to perform the fuel demand control logic at run state. 4 Figure 2 - High level architecture of the Gas Turbine Control System The Backup Unit implements the protection functions of the turbine by providing an independent overtemperature and overspeed shutdown. The hardware structure of the control system has been summarised as composed by two subsystems (figure 3), representing the Main Controller and the Back-Up unit. Each unit has an independent CPU, uses a separate power supply circuit (operating from the same supply inlet) but shares the following transducer signals: 2 thermocouples and 1 speed probe. "Watchdog" relays are associated to each hardware circuit board. 5 DI - Digital input; AI - Analog input; CPU- 32-bit microprocessor; MEM - Memory; I/O - I/O bus; DO - Digital output; AO - Analog input; WD-Watchdog relay; PS - Power Supply inlet; SMC-Supply circuit of the main controller. RO - Relay output. Figure 3 - Main Controller and Backup Unit hardware structure The elementary components of the gas turbine control system are assumed to have constant failure rates (table 2). Table 2 - Component /Failure Rate (f/h) Component Iobus Therm. Speed Memory DO AO RO DI AI PS SMC SBU CPU WD Failure rate (f/h) IO=2.0 10-9 Th=2.0 10-9 Sp=2.0 10-9 M=5.0 10-8 DO=2.5 10-7 AO=2.5 10-7 RO=2.5 10-7 DI=3.0 10-7 AI=3.0 10-7 PS=3.0 10-7 Smc=3.0 10-7 Sbu=3.0 10-7 CPU=5.0 10-7 WD=2.5 10-7 6 4. The Fault Tree model At the highest level of the analysis we adopt a Fault Tree model (figure 4). Figure 4 - Fault Tree model for safety critical failures The Fault Tree analysis is based on the following simplifying assumptions: components (and the system) have binary behaviour (up or down) and failure events are statistically independent. Qualitative and quantitative analyses of the FT have been carried out. Qualitative analysis has been aimed at enucleate the most critical failure paths (mcs). Quantitative analysis has been aimed at evaluate measures useful to characterise safety. The analysis has found 43 mcs, with the characteristics given in the table 3. The most critical mcs sorted by order are shown in table 4. Table 3 - Order and number of mcs Order 4 3 Number of mcs 2 41 % on TE Unreliability 93.09 6.91 7 The following measures have been performed: Unreliability versus time; Safe Mission Time (SMT) computed as the time interval in which the system unreliability is strictly lower than a pre assigned threshold; Mean Time To Failure (that we consider a less significant measure with respect to SMT); Most critical failure paths; SIL evaluation limited to table 1 requirements of IEC61508 standard. Table 4 - Most critical mcs Minimal Cut Set 1 2 3 4 5 6 7 8 PS WDB WDM CPUB CPUM WDB WDM Speed WDB WDM AIB CPUM WDB WDM DIM CPUB WDB WDM SupM CPUB WDB WDM CPUM SupB WDB WDM AIM CPUB WDB WDM Unreliability versus time and failure frequency have been computed (table 5) for SIL evaluation according to IEC 61508. Comparing the results for the failure frequency of dangerous failures of the table (third column), and comparing t with the SIL Target Failure requirements (table 1), it is obtained SIL - 3 up to 500,000 h. Fixing a limit for the Unreliability U=1.0* 10-3, the Safe Mission Time is SMT= 210.000 (h). The Mean Time To Failure for the Top Event is: MTTF(for the TE) = 3.072 * 106 (h) Table 5 - Unreliability and failure frequency of dangerous failures Time t (h) 10,000 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 450,000 500,000 TE Unreliability 9.095 10-9 5.157 10-6 7.317 10-5 3.291 10-4 9.256 10-4 2.014 10-3 3.730 10-3 6.181 10-3 9.447 10-3 1.358 10-2 1.861 10-2 Failure Frequency 9.095 10-13 1.031 10-10 7.317 10-10 2.194 10-9 4.128 10-9 8.056 10-9 1.243 10-8 1.766 10-8 2.372 10-8 3.018 10-8 3.722 10-8 8 5. The Bayesian Network model According to the translation algorithm presented in [5], the Bayesian network derived from the FT of Figure 7 is reported in Figure 8. In the BN of Figure 8, gray ovals represent root nodes (corresponding to the basic events in the FT), while white ovals represent non-root nodes. Every node in the BN is a binary node, since the variable associated to it is a binary variable. The binary values of the variables associated to the nodes represents the presence of a failure condition (true value) or an operational condition (false value). The only chance (probabilistic) nodes of the BN are the roots (gray nodes). All the other nodes in the BN (white ovals) are deterministic nodes whose Conditional Probability Tables contains only 0 or 1 and are determined by the type of the gate in the FT they refer to (namely by the boolean AND function and by the boolean OR function) [5]. The root nodes must be assigned a probability value. Since the information about the failure probability of the system components is in the form of a constant failure rate (Table 2), the probability for the true value is obtained by computing the probability of generic component C (with failure rate C) at a specific mission time t as Pr(C=true) = 1 - e - t ). Given the prior failure probabilities of system components (i.e. basic events in the FT) computed at different mission times (from t = 1 * 10 5 h to t = 5 * 10 5 h), we can evaluate the unreliability of the TE by computing the probability of node TE in the BN of Figure 5 given a null evidence. 9 Figure 5 - The Bayesian Net traslating the FT of figure 4 5.1 Posterior analysis The novelty and the strength of the BN approach consists in the possibility of computing posterior probabilities (i.e. diagnoses), in order to analyze the criticality of the system components with respect to partial or total system failure. To this end, a probabilistic computation has to be carried out, by considering the occurrence of the TE as the evidence provided to the BN. There are two main probabilistic computations that can be performed: 1. the posterior probability of each single component (in terms of BN, a belief updat-ing propagation must be carried out); 2. the joint posterior probability over the set of components (in terms of BN, a belief revision looking for the most probable configurations of the root variables must be carried out). The first analysis allows to obtain information about the criticality (with respect to the occurrence of TE) of each single component alone, by computing the probability of each single component being down, given that the TE has occurred. The second kind of analysis is much more sophisticated and approaches the criticality problem over a set of components. However, it is worth noting that, differently from MCS computation, all the components (i.e. basic events) are considered in a given configuration, by providing a more precise information. In this case, the posterior joint probability of all the components, given the fact that the TE has occurred, is computed. Table 6 reports the posteriors of each single component computed at time t = 5 *10 5 h. Table 6 - Posterior Probabilities for single components Component Posterior WDb WDm CPUb PS CPUm AIb SB ROb AIm SM DIm AOm DOm Mem Speed I/Ob I/Om Th1 Th2 1 1 0.37063624 0.34525986 0.30848555 0.2333944 0.2333944 0.19688544 0.19425736 0.19425736 0.19425736 0.16387042 0.16387042 0.03443292 0.00247744 0.00167474 0.00139391 0.00100097 0.00100097 10 The component criticality is a more significant measure with respect to their (prior) failure probability. Indeed the order in which components appear are different in the prior and in the posterior computations. We can notice that the two watchdogs WDm and WDb have a criticality 1, since their failures are necessary in order to have a system failure (as it could have been easily deduced from the structure of the FT as well). Moreover, the probability of a CPU failure in case of TE occurrence is about 30% for the CPU M of the main controller and about 37% for the CPU B of the backup unit. Notice that this posterior values are different, even if the failure rate of both CPUs is the same, because of the different role they play in the overall system dependability. In fact, the failure both of the main controller MC and of the backup unit BU are provided by the failure of the corresponding CPU in boolean OR with the failure of the PER sub-system, but the failure of PER M follows a different sequence of events than the failure of PER B, resulting in different posterior probabilities also for the two CPUs. 5.2 Multi-state nodes and sequentially dependent failures In the present section, we discuss the use of BN which enlightens two peculiar features, not considered in FT, namely: the possibility of modeling non-binary events (like events whose behavior is more carefully considered by multi-state variables), and the inclusion of localized dependencies (where the state of a root component influences the state of other root components). A more realistic case for the power supply PS is to find it in three different conditions (states): working, degraded and failed. When PS is in state degraded it induces an anomalous behavior also in the supply equipment (SM) of the main controller (MC) and (SB) the back-up unit (BU). The BN, that models the described situation, is reported in Figure 6, where only the relevant part of the BN of Figure 5 is reconsidered. The PS node has three states denoted by W for working, deg for degraded and F for failed. The prior probabilities of the PS node in the three different states is also reported on the Figure. The arcs connecting node PS with both nodes SM and SB indicate a possible influence of the parent node PS on the children nodes SM and SB. This influence is quantified in the CPT’s reported in Figure 6, where it is shown that a degradation in PS induces a failure also in SM and SB with probability 0.9. 11 Figure 6: Portion of the BN showing the influence of a PS degradation The degradation of the power supply PS does not have a direct effect on the system dependability, but its effect originates from a negative influence of the degradation on other components of the system. Introducing this localized dependence in the overall BN model of Figure 5, the analysis has, however, shown a little effect on the TE unreliability. 6. Stochastic Petri Net models The way followed to build the Stochastic Petri Net Model has been to convert the FT of figure 1 by a conversion algoritm [6]. Some interesting points have emerged. The FT of Figure 1 has 19 basic components and this is a relatively small number for a FT, whose quantitative analysis is based on combinatorial techniques. Instead, for a state space based analysis technique, like Petri Net or Markov Chain, the complexity of the problem changes drastically. The state space of a system with 19 basic components consists of 2 19 states. Moreover, the translation rules to convert a FT into a PN introduce several immediate transitions that produces several vanishing states in the reachability set. Hence, the number of states to be generated (tangible plus vanishing) from the generated PN is larger than the value of 2 19 tangible states. The lesson to be learned is that, even a ”small” FT may produce an unmanageable PN. In order to produce useful results with a PN model, some simplification must be adopted. A basic idea stems from the observation that often, due to symmetries or redundancies in the system to be modeled, the model may contain several similar components. To make the model more compact, the similar components may be folded and parameterized, so that only one representative is explicitly included in the model, while, at the same time, the identity of each replica is maintained through the parameter value. To do that a special class of the Stochastic Petri nets, called Stochastic Wellformed Net (SWN) has been used. The SWN formalism was introduced [7] with the aim of alleviating the state space explosion problem that often undermines state space based models. Restrictions on color domains, arc functions and initial marking are imposed that lead to the definition of symbolic marking. A symbolic marking may be viewed as a high level description of sets of actual markings, expressed in terms of sets of colors. The definition of symbolic markings allows to exploit symmetry properties in the model and to generate the lumped Markov chains. Figure 7 shows the SWN model. The model has been obtained by observing that several basic components have the same failure rates and may be grouped into classes. The symmetry property that is exploited in this case, is that each class contains object with the same failure rate. To make the figure more clear, we have not represented the inhibitor arcs from the TE to all the transitions. The function of these inhibitor arcs can be appreciated only at the analysis level, since they allow to avoid the generation of non relevant absorbing states. The 12 advantages of the SWN model with respect to the Stochastic Petri Net model are evidenced in Table 7, where the state count is reported together with the reduction factor obtained with the symbolic state representation of the SWN with respect to the unfolded Stochastic Petri Net model. In spite of the reduced symmetries present in the model, the reduction factor is quite consistent. The SWN model allows to add new modeling issues that could not have been included in the FT and in th Bayesian models. In particular: - Statistical dependence of the transition rate on the state of the system - Repair transitions that enforce a cyclic behavior in the system. Figure 7 - The SWN model Table 7- Reduction factor of GSPN states versus SWN symbolic states GSPN SWN Reduction Factor Tangible states Vanishing states Absorbing states 393503 83063 4.7 665285 141364 4.7 130785 27529 4.7 13 6.1 Hot and cold stand-by For modeling statistical dependencies of the transition rates we consider the case of redundant components in hot and cold stand-by. FT can only model parallel operation (some times called hot stand-by) in which the parallel units have the same failure rate independently on the state of the others. In a stand-by (or more precisely cold stand-by) operation the stand-by unit has a zero failure rate when non in operation. This behavior induces a dependence of the failure rate of the stand-by unit with respect to the unit under operation, and as such cannot be included in FT models. The same behavior, on the other hand, can be very simply adjusted in SWN by using marking dependent transition rates for the failure transitions. 6.2 Global repair FT and Bayesian networks are intrinsically acyclic graph structures and hence cannot be invoked to model systems in which actions are possible, in consequence to a fault, to restore the system to a previous condition. It has been assumed that the system is repaired whenever it fails and that the repair restores the system to its initial marking with all components up and the repair rate is constant (”As good as new” repair policy). To include a global repair, the basic SWN model of Figure 7 must be modified. The modification, consists in adding a timed transition, denominated Glob R, modeling the stochastic duration of the repair activity. The timed transition Glob R has the TE as the only input place. Moreover, a number of additional immediate transitions of higher priority must be inserted in a way that, when the SWN model reaches a state where place TE becomes marked, all the timed transitions in the SWN model are disabled except transition Glob R. When transition Glob R fires then the high priority immediate transitions restore the SWN in its initial state thus restarting the model. We have run the SWN model with global repair and with different values of the repair rate . The obtained steady state availability is reported in Table 8 as a function of the repair rate. As a final numerical example, we have computed the steady state availability with a repair rate = 1 assuming for the thermocouples a hot (parallel) stand-by configuration and a cold stand-by configuration. The results are reported in Table 9. The effect of the operation policy is very little, but, as can be expected, the cold stand-by policy proves to be superior. Table 8 - Steady state availability versus repair rate Repair rate 1 0.1 0.01 0.001 Availability 0.999999666667 0.999996666678 0.999966667778 0.999666777741 Table 9 - Steady state availability versus the configuration policy of the thermocouples 14 Configuration Hot stand-by Cold stand-by Availability 0.999999666667 0.999999674650 7. Conclusions The present paper has investigated classic and new methodologies for medelling and quantitatively evaluating dependability measures and SIL evaluation of a Gas Turbine Digital Controller, with reference to IEC 61508 standard. First, a Fault Tree model have been built, adopting simplified basic assumptions. Then FT has been translated into a BN that is still an acyclic model, but possesses more modelling and analysis power. Furthermore, FT have been translated into a Stochastic Petri Net, aspace state based model. That increased exponentional the complexity (the state explosion problem), so that even a small FT has become untractable as PN. T o partially alleviate this problem Stochastic Well formed Net have been adopted. By SWN system global availability has been computed. The applicability, the limits and the main selection criteria of the investigated methodologies have been provided dealing with an application coming from the real world. References 1. P. Incalcaterra - Impianto di cogenerazione del centro ricerche Casaccia dell'ENEA- Rapporto interno ENEA- 18 gennaio 1999 2. S. Bologna - Safety applications of programmable electronic systems in the process industry: impact of emerging standards -16th International System Conference, Seattle, Washington, USA - Sept. 14-18, 1998 3. System manual of gas turbine governor controller of ICARO plant September1992 4. Hardware and Software design specification for a controller of ICARO plant gas turbine - October 1997 5. A. Bobbio, L. Portinale, M. Minichino, E. Ciancamerla - Improving the Analysis of Dependable Systems by Mapping Fault Trees into Bayesian Networks- Reliability Engineering and System Safety Journal - vol. 71 N.3 March 2001 pages 249-260 -ISSN 0951 - 8320 6. S. Bologna, E. Ciancamerla, M. Minichino, A. Bobbio, G. Franceschinis, L. Portinale, and R. Gaeta - "Comparison of methodologies for the safety and dependability assessment of an industrial programmable logic controller", In European Safety Dependability Conference (ESREL2001), pages 411–418, September 2001 15 7. G. Chiola, C. Dutheillet, G. Franceschinis, and S. Haddad - "Stochastic well-formed coloured nets for symmetric modelling applications", IEEE Transactions on Computers, 42:1343–1360, 1993 16