dependable systems dependability analysis How does the system react to the occurrence of a fault? TOPIC QUESTIONS How reliable or available is the system? What are the most critical faults? Dependability analysis Goal: estimate dependability-related properties • Reliability (MTTF, fault coverage …) • Availability • Performability and Dependability Importance of analysis and analytical model • to evaluate a design • a metric to compare different designs • to provide feedback to the designer during early design stages • use a model for performance analysis • used for quantitative and qualitative analysis Reliability & Availability Reliability-related analyses Evaluation of the fault-error relationship – For each fault, what are the effects (errors) and the consequent failures – For each failure, which are the possible causes Compute/estimate reliability/availability metrics starting from the system components and adopted fault models – MTTF – Fault coverage Analysis approaches Forward Starting from a set of events the effects on the system are evaluated … Backward Starting from the observed malfunctioning effects, possible causes (events) are analyzed and identified Reliability Analysis Forward & Backward Analysis Failure Mode and Effects (Criticality / Diagnostics) Analysis exploits the forward approach … given these events, what will happen? Fault Tree Analysis follows the backward method … what are the events that cause the observed failure? Reliability Analysis Forward & Backward Analysis In both cases the goal is to identify a causal relationship between events and failures Events include failures in – Hardware/Software – Human behavior – Environmental conditions Reliability Analysis Analytical techniques • • • • • • • • Reliability Block Diagrams – RDB Fault tree analysis – FTA Failure modes and effects analysis – FMEA Failure modes, effects and criticality analysis FMECA Failure modes, effects and diagnostic analysis FMEDA Hazard and operability studies – HAZOP Event tree analysis – ETA Risk analysis – RA Reliability Block Diagrams An inductive model wherein a system is divided into blocks that represent distinct elements such as components or subsystems Blocks are then combined according to system-success pathways RBDs are generally used to represent active elements in a system, in a manner that allows an exhaustive search for and identification of all pathways for success Reliability Block Diagrams An approach to compute the reliability of a system starting from the reliability of its components components in series components in parallel All components must be healthy for the system to work properly If one component is healthy the system works properly Reliability Block Diagrams Series: RS(t) = RC1(t) * RC2(t) Parallel: RS(t) = 1 – (1 – RC1(t)) * (1 – RC2(t)) series meno Block a±dabile dela±dabile meno a±dabile suoi componenti. meno a±dabile del Diagrams meno dei suoidei componenti. Reliability Se i componenti sono aditasso di costante guasto costante 2 Se i componenti sono a tasso guasto ∏i , (i =∏1, 2, .=. .1, , n) i , (i del componente è una funzione esponenziale del R tempo del componente i-esimo i-esimo è una funzione esponenziale del tempo i (t) = System S composed by components with an exponential distribution si ricava: si ricava: t °∏s t = e°∏s con Rs (t) =Rse(t) : n n X X con∏s: = ∏s ∏=i e∏i i=1 Me T T i=1 Il sistema ha ancora un comportamento di tipo esponec Il sistema serie haserie ancora un comportamento di tipo esponenziale datosomma dalla somma di guasto dei componenti. dato dalla dei tassidei di tassi guasto dei componenti. Example 2.1 - Si consideri il sistema di pressurizzazione Example 2.1 - Si consideri il sistema di pressurizzazione di un sed in Figura 2.2. Il sistema è costituito da 4 compone tato in tato Figura 2.2. Il sistema è costituito da 4 componenti ob di pressione (R), unadilogica di controllo un gruppo di pressione (R), una logica controllo (L), un(L), gruppo motore La connessione fra i blocchi è di tiponelserie valvola valvola (V). La(V). connessione fra i blocchi è di tipo serie sen devefunzionante essere funzionante per garantire il corretto funziona deve essere per garantire il corretto funzionamento mantenimento del prescritto valore di pressione nel serb mantenimento del prescritto valore di pressione nel serbatoio. di guasto costante, una dati banca datistati son nenti a nenti tasso aditasso guasto costante, da una da banca sono series a±dabile del meno a±dabile dei suoi compone Reliability Blockmeno Diagrams Se i componenti sono a tasso di guasto costante ∏ del componente i-esimo è una funzione esponenziale When all components are identical si ricava: Rs (t) = e°∏s t con : ∏s = n X ∏i i=1 Il sistema serie ha ancora un comportamento di t dato dalla somma dei tassi di guasto dei component Example 2.1 - Si consideri il sistema di pressuri tato in Figura 2.2. Il sistema è costituito da 4 di pressione (R), una logica di controllo (L), u valvola (V). La connessione fra i blocchi è di deve essere funzionante per garantire il corrett mantenimento del prescritto valore di pression nenti a tasso di guasto costante, da una banca series Reliability Block Diagrams Availability parallel Reliability Block Diagrams System P composed by n components Availability Reliability Block Diagrams ive t c A Component redundancy System redundancy RBDs Stand-by redundancy active parallel system: primary and redundant subsystems normally operate at all times standby redundant system: the standby component is not activated unless the online component fails dynamic switching to the standby elements (explicit fault detection) the switch can fail too RBDs Standby redundancy RBDs r out of n reliability Rs = System reliability Rc = Component reliability n = Number of components r = Minimum number of components which must survive RBDs Example 1 RA = 0.95 RB = 0.97 RC = 0.99 RD = 0.99 RE = 0.92 RF = 0.92 RBDs Solution Rs = RA RB [1-(1-RC)(1-RD)] [1-(1-RE)(1-RF)] Rs= (0.95)(0.97)[1-(1-0.99)(1-0.99)] [1-(1-0.92)(1-0.92)] = 0.916 RBDs Example 2 2 control blocks and 3 voice channels: • system is up if at least 1 control channel and at least 1 voice channel are up voice control voice control voice Example 2 – cont’d • Each control channel has reliability Rc • Each voice channel has reliability Rv • Reliability: R = [1 - (1 - Rc ) 2 ][1 - (1 - Rv ) 3 ] RBDs Example 3 RBDs RDB: used to compare different alternatives Cable Bundle Each block has R = .9 Taken from: SRE Meeting - Apr 2016 Alternative 1 Alternative 2 Alternative 3 RBDs Example 4 – homework C A B D C C E D RBDs Complex example RBDs Methods for non-series-parallel systems • State enumeration (Boolean Truth Table) • Factoring/conditioning • Binary Decision Diagrams Implemented in SHARPE RBDs non-series-parallel systems – Example 2 1 3 4 5 RBDs non-series-parallel systems – TT – Example 1 2 3 4 5 System Probability 1 1 1 1 1 1 R1R2R3R4R5 1 1 1 1 0 1 R1R2R3R4!R5 1 1 1 0 1 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 0 1 0 1 1 1 0 0 1 1 1 1 0 0 0 1 1 0 1 1 1 1 1 0 1 1 0 0 1 0 1 0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 0 R1R2 R1!R2R3R4R5 R1!R2R3!R4R5 R1!R2!R3R4R5 RBDs non-series-parallel systems – TT – Example – cont’d 1 2 3 4 5 System Probability 0 1 1 1 1 1 !R1R2R3R4R5 0 1 1 1 0 1 !R1R2R3R4!R5 0 1 1 0 1 0 0 1 1 0 0 0 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 1 1 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 !R1R2!R3R4R5 !R1!R2R3R4R5 !R1!R2!R3R4R5 !R1R2R3R4 RBDs non-series-parallel systems – TT – Example – cont’d Reliability: R1R2 + !R1R2R3R4 + !R1R2!R3R4R5 + !R1!R2R3R4R5 + !R1!R2!R3R4R5 + !R1R2!R3R4R5 + !R1!R2R3R4R5 + !R1!R2!R3R4R5 … simplifying and optimizing … Reliability: R1R2+ R4R5 + R1R3R5 + R2R3R4 BTW: !R = (1 – R) RBDs non-series-parallel systems – conditioning – Example 1 C3 down 2 1 2 4 5 1 2 4 5 3 4 5 C3 up RBDs non-series-parallel systems – conditioning – Example – cont’d • Component C3 is chosen to factor on (or condition on) • Upper resulting block diagram: C3 is down • Lower resulting block diagram: C3 is up • Series-parallel reliability formulas are applied to both the resulting block diagrams • Use the theorem of total probability to get the final result RBDs non-series-parallel systems – conditioning – Example – cont’d RC3down= 1 - (1 - R1R2) (1 - R4R5) RC3up = (1 - Q1Q4)(1 - Q2Q5) = [1 - (1-R1) (1-R4)] [1 - (1-R2) (1-R5)] Rbridge = RC3down (1-R3 ) + RC3up R3 RDBs Triple Modular Redundancy – TMR System works properly if • 2 out of 3 components work properly AND the voter works properly ≅ RBDs TMR • MTTFTMR is shorter than MTTFCOMP • Can tolerate transient faults and permanent faults • Higher reliability (for shorter missions) When do we have the same reliability? • RTMR(t) = RC(t) ≅ 0.7 MTTFC • RTMR(t) > RC(t) when the mission time is shorter than 70% of MTTFC RBDs TMR TMR: 2 out of 3 components (voter is a ‘perfect’ element) RBDs nMR TMR: 2oo3 and nMR: 3oo5 Redundancy is useful for specific mission times RBDs Pros and cons Advantages • an RBD allows early assessment of design concepts and allows to easily visualize the system logic Limitations • breaking down the systems to identify multiple levels of components > considerable effort • analyzing complex reliability diagrams can be difficult > not simple series / parallel configurations • modeling non-hardware failure mitigation measures, such as training and procedures, is difficult using this technique Exercise • The failure distribution of the main landing gear of a commercial airliner is Weibull with a shape parameter of 1.6 and a characteristic life of 10,000 landings • The nose gear also has a Weibull distribution with a shape parameter of 0.90 and a characteristic life of 15,000 landings • What is the reliability of the landing gear system if the system is to be overhauled after 1,000 landings? • R(t) = exp[-(t/η)]β η: scale parameter β: shape parameter Exercise 2 • Out of the 12 identical AC generators on the C-5 aircraft, at least 9 of them must be operating in order for the aircraft to complete its mission. Failures are known to follow an exponential distribution with a failure rate of 0.01 failure per hour. What is the reliability of the generator system over a 10 hour mission? FAULT TREES Fault Tree Analysis Defines a correlation between events and failures – Composes events and caused by means of “logic gates” The approach offers a tree model of the events and conditions that lead to the failure It can be used to characterize a system and to evaluate the overall dependability properties Approach Fault Tree Analysis It is a top-down approach to failure analysis, starting with a potential undesirable event (accident) called a TOP event, and then determining all the ways it can happen – What (combination of) events can lead to the TOP event • Use “logic gates” to connect the causes A FTA starts with the undesired event and traces backward to the necessary and sufficient causes Fault Tree Analysis - FTA Historical perspective Bell Telephone Laboratories developed the concept in 1962 for the US Air Force for use with the Minuteman system Later improved by Boeing Company One of many symbolic “analytical logic techniques” found in operations research and in system reliability FTA steps Taken from Fault Tree Handbook with Aerospace Fault Tree Handbook with Aerospace Applications – version 1.1 - 2002 Fault Trees • • • • • Combinatorial (non-state-space) model type Components are represented as nodes Components or subsystems in series are connected to OR gates Components or subsystems in parallel are connected to AND gates Components or subsystems in k-of-n (RBD) are connected as (n-k+1)-of-n gate Fault Tree Diagram System model produced with a symbolic representation – Basic events are connected with logical gates Events Gates Fault Tree Diagrams Example CPU DS1 DS1 NIC1 CPU DS2 DS3 NIC2 DS2 DS3 NIC1 NIC2 TOPEvent = !CPU + (!DS1 !DS2 !DS3) + (!NIC1 !NIC2) R = 1 – P(TOPEvent) = 1 - P(!CPU + (!DS1 !DS2 !DS3) + (!NIC1 !NIC2)) Fault Tree Diagrams Events Events Meaning Basic Event A basic initiating fault (or failure event) External Event (House Event) An event that is normally expected to occur. In general, these events can be set to occur or not occur, i.e. they have a fixed probability of 0 or 1. Undeveloped Event An event which is no further developed. It is a basic event that does not need further resolution. Conditioning Event A specific condition or restriction that can apply to any gate. Transfer Indicates a transfer continuation to a sub tree Symbol Fault Tree Diagrams Gates Gates Meaning AND The output event occurs if all input events occur OR The output event occurs if at least one of the input events occurs Voting OR (k-out-of-n) The output event occurs if k or more of the input events occur Inhibit The input event occurs if all input events occur and an additional conditional event occurs Symbol k Priority AND The output event occurs if all input events occur in a specific sequence XOR The output event occurs if exactly one input event occurs Mutually XOR The output event occurs when any input event occurs. All other events cannot occur. M Fault Tree Diagrams Failure probabilities Gate Failure probability OR Pt = P1 + P2 – (P1P2) AND Pt = P1P2 Pt = P1(1–P2) + (1–P1)P2 + P1P2 = P1–P1P2 + P2 – P1P2 + P1P2 = P1 + P2 –P1P2 Pt : total failure probability Pi: failure probability, event i Priority AND XOR Pt = P1 + P2 – 2(P1P2) Voting OR k-out-of-n Pt = P1 + P2+ P3– (P1P2) – (P1P3) – (P2P3) – (P1P2P3) (2-out-of-3) Mutually XOR Pt = P1 + P2 Fault Tree Diagrams FTD model Gate & Events symbols and descriptions gate description gate identifier event ID basic events Fault Tree Diagrams FTD Example top-level event Fault Tree Diagrams Fault tree construction Define a TOP event in a clear and unambiguous way – What – Where – When What are the necessary, and sufficient events and conditions causing the TOP event? Connect via AND- or OR-gate Appropriate level – Independent basic events – Events for which we have failure data Example Fuel system schematic Example Fuel system faults Fault States – No flow when needed – Flow cannot be shut off Component failures – Block Valve A fail open/fail close – Block Valve B fail open/fail close – Control Valve A fail open/fail close – Control Valve B fail open/fail close Example No flow when needed Example Flow cannot be shut off Fault tree Major characteristics: • FT without repeated events (same event in input at different gates) • can be mapped onto RBDs • can be solved in linear time • FT with repeated events • Theoretical complexity: exponential in the number of components • Up to 100 components can be solved … How to exploit Fault Trees? Analyze the the causes leading to top events and identify the critical elements within the entire system Identify the (sets of) basic elements that cause the top event Fault Tree Analysis Cut set: definition and use Given a fault tree, it is possible to derive cut sets Cut set: a set of basic events, such that if they all occur, the top event will occur Puts basic events into relation with the final outcome top set event Minimal cut set: smallest set of basic events leading to the top event Minimal cut set Cut sets The set is minimal if all its events must occur to lead to the top-event Each fault tree has a finite number of unique minimal cut sets The number of different basic events in a minimal cut set is called the order They identify all distinct ways a top event can occur w.r.t. basic events Path set A set of basic events whose non-occurrence (simultaneous) guarantees that the top event does not occur A minimal path set is one that cannot be reduced without loosing its status as a path set Qualitative assessment Qualitative assessment by investigating the minimal cut sets – Ordering cut sets – Ranking based on the type of basic events • Human error (most critical) • Failure of active equipment • Failure of passive equipment – Spot “large” cut sets with dependent items Qualitative assessment Ranking of events Rank Basic event 1 Basic Event 2 1 Human error Human error 2 Human error Failure of active unit 3 Human error Failure of passive unit 4 Failure of active unit Failure of active unit 5 Failure of active unit Failure of passive unit 6 Failure of passive unit Failure of passive unit Quantitative analysis Cut sets are computed and failure probabilities are combined to get the top event probability – Generate cut sets – Apply failure data – Compute probabilities – Compute criticality measures Instruments: FT mathematics (Boolean algebra & probability) FT approximation methods Reliability R = e-λT R+Q=1 Q = 1 – e-λT Approximation: when λT < 0.0001 then Q ≈ λT Where: λ : component failure rate = 1 / MBTF T: time interval Failure probabilities Gate Failure probability OR Pt = P1 + P2 – (P1P2) AND Pt = P1P2 Pt : total failure probability Pi: failure probability, event i Priority AND XOR Pt = P1 + P2 – 2(P1P2) Voting OR k-out-of-n Pt = P1 + P2+ P3 – (P1P2) – (P1P3) – (P2P3) – (P1P2P3) (2-out-of-3) Mutually XOR Pt = P1 + P2 Minimal cut sets: computation Boolean reduction Bottom up reduction algorithms Binary Decision Diagrams Min Terms method (Shannon decomposition) Modularization methods Genetic algorithms Minimal Cut Set: example Cut Sets: A A,B A,B,C A,B Min Cut Sets: A Minimal Cut Set: example Probability top event Ptop-event = 1 – Π(1-PCSi) PCSi : probability of the i-th cut set Ptop-event = 1 – (1-PA) (1-PAPB) (1-PAPBPC) FTA example T = E1 E2 E1 = A + E3 E2 = C + E4 E3 = B + C E4 = A B T = (A+B+C)(C+AB) = AC+BC+C+AB+ABC = C + AB Minimal Cut Set: example FTA example - continued T = C + AB Minimal cut sets: C AB PTopEvent = 1 – Π(1-PCSi) PTopEvent = 1 – (1-PC) (1-PAB) Fault Tree Analysis Characteristics It works by analyzing the “failure space” by looking at failure combinations – When this fails … Traditionally used to analyze fixed probabilities – each event that comprises the tree has a fixed probability of occurring – no variation in time (no repair, …) SAFETY RELATED ANALYSIS Functional safety and its analysis In the context of electrical/electronic (E/E) systems • Need to safely operate machines that have a potential to cause injury or property damage • Standards were established for applications like machinery or aviation, because of the potential impact caused by a safety-related failure • Effort to achieve a high probability to detect or prevent random hardware faults random are hardware failures that appear during the lifetime of a device because of a defect/wear-out as opposed to systematic failures induced in a deterministic way in the production process, manufacturing …. • “devices/functions” are specifically designed and added to a nominal system to provide safety Safety-related requirements • Hazards analysis leads to safety function requirements the identification of the functions necessary for the system to operate safely with respect to dangerous faults rd something that has the a z potential to harm you ha hazard analysis safety function requirements • Risk assessment leads to safety integrity requirements the likelihood of a safety function to properly perform the likelihood of a hazard actually causing harm k risthe future impact of a hazard risk assessment that is not controlled or eliminated safety integrity requirements hazard / risk assessment matrix critical situations require control of the function to prevent and/or reduce the risk Likelihood RISK Frequency of occurrence Hazard categories 1 catastrophic 2 critical 3 serious 4 minor A frequent 1A 2A 3A 4A B probable 1B 2B 3B 4B C occasional 1C 2C 3C 4C D remote 1D 2D 3D 4D E improbable 1E 2E 3E 4E Consequences fatality major injuries minor injuries negligible injuries very likely high high high medium likely high high medium medium unlikely high medium medium low highly unlikely medium medium low low Safety functions • safety functions are what needs to be done to achieve the desired/required level of safety, to reduce risk • elements that are added to the system to mitigate the effects of a fault somewhere else in the system They can be: • “on demand” (or “low demand”) • “continuous” (or “high demand”) *** automotive • On demand: found in a protection system separated from the system-underconsideration • Continuous: part of the system-under-consideration Safety functions • In general normal functions safety functions • In automotive normal functions safety functions Safety Integrity Levels Associated with Safety Integrity Requirements Apply to safety functions and not to system functions • (Automotive) Safety Integrity Level – (A)SIL SIL and ASIL 61508 SIL Probability of dangerous failure per hour Risk reduction factor 4 from 10-5 to 10-4 From 100,000 to 10,000 3 from 10-4 to 10-3 From 10,000 to 1,000 2 from 10-3 to 10-2 From 1,000 to 100 1 from 10-2 to 10-1 From 100 to 10 ASIL FAILURE RATE SPFM LFM A < 1000 FIT Not relevant Not relevant B < 100 FIT >= 90% >= 60% C < 100 FIT >= 97% >= 80% D < 10 FIT >= 99% >= 90% FIT: 1 fault in one device in 109 hours measure the ability of a function/device to detect faults Architectural metrics: SPFM, LFM and PMHF Single Point Fault Metric Reflects the robustness of a function to single point faults either by design or by coverage Latent Fault Metric Reflects robustness of a function against latent faults (mainly safe faults) Probabilistic Metric Overall likelihood of risk to happen, Hardware Faults measured in FIT Single Point Fault Causes a violation of the safety-goal: no mechanism has been designed to detect this fault ASIL Assignment from standards ability l l o r t n o C uency X q e r F x Severity Light / moderate injuries Sever injuries / survival probable Controllable Normally controllable Difficult to control very unlikely Quality Management Quality Management Quality Management unlikely Quality Management Quality Management Quality Management likely Quality Management Quality Management ASIL A very likely Quality Management ASIL A ASIL B very unlikely Quality Management Quality Management Quality Management unlikely Quality Management Quality Management ASIL A likely Quality Management ASIL A ASIL B ASIL A ASIL B ASIL C very unlikely Quality Management Quality Management ASIL A unlikely Quality Management ASIL A ASIL B likely ASIL A ASIL B ASIL C very likely ASIL B ASIL C ASIL D very likely Life-threatening / fatal Functional Safety: standards IEC 61508 – the generic functional safety standard for electrical and electronic (E/E) systems ISO 26262 – automotive-specific Example: ABS Is it a safety function? • It seems to be an “on demand” protection system ... used when necessary • However, in a modern vehicle it is closely bound to the operation of the powertrain itself • It implements not only the ABS but also a variety of stability control functions • It performs safety and other normal functions ASIL malfunction ABS system failure hazard analysis What unintended situation (hazard) may arise? How likely is the hazard? How harmful is the hazard? (severity) How controllable is the system if the hazard occurs? risk analysis What ASIL level is requested? How likely is the fault – fault rate (FIT)? How often should the safety function detect the fault/handle it – effectiveness of the detection? ASIL determination low high A B C D ASIL – ABS example Safety function analyses • FMEA – Failure Mode and Effects Analysis • FMECA – Failure Mode, Effects and Criticality Analysis • FMEDA – Failure Modes, Effects and Diagnostic Analysis FMEA, FMECA, FMEDA Analysis: boundary conditions The physical boundaries of the system which parts are included in the analysis The initial conditions what is the operating status when the (top) event occurs The level of resolution how detailed the analysis should be (components, faults, …) FMEA Progressively selects the individual components or functions within a system and investigates possible modes of failure Considers possible causes for each failure mode and assesses the likely consequences Effects of the failure are determined for the unit itself and for the complete system Possible remedial actions are suggested FMEA Goals Often used at a functional level, early in the lifecycle May be applied at several levels to refine the analysis The analysis can include probabilistic values Used to provide input data for fault tree analysis FMEA Analysis steps Four main steps – System definition, its functions and components – Failure modes identification, and their causes – Effects identification (top events) – Conclusions and recommendations For each failure mode, with a pre-defined scale: – Severity – Frequency – Detection Risk Priority Number = S•F•D FMEA Report sheet FMEA Example: X-by-wire system FMEA Report Failure mode assumptions Value failure: The unit produces one or several erroneous results which are syntactically correct. Timing failure: The logic value of a result is correct, but the result is delivered too late, or too early. Omission failure: The unit stops producing results for some finite time, and then (after an internal recovery) commences to produce correct and timely results again. Crash failure: The unit stops producing results and does not recover from the failure. Silent failure: The unit produces either no results at all, or results that can be identified as being incorrect by all other units. FMECA Extension of FMEA taking into account the importance of each component failure Introduces consequences of particular failures, probability/frequency of occurrence Determines sections of the system where failures are most important Categorization of system effects Fault category Description Environmental, Safety, and Health Result Criteria I Catastrophic Failure that may cause death to the uninvolved public or safety-critical system loss II Critical Failure that may cause severe injury to the uninvolved public, major property damage, or major safety- critical system damage III Marginal Failure that may cause minor injury to the uninvolved public or minor safety-critical damage IV Negligible Failure not serious enough to cause injury to the uninvolved public or safety-critical system damage Fault Severity Categories and Corresponding System Effects Likelihood of events Description Level Likelihood criteria Frequent A Likely to occur often in the life of an item, with a probability of occurrence greater than 10-1 per mission Probable B Will occur several times in the life of an item, with a probability of occurrence less than 10-1 but greater than 10-2 per mission Occasional C Likely to occur some time in the life of an item, with a probability of occurrence less than 10-2 but greater than 10-3 per mission Remote D Unlikely but possible to occur in the life of an item, with a probability of occurrence less than 10-3 but greater than 10-6 per mission Improbable E So unlikely, it can be assumed occurrence may not be experienced, with a probability of occurrence less than 10-6 per mission Extended categorization Fault category Effect on system I Negligible IIA A second fault event causes a transition into Category III - Critical IIB A second fault event causes a transition into Category IV Catastrophic IIC A system safety problem whose effect depends upon the situation III Critical and mission must be aborted IV Catastrophic Extended Fault Severity Categories and Corresponding System Effects Risk acceptance matrix FMECA Example Risk acceptance matrix Analysis • Identify the ways to fail, which are referred to as failure modes – Identifying causes of failure is useful because each cause may have its own mitigation approach • Identify all the effects or consequences of each failure mode • Identify the worst credible severity and probability for each failure mode as designated in the severity and likelihood • Assess the risk, using the risk acceptability matrix FMECA Pros and cons Advantages an FMECA allows for systematic evaluation of item failures and helps identify single-point failures may identify hazards overlooked in other system analysis techniques Limitations does not account for the consequences of coexisting, multi- element faults and failures > complementary techniques to identify multiple faults and failures majority of FMECA do not account for human error or hostile environments an FMECA can give an optimistic estimate of system reliability if a quantitative approach is used FMEDA Analysis of the behavior of the functionality devoted to guaranteeing safety Two points of view: – Functional/Architectural level – Structural level It combines standard FMEA techniques with extensions to identify online diagnostics techniques and the failure modes relevant to safety instrumented system design FMEDA Assumptions • • • • Only a single component failure will fail the entire product Failure rates are constant, wear out mechanisms are not included Propagation of failures is not relevant All components that are not part of the safety function and cannot influence the safety function (feedback immune) are excluded FMEDA Faults Faults can be: • SAFE (S): faulty and the outcome is that it provides the safety function even if not requested • DANGEROUS (D): faulty and the outcome is that it cannot provide the safety function when requested • NO EFFECT • DETECTED (D): the fault produces an observable effect, that can be detected before requesting the safety function • UNDETECTED (U): the fault corrupts the safety functions and it is only understandable when it will be required FMEDA Goals • Evaluate the impact of DANGEROUS UNDETECTED faults • Harden the system so that such impact is limited – Safety Functions are characterized by a Safety Integrity Level (SIL) that expresses the probability (on average) of being able to provide the safety function WHEN REQUESTED • Two actions: • Architectural level • Circuit level How does the system react to the occurrence of a fault? TOPIC QUESTIONS How reliable or available is the system? What are the most critical faults? Reliability/Availability estimation TOPICS + RBDs + FTs + FMEAs References Fault Tree Handbook Marvin Rausand’s Chapter on “System Analysis Fault Tree Analysis” Arnljot Hoyland, Marvin Rausand, “System Reliability Theory”, John Wiley & Sons, Inc., 1994 Federal Aviation Administration, “Guide to Reusable Launch and Reentry Vehicle Reliability Analysis”, 2005 http://sharpe.pratt.duke.edu/