Uploaded by burak.tevfik

Reliability Analysis Parellel Serial System

advertisement
dependable systems
dependability analysis
How does the system react to the
occurrence of a fault?
TOPIC
QUESTIONS
How reliable or available is the
system?
What are the most critical faults?
Dependability analysis
Goal: estimate dependability-related properties
• Reliability (MTTF, fault coverage …)
• Availability
• Performability and Dependability
Importance of analysis and analytical model
• to evaluate a design
• a metric to compare different designs
• to provide feedback to the designer during early design stages
• use a model for performance analysis
• used for quantitative and qualitative analysis
Reliability & Availability
Reliability-related analyses
Evaluation of the fault-error relationship
– For each fault, what are the effects (errors) and the consequent failures
– For each failure, which are the possible causes
Compute/estimate reliability/availability metrics starting from the system
components and adopted fault models
– MTTF
– Fault coverage
Analysis approaches
Forward
Starting from a set of events the effects on the system are evaluated …
Backward
Starting from the observed malfunctioning effects, possible causes (events) are
analyzed and identified
Reliability Analysis
Forward & Backward Analysis
Failure Mode and Effects (Criticality / Diagnostics) Analysis exploits the forward
approach
… given these events, what will happen?
Fault Tree Analysis follows the backward method
… what are the events that cause the observed failure?
Reliability Analysis
Forward & Backward Analysis
In both cases the goal is
to identify a causal relationship between events and failures
Events include failures in
– Hardware/Software
– Human behavior
– Environmental conditions
Reliability Analysis
Analytical techniques
•
•
•
•
•
•
•
•
Reliability Block Diagrams – RDB
Fault tree analysis – FTA
Failure modes and effects analysis – FMEA
Failure modes, effects and criticality analysis
FMECA
Failure modes, effects and diagnostic analysis
FMEDA
Hazard and operability studies – HAZOP
Event tree analysis – ETA
Risk analysis – RA
Reliability Block Diagrams
An inductive model wherein a system is divided into blocks that represent distinct
elements such as components or subsystems
Blocks are then combined according to system-success pathways
RBDs are generally used to represent active elements in a system, in a manner that
allows an exhaustive search for and identification of all pathways for success
Reliability Block Diagrams
An approach to compute the reliability of a system starting from the reliability of its
components
components in series
components in parallel
All components must be healthy
for the system to work properly
If one component is healthy the
system works properly
Reliability Block Diagrams
Series:
RS(t) = RC1(t) * RC2(t)
Parallel:
RS(t) = 1 – (1 – RC1(t)) * (1 – RC2(t))
series
meno Block
a±dabile
dela±dabile
meno a±dabile
suoi componenti.
meno
a±dabile
del Diagrams
meno
dei suoidei
componenti.
Reliability
Se i componenti
sono aditasso
di costante
guasto costante
2
Se i componenti
sono a tasso
guasto
∏i , (i =∏1,
2, .=. .1,
, n)
i , (i
del componente
è una funzione
esponenziale
del R
tempo
del componente
i-esimo i-esimo
è una funzione
esponenziale
del tempo
i (t) =
System
S composed by components with an exponential distribution
si ricava:
si ricava:
t
°∏s t
= e°∏s con
Rs (t) =Rse(t)
:
n
n
X
X
con∏s: = ∏s ∏=i
e∏i
i=1
Me T T
i=1
Il sistema
ha ancora
un comportamento
di tipo esponec
Il sistema
serie haserie
ancora
un comportamento
di tipo esponenziale
datosomma
dalla somma
di guasto
dei componenti.
dato dalla
dei tassidei
di tassi
guasto
dei componenti.
Example
2.1
- Si consideri
il sistema
di pressurizzazione
Example
2.1 - Si
consideri
il sistema
di pressurizzazione
di un sed
in Figura
2.2. Il sistema
è costituito
da 4 compone
tato in tato
Figura
2.2. Il sistema
è costituito
da 4 componenti
ob
di pressione
(R),
unadilogica
di controllo
un gruppo
di pressione
(R), una
logica
controllo
(L), un(L),
gruppo
motore
La connessione
fra i blocchi
è di
tiponelserie
valvola valvola
(V). La(V).
connessione
fra i blocchi
è di tipo
serie
sen
devefunzionante
essere funzionante
per garantire
il corretto
funziona
deve essere
per garantire
il corretto
funzionamento
mantenimento
del prescritto
valore
di pressione
nel serb
mantenimento
del prescritto
valore di
pressione
nel serbatoio.
di guasto
costante,
una dati
banca
datistati
son
nenti a nenti
tasso aditasso
guasto
costante,
da una da
banca
sono
series
a±dabile del meno a±dabile dei suoi compone
Reliability Blockmeno
Diagrams
Se i componenti sono a tasso di guasto costante ∏
del componente i-esimo è una funzione esponenziale
When all components
are identical
si ricava:
Rs (t) = e°∏s t
con :
∏s =
n
X
∏i
i=1
Il sistema serie ha ancora un comportamento di t
dato dalla somma dei tassi di guasto dei component
Example 2.1 - Si consideri il sistema di pressuri
tato in Figura 2.2. Il sistema è costituito da 4
di pressione (R), una logica di controllo (L), u
valvola (V). La connessione fra i blocchi è di
deve essere funzionante per garantire il corrett
mantenimento del prescritto valore di pression
nenti a tasso di guasto costante, da una banca
series
Reliability Block Diagrams
Availability
parallel
Reliability Block Diagrams
System P composed by n components
Availability
Reliability Block Diagrams
ive
t
c
A
Component redundancy
System redundancy
RBDs
Stand-by redundancy
active parallel system: primary and redundant subsystems normally operate at
all times
standby redundant system: the standby component is not activated unless the
online component fails
dynamic switching to the standby elements (explicit fault detection)
the switch can fail too
RBDs
Standby redundancy
RBDs
r out of n reliability
Rs = System reliability
Rc = Component reliability
n = Number of components
r = Minimum number of components which must survive
RBDs
Example 1
RA = 0.95
RB = 0.97
RC = 0.99
RD = 0.99
RE = 0.92
RF = 0.92
RBDs
Solution
Rs = RA RB [1-(1-RC)(1-RD)] [1-(1-RE)(1-RF)]
Rs= (0.95)(0.97)[1-(1-0.99)(1-0.99)] [1-(1-0.92)(1-0.92)] = 0.916
RBDs
Example 2
2 control blocks and 3 voice channels:
• system is up if at least 1 control channel and at least 1 voice channel are up
voice
control
voice
control
voice
Example 2 – cont’d
• Each control channel has reliability Rc
• Each voice channel has reliability Rv
• Reliability:
R = [1 - (1 - Rc ) 2 ][1 - (1 - Rv ) 3 ]
RBDs
Example 3
RBDs
RDB: used to compare different alternatives
Cable Bundle
Each block has R = .9
Taken from: SRE Meeting - Apr 2016
Alternative 1
Alternative 2
Alternative 3
RBDs
Example 4 – homework
C
A
B
D
C
C
E
D
RBDs
Complex example
RBDs
Methods for non-series-parallel systems
• State enumeration (Boolean Truth Table)
• Factoring/conditioning
• Binary Decision Diagrams
Implemented in SHARPE
RBDs
non-series-parallel systems – Example
2
1
3
4
5
RBDs
non-series-parallel systems – TT – Example
1
2
3
4
5
System
Probability
1
1
1
1
1
1
R1R2R3R4R5
1
1
1
1
0
1
R1R2R3R4!R5
1
1
1
0
1
1
1
1
1
0
0
1
1
1
0
1
1
1
1
1
0
1
0
1
1
1
0
0
1
1
1
1
0
0
0
1
1
0
1
1
1
1
1
0
1
1
0
0
1
0
1
0
1
1
1
0
1
0
0
0
1
0
0
1
1
1
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
0
0
0
R1R2
R1!R2R3R4R5
R1!R2R3!R4R5
R1!R2!R3R4R5
RBDs
non-series-parallel systems – TT – Example – cont’d
1
2
3
4
5
System
Probability
0
1
1
1
1
1
!R1R2R3R4R5
0
1
1
1
0
1
!R1R2R3R4!R5
0
1
1
0
1
0
0
1
1
0
0
0
0
1
0
1
1
1
0
1
0
1
0
0
0
1
0
0
1
0
0
1
0
0
0
0
0
0
1
1
1
1
0
0
1
1
0
0
0
0
1
0
1
0
0
0
1
0
0
0
0
0
0
1
1
1
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
!R1R2!R3R4R5
!R1!R2R3R4R5
!R1!R2!R3R4R5
!R1R2R3R4
RBDs
non-series-parallel systems – TT – Example – cont’d
Reliability: R1R2 + !R1R2R3R4 + !R1R2!R3R4R5 + !R1!R2R3R4R5 + !R1!R2!R3R4R5
+ !R1R2!R3R4R5 + !R1!R2R3R4R5 + !R1!R2!R3R4R5
… simplifying and optimizing …
Reliability: R1R2+ R4R5 + R1R3R5 + R2R3R4
BTW: !R = (1 – R)
RBDs
non-series-parallel systems – conditioning – Example
1
C3 down
2
1
2
4
5
1
2
4
5
3
4
5
C3 up
RBDs
non-series-parallel systems – conditioning – Example – cont’d
• Component C3 is chosen to factor on (or condition on)
• Upper resulting block diagram: C3 is down
• Lower resulting block diagram: C3 is up
• Series-parallel reliability formulas are applied to both the resulting block
diagrams
• Use the theorem of total probability to get the final result
RBDs
non-series-parallel systems – conditioning – Example – cont’d
RC3down= 1 - (1 - R1R2) (1 - R4R5)
RC3up = (1 - Q1Q4)(1 - Q2Q5)
= [1 - (1-R1) (1-R4)] [1 - (1-R2) (1-R5)]
Rbridge = RC3down (1-R3 ) + RC3up R3
RDBs
Triple Modular Redundancy – TMR
System works properly if
• 2 out of 3 components work properly
AND the voter works properly
≅
RBDs
TMR
• MTTFTMR is shorter than MTTFCOMP
• Can tolerate transient faults and permanent faults
• Higher reliability (for shorter missions)
When do we have the same reliability?
• RTMR(t) = RC(t)
≅ 0.7 MTTFC
• RTMR(t) > RC(t) when the mission time is shorter than 70% of MTTFC
RBDs
TMR
TMR: 2 out of 3 components (voter is a ‘perfect’ element)
RBDs
nMR
TMR: 2oo3 and nMR: 3oo5
Redundancy is useful
for specific
mission times
RBDs
Pros and cons
Advantages
• an RBD allows early assessment of design concepts and allows to
easily visualize the system logic
Limitations
• breaking down the systems to identify multiple levels of components
> considerable effort
• analyzing complex reliability diagrams can be difficult > not simple
series / parallel configurations
• modeling non-hardware failure mitigation measures, such as
training and procedures, is difficult using this technique
Exercise
• The failure distribution of the main landing gear of a commercial airliner is
Weibull with a shape parameter of 1.6 and a characteristic life of 10,000
landings
• The nose gear also has a Weibull distribution with a shape parameter of
0.90 and a characteristic life of 15,000 landings
• What is the reliability of the landing gear system if the system is to be
overhauled after 1,000 landings?
• R(t) = exp[-(t/η)]β
η: scale parameter
β: shape parameter
Exercise 2
• Out of the 12 identical AC generators on the C-5 aircraft, at least 9 of them
must be operating in order for the aircraft to complete its mission. Failures
are known to follow an exponential distribution with a failure rate of 0.01
failure per hour. What is the reliability of the generator system over a 10
hour mission?
FAULT TREES
Fault Tree Analysis
Defines a correlation between events and failures
– Composes events and caused by means of “logic gates”
The approach offers a tree model of the events and conditions that lead to the
failure
It can be used to characterize a system and to evaluate the overall dependability
properties
Approach
Fault Tree Analysis
It is a top-down approach to failure analysis, starting with a potential
undesirable event (accident) called a TOP event, and then determining all the
ways it can happen
– What (combination of) events can lead to the TOP event
• Use “logic gates” to connect the causes
A FTA starts with the undesired event and traces backward to the necessary and
sufficient causes
Fault Tree Analysis - FTA
Historical perspective
Bell Telephone Laboratories developed the concept in 1962 for the US Air Force
for use with the Minuteman system
Later improved by Boeing Company
One of many symbolic “analytical logic techniques” found in operations
research and in system reliability
FTA steps
Taken from
Fault Tree Handbook with Aerospace Fault Tree Handbook with Aerospace
Applications – version 1.1 - 2002
Fault Trees
•
•
•
•
•
Combinatorial (non-state-space) model type
Components are represented as nodes
Components or subsystems in series are connected to OR gates
Components or subsystems in parallel are connected to AND gates
Components or subsystems in k-of-n (RBD) are connected as (n-k+1)-of-n gate
Fault Tree Diagram
System model produced with a symbolic representation
– Basic events are connected with logical gates
Events
Gates
Fault Tree Diagrams
Example
CPU
DS1
DS1
NIC1
CPU
DS2
DS3
NIC2
DS2
DS3
NIC1
NIC2
TOPEvent = !CPU + (!DS1 !DS2 !DS3) + (!NIC1 !NIC2)
R = 1 – P(TOPEvent) = 1 - P(!CPU + (!DS1 !DS2 !DS3) + (!NIC1 !NIC2))
Fault Tree Diagrams
Events
Events
Meaning
Basic Event
A basic initiating fault (or failure event)
External Event
(House Event)
An event that is normally expected to occur. In
general, these events can be set to occur or not
occur, i.e. they have a fixed probability of 0 or 1.
Undeveloped Event
An event which is no further developed. It is a
basic event that does not need further
resolution.
Conditioning Event
A specific condition or restriction that can apply
to any gate.
Transfer
Indicates a transfer continuation to a sub tree
Symbol
Fault Tree Diagrams
Gates
Gates
Meaning
AND
The output event occurs if all input events occur
OR
The output event occurs if at least one of the input events
occurs
Voting OR
(k-out-of-n)
The output event occurs if k or more of the input events
occur
Inhibit
The input event occurs if all input events occur and an
additional conditional event occurs
Symbol
k
Priority AND The output event occurs if all input events occur in a
specific sequence
XOR
The output event occurs if exactly one input event occurs
Mutually
XOR
The output event occurs when any input event occurs. All
other events cannot occur.
M
Fault Tree Diagrams
Failure probabilities
Gate
Failure probability
OR
Pt = P1 + P2 – (P1P2)
AND
Pt = P1P2
Pt = P1(1–P2) + (1–P1)P2 + P1P2
= P1–P1P2 + P2 – P1P2 + P1P2
= P1 + P2 –P1P2
Pt : total failure probability
Pi: failure probability, event i
Priority AND
XOR
Pt = P1 + P2 – 2(P1P2)
Voting OR
k-out-of-n
Pt = P1 + P2+ P3– (P1P2) – (P1P3) – (P2P3) – (P1P2P3)
(2-out-of-3)
Mutually XOR
Pt = P1 + P2
Fault Tree Diagrams
FTD model
Gate & Events symbols and descriptions
gate description
gate
identifier
event ID
basic events
Fault Tree Diagrams
FTD Example
top-level
event
Fault Tree Diagrams
Fault tree construction
Define a TOP event in a clear and unambiguous way
– What
– Where
– When
What are the necessary, and sufficient events and conditions causing the TOP
event?
Connect via AND- or OR-gate
Appropriate level
– Independent basic events
– Events for which we have failure data
Example
Fuel system schematic
Example
Fuel system faults
Fault States
– No flow when needed
– Flow cannot be shut off
Component failures
– Block Valve A fail open/fail close
– Block Valve B fail open/fail close
– Control Valve A fail open/fail close
– Control Valve B fail open/fail close
Example
No flow when needed
Example
Flow cannot be shut off
Fault tree
Major characteristics:
• FT without repeated events (same event in input at different gates)
• can be mapped onto RBDs
• can be solved in linear time
• FT with repeated events
• Theoretical complexity: exponential in the number of components
• Up to 100 components can be solved …
How to exploit Fault Trees?
Analyze the the causes leading to top events and identify the critical elements
within the entire system
Identify the (sets of) basic elements that cause the top event
Fault Tree Analysis
Cut set: definition and use
Given a fault tree, it is possible to derive
cut sets
Cut set: a set of basic events, such that if they all occur, the top event will occur
Puts basic events into relation with the final outcome top set event
Minimal cut set: smallest set of basic events leading to the top event
Minimal cut set
Cut sets
The set is minimal if all its events must occur to lead to the top-event
Each fault tree has a finite number of unique minimal cut sets
The number of different basic events in a minimal cut set is called the order
They identify all distinct ways a top event can occur w.r.t. basic events
Path set
A set of basic events whose non-occurrence (simultaneous) guarantees that the top
event does not occur
A minimal path set is one that cannot be reduced without loosing its status as a
path set
Qualitative assessment
Qualitative assessment by investigating the minimal cut sets
– Ordering cut sets
– Ranking based on the type of basic events
• Human error (most critical)
• Failure of active equipment
• Failure of passive equipment
– Spot “large” cut sets with dependent items
Qualitative assessment
Ranking of events
Rank
Basic event 1
Basic Event 2
1
Human error
Human error
2
Human error
Failure of active unit
3
Human error
Failure of passive unit
4
Failure of active unit
Failure of active unit
5
Failure of active unit
Failure of passive unit
6
Failure of passive unit
Failure of passive unit
Quantitative analysis
Cut sets are computed and failure probabilities are combined to get the top event
probability
– Generate cut sets
– Apply failure data
– Compute probabilities
– Compute criticality measures
Instruments:
FT mathematics (Boolean algebra & probability)
FT approximation methods
Reliability
R = e-λT
R+Q=1
Q = 1 – e-λT
Approximation: when λT < 0.0001 then Q ≈ λT
Where:
λ : component failure rate = 1 / MBTF
T: time interval
Failure probabilities
Gate
Failure probability
OR
Pt = P1 + P2 – (P1P2)
AND
Pt = P1P2
Pt : total failure probability
Pi: failure probability, event i
Priority AND
XOR
Pt = P1 + P2 – 2(P1P2)
Voting OR
k-out-of-n
Pt = P1 + P2+ P3 – (P1P2) – (P1P3) – (P2P3) – (P1P2P3)
(2-out-of-3)
Mutually XOR
Pt = P1 + P2
Minimal cut sets: computation
Boolean reduction
Bottom up reduction algorithms
Binary Decision Diagrams
Min Terms method (Shannon decomposition)
Modularization methods
Genetic algorithms
Minimal Cut Set: example
Cut Sets:
A
A,B
A,B,C
A,B
Min Cut Sets:
A
Minimal Cut Set: example
Probability top event
Ptop-event = 1 – Π(1-PCSi)
PCSi : probability of the i-th cut set
Ptop-event = 1 – (1-PA) (1-PAPB) (1-PAPBPC)
FTA example
T = E1 E2
E1 = A + E3
E2 = C + E4
E3 = B + C
E4 = A B
T = (A+B+C)(C+AB)
= AC+BC+C+AB+ABC
= C + AB
Minimal Cut Set: example
FTA example - continued
T = C + AB
Minimal cut sets:
C
AB
PTopEvent = 1 – Π(1-PCSi)
PTopEvent = 1 – (1-PC) (1-PAB)
Fault Tree Analysis
Characteristics
It works by analyzing the “failure space” by looking at failure combinations
– When this fails …
Traditionally used to analyze fixed probabilities
– each event that comprises the tree has a fixed probability of occurring
– no variation in time (no repair, …)
SAFETY RELATED ANALYSIS
Functional safety and its analysis
In the context of electrical/electronic (E/E) systems
• Need to safely operate machines that have a potential to cause injury or
property damage
• Standards were established for applications like machinery or aviation, because
of the potential impact caused by a safety-related failure
• Effort to achieve a high probability to detect or prevent random hardware faults
random are hardware failures that appear during the lifetime of a device because of a
defect/wear-out as opposed to systematic failures induced in a deterministic way in the production
process, manufacturing ….
• “devices/functions” are specifically designed and added to a nominal system to
provide safety
Safety-related requirements
• Hazards analysis leads to safety function requirements
the identification of the functions necessary for the system to operate safely
with respect to dangerous faults
rd something that has the
a
z potential to harm you
ha
hazard analysis
safety function
requirements
• Risk assessment leads to safety integrity requirements
the likelihood of a safety function to properly perform
the likelihood of a hazard
actually causing harm
k
risthe future impact of a hazard
risk
assessment
that is not controlled or eliminated
safety integrity
requirements
hazard / risk assessment matrix
critical situations require control of
the function to prevent and/or
reduce the risk
Likelihood
RISK
Frequency of occurrence
Hazard categories
1
catastrophic
2
critical
3
serious
4
minor
A
frequent
1A
2A
3A
4A
B
probable
1B
2B
3B
4B
C
occasional
1C
2C
3C
4C
D
remote
1D
2D
3D
4D
E
improbable
1E
2E
3E
4E
Consequences
fatality
major injuries
minor injuries
negligible injuries
very likely
high
high
high
medium
likely
high
high
medium
medium
unlikely
high
medium
medium
low
highly
unlikely
medium
medium
low
low
Safety functions
• safety functions are what needs to be done to achieve the desired/required level
of safety, to reduce risk
• elements that are added to the system to mitigate the effects of a fault
somewhere else in the system
They can be:
• “on demand” (or “low demand”)
• “continuous” (or “high demand”) *** automotive
• On demand: found in a protection system separated from the system-underconsideration
• Continuous: part of the system-under-consideration
Safety functions
• In general
normal
functions
safety
functions
• In automotive
normal
functions
safety
functions
Safety Integrity Levels
Associated with Safety Integrity Requirements
Apply to safety functions and not to system functions
• (Automotive) Safety Integrity Level – (A)SIL
SIL and ASIL
61508 SIL
Probability of dangerous failure
per hour
Risk reduction factor
4
from 10-5 to 10-4
From 100,000 to 10,000
3
from 10-4 to 10-3
From 10,000 to 1,000
2
from 10-3 to 10-2
From 1,000 to 100
1
from 10-2 to 10-1
From 100 to 10
ASIL
FAILURE RATE
SPFM
LFM
A
< 1000 FIT
Not relevant
Not relevant
B
< 100 FIT
>= 90%
>= 60%
C
< 100 FIT
>= 97%
>= 80%
D
< 10 FIT
>= 99%
>= 90%
FIT: 1 fault in one device in 109 hours
measure the ability of a
function/device to detect faults
Architectural metrics: SPFM, LFM and PMHF
Single Point Fault Metric Reflects the robustness of a function to
single point faults either by design or by
coverage
Latent Fault Metric Reflects robustness of a function against
latent faults (mainly safe faults)
Probabilistic Metric Overall likelihood of risk to happen,
Hardware Faults measured in FIT
Single Point Fault Causes a violation of the safety-goal: no
mechanism has been designed to detect
this fault
ASIL Assignment from standards
ability
l
l
o
r
t
n
o
C
uency X
q
e
r
F
x
Severity
Light / moderate injuries
Sever injuries /
survival probable
Controllable
Normally controllable
Difficult to control
very unlikely
Quality Management
Quality Management
Quality Management
unlikely
Quality Management
Quality Management
Quality Management
likely
Quality Management
Quality Management
ASIL A
very likely
Quality Management
ASIL A
ASIL B
very unlikely
Quality Management
Quality Management
Quality Management
unlikely
Quality Management
Quality Management
ASIL A
likely
Quality Management
ASIL A
ASIL B
ASIL A
ASIL B
ASIL C
very unlikely
Quality Management
Quality Management
ASIL A
unlikely
Quality Management
ASIL A
ASIL B
likely
ASIL A
ASIL B
ASIL C
very likely
ASIL B
ASIL C
ASIL D
very likely
Life-threatening / fatal
Functional Safety: standards
IEC 61508 – the generic functional safety standard for electrical
and electronic (E/E) systems
ISO 26262 – automotive-specific
Example: ABS
Is it a safety function?
• It seems to be an “on demand” protection system
... used when necessary
• However, in a modern vehicle it is closely bound
to the operation of the powertrain itself
• It implements not only the ABS but also a variety
of stability control functions
• It performs safety and other normal functions
ASIL
malfunction
ABS system failure
hazard analysis
What unintended situation (hazard) may arise?
How likely is the hazard?
How harmful is the hazard? (severity)
How controllable is the system if the hazard occurs?
risk analysis
What ASIL level is requested?
How likely is the fault – fault rate (FIT)?
How often should the safety function detect the fault/handle it – effectiveness of the detection?
ASIL
determination
low
high
A
B
C
D
ASIL – ABS example
Safety function analyses
• FMEA – Failure Mode and Effects Analysis
• FMECA – Failure Mode, Effects and Criticality Analysis
• FMEDA – Failure Modes, Effects and Diagnostic Analysis
FMEA, FMECA, FMEDA
Analysis: boundary conditions
The physical boundaries of the system
which parts are included in the analysis
The initial conditions
what is the operating status when the (top) event occurs
The level of resolution
how detailed the analysis should be (components, faults, …)
FMEA
Progressively selects the individual components or functions within a system and
investigates possible modes of failure
Considers possible causes for each failure mode and assesses the likely
consequences
Effects of the failure are determined for the unit itself and for the complete system
Possible remedial actions are suggested
FMEA
Goals
Often used at a functional level, early in the lifecycle
May be applied at several levels to refine the analysis
The analysis can include probabilistic values
Used to provide input data for fault tree analysis
FMEA
Analysis steps
Four main steps
– System definition, its functions and components
– Failure modes identification, and their causes
– Effects identification (top events)
– Conclusions and recommendations
For each failure mode, with a pre-defined scale:
– Severity
– Frequency
– Detection
Risk Priority Number = S•F•D
FMEA
Report sheet
FMEA
Example: X-by-wire system
FMEA
Report
Failure mode assumptions
Value failure: The unit produces one or several erroneous results which are
syntactically correct.
Timing failure: The logic value of a result is correct, but the result is
delivered too late, or too early.
Omission failure: The unit stops producing results for some finite time, and
then (after an internal recovery) commences to produce correct and timely
results again.
Crash failure: The unit stops producing results and does not recover from the
failure.
Silent failure: The unit produces either no results at all, or results that can be
identified as being incorrect by all other units.
FMECA
Extension of FMEA taking into account the importance of each component failure
Introduces consequences of particular failures, probability/frequency of occurrence
Determines sections of the system where failures are most important
Categorization of system effects
Fault category
Description
Environmental, Safety, and Health Result Criteria
I
Catastrophic
Failure that may cause death to the uninvolved public or
safety-critical system loss
II
Critical
Failure that may cause severe injury to the uninvolved
public, major property damage, or major safety- critical
system damage
III
Marginal
Failure that may cause minor injury to the uninvolved
public or minor safety-critical damage
IV
Negligible
Failure not serious enough to cause injury to the
uninvolved public or safety-critical system damage
Fault Severity Categories and Corresponding System Effects
Likelihood of events
Description
Level
Likelihood criteria
Frequent
A
Likely to occur often in the life of an item, with a
probability of occurrence greater than 10-1 per mission
Probable
B
Will occur several times in the life of an item, with a probability of
occurrence less than 10-1 but greater than 10-2 per mission
Occasional
C
Likely to occur some time in the life of an item, with a probability of
occurrence less than 10-2 but greater than 10-3 per mission
Remote
D
Unlikely but possible to occur in the life of an item, with a probability
of occurrence less than 10-3 but greater than 10-6 per mission
Improbable
E
So unlikely, it can be assumed occurrence may not be experienced,
with a probability of occurrence less than 10-6 per mission
Extended categorization
Fault category
Effect on system
I
Negligible
IIA
A second fault event causes a transition into Category III - Critical
IIB
A second fault event causes a transition into Category IV Catastrophic
IIC
A system safety problem whose effect depends upon the situation
III
Critical and mission must be aborted
IV
Catastrophic
Extended Fault Severity Categories and Corresponding System Effects
Risk acceptance matrix
FMECA
Example
Risk acceptance matrix
Analysis
• Identify the ways to fail, which are referred to as failure
modes
– Identifying causes of failure is useful because each cause may
have its own mitigation approach
• Identify all the effects or consequences of each failure
mode
• Identify the worst credible severity and probability for
each failure mode as designated in the severity and
likelihood
• Assess the risk, using the risk acceptability matrix
FMECA
Pros and cons
Advantages
an FMECA allows for systematic evaluation of item failures and helps
identify single-point failures
may identify hazards overlooked in other system analysis techniques
Limitations
does not account for the consequences of coexisting, multi- element
faults and failures > complementary techniques to identify multiple
faults and failures
majority of FMECA do not account for human error or hostile
environments
an FMECA can give an optimistic estimate of system reliability if a
quantitative approach is used
FMEDA
Analysis of the behavior of the functionality devoted to guaranteeing safety
Two points of view:
– Functional/Architectural level
– Structural level
It combines standard FMEA techniques with extensions to identify online
diagnostics techniques and the failure modes relevant to safety instrumented
system design
FMEDA
Assumptions
•
•
•
•
Only a single component failure will fail the entire product
Failure rates are constant, wear out mechanisms are not included
Propagation of failures is not relevant
All components that are not part of the safety function and cannot influence
the safety function (feedback immune) are excluded
FMEDA
Faults
Faults can be:
• SAFE (S): faulty and the outcome is that it provides the safety function even
if not requested
• DANGEROUS (D): faulty and the outcome is that it cannot provide the safety
function when requested
• NO EFFECT
• DETECTED (D): the fault produces an observable effect, that can be detected
before requesting the safety function
• UNDETECTED (U): the fault corrupts the safety functions and it is only
understandable when it will be required
FMEDA
Goals
• Evaluate the impact of DANGEROUS UNDETECTED faults
• Harden the system so that such impact is limited – Safety Functions are
characterized by a Safety Integrity Level (SIL) that expresses the probability
(on average) of being able to provide the safety function WHEN REQUESTED
• Two actions:
• Architectural level
• Circuit level
How does the system react to the
occurrence of a fault?
TOPIC
QUESTIONS
How reliable or available is the
system?
What are the most critical faults?
Reliability/Availability estimation
TOPICS
+ RBDs
+ FTs
+ FMEAs
References
Fault Tree Handbook
Marvin Rausand’s Chapter on “System Analysis Fault Tree Analysis”
Arnljot Hoyland, Marvin Rausand, “System Reliability Theory”, John Wiley &
Sons, Inc., 1994
Federal Aviation Administration, “Guide to Reusable Launch and Reentry Vehicle
Reliability Analysis”, 2005
http://sharpe.pratt.duke.edu/
Download