System Safety

advertisement
Using Probability for Risk Analysis
Barbara Luckett – 20 October 2009
Personal Background

Naval Surface Warfare Center Dahlgren
Division (NSWCDD )
 “…premier research and development center that
serves as a specialty site for weapon system
integration.” -- NSWCDD website


Platform System Safety Branch
Department of Defense (DoD) acquisition
projects
 http://www.austal.com/index.cfm?objectID=6B42C
C62-65BF-EBC1-2E3E308BACC92365
System Safety Terms and
Concepts


System – “a composite, at any level of complexity, of
personnel, procedures, materials, tools, equipment,
facilities, and software… used together in the
intended operational or support environment to
perform a given task or achieve a specific purpose.” –
MIL-STD 882C
Safety – “freedom from those conditions that can
cause death, injury, occupational illness, or damage
to or loss of equipment or property, or damage to the
environment.” – MIL-STD 882C
What is System Safety?


“The application of engineering and management
principles, criteria, and techniques to optimize all
aspects of safety within the constraints of operational
effectiveness, time, and cost throughout all phases of
the system life cycle.” – MIL-STD 882C
“For almost any system, product, or service, the most
effective means of limiting product liability and
accident risks is to implement an organized system
safety function beginning in the conceptual design
phase, and continuing through to its development,
fabrication, testing, production, use, and ultimate
disposal.” – System Safety Society website
System Safety Terms and
Concepts



Hazard – “any real or potential condition that can
cause death, injury, occupational illness; or damage to
or loss of equipment or property; or damage to the
environment.” – MIL-STD 882C
Mishap – “an unplanned event or series of events
resulting in death, injury, occupational illness; or
damage to or loss of equipment or property; or damage
to the environment.” – MIL-STD 882C
Effect – “the result of a mishap (ie: death, injury,
occupational illness; or damage to or loss of equipment
or property; or damage to the environment).” – MILSTD 882C
Mishap Severity
Description
Category
Environmental, Safety, and Health Result Criteria
Catastrophic
1
Could result in: death, permanent total disability; system
loss, loss exceeding $1M; or irreversible severe
environmental damage that violates law or regulation
Critical
2
Could result in: permanent partial disability, injuries, or
occupational illness that may result in hospitalization of at
least three personnel; major system damage, loss exceeding
$200K but less than $1M; or reversible environmental
damage causing a violation of law or regulation
Marginal
3
Could result in: injury or occupational illness resulting in
one or more lost workdays; minor system damage, loss
exceeding $10K but less than $200K; or mitigable
environmental damage without violation of law or
regulation where restoration activities can be accomplished
Negligible
4
Could result in: injury or illness not resulting in a lost work
day; insignificant system damage, loss exceeding $2K but
less than $10K; or minimal environmental damage not
violating law or regulation
Mishap Probability
Level
Description
Item Criteria
Fleet Criteria
A
Frequent
Likely to occur often in the life of an item,
with a probability of occurrence greater
than 10-1
Continuously
experienced
B
Probable
Will occur several times in the life of an
item, with a probability of occurrence less
than 10-1 but greater than 10-2 in that life
Will occur
frequently
C
Occasional
Likely to occur some time in the life of an
item, with a probability of occurrence less
than 10-2 but greater than 10-3 in that life
Will occur several
times
D
Remote
Unlikely but possible to occur in the life of
an item, with a probability of occurrence
less than 10-3 but greater than 10-6 in that
life
Unlikely but can be
reasonably be
expected to occur
E
Improbable
So unlikely, it can be assumed occurrence
may not be experienced, with a probability
of occurrence less than 10-6
Unlikely to occur
but possible
Mishap Risk Index (MRI)
Mishap
Probability:
Mishap Severity Categories:
1: Catastrophic
2: Critical
3: Marginal
4: Negligible
A: Frequent
HIGH
HIGH
SERIOUS
MEDIUM
B: Probable
HIGH
HIGH
SERIOUS
MEDIUM
C: Occasional
HIGH
SERIOUS
MEDIUM
LOW
D: Remote
SERIOUS
MEDIUM
MEDIUM
LOW
E: Improbable
MEDIUM
MEDIUM
MEDIUM
LOW
Risk Level
Suggested Criteria
Acceptance Authority
HIGH
Unacceptable
Service Acquisition Executive
SERIOUS
Undesirable
Program Executive Officer
MEDIUM
Acceptable with review
Program Manager
LOW
Acceptable without review
Program Manager
How do we get these values?

Severity values are obtained by
brainstorming “worst credible” mishaps
in each of three categories:
Personnel injury/death
2. Damage to system equipment
3. Environmental damage
1.

Probability values are a little more
technical…
Probability Terms and
Concepts
“The probability of any outcome of a random
phenomenon is the proportion of times the
outcome would occur in a very long series of
repetitions.” - COMAP
 “The sample space S of a random phenomenon
is the set of all possible outcomes” - COMAP
 “An event is any outcome or set of outcomes of
a random phenomenon… An event is a subset of
the sample space.” - COMAP
-- COMAP text, 7th edition, pages 289-299

Probability Rules
0 ≤ P(A) ≤ 1
2. P(S) = 1
3. P(Ac) = 1 – P(A)
4. P(A or B) = P(A) + P(B) – P(A and B)
5. P(A and B) = P(A) x P(B)
1.
Methods of Obtaining
Probability Values
 Fault Tree
Analysis
 Historical Mishap Data
 Given Information
 Standard Calculations
Fault Tree Analysis (FTA)

Originally developed by Bell Telephone
Laboratories in 1962 for the U.S. Air Force.
 Used to analyze probabilities of inadvertent launch of
Minuteman missiles
Technique was expanded and improved upon by
Boeing Company
 Fault Trees are now one of the most widely used
methods in system reliability and failure
probability analysis

Fault Tree Analysis (FTA)

A Fault Tree is a top-down structured graphical
representation of the interaction of failures of events
within a system
 Basic events (hazards and their causal factors) are at the
bottom of the fault tree and are linked via logic symbols
(known as gates) to a top event (mishap).


Events in a Fault Tree are continually expanded until
sub-events are created for which you can assign a
probability.
We can use known probability values for the basic
events as well as knowledge of logic gates and
boolean logic to calculate the probability of the
mishap occurring.
Review of Logic Gates for FTA
 AND
 NAND
 OR
 NOR
 XOR
 XNOR
 NOT
A
B
A and B
A or B
A xor B
not A
A nand B
A nor B
A xnor B
0
0
0
0
0
1
1
1
1
0
1
0
1
1
1
1
0
0
1
0
0
1
1
0
1
0
0
1
1
1
1
0
0
0
0
1
AND Gate Logic
All on-site power failed
A
Generator #1 fails
B
Generator #2 fails
C
Generator #3 fails
D
All on-site power fails iff Generator #1 fails and
Generator #2 fails and Generator #3 fails
 A = B and C and D
 P(A) = P(B) x P(C) x P(D)

OR Gate Logic
Elevator door
‘closed’ failed
A
Hardware failure
B
Human Error
C
Software failure
D
Elevator door ‘closed’ failed iff hardware
failure or human error or software failure
 A = B or C or D
 P(A) = P(B) + P(C) + P(D) – P(B)P(C) –
P(B)P(D) - P(C)P(D) + P(B)P(C)P(D)

FTA Methodology

Generally involves five steps:
Define the undesired top event (mishap)
2. Obtain an understanding of the system
3. Construct the fault tree
1.
○
Deductively define all potential failure paths
4. Evaluate the probability of the top event
5. Analyze output and determine what is
required to mitigate the top event
Define the undesired top event (mishap)
1.
○
Fire Protection Systems Fail
Obtain an understanding of the system
2.
○
○
Primary smoke detection system with secondary heat
detection system
AFFF (Aqueous film-forming foam) fire suppression system
Construct the fault tree
3.
○
Deductively define all potential failure paths
Fire Protection
Systems Fail
Fire detection
system fails
Smoke detection
system fails
Heat detection
system fails
Fire suppression
system fails
Pump fails
Blocked nozzles
Historical Mishap Data
Using probability of an event occurring in the past to predict
probability of the event occurring in the future
 EX) If we have a fleet of 5 ships (each with 6 freight elevators
onboard) that have been in operation for 20 years (each
elevator used approx. 35 hours/year), with 3 injuries caused
by elevator malfunctions:

Operational
Hours per Year
(per ship)
Total
Operational
Time in years
Total
Operational
Hours
210
100
21000
 The probability of a mishap can now be determined by dividing the
number of times a mishap has occurred by the total operational
hours
 P(mishap) = # of mishaps / total hours = 3/21000 = 1.42857 x 10-4
 This falls into the REMOTE severity category
Given Information:
Hardware Components
Often, probability of failure for a system’s hardware components may
be available
 EX) Consider a system with an operational function that is dependent
on all four of the individual components working (ie: the system
function fails if any one of the components fail):

Component Component P(failure) per 1 million P(success)
name
type
operational hours

100
A
0.0643
0.9357
101
B
0.0917
0.9083
102
C
0.075
0.925
103
B
0.0917
0.9083
P(system failure) = 1 – [P(component A does not fail) x P(B does not
fail) x P(C does not fail) x P(B does not fail)]
= 1 – [(0.9357)(0.9083)(0.925)(0.9083)] = 1 - 0.71406
= 0.28594 per 1 million operational hours
Given Information:
Test Scenarios


Operational tests can be conducted to provide an
estimate of failure for certain system components
EX) We can run a series of tests on a fire
suppression system and note when the fire is
extinguished.
 Define a success here as an event where the fire is
extinguished in less than 60 seconds from system
activation.

If we conduct 10 tests, and the system fails to
extinguish the fire in under a minute once, we have
P(failure) = 0.1
 This is not incredibly accurate due to the small sample
size
Standard Calculations:
Event Types
Let qi(t) = P(Failure of unit i occurs at time t)
 Different types of events:

Non-repairable unit
1.
○
○
○
Unit i is not repaired when a failure occurs
Failure rate of λi
qi(t) = 1 − e−λit ≈ λit
Repairable unit (repaired when failure occurs)
2.
○
○
○
○
Unit i is repaired when a failure occurs and is assumed
to be as good as new following a repair
Failure rate of λi
Mean Time to Repair of MTTRi
qi(t) ≈ λit x MTTRi
Standard Calculations:
Event Types
Periodically tested (hidden failures)
3.
○
Unit i is tested periodically with test interval τ
○
Failure may occur at any time in the test interval, but the
failure is only detected in a test or if a demand for the
unit occurs.
Typical for safety-critical units (ie: smoke detectors)
Failure rate of λi
○
○
○
○
Test interval of τ i
qi(t) ≈ λi x τ i
2
4.
On-demand probability
○ Unit i is not active during normal operation, but may
be subject to one or more demands
○ Often used for human (operator) error
○ qi(t) = P(i fails on request)
Standard Calculations:
Why is Human Error important?
Human beings are an integral part of any system, so we
cannot accurately estimate the probability of failure without
taking people into consideration
 “Estimates of the probability that a person will, for example,
have a moment’s forgetfulness or lapse of attention and forget
to close a valve or close the wrong valve, press the wrong
button, make a mistake in arithmetic, and so on… They are
not estimates of the probability of error due to poor training
or instructions, lack of physical or mental ability, lack of
motivation, or poor management”
 “… Because so much judgment is involved, it is tempting for
those who wish to do so to try to ‘jiggle’ the figures to get the
answers they want… Anyone who uses estimates of human
reliability outside the usual ranges should be expected to
justify them.”
– An Engineer’s View of Human Error by Trevor Kletz

Standard Calculations:
Human Error Probability
P (Human Error) ≈ K1 x K2 x K3 x K4 x K5
Human Error Probability Parameters
Type of Activity:
• Simple, routine
• Requiring attention, routine
• Not routine
K1
0.001
0.01
0.1
Temporary Stress Factor for routine activities, seconds available:
•2
• 10
• 20
K2
10
1
0.5
Temporary Stress Factor for non-routine activities, seconds
available:
•3
• 30
• 45
• 60
K2
10
1
0.3
0.1
Standard Calculations:
Human Error Probability
Human Error Probability Parameters, continued
Operator Qualifications:
• Carefully selected, expert, well-trained
• Average knowledge and training
• Little knowledge, poorly trained
K3
0.5
1
3
Activity Anxiety Factor:
• Situation of grave emergency
• Situation of potential emergency
• Normal situation
K4
3
2
1
Activity Ergonomic Factor:
• Excellent microclimate, excellent interface with plant
• Good microclimate, good interface with plant
• Discrete microclimate, discrete interface with plant
• Discrete microclimate, poor interface with plant
• Worst microclimate, poor interface with plant
K5
0.1
1
3
7
10
Standard Calculations:
Human Error Probability

Consider one scenario:
 Type of activity: Requiring attention, routine  K1 = 0.01
 Stress factor: More than 20 seconds available  K2 = 0.5
 Operational qualities: Average knowledge and training  K3
=1
 Activity anxiety factor: Potential emergency  K4 = 2
 Activity ergonomic factor: Good microclimate, good interface
with plant  K5 = 1



P (Human Error) ≈ K1 x K2 x K3 x K4 x K5 =
0.01 x 0.5 x 1 x 2 x 1 = 0.01
In this situation, a person will fail 1% of the
time
This falls into the PROBABLE category
Back to a Fault Tree Example…
Alarm clock does
not wake you up
Alarm clock failure
Main (plug-in) clock failure
Power
outage
Forgot to
set (or set
incorrectly)
Faulty clock
Electrical
Fault
Mechanical
Fault
You don’t hear it
Backup (wind-up) clock failure
Faulty
clock
Forget
to wind
Forget to
set (or set
incorrectly)
Alarm clock does
not wake you up
You don’t hear it
Alarm clock failure
negligible
Main (plug-in) clock failure
Power
outage
P = 0.012
Forgot to set
(or set
incorrectly)
P = 0.008
Faulty clock
Electrical
Fault
P = 0.0003
Mechanical
Fault
P = 0.0004
Backup (wind-up) clock failure
Faulty
clock
P = 0.0004
Forget to
wind
P = 0.012
Forget to set
(or set
incorrectly)
P = 0.008
Probability that the Backup
(wind-up) clock fails?
Backup (wind-up) clock failure
Faulty
Clock
P = 0.0004
Forget to
wind
P = 0.012
Forget to set
(or set
incorrectly)
P = 0.008
P (backup clock failure) = P (faulty clock) +
P (forget to wind) + P (forget to set)
 P (backup clock failure) = 0.0004 + 0.012 +
0.008
 P (backup clock failure) = 0.0204

Probability that the Main
(plug-in) clock fails?
Main (plug-in) clock failure
Power
outage
P = 0.012
Forgot to set
(or set
incorrectly)
P = 0.008
Faulty clock
Electrical
Fault
P = 0.0003
Mechanical
Fault
P = 0.0004
P (main clock failure)
= P (power outage) +
P (faulty clock) + P
(forget to set)
 P(main clock failure)
= 0.012 + (0.0003
+0.0004) + 0.008
 P(main clock failure)
= 0.012 + 0.0007
+0.008
 P(main clock failure)
= 0.0207

Probability that the Alarm
Clock Does Not Wake You Up?
Alarm clock does
not wake you up
Alarm clock failure
You don’t hear it
negligible
Main (plug-in) clock failure
P = 0.0207




Backup (wind-up) clock failure
P = 0.0204
P (Alarm Clock Failure) = P (Main Clock Failure) + P (Backup Clock
Failure) = 0.0207 x 0.0204 = 0.0oo422
P (Alarm Clock Does Not Wake You Up) = P (Alarm Clock Failure)
+ P (You Don’t Hear It)
P (Alarm Clock Does Not Wake You Up) = 0.000422 = 4.22 x 10-4
This falls into the REMOTE category
Conclusions




System Safety is a risk management strategy based
on identifying, analyzing, and eliminating or
mitigating hazards using a systems-based approach.
Hazards are evaluated and analyzed based on the
severity and probability values for their
corresponding mishap.
Probability values can be obtained by using basic
probability rules and boolean logic in addition to
historical data, published failure values, an
understanding of potential failure paths in a system,
and some simple calculations.
The allows us to quantitatively analyze risk levels and
make an informed recommendation /decision.
Sources










MIL_STD 882C
Introduction to System Safety: Tutorial for the 19th
International System Safety Conference by Dick Church
An Engineer’s View of Human Error by Trevor Kletz
For all Practical Purposes: Mathematical Literacy in Today’s
World, 7th edition
http://www.navsea.navy.mil/nswc/dahlgren/default.aspx
http://www.system-safety.org/about/
http://www.weibull.com/basics/fault-tree/index.htm
http://www.fault-tree.net/papers/andrews-fta-tutor.pdf
http://www.fault-tree-analysis-software.com/fault-treeanalysis-basics.html
http://www.ntnu.no/ross/srt/slides/fta.pdf
Download