Document 13478016

advertisement
.
Software System Safety
Nancy G. Leveson
MIT Aero/Astro Dept.
c
Copyright by the author, November 2004.
Accident with No Component Failures
c
�
VENT
LA
GEARBOX
LC
CONDENSER
CATALYST
COOLING
VAPOR
WATER
REFLUX
REACTOR
COMPUTER
c
�
Types of Accidents
Component Failure Accidents
Single or multiple component failures
Usually assume random failure
System Accidents
Arise in interactions among components
No components may have "failed"
Caused by interactive complexity and tight coupling
Exacerbated by the introduction of computers.
.
.
c
.
!
#
!
.
Safety
Reliability
Accidents in high−tech systems are changing
their nature, and we must change our approaches
to safety accordingly.
.
.
c
Confusing Safety and Reliability
From an FAA report on ATC software architectures:
"The FAA’s en route automation meets the criteria for
consideration as a safety−critical system. Therefore,
en route automation systems must posses ultra−high
reliability."
From a blue ribbon panel report on the V−22 Osprey problems:
"Safety [software]: ...
Recommendation: Improve reliability, then verify by
extensive test/fix/test in challenging environments."
c
!
!
%
Does Software Fail?
Failure: Nonperformance or inability of system or component
to perform its intended function for a specified time
under specified environmental conditions.
A basic abnormal occurrence, e.g.,
burned out bearing in a pump
relay not closing properly when voltage applied
Fault: Higher−order events, e.g.,
relay closes at wrong time due to improper functioning
of an upstream component.
All failures are faults but not all faults are failures.
c
Reliability Engineering Approach to Safety
Reliability: The probability an item will perform its required
function in the specified manner over a given time
period and under specified or assumed conditions.
(Note: Most software−related accidents result from errors
in specified requirements or function and deviations
from assumed conditions.)
Concerned primarily with failures and failure rate reduction
Parallel redundancy
Standby sparing
Safety factors and margins
Derating
Screening
Timed replacements
c
!
'
!
(
Reliability Engineering Approach to Safety (2)
Assumes accidents are the result of component failure.
Techniques exist to increase component reliability
Failure rates in hardware are quantifiable.
Omits important factors in accidents.
May even decrease safety.
Many accidents occur without any component ‘‘failure’’
e.g. Accidents may be caused by equipment operation
outside parameters and time limits upon which
reliability analyses are based.
Or may be caused by interactions of components
all operating according to specification
Highly reliable components are not necessarily safe.
c
Software−Related Accidents
Are usually caused by flawed requirements
Incomplete or wrong assumptions about operation of
controlled system or required operation of computer.
Unhandled controlled−system states and environmental
conditions.
Merely trying to get the software ‘‘correct’’ or to make it
reliable will not make it safer under these conditions.
c
!
.
Software−Related Accidents (con’t.)
Software may be highly reliable and ‘‘correct’’ and still
be unsafe.
Correctly implements requirements but specified
behavior unsafe from a system perspective.
Requirements do not specify some particular behavior
required for system safety (incomplete)
Software has unintended (and unsafe) behavior beyond
what is specified in requirements.
c
�
)
*
*
#
+
A Possible Solution
Enforce discipline and control complexity
Limits have changed from structural integrity and physical
constraints of materials to intellectual limits
Improve communication among engineers
Build safety in by enforcing constraints on behavior
Example (batch reactor)
System safety constraint:
Water must be flowing into reflux condenser whenever
catalyst is added to reactor.
Software safety constraint:
Software must always open water valve before catalyst valve
c
-
,
c
�
)
*
*
#
+
!
The Problem to be Solved
The primary safety problem in computer−based systems
is the lack of appropriate constraints on design.
The job of the system safety engineer is to identify the
design constraints necessary to maintain safety and to
ensure the system and software design enforces them.
.
c
�
)
*
*
'
+
+
!
An Overview of The Approach
Engineers should recognize that reducing risk is not an
impossible task, even under financial and time constraints.
All it takes in many cases is a different perspective on the
design problem.
Mike Martin and Roland Schinzinger
Ethics in Engineering
c
�
System Safety
A planned, disciplined, and systematic approach to
preventing or reducing accidents throughout the life
cycle of a system.
‘‘Organized common sense ’’ (Mueller, 1968)
Primary concern is the management of hazards:
Hazard
identification
evaluation
elimination
control
through
analysis
design
management
MIL−STD−882
)
*
*
'
#
c
System Safety (2)
�
)
*
*
'
+
+
Hazard analysis and control is a continuous, iterative process
throughout system development and use.
Conceptual
development
Design
Development
Operations
Hazard identification
Hazard resolution
Verification
Change analysis
Operational feedback
Hazard resolution precedence:
1.
2.
3.
4.
Eliminate the hazard
Prevent or minimize the occurrence of the hazard
Control the hazard if it occurs.
Minimize damage.
Management
c
�
Process Steps
1. Perform a Preliminary Hazard Analysis
Produces hazard list
2. Perform a System Hazard Analysis (not just Failure Analysis)
Identifies potential causes of hazards
3. Identify appropriate design constraints on system, software,
and humans.
4. Design at system level to eliminate or control hazards.
5. Trace unresolved hazards and system hazard controls to
software requirements.
)
*
*
'
c
�
)
*
*
'
+
+
%
Specifying Safety Constraints
Most software requirements only specify nominal behavior
Need to specify off−nominal behavior
Need to specify what software must NOT do
What must not do is not inverse of what must do
Derive from system hazard analysis
c
�
Process Steps (2)
6. Software requirements review and analysis
Completeness
Simulation and animation
Software hazard analysis
Robustness (environment) analysis
Mode confusion and other human error analyses
Human factors analyses (usability, workload, etc.)
)
*
*
'
'
c
�
)
*
*
'
+
+
(
Process Steps (3)
7. Implementation with safety in mind
Defensive programming
Assertions and run−time checking
Separation of critical functions
Elimination of unnecessary functions
Exception−handling etc.
8. Off−nominal and safety testing
c
�
Process Steps (4)
9. Operational Analysis and Auditing
Change analysis
Incident and accident analysis
Performance monitoring
Periodic audits
)
*
*
'
,
c
0
k
3
1
l
3
3
5
n
:
7
o
7
A Human−Centered, Safety−Driven Design Process
System
Engineering
Human Factors
L
F
Q
A
P
?
g
O
N
O
N
P
Q
R
G
V
C
N
System Safety
V
A
F
X
Preliminary Task Analysis
Q
A
\
?
A
]
R
Q
A
P
V
C
V
N
N
E
R
_
P
?
A
X
Preliminary Hazard Analysis
N
X
Operator Goals and
Responsibilities
Hazard List
Fault Tree Analysis
e
Q
A
Q
V
P
Q
N
O
N
P
Q
R
V
A
F
]
Task Allocation Principles
_
Q
V
X
P
?
A
]
Operator Task and
Training Requirements
V
A
V
C
Q
X
F
F
d
E
?
Q
]
Q
N
?
G
R
Q
A
P
Safety Requirements and
Constraints
N
]
A
B
A
N
P
V
X
?
A
P
N
]
System Hazard Analysis
C
C
B
j
V
P
Q
P
V
N
W
N
V
A
Completeness/Consistency
Analysis
F
G
X
Q
A
Q
V
P
Q
N
O
N
P
Q
R
F
Q
N
?
G
A
]
>
?
A
B
C
E
F
?
A
G
H
J
L
M
Operator Task Analysis
Simulation and Animation
Simulation/Experiments
J
F
Q
C
V
A
F
Q
\
V
C
E
V
P
Q
_
X
Q
V
X
Usability Analysis
P
V
N
W
N
V
A
F
B
R
_
A
X
C
V
B
Q
U
X
Y
Q
X
A
]
P
X
W
U
P
]
[
V
\
State Machine Hazard
Analysis
?
U
X
]
Deviation Analysis (FMECA)
Other Human Factors
Evaluation
(workload, situation
awareness, etc.)
>
N
O
N
P
Q
R
F
Q
N
?
G
A
M
Mode Confusion Analysis
Q
N
?
G
A
V
A
F
B
A
h
N
P
E
X
B
P
]
Human Error Analysis
B
R
_
A
X
F
?
N
Q
A
P
N
b
B
A
X
_
C
V
O
P
C
X
N
b
P
V
?
A
?
A
]
G
N
V
R
V
P
Q
?
]
V
A
F
_
F
V
C
N
b
]
Q
V
X
A
X
P
R
]
X
V
A
E
V
C
Timing and other analyses
N
]
Safety Verification
Q
?
f
Operational Analysis
?
Q
C
F
g
?
B
V
P
?
P
Q
Safety Testing
Software FTA
A
]
X
N
P
?
A
G
b
?
A
N
a
P
V
C
C
V
P
?
A
X
V
A
F
P
V
?
A
?
A
G
]
Performance Monitoring
b
Operational Analysis
Change Analysis
Incident and accident analysis
Periodic audits
Periodic audits
c
_
V
Q
]
Change Analysis
P
?
A
N
X
Performance Monitoring
9
q
:
s
;
<
u
1
v
w
x
c
(
,
.
z
y
.
{
)
Preliminary Hazard Analysis
1. Identify system hazards
2. Translate system hazards into high−level
system safety design constraints.
3. Assess hazards if required to do so.
4. Establish the hazard log.
.
.
c
y
z
{
)
System Hazards for Automated Train Doors
Train starts with door open.
Door opens while train is in motion.
Door opens while improperly aligned with station platform.
Door closes while someone is in doorway
Door that closes on an obstruction does not reopen or reopened
door does not reclose.
Doors cannot be opened for emergency evacuation.
c
System Hazards for Air Traffic Control
z
y
{
{
)
.
-
Controlled aircraft violate minimum separation standards (NMAC).
Airborne controlled aircraft enters an unsafe atmospheric region.
Controlled airborne aircraft enters restricted airspace without
authorization.
Controlled airborne aircraft gets too close to a fixed obstable
other than a safe point of touchdown on assigned runway (CFIT)
Controlled airborne aircraft and an intruder in controlled airspace
violate minimum separation.
Controlled aircraft operates outside its performance envelope.
Aircraft on ground comes too close to moving objects or collides
with stationary objects or leaves the paved area.
Aircraft enters a runway for which it does not have clearance.
Controlled aircraft executes an extreme maneuver within its
performance envelope.
Loss of aircraft control.
c
y
z
Exercise: Identify the system hazards for this cruise−control system
The cruise control system operates only when the engine is running.
When the driver turns the system on, the speed at which the car is
traveling at that instant is maintained. The system monitors the car’s
speed by sensing the rate at which the wheels are turning, and it
maintains desired speed by controlling the throttle position. After the
system has been turned on, the driver may tell it to start increasing
speed, wait a period of time, and then tell it to stop increasing speed.
Throughout the time period, the system will increase the speed at a
fixed rate, and then will maintain the final speed reached.
The driver may turn off the system at any time. The system will turn
off if it senses that the accelerator has been depressed far enough to
override the throttle control. If the system is on and senses that the
brake has been depressed, it will cease maintaining speed but will not
turn off. The driver may tell the system to resume speed, whereupon
it will return to the speed it was maintaining before braking and resume
maintenance of that speed.
)
c
y
z
{
)
!
Hazards must be translated into design constraints.
HAZARD
DESIGN CRITERION
Train starts with door open.
Train must not be capable of moving with
any door open.
Door opens while train is in motion.
Doors must remain closed while train is in
motion.
Door opens while improperly aligned
with station platform.
Door must be capable of opening only after
train is stopped and properly aligned with
platform unless emergency exists (see below).
Door closes while someone is in
doorway.
Door areas must be clear before door
closing begins.
Door that closes on an obstruction
does not reopen or reopened door
does not reclose.
An obstructed door must reopen to permit
removal of obstruction and then automatically
reclose.
Doors cannot be opened for
emergency evacuation.
Means must be provided to open doors
anywhere when the train is stopped for
emergency evacuation.
Example PHA for ATC Approach Control
HAZARDS
1. A pair of controlled aircraft
violate minimum separation
standards.
REQUIREMENTS/CONSTRAINTS
1a. ATC shall provide advisories that
maintain safe separation between
aircraft.
1b. ATC shall provide conflict alerts.
2. A controlled aircraft enters an
unsafe atmospheric region.
(icing conditions, windshear
areas, thunderstorm cells)
2a. ATC must not issue advisories that
direct aircraft into areas with unsafe
atmospheric conditions.
2b. ATC shall provide weather advisories
and alerts to flight crews.
2c. ATC shall warn aircraft that enter an
unsafe atmospheric region.
c
y
z
{
)
#
Example PHA for ATC Approach Control (2)
HAZARDS
REQUIREMENTS/CONSTRAINTS
3. A controlled aircraft enters
restricted airspace without
authorization.
3a. ATC must not issue advisories that
direct an aircraft into restricted airspace
unless avoiding a greater hazard.
3b. ATC shall provide timely warnings to
aircraft to prevent their incursion into
restricted airspace.
4. A controlled aircraft gets too
4. ATC shall provide advisories that
close to a fixed obstacle or
maintain safe separation between
terrain other than a safe point of
aircraft and terrain or physical obstacles.
touchdown on assigned runway.
5. A controlled aircraft and an
intruder in controlled airspace
violate minimum separation
standards.
HAZARDS
6. Loss of controlled flight or loss
of airframe integrity.
5. ATC shall provide alerts and advisories
to avoid intruders if at all possible.
REQUIREMENTS/CONSTRAINTS
6a. ATC must not issue advisories outside
the safe performance envelope of the
aircraft.
6b. ATC advisories must not distract
or disrupt the crew from maintaining
safety of flight.
6c. ATC must not issue advisories that
the pilot or aircraft cannot fly or that
degrade the continued safe flight of
the aircraft.
6d. ATC must not provide advisories
that cause an aircraft to fall below
the standard glidepath or intersect
it at the wrong place.
Classic Hazard Level Matrix
c
´
µ
¶
y
µ
z
·
¸
¹
{
º
)
»
¼
¾
SEVERITY
I
II
Catastrophic Critical
III
Marginal
IV
Negligible
A Frequent
I−A
II−A
III−A
IV−A
B Moderate
I−B
II−B
III−B
IV−B
C Occasional
I−C
II−C
III−C
IV−C
D Remote
I−D
II−D
III−D
IV−D
E Unlikely
I−E
II−E
III−E
IV−E
F Impossible
I−F
II−F
III−F
IV−F
LIKELIHOOD
c
´
µ
y
¶
z
µ
·
¸
¹
{
)
º
»
¼
Another Example Hazard Level Matrix
A
Frequent
C
Occasional
B
Probable
D
Remote
¦
—
˜
™
š
›
œ

ž
Ÿ
š
œ
—
˜
™
š
›
œ

ž
Ÿ
š
œ
—
˜
™
š
›
œ

ž
Ÿ
š

¡
˜
–
¨
š
¡
˜
¥
Ÿ
¡
˜
–
¨
š
¡
˜
¥
Ÿ
¡
˜
–
¨
š
¡
˜
¥
Ÿ
˜
¢
š
§
š
œ

Ÿ
˜
¡
˜
¢
š
§
š
œ

Ÿ
˜
˜
¡
¢
š
§
š
œ

Ÿ
¡
ž
¥
§
œ
Ÿ
¨
¡
¢
™
¢
Ÿ
˜
¥
˜

œ
©
Catastrophic
I
¤
E
F
Improbable Impossible
˜
¡
£

¤

¡
¥
¡
¡
©

©
š
¢
š
Ÿ
«
ª
ž
œ
Ÿ
¡
¢
£

¤

¡
¥
ž
œ
Ÿ
¡
¢
£

¤

¡
ž
¥
œ
Ÿ
¡
¢
£

¤

¡
¥
¡
1
2
¦
—
˜
™
š
›
œ

ž
Ÿ
š
œ
—
˜
™
š
›
œ

ž
Ÿ
š
˜
¥
Critical
˜
–
¨
š
¡
˜
¥
Ÿ
¡
˜
–
¨
š
¡
˜
¥
Ÿ
ž
˜
¥

¤

¡
¥
§
¨
™
4
|
Ÿ
}
~
}
€
‚
ƒ
…
†
‡
€
…
9
12
‰
œ
©
˜
¡
ž
œ
Ÿ
¡
¢
¢
˜
¥
‚
¡
¨
3
Š
‹
Œ
€
}

‰
Š
Œ
Ž
¯
£

¤

¡
™
™
¨
§
˜
°
š
¢
¢
®
§
™
™
š
©
¢
˜
¥
ª
ƒ
˜
¢
š
§
š
œ

Ÿ
˜
¡
˜
¢
š
§
š
œ

Ÿ
˜
…
‹
‡
Š
Ž
Ž
Š
ƒ
‡
Œ

Š
¡
œ
¡
©

©
š
¢
š
Ÿ
Ÿ
ž
ž
¨
¡
«
ž
ž
¨
¡
¡
˜
œ
ž
˜
ª
ž
œ
Ÿ
¡
¢
£

¤

¡
¥
II
ž
œ
Ÿ
¡
¢
£
¤


¡
¥
¡
3
˜
™
š
›
œ

ž
Ÿ
š
¥
–
‘
Œ
€
Š
‚
‡
˜
¥
}
~
}
€
‚
•
‘
‹
6
‡
|
}
~
}
€
‚
ƒ
…
†
‡
€
7
12
12
8
10
12
12
12
12
12
…
‰
¬
Š
ƒ
…
†
‡
€
…
‰
‰
Š
Š
‹
Œ
€
}

‰
Š
Œ
…
€
”
}
~
}
€
‰
Œ
•
Œ
†
}
‡
Š
…
™
§
Ÿ

˜
¢
²
²
¢
˜
«
œ
ž
Ÿ
š
³
Ÿ
˜
‚
ƒ
Š
¡
Ž
…
ž
Marginal
‚
‚
Š
ž
œ

€
¨
4
|
—
˜
…
‹
‡
Š
Ž
Ž
Š
ƒ
‡
Œ

Š
€
€
…

‚
‘
}

Œ
‰
Œ
‡
“
’
ƒ
III
…
†
‡
€
…
‰
”
}
~
}
€
‚
€
Negligible
IV
Š
ƒ
Š
‚
6
5
¬
10
˜
›
11
¢
š
›
š
©
¢
˜
£

¤

¡
¥
12
½
c
¿
À
·
Á
µ
´
µ
Â
¶
µ
Ã
·
Å
¸
Æ
¹
Å
Ç
º
È
»
É
»
¹
¾
Å
À
Ê
·
Ë
·
Hazard Causal Analysis
Used to refine the high−level safety constraints into more
detailed constraints.
Requires some type of model (even if only in head of analyst)
Almost always involves some type of search through the
system design (model) for states or conditions that could lead
to system hazards.
Top−down
Bottom−up
Forward
Backward
c
¿
À
·
Á
µ
Â
´
µ
Ã
¶
µ
Å
·
Æ
¸
Å
¹
Ç
È
º
É
»
¹
Forward vs. Backward Search
Initiating
Events
Final
States
Initiating
Events
Final
States
A
W
nonhazard
A
W
nonhazard
B
X
HAZARD
B
X
HAZARD
C
Y
nonhazard
C
Y
nonhazard
D
Z
nonhazard
D
Z
nonhazard
Forward Search
Backward Search
»
Å
½
Ê
À
·
Ë
·
c
¿
À
·
Á
µ
´
µ
Â
¶
µ
Ã
·
Å
¸
Æ
¹
Å
Ç
º
È
»
É
»
¹
Å
Î
Ê
À
FTA and Software
·
Appropriate for qualitative analyses, not quantitative ones
System fault trees helpful in identifying potentially hazardous
software behavior.
Can use to refine system design constraints.
FTA can be used to verify code.
Identifies any paths from inputs to hazardous outputs or
provides some assurance they don’t exist.
Not looking for failures but incorrect paths (functions)
c
¿
À
·
Á
µ
Â
Ã
Å
´
Æ
µ
Å
¶
Ç
È
µ
·
É
¸
¹
¹
Å
º
Ê
À
»
·
Fault Tree Example
Explosion
and
Pressure
too high
Valve
failure
Relief valve 1
does not open
Relief valve 2
does not open
or
or
Computer does
not open
valve 1
Valve
failure
or
Sensor Computer
Failure
output
too late
Operator does
not know to
open valve 2
Operator
inattentive
and
Computer
does not issue
command to
open valve 1
Valve 1
Position
Indicator
fails on
Open
Indicator
Light fails
on
Ë
»
·
Í
Ë
·
c
¿
À
·
Á
µ
´
Â
µ
¶
Ã
µ
Å
·
Æ
¸
Å
Ç
¹
º
È
»
É
¹
»
Å
ÿ
À
Ê
·
Ë
·
Example Fault Tree for ATC Arrival Traffic
ü
ò
Ù
ã
Û
Û
Ô
Ö
Û
Ñ
Þ
ã
Ý
á
Ñ
Ö
ã
Ð
ç
Ô
â
Ñ
à
ò
à
Ù
â
ì
Ô
è
Ù
Ù
Ð
Û
Û
Ñ
Ô
Ý
Ô
Ö
ç
Ù
Ð
Þ
Ð
Ù
õ
Ö
ì
Û
Û
Ñ
Ö
à
Ñ
Ù
Ð
Ð
Ö
ç
Ô
Þ
Ù
Û
Ñ
Ù
Û
Ö
ã
Ö
Û
à
Ñ
ç
Ù
à
â
Ô
Þ
â
Ù
Ö
Û
Ù
ò
á
ã
ò
Ô
ò
Ô
Ò
Û
Ð
Ñ
Ù
Ö
Ù
Û
Ù
ã
á
Ñ
Ý
Ò
ð
Ð
ð
Û
à
ç
â
à
Þ
Ô
î
Ù
ò
Ñ
Ñ
Ñ
â
Ð
Ô
Ù
á
Û
Ù
Û
Ö
Ô
Ñ
Ö
Ù
Ý
Ò
Ñ
Ð
Ô
Ù
ì
Ö
Ù
Þ
î
Þ
Ñ
Û
Û
ô
Ð
à
ç
Ð
â
Ù
Ð
Ö
Ù
Ö
Ò
Ý
â
ì
Û
â
Ö
â
Ñ
Ö
Ô
ç
á
Ù
Ô
Ö
Ð
Û
Ñ
Ý
î
Ø
Ô
Ý
Ù
Ø
Ô
Ù
Ò
Ù
Ô
Ø
Ô
ç
Ñ
Ð
Ö
Ð
ì
Ù
ð
Ð
ç
â
Þ
â
Ñ
ò
Ñ
ì
Ö
Ñ
Ö
Ô
Ð
Ð
Þ
Ù
Û
Ô
ç
ò
Ö
Ù
Ù
Ù
à
Ð
à
â
ç
Û
à
Ù
à
à
Ý
Ñ
Ñ
î
Ô
ç
Ö
Û
ð
õ
Ô
Û
â
Ý
Ô
á
î
Ô
ã
Ð
Û
Û
Ð
à
Ö
Û
Ø
à
Ù
Ö
î
ì
Û
Ö
Ý
Þ
Ô
Û
Û
ã
Þ
Þ
Û
â
Ù
ã
â
ô
ç
Ô
â
Ö
Ñ
à
â
ì
Ð
Ð
Ù
Û
Ô
ì
Ö
â
Ù
â
Ñ
â
ò
â
Ð
Ò
â
Þ
ç
Ð
â
Ñ
Ö
á
Ô
Ô
Þ
ã
Ù
Ô
â
Ð
Ù
Û
Ô
Û
Ô
Ô
Û
ò
õ
Ù
Þ
Ñ
Ö
Û
Ù
Ô
à
Þ
Ð
Û
ã
á
Ð
Ô
ç
Ý
ã
Ù
Þ
Þ
Ô
Ñ
ç
Ô
ç
â
Ô
Û
Û
Ý
Ö
â
Ð
Ð
Ö
ç
Þ
Ý
Ù
â
Þ
Û
Ñ
Ö
Ö
Û
Ö
Ð
Ñ
ç
Û
Ð
â
Ø
Ö
â
Ø
Ô
Ô
õ
Û
Ñ
à
Ù
Ð
Û
Û
Ö
Ö
Û
ì
Ý
Ð
à
Ô
Þ
Ö
Û
â
Ù
Ù
Ù
Ù
ò
ò
Ô
Ô
â
â
Ñ
Ô
ð
â
Ð
â
Ù
Ò
Ñ
Ö
ã
Û
â
Ý
ò
Ù
ã
Ù
Ö
Ö
Ô
Ù
Ð
Û
Ñ
Ö
ì
Ù
â
Ô
ô
î
�
ü
Ö
Û
ç
Þ
é
Ø
ì
Ô
â
â
Ñ
Ö
Ö
â
Ð
ò
Þ
ì
Ñ
Ñ
ì
Ñ
à
â
Ö
ã
ð
Þ
õ
Û
Ø
á
ç
Ù
ç
â
Ö
Ô
Ý
Ù
Ö
ã
Û
â
Ò
Ñ
Ð
Ù
Ö
á
Ý
�
Û
â
ü
Ò
à
â
Ð
Ô
ì
Ñ
ò
Ù
Ù
Û
ã
Ö
ò
à
é
è
Ù
Ñ
Ò
Û
à
Û
Ø
ò
Ô
Ñ
ì
à
õ
ç
Ô
Ö
Ù
Ð
Ô
â
à
à
â
Û
Ð
Û
ó
Ö
ç
Ð
Ñ
Ö
Ö
á
ò
î
â
Ý
ò
ç
Ö
Ñ
ì
Ù
Ù
â
Ñ
Ò
ð
Ö
Ý
á
â
ç
ç
Ð
Ô
Ù
ç
Ð
Ö
â
Ñ
Ð
Ð
Ô
Û
Ñ
Ö
Ñ
Ý
ð
â
ç
Û
ã
Þ
Ô
Þ
Û
Ø
Ù
Ù
Ô
Ù
Ñ
Ô
ù
ã
Ö
Ý
à
Þ
â
ô
Ù
Ù
Ð
Ù
Ù
Ð
á
ç
ò
Þ
Ô
â
ò
Ù
Ñ
à
ç
Ö
Ð
Ô
Û
Ñ
Ù
Ý
ð
�
�
�
c
¿
À
·
Á
µ
´
Â
µ
¶
Ã
Å
µ
·
Æ
¸
Å
Ç
¹
º
È
É
»
¹
»
Å
Example Fault Tree for ATC Arrival Traffic (2)
Ê
æ
Ñ
Ù
Û
Ô
Ý
Ö
Ô
Ù
Ð
Þ
Ô
Ð
Ñ
Ð
à
à
Ñ
â
Ô
Û
ã
Ö
Ù
ç
ù
Ð
Ô
â
á
Ö
Ý
æ
Ñ
Ö
Ù
Ö
Ñ
Ð
Ð
ì
õ
Û
Ô
Û
ç
Ñ
à
ç
ç
á
à
Ñ
Ô
â
Ô
â
ì
Ñ
ç
ò
â
â
Ñ
â
ì
Ñ
Ö
â
ç
ç
ì
ç
Ù
Ñ
Ô
Ö
î
Ñ
ç
Ð
ò
Ý
â
â
Ù
ì
á
ç
Ý
â
ð
Ù
Ö
Ù
Ö
ò
Ù
ò
ì
ì
Ð
Ô
õ
Ñ
Ô
à
Ñ
Û
â
Ñ
ò
ç
â
Ô
Ñ
Ô
ç
à
Ö
Ô
Û
Û
Ù
Ð
î
ç
â
á
Ð
Ô
á
ç
ô
Ñ
ç
ò
Ð
â
â
Ý
Û
â
Û
Ñ
â
à
õ
Ñ
ì
Ù
Ð
Û
Ð
Ô
�
â
Ð
ò
ì
Ô
Ô
õ
Û
Ý
Ö
è
Ö
ò
Ù
â
â
æ
ç
â
ò
Ø
é
æ
ç
î
Û
Ý
è
æ
Ð
â
Þ
à
ò
Ñ
Ð
Ñ
ç
ç
â
Ö
Û
Ð
á
æ
Ñ
ç
â
â
ò
ô
Û
â
ò
ì
Ð
Ò
á
ç
Ù
ç
à
Û
Ù
î
â
à
Ô
Û
Ô
õ
â
Ô
Ñ
Û
Ð
à
Ñ
ç
â
Ñ
Ñ
Ð
Û
ì
à
ì
Ñ
ç
Ð
Ñ
ò
Ð
â
ç
à
à
à
Ù
ì
Û
Ñ
Ñ
ì
Ð
Ñ
Û
Ô
â
Ù
õ
õ
Ð
â
ð
Ù
�
Ö
ç
ç
ò
á
Ñ
Ö
â
Ñ
ç
Û
ç
ç
Û
Û
õ
â
ì
Ð
Ô
ì
Ñ
Ù
â
Ô
Ñ
Ù
â
ç
Ñ
î
ç
Ð
Ô
ò
Ð
Ù
Ð
Û
Ñ
Ö
Ð
â
Ñ
õ
ò
Û
Ñ
Ù
à
Ð
Ô
Ð
à
â
Ô
Ù
Ù
Ù
à
Ñ
ì
à
â
Ô
â
Ñ
ç
Ö
Ö
â
Ù
Û
ì
õ
Ð
Ð
Ñ
Û
Û
Ö
Û
ç
Ñ
ç
ç
Ñ
Ù
Ñ
á
â
Ô
õ
ç
î
Ñ
Û
ì
Ö
�
é
æ
ï
ö
ð
Ý
Ñ
î
ã
ç
ã
Þ
Ù
Û
è
Û
á
à
Ý
Ù
á
Ö
Ô
Û
à
Ý
á
Ù
Ð
Û
Ñ
Ö
Ý
Ñ
ã
Þ
ã
â
ã
Ù
Ù
á
Ö
Û
Ö
à
á
Ô
Û
Ý
Ñ
Ù
Ð
Û
Ñ
Ö
ç
Ö
â
Ð
ò
Ñ
â
Ò
é
Ð
Ô
â
Ñ
à
Ô
Ñ
Ù
Ö
â
Ô
Û
ì
ç
Û
Ù
á
ç
õ
Ø
è
é
à
ì
Û
ç
â
Ñ
Ô
Ý
Ô
Ô
ç
î
Ù
Þ
Ð
é
é
ï
Ù
ì
Û
Ñ
Þ
Ù
Û
à
á
Ô
â
Þ
Ù
Ô
â
í
ì
á
Û
Ñ
â
Ñ
Ö
Ö
Ý
Ò
Ô
Ñ
Ö
Ø
þ
ç
î
Ý
ð
Ñ
à
Ñ
Ø
Û
Ý
Ù
à
ç
à
Û
ò
Ô
Ù
ç
î
Ö
ç
Ù
Û
ò
Ñ
Ô
à
Ù
Ñ
Ø
Ý
Ý
Ô
Ö
à
õ
Ù
Û
Þ
Û
Ù
Ð
Ð
â
Ù
Ñ
Ò
ô
â
â
ó
à
ì
Ò
Û
Ð
ð
Ù
ã
Ö
Û
ò
ì
Û
ç
ò
à
Ù
ô
î
ç
à
Ý
â
ç
Ù
à
Ý
Ô
à
â
Û
Ù
ì
â
â
Ñ
â
Ö
Ö
Ö
Û
Ö
Ø
À
ý
·
Ë
·
c
¿
¸
Á
Å
´
Ç
µ
¶
µ
µ
Ã
·
Å
¸
Æ
¹
Å
º
Ç
»
È
É
¹
É
¹
Î
Î
À
Å
Ê
·
Ë
·
·
Ë
·
Requirements Completeness
Most software−related accidents involve software requirements
deficiencies.
Accidents often result from unhandled and unspecified cases.
We have defined a set of criteria to determine whether a
requirements specification is complete.
Derived from accidents and basic engineering principles.
Validated (at JPL) and used on industrial projects.
Completeness: Requirements are sufficient to distinguish
the desired behavior of the software from
that of any other undesired program that
might be designed.
c
¿
Requirements Completeness Criteria (2)
How were criteria derived?
Mapped the parts of a control loop to a state machine
I/O
I/O
Defined completeness for each part of state machine
States, inputs, outputs, transitions
Mathematical completeness
Added basic engineering principles (e.g., feedback)
Added what have learned from accidents
¸
Á
Å
Ç
´
µ
µ
¶
Ã
µ
Å
·
Æ
¸
Å
¹
Ç
º
È
»
Î
Å
Í
Ê
À
c
Requirements Completeness Criteria (3)
About 60 criteria in all including human−computer interaction.
(won’t go through them all
they are in the book)
Startup, shutdown
Mode transitions
Inputs and outputs
Value and timing
Load and capacity
Environment capacity
Failure states and transitions
Human−computer interface
Robustness
Data age
Latency
Feedback
Reversibility
Preemption
Path Robustness
Most integrated into SpecTRM−RL language design or simple
tools can check them.
µ
Ë
Ç
µ
´
Â
µ
¶
µ
µ
¹
·
Á
·
¸
¹
É
º
¹
½
Å
Ê
ÿ
À
·
Ë
·
c
¿
¸
Á
Å
´
Ç
µ
¶
µ
µ
Ã
·
Å
¸
Æ
¹
Å
º
Ç
»
È
É
¹
É
¹
Î
¾
À
Å
Ê
·
Ë
·
·
Ë
·
Requirements Analysis
Model Execution, Animation, and Visualization
Completeness
State Machine Hazard Analysis (backwards reachability)
Software Deviation Analysis
Human Error Analysis
Test Coverage Analysis and Test Case Generation
Automatic code generation?
c
¿
¸
Á
Å
Ç
´
µ
µ
¶
Ã
Model Execution and Animation
SpecTRM−RL models are executable.
Model execution is animated
Results of execution could be input into a graphical
visualization
Inputs can come from another model or simulator and
output can go into another model or simulator.
µ
Å
·
Æ
¸
Å
¹
Ç
º
È
»
Î
Å
½
Ê
À
c
´
µ
¶
µ
·
¸
¹
µ
º
·
Ë
½
¹
Design for Safety
Software design must enforce safety constraints
Should be able to trace from requirements to code (vice versa)
Design should incorporate basic safety design principles
c
´
µ
¶
µ
·
¸
µ
¹
·
º
Ë
Î
¼
¹
Safe Design Precedence
HAZARD ELIMINATION
Substitution
Simplification
Decoupling
Elimination of human errors
Reduction of hazardous materials or conditions
HAZARD REDUCTION
Design for controllability
Barriers
Lockins, Lockouts, Interlocks
Failure Minimization
Safety Factors and Margins
Redundancy
HAZARD CONTROL
Reducing exposure
Isolation and containment
Protection systems and fail−safe design
DAMAGE REDUCTION
Decreasing cost
Increasing effectiveness
..
c
¿
À
¿
À
·
Á
µ
Â
·
Á
µ
Â
´
µ
Ã
¶
Å
µ
Æ
·
Å
¸
Ç
¹
È
º
É
»
¹
Ê
»
Å
Ê
À
Å
·
Ë
·
Ë
·
STAMP
(Systems Theory Accident Modeling and Processes)
STAMP is a new theoretical underpinning for developing
more effective hazard analysis techniques for complex systems.
.
c
Accident models provide the basis for
Investigating and analyzing accidents
Preventing accidents
Hazard analysis
Design for safety
Assessing risk (determining whether systems are
suitable for use)
Performance modeling and defining safety metrics
´
Ã
µ
Å
¶
Æ
µ
Å
·
Ç
¸
È
¹
É
º
¹
»
À
¼
·
c
¿
¿
À
·
Á
µ
´
µ
¶
Â
µ
Ã
·
Å
¸
Æ
¹
Å
Ç
º
»
È
É
¹
»
Å
À
Ê
·
Chain−of−Events Models
Ë
Explain accidents in terms of multiple events, sequenced
as a forward chain over time.
Events almost always involve component failure, human
error, or energy−related event
Form the basis of most safety−engineering and reliability
engineering analysis:
e.g., Fault Tree Analysis, Probabilistic Risk Assessment,
FMEA, Event Trees
and design:
e.g., redundancy, overdesign, safety margins, ...
c
¿
À
·
Á
µ
´
Â
µ
Ã
¶
µ
Å
Æ
·
¸
Å
Ç
¹
º
È
É
»
¹
Å
Chain−of−Events Example
Equipment
damaged
Operating
pressure
.
#
J
=
&
/
#
)
#
8
8
&
8
,
1
+
5
1
F
#
8
8
I
#
5
=
#
#
8
#
)
'
/
/
!
1
+
'
8
,
,
8
I
&
#
8
,
#
#
8
#
,
C
1
'
+
C
/
#
8
M
8
1
,
,
#
A
=
#
8
'
&
F
+
!
#
Fragments
projected
,
Tank
rupture
1
C
8
I
#
E
&
8
,
=
'
1
)
A
1
F
!
G
'
?
,
#
,
#
1
%
+
&
'
5
1
)
!
6
#
1
+
7
,
8
/
#
)
,
'
E
C
#
,
=
1
&
,
,
1
+
5
)
>
=
#
!
#
8
A
#
D
#
)
)
#
8
+
+
#
C
C
,
1
,
#
#
/
#
C
1
,
E
)
/
+
#
#
+
/
5
,
+
#
8
8
8
,
&
)
,
&
#
=
8
&
'
/
+
#
8
6
,
'
#
C
C
+
+
F
,
,
=
A
,
#
K
#
8
,
L
#
)
+
8
#
E
#
#
#
'
?
/
!
+
,
'
8
1
/
,
,
&
6
#
'
>
,
#
,
1
+
#
=
+
,
!
#
1
'
+
F
!
!
1
F
5
#
?
A
1
'
C
&
#
8
#
)
#
#
1
'
E
F
!
?
'
?
8
1
Personnel
injured
8
/
#
#
+
,
/
+
,
1
'
+
'
/
'
+
'
,
7
,
1
+
5
6
A
?
!
OR
Weakened
metal
Corrosion
,
>
AND
Moisture
1
!
#
8
1
/
+
,
C
#
=
&
C
'
#
'
,
+
'
F
!
1
#
>
+
=
1
F
+
,
1
,
'
+
>
#
)
8
8
'
E
C
#
1
F
!
#
+
,
8
>
'
,
'
8
)
#
8
8
&
'
<
#
=
>
'
C
#
#
Ê
>
À
·
Ë
·
·
c
¿
À
·
Á
µ
´
Â
µ
¶
Ã
µ
Å
·
Æ
¸
Å
Ç
¹
º
È
»
É
¹
Å
¾
À
Ê
·
Ë
Chain−of−Events Example: Bhopal
·
E1: Worker washes pipes without inserting slip blind
E2: Water leaks into MIT tank
E3: Explosion occurs
E4: Relief valve opens
E5: MIC vented into air
E6: Wind carries MIC into populated area around plant
c
¿
Limitations of Event Chain Models:
Social and organizational factors in accidents
~
o
P
Z
R
Z
`
b
j
e
Y
T
X
`
Y
b
]
l
P
c
X
n
o
X
R
b
R
o
n
T
Z
Y
e
c
Y
Y
T
f
Z
i
Y
Y
e
Y
Y
e
f
]
]
Y
c
e
n
b
e
Y
T
Z
T
Y
b
o
_
P
T
`
R
S
i
]
`
n
q
e
o
`
`
R
c
Y
]
i
`
n
f
X
j
R
c
X
c
P
Y
X
Z
o
c
Z
P
n
z
Y
S
l
e
Y
i
]
R
{
S
Y
`
]
Y
P
Y
Z
c
R
e
Z
Y
e
R
Y
m
`
j
b
n
o
Y
T
o
`
Y
i
R
X
P
z
Y
R
b
`
P
Z
Z
Z
R
]
o
R
b
j
`
P
P
X
Y
R
m
O
P
S
b
j
S
`
X
T
j
b
i
Y
o
n
X
Z
b
X
Y
Z
e
`
m
}
Z
X
Z
Y
Y
c
X
R
b
z
R
Y
V
X
o
`
Y
e
X
b
]
Y
f
o
[
]
]
X
T
e
Z
Y
Y
]
P
X
P
_
_
Models need to include the social system as well as the
technology and its underlying science.
System accidents
Software error
À
·
Á
µ
Â
´
Ã
µ
¶
Å
µ
Æ
·
Å
¸
Ç
È
¹
º
É
»
¹
Å
Ê
À
½
·
Ë
·
c
¿
À
·
Á
µ
Â
´
µ
¶
Ã
µ
Å
·
Æ
¸
Å
Ç
¹
º
È
»
É
¹
Å
Î
À
Ê
·
Ë
·
Limitations of Event Chain Models (2)
Human error
Deviation from normative procedure vs. established practice
Cannot effectively model human behavior by decomposing
it into individual decisions and actions and studying it
in isolation from the
physical and social context
value system in which it takes place
dynamic work process
Adaptation
Major accidents involve systematic migration of organizational
behavior under pressure toward cost effectiveness in an
aggressive, competitive environment.
c
¿
Vessel
Design
Harbor
Design
Cargo
Management
À
·
Design
Stability Analysis
Shipyard
Equipment
load added
Impaired
stability
Truck companies
Excess load routines
Passenger management
Excess numbers
Berth design
Calais
Passenger
Management
Traffic
Scheduling
Vessel
Operation
Operations management
Standing orders
Docking
procedure
Capsizing
Change of
docking
procedure
Berth design
Zeebrugge
Unsafe
heuristics
Operations management
Transfer of Herald
to Zeebrugge
Captain’s planning
Operations management
Time pressure
Operational Decision Making:
Decision makers from separate
departments in operational context
very likely will not see the forest
for the trees.
Crew working
patterns
Accident Analysis:
Combinatorial structure
of possible accidents
can easily be identified.
Á
µ
Â
´
Ã
µ
¶
Å
µ
Æ
·
Å
¸
Ç
È
¹
º
É
»
¹
Å
Ê
À
Í
·
Ë
·
c
¿
À
·
Á
µ
´
Â
µ
¶
Ã
µ
Å
·
Æ
¸
Å
Ç
¹
º
È
»
É
¹
Å
ÿ
À
Ê
·
Ë
Ways to Cope with Complexity
·
Analytic Reduction (Descartes)
Divide system into distinct parts for analysis purposes.
Examine the parts separately.
Three important assumptions:
1. The division into parts will not distort the
phenomenon being studied.
2. Components are the same when examined singly
as when playing their part in the whole.
3. Principles governing the assembling of the components
into the whole are themselves straightforward.
c
¿
À
Ways to Cope with Complexity (con’t.)
Statistics
Treat as a structureless mass with interchangeable parts.
Use Law of Large Numbers to describe behavior in
terms of averages.
Assumes components sufficiently regular and random
in their behavior that they can be studied statistically.
·
Á
µ
Â
´
Ã
µ
¶
Å
µ
Æ
·
Å
¸
Ç
È
¹
º
É
»
¹
Å
Ê
À
ý
·
Ë
·
c
¿
À
·
Á
µ
´
Â
µ
¶
Ã
µ
Å
·
Æ
¸
Å
Ç
¹
º
È
»
É
¹
Å
À
Ê
·
Ë
What about software?
·
Too complex for complete analysis:
Separation into non−interacting subsystems distorts
the results.
The most important properties are emergent.
Too organized for statistics
Too much underlying structure that distorts
the statistics.
c
¿
À
·
Á
µ
Â
´
Ã
µ
¶
Å
µ
Æ
·
Å
¸
Ç
¹
È
º
É
»
¹
Systems Theory
Developed for biology (Bertalanffly) and cybernetics (Norbert Weiner)
For systems too complex for complete analysis
Separation into non−interacting subsystems distorts results
Most important properties are emergent.
and too organized for statistical analysis
Concentrates on analysis and design of whole as distinct from parts
(basis of system engineering)
Some properties can only be treated adequately in their entirety,
taking into account all social and technical aspects.
These properties derive from relationships between the parts of
systems −− how they interact and fit together.
Å
¾
Ê
À
¼
·
Ë
·
c
¿
À
·
Á
µ
Â
´
µ
¶
Ã
µ
Å
·
Æ
¸
Å
Ç
¹
º
È
»
É
¹
¾
Å
»
À
Ê
·
Ë
·
Systems Theory (2)
Two pairs of ideas:
1. Emergence and hierarchy
Levels of organization, each more complex than one below.
Levels characterized by emergent properties
Irreducible
Represent constraints upon the degree of freedom of
components a lower level.
Safety is an emergent system property
It is NOT a component property.
It can only be analyzed in the context of the whole.
c
¿
À
·
Á
µ
Systems Theory (3)
2. Communication and control
Hierarchies characterized by control processes working at
the interfaces between levels.
A control action imposes constraints upon the activity
at one level of a hierarchy.
Open systems are viewed as interrelated components kept
in a state of dynamic equilibrium by feedback loops of
information and control.
Control in open systems implies need for communication
Â
´
Ã
µ
¶
Å
µ
Æ
·
Å
¸
Ç
È
¹
º
É
»
¹
Å
¾
Ê
À
·
Ë
·
c
´
µ
¶
µ
·
¿
¸
¹
€
º
É

»
½
ÿ
»
½
ý
‚
A Systems Theory Model of Accidents
Safety is an emergent system property.
Accidents arise from interactions among
People
Societal and organizational structures
Engineering activities
Physical system components
that violate the constraints on safe component
behavior and interactions.
Not simply chains of events or linear causality,
but more complex types of causal connections.
Need to include the entire socio−technical system
c
´
µ
¶
µ
¿
·
€
¸
¹
É
º

‚
c
´
µ
¶
µ
·
¿
STAMP
¸
¹
€
º
É

»
½
»
Î
¼
‚
(Systems−Theoretic Accident Model and Processes)
Based on systems and control theory
Systems not treated as a static design
A socio−technical system is a dynamic process
continually adapting to achieve its ends and to
react to changes in itself and its environment
Preventing accidents requires designing a control
structure to enforce constraints on system behavior
and adaptation.
c
´
µ
¶
µ
¿
·
€
STAMP (2)
Views accidents as a control problem
e.g., O−ring did not control propellant gas release by
sealing gap in field joint
Software did not adequately control descent speed of
Mars Polar Lander.
Events are the result of the inadequate control
Result from lack of enforcement of safety constraints
To understand accidents, need to examine control structure
itself to determine why inadequate to maintain safety constraints
and why events occurred.
¸
¹
É
º

‚
SYSTEM DEVELOPMENT
SYSTEM OPERATIONS
Congress and Legislatures
Congress and Legislatures
Government Reports
Lobbying
Hearings and open meetings
Accidents
Legislation
Government Regulatory Agencies
Industry Associations,
User Associations, Unions,
Insurance Companies, Courts
Regulations
Standards
Certification
Legal penalties
Case Law
Government Reports
Lobbying
Hearings and open meetings
Accidents
Legislation
Government Regulatory Agencies
Industry Associations,
User Associations, Unions,
Insurance Companies, Courts
Certification Info.
Change reports
Whistleblowers
Accidents and incidents
Regulations
Standards
Certification
Legal penalties
Case Law
Accident and incident reports
Operations reports
Maintenance Reports
Change reports
Whistleblowers
Company
Management
Policy, stds.
Company
Management
Status Reports
Risk Assessments
Incident Reports
Safety Policy
Standards
Resources
Safety Policy
Standards
Resources
Project
Management
Operations
Management
Hazard Analyses
Safety Standards
Hazard Analyses
Progress Reports
Safety−Related Changes
Progress Reports
Problem reports
Operating Assumptions
Operating Procedures
Test reports
Hazard Analyses
Review Results
Revised
operating procedures
Safety
Reports
Hazard Analyses
Work
Procedures
safety reports
audits
work logs
inspections
Manufacturing
Operating Process
Human Controller(s)
Implementation
and assurance
Manufacturing
Management
Change requests
Audit reports
Work Instructions
Design,
Documentation
Safety Constraints
Standards
Test Requirements
Operations Reports
Software revisions
Hardware replacements
Documentation
Actuator(s)
Physical
Process
Design Rationale
Maintenance
and Evolution
Automated
Controller
Problem Reports
Incidents
Change Requests
Performance Audits
Sensor(s)
c
´
µ
¶
µ
·
¿
¸
¹
€
º
É
»

Î
‚
Note:
Does not imply need for a "controller"
Component failures may be controlled through design
e.g., redundancy, interlocks, fail−safe design
or through process
manufacturing processes and procedures
maintenance procedures
But does imply the need to enforce the safety constraints
in some way.
New model includes what do now and more
c
´
µ
¶
µ
¿
Accidents occur when:
Design does not enforce safety constraints
unhandled disturbances, failures, dysfunctional interactions
Inadequate control actions
Control structure degrades over time, asynchronous evolution
Control actions inadequately coordinated among multiple
controllers.
Boundary areas
Controller 1
Process 1
Controller 2
Process 2
Overlap areas (side effects of decisions and control actions)
Controller 1
Controller 2
Process
·
€
¸
¹
É
º

»
‚
Î
¾
c
Process Models
Model of
Process
Model of
Automation
µ
¶
µ
·
¿
Sensors
Human Supervisor
(Controller)
´
Measured
variables
¸
¹
€
º
É

»
Î
½
»
Î
Î
‚
Process
inputs
Automated Controller
Controls
Displays
Model of
Process
Controlled
Process
Model of
Interfaces
Disturbances
Process
outputs
Actuators
Controlled
variables
Process models must contain:
Required relationship among process variables
Current state (values of process variables)
The ways the process can change state
c
´
µ
¶
µ
¿
·
€
¸
¹
É
Relationship between Safety and Process Model
Accidents occur when the models do not match the process and
incorrect control commands are given (or correct ones not given)
How do they become inconsistent?
Wrong from beginning
e.g. uncontrolled disturbances
unhandled process states
inadvertently commanding system into a hazardous state
unhandled or incorrectly handled system component failures
[Note these are related to what we called system accidents]
Missing or incorrect feedback and not updated correctly
Time lags not accounted for
Explains most software−related accidents
º

‚
Safety and Human Mental Models
c
´
µ
¶
µ
·
¿
¸
¹
€
º
É

»
Î
Í
»
Î
ÿ
‚
Explains developer errors
May have incorrect model of required system or software behavior
development process
physical laws
etc.
Also explains most human/computer interaction problems
Pilots and others are not understanding the automation
What did it just do?
Why did it do that?
What will it do next?
How did it get us into this state?
How do I get it to do what I want?
Why won’t it let us do that?
What caused the failure?
What can we do so it does not
happen again?
Or don’t get feedback to update mental models or disbelieve it
c
´
µ
¶
µ
¿
Validating and Using the Model
Can it explain (model) accidents that have already occurred?
Is it useful?
In accident and mishap investigation
In preventing accidents
Hazard analysis
Designing for safety
Is it better for these purposes than the chain−of−events model?
·
€
¸
¹
É
º

‚
c
´
µ
¶
µ
·
¿
¸
¹
€
º
É

»
Î
ý
»
Î
‚
Using STAMP in Accident and
Mishap Investigation and
Root Cause Analysis
c
´
µ
¶
µ
¿
·
€
Modeling Accidents Using STAMP
Three types of models are needed:
1. Static safety control structure
Safety requirements and constraints
Flawed control actions
Context (social, political, etc.)
Mental model flaws
Coordination flaws
2. Dynamic structure
Shows how the safety control structure changed over time
3. Behavioral dynamics
Dynamic processes behind the changes, i.e., why the
system changes
¸
¹
É
º

‚
c
Š
‹
Œ
‹
ƒ
ARIANE 5 LAUNCHER
OBC
Executes flight program;
Controls nozzles of solid
boosters and Vulcain
cryogenic engine
Diagnostic and
flight information
D
A
Nozzle
command
B
Main engine
command
Booster
Nozzles
Main engine
Nozzle
SRI
Measures attitude of
launcher and its
movements in space
C
Horizontal
velocity
Strapdown inertial
platform
Backup SRI
Measures attitude of
launcher and its
movements in space;
Takes over if SRI unable
to send guidance info
C
Horizontal velocity
Ariane 5: A rapid change in attitude and high aerodynamic loads stemming from a
high angle of attack create aerodynamic forces that cause the launcher
to disintegrate at 39 seconds after command for main engine ignition (H0).
Nozzles: Full nozzle deflections of solid boosters and main engine lead to angle
of attack of more than 20 degrees.
Self−Destruct System: Triggered (as designed) by boosters separating from main
stage at altitude of 4 km and 1 km from launch pad.
OBC (On−Board Computer)
OBC Safety Constraint Violated: Commands from the OBC to the nozzles must not
result in the launcher operating outside its safe envelope.
Unsafe Behavior: Control command sent to booster nozzles and later to main engine
nozzle to make a large correction for an attitude deviation that had not occurred.
Process Model: Model of the current launch attitude is incorrect, i.e., it contains
an attitude deviation that had not occurred. Results in incorrect commands
being sent to nozzles.
Feedback: Diagnostic information received from SRI
Interface Model: Incomplete or incorrect (not enough information in accident report
to determine which) − does not include the diagnostic information from the
SRI that is available on the databus.
Control Algorithm Flaw: Interprets diagnostic information from SRI as flight data and
uses it for flight control calculations. With both SRI and backup SRI shut down
and therefore no possibility of getting correct guidance and attitude information,
loss was inevitable.

„


…
‘
‡
’
‰
”
•
c
Š
‹
Œ
‹
ƒ
ARIANE 5 LAUNCHER
OBC
Executes flight program;
Controls nozzles of solid
boosters and Vulcain
cryogenic engine
Diagnostic and
flight information
D
A
Nozzle
command
B
Main engine
command
Booster
Nozzles
Main engine
Nozzle
SRI
Measures attitude of
launcher and its
movements in space
C
Horizontal
velocity
Strapdown inertial
platform
Backup SRI
Measures attitude of
launcher and its
movements in space;
Takes over if SRI unable
to send guidance info
C
Horizontal velocity
SRI (Inertial Reference System):
SRI Safety Constraint Violated: The SRI must continue to send guidance
information as long as it can get the necessary information from the strapdown
inertial platform.
Unsafe Behavior: At 36.75 seconds after H0, SRI detects an internal error and
turns itself off (as it was designed to do) after putting diagnostic information on
the bus (D).
Control Algorithm: Calculates the Horizontal Bias (an internal alignment variable
used as an indicator of alignment precision over time) using the horizontal
velocity input from the strapdown inertial platform (C). Conversion from a 64−bit
floating point value to a 16−bit signed integer leads to an unhandled overflow
exception while calculating the horizontal bias. Algorithm reused from Ariane 4
where horizontal bias variable does not get large enough to cause an overflow.
Process Model: Does not match Ariane 5 (based on Ariane 4 trajectory data);
Assumes smaller horizontal velocity values than possible on Ariane 5.
Backup SRI (Inertial Reference System):
SRI Safety Constraint Violated: The backup SRI must continue to send guidance
information as long as it can get the necessary information from the strapdown
inertial platform.
Unsafe Behavior: At 36.75 seconds after H0, backup SRI detects an internal error
and turns itself off (as it was designed to do).
Control Algorithm: Calculates the Horizontal Bias (an internal alignment variable
used as an indicator of alignment precision over time) using the horizontal
velocity input from the strapdown inertial platform (C). Conversion from a 64−bit
floating point value to a 16−bit signed integer leads to an unhandled overflow
exception while calculating the horizontal bias. Algorithm reused from Ariane 4
where horizontal bias variable does not get large enough to cause an overflow.
Because the algorithm was the same in both SRI computers, the overflow
results in the same behavior, i.e., shutting itself off.
Process Model: Does not match Ariane 5 (based on Ariane 4 trajectory data);
Assumes smaller horizontal velocity values than possible on Ariane 5.

„


…
‘
‡
’
‰
”
’
c
á
â
ã
â
—
DEVELOPMENT
Titan 4/Centaur/Milstar
ä
˜
å
æ
™
ç
š
OPERATIONS
Space and Missile Systems
Center Launch Directorate (SMC)
Defense Contract
Management Command
(Responsible for administration
of LMA contract)
contract administration
software surveillance
oversee the process
Prime Contractor (LMA)
(Responsible for design and
construction of flight control system)
LMA System
Engineering
LMA Quality
Assurance
IV&V
Analex
Software Design
and Development
Third Space Launch
Squadron (3SLS)
(Responsible for ground
operations management)
Analex−Cleveland
verify design
LMA
Analex Denver
Flight Control Software
IV&V of flight software
Ground Operations
(CCAS)
Honeywell
IMS software
Aerospace
Monitor software
development and test
LMA FAST Lab
Titan/Centaur/Milstar
System test of INU
Ø
Ì
¿
Í
Î
·
¥
Ï
³
¦
§
Ã
³
´
¶
±
·
¹
±
Ò
ž
·
Ÿ
¿
»
Ÿ
Æ
´
¬
Ÿ
¡
£
²
Ÿ
¥
±
¢
Î
Û
Ü
Ý
Þ
ß
à
ß
Â
¨
¬
Ú

«
©
£
±
Å
ª
©
¢
É
£

¬
£
¢
ª
£
¡
¡
¢
¬
ž
®
ª

Å
¬
£
È
ž
£
Ç
©
Å

±
ª
È
É
¡
²
ž
ž
¬
Æ
Ê
ž
¬

Ÿ

¡
É
«
¬
§
Ó

»
¦
£
ž
½
Ÿ
»
¦
±
¿
¨
²

¡
ž
«
À
Á
©
ª
Å
ª
«
¡
ª

©
ž
¬
ž
¬
£

ž
¬
ž
¡
¬
¬
Ÿ
®
£
¡
Ê
¬
²
¢
ž
¬
Ÿ
ª
¡
¢
¡
Å
ª
¬
¬
È
ª
£
¬
£
«
©
Ÿ
¬
²
¡

Ÿ
«
£
±
±
É
²
¢
¨
«

±
ª
¡

±
Õ
Ÿ
Ÿ
ž
ž
²
²
¡
¡
¢
¢

ž

©
©
ž
²
¢
¬
¬
£
£
Ÿ
¡
¡
¢
¢
Ÿ
Ÿ
¡
¬
£

¡
£
ª
Ê
Å
ª
±
²
ª
£
¬
¢
È
¬
®
£
£
¨
¬

«
ª
«
²
©

±
¢
£
Ê
¬
Ÿ

ª
¡
¬

¨
ž
©
¬
ª

¢
«

£
ž
¡
¡
Â
ª

É
ž
Ö
Õ
±
¢
Â
¢
×
·

©
£
Å
¶
Ê
«
¢

Ð
Á

¬
¿
Ç
À

£
»
Ö
¿
¢
¢
¹
ž
¬
¡
¢
±

½

·
²
Å
»
£
¶
£
ž
Á
É
ž

¦
¶
¦
´
œ
Î
´
Ù
ž
¬
¬
¢
£
¬
¬
Ÿ
£
ª
¨
¡

è
›
é
ê
.
.
5
a
E
P
ì
O
ì
î
0
c
V
ò
î
0
N
System Hazard:
System Safety Constraints:
f
g
ï
S
ò
0
0
W
î
O
D
ó
ò
0
P
ï
ì
ò
ñ
ð
O
ò
ò
ï
í
S
ï
ð
0
ð
ò
P
S
î
0
ï
D
ð
a
P
O
ï
S
ï
d
a
ð
ð
0
P
0
D
ï
1
a
0
N
î
O
ï
ò
V
ð
í
0
ï
.
D
1
0
í
ì
ï
í
0
D
í
c
V
ï
ò
î
ï
î
a
S
ð
ð
ò
0
a
ò
B
ó
S
ï
N
S
ð
0
ì
V
í
]
a
ì
E
P
í
ì
B
e
O
ï
D
ò
ï
O
0
ð
ò
í
W
ï
D
1
ì
í
D
ï
0
N
e
D
ï
0
ð
c
W
á
â
ã
â
ä
l
æ
ç
è
é
n
é
é
@
—
4
f
å
h
D
ï
0
ð
i
a
D
P
ì
ï
ñ
1
a
î
ï
í
ò
ï
E
0
O
ò
1
V
ð
ò
1
ì
î
0
N
f
h
5
a
E
P
ì
O
S
0
D
P
ï
S
1
0
D
î
a
ð
0
î
1
a
î
ï
ð
0
N
a
O
0
ð
ì
î
]
ò
ó
0
c
V
ò
î
a
ð
0
ì
ó
e
D
ï
0
ð
i
a
D
P
ì
ï
ñ
ì
î
O
1
ò
V
ð
ò
1
ì
å
î
0
â
å
÷
÷
ä
ù
™
š
›
â
å
÷
0
W
÷
ù
ö
õ
B
W
â
å
ú
m
í
÷
ò
ä
ù
ï
ì
ó
ì
æ
O
D
ï
ö
ì
÷
ù
ö
ò
í
D
í
å
â
ø
N
V
ð
ò
O
0
N
a
å
ø
õ
ú
ð
ø
0
æ
î
ï
ò
ó
ò
P
P
ò
e
ø
ö
ä
ù
ö
h
N
ä
÷
˜
W
ö
M
^
9
_
-
L
M
K
š
?
F
æ
ä
â
ù
ë
ì
í
ì
î
ï
ð
ñ
ò
â
å
÷
â
ä
0
D
P
ï
ú
å
ú
õ
V
ï
W
ò
ó
R
ì
0
O
D
D
P
P
ï
ã
ä
å
÷
â
ä
÷
æ
æ
ä
õ
S
ô
÷
å
ö
N
ô
0
õ
â
ú
0
™
T
÷
â
ë
æ
ö
S
ä
ö
÷
å
ä
ù
ö
ó
÷
R
Q
õ
å
æ
þ
ü
ý
ä
ö
ù
ú
÷
â
å
÷
ä
ù
ö
ä
ö
â
ô
õ
÷
ä
â
ö
õ
ø
ù
õ
ä
ö
ä
ú
÷
â
+
â
å
÷
ä
ä
ö
G
â
â
÷
õ
ú
ò
.
0
ð
í
1
0
í
ï
æ
õ
5
â
æ
â
ð
ò
.
ì
í
O
ì
D
P
÷
ä
â
å
÷
ä
ù
ú
0
î
ï
ì
í
B
C
D
÷
â
ö
ù
ö
E
ò
.
0
ð
í
1
0
í
ï
÷
â
å
÷
@
ä
ù
åå
â
ô
æ
ä
â
ù
Q
â
ä
å
æ
æ
ö
å
â
õ
÷
÷
â
å
ö
÷
ú
õ
å
ú
õ
÷
å
ö
ú
ï
í
ì
î
ï
ð
ñ
ò
Y
Z
?
S
0
_
í
.
ì
ð
ò
í
1
0
í
õõ
â
÷
?
ä
\
å
÷
æ
ú
â
â
õ
÷
õ
å
=
æ
â
ú
ö
õ
™
ä
å
ö
å
ù
ö
÷
æ
â
÷
â
ä
â
ú
õ
ú
ä
ø
÷
â
â
õ
ú
ö
üü
ý
þ
ÿ
�
�
ý
÷
÷
â
ù
÷
ö
õ
õ
å
ú
æ
ö
D
P
]
0
ð
ï
ò
í
5
7
ì
î
ï
ð
ñ
ò
ó
ò
›
å
ì
O
a
P
ï
a
ð
0
4
m
5
7
9
9
ò
1
1
ì
î
î
ì
ò
í
0
ð
â
V
0
ð
D
ï
ì
ò
í
÷
æ
ú
õ
#
ä
â
â
ä
ò
ò
N
m
D
í
$
å
æ
ö
N
ä
ö
ú
õ
ô
o
a
ð
D
P
^
ó
ó
D
ì
ð
%
%
'
#
$
%
%
)
Design flaw:
Shallow location
Design flaw:
No chlorinator
ú
â
ú
q
æ
â
ú
å
ö
î
ä
î
Z
Q
÷
ö
ú
ð
õ
9
å
í
ã
ù
ö
å
ì
þ
æ
ö
â
B
ää
öö
æ
ø
ô
ë
ææ
õõ
ö
4
^
ææ
øø
ï
ö
÷
öö
ä
ú
ó
\
å
ù
ûû
ì
ù
ø
?
â
â
õ
ù
ë
÷
ä
ä
ù
å
÷
ö
ô
â
õ
ä
ö
÷
ææ
ö
î
å
?
ã
â
÷
ä
ø
ù
ú
æ
ä
õ
Porous bedrock
Minimal overburden
Heavy rains
ö
ö
÷
â
+
â
ä
ä
å
ö
4
D
o
P
0
]
0
î
ì
ð
N
ï
ò
0
÷
æ
å
í
í
ï
î
!
.
.
c
Dynamic Structure
å
â
å
÷
ä
ù
â
ã
â
—
ä
÷
á
÷
â
å
ö
÷
÷
ù
ö
õ
â
å
ú
÷
ä
ù
æ
ö
÷
ù
ö
å
â
ø
å
ø
õ
ú
ø
ø
æ
ä
˜
å
æ
™
ö
ä
ù
ö
M
^
9
_
-
L
M
K
š
?
F
æ
ä
â
ù
ë
ì
í
ì
î
ï
ð
ñ
ò
â
å
÷
â
ä
0
D
P
ï
ú
õ
å
ú
õ
÷
3
å
ö
N
ì
O
D
P
ã
ä
å
÷
â
ä
÷
ô
0
V
ï
W
ò
ó
R
0
D
P
ï
æ
æ
ä
õ
S
â
ú
0
™
T
ô
÷
â
ë
æ
ö
S
ä
ö
÷
å
ä
ù
ö
ó
÷
R
Q
ù
ú
õ
å
æ
þþ
üü
ýý
ä
ö
â
æ
â
ä
ú
÷
â
å
÷
ä
ù
ö
ä
ö
â
ô
õ
÷
ä
â
ö
õ
ø
ù
õ
ä
ö
ä
ú
÷
â
+
â
ä
ä
ö
G
â
â
÷
õ
ú
ò
.
0
ð
í
1
0
í
ï
æ
õ
5
â
æ
â
ð
ò
.
ì
í
O
ì
D
P
÷
ä
â
å
÷
ä
ù
ú
0
î
ï
ì
í
B
C
D
÷
â
å
ö
÷
ù
ö
E
ò
.
0
ð
í
1
0
í
ï
÷
â
å
÷
ù
ä
@
åå
â
ô
æ
ä
â
ù
Q
â
ä
å
æ
æ
ö
å
â
õ
÷
÷
â
å
ö
÷
ú
õ
å
ú
õ
÷
å
ö
ú
í
ì
î
ï
ð
ñ
ò
Y
Z
?
\
å
S
0
_
í
.
ì
ð
ò
í
1
0
í
â
÷
?
ä
ö
÷
æ
â
÷
â
ä
â
ú
õ
ú
ä
ø
÷
â
â
õ
ø
÷
æ
ú
â
â
ù
÷
õ
õ
å
õ
ä
å
æ
=
â
ö
ú
™
å
ö
÷
â
ù
÷
ö
÷
ú
ü
ý
þ
ÿ
�
�
ý
ì
õ
å
æ
ö
î
ï
ð
ñ
ò
D
ó
P
ò
›
å
0
]
ð
ï
ò
í
5
7
ì
O
a
P
ï
a
ð
0
m
4
5
7
9
9
ò
1
1
ì
î
î
ì
ò
í
0
ð
â
V
0
ð
D
ï
ì
ò
í
÷
æ
ú
õ
=
â
ä
ö
ò
ò
N
m
D
í
â
ä
â
õ
ô
o
a
ð
D
P
^
ó
ó
D
ì
ð
ú
â
ú
N
ä
ú
æ
#
â
ö
q
å
ö
î
$
%
%
'
#
$
%
%
)
ä
ú
÷
ö
î
Z
Q
õ
9
ú
ð
ú
å
í
þ
æ
ö
ã
õ
ö
å
ù
ì
ää
öö
ö
â
B
ææ
õõ
æ
ô
ë
ææ
øø
ö
4
^
õõ
ï
ö
÷
öö
ä
ú
ó
\
å
ù
û
ì
?
ù
ø
ï
â
â
õ
ù
ë
÷
ä
ä
ù
å
÷
ö
ô
â
õ
ä
ö
÷
ææ
ö
ö
å
æ
Design flaw:
No chlorinator
Design flaw:
Shallow location
î
?
ã
â
÷
ä
G
æ
æ
õ
ö
æ
õ
ú
H
å
Porous bedrock
Minimal overburden
Heavy rains
J
4
D
o
0
P
î
]
0
ì
N
ð
0
ï
ò
í
5
í
ï
î
0
ð
î
ì
ï
.
ì
D
í
ï
B
0
C
D
E
@
!
š
ç
è
›
c

}
€
}
r
Modeling Behavioral Dynamics
w
–
’
}


z
’
}
~
ƒ
z
’
ƒ
™
’
ƒ
€
Œ
„

y
u
‰
‹
~
y
z
{
|
}
~
w
š
Œ
z

{

|
“
}
ƒ
€
~
œ
~
‰
€
}
’

Œ
|
•
~
‰
Ž
}
–
ƒ
’
š
„
z
Œ

~
{

ƒ
}

Œ
’
„
~

‘
}
š
ƒ
Ÿ
}
s
t
„
„
ƒ
‘

„

’
}

Œ
„
Œ
„
|
ƒ
‹
š
š
}

~
Œ
€
}
„
}


Ž
ƒ
’
~
}
{
z


Œ
~
‘
r
š

™
Ž

}
{
¡

~
}
’
š
ƒ
s
ž
Œ
~
•
Ÿ
}
{

Ž
}
‹
¥
z
Œ
Ž
™
}
„
~
u

‹
š
š
}

~
Œ
€
}
„
}


Œ
„
~
}
„

„

}
š
ƒ
r

™
Ž

Œ
„
|
w
}
u
’

ƒ
„


t
z
„
Œ

Œ
Ž


Œ
~
‘
¨
Œ

Œ
~
Œ
}

‰
€
u
‰
}
’

Œ
|
•
~
Š
w
š
}
’
ƒ
’
™

„

}
š
ƒ
‰
€
}
’

Œ
|
•
~
–
‰
Ž
}
’

u
~
ƒ
•
’

ƒ
’
Œ
„

~
ƒ
’

u
}
„
~


ƒ
{
}

‰
Ž
}
’

~
ƒ
’
w
š
ƒ
’
ƒ

}


–
ƒ
™
Ž
}
~
}
„

}
‰
Ž
}
’

~
ƒ
’
–
ƒ
‹
š
š
}

~
Œ
–
€
}
„
}


„
~
’
ƒ

Ž

Œ

„

}
š
ƒ
–
ƒ
™
‹
–
—
ƒ
™
™
•

„
„
}

š
š
}


~
Œ
€
}
Š
y
}
~
ž
}
}
„


¨
—
“
ƒ
€
~
„
}


š
ƒ
–
z
œ
™

„
‹
…
ƒ
„
~
’
ƒ


–
z


Œ
~
}
{
¡

~
}
ƒ
„
~
’
ƒ
š
’
r
‘
š
}

~
Œ
€
}
„
t

‘

~
}
}


š
ƒ
r
™
z
~
ƒ
™
‰

~
Œ

•
z
~
{
ƒ
ž
„
r
Ž
}

~
ƒ
’
¦
}

’
‘

~
}
’
™
w
š
ƒ
’
z
„
Œ

•
™
}
„
~
w
’
Ÿ
Œ

£
}
’


Œ
~
‘
–
š
ƒ
ƒ
„
~

™
Œ
„
}

}
„


~
Œ
ƒ
„
}
¥
z
Œ
’
}
™
}
’

~
ƒ
’
Ÿ
w
’
w
’


}

¨
}
„
Ÿ

}
}
Ž
ƒ
~
Œ
¨

}
™
Ÿ
}
Ž
ƒ
’
~
Œ
„
„
’

‰
y

~
š
ƒ
¤
„

’
}


«
}
’
Œ
„
£
Œ
„
|
¡

~
}
z
~
{
ƒ
ž
„
‘

’
ƒ
„
~

™
Œ
„

~
Œ
ƒ
„
Ÿ
Œ

£
¤
Œ
ƒ
„
š
„

}


~
Ÿ
Œ

ƒ
~
}
„
z


Œ
~
‘
š
ƒ
¡
}


r
“
u
u
}
„
~


ƒ
{
}

w
š
ƒ
‰
š
ƒ
|
¦
u
}
–
š
ƒ
|
~
~
š
ƒ
’
ƒ

„
’
ƒ

}


‹
u
u
}
„


ƒ
{
}

w
š
ƒ
~
’
ƒ

}


u
z
„
Œ

Œ
Ž


Œ
~
Ÿ
‘
‹
u
š
š
}

~
Œ
€
}
„
}


„
~


ƒ
{
}

ƒ
š
r
t
“
€
{
Œ

ƒ
’
Œ
}

w
’
ƒ

}


Ÿ

¤
‰
Š
u
u
}
„
~


ƒ
{
}


­
ƒ
š
}
w
š
ƒ
z
¨

Œ

w
’
ƒ

}


£

t
t
€
Ÿ
}

Œ


ƒ

¨
z

¨
‰
’
‘
y
ž
}

}

’
}
„
}



r
“
‹
š
š
}

~
Œ
€
}
–
„
}


š
ƒ
–
ƒ
„
~
’
ƒ

–
—
ƒ
™
™
•

„
„
}
u
y
}
~
ž
}
}
„
‹
š
ƒ
¤
‰
y
u
Œ
–
œ
ƒ
š
ƒ
u
}

‰
z
„
Œ

Œ
Ž


Œ
~
‘
—
y
“

r
~
„
}
š
š
ƒ
}

~
Œ
Œ
ƒ

„
„
’
Ÿ
š

ƒ
r
•
Ž
š
ƒ
}
š
ƒ
r
‰
}
š
ƒ
Ÿ
©
}

Œ


£
}
ƒ
š
„
š
}

~
Œ
ƒ
„

Œ
~
}
™
’

s
ƒ
t
„
…
u
†
w
‡
ˆ
c
á
â
ã
â
—
ä
å
˜
æ
ç
™
A (Partial) System Dynamics Model
of the Columbia Accident
š
¹
Ë
Budget
cuts
Budget cuts
directed toward
safety
Â
¾
»
Í
·
À
Rate of
safety increase
³
´
º
¼
½
¾
¿
º
»
À
º
³
Á
Â
Ã
º
±
Ä
¿
Å
º
°
Ç
Æ
µ
»
¯
¶
°
±
·
°
¹
²
¹
²
º
»
¾
Complacency
»
External
Pressure
Rate of increase
in complacency
Performance
Pressure
È
·
»
Expectations
µ
Á
Ä
¼
Ê
Á
°
Perceived
safety
Risk
º
²
³
¯
°
±
°
²
Launch Rate
Success
Success Rate
Accident Rate
µ
Â
À
Safety
System
safety
efforts
Priority of
safety programs
°
¹
è
›
é
Î
Steps in a STAMP analysis:
c
á
â
ã
â
ä
—
å
æ
˜
ç
™
è
š
Ï
Ñ
›
1. Identify
System hazards
System safety constraints and requirements
Control structure in place to enforce constraints
2. Model dynamic aspects of accident:
Changes to static safety control structure over time
Dynamic processes in effect that led to changes
3. Create the overall explanation for the accident
Inadequate control actions and decisions
Context in which decisions made
Mental model flaws
Control flaws (e.g., missing feedback loops)
Coordination flaws
c
á
â
ã
â
—
ä
˜
å
æ
™
ç
š
è
›
STAMP vs. Traditional Accident Models
Examines interrelationships rather than linear cause−effect chains
Looks at the processes behind the events
Includes entire socio−economic system
Includes behavioral dynamics (changes over time)
Want to not just react to accidents and impose controls
for a while, but understand why controls drift toward
ineffectiveness over time and
Change those factors if possible
Detect the drift before accidents occur
Ï
è
c
á
â
ã
â
ä
—
å
æ
˜
ç
™
š
è
Ï
è
Ï
ê
›
Using STAMP to Prevent Accidents
Hazard Analysis
Safety Metrics and Performance Auditing
Risk Assessment
c
á
â
ã
â
—
STAMP−Based Hazard Analysis (STPA)
Provides information about how safety constraints could be
violated.
Used to eliminate, reduce, and control hazards in system
design, development, manufacturing, and operations
Assists in designing safety into system from the beginning
Not just after−the−fact analysis
Includes software, operators, system accidents, management,
regulatory authorities
Can use a concrete model of control (SpecTRM−RL) that is
executable and analyzable
ä
˜
å
æ
™
ç
š
›
Ò
STPA − Step1: Identify hazards and translate into high−level
requirements and constraints on behavior
c
á
â
ã
â
ä
—
å
æ
˜
ç
™
è
š
Ï
Ó
›
TCAS Hazards
1. A near mid−air collision (NMAC)
(a pair of controlled aircraft violate minimum separation
standards)
2. A controlled maneuver into the ground
3. Loss of control of aircraft
4. Interference with other safety−related aircraft systems
5. Interference with ground−based ATC system
6. Interference with ATC safety−related advisory
c
á
â
ã
STPA − Step 2: Define basic control structure
â
—
ä
˜
å
æ
™
ç
š
è
›
Displays
Aural Alerts
Airline
Ops
Mgmt.
Own and Other
Aircraft Information
Pilot
Operating
Mode
TCAS
Advisories
FAA
Local
ATC
Ops
Mgmt.
Air Traffic
Aircraft
Radio
Controller
Advisories
Flight Data
Processor
Airline
Ops
Mgmt.
Aircraft
Pilot
Operating
Mode
Aural Alerts
Displays
Radar
TCAS
Own and Other
Aircraft Information
Ï
n
c
á
â
ã
â
ä
—
å
æ
˜
ç
™
š
è
Ï
è
Ï
é
›
STPA − Step 3: Identify potential inadequate control actions that
could lead to hazardous process state
In general:
1. A required control action is not provided
2. An incorrect or unsafe control action is provided.
3. A potentially correct or inadequate control action is
provided too late (at the wrong time)
4. A correct control action is stopped too soon
c
á
â
ã
â
—
ä
˜
For the NMAC hazard:
TCAS:
1. The aircraft are on a near collision course and TCAS does
not provide an RA
2. The aircraft are in close proximity and TCAS provides an
RA that degrades vertical separation
3. The aircraft are on a near collision course and TCAS provides
an RA too late to avoid an NMAC
4. TCAS removes an RA too soon.
Pilot:
1. The pilot does not follow the resolution advisory provided
by TCAS (does not respond to the RA)
2. The pilot incorrectly executes the TCAS resolution advisory.
3. The pilot applies the RA but too late to avoid the NMAC
4. The pilot stops the RA maneuver too soon.
å
æ
™
ç
š
›
Ï
c
á
â
ã
â
STPA − Step 4: Determine how potentially hazardous control
actions could occur.
ä
—
å
æ
˜
ç
™
è
š
Ï
Ô
Ï
Î
›
Eliminate from design or control or mitigate in
design or operations
In general:
Can use a concrete model in SpecTRM−RL
Assists with communication and completeness of analysis
Provides a continuous simulation and analysis environment
to evaluate impact of faults and effectiveness of mitigation
features.
c
á
â
ã
â
—
Step 4a: Augment control structure with process models for each
control component
Step 4b: For each of inadequate control actions, examine parts of
control loop to see if could cause it.
Guided by set of generic control loop flaws
Where human or organization involved must evaluate:
Context in which decisions made
Behavior−shaping mechanisms (influences)
Step 4c: Consider how designed controls could degrade over time
ä
˜
å
æ
™
ç
š
è
›
INPUTS FROM OWN AIRCRAFT
Radio Altitude
Aircraft Altitude Limit
Radio Altitude Status
Config Climb Inhibit
Barometric Altitude
Own MOde S address
Barometric Altimeter Status
Air Status
Altitude Climb Inhibit
Altitude Rate
Increase Climb Inhibit Discrete
Prox Traffic Display
Traffic Display Permitted
TCAS
c
Climb Inhibit
Inhibited
Not Inhibited
Unknown
On
System Start
Fault Detected
Descent Inhibit
Inhibited
Not Inhibited
Unknown
Sensivity Level
1
2
3
4
5
6
7
Current RA Sense
None
Climb
Descend
Unknown
Current RA Level
Û
Ü
Û
Ý
Õ
Þ
ß
Ö
à
×
Ø
á
â
á
â
ã
Ù
Other Aircraft (1..30) Model
Own Aircraft Model
Status
Ú
Status
Increase Climb Inhibit
Altitude Reporting
On ground
Airborne
Unknown
Inhibited
Not Inhibited
Unknown
Increase Descent Inhibit
Classification
Inhibited
Not Inhibited
Unknown
Yes
No
Lost
Unknown
RA Sense
Other Traffic
Proximate Traffic
Potential Threat
Threat
Unknown
None
Climb
Descend
None
Unknown
Altitude Layer
RA Strength
Layer 1
Layer 2
Layer 3
Layer 4
Unknown
Crossing
Not Selected
VSL 0
VSL 500
VSL 1000
VSL 2000
Nominal 1500
Increase 2500
None
VSL 0
VSL 500
VSL 1000
VSL 2000
Unknown
Non−Crossing
Int−Crossing
Own−Cross
Unknown
Reversal
Reversed
Not Reversed
Unknown
Unknown
Other Bearing
Other Bearing Valid
Other Altitude
Other Altitude Valid
Range
Mode S Address
Sensitivity Level
Equippage
INPUTS FROM OTHER AIRCRAFT
c
Sensors
Human Supervisor
(Controller)
Model of
Process
Model of
Automation
Measured
variables
Ú
Û
Ü
Õ
Û
Ö
Ý
Þ
×
ß
Ø
à
á
Ù
Process
inputs
Automated Controller
Controls
Displays
Model of
Process
Model of
Interfaces
Controlled
Process
Disturbances
Process
outputs
Actuators
Controlled
variables
STPA − Step 4b: Examine control loop for potential to cause
inadequate control actions
c
Ú
Û
Ü
Û
Õ
Ý
Þ
Ö
ß
×
à
Ø
á
â
ä
Ù
Inadequate Control Actions (enforcement of constraints)
Design of control algorithm (process) does not enforce constraints
Process models inconsistent, incomplete, or incorrect (lack of linkup)
Flaw(s) in creation or updating process
Inadequate or missing feedback
Not provided in system design
Communication flaw
Inadequate sensor operation (incorrect or no information provided)
Time lags and measurement inaccuracies not accounted for
Inadequate coordination among controllers and decision−makers
(boundary and overlap areas)
Inadequate Execution of Control Action
Communication flaw
Inadequate "actuator" operation
Time lag
c
STPA − Step4c: Consider how designed controls could degrade
over time.
Ú
Û
Ü
Õ
Û
Ö
Ý
Þ
ß
×
E.g., specified procedures ==> effective procedures
Use system dynamics models?
Use information to design protection against changes:
e.g. operational procedures
controls over changes and maintenance activities
auditing procedures and performance metrics
management feedback channels to detect unsafe changes
Ø
à
Ù
á
â
å
c
Ú
Û
Ü
Õ
Û
Ý
Ö
Þ
×
ß
Ø
à
á
â
æ
Ù
Comparisons with Traditional HA Techniques
Top−down (vs. bottom−up like FMECA)
Considers more than just component failures and failure events
Guidance in doing analysis (vs. FTA)
Handles dysfunctional interactions, software, management, etc.
Concrete model (not just in head)
Not physical structure (HAZOP) but control (functional) structure
General model of inadequate control
HAZOP guidewords based on model of accidents being
caused by deviations in system variables
Includes HAZOP model but more general
Compared with TCAS II Fault Tree (MITRE)
STPA results more comprehensive
Download