PowerPoint Presentation - Organizational factors

advertisement
Safety in large technology
systems
October, 1999
Technology failure
Why do large, complex systems sometimes fail
so spectacularly? Do the easy explanations of
“operator error,” “faulty technology,” or
“complexity” suffice? Are there managerial
causes of technology failure? Are there design
principles and engineering protocols that can
enhance large system safety? What is the role
of software in safety and failure?
Decision-making about complex
technology systems
l
l
l
l
l
the scientific basis of technology systems
How do managers make intelligent decisions
about complex technologies?
managers, scientists, citizens
example: Star Wars anti-missile systems
Note: the technical specialist often does not make
the decision; so the persuasive power of good
scientific communication is critical.
Goal for technology management
l
A central problem for designers, policy makers,
and citizens, then, is how to avoid large-scale
failures when possible, through appropriate
design, and how to plan for minimizing the
consequences of those failures which will
inevitably occur.
Information and decisionmaking
Information flow and management of complex
technology systems
l complex organizations pursue multiple objectives
simultaneously
l complex organizations pursue the same objective
along different and conflicting paths
l
Surprising failures
l
l
l
Franco-Prussian war, Israeli intelligence
failure in Yom Kippur war
The Mercedes “A” vehicle sedan and the
moose test
Chernobyl nuclear power meltdown
Therac-25
l
l
l
l
high energies
computer control rather than electromechanical control
positioning the turntable: x-ray beam
flattener
15,000 rad administered rather than 200 rad
Technology failure
l
sources of failure
management failures
n design failures
n proliferating random failures
n “storming” the system
n
l
l
design for “soft landings”
crisis management
Causes of failure
l
l
l
l
l
Complexity and multiple causal pathways
and relations
defective procedures
defective training systems
“human” error
faulty design
Causes of failure
l
“The causes of accidents are frequently, if
not almost always, rooted in the
organization--its culture, management, and
structure. These factors are all critical to
the eventual safety of the engineered
system” (Leveson, 47).
Varieties of failure
l
l
routine failures, stochastic failures, design
failures, systemic failures, interactive
failures, “horseshoe nail” failures
vulnerability of modern technologies to
software failure -- Euro, 2000 bug, air
traffic control failures
Sources of potential failure
l
l
l
l
l
hardware interlocks replaced with software
checks on turntable position
cryptic malfunction codes; frequent
messages
excessive operator confidence in
safetysystems
lack of effective mechanism for reporting
and investigating failures
poor software engineering practices;
Organizational factors
l
“Large-scale engineered systems are more
than just a collection of technological
artifacts: They are a reflection of the
structure, management, procedures, and
culture of the engineering organization that
created them, and they are also, usually, a
reflection of the society in which they were
created” (Leveson, 47).
Design for safety
l
l
l
l
hazard elimination
hazard reduction
hazard control
damage reduction
Aspects of design
l
l
l
l
the technology -- machine, vehicle, software
system, airport
the management structure -- locus of
decision-making
the communications system -- transmission
of critical and routine information within
the organization
training of workers for task -- performance
skills, safety procedures
Information and decision-making
l
l
l
Information flow and management of
complex technology systems
complex organizations pursue multiple
objectives simultaneously
complex organizations pursue the same
objective along different and conflicting
paths
System safety
l
l
l
l
builds in safety, not simply adding it on to a
completed design
deals with systems as a whole rather than
subsystems or components
takes a larger view of hazards than just
failures
emphasizes analysis rather than past
experience and standards
System safety (2)
l
l
l
emphasizes qualitative rather than
quantitative approaches
recognizes the importance of tradeoffs and
conflicts in system design
more than just system engineering
Hazard analysis
l
l
l
development: identify and assess potential
hazards
operations: examine an existing system to
improve its safety
licencing: examine a planned system to
demonstrate acceptable safety to a
regulatory authority
Hazard analysis (2)
l
l
l
l
construct an exhaustive inventory of
hazards early in design
classify by severity and probability
construct causal pathways that lead to
hazards
design so as to eliminate, reduce, control, or
ameliorate
Better software design
l
l
l
l
l
l
design for the worst case
avoid “single point of failure” designs
design “defensively”
investigate failures carefully and
extensively
look for “root cause,” not symptom or
specific transient cause
embed audit trails; design for simplicity
Safe software design
l
l
l
l
control software should be designed with
maximum simplicity (408)
design should be testable; limited number of
states
avoid multitasking, use polling rather than
interrupts
design should be easily readable and
understood
Safe software (2)
l
l
l
interactions between components should be
limited and straightforward
worst-case timing should be determinable
by review of code
code should include only the minimum
features and capabilities required by the
system; no unnecessary or undocumented
features
Safe software (3)
l
l
critical decisions (launch a missile) should
not be made on values often taken by failed
components -- 0 or 1.
Messages should be designed in ways to
eliminate possibility of compute hardware
failures having hazardous consequences
(missile launch example)
Safe software (4)
l
l
l
strive for maximal decoupling of parts of a
software control system
accidents in tightly coupled systems are a
result of unplanned interactions
the flexibility of software encourages
coupling and multiple functions; important
to resist this impulse.
Safe software (5)
l
“Adding computers to potentially dangerous
systems is likely to increase accidents
unless extra care is put into system design”
(411).
Scope and limits of simulations
Computer simulations permit “experiments” on different
scenarios presented to complex systems
l Simulations are not reality
l Simulations represent some factors and exclude others
l Simulations rely on a mathematicization of the process that
may be approximate or even false.
l
Human interface considerations
l
l
l
l
l
unambiguous error messages (Therac 25)
operator needs extensive knowledge about
the “theory” of the system
alarms need to be comprehensible (TMI);
spurious alarms minimized
operator needs knowledge about timing and
sequencing of events
design of control board is critical
Control panel anomalies
Risk assessment and prediction
l
What is involved in assessing risk?
probability of failure
n prediction of consequences of failure
n failure pathways
n
Reasoning about risk
l
l
l
l
How should we reason about risk?
Expected utility: probability of outcome x
utility of outcome
Probability and science
How to anticipate failure scenarios?
Compare scenarios
l
l
nuclear power vs coal power
automated highway system vs routine traffic
accidents
Ordinary reasoning and judgment
l
well-known “fallacies” of ordinary
reasoning:
time preference
n framing
n risk aversion
n
large risks and small risks
l
l
l
the decision-theory approach: minimize
expected harms
the decision-making reality: large harms are
more difficult to absorb, even if smaller in
overall consequence
example: JR West railway
The end
Download