Document 5543567

advertisement
The Role of Complexity in
System Safety and
How to Manage It
Nancy Leveson
– You’ve carefully thought out all the angles
– You’ve done it a thousand times
– It comes naturally to you
– You know what you’re doing, it’s what you’ve been trained to do
your whole life.
– Nothing could possibly go wrong, right?
What is the Problem?
•
Traditional safety engineering approaches developed for
relatively simple electro-mechanical systems
•
New technology (especially software) is allowing almost
unlimited complexity in the systems we are building
•
Complexity is creating new causes of accidents
•
Should build simplest systems possible, but usually
unwilling to make the compromises necessary
1. Complexity related to the problem itself
2. Complexity introduced in the design of solution of problem
•
Need new, more powerful safety engineering approaches
to dealing with complexity and new causes of accidents
What is Complexity?
• Complexity is subjective
– Not in system, but in minds of observers or users
– What is complex to one person or at one point in time may
not be to another
• Relative
• Changes with time
• Many aspects of complexity: Will focus on aspects most
relevant to safety
Relation of Complexity to Safety
• In complex systems, behavior cannot be thoroughly
– Planned
– Understood
– Anticipated
– Guarded against
• Critical factor is intellectual manageability
• Leads to “unknowns” in system behavior
• Need tools to
– Stretch our intellectual limits
– Deal with new causes of accidents
Types of Complexity Relevant to Safety
• Interactive Complexity: arises in interactions among
system components
• Non-linear complexity: cause and effect not related in
an obvious way
• Dynamic complexity: related to changes over time
• Decompositional complexity: related to how
decompose or modularize our systems
• Others ??
Interactive Complexity
• Level of interactions has reached point where can no
longer be thoroughly anticipated or tested
• Coupling causes interdependence
– Increases number of interfaces and potential interactions
– Software allows us to build highly coupled and interactively
complex systems
• How affects safety engineering?
– Component failure vs. component interaction accidents
– Reliability vs. safety
Accident with No Component Failures
Software-Related Accidents
• Are usually caused by flawed requirements
– Incomplete or wrong assumptions about operation of controlled
system or required operation of computer
– Unhandled controlled-system states and environmental
conditions
• Merely trying to get the software “correct” or to make it
reliable will not make it safer under these conditions.
Types of Accidents
• Component Failure Accidents
– Single or multiple component failures
– Usually assume random failure
• Component Interaction Accidents
– Arise in interactions among components
– Related to interactive complexity and tight coupling
– Exacerbated by introduction of computers and software
Safety = Reliability
• Safety and reliability are NOT the same
– Sometimes increasing one can even decrease the other.
– Making all the components highly reliable will not prevent
component interaction accidents.
• For relatively simple, electro-mechanical systems with
primarily component failure accidents, reliability engineering
can increase safety.
• But this is untrue for complex, software-intensive sociotechnical systems
• Our current safety engineering techniques assume accidents
are caused by component failures
(From Rasmussen)
Accident Causality Models
• Underlie all our efforts to engineer for safety
• Explain why accidents occur
• Determine the way we prevent and investigate accidents
• May not be aware you are using one, but you are
• Imposes patterns on accidents
“All models are wrong, some models are useful”
George Box
Chain-of-Events Model
• Explains accidents in terms of multiple events,
sequenced as a forward chain over time.
– Simple, direct relationship between events in chain
• Events almost always involve component failure, human
error, or energy-related event
• Forms the basis for most safety-engineering and
reliability engineering analysis:
e,g, FTA, PRA, FMECA, Event Trees, etc.
and design:
e.g., redundancy, overdesign, safety margins, ….
Reason’s Swiss Cheese Model
Swiss Cheese Model Limitations
• Focus on “barriers” (from the process industry approach
to safety) and omit other ways to design for safety
• Ignores common cause failures of barriers (systemic
accident factors)
• Does not include migration to states of high risk: “Mickey
Mouse Model”
• Assumes randomness in “lining up holes”
• Assumes some (linear) causality or precedence in the
cheese slices
• Human error better modeled as a feedback loop than a
“failure” in a chain of events
Non-Linear Complexity
• Definition: Cause and effect not related in an obvious way
• Systemic factors in accidents, e.g., safety culture
– Our accident models assume linearity (chain of events, Swiss
cheese)
– Systemic factors affect events in non-linear ways
• John Stuart Mill (1806-1873): “Cause” is a set of necessary
and sufficient conditions
– What about factors (conditions) that are not necessary or sufficient?
e.g., Smoking “causes” lung cancer
– Contrapositive: A → B then ~ B→ ~ A
Implications of Non-Linear Complexity
for Operator Error
• Role of operators in our systems is changing
– Supervising rather than directly controlling
– Not simply following procedures
– Non-linear complexity makes it harder for operators to make
real-time decisions
• Operator errors are not random failures
– All behavior affected by context (system) in which occurs
– Human error a symptom, not a cause
– Human error better modeled as feedback loops
Dynamic Complexity
• Related to changes over time
• Systems are not static, but we assume they are
• Systems migrate toward states of high risk under
competitive and financial pressures [Rasmussen]
• Want flexibility but need to design ways to
– Prevent or control dangerous changes
– Detect when they occur during operations
Decompositional Complexity
• Definition: Structural decomposition not consistent with
functional decomposition
• Harder for humans to understand and find functional
design errors
• For safety, makes it difficult to determine whether system
will be safe
– Safety is related to functional behavior of system and its
components
– Not a function of the system structure
• No effective way to verify safety of object-oriented
system designs
Human Error, Safety, and Complexity
• Role of operators in our systems is changing
– Supervising rather than directly controlling
– Complexity is stretching limits of comprehensibility
– Designing systems in which operator error inevitable and then
blame accidents on operators rather than designers
• Designers are unable to anticipate and prevent accidents
• Greatest need in safety engineering is to
– Limit complexity in our systems
– Practice restraint in requirements definition
– Do not add extra complexity in design
– Provide tools to stretch our intellectual limits
It’s still hungry … and I’ve been stuffing worms into it all day.
So What Do We Need to Do?
“Engineering a Safer World”
• Expand our accident causation models
• Create new hazard analysis techniques
• Use new system design techniques
– Safety-driven design
– Integrate safety analysis into system engineering
• Improve accident analysis and learning from events
• Improve control of safety during operations
• Improve management decision-making and safety culture
STAMP
(System-Theoretic Accident Model and
Processes)
• A new, more powerful accident causation model
• Based on systems theory, not reliability theory
• Treats accidents as a control problem (vs. a failure
problem)
“prevent failures”
↓
“enforce safety constraints on system behavior”
STAMP (2)
• Safety is an emergent property that arises when system
components interact with each other within a larger
environment
– A set of constraints related to behavior of system components
(physical, human, social) enforces that property
– Accidents occur when interactions violate those constraints (a
lack of appropriate constraints on the interactions)
• Accidents are not simply an event or chain of events but
involve a complex, dynamic process
• Most major accidents arise from a slow migration of the
entire system toward a state of high-risk
– Need to control and detect this migration
STAMP (3)
• Treats safety as a dynamic control problem rather than a
component failure problem.
– O-ring did not control propellant gas release by sealing gap in field
joint of Challenger Space Shuttle
– Software did not adequately control descent speed of Mars Polar
Lander
– Temperature in batch reactor not adequately controlled in system
design
– Public health system did not adequately control contamination of
the milk supply with melamine
– Financial system did not adequately control the use of financial
instruments
Example
Safety
Control
Structure
Safety
Control in
Physical
Process
Safety Constraints
• Each component in the control structure has
– Assigned responsibilities, authority, accountability
– Controls that can be used to enforce safety constraints
• Each component’s behavior is influenced by
– Context (environment) in which operating
– Knowledge about current state of process
Control processes operate
between levels of control
Controller
Model of
Process
Control
Actions
Accidents occur when model of
process is inconsistent with real
state of process and controller
provides inadequate control
actions
Feedback
Controlled Process
Feedback channels are critical
-- Design
-- Operation
Relationship Between Safety and
Process Models (2)
• Accidents occur when models do not match process and
– Required control commands are not given
– Incorrect (unsafe) ones are given
– Correct commands given at wrong time (too early, too late)
– Control stops too soon
Explains software errors, human errors, component
interaction accidents …
Accident Causality
Using STAMP
Uses for STAMP
• More comprehensive accident/incident
investigation and root cause analysis
• Basis for new, more powerful hazard analysis
techniques (STPA)
• Supports safety-driven design (physical,
operational, organizational))
– Can integrate safety into the system engineering process
– Assists in design of human-system interaction and interfaces
Uses for STAMP (2)
• Organizational and cultural risk analysis
– Identifying physical and project risks
– Defining safety metrics and performance audits
– Designing and evaluating potential policy and structural
improvements
– Identifying leading indicators of increasing risk (“canary in the
coal mine”)
• Improve operations and management control of
safety
STPA (System-Theoretic Process Analysis)
• Identifies safety constraints (system and component
safety requirements)
• Identifies scenarios leading to violation of safety
constraints
– Includes scenarios (cut sets) found by Fault Tree Analysis
– Finds additional scenarios not found by FTA and other failureoriented analyses
• Can be used on technical design and organizational
design
• Evaluated and compared to traditional HA methods
– Found many more potential safety problems
5 Missing or wrong
communication with another
controller
Does it work? Is it practical?
Technical
• Safety analysis of new missile defense system (MDA)
• Safety-driven design of new JPL outer planets explorer
• Safety analysis of the JAXA HTV (unmanned cargo spacecraft to ISS)
• Incorporating risk into early trade studies (NASA Constellation)
• Orion (Space Shuttle replacement)
• NextGen (planned changes to air traffic control)
• Accident/incident analysis (aircraft, petrochemical plants, air traffic
control, railroad, UAVs …)
• Proton Therapy Machine (medical device)
• Adaptive cruise control (automobiles)
Does it work? Is it practical?
Social and Managerial
• Analysis of the management structure of the space shuttle program
(post-Columbia)
• Risk management in the development of NASA’s new manned space
program (Constellation)
• NASA Mission control ─ re-planning and changing mission control
procedures safely
• Food safety
• Safety in pharmaceutical drug development
• Risk analysis of outpatient GI surgery at Beth Israel Deaconess Hospital
• UAVs in civilian airspace
• Analysis and prevention of corporate fraud
Integrating Safety into System
Engineering
• Hazard analysis must be integrated into design and decisionmaking environment. Needs to be available when decisions
are made.
• Lots of implications for specifications:
– Relevant information must be easy to find
– Design rationale must be specified
– Must be able to trace from high-level requirements to system
design to component requirements to component design and
vice versa.
– Must include specification of what NOT to do
– Must be easy to review and find errors
Intent Specifications
• Based on systems theory principles
• Designed to support
– System Engineering (including maintainance and evolution)
– Human problem solving
– Management of complexity (adds intent abstraction to
standard refinement and decomposition)
– Model-Based development
– Specification principles from preceding slide
Leveson, Intent Specifications: An Approach to Building Human
Centered Specification, IEEE Trans. on Software Engineering,
Jan. 2000
Level 3 Modeling Language: Spectrm-RL
• Combined requirements specification and modeling
language. Supports model-based development.
• A state machine with a domain-specific notation on top of
it
– Reviewers can learn to read it in 10 minutes
– Executable
– Formally analyzable
– Automated tools for creation and analysis (e.g.,
incompleteness, inconsistency, simulation)
– Black-box requirements only (no component design)
SpecTRM-RL
• Black-box requirements only (no component design)
• Separates design from requirements
– Specify only black box, transfer function across component
– Reduces complexity by omitting information not needed at
requirements evaluation time
• Separation of concerns is an important way for humans to
deal with complexity
– Almost all software-related accidents caused by
incomplete or inadequate requirements (not software
design errors)
Conclusions
• Traditional safety engineering techniques do not
adequately handle complexity
– Interactive, non-linear, dynamic, and design (especially
decompositional)
• Need to take a system engineering view of safety rather
than the current component reliability view when building
complex systems
– Include entire socio-technical system including safety
culture and organizational structure
– Support top-down and safety-driven design
– Support specification and human review of requirements
Conclusions
• Need a more realistic handling of human errors and
human decision-making
• Need to include behavioral dynamics and changes over
time
– Consider processes behind events and not just events
– Understand why controls drift into ineffectiveness over time
and manage this drift
Nancy Leveson
“Engineering a Safer World”
(Systems Thinking Applied to Safety)
MIT Press, December 2011
Available for free download from:
http://sunnyday.mit.edu/safer-world
Download