The Role of Complexity in System Safety and How to Manage It Nancy Leveson – You’ve carefully thought out all the angles – You’ve done it a thousand times – It comes naturally to you – You know what you’re doing, it’s what you’ve been trained to do your whole life. – Nothing could possibly go wrong, right? What is the Problem? • Traditional safety engineering approaches developed for relatively simple electro-mechanical systems • New technology (especially software) is allowing almost unlimited complexity in the systems we are building • Complexity is creating new causes of accidents • Should build simplest systems possible, but usually unwilling to make the compromises necessary 1. Complexity related to the problem itself 2. Complexity introduced in the design of solution of problem • Need new, more powerful safety engineering approaches to dealing with complexity and new causes of accidents What is Complexity? • Complexity is subjective – Not in system, but in minds of observers or users – What is complex to one person or at one point in time may not be to another • Relative • Changes with time • Many aspects of complexity: Will focus on aspects most relevant to safety Relation of Complexity to Safety • In complex systems, behavior cannot be thoroughly – Planned – Understood – Anticipated – Guarded against • Critical factor is intellectual manageability • Leads to “unknowns” in system behavior • Need tools to – Stretch our intellectual limits – Deal with new causes of accidents Types of Complexity Relevant to Safety • Interactive Complexity: arises in interactions among system components • Non-linear complexity: cause and effect not related in an obvious way • Dynamic complexity: related to changes over time • Decompositional complexity: related to how decompose or modularize our systems • Others ?? Interactive Complexity • Level of interactions has reached point where can no longer be thoroughly anticipated or tested • Coupling causes interdependence – Increases number of interfaces and potential interactions – Software allows us to build highly coupled and interactively complex systems • How affects safety engineering? – Component failure vs. component interaction accidents – Reliability vs. safety Accident with No Component Failures Software-Related Accidents • Are usually caused by flawed requirements – Incomplete or wrong assumptions about operation of controlled system or required operation of computer – Unhandled controlled-system states and environmental conditions • Merely trying to get the software “correct” or to make it reliable will not make it safer under these conditions. Types of Accidents • Component Failure Accidents – Single or multiple component failures – Usually assume random failure • Component Interaction Accidents – Arise in interactions among components – Related to interactive complexity and tight coupling – Exacerbated by introduction of computers and software Safety = Reliability • Safety and reliability are NOT the same – Sometimes increasing one can even decrease the other. – Making all the components highly reliable will not prevent component interaction accidents. • For relatively simple, electro-mechanical systems with primarily component failure accidents, reliability engineering can increase safety. • But this is untrue for complex, software-intensive sociotechnical systems • Our current safety engineering techniques assume accidents are caused by component failures (From Rasmussen) Accident Causality Models • Underlie all our efforts to engineer for safety • Explain why accidents occur • Determine the way we prevent and investigate accidents • May not be aware you are using one, but you are • Imposes patterns on accidents “All models are wrong, some models are useful” George Box Chain-of-Events Model • Explains accidents in terms of multiple events, sequenced as a forward chain over time. – Simple, direct relationship between events in chain • Events almost always involve component failure, human error, or energy-related event • Forms the basis for most safety-engineering and reliability engineering analysis: e,g, FTA, PRA, FMECA, Event Trees, etc. and design: e.g., redundancy, overdesign, safety margins, …. Reason’s Swiss Cheese Model Swiss Cheese Model Limitations • Focus on “barriers” (from the process industry approach to safety) and omit other ways to design for safety • Ignores common cause failures of barriers (systemic accident factors) • Does not include migration to states of high risk: “Mickey Mouse Model” • Assumes randomness in “lining up holes” • Assumes some (linear) causality or precedence in the cheese slices • Human error better modeled as a feedback loop than a “failure” in a chain of events Non-Linear Complexity • Definition: Cause and effect not related in an obvious way • Systemic factors in accidents, e.g., safety culture – Our accident models assume linearity (chain of events, Swiss cheese) – Systemic factors affect events in non-linear ways • John Stuart Mill (1806-1873): “Cause” is a set of necessary and sufficient conditions – What about factors (conditions) that are not necessary or sufficient? e.g., Smoking “causes” lung cancer – Contrapositive: A → B then ~ B→ ~ A Implications of Non-Linear Complexity for Operator Error • Role of operators in our systems is changing – Supervising rather than directly controlling – Not simply following procedures – Non-linear complexity makes it harder for operators to make real-time decisions • Operator errors are not random failures – All behavior affected by context (system) in which occurs – Human error a symptom, not a cause – Human error better modeled as feedback loops Dynamic Complexity • Related to changes over time • Systems are not static, but we assume they are • Systems migrate toward states of high risk under competitive and financial pressures [Rasmussen] • Want flexibility but need to design ways to – Prevent or control dangerous changes – Detect when they occur during operations Decompositional Complexity • Definition: Structural decomposition not consistent with functional decomposition • Harder for humans to understand and find functional design errors • For safety, makes it difficult to determine whether system will be safe – Safety is related to functional behavior of system and its components – Not a function of the system structure • No effective way to verify safety of object-oriented system designs Human Error, Safety, and Complexity • Role of operators in our systems is changing – Supervising rather than directly controlling – Complexity is stretching limits of comprehensibility – Designing systems in which operator error inevitable and then blame accidents on operators rather than designers • Designers are unable to anticipate and prevent accidents • Greatest need in safety engineering is to – Limit complexity in our systems – Practice restraint in requirements definition – Do not add extra complexity in design – Provide tools to stretch our intellectual limits It’s still hungry … and I’ve been stuffing worms into it all day. So What Do We Need to Do? “Engineering a Safer World” • Expand our accident causation models • Create new hazard analysis techniques • Use new system design techniques – Safety-driven design – Integrate safety analysis into system engineering • Improve accident analysis and learning from events • Improve control of safety during operations • Improve management decision-making and safety culture STAMP (System-Theoretic Accident Model and Processes) • A new, more powerful accident causation model • Based on systems theory, not reliability theory • Treats accidents as a control problem (vs. a failure problem) “prevent failures” ↓ “enforce safety constraints on system behavior” STAMP (2) • Safety is an emergent property that arises when system components interact with each other within a larger environment – A set of constraints related to behavior of system components (physical, human, social) enforces that property – Accidents occur when interactions violate those constraints (a lack of appropriate constraints on the interactions) • Accidents are not simply an event or chain of events but involve a complex, dynamic process • Most major accidents arise from a slow migration of the entire system toward a state of high-risk – Need to control and detect this migration STAMP (3) • Treats safety as a dynamic control problem rather than a component failure problem. – O-ring did not control propellant gas release by sealing gap in field joint of Challenger Space Shuttle – Software did not adequately control descent speed of Mars Polar Lander – Temperature in batch reactor not adequately controlled in system design – Public health system did not adequately control contamination of the milk supply with melamine – Financial system did not adequately control the use of financial instruments Example Safety Control Structure Safety Control in Physical Process Safety Constraints • Each component in the control structure has – Assigned responsibilities, authority, accountability – Controls that can be used to enforce safety constraints • Each component’s behavior is influenced by – Context (environment) in which operating – Knowledge about current state of process Control processes operate between levels of control Controller Model of Process Control Actions Accidents occur when model of process is inconsistent with real state of process and controller provides inadequate control actions Feedback Controlled Process Feedback channels are critical -- Design -- Operation Relationship Between Safety and Process Models (2) • Accidents occur when models do not match process and – Required control commands are not given – Incorrect (unsafe) ones are given – Correct commands given at wrong time (too early, too late) – Control stops too soon Explains software errors, human errors, component interaction accidents … Accident Causality Using STAMP Uses for STAMP • More comprehensive accident/incident investigation and root cause analysis • Basis for new, more powerful hazard analysis techniques (STPA) • Supports safety-driven design (physical, operational, organizational)) – Can integrate safety into the system engineering process – Assists in design of human-system interaction and interfaces Uses for STAMP (2) • Organizational and cultural risk analysis – Identifying physical and project risks – Defining safety metrics and performance audits – Designing and evaluating potential policy and structural improvements – Identifying leading indicators of increasing risk (“canary in the coal mine”) • Improve operations and management control of safety STPA (System-Theoretic Process Analysis) • Identifies safety constraints (system and component safety requirements) • Identifies scenarios leading to violation of safety constraints – Includes scenarios (cut sets) found by Fault Tree Analysis – Finds additional scenarios not found by FTA and other failureoriented analyses • Can be used on technical design and organizational design • Evaluated and compared to traditional HA methods – Found many more potential safety problems 5 Missing or wrong communication with another controller Does it work? Is it practical? Technical • Safety analysis of new missile defense system (MDA) • Safety-driven design of new JPL outer planets explorer • Safety analysis of the JAXA HTV (unmanned cargo spacecraft to ISS) • Incorporating risk into early trade studies (NASA Constellation) • Orion (Space Shuttle replacement) • NextGen (planned changes to air traffic control) • Accident/incident analysis (aircraft, petrochemical plants, air traffic control, railroad, UAVs …) • Proton Therapy Machine (medical device) • Adaptive cruise control (automobiles) Does it work? Is it practical? Social and Managerial • Analysis of the management structure of the space shuttle program (post-Columbia) • Risk management in the development of NASA’s new manned space program (Constellation) • NASA Mission control ─ re-planning and changing mission control procedures safely • Food safety • Safety in pharmaceutical drug development • Risk analysis of outpatient GI surgery at Beth Israel Deaconess Hospital • UAVs in civilian airspace • Analysis and prevention of corporate fraud Integrating Safety into System Engineering • Hazard analysis must be integrated into design and decisionmaking environment. Needs to be available when decisions are made. • Lots of implications for specifications: – Relevant information must be easy to find – Design rationale must be specified – Must be able to trace from high-level requirements to system design to component requirements to component design and vice versa. – Must include specification of what NOT to do – Must be easy to review and find errors Intent Specifications • Based on systems theory principles • Designed to support – System Engineering (including maintainance and evolution) – Human problem solving – Management of complexity (adds intent abstraction to standard refinement and decomposition) – Model-Based development – Specification principles from preceding slide Leveson, Intent Specifications: An Approach to Building Human Centered Specification, IEEE Trans. on Software Engineering, Jan. 2000 Level 3 Modeling Language: Spectrm-RL • Combined requirements specification and modeling language. Supports model-based development. • A state machine with a domain-specific notation on top of it – Reviewers can learn to read it in 10 minutes – Executable – Formally analyzable – Automated tools for creation and analysis (e.g., incompleteness, inconsistency, simulation) – Black-box requirements only (no component design) SpecTRM-RL • Black-box requirements only (no component design) • Separates design from requirements – Specify only black box, transfer function across component – Reduces complexity by omitting information not needed at requirements evaluation time • Separation of concerns is an important way for humans to deal with complexity – Almost all software-related accidents caused by incomplete or inadequate requirements (not software design errors) Conclusions • Traditional safety engineering techniques do not adequately handle complexity – Interactive, non-linear, dynamic, and design (especially decompositional) • Need to take a system engineering view of safety rather than the current component reliability view when building complex systems – Include entire socio-technical system including safety culture and organizational structure – Support top-down and safety-driven design – Support specification and human review of requirements Conclusions • Need a more realistic handling of human errors and human decision-making • Need to include behavioral dynamics and changes over time – Consider processes behind events and not just events – Understand why controls drift into ineffectiveness over time and manage this drift Nancy Leveson “Engineering a Safer World” (Systems Thinking Applied to Safety) MIT Press, December 2011 Available for free download from: http://sunnyday.mit.edu/safer-world