Safety in large technology systems October, 1999 Technology failure Why do large, complex systems sometimes fail so spectacularly? Do the easy explanations of “operator error,” “faulty technology,” or “complexity” suffice? Are there managerial causes of technology failure? Are there design principles and engineering protocols that can enhance large system safety? What is the role of software in safety and failure? Decision-making about complex technology systems l l l l l the scientific basis of technology systems How do managers make intelligent decisions about complex technologies? managers, scientists, citizens example: Star Wars anti-missile systems Note: the technical specialist often does not make the decision; so the persuasive power of good scientific communication is critical. Goal for technology management l A central problem for designers, policy makers, and citizens, then, is how to avoid large-scale failures when possible, through appropriate design, and how to plan for minimizing the consequences of those failures which will inevitably occur. Information and decisionmaking Information flow and management of complex technology systems l complex organizations pursue multiple objectives simultaneously l complex organizations pursue the same objective along different and conflicting paths l Surprising failures l l l Franco-Prussian war, Israeli intelligence failure in Yom Kippur war The Mercedes “A” vehicle sedan and the moose test Chernobyl nuclear power meltdown Therac-25 l l l l high energies computer control rather than electromechanical control positioning the turntable: x-ray beam flattener 15,000 rad administered rather than 200 rad Technology failure l sources of failure management failures n design failures n proliferating random failures n “storming” the system n l l design for “soft landings” crisis management Causes of failure l l l l l Complexity and multiple causal pathways and relations defective procedures defective training systems “human” error faulty design Causes of failure l “The causes of accidents are frequently, if not almost always, rooted in the organization--its culture, management, and structure. These factors are all critical to the eventual safety of the engineered system” (Leveson, 47). Varieties of failure l l routine failures, stochastic failures, design failures, systemic failures, interactive failures, “horseshoe nail” failures vulnerability of modern technologies to software failure -- Euro, 2000 bug, air traffic control failures Sources of potential failure l l l l l hardware interlocks replaced with software checks on turntable position cryptic malfunction codes; frequent messages excessive operator confidence in safetysystems lack of effective mechanism for reporting and investigating failures poor software engineering practices; Organizational factors l “Large-scale engineered systems are more than just a collection of technological artifacts: They are a reflection of the structure, management, procedures, and culture of the engineering organization that created them, and they are also, usually, a reflection of the society in which they were created” (Leveson, 47). Design for safety l l l l hazard elimination hazard reduction hazard control damage reduction Aspects of design l l l l the technology -- machine, vehicle, software system, airport the management structure -- locus of decision-making the communications system -- transmission of critical and routine information within the organization training of workers for task -- performance skills, safety procedures Information and decision-making l l l Information flow and management of complex technology systems complex organizations pursue multiple objectives simultaneously complex organizations pursue the same objective along different and conflicting paths System safety l l l l builds in safety, not simply adding it on to a completed design deals with systems as a whole rather than subsystems or components takes a larger view of hazards than just failures emphasizes analysis rather than past experience and standards System safety (2) l l l emphasizes qualitative rather than quantitative approaches recognizes the importance of tradeoffs and conflicts in system design more than just system engineering Hazard analysis l l l development: identify and assess potential hazards operations: examine an existing system to improve its safety licencing: examine a planned system to demonstrate acceptable safety to a regulatory authority Hazard analysis (2) l l l l construct an exhaustive inventory of hazards early in design classify by severity and probability construct causal pathways that lead to hazards design so as to eliminate, reduce, control, or ameliorate Better software design l l l l l l design for the worst case avoid “single point of failure” designs design “defensively” investigate failures carefully and extensively look for “root cause,” not symptom or specific transient cause embed audit trails; design for simplicity Safe software design l l l l control software should be designed with maximum simplicity (408) design should be testable; limited number of states avoid multitasking, use polling rather than interrupts design should be easily readable and understood Safe software (2) l l l interactions between components should be limited and straightforward worst-case timing should be determinable by review of code code should include only the minimum features and capabilities required by the system; no unnecessary or undocumented features Safe software (3) l l critical decisions (launch a missile) should not be made on values often taken by failed components -- 0 or 1. Messages should be designed in ways to eliminate possibility of compute hardware failures having hazardous consequences (missile launch example) Safe software (4) l l l strive for maximal decoupling of parts of a software control system accidents in tightly coupled systems are a result of unplanned interactions the flexibility of software encourages coupling and multiple functions; important to resist this impulse. Safe software (5) l “Adding computers to potentially dangerous systems is likely to increase accidents unless extra care is put into system design” (411). Scope and limits of simulations Computer simulations permit “experiments” on different scenarios presented to complex systems l Simulations are not reality l Simulations represent some factors and exclude others l Simulations rely on a mathematicization of the process that may be approximate or even false. l Human interface considerations l l l l l unambiguous error messages (Therac 25) operator needs extensive knowledge about the “theory” of the system alarms need to be comprehensible (TMI); spurious alarms minimized operator needs knowledge about timing and sequencing of events design of control board is critical Control panel anomalies Risk assessment and prediction l What is involved in assessing risk? probability of failure n prediction of consequences of failure n failure pathways n Reasoning about risk l l l l How should we reason about risk? Expected utility: probability of outcome x utility of outcome Probability and science How to anticipate failure scenarios? Compare scenarios l l nuclear power vs coal power automated highway system vs routine traffic accidents Ordinary reasoning and judgment l well-known “fallacies” of ordinary reasoning: time preference n framing n risk aversion n large risks and small risks l l l the decision-theory approach: minimize expected harms the decision-making reality: large harms are more difficult to absorb, even if smaller in overall consequence example: JR West railway The end