Risk Aversion and Other Obstacles to Mission Success A presentation to SPIN March 2, 2007 Scott Jackson jackessone@cox.net – (949) 854-0519 Systems Architecture and Engineering Program: A Node of the Resilience Engineering Network Acknowledgments This presentation is based in part on a paper titled “The Science of Organizational Psychology Applied to Mission Assurance” presented to the Conference on Systems Engineering Research, Los Angeles, 7-8, April 2006. Co-authors were Katherine Erlick, PhD, and Joann Gutierrez both of whom have degrees in organizational psychology. System Resilience, Culture and Paradigms • System Resilience – the ability of organizational, hardware and software systems to mitigate the severity and likelihood of failures or losses, to adapt to changing conditions, and to respond appropriately after the fact. • Culture is a key element in System Resilience • Thesis: The traditional methods of executive mandate and extensive training are not sufficient to achieve a System Resilience culture. The science of organizational psychology promises to show us a better way. • Common paradigms, especially with respect to risk, can be obstacles to a System Resilience The System Resilience Architecture measure Tools and processes, e.g., risk, are here System Resilience Metrics is enabled by can be obstructed by determine improvements in Capabilities Infrastructure Taking risk seriously is here Culture apply to all apply to all can be enhanced by can be divided into can be divided into Technical Capabilities Managerial Capabilities Operational Infrastructure Organizational Infrastructure Cultural Initiatives can be divided into Internal Organization External Organization Figure 1 The Architecture of System Resilience Common Root Causes Suggest Key Capabilities Root Causes Capabilities Lack of Rigorous System Safety Cultural Initiatives Lack of Information Management System Resilience Oversight System Resilience Infrastructure Culture Risk Management Lack of Risk Management Schedule Management Regulatory Faults Cost Management Lack of Review and Oversight Requirements Management Incomplete Verification Technology Management Conflicting Priorities Verification Poor Schedule Management Lack of Expertise System Safety Configuration Management Expertise Organizational Barriers Software Maintenance Manufacturing Cost Management Operations Incomplete Requirements Work Environment Faulty Decision Making Information Management Emergence Regulatory Environment Capabilities require more than system safety and reliability Maintenance Reliability Supplier Management Adaptability Case Studies Covered Many Domains American Flight 191 – Reason Apollo 13 – Leveson, Reason, Chiles Bhopal – Leveson, Reason, Chiles Challenger – Vaughn, Leveson, Reason, Chiles, Hollnagel Chernobyl – Leveson, Reason, Chiles Clapham Junction – Reason Columbia – Columbia Investigatory Committee, Chiles, Hollnagel The Fishing Industry - Gaël Flixborough – Leveson, Reason Hospital Emergency Wards – Woods and Mears Japan Airlines 123 – Reason Katrina - Westrum King’s Cross Underground – Leveson, Reason Mars Lander - Leveson Nagoya Airbus 300 – Dijkstra New York Electric Power Recovery on 9/11 - Mendoça Philips 66 Company – Reason, Chiles Piper Alpha – Reason, Chiles, Paté-Cornell, Hollnagel Seveso – Leveson, Reason Texas City – Hughes, Chiles Three Mile Island – Leveson, Reason, Chiles TWA 800 – National Transportation Safety Board (NTSB) Windscale – Leveson Risk Emphasis Some Quotes “A safety culture is a learning culture.” James Reason, Managing the Risks of Organizational Accidents “The severity with which a system fails is directly proportional to the intensity of the designer's belief that it cannot.” (The Titanic Effect) Nancy Leveson, Safeware: System Safety and Computers “Focus on problems.” Weick and Sutcliffe, Managing Uncertainty “One of our largest problems was success.” Cor Horkströter, Royal Dutch/Shell The Feynman Observation “[Feynman’s] failure estimate for the shuttle system was 1 in 25…” “[NASA’s] estimate [of failure] range from 1 in 100 [by working engineers] to 1 in 100,000 [by management]” Diane Vaughn, The Challenger Launch Decision Priorities: A personal view (page 1) Priority Number 3 – Do you have a good risk tool? Priorities: A personal view (page 2) Priority Number 2 – Do you have a good risk process? Priority Number 3 – Do you have a good risk tool? Priorities: A personal view (page 3) Priority Number 1 – Do you take risk seriously? Priority Number 2 – Do you have a good risk process? Priority Number 3 – Do you have a good risk tool? Two important risk paradigms Definition: Mind-set, perception, way of thinking, cultural belief - Paradigm No. 1 – The belief that even having risks is a sign of bad management - Paradigm No. 2 – Risk as a “normative” condition Diane Vaughn: “NASA’s ‘can do’ attitude created a … risk taking culture that forced them to push ahead no matter what..” “…flying with acceptable risks was normative in NASA culture.” Some Paradigms Definition: Mind-set, perception, way of thinking, cultural belief • Don’t bother me with small problems. • Our system (airplane, etc) is safe. It has never had a major accident. • We can’t afford to verify everything. • My job is to assure a safe design. • If I am ethical, I have nothing else to worry about. More Paradigms (p. 2) • If I get too close to safety issues, then I may be liable. • Safety is the responsibility of individuals, not organizations • Our customer pays us to design systems, not organizations • Human error has already been taken into account in safety analyses • Accidents are inevitable; there is nothing you can do to prevent them • Organizational issues are the purview of program management Still More Paradigms (p. 3) • I am hampered by scope, schedule and cost constraints • Our contracts (with the customer and suppliers) do not allow us to consider aspects outside of design • Human errors are random and uncontrollable • You can’t predict serious accidents • To change paradigms all we need is a good executive and lots of training Some Thoughts on Risk (from the Second Symposium on Resilience Engineering, Juanles-Pins, France, November 2006) Wreathall says we must consider “meta-risks,” that is, risks that we all know are there and do not consider Epstein says that the important risks are in the lower right hand corner of the risk matrix: • Low probability • High consequence • Lots of them • Examine by simulation Cultural Beliefs Our paradigms Pressures (cost, schedule, etc.) The Genesis of Paradigms The Old Model Start Here Executive Management The Vision Employees Attend Training The New Model (Simplified) Start Here Endorse Self-Discovery Executive Management New Paradigms Employees Establish Communities of Practice Learn Some Approaches • Training • The Hero-Executive • Socratic teaching • Coaching • Self-discovery through communities of practice • Independent reviews • Cost and schedule margins • Standard processes • Teams Core Group • Rewards and incentives • Management selection Community of Practice Self-Discovery Through Communities of Practice • Bottom-up • Informal • Core Group • Dialogue • Respect • Inclusive Conclusions • Progress is both top-down and bottom-up • Organizational psychology is a necessary discipline for mission assurance and, hence, also for systems engineering • Training and top-down mandates have limited effectiveness • Self-discovery is the preferred path. No one can teach you the right paradigm; you have to learn it yourself. Some Recommended Reading Vaughn, Diane, The Challenger Launch Decision: Risky Technology, Culture and Deviance at NASA, University of Chicago Press, 1996 Reason, James, Managing the Risks of Organizational Accidents, Ashgate, 1997 Leveson, Nancy, Safeware: System Safety and Computers, Addison Wesley, 1995 Weick, Karl E. and Sutcliffe, Kathleen M., Managing the Unexpected, Jossey-Bass, 2001 Senge, Peter, et al, The Dance of Change, Doubleday, 1999 Wegner, Etienne, et al, Communities of Practice: Learning, Meaning and Identity, University of Cambridge, 1998