Can a Systems Reliability Approach Avoid the Logic of Failure? Presented to 2012 Annual Research Review USC Center for System and Software Engineering Presented by Myron Hecht The Aerospace Corporation March 7, 2012 © The Aerospace Corporation 2011 2012 Motivation • “Homework Assignment”: Email from Barry Boehm on 2/22/2012: Dietrich Dorner's The Logic of Failure book* emphasizes that most failures come from overly simplistic short-term decisions made without considering their later effects. How about some Decision Considerations for Avoiding the Logic of Failure? • Question: is a systems reliability approach effective to address limitations in thinking that result in the Logic of Failure? *Dietrich Dorner, The Logic Of Failure: Recognizing And Avoiding Error In Complex Situations , ©Metropolitan Books, 1996, originally published in German © 1989 by Rowehlt Verlag GMBH, translated by Rita and Robert Kimber, ISBN 0-20147948-6 2 Dietrich Dorner’s Statement of the Logic of Failure • What is the Logic of failure? Thought patterns that, while appropriate to a simpler world, are disastrous for a complex one • “An individual’s reality model ..as a rule will be both incomplete and wrong. People are most inclined to insist that they are right when they are wrong and when they are beset by uncertainty. The ability to admit ignorance or mistaken assumptions is indeed a sign of wisdom, and most individuals in the thick of complex situations are not, or not yet, wise.” (Dorner p. 42) • Decision Making Errors – Goal Definition Failures • Not recognizing conflicts in goals • Not understanding priorities of goals – Problem recognition Failures • Over-simplification of the situation • Assuming lack of obvious problems meant all was well. – Problem solution Failures • Reliance on “proven solutions.” that don’t apply to the current problem • Focusing on solving an immediate crisis at the expense of long term goals • Not considering what to keep when deciding on what to fix. • Focusing on pet projects and forgetting what is important. 3 Dorner Error Avoidance Guidelines Summary • State goals clearly. • Establish priorities, but realize they may change as the situation does. • Form a model of the system. • Gather information . . . but not too much. • Don’t excessively abstract. Remember common sense. • Analyze errors and draw conclusions from them. Change your thinking and behavior according to that feedback. Based on summary by Clive Rowe, http://www.cliverowe.com/blog/2010/09/07/the-logic-of-failure/ 4 Mapping of Decision Making Errors to Avoidance Guidelines X X Gather information . . . but not too much. Don’t excessively abstract. Remember common sense. Analyze errors and draw conclusions from them. Change your thinking and behavior according to that feedback X X X X X Problem solution Reliance on “proven solutions.” that don’t apply to the current problem Focusing on solving an immediate crisis at the expense of long term goals Not considering what to keep when deciding on what to fix. Focusing on pet projects and forgetting what is important. X Assuming lack of obvious problems meant all was well. X Oversimplification of the situation State goals clearly. Establish priorities, but realize they may change as the situation does. Form a model of the system. Not understanding priorities of goals Dorner Logic of failures Problem recognition Not recognizing conflicts in goals Defining Goals X X 5 X X X Unsuccessful IT Project Causes • Failure to meet requirements (product) – Organizational changes needed to realize benefits – Inadequately specified deliverables – Overly ambitious scope – Technology risks • Late or over budget (process) – Impact of changes not recognized (technology, requirements, business case, actors) – Overly optimistic predictions of activity durations – Resources not available when required – Inadequate accountability and status tracking Brenda Whittaker, “What went wrong? Unsuccessful information technology projects”, Information Management & Computer Security 7/1 [1999] 23-29 6 DoD Reliability and Maintainability (R&M) Problems • • GAO: Persistently low readiness rates and costly maintenance contribute to increases in the total ownership cost of DoD systems DoD: 80% of systems acquired between 1995 and 2000 failed to meet R&M objectives – requirements focused on technical performance, with little attention to operations and support (O&S) costs and readiness, [Defining Goals] – immature technologies hindered the ability to design weapon systems with high reliability; [Immediate vs. long term] – limited collaboration among organizations charged with requirements setting, product development, and maintenance. – Inadequate attention to data collection and analysis DOD GUIDE FOR ACHIEVING RELIABILITY, AVAILABILITY, AND MAINTAINABILITY, AUGUST 3, 2005, http://www.acq.osd.mil/dte/docs/RAM_Guide_080305.pdf 7 Mapping of DoD R&M Problems to Dorner Decision Making Errors Failure to meet requirements (product) requirements focused on technical performance, with little attention to operations and support (O&S) costs and readiness immature technologies hindered the ability to design weapon systems with high reliability limited collaboration among organizations charged with requirements setting, product development, and maintenance. Inadequate attention to data collection and analysis Reliance on “proven solutions.” that don’t apply to the current problem Focusing on solving an immediate crisis at the expense of long term goals Not considering what to keep when deciding on what to fix. Focusing on pet projects and forgetting what is important. Problem solution X X X X x X X 8 Assuming lack of obvious problems meant all was well. X Oversimplification of the situation Not understanding priorities of goals DOD GUIDE FOR ACHIEVING RELIABILITY, AVAILABILITY, AND MAINTAINABILITY, AUGUST 3, 2005, http://www.acq.osd.mil/dte/docs/RAM_ Guide_080305.pdf Problem recognition Not recognizing conflicts in goals Defining Goals X X X X X X X Systems Reliability Approach (GEIA STD-009) • • • • Understand Customer/User Requirements and Constraints Design and Redesign for Reliability Verification: Produce Reliable Systems and Products Fielding: Monitor and Assess User Reliability ANSI/GEIA-STD-0009, "Reliability Program Standard for Systems Design, Development, and Manufacturing”, available from Information Technology Association of America (ITAA) http://www.techstreet.com/itaagate.tmpl or http://webstore.ansi.org. 9 10 Fielding: Monitor and Assess User Reliability Inadequate attention to data collection and analysis Verification: Produce Reliable Systems/Products requirements focused on technical performance, with little attention to operations and support (O&S) costs and readiness immature technologies hindered the ability to design weapon systems with high reliability limited collaboration among organizations charged with requirements setting, product development, and maintenance. Design and Redesign for Reliability DOD GUIDE FOR ACHIEVING RELIABILITY, AVAILABILITY, AND MAINTAINABILITY, AUGUST 3, 2005, http://www.acq.osd.mil/dte/docs/RAM_Guide_080305.pdf Understand Customer/User Requirements and Constraints Mapping of R&M problems to Solutions x x x x x x x Mapping to GEIA STD-009 11 Mapping Unsuccessful IT Projects to Dorner Logic of Failures Failure to meet requirements (product) Organizational changes needed to realize benefits Inadequately specified deliverables Overly ambitious scope Technology risks Late or over budget (process) Impact of changes not recognized (technology, requirements, business case, actors) Overly optimistic predictions of activity durations Resources not available when required Inadequate accountability and status tracking X X X X X X X Problem solution Reliance on “proven solutions.” that don’t apply to the current problem Focusing on solving an immediate crisis at the expense of long term goals Not considering what to keep when deciding on what to fix. Focusing on pet projects and forgetting what is important. X Assuming lack of obvious problems meant all was well. X Oversimplification of the situation Not understanding priorities of goals Whittaker Unsuccessful IT Projects to Logic of Failures Problem recognition Not recognizing conflicts in goals Defining Goals X X X X X X X X X 12 X X X Focusing on solving an immediate crisis at the expense of long term goals Not considering what to keep when deciding on what to fix. Focusing on pet projects and forgetting what is important. 13 x x x Fielding: Monitor and Assess User Reliability x x Verification: Produce Reliable Systems/Products Not recognizing conflicts in goals Not understanding priorities of goals Over-simplification of the situation Assuming lack of obvious problems meant all was well. Reliance on “proven solutions.” that don’t apply to the current problem Design: Design and Redesign for Reliability Dorner Logic of Failures to GEIA STD 009 Solutions Requirements: Understand Customer/User Requirements and Mapping of Dorner Logic of Failure to Systems Reliability Approach (GEIA STD 009) x x x x x x x x x x x 14 Verification: Produce Reliable Systems/Products Fielding: Monitor and Assess User Reliability Failure to meet requirements (product) Organizational changes needed to realize benefits Inadequately specified deliverables Overly ambitious scope Technology risks Late or over budget (process) Impact of changes not recognized (technology, requirements, business case, actors) Overly optimistic predictions of activity durations Resources not available when required Inadequate accountability and status tracking Design: Design and Redesign for Reliability Whittaker Unsuccessful IT Projects to Systems Reliability Approach (GEIA STD 009) Requirements: Understand Customer/User Requirements and Mapping of Unsuccessful IT approach to Systems Reliability Approach (GEIA STD 009) x x x x x x x x x x x x x x x x x x x Question • IF Unsuccessful IT Projects can be mapped to Dorner Logic of Failures • AND Reliability Systems Approach (GEIA STD 009) can be mapped to Dorner Logic of Failures • THEN Why is it so hard to prevent the Logic of Failure? 15 Errors in Defining Goals • • • Defining basic parameters – Example: failure rate (failures per unit time): MIL STD 721C has 11 definitions related to failures and more than 15 different definitions of time Choosing Attributes – Reliability (probability of success over a definite time interval) – important to an airplane traveler – Availability (probability of being operational) – important to a network or service user – Probability of success upon demand (e.g., actuation) – important to a safety system Formulating verifiable requirements (counterexamples below) – No single point of failure – No failure propagation – No hazardous failure modes 16 Errors in Problem Recognition* • System failures attributable to oversimplification or spuriously assuming all was well – Switchover Failures • Ariane 5 (June 1996, ref. 1) • Phobos Grunt (2012) – Capacity • RIM (November 2011 – service interruption) • Walmart On-line web site (November, 2006 – Black Friday) – Maintenance • Juniper Networks (December 2011) – Data breaches • 18 million health records breached in 2011, *See references chart at end of this briefing for more details 17 Errors in Problem Solution • • • • Reliance on “proven solutions” – Hardware only solutions assuming software does not fail Focusing on solving an immediate problem at the expense of long term goals – Achieving greater reliability or reducing sustainment costs are immediate costs (problems) but the benefit comes over many years Not considering what to keep when deciding on what to fix – Discarding test, verification, and analysis to solve a budget or schedule problem Focusing on pet projects and forgetting what is important – Consequences of ignoring reliability and dependability: • Market failure, • Legal liability, • Mission Failure, • Organizational failure 18 References • • • • • • ARIANE 5, Flight 501 Failure, Report by the Inquiry Board, Prof. J. L. LIONS, Chairman,1996, available online at www.di.unito.it/~damiani/ariane5rep.html Russia: Computer crash doomed Phobos-Grunt, Spaceflight now, Feb. 6, 2012, available online at http://www.spaceflightnow.com/news/n1202/06phobosgrunt/ Evan Suman, “Black Friday Turns Servers Dark at Wal-Mart, Macy's”, e-Week, Nov. 25, 2006, http://www.eweek.com/print_article2/0,1217,a=194801,00.asp Jim Duffy, Juniper at the root of Internet outage?, Network World, November 07, 2011, available online at http://www.networkworld.com/news/2011/110711-internetoutage-252851.html Nicholas Kolakowski, RIM BlackBerry Outage Hits Users Around the World, e-Week, October 10, 2011, available online at http://www.eweek.com/c/a/Mobile-andWireless/RIM-BlackBerry-Outage-Hits-Users-Around-the-World-781395/ U.S. Dept. of Health and Human Services, Health Information Privacy web site, http://www.hhs.gov/ocr/privacy/hipaa/administrative/breachnotificationrule/postedbrea ches.html 19 Thank you © The Aerospace Corporation 2011 2012