Can a Systems Reliability Approach Avoid
the Logic of Failure?
Presented to
2012 Annual Research Review
USC Center for System and Software Engineering
Presented by
Myron Hecht
The Aerospace Corporation
March 7, 2012
© The Aerospace Corporation 2011
2012
Motivation
• “Homework Assignment”: Email from Barry Boehm on 2/22/2012:
Dietrich Dorner's The Logic of Failure book* emphasizes that most
failures come from overly simplistic short-term decisions made
without considering their later effects. How about some Decision
Considerations for Avoiding the Logic of Failure?
• Question: is a systems reliability approach effective to address
limitations in thinking that result in the Logic of Failure?
*Dietrich Dorner, The Logic Of Failure: Recognizing And Avoiding Error In Complex
Situations , ©Metropolitan Books, 1996, originally published in German © 1989 by
Rowehlt Verlag GMBH, translated by Rita and Robert Kimber, ISBN 0-20147948-6
2
Dietrich Dorner’s Statement of the Logic of Failure
•
What is the Logic of failure?
Thought patterns that, while appropriate to a simpler world, are disastrous for a
complex one
• “An individual’s reality model ..as a rule will be both incomplete and wrong.
People are most inclined to insist that they are right when they are wrong
and when they are beset by uncertainty. The ability to admit ignorance or
mistaken assumptions is indeed a sign of wisdom, and most individuals in
the thick of complex situations are not, or not yet, wise.” (Dorner p. 42)
•
Decision Making Errors
– Goal Definition Failures
• Not recognizing conflicts in goals
• Not understanding priorities of goals
– Problem recognition Failures
• Over-simplification of the situation
• Assuming lack of obvious problems meant all was well.
– Problem solution Failures
• Reliance on “proven solutions.” that don’t apply to the current problem
• Focusing on solving an immediate crisis at the expense of long term goals
• Not considering what to keep when deciding on what to fix.
• Focusing on pet projects and forgetting what is important.
3
Dorner Error Avoidance Guidelines Summary
• State goals clearly.
• Establish priorities, but realize they may change as the situation
does.
• Form a model of the system.
• Gather information . . . but not too much.
• Don’t excessively abstract. Remember common sense.
• Analyze errors and draw conclusions from them. Change your
thinking and behavior according to that feedback.
Based on summary by Clive Rowe, http://www.cliverowe.com/blog/2010/09/07/the-logic-of-failure/
4
Mapping of Decision Making Errors to Avoidance
Guidelines
X
X
Gather information . . . but not too much.
Don’t excessively abstract. Remember
common sense.
Analyze errors and draw conclusions from
them. Change your thinking and behavior
according to that feedback
X
X
X
X
X
Problem solution
Reliance on “proven
solutions.” that
don’t apply to the
current problem
Focusing on solving
an immediate crisis
at the expense of
long term goals
Not considering
what to keep when
deciding on what to
fix.
Focusing on pet
projects and
forgetting what is
important.
X
Assuming lack of
obvious problems
meant all was well.
X
Oversimplification of
the situation
State goals clearly.
Establish priorities, but realize they may
change as the situation does.
Form a model of the system.
Not
understanding
priorities of goals
Dorner Logic of failures
Problem recognition
Not recognizing
conflicts in goals
Defining Goals
X
X
5
X
X
X
Unsuccessful IT Project Causes
• Failure to meet requirements (product)
– Organizational changes needed to realize benefits
– Inadequately specified deliverables
– Overly ambitious scope
– Technology risks
• Late or over budget (process)
– Impact of changes not recognized (technology, requirements,
business case, actors)
– Overly optimistic predictions of activity durations
– Resources not available when required
– Inadequate accountability and status tracking
Brenda Whittaker, “What went wrong? Unsuccessful information technology projects”,
Information Management & Computer Security 7/1 [1999] 23-29
6
DoD Reliability and Maintainability (R&M) Problems
•
•
GAO: Persistently low readiness rates and costly maintenance
contribute to increases in the total ownership cost of DoD systems
DoD: 80% of systems acquired between 1995 and 2000 failed to meet
R&M objectives
– requirements focused on technical performance, with little attention
to operations and support (O&S) costs and readiness, [Defining
Goals]
– immature technologies hindered the ability to design weapon
systems with high reliability; [Immediate vs. long term]
– limited collaboration among organizations charged with
requirements setting, product development, and maintenance.
– Inadequate attention to data collection and analysis
DOD GUIDE FOR ACHIEVING RELIABILITY, AVAILABILITY, AND MAINTAINABILITY, AUGUST 3, 2005,
http://www.acq.osd.mil/dte/docs/RAM_Guide_080305.pdf
7
Mapping of DoD R&M Problems to Dorner Decision
Making Errors
Failure to meet requirements (product)
requirements focused on technical
performance, with little attention to
operations and support (O&S) costs and
readiness
immature technologies hindered the
ability to design weapon systems with
high reliability
limited collaboration among
organizations charged with requirements
setting, product development, and
maintenance.
Inadequate attention to data collection
and analysis
Reliance on “proven
solutions.” that
don’t apply to the
current problem
Focusing on solving
an immediate crisis
at the expense of
long term goals
Not considering
what to keep when
deciding on what to
fix.
Focusing on pet
projects and
forgetting what is
important.
Problem solution
X
X
X
X
x
X
X
8
Assuming lack of
obvious problems
meant all was well.
X
Oversimplification of
the situation
Not
understanding
priorities of goals
DOD GUIDE FOR ACHIEVING RELIABILITY,
AVAILABILITY, AND MAINTAINABILITY,
AUGUST 3, 2005,
http://www.acq.osd.mil/dte/docs/RAM_
Guide_080305.pdf
Problem recognition
Not recognizing
conflicts in goals
Defining Goals
X
X
X
X
X
X
X
Systems Reliability Approach (GEIA STD-009)
•
•
•
•
Understand Customer/User Requirements and Constraints
Design and Redesign for Reliability
Verification: Produce Reliable Systems and Products
Fielding: Monitor and Assess User Reliability
ANSI/GEIA-STD-0009, "Reliability Program Standard for Systems Design, Development, and
Manufacturing”, available from Information Technology Association of America (ITAA)
http://www.techstreet.com/itaagate.tmpl or http://webstore.ansi.org.
9
10
Fielding: Monitor
and Assess User
Reliability
Inadequate attention to data collection and analysis
Verification:
Produce Reliable
Systems/Products
requirements focused on technical performance, with little
attention to operations and support (O&S) costs and readiness
immature technologies hindered the ability to design weapon
systems with high reliability
limited collaboration among organizations charged with
requirements setting, product development, and
maintenance.
Design and Redesign
for Reliability
DOD GUIDE FOR ACHIEVING RELIABILITY, AVAILABILITY, AND
MAINTAINABILITY, AUGUST 3, 2005,
http://www.acq.osd.mil/dte/docs/RAM_Guide_080305.pdf
Understand
Customer/User
Requirements and
Constraints
Mapping of R&M problems to Solutions
x
x
x
x
x
x
x
Mapping to GEIA STD-009
11
Mapping Unsuccessful IT Projects to Dorner Logic of
Failures
Failure to meet requirements (product)
Organizational changes needed to realize
benefits
Inadequately specified deliverables
Overly ambitious scope
Technology risks
Late or over budget (process)
Impact of changes not recognized
(technology, requirements, business
case, actors)
Overly optimistic predictions of activity
durations
Resources not available when required
Inadequate accountability and status
tracking
X
X
X
X
X
X
X
Problem solution
Reliance on “proven
solutions.” that
don’t apply to the
current problem
Focusing on solving
an immediate crisis
at the expense of
long term goals
Not considering
what to keep when
deciding on what to
fix.
Focusing on pet
projects and
forgetting what is
important.
X
Assuming lack of
obvious problems
meant all was well.
X
Oversimplification of
the situation
Not
understanding
priorities of goals
Whittaker Unsuccessful IT Projects to
Logic of Failures
Problem recognition
Not recognizing
conflicts in goals
Defining Goals
X
X
X
X
X
X
X
X
X
12
X
X
X
Focusing on solving an immediate crisis at the expense of long term goals
Not considering what to keep when deciding on what to fix.
Focusing on pet projects and forgetting what is important.
13
x
x
x
Fielding: Monitor
and Assess User
Reliability
x
x
Verification:
Produce Reliable
Systems/Products
Not recognizing conflicts in goals
Not understanding priorities of goals
Over-simplification of the situation
Assuming lack of obvious problems meant all was well.
Reliance on “proven solutions.” that don’t apply to the current problem
Design: Design and
Redesign for
Reliability
Dorner Logic of Failures to GEIA STD 009 Solutions
Requirements:
Understand
Customer/User
Requirements and
Mapping of Dorner Logic of Failure to Systems
Reliability Approach (GEIA STD 009)
x
x
x
x
x
x
x
x
x
x
x
14
Verification:
Produce Reliable
Systems/Products
Fielding: Monitor
and Assess User
Reliability
Failure to meet requirements (product)
Organizational changes needed to realize benefits
Inadequately specified deliverables
Overly ambitious scope
Technology risks
Late or over budget (process)
Impact of changes not recognized (technology,
requirements, business case, actors)
Overly optimistic predictions of activity durations
Resources not available when required
Inadequate accountability and status tracking
Design: Design and
Redesign for
Reliability
Whittaker Unsuccessful IT Projects to Systems Reliability
Approach (GEIA STD 009)
Requirements:
Understand
Customer/User
Requirements and
Mapping of Unsuccessful IT approach to Systems
Reliability Approach (GEIA STD 009)
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
x
Question
• IF Unsuccessful IT Projects can be mapped to Dorner Logic of
Failures
• AND Reliability Systems Approach (GEIA STD 009) can be mapped
to Dorner Logic of Failures
• THEN Why is it so hard to prevent the Logic of Failure?
15
Errors in Defining Goals
•
•
•
Defining basic parameters
– Example: failure rate (failures per unit time): MIL STD 721C has 11
definitions related to failures and more than 15 different definitions
of time
Choosing Attributes
– Reliability (probability of success over a definite time interval) –
important to an airplane traveler
– Availability (probability of being operational) – important to a
network or service user
– Probability of success upon demand (e.g., actuation) – important to
a safety system
Formulating verifiable requirements (counterexamples below)
– No single point of failure
– No failure propagation
– No hazardous failure modes
16
Errors in Problem Recognition*
• System failures attributable to oversimplification or spuriously
assuming all was well
– Switchover Failures
• Ariane 5 (June 1996, ref. 1)
• Phobos Grunt (2012)
– Capacity
• RIM (November 2011 – service interruption)
• Walmart On-line web site (November, 2006 – Black Friday)
– Maintenance
• Juniper Networks (December 2011)
– Data breaches
• 18 million health records breached in 2011,
*See references chart at end of this briefing for more details
17
Errors in Problem Solution
•
•
•
•
Reliance on “proven solutions”
– Hardware only solutions assuming software does not fail
Focusing on solving an immediate problem at the expense of long
term goals
– Achieving greater reliability or reducing sustainment costs are
immediate costs (problems) but the benefit comes over many
years
Not considering what to keep when deciding on what to fix
– Discarding test, verification, and analysis to solve a budget or
schedule problem
Focusing on pet projects and forgetting what is important
– Consequences of ignoring reliability and dependability:
• Market failure,
• Legal liability,
• Mission Failure,
• Organizational failure
18
References
•
•
•
•
•
•
ARIANE 5, Flight 501 Failure, Report by the Inquiry Board, Prof. J. L. LIONS,
Chairman,1996, available online at www.di.unito.it/~damiani/ariane5rep.html
Russia: Computer crash doomed Phobos-Grunt, Spaceflight now, Feb. 6, 2012,
available online at http://www.spaceflightnow.com/news/n1202/06phobosgrunt/
Evan Suman, “Black Friday Turns Servers Dark at Wal-Mart, Macy's”, e-Week, Nov.
25, 2006, http://www.eweek.com/print_article2/0,1217,a=194801,00.asp
Jim Duffy, Juniper at the root of Internet outage?, Network World, November 07,
2011, available online at http://www.networkworld.com/news/2011/110711-internetoutage-252851.html
Nicholas Kolakowski, RIM BlackBerry Outage Hits Users Around the World, e-Week,
October 10, 2011, available online at http://www.eweek.com/c/a/Mobile-andWireless/RIM-BlackBerry-Outage-Hits-Users-Around-the-World-781395/
U.S. Dept. of Health and Human Services, Health Information Privacy web site,
http://www.hhs.gov/ocr/privacy/hipaa/administrative/breachnotificationrule/postedbrea
ches.html
19
Thank you
© The Aerospace Corporation 2011
2012
Download

pres7 - Center for Software Engineering