Space System Development: Lessons Learned Conference on Quality in the Space and Defense

advertisement
Space System Development:
Lessons Learned
(Excerpts)
Conference on Quality in
the Space and Defense
Industries
March 14, 15, 2011
Joe Nieberding
Presenter
Joe Nieberding:
Mr. Nieberding has over 40 years of management and technical
experience in leading and participating in NASA independent review
teams, and in evaluating NASA advanced space mission planning. Before
retiring from NASA GRC in 2000, under his direction numerous studies
were conducted during 35 years at GRC to help select transportation,
propulsion, power, and communications systems for advanced NASA
mission applications. His Advanced Space Analysis Office led all
exploration advanced concept studies for GRC. In addition, he was a
launch team member on over 65 NASA Atlas/Centaur and Titan/Centaur
launches, and is a widely recognized expert in launch vehicles and
advanced transportation architecture planning for space missions. Mr.
Nieberding is co-founder and President of Aerospace Engineering
Associates.
2
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Introduction
• Excerpted from two day presentation aimed at assisting today’s
space system developers
– Explore overarching fundamental lessons derived from
• Many specific mishap case histories from multiple programs
• “Root” causes not unique to times/programs
• Will cover some material from the two day presentation:
– A few of the detailed case histories
– A summary of causes for all case histories
– Example countermeasure “Rules of Practice”
• References given for all resource information
– Lessons learned charts (yellow background) were either developed
independently by Aerospace Engineering Associates(AEA) or extracted
from resource information
It ain’t what you don’t know that gets you into trouble.
It’s what you know for sure that just ain’t so.
Mark Twain
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
3
2 Day Outline
• Introduction
• The Practice of Failure Analysis
• Space Mission Record of Success
• General Management Lessons
• Lessons Learned from Specific Case Histories
–
–
–
–
–
–
Screening Out Design Errors
Impact of Weak Testing Practices
Screening Out Procedural Errors
System Engineering Lapses
Mishaps Associated With Software
When Processes Break Down
– Adverse Program Management Factors Can Produce Bad Outcomes
–
–
–
–
–
–
A Piece Part Failure
Not Everyone May Want the Project to Succeed
Experienced Teams make Mistakes
Normalizing Deviance
When Advanced Warnings are Missed
The Perils of Heritage
4
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
2 Day Outline (concluded)
• Summary of Causes for the Foregoing Case
Histories
• The Unsuccessful Failure Investigation of Atlas
Centaur 70
• Common Cause Failures
• The Human Element
• Applying the Lessons: Sample “Rules of
Practice”
• One Strike and You’re Out! – Flight Termination
• Conclusions
Politicians are like diapers; They need to be changed often and for the same reason
Mark Twain
5
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Historical Perspective
6
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
The Practice of Failure Analysis
Case
Event
The Milan Cathedral
Wall collapse
The Tay Rail Bridge
Bridge collapse – 75 fatalities
Kansa City Hyatt Regency Skyway
Skyway collapse – 114 fatalities
American Airlines Flight 96
Separation of DC-10 aft cargo door – no fatalities
Turkish Air Flight 981
Separation of DC-10 aft cargo door – 346 fatalities
Tacoma Narrows Bridge
Russian R-16 ICBM
Bridge collapse
Pad explosion - >120 fatalities
7
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Historical Perspective: Prominent Failures from Across the
Spectrum of Engineering Endeavors
Possibly The Largest Disaster in the History of Rocketry!
• Baikonur Cosmodrome Russia, 10/24/1960
• Preps for first test flight of R-16 ICBM
• Program rushed to launch on anniversary
of Bolshevik revolution (as a present for
Premier Khrushchev)
Mitrofan Nedelin
R-16 ICBM
• Lead by head of the Soviet Ballistic Missile
Forces Marshal Mitrofan Nedelin
• 250 people on and around pad
– Viewing stand for visiting dignitaries
• Unsafe design and undisciplined
Destroyed Pad and Memorial at Baikonur (Tyuratam)
procedures caused 2nd stage ignition
• More than 120 people were killed including
Video
Nedelin
For additional information see “Rockets and People: Creating a Rocket Industry, Volume
II”, Boris Chertok, NASA History Series SP-2006-4110
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
8
Design Screens
9
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
A Quick Aside About Design Error “Screens”
GIVEN:
Our design “machine” (humans)
WILL produce errors at some >0 rate
Design Error
“Screens”
Test
Design
Error
Design Review
Unexpected
Behavior
“Engineers today, like Galileo three and a half centuries ago, are not superhuman. They
make mistakes in their assumptions, in their calculations, in their conclusions. That they
make mistakes is forgivable; that they catch them is imperative.” (1)
(1)“To
Engineer is Human”; Henry Petroski, Vintage Books, 1992
10
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Selected Mishaps
11
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Genesis
• Underlying Issue: Omitted test combined with flawed
adaptation of heritage design
• Problem: Spacecraft failed to properly deploy drogue
chute (9/8/2004)
• Impact: Loss of some scientific data
Video
Source: http://www.nasa.gov/pdf/149414main_Genesis_MIB.pdf;
Genesis Mishap Report, Dr. M. Ryschkewitsch Chairperson, 11/30/2005;
Presentation: Genesis Mishap Investigation and Stardust Entry, Dr. Mike
Ryschkewitsch and Pete Spidaliere
12
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Genesis G-Switch Orientation
Pyros
Velocity
As Installed
Acceleration
to Activate Switch
Aerobraking
Acceleration
Heatshield
13
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Genesis (cont’d)
• WHY: Improperly oriented gravity switch sensors
(inverted). Deficiencies in the following processes
resulted in the mishap:
− Design that inverted the G-switch sensor (a heritage design)
− Design reviews did not detect the error
− Verification processes did not detect the design error
• No tests were conducted that would reveal the problem
− Red Team review did not uncover the failure in the verification
process
14
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Genesis (cont’d)
• The Board further identified ineffective systems
engineering as a root cause:
–
–
–
–
–
–
Inadequate project and systems engineering management
Inadequate systems engineering processes
Inadequate review process
Unfounded confidence in heritage designs
Failure to “Test like you fly”
Better/Faster/Cheaper philosophy - quote from MIB Report:
“Root Cause 6.1: Faster, Better, Cheaper (FBC) philosophy: Cost-capped
mission with threat of cancellation if overrun…
Findings:
• The project maintained the cost-cap, in part at the expense of adequate
technical oversight by JPL into LMSS Flight System and at the expense of
•
a complete and robust Systems Engineering function.
The Agency was at fault for encouraging and accepting the FBC philosophy
as described above.”
15
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Genesis (concluded)
LESSONS:
• Imposition of a concept (Better/Faster/Cheaper) absent sensible,
practical, and reliable implementation guidance is a recipe for
serious trouble
• Treat changed heritage designs as new designs
• Make it very difficult to change baselined* test plans
• Test like you fly – and pay attention to when you don’t
• Don’t let system reviews get superficial (checking the block)
*Those adopted after appropriate vetting activities
16
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
CONTOUR
• Underlying Issue: Erroneous prediction of
spacecraft thermal environment
• Problem: Spacecraft broke up following SRM firing
(8/15/2002)
• Impact: Loss of mission
17
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
CONTOUR (cont’d)
• Why: Spacecraft overheating caused by
improper installation of a “heritage” SRM
–
–
–
–
–
–
–
–
–
Inadequate systems engineering process
Inappropriate reliance on analysis by similarity
Inadequate review function
Dubious decision to omit telemetry coverage of motor
firing event
Inadequate oversight, insight, and review of
subcontractors
Inadequate communications between APL and ATK
ATK models not specific to CONTOUR
Limited understanding of the SRM plume heating
environments in space
Limited understanding of CONTOUR SRM operating
conditions
Source: Contour Mishap Investigation Board Report, May 31, 2003;
http://klabs.org/richcontent/Reports/Failure_Reports/contour/contour.pdf
18
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
CONTOUR (concluded)
LESSONS:
• Heritage designs must be re-qualified for new applications
• Systems engineering is absolutely vital to mission success – in
this case it should have:
• Challenged the flawed heritage assumption
• Objected to the use of invalid models
• Insisted on a more complete understanding of SRM plume
heating
• Involve subcontractors early in the design process
• They need to understand and “buy in” to how their product is
integrated
19
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Ariane 5
• Underlying Issue: Unwarranted reliance on
heritage software
• Problem: Forty seconds into maiden Ariane-5
flight (6/4/1996), vehicle veered off course and
broke-up
• Impact: Loss of mission
• Why: Flight software error
– The flight software was programmed for Ariane-4
launch and trajectory conditions
Video
• Didn’t account for higher horizontal velocity of Ariane-5
• Caused IRU software overflow error resulting in loss of
guidance information
• Never tested in conditions that simulated the Ariane-5
trajectory
Source: I-Shih Chang, Space Launch Reliabilityhttp://sunnyday.mit.edu/accidents/Ariane5accidentreport.html
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
20
Ariane 5 (concluded)
LESSONS:
• Technical experts need to push back against baseless
management directives
• Be very thorough in justifying dependence on previous
“heritage” hardware or software development/testing
• Have the decision to accept “heritage” verifications examined
in an IV&V mode
• Test like you fly and fly like you test
21
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Lewis Spacecraft
• Underlying Issue: Misapplication of heritage system
• Problem: Spacecraft tumbled out of control 8/26/1997
• Impact: Loss of spacecraft
22
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Lewis Spacecraft (cont’d)
• Why: Proximate Cause - Inoperable ACS safe mode
– Spacecraft had multiple anomalies during initial operations
• Contact lost for two orbits
• Reappeared in uncontrolled attitude mode
• Commanded to “safe mode”
Safe Mode
– “Safe mode” adopted from Total Ozone Mapping Spacecraft
• Inherently unstable in Lewis application (no X-axis gyro)
– In spite of serious “cause unknown” anomalies, operations crew
entered rest period
• X-axis rates due to thruster imbalances
‒
‒
‒
‒
Rates transferred to Y and Z axes (Polhode Motion)
Computer shuts down excessive thruster firings
Spacecraft rates transferred to principal moment of inertia axis
Edge on to Sun - battery discharged ~ 72%
X Axis Spin
‒ Attempt to recover was flawed and failed
‒ Spacecraft went out of contact and was never reacquired
– Only one crew conducted all on-orbit operations (One 12 hour
shift/day)
• No crew on duty during significant periods when spacecraft in view of
ground station
Polhode Motion
Source: Lewis Spacecraft Mission Failure Investigation Board Final Report, February 12, 1998
http://www.lr.tudelft.nl/live/pagina.jsp?id=a8b6dca2-92dc-4965-a64c-298189e5b58e&lang=en&binary=/doc/lewis_document.pdf
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
23
Lewis Spacecraft (cont’d)
• Root Causes:
– No mutual contractor/government understanding as to what
is meant by “Better/Faster/Cheaper” leading to:
• Requirements changes without adequate resource adjustment
• Undue cost and schedule pressures
• Inadequate ground station availability for initial operations
• Frequent key personnel changes
• Inadequate engineering discipline
• Inadequate management discipline
• Active NASA oversight and management absent
– Senior management imposition of an ill-defined concept
(Better/Faster/Cheaper)*
*While the BFC thrust was abandoned after multiple disappointing outcomes, vestiges (both good and bad) remain.
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
24
Lewis Spacecraft (cont’d)
LESSONS:
With respect to the proximate cause:
• “Heritage” hardware/software is often a trap
• Flag any proposed use of heritage designs for special attention
• Challenge applicability and understand its qualification history
• Make certain that the true heritage (especially the limitations) is
fully understood
• Even presumably qualified heritage items need to be
functionally tested in the way they will fly!
25
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Lewis Spacecraft (concluded)
LESSONS: (concluded)
With respect to the root causes:
• Imposition of a concept (Better/Faster/Cheaper) absent
sensible, practical, and reliable implementation guidance is
a recipe for serious trouble
• Take great care to select qualified people to run a program when it’s clear they’re not right for the job, replace them
26
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Causation Summary
27
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Causation Analysis – Breakdown by Category
Distribution of
Proximate Causes
Pgm Mgt 8%
Prod/Ops 23%
Distribution of
Root Causes
69% Design
Pgm Mgt
41% 51%
Sys
Engr
27
25 Design Proximate Causes
Nature of Deficiencies
Prod/Ops 8%
12
10
7
7
Analysis
Qual Test
1
Dev Test
Sim
Heritage
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Engineering
28
Observations
• Only one of the 39 cases analyzed (Atlas Centaur 24)
had failure of a proper part as the cause!
– Programs doing good job of acceptance testing
• The other 38 were associated with human error:
management weaknesses, systems engineering
shortcomings, etc.
• Therefore, it is necessary that risk assessments be
based on data that somehow reflects human error
Facts are stubborn things, but statistics are pliable.
Mark Twain
29
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Observations (concluded)
• Programs that adopt a zero-based approach to
testing are betting on the ability of the engineering
community to foresee all aspects of system
performance under all conditions
– This is a very risky bet!
History demonstrates that tests frequently, if not
usually, produce unexpected (and unwanted) results
30
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Applying the Lessons:
A Sample Set of “Rules of Practice”
31
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Applying the Lessons:
A Sample Set of “Rules of Practice”
• Issue: Many lessons learned have common themes. The issue is
to systematically infuse this knowledge into programs so they’re
not lessons forgotten
• One approach: For large and complex programs, impose a
Program specific set of overarching “Rules of Practice” that
govern how certain things are to be done (i.e. to codify some of
the lessons)
− Any deviation from these “Rules” would be cause for special attention (risk
management) by Program Management
− These ad hoc “Rules” would not take the place of existing design
standards or similar tools, but rather provide an additional mechanism to
flag when special action is warranted
32
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Applying the Lessons:
A Sample Set of “Rules of Practice” (cont’d)
• Advance Warning: (Causal in 17 of 39 cases)
− An effective system for facilitating communication between those
concerned about a potential safety-of-flight problem and those in a
position to reconcile it is to be designed and embedded in the
Program culture (easier said than done - but surely it’s doable!). It
must be:
• Formal and visible.
• Reliable (if not foolproof).
• Simple to use with quick feedback.
• Plugged into real authority to stop the action.
• Culturally valued and respected.
33
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Applying the Lessons:
A Sample Set of “Rules of Practice” (cont’d)
• Analytical Modeling: (Causal in 12 of 39 )
− All analytical modeling on which designs are based will be test-
validated and acquired from at least two independent sources.
− An independently validated plume heating analysis is required of all
systems employing a new propulsion arrangement.
• Heritage Items: (Contributing cause in 12 of 39 cases)
− Any item adopted for use based on successful flight performance in
another program will be deemed unqualified in the adopting
application until a thorough analysis has been performed to confirm
that the adopting application is identical (or less demanding) in all
relevant features to the prior successful application.
− Any deviations must be qualified by test.
34
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Applying the Lessons:
A Sample Set of “Rules of Practice” (cont’d)
• Software: (Causal in 6 of 39 cases: Ariane 501, Titan IVB-32, SOHO, MCO, MPL, DART)
− All software development, testing, and application processes will
be controlled by a single formal, and configuration managed
Software Management Plan for which a single individual is
responsible.
• Testing provided for in this plan will specifically include:
– Demonstration of proper flight software operation in nominal and off
nominal flight simulation functional testing; this will be done with
flight hardware to the greatest extent possible.
– Formal “qualification” and “acceptance” testing of flight critical
software “end items” prior to controlled “release” for use.
• The plan will also provide for periodic, independent verification
that the original requirements remain valid.
35
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Applying the Lessons:
A Sample Set of “Rules of Practice” (concluded)
• General Engineering Management Practices: Certain practices will
constitute required standard operating procedures:
− Rationale Documentation: It will be mandatory to systematically record the
rationale associated with all engineering products such as design and
operational requirements, procedures, test parameters, processes, design
choices, specifications, etc., and to place the rationale as close to the item it
relates to as possible.
− Assumptions: All assumptions that form the foundation for engineering
activities (analyses, test or not-to-test decisions, trade studies, design
approaches, etc.) will be explicitly stated and documented. A process for
validating, and periodically revalidating, the assumptions will be initiated.
• Etc. (This is a sampling – not an all inclusive list. Certainly, Project
specific “Rules” are also appropriate.)
36
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
The Message
• Some may say that the foregoing rules are rather boring
- Nothing earthshaking - all pretty routine
But that’s exactly the point!
• Rigorous implementation and infusion of quality into all
aspects of routine, common sense practices will prevent
most mission failures
• It’s really not rocket science!
37
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Conclusions
38
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Conclusions – Stuff Happens
• Most mishaps can be broadly attributed to human error, not rocket
science
– Lack of complete understanding of how complex systems interact with each other
– Inadequate attention to every detail
– Flawed analyses or tests
– Improper use of “heritage” systems
– Flawed processes
– Flawed understanding of how software fails
– Reaction to budget or schedule pressure
– Imperfect management
• Often, a complex, subtle, sequence of events is needed
– If just one event in the chain were prevented, the failure would not have happened
• Must ensure quality in all the above areas
• Essential for mission success
• Over decades, the same root causes of failures appear repeatedly
• There are few new ones!
39
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
Conclusions – About Learning From Past Incidents
• Sometimes we do, but the process is haphazard
• Those involved learn what to do and/or what not to do
– But eventually they disappear taking with them:
• The nuances of causation
• Factors omitted from the official record
• The lessons themselves (often) and their underlying rationale
– Mishap Reports and Lessons Learned Data Bases (which have
come a long way) are what’s left but:
• Relevant information may be missing
• They lack the live element (the passion) and,
• Nothing beats talking to those who “were there”
40
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Conclusions (cont’d)
• Basically, there is no universally successful approach
to learning the lessons from the past
• What’s needed is a dependable process that:
– Uncovers root causation from those involved and/or the
documentation
– Develops and promulgates “Rules of Practice” as countermeasures
• Organizations desiring to profit from applying lessons
previously learned should develop their own tailored
approaches
– Should be included in the Project Plan
In the end, lessons are still best learned as a “contact sport”
41
© 2006 All Rights Reserved. Aerospace Engineering
Associates LLC
P. O. Box 40448
Bay Village OH 44140
www.aea-llc.com
Joe Nieberding, President
Email: joenieber@sbcglobal.net
Cell: 440-503-4758
MISSION
AEA’s mission is to leverage the vital lessons
learned by NASA’s spacefaring pioneers to
strengthen the skills of today’s aerospace
explorers.
© 2006 All Rights Reserved. Aerospace Engineering Associates LLC
Larry Ross, CEO
Email: ljross1@att.net
Cell: 440-227-7240
Download