Risk Module: Risk Management, Fault Trees and Failure Mode Effects Analysis Space Systems Engineering, version 1.0 Space Systems Engineering: Risk Module Module Purpose: Risk To understand risk, risk management, fault tree analysis and failure mode effects analysis in the context of project development Acknowledge that risks are inevitable and recognize that through systematic management and analytic techniques they can be reduced Review three techniques that are used to discover, assess, rank and mitigate risk - risk management, fault tree analysis and failure mode effects analysis Space Systems Engineering: Risk Module 2 What are Risks and Risk Management? Risks are potential events that have negative impacts on safety or project technical performance, cost or schedule Risks are an inevitable fact of life – risks can be reduced but never eliminated Risk Management comprises purposeful thought to the sources, magnitude, and mitigation of risk, and actions directed toward its balanced reduction The same tools and perspectives that are used to discover, manage and reduce risks can be used to discover, manage and increase project opportunities opportunity management Space Systems Engineering: Risk Module 3 What is Risk Management? Risk management is a continuous and iterative decision making technique designed to improve the probability of success. It is a proactive approach that: Seeks or identifies risks Assesses the likelihood and impact of these risks Develops mitigation options for all identified risks Identifies the most significant risks and chooses which mitigation options to implement Tracks progress to confirm that cumulative project risk is indeed declining Communicates and documents the project risk status Repeats this process throughout the project life Space Systems Engineering: Risk Module 4 Risk Management Considers the Entire Development and Operations Life of a Project Risk Type Examples Technical Performance Risk Failure to meet a spacecraft technical requirement or specification during verification Cost Risk Failure to stay within a cost cap for the project Programmatic Risk Schedule Risk Failure to secure long-term political support Failure to meet a critical launch window Liability Risk Spacecraft deorbits prematurely causing damage over the debris footprint Regulatory Risk Failure to secure proper approvals for launch of nuclear materials Operational Risk Failure of spacecraft during mission Safety Risk Hazardous material release while fueling during ground operations Supportability Risk Failure to resupply sufficient material to support human presence as planned Space Systems Engineering: Risk Module 5 Every NASA Space Flight Project Begins with a Plan for Risk Management This plan reflects the project’s risk management philosophy: • • • • • • • Priority (criticality to long-term strategic plans) National significance Mission lifetime (primary baseline mission) Estimated project life cycle cost Launch constraints In-flight maintenance feasibility Alternative research opportunities or re-flight opportunities The risk management philosophy is reflected in a number of ways: • • • • Whether single point failures are allowed Whether the system is monitored continuously during operations How much slack is in the development schedule How technical resource margins (i.e., mass, power, MIPS, etc.) are allocated throughout the development Space Systems Engineering: Risk Module 6 Other Factors to Consider in Assessing Risk (but not limited to)… Complexity of management and technical interfaces Design and test margins Mission criticality Availability and allocation of resources such as mass, power, volume, data volume, data rates, and computing resources Scheduling and manpower limitations Ability to adjust to cost and funding profile constraints Mission operations Data handling, i.e., acquisition, archiving, distribution and analysis Launch system characteristics Available facilities Space Systems Engineering: Risk Module 7 Risk Identification Risks are identified by the development team, peer reviews, lessons from past projects and expert review Lessons from past projects are captured via ‘trigger questions’, or questions that challenge a development strategy or design solution The project risk status and top ten risk list are reviewed periodically - usually monthly - and at the project milestone reviews Space Systems Engineering: Risk Module 8 Example Risk Trigger Questions Have requirements been implemented such that a small change in requirements has the potential to cause large cost, performance or schedule system ramifications? Do designs or requirements push the current state-of-the-art? Has the concept for operating, maintaining, decommissioning or disposal of the system been adequately defined to ensure the identification of all requirements? Has an independent cost estimate (ICE) been performed? Is the schedule adequate to handle the level of requirements or objectives changes that are occurring or are likely to occur? Have the necessary facilities for environmental test been identified and availability problems been resolved? Space Systems Engineering: Risk Module 9 More Considerations for Risk Discovery While each space project has its unique risks, a list of the underlying sources of risks would include the following: Technical complexity - many design constraints or many dependent operational sequences having to occur in the right sequence and at the right time Organizational complexity - many independent organizations having to perform with limited coordination Inadequate margins or reserves Inadequate implementation plans Unrealistic schedules Total and year-by-year budgets mismatched to the actual implementation risks Over-optimistic designs pressured by mission expectations Limited engineering analysis and understanding due to inadequate engineering tools and models Limited understanding of the mission’s space environments Inadequately trained or inexperienced project personnel Inadequate processes or inadequate adherence to proven processes Space Systems Engineering: Risk Module 10 Pause and Learn Opportunity Engage the class in identifying risks for a familiar project. • What kinds of risks are identified? • What is the basis for their search for risks? After the class has thought for a while, the instructor could present some trigger questions which may help discover new risks and show the value of the trigger questions. Space Systems Engineering: Risk Module Cartoon: Dilbert Identifies Risks © United Features Syndicate, Inc. Space Systems Engineering: Risk Module 12 The Benefits of Preparing for the Unexpected Background: On January 21, 2004 (Sol 18), Spirit abruptly ceased communicating with mission control. The next day the rover radioed a 7.8 bit/s beep, confirming that it had received a transmission from Earth but indicating that the spacecraft believed it was in a fault mode. Mars Spirit Rover Flash Memory Problem “The thing that strikes me most about all this is how critical it was to have that INIT_CRIPPLED command in the system. It’s not the kind of command that you’d ever expect to use under normal conditions on Mars. But back during the earliest days of the project Glenn realized that someday we might need the flexibility to deal with a broken flash file system, and he put INIT_CRIPPLED in the system and left it there. And when the anomaly hit, it saved the mission.” –From “Roving Mars” by Steve Squires, Hyperion 2005 Be prepared for the low probability event with a huge consequence. Space Systems Engineering: Risk Module 13 After Identification Risks are Assessed Risks are assessed by characterizing the probability that a project will experience an undesired event and the consequences, impact or severity of the undesired event, were it to occur Risks can be compared on iso-curves consisting of a likelihood measure and a consequence measure Since the assessment of the likelihood and consequence of a risk is both subjective and has significant uncertainty the characterization of risk either qualitative (low medium or high) or semi-quantitative (risk are captured on a 5x5 matrix) Likelihood (Probability) 1.0 High Risk Medium Risk Low Risk 0.0 Severity of Consequence Space Systems Engineering: Risk Module 14 An Example of Some Semi-Quantitative Definitions to Enable a Project to Compare and Rank Risks Impact of Consequences Probability of Occurrence Scale 5 Measure Near certain to occur (80-100%). 4 Highly likely to occur (60-80%). 3 Likely to occur (4060%). 2 Unlikely to occur (2040%). 1 Not likely; Improbable (0-20%). Space Systems Engineering: Risk Module Class Technical Schedule Cost Class I Catastrophic (Scale 5) A condition that may cause death or permanently disabling injury, facility destruction on the ground, or loss of crew, major systems, or vehicle during the mission launch window to be missed cost overrun > 50 % of planned cost Class II Critical (Scale 4) A condition that may cause severe injury or occupational illness, or major property damage to facilities, systems, equipment, or flight hardware schedule slippage causing launch date to be missed cost overrun 15 % to 50 % of planned cost Class III Moderate (Scale 3) A condition that may cause minor injury or occupational illness, or minor property damage to facilities, systems, equipment, or flight hardware internal schedule slip that does not impact launch date cost overrun 2 % to 15 % of planned cost Class IV Negligible (Scale 2) A condition that could cause the need for minor first aid treatment but would not adversely affect personal safety or health; damage to facilities, equipment, or flight hardware more than normal wear and tear level internal schedule slip that does not impact internal development milestones cost overrun < 2 % of planned cost 15 A 5x5 Risk Matrix Provides a Quick Visual Comparison of All Project Risks High risks – mission success jeopardized immediate action required Medium risk – review regularly – contingent action if does not improve Low risk – watch and review periodically Space Systems Engineering: Risk Module 16 Top Risks and their Trends are Periodically Reviewed for the SOFIA Project SOFIA Risk Matrix Rank & Trend Risk ID Appr oach 1 DFRC-34 R Landing Gear Door System Failure 2 DFRC-12 M 3 DFRC-07 W Sched Integration problems structure vs.. avionics Cost growth for engine components 4 DFRC-24 A Quality Control Resources insufficient 5 DFRC-01 W 6 DFRC-11 R Avionics software behind schedule Payload Capacity & Volume Trade-offs design issues 7 DFRC-04 R 8 DFRC-02 R Likelihood 5 4 3 1 3 4 5 2 6 2 8 7 1 1 2 3 4 5 CONSEQUENCES Criticality L x C Trend High Med Low Decreasing (Improving) Increasing (Worsening) Unchanged New Since Last Period Space Systems Engineering: Risk Module Approach Risk Title Limited Flight Envelope, due to technical issues More flight testing may be required for Soft V&V M - Mitigate W - Watch A - Accept R - Research 17 Top Risks and their Trends are Periodically Reviewed for the Constellation SE&I SE&I Top Risk List L I K E L I H O O D 5 5 4 6 7 3 8 4 1, 2 R a n k T r e n d 1 N 2 S A F E P E R F S C H E D C O S T FP_SIG 4 4 5 5 5 FP_SIG 4 5 5 4 4 SE&IPRIMO 2 0 2 2 2 SE&IAT&A 3 0 4 0 4 SE&I_SO A 5 3 4 4 4 State Limits Launch Availability 1125 - Software Development CSI_SIG 4 3 3 3 3 1677 - Ares I/Orion Ascent N 1676 - Structural loads on CEV and LSAM during TLI 1 3 1 2 3 4 CONSEQUENCE Legend Decreasing (Improving) Owning Team Title Aeroacoustic Environments 3 2 Consequence L I K E 5 Maturation 4 5 N 6 1603 - (SRR) Abort Site Sea and Assurance Top Directorate Risk (TDR) Top Program Risk (TPR) 1135 - Program Visibility for Closing the Architecture Increasing (Worsening) Unchanged 1122 - Requirements 7 1195 - CxP Lifecycle cost SE&I_SO A 4 0 0 0 4 8 1046 - Tailoring of Human- SE&I_PT I_HR 3 0 0 3 3 Top Project Risk (TProjR) Rating requirements 1 Space Systems Engineering: Risk Module 18 The Status of the Most Significant Risks and Their Mitigation Options are Reviewed Periodically Title of risk Description or Root cause Possible categorizations • • • System or subsystem Cause category (technology, programmatic, cost, schedule, etc.) Resources affected (budget, schedule slack, technical margins, etc.) Owner Assessment of Implementation risk or Mission risk • Likelihood - estimate of the probability of the risk event • Consequences - estimate of the performance, cost, safety and schedule effects Mitigation • • • • Description, including costs of mitigation options Mitigation option leverage or reduction in the assessed risk Current mitigation activities Current trends in risk significance - likelihood and impact Significant milestones • • Opening and closing of the window of occurrence Decision points for mitigation implementation effectiveness Space Systems Engineering: Risk Module 19 Part 2 of Risk Module: Fault Tree Analysis Event Tree Analysis Space Systems Engineering: Risk Module Fault Tree Analysis Supports Design Decisions and Failure Investigations Fault Tree Analysis - FTA - uses a top-down symbolic logic model and estimates of failure probabilities of ‘initiators’ to estimate the occurrence (failure) of the pre-determined, undesirable, ‘top’ event An initiator is a credible undesirable event that is a contributing cause to top event failure ‘Cut sets’ are groups of initiators, when taken together, cause top event failure ‘Path sets’ are groups of initiators that if none occur the top event does not fail FTA is both a design and a diagnostic tool As a design tool FTA is used to compare alternative design solutions and the resulting TOP event probability As a diagnostic tool FTA is used to investigate scenarios that may have led to the TOP event failure - leading to an estimate of the most likely cut sets Space Systems Engineering: Risk Module 21 Fault Tree Analysis Fault tree analysis is a graphical representation of the combination of faults that will result in the occurrence of some (undesired) top event. In the construction of a fault tree, successive subordinate failure events are identified and logically linked to the top event. The linked events form a tree structure connected by symbols called gates. Space Systems Engineering: Risk Module 22 Refer to NASA Reference Publication 1358: System Engineering “Toolbox” for Design-Oriented Engineers Section 3.6: Fault Tree Analysis (Handout) Particular points: And/Or Gates explanation Example Fault Tree (Fig 3-20) Space Systems Engineering: Risk Module Event Trees Event trees can be viewed as a special case of fault trees, where the branches are all ORs weighted by their probabilities. Event trees are generated both in the success and failure domains. This technique explores system responses to an initiating “challenge” and enables assessment of the probability of an unfavorable or favorable outcome. The system challenge may be a failure or fault, an undesirable event, or a normal system operating command. In constructing the event tree, one traces each path to eventual success or failure. This technique is typically performed in phase C but may also be performed in phase B. See NASA Reference Publication 1358: System Engineering “Toolbox” for Design-Oriented Engineers section 3.8 for additional discussion. Space Systems Engineering: Risk Module 24 Will the Stage Make it from Hangman’s Hill to Placer Gulch? Station Probability of no horses 1, 2, 3 0.2 4 0.1 Placer Gulch event tree example from a Safety & Mission Assurance training course by Pat Clemons of Sverdrup. Space Systems Engineering: Risk Module 25 Fault Tree Analysis of the Placer Gulch Stage Space Systems Engineering: Risk Module 26 Part 3 of Risk Module: Failure Mode Effects Analysis Space Systems Engineering: Risk Module Failure Mode Effects Analysis • Objective • To ensure all failure modes have been identified and evaluated • Technique • • • • Select a method to rank project failure modes Identify failure modes including all single point failure modes Analyze failure modes and their mission effect Determine those failure modes that might benefit from corrective action, e.g., – Alternative designs – Redundancy – Increased reliability • Determine which, if any, corrective actions implemented Space Systems Engineering: Risk Module will be 28 Failure Mode Effects Analysis FMEA is a design tool for identifying risk in the system or mission design, with the intent of mitigating those risks with design changes. The FMEA risk mitigation: 1. Recognizes and evaluates the potential failure of a system and its effects; 2. Identifies actions which could eliminate or reduce the chance of a potential failure occurring. FMEA is initiated in Phase B (Preliminary Design) and used to support design decisions in Phase C (Final Design). Space Systems Engineering: Risk Module 29 Failure Mode and Effects Analysis Item Function Potential Failure Mode Potential Effects of Failure S e v C O D R Actions Results l Potential Causes/c e P Current Responsibility a c Controls et N Recommended & Target s Mechanisms(s) u O DR s r Prevention/Detection Failure c Action(s) Completion Date Actions S e c e P Taken v c t N What can be done? What are the Effects? How bad is it? - Design changes What are the functions or requirements? What can go wrong? - No Function - Partially Degraded Function - Intermittent Function - Unintended Function Space Systems Engineering: Risk Module What are the Cause(s)? How often does it happen ? How good is this method How can this at be prevented detecting and detected? it? - Process changes - Special controls What did they do and what are the outcomes - Changes to standards, procedures, or guides Who is going to do it and when? 30 Module Summary: Risk Risk is inevitable, so risks can be reduced but not eliminated. Risk management is a proactive systematic approach to assessing risks, generating alternatives and reducing cumulative project risk. Fault Tree Analysis is both a design and a diagnostic tool that estimates failure probabilities of initiators to estimate the failure of the pre-determined, undesirable, ‘top’ event. Failure Mode Effects Analysis is a design tool for identifying risk in the system design, with the intent of mitigating those risks with design changes. Space Systems Engineering: Risk Module 31 Backup Slides for Risk Module Space Systems Engineering: Risk Module Uncertainties that Plague Projects Uncertainties Mission Objectives Technical Factors Will the baseline system satisfy the needs & objectives? Are they the best ones? Thorough study Analyses Cost & schedule credibility Can baseline technology achieve the objectives? Can the specified technology be attained? Are all the requirements known? Technology development plan Paper studies Design reviews Establish performance margins Engineering model test and prototyping Test & evaluation Can the plan and strategy meet the objectives? Resources Internal Factors External Factors Space Systems Engineering: Risk Module Offsets •Manpower skills •Time •Facilities Program strategy Budget allocations Contingency planning Will outside influences jeopardize the project? Contingency Robust design 33 Project Risk Categories Typical Technical Risk Sources Typical Programmatic Risk Sources • Physical properties • Material availability • Material properties • Personnel availability • Radiation properties • Personnel skills • Testing/Modeling • Safety • Integration/Interface • Security • Software Design • Environmental impact • Safety • Requirement changes • Fault detection • Operating environment • Proven/Unproven technology • System complexity • Unique/Special Resources • COTS performance • Communication problems • Labor strikes • Requirement changes Typical Supportability Risk Sources Typical Cost Risk Sources Typical Schedule Risk Sources • Reliability and maintainability • Sensitivity to technical risk • Sensitivity to technical risk • Training • Sensitivity to programmatic risk • Sensitivity to programmatic risk • Sensitivity to supportability risk • Sensitivity to supportability risk • Sensitivity to schedule risk • Sensitivity to cost risk • Labor rates • Degree of currency • Estimating error • Number of critical path items • Operations and support • Manpower considerations • Facility considerations • Interoperability considerations • System safety • Estimating error • Technical data • Stakeholder advocacy • Contractor stability • Funding continuity and profile • Regulatory changes • Embedded training Space Systems Engineering: Risk Module 34