Reliability Risk Assessment Ray Barlog, PE March 24, 2015 “Service Measured to the Standard” Cornerstone Electrical Consultants, Inc. Safety and Reliability • Both deal with uncertainty, aim to reduce undesired outcomes • Safety mostly concerned with avoiding harm to humans • Reliability most often concerned with reducing economic losses - $$ Cornerstone Electrical Consultants, Inc. Risk - An event that has a negative consequence and has a probability of occurring (not an opportunity) • Risk = Likelihood x Consequence • Reliability Risk = Failure Probability x $$ Impact • Reliability Risks are often not constant across time Risk Do We Want To………… Eliminate Risk? Reduce Risk? Manage Risk? Risk Management Process Identify Assessment Control Respond Analyze Evaluate Risk Assessment The process of identifying, analyzing, and evaluating, and prioritizing risks Some Reliability Risk Assessment Methods Functional FMEA Process FMEA Equipment FMEA Expected Value FMEA Fault Tree Analysis Qualitative Fault Tree What If Analysis Bow Tie Analysis RAM Modeling Stochastic Life Cycle Cost Concept FMEA Event Tree Analysis Layer of Protection Analysis Markov Analysis 3 Reliability RA Tools • Functional Failure Mode and Effects Analysis • Bow Tie Analysis • Reliability, Availability, Maintainability (RAM) Modeling FMEA • Probably the most common reliability risk assessment tool • Structured method • Best using team with diverse backgrounds FMEA • Came from Military Procedure MIL-P-1629, Procedures for Performing a Failure Mode, Effects and Criticality Analysis, dated November 9, 1949. • FMEA used and improved by NASA in the 1960's to improve and verify reliability of space program hardware. • Mil-Std-1629A used in the military and by commercial • Used in the Nuclear Power Industry for evaluating design risks • SAE J1739 - an FMEA standard used in the auto industry FMEA-asks the questions • What is the intended function? • How does it fail? ( failure mode ) • How often do we expect the failure to occur? • How severe are the effects? • What are the potential causes of the failure? • How likely is the onset of failure to be detected? Common Example Objective: Determine the most critical risk and its cause(s) for this boiler feed water system. Common Example If 2 pumps fail, both boilers trip Common Example P-1 P-2 P-3 Cornerstone Electrical Consultants, Inc. Risk Rating Factors DEGREE OF RATING SEVERITY OCCURRENCE Qualitative 1 2 3 4 5 6 7 Less than $50K $50k to $100k $100k to $500k $500k to $1mm $1mm to $5mm $5mm to $10mm $10mm to $100mm Likelihood of occurrence is remote Low failure rate with supporting documentation Low failure rate without supporting documentation Occasional failures Medium Failure Rate Moderately High Failure Rate High Failure Rate DETECTION FAILURE RATE (_/yr) 1.00E-06 1.00E-05 1.00E-04 Detection Certainty Almost certain that the potential failure will be found or prevented before producing an economic loss Current controls may or may not detect impending failure Current controls probably will not detect the potential failure 100% 50% 0% 1.00E-03 1.00E-02 1.00E-01 1 Cornerstone Electrical Consultants, Inc. FMEA Worksheet Subsystem Function of Subsystem Potential Failure Mode Boiler Feed Pump Deliver Loss of ALL System feedwater to feed water flow boilers at 2mmpph rate O C C Potential Causes S D R P E Potential Failure Effects E N V T Boilers trip, Production Loss of $100k per day 4 x 5 days plus $50k 2 pump repair cost, Total $550k loss Boilers trip, Production Loss of $100k per day 1 Pump fails and auto3 3 x 5 days plus $60k 3 start for standby fails repair cost. Total $560k loss 2oo3 Pumps Fail 4 Simultaneously due to seal failure Current Controls Manual Condition 32 Monitoring for vibration Recommended Actions Consider continuous vibration monitoring Action Owner Joe Engineer Periodic 27 Testing of Auto- None Start NA Boilers trip, Production Loss of $100k per day 5 x 15 days plus $100k repair cost. Total $1.6mm loss 2 Periodic 40 ultrasonic None corona testing NA Boilers trip, Production Pump 1 fails and Loss of $100k per day 3 Station Service bus B 5 x 15 days plus $50k fails repair cost. Total $1.55mm loss 2 Periodic 30 ultrasonic None corona testing NA 4 Loss of Station Service Bus B Cornerstone Electrical Consultants, Inc Bow Tie Analysis A simple graphical tool that shows the link between potential causes, preventive and mitigating controls, and consequences of a risk event • Shows at a glance how risks are managed • Can be purely qualitative or semiquantitative Reason’s Swiss Cheese Cause 1 Consequences Threats or Causes Generalized Bow Tie Cause 2 TOP EVENT Cause 3 Cause 4 Mitigations Barriers Example Risk Matrix Freq per Year or Likelihood 1 ( 1/yr) 2 (1/10yr) 3 (.001)/yr 4 (.0001)/yr 5 (.00001/yr) Financial Consequence Severity A B C D <$50k $50 to $500k $500k - $5mm $5mm-$50mm E $50mm $100mm Bow Tie-Common Example THREATS or CAUSES 2oo3 pumps fail due to seal failures F=1 BARRIERS / PREVENTIVE CONTROLS _ Pump redundancy 2 One pump fails and auto-start fails F=1 P-2 or P-3 fails and SS Bus A fails F=3 Station Service Bus B Failure F=2 _ _ _ Medium Robust shaft and bearing design 1 Medium Periodic testing of auto-start 2 _ _ MITIGATIVE CONTROLS Weak Operator response 2 Medium Burner Trip System TOP EVENT Inadequate BFW Flow to Boilers 3 Use of Predictive Maintenance Techniques 2 2 Corona testing to detect onset of failure 2 Medium _ Medium _ _ Weak Planned Repairs Prior to Major Damage 2 _ Boiler Tubes Damaged $10mm Strong Spares Stocking Strategy 1 Weak Corona testing to detect onset of failure Medium 3 Strong Quick Pump Repairs 1 _ 3 Element BFW Control System CONSEQUENCES Medium _ Large Production Downtime Losses $550k$5mm Significant Pump Repair Costs >$100k RAM Model • RAM: Reliability, Availability, Maintainability • Reliability: Probability of surviving a given time interval without failure under given conditions • Availability: Average % time a system is in a state to perform a function • Maintainability: Probability of completion of a maintenance task in a given time interval RAM Model • A graphical and mathematical representation of system operation, dependency, and performance • Most quantitative of the three methods presented • Requires failure data, repair time data, and system operating logic RAM Model Building Block • Series RAM Model Building Block RAM Model-Example RAM Model-Typical Input RAM Model Results System Life Cycle Performance Summary System Mean Availability 99.986%, +/- 0.052% Average Annual Production Losses 2.457 mmLb/yr Average Annual Production Losses $5,120/yr Average Outage Duration 160.6 Hrs Longest Duration Outage 372 Hrs Shortest Duration Outage 0.34 Hrs Results of 1000 Simulations, 20 Years in Length RAM Model Results Pros / Cons - FMEA • Structured, Thorough • Tedious, Time Consuming • Easy to Learn • Requires robust risk matrix • Uses Group Knowledge • Doesn't handle redundancy or multiple failures well • Doesn't handle dependencies well • Doesn’t handle increasing failure rates well • Requires no special software • Excellent for evaluating designs early in the process Pros / Cons – Bow Tie • Excellent risk management communication tool • Easy to learn and interpret • Uses group knowledge to develop • Quantifying risk requires modification • Requires robust risk matrix • Fairly quick to develop • Becomes complex with large systems • Software recommended for good documentation Pros / Cons – RAM Model • Quantifies risks for prioritization • Estimates risks over time • Handles dependencies, redundancy, special ops rules • Evaluating “What Ifs” can be done quickly • Can be labor and $$ intensive for large systems • Not easily understood by person not trained • Requires special analyst skills for model building • Quality of model depends on quality of data • • • Final Thoughts There is NO one best or universal method. Use the simplest method that can help you meet the objective of your assessment with the minimum investment of time and resources. Risk assessment alone is valueless- risks must be managed and that takes action. What are your questions?