& Accelerated Stress Testing and Reliability Workshop October 9-11, 2013 San Diego, CA Accelerating Reliability into the 21st Century Keynote Presenter Day 1: Vice Admiral Walter Massenburg Keynote Presenter Day 2: Alain Bensoussan, Thales Avionics CALL FOR PRESENTATIONS: We are now Accepting Abstracts. Email to: don.gerstle@gmail.com. Guidelines on website www.ieee-astr.org For more details, click here to join our LinkedIn Group: IEEE/CPMT Workshop on Accelerated Stress Testing and Reliability This is the 3rd of a series of four webinars being put on by Ops A La Carte, ASTR, and ASQ Reliability Division Each webinar will also be presented as a full 2 hour tutorial at our ASTR Workshop Oct 9-11th, San Diego. Abstracts for presentations are due Apr 30. www.ieee-astr.org Agenda Introduction 5 min Accelerated Reliability Growth Testing 45 min Questions 10 min Upcoming Reliability Webinars Title: 40 Years of HALT: What Have We Learned Author: Mike Silverman Date: Sept 12, 2013, 12pm EST http://reliabilitycalendar.org/webinars/english/40-years-of-halt-whathave-we-learned/ Location: Webinar HALT began 40 years ago with a simple idea of testing beyond specifications in order to better understand design margins. Over the past 40 years, thousands of engineers around the world have been exposed to the concepts of HALT and have tried the techniques. This tutorial will explore what we have learned in the past 40 Years and what the future of HALT could be. Registration Demographics For this webinar we have signed up –250 Registrants –17 Countries –28 US States Registration Question #1 Have you ever performed a Reliability Growth Test? –Never –All the time –Tried Once 45% 25% 20% Registration Question #1 For your last RGT, did you have a chance to plan the duration and stresses? –Neither –Both –Duration Only –Stresses only 50% 25% 10% 10% Traditional and Accelerated Reliability Growth The Case of Lost (and Found) Failure Rates Milena Krasich, P. E. Raytheon, IDS Copyright © 2012 Raytheon Company. All rights reserved. Customer Success Is Our Mission is a registered trademark of Raytheon Company. Tutorial Objectives Identify shortcomings of traditional reliability growth testing and offer alternatives Reliability Growth Test objectives Explain traditional Reliability Growth test methodology along with the assumptions Show shortfalls of the traditional methods • Entire item failure rate not calculated and presented in results • Test duration too long for the modern high reliability items • Little or no relationship of reliability and stresses on the tested item Show principles of the Physics of Failure test methodology Show how the Reliability growth test based on PoF is constructed Show how the expected stresses are applied and accelerated Show how to account for total final failure rates Show achieved considerable test cost reduction. Page 10 Traditional RG Test Methodology Overall test duration determined based on the initial and goal reliability measure: failure rates Mean Time Between Failures, MTBF (or MTTF) Initial failure rate estimated for the entire item and then used for calculations of reliability growth Reliability growth parameters and test duration determined based on the goal reliability - mathematically Magnitude (stress level of applied operational and environmental stresses equal to those in use – but not their duration Applied stress duration determined by engineering judgment, and level by assumptions of some “mean” stress Overall test duration and stress application are unrelated to use profiles or required life or mission of the product – only to mathematics Additional errors: Mathematical Page 11 Principles and Assumptions Goal: Increase the current (existing reliability – measured in mean time between failures) Goal magnitude guided by: • Requirement or commercial logic Item as designed contains design errors: Those are going to appear in test reasonably within the determined test time The test errors are going to be eliminated by design corrections type B failure modes) The test continuation will evaluate success of the fix. Design errors that cannot be fixed (type A failure modes) will continuously be counted Failures determined to be random will not be counted Reliability growth will be measured. Page 12 Principles and Assumptions, cont. Failure rate during the test is constant when there are no changes of the tested item Failure rate decreases with introduced design corrections in steps, and remains constant through the next change The step curve is fitted with a curve representing NonHomogenous Poisson Process, NHPP) The process definition: failure rate is constant until changes occur. The facts not considered in application of that theory: The initial failure rate is just the total failure rate. No rationale how much of it is attributed to: • Design problems that can be corrected • Random events (those failure modes one does not know where they come from, they “just happen”) • Design problems that cannot be corrected for one of the reasons: – Technically impossible – Economically not justifiable – Time to market constraints Page 13 Mathematical Model - Refresher The expected accumulated number of failures up to test time T is given by: b E N T l T , with l 0, b 0, T 0 where l is the scale parameter; b is the shape parameter (a function of the general effectiveness of the improvements; (0 < b < 1, corresponds to reliability growth; b = 1 corresponds to no reliability growth; b > 1 corresponds to negative reliability growth- reliability degradation) The failure intensity when it is changing as a result of design improvements after T h of testing is given by: t d dt E N t lb t Item ( t ) B ( t ) A ( t ) r ( t ) Item ( t ) l b t b 1 A (t ) r (t ) b 1 , with t 0 T 1 T b 1 A ( t ) l A const . Item ( t ) l b t r ( t ) l r const . Item ( t ) l Item b t Page 14 l A lr b 1 Mathematics of Traditional Reliability Growth Failure modes types in test: Systematic: corrected in test (Type B), not corrected (Type A), Random constant Item ( t ) B ( t ) A ( t ) r ( t ) 0,06 Item ( t ) l b t A (t ) r (t ) Item ( t ) l b t 0,05 Failure intensity/failure rate (failures/hour) b 1 Only type B failure modes failure rates are accounted for in a reliability test program – those that show growth expressed by the power law model; the type A and random remain constant. 0,04 S(t)=A(t)+r(t)+B(t) 0,03 0,02 b 1 r(t) The only failure modes with decreasing failure rates (power law) B(t) 0,01 A(t) 0 0 1000 2000 3000 Test duration (hours) 4000 5000 6000 Page 15 Planning Reliability Growth To plan a reliability growth, the initial value of failure rate, lI or initial mean time between failures, I, was assumed as known at some time tI. This initial failure rate would have a value that was known by experience for that item or by similarity with another like item, I(tI)=constant The thought process was then that this initial failure rate would decrease under the rules of the power law and at the end of the test with the corrections would assume a final value (a constant again), F(tF). The Crow/AMSAA/Duane planning model is simple and easy to implement: b 1 t I t I t tI But, the initial failure rate has three components, only one of those can be improved and fitted with the power law, the failure rate of the B failure modes. The remaining components are constant. Page 16 Planning Reliability Growth, cont. The remaining two components are constant. The final failure rate as a function of time also contains three components, two constant and one only that can be fitted with the power law: I ( t ) BI ( t I ) A ( t I ) r ( t I ) I ( t ) BI ( t I ) A r tF tI b 1 BF t F BI ( t I ) F (t F ) l b t F b 1 A (t F ) r (t F ) F ( t ) BF ( t F ) A r F ( t ) BI t ( t I ) F tI b 1 A r The final B-modes failure rate is then made of the improved Btype failure modes failure rate and the total final item or system failure rate contains also two additional constant components: Page 17 A Failure Modes The random failure rates are not recorded or taken into account, the A-type failures are considered in the number of failures it is said that they are included into the shape parameter calculations but there is no example in current Handbooks that would show how it was done It is also stated that the Type A failure modes are counted every time they show up, repetitions included; no example of that statement could be found Given that there is no improvement applied, type A failure modes should be treated in the same manner as the random failure rates. They could be separately accounted for, but numerically, their failure rate will be added to the random failure rate. This means that during the test, the A type failure modes should be counted as another group of constant failure rates In which case the methodology of the fixed duration testing should be applied to determine failure rates for both: • The A – type failure modes • All other random failure modes where the origin is not identifiable. Page 18 Present Method to Determine Test Duration Test duration is mathematically determined from the reciprocal of the “failure rate” as: log F t F log 1 t1 F t t F 1 t1 F t1 1 b tF e 1 b log t1 Where: F = final product MTBF (for mitigated. “fixed” failure modes only) – given goal I = initial product MTBF (for failure modes that will be mitigated) - assumed tF =test duration needed to achieve the final MTBF for fixed failure modes tI = initial test time (has various explanations) – assumed – what is it? Test Duration (hours) Example – old school: I=4,000 hours, F=10,000, b = 0.6 4 1 10 8 10 6 10 4 10 2 10 3 3 tF tI 3 3 0 0 400 800 tI Initial Test Time (hours) Page 19 Initial MTBF – What is It? In the traditional test design, the initial test MTBF is the MTBF assumed for the product, but: The reciprocal of this initial MTBF is the initial failure rate made up of three components, two of them are constant, not Power Law: • Design – correctable • Design – non correctable • Random failure rates or failure modes It is only the design failure modes that can be corrected (B type) that can be fitted by the Power Law (Weibull Intensity Function), thus: BI t I 1 BI t I BI • What part of the entire item initial assumed, estimated failure rate could those correctable failure modes could be? • Analytical prediction contains only the random failure rates – If the Design Engineering is reasonably competent, Type A or B failure modes could be at the most 40% of the assumed initial failure rate – B failure rate could be only a small fraction of the estimated product failure rate before the test. Page 20 Parameters and Results Recorded in test are cumulative times of occurrence of A and B failure modes. 1 d b 1 T B B t E N B t lb t , with t 0 T dt B A modes are not addressed, they should not be a part of the power law – handbook text suggested they are counted, if they were it would have been in error From test data, shape and scale parameters are determined bˆ NB N B ln T ; Unbiased : b NB ln t i i0 NB 1 N B ln t 0 l NB ln t NB T b i i0 The reported failure rate and MTBF are: B T l b T 1 B T l b T b 1 b 1 Random and A modes do not seem to be a part of the achieved growth. They are unfortunately - forgotten. Page 21 Comparison If initial test time was assumed to be 200 hours Traditional test (all failure rates – power law): Initial failure rate: lI = 2.5×10-4 f/hr Initial MTBF: I = 4,000 hours Final MTBF: F = 10,000 hours Final test time: 1,976 hours (from the initial time) True status, only B-type failure modes improved (e.g. maximum 40% of the old “initial” failure rate: lI = 2.5×10-4 f/hr Initial failure rate for B modes: lI = 0.4 ×2.5×10-4 f/hr = 1×10-4 f/hr Initial MTBF: IB = 10,000 hours Possible final MTBF for B modes: FB = 30,000 hours Overall final failure rate B modes + random and A modes: 1,833 ×10-4 Final overall MTBF: F = 5,544 hours Final test time: 3,118 hours (from the initial time) The forgotten, unreported failure rate: = 1.5×10-4 f/hr Page 22 The Solution – Way Forward The possible correct solution: Prepare a reliability growth test for only B failure modes Count A type failure modes as if they are random Count random failures Calculate final B failure modes failure rate and MTBF Add the constant A and random failure rates to get results Possible problems - difficulties: The calculated mathematical test duration is unrelated to use stresses or use profile The traditionally determined test duration is too short to account for the random failures, normally the required test duration for a reasonable confidence is about 10 MTBFs (in our example would be about 70,000 hours) • The traditional RG test duration does not support this test time A short reliability growth test does not disclose any cumulative damage or failures of small failure rates that would start showing only after the test is complete, while useful life of the item could be 10 or 20 years The proposed viable solution – accelerated Reliability Growth test. Page 23 Physics of Failure and Reliability Failures occur when an item is not strong enough to withstand one or more attributes of a stress: Level, duration, or repetitions of its application • The higher the level the shorter duration or less repetitions induce a failure The area of overlap of strength and stress distributions represents probability of failure for each of the stresses; mL, sL = mean and standard deviation of the load distribution sL = b× mL mS, sS = mean and standard deviation of the strength distribution sS = a × mS • If the mean of strength is a k times multiple of the mean of stress (load) and the standard deviations of each are a and b times their respective mean values, reliability of an item regarding each use stress (i), and the total reliability will be: Ri ( k , m L _ i ) k mL _ i mL _ i a k m L _ i 2 b m L _ i 2 S R Item ( t 0 ) R Stress i (ti ) i 1 Page 24 Physics of Failure Reliability – Margin k Selection Allocate reliability regarding each of the expected stresses in use The cumulative damage and ultimately failure due to a stress is proportional to the stress level and its duration. For the stress applied at the same level as in life, the cumulative damage model is: D ( t ) S ( t ) dt 1.00 t 0.95 0.90 Reliability Reliability 0.85 0.80 0.75 b=0,5 a=0,05 b=0,2 a=0,05 0.70 b=0,05 a=0,05 0.65 b=0,2 a=0,02 b=0,1 a=0.02 0.60 b=0,05 a=0,02 0.55 0.50 1.00 1.05 1.10 1.15 1.20 1.25 Multiplier k 1.30 1.35 1.40 1.45 1.50 For the allocated reliability regarding each stress, select the value of margin k which would multiply its duration in use to be applied in test; Apply stresses simultaneously whenever possible; If the same stress type is applied at different levels in use, recalculate their durations to the highest level (using acceleration factors); The most common values for a and b are: a = 0.05, b = 0.2 Page 25 Test Acceleration Each of the stresses is accelerated in test to allow for shorter test duration Total item failure rate is the sum of its failure rates regarding each individual stress (l0 is the item total failure rate in use condition and lA is the accelerated item total failure rate (in reliability growth l is equivalent to ): N l0 i 1 S l A ATest j A j li i Product j exists when the stresses 1 to j produce the same failure mode. Stress acceleration models for different stresses – example: inverse power law model (usually applicable to thermal cycling, vibration, shock, humidity); Arrhenius model (used for temperature acceleration using absolute temperature); Eyring model (used also when the thermal stress is a factor in process acceleration); step stress model, where the stress is increasing in steps; fatigue model representing the degradation due to the repetitious stress. Page 26 Test Example B Failure Modes – duration k×life Parameter Symbol Value Required life t0 10 years = 87 600 h Required reliability R 0 (t 0 ) 0,8 Time ON t ON 2 h/day=7 300 h Temperature ON T ON 65 °C Time OFF t OFF 22 h/day=80 300 h Temperature OFF T OFF 35 °C Thermal cycling T Use 45 °C, two times per day Total cycles N Use 7 300 Temperature ramp rate 1,5 °C/min Vibrations, random W Use 16,68 m/sec 2 r.m.s Relative humidity RH Use 50 % Activation energy Ea 1,2 eV Determination of factor k – for major stresses: R i ( t 0 ) R 0 ( t 0 ) 1 4 0 . 946 k=1.5 1 0,95 0,9 0,85 Reliability 0,8 a=0,1 b=0,1 0,75 0,7 0,65 0,6 0,55 Stresses: 0,5 1,00 Thermal cycling Thermal exposure (thermal dwell) Humidity Vibration Operational cycling Thermal cycling TTest ATC T Use m A Ramp _ Rate Test Uset 1/ 3 N TC _ Test 1,05 1,10 1,15 1,20 1,25 1,30 1,35 1,45 1,50 Thermal dwell (normalize exposure when OFF to duration at ON temperature): N TC _ Use k ATC A Ramp _ Rate t ON _N t ON _N E t ON t OFF exp a k B 1 1 T 273 T 273 ON OFF 8 , 754 hours Duration of accelerated exposure: tT _ Test t ON tT _ Test 168 . 1 h One thermal cycle in test = 24 hours in life _N E k exp a k B 1 1 T T Test 273 ON 273 Page 27 . 1,40 Multiplier k Test Example, Cont. The thermal exposure is combined with the thermal cycling, distributed over the high temperature: t TC 2 ( ramp time) (temp. Stabilizat ion Thermal Dwell) Dwell at cold The test cycle profile: t TC 2 125 22 . 3 5 52 . 3 min 0.875 h 10 Humidity: Test 95% RH and temperature TRH= 85 °C (65 °C chamber + 20 °C internal h temperature rise) Ea RH Use 1 1 t RH _ Test _ Test t ON _N RH Test exp T k 273 T 273 B ON RH h 2.3 t RH _ Test 300 h Vibration: 150,000 miles, 150 hours per axis vibration at 1.7 g rms. Test level: 3.2 g rms To project test time to life use acceleration factor to multiply test time tVib _ Test k tVib _ Use W Use W Test w With : w 4 ilure Time to failure h Cumulative time to failure (n=24) (t) log(t) log[ (t)] 1 3,821.33 91,711,92 91 ,711.92 4.96 4.96 2 5,781.33 138,751.92 69 ,375.96 5.14 4.84 3 14,016 336,384.00 112 ,128 5.53 5.05 4 18,563.44 445 522,56 111, 380.64 5.65 5.05 t 0*k 131.400 3 ,153 ,600 788 ,400 6.50 5.90 tVib _ Test 18 hours per axis Data for reliability plotting: Initial B failure modes MTBF 100,000 hours, final 106hours Initial test time: 100 hours Total traditional test time: 4.6x103hours Final test reliability (B failure modes): 0.99997 Final MTBF (improved failure modes):1,431,964 hours Total accelerated test time; 526 hours Page 28 Why Accelerated Reliability Growth? The test duration covers product entire life It allows detection of all design problems, not only those that appear in a small fraction of product life It enables estimate of failure rate regarding product random events, disregarded in traditional RG testing The failure rate achieved by design improvement with the random failure rate provides realistic estimate of total product reliability Test duration is determined based on required total reliability in view of product physical cumulative damage from life stresses in use; Test acceleration allows achievement of very reasonable test duration, shorter than traditional mathematically derived testing The reliability improvement through test is no longer cost prohibitive Test failure times are projected to their appearance in real life and the analysis uses this data; Even though covering the product expected life (durability information), it is still considerably shorter than the traditional reliability growth test. Page 29 Biography Milena_krasich@raytheon.com Milena Krasich is a Senior Principal Systems Engineer in Raytheon Integrated Defense Systems, Whole Life Engineering in RAM Engineering Group, Sudbury, MA. Prior to joining Raytheon, she was a Senior Technical Lead of Reliability Engineering in Design Quality Engineering of Bose Corporation, Automotive Systems Division. Before joining Bose, she was a Member of Technical Staff in the Reliability Engineering Group of General Dynamics Advanced Technology Systems formerly Lucent Technologies, after the five year tenure at the Jet Propulsion Laboratory in Pasadena, California. While in California, she was a part-time professor at the California State University Dominguez Hills, where she taught graduate courses in System Reliability, Advanced Reliability and Maintainability, and Statistical Process Control. At that time, she was also a part-time professor at the California State Polytechnic University, Pomona, teaching undergraduate courses in Engineering Statistics, Reliability, SPC, Environmental Testing, Production Systems Design,. She holds a BS and MS in Electrical Engineering from the University of Belgrade, Yugoslavia, and is a California registered Professional Electrical Engineer. She is also a member of the IEEE and ASQC Reliability Society, and a Fellow and the president Emeritus of the Institute of Environmental Sciences and Technology. Currently, she is the Technical Advisor (Chair) to the US Technical Advisory Group (TAG) to the International Electrotechnical Committee, IEC, Technical Committee, TC56, Dependability. As a part of the TC56 Working groups she is working on dependability/Reliability standards as a project leader for revision of many released and current international standards such as IEC/IEEE/ANSI Reliability Growth IEC 61014 and IEC 61164, Fault Tree Analysis IEC /ANSI 61025, Testing for the constant failure rate and failure intensity (Reliability compliance/demonstration tests), IEC/ANSI 61124 and FMEA, IEC/ANSI 60812, and for preparation of the new IEC standard on Accelerated Testing, IEC 62506 . Page 30 Upcoming Reliability Webinars Title: 40 Years of HALT: What Have We Learned Author: Mike Silverman Date: Sept 12, 2013, 12pm EST http://reliabilitycalendar.org/webinars/english/40-years-of-haltwhat-have-we-learned/ Location: Webinar HALT began 40 years ago with a simple idea of testing beyond specifications in order to better understand design margins. Over the past 40 years, thousands of engineers around the world have been exposed to the concepts of HALT and have tried the techniques. This tutorial will explore what we have learned in the past 40 Years and what the future of HALT could be. Page 31