Introduction to Reliability • Reliability is: – An inherent feature of design – Concerned with performance in the field, as opposed to quality of production (conformance to design specs) • Definition – Reliability is the probability that a system will perform in a satisfactory manner for a given period of time when used under specified operating conditions. 1 Introduction to Reliability (cont) • What is Satisfactory ? – All critical functions – Time-oriented quantitative factors--MTBF P (X>to), with X = Lifetime – Qualitative factors, too • Operating Conditions – Use – Handling, Transport, Installation, Storage 2 1 Reliability in the System Life-Cycle • Conceptual Design Phase – Define reliability requirements of a system – Plan Reliability Program • Preliminary Design Phase – Allocate reliability requirements – Predict reliability of components/subsystems – Provide reliability estimates to cost estimating and design trade-off studies – Participate in design reviews – Assess subsystem/ component supplier reliability estimates 3 Reliability in the System LifeCycle(cont) • Detail Design Phase – More detailed reliability prediction – Assist in detail design decisions – Assist in logistic support analysis – Assist in prototype development – Recommend changes prior to production – Evaluate reliability of prototype – Participate in other test and evaluation activities as related to reliability 4 2 Reliability in the System LifeCycle(cont) • Production/ Construction Phase – Monitor production – Perform reliability tests of selected items • Qualification Tests -Prior to production, repetitive tests to determine MTBF, degradation, failure modes • Acceptance Tests- Random or 100%, testing of items exiting production to assure that reliability demonstrated during qualifying-test is being achieved in production items. – Collect and analyze data on operational test (product evaluation tests at a designated site) – Recommend Corrective action – Continue to update reliability models and predictions 5 Reliability in the System LifeCycle(cont) • System Use Phase – Data collection and analysis – Reliability improvement studies – Change recommendations – Equipment redesign projects 6 3 Measures of Reliability • Let T = Random Variable Measuring “Lifetime” of an item (time to first - next - failure) • Range Space of T={t:t ≥ 0} ≥ • Tests to establish PDF & Parameters of T are called “Life Testing” • Cum, Distribution Function F(t)=P(T ≤ t) is called the Failure Distribution Function 7 Measures of Reliability(cont) • The Reliability Function is: ∞ – R(t)=P(T>t)=1-F(t)= f (t )dt ∫ 1 t F (t) Prob Reliability Density Function 0 R (t) t • Four ways to determine R(t) for a particular system – Test many systems to failure. Develop curve empirically. – Test many subsystems, use historical field data on others, develop subsystem reliability functions, use a reliability system model to combine. – Extrapolate past experience with similar systems. 8 – Physical properties--Hypothesize a certain distribution. 4 Failures and Failure Rates • 3 Types of Failure (See Figure 12.4) – Initial ( Failure at t=0) – Random – Wearout • IF initial failures are to be disregarded in your analysis, g (t ) then use, f (t ) = , t>0; as density for(T T > 0 ) [1 − P (T = 0)] 9 Failures and Failure Rates(cont) • The Hazard Function is the instantaneous failure rate at time t, given survival up to t has formula: h(t ) = − R′(t ) f (t ) = R (t ) R (t ) t • Note: H(t)= number failures in [0,t]= ∫ h( x )dx is called the failure count function 0 10 5 Failures and Failure Rates(cont) • How are H(t),R(t),F(t) Related? t H (t ) = − 0 R′( x ) ∫ R( x) dx = − log R( x) ] = − log t e 0 e R(t ) + loge R(0) 0 • So, R(t)= e − H (t ) 11 Mean Lifetime (Time Between Failures) • Mean Life = θ ≡ E(T)= ∞ ∫ tf (t )dt, or 0 ∞ ∞ 0 0 ∫ [1- F(t)]dt = ∫ R(t )dt • Example: – Random failures often are modeled by time-to-failure is exponential with rate λ: f (t ) = λe − λt , t ≥ 0 = 0 otherwise F(t ) = 1 − e − λt R(t ) = e − λt 12 6 Example (cont) • Then, f (t ) λe − λt h(t ) = = −λt = λ Constant R (t ) e • Also, because θ=E(T)= 1 R(t ) = e − H ( t ) , H(t)=λt Linear in t and λ • P(T< θ)=F(θ)= 1 − e −λθ = 1 − e −1 = 1 − 1 = 1 − .3679 = .6321 e • P(T≥ θ)=.3679 , Independent of λ (or θ) 13 Examples on Pages 349 • Example 1 – 5 Components did not fail in 600 hours – 5 Others failed at various points 5 failures λ= = 0.001196 • Example 2 4180 hours – Operating Cycle = 168.8 hours – Downtime = 26.8 hours – Operating Time = 142 14 7 Examples on Pages 352-353(cont) • Number of failures = 6 Only if we treat MTBF = MTBM (instant maintenance) • λ = 6 / 142 = 0.042 • MTBF = 23.81 hours = 1 / λ • Operational Availability = MTBM 23.81 = = 0.841 MTBM + MDT 23.81 + 4.4666 Other examples are on handouts Hines and Montgomery, example 15-7 Halpern, examples 10-1 thru 10-6 Note: For exponential failure module R(t) = e- λ t is the first Term in a poisson distribution with parameter x. 15 What if Failure Rate Not Constant? • Distribution Normal • Lognormal Failure Rate h(t) Behavior Increasing Function Various Shapes • Weibull Decreasing β<1 Constant β =1 Increasing β>1 • Gamma Decreasing n<1 Constant n =1 Increasing n>1 16 8 What if Failure Rate Not Constant(cont) • Have different h(t) for each time interval where rate is constant • use average failure rate (AFR) between t1 and t2 t2 AFR(t1 , t2 ) = ∫ h(t )dt t1 t2 − t1 Note: AFR (0, t) = = H (t2 ) − H (t1 ) ln R(t1 ) − ln R(t2 ) = t2 − t1 t2 − t1 H (t) - ln R(t) = t t 17 Concepts Our Text Skips • Renewal Rate Function r(t) = Instantaneous failure rate at time T accounting for replacement of failed items with new components from same population as original parts • Censored Type I Data : A fixed test duration T is pre-set. Units that do not fail before T are “censored” in that the data doesn’t account for their survival beyond time T. If T is poorly chosen, may get no failures by time T--then what? 18 9 Concepts Our Text Skips (cont) • Censored Type II Data : A fixed number of failures is prespecified, n items are tested until r fail. If r is poorly chosen, test make take too long. • Readout Time Data : Record actual failure times of each failed component 19 Estimation of λ for Exponential Life • λ = (number of failures) / (total unit test hours) • Type I Censored Data – n items, r failures λ= r r ∑t + ( n − r )T i i =1 ti = time of i th failure λ= • Type II Censored Data (ends at r th failure time t r ) r r ∑t i + (n − r )tr i =1 • If system has n components and system fails when first n component fails λ s = ∑ λi i =1 20 10 System Reliability Models • Defined: Math models of the system that show functional relationships among subsystems, components, etc. • Examples – Reliability block diagram • Shows all possible success/failure combinations • Series and parallel; also k-out-of-n configuration • Any closed path through system is success • May not resemble system physically • Standby redundancy 21 System Reliability Models(cont) • Coherent systems models • Fault tree analysis and other cause-consequence diagrams – Work from top level events (failures) – To primary events ( causes) 22 11 Series Configuration 1 2 n • Static Model: Rs = n ∏ R = R * R *...R i =1 i 1 n ∏ R (t ) Rs (t ) = • Dynamic Model: hs (t ) = n 2 i i =1 n ∑ h (t ) i i =1 Hs ( t ) = n ∑ H (t ) i i =1 23 Example • Exponential Subsystem Failure Models − λ + λ + ... + λ n ) t Rs (t ) = e ( 1 2 n hs (t ) = ∑ λi Constant i =1 θ = MTBF = 1 n ∑λ i i =1 See example on page 354 24 12 Active Parallel Configuration • Static: Ra = 1 − 1 n ∏ (1 − Ri ) i =1 • Dynamic: Ra (t ) = 1 − 2 n ∏ (1 − R (t )) i i =1 n • Identical Components: Ra (t ) = 1 − [1 − R(t )] System fails only if all n subsystems fail n 25 Example 1 • Always Keep in Mind “Redundancy Has a Cost” # of Components in Parallel R Wt. Benefit/ Cost 1 0.95 5 lb - 2 0.9975 10 lb .0475 / 5 lbs 3 0.999875 15 lb .002375 / 5 lbs 26 13 Example 2 • Exponential Subsystem Lifetime, Identical Subsystems [ Ra (t ) = 1 − 1 − e − λt θa = ∞ ] n n 1 θ n ∫ R (t )dt = ∑ λ * i = ∑ i a i =1 0 e. g., if n = 3 and i =1 1 θ = = 1000 hours λ 1000 1000 1000 + + 1 2 3 = 1000 + 500 + 333.33 = 1833.33 θa = 27 Special Configurations • K-out-of-n Configuration – Systems works only if at least k of n components are working. Assume identical components with reliability R(t): Rs (t ) = n n ∑ ( i )[ R(t )] [1 − R(t )] i n−i i=k • If −t R(T ) = e − λt = e θ exponential, then θ s = n θ ∑i i=k 28 14 Special Configurations (cont) • Combined Series-Parallel – Key:Treat Components in parallel as single component, then expand Rs = Ra * RBUC = Ra [1 − (1 − RB )(1 − RC )] Rs = R AUB * RCUD = [1 - (1 - R A )(1 − RB )][1 − (1 − RC )(1 − RD )] See pages 354 - 355 29 Availability Measurement • Inherent Availability (Ideal Support Environment) Ai = MTBF MTBF + M ct M ct = mean corrective maintenance time = mean time to repair (MTTR) • Does not include preventive maintenance, logistics delay, or administrative delay. • Achieved Availability ( Ideal Support Environment) M = mean active maintenance time MTBM Aa = = weighted average of corrective MTBM + M and preventive maintenance time. – MTBM = mean time between any maintenance action, corrective or preventive 30 15 Availability Measurement Operational Availability ( Actual Support Environment) Aσ = MTBM MTBM + MDT MDT = mean downtime = weighted average of active maintenance (current and previous) and delays (logistical and administrative. 31 Comments on Availability • Availability is a function of both: – Reliability of a prime item – The logistics support subsystem • Equipment designer can exert little control over support operations, but can design in: – Built-in diagnostics – Easy access – Rapid disconnect / connect 32 16 Comments on Availability (cont) • The proper balance of R&M must be decided in early stages, when flexibility is great. • Discussion of availability is always in some context: – Actual failure or not – Which mission, what is critical to success – Maintenance crew, equipment, spares availability 33 Reliability Techniques in System Design Phase • Conceptual Design Phase – Assignment of system reliability goal based on: • Mission analysis • Cost analysis • Technical Limits • Preliminary Design: – Block Diagram Models – Estimation of Ri(t) Functions – Study of failure points, solutions 34 17 Reliability Techniques in System Design Phase(cont) • Preliminary Design Phase (Cont.) – Definition of Success/ Failure criteria – Budgeting/ Revision of Reliability Requirements • Detail Design: – Material and Parts Selection – Standardization – Test and Evaluation – Requirements for Suppliers – Series-Parallel Recommendations – De-rating 35 Standardization • Standardization: – Means selection of components and materials whose reliability characteristics are known, as well as their degradation under stress and aging. This indirectly eases the burden on spare parts inventories, by having same component used in several systems 36 18 De-rating • De-rating: - Use part in application below its rated value – A type of overdesign to provide reliability margin • Steps: – Identify operating interval – Select de-rating % ( see RCA Corp. Table) – Calculate de-rated value of component to be used • Example: ceramic capacitor for 100v (max) application - RCA recommends 70% de-rating - X (0.7) = 100, X = 142.85 v minimum requirement for component 37 Binomial Expansion to Explain Parallel-Redundant Systems • Consider 3 Identical Components in Parallel – P = Probability of Operation of Each – Q = Probability of Failure of Each 3 3 3 (P + Q)3 = P 3 + P 2 Q + P1Q 2 + P 0 Q3 1 2 3 = P 3 + 3P 2 Q + 3P1Q 2 + Q3 All 3 up 2 up, one failed P (System operating) One up, two failed All 3 down 1 − Q3 = 1 − (1 − P)3 P 3 + 3 P 2 Q + 3 P1Q 2 38 19 Binomial Expansion to Explain Parallel-Redundant Systems • Let PA=PB=PC=PD = 0.9 Which configuration is more reliable? Why? A B C D A B C D 39 Parallel Redundancy Has Its Drawbacks • Limitations • Each subsystem must have a “switch” to assure its failure doesn’t disable the remaining components • Sometimes necessary to “disconnect” failed system • Redundancy increases weight, volume, cost and sometimes complexity. The failure sensing device may be unreliable • Alternatives to Redundancy • Reduce number of parts • Simplify • Improve reliability level of parts used, especially at critical “nodes” • Burn-in of Parts • On-board spares, repairs 40 20 Standby Redundancy • Assume “cold” standby, not energized until failure detected in original component • Assume reliability of “decision switch” is 100% • Lifetime variable is T=T1+…+Tn • Standby always more reliable than simple parallel, if switch is 100% Reliable 1 DS 2 41 n Standby Redundancy (Cont,) • Assume lifetime variable is as follows: T = T1 + T2 + - - - - + Tn ∑ E (T ) V (T ) = ∑ V (T ) E (T ) = i i If Ti each exponential, t is gamma (λ , n) n λ n V (T ) = 2 λ E (T) = n=2 R(t) = P(system life > t / one standby) = e - λt + (λt )e − λt n=3 R(t) = P(system life > t / two standbys) = e - λt + (λt )e − λt + (λt )2 − λt e 2! 42 21 Benefits of Computerized Reliability Models • Helps keep track of reliability relationships – Across levels of design – Within a given level • Rapid Sensitivity Analysis – Is overall R goal even feasible – Study effect of different R allocations – Study effects of configuration changes on R – Study effects of substituting different components – Perform “worst-case” analysis • Can be adapted to multiple missions -in essence, one model for each set of mission equipment/conditions • Can be used to evaluate proposed modifications to existing system 43 Analytical Methods to Support Reliability Estimation and Assist in Design Decisions • Stress-Strength Analysis • Critical-Useful-Life Analysis • For Complex Systems ( radar, missiles, computers) – Failure Mode and Effect Analysis – Worst-Case Analysis – Sneak-Circuit Analysis • Safety Analysis Techniques – Fault-Tree Analysis – Task and Error Analysis – Hazard Analysis 44 22 Discussion of Stress-Strength Analysis • Measures Resistance to Stress (strength) • Examples: operating wattage versus rated wattage Operating temperature vrs rated temperature pounds/square inch • Includes: – Stress distribution, especially maximum stress – Stress causes, timing, frequency – Stress testing, such as metal fatigue tests 45 Discussion of Critical-Useful-Life Analysis • Critical-Useful-Life Analysis: – Identification of critical item list and requirements of each of these items for a preventive maintenance, corrective maintenance, and replacement. Includes studies of how to eliminate critical items through redesign 46 23 Discussion of FMEA • Failure Mode and Effect Analysis: – Identification of all possible failure modes of equipment, the possible causes and the possible immediate/ ultimate effects on the system and operation • Formal documentation in words not diagrams • Estimation of probability of occurrence • Classify each failure by criticality • Describe corrective action alternatives 47 Discussion of Worst-Case Analysis • Worst Case Analysis: – Examining how the performance of an electrical circuit (or other device) will change over time as a result of drift in part characteristics. Provides guidance on how to allow for part parameter variation in design 48 24 Discussion of Sneak-Circuit Analysis • Sneak-Circuit Analysis: – Use of math models to identify any unanticipated performance signal paths in a circuit that may degrade performance or introduce failure. 49 Reliability Prediction at Part, Circuit, and Subsystem Level • Based On: – Similar equipment--Extrapolate. Not very accurate. – Number and complexity of “active element groups”--these are controllers or converters of energy – part types, counts, failure rates are combined into an estimate of system reliability – Prediction based on testing, such as stress tests • Used For: – Higher-level reliability prediction – As input to maintenance and logistic support analysis – Comparison with requirement, where are we over/ under reliability 50 25 Reliability Degradation Studies/ Action • Determine and correct potential/ actual adverse effects due to: – Storage, packing, transportation, handling – Unpacking, assembly, set-up – Preventive and corrective maintenance • Carelessness • Wrong tools and equipment • Didn’t follow/ know proper procedure 51 Reliability Test and Evaluation • To answer question : “will the mature system achieve its MTBF requirement in operation ?” • Should be part of an integrated test plan to test entire spec. • Type I Tests: – are early enough in design process so that design changes are fairly cheap • Type II and III Tests must : – Follow approved procedures ( first drafts of tech manuals and training courses) – Use test and support equipment that was specified in the maintenance concept and detailed in LSA – Be provided with ( test ) supply support – Be carefully planned, instrumented, documented, analyzed 52 26 Type II Reliability Testing • Evaluation of prototype and early production models, using producer personnel • Includes: – Reliability qualification tests, to determine • MTBF • MTBM • Failure sequences, detection, performance degradation • Maintenance procedure adequacy • Maintenance induced failures – Production sampling acceptance tests 53 Types of Type 2 Tests • Sequential Qualification Tests – Environmental test chambers – Environmental test cycle, equipment duty cycle – Multiple identical test items – Statistics-based accept-reject test plan • Producer’s Risk α • Consumer’s Risk β }usually range from .05 to .25 (negotiated) 54 27 Types of Type 2 Tests (cont) • Reliability Acceptance Testing- Plot MTBF versus time, look for growth/decline • Reliability Life Testing- To determine failure distribution – Continuous (Steady) • Fixed Time, Count Failures • Fixed number of Failures, Count Time – Step-Stress (Accelerated) Testing • Step up stress until all units fail • Aids in planning burn in 55 Type 3 Testing • Definition- Operational Testing Using: – – – – – A group of production units Designated field test sight Representative mix of mission profiles User personnel (first trained) 1st sets of support equipment; spares • Uniqueness – All elements of the system are operational and evaluated together – Where the true R, M, A and other performance measures are known for first time, rather than estimated via models plus some type 1 & 2 test data 56 28