22.38 PROBABILITY AND ITS APPLICATIONS TO RELIABILITY, QUALITY CONTROL AND RISK ASSESSMENT Fall 2005, Lecture 2 RISK-INFORMED OPERATIONAL DECISION MANAGEMENT (RIODM): RELIABILITY AND AVAILABILITY Michael W. Golay Professor of Nuclear Engineering Massachusetts Institute of Technology Component States and Populations Successful Ns Failure Failed Repair Nf Consider a population, Nso, of successful components and, Nfo, failed components placed into service at the same time. At time, t, progresses, some of these components will fail and some of the failed components will be repaired and returned to service. The expected populations of components vary in time as: Expected Successful Components: Ns = NoPs(t) Expected Failed Components: Nf = NoPf(t) and Probability Conservation: Ps(t) + Pf(t) = 1 and Component Conservation: Ns(t) + Nf(t) = No COMPONENT FAILURE PROBABILITY Component (Conditional) Failure Rate, λ(t), 1 dPs ( t ) 1 dN s ( t ) = = −λ ( t ) Ps ( t ) dt Ns (t ) dt where Ps(t) = probability that an individual component will be successful at time, t; Ns(t) = expected number of components surviving at time, t (note that Ns(t=0) = Nso); λ(t) = time-dependent (conditional) failure rate function. Mean-Time-To-Failure (MTTF) = 1/λ = τf, for λ = constant. COMPONENT REPAIR PROBABILITY Component Repair Coefficient, µ(t), 1 dPf ( t ) 1 dN f ( t ) = = −µ( t ) Pf ( t ) dt N f ( t ) dt where Pf(t) = probability that an individual component will be failed at time, t; Nf(t) = expected number of components failed at time, t (note that Nf(t=0) = Nfo); µ(t) = time-dependent (conditional) repair rate function. Mean-Time-To-Repair (MTTR) = 1/µ = τR, for µ = constant. Combined Repair and Failure dN s = −λNs ( t ) + µN f ( t ) dt dN f = λNs (t ) − µN f ( t) dt can express as matrix equation dN = MN , dt where N s ( t ) −λ µ N= , and M = . N f ( t ) λ −µ This is the relationship for a Markov process, where for a single component: dP( t ) = MP( t ) , dt where Ps ( t ) P(t ) = state vector of the component = . Pf ( t ) For initial condition Ps(t=0) = 1 and Pf(t=0) = 0, Solution is: λ − ( λ+µ ) t µ Ps ( t ) = + e λ + µ λ + µ λ − (λ +µ ) t Pf ( t ) = . 1− e λ + µ [ ] Asymptotic result: (i.e., as t ∅ ∞) µ λ Ps ∞ = Pf∞ = , . λ + µ λ + µ 1 Ps Ps∞ Pf∞ Pf 0 0 Time, t COMPONENT CYCLE: RUN-TO-FAILURE, REPAIR AND RETURN-TO-SERVICE Consider that total mean cycle time is τcycle for: Component Status = τs (= MTTF) a) Service b) Failure c) Waiting for repair d) Repaired to service = τR (= MTTR) τs τR 1 1 µ+λ τ cycle = τ s + τ R = + = λ µ λµ τs µ Ps ∞ = = τ cycle µ + λ τR λ Pf∞ = = τ cycle µ + λ τ cycle EFFECTS OF COMPONENT TESTING AND INSPECTION BENEFICIAL • • • • Verify That Component Is Operable Reveal Failures That Can Be Repaired Exercise Component and Maintain Operability Maintain Skills of Testing Team HARMFUL • Removal From Service Can Result in Complete Component Unavailability • Wear and Tear Due to Testing (Wear, Fatigue, Corrosion, …) • Introduction of New Defects (e.g., via Damage During Inspection, Fuel Depletion) • Acceleration of Dependent Failures • Damage or Degradation of Component via Incorrect Restoration to Service • Human Error Can Cause Wrong Component to Be Removed From Service TIME DEPENDENCE OF STANDBY COMPONENT UNAVAILABILITY, INCLUDING TEST AND REPAIR 1.0 Test, t t , duration Mean repair, t R , duration Test Test fR , fraction of tested components requiring repair 0 t1 Effect of tests, demonstrating that some failure modes have not become activated Q1 Q2 Effect of untested failure modes and new defects Time, t t2 t1 = time of first test t i = time of i-th test POST-TEST UNAVAILABILITY CAUSED BY • Failures Requiring Repairs, Caused by Tests • Defects Introduced by Tests, Resulting in Later Failures • Incorrect Component (and Supporting System) Disengagement, Re-Engagement • Incorrect Component Having Been Tested MEAN AVAILABILITY, <Q>, UNDER DIFFERENT COMBINATIONS OF TESTING AND REPAIR: CASES TO BE CONSIDERED (λ = CONSTANT) CASES 1. Asymptotic Component Unavailability as Function of µ, λ 2. Mean Component Unavailability During Standby Interval 3. Cycle Mean Unavailability Due to • Defects randomly introduced during standby, • Unavailability due to testing and repairs 4. Cycle Mean Unavailability Due to • Pre-existing defects, • Defects introduced during standby, and • Unavailability due to testing and repairs 5. Standby Interval That Minimizes <Q> CASE 1. ASYMPTOTIC AVAILABILITY WHEN FAILURES ARE MONITORED AND REPAIRED Asymptotic Availability : A ∞ = Ps ∞ 1 0 µ = µ+λ Ps Ps∞ Pf Pf∞ 0 1 Note that MTTR = = TD µ 1 ⇒ A= 1 + λTD Time, t (TD = repair-related down-time) and λTD Q =1− A = 1 + λTD also, Q ≈ λTD CASE 2. MEAN UNAVAILABILITY DURING STANDBY PERIOD, ts During Standby : Q( t ) = Pf ( t ) = 1 − e −λt s ≈ 1 − (1 − λt s ) Q( t ) ≈ λt s Pf <Q> ts Time ts ts ′ ′ Q t d t ∫ 0 λ(t′) dt′ tD ∫0 ( ) t s2 Q = = = =λ tc ts ts 2t s €tc = cycle time ts Q =λ 2 CASE 3. MEAN CYCLE UNAVAILABILITY, INCLUDING TESTING AND REPAIR For the Entire Testing Cycle Can Evaluate Expected Unavailability, <Q>, Due to Defects Introduced Randomly During Standby and Unavailability Due to Testing and Repairs as: 1 tc tD Q = ∫ 0 Q( t)dt = , tc tc € where Q <Q> Time CASE 3. MEAN CYCLE UNAVAILABILITY (continued) DOWNTIME: t D = t Ds + t D t + t D R λt 2s t Ds = During Standby: 2 t Dt = t t During Testing: t D R = fR t R During Repair: fR = repair frequency, the fraction of tests for which a repair is required CYCLE TIME: tc = ts + tt + tR cycle standby testing repair AVERAGE UNAVAILABILITY: t D 1 λt s2 Q = = ∗ + t t + fR t R ( t s + t t + t R ) tc tc 2 CASE 4. MEAN CYCLE UNAVAILABILITY, INCLUDING PRE-EXISTING UNAVAILABILITY, Qo Evaluate Expected System Unavailability, <Q>, Due to • Pre-Existing Defects • Defects Introduced Randomly During Standby and • Unavailability Due to Testing and Repairs as: Q <Q> Qo Time CASE 4. MEAN CYCLE UNAVAILABILITY, INCLUDING PRE-EXISTING UNAVAILABILITY, Qo (continued) t D = t Ds + t D t + t D R λ 2 During Standby: t Ds = Qo t s + t s (1 − Q o ) 2 Qo = expected unavailability due to pre-existing defects (i.e., those not interrogated during testing) t Dt = t t During Testing: t D R = fR t R During Repair: λ 2 For Entire Cycle: t D = Qo t s + (1 − Q o ) t s + t t + fR t R 2 CYCLE TIME: tc = ts + tt + tR cycle standby testing repair DOWNTIME: AVERAGE UNAVAILABILITY: tD 1 λ 2 Q = = Qo ts + (1 − Qo ) ts + t t + fR t R tc tc 2 COMBINED CASE OF EFFECT UPON STANDBY SYSTEM FAILURE OF PRE-EXISTING FAULT AND RANDOMLY INTRODUCED FAULT I R I = Pre-existing fault event R = Random fault event F = I+R = Component fault P( F) = P(I + R ) = P(I ) + P( R ) − P( I ) ⋅ P( R ) λt s λt s − Qo ⋅ 2 2 λt s P( F) = Q o + (1 − Q o ) 2 P( F) = Q o + CASE 5. STANDBY INTERVAL THAT MINIMIZES <Q> For a Good System: ⇒ t t + fR t R << t s Q o << 1 1 λ 2 Q ≈ Q o t s + t s + t t + fR t R tc 2 ∂Q € * The value of ts which minimizes Q , t s , is obtained from = 0 as ∂t s 1/ 2 2(t t + fR t R ) 1/ 2 * € ts = = 2τ t + f t ( ) [ f t RR] λ € τf = random defects contribution (tt + fRtR) = testing and repair contribution UNAVAILABILITY • Failure density fT (t ) = λe − λt t ≥ 0 t • Cumulative Density Function (CDF): FT (t) = P(T ≤ t) = fT (t)dt 0 • Unavailability Q(t) : probability that system is down at time t, ∫ € t Q(t) = FT (t) = ∫ 0 fT (t)dt = 1− e−λt ≈ 1− (1− λt) Q(t) ≈ λt € MEAN UNAVAILABILITY DURING STANDBY PERIOD, ts 2 1 ts t t < Q >= ∫ 0 Q(t)dt ≈ ∫ 0s λtdt = λ s ts 2t s ts < Q >≈ λ 2 MEAN CYCLE UNAVAILABILITY, INCLUDING TESTING AND REPAIR tt Q(t) tR <Q> ts tC λt Q(t) = 1 f R (0 ≤ t ≤ t s ) (t s < t ≤ t s + t t ) (t s + t t < t ≤ t C ) Time 1 t € × ∫ 0C Q(t)dt tC 1 t t +t t = × ∫ 0s λtdt + ∫ t s t dt + ∫ t C+t fR dt s s t tC 1 λ 2 = × t s + t t + fR t R tC 2 <Q>= [ ] MEAN CYCLE UNAVAILABILITY INCLUDING PRE-EXISTING UNAVAILABILITY, Q0 tt Q(t) tR <Q> Q0 ts tC Q 0 + (1− Q 0 )λt (0 ≤ t ≤ t s ) Q(t) = 1 (t s < t ≤ t s + t t ) (t s + t t < t ≤ t C ) fR Time € 1 t × ∫ 0C Q(t)dt tC 1 t t +t t = × ∫ 0s Q 0 + (1− Q 0 )λtdt + ∫ t s t dt + ∫ t C+t fR dt s s t tC 1 λ 2 = × Q 0 t s + (1− Q 0 ) t s + t t + fR t R t C 2 <Q>= [ ] STANDBY INTERVAL, ts*, THAT MINIMIZES <Q> 1 λ 2 • < Q >= t × Q 0 t s + (1− Q 0 ) 2 t s + t t + fR t R C t t + fR t R << t s t C = t s + t t + t R ≈ t s • For a good system ⇒ (1− Q 0 ) ≈ 1 Q 0 << 1 € 1 λ 2 ⇒ < Q > ≈ × Q 0 t + t s + t t + fR t R t s 2 ∂€< Q > * (t s ) = 0 ∂t s ∂<Q> * λ 1 (t s ) = − (t t + fR t R ) × 2 = 0 ∂t s 2 t* s 1/2 2(t + f t ) ⇒ t*s = t R R λ STANDBY INTERVAL, ts*, THAT MINIMIZES <Q> (continued) 1 <Q> ts* tC 1/2 2(t + f t ) 1/2 t*s = t R R = [2τ f ( t t + fR t R )] λ tf = random defects contribution € (tt+fRtR) = testing and repair contribution MEAN UNAVAILABILITY, EXAMPLES • Mean unavailability during standby period ts: t s = 10 3 hr, λ = 10−4 hr −1 3 ts 10 < Q >= λ = 10−4 × = 0.05 2 2 • Mean cycle unavailability, including testing and repair: € t s = 10 3 hr, λ = 10−4 hr −1, t t = 25 hr, t R = 60 hr, fR = 0.01 1 λt s2 <Q>= + t t + fR t R tC 2 10−4 ×10 3×2 1 = 3 + 25 + 0.01× 60 ≈ 0.07 2 10 + 25 + 60 MEAN UNAVAILABILITY, EXAMPLES (continued) • Mean cycle unavailability including Q0: t s = 10 3 hr, λ = 10−4 hr −1, t t = 25 hr, t R = 60 hr, fR = 0.01, Q 0 = 0.02 1 λt s2 < Q > = Q 0 t s + (1− Q 0 ) + t t + fR t R tC 2 1 10−4 ×10 3×2 3 = 3 + 25 + 0.01× 60 ≈ 0.087 0.02 ×10 + (1− 0.02) 2 10 + 25 + 60 • Optimum standby interval ts: 1/2 1/2 2(t + f t ) 2(25 + 0.01× 60) t*s = t R R = ≈ 715.54 hr −4 λ 10 € EXAMINATION OF SEQUENCING OF TESTS EXAMPLE OF TWO PARALLEL IDENTICAL COMPONENTS* A) Successive Testing B) Staggered Testing * • Consider random failures during standby, time out-of-service during testing • Ignore time out-of-service during repairs, pre-existing defects. FOR REDUNDANT SYSTEMS CAN COMBINE INDIVIDUAL COMPONENT UNAVAILABILITY VALUES TO OBTAIN OVERALL SYSTEM UNAVAILABILITY, CONSIDER A 1/2 PARALLEL SYSTEM (e.g., Two Parallel EDGs), WHERE SUCCESS OF ONE COMPONENT IS SUFFICIENT FOR SYSTEM SUCCESS A B Q system = Q A ⋅ Q B (ignoring dependencies) During Interval with Units A & B in Standby: ( )( ) Q s ( t ) = 1 − e −λ A t A 1 − e −λ B t B ≈ λ A t A ⋅λ B t B = λ A λ B t A t B tA = time that component A has been on standby tB = time that component B has been on standby Note, effects of downtime for repair omitted from this analysis. FOR REDUNDANT SYSTEMS CAN COMBINE INDIVIDUAL COMPONENT UNAVAILABILITY VALUES TO OBTAIN OVERALL SYSTEM UNAVAILABILITY, CONSIDER A 1/2 PARALLEL SYSTEM (e.g., Two Parallel EDGs), WHERE SUCCESS OF ONE COMPONENT IS SUFFICIENT FOR SYSTEM SUCCESS (continued) ( ) During Interval with Unit A in Testing: Q s = 1⋅ 1 − e −λ Bt B ≈ λ B t B During Interval with Unit B in Testing: Qs = 1 − e During Interval with Unit A Possibly in Repair: Q s = fR A 1 − e −λ B t B ≈ f R A ⋅λ B t B where −λ A t A ) ⋅1 ≈ λ At A ( ) ( ) fR A = repair frequency of Unit A During Interval with Unit B Possibly in Repair: where ( fR B = repair frequency of Unit B Qs = fR B 1 − e −λ A t A ≈ fR B ⋅λ A t A ILLUSTRATION OF INDIVIDUAL COMPONENT (e.g., EDG) UNRELIABILITIES FOR A 1/2 PARALLEL SYSTEM GIVEN A STRATEGY OF TESTING EACH COMPONENT AT SUCCESSIVE INTERVALS (e.g., TESTING BOTH COMPONENTS DURING SAME OUTAGE)* Let λA = λB = λ Testing Time Start Component A Component B Component A Tested First τ1 = τ τ2 = 2τ + tt τ1′ = τ + tt τ2′ = τ2 + tt - 2τ + 2tt Component B Tested Second * Role of repair omitted from the analysis. ILLUSTRATION OF INDIVIDUAL COMPONENT (e.g., EDG) UNRELIABILITY FOR A 1/2 PARALLEL SYSTEM GIVEN A STRATEGY OF TESTING EACH COMPONENT AT EVENLY STAGGERED INTERVALS Component A Tested First Component B Tested Second COMPARISON OF MAXIMUM AND AVERAGE VALUES OF Q, FIRST CYCLE OF TESTING Qmax Successive Testing: λ( τ + t t ) ≈ λτ Staggered Testing: τ τ λ + t t ≈ λ 2 2 € <Q>cycle λτ ≈ 3 5 ≈ λτ 24 € Successive Testing: Staggered Testing: € € HUMAN ERRORS ARE TYPICALLY MOST IMPORTANT Also, taking into account human errors committed during tests and repair and failure modes not tested previously. Qo = unavailability due to defects existing at the start of the next testing cycle Qo = Q U + Q H , where QU = unavailability due to failure modes not interrogated during the tests performed, and those activated upon demand QH = λttt + λRtR, and λt = rate of introduction of defects due to human errors during tests (e.g., system realignment errors), hr -1 λR = rate of introduction of defects due to human errors during repairs (e.g., incorrectly installed gaskets, tools or debris left within a component), hr -1 SUMMARY • Testing and Inspections Contribute to Simultaneous Increases and Decreases in System Availability • These Contributions Can Be Balanced Optimally • Staggered Testing Yield Lower Peak and Lower Mean System Unavailability vs. Successive Testing