Fault-Tolerant Computing Motivation, Background, and Tools Oct. 2006 Terminology, Models, and Measures Slide 1 About This Presentation This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami Oct. 2006 Edition Released First Oct. 2006 Revised Terminology, Models, and Measures Revised Slide 2 Terminology, Models, and Measures for Dependability Oct. 2006 Terminology, Models, and Measures Slide 3 Oct. 2006 Terminology, Models, and Measures Slide 4 Impairments to Dependability Oct. 2006 Terminology, Models, and Measures Slide 5 The Fault-Error-Failure Cycle Includes both components and design Aspect Impairment Structure State Behavior Fault Error Failure Fault 0 Correct signal 0 0 Replaced with NAND? Schematic diagram of the Newcastle hierarchical model and the impairments within one level. Oct. 2006 Terminology, Models, and Measures Slide 6 The Four-Universe Model Universe Impairment Physical Logical Informational External Failure Fault Error Crash Cause-effect diagram for Avižienis’ four-universe model of impairments to dependability. Oct. 2006 Terminology, Models, and Measures Slide 7 Unrolling the Fault-Error-Failure Cycle Aspect Impairment Structure State Behavior Fault Error Failure First Cycle Second Cycle Abstraction Impairment Component Logic Information System Service Result Defect Fault Error Malfunction Degradation Failure LowLevel MidLevel HighLevel Cause-effect diagram for an extended six-level view of impairments to dependability. Oct. 2006 Terminology, Models, and Measures Slide 8 Multilevel Model Component Logic Defective Legend: Legned: Service Result Oct. 2006 Low-Level Impaired Faulty Initial Entry Entry Information System Ideal Erroneous Mid-Level Impaired Deviation Malfunctioning Remedy Degraded Tolerance Failed Terminology, Models, and Measures High-Level Impaired Slide 9 Analogy for the Multilevel Model An analogy for our multi-level model of dependable computing. Defects, faults, errors, malfunctions, degradations, and failures are represented by pouring water from above. Valves represent avoidance and tolerance techniques. The goal is to avoid overflow. Oct. 2006 Terminology, Models, and Measures Slide 10 Why Our Concern with Dependability? Reliability of n-transistor system, each having failure rate l R(t) = e–nlt There are only 3 ways of making systems more reliable 1.0 Reduce l .9999 .9990 .9900 .9048 0.8 Reduce n Reduce t 0.6 –n l t e 0.4 Alternative: Change the reliability formula by introducing redundancy in system Oct. 2006 .3679 0.2 0.0 104 106 Terminology, Models, and Measures nt 108 1010 Slide 11 Highly Dependable Computer Systems Long-life systems: Fail-slow, Rugged, High-reliability Spacecraft with multiyear missions, systems in inaccessible locations Methods: Replication (spares), error coding, monitoring, shielding Safety-critical systems: Fail-safe, Sound, High-integrity Flight control computers, nuclear-plant shutdown, medical monitoring Methods: Replication with voting, time redundancy, design diversity High-availability: Fail-soft, Robust, High-availability Telephone switching centers, transaction processing, e-commerce Methods: HW/info redundancy, backup schemes, hot-swap, recovery Just as performance enhancement techniques gradually migrate from supercomputers to desktops, so too dependability enhancement methods find their way from exotic systems into personal computers Oct. 2006 Terminology, Models, and Measures Slide 12 Aspects of Dependability Oct. 2006 Terminology, Models, and Measures Slide 13 Concepts from Probability Theory Probability density function: pdf f(t) = prob[t x t + dt] / dt = dF(t) / dt Liftimes of 20 identical systems Cumulative distribution function: CDF F(t) = prob[x t] = 0t f(x) dx Expected value of x + Ex = - x f(x) dx = k xk f(xk) 0 10 20 30 40 50 30 40 50 30 40 50 1.0 0.8 CDF 0.6 Variance of x + 2 sx = - (x – Ex)2 f(x) dx = k (xk – Ex)2 f(xk) Time 0.4 F(t) 0.2 0.0 0 10 20 Time 0.05 Covariance of x and y yx,y = E [(x – Ex)(y – Ey)] = E [x y] – Ex Ey pdf 0.04 0.03 f(t) 0.02 0.01 0.00 0 Oct. 2006 10 Terminology, Models, and Measures 20 Time Slide 14 Some Simple Probability Distributions F(x) 1 CDF CDF CDF CDF Normal Binomial f(x) pdf Uniform Oct. 2006 pdf Exponential pdf Terminology, Models, and Measures Slide 15 Reliability and MTTF Reliability: R(t) Probability that system remains in the “Good” state through the interval [0, t] Two-state nonrepairable system R(t + dt) = R(t) [1 – z(t) dt] Hazard function R(t) = 1 – F(t) Start State Failure Failed CDF of the system lifetime, or its unreliability Constant hazard function z(t) = l R(t) = e–lt (system failure rate is independent of its age) Mean time to failure: MTTF + + MTTF = 0 t f(t) dt = 0 R(t) dt Expected value of lifetime Oct. 2006 Good Exponential reliability law Area under the reliability curve (easily provable) Terminology, Models, and Measures Slide 16 Failure Distributions of Interest Exponential: z(t) = l R(t) = e–lt MTTF = 1/l Discrete versions Geometric R(k) = q k Rayleigh: z(t) = 2l(lt) R(t) = e(-lt)2 MTTF = (1/l) p / 2 Weibull: z(t) = al(lt) a–1 R(t) = e(-lt)a MTTF = (1/l) G(1 + 1/a) Discrete Weibull Erlang: MTTF = k/l Gamma: Erlang and exponential are special cases Normal: Reliability and MTTF formulas are complicated Oct. 2006 Terminology, Models, and Measures Binomial Slide 17 Comparing Reliabilities Reliability difference: R2 – R1 Reliability gain: R2 / R1 Reliability improvement factor RIF2/1 = [1–R1(tM)] / [1–R2(tM)] Reliability functions for Systems 1/2 System Reliability (R) Example: [1 – 0.9] / [1 – 0.99] = 10 Reliability improv. index RII = log R1(tM) / log R2(tM) Mission time extension MTE2/1(rG) = T2(rG) – T1(rG) Mission time improv. factor: MTIF2/1(rG) = T2(rG) / T1(rG) Oct. 2006 Terminology, Models, and Measures Slide 18 Availability, MTTR, and MTBF (Interval) Availability: A(t) Fraction of time that system is in the “Up” state during the interval [0, t] Two-state repairable system Steady-state availability: A = limt A(t) Pointwise availability: a(t) Probability that system available at time t A(t) = (1/t) 0t a(x) dx Repair Start State Down Up Failure Availability = Reliability, when there is no repair Availability is a function not only of how rarely a system fails (reliability) but also of how quickly it can be repaired (time to repair) Repair rate MTTF MTTF m A= = = 1/m = MTTR MTTF + MTTR MTBF l+m In general, m >> l, leading to A 1 Oct. 2006 Terminology, Models, and Measures (Will justify this equation later) Slide 19 System Up and Down Times Short repair time implies good Maintainability (serviceability) Time to first failure Repair Start State Down Up Failure Time between failures Repair time Up Down 0 t1 t 2 t'2 t'1 t Time Oct. 2006 Terminology, Models, and Measures Slide 20 Performability and MCBF Performability: P Composite measure, incorporating both performance and reliability Three-state degradable system Repair Start State Up 2 Up 1 Partial repair Down Failure Partial failure Simple example Worth of “Up2” twice that of “Up1” t pUpi = probability system is in state Upi Question: What is system availability here? P = 2pUp2 + pUp1 pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90 (system performance equiv. To that of 1.9 processors on average) Performability improvement factor of this system (akin to RIF) relative to a fail-hard system that goes down when either processor fails: PIF = (2 – 2 0.92) / (2 – 1.90) = 1.6 Oct. 2006 Terminology, Models, and Measures Slide 21 System Up, Partially Up, and Down Times Important to prevent direct transitions to the “Down” state (coverage) Repair Start State Up 2 Up 1 Partial failure Partial repair Down Failure Partial Failure Up Partially Up Total Failure Partial Repair t2 t'2 Time MCBF Down 0 Oct. 2006 t1 Terminology, Models, and Measures t'1 t 3 t'3 t Slide 22 Integrity and Safety Risk: Prob. of being in “Unsafe Failed” state There may be multiple unsafe states, each with a different consequence (cost) Simple analysis Lump “Safe Failed” state with “Good” state; proceed as in reliability analysis Three-state fail-safe system Failure Start State Good Failure More detailed analysis Even though “Safe Failed” state is more desirable than “Unsafe Failed”, it is still not as desirable as the “Good” state; so keeping it separate makes sense Safe Failed Unsafe Failed For example, if a repair transition is introduced between “Safe Failed” and “Good” states, we can tackle questions such as the expected outage of the system in safe mode, and thus its availability Oct. 2006 Terminology, Models, and Measures Slide 23