Fault-Tolerant Computing Dealing with Low-Level Impairments Oct. 2006 Fault Masking Slide 1 About This Presentation This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. © Behrooz Parhami Oct. 2006 Edition Released First Oct. 2006 Revised Fault Masking Revised Slide 2 Fault Masking Oct. 2006 Fault Masking Slide 3 Oct. 2006 Fault Masking Slide 4 Multilevel Model Component Logic Ideal Defective Legend: Legned: Today Low-Level Impaired Faulty Initial Entry Entry Last lecture Information System Service Result Oct. 2006 Erroneous Mid-Level Impaired Deviation Malfunctioning Degraded Remedy Failed Tolerance Fault Masking High-Level Impaired Slide 5 Handling Faults Fault Avoid Tolerate Quality Assurance Testing Prevent Repair Yes Fixed Full? Test Miss Monitor Mask Detect Reconfigure Abort No No Screened Failed Component Oct. 2006 Miss Discard Injured Static Redundancy Expose Remove Detect Perfect Dynamic Redundancy or Failed-safe System Fault Masking Full? Degraded Yes Restored Unaffected State Slide 6 Some Options for Fault Tolerance 1. Detect and replace Dynamic redundancy (cold/hot standby) Detection via -- coding, watchdog timer, self-checking -- duplication (pair-and-spares) Detector 1 Spare 2. Mask Static redundancy May revert to simplex instead of duplex Design challenges include -- synchronization for voting -- voting on imprecise results 3. Mask, diagnose, and reconfigure Hybrid redundancy Fault masked at output, but diagnosed -- e.g., via comparison with voter output Faulty circuit is replaced by spare Becomes static upon spare exhaustion Oct. 2006 Fault Masking D 2 1 Voter 2 V 3 1 2 S V 3 Spare 4 Switch-voter Slide 7 Comparing Fault Tolerance Schemes Advantages Drawbacks Less power (cold standby) Long life (just add spares) Coverage factor Immediate masking Power/area penalty High safety Detector 1 Tolerance latency Spare D 2 1 Voter 2 Voter critical V 3 Immediate masking Power/area penalty 1 Long life and high safety Switch-voter critical 2 S 3 Spare Oct. 2006 V Fault Masking 4 Switch-voter Slide 8 Inherent Fault Masking in Logic Circuits a b c d 1 e 0 g 0 0 0 0 f 0 0 1 fault in b is critical h 0 z 1 0 1 fault in c or d is not critical (it is masked) 1 0 fault in a or h is not critical (it is masked) Even nonredundant circuits have some masking capability Is there a way to exploit the inherent masking capabilities of logic gates to achieve fault tolerance? Oct. 2006 Fault Masking Slide 9 Interwoven Redundant Logic 1 a b c d 1 e g 1 z f 10 a1 a2 b1 b2 a1 a2 b1 b2 a3 a4 b3 b4 a3 a4 b3 b4 1 0 change is critical for AND, subcritical for OR e1 0 1 change is critical for OR, subcritical for AND 10 e2 Alternating layers of ANDs and ORs can mask each other’s critical faults e3 1 e4 f1 f2 f3 f4 Oct. 2006 h Let x1, x2, x3, and x4 be 4 copies of the signal x e1 e4 f1 f4 To mask h critical faults: Number of gates multiplied by (h + 1)2 Gate inputs multiplied by h + 1 For h = 1, the scheme is known as Quadded logic Fault Masking Slide 10 Interwoven Logic for Nanoelectronics Half-adder implemented in quadded logic a b s c a b s c IEEE D&T July-Aug. 2005 pp. 328-339 From: http://ieeexplore.ieee.org/iel5/54/32070/01492293.pdf Oct. 2006 Fault Masking Slide 11 Highly Reliable Logic with “Crummy” Relays Moore & Shannon, 1956 a: prob [contact made | energized] c: prob [contact made | not energized] No matter how crummy the relays (i.e., how close the values of a and c), one can interconnect many of them in a redundant series-parallel structure to achieve arbitrarily high reliability Oct. 2006 Fault Masking “Make” contact (normally open) a>c “Break” contact (normally closed) a<c 1 xy x y Slide 12 TMR with Perfect Voter R = 3Rm2 – 2Rm3 ? > Rm 1 Condition on the module reliability: R = Rm [1 + (1 – Rm)(2Rm – 1)] (1 – Rm)(2Rm – 1) > 0 Rm > 1/2 2 R 1.0 TMR better 0.5 Simplex better 0.5 1.0 Rm RIFTMR/Simplex = (1 – Rm)/(1 – R) = 1/[1 – Rm(2Rm – 1)] Oct. 2006 TMR 0.5 0.0 0.0 V 3 R 1.0 Voter Fault Masking Simplex 0.0 0 ln 2 lt MTTF: TMR 5l/6 Simplex l Slide 13 TMR with Imperfect Voter ? R = Rv(3Rm2 – 2Rm3) > Rm 1 Condition on the voter reliability Rv > 1 / [3Rm – 2Rm2] 2 Voter V 3 dRvmin/ dRm = (–3 + 4Rm) / (3Rm – 2Rm2)2 Rv Condition on the module reliability 3 – 9 – 8/Rv 3 + 9 – 8/Rv < Rm < 4 4 0.95 TMR better 0.885 Example: Rv = 0.95 requires that 0.56 < Rm < 0.94 Oct. 2006 Fault Masking Simplex better 0.5 0.56 0.75 1.0 0.94 Slide 14 Rm TMR with Compensating Faults Rm = 1 – p0 – p1 (0- and 1-fault probabilities) 1 R = (3Rm2 – 2Rm3) + 6p0p1Rm 2 Example: Rm = 0.998, p0 = p1 = 0.001 3 Voter V R = 0.999,984 + 0.000,006 = 0.999,990 Basic TMR Compensation RIFTMR/Simplex = 0.002 / 0.000,016 = 125 RIFCompen/TMR = 0.000,016 / 0.000,010 = 1.6 Oct. 2006 Fault Masking Slide 15 Implementing a Bit-Voter TMR bit-voting: y = x1x2 x2x3 x3x1 (carry output of a single-bit full-adder) What about 5MR, 7MR? 1 Gate-level design quickly explodes in size 3 2 x1 x2 Bit-voter V y x3 Other designs are also possible Arithmetic: add the bits, compare to threshold Mux-based Selection-based (majority of bit values is their median) 3-out-of-5 voter built of 2-input gates Oct. 2006 Two mux-based designs for a 3-out-of-5 bit-voter Fault Masking Slide 16 Complexity of Different Bit-Voter Designs Cost of majority bit-voters as a function of the number n of inputs Oct. 2006 Fault Masking Slide 17 Voting at the Word Level Using bit-by-bit voting may be dangerous One might think that in this example, any of the module outputs could be correct, so that producing 1 0 at the output isn’t all that wrong However, with bit-by-bit voting, the output may be different from all inputs x1 = 0 x2 = 1 x3 = 1 y =1 x1 = 0 x2 = 1 x3 = 1 y =1 0 0 1 0 0 0 1 0 0 1 0 0 Design of bit- and word-voting networks discussed in: Parhami, B., “Voting Networks,” IEEE TR, Aug. 1991 Oct. 2006 Fault Masking Slide 18 Some Simple Voter Designs If in the case of 3-way disagreement any of the inputs can be chosen, then a simple design is possible 1 This design can be readily generalized to a larger number of inputs 3 2 x1 Compare x2 x3 Disagree 0 1 One can perform pseudo voting that yields the median of 3 analog signals (Dennis, N.G., Microelectronics and Reliability, Aug. 1974) Median and mean voting are also possible with digital signals Oct. 2006 Fault Masking Slide 19 y Switch for Standby Redundancy Standby redundancy requires an n-to-1 switch to select the output of the currently active module The detectors use various info to deduce fault conditions -- Error coding -- Reasonableness checks -- Watchdog timer Detector 1 Spare Once a fault has been detected, the switch reconfigures the system by flagging the faulty unit and activating next spare in sequence D 2 1 D 2 D n-to-1 switch Spares 3 D If we use an n-to-2 switch and compare the two selected outputs, the configuration is known as “pair-and-spares” Oct. 2006 Fault Masking Slide 20 Switch for Hybrid Redundancy Hybrid redundancy with n active and s spare modules requires an (n + s)-to-n switch to select the outputs of the active modules Self-purging redundancy is a variant of hybrid redundancy in which all modules are active at the outset, but they are purged as they disagree with the majority output Voter in self-purging redundancy is a threshold voter that considers the inputs with weights of 1 (active) or 0 (purged) Oct. 2006 Fault Masking 1 2 S V 3 Spare 4 Switch-voter .. . Slide 21 Applications of nMR and Hybrid Redundancy The Space Shuttle: Uses 5-way redundancy in hardware Originally, 3 operational units and 2 spares (one warm, one cold) More recently, 4 operational units and 1 spare Additionally, uses two independently developed software systems Japanese Shinkansen “Bullet” Train Triple-duplex system (6-fold redundancy) .. . Oct. 2006 Fault Masking Slide 22