ELEC 7770 Advanced VLSI Design Spring 2010 Soft Errors and Fault-Tolerant Design Vishwani D. Agrawal James J. Danaher Professor ECE Department, Auburn University Auburn, AL 36849 vagrawal@eng.auburn.edu http://www.eng.auburn.edu/~vagrawal/COURSE/E7770_Spr10 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 1 Soft Errors Soft errors are the errors caused by the operating environment. They are not due to a permanent hardware fault. Soft errors are intermittent or random, which makes their testing unreliable. One way to deal with soft errors is to make hardware robust: Capable of detecting soft errors Capable of correcting soft errors Both measures are probabilistic Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 2 Some Early References J. von Neumann, “Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components,” pp. 329-378, 1959, in A. H. Taub, editor, John von Neumann: Collected Works, Volume V: Design of Computers, Theory of Automata and Numerical Analysis, Oxford University Press, 1963. M. A. Breuer, “Testing for Intermittent Faults in Digital Circuits,” IEEE Trans. Computers, vol. C-22, no. 3, pp. 241-246, March 1973. T. C. May and M. H. Woods, “Alpha-Particle-Induces Soft Errors in Dynamic Memories,” IEEE Trans. Electron Devices, vol. ED-26, no. 1, pp. 2-9, 1979. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 3 Causes of Soft Errors Interconnect coupling (crosstalk). Power supply noise: IR-drop, power droop, ground bounce. Ignition noise. Electromagnetic pulse (EMP). Effects generally attributed to alpha-particles: Charged particles: electrons, protons, ions. Radiation (photons): X-rays, gamma-rays, ultra-violet light. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 4 Sources of Alpha-Particles Radioactive contamination in VLSI packaging material. Ionosphere, magnetosphere and solar radiation. Other electromagnetic radiation. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 5 Alpha-Particle Helium nucleus: two protons and two neutrons, mass = 6.65 ×10-27kg, charge = +2e (e = 1.6 ×10-19C). Energy = 3.73 GeV Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 6 Soft Error Rate (SER) Failures in time (FIT): One FIT is 1 error per billion hours of operation. Alternative unit is mean time between failures (MTBF) or mean time to failure (MTTF). 1 year MTBF Spring 2010, Apr 14 . . . = 109/(365×24) = 114,155 FIT ELEC 7770: Advanced VLSI Design (Agrawal) 7 Particle Strike Ion or Charged particle - + n + + + - p - substrate Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 8 current Induced Current time I(t) = I0(e– t/a – e– t/b), Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) a >> b 9 Voltage Induced at a Node V = Q/C Where Q = ∫ I(t) dt C = node capacitance Smaller node capacitance will result in larger voltage swing. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 10 Effect on Digital Circuit Charged Particles IN Charged Particles Combinational Logic OUT CK Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 11 An SRAM Cell WL VDD 0 bit 1 bit BL BL Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 12 SRAM Cell Struck by Alpha-Particle Single-Event Upset (SEU) WL Charged Particles VDD 0→1 bit 1→0 bit BL BL Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 13 A Resistor Hardened SRAM Cell WL VDD 0 bit 1 bit BL BL Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 14 D-Latch 1 D Q Q CK = 0 Spring 2010, Apr 14 . . . 0 ELEC 7770: Advanced VLSI Design (Agrawal) 15 SEU in D-Latch Charged Particles 1→0 D Q CK = 0 Spring 2010, Apr 14 . . . Q 0→1 ELEC 7770: Advanced VLSI Design (Agrawal) 16 Single Event Transients in Combinational Logic 1 0 1 1 1 CK Charged Particles 0 CK Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 17 Effects of Transients Error correcting effects Transient pulse is filtered by gate inertia Transient is blocked by an unsensitized path Transient is blocked by an inactive clock Error enhancing effects Large number of gates can produce multiple pulses Fanouts can multiply error pulses Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 18 Typical Soft Error Distribution S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 19 Soft Error Simulation F. Wang and V. D. Agrawal, “Soft Error Rate with Inertial and Logical Masking,” Proc. 22nd International Conference on Quality VLSI Design, January 2009, pp. 459-464. F. Wang and V. D. Agrawal, “Soft Error Rate Determination for Nanoscale Sequential Logic,” Proc. 11th International Symposium on Quality Electronic Design (ISQED), March 2010. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 20 SEUs in FPGA Parts that can be affected Look-up table (LUT) Configuration memory cell Flip-flop Block RAM F. L. Kastensmidt, L. Carro and R. Reis, Fault-Tolerant Techniques for SRAM-Based FPGAs, Springer, 2006. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 21 F1 F2 F3 F4 LUT 1 1 1 Memory cells 0 0 1 0 0 0 out 0 1 1 1 0 0 Spring 2010, Apr 14 . . . 1 ELEC 7770: Advanced VLSI Design (Agrawal) 22 F1 F2 F3 1 1 F4 SEU in LUT 1 Memory cells 0 0 Charged Particle 1 changed to 0 Spring 2010, Apr 14 . . . 1 0 0 0 out 0 1 1 0 0 0 1 ELEC 7770: Advanced VLSI Design (Agrawal) 23 Four Types of SEU in FPGA M M M M M FF F1 F2 F3 F4 M Type 3 Type 2 LUT Type 1 M Configuration memory cell Spring 2010, Apr 14 . . . Type 4 ELEC 7770: Advanced VLSI Design (Agrawal) Block RAM 24 SEU Detection Methods Hardware redundancy Time redundancy Error detection codes (EDC) Self-checker techniques Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 25 SEU Mitigation Techniques Triple modular redundancy (TMR) Multiple redundancy with voting Error detection and correction codes (EDAC) Hardened memory cells FPGA-specific methods Reconfiguration Partial configuration Rerouting design Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 26 Hardware Redundancy for Detection inputs Combinational Logic Combinational Logic (duplicated) output Logic 1 indicates error Hardware overhead is high ~ 100% Performance penalty is negligible. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 27 Time Redundancy for Detection inputs Combinational Logic DQ output CK+ d DQ Logic 1 indicates error CK Hardware overhead is low. Performance penalty ( ~ d) = maximum detectable pulse width. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 28 Repeat on Error Detection inputs Combinational Logic DQ C output CK+ d DQ Logic 1 indicates error CK Operation: Spring 2010, Apr 14 . . . If error is detected, then output retains its previous value. Repeating the computation can produce correct result. ELEC 7770: Advanced VLSI Design (Agrawal) 29 Muller C-Element A output C B A B output A 0 0 0 0 1 Old output S Q 1 0 Old output 1 1 1 Spring 2010, Apr 14 . . . B ELEC 7770: Advanced VLSI Design (Agrawal) output R 30 Dynamic CMOS C-Element A C output A B A B output 0 0 1 0 1 Old output 1 0 Old output 1 1 0 Spring 2010, Apr 14 . . . output B ELEC 7770: Advanced VLSI Design (Agrawal) 31 Pseodostatic CMOS C-Element Weak keeper A output C A B A B output 0 0 1 output 0 1 Old output 1 0 Old output 1 1 0 Spring 2010, Apr 14 . . . B ELEC 7770: Advanced VLSI Design (Agrawal) 32 Built-In Soft Error Resilience (BISER) Weak keeper Data from combinational logic Flip-flop A output Duplicate Flip-flop Clock Spring 2010, Apr 14 . . . B ELEC 7770: Advanced VLSI Design (Agrawal) A B output 0 0 1 0 1 Old output 1 0 Old output 1 1 0 33 BISER Assumptions: Most soft errors in combinational logic are eliminated by inertial and logic maskings. Soft error pulse generated in flip-flop is much shorter than clock period. Probability of either a master or slave latch being struck by soft error exactly at clock edge is small. Flip-flop is duplicated and outputs fed to C-element. Twenty times reduction of soft error observed. Ref.: S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, “Robust System Design with Built-In Soft-Error Resilience,” Computer, vol. 38, no. 2, pp. 43-52, February 2005. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 34 Triple Modular Redundancy (TMR) Combinational Logic copy 1 inputs Combinational Logic copy 2 Majority Voter output Combinational Logic copy 3 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 35 TMR Error Reduction Voter input error probability = E, assumed independent for each input. Output error probability, e = Prob(two errors or three errors) 3 E2 (1 3 – E) + ( 3 ) E3 = ( ) = 3 E2 – 3 E3 + E3 = 2 3 E2 – 2 E 3 For very small E, E3 << E2 → e = 3E2 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 36 TMR Error Probability Input error probability, E 0.0 Output error probability, e 0.0 0.001 0.01 0.1 0.000002998 0.000298 0.027 0.2 0.3 0.4 0.104 0.216 0.352 0.5 0.6 0.5 0.648 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 37 Majority Voter Circuit A B C output A B 0 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 0 0 C 0 1 1 1 1 0 1 1 1 1 1 Spring 2010, Apr 14 . . . output A B 1 Majority Voter output C ELEC 7770: Advanced VLSI Design (Agrawal) 38 Alternative Implementations of Voter VDD A LUT 0 0 0 1 0 1 1 1 output B output C ABC Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 39 Triple Modular Redundancy (TMR) inputs Combinational Logic DQ CK DQ Majority Voter CK+ d DQ output CK+3d DQ CK+2d Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 40 TMR for Memory Cells inputs Combinational Logic DQ CK DQ Majority Voter output CK DQ CK Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) Problems: 1. Accumulation of errors in flip-flops. 1. Voter is not protected. 41 FF Refresh and TMR for Memory Cells r1 DQ r2 Majority Voter CK DQ r3 Majority Voter CK Majority Voter output DQ CK Spring 2010, Apr 14 . . . Majority Voter ELEC 7770: Advanced VLSI Design (Agrawal) 42 Reliability Analysis Determine how long a system will work without failure. Find: Mean time to failure (MTTF) Mean time to repair (MTTR) FIT rate Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 43 Reliability Function Reliability function of a system, R(t) = Probability of survival at time t Determined from failure rates of components, λ(t) = Number of failures per unit time Generally varies with time. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 44 Failure Rate, λ(t) Failures per second, λ(t) 100 10-3 Infant mortality Constant failure Wearout Rate (useful life) or aging λ(t) = λ 10-6 10-9 10-12 Time, t Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 45 Deriving R(t) R(t) is the probability of no error in interval [0, t]. Divide interval in a large number (n) of subintervals of duration t/n. Let x be the probability of error in one subinterval. Assume that duration t/n is so small that either no error occurs or at most one error can occur. Then, average errors in a subinterval = 0.(1 – x) + 1.x = x = λt/n. Probability of no error in interval [0, t] is, R(t) = (1 – x)n = (1 – λt/n)n = exp(– λt), from Sterling’s formula as n → ∞ Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 46 R(t) and MTBF R(t) = e –λt ∞ ∞ MTBF = ∫ R(t) dt = ∫ exp(– λt)dt 0 = Spring 2010, Apr 14 . . . 0 1/λ ELEC 7770: Advanced VLSI Design (Agrawal) 47 Reliability and MTBF 1.0 Reliability, R(t) 0.8 R(t) = 1/e = 0.368 0.6 0.4 0.2 0.0 1 MTBF 2 MTBF 3 MTBF Time, t Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 48 Example: First Generation Computer 10,000 electron tubes. Average burn out rate: 5 tubes per 100,000 hours. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 49 Reliability of TMR R(TMR) = Prob(all three modules correct) + Prob(any two modules correct) = R3 + 3R2 (1 – R) = 3 R2 – 2 R3 = 3e-2λt – 2e-3λt Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 50 MTBF of TMR R(TMR) = 3e-2λt – 2e-3λt 8 MTBF = ∫ R(TMR) dt = 5/(6λ) 0 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 51 MTBF of TMR 1.0 Reliability, R(t) 0.8 TMR 0.6 0.4 0.2 Single module 0.0 Mission duration Spring 2010, Apr 14 . . . Time, t ELEC 7770: Advanced VLSI Design (Agrawal) 52 Error Detection Code Errors: Bits can flip due too noise in circuits and in communication. Extra bits used for error detection. Example: a parity bit in ASCII code 7-bit ASCII code Even parity code for A (even number of 1s) 01000001 Parity bits Odd parity code for A 11000001 (odd number of 1s) Single-bit error in 7-bit code of “A”, e.g., 1000101, will change symbol to “E” or 1000000 to “@”. But error will be detected in the 8-bit code because the error changes the specified parity. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 43 Richard W. Hamming Error-correcting codes (ECC). Also known for Hamming distance HD = Number of bits two binary vectors differ in Example: HD(1101, 1010) = 3 Hamming Medal, 1988 1915-1998 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 54 The Idea of Hamming Code Code space contains 2N possible N-bit code words 0010 ”2” 1110 ”E” HD = 1 HD = 1 1010 ”A” 1-bit error in “A” HD = 1 HD = 1 1000 ”8” 1011 ”B” Error not correctable. Reason: No redundancy. Hamming’s idea: Increase HD between valid code words. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 55 Hamming’s Distance ≥ 3 Code 1110100 0010101 ”2” ”E” HD = 4 HD = 4 HD = 3 HD = 3 1-bit error in “A” shortest distance decoding eliminates error 0010010 ”?” HD = 1 1010010 0011110 ”A” ”3” HD = 2 HD = 4 1000111 HD = 3 HD = 3 ”8” 1011001 Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) ”B” 56 Minimum Distance-3 Hamming Code Symbol Original code Odd-parity code ECC, HD ≥ 3 0 0000 10000 0000000 1 0001 00001 0001011 2 0010 00010 0010101 3 0011 10011 0011110 4 0100 00100 0100110 5 0101 10101 0101101 6 0110 10110 0110011 7 0111 00111 0111000 8 1000 01000 1000111 9 1001 11001 1001100 A 1010 11010 1010010 B 1011 01011 1011001 C 1100 11100 1100001 D 1101 01101 1101010 E 1110 01110 1110100 F 1111 11111 1111111 Spring 2010, Apr 14 . . . Original code: Symbol “0” with a single-bit error will be Interpreted as “1”, “2”, “4” or “8”. Reason: Hamming distance between codes is 1. A code with any bit error will map onto another valid code. Remedy: Design codes with HD ≥ 2. Example: Parity code. Single bit error detected but not correctable. Remedy: Design codes with HD ≥ 3. For single bit error correction, decode as the valid code at HD = 1. For more error bit detection or correction, design code with HD ≥ 4. ELEC 7770: Advanced VLSI Design (Agrawal) 57 A Book on Coding Theory R. W. Hamming, Coding and Information Theory, Englewood Cliffs, New Jersey: Prentice-Hall, 1980. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 58 Byzantine Empire, 527-565 Emperor Justinian and General Belisarius Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 59 Byzantine General’s Problem In a war a general needs to communicate an attack (a) or retreat (r) order to subordinates in the field. For success a perfect agreement is necessary. Byzantine Fault: Subordinates can be unreliable or malicious. Communication (messengers) can be unrelaible or malicious. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 60 Example 1: Single Fault General: D; Subordinates: A, B and C D r→a A Spring 2010, Apr 14 . . . r r B ELEC 7770: Advanced VLSI Design (Agrawal) C 61 Example 1: Majority Agreement General: D; Subordinates: A, B and C D r→a r r r Retreat A a B r r C Retreat Retreat a r Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 62 Example 2: Two Faults General: D; Subordinates: A, B and C D a A Spring 2010, Apr 14 . . . a a B ELEC 7770: Advanced VLSI Design (Agrawal) C 63 Example 2: Byzantine Failure General: D; Subordinates: A, B and C D a a a r Attack A r r r a B Attack C Retreat a Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 64 Byzantine Resilient System A system that can correctly function in presence of Byzantine faults. Byzantine protocol for n node system: Any node can initiate a message broadcast. All nodes rebroadcast the received message to all nodes it has not heard from. After communications end, nodes take majority decision. Ref.: L. Lamport, R. Shostak and M. Pease, “The Byzantine General’s Problem,” ACM Trans. Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 65 Byzantine Resilience Conditions In order to tolerate t failures, : The system must have at least 3t + 1 nodes. There must be at least 2t +1 disjoint communication paths between nodes. A node must exchange messages at least t +1 times. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 66 Four-Core Processor System A B D C Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 67 Example 1: C Initiates Message m, Sends n to A and m to B and D Processor First round Second round A n mm m B m mn m D m mn m Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) Decoded message 68 Example 2: C Initiates Message m, B Sends p to A and D Processor First round Second round A m mp m B m mm m D m mp m Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) Decoded message 69 Example 2: C Initiates Message m, A and B generate faulty message q Processor First round Second round A m mq m B m mq m D m qq q Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) Decoded message 70 References L. Lamport, R. Shostak and M. Pease, “The Byzantine General’s Problem,” ACM Trans. Prog. Lang. Syst., vol. 4, no. 3, pp. 382-401, July 1982. D. K. Pradhan, Fault-Tolerant Computer System Design, Upper Saddle River, New Jersey: Prentice Hall PTR, 1996. P. K. Lala, Self-Checking and Fault-Tolerant Digital Design, San Francisco: MorganKaufmann, 2001. Spring 2010, Apr 14 . . . ELEC 7770: Advanced VLSI Design (Agrawal) 71