Single Event Upset An Embedded Tutorial Fan Wang Vishwani D. Agrawal Department of Electrical and Computer Engineering Auburn University, AL 36849 USA 21th International Conf. on VLSI Design, Hyderabad, India, January 4-8, 2008 January 4-8, 2008 VLSI Design 2008 1 Motivation for This Work With the continuous downscaling of CMOS technologies, the device reliability has become a major bottleneck. The sensitivity of electronic systems can potentially become a major cause of soft (non-permanent) failures. It is necessary for both circuit designer and test engineer to have the basic knowledge of soft errors caused by the basic radiation mechanisms, and the soft error mitigation techniques. January 4-8, 2008 VLSI Design 2008 2 Outline Introduction to Soft Errors What is Soft Error? Historical notes Basic radiation mechanisms in silicon Soft error resilience techniques A case study Conclusion January 4-8, 2008 VLSI Design 2008 3 Introduction to SEU Certain behaviors in the state of the art electronic circuits caused by random factors. Single event upset (SEU) is non-permanent, non-functional error. Definition from NASA Thesaurus: “Single Event Upset (SEU): Radiation-induced errors in microelectronic circuits caused when particles (usually from the radiation belts cosmic rays) lose energy by ionizing the through which they pass, leaving behind a electron-hole pairs”. January 4-8, 2008 VLSI Design 2008 charged or from medium wake of 4 What is Soft Error A “fault” is the cause of errors. A non-permanent fault is a non-destructive fault and falls into two categories: Transient faults, caused by environmental conditions like temperature, humidity, pressure, voltage, power supply, vibrations, fluctuations, electromagnetic interference, ground loops, cosmic rays and alpha particles. Intermittent faults caused by non-environmental conditions like loose connections, aging components, critical timing, resistive or capacitive variations and noise in the system. With advances in manufacturing, “soft error” caused by cosmic rays and alpha particles are dominant causes of failures in electronic systems. January 4-8, 2008 VLSI Design 2008 5 Historical Notes In the period 1954 through 1957 failures in digital electronics were reported during the above-ground nuclear bomb tests. In 1962, Wallmark and Marcus predicted that cosmic rays would start upsetting microcircuits due to heavy ionized particle strikes when feature sizes become small enough. In 1970s and early 1980s, the effects of radiation received attention and more researchers examined the physics of these phenomena. Same as the fault tolerant computing theory. In 1978, May and Woods of Intel Corporation determined that these errors were caused by the alpha particles emitted in the radioactive decay of uranium and thorium present just in few parts-per-million levels in package materials. In 1979, Guenzer and Wolicki reported that the error causing particles came not only from uranium and thorium but that nuclear reactions generated high energy neutrons and protons. The term “SEU” has been in use since this paper. In 1979, Ziegler and Lanford from IBM predicted that cosmic rays could result in the same upset phenomenon in electronics (not only memories) even at sea level. January 4-8, 2008 VLSI Design 2008 6 Soft Error Rate of Specific Applications Figure of Merit: 1. Fail In Time (FIT) The number of failures per 109 device hours. 2. MTTF (Mean Time To Failure) 1 year MTTF = 109/(24*365) FIT = 114,155 FIT SER of contemporary commercial chips is controlled to within 100~1000 FITs!!! Most hard failure mechanisms produce error rate on the order of 1~100 FIT Programmable Logic SER is almost 100 times larger than combinational logic Soft Error Rate for SRAM-Based FPGAs: Smaller design rule and lower supply voltages Used radiation chamber to calculate SEU frequency at altitude of 10km at 60°N (Sweden) FPGA XC4010E XC4010XL Process 0.60um 0.35um Vcc 5v 3.3v 1 SEU every 1×106 hours 2.8×105 hours Projecting this for 3 design rule shrinks and 2 voltage reductions we get ≈1 SEU every 28.2 hrs M. Ohlsson, P. Dyreklev, K. Johansson and P. Alfke, “Neutron Single Event Upsets in SRAM-Based FPGAs”, proc. 1998 IEEE Nuclear & Space Radiation Effects Conference Chuck Stroud, “FPGA Architectures and Operation for Tolerating SEUs”, Electrical Engineering VLSI design and test seminar, Spring 2007, Auburn University. January 4-8, 2008 VLSI Design 2008 7 Example: SRAM-Based FPGA System* Table cont. *1. Example (1) is tested at Denver, using SpaceRad 4.5 (a software radiation effects prediction software program). Source: Actel. 2. All systems are without any protection. January 4-8, 2008 VLSI Design 2008 8 Radiation Mechanisms for Silicon (1) 1. Alpha particles are emitted when the nucleus of an unstable isotope decays to a lower energy state. (dominant soft error cause for DRAM in 1970s) Uranium and thorium have the highest activity among naturally occurring radioactive materials. In the terrestrial environment, major sources of radioactive impurities are lead-based isotopes in solder bumps of the flip-chip technology, gold used for the bond wires and lid plating, aluminum in ceramic packages, lead-frame alloys and interconnect metalization. **With carefully selected materials, this mechanism effect can be greatly reduced. January 4-8, 2008 VLSI Design 2008 9 Radiation Mechanisms for Silicon (2) 2. High-energy ( > 1 MeV*) neutrons from cosmic radiation induces soft errors in semiconductor devices via secondary ions produced by the neutron reaction with silicon nuclei. Cosmic rays which are of galactic origin react with the Earth’s atmosphere to produce complex cascades of secondary particles. Neutrons are the most likely cosmic radiation sources to cause SEU in deep-submicron semiconductors at terrestrial altitude. The neutron flux is dependent on the altitude above sea level, the density of the neutron flux increases with altitude *MeV: Million Electron Volts **Nowadays, Neutron is the major cause among all fail mechanisms. January 4-8, 2008 VLSI Design 2008 10 Radiation Mechanisms for Silicon (3) 3. The secondary radiation induced from the interaction of cosmic ray neutrons and boron is the third significant source of ionizing particles in electronic systems. Low-energy cosmic neutron interactions with the isotope boron-10 (10B). 10B is commonly used as p-type dopant for junction formation IC package. Baumann et al, IEEE Trans. Device and Materials Reliability, vol. 1, no. 1, pp. 17–22, 2001. **This mechanism can be greatly reduced or eliminated by removing source of 10B January 4-8, 2008 VLSI Design 2008 11 Single Event Transient (SET) SET is caused by the generation of charge due to a highenergy particle passing through a sensitive node. Each SET has its unique characteristics like polarity, waveform, amplitude, duration, etc. depend on particle impact location, particle energy, device technology, device supply voltage and output load. The off transistors struck by a heavy ion with high enough LET* in the junction area are most sensitive to SEU. Specifically, the channel region of the off-NMOS transistor and the drain region of the off-PMOS transistor. *Linear Energy Transfer is a measure of the energy transferred to the device per unit length as an ionizing particle travels through a material. January 4-8, 2008 VLSI Design 2008 12 More Details of SET Generation (a) Along the path traverses, the particle produces a dense radial distribution of electron-hole pairs. (b) Outside the depletion region the non-equilibrium charge distribution induces a temporary funnel-shaped potential distortion along the trajectory of the event (drift component). (c) Funnel collapses, diffusion component then dominates the collection process until all excess carriers have been collected, recombined, or diffused away from the junction area. (d) Current vs. Time to illustrate the charge collection and SET generation. January 4-8, 2008 VLSI Design 2008 13 Analytical Model of SET The time constants depend strongly on the type of ion, its initial energy and the properties of the specific technology. Approximate analytical model for ion track charge collection is a double-exponential form. It gives an induced current with a rapid rise time but a more gradual fall time: *Typical values are approximately 1.64 x 10-10sec for and 5.10x10-11sec for January 4-8, 2008 *Experimental Results from NASA JPL . VLSI Design 2008 14 SET in CMOS Inverter *For example, in ami12 technology, when the output load capacitance is 100fF and the cumulative collected charge is 0.65pC, the amplitude of the voltage pulse is 0.65pC/100fF = 0.65 x10-12C/100 x10-15F = 0.65V . January 4-8, 2008 VLSI Design 2008 15 Soft Error Mitigation Techniques The soft error tolerant techniques can be classified into two types: recovery and prevention. Recovery: Recovery error after it does occur. Include on-line recovery mechanisms, fault tolerant computing, ECC/parity check, redundancy etc. Prevention: The methods to protect microchips from soft-errors before it occurs. The need for a recovery mechanism stems from the fact that prevention techniques may not be enough for contemporary microchips. Soft error is not the only reason why computer systems need to resort to a recovery procedure. Random errors due to noise, unreliable components, and coupling effects may also require the recovery mechanism. January 4-8, 2008 VLSI Design 2008 16 Some Mitigation Techniques Prevention Techniques 1. Purify the Fabrication Material: Uranium and thorium impurities have been reduced below one hundred parts per trillion for high reliability. To eliminate 10B, alternative insulators that don’t contain boron are used. 2. Radiation Hardened Process Technologies SER performance can be greatly improved by adapting the process technology either to reduce the collected charge or increase the critical charge. Specific methods: use additional well isolation; replace bulk silicon with SOI. 10x reduction in SER achieved over conventional bulk devices when a fully depleted SOI substrate is used. But SOI is more expensive and parasitic bipolar action limit further reduction of SER. January 4-8, 2008 VLSI Design 2008 17 Picked Mitigation Techniques Recovery Techniques 1. Redundancy To gain higher system reliability by sacrificing the minimality of time or space or both. Classic design: Triple Modular Redundancy (TMR) with majority voter New design: time redundancy based on C-element gate to compare two samples of combinational primary outputs at t0 and t0+d. 2. Error Detection and Correction Code (EDAC) Simple solution for memory: add a parity bit to each memory word. In most situations, it must be combined with a system-level approach for error recovery. *S. Mitra, Z. Ming, S. Waqas, N. Seifert, B. Gill, and K. S. Kim, “Combinational Logic Soft Error Correction,” in Proc. International Test Conference, 2006, pp. 1–9. January 4-8, 2008 VLSI Design 2008 18 A Case Study: IBM eServer z990 System z990 configuration 1. z990 contains 4 pluggable nodes connected through a planar board. 2. Each node contains up to 64 GB physical memory and 32 MB L2 cache for a system capacity of 256 GB memory and 126 MB L2 cache. Error tolerance techniques used: 1. Extensive use of ECC and parity with retry on data and controls; 2. Full SRAM ECC and parity protection 3. Microprocessor mirroring January 4-8, 2008 VLSI Design 2008 19 Conclusion SER in logic and memory chips will continue to increase as devices become more sensitive to soft errors at sea level Open soft error issues: 1. How EDA tools handle soft error hardening? 2. Analysis of radiation mechanisms (too complex to be comprehensive) 3. Soft error rate analysis for logics 4. Error mitigation methods January 4-8, 2008 VLSI Design 2008 20 Useful References and Further Readings “Single Event Phenomena”, (Messenger and Ash, 1993) “Ionizing Radiation Effects in MOS Devices and Circuits”, (Ma and Dressendorfer, 1989) “Handbook of Radiation Effects”, (A. Holmes-Siedle and L. Adams,1993) “Fault-Tolerance Techniques for SRAM-Based FPGAs”, (Kastensmidt, Fernanda Lima, Carro, Luigi, Reis, Ricardo, 2006) 1. 2. 3. 4. 5. 6. 7. Test methods and standard: JEDEC89, JEDEC89A, JEDEC89-2 Journals: IEEE Trans on Nuclear Science, IEEE Trans Reliability NASA Goddard’s test group: http://radhome.gsfc.nasa.gov/radhome/papers/seeca5.htm 7. NASA Space Environment and Effects Program http://see.msfc.nasa.gov/ …… January 4-8, 2008 VLSI Design 2008 21 Thank You . . . January 4-8, 2008 VLSI Design 2008 22