Single Event Upset An Embedded Tutorial Fan Wang Vishwani D. Agrawal

advertisement
Single Event Upset
An Embedded Tutorial
Fan Wang
Vishwani D. Agrawal
Department of Electrical and Computer Engineering
Auburn University, AL 36849 USA
21th International Conf. on VLSI Design, Hyderabad, India, January 4-8, 2008
January 4-8, 2008
VLSI Design 2008
1
Motivation for This Work
 With the continuous downscaling of CMOS technologies,
the device reliability has become a major bottleneck.
 The sensitivity of electronic systems can potentially
become a major cause of soft (non-permanent) failures.
 It is necessary for both circuit designer and test engineer
to have the basic knowledge of soft errors caused by the
basic radiation mechanisms, and the soft error mitigation
techniques.
January 4-8, 2008
VLSI Design 2008
2
Outline
Introduction to Soft Errors
What is Soft Error?
Historical notes
Basic radiation mechanisms in silicon
Soft error resilience techniques
A case study
Conclusion
January 4-8, 2008
VLSI Design 2008
3
Introduction to SEU
Certain behaviors in the state of the art
electronic circuits caused by random factors.
Single event upset (SEU) is non-permanent,
non-functional error.
Definition from NASA Thesaurus:
“Single Event Upset (SEU): Radiation-induced errors
in microelectronic circuits caused when
particles (usually from the radiation belts
cosmic rays) lose energy by ionizing the
through which they pass, leaving behind a
electron-hole pairs”.
January 4-8, 2008
VLSI Design 2008
charged
or from
medium
wake of
4
What is Soft Error
 A “fault” is the cause of errors.
 A non-permanent fault is a non-destructive fault and
falls into two categories:


Transient faults, caused by environmental conditions like
temperature, humidity, pressure, voltage, power supply, vibrations,
fluctuations, electromagnetic interference, ground loops, cosmic
rays and alpha particles.
Intermittent faults caused by non-environmental conditions like
loose connections, aging components, critical timing, resistive or
capacitive variations and noise in the system.
 With advances in manufacturing, “soft error” caused
by cosmic rays and alpha particles are dominant
causes of failures in electronic systems.
January 4-8, 2008
VLSI Design 2008
5
Historical Notes
 In the period 1954 through 1957 failures in digital electronics were
reported during the above-ground nuclear bomb tests.
 In 1962, Wallmark and Marcus predicted that cosmic rays would start
upsetting microcircuits due to heavy ionized particle strikes when
feature sizes become small enough.
 In 1970s and early 1980s, the effects of radiation received attention
and more researchers examined the physics of these phenomena.
Same as the fault tolerant computing theory.
 In 1978, May and Woods of Intel Corporation determined that these
errors were caused by the alpha particles emitted in the radioactive
decay of uranium and thorium present just in few parts-per-million
levels in package materials.
 In 1979, Guenzer and Wolicki reported that the error causing particles
came not only from uranium and thorium but that nuclear reactions
generated high energy neutrons and protons. The term “SEU” has
been in use since this paper.
 In 1979, Ziegler and Lanford from IBM predicted that cosmic rays
could result in the same upset phenomenon in electronics (not only
memories) even at sea level.
January 4-8, 2008
VLSI Design 2008
6
Soft Error Rate of Specific Applications

Figure of Merit:
1.
Fail In Time (FIT)
The number of failures per 109 device hours.



2. MTTF (Mean Time To Failure)
1 year MTTF = 109/(24*365) FIT = 114,155 FIT
SER of contemporary commercial chips is controlled to within 100~1000 FITs!!!
Most hard failure mechanisms produce error rate on the order of 1~100 FIT
Programmable Logic SER is almost 100 times larger than combinational logic
Soft Error Rate for SRAM-Based FPGAs:
Smaller design rule and lower supply voltages
Used radiation chamber to calculate SEU frequency at altitude of 10km at 60°N (Sweden)
FPGA
XC4010E
XC4010XL
Process
0.60um
0.35um
Vcc
5v
3.3v
1 SEU every
1×106 hours
2.8×105 hours
Projecting this for 3 design rule shrinks and 2 voltage reductions we get ≈1 SEU every 28.2 hrs
M. Ohlsson, P. Dyreklev, K. Johansson and P. Alfke, “Neutron Single Event Upsets in SRAM-Based FPGAs”, proc. 1998 IEEE Nuclear & Space Radiation Effects
Conference
Chuck Stroud, “FPGA Architectures and Operation for Tolerating SEUs”, Electrical Engineering VLSI design and test seminar, Spring 2007, Auburn University.
January 4-8, 2008
VLSI Design 2008
7
Example: SRAM-Based FPGA System*
Table
cont.
*1. Example (1) is tested at Denver, using SpaceRad 4.5 (a software
radiation effects prediction software program). Source: Actel.
2. All systems are without any protection.
January 4-8, 2008
VLSI Design 2008
8
Radiation Mechanisms for Silicon (1)
1. Alpha particles are emitted when the nucleus of an
unstable isotope decays to a lower energy state.
(dominant soft error cause for DRAM in 1970s)

Uranium and thorium have the highest activity among
naturally occurring radioactive materials.

In the terrestrial environment, major sources of radioactive
impurities are lead-based isotopes in solder bumps of the
flip-chip technology, gold used for the bond wires and lid
plating, aluminum in ceramic packages, lead-frame alloys
and interconnect metalization.
**With carefully selected materials, this
mechanism effect can be greatly reduced.
January 4-8, 2008
VLSI Design 2008
9
Radiation Mechanisms for Silicon (2)
2. High-energy ( > 1 MeV*) neutrons from cosmic
radiation induces soft errors in semiconductor
devices via secondary ions produced by the
neutron reaction with silicon nuclei.
 Cosmic rays which are of galactic origin react with the
Earth’s atmosphere to produce complex cascades of
secondary particles.
 Neutrons are the most likely cosmic radiation sources
to cause SEU in deep-submicron semiconductors at
terrestrial altitude. The neutron flux is dependent on
the altitude above sea level, the density of the neutron
flux increases with altitude
*MeV: Million Electron Volts
**Nowadays, Neutron is the major cause among
all fail mechanisms.
January 4-8, 2008
VLSI Design 2008
10
Radiation Mechanisms for Silicon (3)
3.
The secondary radiation induced from the interaction of
cosmic ray neutrons and boron is the third significant
source of ionizing particles in electronic systems.

Low-energy cosmic neutron interactions with the isotope boron-10 (10B).
10B is commonly used as p-type dopant for junction formation IC package.
Baumann et al, IEEE
Trans. Device and
Materials Reliability, vol.
1, no. 1, pp. 17–22, 2001.
**This mechanism can be greatly reduced or
eliminated by removing source of 10B
January 4-8, 2008
VLSI Design 2008
11
Single Event Transient (SET)
 SET is caused by the generation of charge due to a highenergy particle passing through a sensitive node.
 Each SET has its unique characteristics like polarity,
waveform, amplitude, duration, etc. depend on particle
impact location, particle energy, device technology,
device supply voltage and output load.
 The off transistors struck by a heavy ion with high
enough LET* in the junction area are most sensitive to
SEU.
 Specifically, the channel region of the off-NMOS
transistor and the drain region of the off-PMOS
transistor.
*Linear Energy Transfer is a measure of the energy transferred to the
device per unit length as an ionizing particle travels through a material.
January 4-8, 2008
VLSI Design 2008
12
More Details of SET Generation
(a) Along the path traverses, the particle produces a dense radial distribution of
electron-hole pairs.
(b) Outside the depletion region the non-equilibrium charge distribution induces
a temporary funnel-shaped potential distortion along the trajectory of the
event (drift component).
(c) Funnel collapses, diffusion component then dominates the collection
process until all excess carriers have been collected, recombined, or
diffused away from the junction area.
(d) Current vs. Time to illustrate the charge collection and SET generation.
January 4-8, 2008
VLSI Design 2008
13
Analytical Model of SET
 The time constants depend strongly on the type of ion, its initial
energy and the properties of the specific technology.
 Approximate analytical model for ion track charge collection is a
double-exponential form. It gives an induced current with a rapid
rise time but a more gradual fall time:
*Typical values are approximately
1.64 x 10-10sec for
and 5.10x10-11sec for
January 4-8, 2008
*Experimental Results
from NASA JPL
.
VLSI Design 2008
14
SET in CMOS Inverter
*For example, in ami12 technology, when the output load capacitance is 100fF
and the cumulative collected charge is 0.65pC, the amplitude of the voltage
pulse is 0.65pC/100fF = 0.65 x10-12C/100 x10-15F = 0.65V .
January 4-8, 2008
VLSI Design 2008
15
Soft Error Mitigation Techniques
 The soft error tolerant techniques can be classified into
two types: recovery and prevention.
 Recovery: Recovery error after it does occur.
Include on-line recovery mechanisms, fault tolerant computing,
ECC/parity check, redundancy etc.
 Prevention: The methods to protect microchips from soft-errors
before it occurs.
 The need for a recovery mechanism stems from the fact
that prevention techniques may not be enough for
contemporary microchips.
 Soft error is not the only reason why computer systems
need to resort to a recovery procedure. Random errors
due to noise, unreliable components, and coupling
effects may also require the recovery mechanism.
January 4-8, 2008
VLSI Design 2008
16
Some Mitigation Techniques
 Prevention Techniques
1. Purify the Fabrication Material:
 Uranium and thorium impurities have been reduced
below one hundred parts per trillion for high reliability.
 To eliminate 10B, alternative insulators that don’t contain
boron are used.
2. Radiation Hardened Process Technologies
 SER performance can be greatly improved by adapting
the process technology either to reduce the collected
charge or increase the critical charge.
 Specific methods: use additional well isolation; replace
bulk silicon with SOI.
10x reduction in SER achieved over conventional bulk devices
when a fully depleted SOI substrate is used. But SOI is more
expensive and parasitic bipolar action limit further reduction of SER.
January 4-8, 2008
VLSI Design 2008
17
Picked Mitigation Techniques
 Recovery Techniques
1. Redundancy



To gain higher system reliability by sacrificing the minimality of time or space or both.
Classic design: Triple Modular Redundancy (TMR) with majority voter
New design: time redundancy based on C-element gate to compare two samples
of combinational primary outputs at t0 and t0+d.
2. Error Detection and Correction Code (EDAC)


Simple solution for memory: add a parity bit to each memory word.
In most situations, it must be combined with a system-level approach for error
recovery.
*S. Mitra, Z. Ming, S. Waqas, N. Seifert, B. Gill, and K. S. Kim, “Combinational Logic Soft Error Correction,” in Proc.
International Test Conference, 2006, pp. 1–9.
January 4-8, 2008
VLSI Design 2008
18
A Case Study: IBM eServer z990 System
 z990 configuration
1. z990 contains 4 pluggable nodes connected through a
planar board.
2. Each node contains up to 64 GB physical memory and
32 MB L2 cache for a system capacity of 256 GB
memory and 126 MB L2 cache.
 Error tolerance techniques used:
1. Extensive use of ECC and parity with retry on data
and controls;
2. Full SRAM ECC and parity protection
3. Microprocessor mirroring
January 4-8, 2008
VLSI Design 2008
19
Conclusion
 SER in logic and memory chips will
continue to increase as devices
become more sensitive to soft errors
at sea level
 Open soft error issues:
1. How EDA tools handle soft error hardening?
2. Analysis of radiation mechanisms (too complex
to be comprehensive)
3. Soft error rate analysis for logics
4. Error mitigation methods
January 4-8, 2008
VLSI Design 2008
20
Useful References and Further Readings
“Single Event Phenomena”, (Messenger and Ash, 1993)
“Ionizing Radiation Effects in MOS Devices and Circuits”, (Ma and
Dressendorfer, 1989)
“Handbook of Radiation Effects”, (A. Holmes-Siedle and L. Adams,1993)
“Fault-Tolerance Techniques for SRAM-Based FPGAs”, (Kastensmidt,
Fernanda Lima, Carro, Luigi, Reis, Ricardo, 2006)
1.
2.
3.
4.
5.
6.
7.
Test methods and standard: JEDEC89, JEDEC89A, JEDEC89-2
Journals: IEEE Trans on Nuclear Science, IEEE Trans Reliability
NASA Goddard’s test group:
http://radhome.gsfc.nasa.gov/radhome/papers/seeca5.htm
7.
NASA Space Environment and Effects Program
http://see.msfc.nasa.gov/
……
January 4-8, 2008
VLSI Design 2008
21
Thank You . . .
January 4-8, 2008
VLSI Design 2008
22
Download