and Process-Variation Aware Design of

advertisement
Reliability- and Process-Variation Aware Design of
Integrated Circuits – A Broader Perspective
Muhammad A. Alam, Kaushik Roy, and Charles Augustine1
1
Department of ECE, Purdue University, West Lafayette, IN 47907, USA
phone: 765-494-5988, fax: 765-494-6441, e-mail: alam@purdue.edu
Abstract— A broad review the literature for Reliability- and
Process-variation aware VLSI design shows a re-emergence of
the topic as a core area of active research. Design of reliable
circuits with unreliable components has been a challenge since
the early days of electro-mechanical switches and have been
addressed by elegant coding and redundancy techniques. And
radiation hard design principles have been used extensively for
systems affected by soft transient errors. Additional modern
reliability concerns associated with parametric degradation of
NBTI and soft-broken gate dielectrics and proliferation of
memory and thin-film technologies add new dimension to
reliability-aware design. Taken together, these device, circuit,
architectural, and software based fault-tolerant approaches
have enabled continued scaling of integrated circuits and is
likely to be a part of any reliability qualification protocol for
future technology generations.
Keywords-positive reliability physics, circuit design, variability,
modeling, lifetime projection
I.
INTRODUCTION
Most introductory courses on VLSI design presumes
interchangeability and uniformity of components whose
properties remain invariant with time and posit that the
fundamental challenge of IC design is the trade-off among
area, performance, testability, and power dissipation of a
circuit. The key feature of such deterministic optimization is
that it involves analysis of a static and uniform network of
transistor and interconnect nodes in response to a set of
time-dependent test-patterns. Since in practice no two
transistors are quite alike even within a single die due to
process-variability [1], nor do they retain the same
characteristics over time as defects accumulate within the
transistor in response to nodal activity and input patterns
[2][3][4], the idealized design principles are hardly
defensible in practice. Traditional VLSI design bypasses the
analysis and optimization of such nonuniform, dynamic
network by approximating the problem into optimization of
uniform static network with certain guard band. This allows
the optimized static circuit to continue functioning even if
the devices have randomly distributed parameters (with
some margin), or become faster or slower over time.
Such worst-case guard-band limited VLSI design is
widely used in microelectronic integrated circuits in modern
CPU, as well as in macroelectronic technologies such as
Liquid Crystal Displays (LCD). Despite its routine use, the
978-1-4244-9111-7/11/$26.00 ©2011 IEEE
methodology is inefficient and involves considerable
penalty in the area-performance-power-reliability budget.
After all, the process-induced variation [5], time-dependent
degradation, radiation-induced soft and hard errors make it
difficult to design integrated circuits within a conservative
guard-bands without unacceptable compromise of
power/performance targets of the circuit design. Given the
constraints, the following question frames the discussion of
this paper: “How does one design and optimize for
power/performance/area metrics various VLSI and
macroelectronic circuits with components that are
occasionally faulty, but more generally have statistically
distributed parameters that evolve in time either abruptly or
gradually over a period of time?” There is no global
optimization principle that addresses all problems of course,
but the design and device communities have collaborated
over
the
years
to
find
a
series
of
device/circuit/architecture/software solutions to reliability
concerns in specific context. We will summarize some of
the key development from the perspective of a reliability
engineer.
Although the process and reliability considerations
across the electronics industry is multi-faceted and is a vast
topic, in general, they can be classified in four groups (see
Fig. 1): The first group involves transistor-to-transistor, dieto-die, wafer-to-wafer process induced variation that
introduces parametric variation among otherwise two
nominally identical and functional transistors. The second
group involves irreversible parametric degradation arising
4A.1.1
Figure 1: ICs might experience different types of faults during the
fabrication and/or operation of the system. And these faults must be
managed at the hardware or software levels. In this article, we
classify the faults in four categories depending on how they affect
system performance and the classes of solution strategies used to
address them.
IRPS11-353
from loss of passivated surfaces (e.g., brokeen Si-H bonds for
NBTI/HCI damages in microelectronic andd macroelectronic
applications [6], loss of Si-H bonds and increase in dark
current in amorphous-Si based solar cells [138], etc.), salt
penetration through oxides in biosensors [77], etc. The third
group of reliability issues involving transsient errors (e.g.,
soft error due to radiation [8][9], transiennt charge loss in
Flash and ZRAM memories [10], etc.). Fiinally, the fourth
group involves permanent damage arising ffrom pre-existing
or new generated of bulk defects (e.g., brooken Si-O bonds)
in SiO2 that leads to gate dielectric breaakdown in logic
transistors [11][13], anomalous charge loss in Flash
transistors [14], loss of resistance ratio iin MRAM cells,
radiation induced permanent damage in SR
RAM cells [25],
open interconnects due to imperfect processing or
electromigration, etc.
In each section of the following disccussion, we will
discuss definition, detection, and solution strategies
associated with these four class of defectts. We conclude
with a brief discussion regarding the statuss, challenge, and
opportunities of CAD tools in implementiing process- and
reliability-aware design methodologies aand provide an
outlook for the emerging research direcctions for VLSI
design.
II.
LITY
PHYSICAL ORIGIN OF VARIABIL
In this section, we consider the generaal features of the
four defect classes associated with processs- and reliabilityaware design.
A. Origin/Measure of ‘Time-Zero’ Variatioon
Modern VLSI designers are intimately aaware of processrelated parametric fluctuation arising froom (i) randomdopant fluctuation [26][27][28][29], (ii) flucctuation of oxide
thickness [30], (iii) statistically distributed channel lengths
due to line-edge roughness, etc. These ranndomness reflects
local, submicron-scale processing historyy of individual
transistors and translates to threshold vooltage variation,
fluctuation in gate leakage, and variabbility of series
resistance. Regardless of the type of transistor (e.g.,
FINFET, ultra-thin body SOI, double-gatee transistors, etc.
[32]), the continued technology scaling oof guarantees the
susceptibility of these transistors to the processing
conditions.
Other types of electronic componentts are similarly
affected. The fluctuation in the number off nanocrystals in
NC-Flash memories translates to distributtion of threshold
voltage and retain time, the number of graiins within a TFT
channel dictates its ON current [34], and aas Nair et al. [7]
Figure 2 Type 1 faults are associated with disstribution of initial
parameters due to process variation (blue). When cooupled with Type 2
time-dependent parametric degradation (red), thee total guard-band
necessary to design an IC can become unacceptably llarge.
IRPS11-354
s
fluctuation of
has shown that dopant-induced statistical
drain-current is so significant that ab
bsolute biosensing based
on Nanowire biosensors is actually impossible. In short, all
areas of modern electronics (ee.g., logic/memory in
microelectronics, macroelectronics, and bioelectronics) are
n parameters.
influenced by randomness of design
To design an IC with transistors with
w random parameters,
the randomness is first captured in the lowest level of
abstraction (e.g. process level) and its effects on quantities
like threshold voltage, and leakage current can either be
g percolation theory and
determined numerically or by using
stochastic geometry. These effectiv
ve electrical parameters
are then propagated to higher leveels of abstractions (e.g.
circuits/systems level) using nu
umerical or analytical
approaches [35][36]. Although wiidely used, the MonteCarlo based numerical determin
nation of VT has the
disadvantage that limited sample-sizze makes the calculation
of tails of the distribution difficult an
nd probably inaccurate.
Finally, even for nominally identtical transistors, the local
operating environment (nodal activiity, passivation, etc.) or
R-C drop due to statistically distributed length of
mogeneity in opearating
interconnects results in local inhom
temperature and ultimately to diffeerence in
, , etc.
Fortunately, given the nodal activity
a
and operating
conditions, this aspect of the probllem may be predictably
defined by self-consistent solution
n of the electro-thermal
design problem [37][38].
hysical Models of
B. ‘Time-Dependent’ Variation: Ph
Reliability
The second set of variability concerns involve
permanent
degradation
of
individual
transistor
t
as a function of
characteristics that evolve with time
transistor activity [2][3][11][12][53]][56]. Unlike ‘time-zero’
variation discussed above, this variaation would arise for two
identically processed transistors with the same initial
n parametric values are
characteristics and that the shift in
generally permanent and cannot be restored to their pristine
value by turning the IC off, see Fig. 2.
Several degradation mechanism,, such as, negative bias
temperature instability (NBTI), po
ositive bias temperature
instability (PBTI), hot carrier deegradation (HCI), (soft)
dielectric breakdown (S-TDDB), have been extensively
characterized by many groups over the years. Moreover, the
implications of these degradationss on circuit and system
performance have also been explorred extensively. The key
difference between models for deegradation for reliability
aware design vs. the standard modeel for qualification is the
emphasis on time ( ), frequency
y ( ), duty-cycle ( )
dependencies of degradation in add
dition to traditional focus
on voltage acceleration ( ), tempeerature acceleration ( )
and statistical distribution ( ) of failure time. This
e
the fact that it
generalization of reliability models emphasize
is too conservative to qualify a technology
t
for a given
operating voltage and temperature and dictate that all IC
design based on the technology resiide within the boundary.
Instead, the quantification of time-dependence of
ng voltage, temperature,
degradation allows trade-off amon
and time based on local nodal activity and specific usage
condition of the IC.
4A.1.2
Consider the case of NBTI degradation associated with
PMOS transistors biased in inversion. The degradation is
presumed to involve dissociation of Si-H bonds at the
interface followed by either repassivation of the
/
bonds by newly created free hydrogen or diffusion away
dimerization [2]. (An
from the interface following
introductory lecture, a set of lecture notes, and numerical
simulation tool (DevRel) to calculate degradation
characteristics are available at www.nanohub.org
[15][16][17]. Based on decades of research by many groups
across the industry (see Fig. 3), the degradation of threshold
voltage can be succinctly described by [18]
∗
ΔVmin for SRAM
)
(
) where is the integrated stress time seen by a transistor,
is the
~1/6 is the time exponent, is the surface field,
number of Si-H bonds at the interface, is the temperature,
is the diffusion
is polarization constant, and
,
, and
are activation energies for
coefficient,
diffusion, forward-dissociation, and reverse annealing of
Si-H bonds. The net degradation is frequency independent
[19] and equals the degradation at 50% duty cycle [20].
Remarkably, PBTI degradation associated with NMOS
transistors biased in inversion is also defined by similar
formula, except physical meaning of the individual terms
are different and the magnitude of degradation is somewhat
reduced. NBTI is also a key reliability concern for solar
cells and TFTs based on a-Si or poly-crystalline Si and are
described by analogous formula [21].
There are corresponding formula for hot carrier
degradation based on bond-dissociation model [22][23][24].
The soft-breakdown characteristics after dielectric
breakdown can be characterized by [11][52]
=
3.0
2.5
( , )
is jump in gate leakage for each breakdown event,
where
is the voltage and temperature dependent acceleration
factor, is the Weibull slope. Similar formula exists for
backend degradation arising from electro-migration and
stress migration, etc.
Finally, while radiation damage was primarily associated
with transient fault (soft error) or gate punch-through (hard
error), close analysis of radiation detector show that steady
generation of permanent traps and type inversion of donor
levels lead to time-dependent increase excess leakage and
depletion width. Such parametric degradation has been a
serious concern for viability of radiation detectors.
Once these degradation characteristics are characterized,
its effect on inverting logic [3], SRAM cells [56][12],
pipeline microprocessor [94], etc. can be readily analyzed
and the susceptibility of failure due to a degradation mode
or a combination thereof can be predicted. Based on these
predictive characterizations, we will explore various
solution methodologies later in the paper.
Tech A, V1
Tech A, V2
Tech B, V1
Freescale, IRPS'07 [19]
Stress Time > 1000 Hr
degradation
=[
] (
1 + (1 − )/2
1/6
~t
2.0
IMEC
ST
TUV
Infenion
Ours
1.5
1.0
0.5
Normalized to 50% AC
0.0
0 20 40 60 80 100 0 101 103 105 107 109
duty cycle (%)
frequency (Hz)
Figure 3. Time evolution of degradation at very long stress time from
various published results showing power law dependence with
universally observed exponent of n=1/6AC NBTI degradation normalized
to 50% AC versus pulse duty cycle and frequency from various published
results. R-D solutions (red lines) are shown [2]. Taken from S.
Mahapatra et al., Proc. IRPS, 2011.
C. Physical Origin of Transient Faults and Soft Errors
Many researchers attribute the mysterious charge-loss
electrometers in 1800s as the first reported occurrence of
radiation damage in electronic components and the US Navy
also reported radiation related failures of electronic
components during tests for atomic bomb during WWII.
Radiation-induced transient faults gained increasing
prominence in 1960s and 1970s as transistors were
miniaturized and transient effects associated with radiation
strike and latch-up [9] caused serious reliability concerns.
Transient soft errors are run-time computational error that
does not lead to permanent degradation of parameters, nor is
it related to process variation, but the random nature of the
transient upset makes its detection and correction a
challenging problem. For example, tests of SGI Altix 4700
computer (32 blades, 35 GB SRAM) shows approximately
4-5 SRAM upsets every week – errors that must be corrected
for the proper operation of the machine.
Sources of radiation include alpha particles and high
energy neutrons in solar winds and cosmic particles as well
as low energy neutrons from packaging materials [39],
making design of satellite and space-crafts particularly
challenging. Although the single events effets (SEE) is
generally related transient errors related to single bit upset
(SEU) or multi-bit upset (MBU), permanent faults related to
Single event gate rupture or burnout (SEGR/SEB) are also
possible and are discussed in the next section [39][40].
Modeling/characterization of soft errors involves three
steps [41] (i) understanding the radiation environment of the
electronics, which may involve interaction of the atmosphere
with radiation fluxes, physics of nuclear fission, etc., (ii)
4A.1.3
IRPS11-355
given the relative fluxes of particles, probability that they
will interact with a given active volume of a transistors, and
finally, (iii) calculation of critical charge necessary to upset
a particular type of device. These effects are modeled at
various levels of sophistication depending on the degree of
precision necessary for a particular application and
compared extensively with measurements. For example,
simulation show and experiments confirm that the SER rate
seems to increase exponentially with reduction in supply
voltage – increasing by a factor of 10-50 as supply voltage is
scaled from 5V to 1V. In the worst case scenario, a strike
can change the state of multiple nodes (MBU), making it
difficult to recover the lost data. In memories, techniques
are present to detect and correct one error in every
column/row, but in the case of multiple errors, present day
techniques are not sufficient to tackle such errors and will
result in complete system failure [8]. Thus, it is important to
explore techniques that can detect (and if possible correct)
the errors, without allowing the errors to propagate. In
general, transition to SOI technologies reduces the amount
of charge that is generated within bulk of the body. Since
this increase the LET threshold for upset, the SER
probability is reduced by a factor of 2-3 [42]. However,
trapping and other hysteresis effects could be a concern for
SOI transistors.
Finally, memories like Flash and ZRAM are also
susceptible to soft error [10], since the energetic particle
may eject the stored charge and change the memory state
permanently. We will briefly summarize the well known
detection and solution strategies later in the paper.
D. Origin of Permanent Faults
Permanent faults has been a reliability concerns from the
early days of electronic industry based on electromechanical switches and punch-cards based on optical
readers [43]. The unreliability of mechanical switches in
1940 and 1950 were so severe that many great scientists like
von Neuman, Shanon, and Hamming spent considerable
time working on fundamental principles of designing
reliable systems with unreliable components. The techniques
include judicious use of combination of techniques involving
redundancy, stability analysis, and coding. To this day,
when computer scientists discuss reliability of components,
Open defect
Short defect
Resistive Bridges
0.7
0.6
PMOS
OUT
INP
NMOS
Drain Source
Short
VOUT (V)
0.5
0.4
Original VTC
VTC with faulty
NMOS
0.3
0.2
0.1
0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
VIN (V)
Figure 4. (Top) Type 4 faults present in scaled devices. (bottom, left)
Defective NMOS in INV and corresponding (less than perfect) voltage
transfer characteristics (VTC) (bottom, right).
IRPS11-356
the language and terminology (‘stuck-at-zero’ fault meaning
a relay which cannot be closed or ‘stuck-at-one’ fault
meaning a relay permanently connected to the output) can be
dated back to early days of computer engineering.
Modern semiconductor processing has vastly improved
process-reliability. While it is still possible to have physical
defects such as shorts, opens and resistive bridges (see Fig.
4) that can be modeled as “stuck-at-zero’ and ‘stuck-at-one’
or stuck-open, [44]; fortunately, these challenges are likely
to be resolved with process debugging or detected at test
time using standard test generation techniques. However,
once the circuit has been used for a prolonged period of
time, ‘stuck-at-zero’ type faults may occur with a harddielectric breakdown at the gate for NMOS (Fig. 4).
Moreover, given the reduced oxide thickness, there is
increasing likelihood of punch-through of the dielectric film
(SEGR/SEB) by energetic particles leading to permanently
damaged transistors. These types of hard errors also occur
in transistors based on CNT, poly-silicon NW, etc. since
excessive heating may lead to burning of the channel region
[45]. The classical theories of fault-tolerance therefore can
be useful in these contexts, with the implicit assumption that
a certain percentage of devices may fail randomly during IC
operation. However, since all such redundancy techniques
incur extra area/power penalty, ensuring high performance
with high reliability requires intelligent trade-off and
thoughtful design at architecture, circuit, and transistor
levels. We will discuss the detection and solution strategies
of such failures later in the next section.
III. CIRCUIT SOLUTIONS STRATEGIES FOR LOGIC
CIRCUITS
A. Process Variation and Solution Strategies.
Process variation and reliability consideration pose
special design challenges for logic circuits. Historically, the
circuit designers first became aware of the challenges of
design under variability as they had to reckon with the
widening distribution of leakage due to random dopant
fluctuation. In response, device physicists proposed channel
dopant-free transistor designs [26]. At the circuit design
level deterministic ‘global’ techniques like back-gate bias
techniques, and adaptive VDD have been proposed to tighten
VT and leakage current distributions [46]. These approaches
are particularly effective for systematic process fluctuation
arising from line-edge roughness or oxide thickness, etc.
On the other hand, CRISTA [47] is a
circuit/architecture co-design technique for variationtolerant digital system design, which allows for aggressive
voltage over- scaling. The CRISTA design principle (a)
isolates and predicts the paths that may become critical
under process variations, (b) ensures that they are activated
rarely, and (c) avoids possible delay failures in the critical
paths by adaptively stretching the clock period to two
cycles. This allows the circuit to operate at reduced supply
voltage while achieving the required yield with small
throughput penalty (due to rare two-cycle operations). A
schematic of the approach is shown in Fig. 5.
A recently introduced approach called ‘RAZOR’ [48]
uses a shadow latch to detect the transient failures in
pipelined logic, which causes an output to be incorrectly
4A.1.4
cause for concern. The techniques proposed to address this
degradation are discussed next.
a17 b17
a13 b13
a31 b31
FA31
FA17
S31
S17
C13
FA13 C12
S13
prediction
a0 b0
FA0
Cin
S0
2:1 MUX
select
signal
D
CLK
D
comparator
CLK_D
Shadow latch
Figure 5. (Top) Fig. 7 32bit Full Adder designed using CRISTA design
methodology. (bottom) RAZOR flip flop (CLK_D is the delayed CLK).
latched. Upon the detection of error, the instruction is reexecuted by flushing the pipeline and using the correct data
from the shadow latch (Fig. 5). Razor is an adaptive tuning
technique for correct operation under parametric process
variation.
At system level, design techniques like Robust core
checker [49], leakage based PVT sensor [Kim04], or onchip PLL-based sensors [50] have also been explored to
generate self-diagnostic signals to dynamically control
adaptive body-bias for inter-die variation. An excellent
review of the process tolerant design techniques is
summarized in [1].
B. Variation Due to Parametric Degradation.
Parametric time-dependent degradation associated with
NBTI, HCI, TDDB, and EM degradation are challenging
problems – made even more complicated by increasing
additional contributions from ‘time-zero’ process variation.
Given the voltage, temperature, and time dependent
analytical models, however, statistical design methodologies
[51] offers promising solutions in such a scenario. We offer
a brief review of the approaches related to NBTI to illustrate
the principles of statistical VLSI design in the presence of
variability; previously there have been extensive work
regarding gate-dielectric breakdown [52] and there are
suggestions by many that PBTI and HCI are likely to be
analyzed by similar approaches.
To appreciate the role of such parametric degradation on
logic circuits, we used 70 nm Predictive Technology Model,
(PTM) an NBTI-aware static timing analysis (STA) tool [3]
to estimate temporal degradation in circuit delay under
NBTI. Both simple logic gates (e.g. NAND, NOR gates), as
well as various ISCAS benchmark circuits involving a few
hundred to a few thousand transistors were simulated to
estimate the degradation. In general, the results indicate that
the circuit delays can degrade by more than 10% within 10
years operation time [3]. Such large change in parameters
from just one of the degradation mechanisms is a genuine
Pre-Silicon Solution Strategies: The solution strategies
discussed in the literature roughly fall into two categories:
deterministic and statistical. Various global deterministic
techniques like dynamic VDD scaling, duty-cycle scaling,
etc. have been discussed in the literature [53]. The ‘VDD
scaling’ may not always be optimal because in addition to
increasing reliability, it also increases delay [54][55]. The
proposal of ‘reduced duty-cycle’ [56] is misleading, because
in an inverting logic, reduced duty-cycle at one stage
translates to enhanced duty-cycle for the following stage
[57].
The statistical technique involves Lagrange-Multiplier
(LM) based sizing given the constraints of area, gate delay,
and product lifetime. Both cell based sizing [3] and
transistor based sizing [57] can be effective. Since the
threshold voltage of a gate depends on switching activity,
which in turn depends on the sizes of the other transistors,
the sizing algorithm therefore is by definition self-consistent
[58]. The results show that cell-sizing (where NMOS and
PMOS are scaled by the same factor) [3] of several ISCAS
benchmark circuits can be made NBTI-tolerant only with
modest area overhead (~9%). Even better optimization is
desirable and possible if PMOS is scaled, (but NMOS is
not), thereby providing – in the best case and as shown in
Fig. 6 – a factor of two improvements in area penalty [57]
so that a wide variety of ISCAS benchmark circuits at 70nm
node can be designed with only 5% area-penalty for NBTI
tolerant design (with no delay-penalty). One limitation of
the specific work discussed above is that the model does not
include the effect of NBTI relaxation – a key feature of
NBTI characteristics. More recent work [59] addresses the
relaxation issue specifically and arrives at improved (but
comparable) estimates of various trade-off.
800
c1908
10ps
750
Area (um)
Pre-decoder for critical path prediction One-cycle/Two cycle
Area Saving
From TR-based
700
Cell-based
650
600
550
580
INIT
TR-Based
Cell-Based
600
620
640
660
680
700
720
Delay (ps)
Figure 6. PMOS-only LR optimization of ISCAS benchmark circuit
C1908 based on 70 nm PTM offers approximately 45% saving in area
overhead (11.7% for cell-based optimization, 6.13% for PMOS-only
optimization).
Post-Silicon Silicon’ Odometers: Since NBTI degradation
is actually specific to local operating environment, all ICs
do not degrade equally. An operating condition specific
indicator of NBTI degradation would be of great value. A
new class of techniques, broadly known as “Silicon
4A.1.5
IRPS11-357
channels have been strained by Si/Ge Source and drain and
high-k/metal gate have been introduced in the gate stack.
The material changes have led to an intriguing suggestion of
the possibility of NBTI degradation-less transistor operation
for high performance logic circuit [68]. Briefly, the ON( − ) (μ is
current of a transistor is given by ~
the threshold voltage), so that
the mobility,
Δ /
Figure 7: The change in Iddq (red triangle) is measured by temporarily
flipping the states of the transistors (Vin=0) so that leakage from
degraded PMOS is reflected in Iddq. Note that the power-law
exponents of individual devices and that of an IC (here) is identical.
Odometers”, have recently been proposed to allow simple
and direct estimate of actual usage of an IC.
The first approach suggests the use of quiescent leakage
current (IDDQ) to establish on-the-fly time-dependent
degradation due to NBTI, see Fig. 7 [60][61]. Over the
years, Iddq has been used a process-debugging tool for
time-zero short- or open-circuit issues [95]. Since Iddq is
also affected by VT-shift, it can also be an effective and
efficient monitor of application- and usage-specific IC
degradation as a function of time.
The second approach is based on differential signals
from build-in modules or newly designed structures. For
example, Ref. [63][64] compare the phase difference
between two ring oscillators (DLL) - one is stressed in
actual operation and the other is not – to estimate the actual
usage history of the IC. Other groups [50] have used
variation in PLL signal to arrive at the same signatures.
Finally, Ref. [65] uses a PMOS transistor biased at subthreshold region as a NBTI sensor for actual usage of the IC
and this degradation is then mapped back to frequency
through 15-stage NAND-gate ring oscillator. In sum, a wide
variety of techniques are now being used to measure – in
situ – the actual usage of the IC. Other types of locally
embedded sensors have also been used to dynamically
monitor the degradation [66][67]. This enables the circuit to
be run at its maximum possible frequency, while preventing
any failure to occur due to degradation.
These indicators – be it IDDQ or DLL phase shifts –
would then allow one to use adaptive design to extend
lifetime of the ICs or to indicate incipient failure and
preventive maintenance, allowing a powerful post-Si design
approach for future ICs. The effectiveness of the Iddq
methodology has been demonstrated based on ISCAS
benchmark-circuits at 70 nm PTM node – establishing a
perfect correlation between NBTI degradation and decrease
in Iddq. The technique has also been verified experimentally
both for NBTI and TDDB to show ease of implementation
of the methodology. Although recent results show
encouraging trends; however, the area- and power- overhead
of the sensors as well as algorithmic complexity associated
with various corrective techniques are limitations of this
technique that needs careful consideration.
Solution Based on Degradation-Free Transistors:
Classical MOSFETs are undergoing significant changes: the
IRPS11-358
,
= Δ /
−∆
/(
−
,
),
where subscript ‘0’ indicates pre-stress parameters t. When
degrades (increases) with generation of bulk and
at the Si/SiO2 interface is reduced.
interface traps, the
Since −
characteristics is negative, this translates to
positive . Take together, the self-compensating effects of
reduce drain-current degradation, and in some
and
strain-values can make the transistor degradation-free. A
combination of circuit techniques as well device engineering
may keep NBTI degradation within acceptable limits. There
may be similar opportunities for PBTI and HCI hardened
design principles.
C. Detection and Solution Techniques for Transient
Failures
Similar to degradation-free transistors for NBTI, we have
already discussed the intrinsic improvement in soft errors
robustness for SOI transistors that has smaller collection
volume and therefore higher LET for upset. In addition,
there
are
many
Circuit/Architectural/Software
detection/correction algorithms for soft errors in logic
circuits, as discussed below.
At circuit level, flip-flops have been proposed with builtin soft error tolerance [69]. The flip-flop implementation
consists of two parallel flips-flops with four latch elements
and a C-element for comparing the output of two flip-flops.
Out of the four latch elements, one can be subjected to
single event upset (SEU). In that case, output of the two
flip-flops differs and error does not propagate to the output
through the C-element. In the event of more than one error
in four latch elements, the output can go wrong, however,
the probability of such an event is very low. Another
topology for soft-error tolerant flip-flop is proposed in [70].
Along with immunity towards soft-error, the proposed flipflop offers enhanced scan capability for delay fault testing.
Finally, in the ‘Check point and roll back’ [71] technique
instructions stop executing once a fault has been detected
and the state of the system is restored from the point of error
introduction (i.e. rolled back). This technique is associated
with micro-architecture Access-Control-Extension (ACE)
instruction set [72] for detecting the faults. Although the
approach can cause stalling of the processor for few cycles,
but it results in correct execution of instructions without the
need for complete flushing of the pipeline.
The above-mentioned approaches for soft error
tolerance are applicable in general purpose computing
systems. On the other hand, for probabilistic applications
such as Recognition and Mining and Synthesis (RMS), Error
Resilient System Architecture (ERSA) is proposed in [73].
ERSA combines error resilient algorithms and software
optimization with asymmetric reliability cores (some cores
4A.1.6
are more reliable than others due to parameter variations,
say). Results show that the architecture is resilient to high
error rates on the order of 20,000 errors/cycle per core.
P1
INP
NMOS
P3
PMOS
OUT
INP
Source Drain
Open Fixed
Po : Probabitliy of Open Fault
Ps : Probability of Short Fault
Po _ eq = 2 × Po − Po 2 , Ps _ eq = Ps 2 ,
Pwork _ eq = (1 − Po _ eq )(1 − Ps _ eq )
= (1 − Ps 2 )(1 − (2 × Po − Po 2 ))
P2
P1
P1
P2
P2
P2
P3
P3
P3
P3
P3
1
0.9999
0.9998
0.9997
0.9996
0.9995
TIR
0.9994
TMR
0.9993
0.9992
0.9991
1e-3 2e-3 3e-3 4e-3 5e-3 6e-3 7e-3 8e-3 9e-3 10e-3
Probability of failure of each
component
Po _ eq = Po2 , Ps _ eq = 2 × Ps − Ps2 ,
Pwork _ eq = (1− Po _ eq )(1− Ps _ eq )
2
P2
P1
Majority Voter
NMOS NMOS
NMOS
Source Drain
Short Fixed
P2
Success Probability of the system
OUT
P1
Majority Voter
D. Detection and Solution Techniques for Hard Faults.
In the gate level ‘stuck-at’ fault model, each circuit
node can be at stuck-at zero (1/0: good value is ONE and
faulty value is ZERO), stuck-at one (0/1), 0/0, 1/1 or x/x
value. Not that this model is used in reference to the
manifestation of defects at the electrical level and is not a
physical defect model. For detecting stuck-at one fault (say),
the node is activated to have a logic value of 0 using an input
vector such that the effect of the fault is propagated to a
primary output.
PMOS
P1
Figure 9. (Top, left) Cascaded TMR and probability for correct system
operation . (Top, right) Triplicated Interwoven Redundancy (TIR) and
probability for correct system operation. (Bottom) Comparison of TMR
and TIR.
2
= (1 − Po )(1− (2 × Ps − Ps ))
Figure 8. Transistor level redundancy techniques for Type 4 short and
open faults.
Although we have already discussed IDDQ (Quiescent
current) tests to determine NBTI and HCI degradation, the
IDDQ test was originally developed to detect hard faults
(such as short between two nodes, pseudo-stuck at faults,
stuck-on faults, etc. [74]). The assumption that in CMOS
circuits, if a fault, such as a bridging defect is activated, will
lead to large increase in quiescent current, worked well in
early technology generations to detect manufacturing
defects. However, with increased leakage and leakage spread
(due to parameter variations) in scaled technologies, it is
becoming increasingly difficult to use IDDQ to test
manufacturing defects. The sensitivity of IDDQ tests to
detect manufacturing under increased leakage and parameter
variations has decreased considerably.
Pre-Silicon Solution Strategies: Once the faults have been
identified, its effects can be mitigated by various types of
redundancy. At the transistor level, one can deploy various
types of series and parallel transistor redundancies
depending on the prevailing types of defects (as shown in
Fig. 8). Redundancies can be distributed at random
throughout the circuit. The introduction of transistor level
redundancy may decrease the finite impulse response (FIR)
filter failure probability by almost 20% [75].
At the system level, triple modular redundancy (TMR)
and Triplicated Interwoven Redundancy (TIR) are popular
options to address both transient and permanent faults [76].
The final result of the computation in TIR and TMR are
decided by majority voting of three identical sets of
hardware (see Fig. 9). For a given area overhead, however,
TIR seems to provide somewhat improved robustness
compared to TMR. Note that, in this analysis the voting
circuit is assumed to have zero failures.
The main challenge of dealing with hard failure is that
many of the classical techniques proposed above to address
soft degradation are not suitable to handle such catastrophic
failure: global solutions like adaptive body-bias, etc. are not
helpful, redundancy could help, but since the BIST may
have to monitor the cell failure at exceptionally high rate,
the approach may not be practical. Various other versions of
reconfigurable logic based on FPGA are being investigated
and offer some promise [77][78], but more research is
needed for practical, cost-effective solutions. Therefore, the
challenges of circuit design under random permanent faults
offer new avenues for research in this field.
III.
DETECTION AND SOLUTION TECHNIQUES FOR SRAM
The relative footprint of memory in microprocessors is
expected to reach 50% by 2008 and 90% by 2011. As such
robust SRAM design under process-variation and parameter
degradation has been a persistent challenge for many years
[27][79].
A. Process Variation and Solution Strategies
Four functional issues define SRAM yield: ACCESS
failure (AF) defines the unacceptable increase in cell access
time, READ failure (RF) defines the flipping of the cell
while reading, WRITE failure (WF) defines the inability to
write to a cell, and HOLD failure (HF) defines the flipping
of the cell state in the standby, especially in low VDD mode
[80]. Only those memory blocks that have none of these
failures in any of their cells are expected to remain
functional. Systematic inter-die variation and random interdie variation limit the yield of SRAM cells – as expected.
4A.1.7
IRPS11-359
Of the many proposals discussed in the literature to
address this issue, we will discuss two representative cases.
Similar to logic transistors, the first approach involves using
adaptive body bias to tighten distribution of VT and leakage
current. The signal may be self-generated by monitoring the
on-chip leakage. It has been shown both theoretically and
experimentally that such adaptive body bias can increase
SRAM yield significantly [81]. Once again, the methods are
effective for inter-die variation, but unsuitable for finegrained intra-die variations.
The second approach consists of two circuit level
approaches: one involving more complex SRAM cell
design, especially for low-voltage application [82] and the
other involving redundant columns to replace faulty cells. In
the ‘redundant column’ approach, the Built-in Self Test
(BIST) circuit periodically (e.g. every time operating
condition is changed) checks for the faulty bits and
subsequently resizes the memory to prevent the access to the
faulty bits [96]. It has been shown that such design can be
transparent to the processor with negligible effect on access
time or area/energy overhead and can still improve yield by
as much as 50%.
B. Reliability Aware SRAM Design
Similar to design of logic transistors, SRAM design with
the simultaneous presence of process variation and
parameter degradation is significantly more challenging
problem. For example, regardless of whether an SRAM cell
stores a ‘0’ or ‘1’, if the cell is not accessed frequently, one
of the PMOS pull-up transistors could be in NBTI stress for
a prolonged period of time. As such the divergence of
parameters of the pair of PMOS transistors within a SRAM
cell can erode the read and write stability of the cell [12].
Likewise, since gate dielectric breakdown (TDDB) is a
statistical process, breakdown in one of the transistors
would induce excess gate leakage in one transistor without
corresponding effect on the other five transistors [83]. The
asymmetric degradation is a key reliability concern for
SRAM design and needs to be reflected in IC design
process.
An analysis of typical 6T SRAM array shows that
READ failure probability increases by a factor of 10-50 due
to NBTI, if the initial ΔVT due to process variation ranges
between 30-40 mV (70 nm PTM technology). On the other
hand, the “WRITE stability’ improves somewhat with NBTI
as the cell can be written at smaller voltage. Overall,
however, the metrics like static noise margin (SNM) reduces
so quickly with NBTI that it emerges as an important
reliability concern [12]. Indeed it was found that parametric
yield of a large sized SRAM arrays (e.g., 2MB, PTM 70nm)
can degrade up to 10% in 3 years time period (Fig. 10).
Solution Strategies: Several groups have proposed and
explored potential remedies for NBTI-induced SRAM
degradation. These approaches involve improvement in presilicon design phase, post-silicon test phase, and post-silicon
operation phase, as discussed below.
In device design phase, the most effective route to
combating the NBTI related asymmetry is to reduce initial
asymmetry as much as possible. Therefore, techniques that
reduce process variation are well-suited for reliability-aware
design as well. These may include dynamic VDD scaling,
IRPS11-360
body biasing, or redundant arrays to replace faulty cells, etc.
[53][56][12].
During the test-phase, a body-bias based burn-in strategy
could also be effective [55]. In particular, if the body bias is
used such that the SNM is reduced, then NBTI induced
Figure 10. Temporal parametric yield of 6T-SRAM array. The
decrease in memory yield for larger memories, especially towards the
end of the product cycle, could be addressed by pre-silicon design (e.g.
cell flip) or post-silicon monitoring (Fig. 8).
degradation will immediately identify vulnerable cells. With
an additional N-well bias, the SNM could be increased for
those cells and this increases memory yield by a factor of 3
– a remarkable improvement.
Finally, cell-flipping algorithm proposes to restore
symmetry in PMOS transistors by inverting the stored data
in the memory every other day, so that any given PMOS
transistors are not subjected to prolonged uninterrupted
stress. While good results are obtained, the main limitation
of the approach involves the algorithmic and hardware
complexity/overhead associated with inverted data every
other day [56].
C. Soft Errors: SRAM Solutions
For SRAM memory, various types of error detection/
correction schemes are now routinely used for wide variety
of IC. These include (i) Single Error Correction (SEC) that
requires 6-8 additional error correction bits depending on
bit-length, (ii) Single Error Correction, Double Error
Detection (SEC-DEC) schemes with 7-9 error correction
bits, and (iii) Double error Correction (DEC) schemes with
12-24 bits [84]. These techniques based on coding theory
require substantial overhead in area, power, and execution
time and requires careful implementation, however in
general these techniques are effective and widely used.
D. Permanent Faults: Detection and Solution Strategies.
The possible manufacturing defects (such as shorts, gate
oxide defects, opens, etc) can be modeled as circuit nodes
being stuck-patterns at zero or one. There exists several test
pattern generators that optimizes the number of test vectors
while trying to achieve close to 100% fault coverage. We
have already discussed use of test patterns and IDDQ for
logic circuits, here we discuss approaches that are relevant
for faults which occur in SRAMs.
SRAM fault models not only consider standard stuck-at
faults but also transition (cell fails to make a 01 or 1 0
transition), coupling (transition in bit ‘j’ cause unwanted
change in bit ‘i’) faults. A popular testing methodology
4A.1.8
known as MARCH test involve writing/reading 0 and 1 in
successive memory locations [85]. The test failure data
provided by the MARCH test enables identification of fault
locations in the memory array. The address of faulty bitcells can then be remapped to redundant bit-cells in the
array (column redundancy) to increase memory yield.
Techniques to address soft errors in SRAM memory
relies either on error correction codes (ECC) discussed
above, or by raising the critical charge necessary to flip a
cell by various circuit techniques. A classic example
involves Dual Interlocked Storage Cells where SRAM
connected back to back stores the memory state [97].
Regardless the position of the strike, the bit does not flip
because the other SRAM prevents racing of the signal
necessary to complete the bit-flip. Other rad-hard methods
add resistance to the feedback loop to slow the racing of
signal and thereby reduce probability of bit flip. Such
resistive hardening however may not be scalable in highperformance circuits.
IV.
OUTLOOK/CONCLUSIONS
A. Evolution of CAD Tools
One of the primary challenges in process and reliabilityaware design is the lack of widely available CAD tools and
well-recognized design philosophy that could encapsulate
the information about device properties to circuit designers
for appropriate corrective action.
Recently, there have been changes across the board.
Commercial device simulators like MiniMOS and Medici
now incorporate models for NBTI, HCI, and TDDB
degradation. And there are university-based efforts on
reliability models (e.g. DevRel at www.nanohub.org), which
are also being developed. Circuits design techniques are
being explored by several groups from Purdue (CRISTA),
University of Minnesota, Arizona State, University of
Michigan. And industrial groups like TI and IBM are
publishing their design methodologies at various stages of
design process [86]. Recent simulation software from
Arizona group called ‘NewAge’ integrates self-consistently
activity dependent temperature enhancement, transistor
degradation along with timing analysis to provide a
systematic prediction of the role of parametric degradation
for various integrated circuits [87].
Eventually, we
anticipate that these issues will propagate to the system level
tools.
The multi-scale radiation damage and soft error models
have grown increasingly sophisticated over the years. For
example, simulators like SEMM-2 from IBM contains
particle transport, Nuclear physics, and particle source
generator modules as well as detailed layout information for
front and backend information to create a realistic
simulation environment for prediction of soft errors [41].
These simulators however focuses on charge generation and
critical charge, however, if hot carrier information is
desired, modeling framework that GEANT4 along with fullband Monte Carlo simulation is appropriate and has been
successfully used to study charge ejection from Flash and
ZRAM memories [10].
Several Automatic Test Pattern Generator (ATPG) tools
are available from vendors like Mentor Graphics, Synopsys
and Cadence that can generate test patterns for stuck-at
faults, IDDQ along with at-speed testing for path delay
faults [88][89][90]. Test failure data coupled with test
patterns and gate level design description can be used to
identify different defects causing the test failure and the
defect locations [93]. However, the cost of external testing
using automatic test equipment is increasing as the chips
become more complex thereby increasing the total chip cost.
Research work is in progress to achieve lower test cost
using on-chip testing (Built in Self Test, BIST) instead of
relying on expensive external testing methodologies
practiced today [91][92].
In sum, in this paper we have reviewed four classes of
faults and various fault-tolerant strategies to detect and
manage them. We have explored in some depth the various
approaches proposed to design with these faults both for
logic as well as for memory transistors. The techniques are
innovative and often appear promising as they offer saving
of power/area over more conservative design. However, the
implication for test, burning, and yield are important
questions that needs to be explored. Eventually, we suggest
that the only way for implementing any design philosophy
is through the CAD tools. Progress towards that goal is
already underway.
ACKNOWLEDGEMENT
We wish to acknowledge contributions from members of
our groups, K.-H. Kang, B. Paul, A. E. Islam, D. Varghese
and G. Karakonstantis. This work was supported by
National Science Foundation, Applied Materials, Texas
Instruments, Taiwan Semiconductor Manufacturing
Company.
REFERENCES
[1]
[2]
[3]
[4]
[5]
S. Borker, “Designing Reliable Systems from Unreliable
Components: The Challenge of Transistor Variability and
Degradation”, IEEE Micro, p. 10-16, 2005.
M. A. Alam, S. Mahapatra, “A comprehensive model of PMOS
NBTI degradation,” Microelectronics Reliability, 45 (1) p. 71,
2005.
B. C. Paul et al. EDL-26, 560-562, 2005. Also see, B. Paul, Proc.
of DATE, p. 780-785, 2006.
M.A. Alam, N. Pimparkar, S. Kumar, and for Macroelectronics
Applications”, MRS Bulletin, 31, 466, 2006.
G. Groeseneken, R. Degraeve, B. P. Roussel, “Challenges in
Reliability Assessment of Advanced CMOS Technologies”,
Keynote Presentation at the Int. Conf. on. Physical and Failure
4A.1.9
[6]
[7]
[8]
[9]
Analysis of Integrated Circuits, p. 1-9, 2007.
H. Kufluoglu and M.A. Alam, “A geometrical unification of the
theories of NBTI and HCI time-exponents and its implications for
ultra-scaled planar and surround-gate MOSFETs,” IEEE IEDM
Technical Digest, p. 113 ( 2004).
P. Nair and M. Alam, “Design Considerations of Nanobiosensors”,
IEEE Trans. Elec. Devices, 2007.
R. Baumann, “The impact of technology scaling on soft error rate
performance and limits to the efficacy of error correction,” IEDM
Tech. Digest, pp. 329-332 ( 2002).
P. E. Dodd, M. R. Shaneyfelt, J.R. Schwank, and G. L. Hash,
“Neutron Induced Latchup in SRAM at Ground Level, IRPS
Proceeding, pp. 51-55, (2003).
IRPS11-361
[10] N. Butt, Ph.D. Thesis, Purdue University, 2008.
[11] M. Alam and K. Smith, “A Phenomenological Theory of
Correlated Multiple Soft-Breakdown Evetns in Ultra-thin Gate
Dielectrics,” IEEE IRPS Proceedings, p. 406, (2003).
[12] K. Kang, H. Kufluoglu, K. Roy, and M. A. Alam, “Impact of
Negative Bias Temperature Instability in Nano-Scale SRAM Array:
Modeling and Analysis,” to be published in IEEE Transactions on
CAD, 2007.
[13] J. H. Stathis, “Reliability Limits for the Gate Insulator in CMOS
Technology,” IBM Journal of Research and Development, vol. 46
(2/3), pp. 265-286, (2002).
[14] R. Degraeve, B. Govoreanu, B. Kaczer, J. Van Houdt, G.
Groeseneken, “Measurement and statistical analysis of single
trap current-voltage characteristics in ultrathin SiON”, Proc. of
IRPS, 2005. pp. 360- 365.
[15] https://www.nanohub.org/resources/1647/
[16] https://www.nanohub.org/resources/1214/
[17] http://cobweb.ecn.purdue.edu/~ee650/handouts.htm
[18] A.E. Islam, Ph.D. Thesis, Purdue University, 2010.
[19] R. Fernandez, B. Kaczer, IEDM Tech. Dig. 2006.
[20] H. Kufluoglu et al., Proc. of IRPS, 2007.
[21] D. Allee et al., “Degradation Effects in a a-Si :H Thin Film
Transistors and their Impact on Circuit Perfomance,” Proc. of
IRPS, p. 158, 2008.
[22] E. Takeda, C. Yang, A. Miura-Hamada, “Hot Carrier Effects in
MOS Devices,” Academic Press (1995).
[23] http://www.eecs.berkeley.edu/Pubs/TechRpts/1990/ ERL-90-4.pdf
[24] D. Varghese, Ph.D. Thesis, Purdue University, 2009.
[25] G. Cellere, L. Larcher, A. Paccagnella. A. Visconti, M. Bonanomi ,
“Radiation induced leakage current in floating gate memory cells”,
IEEE
transactions
on
nuclear
science 2005, vol. 52 (1), no6, pp. 2144-2152.
[26] R. W. Keyes, “Physical Limits of Silicon Transistors and Circuits,
Rep. Prog. Phys. 68, 2701-2746, 2005.
[27] A. Bhavnagarwala, X. Tang, J. Meindl, The impact of Intrinsic
Device Fluctuations on CMOS SRAM Cell Reliability“, IEEE J. of
Solid State Circuits, 36 (4), pp. 658-664, 2001.
[28] V. De, X. Tang, and J. Meindl, “Random MOSFET Parameter
Fluctuation Limits to Gigascale Integration (GSI)”, 1996 Sym. Of
VLSI Technology, paper 20.4.
[29] M. Agostinelli, S. Lau, S. Pae, P. Marzolf, H. Muthali, S. Jacobs,
“PMOS NBTI-induced circuit mismatch in advanced
technologies,” IEEE IRPS Proceedings, p. 171 (2004).
[30] M. A. Alam, B. Weir, P. Silverman, J. Bude, A. Ghetti, Y. Ma, M.
M. Brown, D. Hwang, and A. Hamad , “Physics and Prospects of
Sub-2nm Oxides,” in The Physics and Chemistry of SiO2 and the
eSi-SiO2 interface-4 H. Z. Massoud, I. J. R. Baumvol, M. Hirose,
E. H. Poindexter, Editors, PV 2000-2, p. 365, The Electrochemical
Society, Pennington, NJ (2000).
[31] D. G. Flagello, H. V. D. Laan, J. V. Schoot, I Bouchoms, and B.
Geh, “Understanding Systematic and Random CD Variations using
Predictive Modelling Techniques,” SPIE Conference in Optical
Microlithography XII, pp. 162-175, 1999.
[32] M. Ieong, B. Doris, J. Kedzierski, K. Rim, M. Yang, “Silicon
device scaling to the sub-10 nm regime,” Science, vol. 306 p.2057
(2004).
[33] A. Asenov, “Random dopant induced threshold voltage lowering
and fluctuations in sub-0.1 μm MOSFET's: A 3-D “atomistic”
simulation study” IEEE Transactions on Electron Devices
45(12):pp. 2502-2513, 1998.
[34] Simon W-B. Tam, Yojiro Matsueda, Hiroshi Maeda, Mutsumi
Kimura,” Improved Polysilicon TFT Drivers for Light Emitting
Polymer Displays,” Proc Int Disp Workshops, 2000.
[35] R. Gusmeroli, A.S. Spinelli, C. Monzio Compagnoni, D. Ielmini
and A.L. Lacaita, “Edge and percolation effects on VT window in
nanocrystal memories”, Microelectronics Engineering, doi:
10.1016 /j.mee2005.04.066.
[36] N. Pimparkar, J. Guo and M. Alam, “Performance Assessment of
Sub-percolating Network Transistors by an Analytical Model,”
IEEE Trans. of Electron Devices, 54(4), 2007.
IRPS11-362
4A.1.10
[37] J. Choi, A. Bansal, M. Meterelliyoz, J. Murthy, and K. Roy,
“Leakage Power Dependent Temperature Estimation to Predict
Thermal Runaway in FinFET Circuits,” International Conference
on CAD, pp. 2006.
[38] S. Kumar, J. Murthy, and M. Alam, “Self-Consistent ElectroThermal Analysis of Nanotube Network Transistors” Journal of
Applied Physics, 109, 014315, 2011. Also see, S.-C. Lin and K.
Banerjee, “An Electrothermally-aware Full Chip Substrate
Temperature Gradient Evaluation Methodology for Leakage
Dominant Technologies with Implications for Power Estimation
and Hot Spot Management”, Proc. ICCAD, pp. 568-574, 2006.
[39] Robert Bauman, Tutorial Notes, International Reliability Physics
Symposium, 2004.
[40] D.F. Heidel et al. “Single Event Upsets and Multiple-Bit Upsets on
a 45 nm SOI SRAM”, IEEE Trans. Nuclear Science. 56(6), pp.
3499-3504, 2009.
[41] H.H. K. Tang, “SEMM-2: A New Generation of Single-EventEffect Modeling Tool,” IBM Journal of Research and
Development, 52(3), pp. 233-244, 2008.
[42] P. E. Dodd, F. W. Sexton, M. R. Shaneyfelt, J. R. Schwank, and D.
S. Walsh, Sandia National Laboratories “Epi, Thinned and SOI
substrates: Cure-Alls or Just Lesser Evils”, IRPS Proc. 2007.
[43] R. Lucky, in Silicon Dreams: Information, Man, and Machine,
1989.
[44] E. J. McCluskey, Chao-Wen Tseng, “Stuck-fault tests vs. actual
defects”, Proc. ITC, pp.336-342, 2000.
[45] S. Shekhar, M. Erementchouk, M. Leuenberger and S. Khondaker,
“Correlated breakdown of CNT in aligned arrays”,
arxiv.org/ftp/arxiv/papers/1101/1101.4040.pdf, 2011.
[46] J. Tschanz, J. Kao, S. Narendra, R. Nair, D. Antoniadis, A.
Chandrakasan, V. De, “Adaptive Body Bias for Reducing Impacts
of Die-to-Die and Within-Die Parameter Variations on
Microprocessor Frequency and Leakage,” International Solid State
Circuit Conference, pp. 477-479, 2002.
[47] S. Ghosh, S. Bhunia, and K. Roy, “CRISTA: A New Paradigm for
Low- Power,Variation-Tolerant, and Adaptive Circuit Synthesis
Using Critical Path Isolation”, Transactions on CAD, 2007.
[48] D. Ernst, N. Kim, et al., "Razor: A Low-Power Pipeline Based on
Circuit-Level Timing Speculation," Proc. IEEE/ACM Int.
Symposium on Microarchitecture (MICRO), Dec. 2003.
[49] T. Austin, “DIVA: A Reliable Substrate for Deep Submicron
Microarchitecture Design,” ACM/IEEE 32nd Annual Symposium
on Microarchitecture, p. 196 November 1999.
[50] K. Kang, K. Kim, and K. Roy, “Variation Resilient Low-Power
Circuit Design Methodology using On-Chip Phase Locked Loop”,
to be published in ACM/IEEE Design Automation Conference,
2007.
[51] C. Visweswariah, “Death, Taxes and Failing Chips,” ACM/IEEE
Design Automation Conference, pp. 343-347, 2003.
[52] B. Kaczer, IRPS Tutorial of Gate Dielectric Breakdown, 2009.
[53] R. Vattikonda et al. Proc. of DAC, 1047-1052, 2006.
[54] T. Krishnan, V. Reddy, S. Chakravarthi, J. Rodriguez, S. John, S.
Krishnan, “NBTI impact on transistor and circuit: models,
mechanisms and scaling effects”, Proc. of IEDM, 2003.
[55] A. Krishnan et al., NBTI Effects of SRAM Circuits, IEEE IEDM
Digest, 2006.
[56] S. Kumar et al., Proc. of Quality Electronic Design, pp. 210-218,
2006.
[57] K. Kang, H. Kufluoglu, M. A. Alam, and K. Roy, “Efficient
Transistor-Level Sizing Technique under Temporal Performance
Degradation due to NBTI,” IEEE International Conference on
Computer Design, pp. 216-221, 2006.
[58] C. P. Chen, C. C. N. Chu, and D. F. Wong, “Fast and exact
simultaneous gate and wire sizing by Langrangian relaxation,”
IEEE Trans. on Computer-Aided Design of Integrated Circuits and
Systems, vol. 19, pp. 1014-1025, (1999).
[59] S. Kumar, C. Kim and S. Sapatnekar, “NBTI Aware Synthesis of
Digital Circuits”, Design Automation Conference, pp. 370-375, San
Diego, California, USA, June 2007
[60] K. Kang, K. Kim, A. E. Islam, M. A. Alam, and K. Roy,
“Characterization and Estimation of Circuit Reliability
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
Degradation under NBTI using On-line IDDQ Measurement,” to
be published in ACM/IEEE Design Automation Conference, 2007.
K. Kang, M. A. Alam, and K. Roy, “Estimation of NBTI
Degradation using On-Chip IDDQ Measurement”, International
Reliability Physics Symposium, pp. 10-16, 2007.
C. H. Kim, K. Roy, S. Hsu, R. Krishnamurthy, S. Borkhar, "An
On-Die CMOS Leakage Current Sensor For Measuring Process
Variation in Sub-90nm Generations", VLSI Circuits Symposium,
June 2004
T. Kim, R. Persaud, and C.H. Kim, "Silicon Odometer: An OnChip Reliability Monitor for Measuring Frequency Degradation of
Digital Circuits", IEEE Journal of Solid-State Circuits, pp 874-880,
Apr. 2008. A
T. Kim, R. Persaud, and C.H. Kim, "Silicon Odometer: An OnChip Reliability Monitor for Measuring Frequency Degradation of
Digital Circuits", VLSI Circuits Symposium, pp122-123, June
2007
E. Karl, P. Singh, D. Blaauw, D. Sylvester “Compact In-Situ
Sensor for Monitoring Nagative Bias-Temperature-Instability
Effect and Oxide Degradation”, IEEE International Solid State
Circuit Conference, pp. 410-411, California, USA, 2008
M. Agarwal, S. Mitra, B. C. Paul, M. Zhang, “Circuit Failure
Prediction and its Application to Transistor Aging,” VLSI Test
Symposium, 2007.
S. Mitra, Proc. of Int. Rel. Phys. Symposium, 2008.
A.E. Islam and M. A. Alam, “On the Possibility of Degradation
free Field-Effect Transistors”, Appl. Phys. Lett. 92, 173504, 2008.
M. Zhang et. al., “Sequential Element Design With Built-In Soft
Error Resilience,” IEEE Transactions on VLSI Systems, 2006.
A. Goel et. al., “Low-overhead design of soft-error-tolerant scan
flip-flops with enhanced-scan capability,” ASP-DAC, 2006.
R. Koo and S. Toueg, “Checkpointing and rollback-recovery for
distributed systems,” IEEE Trans. Software Engineering 13, 1, 23–
31, 1987.
K. Constantinides et. al., "Software-Based Online Detection of
Hardware Defects: Mechanisms, Architectural Support, and
Evaluation", International Symposium on Micro architecture.
L. Leem et. al., “ERSA: Error Resilient System Architecture for
Probabilistic Applications”, DATE, 2010.
J.M. Soden, C.F. Hawkins, R.K. Gulati, W. Mao, “IDDQ Testing:
A Review,” Journal of Electronic Testing: Theory and Applications,
Vol. 3, No. 4, pp. 291-303, Dec. 1992.
N. Banerjee, C. Augustine, K. Roy, “Fault-Tolerance with Graceful
Degradation in Quality: A Design Methodology and its Application
to Digital Signal Processing Systems”, DFT, 2008.
Jie Han , Jianbo Gao , Yan Qi , Pieter Jonker , Jose A. B. Fortes,
Toward Hardware-Redundant, Fault-Tolerant Logic for
Nanoelectronics, IEEE Design & Test, v.22 n.4, p.328-339, July
2005
M. Z. Hasan and S. G. Ziavras, “Runtime Partial Reconfiguration
4A.1.11
[78]
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
for Embedded Vector Processors”, International Conference on
Information Technology: New Generations”, Las Vegas, Nevada,
April 204, 2007.
J. R. Heath, P. J. Kuekes, G. Snider, and R. S. Williams, “A DefectTolerant
Computer
Architecture:
Opportunities
for
Nanotechnology”, Science, 280, pp. 1716-1721, 1998.
S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Modeling of
Failure Probability and Statistical Design of SRAM Array for Yield
Enhancement in Nanoscaled CMOS”, IEEE TCAD, 2004.
S. Mukhopadhyay, A. Raychowdhury, H. Mahmoodi, K. Roy,
“Statistical Design and Optimization of SRAM cell for yield
improvement,” International Conference on Computer Aided
Design, pp. 10-13, 2004.
S. Mukhopadhyay, K. Kang, H. Mahmoodi, K. Roy, “Reliable and
self-repairing SRAM in nano-scale technologies using leakage and
delay monitoring,” International Test Conference, 2005.
, K. Roy, J. Kulkarni, M.-E. Hwang, “Process-Tolerant Ultralow
Voltage Digital Subthreshold Design”
IEEE Topical Meeting on Silicon Monolithic Integrated Circuits in
RF Systems, 2008. SiRF, pp. 42-45, 2008.
R. Rodriguez, J. Stathis, B. Liner, S. Kowalczyk, C. Chung, R.
Joshi, G. Northrop, K. Bernstein, A. Bhavnagarwala, and S.
Lombardo, “The Impact of Gate Oxide Breakdown on SRAM
Stability,” IEEE Electron. Dev. Lett. 23, 559 (2002).
Heidergott, NSREC Short Course, 1999.
A.J. van de Goor, “Using March Tests to Test SRAMs,” IEEE
Design & Test of Computers, Vol. 10, No. 1, Mar. 1993, pp. 8-14.
[Goda05] A. Goda, ISQED, 2005.
M. DeBole, Int. J. Parallel Prog. 2009
http://www.mentor.com/products/silicon-yield
http://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pa
ges/TetraMAXATPG.aspx
http://www.cadence.com/products/ld/true_time_test
V. D Agrawal, “A tutorial on built-in self-test. I. Principles”,
Design and Test of Computers, 1993.
V. D Agrawal, “A tutorial on built-in self-test. II. Applications”,
Design and Test of Computers, 1993.
W. T Cheng et. al., “Compactor independent direct diagnosis”,
Proc. Asian Test Symp, 2004.
B. Vaidyanathan, B.; A.S. Oates, X. Yuan Intrinsic NBTIvariability aware statistical pipeline performance assessment and
tuning, ICCAD, pp. 164-171, 2009.
[Rajsuman95] R. Rajusman, Iddq Testing for CMOS VLSI, Artech
House, 1995.
[Agarwal05] A. Agarwal, B. C. Paul, H. Mahmoodi, A. Datta, K.
Roy, “A process-tolerant cache architecture for improved yield in
nanoscale technologies,” IEEE Transactions on VLSI, vol. 13, no.
1, pp. 27-38, 2005.
Calin, et al., IEEE Trans. Nucl. Sci. 43, p. 2874, 1996.
IRPS11-363
Download