Reliability- and Process-Variation Aware Design of Integrated Circuits – A Broader Perspective Muhammad A. Alam, Kaushik Roy, and Charles Augustine1 1 Department of ECE, Purdue University, West Lafayette, IN 47907, USA phone: 765-494-5988, fax: 765-494-6441, e-mail: alam@purdue.edu Abstract— A broad review the literature for Reliability- and Process-variation aware VLSI design shows a re-emergence of the topic as a core area of active research. Design of reliable circuits with unreliable components has been a challenge since the early days of electro-mechanical switches and have been addressed by elegant coding and redundancy techniques. And radiation hard design principles have been used extensively for systems affected by soft transient errors. Additional modern reliability concerns associated with parametric degradation of NBTI and soft-broken gate dielectrics and proliferation of memory and thin-film technologies add new dimension to reliability-aware design. Taken together, these device, circuit, architectural, and software based fault-tolerant approaches have enabled continued scaling of integrated circuits and is likely to be a part of any reliability qualification protocol for future technology generations. Keywords-positive reliability physics, circuit design, variability, modeling, lifetime projection I. INTRODUCTION Most introductory courses on VLSI design presumes interchangeability and uniformity of components whose properties remain invariant with time and posit that the fundamental challenge of IC design is the trade-off among area, performance, testability, and power dissipation of a circuit. The key feature of such deterministic optimization is that it involves analysis of a static and uniform network of transistor and interconnect nodes in response to a set of time-dependent test-patterns. Since in practice no two transistors are quite alike even within a single die due to process-variability [1], nor do they retain the same characteristics over time as defects accumulate within the transistor in response to nodal activity and input patterns [2][3][4], the idealized design principles are hardly defensible in practice. Traditional VLSI design bypasses the analysis and optimization of such nonuniform, dynamic network by approximating the problem into optimization of uniform static network with certain guard band. This allows the optimized static circuit to continue functioning even if the devices have randomly distributed parameters (with some margin), or become faster or slower over time. Such worst-case guard-band limited VLSI design is widely used in microelectronic integrated circuits in modern CPU, as well as in macroelectronic technologies such as Liquid Crystal Displays (LCD). Despite its routine use, the 978-1-4244-9111-7/11/$26.00 ©2011 IEEE methodology is inefficient and involves considerable penalty in the area-performance-power-reliability budget. After all, the process-induced variation [5], time-dependent degradation, radiation-induced soft and hard errors make it difficult to design integrated circuits within a conservative guard-bands without unacceptable compromise of power/performance targets of the circuit design. Given the constraints, the following question frames the discussion of this paper: “How does one design and optimize for power/performance/area metrics various VLSI and macroelectronic circuits with components that are occasionally faulty, but more generally have statistically distributed parameters that evolve in time either abruptly or gradually over a period of time?” There is no global optimization principle that addresses all problems of course, but the design and device communities have collaborated over the years to find a series of device/circuit/architecture/software solutions to reliability concerns in specific context. We will summarize some of the key development from the perspective of a reliability engineer. Although the process and reliability considerations across the electronics industry is multi-faceted and is a vast topic, in general, they can be classified in four groups (see Fig. 1): The first group involves transistor-to-transistor, dieto-die, wafer-to-wafer process induced variation that introduces parametric variation among otherwise two nominally identical and functional transistors. The second group involves irreversible parametric degradation arising 4A.1.1 Figure 1: ICs might experience different types of faults during the fabrication and/or operation of the system. And these faults must be managed at the hardware or software levels. In this article, we classify the faults in four categories depending on how they affect system performance and the classes of solution strategies used to address them. IRPS11-353 from loss of passivated surfaces (e.g., brokeen Si-H bonds for NBTI/HCI damages in microelectronic andd macroelectronic applications [6], loss of Si-H bonds and increase in dark current in amorphous-Si based solar cells [138], etc.), salt penetration through oxides in biosensors [77], etc. The third group of reliability issues involving transsient errors (e.g., soft error due to radiation [8][9], transiennt charge loss in Flash and ZRAM memories [10], etc.). Fiinally, the fourth group involves permanent damage arising ffrom pre-existing or new generated of bulk defects (e.g., brooken Si-O bonds) in SiO2 that leads to gate dielectric breaakdown in logic transistors [11][13], anomalous charge loss in Flash transistors [14], loss of resistance ratio iin MRAM cells, radiation induced permanent damage in SR RAM cells [25], open interconnects due to imperfect processing or electromigration, etc. In each section of the following disccussion, we will discuss definition, detection, and solution strategies associated with these four class of defectts. We conclude with a brief discussion regarding the statuss, challenge, and opportunities of CAD tools in implementiing process- and reliability-aware design methodologies aand provide an outlook for the emerging research direcctions for VLSI design. II. LITY PHYSICAL ORIGIN OF VARIABIL In this section, we consider the generaal features of the four defect classes associated with processs- and reliabilityaware design. A. Origin/Measure of ‘Time-Zero’ Variatioon Modern VLSI designers are intimately aaware of processrelated parametric fluctuation arising froom (i) randomdopant fluctuation [26][27][28][29], (ii) flucctuation of oxide thickness [30], (iii) statistically distributed channel lengths due to line-edge roughness, etc. These ranndomness reflects local, submicron-scale processing historyy of individual transistors and translates to threshold vooltage variation, fluctuation in gate leakage, and variabbility of series resistance. Regardless of the type of transistor (e.g., FINFET, ultra-thin body SOI, double-gatee transistors, etc. [32]), the continued technology scaling oof guarantees the susceptibility of these transistors to the processing conditions. Other types of electronic componentts are similarly affected. The fluctuation in the number off nanocrystals in NC-Flash memories translates to distributtion of threshold voltage and retain time, the number of graiins within a TFT channel dictates its ON current [34], and aas Nair et al. [7] Figure 2 Type 1 faults are associated with disstribution of initial parameters due to process variation (blue). When cooupled with Type 2 time-dependent parametric degradation (red), thee total guard-band necessary to design an IC can become unacceptably llarge. IRPS11-354 s fluctuation of has shown that dopant-induced statistical drain-current is so significant that ab bsolute biosensing based on Nanowire biosensors is actually impossible. In short, all areas of modern electronics (ee.g., logic/memory in microelectronics, macroelectronics, and bioelectronics) are n parameters. influenced by randomness of design To design an IC with transistors with w random parameters, the randomness is first captured in the lowest level of abstraction (e.g. process level) and its effects on quantities like threshold voltage, and leakage current can either be g percolation theory and determined numerically or by using stochastic geometry. These effectiv ve electrical parameters are then propagated to higher leveels of abstractions (e.g. circuits/systems level) using nu umerical or analytical approaches [35][36]. Although wiidely used, the MonteCarlo based numerical determin nation of VT has the disadvantage that limited sample-sizze makes the calculation of tails of the distribution difficult an nd probably inaccurate. Finally, even for nominally identtical transistors, the local operating environment (nodal activiity, passivation, etc.) or R-C drop due to statistically distributed length of mogeneity in opearating interconnects results in local inhom temperature and ultimately to diffeerence in , , etc. Fortunately, given the nodal activity a and operating conditions, this aspect of the probllem may be predictably defined by self-consistent solution n of the electro-thermal design problem [37][38]. hysical Models of B. ‘Time-Dependent’ Variation: Ph Reliability The second set of variability concerns involve permanent degradation of individual transistor t as a function of characteristics that evolve with time transistor activity [2][3][11][12][53]][56]. Unlike ‘time-zero’ variation discussed above, this variaation would arise for two identically processed transistors with the same initial n parametric values are characteristics and that the shift in generally permanent and cannot be restored to their pristine value by turning the IC off, see Fig. 2. Several degradation mechanism,, such as, negative bias temperature instability (NBTI), po ositive bias temperature instability (PBTI), hot carrier deegradation (HCI), (soft) dielectric breakdown (S-TDDB), have been extensively characterized by many groups over the years. Moreover, the implications of these degradationss on circuit and system performance have also been explorred extensively. The key difference between models for deegradation for reliability aware design vs. the standard modeel for qualification is the emphasis on time ( ), frequency y ( ), duty-cycle ( ) dependencies of degradation in add dition to traditional focus on voltage acceleration ( ), tempeerature acceleration ( ) and statistical distribution ( ) of failure time. This e the fact that it generalization of reliability models emphasize is too conservative to qualify a technology t for a given operating voltage and temperature and dictate that all IC design based on the technology resiide within the boundary. Instead, the quantification of time-dependence of ng voltage, temperature, degradation allows trade-off amon and time based on local nodal activity and specific usage condition of the IC. 4A.1.2 Consider the case of NBTI degradation associated with PMOS transistors biased in inversion. The degradation is presumed to involve dissociation of Si-H bonds at the interface followed by either repassivation of the / bonds by newly created free hydrogen or diffusion away dimerization [2]. (An from the interface following introductory lecture, a set of lecture notes, and numerical simulation tool (DevRel) to calculate degradation characteristics are available at www.nanohub.org [15][16][17]. Based on decades of research by many groups across the industry (see Fig. 3), the degradation of threshold voltage can be succinctly described by [18] ∗ ΔVmin for SRAM ) ( ) where is the integrated stress time seen by a transistor, is the ~1/6 is the time exponent, is the surface field, number of Si-H bonds at the interface, is the temperature, is the diffusion is polarization constant, and , , and are activation energies for coefficient, diffusion, forward-dissociation, and reverse annealing of Si-H bonds. The net degradation is frequency independent [19] and equals the degradation at 50% duty cycle [20]. Remarkably, PBTI degradation associated with NMOS transistors biased in inversion is also defined by similar formula, except physical meaning of the individual terms are different and the magnitude of degradation is somewhat reduced. NBTI is also a key reliability concern for solar cells and TFTs based on a-Si or poly-crystalline Si and are described by analogous formula [21]. There are corresponding formula for hot carrier degradation based on bond-dissociation model [22][23][24]. The soft-breakdown characteristics after dielectric breakdown can be characterized by [11][52] = 3.0 2.5 ( , ) is jump in gate leakage for each breakdown event, where is the voltage and temperature dependent acceleration factor, is the Weibull slope. Similar formula exists for backend degradation arising from electro-migration and stress migration, etc. Finally, while radiation damage was primarily associated with transient fault (soft error) or gate punch-through (hard error), close analysis of radiation detector show that steady generation of permanent traps and type inversion of donor levels lead to time-dependent increase excess leakage and depletion width. Such parametric degradation has been a serious concern for viability of radiation detectors. Once these degradation characteristics are characterized, its effect on inverting logic [3], SRAM cells [56][12], pipeline microprocessor [94], etc. can be readily analyzed and the susceptibility of failure due to a degradation mode or a combination thereof can be predicted. Based on these predictive characterizations, we will explore various solution methodologies later in the paper. Tech A, V1 Tech A, V2 Tech B, V1 Freescale, IRPS'07 [19] Stress Time > 1000 Hr degradation =[ ] ( 1 + (1 − )/2 1/6 ~t 2.0 IMEC ST TUV Infenion Ours 1.5 1.0 0.5 Normalized to 50% AC 0.0 0 20 40 60 80 100 0 101 103 105 107 109 duty cycle (%) frequency (Hz) Figure 3. Time evolution of degradation at very long stress time from various published results showing power law dependence with universally observed exponent of n=1/6AC NBTI degradation normalized to 50% AC versus pulse duty cycle and frequency from various published results. R-D solutions (red lines) are shown [2]. Taken from S. Mahapatra et al., Proc. IRPS, 2011. C. Physical Origin of Transient Faults and Soft Errors Many researchers attribute the mysterious charge-loss electrometers in 1800s as the first reported occurrence of radiation damage in electronic components and the US Navy also reported radiation related failures of electronic components during tests for atomic bomb during WWII. Radiation-induced transient faults gained increasing prominence in 1960s and 1970s as transistors were miniaturized and transient effects associated with radiation strike and latch-up [9] caused serious reliability concerns. Transient soft errors are run-time computational error that does not lead to permanent degradation of parameters, nor is it related to process variation, but the random nature of the transient upset makes its detection and correction a challenging problem. For example, tests of SGI Altix 4700 computer (32 blades, 35 GB SRAM) shows approximately 4-5 SRAM upsets every week – errors that must be corrected for the proper operation of the machine. Sources of radiation include alpha particles and high energy neutrons in solar winds and cosmic particles as well as low energy neutrons from packaging materials [39], making design of satellite and space-crafts particularly challenging. Although the single events effets (SEE) is generally related transient errors related to single bit upset (SEU) or multi-bit upset (MBU), permanent faults related to Single event gate rupture or burnout (SEGR/SEB) are also possible and are discussed in the next section [39][40]. Modeling/characterization of soft errors involves three steps [41] (i) understanding the radiation environment of the electronics, which may involve interaction of the atmosphere with radiation fluxes, physics of nuclear fission, etc., (ii) 4A.1.3 IRPS11-355 given the relative fluxes of particles, probability that they will interact with a given active volume of a transistors, and finally, (iii) calculation of critical charge necessary to upset a particular type of device. These effects are modeled at various levels of sophistication depending on the degree of precision necessary for a particular application and compared extensively with measurements. For example, simulation show and experiments confirm that the SER rate seems to increase exponentially with reduction in supply voltage – increasing by a factor of 10-50 as supply voltage is scaled from 5V to 1V. In the worst case scenario, a strike can change the state of multiple nodes (MBU), making it difficult to recover the lost data. In memories, techniques are present to detect and correct one error in every column/row, but in the case of multiple errors, present day techniques are not sufficient to tackle such errors and will result in complete system failure [8]. Thus, it is important to explore techniques that can detect (and if possible correct) the errors, without allowing the errors to propagate. In general, transition to SOI technologies reduces the amount of charge that is generated within bulk of the body. Since this increase the LET threshold for upset, the SER probability is reduced by a factor of 2-3 [42]. However, trapping and other hysteresis effects could be a concern for SOI transistors. Finally, memories like Flash and ZRAM are also susceptible to soft error [10], since the energetic particle may eject the stored charge and change the memory state permanently. We will briefly summarize the well known detection and solution strategies later in the paper. D. Origin of Permanent Faults Permanent faults has been a reliability concerns from the early days of electronic industry based on electromechanical switches and punch-cards based on optical readers [43]. The unreliability of mechanical switches in 1940 and 1950 were so severe that many great scientists like von Neuman, Shanon, and Hamming spent considerable time working on fundamental principles of designing reliable systems with unreliable components. The techniques include judicious use of combination of techniques involving redundancy, stability analysis, and coding. To this day, when computer scientists discuss reliability of components, Open defect Short defect Resistive Bridges 0.7 0.6 PMOS OUT INP NMOS Drain Source Short VOUT (V) 0.5 0.4 Original VTC VTC with faulty NMOS 0.3 0.2 0.1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 VIN (V) Figure 4. (Top) Type 4 faults present in scaled devices. (bottom, left) Defective NMOS in INV and corresponding (less than perfect) voltage transfer characteristics (VTC) (bottom, right). IRPS11-356 the language and terminology (‘stuck-at-zero’ fault meaning a relay which cannot be closed or ‘stuck-at-one’ fault meaning a relay permanently connected to the output) can be dated back to early days of computer engineering. Modern semiconductor processing has vastly improved process-reliability. While it is still possible to have physical defects such as shorts, opens and resistive bridges (see Fig. 4) that can be modeled as “stuck-at-zero’ and ‘stuck-at-one’ or stuck-open, [44]; fortunately, these challenges are likely to be resolved with process debugging or detected at test time using standard test generation techniques. However, once the circuit has been used for a prolonged period of time, ‘stuck-at-zero’ type faults may occur with a harddielectric breakdown at the gate for NMOS (Fig. 4). Moreover, given the reduced oxide thickness, there is increasing likelihood of punch-through of the dielectric film (SEGR/SEB) by energetic particles leading to permanently damaged transistors. These types of hard errors also occur in transistors based on CNT, poly-silicon NW, etc. since excessive heating may lead to burning of the channel region [45]. The classical theories of fault-tolerance therefore can be useful in these contexts, with the implicit assumption that a certain percentage of devices may fail randomly during IC operation. However, since all such redundancy techniques incur extra area/power penalty, ensuring high performance with high reliability requires intelligent trade-off and thoughtful design at architecture, circuit, and transistor levels. We will discuss the detection and solution strategies of such failures later in the next section. III. CIRCUIT SOLUTIONS STRATEGIES FOR LOGIC CIRCUITS A. Process Variation and Solution Strategies. Process variation and reliability consideration pose special design challenges for logic circuits. Historically, the circuit designers first became aware of the challenges of design under variability as they had to reckon with the widening distribution of leakage due to random dopant fluctuation. In response, device physicists proposed channel dopant-free transistor designs [26]. At the circuit design level deterministic ‘global’ techniques like back-gate bias techniques, and adaptive VDD have been proposed to tighten VT and leakage current distributions [46]. These approaches are particularly effective for systematic process fluctuation arising from line-edge roughness or oxide thickness, etc. On the other hand, CRISTA [47] is a circuit/architecture co-design technique for variationtolerant digital system design, which allows for aggressive voltage over- scaling. The CRISTA design principle (a) isolates and predicts the paths that may become critical under process variations, (b) ensures that they are activated rarely, and (c) avoids possible delay failures in the critical paths by adaptively stretching the clock period to two cycles. This allows the circuit to operate at reduced supply voltage while achieving the required yield with small throughput penalty (due to rare two-cycle operations). A schematic of the approach is shown in Fig. 5. A recently introduced approach called ‘RAZOR’ [48] uses a shadow latch to detect the transient failures in pipelined logic, which causes an output to be incorrectly 4A.1.4 cause for concern. The techniques proposed to address this degradation are discussed next. a17 b17 a13 b13 a31 b31 FA31 FA17 S31 S17 C13 FA13 C12 S13 prediction a0 b0 FA0 Cin S0 2:1 MUX select signal D CLK D comparator CLK_D Shadow latch Figure 5. (Top) Fig. 7 32bit Full Adder designed using CRISTA design methodology. (bottom) RAZOR flip flop (CLK_D is the delayed CLK). latched. Upon the detection of error, the instruction is reexecuted by flushing the pipeline and using the correct data from the shadow latch (Fig. 5). Razor is an adaptive tuning technique for correct operation under parametric process variation. At system level, design techniques like Robust core checker [49], leakage based PVT sensor [Kim04], or onchip PLL-based sensors [50] have also been explored to generate self-diagnostic signals to dynamically control adaptive body-bias for inter-die variation. An excellent review of the process tolerant design techniques is summarized in [1]. B. Variation Due to Parametric Degradation. Parametric time-dependent degradation associated with NBTI, HCI, TDDB, and EM degradation are challenging problems – made even more complicated by increasing additional contributions from ‘time-zero’ process variation. Given the voltage, temperature, and time dependent analytical models, however, statistical design methodologies [51] offers promising solutions in such a scenario. We offer a brief review of the approaches related to NBTI to illustrate the principles of statistical VLSI design in the presence of variability; previously there have been extensive work regarding gate-dielectric breakdown [52] and there are suggestions by many that PBTI and HCI are likely to be analyzed by similar approaches. To appreciate the role of such parametric degradation on logic circuits, we used 70 nm Predictive Technology Model, (PTM) an NBTI-aware static timing analysis (STA) tool [3] to estimate temporal degradation in circuit delay under NBTI. Both simple logic gates (e.g. NAND, NOR gates), as well as various ISCAS benchmark circuits involving a few hundred to a few thousand transistors were simulated to estimate the degradation. In general, the results indicate that the circuit delays can degrade by more than 10% within 10 years operation time [3]. Such large change in parameters from just one of the degradation mechanisms is a genuine Pre-Silicon Solution Strategies: The solution strategies discussed in the literature roughly fall into two categories: deterministic and statistical. Various global deterministic techniques like dynamic VDD scaling, duty-cycle scaling, etc. have been discussed in the literature [53]. The ‘VDD scaling’ may not always be optimal because in addition to increasing reliability, it also increases delay [54][55]. The proposal of ‘reduced duty-cycle’ [56] is misleading, because in an inverting logic, reduced duty-cycle at one stage translates to enhanced duty-cycle for the following stage [57]. The statistical technique involves Lagrange-Multiplier (LM) based sizing given the constraints of area, gate delay, and product lifetime. Both cell based sizing [3] and transistor based sizing [57] can be effective. Since the threshold voltage of a gate depends on switching activity, which in turn depends on the sizes of the other transistors, the sizing algorithm therefore is by definition self-consistent [58]. The results show that cell-sizing (where NMOS and PMOS are scaled by the same factor) [3] of several ISCAS benchmark circuits can be made NBTI-tolerant only with modest area overhead (~9%). Even better optimization is desirable and possible if PMOS is scaled, (but NMOS is not), thereby providing – in the best case and as shown in Fig. 6 – a factor of two improvements in area penalty [57] so that a wide variety of ISCAS benchmark circuits at 70nm node can be designed with only 5% area-penalty for NBTI tolerant design (with no delay-penalty). One limitation of the specific work discussed above is that the model does not include the effect of NBTI relaxation – a key feature of NBTI characteristics. More recent work [59] addresses the relaxation issue specifically and arrives at improved (but comparable) estimates of various trade-off. 800 c1908 10ps 750 Area (um) Pre-decoder for critical path prediction One-cycle/Two cycle Area Saving From TR-based 700 Cell-based 650 600 550 580 INIT TR-Based Cell-Based 600 620 640 660 680 700 720 Delay (ps) Figure 6. PMOS-only LR optimization of ISCAS benchmark circuit C1908 based on 70 nm PTM offers approximately 45% saving in area overhead (11.7% for cell-based optimization, 6.13% for PMOS-only optimization). Post-Silicon Silicon’ Odometers: Since NBTI degradation is actually specific to local operating environment, all ICs do not degrade equally. An operating condition specific indicator of NBTI degradation would be of great value. A new class of techniques, broadly known as “Silicon 4A.1.5 IRPS11-357 channels have been strained by Si/Ge Source and drain and high-k/metal gate have been introduced in the gate stack. The material changes have led to an intriguing suggestion of the possibility of NBTI degradation-less transistor operation for high performance logic circuit [68]. Briefly, the ON( − ) (μ is current of a transistor is given by ~ the threshold voltage), so that the mobility, Δ / Figure 7: The change in Iddq (red triangle) is measured by temporarily flipping the states of the transistors (Vin=0) so that leakage from degraded PMOS is reflected in Iddq. Note that the power-law exponents of individual devices and that of an IC (here) is identical. Odometers”, have recently been proposed to allow simple and direct estimate of actual usage of an IC. The first approach suggests the use of quiescent leakage current (IDDQ) to establish on-the-fly time-dependent degradation due to NBTI, see Fig. 7 [60][61]. Over the years, Iddq has been used a process-debugging tool for time-zero short- or open-circuit issues [95]. Since Iddq is also affected by VT-shift, it can also be an effective and efficient monitor of application- and usage-specific IC degradation as a function of time. The second approach is based on differential signals from build-in modules or newly designed structures. For example, Ref. [63][64] compare the phase difference between two ring oscillators (DLL) - one is stressed in actual operation and the other is not – to estimate the actual usage history of the IC. Other groups [50] have used variation in PLL signal to arrive at the same signatures. Finally, Ref. [65] uses a PMOS transistor biased at subthreshold region as a NBTI sensor for actual usage of the IC and this degradation is then mapped back to frequency through 15-stage NAND-gate ring oscillator. In sum, a wide variety of techniques are now being used to measure – in situ – the actual usage of the IC. Other types of locally embedded sensors have also been used to dynamically monitor the degradation [66][67]. This enables the circuit to be run at its maximum possible frequency, while preventing any failure to occur due to degradation. These indicators – be it IDDQ or DLL phase shifts – would then allow one to use adaptive design to extend lifetime of the ICs or to indicate incipient failure and preventive maintenance, allowing a powerful post-Si design approach for future ICs. The effectiveness of the Iddq methodology has been demonstrated based on ISCAS benchmark-circuits at 70 nm PTM node – establishing a perfect correlation between NBTI degradation and decrease in Iddq. The technique has also been verified experimentally both for NBTI and TDDB to show ease of implementation of the methodology. Although recent results show encouraging trends; however, the area- and power- overhead of the sensors as well as algorithmic complexity associated with various corrective techniques are limitations of this technique that needs careful consideration. Solution Based on Degradation-Free Transistors: Classical MOSFETs are undergoing significant changes: the IRPS11-358 , = Δ / −∆ /( − , ), where subscript ‘0’ indicates pre-stress parameters t. When degrades (increases) with generation of bulk and at the Si/SiO2 interface is reduced. interface traps, the Since − characteristics is negative, this translates to positive . Take together, the self-compensating effects of reduce drain-current degradation, and in some and strain-values can make the transistor degradation-free. A combination of circuit techniques as well device engineering may keep NBTI degradation within acceptable limits. There may be similar opportunities for PBTI and HCI hardened design principles. C. Detection and Solution Techniques for Transient Failures Similar to degradation-free transistors for NBTI, we have already discussed the intrinsic improvement in soft errors robustness for SOI transistors that has smaller collection volume and therefore higher LET for upset. In addition, there are many Circuit/Architectural/Software detection/correction algorithms for soft errors in logic circuits, as discussed below. At circuit level, flip-flops have been proposed with builtin soft error tolerance [69]. The flip-flop implementation consists of two parallel flips-flops with four latch elements and a C-element for comparing the output of two flip-flops. Out of the four latch elements, one can be subjected to single event upset (SEU). In that case, output of the two flip-flops differs and error does not propagate to the output through the C-element. In the event of more than one error in four latch elements, the output can go wrong, however, the probability of such an event is very low. Another topology for soft-error tolerant flip-flop is proposed in [70]. Along with immunity towards soft-error, the proposed flipflop offers enhanced scan capability for delay fault testing. Finally, in the ‘Check point and roll back’ [71] technique instructions stop executing once a fault has been detected and the state of the system is restored from the point of error introduction (i.e. rolled back). This technique is associated with micro-architecture Access-Control-Extension (ACE) instruction set [72] for detecting the faults. Although the approach can cause stalling of the processor for few cycles, but it results in correct execution of instructions without the need for complete flushing of the pipeline. The above-mentioned approaches for soft error tolerance are applicable in general purpose computing systems. On the other hand, for probabilistic applications such as Recognition and Mining and Synthesis (RMS), Error Resilient System Architecture (ERSA) is proposed in [73]. ERSA combines error resilient algorithms and software optimization with asymmetric reliability cores (some cores 4A.1.6 are more reliable than others due to parameter variations, say). Results show that the architecture is resilient to high error rates on the order of 20,000 errors/cycle per core. P1 INP NMOS P3 PMOS OUT INP Source Drain Open Fixed Po : Probabitliy of Open Fault Ps : Probability of Short Fault Po _ eq = 2 × Po − Po 2 , Ps _ eq = Ps 2 , Pwork _ eq = (1 − Po _ eq )(1 − Ps _ eq ) = (1 − Ps 2 )(1 − (2 × Po − Po 2 )) P2 P1 P1 P2 P2 P2 P3 P3 P3 P3 P3 1 0.9999 0.9998 0.9997 0.9996 0.9995 TIR 0.9994 TMR 0.9993 0.9992 0.9991 1e-3 2e-3 3e-3 4e-3 5e-3 6e-3 7e-3 8e-3 9e-3 10e-3 Probability of failure of each component Po _ eq = Po2 , Ps _ eq = 2 × Ps − Ps2 , Pwork _ eq = (1− Po _ eq )(1− Ps _ eq ) 2 P2 P1 Majority Voter NMOS NMOS NMOS Source Drain Short Fixed P2 Success Probability of the system OUT P1 Majority Voter D. Detection and Solution Techniques for Hard Faults. In the gate level ‘stuck-at’ fault model, each circuit node can be at stuck-at zero (1/0: good value is ONE and faulty value is ZERO), stuck-at one (0/1), 0/0, 1/1 or x/x value. Not that this model is used in reference to the manifestation of defects at the electrical level and is not a physical defect model. For detecting stuck-at one fault (say), the node is activated to have a logic value of 0 using an input vector such that the effect of the fault is propagated to a primary output. PMOS P1 Figure 9. (Top, left) Cascaded TMR and probability for correct system operation . (Top, right) Triplicated Interwoven Redundancy (TIR) and probability for correct system operation. (Bottom) Comparison of TMR and TIR. 2 = (1 − Po )(1− (2 × Ps − Ps )) Figure 8. Transistor level redundancy techniques for Type 4 short and open faults. Although we have already discussed IDDQ (Quiescent current) tests to determine NBTI and HCI degradation, the IDDQ test was originally developed to detect hard faults (such as short between two nodes, pseudo-stuck at faults, stuck-on faults, etc. [74]). The assumption that in CMOS circuits, if a fault, such as a bridging defect is activated, will lead to large increase in quiescent current, worked well in early technology generations to detect manufacturing defects. However, with increased leakage and leakage spread (due to parameter variations) in scaled technologies, it is becoming increasingly difficult to use IDDQ to test manufacturing defects. The sensitivity of IDDQ tests to detect manufacturing under increased leakage and parameter variations has decreased considerably. Pre-Silicon Solution Strategies: Once the faults have been identified, its effects can be mitigated by various types of redundancy. At the transistor level, one can deploy various types of series and parallel transistor redundancies depending on the prevailing types of defects (as shown in Fig. 8). Redundancies can be distributed at random throughout the circuit. The introduction of transistor level redundancy may decrease the finite impulse response (FIR) filter failure probability by almost 20% [75]. At the system level, triple modular redundancy (TMR) and Triplicated Interwoven Redundancy (TIR) are popular options to address both transient and permanent faults [76]. The final result of the computation in TIR and TMR are decided by majority voting of three identical sets of hardware (see Fig. 9). For a given area overhead, however, TIR seems to provide somewhat improved robustness compared to TMR. Note that, in this analysis the voting circuit is assumed to have zero failures. The main challenge of dealing with hard failure is that many of the classical techniques proposed above to address soft degradation are not suitable to handle such catastrophic failure: global solutions like adaptive body-bias, etc. are not helpful, redundancy could help, but since the BIST may have to monitor the cell failure at exceptionally high rate, the approach may not be practical. Various other versions of reconfigurable logic based on FPGA are being investigated and offer some promise [77][78], but more research is needed for practical, cost-effective solutions. Therefore, the challenges of circuit design under random permanent faults offer new avenues for research in this field. III. DETECTION AND SOLUTION TECHNIQUES FOR SRAM The relative footprint of memory in microprocessors is expected to reach 50% by 2008 and 90% by 2011. As such robust SRAM design under process-variation and parameter degradation has been a persistent challenge for many years [27][79]. A. Process Variation and Solution Strategies Four functional issues define SRAM yield: ACCESS failure (AF) defines the unacceptable increase in cell access time, READ failure (RF) defines the flipping of the cell while reading, WRITE failure (WF) defines the inability to write to a cell, and HOLD failure (HF) defines the flipping of the cell state in the standby, especially in low VDD mode [80]. Only those memory blocks that have none of these failures in any of their cells are expected to remain functional. Systematic inter-die variation and random interdie variation limit the yield of SRAM cells – as expected. 4A.1.7 IRPS11-359 Of the many proposals discussed in the literature to address this issue, we will discuss two representative cases. Similar to logic transistors, the first approach involves using adaptive body bias to tighten distribution of VT and leakage current. The signal may be self-generated by monitoring the on-chip leakage. It has been shown both theoretically and experimentally that such adaptive body bias can increase SRAM yield significantly [81]. Once again, the methods are effective for inter-die variation, but unsuitable for finegrained intra-die variations. The second approach consists of two circuit level approaches: one involving more complex SRAM cell design, especially for low-voltage application [82] and the other involving redundant columns to replace faulty cells. In the ‘redundant column’ approach, the Built-in Self Test (BIST) circuit periodically (e.g. every time operating condition is changed) checks for the faulty bits and subsequently resizes the memory to prevent the access to the faulty bits [96]. It has been shown that such design can be transparent to the processor with negligible effect on access time or area/energy overhead and can still improve yield by as much as 50%. B. Reliability Aware SRAM Design Similar to design of logic transistors, SRAM design with the simultaneous presence of process variation and parameter degradation is significantly more challenging problem. For example, regardless of whether an SRAM cell stores a ‘0’ or ‘1’, if the cell is not accessed frequently, one of the PMOS pull-up transistors could be in NBTI stress for a prolonged period of time. As such the divergence of parameters of the pair of PMOS transistors within a SRAM cell can erode the read and write stability of the cell [12]. Likewise, since gate dielectric breakdown (TDDB) is a statistical process, breakdown in one of the transistors would induce excess gate leakage in one transistor without corresponding effect on the other five transistors [83]. The asymmetric degradation is a key reliability concern for SRAM design and needs to be reflected in IC design process. An analysis of typical 6T SRAM array shows that READ failure probability increases by a factor of 10-50 due to NBTI, if the initial ΔVT due to process variation ranges between 30-40 mV (70 nm PTM technology). On the other hand, the “WRITE stability’ improves somewhat with NBTI as the cell can be written at smaller voltage. Overall, however, the metrics like static noise margin (SNM) reduces so quickly with NBTI that it emerges as an important reliability concern [12]. Indeed it was found that parametric yield of a large sized SRAM arrays (e.g., 2MB, PTM 70nm) can degrade up to 10% in 3 years time period (Fig. 10). Solution Strategies: Several groups have proposed and explored potential remedies for NBTI-induced SRAM degradation. These approaches involve improvement in presilicon design phase, post-silicon test phase, and post-silicon operation phase, as discussed below. In device design phase, the most effective route to combating the NBTI related asymmetry is to reduce initial asymmetry as much as possible. Therefore, techniques that reduce process variation are well-suited for reliability-aware design as well. These may include dynamic VDD scaling, IRPS11-360 body biasing, or redundant arrays to replace faulty cells, etc. [53][56][12]. During the test-phase, a body-bias based burn-in strategy could also be effective [55]. In particular, if the body bias is used such that the SNM is reduced, then NBTI induced Figure 10. Temporal parametric yield of 6T-SRAM array. The decrease in memory yield for larger memories, especially towards the end of the product cycle, could be addressed by pre-silicon design (e.g. cell flip) or post-silicon monitoring (Fig. 8). degradation will immediately identify vulnerable cells. With an additional N-well bias, the SNM could be increased for those cells and this increases memory yield by a factor of 3 – a remarkable improvement. Finally, cell-flipping algorithm proposes to restore symmetry in PMOS transistors by inverting the stored data in the memory every other day, so that any given PMOS transistors are not subjected to prolonged uninterrupted stress. While good results are obtained, the main limitation of the approach involves the algorithmic and hardware complexity/overhead associated with inverted data every other day [56]. C. Soft Errors: SRAM Solutions For SRAM memory, various types of error detection/ correction schemes are now routinely used for wide variety of IC. These include (i) Single Error Correction (SEC) that requires 6-8 additional error correction bits depending on bit-length, (ii) Single Error Correction, Double Error Detection (SEC-DEC) schemes with 7-9 error correction bits, and (iii) Double error Correction (DEC) schemes with 12-24 bits [84]. These techniques based on coding theory require substantial overhead in area, power, and execution time and requires careful implementation, however in general these techniques are effective and widely used. D. Permanent Faults: Detection and Solution Strategies. The possible manufacturing defects (such as shorts, gate oxide defects, opens, etc) can be modeled as circuit nodes being stuck-patterns at zero or one. There exists several test pattern generators that optimizes the number of test vectors while trying to achieve close to 100% fault coverage. We have already discussed use of test patterns and IDDQ for logic circuits, here we discuss approaches that are relevant for faults which occur in SRAMs. SRAM fault models not only consider standard stuck-at faults but also transition (cell fails to make a 01 or 1 0 transition), coupling (transition in bit ‘j’ cause unwanted change in bit ‘i’) faults. A popular testing methodology 4A.1.8 known as MARCH test involve writing/reading 0 and 1 in successive memory locations [85]. The test failure data provided by the MARCH test enables identification of fault locations in the memory array. The address of faulty bitcells can then be remapped to redundant bit-cells in the array (column redundancy) to increase memory yield. Techniques to address soft errors in SRAM memory relies either on error correction codes (ECC) discussed above, or by raising the critical charge necessary to flip a cell by various circuit techniques. A classic example involves Dual Interlocked Storage Cells where SRAM connected back to back stores the memory state [97]. Regardless the position of the strike, the bit does not flip because the other SRAM prevents racing of the signal necessary to complete the bit-flip. Other rad-hard methods add resistance to the feedback loop to slow the racing of signal and thereby reduce probability of bit flip. Such resistive hardening however may not be scalable in highperformance circuits. IV. OUTLOOK/CONCLUSIONS A. Evolution of CAD Tools One of the primary challenges in process and reliabilityaware design is the lack of widely available CAD tools and well-recognized design philosophy that could encapsulate the information about device properties to circuit designers for appropriate corrective action. Recently, there have been changes across the board. Commercial device simulators like MiniMOS and Medici now incorporate models for NBTI, HCI, and TDDB degradation. And there are university-based efforts on reliability models (e.g. DevRel at www.nanohub.org), which are also being developed. Circuits design techniques are being explored by several groups from Purdue (CRISTA), University of Minnesota, Arizona State, University of Michigan. And industrial groups like TI and IBM are publishing their design methodologies at various stages of design process [86]. Recent simulation software from Arizona group called ‘NewAge’ integrates self-consistently activity dependent temperature enhancement, transistor degradation along with timing analysis to provide a systematic prediction of the role of parametric degradation for various integrated circuits [87]. Eventually, we anticipate that these issues will propagate to the system level tools. The multi-scale radiation damage and soft error models have grown increasingly sophisticated over the years. For example, simulators like SEMM-2 from IBM contains particle transport, Nuclear physics, and particle source generator modules as well as detailed layout information for front and backend information to create a realistic simulation environment for prediction of soft errors [41]. These simulators however focuses on charge generation and critical charge, however, if hot carrier information is desired, modeling framework that GEANT4 along with fullband Monte Carlo simulation is appropriate and has been successfully used to study charge ejection from Flash and ZRAM memories [10]. Several Automatic Test Pattern Generator (ATPG) tools are available from vendors like Mentor Graphics, Synopsys and Cadence that can generate test patterns for stuck-at faults, IDDQ along with at-speed testing for path delay faults [88][89][90]. Test failure data coupled with test patterns and gate level design description can be used to identify different defects causing the test failure and the defect locations [93]. However, the cost of external testing using automatic test equipment is increasing as the chips become more complex thereby increasing the total chip cost. Research work is in progress to achieve lower test cost using on-chip testing (Built in Self Test, BIST) instead of relying on expensive external testing methodologies practiced today [91][92]. In sum, in this paper we have reviewed four classes of faults and various fault-tolerant strategies to detect and manage them. We have explored in some depth the various approaches proposed to design with these faults both for logic as well as for memory transistors. The techniques are innovative and often appear promising as they offer saving of power/area over more conservative design. However, the implication for test, burning, and yield are important questions that needs to be explored. Eventually, we suggest that the only way for implementing any design philosophy is through the CAD tools. Progress towards that goal is already underway. ACKNOWLEDGEMENT We wish to acknowledge contributions from members of our groups, K.-H. Kang, B. Paul, A. E. Islam, D. Varghese and G. Karakonstantis. This work was supported by National Science Foundation, Applied Materials, Texas Instruments, Taiwan Semiconductor Manufacturing Company. REFERENCES [1] [2] [3] [4] [5] S. Borker, “Designing Reliable Systems from Unreliable Components: The Challenge of Transistor Variability and Degradation”, IEEE Micro, p. 10-16, 2005. M. A. Alam, S. Mahapatra, “A comprehensive model of PMOS NBTI degradation,” Microelectronics Reliability, 45 (1) p. 71, 2005. B. C. Paul et al. EDL-26, 560-562, 2005. Also see, B. Paul, Proc. of DATE, p. 780-785, 2006. M.A. Alam, N. Pimparkar, S. Kumar, and for Macroelectronics Applications”, MRS Bulletin, 31, 466, 2006. G. Groeseneken, R. Degraeve, B. P. Roussel, “Challenges in Reliability Assessment of Advanced CMOS Technologies”, Keynote Presentation at the Int. Conf. on. Physical and Failure 4A.1.9 [6] [7] [8] [9] Analysis of Integrated Circuits, p. 1-9, 2007. H. Kufluoglu and M.A. Alam, “A geometrical unification of the theories of NBTI and HCI time-exponents and its implications for ultra-scaled planar and surround-gate MOSFETs,” IEEE IEDM Technical Digest, p. 113 ( 2004). P. Nair and M. Alam, “Design Considerations of Nanobiosensors”, IEEE Trans. Elec. Devices, 2007. R. Baumann, “The impact of technology scaling on soft error rate performance and limits to the efficacy of error correction,” IEDM Tech. Digest, pp. 329-332 ( 2002). P. E. Dodd, M. R. Shaneyfelt, J.R. Schwank, and G. L. Hash, “Neutron Induced Latchup in SRAM at Ground Level, IRPS Proceeding, pp. 51-55, (2003). IRPS11-361 [10] N. Butt, Ph.D. Thesis, Purdue University, 2008. [11] M. Alam and K. Smith, “A Phenomenological Theory of Correlated Multiple Soft-Breakdown Evetns in Ultra-thin Gate Dielectrics,” IEEE IRPS Proceedings, p. 406, (2003). [12] K. Kang, H. Kufluoglu, K. Roy, and M. A. Alam, “Impact of Negative Bias Temperature Instability in Nano-Scale SRAM Array: Modeling and Analysis,” to be published in IEEE Transactions on CAD, 2007. [13] J. H. Stathis, “Reliability Limits for the Gate Insulator in CMOS Technology,” IBM Journal of Research and Development, vol. 46 (2/3), pp. 265-286, (2002). [14] R. Degraeve, B. Govoreanu, B. Kaczer, J. Van Houdt, G. Groeseneken, “Measurement and statistical analysis of single trap current-voltage characteristics in ultrathin SiON”, Proc. of IRPS, 2005. pp. 360- 365. [15] https://www.nanohub.org/resources/1647/ [16] https://www.nanohub.org/resources/1214/ [17] http://cobweb.ecn.purdue.edu/~ee650/handouts.htm [18] A.E. Islam, Ph.D. Thesis, Purdue University, 2010. [19] R. Fernandez, B. Kaczer, IEDM Tech. Dig. 2006. [20] H. Kufluoglu et al., Proc. of IRPS, 2007. [21] D. Allee et al., “Degradation Effects in a a-Si :H Thin Film Transistors and their Impact on Circuit Perfomance,” Proc. of IRPS, p. 158, 2008. [22] E. Takeda, C. Yang, A. Miura-Hamada, “Hot Carrier Effects in MOS Devices,” Academic Press (1995). [23] http://www.eecs.berkeley.edu/Pubs/TechRpts/1990/ ERL-90-4.pdf [24] D. Varghese, Ph.D. Thesis, Purdue University, 2009. [25] G. Cellere, L. Larcher, A. Paccagnella. A. Visconti, M. Bonanomi , “Radiation induced leakage current in floating gate memory cells”, IEEE transactions on nuclear science 2005, vol. 52 (1), no6, pp. 2144-2152. [26] R. W. Keyes, “Physical Limits of Silicon Transistors and Circuits, Rep. Prog. Phys. 68, 2701-2746, 2005. [27] A. Bhavnagarwala, X. Tang, J. Meindl, The impact of Intrinsic Device Fluctuations on CMOS SRAM Cell Reliability“, IEEE J. of Solid State Circuits, 36 (4), pp. 658-664, 2001. [28] V. De, X. Tang, and J. Meindl, “Random MOSFET Parameter Fluctuation Limits to Gigascale Integration (GSI)”, 1996 Sym. Of VLSI Technology, paper 20.4. [29] M. Agostinelli, S. Lau, S. Pae, P. Marzolf, H. Muthali, S. Jacobs, “PMOS NBTI-induced circuit mismatch in advanced technologies,” IEEE IRPS Proceedings, p. 171 (2004). [30] M. A. Alam, B. Weir, P. Silverman, J. Bude, A. Ghetti, Y. Ma, M. M. Brown, D. Hwang, and A. Hamad , “Physics and Prospects of Sub-2nm Oxides,” in The Physics and Chemistry of SiO2 and the eSi-SiO2 interface-4 H. Z. Massoud, I. J. R. Baumvol, M. Hirose, E. H. Poindexter, Editors, PV 2000-2, p. 365, The Electrochemical Society, Pennington, NJ (2000). [31] D. G. Flagello, H. V. D. Laan, J. V. Schoot, I Bouchoms, and B. Geh, “Understanding Systematic and Random CD Variations using Predictive Modelling Techniques,” SPIE Conference in Optical Microlithography XII, pp. 162-175, 1999. [32] M. Ieong, B. Doris, J. Kedzierski, K. Rim, M. Yang, “Silicon device scaling to the sub-10 nm regime,” Science, vol. 306 p.2057 (2004). [33] A. Asenov, “Random dopant induced threshold voltage lowering and fluctuations in sub-0.1 μm MOSFET's: A 3-D “atomistic” simulation study” IEEE Transactions on Electron Devices 45(12):pp. 2502-2513, 1998. [34] Simon W-B. Tam, Yojiro Matsueda, Hiroshi Maeda, Mutsumi Kimura,” Improved Polysilicon TFT Drivers for Light Emitting Polymer Displays,” Proc Int Disp Workshops, 2000. [35] R. Gusmeroli, A.S. Spinelli, C. Monzio Compagnoni, D. Ielmini and A.L. Lacaita, “Edge and percolation effects on VT window in nanocrystal memories”, Microelectronics Engineering, doi: 10.1016 /j.mee2005.04.066. [36] N. Pimparkar, J. Guo and M. Alam, “Performance Assessment of Sub-percolating Network Transistors by an Analytical Model,” IEEE Trans. of Electron Devices, 54(4), 2007. IRPS11-362 4A.1.10 [37] J. Choi, A. Bansal, M. Meterelliyoz, J. Murthy, and K. Roy, “Leakage Power Dependent Temperature Estimation to Predict Thermal Runaway in FinFET Circuits,” International Conference on CAD, pp. 2006. [38] S. Kumar, J. Murthy, and M. Alam, “Self-Consistent ElectroThermal Analysis of Nanotube Network Transistors” Journal of Applied Physics, 109, 014315, 2011. Also see, S.-C. Lin and K. Banerjee, “An Electrothermally-aware Full Chip Substrate Temperature Gradient Evaluation Methodology for Leakage Dominant Technologies with Implications for Power Estimation and Hot Spot Management”, Proc. ICCAD, pp. 568-574, 2006. [39] Robert Bauman, Tutorial Notes, International Reliability Physics Symposium, 2004. [40] D.F. Heidel et al. “Single Event Upsets and Multiple-Bit Upsets on a 45 nm SOI SRAM”, IEEE Trans. Nuclear Science. 56(6), pp. 3499-3504, 2009. [41] H.H. K. Tang, “SEMM-2: A New Generation of Single-EventEffect Modeling Tool,” IBM Journal of Research and Development, 52(3), pp. 233-244, 2008. [42] P. E. Dodd, F. W. Sexton, M. R. Shaneyfelt, J. R. Schwank, and D. S. Walsh, Sandia National Laboratories “Epi, Thinned and SOI substrates: Cure-Alls or Just Lesser Evils”, IRPS Proc. 2007. [43] R. Lucky, in Silicon Dreams: Information, Man, and Machine, 1989. [44] E. J. McCluskey, Chao-Wen Tseng, “Stuck-fault tests vs. actual defects”, Proc. ITC, pp.336-342, 2000. [45] S. Shekhar, M. Erementchouk, M. Leuenberger and S. Khondaker, “Correlated breakdown of CNT in aligned arrays”, arxiv.org/ftp/arxiv/papers/1101/1101.4040.pdf, 2011. [46] J. Tschanz, J. Kao, S. Narendra, R. Nair, D. Antoniadis, A. Chandrakasan, V. De, “Adaptive Body Bias for Reducing Impacts of Die-to-Die and Within-Die Parameter Variations on Microprocessor Frequency and Leakage,” International Solid State Circuit Conference, pp. 477-479, 2002. [47] S. Ghosh, S. Bhunia, and K. Roy, “CRISTA: A New Paradigm for Low- Power,Variation-Tolerant, and Adaptive Circuit Synthesis Using Critical Path Isolation”, Transactions on CAD, 2007. [48] D. Ernst, N. Kim, et al., "Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation," Proc. IEEE/ACM Int. Symposium on Microarchitecture (MICRO), Dec. 2003. [49] T. Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” ACM/IEEE 32nd Annual Symposium on Microarchitecture, p. 196 November 1999. [50] K. Kang, K. Kim, and K. Roy, “Variation Resilient Low-Power Circuit Design Methodology using On-Chip Phase Locked Loop”, to be published in ACM/IEEE Design Automation Conference, 2007. [51] C. Visweswariah, “Death, Taxes and Failing Chips,” ACM/IEEE Design Automation Conference, pp. 343-347, 2003. [52] B. Kaczer, IRPS Tutorial of Gate Dielectric Breakdown, 2009. [53] R. Vattikonda et al. Proc. of DAC, 1047-1052, 2006. [54] T. Krishnan, V. Reddy, S. Chakravarthi, J. Rodriguez, S. John, S. Krishnan, “NBTI impact on transistor and circuit: models, mechanisms and scaling effects”, Proc. of IEDM, 2003. [55] A. Krishnan et al., NBTI Effects of SRAM Circuits, IEEE IEDM Digest, 2006. [56] S. Kumar et al., Proc. of Quality Electronic Design, pp. 210-218, 2006. [57] K. Kang, H. Kufluoglu, M. A. Alam, and K. Roy, “Efficient Transistor-Level Sizing Technique under Temporal Performance Degradation due to NBTI,” IEEE International Conference on Computer Design, pp. 216-221, 2006. [58] C. P. Chen, C. C. N. Chu, and D. F. Wong, “Fast and exact simultaneous gate and wire sizing by Langrangian relaxation,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 19, pp. 1014-1025, (1999). [59] S. Kumar, C. Kim and S. Sapatnekar, “NBTI Aware Synthesis of Digital Circuits”, Design Automation Conference, pp. 370-375, San Diego, California, USA, June 2007 [60] K. Kang, K. Kim, A. E. Islam, M. A. Alam, and K. Roy, “Characterization and Estimation of Circuit Reliability [61] [62] [63] [64] [65] [66] [67] [68] [69] [70] [71] [72] [73] [74] [75] [76] [77] Degradation under NBTI using On-line IDDQ Measurement,” to be published in ACM/IEEE Design Automation Conference, 2007. K. Kang, M. A. Alam, and K. Roy, “Estimation of NBTI Degradation using On-Chip IDDQ Measurement”, International Reliability Physics Symposium, pp. 10-16, 2007. C. H. Kim, K. Roy, S. Hsu, R. Krishnamurthy, S. Borkhar, "An On-Die CMOS Leakage Current Sensor For Measuring Process Variation in Sub-90nm Generations", VLSI Circuits Symposium, June 2004 T. Kim, R. Persaud, and C.H. Kim, "Silicon Odometer: An OnChip Reliability Monitor for Measuring Frequency Degradation of Digital Circuits", IEEE Journal of Solid-State Circuits, pp 874-880, Apr. 2008. A T. Kim, R. Persaud, and C.H. Kim, "Silicon Odometer: An OnChip Reliability Monitor for Measuring Frequency Degradation of Digital Circuits", VLSI Circuits Symposium, pp122-123, June 2007 E. Karl, P. Singh, D. Blaauw, D. Sylvester “Compact In-Situ Sensor for Monitoring Nagative Bias-Temperature-Instability Effect and Oxide Degradation”, IEEE International Solid State Circuit Conference, pp. 410-411, California, USA, 2008 M. Agarwal, S. Mitra, B. C. Paul, M. Zhang, “Circuit Failure Prediction and its Application to Transistor Aging,” VLSI Test Symposium, 2007. S. Mitra, Proc. of Int. Rel. Phys. Symposium, 2008. A.E. Islam and M. A. Alam, “On the Possibility of Degradation free Field-Effect Transistors”, Appl. Phys. Lett. 92, 173504, 2008. M. Zhang et. al., “Sequential Element Design With Built-In Soft Error Resilience,” IEEE Transactions on VLSI Systems, 2006. A. Goel et. al., “Low-overhead design of soft-error-tolerant scan flip-flops with enhanced-scan capability,” ASP-DAC, 2006. R. Koo and S. Toueg, “Checkpointing and rollback-recovery for distributed systems,” IEEE Trans. Software Engineering 13, 1, 23– 31, 1987. K. Constantinides et. al., "Software-Based Online Detection of Hardware Defects: Mechanisms, Architectural Support, and Evaluation", International Symposium on Micro architecture. L. Leem et. al., “ERSA: Error Resilient System Architecture for Probabilistic Applications”, DATE, 2010. J.M. Soden, C.F. Hawkins, R.K. Gulati, W. Mao, “IDDQ Testing: A Review,” Journal of Electronic Testing: Theory and Applications, Vol. 3, No. 4, pp. 291-303, Dec. 1992. N. Banerjee, C. Augustine, K. Roy, “Fault-Tolerance with Graceful Degradation in Quality: A Design Methodology and its Application to Digital Signal Processing Systems”, DFT, 2008. Jie Han , Jianbo Gao , Yan Qi , Pieter Jonker , Jose A. B. Fortes, Toward Hardware-Redundant, Fault-Tolerant Logic for Nanoelectronics, IEEE Design & Test, v.22 n.4, p.328-339, July 2005 M. Z. Hasan and S. G. Ziavras, “Runtime Partial Reconfiguration 4A.1.11 [78] [79] [80] [81] [82] [83] [84] [85] [86] [87] [88] [89] [90] [91] [92] [93] [94] [95] [96] [97] for Embedded Vector Processors”, International Conference on Information Technology: New Generations”, Las Vegas, Nevada, April 204, 2007. J. R. Heath, P. J. Kuekes, G. Snider, and R. S. Williams, “A DefectTolerant Computer Architecture: Opportunities for Nanotechnology”, Science, 280, pp. 1716-1721, 1998. S. Mukhopadhyay, H. Mahmoodi, and K. Roy, “Modeling of Failure Probability and Statistical Design of SRAM Array for Yield Enhancement in Nanoscaled CMOS”, IEEE TCAD, 2004. S. Mukhopadhyay, A. Raychowdhury, H. Mahmoodi, K. Roy, “Statistical Design and Optimization of SRAM cell for yield improvement,” International Conference on Computer Aided Design, pp. 10-13, 2004. S. Mukhopadhyay, K. Kang, H. Mahmoodi, K. Roy, “Reliable and self-repairing SRAM in nano-scale technologies using leakage and delay monitoring,” International Test Conference, 2005. , K. Roy, J. Kulkarni, M.-E. Hwang, “Process-Tolerant Ultralow Voltage Digital Subthreshold Design” IEEE Topical Meeting on Silicon Monolithic Integrated Circuits in RF Systems, 2008. SiRF, pp. 42-45, 2008. R. Rodriguez, J. Stathis, B. Liner, S. Kowalczyk, C. Chung, R. Joshi, G. Northrop, K. Bernstein, A. Bhavnagarwala, and S. Lombardo, “The Impact of Gate Oxide Breakdown on SRAM Stability,” IEEE Electron. Dev. Lett. 23, 559 (2002). Heidergott, NSREC Short Course, 1999. A.J. van de Goor, “Using March Tests to Test SRAMs,” IEEE Design & Test of Computers, Vol. 10, No. 1, Mar. 1993, pp. 8-14. [Goda05] A. Goda, ISQED, 2005. M. DeBole, Int. J. Parallel Prog. 2009 http://www.mentor.com/products/silicon-yield http://www.synopsys.com/Tools/Implementation/RTLSynthesis/Pa ges/TetraMAXATPG.aspx http://www.cadence.com/products/ld/true_time_test V. D Agrawal, “A tutorial on built-in self-test. I. Principles”, Design and Test of Computers, 1993. V. D Agrawal, “A tutorial on built-in self-test. II. Applications”, Design and Test of Computers, 1993. W. T Cheng et. al., “Compactor independent direct diagnosis”, Proc. Asian Test Symp, 2004. B. Vaidyanathan, B.; A.S. Oates, X. Yuan Intrinsic NBTIvariability aware statistical pipeline performance assessment and tuning, ICCAD, pp. 164-171, 2009. [Rajsuman95] R. Rajusman, Iddq Testing for CMOS VLSI, Artech House, 1995. [Agarwal05] A. Agarwal, B. C. Paul, H. Mahmoodi, A. Datta, K. Roy, “A process-tolerant cache architecture for improved yield in nanoscale technologies,” IEEE Transactions on VLSI, vol. 13, no. 1, pp. 27-38, 2005. Calin, et al., IEEE Trans. Nucl. Sci. 43, p. 2874, 1996. IRPS11-363