micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND®mir Yazdanbakhsh Georgia Institute of Technology Raghuraman Balasubramanian Tony Nowatzki Karthikeyan Sankaralingam University of Wisconsin–Madison In the future, especially in harsh environments (such as aerospace, underwater, and military), microprocessors are increasingly likely to fail in the field because of manufacturing test fault escapes and various aging and wear-out phenomena.1,2 Circuit failure prediction techniques employ wear-out device physics principles and empirical measurements3 to predict failures in the field before they occur for logic and static RAM (SRAM). Models of the dominant mechanisms— negative bias temperature instability (NBTI), Hot Carrier Injection (HCI), and timedependent dielectric breakdown (TDDB)— show logic wear out increases the delay of gates because a degraded Vth increases the ðVDD " Vth Þ. However, wear out of SRAM transistors affects the SRAM arrays’ performance parameters (such as read stability, write stability, and read delay) differently. Previous work has shown that read stability is the dominant failure in SRAM arrays because of the wear out.3–5 (The effect of aging on transistors’ mobility is not considered.) Extensive literature has addressed wearout-prediction inspired by these observations (in the interest of space, we provide one representative citation6). However, as far as we know, no prior work simultaneously addresses both logic and SRAM. Furthermore, they individually suffer from complexity, overhead, and accuracy and generality problems and become particularly ineffective in harsh environments in which wear-out challenges are exacerbated. These prior techniques are discussed further in the “Related Work in Circuit Failure Prediction” sidebar. Our goal is to develop a unifying yet simple mechanism that covers both logic and SRAM and delivers low complexity, low overhead, and high accuracy. To this end, we developed a comprehensive circuit-prediction technique called the Aged Full-Chip Predictor for both logic and SRAM in many-core systems. Aged Full-Chip Predictor allows safe execution up to 0.4 days before logic failures and extends the typical lifetime by 14 months, over a system with ECC for SRAM. Published by the IEEE Computer Society 0272-1732/15/$31.00 c 2015 IEEE ....................................................... 24 micro IEEE $ Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® .............................................................................................................................................................................................. Related Work in Circuit Failure Prediction Figure A shows the various alternatives for handling wear out in logic and SRAM. Dimitris Gizopoulos and colleagues provide a good overview of detection techniques for logic.1 Logic wear-out prediction is based on canaries,2 in-situ flip-flop techniques,3 delay measurement,4 and built-in self-test (BIST).5 SRAM-based detection and prediction techniques are based on sensors or modifications to the SRAM cell,6,7 complex error-correcting codes (ECCs), and hybrid ECC and cell sizing.8 None of these can simultaneously deliver on low complexity, low overheads, and high accuracy because these techniques operate within only a single computing layer. When done at the circuit level, these techniques suffer from complexity and always remain active. On the other hand, an architecture-levelonly solution suffers from low accuracy because architecture fault models do not capture most physical effects. (In both logic- and SRAM-based directions, there is a body of work on mitigation and repair, which is complementary and somewhat orthogonal to detection and prediction.) 2. J. Tschanz et al., “Tunable Replica Circuits and Adaptive Voltage-Frequency Techniques for Dynamic Voltage, Temperature, and Aging Variation Tolerance,” Proc. Symp. VLSI Circuits, 2009, pp. 112–113. 3. D. Ernst et al., “Razor: A Low-Power Pipeline based on Circuit-Level Timing Speculation,” Proc. 36th Ann. IEEE/ACM Int’l Symp. Microarchitecture, 2003, pp. 7–18. 4. J. Blome et al., “Self-Calibrating Online Wearout Detection,” Proc. 40th Ann. IEEE/ACM Int’l Symp. Microarchitecture, 2007, pp. 109–122. 5. J.C. Smolens et al., “Detecting Emerging Wearout Faults,” 3rd IEEE Workshop Silicon Errors in Logic-System Effects, 2007; http://jared.smolens.org/documents/first-smolens_____________________________ selse07.pdf. _______ 6. F. Ahmed and L. Milor, “Reliable Cache Design with On-Chip Monitoring of NBTI Degradation in SRAM Cells using BIST,” Proc. 28th VLSI Test Symp., 2010, pp. 63–68. 7. Z. Qi et al., “SRAM-Based NBTI/PBTI Sensor System Design,” Proc. 47th ACM/IEEE Design Automation Conf., References 2010, pp. 849–852. 1. D. Gizopoulos et al., “Architectures for Online Error Detection and Recovery in Multicore Processors,” Proc. ACM/ 8. Z. Chishti et al., “Improving Cache Lifetime Reliability at IEEE Design, Automation, and Test in Europe Conf., 2011, pp. 1–6. Ultra-Low Voltages,” Proc. 42nd Ann. IEEE/ACM Int’l Symp. Technique operation over time (thickness indicates operational overheads) Lifetime of a processor Logic failure Time (years) Zero Causes system corruption Age detection flip-flops Coverage Early prediction Select logic on critical paths BIST-based prediction Microarchitecture, 2009, pp. 89–99. Lifetime of a processor First SRAM failure Causes system corruption Lifetime of a processor with ECC First SRAM failure Continuous monitoring of gate delay Aged-SDMR Early prediction Select logic on critical paths Second SRAM failure (if chip were active) Wasted lifetime/lost performance Cannot correct next error Cache block unusable* * processor decommissioned if many blocks become unusable Corrected by ECC Logic on critical paths Periodic, offline BIST check Online delay tracking Time (years) Cache block unusable Aged-AsymChk All logic cells Virtual aging + sampled redundancy First SRAM failure Second SRAM failure (if chip were active) Timely prediction by Aged-AsymChk Decommissioned with little wasted lifetime Corrected by ECC Prediction techniques targeting memories (SRAM) Prediction techniques targeting logic Figure A. The operation of failure-prediction techniques that target logic and static RAM (SRAM). Compared to other logicdetection techniques, Aged-SDMR has low overhead and coverage on all logic cells. Compared to error-correcting code (ECC) alone, Aged-AsymChk can predict the second failure before it occurs. Design Virtual aging to manifest faults The design of the Aged Full-Chip Predictor leverages three primary mechanisms. We discuss the insight for each and outline their design below. Figure 1 provides an overview of the execution of our comprehensive failure-prediction system. Our key insight is to virtually wear out the processor and thus manifest a wear-out fault early. We convert the wear-out degradation into a higher-level and easier-to-detect fault; we then expose and detect the fault, which effectively predicts and detects the wear out. ............................................................. NOVEMBER/DECEMBER 2015 micro IEEE 25 Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® .............................................................................................................................................................................................. FAILURE PREDICTION Time (years) Execution is divided into epochs S-epochs L-epochs Aged-SDMR active 1% of the cycles at the start of each L-epoch Aged-AsymChk is active at the start of each S-epoch Resume processes Pause all processes Flush cache Aged-AsymChk Aged-SDMR Processor Virtual aging makes the cells behave as if they are weeks older. Causing eventual failures to manifest as stuck-at faults. BIST test vectors expose these faults. AsymChk ideal to BIST checkers detect the defect. capture stuck-at faults Processor memories No modifications to SRAM cells BIST test vectors SRAM cells B ECC C D Test mode BIST check Supply voltage DVS Memories Control Memory Virtual ager A B C D Logic Virtual aging makes the cells behave as if they are weeks older. Causing eventual failures to manifest as delay faults. User applications expose these faults as errors. Sampling DMR ideal to Sampling DMR detects the errors. capture delay faults Processor logic Near-critical paths C B To processor logic A B C D User applications running Sampling DMR active Virtual aging active BIST check Virtual aging active A fast gate Noncritical path CLK Capture flop phased CLK Clock gate Aging mode Supply voltage DVS Virtual ager A Additional logic inserted to cover fast gates Sampled dual modular redundancy D Checker core Checker core Figure 1. Two techniques, based on virtual aging, together provide comprehensive failure prediction. Aged-SDMR detects manifested logic errors using sampling and dual-modular redundancy, whereas Aged-AsymChk detects manifested SRAM errors using asymmetric checking. All device-level wear-out faults eventually must manifest at a higher abstraction level; thus, any detection technique can be repurposed as a prediction technique. We carry out virtual aging by reducing supply voltage using dynamic voltage scaling. We can tune the prediction’s timeliness by changing the amount of voltage reduction. Virtual aging is instantaneously reversible; resetting to nominal voltage restores the processor’s current age. Sampled redundancy to expose and detect logic failure ............................................................ 26 micro IEEE We observed that wear out in logic is first exposed as a logic delay fault, and sampled redundancy with execution on a second core can be effective in handling logic transistors. BIST and stuck-at fault models are insufficient for providing full coverage for these delay-driven failures. The key idea of the solution, Aged-SDMR, is to couple cores randomly at randomly chosen periods of time, run one core virtually aged, use the second (redundant) core as a checker core, and couple these using a nonintrusive lightweight mechanism. Because logic faults start as delay faults, a comprehensive redundant core is necessary for full coverage. Shuou Nomura and colleagues introduced the concept of SamplingþDMR,7 which solves the overhead problem that historically has plagued redundancy. Our key advancement over their work is to use virtual aging during DMR execution to ensure that faults always occur first in a DMR window, thus ensuring no missed errors. Asymmetric checkers to expose and detect SRAM failure Aged-SDMR cannot be used for SRAM because checkpointing the entire SRAM state is infeasible, especially considering today’s megabyte-sized level-2 caches. However, wear out in SRAMs results in read stability problems, and therefore its effect can be captured by a simple stuck-at fault model. The solution, Aged-AsymChk, leverages this insight and uses established asymmetric checker technology such as BIST to check the SRAMs when they are virtually aged. Specifically, we write known vectors to an SRAM, then read out the values; any mismatch between these indicates an impending failure. IEEE MICRO Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page THE WORLD’S NEWSSTAND® Use of existing techniques The principles of dynamic voltage scaling, sampling, redundancy, and asymmetric checking using BIST are well known. Our work’s implementation and design contribution is a novel use of existing techniques, while avoiding disruptive or intrusive mechanisms and providing comprehensive logic and SRAM wear-out prediction. The implementation requirements are simple or already existent: dynamic voltage scaling capability; separate voltage islands for SRAMs and logic; a reliability manager module added to cores to allow checking of retired instructions; BIST capability in the SRAMs; and a controller (like a cache controller) in the SRAM that allows its contents to be safely evicted prior to being overwritten for BIST. Implementation We present the organization of our system and the implementation of virtual aging, fault exposure, and fault detection. Within each, we discuss logic and SRAM. Figure 1 shows the high-level overview and details of each individual approach. We focus on SRAM in this article because our previous work covered the logic.8 Overall organization Conceptually, we execute the processor in epochs, where at the start of every epoch we have a window where the processor is virtually aged. As Figure 1 shows, we have two types of epochs: logic epochs (L-epochs), in which only the logic is virtually aged, and SRAM-epochs (S-epochs), in which only SRAM is virtually aged. These never overlap and are executed at different rates. Virtual aging We virtually age a processor by reducing the supply voltage to both logic and SRAM arrays. Although the enabling mechanism is the same, the failure behavior is different. For SRAM, prior to virtual aging, we must ensure any useful SRAM state is written to some other location. For an SRAM that is part of a cache, the cache controller can be enhanced to evict all dirty lines. Otherwise, it can be done completely in software using instructions like WBINVD (writeback and invalidate cache) in the AMD 64 architecture. SRAMs in speculative structures such as branch predictor tables can simply be overwritten. Precise interrupts that would start an S-epoch ensure that structures such as load queues and the rename table are empty. We can virtually age large memory structures, such as L2 caches with many SRAM blocks, by applying the S-epochs one SRAM array at a time coordinated with the controller to turn off banks. Effect on logic. The delay of a gate td is inversely proportional to ðVDD " Vth Þ2 . Wear out causes Vth and hence td to increase. Reducing VDD has the same effect and can be calibrated to mimic weeks or months of aging. Effect on SRAM. Consider the basic six-transistor SRAM cell organization. In a newly manufactured cell, the cross-coupled inverters are fairly identical, producing a voltage transfer characteristic as in Figure 2a. The static noise margin (SNM) is the minimum noise or extraneous voltage that can corrupt the stored value. The read failure probability defines this likelihood for a given cell. Owing to wear out, the SRAM’s inverters degrade, reducing the static noise margin as shown in Figures 2b and 2c, which consequently increases the read failure probability. Furthermore, SRAM wear out is asymmetric and depends on the stored value in the SRAM cell. For example, when zero value is stored in the SRAM cell, the p-channel MOS transistor in one of the inverters is subjected to stress, whereas the PMOS transistor in the other one goes into the recovery mode. With extremely high wear out, cells can become stuck at 0 or 1 permanently (see Figure 2d). Virtual aging’s behavior for SRAM is similar to the logic case. The fundamental source for SNM change is decreased ðVDD " Vth Þ due to increased Vth , which can be achieved equivalently by decreasing VDD and can be instantaneously reset back to the current age by resetting to nominal VDD . Figure 3 shows an HSpice simulation of virtual aging’s effectiveness. Using MOS reliability analysis (MOSRA) aging models, we ran simulations of the SRAM cell with various amounts of aging—for the technology and the MOSRA parameters that we considered, ............................................................. NOVEMBER/DECEMBER 2015 micro IEEE M q M q M q M q MQmags q 27 Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® .............................................................................................................................................................................................. FAILURE PREDICTION 1.0 1.0 Read SNM 0.8 0.6 V(QB) V(QB) 0.8 Read SNM > 0 0.4 0.6 0.4 0.2 0.2 VDD = 1.2 V VDD = 1.2 V Age = 0 years 0 0 0.2 Age ≈ 10 years 0.4 (a) 1.0 0.6 V(Q) 0.8 0 1.0 0 0.2 0.4 V(Q) (b) 0.6 0.8 1.0 1.0 Read SNM ≈ 0 VWL VQ 0.8 0.8 V(QB) VQB 0.6 0.6 0.4 0.4 0.2 0.2 VDD = 1.2 V Age ≈ 12 years 0 (c) bit flip 0 0.2 0 0.4 0.6 V(Q) 0.8 1.0 0 50 100 (d) 150 200 250 300 350 400 450 Time (µsec) Figure 2. Six-transistor (6T) SRAM cell transfer characteristics and the read failure in the SRAM cell. 6T SRAM transfer characteristics for a (a) new chip, (b) positive read static noise margin (SNM) after wear out, and (c) zero read SNM after wear out. (d) Negative (near-zero) read SNM causes the stored value in the SRAM to flip (initial stored value is zero). ............................................................ 28 micro IEEE failure happened at approximately 12 years (626 weeks) for a worst-case stressed cell (that is, one that constantly stores either one or zero in the SRAM cell for the duration of the aging). The MOSRA parameters are TIT 0 ¼ 5e " 8; TITFD ¼ 7:5e " 10; TITTD ¼ 1:45e " 20; TN ¼ 0:5; RelMode ¼ default (both HCI and BTI). At each aging setting, we also ran a simulation with various amounts of voltage reduction. In this case, we first obtained the total amount of stress on transistors during the whole period of the aging with the nominal voltage, which shows itself as shift in the Vth . Given the shifted Vth values for each transistor, we simulated the SRAM cell with the reduced voltage to observe the aging failure. The dots in the figure indicate the age at which the cell failed for various amounts of voltage reduction. Subtracting this age from 12 years provides the window of advance failure notification. This experiment demonstrates that reducing voltage serves the purpose of virtual aging. Fault exposure The fault exposure mechanism is what makes all errors visible to the detection mechanism. IEEE MICRO Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® SRAM. The goal of fault exposure is to condition a failed cell to produce errors. Our main contribution here is based on a simple observation: the read stability problem in failed cells can be abstracted as a stuck-atzero or a stuck-at-one fault if we can write known values into the SRAM and then read them. We reuse the pattern generators in memory BIST to produce and write these values: a simple “March” algorithm that writes all zeros followed by all ones will suffice for Aged-AsymChk. 1.2 1.0 ß 0.8 End of life Logic. Exposing permanent faults in the critical path is straightforward. Permanent faults keep producing the fault in the circuit. However, based on the input values, some of the faults might be masked. Therefore, we need a mechanism to do more than one sampling to guarantee the detection mechanism’s completeness. Figure 4a shows how degradation affects a critical path, assuming that guardband is added to accommodate aging. As the chip ages, the delay increases and the guardband slack decreases. When the delay degradation overshoots the guardband (3 years in the figure), soft breakdown occurs. Under virtual aging, the additional delay in gates that fall in near-critical paths show up as faults at the flip-flops they drive. This causes a bit-flip (or metastability) at the output of the flip-flops that can propagate to cause an architectural state corruption. These faults are exposed, with no modifications required to the processor. Figure 1 shows an example circuit block highlighting the fact that the critical path is left unmodified. Noncritical paths introduce subtle challenges because gates that are exclusively on noncritical paths (fast gates) can degrade directly to hard breakdown without ever manifesting as a delay fault, thus circumventing the prediction mechanism. Simple clockphase shifting logic can be added to gates on noncritical paths to effectively expose their delays (see Figure 1). Because modifications are only to paths that have much slack, they are not a source of complexity. Voltage (volt) micro IEEE 0.6 0.4 0.2 0 100 ß: Predicted ~ 28 weeks in advance with VDD reduced by 45 mV 200 300 400 500 600 Time in weeks Figure 3. The timing of failure manifestation using virtual aging versus supply voltage. As the supply voltage is reduced (virtual aging), the time when the failure occurs becomes earlier. Logic. For fault detection in logic, we use a separate checker core that is started on the basis of the checked core’s checkpoint. The checker core operates at regular voltage. As we outlined earlier, we need a full-fledged core to address accuracy problems, because BIST and test-vector-based techniques compromise coverage for delay-based fault models. We also add a simple reliability manager module to every core, which monitors retiring instructions, converts them into a signature, and sends the signature to the checker core using the L2-cache communication network. The checker core’s reliability manager checks the signature against its own computed signatures. Shuou Nomura and colleagues describe the firmware or OS to allow the pairing of arbitrary cores together using the idea of virtual CPUs.7 We assume the same to allow the coupling of cores. SRAM. The detection phase is trivial for Aged-AsymChk, because the BIST controller knows what values to expect—any differences are flagged as impending failures. Discussion Fault detection The fault-detection mechanism compares measured (read) values against known (written) values to determine when a fault has occurred. An important question to consider is, compared to prior works, what do we lose or what assumptions are broken or ignored? We make one judicious cross-layer (circuit to ............................................................. NOVEMBER/DECEMBER 2015 micro IEEE 29 Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® .............................................................................................................................................................................................. FAILURE PREDICTION DQ CLK Input DQ Capture edge CLK CLK Time CLK CLK Input Capture edge Clock In Input D Guardband Q D 0 years Degradation Q D 2.5 years Timing violation Soft breakdown Q 3 years D Large slack Q D Degradation Q D Hard breakdown Q Fault exposure D D Fault manifested Fault exposed Q Fault manifested No fault seen Q Phased clock 2.5 years + Q' virtual aging (b) (a) Fault exposed Figure 4. Signal integrity in circuits as they age. (a) In near-critical paths, the signal integrity will not hold once the guardband is degraded (a delay fault), and virtual aging alone can detect the problem in advance. (b) In noncritical paths, hard breakdown may occur before a delay fault manifests, but a phased clock on these paths can expose the issue earlier. architecture layer) assumption: the state or values in the SRAM can be drained using an architectural mechanism, allowing the SRAM’s contents to be overwritten to allow BIST-based stuck-at-fault testing periodically. In the context of a microprocessor execution, this is a reasonable and easy-to-implement assumption. However, the circuit-based techniques attempt to address wear out in isolation and hence avoid such assumptions. Evaluation Our goal of understanding wear out and the Aged Full-Chip Predictor’s effectiveness is organized around eight questions, of which questions 5 through 8 address overhead and accuracy. ' ' ' ............................................................ 30 micro IEEE Q1: Are wear out and its effects measurably observable? Q2: Can voltage reduction virtually manifest wear-out faults? Q3: Are the manifested faults exposed to a higher level? ' ' ' ' ' Q4: Are the faults exposed to the higher level detected? Q5: What are the overheads? Q6: What is the delay to predict the wear out? Q7: When does this technique provably fail to predict wear out? Q8: How does this technique compare to the current state-of-the-art methods? We examine each question for logic and SRAM. By design, we achieve low complexity, which was our other key goal. Methodology Our evaluation of the Aged Full-Chip Predictor uses a prototype system we built on the basis of the OpenRISC processor (see Figure 5). For logic and Aged-SDMR, our general philosophy is as follows: ' Use Spice and MOSRA with the 32nm silicon-on-insulator library to evaluate any gate-level effects. IEEE MICRO Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® ' Use gate-level delay-aware simulations to check for timing faults. Use full-system emulation on the field-programmable gate array when actual runtime data is required. 1 Is wear out measurably observable? Is the degradation deterministic? Logic 1 1 For Aged-AsymChk, our evaluation is similar: ' ' Use Spice and MOSRA to evaluate any gate-level effects, including the noise margin. Use the noise-margin results to determine failures in SRAM reads. Use analytical models and workload measurements to determine the effect of applications on wear out. One difference is that we run more benchmarks using larger input sets, totaling 35 and spanning SPEC2K, SPEC2006, MediaBench, and Parboil, to capture cache and SRAM effects more representatively. 1 1 0 32-nm lib 2 Delay degradation SPEC2000, SPEC2K6, Mediabench, Parboil Time (cache Voltage intensive) Usage 32-nm lib HSpice + Mosra Voltage transfer characteristics Degradation indeterministic Time Simulation Figure 7 Degradation indeterministic Vin A1 : Figure 3 (b, c) A2 : Figure 4 @Different utilizations @Supply voltage reduction 3 Can reducing supply voltage virtually manifest wear-out faults? SRAM SPEC2000 Simulation Time Voltage Switching Activity HSpice + Mosra Delay ' M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® Vout ' Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page Do the manifested faults get exposed to a higher level? Are the faults exposed to the higher level detected? 4 Logic SRAM SPEC2000 No application dependency Xilinx Zynq FPGA OpenRISC processor Wornout SRAM Xilinx Zynq FPGA CLK OpenRISC processor OpenRISC Delay aware simulation processor 1 0 1 Fault vector Checker micro IEEE A3: Figure 4(d) Read failure probability Stuck-at fault BIST captures all stuck-at faults Architectural error rate Timing fault rate HSpice + Mosra Aged-SDMR results Table 1 summarizes the key results for Aged-SDMR, and Table 2 compares AgedSDMR to three state-of-the-art techniques.9-11 5 What are the overheads? OpenRISC processor CLK Logic SRAM Synopsys Design Compiler- STA Aged-AsymChk results Fast gates We address the evaluation questions for Aged-AsymChk in detail below. Insert capture logic Understanding degradation (Q1). Degradation in SRAM devices is measurably observable and cannot be statically determined because it depends on the switching activity. Figure 3 previously showed this aging behavior at the cell level. Figure 6a shows the wear out at the application level for every cell in a 64-Kbyte data cache (a two-way set associative, level-1 cache with 64-byte blocks). Here, we quantify and visualize wear-out intensity using a simple model: we count the number of cycles that a cell is 1 as a unit of wear out, and we assume every transition to 0 is "1/ 100th of one unit (modeling NBTI recovery). For all applications, we consider a 200-million-cycle window, and pixel values are normalized to maximum wear out. Two banks form the cache ways, shown side by side. We also determined the average and standard deviation of wear out across all the 6 7 Offline testing period ~ 10 hrs Power, energy overhead ~ 0 Modified netlist Reuse BIST Area, power, energy overheads No area overhead What is the delay to predict? Logic SRAM Voltage reduction vs. virtual aging Worst-case error occurrence HMM models No. of samples required Worst-case prediction latency Prediction latency, horizon When does this technique provably fail to predict wear out? SRAM Device failure analysis False positives/ negatives Fault models Probabilistic models 8 Duration of 1 BIST test How does it compare to the current state-of-the-art? Logic State-of-the-art techniques Failures that cannot be predicted Caches with ECC (state-of-the-art) Aged-SDMR Analysis Table 1 Overheads area/power Fault models Time to predict SRAM Cell failure probability (fc) Wear-out rates Analytical Models Prediction horizon Table 3 Is ECC sufficient? Figure 8 Figure 5. Evaluation setup. We built a prototype system based on the OpenRISC processor to evaluate the Aged Full-Chip Predictor. ............................................................. NOVEMBER/DECEMBER 2015 micro IEEE Wasted lifetime 31 Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® .............................................................................................................................................................................................. FAILURE PREDICTION Table 1. Aged-SDMR results Evaluation questions Results Understanding Delay degradation in CMOS logic is measurably observable. degradation (Q1) Dependent on factors including switching activity (cannot be statically determined). Manifesting faults (Q2) Reducing V DD mimics aging. For example, a 50-mV (4.1%) reduction corresponds to predicting up to nine months in advance. Exposing faults (Q3) While in Aged-SDMR mode, timing faults indicate impending hard or soft breakdowns. Virtual aging induces timing faults at the rate of between 0 to 9.8%. Detecting faults (Q4) Faults introduced in Aged-SDMR mode translate to architectural errors and can be caught without escapes. Empirically, errors were seen in at least 0.02% of cycles and were caught within a few samples. Estimating Aged-SDMR has small area (8.9%), power (2.54%), and energy (0.7%) overheads. overheads (Q5) Delay to predict (Q6) We can guarantee an upper bound on Aged-SDMR’s prediction latency mathematically, based on defect and sampling rates. The longest latency to predict is 0.4 days. When the technique does not work (Q7) Aged-SDMR cannot predict faults that do not start as delay faults. For delay-based faults, missed sites are those that have high switching activity but do not affect the architectural trace (integer benchmarks might do this to the floating-point pipeline). If more than 0.4 days of life remain, Aged-SDMR will still predict correctly. Masking scenario is rare in commercial designs because power/value gating avoids unnecessary switching. Comparison to Aged-SDMR is comparable, if not better, on other metrics and also provides generality. state-of-the-art methods (Q8) Previous techniques do not provide generality and accuracy, leaving fast gates (30 to 40% of gates) uncovered. Table 2. A comparison of Aged-SDMR and three state-of-the-art techniques Overheads Area (%) Power (%) Time to predict Prediction horizon Online wear-out prediction 4.6† 8.6† 4 days 2 years, 4 days WearMon11 (14‡ Not reported Varies Not reported Technique 9 FIRST10 Not reported 0 1 day 9 months, 1 day* Aged-SDMR 8.94 3.2 0.4 days 9 months, 0.4 days ................................................................................................................................... † For every eight signals monitored. Rough estimates from field-programmable gate array use numbers reported by the authors. * Assuming a virtual aging mechanism similar to this work. ‡ ............................................................ 32 micro IEEE bits with all 35 applications and computed it to be 0.278 and 0.2895. Even simply looking at distributions of wear out among the bits, we observe they sometimes follow a normal distribution but with large differences in standard deviation and variance across benchmarks (see Figure 6b). These data measurements demonstrate the diversity and IEEE MICRO Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® substantiate two points—that the degradation is highly application dependent, and that degradation within the different cells of an SRAM block can vary significantly. Manifesting faults (Q2). As we demonstrated earlier, reducing VDD mimics aging (see Figure 3). Empirically, for example, a 45-mV reduction emulated 28 weeks of aging. Exposing faults (Q3) and detecting faults at a higher level (Q4). Figure 2d showed that the end effect of SRAM cell aging is read failure stability. By design, writing 1s and then reading them exposes the wear-out fault under virtual aging. Delay to predict (Q6). Compared to logic, the delay to predict for SRAM is on the order of milliseconds, because the prediction happens in a single S-epoch and is application independent. The delay guarantees for logic are probabilistic and are for the worst case, because some sampling windows are required to guarantee overlap of the DMR window with a fault occurrence by the application. When the technique does not work (Q7). Failures in SRAM that do not start as read failures cannot be detected. Although these exist and include electromigration, for example, there is evidence that NBTI, which we cover, is dominant. Unlike the logic case, for device (a) 175-vpr 429-mcf 456-hmmer 60 gzip vpr 50 mcf Percentage of bits Estimating overheads (Q5). In terms of area, there is practically no additional overhead— we simply reuse the existing BIST circuitry. In terms of performance slowdown, AgedAsymChk can be run quite infrequently. Because it predicts wear out without memory corruption and is 100 percent accurate, the only requirement is to run at periods less than the age mimicked by virtual aging, which is on the order of weeks. On the basis of our empirical data, the overhead of checking is pessimistically on the order of 1 million cycles. Even assuming that S-epochs are activated as often as every 100 context switches, which at a 5-ms OS scheduling quantum would be half a second, a 1-Ghz processor at one instruction per cycle would have negligible overhead (0.2 percent). Therefore, Aged-AsymChk introduces no significant performance, power, or area overhead to the system. 164-gzip hmmer 40 30 20 10 0 0.0 (b) 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 IEEE 1.0 Normalized wear-out intensity Figure 6. Application-level behavior of the wear out in the SRAM cells. (a) Visualization of the SRAM wear out in a 64-Kbyte data cache for four applications. Wear out of each SRAM cell depends on the application behavior. (b) SRAM cells distribution. A point (x, y) indicates that y percent of the bits in the SRAM have the wear-out intensity of x. faults that adhere to the model, AgedAsymChk is 100 percent correct because it is based on the formal BIST model that can generate vectors with 100 percent coverage. ............................................................. NOVEMBER/DECEMBER 2015 micro 0.9 33 Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® .............................................................................................................................................................................................. FAILURE PREDICTION Table 3. Defect rates (parts per million) of SRAM arrays Defect rate for ECC (16 data bits, 6 ECC bits) Defect rate for ECC (256 data bits, 10 ECC bits) fc ðtÞ Single failure Single failure Double failure 10"7 4,495 0 53,018 1 10"6 10"5 44,055 362,700 0 47 419,881 995,662 72 7,179 10"4 988,903 4,716 999,999 508,041 10"3 1,000,000 373,043 1,000,000 1,000,000 Comparison to state-of-the art methods (Q8) As we mentioned earlier, prior work does not provide low overhead, high accuracy, and low complexity. Quantitatively, Aged-AsymChk either eliminates silent data corruptions for baselines without ECC or it increases the array’s lifetime. We developed an SRAM array defect-rate model to show how we can extend the average proficient lifetime by 14 months, considering common wear-out patterns. We first used a fixed cell-failure model (excluding dynamic sources of wear out such as the application and temperature) and then extended those results, considering timevarying failure rates. ............................................................ 34 micro IEEE Double failure Failure model preliminaries. Using basic probability, we built a simple analytical model for how wear out affects SRAM array failure. The key input was a cell’s read failure probability at a given time ðfc ðtÞÞ. (The read failure probability indicates the probability that a six-transistor SRAM cell has a read failure at a given time. For example, the read failure probability 10"7 indicates that one SRAM cell out of 107 cells has read failure.) We considered an SRAM made of n blocks and used cache-block granularity single-error correction and double-error detection ECC. We used two cache block sizes with k data bits and e ECC bits: (16, 6) and (256, 10). Also, we define the defect rate as the defective parts per million. Furthermore, the singlefailure defect rate considers one bit failure to be a defect, whereas the double-failure defect rate considers two failures (in a single block) to be a defect. ECC-only arrays are proficient only until the first error, at which point they must be decommissioned to prevent uncor- rectable errors. Arrays with prediction capability are proficient until just before the second error, extending their lifetime. SRAM array model for fixed defect rates. We can build a defect rate model, based on the binomial probability model, for an SRAM array by calculating the failure probability of bits in a cache block ðfc ðtÞÞ, then the failure probability of blocks in the array. We consider both single-failure (Equation 1) and double-failure (Equation 2) cases below. fblock ; 1ðtÞ ¼ 1 " ð1 " fc ðtÞÞkþe fblock ; 2ðtÞ ¼ 1 " ½ð1 " fc ðtÞÞkþe ð1Þ farray ðtÞ ¼ 1 " ½ð1 " fblock;i ðtÞÞn + ð3Þ þðk þ eÞ=1 * fc ðtÞÞ * ð1 " fc ðtÞÞkþe"1 + ð2Þ Equations 1 and 2 calculate the probability that one or two bits, respectively, in a given ðk þ eÞ-bit block are erroneous at a given time. Equation 3 finds the probability that one block in a given SRAM array made of n blocks is faulty at a given time. Table 3 shows the single- and double-failure defect rates for various cell failure probabilities ðfc ðtÞÞ and two extreme granularities of ECC. We can draw three implications from Table 3. First, as expected, fine-grained ECC has a lower defect rate. Second, at low cellfailure probabilities, the number of failures with only a single defect is orders of magnitude more than when allowing prediction. And third, schemes decommissioning arrays and cache blocks at first failure incur wasted lifetime: nearly 100 and 36 percent of coarseand fine-grained ECC, with fc ðtÞ ¼ 10"5 . IEEE MICRO Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® Normalized fc 10 9 8 7 6 5 4 3 2 1 Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page Optimistic Linear Pessimistic 0 (a) M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® Months of added life micro IEEE 10 20 40 50 30 Age in months 60 70 35 30 25 20 15 10 5 0 (b) Optimistic Linear Pessimistic 0 20 40 60 80 Percentage of SRAM arrays 100 Figure 7. Wear-out models and added life from effective prediction. (a) The (x, y) point indicates the read failure probability of an SRAM cell normalized to 10"6 (fc ðtÞ is y after x months). (b) The (x, y) point indicates that the lifetime of x percentage of total fabricated SRAM arrays is extended by y months. Extending results for dynamic wear out. To quantify the wasted lifetime for SRAM arrays, we extend the model to include dynamic SRAM wear out, the primary effect of which is to cause fc ðtÞ to become time dependent (increasing over time). Our extended model must incorporate several issues. First, the wear out of different bits will vary, implying that a single fc ðtÞ no longer models the entire array. Second, depending on the SRAM’s usage, the fc ðtÞ changes to some value by the end of the SRAM array’s lifetime. Third, fc ðtÞ changes at some rate with time to reach this final value. Finally, we must determine when the array is single-failure defective or double-failure defective. These phenomenon are highly application dependent, and we make some simplifying assumptions to capture firstorder effects. First, we assume the highest fc ðtÞ of the bits in a block, thus providing a lower-bound estimate on wasted life. Second, we assume fc ðtÞ changes by one order of magnitude due to wear out—this has strong empirical evidence from circuit literature.3,12 Finally, to model the rate of change of fc ðtÞ, we consider reciprocal, linear change and exponential change as in Figure 7a. Linear change is likely the common case. Exponential and reciprocal represent the worst (pessimistic) case and best (optimistic) case for the benefits of our technique, respectively. We considered a 36-month period discretized at monthly granularity, and we assumed the second error occurs at the end of this period. We used fc ðtÞ at each month to calculate the defect rates, which determine how many arrays are wasted due to early decommissioning based on the first failure. Figure 7b shows the dynamic wear-out model’s results in terms of months of added life for a percent of the SRAM arrays, which suggests two things. First, the lifetime can be extended significantly to 17, 14, and 7 months on average for the three scenarios. Second, significant fractions of SRAM arrays are improved by 95, 87, and 46 percent, respectively. B y providing a unified technique for error prediction in both logic and SRAM settings, which is low overhead and has high fault coverage, the Aged Full-Chip Predictor could serve as an important component for future fault-dominated technologies. The mechanisms behind the concepts of virtual aging and sampling are well understood and easy to implement, making the idea attractive and practical to deploy. One primary implication is that future designs can more aggressively provision the resources for recovering from soft errors (such as ECC in SRAMs), while relying on the Aged FullChip Predictor for the prediction and detection of hard errors. Looking forward, understanding the relationship between delay degradation and failure modes in far-out semiconductor technologies will be the key to using virtual aging to address future reliMICRO ability challenges. .................................................................... References 1. A. Haggag et al., “Realistic Projections of Product Fails from NBTI and TDDB,” Proc. 44th Ann. IEEE Int’l Reliability Physics Symp., 2006, pp. 541–544. ............................................................. NOVEMBER/DECEMBER 2015 micro IEEE 35 Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® micro IEEE Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND® .............................................................................................................................................................................................. FAILURE PREDICTION 2. A.W. Strong et al., Reliability Wearout Mechanisms in Advanced CMOS Technologies, vol. 12, Wiley-IEEE Press, 2009. 3. K. Kang et al., “Impact of Negative-Bias Temperature Instability in Nanoscale SRAM Array: Modeling and Analysis,” IEEE Trans. Computer-Aided Design of Integrated Circuits and Systems, vol. 26, no. 10, 2007, pp. 1770–1781. 4. A. Bansal et al., “Impacts of NBTI and PBTI on SRAM Static/Dynamic Noise Margins and Cell Failure Probability,” Microelectronics Reliability, vol. 49, no. 6, 2009, pp. 642–649. 5. T.T.-H. Kim and Z.H. Kong, “Impact Analysis of NBTI/PBTI on SRAM VMIN and Design Techniques for Improved SRAM VMIN,” J. Semiconductor Tech. and Science, vol. 13, no. 2, 2013, pp. 87–97. 6. S. Kothawade et al., “Mitigating NBTI in the Physical Register File through Stress Prediction,” Proc. IEEE 30th Int’l Conf. Computer Design, 2012, pp. 345–351. 7. S. Nomura et al., “Sampling þ DMR: Practical and Low-Overhead Permanent Fault Detection,” Proc. 38th Ann. Int’l Symp. Computer Architecture, 2011, pp. 201–212. 8. R. Balasubramanian and K. Sankaralingam, “Virtually-Aged Sampling DMR: Unifying Circuit Failure Prediction and Circuit Failure Detection,” Proc. 46th Ann. IEEE/ACM Int’l Symp. Microarchitecture, 2013, pp. 123–135. 9. J. Blome et al., “Self-Calibrating Online Wearout Detection,” Proc. 40th Ann. IEEE/ ACM Int’l Symp. Microarchitecture, 2007, pp. 109–122. 10. J.C. Smolens et al., “Detecting Emerging Wearout Faults,” 3rd IEEE Workshop Silicon Errors in Logic-System Effects, 2007; http://jared.smolens.org/documents/first________________________ smolens-selse07.pdf. ____________ 11. B. Zandian et al., “WearMon: Reliability Monitoring Using Adaptive Critical Path Testing,” Proc. 40th Ann. IEEE/IFIP Int’l Conf. Dependable Systems and Networks, 2010, pp. 151–160. 12. K. Kang et al., “Estimation of Statistical Variation in Temporal NBTI Degradation and Its Amir Yazdanbakhsh is a PhD student in the School of Computer Science at the Georgia Institute of Technology and a research assistant in the Alternative Computing Technologies (ACT) Lab. His research interests include computer architecture, approximate general-purpose computing, mixed-signal accelerator design, machine learning, and programming languages for hardware design. Yazdanbakhsh has an MS in computer engineering from the University of Wisconsin–Madison and an MS in electrical and computer engineering from the University of Tehran. He is a student member of IEEE. Contact him at a.yazdanbakhsh@gatech.edu. ___________________ Raghuraman Balasubramanian is a digital design engineer at Google. His research interests include microprocessor architecture and circuit design. Balasubramanian has an MS in computer science from the University of Wisconsin–Madison, where he completed the work for this article. Contact him at raghuraman.b@gmail.com. _________________ Tony Nowatzki is a PhD student in the Department of Computer Sciences at the University of Wisconsin–Madison and a member of the Vertical Research Group. His research interests include architecture and compiler codesign and mathematical modeling. Nowatzki has an MS in computer science from the University of Wisconsin– Madison. He is a student member of IEEE. Contact him at tjn@cs.wisc.edu. __________ Karthikeyan Sankaralingam is an associate professor in the Department of Computer Sciences and the Department of Electrical and Computer Engineering at the University of Wisconsin–Madison, where he also leads the Vertical Research Group. His research interests include microarchitecture, architecture, and very large-scale integration. Sankaralingam has a PhD in computer science from the University of Texas at Austin. He is a senior member of IEEE. Contact him at karu@cs.wisc.edu. ___________ Impact on Lifetime Circuit Performance,” Proc. IEEE/ACM Int’l Conf. Computer-Aided Design, 2007, pp. 730–734. ____________ _______ ............................................................ 36 micro IEEE IEEE MICRO Previous Page | Contents | Zoom in | Zoom out | Front Cover | Search Issue | Next Page M q M q M q M q MQmags q THE WORLD’S NEWSSTAND®