Self-calibrating Online Wearout Detection Authors: Jason Blome Shuguang Feng Shantanu Gupta Scott Mahlke MICRO-40 December 3, 2007 1 University of Michigan Electrical Engineering and Computer Science Motivation “Designing Reliable Systems from Unreliable Components…” - Shekhar Borkar (Intel) More failures to come Failures will be wearout induced [Srinivasan, DSN‘04] [Borkar, MICRO‘05] 2 University of Michigan Electrical Engineering and Computer Science Current Approaches Traditional Impractical Design margins Burn-in Detection: based on replication of computation TMR (Tandem/HP NonStop servers) DIVA (Bower, MICRO’05) Prediction: utilizes precise analytical models and/or sensors Canary circuits (SentinelSilicion, RidgeTop) RAMP (Srinivasan, UIUC/IBM) Static Costly 3 University of Michigan Electrical Engineering and Computer Science Wearout Mechanisms Many failure mechanisms have been shown to be progressive Oxide Hot carrier injection (HCI) Negative Bias Temperature Inversion (NBTI) G I gd D N+ d Ig I gc N+ cs Ig S s Oxide Igb P-well B Electromigration (EM) 4 Oxide Breakdown (OBD) University of Michigan Electrical Engineering and Computer Science Objective Propose a failure prediction technique that exploits the progressive nature of wearout Monitor impact on path delays Prediction Detection • Monitors evolution of wearout • Identifies existing fault • Proactive • Reactive • enables failure avoidance/mitigation • enables failure recovery • Continuous feedback • End-of-life feedback • False negatives and positives • False negatives 5 University of Michigan Electrical Engineering and Computer Science Oxide Breakdown (OBD) Accumulation of defects leads to a conductive path G G S D Oxide N + ΔIoxide N+ P-well Percolation Model [Stathis, JAP‘06] B 6 University of Michigan Electrical Engineering and Computer Science OBD HSPICE Model Post-breakdown leakage modeling G s I gd S Ig cs I gc N+ D N+ d Ig Igb P-well [BSIM4.6.0, ‘06] [Rodriguez, Stathis, Linder, IRPS ‘03] I gcs , I gcd , and I gb remain unchanged 7 B I gs K I gs0 I gd K I gd0 University of Michigan Electrical Engineering and Computer Science Characterization Testbench 90nm standard cell library tcircuit tcell Gate UUT DC BUFX4 BUFX4 FO4GATE 8 University of Michigan Electrical Engineering and Computer Science FO4BUFX4 Impact on Propagation Delay 9 University of Michigan Electrical Engineering and Computer Science Delay Profiling Unit (DPU) 0 1 input signal 1 1 0 1 1 0 Latency Sampling 1 0 uArch Module 1 0 0 1 0 1 0 1 0 1 10 University of Michigan Electrical Engineering and Computer Science TRIX Analysis Magnitude of divergence between TRIXglobal and TRIXlocal reflects amount of degradation 11 University of Michigan Electrical Engineering and Computer Science TRIX Analysis Details Exponential Moving Average (EMA) EMA(t ) EMAt 1 ( price EMAt 1 ) where is defined by the window size Triple-smoothed Exponential Moving Average EMA1 (t ) EMA1t 1 ( pricet EMA1t 1 ) EMA2 (t ) EMA2 t 1 ( EMAt EMA2 t 1 ) 1 EMA3 (t ) EMA3t 1 ( EMAt EMA3t 1 ) 2 12 University of Michigan Electrical Engineering and Computer Science Noisy Latency Profile Raw Latency Profile Trix Profile (local) Trix Profile (global) Percent Nominal Delay (%) 110 108 106 104 102 100 98 96 94 Increasing Age 13 University of Michigan Electrical Engineering and Computer Science DPU with TRIX Hardware TRIXl Calculation 0 input signal 0 1 0 Latency Sampling 0 Prediction 0 TRIXg Calculation 0 1 0 1 14 University of Michigan Electrical Engineering and Computer Science Wearout Detection Unit (WDU) + TRIXl Calculation Latency Sampling Prediction TRIXg Calculation 15 University of Michigan Electrical Engineering and Computer Science Evaluation Framework Gate-level Processor Simulator OR1200 Verilog Synthesis and Place and Route 90nm Library Fully Synthesized, P&R, OR1200 Core Monte Carlo MediaBench Suite Workload Simulator Simulator Timing, Power, and Temperature Simulations HSPICE Simulations OBD Wearout Model 16 Wearout Simulator University of Michigan Electrical Engineering and Computer Science WDU Accuracy Life Expended Signals Flagged 120 Percentage (%) 100 80 60 40 20 0 ALU Register File LSU Next PC Module 17 University of Michigan Electrical Engineering and Computer Science WDU Overhead Percentage Overhead (%) Area-Hybrid Area-Hardware Power-Hybrid Power-Hardware 50 45 40 35 30 25 20 15 10 5 0 1 2 4 8 # Signals Monitored 18 University of Michigan Electrical Engineering and Computer Science WDU Overhead Percentage Overhead (%) Area-Hybrid Area-Hardware Power-Hybrid Power-Hardware 3 2.5 2 1.5 1 0.5 0 1 2 4 8 # Signals Monitored 19 University of Michigan Electrical Engineering and Computer Science Long-term Vision Introspective Reliability Management (IRM) Intelligent reliability management directed by on-chip sensor feedback Prospective sensors Delay (WDU) Leakage/Vt Temperature 20 University of Michigan Electrical Engineering and Computer Science Introspective Reliability Management OS Scheduled Jobs Virtualization Layer Thread Migration Reconfiguration Reliability Reliability Assesment Assesment WDU DVFS Configuration DVFS Settings WDU Power/CLK Gating Gating Power/CLK WDU 21 Runtime Analysis Aggregate Analysis Filtered Data Stream Processed Data Filtering andData Analysis Sensor WDU Raw Sensor Data Thread Migration Job Assignment WDU IRM Policy Policy IRM University of Michigan Electrical Engineering and Computer Science Conclusions Many progressive wearout phenomenon impact devicelevel performance. WDU performance It’s possible to characterize this impact and anticipate failures Failure predicted within 20% of end of life (tunable) Area overhead < 3% (hybrid) Low-level sensors can be used to enable intelligent reliability management 22 University of Michigan Electrical Engineering and Computer Science Questions? ? 23 University of Michigan Electrical Engineering and Computer Science