Layali Rashid Errors that occur in bursts, at the same location, when the fault is activated [WDSN07]. Faults which occur frequently and irregularly for a period of time [ASPLOS08]. A persistent defect that causes zero or more failures, such as a speck of conductive dust partially bridging two traces [EuroSys11]. Bursts of errors that recur nondeterministically [Layali]. 2 257 servers for 1.2 year. Memory SBE rate: Number of Errors Frequency (%) 0 47.5 1-5 31.5 6-99 13.3 100-1000 5.8 >1000 1.9 6.2% of the memory subsystems were affected by ifaults. Processor buses in 2 servers had 15 to 7104 SBE bursts. 3 Rates of failures (regardless of the error type). Failure P(1fail) P(2 fail|1 fail) P(3 fail|2 fail) CPU (5 working days) 0.3% 30% 56% CPU (30 working days) 0.5% 34% 59% DRAM (5 working days) 0.03% 11% 46% DRAM(30 working days) 0.05% 8% 50% Disk (5 working days) 0.2% 29% 53% Disk (30 working days) 0.3% 29% 59% 4 Rates of ifaults out of total failures: Fault location Rate of Ifaults CPU 39% DRAM 19% Disk 39% When ifaults recur? Fault location Recur within 10 days Recur within a month CPU 84% 97% Disk 86% 99% How many times an ifault recur? ◦ MTTF decreases as more failures occur. ◦ Not exponentially distributed. 5 Location Source Result in Ifault? Wires Electromigration Yes Stress migration Not mentioned Crosstalk Yes Gate oxide breakdown Yes Hot carrier injection Not mentioned Negative bias temperature instability Not mentioned Thermal cycling Not mentioned Manufacturing defects Yes Dust Yes Transistor Package and pins Other 6 Ileak. PolySi Gate SiO2 Substrate From Wikipedia Traps exist in SiO2 due to Consequences: Possible Solutions: Hard breakdown Soft breakdown manufacturing defects or ↑ Leakage Current High-k dielectric. high voltage. ↑Power consumption Burn-in. 7 * Consequences: Thinner wires, high current density and temperature. Voids → stuck shorts Metal films imperfections. Hallocks → stuck opens 8 * University of Kiel Thermal stress. ◦ Growth of voids Contribute to electromigration. Consequences: Voids → stuck shorts 9 Major problem during layout synthesis. Consequences: ◦ Delays and glitches 10 Appears in package and die interface (e.g. solder joints). Large cycles vs. small cycles Consequences: ? 11 Other Wearout Mechanisms Hot Carrier Injection Negative Bias Temperature Instability 12 Dominant reliability concern for nMOS transistors. Happens during normal operatingtemperature ranges. Vg Vs Vd Consequences: Ig oDecrease drain current n+ n+ oSlower IC p+ Vbs 13 ◦ Dominant reliability concern for pMOS. ◦ Happens during high temperature. Vg Vs PolySi Vd H2 H2 Consequences: SiO2 oReduces Vt p+ p+ H H H H H H H H H H oReduces IC speed (~20%) n+ error Si Si opath Si Si delay Si Si Si Si Si Si Substrate Vbs 14 Location Source Model Duration Wires Electromigration Short and open Stress migration Short and open Gate oxide breakdown Ileakage Supply voltage fluctuation lasts from 5 to 30 Crosstalk Delay and glitch cycles. Transistor Hot effects carrier injection Pathhundreds delay Temperature evolve over of microseconds or milliseconds. Negative bias Path delay temperature instability breakdown evolves over a few days Soft Thermal cycling Package and pins becomes hard breakdown. Other S:1x104s+R: 2x104s then Manufacturing defects Dust 15 Transistor Stuck-open Last Output Stuck-short IDDQ Delay 16 Stuck-Open Last output Open Stuck-at Wire IDDQ Short Delay Bridging Logical AND/OR 17 Intermittent faults are loosely defined and their causes are not well explored. We need more accurate results on the rates of ifaults ◦ Rates and number of recurrence Does NBTI, stress migration, thermal cycling and hot carrier injection cause ifault? ◦ Evidences by scientific studies or field data. 18 Backup Slides 19 20 From [AdancesinRadioScience09] 21 From [AdancesinRadioScience09] 22 23 Example From Dr. Ivanov Course 24 Copyright 2001, Agrawal & Bushnell 25 26 RAM Pattern Sensitivity BDS Coupling BDS 27 Stuck-Open Last output Open Stuck-at Wire Short Delay Bridging WiredAND/OR IDDQ Dominant IDDQ Dominant AND/OR IDDQ 28 [Wikipedia] Many articles. [WDSN07] Impact of Intermittent Faults on Nanocomputing Devices, WDSN, 2007. [D3T] Emphasis on the existence of intermittent faults in embedded systems. IEEE Workshop on Defect and Data Driven Testing (D3T), 2010. [ASPLOS08] Adapting to intermittent faults in multicore systems. [EuroSys11] Cycles, Cells and Platters An Empirical Analysis. [IEEETrans.onElectronDevices96]Soft breakdown of ultra-thin gate oxide layers [ACMSurveys10] Electromigration for Microarchitects, Intel. [Applied Physics Letters91] Stress-migration related electromigration damage mechanism in passivated, narrow interconnects. [AdancesinRadioScience09]Impact of negative and positive bias temperature stress on 6T-SRAM cells 29