Microprocessor Reliability Robert Pawlowski ECE 570 – 2/19/2013 1 Reliability • Involves different aspects about a processor that can affect performance and functionality. – Ultimately can reduce the lifetime of the processor. • Issues typically manifest themselves at the device level. – Solutions can be implemented at multiple design levels. 2 Why the concern? • Operating at highest frequencies and/or lowest power possible increases sensitivity to processrelated variabilities. – Gate length/doping concentration variations – Temperature – Supply voltage droops • This decreases processor yield • Decreasing device sizes Increased effect of external issues 3 Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact 4 Processor Error Classification • Hard Errors will result in permanent processor failure. • Processor lifetime is inversely proportional to hard error rate. • Soft Errors do not permanently damage the device. 5 Hard Errors • Extrinsic failures – Caused by process and manufacturing defects – Occur with decreasing rate over time – No impact from micro-architecture • Intrinsic failures – Related to processor wear-out – Occur with increasing rate over time – Related to wafer packaging, process parameters, and processor design. 6 Hard Errors 7 Soft Errors • Occur in both memory and logic – External radiation main issue in memory • Alpha particles • High energy neutrons • Thermal neutrons • Different causes of transient errors in logic – External radiation – Supply voltage droop • Power supply fluctuations – Ground bounce, cross-talk – Process variation, temperature – Affect delay of computational paths 8 Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact 9 Radiation-Induced Soft Errors • • • • Ionized particle strike causing a state change No permanent damage (Hard-error) Combo logic – Single Event Transients (SET) Memory cells – Single Bit Upset (SBU) Multi Bit Upset (MBU) • Three causes of soft errors – Alpha particles – Thermal neutrons – High-energy neutrons 10 Alpha-Particles • Emitted from impurities in packaging materials. • Create electron-hole pairs through direct ionization • Range for a 10 MeV particle < 100um – Typical energy 4-9MeV • Improved manufacturing trends Reduced effect – Purified materials – Shielding layers 11 Neutrons • Result of cosmic ray reactions with atmosphere • High-Energy neutrons react with chip materials. • Concrete only shielding material – 1.4x lower flux/foot of thickness 12 Neutrons • Thermal neutrons (<<< 1MeV) react with BoronDoped Phosphosilicate Glass (BPSG) dielectric layer. – Produce ionized particles that can cause soft-errors • Solution Remove BPSG from advanced processes • Mostly solved – SEU’s still found in 45nm, 90nm 13 Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact 14 Device-level solutions • Larger device sizes Larger capacitance – Increase the amount of charge necessary to flip bit (critical charge) • Multiple VT design – Sensitivity to variation at low-VDD may limit effectiveness. • Body biasing also common to both radiation hardening and variation tolerance 15 Circuit-level solutions • DICE cell – Used for SRAM, FF’s, latches • Built-in current sensors on supply lines of memory cells. 16 Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact 17 Modular redundancy • Dual Modular Redundancy data in data out Main Module error Replicated Module • Triple Modular Redundancy Main Module Replicated Module Replicated Module Voter data in data out 18 Redundant Circuits • Redundancy increases area/power • DMR/TMR in sub/near-VT – Timing variation between circuits increases • Utilization of redundant lanes for parallel operation can increase throughput at low-VDD 19 Self-Checking Circuits • Partition circuit into smaller blocks – Error checker for each block • Use error detection codes – Berger codes – Arithmetic codes • Increases circuit delay for error computation 20 Circuit-Level Speculation • Uses approximated circuit implementation – Goal is to reduce critical path 21 Tunable Replica Circuits • Mirrors delay of critical path • Monitors for errors over voltage/frequency changes 22 Timing Speculation data in clk clk 0 1 D Q DFF data out delayed clk error data in D delayed clk Q Shadow Latch D0 D1 D2 error data out D0 D1 D2 • Razor timing error detection – Designed for transient faults – Effective against SET’s and SBU’s on flip-flops • Requires error recovery 23 Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact 24 Error Recovery Options in Scalar Processors • Clock Gating: – Global error signal – Clock gating – 1-cycle penalty 25 Error Recovery Options in Scalar Processors • Multiple Issue: – Error signals propagated to control unit – Instructions must be flushed – Error instruction then replayed – 2N-cycle penalty 26 Error Recovery Options in Scalar Processors • Counter-flow pipelining • Micro-rollback 27 Error correcting codes for memories • Most common is Hamming code • Check bits stored when data written • Identifies error and erroneous bit position 28 Error correcting codes for memories • Single-bit ECC adds area/power and delay – Low-VDD Increased delay – Hybrid VDD operation will reduce delay • Overhead increases for multi-bit ECC – Increased memory density higher probability of MBU – Current research increase in ratio of MBU to total SER in sub-VT 29 Outline • Error Classification • Hard Errors • Soft Errors • Sources of radiation • Device/Circuit approaches • Architectural approaches • Error detection • Error correction • System level impact 30 System-Level Impact • Soft errors can have a large affect on processor functionality – Increasing issue with further device scaling • All methods off error detection/correction are costly – Need to be added to system blocks wisely • SEU distribution • Effects of process variation 31 System-Level Impact • How to determine what blocks have the highest system-level impact? – Mostly through simulation • For radiation: all-encompassing – Includes fault injection @ circuit level • Different models have been developed – ReStore – University of Illinois at Urbana-Champaign • Focuses on system level effect of radiation-induced errors – RAMP – IBM • Directed more towards hard-errors and processor failure. 32 Questions? 33