Methodology to Compute Architectural Vulnerability Factors Chris Weaver1, 2 Shubhendu S. Mukherjee1 Joel Emer 1 Steven K. Reinhardt1, 2 Todd Austin2 1Fault Aware Computing Technology (FACT), VSSAD, Intel 2University of Michigan Overview Background Previous reliability estimation methodology Proposed methodology for early reliability estimates Sample analysis Conclusion Strike Changes State 0 1 Failure Rate Definitions Interval-based MTBF = Mean Time Between Failures Rate-based FIT = Failure in Time = 1 failure in a billion hours 1 year MTBF = 109 / (24 * 365) FIT = 114,155 FIT Additive Cache: 0 FIT IQ: 114K FIT + FU: 114K FIT Total of 228K FIT Motivation Data Corruption FIT Data Corruption FIT 100000 100000 100000 10000 10000 10000 1000 1000 1000 100 100 100 10 10 10 111 2003 2003 2004 2004 2005 2006 2007 2003 2004 2005 2006 2006 2007 2008 2008 2009 2009 2010 2010 2011 2012 1000 MTBF Goal 1000 MTBF Goal 1000manifest MTBF Goal FIT if all flips as errors FIT if all flips manifest as errors FIT if 10% of flips manifest as errors Results of precise & early analysis If we meet goal we are done If we don’t meet goal add error protection schemes Objectives Determine which bits matter Compute FIT rate Strike on state bit Bit Read no yes Bit has error protection yes benign fault no error no yes Error is only detected (e.g., parity + no recovery) Detected, but unrecoverable error (DUE) Error can be corrected (e.g, ECC) no error Does bit matter? yes Silent Data Corruption (SDC) * We only focus on SDC FIT no benign fault no error Architectural Vulnerability Factor (AVF) AVFbit = Probability Bit Matters # of Visible Errors = # of Bit Flips from Particle Strikes FITbit= intrinsic FITbit * AVFbit Previous AVF Methodology Statistical Fault Injection with RTL 1 0 Simulate Strike on Latch Logic 0 output Does Fault Propagate to Architectural State Characteristics of SFI with RTL Naturally characterizes all logical structures RTL not till late in the design cycle Numerous experiments to flip all bits Generally done at the chip level Limited structural insight Objectives Determine which bits matter Earlier in the design cycle With fewer experiments At the structural-level Compute FIT rate Intrinsic FIT per bit Architectural Vulnerability Factor Our Analysis: Which bits matter? Branch Predictor Doesn’t matter at all (AVF = 0%) Program Counter Almost always matters (AVF ~ 100%) Architecturally Correct Execution (ACE) Program Input Program Outputs ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) Anything else (un-ACE path) can be derated away Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output Dynamic Instruction Breakdown DYNAMICALLY DEAD 20% PERFORMANCE INST 1% ACE 46% PREDICATED FALSE 7% NOP 26% Average across all of Spec2K slices Mapping ACE & un-ACE Instructions to the Instruction Queue NOP Prefetch Architectural un-ACE ACE Inst ExACE ACE Inst Inst WrongPath Inst Idle Micro-architectural un-ACE Vulnerability of a structure AVF = fraction of cycles a bit contains ACE state T=4 3 2 1 = = (2+1+0+3)/4 ACE% = 3/4 0/4 1/4 2/4 4 Average number of ACE bits in a cycle Total number of bits in the structure Little’s Law for ACEs Nace T ace Lace N ace AVF Ntotal Computing AVF Our approach is conservative Data Analysis We assume every bit is ACE unless proven otherwise Try to prove that data held in a structure is unACE Timing Analysis Tracks the time this data spent in the structure Computing FIT rate of a Chip Total FIT = (FIT per biti X # of bitsi X AVFi) Structure FIT per bit # of bits AVF Branch Predictor .001* 1K 0 0 Program Counter .001* 64 1 0.064 Instruction Queue .001* 6400 ? ? Funtional Units .001* 4000 ? ? … Total FIT … Total FIT of whole chip = column * Intrinsic FIT per bit from externally published data Results: Experimental Setup Used ASIM modeling infrastructure Model of a Itanium®2-like processor Ran all Spec2K benchmarks Compiled with highest level of optimization with the Intel electron compiler Simulated under a full OS Simulation points chosen using SimPoint (Sherwood et al) Instruction Queue IDLE 31% Ex-ACE 10% ACE 29% NOP 15% PREDICATED FALSE 3% WRONG PATH 3% DYNAMICALLY DEAD 8% PERFORMANCE INST 1% ACE percentage = AVF = 29% Functional Units WRONG PATH 1% SPECULATIVE ISSUE 1% DYNAMICALLY DEAD 4% PERFORMANCE INST 0% PREDICATED FALSE 1% NOP 6% ACE 9% UNIT IDLE 77% DATAPATH IDLE 1% LOGICAL MASKING 0% ACE percentage = AVF = 9% Computing FIT rate of Chip Structure FIT per bit # of bits AVF Total FIT Branch Predictor .001* 1K 0 0 Program Counter .001* 64 1 0.064 Instruction Queue .001* 6400 .29 1.856 Funtional Units .001* 4000 .09 0.360 … … Total FIT of whole chip = column * Intrinsic FIT per bit from externally published data Summary Determine which bits matter ACE (Architecturally Correction Execution) Compute FIT rate Intrinsic FIT per bit AVF (Architectural Vulnerability Factor) Questions? Statistical Fault Injection (SFI) Algorithm Find a statistically significant set of bits Randomly select a bit Flip the bit Run two simulations: one with bit flip and one without bit flip Run for pre-defined # cycles Compare architectural state of two simulations (e.g., register file) If mismatch, declare an error Repeat algorithm with different bit flip AVF = # mismatches observed / total # experiments Used widely + has provided useful AVF numbers till date SFI vs. ACE analysis SFI ACE Accuracy of Microarchitectural unACE Better than ACE analysis Conservative Accuracy of Archirectural un-ACE Conservative Better than SFI (e.g., covers dynamically dead instructions) Insight Per-structure insights harder Little’s Law & perstructure breakdown easier # of experiments Large # required to be statistically significant Small # of experiments can give good accuracy