Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures Arijit Biswas, Paul Racunas, Shubu Mukherjee FACT Group, DEG, Intel Joel Emer VSSAD, Intel Razvan Cheveresan Sun Microsystems, Intern FACT Group Ram Rangan Princeton University, Intern FACT Group Moore’s Law Graph • Soft errors are a serious problem – Assuming a certain error rate, failure rate of whole chip increases 12x GAP 1000 100 100% Vulnerable 10 20% Vulnerable Year 2012 2011 2010 2009 2008 2007 2006 2005 2004 1 2003 Failure rate from Vulnerable Latches 10000 1000 year MTBF Goal Chart based on 200,000 latches as used in the Fujitsu SPARC Processor (2003) FACT Group, Intel 2 All bits are not created equal! Bit 1 0 FACT Group, Intel Particle Strike Causes Bit Flip! 3 All bits are not created equal! Particle Strike Causes Bit Flip! Bit Read? no yes benign fault no error Detection & Correction Does bit matter? yes True Detected Unrecoverable Error Bit has error protection benign fault no error no Detection only no Does bit matter? yes False Detected Unrecoverable Error FACT Group, Intel Silent Data Corruption no benign fault no error 4 Does bit matter? • Architectural Vulnerability Factor (AVF) – Probability that a bit flip will cause user-visible error • Soft Error Rate of a Structure = (AVFbit) x (# Bits) x (Intrinsic Error Rate)bit • Reducing AVF reduces SER – High AVF indicates need for protection – Low AVF can help remove protection hardware • SER Protection can be Expensive – Impacts Area, Power, Performance, Design Time FACT Group, Intel 5 Simple Examples • Committed Program Counter AVF ~ 100% • Branch Predictor AVF = 0% FACT Group, Intel 6 Complex Examples • Instruction Queue AVF = 29% • Execution Units AVF = 9% • Used a new concept – Architecturally Correct Execution (ACE) FACT Group, Intel 7 Architecturally Correct Execution (ACE) Program Input Program Outputs • ACE path requires only a subset of values to flow correctly through the program’s data flow graph (and the machine) • Anything else (un-ACE path) can be derated away FACT Group, Intel 8 Example of un-ACE instruction: Dynamically Dead Instruction Dynamically Dead Instruction Most bits of an un-ACE instruction do not affect program output FACT Group, Intel 9 ACE Breakdown of Instruction Queue IDLE 31% ACE 29% Ex-ACE 10% NOP 15% PREDICATED FALSE 3% WRONG PATH 3% DYNAMICALLY DEAD 8% PERFORMANCE INST 1% Average across all of Spec2K slices for an IA64-like processor ACE % = AVF = 29% FACT Group, Intel 10 A New AVF Analysis – Address-Based Structures • Caches, data translation buffers, store buffers – Make up large portions of a modern chip • Simple ACE analysis is no longer enough • Data & Tag structures need new concepts – – – – Extended Lifetime Analysis Hamming-Distance-1 Analysis Cooldown AVF Reduction - Flushing FACT Group, Intel 11 Lifetime Analysis • Idle is unACE Fill Idle Read Valid Read Valid Evict Valid Idle – Assuming all time intervals are equal – For 3/5 of the lifetime the bit is valid – Gives a measure of the structure’s utilization • Number of useful bits • Amount of time useful bits are resident in structure • Valid for a particular trace FACT Group, Intel 12 Lifetime Analysis of Write-through Data Cache • Valid is not necessarily ACE Fill Read Read Evict Idle Idle Write-through Data Cache • ACE % = AVF = 2/5 = 40% • Example Lifetime Components – ACE: fill-to-read, read-to-read – unACE: idle, read-to-evict, write-to-evict FACT Group, Intel 13 Lifetime Analysis of Write-through Data Cache • Data ACEness is a function of instruction ACEness Fill Read Read Idle Evict Idle Write-through DCache • Second Read is by an unACE instruction • AVF = 1/5 = 20% FACT Group, Intel 14 Tags are Hard • A fault associated with a tag that is nominally associated with a particular instruction can impact the correct execution of a different independent instruction • False Negatives only error if writeback is necessary – Uses standard lifetime analysis • False Positives always result in error – Need bit-level analysis FACT Group, Intel 15 False Positive Incoming Address •Expect: 1 0 0 1 Tag Address MISS 1 0 0 1 0 0 0 Tag Address Incoming Address •Acquire: 1 HIT 1 0 0 1 • Expected Tag Miss, but got Hit – Error • How do you compute the AVF? Fault injection? FACT Group, Intel 16 Hamming-Distance-1 Analysis • Assuming a single-bit error model Tag Array 101010 Incoming Address 001010 111010 000001 111000 Hamming-Distance-1 Match Hamming-Distance-1 Match 010101 111111 • Now we can use lifetime analysis on the identified bit(s) FACT Group, Intel 17 Edge Effects • Simulation introduces unknown component – Simulation not run to completion – Only execute small segment of code Fill Idle Read Read Evict Unknown Not Simulated Idle Sim End • Worst Case AVF = Known AVF + Unknown AVF • How do we reduce/eliminate unknown? FACT Group, Intel 18 Cooldown • run simulation beyond end interval. – Any bits that were already valid (the unknown bits), are resolved 50 45 40 Trend: unknown AVF primarily resolves to unACE 35 AVF % 15 Cooldown FACT Group, Intel dTLB Tags No Cooldown Dcache Tags (WB) 0 Dcache Tags (WT) 10 5 dTLB Data Best Estimate AVF = Known AVF after Cooldown 25 20 Dcache Data (WT) • 30 Dcache Data (WB) • 10 Million Instructions Simulation 10 Million Instructions Cooldown 19 Data AVFs (Average) STB DTB Dcache (WB) Dcache (WT) 0 • • • 5 10 15 20 25 AVF % 30 35 40 45 50 STB AVF lower due to large idle component and bytemasks DTB AVF higher due to high average utilization Dcache (WB) AVF higher than Dcache (WT) since dirty bytes still ACE after last read FACT Group, Intel 20 •Large variability in AVF •Ranges from ~0% to 80% •Based on structure utilization by benchmark FACT Group, Intel wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake art_1 apsi applu ammp vpr_route vortex_lendian3 twolf perlbmk_makerand parser mcf gzip_graphic gap eon_kajiya crafty cc_166 bzip2_source AVF % Data AVF of DTB Best Estimate AVF 100 90 80 70 60 50 40 30 20 10 0 21 Tag AVFs (Average) STB DTB Dcache (WB) Dcache (WT) 0 • • 5 10 15 20 25 AVF % 30 35 40 45 50 Tag AVFs lower than expected for DTB and DCache (WT) – Only Hamming-Distance-1 matches contribute ACE time Tag AVFs higher than data for STB and DCache (WB) – Dynamically dead tags are still ACE for dirty bytes FACT Group, Intel 22 Tag AVF of DTB Best Estimate AVF 100 90 80 60 50 40 30 20 10 wupwise swim sixtrack mgrid mesa lucas galgel fma3d facerec equake art_1 apsi applu ammp vpr_route vortex_lendian3 twolf perlbmk_makerand parser mcf gzip_graphic gap eon_kajiya crafty cc_166 0 bzip2_source AVF % 70 •AVFs surprisingly small, little variation •Protection added to DTB CAMs prior to AVF calculation (large # bits) •AVF calculation shows NO protection was needed in this case FACT Group, Intel 23 AVF Observations • DTB and Write-through Data Cache – Typically Tag AVF < Data AVF • only hamming-distance 1 hits contribute to Tag AVF • dynamic dead data are unACE • STB and Write-back Data Cache – Typically Tag AVF ≥ Data AVF • Tag AVF ACE till eviction if line is dirty • dynamic dead data can be ACE • Bytemasks and writes may make certain bytes of data unACE while all bits of tag are always ACE FACT Group, Intel 24 AVF Reduction: Flushing • Flushing (emulates a context switch) – Also eliminates unknowns by flushing all live entries at end of simulation • Main concept: Transform part of ACE time into unACE at the Expense of some Performance Fill Idle Read ACE Read Fill ACE Evict Idle Flush FACT Group, Intel 25 DTB 1M flush DTB 100K flush FACT Group, Intel DTB 1M flush DTB 100K flush DTB base DTB 5M flush WB 100K flush WB 5M flush WB 1M flush Writeback base WT 1M flush WT 100K flush Writethrough base WT 5M flush Data DTB base DTB 5M flush 10 WB 100K flush 15 WB 5M flush WB 1M flush 20 Writeback base 25 No Flushing 5M cycle Flush 1M cycle Flush 100K cycle Flush 30 WT 1M flush WT 100K flush Writethrough base WT 5M flush AVF % AVF Reduction: Flushing 40 35 5 0 Tags – >50% AVF reduction for 100K cycle Flush (Flush takes 0 time) • Max IPC reduction: 1.77% DTB, 1.25% WT/WB DCache • Avg IPC reduction: 0.56% DTB, 0.19% WT/WB DCache 26 Summary • SER is an ever-increasing problem – Need standard, quantitative way to evaluate design cost of adding protection/recovery to structures • AVF Gives us a Quantitative way to Measure the cost of adding Protection • Presented a Methodology to Compute the AVF of Address Based Structures – Lifetime Analysis – False Negatives and False Positives • Hamming Distance-1 Analysis for False Positives – Edge Effects and Cooldown • Analogous to Warmup – AVF Reduction - Flushing FACT Group, Intel 27