A Systematic Methodology to Compute the Architectural

advertisement
Methodology to Compute
Architectural
Vulnerability Factors
Chris Weaver1, 2
Shubhendu S. Mukherjee1
Joel Emer 1
Steven K. Reinhardt1, 2
Todd Austin2
1Fault
Aware Computing Technology (FACT), VSSAD, Intel
2University of Michigan
Overview





Background
Previous reliability estimation methodology
Proposed methodology for early reliability
estimates
Sample analysis
Conclusion
Strike Changes State
0
1
Failure Rate Definitions

Interval-based


MTBF = Mean Time Between Failures
Rate-based



FIT = Failure in Time = 1 failure in a billion hours
1 year MTBF = 109 / (24 * 365) FIT = 114,155 FIT
Additive
Cache: 0 FIT
IQ: 114K FIT
+ FU: 114K FIT
Total of 228K FIT
Motivation
Data Corruption FIT
Data
Corruption FIT
100000
100000
100000
10000
10000
10000
1000
1000
1000
100
100
100
10
10
10
111
2003
2003 2004
2004 2005
2006 2007
2003
2004
2005 2006
2006
2007 2008
2008 2009
2009 2010
2010 2011 2012
1000
MTBF
Goal
1000
MTBF
Goal
1000manifest
MTBF Goal
FIT if all flips
as errors
FIT if all flips manifest as errors
FIT if 10% of flips manifest as errors
Results of precise & early
analysis


If we meet goal
we are done
If we don’t meet goal
add error protection schemes
Objectives


Determine which bits matter
Compute FIT rate
Strike on state bit
Bit
Read
no
yes
Bit has
error
protection
yes
benign fault
no error
no
yes
Error
is only detected
(e.g., parity +
no recovery)
Detected, but
unrecoverable error
(DUE)
Error can be
corrected
(e.g, ECC)
no error
Does bit
matter?
yes
Silent Data
Corruption
(SDC)
* We only focus on SDC FIT
no
benign fault
no error
Architectural Vulnerability Factor
(AVF)
AVFbit = Probability Bit Matters
# of Visible Errors
=
# of Bit Flips from Particle Strikes
FITbit= intrinsic FITbit * AVFbit
Previous AVF Methodology

Statistical Fault Injection with RTL
1
0
Simulate Strike on
Latch
Logic
0
output
Does Fault Propagate
to Architectural State
Characteristics of SFI with RTL




Naturally characterizes all logical structures
RTL not till late in the design cycle
Numerous experiments to flip all bits
Generally done at the chip level

Limited structural insight
Objectives

Determine which bits matter




Earlier in the design cycle
With fewer experiments
At the structural-level
Compute FIT rate


Intrinsic FIT per bit
Architectural Vulnerability Factor
Our Analysis: Which bits
matter?

Branch Predictor


Doesn’t matter at all (AVF = 0%)
Program Counter

Almost always matters (AVF ~ 100%)
Architecturally Correct
Execution
(ACE)
Program Input
Program Outputs


ACE path requires only a subset of values to flow
correctly through the program’s data flow graph
(and the machine)
Anything else (un-ACE path) can be derated away
Example of un-ACE instruction:
Dynamically Dead Instruction
Dynamically
Dead
Instruction
Most bits of an un-ACE instruction do not affect
program output
Dynamic Instruction Breakdown
DYNAMICALLY
DEAD
20%
PERFORMANCE
INST
1%
ACE
46%
PREDICATED
FALSE
7%
NOP
26%
Average across all of Spec2K slices
Mapping ACE & un-ACE Instructions to
the Instruction Queue
NOP
Prefetch
Architectural un-ACE
ACE
Inst
ExACE
ACE
Inst
Inst
WrongPath
Inst
Idle
Micro-architectural un-ACE
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=4
3
2
1
=
=
(2+1+0+3)/4
ACE% = 3/4
0/4
1/4
2/4
4
Average number of ACE bits in a cycle
Total number of bits in the structure
Little’s Law for ACEs
Nace  T ace  Lace
N ace
AVF 
Ntotal
Computing AVF

Our approach is conservative


Data Analysis


We assume every bit is ACE unless proven
otherwise
Try to prove that data held in a structure is unACE
Timing Analysis

Tracks the time this data spent in the structure
Computing FIT rate of a Chip
Total FIT =  (FIT per biti X # of bitsi X AVFi)
Structure
FIT per bit
# of bits
AVF
Branch
Predictor
.001*
1K
0
0
Program
Counter
.001*
64
1
0.064
Instruction
Queue
.001*
6400
?
?
Funtional
Units
.001*
4000
?
?
…
Total FIT
…
Total FIT of whole chip
=  column
* Intrinsic FIT per bit from externally published data
Results:
Experimental Setup



Used ASIM modeling infrastructure
Model of a Itanium®2-like processor
Ran all Spec2K benchmarks



Compiled with highest level of optimization
with the Intel electron compiler
Simulated under a full OS
Simulation points chosen using SimPoint
(Sherwood et al)
Instruction Queue
IDLE
31%
Ex-ACE
10%
ACE
29%
NOP
15%
PREDICATED
FALSE
3%
WRONG PATH
3%
DYNAMICALLY
DEAD
8%
PERFORMANCE
INST
1%
ACE percentage = AVF = 29%
Functional Units
WRONG PATH
1%
SPECULATIVE
ISSUE
1%
DYNAMICALLY
DEAD
4%
PERFORMANCE
INST
0%
PREDICATED
FALSE
1%
NOP
6%
ACE
9%
UNIT IDLE
77%
DATAPATH IDLE
1%
LOGICAL
MASKING
0%
ACE percentage = AVF = 9%
Computing FIT rate of Chip
Structure
FIT per bit
# of bits
AVF
Total FIT
Branch
Predictor
.001*
1K
0
0
Program
Counter
.001*
64
1
0.064
Instruction
Queue
.001*
6400
.29
1.856
Funtional
Units
.001*
4000
.09
0.360
…
…
Total FIT of whole chip
=  column
* Intrinsic FIT per bit from externally published data
Summary

Determine which bits matter


ACE (Architecturally Correction Execution)
Compute FIT rate


Intrinsic FIT per bit
AVF (Architectural Vulnerability Factor)
Questions?
Statistical Fault Injection (SFI)

Algorithm










Find a statistically significant set of bits
Randomly select a bit
Flip the bit
Run two simulations: one with bit flip and one without bit flip
Run for pre-defined # cycles
Compare architectural state of two simulations (e.g., register
file)
If mismatch, declare an error
Repeat algorithm with different bit flip
AVF = # mismatches observed / total # experiments
Used widely
+ has provided useful AVF numbers till date
SFI vs. ACE analysis
SFI
ACE
Accuracy of
Microarchitectural unACE
Better than ACE
analysis
Conservative
Accuracy of
Archirectural
un-ACE
Conservative
Better than SFI
(e.g., covers
dynamically dead
instructions)
Insight
Per-structure insights
harder
Little’s Law & perstructure breakdown
easier
# of experiments
Large # required to
be statistically
significant
Small # of
experiments can give
good accuracy
Download