Architectural Vulnerability Factor (AVF) Computation for Address-Based Structures

advertisement
Architectural Vulnerability Factor
(AVF)
Computation for Address-Based
Structures
Arijit Biswas, Paul Racunas, Shubu Mukherjee
FACT Group, DEG, Intel
Joel Emer
VSSAD, Intel
Razvan Cheveresan
Sun Microsystems, Intern FACT Group
Ram Rangan
Princeton University, Intern FACT Group
Moore’s Law Graph
• Soft errors are a serious problem
– Assuming a certain error rate, failure rate of whole chip increases
12x GAP
1000
100
100% Vulnerable
10
20% Vulnerable
Year
2012
2011
2010
2009
2008
2007
2006
2005
2004
1
2003
Failure rate from Vulnerable
Latches
10000
1000 year MTBF
Goal
Chart based on 200,000 latches as used in the Fujitsu SPARC Processor (2003)
FACT Group, Intel
2
All bits are not created equal!
Bit
1 0
FACT Group, Intel
Particle Strike
Causes Bit Flip!
3
All bits are not created equal!
Particle Strike
Causes Bit Flip!
Bit
Read?
no
yes
benign fault
no error
Detection &
Correction
Does bit
matter?
yes
True Detected
Unrecoverable
Error
Bit has
error
protection
benign fault
no error
no
Detection
only
no
Does bit
matter?
yes
False Detected
Unrecoverable
Error
FACT Group, Intel
Silent Data
Corruption
no
benign fault
no error
4
Does bit matter?
• Architectural Vulnerability Factor (AVF)
– Probability that a bit flip will cause user-visible error
• Soft Error Rate of a Structure =
(AVFbit) x (# Bits) x (Intrinsic Error Rate)bit
• Reducing AVF reduces SER
– High AVF indicates need for protection
– Low AVF can help remove protection hardware
• SER Protection can be Expensive
– Impacts Area, Power, Performance, Design Time
FACT Group, Intel
5
Simple Examples
• Committed Program Counter AVF ~ 100%
• Branch Predictor AVF = 0%
FACT Group, Intel
6
Complex Examples
• Instruction Queue AVF = 29%
• Execution Units AVF = 9%
• Used a new concept
– Architecturally Correct Execution (ACE)
FACT Group, Intel
7
Architecturally Correct Execution
(ACE)
Program Input
Program Outputs
• ACE path requires only a subset of values to flow
correctly through the program’s data flow graph (and the
machine)
• Anything else (un-ACE path) can be derated away
FACT Group, Intel
8
Example of un-ACE instruction:
Dynamically Dead Instruction
Dynamically
Dead
Instruction
Most bits of an un-ACE instruction do not affect
program output
FACT Group, Intel
9
ACE Breakdown of Instruction Queue
IDLE
31%
ACE
29%
Ex-ACE
10%
NOP
15%
PREDICATED
FALSE
3%
WRONG PATH
3%
DYNAMICALLY
DEAD
8%
PERFORMANCE
INST
1%
Average across all of Spec2K slices for an IA64-like processor
ACE % = AVF = 29%
FACT Group, Intel
10
A New AVF Analysis –
Address-Based Structures
• Caches, data translation buffers, store buffers
– Make up large portions of a modern chip
• Simple ACE analysis is no longer enough
• Data & Tag structures need new concepts
–
–
–
–
Extended Lifetime Analysis
Hamming-Distance-1 Analysis
Cooldown
AVF Reduction - Flushing
FACT Group, Intel
11
Lifetime Analysis
• Idle is unACE
Fill
Idle
Read
Valid
Read
Valid
Evict
Valid
Idle
– Assuming all time intervals are equal
– For 3/5 of the lifetime the bit is valid
– Gives a measure of the structure’s utilization
• Number of useful bits
• Amount of time useful bits are resident in structure
• Valid for a particular trace
FACT Group, Intel
12
Lifetime Analysis of Write-through
Data Cache
• Valid is not necessarily ACE
Fill
Read
Read
Evict
Idle
Idle
Write-through Data Cache
• ACE % = AVF = 2/5 = 40%
• Example Lifetime Components
– ACE: fill-to-read, read-to-read
– unACE: idle, read-to-evict, write-to-evict
FACT Group, Intel
13
Lifetime Analysis of Write-through
Data Cache
• Data ACEness is a function of instruction ACEness
Fill
Read
Read
Idle
Evict
Idle
Write-through DCache
• Second Read is by an unACE instruction
• AVF = 1/5 = 20%
FACT Group, Intel
14
Tags are Hard
• A fault associated with a tag that is nominally
associated with a particular instruction can
impact the correct execution of a different
independent instruction
• False Negatives only error if writeback is necessary
– Uses standard lifetime analysis
• False Positives always result in error
– Need bit-level analysis
FACT Group, Intel
15
False Positive
Incoming Address
•Expect:
1
0
0
1
Tag Address
MISS
1
0
0
1
0
0
0
Tag Address
Incoming Address
•Acquire:
1
HIT
1
0
0
1
• Expected Tag Miss, but got Hit – Error
• How do you compute the AVF? Fault injection?
FACT Group, Intel
16
Hamming-Distance-1 Analysis
• Assuming a single-bit error model
Tag Array
101010
Incoming Address
001010
111010
000001
111000
Hamming-Distance-1 Match
Hamming-Distance-1 Match
010101
111111
• Now we can use lifetime analysis on the
identified bit(s)
FACT Group, Intel
17
Edge Effects
• Simulation introduces unknown component
– Simulation not run to completion
– Only execute small segment of code
Fill
Idle
Read
Read
Evict
Unknown Not Simulated
Idle
Sim End
• Worst Case AVF = Known AVF +
Unknown AVF
• How do we reduce/eliminate unknown?
FACT Group, Intel
18
Cooldown
• run simulation beyond end interval.
– Any bits that were already valid (the unknown bits), are resolved
50
45
40
Trend: unknown AVF
primarily resolves to unACE
35
AVF %
15
Cooldown
FACT Group, Intel
dTLB Tags
No Cooldown
Dcache Tags (WB)
0
Dcache Tags (WT)
10
5
dTLB Data
Best Estimate AVF =
Known AVF after Cooldown
25
20
Dcache Data (WT)
•
30
Dcache Data (WB)
•
10 Million Instructions Simulation
10 Million Instructions Cooldown
19
Data AVFs (Average)
STB
DTB
Dcache (WB)
Dcache (WT)
0
•
•
•
5
10
15
20
25
AVF %
30
35
40
45
50
STB AVF lower due to large idle component and bytemasks
DTB AVF higher due to high average utilization
Dcache (WB) AVF higher than Dcache (WT) since dirty bytes still ACE after
last read
FACT Group, Intel
20
•Large variability in AVF
•Ranges from ~0% to 80%
•Based on structure utilization by benchmark
FACT Group, Intel
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
art_1
apsi
applu
ammp
vpr_route
vortex_lendian3
twolf
perlbmk_makerand
parser
mcf
gzip_graphic
gap
eon_kajiya
crafty
cc_166
bzip2_source
AVF %
Data AVF of DTB
Best Estimate AVF
100
90
80
70
60
50
40
30
20
10
0
21
Tag AVFs (Average)
STB
DTB
Dcache (WB)
Dcache (WT)
0
•
•
5
10
15
20
25
AVF %
30
35
40
45
50
Tag AVFs lower than expected for DTB and DCache (WT)
– Only Hamming-Distance-1 matches contribute ACE time
Tag AVFs higher than data for STB and DCache (WB)
– Dynamically dead tags are still ACE for dirty bytes
FACT Group, Intel
22
Tag AVF of DTB
Best Estimate AVF
100
90
80
60
50
40
30
20
10
wupwise
swim
sixtrack
mgrid
mesa
lucas
galgel
fma3d
facerec
equake
art_1
apsi
applu
ammp
vpr_route
vortex_lendian3
twolf
perlbmk_makerand
parser
mcf
gzip_graphic
gap
eon_kajiya
crafty
cc_166
0
bzip2_source
AVF %
70
•AVFs surprisingly small, little variation
•Protection added to DTB CAMs prior to AVF calculation (large # bits)
•AVF calculation shows NO protection was needed in this case
FACT Group, Intel
23
AVF Observations
• DTB and Write-through Data Cache
– Typically Tag AVF < Data AVF
• only hamming-distance 1 hits contribute to Tag AVF
• dynamic dead data are unACE
• STB and Write-back Data Cache
– Typically Tag AVF ≥ Data AVF
• Tag AVF ACE till eviction if line is dirty
• dynamic dead data can be ACE
• Bytemasks and writes may make certain bytes of data
unACE while all bits of tag are always ACE
FACT Group, Intel
24
AVF Reduction: Flushing
• Flushing (emulates a context switch)
– Also eliminates unknowns by flushing all live entries
at end of simulation
• Main concept: Transform part of ACE time into
unACE at the Expense of some Performance
Fill
Idle
Read
ACE
Read
Fill
ACE
Evict
Idle
Flush
FACT Group, Intel
25
DTB 1M flush
DTB 100K flush
FACT Group, Intel
DTB 1M flush
DTB 100K flush
DTB base
DTB 5M flush
WB 100K flush
WB 5M flush
WB 1M flush
Writeback base
WT 1M flush
WT 100K flush
Writethrough base
WT 5M flush
Data
DTB base
DTB 5M flush
10
WB 100K flush
15
WB 5M flush
WB 1M flush
20
Writeback base
25
No Flushing
5M cycle Flush
1M cycle Flush
100K cycle Flush
30
WT 1M flush
WT 100K flush
Writethrough base
WT 5M flush
AVF %
AVF Reduction: Flushing
40
35
5
0
Tags
– >50% AVF reduction for 100K cycle Flush (Flush takes 0 time)
• Max IPC reduction: 1.77% DTB, 1.25% WT/WB DCache
• Avg IPC reduction: 0.56% DTB, 0.19% WT/WB DCache
26
Summary
• SER is an ever-increasing problem
– Need standard, quantitative way to evaluate design
cost of adding protection/recovery to structures
• AVF Gives us a Quantitative way to Measure the
cost of adding Protection
• Presented a Methodology to Compute the AVF
of Address Based Structures
– Lifetime Analysis
– False Negatives and False Positives
• Hamming Distance-1 Analysis for False Positives
– Edge Effects and Cooldown
• Analogous to Warmup
– AVF Reduction - Flushing
FACT Group, Intel
27
Download