Self-calibrating Online Wearout Detection

advertisement
Self-calibrating Online Wearout Detection
Authors:
Jason Blome
Shuguang Feng
Shantanu Gupta
Scott Mahlke
MICRO-40
December 3, 2007
1
University of Michigan
Electrical Engineering and Computer Science
Motivation

“Designing Reliable Systems from Unreliable
Components…”
- Shekhar Borkar (Intel)
More failures to come
Failures will be wearout
induced
[Srinivasan, DSN‘04]
[Borkar, MICRO‘05]
2
University of Michigan
Electrical Engineering and Computer Science
Current Approaches

Traditional



Impractical
Design margins
Burn-in
Detection: based on replication of computation



TMR (Tandem/HP NonStop servers)
DIVA (Bower, MICRO’05)
Prediction: utilizes precise analytical models
and/or sensors


Canary circuits (SentinelSilicion, RidgeTop)
RAMP (Srinivasan, UIUC/IBM)
Static
Costly
3
University of Michigan
Electrical Engineering and Computer Science
Wearout Mechanisms

Many failure mechanisms have been shown to be
progressive
Oxide

Hot carrier injection (HCI)
Negative Bias Temperature
Inversion (NBTI)
G
I gd
D
N+
d
Ig
I gc
N+
cs
Ig
S
s

Oxide
Igb
P-well
B


Electromigration (EM)
4
Oxide Breakdown (OBD)
University of Michigan
Electrical Engineering and Computer Science
Objective

Propose a failure prediction technique that
exploits the progressive nature of wearout

Monitor impact on path delays
Prediction
Detection
• Monitors evolution of wearout
• Identifies existing fault
• Proactive
• Reactive
• enables failure avoidance/mitigation
• enables failure recovery
• Continuous feedback
• End-of-life feedback
• False negatives and positives
• False negatives
5
University of Michigan
Electrical Engineering and Computer Science
Oxide Breakdown (OBD)

Accumulation of defects leads to a conductive path
G
G
S
D
Oxide
N
+
ΔIoxide
N+
P-well
Percolation Model [Stathis, JAP‘06]
B
6
University of Michigan
Electrical Engineering and Computer Science
OBD HSPICE Model
Post-breakdown leakage modeling
G
s
I gd
S
Ig
cs
I gc
N+
D
N+
d
Ig

Igb
P-well
[BSIM4.6.0, ‘06]
[Rodriguez, Stathis,
Linder, IRPS ‘03]
I gcs , I gcd , and I gb
remain unchanged
7
B
I gs  K  I gs0
I gd  K  I gd0
University of Michigan
Electrical Engineering and Computer Science
Characterization Testbench

90nm standard cell library
tcircuit
tcell
Gate UUT
DC
BUFX4
BUFX4
FO4GATE
8
University of Michigan
Electrical Engineering and Computer Science
FO4BUFX4
Impact on Propagation Delay
9
University of Michigan
Electrical Engineering and Computer Science
Delay Profiling Unit (DPU)
0
1
input signal
1
1
0
1
1
0
Latency
Sampling
1
0
uArch Module
1
0
0
1
0
1
0
1
0
1
10
University of Michigan
Electrical Engineering and Computer Science
TRIX Analysis
Magnitude of divergence between TRIXglobal
and TRIXlocal reflects amount of degradation
11
University of Michigan
Electrical Engineering and Computer Science
TRIX Analysis Details

Exponential Moving Average (EMA)
EMA(t )  EMAt 1    ( price  EMAt 1 )
where  is defined by the window size

Triple-smoothed Exponential Moving Average
EMA1 (t )  EMA1t 1    ( pricet  EMA1t 1 )
EMA2 (t )  EMA2 t 1    ( EMAt  EMA2 t 1 )
1
EMA3 (t )  EMA3t 1    ( EMAt  EMA3t 1 )
2
12
University of Michigan
Electrical Engineering and Computer Science
Noisy Latency Profile
Raw Latency Profile
Trix Profile (local)
Trix Profile (global)
Percent Nominal Delay (%)
110
108
106
104
102
100
98
96
94
Increasing Age
13
University of Michigan
Electrical Engineering and Computer Science
DPU with TRIX Hardware
TRIXl
Calculation
0
input signal
0
1
0
Latency
Sampling
0
Prediction
0
TRIXg
Calculation
0
1
0
1
14
University of Michigan
Electrical Engineering and Computer Science
Wearout Detection Unit (WDU)
+
TRIXl
Calculation
Latency
Sampling
Prediction
TRIXg
Calculation
15
University of Michigan
Electrical Engineering and Computer Science
Evaluation Framework
Gate-level
Processor
Simulator
OR1200
Verilog
Synthesis and
Place and Route
90nm
Library
Fully Synthesized,
P&R, OR1200 Core
Monte Carlo
MediaBench
Suite
Workload
Simulator
Simulator
Timing, Power,
and Temperature
Simulations
HSPICE
Simulations
OBD Wearout
Model
16
Wearout
Simulator
University of Michigan
Electrical Engineering and Computer Science
WDU Accuracy
Life Expended
Signals Flagged
120
Percentage (%)
100
80
60
40
20
0
ALU
Register File
LSU
Next PC
Module
17
University of Michigan
Electrical Engineering and Computer Science
WDU Overhead
Percentage Overhead (%)
Area-Hybrid
Area-Hardware
Power-Hybrid
Power-Hardware
50
45
40
35
30
25
20
15
10
5
0
1
2
4
8
# Signals Monitored
18
University of Michigan
Electrical Engineering and Computer Science
WDU Overhead
Percentage Overhead (%)
Area-Hybrid
Area-Hardware
Power-Hybrid
Power-Hardware
3
2.5
2
1.5
1
0.5
0
1
2
4
8
# Signals Monitored
19
University of Michigan
Electrical Engineering and Computer Science
Long-term Vision

Introspective Reliability Management (IRM)


Intelligent reliability management directed by on-chip
sensor feedback
Prospective sensors



Delay (WDU)
Leakage/Vt
Temperature
20
University of Michigan
Electrical Engineering and Computer Science
Introspective Reliability Management
OS
Scheduled Jobs
Virtualization Layer
Thread
Migration
Reconfiguration
Reliability
Reliability Assesment
Assesment
WDU
DVFS
Configuration
DVFS
Settings
WDU
Power/CLK Gating
Gating
Power/CLK
WDU
21
Runtime Analysis
Aggregate
Analysis
Filtered
Data Stream
Processed
Data
Filtering
andData
Analysis
Sensor
WDU
Raw Sensor Data
Thread
Migration
Job Assignment
WDU
IRM Policy
Policy
IRM
University of Michigan
Electrical Engineering and Computer Science
Conclusions

Many progressive wearout phenomenon impact devicelevel performance.


WDU performance



It’s possible to characterize this impact and anticipate failures
Failure predicted within 20% of end of life (tunable)
Area overhead < 3% (hybrid)
Low-level sensors can be used to enable intelligent
reliability management
22
University of Michigan
Electrical Engineering and Computer Science
Questions?
?
23
University of Michigan
Electrical Engineering and Computer Science
Download