Software-Based On-Line Detection of Hardware Defects

advertisement
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal
Tue. Dec. 18th 2007
Kypros Constantinides
Advisor: Todd Austin
Department of Electrical Engineering and Computer Science
University of Michigan
Reliability Challenges of Technology Scaling
Age-related wearout
- Electromigration
- Gate-oxide breakdown (TDDB)
Transient Faults
(due to natural radiation)
Source
N+
Gate
Drain
-
N+
-+ -+ +- +- +
P
Parametric Process Variation
Manufacturing Defects
(Uncertainty in device & environment)
(that escape testing and burn-in)
Increased Heating
Thermal
Runaway
Higher
Power
Dissipation
2
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Higher
Transistor
Leakage
Thesis Proposal Dec. 18th, 2007
Reliability Challenges of Technology Scaling
Cost
product
cost
cost per
transistor
Further scaling
is not profitable
1) Cost of built-in defect
reliability
tolerance mechanisms
costreliability
2) Cost of R&D needed to
cost develop reliable technologies
Silicon Process Technology
Suggested Approach
1) Build products out of unreliable components/technologies
2) Provide reliability through very low cost defect-tolerance techniques
3
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Presentation Outline
Previous Work – Traditional Techniques
Preliminary Results




BulletProof – A Hardware-Based Defect Tolerance Technique
ACE Testing – A Software-Based Defect Tolerance Technique
Future Work
Timeline


4
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Traditional Defect Tolerance Techniques
Used at high-end life-critical systems (e.g., aviation)


Triple Modular Redundancy (voting scheme)
N-Version Hardware
Triple Modular Redundancy
Module
Module
Voting
Logic
Module
5
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
2-Version Hardware
Processor
Type A
Processor
Type B
Checker

Thesis Proposal Dec. 18th, 2007
Examples of More Recent Research Approaches
Processor Checking (DIVA – Austin, MICRO’99)
Task Checking (Argus – Meixner, MICRO’07)


Processor Checking
Main
Processor
6
Processor
Checker
Task Checking
Control-Flow
Checker
Memory
Checker
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Main
Processor
Data-Flow
Checker
Computation
Checker
Thesis Proposal Dec. 18th, 2007
Shortcomings of Existing Techniques
Existing techniques continuously check for execution errors



Redundant computation requires significant extra hardware – high
area overhead
Continuous checking consumes significant energy – pressure on
power budget
Suitable for high-end or life-critical systems
BUT, too costly to employ for mainstream systems

7
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Thesis Goal
Thesis: Defect-tolerance techniques can provide the same level
of reliability as traditional techniques, but at a much lower cost.

Reliability ~99%
Goals:
Area Cost


Ultra low-cost solution < 5%
Area
~99% of defects are detectable and recoverable < 5%
Provided Reliability


Thesis
Goal
Performance
< 10%
Performance



8
Low runtime performance overhead (due to testing) < 10%
After recovery the system still operates
in degraded performance mode < 10%
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Presentation Outline
Previous Work – Traditional Techniques
Preliminary Results




BulletProof – A Hardware-Based Defect Tolerance Technique
ACE Testing – A Software-Based Defect Tolerance Technique
Future Work
Timeline


9
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
BulletProof Pipeline - Overview (ASPLOS06, DATE07)
EXTERNAL
INTERRUPTS
EX
stage
EX
checker
with SEU
detection
MEM
stage
MEM
checker
MEM/WB latches
ID
checker
with SEU
detection
with SEU
detection
with SEU
detection
ID
stage
EX/MEM latches
IF
checker
trans epoch data
intra epoch data
ID/EX latches
IF
stage
IF/ID latches
I - CACHE
non-speculative state
speculative state
I/O TRANSFERS
D - CACHE
MEMORY and
L2 CACHE
WB
stage
WB
checker
Distributed
Testing
Checkpoint
Checkpoint
scan chain
COMPUTATION
Speculative
state during
checkpoint
interval
On-line
distributed
testing using
checkers
COMPUTATION
Checkpoint Interval
10
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
BulletProof: Distributed Testing and Recovery
ID/
EX
IF/
ID
LOCALTESTER
TESTER
LOCAL
CHECKER
CHECKER
LOCALTESTER
TESTER
LOCAL
CHECKER
CHECKER
EX/
MEM
X
LOCALTESTER
TESTER
LOCAL
CHECKER
CHECKER
MEM
/WB
LOCALTESTER
TESTER
LOCAL
CHECKER
CHECKER
Checkpoint Recovery
Reconfig
X
Computation
Testing
Computation
Testing
Computational Epoch
Computation
No Testing
Time
State
Checkpoint
11
Testing
Complete
Fault
Manifests
Fault
Detected
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Experimental Methodology – Baseline Architecture

Baseline Architecture:



Circuit-Level Evaluation:




Prototype with a physical layout (TSMC 0.18um)
Accurate area overhead estimations
Accurate fault coverage area estimations
Architecture-Level Evaluation:




5-stage 4-wide VLIW architecture, 32KB I-Cache, 32KB D-Cache
Embedded designs: Need high reliability with high cost sensitivity
Trimaran toolset & Dinero IV cache simulator
Average computational epoch size
Performance while in graceful degradation
Benchmarks

12
PC
IF/ID
DECODER
I-CACHE
32KB
address
data
SPECINT2000, MediaBench, MiBench
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
ID/EX
ALU
DECODER
DECODER
ALU
DECODER
Agen
REGISTER
FILE
4-write/8-read
EX/
MEM
MEM
/WB
D-CACHE
MULT
Agen
MULT
32KB
Thesis Proposal Dec. 18th, 2007
Design Defect Coverage

Defect Coverage: total area of the design in which a defect
can be detected and corrected
IF
92.5%
ID
93.6%
EX
97.7%
MEM
92.6%
WB
92.7%
Overall Design Defect Coverage 95.2%
13
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Area Overhead Summary
EX 11.09% (86%)
RF 1.26% (9.8%)
Overall design
area cost 12.9%
ID 0.22% (1.7%)
IF 0.07% (0.6%)
WB 0.06% (0.5%)
L1 I-Cache 0.08% (0.66%)
L1 D-Cache 0.08% (0.66%)
14
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
BulletProof Summary
Provided
Reliability
95.2%
Silicon
Area Cost
12.9%
BulletProof
Pipeline
Runtime
Performance
Overhead
<1%
Trade-off runtime performance to get
lower area overhead and higher reliability
15
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Presentation Outline


Previous Work – Traditional Techniques
Preliminary Results




BulletProof – A Hardware-Based Defect Tolerance Technique
ACE Testing – A Software-Based Defect Tolerance Technique
Future Work
Timeline
16
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Software-Based Defect Detection (MICRO’07)
1) Move the hardware checking overhead to
software
2) Firmware periodically stalls the processor
and perform hardware checking
3) Provide architectural support to the software
checking routines
FIRMWARE
Periodically stalls the
processor and run
hardware checking
routines
Accessibility
Architectural
support to
software-based
checking
Controllability
?
17
?
Advantages over hardware-based techniques
- Lower area overhead
- Higher runtime flexibility
- it can support multiple fault models
- dynamic tuning of testing process
- Easier to upgrade (software patches)
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Access-Control Extensions (ACE) Framework

Software

Architectural support that enables
software access to the processor state
(ACE Hardware)
Special Instructions can access
and control any part of the
processor state
(ACE Instructions)
ISA
Firmware can periodically
run directed hardware tests
(ACE Firmware)
Hardware

18
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Applications
Operating System
ACE Firmware
ACE Extension
ACE Hardware
Processor State
Processor
Thesis Proposal Dec. 18th, 2007
Accessing The Processor State (ACE Hardware)


We leverage the existing full hold-scan chain infrastructure
Full hold-scan chains are employed by most modern processors
to improve/automate manufacturing testing
Scan State
(shadow
processor state)
Processor State
19
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Accessing The Processor State (ACE Hardware)
ACE Tree
Register
File
ACE Node
ACE Node
ACE Node
ACE Node
ACE Node
ACE Node
Scan State
Processor State


ACE Instructions can move values from the architectural
registers to the scan state and vice versa
ACE Instructions can swap data between the scan state and the
processor state
20
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Software-based Testing & Diagnosis (ACE Firmware)


Step 1: Load test pattern into scan state
Step 2: 3 cycle atomic test operation




ATPG
Automatic test
pattern & response
generation
Cycle 1: Swap scan state with processor state
Cycle 2: Test cycle
Cycle 3: Swap scan state with processor state
Step 3: Validate test response
MEMORY
Test Patterns
Test Responses
Register
File
ACE Node
ACE Node
Processor
Test
Response
State
Test
Pattern
Scan
state
Validation
ACE Node
ACE Node
ACE Node
ACE Node
X
Test
Processor
State
state
TestResponse
Pattern
21
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Timeline of Software-Based Testing
Software-based testing is coupled with a checkpointing and
recovery mechanism
Checkpoint Interval
22
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Checkpoint
ACE-based Test
COMPUTATION
Directed ACE-based testing
- High-quality testing (ATPG patterns)
- High fault coverage ~99%
- Runtime < 1M instructions
Functional Test
Checkpoint
Functional software test
- Check if the core is capable to
run ACE-based testing
- Limited fault coverage 60-70%
- Very fast < 1000 instructions
COMPUTATION
Thesis Proposal Dec. 18th, 2007
Experimental Methodology

OpenSPARC T1 CMP – based on Sun’s Niagara


Synopsys Design Compiler to synthesize the OpenSPARC CMP
Synopsys TetraMAX ATPG tool for test pattern generation

RTL implementation of ACE framework to get area overhead

Microarchitectural Simulation to get performance overhead



SESC cycle-accurate simulator
Simulate a SPARC core enhanced with the ACE framework
Benchmarks from the SPEC CPU2000 suite
23
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Preliminary Functional Testing


Fault injection campaign on a gate-level netlist of a SPARC core
Software functional test – 3 phases (~700 instructions):






Control flow check
Register access
Use all ISA instructions
Memory Error
(6.49%)
Illegal
Execution
(1.40%)
Early
Execution
Termination Timeout
(0.49%)
(1.57%)
Control Flow
Assertion
Register (7.45%)
Access
Assertion
(23.36%)
Functional testing coverage
Undetected
Faults (37.86%)
Incorrect
is low ~ 62%
Execution
Assertion
Undetected faults do not
(21.38%)
affect the execution of
ACE firmware
Full coverage provided with further ACE-based testing
24
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Full-chip Distributed ACE-based Testing


Chip testing is distributed to the eight SPARC cores
Testing for stuck-at and path-delay fault models
Cores [0,1]
Test Instructions: 312K
Coverage: 99.6%
Cores [2,4]
Test Instructions: 468K
Coverage: 98.7%
Cores [3,5]
Test Instructions: 405K
Coverage: 98.8%
Cores [6,7]
Test Instructions: 333K
Coverage: 99.9%
25
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Performance Overhead of ACE-Based Testing

Performance overhead depends on the fault model used to generate patterns
ACE framework is flexible to support test patterns from different fault models
Average Performance Overhead (%)

30
25
20
100M Checkpoint Interval
SPEC CPU2000 Average
15
10
5
0
Stuck-at
Stuck-at+
Path Delay
N-Detect(N=2) N-Detect(N=4)
+Path Delay
+Path Delay
Higher quality testing
26
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
ACE Framework Area Overhead



RTL implementation of
ACE Framework in Verilog
Explored several ACE tree
configurations
8 ACE trees (1 per core)
to cover OpenSPARC
~230K ACE accessible bits
Area Overhead:
0.7% each tree
5.8% for ACE framework
27
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
ACE Testing Summary
Provided
Reliability
~99%
Silicon
Area Cost
5.8%
28
BulletProof
Pipeline
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Runtime
Performance
Overhead
5-25%
Thesis Proposal Dec. 18th, 2007
Contributions to Date - Acknowledgements

BulletProof Pipeline (ASPLOS’06, DATE’07)


Todd Austin and Valeria Bertacco (project supervision)
Smitha Shyam and Sujay Phadke (ASPLOS’06)



Mojtaba Mehrara and Mona Attariyan (DATE’07)



Physical prototype implementation
Distributed Checkers
Added soft-error detection to BulletProof pipeline
Increased the fault coverage of the technique (protection for control logic)
ACE Testing Framework (MICRO’07)

29
Todd Austin, Onur Mutlu and Valeria Bertacco (project supervision)
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Presentation Outline


Previous Work – Traditional Techniques
Preliminary Results




BulletProof – A Hardware-Based Defect Tolerance Technique
ACE Testing – A Software-Based Defect Tolerance Technique
Future Work
Timeline
30
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Overview of Future Research Directions
Add value to already
proposed techniques
Online Defect
Detection & Diagnosis
Online Low-cost
Defect Tolerance
Solutions
Online
System Repair
Online
System Recovery
Evaluation
Infrastructure
31
- BulletProof Pipeline
- ACE Testing
- Low overhead periodic
checkpoint and recovery
- Existing mechanisms:
• ReVive + ReViveI/O
• SafetyNet
Fault Injection Based
Analysis Framework
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Extend the ACE Framework to Other Applications
Overhead of ACE framework can be amortized by other applications:
Online Defect
Detection & Diagnosis
ACE Framework
PROCESSOR
ACE Firmware
Hardware
accessibility &
controllability
Online Performance
Monitoring
Online Design
Bug Detection
Manufacturing Testing
Post-silicon
Debugging
32
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Flexible Event Monitoring Architecture

Event monitoring requires real-time signal monitoring/processing
Register
File
ACE Node
ACE Node
ACE Node
Event monitoring hardware:
- Bug signature checkers
- Performance counters
Programmable
Logic Core
ACE Node
ACE Node
ACE Node
Support of monitoring capabilities for all ~230K bits of OpenSPARC
is very expensive ~25-30% area overhead
33
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Design Bugs - Preliminary Analysis

Most bugs are in complex control logic




Memory subsystem (lsu)
Exception/interrupt control (tlu)
Load/Store Unit (lsu) & Trap Logic Unit (tlu) account for 96% of the
design bugs in the OpenSPARC core
They account only for the 49% of the core’s scan cells
Design Bug Distribution (SPARC core)
ifu, 4%
tlu, 44%
Scan Cells Distribution (SPARC core)
tlu, 26%
exu, 11%
spu, 8%
ffu, 5%
mul, 6%
lsu, 51%
spu, 1%
34
ifu, 21%
lsu, 23%
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Current Software-Based Fault Simulation Framework
A Monte Carlo-based fault simulation & analysis framework
Fault
Models
Design
Stimuli
• Logic masked
• Timing masked
Type, Time,
Location, Duration • Architecture masked
• Error (fault manifests)
Fault-Exposed
Model
Gate-level
Netlist
Golden Model
Monte Carlo Simulation
loop – 1000x
Fault is
Fault
Analyzer
(no faults injected)
Supported Models
• Stuck-at
• Stuck-open
• Bridge
• Path-delay
• Transient (SEU)
Fault simulation & analysis speed ~ 10KHz
35
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Hardware Accelerated Fault Simulation
Port the software-based
fault simulation &
analysis framework
on the BEE2
hardware emulation
platform
BEE2 Emulation Board
36
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
In collaboration with:
- Andrea Pellegrini
- Dan Zhang
Thesis Proposal Dec. 18th, 2007
Online Repair Techniques

Qualitatively evaluate the effectiveness of graceful degradation that
exploits existing resource redundancy


But different architectures have different degrees of resource redundancy
For what defect rates is a given degree of resource redundancy adequate?
2-cores

8-cores
80-tiles
Is graceful degradation enough? Do we need to spare?
If yes, what to spare?
37
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Presentation Outline


Previous Work – Traditional Techniques
Preliminary Results




BulletProof – A Hardware-Based Defect Tolerance Technique
ACE Testing – A Software-Based Defect Tolerance Technique
Future Work
Timeline
38
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Thesis Completion Timeline
Internship?
Jan’08 Mar’08 May’08 Jul’08 Sept’08 Nov’08 Jan’09 Mar’09 May’09
IEEE Transactions
on Computers
39
MICRO’08 or
ASPLOS’09
DAC’09
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
DSN’09
Thesis Proposal Dec. 18th, 2007
Thank You!
Questions?
40
Online Low-Cost Defect Tolerance Solutions
for Microprocessor Designs
Thesis Proposal Dec. 18th, 2007
Download