timcheng

advertisement
Design Validation and
Debugging
Tim Cheng
Department of Electrical & Computer Engineering
UC Santa Barbara
VLSI Design and Education Center (VDEC)
Univ. of Tokyo
1
•
Harder to Design Robust and
First-silicon success
rate hasChips
been dropping
Reliable
– ~30% for complex ASIC/SoC@.13m (according to an ASIC
vendor)
– Pre-silicon logic bugs have been increasing at 3X-4X per
generation for Intel’s processors
• Yield has been dropping for volume production and
takes longer to ramp up the yield
– IBM’s 8-core Cell-Processor chips: ~10-20% yield (July 2006)
• “Better than worst-case” design resulting in failures
w/o defects
– Increase in variation of process parameters with scaling
– Worst-case design getting way too conservative
2
In-Field Failures are Common and Costly
• Xbox:16.4% failure rate
• Additional warranty and
refund will cost Microsoft
$1.15B ($86 per $300-item)
• More than financial cost:
reputation and market loss
• Non-trivial failure rate
– 15% in average
http://arstechnica.com/news.ars/post/20080214-xbox-360-failure-rates-worse-than-most-consumer-
3
Design for Robustness and Reliability
• Systems must be designed to cope with failures
• Efficient silicon debug is becoming a must
– Need efficient design validation and debugging methodology
– Design for debugging would become necessary
• Must have embedded self-test for error detection
– For both testing in manufacturing line and in-field testing
– Both on-line and off-line testing
• Re-configurability and adaptability for error recovery
make better sense
– Using spares to replace defective parts
– Using redundancy to mask errors
– Using tuning to compensate variations
4
Outline
• Post-Silicon Validation and Debug
• SMT-Based RTL Error Diagnosis [ITC
2008]
• SAT-Based Diagnostic Test Generation
[ATS 2007]
5
Bugs in Silicon
• Manufacturing defects
– Discovered during manufacturing test (<<1M DPM)
• Functional bugs (AKA logic bugs)
– Exist in all components
– ~98% found before tape out, ~2% post-silicon*
• Circuit bugs (AKA electrical bugs)
– Not all components exhibit failures
– Fails in some operating region (voltage, temperature, or
frequency)
– Usually cause by design margin errors, IR drop, crosstalk
coupling, L di/dt noise, process variation …
– ~50% found before tape out, ~50% post-silicon*
* Source: Intel
6
Validation Domain Characteristics
• Pre-silicon validation
– Cycle accurate simulation
– FSIM << FPROD: cycle poor
– Any signal visible (i.e. white box): debugging is
straightforward
– Limited platform level interaction
• Post-silicon validation
– Tests run at FPROD: cycle rich
– Component tested in platform configuration
– Only package pins visible: difficult debug
7
Post-Si History And Trends
• Functional bugs relatively constant
– Correlate well to design complexity (amount
of new and changed RTL)
– Late specification changes are contributors
• Circuit and analog bugs growing over
time
– I/O circuit complexity increasing sharply
– Speedpaths (limiting FMAX of component)
dominate CPU core circuit issues
8
Post-Si Debug Challenges
• Trend is toward lower observability
– Integration increasing towards SoC
• Functional and circuit issues require different
solutions
• On average circuit bugs take 3x as much time to root
cause vs. functional bugs
– Bugs found on platforms, but are debugged on debug-enabled
automatic test equipment (ATE)
– Often need multiple iterations to reproduce on the tester
– Often long latency between circuit issue and it’s syndrome
9
Pre-Si Verification vs. Post-Si Debugging
Specification
RTL Description
Insert
Corrections
Pre-silicon
Functional
Debugging
Logic Netlist
Physical Design
Insert
Faults/ Silicon Debugging
& Fault Diagnosis
Errors
10
Automated Debugging/Diagnosis
A failed verification/test step is followed by
debugging/diagnosis:
Testbench
or
Test Vectors
Design
or
Silicon
Verification
or
Testing
Automated
Debugging/Diagnosis
PASS
FAIL
Counter examples/
Diagnostic Patterns
11
Leveraging Pre-Si Verification & Manufacturing
Test Efforts for Post-Si Validation
Specification
Pre-silicon
verification
Lack of error
propagation
White Box
analysis/metrics
Post-silicon
validation
Black Box
RTL Description
Logic Netlist
Physical Design
at very
Manufacturing Models
Black Box
low level of
test
abstraction
1212
Outline
• Post-Silicon Validation and Debug
• SMT-Based RTL Error Diagnosis [ITC
2008]
• SAT-Based Diagnostic Test Generation
[ATS 2007]
13
SAT-Based Diagnosis
Erroneous Design
Failing Tests
Replicate circuit for each test
Add additional circuitry into circuit model
Add input/output constraints
SAT assignment(s) → Fault location(s)!
14
SAT-Based Diagnosis - Example
• Stuck-at-1 fault on line l1
• Input vector v=(0, 0, 1) detects 1/0 at y
0
0
x1
x2
l1
11
x3
0/1
y
1/0
Courtesy: A. Veneris
15
SAT-Based Diagnosis –
Example (Cont’d)
1. Insert a MUX at each error candidate location
x1
x2
s1
l1
w1
x3
0
y
1
2. Apply input/output vector constraints
0
0
s1
x1
l1
x2
w1 1
0
1
x3
y
0
Courtesy: A. Veneris
16
SAT-Based Diagnosis – Multiple
Diagnostic Tests
s1
x
0
1
1
l1
1
x
0
1
2
1
w
1
l1
1
x
1
2
2
w
1
l1
1
0
0
x
2
w
2
0
3
3
3
y
1
3
1
x3
0
1
x
1
0
2
2
2
y
1
2
1
x3
0
0
x
1
1
3
x3
y
3
0
1
1
Courtesy: A. Veneris
17
RTL Design Error Diagnosis
• Using Boolean SAT-Solvers for RTL
design error diagnosis is not efficient
– The translation to Boolean is expensive
– High level information is discarded
Propose a SMT-based,
automated method for RTL-level
design error diagnosis
18
18
Satisfiability Modulo Theory
(SMT) Solvers
• Targets combined decision procedures (CDP)
• Integrate Boolean-level approach with higherlevel decision procedures, such as ILP
• SHIVA-UIF: an SMT solver developed for RTL
circuit
• Boolean Theory
}
• Bit-vector Theory
• Equality Theory
Makes a good candidate as
the satisfiability engine for
hardware designs
19
RTL Design Error Diagnosis Utilizing
SHIVA-UIF
• Extend the main idea of
Boolean-SAT-based
diagnosis approach to
word-level
– MUXs are added to
word-level signals
Failing Patterns,
Error Candidates
Add MUXs to design
Impose test as constraints
UNSAT
SMT
SAT
Reduced
candidate
list
Remove
remaining
candidates
Add identified candidate to
possible candidate list
Add constraints to avoid
same solution
20
20
Initialization Steps
• Simple effect-cause
analysis used to limit
the potential candidates
• A MUX is inserted at
each potential erroneous
signal
X3
X1
X2
L
+
=
Y
W
S
21
21
Could Directly Modifying HDL Code
(at Potential Erroneous Statements)
module full_adder_imp (a1, a2, c_in, s, c_out);
input a1, a2, c_in;
output s, c_out;
wire temp;
assign s = a1 ^ a2 ^ c_in;
assign temp = (a1 & a2) | (a1 & c_in);
assign c_out = temp | (a2 & c_in);
endmodule
module full_adder_muxed (a1, a2, free1, free2,
free3, s1, s2, s3, c_in, s, c_out);
input a1, a2, c_in;
input free1, free2, free3;
input s1, s2, s3;
output s, c_out;
wire temp_mux, s_mux, cout_mux;
assign s_mux = a1 ^ a2 ^ c_in;
assign s = s1 ? s_mux : free1;
assign temp_mux = (a1 & a2) | (a1 & c_in);
assign temp = s2 ? temp_mux : free2;
assign c_out_mux = temp | (a2 & c_in);
assign c_out = s3 ? c_out_mux : free3;
endmodule
22
Inserting Constraints w.r.t. Failing
Test and Expected Response
• Add constraints corresponding to a failing
test and its expected response to the MUXinserted circuit/code
3
3
SAT
X3
5
X1
X2
L
+
=
Y
W
1
S=1
W =5
S
( ( S? (W):(3+3) ) = 5 )
23
23
Experimental Results
Design
B03
B04
B05
C5
C10
C12
C15
C16
C17
C18
C30
No. of wordlevel elements
108
108
9700
115
230
420
345
540
720
1800
2910
No. of
patterns
4
5
28
20
18
12
13
9
8
28
26
No. of initial
candidates*
72
72
12949
211
561
579
911
595
815
2135
3499
• 11 example circuits (IWLS 2005 benchmarks)
• An error is randomly injected in each circuit
• * after applying simple effect-cause analysis
No. of
final
candidates
6
9
5
13
9
100
25
7
28
10
87
24
Experimental Results
1200
Circuit B03
70
60
50
40
30
20
10
number of remaining candidates
number of remaining candidates
80
800
600
400
200
0
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Number of failing patterns imposed
Number of failing patterns imposed
700
600
250
Circuit C10
500
400
300
200
100
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Number of failing patterns imposed
number of remaining candidates
number of remaining candidates
Circuit C15
1000
Circuit C5
200
150
100
50
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Number of failing patterns imposed
• 4 sample circuits, each with 1000 random errors
• Average/Max/Minimum number of remaining candidates
25
25
Experimental Results – Effect of
Applying More Failing Tests
• Average of 4 sample circuits, each with 1000 random errors
Range of failing
test indexes
# of erroneous ckt
instances in which #
of candidates
reduced
(out of 1000)
Average reduction in
size of candidate
list (in %)
5 to 200
588
1.74%
10 to 200
418
1.16%
20 to 200
318
0.97%
50 to 200
177
0.73%
100 to 200
102
0.62%
26
26
Disadvantage of Model-Free Diagnosis
Design
Golden Model
s3
s1
W1
X3
X1
X2
+
W3
L
=
X3
X1
Y
X2
W
2
S2
L
W4
S4
Y
=
W5
S5
• Some errors are indistinguishable from each other
• Example: L is the real error location but the solver can
find satisfying values for all initial error candidates
27
Advantages of SMT-Based
RTL Design Error Diagnosis
• The learned information can be reused
• The order of candidate identification is easy to difficult,
implicitly done by the solver
– Solver tends to set MUXs of easy-to-diagnosis candidates first,
and,
– By the time of checking difficult candidates, the accumulated
learned clauses help reduce complexity
• Running All-SAT for this model results in:
– Eliminating a group of candidates without explicitly targeting
them one at a time
28
Outline
• Post-Silicon Validation and Debug
• SMT-Based RTL Error Diagnosis [ITC
2008]
• SAT-Based Diagnostic Test Generation
[ATS 2007]
29
Diagnostic Test Pattern
Generation (DTPG)
• Generates tests that distinguish fault types
or locations
• One of the most computationally intensive
problems
• Most existing methods are based on modified
conventional ATPG or Sequential ATPG
• Very complex and tedious implementation
30
30
Traditional SAT-based DTPG
• Use a miter-like model to transform DTPG into
a SAT problem
PO
×
f1
Faulty
M=1
PI
Faulty
×
f2
PO
SAT
UNSAT
Distinguishable
Indistinguishable
31
31
SAT-based DTPG
• Limitations:
– Need to build a miter circuit for each fault
pair
– Cannot share learned information for
different fault pairs
Objectives: Reduce number of miter circuits
and the computational cost for each DTPG run
by using learned information from previous runs
32
32
DTPG Model for Injecting Multiple
Fault Pairs
sel2
• Inject the same set of
Faulty
sel1 selN
N=2n to-be-differentiated
faults into each of the two
PI
0
circuits in the miter
Vi
PI
0 1 n-2n Decoder
1
• Add a n-to-2n decoder in
PO
each circuit to activate
sel'2
Faulty
sel'1 sel'N
exactly one fault at a time
• The extra sets of primary
1
inputs to the decoders, PI1
PI
12 n-2n Decoder
and PI2, are extra primary
0
inputs
Vi differentiates f1 and f6!!
• Solve objective M=1
PO
M=1?
33
33
DTPG Procedure Using Proposed Model
List of fault candidates
Build the
DTPG model
Simplify the circuit
UNSAT
M=1?
SAT
Diagnostic pattern
found
Add SAT constraint
• For a SAT solution, values
assigned at PI1 and PI2
represent indices of activated
fault pair; values assigned at PI
is a diagnostic test
• After diagnostic test of fault
pair fi and fj, is found, add a
End
blocking clause to avoid test
for the same pair generated
again
• After UNSAT, all remaining
fault pairs are indistinguishable
34
34
Main Advantages of the DTPG Model
• The learned information can be reused
• Order of target fault pair selection is
automatically determined by SAT solving
– Easy-to-distinguish fault pairs would be implicitly
targeted first
• Running All-SAT for this miter model could:
– Find diagnostic patterns for all pairs of faults
– Naturally perform diagnostic pattern compaction
• Identify a group of indistinguishable fault pairs
without explicitly targeting them one at a time
35
35
Finding More Compact Diagnostic Tests
PO
Faulty
sel2
sel1 selN
M=1?
Vij
PI
0
PI
x0 1
0
n-2n Decoder
Faulty
sel'1
Vji differentiates f0
{f0,
f2} and {f6, f7}
and f3
1
0
PI
12
x1
sel'2
PO
sel'N
n-2n Decoder
36
36
DTPG with Compaction Heuristic
• Solve objective M = 1 using SAT solver
• Use existing patterns to guide the SAT solving
• Find don’t cares at PI1 and PI2 in the newly
generated pattern - so the corresponding
pattern differentiate two groups of faults
37
37
DTPG for Multiple Faults
• Need m n-to-2n decoder in each
faulty circuit (m is the
cardinality of multiple faults)
Selj
....
Seli
• One output from each decoder is
connected to an m-input OR gate
• Can inject m or fewer faults
• Combine existing methods before
using the proposed DTPG model
.
n
n-2
decoder
n
n-2
decoder
...
...
38
DTPG Results
Circuit
#Initial
Fault Pairs
#D/#E/ #Diagnostic
#A
Patterns
CPU
(sec)
S5378
66
63/3/0
13
0.3
S13207
1225
1198/27/0
28
3.9
S15850
231
204/27/0
7
3.3
S35932
120
106/14/0
7
2.0
S38417
351
351/0/0
8
2.9
S38584
1225
1205/20/0
33
7.3
• Initial fault pairs: generated by a critical-path-tracing tool
• All fault pairs injected into one miter circuit
• #D—distinguishable, #E—equivalent, #A—aborted
39
Summary
• SMT-based RTL Design Error Diagnosis
– An enhanced model injecting single/multiple design errors
– Enable sharing of the learned information
– Identify false candidates without explicitly targeting them
• SAT-based DTPG
– Use an enhanced miter model injecting multiple faults
– Enable sharing of the learned information
– Identify undifferentiable faults efficiently
– Support diagnosis between mixed, multiple fault types
– Combine with diagnostic test pattern compaction
40
Download