The idea is to present an approach to VLSI fault

advertisement
UNIVERSITY OF CALIFORNIA
SANTA CRUZ
COMPREHENSIVE FAULT DIAGNOSIS OF COMBINATIONAL CIRCUITS
A dissertation submitted in partial satisfaction
of the requirements for the degree of
DOCTOR OF PHILOSOPHY
in
COMPUTER ENGINEERING
by
David B. Lavo
September 2002
The Dissertation of David B. Lavo
is approved:
Professor Tracy Larrabee, Chair
Professor F. Joel Ferguson
Professor David P. Helmbold
Robert C. Aitken, Ph.D.
Frank Talamantes
Vice Provost & Dean of Graduate Studies
Copyright © by
David B. Lavo
2002
Contents
List of Figures .......................................................................................................................................... v
List of Tables...........................................................................................................................................vi
Abstract ................................................................................................................................................. vii
Acknowledgements .............................................................................................................................. viii
Chapter 1.
Introduction ..................................................................................................................... 1
Chapter 2.
Background ..................................................................................................................... 4
2.1
Types of Circuits ..................................................................................................................... 4
2.2
Diagnostic Data ....................................................................................................................... 5
2.3
Fault Models ............................................................................................................................ 8
2.4
Fault Models vs. Algorithms: A Short Tangent into a Long Debate ....................................... 9
2.5
Diagnostic Algorithms............................................................................................................11
2.5.1
Early Approaches and Stuck-at Diagnosis......................................................................13
2.5.2
Waicukauski & Lindbloom ............................................................................................14
2.5.3
Stuck-At Path-Tracing Algorithms .................................................................................16
2.5.4
Bridging fault diagnosis .................................................................................................16
2.5.5
Delay fault diagnosis ......................................................................................................18
2.5.6
IDDQ diagnosis ..............................................................................................................19
2.5.7
Recent Approaches .........................................................................................................20
2.5.8
Inductive Fault Analysis .................................................................................................20
2.5.9
System-Level Diagnosis .................................................................................................22
Chapter 3.
A Deeper Understanding of the Problem: Developing a Fault Diagnosis Philosophy ...23
3.1
The Nature of the Defect is Unknown ....................................................................................23
3.2
Fault Models are Hopelessly Unreliable.................................................................................24
3.3
Fault Models are Practically Indispensable ............................................................................25
3.4
With Fault Models, More is Better .........................................................................................27
3.5
Every Piece of Data is Valuable .............................................................................................28
3.6
Every Piece of Data is Possibly Bad ......................................................................................29
3.7
Accuracy Should be Assumed, but Precision Should be Accumulated ..................................29
3.8
Be Practical.............................................................................................................................30
Chapter 4.
First Stage Fault Diagnosis: Model-Independent Diagnosis...........................................31
4.1
SLAT, STAT, and All That ....................................................................................................32
4.2
Multiplet Scoring ....................................................................................................................35
4.3
Collecting and Diluting Evidence ...........................................................................................36
4.4
“A Mathematical Theory of Evidence” ..................................................................................37
4.5
Turning Evidence into Scored Multiplets ...............................................................................40
4.6
Matching Simple Failing Tests: An Example .........................................................................43
4.7
Matching Passing Tests ..........................................................................................................46
4.8
Matching Complex Failures ...................................................................................................48
4.9
Size is an Issue .......................................................................................................................49
4.10 Experimental Results – Simulated Faults ...............................................................................51
4.11 Experimental Results – FIB Defects.......................................................................................54
iii
Chapter 5.
Second Stage Fault Diagnosis: Implication of Likely Fault Models ..............................56
5.1
An Old, but Still Valid, Debate ..............................................................................................56
5.2
Answers and Compromises ....................................................................................................57
5.3
Finding Meaning (and Models) in Multiplets .........................................................................58
5.4
Plausibility Metrics .................................................................................................................59
5.5
Proximity Metrics ...................................................................................................................62
5.6
Experimental Results – Multiplet Classification ....................................................................64
5.7
Analysis of Multiple Faults ....................................................................................................65
5.8
The Advantages of (Multiplet) Analysis ................................................................................66
Chapter 6.
Third Stage Fault Diagnosis: Mixed-Model Probabilistic Fault Diagnosis ....................68
6.1
Drive a Little, Save a Lot: A Short Detour into Inexpensive Bridging Fault Diagnosis ........69
6.1.1
Stuck with Stuck-at Faults ..............................................................................................69
6.1.2
Composite Bridging Fault Signatures .............................................................................70
6.1.3
Matching and (Old Style) Scoring with Composite Signature .......................................72
6.1.4
Experimental Results with Composite Bridging Fault Signatures ..................................72
6.2
Mixed-model Diagnosis .........................................................................................................73
6.3
Scoring: Bayes decision theory ..............................................................................................74
6.4
The Probability of Model Error ... ..........................................................................................77
6.5
... Vs. Acceptance Criteria ......................................................................................................78
6.6
Stuck-at scoring ......................................................................................................................80
6.7
0th-Order Bridging Fault Scoring ...........................................................................................80
6.8
1st-Order Bridging Fault Scoring ...........................................................................................81
6.9
2nd-Order Bridging Fault Scoring ..........................................................................................81
6.10 Expressing Uncertainty with Dempster-Shaffer .....................................................................83
6.11 Experimental results – Hewlett-Packard ASIC ......................................................................84
6.12 Experimental results – Texas Instruments ASIC ....................................................................88
6.13 Conclusion ..............................................................................................................................90
Chapter 7.
IDDQ Fault Diagnosis .......................................................................................................91
7.1
Probabilistic Diagnosis, Revisited ..........................................................................................92
7.2
Back to Bayes (One Last Time) .............................................................................................93
7.3
Probabilistic IDDQ Diagnosis ...................................................................................................94
7.4
IDDQ Diagnosis: Pre-Set Thresholds ........................................................................................98
7.5
IDDQ Diagnosis: Good-Circuit Statistical Knowledge ...........................................................101
7.6
IDDQ Diagnosis: Zero Knowledge .........................................................................................103
7.7
A Clustering Example ..........................................................................................................106
7.8
Experimental Results ............................................................................................................107
Chapter 8.
Small Fault Dictionaries ...............................................................................................110
8.1
The Unbearable Heaviness of Unabridged Dictionaries .......................................................111
8.2
Output-Compacted Signatures ..............................................................................................114
8.3
Diagnosis with Output Signatures ........................................................................................115
8.4
Objects in Dictionary are Smaller Than They Appear..........................................................117
8.5
What about Unmodeled Faults? ...........................................................................................118
8.6
An Alternative to Path Tracing? ...........................................................................................119
8.7
Clustering Output Signatures................................................................................................121
8.8
Clustering Vector Signatures & Low-Resolution Diagnosis ................................................124
Chapter 9.
Conclusions and Future Work ......................................................................................126
Bibliography ......................................................................................................................................... 128
iv
List of Figures
FIGURE 2.1. EXAMPLE OF PASS-FAIL FAULT SIGNATURES......................................................................... 6
FIGURE 2.2. EXAMPLE OF INDEXED AND BITMAPPED FULL-RESPONSE FAULT SIGNATURES. ..................... 7
FIGURE 4.1: SIMPLE PER-TEST DIAGNOSIS EXAMPLE. ..............................................................................34
FIGURE 4.2. AN EXAMPLE BELIEF FUNCTION. ..........................................................................................38
FIGURE 4.3. ANOTHER BELIEF FUNCTION. ...............................................................................................38
FIGURE 4.4. THE COMBINATION OF TWO BELIEF FUNCTIONS. ..................................................................39
FIGURE 4.5. EXAMPLE SHOWING THE COMBINATION OF FAULTS. ............................................................41
FIGURE 4.6. A THIRD TEST RESULT IS COMBINED WITH THE RESULTS FROM THE PREVIOUS EXAMPLE. ....42
FIGURE 4.7. EXAMPLE TEST RESULTS WITH MATCHING FAULTS. .............................................................43
FIGURE 4.8. COMBINATION OF EVIDENCE FROM THE FIRST TWO TESTS. ..................................................44
FIGURE 4.9. A-SA-1 WILL LIKELY FAIL ON MANY MORE VECTORS THAN WILL B-SA-1 ............................46
FIGURE 4.10. EXAMPLE OF CONSTRUCTING A SET OF POSSIBLY-FAILING OUTPUTS FOR A MULTIPLET .....49
FIGURE 4.11. MULTIPLETS (A,B), (A,B,C) AND (A,B,D) EXPLAIN ALL TEST RESULTS, BUT (A,B) IS
SMALLER AND SO PREFERRED..........................................................................................................50
FIGURE 4.12 THE CHOICE OF BEST MULTIPLET IS DIFFICULT IF (A) PREDICTS ADDITIONAL FAILURES BUT
(B, C) DOES NOT. ............................................................................................................................50
FIGURE 6.1. THE COMPOSITE SIGNATURE OF X BRIDGED TO Y WITH MATCH RESTRICTIONS (IN BLACK)
AND MATCH REQUIREMENTS (LABELED R) ......................................................................................71
FIGURE 7.1. IDDQ RESULTS FOR 100 VECTORS ON 1 DIE (SEMATECH EXPERIMENT). .................................98
FIGURE 7.2. ASSIGNMENT OF A BINARY p̂(A | O) FOR THE IDEAL CASE OF A FIXED IDDQ THRESHOLD. ....98
FIGURE 7.3. ASSIGNMENT OF A LINEAR p̂(A | O) WITH A FIXED IDDQ THRESHOLD. .................................99
FIGURE 7.4. ASSIGNMENT OF NORMALLY-DISTRIBUTED p̂(O | A) AND pˆ (O | A) . .............................101
FIGURE 7.5. DETERMINING A PASS THRESHOLD BASED ON AN ASSUMED DISTRIBUTION AND THE
MINIMUM-VECTOR MEASURED IDDQ...............................................................................................102
FIGURE 7.6. THE SAME DATA GIVEN IN FIGURE 7.1, WITH THE TEST VECTORS ORDERED BY IDDQ
MAGNITUDE. .................................................................................................................................103
FIGURE 7.7. ESTIMATING pˆ (O | A) AND p̂(O | A) AS NORMAL DISTRIBUTIONS OF CLUSTERED VALUES.
......................................................................................................................................................104
FIGURE 7.8. FULL DATA SET OF 196 ORDERED IDDQ MEASUREMENTS. ...................................................106
FIGURE 7.9. DIVISION OF THE ORDERED MEASUREMENTS INTO CLUSTERS. ...........................................107
FIGURE 8.3. A SIMPLE EXAMPLE OF CLUSTERING BY SUBSETS OF OUTPUTS...........................................123
v
List of Tables
TABLE 4.1. RESULTS FROM SCORING AND RANKING MULTIPLETS ON SOME SIMULATED DEFECTS...........53
TABLE 4.2. FASTSCAN AND ISTAT RESULTS ON TI FIB EXPERIMENTS: 2 STUCK-AT FAULTS, 14 BRIDGES.
........................................................................................................................................................55
TABLE 5.1. RESULTS FROM CORRELATING TOP-RANKED MULTIPLETS TO DIFFERENT FAULT MODELS. ....64
TABLE 6.1. SET OF LIKELY EFFECTS THAT CAN INVALIDATE COMPOSITE BRIDGING FAULT PREDICTIONS.
........................................................................................................................................................82
TABLE 6.2. DIAGNOSIS RESULTS FOR ROUND 1 OF THE EXPERIMENTS: TWELVE STUCK-AT FAULTS. .......87
TABLE 6.3. DIAGNOSIS RESULTS FOR ROUND 2 OF THE EXPERIMENTS: NINE BRIDGING FAULTS. .............88
TABLE 6.4. DIAGNOSIS RESULTS FOR ROUND 3 OF THE EXPERIMENTS: FOUR OPEN FAULTS. ....................88
TABLE 6.5. DIAGNOSIS RESULTS FOR TI FIB EXPERIMENTS: 2 STUCK-AT FAULTS, 14 BRIDGES...............90
TABLE 7.1. RESULTS ON SEMATECH DEFECTS. ......................................................................................109
TABLE 8.1. SIZE OF TOP-RANKED CANDIDATE SET (IN FAULTS) AND TOTAL NUMBER OF SIGNATURE BITS.
......................................................................................................................................................113
TABLE 8.2. SIZE OF TOP-RANKED CANDIDATE SET (IN FAULTS) AND TOTAL NUMBER OF SIGNATURE BITS.
......................................................................................................................................................117
TABLE 8.3. OUTPUT-COMPACTED SIGNATURE SIZES ADJUSTED FOR REPEATED OUTPUT SIGNATURES. ..118
TABLE 8.4. SUCCESS RATE FOR BRIDGING FAULT DIAGNOSIS USING STUCK-AT FAULT CANDIDATES.....119
TABLE 8.5. TOP-RANKED CANDIDATE SET SIZE AND SIGNATURE BITS FOR PASS-FAIL AND OUTPUTCOMPACTED (ALONE) SIGNATURES. ..............................................................................................120
TABLE 8.6. DIAGNOSTIC RESULTS WHEN OUTPUT-COMPACTED SIGNATURES ARE CLUSTERED DOWN TO
1000 BITS EACH. ............................................................................................................................123
TABLE 8.7. DIAGNOSTIC RESULTS FOR CLUSTERING (PF+OC) SIGNATURES DOWN TO 100 BITS TOTAL.
......................................................................................................................................................125
vi
Abstract
Comprehensive Fault Diagnosis of Combinational Circuits
by
David B. Lavo
Determining the source of failure in a defective circuit is an important but difficult task.
Important, since finding and fixing the root cause of defects can lead to increased product quality and
greater product profitability; difficult, because the number of locations and variety of mechanisms
whereby a modern circuit can fail are increasing dramatically with each new generation of circuits.
This thesis presents a method for diagnosing faults in combinational VLSI circuits. While it
consists of several distinct stages and specializations, this method is designed to be consistent with
three main principles: practicality, probability and precision. The proposed approach is practical, as it
uses relatively simple modeling and algorithms, and limited computation, to enable diagnosis in even
very large circuits. It is also probabilistic, imposing a probability-based framework to resist the
inherent noise and uncertainty of fault diagnosis, and to allow the combined use of multiple fault
models, algorithms, and data sets towards a single diagnostic result. Finally, it is precise, using an
iterative approach to move from simple and abstract fault models to complex and specific fault
behaviors.
The diagnosis system is designed to address both the initial stage of diagnosis, when nothing is
known about the number or types of faults present, as well as end-stage diagnosis, in which multiple
arbitrarily-specific fault models are applied to reach a desired level of diagnostic precision. It deals
with both logic fails and quiescent current (IDDQ) test failures. Finally, this thesis addresses the
problem of data size in dictionary-based diagnosis, and in doing so introduces the new concept of lowresolution fault diagnosis.
Acknowledgements
Among the people who have contributed to this work, I would first like to thank my co-authors
on various publications: Ismed Hartanto, Brian Chess, Tracy Larrabee, Joel Ferguson, Jon Colburn,
Jayashree Saxena, and Ken Butler. Their contributions to this work, both in its exposition and
execution, have been invaluable.
I would also like to thank those people who have taken the time to provide advice, guidance, and
insight into the issues involved in this research. These people include Rob Aitken, David Helmbold,
Haluk Konuk, Phil Nigh, Eric Thorne, Doug Williams, Paul Imthurn and John Bruschi.
And while they have already been mentioned, two people deserve special acknowledgement for
their remarkable dedication to seeing this work completed. The first is Tracy Larrabee, my advisor,
who managed to provide both the constant encouragement and the extraordinary patience that this
research required. The other is Rob Aitken, who believed enough in the work to encourage and
sponsor it, in a variety of ways, throughout the many years it took to complete.
While many people have believed in this work, and given their time and support to help me
complete it, no one has believed as strongly, helped so much, or is owed as much as my wife,
Elizabeth. I am very happy to have completed this work, and even happier to be able to dedicate this
dissertation to her.
viii
Chapter 1.
Introduction
Ensuring the high quality of integrated circuits is important for many reasons, including high
production yield, confidence in fault-free circuit operation, and the reliability of delivered parts.
Rigorous testing of circuits can prevent the shipment of defective parts, but improving the production
quality of a circuit depends upon effective failure analysis, the process of determining the cause of
detected failures. Discovering the cause of failures in a circuit can often lead to improvements in
circuit design or manufacturing process, with the subsequent production of higher-quality integrated
circuits.
Motivating the quest for improving quality, as with many research efforts, is bottom-line
economics. A better quality production process means higher yield and more usable (or sellable) die
per the same wafer cost. Fewer defective chips means lower assembly costs (more assembled boards
and products actually work) and lower costs associated with repair or scrap. And, a better quality chip
or product means a more satisfied customer and a greater assurance of future business. Failure
analysis is therefore an essential tool to improving both quality and profitability.
A useful if somewhat strained analogy to the process of failure analysis is its similarity to
criminal detective work: given the evidence of circuit failure, determine the cause of the failure,
identifying a node or region that is the source of error. In addition to location, it is useful to identify the
mechanism of failure, such as an unintentional short or open, so that remediating changes can be
considered in the design or manufacturing process.
Historically, failure analysis has been a physical process; a surprising number of present-day
failure analysis teams still use only physical methods to investigate chip failures. The stereotypical
failure analysis lab is a team of hard-boiled engineers physically and aggressively interrogating the
failing part, using scanning electron microscopes, particle beams, infrared sensors, liquid crystal films,
1
and a variety of other high-tech and high-cost techniques to eventually force a confession out of the
silicon scofflaw. The final result, if successful, is the identification of the actual cause of failure for the
circuit, along with the requisite gory “crime scene" photograph of the defective region itself: an errant
particle, missing or extra conductor, a disconnected via, and so on.
The sweaty, smoke-filled scene of the failure analysis lab is only part of the story, however, and
is usually referred to as root-cause identification. Given the enormous number of circuit devices in
modern ICs, and the number of layers in most complex circuits, physical interrogation cannot hope to
succeed without first having a reasonable list of suspect locations. Conducting a physical root-cause
examination on an entire defective chip is akin to having to conduct a house-to-house search of an
entire metropolis, in which every member of the populace is a possible suspect.
It is the job of the other part of failure analysis, usually called fault diagnosis, to do the logical
detective work. Based on the data available about the failing part, the purpose of fault diagnosis is to
produce an evaluation of the failing chip and a list of likely defect sites or regions. A lot is riding on
this initial footwork: if the diagnosis is either inaccurate or imprecise (identifying either incorrect or
excessively many fault candidates, respectively), the process of physical fault location will be
hampered, resulting in the waste of considerable amounts of time and effort.
Previously-proposed strategies for VLSI fault diagnosis have suffered from a variety of selfimposed limitations. Some techniques are limited to a specific fault model, and many will fail in the
face of any unmodeled behavior or unexpected data. Others apply ad hoc or arbitrary scoring
mechanisms to rate fault candidates, making the results difficult to interpret or to compare with the
results from other algorithms. This thesis presents an approach to fault diagnosis that is robust,
comprehensive, extendable, and practical. By introducing a probabilistic framework for diagnostic
prediction, it is designed to incorporate disparate diagnostic algorithms, different sets of data, and a
mixture of fault models into a single diagnostic result.
The fundamental aspects of fault diagnosis will be discussed in Chapter 2, including fault models,
fault signatures, and diagnostic algorithms. Chapter 3 indulges in an examination of the issues
2
inherent in fault diagnosis, and presents a philosophy of diagnosis that will guide the balance of the
work. Chapter 4 presents the first stage of the proposed diagnostic approach, which handles the initial
condition of indeterminate fault behaviors. Chapter 5 discusses the second stage of diagnosis, in which
likely fault models are inferred from the first-stage results. Chapter 6 digresses to a discussion of
inexpensive bridging fault models, and introduces the third stage of diagnosis, in which multiple fault
models are applied to refine the diagnostic result. Chapter 7 presents extends the diagnosis system to
the topic of IDDQ failures, and Chapter 8 addresses the issue of small fault dictionaries. Chapter 9
presents the conclusions from this research and discusses areas of further work.
3
Chapter 2.
Background
Here is the problem of fault diagnosis in a nutshell: a circuit has failed one or more tests applied
to it; from this failing information, determine what has gone wrong. The evidence usually consists of a
description of the tests applied, and the pass-fail results of those tests. In addition, more detailed pertest failing information may be provided. The purpose of fault diagnosis is to logically analyze
whatever information exists about the failures and produce a list of likely fault candidates. These
candidates may be logical nodes of the circuit, physical locations, defect scenarios (such as shorted or
open signal lines), or some combination thereof.
This chapter will give the background of the problem of fault diagnosis. It starts with a
description of the types of circuits that will and will not be addressed by the diagnosis methods
described in this thesis. It will explain the types of data that make up the raw materials of the
diagnosis process, and then introduce the abstractions of defective behavior known as fault models.
Finally, it will present the various algorithms and approaches that previous researchers have proposed
for various instances of the fault diagnosis problem.
2.1
Types of Circuits
This thesis will only address the problem of fault diagnosis in combinational logic. While nearly
all large-scale modern circuits are sequential, meaning they contain state-holding elements, most are
tested in a way that transforms their operation under test from sequential to combinational. This is
usually accomplished by implementing scan-based test [AbrBre90], in which all state-holding flipflops in the circuit are modified so that they can be controlled and observed by shifting data through
one or more scan chains. During scan tests, input data is scanned into the flip-flops via the scan chains
and other input data is applied to the input pins (or primary inputs) of the circuit. Once these inputs
4
are applied and the circuit has stabilized its response (now fully combinational), the circuit is clocked
to capture the results back into the flip-flops, and the data values at the output pins (or primary
outputs) of the circuit are recorded. The combination of values at the output pins and the values
scanned out of the flip-flops make up the response of the circuit to the test, and these values are
compared to the expected response of a good circuit. If there is a mismatch for any test, the circuit is
considered defective, and the process of fault diagnosis can begin.
This thesis will not address the diagnosis of failures during tests that consist of multiple clock
cycles and therefore sequential circuit behavior. So-called functional tests fall under this domain, and
are extremely difficult to diagnose due to the mounting complexity of defective behavior under
multiple sequential time frames. Another sequential circuit type that is not addressed here is that of
memories such as RAMs and ROMs. Unlike the “random” logic of logic gates and flip-flops,
however, the “structured” nature of memories makes them especially amenable to simple fault
diagnosis. It is usually a simple process to control and observe any word or bit in most memories to
determine the location of test failure.
2.2
Diagnostic Data
Part of the data that is involved in fault diagnosis, at least for scan tests, has already been
introduced: namely, the input values applied at the circuit input pins and scanned into the flip-flops.
The input data for each scan operation, including values driven at input pins, is referred to as the input
pattern or test vector. The operation of scanning and applying an input to the circuit and recording its
output response is formally called a test1, and a collection of tests designed to exercise whole or part of
the circuit is called a test set. This information, along with the expected output values (determined by
prior simulation of the circuit and test set), makes up the test program actually applied to the circuit.
1
Traditional scan tests test only the function of a circuit, and usually only require a single input pattern and record a single
combinational response. Tests that test the speed of a circuit, however, must create logic transitions in the circuit and so must
apply pairs of input values, often by scanning two input patterns into the circuit. This type of test is still a single test and records
a single response, and as such is commonly referred to as a “two-pattern test”.
5
The test program runs on a tester, which can handle either wafers or packaged die, and can apply
tests and observe circuit responses. The tester records the actual responses measured at circuit outputs,
and any differences between the observed responses and the expected responses are recorded in the
tester data log. While it is not the usual default setting during production test, this thesis will assume
that the data log information identifies all mismatched responses and not just the first failing response.
It is usually a simple matter to re-program a tester from a default “stop-on-first-fail” mode to a
diagnostic “record-all-fails” mode once a die or chip has been selected for failure analysis.
The response of a defective circuit to a test set is referred to as the observed faulty behavior, and
its data representation is commonly known as a fault signature. For scan tests, the fault signature is
usually represented in one of two common forms. The first, the pass-fail fault signature, reports the
result for each test in the test set, whether a pass or a fail. Typically the fault signature consists either
of the indices of the failing tests, or a bit vector for the entire test set in which the failing tests (by
convention) are represented as 1s and the passing tests by 0s. Figure 2.1, below, gives an example of a
fault signature for a simple example of 10 tests, out which 4 failing tests are recorded.
Results for 10 total tests:
1: Pass
2: Pass
3: Pass
4: Pass
5: Fail
Pass-fail signatures:
6: Pass
7: Fail
8: Fail
9: Pass
10: Fail
By index:
By bit vector:
5, 7, 8, 10
0000101101
Figure 2.1. Example of pass-fail fault signatures.
The second type of fault signature is the full-response fault signature, which reports not only
what tests failed but also at which outputs (flip-flops and primary outputs) the discrepancies were
observed. As with test vectors, circuit outputs are usually indexed to facilitate identification. Figure
2.2 gives another simple example of indexed and bitmapped full-response fault signatures. Each
failing vector number in the indexed signature is augmented with a list of failing outputs. In the
bitmapped signature, a second dimension has been added for failing outputs.
6
Indexed full-response signature:
Bitmapped full-response signature:
Vectors
1 2 3 4 5 6 7 8 9 10
5: 2, 4
7: 3, 4
8: 7
10: 2, 7
O
u
t
p
u
t
s
1
2
3
4
5
6
7
8
9
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
Figure 2.2. Example of indexed and bitmapped full-response fault signatures.
Scan tests are only a single part of the suite of tests usually applied to a production chip. Another
common type of test, called an IDDQ test, is to put the circuit in a non-switching or static state and
measure the quiescent current draw. If an abnormally high current is measured, a defect is assumed to
be the cause and the part is marked for scrap or failure analysis.
The fault signature generated by an IDDQ test set can take one of two forms. The first is the same
as the pass-fail signature introduced earlier for scan tests, in which either index numbers or bits are
used to represent passing (normal or low IDDQ current) and failing (high current) tests. The second type
of signature records an absolute current measurement for each I DDQ test in the form of a real number.
This thesis will address fault diagnosis for both scan and I DDQ tests, as these are the two major
types of comprehensive tests performed on commercial circuits. Other tests, such as those for
memories, pads, or analog blocks, cover a much more limited area and require more specialized (often
manual) diagnostics. Functional test failures, as mentioned, are especially difficult to diagnose, but
fortunately (at least for fault diagnosis) functional tests are gradually being eclipsed by scan-based
tests. Diagnosis for Built-In-Self-Test (BIST) [AbrBre90], in which on-chip circuitry is used to apply
and capture test patterns, will not be directly addressed here. However, many of the diagnosis
techniques presented in this thesis can be applied to BIST results if the data can be made available for
7
off-chip processing. Finally, the issue of timing or speed test diagnostics will be addressed only briefly
and remains a subject for further research.
2.3
Fault Models
The ultimate targets of both testing and diagnosis are physical defects. In the logical domain of
testing and diagnostic algorithms, a defect is represented by an abstraction known as a logical fault, or
simply fault. A description of the behavior and assumptions constituting a logical fault is referred to as
a fault model. Test and diagnosis algorithms use a fault model to work with the entire set of fault
instances in a target circuit.
The most popular fault model for both testing and diagnosis is the single stuck-at fault model, in
which a node in the circuit is assumed to be unable to change its logic value. The stuck-at model is
popular due to its simplicity, and because it has proved to be effective both in providing test coverage
and diagnosing a limited range of faulty behaviors [JacBis86]. As an abstract representation of a class
of defects, the stuck-at fault is commonly used to represent the defect of a circuit node shorted to either
power or ground. It is commonly used, however, to both detect and diagnose a wide range of other
defect types, as will be seen in the rest of this thesis.
Perhaps the second most popular fault model is the bridging fault model. Used to represent an
electrical short between signal lines, in its most common form the model describes a short between two
gate outputs. Most bridging fault models ignore bridge resistance, and instead focus on the logical
behavior of the fault. These models include the wired-OR bridging fault, in which a logic 1 on either
bridged node results in the propagation of a logic 1 downstream from both nodes; the wired-AND
bridging fault, which propagates a 0 if either node is 0; and the dominance bridging fault, in which one
gate is much stronger than the other and is assumed to always drive its logic value onto the other
bridged node. Other bridging fault models have been developed of much greater sophistication
[AckMil91, GrePat92, MaxAit93, Rot94, MonBru92], taking into account gate drive strengths, various
8
bridge resistances, and even more than two bridged nodes, but they are not used as much due to their
computational complexity during large-scale test generation or fault diagnosis.
Bridging fault models have become popular due to an increasing attention to defects in the
interconnect of modern chips. Similarly, there has been a commensurate rise in interest in open fault
models, which attempt to model electrical opens, breaks, and disconnected vias. Since opens can result
in state-holding, intermittent, and pattern-dependent fault effects, these models have generally been
more complex and less widely used for both testing and diagnosis.
Instead of interconnect faults, several fault models have concentrated on defects in logic gates
and transistors. Among these are the transistor-stuck-on and transistor-stuck-off models, which are
similar to conventional stuck-at faults. Various intra-gate short models have been proposed to model
shorts between transistors in standard-cell logic gates. Many of these models have not enjoyed
widespread success simply because the stuck-at model tends to work nearly as well for generating
effective tests at much lower complexity.
Other fault models have been developed to represent timing-related defects, including the
transition fault model and the path-delay fault model. The first assumes that a defect-induced delay is
introduced at a single gate input or output, while the second spreads the total delay along a circuit path
from input to output.
2.4
Fault Models vs. Algorithms: A Short Tangent into a Long Debate
The previous section briefly introduced a wide variety of fault models, from the simple and
abstract stuck-at model to more complicated, specific, and realistic fault models. The stuck-at fault
model has been generally dominant for several decades, and continues to be dominant today, both for
its simplicity and its demonstrated utility. But the general trend, in the field of testing at least, has been
a tentative shift away from sole reliance on the stuck-at model towards more realistic fault models that
will facilitate the generation of better tests for more complicated defects. The question is, then, what
models are best for fault diagnosis?
9
A paper by Aitken and Maxwell [AitMax95] identifies two main components to any fault
diagnosis approach. The first is the choice of fault model, and the second is the algorithm used to
apply the fault model to the diagnostic problem. As the authors explain, the effectiveness of a
diagnostic technique will be compromised by the limitations of the fault model it employs. So, for
example, a diagnosis tool that relies purely on the stuck-at fault model can never completely or
correctly diagnose a signal-line short or open, simply because it is looking for one thing while another
has occurred.
The authors go on to explain that the role of the diagnosis algorithm, then, has evolved to try to
overcome the limitations of the chosen fault model. This will be illustrated in the next section of this
chapter in an overview of previous diagnosis research; a common technique is to use the stuck-at
model but adjust the algorithm to anticipate bridging-fault behaviors. But, the authors also opened a
debate, which remains active to this day: is it better for a diagnosis technique to use more realistic fault
models with a simple algorithm, or to use simple and abstract models with a more clever and robust
algorithm?
As with any interesting debate, there are good arguments on both sides. The argument for simple
fault models is that they are more practical to apply to large circuits and more flexible for a wide
variety of defect behaviors. The argument for better models, taken by the authors in their original
paper, is that good models are necessary for both diagnostic accuracy and precision. Simple models do
not provide sufficient accuracy because defect behavior is often complex, more complex than even
clever algorithms anticipate. They also do not result in sufficient precision because they do not
provide enough specificity (e.g. “look for a short at this location”) to guide effective physical failure
analysis.
This thesis will attempt to resolve this debate as it presents a new diagnostic approach. The next
section outlines how previous researchers have addressed the diagnostic problem, and notes how each
participant has taken their place in the model vs. algorithm debate.
10
2.5
Diagnostic Algorithms
This section will cover the diagnosis algorithms proposed by previous researchers, in a roughly
chronological order. The general trend, as will become clear, has been from simple approaches that
target simple defects, to more complex algorithms that try to address more complicated defect
scenarios.
Diagnosis algorithms have traditionally been classified into two types, according to how they
approach the problem. The first and by far the most popular approach is called cause-effect fault
diagnosis [AbrBre90]. A cause-effect algorithm starts with a particular fault model (the “cause”), and
compares the observed faulty behavior (the “effect”) to simulations of that fault in the circuit. A
simulation of any fault instance produces a fault signature, or a list of all the test vectors and circuit
outputs by which a fault is detected, and which can be in one of the signature formats described earlier.
The process of cause-effect diagnosis is therefore one of comparing the signature of the observed
faulty behavior with a set of simulated fault signatures, each representing a fault candidate. The
resulting set of matches constitutes a diagnosis, with each algorithm specifying what is acceptable as a
“match”.
The main job of a cause-effect algorithm is to perform this matching between simulated candidate
and observed behavior. The general historical trend has been from very simple or exact matching,
where the defect is assumed to correspond very closely to the fault model, to more complicated
matching and scoring schemes that attempt to deal with a range of defect types and unmodeled
behavior.
A cause-effect algorithm is characterized by the choice of a particular fault model before any
analysis of the actual faulty behavior is performed. A cause-effect algorithm can further be classified
as static, in which all fault simulation is done ahead of time and all fault signatures stored in a database
called a fault dictionary; or, it can be dynamic, where simulations are performed only as needed.
The opposite approach, and the second classification of diagnosis algorithms, is called (not
surprisingly) effect-cause fault diagnosis [AbrBre80, RajCox87]. These algorithms attempt the
11
common-sense approach of starting from what has gone wrong on the circuit (the fault “effect”) and
reasoning back through the logic to infer possible sources of failure (the “cause”). Most commonly the
cause suggested by these algorithms is a logical location or area of the circuit under test, not
necessarily a failure mechanism.
Most effect-cause methods have taken the form of path-tracing algorithms. They use assumptions
about the propagation and sensitization of candidate faults to traverse a circuit netlist, usually
identifying a set of fault-free lines and thereby implicating other logic that is possibly faulty.
Effect-cause diagnosis methods have several advantages. First, they don't incur the oftensignificant overhead of simulating and storing the responses of a large set of faults. Second, they can
be constructed to be general enough to handle, at least implicitly, the presence of multiple faults and
diffuse fault behavior. This is an advantage over most other diagnosis strategies that rely heavily on a
single-fault assumption. The most common disadvantage of effect-cause diagnosis algorithms is
significant inherent imprecision. Most are conservative in their inferences to avoid eliminating any
candidate logic, but this usually leads to a large implicated area. Also, since a pure effect-cause
algorithm doesn't use fault models, it necessarily cannot provide a candidate defect mechanism (such
as a bridge or open) for consideration.
In fact, while most effect-cause algorithms claim to be “fault-model-independent”, this is a
difficult claim to justify. Existing effect-cause algorithms implicitly make assumptions about fault
sensitization, propagation, or behavior that are impossible to distinguish from classic fault modeling.
(Usually, the implicit model is the stuck-at fault model.) This is understandable: it is the job of a
diagnosis algorithm to make inferences about the underlying defect, but it is difficult to do so without
some assumptions about faulty behavior, which is in turn difficult to do so without some fault
modeling.
The following sections present algorithms for VLSI diagnosis proposed by previous researchers,
from the early 1980s to the present day. In general, the earliest algorithms have targeted solely stuck-
12
at faults and associated simple defects, while the later and more sophisticated algorithms have used
more detailed fault models and targeted more complicated defects.
2.5.1
Early Approaches and Stuck-at Diagnosis
Many early systems of VLSI diagnosis, such as Western Electric Company's DORA [AllErv92]
and an early approach of Teradyne, Inc. [RatKea86], attempted to incorporate the concept of causeeffect diagnosis with a previous-generation physical method called guided-probe analysis. Guidedprobe analysis employed a physical voltage probe and feedback from an analysis algorithm to
intelligently select accessible circuit nodes for evaluation. The Teradyne and DORA techniques
attempted to supplement the guided-probe analysis algorithm with information from stuck-at fault
signatures.
Both systems used relatively advanced (for their time) matching algorithms. The DORA system
used a nearness calculation that the authors describe as fuzzy match. The Teradyne system employed
the concept of prediction penalties: the signature of a candidate fault is considered a prediction of some
faulty behavior, made up of <output:vector> pairs. When matching with the actual observed behavior,
the Teradyne algorithm scored a candidate fault by penalizing for each <output:vector> pair found in
the stuck-at signature but not found in the observed behavior, and penalizing for each <output:vector>
pair found in the observed behavior but not the stuck-at signature. These have commonly become
known as misprediction and non-prediction penalties, respectively. A related Teradyne system
[RicBow85] introduced the processing of possible-detects, or outputs in stuck-at signatures that have
unknown logic values, into the matching process.
While other early and less-sophisticated algorithms applied stuck-at fault signatures directly,
expecting exact matches to simulated behaviors, it became obvious to the testing community that most
failures in CMOS circuits do not behave exactly like stuck-at faults. Stuck-at diagnosis algorithms
responded by increasing the complexity and sophistication of their matching to account for these
unmodeled effects. An algorithm proposed by Kunda [Kun93] ranked matches by the size of
13
intersection between signature bits. This stress on minimum non-prediction (misprediction was not
penalized) reflects an implicit assumption that unmodeled behavior generally leads to over-prediction:
the algorithm does not expect the stuck-at model to be perfect, but any unmodeled behavior will cause
fewer actual failures than predicted by simulation. This assumption likely arose from the intuitive
expectation that most defects involve a single fault site with intermittent faulty behavior — a not
uncommon scenario for many chips that have passed initial tests but failed scan tests, especially after
burn-in or packaging. Most authors, however, do not make this assumption explicit or explore its
consequences, and an unexamined preference for the fault candidate that “explains the most failures”
(regardless of over-prediction) is common to many diagnosis algorithms.
A more balanced approach was proposed by De and Gunda [DeGun95], in which the user can
supply relative weightings for misprediction and non-prediction. By modifying traditional scoring
with these weightings, the algorithm assigns a quantitative ranking to each stuck-at fault. The authors
claim that the method can be used to explicitly target defects that behave similar to but not exactly like
the stuck-at model, such as some opens and multiple independent stuck-at faults, but it can diagnose
bridging defects only implicitly (by user interpretation). This is perhaps the most general of the simple
stuck-at algorithms and is unique for its ability to allow the user to adjust the assumptions about
unmodeled behavior that other algorithms make implicitly.
2.5.2
Waicukauski & Lindbloom
The algorithm developed by Waicukauski and Lindbloom (W&L) [WaiLin89] deserves its own
subsection because it has been so pervasive and successful — the most popular commercial tool is
based on this algorithm — and also because it introduced several techniques that other algorithms have
since adopted.
The W&L algorithm relies solely on stuck-at fault assumptions and simulations, and as such can
be best classified as a (dynamic) cause-effect algorithm. It does, however, use limited path-tracing to
14
implicate portions of the circuit and reduce the number of simulations it performs, so it does borrow
elements from effect-cause approaches.
The W&L algorithm uses a very simple scoring mechanism, relying mainly on exact matching.
But, it performs this matching in an innovative way, by matching fault signatures on a per-test basis.
Most fault diagnosis algorithms count the number of mismatched bits between the observed behavior
and a candidate fault signature across the entire test set. Each bit is a <vector:output> pair, as in the
Teradyne algorithm described earlier, and an intersection is performed between the set of bits in the
observed behavior and the set in each candidate fault signature.
In the W&L algorithm, by contrast, each test vector that actually fails on the tester is considered
independently. For each failing test, the set of failing outputs is compared with each candidate fault; if
a candidate predicts a fail for that test, and the outputs match exactly, then a “match” is declared. Each
matching fault candidate is then simulated against the rest of the failing tests, and the candidate that
matches the most failing tests (exactly) is retained. All of the matched test results for this candidate
are removed from the observed faulty signature, and the process repeats until all failing tests are
considered.
Note that this matching algorithm is really a greedy coverage algorithm over the set of failing
tests. Since the tests are considered in order, the sequence in which the tests are examined could affect
the contents of the final diagnosis when multiple candidates are required to match all of the tests. It
should also be noted that the practice of removing test results as they are matched reflects a desire to
address multiple simultaneous defects, as well as an assumption that the fault effects from such defects
are non-interfering.
The algorithm also conducts a simple post-processing step, in which it classifies the diagnosis by
examining the final candidate set. If the diagnosis consists of a single stuck-at fault (with any
equivalent faults) that matches all failing tests, it then checks the tests that pass in the observed
behavior. If all of these passing test results are also predicted by the stuck-at candidate, the diagnosis
is classified as a “Class I” diagnosis, or an exact match with a single stuck-at fault. If the diagnosis
15
consists of a single candidate that matches all failing tests but not all passing tests (e.g. there is some
misprediction), then the diagnosis is classified as “Class II”. The authors explain that Class II
diagnoses could indicate the presence of an open, an intermittent stuck-at defect, or a dominance
bridging fault. Finally, a “Class III” diagnosis consists of multiple stuck-at candidates with possible
mispredicted and non-predicted behaviors.
The two most interesting features of the W&L algorithm, the per-test approach and the postprocessing analysis, will be discussed further in later sections of this thesis. Overall, the W&L
algorithm is interesting not only because it is so commonly used, but also because it raises some
interesting theoretical issues.
2.5.3
Stuck-At Path-Tracing Algorithms
The classic effect-cause algorithms are those that rely on path-tracing to implicate portions of the
circuit. Examples of these are the approaches suggested by Abramovici and Breuer [AbrBre80] and
Rajski and Cox [RajCox87]. While they claim fault-model-independence, these algorithms attempt to
identify nodes in the circuit that can be demonstrated to change their logic values (or toggle) during the
test set, which amounts to an implicit targeting of stuck-at faults. In fact, these algorithms maintain a
stricter adherence to the stuck-at model than the cause-effect algorithms just described, as any
intermittent stuck-at defect is not anticipated and would not be diagnosed correctly.
2.5.4
Bridging fault diagnosis
The first evolution of diagnosis algorithms away from the stuck-at model was when they started
to address bridging faults explicitly. Some of the stuck-at diagnosis algorithms already presented
claim to be able to occasionally diagnose bridging faults, but only fortuitously by addressing limited
unmodeled behavior. Perhaps the simplest explicit bridging fault diagnosis algorithm is that proposed
Millman McCluskey and Acken (MMA) [MilMcC90], which was a direct transition from stuck-at
faults to bridges. The authors introduced the idea of composite bridging-fault signatures, which are
created by concatenating the four stuck-at fault signatures for the two bridged nodes. This was a novel
16
way of creating fault signatures without relying on bridging fault simulation, which can be
computationally expensive especially if electrical effects are considered. The underlying idea is that
the actual behavior of a bridge, for any failing test vector, will be a subset of the behaviors predicted
by the four related stuck-at faults. The matching algorithm used is simple subset matching: any
candidate whose composite signature contains all the observed failing <vector:output> pairs is
considered a match and appears in the final diagnosis.
A similar approach to the MMA algorithm was taken by Chakravarty and Gong [ChaGon93],
whose algorithm did not explicitly create composite signatures but used a matching technique on
combinations of stuck-at signatures to create the same result. Both of these bridging-fault diagnosis
methods suffer from imprecision, however: the average diagnosis sizes for both are very large,
consisting of hundreds or thousands of candidates. The performance of the MMA algorithm was
improved significantly by Chess, Lavo, et al. [CheLav95], by classifying vectors in the composite
signatures as stronger or weaker predictions of bridging fault behavior, and refining the match scoring
appropriately. Other researchers have continued to use and extend the idea of (stuck-at based)
composite signatures for various fault models [VenDru00].
A more direct approach to bridging fault diagnosis was suggested by Aitken and
Maxwell [AitMax95]. As opposed to the algorithms just described, in which the simple stuck-at fault
model is augmented with more-complex algorithms to deal with unmodeled behavior, the authors
instead chose to build dictionaries comprised of realistic bridging faults. (A realistic bridging fault is a
short that is considered likely to occur in the fabricated circuit based on a signal-line proximity
analysis of the circuit artwork.) This is pure cause-effect diagnosis for bridging-faults: the fault
candidates are the same faults targeted for diagnosis. The authors report excellent results, both in
accuracy and precision.
While there are obvious advantages to this approach, there are also significant disadvantages.
The number of realistic two-line bridging faults is significantly larger than the number of single stuckat faults for a circuit. Since the cost of simulating each of these faults can be expensive, especially if
17
the simulation considers electrical effects, the overall time spent in fault simulation can be prohibitive.
In addition, even the best bridging fault simulations may not reflect the behavior of actual shorts,
requiring continual validation and refinement of the fault models [Ait95] and possibly the use of a
more complex matching algorithm.
Bridging fault diagnosis in general is plagued by the so-called candidate selection problem: there
are many more faults in a circuit than can be reasonably considered by any diagnosis algorithm. Even
for two-line bridging faults, there are
n
 
2
possible candidates. The Aitken and Maxwell approach got
around this problem by considering only realistic bridging faults, but the analysis required for
determining the set of realistic faults can itself be impractical. Other methods have been suggested,
including one by Lavo et al. [LavChe97] that used a two-stage diagnosis approach, the first stage to
identify likely bridges and the second stage to directly diagnose the bridging fault candidates. This
thesis will explore the candidate selection problem in more detail in a subsequent chapter.
2.5.5
Delay fault diagnosis
Due to the increasing importance of timing-related defects in high-performance designs,
researchers have proposed methods to diagnose timing defects with delay fault models. Due to its
simplicity, the transition fault model, in which the excessive delay is lumped at one circuit node, has
been preferred. Diagnosis with the path-delay fault model, which considers distributed delay along a
path from circuit input to output, has been hampered by the candidate selection problem: there are an
enormous number of paths through a modern circuit.
An example of fault diagnosis using the path-delay fault model is the approach suggested by
Girard et al. [GirLan92]. The authors use a method called critical path tracing [AbrMen84] to traverse
backwards through the circuit from the failing outputs, implicating nodes that transition for each test.
In this way it is similar to the effect-cause algorithms described in section 2.5.3, but its decisions at
each node are determined by the transition fault model rather than the stuck-at fault model.
18
2.5.6
IDDQ diagnosis
Aside from logic levels and assertion timing data, people have applied information from other
types of tests to diagnose defects. One source of such information is the amount of quiescent current
drawn for certain test vectors, or IDDQ diagnosis. The vectors used for IDDQ diagnosis are designed to
put the circuit in a static state, in which no logic transitions are occurring, so that a high amount of
measured current draw will indicate the likely presence of a defect (such as a short to a power line).
An advantage to IDDQ diagnosis is that the defects should have high observability: the measurable fault
effects do not have to propagate through many levels of logic to be observed, but are rather measured
at the supply pin. The issue of IDDQ observability is a complicated one, however, and will be discussed
later in Chapter 7.
Aitken presented a method of diagnosing faults when logic fails and I DDQ fails are measured
simultaneously [Ait91], and he later generalized this approach to include fault models for intra-gate
and inter-gate shorts [Ait92]. The approach presented by Chakravarty and Liu examines the logic
values applied to circuit nodes during failing tests, and attempts to identify pairs of nodes with
opposite logic values as possible bridging fault sites [ChaLiu93]. All of the approaches, however, rely
on IDDQ measurements that can be definitively classified as either a pass or a fail, which limits their
application in some situations.
This limitation is addressed by the application of current signatures [Bur89, GatMal96], in which
relative measurements of current across the test set are used to infer the presence of a defect, rather
than the absolute values of IDDQ. A diagnosis approach suggested by Gattiker and Maly [GatMal97,
GatMal98] attempts to use the presence of certain large differences between current measurements as a
sign that certain types of defects are present. This concept was further extended by Thibeault [Thi97],
who applied a maximum likelihood estimator to changes in I DDQ measurements to infer defective fault
types. These approaches, while more robust, stress the implication of defect type rather than location;
the algorithm I propose later in this thesis targets explicit fault instances or locations. It is possible that
these two strategies could be combined to further improve resolution, a topic I discuss in Chapter 7.
19
2.5.7
Recent Approaches
A couple of recently-published papers have suggested diagnosis algorithms that attempt to target
multiple defects or fault models. The first, called the POIROT algorithm [VenDru00], diagnosis test
patterns one at a time, much like the Waicukauski and Lindbloom algorithm. In addition, it employs
stuck-at signatures, composite bridging fault signatures, and composite signatures for open faults on
nets with fanout. Its scoring method is rather rudimentary, especially when it compares the scores of
different fault models, relying on an interpretation of Occam’s Razor [Tor38] to prefer stuck-at
candidates over bridging candidates, and bridging candidates over open faults.
Another algorithm, called SLAT [BarHea01], also uses a per-test diagnosis strategy, and attempts
a coverage algorithm over the observed behavior using stuck-at signatures and only exact matching of
failing outputs. In both of these ways it is very similar to the W&L algorithm. However, it modifies
that algorithm by attempting to build multiple coverings, which it calls multiplets; each multiplet is a
set of stuck-at faults that together explain all the perfectly-matched test patterns. Test results that don’t
match exactly, and passing patterns, are ignored.
Because they explicitly target multiple faults and complex fault behaviors, the SLAT and the
POIROT algorithms are interesting for application to an initial pass of fault diagnosis, when little is
known about the underlying defects. These algorithms, in addition to W&L, will be discussed further
in Chapter 4 of this thesis, which addresses initial-stage fault diagnosis.
2.5.8
Inductive Fault Analysis
The diagnosis techniques presented so far do not use physical layout information to diagnose
faults. Intuitively, however, identifying a fault as the cause of a defect has much to do with the relative
likelihood of certain defects occurring in the actual circuit. Inductive Fault Analysis (IFA) [SheMal85]
uses the circuit layout to determine the relative probabilities of individual physical faults in the
fabricated circuit.
20
Inductive fault analysis uses the concept of a spot defect (or point defect), which is an area of
extra or missing conducting material that creates an unintentional electrical short or break in a circuit.
As these spot defects often result in bridge or open behaviors, inductive fault analysis can provide a
fault diagnosis of sorts: an ordered list of physical faults (bridges or opens) that are likely to occur, in
which the order is defined by the relative probability of each associated fault. The relative probability
of a fault is expressed as its weighted critical area (WCA), defined as the physical area of the layout
that is sensitive to the introduction of a spot defect, multiplied by the defect density for that defect
type. For example, two circuit nodes that run close to one another for a relatively long distance provide
a large area for the introduction of a shorting point defect; the resulting large WCA value indicates that
a bridging fault between these nodes is considered relatively likely.
One way that inductive fault analysis can be applied to fault diagnosis is through the creation of
fault lists. Inductive fault analysis tools such as Carafe [JeeISTFA93, JeeVTS93] can provide a
realistic fault list, useful for fault models such as the bridging fault model, in which the number of
possible faults is intractable for most circuits. By limiting the candidates to only faults that can
realistically occur in the fabricated circuit, a diagnosis can be obtained that is much more precise than
one that results from consideration of all theoretical faults.
Another possible way to use inductive fault analysis for diagnosis is presented in Chapter 6, in
which IFA can provide the a-priori probabilities for a set of candidate faults. This is a generalization of
the idea of creating faultlists, in which faults are not characterized as realistic or unrealistic, but instead
are rated as more or less probable.
IFA has also been applied to the related field of yield analysis; a technique proposed by Ferguson
and Yu [FerYu96] uses a combination of IFA and maximum likelihood estimation to perform a sort of
statistical diagnosis on process monitor circuits. A similar combination of layout examination,
statistical inference, and fault modeling will be applied to more traditional cause-effect fault diagnosis
in Chapter 6 of this thesis.
21
2.5.9
System-Level Diagnosis
The area of system-level diagnosis, which deals with finding defective components in large-scale
electronic systems, is outside the area of research of this dissertation. However, some interesting work
has been done in this area, which predates and often deals with issues very different than those of
CMOS and VLSI diagnosis. The most comprehensive diagnosis approach has been developed by
Simpson and Sheppard [SimShe94], who have presented a probabilistic approach for everything from
determining the identity of failing subsystems to determining the optimal order of diagnostic tests.
They have also suggested an approach for CMOS diagnosis using fault dictionaries [SheSim96]. Their
methods apply the Dempster-Shaffer method of analysis, which I will use extensively and discuss
further in Chapter 4.
22
Chapter 3.
A Deeper Understanding of the Problem:
Developing a Fault Diagnosis Philosophy
The previous chapter presented some of the various ways that researchers have approached the
problem of VLSI fault diagnosis. These attempts have spanned a period of over 25 years, and a good
deal of academic and industrial effort has gone into making fault diagnosis work in the real world.
And yet, few if any academic diagnosis algorithms have made a successful transition into industrial
use.
The reasons for this lack of success are probably many, but chief among them is probably the
disparity between academic assumptions about the problem and the real-world conditions of industrial
failure analysis. This chapter will examine these assumptions in some detail, and by trying to rectify
them will present a philosophic framework for approaching the problem of fault diagnosis that will
guide the rest of the research presented in this thesis.
3.1
The Nature of the Defect is Unknown
Several theoretical fault diagnosis systems have claimed great success in some variation of the
following experiment: physically create or simulate a defect of a certain fault type, create some
candidates of that fault type, and run the diagnosis algorithm to choose the correct candidate out of the
list. While the accuracy of these success stories is indeed laudable, the result is a little like pulling a
guilty culprit out of a police lineup: the job is made much easier if the choices are limited ahead of
time.
It is an unfortunate fact of failure analysis, however, that what form a defect has taken, or what
fault model could best represent the actual electrical phenomenon, is not known in advance. In the real
world, a circuit simply fails some tests; it does not generally give any indication of what type of defect
23
is present. While some algorithms have been proposed that attempt to infer a defect type from some
behavior, most notably IDDQ information [GatMal97, GatMal98], these will not work on the most
common failure type: there is generally little or no information about defect type that can be gleaned
from standard scan failures.
Acknowledging this lack of initial information leads to a basic principle of fault diagnosis,
often ignored by academic researchers but obvious to industrial failure analysis engineers:
A fault diagnosis algorithm should be designed with the
(i)
assumption that the underlying defect mechanism is unknown.
Given this fact, it makes little sense to design a fault diagnosis algorithm that only works when
the underlying defect is a certain type or class. Or, if an algorithm is targeted to one fault type, it
should be designed so that an unmodeled fault will result in either explicit or obvious failure. This
leads to the next principle:
A fault diagnosis algorithm should indicate the quality of its result.
(ii)
This way, if a diagnosis algorithm does encounter some behavior that violates its basic
assumptions, it can let the user know that these assumptions may have been wrong.
3.2
Fault Models are Hopelessly Unreliable
Many clever diagnosis algorithms have been proposed, using a variety of fault models, and all
promise great success as long as one condition holds: nothing unexpected ever happens. These
expectations come from the fault model used, the diagnostic algorithm, or both. So, if the modeled
defect doesn't cause a circuit failure when expected, or if a failure occurs along an unanticipated path,
the algorithm will either quit or get hopelessly off the track of the correct suspect.
If the problem is defective fault models, then maybe the solution is to work very hard to perfect
the models. If the models were perfect, then diagnosis would reduce to a simple process of finding
24
exactly the matching candidate for the observed behavior. But, once again, the cold hard world
intrudes with the cold hard facts: fault model perfection is extremely difficult, and may very likely be
impossible.
Perhaps best documented are the problems inherent in bridging fault modeling: many simplified
bridging fault models have been proposed, and each in turn has been demonstrated to be inadequate or
inaccurate in one or more important respects [AckMil91, AckMil92, GrePat92, MaxAit93, MonBru92,
Rot94]. Even the most complex and computationally intensive models can fail to account for the
subtleties of defect characteristics and the vagaries of defective circuit behavior. And it is not only the
complex models that are prone to error: even apparently simple predictions may be hard to make when
real defects are involved [Ait95].
The unfortunate fact is that faulty circuits have the tendency to misbehave—they are faulty, after
all—and often fail in ways not predicted by the best of fault simulators or the most carefully crafted
fault models. The only answer is that any diagnostic technique that hopes to be effective on real-world
defective circuits has to be robust enough to tolerate at least some level of noise and uncertainty. If
not, the only certain thing about the process will be the resulting frustration of a sadly misguided
engineer.
A fault diagnosis algorithm should make no inviolable
assumptions regarding the defect behavior or its chosen fault
(iii)
model(s): fault models are only approximations
3.3
Fault Models are Practically Indispensable
Given the well-documented limitations of fault models, several diagnosis algorithms have tried to
minimize them or do away with them completely. Some, such as some effect-cause algorithms, claim
to be “fault-model-independent”. Others attempt to use the abstract nature of the stuck-at fault model
to avoid the messy and unreliable aspects of realistic fault models.
25
While the idea behind these approaches has merit, abstract diagnosis is not enough for real-world
failure analysis. The majority of fault diagnosis algorithms that address complex defects use the stuckat fault model to get “close enough” to the actual defect behavior to enable physical mapping. But,
using the stuck-at model alone results in some well-characterized problems in both accuracy and
precision. For example, even a robust stuck-at diagnosis may identify one of two shorted nodes only
60% to 90% of the time [AitMax95, LavChe97]. For situations in which a 10% to 40% failure rate is
unacceptable, or such partial answers (single-node explanations) are inadequate, stuck-at diagnosis
alone is not the answer.
The use of the stuck-at model is typical of a common answer to the problem of unreliable fault
models: use an abstract model that makes as few assumptions as possible. But, while this approach has
historically worked for testing, it is not likely to work for fault diagnosis.
Generally speaking, fault models have proved their utility for test generation. If, for example, a
test is generated to detect the (abstract) situation of a circuit node stuck-at 0, there is considerable
evidence to suggest that the test will, in the process, detect a wide range of related defects: the node
shorted to ground, perhaps, or missing conductor to a pull-up network, or even a floating node held
low from a capacitive effect. When testing a circuit for defects, the actual relation of fault model to
defect is less important than whether the defect is caught or not.
But what does it mean, in the world of fault diagnosis, to explain the actual failures of a circuit
with an abstract fault model? Try as one might, no failure analysis engineer is ever going to find a
single stuck-at fault under the microscope; a stuck-at fault, strictly defined, is not a specific
explanation, but is instead a useful fiction.
For fault diagnosis, the issue is one of resolution: the more abstract the model used, the less well
the fault candidates in the final diagnosis will map to actual defects in the silicon. A stuck-at candidate,
for example, may implicate a range of mechanisms or defect scenarios involving the specified stuck-at
node, and the failure analysis engineer must account for this poor resolution by performing some
amount of mapping to actual circuit elements. The more specific the fault model, the better the
26
correspondence to actual defects, and the less mapping work is required: a sophisticated bridging fault
candidate, with specific electrical characteristics, will usually resolve to either a single or a few defect
scenarios.
A more specific fault model is always preferable for diagnosis.
(iv)
This is exactly the point made by Aitken and Maxwell [AitMax95], where they pointed out the
perils of using abstract fault models for complex defect behaviors. While accuracy may be the most
important quality of a diagnosis algorithm, the precision of a diagnosis tool is what makes it truly
useful for failure analysis.
3.4
With Fault Models, More is Better
The conflicting principles of unknown fault origins and the desirability of specific fault models
lead to a dilemma. If a diagnosis algorithm can make no assumptions about the nature of the
underlying defect, how can it apply a specific or detailed fault model to the problem?
The answer, as with many things in life, is that more is better. Since no one fault model will ever
provide both the accuracy and precision required from useful fault diagnosis, the best approach is to
apply as many different fault models to the problem as possible. In this way, a wide range of possible
defects can be handled with the highest possible precision for the failure analysis engineer.
The more fault models used or considered during fault diagnosis,
(v)
the greater the potential for precision, accuracy, and robustness.
So, perhaps a stuck-at diagnosis, a bridging diagnosis, and a delay fault diagnosis or two could be
performed, and the results from this mix of algorithms examined. But apart from the time and work
required, a problem remains in reconciling the different results: how can one compare the top
candidates from, for example, a stuck-at fault diagnosis algorithm to the top bridging candidates from a
completely different algorithm? Many diagnosis techniques employ unique scoring mechanisms to
27
rate their candidates, and even when common techniques are used, such as Hamming distance, they are
often applied in different ways or to different data: a “1-bit difference" may mean something very
different for a stuck-at candidate than for an IDDQ candidate.
It is essential, then, that a diagnosis algorithm present its results in a way that enables comparison
to the results of other diagnosis algorithms. A diagnosis engineer will get the best result possible by
leveraging the efforts of many algorithms and different modeling, but only if these efforts can be
effectively combined.
A fault diagnosis algorithm should produce diagnoses that allow
comparison or combination with the results from other diagnosis
(vi)
algorithms.
3.5
Every Piece of Data is Valuable
The concept of “more is better” regarding fault models applies equally well to information: the
more data that is applied to the problem of fault diagnosis, generally the higher the quality of the
eventual result. This is especially true of sets of data from different sources or types of tests, such as
using results from both scan and IDDQ tests. It can often be the case that IDDQ information, for example,
can differentiate fault candidates that are essentially equivalent under voltage tests [GatMal97,
GatMal98].
Therefore, process of diagnosis should be inclusive, using every available source of information
to improve the final diagnosis.
A diagnosis algorithm or set of algorithms should use every
available bit of data about the defect in producing or refining a
diagnosis.
28
(vii)
3.6
Every Piece of Data is Possibly Bad
There is one problem with the “use all data” rule: any or all of the data might be unreliable,
misleading, or downright corrupt. Data in the failure analysis problem is inherently noisy. As
mentioned, simulations and fault models are only imperfect approximations. The failure data from the
tester may not be completely reliable, and often results are not repeatable, especially for I DDQ
measurements. The data files may be compressed with some data loss, and with the size and
complexity of netlists and test programs, it’s always possible that some part of the test results or a
simulation is missing or incorrect. In general, then, any diagnosis algorithm that hopes to be
successful in the real (messy) world needs to be robust enough to handle some data error.
A diagnosis algorithm should not make any irreversible decisions
(viii)
based on any single piece of data.
3.7
Accuracy Should be Assumed, but Precision Should be Accumulated
The prime directive of a diagnosis algorithm is to be as accurate as possible, even at the cost of
precision. It is far better to give a large answer, or even no answer, than to give a wrong or misleading
one. A large or imprecise diagnosis can always be refined, but an inaccurate one will lead to physical
de-processing of the wrong part of a chip, with the possible destruction of the actual defect site.
Accuracy is the most important feature of a diagnosis algorithm;
(ix)
a large or even empty answer is preferable to the wrong answer.
But, a diagnosis methodology should be designed so that iterative applications of new data or
different algorithms should successively increase the precision and improve the diagnosis. Each step,
however, needs to insure that the accuracy of previous stages is not compromised or lost.
Diagnosis algorithms should be designed so that successive stages
(x)
or applications increase the precision of the answer, with a
minimal sacrifice of accuracy.
29
3.8
Be Practical
Over the years there have been many diagnosis algorithms proposed, but the computational or
data requirements of many of them immediately disqualify them for application to modern circuits.
For instance, simulating a sophisticated fault model across an entire netlist of millions of logic gates is
usually not feasible. Neither is considering all
n
 2 
 
possible two-line bridging faults.
If an algorithm does require sophisticated fault modeling, however, it may still have application
on a much-reduced faultlist resulting from a previously-obtained diagnosis. The trade-off in such a
case is that the precision promised by such an algorithm may be worth the initial work to reduce the
candidate space.
A diagnosis algorithm should have realistic and reasonable
resource requirements, with high-resource algorithms reserved for
high-precision diagnoses on a limited fault space.
.
30
(xi)
Chapter 4.
First Stage Fault Diagnosis: Model-Independent
Diagnosis
Fault diagnosis, especially in its initial stage, can be a daunting task. Not only does the failure
analysis engineer not know what kind of defect he is dealing with, but there may in fact be multiple
separate defects, any number of which may interfere with each other to modify expected fault
behaviors. The defect behavior may be intermittent or difficult to reproduce. Also, the size of the
circuit may make application of all but the simplest diagnosis algorithms impractical.
Given these facts, a long-lived staple of fault diagnosis research has apparently outlived its
usefulness. The single fault assumption – that there is one defect in the circuit under diagnosis that can
be modeled by a single instance of a particular fault model – may not apply for modern fault diagnosis.
While it has simplified many diagnostic approaches, some of which have worked quite well despite
real-world violations of the premise, the single fault assumption has led to problems with two common
defect types: multiple faults, and complex faults. As defined here, complex faults are faults in which
the fault behavior involves several circuit nodes, involves multiple erroneous logic values, is patterndependent, or is otherwise intermittent or unpredictable.
Traditionally, the single fault assumption has led to the expectation of a certain internal
consistency, or some dependence between the test results, with regard to defective circuit behavior. In
cause-effect diagnosis, a fault model is selected beforehand, and the observed faulty behavior is
compared, as a single collection of failing patterns and outputs, to fault signatures obtained by
simulation. In effect-cause diagnosis, many algorithms look for test results that prove that certain
nodes in the circuit are able to toggle, and are therefore fault-free throughout the rest of the test set. In
either case, the assumption has been that individual test results are not independent, but are rather
wholly determined by the presence of the single unknown defect.
31
From the beginning, however, a few diagnosis techniques eschewed the single fault assumption,
especially those that directly addressed multiple faults. These approaches, either implicitly or
explicitly, forsake inter-test dependence and instead consider each test independently. The advantage
to such approaches is that pattern-dependent and intermittent faults can still be identified, as can the
component faults of complex defects. The drawback is that a conclusion drawn about the defect from
one test cannot be applied to any other test, and the net result is (in effect) a diagnosis for each test
pattern. This can lead to large candidate sets that are difficult to understand and use, especially as
guidance for physical failure analysis. Also, since these algorithms no longer implicate a single
instance of a fault model, there is now the problem of constructing a plausible defect scenario to
explain the observed behavior.
This chapter will attempt to address these drawbacks by improving both the process and the
product of per-test fault diagnosis. First, the process will be improved by including more information
to score candidates, and paring down the candidate list to a manageable number. Second, the product
will be improved by suggesting a way of interpreting the candidates to infer the most likely defect
type. The result is a general-purpose approach to identifying likely sources of defective behavior in a
circuit despite the complexity or unpredictability of the actual defects.
4.1
SLAT, STAT, and All That
While increasing in recent popularity, the idea of conducting fault diagnosis one test pattern at a
time is a venerable one. Waicukauski and Lindbloom [WaiLin89], Eichelberger et al. [EicLin91], and,
more recently, the POIROT [VenDru00] and SLAT [BarHea01] diagnostic systems all suggest or rely
on per-test fault diagnosis to address multiple or complex faults. We can, without too much license,
state the primary axiom of the one-test-at-a-time approach as follows:
For any single test, an exact match between the observed failures (at circuit
outputs or flip-flops) with those predicted by a simulated fault is strong
evidence that the fault is present in the circuit, if only during that test.
32
The underlying concept is uncontroversial, as it underpins both traditional fault diagnosis as well
as scientific modeling and prediction: A match between model and observation supports the
assumptions of the model or implicates the modeled cause. The difference here is that the traditional
comparison of model to observed behavior is decomposed into comparisons on individual test vectors,
with a stricter threshold of exact matching to produce stronger implications.
The statement that “the fault is present” should not be taken too broadly. It does not mean that
the fault (or modeled defect) is physically present, or that any conclusions can be drawn about the
defect in any other circumstance other than the specific failing test. Applied most commonly to stuckat faults, all that can be inferred from a match is that a particular node has the wrong value for a
particular test. However, that node is not implicated as the source of any other failures, nor is it
actually “stuck-at” any value at all, since there is no evidence that it doesn’t toggle during other tests.
Note also that the axiom cannot claim that a match constitutes proof that a particular fault is
present. A per-test diagnosis approach can be fooled by aliasing, when the fault effects from multiple
or complex faults mimic the response from a simple stuck-at fault. This can happen, for instance, if
the propagation from a fault site is altered by the presence of other simultaneous faults, or due to
defect-induced behaviors such as the Byzantine General’s effect downstream from bridged circuit
nodes [AckMil91, LamSho80]. The probability of such aliasing is impossible to determine, given the
variety of ways in which it could occur. Per-test diagnosis approaches rely on the assumption that this
probability is small, and on the hope that, should aliasing implicate the wrong fault, that this fault is
not wholly unrelated to the actual defect and is therefore not completely misleading.
A secondary axiom, implicit in the W&L paper but stated in somewhat different terms in the
SLAT paper, is the following:
There will be some tests during which the defect(s) to be diagnosed will
behave as a single, simple fault, which will, by application of the primary
axiom, implicate something about the defect(s).
33
What this axiom states is that, for any defective chip, there will be some tests for which the
failing outputs will exactly match the predicted failing outputs of one or more simple (generally stuckat) faults. This assertion relies on the observation that many complex defects will, for some applied
tests, behave like stuck-at faults that are in some way related to the actual defect. For example, a
bridging fault will occasionally behave, on some tests, just like a stuck-at fault on one of the bridged
nodes.2
The way that a per-test fault diagnosis algorithm proceeds is to find these simple failing tests
(referred to in the SLAT paper as SLAT patterns), and identify and collect the faults that match them.
The candidate faults are arranged into sets of faults that cover all the matched tests. The SLAT authors
call these collections of faults multiplets, a term adopted in this thesis. As a simple example, consider
the following three tests, with the associated matching fault candidates:
Test Number
1
2
3
Exactly-Matching Faults
A
B
C, D, E
Figure 4.1: Simple per-test diagnosis example.
In this example, fault A is a match for test #1, which means that the predicted failing outputs for
fault A on test #1 match exactly with the observed failing outputs for that test. Similarly, fault B
matches on test #2, while for test #3 three faults match exactly: C, D, and E. The SLAT algorithm will
build the following multiplets as a diagnosis: (A, B, C), (A, B, D), and (A, B, E). Each multiplet
“explains”, or covers, all of the simple failing test patterns. SLAT uses a simple recursive covering
algorithm to traverse all covering sets smaller than a pre-set maximum size, and then only reports
minimal-sized coverings (multiplets) in its final diagnosis.
For comparison, the W&L algorithm will report one set of faults – (A, B, C, D, E) – in its
diagnosis on the above example, with a note that fault C, D, and E are equivalent explanations for
2
Note that this axiom is also the basis of the original MMA algorithm [MilMcC90], which used stuck-at faults to diagnose
bridging-faults (see Section 2.5.4 of this thesis).
34
test #3. The POIROT algorithm will produce the same results, with a score based on how many tests
are explained by each fault (in this case, all faults would get the same score).
The are several advantages to the per-test fault diagnosis approach. First, it explicitly handles the
pattern-dependence often seen with complex fault behaviors. It also explicitly targets multiple fault
behaviors. And, by breaking up single stuck-at fault behaviors into their per-test components, it
attempts to perform a model-independent or abstract fault diagnosis. (Since it still relies on stuck-at
fault sensitization and propagation conditions, however, it cannot be considered truly fault-modelindependent.) This sort of abstract fault diagnosis is just the thing for an initial, first-pass fault
diagnosis when nothing is known about the actual defect(s) present.
This chapter will propose a new per-test algorithm. This algorithm is similar in style to the SLAT
diagnosis technique, but is able to use more information and so produce a better, more quantified,
diagnostic result. The SLAT technique is focused on determining fault locations, hence the name:
“Single Location At a Time”. The new approach will instead focus on the faults themselves, but will,
like SLAT, diagnose test patterns one at a time. Borrowing the nomenclature, however, we will refer
to the process of per-test diagnosis as “STAT” – “Single Test At a Time”.3 For shorthand, the new
algorithm will be called “iSTAT”, for “improved STAT”. Like SLAT, the iSTAT algorithm uses
stuck-at faults to build multiplets, but differs from SLAT in two important ways. First, it uses a
scoring mechanism to order multiplets to narrow the resulting candidate set. Second, it can use the
results from both passing and complex failing tests to improve the scoring of candidate fault sets.
4.2
Multiplet Scoring
The biggest problem with a STAT-based diagnosis is that, since each test is essentially an
individual diagnosis, the number of candidates can become quite large. Specifically, the number of
multiplets used to explain the entire set of failing patterns can be large, and each multiplet will itself be
3
We will hereafter refer to the class of diagnosis algorithms that includes Waicukauski and Lindbloom, POIROT, SLAT, and the
new iSTAT algorithm as “STAT”, or “per-test”, diagnosis algorithms.
35
composed of multiple individual component faults. What is needed is a way to reduce the number of
multiplets, or to score and rank the multiplets to indicate a preference between them. This section will
introduce a method for scoring and ranking multiplets. It will also talk about how to recover
information from tests that don’t fail exactly like a stuck-at fault, and from passing tests that don’t fail
at all.
4.3
Collecting and Diluting Evidence
The basic motivation of STAT-based approaches, as expressed in the first axiom above, is that an
exact match between failing and predicted outputs on a single test is strong evidence for the fault.
While this much seems reasonable, it seems just as obvious that the evidence provided by a failing test
is diluted if there are many fault candidates that match. For instance, in the simple example given
above, the evidence for fault A is much stronger than that for any of faults C, D, or E, simply because
fault A is the only candidate (according to the axiom) that can explain the failures of test #1. The
evidence provided by test #3 is just as significant as the evidence from test #1, it is just shared among
three possible explanations.
This division of evidence can also be illustrated by imagining failures on outputs with a lot of
fan-in, or a defect in an area with many equivalent faults. While there will be a number of faults that
match the failure exactly, test results will not provide much compelling evidence to point to any
particular fault instance.
The first way that iSTAT improves per-test diagnosis is to consider the weight of evidence
pointing to individual faults, and to quantify and collect that evidence into multiplet scores. The
mechanism that iSTAT uses to quantify diagnostic evidence is the Dempster-Shafer method of
evidentiary reasoning.
36
4.4
“A Mathematical Theory of Evidence”
A means of quantitatively manipulating evidence was developed by Arthur Dempster in the
1960’s, and refined by his student Glen Shafer in 1976 [Sha76]. At its center is a generalization of the
familiar Bayes rule of conditioning, also known simply as Bayes Rule:
p(C i | B) 
p(C i ) p( B | C i )
p(Ci ) p( B | Ci )

n
p( B)
 p(C i ) p( B | C i )
i 1
(1)
In this formulation of Bayes Rule, B represents some phenomenon or observed behavior, and
each Ci is a possible candidate explanation or cause for that behavior. The set of candidates is assumed
to be mutually exclusive. Bayes Rule is commonly used for the purposes of statistical inference or
prediction, which attempt to determine the most likely probability distribution or cause underlying a
particular observed phenomenon.
Bayes Rule uses the prior probability (or a-priori probability) p(Ci) of candidate Ci and the
conditional probability of B given the candidate Ci to determine the posterior probability p(Ci | B) of
candidate Ci given B. This posterior probability is central to Bayes decision theory, which states that
the most likely candidate given a certain behavior is that for which
p(Ci | B)  p(C j | B) for all i  j
When applied to the problem of fault diagnosis, Bayes decision theory can be used to determine
the best fault candidate (Ci) given a particular observed behavior (B).
The Dempster-Shafer method was developed to address certain difficulties with Bayes Rule when
it is applied to the conditions of epistemic probability, in which probability assignments are based on
belief or personal judgement, rather than its usual application to aleatory probability, where probability
values express the likelihood or frequency of outcomes determined by chance.
The conditions of epistemic probability are familiar to most people: a person will assign a degree
of belief to a proposition relative to the strength of evidence presented in its favor. There is an explicit
and unavoidable role of judgement in such a process. It is possible or likely that no prior information
37
or belief about the problem exists before the evidence is considered. Finally, there is a possibility that
a judgement cannot be made, or belief will be reserved, in the case of ignorance or lack of evidence.
The Dempster-Shafer method is designed with these considerations in mind. It is best illustrated
geometrically: the basic element of the Dempster-Shafer method is a belief function, which can be
thought of as a division of a unit line segment into various probability assignments. Probability can be
assigned either to individual possibilities (referred to as singletons) or to subsets of possibilities; the set
of all singletons represented by Θ. A probability assignment represents the support accorded to some
singleton or subset based on a piece of evidence; in addition, an explicit degree of doubt or ignorance
about the evidence can be assigned. The total of all probability assignments equals one; an example of
such an assignment over the n subsets of A is shown in Figure 4.2. (The “m1” notation can be thought
of either as the probability “mass” or “measure” accorded due to the first piece of evidence.)
m1(A1)
m1(A2)
m1(Am) m1(Θ)
0
1
Figure 4.2. An example belief function.
The assignment m1(Θ) represents the degree of doubt regarding the evidence or the assignments,
and represents probability not accorded to any singleton or subset. The introduction of a second piece
of evidence results in the creation of a second belief function, with a new assignment of probabilities to
a possibly-different set of elements:
m2(B1) m2(B2)
m2(Bn)
m2(Θ)
0
1
Figure 4.3. Another belief function.
38
Dempster’s rule of combination performs an orthogonal combination of these two belief
functions. Geometrically, the two line segments are combined to produce a square, which represents
the new total probability mass of the combination:
m2(Θ)
m2(Bn)
m2(B2)
m2(B1)
0
1
m1(A1) m1(A2)
m1(Am) m1(Θ)
Figure 4.4. The combination of two belief functions.
The squares in Figure 4.4 represent the probability assigned to intersections of the subsets. The
total combined probability of a subset is the sum of all non-contradictory assignments to that subset.
Note that Θ combined with any singleton or subset is not contradictory, and so such combinations are
included in the summations. The actual final probability assigned to each subset is re-normalized by
dividing by the total probability mass assigned to non-contradictory combinations:
 m (A )m
1
m(C) 
i
2 (B j )
i, j
A i B j C
1
 m (A )m
1
i
(2)
2 (B j )
i, j
A i B j Ø
39
4.5
Turning Evidence into Scored Multiplets
The iSTAT algorithm uses a relatively straightforward implementation of the Dempster-Shafer
method for diagnostic scoring. Each failing test that is matched exactly by one or more fault
candidates results in a belief function; each candidate is assigned an equal portion of the belief
assigned by the test result. Also, some probability mass is reserved to account for the possibility of
aliasing, discussed earlier in this chapter. Since an exact match on a test result is the strongest
evidence implicating fault candidates, this reserved belief is small.
For this application, singletons are defined as vectors of n individual faults, each of which
explains or matches one of the n simple failing tests. As an example, consider a circuit with two
simple failing tests and three individual fault candidates: A, B, and C. Since a valid diagnostic
explanation must cover both failing tests, the set Θ of possible singletons is {(A,A), (A,B), (AC),
(B,A), (B,B), (B,C), (C,A), (C,B), (C,C)}. If fault A is a match for the first test, the evidence provided
by this match devolves on the subset {(A,A) , (A,B), (AC)}, which will be represented here by (A,θ).
An example of using Dempster’s rule of combination on these elements is shown in Figure 4.5 below.
In this example, faults A and B both match on test 1, and faults A and C match on test 2. (In
Dempster-Shafer terms, (A,θ) and (B,θ) are the focal elements of test #1, as are (θ,A) and (θ,C) for
test #2.)
The Dempster-Shafer method provides two ways of calculating total belief given the probabilities
computed according to Dempster’s rule of combination. The first is termed belief (Bel), and the
second is upper probability (P*):
Bel(A) 
 m(B)
(3)
B A
P * (A)  1  Bel( A)

 m(B)
(4)
B A  Ø
40
To illustrate these calculations, the belief and upper probability assigned to the singleton (A,C)
are:
Bel(A,C) = m(A,C)
P*(A,C) = m(A,C) + m(A,θ) + m(θ ,C) + m(Θ)
m2(Θ)
m(A, θ)
m(B, θ)
m(Θ)
m2(θ,C)
m(A,C)
m(B,C)
m(θ,C)
m(A,A)
m(B,A)
m(θ,A)
m2(θ,A)
0
1
m1(A,θ)
m1(B,θ)
m1(Θ)
Figure 4.5. Example showing the combination of faults.
These combinations of sets of faults resemble multiplets in the STAT sense, but not all DempsterShafer combinations qualify as valid multiplets. First, a multiplet must be complete, or contain a fault
to match every simple failing test. Due to the way evidence is distributed by iSTAT (to individual
faults), this implies that only singletons with non-zero belief assignments qualify as multiplet
candidates. As an example, for the singleton (B,B) in Figure 4.5, P*((B,B)) = m(B,θ) + m(Θ), but
Bel(B,B) = 0. Second, certain singletons are indistinguishable as multiplets; for example, the
singletons (A,C) and (C,A) are equivalent to the multiplet (A,C). (Also note that the singleton (A,A) is
41
simply (A) when represented, in the final diagnosis, as a multiplet.) According to these criteria, the
valid multiplets after processing the two tests shown in Figure 4.5 are (A,A), (A,C), (A,B) and (B,C).
If a third simple failing test is processed, these four multiplets would constitute the focal elements
that are combined with the evidence from the third test. The iSTAT algorithm uses the upper
probability number of each multiplet as its new probability assignment, and these assignments are
normalized, along with m(Θ), according to Equation 2 above. The advantage to reducing the focal
elements to multiplets before each new test is that the size of the convolution stays practical even for a
large number of tests. Using the normalized plausibility allows the calculations to retain relevant
probability masses assigned to non-singleton combinations.4 An example of processing a third test,
matching with fault C, is shown in Figure 4.6 below.
m3(Θ)
m3(θ,θ,C)
m(A,A,θ) m(A,C,θ) m(A,B,θ) m(B,C,θ) m(Θ)
m(A,A,C) m(A,C,C) m(A,B,C) m(B,C,C) m(θ,θ,C)
0
1
m1,2(A,C,θ)
m1,2(A.A,θ)
m1,2(B,C,θ)
m1,2(A,B,θ)
m1,2(Θ)
Figure 4.6. A third test result is combined with the results from the previous example.
4
Reducing the number of focal elements in this manner is referred to in Dempster-Shafer terminology as a coarsening of the
frame of discernment. Using upper probability assignments for the new frame is referred to as an outer reduction over the
frame.
42
After the last simple failing test has been processed, the upper probability numbers for all
qualifying multiplets are used as their respective scores. The iSTAT algorithm applies a final criterion
to the multiplets, however: a multiplet must be non-redundant, which means it cannot contain faults in
excess of those required to cover all of the simple failing tests. (This is an arbitrary criterion, but it is
consistent with the conventions of other per-test, and traditional, diagnosis algorithms.) In Figure 4.6,
the resulting multiplets are (A,C), (A,B,C), and (B,C), but multiplet (A,B,C) is marked as redundant
and eliminated. The final upper probabilities are re-normalized to produce the actual scores for the
remaining multiplets.
4.6
Matching Simple Failing Tests: An Example
A short example will illustrate the scoring process. Figure 4.7, below, presents some test-
matching results.
Test Number
1
2
3
4
Matching Faults
A
A, D
B
C, D
Figure 4.7. Example test results with matching faults.
The result of test #1 results in a belief function in which all evidence supports fault A. The
amount of ignorance regarding this test result (whether fault A is really the cause of the behavior) is
arbitrary but assumed to be small; the iSTAT algorithm uses the value m(Θ) = 0.01, so the support
awarded to fault A for test 1 is m1(A) = 0.99.
For test 2, the evidence supports both faults A and D, so the total belief is split between these
faults: m2(A) = m2(D) = 0.495. A geometric representation of the combination of these belief
functions is shown below. The proportion of area allotted to m1(Θ) is exaggerated in the figure for
readability.
43
m(Θ)
m(A,θ)
m2(Θ)
0.99
m2(θ,D)
m(θ,D)
m(A,D)
0.495
m2(θ,A)
m(A,A)
0
m(θ,A)
0.99
m1(A,θ)
1
m1(Θ)
Figure 4.8. Combination of evidence from the first two tests.
The calculation of combined probabilities is as follows:
P * ( A, A)  m 2 (θ, A)m1 ( A, θ) m 2 (θ, A)m1 ()m 2 ()m1 ( A, θ)  m1 ()m 2 ()
 (0.495 )( 0.99 )  (0.495 )( 0.01)  (0.01)( 0.99 )  (0.01)( 0.01)
 0.505
P * ( A, D)  m 2 (θ, D)m1 ( A, θ)  m 2 (θ, D)m1 ()  m 2 ()m1 ( A, θ)  m 2 ()m1 ()
 (0.495 )( 0.99 )  (0.495 )( 0.01)  (0.01)( 0.99 )  (0.01)( 0.01)
 0.505
m()  m 2 ()m1 ()
 (0.01)( 0.01)
 0.0001
As you can see, equal plausibility is given to (A,A) and (A,D). Note that while the multiplet
(A,D) is redundant, redundant combinations must be retained until all simple failing tests are
processed. The re-normalized assignments then become m1,2(A,A) = m1,2(A,D) = 0.49995.
After the application of the third test, the results of which match with fault B, the revised
probabilities are:
44
P * ( A, A, B)  (0.99 )( 0.49995 )  (0.01)( 0.49995 )  (0.99 )( 0.0001 )  (0.01)( 0.0001 )
 0.500049
P * ( A, D, B)  0.500049
m()  (0.0001 )( 0.01)  0.000001
These assignments are then re-normalized to total 1.0. Finally, test #4 matches faults C and D,
and the top combinations become:
P * ( A, A, B, C )  (0.495  0.01)( 0.4999995 )  (0.495  0.01)( 0.000001 )
 0.2525002525
P * ( A, A, D, B)  0.2525002525
P * ( A, D, B, C )  0.2525002525
P * ( A, D, B, D)  0.2525002525
m()  (0.000001 )( 0.01)
 1*10 8
Since (A,A,D,B) and (A,D,B,D) are indistinguishable as multiplets, the multiplet (A,B,D) gets
the sum of these probabilities. Since this is the final simple failing test, the redundant multiplet
(A,B,C,D) is eliminated, and the resulting final multiplets and re-normalized probabilities are:
P * ( A, B, D)  0.505000505 (0.505000505  0.2525002525  1*10 8 )
 0.666
P * ( A, B, C )  0.2525002525 (0.75750076 75)
 0.333
m()  1*10 8
The same multiplets will be built by the SLAT algorithm, as they are the minimal covering sets
for the observed failing tests. However, the iSTAT algorithm was designed to prefer multiplet (A,B,D)
to multiplet (A,B,C), based on the intuitive notion that there exists more evidential support for fault D
than fault C. The calculations above support this intuition, showing that the Dempster-Shafer method
assigns twice the support to the multiplet containing fault D.
The application of this scoring alone makes the iSTAT algorithm preferable to other per-test
diagnosis algorithms; all such algorithms produce essentially the same candidate faults, but by
assigning a probability score to each candidate set it provides much more guidance in selecting
45
candidates out of what can be large diagnoses. But, there is more information that per-test approaches
usually fail to consider and that can be applied to produce even better final diagnoses.
4.7
Matching Passing Tests
Most STAT-based algorithms completely ignore passing tests, probably because passing tests
don’t fit well with the basic axioms expressed earlier: it is difficult to infer a failure when no failure
has occurred. But, STAT algorithms will suffer a loss of resolution, especially when compared with
traditional non-STAT algorithms, when dealing with some defects.
For example, consider an observed behavior that mimics a classic stuck-at fault. In such a case
(which is surprisingly common, for power or ground shorts, signal-to-signal shorts, and for opens), a
traditional diagnosis algorithm that matches both failing and passing tests will produce either a single
fault candidate, or a list of faults that are behaviorally equivalent under the applied test set. But, a pertest algorithm that ignores passing tests will produce the same equivalence list, plus all fault candidates
whose fault signatures are supersets of the observed behavior. A possible scenario is shown in
Figure 4.9, in which a fault that is difficult to sensitize (a stuck-at 1 on the NAND gate) is dominated
by another fault that is easier to sensitize (a stuck-at 1 on the inverter).
A-sa-1
B-sa-1
Figure 4.9. A-sa-1 will likely fail on many more vectors than will B-sa-1
The difference in behavior between these two stuck-at faults becomes most apparent when
considering the tests that fault B-sa-1 passes but A-sa-1 doesn’t. But, a STAT algorithm, or any
diagnosis algorithm, that ignores passing patterns may not be able to distinguish these faults,
depending upon the tests applied. An even simpler example is a non-controlling stuck-at fault on a
46
gate input (stuck-at-0 for an OR gate, or stuck-at-1 for an AND gate). Most STAT algorithms will
implicate a stuck-at fault on the output of the gate as strongly as the (preferred) input fault, simply
because the output fault explains all the failing test patterns. In any case, it is especially disappointing
for STAT algorithms not to be able to perform as well as traditional algorithms on such “perfect”, and
common, behaviors as single stuck-at faults.
To remedy this, the iSTAT algorithm must deal with passing tests. The process of matching
passing patterns is very similar to matching simple failing patterns: candidates that predict the a
passing test will share in belief assigned based on that test. These belief values are combined
according to Dempster’s rule of combination, as with the failing tests.
An important difference in dealing with passing tests is that only multiplets (candidate fault sets
that explain all simple failing tests) are used as the focal elements of each probability assignment, not
individual faults. The reason is that passing tests don’t, according to the per-test axioms stated earlier,
provide any evidence for individual faults. Rather, they only imply the lack of fault sensitization or
unmodeled fault behavior.
It is difficult to infer much about the conditional probability of a set of faults given a passing test
result. Obviously, if all of the component faults are predicted to pass on a particular passing test, then
that result provides some evidence in support of that multiplet. If, however, some of the component
faults of a multiplet predict failures for a passing test, it is possible that none of these faults were
activated, or if any such fault was sensitized then none of its failures propagated to observable outputs.
Either condition could occur due to interactions between multiple faults. The likelihood of
interference with either sensitization or propagation is a difficult one to calculate, especially for larger
multiplets.5
It seems reasonable to assume that the likelihood of no sensitization and propagation is
proportional to the number of components in a multiplet that predict a pass for any test. This means
5
If per-test IDDQ pass-fail information were available, it would indicate whether a logical pass does indeed indicate the absence
of a defect or not. On a test that passes scan tests but fails IDDQ, then, a multiplet that predicts one or more failures would not be
subject to a scoring penalty.
47
that, for each passing test a multiplet will get an initial score, from a maximum of 1.0 (all faults predict
a pass) to a minimum of 0.0 (all faults predict some failure). Then, this initial score is divided by the
total score over all multiplets, so that the total belief accorded over all multiplets is equal to 1.0.
Since the evidence provided by any passing test is relatively weak, any inference made from one
is not strong, and so the degree of doubt or ignorance assigned to a passing test should be high. The
iSTAT algorithm uses a value of m(Θ) = 0.5. The belief invested in each multiplet is therefore
adjusted again, by multiplying by 0.5, to re-normalize the total belief to 1.0.
4.8
Matching Complex Failures
The SLAT algorithm ignores any failing test pattern that doesn’t match exactly with one or more
candidate faults. If we refer to the easily-matched patterns as simple failing tests, then the question
becomes what to do with the complex failing tests, or tests that don’t match exactly with any stuck-at
fault.
The POIROT algorithm uses a greedy covering algorithm on such failing output sets, using
individual faults to explain subsets of the failing outputs. In an example given in the POIROT paper,
failures occur for the example circuit on outputs 1 to 5. The POIROT algorithm looks for the stuck-at
fault that explains the most outputs (1,2, and 3 for the first), and then looks for the fault that explains
the remainder (4 and 5). The iSTAT algorithm takes a different approach. First, as with passing
patterns only multiplets are considered when trying to match the failing outputs, and not individual
stuck-at faults. Second, instead of trying to match subsets of the failing outputs, we attempt a much
simpler and more conservative matching process, as explained below.
Determining which outputs are predicted to fail by a multiplet is not easy, because we have no
way of knowing how the fault effects of the individual fault components will interact for any test
vector. The fault effects of one activated fault could prevent the propagation to some outputs of a
second activated fault. Or, one fault could prevent the sensitization of another fault completely, or
cause another fault to become sensitized that normally would not.
48
It is not practical to investigate all of the various permutations of these fault interactions for most
multiplets, especially if electrical effects such as drive fights or variable logic thresholds are
concerned. So, iSTAT ignores these complications and instead chooses a conservative path of
matching by combining all the failing outputs and then ignoring misprediction (or overprediction) of
the observed failing outputs. For example, if the following faults are contained in the multiplet
(A, B, C):
Fault
A
B
C
Predicted Failing Outputs
1, 5, 8
2, 5
2, 10
Figure 4.10. Example of constructing a set of possibly-failing outputs for a multiplet
The total list of failing outputs for this multiplet is (1, 2, 5, 8, 10). A successful match, then, is a
match with any subset of these outputs, such as (1), (2, 5, 10), (1, 10), and so on. A match with any
subset is considered an “explanation” of the failures, but any non-subset, such as (1, 2, 6) is not. It is
possible that fault interaction could cause such an unexpected propagation and therefore a mismatch,
but iSTAT will tolerate this (assumed small) probability of error if it generally aids in ranking
candidate multiplets.
This matching on complex failing tests results in either a success or a failure for each multiplet on
each test. The degree of belief assigned to each matching multiplet is therefore 1.0 divided by the
number of matching multiplets. As with passing tests, the evidence provided by a complex failing test
is not perfect, and so iSTAT assigned a degree of doubt m(Θ) = 0.1 and the belief assigned to
individual matching multiplets is normalized by multiplying by 0.9.
4.9
Size is an Issue
In addition to matching all the simple failing tests, the SLAT paper implicitly introduces another
criterion for judging multiplets, namely by multiplet size: only minimal-sized multiplets are considered
in the final diagnosis. As an example, consider the following set of vectors and matching faults:
49
Test Number
1
2
3
4
Matching Faults
A
A
B
B, C, D
Figure 4.11. Multiplets (A,B), (A,B,C) and (A,B,D) explain all test results, but (A,B) is smaller
and so preferred
A minimally-sized multiplet that covers all of the failing vectors is (A, B). But, it is also possible
to cover the failing vectors with the multiplet (A, B, C) by choosing fault C to explain the failures on
test #4. Similarly, another possible candidate is (A, B, D). Intuitively, (A, B) seems to be the best, and
most likely, candidate due to the evidence for fault B from test #3. There is also the principle of
Occam’s Razor [Tor38], which states “Causes shall not be multiplied beyond necessity”, or more
commonly, “The simplest answer is best”6. The application of Occam’s Razor therefore argues for
choosing multiplets of minimal size.
But, what happens when the scenario is not quite as simple? Consider the next example:
Test Number
1
2
3
4
Matching Faults
A, B
A, C
A, C
A, B
Figure 4.12 The choice of best multiplet is difficult if (A) predicts additional failures but (B, C)
does not.
While iSTAT will build and score the multiplets (A) and (B,C), SLAT will only consider the
multiplet (A). At first glance it would appear that multiplet (A) is an obviously better choice than
(B, C). But suppose that fault A is also predicted to fail other tests that don’t fail on the tester, while
faults B and C are only predicted to fail on tests #1 through #4. We would then be faced with the
choice of explaining the behavior with either an intermittent stuck-at fault (A), or a well-behaved pair
of stuck-at faults (B, C). In such a case, a simplistic application of Occam’s Razor may not work to
slice out the best or simplest answer.
6
Or, less commonly, “Nunquam ponenda est pluralitas sine necesitate”.
50
For the example above, the iSTAT algorithm will assign the following probabilities to each
multiplet through test #4:
P * ( A)  0.499999961
P * ( B, C )  0.499999961
m( )  8 *10 8
If, however, test #5 is a passing test and faults B and C are both predicted to pass while fault A is
predicted to fail, the multiplet probabilities are adjusted to the following:
P * ( A)  0.333
P * ( B, C )  0.666
m()  5 *10 8
The actual values calculated will depend upon the value of m(Θ) assigned for passing tests, which
in turn is determined by the judgement of the algorithm designer or user.
The iSTAT algorithm follows the SLAT convention of rejecting multiplets with redundant faults,
such as (A, B, C) in the example of Figure 4.11. But, by allowing such non-minimal multiplets as
(B, C) in the second example (Figure 4.12), the iSTAT algorithm can consider a wider range of defect
scenarios than can SLAT and many other per-test algorithms
4.10
Experimental Results – Simulated Faults
This section presents results on some simulated defects in an industrial circuit. These defects
were created by modifying the circuit netlist and simulating the test vectors to obtain faulty behaviors.
Only logical fault simulation was done; in none of the cases was any electrical-level (or SPICE-level)
simulation performed. The idea was to create defects of varying complexity, and of the types that pertest diagnosis algorithms usually target: multiple and intermittent stuck-at faults, wired-logic bridging
faults, and faults clustered on nets and gates.
The iSTAT algorithm was performed on each simulated defect, including all of the matching and
scoring methods described earlier. For each trial, a diagnosis will consist of a set of multiplets. For
each diagnosis, Table 1 below reports the type of defect we simulated and the size of the multiplets,
51
where the size indicates the number of component faults in each multiplet. All SLAT multiplets
contain the same number of faults (by construction); for these experiments, so did all top-ranked
iSTAT multiplets.
The next column reports the number of SLAT multiplets, built according to the SLAT algorithm.
There is one difference, however, between the multiplets described here and those described in the
SLAT paper: These multiplets contain stuck-at faults (describing a circuit node and fault polarity),
while SLAT multiplets consist of only faulty circuit nodes (faults of opposite polarity on the same
node are collapsed into one “location”).
For each diagnosis, we then report the number of top-ranked iSTAT multiplets. This value gives
the number of multiplets that all receive the same top score. A higher number indicates lower
resolution, as the algorithm expresses no preference among these candidates. The comparison of this
number with the number of SLAT multiplets indicates the improvement in resolution over the SLAT
algorithm. The next column reports whether the diagnosis was a success or not, defined as the correct
multiplet receiving the highest score.
For the two-line bridging defects, the result can be a partial (“P”) success if only one node of the
bridge is identified by the faults in a multiplet. A complete success (“Y”) requires that both nodes be
represented by at least one fault in the multiplet. Therefore, it is very unlikely that the diagnosis of a
dominance bridge can be anything but a partial success, because no faults ever originate from the
dominating node. Also, the implication of both nodes of a non-dominance bridging fault is highly
dependent upon the test set. In order for both nodes to appear in a multiplet, the test set will have to
propagate failures from both nodes and put opposite logic values on those nodes during the detecting
tests.
On other defects, a successful diagnosis is expected to identify exactly the faults inserted. So, if
two stuck-at faults were inserted, the correct multiplet should have two faults of the correct polarity.
One exception was defect #13, where the diagnosis was judged a success even though three faults were
52
inserted while the multiplet indicated four faults, since the fourth implicated fault was the stem of the
net fault.
Overall, both the SLAT algorithm and the iSTAT algorithm produce a correct diagnosis on all
trials. This is a remarkable success rate even for this small trial size and relatively small circuit, given
the complexity of some of the defect behaviors. The number of times that SLAT produced a small
diagnosis was surprising (4 multiplets or fewer on 13 of 20 trials), but in all cases iSTAT was able to
improve this resolution, in some cases dramatically.
Defect
No.
Simulated Defect
Size of
multiplets
No. of
multiplets
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Single stuck-at fault
2 independent stuck-at faults
2 independent stuck-at faults
2 interfering stuck-at faults
3 interfering stuck-at faults
4 stuck-at faults, 3 interfering
Two-line wired-OR bridge
Two-line wired-AND bridge
Two-line wired-AND bridge
Two-line wired-XNOR bridge
Two-line dominance bridge
Two-line dominance bridge
Net fault (3 branch stuck-at faults)
Net fault (3 branch stuck-at faults)
Gate replacement (OR to AND)
Gate replacement (OR to NOR)
Gate replacement (MUX to NAND)
Gate output inversion
Multiple logic errors on one gate
Multiple logic errors on one gate
1
2
2
2
3
4
2
2
2
3
1
1
4
3
1
2
2
1
1
2
7
21
1
9
2
2
2
2
1
13
3
2
90
4
1
11
3
3
1
27
No. of
Top-Ranked
multiplets
4
8
1
4
1
1
1
1
1
7
1
1
1
1
1
7
2
1
1
10
Success?
Y
Y
Y
Y
Y
Y
Y
Y
Y
Y
P
P
Y
Y
Y
Y
Y
Y
Y
Y
Table 4.1. Results from scoring and ranking multiplets on some simulated defects.
Another observation is that it was difficult to create gate faults that looked like anything other
than (possibly-intermittent) stuck-at faults on the gate outputs. The inability to create truly “complex”
faulty gate behaviors most likely has much to do with pattern-dependent fault detection, since output
faults on gates can often swamp faults on the gate inputs unless enough tests with the right logic values
53
are applied. The same is true for the bridging faults where, while the maximum multiplet size is 4
(both faults on both nodes detected), the algorithm mostly produced 1- to 2-fault multiplets.
The circuit used in these experiments is relatively small; future work includes repeating the same
experiments on a larger circuit, with even more complex simulated defects. However, these results do
indicate that per-test algorithms are effective in diagnosing complicated behaviors, and that the iSTAT
algorithm improves upon previous approaches, both in the resolution it provides and in the amount of
information it can apply to the diagnostic problem.
4.11
Experimental Results – FIB Defects
Texas Instruments supplied data to UCSC on some production chips that had been altered by use
of a Focused Ion Beam (FIB) to insert known defects [SaxBal98]. A total of sixteen defects were
inserted (one per chip), including two shorting signal lines to power and ground (to mimic stuck-at
faults) and fourteen shorting two signal lines.
An interesting aspect of this data is that the TI engineers had also used the most popular
commercial diagnosis tool, Mentor Fastscan, to diagnose the same failures. It is widely believed that
Fastscan implements the W&L algorithm, providing a good comparison of the effectiveness of these
two per-test algorithms on real-world circuits. (This same data will be re-visited in Chapter 6.)
Table 4.2 presents the results from these experiments. The first column gives the id used by the
TI engineers to identify each FIB’d circuit. The second column reports the number of nodes identified
by Fastscan in its diagnosis, either two, one, or none; for the fourteen two-line bridging defects, the
best answer is a two-node identification. A Fastscan diagnosis consists of a list of stuck-at faults, and
can be of any length. Unfortunately, the TI engineers did not report the Fastscan diagnosis sizes for
these trials, or in what position the bridged nodes appeared in the list. (Fastscan orders its stuck-at
candidates by the number of failing patterns explained, or matched, by each fault.) The third column
reports the number of nodes identified by iSTAT. For iSTAT, a diagnosis was considered a success if
one of the two nodes was identified in the top-ranked multiplet (of any size). If the other node was
54
found in any of the top 10 multiplets, then iSTAT was given credit for identifying both nodes. The last
column provides notes about some defects or diagnoses.
In summary , Fastscan was able to identify both nodes of a bridge only 2 out of 14 times, and
for 3 diagnoses was unable to identify either bridged node. By comparison, iSTAT was able to
identify both nodes 5 out of 14 times, with no failures. Plus, iSTAT was able to do at least as well as
Fastscan on all diagnoses, improving on the 3 cases for which Fastscan failed.
The case of FIB7 deserves mention since the number of top-ranked multiplets was quite large.
For this diagnosis, the top 50 multiplets were all of size 1 and were all given the same score. While
both nodes were indeed identified in the top-ranked multiplets, this diagnosis is larger than the usually
accepted standard of 10 candidates for a successful or usable diagnosis. The result seems to indicate
that a large set of equivalent faults has been implicated, but further analysis of this diagnosis will have
to wait until either circuit information or the Fastscan results are available from TI.
ID
Fastscan
iSTAT
FIB-sa1
FIB-sa2
FIBx
exact
exact
one node
exact
exact
one node
TopRanked
Multiplet
Size
1
1
1
No. of
TopRanked
Multiplets
1
1
3
FIBy
none
two nodes
2
10
FIB1intra
FIB2intra
FIB3inter
FIB4intra
FIB4inter
one node
none
one node
none
one node
one node
one node
one node
two nodes
one node
2
1
1
2
1
1
6
2
2
1
FIB5intra
FIB5inter
FIB6intra
FIB6inter
FIB7
FIB8
FIB9
one node
one node
one node
two nodes
two nodes
one node
one node
one node
one node
two nodes
two nodes
two nodes
one node
one node
1
1
2
2
1
1
2
1
1
2
1
50
1
4
Notes
Pattern-dependent dominance
bridge; behaves like intermittent
stuck-at fault on one node
Bridge between two inputs in
XOR tree
Dominance bridge
Only one node sensitized by
tests
Dominance bridge
Feedback bridging fault
Dominance bridge
Table 4.2. Fastscan and iSTAT results on TI FIB experiments: 2 stuck-at faults, 14 bridges.
55
Chapter 5.
Second Stage Fault Diagnosis: Implication of
Likely Fault Models
For the initial stage of diagnosis, when the best fault model to apply is unknown and the entire
chip must be considered, a per-test diagnosis algorithm that uses the abstract stuck-at fault model is
ideal: it is flexible enough to deal with intermittent, multiple, and complex fault behaviors, and it is
simple enough to apply to even large netlists. But the diagnoses returned by such algorithms are often
difficult to apply as a guide for physical failure analysis. These diagnoses often consist of a large
number of stuck-at fault candidates, each of which explains only a part of the observed behavior, and
which appear to be wholly unrelated to one another. And, an individual candidate is not actually a
traditional stuck-at fault, which was itself an abstraction, but is rather the further abstraction of a piece
of a stuck-at fault, valid only during certain failing tests. While physical failure analysis has had a
difficult enough time with traditional stuck-at diagnosis, dealing as it does with logical gates and nodes
and not actual physical circuit structures, it has even a more difficult time using the results from
abstract “model-independent” per-test diagnosis algorithms.
5.1
An Old, but Still Valid, Debate
A recurring but often unstated theme throughout much of fault diagnosis research, the debate
identified by Aitken and Maxwell [AitMax95] (introduced in Chapter 2) surfaces again. In this “A-M
debate” (whether for Aitken-Maxwell or algorithms-vs.-models) the main question is whether more
clever and flexible algorithms, or more accurate and sophisticated fault models, will lead to a better
diagnostic result. On the side of better algorithms is mainly the practical argument: applying specific
fault models across an entire circuit, when the actual best fault model to apply is unknown, is not
feasible for most modern circuits. It is better, says the algorithm side, to build abstract diagnosis
56
algorithms and decompose the behavior into individual tests, in order to come up with the most general
answer possible.
But, answers the better-model crowd, by using already-abstract models and decomposing the
behavior into individual tests, any relation between the test results or faults that could help identify an
actual failure mechanism has been lost. It is better, they say, to retain the idea of how something could
go wrong in a circuit, through the use of a fault model, so that the final answer has some precision and
some utility for failure analysis. Plus, the application of realistic fault models is a better test of actual
defect behavior than the often-simplistic assumptions built into per-test algorithms about fault
interference and propagation.
5.2
Answers and Compromises
The SLAT algorithm makes only a superficial attempt to make its diagnoses more
understandable. The SLAT authors propose the construction of splats, which are sets of (apparently)
equivalent nodes common to all multiplets. But, this does nothing except identify fault equivalencies;
each multiplet must still be investigated as a possible defect scenario. A more ambitious analysis
method, called “SLAT Plus”, was recently proposed [BarBha01]. This method analyzes logic-value
relationships across all nodes of the circuit during observed failures, in an attempt to infer possible
bridging defects. That work, however, is preliminary, and involves a different and more extensive type
of analysis than is proposed here.
The W&L algorithm also makes a simple attempt to interpret its results, by classifying a
diagnosis into one of three categories. The first, called “Class I”, is an exact match with a single stuckat fault, both for failing and passing tests. The second, or “Class II”, is identified when a single stuckat fault can explain all the failing tests but not all the passing tests; in other words, an intermittent
stuck-at fault. Finally, “Class III” is reserved for diagnoses that consist of multiple stuck-at faults that
match only inexactly, a category that consists of a wide range of defects.
57
The POIROT algorithm attempts something of a compromise: it decomposes the matching
operation into single tests, but also applies a set of pre-built signatures for certain fault models in
addition to the stuck-at model. It is, in fact, much like the W&L algorithm with the addition of
bridging fault and net fault candidates. Since it explicitly targets these additional models, it doesn’t
require any interpretation of its results when one of the more specific candidates (bridge or net fault) is
implicated. However, by relying on a a-priori set of candidates, it suffers from the candidate selection
problem, first mentioned in section 2.5.4. There are  n  possible two-line bridging faults in a circuit,
2
 
where n is the number of signal lines and can be quite large. There are also 2n individual stuck-at
faults, and O(n) open faults. The result is that, with only 3 candidate types considered, the POIROT
algorithm can quickly become infeasible for large circuits.
5.3
Finding Meaning (and Models) in Multiplets
The main problem with the diagnoses returned by most per-test diagnosis schemes is one of
interpretation. The end product of these algorithms can often be a large collection of sets of faults
(multiplets), any of which can be used to explain the observed faulty behavior. If you show even a
single multiplet, consisting of several faults or nodes, to a failure analysis engineer, the likely response
is “But what does this mean?”
The purpose of this chapter is to find a way to discover meaning in multiplets. The idea is to
analyze each multiplet in a diagnosis to determine whether the component faults are in someway
related to one another, or if they appear to be simply a collection of random faults. In the first case, an
algorithm should then be able to infer a defect mechanism; in the second case, either the meaning
escapes (due to unmodeled behavior) or perhaps the circuit behavior really is the result of a collection
of unrelated defects.
But how can candidate faults be related to each other, and a meaning extracted from the observed
behavior? The traditional answer for explaining defective behavior has been the use of fault models.
The stuck-at fault model, various bridging fault models, and the transition fault model are all examples
58
of using abstractions to simplify what can be complex defect behaviors. These fault models have the
advantage of being relatively easy to understand and (with some translation) identify as part of failure
analysis.
It seems intuitive, then, to interpret multiplets by correlating them with common fault models,
calculating for every multiplet a correlation score for each fault model. A high correlation score
implicates a likely defect scenario for that multiplet. A low correlation score for every candidate
multiplet in a diagnosis indicates either that the defect is not well represented by any of the fault
models, or that the defect consists of multiple unrelated fault instances.
5.4
Plausibility Metrics
To judge this correlation, the most natural scoring, mathematically speaking, is the plausibility of
a match between a multiplet and a fault model, or the upper probability limit that a multiplet represents
an instance of a particular fault model. For each multiplet, the proposed iSTAT analysis algorithm
computes a plausibility score for each fault model, with a maximum score of 1.0 (complete agreement
of faults to defect assumptions) and a minimum score of 0.0 (no agreement). A description of each
fault model considered and the details of the plausibility calculations follow.
A. Single stuck-at/intermittent stuck-at fault
This case is trivial: if the multiplet consists of a single fault candidate, it will be classified as a
stuck-at or intermittent stuck-at fault on a single node. While this is a simple classification, many
defect types mimic intermittent stuck-at faults. Depending upon the test set, bridging faults, gate
faults, open faults and transition faults could all look like stuck-at faults. In the SLAT paper, the
authors found that 37% of the defects they diagnosed looked like stuck-at faults, which is not
inconsistent with this author’s industrial experience of diagnosing actual failures. So, this defect
class is likely to be a catch-all for many defects that aren’t activated multiple times by the test set.
Plausibility: 1.0 if multiplet is size 1; 0.0 otherwise.
59
B. Node/transition fault
If a multiplet consists of two fault candidates of opposite polarity on the same node, it is
classified as a node fault. The most likely defects for this scenario are a dominance bridging fault,
a gate delay fault, or some open faults.
Plausibility: 1.0 if multiplet is size 2, and faults involve the same node; 0.0 otherwise.
C. Net fault
If examination of the netlist determines that most or all of the component faults of a multiplet
are the branches or stem of a common net, then it can be identified as a net fault. This type of
fault was proposed by the authors of the POIROT system to cover open defects that affect nets
with fanout.
Plausibility: 1.0 if multiplet is size 2 or greater, and all faults are on the same net (including
fanout); if size 3 or greater, the portion of faults on the same net; 0.0 if multiplet is size 1.
D. Gate fault
If we find by examining either the faultlist or the circuit netlist that most or all of the faults in
a multiplet involve a common gate or standard cell, then it will be classified as a gate fault. Some
possible defects that could look like intermittent faults on a gate’s outputs and inputs are transistor
stuck-on or stuck-off, internal shorts, clocking problems, or some other logic error.
Note that since gate faults are a superset of node faults, any multiplet that gets a node fault
score of 1.0 will also get a gate fault score of 1.0. While this classification is slightly redundant, it
does reflect the fact that any defect on a node can also reasonably be attributed to its connected
gates.
Plausibility: 1.0 if multiplet is size 2 or greater, and all faults are on ports of the same gate; if
size 3 or greater, portion of faults on the same gate; 0.0 if multiplet is size 1.
60
E. Two-line bridging fault
The identification of a two-line bridging fault relies on a multiplet containing faults on two
nodes. Also, due to the nature of two-line shorts, tests that detect faults having opposite polarity
should fail, and tests that detect faults of the same polarity should pass.
Plausibility: if multiplet is size 2, 3, or 4, and all faults are on (exactly) two nodes, then
combine a) portion of common tests for faults of opposite polarity that fail, with b) portion of
common tests for faults of same polarity that pass 7; 0.0 otherwise.
F. Path/path-delay fault
If netlist examination find that the component faults of a netlist can be found on a single path
by tracing back from failing outputs, then the defect is classified as a path fault. The as-yet
unproven assumption is that path-delay faults can be identified in this manner.
Plausibility: 1.0 if multiplet is size 2 or greater and all faults exist on a path from an output to
an input; if size 3 or greater, portion of faults on the same path; 0.0 if all faults are on the same
node, gate or net, or if multiplet is size 1.
Note that none of these plausibility calculations mention equivalent faults. If a fault dictionary
was used to perform the initial diagnosis that produced the multiplets, or if for some other reason fault
collapsing was performed by the simulator, then equivalent faults will have to be identified and
considered (usually by expanding the set of multiplets.) In the case of iSTAT, it was designed as a
path-tracing algorithm that does not require fault collapsing, so all equivalent faults are already
identified and are contained in the multiplets.
These plausibility calculations were designed so that the information they require could be
determined during the normal iSTAT algorithm operations of limited path tracing and fault simulation.
For the bridging fault model, these calculations are a subset of those that a normal bridging fault
7
The next chapter explains these conditions in more detail; as metrics for scoring bridging fault candidates, they are referred to
as “required vector” and “restricted vector” scores, respectively [LavTCAD98].
61
diagnosis algorithm would perform; the same is true for node and net faults. At this stage, however, no
specific fault simulation is done, other than normal STAT-based stuck-at simulation. These
calculations, then, are a sort of “first-order” model-based diagnosis on the multiplet candidates, and the
plausibility numbers express how reasonable it is to pursue a more intensive diagnosis for any fault
model.
5.5
Proximity Metrics
The plausibility calculations for the models that involve electrical shorts could be significantly
improved if information about physical proximity of the faults is available. For a traditional stuck-at
fault, the implication is that a signal line is shorted to power or ground; whether this is plausible or not
depends upon the proximity of a supply wire to the signal line. Similarly, the plausibility of a two-line
bridging fault is highly dependent upon the proximity of the two lines, considered along the length of
both wires. This information, however, is not normally used during traditional fault diagnosis, which
usually only works with netlist information and test data, and so it was not included in the calculations
specified above or in the experiments described below.
But the issue of proximity raises an interesting avenue of fault interpretation. Not all correlations
or fault relationships can be expressed by traditional fault models. There are complicated defect
scenarios that affect isolated areas of a die, such as large spot defects, physical damage, or poor
localized implantation [NighVal98]. No current fault model could properly capture such a scenario,
even though a STAT-type diagnosis might implicate faulty circuit nodes in the area of the defect.
An additional type of correlation, then, would be useful for interpreting a multiplet: the physical
proximity of the component faults. This proximity can be calculated from an analysis of the layout or
artwork files, often represented in database form (such as the popular “DEF” format). When faced
with a set of multiplets in a diagnosis, the proximity measure would tell the failure analysis engineer
how localized the faults for each multiplet are in silicon. Given the limits to how much area physical
62
investigation can reasonably cover, a high physical proximity correlation could very well be the most
valuable information to an FA engineer, more valuable perhaps than any fault model.
Stuck-at faults are usually associated with a port on a logic gate, but fault effects can affect the
wire connected to the port. Defining a fault location by its port, however, is simpler than defining the
location by the area traversed by a wire. Using the (x, y) coordinates of gate ports, a proximity
measure of a set of candidate faults can be determined by using a sum-of-squared-error calculation: the
(Euclidean) distance of each fault from the mean fault location is squared and summed. The result is a
numerical representation of the nearness of the faults to each other, with a smaller number indicating
higher physical proximity.
If wires, and not just gate ports, are taken into account, the calculation of proximity is
considerably more complicated. A single net in a circuit can traverse a large part of the die, and can
include multiple layers of metal, introducing a third (z) dimension into the calculations. Two
calculations, however, would probably be worth the effort: the first would be the size of the bounding
box that contains at least part of each wire. This is the same adjacency calculation performed for two
wires when determining bridging fault likelihood. The second is the size of the bounding box that
contains all of the wire mass. This is even interesting in the case of multiplets with only a single stuckat fault: given two stuck-at fault candidates, the one that is most interesting to pursue is the one whose
wire covers the smallest physical area of the die, since that area can be more thoroughly inspected. It
is not uncommon for an apparently-ideal diagnosis that consists of a single perfectly-matching stuck-at
fault to be useless, simply because the implicated fault actually covers too great a physical area to
conduct a search for root cause.
Another type of proximity measure that would be interesting for multiplet analysis is logical
proximity, or the number of gates or cells that separate the set of faults in the multiplet. This
information would be easier to calculate than physical proximity, since it can be determined from the
same netlist file used for fault tracing and simulation. Some of this proximity information is captured
in the node, net, and gate fault classes, but some more complicated defects may involve several gates.
63
In any case, both the logical and physical proximity measures could indicate how related a particular
set of faults in a multiplet are, which may help in limiting the search for root cause to an area of the die
or to an area of functional logic.
5.6
Experimental Results – Multiplet Classification
Table 5.1 gives the results of the multiplet classification technique on the same simulated defects
from Table 4.1. For each simulated defect, the plausibility of the top (correct) multiplet is calculated
vis-à-vis each defect class.
As expected, some of the defects are classified as stuck-at faults simply because the diagnosis
multiplet size is 1. For the bridging and gate faults that are classified as stuck-at, the result is highly
dependent on the test set—if the tests don’t activate the other faults or fault polarities, then these
defects will look like stuck-at faults.
Defect
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Simulated Defect
Single stuck-at fault
2 independent stuck-at faults
2 independent stuck-at faults
2 interfering stuck-at faults
3 interfering stuck-at faults
4 stuck-at faults, 3 interfering
Two-line wired-OR bridge
Two-line wired-AND bridge
Two-line wired-AND bridge
Two-line wired-XNOR bridge
Two-line dominance bridge
Two-line dominance bridge
Net fault (3 branch stuck-at faults)
Net fault (3 branch stuck-at faults)
Gate replacement (OR to AND)
Gate replacement (OR with NOR)
Gate replacement (MUX to NAND)
Gate output inversion
Multiple logic errors on one gate
Multiple logic errors on one gate
Single
Stuckat
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
1.0
0.0
0.0
1.0
1.0
0.0
Node
Fault
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
0.0
0.0
0.0
0.0
Net
Fault
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
Gate
Fault
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
1.0
1.0
0.0
0.0
1.0
2-Line
Bridge
0.0
0.0
1.0
0.0
0.0
0.0
1.0
1.0
1.0
1.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Table 5.1. Results from correlating top-ranked multiplets to different fault models.
64
Path
Fault
0.0
0.0
0.0
0.0
0.67
0.75
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
Generally speaking, a fault that received a 0.0 plausibility score for all defect classes was a case
of multiple unrelated stuck-at faults. It is possible, however, for two simultaneous but unrelated stuckat faults to get a non-zero bridging fault score, as happened with defect #3. For that defect, the stuck-at
faults in the multiplet are of opposite polarity, and all vectors common to the two fault signatures fail,
so there is nothing in this behavior that is inconsistent with a two-line bridging fault. On the other
hand, the component faults for defects #2 and #4 are of the same polarity, but all common vectors fail,
which is completely inconsistent with a bridging fault. In either case, this analysis can only judge the
consistency of the behavior with a bridging fault; it would take either layout analysis, or a bridgingfault diagnosis algorithm, or both, to judge whether the bridging fault is actually a good explanation for
the behavior.
5.7
Analysis of Multiple Faults
By correlating multiplets to individual fault classes, the above procedure implicitly invokes the
venerable single fault assumption, which is that the observed behavior can be attributed to a single
fault mechanism. But, one of the strengths of per-test approaches such as iSTAT is that they should be
able to implicate the components of multiple simultaneous defects, even if the implications consist of
partial stuck-at faults and are therefore somewhat vague.
The signal for the possible presence of multiple faults is a low plausibility score across all fault
classes. If enough fault classes are applied, this would indicate that the faults in a multiplet don’t
match up well with any single fault scenario and the behavior may be due to multiple faults. There are
several ways, then, to re-analyze the candidates to infer multiple fault groups.
Some of the fault classes define partial correlation scores, and for these classes a non-zero score
might indicate that some of the faults in a multiplet fit the defect scenario. These are the path, net, and
gate fault classes, and if a multiplet gets an imperfect but non-zero score for any of these classes, the
faults that do correlate well can be separated and the rest of the faults re-analyzed to infer the presence
of a second defect.
65
Another way to infer multiple defects is by applying the proximity measures introduced in the
last section. Groups of individual faults that have high mutual proximity imply a high probability that
they are related in a single defect mechanism. These proximity measures can be used with a simple
clustering algorithm, such as nearest-neighbor [DudHar73], to determine likely groups of faults, which
can then be re-analyzed to correlate with the set of fault classes.
Finally, some of the fault classes have a defined cardinality, or a certain number of expected
individual fault components. These are the two-line bridge fault class and the node class, and any
multiplet that does contain exactly two faults (node fault) or two, three or four faults (bridge fault) will
automatically get a plausibility score of 0 for these classes. If multiple defects are involved, however,
the multiplet could contain a viable node or bridge candidate mixed in with other candidate faults.
For these two fault classes, an exhaustive search has to be performed for large multiplet sizes.
The case of node faults is simple: unless two faults of opposite polarity on the same node (e.g. Astuckat-0 and A-stuckat-1) are contained in the multiplet, there is no evidence for a node (or transition)
fault. For bridging faults, given a multiplet of k otherwise-uncorrelated stuck-at faults, there are
k
 
2
possible bridging candidates. For most multiplets, this is an easily-handled number: for multiplets of
size 20 it is 190 candidates, for size 50 it is 1,225 candidates, and for size 100 it is less than 4,950
candidates.
5.8
The Advantages of (Multiplet) Analysis
Considering the number of bridging faults that can be constructed out of the components of a
multiplet, the advantages to a robust first-pass per-test diagnosis run is obvious. There may be
hundreds of thousands or even millions of signal lines in very large circuits. An exhaustive search of
two-line bridging faults in even a ten thousand-line circuit would involve almost 50 million candidates.
Other techniques that limit the search to, say, 10n candidates [LavChe97] (where n is the number of
signal lines) may still be impractical for large circuits. But when the candidate space is limited to the
66
number of faults in a set of multiplets, many diagnosis techniques that were quickly running out of
steam on modern circuits suddenly become practical again.
What this points out is that the main advantage of STAT diagnosis is that it solves one of the
major problems of model-based diagnosis: the problem of candidate extraction or selection. By using
a STAT-based diagnosis algorithm that can identify likely fault components, and a way to translate the
components into candidate fault models, model-based algorithms can be applied on much-reduced
candidate spaces and produce much more precise diagnoses.
To revisit the algorithms-vs-models debate, the problem with model-based diagnosis is its lack of
flexibility with regard to unmodeled behavior and its impracticality on large circuits. STAT-based
diagnosis solves both of these problems, at the cost of imprecision and some opacity in its final
diagnostic result. But precision and clarity are exactly the qualities that model-based diagnosis excels
at. The analysis step introduced in this chapter, by overlaying fault models on multiplets, provides a
natural transition from model-independent STAT-based diagnosis to model-based diagnosis, which is
in turn developed and extended in the next chapter.
67
Chapter 6.
Third Stage Fault Diagnosis: Mixed-Model
Probabilistic Fault Diagnosis
As I asserted in Chapter 2, one of the principles of fault diagnosis is that the more fault models
that can be applied to a particular diagnostic problem the better. Fault model-based diagnosis provides
a level of precision and clarity of result that abstract model-independent diagnosis algorithms cannot
match. And, the more fault models that are applied, the more assumptions and scenarios are tested
against the defect behavior in the search for the underlying cause.
There have, however, always been two problems with performing diagnosis with multiple fault
models. The first is a practical one, due to the size of modern circuits and the candidate selection
problem mentioned in earlier chapters. It is simply not practical to consider multiple fault models, and
perhaps multiple versions of the same fault model (such as wired-OR and wired-AND bridging fault
models), when the circuit netlist is as large as is common today. The second problem is that it has
always been difficult or impossible to compare the results from two different diagnosis algorithms.
Not all algorithms score their candidates, and even when they do they often use ad-hoc or arbitrary
scoring methods specific to the chosen fault model. So, it can be difficult to compare the results from,
for example, a bridging fault diagnosis algorithm with those from one that targets open faults.
The candidate space problem can be addressed in the manner described in the previous two
chapters, by using an abstract but accurate model-independent algorithm, and then determining the
most promising fault models to apply to further refine the diagnosis. But, the problem remains of how
to compare diagnostic results across multiple fault models. This problem has existed since the first
model-specific or non-stuck-at algorithm was proposed, but has not even been acknowledged until
recently. The only other algorithm to consider multiple fault models, the POIROT algorithm
[VenDru00], used a theoretically suspect application of Occam’s Razor to simply prefer equivalently68
scored stuck-at candidates to node candidates, and node candidates to bridging candidates, using the
rational that stuck-at candidates are “simpler” than node and bridge faults. But, given that a stuck-at
fault is actually just an abstraction, and in its simplest form represents a node shorted to power or
ground, it is unclear whether a stuck-at fault is really a simpler explanation for a particular behavior
than a bridging fault.
This chapter presents a way to solve the model-comparison problem by introducing a
probabilistic framework for fault diagnosis. It is designed to incorporate disparate diagnostic
algorithms, different sets of data, and a mixture of fault models into a single diagnostic result. It will
develop a rigorous approach to incorporating both data and user assumptions into the scoring of fault
candidates. It will also present the results of experiments on an industrial circuit that was physically
modified to insert various defects. But first, it will present a way to include the normally computationintensive bridging fault model into a practical diagnostic system.
6.1
Drive a Little, Save a Lot: A Short Detour into Inexpensive Bridging Fault Diagnosis
This section gives a brief overview of the subject of bridging fault diagnosis. It presents a method
of constructing relatively accurate bridging fault signatures that is much more cost-effective than
simulation. It also introduces some important concepts for the proposed multiple-fault-model
diagnostic framework.
6.1.1
Stuck with Stuck-at Faults
Fault diagnosis using the stuck-at model has dominated in most industrial settings, largely
because the stuck-at fault model is ubiquitous in testing-related tools. Therefore, a good stuck-at fault
simulator is usually available and in wide use, along with other convenient items such as a fault-list, a
stuck-at test set, and logic fail information from the tester. But the desire to overcome the limitations of
the stuck-at model for diagnosis has motivated a great deal of research into better fault models, better
algorithms, and different approaches to the problem of fault diagnosis.
69
One of the “better” models is the bridging fault model, which represents the unintentional
shorting of two signal lines [Mei74]. The bridging fault model has gained prominence due to the
increasing circuit area devoted to interconnect in modern circuits, with a commensurate increase in the
rate of shorted interconnect lines. But, it is difficult to accurately model bridging faults, and various
models of increasing sophistication have attempted to capture realistic effects of shorted nodes
[AckMil92, GrePat92, Rot94]. Complications include variable drive strengths [MaxAit93], the
Byzantine Generals Problem [AckMil91, LamSho80], feedback, bridge resistance [MonBru92], and
defect-induced sequential behavior. A bridging fault model that accounts for most or all of these
complications would be too expensive to apply diagnostically to all but the most limited of candidate
spaces.
6.1.2
Composite Bridging Fault Signatures
The prominence of the stuck-at fault model, and the prevalence of bridging defects in CMOS
circuits, has motivated several attempts at using the stuck-at fault model to perform bridging fault
diagnosis. I have previously published an improvement [CheLav95] to one such technique, by
Millman, McCluskey, and Acken (MMA) [MilMcC90]; the improved technique demonstrated
considerable success at diagnosing simulated bridging faults. Like the original technique, my approach
used only stuck-at fault simulation and signatures, but improved on the original technique in three
ways: considering only realistic bridges, incorporating match restriction (flagging some test vectors as
incapable of detecting a particular bridging fault), and incorporating match requirement (flagging some
vectors as dependably detecting a particular bridging fault).
The basic idea to both the original and the improved technique is that of the composite bridging
fault signature, which is the union of the four single stuck-at signatures associated with the two bridged
nodes. The underlying idea of the composite bridging signature is this: if a bridging fault is detected by
a test, that test will also detect one or more of the four stuck-at faults on the bridged nodes. Therefore,
70
it is expected that the actual bridging fault signature (the set of detecting test vectors if the bridge
occurs) will be a subset of the vectors found in the bridge's composite signature.
Figure 6.1 illustrates the composite signature of a fault candidate for node X bridged with node
Y; for simplicity, the contents of each of the four component sets can be considered to be the test
vectors, numbered from 1 to n, that detect each respective fault. The figure, then, portrays the set
concatenation of four stuck-at signatures. The black lines in the figure illustrate the concept of match
restriction: if the same test vector (in the figure, the same vector number occupies the same relative
position in each set) detects both X stuck-at 0 and Y stuck-at 0, it by definition tries to set each line
to 1. When both bridged nodes are set to the same value it is highly unlikely that the bridging fault will
be stimulated (no error should result), and the test vector can be marked as restricted in the composite
signature. The same holds true for any test that detects both X stuck-at 1 and Y stuck-at 1.
Figure 6.1. The composite signature of X bridged to Y with match restrictions (in black) and
match requirements (labeled R)
The lines labeled R in Figure 1 illustrate the complementary concept to match restriction, called
match requirement: if the same test vector can detect both X stuck-at 0 and Y stuck-at 1 (or viceversa), that test should detect the bridging fault (since it sensitizes and propagates both simple fault
conditions), and it is flagged in the composite signature as a required vector.
The result is a signature for the bridging fault node X bridged to node Y, but notice that only
stuck-at fault signatures (and simulation) were used - no bridging fault modeling or simulation was
required. This is a tremendous practical advantage, as it allows inexpensive but approximate bridging
fault signatures to be created much more cheaply than with almost any bridging simulator, using tools
(a stuck-at simulator or a set of stuck-at fault signatures) that are usually readily available.
71
6.1.3
Matching and (Old Style) Scoring with Composite Signature
As the composite signatures are now only approximations to actual bridging fault behaviors, the
matching algorithm that selects candidates for the final diagnosis must allow and expect some
mismatch between the predictions (composite signatures) and the observed behavior (actual failing test
vectors). The original MMA technique only expected that the correct candidate's composite signature
would be a superset of the observed behavior. However, the elimination of the restricted vectors, and
the specification of required vectors, improves the predictions, and provides the matching algorithm a
means for refining its expectations and judging the goodness of each candidate compared to the
observed behavior.
My previous scoring system was lexicographic, in which each candidate was ranked on three
criteria, in descending order of importance. First, as in the original technique, the observed behavior
for a bridging fault is expected to be a subset of the candidate signature, so any nonprediction (errors
seen but not predicted) is very unexpected. Second, some test vectors in each candidate are marked as
required, so we can judge a candidate by how many of its required vectors actually detected the fault.
Third, while some misprediction (errors predicted but not seen) is to be expected with composite
signatures, excessive misprediction indicates a poor match with the observed behavior. The final
scoring, as stated, was lexicographic, with the (smallest) amount of nonprediction having priority,
followed by the number of successful required vector predictions, and finally by the (smallest) amount
of misprediction.
6.1.4
Experimental Results with Composite Bridging Fault Signatures
I then ran experiments to see how well this technique could perform at diagnosing simulated
bridging defects, especially in the presence of noise [LavLar96]. Various amounts of random noise
were added to the simulated bridging signatures, and the technique attempted to identify the correct
bridging fault in a list of 10 candidates. The results were quite successful: even in the presence of
severe noise (causing the deletion of more half of the original information or the addition of half again
72
as much spurious information or both), the scoring mechanism was able to successfully extract the
correct candidate from 70% to 95% of the time.
This was a surprisingly good result, especially since no other published diagnostic technique has
attempted to diagnose complex behaviors in the presence of so much noise and unmodeled behavior.
But, underlying all of the unmodeled behavior, there was still a bridging fault behavior to be
unearthed. In these experiments, the defect was known ahead of time to be a bridge, and then bridging
candidates were used to identify it. What about a more realistic scenario, where the form of the defect
is unknown? Can a diagnosis algorithm account for another fault type, and incorporate and distinguish
between varying explanations for the observed faulty behavior? These are exactly the questions I
needed to answer when I set out to transfer this technology to industrial use, performing real-world
diagnoses on actual failing circuits.
6.2
Mixed-model Diagnosis
An ideally robust diagnosis system would have the ability to include an arbitrary number of fault
models, would employ all the models towards diagnosing the faulty circuit, and would report a single
answer that represents the best explanation for the behavior over all candidates. Such a system would
admittedly require more work as more models were added, but it could theoretically cover an arbitrary
range of fault types and behaviors. Such a system is perhaps the ideal, with a model for every
contingency, but in practice the number of models will probably be limited to those considered most
likely or most interesting. For this research, my approach was to build towards a robust diagnosis
system by starting small, with a combination stuck-at fault/bridging fault diagnostic system.
The idea behind such a two-model system is relatively modest. First, I use bridging fault
candidates and (composite) signatures to diagnose actual bridging defects. Second, I use stuck-at
candidates and signatures to diagnose a selected set of other defects: shorts to power or ground and
“charged" opens (disconnected circuit lines that hold a high or low logic value). These defect types
were chosen because they are assumed to be both commonplace and well represented by the stuck-at
73
fault model. The diagnostic bottom line is: if the behavior looks most like a bridging candidate, score
the bridge highest; if it looks most like a stuck-at candidate, score the stuck-at candidate highest; if
neither happens, give some indication that the behavior is not much like any of the candidates, bridging
or stuck-at.
It should be obvious that, in order for this mixed-model system to work, an improved method of
scoring fault candidates is required that can be applied across fault models. This is not possible with
the previously-described composite bridging fault scoring, as there is direct reference to such bridgingspecific items as required and restricted vectors. Some generalization of the concept of candidate
scoring needs to be defined that will work for any fault candidate, regardless of fault model.
6.3
Scoring: Bayes decision theory
Perhaps the most intuitive method of scoring and comparing fault candidates is numeric, and
specifically probabilistic. In other words, what a diagnosis should really calculate is the probability that
the failures seen are due to one fault candidate or another, whether that candidate is a stuck-at fault or
some other fault type. It would follow, then, that the candidate with the highest probability of having
occurred is the most likely suspect.
Applying probabilistic measures to the problem of diagnosis has been recently proposed by a
number of researchers. Sheppard and Simpson have developed a comprehensive approach to systemlevel diagnosis that they recently proposed for application to traditional fault dictionaries [SheSim96].
Thibeault [Thi97] has developed an approach to IDDQ diagnosis that uses a form of current signatures
and maximum likelihood estimation, comparing measured current levels to predictions of differential
current under a given noise model. And, a method for probabilistically conducting physical failure
analysis has been developed by Henderson and Soden at Sandia National Labs [HenSod97].
The probability of a fault candidate occurrence given an observed faulty behavior can be
expressed literally as p(c|b), where the candidate and behavior are represented by their fault signatures
74
c and b respectively. An obvious choice for the best candidate is the one with the maximum posterior
probability of all candidates considered:
p(ci | b)  p(c j | b)  j  i
This is merely the simplest expression of Bayes decision theory, used extensively in the fields of
pattern recognition and classification, and introduced earlier in Chapter 4. The theory states that the
best explanation (or classification) for a phenomenon is the explanation judged to be most likely given
the phenomenon. This is obvious, intuitive, and simple, so of course there's a catch: the probability
measure p(cI | b) is difficult to calculate directly. Fortunately, Bayes rule comes to the rescue:
p(c i | b) 
p(c i ) p(b | c i )
 p(c ) p(b | c )
i
i
i
The value p(ci) is the a-priori probability of each fault candidate: that is, the probability of a
fault's occurrence over all candidates regardless of fault model. The conditional probability p(b | ci) is
the probability that the behavior seen is the result of the candidate fault occurring. While this
expression may not seem like much of an improvement, the difference now is that, unlike the
probability p(ci | b), both p(ci) and p(b | ci) can be calculated or approximated for each candidate, as
will be explained shortly.
Since the denominator in the above equation is the same for all fault candidates, calculating and
comparing the numerator for each fault candidate gives a numerical ordering across all candidates,
regardless of model.8 Using the probability p(ci) p(b | ci) as a scoring function is the classic Bayes
decision rule, and under some basic assumptions can be proven to give the minimum error rate of any
scoring or decision method.
8
One of the assumptions of Bayes rule is that the candidates are an exhaustive and mutually exclusive set of causes for the
observed phenomenon. This will generally not be true for fault diagnosis, as unmodeled behavior may occur. Therefore, while
the numerator alone still provides an ordering over the fault candidates considered, the denominator does not satisfy the rule of
total probability and the complete ratio will likely be an overestimation of the posterior probability for any fault candidate.
75
The a-priori probabilities p(ci) can be calculated through various means. One method is inductive
fault analysis, which examines the physical layout of the fabricated circuit and calculates probabilities
that various defects will occur [SheMal85, JeeVTS93]. Alternatively, defect sample statistics can be
used, or other estimates based on specifics of the actual circuit. In the absence of such information, the
a-priori probabilities can be approximated as equal for all candidates, implying that all faults,
regardless of model, are equally likely. This is a gross approximation and can obviously affect the
accuracy of the results, but it does allow a diagnosis to proceed if a good estimate of the a-priori
probabilities is not available.
The conditional probability p(b | ci) expresses the probability of the observed behavior resulting
from a particular candidate fault. In other words, it is the probability that the circuit behaves in a
certain way if the fault occurs. Traditional classification of physical phenomena would usually involve
sampling the various candidates and describing the frequencies of their behaviors statistically. This is
obviously not possible for VLSI fault diagnosis. Sufficient samples are simply not available for every
candidate of every fault model: gathering such sample data would take root-cause failure analysis of
thousands of defective chips. And, the statistics such painstakingly determined from any one chip
would likely not apply to any other chip, due to differences in circuit design or manufacturing process.
Instead of samples and statistics, diagnostic scoring will have to rely on probabilistic modeling:
the conditional probability functions will be estimates based upon the information available and the
inherent assumptions in each of the candidate fault models. In other words, candidate fault signatures
will be treated as predictions of actual defect behavior, and the conditional probabilities will be
functions of the estimated rates of prediction error.
The question then is, what are the levels of confidence associated with each type of fault model
used? The answer depends upon the accuracy of the models and predictions, the correlation of each
model to the defects it targets, and, perhaps most importantly, the judgement of the failure analysis
engineer.
76
6.4
The Probability of Model Error ...
The conditional probability p(b | ci) of a stuck-at candidate should be relatively straightforward to
calculate: it is a function of the expected error rate of the stuck-at simulator that produced that
candidate's signature. In other words, the likelihood that a prediction for a stuck-at fault differs from
the observed behavior when that fault is realized should depend upon the accuracy of the fault
simulator, and to a lesser extent upon other factors such as the reliability of the measurements and the
integrity of the data.
Some definitions and notations will help here. Usually during fault diagnosis, comparisons are
made on a per-test basis between prediction and behavior; a prediction error occurs, for example, when
the chip fails a test that the fault candidate is predicted to pass. (For this and the next section, the
discussion of predictions and behaviors will be limited to pass-fail results only.) The probability of
this is p(chip fails | candidate predicts pass); in the standard notation of diagnosis, a 0 in a fault
signature indicates a passing response and a 1 indicates a failing response, so the above expression
reduces to p(b = 1 | c = 0), or more simply, p(1|0). Now, to continue calculating the required
probabilities, I make a simplifying assumption: namely, that the outcomes (success or failure) of
candidate predictions are independent. While dependencies may exist for some candidate predictions,
the inaccuracies introduced by this assumption of independence are likely to be swamped by inherent
approximations of fault simulation. (The limits of precision are especially obvious in the case of
composite bridging fault signatures.) With this assumption of independence, the full conditional
probability for a candidate can be expressed as
n
p(b | c) 
 p(b
k
| ck )
k 1
where k is the index over all n test vectors, bk is the kth bit of the behavior signature, and ck is the
kth bit of the candidate signature. The value, then, of p(b | ci) for stuck-at candidates should be
relatively easy to calculate: assuming an unbiased simulator with
p(0 | 1)  p(1 | 0)  (1.0  p(1 | 1))  (1.0  p(0 | 0))  x,
77
if a value or estimate can be assigned to x (the probability of prediction error or prediction error rate)
the score of each candidate can be expressed simply as the product of per-test probabilities. But this of
course begs the question of what a good value for x is, or what the expected rate of prediction error is
for a particular stuck-at simulator. It is possible (but perhaps unrealistic) that this value can, in some
cases, be obtained statistically: perhaps sufficient failure analysis has been performed on a significant
number of stuck-at defects to determine this probability with a high degree of confidence. Lacking this
information, however, an estimate will have to suffice.
6.5
... Vs. Acceptance Criteria
It is widely accepted that the stuck-at fault has no single direct analog in the realm of silicon
defects. Its closest manifestation would probably be a circuit node shorted to power or ground. If such
shorts are the only defects targeted diagnostically with the stuck-at model, then the error rates for
stuck-at predictions should be quite low, as a good correlation of defect behavior to simulation should
occur. But if the stuck-at candidates are meant to target a wider range of defects, with less direct
correlation to classical stuck-at faults, then higher error rates will have to be expected.
Regardless of the value chosen, the role of operator choice points out that the process of scoring
fault candidates is largely an arbitrary one, in which the assignment of probabilities is really a matter of
establishing acceptance criteria for the various fault models used. If a low stuck-at prediction error rate
is used, then a stuck-at candidate with large error will be assigned a low conditional probability;
compared with a candidate of another model with the same number of errors but a higher error rate, the
stuck-at candidate will be scored lower.
Assigning an error rate for fault models has been done implicitly by almost every traditional fault
diagnosis algorithm, and by failure analysis engineers who must reconcile the results from different
diagnosis tools. Some algorithms do not tolerate any prediction error; they implicitly assign a zero
probability of error and reject any imperfect candidate. Others expect much more error in one
direction than another (such expecting misprediction but not non-prediction) and express this as
78
weighted or lexicographic ratings. And, if an engineer uses multiple fault diagnosis algorithms and the
top candidate reported by a stuck-at diagnosis tool, say, misses the same number of fault predictions as
the top candidate from a bridging diagnosis program, then it is human judgement that decides how
much error to tolerate in each model and therefore which candidate to prefer.
The point is subtle, but it is important enough to bear repeating: every diagnosis algorithm sets its
own acceptable error rates, although almost all do it implicitly. The simplest algorithm that only
accepts exact matches sets a probability of model error to 0, as do “model-independent” path-tracing
algorithms that use strict fault propagation and sensitization conditions. An algorithm that awards one
point to a fault candidate for correctly predicting or matching a failure and one point for predicting a
passing test is applying a uniform probability to model misprediction and non-prediction. And an
algorithm that applies lexicographic scoring or uses Occam’s Razor to prefer one type of candidate to
another is simply weighting types of predictions or candidates by dominating factors. If this thesis
makes one contribution to the state of the art in fault diagnosis, it should be the identification of this
principle:
All fault diagnosis is probabilistic, and the underlying probabilities are
almost all epistemic, or based on human judgement.
By adopting Bayes decision theory for fault diagnosis, then, I am arguing for making these
implicit judgements explicit. Explicit parameters have the great advantages of transparency and
mutability: the assumptions built into an algorithm are not hidden but declared, and they can then be
adjusted to different diagnostic conditions or updated upon new information. The assignment of error
rates to each fault model and its predictions is equivalent to stating acceptance criteria for each type of
candidate employed. In the case of the proposed two-model diagnosis system, the algorithm will
obviously have to accept or tolerate more prediction error with composite bridging signatures than with
stuck-at candidates. In the spirit of full disclosure, specifying these usually-implicit values is intended
79
to codify the assumptions and knowledge about the various fault models into a single diagnosis tool where they can be examined and updated as necessary.
6.6
Stuck-at scoring
For this research no statistical information was available about the behavior of stuck-at defects in
actual manufactured circuits. Therefore, the approach taken for candidate scoring necessarily involved
an arbitrary assignment of per-vector prediction error for stuck-at candidates.
In general, fault diagnosis will be more effective and accurate if it targets more specific fault
types and ties the models more directly to the defects targeted. It will be more effective because the
increased precision greatly facilitates the subsequent work of physical failure analysis, and it will be
more accurate because the fault predictions will be more accurate and therefore easier to match to the
associated defects. This point was made in the “algorithms vs. models” paper by Aitken and Maxwell
[AitMax95], already mentioned in earlier chapters; the authors’ argument was that diagnosis is most
successful (both most accurate and precise) when a fault model is used to target only defects that it best
represents. The implication of this idea is that the expected error rate for stuck-at candidates should be
set relatively low. This philosophy argues for a relatively tight link between the stuck-at predictions
used and the defects targeted for diagnosis. To this end, an expected prediction error rate of 1% was
arbitrarily chosen for stuck-at fault candidates in the presented diagnosis system. Viewed as an
acceptance criterion, the implication is that any stuck-at candidate that matches less than 99% of the
observed behavior should be considered a poor match. The value of 1% is somewhat arbitrary, but is
based on limited industrial experience with power or ground shorts and opens, the two defect types
explicitly targeted with the stuck-at candidates
6.7
0th-Order Bridging Fault Scoring
Since the assignment of an error rate for stuck-at candidates is somewhat arbitrary, the value of
an error rate for bridging candidates is similarly arbitrary. It is the value of the bridging error rate
80
relative to the stuck-at error rate that will determine the selection of bridging or stuck-at candidates for
any particular diagnosis.
As detailed previously, the composite bridging fault signatures used in this system are only
approximations to the expected behaviors, and a significant amount of prediction error is anticipated.
Accordingly, the error rate assigned for bridging fault candidates should be significantly higher than
that assigned to stuck-at candidates. For our purposes a significant difference will be at least an order
of magnitude, so a 0th-order estimate for the bridging candidate error rate, given the stuck-at rate
specified above, would be 10%. While this is admittedly a gross estimate, it is not far from the value
seen in our previous experience with composite signatures vis-a-vis simulated bridging fault behavior
[LavTCAD98].
6.8
1st-Order Bridging Fault Scoring
A better estimate for the bridging fault candidate error rate can be obtained by looking at the
components of the composite signature described earlier. Doing so points out that different per-vector
predictions in a composite signature have very different expected errors. As the name implies, a
required vector prediction is expected to fail very infrequently; similarly, a restricted vector should
produce a passing result nearly all of the time. Also, misprediction is significantly more probable than
nonprediction. Given these factors, one might reasonably assign individual error rates to the various
types of composite predictions, again relative to the stuck-at error rate previously assigned: 10% for
non-required vectors, 1% for nonprediction and required vectors (since they rely on stuck-at
assumptions), and 0.1% for restricted vectors. These values are consistent with those I have seen over
thousands of simulated bridging-fault diagnoses, and provide a somewhat more accurate basis for
discrimination than the simplistic 0th-order estimate given above.
6.9
2nd-Order Bridging Fault Scoring
It is possible to refine the estimates of prediction error further by examining the possible causes
for the actual behavior to diverge from prediction. While this approaches the complicated topic of
81
bridging fault modeling (the avoidance of which was the basis of the composite signature idea), the
salient factors affecting composite bridging signatures can be identified relatively easily. They are
summarized in Table 6.1.
Probability
p(sv)
p(hr)
p(wf)
p(bg)
p(fb)
Assumptions:
Description
Probability that a test puts the same logic value on both bridged nodes.
Probability that high resistance of the short prevents (per-vector) any fault effect,
regardless of gate type or topology.
Probability that one node wins a drive fight and asserts a definite (faulty) logic value on
the other node, but the corresponding stuck-at fault is not detected, causing no fault effect
to result from the bridge (non-required vector only).
Probability that a Byzantine Generals situation and re-convergent fanout downstream from
the bridge invalidate a pass/fail prediction.
Probability that fault-induced feedback invalidates a pass/fail prediction.
The events {sv, hr} are independent.
The events {sv, wf, bg} are mutually exclusive, as are {hr, wf, bg}.
The events {fb, hr, wf} are mutually exclusive.
The events {fb, bg} are approximated as independent.
The event fb is dependent on sv: p(fb) = p(fb|sv) + p(fb|¬sv).
Table 6.1. Set of likely effects that can invalidate composite bridging fault predictions.
With a little bit of thought, the relevant conditional error probabilities can be expressed as:
p(0 | 1)  p(sv)  p(hr )  p(sv) p(hr )  p(bg )  (1  p(bg ))( p(fb | sv)  p(fb | sv))  p(wf )
p(1 | 0)  p(bg )  (1  p(bg ))( p(fb | sv)  p(fb | sv))
p(1 | 0*)  p(fb | sv)
p(0 | 1*)  p(hr )  p(bg )  (1  p(bg )) p(fb | sv)
In these equations, where p(1|0*) refers to the restricted vector error rate and p(0|1*) refers to the
required vector error rate. While this degree of decomposition requires more calculation, it does offer
certain benefits over the simpler 1st-order approximations. First, some of the probabilities are easy to
estimate: p(sv) can be approximated as 0.5, and p(wf) as 0.25. Second, simulator and netlist
information can provide accurate values for p(sv), p(wf), and p(fb) on a per-candidate basis. But,
values for such probabilities as p(hr) and p(bg) would require either extensive bridging fault
characterization, or the assignment of estimates as described earlier (most likely relative to the stuck-at
error rates). Given the philosophy of an inexpensive diagnosis system based on stuck-at simulation
only, I have decided that the most practical and consistent approach is to use order-of-magnitude
82
estimates for these values. Note, however, that the imposition of a probabilistic framework allows
values for these parameters to be used should they be available.
6.10
Expressing Uncertainty with Dempster-Shaffer
The Dempster-Shaffer theory of evidence presented in Chapter 4 can also be applied to the
mixed-model scoring described in this chapter. The conditional probabilities just presented can be
used as the degrees of belief for candidates of each fault model, and an uncertainty value can be added
to the belief assignment over all candidates for each test vector.
The per-test uncertainty value, however, would be calculated differently from the situation
presented in the first-pass algorithm of Chapter 4, in which the evidence provided by certain test
results is considered to be much stronger than for other tests. In the case of the mixed-model
algorithm, assumptions about different test results are built into the conditional probabilities
themselves, as with the case for restricted and required vectors for bridging candidates. In this case,
the per-test uncertainty would be a function of the conditional probabilities of all candidates; in other
words, a test result for which all candidates expressed a conditional probability of 0.5 would result in
maximum uncertainty.
Where the Dempster-Shaffer method can really add value to the mixed-model algorithm is in its
final calculation of the weight of conflict of all evidence, which is determined by the final value of
m(Θ). As a final measure of the total uncertainty of the probability assignments, it can provide a
valuable confidence value for the diagnosis as a whole. This can be especially important as the
algorithm applies fewer or more specific fault models, since it can express how well the observed
behavior matches the expectations of the models applied. If the confidence level is low, then, an
engineer can decide to try re-running the algorithm with a different set of models or assumptions.
The Dempster-Shaffer method also promises more flexibility than a traditional application of
Bayes rule since it can calculate the posterior probability of combinations of faults, which would allow
an explicit scoring of multiple simultaneous faults. This application is questionable, however, for two
83
reasons. First, the computation of all possible fault combinations for large circuits would be infeasible.
Second, the per-test conditional probabilities of fault candidates are not independent unless the fault
effects themselves are completely independent.
The current version of the mixed-model algorithm does not implement the Dempster-Shaffer
method, largely due to practical computational issues – since these experiments were run on whole
circuits, the scoring algorithm had to be very simple. But, the application of the iSTAT and analysis
algorithms described in earlier chapters should reduce the candidate space for future experiments, and
allow the application of a more interesting scoring algorithm. Certainly, the promise of adding an
explicit confidence measure is compelling, and so implementing Dempster-Shaffer scoring in the
mixed-model algorithm is a subject of near-term future work. Elements of this work include defining
the proper per-test uncertainty function, and perhaps including a way to consider small-sized
combinations of faults (and judge or estimate their independence) to enable multiple fault diagnosis.
6.11
Experimental results – Hewlett-Packard ASIC
In order to evaluate the diagnosis approach just described, I implemented the technique and
performed several diagnosis experiments on a production industrial circuit. The experiments were
performed at Hewlett-Packard, with their support and equipment; the circuit used was a HewlettPackard ASIC. Defects were inserted into the circuits using a focused ion beam (FIB). (Knowing the
exact form and location of a defect is obviously very useful for validation [Ait95]; diagnosis of failing
production chips is an obvious next step.) The circuit was built with a 0.5-micron process, and its
ATPG model had approximately 150,000 gates.
There were three rounds of experimentation. In the first, the FIB engineer connected arbitrary
signal lines to either power or ground in order to mimic stuck-at behavior. In the second round, he
joined neighboring signal lines in order to represent bridging faults; in the third round he broke signal
lines in order to simulate open defects.
84
The diagnosis experiments were performed despite several practical challenges. First, only passfail signatures were readily available, so no information about failing outputs was used. Second, no
realistic bridging fault candidate list was available, so the diagnosis program had to consider all
bridges to be possible. Third, no gate descriptions or simulator information was available for
refinement of the p(wf), p(sv), or p(fb) estimates used for composite bridging scoring. Fourth, no
statistical analysis of fault or defect frequencies (such as IFA) was performed, so a uniform prior was
used for the Bayesian scoring of candidate faults. It is assumed that the addition of any or all of these
missing elements would improve the accuracy and resolution of the resulting diagnoses.
It is also important to reiterate that no information or tool was used for diagnosis other than a
stuck-at faultlist, a pass-fail dictionary (from a stuck-at fault simulator), and a list of failing vectors for
each faulty circuit. Also, for all experiments in this chapter, the first- and second-pass diagnosis
algorithms described in earlier chapters were not yet available to reduce the candidate faultlist.
Therefore, these experiments represent a worst-case situation, in which the model-specific algorithm
must run on the entire circuit. These circuits, while large, contain at most a few hundred thousand
stuck-at faults, and so most likely represent the last generation of industrial circuits for which such an
approach is feasible.
The diagnosis program requires estimates of prediction error, or sources of error, for bridging and
stuck-at fault candidates. The initial assignment for bridging faults was
p( sv )  0.5
p( wf )  0.25
p(hr )  p(bg )  p( fb )  0.01
p( fb | sv )  100 * p( fb | sv )
The resulting bridging fault probabilities of error were
p(0 | 1)  0.78
p(1 | 0)  0.02
p(1 | 0*)  0.0001
p(0 | 1*)  0.03
For stuck-at faults, p(0|1) = p(1|0) = 0.01.
85
In most cases, these estimates were probably pessimistic. The experiments were designed to see
if the proposed algorithm could 1) distinguish between stuck-at and bridging defects, and 2) correctly
identify the nodes involved in the defect. Another goal was to determine how open defects would be
diagnosed in this system, and whether suspicions about their similarity to stuck-at behaviors are
justified. The results are given in Tables 6.2, 6.3, and 6.4.
Each of the three tables presents results from a round of experiments, for stuck-at, bridging, and
open defects respectively. Each row in a table is an individual diagnosis experiment on a single
defective circuit. The first column of each row gives the defect number. The second column (Top
Candidate) indicates which candidate the diagnosis algorithm scored highest. More than one candidate
can get the same top score, so the third column (Num. Tied for First) reports the number of top-scoring
candidates. The fourth column (Classification) classifies each diagnosis, and the last column (Notes)
gives a short qualitative description or details on each diagnosis. In the tables, candidates are
described by their model and quality of match to the actual inserted defect. The two candidate models
are bf for bridging fault and sa for stuck-at fault. Three grades of match between candidate and the
actual defect are specified. An exact match exactly identifies the single node or pair of nodes involved
in the defect. A partial match either identifies one out of two bridged nodes (for a stuck-at candidate),
or pairs a stuck-at or open node with another unrelated node (for a bridging candidate). A misleading
match does not correctly identify any faulted nodes, although the table indicates if an apparently
unrelated node is logically near (within two simple logic gates up or downstream from) the fault site.
To illustrate, a stuck-at candidate that identifies one of a pair of shorted nodes is considered
sa-partial. A bridging candidate that pairs the correct stuck-at node with another is bf-partial, as is a
bridging candidate that only correctly identifies one of a pair of shorted nodes. For open defects, either
stuck-at fault on the open nodes is considered sa-exact.
86
Defect
1.1
Top
Candidate
bf-partial
Num. Tied
for First
2
Classification
Partial success
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
sa-exact
sa-exact
sa-exact
sa-exact
sa-exact
sa-exact
sa-exact
sa-exact
1
> 300
9
3
> 100
1
1
2
Success
Ambiguous
Success
Success
Ambiguous
Success
Success
Success
1.10
sa-exact
3
Success
1.11
1.12
sa-exact
sa-exact
1
2
Success
Success
Notes
Significantly (17%) non-stuck-at behavior; 8 of
top 10 candidates are bf-partial
Next 100 candidates are bf-partial
Other 2 top candidates are near
Next 100 candidates all bf-partial, all same score
Next 100 candidates all bf-partial, all same score
Other top candidate is near; next 100 candidates
are bf-partial, all same score
Other top candidates are near; next 100
candidates are bf-partial, all same score
Next 100 candidates all bf-partial, all same score
Other top candidate is near; next 100 candidates
are bf-partial, all same score
Table 6.2. Diagnosis results for round 1 of the experiments: twelve stuck-at faults.
There are four diagnosis classifications for each experiment. A success indicates that an exact
match is found in the top 10 candidates. A partial success indicates that at least a partial match, but no
exact match, is contained in the top 10. A diagnosis is a failure if no exact or partial matches rank in
the top 10. In any event a diagnosis is considered ambiguous if the top 10 matches, or more, all
receive the same score. An ambiguous diagnosis indicates that more information (such as failing
outputs, for example) is needed to distinguish between highly-ranked candidates.
A point of detail: The last three bridging defects, defects 2.7 to 2.9, all followed the same
scenario, and are considered only partial successes. In all three cases, the FIB bridged the outputs of
two inverters, one having a much stronger drive strength than the other. The result in such cases is that
fault effects only initiate from the weaker node, the stronger node never being overdriven. All three of
the diagnoses reflect this situation: in all three cases, the top candidate is the weaker of the two inverter
outputs stuck-at either 1 or 0. Without the dominating node ever being the source of error effects, it is
doubtful whether another algorithm could do better looking only at the logic failures of the circuit;
perhaps IDDQ diagnosis has a chance of identifying this type of defect.
87
Defect
2.1
2.2
2.3
2.4
Top
Candidate
bf-exact
bf-exact
bf-exact
bf-partial
Num. Tied
for First
1
1
1
1
Classification
Success
Success
Success
Success
2.5
2.6
bf-exact
bf-partial
1
1
Success
Success
2.7
2.8
2.9
sa-partial
sa-partial
sa-partial
1
1
1
Partial success
Partial success
Partial success
Notes
Defect is a feedback bridging fault
Next 9 candidates are bf-partial
Second candidate is sa-partial
Second candidate is bf-exact; other node in top
candidate is near
Next three candidates are bf-partial
Second candidate is bf-partial, third candidate is
bf-exact
Dominated node: see text
Dominated node: see text
Dominated node: see text
Table 6.3. Diagnosis results for round 2 of the experiments: nine bridging faults.
Defect
3.1
3.2
3.3
3.4
Top
Candidate
sa-exact
sa-exact
sa-exact
samisleading
Num. Tied
for First
3
2
2
2
Classification
Success
Success
Success
Failure
Notes
Behavior identical to node stuck-at 1
Behavior identical to node stuck-at 1
Behavior identical to node stuck-at 1
Significantly (21%) non-stuckat behavior; 14th
candidate is bf-partial
Table 6.4. Diagnosis results for round 3 of the experiments: four open faults.
The results indicate that the approach works quite well at accurately diagnosing and
distinguishing a mixture of fault types. The one failed diagnosis occurred on the last open defect,
when the behavior was significantly (21%) different than the signature for the node stuck-at 1.
Whether this behavior is typical or not is an area of further research; answering this question may lead
to a refinement of the acceptance criteria for stuck-at candidates, or possibly the addition of another
fault model specifically for open defects.
6.12
Experimental results – Texas Instruments ASIC
I supplied the mixed-model diagnosis software to a team of engineers at Texas Instruments, who
independently performed another round of similar experiments [SaxBal98]. As with the HewlettPackard experiment, samples of a production ASIC were modified by inserting defects with a focused
ion beam. A total of sixteen diagnoses were performed, the first two on signal lines shorted to power
and ground, and the next fourteen on signal-to-signal shorts.
88
An interesting aspect of this experiment is that the TI engineers also ran the most widely used
commercial diagnosis tool, Mentor Fastscan, on the same failures. It is widely believed that Fastscan
implements the W&L algorithm described in Chapters 2 and 4. This provides a useful comparison of
the effectiveness of these two algorithms on some interesting defects in real-world circuits.
Table 6.5 presents the results from these experiments. The first column gives the id used by the
TI engineers to identify each FIB’d circuit. The second column reports the number of nodes identified
by Fastscan in its diagnosis, either two, one, or none. A Fastscan diagnosis consists of a list of stuck-at
faults, and can be of any length. Unfortunately, the TI engineers did not report the Fastscan diagnosis
sizes for these trials, or in what position the bridged nodes appeared in the list. (Fastscan orders its
stuck-at candidates by the number of failing patterns explained, or matched, by each fault.) The third
column gives the results from the mixed-model probabilistic algorithm, using the same match types
defined in the last section. The last column provides notes about some defects or diagnoses.
With the exception of FIB7, the diagnoses returned by the mixed-model algorithm are superior to
that of the commercial tool. In eight out of fourteen bridging faults, the mixed-model algorithm gave a
better result: in seven cases it identified both nodes when Fastscan could only identify one or neither
node, and in the other case it identified one node when Fastscan implicated none. In almost all cases
the mixed-model algorithm did a good job of differentiating stuck-at vs. bridging fault behaviors,
something Fastscan cannot do. And, the TI engineers noted that the bf-partial diagnoses could likely
have been improved with a better test set, as most of these defects involved pattern-dependent
sensitization.
89
ID
Fastscan
FIB-sa1
FIB-sa2
FIBx
exact
exact
one node
Mixed-model
algorithm
sa-exact
sa-exact
sa-partial
FIBy
FIB1intra
FIB2intra
FIB3inter
FIB4intra
FIB4inter
FIB5intra
FIB5inter
FIB6intra
FIB6inter
FIB7
FIB8
FIB9
none
one node
none
one node
none
one node
one node
one node
one node
two nodes
two nodes
one node
one node
bf-partial I
bf-exact I
sa-misleading
bf-exact I
bf-exact I
bf-partial
bf-exact I
bf-exact I
bf-exact I
bf-exact
bf-partial
bf-partial
bf-exact I
Notes
Pattern-dependent dominance bridge; behaves like
intermittent stuck-at fault on one node
Bridge between two inputs in XOR tree
Dominance bridge
Only one node sensitized by tests
Dominance bridge
Feedback bridging fault
Dominance bridge
Table 6.5. Diagnosis results for TI FIB experiments: 2 stuck-at faults, 14 bridges.
6.13
Conclusion
This chapter describes an approach to model-based fault diagnosis built around a probabilistic
evaluation of a set of fault candidates given all the available data about a failing circuit. The
introduction of probability as a common measure of diagnostic inference allows different algorithms to
process different sets of data, using different sets and types of candidates, to produce a single
diagnostic result.
90
Chapter 7.
IDDQ Fault Diagnosis
The mainstream of VLSI fault diagnosis has been concerned with logic failures at circuit outputs
or internal scan elements, as has this thesis up to this point. The reasons for this are many, but perhaps
most important is that in the field of test, logic-related fault models are dominant – especially the
single stuck-at fault model. The emergence of the IDDQ fault model, in which the presence of a defect
causes an abnormally high amount of current to flow in the circuit in a normally quiescent or static
state, has spurred interest in using this fault model for fault diagnosis.
There are several apparent advantages to performing diagnosis with I DDQ fault models and
information. First, many chips submitted to failure analysis do not have hard logic fails, but may fail
only IDDQ tests. Second, the IDDQ fault model has the advantage of high observability: unlike the logiclevel fault models, the effect of the fault does not have to propagate through many levels of logic but
only to the point of current monitoring. Therefore, I DDQ diagnosis can differentiate between defects
that are indistinguishable with other logic-level fault models, and the confounding effects of
indeterminate propagation that plague other diagnosis techniques are generally eliminated [AckMil92].
Third, and perhaps most important, considering IDDQ fault models in diagnosis adds another source of
information, generally orthogonal to traditional diagnosis results, that can be used to refine or add
confidence to an existing logic-based diagnosis.
On the other hand, IDDQ diagnosis presents its own challenge of ambiguity. Rather than the
simple pass-fail results of a logic-based test, IDDQ diagnosis algorithms must interpret the results of a
current measurement (of perhaps questionable precision and accuracy) as either a passably low current
value or a defectively high current value. As Nigh, Forlenza, and Motika said, “it should be obvious
that determining an IDDQ diagnostic current threshold […] is not simple” [NighFor97]. These authors
were involved with the Sematech test experiment [NighNee97a, NighNee97b], in which IDDQ diagnosis
91
was performed on a large number of failing chips [NighVal98]. While IDDQ diagnosis generally proved
to be accurate and useful, it required a great deal of manual intervention: the pass-fail current threshold
for each chip had to be repeatedly adjusted until a perfect diagnostic match was found. The work
presented in this chapter was intended first and foremost to provide an answer for the difficulties of
that experiment.
7.1
Probabilistic Diagnosis, Revisited
As stated in the last chapter, the diagnosis problem is by its nature probabilistic. This thesis
argues for acknowledging this fact openly, and reflecting it explicitly in the design of diagnosis
algorithms. A diagnosis algorithm that calculates and expresses its diagnoses probabilistically has
advantages both in the quality and the usability of its results. The results are more usable, because they
can be directly applied as inputs to another algorithm. The results are higher quality, because they can
adjust to the inherent complexities of fault diagnosis. The most common reasons for the unfortunate
complexity of the diagnostic process are two: noise and uncertainty.
Noise comes from many different sources during diagnosis. First, the measurements taken at the
tester are subject to human or mechanical frailties, meaning that the pass-fail results obtained may
contain errors or may not be completely reliable. Second, this data must be stored and transmitted to
the diagnosis program; given the size and complexity of modern circuits, the data reported from tester
is commensurately large and complex and is subject to noise and data loss from many sources. In fact,
many dictionary organizations are deliberately lossy to achieve aggressive compression targets, often
at the expense of diagnostic utility [CheLar99].
Uncertainty seems to be inherent in the nature of the diagnosis problem. Fault simulators are
used to predict behavior and build fault dictionaries, but even for apparently simple fault models they
often mispredict or fail to predict the resulting behavior from actual defects. Also, a pass or fail
response from the defective chip may have poor repeatability, or may itself be open to interpretation,
both increasing the uncertainty of any resulting diagnosis.
92
Nowhere in the field of test is the uncertainty of result more keenly felt than in IDDQ testing.
Therefore, nowhere in the field of fault diagnosis is a probabilistic approach more necessary than
during the diagnosis of IDDQ faults.
7.2
Back to Bayes (One Last Time)
Much of the probabilistic diagnosis approach presented so far has been built on Bayesian
prediction or Bayes decision theory. A Bayesian predictor scores the possible causes of an effect
according to the probability of a cause given the effect. To recap the notation and terminology, a cause
or fault candidate is denoted by cI, and the effect or behavior by b. Bayes decision theory says the
most likely candidate given a certain behavior is that for which p (ci | b)  p (c j | b) for all i  j . The
posterior probability p(ci | b) for any candidate is determined by Bayes Rule:
p (c i | b) 
p (c i ) p (b | c i )
n
 p (c i ) p (b | c i )
i 1
,
(1)
where p(ci) is referred to as the prior probability (or a-priori probability) of candidate ci and p(b | ci) is
referred to as the conditional probability of b given the candidate ci.
Since both candidates and behaviors can be represented as sequences of responses to the test set
(their fault signatures), if the probabilities of correct and incorrect predictions are assumed to be
independent then the conditional probabilities can be expressed as
m
p(b | ci )   p(b j | ci, j ) ,
j 1
(2)
where m is the number of test vectors, and ci,j is the predicted response of candidate i to test j.
The last chapter presented a mixed-model fault diagnosis algorithm where each p(bj | cj) was (in
effect) an estimate of the accuracy of a fault candidate’s prediction for a particular test vector. So, for
example, if a particular fault candidate predicted a failing response with a confidence level of 90%, its
p(observed fail | predicted fail) is 0.90, and its p(observed pass | predicted fail) is 1.0 - 0.90, or 0.10.
These estimates, then, are applied with Bayes Rule to determine the total posterior probability for each
93
candidate, and the resulting diagnosis consists of faults sorted by decreasing probability. The
algorithm used a uniform prior for all candidates – the a-priori probability of any defect was equal to
that of any other defect.
7.3
Probabilistic IDDQ Diagnosis
In logic-based diagnosis, the greatest source of uncertainty is fault behavior, specifically the
manner in which fault effects propagate (or do not propagate) from the site of a defect to where they
are eventually observed at primary outputs or at scan elements. Because of this uncertainty, even the
best simulators cannot perfectly predict what behaviors will result from a given fault model and
instance.
Conversely, IDDQ fault models are generally not subject to the same vagaries of prediction or
propagation: if a modeled defect is present, it generally produces an abnormally high current level that
theoretically should be observable. However, the “theoretically” is important here, as it is observation,
or rather interpretation, that is the most difficult obstacle for I DDQ diagnosis.
Because of the high background leakage currents that occur in modern VLSI circuits, it can be
difficult to distinguish a high, or failing, IDDQ measurement from a low, or passing, one. In fact, the
most important conditional probability for IDDQ diagnosis is whether an observed quiescent current
level is an indication of a activated defect or not.
In the fields of machine learning or statistical estimation, this problem could be addressed with
the following experiment. Start with a set of defective chips, each of which contains just a single one
of a set of known fault candidates. Apply all I DDQ tests to each chip, and for each test record the IDDQ
value along with the identity of the fault. From this data, then, determine the following distribution:
p(observed IDDQ on test j | fault Fi is present)
or
p(Oj | Fi)
This information can then be used Equation 2 above, substituting for p(bj | ci,j). But, the problem
for IDDQ diagnosis is that the experiment just described is completely impractical. It is not practical to
determine the identity of a large enough number of defects to gather these statistics.
94
It may be possible, however, to estimate the distribution of good-circuit IDDQ values over all tests,
as well as the distribution of faulty-circuit IDDQ values over all tests and faults. Scenarios for producing
these estimates are presented in subsequent sections of this chapter. This information can be
represented by
p(observed IDDQ | a fault is activated)
or
ˆ (O | A)
p
p(observed IDDQ | no fault is activated)
or
pˆ (O | A)
and
One more available piece of information is an estimate of the accuracy of the IDDQ fault simulator
(and fault model). As is the case for logical faults, we can estimate the probability of misprediction
and nonprediction for any fault on any test. The difference for I DDQ fault models is that the prediction
is not fail or pass, but rather fault activation or non-activation:
p(fault i is not activated on test j | fault i is present and
activation is predicted for test j)
or
or
p(Ai, j | Ai, j , Fi )
Mi,j
and
p(fault i is activated on test j | fault i is present and
activation is not predicted)
or
or
p( Ai, j | Ai, j , Fi )
Ni,j
Since per-fault and per-test information about prediction error is usually not available, Mi,j and Ni,j
can be estimated by a single M and N for all candidates and tests.
Given this information, the unknown distribution p(Oj | Fi) needed for the Bayesian estimator can
be replaced with the estimations of p(Oˆ | A) and p(Oˆ | A) :
95
p (O j | Fi )






p (O j Fi )
p ( Fi )
p (O j ( Ai , j  Ai , j ))
Ai , j  Ai , j  Fi , j  Fi
p ( Fi )
p (O j Ai , j )  p (O j Ai , j )  p(O j Ai , j Ai , j )
p ( Fi )
p (O j Ai , j )  p (O j Ai , j )
Ai , j  Ai , j  
p ( Fi )
p (O j | Ai , j ) p ( Ai , j )  p(O j | Ai , j ) p (Ai , j )
p( Fi )
pˆ (O | A) p ( Ai , j )  pˆ (O | A) p(Ai , j )
p( Fi )
 pˆ (O | A)
p ( Ai , j )
p ( Fi )
 pˆ (O | A)
substitute estimation s
p (Ai , j )
p ( Fi )
 pˆ (O | A) p ( Ai , j | Fi )  pˆ (O | A) p(Ai , j | Fi )
(3)
The probabilities p( Ai, j | Fi ) and p(Ai, j | Fi ) are the probabilities of a fault’s activation and
non-activation, respectively, for a particular test given the fault’s presence. The probabilities are not
known exactly, but can be estimated from the rates of misprediction and nonprediction mentioned
earlier. Since a candidate can predict either a pass (no fault activation) or fail (fault activation) for test
j, Equation 3 can be decomposed into two conditions:
p(O j | Fi )  pˆ (O | A) p( Ai, j | Fi )  pˆ (O | A) p(Ai, j | Fi )
(3)
 p( Ai, j )[ pˆ (O | A) p( Ai, j | Ai, j , Fi )  pˆ (O | A) p(Ai, j | Ai, j , Fi )] 
p(Ai, j )[ pˆ (O | A) p( Ai, j | Ai, j , Fi )  pˆ (O | A) p(Ai, j | Ai, j , Fi )]
 p( Ai, j )[ pˆ (O | A)(1  M )  pˆ (O | A)(M )] 
p(Ai, j )[ pˆ (O | A) N  pˆ (O | A)(1  N )]
(4)
In the previous chapter on probabilistic logic diagnosis the values of M and N were enough to
define the per-test conditional probabilities for each fault candidate and test result. Now, in the case of
IDDQ diagnosis, there are two additional conditional probabilities to be calculated or estimated,
reflecting the uncertainty in interpreting the test results as either pass or fail.
96
There exists a certain amount of uncertainty in interpreting the results of logic tests as well, but it
is dominated by much more serious and common concern of model prediction error: it is much more
likely that, for a stuck-at or bridging fault, the simulator’s pass-fail prediction will be wrong than it is
that the result of a test will be misinterpreted. For this reason, I omitted explicit mention of this type of
error in the previous chapter on logic diagnosis and its calculations, instead concentrating on the
estimates of model prediction error. For IDDQ diagnosis the emphasis is reversed: an IDDQ fault
prediction of activation or non-activation is assumed to be wrong very rarely, due to the simplicity of
the models (only simple sensitization is required, and propagation is assumed). Therefore, for I DDQ
diagnosis the prediction error can be ignored, the Equation 4 reduces to
p(O j | Fi )  p( Ai, j )[ pˆ (O | A)(1  M )  pˆ (O | A)(M )] 
p(Ai, j )[ pˆ (O | A) N  pˆ (O | A)(1  N )]
 p( Ai, j ) pˆ (O | A)  p(Ai, j ) pˆ (O | A)
(5)
Now, the salient per-test conditional probabilities for IDDQ diagnosis have been reduced to the
estimated distributions of IDDQ currents during fault activation and non-activation. The following
sections of this chapter present different diagnostic scenarios and methods for how these estimates can
be created in each scenario.
Previous researchers, notably Gattiker and Maly [GatMal97] and Thibeault [Thi97], have
proposed diagnosis algorithms based on IDDQ test results and probability assessments. In these
approaches the defining element is the use of levels of IDDQ measurements to identify candidate fault
classes (such as faults on 3-input NAND gates or 2-input NORs) in the circuit. These classes can then
be used to refine the diagnosis to individual fault instances. In my approach I am concerned entirely
and deal directly with fault instances (a fault on a specific circuit node or nodes). However, an
algorithm that computed probabilities for fault classes could be used to provide the prior probabilities
for the individual faults used in this algorithm.
97
7.4
IDDQ Diagnosis: Pre-Set Thresholds
In an ideal world, the identification of IDDQ thresholds for test and diagnosis would be trivial: the
definitions of abnormal and normal current values would be fixed and unchanging. For a particular
chip, a fixed threshold could be established that would always divide passing from failing IDDQ.
Consider the following graph, an excerpt of actual I DDQ measurements from the Sematech experiment:
Figure 7.1. IDDQ results for 100 vectors on 1 die (Sematech experiment).
It is possible that for this chip, a threshold value of 100A, indicated on the chart by a bold line,
would always serve as a viable pass-fail threshold for IDDQ measurements. In this ideal situation, the
assignment of the conditional probability pˆ ( A | O) would be easy:
180
160
120
100
p (A |O )
80
60
IDDQ (uA)
140
40
0.
1.
20
0
20
40
60
80
0
100
Vector Number
Figure 7.2. Assignment of a binary p̂(A | O) for the ideal case of a fixed IDDQ threshold.
98
The inset graph on the left, rotated by 90 degrees, indicates the probability that a given I DDQ
measurement indicates a defect activation in this ideal case. This is the reverse conditioning from that
required for Equation 5, but the distribution pˆ (O | A) can be computed by application of Bayes Rule:
pˆ (O | A) 
pˆ (O) pˆ ( A | O)
pˆ ( A)
The values of pˆ (O) and pˆ ( A) can either be estimated from the sample values or as uniform
distributions. In any event, since the pass-fail threshold is fixed and unambiguous, the probabilities are
similarly definite: pˆ (O | A)  0.0 for observed currents below 100A, and pˆ (O | A)  1.0 for anything
above 100A. A similar situation is true for pˆ (O | A) . This type of extreme conditional probability
assignment, however, leads to posterior candidate assignments of either 0.0 or 1.0 – nothing less than a
perfect match of candidate to behavior will be assigned a non-zero posterior probability.
A less simplistic pˆ ( A | O) might be that shown in Figure 7.3. In this case, a piecewise linear
probability assignment is used, where pˆ ( A | O  0.0)  0.0 , pˆ ( A | O  max. iddq)  1.0 , and
pˆ ( A | O  threshold iddq)  0.5 . This assignment, of course, assumes that the maximum I DDQ
measurement indicates the presence of a defect. Application of Bayes Rule with constant or uniform
pˆ (O) and pˆ ( A) will result in the same distribution for pˆ (O | A) , scaled by a constant factor.
180
160
120
100
p(A|O)
80
60
IDDQ (uA)
140
40
0.
1.
20
0
20
40
60
80
0
100
Vector Number
Figure 7.3. Assignment of a linear p̂(A | O) with a fixed IDDQ threshold.
99
The diagnostic implication of the pˆ ( A | O) shown above that current measurements well below
the fixed threshold are considered much less likely to indicate the presence of a defect, and those at the
maximum almost certainly indicate defectively high current. The choice of linearity is arbitrary but
common: in a traditional non-probabilistic diagnosis system the equivalent scoring mechanism would
be to give a candidate fault one point for every A measured below the threshold when it predicts a
pass, and subtract one point per below-threshold A if it predicts a fail. Similarly, a candidate would
get one point for every A above the threshold for predicting a fail and would subtract one per A for
predicting a pass. Such a scoring mechanism would result in the same candidate ordering, with the
same relative assignment of scores, as a Bayesian predictor (assuming a uniform prior) that uses the
linear conditional probabilities illustrated in Figure 7.3.
Another possibility for estimating pˆ (O | A) and pˆ (O | A) is to assume that failing (activated)
and passing (non-activated) IDDQ measures are normally distributed. This is, in fact, the general
assumption behind many IDDQ testing theories. If the only data available is the pre-set threshold and
the actual IDDQ results from the tester, then pˆ (O | A) and pˆ (O | A) can be generated by estimating
the mean and variance of two univariate normal distributions from the sets of sample data. The
maximum likelihood estimate for the mean and variance in this case are just the sample mean and
variance:
n
ˆ  1 / n  x k
k 1
n
ˆ  1 / n  ( x k  ˆ ) 2
2
k 1
The resulting estimated distributions on the sample data presented before would look like that
shown in Figure 7.4 (the illustrated variances are not to scale).
100
180
p(O|A)
160
120
100
p(O|~A)
80
60
IDDQ (uA)
140
40
0.
1.
20
0
20
40
60
80
0
100
Vector Number
Figure 7.4. Assignment of normally-distributed p̂(O | A) and pˆ (O | A) .
An important assumption in the estimated pˆ (O | A) shown above is that the distribution is indeed
univariate. Research on IDDQ failures has demonstrated that IDDQ failures usually involve multiple
sensitized defect paths with various IDDQ levels, and so generally cluster into multiple normal
distributions. This will be addressed in Section 7.6 of this chapter. It should also be noted that there
are a wide variety of more powerful statistical and machine learning techniques available to extract and
test mixture densities of the sort encountered in IDDQ test results. I have chosen the statistical
assumptions and clustering algorithm described above for their simplicity, but more complicated
methods may prove useful or effective, and may be a part of further research.
7.5
IDDQ Diagnosis: Good-Circuit Statistical Knowledge
The ideal case of a fixed threshold is unfortunately something of a rarity for modern circuits. A
fixed threshold is often difficult or impossible to set, as the normal variations of a VLSI process can
result in a wide range of defect-free or acceptable IDDQ current values from die to die. In a nearly ideal
world, enough information would be available for diagnosis to account for these variations and adjust
its assessments of the tester data accordingly.
If enough time, effort, and expense are dedicated to the job it may be possible to adequately
define the defect-free IDDQ characteristics of a single chip and test set. If a sufficient sample of dies
101
covering the range of process variations is tested with the same vector set and the results are analyzed,
then it is possible that an expected good-circuit IDDQ distribution can be determined. From this
distribution, one can define acceptable ranges for measured I DDQ for both test and diagnosis.
The most sophisticated of these techniques examine the relation of the minimum and maximum
measured IDDQ per die over many samples, and develop acceptance criteria for the range of measured
IDDQ for dies submitted for production testing. One such technique developed at Hewlett-Packard and
Agilent Technologies assumes a normal distribution of good-circuit values for the ratio of maximum to
minimum IDDQ, and from this establishes a 3 threshold as an acceptable ratio for test [MaxNei99].
(The ratio of maximum to minimum current is used to compensate for die-to-die variations in IDDQ
current.) Figure 7.5 below shows how the statistically-determined value of pˆ (O | A) might be
applied to the example test data shown before.
180
160
120
100
p(O|~A)
80
60
IDDQ (uA)
140
40
0.
20
0
20
40
60
80
0
100
Vector Number
Figure 7.5. Determining a pass threshold based on an assumed distribution and the minimumvector measured IDDQ.
As shown in Figure 7.5, the IDDQ current ratios method defines a pass-fail threshold as a function
of the IDDQ measurement at a presumed-minimum vector and a previously established 3 acceptance
limit. The inset curve demonstrates that the same distribution used for testing can be used as the
pˆ (O | A) distribution necessary for probabilistic diagnosis. The best estimate for pˆ (O | A) in this
102
case is probably also as a normal distribution, using either the simple univariate method described in
the last section, or the multivariate clustering method described in the next.
7.6
IDDQ Diagnosis: Zero Knowledge
The level of preparation and analysis described in the previous section is often not available for
every chip that requires fault diagnosis. If neither a fixed threshold nor a statistically based variable
threshold is available, then some other mechanism must be employed to distinguish and characterize
passing and failing IDDQ measurements for fault diagnosis.
Gattiker and Maly have proposed a method of identifying the presence of defect-induced high
current paths in a circuit [GatMal96, GatMal97]. They note that when the IDDQ measurements of a
defective chip are ordered by magnitude one or more steps can usually be identified in the resulting
graph. Below is the same data set given before, this time with the vectors ordered by increasing
current value.
Figure 7.6. The same data given in Figure 7.1, with the test vectors ordered by I DDQ magnitude.
Now a large step is clearly visible in the IDDQ measurements. The value of this identification is
based on three assumptions. First, small variations in both normal and abnormal IDDQ values are due to
vector-dependent levels of transistor leakage not associated with any defect, and to normal variations
of measurement. Second, large variations are due to the activation of different defect-induced current
paths. Third, these large variations are several times larger than the small variations arising from
103
transistor leakage or measurement error. Therefore, the presence of a large step in the ordered I DDQ
graph suggests the presence of a defect-induced path from power to ground.
If the absence of a step suggests the absence of an activated defect path, then using current
signatures as described can separate assumed-passing vectors from assumed-failing vectors without
establishing a prior pass-fail threshold. If a large IDDQ step can be identified, all ordered vectors before
the first large step can be considered passing vectors, and all ordered vectors after the first large step
can be considered failing.
Since it is assumed that the small variations in both normal and abnormal IDDQ measurements are
due to various transistor leakage paths and to measurement noise, a reasonable conclusion is that the
resulting passing and failing I DDQ measurements are normally distributed. Applying these
assumptions, and the using the presence of a large step to set an IDDQ threshold, an overlay of estimated
conditional probabilities pˆ (O | A) and pˆ (O | A) on the data given above would look something like
p(O|A)
Figure 7.7.
150
100
p(O|~A)
50
0
20
40
60
80
IDDQ (uA)
200
0
100
Vector Order
Figure 7.7. Estimating pˆ (O | A) and p̂(O | A) as normal distributions of clustered values.
If there is more than one identified current signature step, then will be a pˆ (O | A) distribution
defined for each cluster of failing IDDQ measurements.
104
In order to automate the process of determining these distributions, two algorithms are necessary:
one to define groupings of passing and failing IDDQ values, and one to determine a mean and variance
for each distribution thus defined. To begin, the assumptions of the zero-knowledge case are as
follows:
1.
2.
3.
No statistical information about the circuit or process in question is available; no data is
available for diagnosis except the IDDQ tester results themselves, except perhaps a prior
distribution on the fault candidates.
There are at least two passing (normal IDDQ levels) test results and two failing (abnormal I DDQ
levels) test results.
The lowest IDDQ measurement is assumed to be a pass, and the highest is assumed to be a fail.
Proceeding from these assumptions, the first task is to divide the test results into groups of
passing and failing test vectors. Using the current signature concept, an algorithm is required to
identify large steps in the sorted IDDQ measurements. One such algorithm is actually a rather simple
version of the hierarchical clustering algorithms commonly used in pattern classification [DudHar73],
and can be described as follows:
1.
2.
3.
4.
5.
6.
Sort the vectors by increasing IDDQ value, initially placing all vectors in a single cluster.
Break the cluster by the single largest IDDQ step value.
For each resulting cluster, calculate the average and largest I DDQ step values.
If the largest IDDQ step is K times larger than the cluster average, break the cluster at that step.
Loop to #3 until no new clusters have been formed, or until the maximum number of clusters
is reached.
Define the lowest (by IDDQ values) cluster as passing (no defect activation), all other clusters
as failing.
A value must be established for K, the multiplier at which an IDDQ step suggests an activated
defect path instead of a normal measurement or leakage variation. Empirical evidence suggests that
such steps are large: for the experiments described in this chapter a value of 10 was used.
The second and remaining task is to establish distributions for each passing and failing cluster of
measurements. The maximum likelihood estimates described in Section 7.4 can be used to estimate
the mean and variance, from the observed sample data, of the normal distribution of each cluster. The
data within each cluster is assumed to be univariate, simplifying the calculations.
105
7.7
A Clustering Example
The data shown in all the graphs of this chapter, taken from same chip, was simplified for
purposes of clarity and illustration. The actual IDDQ data consisted of nearly twice as many vectors
(196 vs 100) with more than one large step apparent in the measurements. The full data set is
displayed in Figure 7.8 below.
There are several obvious steps in this current signature. Applying the clustering algorithm
described earlier results in the cluster assignments shown in Figure 7.9. This example, in fact,
represents something of an anomaly among the Sematech data. Almost all of the other die have I DDQ
measurements that produce only two clusters, one passing and one failing, making the process of
setting thresholds and assigning conditional probabilities very simple.
Figure 7.8. Full data set of 196 ordered IDDQ measurements.
106
Figure 7.9. Division of the ordered measurements into clusters.
7.8
Experimental Results
As I stated earlier, the main purpose of this work is to replicate the Sematech diagnosis
experiments with improved diagnostic methods and algorithms. Phil Nigh of IBM has supplied UCSC
with IDDQ test results and defect information for sixteen chips that were submitted to failure analysis
after IDDQ diagnosis.
In 15 of the 16 cases, Phil Nigh reported successful diagnoses by manually adjusting I DDQ passfail thresholds until a perfect match was found in one of the two candidate dictionaries: a pseudostuck-at and a bridging fault dictionary, both of which contained pass-fail IDDQ signatures. The
bridging fault candidate list was derived from an examination of the adjacency of same-metal signal
wires, and supplemented by same-gate-input bridges. There were approximately 710K faults and
300K unique signatures (each representing a fault equivalence class) in the pseudo-stuck-at dictionary,
and approximately 560K faults and 220K signatures in the bridging fault dictionary.
107
As part of the Sematech experiment, these diagnosis results were verified by physical failure
analysis. I was able to verify these results by converting the I DDQ tester results into pass-fail signatures,
using the reported thresholds, and using a simple diagnosis algorithm to find the same candidates.
Next, I fed the raw IDDQ tester results into the zero-knowledge clustering algorithm described
earlier, and from there into the probabilistic diagnosis algorithm, using a uniform prior. In all cases but
one the clustering algorithm set pass-fail thresholds at the same (sorted) vector index set by the manual
method. For all fifteen of these chips, the diagnosis program identified exactly the same faults, as the
highest-ranked candidates, as were previously verified by failure analysis. (The identification of these
faults as “successful” matches to the defects was done by the team at IBM, using their criteria for
matching and verification.)
In one case (HGQ0810/2890), Phil Nigh was unable to correlate the implicated candidate faults
with the results from physical analysis: the best candidates, with perfect signature matches, had no
apparent relation to the defect site. These same five faults showed up at the top of the probabilistic
diagnosis, but a bridge containing the defective node did appear in the next five candidates. This can
only be considered a partial success, however, both because of the relatively low ranking of the bridge
and the fact that the more appropriate pseudo stuck-at candidate was not included in the diagnosis.
This particular chip, along with a few others of uncertain or unknown physical verification, remains a
subject of ongoing research.
108
Wafer ID/
Successful Diagnosis
Defect Found
Chip ID
Manual
Automated
QYQ0801/3488
Y
Y
Metal-metal short
QEQ0713
Y
Y
Poly-GND short
BJQ0611/3392
Y
Y
Poly-poly short
YXQ0810/2274
Y
Y
Gate-drain short
IAQ1405/2795
Y
Y
Poly-Nwell short
ITQ1312/2284
Y
Y
PFET ‘poor drive’, 13 transistors
ITQ0214/1787
Y
Y
Metal-metal short
RUQ0418/1947
Y
Y
Source-drain leakage
GJQ0908/3382
Y
Y
Poly-poly short & poly-GND short
R5Q0306/3053
Y
Y
Source-drain & drain-substrate shorts
R6Q1608/3062
Y
Y
Poly-Vdd short
ILQ0209/3498
Y
Y
PFET: poly-Nwell leak, poor drive
LJQ1510/2177
Y
Y
Diffusion-substrate leak, 12 transistors
BJQ0908/1725
Y
Y
PFET diffusion anomaly
IXQ1508/4835
Y
Y
Poly-metal short
HGQ0810/2890
N
Partial
Poly-Nwell short
Table 7.1. Results on Sematech defects.
109
Chapter 8.
Small Fault Dictionaries
Up to this point, this thesis has dealt exclusively with the theory of fault diagnosis, and has
proposed several algorithms consistent with a probabilistic and precise diagnosis methodology. But,
one of the self-proclaimed principles of this thesis is that a diagnosis system should be practical,
especially considering the enormous data sizes involved in modern circuits. This chapter addresses
one of the main data problems in fault diagnosis, that of the size of fault dictionaries. Not all diagnosis
algorithms use fault dictionaries; in fact, the choice of whether or not to use dictionaries is often
orthogonal to the methods of matching and scoring candidates. But, since almost all algorithms can
use fault dictionaries, and some situations mandate their use, making dictionaries practical is an
interesting and important topic.
It may be useful to first recap some of the background and terminology introduced in Chapter 2.
Traditional fault diagnosis, often referred to as cause-effect diagnosis, compares the simulated
behaviors of a set of faults with the defective behavior of the chip on the tester. The simulated
behavior of a fault is usually called its fault signature; a complete record, consisting of the list of
failing vectors and the outputs (for each vector) at which errors are detected, is called a full-response
fault signature.
If the simulated behaviors are collected and stored before diagnosis, the result is known as a fault
dictionary. The problem with dictionary-based diagnosis schemes is the enormous amount of data that
is required, both to store and process. The common alternative to using fault dictionaries is to perform
fault simulation at the time of diagnosis, removing the storage requirement [WaiLin89]. In addition, a
process known as path tracing [AbrBre80, RajCox87] can be employed to trace back from erroneous
outputs and implicate a cone of logic, thereby dynamically creating a faultlist for limited simulation.
110
And yet, despite its onerous data requirements, dictionary-based diagnosis remains popular for
several reasons. First, since fault simulation is performed as a part of test generation, most test
generators can create a fault dictionary (usually stuck-at) as a standard option. A second and more
practical reason is that using a fault dictionary removes the dependency of the diagnosis program on
the circuit netlist and the messy details of simulation. It can often be difficult, long after a circuit has
taped-out and been archived, to restore the final versions of all necessary components of the circuit,
from the main netlist to subsidiary designs to the full set of library files. It can also often be difficult to
reliably restore and faithfully simulate the tester program. For these reasons, dictionaries are often
very popular with failure analysis teams who, often far removed from design and test, appreciate the
fact that all the required diagnostic information about a circuit is encapsulated into a single data file.
Finally, dictionary-based diagnosis can often provide a good result very quickly, simply because
the fault simulation work has been done ahead of time and is therefore amortized over many diagnosis
runs. This aspect is especially significant for high-volume situations in which a large number of parts
must be diagnosed, and in cases where a quick diagnostic result is desired.
In this chapter, I present a method of addressing the major problem in dictionary-based diagnosis,
namely the size of fault dictionaries. I will first examine the components of the data involved in fault
diagnosis, and the costs and benefits of each. I will then propose a strategy for approximating the
information content of full-response dictionaries at a minimum cost. Finally, I will begin to develop a
new approach of low-resolution diagnosis, in which a conscious trade-off is made between data size
and precision. All of this is an attempt to postpone, for a while at least, the widely expected demise of
dictionary-based fault diagnosis.
8.1
The Unbearable Heaviness of Unabridged Dictionaries
In a classic full-response fault dictionary, the detection data for an individual fault consists of the
test vectors for which it is detected and the outputs (primary circuit outputs or scan flops) to which the
fault is propagated for each detecting test vector. If there are f faults in the fault list, v test vectors, and
111
o outputs, the total number of bits required for an uncompressed (no data loss) dictionary is f*v*o.
Different encodings of this data, considering the relative number of faults, vectors, and outputs, as well
as the number of detections, can result in very different dictionary sizes for the same data [BopHar96].
For purposes of a generalized comparison, we will leave aside such considerations and focus on the
raw number of bits of data in a full dictionary. For full-response dictionaries, this number can be truly
enormous and completely impractical for modern circuits.
(In addition, this paper will ignore the topic of data compression and such algorithms and
programs as Lempel-Ziv, Huffman coding, gzip, etc. Data compression, when applied to fault
dictionaries, addresses the question of how the detection data is stored. This chapter, on the other
hand, will address what data is stored in a fault dictionary. In all cases, data compression algorithms
can be applied to the various data sets presented here, but such compressed dictionaries cannot usually
be used for diagnosis without first uncompressing them, a serious disadvantage for very large data
sets.)
Several techniques have been applied to reduce the data requirements of the full-response fault
dictionary. Most involve some compaction, or loss of data from the original. So-called drop-on-k
dictionaries do not record every detection in the test set, but stop after a certain number of detections.
These dictionaries, however, are of questionable utility for fault diagnosis [CheLar99]. The most
commonly-used compaction technique for fault dictionaries is the pass-fail dictionary, in which the
per-vector output data has been completely removed and the results of each test are expressed as a
single bit: 0 for no detection, 1 for detection at any output. Pass-fail dictionaries are often relatively
small, requiring f*v bits, and are in some situations quite usable for fault diagnosis.
The problem with using pass-fail dictionaries is, of course, that all of the failing output data has
been lost. This information can be very useful in distinguishing between fault candidates that fail the
same set of tests. In addition, considering only faults in the input cones of failing outputs can usually
significantly reduce the candidate space. The bottom line is that a pass-fail dictionary usually produces
112
a much lower resolution diagnosis, one in which many candidates receive the same score and are
effectively indistinguishable.
To demonstrate this, I ran stuck-at diagnosis experiments on the ISCAS-85 circuits and four
industrial circuits. The entire stuck-at faultlist was simulated and diagnosed using both the fullresponse and pass-fail dictionaries for each circuit. Table 8.1 reports the number of faults, with
equivalent scores, that were ranked #1 using the full-response (FR) and pass-fail (PF) dictionaries.
In all cases, the correct match will be one of these top-ranked faults. (The correct match, and all topranked faults, will get a “perfect” matching score.) The table also notes the number of bits contained
in each dictionary.
In some cases the difference in resolution is quite dramatic, especially for the larger circuits in
which more data has been lost by removing the output information. But equally or more dramatic is
the difference in the number of bits required for each type of dictionary. The first goal of this chapter,
then, is to find some way to re-introduce the obviously-useful output information into a fault
dictionary, while keeping the size of the dictionary to pass-fail—sized numbers.
Circuit
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Ind-A
Ind-B
Ind-C
Ind-D
FR
faults ranked #1
2.29
1.17
1.61
1.67
1.82
2.24
2.03
1.89
1.33
1.63
2.74
2.33
2.51
1.91
FR bits
(f*v*o)
191,142
901,120
1,512,864
2,936,192
2,966,700
18,141,184
8,963,724
73,878,966
7,207,680
131,224,800
232,836,120,000
929,424,581,448
297,857,813,000
9,077,621,646
PF
faults ranked #1
2.80
1.17
1.66
1.71
1.99
3.04
2.10
2.07
1.40
2.12
5.49
2.91
51.0
2.86
PF bits
(f*v)
27,306
28,160
56,032
91,756
118,668
283,456
407,442
600,642
225,240
1,226,400
15,522,408
46,373,844
21,271,000
333,822
Table 8.1. Size of top-ranked candidate set (in faults) and total number of signature bits.
113
8.2
Output-Compacted Signatures
Consider the contents of a typical fault signature for single candidate (Figure 8.1). A fault
signature can be thought of as a matrix of bits, in which each row represents the pass (0) or fail (1)
response, at an individual output, to each test vector. The bits in each column represent the outputs at
which the fault will be detected for a particular vector. There are therefore v columns and o rows in
this view of a fault signature, and there are f such (v*o)-sized matrices in the full fault dictionary.
o1
o2
o3
o4
o5
o6
o7
o8
o9
PF
v1
v2
v3
v4
v5
v6
v7
v8
v9
OC
0
0
1
1
0
0
0
0
0
1
0
0
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
1
1
0
0
1
1
1
0
0
0
1
Figure 8.1. Full-response fault signature for a single fault.
The bottom row, labeled PF, is the traditional pass-fail signature for this fault, and is the bitwiseOR of all rows in the table. This pass-fail signature says that the fault is predicted to fail tests v1, v2,
v7, and v9. The pass-fail dictionary is constructed of such signatures, one per fault, for an
uncompressed storage requirement of f*v bits.
An interesting observation is that the failing output information can be compacted in the same
way as the failing vector information. The result of doing so is the final column in Figure 8.1 labeled
OC, where each bit indicates whether a particular output ever fails, under the test set, for that fault. I
refer to this as the “output-compacted” signature of the fault, and is the bitwise-OR of all columns in
the fault signature matrix.
114
An idea, then, is to re-introduce failing output information by constructing a dictionary to include
these output-compacted signatures. These signatures can be added into the traditional pass-fail
dictionary, or stored as a separate file. The additional storage required is f*o bits, and the total for all
signatures will be f*(v+o). This can be significantly smaller than the f*v*o bits required for the fullresponse dictionary.
8.3
Diagnosis with Output Signatures
If a fault dictionary includes the output signature information, the question then becomes how
best to use this information. Specifically, how should this information be used to rate fault candidates?
Normally, in pass-fail diagnosis a candidate is scored by the number of bit differences in its signature
from that of the observed behavior. Two commonly-used metrics are nonprediction and misprediction.
Nonprediction is the number of bits in the observed behavior not found in the candidate signature
(underprediction). Misprediction is the number of bits in the candidate not found in the observed
behavior (overprediction). The score for each candidate fault will then consist of some combination of
a nonprediction and misprediction score. Different diagnosis algorithms may weight nonprediction
and misprediction differently, depending upon the specifics of the fault model and simulator.
This same method could be followed with output-compacted signatures: find the intersection with
the behavior’s output signature, and weight the intersection with the appropriate parameters. Then,
combine the scores for the (pass-fail) vector matching and the output matching into a single candidate
score.
An alternative scoring method is to use both the pass-fail and output signatures simultaneously.
Looking at the example fault signature again, we see that the intersection of pass-fail and output
signatures defines an area of possible fault detection (shaded areas of Figure 8.2).
115
o1
o2
o3
o4
o5
o6
o7
o8
o9
PF
v1
v2
v3
v4
v5
v6
v7
v8
v9 OC
0
0
1
1
0
0
0
0
0
1
0
0
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
0
0
0
1
1
0
0
1
1
1
0
0
0
1
Figure 8.2. The intersection of pass-fail and output signatures.
During diagnosis, then, an observed failure bit (vector & output) inside of this 2-dimensional
detection area can be considered a successful prediction, or part of the intersection of candidate and
behavior. A detection outside of this area is considered a failed prediction, or a non-predicted bit.
Misprediction must be de-emphasized in this scoring method since the intersection area, in bits, will
usually be much greater than the actual number of predicted failing bits.
I repeated the earlier experiments to determine what diagnostic value the addition of outputcompacted signatures provides. Table 8.2 gives the results for a full-response dictionary vs. a pass-fail
dictionary vs. a pass-fail dictionary with output-compacted (PF+OC) signatures.
It appears from the data that the output signatures do indeed increase the precision of these
diagnoses. In almost all cases where there was a significant difference in precision between the passfail and full-response dictionaries, the addition of output-compacted signatures made up most of that
difference. While the pass-fail dictionary with output signatures is much smaller than the full-response
dictionary, it is still a significant increase over the number of pass-fail bits. This is especially true for
the industrial circuits, which, since they contain scan flip-flops, have many more outputs than the
simple ISCAS circuits. The next challenge is to decrease this overhead while retaining the diagnostic
improvement.
116
Circuit
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Ind-A
Ind-B
Ind-C
Ind-D
FR
faults
ranked #1
2.29
1.17
1.61
1.67
1.82
2.24
2.03
1.89
1.33
1.63
2.74
2.33
2.51
1.91
FR bits
(f*v*o)
191,142
901,120
1,512,864
2,936,192
2,966,700
18,141,184
8,963,724
73,878,966
7,207,680
131,224,800
232,836,120,000
929,424,581,448
297,857,813,000
9,077,621,646
PF
faults
ranked
#1
2.80
1.17
1.66
1.71
1.99
3.04
2.10
2.07
1.40
2.12
5.49
2.91
51.0
2.86
PF bits
(f*v)
27,306
28,160
56,032
91,756
118,668
283,456
407,442
600,642
225,240
1,226,400
15,522,408
46,373,844
21,271,000
333,822
PF+OC
faults
ranked
#1
2.32
1.17
1.61
1.69
1.82
2.24
2.03
1.89
1.34
1.63
2.87
2.34
2.51
1.91
PF+OC bits
(f*(v+o))
29,637
42,240
70,720
117,740
135,718
371,520
441,014
926,100
345,368
1,585,920
309,507,408
420,237,312
319,128,813
66,113,689
Table 8.2. Size of top-ranked candidate set (in faults) and total number of signature bits.
8.4
Objects in Dictionary are Smaller Than They Appear
The pass-fail with output signature dictionary size formula given in the previous section,
(f*(v + o)), is in fact a worst-case calculation of the number of bits required for storing outputcompacted signatures. One common aspect of fault dictionaries is that logically-connected sets of
faults tend to fail at the same set of circuit outputs. For example, the faults that make up a sub-design
in the circuit will usually propagate their failures to the outputs (possibly scan or wrapper flops) of that
subdesign. A cursory examination of the contents of any full-response fault dictionary will usually
confirm that sets of failing outputs are repeated often throughout the fault signatures. Output
signatures, which are collections of per-fault output sets, are even more likely to repeat across a fault
set.
This fact can be exploited by storing a particular output signature only once in a dictionary.
Then, every fault that has that output signature will reference that particular output signature by an
index, rather than the full string of output bits. If there are so unique output signatures in a dictionary,
then, this index will require log2(so) bits for each of the f faults. The revised formula for the size of the
117
output signature dictionary, with pass-fail signatures, is therefore (f (log2(so)+v) + (so*o)). Table 8.3
reports the number of unique signatures for each circuit above, the percent of the faultlist this
represents, and the actual number of bits used in the dictionaries for the results given above. The
number of pass-fail bits is repeated for comparison.
These revised sizes for the pass-fail with output signature dictionaries are much more acceptable,
especially given the increase in diagnostic precision (as reported in Table 8.2).
Circuit
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Ind-A
Ind-B
Ind-C
Ind-D
Faults
(f)
333
440
544
812
682
1376
1526
2646
3754
3360
19599
18654
21271
2419
Unique OC
signatures
(so)
41
302
141
426
318
203
359
678
288
586
5352
5633
2423
622
PF bits
(f*v)
27,306
28,160
56,032
91,756
118,668
283,456
407,442
600,642
225,240
1,226,400
15,522,408
46,373,844
21,271,000
333,822
PF+OC bits
(w/repeats)
(f*(v+o))
29,637
42,240
70,720
117,740
135,718
371,520
441,014
926,100
345,368
1,585,920
309,507,408
420,237,312
319,128,813
66,113,689
PF+OC bits
(unique)
(f*(log2(so)+v)
+ (so*o))
29,591
41,784
64,191
112,696
132,756
307,456
429,074
710,496
268,242
1,322,702
96,057,195
159,512,932
55,455,521
17,272,058
Table 8.3. Output-compacted signature sizes adjusted for repeated output signatures.
8.5
What about Unmodeled Faults?
I next performed some experiments to see if adding output-compacted signatures to a pass-fail
dictionary would also help a dictionary’s ability to diagnose unmodeled faults. The classic unmodeled
fault vis-à-vis the stuck-at fault model is the bridging fault. I have previously published a bridging
fault diagnosis approach that uses stuck-at faults to identify at least one node in a bridging fault
pair [LavChe97].
In a similar experiment, I simulated and diagnosed, with stuck-at fault candidates, a set of
realistic bridging faults for each of the ISCAS-85 circuits. I compared the three types of candidate
118
signatures: full-response, pass-fail, and pass-fail with output-compacted signatures. A diagnostic
success was defined as identifying one of the two bridged circuit nodes in the top 10 stuck-at faults.
The data in Table 8.4 shows that the full-response signatures provide the highest success rate
when diagnosing bridging faults, as expected. The pass-fail success rate is generally much lower, on
some circuits succeeding less than half the time, which is unacceptable for most diagnostic situations.
The pass-fail dictionary augmented with output –compacted signatures provides a significant
improvement over the pass-fail results, at the relatively small additional cost in bits (compared to the
full-response requirements) reported in Table 8.3.
Circuit
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
FR
Success
%
98.1
83.2
99.4
82.6
89.2
95.6
99.5
97.9
88.5
95.1
PF
Success
%
84.2
34.4
79.4
43.1
60.7
76.2
83.2
65.2
35.2
68.0
PF +OC
Success
%
92.4
71.3
95.7
62.1
79.0
90.5
90.5
94.9
57.3
86.7
Table 8.4. Success rate for bridging fault diagnosis using stuck-at fault candidates.
8.6
An Alternative to Path Tracing?
As mentioned earlier, path tracing has been proposed as a first step in fault diagnosis to reduce
the candidate faultlist to a tractable size. In this capacity, path tracing helps by limiting the faultlist to
just those faults in the input cone of affected outputs. A path-tracing algorithm can either be static, in
which the tracing algorithm only looks at the logical paths in the netlist, or it can be dynamic, in which
information about the fault model and the applied vectors are used to eliminate certain candidate faults
in the input cones.
A interesting observation about output-compacted signatures is that they contain much of the
same information that is obtained from a dynamic path-tracing algorithm; that is, they report the set of
119
outputs to which the fault effects are propagated for each candidate fault. Output-compacted
signatures lose some resolution because they do not store the per-vector propagation information; fault
propagation can change, depending upon the applied test vector, as fault effects are either blocked or
transmitted. But, output signatures will provide better resolution than can be obtained from static path
tracing, because the fault type is known and aggregate vector information is stored.
Therefore, output-compacted signatures can be thought of as filling the same role in dictionarybased fault diagnosis as a preliminary path tracing. This would be an advantage in scenarios,
mentioned earlier, where path tracing is not convenient or possible during fault diagnosis.
I was curious, then, how much diagnostic resolution the output-compacted signatures provide on
their own, aside from the pass-fail information. To this end, I ran the same experiments in Section X
above, diagnosing stuck-at behaviors with stuck-at candidates, but this time only using the output
signatures for matching. The results are shown in Table 8.5 below. As with the previous stuck-at
diagnosis experiments, the correct candidate will always be ranked #1; again, the question is how
many candidates are ranked equally at the top of the diagnosis list.
Circuit
Faults
C432
C499
C880
C1355
C1908
C2670
C3540
C5315
C6288
C7552
Ind-A
Ind-B
Ind-C
Ind-D
333
440
544
812
682
1,376
1,526
2,646
3,754
3,360
19,599
18,654
21,271
2,419
PF
faults ranked
#1
2.80
1.17
1.66
1.71
1.99
3.04
2.10
2.07
1.40
2.12
5.49
2.91
51.0
2.86
PF bits
(f*v)
27,306
28,160
56,032
91,756
118,668
283,456
407,442
600,642
225,240
1,226,400
15,522,408
46,373,844
21,271,000
333,822
OC
faults ranked
#1
31.6
2.24
12.5
3.31
16.0
52.7
17.1
31.5
37.8
60.0
15.8
23.1
21.5
16.4
OC bits
(f*log2(so)+
so*o)
2,285
13,624
8,159
20,940
14,088
24,000
21,632
109,854
43,002
96,302
80,534,787
113,139,088
34,184,521
16,938,236
Table 8.5. Top-ranked candidate set size and signature bits for pass-fail and output-compacted
(alone) signatures.
120
It is difficult, from this limited set of results, to draw any conclusions about the potential for
using output signatures by themselves. It is possible that output signatures could, like static path
tracing, prove useful for diagnosing unmodeled faults, or in cases were unmodeled behavior is
expected. Until further research is done, it seems the power of these output signatures is best realized
when they are, as demonstrated earlier, used in combination with traditional pass-fail signatures.
8.7
Clustering Output Signatures
Even the small data set used in these experiments hints at an impending problem with the use of
output-compacted signatures: real circuits have many more outputs than vectors on average. This fact
will cause the output signature size to explode when compared to the number of pass-fail vectors.
It was mentioned earlier in this chapter that the failing output sets of many faults in a circuit can
be identical; this fact was used to remove a large number of identical output signatures from the
dictionaries. But, it is also true that many more faults have similar sets of failing outputs, differing by
only a small number of bits. Can this fact be used to further reduce the size of an output signature
dictionary, by combining similar output signatures, while maintaining the previous diagnostic
accuracy?
The idea of identifying and combining similar individuals in a set of values or vectors has been
studied extensively in the fields of machine learning and pattern recognition [DudHar73]. The
common approach is referred to as clustering, and many clustering algorithms have been identified and
analyzed for various situations.
I considered two clustering algorithms to reduce the number of output signature bits in the sample
dictionaries. The first, based on a method referred to as hierarchical clustering, starts with every
output signature in its own cluster (identical signatures are already combined). Then, it finds the most
similar pair of clusters and combines those signatures into a new signature, creating a new cluster (and
a new signature). The algorithm proceeds in the same fashion until the desired number of clusters is
achieved.
121
Similarity between two output signatures can be described in a number of ways. The method I
chose was to express the distance (or dissimilarity) by the number of 1s in the bitwise-XOR between
two signatures. The signature pair with the smallest XOR value was chosen for the next clustering. In
case of multiple pairs with the same XOR value, the clustering algorithm chooses the one with the
maximum number of 1s in a bitwise-OR, which effectively chooses the signatures with the most failing
outputs in common. The resulting signature from a clustering of two output signatures is the bitwiseOR of the two signatures.
A disadvantage of this method is the effort required to perform the clustering. Finding the initial
set of pair-wise distances between output signatures is an O(n2) process, where n is the number of
initial output signatures (so in previous tables). While this work is only required once at dictionary
creation, it can be time-consuming for very large circuits.
My second method of clustering output signatures is to divide the full set of circuit outputs into
distinct sequential subsets. For example, if a circuit has 10,000 outputs, the outputs could be divided
into 1000 subsets of 10 outputs each. Then, output signatures can be created with one bit per subset,
where the value of each bit indicates whether the fault propagates its fault effects to any of the outputs
in the set. This method relies on the assumption that (as is often the case) closely-numbered outputs
are also closely positioned in the circuit, so that the failing bits (and clusters) of localized faults tend to
be highly correlated.
(The issue of correlation is important to the success of any clustering algorithm. Unless the bits
in a cluster are highly correlated, then the values of the bits in the clustered signature will be
meaningless. The first clustering algorithm, by examining all the signatures against each other, can
judge these correlations well. The second method, on the other hand, must rely on the expectation that
adjacent sequential bits are by nature correlated with each other. It is, however, much simpler and
enables the creation of clustered signatures on the fly, not just after all the original signatures have
been written. This is a tremendous advantage for large data sets.)
122
As a simple example of the second clustering algorithm, consider the diagram below (Figure 8.3).
It shows a set of 25 bits, where a shaded box represents a 1 (detection) and an unshaded box represents
a 0. This set could be thought of as a full output signature. The lower set shows a clustered signature,
where each bit represents a clustering of 5 of the original output bits. The result, then, is a 5-bit
clustered output signature.
Figure 8.3. A simple example of clustering by subsets of outputs.
I performed both types of clustering to reduce the number of output-compacted signature bits for
the industrial circuits. (The ISCAS circuits have too few outputs to be interesting for this experiment.)
My target was to reduce the number of output bits to about the same number as the pass-fail bits. To
this end, for each circuit, I clustered the output signatures down to 1000 bits per fault.
I did not find a significant difference between the diagnostic results from the two clustering
algorithms; only the data for the second (simpler) algorithm is reported here. Once again, the table
shows the number of equivalently-ranked top candidates for each type of signature.
Circuit
Ind-A
Ind-B
Ind-C
Ind-D
Outputs
(o)
FR
faults
ranked #1
PF
faults
ranked #1
15,000
20,042
14,003
27,193
2.74
2.33
2.51
1.91
5.49
2.91
51.0
2.86
PF+OC
unclustered
faults
ranked #1
2.87
2.34
2.51
1.91
PF+OC
clustered
faults
ranked #1
2.87
2.34
6.21
2.03
OC
(clustered)
bits
per fault
1,000
1,000
1,000
1,000
Table 8.6. Diagnostic results when output-compacted signatures are clustered down to 1000 bits
each.
This table does not report the number of dictionary bits for the PF+clustered OC dictionary,
because I did not perform the collapsing of duplicate signatures as was done in Section X (Table 8.3).
I expect that, even more so than for the unclustered signatures, this would cause a significant reduction
123
in the dictionary size. In any case, the maximum number of bits required for the output signatures
(1000 bits) is between 0.4 and 1.3 times the number of pass-fail signature bits.
These results indicate that using even highly-clustered output signatures increase diagnostic
precision over pass-fail signatures alone. At the approximate cost of doubling the size of the pass-fail
dictionaries, the precision of the result approaches that of the full-response dictionaries.
8.8
Clustering Vector Signatures & Low-Resolution Diagnosis
To follow up on these results, I was interested to find out whether or not the same sort of
clustering can be used to further reduce the number of pass-fail signature bits. Specifically, can a
clustering algorithm be applied to pass-fail signatures to create effective vector signatures of a small
number of bits? This question is of particular importance for the very largest of modern circuits, which
contain millions of faults, thousands of test vectors, and tens or hundreds of thousands of outputs. For
these circuits, even traditional pass-fail signatures are too large for practical dictionary-based
diagnosis.
The correlation assumed for consecutive output bits, however, probably does not exist for passfail vector bits in most cases. (An exception is when a subset of consecutive tests targets a particular
set of faults; test sets, however, can be reorganized and this correlation can be lost.) Therefore, the
simple clustering algorithm will not be as effective when applied to pass-fail signatures as it was when
applied to output signatures.
I was curious, however, to see what sort of results could be obtained using the simple clustering
algorithm on both pass-fail and output signatures. The idea was to create “tiny” dictionaries at a
reasonable cost, and then see what sort of diagnoses could be performed. To this end, I created
dictionaries with only 100 bits per fault signature, divided between 50 clustered vector bits and 50
clustered output bits. For both halves of each signature, the clusters were created by the simple
sequential clustering algorithm described earlier. Diagnoses were performed on three of the four
124
industrial circuits; the fourth had too few test vectors in its test set to be of interest. The results are
presented in Table 8.7.
Circuit
Ind-A
Ind-B
Ind-C
Faults
(f)
19,599
18,654
21,271
Outputs
(o)
15,000
20,042
14,003
FR
faults
ranked #1
2.74
2.33
2.51
PF
faults
ranked #1
5.49
2.91
51.0
100b
signatures
ranked #1
201.0
3.89
90.42
Table 8.7. Diagnostic results for clustering (PF+OC) signatures down to 100 bits total.
Of course, the precision of the resulting diagnoses is much lower than could be obtained with
either the full fault data or with unclustered signatures of any sort. A point of further research is to
examine the efficiency of these types of signatures, to see if the per-bit reduction in the candidate
faultlist is as good or better than either pass-fail or full-response fault signatures.
Despite the significant loss of precision, trading-off precision for data load may be attractive if
diagnosis can be performed iteratively, by using data of ever-increasing resolution on ever-decreasing
faultlists. An initial diagnosis would be quite large, but if it produces an accurate set of fault
candidates that represents, say, 1-10% of the faultlist, then diagnosis can proceed where normally it
would be completely impractical. I refer to this approach as low-resolution fault diagnosis. It is
possible that this approach could find application in very high-volume situations or in system-level
diagnostics. Or perhaps this technique could enable “built-in-self-diagnosis”, in which a chip could
diagnose itself to some reasonable level of precision, using this kind of tiny dictionary and results from
built-in-self-test (BIST).
125
Chapter 9.
Conclusions and Future Work
Fault diagnosis in modern circuits is a difficult task, considering the size of today’s circuits and
the almost-innumerable ways in which they can fail. But, there is good reason to attempt the task
anyway, as the quality of these circuits depends upon identifying and fixing sources of error.
I have presented an approach to fault diagnosis in combinational circuits that attempts to be as
comprehensive as possible. Perhaps the most important contribution is the introduction of a
probabilistic framework to the problem, which allows many different fault models, algorithms, and
sources of information to be applied to the problem to produce an accurate result, the precision of
which increases as more effort is applied. In developing a guiding philosophy for my approach, I have
identified many issues involved in fault diagnosis that, while some may be common sense, have more
often been ignored than heeded by previous researchers.
The diagnosis approach presented here covers most stages of the problem, from an initial
diagnosis step that can handle multiple and complex defects, to a model-based stage that can apply
fault models of arbitrary sophistication to refine a diagnosis as far as desired. I have also addressed the
issue of non-logic fails in the form of an extension of the probabilistic framework to cover IDDQ test
fails. Finally, I have addressed the practical issue of static data sizes, a problem that can defeat many
diagnosis strategies on very large circuits. In doing so I introduced a new topic of low-resolution
diagnosis, which may find use in some more exotic situations such as high-speed diagnosis or selfdiagnosing circuits.
The future work identified by my research falls into four categories. The first is to uniformly
apply the Dempster-Shaffer method of scoring fault candidates across all stages of the diagnosis
methodology. I have applied it to the first stage iSTAT algorithm, where it seemed the most natural
and practical fit, but applying it to the model-based algorithm would result in two major
126
improvements. First, it would provide a confidence measure for the final diagnosis, a very important
piece of information for practical use. Second, it might provide some limited means of considering
multiple (perhaps two or three) model instances at one time for the case of multiple independent
defects.
The second area of future work is to address the issue of timing-related failures. This thesis
considered only static logic failures, or tests that fail at a very slow speed. But, failures in which a chip
fails to meet timing requirements on tests run at-speed are very interesting for modern high-speed or
high-performance designs. I suspect that some modification of the iSTAT algorithm could address this
issue by implicating individual faults along timing-critical paths.
The third area of future work is to run these algorithms on actual production fails to see if they
can properly diagnose defects that aren’t artificially created. This is the real test, of course, of any
diagnosis algorithm, but will take a relatively large effort on the part of an industrial team to carry out
physical root-cause verification on multiple real chips. This is a difficult task given current limits on
research and development funds, and is complicated by the fact that it is now usually the case that one
company designs a chip while another company manufactures it, while a third company may do the
failure analysis. Such disintegration in industry, however, creates an opening that an easy-to-use yet
powerful diagnosis tool could exploit.
Finally, I would also like to pursue the avenue of low-resolution fault diagnosis, if only to satisfy
my curiosity about how much diagnosis can be performed with how little data. The idea of future
circuits that can do self-diagnosis, and therefore possibly repair themselves, is intriguing and worth
some investigation, however impractical it may be.
127
Bibliography
[AbrBre80] M. Abramovici and M. A. Breuer. Multiple fault diagnosis in combinational circuits
based on an effect-cause analysis. IEEE Transactions on Computing, Vol. C-29, pages 451-460, June
1980.
[AbrBre90] M. Abramovici, M. Breuer, and A. Friedman. Digital Systems Testing and Testable
Design. W.H. Freeman and Company, New York, NY. 1990.
[AbrMen84] M. Abramovici, P.R. Menon and D.T. Miller. Critical Path Tracing: An Alternative to
Fault Simulation. IEEE Design & Test, IEEE, February 1984.
[AckMil91] J.M. Acken and S.D. Millman. Accurate modeling and simulation of bridging faults.
Proceedings of the Custom Integrated Circuits Conference, pages 17.4.1-17.4.4, 1991.
[AckMil92] J.M. Acken and S.D. Millman. Fault model evolution for diagnosis: Accuracy vs.
precision. Proceedings of the Custom Integrated Circuits Conference, 1992.
[Ait91] R. Aitken. Fault Location with Current Monitoring. Proceedings of the International Test
Conference, pages 623-632, IEEE, 1991.
[Ait92] R. Aitken. A Comparison of Defect Models for Fault Location with Iddq Measurements.
Proceedings of the International Test Conference, pages 778-787, IEEE, 1992.
[Ait95] R. Aitken. Finding defects with fault models. Proceedings of the International Test
Conference, pages 498-505, IEEE, 1995.
[AitMax95] R. Aitken and P. Maxwell. Better models or better algorithms? On techniques to improve
fault diagnosis. Hewlett-Packard Journal, February 1995.
[AllErv92] R.W. Allen, M.M. Ervin-Willis and R.E. Tullose. DORA: CAD Interface to Automatic
Diagnostics. 19th Design Automation Conference, pages 559-563, 1982.
[BarBha01] T. Bartenstein, J. Bhawnani. SLAT Plus: Work in Progress. 2nd International IEEE
Workshop on Yield Optimization and Test, Baltimore, Nov. 1-2, 2001.
[BarHea01] T. Bartenstein, D. Heaberlin, L. Huisman, D. Sliwinski. Diagnosing Combinational Logic
Designs Using the Single Location At-a-Time (SLAT) Paradigm. Proceedings of the International
Test Conference, pages 287-296, IEEE, 2001.
[BopHar96] V. Boppana, I. Hartanto, W. K. Fuchs. Full Fault Dictionary Storage Based on Labeled
Tree Encoding. Proceedings IEEE VLSI Test Symposium, pages 174-179, April 1996.
[Bur89] D. Burns. Locating high resistance shorts in CMOS circuits by analyzing supply current
measurement vectors. International Symposium for Testing and Failure Analysis, pages 231-237,
November 1989.
[ChaGon93] S. Chakravarty and Y. Gong. An algorithm for diagnosing two-line bridging faults in
combinational circuits. Proceedings of the Design Automation Conference, pages 520-524, 1993.
[ChaLiu93] S. Chakravarty and M. Liu. Iddq measurement based diagnosis of bridging faults.
Journal of Electronic Testing: Theory and Application (Special Issue on Iddq Testing), 1993.
[CheLar99] B. Chess and T. Larrabee. Creating Small Fault Dictionaries. IEEE Transactions on
Computer-Aided Design, pages 346-356, March 1999.
128
[CheLav95] B. Chess, D.B. Lavo, F.J. Ferguson and T. Larrabee. Diagnosis of Realistic Bridging
Faults with Single Stuck-At Information. Dig. Of Technical Papers, 1995 IEEE International
Conference on Computer-Aided Design, pages 185-192, Nov. 1995.
[DeGun95] K. De and A. Gunda. Failure analysis for full-scan circuits. Proceedings of the
International Test Conference, pages 636-645, IEEE, 1995.
[DudHar73] R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons,
1973.
[EicLin91] E. Eichelberger, E. Lindbloom, J. Waicukauski and T. Williams. Structured Logic
Testing. Prentice Hall, New Jersey, 1991.
[FerYu96] F.J. Ferguson and J. Yu. Maximum likelihood estimation for yield analysis. Proceedings
of the Defect and Fault Tolerance in VLSI Systems Symposium, pages 149-157, IEEE, 1996.
[GatMal96] A. Gattiker and W. Maly. Current Signatures. Proceedings of the 1996 VLSI Test
Symposium, pages 112-117, IEEE, 1996.
[GatMal97] A. Gattiker and W. Maly. Current Signatures: Application. Proceedings of the
International Test Conference, pages 156-165, IEEE, 1997.
[GatMal98] A. Gattiker and W. Maly. Toward Understanding “I DDQ-Only” Fails. Proceedings of the
International Test Conference, pages 156-165, IEEE, 1998.
[GirLan92] P. Girard, C. Landrault and S. Pravossoudovitch. Delay Fault Diagnosis by Critical Path
Tracing. IEEE Design and Test of Computers, IEEE, December 1992.
[GrePat92] G. Greenstein and J. Patel. EPROOFS: a CMOS bridging fault simulator. Proceedings of
the International Conference on Computer-Aided Design, pages 268-271, IEEE, 1992.
[HenSod97] C. Henderson and J. Soden. Signature analysis for IC diagnosis and failure analysis.
Proceedings of the International Test Conference, pages 310-318, IEEE, 1997.
[JacBis86] J. Jacob and N.N. Biswas. GTBD faults and lower bounds on multiple fault coverage of
single fault test sets. Proceedings of the International Test Conference, pages 849-855, IEEE, 1986.
[JeeISTFA93] A. Jee and F.J. Ferguson. Carafe: A software tool for failure analysis. Proceedings of
the International Symposium on Testing and Failure Analysis, pages 143-149, 1993.
[JeeVTS93] A. Jee and F.J. Ferguson. Carafe: An inductive fault analysis tool for CMOS VLSI
circuits. Proceedings of the IEEE VLSI Test Symposium, pages 92-98, 1993.
[Kun93] R.P. Kunda. Fault location in full-scan designs. International Symposium for Testing &
Failure Analysis, pages 121-126, 1993.
[LamSho80] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. Technical
Report 54, Comp. Sci. Lab, SRI International, March 1980.
[LavChe97] D.B. Lavo, B. Chess, T. Larrabee, F.J. Ferguson, J. Saxena and K. Butler. Bridging Fault
Diagnosis in the Absence of Physical Information. Proceedings of the International Test Conference,
pages 887-893, IEEE, 1997.
[LavTCAD98] D.B. Lavo, B. Chess, T. Larrabee, and F. J. Ferguson. Diagnosing realistic bridging
faults with single stuck-at information. IEEE Transactions on Computer-Aided Design, pages 255-268,
March 1998.
[LavLar96] D.B. Lavo, T. Larrabee, and B. Chess. Beyond Byzantine Generals: Unexpected behavior
and bridging-fault diagnosis. Proceedings of the International Test Conference, pages 611-619. IEEE,
1996.
129
[MaxAit93] P. Maxwell and R. Aitken. Biased voting: a method for simulating CMOS bridging faults
in the presence of variable gate logic thresholds. Proceedings of the International Test Conference,
pages 63-72. IEEE, 1993.
[MaxNei99] P. Maxwell, P. O’Neill, R. Aitken, R. Dudley, N. Jaarsma, Minh Quach, and D.
Wiseman. Current Ratios: A Self-Scaling Technique for Production IDDQ Testing. Proceedings of the
International Test Conference, IEEE, 1999.
[Mei74] K.C.Y Mei. Bridging and stuck-at faults. IEEE Transactions on Computers, C-23(7), pages
720-727, July 1974.
[MilMcC90] S.D. Millman, E.J. McCluskey and J.M. Acken. Diagnosing CMOS bridging faults with
stuck-at fault dictionaries. Proceedings of the International Test Conference, pages 860-870, IEEE,
1990.
[MonBru92] R. Rodriguez-Montanez, E.M.J.G. Bruls and J. Figueras. Bridging defects resistance
measurements in a CMOS process. Proceedings of the International Test Conference, pages 892-899,
IEEE, 1992.
[NighFor97] P. Nigh, D. Forlenza and F. Motika. Application and Analysis of I DDQ Diagnostic
Software. Proceedings of the International Test Conference, pages 319-327, IEEE, 1997.
[NighNee97a] P. Nigh, W. Needham, K. Butler, P. Maxwell and R. Aitken. An Experimental Study
Comparing the Relative Effectiveness of Functional Scan IDDQ Delay-Fault Testing. Proceedings of
VLSI Test Symposium, pages 459-463, 1997.
[NighNee97b] P. Nigh, W. Needham, K. Butler, P. Maxwell, R. Aitken and W. Maly. So What is an
Optimal Test Mix? A Discussion of the Sematech Methods Experiment. Proceedings of the
International Test Conference , pages 1037-1038, IEEE, 1997.
[NighVal98] P. Nigh, D. Vallett, A. Patel, J. Wright, F. Motika, D. Forlenza, R. Kurtulik, W. Chong.
Failure Analysis of Timing and IDDQ-only Failures from the SEMATECH Test Methods Experiment.
Proceedings of the International Test Conference, IEEE, pages 43-52, 1997.
[RajCox87] J. Rajski and H. Cox. A method of test generation and fault diagnosis in very large
combinational circuits. Proceedings of the International Test Conference, pages 932-943, 1987.
[RatKea86] V. Ratford and P. Keating. Integrating guided probe and fault dictionary: an enhanced
diagnostic approach. Proceedings of the International Test Conference, pages 304-311, IEEE, 1986.
[RicBow85] J. Richman and K.R. Bowden. The modern fault dictionary. Proceedings of the
International Test Conference, pages 696-702, IEEE, 1985.
[Rot94] Roth, C.D. Simulation and test pattern generation for bridge faults in CMOS ICs. Master’s
Thesis, University of California Santa Cruz, Department of Computer Engineering, June 1994.
[SaxBal98] J. Saxena, H. Balachandran, K. Butler, D.B. Lavo, B. Chess, T. Larrabee and F.J.
Ferguson. On Applying Non-Classical Defect Models to Automated Diagnosis. Proceedings of the
International Test Conference, pages 748-757, IEEE, 1998.
[Sha76] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, Princeton,
New Jersey, 1976.
[SheMal85] J.P. Shen, W. Maly and F.J. Ferguson. Inductive Fault Analysis of MOS Integrated
Circuits. IEEE Design and Test of Computers, 2(6):13-26, December 1985.
[SheSim96] J. W. Sheppard and W. R. Simpson. Improving the accuracy of diagnostics provided by
fault dictionaries. Proceedings of the 14th VLSI Test Symposium, pages 180-185, IEEE, 1996.
130
[SimShe94] W.R. Simpson and J.W. Sheppard. System Test and Diagnosis. Kluwer Academic
Publishers, Norwell, MA, 1994.
[Thi97] C. Thibeault. A Novel Probabilistic Approach for IC Diagnosis Based on Differential
Quiescent Current Signatures. Proceedings of the 1997 VLSI Test Symposium, pages 80-85, IEEE,
1997.
[Tor38] S.C. Tornay. Ockham: Studies and Selections. Open Court Publishers, La Salle, IL, 1938.
[VenDru00] S. Venkataraman, S. Drummonds. POIROT: A Logic Fault Diagnosis Tool and Its
Applications. Proceedings of the International Test Conference, pages 253-262, IEEE, 2000.
[WaiLin89] J. Waicukauski and E. Lindbloom. Failure diagnosis of structured VLSI. IEEE Design
and Test of Computers, pages 49-60, August 1989.
131
Bibliography (in order of reference)
[AbrBre90] M. Abramovici, M. Breuer, and A. Friedman. Digital Systems Testing and Testable
Design. W.H. Freeman and Company, New York, NY. 1990.
[JacBis86] J. Jacob and N.N. Biswas. GTBD faults and lower bounds on multiple fault coverage of
single fault test sets. Proceedings of the International Test Conference, pages 849-855, IEEE, 1986.
[AckMil91] J.M. Acken and S.D. Millman. Accurate modeling and simulation of bridging faults.
Proceedings of the Custom Integrated Circuits Conference, pages 17.4.1-17.4.4, 1991.
[GrePat92] G. Greenstein and J. Patel. EPROOFS: a CMOS bridging fault simulator. Proceedings of
the International Conference on Computer-Aided Design, pages 268-271, IEEE, 1992.
[MaxAit93] P. Maxwell and R. Aitken. Biased voting: a method for simulating CMOS bridging faults
in the presence of variable gate logic thresholds. Proceedings of the International Test Conference,
pages 63-72. IEEE, 1993.
[Rot94] Roth, C.D. Simulation and test pattern generation for bridge faults in CMOS ICs. Master’s
Thesis, University of California Santa Cruz, Department of Computer Engineering, June 1994.
[MonBru92] R. Rodriguez-Montanez, E.M.J.G. Bruls and J. Figueras. Bridging defects resistance
measurements in a CMOS process. Proceedings of the International Test Conference, pages 892-899,
IEEE, 1992.
[AitMax95] R. Aitken and P. Maxwell. Better models or better algorithms? On techniques to improve
fault diagnosis. Hewlett-Packard Journal, February 1995.
[AbrBre80] M. Abramovici and M. A. Breuer. Multiple fault diagnosis in combinational circuits
based on an effect-cause analysis. IEEE Transactions on Computing, Vol. C-29, pages 451-460, June
1980.
[RajCox87] J. Rajski and H. Cox. A method of test generation and fault diagnosis in very large
combinational circuits. Proceedings of the International Test Conference, pages 932-943, 1987.
[AllErv92] R.W. Allen, M.M. Ervin-Willis and R.E. Tullose. DORA: CAD Interface to Automatic
Diagnostics. 19th Design Automation Conference, pages 559-563, 1982.
[RatKea86] V. Ratford and P. Keating. Integrating guided probe and fault dictionary: an enhanced
diagnostic approach. Proceedings of the International Test Conference, pages 304-311, IEEE, 1986.
[RicBow85] J. Richman and K.R. Bowden. The modern fault dictionary. Proceedings of the
International Test Conference, pages 696-702, IEEE, 1985.
[Kun93] R.P. Kunda. Fault location in full-scan designs. International Symposium for Testing &
Failure Analysis, pages 121-126, 1993.
[DeGun95] K. De and A. Gunda. Failure analysis for full-scan circuits. Proceedings of the
International Test Conference, pages 636-645, IEEE, 1995.
[WaiLin89] J. Waicukauski and E. Lindbloom. Failure diagnosis of structured VLSI. IEEE Design
and Test of Computers, pages 49-60, August 1989.
132
[MilMcC90] S.D. Millman, E.J. McCluskey and J.M. Acken. Diagnosing CMOS bridging faults with
stuck-at fault dictionaries. Proceedings of the International Test Conference, pages 860-870, IEEE,
1990.
[ChaGon93] S. Chakravarty and Y. Gong. An algorithm for diagnosing two-line bridging faults in
combinational circuits. Proceedings of the Design Automation Conference, pages 520-524, 1993.
[CheLav95] B. Chess, D.B. Lavo, F.J. Ferguson and T. Larrabee. Diagnosis of Realistic Bridging
Faults with Single Stuck-At Information. Dig. Of Technical Papers, 1995 IEEE International
Conference on Computer-Aided Design, pages 185-192, Nov. 1995.
[VenDru00] S. Venkataraman, S. Drummonds. POIROT: A Logic Fault Diagnosis Tool and Its
Applications. Proceedings of the International Test Conference, pages 253-262, IEEE, 2000.
[Ait95] R. Aitken. Finding defects with fault models. Proceedings of the International Test
Conference, pages 498-505, IEEE, 1995.
[LavChe97] D.B. Lavo, B. Chess, T. Larrabee, F.J. Ferguson, J. Saxena and K. Butler. Bridging Fault
Diagnosis in the Absence of Physical Information. Proceedings of the International Test Conference,
pages 887-893, IEEE, 1997.
[GirLan92] P. Girard, C. Landrault and S. Pravossoudovitch. Delay Fault Diagnosis by Critical Path
Tracing. IEEE Design and Test of Computers, IEEE, December 1992.
[AbrMen84] M. Abramovici, P.R. Menon and D.T. Miller. Critical Path Tracing: An Alternative to
Fault Simulation. IEEE Design & Test, IEEE, February 1984.
[Ait91] R. Aitken. Fault Location with Current Monitoring. Proceedings of the International Test
Conference, pages 623-632, IEEE, 1991.
[Ait92] R. Aitken. A Comparison of Defect Models for Fault Location with Iddq Measurements.
Proceedings of the International Test Conference, pages 778-787, IEEE, 1992.
[ChaLiu93] S. Chakravarty and M. Liu. Iddq measurement based diagnosis of bridging faults.
Journal of Electronic Testing: Theory and Application (Special Issue on Iddq Testing), 1993.
[Bur89] D. Burns. Locating high resistance shorts in CMOS circuits by analyzing supply current
measurement vectors. International Symposium for Testing and Failure Analysis, pages 231-237,
November 1989.
[GatMal96] A. Gattiker and W. Maly. Current Signatures. Proceedings of the 1996 VLSI Test
Symposium, pages 112-117, IEEE, 1996.
[GatMal97] A. Gattiker and W. Maly. Current Signatures: Application. Proceedings of the
International Test Conference, pages 156-165, IEEE, 1997.
[GatMal98] A. Gattiker and W. Maly. Toward Understanding “I DDQ-Only” Fails. Proceedings of the
International Test Conference, pages 156-165, IEEE, 1998.
[Thi97] C. Thibeault. A Novel Probabilistic Approach for IC Diagnosis Based on Differential
Quiescent Current Signatures. Proceedings of the 1997 VLSI Test Symposium, pages 80-85, IEEE,
1997.
[Tor38] S.C. Tornay. Ockham: Studies and Selections. Open Court Publishers, La Salle, IL, 1938.
133
[BarHea01] T. Bartenstein, D. Heaberlin, L. Huisman, D. Sliwinski. Diagnosing Combinational Logic
Designs Using the Single Location At-a-Time (SLAT) Paradigm. Proceedings of the International
Test Conference, pages 287-296, IEEE, 2001.
[SheMal85] J.P. Shen, W. Maly and F.J. Ferguson. Inductive Fault Analysis of MOS Integrated
Circuits. IEEE Design and Test of Computers, 2(6):13-26, December 1985.
[JeeISTFA93] A. Jee and F.J. Ferguson. Carafe: A software tool for failure analysis. Proceedings of
the International Symposium on Testing and Failure Analysis, pages 143-149, 1993.
[JeeVTS93] A. Jee and F.J. Ferguson. Carafe: An inductive fault analysis tool for CMOS VLSI
circuits. Proceedings of the IEEE VLSI Test Symposium, pages 92-98, 1993.
[FerYu96] F.J. Ferguson and J. Yu. Maximum likelihood estimation for yield analysis. Proceedings
of the Defect and Fault Tolerance in VLSI Systems Symposium, pages 149-157, IEEE, 1996.
[SimShe94] W.R. Simpson and J.W. Sheppard. System Test and Diagnosis. Kluwer Academic
Publishers, Norwell, MA, 1994.
[SheSim96] J. W. Sheppard and W. R. Simpson. Improving the accuracy of diagnostics provided by
fault dictionaries. Proceedings of the 14th VLSI Test Symposium, pages 180-185, IEEE, 1996.
[AckMil92] J. M. Acken and S. D. Millman. Fault model evolution for diagnosis: Accuracy vs.
precision. Proceedings of the Custom Integrated Circuits Conference, 1992.
[EicLin91] E. Eichelberger, E. Lindbloom, J. Waicukauski and T. Williams. Structured Logic
Testing. Prentice Hall, New Jersey, 1991.
[LamSho80] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. Technical
Report 54, Comp. Sci. Lab, SRI International, March 1980.
[Sha76] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, Princeton,
New Jersey, 1976.
[BarBha01] T. Bartenstein, J. Bhawnani. SLAT Plus: Work in Progress. 2 nd International IEEE
Workshop on Yield Optimization and Test, Baltimore, Nov. 1-2, 2001.
[NighVal98] P. Nigh, D. Vallett, A. Patel, J. Wright, F. Motika, D. Forlenza, R. Kurtulik, W. Chong.
Failure Analysis of Timing and IDDQ-only Failures from the SEMATECH Test Methods Experiment.
Proceedings of the International Test Conference, IEEE, pages 43-52, 1997.
[DudHar73] R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons,
1973.
[Mei74] K.C.Y Mei. Bridging and stuck-at faults. IEEE Transactions on Computers, C-23(7), pages
720-727, July 1974.
[LavLar96] D. B. Lavo, T. Larrabee, and B. Chess. Beyond Byzantine Generals: Unexpected behavior
and bridging-fault diagnosis. Proceedings of the International Test Conference, pages 611-619. IEEE,
1996.
[HenSod97] C. Henderson and J. Soden. Signature analysis for IC diagnosis and failure analysis.
Proceedings of the International Test Conference, pages 310-318, IEEE, 1997.
[LavTCAD98] D. B. Lavo, B. Chess, T. Larrabee, and F. J. Ferguson. Diagnosing realistic bridging
faults with single stuck-at information. IEEE Transactions on Computer-Aided Design, pages 255-268,
March 1998.
134
[SaxBal98] J. Saxena, H. Balachandran, K. Butler, D.B. Lavo, B. Chess, T. Larrabee and F.J.
Ferguson. On Applying Non-Classical Defect Models to Automated Diagnosis. Proceedings of the
International Test Conference, pages 748-757, IEEE, 1998.
[NighFor97] P. Nigh, D. Forlenza and F. Motika. Application and Analysis of I DDQ Diagnostic
Software. Proceedings of the International Test Conference, pages 319-327, IEEE, 1997.
[NighNee97a] P. Nigh, W. Needham, K. Butler, P. Maxwell and R. Aitken. An Experimental Study
Comparing the Relative Effectiveness of Functional Scan IDDQ Delay-Fault Testing. Proceedings of
VLSI Test Symposium, pages 459-463, 1997.
[NighNee97b] P. Nigh, W. Needham, K. Butler, P. Maxwell, R. Aitken and W. Maly. So What is an
Optimal Test Mix? A Discussion of the Sematech Methods Experiment. Proceedings of the
International Test Conference , pages 1037-1038, IEEE, 1997.
[CheLar99] B. Chess and T. Larrabee. Creating Small Fault Dictionaries. IEEE Transactions on
Computer-Aided Design, pages 346-356, March 1999.
[MaxNei99] P. Maxwell, P. O’Neill, R. Aitken, R. Dudley, N. Jaarsma, Minh Quach, and D.
Wiseman. Current Ratios: A Self-Scaling Technique for Production IDDQ Testing. Proceedings of the
International Test Conference, IEEE, 1999.
[BopHar96] V. Boppana, I. Hartanto, W. K. Fuchs. Full Fault Dictionary Storage Based on Labeled
Tree Encoding. Proceedings IEEE VLSI Test Symposium, pages 174-179, April 1996.
135
Download