UNIVERSITY OF CALIFORNIA SANTA CRUZ COMPREHENSIVE FAULT DIAGNOSIS OF COMBINATIONAL CIRCUITS A dissertation submitted in partial satisfaction of the requirements for the degree of DOCTOR OF PHILOSOPHY in COMPUTER ENGINEERING by David B. Lavo September 2002 The Dissertation of David B. Lavo is approved: Professor Tracy Larrabee, Chair Professor F. Joel Ferguson Professor David P. Helmbold Robert C. Aitken, Ph.D. Frank Talamantes Vice Provost & Dean of Graduate Studies Copyright © by David B. Lavo 2002 Contents List of Figures .......................................................................................................................................... v List of Tables...........................................................................................................................................vi Abstract ................................................................................................................................................. vii Acknowledgements .............................................................................................................................. viii Chapter 1. Introduction ..................................................................................................................... 1 Chapter 2. Background ..................................................................................................................... 4 2.1 Types of Circuits ..................................................................................................................... 4 2.2 Diagnostic Data ....................................................................................................................... 5 2.3 Fault Models ............................................................................................................................ 8 2.4 Fault Models vs. Algorithms: A Short Tangent into a Long Debate ....................................... 9 2.5 Diagnostic Algorithms............................................................................................................11 2.5.1 Early Approaches and Stuck-at Diagnosis......................................................................13 2.5.2 Waicukauski & Lindbloom ............................................................................................14 2.5.3 Stuck-At Path-Tracing Algorithms .................................................................................16 2.5.4 Bridging fault diagnosis .................................................................................................16 2.5.5 Delay fault diagnosis ......................................................................................................18 2.5.6 IDDQ diagnosis ..............................................................................................................19 2.5.7 Recent Approaches .........................................................................................................20 2.5.8 Inductive Fault Analysis .................................................................................................20 2.5.9 System-Level Diagnosis .................................................................................................22 Chapter 3. A Deeper Understanding of the Problem: Developing a Fault Diagnosis Philosophy ...23 3.1 The Nature of the Defect is Unknown ....................................................................................23 3.2 Fault Models are Hopelessly Unreliable.................................................................................24 3.3 Fault Models are Practically Indispensable ............................................................................25 3.4 With Fault Models, More is Better .........................................................................................27 3.5 Every Piece of Data is Valuable .............................................................................................28 3.6 Every Piece of Data is Possibly Bad ......................................................................................29 3.7 Accuracy Should be Assumed, but Precision Should be Accumulated ..................................29 3.8 Be Practical.............................................................................................................................30 Chapter 4. First Stage Fault Diagnosis: Model-Independent Diagnosis...........................................31 4.1 SLAT, STAT, and All That ....................................................................................................32 4.2 Multiplet Scoring ....................................................................................................................35 4.3 Collecting and Diluting Evidence ...........................................................................................36 4.4 “A Mathematical Theory of Evidence” ..................................................................................37 4.5 Turning Evidence into Scored Multiplets ...............................................................................40 4.6 Matching Simple Failing Tests: An Example .........................................................................43 4.7 Matching Passing Tests ..........................................................................................................46 4.8 Matching Complex Failures ...................................................................................................48 4.9 Size is an Issue .......................................................................................................................49 4.10 Experimental Results – Simulated Faults ...............................................................................51 4.11 Experimental Results – FIB Defects.......................................................................................54 iii Chapter 5. Second Stage Fault Diagnosis: Implication of Likely Fault Models ..............................56 5.1 An Old, but Still Valid, Debate ..............................................................................................56 5.2 Answers and Compromises ....................................................................................................57 5.3 Finding Meaning (and Models) in Multiplets .........................................................................58 5.4 Plausibility Metrics .................................................................................................................59 5.5 Proximity Metrics ...................................................................................................................62 5.6 Experimental Results – Multiplet Classification ....................................................................64 5.7 Analysis of Multiple Faults ....................................................................................................65 5.8 The Advantages of (Multiplet) Analysis ................................................................................66 Chapter 6. Third Stage Fault Diagnosis: Mixed-Model Probabilistic Fault Diagnosis ....................68 6.1 Drive a Little, Save a Lot: A Short Detour into Inexpensive Bridging Fault Diagnosis ........69 6.1.1 Stuck with Stuck-at Faults ..............................................................................................69 6.1.2 Composite Bridging Fault Signatures .............................................................................70 6.1.3 Matching and (Old Style) Scoring with Composite Signature .......................................72 6.1.4 Experimental Results with Composite Bridging Fault Signatures ..................................72 6.2 Mixed-model Diagnosis .........................................................................................................73 6.3 Scoring: Bayes decision theory ..............................................................................................74 6.4 The Probability of Model Error ... ..........................................................................................77 6.5 ... Vs. Acceptance Criteria ......................................................................................................78 6.6 Stuck-at scoring ......................................................................................................................80 6.7 0th-Order Bridging Fault Scoring ...........................................................................................80 6.8 1st-Order Bridging Fault Scoring ...........................................................................................81 6.9 2nd-Order Bridging Fault Scoring ..........................................................................................81 6.10 Expressing Uncertainty with Dempster-Shaffer .....................................................................83 6.11 Experimental results – Hewlett-Packard ASIC ......................................................................84 6.12 Experimental results – Texas Instruments ASIC ....................................................................88 6.13 Conclusion ..............................................................................................................................90 Chapter 7. IDDQ Fault Diagnosis .......................................................................................................91 7.1 Probabilistic Diagnosis, Revisited ..........................................................................................92 7.2 Back to Bayes (One Last Time) .............................................................................................93 7.3 Probabilistic IDDQ Diagnosis ...................................................................................................94 7.4 IDDQ Diagnosis: Pre-Set Thresholds ........................................................................................98 7.5 IDDQ Diagnosis: Good-Circuit Statistical Knowledge ...........................................................101 7.6 IDDQ Diagnosis: Zero Knowledge .........................................................................................103 7.7 A Clustering Example ..........................................................................................................106 7.8 Experimental Results ............................................................................................................107 Chapter 8. Small Fault Dictionaries ...............................................................................................110 8.1 The Unbearable Heaviness of Unabridged Dictionaries .......................................................111 8.2 Output-Compacted Signatures ..............................................................................................114 8.3 Diagnosis with Output Signatures ........................................................................................115 8.4 Objects in Dictionary are Smaller Than They Appear..........................................................117 8.5 What about Unmodeled Faults? ...........................................................................................118 8.6 An Alternative to Path Tracing? ...........................................................................................119 8.7 Clustering Output Signatures................................................................................................121 8.8 Clustering Vector Signatures & Low-Resolution Diagnosis ................................................124 Chapter 9. Conclusions and Future Work ......................................................................................126 Bibliography ......................................................................................................................................... 128 iv List of Figures FIGURE 2.1. EXAMPLE OF PASS-FAIL FAULT SIGNATURES......................................................................... 6 FIGURE 2.2. EXAMPLE OF INDEXED AND BITMAPPED FULL-RESPONSE FAULT SIGNATURES. ..................... 7 FIGURE 4.1: SIMPLE PER-TEST DIAGNOSIS EXAMPLE. ..............................................................................34 FIGURE 4.2. AN EXAMPLE BELIEF FUNCTION. ..........................................................................................38 FIGURE 4.3. ANOTHER BELIEF FUNCTION. ...............................................................................................38 FIGURE 4.4. THE COMBINATION OF TWO BELIEF FUNCTIONS. ..................................................................39 FIGURE 4.5. EXAMPLE SHOWING THE COMBINATION OF FAULTS. ............................................................41 FIGURE 4.6. A THIRD TEST RESULT IS COMBINED WITH THE RESULTS FROM THE PREVIOUS EXAMPLE. ....42 FIGURE 4.7. EXAMPLE TEST RESULTS WITH MATCHING FAULTS. .............................................................43 FIGURE 4.8. COMBINATION OF EVIDENCE FROM THE FIRST TWO TESTS. ..................................................44 FIGURE 4.9. A-SA-1 WILL LIKELY FAIL ON MANY MORE VECTORS THAN WILL B-SA-1 ............................46 FIGURE 4.10. EXAMPLE OF CONSTRUCTING A SET OF POSSIBLY-FAILING OUTPUTS FOR A MULTIPLET .....49 FIGURE 4.11. MULTIPLETS (A,B), (A,B,C) AND (A,B,D) EXPLAIN ALL TEST RESULTS, BUT (A,B) IS SMALLER AND SO PREFERRED..........................................................................................................50 FIGURE 4.12 THE CHOICE OF BEST MULTIPLET IS DIFFICULT IF (A) PREDICTS ADDITIONAL FAILURES BUT (B, C) DOES NOT. ............................................................................................................................50 FIGURE 6.1. THE COMPOSITE SIGNATURE OF X BRIDGED TO Y WITH MATCH RESTRICTIONS (IN BLACK) AND MATCH REQUIREMENTS (LABELED R) ......................................................................................71 FIGURE 7.1. IDDQ RESULTS FOR 100 VECTORS ON 1 DIE (SEMATECH EXPERIMENT). .................................98 FIGURE 7.2. ASSIGNMENT OF A BINARY p̂(A | O) FOR THE IDEAL CASE OF A FIXED IDDQ THRESHOLD. ....98 FIGURE 7.3. ASSIGNMENT OF A LINEAR p̂(A | O) WITH A FIXED IDDQ THRESHOLD. .................................99 FIGURE 7.4. ASSIGNMENT OF NORMALLY-DISTRIBUTED p̂(O | A) AND pˆ (O | A) . .............................101 FIGURE 7.5. DETERMINING A PASS THRESHOLD BASED ON AN ASSUMED DISTRIBUTION AND THE MINIMUM-VECTOR MEASURED IDDQ...............................................................................................102 FIGURE 7.6. THE SAME DATA GIVEN IN FIGURE 7.1, WITH THE TEST VECTORS ORDERED BY IDDQ MAGNITUDE. .................................................................................................................................103 FIGURE 7.7. ESTIMATING pˆ (O | A) AND p̂(O | A) AS NORMAL DISTRIBUTIONS OF CLUSTERED VALUES. ......................................................................................................................................................104 FIGURE 7.8. FULL DATA SET OF 196 ORDERED IDDQ MEASUREMENTS. ...................................................106 FIGURE 7.9. DIVISION OF THE ORDERED MEASUREMENTS INTO CLUSTERS. ...........................................107 FIGURE 8.3. A SIMPLE EXAMPLE OF CLUSTERING BY SUBSETS OF OUTPUTS...........................................123 v List of Tables TABLE 4.1. RESULTS FROM SCORING AND RANKING MULTIPLETS ON SOME SIMULATED DEFECTS...........53 TABLE 4.2. FASTSCAN AND ISTAT RESULTS ON TI FIB EXPERIMENTS: 2 STUCK-AT FAULTS, 14 BRIDGES. ........................................................................................................................................................55 TABLE 5.1. RESULTS FROM CORRELATING TOP-RANKED MULTIPLETS TO DIFFERENT FAULT MODELS. ....64 TABLE 6.1. SET OF LIKELY EFFECTS THAT CAN INVALIDATE COMPOSITE BRIDGING FAULT PREDICTIONS. ........................................................................................................................................................82 TABLE 6.2. DIAGNOSIS RESULTS FOR ROUND 1 OF THE EXPERIMENTS: TWELVE STUCK-AT FAULTS. .......87 TABLE 6.3. DIAGNOSIS RESULTS FOR ROUND 2 OF THE EXPERIMENTS: NINE BRIDGING FAULTS. .............88 TABLE 6.4. DIAGNOSIS RESULTS FOR ROUND 3 OF THE EXPERIMENTS: FOUR OPEN FAULTS. ....................88 TABLE 6.5. DIAGNOSIS RESULTS FOR TI FIB EXPERIMENTS: 2 STUCK-AT FAULTS, 14 BRIDGES...............90 TABLE 7.1. RESULTS ON SEMATECH DEFECTS. ......................................................................................109 TABLE 8.1. SIZE OF TOP-RANKED CANDIDATE SET (IN FAULTS) AND TOTAL NUMBER OF SIGNATURE BITS. ......................................................................................................................................................113 TABLE 8.2. SIZE OF TOP-RANKED CANDIDATE SET (IN FAULTS) AND TOTAL NUMBER OF SIGNATURE BITS. ......................................................................................................................................................117 TABLE 8.3. OUTPUT-COMPACTED SIGNATURE SIZES ADJUSTED FOR REPEATED OUTPUT SIGNATURES. ..118 TABLE 8.4. SUCCESS RATE FOR BRIDGING FAULT DIAGNOSIS USING STUCK-AT FAULT CANDIDATES.....119 TABLE 8.5. TOP-RANKED CANDIDATE SET SIZE AND SIGNATURE BITS FOR PASS-FAIL AND OUTPUTCOMPACTED (ALONE) SIGNATURES. ..............................................................................................120 TABLE 8.6. DIAGNOSTIC RESULTS WHEN OUTPUT-COMPACTED SIGNATURES ARE CLUSTERED DOWN TO 1000 BITS EACH. ............................................................................................................................123 TABLE 8.7. DIAGNOSTIC RESULTS FOR CLUSTERING (PF+OC) SIGNATURES DOWN TO 100 BITS TOTAL. ......................................................................................................................................................125 vi Abstract Comprehensive Fault Diagnosis of Combinational Circuits by David B. Lavo Determining the source of failure in a defective circuit is an important but difficult task. Important, since finding and fixing the root cause of defects can lead to increased product quality and greater product profitability; difficult, because the number of locations and variety of mechanisms whereby a modern circuit can fail are increasing dramatically with each new generation of circuits. This thesis presents a method for diagnosing faults in combinational VLSI circuits. While it consists of several distinct stages and specializations, this method is designed to be consistent with three main principles: practicality, probability and precision. The proposed approach is practical, as it uses relatively simple modeling and algorithms, and limited computation, to enable diagnosis in even very large circuits. It is also probabilistic, imposing a probability-based framework to resist the inherent noise and uncertainty of fault diagnosis, and to allow the combined use of multiple fault models, algorithms, and data sets towards a single diagnostic result. Finally, it is precise, using an iterative approach to move from simple and abstract fault models to complex and specific fault behaviors. The diagnosis system is designed to address both the initial stage of diagnosis, when nothing is known about the number or types of faults present, as well as end-stage diagnosis, in which multiple arbitrarily-specific fault models are applied to reach a desired level of diagnostic precision. It deals with both logic fails and quiescent current (IDDQ) test failures. Finally, this thesis addresses the problem of data size in dictionary-based diagnosis, and in doing so introduces the new concept of lowresolution fault diagnosis. Acknowledgements Among the people who have contributed to this work, I would first like to thank my co-authors on various publications: Ismed Hartanto, Brian Chess, Tracy Larrabee, Joel Ferguson, Jon Colburn, Jayashree Saxena, and Ken Butler. Their contributions to this work, both in its exposition and execution, have been invaluable. I would also like to thank those people who have taken the time to provide advice, guidance, and insight into the issues involved in this research. These people include Rob Aitken, David Helmbold, Haluk Konuk, Phil Nigh, Eric Thorne, Doug Williams, Paul Imthurn and John Bruschi. And while they have already been mentioned, two people deserve special acknowledgement for their remarkable dedication to seeing this work completed. The first is Tracy Larrabee, my advisor, who managed to provide both the constant encouragement and the extraordinary patience that this research required. The other is Rob Aitken, who believed enough in the work to encourage and sponsor it, in a variety of ways, throughout the many years it took to complete. While many people have believed in this work, and given their time and support to help me complete it, no one has believed as strongly, helped so much, or is owed as much as my wife, Elizabeth. I am very happy to have completed this work, and even happier to be able to dedicate this dissertation to her. viii Chapter 1. Introduction Ensuring the high quality of integrated circuits is important for many reasons, including high production yield, confidence in fault-free circuit operation, and the reliability of delivered parts. Rigorous testing of circuits can prevent the shipment of defective parts, but improving the production quality of a circuit depends upon effective failure analysis, the process of determining the cause of detected failures. Discovering the cause of failures in a circuit can often lead to improvements in circuit design or manufacturing process, with the subsequent production of higher-quality integrated circuits. Motivating the quest for improving quality, as with many research efforts, is bottom-line economics. A better quality production process means higher yield and more usable (or sellable) die per the same wafer cost. Fewer defective chips means lower assembly costs (more assembled boards and products actually work) and lower costs associated with repair or scrap. And, a better quality chip or product means a more satisfied customer and a greater assurance of future business. Failure analysis is therefore an essential tool to improving both quality and profitability. A useful if somewhat strained analogy to the process of failure analysis is its similarity to criminal detective work: given the evidence of circuit failure, determine the cause of the failure, identifying a node or region that is the source of error. In addition to location, it is useful to identify the mechanism of failure, such as an unintentional short or open, so that remediating changes can be considered in the design or manufacturing process. Historically, failure analysis has been a physical process; a surprising number of present-day failure analysis teams still use only physical methods to investigate chip failures. The stereotypical failure analysis lab is a team of hard-boiled engineers physically and aggressively interrogating the failing part, using scanning electron microscopes, particle beams, infrared sensors, liquid crystal films, 1 and a variety of other high-tech and high-cost techniques to eventually force a confession out of the silicon scofflaw. The final result, if successful, is the identification of the actual cause of failure for the circuit, along with the requisite gory “crime scene" photograph of the defective region itself: an errant particle, missing or extra conductor, a disconnected via, and so on. The sweaty, smoke-filled scene of the failure analysis lab is only part of the story, however, and is usually referred to as root-cause identification. Given the enormous number of circuit devices in modern ICs, and the number of layers in most complex circuits, physical interrogation cannot hope to succeed without first having a reasonable list of suspect locations. Conducting a physical root-cause examination on an entire defective chip is akin to having to conduct a house-to-house search of an entire metropolis, in which every member of the populace is a possible suspect. It is the job of the other part of failure analysis, usually called fault diagnosis, to do the logical detective work. Based on the data available about the failing part, the purpose of fault diagnosis is to produce an evaluation of the failing chip and a list of likely defect sites or regions. A lot is riding on this initial footwork: if the diagnosis is either inaccurate or imprecise (identifying either incorrect or excessively many fault candidates, respectively), the process of physical fault location will be hampered, resulting in the waste of considerable amounts of time and effort. Previously-proposed strategies for VLSI fault diagnosis have suffered from a variety of selfimposed limitations. Some techniques are limited to a specific fault model, and many will fail in the face of any unmodeled behavior or unexpected data. Others apply ad hoc or arbitrary scoring mechanisms to rate fault candidates, making the results difficult to interpret or to compare with the results from other algorithms. This thesis presents an approach to fault diagnosis that is robust, comprehensive, extendable, and practical. By introducing a probabilistic framework for diagnostic prediction, it is designed to incorporate disparate diagnostic algorithms, different sets of data, and a mixture of fault models into a single diagnostic result. The fundamental aspects of fault diagnosis will be discussed in Chapter 2, including fault models, fault signatures, and diagnostic algorithms. Chapter 3 indulges in an examination of the issues 2 inherent in fault diagnosis, and presents a philosophy of diagnosis that will guide the balance of the work. Chapter 4 presents the first stage of the proposed diagnostic approach, which handles the initial condition of indeterminate fault behaviors. Chapter 5 discusses the second stage of diagnosis, in which likely fault models are inferred from the first-stage results. Chapter 6 digresses to a discussion of inexpensive bridging fault models, and introduces the third stage of diagnosis, in which multiple fault models are applied to refine the diagnostic result. Chapter 7 presents extends the diagnosis system to the topic of IDDQ failures, and Chapter 8 addresses the issue of small fault dictionaries. Chapter 9 presents the conclusions from this research and discusses areas of further work. 3 Chapter 2. Background Here is the problem of fault diagnosis in a nutshell: a circuit has failed one or more tests applied to it; from this failing information, determine what has gone wrong. The evidence usually consists of a description of the tests applied, and the pass-fail results of those tests. In addition, more detailed pertest failing information may be provided. The purpose of fault diagnosis is to logically analyze whatever information exists about the failures and produce a list of likely fault candidates. These candidates may be logical nodes of the circuit, physical locations, defect scenarios (such as shorted or open signal lines), or some combination thereof. This chapter will give the background of the problem of fault diagnosis. It starts with a description of the types of circuits that will and will not be addressed by the diagnosis methods described in this thesis. It will explain the types of data that make up the raw materials of the diagnosis process, and then introduce the abstractions of defective behavior known as fault models. Finally, it will present the various algorithms and approaches that previous researchers have proposed for various instances of the fault diagnosis problem. 2.1 Types of Circuits This thesis will only address the problem of fault diagnosis in combinational logic. While nearly all large-scale modern circuits are sequential, meaning they contain state-holding elements, most are tested in a way that transforms their operation under test from sequential to combinational. This is usually accomplished by implementing scan-based test [AbrBre90], in which all state-holding flipflops in the circuit are modified so that they can be controlled and observed by shifting data through one or more scan chains. During scan tests, input data is scanned into the flip-flops via the scan chains and other input data is applied to the input pins (or primary inputs) of the circuit. Once these inputs 4 are applied and the circuit has stabilized its response (now fully combinational), the circuit is clocked to capture the results back into the flip-flops, and the data values at the output pins (or primary outputs) of the circuit are recorded. The combination of values at the output pins and the values scanned out of the flip-flops make up the response of the circuit to the test, and these values are compared to the expected response of a good circuit. If there is a mismatch for any test, the circuit is considered defective, and the process of fault diagnosis can begin. This thesis will not address the diagnosis of failures during tests that consist of multiple clock cycles and therefore sequential circuit behavior. So-called functional tests fall under this domain, and are extremely difficult to diagnose due to the mounting complexity of defective behavior under multiple sequential time frames. Another sequential circuit type that is not addressed here is that of memories such as RAMs and ROMs. Unlike the “random” logic of logic gates and flip-flops, however, the “structured” nature of memories makes them especially amenable to simple fault diagnosis. It is usually a simple process to control and observe any word or bit in most memories to determine the location of test failure. 2.2 Diagnostic Data Part of the data that is involved in fault diagnosis, at least for scan tests, has already been introduced: namely, the input values applied at the circuit input pins and scanned into the flip-flops. The input data for each scan operation, including values driven at input pins, is referred to as the input pattern or test vector. The operation of scanning and applying an input to the circuit and recording its output response is formally called a test1, and a collection of tests designed to exercise whole or part of the circuit is called a test set. This information, along with the expected output values (determined by prior simulation of the circuit and test set), makes up the test program actually applied to the circuit. 1 Traditional scan tests test only the function of a circuit, and usually only require a single input pattern and record a single combinational response. Tests that test the speed of a circuit, however, must create logic transitions in the circuit and so must apply pairs of input values, often by scanning two input patterns into the circuit. This type of test is still a single test and records a single response, and as such is commonly referred to as a “two-pattern test”. 5 The test program runs on a tester, which can handle either wafers or packaged die, and can apply tests and observe circuit responses. The tester records the actual responses measured at circuit outputs, and any differences between the observed responses and the expected responses are recorded in the tester data log. While it is not the usual default setting during production test, this thesis will assume that the data log information identifies all mismatched responses and not just the first failing response. It is usually a simple matter to re-program a tester from a default “stop-on-first-fail” mode to a diagnostic “record-all-fails” mode once a die or chip has been selected for failure analysis. The response of a defective circuit to a test set is referred to as the observed faulty behavior, and its data representation is commonly known as a fault signature. For scan tests, the fault signature is usually represented in one of two common forms. The first, the pass-fail fault signature, reports the result for each test in the test set, whether a pass or a fail. Typically the fault signature consists either of the indices of the failing tests, or a bit vector for the entire test set in which the failing tests (by convention) are represented as 1s and the passing tests by 0s. Figure 2.1, below, gives an example of a fault signature for a simple example of 10 tests, out which 4 failing tests are recorded. Results for 10 total tests: 1: Pass 2: Pass 3: Pass 4: Pass 5: Fail Pass-fail signatures: 6: Pass 7: Fail 8: Fail 9: Pass 10: Fail By index: By bit vector: 5, 7, 8, 10 0000101101 Figure 2.1. Example of pass-fail fault signatures. The second type of fault signature is the full-response fault signature, which reports not only what tests failed but also at which outputs (flip-flops and primary outputs) the discrepancies were observed. As with test vectors, circuit outputs are usually indexed to facilitate identification. Figure 2.2 gives another simple example of indexed and bitmapped full-response fault signatures. Each failing vector number in the indexed signature is augmented with a list of failing outputs. In the bitmapped signature, a second dimension has been added for failing outputs. 6 Indexed full-response signature: Bitmapped full-response signature: Vectors 1 2 3 4 5 6 7 8 9 10 5: 2, 4 7: 3, 4 8: 7 10: 2, 7 O u t p u t s 1 2 3 4 5 6 7 8 9 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 Figure 2.2. Example of indexed and bitmapped full-response fault signatures. Scan tests are only a single part of the suite of tests usually applied to a production chip. Another common type of test, called an IDDQ test, is to put the circuit in a non-switching or static state and measure the quiescent current draw. If an abnormally high current is measured, a defect is assumed to be the cause and the part is marked for scrap or failure analysis. The fault signature generated by an IDDQ test set can take one of two forms. The first is the same as the pass-fail signature introduced earlier for scan tests, in which either index numbers or bits are used to represent passing (normal or low IDDQ current) and failing (high current) tests. The second type of signature records an absolute current measurement for each I DDQ test in the form of a real number. This thesis will address fault diagnosis for both scan and I DDQ tests, as these are the two major types of comprehensive tests performed on commercial circuits. Other tests, such as those for memories, pads, or analog blocks, cover a much more limited area and require more specialized (often manual) diagnostics. Functional test failures, as mentioned, are especially difficult to diagnose, but fortunately (at least for fault diagnosis) functional tests are gradually being eclipsed by scan-based tests. Diagnosis for Built-In-Self-Test (BIST) [AbrBre90], in which on-chip circuitry is used to apply and capture test patterns, will not be directly addressed here. However, many of the diagnosis techniques presented in this thesis can be applied to BIST results if the data can be made available for 7 off-chip processing. Finally, the issue of timing or speed test diagnostics will be addressed only briefly and remains a subject for further research. 2.3 Fault Models The ultimate targets of both testing and diagnosis are physical defects. In the logical domain of testing and diagnostic algorithms, a defect is represented by an abstraction known as a logical fault, or simply fault. A description of the behavior and assumptions constituting a logical fault is referred to as a fault model. Test and diagnosis algorithms use a fault model to work with the entire set of fault instances in a target circuit. The most popular fault model for both testing and diagnosis is the single stuck-at fault model, in which a node in the circuit is assumed to be unable to change its logic value. The stuck-at model is popular due to its simplicity, and because it has proved to be effective both in providing test coverage and diagnosing a limited range of faulty behaviors [JacBis86]. As an abstract representation of a class of defects, the stuck-at fault is commonly used to represent the defect of a circuit node shorted to either power or ground. It is commonly used, however, to both detect and diagnose a wide range of other defect types, as will be seen in the rest of this thesis. Perhaps the second most popular fault model is the bridging fault model. Used to represent an electrical short between signal lines, in its most common form the model describes a short between two gate outputs. Most bridging fault models ignore bridge resistance, and instead focus on the logical behavior of the fault. These models include the wired-OR bridging fault, in which a logic 1 on either bridged node results in the propagation of a logic 1 downstream from both nodes; the wired-AND bridging fault, which propagates a 0 if either node is 0; and the dominance bridging fault, in which one gate is much stronger than the other and is assumed to always drive its logic value onto the other bridged node. Other bridging fault models have been developed of much greater sophistication [AckMil91, GrePat92, MaxAit93, Rot94, MonBru92], taking into account gate drive strengths, various 8 bridge resistances, and even more than two bridged nodes, but they are not used as much due to their computational complexity during large-scale test generation or fault diagnosis. Bridging fault models have become popular due to an increasing attention to defects in the interconnect of modern chips. Similarly, there has been a commensurate rise in interest in open fault models, which attempt to model electrical opens, breaks, and disconnected vias. Since opens can result in state-holding, intermittent, and pattern-dependent fault effects, these models have generally been more complex and less widely used for both testing and diagnosis. Instead of interconnect faults, several fault models have concentrated on defects in logic gates and transistors. Among these are the transistor-stuck-on and transistor-stuck-off models, which are similar to conventional stuck-at faults. Various intra-gate short models have been proposed to model shorts between transistors in standard-cell logic gates. Many of these models have not enjoyed widespread success simply because the stuck-at model tends to work nearly as well for generating effective tests at much lower complexity. Other fault models have been developed to represent timing-related defects, including the transition fault model and the path-delay fault model. The first assumes that a defect-induced delay is introduced at a single gate input or output, while the second spreads the total delay along a circuit path from input to output. 2.4 Fault Models vs. Algorithms: A Short Tangent into a Long Debate The previous section briefly introduced a wide variety of fault models, from the simple and abstract stuck-at model to more complicated, specific, and realistic fault models. The stuck-at fault model has been generally dominant for several decades, and continues to be dominant today, both for its simplicity and its demonstrated utility. But the general trend, in the field of testing at least, has been a tentative shift away from sole reliance on the stuck-at model towards more realistic fault models that will facilitate the generation of better tests for more complicated defects. The question is, then, what models are best for fault diagnosis? 9 A paper by Aitken and Maxwell [AitMax95] identifies two main components to any fault diagnosis approach. The first is the choice of fault model, and the second is the algorithm used to apply the fault model to the diagnostic problem. As the authors explain, the effectiveness of a diagnostic technique will be compromised by the limitations of the fault model it employs. So, for example, a diagnosis tool that relies purely on the stuck-at fault model can never completely or correctly diagnose a signal-line short or open, simply because it is looking for one thing while another has occurred. The authors go on to explain that the role of the diagnosis algorithm, then, has evolved to try to overcome the limitations of the chosen fault model. This will be illustrated in the next section of this chapter in an overview of previous diagnosis research; a common technique is to use the stuck-at model but adjust the algorithm to anticipate bridging-fault behaviors. But, the authors also opened a debate, which remains active to this day: is it better for a diagnosis technique to use more realistic fault models with a simple algorithm, or to use simple and abstract models with a more clever and robust algorithm? As with any interesting debate, there are good arguments on both sides. The argument for simple fault models is that they are more practical to apply to large circuits and more flexible for a wide variety of defect behaviors. The argument for better models, taken by the authors in their original paper, is that good models are necessary for both diagnostic accuracy and precision. Simple models do not provide sufficient accuracy because defect behavior is often complex, more complex than even clever algorithms anticipate. They also do not result in sufficient precision because they do not provide enough specificity (e.g. “look for a short at this location”) to guide effective physical failure analysis. This thesis will attempt to resolve this debate as it presents a new diagnostic approach. The next section outlines how previous researchers have addressed the diagnostic problem, and notes how each participant has taken their place in the model vs. algorithm debate. 10 2.5 Diagnostic Algorithms This section will cover the diagnosis algorithms proposed by previous researchers, in a roughly chronological order. The general trend, as will become clear, has been from simple approaches that target simple defects, to more complex algorithms that try to address more complicated defect scenarios. Diagnosis algorithms have traditionally been classified into two types, according to how they approach the problem. The first and by far the most popular approach is called cause-effect fault diagnosis [AbrBre90]. A cause-effect algorithm starts with a particular fault model (the “cause”), and compares the observed faulty behavior (the “effect”) to simulations of that fault in the circuit. A simulation of any fault instance produces a fault signature, or a list of all the test vectors and circuit outputs by which a fault is detected, and which can be in one of the signature formats described earlier. The process of cause-effect diagnosis is therefore one of comparing the signature of the observed faulty behavior with a set of simulated fault signatures, each representing a fault candidate. The resulting set of matches constitutes a diagnosis, with each algorithm specifying what is acceptable as a “match”. The main job of a cause-effect algorithm is to perform this matching between simulated candidate and observed behavior. The general historical trend has been from very simple or exact matching, where the defect is assumed to correspond very closely to the fault model, to more complicated matching and scoring schemes that attempt to deal with a range of defect types and unmodeled behavior. A cause-effect algorithm is characterized by the choice of a particular fault model before any analysis of the actual faulty behavior is performed. A cause-effect algorithm can further be classified as static, in which all fault simulation is done ahead of time and all fault signatures stored in a database called a fault dictionary; or, it can be dynamic, where simulations are performed only as needed. The opposite approach, and the second classification of diagnosis algorithms, is called (not surprisingly) effect-cause fault diagnosis [AbrBre80, RajCox87]. These algorithms attempt the 11 common-sense approach of starting from what has gone wrong on the circuit (the fault “effect”) and reasoning back through the logic to infer possible sources of failure (the “cause”). Most commonly the cause suggested by these algorithms is a logical location or area of the circuit under test, not necessarily a failure mechanism. Most effect-cause methods have taken the form of path-tracing algorithms. They use assumptions about the propagation and sensitization of candidate faults to traverse a circuit netlist, usually identifying a set of fault-free lines and thereby implicating other logic that is possibly faulty. Effect-cause diagnosis methods have several advantages. First, they don't incur the oftensignificant overhead of simulating and storing the responses of a large set of faults. Second, they can be constructed to be general enough to handle, at least implicitly, the presence of multiple faults and diffuse fault behavior. This is an advantage over most other diagnosis strategies that rely heavily on a single-fault assumption. The most common disadvantage of effect-cause diagnosis algorithms is significant inherent imprecision. Most are conservative in their inferences to avoid eliminating any candidate logic, but this usually leads to a large implicated area. Also, since a pure effect-cause algorithm doesn't use fault models, it necessarily cannot provide a candidate defect mechanism (such as a bridge or open) for consideration. In fact, while most effect-cause algorithms claim to be “fault-model-independent”, this is a difficult claim to justify. Existing effect-cause algorithms implicitly make assumptions about fault sensitization, propagation, or behavior that are impossible to distinguish from classic fault modeling. (Usually, the implicit model is the stuck-at fault model.) This is understandable: it is the job of a diagnosis algorithm to make inferences about the underlying defect, but it is difficult to do so without some assumptions about faulty behavior, which is in turn difficult to do so without some fault modeling. The following sections present algorithms for VLSI diagnosis proposed by previous researchers, from the early 1980s to the present day. In general, the earliest algorithms have targeted solely stuck- 12 at faults and associated simple defects, while the later and more sophisticated algorithms have used more detailed fault models and targeted more complicated defects. 2.5.1 Early Approaches and Stuck-at Diagnosis Many early systems of VLSI diagnosis, such as Western Electric Company's DORA [AllErv92] and an early approach of Teradyne, Inc. [RatKea86], attempted to incorporate the concept of causeeffect diagnosis with a previous-generation physical method called guided-probe analysis. Guidedprobe analysis employed a physical voltage probe and feedback from an analysis algorithm to intelligently select accessible circuit nodes for evaluation. The Teradyne and DORA techniques attempted to supplement the guided-probe analysis algorithm with information from stuck-at fault signatures. Both systems used relatively advanced (for their time) matching algorithms. The DORA system used a nearness calculation that the authors describe as fuzzy match. The Teradyne system employed the concept of prediction penalties: the signature of a candidate fault is considered a prediction of some faulty behavior, made up of <output:vector> pairs. When matching with the actual observed behavior, the Teradyne algorithm scored a candidate fault by penalizing for each <output:vector> pair found in the stuck-at signature but not found in the observed behavior, and penalizing for each <output:vector> pair found in the observed behavior but not the stuck-at signature. These have commonly become known as misprediction and non-prediction penalties, respectively. A related Teradyne system [RicBow85] introduced the processing of possible-detects, or outputs in stuck-at signatures that have unknown logic values, into the matching process. While other early and less-sophisticated algorithms applied stuck-at fault signatures directly, expecting exact matches to simulated behaviors, it became obvious to the testing community that most failures in CMOS circuits do not behave exactly like stuck-at faults. Stuck-at diagnosis algorithms responded by increasing the complexity and sophistication of their matching to account for these unmodeled effects. An algorithm proposed by Kunda [Kun93] ranked matches by the size of 13 intersection between signature bits. This stress on minimum non-prediction (misprediction was not penalized) reflects an implicit assumption that unmodeled behavior generally leads to over-prediction: the algorithm does not expect the stuck-at model to be perfect, but any unmodeled behavior will cause fewer actual failures than predicted by simulation. This assumption likely arose from the intuitive expectation that most defects involve a single fault site with intermittent faulty behavior — a not uncommon scenario for many chips that have passed initial tests but failed scan tests, especially after burn-in or packaging. Most authors, however, do not make this assumption explicit or explore its consequences, and an unexamined preference for the fault candidate that “explains the most failures” (regardless of over-prediction) is common to many diagnosis algorithms. A more balanced approach was proposed by De and Gunda [DeGun95], in which the user can supply relative weightings for misprediction and non-prediction. By modifying traditional scoring with these weightings, the algorithm assigns a quantitative ranking to each stuck-at fault. The authors claim that the method can be used to explicitly target defects that behave similar to but not exactly like the stuck-at model, such as some opens and multiple independent stuck-at faults, but it can diagnose bridging defects only implicitly (by user interpretation). This is perhaps the most general of the simple stuck-at algorithms and is unique for its ability to allow the user to adjust the assumptions about unmodeled behavior that other algorithms make implicitly. 2.5.2 Waicukauski & Lindbloom The algorithm developed by Waicukauski and Lindbloom (W&L) [WaiLin89] deserves its own subsection because it has been so pervasive and successful — the most popular commercial tool is based on this algorithm — and also because it introduced several techniques that other algorithms have since adopted. The W&L algorithm relies solely on stuck-at fault assumptions and simulations, and as such can be best classified as a (dynamic) cause-effect algorithm. It does, however, use limited path-tracing to 14 implicate portions of the circuit and reduce the number of simulations it performs, so it does borrow elements from effect-cause approaches. The W&L algorithm uses a very simple scoring mechanism, relying mainly on exact matching. But, it performs this matching in an innovative way, by matching fault signatures on a per-test basis. Most fault diagnosis algorithms count the number of mismatched bits between the observed behavior and a candidate fault signature across the entire test set. Each bit is a <vector:output> pair, as in the Teradyne algorithm described earlier, and an intersection is performed between the set of bits in the observed behavior and the set in each candidate fault signature. In the W&L algorithm, by contrast, each test vector that actually fails on the tester is considered independently. For each failing test, the set of failing outputs is compared with each candidate fault; if a candidate predicts a fail for that test, and the outputs match exactly, then a “match” is declared. Each matching fault candidate is then simulated against the rest of the failing tests, and the candidate that matches the most failing tests (exactly) is retained. All of the matched test results for this candidate are removed from the observed faulty signature, and the process repeats until all failing tests are considered. Note that this matching algorithm is really a greedy coverage algorithm over the set of failing tests. Since the tests are considered in order, the sequence in which the tests are examined could affect the contents of the final diagnosis when multiple candidates are required to match all of the tests. It should also be noted that the practice of removing test results as they are matched reflects a desire to address multiple simultaneous defects, as well as an assumption that the fault effects from such defects are non-interfering. The algorithm also conducts a simple post-processing step, in which it classifies the diagnosis by examining the final candidate set. If the diagnosis consists of a single stuck-at fault (with any equivalent faults) that matches all failing tests, it then checks the tests that pass in the observed behavior. If all of these passing test results are also predicted by the stuck-at candidate, the diagnosis is classified as a “Class I” diagnosis, or an exact match with a single stuck-at fault. If the diagnosis 15 consists of a single candidate that matches all failing tests but not all passing tests (e.g. there is some misprediction), then the diagnosis is classified as “Class II”. The authors explain that Class II diagnoses could indicate the presence of an open, an intermittent stuck-at defect, or a dominance bridging fault. Finally, a “Class III” diagnosis consists of multiple stuck-at candidates with possible mispredicted and non-predicted behaviors. The two most interesting features of the W&L algorithm, the per-test approach and the postprocessing analysis, will be discussed further in later sections of this thesis. Overall, the W&L algorithm is interesting not only because it is so commonly used, but also because it raises some interesting theoretical issues. 2.5.3 Stuck-At Path-Tracing Algorithms The classic effect-cause algorithms are those that rely on path-tracing to implicate portions of the circuit. Examples of these are the approaches suggested by Abramovici and Breuer [AbrBre80] and Rajski and Cox [RajCox87]. While they claim fault-model-independence, these algorithms attempt to identify nodes in the circuit that can be demonstrated to change their logic values (or toggle) during the test set, which amounts to an implicit targeting of stuck-at faults. In fact, these algorithms maintain a stricter adherence to the stuck-at model than the cause-effect algorithms just described, as any intermittent stuck-at defect is not anticipated and would not be diagnosed correctly. 2.5.4 Bridging fault diagnosis The first evolution of diagnosis algorithms away from the stuck-at model was when they started to address bridging faults explicitly. Some of the stuck-at diagnosis algorithms already presented claim to be able to occasionally diagnose bridging faults, but only fortuitously by addressing limited unmodeled behavior. Perhaps the simplest explicit bridging fault diagnosis algorithm is that proposed Millman McCluskey and Acken (MMA) [MilMcC90], which was a direct transition from stuck-at faults to bridges. The authors introduced the idea of composite bridging-fault signatures, which are created by concatenating the four stuck-at fault signatures for the two bridged nodes. This was a novel 16 way of creating fault signatures without relying on bridging fault simulation, which can be computationally expensive especially if electrical effects are considered. The underlying idea is that the actual behavior of a bridge, for any failing test vector, will be a subset of the behaviors predicted by the four related stuck-at faults. The matching algorithm used is simple subset matching: any candidate whose composite signature contains all the observed failing <vector:output> pairs is considered a match and appears in the final diagnosis. A similar approach to the MMA algorithm was taken by Chakravarty and Gong [ChaGon93], whose algorithm did not explicitly create composite signatures but used a matching technique on combinations of stuck-at signatures to create the same result. Both of these bridging-fault diagnosis methods suffer from imprecision, however: the average diagnosis sizes for both are very large, consisting of hundreds or thousands of candidates. The performance of the MMA algorithm was improved significantly by Chess, Lavo, et al. [CheLav95], by classifying vectors in the composite signatures as stronger or weaker predictions of bridging fault behavior, and refining the match scoring appropriately. Other researchers have continued to use and extend the idea of (stuck-at based) composite signatures for various fault models [VenDru00]. A more direct approach to bridging fault diagnosis was suggested by Aitken and Maxwell [AitMax95]. As opposed to the algorithms just described, in which the simple stuck-at fault model is augmented with more-complex algorithms to deal with unmodeled behavior, the authors instead chose to build dictionaries comprised of realistic bridging faults. (A realistic bridging fault is a short that is considered likely to occur in the fabricated circuit based on a signal-line proximity analysis of the circuit artwork.) This is pure cause-effect diagnosis for bridging-faults: the fault candidates are the same faults targeted for diagnosis. The authors report excellent results, both in accuracy and precision. While there are obvious advantages to this approach, there are also significant disadvantages. The number of realistic two-line bridging faults is significantly larger than the number of single stuckat faults for a circuit. Since the cost of simulating each of these faults can be expensive, especially if 17 the simulation considers electrical effects, the overall time spent in fault simulation can be prohibitive. In addition, even the best bridging fault simulations may not reflect the behavior of actual shorts, requiring continual validation and refinement of the fault models [Ait95] and possibly the use of a more complex matching algorithm. Bridging fault diagnosis in general is plagued by the so-called candidate selection problem: there are many more faults in a circuit than can be reasonably considered by any diagnosis algorithm. Even for two-line bridging faults, there are n 2 possible candidates. The Aitken and Maxwell approach got around this problem by considering only realistic bridging faults, but the analysis required for determining the set of realistic faults can itself be impractical. Other methods have been suggested, including one by Lavo et al. [LavChe97] that used a two-stage diagnosis approach, the first stage to identify likely bridges and the second stage to directly diagnose the bridging fault candidates. This thesis will explore the candidate selection problem in more detail in a subsequent chapter. 2.5.5 Delay fault diagnosis Due to the increasing importance of timing-related defects in high-performance designs, researchers have proposed methods to diagnose timing defects with delay fault models. Due to its simplicity, the transition fault model, in which the excessive delay is lumped at one circuit node, has been preferred. Diagnosis with the path-delay fault model, which considers distributed delay along a path from circuit input to output, has been hampered by the candidate selection problem: there are an enormous number of paths through a modern circuit. An example of fault diagnosis using the path-delay fault model is the approach suggested by Girard et al. [GirLan92]. The authors use a method called critical path tracing [AbrMen84] to traverse backwards through the circuit from the failing outputs, implicating nodes that transition for each test. In this way it is similar to the effect-cause algorithms described in section 2.5.3, but its decisions at each node are determined by the transition fault model rather than the stuck-at fault model. 18 2.5.6 IDDQ diagnosis Aside from logic levels and assertion timing data, people have applied information from other types of tests to diagnose defects. One source of such information is the amount of quiescent current drawn for certain test vectors, or IDDQ diagnosis. The vectors used for IDDQ diagnosis are designed to put the circuit in a static state, in which no logic transitions are occurring, so that a high amount of measured current draw will indicate the likely presence of a defect (such as a short to a power line). An advantage to IDDQ diagnosis is that the defects should have high observability: the measurable fault effects do not have to propagate through many levels of logic to be observed, but are rather measured at the supply pin. The issue of IDDQ observability is a complicated one, however, and will be discussed later in Chapter 7. Aitken presented a method of diagnosing faults when logic fails and I DDQ fails are measured simultaneously [Ait91], and he later generalized this approach to include fault models for intra-gate and inter-gate shorts [Ait92]. The approach presented by Chakravarty and Liu examines the logic values applied to circuit nodes during failing tests, and attempts to identify pairs of nodes with opposite logic values as possible bridging fault sites [ChaLiu93]. All of the approaches, however, rely on IDDQ measurements that can be definitively classified as either a pass or a fail, which limits their application in some situations. This limitation is addressed by the application of current signatures [Bur89, GatMal96], in which relative measurements of current across the test set are used to infer the presence of a defect, rather than the absolute values of IDDQ. A diagnosis approach suggested by Gattiker and Maly [GatMal97, GatMal98] attempts to use the presence of certain large differences between current measurements as a sign that certain types of defects are present. This concept was further extended by Thibeault [Thi97], who applied a maximum likelihood estimator to changes in I DDQ measurements to infer defective fault types. These approaches, while more robust, stress the implication of defect type rather than location; the algorithm I propose later in this thesis targets explicit fault instances or locations. It is possible that these two strategies could be combined to further improve resolution, a topic I discuss in Chapter 7. 19 2.5.7 Recent Approaches A couple of recently-published papers have suggested diagnosis algorithms that attempt to target multiple defects or fault models. The first, called the POIROT algorithm [VenDru00], diagnosis test patterns one at a time, much like the Waicukauski and Lindbloom algorithm. In addition, it employs stuck-at signatures, composite bridging fault signatures, and composite signatures for open faults on nets with fanout. Its scoring method is rather rudimentary, especially when it compares the scores of different fault models, relying on an interpretation of Occam’s Razor [Tor38] to prefer stuck-at candidates over bridging candidates, and bridging candidates over open faults. Another algorithm, called SLAT [BarHea01], also uses a per-test diagnosis strategy, and attempts a coverage algorithm over the observed behavior using stuck-at signatures and only exact matching of failing outputs. In both of these ways it is very similar to the W&L algorithm. However, it modifies that algorithm by attempting to build multiple coverings, which it calls multiplets; each multiplet is a set of stuck-at faults that together explain all the perfectly-matched test patterns. Test results that don’t match exactly, and passing patterns, are ignored. Because they explicitly target multiple faults and complex fault behaviors, the SLAT and the POIROT algorithms are interesting for application to an initial pass of fault diagnosis, when little is known about the underlying defects. These algorithms, in addition to W&L, will be discussed further in Chapter 4 of this thesis, which addresses initial-stage fault diagnosis. 2.5.8 Inductive Fault Analysis The diagnosis techniques presented so far do not use physical layout information to diagnose faults. Intuitively, however, identifying a fault as the cause of a defect has much to do with the relative likelihood of certain defects occurring in the actual circuit. Inductive Fault Analysis (IFA) [SheMal85] uses the circuit layout to determine the relative probabilities of individual physical faults in the fabricated circuit. 20 Inductive fault analysis uses the concept of a spot defect (or point defect), which is an area of extra or missing conducting material that creates an unintentional electrical short or break in a circuit. As these spot defects often result in bridge or open behaviors, inductive fault analysis can provide a fault diagnosis of sorts: an ordered list of physical faults (bridges or opens) that are likely to occur, in which the order is defined by the relative probability of each associated fault. The relative probability of a fault is expressed as its weighted critical area (WCA), defined as the physical area of the layout that is sensitive to the introduction of a spot defect, multiplied by the defect density for that defect type. For example, two circuit nodes that run close to one another for a relatively long distance provide a large area for the introduction of a shorting point defect; the resulting large WCA value indicates that a bridging fault between these nodes is considered relatively likely. One way that inductive fault analysis can be applied to fault diagnosis is through the creation of fault lists. Inductive fault analysis tools such as Carafe [JeeISTFA93, JeeVTS93] can provide a realistic fault list, useful for fault models such as the bridging fault model, in which the number of possible faults is intractable for most circuits. By limiting the candidates to only faults that can realistically occur in the fabricated circuit, a diagnosis can be obtained that is much more precise than one that results from consideration of all theoretical faults. Another possible way to use inductive fault analysis for diagnosis is presented in Chapter 6, in which IFA can provide the a-priori probabilities for a set of candidate faults. This is a generalization of the idea of creating faultlists, in which faults are not characterized as realistic or unrealistic, but instead are rated as more or less probable. IFA has also been applied to the related field of yield analysis; a technique proposed by Ferguson and Yu [FerYu96] uses a combination of IFA and maximum likelihood estimation to perform a sort of statistical diagnosis on process monitor circuits. A similar combination of layout examination, statistical inference, and fault modeling will be applied to more traditional cause-effect fault diagnosis in Chapter 6 of this thesis. 21 2.5.9 System-Level Diagnosis The area of system-level diagnosis, which deals with finding defective components in large-scale electronic systems, is outside the area of research of this dissertation. However, some interesting work has been done in this area, which predates and often deals with issues very different than those of CMOS and VLSI diagnosis. The most comprehensive diagnosis approach has been developed by Simpson and Sheppard [SimShe94], who have presented a probabilistic approach for everything from determining the identity of failing subsystems to determining the optimal order of diagnostic tests. They have also suggested an approach for CMOS diagnosis using fault dictionaries [SheSim96]. Their methods apply the Dempster-Shaffer method of analysis, which I will use extensively and discuss further in Chapter 4. 22 Chapter 3. A Deeper Understanding of the Problem: Developing a Fault Diagnosis Philosophy The previous chapter presented some of the various ways that researchers have approached the problem of VLSI fault diagnosis. These attempts have spanned a period of over 25 years, and a good deal of academic and industrial effort has gone into making fault diagnosis work in the real world. And yet, few if any academic diagnosis algorithms have made a successful transition into industrial use. The reasons for this lack of success are probably many, but chief among them is probably the disparity between academic assumptions about the problem and the real-world conditions of industrial failure analysis. This chapter will examine these assumptions in some detail, and by trying to rectify them will present a philosophic framework for approaching the problem of fault diagnosis that will guide the rest of the research presented in this thesis. 3.1 The Nature of the Defect is Unknown Several theoretical fault diagnosis systems have claimed great success in some variation of the following experiment: physically create or simulate a defect of a certain fault type, create some candidates of that fault type, and run the diagnosis algorithm to choose the correct candidate out of the list. While the accuracy of these success stories is indeed laudable, the result is a little like pulling a guilty culprit out of a police lineup: the job is made much easier if the choices are limited ahead of time. It is an unfortunate fact of failure analysis, however, that what form a defect has taken, or what fault model could best represent the actual electrical phenomenon, is not known in advance. In the real world, a circuit simply fails some tests; it does not generally give any indication of what type of defect 23 is present. While some algorithms have been proposed that attempt to infer a defect type from some behavior, most notably IDDQ information [GatMal97, GatMal98], these will not work on the most common failure type: there is generally little or no information about defect type that can be gleaned from standard scan failures. Acknowledging this lack of initial information leads to a basic principle of fault diagnosis, often ignored by academic researchers but obvious to industrial failure analysis engineers: A fault diagnosis algorithm should be designed with the (i) assumption that the underlying defect mechanism is unknown. Given this fact, it makes little sense to design a fault diagnosis algorithm that only works when the underlying defect is a certain type or class. Or, if an algorithm is targeted to one fault type, it should be designed so that an unmodeled fault will result in either explicit or obvious failure. This leads to the next principle: A fault diagnosis algorithm should indicate the quality of its result. (ii) This way, if a diagnosis algorithm does encounter some behavior that violates its basic assumptions, it can let the user know that these assumptions may have been wrong. 3.2 Fault Models are Hopelessly Unreliable Many clever diagnosis algorithms have been proposed, using a variety of fault models, and all promise great success as long as one condition holds: nothing unexpected ever happens. These expectations come from the fault model used, the diagnostic algorithm, or both. So, if the modeled defect doesn't cause a circuit failure when expected, or if a failure occurs along an unanticipated path, the algorithm will either quit or get hopelessly off the track of the correct suspect. If the problem is defective fault models, then maybe the solution is to work very hard to perfect the models. If the models were perfect, then diagnosis would reduce to a simple process of finding 24 exactly the matching candidate for the observed behavior. But, once again, the cold hard world intrudes with the cold hard facts: fault model perfection is extremely difficult, and may very likely be impossible. Perhaps best documented are the problems inherent in bridging fault modeling: many simplified bridging fault models have been proposed, and each in turn has been demonstrated to be inadequate or inaccurate in one or more important respects [AckMil91, AckMil92, GrePat92, MaxAit93, MonBru92, Rot94]. Even the most complex and computationally intensive models can fail to account for the subtleties of defect characteristics and the vagaries of defective circuit behavior. And it is not only the complex models that are prone to error: even apparently simple predictions may be hard to make when real defects are involved [Ait95]. The unfortunate fact is that faulty circuits have the tendency to misbehave—they are faulty, after all—and often fail in ways not predicted by the best of fault simulators or the most carefully crafted fault models. The only answer is that any diagnostic technique that hopes to be effective on real-world defective circuits has to be robust enough to tolerate at least some level of noise and uncertainty. If not, the only certain thing about the process will be the resulting frustration of a sadly misguided engineer. A fault diagnosis algorithm should make no inviolable assumptions regarding the defect behavior or its chosen fault (iii) model(s): fault models are only approximations 3.3 Fault Models are Practically Indispensable Given the well-documented limitations of fault models, several diagnosis algorithms have tried to minimize them or do away with them completely. Some, such as some effect-cause algorithms, claim to be “fault-model-independent”. Others attempt to use the abstract nature of the stuck-at fault model to avoid the messy and unreliable aspects of realistic fault models. 25 While the idea behind these approaches has merit, abstract diagnosis is not enough for real-world failure analysis. The majority of fault diagnosis algorithms that address complex defects use the stuckat fault model to get “close enough” to the actual defect behavior to enable physical mapping. But, using the stuck-at model alone results in some well-characterized problems in both accuracy and precision. For example, even a robust stuck-at diagnosis may identify one of two shorted nodes only 60% to 90% of the time [AitMax95, LavChe97]. For situations in which a 10% to 40% failure rate is unacceptable, or such partial answers (single-node explanations) are inadequate, stuck-at diagnosis alone is not the answer. The use of the stuck-at model is typical of a common answer to the problem of unreliable fault models: use an abstract model that makes as few assumptions as possible. But, while this approach has historically worked for testing, it is not likely to work for fault diagnosis. Generally speaking, fault models have proved their utility for test generation. If, for example, a test is generated to detect the (abstract) situation of a circuit node stuck-at 0, there is considerable evidence to suggest that the test will, in the process, detect a wide range of related defects: the node shorted to ground, perhaps, or missing conductor to a pull-up network, or even a floating node held low from a capacitive effect. When testing a circuit for defects, the actual relation of fault model to defect is less important than whether the defect is caught or not. But what does it mean, in the world of fault diagnosis, to explain the actual failures of a circuit with an abstract fault model? Try as one might, no failure analysis engineer is ever going to find a single stuck-at fault under the microscope; a stuck-at fault, strictly defined, is not a specific explanation, but is instead a useful fiction. For fault diagnosis, the issue is one of resolution: the more abstract the model used, the less well the fault candidates in the final diagnosis will map to actual defects in the silicon. A stuck-at candidate, for example, may implicate a range of mechanisms or defect scenarios involving the specified stuck-at node, and the failure analysis engineer must account for this poor resolution by performing some amount of mapping to actual circuit elements. The more specific the fault model, the better the 26 correspondence to actual defects, and the less mapping work is required: a sophisticated bridging fault candidate, with specific electrical characteristics, will usually resolve to either a single or a few defect scenarios. A more specific fault model is always preferable for diagnosis. (iv) This is exactly the point made by Aitken and Maxwell [AitMax95], where they pointed out the perils of using abstract fault models for complex defect behaviors. While accuracy may be the most important quality of a diagnosis algorithm, the precision of a diagnosis tool is what makes it truly useful for failure analysis. 3.4 With Fault Models, More is Better The conflicting principles of unknown fault origins and the desirability of specific fault models lead to a dilemma. If a diagnosis algorithm can make no assumptions about the nature of the underlying defect, how can it apply a specific or detailed fault model to the problem? The answer, as with many things in life, is that more is better. Since no one fault model will ever provide both the accuracy and precision required from useful fault diagnosis, the best approach is to apply as many different fault models to the problem as possible. In this way, a wide range of possible defects can be handled with the highest possible precision for the failure analysis engineer. The more fault models used or considered during fault diagnosis, (v) the greater the potential for precision, accuracy, and robustness. So, perhaps a stuck-at diagnosis, a bridging diagnosis, and a delay fault diagnosis or two could be performed, and the results from this mix of algorithms examined. But apart from the time and work required, a problem remains in reconciling the different results: how can one compare the top candidates from, for example, a stuck-at fault diagnosis algorithm to the top bridging candidates from a completely different algorithm? Many diagnosis techniques employ unique scoring mechanisms to 27 rate their candidates, and even when common techniques are used, such as Hamming distance, they are often applied in different ways or to different data: a “1-bit difference" may mean something very different for a stuck-at candidate than for an IDDQ candidate. It is essential, then, that a diagnosis algorithm present its results in a way that enables comparison to the results of other diagnosis algorithms. A diagnosis engineer will get the best result possible by leveraging the efforts of many algorithms and different modeling, but only if these efforts can be effectively combined. A fault diagnosis algorithm should produce diagnoses that allow comparison or combination with the results from other diagnosis (vi) algorithms. 3.5 Every Piece of Data is Valuable The concept of “more is better” regarding fault models applies equally well to information: the more data that is applied to the problem of fault diagnosis, generally the higher the quality of the eventual result. This is especially true of sets of data from different sources or types of tests, such as using results from both scan and IDDQ tests. It can often be the case that IDDQ information, for example, can differentiate fault candidates that are essentially equivalent under voltage tests [GatMal97, GatMal98]. Therefore, process of diagnosis should be inclusive, using every available source of information to improve the final diagnosis. A diagnosis algorithm or set of algorithms should use every available bit of data about the defect in producing or refining a diagnosis. 28 (vii) 3.6 Every Piece of Data is Possibly Bad There is one problem with the “use all data” rule: any or all of the data might be unreliable, misleading, or downright corrupt. Data in the failure analysis problem is inherently noisy. As mentioned, simulations and fault models are only imperfect approximations. The failure data from the tester may not be completely reliable, and often results are not repeatable, especially for I DDQ measurements. The data files may be compressed with some data loss, and with the size and complexity of netlists and test programs, it’s always possible that some part of the test results or a simulation is missing or incorrect. In general, then, any diagnosis algorithm that hopes to be successful in the real (messy) world needs to be robust enough to handle some data error. A diagnosis algorithm should not make any irreversible decisions (viii) based on any single piece of data. 3.7 Accuracy Should be Assumed, but Precision Should be Accumulated The prime directive of a diagnosis algorithm is to be as accurate as possible, even at the cost of precision. It is far better to give a large answer, or even no answer, than to give a wrong or misleading one. A large or imprecise diagnosis can always be refined, but an inaccurate one will lead to physical de-processing of the wrong part of a chip, with the possible destruction of the actual defect site. Accuracy is the most important feature of a diagnosis algorithm; (ix) a large or even empty answer is preferable to the wrong answer. But, a diagnosis methodology should be designed so that iterative applications of new data or different algorithms should successively increase the precision and improve the diagnosis. Each step, however, needs to insure that the accuracy of previous stages is not compromised or lost. Diagnosis algorithms should be designed so that successive stages (x) or applications increase the precision of the answer, with a minimal sacrifice of accuracy. 29 3.8 Be Practical Over the years there have been many diagnosis algorithms proposed, but the computational or data requirements of many of them immediately disqualify them for application to modern circuits. For instance, simulating a sophisticated fault model across an entire netlist of millions of logic gates is usually not feasible. Neither is considering all n 2 possible two-line bridging faults. If an algorithm does require sophisticated fault modeling, however, it may still have application on a much-reduced faultlist resulting from a previously-obtained diagnosis. The trade-off in such a case is that the precision promised by such an algorithm may be worth the initial work to reduce the candidate space. A diagnosis algorithm should have realistic and reasonable resource requirements, with high-resource algorithms reserved for high-precision diagnoses on a limited fault space. . 30 (xi) Chapter 4. First Stage Fault Diagnosis: Model-Independent Diagnosis Fault diagnosis, especially in its initial stage, can be a daunting task. Not only does the failure analysis engineer not know what kind of defect he is dealing with, but there may in fact be multiple separate defects, any number of which may interfere with each other to modify expected fault behaviors. The defect behavior may be intermittent or difficult to reproduce. Also, the size of the circuit may make application of all but the simplest diagnosis algorithms impractical. Given these facts, a long-lived staple of fault diagnosis research has apparently outlived its usefulness. The single fault assumption – that there is one defect in the circuit under diagnosis that can be modeled by a single instance of a particular fault model – may not apply for modern fault diagnosis. While it has simplified many diagnostic approaches, some of which have worked quite well despite real-world violations of the premise, the single fault assumption has led to problems with two common defect types: multiple faults, and complex faults. As defined here, complex faults are faults in which the fault behavior involves several circuit nodes, involves multiple erroneous logic values, is patterndependent, or is otherwise intermittent or unpredictable. Traditionally, the single fault assumption has led to the expectation of a certain internal consistency, or some dependence between the test results, with regard to defective circuit behavior. In cause-effect diagnosis, a fault model is selected beforehand, and the observed faulty behavior is compared, as a single collection of failing patterns and outputs, to fault signatures obtained by simulation. In effect-cause diagnosis, many algorithms look for test results that prove that certain nodes in the circuit are able to toggle, and are therefore fault-free throughout the rest of the test set. In either case, the assumption has been that individual test results are not independent, but are rather wholly determined by the presence of the single unknown defect. 31 From the beginning, however, a few diagnosis techniques eschewed the single fault assumption, especially those that directly addressed multiple faults. These approaches, either implicitly or explicitly, forsake inter-test dependence and instead consider each test independently. The advantage to such approaches is that pattern-dependent and intermittent faults can still be identified, as can the component faults of complex defects. The drawback is that a conclusion drawn about the defect from one test cannot be applied to any other test, and the net result is (in effect) a diagnosis for each test pattern. This can lead to large candidate sets that are difficult to understand and use, especially as guidance for physical failure analysis. Also, since these algorithms no longer implicate a single instance of a fault model, there is now the problem of constructing a plausible defect scenario to explain the observed behavior. This chapter will attempt to address these drawbacks by improving both the process and the product of per-test fault diagnosis. First, the process will be improved by including more information to score candidates, and paring down the candidate list to a manageable number. Second, the product will be improved by suggesting a way of interpreting the candidates to infer the most likely defect type. The result is a general-purpose approach to identifying likely sources of defective behavior in a circuit despite the complexity or unpredictability of the actual defects. 4.1 SLAT, STAT, and All That While increasing in recent popularity, the idea of conducting fault diagnosis one test pattern at a time is a venerable one. Waicukauski and Lindbloom [WaiLin89], Eichelberger et al. [EicLin91], and, more recently, the POIROT [VenDru00] and SLAT [BarHea01] diagnostic systems all suggest or rely on per-test fault diagnosis to address multiple or complex faults. We can, without too much license, state the primary axiom of the one-test-at-a-time approach as follows: For any single test, an exact match between the observed failures (at circuit outputs or flip-flops) with those predicted by a simulated fault is strong evidence that the fault is present in the circuit, if only during that test. 32 The underlying concept is uncontroversial, as it underpins both traditional fault diagnosis as well as scientific modeling and prediction: A match between model and observation supports the assumptions of the model or implicates the modeled cause. The difference here is that the traditional comparison of model to observed behavior is decomposed into comparisons on individual test vectors, with a stricter threshold of exact matching to produce stronger implications. The statement that “the fault is present” should not be taken too broadly. It does not mean that the fault (or modeled defect) is physically present, or that any conclusions can be drawn about the defect in any other circumstance other than the specific failing test. Applied most commonly to stuckat faults, all that can be inferred from a match is that a particular node has the wrong value for a particular test. However, that node is not implicated as the source of any other failures, nor is it actually “stuck-at” any value at all, since there is no evidence that it doesn’t toggle during other tests. Note also that the axiom cannot claim that a match constitutes proof that a particular fault is present. A per-test diagnosis approach can be fooled by aliasing, when the fault effects from multiple or complex faults mimic the response from a simple stuck-at fault. This can happen, for instance, if the propagation from a fault site is altered by the presence of other simultaneous faults, or due to defect-induced behaviors such as the Byzantine General’s effect downstream from bridged circuit nodes [AckMil91, LamSho80]. The probability of such aliasing is impossible to determine, given the variety of ways in which it could occur. Per-test diagnosis approaches rely on the assumption that this probability is small, and on the hope that, should aliasing implicate the wrong fault, that this fault is not wholly unrelated to the actual defect and is therefore not completely misleading. A secondary axiom, implicit in the W&L paper but stated in somewhat different terms in the SLAT paper, is the following: There will be some tests during which the defect(s) to be diagnosed will behave as a single, simple fault, which will, by application of the primary axiom, implicate something about the defect(s). 33 What this axiom states is that, for any defective chip, there will be some tests for which the failing outputs will exactly match the predicted failing outputs of one or more simple (generally stuckat) faults. This assertion relies on the observation that many complex defects will, for some applied tests, behave like stuck-at faults that are in some way related to the actual defect. For example, a bridging fault will occasionally behave, on some tests, just like a stuck-at fault on one of the bridged nodes.2 The way that a per-test fault diagnosis algorithm proceeds is to find these simple failing tests (referred to in the SLAT paper as SLAT patterns), and identify and collect the faults that match them. The candidate faults are arranged into sets of faults that cover all the matched tests. The SLAT authors call these collections of faults multiplets, a term adopted in this thesis. As a simple example, consider the following three tests, with the associated matching fault candidates: Test Number 1 2 3 Exactly-Matching Faults A B C, D, E Figure 4.1: Simple per-test diagnosis example. In this example, fault A is a match for test #1, which means that the predicted failing outputs for fault A on test #1 match exactly with the observed failing outputs for that test. Similarly, fault B matches on test #2, while for test #3 three faults match exactly: C, D, and E. The SLAT algorithm will build the following multiplets as a diagnosis: (A, B, C), (A, B, D), and (A, B, E). Each multiplet “explains”, or covers, all of the simple failing test patterns. SLAT uses a simple recursive covering algorithm to traverse all covering sets smaller than a pre-set maximum size, and then only reports minimal-sized coverings (multiplets) in its final diagnosis. For comparison, the W&L algorithm will report one set of faults – (A, B, C, D, E) – in its diagnosis on the above example, with a note that fault C, D, and E are equivalent explanations for 2 Note that this axiom is also the basis of the original MMA algorithm [MilMcC90], which used stuck-at faults to diagnose bridging-faults (see Section 2.5.4 of this thesis). 34 test #3. The POIROT algorithm will produce the same results, with a score based on how many tests are explained by each fault (in this case, all faults would get the same score). The are several advantages to the per-test fault diagnosis approach. First, it explicitly handles the pattern-dependence often seen with complex fault behaviors. It also explicitly targets multiple fault behaviors. And, by breaking up single stuck-at fault behaviors into their per-test components, it attempts to perform a model-independent or abstract fault diagnosis. (Since it still relies on stuck-at fault sensitization and propagation conditions, however, it cannot be considered truly fault-modelindependent.) This sort of abstract fault diagnosis is just the thing for an initial, first-pass fault diagnosis when nothing is known about the actual defect(s) present. This chapter will propose a new per-test algorithm. This algorithm is similar in style to the SLAT diagnosis technique, but is able to use more information and so produce a better, more quantified, diagnostic result. The SLAT technique is focused on determining fault locations, hence the name: “Single Location At a Time”. The new approach will instead focus on the faults themselves, but will, like SLAT, diagnose test patterns one at a time. Borrowing the nomenclature, however, we will refer to the process of per-test diagnosis as “STAT” – “Single Test At a Time”.3 For shorthand, the new algorithm will be called “iSTAT”, for “improved STAT”. Like SLAT, the iSTAT algorithm uses stuck-at faults to build multiplets, but differs from SLAT in two important ways. First, it uses a scoring mechanism to order multiplets to narrow the resulting candidate set. Second, it can use the results from both passing and complex failing tests to improve the scoring of candidate fault sets. 4.2 Multiplet Scoring The biggest problem with a STAT-based diagnosis is that, since each test is essentially an individual diagnosis, the number of candidates can become quite large. Specifically, the number of multiplets used to explain the entire set of failing patterns can be large, and each multiplet will itself be 3 We will hereafter refer to the class of diagnosis algorithms that includes Waicukauski and Lindbloom, POIROT, SLAT, and the new iSTAT algorithm as “STAT”, or “per-test”, diagnosis algorithms. 35 composed of multiple individual component faults. What is needed is a way to reduce the number of multiplets, or to score and rank the multiplets to indicate a preference between them. This section will introduce a method for scoring and ranking multiplets. It will also talk about how to recover information from tests that don’t fail exactly like a stuck-at fault, and from passing tests that don’t fail at all. 4.3 Collecting and Diluting Evidence The basic motivation of STAT-based approaches, as expressed in the first axiom above, is that an exact match between failing and predicted outputs on a single test is strong evidence for the fault. While this much seems reasonable, it seems just as obvious that the evidence provided by a failing test is diluted if there are many fault candidates that match. For instance, in the simple example given above, the evidence for fault A is much stronger than that for any of faults C, D, or E, simply because fault A is the only candidate (according to the axiom) that can explain the failures of test #1. The evidence provided by test #3 is just as significant as the evidence from test #1, it is just shared among three possible explanations. This division of evidence can also be illustrated by imagining failures on outputs with a lot of fan-in, or a defect in an area with many equivalent faults. While there will be a number of faults that match the failure exactly, test results will not provide much compelling evidence to point to any particular fault instance. The first way that iSTAT improves per-test diagnosis is to consider the weight of evidence pointing to individual faults, and to quantify and collect that evidence into multiplet scores. The mechanism that iSTAT uses to quantify diagnostic evidence is the Dempster-Shafer method of evidentiary reasoning. 36 4.4 “A Mathematical Theory of Evidence” A means of quantitatively manipulating evidence was developed by Arthur Dempster in the 1960’s, and refined by his student Glen Shafer in 1976 [Sha76]. At its center is a generalization of the familiar Bayes rule of conditioning, also known simply as Bayes Rule: p(C i | B) p(C i ) p( B | C i ) p(Ci ) p( B | Ci ) n p( B) p(C i ) p( B | C i ) i 1 (1) In this formulation of Bayes Rule, B represents some phenomenon or observed behavior, and each Ci is a possible candidate explanation or cause for that behavior. The set of candidates is assumed to be mutually exclusive. Bayes Rule is commonly used for the purposes of statistical inference or prediction, which attempt to determine the most likely probability distribution or cause underlying a particular observed phenomenon. Bayes Rule uses the prior probability (or a-priori probability) p(Ci) of candidate Ci and the conditional probability of B given the candidate Ci to determine the posterior probability p(Ci | B) of candidate Ci given B. This posterior probability is central to Bayes decision theory, which states that the most likely candidate given a certain behavior is that for which p(Ci | B) p(C j | B) for all i j When applied to the problem of fault diagnosis, Bayes decision theory can be used to determine the best fault candidate (Ci) given a particular observed behavior (B). The Dempster-Shafer method was developed to address certain difficulties with Bayes Rule when it is applied to the conditions of epistemic probability, in which probability assignments are based on belief or personal judgement, rather than its usual application to aleatory probability, where probability values express the likelihood or frequency of outcomes determined by chance. The conditions of epistemic probability are familiar to most people: a person will assign a degree of belief to a proposition relative to the strength of evidence presented in its favor. There is an explicit and unavoidable role of judgement in such a process. It is possible or likely that no prior information 37 or belief about the problem exists before the evidence is considered. Finally, there is a possibility that a judgement cannot be made, or belief will be reserved, in the case of ignorance or lack of evidence. The Dempster-Shafer method is designed with these considerations in mind. It is best illustrated geometrically: the basic element of the Dempster-Shafer method is a belief function, which can be thought of as a division of a unit line segment into various probability assignments. Probability can be assigned either to individual possibilities (referred to as singletons) or to subsets of possibilities; the set of all singletons represented by Θ. A probability assignment represents the support accorded to some singleton or subset based on a piece of evidence; in addition, an explicit degree of doubt or ignorance about the evidence can be assigned. The total of all probability assignments equals one; an example of such an assignment over the n subsets of A is shown in Figure 4.2. (The “m1” notation can be thought of either as the probability “mass” or “measure” accorded due to the first piece of evidence.) m1(A1) m1(A2) m1(Am) m1(Θ) 0 1 Figure 4.2. An example belief function. The assignment m1(Θ) represents the degree of doubt regarding the evidence or the assignments, and represents probability not accorded to any singleton or subset. The introduction of a second piece of evidence results in the creation of a second belief function, with a new assignment of probabilities to a possibly-different set of elements: m2(B1) m2(B2) m2(Bn) m2(Θ) 0 1 Figure 4.3. Another belief function. 38 Dempster’s rule of combination performs an orthogonal combination of these two belief functions. Geometrically, the two line segments are combined to produce a square, which represents the new total probability mass of the combination: m2(Θ) m2(Bn) m2(B2) m2(B1) 0 1 m1(A1) m1(A2) m1(Am) m1(Θ) Figure 4.4. The combination of two belief functions. The squares in Figure 4.4 represent the probability assigned to intersections of the subsets. The total combined probability of a subset is the sum of all non-contradictory assignments to that subset. Note that Θ combined with any singleton or subset is not contradictory, and so such combinations are included in the summations. The actual final probability assigned to each subset is re-normalized by dividing by the total probability mass assigned to non-contradictory combinations: m (A )m 1 m(C) i 2 (B j ) i, j A i B j C 1 m (A )m 1 i (2) 2 (B j ) i, j A i B j Ø 39 4.5 Turning Evidence into Scored Multiplets The iSTAT algorithm uses a relatively straightforward implementation of the Dempster-Shafer method for diagnostic scoring. Each failing test that is matched exactly by one or more fault candidates results in a belief function; each candidate is assigned an equal portion of the belief assigned by the test result. Also, some probability mass is reserved to account for the possibility of aliasing, discussed earlier in this chapter. Since an exact match on a test result is the strongest evidence implicating fault candidates, this reserved belief is small. For this application, singletons are defined as vectors of n individual faults, each of which explains or matches one of the n simple failing tests. As an example, consider a circuit with two simple failing tests and three individual fault candidates: A, B, and C. Since a valid diagnostic explanation must cover both failing tests, the set Θ of possible singletons is {(A,A), (A,B), (AC), (B,A), (B,B), (B,C), (C,A), (C,B), (C,C)}. If fault A is a match for the first test, the evidence provided by this match devolves on the subset {(A,A) , (A,B), (AC)}, which will be represented here by (A,θ). An example of using Dempster’s rule of combination on these elements is shown in Figure 4.5 below. In this example, faults A and B both match on test 1, and faults A and C match on test 2. (In Dempster-Shafer terms, (A,θ) and (B,θ) are the focal elements of test #1, as are (θ,A) and (θ,C) for test #2.) The Dempster-Shafer method provides two ways of calculating total belief given the probabilities computed according to Dempster’s rule of combination. The first is termed belief (Bel), and the second is upper probability (P*): Bel(A) m(B) (3) B A P * (A) 1 Bel( A) m(B) (4) B A Ø 40 To illustrate these calculations, the belief and upper probability assigned to the singleton (A,C) are: Bel(A,C) = m(A,C) P*(A,C) = m(A,C) + m(A,θ) + m(θ ,C) + m(Θ) m2(Θ) m(A, θ) m(B, θ) m(Θ) m2(θ,C) m(A,C) m(B,C) m(θ,C) m(A,A) m(B,A) m(θ,A) m2(θ,A) 0 1 m1(A,θ) m1(B,θ) m1(Θ) Figure 4.5. Example showing the combination of faults. These combinations of sets of faults resemble multiplets in the STAT sense, but not all DempsterShafer combinations qualify as valid multiplets. First, a multiplet must be complete, or contain a fault to match every simple failing test. Due to the way evidence is distributed by iSTAT (to individual faults), this implies that only singletons with non-zero belief assignments qualify as multiplet candidates. As an example, for the singleton (B,B) in Figure 4.5, P*((B,B)) = m(B,θ) + m(Θ), but Bel(B,B) = 0. Second, certain singletons are indistinguishable as multiplets; for example, the singletons (A,C) and (C,A) are equivalent to the multiplet (A,C). (Also note that the singleton (A,A) is 41 simply (A) when represented, in the final diagnosis, as a multiplet.) According to these criteria, the valid multiplets after processing the two tests shown in Figure 4.5 are (A,A), (A,C), (A,B) and (B,C). If a third simple failing test is processed, these four multiplets would constitute the focal elements that are combined with the evidence from the third test. The iSTAT algorithm uses the upper probability number of each multiplet as its new probability assignment, and these assignments are normalized, along with m(Θ), according to Equation 2 above. The advantage to reducing the focal elements to multiplets before each new test is that the size of the convolution stays practical even for a large number of tests. Using the normalized plausibility allows the calculations to retain relevant probability masses assigned to non-singleton combinations.4 An example of processing a third test, matching with fault C, is shown in Figure 4.6 below. m3(Θ) m3(θ,θ,C) m(A,A,θ) m(A,C,θ) m(A,B,θ) m(B,C,θ) m(Θ) m(A,A,C) m(A,C,C) m(A,B,C) m(B,C,C) m(θ,θ,C) 0 1 m1,2(A,C,θ) m1,2(A.A,θ) m1,2(B,C,θ) m1,2(A,B,θ) m1,2(Θ) Figure 4.6. A third test result is combined with the results from the previous example. 4 Reducing the number of focal elements in this manner is referred to in Dempster-Shafer terminology as a coarsening of the frame of discernment. Using upper probability assignments for the new frame is referred to as an outer reduction over the frame. 42 After the last simple failing test has been processed, the upper probability numbers for all qualifying multiplets are used as their respective scores. The iSTAT algorithm applies a final criterion to the multiplets, however: a multiplet must be non-redundant, which means it cannot contain faults in excess of those required to cover all of the simple failing tests. (This is an arbitrary criterion, but it is consistent with the conventions of other per-test, and traditional, diagnosis algorithms.) In Figure 4.6, the resulting multiplets are (A,C), (A,B,C), and (B,C), but multiplet (A,B,C) is marked as redundant and eliminated. The final upper probabilities are re-normalized to produce the actual scores for the remaining multiplets. 4.6 Matching Simple Failing Tests: An Example A short example will illustrate the scoring process. Figure 4.7, below, presents some test- matching results. Test Number 1 2 3 4 Matching Faults A A, D B C, D Figure 4.7. Example test results with matching faults. The result of test #1 results in a belief function in which all evidence supports fault A. The amount of ignorance regarding this test result (whether fault A is really the cause of the behavior) is arbitrary but assumed to be small; the iSTAT algorithm uses the value m(Θ) = 0.01, so the support awarded to fault A for test 1 is m1(A) = 0.99. For test 2, the evidence supports both faults A and D, so the total belief is split between these faults: m2(A) = m2(D) = 0.495. A geometric representation of the combination of these belief functions is shown below. The proportion of area allotted to m1(Θ) is exaggerated in the figure for readability. 43 m(Θ) m(A,θ) m2(Θ) 0.99 m2(θ,D) m(θ,D) m(A,D) 0.495 m2(θ,A) m(A,A) 0 m(θ,A) 0.99 m1(A,θ) 1 m1(Θ) Figure 4.8. Combination of evidence from the first two tests. The calculation of combined probabilities is as follows: P * ( A, A) m 2 (θ, A)m1 ( A, θ) m 2 (θ, A)m1 ()m 2 ()m1 ( A, θ) m1 ()m 2 () (0.495 )( 0.99 ) (0.495 )( 0.01) (0.01)( 0.99 ) (0.01)( 0.01) 0.505 P * ( A, D) m 2 (θ, D)m1 ( A, θ) m 2 (θ, D)m1 () m 2 ()m1 ( A, θ) m 2 ()m1 () (0.495 )( 0.99 ) (0.495 )( 0.01) (0.01)( 0.99 ) (0.01)( 0.01) 0.505 m() m 2 ()m1 () (0.01)( 0.01) 0.0001 As you can see, equal plausibility is given to (A,A) and (A,D). Note that while the multiplet (A,D) is redundant, redundant combinations must be retained until all simple failing tests are processed. The re-normalized assignments then become m1,2(A,A) = m1,2(A,D) = 0.49995. After the application of the third test, the results of which match with fault B, the revised probabilities are: 44 P * ( A, A, B) (0.99 )( 0.49995 ) (0.01)( 0.49995 ) (0.99 )( 0.0001 ) (0.01)( 0.0001 ) 0.500049 P * ( A, D, B) 0.500049 m() (0.0001 )( 0.01) 0.000001 These assignments are then re-normalized to total 1.0. Finally, test #4 matches faults C and D, and the top combinations become: P * ( A, A, B, C ) (0.495 0.01)( 0.4999995 ) (0.495 0.01)( 0.000001 ) 0.2525002525 P * ( A, A, D, B) 0.2525002525 P * ( A, D, B, C ) 0.2525002525 P * ( A, D, B, D) 0.2525002525 m() (0.000001 )( 0.01) 1*10 8 Since (A,A,D,B) and (A,D,B,D) are indistinguishable as multiplets, the multiplet (A,B,D) gets the sum of these probabilities. Since this is the final simple failing test, the redundant multiplet (A,B,C,D) is eliminated, and the resulting final multiplets and re-normalized probabilities are: P * ( A, B, D) 0.505000505 (0.505000505 0.2525002525 1*10 8 ) 0.666 P * ( A, B, C ) 0.2525002525 (0.75750076 75) 0.333 m() 1*10 8 The same multiplets will be built by the SLAT algorithm, as they are the minimal covering sets for the observed failing tests. However, the iSTAT algorithm was designed to prefer multiplet (A,B,D) to multiplet (A,B,C), based on the intuitive notion that there exists more evidential support for fault D than fault C. The calculations above support this intuition, showing that the Dempster-Shafer method assigns twice the support to the multiplet containing fault D. The application of this scoring alone makes the iSTAT algorithm preferable to other per-test diagnosis algorithms; all such algorithms produce essentially the same candidate faults, but by assigning a probability score to each candidate set it provides much more guidance in selecting 45 candidates out of what can be large diagnoses. But, there is more information that per-test approaches usually fail to consider and that can be applied to produce even better final diagnoses. 4.7 Matching Passing Tests Most STAT-based algorithms completely ignore passing tests, probably because passing tests don’t fit well with the basic axioms expressed earlier: it is difficult to infer a failure when no failure has occurred. But, STAT algorithms will suffer a loss of resolution, especially when compared with traditional non-STAT algorithms, when dealing with some defects. For example, consider an observed behavior that mimics a classic stuck-at fault. In such a case (which is surprisingly common, for power or ground shorts, signal-to-signal shorts, and for opens), a traditional diagnosis algorithm that matches both failing and passing tests will produce either a single fault candidate, or a list of faults that are behaviorally equivalent under the applied test set. But, a pertest algorithm that ignores passing tests will produce the same equivalence list, plus all fault candidates whose fault signatures are supersets of the observed behavior. A possible scenario is shown in Figure 4.9, in which a fault that is difficult to sensitize (a stuck-at 1 on the NAND gate) is dominated by another fault that is easier to sensitize (a stuck-at 1 on the inverter). A-sa-1 B-sa-1 Figure 4.9. A-sa-1 will likely fail on many more vectors than will B-sa-1 The difference in behavior between these two stuck-at faults becomes most apparent when considering the tests that fault B-sa-1 passes but A-sa-1 doesn’t. But, a STAT algorithm, or any diagnosis algorithm, that ignores passing patterns may not be able to distinguish these faults, depending upon the tests applied. An even simpler example is a non-controlling stuck-at fault on a 46 gate input (stuck-at-0 for an OR gate, or stuck-at-1 for an AND gate). Most STAT algorithms will implicate a stuck-at fault on the output of the gate as strongly as the (preferred) input fault, simply because the output fault explains all the failing test patterns. In any case, it is especially disappointing for STAT algorithms not to be able to perform as well as traditional algorithms on such “perfect”, and common, behaviors as single stuck-at faults. To remedy this, the iSTAT algorithm must deal with passing tests. The process of matching passing patterns is very similar to matching simple failing patterns: candidates that predict the a passing test will share in belief assigned based on that test. These belief values are combined according to Dempster’s rule of combination, as with the failing tests. An important difference in dealing with passing tests is that only multiplets (candidate fault sets that explain all simple failing tests) are used as the focal elements of each probability assignment, not individual faults. The reason is that passing tests don’t, according to the per-test axioms stated earlier, provide any evidence for individual faults. Rather, they only imply the lack of fault sensitization or unmodeled fault behavior. It is difficult to infer much about the conditional probability of a set of faults given a passing test result. Obviously, if all of the component faults are predicted to pass on a particular passing test, then that result provides some evidence in support of that multiplet. If, however, some of the component faults of a multiplet predict failures for a passing test, it is possible that none of these faults were activated, or if any such fault was sensitized then none of its failures propagated to observable outputs. Either condition could occur due to interactions between multiple faults. The likelihood of interference with either sensitization or propagation is a difficult one to calculate, especially for larger multiplets.5 It seems reasonable to assume that the likelihood of no sensitization and propagation is proportional to the number of components in a multiplet that predict a pass for any test. This means 5 If per-test IDDQ pass-fail information were available, it would indicate whether a logical pass does indeed indicate the absence of a defect or not. On a test that passes scan tests but fails IDDQ, then, a multiplet that predicts one or more failures would not be subject to a scoring penalty. 47 that, for each passing test a multiplet will get an initial score, from a maximum of 1.0 (all faults predict a pass) to a minimum of 0.0 (all faults predict some failure). Then, this initial score is divided by the total score over all multiplets, so that the total belief accorded over all multiplets is equal to 1.0. Since the evidence provided by any passing test is relatively weak, any inference made from one is not strong, and so the degree of doubt or ignorance assigned to a passing test should be high. The iSTAT algorithm uses a value of m(Θ) = 0.5. The belief invested in each multiplet is therefore adjusted again, by multiplying by 0.5, to re-normalize the total belief to 1.0. 4.8 Matching Complex Failures The SLAT algorithm ignores any failing test pattern that doesn’t match exactly with one or more candidate faults. If we refer to the easily-matched patterns as simple failing tests, then the question becomes what to do with the complex failing tests, or tests that don’t match exactly with any stuck-at fault. The POIROT algorithm uses a greedy covering algorithm on such failing output sets, using individual faults to explain subsets of the failing outputs. In an example given in the POIROT paper, failures occur for the example circuit on outputs 1 to 5. The POIROT algorithm looks for the stuck-at fault that explains the most outputs (1,2, and 3 for the first), and then looks for the fault that explains the remainder (4 and 5). The iSTAT algorithm takes a different approach. First, as with passing patterns only multiplets are considered when trying to match the failing outputs, and not individual stuck-at faults. Second, instead of trying to match subsets of the failing outputs, we attempt a much simpler and more conservative matching process, as explained below. Determining which outputs are predicted to fail by a multiplet is not easy, because we have no way of knowing how the fault effects of the individual fault components will interact for any test vector. The fault effects of one activated fault could prevent the propagation to some outputs of a second activated fault. Or, one fault could prevent the sensitization of another fault completely, or cause another fault to become sensitized that normally would not. 48 It is not practical to investigate all of the various permutations of these fault interactions for most multiplets, especially if electrical effects such as drive fights or variable logic thresholds are concerned. So, iSTAT ignores these complications and instead chooses a conservative path of matching by combining all the failing outputs and then ignoring misprediction (or overprediction) of the observed failing outputs. For example, if the following faults are contained in the multiplet (A, B, C): Fault A B C Predicted Failing Outputs 1, 5, 8 2, 5 2, 10 Figure 4.10. Example of constructing a set of possibly-failing outputs for a multiplet The total list of failing outputs for this multiplet is (1, 2, 5, 8, 10). A successful match, then, is a match with any subset of these outputs, such as (1), (2, 5, 10), (1, 10), and so on. A match with any subset is considered an “explanation” of the failures, but any non-subset, such as (1, 2, 6) is not. It is possible that fault interaction could cause such an unexpected propagation and therefore a mismatch, but iSTAT will tolerate this (assumed small) probability of error if it generally aids in ranking candidate multiplets. This matching on complex failing tests results in either a success or a failure for each multiplet on each test. The degree of belief assigned to each matching multiplet is therefore 1.0 divided by the number of matching multiplets. As with passing tests, the evidence provided by a complex failing test is not perfect, and so iSTAT assigned a degree of doubt m(Θ) = 0.1 and the belief assigned to individual matching multiplets is normalized by multiplying by 0.9. 4.9 Size is an Issue In addition to matching all the simple failing tests, the SLAT paper implicitly introduces another criterion for judging multiplets, namely by multiplet size: only minimal-sized multiplets are considered in the final diagnosis. As an example, consider the following set of vectors and matching faults: 49 Test Number 1 2 3 4 Matching Faults A A B B, C, D Figure 4.11. Multiplets (A,B), (A,B,C) and (A,B,D) explain all test results, but (A,B) is smaller and so preferred A minimally-sized multiplet that covers all of the failing vectors is (A, B). But, it is also possible to cover the failing vectors with the multiplet (A, B, C) by choosing fault C to explain the failures on test #4. Similarly, another possible candidate is (A, B, D). Intuitively, (A, B) seems to be the best, and most likely, candidate due to the evidence for fault B from test #3. There is also the principle of Occam’s Razor [Tor38], which states “Causes shall not be multiplied beyond necessity”, or more commonly, “The simplest answer is best”6. The application of Occam’s Razor therefore argues for choosing multiplets of minimal size. But, what happens when the scenario is not quite as simple? Consider the next example: Test Number 1 2 3 4 Matching Faults A, B A, C A, C A, B Figure 4.12 The choice of best multiplet is difficult if (A) predicts additional failures but (B, C) does not. While iSTAT will build and score the multiplets (A) and (B,C), SLAT will only consider the multiplet (A). At first glance it would appear that multiplet (A) is an obviously better choice than (B, C). But suppose that fault A is also predicted to fail other tests that don’t fail on the tester, while faults B and C are only predicted to fail on tests #1 through #4. We would then be faced with the choice of explaining the behavior with either an intermittent stuck-at fault (A), or a well-behaved pair of stuck-at faults (B, C). In such a case, a simplistic application of Occam’s Razor may not work to slice out the best or simplest answer. 6 Or, less commonly, “Nunquam ponenda est pluralitas sine necesitate”. 50 For the example above, the iSTAT algorithm will assign the following probabilities to each multiplet through test #4: P * ( A) 0.499999961 P * ( B, C ) 0.499999961 m( ) 8 *10 8 If, however, test #5 is a passing test and faults B and C are both predicted to pass while fault A is predicted to fail, the multiplet probabilities are adjusted to the following: P * ( A) 0.333 P * ( B, C ) 0.666 m() 5 *10 8 The actual values calculated will depend upon the value of m(Θ) assigned for passing tests, which in turn is determined by the judgement of the algorithm designer or user. The iSTAT algorithm follows the SLAT convention of rejecting multiplets with redundant faults, such as (A, B, C) in the example of Figure 4.11. But, by allowing such non-minimal multiplets as (B, C) in the second example (Figure 4.12), the iSTAT algorithm can consider a wider range of defect scenarios than can SLAT and many other per-test algorithms 4.10 Experimental Results – Simulated Faults This section presents results on some simulated defects in an industrial circuit. These defects were created by modifying the circuit netlist and simulating the test vectors to obtain faulty behaviors. Only logical fault simulation was done; in none of the cases was any electrical-level (or SPICE-level) simulation performed. The idea was to create defects of varying complexity, and of the types that pertest diagnosis algorithms usually target: multiple and intermittent stuck-at faults, wired-logic bridging faults, and faults clustered on nets and gates. The iSTAT algorithm was performed on each simulated defect, including all of the matching and scoring methods described earlier. For each trial, a diagnosis will consist of a set of multiplets. For each diagnosis, Table 1 below reports the type of defect we simulated and the size of the multiplets, 51 where the size indicates the number of component faults in each multiplet. All SLAT multiplets contain the same number of faults (by construction); for these experiments, so did all top-ranked iSTAT multiplets. The next column reports the number of SLAT multiplets, built according to the SLAT algorithm. There is one difference, however, between the multiplets described here and those described in the SLAT paper: These multiplets contain stuck-at faults (describing a circuit node and fault polarity), while SLAT multiplets consist of only faulty circuit nodes (faults of opposite polarity on the same node are collapsed into one “location”). For each diagnosis, we then report the number of top-ranked iSTAT multiplets. This value gives the number of multiplets that all receive the same top score. A higher number indicates lower resolution, as the algorithm expresses no preference among these candidates. The comparison of this number with the number of SLAT multiplets indicates the improvement in resolution over the SLAT algorithm. The next column reports whether the diagnosis was a success or not, defined as the correct multiplet receiving the highest score. For the two-line bridging defects, the result can be a partial (“P”) success if only one node of the bridge is identified by the faults in a multiplet. A complete success (“Y”) requires that both nodes be represented by at least one fault in the multiplet. Therefore, it is very unlikely that the diagnosis of a dominance bridge can be anything but a partial success, because no faults ever originate from the dominating node. Also, the implication of both nodes of a non-dominance bridging fault is highly dependent upon the test set. In order for both nodes to appear in a multiplet, the test set will have to propagate failures from both nodes and put opposite logic values on those nodes during the detecting tests. On other defects, a successful diagnosis is expected to identify exactly the faults inserted. So, if two stuck-at faults were inserted, the correct multiplet should have two faults of the correct polarity. One exception was defect #13, where the diagnosis was judged a success even though three faults were 52 inserted while the multiplet indicated four faults, since the fourth implicated fault was the stem of the net fault. Overall, both the SLAT algorithm and the iSTAT algorithm produce a correct diagnosis on all trials. This is a remarkable success rate even for this small trial size and relatively small circuit, given the complexity of some of the defect behaviors. The number of times that SLAT produced a small diagnosis was surprising (4 multiplets or fewer on 13 of 20 trials), but in all cases iSTAT was able to improve this resolution, in some cases dramatically. Defect No. Simulated Defect Size of multiplets No. of multiplets 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Single stuck-at fault 2 independent stuck-at faults 2 independent stuck-at faults 2 interfering stuck-at faults 3 interfering stuck-at faults 4 stuck-at faults, 3 interfering Two-line wired-OR bridge Two-line wired-AND bridge Two-line wired-AND bridge Two-line wired-XNOR bridge Two-line dominance bridge Two-line dominance bridge Net fault (3 branch stuck-at faults) Net fault (3 branch stuck-at faults) Gate replacement (OR to AND) Gate replacement (OR to NOR) Gate replacement (MUX to NAND) Gate output inversion Multiple logic errors on one gate Multiple logic errors on one gate 1 2 2 2 3 4 2 2 2 3 1 1 4 3 1 2 2 1 1 2 7 21 1 9 2 2 2 2 1 13 3 2 90 4 1 11 3 3 1 27 No. of Top-Ranked multiplets 4 8 1 4 1 1 1 1 1 7 1 1 1 1 1 7 2 1 1 10 Success? Y Y Y Y Y Y Y Y Y Y P P Y Y Y Y Y Y Y Y Table 4.1. Results from scoring and ranking multiplets on some simulated defects. Another observation is that it was difficult to create gate faults that looked like anything other than (possibly-intermittent) stuck-at faults on the gate outputs. The inability to create truly “complex” faulty gate behaviors most likely has much to do with pattern-dependent fault detection, since output faults on gates can often swamp faults on the gate inputs unless enough tests with the right logic values 53 are applied. The same is true for the bridging faults where, while the maximum multiplet size is 4 (both faults on both nodes detected), the algorithm mostly produced 1- to 2-fault multiplets. The circuit used in these experiments is relatively small; future work includes repeating the same experiments on a larger circuit, with even more complex simulated defects. However, these results do indicate that per-test algorithms are effective in diagnosing complicated behaviors, and that the iSTAT algorithm improves upon previous approaches, both in the resolution it provides and in the amount of information it can apply to the diagnostic problem. 4.11 Experimental Results – FIB Defects Texas Instruments supplied data to UCSC on some production chips that had been altered by use of a Focused Ion Beam (FIB) to insert known defects [SaxBal98]. A total of sixteen defects were inserted (one per chip), including two shorting signal lines to power and ground (to mimic stuck-at faults) and fourteen shorting two signal lines. An interesting aspect of this data is that the TI engineers had also used the most popular commercial diagnosis tool, Mentor Fastscan, to diagnose the same failures. It is widely believed that Fastscan implements the W&L algorithm, providing a good comparison of the effectiveness of these two per-test algorithms on real-world circuits. (This same data will be re-visited in Chapter 6.) Table 4.2 presents the results from these experiments. The first column gives the id used by the TI engineers to identify each FIB’d circuit. The second column reports the number of nodes identified by Fastscan in its diagnosis, either two, one, or none; for the fourteen two-line bridging defects, the best answer is a two-node identification. A Fastscan diagnosis consists of a list of stuck-at faults, and can be of any length. Unfortunately, the TI engineers did not report the Fastscan diagnosis sizes for these trials, or in what position the bridged nodes appeared in the list. (Fastscan orders its stuck-at candidates by the number of failing patterns explained, or matched, by each fault.) The third column reports the number of nodes identified by iSTAT. For iSTAT, a diagnosis was considered a success if one of the two nodes was identified in the top-ranked multiplet (of any size). If the other node was 54 found in any of the top 10 multiplets, then iSTAT was given credit for identifying both nodes. The last column provides notes about some defects or diagnoses. In summary , Fastscan was able to identify both nodes of a bridge only 2 out of 14 times, and for 3 diagnoses was unable to identify either bridged node. By comparison, iSTAT was able to identify both nodes 5 out of 14 times, with no failures. Plus, iSTAT was able to do at least as well as Fastscan on all diagnoses, improving on the 3 cases for which Fastscan failed. The case of FIB7 deserves mention since the number of top-ranked multiplets was quite large. For this diagnosis, the top 50 multiplets were all of size 1 and were all given the same score. While both nodes were indeed identified in the top-ranked multiplets, this diagnosis is larger than the usually accepted standard of 10 candidates for a successful or usable diagnosis. The result seems to indicate that a large set of equivalent faults has been implicated, but further analysis of this diagnosis will have to wait until either circuit information or the Fastscan results are available from TI. ID Fastscan iSTAT FIB-sa1 FIB-sa2 FIBx exact exact one node exact exact one node TopRanked Multiplet Size 1 1 1 No. of TopRanked Multiplets 1 1 3 FIBy none two nodes 2 10 FIB1intra FIB2intra FIB3inter FIB4intra FIB4inter one node none one node none one node one node one node one node two nodes one node 2 1 1 2 1 1 6 2 2 1 FIB5intra FIB5inter FIB6intra FIB6inter FIB7 FIB8 FIB9 one node one node one node two nodes two nodes one node one node one node one node two nodes two nodes two nodes one node one node 1 1 2 2 1 1 2 1 1 2 1 50 1 4 Notes Pattern-dependent dominance bridge; behaves like intermittent stuck-at fault on one node Bridge between two inputs in XOR tree Dominance bridge Only one node sensitized by tests Dominance bridge Feedback bridging fault Dominance bridge Table 4.2. Fastscan and iSTAT results on TI FIB experiments: 2 stuck-at faults, 14 bridges. 55 Chapter 5. Second Stage Fault Diagnosis: Implication of Likely Fault Models For the initial stage of diagnosis, when the best fault model to apply is unknown and the entire chip must be considered, a per-test diagnosis algorithm that uses the abstract stuck-at fault model is ideal: it is flexible enough to deal with intermittent, multiple, and complex fault behaviors, and it is simple enough to apply to even large netlists. But the diagnoses returned by such algorithms are often difficult to apply as a guide for physical failure analysis. These diagnoses often consist of a large number of stuck-at fault candidates, each of which explains only a part of the observed behavior, and which appear to be wholly unrelated to one another. And, an individual candidate is not actually a traditional stuck-at fault, which was itself an abstraction, but is rather the further abstraction of a piece of a stuck-at fault, valid only during certain failing tests. While physical failure analysis has had a difficult enough time with traditional stuck-at diagnosis, dealing as it does with logical gates and nodes and not actual physical circuit structures, it has even a more difficult time using the results from abstract “model-independent” per-test diagnosis algorithms. 5.1 An Old, but Still Valid, Debate A recurring but often unstated theme throughout much of fault diagnosis research, the debate identified by Aitken and Maxwell [AitMax95] (introduced in Chapter 2) surfaces again. In this “A-M debate” (whether for Aitken-Maxwell or algorithms-vs.-models) the main question is whether more clever and flexible algorithms, or more accurate and sophisticated fault models, will lead to a better diagnostic result. On the side of better algorithms is mainly the practical argument: applying specific fault models across an entire circuit, when the actual best fault model to apply is unknown, is not feasible for most modern circuits. It is better, says the algorithm side, to build abstract diagnosis 56 algorithms and decompose the behavior into individual tests, in order to come up with the most general answer possible. But, answers the better-model crowd, by using already-abstract models and decomposing the behavior into individual tests, any relation between the test results or faults that could help identify an actual failure mechanism has been lost. It is better, they say, to retain the idea of how something could go wrong in a circuit, through the use of a fault model, so that the final answer has some precision and some utility for failure analysis. Plus, the application of realistic fault models is a better test of actual defect behavior than the often-simplistic assumptions built into per-test algorithms about fault interference and propagation. 5.2 Answers and Compromises The SLAT algorithm makes only a superficial attempt to make its diagnoses more understandable. The SLAT authors propose the construction of splats, which are sets of (apparently) equivalent nodes common to all multiplets. But, this does nothing except identify fault equivalencies; each multiplet must still be investigated as a possible defect scenario. A more ambitious analysis method, called “SLAT Plus”, was recently proposed [BarBha01]. This method analyzes logic-value relationships across all nodes of the circuit during observed failures, in an attempt to infer possible bridging defects. That work, however, is preliminary, and involves a different and more extensive type of analysis than is proposed here. The W&L algorithm also makes a simple attempt to interpret its results, by classifying a diagnosis into one of three categories. The first, called “Class I”, is an exact match with a single stuckat fault, both for failing and passing tests. The second, or “Class II”, is identified when a single stuckat fault can explain all the failing tests but not all the passing tests; in other words, an intermittent stuck-at fault. Finally, “Class III” is reserved for diagnoses that consist of multiple stuck-at faults that match only inexactly, a category that consists of a wide range of defects. 57 The POIROT algorithm attempts something of a compromise: it decomposes the matching operation into single tests, but also applies a set of pre-built signatures for certain fault models in addition to the stuck-at model. It is, in fact, much like the W&L algorithm with the addition of bridging fault and net fault candidates. Since it explicitly targets these additional models, it doesn’t require any interpretation of its results when one of the more specific candidates (bridge or net fault) is implicated. However, by relying on a a-priori set of candidates, it suffers from the candidate selection problem, first mentioned in section 2.5.4. There are n possible two-line bridging faults in a circuit, 2 where n is the number of signal lines and can be quite large. There are also 2n individual stuck-at faults, and O(n) open faults. The result is that, with only 3 candidate types considered, the POIROT algorithm can quickly become infeasible for large circuits. 5.3 Finding Meaning (and Models) in Multiplets The main problem with the diagnoses returned by most per-test diagnosis schemes is one of interpretation. The end product of these algorithms can often be a large collection of sets of faults (multiplets), any of which can be used to explain the observed faulty behavior. If you show even a single multiplet, consisting of several faults or nodes, to a failure analysis engineer, the likely response is “But what does this mean?” The purpose of this chapter is to find a way to discover meaning in multiplets. The idea is to analyze each multiplet in a diagnosis to determine whether the component faults are in someway related to one another, or if they appear to be simply a collection of random faults. In the first case, an algorithm should then be able to infer a defect mechanism; in the second case, either the meaning escapes (due to unmodeled behavior) or perhaps the circuit behavior really is the result of a collection of unrelated defects. But how can candidate faults be related to each other, and a meaning extracted from the observed behavior? The traditional answer for explaining defective behavior has been the use of fault models. The stuck-at fault model, various bridging fault models, and the transition fault model are all examples 58 of using abstractions to simplify what can be complex defect behaviors. These fault models have the advantage of being relatively easy to understand and (with some translation) identify as part of failure analysis. It seems intuitive, then, to interpret multiplets by correlating them with common fault models, calculating for every multiplet a correlation score for each fault model. A high correlation score implicates a likely defect scenario for that multiplet. A low correlation score for every candidate multiplet in a diagnosis indicates either that the defect is not well represented by any of the fault models, or that the defect consists of multiple unrelated fault instances. 5.4 Plausibility Metrics To judge this correlation, the most natural scoring, mathematically speaking, is the plausibility of a match between a multiplet and a fault model, or the upper probability limit that a multiplet represents an instance of a particular fault model. For each multiplet, the proposed iSTAT analysis algorithm computes a plausibility score for each fault model, with a maximum score of 1.0 (complete agreement of faults to defect assumptions) and a minimum score of 0.0 (no agreement). A description of each fault model considered and the details of the plausibility calculations follow. A. Single stuck-at/intermittent stuck-at fault This case is trivial: if the multiplet consists of a single fault candidate, it will be classified as a stuck-at or intermittent stuck-at fault on a single node. While this is a simple classification, many defect types mimic intermittent stuck-at faults. Depending upon the test set, bridging faults, gate faults, open faults and transition faults could all look like stuck-at faults. In the SLAT paper, the authors found that 37% of the defects they diagnosed looked like stuck-at faults, which is not inconsistent with this author’s industrial experience of diagnosing actual failures. So, this defect class is likely to be a catch-all for many defects that aren’t activated multiple times by the test set. Plausibility: 1.0 if multiplet is size 1; 0.0 otherwise. 59 B. Node/transition fault If a multiplet consists of two fault candidates of opposite polarity on the same node, it is classified as a node fault. The most likely defects for this scenario are a dominance bridging fault, a gate delay fault, or some open faults. Plausibility: 1.0 if multiplet is size 2, and faults involve the same node; 0.0 otherwise. C. Net fault If examination of the netlist determines that most or all of the component faults of a multiplet are the branches or stem of a common net, then it can be identified as a net fault. This type of fault was proposed by the authors of the POIROT system to cover open defects that affect nets with fanout. Plausibility: 1.0 if multiplet is size 2 or greater, and all faults are on the same net (including fanout); if size 3 or greater, the portion of faults on the same net; 0.0 if multiplet is size 1. D. Gate fault If we find by examining either the faultlist or the circuit netlist that most or all of the faults in a multiplet involve a common gate or standard cell, then it will be classified as a gate fault. Some possible defects that could look like intermittent faults on a gate’s outputs and inputs are transistor stuck-on or stuck-off, internal shorts, clocking problems, or some other logic error. Note that since gate faults are a superset of node faults, any multiplet that gets a node fault score of 1.0 will also get a gate fault score of 1.0. While this classification is slightly redundant, it does reflect the fact that any defect on a node can also reasonably be attributed to its connected gates. Plausibility: 1.0 if multiplet is size 2 or greater, and all faults are on ports of the same gate; if size 3 or greater, portion of faults on the same gate; 0.0 if multiplet is size 1. 60 E. Two-line bridging fault The identification of a two-line bridging fault relies on a multiplet containing faults on two nodes. Also, due to the nature of two-line shorts, tests that detect faults having opposite polarity should fail, and tests that detect faults of the same polarity should pass. Plausibility: if multiplet is size 2, 3, or 4, and all faults are on (exactly) two nodes, then combine a) portion of common tests for faults of opposite polarity that fail, with b) portion of common tests for faults of same polarity that pass 7; 0.0 otherwise. F. Path/path-delay fault If netlist examination find that the component faults of a netlist can be found on a single path by tracing back from failing outputs, then the defect is classified as a path fault. The as-yet unproven assumption is that path-delay faults can be identified in this manner. Plausibility: 1.0 if multiplet is size 2 or greater and all faults exist on a path from an output to an input; if size 3 or greater, portion of faults on the same path; 0.0 if all faults are on the same node, gate or net, or if multiplet is size 1. Note that none of these plausibility calculations mention equivalent faults. If a fault dictionary was used to perform the initial diagnosis that produced the multiplets, or if for some other reason fault collapsing was performed by the simulator, then equivalent faults will have to be identified and considered (usually by expanding the set of multiplets.) In the case of iSTAT, it was designed as a path-tracing algorithm that does not require fault collapsing, so all equivalent faults are already identified and are contained in the multiplets. These plausibility calculations were designed so that the information they require could be determined during the normal iSTAT algorithm operations of limited path tracing and fault simulation. For the bridging fault model, these calculations are a subset of those that a normal bridging fault 7 The next chapter explains these conditions in more detail; as metrics for scoring bridging fault candidates, they are referred to as “required vector” and “restricted vector” scores, respectively [LavTCAD98]. 61 diagnosis algorithm would perform; the same is true for node and net faults. At this stage, however, no specific fault simulation is done, other than normal STAT-based stuck-at simulation. These calculations, then, are a sort of “first-order” model-based diagnosis on the multiplet candidates, and the plausibility numbers express how reasonable it is to pursue a more intensive diagnosis for any fault model. 5.5 Proximity Metrics The plausibility calculations for the models that involve electrical shorts could be significantly improved if information about physical proximity of the faults is available. For a traditional stuck-at fault, the implication is that a signal line is shorted to power or ground; whether this is plausible or not depends upon the proximity of a supply wire to the signal line. Similarly, the plausibility of a two-line bridging fault is highly dependent upon the proximity of the two lines, considered along the length of both wires. This information, however, is not normally used during traditional fault diagnosis, which usually only works with netlist information and test data, and so it was not included in the calculations specified above or in the experiments described below. But the issue of proximity raises an interesting avenue of fault interpretation. Not all correlations or fault relationships can be expressed by traditional fault models. There are complicated defect scenarios that affect isolated areas of a die, such as large spot defects, physical damage, or poor localized implantation [NighVal98]. No current fault model could properly capture such a scenario, even though a STAT-type diagnosis might implicate faulty circuit nodes in the area of the defect. An additional type of correlation, then, would be useful for interpreting a multiplet: the physical proximity of the component faults. This proximity can be calculated from an analysis of the layout or artwork files, often represented in database form (such as the popular “DEF” format). When faced with a set of multiplets in a diagnosis, the proximity measure would tell the failure analysis engineer how localized the faults for each multiplet are in silicon. Given the limits to how much area physical 62 investigation can reasonably cover, a high physical proximity correlation could very well be the most valuable information to an FA engineer, more valuable perhaps than any fault model. Stuck-at faults are usually associated with a port on a logic gate, but fault effects can affect the wire connected to the port. Defining a fault location by its port, however, is simpler than defining the location by the area traversed by a wire. Using the (x, y) coordinates of gate ports, a proximity measure of a set of candidate faults can be determined by using a sum-of-squared-error calculation: the (Euclidean) distance of each fault from the mean fault location is squared and summed. The result is a numerical representation of the nearness of the faults to each other, with a smaller number indicating higher physical proximity. If wires, and not just gate ports, are taken into account, the calculation of proximity is considerably more complicated. A single net in a circuit can traverse a large part of the die, and can include multiple layers of metal, introducing a third (z) dimension into the calculations. Two calculations, however, would probably be worth the effort: the first would be the size of the bounding box that contains at least part of each wire. This is the same adjacency calculation performed for two wires when determining bridging fault likelihood. The second is the size of the bounding box that contains all of the wire mass. This is even interesting in the case of multiplets with only a single stuckat fault: given two stuck-at fault candidates, the one that is most interesting to pursue is the one whose wire covers the smallest physical area of the die, since that area can be more thoroughly inspected. It is not uncommon for an apparently-ideal diagnosis that consists of a single perfectly-matching stuck-at fault to be useless, simply because the implicated fault actually covers too great a physical area to conduct a search for root cause. Another type of proximity measure that would be interesting for multiplet analysis is logical proximity, or the number of gates or cells that separate the set of faults in the multiplet. This information would be easier to calculate than physical proximity, since it can be determined from the same netlist file used for fault tracing and simulation. Some of this proximity information is captured in the node, net, and gate fault classes, but some more complicated defects may involve several gates. 63 In any case, both the logical and physical proximity measures could indicate how related a particular set of faults in a multiplet are, which may help in limiting the search for root cause to an area of the die or to an area of functional logic. 5.6 Experimental Results – Multiplet Classification Table 5.1 gives the results of the multiplet classification technique on the same simulated defects from Table 4.1. For each simulated defect, the plausibility of the top (correct) multiplet is calculated vis-à-vis each defect class. As expected, some of the defects are classified as stuck-at faults simply because the diagnosis multiplet size is 1. For the bridging and gate faults that are classified as stuck-at, the result is highly dependent on the test set—if the tests don’t activate the other faults or fault polarities, then these defects will look like stuck-at faults. Defect No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Simulated Defect Single stuck-at fault 2 independent stuck-at faults 2 independent stuck-at faults 2 interfering stuck-at faults 3 interfering stuck-at faults 4 stuck-at faults, 3 interfering Two-line wired-OR bridge Two-line wired-AND bridge Two-line wired-AND bridge Two-line wired-XNOR bridge Two-line dominance bridge Two-line dominance bridge Net fault (3 branch stuck-at faults) Net fault (3 branch stuck-at faults) Gate replacement (OR to AND) Gate replacement (OR with NOR) Gate replacement (MUX to NAND) Gate output inversion Multiple logic errors on one gate Multiple logic errors on one gate Single Stuckat 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0 0.0 Node Fault 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 Net Fault 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 Gate Fault 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 1.0 2-Line Bridge 0.0 0.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Table 5.1. Results from correlating top-ranked multiplets to different fault models. 64 Path Fault 0.0 0.0 0.0 0.0 0.67 0.75 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Generally speaking, a fault that received a 0.0 plausibility score for all defect classes was a case of multiple unrelated stuck-at faults. It is possible, however, for two simultaneous but unrelated stuckat faults to get a non-zero bridging fault score, as happened with defect #3. For that defect, the stuck-at faults in the multiplet are of opposite polarity, and all vectors common to the two fault signatures fail, so there is nothing in this behavior that is inconsistent with a two-line bridging fault. On the other hand, the component faults for defects #2 and #4 are of the same polarity, but all common vectors fail, which is completely inconsistent with a bridging fault. In either case, this analysis can only judge the consistency of the behavior with a bridging fault; it would take either layout analysis, or a bridgingfault diagnosis algorithm, or both, to judge whether the bridging fault is actually a good explanation for the behavior. 5.7 Analysis of Multiple Faults By correlating multiplets to individual fault classes, the above procedure implicitly invokes the venerable single fault assumption, which is that the observed behavior can be attributed to a single fault mechanism. But, one of the strengths of per-test approaches such as iSTAT is that they should be able to implicate the components of multiple simultaneous defects, even if the implications consist of partial stuck-at faults and are therefore somewhat vague. The signal for the possible presence of multiple faults is a low plausibility score across all fault classes. If enough fault classes are applied, this would indicate that the faults in a multiplet don’t match up well with any single fault scenario and the behavior may be due to multiple faults. There are several ways, then, to re-analyze the candidates to infer multiple fault groups. Some of the fault classes define partial correlation scores, and for these classes a non-zero score might indicate that some of the faults in a multiplet fit the defect scenario. These are the path, net, and gate fault classes, and if a multiplet gets an imperfect but non-zero score for any of these classes, the faults that do correlate well can be separated and the rest of the faults re-analyzed to infer the presence of a second defect. 65 Another way to infer multiple defects is by applying the proximity measures introduced in the last section. Groups of individual faults that have high mutual proximity imply a high probability that they are related in a single defect mechanism. These proximity measures can be used with a simple clustering algorithm, such as nearest-neighbor [DudHar73], to determine likely groups of faults, which can then be re-analyzed to correlate with the set of fault classes. Finally, some of the fault classes have a defined cardinality, or a certain number of expected individual fault components. These are the two-line bridge fault class and the node class, and any multiplet that does contain exactly two faults (node fault) or two, three or four faults (bridge fault) will automatically get a plausibility score of 0 for these classes. If multiple defects are involved, however, the multiplet could contain a viable node or bridge candidate mixed in with other candidate faults. For these two fault classes, an exhaustive search has to be performed for large multiplet sizes. The case of node faults is simple: unless two faults of opposite polarity on the same node (e.g. Astuckat-0 and A-stuckat-1) are contained in the multiplet, there is no evidence for a node (or transition) fault. For bridging faults, given a multiplet of k otherwise-uncorrelated stuck-at faults, there are k 2 possible bridging candidates. For most multiplets, this is an easily-handled number: for multiplets of size 20 it is 190 candidates, for size 50 it is 1,225 candidates, and for size 100 it is less than 4,950 candidates. 5.8 The Advantages of (Multiplet) Analysis Considering the number of bridging faults that can be constructed out of the components of a multiplet, the advantages to a robust first-pass per-test diagnosis run is obvious. There may be hundreds of thousands or even millions of signal lines in very large circuits. An exhaustive search of two-line bridging faults in even a ten thousand-line circuit would involve almost 50 million candidates. Other techniques that limit the search to, say, 10n candidates [LavChe97] (where n is the number of signal lines) may still be impractical for large circuits. But when the candidate space is limited to the 66 number of faults in a set of multiplets, many diagnosis techniques that were quickly running out of steam on modern circuits suddenly become practical again. What this points out is that the main advantage of STAT diagnosis is that it solves one of the major problems of model-based diagnosis: the problem of candidate extraction or selection. By using a STAT-based diagnosis algorithm that can identify likely fault components, and a way to translate the components into candidate fault models, model-based algorithms can be applied on much-reduced candidate spaces and produce much more precise diagnoses. To revisit the algorithms-vs-models debate, the problem with model-based diagnosis is its lack of flexibility with regard to unmodeled behavior and its impracticality on large circuits. STAT-based diagnosis solves both of these problems, at the cost of imprecision and some opacity in its final diagnostic result. But precision and clarity are exactly the qualities that model-based diagnosis excels at. The analysis step introduced in this chapter, by overlaying fault models on multiplets, provides a natural transition from model-independent STAT-based diagnosis to model-based diagnosis, which is in turn developed and extended in the next chapter. 67 Chapter 6. Third Stage Fault Diagnosis: Mixed-Model Probabilistic Fault Diagnosis As I asserted in Chapter 2, one of the principles of fault diagnosis is that the more fault models that can be applied to a particular diagnostic problem the better. Fault model-based diagnosis provides a level of precision and clarity of result that abstract model-independent diagnosis algorithms cannot match. And, the more fault models that are applied, the more assumptions and scenarios are tested against the defect behavior in the search for the underlying cause. There have, however, always been two problems with performing diagnosis with multiple fault models. The first is a practical one, due to the size of modern circuits and the candidate selection problem mentioned in earlier chapters. It is simply not practical to consider multiple fault models, and perhaps multiple versions of the same fault model (such as wired-OR and wired-AND bridging fault models), when the circuit netlist is as large as is common today. The second problem is that it has always been difficult or impossible to compare the results from two different diagnosis algorithms. Not all algorithms score their candidates, and even when they do they often use ad-hoc or arbitrary scoring methods specific to the chosen fault model. So, it can be difficult to compare the results from, for example, a bridging fault diagnosis algorithm with those from one that targets open faults. The candidate space problem can be addressed in the manner described in the previous two chapters, by using an abstract but accurate model-independent algorithm, and then determining the most promising fault models to apply to further refine the diagnosis. But, the problem remains of how to compare diagnostic results across multiple fault models. This problem has existed since the first model-specific or non-stuck-at algorithm was proposed, but has not even been acknowledged until recently. The only other algorithm to consider multiple fault models, the POIROT algorithm [VenDru00], used a theoretically suspect application of Occam’s Razor to simply prefer equivalently68 scored stuck-at candidates to node candidates, and node candidates to bridging candidates, using the rational that stuck-at candidates are “simpler” than node and bridge faults. But, given that a stuck-at fault is actually just an abstraction, and in its simplest form represents a node shorted to power or ground, it is unclear whether a stuck-at fault is really a simpler explanation for a particular behavior than a bridging fault. This chapter presents a way to solve the model-comparison problem by introducing a probabilistic framework for fault diagnosis. It is designed to incorporate disparate diagnostic algorithms, different sets of data, and a mixture of fault models into a single diagnostic result. It will develop a rigorous approach to incorporating both data and user assumptions into the scoring of fault candidates. It will also present the results of experiments on an industrial circuit that was physically modified to insert various defects. But first, it will present a way to include the normally computationintensive bridging fault model into a practical diagnostic system. 6.1 Drive a Little, Save a Lot: A Short Detour into Inexpensive Bridging Fault Diagnosis This section gives a brief overview of the subject of bridging fault diagnosis. It presents a method of constructing relatively accurate bridging fault signatures that is much more cost-effective than simulation. It also introduces some important concepts for the proposed multiple-fault-model diagnostic framework. 6.1.1 Stuck with Stuck-at Faults Fault diagnosis using the stuck-at model has dominated in most industrial settings, largely because the stuck-at fault model is ubiquitous in testing-related tools. Therefore, a good stuck-at fault simulator is usually available and in wide use, along with other convenient items such as a fault-list, a stuck-at test set, and logic fail information from the tester. But the desire to overcome the limitations of the stuck-at model for diagnosis has motivated a great deal of research into better fault models, better algorithms, and different approaches to the problem of fault diagnosis. 69 One of the “better” models is the bridging fault model, which represents the unintentional shorting of two signal lines [Mei74]. The bridging fault model has gained prominence due to the increasing circuit area devoted to interconnect in modern circuits, with a commensurate increase in the rate of shorted interconnect lines. But, it is difficult to accurately model bridging faults, and various models of increasing sophistication have attempted to capture realistic effects of shorted nodes [AckMil92, GrePat92, Rot94]. Complications include variable drive strengths [MaxAit93], the Byzantine Generals Problem [AckMil91, LamSho80], feedback, bridge resistance [MonBru92], and defect-induced sequential behavior. A bridging fault model that accounts for most or all of these complications would be too expensive to apply diagnostically to all but the most limited of candidate spaces. 6.1.2 Composite Bridging Fault Signatures The prominence of the stuck-at fault model, and the prevalence of bridging defects in CMOS circuits, has motivated several attempts at using the stuck-at fault model to perform bridging fault diagnosis. I have previously published an improvement [CheLav95] to one such technique, by Millman, McCluskey, and Acken (MMA) [MilMcC90]; the improved technique demonstrated considerable success at diagnosing simulated bridging faults. Like the original technique, my approach used only stuck-at fault simulation and signatures, but improved on the original technique in three ways: considering only realistic bridges, incorporating match restriction (flagging some test vectors as incapable of detecting a particular bridging fault), and incorporating match requirement (flagging some vectors as dependably detecting a particular bridging fault). The basic idea to both the original and the improved technique is that of the composite bridging fault signature, which is the union of the four single stuck-at signatures associated with the two bridged nodes. The underlying idea of the composite bridging signature is this: if a bridging fault is detected by a test, that test will also detect one or more of the four stuck-at faults on the bridged nodes. Therefore, 70 it is expected that the actual bridging fault signature (the set of detecting test vectors if the bridge occurs) will be a subset of the vectors found in the bridge's composite signature. Figure 6.1 illustrates the composite signature of a fault candidate for node X bridged with node Y; for simplicity, the contents of each of the four component sets can be considered to be the test vectors, numbered from 1 to n, that detect each respective fault. The figure, then, portrays the set concatenation of four stuck-at signatures. The black lines in the figure illustrate the concept of match restriction: if the same test vector (in the figure, the same vector number occupies the same relative position in each set) detects both X stuck-at 0 and Y stuck-at 0, it by definition tries to set each line to 1. When both bridged nodes are set to the same value it is highly unlikely that the bridging fault will be stimulated (no error should result), and the test vector can be marked as restricted in the composite signature. The same holds true for any test that detects both X stuck-at 1 and Y stuck-at 1. Figure 6.1. The composite signature of X bridged to Y with match restrictions (in black) and match requirements (labeled R) The lines labeled R in Figure 1 illustrate the complementary concept to match restriction, called match requirement: if the same test vector can detect both X stuck-at 0 and Y stuck-at 1 (or viceversa), that test should detect the bridging fault (since it sensitizes and propagates both simple fault conditions), and it is flagged in the composite signature as a required vector. The result is a signature for the bridging fault node X bridged to node Y, but notice that only stuck-at fault signatures (and simulation) were used - no bridging fault modeling or simulation was required. This is a tremendous practical advantage, as it allows inexpensive but approximate bridging fault signatures to be created much more cheaply than with almost any bridging simulator, using tools (a stuck-at simulator or a set of stuck-at fault signatures) that are usually readily available. 71 6.1.3 Matching and (Old Style) Scoring with Composite Signature As the composite signatures are now only approximations to actual bridging fault behaviors, the matching algorithm that selects candidates for the final diagnosis must allow and expect some mismatch between the predictions (composite signatures) and the observed behavior (actual failing test vectors). The original MMA technique only expected that the correct candidate's composite signature would be a superset of the observed behavior. However, the elimination of the restricted vectors, and the specification of required vectors, improves the predictions, and provides the matching algorithm a means for refining its expectations and judging the goodness of each candidate compared to the observed behavior. My previous scoring system was lexicographic, in which each candidate was ranked on three criteria, in descending order of importance. First, as in the original technique, the observed behavior for a bridging fault is expected to be a subset of the candidate signature, so any nonprediction (errors seen but not predicted) is very unexpected. Second, some test vectors in each candidate are marked as required, so we can judge a candidate by how many of its required vectors actually detected the fault. Third, while some misprediction (errors predicted but not seen) is to be expected with composite signatures, excessive misprediction indicates a poor match with the observed behavior. The final scoring, as stated, was lexicographic, with the (smallest) amount of nonprediction having priority, followed by the number of successful required vector predictions, and finally by the (smallest) amount of misprediction. 6.1.4 Experimental Results with Composite Bridging Fault Signatures I then ran experiments to see how well this technique could perform at diagnosing simulated bridging defects, especially in the presence of noise [LavLar96]. Various amounts of random noise were added to the simulated bridging signatures, and the technique attempted to identify the correct bridging fault in a list of 10 candidates. The results were quite successful: even in the presence of severe noise (causing the deletion of more half of the original information or the addition of half again 72 as much spurious information or both), the scoring mechanism was able to successfully extract the correct candidate from 70% to 95% of the time. This was a surprisingly good result, especially since no other published diagnostic technique has attempted to diagnose complex behaviors in the presence of so much noise and unmodeled behavior. But, underlying all of the unmodeled behavior, there was still a bridging fault behavior to be unearthed. In these experiments, the defect was known ahead of time to be a bridge, and then bridging candidates were used to identify it. What about a more realistic scenario, where the form of the defect is unknown? Can a diagnosis algorithm account for another fault type, and incorporate and distinguish between varying explanations for the observed faulty behavior? These are exactly the questions I needed to answer when I set out to transfer this technology to industrial use, performing real-world diagnoses on actual failing circuits. 6.2 Mixed-model Diagnosis An ideally robust diagnosis system would have the ability to include an arbitrary number of fault models, would employ all the models towards diagnosing the faulty circuit, and would report a single answer that represents the best explanation for the behavior over all candidates. Such a system would admittedly require more work as more models were added, but it could theoretically cover an arbitrary range of fault types and behaviors. Such a system is perhaps the ideal, with a model for every contingency, but in practice the number of models will probably be limited to those considered most likely or most interesting. For this research, my approach was to build towards a robust diagnosis system by starting small, with a combination stuck-at fault/bridging fault diagnostic system. The idea behind such a two-model system is relatively modest. First, I use bridging fault candidates and (composite) signatures to diagnose actual bridging defects. Second, I use stuck-at candidates and signatures to diagnose a selected set of other defects: shorts to power or ground and “charged" opens (disconnected circuit lines that hold a high or low logic value). These defect types were chosen because they are assumed to be both commonplace and well represented by the stuck-at 73 fault model. The diagnostic bottom line is: if the behavior looks most like a bridging candidate, score the bridge highest; if it looks most like a stuck-at candidate, score the stuck-at candidate highest; if neither happens, give some indication that the behavior is not much like any of the candidates, bridging or stuck-at. It should be obvious that, in order for this mixed-model system to work, an improved method of scoring fault candidates is required that can be applied across fault models. This is not possible with the previously-described composite bridging fault scoring, as there is direct reference to such bridgingspecific items as required and restricted vectors. Some generalization of the concept of candidate scoring needs to be defined that will work for any fault candidate, regardless of fault model. 6.3 Scoring: Bayes decision theory Perhaps the most intuitive method of scoring and comparing fault candidates is numeric, and specifically probabilistic. In other words, what a diagnosis should really calculate is the probability that the failures seen are due to one fault candidate or another, whether that candidate is a stuck-at fault or some other fault type. It would follow, then, that the candidate with the highest probability of having occurred is the most likely suspect. Applying probabilistic measures to the problem of diagnosis has been recently proposed by a number of researchers. Sheppard and Simpson have developed a comprehensive approach to systemlevel diagnosis that they recently proposed for application to traditional fault dictionaries [SheSim96]. Thibeault [Thi97] has developed an approach to IDDQ diagnosis that uses a form of current signatures and maximum likelihood estimation, comparing measured current levels to predictions of differential current under a given noise model. And, a method for probabilistically conducting physical failure analysis has been developed by Henderson and Soden at Sandia National Labs [HenSod97]. The probability of a fault candidate occurrence given an observed faulty behavior can be expressed literally as p(c|b), where the candidate and behavior are represented by their fault signatures 74 c and b respectively. An obvious choice for the best candidate is the one with the maximum posterior probability of all candidates considered: p(ci | b) p(c j | b) j i This is merely the simplest expression of Bayes decision theory, used extensively in the fields of pattern recognition and classification, and introduced earlier in Chapter 4. The theory states that the best explanation (or classification) for a phenomenon is the explanation judged to be most likely given the phenomenon. This is obvious, intuitive, and simple, so of course there's a catch: the probability measure p(cI | b) is difficult to calculate directly. Fortunately, Bayes rule comes to the rescue: p(c i | b) p(c i ) p(b | c i ) p(c ) p(b | c ) i i i The value p(ci) is the a-priori probability of each fault candidate: that is, the probability of a fault's occurrence over all candidates regardless of fault model. The conditional probability p(b | ci) is the probability that the behavior seen is the result of the candidate fault occurring. While this expression may not seem like much of an improvement, the difference now is that, unlike the probability p(ci | b), both p(ci) and p(b | ci) can be calculated or approximated for each candidate, as will be explained shortly. Since the denominator in the above equation is the same for all fault candidates, calculating and comparing the numerator for each fault candidate gives a numerical ordering across all candidates, regardless of model.8 Using the probability p(ci) p(b | ci) as a scoring function is the classic Bayes decision rule, and under some basic assumptions can be proven to give the minimum error rate of any scoring or decision method. 8 One of the assumptions of Bayes rule is that the candidates are an exhaustive and mutually exclusive set of causes for the observed phenomenon. This will generally not be true for fault diagnosis, as unmodeled behavior may occur. Therefore, while the numerator alone still provides an ordering over the fault candidates considered, the denominator does not satisfy the rule of total probability and the complete ratio will likely be an overestimation of the posterior probability for any fault candidate. 75 The a-priori probabilities p(ci) can be calculated through various means. One method is inductive fault analysis, which examines the physical layout of the fabricated circuit and calculates probabilities that various defects will occur [SheMal85, JeeVTS93]. Alternatively, defect sample statistics can be used, or other estimates based on specifics of the actual circuit. In the absence of such information, the a-priori probabilities can be approximated as equal for all candidates, implying that all faults, regardless of model, are equally likely. This is a gross approximation and can obviously affect the accuracy of the results, but it does allow a diagnosis to proceed if a good estimate of the a-priori probabilities is not available. The conditional probability p(b | ci) expresses the probability of the observed behavior resulting from a particular candidate fault. In other words, it is the probability that the circuit behaves in a certain way if the fault occurs. Traditional classification of physical phenomena would usually involve sampling the various candidates and describing the frequencies of their behaviors statistically. This is obviously not possible for VLSI fault diagnosis. Sufficient samples are simply not available for every candidate of every fault model: gathering such sample data would take root-cause failure analysis of thousands of defective chips. And, the statistics such painstakingly determined from any one chip would likely not apply to any other chip, due to differences in circuit design or manufacturing process. Instead of samples and statistics, diagnostic scoring will have to rely on probabilistic modeling: the conditional probability functions will be estimates based upon the information available and the inherent assumptions in each of the candidate fault models. In other words, candidate fault signatures will be treated as predictions of actual defect behavior, and the conditional probabilities will be functions of the estimated rates of prediction error. The question then is, what are the levels of confidence associated with each type of fault model used? The answer depends upon the accuracy of the models and predictions, the correlation of each model to the defects it targets, and, perhaps most importantly, the judgement of the failure analysis engineer. 76 6.4 The Probability of Model Error ... The conditional probability p(b | ci) of a stuck-at candidate should be relatively straightforward to calculate: it is a function of the expected error rate of the stuck-at simulator that produced that candidate's signature. In other words, the likelihood that a prediction for a stuck-at fault differs from the observed behavior when that fault is realized should depend upon the accuracy of the fault simulator, and to a lesser extent upon other factors such as the reliability of the measurements and the integrity of the data. Some definitions and notations will help here. Usually during fault diagnosis, comparisons are made on a per-test basis between prediction and behavior; a prediction error occurs, for example, when the chip fails a test that the fault candidate is predicted to pass. (For this and the next section, the discussion of predictions and behaviors will be limited to pass-fail results only.) The probability of this is p(chip fails | candidate predicts pass); in the standard notation of diagnosis, a 0 in a fault signature indicates a passing response and a 1 indicates a failing response, so the above expression reduces to p(b = 1 | c = 0), or more simply, p(1|0). Now, to continue calculating the required probabilities, I make a simplifying assumption: namely, that the outcomes (success or failure) of candidate predictions are independent. While dependencies may exist for some candidate predictions, the inaccuracies introduced by this assumption of independence are likely to be swamped by inherent approximations of fault simulation. (The limits of precision are especially obvious in the case of composite bridging fault signatures.) With this assumption of independence, the full conditional probability for a candidate can be expressed as n p(b | c) p(b k | ck ) k 1 where k is the index over all n test vectors, bk is the kth bit of the behavior signature, and ck is the kth bit of the candidate signature. The value, then, of p(b | ci) for stuck-at candidates should be relatively easy to calculate: assuming an unbiased simulator with p(0 | 1) p(1 | 0) (1.0 p(1 | 1)) (1.0 p(0 | 0)) x, 77 if a value or estimate can be assigned to x (the probability of prediction error or prediction error rate) the score of each candidate can be expressed simply as the product of per-test probabilities. But this of course begs the question of what a good value for x is, or what the expected rate of prediction error is for a particular stuck-at simulator. It is possible (but perhaps unrealistic) that this value can, in some cases, be obtained statistically: perhaps sufficient failure analysis has been performed on a significant number of stuck-at defects to determine this probability with a high degree of confidence. Lacking this information, however, an estimate will have to suffice. 6.5 ... Vs. Acceptance Criteria It is widely accepted that the stuck-at fault has no single direct analog in the realm of silicon defects. Its closest manifestation would probably be a circuit node shorted to power or ground. If such shorts are the only defects targeted diagnostically with the stuck-at model, then the error rates for stuck-at predictions should be quite low, as a good correlation of defect behavior to simulation should occur. But if the stuck-at candidates are meant to target a wider range of defects, with less direct correlation to classical stuck-at faults, then higher error rates will have to be expected. Regardless of the value chosen, the role of operator choice points out that the process of scoring fault candidates is largely an arbitrary one, in which the assignment of probabilities is really a matter of establishing acceptance criteria for the various fault models used. If a low stuck-at prediction error rate is used, then a stuck-at candidate with large error will be assigned a low conditional probability; compared with a candidate of another model with the same number of errors but a higher error rate, the stuck-at candidate will be scored lower. Assigning an error rate for fault models has been done implicitly by almost every traditional fault diagnosis algorithm, and by failure analysis engineers who must reconcile the results from different diagnosis tools. Some algorithms do not tolerate any prediction error; they implicitly assign a zero probability of error and reject any imperfect candidate. Others expect much more error in one direction than another (such expecting misprediction but not non-prediction) and express this as 78 weighted or lexicographic ratings. And, if an engineer uses multiple fault diagnosis algorithms and the top candidate reported by a stuck-at diagnosis tool, say, misses the same number of fault predictions as the top candidate from a bridging diagnosis program, then it is human judgement that decides how much error to tolerate in each model and therefore which candidate to prefer. The point is subtle, but it is important enough to bear repeating: every diagnosis algorithm sets its own acceptable error rates, although almost all do it implicitly. The simplest algorithm that only accepts exact matches sets a probability of model error to 0, as do “model-independent” path-tracing algorithms that use strict fault propagation and sensitization conditions. An algorithm that awards one point to a fault candidate for correctly predicting or matching a failure and one point for predicting a passing test is applying a uniform probability to model misprediction and non-prediction. And an algorithm that applies lexicographic scoring or uses Occam’s Razor to prefer one type of candidate to another is simply weighting types of predictions or candidates by dominating factors. If this thesis makes one contribution to the state of the art in fault diagnosis, it should be the identification of this principle: All fault diagnosis is probabilistic, and the underlying probabilities are almost all epistemic, or based on human judgement. By adopting Bayes decision theory for fault diagnosis, then, I am arguing for making these implicit judgements explicit. Explicit parameters have the great advantages of transparency and mutability: the assumptions built into an algorithm are not hidden but declared, and they can then be adjusted to different diagnostic conditions or updated upon new information. The assignment of error rates to each fault model and its predictions is equivalent to stating acceptance criteria for each type of candidate employed. In the case of the proposed two-model diagnosis system, the algorithm will obviously have to accept or tolerate more prediction error with composite bridging signatures than with stuck-at candidates. In the spirit of full disclosure, specifying these usually-implicit values is intended 79 to codify the assumptions and knowledge about the various fault models into a single diagnosis tool where they can be examined and updated as necessary. 6.6 Stuck-at scoring For this research no statistical information was available about the behavior of stuck-at defects in actual manufactured circuits. Therefore, the approach taken for candidate scoring necessarily involved an arbitrary assignment of per-vector prediction error for stuck-at candidates. In general, fault diagnosis will be more effective and accurate if it targets more specific fault types and ties the models more directly to the defects targeted. It will be more effective because the increased precision greatly facilitates the subsequent work of physical failure analysis, and it will be more accurate because the fault predictions will be more accurate and therefore easier to match to the associated defects. This point was made in the “algorithms vs. models” paper by Aitken and Maxwell [AitMax95], already mentioned in earlier chapters; the authors’ argument was that diagnosis is most successful (both most accurate and precise) when a fault model is used to target only defects that it best represents. The implication of this idea is that the expected error rate for stuck-at candidates should be set relatively low. This philosophy argues for a relatively tight link between the stuck-at predictions used and the defects targeted for diagnosis. To this end, an expected prediction error rate of 1% was arbitrarily chosen for stuck-at fault candidates in the presented diagnosis system. Viewed as an acceptance criterion, the implication is that any stuck-at candidate that matches less than 99% of the observed behavior should be considered a poor match. The value of 1% is somewhat arbitrary, but is based on limited industrial experience with power or ground shorts and opens, the two defect types explicitly targeted with the stuck-at candidates 6.7 0th-Order Bridging Fault Scoring Since the assignment of an error rate for stuck-at candidates is somewhat arbitrary, the value of an error rate for bridging candidates is similarly arbitrary. It is the value of the bridging error rate 80 relative to the stuck-at error rate that will determine the selection of bridging or stuck-at candidates for any particular diagnosis. As detailed previously, the composite bridging fault signatures used in this system are only approximations to the expected behaviors, and a significant amount of prediction error is anticipated. Accordingly, the error rate assigned for bridging fault candidates should be significantly higher than that assigned to stuck-at candidates. For our purposes a significant difference will be at least an order of magnitude, so a 0th-order estimate for the bridging candidate error rate, given the stuck-at rate specified above, would be 10%. While this is admittedly a gross estimate, it is not far from the value seen in our previous experience with composite signatures vis-a-vis simulated bridging fault behavior [LavTCAD98]. 6.8 1st-Order Bridging Fault Scoring A better estimate for the bridging fault candidate error rate can be obtained by looking at the components of the composite signature described earlier. Doing so points out that different per-vector predictions in a composite signature have very different expected errors. As the name implies, a required vector prediction is expected to fail very infrequently; similarly, a restricted vector should produce a passing result nearly all of the time. Also, misprediction is significantly more probable than nonprediction. Given these factors, one might reasonably assign individual error rates to the various types of composite predictions, again relative to the stuck-at error rate previously assigned: 10% for non-required vectors, 1% for nonprediction and required vectors (since they rely on stuck-at assumptions), and 0.1% for restricted vectors. These values are consistent with those I have seen over thousands of simulated bridging-fault diagnoses, and provide a somewhat more accurate basis for discrimination than the simplistic 0th-order estimate given above. 6.9 2nd-Order Bridging Fault Scoring It is possible to refine the estimates of prediction error further by examining the possible causes for the actual behavior to diverge from prediction. While this approaches the complicated topic of 81 bridging fault modeling (the avoidance of which was the basis of the composite signature idea), the salient factors affecting composite bridging signatures can be identified relatively easily. They are summarized in Table 6.1. Probability p(sv) p(hr) p(wf) p(bg) p(fb) Assumptions: Description Probability that a test puts the same logic value on both bridged nodes. Probability that high resistance of the short prevents (per-vector) any fault effect, regardless of gate type or topology. Probability that one node wins a drive fight and asserts a definite (faulty) logic value on the other node, but the corresponding stuck-at fault is not detected, causing no fault effect to result from the bridge (non-required vector only). Probability that a Byzantine Generals situation and re-convergent fanout downstream from the bridge invalidate a pass/fail prediction. Probability that fault-induced feedback invalidates a pass/fail prediction. The events {sv, hr} are independent. The events {sv, wf, bg} are mutually exclusive, as are {hr, wf, bg}. The events {fb, hr, wf} are mutually exclusive. The events {fb, bg} are approximated as independent. The event fb is dependent on sv: p(fb) = p(fb|sv) + p(fb|¬sv). Table 6.1. Set of likely effects that can invalidate composite bridging fault predictions. With a little bit of thought, the relevant conditional error probabilities can be expressed as: p(0 | 1) p(sv) p(hr ) p(sv) p(hr ) p(bg ) (1 p(bg ))( p(fb | sv) p(fb | sv)) p(wf ) p(1 | 0) p(bg ) (1 p(bg ))( p(fb | sv) p(fb | sv)) p(1 | 0*) p(fb | sv) p(0 | 1*) p(hr ) p(bg ) (1 p(bg )) p(fb | sv) In these equations, where p(1|0*) refers to the restricted vector error rate and p(0|1*) refers to the required vector error rate. While this degree of decomposition requires more calculation, it does offer certain benefits over the simpler 1st-order approximations. First, some of the probabilities are easy to estimate: p(sv) can be approximated as 0.5, and p(wf) as 0.25. Second, simulator and netlist information can provide accurate values for p(sv), p(wf), and p(fb) on a per-candidate basis. But, values for such probabilities as p(hr) and p(bg) would require either extensive bridging fault characterization, or the assignment of estimates as described earlier (most likely relative to the stuck-at error rates). Given the philosophy of an inexpensive diagnosis system based on stuck-at simulation only, I have decided that the most practical and consistent approach is to use order-of-magnitude 82 estimates for these values. Note, however, that the imposition of a probabilistic framework allows values for these parameters to be used should they be available. 6.10 Expressing Uncertainty with Dempster-Shaffer The Dempster-Shaffer theory of evidence presented in Chapter 4 can also be applied to the mixed-model scoring described in this chapter. The conditional probabilities just presented can be used as the degrees of belief for candidates of each fault model, and an uncertainty value can be added to the belief assignment over all candidates for each test vector. The per-test uncertainty value, however, would be calculated differently from the situation presented in the first-pass algorithm of Chapter 4, in which the evidence provided by certain test results is considered to be much stronger than for other tests. In the case of the mixed-model algorithm, assumptions about different test results are built into the conditional probabilities themselves, as with the case for restricted and required vectors for bridging candidates. In this case, the per-test uncertainty would be a function of the conditional probabilities of all candidates; in other words, a test result for which all candidates expressed a conditional probability of 0.5 would result in maximum uncertainty. Where the Dempster-Shaffer method can really add value to the mixed-model algorithm is in its final calculation of the weight of conflict of all evidence, which is determined by the final value of m(Θ). As a final measure of the total uncertainty of the probability assignments, it can provide a valuable confidence value for the diagnosis as a whole. This can be especially important as the algorithm applies fewer or more specific fault models, since it can express how well the observed behavior matches the expectations of the models applied. If the confidence level is low, then, an engineer can decide to try re-running the algorithm with a different set of models or assumptions. The Dempster-Shaffer method also promises more flexibility than a traditional application of Bayes rule since it can calculate the posterior probability of combinations of faults, which would allow an explicit scoring of multiple simultaneous faults. This application is questionable, however, for two 83 reasons. First, the computation of all possible fault combinations for large circuits would be infeasible. Second, the per-test conditional probabilities of fault candidates are not independent unless the fault effects themselves are completely independent. The current version of the mixed-model algorithm does not implement the Dempster-Shaffer method, largely due to practical computational issues – since these experiments were run on whole circuits, the scoring algorithm had to be very simple. But, the application of the iSTAT and analysis algorithms described in earlier chapters should reduce the candidate space for future experiments, and allow the application of a more interesting scoring algorithm. Certainly, the promise of adding an explicit confidence measure is compelling, and so implementing Dempster-Shaffer scoring in the mixed-model algorithm is a subject of near-term future work. Elements of this work include defining the proper per-test uncertainty function, and perhaps including a way to consider small-sized combinations of faults (and judge or estimate their independence) to enable multiple fault diagnosis. 6.11 Experimental results – Hewlett-Packard ASIC In order to evaluate the diagnosis approach just described, I implemented the technique and performed several diagnosis experiments on a production industrial circuit. The experiments were performed at Hewlett-Packard, with their support and equipment; the circuit used was a HewlettPackard ASIC. Defects were inserted into the circuits using a focused ion beam (FIB). (Knowing the exact form and location of a defect is obviously very useful for validation [Ait95]; diagnosis of failing production chips is an obvious next step.) The circuit was built with a 0.5-micron process, and its ATPG model had approximately 150,000 gates. There were three rounds of experimentation. In the first, the FIB engineer connected arbitrary signal lines to either power or ground in order to mimic stuck-at behavior. In the second round, he joined neighboring signal lines in order to represent bridging faults; in the third round he broke signal lines in order to simulate open defects. 84 The diagnosis experiments were performed despite several practical challenges. First, only passfail signatures were readily available, so no information about failing outputs was used. Second, no realistic bridging fault candidate list was available, so the diagnosis program had to consider all bridges to be possible. Third, no gate descriptions or simulator information was available for refinement of the p(wf), p(sv), or p(fb) estimates used for composite bridging scoring. Fourth, no statistical analysis of fault or defect frequencies (such as IFA) was performed, so a uniform prior was used for the Bayesian scoring of candidate faults. It is assumed that the addition of any or all of these missing elements would improve the accuracy and resolution of the resulting diagnoses. It is also important to reiterate that no information or tool was used for diagnosis other than a stuck-at faultlist, a pass-fail dictionary (from a stuck-at fault simulator), and a list of failing vectors for each faulty circuit. Also, for all experiments in this chapter, the first- and second-pass diagnosis algorithms described in earlier chapters were not yet available to reduce the candidate faultlist. Therefore, these experiments represent a worst-case situation, in which the model-specific algorithm must run on the entire circuit. These circuits, while large, contain at most a few hundred thousand stuck-at faults, and so most likely represent the last generation of industrial circuits for which such an approach is feasible. The diagnosis program requires estimates of prediction error, or sources of error, for bridging and stuck-at fault candidates. The initial assignment for bridging faults was p( sv ) 0.5 p( wf ) 0.25 p(hr ) p(bg ) p( fb ) 0.01 p( fb | sv ) 100 * p( fb | sv ) The resulting bridging fault probabilities of error were p(0 | 1) 0.78 p(1 | 0) 0.02 p(1 | 0*) 0.0001 p(0 | 1*) 0.03 For stuck-at faults, p(0|1) = p(1|0) = 0.01. 85 In most cases, these estimates were probably pessimistic. The experiments were designed to see if the proposed algorithm could 1) distinguish between stuck-at and bridging defects, and 2) correctly identify the nodes involved in the defect. Another goal was to determine how open defects would be diagnosed in this system, and whether suspicions about their similarity to stuck-at behaviors are justified. The results are given in Tables 6.2, 6.3, and 6.4. Each of the three tables presents results from a round of experiments, for stuck-at, bridging, and open defects respectively. Each row in a table is an individual diagnosis experiment on a single defective circuit. The first column of each row gives the defect number. The second column (Top Candidate) indicates which candidate the diagnosis algorithm scored highest. More than one candidate can get the same top score, so the third column (Num. Tied for First) reports the number of top-scoring candidates. The fourth column (Classification) classifies each diagnosis, and the last column (Notes) gives a short qualitative description or details on each diagnosis. In the tables, candidates are described by their model and quality of match to the actual inserted defect. The two candidate models are bf for bridging fault and sa for stuck-at fault. Three grades of match between candidate and the actual defect are specified. An exact match exactly identifies the single node or pair of nodes involved in the defect. A partial match either identifies one out of two bridged nodes (for a stuck-at candidate), or pairs a stuck-at or open node with another unrelated node (for a bridging candidate). A misleading match does not correctly identify any faulted nodes, although the table indicates if an apparently unrelated node is logically near (within two simple logic gates up or downstream from) the fault site. To illustrate, a stuck-at candidate that identifies one of a pair of shorted nodes is considered sa-partial. A bridging candidate that pairs the correct stuck-at node with another is bf-partial, as is a bridging candidate that only correctly identifies one of a pair of shorted nodes. For open defects, either stuck-at fault on the open nodes is considered sa-exact. 86 Defect 1.1 Top Candidate bf-partial Num. Tied for First 2 Classification Partial success 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 sa-exact sa-exact sa-exact sa-exact sa-exact sa-exact sa-exact sa-exact 1 > 300 9 3 > 100 1 1 2 Success Ambiguous Success Success Ambiguous Success Success Success 1.10 sa-exact 3 Success 1.11 1.12 sa-exact sa-exact 1 2 Success Success Notes Significantly (17%) non-stuck-at behavior; 8 of top 10 candidates are bf-partial Next 100 candidates are bf-partial Other 2 top candidates are near Next 100 candidates all bf-partial, all same score Next 100 candidates all bf-partial, all same score Other top candidate is near; next 100 candidates are bf-partial, all same score Other top candidates are near; next 100 candidates are bf-partial, all same score Next 100 candidates all bf-partial, all same score Other top candidate is near; next 100 candidates are bf-partial, all same score Table 6.2. Diagnosis results for round 1 of the experiments: twelve stuck-at faults. There are four diagnosis classifications for each experiment. A success indicates that an exact match is found in the top 10 candidates. A partial success indicates that at least a partial match, but no exact match, is contained in the top 10. A diagnosis is a failure if no exact or partial matches rank in the top 10. In any event a diagnosis is considered ambiguous if the top 10 matches, or more, all receive the same score. An ambiguous diagnosis indicates that more information (such as failing outputs, for example) is needed to distinguish between highly-ranked candidates. A point of detail: The last three bridging defects, defects 2.7 to 2.9, all followed the same scenario, and are considered only partial successes. In all three cases, the FIB bridged the outputs of two inverters, one having a much stronger drive strength than the other. The result in such cases is that fault effects only initiate from the weaker node, the stronger node never being overdriven. All three of the diagnoses reflect this situation: in all three cases, the top candidate is the weaker of the two inverter outputs stuck-at either 1 or 0. Without the dominating node ever being the source of error effects, it is doubtful whether another algorithm could do better looking only at the logic failures of the circuit; perhaps IDDQ diagnosis has a chance of identifying this type of defect. 87 Defect 2.1 2.2 2.3 2.4 Top Candidate bf-exact bf-exact bf-exact bf-partial Num. Tied for First 1 1 1 1 Classification Success Success Success Success 2.5 2.6 bf-exact bf-partial 1 1 Success Success 2.7 2.8 2.9 sa-partial sa-partial sa-partial 1 1 1 Partial success Partial success Partial success Notes Defect is a feedback bridging fault Next 9 candidates are bf-partial Second candidate is sa-partial Second candidate is bf-exact; other node in top candidate is near Next three candidates are bf-partial Second candidate is bf-partial, third candidate is bf-exact Dominated node: see text Dominated node: see text Dominated node: see text Table 6.3. Diagnosis results for round 2 of the experiments: nine bridging faults. Defect 3.1 3.2 3.3 3.4 Top Candidate sa-exact sa-exact sa-exact samisleading Num. Tied for First 3 2 2 2 Classification Success Success Success Failure Notes Behavior identical to node stuck-at 1 Behavior identical to node stuck-at 1 Behavior identical to node stuck-at 1 Significantly (21%) non-stuckat behavior; 14th candidate is bf-partial Table 6.4. Diagnosis results for round 3 of the experiments: four open faults. The results indicate that the approach works quite well at accurately diagnosing and distinguishing a mixture of fault types. The one failed diagnosis occurred on the last open defect, when the behavior was significantly (21%) different than the signature for the node stuck-at 1. Whether this behavior is typical or not is an area of further research; answering this question may lead to a refinement of the acceptance criteria for stuck-at candidates, or possibly the addition of another fault model specifically for open defects. 6.12 Experimental results – Texas Instruments ASIC I supplied the mixed-model diagnosis software to a team of engineers at Texas Instruments, who independently performed another round of similar experiments [SaxBal98]. As with the HewlettPackard experiment, samples of a production ASIC were modified by inserting defects with a focused ion beam. A total of sixteen diagnoses were performed, the first two on signal lines shorted to power and ground, and the next fourteen on signal-to-signal shorts. 88 An interesting aspect of this experiment is that the TI engineers also ran the most widely used commercial diagnosis tool, Mentor Fastscan, on the same failures. It is widely believed that Fastscan implements the W&L algorithm described in Chapters 2 and 4. This provides a useful comparison of the effectiveness of these two algorithms on some interesting defects in real-world circuits. Table 6.5 presents the results from these experiments. The first column gives the id used by the TI engineers to identify each FIB’d circuit. The second column reports the number of nodes identified by Fastscan in its diagnosis, either two, one, or none. A Fastscan diagnosis consists of a list of stuck-at faults, and can be of any length. Unfortunately, the TI engineers did not report the Fastscan diagnosis sizes for these trials, or in what position the bridged nodes appeared in the list. (Fastscan orders its stuck-at candidates by the number of failing patterns explained, or matched, by each fault.) The third column gives the results from the mixed-model probabilistic algorithm, using the same match types defined in the last section. The last column provides notes about some defects or diagnoses. With the exception of FIB7, the diagnoses returned by the mixed-model algorithm are superior to that of the commercial tool. In eight out of fourteen bridging faults, the mixed-model algorithm gave a better result: in seven cases it identified both nodes when Fastscan could only identify one or neither node, and in the other case it identified one node when Fastscan implicated none. In almost all cases the mixed-model algorithm did a good job of differentiating stuck-at vs. bridging fault behaviors, something Fastscan cannot do. And, the TI engineers noted that the bf-partial diagnoses could likely have been improved with a better test set, as most of these defects involved pattern-dependent sensitization. 89 ID Fastscan FIB-sa1 FIB-sa2 FIBx exact exact one node Mixed-model algorithm sa-exact sa-exact sa-partial FIBy FIB1intra FIB2intra FIB3inter FIB4intra FIB4inter FIB5intra FIB5inter FIB6intra FIB6inter FIB7 FIB8 FIB9 none one node none one node none one node one node one node one node two nodes two nodes one node one node bf-partial I bf-exact I sa-misleading bf-exact I bf-exact I bf-partial bf-exact I bf-exact I bf-exact I bf-exact bf-partial bf-partial bf-exact I Notes Pattern-dependent dominance bridge; behaves like intermittent stuck-at fault on one node Bridge between two inputs in XOR tree Dominance bridge Only one node sensitized by tests Dominance bridge Feedback bridging fault Dominance bridge Table 6.5. Diagnosis results for TI FIB experiments: 2 stuck-at faults, 14 bridges. 6.13 Conclusion This chapter describes an approach to model-based fault diagnosis built around a probabilistic evaluation of a set of fault candidates given all the available data about a failing circuit. The introduction of probability as a common measure of diagnostic inference allows different algorithms to process different sets of data, using different sets and types of candidates, to produce a single diagnostic result. 90 Chapter 7. IDDQ Fault Diagnosis The mainstream of VLSI fault diagnosis has been concerned with logic failures at circuit outputs or internal scan elements, as has this thesis up to this point. The reasons for this are many, but perhaps most important is that in the field of test, logic-related fault models are dominant – especially the single stuck-at fault model. The emergence of the IDDQ fault model, in which the presence of a defect causes an abnormally high amount of current to flow in the circuit in a normally quiescent or static state, has spurred interest in using this fault model for fault diagnosis. There are several apparent advantages to performing diagnosis with I DDQ fault models and information. First, many chips submitted to failure analysis do not have hard logic fails, but may fail only IDDQ tests. Second, the IDDQ fault model has the advantage of high observability: unlike the logiclevel fault models, the effect of the fault does not have to propagate through many levels of logic but only to the point of current monitoring. Therefore, I DDQ diagnosis can differentiate between defects that are indistinguishable with other logic-level fault models, and the confounding effects of indeterminate propagation that plague other diagnosis techniques are generally eliminated [AckMil92]. Third, and perhaps most important, considering IDDQ fault models in diagnosis adds another source of information, generally orthogonal to traditional diagnosis results, that can be used to refine or add confidence to an existing logic-based diagnosis. On the other hand, IDDQ diagnosis presents its own challenge of ambiguity. Rather than the simple pass-fail results of a logic-based test, IDDQ diagnosis algorithms must interpret the results of a current measurement (of perhaps questionable precision and accuracy) as either a passably low current value or a defectively high current value. As Nigh, Forlenza, and Motika said, “it should be obvious that determining an IDDQ diagnostic current threshold […] is not simple” [NighFor97]. These authors were involved with the Sematech test experiment [NighNee97a, NighNee97b], in which IDDQ diagnosis 91 was performed on a large number of failing chips [NighVal98]. While IDDQ diagnosis generally proved to be accurate and useful, it required a great deal of manual intervention: the pass-fail current threshold for each chip had to be repeatedly adjusted until a perfect diagnostic match was found. The work presented in this chapter was intended first and foremost to provide an answer for the difficulties of that experiment. 7.1 Probabilistic Diagnosis, Revisited As stated in the last chapter, the diagnosis problem is by its nature probabilistic. This thesis argues for acknowledging this fact openly, and reflecting it explicitly in the design of diagnosis algorithms. A diagnosis algorithm that calculates and expresses its diagnoses probabilistically has advantages both in the quality and the usability of its results. The results are more usable, because they can be directly applied as inputs to another algorithm. The results are higher quality, because they can adjust to the inherent complexities of fault diagnosis. The most common reasons for the unfortunate complexity of the diagnostic process are two: noise and uncertainty. Noise comes from many different sources during diagnosis. First, the measurements taken at the tester are subject to human or mechanical frailties, meaning that the pass-fail results obtained may contain errors or may not be completely reliable. Second, this data must be stored and transmitted to the diagnosis program; given the size and complexity of modern circuits, the data reported from tester is commensurately large and complex and is subject to noise and data loss from many sources. In fact, many dictionary organizations are deliberately lossy to achieve aggressive compression targets, often at the expense of diagnostic utility [CheLar99]. Uncertainty seems to be inherent in the nature of the diagnosis problem. Fault simulators are used to predict behavior and build fault dictionaries, but even for apparently simple fault models they often mispredict or fail to predict the resulting behavior from actual defects. Also, a pass or fail response from the defective chip may have poor repeatability, or may itself be open to interpretation, both increasing the uncertainty of any resulting diagnosis. 92 Nowhere in the field of test is the uncertainty of result more keenly felt than in IDDQ testing. Therefore, nowhere in the field of fault diagnosis is a probabilistic approach more necessary than during the diagnosis of IDDQ faults. 7.2 Back to Bayes (One Last Time) Much of the probabilistic diagnosis approach presented so far has been built on Bayesian prediction or Bayes decision theory. A Bayesian predictor scores the possible causes of an effect according to the probability of a cause given the effect. To recap the notation and terminology, a cause or fault candidate is denoted by cI, and the effect or behavior by b. Bayes decision theory says the most likely candidate given a certain behavior is that for which p (ci | b) p (c j | b) for all i j . The posterior probability p(ci | b) for any candidate is determined by Bayes Rule: p (c i | b) p (c i ) p (b | c i ) n p (c i ) p (b | c i ) i 1 , (1) where p(ci) is referred to as the prior probability (or a-priori probability) of candidate ci and p(b | ci) is referred to as the conditional probability of b given the candidate ci. Since both candidates and behaviors can be represented as sequences of responses to the test set (their fault signatures), if the probabilities of correct and incorrect predictions are assumed to be independent then the conditional probabilities can be expressed as m p(b | ci ) p(b j | ci, j ) , j 1 (2) where m is the number of test vectors, and ci,j is the predicted response of candidate i to test j. The last chapter presented a mixed-model fault diagnosis algorithm where each p(bj | cj) was (in effect) an estimate of the accuracy of a fault candidate’s prediction for a particular test vector. So, for example, if a particular fault candidate predicted a failing response with a confidence level of 90%, its p(observed fail | predicted fail) is 0.90, and its p(observed pass | predicted fail) is 1.0 - 0.90, or 0.10. These estimates, then, are applied with Bayes Rule to determine the total posterior probability for each 93 candidate, and the resulting diagnosis consists of faults sorted by decreasing probability. The algorithm used a uniform prior for all candidates – the a-priori probability of any defect was equal to that of any other defect. 7.3 Probabilistic IDDQ Diagnosis In logic-based diagnosis, the greatest source of uncertainty is fault behavior, specifically the manner in which fault effects propagate (or do not propagate) from the site of a defect to where they are eventually observed at primary outputs or at scan elements. Because of this uncertainty, even the best simulators cannot perfectly predict what behaviors will result from a given fault model and instance. Conversely, IDDQ fault models are generally not subject to the same vagaries of prediction or propagation: if a modeled defect is present, it generally produces an abnormally high current level that theoretically should be observable. However, the “theoretically” is important here, as it is observation, or rather interpretation, that is the most difficult obstacle for I DDQ diagnosis. Because of the high background leakage currents that occur in modern VLSI circuits, it can be difficult to distinguish a high, or failing, IDDQ measurement from a low, or passing, one. In fact, the most important conditional probability for IDDQ diagnosis is whether an observed quiescent current level is an indication of a activated defect or not. In the fields of machine learning or statistical estimation, this problem could be addressed with the following experiment. Start with a set of defective chips, each of which contains just a single one of a set of known fault candidates. Apply all I DDQ tests to each chip, and for each test record the IDDQ value along with the identity of the fault. From this data, then, determine the following distribution: p(observed IDDQ on test j | fault Fi is present) or p(Oj | Fi) This information can then be used Equation 2 above, substituting for p(bj | ci,j). But, the problem for IDDQ diagnosis is that the experiment just described is completely impractical. It is not practical to determine the identity of a large enough number of defects to gather these statistics. 94 It may be possible, however, to estimate the distribution of good-circuit IDDQ values over all tests, as well as the distribution of faulty-circuit IDDQ values over all tests and faults. Scenarios for producing these estimates are presented in subsequent sections of this chapter. This information can be represented by p(observed IDDQ | a fault is activated) or ˆ (O | A) p p(observed IDDQ | no fault is activated) or pˆ (O | A) and One more available piece of information is an estimate of the accuracy of the IDDQ fault simulator (and fault model). As is the case for logical faults, we can estimate the probability of misprediction and nonprediction for any fault on any test. The difference for I DDQ fault models is that the prediction is not fail or pass, but rather fault activation or non-activation: p(fault i is not activated on test j | fault i is present and activation is predicted for test j) or or p(Ai, j | Ai, j , Fi ) Mi,j and p(fault i is activated on test j | fault i is present and activation is not predicted) or or p( Ai, j | Ai, j , Fi ) Ni,j Since per-fault and per-test information about prediction error is usually not available, Mi,j and Ni,j can be estimated by a single M and N for all candidates and tests. Given this information, the unknown distribution p(Oj | Fi) needed for the Bayesian estimator can be replaced with the estimations of p(Oˆ | A) and p(Oˆ | A) : 95 p (O j | Fi ) p (O j Fi ) p ( Fi ) p (O j ( Ai , j Ai , j )) Ai , j Ai , j Fi , j Fi p ( Fi ) p (O j Ai , j ) p (O j Ai , j ) p(O j Ai , j Ai , j ) p ( Fi ) p (O j Ai , j ) p (O j Ai , j ) Ai , j Ai , j p ( Fi ) p (O j | Ai , j ) p ( Ai , j ) p(O j | Ai , j ) p (Ai , j ) p( Fi ) pˆ (O | A) p ( Ai , j ) pˆ (O | A) p(Ai , j ) p( Fi ) pˆ (O | A) p ( Ai , j ) p ( Fi ) pˆ (O | A) substitute estimation s p (Ai , j ) p ( Fi ) pˆ (O | A) p ( Ai , j | Fi ) pˆ (O | A) p(Ai , j | Fi ) (3) The probabilities p( Ai, j | Fi ) and p(Ai, j | Fi ) are the probabilities of a fault’s activation and non-activation, respectively, for a particular test given the fault’s presence. The probabilities are not known exactly, but can be estimated from the rates of misprediction and nonprediction mentioned earlier. Since a candidate can predict either a pass (no fault activation) or fail (fault activation) for test j, Equation 3 can be decomposed into two conditions: p(O j | Fi ) pˆ (O | A) p( Ai, j | Fi ) pˆ (O | A) p(Ai, j | Fi ) (3) p( Ai, j )[ pˆ (O | A) p( Ai, j | Ai, j , Fi ) pˆ (O | A) p(Ai, j | Ai, j , Fi )] p(Ai, j )[ pˆ (O | A) p( Ai, j | Ai, j , Fi ) pˆ (O | A) p(Ai, j | Ai, j , Fi )] p( Ai, j )[ pˆ (O | A)(1 M ) pˆ (O | A)(M )] p(Ai, j )[ pˆ (O | A) N pˆ (O | A)(1 N )] (4) In the previous chapter on probabilistic logic diagnosis the values of M and N were enough to define the per-test conditional probabilities for each fault candidate and test result. Now, in the case of IDDQ diagnosis, there are two additional conditional probabilities to be calculated or estimated, reflecting the uncertainty in interpreting the test results as either pass or fail. 96 There exists a certain amount of uncertainty in interpreting the results of logic tests as well, but it is dominated by much more serious and common concern of model prediction error: it is much more likely that, for a stuck-at or bridging fault, the simulator’s pass-fail prediction will be wrong than it is that the result of a test will be misinterpreted. For this reason, I omitted explicit mention of this type of error in the previous chapter on logic diagnosis and its calculations, instead concentrating on the estimates of model prediction error. For IDDQ diagnosis the emphasis is reversed: an IDDQ fault prediction of activation or non-activation is assumed to be wrong very rarely, due to the simplicity of the models (only simple sensitization is required, and propagation is assumed). Therefore, for I DDQ diagnosis the prediction error can be ignored, the Equation 4 reduces to p(O j | Fi ) p( Ai, j )[ pˆ (O | A)(1 M ) pˆ (O | A)(M )] p(Ai, j )[ pˆ (O | A) N pˆ (O | A)(1 N )] p( Ai, j ) pˆ (O | A) p(Ai, j ) pˆ (O | A) (5) Now, the salient per-test conditional probabilities for IDDQ diagnosis have been reduced to the estimated distributions of IDDQ currents during fault activation and non-activation. The following sections of this chapter present different diagnostic scenarios and methods for how these estimates can be created in each scenario. Previous researchers, notably Gattiker and Maly [GatMal97] and Thibeault [Thi97], have proposed diagnosis algorithms based on IDDQ test results and probability assessments. In these approaches the defining element is the use of levels of IDDQ measurements to identify candidate fault classes (such as faults on 3-input NAND gates or 2-input NORs) in the circuit. These classes can then be used to refine the diagnosis to individual fault instances. In my approach I am concerned entirely and deal directly with fault instances (a fault on a specific circuit node or nodes). However, an algorithm that computed probabilities for fault classes could be used to provide the prior probabilities for the individual faults used in this algorithm. 97 7.4 IDDQ Diagnosis: Pre-Set Thresholds In an ideal world, the identification of IDDQ thresholds for test and diagnosis would be trivial: the definitions of abnormal and normal current values would be fixed and unchanging. For a particular chip, a fixed threshold could be established that would always divide passing from failing IDDQ. Consider the following graph, an excerpt of actual I DDQ measurements from the Sematech experiment: Figure 7.1. IDDQ results for 100 vectors on 1 die (Sematech experiment). It is possible that for this chip, a threshold value of 100A, indicated on the chart by a bold line, would always serve as a viable pass-fail threshold for IDDQ measurements. In this ideal situation, the assignment of the conditional probability pˆ ( A | O) would be easy: 180 160 120 100 p (A |O ) 80 60 IDDQ (uA) 140 40 0. 1. 20 0 20 40 60 80 0 100 Vector Number Figure 7.2. Assignment of a binary p̂(A | O) for the ideal case of a fixed IDDQ threshold. 98 The inset graph on the left, rotated by 90 degrees, indicates the probability that a given I DDQ measurement indicates a defect activation in this ideal case. This is the reverse conditioning from that required for Equation 5, but the distribution pˆ (O | A) can be computed by application of Bayes Rule: pˆ (O | A) pˆ (O) pˆ ( A | O) pˆ ( A) The values of pˆ (O) and pˆ ( A) can either be estimated from the sample values or as uniform distributions. In any event, since the pass-fail threshold is fixed and unambiguous, the probabilities are similarly definite: pˆ (O | A) 0.0 for observed currents below 100A, and pˆ (O | A) 1.0 for anything above 100A. A similar situation is true for pˆ (O | A) . This type of extreme conditional probability assignment, however, leads to posterior candidate assignments of either 0.0 or 1.0 – nothing less than a perfect match of candidate to behavior will be assigned a non-zero posterior probability. A less simplistic pˆ ( A | O) might be that shown in Figure 7.3. In this case, a piecewise linear probability assignment is used, where pˆ ( A | O 0.0) 0.0 , pˆ ( A | O max. iddq) 1.0 , and pˆ ( A | O threshold iddq) 0.5 . This assignment, of course, assumes that the maximum I DDQ measurement indicates the presence of a defect. Application of Bayes Rule with constant or uniform pˆ (O) and pˆ ( A) will result in the same distribution for pˆ (O | A) , scaled by a constant factor. 180 160 120 100 p(A|O) 80 60 IDDQ (uA) 140 40 0. 1. 20 0 20 40 60 80 0 100 Vector Number Figure 7.3. Assignment of a linear p̂(A | O) with a fixed IDDQ threshold. 99 The diagnostic implication of the pˆ ( A | O) shown above that current measurements well below the fixed threshold are considered much less likely to indicate the presence of a defect, and those at the maximum almost certainly indicate defectively high current. The choice of linearity is arbitrary but common: in a traditional non-probabilistic diagnosis system the equivalent scoring mechanism would be to give a candidate fault one point for every A measured below the threshold when it predicts a pass, and subtract one point per below-threshold A if it predicts a fail. Similarly, a candidate would get one point for every A above the threshold for predicting a fail and would subtract one per A for predicting a pass. Such a scoring mechanism would result in the same candidate ordering, with the same relative assignment of scores, as a Bayesian predictor (assuming a uniform prior) that uses the linear conditional probabilities illustrated in Figure 7.3. Another possibility for estimating pˆ (O | A) and pˆ (O | A) is to assume that failing (activated) and passing (non-activated) IDDQ measures are normally distributed. This is, in fact, the general assumption behind many IDDQ testing theories. If the only data available is the pre-set threshold and the actual IDDQ results from the tester, then pˆ (O | A) and pˆ (O | A) can be generated by estimating the mean and variance of two univariate normal distributions from the sets of sample data. The maximum likelihood estimate for the mean and variance in this case are just the sample mean and variance: n ˆ 1 / n x k k 1 n ˆ 1 / n ( x k ˆ ) 2 2 k 1 The resulting estimated distributions on the sample data presented before would look like that shown in Figure 7.4 (the illustrated variances are not to scale). 100 180 p(O|A) 160 120 100 p(O|~A) 80 60 IDDQ (uA) 140 40 0. 1. 20 0 20 40 60 80 0 100 Vector Number Figure 7.4. Assignment of normally-distributed p̂(O | A) and pˆ (O | A) . An important assumption in the estimated pˆ (O | A) shown above is that the distribution is indeed univariate. Research on IDDQ failures has demonstrated that IDDQ failures usually involve multiple sensitized defect paths with various IDDQ levels, and so generally cluster into multiple normal distributions. This will be addressed in Section 7.6 of this chapter. It should also be noted that there are a wide variety of more powerful statistical and machine learning techniques available to extract and test mixture densities of the sort encountered in IDDQ test results. I have chosen the statistical assumptions and clustering algorithm described above for their simplicity, but more complicated methods may prove useful or effective, and may be a part of further research. 7.5 IDDQ Diagnosis: Good-Circuit Statistical Knowledge The ideal case of a fixed threshold is unfortunately something of a rarity for modern circuits. A fixed threshold is often difficult or impossible to set, as the normal variations of a VLSI process can result in a wide range of defect-free or acceptable IDDQ current values from die to die. In a nearly ideal world, enough information would be available for diagnosis to account for these variations and adjust its assessments of the tester data accordingly. If enough time, effort, and expense are dedicated to the job it may be possible to adequately define the defect-free IDDQ characteristics of a single chip and test set. If a sufficient sample of dies 101 covering the range of process variations is tested with the same vector set and the results are analyzed, then it is possible that an expected good-circuit IDDQ distribution can be determined. From this distribution, one can define acceptable ranges for measured I DDQ for both test and diagnosis. The most sophisticated of these techniques examine the relation of the minimum and maximum measured IDDQ per die over many samples, and develop acceptance criteria for the range of measured IDDQ for dies submitted for production testing. One such technique developed at Hewlett-Packard and Agilent Technologies assumes a normal distribution of good-circuit values for the ratio of maximum to minimum IDDQ, and from this establishes a 3 threshold as an acceptable ratio for test [MaxNei99]. (The ratio of maximum to minimum current is used to compensate for die-to-die variations in IDDQ current.) Figure 7.5 below shows how the statistically-determined value of pˆ (O | A) might be applied to the example test data shown before. 180 160 120 100 p(O|~A) 80 60 IDDQ (uA) 140 40 0. 20 0 20 40 60 80 0 100 Vector Number Figure 7.5. Determining a pass threshold based on an assumed distribution and the minimumvector measured IDDQ. As shown in Figure 7.5, the IDDQ current ratios method defines a pass-fail threshold as a function of the IDDQ measurement at a presumed-minimum vector and a previously established 3 acceptance limit. The inset curve demonstrates that the same distribution used for testing can be used as the pˆ (O | A) distribution necessary for probabilistic diagnosis. The best estimate for pˆ (O | A) in this 102 case is probably also as a normal distribution, using either the simple univariate method described in the last section, or the multivariate clustering method described in the next. 7.6 IDDQ Diagnosis: Zero Knowledge The level of preparation and analysis described in the previous section is often not available for every chip that requires fault diagnosis. If neither a fixed threshold nor a statistically based variable threshold is available, then some other mechanism must be employed to distinguish and characterize passing and failing IDDQ measurements for fault diagnosis. Gattiker and Maly have proposed a method of identifying the presence of defect-induced high current paths in a circuit [GatMal96, GatMal97]. They note that when the IDDQ measurements of a defective chip are ordered by magnitude one or more steps can usually be identified in the resulting graph. Below is the same data set given before, this time with the vectors ordered by increasing current value. Figure 7.6. The same data given in Figure 7.1, with the test vectors ordered by I DDQ magnitude. Now a large step is clearly visible in the IDDQ measurements. The value of this identification is based on three assumptions. First, small variations in both normal and abnormal IDDQ values are due to vector-dependent levels of transistor leakage not associated with any defect, and to normal variations of measurement. Second, large variations are due to the activation of different defect-induced current paths. Third, these large variations are several times larger than the small variations arising from 103 transistor leakage or measurement error. Therefore, the presence of a large step in the ordered I DDQ graph suggests the presence of a defect-induced path from power to ground. If the absence of a step suggests the absence of an activated defect path, then using current signatures as described can separate assumed-passing vectors from assumed-failing vectors without establishing a prior pass-fail threshold. If a large IDDQ step can be identified, all ordered vectors before the first large step can be considered passing vectors, and all ordered vectors after the first large step can be considered failing. Since it is assumed that the small variations in both normal and abnormal IDDQ measurements are due to various transistor leakage paths and to measurement noise, a reasonable conclusion is that the resulting passing and failing I DDQ measurements are normally distributed. Applying these assumptions, and the using the presence of a large step to set an IDDQ threshold, an overlay of estimated conditional probabilities pˆ (O | A) and pˆ (O | A) on the data given above would look something like p(O|A) Figure 7.7. 150 100 p(O|~A) 50 0 20 40 60 80 IDDQ (uA) 200 0 100 Vector Order Figure 7.7. Estimating pˆ (O | A) and p̂(O | A) as normal distributions of clustered values. If there is more than one identified current signature step, then will be a pˆ (O | A) distribution defined for each cluster of failing IDDQ measurements. 104 In order to automate the process of determining these distributions, two algorithms are necessary: one to define groupings of passing and failing IDDQ values, and one to determine a mean and variance for each distribution thus defined. To begin, the assumptions of the zero-knowledge case are as follows: 1. 2. 3. No statistical information about the circuit or process in question is available; no data is available for diagnosis except the IDDQ tester results themselves, except perhaps a prior distribution on the fault candidates. There are at least two passing (normal IDDQ levels) test results and two failing (abnormal I DDQ levels) test results. The lowest IDDQ measurement is assumed to be a pass, and the highest is assumed to be a fail. Proceeding from these assumptions, the first task is to divide the test results into groups of passing and failing test vectors. Using the current signature concept, an algorithm is required to identify large steps in the sorted IDDQ measurements. One such algorithm is actually a rather simple version of the hierarchical clustering algorithms commonly used in pattern classification [DudHar73], and can be described as follows: 1. 2. 3. 4. 5. 6. Sort the vectors by increasing IDDQ value, initially placing all vectors in a single cluster. Break the cluster by the single largest IDDQ step value. For each resulting cluster, calculate the average and largest I DDQ step values. If the largest IDDQ step is K times larger than the cluster average, break the cluster at that step. Loop to #3 until no new clusters have been formed, or until the maximum number of clusters is reached. Define the lowest (by IDDQ values) cluster as passing (no defect activation), all other clusters as failing. A value must be established for K, the multiplier at which an IDDQ step suggests an activated defect path instead of a normal measurement or leakage variation. Empirical evidence suggests that such steps are large: for the experiments described in this chapter a value of 10 was used. The second and remaining task is to establish distributions for each passing and failing cluster of measurements. The maximum likelihood estimates described in Section 7.4 can be used to estimate the mean and variance, from the observed sample data, of the normal distribution of each cluster. The data within each cluster is assumed to be univariate, simplifying the calculations. 105 7.7 A Clustering Example The data shown in all the graphs of this chapter, taken from same chip, was simplified for purposes of clarity and illustration. The actual IDDQ data consisted of nearly twice as many vectors (196 vs 100) with more than one large step apparent in the measurements. The full data set is displayed in Figure 7.8 below. There are several obvious steps in this current signature. Applying the clustering algorithm described earlier results in the cluster assignments shown in Figure 7.9. This example, in fact, represents something of an anomaly among the Sematech data. Almost all of the other die have I DDQ measurements that produce only two clusters, one passing and one failing, making the process of setting thresholds and assigning conditional probabilities very simple. Figure 7.8. Full data set of 196 ordered IDDQ measurements. 106 Figure 7.9. Division of the ordered measurements into clusters. 7.8 Experimental Results As I stated earlier, the main purpose of this work is to replicate the Sematech diagnosis experiments with improved diagnostic methods and algorithms. Phil Nigh of IBM has supplied UCSC with IDDQ test results and defect information for sixteen chips that were submitted to failure analysis after IDDQ diagnosis. In 15 of the 16 cases, Phil Nigh reported successful diagnoses by manually adjusting I DDQ passfail thresholds until a perfect match was found in one of the two candidate dictionaries: a pseudostuck-at and a bridging fault dictionary, both of which contained pass-fail IDDQ signatures. The bridging fault candidate list was derived from an examination of the adjacency of same-metal signal wires, and supplemented by same-gate-input bridges. There were approximately 710K faults and 300K unique signatures (each representing a fault equivalence class) in the pseudo-stuck-at dictionary, and approximately 560K faults and 220K signatures in the bridging fault dictionary. 107 As part of the Sematech experiment, these diagnosis results were verified by physical failure analysis. I was able to verify these results by converting the I DDQ tester results into pass-fail signatures, using the reported thresholds, and using a simple diagnosis algorithm to find the same candidates. Next, I fed the raw IDDQ tester results into the zero-knowledge clustering algorithm described earlier, and from there into the probabilistic diagnosis algorithm, using a uniform prior. In all cases but one the clustering algorithm set pass-fail thresholds at the same (sorted) vector index set by the manual method. For all fifteen of these chips, the diagnosis program identified exactly the same faults, as the highest-ranked candidates, as were previously verified by failure analysis. (The identification of these faults as “successful” matches to the defects was done by the team at IBM, using their criteria for matching and verification.) In one case (HGQ0810/2890), Phil Nigh was unable to correlate the implicated candidate faults with the results from physical analysis: the best candidates, with perfect signature matches, had no apparent relation to the defect site. These same five faults showed up at the top of the probabilistic diagnosis, but a bridge containing the defective node did appear in the next five candidates. This can only be considered a partial success, however, both because of the relatively low ranking of the bridge and the fact that the more appropriate pseudo stuck-at candidate was not included in the diagnosis. This particular chip, along with a few others of uncertain or unknown physical verification, remains a subject of ongoing research. 108 Wafer ID/ Successful Diagnosis Defect Found Chip ID Manual Automated QYQ0801/3488 Y Y Metal-metal short QEQ0713 Y Y Poly-GND short BJQ0611/3392 Y Y Poly-poly short YXQ0810/2274 Y Y Gate-drain short IAQ1405/2795 Y Y Poly-Nwell short ITQ1312/2284 Y Y PFET ‘poor drive’, 13 transistors ITQ0214/1787 Y Y Metal-metal short RUQ0418/1947 Y Y Source-drain leakage GJQ0908/3382 Y Y Poly-poly short & poly-GND short R5Q0306/3053 Y Y Source-drain & drain-substrate shorts R6Q1608/3062 Y Y Poly-Vdd short ILQ0209/3498 Y Y PFET: poly-Nwell leak, poor drive LJQ1510/2177 Y Y Diffusion-substrate leak, 12 transistors BJQ0908/1725 Y Y PFET diffusion anomaly IXQ1508/4835 Y Y Poly-metal short HGQ0810/2890 N Partial Poly-Nwell short Table 7.1. Results on Sematech defects. 109 Chapter 8. Small Fault Dictionaries Up to this point, this thesis has dealt exclusively with the theory of fault diagnosis, and has proposed several algorithms consistent with a probabilistic and precise diagnosis methodology. But, one of the self-proclaimed principles of this thesis is that a diagnosis system should be practical, especially considering the enormous data sizes involved in modern circuits. This chapter addresses one of the main data problems in fault diagnosis, that of the size of fault dictionaries. Not all diagnosis algorithms use fault dictionaries; in fact, the choice of whether or not to use dictionaries is often orthogonal to the methods of matching and scoring candidates. But, since almost all algorithms can use fault dictionaries, and some situations mandate their use, making dictionaries practical is an interesting and important topic. It may be useful to first recap some of the background and terminology introduced in Chapter 2. Traditional fault diagnosis, often referred to as cause-effect diagnosis, compares the simulated behaviors of a set of faults with the defective behavior of the chip on the tester. The simulated behavior of a fault is usually called its fault signature; a complete record, consisting of the list of failing vectors and the outputs (for each vector) at which errors are detected, is called a full-response fault signature. If the simulated behaviors are collected and stored before diagnosis, the result is known as a fault dictionary. The problem with dictionary-based diagnosis schemes is the enormous amount of data that is required, both to store and process. The common alternative to using fault dictionaries is to perform fault simulation at the time of diagnosis, removing the storage requirement [WaiLin89]. In addition, a process known as path tracing [AbrBre80, RajCox87] can be employed to trace back from erroneous outputs and implicate a cone of logic, thereby dynamically creating a faultlist for limited simulation. 110 And yet, despite its onerous data requirements, dictionary-based diagnosis remains popular for several reasons. First, since fault simulation is performed as a part of test generation, most test generators can create a fault dictionary (usually stuck-at) as a standard option. A second and more practical reason is that using a fault dictionary removes the dependency of the diagnosis program on the circuit netlist and the messy details of simulation. It can often be difficult, long after a circuit has taped-out and been archived, to restore the final versions of all necessary components of the circuit, from the main netlist to subsidiary designs to the full set of library files. It can also often be difficult to reliably restore and faithfully simulate the tester program. For these reasons, dictionaries are often very popular with failure analysis teams who, often far removed from design and test, appreciate the fact that all the required diagnostic information about a circuit is encapsulated into a single data file. Finally, dictionary-based diagnosis can often provide a good result very quickly, simply because the fault simulation work has been done ahead of time and is therefore amortized over many diagnosis runs. This aspect is especially significant for high-volume situations in which a large number of parts must be diagnosed, and in cases where a quick diagnostic result is desired. In this chapter, I present a method of addressing the major problem in dictionary-based diagnosis, namely the size of fault dictionaries. I will first examine the components of the data involved in fault diagnosis, and the costs and benefits of each. I will then propose a strategy for approximating the information content of full-response dictionaries at a minimum cost. Finally, I will begin to develop a new approach of low-resolution diagnosis, in which a conscious trade-off is made between data size and precision. All of this is an attempt to postpone, for a while at least, the widely expected demise of dictionary-based fault diagnosis. 8.1 The Unbearable Heaviness of Unabridged Dictionaries In a classic full-response fault dictionary, the detection data for an individual fault consists of the test vectors for which it is detected and the outputs (primary circuit outputs or scan flops) to which the fault is propagated for each detecting test vector. If there are f faults in the fault list, v test vectors, and 111 o outputs, the total number of bits required for an uncompressed (no data loss) dictionary is f*v*o. Different encodings of this data, considering the relative number of faults, vectors, and outputs, as well as the number of detections, can result in very different dictionary sizes for the same data [BopHar96]. For purposes of a generalized comparison, we will leave aside such considerations and focus on the raw number of bits of data in a full dictionary. For full-response dictionaries, this number can be truly enormous and completely impractical for modern circuits. (In addition, this paper will ignore the topic of data compression and such algorithms and programs as Lempel-Ziv, Huffman coding, gzip, etc. Data compression, when applied to fault dictionaries, addresses the question of how the detection data is stored. This chapter, on the other hand, will address what data is stored in a fault dictionary. In all cases, data compression algorithms can be applied to the various data sets presented here, but such compressed dictionaries cannot usually be used for diagnosis without first uncompressing them, a serious disadvantage for very large data sets.) Several techniques have been applied to reduce the data requirements of the full-response fault dictionary. Most involve some compaction, or loss of data from the original. So-called drop-on-k dictionaries do not record every detection in the test set, but stop after a certain number of detections. These dictionaries, however, are of questionable utility for fault diagnosis [CheLar99]. The most commonly-used compaction technique for fault dictionaries is the pass-fail dictionary, in which the per-vector output data has been completely removed and the results of each test are expressed as a single bit: 0 for no detection, 1 for detection at any output. Pass-fail dictionaries are often relatively small, requiring f*v bits, and are in some situations quite usable for fault diagnosis. The problem with using pass-fail dictionaries is, of course, that all of the failing output data has been lost. This information can be very useful in distinguishing between fault candidates that fail the same set of tests. In addition, considering only faults in the input cones of failing outputs can usually significantly reduce the candidate space. The bottom line is that a pass-fail dictionary usually produces 112 a much lower resolution diagnosis, one in which many candidates receive the same score and are effectively indistinguishable. To demonstrate this, I ran stuck-at diagnosis experiments on the ISCAS-85 circuits and four industrial circuits. The entire stuck-at faultlist was simulated and diagnosed using both the fullresponse and pass-fail dictionaries for each circuit. Table 8.1 reports the number of faults, with equivalent scores, that were ranked #1 using the full-response (FR) and pass-fail (PF) dictionaries. In all cases, the correct match will be one of these top-ranked faults. (The correct match, and all topranked faults, will get a “perfect” matching score.) The table also notes the number of bits contained in each dictionary. In some cases the difference in resolution is quite dramatic, especially for the larger circuits in which more data has been lost by removing the output information. But equally or more dramatic is the difference in the number of bits required for each type of dictionary. The first goal of this chapter, then, is to find some way to re-introduce the obviously-useful output information into a fault dictionary, while keeping the size of the dictionary to pass-fail—sized numbers. Circuit C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552 Ind-A Ind-B Ind-C Ind-D FR faults ranked #1 2.29 1.17 1.61 1.67 1.82 2.24 2.03 1.89 1.33 1.63 2.74 2.33 2.51 1.91 FR bits (f*v*o) 191,142 901,120 1,512,864 2,936,192 2,966,700 18,141,184 8,963,724 73,878,966 7,207,680 131,224,800 232,836,120,000 929,424,581,448 297,857,813,000 9,077,621,646 PF faults ranked #1 2.80 1.17 1.66 1.71 1.99 3.04 2.10 2.07 1.40 2.12 5.49 2.91 51.0 2.86 PF bits (f*v) 27,306 28,160 56,032 91,756 118,668 283,456 407,442 600,642 225,240 1,226,400 15,522,408 46,373,844 21,271,000 333,822 Table 8.1. Size of top-ranked candidate set (in faults) and total number of signature bits. 113 8.2 Output-Compacted Signatures Consider the contents of a typical fault signature for single candidate (Figure 8.1). A fault signature can be thought of as a matrix of bits, in which each row represents the pass (0) or fail (1) response, at an individual output, to each test vector. The bits in each column represent the outputs at which the fault will be detected for a particular vector. There are therefore v columns and o rows in this view of a fault signature, and there are f such (v*o)-sized matrices in the full fault dictionary. o1 o2 o3 o4 o5 o6 o7 o8 o9 PF v1 v2 v3 v4 v5 v6 v7 v8 v9 OC 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 1 Figure 8.1. Full-response fault signature for a single fault. The bottom row, labeled PF, is the traditional pass-fail signature for this fault, and is the bitwiseOR of all rows in the table. This pass-fail signature says that the fault is predicted to fail tests v1, v2, v7, and v9. The pass-fail dictionary is constructed of such signatures, one per fault, for an uncompressed storage requirement of f*v bits. An interesting observation is that the failing output information can be compacted in the same way as the failing vector information. The result of doing so is the final column in Figure 8.1 labeled OC, where each bit indicates whether a particular output ever fails, under the test set, for that fault. I refer to this as the “output-compacted” signature of the fault, and is the bitwise-OR of all columns in the fault signature matrix. 114 An idea, then, is to re-introduce failing output information by constructing a dictionary to include these output-compacted signatures. These signatures can be added into the traditional pass-fail dictionary, or stored as a separate file. The additional storage required is f*o bits, and the total for all signatures will be f*(v+o). This can be significantly smaller than the f*v*o bits required for the fullresponse dictionary. 8.3 Diagnosis with Output Signatures If a fault dictionary includes the output signature information, the question then becomes how best to use this information. Specifically, how should this information be used to rate fault candidates? Normally, in pass-fail diagnosis a candidate is scored by the number of bit differences in its signature from that of the observed behavior. Two commonly-used metrics are nonprediction and misprediction. Nonprediction is the number of bits in the observed behavior not found in the candidate signature (underprediction). Misprediction is the number of bits in the candidate not found in the observed behavior (overprediction). The score for each candidate fault will then consist of some combination of a nonprediction and misprediction score. Different diagnosis algorithms may weight nonprediction and misprediction differently, depending upon the specifics of the fault model and simulator. This same method could be followed with output-compacted signatures: find the intersection with the behavior’s output signature, and weight the intersection with the appropriate parameters. Then, combine the scores for the (pass-fail) vector matching and the output matching into a single candidate score. An alternative scoring method is to use both the pass-fail and output signatures simultaneously. Looking at the example fault signature again, we see that the intersection of pass-fail and output signatures defines an area of possible fault detection (shaded areas of Figure 8.2). 115 o1 o2 o3 o4 o5 o6 o7 o8 o9 PF v1 v2 v3 v4 v5 v6 v7 v8 v9 OC 0 0 1 1 0 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 1 1 0 0 1 1 1 0 0 0 1 Figure 8.2. The intersection of pass-fail and output signatures. During diagnosis, then, an observed failure bit (vector & output) inside of this 2-dimensional detection area can be considered a successful prediction, or part of the intersection of candidate and behavior. A detection outside of this area is considered a failed prediction, or a non-predicted bit. Misprediction must be de-emphasized in this scoring method since the intersection area, in bits, will usually be much greater than the actual number of predicted failing bits. I repeated the earlier experiments to determine what diagnostic value the addition of outputcompacted signatures provides. Table 8.2 gives the results for a full-response dictionary vs. a pass-fail dictionary vs. a pass-fail dictionary with output-compacted (PF+OC) signatures. It appears from the data that the output signatures do indeed increase the precision of these diagnoses. In almost all cases where there was a significant difference in precision between the passfail and full-response dictionaries, the addition of output-compacted signatures made up most of that difference. While the pass-fail dictionary with output signatures is much smaller than the full-response dictionary, it is still a significant increase over the number of pass-fail bits. This is especially true for the industrial circuits, which, since they contain scan flip-flops, have many more outputs than the simple ISCAS circuits. The next challenge is to decrease this overhead while retaining the diagnostic improvement. 116 Circuit C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552 Ind-A Ind-B Ind-C Ind-D FR faults ranked #1 2.29 1.17 1.61 1.67 1.82 2.24 2.03 1.89 1.33 1.63 2.74 2.33 2.51 1.91 FR bits (f*v*o) 191,142 901,120 1,512,864 2,936,192 2,966,700 18,141,184 8,963,724 73,878,966 7,207,680 131,224,800 232,836,120,000 929,424,581,448 297,857,813,000 9,077,621,646 PF faults ranked #1 2.80 1.17 1.66 1.71 1.99 3.04 2.10 2.07 1.40 2.12 5.49 2.91 51.0 2.86 PF bits (f*v) 27,306 28,160 56,032 91,756 118,668 283,456 407,442 600,642 225,240 1,226,400 15,522,408 46,373,844 21,271,000 333,822 PF+OC faults ranked #1 2.32 1.17 1.61 1.69 1.82 2.24 2.03 1.89 1.34 1.63 2.87 2.34 2.51 1.91 PF+OC bits (f*(v+o)) 29,637 42,240 70,720 117,740 135,718 371,520 441,014 926,100 345,368 1,585,920 309,507,408 420,237,312 319,128,813 66,113,689 Table 8.2. Size of top-ranked candidate set (in faults) and total number of signature bits. 8.4 Objects in Dictionary are Smaller Than They Appear The pass-fail with output signature dictionary size formula given in the previous section, (f*(v + o)), is in fact a worst-case calculation of the number of bits required for storing outputcompacted signatures. One common aspect of fault dictionaries is that logically-connected sets of faults tend to fail at the same set of circuit outputs. For example, the faults that make up a sub-design in the circuit will usually propagate their failures to the outputs (possibly scan or wrapper flops) of that subdesign. A cursory examination of the contents of any full-response fault dictionary will usually confirm that sets of failing outputs are repeated often throughout the fault signatures. Output signatures, which are collections of per-fault output sets, are even more likely to repeat across a fault set. This fact can be exploited by storing a particular output signature only once in a dictionary. Then, every fault that has that output signature will reference that particular output signature by an index, rather than the full string of output bits. If there are so unique output signatures in a dictionary, then, this index will require log2(so) bits for each of the f faults. The revised formula for the size of the 117 output signature dictionary, with pass-fail signatures, is therefore (f (log2(so)+v) + (so*o)). Table 8.3 reports the number of unique signatures for each circuit above, the percent of the faultlist this represents, and the actual number of bits used in the dictionaries for the results given above. The number of pass-fail bits is repeated for comparison. These revised sizes for the pass-fail with output signature dictionaries are much more acceptable, especially given the increase in diagnostic precision (as reported in Table 8.2). Circuit C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552 Ind-A Ind-B Ind-C Ind-D Faults (f) 333 440 544 812 682 1376 1526 2646 3754 3360 19599 18654 21271 2419 Unique OC signatures (so) 41 302 141 426 318 203 359 678 288 586 5352 5633 2423 622 PF bits (f*v) 27,306 28,160 56,032 91,756 118,668 283,456 407,442 600,642 225,240 1,226,400 15,522,408 46,373,844 21,271,000 333,822 PF+OC bits (w/repeats) (f*(v+o)) 29,637 42,240 70,720 117,740 135,718 371,520 441,014 926,100 345,368 1,585,920 309,507,408 420,237,312 319,128,813 66,113,689 PF+OC bits (unique) (f*(log2(so)+v) + (so*o)) 29,591 41,784 64,191 112,696 132,756 307,456 429,074 710,496 268,242 1,322,702 96,057,195 159,512,932 55,455,521 17,272,058 Table 8.3. Output-compacted signature sizes adjusted for repeated output signatures. 8.5 What about Unmodeled Faults? I next performed some experiments to see if adding output-compacted signatures to a pass-fail dictionary would also help a dictionary’s ability to diagnose unmodeled faults. The classic unmodeled fault vis-à-vis the stuck-at fault model is the bridging fault. I have previously published a bridging fault diagnosis approach that uses stuck-at faults to identify at least one node in a bridging fault pair [LavChe97]. In a similar experiment, I simulated and diagnosed, with stuck-at fault candidates, a set of realistic bridging faults for each of the ISCAS-85 circuits. I compared the three types of candidate 118 signatures: full-response, pass-fail, and pass-fail with output-compacted signatures. A diagnostic success was defined as identifying one of the two bridged circuit nodes in the top 10 stuck-at faults. The data in Table 8.4 shows that the full-response signatures provide the highest success rate when diagnosing bridging faults, as expected. The pass-fail success rate is generally much lower, on some circuits succeeding less than half the time, which is unacceptable for most diagnostic situations. The pass-fail dictionary augmented with output –compacted signatures provides a significant improvement over the pass-fail results, at the relatively small additional cost in bits (compared to the full-response requirements) reported in Table 8.3. Circuit C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552 FR Success % 98.1 83.2 99.4 82.6 89.2 95.6 99.5 97.9 88.5 95.1 PF Success % 84.2 34.4 79.4 43.1 60.7 76.2 83.2 65.2 35.2 68.0 PF +OC Success % 92.4 71.3 95.7 62.1 79.0 90.5 90.5 94.9 57.3 86.7 Table 8.4. Success rate for bridging fault diagnosis using stuck-at fault candidates. 8.6 An Alternative to Path Tracing? As mentioned earlier, path tracing has been proposed as a first step in fault diagnosis to reduce the candidate faultlist to a tractable size. In this capacity, path tracing helps by limiting the faultlist to just those faults in the input cone of affected outputs. A path-tracing algorithm can either be static, in which the tracing algorithm only looks at the logical paths in the netlist, or it can be dynamic, in which information about the fault model and the applied vectors are used to eliminate certain candidate faults in the input cones. A interesting observation about output-compacted signatures is that they contain much of the same information that is obtained from a dynamic path-tracing algorithm; that is, they report the set of 119 outputs to which the fault effects are propagated for each candidate fault. Output-compacted signatures lose some resolution because they do not store the per-vector propagation information; fault propagation can change, depending upon the applied test vector, as fault effects are either blocked or transmitted. But, output signatures will provide better resolution than can be obtained from static path tracing, because the fault type is known and aggregate vector information is stored. Therefore, output-compacted signatures can be thought of as filling the same role in dictionarybased fault diagnosis as a preliminary path tracing. This would be an advantage in scenarios, mentioned earlier, where path tracing is not convenient or possible during fault diagnosis. I was curious, then, how much diagnostic resolution the output-compacted signatures provide on their own, aside from the pass-fail information. To this end, I ran the same experiments in Section X above, diagnosing stuck-at behaviors with stuck-at candidates, but this time only using the output signatures for matching. The results are shown in Table 8.5 below. As with the previous stuck-at diagnosis experiments, the correct candidate will always be ranked #1; again, the question is how many candidates are ranked equally at the top of the diagnosis list. Circuit Faults C432 C499 C880 C1355 C1908 C2670 C3540 C5315 C6288 C7552 Ind-A Ind-B Ind-C Ind-D 333 440 544 812 682 1,376 1,526 2,646 3,754 3,360 19,599 18,654 21,271 2,419 PF faults ranked #1 2.80 1.17 1.66 1.71 1.99 3.04 2.10 2.07 1.40 2.12 5.49 2.91 51.0 2.86 PF bits (f*v) 27,306 28,160 56,032 91,756 118,668 283,456 407,442 600,642 225,240 1,226,400 15,522,408 46,373,844 21,271,000 333,822 OC faults ranked #1 31.6 2.24 12.5 3.31 16.0 52.7 17.1 31.5 37.8 60.0 15.8 23.1 21.5 16.4 OC bits (f*log2(so)+ so*o) 2,285 13,624 8,159 20,940 14,088 24,000 21,632 109,854 43,002 96,302 80,534,787 113,139,088 34,184,521 16,938,236 Table 8.5. Top-ranked candidate set size and signature bits for pass-fail and output-compacted (alone) signatures. 120 It is difficult, from this limited set of results, to draw any conclusions about the potential for using output signatures by themselves. It is possible that output signatures could, like static path tracing, prove useful for diagnosing unmodeled faults, or in cases were unmodeled behavior is expected. Until further research is done, it seems the power of these output signatures is best realized when they are, as demonstrated earlier, used in combination with traditional pass-fail signatures. 8.7 Clustering Output Signatures Even the small data set used in these experiments hints at an impending problem with the use of output-compacted signatures: real circuits have many more outputs than vectors on average. This fact will cause the output signature size to explode when compared to the number of pass-fail vectors. It was mentioned earlier in this chapter that the failing output sets of many faults in a circuit can be identical; this fact was used to remove a large number of identical output signatures from the dictionaries. But, it is also true that many more faults have similar sets of failing outputs, differing by only a small number of bits. Can this fact be used to further reduce the size of an output signature dictionary, by combining similar output signatures, while maintaining the previous diagnostic accuracy? The idea of identifying and combining similar individuals in a set of values or vectors has been studied extensively in the fields of machine learning and pattern recognition [DudHar73]. The common approach is referred to as clustering, and many clustering algorithms have been identified and analyzed for various situations. I considered two clustering algorithms to reduce the number of output signature bits in the sample dictionaries. The first, based on a method referred to as hierarchical clustering, starts with every output signature in its own cluster (identical signatures are already combined). Then, it finds the most similar pair of clusters and combines those signatures into a new signature, creating a new cluster (and a new signature). The algorithm proceeds in the same fashion until the desired number of clusters is achieved. 121 Similarity between two output signatures can be described in a number of ways. The method I chose was to express the distance (or dissimilarity) by the number of 1s in the bitwise-XOR between two signatures. The signature pair with the smallest XOR value was chosen for the next clustering. In case of multiple pairs with the same XOR value, the clustering algorithm chooses the one with the maximum number of 1s in a bitwise-OR, which effectively chooses the signatures with the most failing outputs in common. The resulting signature from a clustering of two output signatures is the bitwiseOR of the two signatures. A disadvantage of this method is the effort required to perform the clustering. Finding the initial set of pair-wise distances between output signatures is an O(n2) process, where n is the number of initial output signatures (so in previous tables). While this work is only required once at dictionary creation, it can be time-consuming for very large circuits. My second method of clustering output signatures is to divide the full set of circuit outputs into distinct sequential subsets. For example, if a circuit has 10,000 outputs, the outputs could be divided into 1000 subsets of 10 outputs each. Then, output signatures can be created with one bit per subset, where the value of each bit indicates whether the fault propagates its fault effects to any of the outputs in the set. This method relies on the assumption that (as is often the case) closely-numbered outputs are also closely positioned in the circuit, so that the failing bits (and clusters) of localized faults tend to be highly correlated. (The issue of correlation is important to the success of any clustering algorithm. Unless the bits in a cluster are highly correlated, then the values of the bits in the clustered signature will be meaningless. The first clustering algorithm, by examining all the signatures against each other, can judge these correlations well. The second method, on the other hand, must rely on the expectation that adjacent sequential bits are by nature correlated with each other. It is, however, much simpler and enables the creation of clustered signatures on the fly, not just after all the original signatures have been written. This is a tremendous advantage for large data sets.) 122 As a simple example of the second clustering algorithm, consider the diagram below (Figure 8.3). It shows a set of 25 bits, where a shaded box represents a 1 (detection) and an unshaded box represents a 0. This set could be thought of as a full output signature. The lower set shows a clustered signature, where each bit represents a clustering of 5 of the original output bits. The result, then, is a 5-bit clustered output signature. Figure 8.3. A simple example of clustering by subsets of outputs. I performed both types of clustering to reduce the number of output-compacted signature bits for the industrial circuits. (The ISCAS circuits have too few outputs to be interesting for this experiment.) My target was to reduce the number of output bits to about the same number as the pass-fail bits. To this end, for each circuit, I clustered the output signatures down to 1000 bits per fault. I did not find a significant difference between the diagnostic results from the two clustering algorithms; only the data for the second (simpler) algorithm is reported here. Once again, the table shows the number of equivalently-ranked top candidates for each type of signature. Circuit Ind-A Ind-B Ind-C Ind-D Outputs (o) FR faults ranked #1 PF faults ranked #1 15,000 20,042 14,003 27,193 2.74 2.33 2.51 1.91 5.49 2.91 51.0 2.86 PF+OC unclustered faults ranked #1 2.87 2.34 2.51 1.91 PF+OC clustered faults ranked #1 2.87 2.34 6.21 2.03 OC (clustered) bits per fault 1,000 1,000 1,000 1,000 Table 8.6. Diagnostic results when output-compacted signatures are clustered down to 1000 bits each. This table does not report the number of dictionary bits for the PF+clustered OC dictionary, because I did not perform the collapsing of duplicate signatures as was done in Section X (Table 8.3). I expect that, even more so than for the unclustered signatures, this would cause a significant reduction 123 in the dictionary size. In any case, the maximum number of bits required for the output signatures (1000 bits) is between 0.4 and 1.3 times the number of pass-fail signature bits. These results indicate that using even highly-clustered output signatures increase diagnostic precision over pass-fail signatures alone. At the approximate cost of doubling the size of the pass-fail dictionaries, the precision of the result approaches that of the full-response dictionaries. 8.8 Clustering Vector Signatures & Low-Resolution Diagnosis To follow up on these results, I was interested to find out whether or not the same sort of clustering can be used to further reduce the number of pass-fail signature bits. Specifically, can a clustering algorithm be applied to pass-fail signatures to create effective vector signatures of a small number of bits? This question is of particular importance for the very largest of modern circuits, which contain millions of faults, thousands of test vectors, and tens or hundreds of thousands of outputs. For these circuits, even traditional pass-fail signatures are too large for practical dictionary-based diagnosis. The correlation assumed for consecutive output bits, however, probably does not exist for passfail vector bits in most cases. (An exception is when a subset of consecutive tests targets a particular set of faults; test sets, however, can be reorganized and this correlation can be lost.) Therefore, the simple clustering algorithm will not be as effective when applied to pass-fail signatures as it was when applied to output signatures. I was curious, however, to see what sort of results could be obtained using the simple clustering algorithm on both pass-fail and output signatures. The idea was to create “tiny” dictionaries at a reasonable cost, and then see what sort of diagnoses could be performed. To this end, I created dictionaries with only 100 bits per fault signature, divided between 50 clustered vector bits and 50 clustered output bits. For both halves of each signature, the clusters were created by the simple sequential clustering algorithm described earlier. Diagnoses were performed on three of the four 124 industrial circuits; the fourth had too few test vectors in its test set to be of interest. The results are presented in Table 8.7. Circuit Ind-A Ind-B Ind-C Faults (f) 19,599 18,654 21,271 Outputs (o) 15,000 20,042 14,003 FR faults ranked #1 2.74 2.33 2.51 PF faults ranked #1 5.49 2.91 51.0 100b signatures ranked #1 201.0 3.89 90.42 Table 8.7. Diagnostic results for clustering (PF+OC) signatures down to 100 bits total. Of course, the precision of the resulting diagnoses is much lower than could be obtained with either the full fault data or with unclustered signatures of any sort. A point of further research is to examine the efficiency of these types of signatures, to see if the per-bit reduction in the candidate faultlist is as good or better than either pass-fail or full-response fault signatures. Despite the significant loss of precision, trading-off precision for data load may be attractive if diagnosis can be performed iteratively, by using data of ever-increasing resolution on ever-decreasing faultlists. An initial diagnosis would be quite large, but if it produces an accurate set of fault candidates that represents, say, 1-10% of the faultlist, then diagnosis can proceed where normally it would be completely impractical. I refer to this approach as low-resolution fault diagnosis. It is possible that this approach could find application in very high-volume situations or in system-level diagnostics. Or perhaps this technique could enable “built-in-self-diagnosis”, in which a chip could diagnose itself to some reasonable level of precision, using this kind of tiny dictionary and results from built-in-self-test (BIST). 125 Chapter 9. Conclusions and Future Work Fault diagnosis in modern circuits is a difficult task, considering the size of today’s circuits and the almost-innumerable ways in which they can fail. But, there is good reason to attempt the task anyway, as the quality of these circuits depends upon identifying and fixing sources of error. I have presented an approach to fault diagnosis in combinational circuits that attempts to be as comprehensive as possible. Perhaps the most important contribution is the introduction of a probabilistic framework to the problem, which allows many different fault models, algorithms, and sources of information to be applied to the problem to produce an accurate result, the precision of which increases as more effort is applied. In developing a guiding philosophy for my approach, I have identified many issues involved in fault diagnosis that, while some may be common sense, have more often been ignored than heeded by previous researchers. The diagnosis approach presented here covers most stages of the problem, from an initial diagnosis step that can handle multiple and complex defects, to a model-based stage that can apply fault models of arbitrary sophistication to refine a diagnosis as far as desired. I have also addressed the issue of non-logic fails in the form of an extension of the probabilistic framework to cover IDDQ test fails. Finally, I have addressed the practical issue of static data sizes, a problem that can defeat many diagnosis strategies on very large circuits. In doing so I introduced a new topic of low-resolution diagnosis, which may find use in some more exotic situations such as high-speed diagnosis or selfdiagnosing circuits. The future work identified by my research falls into four categories. The first is to uniformly apply the Dempster-Shaffer method of scoring fault candidates across all stages of the diagnosis methodology. I have applied it to the first stage iSTAT algorithm, where it seemed the most natural and practical fit, but applying it to the model-based algorithm would result in two major 126 improvements. First, it would provide a confidence measure for the final diagnosis, a very important piece of information for practical use. Second, it might provide some limited means of considering multiple (perhaps two or three) model instances at one time for the case of multiple independent defects. The second area of future work is to address the issue of timing-related failures. This thesis considered only static logic failures, or tests that fail at a very slow speed. But, failures in which a chip fails to meet timing requirements on tests run at-speed are very interesting for modern high-speed or high-performance designs. I suspect that some modification of the iSTAT algorithm could address this issue by implicating individual faults along timing-critical paths. The third area of future work is to run these algorithms on actual production fails to see if they can properly diagnose defects that aren’t artificially created. This is the real test, of course, of any diagnosis algorithm, but will take a relatively large effort on the part of an industrial team to carry out physical root-cause verification on multiple real chips. This is a difficult task given current limits on research and development funds, and is complicated by the fact that it is now usually the case that one company designs a chip while another company manufactures it, while a third company may do the failure analysis. Such disintegration in industry, however, creates an opening that an easy-to-use yet powerful diagnosis tool could exploit. Finally, I would also like to pursue the avenue of low-resolution fault diagnosis, if only to satisfy my curiosity about how much diagnosis can be performed with how little data. The idea of future circuits that can do self-diagnosis, and therefore possibly repair themselves, is intriguing and worth some investigation, however impractical it may be. 127 Bibliography [AbrBre80] M. Abramovici and M. A. Breuer. Multiple fault diagnosis in combinational circuits based on an effect-cause analysis. IEEE Transactions on Computing, Vol. C-29, pages 451-460, June 1980. [AbrBre90] M. Abramovici, M. Breuer, and A. Friedman. Digital Systems Testing and Testable Design. W.H. Freeman and Company, New York, NY. 1990. [AbrMen84] M. Abramovici, P.R. Menon and D.T. Miller. Critical Path Tracing: An Alternative to Fault Simulation. IEEE Design & Test, IEEE, February 1984. [AckMil91] J.M. Acken and S.D. Millman. Accurate modeling and simulation of bridging faults. Proceedings of the Custom Integrated Circuits Conference, pages 17.4.1-17.4.4, 1991. [AckMil92] J.M. Acken and S.D. Millman. Fault model evolution for diagnosis: Accuracy vs. precision. Proceedings of the Custom Integrated Circuits Conference, 1992. [Ait91] R. Aitken. Fault Location with Current Monitoring. Proceedings of the International Test Conference, pages 623-632, IEEE, 1991. [Ait92] R. Aitken. A Comparison of Defect Models for Fault Location with Iddq Measurements. Proceedings of the International Test Conference, pages 778-787, IEEE, 1992. [Ait95] R. Aitken. Finding defects with fault models. Proceedings of the International Test Conference, pages 498-505, IEEE, 1995. [AitMax95] R. Aitken and P. Maxwell. Better models or better algorithms? On techniques to improve fault diagnosis. Hewlett-Packard Journal, February 1995. [AllErv92] R.W. Allen, M.M. Ervin-Willis and R.E. Tullose. DORA: CAD Interface to Automatic Diagnostics. 19th Design Automation Conference, pages 559-563, 1982. [BarBha01] T. Bartenstein, J. Bhawnani. SLAT Plus: Work in Progress. 2nd International IEEE Workshop on Yield Optimization and Test, Baltimore, Nov. 1-2, 2001. [BarHea01] T. Bartenstein, D. Heaberlin, L. Huisman, D. Sliwinski. Diagnosing Combinational Logic Designs Using the Single Location At-a-Time (SLAT) Paradigm. Proceedings of the International Test Conference, pages 287-296, IEEE, 2001. [BopHar96] V. Boppana, I. Hartanto, W. K. Fuchs. Full Fault Dictionary Storage Based on Labeled Tree Encoding. Proceedings IEEE VLSI Test Symposium, pages 174-179, April 1996. [Bur89] D. Burns. Locating high resistance shorts in CMOS circuits by analyzing supply current measurement vectors. International Symposium for Testing and Failure Analysis, pages 231-237, November 1989. [ChaGon93] S. Chakravarty and Y. Gong. An algorithm for diagnosing two-line bridging faults in combinational circuits. Proceedings of the Design Automation Conference, pages 520-524, 1993. [ChaLiu93] S. Chakravarty and M. Liu. Iddq measurement based diagnosis of bridging faults. Journal of Electronic Testing: Theory and Application (Special Issue on Iddq Testing), 1993. [CheLar99] B. Chess and T. Larrabee. Creating Small Fault Dictionaries. IEEE Transactions on Computer-Aided Design, pages 346-356, March 1999. 128 [CheLav95] B. Chess, D.B. Lavo, F.J. Ferguson and T. Larrabee. Diagnosis of Realistic Bridging Faults with Single Stuck-At Information. Dig. Of Technical Papers, 1995 IEEE International Conference on Computer-Aided Design, pages 185-192, Nov. 1995. [DeGun95] K. De and A. Gunda. Failure analysis for full-scan circuits. Proceedings of the International Test Conference, pages 636-645, IEEE, 1995. [DudHar73] R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973. [EicLin91] E. Eichelberger, E. Lindbloom, J. Waicukauski and T. Williams. Structured Logic Testing. Prentice Hall, New Jersey, 1991. [FerYu96] F.J. Ferguson and J. Yu. Maximum likelihood estimation for yield analysis. Proceedings of the Defect and Fault Tolerance in VLSI Systems Symposium, pages 149-157, IEEE, 1996. [GatMal96] A. Gattiker and W. Maly. Current Signatures. Proceedings of the 1996 VLSI Test Symposium, pages 112-117, IEEE, 1996. [GatMal97] A. Gattiker and W. Maly. Current Signatures: Application. Proceedings of the International Test Conference, pages 156-165, IEEE, 1997. [GatMal98] A. Gattiker and W. Maly. Toward Understanding “I DDQ-Only” Fails. Proceedings of the International Test Conference, pages 156-165, IEEE, 1998. [GirLan92] P. Girard, C. Landrault and S. Pravossoudovitch. Delay Fault Diagnosis by Critical Path Tracing. IEEE Design and Test of Computers, IEEE, December 1992. [GrePat92] G. Greenstein and J. Patel. EPROOFS: a CMOS bridging fault simulator. Proceedings of the International Conference on Computer-Aided Design, pages 268-271, IEEE, 1992. [HenSod97] C. Henderson and J. Soden. Signature analysis for IC diagnosis and failure analysis. Proceedings of the International Test Conference, pages 310-318, IEEE, 1997. [JacBis86] J. Jacob and N.N. Biswas. GTBD faults and lower bounds on multiple fault coverage of single fault test sets. Proceedings of the International Test Conference, pages 849-855, IEEE, 1986. [JeeISTFA93] A. Jee and F.J. Ferguson. Carafe: A software tool for failure analysis. Proceedings of the International Symposium on Testing and Failure Analysis, pages 143-149, 1993. [JeeVTS93] A. Jee and F.J. Ferguson. Carafe: An inductive fault analysis tool for CMOS VLSI circuits. Proceedings of the IEEE VLSI Test Symposium, pages 92-98, 1993. [Kun93] R.P. Kunda. Fault location in full-scan designs. International Symposium for Testing & Failure Analysis, pages 121-126, 1993. [LamSho80] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. Technical Report 54, Comp. Sci. Lab, SRI International, March 1980. [LavChe97] D.B. Lavo, B. Chess, T. Larrabee, F.J. Ferguson, J. Saxena and K. Butler. Bridging Fault Diagnosis in the Absence of Physical Information. Proceedings of the International Test Conference, pages 887-893, IEEE, 1997. [LavTCAD98] D.B. Lavo, B. Chess, T. Larrabee, and F. J. Ferguson. Diagnosing realistic bridging faults with single stuck-at information. IEEE Transactions on Computer-Aided Design, pages 255-268, March 1998. [LavLar96] D.B. Lavo, T. Larrabee, and B. Chess. Beyond Byzantine Generals: Unexpected behavior and bridging-fault diagnosis. Proceedings of the International Test Conference, pages 611-619. IEEE, 1996. 129 [MaxAit93] P. Maxwell and R. Aitken. Biased voting: a method for simulating CMOS bridging faults in the presence of variable gate logic thresholds. Proceedings of the International Test Conference, pages 63-72. IEEE, 1993. [MaxNei99] P. Maxwell, P. O’Neill, R. Aitken, R. Dudley, N. Jaarsma, Minh Quach, and D. Wiseman. Current Ratios: A Self-Scaling Technique for Production IDDQ Testing. Proceedings of the International Test Conference, IEEE, 1999. [Mei74] K.C.Y Mei. Bridging and stuck-at faults. IEEE Transactions on Computers, C-23(7), pages 720-727, July 1974. [MilMcC90] S.D. Millman, E.J. McCluskey and J.M. Acken. Diagnosing CMOS bridging faults with stuck-at fault dictionaries. Proceedings of the International Test Conference, pages 860-870, IEEE, 1990. [MonBru92] R. Rodriguez-Montanez, E.M.J.G. Bruls and J. Figueras. Bridging defects resistance measurements in a CMOS process. Proceedings of the International Test Conference, pages 892-899, IEEE, 1992. [NighFor97] P. Nigh, D. Forlenza and F. Motika. Application and Analysis of I DDQ Diagnostic Software. Proceedings of the International Test Conference, pages 319-327, IEEE, 1997. [NighNee97a] P. Nigh, W. Needham, K. Butler, P. Maxwell and R. Aitken. An Experimental Study Comparing the Relative Effectiveness of Functional Scan IDDQ Delay-Fault Testing. Proceedings of VLSI Test Symposium, pages 459-463, 1997. [NighNee97b] P. Nigh, W. Needham, K. Butler, P. Maxwell, R. Aitken and W. Maly. So What is an Optimal Test Mix? A Discussion of the Sematech Methods Experiment. Proceedings of the International Test Conference , pages 1037-1038, IEEE, 1997. [NighVal98] P. Nigh, D. Vallett, A. Patel, J. Wright, F. Motika, D. Forlenza, R. Kurtulik, W. Chong. Failure Analysis of Timing and IDDQ-only Failures from the SEMATECH Test Methods Experiment. Proceedings of the International Test Conference, IEEE, pages 43-52, 1997. [RajCox87] J. Rajski and H. Cox. A method of test generation and fault diagnosis in very large combinational circuits. Proceedings of the International Test Conference, pages 932-943, 1987. [RatKea86] V. Ratford and P. Keating. Integrating guided probe and fault dictionary: an enhanced diagnostic approach. Proceedings of the International Test Conference, pages 304-311, IEEE, 1986. [RicBow85] J. Richman and K.R. Bowden. The modern fault dictionary. Proceedings of the International Test Conference, pages 696-702, IEEE, 1985. [Rot94] Roth, C.D. Simulation and test pattern generation for bridge faults in CMOS ICs. Master’s Thesis, University of California Santa Cruz, Department of Computer Engineering, June 1994. [SaxBal98] J. Saxena, H. Balachandran, K. Butler, D.B. Lavo, B. Chess, T. Larrabee and F.J. Ferguson. On Applying Non-Classical Defect Models to Automated Diagnosis. Proceedings of the International Test Conference, pages 748-757, IEEE, 1998. [Sha76] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, Princeton, New Jersey, 1976. [SheMal85] J.P. Shen, W. Maly and F.J. Ferguson. Inductive Fault Analysis of MOS Integrated Circuits. IEEE Design and Test of Computers, 2(6):13-26, December 1985. [SheSim96] J. W. Sheppard and W. R. Simpson. Improving the accuracy of diagnostics provided by fault dictionaries. Proceedings of the 14th VLSI Test Symposium, pages 180-185, IEEE, 1996. 130 [SimShe94] W.R. Simpson and J.W. Sheppard. System Test and Diagnosis. Kluwer Academic Publishers, Norwell, MA, 1994. [Thi97] C. Thibeault. A Novel Probabilistic Approach for IC Diagnosis Based on Differential Quiescent Current Signatures. Proceedings of the 1997 VLSI Test Symposium, pages 80-85, IEEE, 1997. [Tor38] S.C. Tornay. Ockham: Studies and Selections. Open Court Publishers, La Salle, IL, 1938. [VenDru00] S. Venkataraman, S. Drummonds. POIROT: A Logic Fault Diagnosis Tool and Its Applications. Proceedings of the International Test Conference, pages 253-262, IEEE, 2000. [WaiLin89] J. Waicukauski and E. Lindbloom. Failure diagnosis of structured VLSI. IEEE Design and Test of Computers, pages 49-60, August 1989. 131 Bibliography (in order of reference) [AbrBre90] M. Abramovici, M. Breuer, and A. Friedman. Digital Systems Testing and Testable Design. W.H. Freeman and Company, New York, NY. 1990. [JacBis86] J. Jacob and N.N. Biswas. GTBD faults and lower bounds on multiple fault coverage of single fault test sets. Proceedings of the International Test Conference, pages 849-855, IEEE, 1986. [AckMil91] J.M. Acken and S.D. Millman. Accurate modeling and simulation of bridging faults. Proceedings of the Custom Integrated Circuits Conference, pages 17.4.1-17.4.4, 1991. [GrePat92] G. Greenstein and J. Patel. EPROOFS: a CMOS bridging fault simulator. Proceedings of the International Conference on Computer-Aided Design, pages 268-271, IEEE, 1992. [MaxAit93] P. Maxwell and R. Aitken. Biased voting: a method for simulating CMOS bridging faults in the presence of variable gate logic thresholds. Proceedings of the International Test Conference, pages 63-72. IEEE, 1993. [Rot94] Roth, C.D. Simulation and test pattern generation for bridge faults in CMOS ICs. Master’s Thesis, University of California Santa Cruz, Department of Computer Engineering, June 1994. [MonBru92] R. Rodriguez-Montanez, E.M.J.G. Bruls and J. Figueras. Bridging defects resistance measurements in a CMOS process. Proceedings of the International Test Conference, pages 892-899, IEEE, 1992. [AitMax95] R. Aitken and P. Maxwell. Better models or better algorithms? On techniques to improve fault diagnosis. Hewlett-Packard Journal, February 1995. [AbrBre80] M. Abramovici and M. A. Breuer. Multiple fault diagnosis in combinational circuits based on an effect-cause analysis. IEEE Transactions on Computing, Vol. C-29, pages 451-460, June 1980. [RajCox87] J. Rajski and H. Cox. A method of test generation and fault diagnosis in very large combinational circuits. Proceedings of the International Test Conference, pages 932-943, 1987. [AllErv92] R.W. Allen, M.M. Ervin-Willis and R.E. Tullose. DORA: CAD Interface to Automatic Diagnostics. 19th Design Automation Conference, pages 559-563, 1982. [RatKea86] V. Ratford and P. Keating. Integrating guided probe and fault dictionary: an enhanced diagnostic approach. Proceedings of the International Test Conference, pages 304-311, IEEE, 1986. [RicBow85] J. Richman and K.R. Bowden. The modern fault dictionary. Proceedings of the International Test Conference, pages 696-702, IEEE, 1985. [Kun93] R.P. Kunda. Fault location in full-scan designs. International Symposium for Testing & Failure Analysis, pages 121-126, 1993. [DeGun95] K. De and A. Gunda. Failure analysis for full-scan circuits. Proceedings of the International Test Conference, pages 636-645, IEEE, 1995. [WaiLin89] J. Waicukauski and E. Lindbloom. Failure diagnosis of structured VLSI. IEEE Design and Test of Computers, pages 49-60, August 1989. 132 [MilMcC90] S.D. Millman, E.J. McCluskey and J.M. Acken. Diagnosing CMOS bridging faults with stuck-at fault dictionaries. Proceedings of the International Test Conference, pages 860-870, IEEE, 1990. [ChaGon93] S. Chakravarty and Y. Gong. An algorithm for diagnosing two-line bridging faults in combinational circuits. Proceedings of the Design Automation Conference, pages 520-524, 1993. [CheLav95] B. Chess, D.B. Lavo, F.J. Ferguson and T. Larrabee. Diagnosis of Realistic Bridging Faults with Single Stuck-At Information. Dig. Of Technical Papers, 1995 IEEE International Conference on Computer-Aided Design, pages 185-192, Nov. 1995. [VenDru00] S. Venkataraman, S. Drummonds. POIROT: A Logic Fault Diagnosis Tool and Its Applications. Proceedings of the International Test Conference, pages 253-262, IEEE, 2000. [Ait95] R. Aitken. Finding defects with fault models. Proceedings of the International Test Conference, pages 498-505, IEEE, 1995. [LavChe97] D.B. Lavo, B. Chess, T. Larrabee, F.J. Ferguson, J. Saxena and K. Butler. Bridging Fault Diagnosis in the Absence of Physical Information. Proceedings of the International Test Conference, pages 887-893, IEEE, 1997. [GirLan92] P. Girard, C. Landrault and S. Pravossoudovitch. Delay Fault Diagnosis by Critical Path Tracing. IEEE Design and Test of Computers, IEEE, December 1992. [AbrMen84] M. Abramovici, P.R. Menon and D.T. Miller. Critical Path Tracing: An Alternative to Fault Simulation. IEEE Design & Test, IEEE, February 1984. [Ait91] R. Aitken. Fault Location with Current Monitoring. Proceedings of the International Test Conference, pages 623-632, IEEE, 1991. [Ait92] R. Aitken. A Comparison of Defect Models for Fault Location with Iddq Measurements. Proceedings of the International Test Conference, pages 778-787, IEEE, 1992. [ChaLiu93] S. Chakravarty and M. Liu. Iddq measurement based diagnosis of bridging faults. Journal of Electronic Testing: Theory and Application (Special Issue on Iddq Testing), 1993. [Bur89] D. Burns. Locating high resistance shorts in CMOS circuits by analyzing supply current measurement vectors. International Symposium for Testing and Failure Analysis, pages 231-237, November 1989. [GatMal96] A. Gattiker and W. Maly. Current Signatures. Proceedings of the 1996 VLSI Test Symposium, pages 112-117, IEEE, 1996. [GatMal97] A. Gattiker and W. Maly. Current Signatures: Application. Proceedings of the International Test Conference, pages 156-165, IEEE, 1997. [GatMal98] A. Gattiker and W. Maly. Toward Understanding “I DDQ-Only” Fails. Proceedings of the International Test Conference, pages 156-165, IEEE, 1998. [Thi97] C. Thibeault. A Novel Probabilistic Approach for IC Diagnosis Based on Differential Quiescent Current Signatures. Proceedings of the 1997 VLSI Test Symposium, pages 80-85, IEEE, 1997. [Tor38] S.C. Tornay. Ockham: Studies and Selections. Open Court Publishers, La Salle, IL, 1938. 133 [BarHea01] T. Bartenstein, D. Heaberlin, L. Huisman, D. Sliwinski. Diagnosing Combinational Logic Designs Using the Single Location At-a-Time (SLAT) Paradigm. Proceedings of the International Test Conference, pages 287-296, IEEE, 2001. [SheMal85] J.P. Shen, W. Maly and F.J. Ferguson. Inductive Fault Analysis of MOS Integrated Circuits. IEEE Design and Test of Computers, 2(6):13-26, December 1985. [JeeISTFA93] A. Jee and F.J. Ferguson. Carafe: A software tool for failure analysis. Proceedings of the International Symposium on Testing and Failure Analysis, pages 143-149, 1993. [JeeVTS93] A. Jee and F.J. Ferguson. Carafe: An inductive fault analysis tool for CMOS VLSI circuits. Proceedings of the IEEE VLSI Test Symposium, pages 92-98, 1993. [FerYu96] F.J. Ferguson and J. Yu. Maximum likelihood estimation for yield analysis. Proceedings of the Defect and Fault Tolerance in VLSI Systems Symposium, pages 149-157, IEEE, 1996. [SimShe94] W.R. Simpson and J.W. Sheppard. System Test and Diagnosis. Kluwer Academic Publishers, Norwell, MA, 1994. [SheSim96] J. W. Sheppard and W. R. Simpson. Improving the accuracy of diagnostics provided by fault dictionaries. Proceedings of the 14th VLSI Test Symposium, pages 180-185, IEEE, 1996. [AckMil92] J. M. Acken and S. D. Millman. Fault model evolution for diagnosis: Accuracy vs. precision. Proceedings of the Custom Integrated Circuits Conference, 1992. [EicLin91] E. Eichelberger, E. Lindbloom, J. Waicukauski and T. Williams. Structured Logic Testing. Prentice Hall, New Jersey, 1991. [LamSho80] L. Lamport, R. Shostak, and M. Pease. The Byzantine Generals Problem. Technical Report 54, Comp. Sci. Lab, SRI International, March 1980. [Sha76] G. Shafer. A Mathematical Theory of Evidence. Princeton University Press, Princeton, New Jersey, 1976. [BarBha01] T. Bartenstein, J. Bhawnani. SLAT Plus: Work in Progress. 2 nd International IEEE Workshop on Yield Optimization and Test, Baltimore, Nov. 1-2, 2001. [NighVal98] P. Nigh, D. Vallett, A. Patel, J. Wright, F. Motika, D. Forlenza, R. Kurtulik, W. Chong. Failure Analysis of Timing and IDDQ-only Failures from the SEMATECH Test Methods Experiment. Proceedings of the International Test Conference, IEEE, pages 43-52, 1997. [DudHar73] R. Duda and P. Hart. Pattern Classification and Scene Analysis. John Wiley & Sons, 1973. [Mei74] K.C.Y Mei. Bridging and stuck-at faults. IEEE Transactions on Computers, C-23(7), pages 720-727, July 1974. [LavLar96] D. B. Lavo, T. Larrabee, and B. Chess. Beyond Byzantine Generals: Unexpected behavior and bridging-fault diagnosis. Proceedings of the International Test Conference, pages 611-619. IEEE, 1996. [HenSod97] C. Henderson and J. Soden. Signature analysis for IC diagnosis and failure analysis. Proceedings of the International Test Conference, pages 310-318, IEEE, 1997. [LavTCAD98] D. B. Lavo, B. Chess, T. Larrabee, and F. J. Ferguson. Diagnosing realistic bridging faults with single stuck-at information. IEEE Transactions on Computer-Aided Design, pages 255-268, March 1998. 134 [SaxBal98] J. Saxena, H. Balachandran, K. Butler, D.B. Lavo, B. Chess, T. Larrabee and F.J. Ferguson. On Applying Non-Classical Defect Models to Automated Diagnosis. Proceedings of the International Test Conference, pages 748-757, IEEE, 1998. [NighFor97] P. Nigh, D. Forlenza and F. Motika. Application and Analysis of I DDQ Diagnostic Software. Proceedings of the International Test Conference, pages 319-327, IEEE, 1997. [NighNee97a] P. Nigh, W. Needham, K. Butler, P. Maxwell and R. Aitken. An Experimental Study Comparing the Relative Effectiveness of Functional Scan IDDQ Delay-Fault Testing. Proceedings of VLSI Test Symposium, pages 459-463, 1997. [NighNee97b] P. Nigh, W. Needham, K. Butler, P. Maxwell, R. Aitken and W. Maly. So What is an Optimal Test Mix? A Discussion of the Sematech Methods Experiment. Proceedings of the International Test Conference , pages 1037-1038, IEEE, 1997. [CheLar99] B. Chess and T. Larrabee. Creating Small Fault Dictionaries. IEEE Transactions on Computer-Aided Design, pages 346-356, March 1999. [MaxNei99] P. Maxwell, P. O’Neill, R. Aitken, R. Dudley, N. Jaarsma, Minh Quach, and D. Wiseman. Current Ratios: A Self-Scaling Technique for Production IDDQ Testing. Proceedings of the International Test Conference, IEEE, 1999. [BopHar96] V. Boppana, I. Hartanto, W. K. Fuchs. Full Fault Dictionary Storage Based on Labeled Tree Encoding. Proceedings IEEE VLSI Test Symposium, pages 174-179, April 1996. 135