1058 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 7, JULY 2011 METER: MEasuring Test Effectiveness Regionally Yen-Tzu Lin, Member, IEEE, and R. D. (Shawn) Blanton, Fellow, IEEE Abstract—Researchers from both academia and industry continually propose new fault models and test metrics for coping with the ever-changing failure mechanisms exhibited by scaling fabrication processes. Understanding the relative effectiveness of current and proposed metrics and models is vitally important for selecting the best mix of methods for achieving a desired level of quality at reasonable cost. Evaluating metrics and models traditionally relies on actual test experiments, which is timeconsuming and expensive. To reduce the cost of evaluating new test metrics, fault models, design-for-test techniques, and others, this paper proposes a new approach, MEeasuring Test Effectiveness Regionally (METER). METER exploits the readily available test-measurement data that is generated from chip failures. The approach does not require the generation and application of new patterns but uses analysis results from existing tests, which we show to be more than sufficient for performing a thorough evaluation of any model or metric of interest. METER is demonstrated by comparing several metrics and models that include: 1) stuck-at; 2) N-detect; 3) PAN-detect (physically-aware N-detect); 4) bridge fault models; and 5) the input pattern fault model (also more recently referred to as the gate-exhaustive metric). We also provide in-depth discussion on the advantages and disadvantages of METER, and contrast its effectiveness with those from the traditional approaches involving the test of actual integrated circuits. Index Terms—Fault models, test effectiveness, test evaluation, test metrics. I. Introduction T HE MAIN objective of manufacturing test is to separate good chips from bad chips. Test methodologies continue to evolve, however, to capture the changing characteristics of chip failures, and new fault models and test metrics have been developed to guide the test generation process. Here, we use the phrase "fault model" in its classic sense, as an abstract representation of the behavior that results from some type of defect. A "test metric," on the contrary, is not necessarily meant to model defect behavior but instead is a way to evaluate or measure the quality that a test set would presumably achieve when applied to failing chips. The stuck-at fault model [1] has been used as both a model and a metric, and has been universally adopted as the basis of test generation because of its simplicity and low cost. Nevertheless, as manufacturing technology continues to scale and design complexity increases, failure behaviors have and continue to become more complicated and therefore harder to characterize [2]. The behavior of even static defects (i.e., defects that have no sequence or timing dependency and thus can be detected by a single test pattern) involves more complex mechanisms that can no longer be sufficiently dealt with using just the stuck-at fault model [3], [4]. Various fault models and test metrics have been developed to ensure test quality, that include, e.g., bridge [5], [6], transition [7], input-pattern fault models [8], and test metrics such as gate exhaustive [9], [10], bridge coverage estimate [12], N-detect [3], [11]–[13], and physically-aware N-detect (PAN-detect) [14]–[16]. For all existing and newly developed test methods, it is important to understand their relative effectiveness so that the proper mix of test approaches can be identified for achieving the required quality level at an acceptable cost. Traditionally, test methods have been evaluated empirically. Specifically, experiments involving real integrated circuits (ICs) are conducted to reveal defect characteristics and for assessing the capability of various test and design-for-test (DFT) methods to uncover chip failure. Unique fallouts (i.e., chip-failure detections), typically shown in the form of a Venn diagram, are considered to be good indicators of relative effectiveness. Fig. 1 summarizes some real-chip experiments on test evaluation that have appeared in the paper over the last 15 years [3], [10], [12], [13], [16]–[30].The y-axis shows various process nodes and the x-axis is the time. Each circle indicates the year that the work was published and the process node used for fabricating the design in the experiment. The size of the circle reflects the number of test methods evaluated. Finally, the experiments conducted by the same organization have the same color. Fig. 1 shows that the evaluation of fault models and test metrics continues to be of significant interest. Experiments involving real ICs, however, require a sufficiently large sample in order to produce statistically significant results. When more test methods are compared, more tests must be generated and applied in a production environment. More often than not, generating tests for new, proposed models or metrics is a significant challenge since the commercially available test tools are typically hard-coded to handle only a limited set (e.g., stuck-at, bridge, transition fault, and others). Conducting real-chip experiments for test evaluation is therefore time-consuming and expensive. An evaluation approach that is more economical, automatable, and effective is very much desired. In this paper, we introduce a general and cost-effective testmetric evaluation methodology, MEasuring Test Effectiveness Regionally (METER), and show how it can be used to evaluate Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. LIN AND BLANTON: METER: MEASURING TEST EFFECTIVENESS REGIONALLY Fig. 1. Recent real-chip experiments on test evaluation. a large variety of fault models and test metrics. METER analyzes the failure log files that result from the application of any set of test patterns, i.e., no additional tests are needed. The cost of this method is therefore low since extra test generation and test application are completely avoided. Finally, METER is general since it can be used to evaluate any test metric, or fault model, or DFT approach that is meaningful within the environment used for collecting the test data. The basic idea of METER is to identify the locations (or regions) of the failure within the bad chip, and then evaluate the region against the metric/model of concern using the tests already applied. METER is not perfect, however, since it relies on the identification of the failure within the bad chip using diagnosis or other localization techniques. For failing chips that have some ambiguity in the localization results, the effectiveness measures must be statistically analyzed. This shortcoming, however, also exists and is exacerbated in the traditional evaluation approach. Specifically, tests that target a particular fault model or test metric that detect a given failing chip does not necessarily mean that the model or metric “captures” defect behaviors. In other words, it is possible that the chip failure is fortuitously caught by the applied tests but not due to the targeted metric/model. METER instead precisely addresses this problem by evaluating models or metrics specifically for possible failure regions. METER was first introduced in [16] and [31]. This paper subsumes and extends existing work by: 1) defining new quantitative measures of the effectiveness and efficiency of various models/metrics over different products and technology nodes; 2) applying METER to large, industrial designs that include an NVIDIA graphics processing unit (GPU) and an IBM application-specific integrated circuit (ASIC); 3) showcasing how METER can be used to select parameters for automatic test pattern generation (ATPG). The rest of this paper is organized as follows. Section II provides background on test evaluation and describes related work. The details of test metric/model evaluation methodology METER are described in Section III. Evaluation results for several different test metrics are presented in Section IV. Section V provides a discussion on the applicability of METER, and compares METER with the traditional 1059 Fig. 2. Coverage distribution achieved by a nearly 100% stuck-at test set for various metrics/models. tester-based approach. Finally, in Section VI, conclusions are drawn. II. Background In this paper, we utilize the notion of fault coverage for an individual line. For instance, a signal line has two stuckat faults and can have a coverage for some set of tests that is equal to 0%, 50%, or 100%. For a line that has four “close” neighbors, there are eight possible two-line bridge faults, where it is assumed each neighbor can impose a faulty0 or faulty-1 value on the targeted line. The possible bridge coverages for the line are described by the set {1/8 = 12.5%, 2/8 = 25%, . . ., 8/8 = 100%}. Finally, for a line driven by a two-input gate, the possible gate-exhaustive coverages include 22 = 4 possibilities that lie in the set {0%, 25%, 50%, 100%}. Sometimes, instead of reporting percentages, we will simply list the number of detections for a given metric (as will be seen later in Table II). Extending this notion to N-detect and PAN-detect is a little more complicated since both usually refer to one type of fault polarity (either stuck-at-0 or stuck-at-1). Therefore, the coverage for these test metrics is calculated for a line with a specific fault polarity. With this notion of coverage, we show that additional tests are not really necessary in METER. We have observed that most test sets inherently achieve high coverage of most metrics and fault models for a majority of signal lines. In other words, it is likely that any given circuit region has very high coverage for any reasonable metric or fault model under consideration. For example, Fig. 2 shows the distribution of coverage achieved by the production, stuck-at test set for each signal line in a test chip (details of the chip are presented in Section IV) for the bridge fault model, and the gate-exhaustive and PAN-detect test metrics. We use N = 10 for the PANdetect metric which means 100% coverage is achieved for some line stuck-at-v (v ∈ {0, 1}) if the fault is detected ten times with ten different neighborhood states [14]–[16]. Fig. 2 shows that although the stuck-at test set does not directly target any other models/metrics, 82.8%, 90.11%, and 53.5% of the signal lines have 100% coverage of the bridge model, and the gate-exhaustive and physically-aware ten-detect metrics, respectively. This means that some arbitrary region that is affected by a defect can likely be used to fully evaluate a fault model or test metric. Even when the coverage is not 100%, it 1060 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 7, JULY 2011 TABLE I Comparison Between METER and Other Similar Work on Test Evaluation METER [26] Extra tests Defect region/location known Correlate coverage changes directly with defect detection No Yes No No Yes Yes [30] EMD Bridge Intra-Cell Yes Yes No No Yes Yes No No No Fig. 4. Fig. 3. Overview of the test-metric evaluation methodology METER. is still possible that important information can be derived as described in detail in Section IV. Table I compares METER with other similar work on test evaluation. Specifically, the work in [26] compares several test metrics by correlating chip failures with metric coverage achieved by the applied test patterns. These patterns do not necessarily directly target the considered test metrics, which means that additional test patterns for different test metrics are not needed. Nevertheless, in [26], metric coverage is calculated over the whole design. This means, at a minimum, fault simulation has to be performed for the entire circuit. Also, depending on the evaluated metrics, additional logical and physical information for every signal line in the design may be needed. In METER, defect detection is better correlated with metric effectiveness since coverage is limited to the potential defect regions within the failing chips. Thus, any need for logical/physical information for coverage calculation is therefore correspondingly reduced. On the contrary, the work in [30] used diagnosis to investigate test quality. They however use additional test patterns to measure the effectiveness of test metrics and fault models that include, e.g., N-detect [or more specifically, embedded multi-detect (EMD)] and bridge faults. Reference [30] also utilized diagnosis to identify failing chips with intra-cell defects, and examine these chips using the already-applied test patterns. They however particularly focus on reducing the mismatch between defect and fault behavior (i.e., the metric/model behavior outside of the defect behavior) but not on the effectiveness of the metric/model. Approaches for identifying suspect regions for test-metric evaluation. of test metrics and fault models.1 As shown in Fig. 3, METER consists of four stages: 1) suspect region identification; 2) test selection; 3) failing chip selection; 4) test evaluation. Specifically, tester-response data that results from the application of any type of test set is collected and analyzed. The test data may simply include chip pass/fail information, or may be more comprehensive in the form of full-fail response data. The collected test data is then analyzed to identify suspect regions within a failing chip, which are the logical lines that are believed to cause chip failures. Next, depending on the objective of the experiment, a subset of test patterns and some failing chips of interest are selected for further analysis. Finally, in the test evaluation stage, test metrics are evaluated for the identified suspects from the failing chips using the selected test patterns. This is achieved by correlating changes in metric coverage with defect detection. Details of each stage are described in the following sections. A. Suspect Region Identification III. Test-Metric Evaluation The first stage of METER is to identify suspect regions that are believed to cause chip failures. Several approaches of varying cost and accuracy can be used, as illustrated in Fig. 4. With the least amount of test data, i.e., only the pass-fail outcomes of test patterns are recorded, the suspect regions include all those that are sensitized by the failing patterns (i.e., test patterns that fail the chip), as shown at the top level of the reverse triangle in Fig. 4. If additional information is collected and more comprehensive techniques are used, higher accuracy is expected but at a higher cost. For example, if test-pattern failure responses are recorded, backcone tracing or path tracing [32] from failing outputs and scan elements can be applied to identify possible defect regions (the second and third level of the reverse triangle). Fault simulation can also be performed to identify the fault sites whose responses are compatible (e.g., match, subsume, METER is a cost-effective and time-efficient approach for comparing and evaluating the relative effectiveness/efficiency 1 From this point on, we will not make any distinction between test metric and fault model. LIN AND BLANTON: METER: MEASURING TEST EFFECTIVENESS REGIONALLY and others) with tester responses (fourth level). Alternatively, diagnosis can be used to identify suspects with higher accuracy and resolution (fifth level). Diagnosis suspects are very likely to include the actual defect regions, and are inexpensive to obtain. If physical failure analysis (PFA) results are available, test-metric evaluation can be performed on what is presumably the actual defect region (bottom level). Among the aforementioned approaches, test-metric evaluation using PFA results is of the highest accuracy but has an associated high cost. Moreover, the number of failing chips that have PFA results is typically small. Since more sophisticated region-identification techniques often imply more assumptions and restrictions, few failing chips can have their suspects successfully identified. The number of chips available for evaluation is therefore likely to decrease as more advanced region-identification approaches are employed as shown in Fig. 4. Diagnosis, on the contrary, is less expansive since it mostly involves circuit/fault simulation. Often a decent amount of failing chips can be diagnosed and used for analysis. In other words, one is more likely to draw statistically significant conclusions by analyzing diagnosable failing chips. Circuit tracing-based approaches require less computation time than diagnosis, but the number of suspect regions that result is often much higher. Among the possible techniques for identifying suspect regions, diagnosis provides very good accuracy at a reasonable cost, and it often results in a sufficient number of samples that can be used for analysis. We believe diagnosis is a good choice since it provides a proper tradeoff between cost and accuracy. B. Test Selection Given a failing chip c and an identified suspect s of c, the test patterns in the production test set T generated for the chip design can be classified based on whether they: 1) were applied to c; 2) sensitized suspect s; and 3) passed or failed chip c, as illustrated in Fig. 5. METER allows great flexibility in selecting the test patterns used for analysis, which can be any subset of T . The only requirement is that for a chip c, at least one test pattern that failed c needs to be included in the selected test set Tselc so that we can correlate defect detection with changes in metric coverage for some identified suspect. For example, if the test flow stops after the first failing pattern (FFP), then the subset of test patterns that start from the first test pattern to the FFP can be used. If more test patterns are applied, extra information collected from the application of subsequent test patterns can also be utilized. Depending on how the test patterns are selected, the subsets used for different chips may not be the same. For instance, if the subset of test patterns up to the FFP is used, then the selected test set for chip 1 can be different from that for chip 2. This is because the FFPs of chip 1 and chip 2 may be different. On the contrary, if all the production test patterns are used, then the test sets selected for different chips will be the same. It should be noted that in some test flows, such as in adaptive test or in a stop-on-first-fail environment, some test patterns may not be applied. These test patterns (Tc ), while no pass/fail information is available, can still be used for analysis. The use of selected test patterns will be described in Section III-E. 1061 Fig. 5. Categories of test patterns given a failing chip c and a suspect region s. The notation in the parentheses denote the set of test patterns in that category. C. Failing-Chip Selection The objective of failing-chip selection is to identify chips that are suitable for test evaluation and are of interest. The chip selection process may vary depending on the goal of the evaluation, evaluated test metrics, adopted suspect-region identification techniques, and the characteristics of the applied test patterns. For example, all the failing chips in the failure logs can be used for analysis if a large sample size is desired. In the cases where diagnosis is adopted for identifying suspect regions, diagnosable failing chips are chosen. If test metrics that target multiple faulty lines (e.g., open, Byzantine bridge, multiple stuck-at, and others) are to be evaluated, diagnosis methods such as [33] and [34] can be used to identify chips that exhibit this behavior. If the considered test metrics target defects that are not deterministically detected by stuck-at test, a set of “hard-to-detect” failing chips that do not exhibit stuckat behavior can be selected. D. Test Metrics for Evaluation METER can be used to evaluate any test metric, or fault model, or DFT approach, or their variants, whether they target static or dynamic defect behaviors. For instance, the inputpattern fault model can be evaluated at the gate level or higher levels of hierarchy (this will be demonstrated later in Section IV-E). METER is applicable as long as the test environment employed and the test approaches applied to the failing chips adhere to the assumptions of the test metrics under evaluation. For example, evaluating sequence-dependent defects for PAN-detect test, although possible, is not reasonable since any detection of sequence-dependent defects is fortuitous in nature. Similarly, evaluating the transition fault model for a stuck-at-only test (i.e., no launch-on-shift or launch-on-capture) would also be inappropriate. E. Test Evaluation To evaluate a test metric, we examine whether defect detection is associated with changes in coverage of some test metric for the identified suspect regions. This is achieved by analyzing the selected subset of test patterns Tselc for each failing chip c. Without loss of generality, we assume here that all of the test patterns applied to c are selected, i.e., Tselc = Tc . (The cases where un-applied test patterns are used, i.e., Tselc ∩Tc = φ, will be discussed later.) Tc is fault simulated 1062 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 7, JULY 2011 A failing chip may have multiple suspects, each of which by themselves could alone cause chip failure. For these situations, the analysis needs to be performed for each suspect. For each 1 2 3 chip c, we sum Xm, c, s , Xm, c, s , and Xm, c, s over all the suspects of c as follows: k k Xm, Xm, k ∈ {1, 2, 3} (1) c = c, s s k Xm, c Fig. 6. Correlation between defect detection and changes in metric coverage for suspect s of failing chip c. without fault dropping using stuck-at faults involving the identified suspect regions only. For each suspect s of chip c, we identify the set of test patterns in Tc that sensitize s (i.e., Tc, s in Fig. 5), and track the changes in metric coverage for s resulting from the application of each test pattern in Tc, s . Let Covm, c (s) denote the coverage of metric m for suspect s of chip c. A test pattern t ∈ Tc, s can be classified into one of the following categories depending on whether t increases Covm, c (s) and failed chip c (see Fig. 6): 1) 2) 3) 4) t increases Covm, c (s) and passed chip c (zone 1); t increases Covm, c (s) and failed chip c (zone 2); t does not increase Covm, c (s) but failed chip c (zone 3); t does not increase Covm, c (s) and passed chip c (outside the two circles). Given a test metric, if a failing pattern tf ∈ Tc, s increases the metric coverage for s, the test metric is considered effective in detecting chip c. In other words, a test metric is deemed effective if the metric coverage increases with the application of the failing pattern tf (which falls into zone 2). If the coverage does not change with tf (i.e., tf ∈ zone 3), then it means the test metric was not at all needed to detect the corresponding chip failure. The chip failed due to other reasons outside the scope of the metric. Moreover, if the tests before tf have already achieved “100% coverage” of the test metric, then it means that the corresponding metric does not guarantee the detection of the failure. On the contrary, if a test pattern tp ∈ Tc, s increases coverage of some metric for suspect s but did not fail chip c (i.e., tp ∈ zone 1), then it means that the metric covers zones outside the defect behavior. Additionally, in the best case, this means more test patterns are needed to further improve metric coverage to eventually detect the defect. But obviously, this may not be possible if the defect behavior lies outside the metric. Let 1 2 3 Xm, c, s , Xm, c, s , and Xm, c, s denote the number of test patterns in Tc, s that falls into zones 1, 2, and 3, respectively, for a 2 metric m. A good test metric should have a large Xm, c, s and 1 3 a small Xm, c, s and Xm, c, s . If Tselc includes more than one failing pattern, then each failing pattern may indict different suspects. For example, a suspect reported by single location at a time (SLAT) diagnosis [35] may only be associated with some failing patterns whose tester responses match the suspect’s stuck-at fault simulation responses. If this is the case, each suspect should be examined using the corresponding failing patterns. where will be used to assess the effectiveness and efficiency of a test metric. 1) Test effectiveness: to evaluate the effectiveness of a test metric, we examine how well the metric subsumes defect behavior using a measure called effectiveness ratio. For a given metric m and chip c, the effectiveness ratio is computed as follows: 2 2 3 ERm, c = Xm, c /(Xm, c + Xm, c ). (2) The effectiveness ratio represents how often the coverage of m is increased for some suspect when the chip failed. A high effectiveness ratio means that increasing metric coverage correlates with defect detection for this 2 3 particular chip. If Xm, c + Xm, c = 1, i.e., chip c has only one suspect and Tselc includes only one failing pattern, then ERm, c simply depends on whether the failing pattern increases metric coverage. If the coverage 2 is increased, then Xm, c = 1 and ERm, c = 1. Otherwise, 2 3 ERm, c = 0. In other words, if Xm, c + Xm, c = 1, ERm, c becomes a binary indicator of whether the test metric is effective. 2) Test efficiency: another focus in test evaluation is the efficiency of a test metric. Early detection of failing chips is desired because it saves test application cost especially in a stop-on-first-fail environment. Test efficiency has been defined as the ratio of the number of patterns targeting a specific test metric to the number of chip failures detected by those patterns. The metric with a smaller number of “patterns per failure” is considered more efficient [28]. Instead, we define the efficiency of a test metric to be the ratio of zone 2 to the left circle (see Fig. 6). In other words, we compute the efficiency ratio for test metric m and failing chip c as follows: 2 1 2 FRm, c = Xm, c /(Xm, c + Xm, c ). (3) A high FRm, c means that zone 1 is smaller compared to the left circle and that increasing metric coverage correlates well with defect detection. The test metric is therefore more efficient in capturing chip failure. If metric m is not at all effective for a particular chip 2 c, resulting in Xm, c = 0, then FRm, c becomes zero by definition. This measure of efficiency is particularly useful in a stop-on-first fail environment but is also applicable to cases where information on subsequent failing patterns are collected as well. It should be noted that a metric can be very effective in defect detection but have a poor efficiency. Moreover, a metric that precisely captures a small portion of some defect behavior may be ineffective but efficient. LIN AND BLANTON: METER: MEASURING TEST EFFECTIVENESS REGIONALLY 3) Fault-detection recording: suppose for some metric m, a failing pattern tf, 1 ∈ Tselc detects some fault involving suspect s of chip c. If a subsequent failing pattern tf, 2 ∈ Tselc also detects the same fault, then tf, 2 does not increase metric coverage. In other words, tf, 2 is placed 3 3 into zone 3. Xm, c, s (as well as Xm, c ) is increased by 1, 2 2 while Xm, c, s (and Xm, c ) remains the same, which in turn degrades ERm, c . Nevertheless, it is possible that the fault captures the behavior of the defect causing chip failure, and every test pattern that detects this particular fault fails the chip. The metric m is effective in detecting chip c, but is not accounted for using the current formulation of ERm, c . To prevent underestimating ERm, c , a different faultdetection recording scheme can be used. In the new scheme, a fault is recorded as detected only if it is detected by a passing pattern. (This does not affect whether a test pattern detects the fault; only the detection status of the fault is changed.) In other words, if a fault is detected only by failing patterns, each failing pattern detecting this fault increases the metric coverage (while the fault is still recorded as undetected). These failing patterns are therefore classified into zone 2 instead of 2 2 zone 3, and Xm, c, s as well as Xm, c are increased. With this new fault-detection recording scheme, ERm, c will not be underestimated. Nevertheless, we lose the opportunity to examine whether other faults also detect chip failures due to the existence of faults detected only by failing patterns. The original fault-detection scheme does not have this issue however. Analysis can be performed using one or both schemes depending on the evaluation objective. 4) Using un-applied test patterns: in Section III-B, we mentioned that in some test flows, some production test patterns may not be applied to a failing chip c (the subset Tc in Fig. 5). These patterns, if applied, may further increase the coverage of some test metrics for some suspects, and may have the capability of detecting chip failures. METER provides a way to consider the effect that Tc could possibly have had. Without loss of generality, assume that all test patterns in Tc are selected for analysis, i.e., Tc ⊂ Tselc . Specifically, for a suspect s of failing chip c, test patterns in Tc that sensitize s 3 1 2 (Tc , s ) are used. Let Xm, c , s , Xm, c , s , and Xm, c , s be the number of test patterns in Tc , s that could fall into zones 1, 2, and 3, respectively, for a metric m. Again, for each chip c, we calculate as follows: k k Xm, Xm, k ∈ {1, 2, 3}. (4) c = c , s s The definition of effectiveness ratio and efficiency ratio can then be rewritten as follows: ERm, c = FRm, c = 2 2 Xm, c + Xm, c 3 2 2 3 (Xm, c + Xm, c ) + (Xm, c + Xm, c ) 2 2 Xm, c + Xm, c . 1 2 1 2 (Xm, c + Xm, c ) + (Xm, c + Xm, c ) (5) (6) 1063 Since Tc was not applied, it is unknown whether a test pattern in Tc , s would pass or fail chip c. The actual 3 1 2 values of Xm, c , s , Xm, c , s , and Xm, c , s are therefore unknown. However, from fault simulation, we know what test patterns in Tc , s increase coverage for some suspect s of chip c (denote the set as Tc+ , s ) and what test patterns do not (denote the set as Tc− , s ). The number of test patterns in Tc+ , s and Tc− , s can be used to calculate the best and worst effectiveness/efficiency ratio that a test metric could achieve with Tc . Specifically, in the worst case, chip c passes all the test patterns in Tc+ , s , and fails all 1 + the test patterns in Tc− , s . In other words, Xm, c , s = |Tc , s |, − 3 k 2 Xm, c , s = 0, and Xm, c , s = |Tc , s |. Calculating Xm, c and substituting into (5) and (6) provides the worst case effectiveness ratio and efficiency ratio, respectively. In the best case, the chip fails all the test patterns in Tc+ , s , and passes all the test patterns in Tc− , s . As a result, 3 1 2 + Xm,c , s = 0, Xm, c , s = |Tc , s |, and Xm, c , s = 0. The best case effectiveness ratio and efficiency ratio can then be derived accordingly. Actually applying Tc may indict additional suspects that were not identified previously. When this occurs, further analysis concerning defect regions or defect detections may be needed. F. Metric/Model Case Studies In the following, we illustrate the detailed procedures employed for evaluating the effectiveness and efficiency of the bridge fault model [5], [6], the gate-exhaustive metric [9], [10] (also known as the gate-level input-pattern fault model [8]), and the physically-aware N-detect test metric [14]– [16]. METER is not limited however to these test metrics and can be just as easily applied to various DFT approaches as well. 1) Bridge fault models: to evaluate various bridge fault models, we first extract the possible bridge regions for each identified suspect of a chip c. Specifically, the physical neighbors that are within a distance d for each suspect are obtained from the design’s layout.2 A suspect s and each of its physical neighbors are a possible bridge defect, and all the bridge defects involving s are evaluated. Associated with each bridge consisting of a suspect s and its physical neighbor p are two 2-line bridge faults: s stuck-at zero when p = 0 and s stuck-at one when p = 1. Traditional bridge fault models (e.g., AND-type, OR-type, dominate, and the four-way bridge fault models) are all implicitly considered, including both non-feedback and feedback bridges. M-line bridge faults can be handled as well but in this analysis we only consider bridge faults with M= 2. For each test pattern t ∈ Tselc , we examine the bridge faults that are detected by t and by Tprev , where Tprev is the set of test patterns in Tselc that are applied before t. Whenever a physical neighbor p is driven to the opposite value of s and a stuck-at fault affecting s is detected, a 2 Physical neighbors can be identified using DRC/LVS [28] or criticalarea [36] approaches or by utilizing parasitic extraction data. 1064 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 7, JULY 2011 bridge fault involving p and s is deemed detected. We specifically examine whether a particular bridge fault is detected by t but not by Tprev . If all of the bridge faults detected by t are detected by Tprev , or if t does not detect any bridge fault, then t does not increase the bridge fault coverage for the suspect s of chip c. Test t will be classified depending on whether t failed chip c and whether t increases bridge fault coverage for s, as 1 2 described in Section III-E. Quantities XB, c, s , XB, c, s , and 3 XB, (where B stands for bridge fault) are calculated c, s based on the classification result, and are used to assess the effectiveness and efficiency of the bridge fault model. 2) Gate-exhaustive metric: we also evaluate the effectiveness/efficiency of the gate-exhaustive metric [8]–[10]. With the assumption that only a single gate is faulty, gate-exhaustive testing requires each gate output to be sensitized for all possible input combinations. The procedure for evaluating the gate-exhaustive metric is similar to the one used for the bridge fault models. For each suspect s of a failing chip c, we identify the inputs of the gate that drives the suspect, i.e., the driver-gate inputs. The set of logic values applied to the driver-gate inputs of s by a test pattern t that sensitizes s is defined as a driver state of s. We next track the driver state established by each test pattern t ∈ Tselc , and examine whether t establishes a new driver state and sensitizes s, thereby increases the gate-exhaustive coverage. In the case where s is a branch, the coverage of the downstream gate driven by s is calculated. Based on the zone that 3 1 2 t falls into (see Fig. 6), XG, c, s , XG, c, s , and XG, c, s (G stands for gate-exhaustive) are calculated. 3) Physically-aware N-detect metric: the physically-aware N-detect (PAN-detect) metric exploits physical information to generate test patterns capable of improving defect detection for modern designs [14]–[16]. The metric defines the neighborhood of a suspect as the set of signal lines surrounding the suspect. Three types of signal lines are considered in the neighborhood of a suspect [16] that include: 1) signal lines that are within a distance d of the suspect in the layout (physical neighbors); 2) inputs of the gate that drives the suspect (driver-gate inputs); and 3) side inputs of the gates that receive the suspect (receiver-side inputs). The set of logic values established by a test pattern t on the neighborhood lines of a suspect s when s is sensitized is called the neighborhood state. PAN-detect test requires a targeted signal line be sensitized with at least N neighborhood states. To evaluate the effectiveness of PAN-detect, we extract the neighborhood for each suspect. We next track the neighborhood states established by test pattern t ∈ Tselc and by Tprev for each suspect s. If t establishes a new neighborhood state that has not yet been established by Tprev , then t increases the PAN-detect coverage for the suspect. Test t is then classified into the appropriate zone based on the rules described in Section III-E, and 1 2 3 XP, c, s , XP, c, s , and XP, c, s (P stands for PAN-detect) are calculated. IV. Experiments We apply METER to evaluate the bridge, gate-exhaustive, and PAN-detect metrics. Failure logs from LSI test chips fabricated in a 110 nm process are utilized. The test chip design consists of 384 64-bit arithmetic-logic units (ALUs), where each ALU has ∼3000 gates. The stuck-at test of an ALU consists of approximately 260 scan-test patterns, achieving >99% stuck-at fault coverage. In this experiment, signal lines within 0.5 µm of the targeted line are deemed as physical neighbors, which are used for evaluating both bridge and PAN-detect. We have data for over 2500 failing chips. For an assumed yield of 95%, this means our analysis here is equivalent to a chip test experiment involving more than 50 000 chips. In the following, we describe the procedures used to select failing chips and identify suspects for subsequent analysis (Section IV-A), and present the results in great detail (Sections IV-B–IV-E). While the test patterns up to and including the FFP are used in Sections IV-A–IV-E, we demonstrate in Section IV-F how test metrics can be evaluated using all the applied test patterns and different suspect-region identification techniques. A. Diagnosable and Hard-to-Detect Chip Selection In this experiment, we use diagnosis to identify suspect regions that cause chip failure. The three test metrics evaluated, namely, bridge, gate-exhaustive, and PAN-detect, target defects not deterministically detected by stuck-at test patterns. A set of diagnosable and hard-to-detect failing chips are therefore selected for analysis. Here, diagnosable means that a suspect region that leads to the FFP can be pinpointed by diagnosis, while “hard to detect” means that the failing chip would not be necessarily detected by tests aimed only at stuck-at faults. Hard-to-detect chips are the target of bridge, N-detect, and PAN-detect test, and therefore are the subject of our analysis. Of the 2533 chips in the LSI failure logs, 720 chips are diagnosable and 87 of 720 are hard-to-detect.3 The 87 chips are partitioned into two categories: 28 chips having only one suspect and 59 having two or more suspects, each of which alone could cause the chip’s FFP. The 28 failing chips are of particular interest since we have significant confidence in the failure region identified by diagnosis. Test metrics can be easily evaluated for the single suspect of each chip. For the remaining 59 chips, resolution for the FFP is degraded, meaning that there is more than one single-region candidate that could cause the FFP. For these cases, test-metric evaluation is performed and analyzed over all the suspects of a chip (Section IV-C). B. Single-Suspect Failing Chips Table II shows the results of applying METER to the 28 single-suspect LSI failing chips. Column one gives the chip index. Columns two to six show the total number of physical neighbors of the suspect (Nbrs), the number of test patterns 3 All the 2533 chips, including those that are disregarded here, are analyzed later when all the failing chips are examined using all the applied test patterns and less-restricted suspect-region identification techniques. LIN AND BLANTON: METER: MEASURING TEST EFFECTIVENESS REGIONALLY 1065 TABLE II Test-Metric Evaluation Results for Single-Suspect Failing Chips Chip 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Nbrs 14 18 18 5 10 32 12 12 8 24 7 14 14 11 5 13 8 15 11 8 18 13 6 14 16 11 32 32 Nd,B 3 6 8 19 3 2 3 12 16 3 4 4 9 9 6 5 5 19 2 10 42 2 17 5 3 4 2 2 Bridge Bprev 14 28 31 10 12 13 7 22 14 14 10 12 25 18 5 10 9 29 7 15 36 6 11 18 19 15 13 17 1 XB,c 2 5 4 6 2 1 1 6 4 2 3 3 5 4 4 3 4 9 1 8 12 1 5 3 2 3 1 1 BFFP 0 1 1 0 3 10 1 0 0 2 1 0 1 1 2 1 2 0 2 0 0 5 0 4 3 1 10 6 Gate inputs 2 1 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 1 2 1 1 2 2 2 (including the FFP) that sensitize the identified suspect (i.e., Nd, B ), the number of unique bridge faults detected by the patterns before the FFP (Bprev ), the number of test patterns 1 before the FFP that detect new bridge faults (XB, c ), and the number of new, unique bridge faults detected by the FFP (BFFP ), respectively. Similarly, for gate exhaustive, columns seven to ten give the total number of gate inputs driving the suspect (Gate inputs), the number of test patterns that sensitize the identified suspect (Nd, G ), and the number of unique driver states (of the suspect) that are established before the FFP (Gprev ) and by the FFP (GFFP ). The last four columns show the numbers for PAN-detect, including the number of signal lines in the neighborhood of the suspect (Nbrhd), the number of test patterns that sensitize the identified suspect (Nd, P ), and the number of unique neighborhood states established before the FFP (Pprev ) and by the FFP (PFFP ). For gate-exhaustive and PAN-detect, the number of test patterns that passed the chip and increase metric coverage 1 1 for the suspect (i.e., XG, c and XP, c ) is equal to the number of states established by the patterns before the FFP. In other 1 1 1 1 words, XG, c = Gprev and XP, c = Pprev . XG, c and XP, c are therefore not listed explicitly. Bridge faults involving a suspect include the cases where the suspect fails with a faulty-0 or faulty-1. Similarly, for gate exhaustive, considering all possible driver states implicitly takes into account both stuck-at faults. The analyzed test patterns therefore include those that sensitize the suspect to Gate-Exhaustive Nd,G Gprev 3 1 6 2 8 4 19 3 3 2 2 1 7 3 12 2 16 4 3 1 4 2 5 2 9 2 9 2 6 2 6 2 5 3 19 3 2 1 10 4 42 6 2 1 17 3 5 2 3 2 4 2 2 1 2 1 GFFP 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 Nbrhd 15 19 21 7 12 32 13 14 8 25 9 15 15 12 7 14 9 16 13 10 21 14 8 15 16 12 32 32 PAN-Detect Nd,P Pprev 3 2 4 3 7 6 10 8 2 1 2 1 3 1 9 7 10 8 3 2 2 1 4 3 5 3 3 2 4 3 5 4 2 1 9 8 2 1 6 5 30 29 2 1 7 5 2 1 2 1 2 1 2 1 2 1 PFFP 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 either logic zero or logic one up to and including the FFP. In cases where the suspect is a branch, the downstream gate g driven by s is analyzed, and the considered test patterns include those that sensitize the output of g. Since there can be more test patterns that sensitize g compared to s, Nd, G can be larger than Nd, B . For PAN-detect, on the contrary, a neighborhood state is associated with a specific stuck-at fault involving the suspect. Only the test patterns sensitizing the suspect with the required stuck-at fault polarity is considered. Therefore, Nd, G ≥ Nd, B ≥ Nd, P . Analysis of Table II reveals that nine of the 28 singlesuspect chips have BFFP = 0. This means that for these nine chips, tests aimed at bridge faults do not guarantee failure detection. Specifically, for chips 4 and 21, all the bridge faults involving the identified suspects are detected by the patterns before the FFP (i.e., 2 × Nbrs = Bprev ). In other words, the bridge coverage for the suspects of these two chips is 100%, which indicates that the use of typical bridge models does not guarantee detection of these failures. Moreover, chips 8, 9, 18, 20, and 23 have a bridge coverage of over 80%. For these cases, it is possible that additional bridge coverage could have detected the failure but obviously was not necessary since BFFP = 0 for each of these failing chips. For the gate-exhaustive metric, only three of the 28 chips have increased coverage due to the FFP (i.e., GFFP = 1). The gate-exhaustive coverage appears to be low for many of these chips, which is surprising given Fig. 2. There could be a 1066 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 7, JULY 2011 Fig. 7. Venn diagram showing the number of single-suspect failing chips whose FFP increases coverage for the bridge, gate-exhaustive, and PAN-detect test metrics. number of reasons why coverage is low however that include, e.g., only tests before the FFP are examined, and that some gate-input patterns may not be possible due to circuit structure. In any event, it is not the case that a particular driver state alone is needed to detect a large majority of these single-suspect failures. For the PAN-detect metric, the FFPs of all but two failing chips (chips 12 and 20) establish new neighborhood states for the identified suspects. Because single-suspect failing chips are considered here and 2 3 Tselc includes only one failing pattern (the FFP), Xm, c +Xm, c =1, where m ∈ {B, G, P}. For chips whose FFP increases some metric coverage (i.e., chips with BFFP > 0, GFFP > 0, or 2 3 PFFP > 0), Xm, c = 1 and Xm, c = 0. The effectiveness ratio for these chips for the corresponding metric is therefore equal to 2 3 one. Otherwise, Xm, c = 0 and Xm, c = 1, and the effectiveness ratio becomes zero. For example, the effectiveness ratio of chip 5 in Table II for bridge, gate-exhaustive, and PAN-detect is one, zero, and one, respectively. The efficiency ratio of a 1 chip, on the contrary, can be calculated using (3) with XB, c, 1 1 XG, c = Gprev , and XP, c = Pprev for bridge, gate-exhaustive, and PAN-detect. For instance, the efficiency ratio of Chip 5 for bridge, gate-exhaustive, and PAN-detect is 1/(2+1)=0.33, 0/(2+0)=0, and 1/(1+1)=0.5, respectively. The Venn diagram in Fig. 7 summarizes the outcome of effectiveness evaluation for single-suspect failing chips. Each integer in the diagram is the number of chips whose FFP increases coverage of the evaluated metrics for the suspect regions. Fig. 7 shows that one chip is captured by all three metrics, while five chips are uniquely caught by PAN-detect. Note that the FFP of chips 12 and 20 do not increase the coverage for any of the three metrics, implying that the chips failed in a way that cannot be captured by any of the three metrics, at least for the metric parameters used. These chips are further discussed in Section IV-E. Fig. 8. Distribution of the effectiveness ratio ER for the multiple-suspect failing chips. Fig. 9. chips. Distribution of the efficiency ratio FR for the multiple-suspect failing (i.e., ER = 100%). The bridge fault coverage is not increased for any of the suspects for nine of the 59 failing chips, implying that the bridge effectiveness and efficiency ratios for these chips are zero. For gate exhaustive, no chip has ER = 100%, but 48 of the 59 chips have ER = FR = 0. For PAN-detect, the FFP of 39 failing chips each establishes a new neighborhood state for all suspects (ER = 100%). The FFP of the remaining 20 chips each establishes a new neighborhood state for at least one but not all of the suspects. Since the suspects of these chips have failed with Nd > 1, each suspect is sensitized by at least 1 one passing pattern, i.e., Xm, c ≥ 1. As a result, the efficiency ratio can be at most 50%. Note that the trend of test-metric effectiveness observed from the multiple-suspect failing chips is inline with what we observed from the single-suspect failing chips presented in the previous section. D. Average Efficiency and Effectiveness C. Multiple-Suspect Failing Chips We apply METER to the 59 failing chips with multiple suspects. Specifically, for each suspect of a multiple-suspect failing chip, we apply the same analysis described in Section IV-B. Results for all the suspects are then collected, and the effectiveness ratio and efficiency ratio are calculated. Figs. 8 and 9 show the distribution of the effectiveness and efficiency ratio for the 59 multiple-suspect failing chips, respectively. For the bridge fault model, the FFP of 11 failing chips each detects some new bridge fault for all the suspects Using the data from the 28 single-suspect chips in Table II, an average effectiveness and efficiency ratio of bridge, gate exhaustive, and PAN-detect for the 28 single-suspect and hardto-detect failing chips can be calculated and compared. Here, we adopt a visual approach to compare the effectiveness and efficiency for these metrics. For each metric m and each failing 1 2 3 chip c, we calculate the ratio of Xm, c , Xm, c , and Xm, c to their sum as follows: k k 1 2 3 Fm, c = Xm, c /(Xm, c + Xm, c + Xm, c ) k ∈ {1, 2, 3}. (7) LIN AND BLANTON: METER: MEASURING TEST EFFECTIVENESS REGIONALLY Fig. 10. Likelihood that a test pattern failed a chip and/or increases coverage for (a) bridge, (b) gate-exhaustive, (c) PAN-detect, and (d) N-detect for the 28 LSI failing chips. The ratios for a zone are averaged over all of the 28 failing chips as follows: k Akm = Fm, k ∈ {1, 2, 3} (8) c /|C| c where |C| = 28 is the number of chips considered. Note that A1m + A2m + A3m = 1. The averaged effectiveness ratio and efficiency ratio are calculated as follows: ERm = A2m /(A2m + A3m ) (9) FRm = A2m /(A1m + A2m ). (10) For each evaluated test metric, we re-plot Fig. 6 and make the area of zones 1, 2, and 3 proportional to A1m , A2m , and A3m , as shown in Fig. 10(a)–(c).4 It can be observed that PANdetect has the largest zone 2 (0.31) among the three evaluated metrics, and also has the largest average effectiveness and efficiency ratios. The gate-exhaustive metric, on the contrary, has the smallest zone 2 (0.03), which leads to the lowest effectiveness and efficiency ratios among the three metrics. E. Test Metric Generalization Results of METER described in Sections IV-B and IVC show that the FFP of a chip may not improve the coverage of a metric. These chips failed due to defects that have activation conditions that are outside these metrics. Of the three evaluated metrics, PAN-detect increases coverage 4 Using the data in Table II, we can also evaluate traditional N-detect, and the result is shown in Fig. 10(d). Specifically, Nd, P is the number of times a suspect is sensitized with the required stuck-at fault polarity, i.e., the stuckat fault involving the suspect region of a failing chip has been N-detected with N = Nd, P test patterns when it fails. When a chip fails, the coverage for N-detect for a suspect region is always increased since Nd, P increases (unless a hard threshold of N is used). In other words, N-detect can never be ineffective. This is reflected in the perfect ER and the high FR for N-detect, and is also shown in Fig. 10(d) where zone 3 is empty and the left oval completely subsumes the right one. While the proposed measures reveal the characteristics of N-detect, using these measures to judge the effectiveness of N-detect is inappropriate. This holds for any test metrics that can never be ineffective. Later in Section, we demonstrate how to better compare N-detect and PAN-detect by utilizing METER in a different manner. 1067 for the FFP for most failing chips. Nevertheless, PAN-detect does not capture two single-suspect failing chips and may fail to capture 20 multiple-suspect failing chips in the worst case (as shown in Figs. 7 and 8, respectively). In our analysis thus far, we included physical neighbors, driver-gate inputs, and receiver-side inputs in the neighborhood for each suspect signal line. In the diagnosis procedures described in [37], [38], other types of neighbors are used as well, including the driver-gate inputs of physical neighbors since it is known they affect drive strengths [6]. If the driver-gate inputs of physical neighbors are included in the neighborhoods instead of the physical neighbors, the FFPs of all the hard-to-detect chips (with either single or multiple suspects) establish new neighborhood states for all the suspects of these chips. Because the neighborhood encompasses all the localized influences on a suspect line, it is not surprising that PAN-detect performs well. However, there is danger that having a neighborhood too large creates a situation where the metric becomes too general. Exploring the tradeoff between including additional types of signal lines in the neighborhood and increasing the distance d used for physical neighbor extraction, and the mismatch between defect behaviors is needed to efficiently generate effective test sets. METER can be easily used to meet this objective by analyzing and guiding the selection of parameters used in ATPG. Bridge fault models focus on unintended connections among wires. Use of bridge fault models is typically limited to defects involving only two lines, that create no structural feedback, and ignore cell-drive strengths. But they can be generalized in several ways, e.g., by including more than two lines, and more complex contention functions. The gate-exhaustive test focuses on problems at the transistor level. It too, however, can be generalized to higher levels of the hierarchy or to include groups of cells or gates [8], [39]. Both bridge and gate-exhaustive metrics are subsumed however by PAN-detect with a neighborhood that includes physical neighbors, drivergate inputs, and receiver-side inputs. F. Utilizing All Test Patterns We further apply METER using all of the applied test patterns for each of the 2533 available failing chip logs. In addition, we demonstrate different methods for identifying potential suspects. In the first method, any region that is sensitized by a failing pattern is deemed a suspect region. In the second method, the stuck-at fault response of a sensitized region (for at least one failing pattern) must exactly match the failing-pattern tester response to be deemed a suspect. For the bridge fault model, and the gate-exhaustive and PAN-detect metrics, we plot the effectiveness ratio ER for all 2533 chips against the total number of unique suspects identified across all failing patterns. Specifically, Fig. 11(a) shows the result for selecting suspects using only pass-fail test data,5 while Fig. 11(b) shows the result for SLAT regions. Each failing chip has three points plotted, one that indicates the effectiveness 5 The analyzed ALU has 5110 signal lines, i.e., a failing chip has at most 5110 unique sensitized regions. The maximum occurs when a failing chip has many failing patterns where the union of the sensitized regions is the set of all the signal lines. 1068 Fig. 11. IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 7, JULY 2011 Effectiveness ratio for the 2533 chips. (a) Calculated for all regions sensitized by the failing patterns. (b) Calculated for the SLAT regions. ratio for PAN-detect (triangles), one for bridge (circles), and one for gate-exhaustive (squares). Fig. 11(a) and (b) reveals that there is significant range in the ER values which is expected since many of the suspects, of course, have nothing to do with the defect. Also as expected, the scatter along the ER axis does reduce however as the suspect-identification procedure improves as demonstrated when moving from the sensitized regions [Fig. 11(a)] to the SLAT regions [Fig. 11(b)]. Finally, it is clear that the effectiveness of the metrics follow the same trend observed as the hard-to-detect chips, i.e., PAN-detect is most effective followed by bridge and then gate exhaustive. V. Discussion In this section, we demonstrate the applicability of METER to large designs, and discuss applications of the methodology. We also compare and contrast METER against the traditional tester-based approach for evaluating test metrics. A. Applicability to Large Designs METER can be easily applied to large, industrial designs since it simply analyzes failure logs from already applied tests, and only requires fault simulation of stuck-at faults involving suspect regions. In other words, only a small portion of the circuit has to be analyzed and existing fault simulation tools can be utilized. METER is therefore scalable to large designs. To demonstrate applicability, we apply METER to an NVIDIA GPU. The GPU has ∼10M gates, and is fabricated using 90 nm technology. The bridge fault model, the gateexhaustive and PAN-detect metrics are evaluated. Diagnosis is performed using Synopsis TetraMAX [40] to identify suspect regions within failing chips. For each failing chip, the test patterns up to and including the chip’s FFP are used for analysis. Of the 4000+ failing chips in the stuck-at failure logs, we focused on the 33 chips that have a single suspect reported by diagnosis and are hard to detect (Nd > 1). We perform fault simulation on these suspects over the selected test patterns using TetraMAX, and examine whether a chip’s FFP increases the coverage of any metric for the suspect. The outcome is summarized in the Venn diagram shown in Fig. 12. Each integer in the diagram represents the number of chips whose FFP increases the coverage of the evaluated metric(s) for the Fig. 12. Venn diagram showing the number of single-suspect failing chips whose FFP increases coverage for bridge, gate-exhaustive, and PAN-detect for NVIDIA GPUs. suspect regions. The evaluation results are consistent with the previous experiments that use the LSI test chips. Specifically, PAN-detect uniquely captures five chip failures, while bridge test is found to be much more effective than gate-exhaustive. There are two chips whose FFP does not increase coverage of any of the three metrics. It is likely that PAN-detect test can capture these two chips if the driver-gate inputs of physical neighbors are included in the neighborhoods instead of the physical neighbors, similar to the case discussed in Section IV-E. Similar to the experiment using the LSI test chips, we calculate the average ratios Akm , ERm , and FRm , using (8)– (10), respectively, for the three evaluated metrics (as well as N-detect) for the 33 selected NVIDIA GPUs (see Fig. 13). It can be observed that for these GPUs, PAN-detect test is most effective and efficient in defect detection, followed by bridge and then gate-exhaustive. Our measure of effectiveness and efficiency provides a manner to evaluate and compare test metrics over different manufacturing technologies and products. For example, by contrasting Figs. 10 and 13, it can be observed that gateexhaustive test becomes more effective for the NVIDIA 90 nm GPUs than for the LSI 110 nm ALU chips. On the contrary, the ratio for the bridge fault model remains virtually the same. The PAN-detect metric becomes even more effective and efficient for the NVIDIA GPUs. To be conclusive, however, much more data from failing chips should be analyzed. B. Applications METER has been demonstrated by comparing the effectiveness and efficiency of several metrics. Measures of test- LIN AND BLANTON: METER: MEASURING TEST EFFECTIVENESS REGIONALLY Fig. 13. Illustration of chances that a test pattern failed a chip and/or increases coverage for (a) bridge, (b) gate-exhaustive, (c) PAN-detect, and (d) N-detect for the 33 NVIDIA GPUs. metric effectiveness and insufficiencies learned from tester data provide guidelines for developing new fault models, test metrics, and DFT methods. The information can also be used to select a proper mix of tests to guarantee a certain level of quality as described in [38] and [41]. Specifically, the work in [38] and [41] derives a defect type distribution. METER can be used in conjunction with these other papers to determine which metrics and models are best at detecting the derived defect types, thus enabling custom test, i.e., a test that matches the defect-type distribution for a given design. Other applications of METER include guiding the selection of parameters used in ATPG, such as selecting the distance for bridge extraction and neighbor identification for PAN-detect, and others. In the following, we demonstrate how METER can be applied to select a proper value of N for both N-detect and PAN-detect. As N increases, it is expected that the defect coverage of an N/PAN-detect test set would increase [3], [11], [14]. The improved test quality however comes at the cost of a higher pattern count and test application cost. Selecting an appropriate value of N therefore requires a tradeoff between cost and quality. Common practice is to choose N based on available test resources (i.e., tester memory, test time, and others). Here, we use METER to demonstrate how the test quality can be examined as a function of N. For choosing N, we use failure logs of another large design, an IBM ASIC. The IBM chip has nearly a million gates, fabricated using 130 nm technology. Physical neighborhood information includes all the signal lines within 0.6 µm of the targeted line. The stuck-at test set applied during wafer test consists of 3439 test patterns that achieve 99.51% stuck-at fault coverage. Among the 606 chips in the stuck-at failure logs, the 304 chips that failed scan chain flush test are disregarded. The remaining 302 chips are diagnosed to identify the suspects using Cadence Encounter Diagnostics [42]. Each suspect is fault simulated using all test patterns up to and including the chip’s FFP. For the stuck-at fault involving the suspect, we record the number of times the fault is detected (i.e., the number of N detections, Nd ) and the number of neighborhood states established for the fault (i.e., the number 1069 of PAN detections, Ns ). Because 284 of the 302 diagnosed failing chips have more than one suspect, we take the following approach to handle multiple-suspect chips. For each failing chip, we record Nd and Ns of the suspect that is ranked highest in the diagnosis report, as well as the maximum/minimum/average Nd and Ns over all the suspects of the chip. The diagnosis tool employed reports a score for each identified suspect, where the score measures the similarity between suspect behavior and failingchip behavior. The best-ranked suspect is the one that has the highest score, and is considered more likely to be the actual location of the failure. Using the Nd and Ns for the bestranked suspect, as well as using the max/min/average Nd and Ns , constitutes a variety of options for obtaining Nd and Ns values for a multiple-suspect failing chip. Fig. 14(a) and (b) shows the histograms of the number of N and PAN detections, respectively. The bars indicate the number of chips that are N/PAN detected, and the table in each plot reports the tail data. For example, seven of the 302 failing chips have their best-ranked suspect N-detected four times before the corresponding chip failed. For one chip, the best-ranked suspect was sensitized 33 times with a different neighborhood state before it finally failed for the 34th state. Using Fig. 14, the number of possible test escapes when different values of Nd and Ns are chosen can be easily determined. For instance, if the best-ranked suspect region is the actual defect region, then applying Nd =10-detect test to that region only would lead to three test escapes. On the contrary, applying physically-aware Ns =10-detect test would result in one test escape. Given enough chips to analyze and a threshold on the defect parts per million, this analysis can be used to select the value of N for ATPG. C. Comparing Test-Metric Evaluation Methods Table III compares METER with the traditional approach involving application of extra test patterns generated specifically for the metrics under evaluation. Both approaches require the analysis of the chip’s design information (e.g., netlist and layout) for identifying fault characteristics that include, e.g., physical neighbors and driver-gate inputs. Tester-based evaluation, however, typically requires the generation and fault simulation of extra test patterns in order to isolate the detection characteristics of each metric. Furthermore, new, powerful ATPG and fault simulation tools need to be developed, or existing tools have to be tricked, to generate tests for new test metrics since tests are needed for the entire design. METER, on the contrary, fault simulates only a small subset, albeit without fault dropping, of the existing test patterns against suspect failing regions identified from failing chips. Given the NP-complete nature of ATPG, limited fault simulation of just a portion of the design without fault dropping is a significantly less-intensive task. METER can be easily applied to large designs, as demonstrated in Section V-A, since current tools for stuck-at fault simulation can be utilized and the analysis only requires some script writing. This is not scalable for traditional tester-based evaluation approach where the entire design is considered. More significantly, analysis of existing fail data is a much more 1070 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 30, NO. 7, JULY 2011 Fig. 14. Histograms showing the number of (a) N detections and (b) PAN detections for the 302 IBM failing ASICs. TABLE III VI. Conclusion Comparison of Test-Metric Evaluation Techniques A general and cost-effective test-metric evaluation methodology, METER, was described and demonstrated. METER provided a novel approach for analyzing the effectiveness and efficiency of new and existing test and DFT methods, which traditionally relied on empirical data from expensive and timeconsuming chip experiments. METER analyzed failure log files from tests already applied, and did not require additional tests. Although only test data from failing chips were analyzed, it was equivalent to chip experiments with a large sample size. The time and cost for test generation and test application for test-metric evaluation were therefore completely avoided. One problem of METER is that test metrics are evaluated under the environment that was employed (e.g., temperature, voltage, test clock rate, and others) since existing testmeasurement data is utilized. As test environment changes, the relative effectiveness of the evaluated test metrics may change as well. Test-measurement data collected under a different environment is needed to evaluate metrics for different conditions. Moreover, the test approaches applied to the failing chips should adhere to the assumptions of the test metrics under evaluation. Otherwise, the evaluation is not reasonable although possible. METER has been demonstrated by comparing the effectiveness and efficiency of several metrics that include bridge, gate-exhaustive, and PAN-detect using the stuck-at failure logs from actual fabricated and tested ICs. With this approach, test metrics can be easily evaluated and compared over different manufacturing technologies and products. The resulting information provides guidelines on how to select the best mix of test methods. It can also be used to guide the development of new test metrics, fault models, and DFT methods, as well as the selection of parameters used in ATPG. Tester-Based Evaluation METER Netlist analysis (−) Netlist analysis (−) Layout analysis (−) Layout analysis (−) ATPG/fault sim. (×) Limited fault sim. w/o fault dropping (✓) New ATPG/fault sim. tools (×) Existing fault sim. tools (✓) Tester use (×) Analysis of fail data (✓) Controllable test environment (✓) Test environment not controllable (×) Controllable coverage (✓) Coverage not controllable (×) Gross (×) Fine-grained (✓) * ✓: good; −: tie; ×: bad. cost-effective activity as compared to the tester time needed to apply extra patterns to tens or hundreds of thousands of chips. However, since METER utilizes existing test patterns, the test environment (e.g., temperature, voltage, test clock rate, and others) cannot be changed. In other words, test metrics are evaluated under what was employed. Moreover, as already mentioned in Section I, the coverage of the test metric is not controlled. On the contrary, as shown in Fig. 2, the coverage achieved for any given metric for most of the design is extremely high since it is typically the case that many regions in a design are well tested by a thorough stuck-at test set. The detection efficiency of bridge faults and the gate-exhaustive metric is probably even higher since some untested bridges and input-pattern faults are quite likely redundant. Last but not least, tester-based evaluation is a gross measure of effectiveness since it is unknown whether the unique fallout is due to the model/metric/DFT method being evaluated or simply fortuitous in nature. Instead, METER is fine-grained in that it associates defect detection with changes in metric coverage for the suspect regions believed to be the region of the defect. Although some suspect identification techniques such as diagnosis is not perfect, it is quite likely that the reported suspects include the actual defect regions. If we analyze all the suspects and observe statistically significant trends in the data, meaningful conclusions can be drawn. VII. Acknowledgment The authors would like to thank Carnegie Mellon University, Pittsburgh, PA, Ph.D. students O. Poku for his help on the LSI experiment, C. Xue for his help on the IBM experiment, and LIN AND BLANTON: METER: MEASURING TEST EFFECTIVENESS REGIONALLY J. Nelson, W. C. Tam, and X. Yu for their help on the NVIDIA experiment. 1071 [26] References [27] [1] M. L. Bushnell and V. D. Agrawal, Essentials of Electronic Testing for Digital, Memory, and Mixed-Signal VLSI Circuits. Boston, MA: Kluwer, 2000. [2] S. Sengupta, S. Kundu, S. Chakravarty, P. Parvathala, R. Galivanche, G. Kosonocky, M. Rodgers, and T. M. Mak, “Defect-based test: A key enabler for successful migration to structural test,” Intel Technol. J., Q.1, pp. 1–12, 1999. [3] S. C. Ma, P. Franco, and E. J. McCluskey, “An experimental chip to evaluate test techniques experiment results,” in Proc. Int. Test Conf., Oct. 1995, pp. 663–672. [4] E. J. McCluskey and C.-W. Tseng, “Stuck-fault tests vs. actual defects,” in Proc. Int. Test Conf., Oct. 2000, pp. 336–342. [5] K. C. Y. Mei, “Bridging and stuck-at faults,” IEEE Trans. Comput., vol. C-23, no. 7, pp. 720–727, Jul. 1974. [6] J. M. Acken and S. D. Millman, “Accurate modeling and simulation of bridging faults,” in Proc. Custom Integr. Circuits Conf., May 1991, pp. 12–15. [7] J. A. Waicukauski, E. Lindbloom, B. K. Rosen, and V. S. Iyengar, “Transition fault simulation,” IEEE Des. Test Comput., vol. 4, no. 2, pp. 32–38, Apr. 1987. [8] R. D. Blanton and J. P. Hayes, “Properties of the input pattern fault model,” in Proc. Int. Conf. Comput. Des., Oct. 1997, pp. 372–380. [9] E. J. McCluskey, “Quality and single-stuck faults,” in Proc. Int. Test Conf., Oct. 1993, p. 597. [10] K. Y. Cho, S. Mitra, and E. J. McCluskey, “Gate exhaustive testing,” in Proc. Int. Test Conf., Nov. 2005. [11] I. Pomeranz and S. M. Reddy, “A measure of quality for N-detection test sets,” IEEE Trans. Comput., vol. 53, no. 11, pp. 1497–1503, Nov. 2004. [12] B. Benware, C. Schuermyer, N. Tamarapalli, K.-H. Tsai, S. Ranganathan, R. Madge, J. Rajski, and P. Krishnamurthy, “Impact of multiple-detect test patterns on product quality,” in Proc. Int. Test Conf., Sep.–Oct. 2003, pp. 1031–1040. [13] M. E. Amyeen, S. Venkataraman, A. Ojha, and S. Lee, “Evaluation of the quality of N-detect scan ATPG patterns on a processor,” in Proc. Int. Test Conf., Oct. 2004, pp. 669–678. [14] R. D. Blanton, K. N. Dwarakanath, and A. B. Shah, “Analyzing the effectiveness of multiple-detect test sets,” in Proc. Int. Test Conf., Sep.– Oct. 2003, pp. 876–885. [15] Y.-T. Lin, O. Poku, N. K. Bhatti, and R. D. Blanton, “Physically-aware N-detect test pattern selection,” in Proc. DATE, Mar. 2008, pp. 634–639. [16] Y.-T. Lin, O. Poku, R. D. Blanton, P. Nigh, P. Lloyd, and V. Iyengar, “Evaluating the effectiveness of physically-aware N-detect test using real silicon,” in Proc. Int. Test Conf., Oct. 2008. [17] P. C. Maxwell, R. C. Aitken, K. R. Kollitz, and A. C. Brown, “IDDQ and AC scan: The war against unmodeled defects,” in Proc. Int. Test Conf., Oct. 1996, pp. 250–258. [18] P. Nigh, W. Needham, K. Butler, P. Maxwell1, and R. Aitken, “An experimental study comparing the relative effectiveness of functional, scan, IDDq and delay-fault testing,” in Proc. VLSI Test Symp., May 1997, pp. 459–464. [19] J. T.-Y. Chang, C.-W. Tseng, Y.-C. Chu, S. Wattal, M. Purtell, and E. J. McCluskey, “Experimental results for IDDQ and VLV testing,” in Proc. VLSI Test Symp., Apr. 1998, pp. 118–123. [20] C.-W. Tseng and E. J. McCluskey, “Multiple-output propagation transition fault test,” in Proc. Int. Test Conf., Oct. 2001, pp. 358–366. [21] S. Chakravarty, A. Jain, N. Radhakrishnan, E. W. Savage, and S. T. Zachariah, “Experimental evaluation of scan tests for bridges,” in Proc. Int. Test Conf., Oct. 2002, pp. 509–518. [22] B. R. Benware, R. Madge, C. Lu, and R. Daasch, “Effectiveness comparisons of outlier screening methods for frequency dependent defects on complex ASICs,” in Proc. VLSI Test Symp., May 2003, pp. 39–46. [23] E. J. McCluskey, A. Al-Yamani, J. C.-M. Li, C.-W. Tseng, E. Volkerink, F.-F. Ferhani, E. Li, and S. Mitra, “ELF-Murphy data on defects and test sets,” in Proc. VLSI Test Symp., Apr. 2004, pp. 16–22. [24] S. Mitra, E. Volkerink, E. J. McCluskey, and S. Eichenberger, “Delay defect screening using process monitor structures,” in Proc. VLSI Test Symp., Apr. 2004, pp. 43–48. [25] S. Chakravarty, Y. Chang, H. Hoang, S. Jayaraman, S. Picano, C. Prunty, E. W. Savage, R. Sheikh, E. N. Tran, and K. Wee, “Experimental [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] [42] evaluation of bridge patterns for a high performance microprocessor,” in Proc. VLSI Test Symp., May 2005, pp. 337–342. R. Guo, S. Mitra, E. Amyeen, J. Lee, S. Sivaraj, and S. Venkataraman, “Evaluation of test metrics: Stuck-at, bridge coverage estimate and gate exhaustive,” in Proc. VLSI Test Symp., Apr.–May 2006, pp. 66–71. E. N. Tran, V. Kasulasrinivas, and S. Chakravarty, “Silicon evaluation of logic proximity bridge patterns,” in Proc. VLSI Test Symp., Apr.–May 2006, pp. 78–85. C. Schuermyer, J. Pangilinan, J. Jahangiri, M. Keim, and J. Rajski, “Silicon evaluation of static alternative fault models,” in Proc. VLSI Test Symp., May 2007, pp. 265–270. J. Geuzebroek, E. J. Marinissen, A. Majhi, A. Glowatz, and F. Hapke, “Embedded multi-detect ATPG and its effect on the detection of unmodeled defects,” in Proc. Int. Test Conf., Oct. 2007. S. Eichenberger, J. Geuzebroek, C. Hora, B. Kruseman, and A. Majhi, “Toward a world without test escapes: The use of volume diagnosis to improve test quality,” in Proc. Int. Test Conf., Oct. 2008. Y.-T. Lin and R. D. Blanton, “Test effectiveness evaluation through analysis of readily-available tester data,” in Proc. Int. Test Conf., Nov. 2009. S. Venkataraman and W. K. Fuchs, “A deductive technique for diagnosis of bridging faults,” in Proc. Int. Conf. Comput.-Aided Des., Nov. 1997, pp. 562–567. X. Yu and R. D. Blanton, “Multiple defect diagnosis using no assumptions on failing pattern characteristics,” in Proc. Des. Automat. Conf., Jun. 2008, pp. 361–366. X. Yu and R. D. Blanton, “An effective and flexible multiple defect diagnosis methodology using error propagation analysis,” in Proc. Int. Test Conf., Oct. 2008. L. M. Huisman, “Diagnosing arbitrary defects in logic designs using single location at a time (SLAT),” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 23, no. 1, pp. 91–101, Jan. 2004. W. Maly and J. Deszczka, “Yield estimation model for VLSI artwork evaluation,” Electron. Lett., vol. 19, no. 6, pp. 226–227, Mar. 1983. R. Desineni, O. Poku, and R. D. Blanton, “A logic diagnosis methodology for improved localization and extraction of accurate defect behavior,” in Proc. Int. Test Conf., Oct. 2006. X. Yu, Y.-T. Lin, W. C. Tam, O. Poku, and R. D. Blanton, “Controlling DPPM through volume diagnosis,” in Proc. VLSI Test Symp., May 2009, pp. 134–139. A. Jain, “Arbitrary defects: Modeling and applications,” Masters thesis, Graduate School, Rutgers Univ., New Brunswick, NJ, Oct. 1999. The TetraMAX Reference Manual, Synopsys, Inc., Mountain View, CA [Online]. Available: http://www.synopsys.com X. Yu and R. D. Blanton, "Estimating defect-type distributions through volume diagnosis and defect behavior attribution," in Proc. Int. Test Conf., Nov. 2010. The Encounter Diagnostics Reference Manual, Cadence Design Systems, Inc., San Jose, CA [Online]. Available: http://www.cadence.com (Shawn) Blanton (S’93–M’95–SM’03– F’09) received the Bachelors degree in engineering from Calvin College, Grand Rapids, MI, in 1987, the Masters degree in electrical engineering from the University of Arizona, Tucson, in 1989, and the Ph.D. degree in computer science and engineering from the University of Michigan, Ann Arbor, in 1995. He is currently a Professor with the Department of Electrical and Computer Engineering at Carnegie Mellon University, Pittsburgh, PA, where he is also the Director of the Center for Silicon System Implementation (CSSI), an organization consisting of 18 faculty members and over 80 graduate students focused on the design and manufacture of silicon-based systems.