A Combinatorial Group Testing Method for FPGA Fault Location Ronald F. DeMara, Carthik A. Sharma University of Central Florida Introduction Field Programmable Gate Arrays Gate-array-based reconfigurable architecture Matrix of Logic Cells (Look-Up Tables) surrounded by peripheral I/O cells Capabilities: Runtime reconfiguration On-chip processor core & Millions of gate-equivalent logic elements Millions of FPGA devices produced annually: most SRAM-based Used in mission-critical applications Remote systems & Hazardous Environments Space Applications – Satellites, probes, and shuttles Group Testing Algorithms • Origin – World War II Blood testing Problem: Test samples from millions of new recruits Solution: Test blocks of sample before testing individual samples • Problem Definition Identify subset Q of defectives from set P Minimize number of tests Test v-subsets of P Form suitable blocks Previous Work • Pre-compiled Column-Based Dual FPGA architecture [Mitra04] Autonomous detection, repair by shifting pre-compiled columns Isolation using distributed CED-checkers and “blind” reconfiguration attempts • Overview of Combinatorial Group Testing and Applications [Du00] Provides taxonomy and general algorithms for applying CGT Examples of CGT applications: DNA clone library filtering, vaccine screening, computer fault diagnosis, etc. • CGT Enhanced Circuit Diagnosis [Kahng04] Present doubling, halving etc for circuit fault diagnosis using BIST, CGT Requires ability to test resources individually • Chinese Remainder Sieve technique [Eppstein05] Efficient non-adaptive and two-stage CGT based on prime number driven test formation Improved algorithms for practical problem sizes (n < 1080) with small number of defectives (d < 4) Fault-Handling Techniques Device Failure Characteristics Duration: Target: Approach: Transient: SEU Device Processing Configuration Datapath Repetitive Readback Majority Vote Invert Bit Value Processing Datapath CGT-Based STARS CED Dueling Supplementary Testbench Duplex Output Comparison Duplex Output Comparison Cartesian Intersection Worst-case Clock Period Dilation Diagnosis: Recovery: SEL, Oxide Breakdown, Electron Migration, LPD TMR Detection: Bitwise Comparison Device Configuration BIST Methods Isolation: Permanent: Ignore Discrepancy Replicate in Spare Resource Fast Run-time Location Repetitive Intersections unnecessary Select Spare Resource Evolutionary Algorithm using Intrinsic Fitness Evaluation Isolation Problem Outline Objectives Locate faulty logic and/or interconnect resource: a single stuck-at fault model is assumed Online Fault Isolation: device not entirely removed from service Features Runtime Reconfiguration: FPGA resources configured dynamically Utilize Runtime Inputs: avoid special test-vectors, improve availability Constraints Use pre-designed configurations: defined by target application Subsets under test have constant resource utilization range for a given isolation problem Resource grouping influences fault articulation: resource-mapping and input vector might mask hardware faults Do not use specialized “block designs” Runtime reconfiguration limited to column-swapping “Non-reasonable” algorithm: “tests” may be repeated without gaining new isolation information Fault Location Using Dueling The set of all competing configurations is represented by S. Set Ck represents the resources utilized by configuration k. Each competing configuration k, 1 < k < |S| has a unique binary Usage Matrix Uk, 1 < k < p. Elements Uk[i,j], 1 < i < m, 1 < j n, where m and n represent the rows and columns in the device layout respectively. Elements Uk[i,j] = 1 denote the usage of resource (i, j) by Ck. The History Matrix H, with elements H[i,j] 1 < i < m, 1 < j < n, is an integer matrix used to represent the relative fitness of individual resources. H[i,j] provides instantaneous relative fitness values of resources. Dueling Example H [i,j] @t=0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 U1 H [i,j] @t=2 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 2 1 0 0 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 U2 • H [i,j] changes after C1 and C2 are loaded • U1 and U2 are corresponding Usage Matrices • (3,3) is identified as the faulty resource Modified Halving Initiate H Matrix Initially all H[i,j] = 0 Select & Load Competing Configurations No Discrepancy? Decrement Corresponding H Matrix Elements No Stasis after n Iterations? No Yes Increment Corresponding H Matrix Elements Unique Max In H? Fitness Augmentation can be non-linear Yes Columns can be swapped with any other Columns Yes Swap 50% Suspect Columns Selection Process can be Adaptive Return Indices of Faulty Element FPGA Arrangement for Dueling SRAM-based FPGA CONFIGURATION BIT STREAM L Half-Configuration Configurations in Population • C = CL CR • CL = subset of left-half configurations • CR = subset of right-half configurations • |CL|=|CR |= |C|/2 R Half-Configuration Function Logic L Function Logic R ` Discrepancy Check L Discrepancy Check R DATA OUTPUT CONTROL Reconfiguration Algorithm FEEDBACK OFF-CHIP EEPROM ( NOTE: a non-volatile memory is already required to boot any SRAM FPGA from cold start ... this is not an additional chip ) INPUT DATA Isolation Progress without Halving Number of Suspected Faulty Elements (log) Without Halving 10000 10000 Temporary stasis in isolation due to insufficient design diversity 1000 100 • Resource Utilization = 40% 1000 100 0 5 10 15 20 Number of Iterations 25 30 • Initially |S| = 20,000 • Number of suspected faulty elements constant at 36 after 23 iterations • No subsequent improvement due to lack of differentiating information between competing configurations Dueling with Modified Halving Number of Suspected Faulty Elements (log) Dueling with Halving Symptoms of stasis invoke halving procedure for fast isolation 10000 • Halving works by swapping half the used columns with unused ones • Halving progressively reduces the size of the set of suspected faulty elements 1000 • Isolation proceeds till a single faulty element is isolated 100 0 5 10 15 Number of Iterations 20 25 • Fault isolated after 19 iterations Average Number of Iterations For Fault Isolation Effect of Total Number of Elements Increased Problem Size 30 • Number of Elements = (Number of Rows x Number of Columns 25 20 15 10 Population Size = 40 Resource Utilization = 50% 5 0 0 100 200 300 400 500 600 700 800 900 Number of Rows and Columns in Device 1000 1100 • As the size of the array containing the fault increases, the increase in the required number of iterations is minimal • For 1 mill. elements, only 27.4 iterations required. Effect of Population Size Average Number of Iterations for Fault Isolation Population Size 28 Increased population size provides minimal added benefit 26 24 22 • Single fault in S is assumed •As pop. size increases, isolation expected to be faster • Increased pop. size implies more initial designs 20 18 16 Resource Utilization (%) = 50 Number of Resources = 40000 14 12 10 0 20 40 60 Population Size 80 100 • A population size of 30 seems to be an ideal tradeoff between ease of isolation, and the difficulty of generating increased number of individuals. Average Number of Iterations for Fault Isolation Effect of Resource Utilization • Moderate resource utilization ideal for isolation 45 Population Size=40 20 40 Population Size=20 40 • Rate of isolation progress low with extreme utilization characteristics 35 30 25 20 Number of Resources = 40000 15 10 20 30 40 50 60 Resource Utilization (%) 70 80 90 • Isolation takes longer when less than 20% or greater than 80% of the available resources are utilized. Future Work • Conducting Tests using Benchmark Circuits ISCAS89 s38584 with 11448 gates: sequential logic ISCAS85 circuits with max 3513 gates: combinational logic Compression/ Signal Processing algorithms, such as the Lempel-Ziv (LZ) compression scheme [Mitra04] • Development of an architecture to enable column-swapping Multi-layer Runtime Reconfigurable Architecture (MRRA) being prototyped Backup Slides • On following pages … Online Dueling Evaluation • Objective Isolate faults by successive intersection between sets of FPGA resources used by configurations Analyze complexity of Isolation process • Variables Total resources available Measured in number of LUTs Number of Competing Configurations Number of initial “Seed” designs in CRR process Degree of Articulation Some inputs may not manifest faults, even if faulty resource used by individual Resource Utilization Factor Percentage of FPGA resources required by target application/design Number of Iterations for Isolation Measure of complexity and time involved in isolating fault Discrepancy Mirror Circuit Fault Coverage Component Fault Scenarios Fault-Free Function Output A Fault Correct Correct Correct Correct Function Output B Correct Fault Correct Correct Correct XNORA Disagree (0) Disagree (0) Fault : Disagree(0) Agree (1) Agree (1) XNORB Disagree (0) Disagree (0) Agree (1) Fault : Disagree(0) Agree (1) BufferA 0 0 High-Z 0 1 BufferB 0 0 0 High-Z 1 Match Output 0 0 0 0 1 Influence of LUT utilization Perpetually Articulating Inputs with Equiprobable Distribution • expected number of pairings grows sub-linearly in number of resources • utilization below 20% or above 80% implicates (or exonerates) a smaller sub-set of resources • 50% utilization, the expected number of pairings for 1,000, 10,000, and 100,000 resources are 11.1, 14.9, and 17.6 Intermittently Articulating Inputs with Equiprobable Distribution • at 90% utilization mean value of 258 pairings are required to isolate the faulty resource. Accommodating Multi-bit Word Widths • Proof of concept The present circuit works efficiently Demonstrates important Dueling-enabled isolation method • Strategies Use an array of detectors attempt to minimize points of failure as word-width increases Number of logic resources used is acceptable for smaller circuits Create new circuit or scheme, combining fault tolerant coding-based methods with single-fault secure circuit Current research focused on improving detector by investigating codes, and fault-secure circuits Pull-down Resistor Considerations • Proof of concept The present circuit works in a verifiable correct manner Can utilize synthesized (digital) pull-down resistor which simulate the behavior of analog resistors Demonstrates Dueling-enabled isolation method Can be utilized without implementation problems for Custom-VLSI designs • Alternative Approach Alternate detector circuits for FPGA implementation are under investigation Avoid using Tri-state buffers, pull-down resistors and use native digital components available on FPGAs Competitive Runtime Reconfiguration (CRR) Evolutionary Computation strategies effective for more than just repair phase: continually detect, rank, and isolate faults entirely within the underlying data throughput flow Initialization Population partitioned into functionally-identical yet physically-distinct half-configurations fault detection by robust consensus L=R over time no test vectors L=R graceful degredation via ranking of alternatives Selection Detection choose FPGA configuration(s) labeled L and R apply functional inputs to compute FPGA outputs using L, R discrepancy free Fitness Adjustment PRIMARY LOOP update fitness of only L and R based on detection results NO online during repair YES invoke completelyrepaired criteria can be ignored Genetic Operators only once L, R results Adjust Controls detection mode, overlap interval, ... and only on L or R performance readily adjustable no reconfiguration when fault-free SRAM-based FPGA Conceptual Innovation novel fitness assessment via pairwise discrepancy without any pre-conceived oracle for correctness (emergent behavior) failures in population memory covered INPUT DATA OFF-CHIP EEPROM ( NOTE: a non-volatile memory is already required to boot any SRAM FPGA from cold start ... this is not an additional chip ) fault isolation is model-free and self-calibrating device remains is either L's or R's fitness < Repair Threshold? CONFIGURATION BIT STREAM L Half-Configuration R Half-Configuration Function Logic L Function Logic R ` Discrepancy Check L Discrepancy Check R checking logic part of individual hence also competes for correctness DATA OUTPUT CONTROL Reconfiguration Algorithm FEEDBACK diverse alternatives working a-priori Configuration Health States States Transitions during lifetime of ith Half-Configuration Discrepancy Operator • Baseline Discrepancy Operator is dyadic operator with binary output: • Z(Ci) is FPGA data throughput output of configuration Ci 0 Z (CiL ) Z (CiR ) L R Ci Ci Othewise 1 WTA: = i^ j Ci , j EOR Ci , j L R primordial C O M P E T I T I O N (Equivalence) L=R pristine 9 complete repair partial repair 2 LR refurbished L=R 3 10 L R : fi fOT suspect LR : fi fRT L R RS: = ij Ci , j EOR Ci , j (Hamming Distance) L=R 1 4 integral with EVOLUTION L=R LR fi fOT fi < fRT : : LR COMPETITION 11 8 : L = R : 5 7 fi < fRT LR under repair 6 fi < fOT Procedural Flow under Consensus-Based Evaluation Initialization Population partitioned into functionally-identical yet physically-distinct half-configurations L=R is either L's or R's fitness < Repair Threshold? L=R Selection choose FPGA configuration(s) labeled L and R discrepancy free apply functional inputs Detection to compute FPGA outputs using L, R PRIMARY LOOP Fitness Adjustment update fitness of only L and R based on detection results NO YES invoke Genetic Operators only once L, R results Adjust Controls detection mode, overlap interval, ... and only on L or R Regeneration Initialization Consensus Based Evaluation Genetic Operators based on Reintroduction Discrepancy CL CR Partition P Operator: into recover sub-populations of size |P|/2 toRate designate Operators only applied once or then offspringresource returned to “service” Four Fitness States :left-half physical FPGA right-half utilization without concern about increasing fitness Refurbished Pristine Suspect Under Repair GA Parameters & Experiments GA parameters Population size : 20 individuals Crossover rate : 5% Mutation rate : up to 80% per bit GA operators External-Module-Crossover Internal-Module-Crossover Internal-Module-Mutation Speciation Two-point crossover between individuals from same sub-group Crossover points chosen to prevent intra-CLB crossover Breeding occurs exclusively among members of sub-populations Maintains non-interfering resource use among L, R Experiments … Fault Isolation Characteristics Regenerative Experiments Demonstrate … Objective fitness function replaced by the Consensus-based Evaluation Approach and Relative Fitness Elimination of additional test vectors Impact of Fault on Viable Individuals • Existence of Positive Test Vector Input Ip comprises a positive test vector iff Cv(Ip) Cf(Ip) = 1 where Cv denotes a viable configuration and Cf denotes a faulty configuration So if a discrepancy is visible then some Ip exists which manifests the fault • Minimal Case when Ip is Unique Ip is unique if fault is observable under exactly one test vector • Probability Mass Function for Encountering Ip in Minimal Case Consider Ew=600 yielding 99.5% coverage for a module with input space W=64 The number of input occurrences, 0 i 600, that randomly encounter Ip to identify the fault is governed by the probability density function: p.m.f.(i)= D W n i 1 D W 1 D i where D 600,W 64, n 1,0 i 600 where D is the length of Ew Isolation of a single faulty individual with 1-out-of-64 impact • Outliers are identified after EW iterations have elapsed • Expected D.V. = (1/64)*600 = 9.375 from individual impacted by fault • Isolated individual’s DV differs from the average DV by 3 after 1 or more observation intervals of length EW Isolation of a single faulty L individual with 10-out-of-64 impact Compare with 1-out-of-64 fault impact Expected DV of (10/64)*600 = 93.75 for faulty configuration One isolation will be complete approx. once in every 93.75/5 = 19 Sliding Windows Fault Isolation achieved is 100% Isolation of 8 faulty individuals L4&R4 with 1-out-of-64 impact • Expected isolations do not occur approx. 40% of the time Average discrepancy value of the population is higher Outlier isolation difficult Multiple faulty individual, Discrepancies scattered Regeneration Performance 3x3 Multiplier Experiment Number Fault Location Failure Type Correctness Total after Fault Iterations Discrepant Iterations Repair Final Effective Iterations Correctness Throughput 1 CLB3,LUT0,Input1 Stuck-at-1 52 / 64 17920100 421123 1194 64 / 64 97.65 2 CLB6,LUT0,Input1 Stuck-at-0 33 / 64 802050 17034 47 64 / 64 97.87 3 CLB5,LUT2,Input0 Stuck-at-1 22 / 64 3134660 68027 193 64 / 64 97.83 4 CLB7,LUT2,Input0 Stuck-at-0 38 / 64 8158280 185193 513 64 / 64 97.73 5 CLB9,LUT0,Input1 Stuck-at-0 40 / 64 2332670 71613 219 64 / 64 96.93 32.6 / 64 6469550 152598 433 64 / 64 97.6 Average Parameters: Difference (vs. Hamming Distance) Evaluation Window, Ew = 600 Suspect Threshold: DVS = 1-6/600=99% Repair Threshold: DVR = 1-4/600 = 99.3% Re-introduction rate: r = 0.1 Repairs evolved in-situ, in real-time, without additional test vectors, while allowing device to remain partially online. Multilayer Runtime Reconfiguration Architecture Fault-Repair Genetic Algorithm Control System Microprocessor (MRRA) Reconfiguration Engine System Bus Virtex-II Pro FPGA RAM • Develop MRRA fast reconfiguration paradigm for the CRR approach • Validate with real hardware platform along with detailed performance analysis • First general-purpose framework for a wide variety of applications requiring dynamic reconfiguration • Extend existing theories on reconfiguration Loosely Coupled Solution FP G A O ut p u t Input Data Bit file Control hosted on PC PCI Interface Virtex-II Pro FPGA Off Chip RAM Avnet FPGA Development Board The entire system operates on a 32-bit basis The Virtex-II Pro is mounted on a development board which can then be interfaced with a WorkStation running Xilinx EDK and ISE. For further info … EH Website http://cal.ucf.edu