Built-in Self Repair

Computer Engineering Self Repair Technology for Logic Circuits Architecture, Overhead and Limitations Heinrich T. Vierhaus BTU Cottbus Computer Engineering Group CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Outline 1. Introduction: Nano Structure Problems 2. The Problem of Wear-Out 3. Repair for Memory and FPGAs 4. Basic Logic Repair Strategies & Structures 5. Test and Repair Administration 6. De-Stressing Strategies 7. Cost, Overhead, Single Points of Failure 8. Summary and Conclusions CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering 1. Introduction A bunch of new problems from nano structures ... CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Nanoelectronic Problems Lithography: The wavelength used to „map“ structural information from masks to wafers is larger (4 times of more) than the minimum structural features (193 versus 90 / 65 / 45 nm). Adaptation of layouts for correction of mapping faults. Statistical Parameter Variations: The number of atoms in MOS-transistor channels becomes so small that statistical variations of doping densities have an impact on device parameters such as threshold voltages. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering New Problems with Nano-Technologies Light source Wave length: 193 nm mask (reticle) resist wafer CREDES / ZUSYS / DAAD Summer School 2011, Tallinn exposed resist Feature size: down to 28 nm Computer Engineering Layout Correction Modified layout for compensation of mapping faults Compensation is critical and non-ideal Faults are not random but correlated! Requires fast fault diagnosis CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Doping Fluctuations in MOS Transistors Poly-Si n doping atom n p-Substrate Density and distribution of doping atoms cause shifts in transistor threshold voltages! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Nanostructure Problems Individual device characteristics such as Vth are more dependent on statistical variations of underlying physical features such as doping profiles. Primary Relevance: Yield A significant share of basic devices will be „out or specs“ and needs a replacement by backup elements for yield improvement after production. Primary Relevance: Yield Smaller features mean higher stress (field strength, current density), also foster new mechanisms of early wear-out. Primary Relevance: Lifetime Transient error recognition and compensation „in time“ is becoming a must due to e. g. charged particles that can discharge circuit nodes. Primary Relevance: Dependability CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Fault Tolerant Computing Software-based fault detection & compensation Works only for transient faults! specific HW logic & RT-level detection & compensation Typically works for transient and permanent faults! universal Fault event Transistor-and switch level compensation CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Typically works for specific types of transient faults only! very specific Computer Engineering 2. Wear-Out Problems and Mechanisms Structures on ICs used to live longer than either their application or even their users. Not any more ... CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering IC Structures May Get Tired „Wear-out“ – effects ICs in nano-electronics are likely to appear much earlier, causing a lot of problems for dependable long-time applications ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Fault Effects on ICs metal migration low- k insulator deterioration Metal 3 Metal 2 Polyimide (low-k) Via FieldOxide n n p n-well p Gate Oxide (high-k) Metal 1 Transistor deterioration (HCI, NBTI), eventually gate oxide shorts ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Wear-Out Mechnisms Metal Migration: Metal atoms (Al, Cu) tend to migrate under high current density and high temperature. Stress migration: Migration effects may be enhanced under mechanical stress conditons. Effect: Metal lines and vias may actually cause line interrupts. The effect is partly reversible by changing current directions. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Metal Migration neighbor metal -wire under high current density: new neighbor After some time in operation neighbor Voids (holes) Open-defect Vias are specially prone to such defects short The effect is reversible by reversing the direction of current flow ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Transistor Degradation Negative Bias Thermal Instability (NBTI): Reduced switching speed for p-channel MOS transistors that have operated under long-time constant negative gate bias. The effect is partly reversible. Hot Carrier Injection (HCI): Reduced switching speed for n-channel MOS transistors, induced by positive gate bias and frequent switching. Not reversible. Gate Oxide Deterioration: Induced by high field strengh. Not reversible Dielectric Breakdown: Insulating layers between metal lines may break causing shorts between signal lines. Design technology including a prospective „life time budget“!! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Management of Wear-Out by „Fault Tolerant Computing? Built-in fault tolerance and error compensation are needed in nanotechnologies anyway and for the management of transient faults. Wear-out induced faults may show up as „intermittent“ faults first, which become more and more frequent. Fault in synchronous circuits and systems are detected „by clock cycle“. Hence the detection does not even recognize if the fault is permanent or not for many types of fault tolerant architecture. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Triple Modular Redundancy input signal Execution Unit 1 Execution Unit 2 Execution Unit 3 Result out (majority) Comparator Voter Error detect Can detect and compensate almost any type of fault Overhead about 200-300 %, additional signal delays The voter itself is not covered but must be a „self checking checker“ Standard (by law) in avionics applications! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Error Detecting / Correcting Codes Data Data Error correction Transmission / Storage Signature Often applicable to 1- or 2-bit faults only Often limited to certain fault models (uni-directional) Becomes expensive if applied to computational units CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Signature Comparison Signature Faultdetect Computer Engineering Can TMR and Codes Compensate Permanent Faults? Fault / error detection circuitry typically works on a clock-cycle base. It does not „know“ if a fault is transient or permanent. A permanent fault is a fault event that occurs in several to many successive clock cycles repeatedly. Error correction technology can detect and compensate such permanent faults as well as transient faults. A critical condition occurs if transient faults occur on top of permanent faults. Then the superposition of fault effects is likely to exceed the system‘s fault handling capacity. System components that run actively „in parallel“ suffer from the same wear-out effects. Therefore there is a an increase in dependability before wear-out limits, but no significant life time extension! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Redundancy and Wear-Out During the normal life time of the system, duplication or triplication can enhance reliability significantly. But also area and power consumption are about triplicated. And by the end of normal operating time (out of fuel / steam) all three systems will fail shortly one after the other !! Reliability enhancement is not equal to life time extension !! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Self Repair? Software-based fault detection & compensation Works only for transient faults! specific HW logic & RT-level detection & compensation Typically works for transient and permanent faults! universal Fault event Self Repair for permanent faults! Transistor-and switch level compensation CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Typically works for specific types of transient faults only! very specific Computer Engineering 3. Repair for Memory and FPGAs Compensation of transient faults is not enough. Some technologies for transient compensation can handle permanent faults, too, but not on the long run and with additional transient faults! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Memory Test & Repair Lines Line address Read- / write lines spare column columns CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Memory Test & Repair (2) Line address Lines Read- / Write lines spare column Memory BIST controller CREDES / ZUSYS / DAAD Summer School 2011, Tallinn columns ... is already state-of-the-art! Computer Engineering FPGA-based Self Repair CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering In-System FPGA Repair CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Repair Mechanism: Row/Line-Shift CLB CLB CLB CLB occupied CLBs CLB CLB CLB CLB CLB CLB CLB CLB row with faulty CLB CLB CLB CLB CLB occupied CLBs CLB CLB CLB CLB reserve row Little Overhead for the re-configuration process Loss of many “good” CLBs for every fault CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Distributed Backup CLBs CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB CLB faulty CLB functionally occupied CLB selected non-occupied CLB CLB replacement CLB CLB (reserve) CLB Minimum loss of functional CLBs High effort for re-wiring requires massive „embedded“ computing power (32-bit CPU, 500 MHz) CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Self Repair within FPGA Basic Blocks Heterogeneous repair strategies required (memory, logic) Logic blocks may use methods known from memory BISR Additional repair strategies are necessary for logic elements The basic overhead for FPGAs versus standard logic (about 10) is enhanced. Repair strategies for logic may use some features already used in FPGAs (e. g. switched interconnects). CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Structure of a CLB Slice Program in FF in Logic in SRAM M U X Logic Field Redudant Row M U X Logic out CREDES / ZUSYS / DAAD Summer School 2011, Tallinn SRAM out FF out FF Computer Engineering FPGAs for a Solution? The granularity of re-configurable logic blocks (CLBs) in most FPGAs is the order of several thousand transistors. Replacement strategies must be placed on a granularity of blocks in the area of 100-500 transistors for fault densities between 0.01 % and 0.1 %. Efficient FPGA- repair mechanism requires detailed fault diagnosis plus specific repair schemes, which cannot be kept as pre-computed reconfiguration schemes. Computation of specific repair schemes requires „in-system EDA“ (re-placement and routing) with a massive demand for computing power. There is no source of such „always available“ computing power. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Self-Repairing FPGA ? Reconfigurable Logic CLB WB CLB WB CLB WB CLB CLB WB CLB WB CLB WB CLB WB CLB WB CLB WB CLB CLB WB CLB WB CLB WB CLB CLB WB CLB WB CLB WB CLB WB CLB WB CLB WB New-Config. Memory Virtual CPU CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Config. CLB Scheme CLB Program CLB Computer Engineering Advanced FPGA Structures CPU CPU WB CLB WB CLB WB CLB WB CLB CLB WB CLB WB CLB WB CLB ALU WB MULT WB ALU WB MULT CLB WB CLB WB CLB WB CLB CLB WB CLB WB CLB WB CLB CREDES / ZUSYS / DAAD Summer School 2011, Tallinn ... are only partly re-configurable for performance reasons ! Computer Engineering FPGA / CPLD Repair Looks pretty easy at first glance because of regular architecture! Requires lines / columns of switches for configuration at inputs and between AND / OR matrices. Requires additional programmability of cross-points by double-gate transistor as in EEPROMs or Flash memory. Not fully compatible with standard CMOS Limited number of (re-) configurations Floating gate (FAMOS) transistors are fault-sensitive! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering 4. Basic Logic Repair Strategies Repair techniques that replace failing building blocks by redundant elements from a „silent“ storage are not new. IBM has been selling such computer systems specifically for applications in banks for decade. But always with few (2-10) backup elements (CPUs) assuming a small number of failures (< 10) within years. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Mainframes .. will often contain „redundant“ CPUs for eventual fault compensation. But one faulty transistor then „costs“ a whole CPU, limiting the fault handling to a few (about 10) permanent fault cases. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Granularity of Replacement Block-level replacement (e. g. FPGAs) Hardly explored (logic) CoreReplacement (e. g. CPU) Expected fault density (1 out of..) trans. 100 gate 101 FPGAmacro block 102 103 cores 104 CPU 105 106 Granularity (transistors) CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Repair Overhead versus Element Loss Repair procedure overhead Functioning elements lost New Methods and Architectures Prohibitive fault density Prohibitive overhead 1 10 100 CREDES / ZUSYS / DAAD Summer School 2011, Tallinn 1k 10k 100k 1M 10M Size of replaced blocks (granularity) Computer Engineering Built-in Self Repair (BISR) BISR is well understood for highly regular structures such as embedded memory blocks. BISR is essentially depending on built-in self test (BIST) with high diagnostic resolution. Fault Detection Fault Diagnosis Fault Isolation Redundancy Allocation Fault / Redundancy Management Redundancy management must monitor faults, replacements, available redundancy and must also re-establish a „working“ system state after power-down states. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Levels of Repair Transistors - Switch Level Replace transistors or transistor groups Losses by reconfiguration: (switched-off „good“ devices): Potentially small ( 20 – 50%) for transistor faults Overhead for test and diagnosis: Very high Repair overhead Gate Level will dominate Replace gates or logic cells reliability! Losses by reconfiguration: Medium (60 to 90 %) for single transistor faults Overhead for test and diagnosis: High Macro-Block Level Replace functional macros (ALU, FPU, CPU) Losses by reconfiguration: High, 99% or more Overhead for test and diagnosis: Maybe acceptable CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering The Fault Isolation Problem Load 1 Driver Gateshort Load 2 GND-shorts of input gates affect the whole fan-in network and make redundancy obsolete!! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Block-Level Repair & & SE SE & SE & Blocks of logic / RT elements (gates and larger) contain a redundant element each that can replace a faulty unit. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Switching Concept (1) inputs Test in outputs inputs outputs Functional Block 1 Functional Block 1 Functional Block 2 Functional Block 2 Functional Block 3 Functional Block 3 Replacement Block Replacement Block 1 CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Test out Test in 2 Test out Computer Engineering Switching Concept (2) inputs Test in outputs inputs outputs Functional Block 1 Functional Block 1 Functional Block 2 Functional Block 2 Functional Block 3 Functional Block 3 Replacement Block Replacement Block 3 CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Test out Test in 4 Test out Computer Engineering A Regular Switching Scheme The scheme is regular and scalable by nature, comprising always k functional blocks of the same nature plus 1 additional block for backup. Building blocks are separated by (pass-) transistor switches at inputs and outputs, providing a full isolation of a faulty block. Always 2 additional pass-transistors between two functional blocks. The reconfiguration scheme is regular in shifting functionality between blocks, which results in a simple scheme of administration. The functional access to the „spare“ block can be used for testing purposes. In any state of (re-) configuration, the potentially „faulty“ block is connected to test input / output terminals. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Overhead Depending on Block Size Transistors Basic Element Functional backup norm switch ext. switch 3 /4- 2-NAND 12 4 18 24 3 / 4 2-AND 18 6 18 24 3/4 2-XOR 18 6 18 24 H- Adder 36 12 24 30 F- Adder 90 30 30 36 For small basic blocks, the switches make the essential overhead (200%)! For larger basic blocks,the overhead can be reduced to about 30-50% ... not counting test- and administration overhead! Extract larger basic units from seemingly irregular logic netlists!! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Overhead Transistors per RLB (3 functional units) Basic Block functional backup Switches Overhead min. / ext. 2- NAND 12 4 18 /24 230 % 2- AND 18 6 18 /24 160 % XOR 18 6 18 /24 160 % Half Adder 36 12 24 /30 116 % Full Adder 90 30 30 /36 73 % 4500 1500 168 / 224 38 % 8-bit ALU CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering 5. Test and Repair Administration Test Generator RLB Conf. RLB RLB BIST BIST Logic RLB Logic RLB Conf. RLB Configurator and Status Memory CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Conf. RLB RLB BIST BIST System Monitoring Test Analyzer Centralized Control Conf. May be faulty! De-centralized test and control Computer Engineering Blocks, Switching, Administration Local (re-) configuration Remote (re-) configuration Columns of Switches Columns of Switches F-Unit F-Unit F-Unit F-Unit F-Unit F-Unit F-Unit F-Unit Red.-Unit Red.-Unit Red.-Unit Red.-Unit F-Unit F-Unit F-Unit F-Unit Conf.-Unit Conf.-Unit Decoder Decoder Conf.-Unit Conf.-Unit Global Control-Unit Global Control-Unit CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Combining Test and Re-Configuration Reference Test input Logic under Test next state Config. Memory / Counter CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Test out Compare fault detect Computer Engineering Test and Administration inputs Input Switches Test is done by comparison with reference outputs. The system is run through states of re-configuration with the same input test pattern applied. At test, a functional unit is always removed from normal operation and connected to test I / O s. Functional Block n Replacement Block Test in Output Switches Functional Block 1 Each of the elements in a block is testable via specific test inputs. In case of a „fault detect“, the system is fixed in the current status. outputs Test out Decoder fix at fault Such a procedure of self-test and self-reconfiguration can run at every system start-up, avoiding a central „fault memory“. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn State Reg. Fault indicator Test clock Self Test Circ. Fault flag Computer Engineering Controller for (Re-) Configuration out RLB f1 + f2 f3 Scan out Decoder s1 act s2 1 act 2 s3 3 s4 f1 f2 f3 4 in F >1 BISR clock reset + Controller minimum complexity: 80 transistors (3 + 1 configuration) Control-Bits Test in & Reference Switches scan path Switches + >1 freset fault test CREDES / ZUSYS / DAAD Summer School 2011, Tallinn f >1 A controller may drive one or several re-configurable blocks in parallel, depending on their size Computer Engineering Local Interconnects The block-based repair scheme so far can not cover faults on wires between re-configurable blocks. For small basic blocks (such as logic gates) the majority of wiring is between re-configurable units and not covered. For larger (RT-level) basic blocks the majority of wiring is within basic blocks and covered. Schemes that can also cover inter-block wiring are possible, but require FPGA-like configurable switching and complex switching schemes. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Essentials of the Repair Scheme Logic self repair is feasible at cost below triple modular redundancy (TMR). There is a trade-off between the size or the reconfigurable logic blocks (RLBs) and the maximum tolerable fault density. Administration, not redundancy makes the critical overhead. Efforts can be saved by administrating several RLBs in parallel. Low-level interconnects between RLBs make for the essential „single point of failure“ in the repair scheme! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering 6. De-Stressing Component failure rates failure curve without de-stressing 10-1 failure curve with de-stressing 10-2 10-3 10-4 t1 t2 t3 t4 System life time CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering The Purpose of De-Stressing Building blocks in digital systems of equal type may be more or less heavily used. Blocks running with the highest dynamic load and at the highest temperature are candidates for early failure. Using otherwize „silent“ resources to relieve such units from stress periodically may serve the overall life time of the system. The re-configuration scheme developed for repair may also serve such purpose with slight modifications. ..and the scheme must be compatible with repair architectures ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering The Scheme of De-Stressing state 0 Task 1 Task 1 BB1 BB2 Task 2 low load BB3 Task 3 heavy load test RB A better initial distribution of taks and stress makes a better re-distribution. Task 2 Repair capabilities can be preserved. heavy load Task 1 BB2 low load BB3 Task 3 Backup test RB state 3 state 1 BB1 But: Task 1 BB1 medium load medium load BB2 Task 2 low load BB3 Task 3 De-stressing may need re-organisation within an active system, while repair has been off-line so far ! Task 2 BB2 low load Task 3 BB3 heavy load heavy load Backup BB1 medium load medium load Backup state 2 test RB CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Backup RB Computer Engineering Modified Control Scheme For de-stressing, functions have to be shifted while the system is in „hot“ operation. As long as all building blocks are fully functional, running two functional blocks in parallel serving the same inputs and outputs is possible. With a total of k building blocks (including the spare one) there are k „stable“ states of re-configuration (1 normal, 3 repairs) and (k-1) intermediate states for „handover“ in case of de-stressing. There are no extra switches necessary, but an additional overhead in state management and state decoding. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering FSM including Transitional States 0 tr=1 0/1 tr=1 tr =0 1 tr =0 1/2 tr=1 2 tr =0 2/3 3 If a „flying“ transition between repair states becomes necessary, the control logic will have seven states instead of four! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Control Logic Functionality Test access to each of four basic blocks is possible through the extra test acces. With a test input pattern applied, the RBB is run through the 4 states. If a BB or the RB is found to be faulty through the test access, the control is fixed in this state. The faulty block is then not in functional use. The controller has a „fault“ flag, which indicates the status of „backup in use“. Once a RBB has a fault detected, it cannot be used for de-stressing operations. As long as a RBB has no fault detected, if can activate the re-configuration for de- stressing with an extra Test control signal, which makes the FSM run throught scheme of extended logic states for „hot“ re-configuration. in CREDES / ZUSYS / DAAD Summer School 2011, Tallinn BB BB BB RB Test out Computer Engineering Extended Control Logic Test in Reconfigurable Block (RB) „1“ for fault detect Test out Switch control signals Decoder FSM >1 & & FF fault flag test FF reset FSM reset CREDES / ZUSYS / DAAD Summer School 2011, Tallinn clock tr Computer Engineering 7. Overhead and Limitations BISR requires additional overhead. The inevitable extra circuitry used for fault administration is not fault-free by definition. But we can assume that such circuitry, if fabricated correctly, is not in heavy use all the time and will exhibit much reduced failure from stress. Memory cells used for repair state administration are prone to transient fault effects from particle radiation. Wit suitable state encoding (1-out of n-code) parity check can be applied. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Overhead Overhead factors: - Number and size of redundant elements, - Number of switches for (re)- configuration, - Test and fault diagnosis, - Control logic, - Extra overhead for system – management. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Cost / Overhead ( 3 functional blocks plus 1 backup in RLB) Basic Block 2-NAND H- Adder F- Adder Trans. Trans. Switch funct. backup Trans. 3* 4 4 30 Contr.* Overhead Unit Tr. % 81 /200 960 / 3600 3 * 12 3 * 30 12 30 40 50 81 /200 369 / 700 81 /200 179 / 311 2-bit ALU 3 * 352 4-bit ALU 3 * 699 8-bit ALU 3 * 1367 352 699 1367 140 180 260 81 /200 54.2 / 65.5 81 /200 45.8 / 51.5 81 /200 41.6 / 44.5 * with / without extensions for de-stressing, controller design optimized for supervision by parity control. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Sources of Overhead Basic Block 2-NAND H-Adder F-Adder 2Bit ALU 4Bit ALU 8Bit ALU Complexity Overhead in % (trans.) redund. switches control ctrl/destr. 4 12 30 352 699 1367 33 33 33 33 33 33 250 111 55 13 8.5 6.2 675 225 90 7.6 3.8 2 1666 555 222 18.9 9.5 4.8 Switches and control overhead dominate, reasonable lower bound for complexity of basic blocks is around 100-200 transistors. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Overhead and Block Size Overhead in % 1000 self repair plus de-stressing self repair 100 33 10 10 102 103 104 Basic Block Size (transistors) CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering The Switching Problem (1) switch control switch control Compensates „always on“ switch control Compensates „always off“ switch control switch control Compensates „always on“ and „always off“ ... always in one single transistor. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Single Points of Failure Transistor Switches Config. Control Network switch control 1 2 Signal wiring 3 1: short gate - signal input 2: short gate - block input 3: channel short CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Reconfigurable Logic Block (RLB) Computer Engineering Pass Transistor Faults Short A short condition between the signal input (Usign) and the control input (Uctrl) may be solved by designing the gate input line (Rbr) as a fuse. Then one additional transistor is needed as a „power sink“. CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Blowing Fuses CTL in VDDhigh n fuse gate short sin p n sout Power-Sink-Transistor CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering 8. Summary and Conclusions Logic self-repair is not impossible, but noch cheap either. The lower bound for logic blocks is about 100 transistors. Experience shows that most logic designs „yield“ some potential for logic extraction. Repair technologies work even (much) better for regular processor architectures such as VLIW processors. In real-life designs, a large part of the system (memory, 50-90 %), functional units, 10-40 %) is regular. Only a small fraction is truly „irregular“ and needs higher overhead. No such strategy yet for analog and mixed signal circuits ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Real Embedded Systems CPU CPU Data Path Data Path Mem. Ctrl DSP Cache Memory Ctrl Cache Mixed Signal / RF .. only a small fraction of the real system is truly irregular and needs „expensive“ logic repair ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Regular Processor Architectures Needs Logic-BISR Crtl.Logic Add Register File Mult Multiple parallel Processing units Regular processor structures with multiple parallel units need expensive logic (self-) repair only for their control logic. Reconfiguration of data-path elements can be arranged by software, which does not have wear-out ! CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Computer Engineering Design for Repairability RT netlist Extract obvious regular blocks RLB Control Circuitry Random Logic done Find and extract regular entities Random Rest Logic CREDES / ZUSYS / DAAD Summer School 2011, Tallinn Compose RT-RLBs Compose Gate-Level RLBs Compose Estimate RLB control Reliability Scheme Computer Engineering This is the END ! Thank you for not falling asleep ! (I would have....) CREDES / ZUSYS / DAAD Summer School 2011, Tallinn

Built-in Self Repair

Related documents

Products

Support

Built-in Self Repair

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib