Fault-Tolerance in VHDL Description: Transient-Fault Injection & Early Reliability Estimation Fabian Vargas, Alexandre Amory Raoul Velazco Catholic University – PUCRS Electrical Engineering Dept. Av. Ipiranga, 6681. 90619-900 Porto Alegre Brazil vargas@computer.org TIMA-INPG Laboratory 46, Av. Félix Viallet 38031 – Grenoble France velazco@imag.fr Abstract We present hereafter a new approach to estimate the reliability of complex circuits used in harmful environments like radiation. This goal can be attained in an early stage of the design process. Usually, this step is performed in laboratory, by means of radiation facilities (particle accelerators). In our case, we estimate the expected tolerance of the complex circuit with respect to SEU during the VHDL specification step. By doing so, the early-estimated reliability level is used to balance the design process into a trade-off between maximum area overhead due to the insertion of redundancy and the minimum reliability required for a given application. This approach is being automated through the development of a CAD tool. Keywords: Fault-tolerant Circuits; Reliability Estimation; Radiation-Exposed Environments; Single-Event Upset (SEU); High-Level Description (VHDL); Error Detection and Correction (EDAC) Codes. 1. Introduction At the present time, it is generally accepted that the occurrence of transient faults in memory elements, commonly known as single-event upsets (SEUs), are a potential threat to the reliability of integrated circuits operating in radiation environment [1,2,3]. This subject is of considerable importance because SEU occurs in space/avionics applications due to the presence of heavyenergy particles. SEU can also occur at the ground level (due to atmospheric neutrons), which may potentially affect the operation of digital systems based on future sub-micronic technologies. Heavy particles incident on storage elements such as flip-flops, latches, and RAM memory cells produce a dense track of electron-hole pairs, and this ionization can cause modification of their contents, also referred to as single-event upsets or soft errors [4,5,6]. It is interesting to note that Hubble Space Telescope has not only trouble with mirrors. The NASA's biggest space-based astronomical observatory is also struggling daily radiation-induced electronic failure. SEUs in a 1-Kbit low-power TTL bipolar random access memory are causing one of Hubble's crucial focusing elements to loose data on a regular basis. Ironically, the chip has well-documented history of SEU failures. NASA officials knew about them before the craft was launched. Nothing was done, however, because Hubble's orbit takes it through relative benign radiation territory. It is only when the telescope passes through the heavily protoncharged South Atlantic Anomaly over Brazil that problems occur, and NASA engineers have developed a software solution to compensate for the errors [7]. The same effects can be observed with respect to SRAMbased FPGAs and commercial microprocessors, when exposed to this type of radiation [8,9]. One possible technique that can be used to cope with SEU effects is the one proposed in [10,11]. In this work, the authors have proposed a combination of I DDQ current monitoring with coding techniques to cope with SEU-induced errors in SRAM memories. In this approach, the current checking is performed on the SRAM columns and it is combined with a single parity bit per SRAM word to perform error correction. This approach has been provided to be very effective in detecting/correcting SEUs in SRAMs. Another solution widely employed by designers to cope with the SEU harmful consequences is the use of Error Detection and Correction (EDAC) approaches. In this case, Hamming code plus one parity bit are appended to each of the memory elements (special/general purpose registers or memory words). If we consider memory elements of 64-bit wise, this approach results in an area overhead of 10-15% per memory element. Despite this area overhead, this is the most commonly used approach At present, a tool that automates the insertion of the coding techniques into the circuit storage elements and estimates the obtained reliability (both steps performed at the VHDL description level) is on development. 2. Single-Event Upset (SEU) In a CMOS static memory cell, the nodes sensitive to high-energy particles are the drains of off-transistors. Thus, two sensitive nodes are present in such a structure: the drains of the p-type and the n-type off-transistors. When a single high-energy particle (typically a heavy ion) strikes a memory cell sensitive node, it will loose energy via production of electron-hole pairs, with the result being a densely ionized track in the local region S 0V 0V 0V ion track D 5V p+ + - n+ cur r en t ro n + - drift n+ funneling + - + - In order to cope with this problem, this paper proposes an approach that estimates the reliability level of the circuit on the development in an early step of the design process. This estimation is performed during the circuit specification, at the VHDL high-level description language. If the obtained reliability level with respect to transient faults in memory elements for a given application is the expected one, then the designer can implement the circuit into an FPGA or an ASIC. Otherwise, he remains in the initial step of the design process in order to modify/improve the embedded fault tolerant functions specified for the circuit on development. body + - diffusion elec t + - p substrate (a) P rompt (Drift + Funneling) Current Despite the fact these approaches present a good success to cope with transient faults in memory elements of complex circuits, one of the most important drawbacks is the fact that the designer needs to go till the end of the design process in order to verify the effectiveness of the approach to handle with transient faults. In other words, the designer needs to fabricate the circuit and test it in laboratory radiation environment in order to validate the design. This requirement has as consequence the dramatic increase of both time-to-market and cost of such circuits if the validation process fails, because all the design process must restart. N FET gate + - Another technique extensively used by designers to implement SEU-tolerant integrated circuits is the use of process-related enhancements, such as the use of CMOS-SOI/SOS technologies [4,5,6]. In addition, some designers consider the use of hardware redundancy in terms of extra transistors used to implement registers that are more stable to SEU disruptions [4,5]. Even though these approaches are expensive, they are very effective to cope with SEUs. of that element [12]. The charge collection process following a single particle strike is now described. Fig. 1 shows the simple example case of a particle incident on a reverse-biased n+p junction, that is, the drain of n-type off-transistors. + - to “harden” complex circuits to the SEU phenomena [13]. Delayed (Diffusion) 0 0.2 0.4 1 10 100 Time (nsec.) (b) Fig. 1. Illustration of the charge collection mechanism that causes single-event upset: (a) particle strike and charge generation; (b) current pulse shape generated in the n+p junction during the collection of the charge. Charge collection occurs by three processes which begin immediately after creation of the ionized track: drift in the equilibrium depletion region, diffusion and funneling. A high electric field is present in the equilibrium depletion region, so carriers generated in that region are swept out rapidly; this process is called drift. Carriers generated beyond the equilibrium depletion region width, more specifically, the charge generated beyond the influence of the excess-carrier concentration gradients, can be collected by diffusion. The third process, charge funneling, also plays an important role in the collection process. Charge funneling involves a spreading of the field lines into the device substrate beyond the equilibrium depletion width. Then, the charge generated by the incident particle over the funnel region is collected rapidly. If the charge collected, Qd, during the occurrence of the three processes described before is large enough, greater than the critical charge Qc of the memory cell, is the large area overhead required to implement it: around 22% in this case. then the memory cell flips, inverting its logic state (the critical charge Qc of a memory cell is the greatest charge Now, if we consider a large area containing several tens or hundreds of registers (like an embedded memory), thus we can minimize area overhead by appending only one parity bit per line and one parity bit per column of the memory array (i.e., the two-dimensional parity approach). For example, if we have a 64 32-bit words memory then we can implement the two-dimensional approach with 96 bits (64 lines + 32 columns). The drawback of this approach is that it may involve speed degradation for large sequences of write operations into the memory (for maintaining the integrity of the words parity). that can be deposited in the memory cell before the cell be corrupted, that is, its logic state is inverted). Fig. 1b shows the resulting current pulse shape that is expected to occur due to the three charge collection processes described above. This current pulse is generated between the reverse-biased n+ (resp. p+) drain depletion region and the p-substrate (resp. n-well), for the case of an n-well technology, for instance. 3. The Proposed Approach The proposed approach is divided in two steps: I) according to a built-in reliability functions library, the designer specifies the coding techniques to be incorporated into the circuit; II) then, by using a specific fault injection technique, the designer estimates the circuit reliability. Both steps are performed in VHDL high-level description language. 3.1 Built-In Reliability Functions Library: achieving the desired circuit faulttolerance The first step of the approach is based on the incorporation of coding techniques into the original VHDL circuit description. The coding approaches considered are in the form of: (a) Hamming code plus one parity bit per storage element (single registers) to correct single errors and to detect double errors (SEC/DED); and (b) Two-dimensional parity code to be applied to the columns and lines of embedded memory arrays. While the first approach is more suitable for single storage elements (placed individually) into the circuit, for instance a microprocessor (e.g., the Program Counter (PC), Stack Pointer (SP), Exception Program Register (EPR), and Page Table Register (PTR)) the second approach is more suitable for a group of storage elements, such as the Branch Prediction Table and the Address Translation Cache. Note that if we consider a register placed individually into the processor, then the simplest (and more reliable) way to make it SEU-tolerant is to use the Hamming code plus 1 parity bit approach appended to the original bits of the register. For example, consider a 32-bit register: thus we can implement this approach by appending 7 check bits to the register (6 bits for Hamming + 1 parity bit). The drawback of this approach Compared to the modified Hamming approach, for a given period of time t, note that the two-dimensional parity technique results in lower reliability. This is true because the probability of occurrence of one error in 64 bits of a single line is greater than the probability of occurrence of one error in a single word of 32 bits. Note that the occurrence of the second error in the same memory line corrupts the whole information stored into the memory (in this case, the error is detected, but cannot be localized). In the case of the modified Hamming code, the occurrence of the second error in a given word is detected and cannot be localized either, but in this case, only the information in this word is lost, which confines the error and maintains the integrity of the rest of the information stored in the memory. This approach is being automated through the development of the FT-PRO CAD tool, whose design flow main steps are shown in fig. 2. Fig. 3a shows the target block diagram that is generated by the FT-PRO tool after compiling the initial circuit description in VHDL (“High-Reliability HW Part” Block, in fig. 2). This structure generates and appends check bits to the information bits each time the application program writes data into a memory element. Similarly, each time a data is read from a memory element, the data integrity is checked by the Checker/Corrector Block and if a correctable error is found, this block writes back the data into the memory element in order to be read again by the application program. Fig. 3b and 3c presents details of the Parity Generator and the Checker/Corrector Blocks shown in fig.3a. The example shown in these figures is targeted to an 8-bit word processor. Note that parities P1, P2, P4 and P8 used by the Hamming code are computed in parallel, while the computation of the whole word parity (P0) is serial (see fig. 3b). Try to select different reliability functions VHDL Circuit Description Built-In Reliability Functions Library Generation of the Fault-Tolerant HW High-Reliability HW Part VHDL Simulator NO Transient-Fault Coverage Desired Reliability Level ? Fault-Injection Constraints YES HW Synthesis Circuit Reliability Verification Step Fig. 2. Block diagram of the FT-PRO tool being developed to automate the process of generating storage element transient-fault-tolerant complex circuits. (a) (b) (c) Fig. 3. (a) Target block diagram generated by the FT-PRO Tool. (b) Parity Generator Block. (c) Checker/Corrector Block. 3.2. Reliability Early-Estimation: injecting bit-flip faults (SEUs) in VHDL code It is of common agreement the widespread use of high-level description languages to describe hardware systems as software programs. Consequently, a transient fault that affects the hardware operation can be considered as a fault affecting the software execution. In other words, a bit-flip fault affecting the hardware operation (e.g., an SEU-induced fault in a memory element) can have an equivalent representation at the software implementation level. In this section, we present the fault-injection technique we have developed to modify registers “on-thefly” during VHDL simulation. The fault model assumed is not restricted to single faults, thus any combination of faults can occur in a memory element of the circuit. The idea in our work is to verify at a high-level description, as early as possible in the design process, the circuit reliability against SEU-induced faults. Of course, except the registers modified with EDAC codes, the other non-modified registers of the circuit are not reliable. Thus any fault affecting the latter registers may lead to a system failure. Therefore, the designer must be aware of this situation and balance the desired reliability level against the required HW cost for the circuit on the design. This situation is specially true in the case of large memory portions of the circuit, such as embedded caches. In this particular case, it can happen that the application can tolerate a given number of errors (bit-flips) in the data cache block of the circuit. This allows the designer to decide not to protect this part of the circuit in order to minimize the area overhead impact due to the insertion of the built-in reliability functions. Note that the reliability is not only a function of which memory elements have been protected with EDAC codes, but also a function of the application itself. Starting from the point of view that memory elements are checked only when they are used (i.e., read out) in the application, it may happen that after a long period out of use, the memory element can be corrupted with more errors than those that can be handled by the EDAC code associated with that memory element. The proposed approach works as follows: initially, we insert the (single or multiple) bit-flip fault in the VHDL code according to a predefined mean time between failure (MTBF). Then, we simulate the circuit by running a program (testbench) as close as possible to the application program. After simulation, we look for the primary outputs (POs) of the circuit to verify, for each of the injected bit-flip faults, if they affected the functional circuit operation. In this case we can obtain one of the three conclusions: a) the fault was not propagated to the POs, then it is considered redundant; b) the fault was propagated to the POs of the circuit and it was detected by the built-in reliability functions appended to the memory elements. (This can be verified by reading out the outputs of the comparators along with the VHDL code after simulation.) Then, the reliability of the circuit is maintained. (In cases (a) and (b), we have the generation of a codeword.) c) finally, if the fault produced an erroneous PO and it was not detected by the appended hardware (generation of a non-codeword), then the reliability of the circuit is reduced. This happens because either the reliability functions used in the program fail to detect such a fault, or the choice of the memory elements to be made fault-tolerant is not adequate (because important blocks of storage elements remain in the original form). At the end of this process, when the whole input test vectors and faults were applied to the circuit, we compute the overall bit-flip fault coverage as a function of the predefined MTBF for the target application as follows: Bit-Flip_Fault_Coverage(MTBF) = K . (M - E) Where: K is the number of detected bit-flip faults; M is the total number of injected bit-flip faults; and E is the number of redundant bit-flip faults in the VHDL code. After this basic description of how the proposed approach interprets the results obtained from the fault injection procedure, in the following we describe the mechanism used to perform fault injection at the VHDL code level. Fig. 4 presents the main structure used to inject faults in the VHDL code. In this approach, we use a Linear Feedback Shift Register (LFSR) to inject bit-flip faults into the selected memory element. The proposed approach presents three different operating modes: (a) normal_mode. In this mode, the circuit is in normal operation and no fault injection is possible during the simulation process. (b) precision_fault-injection_mode. In this operating mode, single or multiple faults can be injected in the selected memory register. To do so, the user defines which bits and in which sequence the selected bits will be flipped by setting specific seeds into the LFSR, before clocking it. This results in the injection of the fault(s) in the selected memory element. Next, the user resets the LFSR, and repeat the operation to insert another seed into this element and so on. (See fig. 4.) Fig. 4. Approach used to inject faults in the VHDL code. (Example for a circuit that operates with 8 information bits plus 5 check bits). (c) random_fault-injection_mode. In this mode, the user does not alternate reset and seed insertion into the LFSR every time he wants to inject a fault in a selected memory element. In this mode, a unique reset is performed in the beginning of the process in order to inject the first seed. After this, every time the user wants to inject a fault in the selected memory element, he needs only to generate the clock signal to the LFSR. Note that in this operation mode none, one, or more faults can be pseudo-randomly injected into the selected memory element, while in the operation mode (b) the user defines in a very precise way the number and the position of the bits that have to be flipped in the selected memory element. At the VHDL code level, the LFSR can be implemented by means of a Generate Statement [14]. This mechanism can be used as a conditional elaboration of a portion of a VHDL description. In other words, we define for example a variable TEST_GEN to assume three possible values: “00”, “01” or “11”. Then, if TEST_GEN is set to “00”, the portion of the VHDL code that describes the hardware necessary to implement a bypass mechanism for the LFSR, in the normal_mode, is selected to be compiled together with the VHDL main code. If TEST_GEN is “01”, the hardware necessary to force the LFSR to operate in the precision_fault-injection_mode is selected to be compiled with the main VHDL code. In a similar way, if TEST_GEN is “11”, the selected portion of the VHDL code is the one that describes the hardware necessary to implement the random_fault-injection_mode. Fig. 5 describes a pseudo-VHDL code to illustrate how the Generate Statement can be used to implement the proposed high-level fault injection mechanism. package FAULT_INJECTION_PKG is ... -- fault injection mode -- 0 => normal mode -- 1 => precision fault injection mode -- 2 => random fault injection mode constant FAULT_INJECTION : integer := 0; -- to allow fault injection in high data order, set this constant constant FAULT_DATA_HIGH : std_logic := '1'; ... end FAULT_INJECTION_PKG; -----------------------------------------------entity REG_FT is port( CLOCK, RESET, -- chip enable CE :in std_logic; -- input from data bus D :in std_logic_vector(7 downto 0); -- output to data bus Q :out std_logic_vector(7 downto 0); ERROR :out std_logic_vector(1 downto 0) ); end REG_FT; architecture REG_FT of REG_FT is -- register (info + check bits) signal REG : std_logic_vector(12 downto 0); -- info bits alias INFO_REG : std_logic_vector(7 downto 0) is reg(12 downto 5); -- check bits alias CHECK_REG : std_logic_vector(4 downto 0) is reg(4 downto 0); ... begin ... NORMAL_MODE: if FAULT_INJECTION = 0 generate -- input data from data bus INFO_REG <= D; -- parity from parity generator CHECK_REG <= PARITY_GEN; end generate; PRECISION_FAULT_INJECTION_MODE: if FAULT_INJECTION = 1 generate FAULT_DATA_HIGH_BLOCK: if FAULT_DATA_HIGH = '1' generate LFSR_DATA_HIGH: LFSR port map( LFSR_IN => INFO_REG(7 downto 4), LFSR_OUT=> LFSR_OUT_DATA_HIGH, CLK_IN => CLK_LFSR, RST_IN => RESET); -- insert a fault in the 4 MSB bits INFO_REG <= LFSR_OUT_DATA_HIGH & INFO_REG(3 downto 0); end generate; end generate; ... end REG_FT; Fig. 5. Pseudo VHDL code illustrating the high-level faultinjection mechanism. Now, consider the clock signal C1 used to drive the LFSR. The goal of this control signal is to determine the moment when the LFSR evaluates, i.e. the exact moment when a fault is injected in the selected memory element. This goal can be achieved at the VHDL code level by using the “after” delay mechanism, which can be used to introduce timing constraints to memory element assignments, for instance [14]. The example below can be used to illustrate its application: ... C1 := “1” after 100ms; C1 := “0” after 200ms; C1 := “1” after 300ms; C1 := “0” after 400ms; ... Therefore, in this example, we are injecting fault(s) into a given selected memory element at the simulation instants “100ms” and “300ms”, respectively. Note that the number of faults injected depends on the type of the seed placed in the LFSR. Computation example of the proposed approach: To have a good understanding of the proposed approach, assume for example we specified a circuit containing 300 32-bit registers. Then, for this circuit we developed a representative application program (testbench) and injected 1,000 faults, with a time interval between each fault injection equal to 8s, for an overall simulation time slightly greater than 8,000s. Suppose that for this set of faults injected during the testbench simulation, 990 faults were detected by the additional EDAC HW facilities appended to the original circuit. The other 10 faults produced erroneous POs and were not detected by the additional HW. Thus, the transient_fault_coverage for this circuit with respect to the specific testbench simulated is 99%, and the computed error rate, given in [error/component.second], is 10/8,000 = 1.25 x 10-3 (i.e., we can expect, in average, 1 functional error in this component at every 800 seconds or 0,22 hours). By extrapolating these computations for circuit operation on the field, we can expect a fault coverage at least equal to or higher than 99% for any mechanism that injects faults in this circuit with a time interval greater than 8s, when running the same simulated application program. With this result in mind, we can, for instance, implement this hypothetical circuit in an ASIC. Suppose that the ASIC is based on standard-cells memory elements, which were previously characterized to operate in Low-Earth Orbit1 radiation with a typical error rate of 1 x 10-4 errors/bit.day. Putting this value as a function of [error/component.second], we have: 1.11 x 10-5. In other words, the error rate for this component in LEO is expected to be 1 error (i.e., 1 transient fault injected) at every 25 hours (or 90,000s). Note that we simulated our hypothetical circuit with a time interval between fault injection equal to 8s and as result we obtained a fault coverage of 99%. These results allow us to expect a fault coverage even higher than this value if we decide to implement the hypothetical circuit with the suggested standard-cells memory elements because the time interval between fault injection in this case is roughly 10,000 times greater than the one simulated (i.e., 1 fault at every 90,000s). 4. Conclusions & Future Work We presented a new approach that automates the process of generating fault tolerant complex circuits described in VHDL language. The approach uses coding techniques associated to registers or group of registers to detect the occurrence of a bit-flip (Single-Event Upset – SEU) and to localize the affected memory element (thus, performing error correction). In a second step, this approach estimates the reliability of such complex circuits with respect to SEU. This procedure is also performed in an early stage of the design process, i.e., at the circuit VHDL specification level. well as will provide a valuable feedback to future improvements of the built-in reliability functions database. References [1] Pickel, J. C.; Blandford, Jr.; J. T. Cosmic Ray Induced Errors in MOS Memory Cells. IEEE Transactions on Nuclear Science, vol. NS-25, no. 6, Dec. 1978. [2] Turflinger, T. L.; Davey, M. V. Understanding Single Event Phenomena in Complex Analog and Digital Integrated Circuits. IEEE Transactions on Nuclear Science, vol. NS-37, no. 6, Dec. 1990. [3] Browing, J. S.; Griffee, J. W.; Holtkamp, D. B.; Priedhorsky, W. C. An Assessment of the Radiation Tolerance of Large Satellite Memories in Low Earth Orbits. 1st European Conference on Radiation and its Effects on Devices and Systems - RADECS. Marseille, France, Sep. 1991. [4] Srour, J. R.; McGarrity, J. M. Radiation Effects on Microelectronics in Space. Proc. of the IEEE, vol. 76, no. 11, Nov. 1988. [5] Kerns E. S.; Shafer, B. D.; ed. The Design of RadiationHardened ICs for Space: a Compendium of Approaches. Proc. of the IEEE, vol. 76, no.11, Nov. 1988. [6] Messenger, G. C.; Ash, M. S. The effects of radiation on electronic systems. Van Nostrand Reinhold Company Inc., New York, 1986. [7] ARCHIMEDES ESPRIT-III Basic Research Project. Contract no. 7107. Brussels, Belgium. July 1991. [8] Velazco, R.; Cheynet, Ph.; Ecoffet, R. Operation in Space of Artificial Neural Networks Implemented by Means of a Dedicated Architecture Based on a Transputer. XI Brazilian Symposium on Integrated Circuit Design - SBCCI’98. Rio de Janeiro, Brazil, Sep. 30 - Oct. 2, 1998. [9] Bezerra, F.; Velazco, R.; Assoum, A.; Benezech, D. SEU and Latch-up Results on Transputers. IEEE Transactions on Nuclear Science, Part I, Vol. 43, Number 3, 1996. [10] Vargas, F. L.; Nicolaidis, M. SEU-Tolerant SRAM Design Based on Current Monitoring. 24th FTCS - International Symposium on Fault-Tolerant Computing. Austin - Texas, USA, Jun. 1994, p. 106-115. [11] Calin, T.; Vargas, F.; Nicolaidis, M. Upset-Tolerant CMOS SRAM Using Current Monitoring: Prototype and Test Experiments. International Test Conference - ITC 95. Washington, USA, Oct. 1995, p. 45-53. [12] Ansell, G. P.; Tirado, J. S. CMOS in Radiation Environments. VLSI Systems Design. Sep. 1986. [13] Vargas, F.; Bezerra, E.; Wulff, L.; Barros, D. ReliabilityOriented Hardware/Software Partitioning Approach. 4th International IEEE On-Line Testing Workshop. Capri, Italy, 6-8 July, 1998, pp. 17-22. [14] Active-VHDL Series, Book #1 – VHDL Reference Guide. Aldec, Inc. Feb. 1998, www.aldec.com. This approach is being automated through the development of the FT_PRO CAD tool. The development of a test vehicle (a Z80-like microprocessor) is now in progress. This component will be implemented in a commercial FPGA and exercised under radiation at the Lawrence Berkeley Laboratory facility (88-inch cyclotron). Experimental results will allow to verify the effectiveness of the reliability early-estimation procedure, as 1 Low-Earth Orbit (LEO): approximately 700Km at 980. This Section presents some preliminary results that are used to verify the FT-Pro tool performance profile in terms of area overhead required to implement the approach described in Section 3.1. All the results indicated hereafter were obtained for Altera FPGA components (EPF10k20RC240-4 and EPF10k30RC2404). With this goal in mind, Table I summarizes the area overhead (given in programmable logic cells – PLC) required to implement the Checker/Corrector and the Generator Control blocks that are generated automatically by FT-PRO tool. Number of Programmable Logic Cells Register Length Parity Generator Block Checker/Corrector Block 8 11 16 32 57 64 8 10 17 36 70 80 24 30 43 83 140 166 Table I. Number of Logic Cells required per block for different block sizes.