MERGING BIST AND CONFIGURABLE COMPUTING TECHNOLOGY TO IMPROVE AVAILABILITY IN SPACE APPLICATIONS* Eduardo Bezerra 1, Fabian Vargas 2, Michael Paul Gough 3 1, 3 Space Science Centre, School of Engineering, University of Sussex, Brighton, BN1 9QT, England E.A.Bezerra@sussex.ac.uk, M.P.Gough@sussex.ac.uk 1, 2 Catholic University - PUCRS, 90619-900 Porto Alegre - Brazil eduardob@inf.pucrs.br, vargas@computer.org Abstract The use of configurable computers in space applications depends not only on high reliable military devices available commercially, but also on the definition of strategies for fault tolerance and on-board testing. This paper introduces some techniques targeting, mainly, the problems related to Single Event Upsets (SEU) in on-board electronics. Important subjects as performance and dependability figures are also discussed. Keywords: Fault Tolerance, Configurable Computing, BIST, FPGA, VHDL and Space Applications. 1. Introduction availability, reliability and testability [6]. In the past few years, strategies to improve the dependability features of configurable computer systems, have been proposed and implemented [7-21]. These strategies are mainly based on the traditional ones used in microprocessor based systems. Following a similar approach, in this paper strategies to improve the availability and reliability of configurable computing systems are proposed. An important consideration to define the strategies is that configurable computing allows new possibilities, for instance, the use of a combination of strategies for fault tolerance in software and in hardware in the same level of abstraction. The ideas discussed here can be used not only for space applications, but also for any other embedded system with similar dependability requirements. The paper is organised as follows. In Sections 2 there is a description of the basic node of a network architecture for space applications, based on configurable computers. Section 3, introduces the strategies for availability and reliability improvement, with emphasis on a Built-In Self Test (BIST) strategy for SEU management. Section 4 describes an approach for masking connectivity faults. Sections 5 and 6 discuss expected and obtained results, and in Section 7 there are some conclusions, and future directions. Power consuption, processing speed, area usage, and weight are important concerns of designers of computers for space applications. Because of the application specific nature of this kind of system, its requirements can vary significantly from application to application, which results in a completly new design for almost every new application. As a consequence, computer systems for space applications are very expensive. A possible solution to decrease design costs is to have the processing elements implemented by using configurable devices [1-4]. This technology is appropriated for the implementation of applicationdependent solutions. It allows the designers to have different hardware configurations, adequate for every new application, without the need for changes in the board layout. The drawback of this solution is the difficulty associated to the “software” development for this kind of hardware [5]. For instance, for systems that require complex data structures, the best solution maybe to still use conventional microprocessor based boards. In case of computers for critical applications, or computers used in situations where maitenance is not possible or very expensive, the designers have additional concerns related to dependability features as, for instance, * This research is partly supported by Brazilian Council for the Development of Science and Technology (CNPq), and Catholic University PUCRS, Brazil. 1 2. Single-Event Upset (SEU) region width, more specifically, the charge generated beyond the influence of the excess-carrier concentration gradients, can be collected by diffusion. The third process, charge funneling, also plays an important role in the collection process. Charge funneling involves a spreading of the field lines into the device substrate beyond the equilibrium depletion width. Then, the charge generated by the incident particle over the funnel region is collected rapidly. If the charge collected, Qd, during the occurrence of the three processes described before is large enough, greater than the critical charge Qc of the memory cell, then the memory cell flips, inverting its logic state (the critical charge Qc of a memory cell is the greatest charge that can be deposited in the memory cell before the cell be corrupted, that is, its logic state is inverted). Fig. 1b shows the resulting current pulse shape that is expected to occur due to the three charge collection processes described above. This current pulse is generated between the reverse-biased n+ (resp. p+) drain depletion region and the p-substrate (resp. n-well), for the case of an n-well technology, for instance. In a CMOS static memory cell, the nodes sensitive to high-energy particles are the drains of off-transistors. Thus, two sensitive nodes are present in such a structure: the drains of the p-type and the n-type off-transistors. When a single high-energy particle (typically a heavy ion) strikes a memory cell sensitive node, it will loose energy via production of electron-hole pairs, with the result being a densely ionized track in the local region of that element [22]. The charge collection process following a single particle strike is now described. Fig. 1 shows the simple example case of a particle normally incident on a reverse-biased n+p junction, that is, the drain of n-type off-transistors. N FET gate body S 0V 0V 0V ion track D 5V p+ + - + - n+ funneling + - + - ro n drift n+ + - cur r en t + + - elec t + - diffusion 3. System Description Overview p substrate The block diagram in Fig. 2.a shows the proposed architecture for an on-board processing system, that may be used for both, scientific and commercial space applications. The two main improvements in this architecture over the architectures that have been used in most space applications, are the use of configurable devices as the main processing elements, and the use of a network to connect the different modules. As the main objective of this paper is to discuss and to propose test strategies for the processing elements, we describe a basic network node in this Section. The block diagram of the network node shown in Fig. 2.b has two external communication channels, one for connection to the network, and the other one for interfacing to the application. The first one has a fixed number of signals, as the protocol for inter-node communication is pre-defined. The second one is defined according to the application requirements, and is a good example of the flexibility introduced by the configurable computer technology. The main responsibilities of the processing element, FPGA A in Fig. 2.b, are to construct telemetry packets, according to the European Space Agency (ESA) standards [23-25], and to implement the protocol for communication on the on-board network system. In some applications it may be possible to transfer some or all user processing activities to FPGA A. An example is when the on-board data-handling (OBDH) system and one (or more) of the applications are designed by the same group (a) Current P rompt (Drift + Funneling) Delayed (Diffusion) 0 0.2 0.4 1 10 100 Time (nsec.) (b) Fig. 1. Illustration of the charge collection mechanism that causes single-event upset: (a) particle strike and charge generation; (b) current pulse shape generated in the n+p junction during the collection of the charge. Charge collection occurs by three processes which begin immediately after creation of the ionised track: drift in the equilibrium depletion region, diffusion and funneling. A high electric field is present in the equilibrium depletion region, so carriers generated in that region are swept out rapidly; this process is called drift. Carriers generated beyond the equilibrium depletion 2 [26]. In this case the “User” block shown in Fig. 2.a may be very simple, for instance, consisting of only analog devices, sensors and converters, as the FPGA A may be used to execute all the processing. User 1 User 2 … User n CCM 1 CCM 2 … CCM n of system upgrades or bug fixes. The FPGA B works as a configuration manager allowing FPGA A to be initialised from the flash memories as if they were serial ROMs. As the systems based on this architecture are conceived for long-life missions, all the electronic components must be military devices. However, even employing high reliable devices, because of the hostile environment found in space and the long-life expected for the system, additional fault-tolerance strategies are used in the design. In order to define the strategies, a very simple, but efficient, fault model was chosen. The faults considered in this fault model are stuck-at and connectivity [6]. The stuck-at faults are good representatives of the bit errors that can occur in SRAM based devices, as, for instance, FPGAs, which are sensitive to Single Event Upsets (SEUs) caused by atmospheric high-energy particles [27-31]. The other modelled fault, connectivity, is responsible for more than 90% of the problems in a board and, as described later, special strategies as, for instance, bus replication and voters, are used to tolerate this problem. The strategies used to prevent and, when it is not possible, to tolerate the faults, belong to the fault model adopted, are described next in Section 4. 3 1 On-board network bus Shared RAM CCM (TC/TM) 1 2 On-board instrument processing board Ground station Legend: 1 - Protocol for on-board communication 2 - ESA standard protocol 3 - Configurable interface (a) 4. SEU Prevention Strategies CCM (Configurable Computer Module): 4.1. Refresh Operation in a Triple Modular Redundancy (TMR) FPGA System (configuration manager & readback) Flash memory (configuration bitstream) FPGA B (control) FPGA A (processing element) serial PROM RAM In [27,28] there is a study showing the low SEU susceptibility of Xilinx FPGAs. However, for some critical applications where human life is at a premium or when the whole embarked electronics is dependent on two or three core components, then, the low SEU susceptibility must be even improved. In [28] a strategy to reduce the effects of SEUs in FPGA systems was proposed. Basically, as shown in Fig. 3, three FPGAs are configured with the same bitstream (triple redundancy), and operate in synchronism. A controller reads the three FPGA bitstreams, bit after bit, and if there are no differences, then a correct functioning with no SEU occurrence is assumed. This procedure is executed continuously, with no interference in the FPGA normal operation. Such a scheme is possible because of the FPGA’s readback feature. (optional) (emergency recovery bitstream) (b) Fig. 2. Block diagram of the proposed system. (a) Network architecture. (b) Basic CCM node. The FPGA B is responsible for the management of the reconfiguration and test of FPGA A. These activities are detailed next in Section 4. Other components of the Configurable Component Module (CCM) node are RAM, PROM and Flash memories. A RAM module may be attached to FPGA A, in order to be used by some applications as a scratch area. The emergency recovery PROM holds a configuration bitstream (CB) for FPGA B, used every time the system is initialised, which happens mainly when the Flash memory upsets. The CB is first used to boot FPGA B, which is then responsible for booting FPGA A with a basic functionality, enough to talk to the network bus and reload the flash memories. The flash memories hold the CD for FPGA A. They may be changed from ground in case 3 Configuration bitstreams Readback bitstreams: • user registers • user logic • routing FPGA voter Error signal counter <= counter + 1; if counter = 0 then PROG <= ‘0’; -- reset else PROG <= ‘1’; end if; Configuration bitstream Serial EPROM Start refresh signals 15 Hz counter Application process Start refresh signal PRG pin Application process Application process FPGA Fig. 3. A TMR FPGA system. Fig. 4. Using a counter to start the refresh operation. If one input of the controller is different from the others, then it is assumed that an SEU has occurred, and a reconfiguration of the faulty FPGA is executed. In [27] it was shown that a simple refresh operation, in this case by means of reconfiguration, is enough to recover the device from an SEU. The main problem with the refresh recovery is the total loss of measurement data within the instrument system. Another problem is the time necessary for reconfiguration, and depending on the application size, it is therefore recommended to divide the system into small blocks using several small FPGAs. This is because in a small FPGA, configuration can be made in just a fraction of second (e.g. 195 ms, for Xilinx XQ4085XL). The block size has to be calculated according to the application time requirements. In the next three Sections, methods for SEU prevention are proposed. These new methods are based on the refresh execution, but without FPGA replication. 4.3. Signature Analysis-Driven Refresh Without FPGA Replication Another option for SEU prevention is the use of a signature analysis method [6], to identify when a refresh operation is necessary. In terms of hardware, this strategy may be slightly more expensive than the clock/counter one, but it is more efficient for applications where periods of downtime and loss of data are not allowable. The method proposed here uses the LFSR/PSG (Linear Feedback Shift Register/Parallel Signature Generator) approach for signature generation and analysis [33]. A Linear Feedback Shift Register (LFSR) is a shift register with combinational feedback logic around it that, when clocked, generates a sequence of pseudo-randomly patterns [9,33]. In our case, we are considering the use of a primitive polynomial, in order to generate all the 2n – 1 possible combinations, where n is the degree of the polynomial. A Parallel Signature Generator (PSG) is an LFSR with exclusive-or gates between the shift registers, implementing a generator polynomial [9] used to compact a given sequence of bits. Note that there is a probability of aliasing, that is, of masking eventual errors in the bitstream to be compacted during the signature generation process, which is inversely proportional to the length of the PSG implemented. In other words, the probability of aliasing is given by: (2k – r – 1).(2k – 1)-1 where: k: is the sequence length, given in bits (that is, the length of the FPGA configuration bitstream whose signature is to be generated); r: is the LFSR length (that is, the polynomial degree, or the number of flip-flops required for its implementation). Two FPGAs are necessary to implement this approach. Fig. 5 shows a block diagram based on the CCM node of Fig. 2.b. The LFSR/PSG process is located on FPGA B, and the processes responsible for the start of the readback and refresh operations are located on FPGA A. 4.2. Periodic Refresh Without FPGA Replication As the target system is a long-life application, periods of downtime can be considered in its design, and thus are possible to be interrupted and completely reinitialised after some time running, with no major problems. In order to implement this strategy it is necessary, as shown in Fig. 4, an internal FPGA 15 Hz clock generator and a 19 bits counter, written in VHDL. In the event of a rising edge pulse, generated by the 15 Hz clock, the counter is incremented. Every time the counter reaches the zero value, which happens about each 19.4 hours, the refresh operation is executed. Refresh is achieved by the counter process resetting the FPGA PROG pin, which leads to the FPGA being reconfigured, preventing SEU occurrences from affecting the system functioning. It is important to notice that in this strategy there is no test execution, and consequently, no SEU detection. The refresh is executed periodically, even if there are no SEU occurrences. In terms of hardware resources, this strategy is less expensive than the TMR one, but depending on the application, it can be very expensive in terms of system availability. 4 15 Hz System clock 2 PRG pin 1 4 LFSR/PSG Refresh? 2 3 Flash memory 35: else 36: -- talk to READBACK_PRO waiting for the 37: -- end of readback, then WASH <= '0' 38: end if; 39: D[0] <= (AUX_Q[6] xor AUX_Q[7]) xor A_TO_H[0]; 40: D[1] <= AUX_Q[0] xor A_TO_H[1]; 41: D[2] <= AUX_Q[1] xor A_TO_H[2]; 42: D[3] <= AUX_Q[2] xor A_TO_H[3]; 43: D[4] <= AUX_Q[3] xor A_TO_H[4]; 44: D[5] <= AUX_Q[4] xor A_TO_H[5]; 45: D[6] <= AUX_Q[5] xor A_TO_H[6]; 46: D[7] <= AUX_Q[6] xor A_TO_H[7]; 47: AUX_Q <= D + AUX_Q; 48: end if; 49: end process LFSR_PSG_PRO; 50: end LFSR_BEH; PRG pin Readback Start readback? 3 readback pin FPGA B FPGA A Fig. 5. The LFSR/PSG approach. Fig. 6 shows the VHDL skeleton of the LFSR/PSG process. The base clock used by the LFSR part of this strategy is the same internal 15 Hz clock used in the clock/counter strategy, and is represented in VHDL by the CLK_15HZ_IN signal. In order to define when the readback starts, first the LFSR process, clocked by the 15 Hz signal (Fig. 5, ), loads the input signal A_TO_H with ‘0’. As a consequence, the implicit loop, forced by the INT_CLK signal, works as a simple primitive LFSR (lines 42 to 49) generating all 2n – 1 pseudo-randomly patterns (n = 8). For this example, the string “10010101” was chosen as the seed. When the output of the LFSR (signal Q_OUT) has a pattern matching the seed (line 29), then the operation mode is changed to readback (signal WASH goes to ‘1’). At this moment the input clock is switched from the internal 15 Hz clock to the system clock, used by the readback, and the signal START_READBACK is sent to FPGA A (Fig. 5, ). 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: 26: 27: 28: 29: 30: 31: 32: 33: 34: Fig. 6. Partial VHDL code for the LFSR/PSG process. After each 8 cycles of the system clock, the signal A_TO_H is loaded with the contents of the 8 bits shiftregister SR (Fig. 5, ; Fig. 6, line 18 ) which holds the last 8 bits received from FPGA A. This shift-register is controlled by the process READBACK_PRO (not listed), which is also responsible for filtering the bits “unusable data”, “RAM bits” and “capture bits”, not used for purposes of signature generation [34]. The reason for that is because these bits change dynamically during the FPGA utilisation and are not suitable for comparison with the “gold” signature. The “gold” signature is generated on ground, using the same PSG method, from the original bitstream used to configure FPGA A, and is stored onboard, in FPGA B. When the readback is concluded, the LFSR/PSG process in FPGA B uses the calculated signature to compare to the on-chip stored, pre-defined one. If the test fails, then a “start refresh” signal is sent to FPGA A (Fig. 5, ) in order to “clean” possible SEUs. The FPGA B refreshes itself after the end of all FPGA A readback/refresh executions. entity LFSR is port ( CLK_IN : in std_logic; RESET_NEG_IN: in std_logic; CLK_15HZ_IN : in std_logic; SR_IN : in std_logic_vector(7 downto 0); DO_PSG_IN : in std_logic; Q_OUT : out std_logic_vector(7 downto 0) ); end LFSR; architecture LFSR_BEH of LFSR is signal INT_CLK, WASH: std_logic; signal D, AUX_Q, A_TO_H: std_logic_vector(7 downto 0); begin INT_CLK <= CLK_15HZ_IN when WASH = '0' else CLK_IN; A_TO_H <= (others => '0') when WASH = '0' else SR_IN; Q_OUT <= D when DO_PSG_IN = '0' else AUX_Q; LFSR_PSG_PRO: process (INT_CLK, RESET_NEG_IN ) begin if RESET_NEG_IN = '0' then WASH <= '0'; D <= (others => '0'); AUX_Q <= "10010101"; -- seed elsif INT_CLK'event and INT_CLK = '1' then if WASH = '0' then if AUX_Q = "10010101" then WASH <= '1'; -- WASH mode (PSG) else WASH <= '0'; -- NORMAL mode (LFSR) AUX_Q <= "10010101"; -- seed end if; 4.4. Signature Analysis Readback Execution With Continuous In the LFSR/PSG strategy, the test for SEU occurrences is executed periodically. The LFSR is used to start the readback operation and to compact the configuration bitstream time after time. Another option for the test is to execute the readback continuously, as it does not affect the normal FPGA operation. This method is less expensive in terms of hardware, as part of the LFSR/PSG process is not necessary. As the readback is executed continuously, then there is no need for sending a signal to the “start readback” process on FPGA A. In this case, the internal 15 Hz clock and the circuit used for the clock signal switching, are not used. Alternatively, the 15 Hz clock could be used, in a different process to control the FPGA B self-refreshing activity. This strategy not only saves space on FPGA B, but 5 also allows the integrity of FPGA A to be verified more frequently. different sensors can detect different physical phenomena, at the same instant of time. For this fault masking strategy to be efficient, the three input signals for the voters must be located, preferably, in three distant pins. For instance, in the block diagram in Fig. 7, the input IP1_1_IN may be located in the pin 40, whilst the input IP1_2_IN is located in pin 80. The pin locations are chosen by the designer, using a constraints file, before the placement and routing (PAR) execution [36]. The netlist generated by a synthesis tool, from a VHDL source code, has no pin location and routing information, and this netlist is used as input to the PAR tool. In some cases it may be necessary to edit the CB generated by the PAR tool, and change, manually, the position of the components of a voter, in order to approximate them to the input pins. The pins’ location and the delay for individual routes can be specified in a constraints file. Moreover, the PAR tool may place a voter very close to a pin, but very distant from another one, having both time delays according to that defined in the constraints file, but with very different times between them. A solution to avoid the need for manual intervention, is to define very short delays in the constraints file. The problem with this solution is that, depending on the design complexity, and the size of the FPGA chosen, the constraints specified may not be achievable. This strategy for connectivity faults masking can be employed in the CCM node shown in Fig. 2.b, because the voters are implemented in the same FPGA along with the application, as shown in Fig. 7. This strategy masks permanent, transient or intermittent faults efficiently. 5. Masking Connectivity Faults The SEU prevention strategies described in the last Section, are very efficient for fault prevention in processing modules, as operation units are implemented using the FPGA SRAM-based look-up tables (LUTs). Control units of processing modules are partially implemented using flip-flops. They are one of the points not covered by this work, since well known fault-tolerant strategies to improve control (and data) flow reliability can be found in the literature[6]. Reliability improvement in the processing modules is worthless if the input data correctness is not guaranteed. The proposed strategy is shown in the block diagram of Fig. 7. In this scheme a majority voter receives the same data from three different FPGA input pins, and if at least two of them are equal, then the data is sent to the application, otherwise, an error signal is set, invalidating the data. The block diagram was partially generated by Synplify [35] from a VHDL code. Application process K e r n e l 6. Numerical Analysis of the CCM Node in Two Modes of Operation The CCM node shown in Fig. 2.b, implementing the LFSR/PSG strategy, as described in Sections 4.3 and 4.4, has been analysed in numerical terms using reliability evaluation techniques [6]. For this analysis two situations are considered. In the first situation, the three flash memories hold three different configuration bitstreams (CBs). This scenario represents a real reconfigurable computing system, because the FPGA functionality can be altered, on-the-fly, according to the application requirements. From the fault-tolerance point of view it is not a good approach as, in case of an SEU occurrence in one of the flash memories, the respective application has to stop, and wait for a good CB be up-loaded from the ground station. The reliability of this situation can be found from Equation 1. FPGA Fig 7. Using replicated inputs/voter to mask connectivity faults. The strategy is used to mask faults in the external FPGA pins, and in the internal FPGA routing resources. It is assumed that the same sensor output is connected to three different FPGA pins, sending the same data to the voter. Using three different sensors, which characterises a triple modular redundant (TMR) implementation [6], may be possible but it will depend on the data being collected. In most of the cases, different sensors send different data to the voter and, even if the data is correct, it may result in a wrong interpretation by the voter. This happens because R1(t) = 1 – (1 – Rflash(t)) 6 Equation 1 Since the reliabilities of all components, but the flash memories, are constant, they have not been included in the numerical analysis. The reliability of the flash memories are not constant, because their contents may be changed when a new CB is uploaded. In the second situation, the three flash memories hold the same CB, which characterises a TMR system. The vote is executed, implicitly, by FPGA B, using the signature analysis method described in Sections 4.3 and 4.4. As this test strategy is not capable of fault location, then, in case of a fault detection, it is not possible to identify if the problem was in the flash memory or in the FPGA. In any case, the FPGA A is reconfigured with a CB from another flash memory. If the error persists, then the diagnostic is a permanent fault in FPGA A, and the module has to be by-passed. On the other hand, if with the new CB no error is detected, then the respective flash memory is considered faulty, and it needs to be refreshed in order to try to clear any occurrence of SEUs. The reliability of this situation can be found from Equation 2. reliabilities of each case remain almost the same value at the end of 30 hours of work, and in an acceptable range until the end of 100 hours of work. However, the reliability differences between each architecture become more distinctive as the time progress. For example, at the end of 3000 hours of operation (125 days), the reliability for the redundant case (R2) is 1.3 times better than the non-redundant one (R1). 7. Expected Performance The main motivation for using FPGAs instead of microprocessors for on-board computer implementation, is the gain in performance with a decrease in the PCB area usage. In order to find out the feasibility of using FPGAs in this class of application, a case study was developed. The case study is the FPGA implementation of an on-board instrumentation module of a NASA sounding rocket [37]. This rocket flew from Spitzbergen, Norway, in the winter of 1997/1998, carrying a scientific application computer designed at the Space Science Centre, University of Sussex [38]. The real time science measurement performed by this embedded system is an auto-correlation function (ACF) [39] processing of particle count pulses as a means of studying processes occurring in near Earth plasmas. The original module as flown consisted of a board with two DS87C520 microcontrollers (8051 family), FIFOs, state machines and software written in assembly language. Although this ACF implementation is a very specific application, the system as a whole, including the hardware and the software parts, is not too different from conventional embedded systems based in microprocessors or microcontrollers. This case study is a typical memory transfer application, with a high input sampling rate and with scarceness of processing modules. The most demanding actions for processing blocks, are the ones with multiply-and-accumulate operations (MACs), typical of DSP applications. The test strategies proposed in this paper are designed to execute in parallel with the user application and with the fault-tolerance strategies. There are no performance penalties, because there is no need, for instance, to time share tasks. Table 1 lists the number of cycles necessary to run the processes written in assembly language for the microcontrollers, and the equivalent ones written in VHDL for the FPGA. Another important system feature improved because of the use of configurable computing technology is the reduction in the number of hardware components. The main system repercussions are the reduction in the onboard area usage, and in the power consumption. The reduction in the number of components can be achieved by using only one FPGA device configured to execute the same functionality as the whole 8051 based system. For R2(t) = 1 – ((1 – Rflash1(t))* (1 – Rflash2(t))* (1 – Rflash3(t))) Equation 2 To demonstrate the reliability improvements when using replicated CBs (R2), it is considered a hypothetical situation where the failure rate () is identical for each flash memory. For this study it was chosen a failure rate of 0.0001/hour, to allow the generation of quantitative information for comparison purposes. Considering that R(t) = e-t and flash1 = flash2 = flash3 = flash then Equations 1 and 2 can be re-written as the following Equations. R1(t) = e-t Equation 3 R2(t) = e-3t - 3e-2t + 3e-t Equation 4 0.9 0.8 0.7 1 10 20 30 40 50 60 70 80 90 100 200 300 400 500 1000 2000 3000 Reliability 1 Time (hours) non-redundant (R1) redundant (R2) Fig. 8. The reliability responses for the two situations. The graph in Fig. 8 was plotted from Equations 3 and 4. From this graph it is possible to observe that the 7 instance, external chip FIFOs were replaced by circular FIFOs implemented using data structures in VHDL. Considering the use of the FPGA board in an application where fault-tolerance is not a requirement, as the original case study, all of the electronics board can be implemented as a single chip, for instance, an XQ4085XL Xilinx FPGA [32-37] with two XQ1701L serial ROMs to store the bitstream. The result is a reduction from 22 to 3 chips. Supposing the use of the case study in a long-life system, then fault-tolerance is required, and a board as the one shown in Fig. 2.b can be used. In this case, the reduction in the number of components may be from 22 to about 10 chips on-board, which is still a significant result, considering that now the system has fault-tolerant capabilities. controller P1 P2 P3 P4 P5 P6 (cycles) 4,518 8..36 18..1,018 1,240 1,334..3,438 11,116 FPGA (cycles) 1 1 1..68 48 132..143 288 Design Automation (EDA) tools to generate CBs. In time critical systems, such as space applications, effective development facilities are important because of the short time available for making remedial changes to a faulty application. In the past several missions were saved as a result of the rapid problem identification, followed by the development of a solution, ground tests and timely transmission of the new software to the spacecraft computer. In addition to the selection of efficient EDA tools, another investigation to be done is related to the hardware description language subject. After selecting the language and the EDA tools, the next step will be the implementation of a prototype, in order to determine the feasibility of the test and fault-tolerant strategies proposed here. Rate Acknowledgement 4,518 times faster 8 to 36 times faster 18 to 14.97 times faster 25.8 times faster 10.11 to 24 times faster 38.6 times faster This research is partly supported by the Brazilian Council for the Development of Science and Technology (CNPq), and the Catholic University (PUCRS), Brazil. The authors would like to thank Aldec and Xilinx University Program for the donation of the hardware and software tools used in this research. References Table 1. Performance comparison for the case study. [1] 8. Conclusions & Future Work [2] This paper introduced the use of a BIST technique and traditional fault-tolerance strategies together with configurable computing technology, in order to improve the availability of on-board computers used in space applications. In Section 3 a network architecture for spacecraft instruments was proposed. Section 4 presented test and fault-tolerance strategies to detect and fix/tolerate SEU occurrences. In Section 5, it was described a technique to mask connectivity faults. Sections 6 and 7 described expected and obtained results. The strategies described in this paper deserve a deeper investigation, in order to be used in the design of a faulttolerant on-board instrument processing system, entirely based on configurable computing. During the case study implementation, a series of problems related to the development of FPGA based systems arose. For instance, the synthesis tools available for high-level languages (e.g., VHDL behavioural and Verilog) are still not efficient, and a VHDL developer has to follow strict rules to obtain good results [40]. An FPGA configuration bitstream generated from a high-level language is space consuming, and represents a lower performer circuit when compared to one generated from schematic diagrams or low-level languages such as VHDL structural. Another concern is the time necessary for Electronic [3] [4] [5] [6] [7] [8] 8 Mangione-Smith, W. et al. “Seeking Solutions in Configurable Computing”. IEEE Computer, pp. 38-43, Nov. 1997. Villasenor, J. & Mangione-Smith, W. “Configurable Computing”. Scientific American, pp. 66-71, Jun. 1997. DeHon, A. “Reconfigurable Architectures for General-Purpose Computing”. PhD Thesis, Artificial Intelligence Laboratory. MIT, USA, 368p. Oct. 1996. Villasenor, J. et al. “Configurable Computing Solutions for Automatic Target Recognition”. In Proceedings of IEEE Workshop on FPGAs for Custom Computing Machines, pp. 70-79, Napa, CA, Apr. 1996. Bezerra, E.A. and Gough, M.P. “A guide to migrating from microprocessor to FPGA coping with the support tools limitations”. Microprocessors And Microsystems – to appear. Pradhan, D.K. “Fault-Tolerant Computer System Design”. Prentice-Hall; 544p. 1996. Lach, J.; Mangione-Smith, W.H.; and Potkonjak, M. “Efficiently Supporting FaultTolerance in FPGAs”. Proceedings of the 1998 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA’98), Monterey, CA, Feb., 1998. Kocan, F. and Saab, D.G. “Dynamic Fault [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Diagnosis on Reconfigurable Hardware”. Proceedings of the 1999 ACM/SIGDA Design automation Conference (DAC’99), New Orleans, Louisiana, 1999. Conde, R.F. et al. “Adaptive Instrument Module - A Reconfigurable Processor for Spacecraft Applications”. Proceedings of the 1999 Military and Aerospace Applications of Programmable Logic Devices Conference (MAPLD’99), The Johns Hopkins University, USA, Sep., 1999. Bezerra, E.A. et al. “Improving the Dependability of Embedded Systems Using Configurable Computing Technology”. Proceedings of the XIV International Symposium on Computer and Information Sciences (ISCIS’99), Oct., 1999, IZMIR, Turkey, pp.49-56. Moreno, J.M. et al. “Feasible Evolutionary and Self-Repairing Hardware by Means of the Dynamic Reconfiguration Capabilities of the FIPSOC Devices”. In Lectures Notes in Computer Science - v. 1478: Sipper, M. et al.(Eds.), Evolvable systems: From Biology to Hardware. Proceedings, IX, 1998, pp.345-355. Vargas, F. et al. “Optimizing HW/SW Codesign Towards Reliability for Critical-Application Systems”. 4th International IEEE On-Line Testing Workshop. Capri, Italy, pp.17-22; 6-8 July, 1998. Vargas, F. et al. “Reliability Verification of FaultTolerant Systems Design Based on Mutation Analysis”. Proceedings of the SBCCI 98 Brazilian Symposium on Integrated Circuit Design; Buzios, Rio de Janeiro, Brazil; 1998. “From Hdl Descriptions to Guaranteed Correct Circuit Designs”. Proceedings of the IFIP Wg 10.2 Dominique Borrione (Editor); 1987. Carmichael, C. et al. “SEU Mitigation Techniques for Virtex FPGAs in Space Applications”. Xilinx Aerospace and Defense Products, Internal Report, 1999. http://www.xilinx.com/appnotes/VtxSEU.pdf Fukunaga, A. et al. “Evolvable Hardware for Spacecraft Autonomy”. NASA JPL Technical Reports, Snowmass, Colorado, USA, Mar., 1998. http://techreports.jpl.nasa.gov/ Culbertson, W.B. et al. “Defect Tolerance on the Teramac Custom Computer”. Proceedings of the 5th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines (FCCM’97), Napa Valley, California, 1997, pp. 116-123. Stroud, C. et al. “Bist-Based Diagnostic of FPGA Logic Blocks”. Proceedings of the [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] 9 International Test Conference (ITC’97), Washington, DC, Nov. 1997, pp.539-547. Huang, K. & Lombardi, F. “An Approach to Testing Programmable/Configurable Field Programmable Gate Arrays”. Proceedings of the 1996 IEEE VLSI Test Symposium, 1996, pp.450-455. Thompson, A. “Evolving Fault Tolerant Systems”. Proceedings of the 1st IEE/IEEE International Conference on Genetic Algorithms in engineering Systems: Innovations and Applications (GALESIA’95), Sheffield, Sep. 1995. Vargas, F.; Amory, A. “Design of SEU-Tolerant Processors for Radiation-Exposed Systems”. 4th IEEE International High Level Design Validation and Test Workshop - HLDVTW’99. San Diego, CA, USA, Nov. 04-06, 1999. Ansell, G. P.; Tirado, J. S. “CMOS in Radiation Environments”. VLSI Systems Design. Sep. 1986. ESA “Packet Telemetry Standard, Issue 1”. ESA PSS-04-106, ESA – European Space Agency, Jan. 1988. ESA “Packet Telecommand Standard, Issue 2”. ESA PSS-04-107, ESA – European Space Agency, Apr. 1992. ESA “Packet Utilisation Standard, Issue 1”. ESA PSS-07-0, ESA – European Space Agency, Mar. 1992. ESA “Cluster: mission, payload and supporting activities”. ESA SP-1159, ESA – European Space Agency, Mar. 1993. Mattias, O. et al. “Neutron Single Event Upsets in SRAM-Based FPGAs. Xilinx High Reliable Products”. Internal Report; 4p. 1999. <www.xilinx.com/products/hirel_qml.htm> Alfke, P. and Padovani, R. “Radiation Tolerance of High-Density FPGAs”. Xilinx High Reliable Products; Internal Report; 4p. 1999. <www.xilinx.com/products/hirel_qml.htm> Lum, G. and Vandenboom, G. “Single Event Effects Testing of Xilinx FPGAs”. Xilinx High Reliable Products; Internal Report; 5p. 1999. <www.xilinx.com/products/hirel_qml.htm> Normand, E. “Single Event Upset at Ground Level”, IEEE Transactions on Nuclear Science, vol. 43, pp. 2742-2750, 1996. Olsen, J. et. al. “Neutron-Induced Single Event Upset in Static RAMs Observed at 10km Flight Altitude”, IEEE Trans. on Nuclear Science, vol. 40, pp. 74-77, 1993. Xilinx “The Programmable Logic Data Book” San Jose, 1999. Peterson, W.W. and Weldon, E.J. “Error [34] [35] [36] [37] Correcting Codes”. MIT Press, Cambridge, MA, 1972. Xilinx “Virtex Configuration and Readback” Xilinx Application Note 138 (XAPP 138), San Jose, Mar. 1999, 25pp. Synplicity. “Synplify Better Synthesis – User Guide release 5.0”. Synplicity, 1998. Xilinx. “Synthesis and Simulation Design Guide”. Xilinx, 314p. 1998. Bezerra, E.A. et al. “A VHDL implementation of an on-board ACF application targeting FPGAs”. Proceedings of the 1999 Military and Aerospace Applications of Programmable Logic Devices Conference (MAPLD’99), The Johns Hopkins University, USA, Sep., 1999. [38] Gough, M.P. “Particle Correlator Instruments in Space: Performance Limitations Successes, and the Future”. American Geophysics Union, Santa Fe Chapman Conference, 1995. [39] Beauchamp, K. and Yuen, C. “Digital Methods for Signal Analysis” George Allen & Unwin, 316p. 1979. [40] IEEE “Draft Standard For VHDL Register Transfer Level Synthesis” IEEE, 1998. 10