An experience in porting a traditional microcontroler based

advertisement
MERGING BIST AND CONFIGURABLE COMPUTING TECHNOLOGY TO
IMPROVE AVAILABILITY IN SPACE APPLICATIONS*
Eduardo Bezerra 1, Fabian Vargas 2, Michael Paul Gough 3
1, 3
Space Science Centre, School of Engineering, University of Sussex, Brighton, BN1 9QT, England
E.A.Bezerra@sussex.ac.uk, M.P.Gough@sussex.ac.uk
1, 2
Catholic University - PUCRS, 90619-900 Porto Alegre - Brazil
eduardob@inf.pucrs.br, vargas@computer.org
Abstract
The use of configurable computers in space applications depends not only on high reliable military devices
available commercially, but also on the definition of strategies for fault tolerance and on-board testing. This paper
introduces some techniques targeting, mainly, the problems related to Single Event Upsets (SEU) in on-board electronics.
Important subjects as performance and dependability figures are also discussed.
Keywords: Fault Tolerance, Configurable Computing, BIST, FPGA, VHDL and Space Applications.
1. Introduction
availability, reliability and testability [6]. In the past few
years, strategies to improve the dependability features of
configurable computer systems, have been proposed and
implemented [7-21]. These strategies are mainly based on
the traditional ones used in microprocessor based systems.
Following a similar approach, in this paper strategies
to improve the availability and reliability of configurable
computing systems are proposed. An important
consideration to define the strategies is that configurable
computing allows new possibilities, for instance, the use
of a combination of strategies for fault tolerance in
software and in hardware in the same level of abstraction.
The ideas discussed here can be used not only for space
applications, but also for any other embedded system with
similar dependability requirements. The paper is
organised as follows. In Sections 2 there is a description
of the basic node of a network architecture for space
applications, based on configurable computers. Section 3,
introduces the strategies for availability and reliability
improvement, with emphasis on a Built-In Self Test
(BIST) strategy for SEU management. Section 4 describes
an approach for masking connectivity faults. Sections 5
and 6 discuss expected and obtained results, and in
Section 7 there are some conclusions, and future
directions.
Power consuption, processing speed, area usage, and
weight are important concerns of designers of computers
for space applications. Because of the application specific
nature of this kind of system, its requirements can vary
significantly from application to application, which
results in a completly new design for almost every new
application. As a consequence, computer systems for
space applications are very expensive.
A possible solution to decrease design costs is to have
the processing elements implemented by using
configurable devices [1-4]. This technology is
appropriated for the implementation of applicationdependent solutions. It allows the designers to have
different hardware configurations, adequate for every new
application, without the need for changes in the board
layout. The drawback of this solution is the difficulty
associated to the “software” development for this kind of
hardware [5]. For instance, for systems that require
complex data structures, the best solution maybe to still
use conventional microprocessor based boards.
In case of computers for critical applications, or
computers used in situations where maitenance is not
possible or very expensive, the designers have additional
concerns related to dependability features as, for instance,
*
This research is partly supported by Brazilian Council for the Development of Science and Technology (CNPq), and Catholic University PUCRS, Brazil.
1
2. Single-Event Upset (SEU)
region width, more specifically, the charge generated
beyond the influence of the excess-carrier concentration
gradients, can be collected by diffusion. The third
process, charge funneling, also plays an important role in
the collection process. Charge funneling involves a
spreading of the field lines into the device substrate
beyond the equilibrium depletion width. Then, the charge
generated by the incident particle over the funnel region
is collected rapidly. If the charge collected, Qd, during
the occurrence of the three processes described before is
large enough, greater than the critical charge Qc of the
memory cell, then the memory cell flips, inverting its
logic state (the critical charge Qc of a memory cell is the
greatest charge that can be deposited in the memory cell
before the cell be corrupted, that is, its logic state is
inverted).
Fig. 1b shows the resulting current pulse shape that is
expected to occur due to the three charge collection
processes described above. This current pulse is
generated between the reverse-biased n+ (resp. p+) drain
depletion region and the p-substrate (resp. n-well), for the
case of an n-well technology, for instance.
In a CMOS static memory cell, the nodes sensitive
to high-energy particles are the drains of off-transistors.
Thus, two sensitive nodes are present in such a structure:
the drains of the p-type and the n-type off-transistors.
When a single high-energy particle (typically a
heavy ion) strikes a memory cell sensitive node, it will
loose energy via production of electron-hole pairs, with
the result being a densely ionized track in the local region
of that element [22]. The charge collection process
following a single particle strike is now described. Fig. 1
shows the simple example case of a particle normally
incident on a reverse-biased n+p junction, that is, the
drain of n-type off-transistors.
N FET
gate
body
S
0V
0V
0V
ion track
D
5V
p+
+ -
+ -
n+
funneling
+ -
+ -
ro n
drift
n+
+ -
cur
r en
t
+ + -
elec
t
+ -
diffusion
3. System Description Overview
p substrate
The block diagram in Fig. 2.a shows the proposed
architecture for an on-board processing system, that may
be used for both, scientific and commercial space
applications. The two main improvements in this
architecture over the architectures that have been used in
most space applications, are the use of configurable
devices as the main processing elements, and the use of a
network to connect the different modules. As the main
objective of this paper is to discuss and to propose test
strategies for the processing elements, we describe a basic
network node in this Section.
The block diagram of the network node shown in Fig.
2.b has two external communication channels, one for
connection to the network, and the other one for
interfacing to the application. The first one has a fixed
number of signals, as the protocol for inter-node
communication is pre-defined. The second one is defined
according to the application requirements, and is a good
example of the flexibility introduced by the configurable
computer technology.
The main responsibilities of the processing element,
FPGA A in Fig. 2.b, are to construct telemetry packets,
according to the European Space Agency (ESA) standards
[23-25], and to implement the protocol for
communication on the on-board network system. In some
applications it may be possible to transfer some or all user
processing activities to FPGA A. An example is when the
on-board data-handling (OBDH) system and one (or
more) of the applications are designed by the same group
(a)
Current
P rompt
(Drift + Funneling)
Delayed
(Diffusion)
0
0.2
0.4
1
10
100 Time
(nsec.)
(b)
Fig. 1. Illustration of the charge collection mechanism that
causes single-event upset: (a) particle strike and charge
generation; (b) current pulse shape generated in the n+p
junction during the collection of the charge.
Charge collection occurs by three processes which
begin immediately after creation of the ionised track:
drift in the equilibrium depletion region, diffusion and
funneling. A high electric field is present in the
equilibrium depletion region, so carriers generated in that
region are swept out rapidly; this process is called drift.
Carriers generated beyond the equilibrium depletion
2
[26]. In this case the “User” block shown in Fig. 2.a may
be very simple, for instance, consisting of only analog
devices, sensors and converters, as the FPGA A may be
used to execute all the processing.
User 1
User 2
…
User n
CCM 1
CCM 2
…
CCM n
of system upgrades or bug fixes. The FPGA B works as a
configuration manager allowing FPGA A to be initialised
from the flash memories as if they were serial ROMs.
As the systems based on this architecture are
conceived for long-life missions, all the electronic
components must be military devices. However, even
employing high reliable devices, because of the hostile
environment found in space and the long-life expected for
the system, additional fault-tolerance strategies are used
in the design. In order to define the strategies, a very
simple, but efficient, fault model was chosen. The faults
considered in this fault model are stuck-at and
connectivity [6]. The stuck-at faults are good
representatives of the bit errors that can occur in SRAM
based devices, as, for instance, FPGAs, which are
sensitive to Single Event Upsets (SEUs) caused by
atmospheric high-energy particles [27-31]. The other
modelled fault, connectivity, is responsible for more than
90% of the problems in a board and, as described later,
special strategies as, for instance, bus replication and
voters, are used to tolerate this problem. The strategies
used to prevent and, when it is not possible, to tolerate the
faults, belong to the fault model adopted, are described
next in Section 4.
3
1
On-board network bus
Shared
RAM
CCM
(TC/TM)
1
2
On-board instrument processing board
Ground station
Legend:
1 - Protocol for on-board communication
2 - ESA standard protocol
3 - Configurable interface
(a)
4. SEU Prevention Strategies
CCM (Configurable Computer Module):
4.1. Refresh Operation in a Triple Modular
Redundancy (TMR) FPGA System
(configuration manager
& readback)
Flash
memory
(configuration
bitstream)
FPGA B
(control)
FPGA A
(processing
element)
serial
PROM
RAM
In [27,28] there is a study showing the low SEU
susceptibility of Xilinx FPGAs. However, for some
critical applications where human life is at a premium or
when the whole embarked electronics is dependent on two
or three core components, then, the low SEU
susceptibility must be even improved.
In [28] a strategy to reduce the effects of SEUs in
FPGA systems was proposed. Basically, as shown in Fig.
3, three FPGAs are configured with the same bitstream
(triple redundancy), and operate in synchronism. A
controller reads the three FPGA bitstreams, bit after bit,
and if there are no differences, then a correct functioning
with no SEU occurrence is assumed. This procedure is
executed continuously, with no interference in the FPGA
normal operation. Such a scheme is possible because of
the FPGA’s readback feature.
(optional)
(emergency recovery bitstream)
(b)
Fig. 2. Block diagram of the proposed system. (a) Network
architecture. (b) Basic CCM node.
The FPGA B is responsible for the management of the
reconfiguration and test of FPGA A. These activities are
detailed next in Section 4.
Other components of the Configurable Component
Module (CCM) node are RAM, PROM and Flash
memories. A RAM module may be attached to FPGA A,
in order to be used by some applications as a scratch area.
The emergency recovery PROM holds a configuration
bitstream (CB) for FPGA B, used every time the system is
initialised, which happens mainly when the Flash memory
upsets. The CB is first used to boot FPGA B, which is
then responsible for booting FPGA A with a basic
functionality, enough to talk to the network bus and reload the flash memories. The flash memories hold the CD
for FPGA A. They may be changed from ground in case
3
Configuration
bitstreams
Readback
bitstreams:
• user registers
• user logic
• routing
FPGA
voter
Error signal
counter <= counter + 1;
if counter = 0 then
PROG <= ‘0’; -- reset
else
PROG <= ‘1’;
end if;
Configuration
bitstream
Serial
EPROM
Start refresh
signals
15 Hz
counter
Application
process
Start refresh signal
PRG pin
Application
process
Application
process
FPGA
Fig. 3. A TMR FPGA system.
Fig. 4. Using a counter to start the refresh operation.
If one input of the controller is different from the
others, then it is assumed that an SEU has occurred, and a
reconfiguration of the faulty FPGA is executed. In [27] it
was shown that a simple refresh operation, in this case by
means of reconfiguration, is enough to recover the device
from an SEU. The main problem with the refresh
recovery is the total loss of measurement data within the
instrument system. Another problem is the time necessary
for reconfiguration, and depending on the application
size, it is therefore recommended to divide the system into
small blocks using several small FPGAs. This is because
in a small FPGA, configuration can be made in just a
fraction of second (e.g. 195 ms, for Xilinx XQ4085XL).
The block size has to be calculated according to the
application time requirements.
In the next three Sections, methods for SEU
prevention are proposed. These new methods are based on
the refresh execution, but without FPGA replication.
4.3. Signature Analysis-Driven Refresh Without
FPGA Replication
Another option for SEU prevention is the use of a
signature analysis method [6], to identify when a refresh
operation is necessary. In terms of hardware, this strategy
may be slightly more expensive than the clock/counter
one, but it is more efficient for applications where periods
of downtime and loss of data are not allowable. The
method proposed here uses the LFSR/PSG (Linear
Feedback Shift Register/Parallel Signature Generator)
approach for signature generation and analysis [33].
A Linear Feedback Shift Register (LFSR) is a shift
register with combinational feedback logic around it that,
when clocked, generates a sequence of pseudo-randomly
patterns [9,33]. In our case, we are considering the use of
a primitive polynomial, in order to generate all the 2n – 1
possible combinations, where n is the degree of the
polynomial. A Parallel Signature Generator (PSG) is an
LFSR with exclusive-or gates between the shift registers,
implementing a generator polynomial [9] used to compact
a given sequence of bits. Note that there is a probability of
aliasing, that is, of masking eventual errors in the
bitstream to be compacted during the signature generation
process, which is inversely proportional to the length of
the PSG implemented. In other words, the probability of
aliasing is given by:
(2k – r – 1).(2k – 1)-1
where:
k: is the sequence length, given in bits (that is, the
length of the FPGA configuration bitstream whose
signature is to be generated);
r: is the LFSR length (that is, the polynomial degree,
or the number of flip-flops required for its
implementation).
Two FPGAs are necessary to implement this
approach. Fig. 5 shows a block diagram based on the
CCM node of Fig. 2.b. The LFSR/PSG process is located
on FPGA B, and the processes responsible for the start of
the readback and refresh operations are located on FPGA
A.
4.2. Periodic Refresh Without FPGA Replication
As the target system is a long-life application, periods
of downtime can be considered in its design, and thus are
possible to be interrupted and completely reinitialised
after some time running, with no major problems. In order
to implement this strategy it is necessary, as shown in Fig.
4, an internal FPGA 15 Hz clock generator and a 19 bits
counter, written in VHDL. In the event of a rising edge
pulse, generated by the 15 Hz clock, the counter is
incremented. Every time the counter reaches the zero
value, which happens about each 19.4 hours, the refresh
operation is executed. Refresh is achieved by the counter
process resetting the FPGA PROG pin, which leads to the
FPGA being reconfigured, preventing SEU occurrences
from affecting the system functioning. It is important to
notice that in this strategy there is no test execution, and
consequently, no SEU detection. The refresh is executed
periodically, even if there are no SEU occurrences. In
terms of hardware resources, this strategy is less
expensive than the TMR one, but depending on the
application, it can be very expensive in terms of system
availability.
4
15 Hz
System
clock
2
PRG pin
1
4
LFSR/PSG
Refresh?
2
3
Flash
memory
35:
else
36:
-- talk to READBACK_PRO waiting for the
37:
-- end of readback, then WASH <= '0'
38:
end if;
39:
D[0] <= (AUX_Q[6] xor AUX_Q[7]) xor A_TO_H[0];
40:
D[1] <= AUX_Q[0] xor A_TO_H[1];
41:
D[2] <= AUX_Q[1] xor A_TO_H[2];
42:
D[3] <= AUX_Q[2] xor A_TO_H[3];
43:
D[4] <= AUX_Q[3] xor A_TO_H[4];
44:
D[5] <= AUX_Q[4] xor A_TO_H[5];
45:
D[6] <= AUX_Q[5] xor A_TO_H[6];
46:
D[7] <= AUX_Q[6] xor A_TO_H[7];
47:
AUX_Q <= D + AUX_Q;
48:
end if;
49: end process LFSR_PSG_PRO;
50: end LFSR_BEH;
PRG pin
Readback
Start readback?
3
readback pin
FPGA B
FPGA A
Fig. 5. The LFSR/PSG approach.
Fig. 6 shows the VHDL skeleton of the LFSR/PSG
process. The base clock used by the LFSR part of this
strategy is the same internal 15 Hz clock used in the
clock/counter strategy, and is represented in VHDL by the
CLK_15HZ_IN signal.
In order to define when the readback starts, first the
LFSR process, clocked by the 15 Hz signal (Fig. 5, ),
loads the input signal A_TO_H with ‘0’. As a
consequence, the implicit loop, forced by the INT_CLK
signal, works as a simple primitive LFSR (lines 42 to 49)
generating all 2n – 1 pseudo-randomly patterns (n = 8).
For this example, the string “10010101” was chosen
as the seed. When the output of the LFSR (signal
Q_OUT) has a pattern matching the seed (line 29), then
the operation mode is changed to readback (signal WASH
goes to ‘1’). At this moment the input clock is switched
from the internal 15 Hz clock to the system clock, used by
the readback, and the signal START_READBACK is sent
to FPGA A (Fig. 5, ).
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
28:
29:
30:
31:
32:
33:
34:
Fig. 6. Partial VHDL code for the LFSR/PSG process.
After each 8 cycles of the system clock, the signal
A_TO_H is loaded with the contents of the 8 bits shiftregister SR (Fig. 5, ; Fig. 6, line 18 ) which holds the
last 8 bits received from FPGA A. This shift-register is
controlled by the process READBACK_PRO (not listed),
which is also responsible for filtering the bits “unusable
data”, “RAM bits” and “capture bits”, not used for
purposes of signature generation [34]. The reason for that
is because these bits change dynamically during the
FPGA utilisation and are not suitable for comparison with
the “gold” signature. The “gold” signature is generated on
ground, using the same PSG method, from the original
bitstream used to configure FPGA A, and is stored onboard, in FPGA B.
When the readback is concluded, the LFSR/PSG
process in FPGA B uses the calculated signature to
compare to the on-chip stored, pre-defined one. If the test
fails, then a “start refresh” signal is sent to FPGA A (Fig.
5, ) in order to “clean” possible SEUs. The FPGA B
refreshes itself after the end of all FPGA A
readback/refresh executions.
entity LFSR is
port (
CLK_IN
: in std_logic;
RESET_NEG_IN: in std_logic;
CLK_15HZ_IN : in std_logic;
SR_IN
: in std_logic_vector(7 downto 0);
DO_PSG_IN
: in std_logic;
Q_OUT
: out std_logic_vector(7 downto 0)
);
end LFSR;
architecture LFSR_BEH of LFSR is
signal INT_CLK, WASH: std_logic;
signal D, AUX_Q,
A_TO_H: std_logic_vector(7 downto 0);
begin
INT_CLK <= CLK_15HZ_IN when WASH = '0' else
CLK_IN;
A_TO_H <= (others => '0') when WASH = '0' else
SR_IN;
Q_OUT
<= D when DO_PSG_IN = '0' else
AUX_Q;
LFSR_PSG_PRO: process (INT_CLK, RESET_NEG_IN )
begin
if RESET_NEG_IN = '0' then
WASH <= '0'; D <= (others => '0');
AUX_Q <= "10010101";
-- seed
elsif INT_CLK'event and INT_CLK = '1' then
if WASH = '0' then
if AUX_Q = "10010101" then
WASH <= '1';
-- WASH mode (PSG)
else
WASH <= '0';
-- NORMAL mode (LFSR)
AUX_Q <= "10010101"; -- seed
end if;
4.4. Signature Analysis
Readback Execution
With
Continuous
In the LFSR/PSG strategy, the test for SEU
occurrences is executed periodically. The LFSR is used to
start the readback operation and to compact the
configuration bitstream time after time. Another option
for the test is to execute the readback continuously, as it
does not affect the normal FPGA operation. This method
is less expensive in terms of hardware, as part of the
LFSR/PSG process is not necessary. As the readback is
executed continuously, then there is no need for sending a
signal to the “start readback” process on FPGA A. In this
case, the internal 15 Hz clock and the circuit used for the
clock signal switching, are not used.
Alternatively, the 15 Hz clock could be used, in a
different process to control the FPGA B self-refreshing
activity.
This strategy not only saves space on FPGA B, but
5
also allows the integrity of FPGA A to be verified more
frequently.
different sensors can detect different physical phenomena,
at the same instant of time.
For this fault masking strategy to be efficient, the
three input signals for the voters must be located,
preferably, in three distant pins. For instance, in the block
diagram in Fig. 7, the input IP1_1_IN may be located in
the pin 40, whilst the input IP1_2_IN is located in pin 80.
The pin locations are chosen by the designer, using a
constraints file, before the placement and routing (PAR)
execution [36]. The netlist generated by a synthesis tool,
from a VHDL source code, has no pin location and
routing information, and this netlist is used as input to the
PAR tool. In some cases it may be necessary to edit the
CB generated by the PAR tool, and change, manually, the
position of the components of a voter, in order to
approximate them to the input pins. The pins’ location
and the delay for individual routes can be specified in a
constraints file. Moreover, the PAR tool may place a
voter very close to a pin, but very distant from another
one, having both time delays according to that defined in
the constraints file, but with very different times between
them. A solution to avoid the need for manual
intervention, is to define very short delays in the
constraints file. The problem with this solution is that,
depending on the design complexity, and the size of the
FPGA chosen, the constraints specified may not be
achievable. This strategy for connectivity faults masking
can be employed in the CCM node shown in Fig. 2.b,
because the voters are implemented in the same FPGA
along with the application, as shown in Fig. 7. This
strategy masks permanent, transient or intermittent faults
efficiently.
5. Masking Connectivity Faults
The SEU prevention strategies described in the last
Section, are very efficient for fault prevention in
processing modules, as operation units are implemented
using the FPGA SRAM-based look-up tables (LUTs).
Control units of processing modules are partially
implemented using flip-flops. They are one of the points
not covered by this work, since well known fault-tolerant
strategies to improve control (and data) flow reliability
can be found in the literature[6].
Reliability improvement in the processing modules is
worthless if the input data correctness is not guaranteed.
The proposed strategy is shown in the block diagram of
Fig. 7. In this scheme a majority voter receives the same
data from three different FPGA input pins, and if at least
two of them are equal, then the data is sent to the
application, otherwise, an error signal is set, invalidating
the data. The block diagram was partially generated by
Synplify [35] from a VHDL code.
Application
process
K
e
r
n
e
l
6. Numerical Analysis of the CCM Node in Two
Modes of Operation
The CCM node shown in Fig. 2.b, implementing the
LFSR/PSG strategy, as described in Sections 4.3 and 4.4,
has been analysed in numerical terms using reliability
evaluation techniques [6]. For this analysis two situations
are considered.
In the first situation, the three flash memories hold
three different configuration bitstreams (CBs). This
scenario represents a real reconfigurable computing
system, because the FPGA functionality can be altered,
on-the-fly, according to the application requirements.
From the fault-tolerance point of view it is not a good
approach as, in case of an SEU occurrence in one of the
flash memories, the respective application has to stop, and
wait for a good CB be up-loaded from the ground station.
The reliability of this situation can be found from
Equation 1.
FPGA
Fig 7. Using replicated inputs/voter to mask connectivity
faults.
The strategy is used to mask faults in the external
FPGA pins, and in the internal FPGA routing resources. It
is assumed that the same sensor output is connected to
three different FPGA pins, sending the same data to the
voter. Using three different sensors, which characterises a
triple modular redundant (TMR) implementation [6], may
be possible but it will depend on the data being collected.
In most of the cases, different sensors send different data
to the voter and, even if the data is correct, it may result in
a wrong interpretation by the voter. This happens because
R1(t) = 1 – (1 – Rflash(t))
6
Equation 1
Since the reliabilities of all components, but the flash
memories, are constant, they have not been included in
the numerical analysis. The reliability of the flash
memories are not constant, because their contents may be
changed when a new CB is uploaded.
In the second situation, the three flash memories hold
the same CB, which characterises a TMR system. The
vote is executed, implicitly, by FPGA B, using the
signature analysis method described in Sections 4.3 and
4.4. As this test strategy is not capable of fault location,
then, in case of a fault detection, it is not possible to
identify if the problem was in the flash memory or in the
FPGA. In any case, the FPGA A is reconfigured with a
CB from another flash memory. If the error persists, then
the diagnostic is a permanent fault in FPGA A, and the
module has to be by-passed. On the other hand, if with the
new CB no error is detected, then the respective flash
memory is considered faulty, and it needs to be refreshed
in order to try to clear any occurrence of SEUs. The
reliability of this situation can be found from Equation 2.
reliabilities of each case remain almost the same value at
the end of 30 hours of work, and in an acceptable range
until the end of 100 hours of work. However, the
reliability differences between each architecture become
more distinctive as the time progress. For example, at the
end of 3000 hours of operation (125 days), the reliability
for the redundant case (R2) is 1.3 times better than the
non-redundant one (R1).
7. Expected Performance
The main motivation for using FPGAs instead of
microprocessors for on-board computer implementation,
is the gain in performance with a decrease in the PCB
area usage. In order to find out the feasibility of using
FPGAs in this class of application, a case study was
developed. The case study is the FPGA implementation of
an on-board instrumentation module of a NASA sounding
rocket [37]. This rocket flew from Spitzbergen, Norway,
in the winter of 1997/1998, carrying a scientific
application computer designed at the Space Science
Centre, University of Sussex [38]. The real time science
measurement performed by this embedded system is an
auto-correlation function (ACF) [39] processing of
particle count pulses as a means of studying processes
occurring in near Earth plasmas. The original module as
flown consisted of a board with two DS87C520
microcontrollers (8051 family), FIFOs, state machines
and software written in assembly language. Although this
ACF implementation is a very specific application, the
system as a whole, including the hardware and the
software parts, is not too different from conventional
embedded systems based in microprocessors or
microcontrollers. This case study is a typical memory
transfer application, with a high input sampling rate and
with scarceness of processing modules. The most
demanding actions for processing blocks, are the ones
with multiply-and-accumulate operations (MACs), typical
of DSP applications.
The test strategies proposed in this paper are designed
to execute in parallel with the user application and with
the fault-tolerance strategies. There are no performance
penalties, because there is no need, for instance, to time
share tasks. Table 1 lists the number of cycles necessary
to run the processes written in assembly language for the
microcontrollers, and the equivalent ones written in
VHDL for the FPGA.
Another important system feature improved because
of the use of configurable computing technology is the
reduction in the number of hardware components. The
main system repercussions are the reduction in the onboard area usage, and in the power consumption. The
reduction in the number of components can be achieved
by using only one FPGA device configured to execute the
same functionality as the whole 8051 based system. For
R2(t) = 1 – ((1 – Rflash1(t))* (1 – Rflash2(t))* (1 – Rflash3(t)))
Equation 2
To demonstrate the reliability improvements when
using replicated CBs (R2), it is considered a hypothetical
situation where the failure rate () is identical for each
flash memory. For this study it was chosen a failure rate
of 0.0001/hour, to allow the generation of quantitative
information for comparison purposes. Considering that
R(t) = e-t and flash1 = flash2 = flash3 = flash then Equations
1 and 2 can be re-written as the following Equations.
R1(t) = e-t
Equation 3
R2(t) = e-3t - 3e-2t + 3e-t
Equation 4
0.9
0.8
0.7
1
10
20
30
40
50
60
70
80
90
100
200
300
400
500
1000
2000
3000
Reliability
1
Time (hours)
non-redundant (R1)
redundant (R2)
Fig. 8. The reliability responses for the two situations.
The graph in Fig. 8 was plotted from Equations 3 and
4. From this graph it is possible to observe that the
7
instance, external chip FIFOs were replaced by circular
FIFOs implemented using data structures in VHDL.
Considering the use of the FPGA board in an application
where fault-tolerance is not a requirement, as the original
case study, all of the electronics board can be
implemented as a single chip, for instance, an XQ4085XL
Xilinx FPGA [32-37] with two XQ1701L serial ROMs to
store the bitstream. The result is a reduction from 22 to 3
chips. Supposing the use of the case study in a long-life
system, then fault-tolerance is required, and a board as the
one shown in Fig. 2.b can be used. In this case, the
reduction in the number of components may be from 22 to
about 10 chips on-board, which is still a significant
result, considering that now the system has fault-tolerant
capabilities.
controller
P1
P2
P3
P4
P5
P6
(cycles)
4,518
8..36
18..1,018
1,240
1,334..3,438
11,116
FPGA
(cycles)
1
1
1..68
48
132..143
288
Design Automation (EDA) tools to generate CBs. In time
critical systems, such as space applications, effective
development facilities are important because of the short
time available for making remedial changes to a faulty
application. In the past several missions were saved as a
result of the rapid problem identification, followed by the
development of a solution, ground tests and timely
transmission of the new software to the spacecraft
computer. In addition to the selection of efficient EDA
tools, another investigation to be done is related to the
hardware description language subject. After selecting the
language and the EDA tools, the next step will be the
implementation of a prototype, in order to determine the
feasibility of the test and fault-tolerant strategies proposed
here.
Rate
Acknowledgement
4,518 times faster
8 to 36 times faster
18 to 14.97 times faster
25.8 times faster
10.11 to 24 times faster
38.6 times faster
This research is partly supported by the Brazilian
Council for the Development of Science and Technology
(CNPq), and the Catholic University (PUCRS), Brazil. The
authors would like to thank Aldec and Xilinx University
Program for the donation of the hardware and software
tools used in this research.
References
Table 1. Performance comparison for the case study.
[1]
8. Conclusions & Future Work
[2]
This paper introduced the use of a BIST technique and
traditional fault-tolerance strategies together with
configurable computing technology, in order to improve
the availability of on-board computers used in space
applications. In Section 3 a network architecture for
spacecraft instruments was proposed. Section 4 presented
test and fault-tolerance strategies to detect and fix/tolerate
SEU occurrences. In Section 5, it was described a
technique to mask connectivity faults. Sections 6 and 7
described expected and obtained results.
The strategies described in this paper deserve a deeper
investigation, in order to be used in the design of a faulttolerant on-board instrument processing system, entirely
based on configurable computing. During the case study
implementation, a series of problems related to the
development of FPGA based systems arose. For instance,
the synthesis tools available for high-level languages
(e.g., VHDL behavioural and Verilog) are still not
efficient, and a VHDL developer has to follow strict rules
to obtain good results [40]. An FPGA configuration
bitstream generated from a high-level language is space
consuming, and represents a lower performer circuit when
compared to one generated from schematic diagrams or
low-level languages such as VHDL structural.
Another concern is the time necessary for Electronic
[3]
[4]
[5]
[6]
[7]
[8]
8
Mangione-Smith, W. et al. “Seeking Solutions in
Configurable Computing”. IEEE Computer,
pp. 38-43, Nov. 1997.
Villasenor,
J.
&
Mangione-Smith,
W.
“Configurable Computing”. Scientific American,
pp. 66-71, Jun. 1997.
DeHon, A. “Reconfigurable Architectures for
General-Purpose Computing”. PhD Thesis,
Artificial Intelligence Laboratory. MIT, USA,
368p. Oct. 1996.
Villasenor, J. et al. “Configurable Computing
Solutions for Automatic Target Recognition”. In
Proceedings of IEEE Workshop on FPGAs for
Custom Computing Machines, pp. 70-79, Napa,
CA, Apr. 1996.
Bezerra, E.A. and Gough, M.P. “A guide to
migrating from microprocessor to FPGA coping
with
the
support
tools
limitations”.
Microprocessors And Microsystems – to
appear.
Pradhan, D.K. “Fault-Tolerant Computer
System Design”. Prentice-Hall; 544p. 1996.
Lach,
J.;
Mangione-Smith, W.H.;
and
Potkonjak, M. “Efficiently Supporting FaultTolerance in FPGAs”. Proceedings of the 1998
ACM/SIGDA International Symposium on Field
Programmable
Gate
Arrays
(FPGA’98),
Monterey, CA, Feb., 1998.
Kocan, F. and Saab, D.G. “Dynamic Fault
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
Diagnosis on Reconfigurable Hardware”.
Proceedings of the 1999 ACM/SIGDA Design
automation
Conference
(DAC’99),
New
Orleans, Louisiana, 1999.
Conde, R.F. et al. “Adaptive Instrument Module
- A Reconfigurable Processor for Spacecraft
Applications”. Proceedings of the 1999 Military
and Aerospace Applications of Programmable
Logic Devices Conference (MAPLD’99), The
Johns Hopkins University, USA, Sep., 1999.
Bezerra, E.A. et
al.
“Improving the
Dependability of Embedded Systems Using
Configurable
Computing
Technology”.
Proceedings
of
the
XIV
International
Symposium on Computer and Information
Sciences (ISCIS’99), Oct., 1999, IZMIR,
Turkey, pp.49-56.
Moreno, J.M. et al. “Feasible Evolutionary and
Self-Repairing Hardware by Means of the
Dynamic Reconfiguration Capabilities of the
FIPSOC Devices”. In Lectures Notes in
Computer Science - v. 1478: Sipper, M. et
al.(Eds.), Evolvable systems: From Biology to
Hardware. Proceedings, IX, 1998, pp.345-355.
Vargas, F. et al. “Optimizing HW/SW Codesign
Towards Reliability for Critical-Application
Systems”. 4th International IEEE On-Line
Testing Workshop. Capri, Italy, pp.17-22; 6-8
July, 1998.
Vargas, F. et al. “Reliability Verification of FaultTolerant Systems Design Based on Mutation
Analysis”. Proceedings of the SBCCI 98 Brazilian Symposium on Integrated Circuit
Design; Buzios, Rio de Janeiro, Brazil; 1998.
“From Hdl Descriptions to Guaranteed Correct
Circuit
Designs”.
Proceedings
of
the
IFIP Wg 10.2 Dominique Borrione (Editor);
1987.
Carmichael, C. et al. “SEU Mitigation
Techniques for Virtex FPGAs in Space
Applications”. Xilinx Aerospace and Defense
Products,
Internal
Report,
1999.
http://www.xilinx.com/appnotes/VtxSEU.pdf
Fukunaga, A. et al. “Evolvable Hardware for
Spacecraft Autonomy”. NASA JPL Technical
Reports, Snowmass, Colorado, USA, Mar.,
1998. http://techreports.jpl.nasa.gov/
Culbertson, W.B. et al. “Defect Tolerance on
the Teramac Custom Computer”. Proceedings
of the 5th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines
(FCCM’97), Napa Valley, California, 1997, pp.
116-123.
Stroud, C. et al. “Bist-Based Diagnostic of
FPGA Logic Blocks”. Proceedings of the
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
9
International
Test
Conference
(ITC’97),
Washington, DC, Nov. 1997, pp.539-547.
Huang, K. & Lombardi, F. “An Approach to
Testing
Programmable/Configurable
Field
Programmable Gate Arrays”. Proceedings of
the 1996 IEEE VLSI Test Symposium, 1996,
pp.450-455.
Thompson, A. “Evolving Fault Tolerant
Systems”. Proceedings of the 1st IEE/IEEE
International Conference on Genetic Algorithms
in engineering Systems: Innovations and
Applications (GALESIA’95), Sheffield, Sep.
1995.
Vargas, F.; Amory, A. “Design of SEU-Tolerant
Processors for Radiation-Exposed Systems”.
4th IEEE International High Level Design
Validation and Test Workshop - HLDVTW’99.
San Diego, CA, USA, Nov. 04-06, 1999.
Ansell, G. P.; Tirado, J. S. “CMOS in Radiation
Environments”. VLSI Systems Design. Sep.
1986.
ESA “Packet Telemetry Standard, Issue 1”.
ESA PSS-04-106, ESA – European Space
Agency, Jan. 1988.
ESA “Packet Telecommand Standard, Issue 2”.
ESA PSS-04-107, ESA – European Space
Agency, Apr. 1992.
ESA “Packet Utilisation Standard, Issue 1”.
ESA PSS-07-0, ESA – European Space
Agency, Mar. 1992.
ESA “Cluster: mission, payload and supporting
activities”. ESA SP-1159, ESA – European
Space Agency, Mar. 1993.
Mattias, O. et al. “Neutron Single Event Upsets
in SRAM-Based FPGAs. Xilinx High Reliable
Products”. Internal Report; 4p. 1999.
<www.xilinx.com/products/hirel_qml.htm>
Alfke, P. and Padovani, R. “Radiation Tolerance
of High-Density FPGAs”. Xilinx High Reliable
Products; Internal Report; 4p. 1999.
<www.xilinx.com/products/hirel_qml.htm>
Lum, G. and Vandenboom, G. “Single Event
Effects Testing of Xilinx FPGAs”. Xilinx High
Reliable Products; Internal Report; 5p. 1999.
<www.xilinx.com/products/hirel_qml.htm>
Normand, E. “Single Event Upset at Ground
Level”, IEEE Transactions on Nuclear Science,
vol. 43, pp. 2742-2750, 1996.
Olsen, J. et. al. “Neutron-Induced Single Event
Upset in Static RAMs Observed at 10km Flight
Altitude”, IEEE Trans. on Nuclear Science, vol.
40, pp. 74-77, 1993.
Xilinx “The Programmable Logic Data Book”
San Jose, 1999.
Peterson, W.W. and Weldon, E.J. “Error
[34]
[35]
[36]
[37]
Correcting Codes”. MIT Press, Cambridge, MA,
1972.
Xilinx “Virtex Configuration and Readback”
Xilinx Application Note 138 (XAPP 138), San
Jose, Mar. 1999, 25pp.
Synplicity. “Synplify Better Synthesis – User
Guide release 5.0”. Synplicity, 1998.
Xilinx. “Synthesis and Simulation Design
Guide”. Xilinx, 314p. 1998.
Bezerra, E.A. et al. “A VHDL implementation of
an on-board ACF application targeting FPGAs”.
Proceedings of the 1999 Military and
Aerospace Applications of Programmable Logic
Devices Conference (MAPLD’99), The Johns
Hopkins University, USA, Sep., 1999.
[38] Gough, M.P. “Particle Correlator Instruments in
Space: Performance Limitations Successes,
and the Future”. American Geophysics Union,
Santa Fe Chapman Conference, 1995.
[39] Beauchamp, K. and Yuen, C. “Digital Methods
for Signal Analysis” George Allen & Unwin,
316p. 1979.
[40] IEEE “Draft Standard For VHDL Register
Transfer Level Synthesis” IEEE, 1998.
10
Download