Highly Reliable Fault Tolerant Technique for Safety Critical Applications Nanditha S

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 11 - May 2014
Highly Reliable Fault Tolerant Technique for Safety
Critical Applications
Nanditha S#1, Laxmi C*2
#
*
PG Student, Dept. of PG Studies, VTU Regional Office, Gulbarga, Karnataka, India
Guest Lecturer, Dept. of PG Studies, VTU Regional Office, Gulbarga, Karnataka, India
Abstract—This paper presents a highly reliable fault tolerant
technique for safety critical applications using Five Modular
Redundancy method. In high radiation environments like space
crafts and nuclear thermal plants it is likely that single event
upsets (SEU) degrades the system operation. This causes single
bit flips in the sequential elements of electronic components in
the system. If these systems are not provided with the fault
tolerance then there are high chances of obtaining false response.
In order to avoid this problem the system is made redundant and
a roll-forward recovery mechanism is used to increase the overall
reliability. Scan cell design is employed to shift out the internal
states of all the flip flops during comparison and recovery
process. The proposed method is designed using verilog HDL on
XILINX ISE simulator.
Keywords— Design for Testability (DFT), Fault-tolerance, Nmodular redundancy (NMR), Roll-forward error recovery, Scan
chain,.
I. INTRODUCTION
In present days many applications use embedded processors
very commonly. In these applications safety critical
applications like avionics and patient life-support monitoring
are of concern. In these applications it is important to have
both the timing constraints and appropriate fault tolerance.
These systems are equipped with appropriate error detection
and correction mechanism. But one has to compromise
between the reliability and timing constraints i.e., if reliability
is improved then timing constraints are comprised. If roll-back
recovery technique is considered, the overall reliability is
enhanced but timing constraints are compromised because of
increased missing deadlines in few applications. Generally, in
safety critical applications it is not acceptable, if only
reliability is improved without meeting timing constraints.
Because of this reason it is important to provide low
performance overhead fault tolerant technique used in safety
critical applications. One such fault-tolerant technique is
modular redundancy. Again in modular redundancy triple
modular redundancy (TMR) has good timing constraints
hence it has wide usage in safety critical applications
including avionics and in satellite applications. A TMR
system can recover up to two errors but it fails to recover the
faults if two common erroneous flip flops are present [1]. A
traditional TMR system has three replicated cores and a
majority voter. The majority voter selects fault free output
among three outputs of replicated cores. This mechanism has
some limitations to use in safety critical applications. One of
ISSN: 2231-5381
its limitations is its inability to cope up with TMR failures [2].
A TMR failure occurs due to the presence of faulty modules
or a faulty voter. In harsh environments due to cosmic
radiation or high energy neutron striking the substrate of the
device causes faults. The probability that two particles hitting
the common replica flip-flops is very low but the probability
that two particles hitting two replica modules is high in harsh
environments when they are running for longer durations. In
long term applications without appropriate recovery
mechanism, the probability of having TMR failures is very
high. To overcome this issue TMR system is equipped with a
transient error recovery mechanism [1]. Most of the TMR
based recovery mechanisms used retry technique [1], [5] but it
is not well suited for tight deadline applications
In roll-forward recovery mechanisms there is no recomputation as compared to retry based recovery mechanism
hence it can be used in safety critical applications. A rollforward mechanism for TMR systems has been proposed in
[3]. This technique is not suitable for general purpose
processors as it requires detailed information about the
registers of TMR modules. A TMR based technique which is
applicable to general purpose processors is presented in [4].
This technique called ScTMR uses roll-forward mechanism
and scan chain implemented in the circuits to recover both
transient and permanent faults. Another paper presents TMR
based technique to recover multiple faults [1]. It has one main
drawback that it cannot detect if two common erroneous flip
flops are present. This paper presents, a scan chain based
highly reliable system and multiple error recovery mechanism.
The proposed technique Five modular redundancy (FMR)
with recovery mechanism which is version of N-modular
redundancy (NMR) provides high reliability to the system.
II. RELATED WORK
In TMR systems, the voter does not mask multiple faults i.e.
if two faulty modules are present in the system then
probability of choosing the faulty output is very high and even
single errors cannot be recovered as there is no error detection
and recovery mechanisms involved. The techniques used in
[3], [5], [6] use modified voters to diagnose the faulty module.
The techniques presented in [3], [5], [6] are hardware based,
while the technique proposed in [7] uses a software based
method for voting and fault diagnosis resulting in negative
impact on the system performance. Some of these voters use
http://www.ijettjournal.org
Page 537
International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 11 - May 2014
the history of faulty modules and, whenever the number of
consecutive faults exceeds the predefined threshold, the error
is identified as permanent fault.
A. The Majority Voter
Multiple voters and a disagreement detector can be used to
mask the transient and permanent faults [3], [5]. A disagree
detector is a circuit which detects the single fault but a faulty
detector may lead to failure. In most of the previous papers
retry recovery mechanism has been used. In this method, after
detecting the error the faulty module will re-execute the
process. Re-execution leads to performance overhead hence
retry technique cannot be used in tight deadline applications.
Fig.2
The work presented in [1] compares the traditional TMR
system with the one equipped with appropriate recovery
mechanism. The work proposed in [1] shows that multiple
errors can be recovered but it fails to recover common
erroneous flip flops. The study states that roll-forward
mechanism is having low performance overhead compared to
retry mechanism and hence it is well suited for tight deadline
applications and it is more reliable than retry mechanism. In
roll-forward method, the faulty is recovered by replacing its
state with fault free modules state to avoid re-computation. It
is well known that with increased redundancy reliability of the
system increases. The work proposed in [8] states that, with
five modular redundancy two faults can be recovered. Which
include both common and non common erroneous flip-flops.
The proposed architecture shows that increased modular
redundancy adds increased reliability advantage to the system
and it is provided with multiple error recovery mechanism to
improve the reliability still more compared to the reliability
achieved in [8].
The Majority Voter
In redundant systems it is challenging to detect faulty
modules and recover them. The systems reliability
significantly reduces for a wrong detection or inability find a
faulty module.
To address this issue, a voter is presented in this paper that
can identify the faulty module. The voter consists of ten
comparators i.e. Cm12, Cm13, Cm14, Cm15, Cm23, Cm24,
Cm25, Cm34, Cm35 and Cm45 each for comparing outputs of
two modules at a time. Each of the comparator asserts one
error signal which is used to select the fault free output. The
priority encoder ( i.e. PE in the fig.2. ) is used to generate
select signal for the final output selector multiplexer with the
help of error signals from the comparators. For example, if
output I is faulty then the four comparators Cm12, Cm13,
Cm14 and Cm15 asserts the error signals. The priority
encoder by using these error signals generates three select
lines for the final output selector multiplexer.
IV. FMR CONTROLLER
III. FMR ARCHITECTURE
The block diagram of FMR is shown in fig. 1.
Fig.3
Fig.1
Proposed FMR Architecture
In this technique until the error detected by the voter the
system will be in normal state. In comparison mode all the
internal states are shifted out using scan chain through SCO
port and during recovery process SCI ports are used to recover
the states. In the presence of fault SCI of the faulty module is
connected to the SCO of the fault free module with the help of
multiplexer. If no fault is present then the respective SCI port
is connected to the respective SCO port of the same module.
ISSN: 2231-5381
State Diagram
Fig 3 shows the state diagram for the proposed architecture.
It includes four states i.e. normal mode, comparison mode,
recovery mode and unrecoverable mode. When there are no
faults in the system, it will be in the normal mode of operation.
As soon as the voter detects the fault the controller enters
comparison mode and all the internal states are scanned to
detect the presence of fault. During this operation the flip
flops will be arranged as a shift register and all the internal
states are shifted out using scan chain. Once any fault is
detected the controller activates recovery mode where again
the states are compared and the erroneous states replaced with
the error free states. After complete recovery process, the
system resumes its normal operation until next fault occurs. If
http://www.ijettjournal.org
Page 538
International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 11 - May 2014
it fails to recover the fault it goes to unrecoverable condition
and the system is stopped for fail safe. If in comparison mode,
the controller fails to detect any fault then it enters
unrecoverable state.
A. Controller in comparison and recovery modes of
operation
In multiple error recovery system, one can detect and
recover the system from multiple faults. Fig 4 shows the
architecture of the proposed system. In this system, the
internal states of all FMR modules are shifted out using the
scan chains and outputs of all module pairs (I/II, I/III, and
II/III) are compared. As shown in Fig 4, ten counters are used,
namely, C12, C13, C14, C15, C23, C24, C25, C34, C35 and C45. In
the proposed controller architecture xor gates used to compare
the internal states of FMR modules. When any mismatch is
found between the modules the corresponding counter is
incremented by one unit. The number of counters can be
decreased to five but in that case we fail to detect common
erroneous flip flops. In comparison mode the counters are up
counting and the fault locater detects the faulty modules and
stores in an register. Then in recovery mode this information
is used to recover the exact error state. In recovery mode the
counters are down counting. At the end of recovery mode all
the counters should be zero. Otherwise it is a indicative that
other fault has occurred during the recovery process and the
system enters unrecoverable state.
Fig.4.
Fig.5
Voter Results
B. FMR Simulation Results Showing Multiple Error Recovery
The Fig.6, Fig.7, Fig.8, and Fig.9 shows the simulation
results of FMR system for various numbers faults. Initially the
system is reset for some period. During this duration the
system will be in Normal mode. The faults are injected based
on simulation based fault injection technique. The comparison
mode is activated if any of the ten error signals are activated
by the voter. The scan chain is enabled in comparison mode
and recovery mode i.e. the internal flip-flops will be
configured as a shift register. This shift register is used to shift
out all the values of each module. In comparison mode the
fault can be seen at the SCO port of the faulty module. During
this process the internal states are compared and if any
mismatch between the internal states of the modules the
corresponding counter is incremented and using fault locater
the faulty module is detected. In recovery mode again the
internal states are compared and each time the mismatch
detected the counter decrements by one. With the fault locater
status the fault states are recovered in this mode.
Controller Configuration in Comparison and Recovery Modes
V. SIMULATION RESULTS
A. The Majority Voter Simulation Results
Fig.5 shows simulation results of the majority voter. The
signal H is the final fault free output. K, L, M, N and O are the
five outputs of five redundant modules respectively. The
design is checked for all combinations of inputs. The voter
selects the majority output i.e. the output which is same for
more than two modules. If any mismatch is found then the
corresponding error signals E12, E13, E23 etc., are activated.
ISSN: 2231-5381
Fig.6
Simulation showing one faulty module
Fig.6 shows the simulation results for single faulty module.
Here module III is faulty hence the signal F3 is high and the
register I holds a value of 00100 which shows that module III
http://www.ijettjournal.org
Page 539
International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 11 - May 2014
is faulty. It shows that at the end of recovery mode all the
faults are recovered.
Fig.9
Fig.7
Simulation showing two Faulty Modules
Fig.7 shows the simulation results for two faulty module
conditions. Here module I and module III are faulty hence the
corresponding signals i.e. F1 and F3 are high and the state.
Hence the faulty module register here I has a value of 00101.
At the end of recovery mode all the counters are equal to zero
which indicates the complete recovery of faulty modules.
Simulation showing four Faulty Modules
Fig.9 shows the simulation results for four faulty modules.
Here module I, module II, module III and module IV are
faulty hence the signals F1, F2, F3 and F4 are high and the
faulty module register I have a value of 01111. The Fig 9
shows that at the end of recovery mode all the faults are
recovered.
VI. CONCLUSIONS
In this paper, a technique increase the reliability of
onboard systems for safety critical applications is introduced.
The proposed system is provided with the fault recovery
mechanism which is an added advantage for the fault tolerant
redundant systems. In the result section it is shown that even
the two common erroneous flip flops can be detected and
corrected.
ACKNOWLEDGMENT
Authors would like to thank Dept. of PG Studies, VTU
Regional Office, Gulbarga, whose timely support and
suggestions went a long in the completion of the project.
REFERENCES
[1]
[2]
[3]
[4]
[5]
Fig.8
Simulation showing three Faulty Modules
Fig.8 shows the simulation results for three faulty modules.
In this case modules I, module II and module III are faulty
hence the corresponding signals F1, F2 and F3 are high and
the faulty module register I has a value of 00111. The Fig 8
shows that at the end of recovery mode all the faults are
recovered.
ISSN: 2231-5381
[6]
[7]
[8]
M. Ebrahimi, S. G. Miremadi, H. Asadi, and M.Fazeli, “Low-Cost ScanChain-Based Technique to Recover Multiple Errors in TMR Systems,”
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol.
21, no. 8, pp.
1454–1468, Aug. 2013.
H. Kim and K. G. Shin, “Design and analysis of an optimal instruction retry
policy for TMR controller computers,” IEEE Trans. Comput., vol. 45, no.
11, pp. 1217–1225, Nov. 1996.
K. G. Shin and H. Kim, “A time redundancy approach to TMR failures
using fault-state likelihoods,” IEEE Trans. Comput., vol. 43, no. 10,
pp.1151–1162, Oct. 1994.
M. Ebrahimi, S. G. Miremadi, and H. Asadi, “ScTMR: A scan chain based
error recovery technique for TMR systems in safety-critical applications,”
in Proc. Design Autom. Test Eur. Conf. Exhibit., 2011, pp. 1–4.
F. L. Kastensmidt, L. Sterpone, L. Carro, and M. S. Reorda, “On the
optimal design of triple modular redundancy logic for SRAM-based
FPGAs,” in Proc. Design Autom. Test Eur. Conf. Exhibit., 2005, pp. 1530–
1591.
S. Yu and E. J. McCluskey, “On-line testing and recovery in TMR systems
for real-time applications,” in Proc. Int. Test Conf., 2001, pp. 240–249.
P. K. Chande, A. K. Ramani, and P. C. Sharma, “Modular TMR
multiprocessor system,” IEEE Trans. Ind. Electron., vol. 36, no. 1, pp. 34–
41, Feb. 1989.
Haryono , Jazi Eko Istiyanto, Agus Harjoko and Agfianto Eko Putra, “ Five
Modular Redundancy with Mitigation Technique to Recover the Error
Module,” IJASCSE, Volume 3, Issue 2, 2014.
http://www.ijettjournal.org
Page 540
Download