International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 11 - May 2014 Highly Reliable Fault Tolerant Technique for Safety Critical Applications Nanditha S#1, Laxmi C*2 # * PG Student, Dept. of PG Studies, VTU Regional Office, Gulbarga, Karnataka, India Guest Lecturer, Dept. of PG Studies, VTU Regional Office, Gulbarga, Karnataka, India Abstract—This paper presents a highly reliable fault tolerant technique for safety critical applications using Five Modular Redundancy method. In high radiation environments like space crafts and nuclear thermal plants it is likely that single event upsets (SEU) degrades the system operation. This causes single bit flips in the sequential elements of electronic components in the system. If these systems are not provided with the fault tolerance then there are high chances of obtaining false response. In order to avoid this problem the system is made redundant and a roll-forward recovery mechanism is used to increase the overall reliability. Scan cell design is employed to shift out the internal states of all the flip flops during comparison and recovery process. The proposed method is designed using verilog HDL on XILINX ISE simulator. Keywords— Design for Testability (DFT), Fault-tolerance, Nmodular redundancy (NMR), Roll-forward error recovery, Scan chain,. I. INTRODUCTION In present days many applications use embedded processors very commonly. In these applications safety critical applications like avionics and patient life-support monitoring are of concern. In these applications it is important to have both the timing constraints and appropriate fault tolerance. These systems are equipped with appropriate error detection and correction mechanism. But one has to compromise between the reliability and timing constraints i.e., if reliability is improved then timing constraints are comprised. If roll-back recovery technique is considered, the overall reliability is enhanced but timing constraints are compromised because of increased missing deadlines in few applications. Generally, in safety critical applications it is not acceptable, if only reliability is improved without meeting timing constraints. Because of this reason it is important to provide low performance overhead fault tolerant technique used in safety critical applications. One such fault-tolerant technique is modular redundancy. Again in modular redundancy triple modular redundancy (TMR) has good timing constraints hence it has wide usage in safety critical applications including avionics and in satellite applications. A TMR system can recover up to two errors but it fails to recover the faults if two common erroneous flip flops are present [1]. A traditional TMR system has three replicated cores and a majority voter. The majority voter selects fault free output among three outputs of replicated cores. This mechanism has some limitations to use in safety critical applications. One of ISSN: 2231-5381 its limitations is its inability to cope up with TMR failures [2]. A TMR failure occurs due to the presence of faulty modules or a faulty voter. In harsh environments due to cosmic radiation or high energy neutron striking the substrate of the device causes faults. The probability that two particles hitting the common replica flip-flops is very low but the probability that two particles hitting two replica modules is high in harsh environments when they are running for longer durations. In long term applications without appropriate recovery mechanism, the probability of having TMR failures is very high. To overcome this issue TMR system is equipped with a transient error recovery mechanism [1]. Most of the TMR based recovery mechanisms used retry technique [1], [5] but it is not well suited for tight deadline applications In roll-forward recovery mechanisms there is no recomputation as compared to retry based recovery mechanism hence it can be used in safety critical applications. A rollforward mechanism for TMR systems has been proposed in [3]. This technique is not suitable for general purpose processors as it requires detailed information about the registers of TMR modules. A TMR based technique which is applicable to general purpose processors is presented in [4]. This technique called ScTMR uses roll-forward mechanism and scan chain implemented in the circuits to recover both transient and permanent faults. Another paper presents TMR based technique to recover multiple faults [1]. It has one main drawback that it cannot detect if two common erroneous flip flops are present. This paper presents, a scan chain based highly reliable system and multiple error recovery mechanism. The proposed technique Five modular redundancy (FMR) with recovery mechanism which is version of N-modular redundancy (NMR) provides high reliability to the system. II. RELATED WORK In TMR systems, the voter does not mask multiple faults i.e. if two faulty modules are present in the system then probability of choosing the faulty output is very high and even single errors cannot be recovered as there is no error detection and recovery mechanisms involved. The techniques used in [3], [5], [6] use modified voters to diagnose the faulty module. The techniques presented in [3], [5], [6] are hardware based, while the technique proposed in [7] uses a software based method for voting and fault diagnosis resulting in negative impact on the system performance. Some of these voters use http://www.ijettjournal.org Page 537 International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 11 - May 2014 the history of faulty modules and, whenever the number of consecutive faults exceeds the predefined threshold, the error is identified as permanent fault. A. The Majority Voter Multiple voters and a disagreement detector can be used to mask the transient and permanent faults [3], [5]. A disagree detector is a circuit which detects the single fault but a faulty detector may lead to failure. In most of the previous papers retry recovery mechanism has been used. In this method, after detecting the error the faulty module will re-execute the process. Re-execution leads to performance overhead hence retry technique cannot be used in tight deadline applications. Fig.2 The work presented in [1] compares the traditional TMR system with the one equipped with appropriate recovery mechanism. The work proposed in [1] shows that multiple errors can be recovered but it fails to recover common erroneous flip flops. The study states that roll-forward mechanism is having low performance overhead compared to retry mechanism and hence it is well suited for tight deadline applications and it is more reliable than retry mechanism. In roll-forward method, the faulty is recovered by replacing its state with fault free modules state to avoid re-computation. It is well known that with increased redundancy reliability of the system increases. The work proposed in [8] states that, with five modular redundancy two faults can be recovered. Which include both common and non common erroneous flip-flops. The proposed architecture shows that increased modular redundancy adds increased reliability advantage to the system and it is provided with multiple error recovery mechanism to improve the reliability still more compared to the reliability achieved in [8]. The Majority Voter In redundant systems it is challenging to detect faulty modules and recover them. The systems reliability significantly reduces for a wrong detection or inability find a faulty module. To address this issue, a voter is presented in this paper that can identify the faulty module. The voter consists of ten comparators i.e. Cm12, Cm13, Cm14, Cm15, Cm23, Cm24, Cm25, Cm34, Cm35 and Cm45 each for comparing outputs of two modules at a time. Each of the comparator asserts one error signal which is used to select the fault free output. The priority encoder ( i.e. PE in the fig.2. ) is used to generate select signal for the final output selector multiplexer with the help of error signals from the comparators. For example, if output I is faulty then the four comparators Cm12, Cm13, Cm14 and Cm15 asserts the error signals. The priority encoder by using these error signals generates three select lines for the final output selector multiplexer. IV. FMR CONTROLLER III. FMR ARCHITECTURE The block diagram of FMR is shown in fig. 1. Fig.3 Fig.1 Proposed FMR Architecture In this technique until the error detected by the voter the system will be in normal state. In comparison mode all the internal states are shifted out using scan chain through SCO port and during recovery process SCI ports are used to recover the states. In the presence of fault SCI of the faulty module is connected to the SCO of the fault free module with the help of multiplexer. If no fault is present then the respective SCI port is connected to the respective SCO port of the same module. ISSN: 2231-5381 State Diagram Fig 3 shows the state diagram for the proposed architecture. It includes four states i.e. normal mode, comparison mode, recovery mode and unrecoverable mode. When there are no faults in the system, it will be in the normal mode of operation. As soon as the voter detects the fault the controller enters comparison mode and all the internal states are scanned to detect the presence of fault. During this operation the flip flops will be arranged as a shift register and all the internal states are shifted out using scan chain. Once any fault is detected the controller activates recovery mode where again the states are compared and the erroneous states replaced with the error free states. After complete recovery process, the system resumes its normal operation until next fault occurs. If http://www.ijettjournal.org Page 538 International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 11 - May 2014 it fails to recover the fault it goes to unrecoverable condition and the system is stopped for fail safe. If in comparison mode, the controller fails to detect any fault then it enters unrecoverable state. A. Controller in comparison and recovery modes of operation In multiple error recovery system, one can detect and recover the system from multiple faults. Fig 4 shows the architecture of the proposed system. In this system, the internal states of all FMR modules are shifted out using the scan chains and outputs of all module pairs (I/II, I/III, and II/III) are compared. As shown in Fig 4, ten counters are used, namely, C12, C13, C14, C15, C23, C24, C25, C34, C35 and C45. In the proposed controller architecture xor gates used to compare the internal states of FMR modules. When any mismatch is found between the modules the corresponding counter is incremented by one unit. The number of counters can be decreased to five but in that case we fail to detect common erroneous flip flops. In comparison mode the counters are up counting and the fault locater detects the faulty modules and stores in an register. Then in recovery mode this information is used to recover the exact error state. In recovery mode the counters are down counting. At the end of recovery mode all the counters should be zero. Otherwise it is a indicative that other fault has occurred during the recovery process and the system enters unrecoverable state. Fig.4. Fig.5 Voter Results B. FMR Simulation Results Showing Multiple Error Recovery The Fig.6, Fig.7, Fig.8, and Fig.9 shows the simulation results of FMR system for various numbers faults. Initially the system is reset for some period. During this duration the system will be in Normal mode. The faults are injected based on simulation based fault injection technique. The comparison mode is activated if any of the ten error signals are activated by the voter. The scan chain is enabled in comparison mode and recovery mode i.e. the internal flip-flops will be configured as a shift register. This shift register is used to shift out all the values of each module. In comparison mode the fault can be seen at the SCO port of the faulty module. During this process the internal states are compared and if any mismatch between the internal states of the modules the corresponding counter is incremented and using fault locater the faulty module is detected. In recovery mode again the internal states are compared and each time the mismatch detected the counter decrements by one. With the fault locater status the fault states are recovered in this mode. Controller Configuration in Comparison and Recovery Modes V. SIMULATION RESULTS A. The Majority Voter Simulation Results Fig.5 shows simulation results of the majority voter. The signal H is the final fault free output. K, L, M, N and O are the five outputs of five redundant modules respectively. The design is checked for all combinations of inputs. The voter selects the majority output i.e. the output which is same for more than two modules. If any mismatch is found then the corresponding error signals E12, E13, E23 etc., are activated. ISSN: 2231-5381 Fig.6 Simulation showing one faulty module Fig.6 shows the simulation results for single faulty module. Here module III is faulty hence the signal F3 is high and the register I holds a value of 00100 which shows that module III http://www.ijettjournal.org Page 539 International Journal of Engineering Trends and Technology (IJETT) – Volume 11 Number 11 - May 2014 is faulty. It shows that at the end of recovery mode all the faults are recovered. Fig.9 Fig.7 Simulation showing two Faulty Modules Fig.7 shows the simulation results for two faulty module conditions. Here module I and module III are faulty hence the corresponding signals i.e. F1 and F3 are high and the state. Hence the faulty module register here I has a value of 00101. At the end of recovery mode all the counters are equal to zero which indicates the complete recovery of faulty modules. Simulation showing four Faulty Modules Fig.9 shows the simulation results for four faulty modules. Here module I, module II, module III and module IV are faulty hence the signals F1, F2, F3 and F4 are high and the faulty module register I have a value of 01111. The Fig 9 shows that at the end of recovery mode all the faults are recovered. VI. CONCLUSIONS In this paper, a technique increase the reliability of onboard systems for safety critical applications is introduced. The proposed system is provided with the fault recovery mechanism which is an added advantage for the fault tolerant redundant systems. In the result section it is shown that even the two common erroneous flip flops can be detected and corrected. ACKNOWLEDGMENT Authors would like to thank Dept. of PG Studies, VTU Regional Office, Gulbarga, whose timely support and suggestions went a long in the completion of the project. REFERENCES [1] [2] [3] [4] [5] Fig.8 Simulation showing three Faulty Modules Fig.8 shows the simulation results for three faulty modules. In this case modules I, module II and module III are faulty hence the corresponding signals F1, F2 and F3 are high and the faulty module register I has a value of 00111. The Fig 8 shows that at the end of recovery mode all the faults are recovered. ISSN: 2231-5381 [6] [7] [8] M. Ebrahimi, S. G. Miremadi, H. Asadi, and M.Fazeli, “Low-Cost ScanChain-Based Technique to Recover Multiple Errors in TMR Systems,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 8, pp. 1454–1468, Aug. 2013. H. Kim and K. G. Shin, “Design and analysis of an optimal instruction retry policy for TMR controller computers,” IEEE Trans. Comput., vol. 45, no. 11, pp. 1217–1225, Nov. 1996. K. G. Shin and H. Kim, “A time redundancy approach to TMR failures using fault-state likelihoods,” IEEE Trans. Comput., vol. 43, no. 10, pp.1151–1162, Oct. 1994. M. Ebrahimi, S. G. Miremadi, and H. Asadi, “ScTMR: A scan chain based error recovery technique for TMR systems in safety-critical applications,” in Proc. Design Autom. Test Eur. Conf. Exhibit., 2011, pp. 1–4. F. L. Kastensmidt, L. Sterpone, L. Carro, and M. S. Reorda, “On the optimal design of triple modular redundancy logic for SRAM-based FPGAs,” in Proc. Design Autom. Test Eur. Conf. Exhibit., 2005, pp. 1530– 1591. S. Yu and E. J. McCluskey, “On-line testing and recovery in TMR systems for real-time applications,” in Proc. Int. Test Conf., 2001, pp. 240–249. P. K. Chande, A. K. Ramani, and P. C. Sharma, “Modular TMR multiprocessor system,” IEEE Trans. Ind. Electron., vol. 36, no. 1, pp. 34– 41, Feb. 1989. Haryono , Jazi Eko Istiyanto, Agus Harjoko and Agfianto Eko Putra, “ Five Modular Redundancy with Mitigation Technique to Recover the Error Module,” IJASCSE, Volume 3, Issue 2, 2014. http://www.ijettjournal.org Page 540