7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current erroneous state and determines the correct state without any loss of computation. There are two different approaches: a) Hardware Redundancy – Static Redundancy – Dynamic Redundancy b) Software Redundancy 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.1 Static Redundancy Approaches There are 3 different approaches to mask the failures: Active Masking Redundancy Active Masking Using Fail-Stop Modules Active Redundancy Using Self-Diagnosis 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.1 Static Redundancy Approaches Active Masking Redundancy: Uses adequate level of replication to tolerate the failures, using voting on the outputs of all the replicas. E.g.: TMR (Triple Modular Redundant) systems mask a single failure without any performance loss. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.1 Static Redundancy Approaches Active Redundancy Using Fail-Stop Modules: Multiple modules of each processor actively execute each process. Each processor itself is assumed to be failstop. Thus, if one of the processors fails, it stops executing and the other processors executing the task continue functioning without any performance penalty, even in the presence of failures. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.1 Static Redundancy Approaches E.g. in a given system, each subsystem is duplicated, forming a pair. One of the replicas is identified as the spare. Each subsystem and its spare are, themselves, made self-checking by replication. The HW is thereby replicated 4 times. All 4 copies of the HW are tightly synchronized. When a fault is detected in a subsystem by its self-checking mechanisms, it disconnects itself as well as that the spare starts providing its service without any interruption or rollback. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.1 Static Redundancy Approaches Active Redundancy Using Self-Diagnosis: Analogous to the one using “fail-stop modules”, however, instead of concurrent self-checking mechanism, selfdiagnosis tasks are used to identify the faulty processor. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.1 Static Redundancy Approaches E.g. the reconfigurable duplication mechanism, where the process is replicated on 2 processors. Their outputs are continuously compared. If any mismatch indicating a failure of at least one of the processors in the pair is detected, each processor runs self-diagnostic tasks to determine if it has failed. Once the faulty processor is identified, the output of the fault-free processor can be accepted as correct. The use of self-diagnostic tasks instead of concurrent selfchecking results in a slight computation overhead for determining the faulty processor after a fault is detected. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Forward recovery schemes based on dynamic redundancy and checkpointing try to avoid rollback even in the presence of failures. The fault is thus tolerated without the performance penalty of a rollback. E.g. Consider a duplex system that detects failures by checkpointing the two modules in the system periodically and then, comparing their states. When a failure is detected, the roll-forward checkpointing scheme tries to determine which of the two processing modules, if any, is fault-free. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Concurrent retry in the Roll Forward Checkpointing Scheme (RFCS) Scheme. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Concurrent retry in the Roll Forward Checkpointing Scheme (RFCS) Scheme. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Variations of the RFCS may assume that each module has built-in fault detection capability such as parity checks, exception detection. Thus, 4 different scenarios can be conceptualized: Resources Used Recovery Strategy With Spare No Spare Optimistic (only single faults) Roll-forward (I) Roll-forward (I) Rollback (I)* Pessimistic (may occur double faults) Roll-forward (II) Rollback (II) Three Different Recovery Schemes (* no built-in fault detection capability included). 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches In an optimistic recovery strategy, one trusts the built-in detection capability to the fullest extent. This scheme will not require the use of a spare, even though it may be available. Module I1 I2 A I1 Optimistic scheme with or without spare. Roll-forward (I) I2 B roll-forward 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Pessimistic Scheme with spare rolling forward with all single faults. In the pessimistic recovery strategy, It may be noted that although module B has been already suspect to be faulty, a more conservative action was taken just in case A might have experienced a failure which escaped the built-in detection capability during I1. Pessimistic Scheme with spare rolling back with double faults. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Reliability 1 2 The ideal curve 1 is preferred because it allows a small reduction in reliability to be traded off against a large gain in performance. (This is the case of Optimistic Recovery Strategies). 3 Performance Three different rollforward schemes. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Generally, the mean completion time given a failure has occurred is lower for the roll-forward scheme for both optimistic and pessimistic strategies. Without any failure, all the schemes perform similarly. When there is no built-in detection capability, the pessimistic and the corresponding optimistic scheme have identical reliabilities. Since there is no built-in detection, there is no way to identify the faulty module without comparison between operating modules and the spare one. When there is 100% fault detection, with or without spare schemes have identical reliabilities. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Note: = failure rate; c = detection coverage (indicates the degree of builtin detection capabilities); n = # of checkpoint intervals. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Roll-forward Rollback Optimistic Pessimistic Performance comparison between optimistic and pessimistic schemes: mean completion time, given a fault. (Optimistic scheme is better) Reliability comparison between optimistic and pessimistic schemes. (Pessimistic scheme is better) 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches One of the important advantages of a roll-forward scheme is in the minimal degradation in I/O performance: All outputs after I1 will experience one checkpoint interval delay. Permanent delay in rollback scheme outputs in the event of a fault. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches The outputs x and y are the only ones delayed and all other outputs are will occur at the regularly scheduled interval. x,y,z Module w v : System outputs A B I1 I2 I3 I4 I5 I6 I1 I2 I3 I4 I5 I6 I1 I2 Spare Activated Spare Release Temporary delay in roll-forward scheme outputs in the event of a fault. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.2 Dynamic Redundancy Approaches Forward Recovery Using Checkpointing. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.3 Software Redundancy-Based Approach for Forward Error Recovery The previous approaches primarily require HW redundancy (+300%). This approach requires a certain degree of SW redundancy, as well as HW redundancy: SW redundancy is implemented by using Recovery Blocks. Recovery blocks are a language construct that supports the incorporation of program redundancy into a fault-tolerant program in a concise and easily readable form. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.3 Software Redundancy-Based Approach for Forward Error Recovery The syntax of the recovery block is: Ensure T by B1 else by B2 . . . else by Bn else error Where: T is acceptance test; B1 denotes the primary try block; Bk denotes the (k – 1)th alternate try block. 7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems – 7.5.3 Software Redundancy-Based Approach for Forward Error Recovery Distributed Recovery Block.