CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘05 Introduction to Fault-tolerant Systems Terms 1. Fault-tolerance: Continuing to provide services in the event of faults. 2. Fault: A H/W, S/W, communication, or data input problem. 3. Error: A state of incorrect operation caused by a fault that may eventually lead to a failure. 4. Failure: Disruption of normal services or incorrect operation due to an error. 5. Reliability R(t): The probability that a system or component will be normally functioning at time t 6. Unreliability F(t): The probability that a system or component will have failed by time t 7. Fault coverage: The ability to detect, locate, contain, and recover from a fault in order to return to normal operation. 8. Fault latency: The difference in time between the occurrence of a fault and the detection of a resulting error. 9. Error latency: The difference in time between the occurrence of a fault and a failure. 10. Graceful degradation: Prioritizing remaining resource allocations to provide critical services. Fault features 1. Causes o (H/W): heat, age, radiation, vibration or mechanical stress, EM fields, power loss, etc. o (S/W): resource limits, code corruption, specification errors, design errors, bugs, etc. o (Com): signal corruption or interference, loss of support services, H/W causes, S/W causes, etc. 2. Fault Modes o *Transient, intermittent, permanent o Fail-silent – a component ceases to operate or perform o Fail-babbling – a component continues to operate, but produces meaningless output or results. o Fail-Byzantine – a component continues to operate as if no fault occurred, producing meaningful but incorrect output or results. CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘05 *There is an important difference between transient and intermittent faults. An intermittent fault is due to repeated environmental conditions, hardware limitations, software problems, or poor design. As such, they are potentially detectable and thus repairable. Transient faults are due to transient conditions – such as natural alpha or gamma radiation. Since they do not re-occur and hardware is not damaged, they are generally not repairable. Importance of Fault-tolerant systems Providing critical services Providing services when the cost of repair is high Providing services when access for repair is difficult Providing services when high availability is important System characteristics 1. Single computer systems o Low redundancy of needed components o High reliability o Simple detection, location, and containment of faults 2. Distributed and Parallel systems o High redundancy of needed components o Low reliability – the more components in a system, the higher the probability that any one component will fail o Difficult detection, location, and containment of faults *A parallel system of N nodes, each with Rn(t) will have a system R(t) < prod(Rn(t)) due to communications, synchronization, and exclusion failures. *A distributed system of N nodes will have a lower system R(t) than N equivalent single-computer systems due to communications, synchronization, and exclusion failures. Systems Design objectives 1. Design a system with a low probability of failure. 2. Add methodologies for adequate fault coverage Intro to H/W fault-coverage systems 1. Passive TMR (Triple Modular Redundancy) CS 7500 Dr. Cannon Spring ‘05 Fault-tolerant Systems CPU Example A CPU voter memory CPU CPU voter memory CPU voter memory CPU voter memory Example B 2. Active replacement – a set of unused components is held in reserve. In the event of a fault, a failed component is replaced from this set. 3. Temporal TMR (TTMR) – A single component is re-used sequentially and the result is voted on. (Useful for coverage of transient faults) 4. Reconfiguration – A system is rebuilt from remaining resources (Useful primarily for FPGA-based systems). This also applies to systems where trying different configurations increases the odds of detecting an error. For a simple example, each time a program is executed, a buffer is allocated to a different location in memory. 5. Lockstep redundancy – Duplicate systems are run in parallel with all inputs going to both systems. The output of one system is utilized until an error is detected – at which time the backup system output is used or switched in. (This is sometimes called Hot Backup.) Intro to data coverage systems 1. Checksums – a message is treated as a sequence of integers and summed (modulo N). The checksum is stored at the end of the message. The receiver re-computes the checksum and compares. 2. Cyclic-Redundancy Checksums – (A variation of checksums) a message is treated as a sequence of integers. A polynomial is calculated from message powers and stored at the end of the message. CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘05 3. Parity – a message is divided into small integers (usually bytes). The even/odd number of bits for each integer is added as an extra bit at the end of each integer. (Parity can also include more one bit and be done by even/odd powers.) 4. Error-correcting codes – a message is coded so that the change of any one bit can be detected and corrected. Intro to S/W methodologies for dependable systems 1. Fault Avoidance 2. Fault Tolerance 3. Fault Removal 4. Fault Forecasting Intro to S/W coverage systems 1. Redundancy with independent modules o TMR o TTMR 2. Consistency checking 3. Periodic self-tests 4. Periodic heartbeats 5. Checkpointing and rollback 6. Transaction systems 7. Roll forward systems *S/W and H/W systems show similar failure characteristics *Fault coverage is often cheaper in S/W S/W testing for fault avoidance 1. robustness testing – all input is checked for being outside of specifications 2. limit testing – input is tested for boundary limitations 3. path testing – test all possible paths for a possible input 4. branch testing – test each path from every branch 5. special value testing – test against known outputs for special value inputs 6. interface testing – test all module interface interactions CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘05 S/W verification formal methods (example: CSP notation) symbolic execution (example: if (a+b)2 produces a2 +2ab +b2, the module will be correct for all values of a and b. *S/W verification methods have limited utility in the real-world S/W Error Rates A set of software faults do not have equal probability of producing detectable errors. In the following study, 15.25 M executions of two different program detected 11 errors in one program and 19 errors in the other. For each error, the number of executions prior to error detection was noted. The rate at which the error occurred across all executions was also noted; Fault expectations The type of faults a system designer should expect depends heavily upon the type of system being implemented. One operating in a harsh radiation environment might expect different faults than one operating in a quiet office. In general and in the absence of any harsh environment condition, a system designer typically expects that the most common faults will be fail-silent. This is because hardware components typically work or they don’t. The range of problems that permit a hardware component to fail transiently is quite CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘05 narrow. It is difficult to put a number to this, but a guess might be in the range of 95%. As you would expect, fail-silent problems are generally self-confining. Byzantine faults are probably the least common – but often the most devastating. They are also the most difficult to detect, locate, contain, and recover from. Recovery The type of recovery system implemented depends heavily on the purpose or application of the system. No one set of categories defines all recovery types, but the following issues should be considered. Resume recovery. In many systems, recovery only requires that the system resumes normal operation within a reasonable time. Loss of data and incorrect outputs during a failure are not an important issue. Graceful degradation. In this situation, recovery may not be complete, but needs to resume normal operation for the most critical or highest priority services. Lossless recovery. Here, no data corruption or loss is tolerated. A system must not only detect, locate, and contain faults, but prevent any fault-induced errors from reaching the outside or effecting system actions. Continuous recovery. This is similar to resume recovery, but no recovery time is tolerated. Loss of data is tolerated, but no delays in system operation are tolerated. (This is bit counter-intuitive, since one normally thinks of data loss as a gap in system operation. An example might be a system where output data is momentarily lost, but no halt causes input to be missed or lost) Total recovery. In this situation, no fault effect should be noted outside the system. To the world, no evidence is given that a fault occurred.