Introduction to Fault

CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘07 Introduction to Fault-tolerant Systems Terms 1. Fault: A H/W malfunction, S/W bug, communication corruption, data storage problem, or malicious tampering. 2. Error: A state of incorrect operation caused by a fault that may eventually lead to a failure. 3. Failure: Disruption of normal services or incorrect operation due to an error. So, faults lead to errors which lead to failures. 4. Fault-tolerance: Continuing to provide services (recovery) in the event of faults. (Traditionally, software fault tolerance is viewed as using software methods to tolerate software-induced faults. We will expand this definition to include all software and hardware-induced faults.) 5. Reliability R(t): The probability that a system or component will be normally functioning at time t. 6. Availability A(t): The probability that a system or component will be available for use at time t. 7. Fault coverage: The ability to detect, locate, contain, and recover from a range of faults in order to return to normal operation. 8. Fault latency: The difference in time between the occurrence of a fault and the detection of a resulting error. 9. Error latency: The difference in time between the occurrence of a fault and a failure. 10. Graceful degradation: Prioritizing remaining resource allocations to provide critical services. In a sense, this is controlled vs. uncontrolled failure. Fault features 1. Causes o (H/W): heat, manufacturing flaws (infant mortality), aging, radiation, vibration, mechanical stress, EMI, power loss, design deficiencies, data dependencies etc. o (S/W): resource limit violations, code corruption, specification errors, design errors, bugs, data dependencies, deadlock, livelock, etc. o (Com): signal corruption or interference, loss of support services, H/W causes, S/W causes, data dependencies, tampering, etc. Estimates of fault occurrence in transaction based systems (Avresky, Hardware and Software Fault Tolerance in Parallel Computing Systems, Ellis Horwood, 1992): 1 CS 7500 Dr. Cannon Fault-tolerant Systems Fault cause percentage H/W 40% environment 5% S/W 30% data and operation 25% Spring ‘07 2. Fault Modes o Fail-silent – a component ceases to operate or perform o Fail-babbling – a component continues to operate, but produces meaningless output or results. o Fail-Byzantine – a component continues to operate as if no fault occurred, producing meaningful but incorrect output or results. 3. Fault Types o *Transient (soft) o Intermittent o Permanent (hard) *There is an important difference between transient and intermittent faults. An intermittent fault is due to repeated environmental conditions, hardware limitations, software problems, or poor design. As such, they are potentially detectable and thus repairable. Transient faults are due to transient conditions – such as natural alpha or gamma radiation. Since they do not re-occur and hardware is not damaged, they are generally not repairable. Importance of Fault-tolerant systems  Providing services that are critical.  Providing services when the cost of repair is high  Providing services when access for repair is difficult  Providing services when high availability is important System characteristics 1. Single computer systems 2 CS 7500 Dr. Cannon Fault-tolerant Systems Spring ‘07 o Low redundancy of needed components o High reliability o Simple detection, location, and containment of faults 2. Distributed and Parallel systems o High redundancy of needed components o Low reliability – the more components in a system, the higher the probability that any one component will fail o Difficult detection, location, and containment of faults *A parallel system of N nodes, each with Rn(t) will have a system R(t) < prod(Rn(t)) due to communications, synchronization, and exclusion failures. A distributed system of N nodes will have a lower system R(t) than N equivalent single-computer systems due to the same reasons. Software Systems Design objectives 1. Design a system with a low probability of failure. 2. Add software methodologies for adequate fault coverage Limitations and Practicality of Dependability Strategies 1. Dependable systems require attention to faults at four levels: a. fault avoidance through proper design b. fault detection and diagnosis c. fault forecasting d. fault tolerance 2. All these strategies are limited, imperfect, and subject themselves to faults. As a consequence, these strategies are typically applied in a recursive manner. In addition, tools and methods should be applied at each stage of the “fault  error  failure” progression. 3. Strategies that are either the most effective or have the least impact and overhead are typically specific to a given application, data domain, and system than are strategies that are more general and abstract. 4. Building dependability into a system usually increases the budget by a factor >2 (in both time and money). 5. Fault tolerance methods may mask faults with a greater coverage that could be easily corrected. 6. Failures may be the result of cascaded faults, errors, and previously undetected failures. A method or tool may be attacking the leaves of a problem and ignoring the root. 3

Introduction to Fault

Related documents

Products

Support

Introduction to Fault

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib