Introduction to Fault

advertisement
CS 7500
Dr. Cannon
Fault-tolerant Systems
Spring ‘05
Introduction to Fault-tolerant Systems
Terms
1. Fault-tolerance: Continuing to provide services in the event of faults.
2. Fault: A H/W, S/W, communication, or data input problem.
3. Error: A state of incorrect operation caused by a fault that may eventually lead
to a failure.
4. Failure: Disruption of normal services or incorrect operation due to an error.
5. Reliability R(t): The probability that a system or component will be normally
functioning at time t
6. Unreliability F(t): The probability that a system or component will have failed
by time t
7. Fault coverage: The ability to detect, locate, contain, and recover from a fault in
order to return to normal operation.
8. Fault latency: The difference in time between the occurrence of a fault and the
detection of a resulting error.
9. Error latency: The difference in time between the occurrence of a fault and a
failure.
10. Graceful degradation: Prioritizing remaining resource allocations to provide
critical services.
Fault features
1. Causes
o (H/W): heat, age, radiation, vibration or mechanical stress, EM fields,
power loss, etc.
o (S/W): resource limits, code corruption, specification errors, design
errors, bugs, etc.
o (Com): signal corruption or interference, loss of support services, H/W
causes, S/W causes, etc.
2. Fault Modes
o *Transient, intermittent, permanent
o Fail-silent – a component ceases to operate or perform
o Fail-babbling – a component continues to operate, but produces
meaningless output or results.
o Fail-Byzantine – a component continues to operate as if no fault occurred,
producing meaningful but incorrect output or results.
CS 7500
Dr. Cannon
Fault-tolerant Systems
Spring ‘05
*There is an important difference between transient and intermittent faults. An
intermittent fault is due to repeated environmental conditions, hardware limitations,
software problems, or poor design. As such, they are potentially detectable and thus
repairable. Transient faults are due to transient conditions – such as natural alpha or
gamma radiation. Since they do not re-occur and hardware is not damaged, they are
generally not repairable.
Importance of Fault-tolerant systems

Providing critical services

Providing services when the cost of repair is high

Providing services when access for repair is difficult

Providing services when high availability is important
System characteristics
1. Single computer systems
o Low redundancy of needed components
o High reliability
o Simple detection, location, and containment of faults
2. Distributed and Parallel systems
o High redundancy of needed components
o Low reliability – the more components in a system, the higher the
probability that any one component will fail
o Difficult detection, location, and containment of faults
*A parallel system of N nodes, each with Rn(t) will have a system R(t) <
prod(Rn(t)) due to communications, synchronization, and exclusion failures.
*A distributed system of N nodes will have a lower system R(t) than N equivalent
single-computer systems due to communications, synchronization, and exclusion
failures.
Systems Design objectives
1. Design a system with a low probability of failure.
2. Add methodologies for adequate fault coverage
Intro to H/W fault-coverage systems
1. Passive TMR (Triple Modular Redundancy)
CS 7500
Dr. Cannon
Spring ‘05
Fault-tolerant Systems
CPU
Example A
CPU
voter
memory
CPU
CPU
voter
memory
CPU
voter
memory
CPU
voter
memory
Example B
2. Active replacement – a set of unused components is held in reserve. In the event
of a fault, a failed component is replaced from this set.
3. Temporal TMR (TTMR) – A single component is re-used sequentially and the
result is voted on. (Useful for coverage of transient faults)
4. Reconfiguration – A system is rebuilt from remaining resources (Useful primarily
for FPGA-based systems). This also applies to systems where trying different
configurations increases the odds of detecting an error. For a simple example,
each time a program is executed, a buffer is allocated to a different location in
memory.
5. Lockstep redundancy – Duplicate systems are run in parallel with all inputs going
to both systems. The output of one system is utilized until an error is detected – at
which time the backup system output is used or switched in. (This is sometimes
called Hot Backup.)
Intro to data coverage systems
1. Checksums – a message is treated as a sequence of integers and summed (modulo
N). The checksum is stored at the end of the message. The receiver re-computes
the checksum and compares.
2. Cyclic-Redundancy Checksums – (A variation of checksums) a message is treated
as a sequence of integers. A polynomial is calculated from message powers and
stored at the end of the message.
CS 7500
Dr. Cannon
Fault-tolerant Systems
Spring ‘05
3. Parity – a message is divided into small integers (usually bytes). The even/odd
number of bits for each integer is added as an extra bit at the end of each integer.
(Parity can also include more one bit and be done by even/odd powers.)
4. Error-correcting codes – a message is coded so that the change of any one bit can
be detected and corrected.
Intro to S/W methodologies for dependable systems
1. Fault Avoidance
2. Fault Tolerance
3. Fault Removal
4. Fault Forecasting
Intro to S/W coverage systems
1. Redundancy with independent modules
o TMR
o TTMR
2. Consistency checking
3. Periodic self-tests
4. Periodic heartbeats
5. Checkpointing and rollback
6. Transaction systems
7. Roll forward systems
*S/W and H/W systems show similar failure characteristics
*Fault coverage is often cheaper in S/W
S/W testing for fault avoidance
1. robustness testing – all input is checked for being outside of specifications
2. limit testing – input is tested for boundary limitations
3. path testing – test all possible paths for a possible input
4. branch testing – test each path from every branch
5. special value testing – test against known outputs for special value inputs
6. interface testing – test all module interface interactions
CS 7500
Dr. Cannon
Fault-tolerant Systems
Spring ‘05
S/W verification

formal methods (example: CSP notation)

symbolic execution (example: if (a+b)2 produces a2 +2ab +b2, the module will be
correct for all values of a and b.
*S/W verification methods have limited utility in the real-world
S/W Error Rates
A set of software faults do not have equal probability of producing detectable errors. In the following study,
15.25 M executions of two different program detected 11 errors in one program and 19 errors in the other.
For each error, the number of executions prior to error detection was noted. The rate at which the error
occurred across all executions was also noted;
Fault expectations
The type of faults a system designer should expect depends heavily upon the type of
system being implemented. One operating in a harsh radiation environment might expect
different faults than one operating in a quiet office. In general and in the absence of any
harsh environment condition, a system designer typically expects that the most common
faults will be fail-silent. This is because hardware components typically work or they
don’t. The range of problems that permit a hardware component to fail transiently is quite
CS 7500
Dr. Cannon
Fault-tolerant Systems
Spring ‘05
narrow. It is difficult to put a number to this, but a guess might be in the range of 95%.
As you would expect, fail-silent problems are generally self-confining.
Byzantine faults are probably the least common – but often the most devastating. They
are also the most difficult to detect, locate, contain, and recover from.
Recovery
The type of recovery system implemented depends heavily on the purpose or application
of the system. No one set of categories defines all recovery types, but the following
issues should be considered.

Resume recovery. In many systems, recovery only requires that the system
resumes normal operation within a reasonable time. Loss of data and incorrect
outputs during a failure are not an important issue.

Graceful degradation. In this situation, recovery may not be complete, but needs
to resume normal operation for the most critical or highest priority services.

Lossless recovery. Here, no data corruption or loss is tolerated. A system must
not only detect, locate, and contain faults, but prevent any fault-induced errors
from reaching the outside or effecting system actions.

Continuous recovery. This is similar to resume recovery, but no recovery time is
tolerated. Loss of data is tolerated, but no delays in system operation are tolerated.
(This is bit counter-intuitive, since one normally thinks of data loss as a gap in
system operation. An example might be a system where output data is
momentarily lost, but no halt causes input to be missed or lost)

Total recovery. In this situation, no fault effect should be noted outside the
system. To the world, no evidence is given that a fault occurred.
Download