Fault-Tolerant Systems - IEEE Computer Society

advertisement
1304
IEEE TRANSACTIONS ON COMPUTERS, VOL.
C-25, NO. 12, DECEMBER 1976
Fault-Tolerant Systems
ALGIRDAS AVIZIENIS, FELLOW, IEEE
Abstract-Basic concepts, motivation, and techniques of fault
tolerance are discussed in this paper. The topics include fault
classification, redundancy techniques, reliability modeling and
prediction, examples of fault-tolerant computers, and some approaches to the problem of tolerating design faults.
Index Terms-Fault classification, fault tolerance, fault-tolerant
computer design, redundancy techniques, reliability modeling,
reliable computing.
I. INTRODUCTION
T HE PROBLEM of reliable computing is as old as the
first computers, which employed relays, vacuum tubes,
delay line and cathode-ray tube storage, and other relatively unreliable components. Semiconductor components
and magnetic core storage elements with much greater
reliability were introduced in the second computer generation, and the reliability problem was significantly alleviated. However, much greater reliability requirements
were posed by the space program in the early 1960's, as well
as by other real-time applications in which human lives
could be jeopardized by computer failure. In addition to
improvements in component reliability and in test methods, redundancy at various levels of system organization
was investigated and employed to increase the probability
of uninterrupted correct operation over specified time
intervals.
The concept of fault tolerance unifies various approaches to reliability assurance by means of testing, diagnosis, and redundancy in machine organization and
operation. It emerged in the late 1960's and reached maturity with the formation of the IEEE Computer Society
Technical Committee on Fault-Tolerant Computing in
1969 and the subsequent holding of the First International
Symposium on Fault-Tolerant Computing in 1971. The
Symposium has been held annually since then, and it has
become the major international forum for the discussion
of current experience and new ideas in system design, redundancy techniques, system modeling and analysis,
testing and diagnosis methods, and other related areas.
II. BASIC CONCEPTS
The most widely accepted definition of a fault-tolerant
computing system is that it is a system which has the
built-in capability (without external assistance) to preserve
Manuscript received March 22, 1976; revised June 23, 1976. This work
was supported by the National Science Foundation under Grant
MCS72-03633 A04.
The author is with the Department of Computer Science, University
of California, Los Angeles, CA 90024.
the continued correct execution of its programs and
input/output (I/O) functions in the presence of a certain
set of operational faults. An operational fault is an unspecified (failure-induced) change in the value of one or
more logic variables in the hardware of the system. It is the
immediate consequence of a physical failure event. The
event may be a permanent component failure, a temporary
or intermittent component malfunction, or externally
originating interference with the operation of the system.
"Correct execution" means that the programs, the data,
and the results do not contain errors, and that the execution time does not exceed a specified limit. An error is the
symptom of a fault, i.e., it is the change in an instruction
or a data word which is caused by the presence of a fault
at the point of the physical failure event.
The above definition of a fault-tolerant system is based
on the premise that the system will function correctly when
operational faults are not present; that is, it postulates that
design faults have been eliminated from the system prior
to its use. A more general interpretation of fault tolerance
also includes the ability to tolerate thus-far-undetected
design faults. Design faults occur both in hardware and in
software of the system. They are the consequences of either: 1) an incomplete or incorrect specification; or 2) an
incorrect translation of the specification into the final
design-the assemblies of components or the sequences
of machine language instructions and sets of data words.
In order to tolerate left-over design faults, the system
needs provisions to recognize their presence and to deactivate the affected hardware or software while program
execution continues using the remaining parts of the system.
In the strict sense, a system is fault-tolerant only if external assistance is not required to implement its fault
tolerance. Most of the current-generation computers require manual assistance; however, it must be noted that
nearly every computer employs some fault-tolerance
techniques, such as microdiagnosis [1], [4], parity checking
[2], watchdog timers [3], etc. The fault-tolerance process
is interrupted at the point at which a maintenance specialist takes over the control of the system. An off-line
replacement or repair of deactivated parts of the system
is acceptable in a fault-tolerant system, as long as the
correct execution of programs is being continued without
external intervention.
A partially fault-tolerant ("fail-soft," "gracefully degrading") system is one which has the built-in capability
to reduce its specified ("full") computing capacity and to
"shrink" to a smaller system by discarding some previously
AVIZIENIS: FAULT-TOLERANT SYSTEMS
used programs or by slowing down below the specified rate
of execution. The reduction occurs because of decreases
in the working hardware configuration that are due to
operational faults. Another cause of such "shrinking" may
be the discovery of left-over design faults in the hardware
or programs.
The preceding discussion makes it evident that the
across the traditional
hardware/software boundaries and involves the entire
computing system: the equipment, the programs, and the
operating environment as well. Fault tolerance is an essential aspect of reliable computing, which in the most
general sense means "the correct execution of a specified
set of algorithms," and encompasses all the following elements: correctness and completeness of software specifications; testing and verification (proofs) of programs and
microprograms; elimination of hardware design errors;
continued correct execution of programs and protection
of data in the presence of hardware (operational) faults;
protection of the computing system against error-induced
disruption or unauthorized access of programs and data.
The present discussion deals with the problem of
hardware (operational) faults; some aspects of design fault
tolerance are considered in the last section.
1305
carrying out of comprehensive testing to eliminate hardware and software design faults.
Once the design has been completed, a quantitative
prediction of system reliability is made using known or
predicted failure rates for the components and interconnections. In a "purely" fault-intolerant (i.e., nonredund-
ant) design, the probability of fault-free hardware operation is equated to the probability of correct program execution. Such a design is characterized by the decision to
invest all the reliability resources into high-reliability
components and refinement of assembly, packaging, and
testing techniques. Occasional system failures are accepted
as a necessary evil, and manual maintenance is provided
for their correction. To facilitate maintenance, some
built-in error detection, diagnosis, and retry techniques
are provided. (These techniques are discussed in more
detail later.) This is the current practice in computer system design; the trend is toward an increasing number of
built-in aids for the maintenance engineer.
The traditional (fault-intolerance) approach of diagnosis-aided manual repair, however, has proved to be an
insufficient solution in many cases because of at least three
reasons: 1) the unacceptability of the delays and interruptions of real-time programs caused by manual repair
action; 2) the inaccessibility of some systems to manual
repair; and 3) the excessively high costs of lost time and of
III. MOTIVATION FOR FAULT TOLERANCE
maintenance in many installations. The direct dependence
Two distinct approaches have been used in order to of human lives on some computer-controlled operations
reach the goal of reliable computing. These approaches are (air traffic control, manned spaceflight, etc.) has added a
applicable to all parts of the computing system, including psychological reason to object to the fault-intolerance
its hardware elements, microprograms, system programs, approach: while only one system in a million may fail over
data bases, and user programs.
a given time interval, all users of the entire million systems
In the traditional approach (to be called'fault intoler- are subject to the anticipation that they may be involved
ance in the following discussion) the reliability of comin this failure.
puting is assured by a priori elimination of the causes of
An alternate approach which alleviates most of the
unreliability, i.e., of faults. This elimination takes place above shortcomings of the traditional fault-intolerance
before regular use begins, and the resources that are allo- approach is offered by fault tolerance. In this approach,
cated to attain reliability are spent on perfecting the sys- the reliability of computing is assured by the use of protem prior to its field use. Redundancy is not employed, and tective redundancy. The causes of unreliability are exall parts of the system must function correctly at all times. pected to be present and to induce errors during the
Since in practice it has not been possible to assure the computing process, but their disrupting effects are autocomplete a priori elimination of all causes of unreliability, matically counteracted by the redundancy. Reliable
the goal of fault intolerance is to reduce the unreliability computing may be made possible despite certain classes
(expressed as the probability of system failure over the of hardware failures, external interference with computer
duration of the specified computing process) or the un- operation, and perhaps even remaining design errors in
availability of the system to an acceptably low value. To hardware and software. Some of the resources allocated
supplement this approach, manual maintenance proce- to attain reliability are spent on protective redundancy.
dures are devised which return the system to an operating The redundant parts of the system (both hardware and
condition after a failure. The cost of providing mainte- software) either take part in the computing process or are
nance personnel and the cost of the disruption and delay present in a standby condition, ready to act automatically
of computing also are parts of the overall cost of using the to preserve its undisrupted continuation. This contrasts
fault-intolerance approach. The procedures which have with the maintenance procedures in a fault-intolerant
led to the attainment of reliable systems using this ap- system which are invoked after the computing process has
proach are: acquisition of the most reliable components been disrupted, and the system remains "down" for the
within the given cost and performance constraints; use of duration of the maintenance period.
thoroughly refined techniques for the interconnection of
It is evident that the two approaches are complementary
components and assembly of subsystems; packaging of the and that the resources allocated to attain the required
hardware to screen out expected forms of interference; and reliability of computing may be divided between fault
concept of fault tolerance cuts
1306
IEEE TRANSACTIONS ON COMPUTERS, DECEMBER 1976
ization of a transient fault must include a "maximum duration" parameter; faults that last longer will be interpreted as permanent by the recovery algorithm. Other
characteristics are the arrival model and the duration of
transients [1], [5].
Permanent faults are caused by permanent (also called
"solid") failures of components; recovery from them requires either the presence, of a duplicate (spare) component, or provisions -to continue operation without the failed
part. In addition, the reestablishment of the "rollback
point" may be required as well, following the corrective
action on the failed part. The classifications according to
extent and according to value are applicable to both
transient and permanent faults.
The extent of a fault describes how many logic variables
in the hardware are simultaneously affected by the fault
which is due to one failure event. Local (single) faults are
those that affect only single logic variables, while distributed (related multiple) faults are those that affect two or
more variables, one module, or an entire system. The
physical proximity of logic elements in contemporary
medium- and large-scale integration (MSI and LSI) circuitry has made distributed faults much more likely than
in the discrete component designs of the past. Distributed
faults are also caused by single failures of some critical
elements in a computer system, for example: clocks, power
supplies, data buses, switches for reconfiguration, etc.
The value of a fault is determinate when the logic values
affected by the fault assume a constant value ("stuck on
0" or "stuck on 1") throughout its entire duration. The
IV. ASPECTS OF FAULT TOLERANCE
value is indeterminate when it varies between "0"
fault
Investigators and designers of fault-tolerant systems and "1,"
not in accord with design specifications,
have to deal with three fundamental aspects of fault tol- throughoutbut
the
duration of the fault. The determinacy of
erance. They are: 1) systematic identification and char- a fault depends on the physical failure mechanisms, such
acterization of the set of faults to be tolerated; 2) devel- as shorts, burnouts, or drift of component values.
opment and choice of redundancy techniques which proIt is important to note that the description of fault exvide protection against the identified faults; and 3) analytic tent
and fault value applies at the origin of the fault, that
or experimental prediction of the effectiveness of redun- is, at the point at which the failure event has actually taken
dancy techniques [1]. The following discussion deals with place. The introduction of one or more faulty (incorrect)
these aspects as they apply to operational hardware logic
values into the computing process will often lead to
faults.
more ettensive fault symptoms in the logic downstream
from the point of failure. At other times, the presence of
A. Characterization of Operational Faults
a faulty logic value will be masked by other (correct) logic
Operational faults are caused by three classes of failure variables and no symptoms at all will be seen downevents that affect the hardware of the'system during exe- stream.
Confusion and ambiguity are avoided when the term
cution of programs: permanent failures of hardware
components; temporary malfunctions of components; and "fault" is restricted to the change in logic variables(s) at
the point of the actual hardware failure. The fault-caused
external interference with computer operation.
There are three useful dimensions for the classification changes-of logic variables which are observed downstream
of operational faults: duration (transient versus perma- on the outputs of ctrrectly functioning logic elements will
nent); extent (local versus distributed); and value (de- be called "errors." This choice of terms establishes the
following cause-effect sequence.
terminate versus indeterminate).
1) The failure, which is a physical event, causes a fault,
Transient faults are faults of limited duration, caused
either by temporary malfunctions of components or by which is a change of logic variables(s) at the point of failexternal interference. Recovery requires waiting until the ure.
2) The fault supplies incorrect input(s) to the comtransient subsides, followed by the reestablishment of a
known correct state of the programs and the data (often puting process and may cause an error to be produced by
called "rollback point" or "checkpoint"). The character- subsequent operations of failure-free logic circuits.
tolerance and fault intolerance. Experience and analysis
both seem to indicate that a balanced allocation of resources between the two approaches is most likely to yield
the highest reliability of computing. Fault tolerance does
not entirely eliminate the need for reliable components;
instead, it offers the option to allocate part of the reliability
resources to the inclusion of redundancy. One reason for
the use of a fault-tolerant design is to achieve a reliability
or availability prediction that cannot be attained by the
purely fault-intolerant design. A second reason may be the
attainment of a reliability (or availability) prediction that
matches the purely fault-intolerant design at a lower
overall cost of implementation. A third reason is the psychological support to the users who know that provisions
have been made to handle faults automatically as a regular
part of the computing process. An overview of past practice
shows that fault intolerance has been the dominant choice
in both hardware and software in the 1950's and 1960's. In
recent years the fault-tolerance approach has been making
significant inroads in hardware system design; its application in software has remained very limited. The main
argument against the use of fault-tolerance techniques in
computer systems has been the cost of redundancy. The
evolution of component technology into large-scale integration, the decreasing cost of mass-produced hardware
elements, the very high cost of software testing, and the
increasing reliability requirements all favor increasing use
of fault tolerance in computer systems of the future.
AVIZIENIS: FAULT-TOLERANT SYSTEMS
The number of points that can be observed for the
purpose of a fault detection is limited because integrated
circuits are internally complex, and have relatively few
outputs. Digital-logic simulation programs which analyze
the behavior of faulty logic circuits and predict the errors
that will appear on the outputs (for a given class of faults)
are essential tools for the generation of fault-detection tests
[26]. A recent simulation and analysis system to analyze
the behavior of faulty circuits is the Logic Analyzer for
Maintenance Planning (LAMP) System [111]. In addition,
LAMP also performs logic design verification, generates
fault-detection tests, evaluates diagnostics, and produces
data for trouble-location manuals. LAMP exemplifies the
current trend toward multipurpose simulation systems in
digital system design.
B. Redundancy Techniques
The redundancy techniques that have been developed
to protect computer systems against operational faults may
assume three different forms: hardware (additional components), software (special programs), and time (repetition
of operations).
Hardware redundancy: This consists of the components
that are introduced in order to provide fault-tolerance. The
techniques of introducing hardware redundancy have been
classified (on the basis of terminal activity of modules) into
two categories: static redundancy and dynamic redundancy [2]. The static redundancy method is also known as
"masking" redundancy, since the redundant components
are employed to mask the effect of hardware failures within
a given hardware module, and the outputs of the module
remain unaffected as long as the protection is effective.
The static technique is applicable against both transient
and permanent faults. The redundant copies of an element
are permanently connected cnd powered; therefore they
provide fault masking instantaneously and automatically.
However, if the fault is not susceptible to masking and
causes an error, a delayed recovery is not provided. Two
forms of static redundancy have been applied in the space
program: replication of individual electronic components
and triple modular redundancy (TMR) with voting [3].
Several other forms have been studied but did not reach
application because of excessive cost or the need for
practically unrealizable special components [2]. The use
of static redundancy is based on the assumption that
failures of the redundant copies are independent. For this
reason, use of static redundancy is difficult to justify within
integrated circuit packages, in which many failure phenomena are likely to affect several adjacent components.
In the dynamic hardware redundancy approach faultcaused errors or error signals appear at the outputs of a
module. Fault tolerance is implemented in two steps. The
presence of a fault is detected, and a subsequent recovery
action either eliminates the fault, or corrects the error.
When human assistance is entirely bypassed, dynamic
redundancy (usually supported by software and time re-
1307
dundancy techniques) provides self-repair of a computer
system. Limited, i.e., human-controlled, use of dynamic
redundancy techniques in computer hardware is quite
extensive [1], [2], [4]. A common example is the use of
parity to detect errors in data transmission and storage.
A complete application of dynamic redundancy requires
that a number of decisions should be made during the
functional design of the computer. The design choices include: level of modularization, fault-detection techniques,
type of recovery action, "hard core" protection, forms of
intermodule communication, validation of inputs, and
interfaces with system software [1]. The use of dynamic
redundancy has been inhibited because it needs an early
committment in the hardware design process. In contrast,
static redundancy (and software redundancy, as well) can
be superimposed on an existing nonredundant design.
Software redundancy: This includes all additional
programs, program segments, instructions, and microinstructions which would not be needed in a fault-free
computer. They serve to provide either fault detection or
recovery in fault-tolerant computer systems, usually in
conjunction with dynamic hardware redundancy. Three
major forms of software redundancy are: 1) multiple
storage of critical programs and data; 2) test and diagnostic
programs or microprograms; and 3) parts of the operating
system which interact with hardware redundancy and
carry out program restarts. Combinations of all three forms
are found in most modern fault-tolerant computers.
Compared to hardware redundancy, an advantage of
software is the ability to add fault-tolerance features after
the hardware has been designed. This may allow the design
of fault-tolerant systems using redundant configurations
of "off-the-shelf' hardware with very limited supporting
features, such as comparators, voters, parity checkers, etc.
Another advantage is the relatively easier modification and
refinement of the software features after their introduction
into the system. The main disadvantage of software redundancy is the difficulty of assuring that the software
features will be able to function correctly after the occurrence of a fault and that they will be invoked sufficiently
early, that is, before the fault-daused errors have irrevocably disrupted the programs or mutilated the data base.
Other disadvantages include the relatively high cost of
generating and verifying or validating the required software, the storage requirements (including the need to
tolerate failures of memories holding the software), and
the difficulty of estimating or proving-the adequacy of the
software redundancy features. Combinations of software
and hardware redundancy are employed in most faulttolerant systems, but they differ in the choice of the point
in the detection and recovery sequence at which software
takes over control.
Time (execution) redundancy: This consists of repeating or acknowledging machine operations at various
levels: micro-operations, single instructions, program
segments, or entire programs. It is usually employed together with dynamic hardware and software redundancy
techniques. Two distinct goals of time redundancy are:
1308
fault detection by means of repeated execution or acknowledgments; and recovery by program restarts or operation retries after fault detection or reconfiguration has
occurred.
The repeated execution of a program is probably the
oldest form of fault detection. While suitable to detect
errors due to transient faults, it is limited because consistent errors are produced by permanent faults, and comparison may fail to reveal the same error in the results. The
use of various forms of acknowledgments ("handshakes")
has been extensively used in general purpose systems,
especially for control error detection in secondary storage,
channels, and I/O devices. Retransmission is the usual
error correction method in this case. Another common use
of time redundancy is found in the identification and
correction of errors caused by transient faults, and in
program restarts after a hardware reconfiguration. This
is accomplished by the repetition or "rollback" (after error
detection) of single instructions, program segments, or
entire programs. While single-instruction retries are
transparent to the programmer, longer rollbacks require
programming constraints as well as protected storage for
the rollback address and for the state vector, including its
double-buffering. "Singular" events in a computer are
program-controlled events which should not be repeated
as part of a program rollback operation, for example,
real-time output commands which initiate irreversible
actions in the system under computer control. The potential damage makes it imperative that the provision for
handling of singular events should be incorporated in
rollback procedures.
The redundancy checkout problem is the problem of
verifying that the redundant parts are ready to be used
when faults occur. Implementation of checkout encounters
difficulties in systems with static hardware redundancy,
especially in component-redundant systems. Dynamically
redundant systems are inherently better suited for redundancy checkouts, since they possess extensive fault
detection and permit sequential switching-in of spare
modules to be tested. One critical requirement of checkout
is to provide systematic verification that the fault-indicating signals are operational. The techniques of fault
tolerance can be extended beyond the boundaries of the
fault-tolerant computer to effect automatic maintenance
of various peripheral systems which communicate with the
computer. The methodology of extending fault tolerance
consists of the development of fault-tolerant interfaces,
introduction of fault-detection methods in the systems
outside the computer, and programming of recovery sequences to be executed by the computer [6]. In commercial
general purpose systems, the reverse process has occasionally taken place. Because of the relatively high unreliability of peripheral mechanical devices, fault tolerance
began at the peripherals and only later was brought into
the central processing unit (CPU) and main memory.
IEEE TRANSACTIONS ON COMPUTERS, DECEMBER
1976
cation that the redundant system possesses the expected
fault tolerance. Insufficiencies of the design may be uncovered, and the design can be refined by changes or additions of various forms of redundancy. The -principal
quantitative measures are reliability (with respect to
permanent faults), survivability (with respect to transient
faults), and availability [1], [5], [7], [8]. There are two approaches to the prediction of fault tolerance: 1) the analytic approach, in which fault-tolerance measures of the
system are obtained from a mathematical model of the
system, and 2) the experimental approach, in which faults
are inserted either into a simulated model of a system, or
into a prototype of the actual hardware, and fault-tolerance measures are estimated from statistical data.
The class of static reliability models is suitable for the
reliability prediction of systems with static hardware redundancy. The redundant elements are assumed to be
permanently connected, and to fail statistically independently. They have the same failure rate and are instantaneously available to perform the masking of a failure with
unity probability of success. Under these assumptions, the
reliability of a redundant system is obtained as the sum of
the reliabilities of all distinct configurations (including
none or some failed parts) that do not lead to system failure. Reliability models of static redundancy are found in
handbooks and textbooks of reliability theory and are used
for reliability analysis of various redundant structures, e.g.,
relay contact networks, aircraft frames, etc. [8].
Dynamic redundancy requires the success of consecutive
actions of fault detection and recovery in order to utilize
redundant (spare) parts. The use of static reliability
models for the dynamic case is equivalent to assuming
unity probability of success of both actions; for this reason,
very high reliabilities can be predicted as the number of
spares is increased. It was recognized early in the studies
of dynamic redundancy that even slight imperfections of
detection and recovery will leave some spares unused. The
effect of such imperfections was formalized in the reliability model through the concept of "coverage," defined
as the conditional probability of successful recovery, given
that a fault has occurred [7]. In general, the dynamic model
must represent the entire complexity of the proposed
fault-tolerant system, including: 1) differing failure rates
for powered and unpowered modules; 2) increase in module
failure rate due to fault detection and switching hardware;
3) number of spares of each module; 4) imperfect fault
detection and recovery; 5) method of "hard core" protection; 6) existing intramodule fault tolerance; 7) extent and
value of the faults; 8) duration and distribution of expected
transient faults. Recent work has considered repairable
systems and models that include transient as well as permanent faults [5]. An important goal of further study remains the development of models that consider the characteristics of both hardware and software and consider
both permanent and transient faults. Analytical models
of dynamically redundant systems are complex because
of the number of different perameters that may be varied
C. Modeling and Analysis
in the search for a good design. A very useful tool for the
The choice of redundancy techniques requires verifi- study of such systems is an interactive computer program
AVIZIENIS: FAULT-TOLERANT SYSTEMS
which readily allows the variation of the fault-tolerance
parameters of the redundant system for online design re-
finement [5], [7].
Two approaches to experimental prediction of reliability are simulation and experimentation with a hardware prototype. While their use is more costly and time
consuming, the experimental methods are essential when
the available analytic models do not adequately represent
the complex structure of the system or the nature of the
expected faults. An accurate description of the system and
characterization of the faults are the principal prerequisites when simulation is employed to derive the fault-tolerance estimates for the computer [9]. This approach has
been extensively employed in reliability prediction for the
redundant SATURN V guidance computer [3]. Continued
progress in the development of digital simulators has taken
place since then [26]. A contemporary multipurpose simulation system is the Bell Laboratories' LAMP system
[11].
The use of hardware prototypes requires a large investment of effort in constructing the prototype, but avoids
the inaccuracies which may occur in postulating the fault
effects in a simulated model of the system. An example of
the use of a hardware prototype is the experimental
fault-tolerant JPL-STAR computer [10]. In the JPLSTAR computer an electronic "black box" was used to
inject faults of adjustable duration and extent at selected
points in the hardware of the system. Another example of
the experimental approach is the OAO processor, in which
a component-redundant system was completely disassembled to determine the number of failed components
after 3000 hours of operation [3]. A prototype also was used
to derive the fault dictionary for the No. 1 ESS System
[12].
V. SYSTEM DESIGN
The ultimate proof of the effectiveness of the fault-tolerance techniques discussed in the preceding section is
found in the performance of existing systems. For the
convenience of discussion, we make the distinction between 1) fully fault-tolerant (or self-repairing) and 2)
manually-controlled systems with fault-tolerance features. The former complete their recovery actions without
the participation of a maintenance specialist, while the
latter depend on human decision-making as part of the
recovery sequence. These decisions may take place at
various stages of the sequence, from the initiation of diagnostics to the operation of a switch which disconnects
a failed part of the system. The subsequent discussion will
focus on examples of fully fault-tolerant systems.
The fully fault-tolerant systems may be further classified according to the availability of external (offline) repair.
In closed systems repair is not available, and the- system
inevitably fails after the redundancy resources have been
exhausted. Closed systems are usually found in space applications [3], [10]. In repairable systems failed parts are
automatically identified and isolated from further participation in computing. They are then replaced by means
1309
of an offline manual action. System failures usually occur
either because of imperfect fault detection and recovery
algorithms, or because of catastrophic faults (i.e., faults
that cannot be handled by the recovery procedure selected). A less frequent cause of system failure is exhaustion of redundancy, which occurs when faults occur faster
than the repair procedure can handle them. Very prominent examples of repairable systems are the several models
of the ESS telephone switching systems [12], [13].
Another classification of fault-tolerant systems may be
based on the implementation of the recovery algorithm.
Hardware-controlled systems use dedicated hardware
which collects fault indications and initiates recovery,
while software-controlled systems depend on special
programs to interpret fault indications and to carry out the
automatic recovery procedures. The hardware-controlled
recovery approach depends on special hardware to carry
out fault detection and to initiate the recovery procedures.
After the existence of a properly functioning software
system has been assured, the completion of recovery is
usually transferred to software control. It is evident that
further software systems may be superimposed on the
hardware-controlled design, leading to a multilevel recovery procedure. A special case of hardware-controlled
recovery is found in statically redundant systems in which
faults are masked by redundant hardware, and thus remain totally invisible to the software. Two examples of
such systems are the OAO data processor which used
component redundancy and the CPU of the SATURN V
guidance computer, which used TMR protection [3]. A
separate software-controlled recovery system is needed in
statically redundant systems if they are to continue operating after the first fault that escapes the masking effect
and affects the software.
Dynamically redundant systems with hardware control
usually depend on a dedicated hardware module that
gathers fault signals and initiates recovery. Different uses
of duplexing and hardware-controlled switchover techniques are found in the memory, power supply, and peripheral units of the SATURN V guidance computer in
combination with a TMR-protected serial CPU unit.
Separate fault-detection and switchover-control units were
used for every functional unit [3]. Probably the first operational computer with fully hardware-controlled dynamic redundancy was the experimental JPL-STAR
computer [10]. Intended for self-contained multiyear space
missions, this computer employs a special test-and-repair-processor (TARP) module to control recovery and
self-repair. Software assistance is invoked only to perform
memory copying and to resume standard operation after
self-repair. The French MECRA computer is another early
experimental design [14]. A few other hardware-controlled
system designs that have not reached operation have been
described in recent literature [1], [4].
The principal advantage of hardware-controlled recovery systems lies in their independence of the operation
of any software immediately after the fault has occurred.
The recovery process is transferred to software only after
its ability to operate has been assured. The relatively late
1310
appearance of such systems may be attributed to the need
to introduce the recovery module into the design at its
inception, thus requiring an early committment to the
hardware-controlled approach.
The software-controlled recovery systems depend on
special programs to initiate the recovery action upon the
detection of a fault. Fault signals are obtained by both
hardware and software methods, for example: parity
checkers, comparators, power level monitors, watchdog
timers, test programs, reasonableness checks, etc. The
main limitation of these systems is the need for the recovery software to remain operational in the presence of
faults, since recovery cannot otherwise be initiated. A
significant advantage of the software-controlled approach
is that existing "off-the-shelf" hardware system modules
may be used to assemble fault-tolerant organizations.
These modules contain various forms of hardware fault
detection, which are usually supplemented by further
software methods. For this reason software-controlled
systems have appeared earlier and are currently being used
in numerous applications requiring high reliability and
availability. While every modern operating system incorporates some recovery features, the present paper will be
limited to selected illustrations of historically important
and advanced systems.
An important early design of the 1950's with complete
duplication and extensive recovery provisions was the
SAGE system [15]. The IBM System/360 architecture
contains very complete serviceability provisions for multisystem operation in order to attain high availability, reconfiguration, and fail-soft operation [16]. An early example of a multisystem which includes further extensions
of the System/360 design is the IBM 9020 multiprocessing
system for air traffic control applications [17]. Noteworthy
are the operational error analysis program and the diagnostic monitor of the 9020. The recent IBM System/370
hardware incorporates automatic retry of CPU operations,
error coding to correct single-bit errors in processor and
control storage, and I/O retry facilities. The software
provides recovery management support routines, I/O and
channel error recovery, checkpoint/restart facilities,
microdiagnostics, and online diagnostics of I/O device
errors. An interesting illustration of extensive use of
backup storage and dynamic reconfiguration in a general-purpose time-shared system is found in the MIT
Multics System [18]. The Pluribus is a minicomputer/
multiprocessor system with extensive fault-tolerance
provisions which serves as a switching node in the ARPA
Network [19].
Another direction of software-controlled system development is in aerospace applications. Representative
illustrations of this approach are the SIFT design [20], the
RCS system [91, and the C. S. Draper Laboratory modular
system [21]. One more area of application which requires
fault-tolerant operation and very high availability for
several years of continuous operation is the control of
electronic switching systems for telephone exchanges.
These systems usually employ manual repair by replacement of a failed part as the last (offline) step of the re-
IEEE TRANSACTIONS ON COMPUTERS, DECEMBER 1976
covery procedure, while maintaining normal operation by
means of the remaining system modules. A well documented illustration is found in the Electronic Switching
Systems (ESS) of Bell Laboratories. The ESS designs use
several hardware techniques (duplication, matching, error
codes, function monitors) and special software (check
routines, diagnostics, audits) as well as software and
hardware emergency procedures when normal recovery
action does not succeed [12], [13].
Besides the complete systems discussed above, significant efforts have been carried out in providing fault tolerance for individual subsystems, which then can be assembled to form a fault-tolerant system. This is especially
true for secondary and mass storage which has been
characterized by relatively low reliability in the past.
Representative error coding applications include the use
of codes for error control in data communications, magnetic tape units, disc files, primary random access storage,
and a photo-digital mass store [22]. Single-error-correcting
codes are used in the control storage of ESS No. 1 [12] and
the main and control storage of IBM System/370 computers. Error-correcting codes have been shown to provide
a very effective method for fault-tolerance in the storage
medium, and the remaining problems are concentrated in
providing fault-tolerance in the memory access and
readout circuitry.
Recent studies have considered the problem of fault
tolerance in associative memories and processors [23]. In
general, processor fault-tolerance has been provided by
duplication and reconfiguration at the system level. Some
investigations have been conducted in the use of arithmetic
error codes as the means for error-detection for processor
faults [24] and an experimental processor has been designed and constructed for the JPL-STAR computer [10].
Continuing reductions in the cost of microprocessors
makes further emphasis on duplication very likely, although error-detecting codes remain a convenient method
for the identification of the faulty processor in a disagreeing pair. Another potentially very effective approach
to error detection in integrated circuits is self-checking
logic design [25].
VI. THE PROBLEM OF DESIGN FAULTS
In order to attain the goal of general fault-tolerance it
is necessary that one of two conditions should be satisfied
with respect to design faults: 1) the hardware and software
should be free of design faults prior to the start of operational use; or, 2) the system should contain complete provisions to detect and to circumvent the effects of hardware
and software design faults during operational use.
The principal tools for the a priori elimination of design
faults are design verification programs for hardware [26],
[27] and an extensive repertoire of validation and verification tools for software [28], [29]. These methods correspond to the fault-intolerance approach of perfecting
hardware components in order to avoid operational
faults.
The alternative is to provide design fault tolerance for
1311
AVIZIENIS: FAULT-TOLERANT SYSTEMS
both hardware and software in the same sense as it is
provided against operational faults in fault-tolerant
hardware. The essential difference is that operational
faults in hardware occur after the start of operational use,
while design faults in hardware and software are present
at the start, but become disruptive at a later time. We also
note that design modifications and corrections of discovered design faults are known to lead to new design faults,
and that discoveries of design faults may be expected
throughout the useful life of any large system, similar to
the occurrence of operational faults This practically verified observation establishes a relationship between the
methodologies for dealing with operational faults and
design faults: the methods that have proven successful in
operational fault tolerance may be transferable to provide
fault tolerance of design faults, quite especially of those
in software which are very difficult to eliminate prior to
operational use.
Both the static and the dynamic hardware redundancy
approaches (as discussed in Section IV-B) have their
counterparts in software design fault tolerance. In the
static case, multiple independently generated programs
are operated concurrently on multiple copies of the hardware [30]. Comparison or majority voting at specified
points is employed to detect or correct design faults. Systems such as SIFT [20], the symmetric multiprocessor [211,
and the Space Shuttle Computer System [3] are especially
suitable for such multiple programming. The dynamic case
uses the equivalent to standby sparing, in which acceptance tests serve to detect design faults and to initiate a
switchover to an alternate software module [31]. An extension of the above techniques to hardware design fault
tolerance is also feasible: the functionally identical copies
of hardware modules are to be designed and manufactured
by separate organizations in order to avoid the existence
of identical design faults in all copies.
The state of the art in design fault tolerance resembles
that in operational fault tolerance some ten years ago. The
cost and the effectiveness of the design fault tolerance
approaches remain to be investigated, and the techniques
require much further development and experimentation.
The success of operational fault tolerance, however, does
strongly indicate that design fault tolerance cannot be
safely ignored solely because of the past tradition of fault
intolerance in this field.
ACKNOWLEDGMENT
Discussions with Drs. D. A. Rennels and W. M. Whitney
of the Jet Propulsion Laboratory, Pasadena, CA, and Maj.
R. L. Bush of the Air Force Avionics Laboratory, Dayton,
OH, have been valuable in the formulation of this paper.
Special thanks are due to one referee for very extensive and
perceptive comments and suggestions.
REFERENCES
[1] A. Avizienis, "Architecture of fault-tolerant computing systems,"
in Dig. 1975 Int. Symp. Fault-Tolerant Computing, Paris, France,
June 1975, pp. 3-16.
[21 R. A. Short, "The attainment of reliable digital systems through the
use of redundancy-A survey," IEEE Computer Group News, vol.
2, pp. 2-17, Mar. 1968.
[3] A. E. Cooper and W. T. Chow, "Development of on-board space
Jan.
[4]
[5]
[6]
[7]
[8]
[9]
[101
[11]
(12]
(13]
[141
(15]
[16]
computer systems," IBM J. Res. Develop., vol. 20, pp. 5-19,
1976.
W. C. Carter and W. G. Bouricius, "A survey of fault-tolerant
computer architecture and its evaluation," Computer, vol. 4, pp.
9-16, Jan.-Feb. 1971.
Y.-W. Ng and A. Aviiienis, "A model for transient and permanent
fault recovery in closed fault tolerant systems," in Proc. 1976 Int.
Symp. Fault-Tolerant Computing, Pittsburgh, PA, June 1976, pp.
182-188.
G. C. Gilley, "A fault-tolerant spacecraft," in Dig. 1972Int. Symp.
Fault-Tolerant Computing, June 1972, pp. 105-109.
W. G. Bouricius, W. C. Carter, and P. R. Schneider, "Reliability
modeling techniques for self-repairing computer systems," in Proc.
24th Nat. Conf. Ass. Comput. Mach., 1969, pp. 295-383.
R. W. Barlow and F. Proschan, Mathematical Theory of Reliability.
New York: Wiley, 1965.
H. 0. Levy and R. B. Conn, "A simulation program for reliability
prediction of fault tolerant systems," in Dig. 1975 Int. Symp.
Fault-Tolerant Computing, Paris, France, June 1975, pp. 104109.
A. Avisienis et al., "The STAR (self-testing and repairing) computer:
An investigation of the theory and practice of fault-tolerant computer design," IEEE Trans. Comput., vol. C-20, no. 11, pp. 13121321, Nov. 1971.
H. Y. Chang, G. W. Smith, Jr., and R. B. Walford, "LAMP: System
description," Bell Syst. Tech. J., vol. 53, pp. 1431-1449, Oct.
1974.
R. W. Downing, J. S. Nowak, and L. S. Tuomenoksa, "No. 1 ESS
maintenance plan," Bell Syst. Tech. J., vol. 43, Part 1, pp. 19612019, Sept. 1964.
H. J. Beuscher, et al., "Administration and maintenance plan of no.
2 ESS," Bell Syst. Tech. J., vol. 48, pp. 2765-2815, Oct. 1969.
F. P. Maison, "The MECRA: A self-repairable computer for highly
reliable process," IEEE Trans. Comput., vol. C-20, pp. 1382-1393,
Nov. 1971.
R. R. Everett, C. A. Zraket, and H. D. Benington, "SAGE-A
data-processing system for air defense," in Proc. Eastern Joint
Comput. Conf., Washington, DC, Dec. 1957, pp. 148-155.
W. C. Carter, et al., "Design of serviceability features for the IBM
System 360," IBM J. Res. Develop., vol. 8, pp. 115-125, Apr.
1964.
[17] "An application-oriented multiprocessing system," IBM Syst. J.,
vol. 6, pp. 78-132, 1967.
[18] F. J. Corbato, J. H. Saltzer, and C. T. Clingen, "Multics-The first
seven years," in AFIPS Conf. Proc., vol. 40, 1972, pp. 571-583.
(19] S. M. Ornstein et al., "Pluribus-A reliable multiprocessor," in
AFIPS Conf. Proc., vol. 44, 1975, pp. 551-559.
[20] J. H. Wensley, "SIFT-Software implemented fault tolerance," in
AFIPS Conf. Proc., vol. 41, part 1, 1972, pp. 243-254
[21] A. L. Hopkins, Jr. and T. B. Smith, III, "The architectural elements
of a symmetric fault-tolerant multiprocessor," IEEE Trans. Comput., vol. C-24, pp. 498-505, May 1975.
[22] D. T. Tang and R. T. Chien, "Coding for error control," IBM Syst.
J., vol. 8, pp. 48-86, 1969.
[23] B. Parhami and A. Avizienis, "A study of fault-tolerance techniques
for associative processors," in AFIPS Conf. Proc., vol. 43, pp.
643-652, 1974.
studies
[24] A. Avizienis, "Arithmetic error codes: Cost and effectivenessComput.,
for application in digital system design," IEEE Trans.
vol. C-20, pp. 1322-1331, Nov. 1971.
and use of checking circuits," in Computer
W.
[25] C. Carter, "TheoryInfotech
Information Ltd., 1974, pp. 413Systems Reliability,
454.
[26] S. A. Szygenda and E. W. Thompson, "Modeling and digital simulation for design verification and diagnosis," IEEE Trans. Comput.,
this issue, pp. 1242-1253.
[27] T. T. Butler et al., "LAMP: Application to switching-system development," Bell Syst. Tech. J., vol. 53, pp. 1535-1555, Oct.
1974.
(28] Proc. 1975 Int. Conf Reliable Software, Los Angeles, CA, Apr.
1975.
[29] E. C. Nelson, "Software reliability," in Dig. 1975 Int. Symp.
Fault-Tolerant Computing, Paris, France, June 1975, pp. 24-28.
[301 A. Avi'ienis, "Fault-tolerance and fault-intolerance: Complementary
IEEE TRANSACTIONS ON COMPUTERS, VOL.
approaches to reliable computing," in Proc. 1975 Int. Conf. Reliable
Software, Los Angeles, CA, Apr. 1975, pp. 458-464.
131] B. Randell, "System structure for software fault tolerance," IEEE
Trans. Software Eng., vol. SE-1, pp. 220-232, June 1975.
Algirdas Avizienis (S'55-M'56-F'73) was born in Kaunas, Lithuania,
in 1932. He received the B.S., M.S., and Ph.D. degrees in electrical engineering from the University of Illinois, Urbana.
While at the University of Illinois he was associated with the
c-25, NO. 12, DECEMBER 1976
Digital Computer Laboratory as a Research Assistant and Fellow from 1954 to 1960, partici-.
pating in the design of the Illiac II computer. In
1960 he joined the Technical Staff of the Jet
Propulsion Laboratory (JPL), California Institute of Technology, Pasadena, where he initiated and directed the JPL Self-Testing And Repairing (STAR) computer research project. In
1962 he became a member of the faculty of the
School of Engineering and Applied Science at
the University of Califomia, Los Angeles. Presently he is a Professor in the Department of Computer Science at UCLA,
where he teaches and conducts research in computer system architecture
and fault-tolerant computing and is the Principal Investigator of a fiveyear research grant on fault-tolerance sponsored by the National Science
Foundation. He has-also remained associated with the Jet Propulsion
Laboratory as an Academic Member of the Technical Staff, and he has
served as the Principal Consultant on fault-tolerant computer research
projects with Ultrasystems, Inc., Irvine, CA.
Dr. Avizienis has served as the first Chairman of the IEEE Computer
Society's Technical Committee on Fault-Tolerant Computing (19691973), and as Chairman of the First International Symposium on
Fault-Tolerant Computing in 1971.
Analog/Hybrid Computation and Digital Simulation
GRANINO A. KORN, SENIOR MEMBER, IEEE, AND ROBERT VICHNEVETSKY, MEMBER, IEEE
Abstract-This survey article traces the development of analog/hybrid and digital simulation of continuous systems since the
early 1950's, describes modern aspects, and discusses future
trends.
Index Terms-Continuous-system simulation, analog/hybrid
computation, interactive computers, simulation.
I. SIMULATION OF CONTINUOUS SYSTEMS
CONTINUOUS dynamical systems are those systems which can be described by differential equa-
tions of the initial value kind. Typically,
d, =
dt
F(Y1,Y2,Y3,
- -
*
Yt),
i
=
1,2,
- -
-,n
(1)
where YM(t) are the state variables, and (1) are the state
equations. These equations represent a mathematical
model of the system.
For computer simulation, we solve (1) by a computer
as an initial value problem, to produce the continuous
functions of time Yi (t). This permits experimentation with
the mathematical model which "simulates" the system's
behavior.
Objects being simulated in this fashion have typically
Manuscript received April 20, 1976; revised August 16, 1976.
G. A. Korn is with the Department of Electrical Engineering, University of Arizona, Tucson, AZ.
R. Vichnevetsky is with the Department of Computer Science, Rutgers
University, New Brunswick, NJ 08903.
included aerospace vehicles, industrial plants, chemical
or biological processes and, in recent years, social and en-
vironmental systems. They may include few state variables
(say 2 or 3) or be rather complex, up to a few thousand
variables. Real-time simulation is obtained when the
computer time (independent variable) changes at the same
rate as the time t in the real system described by (1).
Real-time siLiulation permits linking the computer with
real hardware (e.g., for testing the control device of a plant)
or interfacing with human operators (e.g., placing a live
"pilot in the loop" of a real-time airplane simulation used
as a training device). More will be said about applications
of simulation in Section X. Computers used for the simulation of continuous dynamical systems have been analog,
digital, or hybrid, the latter combining the first two into
a single computing system (Figs. 1 and 2).
II. ANALOG COMPUTERS
Electronic analog computers implement mathematical
relations between physical variables (typically electrical
voltages) with separate (parallel) computing elements
working simultaneously. Each element effects a simple
operation such as addition, multiplication, and, most important of all, integration with respect to time. Programming is achieved by establishing interconnections between
outputs from and inputs to individual computing elements
by means of wires, called "patch cords."
The first electronic analog computers were conceived
in the 1940's almost as special purpose simulators for air-
Download