Fault-Tolerant Systems - IEEE Computer Society

1304 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-25, NO. 12, DECEMBER 1976 Fault-Tolerant Systems ALGIRDAS AVIZIENIS, FELLOW, IEEE Abstract-Basic concepts, motivation, and techniques of fault tolerance are discussed in this paper. The topics include fault classification, redundancy techniques, reliability modeling and prediction, examples of fault-tolerant computers, and some approaches to the problem of tolerating design faults. Index Terms-Fault classification, fault tolerance, fault-tolerant computer design, redundancy techniques, reliability modeling, reliable computing. I. INTRODUCTION T HE PROBLEM of reliable computing is as old as the first computers, which employed relays, vacuum tubes, delay line and cathode-ray tube storage, and other relatively unreliable components. Semiconductor components and magnetic core storage elements with much greater reliability were introduced in the second computer generation, and the reliability problem was significantly alleviated. However, much greater reliability requirements were posed by the space program in the early 1960's, as well as by other real-time applications in which human lives could be jeopardized by computer failure. In addition to improvements in component reliability and in test methods, redundancy at various levels of system organization was investigated and employed to increase the probability of uninterrupted correct operation over specified time intervals. The concept of fault tolerance unifies various approaches to reliability assurance by means of testing, diagnosis, and redundancy in machine organization and operation. It emerged in the late 1960's and reached maturity with the formation of the IEEE Computer Society Technical Committee on Fault-Tolerant Computing in 1969 and the subsequent holding of the First International Symposium on Fault-Tolerant Computing in 1971. The Symposium has been held annually since then, and it has become the major international forum for the discussion of current experience and new ideas in system design, redundancy techniques, system modeling and analysis, testing and diagnosis methods, and other related areas. II. BASIC CONCEPTS The most widely accepted definition of a fault-tolerant computing system is that it is a system which has the built-in capability (without external assistance) to preserve Manuscript received March 22, 1976; revised June 23, 1976. This work was supported by the National Science Foundation under Grant MCS72-03633 A04. The author is with the Department of Computer Science, University of California, Los Angeles, CA 90024. the continued correct execution of its programs and input/output (I/O) functions in the presence of a certain set of operational faults. An operational fault is an unspecified (failure-induced) change in the value of one or more logic variables in the hardware of the system. It is the immediate consequence of a physical failure event. The event may be a permanent component failure, a temporary or intermittent component malfunction, or externally originating interference with the operation of the system. "Correct execution" means that the programs, the data, and the results do not contain errors, and that the execution time does not exceed a specified limit. An error is the symptom of a fault, i.e., it is the change in an instruction or a data word which is caused by the presence of a fault at the point of the physical failure event. The above definition of a fault-tolerant system is based on the premise that the system will function correctly when operational faults are not present; that is, it postulates that design faults have been eliminated from the system prior to its use. A more general interpretation of fault tolerance also includes the ability to tolerate thus-far-undetected design faults. Design faults occur both in hardware and in software of the system. They are the consequences of either: 1) an incomplete or incorrect specification; or 2) an incorrect translation of the specification into the final design-the assemblies of components or the sequences of machine language instructions and sets of data words. In order to tolerate left-over design faults, the system needs provisions to recognize their presence and to deactivate the affected hardware or software while program execution continues using the remaining parts of the system. In the strict sense, a system is fault-tolerant only if external assistance is not required to implement its fault tolerance. Most of the current-generation computers require manual assistance; however, it must be noted that nearly every computer employs some fault-tolerance techniques, such as microdiagnosis [1], [4], parity checking [2], watchdog timers [3], etc. The fault-tolerance process is interrupted at the point at which a maintenance specialist takes over the control of the system. An off-line replacement or repair of deactivated parts of the system is acceptable in a fault-tolerant system, as long as the correct execution of programs is being continued without external intervention. A partially fault-tolerant ("fail-soft," "gracefully degrading") system is one which has the built-in capability to reduce its specified ("full") computing capacity and to "shrink" to a smaller system by discarding some previously AVIZIENIS: FAULT-TOLERANT SYSTEMS used programs or by slowing down below the specified rate of execution. The reduction occurs because of decreases in the working hardware configuration that are due to operational faults. Another cause of such "shrinking" may be the discovery of left-over design faults in the hardware or programs. The preceding discussion makes it evident that the across the traditional hardware/software boundaries and involves the entire computing system: the equipment, the programs, and the operating environment as well. Fault tolerance is an essential aspect of reliable computing, which in the most general sense means "the correct execution of a specified set of algorithms," and encompasses all the following elements: correctness and completeness of software specifications; testing and verification (proofs) of programs and microprograms; elimination of hardware design errors; continued correct execution of programs and protection of data in the presence of hardware (operational) faults; protection of the computing system against error-induced disruption or unauthorized access of programs and data. The present discussion deals with the problem of hardware (operational) faults; some aspects of design fault tolerance are considered in the last section. 1305 carrying out of comprehensive testing to eliminate hardware and software design faults. Once the design has been completed, a quantitative prediction of system reliability is made using known or predicted failure rates for the components and interconnections. In a "purely" fault-intolerant (i.e., nonredundant) design, the probability of fault-free hardware operation is equated to the probability of correct program execution. Such a design is characterized by the decision to invest all the reliability resources into high-reliability components and refinement of assembly, packaging, and testing techniques. Occasional system failures are accepted as a necessary evil, and manual maintenance is provided for their correction. To facilitate maintenance, some built-in error detection, diagnosis, and retry techniques are provided. (These techniques are discussed in more detail later.) This is the current practice in computer system design; the trend is toward an increasing number of built-in aids for the maintenance engineer. The traditional (fault-intolerance) approach of diagnosis-aided manual repair, however, has proved to be an insufficient solution in many cases because of at least three reasons: 1) the unacceptability of the delays and interruptions of real-time programs caused by manual repair action; 2) the inaccessibility of some systems to manual repair; and 3) the excessively high costs of lost time and of III. MOTIVATION FOR FAULT TOLERANCE maintenance in many installations. The direct dependence Two distinct approaches have been used in order to of human lives on some computer-controlled operations reach the goal of reliable computing. These approaches are (air traffic control, manned spaceflight, etc.) has added a applicable to all parts of the computing system, including psychological reason to object to the fault-intolerance its hardware elements, microprograms, system programs, approach: while only one system in a million may fail over data bases, and user programs. a given time interval, all users of the entire million systems In the traditional approach (to be called'fault intoler- are subject to the anticipation that they may be involved ance in the following discussion) the reliability of comin this failure. puting is assured by a priori elimination of the causes of An alternate approach which alleviates most of the unreliability, i.e., of faults. This elimination takes place above shortcomings of the traditional fault-intolerance before regular use begins, and the resources that are allo- approach is offered by fault tolerance. In this approach, cated to attain reliability are spent on perfecting the sys- the reliability of computing is assured by the use of protem prior to its field use. Redundancy is not employed, and tective redundancy. The causes of unreliability are exall parts of the system must function correctly at all times. pected to be present and to induce errors during the Since in practice it has not been possible to assure the computing process, but their disrupting effects are autocomplete a priori elimination of all causes of unreliability, matically counteracted by the redundancy. Reliable the goal of fault intolerance is to reduce the unreliability computing may be made possible despite certain classes (expressed as the probability of system failure over the of hardware failures, external interference with computer duration of the specified computing process) or the un- operation, and perhaps even remaining design errors in availability of the system to an acceptably low value. To hardware and software. Some of the resources allocated supplement this approach, manual maintenance proce- to attain reliability are spent on protective redundancy. dures are devised which return the system to an operating The redundant parts of the system (both hardware and condition after a failure. The cost of providing mainte- software) either take part in the computing process or are nance personnel and the cost of the disruption and delay present in a standby condition, ready to act automatically of computing also are parts of the overall cost of using the to preserve its undisrupted continuation. This contrasts fault-intolerance approach. The procedures which have with the maintenance procedures in a fault-intolerant led to the attainment of reliable systems using this ap- system which are invoked after the computing process has proach are: acquisition of the most reliable components been disrupted, and the system remains "down" for the within the given cost and performance constraints; use of duration of the maintenance period. thoroughly refined techniques for the interconnection of It is evident that the two approaches are complementary components and assembly of subsystems; packaging of the and that the resources allocated to attain the required hardware to screen out expected forms of interference; and reliability of computing may be divided between fault concept of fault tolerance cuts 1306 IEEE TRANSACTIONS ON COMPUTERS, DECEMBER 1976 ization of a transient fault must include a "maximum duration" parameter; faults that last longer will be interpreted as permanent by the recovery algorithm. Other characteristics are the arrival model and the duration of transients [1], [5]. Permanent faults are caused by permanent (also called "solid") failures of components; recovery from them requires either the presence, of a duplicate (spare) component, or provisions -to continue operation without the failed part. In addition, the reestablishment of the "rollback point" may be required as well, following the corrective action on the failed part. The classifications according to extent and according to value are applicable to both transient and permanent faults. The extent of a fault describes how many logic variables in the hardware are simultaneously affected by the fault which is due to one failure event. Local (single) faults are those that affect only single logic variables, while distributed (related multiple) faults are those that affect two or more variables, one module, or an entire system. The physical proximity of logic elements in contemporary medium- and large-scale integration (MSI and LSI) circuitry has made distributed faults much more likely than in the discrete component designs of the past. Distributed faults are also caused by single failures of some critical elements in a computer system, for example: clocks, power supplies, data buses, switches for reconfiguration, etc. The value of a fault is determinate when the logic values affected by the fault assume a constant value ("stuck on 0" or "stuck on 1") throughout its entire duration. The IV. ASPECTS OF FAULT TOLERANCE value is indeterminate when it varies between "0" fault Investigators and designers of fault-tolerant systems and "1," not in accord with design specifications, have to deal with three fundamental aspects of fault tol- throughoutbut the duration of the fault. The determinacy of erance. They are: 1) systematic identification and char- a fault depends on the physical failure mechanisms, such acterization of the set of faults to be tolerated; 2) devel- as shorts, burnouts, or drift of component values. opment and choice of redundancy techniques which proIt is important to note that the description of fault exvide protection against the identified faults; and 3) analytic tent and fault value applies at the origin of the fault, that or experimental prediction of the effectiveness of redun- is, at the point at which the failure event has actually taken dancy techniques [1]. The following discussion deals with place. The introduction of one or more faulty (incorrect) these aspects as they apply to operational hardware logic values into the computing process will often lead to faults. more ettensive fault symptoms in the logic downstream from the point of failure. At other times, the presence of A. Characterization of Operational Faults a faulty logic value will be masked by other (correct) logic Operational faults are caused by three classes of failure variables and no symptoms at all will be seen downevents that affect the hardware of the'system during exe- stream. Confusion and ambiguity are avoided when the term cution of programs: permanent failures of hardware components; temporary malfunctions of components; and "fault" is restricted to the change in logic variables(s) at the point of the actual hardware failure. The fault-caused external interference with computer operation. There are three useful dimensions for the classification changes-of logic variables which are observed downstream of operational faults: duration (transient versus perma- on the outputs of ctrrectly functioning logic elements will nent); extent (local versus distributed); and value (de- be called "errors." This choice of terms establishes the following cause-effect sequence. terminate versus indeterminate). 1) The failure, which is a physical event, causes a fault, Transient faults are faults of limited duration, caused either by temporary malfunctions of components or by which is a change of logic variables(s) at the point of failexternal interference. Recovery requires waiting until the ure. 2) The fault supplies incorrect input(s) to the comtransient subsides, followed by the reestablishment of a known correct state of the programs and the data (often puting process and may cause an error to be produced by called "rollback point" or "checkpoint"). The character- subsequent operations of failure-free logic circuits. tolerance and fault intolerance. Experience and analysis both seem to indicate that a balanced allocation of resources between the two approaches is most likely to yield the highest reliability of computing. Fault tolerance does not entirely eliminate the need for reliable components; instead, it offers the option to allocate part of the reliability resources to the inclusion of redundancy. One reason for the use of a fault-tolerant design is to achieve a reliability or availability prediction that cannot be attained by the purely fault-intolerant design. A second reason may be the attainment of a reliability (or availability) prediction that matches the purely fault-intolerant design at a lower overall cost of implementation. A third reason is the psychological support to the users who know that provisions have been made to handle faults automatically as a regular part of the computing process. An overview of past practice shows that fault intolerance has been the dominant choice in both hardware and software in the 1950's and 1960's. In recent years the fault-tolerance approach has been making significant inroads in hardware system design; its application in software has remained very limited. The main argument against the use of fault-tolerance techniques in computer systems has been the cost of redundancy. The evolution of component technology into large-scale integration, the decreasing cost of mass-produced hardware elements, the very high cost of software testing, and the increasing reliability requirements all favor increasing use of fault tolerance in computer systems of the future. AVIZIENIS: FAULT-TOLERANT SYSTEMS The number of points that can be observed for the purpose of a fault detection is limited because integrated circuits are internally complex, and have relatively few outputs. Digital-logic simulation programs which analyze the behavior of faulty logic circuits and predict the errors that will appear on the outputs (for a given class of faults) are essential tools for the generation of fault-detection tests [26]. A recent simulation and analysis system to analyze the behavior of faulty circuits is the Logic Analyzer for Maintenance Planning (LAMP) System [111]. In addition, LAMP also performs logic design verification, generates fault-detection tests, evaluates diagnostics, and produces data for trouble-location manuals. LAMP exemplifies the current trend toward multipurpose simulation systems in digital system design. B. Redundancy Techniques The redundancy techniques that have been developed to protect computer systems against operational faults may assume three different forms: hardware (additional components), software (special programs), and time (repetition of operations). Hardware redundancy: This consists of the components that are introduced in order to provide fault-tolerance. The techniques of introducing hardware redundancy have been classified (on the basis of terminal activity of modules) into two categories: static redundancy and dynamic redundancy [2]. The static redundancy method is also known as "masking" redundancy, since the redundant components are employed to mask the effect of hardware failures within a given hardware module, and the outputs of the module remain unaffected as long as the protection is effective. The static technique is applicable against both transient and permanent faults. The redundant copies of an element are permanently connected cnd powered; therefore they provide fault masking instantaneously and automatically. However, if the fault is not susceptible to masking and causes an error, a delayed recovery is not provided. Two forms of static redundancy have been applied in the space program: replication of individual electronic components and triple modular redundancy (TMR) with voting [3]. Several other forms have been studied but did not reach application because of excessive cost or the need for practically unrealizable special components [2]. The use of static redundancy is based on the assumption that failures of the redundant copies are independent. For this reason, use of static redundancy is difficult to justify within integrated circuit packages, in which many failure phenomena are likely to affect several adjacent components. In the dynamic hardware redundancy approach faultcaused errors or error signals appear at the outputs of a module. Fault tolerance is implemented in two steps. The presence of a fault is detected, and a subsequent recovery action either eliminates the fault, or corrects the error. When human assistance is entirely bypassed, dynamic redundancy (usually supported by software and time re- 1307 dundancy techniques) provides self-repair of a computer system. Limited, i.e., human-controlled, use of dynamic redundancy techniques in computer hardware is quite extensive [1], [2], [4]. A common example is the use of parity to detect errors in data transmission and storage. A complete application of dynamic redundancy requires that a number of decisions should be made during the functional design of the computer. The design choices include: level of modularization, fault-detection techniques, type of recovery action, "hard core" protection, forms of intermodule communication, validation of inputs, and interfaces with system software [1]. The use of dynamic redundancy has been inhibited because it needs an early committment in the hardware design process. In contrast, static redundancy (and software redundancy, as well) can be superimposed on an existing nonredundant design. Software redundancy: This includes all additional programs, program segments, instructions, and microinstructions which would not be needed in a fault-free computer. They serve to provide either fault detection or recovery in fault-tolerant computer systems, usually in conjunction with dynamic hardware redundancy. Three major forms of software redundancy are: 1) multiple storage of critical programs and data; 2) test and diagnostic programs or microprograms; and 3) parts of the operating system which interact with hardware redundancy and carry out program restarts. Combinations of all three forms are found in most modern fault-tolerant computers. Compared to hardware redundancy, an advantage of software is the ability to add fault-tolerance features after the hardware has been designed. This may allow the design of fault-tolerant systems using redundant configurations of "off-the-shelf' hardware with very limited supporting features, such as comparators, voters, parity checkers, etc. Another advantage is the relatively easier modification and refinement of the software features after their introduction into the system. The main disadvantage of software redundancy is the difficulty of assuring that the software features will be able to function correctly after the occurrence of a fault and that they will be invoked sufficiently early, that is, before the fault-daused errors have irrevocably disrupted the programs or mutilated the data base. Other disadvantages include the relatively high cost of generating and verifying or validating the required software, the storage requirements (including the need to tolerate failures of memories holding the software), and the difficulty of estimating or proving-the adequacy of the software redundancy features. Combinations of software and hardware redundancy are employed in most faulttolerant systems, but they differ in the choice of the point in the detection and recovery sequence at which software takes over control. Time (execution) redundancy: This consists of repeating or acknowledging machine operations at various levels: micro-operations, single instructions, program segments, or entire programs. It is usually employed together with dynamic hardware and software redundancy techniques. Two distinct goals of time redundancy are: 1308 fault detection by means of repeated execution or acknowledgments; and recovery by program restarts or operation retries after fault detection or reconfiguration has occurred. The repeated execution of a program is probably the oldest form of fault detection. While suitable to detect errors due to transient faults, it is limited because consistent errors are produced by permanent faults, and comparison may fail to reveal the same error in the results. The use of various forms of acknowledgments ("handshakes") has been extensively used in general purpose systems, especially for control error detection in secondary storage, channels, and I/O devices. Retransmission is the usual error correction method in this case. Another common use of time redundancy is found in the identification and correction of errors caused by transient faults, and in program restarts after a hardware reconfiguration. This is accomplished by the repetition or "rollback" (after error detection) of single instructions, program segments, or entire programs. While single-instruction retries are transparent to the programmer, longer rollbacks require programming constraints as well as protected storage for the rollback address and for the state vector, including its double-buffering. "Singular" events in a computer are program-controlled events which should not be repeated as part of a program rollback operation, for example, real-time output commands which initiate irreversible actions in the system under computer control. The potential damage makes it imperative that the provision for handling of singular events should be incorporated in rollback procedures. The redundancy checkout problem is the problem of verifying that the redundant parts are ready to be used when faults occur. Implementation of checkout encounters difficulties in systems with static hardware redundancy, especially in component-redundant systems. Dynamically redundant systems are inherently better suited for redundancy checkouts, since they possess extensive fault detection and permit sequential switching-in of spare modules to be tested. One critical requirement of checkout is to provide systematic verification that the fault-indicating signals are operational. The techniques of fault tolerance can be extended beyond the boundaries of the fault-tolerant computer to effect automatic maintenance of various peripheral systems which communicate with the computer. The methodology of extending fault tolerance consists of the development of fault-tolerant interfaces, introduction of fault-detection methods in the systems outside the computer, and programming of recovery sequences to be executed by the computer [6]. In commercial general purpose systems, the reverse process has occasionally taken place. Because of the relatively high unreliability of peripheral mechanical devices, fault tolerance began at the peripherals and only later was brought into the central processing unit (CPU) and main memory. IEEE TRANSACTIONS ON COMPUTERS, DECEMBER 1976 cation that the redundant system possesses the expected fault tolerance. Insufficiencies of the design may be uncovered, and the design can be refined by changes or additions of various forms of redundancy. The -principal quantitative measures are reliability (with respect to permanent faults), survivability (with respect to transient faults), and availability [1], [5], [7], [8]. There are two approaches to the prediction of fault tolerance: 1) the analytic approach, in which fault-tolerance measures of the system are obtained from a mathematical model of the system, and 2) the experimental approach, in which faults are inserted either into a simulated model of a system, or into a prototype of the actual hardware, and fault-tolerance measures are estimated from statistical data. The class of static reliability models is suitable for the reliability prediction of systems with static hardware redundancy. The redundant elements are assumed to be permanently connected, and to fail statistically independently. They have the same failure rate and are instantaneously available to perform the masking of a failure with unity probability of success. Under these assumptions, the reliability of a redundant system is obtained as the sum of the reliabilities of all distinct configurations (including none or some failed parts) that do not lead to system failure. Reliability models of static redundancy are found in handbooks and textbooks of reliability theory and are used for reliability analysis of various redundant structures, e.g., relay contact networks, aircraft frames, etc. [8]. Dynamic redundancy requires the success of consecutive actions of fault detection and recovery in order to utilize redundant (spare) parts. The use of static reliability models for the dynamic case is equivalent to assuming unity probability of success of both actions; for this reason, very high reliabilities can be predicted as the number of spares is increased. It was recognized early in the studies of dynamic redundancy that even slight imperfections of detection and recovery will leave some spares unused. The effect of such imperfections was formalized in the reliability model through the concept of "coverage," defined as the conditional probability of successful recovery, given that a fault has occurred [7]. In general, the dynamic model must represent the entire complexity of the proposed fault-tolerant system, including: 1) differing failure rates for powered and unpowered modules; 2) increase in module failure rate due to fault detection and switching hardware; 3) number of spares of each module; 4) imperfect fault detection and recovery; 5) method of "hard core" protection; 6) existing intramodule fault tolerance; 7) extent and value of the faults; 8) duration and distribution of expected transient faults. Recent work has considered repairable systems and models that include transient as well as permanent faults [5]. An important goal of further study remains the development of models that consider the characteristics of both hardware and software and consider both permanent and transient faults. Analytical models of dynamically redundant systems are complex because of the number of different perameters that may be varied C. Modeling and Analysis in the search for a good design. A very useful tool for the The choice of redundancy techniques requires verifi- study of such systems is an interactive computer program AVIZIENIS: FAULT-TOLERANT SYSTEMS which readily allows the variation of the fault-tolerance parameters of the redundant system for online design refinement [5], [7]. Two approaches to experimental prediction of reliability are simulation and experimentation with a hardware prototype. While their use is more costly and time consuming, the experimental methods are essential when the available analytic models do not adequately represent the complex structure of the system or the nature of the expected faults. An accurate description of the system and characterization of the faults are the principal prerequisites when simulation is employed to derive the fault-tolerance estimates for the computer [9]. This approach has been extensively employed in reliability prediction for the redundant SATURN V guidance computer [3]. Continued progress in the development of digital simulators has taken place since then [26]. A contemporary multipurpose simulation system is the Bell Laboratories' LAMP system [11]. The use of hardware prototypes requires a large investment of effort in constructing the prototype, but avoids the inaccuracies which may occur in postulating the fault effects in a simulated model of the system. An example of the use of a hardware prototype is the experimental fault-tolerant JPL-STAR computer [10]. In the JPLSTAR computer an electronic "black box" was used to inject faults of adjustable duration and extent at selected points in the hardware of the system. Another example of the experimental approach is the OAO processor, in which a component-redundant system was completely disassembled to determine the number of failed components after 3000 hours of operation [3]. A prototype also was used to derive the fault dictionary for the No. 1 ESS System [12]. V. SYSTEM DESIGN The ultimate proof of the effectiveness of the fault-tolerance techniques discussed in the preceding section is found in the performance of existing systems. For the convenience of discussion, we make the distinction between 1) fully fault-tolerant (or self-repairing) and 2) manually-controlled systems with fault-tolerance features. The former complete their recovery actions without the participation of a maintenance specialist, while the latter depend on human decision-making as part of the recovery sequence. These decisions may take place at various stages of the sequence, from the initiation of diagnostics to the operation of a switch which disconnects a failed part of the system. The subsequent discussion will focus on examples of fully fault-tolerant systems. The fully fault-tolerant systems may be further classified according to the availability of external (offline) repair. In closed systems repair is not available, and the- system inevitably fails after the redundancy resources have been exhausted. Closed systems are usually found in space applications [3], [10]. In repairable systems failed parts are automatically identified and isolated from further participation in computing. They are then replaced by means 1309 of an offline manual action. System failures usually occur either because of imperfect fault detection and recovery algorithms, or because of catastrophic faults (i.e., faults that cannot be handled by the recovery procedure selected). A less frequent cause of system failure is exhaustion of redundancy, which occurs when faults occur faster than the repair procedure can handle them. Very prominent examples of repairable systems are the several models of the ESS telephone switching systems [12], [13]. Another classification of fault-tolerant systems may be based on the implementation of the recovery algorithm. Hardware-controlled systems use dedicated hardware which collects fault indications and initiates recovery, while software-controlled systems depend on special programs to interpret fault indications and to carry out the automatic recovery procedures. The hardware-controlled recovery approach depends on special hardware to carry out fault detection and to initiate the recovery procedures. After the existence of a properly functioning software system has been assured, the completion of recovery is usually transferred to software control. It is evident that further software systems may be superimposed on the hardware-controlled design, leading to a multilevel recovery procedure. A special case of hardware-controlled recovery is found in statically redundant systems in which faults are masked by redundant hardware, and thus remain totally invisible to the software. Two examples of such systems are the OAO data processor which used component redundancy and the CPU of the SATURN V guidance computer, which used TMR protection [3]. A separate software-controlled recovery system is needed in statically redundant systems if they are to continue operating after the first fault that escapes the masking effect and affects the software. Dynamically redundant systems with hardware control usually depend on a dedicated hardware module that gathers fault signals and initiates recovery. Different uses of duplexing and hardware-controlled switchover techniques are found in the memory, power supply, and peripheral units of the SATURN V guidance computer in combination with a TMR-protected serial CPU unit. Separate fault-detection and switchover-control units were used for every functional unit [3]. Probably the first operational computer with fully hardware-controlled dynamic redundancy was the experimental JPL-STAR computer [10]. Intended for self-contained multiyear space missions, this computer employs a special test-and-repair-processor (TARP) module to control recovery and self-repair. Software assistance is invoked only to perform memory copying and to resume standard operation after self-repair. The French MECRA computer is another early experimental design [14]. A few other hardware-controlled system designs that have not reached operation have been described in recent literature [1], [4]. The principal advantage of hardware-controlled recovery systems lies in their independence of the operation of any software immediately after the fault has occurred. The recovery process is transferred to software only after its ability to operate has been assured. The relatively late 1310 appearance of such systems may be attributed to the need to introduce the recovery module into the design at its inception, thus requiring an early committment to the hardware-controlled approach. The software-controlled recovery systems depend on special programs to initiate the recovery action upon the detection of a fault. Fault signals are obtained by both hardware and software methods, for example: parity checkers, comparators, power level monitors, watchdog timers, test programs, reasonableness checks, etc. The main limitation of these systems is the need for the recovery software to remain operational in the presence of faults, since recovery cannot otherwise be initiated. A significant advantage of the software-controlled approach is that existing "off-the-shelf" hardware system modules may be used to assemble fault-tolerant organizations. These modules contain various forms of hardware fault detection, which are usually supplemented by further software methods. For this reason software-controlled systems have appeared earlier and are currently being used in numerous applications requiring high reliability and availability. While every modern operating system incorporates some recovery features, the present paper will be limited to selected illustrations of historically important and advanced systems. An important early design of the 1950's with complete duplication and extensive recovery provisions was the SAGE system [15]. The IBM System/360 architecture contains very complete serviceability provisions for multisystem operation in order to attain high availability, reconfiguration, and fail-soft operation [16]. An early example of a multisystem which includes further extensions of the System/360 design is the IBM 9020 multiprocessing system for air traffic control applications [17]. Noteworthy are the operational error analysis program and the diagnostic monitor of the 9020. The recent IBM System/370 hardware incorporates automatic retry of CPU operations, error coding to correct single-bit errors in processor and control storage, and I/O retry facilities. The software provides recovery management support routines, I/O and channel error recovery, checkpoint/restart facilities, microdiagnostics, and online diagnostics of I/O device errors. An interesting illustration of extensive use of backup storage and dynamic reconfiguration in a general-purpose time-shared system is found in the MIT Multics System [18]. The Pluribus is a minicomputer/ multiprocessor system with extensive fault-tolerance provisions which serves as a switching node in the ARPA Network [19]. Another direction of software-controlled system development is in aerospace applications. Representative illustrations of this approach are the SIFT design [20], the RCS system [91, and the C. S. Draper Laboratory modular system [21]. One more area of application which requires fault-tolerant operation and very high availability for several years of continuous operation is the control of electronic switching systems for telephone exchanges. These systems usually employ manual repair by replacement of a failed part as the last (offline) step of the re- IEEE TRANSACTIONS ON COMPUTERS, DECEMBER 1976 covery procedure, while maintaining normal operation by means of the remaining system modules. A well documented illustration is found in the Electronic Switching Systems (ESS) of Bell Laboratories. The ESS designs use several hardware techniques (duplication, matching, error codes, function monitors) and special software (check routines, diagnostics, audits) as well as software and hardware emergency procedures when normal recovery action does not succeed [12], [13]. Besides the complete systems discussed above, significant efforts have been carried out in providing fault tolerance for individual subsystems, which then can be assembled to form a fault-tolerant system. This is especially true for secondary and mass storage which has been characterized by relatively low reliability in the past. Representative error coding applications include the use of codes for error control in data communications, magnetic tape units, disc files, primary random access storage, and a photo-digital mass store [22]. Single-error-correcting codes are used in the control storage of ESS No. 1 [12] and the main and control storage of IBM System/370 computers. Error-correcting codes have been shown to provide a very effective method for fault-tolerance in the storage medium, and the remaining problems are concentrated in providing fault-tolerance in the memory access and readout circuitry. Recent studies have considered the problem of fault tolerance in associative memories and processors [23]. In general, processor fault-tolerance has been provided by duplication and reconfiguration at the system level. Some investigations have been conducted in the use of arithmetic error codes as the means for error-detection for processor faults [24] and an experimental processor has been designed and constructed for the JPL-STAR computer [10]. Continuing reductions in the cost of microprocessors makes further emphasis on duplication very likely, although error-detecting codes remain a convenient method for the identification of the faulty processor in a disagreeing pair. Another potentially very effective approach to error detection in integrated circuits is self-checking logic design [25]. VI. THE PROBLEM OF DESIGN FAULTS In order to attain the goal of general fault-tolerance it is necessary that one of two conditions should be satisfied with respect to design faults: 1) the hardware and software should be free of design faults prior to the start of operational use; or, 2) the system should contain complete provisions to detect and to circumvent the effects of hardware and software design faults during operational use. The principal tools for the a priori elimination of design faults are design verification programs for hardware [26], [27] and an extensive repertoire of validation and verification tools for software [28], [29]. These methods correspond to the fault-intolerance approach of perfecting hardware components in order to avoid operational faults. The alternative is to provide design fault tolerance for 1311 AVIZIENIS: FAULT-TOLERANT SYSTEMS both hardware and software in the same sense as it is provided against operational faults in fault-tolerant hardware. The essential difference is that operational faults in hardware occur after the start of operational use, while design faults in hardware and software are present at the start, but become disruptive at a later time. We also note that design modifications and corrections of discovered design faults are known to lead to new design faults, and that discoveries of design faults may be expected throughout the useful life of any large system, similar to the occurrence of operational faults This practically verified observation establishes a relationship between the methodologies for dealing with operational faults and design faults: the methods that have proven successful in operational fault tolerance may be transferable to provide fault tolerance of design faults, quite especially of those in software which are very difficult to eliminate prior to operational use. Both the static and the dynamic hardware redundancy approaches (as discussed in Section IV-B) have their counterparts in software design fault tolerance. In the static case, multiple independently generated programs are operated concurrently on multiple copies of the hardware [30]. Comparison or majority voting at specified points is employed to detect or correct design faults. Systems such as SIFT [20], the symmetric multiprocessor [211, and the Space Shuttle Computer System [3] are especially suitable for such multiple programming. The dynamic case uses the equivalent to standby sparing, in which acceptance tests serve to detect design faults and to initiate a switchover to an alternate software module [31]. An extension of the above techniques to hardware design fault tolerance is also feasible: the functionally identical copies of hardware modules are to be designed and manufactured by separate organizations in order to avoid the existence of identical design faults in all copies. The state of the art in design fault tolerance resembles that in operational fault tolerance some ten years ago. The cost and the effectiveness of the design fault tolerance approaches remain to be investigated, and the techniques require much further development and experimentation. The success of operational fault tolerance, however, does strongly indicate that design fault tolerance cannot be safely ignored solely because of the past tradition of fault intolerance in this field. ACKNOWLEDGMENT Discussions with Drs. D. A. Rennels and W. M. Whitney of the Jet Propulsion Laboratory, Pasadena, CA, and Maj. R. L. Bush of the Air Force Avionics Laboratory, Dayton, OH, have been valuable in the formulation of this paper. Special thanks are due to one referee for very extensive and perceptive comments and suggestions. REFERENCES [1] A. Avizienis, "Architecture of fault-tolerant computing systems," in Dig. 1975 Int. Symp. Fault-Tolerant Computing, Paris, France, June 1975, pp. 3-16. [21 R. A. Short, "The attainment of reliable digital systems through the use of redundancy-A survey," IEEE Computer Group News, vol. 2, pp. 2-17, Mar. 1968. [3] A. E. Cooper and W. T. Chow, "Development of on-board space Jan. [4] [5] [6] [7] [8] [9] [101 [11] (12] (13] [141 (15] [16] computer systems," IBM J. Res. Develop., vol. 20, pp. 5-19, 1976. W. C. Carter and W. G. Bouricius, "A survey of fault-tolerant computer architecture and its evaluation," Computer, vol. 4, pp. 9-16, Jan.-Feb. 1971. Y.-W. Ng and A. Aviiienis, "A model for transient and permanent fault recovery in closed fault tolerant systems," in Proc. 1976 Int. Symp. Fault-Tolerant Computing, Pittsburgh, PA, June 1976, pp. 182-188. G. C. Gilley, "A fault-tolerant spacecraft," in Dig. 1972Int. Symp. Fault-Tolerant Computing, June 1972, pp. 105-109. W. G. Bouricius, W. C. Carter, and P. R. Schneider, "Reliability modeling techniques for self-repairing computer systems," in Proc. 24th Nat. Conf. Ass. Comput. Mach., 1969, pp. 295-383. R. W. Barlow and F. Proschan, Mathematical Theory of Reliability. New York: Wiley, 1965. H. 0. Levy and R. B. Conn, "A simulation program for reliability prediction of fault tolerant systems," in Dig. 1975 Int. Symp. Fault-Tolerant Computing, Paris, France, June 1975, pp. 104109. A. Avisienis et al., "The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design," IEEE Trans. Comput., vol. C-20, no. 11, pp. 13121321, Nov. 1971. H. Y. Chang, G. W. Smith, Jr., and R. B. Walford, "LAMP: System description," Bell Syst. Tech. J., vol. 53, pp. 1431-1449, Oct. 1974. R. W. Downing, J. S. Nowak, and L. S. Tuomenoksa, "No. 1 ESS maintenance plan," Bell Syst. Tech. J., vol. 43, Part 1, pp. 19612019, Sept. 1964. H. J. Beuscher, et al., "Administration and maintenance plan of no. 2 ESS," Bell Syst. Tech. J., vol. 48, pp. 2765-2815, Oct. 1969. F. P. Maison, "The MECRA: A self-repairable computer for highly reliable process," IEEE Trans. Comput., vol. C-20, pp. 1382-1393, Nov. 1971. R. R. Everett, C. A. Zraket, and H. D. Benington, "SAGE-A data-processing system for air defense," in Proc. Eastern Joint Comput. Conf., Washington, DC, Dec. 1957, pp. 148-155. W. C. Carter, et al., "Design of serviceability features for the IBM System 360," IBM J. Res. Develop., vol. 8, pp. 115-125, Apr. 1964. [17] "An application-oriented multiprocessing system," IBM Syst. J., vol. 6, pp. 78-132, 1967. [18] F. J. Corbato, J. H. Saltzer, and C. T. Clingen, "Multics-The first seven years," in AFIPS Conf. Proc., vol. 40, 1972, pp. 571-583. (19] S. M. Ornstein et al., "Pluribus-A reliable multiprocessor," in AFIPS Conf. Proc., vol. 44, 1975, pp. 551-559. [20] J. H. Wensley, "SIFT-Software implemented fault tolerance," in AFIPS Conf. Proc., vol. 41, part 1, 1972, pp. 243-254 [21] A. L. Hopkins, Jr. and T. B. Smith, III, "The architectural elements of a symmetric fault-tolerant multiprocessor," IEEE Trans. Comput., vol. C-24, pp. 498-505, May 1975. [22] D. T. Tang and R. T. Chien, "Coding for error control," IBM Syst. J., vol. 8, pp. 48-86, 1969. [23] B. Parhami and A. Avizienis, "A study of fault-tolerance techniques for associative processors," in AFIPS Conf. Proc., vol. 43, pp. 643-652, 1974. studies [24] A. Avizienis, "Arithmetic error codes: Cost and effectivenessComput., for application in digital system design," IEEE Trans. vol. C-20, pp. 1322-1331, Nov. 1971. and use of checking circuits," in Computer W. [25] C. Carter, "TheoryInfotech Information Ltd., 1974, pp. 413Systems Reliability, 454. [26] S. A. Szygenda and E. W. Thompson, "Modeling and digital simulation for design verification and diagnosis," IEEE Trans. Comput., this issue, pp. 1242-1253. [27] T. T. Butler et al., "LAMP: Application to switching-system development," Bell Syst. Tech. J., vol. 53, pp. 1535-1555, Oct. 1974. (28] Proc. 1975 Int. Conf Reliable Software, Los Angeles, CA, Apr. 1975. [29] E. C. Nelson, "Software reliability," in Dig. 1975 Int. Symp. Fault-Tolerant Computing, Paris, France, June 1975, pp. 24-28. [301 A. Avi'ienis, "Fault-tolerance and fault-intolerance: Complementary IEEE TRANSACTIONS ON COMPUTERS, VOL. approaches to reliable computing," in Proc. 1975 Int. Conf. Reliable Software, Los Angeles, CA, Apr. 1975, pp. 458-464. 131] B. Randell, "System structure for software fault tolerance," IEEE Trans. Software Eng., vol. SE-1, pp. 220-232, June 1975. Algirdas Avizienis (S'55-M'56-F'73) was born in Kaunas, Lithuania, in 1932. He received the B.S., M.S., and Ph.D. degrees in electrical engineering from the University of Illinois, Urbana. While at the University of Illinois he was associated with the c-25, NO. 12, DECEMBER 1976 Digital Computer Laboratory as a Research Assistant and Fellow from 1954 to 1960, partici-. pating in the design of the Illiac II computer. In 1960 he joined the Technical Staff of the Jet Propulsion Laboratory (JPL), California Institute of Technology, Pasadena, where he initiated and directed the JPL Self-Testing And Repairing (STAR) computer research project. In 1962 he became a member of the faculty of the School of Engineering and Applied Science at the University of Califomia, Los Angeles. Presently he is a Professor in the Department of Computer Science at UCLA, where he teaches and conducts research in computer system architecture and fault-tolerant computing and is the Principal Investigator of a fiveyear research grant on fault-tolerance sponsored by the National Science Foundation. He has-also remained associated with the Jet Propulsion Laboratory as an Academic Member of the Technical Staff, and he has served as the Principal Consultant on fault-tolerant computer research projects with Ultrasystems, Inc., Irvine, CA. Dr. Avizienis has served as the first Chairman of the IEEE Computer Society's Technical Committee on Fault-Tolerant Computing (19691973), and as Chairman of the First International Symposium on Fault-Tolerant Computing in 1971. Analog/Hybrid Computation and Digital Simulation GRANINO A. KORN, SENIOR MEMBER, IEEE, AND ROBERT VICHNEVETSKY, MEMBER, IEEE Abstract-This survey article traces the development of analog/hybrid and digital simulation of continuous systems since the early 1950's, describes modern aspects, and discusses future trends. Index Terms-Continuous-system simulation, analog/hybrid computation, interactive computers, simulation. I. SIMULATION OF CONTINUOUS SYSTEMS CONTINUOUS dynamical systems are those systems which can be described by differential equations of the initial value kind. Typically, d, = dt F(Y1,Y2,Y3, - - * Yt), i = 1,2, - - -,n (1) where YM(t) are the state variables, and (1) are the state equations. These equations represent a mathematical model of the system. For computer simulation, we solve (1) by a computer as an initial value problem, to produce the continuous functions of time Yi (t). This permits experimentation with the mathematical model which "simulates" the system's behavior. Objects being simulated in this fashion have typically Manuscript received April 20, 1976; revised August 16, 1976. G. A. Korn is with the Department of Electrical Engineering, University of Arizona, Tucson, AZ. R. Vichnevetsky is with the Department of Computer Science, Rutgers University, New Brunswick, NJ 08903. included aerospace vehicles, industrial plants, chemical or biological processes and, in recent years, social and en- vironmental systems. They may include few state variables (say 2 or 3) or be rather complex, up to a few thousand variables. Real-time simulation is obtained when the computer time (independent variable) changes at the same rate as the time t in the real system described by (1). Real-time siLiulation permits linking the computer with real hardware (e.g., for testing the control device of a plant) or interfacing with human operators (e.g., placing a live "pilot in the loop" of a real-time airplane simulation used as a training device). More will be said about applications of simulation in Section X. Computers used for the simulation of continuous dynamical systems have been analog, digital, or hybrid, the latter combining the first two into a single computing system (Figs. 1 and 2). II. ANALOG COMPUTERS Electronic analog computers implement mathematical relations between physical variables (typically electrical voltages) with separate (parallel) computing elements working simultaneously. Each element effects a simple operation such as addition, multiplication, and, most important of all, integration with respect to time. Programming is achieved by establishing interconnections between outputs from and inputs to individual computing elements by means of wires, called "patch cords." The first electronic analog computers were conceived in the 1940's almost as special purpose simulators for air-

Fault-Tolerant Systems - IEEE Computer Society

Related documents

Products

Support

Fault-Tolerant Systems - IEEE Computer Society

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib