Error-control coding techniques, implemented by means of self-checking circuits, will improve system reliability. Error-Correcting Codes and Self-Checking Circuits D. K. Pradhan Oakland University J. J. Stiffler Raytheon Company It is not surprising that error-control coding techniques have been used in computers for many years, especially since they have proven effective against both transient and permanent faults. What is surprising is that coding techniques have not found even more extensive use (in commercial computers, for example), in view of their potential for improving the, overall reliability of computers. In this article, therefore, we will bring out some of the reasons for the limited acceptance of error-control coding techniques. In addition, we will examine some code properties, as well as certain implementation techniques, that might help overcome these currently perceived limitations. Coding for error control Techniques for coding digital information in order to protect it from errors while it is being stored, transferred, or otherwise manipulated have been the subject of intense investigation for several decades. There is a large body of literature detailing the remarkable theoretical and practical developments resulting from this effort. Although most of this work has been concerned with the error models and decoding constraints encountered in communications, the potential utility of these codes in protecting information in computers has long been recognized. Indeed, such considerations motivated some of the initial work on error-control codes. Stated somewhat formally, error-control coding entails mapping elements of a data set X={xij onto elements of a code-word set Y={yi). The code words thus represent the information to be manipulated but are presumably less vulnerable to errors (induced, for March 1980 example, by failures in the circuitry used to carry out the manipulation) than the original unencoded data. If F denotes the set of independent faults to which the circuitry in question is subject, and if E is the set of errors that can be produced as a result of these faults, then each e E E is an error that can occur as the result of some fault, f£ F. And if the distance d (y,y') from y to y'is defined as the minimum number of errors in e E E needed to change y to y', and the distance d associated with a code Y is defined as the minimum distance from any one of its code words to any other code word, then it is not difficult to see that, subject to some mild conditions (for example, the condition that d(yl, y') + d(y2, y')>min [d(y1,y2), d(y2,y1)]), a distance-d code can be used to detect up to at least d-1 errors; to correct up to (d-1)/2 errors; or to correct all errors if their total number does not exceed c < (d - 1)/2; and otherwise to detect any number of errors in the range c + 1, d-c-1. In the present context, X generally consists of the 2k binary k-tuples, and Y a subset of 2£ binary n-tuples for some n > k.- The most widely postulated class of errors is that in which individual code-word bits are erroneously complemented; that is, E consists of the n errors ei, i = 1,2,... ,n, with ei the error caused by complementing the ith bit of any y E Y. The errors encountered in transferring (or transmitting) information from one point to another can often be characterized in this way, as can those observed in retrieving information from certain storage media. The distance between two code words in this case is simply the Hamming distance, i.e., the number of corresponding bit positions in which they differ. However, the faults, and hence the errors, in a given situation strongly depend on the circuitry in question, and the Hamming distance may or may not be the relevant measure. 0018-9162/80/0300-0027S00.75 © 1980 IEEE 27 As already noted, communications problems have motivated most of the study of error-control codes. Although much of the resulting knowledge can also be exploited in the design of fault-tolerant computers, there are some important differences between the two applications: (1) Information is generally transmitted serially over a communications channel; it is generally handled in a parallel format in a computer. Consequently, the serial encoding and decoding algorithms developed for communications applications have a much more limited applicability in computer systems. (2) The time allowed for encoding and decoding is generally more constrained in computer applications than in communications. A few milliseconds' or even a few seconds' delay in decoding information received over a communications channel may be entirely acceptable, whereas even a microsecond's delay in handling critical-path information in a computer could be intolerable. (3) The complexity of the encoding and decoding circuitry is frequently a much more serious limitation in computer applications than it is in communications. If a code is sufficiently effective in combatting transmission errors, its use may be justified, regardless of the complexity of its associated hardware. But in a computer the main reason for encoding information is to protect it agalnst hardware faults. Unless the hardware needed to generate and check the code is relatively simple compared to the hardware thus monitored, a fault-prone decoder could increase rather than decrease the likelihood of erroneous information propagation. (4) Anticipated errors in a computer may be different from those in a communications system. Even when the first-order error statistics are the same, differences in higher-order -statistics may considerably change the relative effectiveness of various coding techniques in the two applications (see "Coding for RAMs" later in this article). (5) Communication codes are designed to protect information as it is transferred from one place to another. Although this function is important in computer applications as well, and is in fact the function of major interest here, it should be noted that other constraints on computer errorcontrol codes are also sometimes desirable. In particular, it is frequently desirable to be able to protect information not only when it is being stored or transferred, but also when it is being modified. Many computer operations entail the evaluation of functions of two variables: Xk = f(xi, xj). If a code is to protect information even while it is being manipulated in this manner, the code must be preserved under such operations; that is, for every function fof interest and for every xi, xj, Xk E X, there must exist a func28 tion g such that Yk = g(yi, yj) whenever Xk = f(xi, xj), with yi, yj and Yk representing the encoded versions of xi, xj and x k As might be expected, this last constraint can be severe, particularly when the class of functions f(xi, Xj) is large. Codes that are preserved under certain arithmetic operations have been developed15; others are preserved under bit-by-bit logical operations21; but no known class of -codes is preserved under all operations usually implemented in a computer. Although the codes available for logical operations can be used for arithmetic operations,13'14 and vice versa,19 such use is likely to be too inefficient to be of interest. To date, no efficient code useful for both logical and arithmetic operations is available. To achieve a breakthrough, then, one has to look into some new and unconventional coding schemes. Unfortunately, research in this area has been limited, partly because of certain negative results established earlier by Elias'7 and by Peterson and Rabin,20 in the area of single-error detection in logical computations. However, it is important to note that more recent work by Pradhan and Reddy2' exhibits an efficient code for multiple-error detection/correction in logical computations. With the possible exception of (4), the differences between communications and computer applications of error-control coding militate against the latter application. Nevertheless, error-control coding techniques, judiciously applied, can be highly effective in increasing computer reliability. In many cases, codes developed primarily for communications applications can be used to advantage in computers, particularly if the peculiarities of the error patterns likely to occur in a computer are fully exploited. A good example of this is the type of error coding used in RAMs (discussed in detail later in this article). The principles of error-correcting codes can also be used to design fault-tolerant logic. Significant work in this area has been reported both in the US40-45 and in the USSR.33-39 This research includes the design of counters (Reed and Chiang,45 Hsiao, et al.40); synchronous sequential machines (Larsen and Reed,42 Meyers41); and asynchronous sequential machines (Pradhan,44 Pradhan and Reddy,43 Sagalovich37-39). The basic technique used in all of these designs incorporates static redundancy as an integral part of the design. The states of the circuit are encoded into code words of an error-correcting code (in the case of synchronous circuits), or its analog (in the case of asynchronous circuits). The output and next-state functions are defined so that the effect of any fault from a prescribed set is masked; i.e., the netWork produces correct outputs, in spite of the fault. This is accomplished by defining the behavior of the machine for potentially faulty states so that it corresponds to that of the correct state. This coding technique differs from the replication technique in two respects: the redundancy is implicit, and there is no explicit error-correcting logic, such as majority logic. The advantage of coding over replication is that the former may require"fewer I/O pins and COMPUTER chips. This follows from the fact that the number of state variables required by the coding scheme is much smaller than that required by replication, and from the fact that the fault-tolerant circuit that uses coding can be implemented as a single circuit. Two important possible applications of this technique are both yield enhancement and reliability improvement.30 Yield is improved when the chips with e or fewer cell failures are not discarded in the acceptance testing, where e is some number less than the error-correction capability, t. The remaining (t-e) error-correction capability is used for reliability improvement. The principles of coding have also found use in the design of self-checking circuits,4649 which are discussed next. Self-checking circuits Self-checking circuits, in general, are that class of circuits in which occurrence of a fault can be determined by observation of the outputs of the circuits. An important subclass of these self-checking circuits is known as "totally self-checking" circuits. TSC circuits are of special interest and so will be discussed here in detail. Loosely speaking, a circuit is TSC if a fault in the circuit cannot cause an error in the outputs without detection of the fault. Also, the information regarding the presence of the fault is available in TSC circuits as an integral part of the outputs. These TSC circuits are particularly useful in reducing or eliminating the "hard core" of a fault-tolerant system. (A "hard core" is that part of a system that cannot tolerate failures, e.g., decoders for error-correcting codes.) The following is an example of a circuit which has built-in fault-detection capability. Through this example, we develop the motivation for the TSC design, which is discussed next. The autonomous linear shift register, shown in Figure 1, is a linear counter and has a cycle length of 15. A special feature of.this counter is that it can detect faults in itself on-line. The shift register produces the nonzero code words from a five-bit singleparity code in a cyclic chain. The states of the shift register are five-bit code words: the first four bits are information bits, and the last bit is the parity bit. To begin with, the shift register is initialized to a nonzero code word in the single-bit parity code; for each successive shift, it produces a new code word, as shown in Table 1. After 15 shifts, the shift register counter starts repeating the cycle. Any single error in the outputs of the shift register can be detected by checking the parity. This requires an addition of extra logic, as shown in Figure 1. The output of this extra logic will be 0 in the absence of any errors, and 1 in the presence of a single error. It can be shown that this design detects all single faults that may occur on the shift register.58 However, the inherent difficulty in this design is that faults cannot be allowed to occur in the parity-check logic because such a fault could prevent the circuit March 1980 0 1 .( f Figure 1. A linear counter designed for fault-detection. Table 1. Shift register contents after different shifts. SHIFT 0 1 2 3 4 5 1 O 1 1 1 1 0 1 0 1 1 1 1 0 6 7 8 9 10 0 O O 11 1 0 0 0 1 O 12 13 14 CONTENT 0 1 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 1 1 1 1 0 0 0 1 0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 0 1 1 1 1 from detecting any errors thereafter in the outputs of the shift register (for example, a stuck-at-0 fault at the output of this extra logic). Thus, a subsequent fault in the shift register and the corresponding error could go unnoticed. The presence of a hard core in the design is responsible for the difficulty described above. TSC circuits have been developed to overcome precisely this type of problem. One fundamental feature of TSC design is that redundancy is embedded directly into the outputs, making it possible to partition the set of outputs into a set of code-word (valid) outputs, and a set of non-code-word (invalid) outputs. These code-word outputs are the set of outputs that normally appear at the outputs of the network. On the other hand, the appearance of a non-code word at the outputs of the network signals the presence of a fault in the network. However, the features that distinguish the TSC network from other self-checking networks are its fault-secure46 and self-testing49 properties. These two properties can best be described in terms of input/output mapping of the network: Let X and Y represent the sets of inputs and outputs, respectively. Let Y = Y1 U Y2, where Y1 and Y2 represent the code-word (valid) and non-code-word (invalid) out29 Figure 2. The fault-secure property of TSC networks. Figure 3. The self-testing property of TSC networks. puts, respectively. Let F represent a prescribed set of faults for which the design is TSC. Let y be the correct output for the input, x. (Note that x E X, and y E Y,.) Let y'be the output of the network for the same input, x, in the presence of a fault, f, f E F. Faultsecureness guarantees that ify 'is a code word, y'E YI, then y' = y. In other words, the output of the faulty network cannot be a code word and, at the same time, be different from the correct output, y. Thus, as long as the output is a code word, it can safely be assumed to be correct. This is illustrated in Figure 2. On the other hand, the self-testing property guarantees that for every fault, f, f E F, there exists at least one input, x, x E X, for which the resulting output, y, is a non-code word, y' E Y2; i.e., input x will result in the signaling of the presence of the fault. In other words, the input set contains at least one test for every fault in the prescribed set; thus, the occurrence of any fault is bound to be detected sometime during the operation (see Figure 3). While TSC networks possess these interesting features, they do have some practical limitations. For one thing, although every fault is detectable sometime during the operation, there is no guarantee that the first fault will be detected before a second fault occurs. Such a pathological case might be the occurrence of a fault detectable only by a particular input. The possibility exists that this input can become invalid as a test if a second fault occurs before the input is actually applied. This has been referred to as the error latency problem.59 Second, the application of TSC circuits has been limited by the absence of any systematic technique that realizes the self-testing property in an 30 economical way. For example, although fault-secureness can be readily achieved in asynchronous networks, incorporating the self-testing property can be too formidable a task.57 However, a class of TSC circuits known as TSC checkers has a significant potential for large-scale use in the design of fault-tolerant computers. A TSC checker46 is a TSC circuit designed to detect errors in error-detecting codes used in fault-tolerant computers. The most basic application of TSC checkers, however, is their use in monitoring a general TSC network, as described below. The outputs of the TSC network are fed to a TSC checker, designed so that any non-code word at its inputs produces only a non-code word at its outputs. Thus, by observing the output of the checker, one can detect any fault in the TSC network or the checker itself. (However, the checker output does not provide any information as to the location of the fault, i.e., whether the fault is in the TSC circuit or in the checker itself.) TSC checkers have two output leads and, hence, four output combinations: (00, 01, 10, 11). Two combinations are considered valid (code-word) outputs; they are usually (01, 10). The appearance of an invalid (non-code-word) output combination indicates either the presence of an error at the input code word, or a fault in the checker. The function of TSC checkers (see Figure 4) is to detect any errors in the input code word, as well as any faults that may occur in the checker itself-as long as they do not occur at the same time. Table 2 presents the network outputs for four possible cases. In the first case, the checker is fault-free and the code word is error-free; the output of the checker is always one of the valid outputs. In the next case, there is an error in the input code word, but there are no faults in the checker. Here, the autput will always be an invalid output, so that the error in the code word can be detected. In the third case, there is a fault in the checker, but there is no error in the code word. Then, the output may be either the correct, valid output, or an invalid output, depending on whether or not the input code word is a test for the fault in the checker. Finally, when there is an error in the code word and a fault in the checker, the output is indeterminate. As an example, consider a TSC checker design for single-parity codes, where the prescribed set of faults is the set of single faults. This design requires the use of two separate parity checkers. (Figure 4 shows the Table 2. Outputs for TSC checker. CHECKER: FAULT-FREE CHECKER: FAULTY CODE WORD: ERROR-FREE VALID VALID/INVALID CODE WORD: WITH ERROR INVALID INDETERMINATE COMPUTER design for a nine-bit code.) The bits in the code word Ul, U...2 ,Um,Um+,...,Un, are divided into two groups: ul, U2.. ,um; and um+1, Um+2, . . ,u. (For optimal design, m is equal to n/2.) These two groups then form the inputs to the two different parity-check circuits. The first circuit produces the output, g = ul eU2 e.. . Um, and the second one yields the output, h =Um+l 9 Um+2 a) ... 9 Un - To illustrate that this design is indeed TSC, first let us consider even-parity codes. Since any code word has an even number of l's, both groups of bits contain either an even or an odd number of l's. Thus, the valid outputs from the network correspond to (01, 10). On the other hand, in the presence of a single error in the input code word, one of the two groups will have an odd number of l's, and the other an even number of 1 's. So, a single error at the inputs will produce either 11 or 00 as the output. Since single faults can produce an error only at one of the two outputs, this TSC checker is fault-secure. The self-testing property of the checker can be de- Figure 4. A TSC checker tor 9-bit single-parity code. duced from the following observations: The set of input code words applies all possible input combinations to each of the parity-check circuits. Thus, a fault in one of these parity-check circuits will result in an error at its output sometime during the operation. This will, therefore, be detected as an invalid network output. As an example, consider a stuck-at-i fault at the output lead of the EX-OR gate, shown in Figure 4. This fault is detected by a large number of code words, including the one shown. Similarly, in the case of odd-parity codes, the above design can be modified by deleting the inverter at the output of h. Now, consider the design of a linear counter, shown in Figure 1. This design can be made TSC by replacing the error-detection logic with a TSC checker, as shown in Figure 5. Although the checker never receives the all-0 code word, it is still self-testing for all single faults. A point worth noting is that since all TSC checkers Figure 5. A TSC linear counter. have at least two output leads, there may be the need for some hard-core logic to monitor the checker outputs. However, the value of the TSC checker is its ing for LSI circuits and transient faults" below). capacity to significantly reduce the hard core in a Codes designed to combat such faults are sometimes referred to as transfer-error-control codes to distinfault-tolerant system. TSC checkers for several other codes, such as guish them from codes used to control other types of constant-weight codes,46,52'55,59"1 Hamming codes,7 errors (e.g., arithmetic or logical errors). The and Berger codes,48-56 are available in the literature. minimum Hamming distance between any two disThe next two sections discuss the use of codes for er- tinct words in such a code clearly provides an indicaror control in RAMs and certain integrated circuits. tion of its effectiveness against independent bit errors. Moreover, if the word "bit" in the previous definition of Hamming distance (see "Coding for error control") is replaced by "symbol," the same measure can also be used to gauge the effectiveness Coding for RAMs of codes in combatting errors (e.g., byte-oriented erErrors in binary communications channels general- rors) confined to discrete groups of bits2'24 (with each ly affect the transmitted bits either independently or group of bits treated as a single symbol). Virtually all useful transfer-error-control coding in bursts and tend to be symmetric; that is, an error is roughly as likely to convert a 1 to a 0 as conversely. techniques involve generalizations of the familiar The same can often be said about errors in computers parity-check concept. A simple parity check on the due to bus faults or faults in the storage medium (al- binary digits representing the information (i.e., a (k + though other types of errors can also occur; see "Cod- 1)th bit representing the modulo-two sum of the k March 1980 data bits) obviously provides a means of detecting a change (error) in any single bit, or indeed in any odd number of bits. More powerful codes (codes having minimum distances d > 2) are constructed by simply appending more parity-check bits, with each sum bit representing the parity of a different subset of the k information bits. Techniques for constructing errorcontrol codes in this manner are well-documented and need not be discussed further here.2'3 Such codes protect the contents of memory against hardware malfunctions simply by storing the paritycheck bits belonging to each word along with the word itself. When a word is retrieved from memory, the parity bits that should be associated with that word can be redetermined and compared with those actually retrieved. Any difference between the calculated and retrieved parity bits, called the "error syndrome," indicates the presence of one or more errors. Moreover, if the number of errors does not exceed the number that can be corrected, the syndrome uniquely identifies those bits that are erroneous. This procedure, while conceptually straightforward, can be complex to implement when even moderate numbers of errors are to be corrected. The widespread use of multiple-error correcting codes in communication systems has come about largely because the mathematical structure used to define good errorcontrol codes (the cyclic code structure) can also be exploited to reduce significantly the complexity of their decoders. -Unfortunately, the resulting decoders, while well-suited for the serial data flow encountered in communications, would significantly increase the time needed to retrieve a corrected word from memory. (This is not to say that the cyclic structure imposed on error-control codes cannot be used to advantage in computer applications; see, for example, Brown and Sellers,27 and Chien.28) The use of error-control codes to protect the contents of a computer's main memory is therefore limited by two factors: fast (parallel) decoders tend to be too complex; simple (serial) decoders tend to be too slow. As a result, the codes used in main memories have been restricted to single-error-detection (d = 2); single-error-correction (d = 3); or single-error-correction, double-error-detection (d = 4) codes; since either the delay or the cost of the circuitry entailed in correcting more than a single error is generally unacceptable. There are several ways to correct multiple errors in memories, however, without incurring either the delays or the complexities usually associated with multiple-error-correction. These methods, to be effective, must take advantage of the fact that certain error patterns are much more likely than others to be encountered in random-access memories. The errorpattern distribution' clearly depends on both the memory technology and organization. Consider, for example, a RAM in which faults occur independently, each fault affecting only a single bit position (but possibly affecting the same bit in every word). Such error patterns are likely to occur, for example, in plated-wire memories and in N-word x 1-bit semiconductor memories. One method of handling 32 this fault pattern is simply to use ad =2 ord =3 code to detect errors as they occur, and then to isolate the defective bit position and switch in a spare bit line to replace it on all subsequent stores. This, of course, requires extra hardware to implement the spare bit lines and the switch needed to make them accessible. The code is used only to detect errors (or, at most, to correct errors when the memory's contents are being recovered following a detected error). The time needed to detect an error, even when added to the delay introduced by the switch, can be considerably less than the time needed to correct even a single error. The number of errors that can be corrected using this method, so long as they occur singly, is limited only by the number of spare bit lines. (Iferrors tend to occur in clusters, a suitable single-symbol error-detecting or correcting code2 might be used instead of the single-bit error-detecting or correcting code assumed here.) If the memory were implemented with N-word by m-bit semiconductor devices, for example, the symbol alphabet might be defined as the set of 2m binary m-tuples. Another method for coping with multiple errors in random-access memories without introducing excessive decoding delays or excessively complex hardware is to take advantage of the erasure-correcting capability of transfer-error-control codes.32 A potentially erroneous bit is called an erasure if its location is known and the only uncertainty is whether or not it is actually correct. This, of course, is the situation typically encountered in a RAM once a bit line has been diagnosed as defective (and is not replaced). Erasure correction has two major advantages: (1) Distance-d transfer codes can be used to correct up to d-1 erasures, as opposed to a maximum of (d-l)12 errors. (This is easily verified: if two code words are distance d apart, at least d bits have to be erased in each before the two words can be identical in their unerased positions.) And (2) erasures can be corrected more quickly than errors. It is only necessary to determine and store the syndromes associated with the various combinations of erasures to be corrected, and to compare the calculated syndrome of each word read from memory with this stored set. The erased positions containing errors are known as soon as a match is found. If the calculated syndrome does not match a stored syndrome (and if it is not the all-O syndrome indicating no errors), a new bit position contains an error. It is then only necessary to determine the location of that error, identify that position as an erasure, and augment the set of stored syndromes to reflect this added erasure. This procedure can continue until the erasure-correction capability of the code has been exhausted. It is important to reemphasize that the effectiveness both of a code and of the procedure used to decode it strongly depend on the failure modes of the hardware. Consider, for example, the reliability of a 32-bit, 4096-word RAM array protected by a distance-4 code. (Any single-bit error is corrected; if a second error is detected, the memory is no longer used.) Suppose the memory array is to be implemented with 1024 x 1 semiconductor chips, each COMPUTER having a A = 10-6 failure/hour hazard rate; and suppose all failed chips exhibit one of two symptoms: either a single cell (bit-storage element) chip fails without affecting the rest of the device, or the entire chip fails. Let yA denote the hazard rate associated with the first type of failure, then, and (1 -y)A the rate associated with the second type of failure. The probability R(t) that the array is still usable after six months of operation is plotted in Figure 6. As the plot shows, the effectiveness of the code is highly dependent on y, the fraction of failures confined to a single cell. The probability that the code is inadequate (i.e., that the array is no longer usable) varies by nearly three orders of magnitude, from .0056 percent when y = 1 to 5 percent when y = 0. The conclusion is apparent-a coding scheme may be considerably less effective than expected if the types of failures are considerably different than expected. So unless the likelihood of various types of failures can be reliably predicted, it is generally better to select as robust a coding procedure as possible (i.e., one that works well regardless of the type of failure). In this example, a multiple-erasure-decoding scheme, in which each chip is treated as unreliable as soon as it exhibits a malfunction, might well be preferable. -1 co x 0 0 -2. Coding for LSI circuits and transient faults As illustrated by the example in the last section, a 0.0 0.1 0.2 0.1 0.4 0.5 0.6 0.7 0.8 0.9 1.0 code can be effective if there is a good match between y the type of errors for which the code is designed and OF SINGLE-BIT FAILURES FRACTION the type of errors that occur. This section focuses on codes specifically designed for a class of errors different from the types discussed so far. This type is Figure 6. Code effectiveness as a function of failure the so-called unidirectional error-one that contains mode. either 1 to 0, or 0 to 1 errors, but not both. (There is evidence that unidirectional errors occur in many integrated circuits69; faults such as short-circuit faults code words, the error in the correct code word is are a likely source of these errors.) not necessarily unidirectional. However, this Existing codes developed to control unidirectional error can be modeled as a unidirectional error in errors are reviewed here. Their inadequacies are some other code word that is contained in the discussed, since these inadequacies have prompted accessed set. Hence, any unidirectional errordevelopment of new codes68 particularly effective detection scheme can detect the error.) against transient faults. (2) Word line. An open word line may cause all bits The various faults in a standard LSI device-the in the word beyond the point of failure to be read-only memories-provide a clear illustration of stuck at 0. On the other hand, two word lines how unidirectional errors are caused in practice. (It is shorted together will form an OR function beimportant to note that the following discussion of the yond the point where they are shorted. In either sources of unidirectional errors in ROMs is relevant case, the resulting errors are unidirectional. to certain technologies and not to all.) (3) Power supply. A failure in the power supply usually results in a unidirectional error. Unidirectional errors in ROMs have a number of likely sources: There are two classes of codes that can detect all errors: constant weight codes, which in unidirectional faults multiple and Single decoder. (1) Address address decoders result in either no access or are nonseparable, and Berger codes,63 which are multiple access.51 No access yields an all-0- separable. A code is separable if the information conword read-out, and multiple access causes the tained in any code word is represented directly by the OR of several words to be read out. In both k-bit number in some fixed k positions. In other cases, the resulting errors are unidirectional, as words, a code C with M code words is separable if, for they contain either 0 to 1, or 1 to 0 type errors. every i, 0 < i 4 M-1, there exists a single code word, (In the case of multiple access, when the correct Xi E C, in which the number, i, appears in some k posicode word is not contained in the accessed set of tions in xi. March 1980 33 On the other hand, the information contained in a code word in a nonseparable code cannot be obtained without using a special decoder circuit. The nonseparable codes are, therefore, not useful in most computer applications, such as error-control coding of operands in arithmetic and logic processors, addresses for data in memory, and horizontal microinstructions in control units.5 The nonseparable m-out of-n code consists simply of all possible binary n-vectors with m l's in them. Unidirectional errors result in a change in the number of 1 's in the code word; hence, these errors are detected. Because these codes are of the nonseparable type, they have limited use. However, significant work has already been performed on the design of TSC checkers for m-out-of-n codes. (In fact, the availability of efficient TSC checkers56-59 is precisely what makes these codes of some practical interest.) The ESS computer is the first known application of m-out-of-n codes. In contrast, Berger codes63 are separable and therefore have a much greater potential for application in fault-tolerant computers. Recently, efficient TSC designs for Berger codes have become available.55 Although these codes have not yet found impleimentation in fault-tolerant computers, there is significant potential for their use in both synchronous and asynchronous circuits.57 Error correction is one of the most effective error-control techniques for transient faults, since these faults are often environmentally induced and hard to diagnose. fore, some form of on-line protection is essential to control errors resulting from these faults. (2) They cause errors that are often nonrecurring. Therefore, dynamic redundancy techniques such as on-line testing and switching to spares may not be effective, since all that is needed may be to restore the correct information, not to discard physical components. (3) They produce two types of errors-independent errors or "bursty" errors. The errors caused by a single transient fault are likely to be limited in number if they are independent, and unidirectional if they are bursty. All these characteristics lead to an error-control approach that may prove to be most effective against transient faults-to use codes that can correct some t random errors, as well as detect all unidirectional errors. Recently, some attempt has been made to construct precisely such codes.64'67'68 An example of a random error-correcting and unidirectional error-detecting code is shown in Table 3. The code, C, is a systematic code and can both correct single errors and detect all unidirectional errors. It is important to note that this code is a paritycheck code as well as a systematic (separable) one. Therefore, it does not have the shortcomings of either Berger codes (which are not parity-check codes) or m-out-of-n codes (which are not separable codes). The following equations describe the parity-check relationship between the check bits and the information bits: Pi = U1 @ 1 P2 = U2 @ 1 P3= U1 @ U2 p4 U1 @ U2 D 1 (Pi P2, andp4 are odd parities.) The equations for the four-bit error syndrome (sI, s2, It is interesting to note, however, that there are two S3, S4) are as follows: U reasons why neither of the above-described codes S1 =-p U1 1 may ever find widespread application in faultS2 = P2 9 U2 @ 1 tolerant computers: their inability to correct any erS3 = p3 @ U1 U2 rors; and their incompatibility with parity-check S4 =p4 @ U1 D U2 @ 1 codes, currently the primary codes used in comTable 4 describes the combination of syndrome bit puters. and the corresponding error locations. patterns Consequently, recent research64'68 has focused on The encoding decoding circuit for the code developing codes not only compatible with parity- shown in Table 3and can be complemented. Note check codes and able to correct a limited number of er- that the code is derivedeasily from the maximal-length (7,3) rors, but also able tadetect certain unidirectional er- code by code puncturing and expurgation techniques rors. Before we illustrate this further, we will provide and is also a co-set code.2 Other techniques for conthe motivation for error correction in the context of structing such codes can be found in Pradhan.68 transient faults. The motivations for using certain types of codes in Error correction is one of the most effective errorLSI devices, then, are different from those for using control techniques for transient (or intermittent) certain other codes in communications. We hope that faults. These faults constitute many real-time faults the material here will suggest new research-for the and have the following general characteristics: construction of unconventional codes that provide er(1) They are often environmentally induced; the ror protection more in line with the type of errors encircuits most vulnerable to transient faults are countered in LSI devices. those operating close to their tolerance limits because of aging, overloading, etc. As a result, transient faults are extremely difficult to diagError-control coding techniques can be highly efnose during initial (acceptance) testing. There- fective in improving computer reliability, in spite of 34 COMPUTER Table 3. A random error-correcting and unidirectional error-detecting code. INFORMATION BITS Ul U2 0 0 1 C= 0 1 0 1 1 1 2 CHECK BITS P1 1 1 0 0 3 P2 P3 1 1 0 1 1 0 0 4 0 5 P4 1 0 0 1 6 BIT POSITION Table 4. Syndrome bits and error positions. S1 0 1 o 1 0 0 O S2 S3 S4 O 0 1 0 1 1 0 1 1 0 1 0 0 0 0 1 0 0 0 1 0 ALL OTHER COMBINATIONS BIT IN ERROR NONE 1 2 3 4 5 6 MULTIPLE UNIDIRECTIONAL ERROR DETECTION the generally tougher constraints imposed on encoding and decoding hardware in this application than in more conventional communications applications. The effective use of such codes requires both the code and the decoding procedure to be tailored to the peculiarities of the hardware. The benefits of error-control codes are demonstrated in the Fault-Tolerant Spaceborne Computer.4 The FTSC protects all addresses and data by means of various versions of the same (shortened) cyclic code. This code is appended to each data word as it enters the computer (if the word is already encoded, the code is checked at the entry port) and remains with that word throughout all computer operations except those taking place in the control processing unit (in which case, all operations are performed in duplicate). The properties of the code used in the FTSC4 include the (1) ability to detect bursts of up to eight adjacent errors (serial data bus); (2) ability to detect all errors confined to an eightbit byte (address and data buses); (3) ability to correct a single erasure and simultaneously detect a second error (memory); (4) ability to detect a solid burst of errors-i.e., an error pattern in which a group of contiguous bits are all in error (address generators in the direct-memory access units); and (5) ability to be generated and monitored by a concatenation of identical devices (throughout the computer; this property can be regarded as a parallel version of the previously noted shiftregister encoding properties of cyclic codes). Even this list does not begin to exhaust the ways in which error-control coding techniques can improve March 1980 computer reliability. We predict significantly increased use of these techniques as the properties of error-control codes become better understood. U Acknowledgment This work was supported in part by AFOSR Con- tract No. F49620-79-C-0119. References and bibliography General references 1. Ball, M. and F. Hardie, "Effect on Detection of Intermittent Failure in Digital Systems," AFIPS Conf Proc., Vol. 35, 1969 FJCC, pp. 329-335. 2. Berelkamp, E. R., Algebraic Coding Theory, McGraw Hill, New York, 1968. 3. Peterson, W. W. and E. J. Weldon, Error-Correcting Codes, MIT Press, Cambridge, Mass., 1971. 4. Stiffler, J. J., "Architectural Design for Near-100% Fault Coverage," Proc. 1976 Int'l Symp. FaultTolerant Computing, Pittsburgh, Pa., June 1976, pp. 134-137.* 5. Tanenbaum, A. S., Structured Computer Organization, Prentice-Hall, Englewood Cliffs, N. J., 1976. 6. Tasar, D. and V. Tasar, "A Study of Intermittent Faults in Digital Computers, "AFIPS Conf. Proc. 1977 NCC, pp. 807-811. 7. Wakerly, J. F., Error Detecting Codes, Self-Checking Circuits and Applications, Elsevier-North Holland, New York, 1978. Codes for arithmetic operations 8. Avizienis, A., "Arithmetic Error Codes, Cost and Effectiveness Studies for Application in Digital Systems," IEEE Trans. Computers, C-20, No. 11, Nov. 1971, pp. 1322-1330. 9. Chien, R. T., S. J. Hong, and F. P. Preparata, "Some Results on the Theory of Arithmetic Codes," Information and Contro; Vol. 19, 1971, pp. 246-264. 10. Diamond, J. L., "Checking Codes for Digital Computers," Proc. IRE, Apr. 1955, pp. 487-488. 11. Langdon, G. G., Jr., and C. K. Tang, "Concurrent Error Detection for Group-Carry-Look-Ahead in Binary Adders," IBMJ. Research and Development, Vol. 14, No. 5, Sept. 1970, pp. 563-573. 12. Massey, J. L., and 0. N. Garcia, "Error Correcting Codes in Computer Arithmetic," Chapter 5, in Advances in Information System Sciences, Vol. 4, pp. 273-326, Plenum Press, New York, 1971. 13. Pradhan, D. K., "Fault-Tolerant Carry Save Adders," IEEE Trans. Computers, Vol. C-23, No. 11, Nov. 1974, pp. 1320-1322. 14. Pradhan, D. K., and L. C. Chang, "Synthesis of FaultTolerant Arithmetic and Logic Processors by Using Nonbinary Codes," Digest of Papers-Fourth Ann. Int'l Symp. Fault-Tolerant Computing, Urbana, Ill., June 1974, pp. 4.22-4.28.* 15. Rao, T. R. N., Error Coding forArithmetic Processors, Academic Press, New York, 1974. 35 Codes for logical operations 16. Eden, M., "A Note on Error Detection in Noisy Logical Computers," Information and Contro4 Vol. 2, Sept. 1959, pp. 310-313. 17. Elias, P., "Computation in the Presence of Noise," IBMJ. Research and Development, Vol. 2, Oct. 1958, pp. 346-353. 18. Garcia, 0. N. and T. R. N. Rao, "On the Methods of Checking Logical Operations," Proc. 2nd Ann. Princeton Conf Information Science and Systems, 1968, pp. 89-95. 19. Monterio, P. M. and T. R. N. Rao, "A Residue Checker for Arithmetic and Logical Operations," Digest of Papers-Int'l Symp. Fault-Tolerant Computing, Boston, Mass., June 1972.* 20. Peterson, W. W. and M. 0. Rabin, "On Codes for Checking Logical Networks," IBM J. Research and Development, Vol. 3, No. 2, Apr. 1959, pp. 163-168. 21. Pradhan, D. K., and S. M. Reddy, "ErrorControlTechniques for Logical Processors," IEEE Trans. Computers, Vol. C-21, No. 12, Dec. 1972, pp. 1331-1336. 22. Winograd, S. and J. C. Cown, Reliable Computation in the Presence of Noise, M.I.T. Press, Cambridge, Mass., 1963. Codes for memory 23. Black, C. J., C. E. Sundberg, and W. K. S. Walker, "Development of a Spaceborne Memory with a Single Error and Erasure Correction Scheme," Proc. Seventh Ann. Int'l Conf Fault-Tolerant Computing, Los Angeles, Calif., June 1977, pp. 50-55.* 24. Bossen, D. C., "b-adjacent Error Correction," IBMJ. Research and Development, Vol. 14, July 1970, pp. 402-408. 25. Bossen, D. C., L. C. Chang, and C. L. Chen, "Measurement and Generation of Error Correcting Codes for Package Failures," IEEE Trans. Computers, Vol. C-27, No. 3, Mar. 1978, pp. 201-204. 26. Carter, W. C. and C. E. McCarthy, " Implementation of an Experimental Fault-Tolerant Memory System," IEEE Trans. Computers, Vol. C-25, No. 6, June,1976, pp. 557-568. 27. Brown, D. T. and F. F. Sellers, Jr., "Error Correction for IBM 800-bit-per-inch Magnetic Tape," IBM J. Research and Development, Vol. 14, July 1970, pp. 384-389. 28. Chien, R. T., "Memory Error Control Beyond Parity," IEEE Spectrum, July 1973, pp. 18-23. 29. Hsiao, M. Y., "O,ptimum Odd-weight Column Codes," IBM J. Research and Development, Vol. 14, No. 4, July 1970. 30. Hsiao, M. Y. and D. C. Bossen, "Orthogonal Latin Square Configuration for LSI Memory Yield and Reliability Enhancement," IEEE Trans. Computers, Vol. C-24, No. 5, May 1975, pp. 512-517. 31. Reddy, S. M., "A Class of Linear Codes for Error Control in Byte-per-card Organized Digital Systems,;' IEEE Trans. Computers, Vol. C-27, No. 5, May 1978, pp. 455-459. 32. Stiffler, J. J., "Coding for Random Access Memories," IEEE Trans. Computers, Vol. C-27, No. 6, June 1978, pp. 526-531. 36 Fault-tolerant logic using coding 33. Problems of Information Transmission, Translated from Russian, Vol. 1-4, Faraday Press; Vol. 5, Consultants Bureau, Plenum Publishing Co., New York. 34. Nemsadze, N. I., Problems of Information Transmission, 1969 (No. 1), 1972 (No. 2), Consultants Bureau, Plenum Publishing Co., New York. 35. Nikaronov, A. A., Problems of Information Transmis- sion, 1974, (No. 2), Consultants Bureau, Plenum Publishing Co., New York. 36. Nikanorov, A. A. and Y. L. Sagalovich, "Linear Codes for Automata," Int'l Symp. Design and Maintenance of Logical Systems, Toulouse, France, Sept. 27-28, 1972. 37. Sagalovich, Y. L., Problems of Information Transmission, 1960 (No. 2), 1965 (No. 2), 1967 (No. 2), 1972 (No.3), 1973 (No. 1), 1976 (No.4), 1978 (No. 2), Faraday Press and Consultants Bureau, Plenum Publishing Co., New York. 38. Sagalovich, Y. L., States Coding and Automata Reliability, Svjas, Moscow, 1975 (in Russian). 39. Sagalovich, Y. L., "Information Theoretical Methods in the Theory of Reliability for Discrete Automata," Proc. 1975 IEEE-USSR Joint Workshop on Information Theory, Dec. 15-19, 1975, Moscow. 40. Jisiao, M. Y., A. M. Patel, and D. K. Pradhan, "Store Address Generator with Built-in Fault-Detection Capabilities," IEEE Trans. Computers, Vol. C-26, No. 11, Nov. 1977, pp. 1144-1147. 41. Meyer, J. F., "Fault-Tolerant Sequential Machines," IEEE Trans. Computers, Vol. C.20, Oct. 1971, pp. 1167-1177. 42. Larsen, R. W. and I. Reed, "Redundancy by Coding Versus Redundancy by Replication for FailureTolerant Sequential Circuits," IEEE Trans. Computers, Vol. C-21, No. 2, Feb. 1972, pp. 130-137. 43. Pradhan, D. K. and S. M. Reddy, "Fault-Tolerant Asynchronous Networks," IEEE Trans. Computers, Vol. C-23, No. 7, July 1974, pp. 651-658. 44. Pradhan, D. K., "Fault-Tolerant Asynchronous Networks Using Read-Only Memories," IEEE Trans. Computers, Vol. C-27, No. 7, July 1978, pp. 674-679. 45. Reed, I. S. and A. C. L. Chiang, "Coding Techniques for Failure Tolerant Counts," IEEE Trans. Computers, Vol. C-19, No. 11, Nov. 1970, pp. 1035-1038. Self-checking circuits 46. Anderson, D. A., "Design of Self-Checking Digital Networks," CSL Report No. 527, University of Illinois, Urbana, Ill., 1971. 47. Anderson, D. A. andG. Metze, "Design of Totally SelfChecking Circuits for M-out-of-n Codes," IEEE Trans. Computers, Vol. C-22, No. 3, Mar. 1973, pp. 263-269. 48. Ashjee, M. J. and S. M. Reddy, "On Totally SelfChecking Checkers for Separable Codes," IEEE Trans. Computers, Vol. C-26, No. 8, Aug. 1977, pp. 737-744. 49. Carter, W. C. and P. R. Schneider, "Design of Dynamically Checked Computers," Proc. IFIP Congress 68, Vol. 2, Edinburgh, Scotland, pp. 878-883. 50. Carter, W. C., K. A. Duke, and D. C. Jessep, "A Simple Self-Testing Decoder Checking Circuit, " IEEE Trans. Computers, Vol. C-20, No. 11, Nov. 1971, pp. 1413-1414. 3 COMPUTER 51. Cook. R. W. et al., "Design of Self-Checking Micropro- Controls," IEEE Trans. Computers, Vol. C-22, No. 3, Mar. 1973, pp. 255-262. David, R., "A Totally Self Checking 1-out-of-3 Code," IEEE Trans. Computers, Vol. C-27, No. 6, June 1978, pp. 570-572. Diaz, M., "Design of Totally Self Checking and Fail Safe Sequential Machines," Digest of Papers-Fourth Ann. Int'l Symp. Fault-Tolerant Computing, Urbana, Ill., June 1974, pp. 3:19-3.24.* Diaz, M..and J. M. Desouza, "Design of-Self-Checking Microprogrammed Controls," Digest of Papers-1975 Int'l Symp. Fault-Tolerant Computing, Paris, France, June 1975, pp. 137-142.* Marouf, M. A. and A. D. Friedman, "Design of SelfChecking Checkers for Berger Codes," Digest of Papers-Eighth Ann. Int'l Conf Fault-Tolerant Computing, Toulouse, France, June 1978, pp. 179-184.* Marouf, M. A. and A. D. Friedman, "Efficient Design of Self-Checking Checkers for M-Out-of-N Codes," Proc. SeventhAnn. Int'l Conf Fault-TolerantComputing, Los Angeles, Calif., June 1977, pp. 143-149. Pradhan, D. K., "Asynchronous State Assignments with Unateness Properties and Fault-Secure Design," IEEE Trans. Computers, Vol. C-27, No. 5, May 1978, pp. 396-404. Pradhan, D. K. et al., "Shift Registers Designed for On-Line Fault-Detection," Digest of Papers-Eighth Ann. Conf. Fault-Tolerant Computing, Toulouse, France, June 1978, pp. 173-178.* Shedletsky, J. J. and E. J. McCluskey, "The Error Latency of a Fault in Combinational Digital Circuits," Digest of Papers-1975 Int'l Symp. Fault-Tolerant Computing, Paris, France, June 1975, pp. 210-214.* Smith, J. E., "The Design of Totally Self-Checking Combinational Circuits," CSL Report No. R-737, University of Illinois, Urbana, III., Aug. 1976. Smith, J. E. and G. Metze, "Strongly Fault Secure Logic Networks," IEEE Trans. Computers, Vol. C-27, No. 6, June 1978, pp. 491-499. Wang, S. L. and A. Avizienis, "The Design of Totally Self-Checking Circuits Using Programmable Logic Arrays," Digest of Papers-Ninth Ann. Int'l Symp. Fault-Tolerant Computing, Madison, Wis., June 1979, pp. 173-180.* gram 52. 53. 54. 55. 56. 57. 58. 59. 60. 61. 62. Codes for unidirectional errors 63. Berger, J. M., "A Note on Error Correction Codes for Asymmetric Channels," Information and Control, Vol. 4, Mar. 1961, pp. 68-73. 64. Bose, B., "Theory and Design of Unidirectional Error Codes," PhD Dissertation, Computer Science and Engineering Dept., Southern Methodist UJniversity, Dallas, Tex., in progress. 65. Frieman, C. V., "Protective Block Codes for Asymmetric Binary Channels," PhD Dissertation, Columbia University, New York, May 1961. 66. Parhami, B. and A. Avizienis, "Detection of Storage in Mass Memories Using Low-Cost Arithmetic Codes," IEEE Trans. Computers, Vol. C-27, No. 4, Apr. 1978, pp. 302-308. 67. Pradhan, D. K. and S. M. Reddy, "Fault-Tolerant FailSafe Logic Networks," Proc. COMPCOM Spring 77, pp. 361-363. March 1980 68. Pradhan, D. K., "A New Class of Error CorrectingDetecting Codes for Fault-Tolerant Computer Applications," to appear in IEEE Trans. Computers, Special Issue on Fault-Tolerant Computing, Vol. C-29, No. 6, June 1980. 69. Sahani, R. M., "Reliability of Integrated Circuits," Proc. IEEE Int'l Computer Group Conf., Washington D. C., June 1970, pp. 213-219. 70. Wakerly, J. F., "Detection of Unidirectional Multiple Errors Using Low Cost Arithmetic Codes," IEEE Trans. Computers, Vol. C-24, No. 2, Feb. 1975, pp. 210-212. *This digest or proceedings is available from the IEEE Computer Society Publications Office, 5855 Naples Plaza, Suite 301, Long Beach, CA 90803. D. K. Pradhan is the guest editor of this issue; his biographical sketch appears on p. 7. J. J. Stiffler is a consulting engineer at the Raytheon Company, Sudbury, Massachusetts. From 1961 to 1967 he was on s* * o the technical staff of the Jet Pro- pulsion Laboratory, Pasadena, Califor- fnia. The author of many papers in the field of communications, Stiffler wrote Theory of Synchronous Communicaand contributed to two other books. His current interests include the design and analysis of highly reliable data-processing _t ~~~tions systems. Stiffler received the AB in physics, magna cum laude, from Harvard University in 1956 and the MS in electrical engineering from the California Institute of Technology in 1957. After a year in Paris as a Fulbright scholar, he returned to Caltech, where he received the PhD in 1962. Stiffler is a member of Phi Beta Kappa and Sigma Xi. SOFTWARE ENGINEERS Professional Careers with XEROX Xerox is developing the future of reprographics. We're looking for a few talented specialists for the Electronics Division who can help keep our technological edge. We've increased our R&D budgetyearafteryearto keep upwith our growing competition. At Xerox we have the technology, the resources, and the challenges to stimulate your career. We're looking for professionals to advance our long term projects at our R&D facility. Responsibilities Include development of real-time operating systems and languages for microprocessor-based control systems; contributing to systems architectural studies; and development of practical approaches to modularity distributive prooessing, language layering, etc. Time/space efficiency trade-offs will be continuously required. You should have a B.S., M.S., or Ph.D. in CS or EE and familiarity with structured programming concurrent processing, high level languages (PASCAL, MODULA, CONCURRENT PASCAL, etc.) operating system kernal design, and other modern programming methodologies. We offer a competitive starting salary/benefits package. Please forward your resume in strict confidence to: Ms. Carol Jones, Dept. IC, XEROX CORPORATION, 800 Phillips Rd., Bldg. 105, Webster, New York 14580. Xerox is an affirmative action employer (male/female). XEROX