Error-Correcting Codes and Self-Checking

advertisement
Error-control coding techniques, implemented by means of
self-checking circuits, will improve system reliability.
Error-Correcting
Codes and
Self-Checking
Circuits
D. K. Pradhan
Oakland University
J. J. Stiffler
Raytheon Company
It is not surprising that error-control coding
techniques have been used in computers for many
years, especially since they have proven effective
against both transient and permanent faults. What is
surprising is that coding techniques have not found
even more extensive use (in commercial computers,
for example), in view of their potential for improving
the, overall reliability of computers. In this article,
therefore, we will bring out some of the reasons for
the limited acceptance of error-control coding techniques. In addition, we will examine some code properties, as well as certain implementation techniques,
that might help overcome these currently perceived
limitations.
Coding for error control
Techniques for coding digital information in order
to protect it from errors while it is being stored,
transferred, or otherwise manipulated have been the
subject of intense investigation for several decades.
There is a large body of literature detailing the remarkable theoretical and practical developments resulting from this effort. Although most of this work
has been concerned with the error models and decoding constraints encountered in communications, the
potential utility of these codes in protecting information in computers has long been recognized. Indeed,
such considerations motivated some of the initial
work on error-control codes.
Stated somewhat formally, error-control coding entails mapping elements of a data set X={xij onto
elements of a code-word set Y={yi). The code words
thus represent the information to be manipulated but
are presumably less vulnerable to errors (induced, for
March 1980
example, by failures in the circuitry used to carry out
the manipulation) than the original unencoded data.
If F denotes the set of independent faults to which the
circuitry in question is subject, and if E is the set of
errors that can be produced as a result of these faults,
then each e E E is an error that can occur as the result
of some fault, f£ F. And if the distance d (y,y') from y
to y'is defined as the minimum number of errors in e E
E needed to change y to y', and the distance d
associated with a code Y is defined as the minimum
distance from any one of its code words to any other
code word, then it is not difficult to see that, subject
to some mild conditions (for example, the condition
that d(yl, y') + d(y2, y')>min [d(y1,y2), d(y2,y1)]), a
distance-d code can be used to detect up to at least
d-1 errors; to correct up to (d-1)/2 errors; or to correct all errors if their total number does not exceed c <
(d - 1)/2; and otherwise to detect any number of errors
in the range c + 1, d-c-1.
In the present context, X generally consists of the
2k binary k-tuples, and Y a subset of 2£ binary n-tuples for some n > k.- The most widely postulated class
of errors is that in which individual code-word bits are
erroneously complemented; that is, E consists of the
n errors ei, i = 1,2,... ,n, with ei the error caused by
complementing the ith bit of any y E Y. The errors encountered in transferring (or transmitting) information from one point to another can often be characterized in this way, as can those observed in retrieving
information from certain storage media. The distance
between two code words in this case is simply the
Hamming distance, i.e., the number of corresponding
bit positions in which they differ. However, the
faults, and hence the errors, in a given situation
strongly depend on the circuitry in question, and the
Hamming distance may or may not be the relevant
measure.
0018-9162/80/0300-0027S00.75 ©
1980 IEEE
27
As already noted, communications problems have
motivated most of the study of error-control codes.
Although much of the resulting knowledge can also
be exploited in the design of fault-tolerant computers, there are some important differences between
the two applications:
(1) Information is generally transmitted serially
over a communications channel; it is generally
handled in a parallel format in a computer. Consequently, the serial encoding and decoding algorithms developed for communications applications have a much more limited applicability
in computer systems.
(2) The time allowed for encoding and decoding is
generally more constrained in computer applications than in communications. A few milliseconds' or even a few seconds' delay in decoding information received over a communications channel may be entirely acceptable,
whereas even a microsecond's delay in handling critical-path information in a computer
could be intolerable.
(3) The complexity of the encoding and decoding
circuitry is frequently a much more serious
limitation in computer applications than it is in
communications. If a code is sufficiently effective in combatting transmission errors, its use
may be justified, regardless of the complexity
of its associated hardware. But in a computer
the main reason for encoding information is to
protect it agalnst hardware faults. Unless the
hardware needed to generate and check the
code is relatively simple compared to the hardware thus monitored, a fault-prone decoder
could increase rather than decrease the
likelihood of erroneous information propagation.
(4) Anticipated errors in a computer may be different from those in a communications system.
Even when the first-order error statistics are
the same, differences in higher-order -statistics
may considerably change the relative effectiveness of various coding techniques in the two applications (see "Coding for RAMs" later in this
article).
(5) Communication codes are designed to protect
information as it is transferred from one place
to another. Although this function is important
in computer applications as well, and is in fact
the function of major interest here, it should be
noted that other constraints on computer errorcontrol codes are also sometimes desirable. In
particular, it is frequently desirable to be able
to protect information not only when it is being
stored or transferred, but also when it is being
modified. Many computer operations entail the
evaluation of functions of two variables: Xk =
f(xi, xj). If a code is to protect information even
while it is being manipulated in this manner,
the code must be preserved under such operations; that is, for every function fof interest and
for every xi, xj, Xk E X, there must exist a func28
tion g such that Yk = g(yi,
yj) whenever Xk =
f(xi, xj), with yi, yj and Yk representing the encoded versions of xi, xj and x k
As might be expected, this last constraint can be
severe, particularly when the class of functions f(xi,
Xj) is large. Codes that are preserved under certain
arithmetic operations have been developed15; others
are preserved under bit-by-bit logical operations21;
but no known class of -codes is preserved under all
operations usually implemented in a computer.
Although the codes available for logical operations
can be used for arithmetic operations,13'14 and vice
versa,19 such use is likely to be too inefficient to be of
interest. To date, no efficient code useful for both
logical and arithmetic operations is available. To
achieve a breakthrough, then, one has to look into
some new and unconventional coding schemes. Unfortunately, research in this area has been limited,
partly because of certain negative results established
earlier by Elias'7 and by Peterson and Rabin,20 in the
area of single-error detection in logical computations.
However, it is important to note that more recent
work by Pradhan and Reddy2' exhibits an efficient
code for multiple-error detection/correction in logical
computations.
With the possible exception of (4), the differences
between communications and computer applications
of error-control coding militate against the latter application. Nevertheless, error-control coding techniques, judiciously applied, can be highly effective in
increasing computer reliability. In many cases, codes
developed primarily for communications applications can be used to advantage in computers, particularly if the peculiarities of the error patterns likely to occur in a computer are fully exploited. A good
example of this is the type of error coding used in
RAMs (discussed in detail later in this article).
The principles of error-correcting codes can also be
used to design fault-tolerant logic. Significant work
in this area has been reported both in the US40-45 and
in the USSR.33-39 This research includes the design of
counters (Reed and Chiang,45 Hsiao, et al.40); synchronous sequential machines (Larsen and Reed,42
Meyers41); and asynchronous sequential machines
(Pradhan,44 Pradhan and Reddy,43 Sagalovich37-39).
The basic technique used in all of these designs incorporates static redundancy as an integral part of
the design. The states of the circuit are encoded into
code words of an error-correcting code (in the case of
synchronous circuits), or its analog (in the case of
asynchronous circuits). The output and next-state
functions are defined so that the effect of any fault
from a prescribed set is masked; i.e., the netWork produces correct outputs, in spite of the fault. This is accomplished by defining the behavior of the machine
for potentially faulty states so that it corresponds to
that of the correct state.
This coding technique differs from the replication
technique in two respects: the redundancy is implicit,
and there is no explicit error-correcting logic, such as
majority logic. The advantage of coding over replication is that the former may require"fewer I/O pins and
COMPUTER
chips. This follows from the fact that the number of
state variables required by the coding scheme is
much smaller than that required by replication, and
from the fact that the fault-tolerant circuit that uses
coding can be implemented as a single circuit.
Two important possible applications of this technique are both yield enhancement and reliability improvement.30 Yield is improved when the chips with e
or fewer cell failures are not discarded in the acceptance testing, where e is some number less than the
error-correction capability, t. The remaining (t-e) error-correction capability is used for reliability improvement.
The principles of coding have also found use in the
design of self-checking circuits,4649 which are discussed next.
Self-checking circuits
Self-checking circuits, in general, are that class of
circuits in which occurrence of a fault can be determined by observation of the outputs of the circuits.
An important subclass of these self-checking circuits
is known as "totally self-checking" circuits. TSC circuits are of special interest and so will be discussed
here in detail.
Loosely speaking, a circuit is TSC if a fault in the
circuit cannot cause an error in the outputs without
detection of the fault. Also, the information regarding the presence of the fault is available in TSC circuits as an integral part of the outputs. These TSC
circuits are particularly useful in reducing or
eliminating the "hard core" of a fault-tolerant system. (A "hard core" is that part of a system that cannot tolerate failures, e.g., decoders for error-correcting codes.) The following is an example of a circuit
which has built-in fault-detection capability.
Through this example, we develop the motivation for
the TSC design, which is discussed next.
The autonomous linear shift register, shown in
Figure 1, is a linear counter and has a cycle length of
15. A special feature of.this counter is that it can
detect faults in itself on-line. The shift register produces the nonzero code words from a five-bit singleparity code in a cyclic chain. The states of the shift
register are five-bit code words: the first four bits are
information bits, and the last bit is the parity bit.
To begin with, the shift register is initialized to a
nonzero code word in the single-bit parity code; for
each successive shift, it produces a new code word, as
shown in Table 1. After 15 shifts, the shift register
counter starts repeating the cycle.
Any single error in the outputs of the shift register
can be detected by checking the parity. This requires
an addition of extra logic, as shown in Figure 1. The
output of this extra logic will be 0 in the absence of
any errors, and 1 in the presence of a single error.
It can be shown that this design detects all single
faults that may occur on the shift register.58 However, the inherent difficulty in this design is that
faults cannot be allowed to occur in the parity-check
logic because such a fault could prevent the circuit
March 1980
0
1
.(
f
Figure 1. A linear counter designed for fault-detection.
Table 1.
Shift register contents after different shifts.
SHIFT
0
1
2
3
4
5
1
O
1
1
1
1
0
1
0
1
1
1
1
0
6
7
8
9
10
0
O
O
11
1
0
0
0
1
O
12
13
14
CONTENT
0
1
0
0
0
1
0
0
1
1
0
0
0
1
0
0
1
1
1
0
0
1
0
1
1
1
1
0
0
0
1
0
0
0
1
0
1
1
1
0
1
0
1
1
1
1
1
0
1
1
1
1
from detecting any errors thereafter in the outputs of
the shift register (for example, a stuck-at-0 fault at
the output of this extra logic). Thus, a subsequent
fault in the shift register and the corresponding error
could go unnoticed.
The presence of a hard core in the design is responsible for the difficulty described above. TSC circuits
have been developed to overcome precisely this type
of problem. One fundamental feature of TSC design is
that redundancy is embedded directly into the outputs, making it possible to partition the set of outputs into a set of code-word (valid) outputs, and a set
of non-code-word (invalid) outputs. These code-word
outputs are the set of outputs that normally appear
at the outputs of the network. On the other hand, the
appearance of a non-code word at the outputs of the
network signals the presence of a fault in the network. However, the features that distinguish the
TSC network from other self-checking networks are
its fault-secure46 and self-testing49 properties.
These two properties can best be described in terms
of input/output mapping of the network: Let X and Y
represent the sets of inputs and outputs, respectively. Let Y = Y1 U Y2, where Y1 and Y2 represent the
code-word (valid) and non-code-word (invalid) out29
Figure 2. The fault-secure property of TSC networks.
Figure 3. The self-testing property of TSC networks.
puts, respectively. Let F represent a prescribed set of
faults for which the design is TSC. Let y be the correct output for the input, x. (Note that x E X, and y E
Y,.) Let y'be the output of the network for the same
input, x, in the presence of a fault, f, f E F. Faultsecureness guarantees that ify 'is a code word, y'E YI,
then y' = y. In other words, the output of the faulty
network cannot be a code word and, at the same time,
be different from the correct output, y. Thus, as long
as the output is a code word, it can safely be assumed
to be correct. This is illustrated in Figure 2.
On the other hand, the self-testing property
guarantees that for every fault, f, f E F, there exists at
least one input, x, x E X, for which the resulting output, y, is a non-code word, y' E Y2; i.e., input x will
result in the signaling of the presence of the fault. In
other words, the input set contains at least one test
for every fault in the prescribed set; thus, the occurrence of any fault is bound to be detected sometime
during the operation (see Figure 3).
While TSC networks possess these interesting
features, they do have some practical limitations. For
one thing, although every fault is detectable sometime during the operation, there is no guarantee that
the first fault will be detected before a second fault occurs. Such a pathological case might be the occurrence of a fault detectable only by a particular input.
The possibility exists that this input can become invalid as a test if a second fault occurs before the input
is actually applied. This has been referred to as the error latency problem.59
Second, the application of TSC circuits has been
limited by the absence of any systematic technique
that realizes the self-testing property in an
30
economical way. For example, although fault-secureness can be readily achieved in asynchronous networks, incorporating the self-testing property can be
too formidable a task.57
However, a class of TSC circuits known as TSC
checkers has a significant potential for large-scale
use in the design of fault-tolerant computers. A TSC
checker46 is a TSC circuit designed to detect errors in
error-detecting codes used in fault-tolerant computers. The most basic application of TSC checkers,
however, is their use in monitoring a general TSC network, as described below.
The outputs of the TSC network are fed to a TSC
checker, designed so that any non-code word at its inputs produces only a non-code word at its outputs.
Thus, by observing the output of the checker, one can
detect any fault in the TSC network or the checker
itself. (However, the checker output does not provide
any information as to the location of the fault, i.e.,
whether the fault is in the TSC circuit or in the
checker itself.)
TSC checkers have two output leads and, hence,
four output combinations: (00, 01, 10, 11). Two combinations are considered valid (code-word) outputs;
they are usually (01, 10). The appearance of an invalid
(non-code-word) output combination indicates either
the presence of an error at the input code word, or a
fault in the checker. The function of TSC checkers
(see Figure 4) is to detect any errors in the input code
word, as well as any faults that may occur in the
checker itself-as long as they do not occur at the
same time. Table 2 presents the network outputs for
four possible cases.
In the first case, the checker is fault-free and the
code word is error-free; the output of the checker is
always one of the valid outputs. In the next case,
there is an error in the input code word, but there are
no faults in the checker. Here, the autput will always
be an invalid output, so that the error in the code word
can be detected. In the third case, there is a fault in
the checker, but there is no error in the code word.
Then, the output may be either the correct, valid output, or an invalid output, depending on whether or
not the input code word is a test for the fault in the
checker. Finally, when there is an error in the code
word and a fault in the checker, the output is indeterminate.
As an example, consider a TSC checker design for
single-parity codes, where the prescribed set of faults
is the set of single faults. This design requires the use
of two separate parity checkers. (Figure 4 shows the
Table 2.
Outputs for TSC checker.
CHECKER:
FAULT-FREE
CHECKER:
FAULTY
CODE WORD:
ERROR-FREE
VALID
VALID/INVALID
CODE WORD:
WITH ERROR
INVALID
INDETERMINATE
COMPUTER
design for a nine-bit code.) The bits in the code word
Ul, U...2 ,Um,Um+,...,Un, are divided into two
groups: ul, U2.. ,um; and um+1, Um+2, . . ,u. (For optimal design, m is equal to n/2.) These two groups
then form the inputs to the two different parity-check
circuits. The first circuit produces the output, g = ul
eU2 e.. . Um, and the second one yields the output, h
=Um+l 9 Um+2 a) ... 9
Un
-
To illustrate that this design is indeed TSC, first let
us consider even-parity codes. Since any code word
has an even number of l's, both groups of bits contain
either an even or an odd number of l's. Thus, the valid
outputs from the network correspond to (01, 10). On
the other hand, in the presence of a single error in the
input code word, one of the two groups will have an
odd number of l's, and the other an even number of
1 's. So, a single error at the inputs will produce either
11 or 00 as the output. Since single faults can produce
an error only at one of the two outputs, this TSC
checker is fault-secure.
The self-testing property of the checker can be de- Figure 4. A TSC checker tor 9-bit single-parity code.
duced from the following observations: The set of input code words applies all possible input combinations to each of the parity-check circuits. Thus, a fault
in one of these parity-check circuits will result in an
error at its output sometime during the operation.
This will, therefore, be detected as an invalid network
output. As an example, consider a stuck-at-i fault at
the output lead of the EX-OR gate, shown in Figure 4.
This fault is detected by a large number of code
words, including the one shown.
Similarly, in the case of odd-parity codes, the above
design can be modified by deleting the inverter at the
output of h.
Now, consider the design of a linear counter, shown
in Figure 1. This design can be made TSC by replacing the error-detection logic with a TSC checker, as
shown in Figure 5. Although the checker never receives the all-0 code word, it is still self-testing for all
single faults.
A point worth noting is that since all TSC checkers Figure 5. A TSC linear counter.
have at least two output leads, there may be the need
for some hard-core logic to monitor the checker outputs. However, the value of the TSC checker is its ing for LSI circuits and transient faults" below).
capacity to significantly reduce the hard core in a Codes designed to combat such faults are sometimes
referred to as transfer-error-control codes to distinfault-tolerant system.
TSC checkers for several other codes, such as guish them from codes used to control other types of
constant-weight codes,46,52'55,59"1 Hamming codes,7 errors (e.g., arithmetic or logical errors). The
and Berger codes,48-56 are available in the literature. minimum Hamming distance between any two disThe next two sections discuss the use of codes for er- tinct words in such a code clearly provides an indicaror control in RAMs and certain integrated circuits. tion of its effectiveness against independent bit errors. Moreover, if the word "bit" in the previous
definition of Hamming distance (see "Coding for error control") is replaced by "symbol," the same
measure can also be used to gauge the effectiveness
Coding for RAMs
of codes in combatting errors (e.g., byte-oriented erErrors in binary communications channels general- rors) confined to discrete groups of bits2'24 (with each
ly affect the transmitted bits either independently or group of bits treated as a single symbol).
Virtually all useful transfer-error-control coding
in bursts and tend to be symmetric; that is, an error is
roughly as likely to convert a 1 to a 0 as conversely. techniques involve generalizations of the familiar
The same can often be said about errors in computers parity-check concept. A simple parity check on the
due to bus faults or faults in the storage medium (al- binary digits representing the information (i.e., a (k +
though other types of errors can also occur; see "Cod- 1)th bit representing the modulo-two sum of the k
March 1980
data bits) obviously provides a means of detecting a
change (error) in any single bit, or indeed in any odd
number of bits. More powerful codes (codes having
minimum distances d > 2) are constructed by simply
appending more parity-check bits, with each sum bit
representing the parity of a different subset of the k
information bits. Techniques for constructing errorcontrol codes in this manner are well-documented and
need not be discussed further here.2'3
Such codes protect the contents of memory against
hardware malfunctions simply by storing the paritycheck bits belonging to each word along with the
word itself. When a word is retrieved from memory,
the parity bits that should be associated with that
word can be redetermined and compared with those
actually retrieved. Any difference between the calculated and retrieved parity bits, called the "error syndrome," indicates the presence of one or more errors.
Moreover, if the number of errors does not exceed the
number that can be corrected, the syndrome uniquely
identifies those bits that are erroneous.
This procedure, while conceptually straightforward, can be complex to implement when even moderate numbers of errors are to be corrected. The widespread use of multiple-error correcting codes in communication systems has come about largely because
the mathematical structure used to define good errorcontrol codes (the cyclic code structure) can also be
exploited to reduce significantly the complexity of
their decoders. -Unfortunately, the resulting decoders, while well-suited for the serial data flow encountered in communications, would significantly increase the time needed to retrieve a corrected word
from memory. (This is not to say that the cyclic structure imposed on error-control codes cannot be used to
advantage in computer applications; see, for example, Brown and Sellers,27 and Chien.28)
The use of error-control codes to protect the contents of a computer's main memory is therefore
limited by two factors: fast (parallel) decoders tend to
be too complex; simple (serial) decoders tend to be too
slow. As a result, the codes used in main memories
have been restricted to single-error-detection (d = 2);
single-error-correction (d = 3); or single-error-correction, double-error-detection (d = 4) codes; since either
the delay or the cost of the circuitry entailed in correcting more than a single error is generally unacceptable.
There are several ways to correct multiple errors in
memories, however, without incurring either the delays or the complexities usually associated with
multiple-error-correction. These methods, to be effective, must take advantage of the fact that certain error patterns are much more likely than others to be
encountered in random-access memories. The errorpattern distribution' clearly depends on both the
memory technology and organization.
Consider, for example, a RAM in which faults occur
independently, each fault affecting only a single bit
position (but possibly affecting the same bit in every
word). Such error patterns are likely to occur, for example, in plated-wire memories and in N-word x 1-bit
semiconductor memories. One method of handling
32
this fault pattern is simply to use ad =2 ord =3 code
to detect errors as they occur, and then to isolate the
defective bit position and switch in a spare bit line to
replace it on all subsequent stores. This, of course, requires extra hardware to implement the spare bit
lines and the switch needed to make them accessible.
The code is used only to detect errors (or, at most, to
correct errors when the memory's contents are being
recovered following a detected error). The time
needed to detect an error, even when added to the
delay introduced by the switch, can be considerably
less than the time needed to correct even a single error. The number of errors that can be corrected using
this method, so long as they occur singly, is limited
only by the number of spare bit lines. (Iferrors tend to
occur in clusters, a suitable single-symbol error-detecting or correcting code2 might be used instead of
the single-bit error-detecting or correcting code
assumed here.) If the memory were implemented
with N-word by m-bit semiconductor devices, for example, the symbol alphabet might be defined as the
set of 2m binary m-tuples.
Another method for coping with multiple errors in
random-access memories without introducing excessive decoding delays or excessively complex hardware is to take advantage of the erasure-correcting
capability of transfer-error-control codes.32 A potentially erroneous bit is called an erasure if its location
is known and the only uncertainty is whether or not it
is actually correct. This, of course, is the situation
typically encountered in a RAM once a bit line has
been diagnosed as defective (and is not replaced).
Erasure correction has two major advantages: (1)
Distance-d transfer codes can be used to correct up to
d-1 erasures, as opposed to a maximum of (d-l)12
errors. (This is easily verified: if two code words are
distance d apart, at least d bits have to be erased in
each before the two words can be identical in their
unerased positions.) And (2) erasures can be corrected
more quickly than errors. It is only necessary to
determine and store the syndromes associated with
the various combinations of erasures to be corrected,
and to compare the calculated syndrome of each word
read from memory with this stored set. The erased
positions containing errors are known as soon as a
match is found. If the calculated syndrome does not
match a stored syndrome (and if it is not the all-O syndrome indicating no errors), a new bit position contains an error. It is then only necessary to determine
the location of that error, identify that position as an
erasure, and augment the set of stored syndromes to
reflect this added erasure. This procedure can continue until the erasure-correction capability of the
code has been exhausted.
It is important to reemphasize that the effectiveness both of a code and of the procedure used to
decode it strongly depend on the failure modes of the
hardware. Consider, for example, the reliability of a
32-bit, 4096-word RAM array protected by a
distance-4 code. (Any single-bit error is corrected; if a
second error is detected, the memory is no longer
used.) Suppose the memory array is to be implemented with 1024 x 1 semiconductor chips, each
COMPUTER
having a A = 10-6 failure/hour hazard rate; and suppose all failed chips exhibit one of two symptoms:
either a single cell (bit-storage element) chip fails
without affecting the rest of the device, or the entire
chip fails. Let yA denote the hazard rate associated
with the first type of failure, then, and (1 -y)A the rate
associated with the second type of failure. The probability R(t) that the array is still usable after six
months of operation is plotted in Figure 6. As the plot
shows, the effectiveness of the code is highly dependent on y, the fraction of failures confined to a single
cell. The probability that the code is inadequate (i.e.,
that the array is no longer usable) varies by nearly
three orders of magnitude, from .0056 percent when y
= 1 to 5 percent when y = 0. The conclusion is apparent-a coding scheme may be considerably less effective than expected if the types of failures are considerably different than expected. So unless the likelihood of various types of failures can be reliably
predicted, it is generally better to select as robust a
coding procedure as possible (i.e., one that works well
regardless of the type of failure). In this example, a
multiple-erasure-decoding scheme, in which each
chip is treated as unreliable as soon as it exhibits a
malfunction, might well be preferable.
-1
co
x
0
0
-2.
Coding for LSI circuits and transient faults
As illustrated by the example in the last section, a
0.0 0.1 0.2 0.1 0.4 0.5 0.6 0.7 0.8 0.9 1.0
code can be effective if there is a good match between
y
the type of errors for which the code is designed and
OF SINGLE-BIT FAILURES
FRACTION
the type of errors that occur. This section focuses on
codes specifically designed for a class of errors different from the types discussed so far. This type is Figure 6. Code effectiveness as a function of failure
the so-called unidirectional error-one that contains mode.
either 1 to 0, or 0 to 1 errors, but not both. (There is
evidence that unidirectional errors occur in many integrated circuits69; faults such as short-circuit faults
code words, the error in the correct code word is
are a likely source of these errors.)
not necessarily unidirectional. However, this
Existing codes developed to control unidirectional
error can be modeled as a unidirectional error in
errors are reviewed here. Their inadequacies are
some other code word that is contained in the
discussed, since these inadequacies have prompted
accessed set. Hence, any unidirectional errordevelopment of new codes68 particularly effective
detection scheme can detect the error.)
against transient faults.
(2) Word line. An open word line may cause all bits
The various faults in a standard LSI device-the
in the word beyond the point of failure to be
read-only memories-provide a clear illustration of
stuck at 0. On the other hand, two word lines
how unidirectional errors are caused in practice. (It is
shorted together will form an OR function beimportant to note that the following discussion of the
yond the point where they are shorted. In either
sources of unidirectional errors in ROMs is relevant
case, the resulting errors are unidirectional.
to certain technologies and not to all.)
(3) Power supply. A failure in the power supply
usually results in a unidirectional error.
Unidirectional errors in ROMs have a number of
likely sources:
There are two classes of codes that can detect all
errors: constant weight codes, which
in
unidirectional
faults
multiple
and
Single
decoder.
(1) Address
address decoders result in either no access or are nonseparable, and Berger codes,63 which are
multiple access.51 No access yields an all-0- separable. A code is separable if the information conword read-out, and multiple access causes the tained in any code word is represented directly by the
OR of several words to be read out. In both k-bit number in some fixed k positions. In other
cases, the resulting errors are unidirectional, as words, a code C with M code words is separable if, for
they contain either 0 to 1, or 1 to 0 type errors. every i, 0 < i 4 M-1, there exists a single code word,
(In the case of multiple access, when the correct Xi E C, in which the number, i, appears in some k posicode word is not contained in the accessed set of tions in xi.
March 1980
33
On the other hand, the information contained in a
code word in a nonseparable code cannot be obtained
without using a special decoder circuit. The nonseparable codes are, therefore, not useful in most computer applications, such as error-control coding of
operands in arithmetic and logic processors, addresses for data in memory, and horizontal microinstructions in control units.5
The nonseparable m-out of-n code consists simply
of all possible binary n-vectors with m l's in them.
Unidirectional errors result in a change in the number
of 1 's in the code word; hence, these errors are detected. Because these codes are of the nonseparable type,
they have limited use. However, significant work has
already been performed on the design of TSC
checkers for m-out-of-n codes. (In fact, the availability of efficient TSC checkers56-59 is precisely what
makes these codes of some practical interest.) The
ESS computer is the first known application of
m-out-of-n codes.
In contrast, Berger codes63 are separable and therefore have a much greater potential for application in
fault-tolerant computers. Recently, efficient TSC designs for Berger codes have become available.55 Although these codes have not yet found impleimentation in fault-tolerant computers, there is significant
potential for their use in both synchronous and asynchronous circuits.57
Error correction is one of the most
effective error-control techniques for
transient faults, since these faults are
often environmentally induced and
hard to diagnose.
fore, some form of on-line protection is essential
to control errors resulting from these faults.
(2) They cause errors that are often nonrecurring.
Therefore, dynamic redundancy techniques
such as on-line testing and switching to spares
may not be effective, since all that is needed
may be to restore the correct information, not
to discard physical components.
(3) They produce two types of errors-independent errors or "bursty" errors. The errors
caused by a single transient fault are likely to
be limited in number if they are independent,
and unidirectional if they are bursty.
All these characteristics lead to an error-control approach that may prove to be most effective against
transient faults-to use codes that can correct some t
random errors, as well as detect all unidirectional errors. Recently, some attempt has been made to construct precisely such codes.64'67'68
An example of a random error-correcting and unidirectional error-detecting code is shown in Table 3.
The code, C, is a systematic code and can both correct
single errors and detect all unidirectional errors.
It is important to note that this code is a paritycheck code as well as a systematic (separable) one.
Therefore, it does not have the shortcomings of either
Berger codes (which are not parity-check codes) or
m-out-of-n codes (which are not separable codes).
The following equations describe the parity-check
relationship between the check bits and the information bits:
Pi = U1 @ 1
P2 = U2 @ 1
P3= U1 @ U2
p4 U1 @ U2 D 1
(Pi P2, andp4 are odd parities.)
The equations for the four-bit error syndrome (sI, s2,
It is interesting to note, however, that there are two S3, S4) are as follows:
U
reasons why neither of the above-described codes
S1 =-p U1
1
may ever find widespread application in faultS2 = P2 9 U2 @ 1
tolerant computers: their inability to correct any erS3 = p3 @ U1
U2
rors; and their incompatibility with parity-check
S4 =p4 @ U1 D U2 @ 1
codes, currently the primary codes used in comTable 4 describes the combination of syndrome bit
puters.
and the corresponding error locations.
patterns
Consequently, recent research64'68 has focused on
The
encoding
decoding circuit for the code
developing codes not only compatible with parity- shown in Table 3and
can
be complemented. Note
check codes and able to correct a limited number of er- that the code is derivedeasily
from
the
maximal-length (7,3)
rors, but also able tadetect certain unidirectional er- code by code puncturing and expurgation
techniques
rors. Before we illustrate this further, we will provide and is also a co-set code.2 Other techniques
for conthe motivation for error correction in the context of structing such codes can be found in Pradhan.68
transient faults.
The motivations for using certain types of codes in
Error correction is one of the most effective errorLSI
devices, then, are different from those for using
control techniques for transient (or intermittent) certain
other codes in communications. We hope that
faults. These faults constitute many real-time faults the material
here will suggest new research-for the
and have the following general characteristics:
construction of unconventional codes that provide er(1) They are often environmentally induced; the ror protection more in line with the type of errors encircuits most vulnerable to transient faults are countered in LSI devices.
those operating close to their tolerance limits
because of aging, overloading, etc. As a result,
transient faults are extremely difficult to diagError-control coding techniques can be highly efnose during initial (acceptance) testing. There- fective in improving computer reliability, in spite of
34
COMPUTER
Table 3.
A random error-correcting and unidirectional
error-detecting code.
INFORMATION
BITS
Ul
U2
0
0
1
C= 0
1
0
1
1
1
2
CHECK BITS
P1
1
1
0
0
3
P2
P3
1
1
0
1
1
0
0
4
0
5
P4
1
0
0
1
6 BIT POSITION
Table 4.
Syndrome bits and error positions.
S1
0
1
o
1
0
0
O
S2
S3
S4
O
0
1
0
1
1
0
1
1
0
1
0
0
0
0
1
0
0
0
1
0
ALL OTHER COMBINATIONS
BIT IN ERROR
NONE
1
2
3
4
5
6
MULTIPLE UNIDIRECTIONAL
ERROR DETECTION
the generally tougher constraints imposed on encoding and decoding hardware in this application
than in more conventional communications applications. The effective use of such codes requires both
the code and the decoding procedure to be tailored to
the peculiarities of the hardware.
The benefits of error-control codes are demonstrated in the Fault-Tolerant Spaceborne Computer.4
The FTSC protects all addresses and data by means
of various versions of the same (shortened) cyclic
code. This code is appended to each data word as it
enters the computer (if the word is already encoded,
the code is checked at the entry port) and remains
with that word throughout all computer operations
except those taking place in the control processing
unit (in which case, all operations are performed in
duplicate). The properties of the code used in the
FTSC4 include the
(1) ability to detect bursts of up to eight adjacent
errors (serial data bus);
(2) ability to detect all errors confined to an eightbit byte (address and data buses);
(3) ability to correct a single erasure and simultaneously detect a second error (memory);
(4) ability to detect a solid burst of errors-i.e., an
error pattern in which a group of contiguous
bits are all in error (address generators in the
direct-memory access units); and
(5) ability to be generated and monitored by a concatenation of identical devices (throughout the
computer; this property can be regarded as a
parallel version of the previously noted shiftregister encoding properties of cyclic codes).
Even this list does not begin to exhaust the ways in
which error-control coding techniques can improve
March 1980
computer reliability. We predict significantly increased use of these techniques as the properties of
error-control codes become better understood. U
Acknowledgment
This work was supported in part by AFOSR Con-
tract No. F49620-79-C-0119.
References and bibliography
General references
1. Ball, M. and F. Hardie, "Effect on Detection of Intermittent Failure in Digital Systems," AFIPS Conf
Proc., Vol. 35, 1969 FJCC, pp. 329-335.
2. Berelkamp, E. R., Algebraic Coding Theory, McGraw
Hill, New York, 1968.
3. Peterson, W. W. and E. J. Weldon, Error-Correcting
Codes, MIT Press, Cambridge, Mass., 1971.
4. Stiffler, J. J., "Architectural Design for Near-100%
Fault Coverage," Proc. 1976 Int'l Symp. FaultTolerant Computing, Pittsburgh, Pa., June 1976, pp.
134-137.*
5. Tanenbaum, A. S., Structured Computer Organization, Prentice-Hall, Englewood Cliffs, N. J., 1976.
6. Tasar, D. and V. Tasar, "A Study of Intermittent
Faults in Digital Computers, "AFIPS Conf. Proc. 1977
NCC, pp. 807-811.
7. Wakerly, J. F., Error Detecting Codes, Self-Checking
Circuits and Applications, Elsevier-North Holland,
New York, 1978.
Codes for arithmetic operations
8. Avizienis, A., "Arithmetic Error Codes, Cost and Effectiveness Studies for Application in Digital Systems," IEEE Trans. Computers, C-20, No. 11, Nov.
1971, pp. 1322-1330.
9. Chien, R. T., S. J. Hong, and F. P. Preparata, "Some
Results on the Theory of Arithmetic Codes," Information and Contro; Vol. 19, 1971, pp. 246-264.
10. Diamond, J. L., "Checking Codes for Digital Computers," Proc. IRE, Apr. 1955, pp. 487-488.
11. Langdon, G. G., Jr., and C. K. Tang, "Concurrent Error Detection for Group-Carry-Look-Ahead in Binary
Adders," IBMJ. Research and Development, Vol. 14,
No. 5, Sept. 1970, pp. 563-573.
12. Massey, J. L., and 0. N. Garcia, "Error Correcting
Codes in Computer Arithmetic," Chapter 5, in Advances in Information System Sciences, Vol. 4, pp.
273-326, Plenum Press, New York, 1971.
13. Pradhan, D. K., "Fault-Tolerant Carry Save Adders,"
IEEE Trans. Computers, Vol. C-23, No. 11, Nov. 1974,
pp. 1320-1322.
14. Pradhan, D. K., and L. C. Chang, "Synthesis of FaultTolerant Arithmetic and Logic Processors by Using
Nonbinary Codes," Digest of Papers-Fourth Ann.
Int'l Symp. Fault-Tolerant Computing, Urbana, Ill.,
June 1974, pp. 4.22-4.28.*
15. Rao, T. R. N., Error Coding forArithmetic Processors,
Academic Press, New York, 1974.
35
Codes for logical operations
16. Eden, M., "A Note on Error Detection in Noisy
Logical Computers," Information and Contro4 Vol. 2,
Sept. 1959, pp. 310-313.
17. Elias, P., "Computation in the Presence of Noise,"
IBMJ. Research and Development, Vol. 2, Oct. 1958,
pp. 346-353.
18. Garcia, 0. N. and T. R. N. Rao, "On the Methods of
Checking Logical Operations," Proc. 2nd Ann. Princeton Conf Information Science and Systems, 1968, pp.
89-95.
19. Monterio, P. M. and T. R. N. Rao, "A Residue Checker
for Arithmetic and Logical Operations," Digest of
Papers-Int'l Symp. Fault-Tolerant Computing,
Boston, Mass., June 1972.*
20. Peterson, W. W. and M. 0. Rabin, "On Codes for
Checking Logical Networks," IBM J. Research and
Development, Vol. 3, No. 2, Apr. 1959, pp. 163-168.
21. Pradhan, D. K., and S. M. Reddy, "ErrorControlTechniques for Logical Processors," IEEE Trans. Computers, Vol. C-21, No. 12, Dec. 1972, pp. 1331-1336.
22. Winograd, S. and J. C. Cown, Reliable Computation in
the Presence of Noise, M.I.T. Press, Cambridge,
Mass., 1963.
Codes for memory
23. Black, C. J., C. E. Sundberg, and W. K. S. Walker, "Development of a Spaceborne Memory with a Single Error and Erasure Correction Scheme," Proc. Seventh
Ann. Int'l Conf Fault-Tolerant Computing, Los
Angeles, Calif., June 1977, pp. 50-55.*
24. Bossen, D. C., "b-adjacent Error Correction," IBMJ.
Research and Development, Vol. 14, July 1970, pp.
402-408.
25. Bossen, D. C., L. C. Chang, and C. L. Chen, "Measurement and Generation of Error Correcting Codes for
Package Failures," IEEE Trans. Computers, Vol.
C-27, No. 3, Mar. 1978, pp. 201-204.
26. Carter, W. C. and C. E. McCarthy, " Implementation of
an Experimental Fault-Tolerant Memory System,"
IEEE Trans. Computers, Vol. C-25, No. 6, June,1976,
pp. 557-568.
27. Brown, D. T. and F. F. Sellers, Jr., "Error Correction
for IBM 800-bit-per-inch Magnetic Tape," IBM J.
Research and Development, Vol. 14, July 1970, pp.
384-389.
28. Chien, R. T., "Memory Error Control Beyond Parity,"
IEEE Spectrum, July 1973, pp. 18-23.
29. Hsiao, M. Y., "O,ptimum Odd-weight Column Codes,"
IBM J. Research and Development, Vol. 14, No. 4,
July 1970.
30. Hsiao, M. Y. and D. C. Bossen, "Orthogonal Latin
Square Configuration for LSI Memory Yield and
Reliability Enhancement," IEEE Trans. Computers,
Vol. C-24, No. 5, May 1975, pp. 512-517.
31. Reddy, S. M., "A Class of Linear Codes for Error Control in Byte-per-card Organized Digital Systems,;'
IEEE Trans. Computers, Vol. C-27, No. 5, May 1978,
pp. 455-459.
32. Stiffler, J. J., "Coding for Random Access Memories,"
IEEE Trans. Computers, Vol. C-27, No. 6, June 1978,
pp. 526-531.
36
Fault-tolerant logic using coding
33. Problems of Information Transmission, Translated
from Russian, Vol. 1-4, Faraday Press; Vol. 5, Consultants Bureau, Plenum Publishing Co., New York.
34. Nemsadze, N. I., Problems of Information Transmission, 1969 (No. 1), 1972 (No. 2), Consultants Bureau,
Plenum Publishing Co., New York.
35. Nikaronov, A. A., Problems of Information Transmis-
sion, 1974, (No. 2), Consultants Bureau, Plenum Publishing Co., New York.
36. Nikanorov, A. A. and Y. L. Sagalovich, "Linear Codes
for Automata," Int'l Symp. Design and Maintenance
of Logical Systems, Toulouse, France, Sept. 27-28,
1972.
37. Sagalovich, Y. L., Problems of Information Transmission, 1960 (No. 2), 1965 (No. 2), 1967 (No. 2), 1972
(No.3), 1973 (No. 1), 1976 (No.4), 1978 (No. 2), Faraday
Press and Consultants Bureau, Plenum Publishing
Co., New York.
38. Sagalovich, Y. L., States Coding and Automata
Reliability, Svjas, Moscow, 1975 (in Russian).
39. Sagalovich, Y. L., "Information Theoretical Methods
in the Theory of Reliability for Discrete Automata,"
Proc. 1975 IEEE-USSR Joint Workshop on Information Theory, Dec. 15-19, 1975, Moscow.
40. Jisiao, M. Y., A. M. Patel, and D. K. Pradhan, "Store
Address Generator with Built-in Fault-Detection
Capabilities," IEEE Trans. Computers, Vol. C-26, No.
11, Nov. 1977, pp. 1144-1147.
41. Meyer, J. F., "Fault-Tolerant Sequential Machines,"
IEEE Trans. Computers, Vol. C.20, Oct. 1971, pp.
1167-1177.
42. Larsen, R. W. and I. Reed, "Redundancy by Coding
Versus Redundancy by Replication for FailureTolerant Sequential Circuits," IEEE Trans. Computers, Vol. C-21, No. 2, Feb. 1972, pp. 130-137.
43. Pradhan, D. K. and S. M. Reddy, "Fault-Tolerant
Asynchronous Networks," IEEE Trans. Computers,
Vol. C-23, No. 7, July 1974, pp. 651-658.
44. Pradhan, D. K., "Fault-Tolerant Asynchronous Networks Using Read-Only Memories," IEEE Trans.
Computers, Vol. C-27, No. 7, July 1978, pp. 674-679.
45. Reed, I. S. and A. C. L. Chiang, "Coding Techniques
for Failure Tolerant Counts," IEEE Trans. Computers, Vol. C-19, No. 11, Nov. 1970, pp. 1035-1038.
Self-checking circuits
46. Anderson, D. A., "Design of Self-Checking Digital
Networks," CSL Report No. 527, University of Illinois, Urbana, Ill., 1971.
47. Anderson, D. A. andG. Metze, "Design of Totally SelfChecking Circuits for M-out-of-n Codes," IEEE
Trans. Computers, Vol. C-22, No. 3, Mar. 1973, pp.
263-269.
48. Ashjee, M. J. and S. M. Reddy, "On Totally SelfChecking Checkers for Separable Codes," IEEE
Trans. Computers, Vol. C-26, No. 8, Aug. 1977, pp.
737-744.
49. Carter, W. C. and P. R. Schneider, "Design of
Dynamically Checked Computers," Proc. IFIP Congress 68, Vol. 2, Edinburgh, Scotland, pp. 878-883.
50. Carter, W. C., K. A. Duke, and D. C. Jessep, "A Simple
Self-Testing Decoder Checking Circuit, " IEEE Trans.
Computers, Vol. C-20, No. 11, Nov. 1971, pp.
1413-1414.
3 COMPUTER
51. Cook. R. W. et al., "Design of Self-Checking Micropro-
Controls," IEEE Trans. Computers, Vol. C-22,
No. 3, Mar. 1973, pp. 255-262.
David, R., "A Totally Self Checking 1-out-of-3 Code,"
IEEE Trans. Computers, Vol. C-27, No. 6, June 1978,
pp. 570-572.
Diaz, M., "Design of Totally Self Checking and Fail
Safe Sequential Machines," Digest of Papers-Fourth
Ann. Int'l Symp. Fault-Tolerant Computing, Urbana,
Ill., June 1974, pp. 3:19-3.24.*
Diaz, M..and J. M. Desouza, "Design of-Self-Checking
Microprogrammed Controls," Digest of Papers-1975
Int'l Symp. Fault-Tolerant Computing, Paris, France,
June 1975, pp. 137-142.*
Marouf, M. A. and A. D. Friedman, "Design of SelfChecking Checkers for Berger Codes," Digest of
Papers-Eighth Ann. Int'l Conf Fault-Tolerant Computing, Toulouse, France, June 1978, pp. 179-184.*
Marouf, M. A. and A. D. Friedman, "Efficient Design
of Self-Checking Checkers for M-Out-of-N Codes,"
Proc. SeventhAnn. Int'l Conf Fault-TolerantComputing, Los Angeles, Calif., June 1977, pp. 143-149.
Pradhan, D. K., "Asynchronous State Assignments
with Unateness Properties and Fault-Secure Design,"
IEEE Trans. Computers, Vol. C-27, No. 5, May 1978,
pp. 396-404.
Pradhan, D. K. et al., "Shift Registers Designed for
On-Line Fault-Detection," Digest of Papers-Eighth
Ann. Conf. Fault-Tolerant Computing, Toulouse,
France, June 1978, pp. 173-178.*
Shedletsky, J. J. and E. J. McCluskey, "The Error
Latency of a Fault in Combinational Digital Circuits,"
Digest of Papers-1975 Int'l Symp. Fault-Tolerant
Computing, Paris, France, June 1975, pp. 210-214.*
Smith, J. E., "The Design of Totally Self-Checking
Combinational Circuits," CSL Report No. R-737, University of Illinois, Urbana, III., Aug. 1976.
Smith, J. E. and G. Metze, "Strongly Fault Secure
Logic Networks," IEEE Trans. Computers, Vol. C-27,
No. 6, June 1978, pp. 491-499.
Wang, S. L. and A. Avizienis, "The Design of Totally
Self-Checking Circuits Using Programmable Logic
Arrays," Digest of Papers-Ninth Ann. Int'l Symp.
Fault-Tolerant Computing, Madison, Wis., June 1979,
pp. 173-180.*
gram
52.
53.
54.
55.
56.
57.
58.
59.
60.
61.
62.
Codes for unidirectional errors
63. Berger, J. M., "A Note on Error Correction Codes for
Asymmetric Channels," Information and Control,
Vol. 4, Mar. 1961, pp. 68-73.
64. Bose, B., "Theory and Design of Unidirectional Error
Codes," PhD Dissertation, Computer Science and
Engineering Dept., Southern Methodist UJniversity,
Dallas, Tex., in progress.
65. Frieman, C. V., "Protective Block Codes for Asymmetric Binary Channels," PhD Dissertation, Columbia University, New York, May 1961.
66. Parhami, B. and A. Avizienis, "Detection of Storage in
Mass Memories Using Low-Cost Arithmetic Codes,"
IEEE Trans. Computers, Vol. C-27, No. 4, Apr. 1978,
pp. 302-308.
67. Pradhan, D. K. and S. M. Reddy, "Fault-Tolerant FailSafe Logic Networks," Proc. COMPCOM Spring 77,
pp. 361-363.
March 1980
68. Pradhan, D. K., "A New Class of Error CorrectingDetecting Codes for Fault-Tolerant Computer Applications," to appear in IEEE Trans. Computers,
Special Issue on Fault-Tolerant Computing, Vol. C-29,
No. 6, June 1980.
69. Sahani, R. M., "Reliability of Integrated Circuits,"
Proc. IEEE Int'l Computer Group Conf., Washington
D. C., June 1970, pp. 213-219.
70. Wakerly, J. F., "Detection of Unidirectional Multiple
Errors Using Low Cost Arithmetic Codes," IEEE
Trans. Computers, Vol. C-24, No. 2, Feb. 1975, pp.
210-212.
*This digest or proceedings is available from the IEEE Computer
Society Publications Office, 5855 Naples Plaza, Suite 301, Long
Beach, CA 90803.
D. K. Pradhan is the guest editor of this issue; his
biographical sketch appears on p. 7.
J. J. Stiffler is a consulting engineer at
the Raytheon Company, Sudbury,
Massachusetts. From 1961 to 1967 he
was on
s*
*
o
the technical staff of the Jet Pro-
pulsion Laboratory,
Pasadena, Califor-
fnia. The author of many papers in the
field of communications, Stiffler wrote
Theory of Synchronous Communicaand contributed to two other
books. His current interests include the
design and analysis of highly reliable data-processing
_t ~~~tions
systems.
Stiffler received the AB in physics, magna cum laude,
from Harvard University in 1956 and the MS in electrical
engineering from the California Institute of Technology in
1957. After a year in Paris as a Fulbright scholar, he returned to Caltech, where he received the PhD in 1962. Stiffler is a member of Phi Beta Kappa and Sigma Xi.
SOFTWARE ENGINEERS
Professional Careers
with XEROX
Xerox is developing the future of reprographics. We're
looking for a few talented specialists for the Electronics
Division who can help keep our technological edge. We've
increased our R&D budgetyearafteryearto keep upwith our
growing competition. At Xerox we have the technology, the
resources, and the challenges to stimulate your career.
We're looking for professionals to advance our long term
projects at our R&D facility. Responsibilities Include
development of real-time operating systems and languages
for microprocessor-based control systems; contributing to
systems architectural studies; and development of practical
approaches to modularity distributive prooessing, language
layering, etc. Time/space efficiency trade-offs will be
continuously required.
You should have a B.S., M.S., or Ph.D. in CS or EE and
familiarity with structured programming concurrent
processing, high level languages (PASCAL, MODULA,
CONCURRENT PASCAL, etc.) operating system kernal
design, and other modern programming methodologies.
We offer a competitive starting salary/benefits package.
Please forward your resume in strict confidence to: Ms.
Carol Jones, Dept. IC, XEROX CORPORATION, 800 Phillips
Rd., Bldg. 105, Webster, New York 14580. Xerox is an
affirmative action employer (male/female).
XEROX
Download