IEEE TRANSACTIONS ON COMPUTERS, VOL. c

advertisement
.1. 1*1_-4.11_,;.yl
.7 .
%. P..771" "',
7T
T71T-lw .7ri.w
.F' 7 '
531
IEEE TRANSACTIONS ON COMPUTERS, VOL. c-27, No. 6, JUNE 1978
for e = 3 they are of comparable complexity. For e > 3, [8] C. J. Black, C. E. Sundberg, and W. K. S. Walker, "Development of a
spaceborne memory with a single error and erasure correction
however, the new decoder is evidently considerably more
scheme," in Proc. Int. Conf on Fault-Tolerant Computing, June 1977.
complex and somewhat slower than the altemative device.
Nevertheless, since the point of diminishing retums in
providing bit-line-failure tolerance in conventional memories is usually reached for e = 2 or 3, the decoding
technique described here would appear to merit serious
consideration.
REFERENCES
[1] J. J. Stiffler, N. G. Parke, IV, and P. C. Barr, "The SERF fault-tolerant
computer," in Proc. Int. Symp. on Fault-Tolerant Computing, pp.
23-31, June 20-22, 1973.
[2] K. N. Levitt, M. W. Green, and J. Goldberg, "The data communication problem in a self-repairable microprocessor," in AFIPS, Spring
Joint Comput. Conf., 1968.
[3] D. P. Siewiorek and E. J. McCluskey, "An iteractive cell switch design
for hybrid redundancy," IEEE Trans. Comput., vol. C-22, pp. 290-297,
Mar. 1973.
[4] E. R. Berelkamp, Algebraic Coding Theory. New York: McGraw-Hill,
1968.
[5] W. W. Peterson and E. J. Weldon, Jr., Error-Correcting Codes (second
edition). Cambridge, MA: M.I.T. Press, 1972.
[6] W. C. Carter and C. E. McCarthy, "Implementation of an experimental fault-tolerant memory system," IEEE Trans. Comput., vol. C-25,
pp. 557-568, June 1976.
[7] W. Y. Hsiao, "A class of optimal minimum odd-weight-column
SEC/DEC codes," IBM J. Res. Develop., July 1970.
J. J. Stiffier (M'63-F'76) was born in Mitchellville, IA, on May 22, 1934. He received the A.B.
degree in physics, magna cum laude, from Harvard University, Cambridge, MA, in 1956, and
the M.S. degree in electrical engineering from the
|
tC jF
California Institute of Technology, Pasadena, in
1957. After spending a year in Paris as the recipient of a Fulbright scholarship, he returned to
Caltech where he received the Ph.D. degree in
1962.
From 1961 to 1967 he was a member of the
Technical Staff at the Jet Propulsion Laboratory, Pasadena, CA. In 1967
he joined the Raytheon Company, Sudbury, MA, where he is currently
employed as a Consulting Engineer. He is the author of numerous
papers in the field of communications, the author of the book, Theory of
Synchronous Communications, and a contributing author of two other
books. His current interests include the design and analysis of highly
reliable data processing systems.
Dr. Stiffier is a member of Phi Beta Kappa and Sigma Xi.
Synchronization and Matching in
Redundant Systems
DANIEL DAVIES,
STUDENT MEMBER, IEEE, AND
Abstract-Most of the published work on massive redundancy
makes one crucial assumption: the redundant modules are synchronized. There are three ways of achieving synchronization in
redundant systems-independent accurate clocks, a common external reference, and mutual feedback. The use of a common external
reference is currently the most widely used technique, but suffers
from vulnerability to common-point failures. We introduce a novel
mutual feedback technique, called "synchronization voting," that
does not have this drawback. A practical application of synchronization voting is described in the appendix-a fault-tolerant crystalcontrolled clock.
Most redundant systems make another assumption: in fault-free
operation the outputs are equal (match). However, it is well known
that the outputs of redundant input sensors cannot be guaranteed to
match, even in fault-free operation. They are usually handled by a
Manuscript received August 11, 1977; revised March 27, 1978.
D. Davies is with the Digital Systems Laboratory, Stanford University,
Stanford, CA 94305.
J. F. Wakerly was with the Digital Systems Laboratory, Stanford
University, Stanford, CA 94305. He is now with BNR Inc., Palo Alto, CA.
JOHN F. WAKERLY,
MEMBER, IEEE
median-select or similar signal selection rule, so that redundant
voters can pick a common input value for processing by the rest of the
system. But we show that in the presence of certain sensor failures,
there is no signal selection rule that can pick a common input value,
and we show how this problem can be circumvented by using multiple
levels of voters.
Index Terms-Asynchronous networks, clocks, fault-tolerant
computing, microcomputers, N-modular redundancy (NMR), signal
selection, synchronization, triple modular redundancy (TMR),
voters.
I. INTRODUCTION
M /[ASSIVE redundancy is a well-known method for
improving the reliability of digital systems [3], [7]. In
its most common form, triple modular redundancy (TMR),
three copies of each module are used, and each set of three
redundant outputs is voted upon by majority voters before
being sent to the next level ofmodules. In this way, failures in
any single module or voter are tolerated, since a single
MVs
0018-9340/78/0600-0531$00.75 (D 1978 IEEE
532
IEEE TRANSACTIONS ON COMPUTERS, VOL.
erroneous output is outvoted by the two remaining modules.
Fig. 1 shows a nonredundant system and its TMR counterpart. Some extensions to TMR are N-modular redundancy
(NMR), in which 2f + 1 modules are used along with
(2f + 1)Iinput majority voters to tolerate up toffaults; and
hybrid redundancy, which combines TMR or NMR with a
disagreement detector and spares that can be switched in
when a faulty module is detected [21].
Most of the published work on massive redundancy
makes one crucial assumption: the modules are synchronized so that the voters (simple combinational logic
circuits) can observe the module outputs simultaneously
and produce their majority outputs.' Except for faulttolerant clock designs [11], [13], little work has been done on
methods for reliably providing synchronization.
Another often-made assumption about massive redundant systems is that the outputs of the redundant copies
should always be equal. Although this is normally true, there
are some important situations, mainly with redundant input
sensors, where equality cannot be guaranteed even in the
absence of faults.
In Section II we discuss the general issues in synchronization and matching. There are three possible mechanisms for
synchronization-accurate independent clocks, a common
external reference, and mutual feedback to correct for drift.
In Section III we introduce a novel mutual feedback
mechanism called "synchronization voting." Section IV
describes the problems and solutions in voting on inputs
that can be unequal even in the absence of faults. Finally, the
appendix describes a practical application of synchronization voting-a triplicated fault-tolerant clock.
II. THE GENERAL PROBLEM AND SOME SOLUTIONS
A. Synchronization
As indicated earlier, TMR is applied to a system by
partitioning the nonredundant system into modules, replicating, and voting between modules. The choice of the level
at which voting is applied depends on reliability, cost, and
other factors [4], [10], [18], [20], [29], [30].
The module partition determines the synchronization
requirements of the system. For example, the modules in a
system that applies voting at the register transfer level must
be synchronized at every clock cycle, whereas a system that
applies voting only at the external outputs must be synchronized only when an external output is produced.
We use the term voting cycle to denote the interval of time
between the appearance of successive module outputs that
are voted upon. When voting is applied at the register
transfer level, the voting cycle is identical to the clock cycle.
On the other hand, if voting is applied only to external
outputs, then the voting cycle is the time between the
appearance of successive external outputs.
1 The only references we have found dealing with voting on totally
asynchronQus processes are [16] and [17]. They discuss systems of redundant computers with separate time bases monitoring sensors to perform
real-time flight control. Input and output values are sampled and
produced at different times by each computer, complicating the tasks of
voting and error detection.
c-27, NO. 6,
JUNE
1978
M2
M3
(a)
(b)
Fig. 1. (a) Nonredundant system. (b) TMR version.
Since time is infinitely divisible, it is impossible to perfectly synchronize the voting cycles of two or more redundant modules. In a system composed of redundant modules
(or sets of modules), we use the term synchronization interval
to denote an interval of time during which all modules are in
the "same" voting cycle. Fig. 2 shows the intuitive notions of
voting cycles and synchronization intervals for three
machines.
To formalize this notion, we consider a set of two or more
machines, each having n voting cycles over an interval of
time (T,, T2). We sequentially number the voting cycles of
each machine, beginning with 1. The set of machines is said
to be synchronized if for each i 1 < i < n, there is a nonzero
interval of time (i.e., synchronization interval) during which
each of the machines is in voting cycle i.
Note that the definition does not specify a minimum
duration for synchronization intervals. A practical system
should only be considered to be synchronized if each
synchronization interval has some minimum duration, expressed either as a fixed length of time or as a fraction of the
duration of the voting cycles. The required minimum duration is implementation dependent.
B. Output Matching
The definition of synchronization above only characterizes the time behavior of the modules, making no assumptions about equality or even correspondence of their
outputs. A set of modules is operating correctly only if they
are synchronized and in some sense "matched" during every
synchronization interval.
Whether applied to external outputs or internal state
variables, voting need not be done on a strict bit-by-bit basis.
Instead, it is possible to apply a metric or distance function
to the state variables if they are ordered (as are the real
533
DAVIES AND WAKERLY: SYNCHRONIZATION AND MATCHING IN REDUNDANT SYSTEMS
I
Module 1
-H
voting cycles
Module 2
voting cycles
L
I
1
Module 3
I
Synchronizationl
intervals
1
voting cycles
2
I
1
3
2
2
2
1
4
3
1
1
3
4
3
14
I
Fig. 2. Voting cycles and sytiEhronization intervals.
numbers). Two outputs could be said to match if they are
within a specified distance of each other. Indeed, this type of
voting is essential in a system where redundant modules can
receive close but unequal inputs, such as a real-time control
system with redundant sensors.
For the moment we consider only the case of strict
bit-by-bif-majority voting. Voting in systems where acceptable outputs can take on a range of values will be discussed
in Section IV.
C. Maintaining Synchronization
Synchronization techniques for redundant systems can be
classified into three types.
1) Independent Accurate Time Bases: The machines are
fully decoupled, but can maintain synchronization for a
short time, governed by the accuracy of the individual time
bases.
2) Common External Reference: The machines rely on a
common external time base to maintain synchronization.
3) Mutual Feedback: There is no common time reference.
The machines monitor each other and correct themselves for
drift.
The first method is impractical for most applications,
because the achievable mission times are so short. The
second method is the most widely used, in both proposed
and implemented systems [2], [4], [10], [14], [15], [19], [22],
[26].
Mutual feedback has not been widely studied; yet it is
attractive from several points of view. It has excellent
reliability characteristics because it has no common failure
points. And it can be conveniently implemented using
current technology. We develop a mutual feedback
technique called ".synchronization voting" in the next
section.
III. SYNCHRONIZATION VOTING
A. System Model
The only way to synchronize redundant machines over
long periods of time without an external reference is to
provide mutual feedback to correct for their inevitable drift.
We shall give a model for a simple synchronizer and an
architecture for a system that will operate in the presence of
any f machine and synchronizer failures.
The model considers only the synchronization aspects of
the system; the voters for internal state variables and
external outputs can be ignored. Output matching is discussed in Section IV.
The system model is shown in Fig. 3. As in traditional
NMR, it should be apparent that the system requires at least
Fig. 3. System model.
2f + 1 modules and synchronizers to tolerate anyf module
and synchronizer failures. Each module has an input line
called "Go" and an output line called "READY." The Go input
tells the module that it should begin a voting cycle. The
READY output line indicates that a module has finished the
voting cycle (i.e., produced an output to be voted on) and is
ready to begin the next voting cycle. The voting cycle is the
time between successive Go's. The Go output can be used as a
strobe for data voters on the internal state variables and
external outputs of the module, assuming a sufficient hold
time.
The synchronizers are replicated, one for each module.
Each synchronizer produces a Go output for its corresponding module, based on the READY outputs of all the modules.
We assume that modules are properly isolated so a synchronizer (or voter) failure affects only its corresponding
module. Therefore, a synchronizer or voter failure is equivalent to a failure of the corresponding module [1].
The choice of signaling format (e.g., levels, pulses, or
transitions) for READY and GO is not crucial. Also, the
synchronizer operations discussed below can be implemented in either hardware or software. However, when
synchronization is implemented in software, the READY and
Go lines may only be sampled at discrete intervals. Uncertainty about the times at which the signals are actually
received limits the degree to which the modules may be
synchronized [12].
B. System Operation
The system must meet a number of design objectives:
1) A majority of the modules must be synchronized even
in the presence of up to f faults in modules and synchron-
izers. There are only two restrictions on the allowable failure
modes: a) Each fault must be isolated to a single module or
synchronizer; b) A module fault may not cause a different
interpretation of its READY output by different synchron-
534
IEEE TRANSACTIONS ON
COMPUTERS, VOL. c-27, NO. 6, JUNE 1978
READY 1
READY 2
2f+1
MAJORITY
VOTER
GO
READY 2f+1
Fig 4. Synchrronizer model.
izers. (See Section III-E for a way to remove the second
restriction.)
2) There may be some modules that are habitually slower
than others. These should not be "left behind" nor should
they be required to abort their calculations in order to catch
up.
3) A synchronizer must not send Go until it sees READY
from at leastf + 1 different modules, sincefof them could be
stuck active or produce spurious READY signals.
4) A synchronizer must be able to send GO with up tof
READY's inactive, since up tofmodules can have READY stuck
in the inactive state.
5) The synchronizer may not know the time required for
each voting cycle, though it may know the maximum
difference in time between corresponding voting cycles of
the fastest and slowest modules.
The proposed synchronizer model is very simple- and is
shown in Fig. 4. A majority circuit determines whenf + 1 of
the READY inputs have been received, then delays for a length
of time d, and sends GO.2 Any protocol-related circuitry is
placed on the inputs of the majority circuit and on the
output of the delay. For example, Fig. 5 shows the gate
implementation of a 3-input synchronizer assuming activelow pulses for READY and Go.
For most system implementations, the minimum synchronizer delay d must be greater than the maximum variation in voting cycle times of the replicated modules, plus the
maximum variation in the value of d in the synchronizers.
The delay d ensures that the slowest module has time to
catch up before the next voting cycle begins, but still allows
the system to continue operation if up tofREADY outputs are
stuck inactive.
Due to the asynchronous nature of the system, a faulty
module has control of the majority gates as soon asf READY
signals become active. Giving the sy'nchronizers explicit
knowledge of the maximum variation in the arrival times of
good inputs allows synchronization to be accomplished
with only 2f+ 1 modules.3 This can be contrasted with the
hysteresis voting scheme [11], which does not require this
explicit knowledge but instead requires 3f + 1 modules to
prevent a bad module from having control.
The voting synchronizer has a problem shared by all
asynchronous systems. Flip-flops may exhibit anomalous
behavior when narrow "runt" pulses are received [8], [9],
[25]. Depending on the signal formats and design, a synch2 The same idea was independently used in [26] to synchronize the "bus
reply" line in a triplicated microcomputer. In this application, all three
microcomputers and a single nonredundant synchronizer receive a
common clock.
3 A similar idea was independently used in [13] in the design of a faulttolerant clock. Their design differs in that it contains an unnecessary
output delay.
READY 1
READY 2
READY 3
Fig. 5. Pulse-based synchronizer circuit.
ronizer may generate runt pulses internally. Even if it does
not, as in the design of Fig. 5, a faulty module may apply runt
pulses to the input of the synchronizer. Therefore this failure
mode must be considered in the system design and reliability
analysis. As shown in the above-mentioned references, the
probability of anomalous behavior can be made arbitrarily
small, but never zero.
C. Analysis of Voting Synchronization
Formally, the operation of the system can'be described as
follows. Let c: denote the duration of active voting cycle n as
executed on module i, that is, the time from when it receives
Go until it produces READY. Let d, denote the total delay
from receiving a majority of the READY'S to producing Go in
synchronizer i during voting cycle n. (This is the delay d in
Fig. 4 plus any fixed or variable logic delays.) Finally, let gi
denote the time at which synchronizer i produces GO at the
beginning of active voting cycle n and let re denote the time
at which module i produces READY at the end of active voting
cycle n. Then the time behavior of the system is characterized
by the following equations:
+
e = median (ll`, r12i. r1 )1+)dnl<i<2f+1 (1)
= g + c',
1 . <i 2f+ 1
(2)
Theorem 1: Let C be the maximum difference between any
c aandforanyvalueofn,and let D be the maximum
difference between any d: and d,. The system modeled in Fig.
3 remains synchronized as long as f + 1
module/synchronizer pairs are fault-free and no d7 is less
than C + D.
Proof: The system is said to be synchronized at voting
cycle n if there are at least f + 1 nonfaulty
module/synchronizer pairs (numbered from 1 tof + 1 for
simplicity) producing GO signals such that g;-9 <D for
DAVIES AND WAKERLY: SYNCHRONIZATION AND MATCHING IN REDUNDANT SYSTEMS
535
1 < i, j <f + 1. If the system starts in synchronization, we Assuming that the machine and synchronizer delays take on
can show by induction that it remains in synchronization. their worst values at each i, that is, c3- C = C and
Assuming that the system is synchronized at voting cycle d - d = D, then
n,then Ir -r7I < C + D for I < i,j <f+ 1,duetovaria(11)
g- = (n - 1)(C + D).
tions in c7. Since a particular READY signal appears the same
to all synchronizers even if faulty, the same median value is Obviously, the difference grows without bound and the
obtained in the calculation of each g91 + by thef + 1 good modules cannot remain synchronized.
synchronizers using (1). Hence, the difference between the
nonfaulty GO'S is due only to the variations in d"; therefore, E. Unequal Ready Outputs
:+1_g!+1 D for 1 < i,j <f + 1.
As we have seen above, the system modeled in Fig. 3 may
Q.E.D.
Exercises similar to Theorem 1 can be used to examine the lose synchronization in the presence of a single special fault.
operation of other synchronizer protocols or fault assump- This problem can be overcome by adding another level of
tions. For example, consider a synchronizer that produces synchronizers to eliminate the error in the first level Go's, as
Go as above, or immediately after receiving all the READY'S, shown in Fig. 6.4 In fact, iff module or synchronizer errors
whichever is sooner. This type of operation is described by are allowed in the system, thenf + 1 levels of synchronizers
modifying (1) above:
are needed. This ensures that at least one level will be
fault-free and the next level will be able to synchronentirely
= minimum [median ('1- 1, r21, , A2-11) +dn1,
ize properly. Furthermore, each level must have 2f + 1
maximum (ru'- , rA2-', **,,2-11)]
synchronizers to keep the system from failing if all of the
faults
occur in the same level. As with other NMR systems,
(3) there are
I<i<2f+1.
many patterns of more thanffaults that the system
As before, the time difference in the Go signals is at most D, will tolerate. In particular, the system will tolerate up to f
and the time difference in the READY'S is at most C + D. The faults in each level, as long as there is one level that is entirely
formulation of (3) ensures that no nonfaulty synchronizer fault-free (the last level of synchronizers and the module are
sends Go until all nonfaulty modules have sent READY.
considered to be one level).
It is interesting to note that the level that actually
D. The Need For Equal Ready Outputs
synchronizes the system may contain faulty synchronizers,
We can also use this analysis to show the need to forbid but its good units will have good signals to work with from
faults in which a single module sends different READY signals the fault-free level preceding it. The time difference in the GO
to different modules. Consider a triplicated system in which signals from the synchronizing level will be at most Di, the
the modules and synchronizers have the following time maximum difference in the synchronizer-delays at that level.
behavior:
To prevent any good module from receiving Go before it
sends READY, the minimum synchronizer delay dr'in at each
< 3
C1 <
for all n
level must be greater than sum of the cycle time and delay
d'1 < d' < dn3
for all n.
~
(5) variations at all the levels. Assuming as in Section III-D that
Now suppose that the READY output of module 2 fails such the active voting cycle time variation is C and the synchronthat it appears always active to synchronizer 1, and always izer delay variation at each level is Di, we require
inactive to synchronizer 3. There are a number of ways that
< i <m.
Dj
(12)
diin > C +
such a fault could occur. If the module has separate lines for
1 <j<f+ 1
each READY output, then a double fault on the module The analysis of this system structure is similar to the ones
(considered a single module failure) could obviously cause previously given, and hinges on the fact that one level of
this behavior. But even a single electrical fault could do synchronizers must be fault-free.
it-suppose the READY output is stuck at an intermediate
The sole purpose of the extra levels of synchronizers is to
logic value, and differences in electrical characteristics cause eliminate the effects of a very special, possibly unlikely, type
one module to interpret it as active, the other as inactive.
of fault. In practice, the reliability tradeoffs between providRegardless of the cause, the faulty behavior of the system ing the extra synchronizers and accepting the fault conis characterized by the equations:
sequences should be examined closely.
=
+
median
dn
+
(6)
o,
dl-=
el
(el-', 3-1)
IV. OUTPUT MATCHING
C2
faulty
gn3 = median (rA,- 1, e3- 1 ) + dn3- 1
-=7+c,
1<i<3.
irrelevant,
(7) A. The Problem
r3- 1 + dn3- 1(8)
The need for output matching occurs at the interfaces
(9) between modules in an NMR system, and it may also be
required at the primary inputs. Matching requirements are
Assuming that all machines start in synchronization with
g = gl = g9, it can be shown by induction that
4 Synchronizers must be the type defined by (1) and (2) in the general
_
[(ci -ci ) + (d'i -dy)].
(10) case. If synchronizers defined by (3) and (2) are used, certain additional
1
2
=
since
1<i<n-1
relations between c: and d, at each level must hold.
536
IEEE TRANSACTIONS ON COMPUTERS, VOL.
c-27, NO. 6, JUNE 1978
Fig. 6. System tolerating unequal READY'S.
the same whether the system is synchronized conventionally
or with synchronization voting.
In NMR systems it is usually assumed that all nonfaulty
copies of the state variables and external outputs are equaL
i.e., they match. Matching can be ensured by starting the
machines in the same state and applying equal inputs to each
machine at every cycle. One obvious way to guarantee
equality of inputs is to apply the same physical input signals
to each machine.
For reliability reasons, it may be desirable to use redundant input sensors. Even in the absence of faults, we cannot
guarantee that they will all produce the same value, due to
differences in sampling times, biases, and possible quantization errors (depending on the type of input sensors). Even if
all the sensors receive a common clock and there is no
quantizing, it is impossible to guarantee that they will all
produce the same value. This is true because in general
external events are not synchronized with the system, and so
events can occur arbitrarily close to the sampling pulse.
When this happens, it is possible for some sensors to "catch"
the event, and others to "miss" it until the next sampling
time. How then can we reliably choose among redundant
input values in such a way that each copy of the machine sees
the same inputs?
One way to solve the problem is to use a single voter
circuit to select an input value from the redundant sensors
and send it to all the machines, but then single failures in the
voter are not tolerated. This approach only makes sense if
the voter is much more reliable than the sensors.
The next obvious approach is to replicate the input voters,
providing one for each machine (for example, see [27]).
Presumably, if the voters are not faulty, they will all pick the
same value to pass on to their respective machines. This
eliminates the common point failure problem. However, we
show in Section IV-B that under some single-fault conditions, no voter design can ensure that all machines receive
the same value.
A final approach is to remove the need for equal machine
inputs, requiring only that the inputs are within an acceptable distance of each other under a suitable metric. Each
machine would look only at its own input sensors. In this
case, the state variables and external outputs of the redundant machines cannot be guaranteed to match, even in the
absence of faults. In fact, as shown in [24], the outputs of
machines performing pure integrations will almost certainly
diverge.
In the remainder of this section we discuss the replicated
voters approach. Our object will be to ensure that all good
machine inputs are equal even though the sensor outputs
may be inherently unequal and some sensors and voters may
be faulty.
B. Voting Schemes
We consider voting in two distinct situations:
1) Nonfaulty outputs must be equal.
2) Nonfaulty outputs may be unequal.
The first case is encountered at the interfaces between
modules and calls for traditional bit-by-bit majority voters.
It is well known that by providing 2f + 1 voters and
modules, up to f voter or module failures can be tolerated.
Furthermore, all nonfaulty voters will produce the same
output, even if the faulty modules send different outputs to
different voters. Therefore, we need not consider this case
further.
The second case occurs when voting on values produced
by input sensors. In this case the voter must select from a
range of acceptable input values. If the input values are
ordered, then a median-select rule can be used-the voter
simply selects the input with the median value [6], [17].
Another possibility is to use a weighted-average scheme; for
example, each weight could be assigned according to how
close the sensor output was to the selected value in the
previous cycle [5]. In any of these schemes, replicated voters
will obviously produce equal outputs if they all receive the
same set of inputs from the sensors (i.e., we assume each
sensor, even if faulty, transmits the same value to all voters).
DAVIES AND WAKERLY: SYNCHRONIZATION AND MATCHING IN REDUNDANT SYSTEMS
This, of course, does not mean all sensors must transmit
the same value, simply that all voters see the same set off
values.
Like the READY lines in Section III-E, a sensor may have a
failure mode in which it can send different values to different
voters. If this can happen, then both of the above votingr
schemes above break down. In fact, in the next section we
prove that if different voters can receive different values from
a faulty sensor, no voting scheme can guarantee that all
voters produce the same value.
C. Unequal Outputs from a Faulty Sensor
Any signal selection rule takes a vector V = <v 1, v2, ,
V2f + 1 > of input values from a set R and selects an output
value s(V) in R. As indicated above, under certain failure
modes a faulty sensor could send different values to different
voters, so that two voters could see two input vectors U and
V differing in one component. We would like the signal
selection rule to produce the same output even in the
presence of such a failure. But we show below that there is no
nontrivial rule with this property.
Theorem 2: Let a signal selection function s have the
property that for any two vectors U and V that differ in at
most one component, s(U) = s(V). Then for any vector V,
s(V) = c, where c is a constant value.
Proof: For any pair of vectors U and V, we can
construct a chain of 2fvectors leading from U to V that differ
in one component at each step:
U1 = <VI, u2, U3, * U2f+1>
U2 = <Vl, V2, U3, ,u2f +I>
<y1, v2, *-, V2f, u2f+I>_
Then by applying the hypothesis 2f times, we have
U2f
=
Q.E.D.
s(U) s(U1) = s(U2) = = s(U2f) = s(V) = c.
So there is no voting scheme that can always map slightly
different input vectors to the same output value.
The problem of unequal outputs from a faulty sensor is
analogous to the unequal READY problem discussed in
Section III-E, and the solution is similar to the one offered in
Section Ill-F. If a voter cannot fail in a manner that sends
different outputs to different voters on the next level, then
the solution is simple. A second level of voters is added, and
we-are assured that the nonfaulty second level voters will all
produce the same output value, since they all see the same
input values. Any reasonable signal selection rule can be
used.
If a voter can send unequal outputs to the next level and
the system is to tolerate any f faults, then f + 1 levels of
voters must be provided to guarantee that at least one level
of voters will be entirely fault-free. Furthermore, the signal
selection rule must be median-select or some other rule that
guarantees to pick a value lying in the range specified by the
majority of good values. For example, if the smallest good
value is x and the largest is y, the rule should return a value in
the range [x, y]. This ensures each good selector produces an
=
537
acceptable output. In particular, if a majority ofinput values
are equal, the rule will return that majority value. Once a
fault-free level is encountered, each succeeding level will
have at leastf + 1 outputs identical, and majority selection
will prevail.
V. CONCLUSION
We have offered some new solutions to the problems of
synchronization and matching in massive-redundant
systems. Synchronization voting is a practical alternative to
the existing practice of using a common clock for synchronization in NMR systems. For those who still like common
clocks, the appendix shows a triplicated fault-tolerant
crystal-controlled clock implemented with synchronization
voting. A triplicated microcomputer system with no
common clock is currently being studied.
It is usually assumed that faulty modules in a redundant
system send the same erroneous outputs to all other modules that are listening to them. But we have shown that under
some fault conditions this need not be true. We have shown
how both synchronization voting and conventional signal
selection rules can fail in the presence of these special faults,
and have indicated how to solve the problem by adding
extra levels of voters.
APPENDIX
A TMR CRYSTAL-CONTROLLED CLOCK
This appendix shows the design of a single-fault tolerant
clock. It consists of three independent modules, each containing a crystal, a quad 2-input ECL AND gate, a voltage
comparator, three resistors, and a capacitor. It is selfstarting and requires no precision parts other than the
crystal. A lab model was built and operated at a frequency of
8 MHz.
A crystal may be modeled as shown in Fig. 7. At its
series-resonant frequency, determined by L and C 1, it looks
like a small resistive load (about 20 to 40 Ql for a typical 8
MHz crystal). As the input frequency changes, the crystal
becomes more reactive, and its output is attenuated and
shifted in phase.
A simple 8 MHz crystal oscillator is shown in Fig. 8. An
ECL logic gate [23] acts as an amplifier, being biased into its
linear range by the feedback resistor. The crystal is only
lightly loaded because of the relatively low ECL signal
voltages and currents (-0.8 to - 1.8 V, 300 yA). The output
waveform will have the crystal's period, but its duty cycle
will depend on the relationship between the bias point and
the gate's threshold.
Given three such circuits, we break the feedback path in
each just before the crystal (Fig. 9, the additional circuits will
be explained shortly). If a voter is inserted here with inputs
from the three amplifiers, all three crystals will be excited
with pulses at their median frequency.
In this circuit, the crystals act as very high Q notch filters.
The amplitude and phase of each crystal's output will be
governed by the difference between its resonant (notch)
frequency and the frequency at which it is driven (the median
frequency of all three). While the attenuation of the fastest
538
IEEE TRANSACTIONS ON
COMPUTERS, VOL. C-27, NO. 6, JUNE 1978
co
R
L
C1
Fig. 7. Crystal model.
8.000 MHz
6.8K
Fig. 8. Crystal oscillator.
-CLOCK
3
Fig. 9. Fault-tolerant clock.
and slowest crystal outputs will not generally be significant
after amplification, the phase shift may be.
In terms of the synchronization voting model, each
crystal-amplifier pair is a "module," and each voter is a
synchronizer without its timer. The timer is not needed since
sending a slow crystal its "GO" slightly before it sends
"READY" is permissible here; unlike a processor, a crystal can
be driven to a frequency higher than its natural one. Also,
since a crystal "module" produces no data other than its
"READY," there is no need to wait for data.
When one input to a voter is 1 and the other is 0, the third
directly controls the output. In order to limit the effects of a
single random failure, the difference between the transitions
of good amplifier outputs (i.e., jitter) must be kept as small as
possible. There are three main causes of jitter in our linked
oscillators:
1) Amplifier and voter delays will be different.
2) Depending on gate thresholds, oscillator duty cycles
will be different.
3) Each crystal will have a different phase shift depending
on the difference between its driving and resonant
frequencies.
The sum of the differences of amplifier and voter delays
may bc as much as 5 ns [23]. The oscillator duty cycles are
matched using the voltage comparators shown in Fig. 9. The
voltage comparator in each oscillator constantly adjusts the
gate's bias to maintain a 50 percent duty cycle. There will
still be a difference in duty cycles due to differing offset biases
of the comparators, but the worst case difference is less than
1 ns with the components shown.
Treating the crystal in Fig. 7 as a notch filter, its phase shift
P can be found as a function of the difference df between
resonant and driving frequencies, assuming df is small:
P(df)
,
(df- L)/(7z R).
-
539
DAVIES AND WAKERLY: SYNCHRONIZATION AND MATCHING IN REDUNDANT SYSTEMS
Using typical values of L = 2.5 millihenries, R = 20 Q,
and assuming 8 MHz crystals accurate to 50 ppm (any two
crystals within 100 ppm of each other), we find the maximum
phase shift between two crystal outputs to be 0.07 cycles, or
about 8 ns. Thus, the total jitter due to all three effects may
be as much as 14 ns.
A receiver for the three redundant clock signals consists of
a 3-input majority voter and a filtering circuit to eliminate
spurious voter outputs occurring while the good outputs
differ. Since the jitter time is very short (14 ns maximum), a
simple RC network and Schmitt trigger may be used to
accomplish the filtering. The receiver produces a valid clock
output if any two of the master oscillators are good.
This fault-tolerant clock was built at Stanford and tested
with a totally self-checking clock checker [28]. When any
signal in any of the three oscillators was replaced with a
random pulse train, the worst variation seen in the output of
a clock receiver was 10 ns.
REFERENCES
[1] J. A. Abraham and D. P. Siewiorek, "An algorithm for the accurate
reliability evaluation of triple modular redundancy networks," IEEE
Trans. Comput., vol. C-23, pp. 682-692, July 1974.
[2] Advanced Control, "Advanced control technology and its potential
for future transport aircraft," in Proc. Symp. Advanced Control,
Dryden Flight Res. Cent., Edwards, CA, July 1974.
[3] A. Avizienis, "Fault-tolerant systems," IEEE Trans. Comput., vol.
C-25, pp. 1304-1312, Dec. 1976.
[4] A. Avizienis, G. C. Gilley, F. P. Mathur, D. A. Rennels, J. A. Rohr,
and D. K. Rubin, "The STAR (self-testing and repairing) computer:
An investigation of the theory and practice of fault-tolerant computer
design," IEEE Trans. Comput., vol. C-20, pp. 1312-1321, Nov. 1971.
[5] R. B. Broen, "New voters for redundant systems," J. Dynamic Syst.,
Meas., Contr., pp. 41-45, Mar. 1975.
[6] J. E. Bush, "Realization of a voter/monitor for a digital flight control
system," M.S. thesis, Air Force Institute of Technology, NTIS Doc.
AD-A027 434/OST, June 1976.
[7] W. C. Carter and W. G. Bouricius, "A survey of fault-tolerant
architecture and its evaluation," Computer, vol. 4, pp. 9-16, Jan. 1971.
[8] T. J. Chaney, S. M. Ornstein, and W. M. Littlefield, "Beware the
synchronizer," presented at COMPCON-72, IEEE Comput. Soc.
Conf., San Francisco, CA, Sept. 12-14, 1972.
[9] T. J. Chaney and C. E. Molnar, "Anomalous behavior of synchronizer and arbiter circuits," IEEE Trans. Comput. (Corresp.), vol. C-22,
pp. 421-422, Apr. 1973.
[10] A. E. Cooper and W. T. Chow, "Development of on-board space
computer systems," IBM J. Res. Develop., vol. 20, pp. 20-28, Jan.
1976.
[11] W. M. Daly, A. L. Hopkins, and J. F. McKenna, "A fault-tolerant
digital clocking system," in Dig. 3rd Int. Symp. Fault-Tolerant
Comput., Palo Alto, CA, IEEE Publ. 73CH0772-4C, pp. 17-22, June
1973.
[12] D. Davies, "Reliable synchronization of redundant systems," in Proc.
5th Annu. Symp. Comput. Archit., Palo Alto, CA, Apr. 3-5, 1978.
[13] J. M. de Souza and E. P. Paz, "Fault tolerant digital clocking
system," Electron. Lett., vol. 11, no. 18, pp. 433-434, Sept. 4, 1975.
[14] J. M. de Souza, E. P. Paz, and C. Landrault, "A research oriented
microcomputer with built-in auto-diagnostics," in Proc. 6th Int.
Symp. Fault-Tolerant Computing, Pittsburgh, PA, IEEE Publ.
76CH1094-2C, pp. 3-8, June 1976.
[15] J. B. Dickinson, J. B. Jackson, and G. C. Randa, "Saturn V launch
vehicle digital computer and data adapter," AFIPS ConfJ Proc. 1964
FJCC, vol. 26, Spartan Books, Baltimore, MD, pp. 501-516, 1964.
[16] D. W. Donahey, "Asynchronous comparison monitoring of digital
flight control systems," M.S. thesis, Air Force Institute of Technology, NTIS Doc. AD-A019 812/7ST, Dec. 1975.
[17] L. E. Fairbanks and J. E. Templeman, "Flight-critical digital control
system evaluation," in AIAA Digital Avion. Syst. Conf Rec., AIAA
paper 75-566, Apr. 1975.
[18] B. J. Flehinger, "Reliability improvement through redundancy at var-
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
ious system levels," IBM J. Res. Develop., vol. 2, pp. 148-158, Feb.
1958.
A. L. Hopkins and T. B. Smith, III, "The architectural elements of a
symmetric fault-tolerant multiprocessor," IEEE Trans. Comput., vol.
C-24, pp. 498-505, May 1975.
R. E. Lyons and W. Vanderkulk, "The use of triple-modular redundancy to improve computer reliability," IBM J. Res. Develop., vol. 6,
pp. 200-209, Apr. 1962.
F. Mathur and A. Avizienis, "Reliability analysis and architecture of
a hybrid-redundant digital system: Generalized triple modular redundancy with self-repair," in AFIPS Conf. Proc., 1970 SJCC, vol.
36, Spartan Books, Baltimore, MD, pp. 375-383, 1970.
F. B. Moore and J. B. White, "Application of redundancy in the
Saturn V guidance and control system," presented at AIAA Guidance, Control, and Flight Dynamics Conf., Huntsville, AL, Aug. 1415, 1967, NASA Doc. NASA-TM-X-73352, NTIS Doc.
N77-13102/7GI, 1976.
Motorola, MECL Integrated Circuits Data Book, Motorola Inc.,
Phoenix, AZ, 1973.
H. A. Padinha, "Divergence in redundant guidance, navigation, and
control systems," in Proceedings of National Aerospace Meeting, Institute of Navigation (ION), Washington DC, Mar. 13-14, 1973, pp.
59-74.
M. Pechoucek, "Anomalous response times of input synchronizers,"
IEEE Trans. Comput., vol. C-25, pp. 133-139, Feb. 1976.
D. Siewiorek, M. Canepa, and S. Clark, "C.vmp: The architecture
and implementation of a fault-tolerant multiprocessor," in Proc. 7th
Int. Conf. Fault-Tolerant Comput., Los Angeles, CA, IEEE Publ.
77CH1223-7C, pp. 37-43, June 1977.
J. R. Sklaroff, "Redundancy management techniques for space shuttle
computers," IBM J. Res. Develop., vol. 20, pp. 20-28, Jan. 1976.
A. M. Usas, "A totally self-checking checker design for the detection
of errors in periodic signals," IEEE Trans. Comput., vol. C-24, pp.
483-488, May 1975.
J. F. Wakerly, "Transients failures in triple modular redundancy
systems with sequential modules," IEEE Trans. Comput., vol. C-24,
pp. 570-573,. May 1975.
, "Microcomputer reliability improvement using triple modular
redundancy," Proc. IEEE, vol. 64, pp. 889-895, June 1976.
*.
Daniel Davies (S'75) was born on July 30, 1951,
in Detroit, MI. He received the B.S. degree from
The Johns Hopkins University, Baltimore, MD,
in 1973 and the M.S. degree from Stanford Uni-
versity, Stanford, CA, in 1974.
He is currently completing work
torate at Stanford. In 1974, he
on
his doc-
participated in the
development of Stanford's "EMMY" processor.
Mr.
Currently, he is assisting in the installation of a
building automation system for the Stanford
University Medical Center.
Davies is a member of Phi Beta Kappa, Tau Beta Pi, and Eta
Kappa Nu.
Wakerly (S'72-M'74) was born on June
15, 1949, in Chicago, IL. He received the B.EE.
degree from Marquette University, Milwaukee,
John F.
WI, in 1970 and the M.S.E.E. and Ph.D. degrees
from Stanford University, Stanford, CA, in 1971
and 1973, respectively. He was a Hertz Foundation Fellow during his studies at Stanford.
He was Assistant Professor of Electrical Engineering at Stanford University from 1974 to
1976. Since 1976 he has been a member of the
Scientific Staff at BNR, Inc., Palo Alto, CA, and
a lecturer at Stanford. He has published many technical articles in the
areas of computer reliability, microprocessors, and digital systems education. He is the author of two textbooks, Logic Design Projects Using
Standard Integrated Circuits (New York: Wiley, 1976), and ErrorDetecting Codes, Self-Checking Circuits, and Applications (New York:
Elsevier North-Holland, 1978).
Dr. Wakerly is a member of the Association for Computing Machinery,
Sigma Xi, Tau Beta Pi, and Eta Kappa Nu.
Download