Uploaded by 梁浩祥

151562315555

advertisement
Coding Approaches to Fault Tolerance in
Combinational and Dynamic Systems
THE KLUWER INTERNATIONAL SERIES
IN ENGINEERING AND COMPUTER SCIENCE
CODING APPAOACHES TO
FAULT TOLERANCE IN
COMBINATIONAL AND
DYNAMIC SYSTEMS
CHRISTOFOROS N. HADJICOSTIS
Coordinated Science Laboratory and
Department of Electrical and Computer Engineering
University of Illinois at Urbana-Champaign
"
~.
SPRINGER SCIENCE+BUSINESS MEDIA, LLC
ISBN 978-1-4613-5271-6
ISBN 978-1-4615-0853-3 (eBook)
DOI 10.1007/978-1-4615-0853-3
Library of Congress Cataloging-in-Publication Data
A c.I.P. Catalogue record for this book is available
from the Library of Congress.
Copyright © 2002 Springer Science+Business Media New York
Originally published by Kluwer Academic Publishers in 2002
Softcover reprint of the hardcover 1st edition 2002
AII rights reserved. No part of this publication may be reproduced, stored in a
retrieval system or transmitted in any form or by any means, mechanical, photocopying, recording, or otherwise, without the prior written permission of the
publisher, Springer Science+Business Media, LLC.
Printed an acid-free paper.
To Pani
Contents
List of Figures
List of Tables
Foreword
Preface
Acknowledgments
1. INTRODUCTION
1
Definitions, Motivation and Background
2
Fault-Tolerant Combinational Systems
2.1
Reliable Combinational Systems
2.2
Minimizing Redundant Hardware
3
Fault-Tolerant Dynamic Systems
3.1
Redundant Implementations
3.2
Faults in the Error-Correcting Mechanism
4
Coding Techniques for Fault Diagnosis
Part I
XI
xiii
xv
XVII
XXI
1
1
4
6
6
7
lO
12
13
Fault-Tolerant Combinational Systems
2. RELIABLE COMBINATIONAL SYSTEMS OUT OF
UNRELIABLE COMPONENTS
Introduction
1
Computational Models for Combinational Systems
2
Von Neumann's Approach to Fault Tolerance
3
Extensions of Von Neumann's Approach
4
4.1
Maximum Tolerable Noise for 3-lnput Gates
4.2
Maximum Tolerable Noise for u-Input Gates
Related Work and Further Reading
5
21
21
22
23
27
27
29
31
3. ABFT FOR COMBINATIONAL SYSTEMS
1
Introduction
Arithmetic Codes
2
Algorithm-Based Fault Tolerance
3
33
33
35
37
viii
CODING APPROACHES TO FAULT TOLERANCE
4
Part II
Generalizations of Arithmetic Coding to Operations with Algebraic
41
Structure
4.1
Fault Tolerance for Abelian Group Operations
41
4.1.1
Use of Group Homomorphisms
44
4.1.2
Error Detection and Correction
45
4.1.3
Separate Group Codes
47
4.2
Fault Tolerance for Semigroup Operations
49
50
4.2.1
Use of Semigroup Homomorphisms
51
4.2.2
Error Detection and Correction
4.2.3
Separate Semigroup Codes
52
4.3
Extensions
56
Fault-Tolerant Dynamic Systems
4. REDUNDANT IMPLEMENTATIONS OF
ALGEBRAIC MACHINES
1
Introduction
2
Algebraic Machines: Definitions and Decompositions
3
Redundant Implementations of Group Machines
3.1
Separate Monitors for Group Machines
3.2
Non-Separate Redundant Implementations for Group
Machines
4
Redundant Implementations of Semigroup Machines
4.1
Separate Monitors for Reset-Identity Machines
4.2
Non-Separate Redundant Implementations for ResetIdentity Machines
5
Summary
5. REDUNDANT IMPLEMENTATIONS OF DISCRETE-TIME
LTI DYNAMIC SYSTEMS
1
Introduction
2
Discrete-Time LTI Dynamic Systems
3
Characterization of Redundant Implementations
4
Hardware Implementation and Fault Model
5
Examples of Fault-Tolerant Systems
Summary
6
61
61
61
64
66
69
73
74
75
76
79
79
79
80
83
86
96
6. REDUNDANT IMPLEMENTATIONS OF
LINEAR FINITE-STATE MACHINES
1
Introduction
2
Linear Finite-State Machines
3
Characterization of Redundant Implementations
4
Examples of Fault-Tolerant Systems
5
Hardware Minimization in Redundant LFSM Implementations
6
Summary
99
99
99
102
104
108
112
7. UNRELIABLE ERROR CORRECTION IN DYNAMIC SYSTEMS
115
COlltents
1
2
3
4
5
Introduction
Fault Model for Dynamic Systems
Reliable Dynamic Systems using Distributed Voting Schemes
Reliable Linear Finite-State Machines
4.1
Low-Density Parity Check Codes and Stable Memories
4.2
Reliable Linear Finite-State Machines using Constant
Redundancy
Other Issues
IX
115
117
118
123
123
127
132
8. CODING APPROACHES FOR FAULT DETECTION AND
IDENTIFICATION IN DISCRETE EVENT SYSTEMS
1
Introduction
2
Petri Net Models of Discrete Event Systems
3
Fault Models for Petri Nets
4
Separate Monitoring Schemes
4.1
Separate Redundant Petri Net Implementations
4.2
Fault Detection and Identification
5
Non-Separate Monitoring Schemes
5.1
Non-Separate Redundant Petri Net Implementations
5.2
Fault Detection and Identification
6
Applications in Control
6.1
Monitoring Active Transitions
6.2
Detecting Illegal Transitions
7
Summary
143
143
145
148
151
151
154
160
160
166
170
170
171
174
9. CONCLUDING REMARKS
1
Summuy
2
Future Research Directions
179
179
181
10. ABOUT THE AUTHOR
185
11. INDEX
187
List of Figures
1.1
1.2
1.3
1.4
2.1
2.2
2.3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9
4.1
4.2
4.3
4.4
5.1
5.2
5.3
Triple modular redundancy.
Fault-tolerant combinational system.
Triple modular redundancy with correcting feedback.
Fault-tolerant dynamic system.
Error correction using a "restoring organ."
Plots of functions f(q) and g(q) for two different values
ofp.
Two successive restoring iterations in von Neumann's
construction for fault tolerance.
Arithmetic coding scheme for protecting binary operations.
aN arithmetic coding scheme for protecting integer addition.
ABFT scheme for protecting matrix multiplication.
Fault-tolerant computation of a group operation.
Fault tolerance using an abelian group homomorphism.
Coset-based error detection and correction.
Separate arithmetic coding scheme for protecting integer addition.
Separate coding scheme for protecting a group operation.
Partitioning of semi group (N, x ) into congruence classes.
Series-parallel decomposition of a group machine.
Redundant implementation of a group machine.
Separate redundant implementation of a group machine.
Relationship between a separate monitor and a decomposed group machine.
Delay-adder-gain implementation and the corresponding signal flow graph for an LTI dynamic system.
State evolution equation and hardware implementation
of the digital filter in Example 5.2.
Redundant implementation based on a checksum condition.
5
6
9
10
23
24
25
34
37
39
43
44
46
48
48
54
62
65
66
69
84
89
89
xii
CODING APPROACHES TO FAULT TOLERANCE
5.4
6.1
6.2
7.1
7.2
7.3
7.4
7.A.1
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
Second redundant implementation based on a checksum
condition.
Hardware implementation of the linear feedback shift
register in Example 6.1.
Different implementations of a convolutional encoder.
Reliable state evolution subject to faults in the error corrector.
Modular redundancy with distributed voting scheme.
Hardware implementation of Gallager's modified iterative decoding scheme for LDPC codes.
Replacing k LFSM's with n redundant LFSM's.
Encoded implementation of k LFSM's using n redundant LFSM's.
Petri net with three places and three transitions.
Cat-and-mouse maze.
Petri net model of a distributed processing system.
Concurrent monitoring scheme using a separate Petri
net implementation.
Example of a separate redundant Petri net implementation that identifies single transition faults in the Petri
net of Figure 8.1.
Example of a separate redundant Petri net implementation that identifies single place faults in the Petri net of
Figure 8.1.
Example of a separate redundant Petri net implementation that identifies single transition or single place faults
in the Petri net of Figure 8.1.
Concurrent monitoring scheme using a non-separate
Petri net implementation.
Example of a non-separate redundant Petri net implementation that identifies single transition faults in the
Petri net of Figure 8.1.
Example of a non-separate redundant Petri net implementation that identifies single place faults in the Petri
net of Figure 8.1.
Example of a separate redundant Petri net implementation that enhances control in the Petri net of Figure 8.3.
91
100
107
119
120
125
128
134
145
147
150
151
156
157
159
161
167
169
171
List of Tables
2.1
5.1
Input-output table for the 3-input XNAND gate.
Syndrome-based error detection and identification in
Example 5.1.
27
87
Foreword
Fault tolerance requires redundancy, but redundancy comes at a price. At one
extreme of redundancy, fault tolerance may involve running several complete
and independent replicas of the desired process; discrepancies then indicate
faults, and the majority result is taken as correct. More modest levels of redundancy - for instance, adding parity check bits to the operands of a computation
- can still be very effective, but need to be more carefully designed, so as to
ensure that the redundancy conforms appropriately to the particular characteristics of the computation or process involved. The latter challenge is the focus
of this book, which has grown out of the author's graduate theses at MIT.
The original stimulus for the approach taken here comes from the work of
Beckmann and Musicus, developed in Beckmann's 1992 doctoral thesis, also at
MIT. That work focused on computations having group structure. The essential
idea was to map the group in which the computation occurred to a larger group
via a homomorphism, thereby preserving the structure of the computation while
introducing the necessary redundancy. Hadjicostis has significantly expanded
the setting to processes occurring in more general algebraic and dynamic systems.
For combinational (i.e., memoryless) systems, this book shows how to recognize and exploit system structure in a way that leads to resource-efficient
arithmetic coding and ''ABFT'' (algorithm-based fault-tolerant) schemes, and
characterizes separate (parity-type) codes. These results are then extended to
dynamic systems, providing a unified system theoretic framework that makes
connections with traditional error correcting methodologies for communication systems, allows coding techniques to be studied in conjunction with the
dynamics of the process that is being protected, and enables the development
of fault-tolerance techniques that can account for faults in the error corrector itself. Numerous examples throughout the book illustrate how the framework and methodology translate to particular situations of interest, providing a
parametrization of the range of possibilities for redundant implementation, and
xvi
CODING APPROACHES TO FAULT TOLERANCE
allowing one to examine features of and trade-offs among different possibilities
and realizations.
The book responds to the growing need to handle faults in complex digital
chips and complex networked systems, and to consider the effects of faults
at the design stage rather than afterwards. I believe that the approach taken
by the author points the way to addressing such needs in a systematic and
fruitful fashion. The material here should be of interest to both researchers and
practitioners in the area of fault tolerance.
George Verghese
Massachusetts Institute of Technology
Preface
As the complexity of systems and networks grows, the likelihood of faults in
certain components or communication links increases significantly and the consequences become highly unpredictable and severe. Even within a single digital
device, the reduction of voltages and capacitances, the shrinking of transistor
sizes and the sheer number of gates involved has led to a significant increase
in the frequency of so-called "soft-errors," and has prompted leading semiconductor manufacturers to admit that they may be facing difficult challenges in
the future. The occurrence of faults becomes a major concern when the systems
involved are life-critical (such as military, transportation or medical systems),
or operate in remote or inaccessible environments (where repair may be difficult
or even impossible).
A fault-tolerant system is able to tolerate internal faults and preserve desirable
overall behavior and output. A necessary condition for a system to be faulttolerant is that it exhibit redundancy, which enables it to distinguish between
correct and incorrect results or between valid and invalid states. Redundancy is
expensive and counter-intuitive to the traditional notion of system design; thus,
the success of a fault-tolerance design relies on making efficient use of hardware
by adding redundancy in those parts of the system that are more liable to faults
than others. Traditionally, the design of fault-tolerant systems has considered
two quite distinct fault models: one model constructs reliable systems out of
unreliable components (all of which may suffer faults with a certain probability)
whereas the other model focuses on detecting and correcting a fixed number
of faults (aiming at minimizing the required hardware). This book addresses
both of these fault models and describes coding approaches that can be used
to exploit the algorithmic/evolutionary structure in a particular combinational
or dynamic system in order to avoid excessive use of redundancy. The book
has grown out of thesis work at the Massachusetts Institute of Technology and
research at the University of Illinois at Urbana-Champaign.
xviii
CODING APPROACHES TO FAULT TOLERANCE
Chapters 2 and 3 describe coding approaches for designing fault-tolerant
combinational systems, i.e., systems with no internal memory that perform a
static function evaluation on their inputs. Chapter 2 reviews von Neumann's
work on "Probabilistic Logics and the Synthesis of Reliable Organisms from
Unreliable Components," which is one of the first systematic approaches to
fault tolerance. Subsequent related results on combinational circuits that are
constructed as interconnections of unreliable ("noisy") gates are also discussed.
In these approaches, a combinational system is built out of components (e.g.,
gates) that suffer transient faults with constant probability; the goal is to assemble these unreliable components in a way that introduces "structured" redundancy and ensures that, with high probability, the overall functionality is the
correct one.
Chapter 3 describes a distinctly different approach to fault tolerance which
aims at protecting a given combinational system against a pre-specified number of component faults. Such designs become more dominant once system
components are fairly reliable; they generally aim at using a minimal amount
of structured redundancy to achieve detection and correction of a pre-specified
number offaults. As explained in Chapter 3, coding techniques are particularly
successful for arithmetic and linear operations; extensions of these techniques
to operations with group or semi group structure are also discussed.
The remainder of the book focuses on fault tolerance in dynamic systems,
such as finite-state controllers or computer simulations, whose internal state
influences their future behavior. Modular redundancy (system replication) and
other traditional techniques for fault tolerance are expensive, and rely heavily
- particularly in the case of dynamic systems operating over extended time
horizons - on the assumption that the error-correcting mechanism does not
fail. The book describes a systematic methodology for adding structured redundancy to a dynamic system, exposing a wide range of possibilities between
no redundancy and full replication. These possibilities can be parameterized in
various settings, including algebraic machines (Chapter 4) and linear dynamic
systems (Chapters 5 and 6). By adopting specific fault models and, in some
cases, by making explicit connections with hardware implementations, the exposition in these chapters describes resource-efficient designs for redundant
dynamic systems. Optimization criteria for choosing among different redundant implementations are not explicitly addressed; several examples, however,
illustrate how such criteria can be posed and investigated.
Chapter 7 relaxes the traditional assumption that the error-correcting mechanism does not fail. The basic idea is to use a distributed error-correcting mechanism so that the effects of faults are dispersed within the redundant system in
a non-devastating fashion. As discussed in Chapter 7, one can employ these
techniques to obtain a variant of modular redundancy that uses unreliable system replicas and unreliable voters to construct redundant dynamic systems that
Preface
xix
evolve in time with a low probability offailure. By combining these techniques
with low-complexity error-correcting coding, one can efficiently protect identical unreliable linear finite-state machines that operate in parallel on distinct
input sequences. The approach requires only a constant amount of redundant
hardware per machine to achieve a probability of failure that remains below any
pre-specified bound over any given finite time interval.
Chapter 8 applies coding techniques in other contexts. In particular, it
presents a methodology for diagnosing faults in discrete event systems that are
described by Petri net models. The method is based on embedding the given
Petri net model in a larger Petri net that retains the functionality and properties
of the given one, while introducing redundancy in a way that facilitates error
detection and identification.
Chapter 9 concludes with a look into emerging research directions in the
areas of fault tolerance, reliable system design and fault diagnosis. Unlike
traditional methodologies, which add error detecting and correcting capabilities on top of existing, non-redundant systems, the methodology developed in
this book simultaneously considers the design for fault tolerance together with
the implementation of a given system. This comprehensive approach to fault
tolerance allows the study of a larger class of redundant implementations and
can be used to better understand fundamental limitations in terms of system-,
coding- and information-theoretic constraints. Future work should also focus
on the implications of redundancy on the speed and power efficiency of digital
systems, and also on the development of systematic ways to trade-off various system parameters of interest, such as redundant hardware, fault coverage,
detection/correction complexity and delay.
Christoforos N. Hadjicostis
Urbana, Illinois
Acknowledgments
This book has grown out of research work at the Massachusetts Institute of
Technology and the University of Illinois at Urbana-Champaign. There are
many colleagues and friends that have been extremely generous with their help
and advice during these years, and to whom I am indebted.
I am very thankful to many members of the faculty at MIT for their involvement and contribution to my graduate research. In particular, I would like to
express my most sincere thanks to George Verghese for his inspiring guidance,
and to Alan Oppenheim and Greg Wornell for their support during my tenure
at the Digital Signal Processing Group. Also, the discussions that I had with
Sanjoy Mitter, Alex Megretski, Bob Gallager, David Forney and Srinivas Devadas were thought-provoking and helpful in defining my research direction; I
am very thankful to all of them.
I am also grateful to many members of the faculty at UIVC for their warm
support during these first few years. In particular, I would like to thank Steve
Kang and Dick Blahut, who served as heads of the Department of Electrical and
Computer Engineering, Ravi Iyer, the director of the Coordinated Science Laboratory, and Tamer Ba§ar, the director of the Decision and Control Laboratory,
whose advice and direction have been a tremendous motivation for writing this
book.
I would also like to thank my many friends and colleagues who made academic life at MIT and at UIVC both enjoyable and productive. Special thanks
go to Carl Livadas, Babis Papadopoulos and John Apostolopoulos, who were
a great source of advice during my graduate studies. At VIUC, Andy Singer,
Francesco Bullo and Petros Voulgaris were encouraging and always willing to
help in any way they could. Becky Lonberger, Francie Bridges, Darla Chupp,
Vivian Mizuno, Maggie Beucler, Janice Zaganjori and Sally Bemus made life a
lot simpler by meticulously taking care of administrative matters. I would also
like to thank Eleftheria Athanasopoulou, Boon Pang Lim and Yingquan Wu for
proof-reading portions of this book.
xxii
CODING APPROACHES TO FAULT TOLERANCE
I am very grateful to many research agencies and companies that have supported my work as a graduate student and as a research professor. These include the Defense Advanced Research Projects Agency for support under the
Rapid Proto typing of Application Specific Signal Processors project, the Electric Power Research Institute and the Department of Defense for support under the Complex Interactive Networks/Systems Initiative, the National Science
Foundation for support under the Information Technology Research and Career
programs, the Air Force Office for Scientific Research for support under their
University Research Initiative, the DIUC Campus Research Board, the National
Semiconductor Corporation, the Grass Instrument Company and Motorola.
Finally, I am extremely thankful to Jennifer Evans and the Kluwer Academic Publishers for encouraging me to make these ideas more widely available
through the publication of this book.
Chapter 1
INTRODUCTION
1
DEFINITIONS, MOTIVATION AND BACKGROUND
Modem digital systems are subject to a variety of potential faults that can
corrupt their output and degrade their perfonnance [Johnson, 1989; Pradhan,
1996; Siewiorek and Swarz, 1998]. In this context, a fault is a deviation of
a given system from its required or expected behavior. The more complex
a computational system is or the longer an algorithm runs for, the higher is
the risk of a hardware malfunction that renders the overall functionality of the
system useless. Depending on the duration of faults, two broad classes are
defined [Johnson, 1989]: (i) Permanent faults manifest themselves in a consistent manner and include design or software errors, manufacturing defects, or
irreversible physical damage. (ii) Transient faults do not appear on a consistent
basis and only manifest themselves in a certain portion of system invocations;
transient faults could be due to noise, such as absorption of alpha particles and
electromagnetic interference, or environmental factors, such as overheating.
An error is the manifestation of a fault and may lead to an overall failure
in the system [Johnson, 1989]. A fault-tolerant system is one that tolerates
internal faults and prevents them from unacceptably corrupting its overall task,
output or final result [Johnson, 1989; Pradhan, 1996; Siewiorek and Swarz,
1998]. Concurrent error masking, that is detection and correction of errors
concurrently with system operation, is one of the most desirable forms of fault
tolerance because no degradation in the overall performance of the system takes
place; at the same time, however, concurrent error masking usually implies a
large overhead in terms of error-detecting and correcting operations.
Fault tolerance is motivated primarily by applications that require high reliability (such as medical, military or transportation systems), or by systems
that operate in remote locations where repair may be difficult or even impos-
2
CODING APPROACHES TO FAULT TOLERANCE
sible (as in the case of space missions, hazardous environments and remote
sensors) [Pradhan, 1996; Avizienis, 1997]. In addition, fault tolerance can
relax design/manufacturing specifications leading, for example, to yield enhancement in integrated circuits [Koren and Singh, 1990; Peercy and Banerjee,
1993; Leveugle et aI., 1994]. As the complexity of computational and signal processing systems increases, their vulnerability to faults becomes higher,
making fault tolerance necessary rather than simply desirable [Redinbo, 1987].
The current trends towards higher clock speeds, lower power consumption and
smaller transistor sizes aggravates this problem even more and leads to a significant increase in the frequency of so-called "soft-errors."
For the reasons mentioned above, fault tolerance has been addressed in a
variety of settings. The most systematic treatment has been for the case of
reliable digital transmissions through unreliable ("noisy") communication links.
Shannon's seminal work in [Shannon, 1948a; Shannon, 1948b] demonstrated
that error-correcting coding techniques can effectively and efficiently protect
against noise in digital communication systems. More specifically, it showed
that, contrary to the common perception of that time, the employment of coding
techniques can enable reliable transmission of digital messages using only a
constant amount of redundancy per bit. This result led to the birth of information
and coding theory [Gallager, 1968; Cover and Thomas, 1999; Peterson and
Weldon Jr., 1972; Blahut, 1983; Wicker, 1995].
Following the success of error-correcting coding in digital communication
systems, Shannon and other researchers applied similar techniques to protect
digital circuits against hardware faults (see for example [Elias, 1958; Winograd
and Cowan, 1963; Taylor, 1968b; Larsen and Reed, 1972] and the exposition
in [Rao and Fujiwara, 1989]). More recently, related techniques were applied
at a higher level to protect special-purpose systems against a fixed number
of "functional" faults, which could be hardware, software or other. These
ideas were introduced within the context of algorithm-based fault tolerance
[Huang and Abraham, 1984; Beckmann and Musicus, 1993; Roy-Chowdhury
and Banerjee, 1996].
The development of an appropriate fault model is a significant aspect of all
designs for fault tolerance. The fault model describes the consequences of each
fault on the state or output of a system, effectively abstracting the cause of a
fault and allowing the mathematical study of fault tolerance. For example, in
Shannon's work the effect of "noise" in a digital communication channel is
captured by the probability that a particular bit gets transmitted erroneously
(i.e., its binary value is flipped). Similarly, the corruption of a single bit in
the digital representation of the output/state of a system is commonly used to
model the effect of faults in digital systems. Note that the fault model does
not have to mimic the actual fault mechanism; for example, one can model
the error due to a fault in a multiplier as additive or the error due to a fault
Introduction
3
in an adder as multiplicative. 1 Efficient fault models need to be close to reality, yet simple enough to allow algebraic or algorithmic manipulation. If a
single hardware fault manifests itself in an unmanageable number of errors in
the analytical representation, then, the performance of the corresponding error
detection/correction scheme will be unnecessarily complicated.
This book focuses mostly on fault tolerance in combinational systems (Chapters 2 and 3) and dynamic systems (Chapters 4-7). The distinction between
combinational and dynamic systems is that the latter evolve in time according
to their internal state (memory), whereas the former have no internal state and
no evolution with respect to time.
DEFINITION 1.1 A combinational system C performs a function evaluation
on its inputs Xl, X2, ••• , Xu' More specifically, the output r of the combinational
system only depends on the inputs provided, i.e., it is described by afunction
AC as
Examples of combinational systems include adders, arithmetic logic units,
and special purpose systems for various signal processing computations. The
book focuses on protecting such systems against faults that corrupt the output
of the system (i.e., faults that produce an incorrect result but do not cause the
system to hang or behave in some other unpredictable way).
DEFINITION 1.2 A dynamic system S evolves in time according to some internal state. More specifically, the state of the system at time step t, denoted
by qs [t], together with the input at time step t, denoted by x[t], completely
determine the system's next state according to a state evolution equation
The output y[t] of the system at time step t is based on the corresponding state
and input, and is captured by the output equation
y[t] = AS(qs[t], x[t]) .
Examples of dynamic systems include tinite-state machines, digital tilters,
convolutional encoders, and more generally algorithms or simulations running
on a computer architecture over several time steps. When discussing fault
tolerance in dynamic systems, the book focuses on faults that cause an unreliable
dynamic system to take a transition to an incorrect state. Depending on the
underlying system and its actual implementation, these faults can be permanent
or transient, and hardware or software. Due to the nature of dynamic systems,
the effects of a state transition fault may last over several time steps; in addition,
state corruption at a particular time step generally leads to the corruption of the
4
CODING APPROACHES TO FAULT TOLERANCE
overall behavior and output at future time steps. Note that faults in the output
mechanism of a dynamic system can be treated like faults in a combinational
system as long as the representation of the state is correct. For this reason, when
discussing fault tolerance in dynamic systems, the book focuses on protecting
against state transition faults.
2
FAULT·TOLERANT COMBINATIONAL SYSTEMS
A necessary condition for a system to be fault-tolerant is that it exhibits
redundancy. "Structured" redundancy (that is, redundancy that has been intentionally introduced in some systematic way) allows a combinational system
to distinguish between valid and invalid results and, if possible, identify the
error and perform the necessary error-correcting procedures. Structured redundancy can also be used to guarantee acceptably degraded performance despite
faults. A well-designed fault-tolerant system makes efficient use of resources
by adding redundancy in those parts of the system that are more liable to faults
than others.
The traditional way of designing combinational systems that cope with hardware faults is the use of N-modular hardware redundancy [von Neumann,
1956]. By replicating the original system N times, one performs the desired
calculation multiple times in parallel. The final result is chosen based on what
the majority of the system replicas agree upon. For example, in the triple modular redundancy (TMR) scheme of Figure 1.1, if all three modules agree on
a result, then, the voter outputs that result; if only two of the modules agree,
then, the voter outputs that result and declares the third module faulty; if all
modules disagree, then, the voter flags an error. When using N-modular redundancy with majority voting, one can correct faults in c different systems if
N ~ 2c + 1. If the modules are self-checking (that is, if they have the ability to
detect and flag internal errors), then, one can detect up to N and correct up to
N - 1 errors. An implicit assumption in the above discussion is that the voter
is fault-free. A number of commercial and other systems have used modular
redundancy schemes [Avizienis et aI., 1971; Harper et aI., 1988]; several examples can be found in [Johnson, 1989; Pradhan, 1996; Siewiorek and Swarz,
1998].
Modular redundancy schemes have been the primary methodology in designs
for fault tolerance because they decouple system design from fault tolerance
design. Modular redundancy, however, is inherently expensive due to system
replication; for this reason, a variety of hybrid methods have evolved, involving
hierarchical levels of modular redundancy that only replicate the parts of the
system that are more vulnerable to faults. When time delay is not an issue, a
popular alternative is N -modular time redundancy, where one uses the same
hardware to repeat a calculation N times. If only transient faults take place,
this approach has the same effect as N -modular hardware redundancy.
Introduction
Input
Comb.
System
Replica
Comb.
System
Replica
Comb.
System
Replica
Figure 1.1.
Output 1
5
Uncorrectable
Error Flag
Output 2
Final Output
Output 3
Triple modular redundancy.
The success of coding techniques in digital communication systems prompted
many researchers to investigate alternative ways for achieving resource-efficient
fault tolerance in computational systems. Not surprisingly, these techniques
have been successful in protecting digital storage devices, such as random access memory chips and hard drives, "chip-kill" and RAID (Redundant Array
of Inexpensive Disks) being perhaps the most successful examples [Patterson
et aI., 1988]. However, in systems that also involve some simple processing
on the data (e.g., Boolean circuits or arithmetic units), the application of such
coding ideas becomes far more challenging. The general model of these faulttolerance schemes consists of multiple interdependent stages as illustrated in
Figure 1.2. These stages include the encoder, the redundant computational unit,
the error detector/corrector, and the decoder. Redundancy is incorporated by
encoding the operands and by ensuring that the redundant computational unit
involves extra outputs that only arise when faults occur. The error detector
examines the output of the redundant computational unit and decides whether
it is valid or not. Finally, the decoder maps the corrected result back to its
non-redundant form. In many cases, there are large overlaps between several
of the subsystems shown in Figure 1.2. The model, however, illustrates the
basic idea in the design of fault-tolerant systems: at the point where the fault
takes place, the representation of the result involves redundancy and enables
one to detect and/or correct the corresponding errors. Usually, faults are only
allowed in the redundant computational unit and (sometimes) in the encoder;
the error corrector and the decoder are commonly assumed to be fault-free.
As pointed out in [Pippenger, 1990; Avizienis, 1997], there have traditionally been two different philosophies for dealing with faults in combinational
systems: one focuses on constructing reliable systems out of unreliable components and the other focuses on detecting and correcting an a priori fixed
6
CODING APPROACHES TO FAULT TOLERANCE
--------------------------------------------------j,
,
,,
c;;
c:
c;
_0
(\j
"l::
(\j
-0-
c:
::J
::Jc.
1_ _.-1
-oE:t::
(]) 0 c:
a:: 0:::>
,
: Faults
Figure 1.2.
£0
~ ~ 13
0'" (])
t:oCi)
woo
;' - + -..'
000
...
Final
Output
Faults
Fault-tolerant combinational system.
number of faults while minimizing the required hardware overhead. The underlying assumptions in each approach are quite distinct: in the former approach
all components suffer faults with a certain probability, whereas in the latter
approach the number of faults is fixed. Given enough redundancy, the latter
assumption essentially allows parts ofthe system to be assumed fault-free. The
next two sections describe these two approaches in the context of fault -tolerant
combinational systems.
2.1
RELIABLE COMBINATIONAL SYSTEMS
One approach towards fault tolerance is the construction of fault-tolerant
systems out of unreliable components, i.e., components that fail independently
with some nonzero probability. The goal of these designs is to assemble the
unreliable components in a way that produces a reliable overall system, that is, a
system that performs as desired with high probability. As ope adds redundancy
into the fault-tolerant system, the probability with which components fail remains constant. Thus, the larger the system, the more faults it has to tolerate on
the average, but the more flexibility one has in using structured redundancy to
ensure that, with high probability, the redundant system will have the desirable
behavior. Work in this direction started with von Neumann [von Neumann,
1956] and was continued by many others, mostly in the context of fault-tolerant
Boolean circuits [Winograd and Cowan, 1963; Taylor, 1968b; Gacs, 1986; Hajek and Weller, 1991; Evans, 1994; Evans and Pippenger, 1998]. This approach
is described in Chapter 2.
2.2
MINIMIZING REDUNDANT HARDWARE
The second approach towards fault tolerance aims at guaranteeing the detection and correction of a fixed number of faults. It closely follows the general
model in Figure 1.2 and usually requires that the error-correcting and decoding
stages are faultJree. In this particular context, the latter assumption seems to
Introduction
7
be inevitable because, regardless of how much redundancy is added, a single
fault in the very last stage of the system will result in an erroneous output.
The TMR system of Figure 1.1 is perhaps the most common example that
falls in this category of designs for fault tolerance. It protects against a single hardware fault in anyone system replica but not in the voter. Numerous
other redundant systems have also been implemented with the capability to detect/correct single faults assuming that error detection/correction is fault-free.
The basic premise behind these designs is that the error-correcting mechanism
is much simpler than the actual system implementation and that faults are rare;
thus, it is reasonable to assume that the error corrector is fault-free and to aim
at protecting against a fixed number of faults (for example, if faults are independent and occur with probability PI < < 1, then, the probability of two
simultaneous faults is of the order of PJ' which is very small compared to PI)'
Once the validity of the two assumptions above is established, designs for
fault tolerance can focus their attention in adding a minimal amount of redundancy in order to detect/correct a pre-specified number of faults in the
redundant computational unit. This approach has been particularly successful when features of a computation or an algorithm can be exploited in order
to introduce "structured" redundancy in a way that offers more efficient fault
coverage than modular redundancy. Work in this direction includes arithmetic
coding schemes, algorithm-based fault tolerance and algebraic techniques, all
of which are described in more detail in Chapter 3. Related applications range
from arithmetic circuits [Rao, 1974], to 2-D systolic arrays for parallel matrix
multiplication [Huang and Abraham, 1984; Jou and Abraham, 1986], faulttolerant sorting networks [Choi and Malek, 1988; Liang and Kuo, 1990; Sun
et aI., 1994], and convolution using the fast Fourier transform [Beckmann and
Musicus, 1993].
3
FAULT-TOLERANT DYNAMIC SYSTEMS
Traditionally, fault tolerance in dynamic systems has been based on variations
of modular redundancy. The technique uses several replicas of the original,
unreliable dynamic system, each initialized at the same state and supplied with
the same input sequence. Each replica goes through the same sequence of
states, unless a fault in its state transition mechanism causes a deviation from
the correct behavior. If the majority of the system replicas are in the correct
state at a given time step, an external voting mechanism will be able to decide
what the correct state is using a majority voting rule; the output can then be
computed based on this error-free state.
To understand the severity of state transition faults consider the following
scenario: assume that an unreliable dynamic system is subject to transient faults
and that the probability of taking an incorrect state transition (on any input at any
given time step) is Ps. If faults between different time steps are independent,
8
CODING APPROACHES TO FAULT TOLERANCE
then, the probability that the system follows the correct state trajectory for
L consecutive time steps is (1- Ps)L and goes to zero exponentially with L.
In general, the probability of ending up in the correct state after L steps is
also low,2 which means that the output of the system at time step L will be
erroneous with high probability (because it is calculated based on an erroneous
state). Therefore, the first priority in the design of a fault-tolerant dynamic
system should be to ensure that the system follows the correct state trajectory.
There are several subtle issues that arise when using modular redundancy
schemes in the context of dynamic systems [Hadjicostis, 1999]. For instance,
in the example above, the use of majority voting at the end of L time steps may
be highly unsuccessful. The problem is that after a system replica operates for
L time steps, the probability that it has followed the correct sequence of states
is (1- Ps)L. Moreover, at time step L, system replicas may be in incorrect
states with probabilities that are prohibitively high for a voter to reliably decide
what the correct state is. (An extreme example would be the case when an
incorrect state is more likely to be reached than the correct one; this would
make it impossible for a voter to decide what the correct state is, regardless
of the number of system replicas that are used!) A possible solution to this
problem is to correct the state of the system replicas at the end of each time
step, as shown in Figure 1.3. In this arrangement, the state agreed upon by the
majority of the systems is fed back to all systems to reset them to the "correct"
state. One does not necessarily have to feed back the correct state at the end of
each time step; if a correction is to be fed back once every T steps, however,
one needs to ensure that (1- Ps does not become too small.
Another possible way of addressing the above problem is to let the systems
evolve for several time steps and then perform error correction using a mechanism that is more complicated than a simple voter. For example, one could look
at the overall state evolution (not just the final states) of each system replica
and then make an educated decision about what the correct state sequence is.
One concern about this approach is that, by allowing the system to evolve incorrectly for several time steps, system performance could be compromised in
the intervals between error correction. A bigger concern is that the complexity
of the error-correcting mechanism may increase, resulting in an unmanageable
number of errors in the correcting mechanism itself.
The concurrent error correction approach in Figure 1.3 has two major drawbacks:
r
1. System replication may be unnecessarily expensive. In order to avoid replication, one can employ a redundant implementation, i.e., a version of the
dynamic system which is redundant and follows a restricted state evolution
[Hadjicostis, 1999]. Faults violate the imposed restrictions, which enables
an external mechanism to perform error detection and correction. Redundant
implementations range from no redundancy to full replication and provide
Introduction
9
Input x[t]
State ~[t]
8'"
"Corrected"
State
State q ttl
1--_ _-"'2_-..
Voter
q[t] ••••••
r ••••
.l L.Y!U.
1
1
1
1
1
,..
1
1
1
1• • • •1
State ~[t]
Figure 1.3.
Fault-Tolerant
Combinational Unit
Triple modular redundancy with correcting feedback.
the means to characterize and parameterize constructions of fault-tolerant
dynamic systems. The book discusses redundant implementations in various settings, including algebraic machines (Chapter 4), linear time-invariant
dynamic systems (Chapter 5) and linear tinite-state machines (Chapter 6).
2. The scheme relies heavily on the assumption that the voter is fault-free. If
the voter also fails independently between time steps (Le., if the voter outputs
a state that, with probability Pv, is different from the state agreed upon by
the majority of the systems), one is faced with another problem: after L
time steps the probability that the modular redundancy scheme performs
correctly is at best (l-Pv)L (ignoring the probability that a fault in the voter
may accidentally result in feeding back the correct state in cases where
most systems are in an incorrect state). Similarly, the probability that the
majority of the replicas are in the correct state after L time steps is also
very low. Therefore, if voters are not reliable, there appears to be a limit
on the number of time steps for which one can guarantee reliable evolution
using a simple replication scheme. What is more alarming is that faults in
the voting mechanism become more significant as one increases the number
of time steps for which the fault-tolerant dynamic system operates. Even
if Pv is significantly smaller than Ps (e.g., because the dynamic system is
more complex than the voter), the probability that the modular redundancy
scheme performs correctly is bounded above by (l-Pv)L and can become
unacceptably small for a large L. In order to deal with faults in the errorcorrecting mechanism, one can use distributed error correction, so that the
effects of faults in individual components of the error-correcting mechanism
do not corrupt the overall system state. The trade-offs involved in such
schemes are discussed in Chapter 7.
10
CODING APPROACHES TO FAULT TOLERANCE
Input
xs[t]
~
enCOde)
e
------~I
'----~-'
Figure 1.4.
3.1
Fault-tolerant dynamic system.
REDUNDANT IMPLEMENTATIONS
In order to avoid replication when constructing fault-tolerant dynamic systems, one can replace the original system with a larger, redundant system that
preserves the state, evolution and properties of the original system in some
encoded form. An external mechanism can then perform error detection and
correction by identifying and analyzing violations of the restrictions on the
set of states that are allowed in this larger dynamic system. The larger dynamic system is called a redundant implementation and is part of the overall
fault -tolerant structure shown in Figure 1.4: the input to the redundant implementation at time step t, denoted by e(xs[t]), is an encoded version of the input
xs[t] to the original system; furthermore, at any given time step t, the state
qs[t] of the original dynamic system can be recovered concurrently from the
corresponding state qh[t] of the redundant system through a decoding mapping
l [i.e., qs[tJ = l(qh[tJ)]. Note that the error detection/correction procedure is
input-independent, so that the next-state function is not evaluated in the error
corrector.
The following definition formalizes the notion of a redundant implementation for a dynamic system [Hadjicostis, 1999]. Note that the definition is
independent of the error-detecting or correcting scheme.
DEFINITION 1.3 Let S be a dynamic system with state set Qs, input set Xs,
initial state qs [OJ and state evolution
where qs['] E Qs, xs[·J E Xs and ds is the next-state function.
Introduction
11
Let 1£ be a dynamic system with state set Q1/.. input set X1/.. initial state
Qh[OJ and state evolution equation
where qh[·J E Q1/.. Xh[·J E X1/. and 01/. is the next-state function.
System 1£ is a redundant implementation for S if there exist
(i) an injective input encoding mapping e : Xs
f---7
X1/.. and
(ii) an one-to-one state decoding mapping f
such that for all input sequences
f(qh[t])
= qs[t]
for all t 2: 0 .
The set Q~ is defined as f- 1(Qs) = {q~[.]
called the subset of valid states in 1£.
= f-l(qs[']) I
qs[']
E
Qs} and is
Jfthe following two conditions are satisfied for all qs[·J E Qs and all x s['] E
Xs
f(qh[O]) =
f(01/.(r 1(qs[tJ), e(xs[t])))
qs[O] ,
Os (qs[t], xs[tJ) ,
then, the state of S at all time steps t 2: 0 can be recovered from the state of
1£ through the decoding mapping f (under fault-free conditions at least); this
can be proved by induction on the number of time steps. Knowledge of the
restrictions on the subset of valid states Q~ allows the external error detecting/correcting mechanism to handle faults. Any faults that cause transitions to
invalid states (i.e., states outside the subset Q~) will be detected and, if possible, corrected. Assuming no faults in the error corrector and no uncorrectable
faults in the state transition mechanism, the redundant implementation will then
be able to concurrently simulate the operation of the original dynamic system.
One then aims at using a minimal amount of redundancy to construct redundant
implementations that are appropriate for protecting the given dynamic system
against a pre-specified number of faults. As shown in Chapters 4-6, this general
approach can be used to parameterize different redundant implementations in
various settings and to make connections with hardware by developing appropriate fault models.
Note that the definition of a redundant implementation does not specify nextstate transitions when the redundant system is in a state outside the set of valid
states (this issue becomes important when the error detector/corrector is not
fault-free or when the error-correcting mechanism is combined with the state
transition mechanism [Larsen and Reed, 1972; Wang and Redinbo, 1984D. Due
12
CODING APPROACHES TO FAULT TOLERANCE
to this flexibility, there are multiple different redundant implementations for a
given error detecting/correcting scheme and in many cases it may be possible to systematically characterize and exploit this flexibility (e.g., to minimize
hardware or to perform error detection/correction periodically).
3.2
FAULTS IN THE ERROR-CORRECTING
MECHANISM
Unlike the situation in combinational systems, fault tolerance in dynamic
systems requires consideration about error propagation. The problem is that
a fault causing a transition to an incorrect next state at a particular time step
will not only affect the output at that particular time step (which may be an
unavoidable possibility given that one uses fault-prone elements), but will also
affect the state and output of the system at later times. In addition, the problem
of error propagation intensifies as one increases the number of time steps for
which the dynamic system operates. On the contrary, faults in a combinational
system (as well as faults in the hardware implementation of the output function
of a dynamic system) only affect the output at a particular time step but have
no aftereffects on the future performance of the system. Specifically, they do
not intensify as one increases the number of time steps for which the system
operates.
Chapter 7 describes the handling of transient faults 3 in both the next-state
transition mechanism and the error detecting/correcting mechanism. The possibility of faults in the error-correcting mechanism implies that one can no
longer guarantee that the fault-tolerant system will end up in the right state at
the completion of the error-correcting stage. To overcome this problem, one
can associate with each state a set of states and ensure that at any given time
step, the fault-tolerant system is, with high probability, within this set of states
that represent the actual state [Larsen and Reed, 1972; Wang and Redinbo,
1984; Hadjicostis, 1999].
Employing the above design principle, Chapter 7 analyzes a variant of modular redundancy that uses unreliable system replicas and unreliable voters to
construct redundant dynamic systems that evolve reliably for any given finite
number of time steps. More specifically, given unreliable system replicas (Le.,
dynamic systems that take incorrect state transitions with probability Ps, independently between different time steps) and unreliable voters (that suffer transient faults independently between different time steps with probability Pv),
Chapter 7 describes ways to guarantee that the state evolution of a redundant
fault -tolerant implementation will be the correct one. This method ensures that,
with high probability, the fault-tolerant system will go through a sequence of
states that correctly represents the error-free state sequence (Le., the state of
the redundant system at each time step is within a set of states that correspond
to the state the fault-free system would be in). It is shown that, under this very
Introduction
13
general approach, there is a logarithmic trade-off between the number of time
steps and the amount of redundancy that is needed to achieve a given probability
of failure [Hadjicostis, 2000].
For the special case of linear finite-state machines, one can combine the
above techniques with low-complexity error-correcting codes to make more
efficient use of redundancy. More specifically, one can obtain interconnections
of identical linear finite-state machines that operate in parallel on distinct input
sequences and use a constant amount of hardware per machine to achieve a
desired probability of failure (for the given number of time steps) [Hadjicostis
and Verghese, 1999]. In other words, by increasing the number of machines
that operate in parallel, one can achieve a smaller probability of failure or,
equivalently, operate the machines for a longer time interval; the redundancy per
machine (including the hardware required in the error-correcting mechanism)
remains bounded by a constant.
The analysis in Chapter 7 provides a better understanding of the tradeoffs
involved when designing fault-tolerant systems out of unreliable components.
These include constraints on the fault probabilities in the system/corrector, the
length of operation and the required amount of redundancy. Furthermore, the
analysis effectively demonstrates that the two-stage approach to fault tolerance
of Figure 1.4 can be used successfully (and in some cases efficiently) to construct
reliable dynamic systems out of unreliable components.
4
CODING TECHNIQUES FOR FAULT DIAGNOSIS
The coding techniques that are studied in this book can also be applied
in other contexts. Chapter 8 explores one such direction by employing coding techniques in order to facilitate fault diagnosis in complex discrete event
systems (DES's). A diagnoser or a monitoring mechanism operates concurrently with a given DES and is able to detect and identify faults by analyzing
available activity and status information. There is large volume of work on
fault diagnosis in dynamic systems and networks, particularly within the systems/control and computer engineering communities. For example, within the
systems and control community, there has been a long-standing interest in fault
diagnosis in large-scale dynamic systems, including finite automata [Cieslak
et aI., 1988; Sampath et aI., 1995; Sampath et aI., 1998], Petri net models [Silva
and Velilla, 1985; Sahraoui et aI., 1987; Valette et aI., 1989; Cardoso et aI.,
1995; Hadjicostis and Verghese, 1999], timed systems [Zad et aI., 1999; Pandalai and Holloway, 2000], and communication networks [Bouloutas et aI.,
1992; Wang and Schwartz, 1993; Park and Chong, 1995]. The goal in all of
these approaches is to develop a monitor (diagnoser) that can detect and identify
faults from a given, pre-determined set. The usual approach is to locate a set of
inherently invariant properties of the system, a subset of which is violated soon
after a particular fault takes place. By tracking the activity in the system, one is
14
CODING APPROACHES TO FAULT TOLERANCE
able to detect violations of such invariant properties (which indicates the presence of a fault) and correlate them with a unique fault in the system (which then
constitutes fault identification). The task becomes challenging because of potential observability limitations (in terms of the inputs, states or outputs that are
observed [Cieslak et aI., 1988]) and various other requirements (such as detection/communication delays [Debouk et aI., 2000], sensor allocation limitations
[Debouk et aI., 1999], distributivity/decentralizability constraints [Aghasaryaiu
et aI., 1998; Debouk et aI., 1998], or the sheer size of the diagnoser).
In Chapter 8, coding techniques are used to design the state evolution of
the monitor so that, at any given time step, certain constraints are enforced between its state and the state of the DES. Fault detection and identification is then
achieved by analyzing violations of these coding constraints. The approach is
very general and can handle a variety of fault models. There are a number of
connections that can be made with the more traditional fault diagnosis techniques mentioned in the previous paragraph; Chapter 8 aims at pointing out
some potential connections between coding approaches and fault diagnosis.
Notes
1 The faulty result r f of a multiplier can be written as r f = r + e, where r is
the fault-free result (i.e., the result that would have been obtained under no
faults) and e is an appropriate real number. Similarly, the faulty result rf of
an adder can be written as rf = r x e, where r is the fault-free result and e
is an appropriate real number (r =I 0).
2 The probability of ending up in the correct state after L steps depends on
the dynamic structure of the particular finite-state machine and on whether
multiple faults may lead to the correct state. The argument can be made
more precise if one chooses a particular implementation for the machine
(consider, for example, the linear feedback shift register shown in Figure 6.1
of Chapter 6 with each fault causing a particular bit in the state vector to flip
with probability Pb)'
3 Permanent faults can be handled more efficiently using reconfiguration techniques rather than concurrent error detection and correction. In some sense,
permanent faults are easier to deal with than transient faults. For example,
when testing for permanent faults in an integrated circuit, it may be reasonable to assume that the testing mechanism (error-detecting mechanism) has
been verified to be fault-free. Since such verification only needs to take place
once, one can devote large amounts of time and resources in order to test for
the absence of permanent faults in this testing/correcting mechanism.
References
15
References
Aghasaryaiu, A., Fabre, E., Benveniste, A., Boubour, R, and Jard, C. (1998).
Fault detection and diagnosis in distributed systems: an approach by partially stochastic Petri nets. Discrete Event Dynamic Systems: Theory and
Applications, 8(2):203-231.
Avizienis, A. (1997). Toward systematic design offault-tolerant systems. IEEE
Computer, 30(4):51-58.
Avizienis, A., Gilley, G. c., Mathur, F. P., Rennels, D. A., Rohr, J. A., and
Rubin, D. K. (1971). The STAR (self-testing and repairing) computer: An
investigation of the theory and practice of fault-tolerant computer design.
In Proceedings of the 1st Int. Conf. on Fault-Tolerant Computing, pages
1312-1321.
Beckmann, P. E. and Musicus, B. R (1993). Fast fault-tolerant digital convolution using a polynomial residue number system. IEEE Transactions on
Signal Processing, 41(7):2300-2313.
Blahut, R E. (1983). Theory and Practice ofData Transmission Codes. AddisonWesley, Reading, Massachusetts.
Bouloutas, A., Hart, G. w., and Schwartz, M. (1992). Simple finite state fault
detectors for communication networks. IEEE Transactions on Communications,40(3):477-479.
Cardoso, J., Ktinzle, L. A., and Valette, R (1995). Petri net based reasoning for
the diagnosis of dynamic discrete event systems. In Proceedings of the IFSA
'95, the 6th Int. Fuzzy Systems Association World Congress, pages 333-336.
Choi, Y.-H. and Malek, M. (1988). A fault-tolerant systolic sorter. IEEE Transactions on Computers, 37(5):621-624.
Cieslak, R., Desclaux, c., Fawaz, A. S., and Varaiya, P. (1988). Supervisory
control of discrete-event processes with partial observations. IEEE Transactions on Automatic Control, 33(3):249-260.
Cover, T. M. and Thomas, J. A. (1999). Elements of Information Theory. John
Wiley & Sons, New York.
Debouk, R, Lafortune, S., and Teneketzis, D. (1998). Coordinated decentralized
protocols for failure diagnosis of discrete event systems. In Proceedings of
the 37th IEEE Conf. on Decision and Control, pages 3763-3768.
Debouk, R., Lafortune, S., and Teneketzis, D. (1999). On an optimization problem in sensor selection for failure diagnosis. In Proceedings of the 38th IEEE
Conf. on Decision and Control, pages 4990-4995.
Debouk, R., Lafortune, S., and Teneketzis, D. (2000). On the effect of communication delays in failure diagnosis of decentralized discrete event systems.
In Proceedings ofthe 39th IEEE Conf. on Decision and Control, pages 2245-
2251.
Elias, P. (1958). Computation in the presence of noise. IBM Journal of Research
and Development, 2(10):346-353.
16
CODING APPROACHES TO FAULT TOLERANCE
Evans, W. (1994). Information Theory and Noisy Computation. PhD thesis,
EECS Department, University of California at Berkeley, Berkeley, California.
Evans, W. and Pippenger, N. (1998). On the maximum tolerable noise for reliable computation by formulas. IEEE Transactions 011 Information Theory,
44(3): 1299-1305.
Gilcs, P. (1986). Reliable computation with Cellular Automata. Journal of Computer and System Sciences, 32(2): 15-78.
Gallager, R. G. (1968). Information Theory and Reliable Communication. John
Wiley & Sons, New York.
Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic
Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts.
Hadjicostis, C. N. (2000). Fault-tolerant dynamic systems. In Proceedings of
ISIT 2000, the Int. Symp. on Information Theory, page 444.
Hadjicostis, C. N. and Verghese, G. C. (1999a). Fault-tolerant linear finite state
machines. In Proceedings of the 6th IEEE Int. Con! on Electronics, Circuits
and Systems, pages 1085-1088.
Hadjicostis, C. N. and Verghese, G. C. (1999b). Monitoring discrete event systems using Petri net embeddings. In Application and Theory of Petri Nets
1999, number 1639 in Lecture Notes in Computer Science, pages 188-208.
Hajek, B. and Weller, T. (1991). On the maximum tolerable noise for reliable computation by formulas. IEEE Transactions on Information Theory,
37(2):388-391.
Harper, R E., Lala, J. H., and Deyst, J. J. (1988). Fault-tolerant parallel processor
architecture review. In Eighteenth Int. Symp. on Fault- Tolerant Computing,
Digest of Papers, pages 252-257.
Huang, K.-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for
matrix operations. IEEE Transactions on Computers, 33(6):518-528.
Johnson, B. (1989). Design and Analysis of Fault-Tolerant Digital Systems.
Addison-Wesley, Reading, Massachusetts.
Jou, J.-Y. and Abraham, J. A. (1986). Fault-tolerant matrix arithmetic and signal
processing on highly concurrent parallel structures. Proceedings ofthe IEEE,
74(5):732-741.
Koren, I. and Singh, A. D. (1990). Fault-tolerance in VLSI circuits. IEEE Computer, 23(7):73-83.
Larsen, R W. and Reed, I. S. (1972). Redundancy by coding versus redundancy
by replication for failure-tolerant sequential circuits. IEEE Transactions on
Computers, 21(2):130-137.
Leveugle, R, Koren, Z., Koren, I., Saucier, G., and Wehn, N. (1994). The
Hyeti defect tolerant microprocessor: A practical experiment and its costeffectiveness analysis. IEEE Transactions on Computers, 43( 12): 1398-1406.
References
17
Liang, S. C. and Kuo, S. Y. (1990). Concurrent error detection and correction in
real-time systolic sorting arrays. In Proceedings of 20th IEEE Int. Symp. on
Fault-Tolerant Computing, pages 434-441. IEEE Computer Society Press.
Pandalai, D. N. and Holloway, L. E. (2000). Template languages for fault monitoring of timed discrete event processes. IEEE Transactions on Automatic
Control, 45(5):868-882.
Park, Y. and Chong, E. K. P. (1995). Fault detection and identification in communication networks: a discrete event systems approach. In Proceedings of
the 33rdAnnuaiAlierton Con! on Communication, Control, and Computing,
pages 126-135.
Patterson, D. A., Gibson, G., and Katz, R. H. (1988). A case for redundant
arrays of inexpensive disks (raid). In Proceedings of the ACM SIGMOD,
pages 109-116.
Peercy, M. and Banerjee, P. (1993). Fault-tolerant VLSI systems. Proceedings
of the IEEE, 81(5):745-758.
Peterson, W. W. and Weldon Jr., E. J. (1972). Error-Correcting Codes. MIT
Press, Cambridge, Massachusetts.
Pippenger, N. (1990). Developments in the synthesis of reliable organisms from
unreliable components. In Proceedings of Symposia in Pure Mathematics,
volume 50, pages 311-324.
Pradhan, D. K. (1996). Fault-Tolerant Computer System Design. Prentice Hall,
Englewood Cliffs, New Jersey.
Rao, T. R. N. (1974). Error Codingfor Arithmetic Processors. Academic Press,
New York.
Rao, T. R. N. and Fujiwara, E. (1989). Error-Control Coding for Computer
Systems. Prentice-Hall, Englewood Cliffs, New Jersey.
Redinbo, G. R. (1987). Signal processing architectures containing distributed
fault-tolerance. In Conference Record - Twentieth Asilomar Con! on Signals, Systems & Computers, pages 711-716.
Roy-Chowdhury, A. and Banerjee, P. (1996). Algorithm-based fault location and
recovery for matrix computations on multiprocessor systems. IEEE Transactions on Computers, 45(11): 1239-1247.
Sahraoui, A., Atabakhche, H., Courvoisier, M., and Valette, R. (1987). Joining
Petri nets and knowledge-based systems for monitoring purposes. In Proceedings of the IEEE Int. Con! on Robotics Automation, pages 1160-1165.
Sampath, M., Lafortune, S., and Teneketzis, D. (1998). Active diagnosis of
discrete-event systems. IEEE Transactions on Automatic Control, 43(7):908929.
Sampath, M., Sengupta, R., Lafortune, S., Sinnamohideen, K., and Teneketzis,
D. (1995). Diagnosability of discrete-event systems. IEEE Transactions on
Automatic Control, 40(9): 1555-1575.
18
CODING APPROACHES TO FAULT TOLERANCE
Shannon, C. E. (1948a). A mathematical theory of communication (Part I). Bell
System Technical Journal, 27(7):379-423.
Shannon, C. E. (1948b). A mathematical theory of communication (Part II).
Bell System Technical Journal, 27(10):623-656.
Siewiorek, D. and Swarz, R (1998). Reliable Computer Systems: Design and
Evaluation. A.K. Peters.
Silva, M. and Velilla, S. (1985). Error detection and correction in Petri net
models of discrete events control systems. In Proceedings of ISCAS 1985,
the IEEE Int. Symp. on Circuits and Systems, pages 921-924.
Sun, J., Cerny, E., and Gecsei, J. (1994). Fault tolerance in a class of sorting
networks. IEEE Transactions on Computers, 43(7):827-837.
Taylor, M. G. (1968). Reliable information storage in memories designed from
unreliable components. The Bell System Journal, 47(10):2299-2337.
Valette, R, Cardoso, J., and Dubois, D. (1989). Monitoring manufacturing systems by means of Petri nets with imprecise markings. In Proceedings of the
IEEE Int. Symp. on Intelligent Control, pages 233-238.
von Neumann, J. (1956). Probabilistic Logics and the Synthesis of Reliable Organismsfrom Unreliable Components. Princeton University Press, Princeton,
New Jersey.
Wang, C. and Schwartz, M. (1993). Fault detection with multiple observers.
IEEEIACM Transactions on Networking, 1(1):48-55.
Wang, G. X. and Redinbo, G. R (1984). Probability of state transition errors in
a finite state machine containing soft failures. IEEE Transactions on Computers, 33(3):269-277.
Wicker, S. B. (1995). Error Control Systems. Prentice Hall, Englewood Cliffs,
New Jersey.
Winograd, S. and Cowan, J. D. (1963). Reliable Computation in the Presence
of Noise. MIT Press, Cambridge, Massachusetts.
Zad, S. H., Kwong, R. H., and Wonham, W. M. (1999). Fault diagnosis in
timed discrete-event systems. In Proceedings of the 38th IEEE Conference
on Decision and Control, pages 1756-1761.
I
FAULT· TOLERANT COMBINATIONAL SYSTEMS
Chapter 2
RELIABLE COMBINATIONAL SYSTEMS OUT OF
UNRELIABLE COMPONENTS
1
INTRODUCTION
In one of his most influential papers, von Neumann considered the construction of reliable combinational systems out of unreliable components [von
Neumann, 1956]. He focused on a class of digital systems that performed
computation by using appropriately interconnected voting mechanisms. More
specifically, von Neumann constructed reliable systems out of unreliable 3-bit
voters, some of which were used to perform computation and some of which
were used as "restoring organs" to achieve error correction. The voters used
for computational purposes received inputs that were either primary inputs,
constants or outputs from other voters; the voters that functioned as "restoring
organs" ideally (i.e., under fault-free conditions) received identical inputs.
Von Neumann's fault model assumed that a voter fails by providing an output
that differs from the value agreed upon by the majority of its inputs. When
voter faults are independent and occur with probability exactly p, von Neumann
demonstrated a fault-tolerant construction that is successful if p < 0.0073. In
fact, using unreliable 3-input gates (including unreliable 3-bit voters) that fail
exactly with probability p, it was later shown in [Hajek and Weller, 1991] that it is
possible to construct reliable circuits for computing arbitrary Boolean formulas
if and only if p < The fraction can be seen as the maximum tolerable noise
in unreliable 3-input gates. These results were extended to interconnections of
u-input gates (for u odd) in [Evans, 1994].
k·
k
This chapter discusses von Neumann's approach for reliable computation
and related extensions. The focus is on reliably unreliable components, i.e.,
components that fail exactly with a known probability p. Extensions of these
results to less restrictive fault models, such as models where each component
fails with a probability that is bounded by a known constant p, are not explicitly
22
CODING APPROACHES TO FAULT TOLERANCE
addressed in this chapter; the interested reader can refer to [Pippenger, 1985;
Pippenger, 1990] and references therein.
2
COMPUTATIONAL MODELS FOR
COMBINATIONAL SYSTEMS
A u-input Boolean gate computes a Boolean function
f: {O, I}U r------+ {O, I} .
Inputs to gates are Boolean variables that are either primary inputs to the circuit,
constants or outputs of other gates. A network or a combinational circuit is a
loop-free interconnection of gates such that the output of each gate is the input
to other gates (except from the last gate which provides the final output). A
formula is a network in which the output of a gate is an input to at most one
gate [Pippenger, 1988; Feder, 1989].
Complex combinational circuits may involve a large number of individual
components (gates), all of which belong to the set of available types of Boolean
gates. In other words, one assumes that there is a given pool or basis of available
prototype gates. The depth and size of such combinational circuits is defined
using graph-theoretic nomenclature [Pippenger, 1985; Evans and Schulman,
1993; Evans and Schulman, 1999]:
• The depth of the circuit is the maximum number of gates that can be found
in a path that connects a primary input to the final output.
• The size of the circuit is the total number of gates.
An unreliable u-input gate is modeled as a gate that with probability 1- p
(0 $ P < !) computes the correct output on its inputs and with probability P it
fails, i.e., it produces an incorrect output (its binary value is flipped). Note that
these unreliable gates are reliably unreliable in the sense that they fail exactly
with probability p [Pippenger, 1990]. A more powerful fault model would be
one where each gate fails with a probability that is bounded by p (i.e., the ith
gate fails with probability Pi $ p).
When considering interconnections of unreliable gates, it is assumed that
different gates fail independently. A formula or a network is considered reliable
if its final output is correct with a "large" probability for all combinations of
primary inputs. This probability is usually simply required to be larger than!
(so that it is more likely that the combinational system will produce the correct
output rather than an incorrect output).
Reliable Combinational Systems out of Unreliable Components
Primary
Inputs
Combinational
Circuit
Replica
Combinational
Circuit
Replica
Combinational
Circuit
Replica
Figure 2.1.
3
23
Output 1
Output 2
Output
Output 3
Error correction using a "restoring organ."
VON NEUMANN'S APPROACH TO FAULT
TOLERANCE
In [von Neumann, 1956] von Neumann constructed reliable combinational
circuits out of unreliable 3-input voters. Some of the voters were used for
computation and some of them were used as "restoring organs" to achieve error
correction. In von Neumann's reliable combinational circuits, an unreliable
voter takes three bits (Boolean values) as inputs and under fault-free conditions
outputs the value that the majority of them have. With probability (exactly) p,
however, the unreliable voter outputs an incorrect value (Le., it flips a "1" to a
"0" and vice-versa).
Voters that operate as restoring organs ideally receive identical inputs. The
basic picture is shown in Figure 2.1: the restoring organ receives as inputs the
outputs of three replicas of the same combinational system (circuit). These
replicas are given the same inputs and operate independently from each other.
If each of their outputs is erroneous with probability q (q <
the probability
that the majority of the inputs to the voter are incorrect will be given by
!),
9(q)
~
t, (! )
qk(l_ q)3-k
~ 3q' -
2q' .
(2.1)
Since the restoring organ is assumed to fail with probability p, independently
from the other systems, the probability that the output of the restoring organ is
24
CODING APPROACHES TO FAULT TOLERANCE
p=.2
p=.l
......
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
~
CT
g
g
~05
3:0.5
0.4
04
0.3
0.3
0.2
0.2
0.1
01
0'0
0.2
0.4
Figure 2.2.
q
0.6
0
0
0.8
0.2
0.4
q
0.6
0.8
Plots of functions f(q) and g(q) for two different values of p.
erroneous is given by
f(q)
=
=
=
p(l - 8(q)) + (1 - p)8(q)
p + (1 - 2p)8(q)
P + (1 - 2p)(3q2 - 2q3) .
(2.2)
Function f(q) (along with function g(q) = q) is plotted in Figure 2.2 for two
different values of p.
The basic approach in von Neumann's scheme is to successively use restoring organs until the final output reaches an acceptable or desirable level of
probability of error. More specifically, one builds a ternary tree whose children
are hardware-independent replicas of an unreliable combinational circuit and
whose internal nodes are 3-input restoring organs. This scheme, illustrated in
Figure 2.3 for two levels of restoring organs, has a hardware cost that is exponential in the number of levels in the tree. For example, s levels of restoring
organs require 38 replicas of the combinational system and 3';1 voters.
Reliable Combinational Systems Ollt of Unreliable Components
25
~o-----··--~o­
/
/
/----------- C=~:I:~:'- -- --i
-(\;
I
r---U
/
Figure 2.3.
ance.
Restoring
Organ
System
Replica
:
:
i
I
L _________________________ J
Two successive restoring iterations in von Neumann's construction for fault toler-
The final output of an s-level ternary tree of voters is erroneous with a probability that is given by
q*
= f(f(f(···f(q)· .. ))) ,
,
'V
J
s iterations
where the number of successive iterations of function fO is the same as the
number of levels in the ternary tree. (The simplicity of the above formula is a
direct consequence of the components being reliably unreliable. If this was not
the case, then, the probability () in Eq. (2.1) would depend on three variables
(e.g., qI. q2, q3) rather than a single one (q) and the discussion would become
slightly more complicated.)
Repeated iterations of function f(q) in Eq. (2.2) converge
monotonically to a value q*, such that:
THEOREM 2.1
• If 0 ::; p <
~ and 0 ::; q
<
~, then, q* satisfies p ::; q*
• If i < p < ~ and 0 ::; q < ~,
then, q* = ~.
<
~.
26
CODING APPROACHES TO FAULT TOLERANCE
Proof: The proof in von Neumann's paper finds first the roots of the function
q- f(q). Since q = is a solution (by inspection), the other two roots can be
found by finding the roots of the quadratic
!
q - f~q) = 2 ((1 - 2p)q2 - (1 - 2p)q + p)
q -1 2
The two solutions are given by
1 ( 1±
q=-
2
1.
~-6P)
-1- 2p
and are complex if p >
In such case, the form of f (q) is as shown at the
right side of Figure 2.2. For p <
the two solutions are real and the form of
f(q) is as shown at the left side of Figure 2.2. The points of intersection are
given by qo, and l-qo where
1,
!,
(2.3)
The following two cases need to be considered:
k
!,
1. If 0 ~ p < and 0 ~ q < then, the monotonicity and continuity of f(q)
imply that successive iterations of f (.) will converge to qo (because, given
o ~ qi < qo, then qi+l = f(qi) satisfies qi < qi+l < qo, whereas given
qo < qi < then, qi+l = f(qi) satisfies qo < qi+l < qi).
!,
1
!
!,
2. If < p < and 0 ~ q < then, the monotonicity and continuity of f(q)
imply that successive iterations of f (.) will converge to (because, given
o ~ qi < then, qi+l = f(qi) satisfies qi < qi+l <
!,
At this point the proof of the theorem is complete.
!).
!
o
Von Neumann's construction in [von Neumann, 1956] demonstrated that if
p ~ 0.0073, then, it is possible to construct reliable combinational systems out
of unreliable voters. The constant 0.0073 was the result of particular details of
his construction and, as von Neumann himself pointed out, it can be improved.
The problem was that the ternary tree in Figure 2.3 needed to have leafs with
outputs that are erroneous with probability smaller than ~; therefore, one had
to decompose a combinational system into reliable subsystems in a way that
ensures that the output of a subsystem is smaller than! (so that it can be driven
to q* by consecutive stages of restoring organs).
Reliable Combinational Systems out of Unreliable Components
I XNAND Output
1
1
0
1
1
0
0
0
Table 2.1.
4
4.1
Inputs
27
II il I h I i31
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
0
1
0
1
0
1
0
1
Input-output table for the 3-input XNAND gate.
EXTENSIONS OF VON NEUMANN'S APPROACH
MAXIMUM TOLERABLE NOISE FOR 3-INPUT
GATES
By considering reliably unreliable 3-input gates that fail with probability p,
Hajek and Weller demonstrated that it is possible to construct reliable combinational circuits that calculate arbitrary Boolean functions if p < ~ [Hajek
and Weller, 1991]. Using techniques different from the ones described in this
chapter, they also proved that, if p > ~ then it is impossible to construct reliable circuits out of (reliably) unreliable 3-input gates. Therefore, can be seen
as the maximum tolerable noise in 3-input components. The latter result also
applies to less restrictive fault models of unreliable gates, including gates that
fail with probability bounded by p.
The construction in [Hajek and Weller, 1991] goes as follows:
i
• Any Boolean function can be computed by appropriately interconnected
2-input noiseless NAND gates. In the fault-tolerant construction, each 2input NAND gate (with inputs Xl and X2) will be emulated by an unreliable
3-input XNAND gate (with inputs iI, i2 and i3) that functions as shown in
Table 2.1. One can think of i l as a noisy version of Xl, and i2 and i3 as
noisy versions of X2 [Hajek and Weller, 1991].
• Suppose that the following conditions are true:
1.
p<i.
28
CODING APPROACHES TO FAULT TOLERANCE
3. All of the above probabilities are independent.
One can verify that the output of an unreliable XNAND gate (with inputs
ilo i2, i3) will equal the output of a reliable NAND gate with inputs (Xl and
X2) with a probability that is larger than !, or
To show this, all one has to do is to consider all different cases separately.
For example, if Xl = 0 and X2 = 0, then, the reliable NAND gate should
output "I." The probability that the unreliable gate produces an incorrect
output (i.e., "0") can be calculated as the sum of the following eight events:
1. il = 0, i2 = 0, i3
probability q3p .
= 0 and XNAND gate fault; this event occurs with
2. i l = 0, i2 = 0, i3 = 1 and XNAND gate fault; this event occurs with
probability q2(1 - q)p.
3. it = 0, i2 = 1, i3 = 0 and no XNAND gate fault; this event occurs
with probability q2(1 - q)(1 - p) .
8. il = 1, i2 = 1, i3 = 1 and no XNAND gate fault; this event occurs
with probability (1 - q)3(1 - p) .
The sum of all of the above eight events is given by
(1 - p) [q2(1 - q)
+ p [2q2(1 -
+ 2q(1 -
q) + q(1 - q)2
q)2
+ (1 -
q)3]
+
+ q3] ,
which can be easily shown to be less that! (e.g., by taking the first and
second derivatives with respect to q).
• Given an arbitrary Boolean function and its circuit implementation based
on reliable 2-input NAND gates, one can construct a reliable circuit from
unreliable 3-input XNAND gates using the following strategy:
(i) Replace each NAND gate with an XNAND gate.
(ii) Replicate the circuit that generates input X2 to each NAND gate; use
the first circuit replica to provide i2 and the second circuit replica to
provide i3 to the XNAND gate.
(iii) Do this recursively starting from the NAND gate that provides the output
of the original circuit and working the way to the primary inputs (until
all NAND gates are replaced by XNAND gates).
Reliable Combinational Systems out of Unreliable Components
29
• Depending on the actual inputs, the outputs of the XNAND gates will be
erroneous with different probabilities (that are nevertheless smaller than!).
This would be problem at the next level of XNAND gates because different
inputs would be erroneous with unequal probabilities (which invalidates the
requirement that each input is erroneous with the same probability). To
avoid this problem, one can use the ternary tree strategy in Figure 2.3 with
enough levels of 3-input unreliable voters so that the probability of an error
under any combination of inputs is close enough to the "steady-state" error
probability qo [see Eq. (2.3)].
4.2
MAXIMUM TOLERABLE NOISE FOR U -INPUT
GATES
The results of [Hajek and Weller, 1991] were generalized to u-input gates
for odd u in [Evans, 1994]. Again, the assumption was that gates are reliably
unreliable and that they fail independently with probability p. It was shown that,
using unreliable u-input gates, one can construct circuits that reliably calculate
arbitrary Boolean functions if p satisfies
(2.4)
To prove this statement, consider the following simplified scenario first.
Suppose that a (reliably) unreliable u-bit voter (u odd) fails with probability p
(a voter fails when it outputs a result that is not equal to the majority of its inputs).
Furthermore, assume that the u input bits ideally (under fault-free conditions)
have identical values, but each one may be erroneous with probability q <
Assume that the inputs being erroneous and the voter failing are independent
events.
!.
The probability that the majority of the inputs to the voter are incorrect is
given by
and the probability that the output of the voter is erroneous is given by
fu(q)
= p(l - Ou(q)) + (1 - p)Ou(q)
p + (1 - 2p)Ou(q) .
30
CODING APPROACHES TO FAULT TOLERANCE
2.2 LetO::; q < ~. Repeatedapplicationsofthefunctionoffu(q)
[i.e., fu{fu( ... {fu{q)) ... ))J will converge toa valueq* that satisfies p ::; q* < ~
THEOREM
if
/fpmax < p < ~, they will converge to ~.
Proof: The proof presented here is slightly different from the proof in [Evans,
1994]. First, it is argued that fu(q) can only take one of the two forms shown
in Figure 2.2. The distinction is made based on whether the function fu{q)
intersects the line of the function g{q) = q at one point or three points (in
both cases, point (~, ~) is an intersection point). Then, one finds the maximum
possible p for which fu{q) is guaranteed to have slope larger than one at q = ~
(which is sufficient to ensure that fu(q) is of the form shown at the left side of
Figure 2.2).
The following facts can be easily verified:
1. fu{1/2) = 1/2.
2. fu(q) = 1- fu(1-q)[i.e., fu{q) is odd symmetric around the point (~, ~)].
3.
4.
df~~q)
==
f~{q) ~
interval [0, ! D.
df~~q)
==
f~(q) ~
0 for 0 ::; q ::;
! (i.e., fu{q) is non-decreasing in the
0 for 0 ::; q ::;
~ (i.e., the derivative of fu(q)
decreasing in the interval
is non-
[0, ~]).
The above establish that function fu(q) can only have one of the two forms
shown in Figure 2.2. Clearly, if fu{q) is of the type shown on the right, repeated
applications of fu{q) for any 0 ::; q < will converge to ~; if, however, fu(q) is
of the form shown on the left, then, repeated applications of f u (q) will converge
to a value q* that satisfies fu(q*) = q* and p ::; q* < 1/2. This can be shown
using arguments similar to the ones used in the proof of Theorem 2.1.
In order to distinguish between these two cases, one can calculate the derivative of fu{q) with respect to q and find the values of p for which this derivative
satisfies
f~(1/2) > 1 .
!
References
31
In other words, it is required that p is such that
f~(1/2)
= (1 - 2p)
t
(~)
(kqk-l(l - q)u-k_
k=~
(1 - 2p)
t
-(u - k)qk(l _ q)U-k-l)
(~)
Iq=1/2
(k(1/2)U-l - (u - k)(1/2)U-l)
k=~
~--------------v~--------------I
C
is strictly larger than one. One can explicitly solve for C to find that
so that p
<~-
2b is equivalent to Eq. (2.4).
The rest of the argument follows the proof in Theorem 2.1.
o
Note that the above line of reasoning can also be used to prove Theorem 2.1.
Having established Theorem 2.2, the argument in [Evans, 1994] follows the
construction in [Hajek and Weller, 1991] to calculate arbitrary Boolean functions using unreliable u-input gates. Ignoring all other u-3 inputs, one can use
the 3-input XNAND gate exactly as defined in [Hajek and Weller, 1991]. From
there on, the construction is the same as in [Hajek and Weller, 1991] except that
u-input voters are now used to restore the different probabilities of error to an
equal level.
5
RELATED WORK AND FURTHER READING
Considerations regarding the size and depth of reliable circuits were not discussed in this chapter. Several researchers have worked on obtaining bounds for
the complexity of such circuits [Dobrushin and Ortyukov, 1977a; Dobrushin and
Ortyukov, 1977b; Pippenger, 1988; Pippenger et aI., 1991; Evans and Schulman,
1993; Gacs and Gal, 1994; Evans and Schulman, 1999]. The authors of [Evans
and Pippenger, 1998] considered the construction of reliable combinational systems from reliably unreliable 2-input NAND gates that fail with probability p.
They proved that reliable computation using such gates is possible if and only
if p < (3-V7)/4.
32
CODING APPROACHES TO FAULT TOLERANCE
References
Dobrushin, R. L. and Ortyukov, S.1. (1977a). Lower bound for the redundancy
of self-correcting arrangements of unreliable functional elements. Problems
of Information Transmission, 13(4):59-65.
Dobrushin, R. L. and Ortyukov, S.1. (1977b). Upper bound for the redundancy
of self-correcting arrangements of unreliable functional elements. Problems
of Information Transmission, 13(4):203-218.
Evans, W. (1994). Information Theory and Noisy Computation. PhD thesis,
EECS Department, University of California at Berkeley, Berkeley, California.
Evans, W. and Pippenger, N. (1998). On the maximum tolerable noise for reliable computation by formulas. IEEE Transactions on Information Theory,
44(3): 1299-1305.
Evans, W. and Schulman, L. J. (1993). Signal propagation, with application to
a lower bound on the depth of noisy formulas. In Proceedings of the 34th
Annual Symp. on Foundations of Computer Science, pages 594-601.
Evans, W. and Schulman, L. J. (1999). Signal propagation and noisy circuits.
IEEE Transactions on Information Theory, 45(7):2367-2373.
Feder, T. (1989). Reliable computation by networks in the presence of noise.
IEEE Transactions on Information Theory, 35(3):569-571.
Gacs, P. and Gal, A. (1994). Lower bounds on the complexity of reliable
Boolean circuits with noisy gates. IEEE Transactions on Information Theory,
40(2):579-583.
Hajek, B. and Weller, T. (1991). On the maximum tolerable noise for reliable computation by formulas. IEEE Transactions on Information Theory,
37(2):388-391.
Pippenger, N. (1985). On networks of noisy gates. In Proceedings of the 26th
IEEE FOCS Symp., pages 30-38.
Pippenger, N. (1988). Reliable computation by formulas in the presence of
noise. IEEE Transactions on Information Theory, 34(2):194-197.
Pippenger, N. (1990). Developments in the synthesis of reliable organisms from
unreliable components. In Proceedings of Symposia in Pure Mathematics,
volume 50, pages 311-324.
Pippenger, N., Stamoulis, G. D., and Tsitsiklis, J. N. (1991). On a lower bound
for the redundancy of reliable networks with noisy gates. IEEE Transactions
on Information Theory, 37(3):639-643.
von Neumann, J. (1956). Probabilistic Logics and the Synthesis of Reliable Organismsfrom Unreliable Components. Princeton University Press, Princeton,
New Jersey.
Chapter 3
ALGORITHM·BASED FAULT TOLERANCE FOR
COMBINATIONAL SYSTEMS
1
INTRODUCTION
Modular redundancy schemes are attractive because they are universally applicable and can be implemented without having to develop explicit fault models. Their major drawback is that they can be prohibitively expensive due to the
overhead of replicating the hardware. Arithmetic coding and algorithm-based
fault tolerance (ABFT) schemes partially overcome this problem by offering
sufficient fault coverage while making more efficient use of redundancy. This
comes at the cost of narrower applicability and harder design. In fact, the main
task in arithmetic coding and ABFT schemes is the development of appropriate
fault models and the recognition of the structural features that make a particular
computation or algorithm amenable to efficient utilization of redundancy. A
variety of useful results and constructive procedures for systematically achieving this goal have been obtained for computations that take place in an abelian
group or in a semigroup. This chapter reviews work on arithmetic codes and
ABFT, and describes a systematic approach for protecting combinational systems whose functionality possesses certain algebraic structure.
Arithmetic codes are error-correcting codes with properties that remain invariant under an arithmetic operation of interest [Rao and Fujiwara, 1989].
They are typically used as shown in Figure 3.1. First, one adds "structured"
redundancy into the representation of the data by using suitable encodings,
denoted by the mappings ¢l and ¢2 in the figure. The desired original computation r = Xl 0 X2 is then replaced by the modified computation <> on the
encoded data (0 and <> denote binary operations). Under fault-free conditions,
the modified operation <> produces p = ¢1(Xt) <> ¢2(X2), which results in r
when decoded through the decoding mapping a [i.e., r = a(p)]. Due to the
possible presence of faults, the result of the redundant computation can be er-
34
CODING APPROACHES TO FAULT TOLERANCE
Redundant
Computational Unit
h1=<1>1 (x1)
o
Faults
Figure 3.1.
Decoder
.-----,
Error
Detector/
Corrector
Arithmetic coding scheme for protecting binary operations.
roneous, PI instead of p. The redundancy in PI is used by the error corrector
Q: to perform error detection and correction. The output fJ of the error detectorlcorrector is decoded via the mapping q. Under fault-free conditions in the
detecting/correcting mechanism and with correctable faults, fJ equals p, and the
final result r = q(fJ) equals r.
A common assumption in the model of Figure 3.1 (which closely follows
the general model in Figure 1.2 of Chapter 1) is that the error detector/corrector
is fault-free. This assumption is reasonable if the implementation of the detector/corrector is simpler than the implementation of the redundant computational
unit. Another common assumption is that no fault takes place in the decoder
unit. As discussed in Chapter 1, this latter assumption is in some sense inevitable: no matter how much redundancy is added, the output of the overall
system will be erroneous if the device that is supposed to provide the output
fails (i.e., if there is a fault in the very last stage of the combinational system).
Algorithm-based fault tolerance (ABFT) techniques involve more sophisticated coding schemes that deal with arrays of reaVcomplex data in concurrent
mUltiprocessor systems. They were introduced by Abraham and coworkers
starting in 1984 [Huang and Abraham, 1984; Jou and Abraham, 1986; Jou and
Abraham, 1988; Nair and Abraham, 1990] and aimed at protecting against a
maximum number of pre-specified faults assuming fault-free error correction.
The classic example of ABFT is the protection of n x n matrix multiplication
on a 2-D systolic array and is discussed in more detail in Section 3. A variety of other computationally intensive algorithms, such as fast Fourier transform (FFf) computational networks [Jou and Abraham, 1988] and convolution
[Beckmann and Musicus, 1993] have also been studied in the context of ABFT.
There are three critical steps involved in ABFT schemes:
(i) Encoding of the input data.
ABFT for Combinational Systems
35
(ii) Refonnulation of the original algorithm so that it operates on the encoded
data.
(iii) Distribution of the computational tasks among the different subsystems of
the overall system so that any faults occurring within these subsystems can
be detected and, hopefully, corrected.
The most important challenge in both arithmetic coding and ABFf implementations is the recognition of structure in an algorithm/architecture that is
amenable to the introduction of redundancy. A step towards providing a systematic approach for the recognition and exploitation of such special structure was
developed for computations that occur in a group or in a semigroup in [Beckmann, 1992; Beckmann and Musicus, 1992; Hadjicostis, 1995; Hadjicostis and
Verghese, 1995]. The key observation is that the desired "structured" redundancy can be introduced by a homomorphic embedding into a larger algebraic
structure (group or semigroup). These techniques are described in more detail
in Section 4; the exposition is self-contained and requires minimal knowledge
of group and semi group theory.
2
ARITHMETIC CODES
While universally applicable and simple to implement, modular redundancy
is inherently expensive and inefficient. For example, a TMR implementation
triplicates the original system in order to detect and correct a single fault. Arithmetic codes, although more limited in applicability and possibly harder to design
and implement, offer a resource-efficient alternative when dealing with the protection of simple operations on integer data, such as addition and multiplication.
They can be thought of as a class of error-correcting codes whose properties
remain invariant under the operation that needs to be made fault-tolerant. An
arithmetic coding scheme follows the model of Figure 3.1: in order to protect
the computation of r = Xl 0 X2, the following four steps are taken.
• Encoding: Redundancy is added to the representation of the data by using
suitable encoding mappings
hI
h2
,
<1>2(X2).
<1>1 (xt)
=
• Redundant Computation: The operation on the encoded data may be different from the desired operation on the original data. In Figure 3.1, this
modified operation is denoted by 0 and under fault-free conditions outputs
36
CODING APPROACHES TO FAULT TOLERANCE
When one or more faults take place, the redundant computation on the
encoded data outputs an erroneous result PI which, in general, is a function
of the encoded data and the errors that took place, i.e.,
where e denotes the error.
• Error Detection and Correction: If enough redundancy exists in the encoding of the data, one may be able to detect, identify and correct the errors by
analyzing their effect on the encoded result PI. In Figure 3.1 this is done
by the error-correcting mapping a which maps the corrupted result PI to p,
i.e.,
p = a(PI) .
Note that the error corrector has no access to the original operands; this
ensures that the desired calculation does not take place in a part of the
redundant construction that is assumed to be fault-free.
• Decoding: The final result f is obtained by decoding
f
= a(p)
p using mapping a
.
Under fault-free conditions or under correctable faults, the final result f
equals the result of the operation on the original data (r = Xl 0 X2).
3.1 Figure 3.2 shows an arithmetic coding scheme for protecting
integer addition. Encoding involves multiplication of the operands Xl and X2
by a factor of 10. The redundant operation on the encoded data is integer
addition (same as the original operation) and error detection involves division
by 10. An error is detected if the corrupted result is not divisible by 10. Note
that faults under which the result remains a multiple of 10 are undetectable.
(Error correction is impossible unless a more detailedfault model is available.)
This specific example is an instance of aN coding [Rao and Fujiwara, 1989J
with a = 10. Under certain conditions and certain choices of a, aN coding
can be used to correct a single error. Note that in the case of aN codes,
redundancy is added into the combinational system by increasing the dynamic
range of the system (by a factor of a).
EXAMPLE
Arithmetic coding schemes need to provide sufficient protection while keeping the associated encoding, error correction and decoding operations simple.
If the latter operations are complicated, then, the code is computationally expensive and impractical (an extreme example would be a code whose encoding/decoding is three times more complicated than the actual computation one
desires to protect; in such a case, it would be more convenient to use TMR).
ABFT for Combinational Systems
37
Error Detection
and Decoding
+
f
Faults
Figure 3.2.
3
Uncorrectable
Error
aN arithmetic coding scheme for protecting integer addition.
ALGORITHM·BASED FAULT TOLERANCE
Arithmetic codes do not always have the simple structure of Example 3.1.
More advanced and more complicated schemes that protect real or complex
numbers and involve entire sequences of data are referred to as algorithmbased fault tolerance (ABFT) and usually deal with multiprocessor systems.
The term was introduced by 1. Abraham and coworkers in 1984 [Huang and
Abraham, 1984]. Since then, a variety of signal processing and other computationally intensive algorithms have been adapted to the ABFT framework
[Jou and Abraham, 1986; Abraham, 1986; Chen and Abraham, 1986; Abraham
et aI., 1987; Jou and Abraham, 1988; Nair and Abraham, 1990].
The classic example of ABFT involves the protection of n x n matrix multiplication on an n x n mUltiprocessor grid [Huang and Abraham, 1984]. The
ABFT scheme detects and corrects any single processor fault using an extra
checksum row and an extra checksum column. The resulting fault-tolerance
scheme requires an (n + 1) x (n + 1) multiprocessor grid on which it performs
multiplication of an (n+ 1) x n matrix with an n x (n+ 1) matrix. The hardware
overhead is minimal compared to the naive use of TMR, which offers similar
fault protection but triplicates the system. The execution time for the algorithm
is slowed down by a negligible amount: it now takes 3n steps, instead of 3n-1.
Figure 3.3 is an illustration of the above ABFT method for the case when
n = 3. The top of the figure shows the unprotected computation of the product
oftwo 3 x 3 square matrices A and B on a 3 x 3 multiprocessor grid. The data
enters the multiprocessor system as indicated by the arrows in the figure (a "."
indicates that no data is received). Element aij corresponds to the element in
the ith row, jth column of matrix A = [aiiJ, whereas bij is the corresponding
element of matrix B = [biiJ. At time step t, each processor executes the
following three steps:
38
CODING APPROACHES TO FAULT TOLERANCE
1. Processor Pij (the processor on the ith row, jth column of the multiprocessor
grid) receives two pieces of data, one from the processor on the left (namely,
Pi(j-l)) and one from the processor at the top (namely, P(i-l)j)' From the
processor on the left it gets b(t-(j+i-l))i whereas from the processor at the
top it gets aj(t-(j+i-l))' If t - (j + i-I) is negative, no data is received.
2. Processor P ij multiplies the data it receives and adds the result to an accumulative sum Sij stored in its memory. Note that Sij is initialized to zero.
If no data has been received, nothing is done at this step.
3. Processor Pij passes the data received from the processor on its left to the
processor on its right, and the data received from the top processor to the
processor below.
It is not hard to see that after 3n-1 steps, the value of Sji is:
3n-l
Sji
=
L
t=o
ai(t-(j+i-l)) x b(t-(j+i-l))j
= Cij
,
where akl, bkl are zero for k, l < 0 or k, l > n, and Cij is the element in the ith
row, jth column of matrix C = A x B. Therefore, after 3n-1 steps, processor
P ji contains the value Cij' A more detailed description of the algorithm can be
found in [Leighton, 1992].
Protected computation is illustrated at the bottom of Figure 3.3. It uses a
(3 + 1) x (3 + 1) multiprocessor grid. Matrices A and B are encoded into two
new matrices, A' = [a~jl and B' = [b~jl respectively, in the following fashion:
• The 4 x 3 matrix A' is formed by adding a row of column sums to matrix A.
More specifically, a~j = aij for 1 ::; i ::; 3, 1 ::; j ::; 3 and
3
a~j
= Laij,
j
= 1,2,3.
i=l
• The 3 x 4 matrix B' is formed by adding a column of row sums to matrix
B. More specifically, b~j = bij for 1 ::; i ::; 3, 1 ::; j ::; 3 and
3
b~4 =
L bij,
j=l
i = 1,2,3.
The redundant computation is executed in the usual way on a 4 x 4 multiprocessor grid. The resulting matrix C' = A' x B' is a 4 x 4 matrix. Under
fault-free conditions, the matrix C = A x B (i.e., the result of the original
ABFT for Combinational Systems
Unprotected Computation
on a 3x3 Processor Array
39
033
012
023
032
022
031
021
011
b32
b22
b23
0'43
Protected Computation
on a 4x4 Processor Array
033
0'42
023
032
0'41
013
022
031
012
021
011
Figure 3.3.
•
ABFT scheme for protecting matrix multiplication.
+
40
CODING APPROACHES TO FAULT TOLERANCE
computation) is given by the submatrix C'(1:3, 1:3), i.e., the 3 x 3 submatrix
that consists of the top three rows and the leftmost three columns of matrix C'.
Moreover, the bottom row and the rightmost column of C' consist of column
and row checksums respectively. In other words,
,
C4j
3
LC~j
,
j
= 1,2,3,4,
,
i
= 1,2,3,4
i=l
,
ci4
3
LC~j
j=l
If one of the processors malfunctions, one can detect and correct the error by
using the row and column checksums to pinpoint the location of the error and
then correct it. More specifically, the following are true:
• If the ith row checksum (1 :S i :S 4) and the jth column checksum (1 :S
j :S 4) of matrix C' are not satisfied, then, there was a fault in processor
P ji ·
• If the ith row checksum (1 :S i :S 4) is not satisfied, then, the calculation
of the ith row checksum was erroneous (the hardware that performs this
calculation is not shown in Figure 3.3).
• If the jth column checksum (1 :S j :S 4) is not satisfied, then, the calculation
of the jth column checksum was erroneous (the hardware that performs this
calculation is not shown in Figure 3.3).
The basic assumptions in the above analysis are that the propagation of the data
(aij and bij ) in the system is flawless and that there is at most one fault in the
system. Note that data propagation errors would have been caught by a TMR
system. Moreover, TMR would catch multiple faults as long as all of them
were confined within one of the three replicas of the multiprocessor system.
The above example, however, illustrates the numerous potential advantages of
ABFT over naive modular redundancy methods. By exploiting the structural
features of parallel matrix multiplication, this scheme achieves fault protection
at a much lower cost. Other examples of efficient ABFT techniques have
been developed for signal processing applications [Huang and Abraham, 1984;
Iou and Abraham, 1986], systems for computing the fast Fourier transform
(FFT) [Nair and Abraham, 1990], analog to digital conversion [Beckmann and
Musicus, 1991], digital convolution [Beckmann and Musicus, 1993], faulttolerant sorting networks [Choi and Malek, 1988; Liang and Kuo, 1990; Sun
et aI., 1994] and linear operators [Sung and Redinbo, 1996].
ABFT for Combinational Systems
4
41
GENERALIZATIONS OF ARITHMETIC CODING
TO OPERATIONS WITH ALGEBRAIC STRUCTURE
Traditionally, arithmetic coding and ABFT schemes have focused on the development of resource-efficient designs for providing fault tolerance to a specific
computational task under a given hardware implementation. The identification
of algorithmic or computational structure that could be exploited to provide
efficient fault coverage to an arbitrary computational task has been more of an
art than an engineering discipline. A step in solving this problem was taken
by Beckmann in [Beckmann, 1992]. Beckmann showed that by concentrating on computational tasks that can be modeled as abelian group operations,
one can impose sufficient structure upon the computation to allow for accurate
characterization of the possible arithmetic codes and the form of redundancy
that is needed. The approach in [Beckmann, 1992] encompasses a number of
previously developed arithmetic codes and ABFT techniques, and also extends
to algebraic structures with an underlying abelian group structure, such as rings,
fields, modules and vector spaces. The results in [Beckmann, 1992] were generalized in [Hadjicostis, 1995] to include operations with abelian semigroup
structure.
This section presents an overview of these results in a way that minimizes the
need for background knowledge in group [Herstein, 1975] or semi group theory
[Ljapin, 1974; Lallement, 1979; Higgins, 1992; Grillet, 1995]. The exposition
also avoids making an explicit connection to actual hardware implementations
or fault models.
4.1
FAULT TOLERANCE FOR ABELIAN GROUP
OPERATIONS
A computational task has an underlying group structure if the computation
takes place in a set of elements that form a group.
DEFINITION 3.1 A nOll-empty set of elements G forms a group (G, 0) if on G
there is a defined binary operation, called the product and denoted by 0, such
that
1. a, bEG implies aob E G (closure).
= (aob)oc (associativity).
E G such that a 0 io = io 0 a = afor all a E G
2. a, b, c E G implies that ao(boc)
3. There exists an element io
(io is called the identity element).
4. For every a E G there exists an element a-I E G such that aoa- l
a-loa = io (the element a-I is called the inverse of a).
=
If the group operation 0 of G is commutative (i.e., for all a, bEG, aob =
boa), then, G is called an abelian group. The order in which a series of abelian
42
CODING APPROACHES TO FAULT TOLERANCE
group products is evaluated does not matter because of associativity and commutativity, i.e.,
where 91,92, ... ,9u E G and {iI, i2, ... ,i u} is any permutation of {I, 2, ... , u}.
3.2 A simple example of an abelian group is the set of integers
under addition, denoted by (IE, +). The properties mentioned above can be
verified easily. Specifically, the identity element is 0 and the inverse of integer
n is integer -no
Another example of an abelian group is the set of nonzero rational numbers
under multiplication, denoted by (Q - {O}, x). The identity element in this
case is I and the inverse of rational number q = ~ (where n, d are nonzero
integers) is the rational number q-l =
EXAMPLE
!.
Suppose that the computation of interest can be modeled as an abelian group
operation 0 with operands {91l 92, ... , 9u} so that the desired result r is given
by
r = 91 0 92 0 ••. 0 9u .
Beckmann provides fault tolerance to this group product using the scheme of
Figure 3.4 (which is essentially a generalization of Figure 3.1). The encoding,
error detecting/correcting and decoding mechanisms are assumed to be faultfree. The redundant computational unit operates on the encoded data via a
redundant abelian group operation o. Under fault-free conditions, the result of
this redundant computation is given by
(3.1)
and can be decoded to the original result r via the decoding mapping (7, i.e.,
(7(p) = r .
Due to faults, the output of the redundant computational unit may be erroneous. In [Beckmann, 1992], the effect of a fault Ii is modeled in an additive
fashion, i.e., the possibly erroneous output Pi is written as
Pi = po €i
,
where P is the error-free result in Eq. (3.1) and €i is a suitably chosen operand
that captures the effect of fault Ii- Note that, since 0 is a group operation, the
existence of an operand that models the effect of fault Ii in this additive fashion
is guaranteed (because one can always choose €i = p- l 0 PI). When the effects
ABFT for Combinational Systems
(j-
-
Error
{gc: §- gc:
Detectorl
§ 0 ctS f - - ' I
"0 ;r. Q)
Corrector
CD '-'
0:
,
,
a.
0
43
I_---<..~
Deccroder
ex
1
I ............................... I
Faults
Figure 3.4.
Fault-tolerant computation of a group operation.
of faults are operand-dependent. however, it may be necessary to use mUltiple
operands to capture the effect of a single fault.
In [Beckmann. 1992]. it is assumed that the effect of multiple faults can
be captured by the superposition of individual additive error effects, i.e., the
possibly corrupted result PI can be written as
PI
=
po ,f.it
poe.
0 f.i2 0 ... 0 f.i>.
V'
,
e
If no errors took place. e is the identity element. Given a pre-specified set of
faults, one can in principle generate the corresponding set of error operands
£ = {io, f.l, f.2, ... } (the identity element is included for notational simplicity).
To detect/correct up to A such errors. one would need to be able to detect/correct
each error e in the set £P.) = {f.il 0 f.i2 0 ... 0 f.i>. I f.il' f.i2' ... , f.i>. E £} (if
e = i o • then, no error has taken place).
The underlying assumptions of the additive error model are that errors are
independent of the operands and that their effect on the overall result is independent of the stage in which the computation is in. The latter assumption is
realistic because the additive error model is used with associative and abelian
operations so that the order of evaluating the different products is irrelevant.
The term "additive" makes more sense if the group operation 0 is addition.
44
CODING APPROACHES TO FAULT TOLERANCE
H
Pf=pOe
0
r ~orrection
Fault
\
G/=H:\
p=<p(;)/\
/
\
0)
G\/0'
'
p
A
o
\
\
<pI
r
Figure 3.5.
4.1.1
cr
t"
r
Fault tolerance using an abelian group homomorphism.
USE OF GROUP HOMOMORPHISMS
Under the formulation described in the previous section, the computation
r = 91
0
92
0 ... 0
9u
in the abelian group (G, 0) is essentially mapped into the computation
(3.2)
in a larger abelian group (H, <». The subset of valid results Hv cHis the set
of results obtained under error-free computation, i.e.,
If the decoding mapping u is one-to-one, then, it is shown in [Beckmann, 1992]
that all encoders {<Pi} have to be the same and have to satisfy
<Pi = u- 1 == <P, i = 1,2, ... , u
(notice that u- 1 is well-defined). Eq. (3.2) can then be written as
ABFT for Combinational Systems
45
and reduces to
¢(91 092) = ¢(91) 0 ¢(92) ,
which is the defining property of a group homomorphism [Herstein, 1975].
Therefore, there is a close relationship between group homomorphisms and
arithmetic coding schemes for group operations. Figure 3.5 describes how
fault tolerance is achieved: the group homomorphism ¢ adds redundancy to the
computation by mapping the abelian group G, in which the original operation
takes place, into a larger (redundant) group H. The subset of valid results,
defined earlier as Hv, turns out to be isomorphic to G (i.e., Hv = (}-I(G)
where (}-1 is one-to-one) and, for this reason, it is also denoted by G'. Any
error e is detected, as long as it forces the result into an element PI that is not in
Hv. If enough redundancy exists in H, the error might be correctable; in such
case, p = P and = r.
r
4.1.2
ERROR DETECTION AND CORRECTION
An error ed E £(>,) c H, ed i= io is detectable if every possible valid
result 9' E G' c H becomes invalid when corrupted by ed, i.e.
{g' 0
Using the set notation G'
be re-written as
ed
0 ed
I g' E G'} n G' = 0 .
== {g' 0
I
ed
g' E G'}, the above equation can
(3.3)
Similarly, an error ec E £P.), ec
if it satisfies
i= io is correctable (and a fortiori detectable)
(G' oe c ) n (G' oe) = 0
\j
eE
£(>'),
e
i= ec .
(3.4)
Note that since e can also be io (io E £(A), the condition in Eq. (3.3) is a special
case of Eq. (3.4).
A well-known result from group theory states that sets of the form G' 0 e
(e 0 G') for any e E H are either identical or have no elements in common.
In other words, they form an equivalence class decomposition (partitioning) of
H into subsets, known as right (left) cosets. When the collection of right and
left cosets under a particular subgroup G' is the same, I this collection of co sets
forms a group, denoted by HjG' and called the quotient group of H under G'.
Its group operation (£) is defined as
A(£) B
= {hI 0 h2 I
hI E A, h2 E B} .
46
CODING APPROACHES TO FAULT TOLERANCE
------~----------------.------~-\
H
I
Coset
I
/""\
Correctable Error
Correctable Error
.
Coset:
()
h1
,Correctable Error
I
'
Coset
i
~----) ~---\-~ ~-~
Detectable
Coset
Erro~
'I
I
Il
--_./
e;A~--i--~-1
r \,~ I
0
~
! \
IG't P
... \ -
'\
!~ ~oset
I
--------+1
E~,
'" 0
\\h2
I
~\
D•• bbl.
h3
JL'J
~---
-- ---(J
~--/-I
G \'
o
:
r
~
Figure 3.6.
:
j
~
Coset-based error detection and correction.
Since two cosets are either identical or have no elements in common, Eqs. (3.3)
and (3.4) can be written as
(G' 0 ed)
(G' 0 ec )
ii-
G', for edi-io,
(G' 0 e), for e E
£(>.) ,
e
i- ec .
Error detection and correction proceed as shown in Figure 3.6: any error that
takes the result out of the subgroup of valid results G' is detected (in the figure,
this is the case for errors el, e2, e3 because they force the result outside G').
Furthermore, if enough redundancy exists in H, some errors can be corrected.
For example, error el in the figure results in hI and is correctable because the
coset G' 0 el is not shared with any other error. Therefore, once one realizes
that hI lies in the coset G' 0 el (the coset of el), one can get the error-free result
pEG' by performing the operation hI 0 ell. If hi lies in a coset shared by
more than one error (which is the case for h2 and h 3 ), the corresponding errors
ABFT for Combinational Systems
47
are detectable but not correctable. Errors that let the result stay within G' , such
as e4, are not detectable.
To summarize, a correctable error forces the result into a nonzero coset (i.e.,
a coset other than G/) that is uniquely associated with this particular error. For
an error to be detectable, it only has to force the result into a nonzero coset.
4.1.3
SEPARATE GROUP CODES
Separate codes are arithmetic codes in which redundancy is added in a separate "parity" channel [Rao and Fujiwara, 1989]. Error detection and correction
are performed by appropriately comparing the result of the computation channel, which performs the original operation, with the result of the parity channel,
which performs a simpler computation. No interaction between the computation and the parity channels is allowed.
EXAMPLE
3.3 A separate code for protecting integer addition is shown in
Figure 3. 7. The computation channel performs the original operation, whereas
the parity channel performs addition modulo-4. The error detector compares
the results of the two channels, 9 and p respectively. If they agree (modulo-4),
then, the result of the computation channel is accepted as error-free; if they
do not agree, then, an error is detected. The figure also illustrates one of the
important advantages of separate codes over non-separate codes (such as the
a.N code of Figure 3.2): if the result is known to be error-free, then, the output
is available without the need for any further decoding.
In the case of separate codes, the model of Figure 3.4 reduces to the model
shown in Figure 3.8. For simplicity, only two operands are shown, but the
discussion applies for the general case of u operands. The group homomorphism ¢> maps the computation in the group G to a redundant group H which
is the cartesian product of G and a parity set p, i.e.,
H=GxP.
The homomorphic mapping satisfies ¢>(g) = [g,O(g)], where 0 : G r----+ P
is the mapping that creates the parity information from operand 9 (refer to
Figure 3.8). The set of valid results Hv is the set of elements of the form
([g,O(g)] I 9 E G}. It can be shown that (P,0) is a group and that 0 is a
group homomorphism from G to P [Beckmann, 1992].
In order to make efficient use of redundancy (i.e., efficient use of parity
symbols in group P), a reasonable requirement would be for Oto be onto. In such
case, the problem of finding suitable separate codes reduces to the problem of
finding suitable epimorphisms 0 from G onto P. A theorem from group theory
states that there is a one-to-one correspondence between epimorphisms 0 from
the abelian group G onto P and subgroups N of G [Herstein, 1975]. In fact,
48
CODING APPROACHES TO FAULT TOLERANCE
Computation
Channel
~-.;._.:81_----
__
9_ _-.t
('.
0Il
..q
'0
o
I
I
E
91 --+:891mod4
mod4 --=---+1
92
YES
mod4
92mod
P
4 +mod4
NO
Parity Channel
Error Detected
Figure 3.7.
Separate arithmetic coding scheme for protecting imeger addition.
Computation
Channel
-
c: c:
.2 .0
ut)
-.... u
Q)
91t-<D:[j
,
£;?'C
I
I
:
G)
;
te
92~
: Parity Channel
Q)
....
....
00
Q)
.... c:
WetS
p..
p
~---.-.---.--.---------------.
Figure 3.B.
Separate coding scheme for protecting a group operation.
9
ABFT for Combinational Systems
49
the quotient group G / N, constructed from G using N as a subgroup, provides
an isomorphic image of P. Therefore, by finding all possible subgroups of G,
one can find all possible epimorphisms () from G onto P (and hence all possible
parity check codes).
Finding the subgroups of a group is not a trivial task but it is relatively easy
for several group operations of interest. By finding all subgroups of a given
group, one is guaranteed to obtain all separate arithmetic codes that can be
used to provide fault tolerance to the corresponding group computation. Thus,
this systematic procedure results in a complete characterization of the separate
codes for a given abelian group. The result is a generalization of one proved by
Peterson for the special case of integer addition and multiplication [Peterson
and Weldon Jr., 1972; Rao, 1974].
4.2
FAULT TOLERANCE FOR SEMIGROUP
OPERATIONS
DEFINITION 3.2 A non-empty set of elements S forms a semi group (S, 0)
on S there is a defined binary operation 0, such that
if
1. a, bE S implies aob E S (closure).
2. a,b,c E S impliesthatao(boc)
= (aob)oc(associativity).
A semigroup is called a monoid when it possesses an identity element, i.e.,
a unique element io that satisfies:
a
0
io
= io 0 a =
a
for all a E S.
One can focus on monoids without any loss of generality because an identity
element can always be adjoined to a semi group that does not initially posses
one [Ljapin, 1974; Lidl and Pilz, 1985]. Therefore, the words "semigroup" and
"monoid" can be used interchangeably unless a distinction needs to be made
about the identity element.
3.4 Every group is a monoid; familiar examples of mono ids that
are not groups are the set of positive integers under the operation of multiplication [denoted by (N, x )}, the set of nonnegative integers under addition
[denoted by (Nt, +)}, and the set of polynomials with real coefficients under
the operation of polynomial multiplication.
All of the above examples are abelian monoids, i.e., monoids in which operation 0 is commutative (for all a, b E S, ao b = bo a). Examples of non-abelian
monoids are the set of polynomials under polynomial substitution, and the set
of n x n matrices under matrix multiplication.
EXAMPLE
50
4.2.1
CODING APPROACHES TO FAULT TOLERANCE
USE OF SEMIGROUP HOMOMORPHISMS
The approach in [Hadjicostis, 1995] uses the model in Figure 3.1 to protect
a computation in a semigroup (monoid) (S,o). To introduce the redundancy
needed for fault tolerance, the computation r = 81 0 82 in (S, 0) is mapped
into an encoded computation P = 4>1(81) <> 4>2(S2) in a larger semigroup
(monoid) (H, <». After performing the redundant computation 4>1 (81) <> 4>2(82)
in H, a possibly erroneous result PI is obtained, which is assumed to still
lie in H. Error correction is performed through the mapping a: that outputs
p = a:(PI); decoding is performed via an one-to-one mapping (1 : Hv I----t S,
where Hv = {4>1(81) <> 4>2(82) I 811 82 E S} is the subset of valid results
in H. Under fault-free conditions in the error corrector and under correctable
faults p = P and (1(p) = r.
Clearly, the decoding mapping (1 needs to satisfy:
for all 81, 82 E S; since (1 is assumed to be one-to-one, the inverse mapping
(1-1 : S I----t Hv is well-defined and satisfies
If one assumes further that both 4>1 and 4>2 map the identity of S to the identity of
H, then (by setting 82 = io or 81 = i o), one concludes that (1-1(81) = 4>1(St}
for all 81 E Sand (1-1(82) = 4>2(S2) for all 82 E S. Therefore, (1-1 = 4>1 =
4>2 == 4> and
1. 4>(81082) = 4>(81) <>4>(82)'
2. 4>(io) =io'
Condition (1) is the defining property of a semigroup homomorphism [Ljapin,
1974; Lidl and Pilz, 1985]. A monoid homomorphism is additionally required
to satisfy condition (2) [Jacobson, 1974; Grillet, 1995]. Thus, mapping 4> is an
injective monoid homomorphism.
The generalization of the framework of [Beckmann, 1992] to semigroups
allows the study of fault tolerance in non-abelian computations for which inverses might not exist. These include a number of combinational and nonlinear
signal processing applications, such as max/median filtering and minimax operations in sorting. This generalization, however, comes at a cost: error detection
and correction can no longer be based on coset constructions. The problem is
two-fold:
• In a semigroup setting one may be unable to model the possibly erroneous
result PI as
ABFT for Combinational Systems
51
for some element e in H (because inverses do not necessarily exist in Hand
because the semigroup may be non-abelian).
• Unlike the subgroup of valid results, the subsemigroup of valid results Hv =
¢(S) does not necessarily induce a partitioning of semigroup H (for instance, it is possible that the set Hv 0 h is a subset of Hv for all h E H).
4.2.2
ERROR DETECTION AND CORRECTION
To derive necessary and sufficient conditions for error detection and correction within the semigroup framework, one needs to resort to set-based arguments. For simplicity, the erroneous result due to one or more faults from
a finite set F = {II, 12, 13, ... } is assumed to only depend on the error-free
result (denoted earlier by p). As argued earlier, there is no loss of generality
in making this assumption because the effects of a single fault that produces
different erroneous results depending on the operands involved can be modeled
through the use of multiple Ii. each of which captures the effect of the fault for
a particular pair of operands. Of course, the disadvantage of such an approach
is that the set of possible faults F is enlarged (and may become unmanageable).
The erroneous result reached under the occurrence of a single fault Ii E F
is given by PI; = e(p, Ii), where P is the error-free result, Ii is the fault that
occurred and e is an appropriate mapping. The fault model for multiple faults
can be defined similarly: the effect of k faults (fl, 1 2, ... , Ik) (where Ii E
F for 1 :S j :S k) can be captured by the mapping e(k) (p, li(k)), where mUltiple
faults are denoted by li(k) E F(k) = {(f1,12, ... ,lk) I Ii E F,l:S j:S k}.
For full single-fault detection, the computation in the redundant semigroup
H needs to meet the following condition:
e(PI,ld
i= P2
for all Ii E F and all PI, P2 E Hv such that PI
i= P2
.
In other words, a fault is detected whenever the result PI lies outside Hv. For
single-fault correction, the following additional condition is needed:
e(PI, F) n e(P2, F) = 0 for all PI, P2
E Hv such that PI
i= P2
,
where e(p, F) = {e(p, Ii)
Ii E F}. The above condition essentially
establishes that no two different results PI and P2 can be mapped, perhaps by
different faults, to the same erroneous result. The error can be corrected by
identifying the unique set e(Pk, F) in which the erroneous result PI lies; Pk
would then be the correct result.
52
CODING APPROACHES TO FAULT TOLERANCE
These conditions can be generalized for fully detecting up to d faults and
correcting up to c faults (c ~ d):
for all PI E Hv,
for 1
{~/') (p" F('») } n eU) (1'2, F(j)) ~ 0
~
k
~
d,
for all P1, 1'2 E Hv, P1
and for 1
~
j
~
f
1'2,
c.
Note that e(k)(p, F(k» denotes the set {e(k)(p, li(k» I Ilk) E F(k)}. The first
condition guarantees detection of any combination of d or less faults (because no
k faults, k ~ d, can cause the erroneous result e(k) (PI, I(k» to be a valid one).
The second condition guarantees correction of up to c faults (no combination
of up to c faults on P2 can result in an erroneous value that can also be produced
by up to d faults on a different result PI).
4.2.3
SEPARATE SEMIGROUP CODES
If the redundant semigroup H is a cartesian product of the form SxP,
where (S, 0) is the original semigroup and (P, 0) is the "parity" separate
semi group, then, the corresponding encoding mapping ¢ can be expressed as
¢( s) = [s, B( s)] for all s E S. In such case, the set of valid results is given by
([s, B(s)] I s E S} and error detection is based on verifying that the result is
of this particular fonn.
Using the fact that the mapping ¢ is a homomorphism, one can easily show
that the parity mapping Bis a homomorphism as well. As in the case of abelian
groups, when this parity mapping B is restricted to be surje.ctive, one obtains
a characterization of all possible parity mappings and, thus, of all separate
codes. However, the role that was played in the abelian group framework by
the (normal) subgroup N of the (abelian) group G is now played by a congruence
relation on S:
3.3 An equivalence relation", on the elements of a semigroup
(S, 0) is called a congruence relation if
DEFINITION
a '" a', b '" b' => a 0 b '" a' 0 b' ,
for all a, a' , b, b' E S. The partitions induced by '" are called congruence
classes.
Unlike the group case, where a nonnal subgroup induces a partitioning of
a group into cosets, the number of elements in each congruence class is not
ABFT for Combinational Systems
53
necessarily the same. The only requirement for congruence classes is that
a given partitioning {C 1 , C2 , ..• } is such that partitions are preserved by the
semigroup operation. More specifically, when any element of partition Cj is
composed with any element of partition Ck, the result is confined to a single
partition Cl. Formally, a given partitioning { C1, C 2 , .•• } is a congruence relation
if the following is true for all partitions Cl:
for all sj' E Cj and all Sk' E Ck.
Let SI'" denote the set of congruence classes of S under relation "'. Each
congruence class in this set will be denoted as [a] E SI'" where a E S is an
arbitrary element of the congruence class. If '" is a congruence relation, the
binary operation [a] ® [b] = [a 0 b] is well-defined [Ljapin, 1974; Lidl and Pilz,
1985]. With this definition, (SI"', ®) is a semigroup, referred to as the factor
or quotient semigroup of'" in S. The congruence class rio] functions as the
identity in SI"'.
A well-known homomorphism theorem from semigroup theory states that
the surjective homomorphisms from semigroup S onto semi group P are isomorphic to the canonical surjective homomorphisms, namely the surjective homomorphisms that map S onto its quotient semi groups SI"', where rv denotes
a congruence relation in S [Ljapin, 1974; Lidl and Pilz, 1985]. Furthermore,
the semigroup (P, 0) is isomorphic to (SI"', ® ). Thus, for each congruence
relation rv there is a corresponding surjective homomorphism, and for each surjective homomorphism there is a corresponding congruence relation. Therefore,
the problem of finding all possible parity codes reduces to that of finding all
possible congruence relations in S [Hadjicostis, 1995].
The major difference between the results in [Hadjicostis, 1995] and the results
in [Beckmann, 1992] that were presented earlier is that, for separate abelian
group codes, the subgroup N of the given group G completely specifies the
parity homomorphism () (this is simply saying that P ~ GIN). In the more
general setting of a semigroup, however, specifying a normal subsemigroup
for S does not completely specify the homomorphism () (and therefore does
not determine the structure of the parity semi group P). In order to define the
surjective homomorphism () : S I---t P (or, equivalently, in order to define a
congruence relation rv on S), one may need to specify all congruence classes.
The following examples help make this point clearer.
EXAMPLE 3.5 Figure 3.9 shows an example of a partitioning into congruence classes for the monoid (N, x) of positive integers under multiplication.
Congruence class C1 contains multiples of2 and 3 (i.e., multiples of6); congruence class C 2 contains multiples of2 but not 3; congruence class C 3 contains
multiples of 3 but not 2; and congruence class Co contains all the remaining
54
CODING APPROACHES TO FAULT TOLERANCE
Congruence
Class Co
Congruence
Class C1
{1 ,5,7,11, 13, ... }
{6, 12, 18,24, ... }
Congruence
Class C2
Congruence
Class C3
{2,4,8,1 0, 14, ... }
{3,9, 15,21 ,... }
Figure 3.9.
Partitioning of semigrollp (N, x) into congruence classes.
positive integers (i.e., integers that are neither multiples of2 nor 3). One can
check that the partitioning is preserved under the monoid operation x.
EXAMPLE 3.6 It is proved in {Hadjicostis, I995J that an encoding mapping 8 : (No, +) f---+ (P, 0) can serve as a separate code for (No, +) if
and only if it has the following form:
Let M > 0 and k 2:: 0 be some fixed integers. Then, the mapping 8 is given by:
< kM,
8(n)
n ifn
8(n)
8(n mod M) == nM, otherwise.
The symbol nM denotes the class of elements that are in the same modulo-M
class as n; there are exactly M such classes, namely, {OM, 1M, ... , (M -1)M}.
The parity monoid P consists of(k + I)M elements: the elements in the index
set {O, 1, ... , (kM -I)}, and the elements in the subgroup {OM, 1M, ... , (MI)M} (which is isomorphic to ZM, the cyclic group of order M). While under
the threshold kM, the parity operation simply replicates the computation in
No; once the threshold is exceeded, the parity operation performs modulo-M
addition.
Note that the parity encodings for the group (Z, +) (the group of integers
under addition) can only be of the form 8(n) = nM for some M > 0, i.e.,
the second of the two expressions listed above. Evidently, by relaxing the
structure to a monoid, more possibilities for parity encodings are opened; their
construction, however, becomes more intricate.
ABFT for Combinational Systems
55
3.7 A simple parity checkfor (N, x) is the mapping f) : (N, x) t - t
(P, 0) defined in Figure 3.9. The parity monoid P has the following binary
EXAMPLE
operation 0:
0
II Co C1 C2 C3
II Co C1 C2 C3
C1 II C 1 C1 C1 C1
C2 II C2 C1 C2 C1
C3 II C3 C1 C1 C3
Co
The parity check determines whether the result is a multiple of 2 and/or of 3.
For instance, when a multiple of 2 but not 3 (i.e., an element in congruence
class C 2 ) is multiplied by a multiple of6 (an element in class C 1), the result is
a multiple of6 (an element in class C1).
EXAMPLE 3.8 The semig roup of intege rs under the MAX ope ration is denoted
by (Z, MAX), where operation MAX is the binary operation that returns the
larger of its two operands. This semigroup can be made into a monoid by
adding the identity element - 00 to it.
From the definition of a congruence class, one concludes that, if c and c' S c
belong to a congruence class C, then, the set {x I c'
x S c} is contained
in C. Thus, any congruence class must consist of all consecutive integers in an
interval. Therefore, every partitioning into congruence classes corresponds to
breaking the integer line into consecutive intervals, each interval constituting
a congruence class. This immediately yields a complete characterization of the
separate codes for (Z U { -oo}, MAX).
A simple choice would be to pick the pair of congruence classes Co =
{-00}U{ ... ,-2,-I}andC1 = {O,I,2, ... }. Thecorrespondingparityoperation 0 is defined by:
s
I 0 II CO I C1 I
I CO II CO I C1 I
I C1 II C1 I Cl I
The parity computation simply checks that the sign of the result comes out
correctly.
56
CODING APPROACHES TO FAULT TOLERANCE
4.3
EXTENSIONS
The algebraic approach for protecting group and semigroup operations can
be extended straightforwardly to other algebraic systems with the underlying
structure of a group (such as rings, fields and vector spaces) or a semi group (such
as semirings or semifields). By exploiting the group or semigroup structure
in each of these other structures, one can place the construction of arithmetic
codes for computations in them into the group/semi group frameworks that were
discussed in this chapter. Therefore, a large set of computational tasks can be
studied using the framework of this chapter, including integer residue codes, real
residue codes, multiplication of nonzero real numbers, linear transformation,
and Gaussian elimination [Beckmann, 1992].
Notes
1 This is always true for the abelian group case because sets of the form G' <> e
are the same as sets of the form e <> G'.
References
Abraham, J. A. (1986). Fault tolerance techniques for highly parallel signal
processing architectures. In Proceedings of SPIE, pages 49-65.
Abraham, J. A., Banerjee, P., Chen, c.- Y., Fuchs, W. K., Kuo, S.-Y., and Reddy,
A. L. N. (1987). Fault tolerance techniques for systolic arrays. IEEE Computer, 36(7):65-75.
Beckmann, P. E. (1992). Fault-Tolerant Computation Using Algebraic Homomorphisms. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts.
Beckmann, P. E. and Musicus, B. R. (1991). Fault-tolerant round-robin AID
converter system. IEEE Transactions on Circuits and Systems, 38( 12): 14201429.
Beckmann, P. E. and Musicus, B. R. (1992). A group-theoretic framework
for fault-tolerant computation. In Proceedings of the IEEE Int. Con! on
Acoustics, Speech, and Signal Processing, pages 557-560.
Beckmann, P. E. and Musicus, B. R. (1993). Fast fault-tolerant digital convolution using a polynomial residue number system. IEEE Transactions on
Signal Processing, 41(7):2300-2313.
Chen, c.-Y. and Abraham, J. A. (1986). Fault tolerance systems for the computation of eigenvalues and singular values. In Proceedings of SPIE, pages
228-237.
Choi, Y.-H. and Malek, M. (1988). A fault-tolerant systolic sorter. IEEE Transactions on Computers, 37(5):621-624.
Grillet, P. A. (1995). Semigroups. Marcel Dekker Inc., New York.
References
57
Hadjicostis, C. N. (1995). Fault-Tolerant Computation in Semigroups and Semirings. M. Eng. thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts.
Hadjicostis, C. N. and Verghese, G. C. (1995). Fault-tolerant computation in
semigroups and semirings. In Proceedings of the Int. Con! on Digital Signal
Processing, pages 779-784.
Herstein, I. N. (1975). Topics in Algebra. Xerox College Publishing, Lexington,
Massachusetts.
Higgins, P. M. (1992). Techniques of Semigroup Theory. Oxford University
Press, New York.
Huang, K.-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for
matrix operations. IEEE Transactions on Computers, 33(6):518-528.
Jacobson, N. (1974). Basic Algebra I. W. H. Freeman and Company, San Francisco.
Jou, J.- Y. and Abraham, J. A. (1986). Fault-tolerant matrix arithmetic and signal
processing on highly concurrent parallel structures. Proceedings ofthe IEEE,
74(5):732-741.
Jou, J.- Y. and Abraham, J. A. (1988). Fault-tolerant FFT networks. IEEE Transactions on Computers, 37(5):548-561.
Lallement, G. (1979). Semigroups and Combinatorial Applications. John Wiley
& Sons, New York.
Leighton, F. T. (1992). Introduction to Parallel Algorithms and Architectures:
Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, San Mateo, California.
Liang, S. C. and Kuo, S. Y. (1990). Concurrent error detection and correction in
real-time systolic sorting arrays. In Proceedings of 20th IEEE Int. Symp. on
Fault-Tolerant Computing, pages 434-441. IEEE Computer Society Press.
Lidl, R. and Pilz, G. (1985). Applied Abstract Algebra. Undergraduate Texts in
Mathematics. Springer-Verlag, New York.
Ljapin, E. S. (1974). Semigroups, volume Three of Translations of Mathematical Monographs. American Mathematical Society, Providence, Rhode
Island.
Nair, V. S. S. and Abraham, J. A. (1990). Real-number codes for fault-tolerant
matrix operations on processor arrays. IEEE Transactions on Computers,
39(4):426-435.
Peterson, W. W. and Weldon Jr., E. J. (1972). Error-Correcting Codes. MIT
Press, Cambridge, Massachusetts.
Rao, T. R. N. (1974). Error Codingfor Arithmetic Processors. Academic Press,
New York.
Rao, T. R. N. and Fujiwara, E. (1989). Error-Control Coding for Computer
Systems. Prentice-Hall, Englewood Cliffs, New Jersey.
58
CODING APPROACHES TO FAULT TOLERANCE
Sun, J., Cerny, E., and Gecsei, J. (1994). Fault tolerance in a class of sorting
networks. IEEE Transactions on Computers, 43(7):827-837.
Sung, J.-L. and Redinbo, G. R. (1996). Algorithm-based fault tolerant synthesis
for linear operations. IEEE Transactions on Computers, 45(4):425-437.
II
FAUL~TOLERANTDYNAMICSYSTEMS
Chapter 4
REDUNDANT IMPLEMENTATIONS OF
ALGEBRAIC MACHINES
1
INTRODUCTION
This chapter extends the algebraic approach of Chapter 3 in order to provide
fault tolerance to group and semigroup machines. The discussion characterizes
redundant implementations using algebraic homomorphisms and demonstrates
that for a particular error-correcting scheme there exist many possible redundant implementations, each potentially offering different fault coverage [Hadjicostis, 1999]. The fault model assumes that the error detecting/correcting
mechanism is fault-free and considers faults that cause the redundant machine
to transition to an incorrect state. Explicit connections to hardware implementations and hardware faults are addressed in Chapter 5 for linear time-invariant
dynamic systems (implemented using delay, adder and gain elements) and in
Chapter 6 for linear tinite-state machines (implemented using XOR gates and
flip-flops). The issue of faults in the error corrector is studied in Chapter 7.
Related work appeared in the context of providing fault tolerance to arbitrary
tinite-state machines via external monitoring mechanisms [Iyengar and Kinney,
1985; Leveugle and Saucier, 1990; Parekhji et aI., 1991; Robinson and Shen,
1992; Leveugle et aI., 1994; Parekhji et aI., 1995]. This work, however, was not
formulated in an algebraic setting and does not make use of algebraic properties
and structure.
2
ALGEBRAIC MACHINES: DEFINITIONS AND
DECOMPOSITIONS
DEFINITION 4.1 A semigroup machine is a dynamic system whose states and
inputs are drawn from a finite set S that forms a semigroup under a binary
operation o. More specifically, given the current state q[t] = 81 E S and the
62
CODING APPROACHES TO FAULT TOLERANCE
Coset
Subgroup
Machine N
Leader GIN
r---------------
i~
1
1
:1:
C
_
Mapping
)
_
1
Figure 4.1.
current input x[tJ
= S2
State
Ci l
Coset
Cil
~
Encoder
g1 =
n
n1
n10
Ci 1
Combined State
Series-parallel decomposition of a group machine.
E S, the next state is given by
In the special case when (S, 0) is a group, the machine is known as a group or
permutation machine.
A group machine G with a non-trivial normal subgroup N can be decomposed into two smaller group machines: the coset leader machine with
group GIN and the subgroup machine with group N [Arbib, 1968; Ginzburg,
1968; Arbib, 1969]. Figure 4.1 conveys the basic idea: group machine G,
with current state qg[tJ = 91 and input xg[tJ = 92 is decomposed into the
"series-parallel" interconnection in the figure. [Figure 4.1 follows a convention
that will be used throughout this chapter: boxes are used to denote machines
(systems with memory) and ovals are used to denote mappings (combinational
systems with no memory).] Note that the input is encoded differently for each
submachine; in particular, the input n' to the subgroup machine is encoded
based on the original input 92 and the state Cit of the coset leader machine.
Note that the encoder E in the figure has no memory (state) and is implemented
as a combinational system. The overall state of the decomposition is obtained
by combining the states of both submachines (qg[tJ = 91 = n1 0 Ci}, where
Redundallt Implementations of Algebraic Machines
63
n1 is the state of the subgroup machine and Cil is the state of the coset leader
machine).
The above decomposition is possible because the normal subgroup N induces
a partition of the elements of G into cosets [Arbib, 1968; Arbib, 1969]. More
specifically, each element 9 of G can be expressed uniquely as
9
= n 0 Ci for some n EN,
Ci
E C ,
where C = {C1, C2, .•. , Cl} is the set of distinct (right 1) coset leaders (there is
exactly one representative for each coset). The decomposition in Figure 4.1
simply keeps track of this parameterization. Initially, machine G is in state
91 = n1 0 Cil' and machines GIN and N in the decomposition are in states n1
and Cil respectively. If input 92 = n2 0 Ci2 is applied, the new state 9 = 91 0 92
can be expressed as 9 = n 0 Cj. One possibility is to take Ci = Cil 092 =
Cil 0 n2 0 Ci2 (here, x denotes the coset leader of the element x E G); then,
one puts n = n1 0 Cil 0920 (Cil 092)-1. Note that Cil 0920 (Cil 092)-1
is an element of N and is the output n' of encoder E (the product n1 0 n'
can be computed within the subgroup machine). The encoders are used to
appropriately encode the input for each machine and to provide the combined
output. The decomposition can continue recursively if either of the groups N
or GIN of the two submachines has a non-trivial normal subgroup. Note that
the above choice of decomposition holds even if N is not a normal subgroup of
G. In such case, however, the (right) coset leader machine is no simpler than
the original machine; its group is still G [Arbib, 1968].
The decomposition of group machines described above has generalizations
to semigroup machines, the most well-known result being the Krohn-Rhodes
theorem [Arbib, 1968; Arbib, 1969]. This theorem states that an arbitrary
semi group machine (S, 0) can be decomposed in a non-unique way into a seriesparallel interconnection of simpler components that are either simple-group
machines or are one of four basic types of semigroup machines. The basic
machine components are the following:
• Simple group machines, i.e., machines whose groups do not have any nontrivial normal subgroups. Each simple-group machine in a Krohn-Rhodes
decomposition has a simple group that is a homomorphic image of some
subgroup of S. It is possible that the decomposition uses multiple copies of
a particular simple-group machine or no copy at all.
• U3 = {1,r1,r2} such that for u, ri E U3,uol
• U2
• U1
= lou= uanduori = rio
= {r1, r2} such that for u, ri E U2, u 0 ri = rio
= {I, r} such that 1 is the identity element and r 0 r = r.
• Uo = {I}.
64
CODING APPROACHES TO FAULT TOLERANCE
Note that Un, U1 and U2 are in fact subsemigroups of U3. Some further results
and ramifications can be found in [Ginzburg, 1968].
Before moving to the construction of redundant algebraic machines, some
comments are in order. Redundant implementations for algebraic machines
are constructed in this chapter using a hardware-independent approach; for
discussion purposes, however, some examples make reference to digital implementations, i.e., implementations that are based on digital circuits. The state of
such implementations is encoded as a binary vector and is stored into an array
of single-bit memory registers (flip-flops). The next-state function is implemented by combinational logic. State transition faults occur when a hardware
fault causes the desired transition to a state Si (Si E S) with binary encoding
(b1; , b2; , ••• , bn ;) to be replaced by a transition to an incorrect state Sj E S with
encoding (bl;, b2j, ... , bnj ) (all "bs" are either "0" or "1"). A single-bit error
occurs when the encoding of Si differs from the encoding of Sj in exactly one
bit-position [Abraham and Fuchs, 1986; Johnson, 1989]. Note that, depending
on the hardware implementation, a single hardware fault can cause multiple-bit
errors. Chapters 5 and 6 describe ways to implement certain types of machines
so that a single hardware fault will result in a single-bit error.
3
REDUNDANT IMPLEMENTATIONS OF GROUP
MACHINES
The next state qg[t + 1J of a group machine G is determined by a state
evolution equation of the following form:
where both the current state qg[tJ = 91 and input Xg[tJ = 92 are elements of
a group (G, 0). Examples of group machines include additive accumulators,
multi-input linear shift registers, counters and cyclic autonomous machines. As
discussed in the previous section, group machines can also play an important
role as essential components of arbitrary state machines.
One approach for constructing a redundant implementation for a given group
machine G is to construct a larger group machine H (with group (H, 0), current
h2 E Handnext-statefunction
state%[tJ = hI E H,encodedinpute(xg[tJ)
oh(h 1 , h2) = hI 0 h2) that can concurrently simulate machine G as shown in
Figure 4.2: the current state qg [tJ = 91 of the original group machine G can be
recovered from the corresponding state qh[tJ = hI of the redundant machine
H through a one-to-one decoding mapping l (i.e., qg[tJ = l(qh[tJ) for all time
steps t). The mapping l is only defined for the subset of valid states in H, given
by G' = l-l(G) c H.
=
DEFINITION 4.2 A redundant implementationJor a group machine (G, 0) is
a group machine (H, 0) Jor which there exist
Redundant Implementations of Algebraic Machines
65
Faults
!
Figure 4.2.
Redundant implementation of a group machine.
(i) an appropriate input encoding mapping ~ : G t--+ H (from G into H), and
(U) an one-to-one state encoding mapping £-1 : G t--+ G' (where G' =
£-1 (G) CHis the subset afvalid states),
such that
(4.1)
for all 91, 92 E G.
Note that when machine H is properly initialized and fault-free, there is a
one-to-one correspondence between the state qh [t] = hI of machine H and the
corresponding state qg[t] = 91 of G. Specifically, 91 = £(hl) or hI = £-I(9t)
for all time steps. This can be shown by induction: if at time step t, machine
H is in state %[t] = hI and input Xg[t] = 92 EGis supplied via the next
state of H will be given by
e,
for some h in H. Since £ is one-to-one, it follows from Eq. (4.1) that h has
to satisfy h = £-1(91092) = £-1(9), where 9 = 91092 is the next state of
machine G. Note that h belongs to the subset of valid states G' = £-1 (G) c H.
Faults cause transitions to invalid states in H; at the end of the time step, the
error detector verifies that the newly reached state h is in G' and, if an error
is detected, necessary correction procedures are initiated and completed before
the next input is supplied.
The concurrent simulation condition of Eq. (4.1) is an instance of the coding
scheme of Figure 3.1: the decoding mapping £ plays the role of (7, whereas ~
corresponds to mapping <P2. (The situation described in Eq. (4.1) is slightly
more restrictive than the one in Figure 3.1, because <PI is restricted to be £-1.)
66
CODING APPROACHES TO FAULT TOLERANCE
1... ___________
I
I
I
I
I
I
I
I
\
,,----
~21 ________________ .\ Input Encoder
1
-----------------
I
I
I
I
I
I
I
------,~
I
I
Redundant Machine
H=GxP
!-------- 91
Error Detector/
Corrector a
Figure 4.3.
Separate redundant implementation of a group machine.
By invoking the results 2 in Sections 4.1.1 and 4.2.1 of Chapter 3, one concludes
that the design of redundant implementations for group machines can be studied through group homomorphisms [Hadjicostis, 1999]. More specifically, by
choosing == i-I to be an injective group homomorphism from G into H,
Eq. (4.1) is automatically satisfied:
e
i(r 1 (gl) 0 e(g2))
=
=
3.1
i(r 1(gl) 0 rl (g2))
i(rl(g} ° g2))
g}og2·
SEPARATE MONITORS FOR GROUP MACHINES
When the redundant group machine is of the form H = G x P, one recovers
the results in Section 4.1.3 of Chapter 3 for separate group codes: the encoding
homomorphism 4J : G 1---+ H [where 4J(g) = (g) = i-I (g)] is of the form
4J(g) = [g, O(g)J for an appropriate mapping O. The redundant machine (H, 0)
consists of the original machine (G I 0) and an independent parity machine
(P, 0) as shown in Figure 4.3. Machine P is smaller than G and is referred to
as a (separate) monitor or a monitoring machine (the latter term has been used in
finite-state machines [Iyengar and Kinney, 1985; Parekhji et aI., 1991; Robinson
and Shen, 1992; Parekhji et aI., 1995; Hadjicostis, 1999] and in other settings).
Mapping 0: G 1---+ P is used to produce the encoded input P2 = 0(g2) for the
e
Redundant Implementations of Algebraic Machines
67
separate monitor P and is easily shown to be a homomorphism. If machines G
and P are properly initialized and fault-free, then, the state qp[t] = PI E P of the
monitor at a given time step t will satisfy qp[t] = O(qg[t]), where qg[t] = 91 E G
is the state of the original machine (G, 0) (see Figure 4.3). Error detection
simply checks if this condition is satisfied. Depending on the actual hardware
implementation, one may be able to detect and correct certain errors in the
original machine or in the separate monitor.
Using the results in Section 4.1 of Chapter 3, and retaining the assumption
that the mapping 0 : G 1-----7 P (which maps states and inputs in machine
G to states and inputs in machine P) is surjective, one concludes that group
machine (P, 0) can monitor group machine (G, 0) if and only if P is a surjective
homomorphic image of G or, equivalently, if and only if there exists a normal
subgroup N of G such that P ~ GIN.
EXAMPLE
4.1 The group machine G
=
Z6
=
{O, 1, 2, 3, 4, 5} performs
modulo-6 addition, i.e., its next state is the modulo-6 sum of its current state
and current input. The subgroup N = {a, 2,4} ~ Z3 is one possible nontrivial normal subgroup of G. The corresponding monitor P is isomorphic to
G IN ~ Z2: it has two states, Po and PI. each of which can be associated with
one of the two partitions of states of machine Z6' More specifically, (monitor)
state Po can be associated with (original) states in partition Po = {O, 2, 4}
and (monitor) state PI can be associated with (original) states in partition
Pt = {I, 3, 5}.
As the original machine receives inputs, the monitor changes state in a way
that keeps track of the partition in which the state of the original machine is in.
Assuming that the monitor is fault-free, 3 one can use this approach to detect
faults that cause transitions to a state in an erroneous partition. For example,
if the current state of the original machine is 1 (so that the state of the monitor
is PI) and input 4 is received, the resulting state of the original machine should
be 5. Under input 5, the monitor takes a transition from state PI to state PI
(agreeing with the fact that the state of the original machine is in partition
PI). A state transition fault in the original machine that results into a state in
partition Po (i.e., anyone of states 0, 2 and 4) will be detected; a state transition
fault that results into a state within partition PI (i.e., state 1 or 3) will not be
detected.
Assuming fault-free monitors, the detection of single-bit errors in the digital
implementation of Z6 is guaranteed if the binary encodings of states within the
same partition have Hamming4 distance greater than one.
EXAMPLE 4.2 An autonomous machine has only one transition from any
given state and this transition occurs at the next clock pulse. If the number
of states is finite, an autonomous machine will eventually enter a cyclic sequence of states. A cyclic autonomous machine is one whose states form a pure
68
CODING APPROACHES TO FAULT TOLERANCE
cycle (i.e., there are no transients involved). The state transition table for the
cyclic autonomous machine with M states is as follows:
I Current State II Next State I
11M
II
II
I:
I
1 0M
I (M -1)M II
1M
2M
I
I
:I
OM
I
This machine is essentially the cyclic group machine ZM, but with only one
allowable input (namely element 1M) instead of the whole set {OM, 1M, 2 M , ... ,
(M -1)M}.
Using the algebraic framework and some rather standard results from group
theory, one can characterize all possible monitors P for the autonomous machine Z M: each monitor needs to be a group machine (P, 0) that is isomorphic
to Z MIN, where N is a normal subgroup of Z M. The (normal) subgroups for
ZM are cyclic groups of order INI = D that divides M [Jacobson, 1974J (i.e.,
N ~ ZDfor D being a divisor of M). Therefore, the monitors correspond to
quotient groups P ~ ZM IN = ZM IZD that are cyclic and of order IPI = ~
(that is, P ~ ZIPI). Since the only available input to the original machine G
is the clock input, P should also be restricted to only having the clock input.
Therefore, a monitor for a cyclic autonomous machine with M states is another
autonomous cyclic machine whose number of states is a divisor of M.
The discussion in this section established that a group machine (P, 0) can
monitor a machine (G, 0) if and only if there exists a normal subgroup N of
G such that P ~ GIN. Since N is a normal subgroup of G, according to the
decomposition results in Section 2 one can also decompose the original group
machine G into an interconnection of a subgroup machine N and a coset leader
machine GIN. Therefore, one has arrived at an interesting observation: if this
particular decomposition is used, then, the monitoring approach will correspond
to partial modular redundancy because P is isomorphic to the coset leader machine. Error detection in this special case is straightforward because, as shown
in Figure 4.4, faults in P or GIN can be detected by concurrently comparing
their corresponding states. The comparison is a simple equality check (up to
isomorphism) and an error is detected whenever there is a disagreement. Faults
in the subgroup machine N cannot be detected. Note that the error detection
and correction capabilities will be different if G is implemented using a different
decomposition (or not decomposed at all).
Redundant Implementations of Algebraic Machines
+
(Coder)
Monitor
+=;IN I
L . . . . - _ - l + ••••
69
1----.----1
!I G
Ii •
1__________ 1
Original Group Machine
(with group G) \
..
__-
l-Tl~~~~~~t~~n
i·····cifu
_____________________________ J
Figure 4.4.
Relationship between a separate monitor and a decomposed group machine.
4.3 Consider the group machine Z4 = {O, 1, 2, 3} whose next state
is given by the modulo-4 sum of its current state and current input. A normal
subgroup for Z4 is given by N = {O, 2} (N ~ Z2); the cosets are {O, 2} and
{1,3}, and the resulting coset leader machine Z4/N ~ Z4/Z2 is isomorphic
to Z2. Due to the interconnectivity provided by the encoder E between the two
submachines 5 (see Figure 4.1) the overallfimctionality is differentfrom Z2 x Z2
(which is expected since Z4 i- Z2 x Z2).
Separate monitor P = G/N = Z4/Z2 ~ Z2Junctionsasfollows: itencodes
the inputs in coset {O, 2} into O2 and those in {I, 3} into 12; then, it adds its
current input to its current state modulo-2. Therefore, the fimctionality of this
separate monitor is identical to the coset leader machine in the decomposition
described above. As illustrated in Figure 4.4, under this particular decomposition of Z4, the monitor will only be able to detect faults that cause errors in
the least significant bit (i.e., errors in the coset leader machine). Errors in the
most significant bit (which correspond to errors in the subgroup machine) will
remain completely undetected.
EXAMPLE
3.2
NON· SEPARATE REDUNDANT
IMPLEMENTATIONS FOR GROUP MACHINES
A non-separate redundant implementation for a group machine (G, 0) uses
a larger group machine (H, <» that preserves the behavior of G in some nonseparately encoded form (as in Figure 4.2). In the beginning of Section 3 it
70
CODING APPROACHES TO FAULT TOLERANCE
was argued that such an embedding can be achieved via an injective group
homomorphism ¢> : G t------+ H that is used to encode the inputs and states of
machine G into those of machine H. The subset of valid states was given
by G' = ¢>( G) and was shown to be a subgroup of H. Notice that, if G' is a
normal subgroup of H, then, it is possible to decompose H into a series-parallel
interconnection of a subgroup machine G' (isomorphic to G) and a coset leader
machine H /G'. If one actually implements H in this decomposed form, then, the
fault-tolerance scheme attempts to protect the computation in G by performing
an isomorphic computation (in the subgroup machine G') and a coset leader
computation H /G'. Faults are detected whenever the overall state of H lies
outside G', that is, whenever the state of the coset leader machine deviatesfrom
the identity. Faults in the subgroup machine are not reflected in the state of H /G'
because the coset leader machine is not influenced in any way by the activity in
the subgroup machine G'. Therefore, faults in G' are completely undetected and
the only detectable faults are the ones that force H /G' to a state different from the
identity. In effect, the added redundancy can only check for faults within itself
rather than for faults in the computation in G' and turns out to be rather useless
for error detection or correction. As demonstrated in the following example,
one can avoid this problem by implementing H using a different decomposition;
each such decomposition may offer different fault coverage (while keeping the
same encoding, decoding and error-correcting procedures).
4.4 To provide fault tolerance to machine G = Z3 = {O, 1, 2}
using an aM coding scheme with a = 2, one would multiply its input and state
by a factor of 2. The resulting redundant machine H = Z6 = {O, 1, "" 5}
performs addition modulo-6; its subgroup of valid states is given by G' =
{0,2,4} and is isomorphic to G. The quotient group H/G' consists of two
cosets: {O, 2, 4} and {1, 3, 5}. If one chooses 0 and 1 as the coset leaders,
now denoting them by 02 and 12 to avoid confusion, the coset leader machine
is isomorphic to Z2 and has the following state transition function:
EXAMPLE
I State
Input
I 02 = {0,2,4} 112 = {1,3,5} I
II
"
For this example, the encoder E in Figure 4.1 (which has no internal state
and provides the input to the subgroup machine based on the current input and
the coset in which the coset leader machine is in) performs the following coding
Redundant Implementations of Algebraic Machilles
71
function:
I State
Input
II
0 11 1
2 3 4
1 1
15 1
I °I °I 2 I 2 I 4 I 4 I
11012121414101
Note that the input to machine H will always be a multiple of 2. Therefore,
as is clear from the table, if one starts from the 02 coset, one will remain there
(at least under faultJree conditions). The input to the subgroup machine will
be the same as in the non-redundant machine (only the symbols used will be
different - {O, 2, 4} instead of {O, 1, 2}).
A fault will be detected whenever the overall state of H does not lie in G ' ,
i.e., whenever the coset leader machine H/G' is in a state different from 02·
Since the coset leader machine does not receive any input from the subgroup
machine, a deviation from the 02 state (coset) reflects afault in the coset leader
machine. Therefore, the redundancy can only be used to check itself and not
the original machine.
One gets more satisfying results if H is decomposed in other ways. For
example, N H = {O, 3} is a normal subgroup of H and the corresponding coset
decomposition H/NH consists of three cosets: {0,3}, {I,4} and {2, 5}. The
state transition junction of the coset leader machine is given by the following
table (where coset leaders are denoted by 03, 13 and 23):
I State
1
03
/13
1
23
Input
II
03
"
"
"
= {O, 3}
113
= {I, 4}
1 23 = {2, 5} 1
03
13
23
13
23
03
23
03
13
72
CODING APPROACHES TO FAULT TOLERANCE
In this case, the output of the encoder E between the coset leader and the
subgroup machine is given by the following table:
I State
Input
II
4
0 11 1 2 1 3 1 1 5 1
II 0 I 0 I 0 I 3 I 3 I 3 I
11010131313101
11013131310101
This situation is quite different from the one described earlier. The valid
results under fault-free conditions do not lie in the same coset anymore. Instead,
for each state in the subgroup machine, there is exactly one valid state in
the coset leader machine. More specifically, the valid states (the ones that
comprise the subgroup machine G') are given by specific pairs (c, nh) of a
state c of the coset leader machine and a state nh of the subgroup machine
NH. The pairs in this example are given by (03,0), (13,3) and (23,0). This
structured redundancy can therefore be exploited to perform error detection
and correction.
The analysis in this example can be generalized to all cyclic group machines
ZM that are to be protected through aM coding. The encoding of the states
and the inputs involves simple multiplication by a, whereas the computation
needs to be reformulated using a group machine decomposition that does not
have ZM as a (normal) subgroup.
The example above illustrates that non-separate redundancy can provide
varying degrees of protection depending on the group machine decomposition
that is used (or, more generally, on the underlying hardware implementation).
This issue is commonly ignored in research on arithmetic codes because the
focus is on pre-specified (fixed) hardware implementations. For example, aM
codes were applied to arithmetic circuits with a specific architecture in mind
and with the objective of choosing the parameter a so that an acceptable level of
error detection/correction was achieved [Peterson and Weldon Jr., 1972; Rao,
1974]. The approach in the above example is different because it characterizes
the encoding and decoding mappings abstractly, and allows for the possibility
of implementing and decomposing the redundant machine in different ways;
each such decomposition results in a different fault coverage. Chapters 5 and 6
illustrate this point more explicitly for hardware implementations oflinear timeinvariant dynamic systems and linear finite-state machines.
Redundant Implementations of Algebraic Machines
4
73
REDUNDANT IMPLEMENTATIONS OF
SEMIGROUP MACHINES
The development in Section 3 can be generalized to semi group machines
[Hadjicostis, 1999]. For this case, one has the following definition:
DEFINITION 4.3 A redundant implementation for a semigroup machine (8,0)
is a semigroup machine (H, 0) for which there exist
(i) an appropriate input encoding mapping
e: 8 t---+ H (from 8 into H), and
(ii) an one-to-one mapping £-1 : 8 t---+ 8' (where 8'
subset of valid states),
= £-1(8)
cHis the
such that
(4.2)
for all 81, 82 E 8.
Note that when H is properly initialized and fault-free, there is a one-to-one
correspondence between the state qh[t] = hI of H and the corresponding state
qs[t] = 81 of 8. Specifically, qs[t] = £(qh[tJ) and qh[t] = £-I(qs[tJ) for all
time steps. At the beginning of time step t, input Xs [t] = 82 E 8 is supplied to
machine H encoded via and the next state of H is given by
e
for some h in H. Since lis one-to-one, Eq. (4.2) implies that h = £-1(81082) =
£-1(8), where 8 = 81 082 is the next state of machine 8. Note that h belongs
to the subset of valid states 8' = £-1 (8) c H. Faults cause transitions to
invalid states in H; at the end of the time step, the error detector verifies that the
newly reached state h is in 8' and, if an error is detected, necessary correction
procedures are initiated and completed before the next input is supplied.
DEFINITION 4.4 A semigroup machine is called a reset ifit corresponds to a
right-zero semigroup R, that is,
for all Ti, Tj E R.
A reset-identity machine Rl = R U {I} corresponds to a right-zero semigroup R with I included as the identity. The reset-identify machine R; denotes
a machine with n right zeros {TIn' T2 n , ... , Tnn} and an identity element In.
A permutation-reset machine has a semigroup (8,0) that is the union of a
set of right zeros R = {Tl' T2, ... , Tn} and a group G = {9t, 92, ... , 9m}. (The
74
CODING APPROACHES TO FAULT TOLERANCE
product Ti 0 9jfor i E {l, .'" n} and j E {l, ... , m} is defined to be Ti 09j = Tk
for some k E {l, ... , n}. The remaining products are defined so that G forms a
group and R is a set of right zeros.}
A permutation-reset machine can be decomposed into a series-parallel pair
with the group machine G at the front-end and the reset-identity machine
RI = R U {l} at the back-end. This construction can be found in [Arbib,
1968]. The Zieger decomposition is a special case of the Krohn-Rhodes decomposition. It states that any general semigroup machine S may be broken
down into permutation-reset components. All groups involved are homomorphic images of subgroups of S. More details and an outline of the procedure
may be found in [Arbib, 1968].
Next, the discussion shifts to redundant implementations for reset-identity
machines. By the Zieger decomposition theorem, these machines together with
simple-group machines are the only building blocks needed to construct all
possible semigroup machines.
4.1
SEPARATE MONITORS FOR RESET MACHINES
For a right-zero semigroup R, any equivalence relation (i.e., any partitioning
of its elements) is a congruence relation [Grillet, 1995]. This result extends
easily to the monoid Rl = R u {l }: any partitioning of the elements of Rl is a
congruence relation, as long as the identity forms its own partition. Using this
result, one can characterize and construct all possible (separate) monitors for a
given reset-identity machine RI.
EXAMPLE 4.5 Consider the standard semigroup machine U3 defined in Section 2. Its next-state function is given by the following table:
I State
I1
I TI
I T2
Input
II
1
I TI I T21
II
II
II
1
I TI I T2 I
I TI I T2 I
I TI I T2 I
TI
T2
The only possible non-trivial partitioning is { {l}, {TI' T2} }; it results in the
e:
parity semigroup P = {lp, T}, defined by the surjective homomorphism
U3 t---t P with 0(1) = lp and O(Tl) = O(T2) = T. Note that P is actually
isomorphic to U1. As expected, under this monitoring scheme, machine P is
simply a coarser version of the original machine U3 .
Redundant Implementations of Algebraic Machines
75
EXAMPLE 4.6 Consider the reset-identity machine R~ = {17, T1 7 , T27' ... , T77}'
A possible partitioning for it is {{17}, {Th' T27' ... , T7 7}} and it results in the
same parity semigroup P = {lp, T} as in the previous example. The surjective
homomorphism 8 : R~ r---t Pisgivenby8(17) = 1p, 8(TI7) = 8(T27) = ... =
8(T77) = T.
Other partitionings are also possible as long as the identity forms its own
class. This flexibility in the choice of partitioning can be exploited depending
on the faults expected in the original machine R~ and the monitor P. For
example, if R~ is implemented digitally (each state being encoded to three
bits), then, one could choose the partitions so that they consist of states whose
encodings are separated by large Hamming distances. For example, if the
binary encodings for the states of R~ are 000 for the identity, and 001, 010,
... , 111 for TI7 to T77 respectively, then, an appropriate partitioning could
be {Po = {OOO}, PI = {001, 010,100, lll}, P 2 = {Oll, 101, 1l0}}. This
results in a monitoring machine with semigroup P ~ U3: state 000 maps to the
identity ofU3, whereas states in partition H map to TI and states in partition P2
map to T2. Under this scheme, one can detect faults that cause single-bit errors
in the original machine as long as the monitoring machine operates correctly
(to see this, notice that the Hamming distance within each of the partitions is
larger than 1).
The scheme above can be made c-error correcting by ensuring that the Hamming distance within any partition is at least 2c+ 1 (still assuming no faults in the
monitoring machine). Under more restrictive fault models, other partitionings
could be more effective. For example, iffaults in a given implementation cause
bits to stick at "1," then, one should aimfor partitions with states separated by
a large asymmetric distance [Rao, 1974].
4.2
NON· SEPARATE REDUNDANT
IMPLEMENTATIONS FOR RESET MACHINES
A non-separate redundant implementation of a reset-identity machine R~ can
be based on an injective semigroup homomorphism ¢ : R~ r---t H that reflects
the state and input of R~ into a larger semigroup machine H so that Eq. (4.2)
is satisfied. Under proper initialization and fault-free conditions, machine H
simulates the reset-identity machine R~; furthermore, since ¢ is injective, there
exists a mapping ¢-I that can decode the state of H into the corresponding
state of R~.
An interesting case occurs when the monoid R~ = {In' TIn' T2 n , ... , Tnn} is
homomorphic ally embedded into a larger monoid R~ = {1 m , TIm' T2m , ... , Tmm}
for m > n (Le., when H = R~). The homomorphism ¢ : R~ r---t R~ is given
by ¢(1n) = 1m and ¢(Ti n ) -:f= ¢(Tjn) for i -:f= j, i,j in {I, 2, ... , n}. Clearly, ¢
76
CODING APPROACHES TO FAULT TOLERANCE
is injective and there is a one-to-one decoding mapping from the subsemigroup
R; = 4>(R~) c R~ onto R~. Assuming that the system is implemented digitally (i.e., each state is encoded as a binary vector), then, in order to protect
against single-bit errors one would need to ensure that the encodings of the
states in the set of valid results R~l are separated by large Hamming distances.
Bit errors can be detected by checking whether the resulting encoding is in
R;.
EXAMPLE 4.7 One way to add redundancy into the semigroup machine R~ =
{12' Th, T22} is by mapping it into machine R~. Any mapping 4> of the form
4>(12) = 17, 4>(TI 2 ) = Ti7 and 4>(T22) = Th (j,i E {I, 2, ... , 7}, j # i) is a
valid embedding. In order to achieve detection of single faults, each fault needs
to result in a state outside the set of valid results 8'.
If machine
R~ is implemented digitally (with its states encoded into 3-
bit binary vectors), faults that result in single-bit errors can be detected by
choosing the encodings for 4>(12) = 17, 4>(TI 2 ) = Ti7 and 4>(T22) = Th
(j, i E {I, 2, ... , 7}, j # i) to be separated by a Hamming distance of at
least 2 (e.g., 001for 17, 01Ofor Ti7 and 100for Th)'
5
SUMMARY
This chapter described redundant implementations for algebraic machines
(group and semigroup machines). The approach was hardware-independent
and resulted in redundant implementations that are based on algebraic homomorphisms. Explicit connections with hardware faults and fault models were
not made. Using these techniques, one can take advantage of algebraic structure
in order to analyze procedures for error correction and avoid decompositions
under which faults in the original machine are always undetectable.
Notes
1 A similar argument can be made for left cosets.
2 Note that a group machine does not necessarily correspond to an abelian
group.
3 This assumption is realistic if the hardware implementation of the monitor
is considerably simpler than the implementation of the actual machine.
4 The Hamming distance between two binary vectors (Xl, X2, ... , xn) and
(YI, Y2, ... , Yn) is the number of positions at which they differ. The minimum Hamming distance of a given set of binary vectors of length n is the
minimum distance between any pair of binary vectors in the code.
5 The output n' of the encoder E in Figure 4.1 is based on the state of the coset
leader machine (Cit) and the overall input (92)' In this particular example the
output functions like the carry-bit in a binary adder: the coset leader machine
performs the addition of the least significant bits, whereas the subgroup
machine deals with the most significant bits.
References
77
References
Abraham, J. A and Fuchs, K. (1986). Fault and error models for VLSI. Proceedings of the IEEE, 74(5):639-654.
Arbib, M. A, editor (1968). Algebraic Theory of Machines, Languages, and
Semigroups. Academic Press, New York.
Arbib, M. A (1969). Theories ofAbstract Automata. Prentice-Hall, Englewood
Cliffs, New Jersey.
Ginzburg, A. (1968). Algebraic Theory of Automata. Academic Press, New
York.
Grillet, P. A (1995). Semigroups. Marcel Dekker Inc., New York.
Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic
Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts.
Iyengar, V. S. and Kinney, L. L. (1985). Concurrent fault detection in microprogrammed control units. IEEE Transactions on Computers, 34(9):810-821.
Jacobson, N. (1974). Basic Algebra I. W. H. Freeman and Company, San Francisco.
Johnson, B. (1989). Design and Analysis of Fault- Tolerant Digital Systems.
Addison-Wesley, Reading, Massachusetts.
Leveugle, R, Koren, Z., Koren, I., Saucier, G., and Wehn, N. (1994). The
Hyeti defect tolerant microprocessor: A practical experiment and its costeffectiveness analysis. IEEE Transactions on Computers, 43( 12): 1398-1406.
Leveugle, R and Saucier, G. (1990). Optimized synthesis of concurrently
checked controllers. IEEE Transactions on Computers, 39(4):419-425.
Parekhji, R A, Venkatesh, G., and Sherlekar, S. D. (1991). A methodology for
designing optimal self-checking sequential circuits. In Proceedings of the
Int. Con! VLSI Design, pages 283-291. IEEE CS Press.
Parekhji, R A, Venkatesh, G., and Sherlekar, S. D. (1995). Concurrent error
detection using monitoring machines. IEEE Design and Test of Computers,
12(3):24-32.
Peterson, W. W. and Weldon Jr., E. J. (1972). Error-Correcting Codes. MIT
Press, Cambridge, Massachusetts.
Rao, T. R N. (1974). Error Codingfor Arithmetic Processors. Academic Press,
New York.
Robinson, S. H. and Shen, J. P. (1992). Direct methods for synthesis of selfmonitoring state machines. In Proceedings of22nd Fault-Tolerant Computing Symp., pages 306-315. IEEE CS Press.
Chapter 5
REDUNDANT IMPLEMENTATIONS OF
DISCRETE-TIME LINEAR
TIME-INVARIANT DYNAMIC SYSTEMS
1
INTRODUCTION
This chapter discusses fault tolerance in discrete-time linear time-invariant
(LTI) dynamic systems [Hadjicostis and Verghese, 1997; Hadjicostis and Verghese, 1999; Hadjicostis, 1999]. It focuses on redundant implementations that
reflect the state of the original system into a larger LTI dynamic system in a linearly encoded form. In essence, this chapter restricts attention to discrete-time
LTI dynamic systems and linear coding techniques, both of which are rather
standard and well-developed topics in system and coding theory respectively.
Interestingly enough, the combination of linear dynamics and coding reveals
some novel aspects of the problem, as summarized by the characterization of
the class of appropriate redundant implementations given in Theorem 5.1. In
most of the fault-tolerance schemes discussed, error detection and correction is
performed at the end of each time step, although examples of non-concurrent
schemes are also presented [Hadjicostis, 2000; Hadjicostis, 2001].
The restriction to LTI dynamic systems allows the development of an explicit
mapping to a hardware implementation and an appropriate fault model. More
specifically, the hardware implementations of the fault-tolerant systems that are
constructed in this chapter are based on a certain class of signal flow graphs
(i.e., interconnections of delay, adder and gain elements) which allow each fault
in a system component (adder or multiplier) to be modeled as a corruption of a
single variable in the state vector.
2
DISCRETE· TIME LTI DYNAMIC SYSTEMS
Linear time-invariant (LTI) dynamic systems are used in digital filter design,
system simulation, model-based control, and other applications [Luenberger,
1979; Kailath, 1980; Roberts and Mullis, 1987]. The state evolution and output
80
CODING APPROACHES TO FAULT TOLERANCE
of an LTI dynamic system S are given by
Aqs[t]
Cqs ttl
+ Bx[t] ,
+ Dx[t] ,
(5.1)
(5.2)
where t is the discrete-time index, qs[t] is the d-dimensional state vector, x[t]
is the u-dimensional input vector, y[t] is the v-dimensional output vector, and
A, B, C, D are constant matrices of appropriate dimensions. All vectors and
matrices have real numbers as entries.
Equivalent state-space models (with d-dimensional state vector q~ ttl and
with the same input and output vectors) can be obtained through similarity
transformation as described in [Luenberger, 1979; Kailath, 1980]:
q~[t
+ 1] = (T- 1AT) q~[t] + (T-1B) x[t]
------
'--v--'
y[t]
-
A'
A'q~[t]
=
~
=
+ B'x[t] ,
CT q~ ttl
C'
C'q~[t]
B'
+~
D x[t]
D'
+ D'x[t] ,
where T is an invertible d x d matrix such that qs[t] = Tq~[t]. The initial
conditions for the transformed system can be obtained as q~ [0] = T-1qs [0].
Systems related in such a way are known as similar systems.
3
CHARACTERIZATION OF REDUNDANT
IMPLEMENTATIONS
A redundant implementation for the LTI dynamic system S [with state evolution as in Eq. (5.1)] is an LTI dynamic system 1£ with dimension 1] (1] == d+ s,
s > 0) and state evolution
(5.3)
The initial state Clh[O], and matrices A and B of the redundant system 1£ are
chosen so that there exists an appropriate one-to-one decoding mapping i, such
that duringfaultJree operation
for all t ~ 0 [Hadjicostis and Verghese, 1997; Hadjicostis and Verghese, 1999;
Hadjicostis, 1999]. Note that according to the setup in Section 3 of Chapter 1,
i is required to be one-lo-one and is only defined from the subset of valid slates
Redundant Implementations of Discrete-Time LTI Dynamic Systems
81
V (i.e., the set of states in 1£ that are obtainable under fault-free conditions).
This means that each valid state Qh[t] E V of the redundant system at any
time step t corresponds to a unique state qs [tJ of system S; in other words,
qs[t] = l-l(qh[tj).
Note that faults in the implementation of the output [see Eq. (5.2)] affect the
output at a particular time step but have no propagation effects. For this reason,
they can be treated as faults in a combinational circuit and are not discussed
here. Instead, the analysis in this chapter focuses on protecting the mechanism
which performs the state evolution in Eq. (5.1). To achieve fault tolerance, the
state Qh[tJ is encoded using a linear code. In other words, it is assumed that,
under proper initialization and fault-free conditions, there exist
• adx17decodingmatrixLsuchthatqs[t] = LQh[t]forallt
~
O,qh['] E V,
and
• an 17 x d encoding matrix G such that Qh[t] = Gqs[t] for all t ~ O.
The error detector/corrector does not have access to previous states or inputs
and has to make a decision at the end of each time step based solely on the state
Qh[t] of the redundant system. Since the construction of 1£ and the choice of
initial condition ensures that under fault-free conditions
the error detection strategy only needs to verify that the redundant state vector
Qh[t] is in the column space of G. Equivalently, one can check that qh[t] is
in the null space of an appropriate parity check matrix P (so that PQh[t] = 0
under fault-free conditions). Any fault that forces the state Qh[t] to fall outside
the column space of G (producing a nonzero parity check p[t] == PQh[tj) will
be detected.
For example, a corruption of the ith state variable at time step t will produce
an erroneous state vector
where Qh[tJ is the state vector that would have been obtained under fault-free
conditions and ei is a column vector with a unique nonzero entry at the ith
position with value 0:. The parity check at time step t will then be
P(Qh[t] + ed
PQh[t] + Pei
=
Pei
o:P(:, i) ,
82
CODING APPROACHES TO FAULT TOLERANCE
where P(: i) denotes the ith column of matrix P. Single-error correction will
be possible if the columns of P are not multiples of each other. If this condition
is satisfied, one can locate and correct the corrupted state variable by identifying
the column of P that is a multiple of p[t]. The underlying assumption in this
discussion is that the error-detecting and correcting mechanism is fault-free.
This assumption is justified if the evaluation of Pqh[t] and all actions that may
be subsequently required for error correction are considerably less complex
than the evaluation of AQh[t] + Bx[t]. This would be the case, for example,
if the size of P is much smaller than the size of A, or if P requires simpler
operations.
THEOREM 5.1 In the setting described above, the system 'Ii [ofdimension"., ==
d + s, s > 0 and state evolution as in Eq. (5.3)] is a redundant implementation
of S if and only if it is similar to a standard redundant system 'liu whose state
evolution equation is given by
qu[t + 1] =
[!
!~~] qu[t] + [ :
-------~
] x[t] .
(5.4)
~
Bq
Here, A and B are the matrices in Eq. (5.1), A22 is an s x s matrix that describes
the dynamics of the redundant modes that have been added, and A12 is a d x s
matrix that describes the coupling from the redundant to the non-redundant
modes. Associated with this standard redundant system is the standard decoding
matrix Lu
=
[Id 0], the standard encoding matrix G u
standard parity check matrix P u
= [ ~ ] and the
= [0 Is] .
Proof: Let 'Ii be a redundant implementation of S. Under fault-free conditions,
LGqs[·] = LQh[.] = qs[·]. Since the initial state qs[O] could be any state,
one concludes that LG = I d . This implies implies that L is full-row rank
and G is full-column rank and that there exists an invertible"., x "., matrix T
such that LT
=
[Id 0] and T-1G
= [~ ]
[Hadjicostis and Verghese,
1997; Hadjicostis and Verghese, 1999; Hadjicostis, 1999]. If one applies the
transformation Qh[t] = T qh' [t] to system 'Ii, the resulting similar system 'Ii'
has decoding mapping L' = L T = [Id 0] and encoding mapping G' =
,-I
Redundant Implementations of Discrete-Time LTl Dynamic Systems
G
= [ ~ ]. The state evolution of the redundant system ?i' is given by
qh,[t+l]
=
(,-lA,)~,[t]+(,-IB)x[t]
--...-..-
A'
=
A'~,[t]
+ B'x[t]
'--v--'"
B'
.
For all time steps t and under fault-free conditions, qh' [t]
[
q~t]
83
(5.5)
=
G'qs[t] =
]. Combining the state evolution equations of the original and re-
dundant systems (Eqs. (5.1) and (5.5) respectively), one obtains
By setting the input x[t] == 0 for all t, one concludes that A'u = A and
A~l = O. With the input now allowed to be nonzero, one can deduce that
B~ = B and B~ = O. The system ?i' is therefore in the form of the standard
system ?iC7 in Eq. (5.4) with appropriate decoding, encoding and parity check l
matrices.
The converse, namely that if?i is similar to a standard ?iC7 as in Eq. (5.4),
then, it is a redundant implementation of the system in Eq. (5.1), is easy to show
[Hadjicostis and Verghese, 1997; Hadjicostis, 1999].
0
Theorem 5.1 establishes a complete characterization of all possible redundant implementations for a given LTI dynamic system subject to the restriction
that linear encoding and decoding techniques are used. The additional modes
introduced by the redundancy never get excited under fault-free conditions because they are initialized to zero and because they are unreachable from the
input. Due to the existence of the coupling matrix A l 2, the additional modes
are not necessarily unobservable through the decoding matrix. The continuoustime version of Theorem 5.1 essentially appears in [Ikeda and Siljak, 1984],
although the proof and the motivation are very different.
4
HARDWARE IMPLEMENTATION AND FAULT
MODEL
In order to demonstrate the implications of Theorem 5.1 to fault tolerance, a
more detailed discussion of the hardware implementation and the corresponding fault model is needed. The assumption made here is that the LTI dynamic
systems of interest [e.g., system S of Eq. (5.1) or system ?i of Eq. (5.3)] are
implemented using appropriately interconnected delays (memory elements),
adders and gain elements (multipliers). These implementations can be repre-
84
CODING APPROACHES TO FAULT TOLERANCE
x[t]
or
Adder
<X1
XlI]
..
q [t+ 1]
1
>-
C\l
Qi
- 0
Gain
Z·'
~
--~~~.------~~~.
~.
Figure 5.1. Delay-adder-gain implementation and the corresponding signal flow graph for an
LTI dynamic system.
sented by signal flow graphs or, equivalently, by delay-adder-gain diagrams.
These are shown in Figure 5.1 for an LTI dynamic system with state evolution
1]] = [01
_ [ q2[t+1]
ql[t +
q[t+1]=
aa21
]
q[t]+
[
10 ]x[t].
Nodes in a signal flow graph sum up all of their incoming arcs; delays are
represented by arcs labeled with z-l.
The analysis in this chapter considers both transient and permanent faults
in the gains and adders of hardware implementations. A transient fault at time
step t causes errors at that particular time step but disappears at the following
ones. Therefore, if the errors are corrected before the initiation of time step
t + 1, the system will resume its normal mode of operation. A permanent
fault, on the other hand, causes errors at all remaining time steps. Notice that a
permanent fault can be treated as a transient fault for each of the remaining time
steps (assuming successful error correction at the end of every time step), but in
certain cases one can deal with it in more efficient ways (e.g., by reconfiguring
the system around the faulty component).
A given state evolution equation has multiple possible implementations using
delay, adder and gain elements [Roberts and Mullis, 1987]. In order to define a
unique mapping from a state evolution equation to a hardware implementation,
one can focus on implementations whose signal flow graphs have delay-free
Redundant Implementations of Discrete-Time LT! Dynamic Systems
85
paths of unit length. In other words, any path that does not include a delay has to
have unit length (the signal flow graph in Figure 5.1 is one such example). One
can verify that for implementations whose signal flow graphs have delay-free
paths of unit length, the entries of the matrices in the state evolution equation
are directly reflected as gain constants in the signal flow graph [Roberts and
Mullis, 1987]. In addition to the above property, each of the variables in the
next-state vector q[t + 1] is calculated using separate gain and adder elements
(sharing only the input x[t] and the variables in the previous state vector q[t]).
This means that a fault in a single gain element or in a single adder during time
step t will result in the corruption of a single state variable in the state vector
q[t + 1] (if the error is not accounted for, many more variables may be corrupted
at later time steps). In fact, any combination of faults in the gains or adders that
are used for the calculation of the next value of the ith state variable will only
result in the corruption of the ith state variable. More general descriptions can be
studied viafactored state variable techniques [Roberts and Mullis, 1987], or by
employing the computation trees in [Chatterjee and d' Abreu, 1993], or by using
the techniques that will be discussed in Example 5.5; in these implementations,
however, a single fault may corrupt multiple state variables, so one has to be
careful when developing the fault model.
Note that according to the assumptions in this section, the standard redundant
system ?to' of Theorem 5.1 cannot be used to provide fault tolerance to system
S. Since hardware implementations employ delay-adder-gain circuits that have
delay-free paths of unit length, the implementation of ?to' will result in a system that only identifies faults in the redundant part of the system. The reason
is that state variables in the lower part of qO'[.] are not influenced by variables
in the upper part and the parity check matrix, given by PO' = [0 Is), only
identifies faults in the added subsystem. The situation is similar to the one in
Example 4.4 in Chapter 4, where under a particular group machine decomposition, redundancy was useless because it was essentially protecting itself. The
objective should be to use the redundancy to protect the original system, not
to protect the redundancy itself. Theorem 5.1 is important, however, because
it provides a systematic way for searching among possible redundant implementations for system S. Specifically, Theorem 5.1 characterizes all possible
redundant implementations that have the given (fixed) encoding, decoding and
parity check matrices (L, G and P respectively). Since the choice of matrices A12 and A22 is completely free, there is an infinite number of redundant
implementations for system S. All of them have the same encoding, decoding
and parity check matrices, and offer the same concurrent error detection and
correction capabilities: depending on the redundancy in the parity check matrix
P, all of these implementations can detect and/or correct the same number of
errors in the state vector qh[t].
86
CODING APPROACHES TO FAULT TOLERANCE
5
EXAMPLES OF FAULT-TOLERANT SYSTEMS
This section discusses the implications of Theorem 5.1 through several examples.
EXAMPLE
5.1 Consider the following original system S:
.2o .50
0
o
0 .6
0
qs[t+1]= [ 0 0 . 1 0
0
1qs[t]+
[3
-1 ]
7 x[t].
0
One possibility for protecting this system against a single transient fault in
a gain element or in an adder, is to use three additional state variables. More
specifically, the standard redundant system can be
.2
qO'[t
0 0 0
0 .5 0 0
0 0 .1 0
0 0 0 .6
+ 1]
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
qO'[t]
+
3
-1
7
0
0 .2 0 0
0 0 .5 0
0 0 0 .3
x[t] ,
0
0
0
J
'-.;-"
Ba
i.e.,
A12
=0
,
A22
=
[.~o .~0 .3~ 1
The parity check matrix of the standard implementation is given by
PO'
= [0
I 13 ] = [
0 0 0 0 1 001
0 0 0 0 0 1 0
.
o 000 0 0 1
For error detection, one needs to check whether P O'qO' [t] is O. However, as
argued earlier, redundant systems in standardform cannot be usedfor detecting
faults that cause errors in the original state variables: given an erroneous state
vector q, [t], a nonzero parity check (P O'q, [t] =I- 0) would simply mean that a
fault has resulted in an error in the calculation of the redundant variables. The
goal is to protect against errors that affect the original system (i.e., errors that
appear in the original variables). One way to achieve this is to employ a system
Redundant Implementations of Discrete-Time LTI Dynamic Systems
I
Parity Check
pT[tJ = [ PI [tJ P2[tJ P3 [tJ ]
Erroneous
State Variable
[ c c c]
ql
[c
C
o]
q2
[ C
0
C ]
q3
[ 0 c c]
q4
o]
q5
cO]
q6
[0 0 c]
q7
[ c 0
[0
Table 5.1.
"
87
I
Syndrome-based error detection and identification in Example 5.1.
similar to the standard redundant system, but with parity check matrix
1 110 1 0 0
P= [ 1 1 0 1 0 1 0
1.
(5.6)
101 100 1
(This choice ofP is motivated by the structure of Hamming codes in communications [Peterson and Weldon Jr., 1972; Blahut, 1983; Wicker, 1995].) With
a suitable similarity transformation 7 chosen so that P7 = P (]", the corresponding redundant system is
qh[t + 1J = (7- 1Aa 7)~[tJ + (7- 1 B)x[tJ
.2
0
0
0
0
.3
.1
0
.5
0
0
0 .1
0 0
-.3 .1
0 0
0 .2
-.1
0
0
0
0
0
0 .5
-.3
0
0 0
0 0
0 0
.6 0
0 .2
(5.7)
3
0
0
0
0
0
0
0 .3
qh[tJ
+
-1
7
0
-9
-2
-10
x[tJ.
88
CODING APPROACHES TO FAULT TOLERANCE
The above system can be used to detect and locate transient faults that cause
the value of a single state variable to be incorrect at a particular time step.
Under fault-free conditions, the parity vector p[t] = P<lh[t] is 0; furthermore,
any fault that results in the corruption ofa single state variable can be identified
as shown in Table 5.1. For example, if pdt] "# 0, P2 [t] "# and P3 [t] "# 0,
then, one can conclude that a fault has corrupted ql [t], the value of the first
state variable in <lh[t]; ifpt[t] "# 0, P2[t] "# and P3[t] = 0, then, a fault has
corrupted q2 [t]; and so forth. Once the erroneous variable is located, correction
can be based on any of the parity equations than involve the erroneous state
variable. For example, if q2[t] is corrupted, one can calculate its correct value
by setting q2[t] = -qdt]-q3[t]-q5[t] (i.e., using the parity equation defined by
the first row of matrix P). If faults are transient, the operation of the system
will resume normally in the following time steps.
°
°
Hamming codes like the ones used in the above example allow for the correction of faults that cause an error in a single variable of the state vector. Instead
of replicating the whole system (as would be required by a modular redundancy
approach), one only needs s additional state variables: as long as 28 - 1 > 7J
(where 7J = d + s is the dimension of the redundant system), one can guarantee the existence of a Hamming code and a redundant implementation that
achieves single-error correction. An alternative approach is developed in [Chatterjee and d' Abreu, 1993] where the authors used real-number coding schemes
and achieved single-error correction with only two additional state variables.
The methods used in [Chatterjee and d' Abreu, 1993] (as well as in [Chatterjee,
1991], where one of the authors of [Chatterjee and d' Abreu, 1993] analyzes
the continuous-time case) do not consider different similarity transformations
and do not permit additional modes to be nonzero. The following example
illustrates some of the advantages obtained by using nonzero redundant modes.
EXAMPLE 5.2 Consider the LTl system with state evolution equation and
hardware implementation as shown in Figure 5.2. Since the corresponding
signalflow graph has delay-free paths of unit length, the entries of A andb are
reflected directly as gains in the diagram (entries that are either "0" or "1"
do not appear explicitly as gain elements). Furthermore, the only variables
that are shared when calculating the next-state vector are the input and the
previous state vector; no hardware is shared during the update of different
state variables.
In order to detect a single fault in a gain element or in an adder, one can use an
extra "checksum" state variable [Huang and Abraham, 1984; Chatterjee and
Redundant Implementations of Discrete-Time LTI Dynamic Systems
Figure 5.2.
ample 5.2.
89
State evolution equation and hardware implementation of the digital filter in Ex-
<lh[t + 1J
0
1
0
0
=
,
0
0
1
0
0 -1/4 0
0
1/2 0
0 -1/4 0
1
1/2 0
1 1 1
1/21 0
'V
A
Figure 5.3.
<lh[t] +
,
1
0
0
0
x[tJ ,
1
"'-v-'"
B
Redundant implementation based on a checksum condition.
90
CODING APPROACHES TO FAULT TOLERANCE
d'Abreu, 1993J. The resulting redundant implementation 1£ has state evolution
'!hIt +1] =
[C:A::]
__________
'!hIt] +
[
C:b ]
'---v---'
A
xlt]
B
with c T = [1 1 1 1]. The corresponding delay-adder-gain implementation is shown in Figure 5.3. Note that there are a number of different delayadder-gain diagrams that are consistent with the above state evolution equation;
the one shown in Figure 5.3 is the only one consistent with the requirement that
signal flow graphs have delay-free paths of unit length.
Under fault-free conditions, the first four state variables are the same as the
original state variables in system S; the additional state variable is always
equal to the sum of these four state variables. Error detection is based on
verifying the validity of this checksum condition at the end of each time step;
no complicated multiplications are involved, which may make it reasonable to
assume that error detection is fault-free.
The above approach is seen to be consistent with the setup described in this
chapter. Specifically, the encoding, decoding and parity check matrices are
given by
G=
[~] =
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
1 1 1 1
L=[I41°]=
[1o 0
0
0
1 0 0
00 1 0
o 0 0 1
~]
,
P = [ _c T 11 ] = [ -1 -1 -1 -1 11 ] .
Furthermore,
if one uses the transformation matrix
[!f
~], one can show
that system 1£ is similar to a standard system 1£0' with state evolution
Redundant Implementations of Discrete-Time LTI Dynamic Systems
Qhl[t + 1J =
0 0
1 0
0 1
0 0
0 0
-1/4
1/2
-1/4
1
1/2
0 -1/2
0
0
0
v
A'
Figure 5.4.
0
0
0
0
1
91
1
QhI[tJ +
0
0
0
x[tJ
1
8'
------
Second redundant implementation based on a checksum condition.
where A, b are the matrices in Figure 5.2. Notice that A12 and A22 in Eq. (5.4)
of Theorem 5.1 have been set to zero.
As stated earlier, with each choice of A12 and A 22 , there is a different
redundant implementation with the same encoding, decoding and parity check
matrices. If, for example, one sets A12 = 0, A22 = [1 J and then transforms
back [Hadjicostis, 1999], the resulting redundant implementation 1l' has state
evolution equation
=
The corresponding hardware implementation is shown in Figure 5.4.
Both redundant implementations 1l and 1l' have the same encoding, decoding and parity check matrices. Both are able to detect faults that corrupt a
single state variable, such as a single fault in an adder or in a gain element.
The (added) complexity in system 1l', however, is lower than that in system
1l (because the computation of the redundant state variable is less involved).
More generally, as illustrated in this example for the case ofa nonzero A 22 , one
can explore different versions of redundant implementations by exploiting the
92
CODING APPROACHES TO FAULT TOLERANCE
dynamics ofthe redundant modes (A22) andlor their coupling with the original
system (A I2 ). For certain choices of A12 and A 22 , one may get designs that
utilize less hardware than others or designs that can perform non-concurrent
error detection and correction, as shown in the next example.
EXAMPLE 5.3 This example shows how nonzero redundant modes can be used
to construct parity checks with "memory" (Hadjicostis, 2000J. The resulting
parity checks "remember" an error and allow one to perform checking periodically (non-concurrently). Instead of checking at the end of each time step,
one only checks once every N time steps and is still able to detect and identify
transient faults that took place at earlier time steps.
Suppose that S is an LTI dynamic system as in Eq. (5.1). Starting from
the standard redundant system 1I.u in Eq. (5.4) with A12 = 0 and using the
similarity transformation Qu[tJ
= 7CJh[tJ (where 7 = [_~ I~]) and Cis
a d x s matrix, one obtains the following redundant implementation 11.:
<Ih[t + 1]
~
[CA _AA22 C : :"
,
'"
1<Ih[t] + [ 1x[t]
:B
"
A
'-vo--"
B
with encoding, decoding and parity check matrices given by
G
= 7- 1 G u = [ ~ ],
L
= Lu 7 = [Id
P = P u 7 = [-C
0] ,
18] .
Suppose that a transientfault (e.g., noise) at time step t corrupts the state of
system 11. so that
qf[tJ = CJh[tJ
+e ,
where CJh[tJ is the state that would have been obtained under faultlree conditions and e is an additive error vector that models the effect of the transient
fault. If the parity check is performed at the end of time step t, the following
syndrome is obtained:
PCJh[tJ + Pe
= O+Pe
= [-C Is] e.
p[tJ = Pqf[tJ =
Redundant Implementations of Discrete-Time LTI Dynamic Systems
93
For a transient fault that affects a single variable in the state vector, e will be a
vector with a single nonzero entry. Therefore, one will be able to detect, identify
and correct errors as long as the columns ofP = [ - C Is ] are not multiples
of each other. For example, ifP is the parity check matrix of a Hamming code
(as in Eq. (5.6) of Example 5.1), then, one can easily perform error correction
by first identifying the column ofP that is a multiple of the obtained syndrome
p[t], then determining e, and finally making the appropriate adjustment to the
corrupted state variable.
When the parity check is performed only periodically (e.g., once every N
time steps), the syndrome at time step t, given a fault at time step t - m (0 :S
m:S N -1), will be
(assuming no other transient faults occurred between time step t - m and t).
If A22 = 0, then, the parity check will be 0 (i.e., e will go undetected). More
generally, however, one can choose A22 so that the parity check will be nonzero.
For example, if A22 = Is, then, the syndrome is the same as the one that would
have been obtained at time step t - m. The problem is, of course, that m
is unknown and even though the initial error has been identified it cannot be
corrected because one does not know when it took place. This situation can be
remedied if a different matrix A22 is chosen. For example, if P is the parity
check matrix of a Hamming code and A22 is the diagonal matrix
t . .~1/2)' 1'
1 0
[
~
.. .
then, one can identify the corrupted state variable and find out when the corruption took place (i.e., what m is) [Hadjicostis, 2000}. The approach has
been extended to handle multiple faults between periodic checks [Hadjicostis,
2001}.
94
CODING APPROACHES TO FAULT TOLERANCE
5.4 The TMR scheme in Figure 1.3 of Chapter 1 corresponds to a
redundant implementation of the form
EXAMPLE
<Jh[t + 1]
==
q![t + 1] ]
[ q~[t + 1]
q~[t + IJ
=
where q![tJ, ~[tJ and q~[tJ evolve in the same way (because q![Oj = q~[Oj =
~[Oj = qs[Oj) and each is calculated using a separate set of delays, adders
and gains.
The encoding matrix G is given by [
~ ]. the decoding mapping L can be
[Id 0 0] and the parity check matrix P can be
[=~~ ~ ~]
(other
decoding and parity check matrices are also possible). A nonzero entry in the
upper (respectively lower) half ofP<Jh[tJ indicates afault in the second system
replica (respectively the third). Nonzero entries in both the top and bottom
half-vectors, indicate a fault in the first system replica.
~e ~MR ~ys]tem is shown (for example, with transformation matrix
[
Id Id 0
Id 0 Id
) to be similar to
which is of the form depicted in Theorem 5.1. All variables of the original
system are replicated twice and no coupling is involved from the redundant to
the original modes, i.e.,
A12
= 0,
A22
=
[! !].
Once the encoding matrix G is fixed, the additional freedom in choosing the
decoding matrix L can be exploited. For example, if there is a permanent
fault in the first system replica, one can change the decoding matrix to L =
[Old 0] to ensure that the final output is correct. This idea is discussed
further in the next example.
EXAMPLE
5.5 In the TMR case, a permanent fault that corrupts the first sys-
tem replica (by corrupting gains or adders in its hardware implementation)
can be handled by switching the decoding matrix from L = [Id 0 0] to
Redundallt Implementations of Discrete-Time LTI Dynamic Systems
95
L = [Old 0] (or L = [0 0 Id] or others) and by ignoring the state
of the first system replica in any subsequent error detection and correction
procedures. For instance, once a permanent fault corrupts the first subsystem, error correction becomes impossible, but error detection can be achieved
by comparing the state of the two remaining system replicas. This idea was
formalized and generalized in [Hadjicostis and Verghese, 1997].
Consider the redundant system 1£ whose state evolution equation is given
by Eq. (5.3) and whose hardware implementation uses delay-adder-gain implementations with delay-free paths of unit length. Under fault-free conditions,
for all t. A permanent fault in a gain element manifests itself as a corrupted
entry in matrices A or B. The ith state variable in <lh [t] (and other variables at
later time steps) will be corrupted if some of the entries in A( i, :) or in B( i, :)
are affected right after time step t - I (A( i, :) denotes the ith row of matrix A
and B( i, :) denotes the ith row of matrix B). If the erroneous state variable
at time t is detected and located (e.g., using the techniques in Example 5.1),
one can attempt to adjust the decoding matrix L to a new matrix La so that
the decoded state is error-free. The question addressed in [Hadjicostis and
Verghese, 1997J concentrated on characterizing the possible choices for La
given the corruption of certain gain or adder elements.
If at time step t, the ith state variable is corrupted, then, all state variables
whose updates directly depend on the ith state variable will be corrupted at
time step t + I (let Mil be the set of indices of these state variables, including
i); at time step t + 2, the state variables with indices in set Mit will corrupt
the state variables that depend on them; let their indices be in set Mi2 (which
includes Mil); and so on. Eventually, the final set of indices for all corrupted
state variables is given by the set Mil (note that Mil = Mi'1 = Mil U Mi2 U
M i3 ... U M i '1)' The sets of indices Mil for all i in {I, 2, ... , 1J} can be precalculated in an efficient manner by computing R( A), the reachability matrix
of A [Norton, I 980J.
Once an error is detected at the ith state variable, the new decoding matrix
La (if it exists) should not make use of state variables with indices in Mil'
Equivalently, one can ask the question: does there exist a decoding matrix La
such that LaGa = Id? Here, G a is the same as the original encoding matrix
G except that Ga(i,:) is set to zeroforall i in Mil' IfG a isfull-column rank,
such an La exists (infact, any La that satisfies LaGa = Id is suitable) and the
redundant system can withstand permanent corruptions in any of the entries in
the ith row of A and/or B.
TMR is clearly a special case of the above formulation: corruption of a
variable in the state of the first system replica is guaranteed to only affect
96
CODING APPROACHES TO FAULT TOLERANCE
the first system replica. Therefore, M f ~ {I, 2, ... , d} and (conservatively)
G a = [Old Id] T. One possibility for La is [Old 0].
Less obvious is the following case: consider the system in Example 5.1 with
state evolution as in Eq. (5.7). Its decoding matrix is given by L = [14 0].
If A(2, 2) (whose value is .5) becomes corrupted, then, the set of indices of
corrupted state variables is M 2f = {2,5}. The original encoding matrix G,
the new encoding matrix G a [resulting after the corruption of entry A(2, 2)J
and a suitable La are shown below:
G=
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
1
-1 -1 -1
0
-1 -1
0 -1
-1
0 -1 -1
0
[ -11 0
La =
~ 0
0
1
, Ga =
0
0
0
0
-1
-1
0
0
0 0
-1 0
1
0 0
1 0
0
0
-1
0
0
0
0
0
0
0
0
0
1
-1
0
0
0
0
-1
0
0
0
1
0
-1
-1
~1
Using the above La' the redundant system can continue to function properly
and provide the correct state vector qs [t] despite the corrupted entry A(2, 2).
The parity check matrix ofEq. (5.6) can still be usedfor error detection, except
that the checks involving the second and/or fifth state variables (i.e., checks
corresponding to the first and second rows ofP will be invalid. Error detection
is still an option, but one has to rely solely on the parity check given by the third
rowofP.
6
SUMMARY
This chapter studied redundant implementations of LTI dynamic systems.
It showed that the set of available redundant implementations for concurrent
error detection and correction is enriched by the dynamics and coupling that
can be introduced by redundancy. The redundant implementation essentially
augments the original system with redundant modes that are unreachable but
observable under fault-free conditions. Because these additional modes are
not excited initially, they manifest themselves only when a fault takes place.
The resulting characterization resembles the treatment of continuous-time LTI
system "inclusion" in [Ikeda and Siljak, 1984].
An explicit mapping to hardware (using delay, adder and gain elements) allowed the development of a fault model that maps a single fault in an adder
References
97
or in a multiplier to an error in a single state variable. By employing linear coding techniques, this chapter developed a wide variety of schemes that
can detect/correct a fixed number of faults (or, equivalently, errors in a fixed
number of state variables). This chapter established that for a particular error
detection/correction scheme there exists a class of possible implementations,
some of which make better use of additional hardware or have other desirable
properties, such as reconfigurability or memory. Criteria to "optimally" select
the "best" possible redundant implementation were not directly addressed; the
examples, however, presented a variety of open questions for future research.
Notes
1 The check matrix can be pI = [0 8], where 8 is any invertible s x s
matrix; a trivial similarity transformation will ensure that the parity check
matrix takes the form [0 Is], while keeping the system in the standard
form 1ier in Eq. (5.4) - except with A12 = Ab8 and A22 = 8- 1 A~28.
References
Blahut, R. E. (1983). Theory and Practice ofData Transmission Codes. AddisonWesley, Reading, Massachusetts.
Chatterjee, A. (1991). Concurrent error detection in linear analog and switchedcapacitor state variable systems using continuous checksums. In Proceedings
of the Int. Test Conference, pages 582-591.
Chatterjee, A. and d' Abreu, M. (1993). The design of fault-tolerant linear digital state variable systems: Theory and techniques. IEEE Transactions on
Computers, 42(7):794-808.
Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic
Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts.
Hadjicostis, C. N. (2000). Fault-tolerant discrete-time linear time-invariant filters. In Proceedings of ICASSP 2000, the IEEE Int. Con! on Acoustics,
Speech and Signal Processing, pages 3311-3314.
Hadjicostis, C. N. (2001). Non-concurrent error detection and correction in
discrete-time LTI dynamic systems. In Proceedings of the 40th IEEE Con!
on Decision and Control.
Hadjicostis, C. N. and Verghese, G. C. (1997). Fault-tolerant design of linear time-invariant systems in state form. In Proceedings of the 5th IEEE
Mediterranean Con! on Control and Systems.
Hadjicostis, C. N. and Verghese, G. C. (1999). Structured redundancy for fault
tolerance in LTI state-space models and Petri nets. Kybernetika, 35( 1):39-55.
Huang, K-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for
matrix operations. IEEE Transactions on Computers, 33(6):518-528.
98
CODING APPROACHES TO FAULT TOLERANCE
Ikeda, M. and Siljak, D. D. (1984). An inclusion principle for dynamic systems.
IEEE Transactions on Automatic Control, 29(3):244-249.
Kailath, T. (1980). Linear Systems. Prentice-Hall, Englewood Cliffs, New Jersey.
Luenberger, D. G. (1979). Introduction to Dynamic Systems: Theory, Models,
& Applications. John Wiley & Sons, New York.
Norton, J. P. (1980). Structural zeros in the modal matrix and its inverse. IEEE
Transactions on Automatic Control, 25( 10):980-981.
Peterson, W. W. and Weldon Jr., E. J. (1972). Error-Correcting Codes. MIT
Press, Cambridge, Massachusetts.
Roberts, R. A. and Mullis, C. T. (1987). Digital Signal Processing. AddisonWesley, Reading, Massachusetts.
Wicker, S. B. (1995). Error Control Systems. Prentice Hall, Englewood Cliffs,
New Jersey.
Chapter 6
REDUNDANT IMPLEMENTATIONS OF
LINEAR FINITE-STATE MACHINES
1
INTRODUCTION
This chapter applies techniques similar to those of Chapter 5 to provide
fault tolerance to linear tinite-state machines (LFSM's) [Hadjicostis, 1999].
The discussion focuses on linear encoding techniques and, as in Chapter 5,
results in a complete characterization of the class of appropriate redundant
implementations. It is shown that, for a given LFSM and a given linear encoding,
there exists a variety of possible implementations and that different criteria
can be used to choose the most desirable one [Hadjicostis, 2000; Hadjicostis
and Verghese, 2002]. The implications of this approach are demonstrated by
studying hardware implementations that use interconnections of 2-input XOR
gates and single-bit memory elements (flip-flops). The redundancy in the state
representation (which essentially appears as a linearly encoded binary vector) is
used by an extemal,fault-free mechanism to perform concurrent error detection
and correction at the end of each time step. The assumption of a fault-free error
corrector is relaxed in Chapter 7.
2
LINEAR FINITE-STATE MACHINES
Linear tinite-state machines (LFSM's) form a general class of tinite-state
machines with a variety of applications [Booth, 1968; Harrison, 1969]. They
include linear feedback shift registers [Golomb, 1967; Martin, 1969; Daehn
et aI., 1990; Damiani et aI., 1991], sequence enumerators and random number
generators [Golomb, 1967], encoders and decoders for linear error-correcting
codes [Peterson and Weldon Jr., 1972; Blahut, 1983; Wicker, 1995], and cellular
automata [Cattell and Muzio, 1996; Chakraborty et aI., 1996]. A discussion of
the power ofLFSM's and related references can be found in [Zeigler, 1973].
100
CODING APPROACHES TO FAULT TOLERANCE
x[t]
~
Figure 6.1.
+
Hardware implementation of the linear feedback shift register in Example 6.1.
The state evolution of an LFSM is given by
qs[t + IJ = Aqs[tJ EB Bx[tJ ,
y[t + IJ = Cqs[t] EB Dx[t] ,
(6.1)
where t is the discrete-time index, qs[tJ is the d-dimensional state vector, x[t]
is the u-dimensional input vector and y[t] is the v-dimensional output vector.
Vectors and matrices have entries from GF(2), the Galois field l of order 2, i.e.,
they are either "0" or "I" (more generally they can be drawn from any finite
field). Matrix-vector multiplication and vector-vector addition are performed
as usual except that element-wise addition and multiplication are taken modulo2. Operation EB in the above equations denotes vector addition modulo-2. As
in Chapter 5, faults in the mechanism that calculates the output (based on the
current state and input) can be treated as faults in a combinational system;
thus, this chapter focuses on protecting against faults in the state evolution
mechanism.
EXAMPLE 6.1 The linear feedback shift register (LFSR) in Figure 6.1 is implemented using single-bit memory elements (flip-flops) and 2-input XOR gates.
Flipjiops are capable of storing a single bit ("0" or "1") and 2-input XOR
gates, denoted by EB in the figure, perform modulo-2 addition on their binary
inputs. The LFSR in Figure 6.1 is an LFSM with state evolution
qs[t + 1]
= Aqs[tJ EB bx[tJ
0
1 0
0 1
0 0
0 0
0
0 0 1
0 0 0
0 0 1
1 0 0
0 1 0
qs[tJ EB
1
0
0
0
0
x[tJ .
Note that when x[.] = 0 and qs[OJ i= 0, the LFSR acts as an autonomous
sequence enumerator. It goes through all nonzero states (essentially counting
from 1 to 31): if initialized at qs[O] = [1 0 0 0 0] T, the LFSR goes
Redundant Implementations of Linear Finite-State Machines
[0 1 0 0 0 (, qs[2] = [0 0 1 0 0 (,
[0 1 0 0 1 )T, qs[31] = [1 0 0 0 0 )T, and so
through states qs[l]
... , qs[30]
=
101
=
forth. In essence, the LFSR acts as an autonomous sequence generator (counter).
For a state evolution ofthe form ofEq. (6.1), there are a number of implementations with 2-input XOR gates and flip-flops. If the hardware implementations
correspond to signal flow graphs whose delay-free paths are of unit length, then,
the following are true:
(i) Each bit in the next-state vector qs [t + 1] is calculated using a separate set
of 2-input XOR gates; thus, a fault in a single XOR gate can corrupt at most
one bit in the next-state vector qs[t + 1].
(ii) The calculation of each bit in qs[t + 1] is based on the bits of qs[t] that are
explicitly specified by the "Is" in matrix A of the state evolution equation
(e.g., the third bit of qs [t + 1] in Example 6.1 is calculated based on the
second and fifth bits of qs[t]).
An LFSM S' (with d-dimensional state vector q~[t]) is similar to LFSM S
[in Eq. (6.1)] if
q~[t
+ 1] =
--------
(T- 1AT) q~[t] E9 (T-1B) x[t]
A'
-
A/q~[t]
~
B'
E9 B/X[t] ,
where T is an invertible d x d binary matrix such that qs ttl = Tq~ [tJ [Booth,
1968; Harrison, 1969]. The initial conditions for the transformed LFSM can
be obtained as <L[O] = T-1qs[O].
It can be shown that any LFSM with state evolution as in Eq. (6.1) can be
put via a similarity transformation in a form where the matrix A' is in classical
canonical form [Booth, 1968]. More specifically, A' has the following blockdiagonal structure:
where each Ai (1 :::; i :::; p) also has a block-diagonal structure
102
CODING APPROACHES TO FAULT TOLERANCE
and where each C ij (1 :::; j :::; q) looks like
Dij
Eij
Dij
Eij
Dij
with
Dij
and
Dij
=
Eij
as follows:
0
0
0
1
0
0
1
0
0
0
0
1
*
*
*
*
*
,
Eij
=
[r
0
0
0
!I
Each "*" could be a "0" or a "1." What is important about this form is that
there are at most two "Is" in each row of A', which implies that each bit in the
next-state vector <t[t + 1] can be generated based on at most two bits of the
current state vector q~[t].
3
CHARACTERIZATION OF REDUNDANT
IMPLEMENTATIONS
An LFSM S [with d state variables and state evolution as in Eq. (6.1)] can be
embedded into a redundant LFSM 11, with 11 state variables (11 == d + s, s > 0)
and state evolution
Clh[t + 1] = AClh[t] ED Bx[t] ,
(6.2)
where the initial state Clh[O] and matrices A, B are chosen so that the error-free
state qh[t] of 11, at time step t provides complete information about qs[t], the
state of the original LFSM [Hadjicostis, 1999; Hadjicostis, 2000; Hadjicostis
and Verghese, 2002]. More specifically, the redundant machine 11, concurrently
simulates the original machine S so that, for an appropriate decoding mapping
l,
q,[tJ = l{Clh[tJ)
for all time steps t. Furthermore, mapping l is required to be one-to-one so that
there is a unique correspondence between the states in S and the states in 11"
i.e.,
Redundant Implementatiolls of Lillear Finite-State Machines
103
for all time steps t (as long as no faults take place).
As in Chapter 5, the analysis is easier if one restricts decoding and encoding
to be linear in GF(2). In other words, there exist
• a d x 'f/ binary decoding matrix L such that, under proper initialization and
fault-free conditions, qs[t] = Lqh[t] for all t, and
• an 'f/ x d binary encoding matrix G such that, under proper initialization
and fault-free conditions, qh[t] = Gqs[t] for all t.
Note that Land G need to satisfy LG = Id, where Id is the d x d identity
matrix in GF(2).
Under the above assumptions, the redundant machine 1£ enforces an ('f/, d)
linear code On the state of the original machine [Peterson and Weldon Jr.,
1972; Blahut, 1983; Wicker, 1995]. An ('f/, d) linear code uses 'f/ bits to represent
d bits of information and is defined in G F (2) by an 'f/ x d generator matrix G
with full-column rank. When no faults have taken place, the d-dimensional state
vector at time t is uniquely represented by the 'f/-dimensional vector (codeword)
~[t]
= Gqs[t].
Error detection is straightforward: under fault-free conditions, the redundant
state vector must be in the column space of G; therefore, all that needs to be
checked is that the redundant state qh[t] lies in the column space of G (in
coding theory terminology, one needs to check that ~[t] is a codeword of the
linear code that is generated by G [Peterson and Weldon Jr., 1972; Blahut,
1983; Wicker, 1995]). Equivalently, one can check that ~[t] is in the null
space of an appropriate parity check matrix p, so that P~[t] = o. The
parity check matrix has row rank 'f/ - d == s and satisfies PG = o. Error
correction associates with each valid state in 1£ (of the form Gqs [tD, a unique
subset of invalid states that get corrected to that particular valid state. This
subset usually contains 'f/-dimensional vectors with small Hamming distance
from the associated valid codeword. Error correction can be performed using
any of the methods employed in the communications setting (e.g., syndrome
table decoding or iterative decoding [Gallager, 1963; Peterson and Weldon Jr.,
1972; Blahut, 1983; Wicker, 1995]).
The following theorem provides a parameterization of all redundant implementations for a given LFSM under a linear encoding scheme.
6.1 In the setting described above, the LFSM 1£ [ofdimension'f/ ==
d + s, s > 0 and state evolution as in Eq. (6.2)J is a redundant implementation
of S if and only if it is similar to a standard redundant LFSM 1£(1' whose state
evolution equation is given by
THEOREM
(6.3)
104
CODING APPROACHES TO FAULT TOLERANCE
Here, A and B are the matrices in Eq. (6.1), A22 is an s x s binary matrix
that describes the dynamics of the redundant modes that have been added, and
A12 is a d x s binary matrix that describes the coupling from the redundant to
the non-redundant modes. Associated with this standard redundant LFSM is
the standard decoding matrix LO' = [Id 0], the standard encoding matrix
GO'
= [ ~ ] and the standard parity check matrix PO' = [0 Is].
Proof: The proof is similar to the proof of Theorem 5.1 in Chapter 5 and is
omitted.
0
4
EXAMPLES OF FAULT· TOLERANT SYSTEMS
Given an LFSM S, and appropriate Land G (so that LG = Id), Theorem 6.1
characterizes all possible redundant LFSM's 1£. Since the choice of the binary
matrices A12 and A22 is completely free, there are multiple redundant implementations of LFSM S for the given Land G. This section demonstrates how
different implementations for LFSM's can be exploited to minimize redundant
hardware overhead.
EXAMPLE 6.2 In order to detect a single fault in an XOR gate of the LFSR
implementation in Figure 6.1, an extra "checksum" state variable can be used.
Following what was suggested for linear time-invariant dynamic systems in
[Huang and Abraham, 1984] and for LFSM's in [Larsen and Reed, 1972;
Sengupta et al., 1981], one obtains the following redundant LFSM 1£:
CJh[t + 1J
where c T = [1
CJh[t + 1J
=
[
C~A : : ] ~[tl $ [ C~b ] x[tl ,
1 1 1 1 ], i.e.,
0 0 0
0 a 0
1 0 a
a 1 0
0 a 1
1 0
0 0
1 0
a 0
0 a
1 1 1 1
01 0
0
1
0
0
0
1
a
qh[t]
EEl
0
0
0
x[t] .
1
Under fault-free conditions, the added state variable is always the sum
modulo-2 of all other state variables (which are the same as the original state
variables in LFSM S). The encoding, decoding and parity check matrices are
Redundant Implementations of Linear Finite-State Machines
105
given by
G
= [ :~ ] =
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0
0
0
0
1
1 1 1 1 1
L
P
=
= [ 15 I 0 ] =
[ -c T 11 ]
[ cT 11 ]
1
0
0
0
0
= [ -1
=[1
0
1
0
0
0
0
0
1
0
0
-1
0
0
0
1
0
-1
0
0
0
0
0
0
0
0
0
1
-1
-1 11 ]
1 1 1 1 11 ] .
(Note that" -1" is the same as "+1" when performing addition and multiplication modulo-2.) Using the similarity transformation qq[tJ = Tqh[tJ where
T =
[!~ ~], one sees that, just as predicted by Theorem 6.3, 1l is similar
to a standard redundant LFSM 1lq with state evolution given by
Note that both A12 and A22 have been set to zero.
As stated earlier, there are multiple redundant implementations with the same
encoding, decoding and parity check matrices. For the scenario described here,
there are exactly 25 different LFSM's (each combination of choices for entries
in matrices A12 and A22 results in a different redundant implementation). One
such choice is to let
A12
=
0
0
0
0
0
A22
= [IJ
106
CODING APPROACHES TO FAULT TOLERANCE
and use a transformation with the same similarity matrix (q(7[tJ
T
= [!~ ~]
)
= Tqhl[tJ,
to get a redundant LFSM 1{' with state evolution equation
qhl [t+1J =
[ cT A _AA22CT : : "
]
'Ih,[t]
(& [
C~b ] x[tl ,
or
~I[t
+ 1J =
0
1
0
0
0
0
0
1
0
0
0
0
0
1
0
0 1
0 0
0 1
0 0
1 0
0 0 0 0
1
0
0
0
0
0
~1[tJ $
I1
1
0
0
0
0
x[t] .
1
Both redundant LFSM's 1{ and 1{' have the same encoding, decoding and
parity check matrices, and both are able to concurrently detect single-bit errors
in the redundant state vector. Furthermore, according to the assumptions about
hardware implementation in Section 2, they are both able to detect a fault in a
single XOR gate. Evidently, the complexity of1{' is lower than the complexity
of1{. More generally, as illustrated in this example for the case of a nonzero
A 22 , one can obtain more efficient redundant implementations by exploiting
the dynamics of the redundant modes (given by A 22 ) and/or their coupling with
the original system (given by A12).
EXAMPLE 6.3 A rate 1/3 convolutional encoder takes a binary sequence x[tJ
and encodes it into three output sequences (yIltJ, Y2[tJ and Y3[tJ) as shown at
the top of Figure 6.2. The encoding mechanism is essentially an LFSM and,
for the particular example shown in Figure 6.2, it has a state evolution that is
given by
qs[t + 1]
=
qdt + 1J
q2[t + 1J
q3[t + 1]
q4[t + 1J
q5[t + 1J
q6[t + 1]
=
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
v
A
= Aqs[t] $ bx[t]
0
0
0
0
0
1
0
0
0
0
0
0
qs[t] $
1
0
0
0
0
0
'--v--"
b
x[t]
Redundant Implementations of Linear Finite-State Machines
x[t)
107
q [1+1)
1
~---+{-+
Figure 6.2.
yIlt +
Y3[t
+}-------+{-
Different implementations of a convolutional encoder.
and output2
y[t + 1] == [ Y2[t
-
1]]
+ 1]
+ 1]
[1]~ x[t]
[011101]
~ ~ ~ ~ ~ ~ qs[t]
EB
_______
_______J
F
Fqs ttl EB dx[t] .
d
108
CODING APPROACHES TO FAULT TOLERANCE
If the output values Yl [t], Y2 [t] and Y3 [t] are saved in designatedjlip-jiops, one
obtains a redundant implementation of an LFSM with state evolution equation
Clh[t + 1] ==
[ q8[t + 1]
y[t + 1]
0
1
0
0
0
0
0
=
~
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1[
=
0
0
0
0
1
0
0
0
0
0
0
0
1
0
A I0
FlO
0
0
0
0
0
0
1
1qh[t] [ 1x[t]
Ef)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-
b
d
Clh[t] Ef)
0 1 1 1 0 1 0 0 0
1 0 0 1 1 1 0 0 0
1 1 0 0 1 1 0 0 0
1
0
0
0
0
0
0
x[t] ,
1
1
1
where the encoding and decoding matrices are given by:
G= [
:
~ t~ : ~ ].
1 1 100 1
By using nonzero redundant dynamics (A 22 =I- 0) and/or coupling (A 12 =I0), one can obtain a number of redundant implementations (for the same L
and G), some of which require a reduced number of2-input XOR gates. The
encoder at the bottom of Figure 6.2 is the result of such an approach: it uses a
nonzero A22 to minimize the use of XOR operations.
The next section elaborates on this example by describing how to systematically minimize the number of 2-input XOR gates in a redundant implementation
ofanLFSM.
5
HARDWARE MINIMIZATION IN REDUNDANT
LFSM IMPLEMENTATIONS
Given a linear code in systematic form (i.e., a code whose generator matrix is
of the form G = [
~
]), Theorem 6.1 can be used to construct all linearly en-
coded redundant implementations for a given LFSM S. This section describes
how to algorithmically find the redundant LFSM that uses the minimal number
of 2-input XOR gates [Hadjicostis and Verghese, 2002].
Redundant Implementations of Linear Finite-State Machines
109
Problem Formulation: Let S be the LFSM in Eq. (6.1) with d state variables.
Construct the redundant LFSM 1l [of dimension"., == d + 5,5 > 0 and state
evolution as in Eq. (6.2)] that uses the minimum number of 2-input XOR gates
and has the following encoding, decoding and parity check matrices:
G-[Id]
- C
L
'
= [Id
0] ,
p
= [c 18] ,
where C is a known matrix.
Solution: All appropriate redundant implementations are similar to a standard
LFSM 1lu' Specifically, there exists an "., x "., matrix 7 such that
A = 7- 1 [A A12] 7 ,
o
A22
where 7 is invertible and the choices for A12 and A22 are arbitrary.
Moreover, the relations
L
= Lu7,
P = P u 7,
establish that 7 is
One can check that 7- 1 = 7 over GF(2), which is consistent with the choice
ofG.
Theorem 6.1 essentially parameterizes matrices A and B in terms of A12
and A 22 :
7-1 [A A12] 7
A
o
A22
=
=
7- 1
B
=
[ :
]
=
[~B]'
In order to find the system with the minimal number of 2-input XOR gates,
one needs to choose A12 and A22 so that the number of "ls" in A is minimized.
1lO
CODING APPROACHES TO FAULT TOLERANCE
Therefore, a straightforward approach would be to search through all 21jX8
possibilities (each entry can be either a "0" or a "I") and to find the choice that
minimizes the number of "Is" in A. The following approach is more efficient
[Hadjicostis and Verghese, 2002].
Minimization Algorithm:
1. Ignore the bottom s rows of A (it will soon be shown why this can be done)
and optimize the cost in the top d rows. Each row of matrix A12 can be
optimized independently from the other rows (because the jth row of matrix
A12 does not influence the structure of the other rows of A). An exhaustive
search of all possibilities in each row will look through 28 different cases.
Thus, the minimization for the top d rows needs to search through d2 8
different possibilities.
2. Having chosen the entries of A 12 , proceed in the exact same way for the
last s rows of A (once A12 is known, the problem has the same structure
as for the top d rows). Exhaustive search for each row will search 28 cases;
the total cases needed will be S28.
The algorithm above searches through a total of ",2 8 = (d + s) 28 cases
instead of 21jX8.
The only issue that remains to be resolved is whether choosing A12 first
(based only on the top d rows of matrix A) is actually optimal. This will be
shown by contradiction: suppose that one chooses A12 as in Step 1 of the
algorithm, but there exists a matrix A~2 i= A 12 , which together with a choice
of A 22 , minimizes the number of "Is" in A. Let A22 = A22 EB CA~2 EB CA 12;
matrix A is then given by
A
=
[
=
[
A EB A12 C
CA EB CA12e EB A22 C
A EB A12C
CA EB CAbC EB A 22 C
I
A12
I CA12 EB A22
I
A12
1
]
I CAb EB A22
This choice of A22 has the same effect in the bottom s rows as the choice
A~2 and A 22 · Since by assumption A12 was a better choice in minimizing the
number of "Is" in the top d rows, a contradiction has been reached: choices
A~2 and A22 are suboptimal (they are worse than choices A12 and A 22 ). 0
EXAMPLE
6.4 Consider the autonomous LFSM with state evolution
Redundant Implementations of Linear Finite-State Machines
111
where matrix A is
A=
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
1
0
0
0
1
0
0
0
0
If initialized in a nonzero state, this LFSM goes through all nonzero 9-bit sequences, essentially counting from 1 to 29 - 1.
To be able to detect and correct a single fault using a systematic linear
code, one can use a redundant machine with four additional state variables
and encoding matrix G
=[
6],
where matrix C is given by
1 100 1 1 1 0
1
01 100 1 1
[
C= 0 1 1 0 0 1 0 1
1 1 1 1 1 000
The parity check matrix P = [C
all of its columns are different.
0
0
1
1
1
.
14] allows single-error correction because
The minimization algorithm described earlier results in the following (nonunique) choice of A 12, A 22 :
112
CODING APPROACHES TO FAULT TOLERANCE
The resulting matrix A is given by
0
0
0
0
0
0
1
0
0
1
0
0
0 0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1 0
0 1
0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
A
=
0
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
1 0 0 0 0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
0 0 0
0
0
0
0
0
0
1 0 0 0
1 0 0 0
1 0 1 0
0 0 0 1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1 0
1 1
0
0
0
0
and requires only nine 2-input XOR gates (as opposed to sixteen gates required
by the implementation that sets A12 and A22 to zero). Note that the original,
non-redundant machine uses a single XOR gate.
6
SUMMARY
This chapter extended the ideas of Chapter 5 to LFSM's. This resulted in
a characterization of all redundant implementations for a given LFSM under a
given linear encoding and decoding scheme. The characterization enables the
systematic development of a variety of possible redundant implementations. It
also leads naturally to an algorithm that can be used to minimize the number
of 2-input XOR gates that are required in a redundant LFSM with a specified
systematic encoding scheme.
Notes
1 The finite field GF(l) is the unique set of I elements GF, which together
with two binary operations EEl and ®, satisfies the following properties:
(i) GF forms a group under operation EEl with identity
(ii) G F -
o.
{O} forms a commutative group under operation ® with identity 1.
(iii) Operation ® distributes over EEl, i.e., foraH
(II ® h) EEl (II ® h)·
/1, 12, 13
E GF,
/1 ®(hEElh) =
The order I of a finite field has to be a prime number or a power of a prime
number.
2 What is denoted here by y [t + 1] is usually denoted by y [t].
References
113
References
Blahut, R. E. (1983). Theory and Practice ofData Transmission Codes. AddisonWesley, Reading, Massachusetts.
Booth, T. L. (1968). Sequential Machines and Automata Theory. Wiley, New
York.
Cattell, K. and Muzio, 1. C. (1996). Analysis of one-dimensional linear hybrid
cellular automata over G F( q). IEEE Transactions on Computers, 45(7):782792.
Chakraborty, S., Chowdhury, D. R., and Chaudhuri, P. P. (1996). Theory and
application of non-group cellular automata for synthesis of easily testable
finite state machines. IEEE Transactions on Computers, 45(7):769-781.
Daehn, w., Williams, T. w., and Wagner, K. D. (1990). Aliasing errors in
linear automata used as multiple-input signature analyzers. IBM Journal of
Research and Development, 34(2-3):363-380.
Damiani, M., Olivo, P., and Ricco, B. (1991). Analysis and design of linear
finite state machines for signature analysis testing. IEEE Transactions 011
Computers, 40(9): 1034-1045.
Gallager, R. G. (1963). Low-Density Parity Check Codes. MIT Press, Cambridge, Massachusetts.
Golomb, S. W. (1967). Shift Register Sequences. Holden-Day, San Francisco.
Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic
Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts.
Hadjicostis, C. N. (2000). Fault-tolerant sequence enumerators. In Proceedings
ofMED 2000, the 8th IEEE Mediterranean ConJ. on Control andAutomation.
Hadjicostis, C. N. and Verghese, G. C. (2002). Encoded dynamics for fault
tolerance in linear finite-state machines. IEEE Transactions on Automatic
Control. To appear.
Harrison, M. A. (1969). Lectures on Linear Sequential Machines. Academic
Press, New YorkILondon.
Huang, K.-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for
matrix operations. IEEE Transactions on Computers, 33(6):518-528.
Larsen, R. W. and Reed, I. S. (1972). Redundancy by coding versus redundancy
by replication for failure-tolerant sequential circuits. IEEE Transactions on
Computers, 21(2): 130-137.
Martin, R. L. (1969). Studies in Feedback-Shift-Register Synthesis of Sequential
Machines. MIT Press, Cambridge, Massachusetts.
Peterson, W. W. and Weldon Jr., E. 1. (1972). Error-Correcting Codes. MIT
Press, Cambridge, Massachusetts.
Sengupta, A., Chattopadhyay, D. K., Palit, A., Bandyopadhyay, A. K., and
Choudhury, A. K. (1981). Realization of fault-tolerant machines - linear
code application. IEEE Transactions on Computers, 30(3):237-240.
114
CODING APPROACHES TO FAULT TOLERANCE
Wicker, S. B. (1995). Error Control Systems. Prentice Hall, Englewood Cliffs,
New Jersey.
Zeigler, B. P. (1973). Every discrete input machine is linearly simulatable.
Journal of Computer and System Sciences, 7(4):161-167.
Chapter 7
UNRELIABLE ERROR CORRECTION IN
DYNAMIC SYSTEMS
1
INTRODUCTION
This chapter focuses on constructing reliable dynamic systems exclusively
out of unreliable components, including unreliable components in the errorcorrecting mechanism. At each time step, a particular component can suffer a
transient fault with a probability that is bounded by a constant. Faults between
different components and between different time steps are treated as independent. Essentially, the chapter considers an extension of the techniques described
in Chapter 2 to a dynamic system setting. Since dynamic systems evolve in
time according to their internal state, the major task is to effectively deal with
the effects of error propagation, i.e., the effects of errors that corrupt the system
state.
The discussion focuses initially on a distributed voting scheme that can be
used to provide fault tolerance to an arbitrary dynamic system [Hadjicostis,
1999; Hadjicostis, 2000]. This approach employs multiple unreliable system
replicas and multiple unreliable voters and is able to improve the reliability of a
dynamic system at the cost of increased redundancy (higher number of system
replicas and voters). More specifically, by increasing the number of systems
and voters by a constant amount, one can double the number of time steps
for which the fault-tolerant implementation will operate within a pre-specified
probability of failure. Equivalently, given a pre-specified number of time steps,
one can decrease the probability of failure by increasing the number of systems
and voters.
Once the distributed voting scheme is analyzed, coding techniques are used to
make this approach more efficient, at least for special types of dynamic systems.
More specifically, by using linear codes that can be corrected with low complexity' one can obtain interconnections of identical linear finite-state machines that
116
CODING APPROACHES TO FAULT TOLERANCE
operate in parallel on distinct input streams and use only a constant amount of
redundant hardware per machine to achieve arbitrarily small probability of failure [Hadjicostis, 1999; Hadjicostis and Verghese, 1999]. Equivalently, given a
pre-specified probability of failure, one can achieve a pre-specified probability
of failure for any given, finite number of time steps using a constant amount of
redundancy per system.
Constructions of fault-tolerant dynamic systems out of unreliable components have appeared in [Taylor, 1968b; Taylor, 1968a; Larsen and Reed, 1972;
Wang and Redinbo, 1984; Gacs, 1986; Spielman, 1996a]. A number of other
constructions of fault-tolerant dynamic systems has also appeared in the literature (see, for example, [Avizienis, 1981; Bhattacharyya, 1983; Iyengar and
Kinney, 1985; Leveugle and Saucier, 1990; Parekhji et aI., 1991; Robinson and
Shen, 1992; Leveugle et aI., 1994; Parekhji et aI., 1995] and [Johnson, 1989;
Pradhan, 1996; Siewiorek and Swarz, 1998] for a comprehensive overview),
but the following overview is limited to approaches in which all components,
including components in the error-correcting mechanism, suffer transient faults.
• In [Taylor, 1968b], Taylor studied the construction of "stable" memories
out of unreliable memory elements (flip-flops) that are capable of storing
a single bit but can suffer transient faults, independently between different
time steps. Taylor constructed reliable ("stable") memory arrays out of
unreliable flip-flops by using appropriately encoded arrays and unreliable
error correcting mechanisms. His results for general computation in [Taylor,
1968a] were in error (see [Pippenger, 1990]).
• In [Larsen and Reed, 1972] the focus is on protecting a single finite-state
machine. The approach works by encoding the state of a given finitestate machine (with less than 2k states) into an n-bit binary vector using
a binary (n, k) code that has a certain minimum Hamming distance and
is majority-logic decodable. The functionality of the state transition and
error-correcting mechanisms are combined into one combinational circuit.
The fault model assumes that the probability of error in each of the bits
in the encoded state vector (of the redundant finite-state machine) can be
bounded by a (small) constant, i.e., the analysis does not directly consider
the probability of a transient fault in each component. Under a number of
assumptions and considering only the probability of failure per time step,
it is concluded that "replication yields better circuit reliability than coding
redundancy."
• A study of the performance of the approach in [Larsen and Reed, 1972] under
low rates of transient ("soft") state transition faults and using the concept
of "cluster states," was shown in [Wang and Redinbo, 1984] to result in
significant improvements.
Unreliable Error Correction ill Dynamic Systems
117
• Gacs studied fault-tolerant cellular automata in [Gacs, 1986], mostly in the
context of stable memories. He employed cellular automata so that the
cost/complexity of connectivity between different parts of the redundant
implementation remain constant as the amount of redundancy increases .
• The approach in [Spielman, 1996a] was for multiple systems that run in
parallel on k "fine-grained" processors for L time steps. (In this sense, it is
closer to the approach presented in Section 4 of this chapter for LFSM's.)
Spielman showed that the probability of error can go down as O(Le- k1 / 4 )
but the amount of redundancy is O(k log k) (i.e., O(log k) processors per
system). Spielman also introduced the concept of slowdown due to the
redundant implementation.
2
FAULT MODEL FOR DYNAMIC SYSTEMS
In an unreliable dynamic system, an incorrect state transition at a particular
time step will not only affect the output at the immediately following time step,
but will typically also affect the state (and therefore the output of the system)
at later time steps. In Chapters 4-6, structured redundancy was added into
a dynamic system so that error detection and correction could be performed
by detecting and identifying violations of artificially created state constraints.
This approach was shown to work nicely if the error-correcting mechanism
was fault-free; however, it is clear that faults in the error corrector may have
devastating effects. To realize the severity of the problem, recall the example
that was introduced in Chapter 1: assume that in a given dynamic system, the
probability of taking a transition to an incorrect next state on any input is Ps and is
independent between different time steps. Then, the probability that the system
follows the correct state trajectory for L consecutive time steps is (l_Ps)L, and
goes to zero exponentially with L. Using modular redundancy with feedback
(as in Figure 1.3 of Chapter 1) will not be successful if the voter also suffers
transient faults with probability PV' (A fault causes the voter to feed back a state
other than the one agreed upon by the majority of the systems; the assumption
here is that this happens with probability Pv, independently between different
time steps.) In such case, the probability that the system follows the correct
state trajectory for L consecutive time steps is at best (l-Pv)L and goes down
exponentially with L. The problem is that faults in the voter (or more generally
in the error-correcting mechanism) corrupt the overall redundant system state
and cause error propagation. Note that the bound (l-Pv)L actually ignores the
possibility that a fault in the voter may result in feeding back the correct state
(when the majority of the system replicas are in an incorrect state). This issue
can be accounted for if a more explicit fault model for the voter is available.
The first question discussed in this chapter is the following: given unreliable
systems and unreliable voters, is there a way to guarantee the correct opera-
118
CODING APPROACHES TO FAULT TOLERANCE
tion of a dynamic system for an arbitrarily large (but finite) number of time
steps? Furthermore, what are the trade-offs between redundant hardware and
reliability? The approach discussed here uses a generalization of the scheme
shown in Figure 1.4 of Chapter 1, where faults are allowed in both the redundant
implementation and the error-correcting mechanism. Since the error corrector
also suffers transient faults, the redundant implementation will not necessarily
be in the correct state at the end of a particular time step; if, however, its state
is within a set of states that represent the correct one, then, the system may
be able to evolve in the right fashion. The basic idea is shown in Figure 7.1:
at the end of time step t, the system is not in a valid state but it is in a state
within the set of states that represent (and could be corrected/decoded to) the
correct valid state. During the next state transition stage, a fault-free transition
should result into the (unique) valid state that a fault-free system would be in.
An incorrect transition however, may end up into an invalid state; the system
performs as desired as long as no overall failure has occurred (i.e., as long as the
error corrector is able to correct the redundant system state so that it is within
the set of states that are associated with the correct one - this is the case for
the corrections labeled "perfect" and "acceptable" in Figure 7.1). Notice that
overall failures can occur both during the state transition stage and during the
error correction stage. (In the approach in [Larsen and Reed, 1972; Wang and
Redinbo, 1984] the two stages are combined into one stage that also suffers
transient faults with some probability; here, the two stages are kept separated
in order to have a handle of the complexity of the corresponding circuits.)
Even when no fault-free decoding mechanism is available, the above approach is desirable because it allows one to guarantee that the probability of
a decoding failure will not increase with time in an unacceptable fashion. As
long as the redundant state is within the set of states that represent the actual
(underlying) state, the decoding at each time step will be incorrect with afixed
probability, which depends only on the reliability of the decoding mechanism
and does not rapidly diminish as the dynamic system evolves in time. The
resulting method guarantees that the probability of incorrect state evolution
during a certain time interval is much smaller in the redundant dynamic system
than in the original one.
3
RELIABLE DYNAMIC SYSTEMS USING
DISTRIBUTED VOTING SCHEMES
The problem in the modular redundancy scheme in Figure 1.3 of Chapter 1
is that a voter fault corrupts the states of all system replicas. This results in an
overall failure, i.e., a situation where the state of the redundant implementation
does not correctly represent the state of the underlying dynamic system. For
instance, if the majority of the systems agree on an incorrect state, the correct
state of the underlying dynamic system cannot be recovered using a majority
Unreliable Error Correction ill Dynamic Systems
I
1' ...
\,
-0 . . ' ,
•
, 0
'0'
'
(Acceptable)
Erroneous
Correction
1
,.'-;.,' ,
------------
I
-..,....-
o
Faulty Next
State q [t+ 1]
ti
",
\ 0/
'0
o
Current
State q [t]
t'i
119
0
Corrected
Next State
qJt+ 1]
~.~--------------~~~ ~.~------------~~~
State Transition
Stage
Error Correction
Stage
r---------
: Valid
:.
, State
,
,,
Set of States
Representing a
Single Valid State
:0 Invalid
:
State
,---------Figure 7.1.
Reliable state evolution subject to faults in the error corrector.
voter. To avoid this situation. one needs to ensure that faults in the voting
mechanism do not have such devastating consequences. One way to achieve
this is by using several voters and by performing error correction in a distributed
fashion. as shown in Figure 7.2 [Hadjicostis. 1999; Hadjicostis, 2000]. The
arrangement in Figure 7.2 uses n system replicas and n voters. All n replicas
are initialized at the same state and receive the same input. Each voter receives
state information from all system replicas and feeds back a correction to only
one of them. This way, a fault in a single voter corrupts the state of only one of
the system replicas and not all of them.
Notice that the redundant implementation of Figure 7.2 is guaranteed to
operate "correctly" as long as
or more systems are in the correct state
xl denotes the smallest integer that is larger or equal to x). The reason is
two-fold:
rni1l
(r
rnt1l
• If
systems are in the correct state, then, the majority of the system
replicas are in the right state and a fault-free voter is guaranteed to recover
the correct state.
120
CODING APPROACHES TO FAULT TOLERANCE
Input
Figure 7.2.
Modular redundancy with distributed voting scheme.
rni1l
• If
systems are in the correct state, then, each voter ideally feeds back
the correct state unless itself suffers a fault; this implies that a fault in a
particular voter or a particular system may be corrected at future time steps
as long as
or more systems end up in the correct state.
rnt1l
The above discussion motivates the following definition of an overall failure.
DEFINITION 7.1 The redundant system of Figure 7.2 suffers an overall failure
when half or more of the systems are in a corrupted state.
A reliable system is one that, with high probability, operates for a prespecified finite number of time steps with no overall failure. In this context, a
redundant implementation is reliable if, with high probability, at least
systems are in the correct state at any given time step. Note that it is not necessary that each of these
systems remains in the correct state for all
consecutive time steps. Also note that the above definition of an overall failure
is conservative because the overall redundant implementation may perform as
expected even if more than half of the systems are in an incorrect state. What
is really needed is that, at any given time step, the majority of the systems are
in the correct state.
rnt1l
rni1l
THEOREM 7.1 Suppose that each system takes a transition to an incorrect
state with probability Ps and each voter feeds back an incorrect state with
probability Pv (independently between different systems, voters and time steps).
Then, the probability of an overall failure at or before time step L (starting at
Unreliable Error Correction in Dynamic Systems
121
time step 0) can be bounded asfollows:
Pr[ overall failure at or before time step L ] ::; L
t (7)
pi(l-pt- i ,
i=Ln/2J
where p == Pv + (l-pv)Ps. This bound goes down exponentially with the
number of systems n if and only if P <
!.
Proof: Given that there is no overall failure at time step T-l, the conditional
probability that system j ends up in an incorrect state at time step T is bounded
by the probability that either voter j suffers a transient fault, or voter j does not
suffer a fault but system j itself takes a transition to an incorrect state, i.e.,
Pr[ system j in incorrect state at T I no overall failure at T-l] ::;
::; Pv + (1 - Pv)Ps == P .
The probability of an overall failure at time step T given no overall failure at
time step T-1 is bounded by the probability that half or more of the n system
replicas suffer faults.
Pr[ overall failure at T I no overall failure at T-1 J ::;
: ; t (7)
pi(l - pt- i .
i=Ln/2J
Using the union bound, the probability of an overall failure at or before a
certain time step L can be bounded as
Pr[ overall failure at or before L J ::; L
t (7)
pi(l _ p)n-i .
i=Ln/2J
Note that the bound on the probability of overall failure increases linearly
with the number of time steps (because of the union bound). The bound goes
down exponentially with n if and only if p is less than
to see this, one can
use the Sterling approximation and the results on p. 531 of [Gallager, 1968]:
assuming p <
!;
!,
__
~------_v
X
------~
122
CODING APPROACHES TO FAULT TOLERANCE
and
~------v~------~
X
(where for simplicity n has been assumed to be even). Since
V{~l
one can conclude that
2n
<( n ) <
-
n/2
t ( 7)
i=n/2
with n if and only if p(l - p)
-
V/2
-;:;;
2n
,
pi(l - pt- i will decrease exponentially
< ~ (i.e., if and only if p is less than
~).
0
A potential problem with the arrangement in Figure 7.2 is the fact that as
n increases, the complexity of each voter (and therefore Pv) increases. An
arrangement in which the number of inputs to each voter is fixed is discussed in
the next section. In such schemes, the voter complexity and Pv remain constant
as the number of systems and voters is increased.
Another concern about the approach in Figure 7.2 is that, in order to construct
dynamic systems that suffer transient faults with an acceptably small probability
of overall failure during any pre-specified (finite) time interval, the hardware in
the redundant implementation may have to increase in an unacceptable fashion.
More specifically, if the number of time steps is doubled, the bound in Theorem 7.1 suggests that one may need to increase the number of system replicas
by a constant amount (in order to keep the probability of an overall failure at
the same level). To make an analogy with the problem of digital transmission
through an unreliable communication link, what was shown in Theorem 7.1 is
very similar to what can be achieved in digital communications without coding
techniques. In other words, in the communications setting the probability of a
transmission error can be made arbitrarily small by replicating (retransmitting)
bits, but at the cost of correspondingly reducing the rate at which information
is transmitted. If, however, one is willing to transmit k bits as a block, then,
the use of coding techniques can result in an arbitrarily small probability of
transmission error with a constant amount of redundancy per bit. l In the next
section, this coding paradigm is transfered to an unreliable dynamic system setting. Specifically, it is shown that for identical linear finite-state machines that
operate in parallel on distinct input streams one can design a scheme that requires only a constant amount of redundancy per machine to achieve arbitrarily
small probability of overall failure over any finite time interval.
Unreliable Error Correction in Dynamic Systems
4
123
RELIABLE LINEAR FINITE-STATE MACHINES
This section combines linear coding techniques with the distributed voting
scheme of the previous section in order to protect linear finite-state machines
(LFSM's). The resulting scheme is an interconnection of identical LFSM's
that operate in parallel on distinct input streams and require only a constant
amount of redundancy per machine to achieve an arbitrarily small probability
of overall failure over any pre-specified (finite) time interval. The linear codes
that are used are low-density parity check codes [Gallager, 1963; Sipser and
Spielman, 1996; Spielman, 1996b J. Error correction is of low complexity and
can be implemented using unreliable voters and unreliable XOR gates:
(i) The unreliable voters vote on J -1 bits, where J is a constant, and suffer
transient faults with a probability that is bounded by some constant Pv (transient faults cause voters to provide an output other than the one agreed upon
by the majority of their inputs). Each voter suffers transient faults independently from all other components (voters and XOR gates) and independently
between time steps.
(ii) The unreliable XOR gates take two inputs and suffer transient faults with
a probability that is bounded by some constant Pal' independently from all
other components and independently between time steps.
The unreliable LFSM's are built out of 2-input XOR gates and single-bit memory elements (flip-flops):
(i) These XOR gates also suffer transient faults with a probability that is
bounded by Pal' independently from all other components and independently
between time steps.
(ii) The flip-flops are assumed to be fault-free, although the same approach can
be extended to also handle this type of faults.
4.1
LOW-DENSITY PARITY CHECK CODES AND
STABLE MEMORIES
An (n, k) low-density parity check (LDPC) code is a linear code that represents k bits of information using n total bits. Just like any linear code, an
LDPC code has an n x k generator matrix G with entries in GF(2) and with
full-column rank; the additional requirement is that the code has a parity check
matrix P that (is generally sparse and) has exactly K "1s" in each row and J
"1s" in each column. It can be easily shown that the ratio has to be an integer
and that P has dimension
x n [Gallager, 1963]. Each bit in a codeword is
involved in J parity checks, and each of these J parity checks involves K -1
additional bits. Note that the rows of P are allowed to be linearly dependent
r;t
K
124
CODING APPROACHES TO FAULT TOLERANCE
(i.e., P can have more than n-k rows) and that the generator matrix G of an
LDPC code is not necessarily sparse.
Gallager studied ways to construct and decode LDPC codes in [Gallager,
1963]. In particular, he constructed sequences of (n, k) LDPC codes for fixed J
and K with rate ~ ~ 1- and he suggested/analyzed the performance of simple
iterative procedures for correcting erroneous bits in corrupted codewords; these
procedures are summarized below.
k
Iterative Decoding. For each bit in a corrupted n-bit codeword:
1. Evaluate the J associated parity checks (since each column of P has exactly
J "Is").
2. If more than half of the J parity checks for a particular bit are unsatisfied,
flip the value of that bit; do this for all bits concurrently.
3. Iterate (back to step 1).
In order to analytically evaluate the performance of this iterative scheme,
Gallager slightly modified his approach.
Modified Iterative Decoding. Replace each bit bi in an n-bit corrupted codeword with J bit-copies {bLb~, ... ,b{} (all bit-copies are initially the same);
obtain new estimates of each of these copies (i.e., J estimates
for bit bi ) by executing the following steps:
{bI, b~, ... , bf}
1. Evaluate J-I parity checks for each bit-copy; for each bit, exclude a different
parity check from the original set of J checks.
2. Flip the value of a particular bit-copy if half or more of the J - 1 parity
checks are unsatisfied.
3. Iterate (back to Step 1).
A hardware implementation of Gallager's modified iterative decoding scheme
can be seen in Figure 7.3 (EEl denotes a 2-input XOR gate and a V denotes a
(J -1 )-bit voter). Initially, one starts with J copies of an (n, k) codeword (Le.,
a total of In bits). During each iteration, each bit-copy is corrected using an
error-correcting mechanism of the form shown in the figure: for each bit-copy,
there are a total of J -1 parity checks, each of which involves K -1 other bits
and can be evaluated via K -1 2-input XOR gates. The output of each voter
is "I" if half or more of the J - 1 parity checks are nonzero. Correction is
accomplished by XOR-ing the output of the voter with the previous value of
the bit -copy.
DEFINITION 7.2 The number o/independent iterations m is the numbero/iterations/or which no decision about a particular bit-copy is based on a previous
estimate o/this same bit-copy.
Unreliable Error Correction in Dynamic Systems
125
r------------------- ---------
Correcting
Mechanism:
I ....... I I I I f~r Each
~~~:=;~:====:~=:~
L-L-1I----,-1..LI---,-I_·-''-'-''-'.1. 1--1..1---,-1--,1
Bit-Copy
-~
LD.@I
Correction
i
t
Bit-Copy
,
~
I I I 11· .. •.. ·1 I I
J Copies of
K-1 Other
Bit-Copies
(n,k) Codeword
®
...\
J-1 Parity
Checks
-@
.
(Jxn Total Bit-Copies)
/
'(f5
Figure 7.3. Hardware implementation of Gallager's modified iterative decoding scheme for
LDPC codes.
Note that in the modified iterative decoding scheme, each parity check requires K -1 input bits (other than the bit-copy being estimated). Since each
of these input bits has J different copies, one has some flexibility in terms of
which particular copy is used when estimating b{ of bit bi (1 ~ j ~ J). If
one is careful enough in choosing among these J bit-copies, the number of
independent iterations can be made nonzero. More specifically, one should
ensure that when a bit-copy of bi is estimated using an estimate of bit bj' one
uses the bit-copy of bj that disregarded the parity check involving bi (otherwise,
the estimate of bi would immediately depend upon its previous estimate). The
number of independent iterations is important because during the first m iterations, the probability of error in an estimate for a particular bit-copy can be
calculated using independence. It is shown in [Gallager, 1963] that when using
the modified iterative decoding scheme the number of independent iterations
for any LDPC code is upper bounded by
m<
logn
--~--------~
log[(K - 1)(J - 1)]
In his thesis, Gallager suggested a procedure for constructing sequences of
(n, k) LDPC codes with fixed J, K (i.e., with parity check matrices that have
J "Is" in each row and K "Is" in each column) such that its rate is bounded
126
CODING APPROACHES TO FAULT TOLERANCE
by ~ ~ 1-
k and the number of independent iterations m is bounded by
m+ 1 >
logn + log
KJ-K-J
2K
210g[(K - I)(J - 1)]
> m.
(7.1)
Building on Gallager's work, Taylor considered the construction of reliable
memories out of unreliable memory elements [Taylor, 1968b]. More specifically, Taylor assumed that the unreliable memory elements (flip-flops) store a
single bit ("0" or "1") but can suffer transient faults with probability Pc, independently between different time steps. Taylor constructed reliable (or stable)
memory arrays out of unreliable flip-flops using (n, k) LDPC codes: a reliable
memory array uses n flip-flops to store k bits of information; at the end of each
time step an unreliable error-correcting mechanism re-establishes (or at least
tries to re-establish) the correct state in the memory array. The memory scheme
performs acceptably for L time steps if, at any time step r (0 S r S L), the
k information bits can be recovered from the n memory bits. This means that
the n-bit sequence stored in the memory at time step r has to be within the
set of n-bit sequences that get decoded to the originally stored codeword (i.e.,
if a fault-free iterative decoder was available, one could successfully use it to
obtain the codeword that was stored in the memory array at time step 0).
Note that if error correction is fault-free, the problem of constructing reliable
memory arrays is trivial because it can be viewed as a sequence of transmissions
through identical unreliable binary symmetric channels. Each transmission
involves an n-bit sequence that ideally represents an (n, k) codeword. Each
bit transmission is successful with probability 1 - Pc and unsuccessful with
probability Pc' At the end of each channel transmission, error correction is
performed and the (corrected) n-bit sequence is passed on to the next channel
(i.e., the first node transmits an (n, k) codeword to the second node via an
unreliable communication link; after performing error detection and correction,
the second node transmits the corrected (ideally the same) codeword to the
third node, and so forth). Therefore, if error correction is fault-free, one can
use Shannon's result to establish that, by increasing k (and n), one can obtain
reliable memory arrays as long as ~ S C where C is given by the channel
capacity of a binary symmetric channel with crossover probability Pc
C
= 1 + Pc log Pc + (1- Pc)log(l- pc).
When faults may take place in the error-correcting mechanism, however, the
analysis becomes significantly harder. Taylor used LDPC codes and Gallager's
modified iterative decoding procedure to build a correcting mechanism out of
unreliable 2-input XOR gates and unreliable (J -1 )-bit voters that suffer transient faults (i.e., output an incorrect bit) with probabilities pz and Pv respectively.
Unreliable Error Correction in Dynamic Systems
127
The scheme uses Gallager's modified iterative decoding scheme and requires J
estimates for each of the n redundant bits. The corresponding correcting circuit
has one (J -1 )-bit voter and 1 + (K -1) (J -1) 2-input XOR gates for each bit
(see Figure 7.3). Taylor constructed reliable memory arrays using (n, k) LDPC
codes (with ~ ;::: 1 J < K) such that the probability of a overall failure
increases linearly with the number of time steps T and decreases polynomially
with k (i.e., the probability of overall failure is O( Tk- (3 ) for a positive constant
/3). By increasing k, the probability of overall failure can be made arbitrarily
small while keeping ~ ;::: 1 (and thus the redundancy per bit) below a
constant. Note that Taylor's construction of reliable memory arrays uses In
voters, In flip-flops and In[1 + (J - 1)(K - 1)] 2-input XOR gates; since
f S I-J; K' the overhead per bit (in terms of the overall number of flip-flops,
XOR gates and voters) remains below a constant as k and n increase. Taylor
also showed that one can reliably perform the XOR operation on k pairs of bits
by performing component-wise XOR-ing on two (n, k) codewords. In fact, he
showed that one can reliably perform a sequence of T such component-wise
XOR operations [Taylor, 1968a].
k,
k
4.2
RELIABLE LINEAR FINITE-STATE MACHINES
USING CONSTANT REDUNDANCY
Consider an LFSM with a single-bit input (u
and state evolution
=
1), a d-dimensional state
q[t + 1J = Acq[t] EB bx[tJ .
(7.2)
Without loss of generality, the d x d matrix Ac can be assumed to be in classical canonical form (see the discussion in Section 2 of Chapter 6). Any such
LFSM can be implemented using 2-input XOR gates and flip-flops as outlined
in Chapter 6. In these implementations, each of the d bits in the next-state vector
q[t + 1] is generated using at most two bits from q[t] and at most one bit from
the input; therefore, the calculation of each bit in q[t + 1J can be accomplished
by using at most two 2-input XOR operations (this is direct consequence of the
fact that the canonical matrix Ac has at most two "1s" in each row).
If k such LFSM's operate in parallel, each with a possibly different initial
state and a possibly different input stream, the result is k parallel instantiations
of the system in Eq. (7.2):
[ qt[t+ 1]
...
qk[t+ 1]]
=
Ac [ql[t]
EB b [ xdtJ
~[t] ] EB
Xk[t] ] . (7.3)
Let G be the n x k encoding matrix of an (n, k) linear code. If both sides
of Eq. (7.3) are post-multiplied by G T , one obtains the following n encoded
128
CODING APPROACHES TO FAULT TOLERANCE
=:g1 ::>
_0>
-0
kDistin~~
Inputs
.
-
Replace With)
-0
Q5
-"'0
o
-
"'0
U
C
UJ
k Distinct
Inputs
-0
-0
k Identical
LFMS's
Figure 7.4.
0
.$
:::J
.c
·c
+""
en
o
n Redundant
LFSM's
Replacing k LFSM's with n redundant LFSM's.
parallel instantiations:
(Ac [ qI[t]
qk[t] ])GTEB
Xk[t] ]) G T
EB (b [ xdt]
= Ac([
qk[t] ] GTJ EB
Xk[t] ] G ) ,
qdt]
EBb([Xl[t]
or equivalently
[ 6[t + 1] ... en[t + 1] ]
= Ac [6[t]
EB b ([ Xl[t]
,
... en[t]] EB
... Xk[t]] G T)
V'
'
X[t]
(7.4)
where
Effectively, n LFSM's with state evolution of the form of Eq. (7.2) are used
to perform k different encoded instantiations of the system in Eq. (7.2). As
shown in Figure 7.4, the operation of k identical LFSM's acting on distinct input
streams has effectively been replaced by n redundant LFSM's acting on encoded
versions of the k original inputs. Input encoding is performed according to an
Unreliable Error Correction ill Dynamic Systems
129
(n, k) linear code with generator matrix G. Each of the n redundant systems
is implemented using a separate set of flip-flops and XOR gates; for simplicity
flip-flops are assumed to be reliable and encoding is assumed to be instantaneous
and fault-free. Most of these assumptions could be relaxed; the real issue
with the encoding mechanism is its time and hardware complexity - see the
discussion in the next section of this chapter.
At each time step, n encoded inputs are provided to the n redundant LFSM's
and each of them evolves to its corresponding (and possibly erroneous) next
state. At the end of the time step, errors in the new states of the n systems are
corrected by performing error correction on d codewords from the (n, k) linear
code with generator matrix G (the ith codeword is obtained by collecting the
ith bit from each of the n state vectors).
If error correction was fault-free, one could invoke Shannon's result and
argue that, by increasing k (and n), the condition in Eq. (7.5) can be satisfied
with an arbitrarily high probability (at least for a pre-specified, finite number of
time steps and as long as the probability of component faults is below a certain
constant). More specifically, one can make the probability of "error per time
step" (i.e., the probability of an overall failure at a particular time step given
no corruption at the previous time step, denoted by Pr[ error per time step ])
arbitrarily small. Then, using the union bound, the probability of an overall
failure over L consecutive time steps could be bounded by
Pr[ overall failure at or before time step L J ~ L Pr[ error per time step J .
To make the above argument more precise, one has to bound the probability of
error per bit during each time step. Assuming that there are no corruptions in
any of the n state vectors at the beginning of a given time step, the probability
of a bit error (in any particular bit of the n next-state vectors) can be obtained by
considering the number of XOR operations that are involved. If this bit-error
probability is less than and if errors among different bits are independent,
then, the problem essentially reduces to an unreliable communication problem.
Fault-free error correction essentially ensures that at the beginning of each time
step the overall redundant state will be correct (unless, of course, an overall
failure took place).
!
Since fault-free error correction is not an option, the approach taken in the
rest of this chapter is quite different. In order to allow faults in the errorcorrecting mechanism, one employs LDPC codes and performs error correction
in each bit using the unreliable error-correcting mechanism of Figure 7.3. This
error-correcting mechanism is implemented using different unreliable XOR
gates and different unreliable voters for each bit (so that a single fault in a
component corrupts a single bit). Following Taylor's scheme in [Taylor, 1968bl,
one actually needs to have J replicas of each of the n redundant systems (a total
130
CODING APPROACHES TO FAULT TOLERANCE
of In systems). At the beginning of each time step, these In systems evolve to
a (possibly corrupted) next state; at the end of the time step, error correction is
performed using one iteration of Gallager's modified iterative decoding scheme
(see Section 4.1).
Once faults in the error-correcting mechanism are allowed, one can no longer
guarantee that the invariant condition in Eq. (7.5) will be true at the beginning of
each time step. However, as long as no overall failure takes place, the overall
redundant state (i.e., the state of all In systems) at a certain time step can
correctly represent the state of the k underlying systems. 2 In such case, one can
recover the exact state of the k underlying redundant systems (e.g., by using an
iterative decoder).
THEOREM
7.2 Consider k distinct instantiations of an LFSM with state evo-
lution as in Eq. (7.2), each instantiation with its own initial state and a distinct
input stream. These k instantiations are embedded into n redundant LFSM's
[also with state evolution as in Eq. (7.2)] using the approach ofEq. (7.4), where
G is the n x k encoding matrix of a linear (n, k) LDPC code. Each redundant
system is properly initialized (so that Eq. (7.5) is satisfied for T = 0) and is
supplied with an encoded input according to X[t] in Eq. (7.4). Each of the n
redundant systems has J realizations (so that there is a total of J n systems) that
use their own (dedicated) sets of reliable flip-flops and unreliable 2-input XOR
gates. At the beginning of a time step, all In redundant systems evolve to a
(possibly corrupted) next state. At the end of the time step, Gallager's modified
iterative decoding scheme is used to correct any errors that may have taken
place. Each bit-copy is corrected using different hardware, i.e., a different set
of1+ (J-1)(K-1) unreliable 2-input XOR gates and one unreliable (J-1)-bit
voter.
Let J be afixed even integer greater than 4, let K be an integer greater than J.
If the 2-input XOR gates suffer transient faults independently with probability
bounded by Pa:, the (J -1 )-bit voters suffer transient faults independently (and
independently from the XOR gates) with probability bounded by Pv, and there
exists P such that
-1 ) [(K - 1)(2p + 3pa:)] J/2
p> ( JJ/2
+ Pv + Pa: ,
then, there exists a sequence of (n, k) LDPC codes (with ~ ~ 1 -f?), such that
the probability of an overall failure at or before time step L is bounded above
as follows:
Pr[ overall failure at or before time step L 1< LdCk-/3 ,
Unreliable Error Correction in Dynamic Systems
131
where /3 and C are constants given by
IOg{ (J-l)(K -1) (
/3
C
J~2-_21
)
[(K _I)(2p+3Pz)]J/2-1 }
21og[(J-l)(K-l)]
=
J
(I-J/KP
(2p + 3pz)
The code redundancy is
I
~
[ 1
2K -
l-}/K
-
3,
1 ] -(/3+3)
2J(K-I)
.
and the hardware used (including hard-
ware in the error-correcting mechanism) is bounded above by J
d(3+/~J}»K -1))
XOR gates and by l!//K voters per system (d is the system dimension).
Proof: The proof follows similar steps as the proofs in [Taylor, 1968a; Taylor,
1968b]. The following discussion provides an overview of the proof; a complete
description can be found in Appendix 7.A.
The state of the overall redundant implementation at a given time step T [i.e.,
the states of the n redundant systems created by the embedding in Eq. (7.4)] are
fully captured by d codewords Ci [tJ from an (n, k) LDPC code (1 ~ i ~ d). In
other words, the state evolution equation of the n systems can be written as
[
CI[t + IJ
C2 [t + IJ
I
Cd[t + IJ
where X[tJ = [ Xl [tJ X2[tJ ... Xk[tJ] G T is the encoding of the k inputs
at time step t and A c ' b are the matrices in the state evolution equation (7.2).
Taylor showed that the addition of any two (n, k) codewords modulo-2 can
be done reliably using LDPC codes and Gallager's modified iterative decoding
scheme. Furthermore, he showed that one can reliably perform a sequence of L
such additions by performing a component-wise XOR operation in an array of
n 2-input XOR gates followed by one iteration of Gallager's modified scheme
(using the mechanism shown in Figure 7.3). More specifically, Taylor showed
that
Pr[ overall failure in a sequence of L array additions J < LC'k-!3'
for constants C' and /3' that depend on the fault probability of the XOR gates
and the voters, and on the parameters of the LDPC codes used.
Taylor's scheme can be used to perform error correction in the d codewords
from the (n, k) code. This requires, of course, that one maintains J copies of
each codeword (a total of J d codewords). During each time step, the overall re-
l32
CODING APPROACHES TO FAULT TOLERANCE
dundant implementation calculates its new state (J d new codewords) by adding
modulo-2 the corresponding codewords ofthe current state; this is then followed
by one iteration of error correction based on Gallager's modified scheme.
Since matrix Ac is in canonical form, the computation of each codeword in
the next overall state is based on at most two codewords of the current state (plus
the input modulo-2). So, over L time steps, one essentially has d sequences of
additions modulo-2 in the form that Taylor considered and which he showed
can be protected efficiently via LDPC coding. Using the union bound, the
probability of an overall failure at or before time step L can be bounded as
Pr[ overall failure at or before time step L 1< LdCk-/3 .
Note that the input is also something that needs to be considered (and one of
the reasons that constants f3 and C differ from the ones in Taylor's work), but
it is not critical in the proof since the inputs involve no memory.
0
5
OTHER ISSUES
In a memoryless binary symmetric channel with crossover probability p, a
bit ("0" or "I") that is provided as input at the transmitting end is corrupted
at the receiving end with probability p. Errors between successive uses of the
channel are assumed to be independent. Shannon studied ways to encode k
input bits into n redundant bits in order to achieve low probability of overall
failure during transmissions. He showed that the probability of error can be
made arbitrarily low using coding techniques, as long as the rate R = ~ of the
code is less than the capacity of the channel defined as
C = 1 + plogp+ (1- p)log(l- p)
(for the binary symmetric channel). Moreover, for rates R greater than C, the
probability of error per bit in the transmitted sequence can be arbitrarily large.
Theorem 7.2 looked at embeddings of k distinct instantiations of a particular
LFSM into n redundant systems, each of which is implemented using unreliable components. It was shown that, given certain conditions on the fault
probabilities of components, there exist LDPC codes that allow the n LFSM's
to reliably implement k identical LFSM's (that nevertheless operate on distinct
input streams) and, with nonzero "rate," achieve arbitrarily low probability of
overall failure during any pre-specified time interval. "Rate" in this context
means the amount of redundant hardware that is required per machine instanthe
tiation. Specifically, by increasing nand k while keeping ~ ~ 1 probability of an overall failure can be made arbitrarily small. An upper bound
on ~, which might then be called the computational capacity, was not obtained
in Theorem 7.2. Also notice that the bound on the probability of failure that was
obtained in Theorem 7.2 goes down polynomiaUy with the number of systems
'k,
Unreliable Error Correction ill Dynamic Systems
133
(not exponentially, as was the case for the distributed voting scheme and for
Shannon's approach).
Another issue that was not explicitly addressed in the development of Theorem 7.2 was the encoding of the k original inputs into n inputs according to
X[t] = [Xl[t] X2[t] ... Xk[t]] G T [see Eq. (7.4)]. Using the generator
matrix G, one sees that each of the n encoded bits can be generated using at
most k information bits (Le., at most k-l 2-input XOR gates). This approach,
however, is problematic because as k (and n) increase, each bit will be encoded
incorrectly with probability 1/2 [Gallager, 1963; Taylor, 1968b]. One alternative is to encode using a binary tree of depth log k, where each node performs
a component-wise 2-input XOR operation on two arrays of n bits. This encoding approach requires O(nk) 2-input XOR gates and o (log k) time steps
to complete, but can be done reliably using unreliable XOR gates if at the end
of each stage of the tree evaluation one performs a correcting iteration (of the
type performed at the end of each time step during the operation of the system).
One potential problem is that this encoding approach will reduce the operating
speed of the system by o (log k) steps.
Understanding the constraints due to the "computational capacity" and encoding complexity limitations are issues that are worth exploring in the future.
The idea of having mUltiple unreliable implementations of the same system,
each operating on distinct inputs, and each offering assistance to each other in
order to achieve reliable computation is a possibility that needs to be explored
further (along the lines of [Taylor, 1968b; Gacs, 1986; Spielman, 1996a; Hadjicostis, 1999]). Another promising direction is to explore the applicability and
effectiveness of encoding the state of an individual system using various types
of codes (along the lines of [Larsen and Reed, 1972; Wang and Redinbo, 1984]).
APPENDIX 7.A: Proof of Theorem 7.2
The proof of Theorem 7.2 appears in [Hadjicostis, 1999] and follows the
steps in [Taylor, 1968a; Taylor, 1968b].
The overall redundant implementation starts operation at time step O. As
described in Section 4, during each time step, all J n redundant system are first
allowed to evolve to their corresponding (and possibly corrupted) next states;
then, error correction is performed using one iteration of Gallager's modified
decoding scheme. This is done in parallel for each of the J d (n, k) codewords
(recall that each codeword has J copies - see Figure 7 .A.l).
The low-density parity check (LDPC) coding scheme was constructed so that
the number of independent iterations m satisfies Eq. (7.1). Therefore, for the
first m time steps, the parity checks that are involved in correcting a particular
bit-copy are guaranteed to be in error with independent probabilities (because
errors within these parity check sets are generated by different components).
After the first m time steps, the independence condition in the parity checks will
134
CODING APPROACHES TO FAULT TOLERANCE
(n,k) Codeword
~F~~~t-------------------------,
Error
! I I! 1 I 1 I ...... • 1 1 1 I !-- Correction
----4-------------------------,
per Codeword
Iii 1 I I· ...... I I I I
(Total of d
:I
Codewords)
I
I
I
!I
Il
I
I
I
Ii I
____ ..I
1...... ·1 I I
i
d-Dimensional
State Vector of
System 1
Figure 7.A.I.
Encoded implementation of k LFSM's using n redundant LFSM's.
not necessarily be true. If, however, no component fault influences decisions
for m or more consecutive time steps (Le., by causing a bit-copy to be incorrect
m or more time steps ~n the future), then, one can guarantee that the J -1 parity
checks for a particular bit-copy are in error with independent probabilities. The
following definitions make this more precise.
DEFINITION 7-7.A.3 A propagation failure occurs whenever any of the Jnd
bit-copies in the overall redundant implementation is erroneous due to componentfaults that occurred more than m time steps in the past.
DEFINITION 7- 7. A. 4 The initial propagation failure denotes the first propagation failure that takes place, i. e., the occurrence of the first component fault
that propagates for m + 1 time steps in the future.
It will be shown shown that a propagation failure is very unlikely and that in
most cases the bit errors in the J d codewords that represent the encoded state of
all LFSM's will depend only on component faults that occurred within the last
few time steps. To calculate this bound on the probability of propagation failure,
one uses a bound on the probability of error per bit-copy which is established
in the next section.
Initial Propagation Failure
The probability of error per bit-copy at the end of time step T, 0 ~ T ~ m, can
be bounded by a constant p. It will be shown that this is true as long as
1.
T ~
m, and
Unreliable Error Correction in Dynamic Systems
135
To see why this is the case, consider the following:
• In order to calculate a certain bit-copy in its next-state vector, each of the In
redundant systems uses at most two bit-copies from a previous state vector
and performs at most two 2-input XOR operations (one XOR-ing involves
the two bit-copies in the previous state vector, the other one involves the
input). Using the union bound, the probability of error per bit-copy at the
end of the state evolution stage is bounded above by
Pr[ error per bit-copy after state evolution at step T 1 ~ 2p + 2pa: == q .
This is simply the union bound of the events that any of the two previous
bit-copies is erroneous and/or that there is a fault in any of the two XOR
gates (for simplicity the input provided is assumed to be correct). Note that
independence is not required here.
• Once all J n systems transition to their next states, error correction is performed along the J d codewords (see Figure 7 .A.I). Correction involves one
iteration of Gallager's modified decoding scheme. Recall that each bit-copy
is corrected using J -1 parity checks, each of which involves K -1 other bitcopies. A parity check associated with a particular bit-copy (1 ~ j ~ J)
is said to be in error if bit-copy bf is incorrect but the parity check is "0", or
if bf is correct but the parity check is "1." This is because ideally one would
like parity checks to be "0" if their corresponding bit-copy is correct and to
be "I" if the bit-copy is incorrect. Note that this definition decouples the
probability of a parity check being in error with whether or not the associated bit-copy is erroneous. The probability of an error in the calculation of a
parity check (see the error-correcting mechanism in Figure 7.3) is bounded
by
0.
Pr[ parity check in error 1~ (K - I)(q + Pa:) = (K - I)(2p + 3pa:)
(i.e., a parity check for a particular bit-copy is in error if there is an error
in any of the K -1 other bit-copies or a fault in any of the K -1 XOR
operations).
• A particular bit-copy will not be corrected if one or more of the following
three events happen: (i) J /2 or more of the associated J -1 parity checks
are in error, (ii) there is a fault in the voting mechanism, or (iii) there is a
fault in the XOR gate that receives the voter output as input (see Figure 7.3).
If the parity checks associated with a particular bit-copy are in error with
136
CODING APPROACHES TO FAULT TOLERANCE
independent probabilities, then,
Pr[ error per bit-copy after correction]
:S ( J;;21 ) [(K - 1)(2p + 3p:z:)]J j 2 + Pv + Px
:S p.
Therefore, if the parity checks for each bit-copy are in error with independent
probabilities, then, the system ends up with a probability of error per bit-copy
that satisfies
Pr[ error per bit-copy at end of time step T] :S p .
The constant p can be viewed as a bound on the "steady-state" probability of
error per bit-copy at the end/beginning of each time step (at least up to time
step m).
This "steady-state" probability of error per bit-copy remains valid for T >
m as long as the initial propagation failure does not take place. The only
complication is that the probability of error per bit-copy conditional on the event
that no propagation failure has taken place may not necessarily be bounded
by p. Next, it is shown that this assumption remains true; the proof is a direct
consequence of the definition of a propagation failure.
At the end of time step T = m, the probability of error per bit-copy is
bounded by p. However, in order to derive the same bound for the probability
of error per bit-copy at the end of time step T = m + 1, one has to assume that
different parity checks for a particular bit-copy are in error independently. To
ensure this, it is enough to require that no component fault took place at time
step T = and propagated up to time step m (so that it causes a propagation
failure at time step T = m + 1).
The probability that a particular bit-copy bf (1 :S j :S J) is in error at the
end of time step T = m conditional on no propagation failure (no PF) up to
time step T = m is denoted by
°
Pr[ error per bit-copy at end of time step T
= m I no initial PF at T = m]
and is smaller or equal to the "steady-state" probability of error per bit-copy
(Le., smaller than p). To see this, consider patterns of component faults at time
steps T = 0,1, ... , m that cause bit-copy to be erroneous at the end of time
step m. If this event is called A, then, it is clear that Pr(A) :S p. Let B
denote the set of primitive events (patterns of component faults at time steps
T = 0,1, ... , m) that lead to a propagation failure at bit-copy bf. Note that by
definition set B is a subset of A (B c A) because a propagation failure at time
bt
Unreliable Error Correction in Dynamic Systems
step T
= m has to corrupt bit bf.
137
Therefore,
Pr[ bf is erroneous at end of T = m I no initial PF at bf at time m J =
=
~
Pr(A) - Pr(B)
1 - Pr(B)
Pr(A) ~ p.
One easily concludes that the "steady-state" probability of error per bit -copy
remains bounded by p given that no propagation failure takes place. Note that
one actually conditions on the event that "no propagation failure takes place
in any of the bit-copies at time step m," which is different from event B. The
proof goes through in exactly the same way because a pattern of component
faults that causes a propagation failure to a different bit-copy can either cause
an error in the computation of bit-copy bf or not interfere with it at all.
Bounding the Probability of Initial Propagation Failure
Given that no propagation failure has taken place up to time step T, a bound
on the probability of error per bit-copy is available and the parity checks for a
given bit-copy are in error with independent probabilities. Using this, one can
calculate the probability that a component fault that took place at time step T-m
propagates up to time step T, corrupts the value of bit-copy bf (1 ~ j ~ J) and
causes the initial propagation failure. This is called the "probability of initial
propagation failure at bit-copy b{"
Note that in order for a component fault to propagate for m time steps it is
necessary that it was critical in causing a wrong decision during the correcting
stages of m consecutive time steps. In other words, without this particular
component fault the decision/correction for all of these time steps would have
had the desired effect.
Let Pm denote the probability that a component fault has propagated for m
consecutive time steps in a way that causes the value of bit-copy bf at time step
T to be incorrect. In order for this situation to happen, both of the following
two independent conditions are required:
1. The value of one or more of the (J -l)(K -1) bit-copies involved in the
parity checks of bit-copy bf is incorrect because of a component fault that
has propagated form-1 time steps. Since each such bit-copy was generated
during the state evolution stage of time step T based on at most two bit -copies
from the previous state vector (the input bit is irrelevant in error propagation),
the probability of this event is bounded by
(J - l)(K - 1)2Pm-l ,
138
CODING APPROACHES TO FAULT TOLERANCE
where Pm - l is the probability that a component fault has propagated for
m-l consecutive time steps (causing an incorrect value to one of the bitThe factor of two comes in because
copies used in the parity checks for
the fault that propagates for m - 1 time steps could be in any of the at most
two bit-copies used to generate during the state evolution stage. This is
due to the fact that the system matrix Ac in Eq. (7.2) is in standard canonical
form.
bl).
bl
2. Since one of the parity checks is associated with the fault that has propagated
for m-l time steps, at least J /2 - lout of the J - 1 remaining parity checks
would have to be erroneous. The probability of this event is bounded by
J - 2 ) [(K-1 )( q+P:t )]J/2-1 .
( J/2-1
If no propagation failure has taken place, errors in parity checks will be
independent. Therefore, the probability of a fault propagating for m consecutive
time steps is bounded by
Similarly,
and so forth. One concludes that
The union bound can be used to obtain an upper bound on the probability
that the initial propagation failure takes place at time step T. For this to happen,
a pattern of component faults has to propagate for m time steps in at least one
of the Jnd bit-copies of the redundant construction, i.e.,
Pr[ initial prop. failure at time step T]
:::; JndPm
.
Unreliable Error Correction ill Dynamic Systems
139
If one uses Gallager's construction in [Gallager, 1963], the LDPC codes will
satisfy the following conditions:
m >
m
<
KJ-K-J
1ogn + 1og 2KJ(K-l)
2Iog[(J _ 1)(K -1)J == A(n) ,
logn
log[(J - 1)(K - 1)J '
k
1- JjK .
Using the first inequality, one obtains
Pr[ initial prop. failure at time step T J S;
S; Jnd(q
+ p:v)2 m
[(J -1)(K -1) (
S; Jnd(q
+ p:v)2m
J~2-_21
) [(K _1)(q+P:v)JJ/2-1]A(n)
1] }-f3'
{[1
2K - 2J(K _ 1) n
,
where {31 is given by
f3' =
_log { (J - I)(K -1) ( /;2-_2 1 ) [(K - l)(q + p.)]J/2-1 }
21og[(J - 1)(K - 1)J
-
Since k S; nand n S; l-;/K' one gets
Pr[ initial prop. failure at time step T
J
S;
J {1 _
{[1
~ j K}
d (q
+ P:v) 2m
1]
2K - 2J(K _ 1) k
Clearly,
2m
< 2iogn < n <
-
-
k
-1-JjK'
which leads to
Pr[ initial prop. failure at T J < dC 1k- f3'+2,
}-f3'
.
140
CODING APPROACHES TO FAULT TOLERANCE
where
J
C' == (q + Pa:) (1- JIK)2
[
1
1]
2K - 2J(K -1)
-{3'
Bounding the Probability of Overall Failure
Note that if no propagation failure takes place in the interval from time step
o to T, a fault-free iterative decoder will be able to correctly decode the state
of the overall system (all Jd codewords) at each time step. The reason is that
no previous fault can be critical in causing consecutive erroneous decisions in
more than m decoding iterations. Using this, one can find an upper bound on
the probability that the initial propagation failure takes place at time step T,
assuming that no propagation failure has taken place in the time interval from
oto T.
An upper bound on the probability that the initial propagation or decoding
failure takes place is given by
Pr[ overall failure at time step T 1 < mdC'k-{3'+2 ,
or, since m ~ log n ~ n ~
l-Y/ K'
Pr[ overall failure at time step T J < dCk-{3 ,
where
/3 = /3' -
3
J
log { (J - I)(K - 1) ( J-2)
~ _ 1 [(K - 1)(q + Pa:)J2-
2Iog[(J - 1)(K - I)J
C
I} _
3
'
C'
I-JIK
J
[
1
1]
(q + Pa:) (1 - J I K)3 2K - 2J (K - 1)
-{3'
.
Using the union bound, the probability of an overall failure at or before time
step L can be bounded by
Pr[ overall failure at or before time step L J < LdCk-{3 .
References
141
Notes
This is achieved by encoding k information bits into n > k bits, transmitting these n bits through the channel, receiving n (possibly corrupted) bits,
performing error correction and finally decoding the n bits into the original
k bits.
2 The overall state is an nd binary vector that represents kd bits of information.
The n redundant systems "perform correctly without an overall failure for
L time steps" if their overall state at time step T (0 :::;: T :::;: L) is within the
set of nd vectors that correspond to the actual kd bits of information at that
particular time step. In other words, if a fault-free (iterative) decoder was
available, one would be able to obtain the correct states of the k underlying
systems.
References
Avizienis, A. (1981). Fault-tolerance by means of external monitoring of computer systems. In Proceedings of the 1981 National Computational Conference, pages 27-40.
Bhattacharyya, A. (1983). On a novel approach of fault detection in an easily testable sequential machine with extra inputs and extra outputs. IEEE
Transactions on Computers, 32(3):323-325.
Gacs, P. (1986). Reliable computation with Cellular Automata. Journal of Computer and System Sciences, 32(2): 15-78.
Gallager, R. G. (1963). Low-Density Parity Check Codes. MIT Press, Cambridge, Massachusetts.
Gallager, R. G. (1968). Information Theory and Reliable Communication. John
Wiley & Sons, New York.
Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic
Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts.
Hadjicostis, C. N. (2000). Fault-tolerant dynamic systems. In Proceedings of
ISIT 2000, the Int. Symp. on Information Theory, page 444.
Hadjicostis, C. N. and Verghese, G. C. (1999). Fault-tolerant linear finite state
machines. In Proceedings of the 6th IEEE Int. Conf on Electronics, Circuits
and Systems, pages 1085-1088.
Iyengar, V. S. and Kinney, L. L. (1985). Concurrent fault detection in microprogrammed control units. IEEE Transactions on Computers, 34(9):810-821.
Johnson, B. (1989). Design and Analysis of Fault-Tolerant Digital Systems.
Addison-Wesley, Reading, Massachusetts.
Larsen, R. W. and Reed, I. S. (1972). Redundancy by coding versus redundancy
by replication for failure-tolerant sequential circuits. IEEE Transactions on
Computers, 21(2):130-137.
142
CODING APPROACHES TO FAULT TOLERANCE
Leveugle, R., Koren, Z., Koren, I., Saucier, G., and Wehn, N. (1994). The
Hyeti defect tolerant microprocessor: A practical experiment and its costeffectiveness analysis. IEEE Transactions on Computers, 43( 12): 1398-1406.
Leveugle, R. and Saucier, G. (1990). Optimized synthesis of concurrently
checked controllers. IEEE Transactions on Computers, 39(4):419-425.
Parekhji, R. A, Venkatesh, G., and Sherlekar, S. D. (1991). A methodology for
designing optimal self-checking sequential circuits. In Proceedings of the
Int. Con! VLSI Design, pages 283-291. IEEE CS Press.
Parekhji, R. A, Venkatesh, G., and Sherlekar, S. D. (1995). Concurrent error
detection using monitoring machines. IEEE Design and Test of Computers,
12(3):24-32.
Pippenger, N. (1990). Developments in the synthesis of reliable organisms from
unreliable components. In Proceedings of Symposia in Pure Mathematics,
volume 50, pages 311-324.
Pradhan, D. K. (1996). Fault- Tolerant Computer System Design. Prentice Hall,
Englewood Cliffs, New Jersey.
Robinson, S. H. and Shen, J. P. (1992). Direct methods for synthesis of selfmonitoring state machines. In Proceedings of22nd Fault-Tolerant Computing Symp., pages 306-315. IEEE CS Press.
Siewiorek, D. and Swarz, R. (1998). Reliable Computer Systems: Design and
Evaluation. AK. Peters.
Sipser, M. and Spielman, D. A (1996). Expander codes. IEEE Transactions on
Information Theory, 42(6):1710-1722.
Spielman, D. A (1996a). Highly fault-tolerant parallel computation. In Proceedings of the Annual Symp. on Foundations of Computer Science, volume 37, pages 154-160.
Spielman, D. A (1996b). Linear-time encodable and decodable error-correcting
codes. IEEE Transactions on Information Theory, 42(6): 1723-1731.
Taylor, M. G. (1968a). Reliable computation in computing systems designed
from unreliable components. The Bell System Journal, 47(10):2239-2366.
Taylor, M. G. (1968b). Reliable information storage in memories designed from
unreliable components. The Bell System Journal, 47(10):2299-2337.
Wang, G. X. and Redinbo, G. R. (1984). Probability of state transition errors in
a finite state machine containing soft failures. IEEE Transactions on Computers, 33(3):269-277.
Chapter 8
CODING APPROACHES FOR FAULT
DETECTION AND IDENTIFICATION IN
DISCRETE EVENT SYSTEMS
1
INTRODUCTION
This chapter applies coding techniques in the context of detecting and identifying faults in complex discrete event systems (DES's) that can be modeled as
Petri nets [Hadjicostis, 1999; Hadjicostis and Verghese, 1999]. The approach
is based on replacing the Petri net model of a given DES with a redundant
Petri net model in a way that preserves the state, evolution and properties of the
original system in some encoded form. This redundant Petri net model enables
straightforward fault detection and identification based on simple parity checks
that are used to verify the validity of artificially-imposed invariant conditions.
Criteria and methods for designing redundant Petri net models that achieve the
desired objective while minimizing the cost associated with them (e.g., by minimizing the number of sensors or communication links) are not pursued here,
but several examples illustrate how such problems can be approached.
In many ways, the development in this chapter parallels the discussion on
fault-tolerant redundant implementations in Chapters 5 and 6. The main difference is in terms of the underlying assumptions/constraints and the fault model
that is used. In particular, in the context ofJault diagnosis, the objective is to
interpret activity/status information in a way that facilitates fault detection and
identification. In most cases, the system implementation cannot be changed
and the fault diagnosis scheme does not have any flexibility in the choice of
sensor allocation or sensor measurements. Thus, an effective diagnoser is one
that is able to handle the available sensory data and determine (with a reasonable delay) what faults, if any, have taken place; the diagnoser is commonly
assumed to be fault-free.
The usual approach in constructing a diagnoser is to locate invariant properties of the given system, a subset of which is violated soon after a particular
144
CODING APPROACHES TO FAULT TOLERANCE
fault takes place. Then, by monitoring the activity in the system, one can detect
violations of such invariant properties (which indicates the presence of a fault)
and correlate them with a unique fault in the system (which then constitutes fault
identification). The task becomes challenging because of potential observability limitations (in terms of the inputs, states or outputs that are observed [Cieslak
et aI., 1988]) and various other requirements (such as detection/communication
delays [Debouk et aI., 2000), sensor allocation limitations [Debouk et aI., 1999),
distributivity/decentralizability constraints [Aghasaryaiu et aI., 1998; Debouk
et aI., 1998), or the sheer size of the diagnoser). There is large volume of related work, especially within the systems/control [Gertler, 1998) and computer
engineering communities [Tinghuai, 1992). More relevant to this chapter are
previous works on fault diagnosis in large-scale discrete event systems. This
includes the work in [Sampath et aI., 1995; Sampath et aI., 1998], which studies
fault diagnosis in finite-state machines using a language-theoretic approach, the
work in [Valette et aI., 1989; Cardoso et aI., 1995), which models the behavior of
a discrete event system as a Petri net and develops state estimation techniques to
perform fault diagnosis, and the work in [Pandalai and Holloway, 2000), which
performs diagnosis based on timing relations between events. Also relevant are
the methodologies for fault diagnosis in complex communication networks that
appeared in [Bouloutas et aI., 1992; Wang and Schwartz, 1993; Park and Chong,
1995; Aghasaryaiu et aI., 1997a; Aghasaryaiu et aI., 1997b; Aghasaryaiu et aI.,
1998).
The presentation in this chapter analyses fault diagnosis schemes that result
from two types of redundant Petri net implementations.
(i) Separate redundant Petri net implementations retain the functionality of the
original Petri net intact and use additional places and tokens in order to
impose invariant conditions.
(ii) Non-separate redundant Petri net implementations only need to retain the
original Petri net functionality in some encoded form, allowing in this way
additional flexibility in the design of diagnosis schemes.
As mentioned earlier, the schemes that result out of both separate and nonseparate redundant Petri net implementations are attractive because of their
simplicity. They are also able to automatically point out the additional connections that are necessary and they may not require explicit acknowledgments
from each activity. These issues are elaborated upon later on; making additional
connections between coding theory and fault diagnosis is certainly a worthwhile
future direction.
145
Coding Approaches for Fault Detection and Identification
h
/O~0P2
p,(j)
2.
\
1
tl~
Orl'0P3
b
Figure 8.1.
2
oy
B+
=
B-
=
U~
1
0
0
0
U
1
0
1
n
Petri net with three places and three transitions.
PETRI NET MODELS OF DISCRETE EVENT
SYSTEMS
Petri nets are a graphical and mathematical model for a variety of information and processing systems [Murata, 1989]. Due to their power and flexibility, Petri nets are particularly relevant to the study of concurrent, asynchronous, distributed, nondeterministic, and/or stochastic systems [Baccelli
et aI., 1992; Cassandras et aI., 1995]. They are used to model manufacturing systems [Desrochers and AI-Jaar, 1994], communication protocols or other
DES's [Cassandras, 1993].
A Petri net S is represented by a directed, bipartite graph with two kinds of
nodes: places (denoted by {PI, P2, ... , Pd} and drawn as circles) and transitions
(denoted by {tll t2, ... , tu} and drawn as rectangles). Weighted directed arcs
connect transitions to places and vice-versa (but there are no connections from
a place to a place or from a transition to a transition). The arc weights have to
be nonnegative integers: bi; denotes the integer weight of the arc from place
Pi to transition tj and bt denotes the integer weight of the arc from transition
tj to place Pl. The graph shown in Figure 8.1 is an example of a Petri net with
d = 3 and u = 3; its three places are denoted by PI. P2 and P3, and its three
transitions by tI. t2 and t3 (arcs with zero weight are not drawn).
Depending on the system modeled by the Petri net, input places can be
interpreted as preconditions, input data/signals, resources, or buffers; transitions
can be regarded as events, computation steps, tasks, or processors; output places
can represent postconditions, output data/signals, conclusions, or buffers. Each
place functions as a token holder. Tokens are drawn as black dots and represent
resources that are available at different parts of the system. The number of
146
CODING APPROACHES TO FAULT TOLERANCE
tokens in a place cannot be negative. At any given time instant t, the marking
(state) of the Petri net is given by the number of tokens at each of its places; for
the Petri net in the figure, the marking (at time instant 0) is given by
~[Oj =
U]
Transitions model events that cause the rearrangement, generation or disappearance of tokens. Transition tj is enabled (i.e., it is allowed to take place)
only if each of its input places Pi has at least bi; tokens (where, as explained
before, bi; is the weight of the arc from place Pi to transition tj)' When transition tj takes place (transition tj is said tofire), it removes bi; tokens from each
bt
tokens to each output place Pl. In the Petri net in
input place Pi and adds
Figure 8.1, transitions it a~d t2 are enabled but transition t3 is not. If transition
tl fires, it removes 2 tokens from its input place PI and adds one token each to
its output places P2 and P3; the corresponding state of the Petri net (at the next
time instant) will be
q.[lj =
[n .
Let B- = [bi;J (respectively B+ = [b~]) denote the d x u matrix with bi;
(respectively b~) at its ith row, jth column position. The state evolution of a
Petri net can then be represented by the following equation:
qs[t + 1J
=
qs[tJ + (B+ - B-)x[tJ
(8.1)
qs[tJ
(8.2)
+ Bx[tJ ,
where B == B+ -B-. (Figure 8.1 shows the corresponding B+ and B- for
that Petri net.) The input x[tJ in the above description is u-dimensional and is
restricted to have exactly one nonzero entry with value "I." When
o
x[tJ = Xj =
1
o
Coding Approaches for Fault Detection and Identification
[
C2
Room 2
Mouse
C8
Room
n
: .
---m2
l
31
C3
I
~-t--I----
m3-I
m1
___ C1
r-H----j
~C4
C7
m6
Room 4
I
--~
Cs
Figure 8.2.
'
Room 1
m4
f--- ~ -- t
·I- ms
I
Cat
147
C6
--~ Room 5
LI ____________ Ii
Cat-and-mouse maze.
(the single "I" being at the jth position), transition tj fires (j is in {I, 2, ... , u}).
Note that transition tj is enabled at time instant t if and only if
where B - (:, j) denotes the jth column of B - and the inequality is taken
element-wise. A pure Petri net is one in which no place serves as both an
input and an output for the same transition (i.e., at most one of bt and bi; can
be nonzero). The Petri net in Figure 8.1 (with the indicated B+ and B- matrices) is a pure Petri net. Matrix B has integer entries and its transpose is known
as the incidence matrix [Murata, 1989].
Discrete event systems are often modeled as Petri nets. The following example presents the Petri net version of the popular "cat-and-mouse" problem,
introduced by Ramadge and Wonham in the setting of supervisory control [Ramadge and Wonham, 1989] and described as a Petri net in [Yamalidou et aI.,
1996]. The authors in [Ramadge and Wonham, 1989; Yamalidou et aI., 1996]
were concerned with controlling the doors in the maze so that the two animals
are never in the same room together. The task becomes challenging because
only a subset of the doors may be controllable and because one may wish to
allow maximum freedom in the movement of the two animals (while avoiding
their entrance into the same room). Fault detection and identification in such
systems is discussed later in this chapter.
148
CODING APPROACHES TO FAULT TOLERANCE
EXAMPLE 8.1 A cat and a mouse circulate in the maze of Figure 8.2, with the
cat movingfrom room to room through a set ofunidirectional doors { Cl, C2, ••. , cs}
and the mouse through a set of unidirectional doors {ml' m2, ... , m6}' The
Petri net model is based on two independent subnets, one dealing with the eat's
position and movements and the other dealing with the mouse's position and
movements. Each subnet has five places, corresponding to the five rooms in
the maze. A token in a certain place indicates that the mouse (or the cat) is in
the corresponding room. Transitions model the movements of the two animals
between different rooms (as allowed by the structure of the maze in Figure 8.2).
The subnet that deals with the mouse has a marking with five variables, exactly
one of which has the value "1" (the rest are set to zero). The state evolution
for this subnet is given by Eqs. (8.1) and (8.2) with
B+ =
o
o
0 1 001
1 000 0
1 0 0 0 0 0
00001 0
000 100
,
B-
=
1
0
0
0
0
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
For example, state qs [t] = [0 1 0 0 0] T indicates that at time instant
t the mouse is in room 2. Transition t3 takes place when the mouse moves from
room 2 to room 1 through door m3; this causes the new state to be qs[t + 1] =
r.
[1 0 0 0 0
In [Yamalidou et al., 1996} the two subnets associated
with the mouse and cat movements were combined in an overall Petri net, which
was then used to construct a linear controller that achieved the desired objective
(i.e., disallowed the presence of the cat and the mouse in the same room while
permitting maximum freedom in their movement within the maze).
3
FAULT MODELS FOR PETRI NETS
In complex DES's with Petri net models that have a large number of places
and transitions, faults can manifest themselves in a variety of ways, including
malfunctions due to hardware, software or communication components. It is
therefore essential that systems are designed with the ability to detect, locate and
correct these different types of faults. This section discusses the fault models
that will be used in the forthcoming fault detection and identification schemes.
As mentioned in Chapter 1, the fault model needs to capture the underlying
faults in an efficient manner. Since faults in DES's depend on the actual implementation (which varies considerably depending on the application), three
different fault models are considered [Hadjicostis, 1999].
Transition Faults: Transition tj is said to fail to execute its postconditions if
no tokens are deposited to its output places, even though the correct number
149
Coding Approaches for Fault Detection and Identification
of tokens from the input places have been consumed. Similarly, transition
tj is said tofail to execute its preconditions if the tokens that are supposed to
be consumed from the input places of the faulty transition are not consumed
even though the correct number of tokens are deposited at the corresponding
output places. In terms of the state evolution in Eq. (8.1), a fault at transition
tj corresponds to transition tj tiring, but its preconditions, as given by the
jth column of B- [denoted by B- (:, j)], or its postconditions, as given by
B+ (:, j), not taking effect.
Place Faults: Faults that corrupt the number of tokens in a single place of the
Petri net are modeled by place faults. In terms ofEq. (8.1), a place fault at
time instant t causes the value of a single variable in the d-dimensional state
qs ttl to be incorrect. This fault model is suitable for Petri nets that represent
computational systems or tinite-state machines and has appeared in earlier
work that dealt with fault detection in pure Petri nets [Sifakis, 1979; Silva
and Velilla, 1985].
Additive Fault Model: Another approach is to model the error of each fault
of interest in terms of its additive effect on the state qs ttl of the Petri net.
In particular, if fault f(i) takes place at time instant t, then, the corrupted
state qJ(i} ttl of the Petri net can be written as
where eJ(i} is the additive effect of fault f(i). One can tind a priori the
additive effect eJ(.} for each fault, so that the d x l error matrix
E = [
ef(l}
I ef(2} I ... I eJ(I}
]
(where l is the total number of faults) summarizes all that is necessary to
detect and identify this set of faults in the given Petri net.
Note that the additive fault model captures both transition and place faults:
• If transition tj fails to execute its preconditions, then, et:
whereas iftj fails to execute its postconditions, then,
=
B- (:, j),
et = -B+(:,j).
3
3
• The corruption of the number of tokens in place Pi is captured by the additive
d-dimensional error array
o
e pi
=Cx
1
o
150
CODING APPROACHES TO FAULT TOLERANCE
Figllre 8.3.
Petri net model of a distribllted processing system.
where c is an integer that denotes the number of tokens that have been added
and where the only nonzero entry appears at the ith position.
The big advantage of the additive fault model is that it can easily capture the
effects of multiple additively independent faults, that is faults whose additive
effect does not depend on whether other faults have taken place or not. For
example, a precondition fault at transition tj and an independent fault at place
Pi will result in the additive error array e~ + e pi '
J
EXAMPLE 8.2 Consider the Petri net in Figure 8.3 which could be the model
of a distributed processing network or a flexible manufacturing system. Transition t2 models a process that takes as input two data packets (or two raw
products) from place P2 and produces two different data packets (or intermediate products), one of which gets deposited to place P3 and one of which gets
deposited to place P4. Processes t3 and t4 take input packets from places P3 and
P4 respectively. and produce final data packets (or final products) in places Ps
and P6 respectively. Note that processes t3 and t4 can take effect concurrently;
once done, they return separate acknowledgments to places Ps and P6 so that
process t2 can be enabled again. Transition tl models the external input to the
system and is always enabled. The state of the Petri net shown in Figure 8.3
is given by qs [0] = [2 2 0 0 1 1] T; only transitions hand t2 are
enabled.
If the process modeled by transition t2 fails to execute its postconditions,
tokens will be removed from input places P2, Ps and P6. but no tokens will be
deposited at output places P3 and P4. The erroneous state of the Petri net will
be q, [1] = [2 0 0 0 0 0
f.
151
Coding Approaches for Fault Detection and Identification
Separate Redundant
Petri Net Embedding
r------------------------~
I
I
I
1
1
I
I
I
I
Original
Petri Net
System
I
State
Information
•
1
::t!
1
(.')
Q)
I
I
I
I
I
I
I
( ;rranSilion
Information
Q)
C
::J
1
I
Figure 8.4.
Monitor
: State
: Information
I
1
Error?
'C'
III
I
I
I
-
.J::
C)
•
\,
./
Concurrent monitoring scheme using a separate Petri net implementation.
If, instead, process t2 fails to execute its preconditions, then, tokens will
appear at the output places P3 and P4 but no tokens will be removed from
the input places P2, Ps and P6. The erroneous state of the Petri net will be
f.
qf[l] = [2 2 1 1 1 1
If process t2 executes correctly but there is a fault at place P4, then, the
resulting state will be of the form qf[l] = [2 0 1 l+c 0 0
(the
number of tokens at place P4 is corrupted by c).
f
4
4.1
SEPARATE MONITORING SCHEMES
SEPARATE REDUNDANT PETRI NET
IMPLEMENTATIONS
In separate monitoring schemes the original Petri net is enhanced by a separate monitor, whose state is updated according to transition activity in the
original system [Hadjicostis, 1999; Hadjicostis and Verghese, 1999]. Faults
can be concurrently detected and identified by comparing the state of the original system and the monitor (see Figure 8.4). A special case of this construction
is when the monitor is a simulator of the original system, so that, given the same
inputs, the monitor and the original Petri net are ideally in the same state; when
this is not the case, a fault is detected. The main disadvantage of this approach
is that it requires access to all activity in the original Petri net and it cannot
easily handle multiple faults (e.g., in distributed DES) and information that is
incorrect or missing. What is studied in this section is an alternative that uses
monitors of reduced sizes and is able to overcome some of these limitations.
152
CODING APPROACHES TO FAULT TOLERANCE
DEFINITION 8.1 A separate redundant implementation for Petri net S [with
d places, u transitions, marking qs [.] and state evolution as in Eq. (8.1) } is a
Petri net 1£ (with 1] == d + s places, s > 0, and u transitions) that has state
evolution
qh[t]
+ B+x[t] - B-x[t]
qh[t]
+[
~:
] x[t] - [
~=
]
x[t]
(8.3)
and whose state is given by
~[t] = [ ~ ] qs[t]
---G
for all time instants t. It is required that for any initial marking (state) qs [0] of
S, Petri net 1£ (with initial state qh[O] = Gqs [0]) admits all firing transition
sequences that are allowed in S (under initial state qs [0]).
Note that the functionality of Petri net S remains intact within the separate
redundant Petri net implementation 1£. Since all valid states qh[t] in 1£ have to
lie within the column space of the encoding matrix G, there exists a parity check
matrix P = [-C Is] such that Pqh[t] = 0 for all t (at least under fault-free
conditions). Since 1£ is a Petri net, matrices X+ and X-, and state qh[t] (for
all t) have nonnegative integer entries. The following theorem characterizes
separate redundant Petri net implementations.
THEOREM 8.1 Consider the setting described above. Petri net 1£ is a separate
redundant implementation of Petri net S if and only if C is a matrix with
nonnegative integer entries and
X+ = CB+ -D,
where D is any
MIN( CB+,
8
X- = CB- -D,
x u matrix with nonnegative integer entries such that D ::;
CB-) (operations::; and MIN are taken element-wise).
Proof: (=» The state
~ [0] = Gqs [0] = [ ~ ] qs [0] has nonnegative integer
entries for all valid qs [0] (a valid qs [0] is any marking with nonnegative integer
entries). For this to be true, a necessary (and sufficient) condition is that C is a
matrix with nonnegative integer entries.
Coding Approaches/or Fault Detection and Identification
153
If the state evolution of the redundant Petri net in Eq. (8.3) is combined with
the state evolution of the original Petri net in Eq. (8.1), one obtains
Gqs[t + 1J
[
= <Jh[t + 1J =
~ ] qs[t + 1J
=
<Jh[tJ + [
~: ] x[tJ - [ ~=
[~] qs[tJ + [ ~:
]
] x[tJ - [
x[tJ
~=
]
x[tJ .
Since any transition tj can be enabled [e.g., by choosing qs[OJ ~ B-(:,j)),
one concludes that
x+ - X- = C(B+ - B-) .
Without loss of generality X+ can be set to CB+ -D and X- to CB--D
for some matrix D with integer entries. Petri net 1£ has initial marking
[ J
<Jh 0
=
qs[OJ]
[ Cqs[OJ
'
where qs [OJ is any initial state for S. In order for 1£ to admit all firing transition
sequences that are allowed in S under initial state qs[OJ, one needs D to have
nonnegative integer entries. This can be proved by contradiction: suppose D
has a negative entry in its jth column; if qs [OJ = B- (:, j), transition tj can be
fired in S but cannot be fired in 1£ because
Cqs[OJ
=
<
CB-(:,j)
CB-(:,j) - D(:,j)
=
X-(:,j).
The requirement that D ::; MIN(CB+, CB-) follows from X+ and Xbeing matrices with nonnegative integer entries.
({:::) The converse direction follows easily. The only challenge is to show
that if D has nonnegative entries, all transitions that are enabled in S at time
instant t under state qs[tJ are also enabled in 1£ under state
154
CODING APPROACHES TO FAULT TOLERANCE
To show this, note that if D has nonnegative entries, then,
qs[t]
~
B-(:,j) => Gqs[t] ~ GB-(:,j)
=> <lh[t] ~ GB-(:,j)
=> <lh[t]
~ GB-(:,j) -
=> <lh[t]
~
[ Dt,j) ]
B-(:,j) .
(This is because matrices G, B+, B- and D have nonnegative integer entries.)
One concludes that, if transition tj is enabled in S, then, it is also enabled in 1£
[i.e., ifqs[t] ~ B-(:,j), then, <lh[t] ~ B-(:,j)].
0
4.2
FAULT DETECTION AND IDENTIFICATION
The separate redundant implementations described in Theorem 8.1 can be
used to monitor faults in the original Petri net [with state evolution as in
Eq. (8.1)). The invariant conditions imposed by a separate implementation
can be checked by verifying that [-C Is] qh[t] is equal to zero. The s additional places in 1£ function as checkpoint places and can either be distributed
in the Petri net system or be part of a centralized monitor. l
Transition Faults: Suppose that at time instant t -1 transition tj fires (that
is, x[t-1] = Xj). If, due to a fault, the postconditions of transition tj are not
executed, the erroneous state at time instant t will be
where <lh [t] is the state that would have been reached under fault -free conditions.
The error syndrome can be calculated to be
Pqf[t]
=
P(<lh[t] - [
CB~+- D
Pqh[t] - P [
o-
]
CB~+- D
[-C Is] [
Xj)
]
Xj
CB~+- D
- (-CB+ + CB+ - D)
DXj == D(:,j) .
]
Xj
Xj
If the preconditions of transition t j are not executed, the erroneous state will be
Coding Approaches for Fault Detection and Identification
155
and the error syndrome can be calculated similarly as
Pqf[t]
=
-Dxj
== -D(:,j) .
If all columns of D are distinct, one can detect and identify all single transition faults. In addition, depending on the sign, one can determine whether
preconditions or postconditions were not executed. Of course, given enough
redundancy, one may be able to also identify multiple transition faults. (For
example, if the effect of multiple transitions is additive, their occurrence could
be identified if the columns of D were linearly independent.)
8.3 Consider the Petri net in Figure 8.1 with the indicated B + and
B - matrices. To concurrently detect and identify transition faults, a separate
redundant implementation with one additional place will be used (s = 1). if
D is set to [3 2 1] and C is set to [2 2 1], one obtains the separate
redundant Petri net implementation of Figure 8.5 (the additional connections
are shown with dotted lines). Since the columns of matrix D are distinct,
identification of single transition faults is possible (the choice of C does not
affect the syndromes of transition faults).
Matrices B+ and B- are given by
EXAMPLE
B+ - [
B- - [
B+
CB+-D
1
B] CB--D
~I
[~ ~ I
[~
1
0
0
0
0
1
0
0
The parity check that is performed concurrently by the checking mechanism
(not shown in the figure) is given by
1fthe parity check is -3 (respectively -2, -1), then, transition tl (respectively
t2, t3) has failed to perform its preconditions. If the parity check is 3 (respectively 2, 1), then, transition tl (respectively t2, t3) has failed to perform its
postconditions.
The additional place P4 is part of the monitoring mechanism: it receives information about the activity in the original Petri net (e.g., which transitions fire)
and appropriately updates its tokens. The linear checker detects and identifies
156
CODING APPROACHES TO FAULT TOLERANCE
Figure B.S. Example 0/ a separate redundant Petri net implementation that identifies single
transition/aults in the Petri net 0/ Figure B.1.
faults by evaluating a checksum on the state of the overall (redundant) system.
Note that the number of tokens in place P4 is updated regardless of the activity
in transition t2' More generally, explicit connections from each transition to
the monitoring mechanism may not be required.
Place Faults: If, due to a fault, the number of tokens in place Pi is increased
by c, the erroneous state will be given by
where e p ; is an 1]-dimensional array with a unique nonzero entry at its ith
position, Le.,
o
o
In this case, the parity check will be
=
P(Clh[t] + epJ
PClh[t] + Pep;
0+ Pep;
ex P(:,i) .
157
Coding Approaches for Fault Detection and Identification
Figure 8.6. Example of a separate redundant Petri net implementation that identifies single
place faults in the Petri net of Figure 8.1.
Ifonechooses C so that columns ofP == [-C Is] are not rational mUltiples
of each other, then, one can detect and identify single place faults. 2
EXAMPLE
8.4 In order to concurrently detect and identify single place faults
in the Petri net of Figure 8.1, two additional places will be used (s
Matrix C will be chosen to be [
i]
~ .~
=
2).
(so that the columns of the parity
check matrix P = [-C 12 ] are not multiples of each other); the choice
for D is not critical in the identification of place faults and in this case D is
set to be [ ;
i i]·
With these choices, one obtains the separate redundant
implementation shown in Figure 8.6. Matrices B+ and B- are given by
1 1
1 0 0
1 0 0
0
B+ - [
B- - [
B+
CB+-D
B-
CB- -D
]
1 0 0
0 1 1
] =
2 0 0
0 1 0
0 0 1
1 0
2 0 0
0
158
CODING APPROACHES TO FAULT TOLERANCE
The parity check is performed through
-2 -1 11 0]
[ -C I12] <lh[t] = [ -1
-2 -1 -1 0 1 <lh[t].
if the result is a multiple of [
~
] (respectively [
~
l [~ l [~ l [~ ]).
then. the number of tokens in place PI (respectively P2. P3. P4. P5) has been
corrupted.
Through proper choice of C and D, one can perform detection and identification of both place and transition faults. Note that matrices C and D can
be chosen almost independently. The only constraint between the two is that
D :::; MIN(CB+, CB-) (this constraint can sometimes be relaxed by multiplying matrix C by a large enough integer constant so that the possibilities for
Dare increased. 3 ) The following example illustrates how one can detect and
identify both place and transition faults.
EXAMPLE 8.5 Identification of a single transition fault or a single place fault
(but not ofboth occurring together) can be achieved in the Petri net of Figure 8.1
by using two additional places (s
D
= [~ ~ ~].
= 2) and by setting C = [~ ~ ~]
With these choices. matrices B+ and B- are given by
o
1 1
1 0 0
100
010
211
200
010
001
100
022
The parity check is performed through
and
159
Coding Approaches for Fault Detection and Identification
Figure 8.7. Example of a separate redUlidallt Petri lIet implemelltatioll that idelltifies single
transition or single place faults ill the Petri net of Figure 8.1.
If the parity check is a multiple of [
[
~
~
] (respectively [
~
], [
~
], [
~ ],
]), then, there is a place fault in PI (respectively P2, P3, P4, Ps). If the parity
check is [
~ ] (respectively [ ~
], [
~
]), then, transition tl (respectively t2,
t3) has failed to perform its postconditions.
(respectively [
=~
l [=~ ]),
If the parity check is [
=~
]
then, transition tl (respectively t2, t3) has
failed to perform its preconditions.
The resulting redundant Petri net implementation is shown in Figure 8.7.
It is instructive to consider the interpretation of the monitoring schemes
shown in Figures 8.5,8.6 and 8.7:
(i) The s additional places, which could be part of a centralized monitor or
could be distributed in the system, are connected to the transitions of the
original Petri net and the tokens associated with the additional connections
and places act as simple acknowledgment messages.
(ii) The weights of the additional connections are given by matrices CB+ - D
andCB--D.
(iii) The choice of matrix C specifies detection and identification for place faults,
whereas the choice of D determines detection and identification for transi-
160
CODING APPROACHES TO FAULT TOLERANCE
tion faults. Coding techniques or simple linear algebra can be used to guide
the choices of C and D.
(iv) The checking mechanism (not shown in any of the figures in Examples 8.3,
8.4 and 8.5) detects and identifies faults by evaluating a linear checksum
on the state of the original Petri net and the added monitor. The implicit
assumption is that this checksum mechanism is fault-free.
Given fault detection and identification requirements, one has a variety of
choices for matrices C and D. Therefore, depending on the underlying system, one could try to optimize certain variables of interest, such as the size of
the monitor memory (number of additional places), the number of additional
connections (from the original system to the additional places), or the number
of tokens involved.
Note that, when restricted to pure Petri nets, one has no choice for D. More
specifically, since the resulting Petri net has to be pure, matrix D has to be
chosen so that D = MIN(CB+, CB-). The ability to detect transition faults
may be lost in such cases. The work in [Sifakis, 1979; Silva and Velilla, 1985]
studied this approach in pure Petri nets: given a pure Petri net S as in Eg. (8.2),
one can construct a pure Petri net embedding with state evolution
for an s x d matrix C with nonnegative integer entries. The distance measure
adopted in [Sifakis, 1979] suggests that the redundant Petri net should guard
against place faults (corruption of the number of tokens in individual places).
5
5.1
NON· SEPARATE MONITORING SCHEMES
NON· SEPARATE REDUNDANT PETRI NET
IMPLEMENTATIONS
In the monitoring scheme of Figure 8.8, the state of a redundant Petri net
implementation is a non-separate encoding of the state of the original Petri
net (i.e., an encoding scheme that does not immediately yield the state of the
original system). As in the case of separate redundant implementations, the
redundancy in the state of a non-separate redundant system will result in fault
detection and identification schemes that operate by analyzing violations on
the imposed state restrictions [Hadjicostis, 1999]. Notice that non-separate
redundant implementations can only be used when the designer has flexibility
in re-arranging the structure of the original DES.
DEFINITION 8.2 Let S be a Petri net with d places, u transitions and state
evolution as in Eq. (8.1); let qs [0] be any initial state qs [0] ~ 0 and X =
161
Coding Approaches for Fault Detection and Identification
Original
State
0
Non-Separate
Redundant
Petri Net
Implementation
Figure 8.8.
State
Information
Q
Q)
0
/
~
E
:J
~ rtJ
Q).::s!
C Q
. - Q)
"
Error?
.....J,r:.
()
Concurrent monitoring scheme using a non-separate Petri net implementation.
{x[O], x[I], ... } be any admissible firing transition sequence under this initial
state.
A Petri net 1l with 1J == d + s (where s > 0). u transitions. initial state <Jh[0]
and state evolution equation
<Jh[t + 1]
<Jh[t] + B+x[t] - B-x[t]
<Jh[t] + (B+ - B-)x[t]
(8.4)
is a non-separate redundant implementation for S if it concurrently simulates
S in the following sense: there exist
1. a state decoding mapping i. and
2. a state encoding mapping g.
such that for any initial state qs [0] in S and any admissible firing sequence X
(for qs[O]).
for all time instants t
~
O.
The non-separate redundant implementation 1l defined above is a Petri net
that, after proper initialization [i.e., <Jh[0] = g(qs[O])J, admits any firing transition sequence X that is admissible by the original Petri net S under initial
state qs[O]. The state of the original Petri net at time instant t is specified by
the state of the redundant implementation and vice-versa (through mappings .e
and g). Note that, regardless of the initial state qs [0] and the firing sequence X,
the state <Jh[t] of the redundant implementation always lies in a subset of the
redundant state space (namely the image of qs['] under the mapping g).
The rest of this section focuses on a special class of non-separate redundant
implementations, where encoding and decoding can be performed through lin-
162
CODING APPROACHES TO FAULT TOLERANCE
ear operations. Specifically, a d x 'rJ decoding matrix L and an 'rJ x d encoding
matrix G exist such that, under any initial state qs [0] and any admissible firing
transition sequence X = {x[O], x[I], ... },
for all time instants t ~ o. The state evolution equation of a non-separate
redundant Petri net implementation is then given by
CJh[t] + B+x[t] - B-x[t]
qh[t] + Bx[t] ,
(8.5)
(8.6)
where B == B+ -B-. The additional structure that is enforced through the nonseparate redundant Petri net implementation can be used for fault detection
and identification. In order to systematically construct redundant implementations, one needs to have a starting point. The following theorem characterizes
non-separate redundant Petri net implementations in terms of a similarity transformation and a standard redundant Petri net.
THEOREM 8.2 A Petri net 1£ with 'rJ == d + s (s > 0) places, u transitions and
state evolution as in Eqs. (8.5) and (8.6) is a redundant Petri net implementation
for S [with state evolution as in Eqs. (8.1) and (8.2) } only if it is similar (in the
usual sense of change of basis in the state space, see Chapter 5) to a standard
redundant Petri net implementation 1£(T whose state evolution equation is given
by
(8.7)
Here, B+, B- and B == B+ - B- are the matrices in Eqs. (8.1) and (8.2).
Associated with the standard redundant Petri net implementation is the standard
decoding matrix L(T and the standard encoding matrix G(T given by
Note that the standard redundant Petri net implementation is a pure Petri net.
Proof: Under fault-free conditions, LGqs[·] = LCJh[·] = qs[·]. Since the initial state qs [0] can be any array with nonnegative integer entries, one concludes
that LG = Id. In particular, L is full-row rank, G is full-column rank and there
163
Coding Approaches for Fault Detection and Identification
exists an 'rf x 'rf matrix 7 such that L7 = [Id
0] and 7- 1 G
= [ ~ ].
By
employing the similarity transformation q~[tJ = 7qh[tJ, one obtains a similar
system 1{' whose state evolution is given by
B'
and whose decoding and encoding matrices are given by
The state q~[tJ of system 1{' at any time instant t is of the form
by combining the state evolution equations of the original Petri net and the
redundant system, it is seen that
[ qs[tJ
~ BX[t]] = [q~t]] + [ :~ ] x[tJ .
The above equations hold for all initial conditions qs [OJ; since all transitions
are enabled under some appropriate initial condition qs [OJ, one concludes that
Bl = Band B2 = O.
If system 1{' is regarded as a pure Petri net, one sees that any transition
enabled in S is also enabled in 1{'. Therefore, 1{' is a redundant Petri net
implementation. In fact, it is the standard redundant Petri net implementation
1{eT with the decoding and encoding matrices presented in the theorem. The
invariant conditions that are imposed by the added redundancy on the standard
Petri net 1{eT are summarized in the transformed coordinates by the parity check
P eTqeT ['J, where PeT = [0 Is] is the parity check matrix.
0
Theorem 8.2 provides a characterization of the class of non-separate redundant Petri net implementations for the given Petri net S and is a convenient
164
CODING APPROACHES TO FAULT TOLERANCE
starting point for systematically constructing such implementations. The following theorem completes this point of view.
THEOREM 8.3 Let S be a Petri net with d places, u transitions and state
evolution as given in Eqs. (8.1) and (8.2). A Petri net 1£ with", == d + s (s > 0)
places, u transitions and state evolution as in Eqs. (8.5) and (8.6) is a redundant
Petri net implementation of S if:
1. 1t is similar to a standard redundant Petri net implementation 1£cr [with state
evolution equation as in Eq. (8.7)] through an 1J x 1J invertible matrix 7,
whose first d columns consist of nonnegative integer entries. The encoding,
decoding and parity check matrices of the Petri net implementation 1£ are
then given by
L
= [Id
0] 7, G
= 7- 1 [ ~
]
,
p
= [0
Is] 7 .
2. Matrices B+ and B- are given by
B+
= 7- 1 [ ~+
B-
=
7- 1
[
~-
] -V
]
-V
= GB+ -V ,
= GB-
-V,
where V is an 1J x u matrix with nonnegative integer entries. Note that V
has to be chosen so that the entries of B+ and B- are nonnegative, i.e.,
V ~ MIN{GB+, GB-).
Proof: From Theorem 8.2, it is clear that any non-separate redundant Petri
net implementation 1£ as in Eqs. (8.5) and (8.6) can be obtained through an
appropriate similarity transformation 7 qhft] = qcr ttl of the standard redundant
implementation 1£cr in Eq. (8.7). In the process of constructing 1£ from 1£cr'
one needs to ensure that 1£ is a valid redundant Petri net implementation of S,
i.e., one that meets the following requirements:
1. Given any initial condition qs [0] (i.e., given a d-dimensional array with
nonnegative integer entries), the marking
has nonnegative integer entries.
2. Matrices B+ and B- have nonnegative integer entries.
Coding Approaches for Fault Detection and Identification
165
3. The set of transitions enabled in S at any time instant t is a subset of the
set of transitions enabled in 11. (so that, under any initial condition qs [OJ, a
firing transition sequence X that is admissible in S is also admissible in 11.).
The first condition has to be satisfied for any array qs[OJ with nonnegative
integer entries. It is therefore necessary and sufficient that the first d columns
of 7- 1 have nonnegative integer entries. This also ensures that the matrix
difference
7- 1
[
B+ ~ B- ]
7- 1
[
~
]
(B+ - B-)
G(B+ - B-)
consists of integer entries. Without loss of generality,
where the entries of 1) are integers chosen so that B+ and B- have nonnegative
entries (Le., so that 1) ::; GB+ and 1) ::; GB-).
To check for the third condition, notice that tj is enabled in the original Petri
net S at time instant t if and only if qs[tJ ~ B-(:,j). If 1) has nonnegative
entries, then,
qs[tJ
~
B-xj :::} Gqs[tJ ~ GB-xj
:::} qh[tJ ~ GB-xj
:::} ~[tJ ~ (GB- - 1))Xj
:::} ~[tJ ~ B-xj ,
where B-(:, j) == B-xj (recall that qs[tj, B-, G and 1) have nonnegative
integer entries). Therefore, if transition tj is enabled in the original Petri net S, it
is also enabled in 11. [transition tj is enabled in 11. if and only if ~[tj ~ B- (:, j)].
It is not hard to see that it is also necessary for 1) to have nonnegative integer
entries (otherwise one can find a counterexample by appropriately choosing the
0
initial condition qs [OJ).
The following lemma is derived from Theorem 8.3 and simplifies the construction of non-separate redundant Petri net implementations.
8.1 Let S be a Petri net with d places, u transitions and state evolution
as given in Eqs. (8.1) and (8.2). A Petri net 11. with T} == d + s (s > 0) places,
u transitions and state evolution as in Eqs. (8.5) and (8.6) is a non-separate
LEMMA
166
CODING APPROACHES TO FAULT TOLERANCE
redundant implementation ofS
entries given by
ifmatrices 8+ and 8- have nonnegative integer
=
=
GB+-1) ,
GB- -1),
where G is a full-column rank 71 x d matrix with nonnegative integer entries
and 1) is an 71 x u matrix with nonnegative integer entries.
In cases where one has the flexibility to restructure the original Petri net, nonseparate redundant Petri net implementations could offer potential advantages
(e.g., could use fewer tokens, connections, or places than separate implementations of the same order).
S.2
FAULT DETECTION AND IDENTIFICATION
The invariant conditions imposed by the non-separate redundant Petri net implementations in Theorem 8.3 can be checked by the parity matrix P = PO' T =
[0 Is] r. The following analysis of the fault detection and identification
procedures is close to the development in Section 4.2.
Transition Faults: Suppose a non-separate redundant Petri net implementation
is used to detect and identify transition faults. If transition t j fires at time instant
t-l (i.e., x[t-l] = Xj) butfails to execute its postconditions, the erroneous
state will be
where <Ih[t] is the state the Petri net would be in under fault-free conditions.
The error syndrome can be calculated to be
PqJ[t] =
P{<Ih[t] - (GB+ -1))
0- P (GB+ -1)) Xj
Xj}
-¥(yB+-V)Xj
=
be
PO' T1)xj
==
P1)Xj .
If the preconditions of transition tj are not executed, the erroneous state will
qf[t] =
=
<Ih[t]
<Ih[t]
+ 8-(:,j)
+ (GB- -1)) Xj
Coding Approaches for Fault Detection and Identification
167
Figure B.9. Example ofa nOli-separate redundant Petri lIet implemellfation that identifies single
transition faults in the Petri net of Figure B.l.
and the error syndrome can be calculated similarly to be Pqf[tJ = -P1>Xj.
If the columns of matrix P1> are distinct, one can detect and identify all single
transition faults. Depending on the sign, one can decide whether postconditions
or preconditions were not executed. Note that, unlike the separate case, the
syndromes in the non-separate case are linear combinations of columns of 1>.
EXAMPLE 8.6 The Petri net in Figure 8.9 is a non-separate redundant implementation of the Petri net in Figure 8.1. The additional place P4 is disconnected
from the rest of the network and can be treated as a constant. The scheme can
detect and identify single transition faults.
The transformation matrix 7- 1 and the matrix 1> that were used to obtain
the non-separate implementation of Figure 8.9 were as follows:
They result in thefollowing matrices B+
= GB+ -1> and B- = GB- -1>:
168
CODING APPROACHES TO FAULT TOLERANCE
The decoding matrix L = LO" 7 and the parity check matrix P = PO" 7 are
given by
L
=
[
1
1
0 -1
1
1
1 -2
-3 -4 -2
7
1
,P
=
[1
2 1 -3 ] .
If the parity check Pqh[t] equals -3 (respectively -2,
-1), then, transition
tl (respectively t2, t3) has failed to execute its postconditions. If the check is 3
(respectively 2, 1), then, transition h (respectively t2, t3) has failed to execute
its preconditions.
Place Faults: Suppose one uses a non-separate redundant Petri net implementation to protect against place faults. If, due to a fault, the number of tokens in
place Pi is increased by c, the erroneous state will be given by
where e pi is an 1]-dimensional array with a unique nonzero entry at its ith
position:
o
e pi = c x
1
o
The parity check will then be
=
=
0 + Pepi
ex P(:, i) .
Single place faults can be detected if all columns of matrix P == [0 Is] 7
are nonzero. If the columns of P are not rational multiples of each other, then,
single place faults can be detected and identified.
EXAMPLE 8.7 Figure 8.10 shows a non-separate redundant implementation
of the Petri net in Figure 8.1. The implementation uses two additional places
(s = 2) and is able to identify single place faults. Note that place P4 essentially
acts as a constant.
169
Coding Approaches for Fault Detection and Identification
Figure 8.10. Example of a non-separate redllndant Petri net implementation that identifies
single place faults ill tlte Petri Ilet of Figure 8.1.
The transformation matrix 7- 1 and matrix V that were used, as well as
matrices B+ and B-, are given by
7- 1 =
1
1
1
1
1
B+ =
2 0 -1 1
1
1 1
2
1 0
1
1 0
0
0 1
0
1
1
3
0
0
1
0
1
0
1
0
0
0
1
0
0
0
1
V=
2
1
2
2
2
B- =
0
1
0
0
0
1
0
1
1
1
0
1
1
1
0
1
0
0
0
0
0
1
0
2 0
The parity check matrix is
p
= [0
12 ] T
1 -1
1
1 x [-3
=4
-1 3
1 -5
~
]
.
Note that the syndromes for transition and place faults in non-separate Petri
net embeddings are more complicated than the syndromes in separate embeddings. At the same time, however, some additional flexibility is available and
can potentially be used to construct embeddings that maintain the desired monitoring capabilities while minimizing certain quantities of interest (such as tokens, connections or places).
170
6
CODING APPROACHES TO FAULT TOLERANCE
APPLICATIONS IN CONTROL
Discrete event systems (DES's) are usually monitored through separate mechanisms that take appropriate actions based on observations about the state and
activity in the system. Control strategies (such as enabling or disabling transitions and external inputs) are often based on the Petri net that models the
DES of interest [Yamalidou et aI., 1996; Moody and Antsaklis, 1997; Moody
and Antsaklis, 1998; Moody and Antsaklis, 2000]. This section uses redundant
Petri net implementations to facilitate the task of the controller by monitoring
active transitions and by identifying "illegal" transitions. One of the biggest
advantages of this approach is that it can be combined with fault detection and
identification, and perform monitoring despite incomplete or erroneous information.
6.1
MONITORING ACTIVE TRANSITIONS
In order to time decisions appropriately, the controller of a DES may need to
identify ongoing activity in the system. For example, the controller may need
to detect when two or more transitions have fired simultaneously or it may have
to identify all active transitions (i.e., transitions that have used all tokens at their
input places but have not returned any tokens at their output places (using the
terminology of the transition fault model in Section 3, one can say that active
transitions are the ones that have not completed their postconditions). Employing the techniques of Section 4, one can construct separate redundant Petri net
implementations that allow the controller to detect and locate active transitions
by looking at the state of the redundant implementation. The following example
illustrates this idea.
EXAMPLE 8.8 If one extra place is added to the Petri net of Figure 8.3 (s
and if matrices C and D are given by
C=[1 1 3 2 3 IJ,
= 1)
D=[2 5 3 IJ,
one obtains the separate redundant Petri net implementation shown in Figure 8.11: at any given time instant t, the controller of the redundant Petri net
can determine if a transition is under execution by observing the overall state
CJh[t] of the system and by performing the parity check
[-c
11 ] qh[t] = [-1
-1
-3 -2 -3 -1
1 ] qh[t] .
If the result is 2 (respectively 5, 3, 1), then, transition tl (respectively t2, t3, t4)
is under execution. Note that in order to identify whether multiple transitions
are under execution, one needs to use additional places (s > 1).
The additional place P7 in this example acts as a place-holder for special
tokens (which in reality would correspond to acknowledgments): it receives
Coding Approaches for Fault Detection and Identification
171
Figure 8.11. Example of a separate redundant Petri net implementation that enhances control
in the Petri net of Figure 8.3.
2 (respectively 1) such tokens whenever transition tl (respectively t4) is completed; it provides 1 token in order to enable transition t2' Explicit acknowledgments about the initiation and completion of each transition are avoided
(for example, transition t3 does not need to send any acknowledgment). Furthermore, by adding enough extra places, the above monitoring scheme can
be made robust to incomplete or erroneous information (as in the case when a
certain place fails to submit the correct number of tokens).
6.2
DETECTING ILLEGAL TRANSITIONS
The occurrence of illegal activity in a DES can lead to complete control
failure. This section uses separate redundant Petri net implementations to detect
and identify illegal transitions in DES's. The system modeled by the Petri net
is assumed to be "observable" through two different mechanisms: (i) place
sensors that provide information about the number of tokens in each place, and
(ii) transition sensors that indicate when each transition fires.
Suppose that the DES of interest is modeled by a Petri net with state evolution
equation
qs[t + 1] = qs[t] + [ B+ I Bt ] x[t]- [ B- I B~ ] x[t] ,
172
CODING APPROACHES TO FAULT TOLERANCE
where matrices B~ and B~ model the postconditions and preconditions of
illegal transitions and where the input
x[t] == [ Xl[t] ]
xu[t]
is an input vector that captures both legal and illegal transitions (in xz[t] and
xu[t] respectively).
If a separate redundant implementation of the (legaI 4 ) part of the network is
constructed, the overall system will have the following state evolution equation:
=
B+
I B~ ]
Qh[t] + [ CB+ _ D I 0
x[t]B-
- [ CB- - D
I B~
I
0
]
x[t] .
The goal then is to choose C and D so that illegal behavior can be detected.
Information about the state of the upper part of the redundant implementation,
with state evolution
qhl[t + 1] = QhIlt] +
[ B+
I B~
] x[t]- [ B-
I B~
] x[t] ,
will be provided to the monitor by the place sensors. Notice that illegal
transitions change the number of tokens in these places, enabling the detection/identification of faults. The additional places, which evolve according to
the equation
Qh2[t + 1] = Qh2[t] + [ CB+ - D I 0 ] x[t]- [ CB- - D I 0 ] x[t] ,
are internal to the controller and act only as test places, i.e., they cannot inhibit
transitions and can have a negative number of tokens. Once the number of
tokens in these test places is initialized appropriately (i.e., Qh2[O] = Cqht[Oj),
the controller removes or adds tokens to these places based on which (legal)
transitions take place. Therefore, the state of the bottom part of the system is
controlled by the transition sensors.
If an illegal transition fires at time instant t, the illegal state qf [t] of the
redundant implementation is given by
qf[t] = Qh[t]
=
Qh[t]
+ [ ~~ ] xu[t]- [ Bo~ ] xu[t]
+[
~u
]
xu[t] ,
173
Coding Approaches for Fault Detection alld Identification
where Bu == B;t -B; and xu[t] denotes an array with all zero entries. except a
single entry with value "1" that indicates the illegal transition that fired. If the
parity check Pqf [t] is performed. one gets
Pqf[t] =
[-C Is] qf[t]
=
[-C Is] (CJh[t] + [
=
-CBuxu[t].
~u
] xu[t])
Therefore. one can identify which illegal transition has fired if all columns of
CBu are unique.
EXAMPLE 8.9 The controller of the maze in Figure 8.2 obtains information
about the state of the system through a set ofdetectors. More specifically, each
room is equipped with a "mouse sensor" that indicates whether the mouse is in
that room. In addition, "door sensors" get activated whenever the mouse goes
through the corresponding door.
Suppose that due to a bad choice ofmaterials, the maze of Figure 8.2 is built
in a way that allows the mouse to dig a tunnel connecting rooms 1 and 5 and a
tunnel connecting rooms 1 and 4. This leads to the following set of illegal (i. e.,
non-door) transitions in the network:
1 -1
o
o
o
-1
1 -1
o
0
0 o
0 -1
1
o
0
0
1
0
In order to detect the existence of such tunnels, one can use a redundant Petri
net implementation with one additional place (s = 1), C = [1 1 1 2 3]
and D = [1 1 1 1 2 1]. The resulting redundant matrices B+ and
B- are given by
B+
[
B+
I B;t ] =
CB+ -D I 0
o
o
0 1 001
1 0 0 0 0
10000 0
o 0 0 0 1 0
000 100
1 0 1
0 0
o 0 0
000
o 1 0
000 2 0 0
000 0
o
0
0
0
1
0
174
B-
CODING APPROACHES TO FAULT TOLERANCE
=
1
0
0
0
0
[ B- II B-]
CB- -D
OU
0
0
1
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
1
0
0 0 0 0 1 1
0
0
0
0
1
1
0
0
0
0
0 0
0
0
0
1
0
1
0
0
0
0
0 0
The upper part of the network is observed through the place ("mouse")
sensors. The number of tokens in the additional place is updated based on
informationfrom the transition ("door") sensors. More specifically, it receives
two tokens when transition t3 fires; it looses one token each time transition t4
or t5fires.
The parity check is given by
[-1
-1
-1
-2
-3 1] <lh[t]
and is zero if no illegal activity has taken place. It is 2 (respectively - 2, 1,
-1) if illegal transition Bu(:, 1) [respectively Bu(:, 2), Bu(:, 3), Bu(:, 4)J has
taken place. Note that one can detect the existence ofa tunnel in the maze using
only three door sensors (since there are only three nonzero entries in matrices
CB+ - D and CB- - D).
7
SUMMARY
This chapter constructed monitoring schemes for DES's based on their Petri
net models. The technique systematically incorporates constraints into a given
Petri net by looking at appropriate Petri net embeddings. The resulting monitor
capitalizes on the imposed constraints in order to detect and identify faults via
simple linear checks. Comparisons with existing fault diagnosis techniques in
Petri net systems were made at various points during the analysis in this chapter;
there still remain, however, a number of connections that need to pursued further
in order to fully understand the role of coding techniques in performing fault
diagnosis. Applications of these techniques in the context of monitoring power
system faults can be found in [Hadjicostis and Verghese, 2000].
Notes
1 Some of the constraints imposed in Theorem 8.1 can be dropped if one adopts
the view in [Silva and Velilla, 1985] and treats additional places only as test
places, i.e., allows them to have a negative number of tokens. In such case,
C and D can have negative entries.
2 One needs to ensure that for all pairs of columns of P there do not exist
nonzero integers 0, f3 such that 0 x P(:, i) = f3 x P(:, j), i =1= j.
References
175
3 Multiplication of C by a constant does not help if, for some i and j, CB + (i, j)
oor CB-(i,j) = O.
4 Since one has no control over the illegal part of the Petri net, the monitoring
scheme cannot use any acknowledgments from this part of the net.
References
Aghasaryaiu, A, Fabre, E., Benveniste, A, Boubour, R., and Jard, C. (1997a).
A Petri net approach to fault detection and diagnosis in distributed systems
(Part I). In Proceedings of the 36th IEEE Con! on Decision and Control,
pages 720-725.
Aghasaryaiu, A, Fabre, E., Benveniste, A, Boubour, R, and Jard, C. (1997b).
A Petri net approach to fault detection and diagnosis in distributed systems
(Part II). In Proceedings of the 36th IEEE Con! on Decision and Control,
pages 726-731.
Aghasaryaiu, A, Fabre, E., Benveniste, A, Boubour, R, and Jard, C. (1998).
Fault detection and diagnosis in distributed systems: an approach by partially stochastic Petri nets. Discrete Event Dynamic Systems: Theory and
Applications, 8(2):203-231.
Baccelli, F., Cohen, G., Olsder, G. J., and Quadrat, J. P. (1992). Synchronization
and Linearity. Wiley, New York.
Bouloutas, A, Hart, G. w., and Schwartz, M. (1992). Simple finite state fault
detectors for communication networks. IEEE Transactions on Communications, 40(3):477-479.
Cardoso, J., Ktinzle, L. A, and Valette, R (1995). Petri net based reasoning for
the diagnosis of dynamic discrete event systems. In Proceedings of the IFSA
'95, the 6th Int. Fuzzy Systems Association World Congress, pages 333-336.
Cassandras, C. G. (1993). Discrete Event Systems. Aksen Associates, Boston.
Cassandras, C. G., Lafortune, S., and Olsder, G. J. (1995). Trends in Control:
A European Perspective. Springer-Verlag, London.
Cieslak, R., Desclaux, c., Fawaz, A S., and Varaiya, P. (1988). Supervisory
control of discrete-event processes with partial observations. IEEE Transactions on Automatic Control, 33(3):249-260.
Debouk, R, Lafortune, S., and Teneketzis, D. (1998). Coordinated decentralized
protocols for failure diagnosis of discrete event systems. In Proceedings of
the 37th IEEE Con! on Decision and Control, pages 3763-3768.
Debouk, R, Lafortune, S., and Teneketzis, D. (1999). On an optimization problem in sensor selection for failure diagnosis. In Proceedings of the 38th IEEE
Con! on Decision and Control, pages 4990-4995.
Debouk, R, Lafortune, S., and Teneketzis, D. (2000). On the effect of communication delays in failure diagnosis of decentralized discrete event systems.
=
176
CODING APPROACHES TO FAULT TOLERANCE
In Proceedings ofthe 39th IEEE Con! on Decision and Control, pages 22452251.
Desrochers, A. A. and Al-Jaar, R. Y. (1994). Applications of Petri Nets in Manufacturing Systems. IEEE Press.
Gertler, J. (1998). Fault Detection and Diagnosis in Engineering Systems. Marcel Dekker, New York.
Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic
Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts.
Hadjicostis, C. N. and Verghese, G. C. (1999). Monitoring discrete event systems using Petri net embeddings. In Application and Theory of Petri Nets
1999, number 1639 in Lecture Notes in Computer Science, pages 188-208.
Hadjicostis, C. N. and Verghese, G. C. (2000). Power system monitoring using
Petri net embeddings. lEE Proceedings: Generation, Transmission, Distribution, 147(5):299-303.
Moody, J. O. and Antsaklis, P. J. (1997). Supervisory control using computationally efficient linear techniques: A tutorial introduction. In Proceedings
of MED 1997, the 5th IEEE Mediterranean Con! on Control and Systems.
Moody, J. O. and Antsaklis, P. J. (1998). Supervisory Control of Discrete Event
Systems Using Petri Nets. Kluwer Academic Publishers, Boston.
Moody, J. O. and Antsaklis, P. J. (2000). Petri net supervisors for DES with uncontrollable and unobservable transitions. IEEE Transactions on Automatic
Control,45(3):462-476.
Murata, T. (1989). Petri nets: Properties, analysis and applications. Proceedings
of the IEEE, 77(4):541-580.
Pandalai, D. N. and Holloway, L. E. (2000). Template languages for fault monitoring of timed discrete event processes. IEEE Transactions on Automatic
Control,45(5):868-882.
Park, Y. and Chong, E. K. P. (1995). Fault detection and identification in communication networks: a discrete event systems approach. In Proceedings of
the 33rdAnnuaiAllerton Con! on Communication, Control, and Computing,
pages 126-135.
Ramadge, P. J. and Wonham, W. M. (1989). The control of discrete event systems. Proceedings of the IEEE, 77(1):81-97.
Sampath, M., Lafortune, S., and Teneketzis, D. (1998). Active diagnosis of
discrete-event systems. IEEE Transactions on Automatic Control, 43(7):908929.
Sampath, M., Sengupta, R., Lafortune, S., Sinnamohideen, K., and Teneketzis,
D. (1995). Diagnosability of discrete-event systems. IEEE Transactions on
Automatic Control, 40(9): 1555-1575.
Sifakis, J. (1979). Realization of fault-tolerant systems by coding Petri nets.
Journal of Design Automation and Fault-Tolerant Computing, 3(2):93-107.
References
177
Silva, M. and Velilla, S. (1985). Error detection and correction in Petri net
models of discrete events control systems. In Proceedings of ISCAS 1985,
the IEEE Int. Symp. on Circuits and Systems, pages 921-924.
Tinghuai, C. (1992). Fault diagnosis andfault tolerance: a systematic approach
to special topics. Springer-Verlag, Berlin.
Valette, R., Cardoso, J., and Dubois, D. (1989). Monitoring manufacturing systems by means of Petri nets with imprecise markings. In Proceedings of the
IEEE Int. Symp. on Intelligent Control, pages 233-238.
Wang, C. and Schwartz, M. (1993). Fault detection with multiple observers.
IEEEIACM Transactions on Networking, 1(1):48-55.
Yamalidou, K., Moody, J., Lemmon, M., and Antsaklis, P. (1996). Feedback
control of Petri nets based on place invariants. Automatica, 32( 1): 15-28.
Chapter 9
CONCLUDING REMARKS
1
SUMMARY
This book presented a unifying approach for constructing fault -tolerant combinational and dynamic systems. The underlying motive was to develop resourceefficient alternatives to modular redundancy by constructing appropriate redundant system embeddings. These embeddings preserve the functionality of the
original system and are designed in a way that imposes constraints on the set of
outputs/states that are reachable under fault-free conditions. Violations of these
constraints can then be used by an external mechanism to detect and correct
errors. The faults that cause the errors could be due to hardware malfunctions,
communication faults, incorrect initialization, and so forth.
The book systematically studied this two-stage approach to fault tolerance
and demonstrated its potential and effectiveness for both combinational and
dynamic systems. Combinational systems were studied first by reviewing von
Neumann's ground-breaking approach in Chapter 2. For combinational systems
that perform computations with algebraic structure, Chapter 3 showed that
algebraic constructions (and, in particular, algebraic injective homomorphisms)
can greatly facilitate fault tolerance. Among other results, it was shown that
the development of parity-type protection schemes for computations with an
underlying group or semigroup structure can be posed and solved as an algebraic
problem.
In the case of dynamic systems, which was studied next, a couple of additional, important issues were identified: (i) Redundant dynamics provide
flexibility that can be used to efficiently/reliably enforce state constraints (for
example, in order to build redundant implementations that require less hardware). (ii) Error propagation complicates the task of maintaining correctness
during the operation of a dynamic system, particularly for long time intervals.
180
CODING APPROACHES TO FAULT TOLERANCE
This raises questions regarding not only the cost but also the feasibility of constructing reliable dynamic systems exclusively out of unreliable components.
Assuming fault-free error correction, the overarching goal of Chapters 4-6
was to systematically develop alternatives to modular redundancy. It was shown
that, under a particular error detection/correction scheme, a number of redundant implementations is possible. A precise characterization of these different
redundant implementations was obtained for a variety of dynamic systems.
This resulted in diverse schemes for fault tolerance that included embeddings
based on algebraic homomorphisms (see Chapter 4), non-concurrent checking
schemes (see Chapters 4 and 5), reconfiguration methodologies (see Chapter 5)
and redundant implementations that required less hardware (see Chapter 6).
Chapter 7 relaxed the assumption that the error-correcting mechanism be
fault-free. It considered dynamic systems that suffer transient faults in the
state transition mechanism and in the error-correcting mechanism. Due to the
dynamic nature of these systems, transient faults in the error-correcting mechanism propagate in time, resulting in a serious increase in the probability of
overall failure. In order to handle error propagation effectively, modular redundancy schemes that use multiple system replicas and voters were studied. It was
shown that, by increasing the amount of redundancy, one can in principle construct redundant implementations that operate under a specified (low) level of
failure probability for any finite time interval. Furthermore, for the case of unreliable linear finite-state machines (LFSM's), low-complexity error-correcting
codes can be used to obtain interconnections of identical LFSM's that operate
in parallel on distinct input sequences, fail with arbitrarily low probability during a finite time interval and require only a constant amount of redundancy per
machine.
Chapter 8 explored similar ideas in the context of fault diagnosis in discrete
event systems that are modeled by Petri nets. More specifically, by employing embeddings similar to the ones developed in Chapters 4-6, one can obtain
monitoring schemes for complex networked systems, such as manufacturing
systems, communication protocols or power systems. The trade-offs and objectives involved in fault diagnosis, however, can be quite different. For example,
the objective may be to avoid complicated reachability analysis, or to minimize the size of the monitor, or to construct monitoring schemes that require
minimal communication overhead. The resulting methodologies are simple
and allow easy specification of additional places, connections and weights so
that detection/identification of both transition and place faults can be verified
by weighted checksums on the overall state of the redundant Petri net. In addition, the monitoring schemes can be designed to perform reliably despite
erroneous/incomplete information.
Concluding Remarks
2
181
FUTURE RESEARCH DIRECTIONS
There are many important directions for future research in this area. Perhaps
the most exciting one is to explore how techniques for fault tolerance can enable
innovative, possibly less expensive, manufacturing technologies and how they
can lead to novel computational architectures. In particular, one prospect is to
build reliable systems out of presently unreliable technologies (such as quantum
or molecular computers) by developing appropriate coding protection schemes.
Another prospect is to apply fault-tolerance techniques in silicon-based systems
to increase speed or power-efficiency [Shanbhag, 1997].
A number of related open questions pertain to the development of faulttolerant implementations that allow faults in the error-correcting mechanism.
For example, the encoding techniques in Chapter 7 could potentially be generalized to group machines or other algebraic machines. In addition, different
(easily decodable) coding schemes could be used for simultaneously protecting
parallel simulations of a given system. There are also interesting theoretical
questions regarding how one can define the computational capacity of unreliable LFSM's and, more generally, finite-state machines. Since one concern
about the approach in Chapter 7 is the increasing number of connections, it may
be worthwhile to explore how one can design dynamic systems that limit the
number of connections to neighboring elements (much like Gacs approach in
[Gacs, 1986]).
The two-stage approach for fault tolerance that was studied in this book operates under the premise that the code (constraints) enforced on the state of the
redundant implementation are time-independent. This implies that the errorcorrecting mechanism has no memory and it would be interesting to investigate
the applicability of more general approaches. For example, instead of using
block codes, one could try convolutional codes to protect LFSM's (some related work has appeared in [Redinbo, 1987; Holmquist and Kinney, 1991]).
This approach seems promising since convolutional codes can also be decoded
at low cost and appear suitable for a dynamic system setting (see, for example,
the work in [Rosenthal and York, 1999]). In addition, using error-correcting
mechanisms with memory may lead to reduced hardware complexity in these
fault-tolerant implementations. More generally, one can develop a "behavioral"
approach to fault tolerance, where system behaviors (i.e., state trajectories) are
associated with fault-free and faulty systems [Antoulas and Willems, 1993].
Applying these ideas further in specific contexts (e.g., in linear filters for digital signal processing applications or linear systems over groups [Fagnani and
Zampieri, 1996]) can help in the systematic study of optimization criteria (e.g.,
the minimization of redundant hardware) and in the development of efficient
reconfiguration schemes (e.g., for handling permanent faults in integrated circuits). One can also study how these ideas generalize to nonlinear and/or
time-varying systems.
182
CODING APPROACHES TO FAULT TOLERANCE
There are a number of future extensions that relate to fault diagnosis in
discrete event systems. These include the development of resource-efficient
hierarchical or distributed fault diagnosis schemes that are robust to uncertainty
in the sensors or in the information communicated to the diagnoser. Also appealing is the explicit study of examples where a subset of the transitions is
uncontrollable and/or unobservable (see, for example, [Moody and Antsaklis,
1997; Moody and Antsaklis, 1998; Moody and Antsaklis, 2000]). Another
promising research direction is the application of these ideas to max-plus systems [Cuningham-Green, 1979; Cohen et aI., 1989; Baccelli et aI., 1992; Cassandras, 1993; Cassandras et aI., 1995]. These systems are "linear" in the
semifield of real numbers under the MAX (additive) and + (multiplicative) operations, and redundancy can be introduced in them in ways analogous to those
for linear dynamic systems. The absence of an inverse for the MAX operation,
however, forces one to consider issues related to error detection and robust performance rather than error correction. These ideas may be useful in building
robust flow networks, real-time systems and scheduling algorithms.
References
Antoulas, A. C. and Willems, J. C. (1993). A behavioral approach to linear exact
modeling. IEEE Transactions on Automatic Control, 38( 12): 1776-1802.
Baccelli, F., Cohen, G., Olsder, G. 1., and Quadrat, 1. P (1992). Synchronization
and Linearity. Wiley, New York.
Cassandras, C. G. (1993). Discrete Event Systems. Aksen Associates, Boston.
Cassandras, C. G., Lafortune, S., and Olsder, G. 1. (1995). Trends in Control:
A European Perspective. Springer-Verlag, London.
Cohen, G., Moller, P, Quadrat, 1.-P., and Viot, M. (1989). Algebraic tools for
the performance evaluation of discrete event systems. Proceedings of the
IEEE,77(1):39-85.
Cuningham-Green, R. (1979). Minimax Algebra. Springer-Verlag, Berlin.
Fagnani, F. and Zampieri, S. (1996). Dynamical systems and convolutional
codes over finite abelian groups. IEEE Transactions on Information Theory,
42(11): 1892-1912.
Gacs, P. (1986). Reliable computation with Cellular Automata. Journal of Computer and System Sciences, 32(2): 15-78.
Holmquist, L. P. and Kinney, L. L. (1991). Concurrent error detection in sequential circuits using convolutional codes. In Proceedings ofthe 9th Int. Symp. on
Applied Algebra, Algebraic Algorithms and Error-Correcting Codes, pages
183-194. Springer-Verlag.
Moody, 1. O. and Antsaklis, P. J. (1997). Supervisory control using computationally efficient linear techniques: A tutorial introduction. In Proceedings
of MED 1997, the 5th IEEE Mediterranean Conf. on Control and Systems.
References
183
Moody, J. O. and Antsaklis, P. J. (1998). Supervisory Control of Discrete Event
Systems Using Petri Nets. Kluwer Academic Publishers, Boston.
Moody, J. O. and Antsaklis, P. J. (2000). Petri net supervisors for DES with uncontrollable and unobservable transitions. IEEE Transactions on Automatic
Control,45(3):462-476.
Redinbo, G. R. (1987). Finite field fault-tolerant digital filtering architecture.
IEEE Transactions on Computers, 36(10):1236-1242.
Rosenthal, J. and York, F. V. (1999). BCH convolutional codes. IEEE Transactions on Information Theory, 45(6):1833-1844.
Shanbhag, N. R. (1997). A mathematical basis for power-reduction in digital
VLSI systems. IEEE Transactions on Circuits and Systems -II: Analog and
Digital Signal Processing, 44(11 ):935-951.
About the Author
Christoforos Hadjicostis is currently an Assistant Professor in the Department of Electrical and Computer Engineering and a Research Assistant Professor in the Coordinated Science Laboratory at the University of Illinois at
Urbana-Champaign. He received S.B. degrees in Electrical Engineering in
1993, in Computer Science and Engineering in 1993 and in Mathematics in
1999, the M.Eng. degree in Electrical Engineering and Computer Science in
1995, and a Ph.D. in Electrical Engineering and Computer Science in 1999, all
from the Massachusetts Institute of Technology, Cambridge, Massachusetts.
Dr. Hadjicostis was awarded the Faculty Early Development (Career) award
from the National Science Foundation in 2001. While at MIT, he served as
president of the MIT Chapter of HKN, received the Harold L. Hazen Teaching Award and the Ernst A. Guillemin Thesis Prize, and received fellowships
from the National Semiconductor Corporation and the Grass Instrument Company. Dr. Hadjicostis' research interests include fault-tolerant computation in
combinational and dynamic systems, fault management and control of complex
systems, and coding and graph theory.
Index
Active transition, 170
Additive fault model, 42, 149
Algorithm-based fault tolerance, 2, 7, 34, 37
Arithmetic code, 33, 35
Associativity, 41
Autonomous machine, 68
Behavioral approach, 181
Binary operation, 41
Binary symmetric'channel, 132
Boolean
function, 22
gate, 22
Capacity, 126
channel, 132
computational, 132
Cellular automata, 117
Channel
binary symmetric, 132
capacity, 132
crossover probability, 132
Checksum, 88, 104
Chip-kill,5
Circuit
combinational, 22
depth,22
reliable, 26, 28
size, 22
Cluster states, 116
Combinational system, 3
Commutativity, 41
Computational capacity, 132
Concurrent error masking, I
Congruence relation, 52
Convolutional encoder, 106
Coset, 45, 63
nonzero, 47
Decoding, 50
output, 36
Diagnoser, 143
Discrete event system, 143
max-plus, 181
Distributed voting, 115, 119
Dynamic system, 3
redundant implementation, 10,9
reliable state evolution, 7
Encoding
input, 35, 133
matrix, 94
state, 116
Equivalence relation, 52
Error correction, 36, 45, 51, 103
conditions for single error, 51
fault-free, 34, 42, 81
multiple, 51
unreliable, 115-116
Error detection and correction, 87
non-concurrent, 92
periodic, 92
Error detection, 36, 45, 51, 103
conditions for single error, 51
fault-free, 81
multiple, 51
Error, I
propagation, 12, 115
single-bit, 64, 67
Failure, I, 171
overall, 118, 120, 130
Fault detection and identification, 154, 166
Fault diagnosis, 13, 143
distributed, 182
hierarchical, 182
Fault model, 148
additive, 42, 149
Fault tolerance, I
Fault, I
detection, 144
hardware, 64
identification, 144
model,2
188
CODING APPROACHES TO FAULT TOLERANCE
monitoring, 144
permanent, I, 84
transient, I, 84, 115
Fault-tolerant FFT, 34, 40
Fault-tolerant convolution, 34
Fault-tolerant integer addition, 36
Fault-tolerant linear operators, 40
Fault-tolerant matrix multiplication, 37
Fault-tolerant sorting networks, 40
Finite field, 112
Flip-flop, 116, 123
Gate
3-input,27
u-input,29
Boolean, 22
NAND, 27, 31
XNAND, 27-28
XOR, 99-101,108, 123
unreliable, 22
Group machine, 62
Group, 41
abelian, 42
canonical surjective homomorphism, 49
coset, 45, 63
cyclic, 68
homomorphism, 45
inverse, 41
non-trivial subgroup, 62
normal subgroup, 62
simple, 63
subgroup, 45
surjective homomorphism, 53
Hamming code, 87-88, 93
Hamming distance, 75-76, 116
Illegal transition, 171
Incidence matrix, 147
Independent iterations, 124
Iterative decoding, 124
LTI dynamic system, 79
hardware implementation, 83
redundant dynamics, 82, 91
redundant implementation, 80, 83
signal flow graph, 83
standard redundant implementation, 82
state evolution, 79
Linear code, 79, 99, 103
Hamming code, 87
encoding matrix, 94
independent iterations, 124
low-density parity check code, 123
parity check, 81
single-error correction, 82
single-error detection, 82
Linear feedback shift register, 100
Linear finite-state machine, 99,123, 127
autonomous, 110
classical canonical form, 10 I, 132
hardware implementation, 101
parallel instantiations, 127, 132
redundant dynamics, 104
redundant implementation, 102, 109
sequence enumerator, 100, 110
standard redundant implementation, 103
state evolution, 99
Loop-free interconnection, 22
Low-density parity check code, 123
Machine decomposition
Krohn-Rhodes, 63
Zieger, 74
coset leader, 62
series-parallel,62
subgroup machine, 62
Machine, 61
algebraic, 61
autonomous, 67
group, 61
permutation-reset, 73
redundant implementation, 64
reset, 73
reset-identity, 73
semigroup, 61
Marking, 145
Modular redundancy, 4, 7, 33, 35, 93, 118
Monitor, 66-68, 151, 160
Monoid,49
homomorphism, 50
Multiprocessor system, 34, 37
Overall failure, I, 118, 120, 130
Parallel matrix multiplication, 37
Parity channel, 47
Parity check, 81, 103
Permutation-reset machine, 73
Petri net, 143-144
additive fault model, 149
fault detection and identification, 154, 166
fault model, 148
incidence matrix, 147
input place, 145
marking, 145
non-separate redundant implementation, 144, 162
output place, 145
place fault, 149
place, 145
separate redundant implementation, 144, 151
token, 145
transition fault, 149
transition, 145-146
Place fault, 149
Place, 145
fault,149
input, 145
output, 145
RAID,5
Reachability matrix, 95
Index
Reconfiguration, 84
Redundant implementation
LTI dynamic system, 80
algebraic machine, 64
group machine, 64
linear finite-state machine, 109
non-separate, 69, 75
semigroup machine, 73
separate, 66, 74
Reliable state evolution, 118
Reset machine, 73
Reset-identity machine, 73
Restoring organ, 21, 23
Self-checking module, 4
Semigroup, 49
abelian, 49
canonical surjective homomorphism, 53
homomorphism, 50
non-abelian, 49
Separate code
for integer addition, 47, 54
for integer comparison, 55
for integer multiplication, 54
group, 47
semigroup, 52
Signal flow graph, 84
delay-free paths, 85, 88
factored state variables, 85
Similarity transformation, 80, 101
Stable memories, 116, 126
State transition fault, 4
Structured redundancy, 4, 6-7, 35, 41, 117
Supervisory control, 147
Surjective homomorphism, 49
TMR, 7, 35,93-94
Tolerable noise, 27
Transition fault, 149
Transition, 145-146
active, 170
fault, 149
iIIegal,171
Unreliable components, 5, 21, 115, 117
reliably, 21
189
Download