Coding Approaches to Fault Tolerance in Combinational and Dynamic Systems THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE CODING APPAOACHES TO FAULT TOLERANCE IN COMBINATIONAL AND DYNAMIC SYSTEMS CHRISTOFOROS N. HADJICOSTIS Coordinated Science Laboratory and Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign " ~. SPRINGER SCIENCE+BUSINESS MEDIA, LLC ISBN 978-1-4613-5271-6 ISBN 978-1-4615-0853-3 (eBook) DOI 10.1007/978-1-4615-0853-3 Library of Congress Cataloging-in-Publication Data A c.I.P. Catalogue record for this book is available from the Library of Congress. Copyright © 2002 Springer Science+Business Media New York Originally published by Kluwer Academic Publishers in 2002 Softcover reprint of the hardcover 1st edition 2002 AII rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, mechanical, photocopying, recording, or otherwise, without the prior written permission of the publisher, Springer Science+Business Media, LLC. Printed an acid-free paper. To Pani Contents List of Figures List of Tables Foreword Preface Acknowledgments 1. INTRODUCTION 1 Definitions, Motivation and Background 2 Fault-Tolerant Combinational Systems 2.1 Reliable Combinational Systems 2.2 Minimizing Redundant Hardware 3 Fault-Tolerant Dynamic Systems 3.1 Redundant Implementations 3.2 Faults in the Error-Correcting Mechanism 4 Coding Techniques for Fault Diagnosis Part I XI xiii xv XVII XXI 1 1 4 6 6 7 lO 12 13 Fault-Tolerant Combinational Systems 2. RELIABLE COMBINATIONAL SYSTEMS OUT OF UNRELIABLE COMPONENTS Introduction 1 Computational Models for Combinational Systems 2 Von Neumann's Approach to Fault Tolerance 3 Extensions of Von Neumann's Approach 4 4.1 Maximum Tolerable Noise for 3-lnput Gates 4.2 Maximum Tolerable Noise for u-Input Gates Related Work and Further Reading 5 21 21 22 23 27 27 29 31 3. ABFT FOR COMBINATIONAL SYSTEMS 1 Introduction Arithmetic Codes 2 Algorithm-Based Fault Tolerance 3 33 33 35 37 viii CODING APPROACHES TO FAULT TOLERANCE 4 Part II Generalizations of Arithmetic Coding to Operations with Algebraic 41 Structure 4.1 Fault Tolerance for Abelian Group Operations 41 4.1.1 Use of Group Homomorphisms 44 4.1.2 Error Detection and Correction 45 4.1.3 Separate Group Codes 47 4.2 Fault Tolerance for Semigroup Operations 49 50 4.2.1 Use of Semigroup Homomorphisms 51 4.2.2 Error Detection and Correction 4.2.3 Separate Semigroup Codes 52 4.3 Extensions 56 Fault-Tolerant Dynamic Systems 4. REDUNDANT IMPLEMENTATIONS OF ALGEBRAIC MACHINES 1 Introduction 2 Algebraic Machines: Definitions and Decompositions 3 Redundant Implementations of Group Machines 3.1 Separate Monitors for Group Machines 3.2 Non-Separate Redundant Implementations for Group Machines 4 Redundant Implementations of Semigroup Machines 4.1 Separate Monitors for Reset-Identity Machines 4.2 Non-Separate Redundant Implementations for ResetIdentity Machines 5 Summary 5. REDUNDANT IMPLEMENTATIONS OF DISCRETE-TIME LTI DYNAMIC SYSTEMS 1 Introduction 2 Discrete-Time LTI Dynamic Systems 3 Characterization of Redundant Implementations 4 Hardware Implementation and Fault Model 5 Examples of Fault-Tolerant Systems Summary 6 61 61 61 64 66 69 73 74 75 76 79 79 79 80 83 86 96 6. REDUNDANT IMPLEMENTATIONS OF LINEAR FINITE-STATE MACHINES 1 Introduction 2 Linear Finite-State Machines 3 Characterization of Redundant Implementations 4 Examples of Fault-Tolerant Systems 5 Hardware Minimization in Redundant LFSM Implementations 6 Summary 99 99 99 102 104 108 112 7. UNRELIABLE ERROR CORRECTION IN DYNAMIC SYSTEMS 115 COlltents 1 2 3 4 5 Introduction Fault Model for Dynamic Systems Reliable Dynamic Systems using Distributed Voting Schemes Reliable Linear Finite-State Machines 4.1 Low-Density Parity Check Codes and Stable Memories 4.2 Reliable Linear Finite-State Machines using Constant Redundancy Other Issues IX 115 117 118 123 123 127 132 8. CODING APPROACHES FOR FAULT DETECTION AND IDENTIFICATION IN DISCRETE EVENT SYSTEMS 1 Introduction 2 Petri Net Models of Discrete Event Systems 3 Fault Models for Petri Nets 4 Separate Monitoring Schemes 4.1 Separate Redundant Petri Net Implementations 4.2 Fault Detection and Identification 5 Non-Separate Monitoring Schemes 5.1 Non-Separate Redundant Petri Net Implementations 5.2 Fault Detection and Identification 6 Applications in Control 6.1 Monitoring Active Transitions 6.2 Detecting Illegal Transitions 7 Summary 143 143 145 148 151 151 154 160 160 166 170 170 171 174 9. CONCLUDING REMARKS 1 Summuy 2 Future Research Directions 179 179 181 10. ABOUT THE AUTHOR 185 11. INDEX 187 List of Figures 1.1 1.2 1.3 1.4 2.1 2.2 2.3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.1 4.2 4.3 4.4 5.1 5.2 5.3 Triple modular redundancy. Fault-tolerant combinational system. Triple modular redundancy with correcting feedback. Fault-tolerant dynamic system. Error correction using a "restoring organ." Plots of functions f(q) and g(q) for two different values ofp. Two successive restoring iterations in von Neumann's construction for fault tolerance. Arithmetic coding scheme for protecting binary operations. aN arithmetic coding scheme for protecting integer addition. ABFT scheme for protecting matrix multiplication. Fault-tolerant computation of a group operation. Fault tolerance using an abelian group homomorphism. Coset-based error detection and correction. Separate arithmetic coding scheme for protecting integer addition. Separate coding scheme for protecting a group operation. Partitioning of semi group (N, x ) into congruence classes. Series-parallel decomposition of a group machine. Redundant implementation of a group machine. Separate redundant implementation of a group machine. Relationship between a separate monitor and a decomposed group machine. Delay-adder-gain implementation and the corresponding signal flow graph for an LTI dynamic system. State evolution equation and hardware implementation of the digital filter in Example 5.2. Redundant implementation based on a checksum condition. 5 6 9 10 23 24 25 34 37 39 43 44 46 48 48 54 62 65 66 69 84 89 89 xii CODING APPROACHES TO FAULT TOLERANCE 5.4 6.1 6.2 7.1 7.2 7.3 7.4 7.A.1 8.1 8.2 8.3 8.4 8.5 8.6 8.7 8.8 8.9 8.10 8.11 Second redundant implementation based on a checksum condition. Hardware implementation of the linear feedback shift register in Example 6.1. Different implementations of a convolutional encoder. Reliable state evolution subject to faults in the error corrector. Modular redundancy with distributed voting scheme. Hardware implementation of Gallager's modified iterative decoding scheme for LDPC codes. Replacing k LFSM's with n redundant LFSM's. Encoded implementation of k LFSM's using n redundant LFSM's. Petri net with three places and three transitions. Cat-and-mouse maze. Petri net model of a distributed processing system. Concurrent monitoring scheme using a separate Petri net implementation. Example of a separate redundant Petri net implementation that identifies single transition faults in the Petri net of Figure 8.1. Example of a separate redundant Petri net implementation that identifies single place faults in the Petri net of Figure 8.1. Example of a separate redundant Petri net implementation that identifies single transition or single place faults in the Petri net of Figure 8.1. Concurrent monitoring scheme using a non-separate Petri net implementation. Example of a non-separate redundant Petri net implementation that identifies single transition faults in the Petri net of Figure 8.1. Example of a non-separate redundant Petri net implementation that identifies single place faults in the Petri net of Figure 8.1. Example of a separate redundant Petri net implementation that enhances control in the Petri net of Figure 8.3. 91 100 107 119 120 125 128 134 145 147 150 151 156 157 159 161 167 169 171 List of Tables 2.1 5.1 Input-output table for the 3-input XNAND gate. Syndrome-based error detection and identification in Example 5.1. 27 87 Foreword Fault tolerance requires redundancy, but redundancy comes at a price. At one extreme of redundancy, fault tolerance may involve running several complete and independent replicas of the desired process; discrepancies then indicate faults, and the majority result is taken as correct. More modest levels of redundancy - for instance, adding parity check bits to the operands of a computation - can still be very effective, but need to be more carefully designed, so as to ensure that the redundancy conforms appropriately to the particular characteristics of the computation or process involved. The latter challenge is the focus of this book, which has grown out of the author's graduate theses at MIT. The original stimulus for the approach taken here comes from the work of Beckmann and Musicus, developed in Beckmann's 1992 doctoral thesis, also at MIT. That work focused on computations having group structure. The essential idea was to map the group in which the computation occurred to a larger group via a homomorphism, thereby preserving the structure of the computation while introducing the necessary redundancy. Hadjicostis has significantly expanded the setting to processes occurring in more general algebraic and dynamic systems. For combinational (i.e., memoryless) systems, this book shows how to recognize and exploit system structure in a way that leads to resource-efficient arithmetic coding and ''ABFT'' (algorithm-based fault-tolerant) schemes, and characterizes separate (parity-type) codes. These results are then extended to dynamic systems, providing a unified system theoretic framework that makes connections with traditional error correcting methodologies for communication systems, allows coding techniques to be studied in conjunction with the dynamics of the process that is being protected, and enables the development of fault-tolerance techniques that can account for faults in the error corrector itself. Numerous examples throughout the book illustrate how the framework and methodology translate to particular situations of interest, providing a parametrization of the range of possibilities for redundant implementation, and xvi CODING APPROACHES TO FAULT TOLERANCE allowing one to examine features of and trade-offs among different possibilities and realizations. The book responds to the growing need to handle faults in complex digital chips and complex networked systems, and to consider the effects of faults at the design stage rather than afterwards. I believe that the approach taken by the author points the way to addressing such needs in a systematic and fruitful fashion. The material here should be of interest to both researchers and practitioners in the area of fault tolerance. George Verghese Massachusetts Institute of Technology Preface As the complexity of systems and networks grows, the likelihood of faults in certain components or communication links increases significantly and the consequences become highly unpredictable and severe. Even within a single digital device, the reduction of voltages and capacitances, the shrinking of transistor sizes and the sheer number of gates involved has led to a significant increase in the frequency of so-called "soft-errors," and has prompted leading semiconductor manufacturers to admit that they may be facing difficult challenges in the future. The occurrence of faults becomes a major concern when the systems involved are life-critical (such as military, transportation or medical systems), or operate in remote or inaccessible environments (where repair may be difficult or even impossible). A fault-tolerant system is able to tolerate internal faults and preserve desirable overall behavior and output. A necessary condition for a system to be faulttolerant is that it exhibit redundancy, which enables it to distinguish between correct and incorrect results or between valid and invalid states. Redundancy is expensive and counter-intuitive to the traditional notion of system design; thus, the success of a fault-tolerance design relies on making efficient use of hardware by adding redundancy in those parts of the system that are more liable to faults than others. Traditionally, the design of fault-tolerant systems has considered two quite distinct fault models: one model constructs reliable systems out of unreliable components (all of which may suffer faults with a certain probability) whereas the other model focuses on detecting and correcting a fixed number of faults (aiming at minimizing the required hardware). This book addresses both of these fault models and describes coding approaches that can be used to exploit the algorithmic/evolutionary structure in a particular combinational or dynamic system in order to avoid excessive use of redundancy. The book has grown out of thesis work at the Massachusetts Institute of Technology and research at the University of Illinois at Urbana-Champaign. xviii CODING APPROACHES TO FAULT TOLERANCE Chapters 2 and 3 describe coding approaches for designing fault-tolerant combinational systems, i.e., systems with no internal memory that perform a static function evaluation on their inputs. Chapter 2 reviews von Neumann's work on "Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components," which is one of the first systematic approaches to fault tolerance. Subsequent related results on combinational circuits that are constructed as interconnections of unreliable ("noisy") gates are also discussed. In these approaches, a combinational system is built out of components (e.g., gates) that suffer transient faults with constant probability; the goal is to assemble these unreliable components in a way that introduces "structured" redundancy and ensures that, with high probability, the overall functionality is the correct one. Chapter 3 describes a distinctly different approach to fault tolerance which aims at protecting a given combinational system against a pre-specified number of component faults. Such designs become more dominant once system components are fairly reliable; they generally aim at using a minimal amount of structured redundancy to achieve detection and correction of a pre-specified number offaults. As explained in Chapter 3, coding techniques are particularly successful for arithmetic and linear operations; extensions of these techniques to operations with group or semi group structure are also discussed. The remainder of the book focuses on fault tolerance in dynamic systems, such as finite-state controllers or computer simulations, whose internal state influences their future behavior. Modular redundancy (system replication) and other traditional techniques for fault tolerance are expensive, and rely heavily - particularly in the case of dynamic systems operating over extended time horizons - on the assumption that the error-correcting mechanism does not fail. The book describes a systematic methodology for adding structured redundancy to a dynamic system, exposing a wide range of possibilities between no redundancy and full replication. These possibilities can be parameterized in various settings, including algebraic machines (Chapter 4) and linear dynamic systems (Chapters 5 and 6). By adopting specific fault models and, in some cases, by making explicit connections with hardware implementations, the exposition in these chapters describes resource-efficient designs for redundant dynamic systems. Optimization criteria for choosing among different redundant implementations are not explicitly addressed; several examples, however, illustrate how such criteria can be posed and investigated. Chapter 7 relaxes the traditional assumption that the error-correcting mechanism does not fail. The basic idea is to use a distributed error-correcting mechanism so that the effects of faults are dispersed within the redundant system in a non-devastating fashion. As discussed in Chapter 7, one can employ these techniques to obtain a variant of modular redundancy that uses unreliable system replicas and unreliable voters to construct redundant dynamic systems that Preface xix evolve in time with a low probability offailure. By combining these techniques with low-complexity error-correcting coding, one can efficiently protect identical unreliable linear finite-state machines that operate in parallel on distinct input sequences. The approach requires only a constant amount of redundant hardware per machine to achieve a probability of failure that remains below any pre-specified bound over any given finite time interval. Chapter 8 applies coding techniques in other contexts. In particular, it presents a methodology for diagnosing faults in discrete event systems that are described by Petri net models. The method is based on embedding the given Petri net model in a larger Petri net that retains the functionality and properties of the given one, while introducing redundancy in a way that facilitates error detection and identification. Chapter 9 concludes with a look into emerging research directions in the areas of fault tolerance, reliable system design and fault diagnosis. Unlike traditional methodologies, which add error detecting and correcting capabilities on top of existing, non-redundant systems, the methodology developed in this book simultaneously considers the design for fault tolerance together with the implementation of a given system. This comprehensive approach to fault tolerance allows the study of a larger class of redundant implementations and can be used to better understand fundamental limitations in terms of system-, coding- and information-theoretic constraints. Future work should also focus on the implications of redundancy on the speed and power efficiency of digital systems, and also on the development of systematic ways to trade-off various system parameters of interest, such as redundant hardware, fault coverage, detection/correction complexity and delay. Christoforos N. Hadjicostis Urbana, Illinois Acknowledgments This book has grown out of research work at the Massachusetts Institute of Technology and the University of Illinois at Urbana-Champaign. There are many colleagues and friends that have been extremely generous with their help and advice during these years, and to whom I am indebted. I am very thankful to many members of the faculty at MIT for their involvement and contribution to my graduate research. In particular, I would like to express my most sincere thanks to George Verghese for his inspiring guidance, and to Alan Oppenheim and Greg Wornell for their support during my tenure at the Digital Signal Processing Group. Also, the discussions that I had with Sanjoy Mitter, Alex Megretski, Bob Gallager, David Forney and Srinivas Devadas were thought-provoking and helpful in defining my research direction; I am very thankful to all of them. I am also grateful to many members of the faculty at UIVC for their warm support during these first few years. In particular, I would like to thank Steve Kang and Dick Blahut, who served as heads of the Department of Electrical and Computer Engineering, Ravi Iyer, the director of the Coordinated Science Laboratory, and Tamer Ba§ar, the director of the Decision and Control Laboratory, whose advice and direction have been a tremendous motivation for writing this book. I would also like to thank my many friends and colleagues who made academic life at MIT and at UIVC both enjoyable and productive. Special thanks go to Carl Livadas, Babis Papadopoulos and John Apostolopoulos, who were a great source of advice during my graduate studies. At VIUC, Andy Singer, Francesco Bullo and Petros Voulgaris were encouraging and always willing to help in any way they could. Becky Lonberger, Francie Bridges, Darla Chupp, Vivian Mizuno, Maggie Beucler, Janice Zaganjori and Sally Bemus made life a lot simpler by meticulously taking care of administrative matters. I would also like to thank Eleftheria Athanasopoulou, Boon Pang Lim and Yingquan Wu for proof-reading portions of this book. xxii CODING APPROACHES TO FAULT TOLERANCE I am very grateful to many research agencies and companies that have supported my work as a graduate student and as a research professor. These include the Defense Advanced Research Projects Agency for support under the Rapid Proto typing of Application Specific Signal Processors project, the Electric Power Research Institute and the Department of Defense for support under the Complex Interactive Networks/Systems Initiative, the National Science Foundation for support under the Information Technology Research and Career programs, the Air Force Office for Scientific Research for support under their University Research Initiative, the DIUC Campus Research Board, the National Semiconductor Corporation, the Grass Instrument Company and Motorola. Finally, I am extremely thankful to Jennifer Evans and the Kluwer Academic Publishers for encouraging me to make these ideas more widely available through the publication of this book. Chapter 1 INTRODUCTION 1 DEFINITIONS, MOTIVATION AND BACKGROUND Modem digital systems are subject to a variety of potential faults that can corrupt their output and degrade their perfonnance [Johnson, 1989; Pradhan, 1996; Siewiorek and Swarz, 1998]. In this context, a fault is a deviation of a given system from its required or expected behavior. The more complex a computational system is or the longer an algorithm runs for, the higher is the risk of a hardware malfunction that renders the overall functionality of the system useless. Depending on the duration of faults, two broad classes are defined [Johnson, 1989]: (i) Permanent faults manifest themselves in a consistent manner and include design or software errors, manufacturing defects, or irreversible physical damage. (ii) Transient faults do not appear on a consistent basis and only manifest themselves in a certain portion of system invocations; transient faults could be due to noise, such as absorption of alpha particles and electromagnetic interference, or environmental factors, such as overheating. An error is the manifestation of a fault and may lead to an overall failure in the system [Johnson, 1989]. A fault-tolerant system is one that tolerates internal faults and prevents them from unacceptably corrupting its overall task, output or final result [Johnson, 1989; Pradhan, 1996; Siewiorek and Swarz, 1998]. Concurrent error masking, that is detection and correction of errors concurrently with system operation, is one of the most desirable forms of fault tolerance because no degradation in the overall performance of the system takes place; at the same time, however, concurrent error masking usually implies a large overhead in terms of error-detecting and correcting operations. Fault tolerance is motivated primarily by applications that require high reliability (such as medical, military or transportation systems), or by systems that operate in remote locations where repair may be difficult or even impos- 2 CODING APPROACHES TO FAULT TOLERANCE sible (as in the case of space missions, hazardous environments and remote sensors) [Pradhan, 1996; Avizienis, 1997]. In addition, fault tolerance can relax design/manufacturing specifications leading, for example, to yield enhancement in integrated circuits [Koren and Singh, 1990; Peercy and Banerjee, 1993; Leveugle et aI., 1994]. As the complexity of computational and signal processing systems increases, their vulnerability to faults becomes higher, making fault tolerance necessary rather than simply desirable [Redinbo, 1987]. The current trends towards higher clock speeds, lower power consumption and smaller transistor sizes aggravates this problem even more and leads to a significant increase in the frequency of so-called "soft-errors." For the reasons mentioned above, fault tolerance has been addressed in a variety of settings. The most systematic treatment has been for the case of reliable digital transmissions through unreliable ("noisy") communication links. Shannon's seminal work in [Shannon, 1948a; Shannon, 1948b] demonstrated that error-correcting coding techniques can effectively and efficiently protect against noise in digital communication systems. More specifically, it showed that, contrary to the common perception of that time, the employment of coding techniques can enable reliable transmission of digital messages using only a constant amount of redundancy per bit. This result led to the birth of information and coding theory [Gallager, 1968; Cover and Thomas, 1999; Peterson and Weldon Jr., 1972; Blahut, 1983; Wicker, 1995]. Following the success of error-correcting coding in digital communication systems, Shannon and other researchers applied similar techniques to protect digital circuits against hardware faults (see for example [Elias, 1958; Winograd and Cowan, 1963; Taylor, 1968b; Larsen and Reed, 1972] and the exposition in [Rao and Fujiwara, 1989]). More recently, related techniques were applied at a higher level to protect special-purpose systems against a fixed number of "functional" faults, which could be hardware, software or other. These ideas were introduced within the context of algorithm-based fault tolerance [Huang and Abraham, 1984; Beckmann and Musicus, 1993; Roy-Chowdhury and Banerjee, 1996]. The development of an appropriate fault model is a significant aspect of all designs for fault tolerance. The fault model describes the consequences of each fault on the state or output of a system, effectively abstracting the cause of a fault and allowing the mathematical study of fault tolerance. For example, in Shannon's work the effect of "noise" in a digital communication channel is captured by the probability that a particular bit gets transmitted erroneously (i.e., its binary value is flipped). Similarly, the corruption of a single bit in the digital representation of the output/state of a system is commonly used to model the effect of faults in digital systems. Note that the fault model does not have to mimic the actual fault mechanism; for example, one can model the error due to a fault in a multiplier as additive or the error due to a fault Introduction 3 in an adder as multiplicative. 1 Efficient fault models need to be close to reality, yet simple enough to allow algebraic or algorithmic manipulation. If a single hardware fault manifests itself in an unmanageable number of errors in the analytical representation, then, the performance of the corresponding error detection/correction scheme will be unnecessarily complicated. This book focuses mostly on fault tolerance in combinational systems (Chapters 2 and 3) and dynamic systems (Chapters 4-7). The distinction between combinational and dynamic systems is that the latter evolve in time according to their internal state (memory), whereas the former have no internal state and no evolution with respect to time. DEFINITION 1.1 A combinational system C performs a function evaluation on its inputs Xl, X2, ••• , Xu' More specifically, the output r of the combinational system only depends on the inputs provided, i.e., it is described by afunction AC as Examples of combinational systems include adders, arithmetic logic units, and special purpose systems for various signal processing computations. The book focuses on protecting such systems against faults that corrupt the output of the system (i.e., faults that produce an incorrect result but do not cause the system to hang or behave in some other unpredictable way). DEFINITION 1.2 A dynamic system S evolves in time according to some internal state. More specifically, the state of the system at time step t, denoted by qs [t], together with the input at time step t, denoted by x[t], completely determine the system's next state according to a state evolution equation The output y[t] of the system at time step t is based on the corresponding state and input, and is captured by the output equation y[t] = AS(qs[t], x[t]) . Examples of dynamic systems include tinite-state machines, digital tilters, convolutional encoders, and more generally algorithms or simulations running on a computer architecture over several time steps. When discussing fault tolerance in dynamic systems, the book focuses on faults that cause an unreliable dynamic system to take a transition to an incorrect state. Depending on the underlying system and its actual implementation, these faults can be permanent or transient, and hardware or software. Due to the nature of dynamic systems, the effects of a state transition fault may last over several time steps; in addition, state corruption at a particular time step generally leads to the corruption of the 4 CODING APPROACHES TO FAULT TOLERANCE overall behavior and output at future time steps. Note that faults in the output mechanism of a dynamic system can be treated like faults in a combinational system as long as the representation of the state is correct. For this reason, when discussing fault tolerance in dynamic systems, the book focuses on protecting against state transition faults. 2 FAULT·TOLERANT COMBINATIONAL SYSTEMS A necessary condition for a system to be fault-tolerant is that it exhibits redundancy. "Structured" redundancy (that is, redundancy that has been intentionally introduced in some systematic way) allows a combinational system to distinguish between valid and invalid results and, if possible, identify the error and perform the necessary error-correcting procedures. Structured redundancy can also be used to guarantee acceptably degraded performance despite faults. A well-designed fault-tolerant system makes efficient use of resources by adding redundancy in those parts of the system that are more liable to faults than others. The traditional way of designing combinational systems that cope with hardware faults is the use of N-modular hardware redundancy [von Neumann, 1956]. By replicating the original system N times, one performs the desired calculation multiple times in parallel. The final result is chosen based on what the majority of the system replicas agree upon. For example, in the triple modular redundancy (TMR) scheme of Figure 1.1, if all three modules agree on a result, then, the voter outputs that result; if only two of the modules agree, then, the voter outputs that result and declares the third module faulty; if all modules disagree, then, the voter flags an error. When using N-modular redundancy with majority voting, one can correct faults in c different systems if N ~ 2c + 1. If the modules are self-checking (that is, if they have the ability to detect and flag internal errors), then, one can detect up to N and correct up to N - 1 errors. An implicit assumption in the above discussion is that the voter is fault-free. A number of commercial and other systems have used modular redundancy schemes [Avizienis et aI., 1971; Harper et aI., 1988]; several examples can be found in [Johnson, 1989; Pradhan, 1996; Siewiorek and Swarz, 1998]. Modular redundancy schemes have been the primary methodology in designs for fault tolerance because they decouple system design from fault tolerance design. Modular redundancy, however, is inherently expensive due to system replication; for this reason, a variety of hybrid methods have evolved, involving hierarchical levels of modular redundancy that only replicate the parts of the system that are more vulnerable to faults. When time delay is not an issue, a popular alternative is N -modular time redundancy, where one uses the same hardware to repeat a calculation N times. If only transient faults take place, this approach has the same effect as N -modular hardware redundancy. Introduction Input Comb. System Replica Comb. System Replica Comb. System Replica Figure 1.1. Output 1 5 Uncorrectable Error Flag Output 2 Final Output Output 3 Triple modular redundancy. The success of coding techniques in digital communication systems prompted many researchers to investigate alternative ways for achieving resource-efficient fault tolerance in computational systems. Not surprisingly, these techniques have been successful in protecting digital storage devices, such as random access memory chips and hard drives, "chip-kill" and RAID (Redundant Array of Inexpensive Disks) being perhaps the most successful examples [Patterson et aI., 1988]. However, in systems that also involve some simple processing on the data (e.g., Boolean circuits or arithmetic units), the application of such coding ideas becomes far more challenging. The general model of these faulttolerance schemes consists of multiple interdependent stages as illustrated in Figure 1.2. These stages include the encoder, the redundant computational unit, the error detector/corrector, and the decoder. Redundancy is incorporated by encoding the operands and by ensuring that the redundant computational unit involves extra outputs that only arise when faults occur. The error detector examines the output of the redundant computational unit and decides whether it is valid or not. Finally, the decoder maps the corrected result back to its non-redundant form. In many cases, there are large overlaps between several of the subsystems shown in Figure 1.2. The model, however, illustrates the basic idea in the design of fault-tolerant systems: at the point where the fault takes place, the representation of the result involves redundancy and enables one to detect and/or correct the corresponding errors. Usually, faults are only allowed in the redundant computational unit and (sometimes) in the encoder; the error corrector and the decoder are commonly assumed to be fault-free. As pointed out in [Pippenger, 1990; Avizienis, 1997], there have traditionally been two different philosophies for dealing with faults in combinational systems: one focuses on constructing reliable systems out of unreliable components and the other focuses on detecting and correcting an a priori fixed 6 CODING APPROACHES TO FAULT TOLERANCE --------------------------------------------------j, , ,, c;; c: c; _0 (\j "l:: (\j -0- c: ::J ::Jc. 1_ _.-1 -oE:t:: (]) 0 c: a:: 0:::> , : Faults Figure 1.2. £0 ~ ~ 13 0'" (]) t:oCi) woo ;' - + -..' 000 ... Final Output Faults Fault-tolerant combinational system. number of faults while minimizing the required hardware overhead. The underlying assumptions in each approach are quite distinct: in the former approach all components suffer faults with a certain probability, whereas in the latter approach the number of faults is fixed. Given enough redundancy, the latter assumption essentially allows parts ofthe system to be assumed fault-free. The next two sections describe these two approaches in the context of fault -tolerant combinational systems. 2.1 RELIABLE COMBINATIONAL SYSTEMS One approach towards fault tolerance is the construction of fault-tolerant systems out of unreliable components, i.e., components that fail independently with some nonzero probability. The goal of these designs is to assemble the unreliable components in a way that produces a reliable overall system, that is, a system that performs as desired with high probability. As ope adds redundancy into the fault-tolerant system, the probability with which components fail remains constant. Thus, the larger the system, the more faults it has to tolerate on the average, but the more flexibility one has in using structured redundancy to ensure that, with high probability, the redundant system will have the desirable behavior. Work in this direction started with von Neumann [von Neumann, 1956] and was continued by many others, mostly in the context of fault-tolerant Boolean circuits [Winograd and Cowan, 1963; Taylor, 1968b; Gacs, 1986; Hajek and Weller, 1991; Evans, 1994; Evans and Pippenger, 1998]. This approach is described in Chapter 2. 2.2 MINIMIZING REDUNDANT HARDWARE The second approach towards fault tolerance aims at guaranteeing the detection and correction of a fixed number of faults. It closely follows the general model in Figure 1.2 and usually requires that the error-correcting and decoding stages are faultJree. In this particular context, the latter assumption seems to Introduction 7 be inevitable because, regardless of how much redundancy is added, a single fault in the very last stage of the system will result in an erroneous output. The TMR system of Figure 1.1 is perhaps the most common example that falls in this category of designs for fault tolerance. It protects against a single hardware fault in anyone system replica but not in the voter. Numerous other redundant systems have also been implemented with the capability to detect/correct single faults assuming that error detection/correction is fault-free. The basic premise behind these designs is that the error-correcting mechanism is much simpler than the actual system implementation and that faults are rare; thus, it is reasonable to assume that the error corrector is fault-free and to aim at protecting against a fixed number of faults (for example, if faults are independent and occur with probability PI < < 1, then, the probability of two simultaneous faults is of the order of PJ' which is very small compared to PI)' Once the validity of the two assumptions above is established, designs for fault tolerance can focus their attention in adding a minimal amount of redundancy in order to detect/correct a pre-specified number of faults in the redundant computational unit. This approach has been particularly successful when features of a computation or an algorithm can be exploited in order to introduce "structured" redundancy in a way that offers more efficient fault coverage than modular redundancy. Work in this direction includes arithmetic coding schemes, algorithm-based fault tolerance and algebraic techniques, all of which are described in more detail in Chapter 3. Related applications range from arithmetic circuits [Rao, 1974], to 2-D systolic arrays for parallel matrix multiplication [Huang and Abraham, 1984; Jou and Abraham, 1986], faulttolerant sorting networks [Choi and Malek, 1988; Liang and Kuo, 1990; Sun et aI., 1994], and convolution using the fast Fourier transform [Beckmann and Musicus, 1993]. 3 FAULT-TOLERANT DYNAMIC SYSTEMS Traditionally, fault tolerance in dynamic systems has been based on variations of modular redundancy. The technique uses several replicas of the original, unreliable dynamic system, each initialized at the same state and supplied with the same input sequence. Each replica goes through the same sequence of states, unless a fault in its state transition mechanism causes a deviation from the correct behavior. If the majority of the system replicas are in the correct state at a given time step, an external voting mechanism will be able to decide what the correct state is using a majority voting rule; the output can then be computed based on this error-free state. To understand the severity of state transition faults consider the following scenario: assume that an unreliable dynamic system is subject to transient faults and that the probability of taking an incorrect state transition (on any input at any given time step) is Ps. If faults between different time steps are independent, 8 CODING APPROACHES TO FAULT TOLERANCE then, the probability that the system follows the correct state trajectory for L consecutive time steps is (1- Ps)L and goes to zero exponentially with L. In general, the probability of ending up in the correct state after L steps is also low,2 which means that the output of the system at time step L will be erroneous with high probability (because it is calculated based on an erroneous state). Therefore, the first priority in the design of a fault-tolerant dynamic system should be to ensure that the system follows the correct state trajectory. There are several subtle issues that arise when using modular redundancy schemes in the context of dynamic systems [Hadjicostis, 1999]. For instance, in the example above, the use of majority voting at the end of L time steps may be highly unsuccessful. The problem is that after a system replica operates for L time steps, the probability that it has followed the correct sequence of states is (1- Ps)L. Moreover, at time step L, system replicas may be in incorrect states with probabilities that are prohibitively high for a voter to reliably decide what the correct state is. (An extreme example would be the case when an incorrect state is more likely to be reached than the correct one; this would make it impossible for a voter to decide what the correct state is, regardless of the number of system replicas that are used!) A possible solution to this problem is to correct the state of the system replicas at the end of each time step, as shown in Figure 1.3. In this arrangement, the state agreed upon by the majority of the systems is fed back to all systems to reset them to the "correct" state. One does not necessarily have to feed back the correct state at the end of each time step; if a correction is to be fed back once every T steps, however, one needs to ensure that (1- Ps does not become too small. Another possible way of addressing the above problem is to let the systems evolve for several time steps and then perform error correction using a mechanism that is more complicated than a simple voter. For example, one could look at the overall state evolution (not just the final states) of each system replica and then make an educated decision about what the correct state sequence is. One concern about this approach is that, by allowing the system to evolve incorrectly for several time steps, system performance could be compromised in the intervals between error correction. A bigger concern is that the complexity of the error-correcting mechanism may increase, resulting in an unmanageable number of errors in the correcting mechanism itself. The concurrent error correction approach in Figure 1.3 has two major drawbacks: r 1. System replication may be unnecessarily expensive. In order to avoid replication, one can employ a redundant implementation, i.e., a version of the dynamic system which is redundant and follows a restricted state evolution [Hadjicostis, 1999]. Faults violate the imposed restrictions, which enables an external mechanism to perform error detection and correction. Redundant implementations range from no redundancy to full replication and provide Introduction 9 Input x[t] State ~[t] 8'" "Corrected" State State q ttl 1--_ _-"'2_-.. Voter q[t] •••••• r •••• .l L.Y!U. 1 1 1 1 1 ,.. 1 1 1 1• • • •1 State ~[t] Figure 1.3. Fault-Tolerant Combinational Unit Triple modular redundancy with correcting feedback. the means to characterize and parameterize constructions of fault-tolerant dynamic systems. The book discusses redundant implementations in various settings, including algebraic machines (Chapter 4), linear time-invariant dynamic systems (Chapter 5) and linear tinite-state machines (Chapter 6). 2. The scheme relies heavily on the assumption that the voter is fault-free. If the voter also fails independently between time steps (Le., if the voter outputs a state that, with probability Pv, is different from the state agreed upon by the majority of the systems), one is faced with another problem: after L time steps the probability that the modular redundancy scheme performs correctly is at best (l-Pv)L (ignoring the probability that a fault in the voter may accidentally result in feeding back the correct state in cases where most systems are in an incorrect state). Similarly, the probability that the majority of the replicas are in the correct state after L time steps is also very low. Therefore, if voters are not reliable, there appears to be a limit on the number of time steps for which one can guarantee reliable evolution using a simple replication scheme. What is more alarming is that faults in the voting mechanism become more significant as one increases the number of time steps for which the fault-tolerant dynamic system operates. Even if Pv is significantly smaller than Ps (e.g., because the dynamic system is more complex than the voter), the probability that the modular redundancy scheme performs correctly is bounded above by (l-Pv)L and can become unacceptably small for a large L. In order to deal with faults in the errorcorrecting mechanism, one can use distributed error correction, so that the effects of faults in individual components of the error-correcting mechanism do not corrupt the overall system state. The trade-offs involved in such schemes are discussed in Chapter 7. 10 CODING APPROACHES TO FAULT TOLERANCE Input xs[t] ~ enCOde) e ------~I '----~-' Figure 1.4. 3.1 Fault-tolerant dynamic system. REDUNDANT IMPLEMENTATIONS In order to avoid replication when constructing fault-tolerant dynamic systems, one can replace the original system with a larger, redundant system that preserves the state, evolution and properties of the original system in some encoded form. An external mechanism can then perform error detection and correction by identifying and analyzing violations of the restrictions on the set of states that are allowed in this larger dynamic system. The larger dynamic system is called a redundant implementation and is part of the overall fault -tolerant structure shown in Figure 1.4: the input to the redundant implementation at time step t, denoted by e(xs[t]), is an encoded version of the input xs[t] to the original system; furthermore, at any given time step t, the state qs[t] of the original dynamic system can be recovered concurrently from the corresponding state qh[t] of the redundant system through a decoding mapping l [i.e., qs[tJ = l(qh[tJ)]. Note that the error detection/correction procedure is input-independent, so that the next-state function is not evaluated in the error corrector. The following definition formalizes the notion of a redundant implementation for a dynamic system [Hadjicostis, 1999]. Note that the definition is independent of the error-detecting or correcting scheme. DEFINITION 1.3 Let S be a dynamic system with state set Qs, input set Xs, initial state qs [OJ and state evolution where qs['] E Qs, xs[·J E Xs and ds is the next-state function. Introduction 11 Let 1£ be a dynamic system with state set Q1/.. input set X1/.. initial state Qh[OJ and state evolution equation where qh[·J E Q1/.. Xh[·J E X1/. and 01/. is the next-state function. System 1£ is a redundant implementation for S if there exist (i) an injective input encoding mapping e : Xs f---7 X1/.. and (ii) an one-to-one state decoding mapping f such that for all input sequences f(qh[t]) = qs[t] for all t 2: 0 . The set Q~ is defined as f- 1(Qs) = {q~[.] called the subset of valid states in 1£. = f-l(qs[']) I qs['] E Qs} and is Jfthe following two conditions are satisfied for all qs[·J E Qs and all x s['] E Xs f(qh[O]) = f(01/.(r 1(qs[tJ), e(xs[t]))) qs[O] , Os (qs[t], xs[tJ) , then, the state of S at all time steps t 2: 0 can be recovered from the state of 1£ through the decoding mapping f (under fault-free conditions at least); this can be proved by induction on the number of time steps. Knowledge of the restrictions on the subset of valid states Q~ allows the external error detecting/correcting mechanism to handle faults. Any faults that cause transitions to invalid states (i.e., states outside the subset Q~) will be detected and, if possible, corrected. Assuming no faults in the error corrector and no uncorrectable faults in the state transition mechanism, the redundant implementation will then be able to concurrently simulate the operation of the original dynamic system. One then aims at using a minimal amount of redundancy to construct redundant implementations that are appropriate for protecting the given dynamic system against a pre-specified number of faults. As shown in Chapters 4-6, this general approach can be used to parameterize different redundant implementations in various settings and to make connections with hardware by developing appropriate fault models. Note that the definition of a redundant implementation does not specify nextstate transitions when the redundant system is in a state outside the set of valid states (this issue becomes important when the error detector/corrector is not fault-free or when the error-correcting mechanism is combined with the state transition mechanism [Larsen and Reed, 1972; Wang and Redinbo, 1984D. Due 12 CODING APPROACHES TO FAULT TOLERANCE to this flexibility, there are multiple different redundant implementations for a given error detecting/correcting scheme and in many cases it may be possible to systematically characterize and exploit this flexibility (e.g., to minimize hardware or to perform error detection/correction periodically). 3.2 FAULTS IN THE ERROR-CORRECTING MECHANISM Unlike the situation in combinational systems, fault tolerance in dynamic systems requires consideration about error propagation. The problem is that a fault causing a transition to an incorrect next state at a particular time step will not only affect the output at that particular time step (which may be an unavoidable possibility given that one uses fault-prone elements), but will also affect the state and output of the system at later times. In addition, the problem of error propagation intensifies as one increases the number of time steps for which the dynamic system operates. On the contrary, faults in a combinational system (as well as faults in the hardware implementation of the output function of a dynamic system) only affect the output at a particular time step but have no aftereffects on the future performance of the system. Specifically, they do not intensify as one increases the number of time steps for which the system operates. Chapter 7 describes the handling of transient faults 3 in both the next-state transition mechanism and the error detecting/correcting mechanism. The possibility of faults in the error-correcting mechanism implies that one can no longer guarantee that the fault-tolerant system will end up in the right state at the completion of the error-correcting stage. To overcome this problem, one can associate with each state a set of states and ensure that at any given time step, the fault-tolerant system is, with high probability, within this set of states that represent the actual state [Larsen and Reed, 1972; Wang and Redinbo, 1984; Hadjicostis, 1999]. Employing the above design principle, Chapter 7 analyzes a variant of modular redundancy that uses unreliable system replicas and unreliable voters to construct redundant dynamic systems that evolve reliably for any given finite number of time steps. More specifically, given unreliable system replicas (Le., dynamic systems that take incorrect state transitions with probability Ps, independently between different time steps) and unreliable voters (that suffer transient faults independently between different time steps with probability Pv), Chapter 7 describes ways to guarantee that the state evolution of a redundant fault -tolerant implementation will be the correct one. This method ensures that, with high probability, the fault-tolerant system will go through a sequence of states that correctly represents the error-free state sequence (Le., the state of the redundant system at each time step is within a set of states that correspond to the state the fault-free system would be in). It is shown that, under this very Introduction 13 general approach, there is a logarithmic trade-off between the number of time steps and the amount of redundancy that is needed to achieve a given probability of failure [Hadjicostis, 2000]. For the special case of linear finite-state machines, one can combine the above techniques with low-complexity error-correcting codes to make more efficient use of redundancy. More specifically, one can obtain interconnections of identical linear finite-state machines that operate in parallel on distinct input sequences and use a constant amount of hardware per machine to achieve a desired probability of failure (for the given number of time steps) [Hadjicostis and Verghese, 1999]. In other words, by increasing the number of machines that operate in parallel, one can achieve a smaller probability of failure or, equivalently, operate the machines for a longer time interval; the redundancy per machine (including the hardware required in the error-correcting mechanism) remains bounded by a constant. The analysis in Chapter 7 provides a better understanding of the tradeoffs involved when designing fault-tolerant systems out of unreliable components. These include constraints on the fault probabilities in the system/corrector, the length of operation and the required amount of redundancy. Furthermore, the analysis effectively demonstrates that the two-stage approach to fault tolerance of Figure 1.4 can be used successfully (and in some cases efficiently) to construct reliable dynamic systems out of unreliable components. 4 CODING TECHNIQUES FOR FAULT DIAGNOSIS The coding techniques that are studied in this book can also be applied in other contexts. Chapter 8 explores one such direction by employing coding techniques in order to facilitate fault diagnosis in complex discrete event systems (DES's). A diagnoser or a monitoring mechanism operates concurrently with a given DES and is able to detect and identify faults by analyzing available activity and status information. There is large volume of work on fault diagnosis in dynamic systems and networks, particularly within the systems/control and computer engineering communities. For example, within the systems and control community, there has been a long-standing interest in fault diagnosis in large-scale dynamic systems, including finite automata [Cieslak et aI., 1988; Sampath et aI., 1995; Sampath et aI., 1998], Petri net models [Silva and Velilla, 1985; Sahraoui et aI., 1987; Valette et aI., 1989; Cardoso et aI., 1995; Hadjicostis and Verghese, 1999], timed systems [Zad et aI., 1999; Pandalai and Holloway, 2000], and communication networks [Bouloutas et aI., 1992; Wang and Schwartz, 1993; Park and Chong, 1995]. The goal in all of these approaches is to develop a monitor (diagnoser) that can detect and identify faults from a given, pre-determined set. The usual approach is to locate a set of inherently invariant properties of the system, a subset of which is violated soon after a particular fault takes place. By tracking the activity in the system, one is 14 CODING APPROACHES TO FAULT TOLERANCE able to detect violations of such invariant properties (which indicates the presence of a fault) and correlate them with a unique fault in the system (which then constitutes fault identification). The task becomes challenging because of potential observability limitations (in terms of the inputs, states or outputs that are observed [Cieslak et aI., 1988]) and various other requirements (such as detection/communication delays [Debouk et aI., 2000], sensor allocation limitations [Debouk et aI., 1999], distributivity/decentralizability constraints [Aghasaryaiu et aI., 1998; Debouk et aI., 1998], or the sheer size of the diagnoser). In Chapter 8, coding techniques are used to design the state evolution of the monitor so that, at any given time step, certain constraints are enforced between its state and the state of the DES. Fault detection and identification is then achieved by analyzing violations of these coding constraints. The approach is very general and can handle a variety of fault models. There are a number of connections that can be made with the more traditional fault diagnosis techniques mentioned in the previous paragraph; Chapter 8 aims at pointing out some potential connections between coding approaches and fault diagnosis. Notes 1 The faulty result r f of a multiplier can be written as r f = r + e, where r is the fault-free result (i.e., the result that would have been obtained under no faults) and e is an appropriate real number. Similarly, the faulty result rf of an adder can be written as rf = r x e, where r is the fault-free result and e is an appropriate real number (r =I 0). 2 The probability of ending up in the correct state after L steps depends on the dynamic structure of the particular finite-state machine and on whether multiple faults may lead to the correct state. The argument can be made more precise if one chooses a particular implementation for the machine (consider, for example, the linear feedback shift register shown in Figure 6.1 of Chapter 6 with each fault causing a particular bit in the state vector to flip with probability Pb)' 3 Permanent faults can be handled more efficiently using reconfiguration techniques rather than concurrent error detection and correction. In some sense, permanent faults are easier to deal with than transient faults. For example, when testing for permanent faults in an integrated circuit, it may be reasonable to assume that the testing mechanism (error-detecting mechanism) has been verified to be fault-free. Since such verification only needs to take place once, one can devote large amounts of time and resources in order to test for the absence of permanent faults in this testing/correcting mechanism. References 15 References Aghasaryaiu, A., Fabre, E., Benveniste, A., Boubour, R, and Jard, C. (1998). Fault detection and diagnosis in distributed systems: an approach by partially stochastic Petri nets. Discrete Event Dynamic Systems: Theory and Applications, 8(2):203-231. Avizienis, A. (1997). Toward systematic design offault-tolerant systems. IEEE Computer, 30(4):51-58. Avizienis, A., Gilley, G. c., Mathur, F. P., Rennels, D. A., Rohr, J. A., and Rubin, D. K. (1971). The STAR (self-testing and repairing) computer: An investigation of the theory and practice of fault-tolerant computer design. In Proceedings of the 1st Int. Conf. on Fault-Tolerant Computing, pages 1312-1321. Beckmann, P. E. and Musicus, B. R (1993). Fast fault-tolerant digital convolution using a polynomial residue number system. IEEE Transactions on Signal Processing, 41(7):2300-2313. Blahut, R E. (1983). Theory and Practice ofData Transmission Codes. AddisonWesley, Reading, Massachusetts. Bouloutas, A., Hart, G. w., and Schwartz, M. (1992). Simple finite state fault detectors for communication networks. IEEE Transactions on Communications,40(3):477-479. Cardoso, J., Ktinzle, L. A., and Valette, R (1995). Petri net based reasoning for the diagnosis of dynamic discrete event systems. In Proceedings of the IFSA '95, the 6th Int. Fuzzy Systems Association World Congress, pages 333-336. Choi, Y.-H. and Malek, M. (1988). A fault-tolerant systolic sorter. IEEE Transactions on Computers, 37(5):621-624. Cieslak, R., Desclaux, c., Fawaz, A. S., and Varaiya, P. (1988). Supervisory control of discrete-event processes with partial observations. IEEE Transactions on Automatic Control, 33(3):249-260. Cover, T. M. and Thomas, J. A. (1999). Elements of Information Theory. John Wiley & Sons, New York. Debouk, R, Lafortune, S., and Teneketzis, D. (1998). Coordinated decentralized protocols for failure diagnosis of discrete event systems. In Proceedings of the 37th IEEE Conf. on Decision and Control, pages 3763-3768. Debouk, R., Lafortune, S., and Teneketzis, D. (1999). On an optimization problem in sensor selection for failure diagnosis. In Proceedings of the 38th IEEE Conf. on Decision and Control, pages 4990-4995. Debouk, R., Lafortune, S., and Teneketzis, D. (2000). On the effect of communication delays in failure diagnosis of decentralized discrete event systems. In Proceedings ofthe 39th IEEE Conf. on Decision and Control, pages 2245- 2251. Elias, P. (1958). Computation in the presence of noise. IBM Journal of Research and Development, 2(10):346-353. 16 CODING APPROACHES TO FAULT TOLERANCE Evans, W. (1994). Information Theory and Noisy Computation. PhD thesis, EECS Department, University of California at Berkeley, Berkeley, California. Evans, W. and Pippenger, N. (1998). On the maximum tolerable noise for reliable computation by formulas. IEEE Transactions 011 Information Theory, 44(3): 1299-1305. Gilcs, P. (1986). Reliable computation with Cellular Automata. Journal of Computer and System Sciences, 32(2): 15-78. Gallager, R. G. (1968). Information Theory and Reliable Communication. John Wiley & Sons, New York. Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts. Hadjicostis, C. N. (2000). Fault-tolerant dynamic systems. In Proceedings of ISIT 2000, the Int. Symp. on Information Theory, page 444. Hadjicostis, C. N. and Verghese, G. C. (1999a). Fault-tolerant linear finite state machines. In Proceedings of the 6th IEEE Int. Con! on Electronics, Circuits and Systems, pages 1085-1088. Hadjicostis, C. N. and Verghese, G. C. (1999b). Monitoring discrete event systems using Petri net embeddings. In Application and Theory of Petri Nets 1999, number 1639 in Lecture Notes in Computer Science, pages 188-208. Hajek, B. and Weller, T. (1991). On the maximum tolerable noise for reliable computation by formulas. IEEE Transactions on Information Theory, 37(2):388-391. Harper, R E., Lala, J. H., and Deyst, J. J. (1988). Fault-tolerant parallel processor architecture review. In Eighteenth Int. Symp. on Fault- Tolerant Computing, Digest of Papers, pages 252-257. Huang, K.-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 33(6):518-528. Johnson, B. (1989). Design and Analysis of Fault-Tolerant Digital Systems. Addison-Wesley, Reading, Massachusetts. Jou, J.-Y. and Abraham, J. A. (1986). Fault-tolerant matrix arithmetic and signal processing on highly concurrent parallel structures. Proceedings ofthe IEEE, 74(5):732-741. Koren, I. and Singh, A. D. (1990). Fault-tolerance in VLSI circuits. IEEE Computer, 23(7):73-83. Larsen, R W. and Reed, I. S. (1972). Redundancy by coding versus redundancy by replication for failure-tolerant sequential circuits. IEEE Transactions on Computers, 21(2):130-137. Leveugle, R, Koren, Z., Koren, I., Saucier, G., and Wehn, N. (1994). The Hyeti defect tolerant microprocessor: A practical experiment and its costeffectiveness analysis. IEEE Transactions on Computers, 43( 12): 1398-1406. References 17 Liang, S. C. and Kuo, S. Y. (1990). Concurrent error detection and correction in real-time systolic sorting arrays. In Proceedings of 20th IEEE Int. Symp. on Fault-Tolerant Computing, pages 434-441. IEEE Computer Society Press. Pandalai, D. N. and Holloway, L. E. (2000). Template languages for fault monitoring of timed discrete event processes. IEEE Transactions on Automatic Control, 45(5):868-882. Park, Y. and Chong, E. K. P. (1995). Fault detection and identification in communication networks: a discrete event systems approach. In Proceedings of the 33rdAnnuaiAlierton Con! on Communication, Control, and Computing, pages 126-135. Patterson, D. A., Gibson, G., and Katz, R. H. (1988). A case for redundant arrays of inexpensive disks (raid). In Proceedings of the ACM SIGMOD, pages 109-116. Peercy, M. and Banerjee, P. (1993). Fault-tolerant VLSI systems. Proceedings of the IEEE, 81(5):745-758. Peterson, W. W. and Weldon Jr., E. J. (1972). Error-Correcting Codes. MIT Press, Cambridge, Massachusetts. Pippenger, N. (1990). Developments in the synthesis of reliable organisms from unreliable components. In Proceedings of Symposia in Pure Mathematics, volume 50, pages 311-324. Pradhan, D. K. (1996). Fault-Tolerant Computer System Design. Prentice Hall, Englewood Cliffs, New Jersey. Rao, T. R. N. (1974). Error Codingfor Arithmetic Processors. Academic Press, New York. Rao, T. R. N. and Fujiwara, E. (1989). Error-Control Coding for Computer Systems. Prentice-Hall, Englewood Cliffs, New Jersey. Redinbo, G. R. (1987). Signal processing architectures containing distributed fault-tolerance. In Conference Record - Twentieth Asilomar Con! on Signals, Systems & Computers, pages 711-716. Roy-Chowdhury, A. and Banerjee, P. (1996). Algorithm-based fault location and recovery for matrix computations on multiprocessor systems. IEEE Transactions on Computers, 45(11): 1239-1247. Sahraoui, A., Atabakhche, H., Courvoisier, M., and Valette, R. (1987). Joining Petri nets and knowledge-based systems for monitoring purposes. In Proceedings of the IEEE Int. Con! on Robotics Automation, pages 1160-1165. Sampath, M., Lafortune, S., and Teneketzis, D. (1998). Active diagnosis of discrete-event systems. IEEE Transactions on Automatic Control, 43(7):908929. Sampath, M., Sengupta, R., Lafortune, S., Sinnamohideen, K., and Teneketzis, D. (1995). Diagnosability of discrete-event systems. IEEE Transactions on Automatic Control, 40(9): 1555-1575. 18 CODING APPROACHES TO FAULT TOLERANCE Shannon, C. E. (1948a). A mathematical theory of communication (Part I). Bell System Technical Journal, 27(7):379-423. Shannon, C. E. (1948b). A mathematical theory of communication (Part II). Bell System Technical Journal, 27(10):623-656. Siewiorek, D. and Swarz, R (1998). Reliable Computer Systems: Design and Evaluation. A.K. Peters. Silva, M. and Velilla, S. (1985). Error detection and correction in Petri net models of discrete events control systems. In Proceedings of ISCAS 1985, the IEEE Int. Symp. on Circuits and Systems, pages 921-924. Sun, J., Cerny, E., and Gecsei, J. (1994). Fault tolerance in a class of sorting networks. IEEE Transactions on Computers, 43(7):827-837. Taylor, M. G. (1968). Reliable information storage in memories designed from unreliable components. The Bell System Journal, 47(10):2299-2337. Valette, R, Cardoso, J., and Dubois, D. (1989). Monitoring manufacturing systems by means of Petri nets with imprecise markings. In Proceedings of the IEEE Int. Symp. on Intelligent Control, pages 233-238. von Neumann, J. (1956). Probabilistic Logics and the Synthesis of Reliable Organismsfrom Unreliable Components. Princeton University Press, Princeton, New Jersey. Wang, C. and Schwartz, M. (1993). Fault detection with multiple observers. IEEEIACM Transactions on Networking, 1(1):48-55. Wang, G. X. and Redinbo, G. R (1984). Probability of state transition errors in a finite state machine containing soft failures. IEEE Transactions on Computers, 33(3):269-277. Wicker, S. B. (1995). Error Control Systems. Prentice Hall, Englewood Cliffs, New Jersey. Winograd, S. and Cowan, J. D. (1963). Reliable Computation in the Presence of Noise. MIT Press, Cambridge, Massachusetts. Zad, S. H., Kwong, R. H., and Wonham, W. M. (1999). Fault diagnosis in timed discrete-event systems. In Proceedings of the 38th IEEE Conference on Decision and Control, pages 1756-1761. I FAULT· TOLERANT COMBINATIONAL SYSTEMS Chapter 2 RELIABLE COMBINATIONAL SYSTEMS OUT OF UNRELIABLE COMPONENTS 1 INTRODUCTION In one of his most influential papers, von Neumann considered the construction of reliable combinational systems out of unreliable components [von Neumann, 1956]. He focused on a class of digital systems that performed computation by using appropriately interconnected voting mechanisms. More specifically, von Neumann constructed reliable systems out of unreliable 3-bit voters, some of which were used to perform computation and some of which were used as "restoring organs" to achieve error correction. The voters used for computational purposes received inputs that were either primary inputs, constants or outputs from other voters; the voters that functioned as "restoring organs" ideally (i.e., under fault-free conditions) received identical inputs. Von Neumann's fault model assumed that a voter fails by providing an output that differs from the value agreed upon by the majority of its inputs. When voter faults are independent and occur with probability exactly p, von Neumann demonstrated a fault-tolerant construction that is successful if p < 0.0073. In fact, using unreliable 3-input gates (including unreliable 3-bit voters) that fail exactly with probability p, it was later shown in [Hajek and Weller, 1991] that it is possible to construct reliable circuits for computing arbitrary Boolean formulas if and only if p < The fraction can be seen as the maximum tolerable noise in unreliable 3-input gates. These results were extended to interconnections of u-input gates (for u odd) in [Evans, 1994]. k· k This chapter discusses von Neumann's approach for reliable computation and related extensions. The focus is on reliably unreliable components, i.e., components that fail exactly with a known probability p. Extensions of these results to less restrictive fault models, such as models where each component fails with a probability that is bounded by a known constant p, are not explicitly 22 CODING APPROACHES TO FAULT TOLERANCE addressed in this chapter; the interested reader can refer to [Pippenger, 1985; Pippenger, 1990] and references therein. 2 COMPUTATIONAL MODELS FOR COMBINATIONAL SYSTEMS A u-input Boolean gate computes a Boolean function f: {O, I}U r------+ {O, I} . Inputs to gates are Boolean variables that are either primary inputs to the circuit, constants or outputs of other gates. A network or a combinational circuit is a loop-free interconnection of gates such that the output of each gate is the input to other gates (except from the last gate which provides the final output). A formula is a network in which the output of a gate is an input to at most one gate [Pippenger, 1988; Feder, 1989]. Complex combinational circuits may involve a large number of individual components (gates), all of which belong to the set of available types of Boolean gates. In other words, one assumes that there is a given pool or basis of available prototype gates. The depth and size of such combinational circuits is defined using graph-theoretic nomenclature [Pippenger, 1985; Evans and Schulman, 1993; Evans and Schulman, 1999]: • The depth of the circuit is the maximum number of gates that can be found in a path that connects a primary input to the final output. • The size of the circuit is the total number of gates. An unreliable u-input gate is modeled as a gate that with probability 1- p (0 $ P < !) computes the correct output on its inputs and with probability P it fails, i.e., it produces an incorrect output (its binary value is flipped). Note that these unreliable gates are reliably unreliable in the sense that they fail exactly with probability p [Pippenger, 1990]. A more powerful fault model would be one where each gate fails with a probability that is bounded by p (i.e., the ith gate fails with probability Pi $ p). When considering interconnections of unreliable gates, it is assumed that different gates fail independently. A formula or a network is considered reliable if its final output is correct with a "large" probability for all combinations of primary inputs. This probability is usually simply required to be larger than! (so that it is more likely that the combinational system will produce the correct output rather than an incorrect output). Reliable Combinational Systems out of Unreliable Components Primary Inputs Combinational Circuit Replica Combinational Circuit Replica Combinational Circuit Replica Figure 2.1. 3 23 Output 1 Output 2 Output Output 3 Error correction using a "restoring organ." VON NEUMANN'S APPROACH TO FAULT TOLERANCE In [von Neumann, 1956] von Neumann constructed reliable combinational circuits out of unreliable 3-input voters. Some of the voters were used for computation and some of them were used as "restoring organs" to achieve error correction. In von Neumann's reliable combinational circuits, an unreliable voter takes three bits (Boolean values) as inputs and under fault-free conditions outputs the value that the majority of them have. With probability (exactly) p, however, the unreliable voter outputs an incorrect value (Le., it flips a "1" to a "0" and vice-versa). Voters that operate as restoring organs ideally receive identical inputs. The basic picture is shown in Figure 2.1: the restoring organ receives as inputs the outputs of three replicas of the same combinational system (circuit). These replicas are given the same inputs and operate independently from each other. If each of their outputs is erroneous with probability q (q < the probability that the majority of the inputs to the voter are incorrect will be given by !), 9(q) ~ t, (! ) qk(l_ q)3-k ~ 3q' - 2q' . (2.1) Since the restoring organ is assumed to fail with probability p, independently from the other systems, the probability that the output of the restoring organ is 24 CODING APPROACHES TO FAULT TOLERANCE p=.2 p=.l ...... 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 ~ CT g g ~05 3:0.5 0.4 04 0.3 0.3 0.2 0.2 0.1 01 0'0 0.2 0.4 Figure 2.2. q 0.6 0 0 0.8 0.2 0.4 q 0.6 0.8 Plots of functions f(q) and g(q) for two different values of p. erroneous is given by f(q) = = = p(l - 8(q)) + (1 - p)8(q) p + (1 - 2p)8(q) P + (1 - 2p)(3q2 - 2q3) . (2.2) Function f(q) (along with function g(q) = q) is plotted in Figure 2.2 for two different values of p. The basic approach in von Neumann's scheme is to successively use restoring organs until the final output reaches an acceptable or desirable level of probability of error. More specifically, one builds a ternary tree whose children are hardware-independent replicas of an unreliable combinational circuit and whose internal nodes are 3-input restoring organs. This scheme, illustrated in Figure 2.3 for two levels of restoring organs, has a hardware cost that is exponential in the number of levels in the tree. For example, s levels of restoring organs require 38 replicas of the combinational system and 3';1 voters. Reliable Combinational Systems Ollt of Unreliable Components 25 ~o-----··--~o­ / / /----------- C=~:I:~:'- -- --i -(\; I r---U / Figure 2.3. ance. Restoring Organ System Replica : : i I L _________________________ J Two successive restoring iterations in von Neumann's construction for fault toler- The final output of an s-level ternary tree of voters is erroneous with a probability that is given by q* = f(f(f(···f(q)· .. ))) , , 'V J s iterations where the number of successive iterations of function fO is the same as the number of levels in the ternary tree. (The simplicity of the above formula is a direct consequence of the components being reliably unreliable. If this was not the case, then, the probability () in Eq. (2.1) would depend on three variables (e.g., qI. q2, q3) rather than a single one (q) and the discussion would become slightly more complicated.) Repeated iterations of function f(q) in Eq. (2.2) converge monotonically to a value q*, such that: THEOREM 2.1 • If 0 ::; p < ~ and 0 ::; q < ~, then, q* satisfies p ::; q* • If i < p < ~ and 0 ::; q < ~, then, q* = ~. < ~. 26 CODING APPROACHES TO FAULT TOLERANCE Proof: The proof in von Neumann's paper finds first the roots of the function q- f(q). Since q = is a solution (by inspection), the other two roots can be found by finding the roots of the quadratic ! q - f~q) = 2 ((1 - 2p)q2 - (1 - 2p)q + p) q -1 2 The two solutions are given by 1 ( 1± q=- 2 1. ~-6P) -1- 2p and are complex if p > In such case, the form of f (q) is as shown at the right side of Figure 2.2. For p < the two solutions are real and the form of f(q) is as shown at the left side of Figure 2.2. The points of intersection are given by qo, and l-qo where 1, !, (2.3) The following two cases need to be considered: k !, 1. If 0 ~ p < and 0 ~ q < then, the monotonicity and continuity of f(q) imply that successive iterations of f (.) will converge to qo (because, given o ~ qi < qo, then qi+l = f(qi) satisfies qi < qi+l < qo, whereas given qo < qi < then, qi+l = f(qi) satisfies qo < qi+l < qi). !, 1 ! !, 2. If < p < and 0 ~ q < then, the monotonicity and continuity of f(q) imply that successive iterations of f (.) will converge to (because, given o ~ qi < then, qi+l = f(qi) satisfies qi < qi+l < !, At this point the proof of the theorem is complete. !). ! o Von Neumann's construction in [von Neumann, 1956] demonstrated that if p ~ 0.0073, then, it is possible to construct reliable combinational systems out of unreliable voters. The constant 0.0073 was the result of particular details of his construction and, as von Neumann himself pointed out, it can be improved. The problem was that the ternary tree in Figure 2.3 needed to have leafs with outputs that are erroneous with probability smaller than ~; therefore, one had to decompose a combinational system into reliable subsystems in a way that ensures that the output of a subsystem is smaller than! (so that it can be driven to q* by consecutive stages of restoring organs). Reliable Combinational Systems out of Unreliable Components I XNAND Output 1 1 0 1 1 0 0 0 Table 2.1. 4 4.1 Inputs 27 II il I h I i31 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 0 1 0 1 0 1 0 1 Input-output table for the 3-input XNAND gate. EXTENSIONS OF VON NEUMANN'S APPROACH MAXIMUM TOLERABLE NOISE FOR 3-INPUT GATES By considering reliably unreliable 3-input gates that fail with probability p, Hajek and Weller demonstrated that it is possible to construct reliable combinational circuits that calculate arbitrary Boolean functions if p < ~ [Hajek and Weller, 1991]. Using techniques different from the ones described in this chapter, they also proved that, if p > ~ then it is impossible to construct reliable circuits out of (reliably) unreliable 3-input gates. Therefore, can be seen as the maximum tolerable noise in 3-input components. The latter result also applies to less restrictive fault models of unreliable gates, including gates that fail with probability bounded by p. The construction in [Hajek and Weller, 1991] goes as follows: i • Any Boolean function can be computed by appropriately interconnected 2-input noiseless NAND gates. In the fault-tolerant construction, each 2input NAND gate (with inputs Xl and X2) will be emulated by an unreliable 3-input XNAND gate (with inputs iI, i2 and i3) that functions as shown in Table 2.1. One can think of i l as a noisy version of Xl, and i2 and i3 as noisy versions of X2 [Hajek and Weller, 1991]. • Suppose that the following conditions are true: 1. p<i. 28 CODING APPROACHES TO FAULT TOLERANCE 3. All of the above probabilities are independent. One can verify that the output of an unreliable XNAND gate (with inputs ilo i2, i3) will equal the output of a reliable NAND gate with inputs (Xl and X2) with a probability that is larger than !, or To show this, all one has to do is to consider all different cases separately. For example, if Xl = 0 and X2 = 0, then, the reliable NAND gate should output "I." The probability that the unreliable gate produces an incorrect output (i.e., "0") can be calculated as the sum of the following eight events: 1. il = 0, i2 = 0, i3 probability q3p . = 0 and XNAND gate fault; this event occurs with 2. i l = 0, i2 = 0, i3 = 1 and XNAND gate fault; this event occurs with probability q2(1 - q)p. 3. it = 0, i2 = 1, i3 = 0 and no XNAND gate fault; this event occurs with probability q2(1 - q)(1 - p) . 8. il = 1, i2 = 1, i3 = 1 and no XNAND gate fault; this event occurs with probability (1 - q)3(1 - p) . The sum of all of the above eight events is given by (1 - p) [q2(1 - q) + p [2q2(1 - + 2q(1 - q) + q(1 - q)2 q)2 + (1 - q)3] + + q3] , which can be easily shown to be less that! (e.g., by taking the first and second derivatives with respect to q). • Given an arbitrary Boolean function and its circuit implementation based on reliable 2-input NAND gates, one can construct a reliable circuit from unreliable 3-input XNAND gates using the following strategy: (i) Replace each NAND gate with an XNAND gate. (ii) Replicate the circuit that generates input X2 to each NAND gate; use the first circuit replica to provide i2 and the second circuit replica to provide i3 to the XNAND gate. (iii) Do this recursively starting from the NAND gate that provides the output of the original circuit and working the way to the primary inputs (until all NAND gates are replaced by XNAND gates). Reliable Combinational Systems out of Unreliable Components 29 • Depending on the actual inputs, the outputs of the XNAND gates will be erroneous with different probabilities (that are nevertheless smaller than!). This would be problem at the next level of XNAND gates because different inputs would be erroneous with unequal probabilities (which invalidates the requirement that each input is erroneous with the same probability). To avoid this problem, one can use the ternary tree strategy in Figure 2.3 with enough levels of 3-input unreliable voters so that the probability of an error under any combination of inputs is close enough to the "steady-state" error probability qo [see Eq. (2.3)]. 4.2 MAXIMUM TOLERABLE NOISE FOR U -INPUT GATES The results of [Hajek and Weller, 1991] were generalized to u-input gates for odd u in [Evans, 1994]. Again, the assumption was that gates are reliably unreliable and that they fail independently with probability p. It was shown that, using unreliable u-input gates, one can construct circuits that reliably calculate arbitrary Boolean functions if p satisfies (2.4) To prove this statement, consider the following simplified scenario first. Suppose that a (reliably) unreliable u-bit voter (u odd) fails with probability p (a voter fails when it outputs a result that is not equal to the majority of its inputs). Furthermore, assume that the u input bits ideally (under fault-free conditions) have identical values, but each one may be erroneous with probability q < Assume that the inputs being erroneous and the voter failing are independent events. !. The probability that the majority of the inputs to the voter are incorrect is given by and the probability that the output of the voter is erroneous is given by fu(q) = p(l - Ou(q)) + (1 - p)Ou(q) p + (1 - 2p)Ou(q) . 30 CODING APPROACHES TO FAULT TOLERANCE 2.2 LetO::; q < ~. Repeatedapplicationsofthefunctionoffu(q) [i.e., fu{fu( ... {fu{q)) ... ))J will converge toa valueq* that satisfies p ::; q* < ~ THEOREM if /fpmax < p < ~, they will converge to ~. Proof: The proof presented here is slightly different from the proof in [Evans, 1994]. First, it is argued that fu(q) can only take one of the two forms shown in Figure 2.2. The distinction is made based on whether the function fu{q) intersects the line of the function g{q) = q at one point or three points (in both cases, point (~, ~) is an intersection point). Then, one finds the maximum possible p for which fu{q) is guaranteed to have slope larger than one at q = ~ (which is sufficient to ensure that fu(q) is of the form shown at the left side of Figure 2.2). The following facts can be easily verified: 1. fu{1/2) = 1/2. 2. fu(q) = 1- fu(1-q)[i.e., fu{q) is odd symmetric around the point (~, ~)]. 3. 4. df~~q) == f~{q) ~ interval [0, ! D. df~~q) == f~(q) ~ 0 for 0 ::; q ::; ! (i.e., fu{q) is non-decreasing in the 0 for 0 ::; q ::; ~ (i.e., the derivative of fu(q) decreasing in the interval is non- [0, ~]). The above establish that function fu(q) can only have one of the two forms shown in Figure 2.2. Clearly, if fu{q) is of the type shown on the right, repeated applications of fu{q) for any 0 ::; q < will converge to ~; if, however, fu(q) is of the form shown on the left, then, repeated applications of f u (q) will converge to a value q* that satisfies fu(q*) = q* and p ::; q* < 1/2. This can be shown using arguments similar to the ones used in the proof of Theorem 2.1. In order to distinguish between these two cases, one can calculate the derivative of fu{q) with respect to q and find the values of p for which this derivative satisfies f~(1/2) > 1 . ! References 31 In other words, it is required that p is such that f~(1/2) = (1 - 2p) t (~) (kqk-l(l - q)u-k_ k=~ (1 - 2p) t -(u - k)qk(l _ q)U-k-l) (~) Iq=1/2 (k(1/2)U-l - (u - k)(1/2)U-l) k=~ ~--------------v~--------------I C is strictly larger than one. One can explicitly solve for C to find that so that p <~- 2b is equivalent to Eq. (2.4). The rest of the argument follows the proof in Theorem 2.1. o Note that the above line of reasoning can also be used to prove Theorem 2.1. Having established Theorem 2.2, the argument in [Evans, 1994] follows the construction in [Hajek and Weller, 1991] to calculate arbitrary Boolean functions using unreliable u-input gates. Ignoring all other u-3 inputs, one can use the 3-input XNAND gate exactly as defined in [Hajek and Weller, 1991]. From there on, the construction is the same as in [Hajek and Weller, 1991] except that u-input voters are now used to restore the different probabilities of error to an equal level. 5 RELATED WORK AND FURTHER READING Considerations regarding the size and depth of reliable circuits were not discussed in this chapter. Several researchers have worked on obtaining bounds for the complexity of such circuits [Dobrushin and Ortyukov, 1977a; Dobrushin and Ortyukov, 1977b; Pippenger, 1988; Pippenger et aI., 1991; Evans and Schulman, 1993; Gacs and Gal, 1994; Evans and Schulman, 1999]. The authors of [Evans and Pippenger, 1998] considered the construction of reliable combinational systems from reliably unreliable 2-input NAND gates that fail with probability p. They proved that reliable computation using such gates is possible if and only if p < (3-V7)/4. 32 CODING APPROACHES TO FAULT TOLERANCE References Dobrushin, R. L. and Ortyukov, S.1. (1977a). Lower bound for the redundancy of self-correcting arrangements of unreliable functional elements. Problems of Information Transmission, 13(4):59-65. Dobrushin, R. L. and Ortyukov, S.1. (1977b). Upper bound for the redundancy of self-correcting arrangements of unreliable functional elements. Problems of Information Transmission, 13(4):203-218. Evans, W. (1994). Information Theory and Noisy Computation. PhD thesis, EECS Department, University of California at Berkeley, Berkeley, California. Evans, W. and Pippenger, N. (1998). On the maximum tolerable noise for reliable computation by formulas. IEEE Transactions on Information Theory, 44(3): 1299-1305. Evans, W. and Schulman, L. J. (1993). Signal propagation, with application to a lower bound on the depth of noisy formulas. In Proceedings of the 34th Annual Symp. on Foundations of Computer Science, pages 594-601. Evans, W. and Schulman, L. J. (1999). Signal propagation and noisy circuits. IEEE Transactions on Information Theory, 45(7):2367-2373. Feder, T. (1989). Reliable computation by networks in the presence of noise. IEEE Transactions on Information Theory, 35(3):569-571. Gacs, P. and Gal, A. (1994). Lower bounds on the complexity of reliable Boolean circuits with noisy gates. IEEE Transactions on Information Theory, 40(2):579-583. Hajek, B. and Weller, T. (1991). On the maximum tolerable noise for reliable computation by formulas. IEEE Transactions on Information Theory, 37(2):388-391. Pippenger, N. (1985). On networks of noisy gates. In Proceedings of the 26th IEEE FOCS Symp., pages 30-38. Pippenger, N. (1988). Reliable computation by formulas in the presence of noise. IEEE Transactions on Information Theory, 34(2):194-197. Pippenger, N. (1990). Developments in the synthesis of reliable organisms from unreliable components. In Proceedings of Symposia in Pure Mathematics, volume 50, pages 311-324. Pippenger, N., Stamoulis, G. D., and Tsitsiklis, J. N. (1991). On a lower bound for the redundancy of reliable networks with noisy gates. IEEE Transactions on Information Theory, 37(3):639-643. von Neumann, J. (1956). Probabilistic Logics and the Synthesis of Reliable Organismsfrom Unreliable Components. Princeton University Press, Princeton, New Jersey. Chapter 3 ALGORITHM·BASED FAULT TOLERANCE FOR COMBINATIONAL SYSTEMS 1 INTRODUCTION Modular redundancy schemes are attractive because they are universally applicable and can be implemented without having to develop explicit fault models. Their major drawback is that they can be prohibitively expensive due to the overhead of replicating the hardware. Arithmetic coding and algorithm-based fault tolerance (ABFT) schemes partially overcome this problem by offering sufficient fault coverage while making more efficient use of redundancy. This comes at the cost of narrower applicability and harder design. In fact, the main task in arithmetic coding and ABFT schemes is the development of appropriate fault models and the recognition of the structural features that make a particular computation or algorithm amenable to efficient utilization of redundancy. A variety of useful results and constructive procedures for systematically achieving this goal have been obtained for computations that take place in an abelian group or in a semigroup. This chapter reviews work on arithmetic codes and ABFT, and describes a systematic approach for protecting combinational systems whose functionality possesses certain algebraic structure. Arithmetic codes are error-correcting codes with properties that remain invariant under an arithmetic operation of interest [Rao and Fujiwara, 1989]. They are typically used as shown in Figure 3.1. First, one adds "structured" redundancy into the representation of the data by using suitable encodings, denoted by the mappings ¢l and ¢2 in the figure. The desired original computation r = Xl 0 X2 is then replaced by the modified computation <> on the encoded data (0 and <> denote binary operations). Under fault-free conditions, the modified operation <> produces p = ¢1(Xt) <> ¢2(X2), which results in r when decoded through the decoding mapping a [i.e., r = a(p)]. Due to the possible presence of faults, the result of the redundant computation can be er- 34 CODING APPROACHES TO FAULT TOLERANCE Redundant Computational Unit h1=<1>1 (x1) o Faults Figure 3.1. Decoder .-----, Error Detector/ Corrector Arithmetic coding scheme for protecting binary operations. roneous, PI instead of p. The redundancy in PI is used by the error corrector Q: to perform error detection and correction. The output fJ of the error detectorlcorrector is decoded via the mapping q. Under fault-free conditions in the detecting/correcting mechanism and with correctable faults, fJ equals p, and the final result r = q(fJ) equals r. A common assumption in the model of Figure 3.1 (which closely follows the general model in Figure 1.2 of Chapter 1) is that the error detector/corrector is fault-free. This assumption is reasonable if the implementation of the detector/corrector is simpler than the implementation of the redundant computational unit. Another common assumption is that no fault takes place in the decoder unit. As discussed in Chapter 1, this latter assumption is in some sense inevitable: no matter how much redundancy is added, the output of the overall system will be erroneous if the device that is supposed to provide the output fails (i.e., if there is a fault in the very last stage of the combinational system). Algorithm-based fault tolerance (ABFT) techniques involve more sophisticated coding schemes that deal with arrays of reaVcomplex data in concurrent mUltiprocessor systems. They were introduced by Abraham and coworkers starting in 1984 [Huang and Abraham, 1984; Jou and Abraham, 1986; Jou and Abraham, 1988; Nair and Abraham, 1990] and aimed at protecting against a maximum number of pre-specified faults assuming fault-free error correction. The classic example of ABFT is the protection of n x n matrix multiplication on a 2-D systolic array and is discussed in more detail in Section 3. A variety of other computationally intensive algorithms, such as fast Fourier transform (FFf) computational networks [Jou and Abraham, 1988] and convolution [Beckmann and Musicus, 1993] have also been studied in the context of ABFT. There are three critical steps involved in ABFT schemes: (i) Encoding of the input data. ABFT for Combinational Systems 35 (ii) Refonnulation of the original algorithm so that it operates on the encoded data. (iii) Distribution of the computational tasks among the different subsystems of the overall system so that any faults occurring within these subsystems can be detected and, hopefully, corrected. The most important challenge in both arithmetic coding and ABFf implementations is the recognition of structure in an algorithm/architecture that is amenable to the introduction of redundancy. A step towards providing a systematic approach for the recognition and exploitation of such special structure was developed for computations that occur in a group or in a semigroup in [Beckmann, 1992; Beckmann and Musicus, 1992; Hadjicostis, 1995; Hadjicostis and Verghese, 1995]. The key observation is that the desired "structured" redundancy can be introduced by a homomorphic embedding into a larger algebraic structure (group or semigroup). These techniques are described in more detail in Section 4; the exposition is self-contained and requires minimal knowledge of group and semi group theory. 2 ARITHMETIC CODES While universally applicable and simple to implement, modular redundancy is inherently expensive and inefficient. For example, a TMR implementation triplicates the original system in order to detect and correct a single fault. Arithmetic codes, although more limited in applicability and possibly harder to design and implement, offer a resource-efficient alternative when dealing with the protection of simple operations on integer data, such as addition and multiplication. They can be thought of as a class of error-correcting codes whose properties remain invariant under the operation that needs to be made fault-tolerant. An arithmetic coding scheme follows the model of Figure 3.1: in order to protect the computation of r = Xl 0 X2, the following four steps are taken. • Encoding: Redundancy is added to the representation of the data by using suitable encoding mappings hI h2 , <1>2(X2). <1>1 (xt) = • Redundant Computation: The operation on the encoded data may be different from the desired operation on the original data. In Figure 3.1, this modified operation is denoted by 0 and under fault-free conditions outputs 36 CODING APPROACHES TO FAULT TOLERANCE When one or more faults take place, the redundant computation on the encoded data outputs an erroneous result PI which, in general, is a function of the encoded data and the errors that took place, i.e., where e denotes the error. • Error Detection and Correction: If enough redundancy exists in the encoding of the data, one may be able to detect, identify and correct the errors by analyzing their effect on the encoded result PI. In Figure 3.1 this is done by the error-correcting mapping a which maps the corrupted result PI to p, i.e., p = a(PI) . Note that the error corrector has no access to the original operands; this ensures that the desired calculation does not take place in a part of the redundant construction that is assumed to be fault-free. • Decoding: The final result f is obtained by decoding f = a(p) p using mapping a . Under fault-free conditions or under correctable faults, the final result f equals the result of the operation on the original data (r = Xl 0 X2). 3.1 Figure 3.2 shows an arithmetic coding scheme for protecting integer addition. Encoding involves multiplication of the operands Xl and X2 by a factor of 10. The redundant operation on the encoded data is integer addition (same as the original operation) and error detection involves division by 10. An error is detected if the corrupted result is not divisible by 10. Note that faults under which the result remains a multiple of 10 are undetectable. (Error correction is impossible unless a more detailedfault model is available.) This specific example is an instance of aN coding [Rao and Fujiwara, 1989J with a = 10. Under certain conditions and certain choices of a, aN coding can be used to correct a single error. Note that in the case of aN codes, redundancy is added into the combinational system by increasing the dynamic range of the system (by a factor of a). EXAMPLE Arithmetic coding schemes need to provide sufficient protection while keeping the associated encoding, error correction and decoding operations simple. If the latter operations are complicated, then, the code is computationally expensive and impractical (an extreme example would be a code whose encoding/decoding is three times more complicated than the actual computation one desires to protect; in such a case, it would be more convenient to use TMR). ABFT for Combinational Systems 37 Error Detection and Decoding + f Faults Figure 3.2. 3 Uncorrectable Error aN arithmetic coding scheme for protecting integer addition. ALGORITHM·BASED FAULT TOLERANCE Arithmetic codes do not always have the simple structure of Example 3.1. More advanced and more complicated schemes that protect real or complex numbers and involve entire sequences of data are referred to as algorithmbased fault tolerance (ABFT) and usually deal with multiprocessor systems. The term was introduced by 1. Abraham and coworkers in 1984 [Huang and Abraham, 1984]. Since then, a variety of signal processing and other computationally intensive algorithms have been adapted to the ABFT framework [Jou and Abraham, 1986; Abraham, 1986; Chen and Abraham, 1986; Abraham et aI., 1987; Jou and Abraham, 1988; Nair and Abraham, 1990]. The classic example of ABFT involves the protection of n x n matrix multiplication on an n x n mUltiprocessor grid [Huang and Abraham, 1984]. The ABFT scheme detects and corrects any single processor fault using an extra checksum row and an extra checksum column. The resulting fault-tolerance scheme requires an (n + 1) x (n + 1) multiprocessor grid on which it performs multiplication of an (n+ 1) x n matrix with an n x (n+ 1) matrix. The hardware overhead is minimal compared to the naive use of TMR, which offers similar fault protection but triplicates the system. The execution time for the algorithm is slowed down by a negligible amount: it now takes 3n steps, instead of 3n-1. Figure 3.3 is an illustration of the above ABFT method for the case when n = 3. The top of the figure shows the unprotected computation of the product oftwo 3 x 3 square matrices A and B on a 3 x 3 multiprocessor grid. The data enters the multiprocessor system as indicated by the arrows in the figure (a "." indicates that no data is received). Element aij corresponds to the element in the ith row, jth column of matrix A = [aiiJ, whereas bij is the corresponding element of matrix B = [biiJ. At time step t, each processor executes the following three steps: 38 CODING APPROACHES TO FAULT TOLERANCE 1. Processor Pij (the processor on the ith row, jth column of the multiprocessor grid) receives two pieces of data, one from the processor on the left (namely, Pi(j-l)) and one from the processor at the top (namely, P(i-l)j)' From the processor on the left it gets b(t-(j+i-l))i whereas from the processor at the top it gets aj(t-(j+i-l))' If t - (j + i-I) is negative, no data is received. 2. Processor P ij multiplies the data it receives and adds the result to an accumulative sum Sij stored in its memory. Note that Sij is initialized to zero. If no data has been received, nothing is done at this step. 3. Processor Pij passes the data received from the processor on its left to the processor on its right, and the data received from the top processor to the processor below. It is not hard to see that after 3n-1 steps, the value of Sji is: 3n-l Sji = L t=o ai(t-(j+i-l)) x b(t-(j+i-l))j = Cij , where akl, bkl are zero for k, l < 0 or k, l > n, and Cij is the element in the ith row, jth column of matrix C = A x B. Therefore, after 3n-1 steps, processor P ji contains the value Cij' A more detailed description of the algorithm can be found in [Leighton, 1992]. Protected computation is illustrated at the bottom of Figure 3.3. It uses a (3 + 1) x (3 + 1) multiprocessor grid. Matrices A and B are encoded into two new matrices, A' = [a~jl and B' = [b~jl respectively, in the following fashion: • The 4 x 3 matrix A' is formed by adding a row of column sums to matrix A. More specifically, a~j = aij for 1 ::; i ::; 3, 1 ::; j ::; 3 and 3 a~j = Laij, j = 1,2,3. i=l • The 3 x 4 matrix B' is formed by adding a column of row sums to matrix B. More specifically, b~j = bij for 1 ::; i ::; 3, 1 ::; j ::; 3 and 3 b~4 = L bij, j=l i = 1,2,3. The redundant computation is executed in the usual way on a 4 x 4 multiprocessor grid. The resulting matrix C' = A' x B' is a 4 x 4 matrix. Under fault-free conditions, the matrix C = A x B (i.e., the result of the original ABFT for Combinational Systems Unprotected Computation on a 3x3 Processor Array 39 033 012 023 032 022 031 021 011 b32 b22 b23 0'43 Protected Computation on a 4x4 Processor Array 033 0'42 023 032 0'41 013 022 031 012 021 011 Figure 3.3. • ABFT scheme for protecting matrix multiplication. + 40 CODING APPROACHES TO FAULT TOLERANCE computation) is given by the submatrix C'(1:3, 1:3), i.e., the 3 x 3 submatrix that consists of the top three rows and the leftmost three columns of matrix C'. Moreover, the bottom row and the rightmost column of C' consist of column and row checksums respectively. In other words, , C4j 3 LC~j , j = 1,2,3,4, , i = 1,2,3,4 i=l , ci4 3 LC~j j=l If one of the processors malfunctions, one can detect and correct the error by using the row and column checksums to pinpoint the location of the error and then correct it. More specifically, the following are true: • If the ith row checksum (1 :S i :S 4) and the jth column checksum (1 :S j :S 4) of matrix C' are not satisfied, then, there was a fault in processor P ji · • If the ith row checksum (1 :S i :S 4) is not satisfied, then, the calculation of the ith row checksum was erroneous (the hardware that performs this calculation is not shown in Figure 3.3). • If the jth column checksum (1 :S j :S 4) is not satisfied, then, the calculation of the jth column checksum was erroneous (the hardware that performs this calculation is not shown in Figure 3.3). The basic assumptions in the above analysis are that the propagation of the data (aij and bij ) in the system is flawless and that there is at most one fault in the system. Note that data propagation errors would have been caught by a TMR system. Moreover, TMR would catch multiple faults as long as all of them were confined within one of the three replicas of the multiprocessor system. The above example, however, illustrates the numerous potential advantages of ABFT over naive modular redundancy methods. By exploiting the structural features of parallel matrix multiplication, this scheme achieves fault protection at a much lower cost. Other examples of efficient ABFT techniques have been developed for signal processing applications [Huang and Abraham, 1984; Iou and Abraham, 1986], systems for computing the fast Fourier transform (FFT) [Nair and Abraham, 1990], analog to digital conversion [Beckmann and Musicus, 1991], digital convolution [Beckmann and Musicus, 1993], faulttolerant sorting networks [Choi and Malek, 1988; Liang and Kuo, 1990; Sun et aI., 1994] and linear operators [Sung and Redinbo, 1996]. ABFT for Combinational Systems 4 41 GENERALIZATIONS OF ARITHMETIC CODING TO OPERATIONS WITH ALGEBRAIC STRUCTURE Traditionally, arithmetic coding and ABFT schemes have focused on the development of resource-efficient designs for providing fault tolerance to a specific computational task under a given hardware implementation. The identification of algorithmic or computational structure that could be exploited to provide efficient fault coverage to an arbitrary computational task has been more of an art than an engineering discipline. A step in solving this problem was taken by Beckmann in [Beckmann, 1992]. Beckmann showed that by concentrating on computational tasks that can be modeled as abelian group operations, one can impose sufficient structure upon the computation to allow for accurate characterization of the possible arithmetic codes and the form of redundancy that is needed. The approach in [Beckmann, 1992] encompasses a number of previously developed arithmetic codes and ABFT techniques, and also extends to algebraic structures with an underlying abelian group structure, such as rings, fields, modules and vector spaces. The results in [Beckmann, 1992] were generalized in [Hadjicostis, 1995] to include operations with abelian semigroup structure. This section presents an overview of these results in a way that minimizes the need for background knowledge in group [Herstein, 1975] or semi group theory [Ljapin, 1974; Lallement, 1979; Higgins, 1992; Grillet, 1995]. The exposition also avoids making an explicit connection to actual hardware implementations or fault models. 4.1 FAULT TOLERANCE FOR ABELIAN GROUP OPERATIONS A computational task has an underlying group structure if the computation takes place in a set of elements that form a group. DEFINITION 3.1 A nOll-empty set of elements G forms a group (G, 0) if on G there is a defined binary operation, called the product and denoted by 0, such that 1. a, bEG implies aob E G (closure). = (aob)oc (associativity). E G such that a 0 io = io 0 a = afor all a E G 2. a, b, c E G implies that ao(boc) 3. There exists an element io (io is called the identity element). 4. For every a E G there exists an element a-I E G such that aoa- l a-loa = io (the element a-I is called the inverse of a). = If the group operation 0 of G is commutative (i.e., for all a, bEG, aob = boa), then, G is called an abelian group. The order in which a series of abelian 42 CODING APPROACHES TO FAULT TOLERANCE group products is evaluated does not matter because of associativity and commutativity, i.e., where 91,92, ... ,9u E G and {iI, i2, ... ,i u} is any permutation of {I, 2, ... , u}. 3.2 A simple example of an abelian group is the set of integers under addition, denoted by (IE, +). The properties mentioned above can be verified easily. Specifically, the identity element is 0 and the inverse of integer n is integer -no Another example of an abelian group is the set of nonzero rational numbers under multiplication, denoted by (Q - {O}, x). The identity element in this case is I and the inverse of rational number q = ~ (where n, d are nonzero integers) is the rational number q-l = EXAMPLE !. Suppose that the computation of interest can be modeled as an abelian group operation 0 with operands {91l 92, ... , 9u} so that the desired result r is given by r = 91 0 92 0 ••. 0 9u . Beckmann provides fault tolerance to this group product using the scheme of Figure 3.4 (which is essentially a generalization of Figure 3.1). The encoding, error detecting/correcting and decoding mechanisms are assumed to be faultfree. The redundant computational unit operates on the encoded data via a redundant abelian group operation o. Under fault-free conditions, the result of this redundant computation is given by (3.1) and can be decoded to the original result r via the decoding mapping (7, i.e., (7(p) = r . Due to faults, the output of the redundant computational unit may be erroneous. In [Beckmann, 1992], the effect of a fault Ii is modeled in an additive fashion, i.e., the possibly erroneous output Pi is written as Pi = po €i , where P is the error-free result in Eq. (3.1) and €i is a suitably chosen operand that captures the effect of fault Ii- Note that, since 0 is a group operation, the existence of an operand that models the effect of fault Ii in this additive fashion is guaranteed (because one can always choose €i = p- l 0 PI). When the effects ABFT for Combinational Systems (j- - Error {gc: §- gc: Detectorl § 0 ctS f - - ' I "0 ;r. Q) Corrector CD '-' 0: , , a. 0 43 I_---<..~ Deccroder ex 1 I ............................... I Faults Figure 3.4. Fault-tolerant computation of a group operation. of faults are operand-dependent. however, it may be necessary to use mUltiple operands to capture the effect of a single fault. In [Beckmann. 1992]. it is assumed that the effect of multiple faults can be captured by the superposition of individual additive error effects, i.e., the possibly corrupted result PI can be written as PI = po ,f.it poe. 0 f.i2 0 ... 0 f.i>. V' , e If no errors took place. e is the identity element. Given a pre-specified set of faults, one can in principle generate the corresponding set of error operands £ = {io, f.l, f.2, ... } (the identity element is included for notational simplicity). To detect/correct up to A such errors. one would need to be able to detect/correct each error e in the set £P.) = {f.il 0 f.i2 0 ... 0 f.i>. I f.il' f.i2' ... , f.i>. E £} (if e = i o • then, no error has taken place). The underlying assumptions of the additive error model are that errors are independent of the operands and that their effect on the overall result is independent of the stage in which the computation is in. The latter assumption is realistic because the additive error model is used with associative and abelian operations so that the order of evaluating the different products is irrelevant. The term "additive" makes more sense if the group operation 0 is addition. 44 CODING APPROACHES TO FAULT TOLERANCE H Pf=pOe 0 r ~orrection Fault \ G/=H:\ p=<p(;)/\ / \ 0) G\/0' ' p A o \ \ <pI r Figure 3.5. 4.1.1 cr t" r Fault tolerance using an abelian group homomorphism. USE OF GROUP HOMOMORPHISMS Under the formulation described in the previous section, the computation r = 91 0 92 0 ... 0 9u in the abelian group (G, 0) is essentially mapped into the computation (3.2) in a larger abelian group (H, <». The subset of valid results Hv cHis the set of results obtained under error-free computation, i.e., If the decoding mapping u is one-to-one, then, it is shown in [Beckmann, 1992] that all encoders {<Pi} have to be the same and have to satisfy <Pi = u- 1 == <P, i = 1,2, ... , u (notice that u- 1 is well-defined). Eq. (3.2) can then be written as ABFT for Combinational Systems 45 and reduces to ¢(91 092) = ¢(91) 0 ¢(92) , which is the defining property of a group homomorphism [Herstein, 1975]. Therefore, there is a close relationship between group homomorphisms and arithmetic coding schemes for group operations. Figure 3.5 describes how fault tolerance is achieved: the group homomorphism ¢ adds redundancy to the computation by mapping the abelian group G, in which the original operation takes place, into a larger (redundant) group H. The subset of valid results, defined earlier as Hv, turns out to be isomorphic to G (i.e., Hv = (}-I(G) where (}-1 is one-to-one) and, for this reason, it is also denoted by G'. Any error e is detected, as long as it forces the result into an element PI that is not in Hv. If enough redundancy exists in H, the error might be correctable; in such case, p = P and = r. r 4.1.2 ERROR DETECTION AND CORRECTION An error ed E £(>,) c H, ed i= io is detectable if every possible valid result 9' E G' c H becomes invalid when corrupted by ed, i.e. {g' 0 Using the set notation G' be re-written as ed 0 ed I g' E G'} n G' = 0 . == {g' 0 I ed g' E G'}, the above equation can (3.3) Similarly, an error ec E £P.), ec if it satisfies i= io is correctable (and a fortiori detectable) (G' oe c ) n (G' oe) = 0 \j eE £(>'), e i= ec . (3.4) Note that since e can also be io (io E £(A), the condition in Eq. (3.3) is a special case of Eq. (3.4). A well-known result from group theory states that sets of the form G' 0 e (e 0 G') for any e E H are either identical or have no elements in common. In other words, they form an equivalence class decomposition (partitioning) of H into subsets, known as right (left) cosets. When the collection of right and left cosets under a particular subgroup G' is the same, I this collection of co sets forms a group, denoted by HjG' and called the quotient group of H under G'. Its group operation (£) is defined as A(£) B = {hI 0 h2 I hI E A, h2 E B} . 46 CODING APPROACHES TO FAULT TOLERANCE ------~----------------.------~-\ H I Coset I /""\ Correctable Error Correctable Error . Coset: () h1 ,Correctable Error I ' Coset i ~----) ~---\-~ ~-~ Detectable Coset Erro~ 'I I Il --_./ e;A~--i--~-1 r \,~ I 0 ~ ! \ IG't P ... \ - '\ !~ ~oset I --------+1 E~, '" 0 \\h2 I ~\ D•• bbl. h3 JL'J ~--- -- ---(J ~--/-I G \' o : r ~ Figure 3.6. : j ~ Coset-based error detection and correction. Since two cosets are either identical or have no elements in common, Eqs. (3.3) and (3.4) can be written as (G' 0 ed) (G' 0 ec ) ii- G', for edi-io, (G' 0 e), for e E £(>.) , e i- ec . Error detection and correction proceed as shown in Figure 3.6: any error that takes the result out of the subgroup of valid results G' is detected (in the figure, this is the case for errors el, e2, e3 because they force the result outside G'). Furthermore, if enough redundancy exists in H, some errors can be corrected. For example, error el in the figure results in hI and is correctable because the coset G' 0 el is not shared with any other error. Therefore, once one realizes that hI lies in the coset G' 0 el (the coset of el), one can get the error-free result pEG' by performing the operation hI 0 ell. If hi lies in a coset shared by more than one error (which is the case for h2 and h 3 ), the corresponding errors ABFT for Combinational Systems 47 are detectable but not correctable. Errors that let the result stay within G' , such as e4, are not detectable. To summarize, a correctable error forces the result into a nonzero coset (i.e., a coset other than G/) that is uniquely associated with this particular error. For an error to be detectable, it only has to force the result into a nonzero coset. 4.1.3 SEPARATE GROUP CODES Separate codes are arithmetic codes in which redundancy is added in a separate "parity" channel [Rao and Fujiwara, 1989]. Error detection and correction are performed by appropriately comparing the result of the computation channel, which performs the original operation, with the result of the parity channel, which performs a simpler computation. No interaction between the computation and the parity channels is allowed. EXAMPLE 3.3 A separate code for protecting integer addition is shown in Figure 3. 7. The computation channel performs the original operation, whereas the parity channel performs addition modulo-4. The error detector compares the results of the two channels, 9 and p respectively. If they agree (modulo-4), then, the result of the computation channel is accepted as error-free; if they do not agree, then, an error is detected. The figure also illustrates one of the important advantages of separate codes over non-separate codes (such as the a.N code of Figure 3.2): if the result is known to be error-free, then, the output is available without the need for any further decoding. In the case of separate codes, the model of Figure 3.4 reduces to the model shown in Figure 3.8. For simplicity, only two operands are shown, but the discussion applies for the general case of u operands. The group homomorphism ¢> maps the computation in the group G to a redundant group H which is the cartesian product of G and a parity set p, i.e., H=GxP. The homomorphic mapping satisfies ¢>(g) = [g,O(g)], where 0 : G r----+ P is the mapping that creates the parity information from operand 9 (refer to Figure 3.8). The set of valid results Hv is the set of elements of the form ([g,O(g)] I 9 E G}. It can be shown that (P,0) is a group and that 0 is a group homomorphism from G to P [Beckmann, 1992]. In order to make efficient use of redundancy (i.e., efficient use of parity symbols in group P), a reasonable requirement would be for Oto be onto. In such case, the problem of finding suitable separate codes reduces to the problem of finding suitable epimorphisms 0 from G onto P. A theorem from group theory states that there is a one-to-one correspondence between epimorphisms 0 from the abelian group G onto P and subgroups N of G [Herstein, 1975]. In fact, 48 CODING APPROACHES TO FAULT TOLERANCE Computation Channel ~-.;._.:81_---- __ 9_ _-.t ('. 0Il ..q '0 o I I E 91 --+:891mod4 mod4 --=---+1 92 YES mod4 92mod P 4 +mod4 NO Parity Channel Error Detected Figure 3.7. Separate arithmetic coding scheme for protecting imeger addition. Computation Channel - c: c: .2 .0 ut) -.... u Q) 91t-<D:[j , £;?'C I I : G) ; te 92~ : Parity Channel Q) .... .... 00 Q) .... c: WetS p.. p ~---.-.---.--.---------------. Figure 3.B. Separate coding scheme for protecting a group operation. 9 ABFT for Combinational Systems 49 the quotient group G / N, constructed from G using N as a subgroup, provides an isomorphic image of P. Therefore, by finding all possible subgroups of G, one can find all possible epimorphisms () from G onto P (and hence all possible parity check codes). Finding the subgroups of a group is not a trivial task but it is relatively easy for several group operations of interest. By finding all subgroups of a given group, one is guaranteed to obtain all separate arithmetic codes that can be used to provide fault tolerance to the corresponding group computation. Thus, this systematic procedure results in a complete characterization of the separate codes for a given abelian group. The result is a generalization of one proved by Peterson for the special case of integer addition and multiplication [Peterson and Weldon Jr., 1972; Rao, 1974]. 4.2 FAULT TOLERANCE FOR SEMIGROUP OPERATIONS DEFINITION 3.2 A non-empty set of elements S forms a semi group (S, 0) on S there is a defined binary operation 0, such that if 1. a, bE S implies aob E S (closure). 2. a,b,c E S impliesthatao(boc) = (aob)oc(associativity). A semigroup is called a monoid when it possesses an identity element, i.e., a unique element io that satisfies: a 0 io = io 0 a = a for all a E S. One can focus on monoids without any loss of generality because an identity element can always be adjoined to a semi group that does not initially posses one [Ljapin, 1974; Lidl and Pilz, 1985]. Therefore, the words "semigroup" and "monoid" can be used interchangeably unless a distinction needs to be made about the identity element. 3.4 Every group is a monoid; familiar examples of mono ids that are not groups are the set of positive integers under the operation of multiplication [denoted by (N, x )}, the set of nonnegative integers under addition [denoted by (Nt, +)}, and the set of polynomials with real coefficients under the operation of polynomial multiplication. All of the above examples are abelian monoids, i.e., monoids in which operation 0 is commutative (for all a, b E S, ao b = bo a). Examples of non-abelian monoids are the set of polynomials under polynomial substitution, and the set of n x n matrices under matrix multiplication. EXAMPLE 50 4.2.1 CODING APPROACHES TO FAULT TOLERANCE USE OF SEMIGROUP HOMOMORPHISMS The approach in [Hadjicostis, 1995] uses the model in Figure 3.1 to protect a computation in a semigroup (monoid) (S,o). To introduce the redundancy needed for fault tolerance, the computation r = 81 0 82 in (S, 0) is mapped into an encoded computation P = 4>1(81) <> 4>2(S2) in a larger semigroup (monoid) (H, <». After performing the redundant computation 4>1 (81) <> 4>2(82) in H, a possibly erroneous result PI is obtained, which is assumed to still lie in H. Error correction is performed through the mapping a: that outputs p = a:(PI); decoding is performed via an one-to-one mapping (1 : Hv I----t S, where Hv = {4>1(81) <> 4>2(82) I 811 82 E S} is the subset of valid results in H. Under fault-free conditions in the error corrector and under correctable faults p = P and (1(p) = r. Clearly, the decoding mapping (1 needs to satisfy: for all 81, 82 E S; since (1 is assumed to be one-to-one, the inverse mapping (1-1 : S I----t Hv is well-defined and satisfies If one assumes further that both 4>1 and 4>2 map the identity of S to the identity of H, then (by setting 82 = io or 81 = i o), one concludes that (1-1(81) = 4>1(St} for all 81 E Sand (1-1(82) = 4>2(S2) for all 82 E S. Therefore, (1-1 = 4>1 = 4>2 == 4> and 1. 4>(81082) = 4>(81) <>4>(82)' 2. 4>(io) =io' Condition (1) is the defining property of a semigroup homomorphism [Ljapin, 1974; Lidl and Pilz, 1985]. A monoid homomorphism is additionally required to satisfy condition (2) [Jacobson, 1974; Grillet, 1995]. Thus, mapping 4> is an injective monoid homomorphism. The generalization of the framework of [Beckmann, 1992] to semigroups allows the study of fault tolerance in non-abelian computations for which inverses might not exist. These include a number of combinational and nonlinear signal processing applications, such as max/median filtering and minimax operations in sorting. This generalization, however, comes at a cost: error detection and correction can no longer be based on coset constructions. The problem is two-fold: • In a semigroup setting one may be unable to model the possibly erroneous result PI as ABFT for Combinational Systems 51 for some element e in H (because inverses do not necessarily exist in Hand because the semigroup may be non-abelian). • Unlike the subgroup of valid results, the subsemigroup of valid results Hv = ¢(S) does not necessarily induce a partitioning of semigroup H (for instance, it is possible that the set Hv 0 h is a subset of Hv for all h E H). 4.2.2 ERROR DETECTION AND CORRECTION To derive necessary and sufficient conditions for error detection and correction within the semigroup framework, one needs to resort to set-based arguments. For simplicity, the erroneous result due to one or more faults from a finite set F = {II, 12, 13, ... } is assumed to only depend on the error-free result (denoted earlier by p). As argued earlier, there is no loss of generality in making this assumption because the effects of a single fault that produces different erroneous results depending on the operands involved can be modeled through the use of multiple Ii. each of which captures the effect of the fault for a particular pair of operands. Of course, the disadvantage of such an approach is that the set of possible faults F is enlarged (and may become unmanageable). The erroneous result reached under the occurrence of a single fault Ii E F is given by PI; = e(p, Ii), where P is the error-free result, Ii is the fault that occurred and e is an appropriate mapping. The fault model for multiple faults can be defined similarly: the effect of k faults (fl, 1 2, ... , Ik) (where Ii E F for 1 :S j :S k) can be captured by the mapping e(k) (p, li(k)), where mUltiple faults are denoted by li(k) E F(k) = {(f1,12, ... ,lk) I Ii E F,l:S j:S k}. For full single-fault detection, the computation in the redundant semigroup H needs to meet the following condition: e(PI,ld i= P2 for all Ii E F and all PI, P2 E Hv such that PI i= P2 . In other words, a fault is detected whenever the result PI lies outside Hv. For single-fault correction, the following additional condition is needed: e(PI, F) n e(P2, F) = 0 for all PI, P2 E Hv such that PI i= P2 , where e(p, F) = {e(p, Ii) Ii E F}. The above condition essentially establishes that no two different results PI and P2 can be mapped, perhaps by different faults, to the same erroneous result. The error can be corrected by identifying the unique set e(Pk, F) in which the erroneous result PI lies; Pk would then be the correct result. 52 CODING APPROACHES TO FAULT TOLERANCE These conditions can be generalized for fully detecting up to d faults and correcting up to c faults (c ~ d): for all PI E Hv, for 1 {~/') (p" F('») } n eU) (1'2, F(j)) ~ 0 ~ k ~ d, for all P1, 1'2 E Hv, P1 and for 1 ~ j ~ f 1'2, c. Note that e(k)(p, F(k» denotes the set {e(k)(p, li(k» I Ilk) E F(k)}. The first condition guarantees detection of any combination of d or less faults (because no k faults, k ~ d, can cause the erroneous result e(k) (PI, I(k» to be a valid one). The second condition guarantees correction of up to c faults (no combination of up to c faults on P2 can result in an erroneous value that can also be produced by up to d faults on a different result PI). 4.2.3 SEPARATE SEMIGROUP CODES If the redundant semigroup H is a cartesian product of the form SxP, where (S, 0) is the original semigroup and (P, 0) is the "parity" separate semi group, then, the corresponding encoding mapping ¢ can be expressed as ¢( s) = [s, B( s)] for all s E S. In such case, the set of valid results is given by ([s, B(s)] I s E S} and error detection is based on verifying that the result is of this particular fonn. Using the fact that the mapping ¢ is a homomorphism, one can easily show that the parity mapping Bis a homomorphism as well. As in the case of abelian groups, when this parity mapping B is restricted to be surje.ctive, one obtains a characterization of all possible parity mappings and, thus, of all separate codes. However, the role that was played in the abelian group framework by the (normal) subgroup N of the (abelian) group G is now played by a congruence relation on S: 3.3 An equivalence relation", on the elements of a semigroup (S, 0) is called a congruence relation if DEFINITION a '" a', b '" b' => a 0 b '" a' 0 b' , for all a, a' , b, b' E S. The partitions induced by '" are called congruence classes. Unlike the group case, where a nonnal subgroup induces a partitioning of a group into cosets, the number of elements in each congruence class is not ABFT for Combinational Systems 53 necessarily the same. The only requirement for congruence classes is that a given partitioning {C 1 , C2 , ..• } is such that partitions are preserved by the semigroup operation. More specifically, when any element of partition Cj is composed with any element of partition Ck, the result is confined to a single partition Cl. Formally, a given partitioning { C1, C 2 , .•• } is a congruence relation if the following is true for all partitions Cl: for all sj' E Cj and all Sk' E Ck. Let SI'" denote the set of congruence classes of S under relation "'. Each congruence class in this set will be denoted as [a] E SI'" where a E S is an arbitrary element of the congruence class. If '" is a congruence relation, the binary operation [a] ® [b] = [a 0 b] is well-defined [Ljapin, 1974; Lidl and Pilz, 1985]. With this definition, (SI"', ®) is a semigroup, referred to as the factor or quotient semigroup of'" in S. The congruence class rio] functions as the identity in SI"'. A well-known homomorphism theorem from semigroup theory states that the surjective homomorphisms from semigroup S onto semi group P are isomorphic to the canonical surjective homomorphisms, namely the surjective homomorphisms that map S onto its quotient semi groups SI"', where rv denotes a congruence relation in S [Ljapin, 1974; Lidl and Pilz, 1985]. Furthermore, the semigroup (P, 0) is isomorphic to (SI"', ® ). Thus, for each congruence relation rv there is a corresponding surjective homomorphism, and for each surjective homomorphism there is a corresponding congruence relation. Therefore, the problem of finding all possible parity codes reduces to that of finding all possible congruence relations in S [Hadjicostis, 1995]. The major difference between the results in [Hadjicostis, 1995] and the results in [Beckmann, 1992] that were presented earlier is that, for separate abelian group codes, the subgroup N of the given group G completely specifies the parity homomorphism () (this is simply saying that P ~ GIN). In the more general setting of a semigroup, however, specifying a normal subsemigroup for S does not completely specify the homomorphism () (and therefore does not determine the structure of the parity semi group P). In order to define the surjective homomorphism () : S I---t P (or, equivalently, in order to define a congruence relation rv on S), one may need to specify all congruence classes. The following examples help make this point clearer. EXAMPLE 3.5 Figure 3.9 shows an example of a partitioning into congruence classes for the monoid (N, x) of positive integers under multiplication. Congruence class C1 contains multiples of2 and 3 (i.e., multiples of6); congruence class C 2 contains multiples of2 but not 3; congruence class C 3 contains multiples of 3 but not 2; and congruence class Co contains all the remaining 54 CODING APPROACHES TO FAULT TOLERANCE Congruence Class Co Congruence Class C1 {1 ,5,7,11, 13, ... } {6, 12, 18,24, ... } Congruence Class C2 Congruence Class C3 {2,4,8,1 0, 14, ... } {3,9, 15,21 ,... } Figure 3.9. Partitioning of semigrollp (N, x) into congruence classes. positive integers (i.e., integers that are neither multiples of2 nor 3). One can check that the partitioning is preserved under the monoid operation x. EXAMPLE 3.6 It is proved in {Hadjicostis, I995J that an encoding mapping 8 : (No, +) f---+ (P, 0) can serve as a separate code for (No, +) if and only if it has the following form: Let M > 0 and k 2:: 0 be some fixed integers. Then, the mapping 8 is given by: < kM, 8(n) n ifn 8(n) 8(n mod M) == nM, otherwise. The symbol nM denotes the class of elements that are in the same modulo-M class as n; there are exactly M such classes, namely, {OM, 1M, ... , (M -1)M}. The parity monoid P consists of(k + I)M elements: the elements in the index set {O, 1, ... , (kM -I)}, and the elements in the subgroup {OM, 1M, ... , (MI)M} (which is isomorphic to ZM, the cyclic group of order M). While under the threshold kM, the parity operation simply replicates the computation in No; once the threshold is exceeded, the parity operation performs modulo-M addition. Note that the parity encodings for the group (Z, +) (the group of integers under addition) can only be of the form 8(n) = nM for some M > 0, i.e., the second of the two expressions listed above. Evidently, by relaxing the structure to a monoid, more possibilities for parity encodings are opened; their construction, however, becomes more intricate. ABFT for Combinational Systems 55 3.7 A simple parity checkfor (N, x) is the mapping f) : (N, x) t - t (P, 0) defined in Figure 3.9. The parity monoid P has the following binary EXAMPLE operation 0: 0 II Co C1 C2 C3 II Co C1 C2 C3 C1 II C 1 C1 C1 C1 C2 II C2 C1 C2 C1 C3 II C3 C1 C1 C3 Co The parity check determines whether the result is a multiple of 2 and/or of 3. For instance, when a multiple of 2 but not 3 (i.e., an element in congruence class C 2 ) is multiplied by a multiple of6 (an element in class C 1), the result is a multiple of6 (an element in class C1). EXAMPLE 3.8 The semig roup of intege rs under the MAX ope ration is denoted by (Z, MAX), where operation MAX is the binary operation that returns the larger of its two operands. This semigroup can be made into a monoid by adding the identity element - 00 to it. From the definition of a congruence class, one concludes that, if c and c' S c belong to a congruence class C, then, the set {x I c' x S c} is contained in C. Thus, any congruence class must consist of all consecutive integers in an interval. Therefore, every partitioning into congruence classes corresponds to breaking the integer line into consecutive intervals, each interval constituting a congruence class. This immediately yields a complete characterization of the separate codes for (Z U { -oo}, MAX). A simple choice would be to pick the pair of congruence classes Co = {-00}U{ ... ,-2,-I}andC1 = {O,I,2, ... }. Thecorrespondingparityoperation 0 is defined by: s I 0 II CO I C1 I I CO II CO I C1 I I C1 II C1 I Cl I The parity computation simply checks that the sign of the result comes out correctly. 56 CODING APPROACHES TO FAULT TOLERANCE 4.3 EXTENSIONS The algebraic approach for protecting group and semigroup operations can be extended straightforwardly to other algebraic systems with the underlying structure of a group (such as rings, fields and vector spaces) or a semi group (such as semirings or semifields). By exploiting the group or semigroup structure in each of these other structures, one can place the construction of arithmetic codes for computations in them into the group/semi group frameworks that were discussed in this chapter. Therefore, a large set of computational tasks can be studied using the framework of this chapter, including integer residue codes, real residue codes, multiplication of nonzero real numbers, linear transformation, and Gaussian elimination [Beckmann, 1992]. Notes 1 This is always true for the abelian group case because sets of the form G' <> e are the same as sets of the form e <> G'. References Abraham, J. A. (1986). Fault tolerance techniques for highly parallel signal processing architectures. In Proceedings of SPIE, pages 49-65. Abraham, J. A., Banerjee, P., Chen, c.- Y., Fuchs, W. K., Kuo, S.-Y., and Reddy, A. L. N. (1987). Fault tolerance techniques for systolic arrays. IEEE Computer, 36(7):65-75. Beckmann, P. E. (1992). Fault-Tolerant Computation Using Algebraic Homomorphisms. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts. Beckmann, P. E. and Musicus, B. R. (1991). Fault-tolerant round-robin AID converter system. IEEE Transactions on Circuits and Systems, 38( 12): 14201429. Beckmann, P. E. and Musicus, B. R. (1992). A group-theoretic framework for fault-tolerant computation. In Proceedings of the IEEE Int. Con! on Acoustics, Speech, and Signal Processing, pages 557-560. Beckmann, P. E. and Musicus, B. R. (1993). Fast fault-tolerant digital convolution using a polynomial residue number system. IEEE Transactions on Signal Processing, 41(7):2300-2313. Chen, c.-Y. and Abraham, J. A. (1986). Fault tolerance systems for the computation of eigenvalues and singular values. In Proceedings of SPIE, pages 228-237. Choi, Y.-H. and Malek, M. (1988). A fault-tolerant systolic sorter. IEEE Transactions on Computers, 37(5):621-624. Grillet, P. A. (1995). Semigroups. Marcel Dekker Inc., New York. References 57 Hadjicostis, C. N. (1995). Fault-Tolerant Computation in Semigroups and Semirings. M. Eng. thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts. Hadjicostis, C. N. and Verghese, G. C. (1995). Fault-tolerant computation in semigroups and semirings. In Proceedings of the Int. Con! on Digital Signal Processing, pages 779-784. Herstein, I. N. (1975). Topics in Algebra. Xerox College Publishing, Lexington, Massachusetts. Higgins, P. M. (1992). Techniques of Semigroup Theory. Oxford University Press, New York. Huang, K.-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 33(6):518-528. Jacobson, N. (1974). Basic Algebra I. W. H. Freeman and Company, San Francisco. Jou, J.- Y. and Abraham, J. A. (1986). Fault-tolerant matrix arithmetic and signal processing on highly concurrent parallel structures. Proceedings ofthe IEEE, 74(5):732-741. Jou, J.- Y. and Abraham, J. A. (1988). Fault-tolerant FFT networks. IEEE Transactions on Computers, 37(5):548-561. Lallement, G. (1979). Semigroups and Combinatorial Applications. John Wiley & Sons, New York. Leighton, F. T. (1992). Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. Morgan Kaufmann Publishers, San Mateo, California. Liang, S. C. and Kuo, S. Y. (1990). Concurrent error detection and correction in real-time systolic sorting arrays. In Proceedings of 20th IEEE Int. Symp. on Fault-Tolerant Computing, pages 434-441. IEEE Computer Society Press. Lidl, R. and Pilz, G. (1985). Applied Abstract Algebra. Undergraduate Texts in Mathematics. Springer-Verlag, New York. Ljapin, E. S. (1974). Semigroups, volume Three of Translations of Mathematical Monographs. American Mathematical Society, Providence, Rhode Island. Nair, V. S. S. and Abraham, J. A. (1990). Real-number codes for fault-tolerant matrix operations on processor arrays. IEEE Transactions on Computers, 39(4):426-435. Peterson, W. W. and Weldon Jr., E. J. (1972). Error-Correcting Codes. MIT Press, Cambridge, Massachusetts. Rao, T. R. N. (1974). Error Codingfor Arithmetic Processors. Academic Press, New York. Rao, T. R. N. and Fujiwara, E. (1989). Error-Control Coding for Computer Systems. Prentice-Hall, Englewood Cliffs, New Jersey. 58 CODING APPROACHES TO FAULT TOLERANCE Sun, J., Cerny, E., and Gecsei, J. (1994). Fault tolerance in a class of sorting networks. IEEE Transactions on Computers, 43(7):827-837. Sung, J.-L. and Redinbo, G. R. (1996). Algorithm-based fault tolerant synthesis for linear operations. IEEE Transactions on Computers, 45(4):425-437. II FAUL~TOLERANTDYNAMICSYSTEMS Chapter 4 REDUNDANT IMPLEMENTATIONS OF ALGEBRAIC MACHINES 1 INTRODUCTION This chapter extends the algebraic approach of Chapter 3 in order to provide fault tolerance to group and semigroup machines. The discussion characterizes redundant implementations using algebraic homomorphisms and demonstrates that for a particular error-correcting scheme there exist many possible redundant implementations, each potentially offering different fault coverage [Hadjicostis, 1999]. The fault model assumes that the error detecting/correcting mechanism is fault-free and considers faults that cause the redundant machine to transition to an incorrect state. Explicit connections to hardware implementations and hardware faults are addressed in Chapter 5 for linear time-invariant dynamic systems (implemented using delay, adder and gain elements) and in Chapter 6 for linear tinite-state machines (implemented using XOR gates and flip-flops). The issue of faults in the error corrector is studied in Chapter 7. Related work appeared in the context of providing fault tolerance to arbitrary tinite-state machines via external monitoring mechanisms [Iyengar and Kinney, 1985; Leveugle and Saucier, 1990; Parekhji et aI., 1991; Robinson and Shen, 1992; Leveugle et aI., 1994; Parekhji et aI., 1995]. This work, however, was not formulated in an algebraic setting and does not make use of algebraic properties and structure. 2 ALGEBRAIC MACHINES: DEFINITIONS AND DECOMPOSITIONS DEFINITION 4.1 A semigroup machine is a dynamic system whose states and inputs are drawn from a finite set S that forms a semigroup under a binary operation o. More specifically, given the current state q[t] = 81 E S and the 62 CODING APPROACHES TO FAULT TOLERANCE Coset Subgroup Machine N Leader GIN r--------------- i~ 1 1 :1: C _ Mapping ) _ 1 Figure 4.1. current input x[tJ = S2 State Ci l Coset Cil ~ Encoder g1 = n n1 n10 Ci 1 Combined State Series-parallel decomposition of a group machine. E S, the next state is given by In the special case when (S, 0) is a group, the machine is known as a group or permutation machine. A group machine G with a non-trivial normal subgroup N can be decomposed into two smaller group machines: the coset leader machine with group GIN and the subgroup machine with group N [Arbib, 1968; Ginzburg, 1968; Arbib, 1969]. Figure 4.1 conveys the basic idea: group machine G, with current state qg[tJ = 91 and input xg[tJ = 92 is decomposed into the "series-parallel" interconnection in the figure. [Figure 4.1 follows a convention that will be used throughout this chapter: boxes are used to denote machines (systems with memory) and ovals are used to denote mappings (combinational systems with no memory).] Note that the input is encoded differently for each submachine; in particular, the input n' to the subgroup machine is encoded based on the original input 92 and the state Cit of the coset leader machine. Note that the encoder E in the figure has no memory (state) and is implemented as a combinational system. The overall state of the decomposition is obtained by combining the states of both submachines (qg[tJ = 91 = n1 0 Ci}, where Redundallt Implementations of Algebraic Machines 63 n1 is the state of the subgroup machine and Cil is the state of the coset leader machine). The above decomposition is possible because the normal subgroup N induces a partition of the elements of G into cosets [Arbib, 1968; Arbib, 1969]. More specifically, each element 9 of G can be expressed uniquely as 9 = n 0 Ci for some n EN, Ci E C , where C = {C1, C2, .•. , Cl} is the set of distinct (right 1) coset leaders (there is exactly one representative for each coset). The decomposition in Figure 4.1 simply keeps track of this parameterization. Initially, machine G is in state 91 = n1 0 Cil' and machines GIN and N in the decomposition are in states n1 and Cil respectively. If input 92 = n2 0 Ci2 is applied, the new state 9 = 91 0 92 can be expressed as 9 = n 0 Cj. One possibility is to take Ci = Cil 092 = Cil 0 n2 0 Ci2 (here, x denotes the coset leader of the element x E G); then, one puts n = n1 0 Cil 0920 (Cil 092)-1. Note that Cil 0920 (Cil 092)-1 is an element of N and is the output n' of encoder E (the product n1 0 n' can be computed within the subgroup machine). The encoders are used to appropriately encode the input for each machine and to provide the combined output. The decomposition can continue recursively if either of the groups N or GIN of the two submachines has a non-trivial normal subgroup. Note that the above choice of decomposition holds even if N is not a normal subgroup of G. In such case, however, the (right) coset leader machine is no simpler than the original machine; its group is still G [Arbib, 1968]. The decomposition of group machines described above has generalizations to semigroup machines, the most well-known result being the Krohn-Rhodes theorem [Arbib, 1968; Arbib, 1969]. This theorem states that an arbitrary semi group machine (S, 0) can be decomposed in a non-unique way into a seriesparallel interconnection of simpler components that are either simple-group machines or are one of four basic types of semigroup machines. The basic machine components are the following: • Simple group machines, i.e., machines whose groups do not have any nontrivial normal subgroups. Each simple-group machine in a Krohn-Rhodes decomposition has a simple group that is a homomorphic image of some subgroup of S. It is possible that the decomposition uses multiple copies of a particular simple-group machine or no copy at all. • U3 = {1,r1,r2} such that for u, ri E U3,uol • U2 • U1 = lou= uanduori = rio = {r1, r2} such that for u, ri E U2, u 0 ri = rio = {I, r} such that 1 is the identity element and r 0 r = r. • Uo = {I}. 64 CODING APPROACHES TO FAULT TOLERANCE Note that Un, U1 and U2 are in fact subsemigroups of U3. Some further results and ramifications can be found in [Ginzburg, 1968]. Before moving to the construction of redundant algebraic machines, some comments are in order. Redundant implementations for algebraic machines are constructed in this chapter using a hardware-independent approach; for discussion purposes, however, some examples make reference to digital implementations, i.e., implementations that are based on digital circuits. The state of such implementations is encoded as a binary vector and is stored into an array of single-bit memory registers (flip-flops). The next-state function is implemented by combinational logic. State transition faults occur when a hardware fault causes the desired transition to a state Si (Si E S) with binary encoding (b1; , b2; , ••• , bn ;) to be replaced by a transition to an incorrect state Sj E S with encoding (bl;, b2j, ... , bnj ) (all "bs" are either "0" or "1"). A single-bit error occurs when the encoding of Si differs from the encoding of Sj in exactly one bit-position [Abraham and Fuchs, 1986; Johnson, 1989]. Note that, depending on the hardware implementation, a single hardware fault can cause multiple-bit errors. Chapters 5 and 6 describe ways to implement certain types of machines so that a single hardware fault will result in a single-bit error. 3 REDUNDANT IMPLEMENTATIONS OF GROUP MACHINES The next state qg[t + 1J of a group machine G is determined by a state evolution equation of the following form: where both the current state qg[tJ = 91 and input Xg[tJ = 92 are elements of a group (G, 0). Examples of group machines include additive accumulators, multi-input linear shift registers, counters and cyclic autonomous machines. As discussed in the previous section, group machines can also play an important role as essential components of arbitrary state machines. One approach for constructing a redundant implementation for a given group machine G is to construct a larger group machine H (with group (H, 0), current h2 E Handnext-statefunction state%[tJ = hI E H,encodedinpute(xg[tJ) oh(h 1 , h2) = hI 0 h2) that can concurrently simulate machine G as shown in Figure 4.2: the current state qg [tJ = 91 of the original group machine G can be recovered from the corresponding state qh[tJ = hI of the redundant machine H through a one-to-one decoding mapping l (i.e., qg[tJ = l(qh[tJ) for all time steps t). The mapping l is only defined for the subset of valid states in H, given by G' = l-l(G) c H. = DEFINITION 4.2 A redundant implementationJor a group machine (G, 0) is a group machine (H, 0) Jor which there exist Redundant Implementations of Algebraic Machines 65 Faults ! Figure 4.2. Redundant implementation of a group machine. (i) an appropriate input encoding mapping ~ : G t--+ H (from G into H), and (U) an one-to-one state encoding mapping £-1 : G t--+ G' (where G' = £-1 (G) CHis the subset afvalid states), such that (4.1) for all 91, 92 E G. Note that when machine H is properly initialized and fault-free, there is a one-to-one correspondence between the state qh [t] = hI of machine H and the corresponding state qg[t] = 91 of G. Specifically, 91 = £(hl) or hI = £-I(9t) for all time steps. This can be shown by induction: if at time step t, machine H is in state %[t] = hI and input Xg[t] = 92 EGis supplied via the next state of H will be given by e, for some h in H. Since £ is one-to-one, it follows from Eq. (4.1) that h has to satisfy h = £-1(91092) = £-1(9), where 9 = 91092 is the next state of machine G. Note that h belongs to the subset of valid states G' = £-1 (G) c H. Faults cause transitions to invalid states in H; at the end of the time step, the error detector verifies that the newly reached state h is in G' and, if an error is detected, necessary correction procedures are initiated and completed before the next input is supplied. The concurrent simulation condition of Eq. (4.1) is an instance of the coding scheme of Figure 3.1: the decoding mapping £ plays the role of (7, whereas ~ corresponds to mapping <P2. (The situation described in Eq. (4.1) is slightly more restrictive than the one in Figure 3.1, because <PI is restricted to be £-1.) 66 CODING APPROACHES TO FAULT TOLERANCE 1... ___________ I I I I I I I I \ ,,---- ~21 ________________ .\ Input Encoder 1 ----------------- I I I I I I I ------,~ I I Redundant Machine H=GxP !-------- 91 Error Detector/ Corrector a Figure 4.3. Separate redundant implementation of a group machine. By invoking the results 2 in Sections 4.1.1 and 4.2.1 of Chapter 3, one concludes that the design of redundant implementations for group machines can be studied through group homomorphisms [Hadjicostis, 1999]. More specifically, by choosing == i-I to be an injective group homomorphism from G into H, Eq. (4.1) is automatically satisfied: e i(r 1 (gl) 0 e(g2)) = = 3.1 i(r 1(gl) 0 rl (g2)) i(rl(g} ° g2)) g}og2· SEPARATE MONITORS FOR GROUP MACHINES When the redundant group machine is of the form H = G x P, one recovers the results in Section 4.1.3 of Chapter 3 for separate group codes: the encoding homomorphism 4J : G 1---+ H [where 4J(g) = (g) = i-I (g)] is of the form 4J(g) = [g, O(g)J for an appropriate mapping O. The redundant machine (H, 0) consists of the original machine (G I 0) and an independent parity machine (P, 0) as shown in Figure 4.3. Machine P is smaller than G and is referred to as a (separate) monitor or a monitoring machine (the latter term has been used in finite-state machines [Iyengar and Kinney, 1985; Parekhji et aI., 1991; Robinson and Shen, 1992; Parekhji et aI., 1995; Hadjicostis, 1999] and in other settings). Mapping 0: G 1---+ P is used to produce the encoded input P2 = 0(g2) for the e Redundant Implementations of Algebraic Machines 67 separate monitor P and is easily shown to be a homomorphism. If machines G and P are properly initialized and fault-free, then, the state qp[t] = PI E P of the monitor at a given time step t will satisfy qp[t] = O(qg[t]), where qg[t] = 91 E G is the state of the original machine (G, 0) (see Figure 4.3). Error detection simply checks if this condition is satisfied. Depending on the actual hardware implementation, one may be able to detect and correct certain errors in the original machine or in the separate monitor. Using the results in Section 4.1 of Chapter 3, and retaining the assumption that the mapping 0 : G 1-----7 P (which maps states and inputs in machine G to states and inputs in machine P) is surjective, one concludes that group machine (P, 0) can monitor group machine (G, 0) if and only if P is a surjective homomorphic image of G or, equivalently, if and only if there exists a normal subgroup N of G such that P ~ GIN. EXAMPLE 4.1 The group machine G = Z6 = {O, 1, 2, 3, 4, 5} performs modulo-6 addition, i.e., its next state is the modulo-6 sum of its current state and current input. The subgroup N = {a, 2,4} ~ Z3 is one possible nontrivial normal subgroup of G. The corresponding monitor P is isomorphic to G IN ~ Z2: it has two states, Po and PI. each of which can be associated with one of the two partitions of states of machine Z6' More specifically, (monitor) state Po can be associated with (original) states in partition Po = {O, 2, 4} and (monitor) state PI can be associated with (original) states in partition Pt = {I, 3, 5}. As the original machine receives inputs, the monitor changes state in a way that keeps track of the partition in which the state of the original machine is in. Assuming that the monitor is fault-free, 3 one can use this approach to detect faults that cause transitions to a state in an erroneous partition. For example, if the current state of the original machine is 1 (so that the state of the monitor is PI) and input 4 is received, the resulting state of the original machine should be 5. Under input 5, the monitor takes a transition from state PI to state PI (agreeing with the fact that the state of the original machine is in partition PI). A state transition fault in the original machine that results into a state in partition Po (i.e., anyone of states 0, 2 and 4) will be detected; a state transition fault that results into a state within partition PI (i.e., state 1 or 3) will not be detected. Assuming fault-free monitors, the detection of single-bit errors in the digital implementation of Z6 is guaranteed if the binary encodings of states within the same partition have Hamming4 distance greater than one. EXAMPLE 4.2 An autonomous machine has only one transition from any given state and this transition occurs at the next clock pulse. If the number of states is finite, an autonomous machine will eventually enter a cyclic sequence of states. A cyclic autonomous machine is one whose states form a pure 68 CODING APPROACHES TO FAULT TOLERANCE cycle (i.e., there are no transients involved). The state transition table for the cyclic autonomous machine with M states is as follows: I Current State II Next State I 11M II II I: I 1 0M I (M -1)M II 1M 2M I I :I OM I This machine is essentially the cyclic group machine ZM, but with only one allowable input (namely element 1M) instead of the whole set {OM, 1M, 2 M , ... , (M -1)M}. Using the algebraic framework and some rather standard results from group theory, one can characterize all possible monitors P for the autonomous machine Z M: each monitor needs to be a group machine (P, 0) that is isomorphic to Z MIN, where N is a normal subgroup of Z M. The (normal) subgroups for ZM are cyclic groups of order INI = D that divides M [Jacobson, 1974J (i.e., N ~ ZDfor D being a divisor of M). Therefore, the monitors correspond to quotient groups P ~ ZM IN = ZM IZD that are cyclic and of order IPI = ~ (that is, P ~ ZIPI). Since the only available input to the original machine G is the clock input, P should also be restricted to only having the clock input. Therefore, a monitor for a cyclic autonomous machine with M states is another autonomous cyclic machine whose number of states is a divisor of M. The discussion in this section established that a group machine (P, 0) can monitor a machine (G, 0) if and only if there exists a normal subgroup N of G such that P ~ GIN. Since N is a normal subgroup of G, according to the decomposition results in Section 2 one can also decompose the original group machine G into an interconnection of a subgroup machine N and a coset leader machine GIN. Therefore, one has arrived at an interesting observation: if this particular decomposition is used, then, the monitoring approach will correspond to partial modular redundancy because P is isomorphic to the coset leader machine. Error detection in this special case is straightforward because, as shown in Figure 4.4, faults in P or GIN can be detected by concurrently comparing their corresponding states. The comparison is a simple equality check (up to isomorphism) and an error is detected whenever there is a disagreement. Faults in the subgroup machine N cannot be detected. Note that the error detection and correction capabilities will be different if G is implemented using a different decomposition (or not decomposed at all). Redundant Implementations of Algebraic Machines + (Coder) Monitor +=;IN I L . . . . - _ - l + •••• 69 1----.----1 !I G Ii • 1__________ 1 Original Group Machine (with group G) \ .. __- l-Tl~~~~~~t~~n i·····cifu _____________________________ J Figure 4.4. Relationship between a separate monitor and a decomposed group machine. 4.3 Consider the group machine Z4 = {O, 1, 2, 3} whose next state is given by the modulo-4 sum of its current state and current input. A normal subgroup for Z4 is given by N = {O, 2} (N ~ Z2); the cosets are {O, 2} and {1,3}, and the resulting coset leader machine Z4/N ~ Z4/Z2 is isomorphic to Z2. Due to the interconnectivity provided by the encoder E between the two submachines 5 (see Figure 4.1) the overallfimctionality is differentfrom Z2 x Z2 (which is expected since Z4 i- Z2 x Z2). Separate monitor P = G/N = Z4/Z2 ~ Z2Junctionsasfollows: itencodes the inputs in coset {O, 2} into O2 and those in {I, 3} into 12; then, it adds its current input to its current state modulo-2. Therefore, the fimctionality of this separate monitor is identical to the coset leader machine in the decomposition described above. As illustrated in Figure 4.4, under this particular decomposition of Z4, the monitor will only be able to detect faults that cause errors in the least significant bit (i.e., errors in the coset leader machine). Errors in the most significant bit (which correspond to errors in the subgroup machine) will remain completely undetected. EXAMPLE 3.2 NON· SEPARATE REDUNDANT IMPLEMENTATIONS FOR GROUP MACHINES A non-separate redundant implementation for a group machine (G, 0) uses a larger group machine (H, <» that preserves the behavior of G in some nonseparately encoded form (as in Figure 4.2). In the beginning of Section 3 it 70 CODING APPROACHES TO FAULT TOLERANCE was argued that such an embedding can be achieved via an injective group homomorphism ¢> : G t------+ H that is used to encode the inputs and states of machine G into those of machine H. The subset of valid states was given by G' = ¢>( G) and was shown to be a subgroup of H. Notice that, if G' is a normal subgroup of H, then, it is possible to decompose H into a series-parallel interconnection of a subgroup machine G' (isomorphic to G) and a coset leader machine H /G'. If one actually implements H in this decomposed form, then, the fault-tolerance scheme attempts to protect the computation in G by performing an isomorphic computation (in the subgroup machine G') and a coset leader computation H /G'. Faults are detected whenever the overall state of H lies outside G', that is, whenever the state of the coset leader machine deviatesfrom the identity. Faults in the subgroup machine are not reflected in the state of H /G' because the coset leader machine is not influenced in any way by the activity in the subgroup machine G'. Therefore, faults in G' are completely undetected and the only detectable faults are the ones that force H /G' to a state different from the identity. In effect, the added redundancy can only check for faults within itself rather than for faults in the computation in G' and turns out to be rather useless for error detection or correction. As demonstrated in the following example, one can avoid this problem by implementing H using a different decomposition; each such decomposition may offer different fault coverage (while keeping the same encoding, decoding and error-correcting procedures). 4.4 To provide fault tolerance to machine G = Z3 = {O, 1, 2} using an aM coding scheme with a = 2, one would multiply its input and state by a factor of 2. The resulting redundant machine H = Z6 = {O, 1, "" 5} performs addition modulo-6; its subgroup of valid states is given by G' = {0,2,4} and is isomorphic to G. The quotient group H/G' consists of two cosets: {O, 2, 4} and {1, 3, 5}. If one chooses 0 and 1 as the coset leaders, now denoting them by 02 and 12 to avoid confusion, the coset leader machine is isomorphic to Z2 and has the following state transition function: EXAMPLE I State Input I 02 = {0,2,4} 112 = {1,3,5} I II " For this example, the encoder E in Figure 4.1 (which has no internal state and provides the input to the subgroup machine based on the current input and the coset in which the coset leader machine is in) performs the following coding Redundant Implementations of Algebraic Machilles 71 function: I State Input II 0 11 1 2 3 4 1 1 15 1 I °I °I 2 I 2 I 4 I 4 I 11012121414101 Note that the input to machine H will always be a multiple of 2. Therefore, as is clear from the table, if one starts from the 02 coset, one will remain there (at least under faultJree conditions). The input to the subgroup machine will be the same as in the non-redundant machine (only the symbols used will be different - {O, 2, 4} instead of {O, 1, 2}). A fault will be detected whenever the overall state of H does not lie in G ' , i.e., whenever the coset leader machine H/G' is in a state different from 02· Since the coset leader machine does not receive any input from the subgroup machine, a deviation from the 02 state (coset) reflects afault in the coset leader machine. Therefore, the redundancy can only be used to check itself and not the original machine. One gets more satisfying results if H is decomposed in other ways. For example, N H = {O, 3} is a normal subgroup of H and the corresponding coset decomposition H/NH consists of three cosets: {0,3}, {I,4} and {2, 5}. The state transition junction of the coset leader machine is given by the following table (where coset leaders are denoted by 03, 13 and 23): I State 1 03 /13 1 23 Input II 03 " " " = {O, 3} 113 = {I, 4} 1 23 = {2, 5} 1 03 13 23 13 23 03 23 03 13 72 CODING APPROACHES TO FAULT TOLERANCE In this case, the output of the encoder E between the coset leader and the subgroup machine is given by the following table: I State Input II 4 0 11 1 2 1 3 1 1 5 1 II 0 I 0 I 0 I 3 I 3 I 3 I 11010131313101 11013131310101 This situation is quite different from the one described earlier. The valid results under fault-free conditions do not lie in the same coset anymore. Instead, for each state in the subgroup machine, there is exactly one valid state in the coset leader machine. More specifically, the valid states (the ones that comprise the subgroup machine G') are given by specific pairs (c, nh) of a state c of the coset leader machine and a state nh of the subgroup machine NH. The pairs in this example are given by (03,0), (13,3) and (23,0). This structured redundancy can therefore be exploited to perform error detection and correction. The analysis in this example can be generalized to all cyclic group machines ZM that are to be protected through aM coding. The encoding of the states and the inputs involves simple multiplication by a, whereas the computation needs to be reformulated using a group machine decomposition that does not have ZM as a (normal) subgroup. The example above illustrates that non-separate redundancy can provide varying degrees of protection depending on the group machine decomposition that is used (or, more generally, on the underlying hardware implementation). This issue is commonly ignored in research on arithmetic codes because the focus is on pre-specified (fixed) hardware implementations. For example, aM codes were applied to arithmetic circuits with a specific architecture in mind and with the objective of choosing the parameter a so that an acceptable level of error detection/correction was achieved [Peterson and Weldon Jr., 1972; Rao, 1974]. The approach in the above example is different because it characterizes the encoding and decoding mappings abstractly, and allows for the possibility of implementing and decomposing the redundant machine in different ways; each such decomposition results in a different fault coverage. Chapters 5 and 6 illustrate this point more explicitly for hardware implementations oflinear timeinvariant dynamic systems and linear finite-state machines. Redundant Implementations of Algebraic Machines 4 73 REDUNDANT IMPLEMENTATIONS OF SEMIGROUP MACHINES The development in Section 3 can be generalized to semi group machines [Hadjicostis, 1999]. For this case, one has the following definition: DEFINITION 4.3 A redundant implementation for a semigroup machine (8,0) is a semigroup machine (H, 0) for which there exist (i) an appropriate input encoding mapping e: 8 t---+ H (from 8 into H), and (ii) an one-to-one mapping £-1 : 8 t---+ 8' (where 8' subset of valid states), = £-1(8) cHis the such that (4.2) for all 81, 82 E 8. Note that when H is properly initialized and fault-free, there is a one-to-one correspondence between the state qh[t] = hI of H and the corresponding state qs[t] = 81 of 8. Specifically, qs[t] = £(qh[tJ) and qh[t] = £-I(qs[tJ) for all time steps. At the beginning of time step t, input Xs [t] = 82 E 8 is supplied to machine H encoded via and the next state of H is given by e for some h in H. Since lis one-to-one, Eq. (4.2) implies that h = £-1(81082) = £-1(8), where 8 = 81 082 is the next state of machine 8. Note that h belongs to the subset of valid states 8' = £-1 (8) c H. Faults cause transitions to invalid states in H; at the end of the time step, the error detector verifies that the newly reached state h is in 8' and, if an error is detected, necessary correction procedures are initiated and completed before the next input is supplied. DEFINITION 4.4 A semigroup machine is called a reset ifit corresponds to a right-zero semigroup R, that is, for all Ti, Tj E R. A reset-identity machine Rl = R U {I} corresponds to a right-zero semigroup R with I included as the identity. The reset-identify machine R; denotes a machine with n right zeros {TIn' T2 n , ... , Tnn} and an identity element In. A permutation-reset machine has a semigroup (8,0) that is the union of a set of right zeros R = {Tl' T2, ... , Tn} and a group G = {9t, 92, ... , 9m}. (The 74 CODING APPROACHES TO FAULT TOLERANCE product Ti 0 9jfor i E {l, .'" n} and j E {l, ... , m} is defined to be Ti 09j = Tk for some k E {l, ... , n}. The remaining products are defined so that G forms a group and R is a set of right zeros.} A permutation-reset machine can be decomposed into a series-parallel pair with the group machine G at the front-end and the reset-identity machine RI = R U {l} at the back-end. This construction can be found in [Arbib, 1968]. The Zieger decomposition is a special case of the Krohn-Rhodes decomposition. It states that any general semigroup machine S may be broken down into permutation-reset components. All groups involved are homomorphic images of subgroups of S. More details and an outline of the procedure may be found in [Arbib, 1968]. Next, the discussion shifts to redundant implementations for reset-identity machines. By the Zieger decomposition theorem, these machines together with simple-group machines are the only building blocks needed to construct all possible semigroup machines. 4.1 SEPARATE MONITORS FOR RESET MACHINES For a right-zero semigroup R, any equivalence relation (i.e., any partitioning of its elements) is a congruence relation [Grillet, 1995]. This result extends easily to the monoid Rl = R u {l }: any partitioning of the elements of Rl is a congruence relation, as long as the identity forms its own partition. Using this result, one can characterize and construct all possible (separate) monitors for a given reset-identity machine RI. EXAMPLE 4.5 Consider the standard semigroup machine U3 defined in Section 2. Its next-state function is given by the following table: I State I1 I TI I T2 Input II 1 I TI I T21 II II II 1 I TI I T2 I I TI I T2 I I TI I T2 I TI T2 The only possible non-trivial partitioning is { {l}, {TI' T2} }; it results in the e: parity semigroup P = {lp, T}, defined by the surjective homomorphism U3 t---t P with 0(1) = lp and O(Tl) = O(T2) = T. Note that P is actually isomorphic to U1. As expected, under this monitoring scheme, machine P is simply a coarser version of the original machine U3 . Redundant Implementations of Algebraic Machines 75 EXAMPLE 4.6 Consider the reset-identity machine R~ = {17, T1 7 , T27' ... , T77}' A possible partitioning for it is {{17}, {Th' T27' ... , T7 7}} and it results in the same parity semigroup P = {lp, T} as in the previous example. The surjective homomorphism 8 : R~ r---t Pisgivenby8(17) = 1p, 8(TI7) = 8(T27) = ... = 8(T77) = T. Other partitionings are also possible as long as the identity forms its own class. This flexibility in the choice of partitioning can be exploited depending on the faults expected in the original machine R~ and the monitor P. For example, if R~ is implemented digitally (each state being encoded to three bits), then, one could choose the partitions so that they consist of states whose encodings are separated by large Hamming distances. For example, if the binary encodings for the states of R~ are 000 for the identity, and 001, 010, ... , 111 for TI7 to T77 respectively, then, an appropriate partitioning could be {Po = {OOO}, PI = {001, 010,100, lll}, P 2 = {Oll, 101, 1l0}}. This results in a monitoring machine with semigroup P ~ U3: state 000 maps to the identity ofU3, whereas states in partition H map to TI and states in partition P2 map to T2. Under this scheme, one can detect faults that cause single-bit errors in the original machine as long as the monitoring machine operates correctly (to see this, notice that the Hamming distance within each of the partitions is larger than 1). The scheme above can be made c-error correcting by ensuring that the Hamming distance within any partition is at least 2c+ 1 (still assuming no faults in the monitoring machine). Under more restrictive fault models, other partitionings could be more effective. For example, iffaults in a given implementation cause bits to stick at "1," then, one should aimfor partitions with states separated by a large asymmetric distance [Rao, 1974]. 4.2 NON· SEPARATE REDUNDANT IMPLEMENTATIONS FOR RESET MACHINES A non-separate redundant implementation of a reset-identity machine R~ can be based on an injective semigroup homomorphism ¢ : R~ r---t H that reflects the state and input of R~ into a larger semigroup machine H so that Eq. (4.2) is satisfied. Under proper initialization and fault-free conditions, machine H simulates the reset-identity machine R~; furthermore, since ¢ is injective, there exists a mapping ¢-I that can decode the state of H into the corresponding state of R~. An interesting case occurs when the monoid R~ = {In' TIn' T2 n , ... , Tnn} is homomorphic ally embedded into a larger monoid R~ = {1 m , TIm' T2m , ... , Tmm} for m > n (Le., when H = R~). The homomorphism ¢ : R~ r---t R~ is given by ¢(1n) = 1m and ¢(Ti n ) -:f= ¢(Tjn) for i -:f= j, i,j in {I, 2, ... , n}. Clearly, ¢ 76 CODING APPROACHES TO FAULT TOLERANCE is injective and there is a one-to-one decoding mapping from the subsemigroup R; = 4>(R~) c R~ onto R~. Assuming that the system is implemented digitally (i.e., each state is encoded as a binary vector), then, in order to protect against single-bit errors one would need to ensure that the encodings of the states in the set of valid results R~l are separated by large Hamming distances. Bit errors can be detected by checking whether the resulting encoding is in R;. EXAMPLE 4.7 One way to add redundancy into the semigroup machine R~ = {12' Th, T22} is by mapping it into machine R~. Any mapping 4> of the form 4>(12) = 17, 4>(TI 2 ) = Ti7 and 4>(T22) = Th (j,i E {I, 2, ... , 7}, j # i) is a valid embedding. In order to achieve detection of single faults, each fault needs to result in a state outside the set of valid results 8'. If machine R~ is implemented digitally (with its states encoded into 3- bit binary vectors), faults that result in single-bit errors can be detected by choosing the encodings for 4>(12) = 17, 4>(TI 2 ) = Ti7 and 4>(T22) = Th (j, i E {I, 2, ... , 7}, j # i) to be separated by a Hamming distance of at least 2 (e.g., 001for 17, 01Ofor Ti7 and 100for Th)' 5 SUMMARY This chapter described redundant implementations for algebraic machines (group and semigroup machines). The approach was hardware-independent and resulted in redundant implementations that are based on algebraic homomorphisms. Explicit connections with hardware faults and fault models were not made. Using these techniques, one can take advantage of algebraic structure in order to analyze procedures for error correction and avoid decompositions under which faults in the original machine are always undetectable. Notes 1 A similar argument can be made for left cosets. 2 Note that a group machine does not necessarily correspond to an abelian group. 3 This assumption is realistic if the hardware implementation of the monitor is considerably simpler than the implementation of the actual machine. 4 The Hamming distance between two binary vectors (Xl, X2, ... , xn) and (YI, Y2, ... , Yn) is the number of positions at which they differ. The minimum Hamming distance of a given set of binary vectors of length n is the minimum distance between any pair of binary vectors in the code. 5 The output n' of the encoder E in Figure 4.1 is based on the state of the coset leader machine (Cit) and the overall input (92)' In this particular example the output functions like the carry-bit in a binary adder: the coset leader machine performs the addition of the least significant bits, whereas the subgroup machine deals with the most significant bits. References 77 References Abraham, J. A and Fuchs, K. (1986). Fault and error models for VLSI. Proceedings of the IEEE, 74(5):639-654. Arbib, M. A, editor (1968). Algebraic Theory of Machines, Languages, and Semigroups. Academic Press, New York. Arbib, M. A (1969). Theories ofAbstract Automata. Prentice-Hall, Englewood Cliffs, New Jersey. Ginzburg, A. (1968). Algebraic Theory of Automata. Academic Press, New York. Grillet, P. A (1995). Semigroups. Marcel Dekker Inc., New York. Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts. Iyengar, V. S. and Kinney, L. L. (1985). Concurrent fault detection in microprogrammed control units. IEEE Transactions on Computers, 34(9):810-821. Jacobson, N. (1974). Basic Algebra I. W. H. Freeman and Company, San Francisco. Johnson, B. (1989). Design and Analysis of Fault- Tolerant Digital Systems. Addison-Wesley, Reading, Massachusetts. Leveugle, R, Koren, Z., Koren, I., Saucier, G., and Wehn, N. (1994). The Hyeti defect tolerant microprocessor: A practical experiment and its costeffectiveness analysis. IEEE Transactions on Computers, 43( 12): 1398-1406. Leveugle, R and Saucier, G. (1990). Optimized synthesis of concurrently checked controllers. IEEE Transactions on Computers, 39(4):419-425. Parekhji, R A, Venkatesh, G., and Sherlekar, S. D. (1991). A methodology for designing optimal self-checking sequential circuits. In Proceedings of the Int. Con! VLSI Design, pages 283-291. IEEE CS Press. Parekhji, R A, Venkatesh, G., and Sherlekar, S. D. (1995). Concurrent error detection using monitoring machines. IEEE Design and Test of Computers, 12(3):24-32. Peterson, W. W. and Weldon Jr., E. J. (1972). Error-Correcting Codes. MIT Press, Cambridge, Massachusetts. Rao, T. R N. (1974). Error Codingfor Arithmetic Processors. Academic Press, New York. Robinson, S. H. and Shen, J. P. (1992). Direct methods for synthesis of selfmonitoring state machines. In Proceedings of22nd Fault-Tolerant Computing Symp., pages 306-315. IEEE CS Press. Chapter 5 REDUNDANT IMPLEMENTATIONS OF DISCRETE-TIME LINEAR TIME-INVARIANT DYNAMIC SYSTEMS 1 INTRODUCTION This chapter discusses fault tolerance in discrete-time linear time-invariant (LTI) dynamic systems [Hadjicostis and Verghese, 1997; Hadjicostis and Verghese, 1999; Hadjicostis, 1999]. It focuses on redundant implementations that reflect the state of the original system into a larger LTI dynamic system in a linearly encoded form. In essence, this chapter restricts attention to discrete-time LTI dynamic systems and linear coding techniques, both of which are rather standard and well-developed topics in system and coding theory respectively. Interestingly enough, the combination of linear dynamics and coding reveals some novel aspects of the problem, as summarized by the characterization of the class of appropriate redundant implementations given in Theorem 5.1. In most of the fault-tolerance schemes discussed, error detection and correction is performed at the end of each time step, although examples of non-concurrent schemes are also presented [Hadjicostis, 2000; Hadjicostis, 2001]. The restriction to LTI dynamic systems allows the development of an explicit mapping to a hardware implementation and an appropriate fault model. More specifically, the hardware implementations of the fault-tolerant systems that are constructed in this chapter are based on a certain class of signal flow graphs (i.e., interconnections of delay, adder and gain elements) which allow each fault in a system component (adder or multiplier) to be modeled as a corruption of a single variable in the state vector. 2 DISCRETE· TIME LTI DYNAMIC SYSTEMS Linear time-invariant (LTI) dynamic systems are used in digital filter design, system simulation, model-based control, and other applications [Luenberger, 1979; Kailath, 1980; Roberts and Mullis, 1987]. The state evolution and output 80 CODING APPROACHES TO FAULT TOLERANCE of an LTI dynamic system S are given by Aqs[t] Cqs ttl + Bx[t] , + Dx[t] , (5.1) (5.2) where t is the discrete-time index, qs[t] is the d-dimensional state vector, x[t] is the u-dimensional input vector, y[t] is the v-dimensional output vector, and A, B, C, D are constant matrices of appropriate dimensions. All vectors and matrices have real numbers as entries. Equivalent state-space models (with d-dimensional state vector q~ ttl and with the same input and output vectors) can be obtained through similarity transformation as described in [Luenberger, 1979; Kailath, 1980]: q~[t + 1] = (T- 1AT) q~[t] + (T-1B) x[t] ------ '--v--' y[t] - A' A'q~[t] = ~ = + B'x[t] , CT q~ ttl C' C'q~[t] B' +~ D x[t] D' + D'x[t] , where T is an invertible d x d matrix such that qs[t] = Tq~[t]. The initial conditions for the transformed system can be obtained as q~ [0] = T-1qs [0]. Systems related in such a way are known as similar systems. 3 CHARACTERIZATION OF REDUNDANT IMPLEMENTATIONS A redundant implementation for the LTI dynamic system S [with state evolution as in Eq. (5.1)] is an LTI dynamic system 1£ with dimension 1] (1] == d+ s, s > 0) and state evolution (5.3) The initial state Clh[O], and matrices A and B of the redundant system 1£ are chosen so that there exists an appropriate one-to-one decoding mapping i, such that duringfaultJree operation for all t ~ 0 [Hadjicostis and Verghese, 1997; Hadjicostis and Verghese, 1999; Hadjicostis, 1999]. Note that according to the setup in Section 3 of Chapter 1, i is required to be one-lo-one and is only defined from the subset of valid slates Redundant Implementations of Discrete-Time LTI Dynamic Systems 81 V (i.e., the set of states in 1£ that are obtainable under fault-free conditions). This means that each valid state Qh[t] E V of the redundant system at any time step t corresponds to a unique state qs [tJ of system S; in other words, qs[t] = l-l(qh[tj). Note that faults in the implementation of the output [see Eq. (5.2)] affect the output at a particular time step but have no propagation effects. For this reason, they can be treated as faults in a combinational circuit and are not discussed here. Instead, the analysis in this chapter focuses on protecting the mechanism which performs the state evolution in Eq. (5.1). To achieve fault tolerance, the state Qh[tJ is encoded using a linear code. In other words, it is assumed that, under proper initialization and fault-free conditions, there exist • adx17decodingmatrixLsuchthatqs[t] = LQh[t]forallt ~ O,qh['] E V, and • an 17 x d encoding matrix G such that Qh[t] = Gqs[t] for all t ~ O. The error detector/corrector does not have access to previous states or inputs and has to make a decision at the end of each time step based solely on the state Qh[t] of the redundant system. Since the construction of 1£ and the choice of initial condition ensures that under fault-free conditions the error detection strategy only needs to verify that the redundant state vector Qh[t] is in the column space of G. Equivalently, one can check that qh[t] is in the null space of an appropriate parity check matrix P (so that PQh[t] = 0 under fault-free conditions). Any fault that forces the state Qh[t] to fall outside the column space of G (producing a nonzero parity check p[t] == PQh[tj) will be detected. For example, a corruption of the ith state variable at time step t will produce an erroneous state vector where Qh[tJ is the state vector that would have been obtained under fault-free conditions and ei is a column vector with a unique nonzero entry at the ith position with value 0:. The parity check at time step t will then be P(Qh[t] + ed PQh[t] + Pei = Pei o:P(:, i) , 82 CODING APPROACHES TO FAULT TOLERANCE where P(: i) denotes the ith column of matrix P. Single-error correction will be possible if the columns of P are not multiples of each other. If this condition is satisfied, one can locate and correct the corrupted state variable by identifying the column of P that is a multiple of p[t]. The underlying assumption in this discussion is that the error-detecting and correcting mechanism is fault-free. This assumption is justified if the evaluation of Pqh[t] and all actions that may be subsequently required for error correction are considerably less complex than the evaluation of AQh[t] + Bx[t]. This would be the case, for example, if the size of P is much smaller than the size of A, or if P requires simpler operations. THEOREM 5.1 In the setting described above, the system 'Ii [ofdimension"., == d + s, s > 0 and state evolution as in Eq. (5.3)] is a redundant implementation of S if and only if it is similar to a standard redundant system 'liu whose state evolution equation is given by qu[t + 1] = [! !~~] qu[t] + [ : -------~ ] x[t] . (5.4) ~ Bq Here, A and B are the matrices in Eq. (5.1), A22 is an s x s matrix that describes the dynamics of the redundant modes that have been added, and A12 is a d x s matrix that describes the coupling from the redundant to the non-redundant modes. Associated with this standard redundant system is the standard decoding matrix Lu = [Id 0], the standard encoding matrix G u standard parity check matrix P u = [ ~ ] and the = [0 Is] . Proof: Let 'Ii be a redundant implementation of S. Under fault-free conditions, LGqs[·] = LQh[.] = qs[·]. Since the initial state qs[O] could be any state, one concludes that LG = I d . This implies implies that L is full-row rank and G is full-column rank and that there exists an invertible"., x "., matrix T such that LT = [Id 0] and T-1G = [~ ] [Hadjicostis and Verghese, 1997; Hadjicostis and Verghese, 1999; Hadjicostis, 1999]. If one applies the transformation Qh[t] = T qh' [t] to system 'Ii, the resulting similar system 'Ii' has decoding mapping L' = L T = [Id 0] and encoding mapping G' = ,-I Redundant Implementations of Discrete-Time LTl Dynamic Systems G = [ ~ ]. The state evolution of the redundant system ?i' is given by qh,[t+l] = (,-lA,)~,[t]+(,-IB)x[t] --...-..- A' = A'~,[t] + B'x[t] '--v--'" B' . For all time steps t and under fault-free conditions, qh' [t] [ q~t] 83 (5.5) = G'qs[t] = ]. Combining the state evolution equations of the original and re- dundant systems (Eqs. (5.1) and (5.5) respectively), one obtains By setting the input x[t] == 0 for all t, one concludes that A'u = A and A~l = O. With the input now allowed to be nonzero, one can deduce that B~ = B and B~ = O. The system ?i' is therefore in the form of the standard system ?iC7 in Eq. (5.4) with appropriate decoding, encoding and parity check l matrices. The converse, namely that if?i is similar to a standard ?iC7 as in Eq. (5.4), then, it is a redundant implementation of the system in Eq. (5.1), is easy to show [Hadjicostis and Verghese, 1997; Hadjicostis, 1999]. 0 Theorem 5.1 establishes a complete characterization of all possible redundant implementations for a given LTI dynamic system subject to the restriction that linear encoding and decoding techniques are used. The additional modes introduced by the redundancy never get excited under fault-free conditions because they are initialized to zero and because they are unreachable from the input. Due to the existence of the coupling matrix A l 2, the additional modes are not necessarily unobservable through the decoding matrix. The continuoustime version of Theorem 5.1 essentially appears in [Ikeda and Siljak, 1984], although the proof and the motivation are very different. 4 HARDWARE IMPLEMENTATION AND FAULT MODEL In order to demonstrate the implications of Theorem 5.1 to fault tolerance, a more detailed discussion of the hardware implementation and the corresponding fault model is needed. The assumption made here is that the LTI dynamic systems of interest [e.g., system S of Eq. (5.1) or system ?i of Eq. (5.3)] are implemented using appropriately interconnected delays (memory elements), adders and gain elements (multipliers). These implementations can be repre- 84 CODING APPROACHES TO FAULT TOLERANCE x[t] or Adder <X1 XlI] .. q [t+ 1] 1 >- C\l Qi - 0 Gain Z·' ~ --~~~.------~~~. ~. Figure 5.1. Delay-adder-gain implementation and the corresponding signal flow graph for an LTI dynamic system. sented by signal flow graphs or, equivalently, by delay-adder-gain diagrams. These are shown in Figure 5.1 for an LTI dynamic system with state evolution 1]] = [01 _ [ q2[t+1] ql[t + q[t+1]= aa21 ] q[t]+ [ 10 ]x[t]. Nodes in a signal flow graph sum up all of their incoming arcs; delays are represented by arcs labeled with z-l. The analysis in this chapter considers both transient and permanent faults in the gains and adders of hardware implementations. A transient fault at time step t causes errors at that particular time step but disappears at the following ones. Therefore, if the errors are corrected before the initiation of time step t + 1, the system will resume its normal mode of operation. A permanent fault, on the other hand, causes errors at all remaining time steps. Notice that a permanent fault can be treated as a transient fault for each of the remaining time steps (assuming successful error correction at the end of every time step), but in certain cases one can deal with it in more efficient ways (e.g., by reconfiguring the system around the faulty component). A given state evolution equation has multiple possible implementations using delay, adder and gain elements [Roberts and Mullis, 1987]. In order to define a unique mapping from a state evolution equation to a hardware implementation, one can focus on implementations whose signal flow graphs have delay-free Redundant Implementations of Discrete-Time LT! Dynamic Systems 85 paths of unit length. In other words, any path that does not include a delay has to have unit length (the signal flow graph in Figure 5.1 is one such example). One can verify that for implementations whose signal flow graphs have delay-free paths of unit length, the entries of the matrices in the state evolution equation are directly reflected as gain constants in the signal flow graph [Roberts and Mullis, 1987]. In addition to the above property, each of the variables in the next-state vector q[t + 1] is calculated using separate gain and adder elements (sharing only the input x[t] and the variables in the previous state vector q[t]). This means that a fault in a single gain element or in a single adder during time step t will result in the corruption of a single state variable in the state vector q[t + 1] (if the error is not accounted for, many more variables may be corrupted at later time steps). In fact, any combination of faults in the gains or adders that are used for the calculation of the next value of the ith state variable will only result in the corruption of the ith state variable. More general descriptions can be studied viafactored state variable techniques [Roberts and Mullis, 1987], or by employing the computation trees in [Chatterjee and d' Abreu, 1993], or by using the techniques that will be discussed in Example 5.5; in these implementations, however, a single fault may corrupt multiple state variables, so one has to be careful when developing the fault model. Note that according to the assumptions in this section, the standard redundant system ?to' of Theorem 5.1 cannot be used to provide fault tolerance to system S. Since hardware implementations employ delay-adder-gain circuits that have delay-free paths of unit length, the implementation of ?to' will result in a system that only identifies faults in the redundant part of the system. The reason is that state variables in the lower part of qO'[.] are not influenced by variables in the upper part and the parity check matrix, given by PO' = [0 Is), only identifies faults in the added subsystem. The situation is similar to the one in Example 4.4 in Chapter 4, where under a particular group machine decomposition, redundancy was useless because it was essentially protecting itself. The objective should be to use the redundancy to protect the original system, not to protect the redundancy itself. Theorem 5.1 is important, however, because it provides a systematic way for searching among possible redundant implementations for system S. Specifically, Theorem 5.1 characterizes all possible redundant implementations that have the given (fixed) encoding, decoding and parity check matrices (L, G and P respectively). Since the choice of matrices A12 and A22 is completely free, there is an infinite number of redundant implementations for system S. All of them have the same encoding, decoding and parity check matrices, and offer the same concurrent error detection and correction capabilities: depending on the redundancy in the parity check matrix P, all of these implementations can detect and/or correct the same number of errors in the state vector qh[t]. 86 CODING APPROACHES TO FAULT TOLERANCE 5 EXAMPLES OF FAULT-TOLERANT SYSTEMS This section discusses the implications of Theorem 5.1 through several examples. EXAMPLE 5.1 Consider the following original system S: .2o .50 0 o 0 .6 0 qs[t+1]= [ 0 0 . 1 0 0 1qs[t]+ [3 -1 ] 7 x[t]. 0 One possibility for protecting this system against a single transient fault in a gain element or in an adder, is to use three additional state variables. More specifically, the standard redundant system can be .2 qO'[t 0 0 0 0 .5 0 0 0 0 .1 0 0 0 0 .6 + 1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 qO'[t] + 3 -1 7 0 0 .2 0 0 0 0 .5 0 0 0 0 .3 x[t] , 0 0 0 J '-.;-" Ba i.e., A12 =0 , A22 = [.~o .~0 .3~ 1 The parity check matrix of the standard implementation is given by PO' = [0 I 13 ] = [ 0 0 0 0 1 001 0 0 0 0 0 1 0 . o 000 0 0 1 For error detection, one needs to check whether P O'qO' [t] is O. However, as argued earlier, redundant systems in standardform cannot be usedfor detecting faults that cause errors in the original state variables: given an erroneous state vector q, [t], a nonzero parity check (P O'q, [t] =I- 0) would simply mean that a fault has resulted in an error in the calculation of the redundant variables. The goal is to protect against errors that affect the original system (i.e., errors that appear in the original variables). One way to achieve this is to employ a system Redundant Implementations of Discrete-Time LTI Dynamic Systems I Parity Check pT[tJ = [ PI [tJ P2[tJ P3 [tJ ] Erroneous State Variable [ c c c] ql [c C o] q2 [ C 0 C ] q3 [ 0 c c] q4 o] q5 cO] q6 [0 0 c] q7 [ c 0 [0 Table 5.1. " 87 I Syndrome-based error detection and identification in Example 5.1. similar to the standard redundant system, but with parity check matrix 1 110 1 0 0 P= [ 1 1 0 1 0 1 0 1. (5.6) 101 100 1 (This choice ofP is motivated by the structure of Hamming codes in communications [Peterson and Weldon Jr., 1972; Blahut, 1983; Wicker, 1995].) With a suitable similarity transformation 7 chosen so that P7 = P (]", the corresponding redundant system is qh[t + 1J = (7- 1Aa 7)~[tJ + (7- 1 B)x[tJ .2 0 0 0 0 .3 .1 0 .5 0 0 0 .1 0 0 -.3 .1 0 0 0 .2 -.1 0 0 0 0 0 0 .5 -.3 0 0 0 0 0 0 0 .6 0 0 .2 (5.7) 3 0 0 0 0 0 0 0 .3 qh[tJ + -1 7 0 -9 -2 -10 x[tJ. 88 CODING APPROACHES TO FAULT TOLERANCE The above system can be used to detect and locate transient faults that cause the value of a single state variable to be incorrect at a particular time step. Under fault-free conditions, the parity vector p[t] = P<lh[t] is 0; furthermore, any fault that results in the corruption ofa single state variable can be identified as shown in Table 5.1. For example, if pdt] "# 0, P2 [t] "# and P3 [t] "# 0, then, one can conclude that a fault has corrupted ql [t], the value of the first state variable in <lh[t]; ifpt[t] "# 0, P2[t] "# and P3[t] = 0, then, a fault has corrupted q2 [t]; and so forth. Once the erroneous variable is located, correction can be based on any of the parity equations than involve the erroneous state variable. For example, if q2[t] is corrupted, one can calculate its correct value by setting q2[t] = -qdt]-q3[t]-q5[t] (i.e., using the parity equation defined by the first row of matrix P). If faults are transient, the operation of the system will resume normally in the following time steps. ° ° Hamming codes like the ones used in the above example allow for the correction of faults that cause an error in a single variable of the state vector. Instead of replicating the whole system (as would be required by a modular redundancy approach), one only needs s additional state variables: as long as 28 - 1 > 7J (where 7J = d + s is the dimension of the redundant system), one can guarantee the existence of a Hamming code and a redundant implementation that achieves single-error correction. An alternative approach is developed in [Chatterjee and d' Abreu, 1993] where the authors used real-number coding schemes and achieved single-error correction with only two additional state variables. The methods used in [Chatterjee and d' Abreu, 1993] (as well as in [Chatterjee, 1991], where one of the authors of [Chatterjee and d' Abreu, 1993] analyzes the continuous-time case) do not consider different similarity transformations and do not permit additional modes to be nonzero. The following example illustrates some of the advantages obtained by using nonzero redundant modes. EXAMPLE 5.2 Consider the LTl system with state evolution equation and hardware implementation as shown in Figure 5.2. Since the corresponding signalflow graph has delay-free paths of unit length, the entries of A andb are reflected directly as gains in the diagram (entries that are either "0" or "1" do not appear explicitly as gain elements). Furthermore, the only variables that are shared when calculating the next-state vector are the input and the previous state vector; no hardware is shared during the update of different state variables. In order to detect a single fault in a gain element or in an adder, one can use an extra "checksum" state variable [Huang and Abraham, 1984; Chatterjee and Redundant Implementations of Discrete-Time LTI Dynamic Systems Figure 5.2. ample 5.2. 89 State evolution equation and hardware implementation of the digital filter in Ex- <lh[t + 1J 0 1 0 0 = , 0 0 1 0 0 -1/4 0 0 1/2 0 0 -1/4 0 1 1/2 0 1 1 1 1/21 0 'V A Figure 5.3. <lh[t] + , 1 0 0 0 x[tJ , 1 "'-v-'" B Redundant implementation based on a checksum condition. 90 CODING APPROACHES TO FAULT TOLERANCE d'Abreu, 1993J. The resulting redundant implementation 1£ has state evolution '!hIt +1] = [C:A::] __________ '!hIt] + [ C:b ] '---v---' A xlt] B with c T = [1 1 1 1]. The corresponding delay-adder-gain implementation is shown in Figure 5.3. Note that there are a number of different delayadder-gain diagrams that are consistent with the above state evolution equation; the one shown in Figure 5.3 is the only one consistent with the requirement that signal flow graphs have delay-free paths of unit length. Under fault-free conditions, the first four state variables are the same as the original state variables in system S; the additional state variable is always equal to the sum of these four state variables. Error detection is based on verifying the validity of this checksum condition at the end of each time step; no complicated multiplications are involved, which may make it reasonable to assume that error detection is fault-free. The above approach is seen to be consistent with the setup described in this chapter. Specifically, the encoding, decoding and parity check matrices are given by G= [~] = 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 1 1 1 1 L=[I41°]= [1o 0 0 0 1 0 0 00 1 0 o 0 0 1 ~] , P = [ _c T 11 ] = [ -1 -1 -1 -1 11 ] . Furthermore, if one uses the transformation matrix [!f ~], one can show that system 1£ is similar to a standard system 1£0' with state evolution Redundant Implementations of Discrete-Time LTI Dynamic Systems Qhl[t + 1J = 0 0 1 0 0 1 0 0 0 0 -1/4 1/2 -1/4 1 1/2 0 -1/2 0 0 0 v A' Figure 5.4. 0 0 0 0 1 91 1 QhI[tJ + 0 0 0 x[tJ 1 8' ------ Second redundant implementation based on a checksum condition. where A, b are the matrices in Figure 5.2. Notice that A12 and A22 in Eq. (5.4) of Theorem 5.1 have been set to zero. As stated earlier, with each choice of A12 and A 22 , there is a different redundant implementation with the same encoding, decoding and parity check matrices. If, for example, one sets A12 = 0, A22 = [1 J and then transforms back [Hadjicostis, 1999], the resulting redundant implementation 1l' has state evolution equation = The corresponding hardware implementation is shown in Figure 5.4. Both redundant implementations 1l and 1l' have the same encoding, decoding and parity check matrices. Both are able to detect faults that corrupt a single state variable, such as a single fault in an adder or in a gain element. The (added) complexity in system 1l', however, is lower than that in system 1l (because the computation of the redundant state variable is less involved). More generally, as illustrated in this example for the case ofa nonzero A 22 , one can explore different versions of redundant implementations by exploiting the 92 CODING APPROACHES TO FAULT TOLERANCE dynamics ofthe redundant modes (A22) andlor their coupling with the original system (A I2 ). For certain choices of A12 and A 22 , one may get designs that utilize less hardware than others or designs that can perform non-concurrent error detection and correction, as shown in the next example. EXAMPLE 5.3 This example shows how nonzero redundant modes can be used to construct parity checks with "memory" (Hadjicostis, 2000J. The resulting parity checks "remember" an error and allow one to perform checking periodically (non-concurrently). Instead of checking at the end of each time step, one only checks once every N time steps and is still able to detect and identify transient faults that took place at earlier time steps. Suppose that S is an LTI dynamic system as in Eq. (5.1). Starting from the standard redundant system 1I.u in Eq. (5.4) with A12 = 0 and using the similarity transformation Qu[tJ = 7CJh[tJ (where 7 = [_~ I~]) and Cis a d x s matrix, one obtains the following redundant implementation 11.: <Ih[t + 1] ~ [CA _AA22 C : :" , '" 1<Ih[t] + [ 1x[t] :B " A '-vo--" B with encoding, decoding and parity check matrices given by G = 7- 1 G u = [ ~ ], L = Lu 7 = [Id P = P u 7 = [-C 0] , 18] . Suppose that a transientfault (e.g., noise) at time step t corrupts the state of system 11. so that qf[tJ = CJh[tJ +e , where CJh[tJ is the state that would have been obtained under faultlree conditions and e is an additive error vector that models the effect of the transient fault. If the parity check is performed at the end of time step t, the following syndrome is obtained: PCJh[tJ + Pe = O+Pe = [-C Is] e. p[tJ = Pqf[tJ = Redundant Implementations of Discrete-Time LTI Dynamic Systems 93 For a transient fault that affects a single variable in the state vector, e will be a vector with a single nonzero entry. Therefore, one will be able to detect, identify and correct errors as long as the columns ofP = [ - C Is ] are not multiples of each other. For example, ifP is the parity check matrix of a Hamming code (as in Eq. (5.6) of Example 5.1), then, one can easily perform error correction by first identifying the column ofP that is a multiple of the obtained syndrome p[t], then determining e, and finally making the appropriate adjustment to the corrupted state variable. When the parity check is performed only periodically (e.g., once every N time steps), the syndrome at time step t, given a fault at time step t - m (0 :S m:S N -1), will be (assuming no other transient faults occurred between time step t - m and t). If A22 = 0, then, the parity check will be 0 (i.e., e will go undetected). More generally, however, one can choose A22 so that the parity check will be nonzero. For example, if A22 = Is, then, the syndrome is the same as the one that would have been obtained at time step t - m. The problem is, of course, that m is unknown and even though the initial error has been identified it cannot be corrected because one does not know when it took place. This situation can be remedied if a different matrix A22 is chosen. For example, if P is the parity check matrix of a Hamming code and A22 is the diagonal matrix t . .~1/2)' 1' 1 0 [ ~ .. . then, one can identify the corrupted state variable and find out when the corruption took place (i.e., what m is) [Hadjicostis, 2000}. The approach has been extended to handle multiple faults between periodic checks [Hadjicostis, 2001}. 94 CODING APPROACHES TO FAULT TOLERANCE 5.4 The TMR scheme in Figure 1.3 of Chapter 1 corresponds to a redundant implementation of the form EXAMPLE <Jh[t + 1] == q![t + 1] ] [ q~[t + 1] q~[t + IJ = where q![tJ, ~[tJ and q~[tJ evolve in the same way (because q![Oj = q~[Oj = ~[Oj = qs[Oj) and each is calculated using a separate set of delays, adders and gains. The encoding matrix G is given by [ ~ ]. the decoding mapping L can be [Id 0 0] and the parity check matrix P can be [=~~ ~ ~] (other decoding and parity check matrices are also possible). A nonzero entry in the upper (respectively lower) half ofP<Jh[tJ indicates afault in the second system replica (respectively the third). Nonzero entries in both the top and bottom half-vectors, indicate a fault in the first system replica. ~e ~MR ~ys]tem is shown (for example, with transformation matrix [ Id Id 0 Id 0 Id ) to be similar to which is of the form depicted in Theorem 5.1. All variables of the original system are replicated twice and no coupling is involved from the redundant to the original modes, i.e., A12 = 0, A22 = [! !]. Once the encoding matrix G is fixed, the additional freedom in choosing the decoding matrix L can be exploited. For example, if there is a permanent fault in the first system replica, one can change the decoding matrix to L = [Old 0] to ensure that the final output is correct. This idea is discussed further in the next example. EXAMPLE 5.5 In the TMR case, a permanent fault that corrupts the first sys- tem replica (by corrupting gains or adders in its hardware implementation) can be handled by switching the decoding matrix from L = [Id 0 0] to Redundallt Implementations of Discrete-Time LTI Dynamic Systems 95 L = [Old 0] (or L = [0 0 Id] or others) and by ignoring the state of the first system replica in any subsequent error detection and correction procedures. For instance, once a permanent fault corrupts the first subsystem, error correction becomes impossible, but error detection can be achieved by comparing the state of the two remaining system replicas. This idea was formalized and generalized in [Hadjicostis and Verghese, 1997]. Consider the redundant system 1£ whose state evolution equation is given by Eq. (5.3) and whose hardware implementation uses delay-adder-gain implementations with delay-free paths of unit length. Under fault-free conditions, for all t. A permanent fault in a gain element manifests itself as a corrupted entry in matrices A or B. The ith state variable in <lh [t] (and other variables at later time steps) will be corrupted if some of the entries in A( i, :) or in B( i, :) are affected right after time step t - I (A( i, :) denotes the ith row of matrix A and B( i, :) denotes the ith row of matrix B). If the erroneous state variable at time t is detected and located (e.g., using the techniques in Example 5.1), one can attempt to adjust the decoding matrix L to a new matrix La so that the decoded state is error-free. The question addressed in [Hadjicostis and Verghese, 1997J concentrated on characterizing the possible choices for La given the corruption of certain gain or adder elements. If at time step t, the ith state variable is corrupted, then, all state variables whose updates directly depend on the ith state variable will be corrupted at time step t + I (let Mil be the set of indices of these state variables, including i); at time step t + 2, the state variables with indices in set Mit will corrupt the state variables that depend on them; let their indices be in set Mi2 (which includes Mil); and so on. Eventually, the final set of indices for all corrupted state variables is given by the set Mil (note that Mil = Mi'1 = Mil U Mi2 U M i3 ... U M i '1)' The sets of indices Mil for all i in {I, 2, ... , 1J} can be precalculated in an efficient manner by computing R( A), the reachability matrix of A [Norton, I 980J. Once an error is detected at the ith state variable, the new decoding matrix La (if it exists) should not make use of state variables with indices in Mil' Equivalently, one can ask the question: does there exist a decoding matrix La such that LaGa = Id? Here, G a is the same as the original encoding matrix G except that Ga(i,:) is set to zeroforall i in Mil' IfG a isfull-column rank, such an La exists (infact, any La that satisfies LaGa = Id is suitable) and the redundant system can withstand permanent corruptions in any of the entries in the ith row of A and/or B. TMR is clearly a special case of the above formulation: corruption of a variable in the state of the first system replica is guaranteed to only affect 96 CODING APPROACHES TO FAULT TOLERANCE the first system replica. Therefore, M f ~ {I, 2, ... , d} and (conservatively) G a = [Old Id] T. One possibility for La is [Old 0]. Less obvious is the following case: consider the system in Example 5.1 with state evolution as in Eq. (5.7). Its decoding matrix is given by L = [14 0]. If A(2, 2) (whose value is .5) becomes corrupted, then, the set of indices of corrupted state variables is M 2f = {2,5}. The original encoding matrix G, the new encoding matrix G a [resulting after the corruption of entry A(2, 2)J and a suitable La are shown below: G= 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 -1 -1 -1 0 -1 -1 0 -1 -1 0 -1 -1 0 [ -11 0 La = ~ 0 0 1 , Ga = 0 0 0 0 -1 -1 0 0 0 0 -1 0 1 0 0 1 0 0 0 -1 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 -1 0 0 0 1 0 -1 -1 ~1 Using the above La' the redundant system can continue to function properly and provide the correct state vector qs [t] despite the corrupted entry A(2, 2). The parity check matrix ofEq. (5.6) can still be usedfor error detection, except that the checks involving the second and/or fifth state variables (i.e., checks corresponding to the first and second rows ofP will be invalid. Error detection is still an option, but one has to rely solely on the parity check given by the third rowofP. 6 SUMMARY This chapter studied redundant implementations of LTI dynamic systems. It showed that the set of available redundant implementations for concurrent error detection and correction is enriched by the dynamics and coupling that can be introduced by redundancy. The redundant implementation essentially augments the original system with redundant modes that are unreachable but observable under fault-free conditions. Because these additional modes are not excited initially, they manifest themselves only when a fault takes place. The resulting characterization resembles the treatment of continuous-time LTI system "inclusion" in [Ikeda and Siljak, 1984]. An explicit mapping to hardware (using delay, adder and gain elements) allowed the development of a fault model that maps a single fault in an adder References 97 or in a multiplier to an error in a single state variable. By employing linear coding techniques, this chapter developed a wide variety of schemes that can detect/correct a fixed number of faults (or, equivalently, errors in a fixed number of state variables). This chapter established that for a particular error detection/correction scheme there exists a class of possible implementations, some of which make better use of additional hardware or have other desirable properties, such as reconfigurability or memory. Criteria to "optimally" select the "best" possible redundant implementation were not directly addressed; the examples, however, presented a variety of open questions for future research. Notes 1 The check matrix can be pI = [0 8], where 8 is any invertible s x s matrix; a trivial similarity transformation will ensure that the parity check matrix takes the form [0 Is], while keeping the system in the standard form 1ier in Eq. (5.4) - except with A12 = Ab8 and A22 = 8- 1 A~28. References Blahut, R. E. (1983). Theory and Practice ofData Transmission Codes. AddisonWesley, Reading, Massachusetts. Chatterjee, A. (1991). Concurrent error detection in linear analog and switchedcapacitor state variable systems using continuous checksums. In Proceedings of the Int. Test Conference, pages 582-591. Chatterjee, A. and d' Abreu, M. (1993). The design of fault-tolerant linear digital state variable systems: Theory and techniques. IEEE Transactions on Computers, 42(7):794-808. Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts. Hadjicostis, C. N. (2000). Fault-tolerant discrete-time linear time-invariant filters. In Proceedings of ICASSP 2000, the IEEE Int. Con! on Acoustics, Speech and Signal Processing, pages 3311-3314. Hadjicostis, C. N. (2001). Non-concurrent error detection and correction in discrete-time LTI dynamic systems. In Proceedings of the 40th IEEE Con! on Decision and Control. Hadjicostis, C. N. and Verghese, G. C. (1997). Fault-tolerant design of linear time-invariant systems in state form. In Proceedings of the 5th IEEE Mediterranean Con! on Control and Systems. Hadjicostis, C. N. and Verghese, G. C. (1999). Structured redundancy for fault tolerance in LTI state-space models and Petri nets. Kybernetika, 35( 1):39-55. Huang, K-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 33(6):518-528. 98 CODING APPROACHES TO FAULT TOLERANCE Ikeda, M. and Siljak, D. D. (1984). An inclusion principle for dynamic systems. IEEE Transactions on Automatic Control, 29(3):244-249. Kailath, T. (1980). Linear Systems. Prentice-Hall, Englewood Cliffs, New Jersey. Luenberger, D. G. (1979). Introduction to Dynamic Systems: Theory, Models, & Applications. John Wiley & Sons, New York. Norton, J. P. (1980). Structural zeros in the modal matrix and its inverse. IEEE Transactions on Automatic Control, 25( 10):980-981. Peterson, W. W. and Weldon Jr., E. J. (1972). Error-Correcting Codes. MIT Press, Cambridge, Massachusetts. Roberts, R. A. and Mullis, C. T. (1987). Digital Signal Processing. AddisonWesley, Reading, Massachusetts. Wicker, S. B. (1995). Error Control Systems. Prentice Hall, Englewood Cliffs, New Jersey. Chapter 6 REDUNDANT IMPLEMENTATIONS OF LINEAR FINITE-STATE MACHINES 1 INTRODUCTION This chapter applies techniques similar to those of Chapter 5 to provide fault tolerance to linear tinite-state machines (LFSM's) [Hadjicostis, 1999]. The discussion focuses on linear encoding techniques and, as in Chapter 5, results in a complete characterization of the class of appropriate redundant implementations. It is shown that, for a given LFSM and a given linear encoding, there exists a variety of possible implementations and that different criteria can be used to choose the most desirable one [Hadjicostis, 2000; Hadjicostis and Verghese, 2002]. The implications of this approach are demonstrated by studying hardware implementations that use interconnections of 2-input XOR gates and single-bit memory elements (flip-flops). The redundancy in the state representation (which essentially appears as a linearly encoded binary vector) is used by an extemal,fault-free mechanism to perform concurrent error detection and correction at the end of each time step. The assumption of a fault-free error corrector is relaxed in Chapter 7. 2 LINEAR FINITE-STATE MACHINES Linear tinite-state machines (LFSM's) form a general class of tinite-state machines with a variety of applications [Booth, 1968; Harrison, 1969]. They include linear feedback shift registers [Golomb, 1967; Martin, 1969; Daehn et aI., 1990; Damiani et aI., 1991], sequence enumerators and random number generators [Golomb, 1967], encoders and decoders for linear error-correcting codes [Peterson and Weldon Jr., 1972; Blahut, 1983; Wicker, 1995], and cellular automata [Cattell and Muzio, 1996; Chakraborty et aI., 1996]. A discussion of the power ofLFSM's and related references can be found in [Zeigler, 1973]. 100 CODING APPROACHES TO FAULT TOLERANCE x[t] ~ Figure 6.1. + Hardware implementation of the linear feedback shift register in Example 6.1. The state evolution of an LFSM is given by qs[t + IJ = Aqs[tJ EB Bx[tJ , y[t + IJ = Cqs[t] EB Dx[t] , (6.1) where t is the discrete-time index, qs[tJ is the d-dimensional state vector, x[t] is the u-dimensional input vector and y[t] is the v-dimensional output vector. Vectors and matrices have entries from GF(2), the Galois field l of order 2, i.e., they are either "0" or "I" (more generally they can be drawn from any finite field). Matrix-vector multiplication and vector-vector addition are performed as usual except that element-wise addition and multiplication are taken modulo2. Operation EB in the above equations denotes vector addition modulo-2. As in Chapter 5, faults in the mechanism that calculates the output (based on the current state and input) can be treated as faults in a combinational system; thus, this chapter focuses on protecting against faults in the state evolution mechanism. EXAMPLE 6.1 The linear feedback shift register (LFSR) in Figure 6.1 is implemented using single-bit memory elements (flip-flops) and 2-input XOR gates. Flipjiops are capable of storing a single bit ("0" or "1") and 2-input XOR gates, denoted by EB in the figure, perform modulo-2 addition on their binary inputs. The LFSR in Figure 6.1 is an LFSM with state evolution qs[t + 1] = Aqs[tJ EB bx[tJ 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 0 qs[tJ EB 1 0 0 0 0 x[tJ . Note that when x[.] = 0 and qs[OJ i= 0, the LFSR acts as an autonomous sequence enumerator. It goes through all nonzero states (essentially counting from 1 to 31): if initialized at qs[O] = [1 0 0 0 0] T, the LFSR goes Redundant Implementations of Linear Finite-State Machines [0 1 0 0 0 (, qs[2] = [0 0 1 0 0 (, [0 1 0 0 1 )T, qs[31] = [1 0 0 0 0 )T, and so through states qs[l] ... , qs[30] = 101 = forth. In essence, the LFSR acts as an autonomous sequence generator (counter). For a state evolution ofthe form ofEq. (6.1), there are a number of implementations with 2-input XOR gates and flip-flops. If the hardware implementations correspond to signal flow graphs whose delay-free paths are of unit length, then, the following are true: (i) Each bit in the next-state vector qs [t + 1] is calculated using a separate set of 2-input XOR gates; thus, a fault in a single XOR gate can corrupt at most one bit in the next-state vector qs[t + 1]. (ii) The calculation of each bit in qs[t + 1] is based on the bits of qs[t] that are explicitly specified by the "Is" in matrix A of the state evolution equation (e.g., the third bit of qs [t + 1] in Example 6.1 is calculated based on the second and fifth bits of qs[t]). An LFSM S' (with d-dimensional state vector q~[t]) is similar to LFSM S [in Eq. (6.1)] if q~[t + 1] = -------- (T- 1AT) q~[t] E9 (T-1B) x[t] A' - A/q~[t] ~ B' E9 B/X[t] , where T is an invertible d x d binary matrix such that qs ttl = Tq~ [tJ [Booth, 1968; Harrison, 1969]. The initial conditions for the transformed LFSM can be obtained as <L[O] = T-1qs[O]. It can be shown that any LFSM with state evolution as in Eq. (6.1) can be put via a similarity transformation in a form where the matrix A' is in classical canonical form [Booth, 1968]. More specifically, A' has the following blockdiagonal structure: where each Ai (1 :::; i :::; p) also has a block-diagonal structure 102 CODING APPROACHES TO FAULT TOLERANCE and where each C ij (1 :::; j :::; q) looks like Dij Eij Dij Eij Dij with Dij and Dij = Eij as follows: 0 0 0 1 0 0 1 0 0 0 0 1 * * * * * , Eij = [r 0 0 0 !I Each "*" could be a "0" or a "1." What is important about this form is that there are at most two "Is" in each row of A', which implies that each bit in the next-state vector <t[t + 1] can be generated based on at most two bits of the current state vector q~[t]. 3 CHARACTERIZATION OF REDUNDANT IMPLEMENTATIONS An LFSM S [with d state variables and state evolution as in Eq. (6.1)] can be embedded into a redundant LFSM 11, with 11 state variables (11 == d + s, s > 0) and state evolution Clh[t + 1] = AClh[t] ED Bx[t] , (6.2) where the initial state Clh[O] and matrices A, B are chosen so that the error-free state qh[t] of 11, at time step t provides complete information about qs[t], the state of the original LFSM [Hadjicostis, 1999; Hadjicostis, 2000; Hadjicostis and Verghese, 2002]. More specifically, the redundant machine 11, concurrently simulates the original machine S so that, for an appropriate decoding mapping l, q,[tJ = l{Clh[tJ) for all time steps t. Furthermore, mapping l is required to be one-to-one so that there is a unique correspondence between the states in S and the states in 11" i.e., Redundant Implementatiolls of Lillear Finite-State Machines 103 for all time steps t (as long as no faults take place). As in Chapter 5, the analysis is easier if one restricts decoding and encoding to be linear in GF(2). In other words, there exist • a d x 'f/ binary decoding matrix L such that, under proper initialization and fault-free conditions, qs[t] = Lqh[t] for all t, and • an 'f/ x d binary encoding matrix G such that, under proper initialization and fault-free conditions, qh[t] = Gqs[t] for all t. Note that Land G need to satisfy LG = Id, where Id is the d x d identity matrix in GF(2). Under the above assumptions, the redundant machine 1£ enforces an ('f/, d) linear code On the state of the original machine [Peterson and Weldon Jr., 1972; Blahut, 1983; Wicker, 1995]. An ('f/, d) linear code uses 'f/ bits to represent d bits of information and is defined in G F (2) by an 'f/ x d generator matrix G with full-column rank. When no faults have taken place, the d-dimensional state vector at time t is uniquely represented by the 'f/-dimensional vector (codeword) ~[t] = Gqs[t]. Error detection is straightforward: under fault-free conditions, the redundant state vector must be in the column space of G; therefore, all that needs to be checked is that the redundant state qh[t] lies in the column space of G (in coding theory terminology, one needs to check that ~[t] is a codeword of the linear code that is generated by G [Peterson and Weldon Jr., 1972; Blahut, 1983; Wicker, 1995]). Equivalently, one can check that ~[t] is in the null space of an appropriate parity check matrix p, so that P~[t] = o. The parity check matrix has row rank 'f/ - d == s and satisfies PG = o. Error correction associates with each valid state in 1£ (of the form Gqs [tD, a unique subset of invalid states that get corrected to that particular valid state. This subset usually contains 'f/-dimensional vectors with small Hamming distance from the associated valid codeword. Error correction can be performed using any of the methods employed in the communications setting (e.g., syndrome table decoding or iterative decoding [Gallager, 1963; Peterson and Weldon Jr., 1972; Blahut, 1983; Wicker, 1995]). The following theorem provides a parameterization of all redundant implementations for a given LFSM under a linear encoding scheme. 6.1 In the setting described above, the LFSM 1£ [ofdimension'f/ == d + s, s > 0 and state evolution as in Eq. (6.2)J is a redundant implementation of S if and only if it is similar to a standard redundant LFSM 1£(1' whose state evolution equation is given by THEOREM (6.3) 104 CODING APPROACHES TO FAULT TOLERANCE Here, A and B are the matrices in Eq. (6.1), A22 is an s x s binary matrix that describes the dynamics of the redundant modes that have been added, and A12 is a d x s binary matrix that describes the coupling from the redundant to the non-redundant modes. Associated with this standard redundant LFSM is the standard decoding matrix LO' = [Id 0], the standard encoding matrix GO' = [ ~ ] and the standard parity check matrix PO' = [0 Is]. Proof: The proof is similar to the proof of Theorem 5.1 in Chapter 5 and is omitted. 0 4 EXAMPLES OF FAULT· TOLERANT SYSTEMS Given an LFSM S, and appropriate Land G (so that LG = Id), Theorem 6.1 characterizes all possible redundant LFSM's 1£. Since the choice of the binary matrices A12 and A22 is completely free, there are multiple redundant implementations of LFSM S for the given Land G. This section demonstrates how different implementations for LFSM's can be exploited to minimize redundant hardware overhead. EXAMPLE 6.2 In order to detect a single fault in an XOR gate of the LFSR implementation in Figure 6.1, an extra "checksum" state variable can be used. Following what was suggested for linear time-invariant dynamic systems in [Huang and Abraham, 1984] and for LFSM's in [Larsen and Reed, 1972; Sengupta et al., 1981], one obtains the following redundant LFSM 1£: CJh[t + 1J where c T = [1 CJh[t + 1J = [ C~A : : ] ~[tl $ [ C~b ] x[tl , 1 1 1 1 ], i.e., 0 0 0 0 a 0 1 0 a a 1 0 0 a 1 1 0 0 0 1 0 a 0 0 a 1 1 1 1 01 0 0 1 0 0 0 1 a qh[t] EEl 0 0 0 x[t] . 1 Under fault-free conditions, the added state variable is always the sum modulo-2 of all other state variables (which are the same as the original state variables in LFSM S). The encoding, decoding and parity check matrices are Redundant Implementations of Linear Finite-State Machines 105 given by G = [ :~ ] = 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 1 1 L P = = [ 15 I 0 ] = [ -c T 11 ] [ cT 11 ] 1 0 0 0 0 = [ -1 =[1 0 1 0 0 0 0 0 1 0 0 -1 0 0 0 1 0 -1 0 0 0 0 0 0 0 0 0 1 -1 -1 11 ] 1 1 1 1 11 ] . (Note that" -1" is the same as "+1" when performing addition and multiplication modulo-2.) Using the similarity transformation qq[tJ = Tqh[tJ where T = [!~ ~], one sees that, just as predicted by Theorem 6.3, 1l is similar to a standard redundant LFSM 1lq with state evolution given by Note that both A12 and A22 have been set to zero. As stated earlier, there are multiple redundant implementations with the same encoding, decoding and parity check matrices. For the scenario described here, there are exactly 25 different LFSM's (each combination of choices for entries in matrices A12 and A22 results in a different redundant implementation). One such choice is to let A12 = 0 0 0 0 0 A22 = [IJ 106 CODING APPROACHES TO FAULT TOLERANCE and use a transformation with the same similarity matrix (q(7[tJ T = [!~ ~] ) = Tqhl[tJ, to get a redundant LFSM 1{' with state evolution equation qhl [t+1J = [ cT A _AA22CT : : " ] 'Ih,[t] (& [ C~b ] x[tl , or ~I[t + 1J = 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 ~1[tJ $ I1 1 0 0 0 0 x[t] . 1 Both redundant LFSM's 1{ and 1{' have the same encoding, decoding and parity check matrices, and both are able to concurrently detect single-bit errors in the redundant state vector. Furthermore, according to the assumptions about hardware implementation in Section 2, they are both able to detect a fault in a single XOR gate. Evidently, the complexity of1{' is lower than the complexity of1{. More generally, as illustrated in this example for the case of a nonzero A 22 , one can obtain more efficient redundant implementations by exploiting the dynamics of the redundant modes (given by A 22 ) and/or their coupling with the original system (given by A12). EXAMPLE 6.3 A rate 1/3 convolutional encoder takes a binary sequence x[tJ and encodes it into three output sequences (yIltJ, Y2[tJ and Y3[tJ) as shown at the top of Figure 6.2. The encoding mechanism is essentially an LFSM and, for the particular example shown in Figure 6.2, it has a state evolution that is given by qs[t + 1] = qdt + 1J q2[t + 1J q3[t + 1] q4[t + 1J q5[t + 1J q6[t + 1] = 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 v A = Aqs[t] $ bx[t] 0 0 0 0 0 1 0 0 0 0 0 0 qs[t] $ 1 0 0 0 0 0 '--v--" b x[t] Redundant Implementations of Linear Finite-State Machines x[t) 107 q [1+1) 1 ~---+{-+ Figure 6.2. yIlt + Y3[t +}-------+{- Different implementations of a convolutional encoder. and output2 y[t + 1] == [ Y2[t - 1]] + 1] + 1] [1]~ x[t] [011101] ~ ~ ~ ~ ~ ~ qs[t] EB _______ _______J F Fqs ttl EB dx[t] . d 108 CODING APPROACHES TO FAULT TOLERANCE If the output values Yl [t], Y2 [t] and Y3 [t] are saved in designatedjlip-jiops, one obtains a redundant implementation of an LFSM with state evolution equation Clh[t + 1] == [ q8[t + 1] y[t + 1] 0 1 0 0 0 0 0 = ~ 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1[ = 0 0 0 0 1 0 0 0 0 0 0 0 1 0 A I0 FlO 0 0 0 0 0 0 1 1qh[t] [ 1x[t] Ef) 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 - b d Clh[t] Ef) 0 1 1 1 0 1 0 0 0 1 0 0 1 1 1 0 0 0 1 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 x[t] , 1 1 1 where the encoding and decoding matrices are given by: G= [ : ~ t~ : ~ ]. 1 1 100 1 By using nonzero redundant dynamics (A 22 =I- 0) and/or coupling (A 12 =I0), one can obtain a number of redundant implementations (for the same L and G), some of which require a reduced number of2-input XOR gates. The encoder at the bottom of Figure 6.2 is the result of such an approach: it uses a nonzero A22 to minimize the use of XOR operations. The next section elaborates on this example by describing how to systematically minimize the number of 2-input XOR gates in a redundant implementation ofanLFSM. 5 HARDWARE MINIMIZATION IN REDUNDANT LFSM IMPLEMENTATIONS Given a linear code in systematic form (i.e., a code whose generator matrix is of the form G = [ ~ ]), Theorem 6.1 can be used to construct all linearly en- coded redundant implementations for a given LFSM S. This section describes how to algorithmically find the redundant LFSM that uses the minimal number of 2-input XOR gates [Hadjicostis and Verghese, 2002]. Redundant Implementations of Linear Finite-State Machines 109 Problem Formulation: Let S be the LFSM in Eq. (6.1) with d state variables. Construct the redundant LFSM 1l [of dimension"., == d + 5,5 > 0 and state evolution as in Eq. (6.2)] that uses the minimum number of 2-input XOR gates and has the following encoding, decoding and parity check matrices: G-[Id] - C L ' = [Id 0] , p = [c 18] , where C is a known matrix. Solution: All appropriate redundant implementations are similar to a standard LFSM 1lu' Specifically, there exists an "., x "., matrix 7 such that A = 7- 1 [A A12] 7 , o A22 where 7 is invertible and the choices for A12 and A22 are arbitrary. Moreover, the relations L = Lu7, P = P u 7, establish that 7 is One can check that 7- 1 = 7 over GF(2), which is consistent with the choice ofG. Theorem 6.1 essentially parameterizes matrices A and B in terms of A12 and A 22 : 7-1 [A A12] 7 A o A22 = = 7- 1 B = [ : ] = [~B]' In order to find the system with the minimal number of 2-input XOR gates, one needs to choose A12 and A22 so that the number of "ls" in A is minimized. 1lO CODING APPROACHES TO FAULT TOLERANCE Therefore, a straightforward approach would be to search through all 21jX8 possibilities (each entry can be either a "0" or a "I") and to find the choice that minimizes the number of "Is" in A. The following approach is more efficient [Hadjicostis and Verghese, 2002]. Minimization Algorithm: 1. Ignore the bottom s rows of A (it will soon be shown why this can be done) and optimize the cost in the top d rows. Each row of matrix A12 can be optimized independently from the other rows (because the jth row of matrix A12 does not influence the structure of the other rows of A). An exhaustive search of all possibilities in each row will look through 28 different cases. Thus, the minimization for the top d rows needs to search through d2 8 different possibilities. 2. Having chosen the entries of A 12 , proceed in the exact same way for the last s rows of A (once A12 is known, the problem has the same structure as for the top d rows). Exhaustive search for each row will search 28 cases; the total cases needed will be S28. The algorithm above searches through a total of ",2 8 = (d + s) 28 cases instead of 21jX8. The only issue that remains to be resolved is whether choosing A12 first (based only on the top d rows of matrix A) is actually optimal. This will be shown by contradiction: suppose that one chooses A12 as in Step 1 of the algorithm, but there exists a matrix A~2 i= A 12 , which together with a choice of A 22 , minimizes the number of "Is" in A. Let A22 = A22 EB CA~2 EB CA 12; matrix A is then given by A = [ = [ A EB A12 C CA EB CA12e EB A22 C A EB A12C CA EB CAbC EB A 22 C I A12 I CA12 EB A22 I A12 1 ] I CAb EB A22 This choice of A22 has the same effect in the bottom s rows as the choice A~2 and A 22 · Since by assumption A12 was a better choice in minimizing the number of "Is" in the top d rows, a contradiction has been reached: choices A~2 and A22 are suboptimal (they are worse than choices A12 and A 22 ). 0 EXAMPLE 6.4 Consider the autonomous LFSM with state evolution Redundant Implementations of Linear Finite-State Machines 111 where matrix A is A= 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 0 0 If initialized in a nonzero state, this LFSM goes through all nonzero 9-bit sequences, essentially counting from 1 to 29 - 1. To be able to detect and correct a single fault using a systematic linear code, one can use a redundant machine with four additional state variables and encoding matrix G =[ 6], where matrix C is given by 1 100 1 1 1 0 1 01 100 1 1 [ C= 0 1 1 0 0 1 0 1 1 1 1 1 1 000 The parity check matrix P = [C all of its columns are different. 0 0 1 1 1 . 14] allows single-error correction because The minimization algorithm described earlier results in the following (nonunique) choice of A 12, A 22 : 112 CODING APPROACHES TO FAULT TOLERANCE The resulting matrix A is given by 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 A = 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 and requires only nine 2-input XOR gates (as opposed to sixteen gates required by the implementation that sets A12 and A22 to zero). Note that the original, non-redundant machine uses a single XOR gate. 6 SUMMARY This chapter extended the ideas of Chapter 5 to LFSM's. This resulted in a characterization of all redundant implementations for a given LFSM under a given linear encoding and decoding scheme. The characterization enables the systematic development of a variety of possible redundant implementations. It also leads naturally to an algorithm that can be used to minimize the number of 2-input XOR gates that are required in a redundant LFSM with a specified systematic encoding scheme. Notes 1 The finite field GF(l) is the unique set of I elements GF, which together with two binary operations EEl and ®, satisfies the following properties: (i) GF forms a group under operation EEl with identity (ii) G F - o. {O} forms a commutative group under operation ® with identity 1. (iii) Operation ® distributes over EEl, i.e., foraH (II ® h) EEl (II ® h)· /1, 12, 13 E GF, /1 ®(hEElh) = The order I of a finite field has to be a prime number or a power of a prime number. 2 What is denoted here by y [t + 1] is usually denoted by y [t]. References 113 References Blahut, R. E. (1983). Theory and Practice ofData Transmission Codes. AddisonWesley, Reading, Massachusetts. Booth, T. L. (1968). Sequential Machines and Automata Theory. Wiley, New York. Cattell, K. and Muzio, 1. C. (1996). Analysis of one-dimensional linear hybrid cellular automata over G F( q). IEEE Transactions on Computers, 45(7):782792. Chakraborty, S., Chowdhury, D. R., and Chaudhuri, P. P. (1996). Theory and application of non-group cellular automata for synthesis of easily testable finite state machines. IEEE Transactions on Computers, 45(7):769-781. Daehn, w., Williams, T. w., and Wagner, K. D. (1990). Aliasing errors in linear automata used as multiple-input signature analyzers. IBM Journal of Research and Development, 34(2-3):363-380. Damiani, M., Olivo, P., and Ricco, B. (1991). Analysis and design of linear finite state machines for signature analysis testing. IEEE Transactions 011 Computers, 40(9): 1034-1045. Gallager, R. G. (1963). Low-Density Parity Check Codes. MIT Press, Cambridge, Massachusetts. Golomb, S. W. (1967). Shift Register Sequences. Holden-Day, San Francisco. Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts. Hadjicostis, C. N. (2000). Fault-tolerant sequence enumerators. In Proceedings ofMED 2000, the 8th IEEE Mediterranean ConJ. on Control andAutomation. Hadjicostis, C. N. and Verghese, G. C. (2002). Encoded dynamics for fault tolerance in linear finite-state machines. IEEE Transactions on Automatic Control. To appear. Harrison, M. A. (1969). Lectures on Linear Sequential Machines. Academic Press, New YorkILondon. Huang, K.-H. and Abraham, J. A. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, 33(6):518-528. Larsen, R. W. and Reed, I. S. (1972). Redundancy by coding versus redundancy by replication for failure-tolerant sequential circuits. IEEE Transactions on Computers, 21(2): 130-137. Martin, R. L. (1969). Studies in Feedback-Shift-Register Synthesis of Sequential Machines. MIT Press, Cambridge, Massachusetts. Peterson, W. W. and Weldon Jr., E. 1. (1972). Error-Correcting Codes. MIT Press, Cambridge, Massachusetts. Sengupta, A., Chattopadhyay, D. K., Palit, A., Bandyopadhyay, A. K., and Choudhury, A. K. (1981). Realization of fault-tolerant machines - linear code application. IEEE Transactions on Computers, 30(3):237-240. 114 CODING APPROACHES TO FAULT TOLERANCE Wicker, S. B. (1995). Error Control Systems. Prentice Hall, Englewood Cliffs, New Jersey. Zeigler, B. P. (1973). Every discrete input machine is linearly simulatable. Journal of Computer and System Sciences, 7(4):161-167. Chapter 7 UNRELIABLE ERROR CORRECTION IN DYNAMIC SYSTEMS 1 INTRODUCTION This chapter focuses on constructing reliable dynamic systems exclusively out of unreliable components, including unreliable components in the errorcorrecting mechanism. At each time step, a particular component can suffer a transient fault with a probability that is bounded by a constant. Faults between different components and between different time steps are treated as independent. Essentially, the chapter considers an extension of the techniques described in Chapter 2 to a dynamic system setting. Since dynamic systems evolve in time according to their internal state, the major task is to effectively deal with the effects of error propagation, i.e., the effects of errors that corrupt the system state. The discussion focuses initially on a distributed voting scheme that can be used to provide fault tolerance to an arbitrary dynamic system [Hadjicostis, 1999; Hadjicostis, 2000]. This approach employs multiple unreliable system replicas and multiple unreliable voters and is able to improve the reliability of a dynamic system at the cost of increased redundancy (higher number of system replicas and voters). More specifically, by increasing the number of systems and voters by a constant amount, one can double the number of time steps for which the fault-tolerant implementation will operate within a pre-specified probability of failure. Equivalently, given a pre-specified number of time steps, one can decrease the probability of failure by increasing the number of systems and voters. Once the distributed voting scheme is analyzed, coding techniques are used to make this approach more efficient, at least for special types of dynamic systems. More specifically, by using linear codes that can be corrected with low complexity' one can obtain interconnections of identical linear finite-state machines that 116 CODING APPROACHES TO FAULT TOLERANCE operate in parallel on distinct input streams and use only a constant amount of redundant hardware per machine to achieve arbitrarily small probability of failure [Hadjicostis, 1999; Hadjicostis and Verghese, 1999]. Equivalently, given a pre-specified probability of failure, one can achieve a pre-specified probability of failure for any given, finite number of time steps using a constant amount of redundancy per system. Constructions of fault-tolerant dynamic systems out of unreliable components have appeared in [Taylor, 1968b; Taylor, 1968a; Larsen and Reed, 1972; Wang and Redinbo, 1984; Gacs, 1986; Spielman, 1996a]. A number of other constructions of fault-tolerant dynamic systems has also appeared in the literature (see, for example, [Avizienis, 1981; Bhattacharyya, 1983; Iyengar and Kinney, 1985; Leveugle and Saucier, 1990; Parekhji et aI., 1991; Robinson and Shen, 1992; Leveugle et aI., 1994; Parekhji et aI., 1995] and [Johnson, 1989; Pradhan, 1996; Siewiorek and Swarz, 1998] for a comprehensive overview), but the following overview is limited to approaches in which all components, including components in the error-correcting mechanism, suffer transient faults. • In [Taylor, 1968b], Taylor studied the construction of "stable" memories out of unreliable memory elements (flip-flops) that are capable of storing a single bit but can suffer transient faults, independently between different time steps. Taylor constructed reliable ("stable") memory arrays out of unreliable flip-flops by using appropriately encoded arrays and unreliable error correcting mechanisms. His results for general computation in [Taylor, 1968a] were in error (see [Pippenger, 1990]). • In [Larsen and Reed, 1972] the focus is on protecting a single finite-state machine. The approach works by encoding the state of a given finitestate machine (with less than 2k states) into an n-bit binary vector using a binary (n, k) code that has a certain minimum Hamming distance and is majority-logic decodable. The functionality of the state transition and error-correcting mechanisms are combined into one combinational circuit. The fault model assumes that the probability of error in each of the bits in the encoded state vector (of the redundant finite-state machine) can be bounded by a (small) constant, i.e., the analysis does not directly consider the probability of a transient fault in each component. Under a number of assumptions and considering only the probability of failure per time step, it is concluded that "replication yields better circuit reliability than coding redundancy." • A study of the performance of the approach in [Larsen and Reed, 1972] under low rates of transient ("soft") state transition faults and using the concept of "cluster states," was shown in [Wang and Redinbo, 1984] to result in significant improvements. Unreliable Error Correction ill Dynamic Systems 117 • Gacs studied fault-tolerant cellular automata in [Gacs, 1986], mostly in the context of stable memories. He employed cellular automata so that the cost/complexity of connectivity between different parts of the redundant implementation remain constant as the amount of redundancy increases . • The approach in [Spielman, 1996a] was for multiple systems that run in parallel on k "fine-grained" processors for L time steps. (In this sense, it is closer to the approach presented in Section 4 of this chapter for LFSM's.) Spielman showed that the probability of error can go down as O(Le- k1 / 4 ) but the amount of redundancy is O(k log k) (i.e., O(log k) processors per system). Spielman also introduced the concept of slowdown due to the redundant implementation. 2 FAULT MODEL FOR DYNAMIC SYSTEMS In an unreliable dynamic system, an incorrect state transition at a particular time step will not only affect the output at the immediately following time step, but will typically also affect the state (and therefore the output of the system) at later time steps. In Chapters 4-6, structured redundancy was added into a dynamic system so that error detection and correction could be performed by detecting and identifying violations of artificially created state constraints. This approach was shown to work nicely if the error-correcting mechanism was fault-free; however, it is clear that faults in the error corrector may have devastating effects. To realize the severity of the problem, recall the example that was introduced in Chapter 1: assume that in a given dynamic system, the probability of taking a transition to an incorrect next state on any input is Ps and is independent between different time steps. Then, the probability that the system follows the correct state trajectory for L consecutive time steps is (l_Ps)L, and goes to zero exponentially with L. Using modular redundancy with feedback (as in Figure 1.3 of Chapter 1) will not be successful if the voter also suffers transient faults with probability PV' (A fault causes the voter to feed back a state other than the one agreed upon by the majority of the systems; the assumption here is that this happens with probability Pv, independently between different time steps.) In such case, the probability that the system follows the correct state trajectory for L consecutive time steps is at best (l-Pv)L and goes down exponentially with L. The problem is that faults in the voter (or more generally in the error-correcting mechanism) corrupt the overall redundant system state and cause error propagation. Note that the bound (l-Pv)L actually ignores the possibility that a fault in the voter may result in feeding back the correct state (when the majority of the system replicas are in an incorrect state). This issue can be accounted for if a more explicit fault model for the voter is available. The first question discussed in this chapter is the following: given unreliable systems and unreliable voters, is there a way to guarantee the correct opera- 118 CODING APPROACHES TO FAULT TOLERANCE tion of a dynamic system for an arbitrarily large (but finite) number of time steps? Furthermore, what are the trade-offs between redundant hardware and reliability? The approach discussed here uses a generalization of the scheme shown in Figure 1.4 of Chapter 1, where faults are allowed in both the redundant implementation and the error-correcting mechanism. Since the error corrector also suffers transient faults, the redundant implementation will not necessarily be in the correct state at the end of a particular time step; if, however, its state is within a set of states that represent the correct one, then, the system may be able to evolve in the right fashion. The basic idea is shown in Figure 7.1: at the end of time step t, the system is not in a valid state but it is in a state within the set of states that represent (and could be corrected/decoded to) the correct valid state. During the next state transition stage, a fault-free transition should result into the (unique) valid state that a fault-free system would be in. An incorrect transition however, may end up into an invalid state; the system performs as desired as long as no overall failure has occurred (i.e., as long as the error corrector is able to correct the redundant system state so that it is within the set of states that are associated with the correct one - this is the case for the corrections labeled "perfect" and "acceptable" in Figure 7.1). Notice that overall failures can occur both during the state transition stage and during the error correction stage. (In the approach in [Larsen and Reed, 1972; Wang and Redinbo, 1984] the two stages are combined into one stage that also suffers transient faults with some probability; here, the two stages are kept separated in order to have a handle of the complexity of the corresponding circuits.) Even when no fault-free decoding mechanism is available, the above approach is desirable because it allows one to guarantee that the probability of a decoding failure will not increase with time in an unacceptable fashion. As long as the redundant state is within the set of states that represent the actual (underlying) state, the decoding at each time step will be incorrect with afixed probability, which depends only on the reliability of the decoding mechanism and does not rapidly diminish as the dynamic system evolves in time. The resulting method guarantees that the probability of incorrect state evolution during a certain time interval is much smaller in the redundant dynamic system than in the original one. 3 RELIABLE DYNAMIC SYSTEMS USING DISTRIBUTED VOTING SCHEMES The problem in the modular redundancy scheme in Figure 1.3 of Chapter 1 is that a voter fault corrupts the states of all system replicas. This results in an overall failure, i.e., a situation where the state of the redundant implementation does not correctly represent the state of the underlying dynamic system. For instance, if the majority of the systems agree on an incorrect state, the correct state of the underlying dynamic system cannot be recovered using a majority Unreliable Error Correction ill Dynamic Systems I 1' ... \, -0 . . ' , • , 0 '0' ' (Acceptable) Erroneous Correction 1 ,.'-;.,' , ------------ I -..,....- o Faulty Next State q [t+ 1] ti ", \ 0/ '0 o Current State q [t] t'i 119 0 Corrected Next State qJt+ 1] ~.~--------------~~~ ~.~------------~~~ State Transition Stage Error Correction Stage r--------- : Valid :. , State , ,, Set of States Representing a Single Valid State :0 Invalid : State ,---------Figure 7.1. Reliable state evolution subject to faults in the error corrector. voter. To avoid this situation. one needs to ensure that faults in the voting mechanism do not have such devastating consequences. One way to achieve this is by using several voters and by performing error correction in a distributed fashion. as shown in Figure 7.2 [Hadjicostis. 1999; Hadjicostis, 2000]. The arrangement in Figure 7.2 uses n system replicas and n voters. All n replicas are initialized at the same state and receive the same input. Each voter receives state information from all system replicas and feeds back a correction to only one of them. This way, a fault in a single voter corrupts the state of only one of the system replicas and not all of them. Notice that the redundant implementation of Figure 7.2 is guaranteed to operate "correctly" as long as or more systems are in the correct state xl denotes the smallest integer that is larger or equal to x). The reason is two-fold: rni1l (r rnt1l • If systems are in the correct state, then, the majority of the system replicas are in the right state and a fault-free voter is guaranteed to recover the correct state. 120 CODING APPROACHES TO FAULT TOLERANCE Input Figure 7.2. Modular redundancy with distributed voting scheme. rni1l • If systems are in the correct state, then, each voter ideally feeds back the correct state unless itself suffers a fault; this implies that a fault in a particular voter or a particular system may be corrected at future time steps as long as or more systems end up in the correct state. rnt1l The above discussion motivates the following definition of an overall failure. DEFINITION 7.1 The redundant system of Figure 7.2 suffers an overall failure when half or more of the systems are in a corrupted state. A reliable system is one that, with high probability, operates for a prespecified finite number of time steps with no overall failure. In this context, a redundant implementation is reliable if, with high probability, at least systems are in the correct state at any given time step. Note that it is not necessary that each of these systems remains in the correct state for all consecutive time steps. Also note that the above definition of an overall failure is conservative because the overall redundant implementation may perform as expected even if more than half of the systems are in an incorrect state. What is really needed is that, at any given time step, the majority of the systems are in the correct state. rnt1l rni1l THEOREM 7.1 Suppose that each system takes a transition to an incorrect state with probability Ps and each voter feeds back an incorrect state with probability Pv (independently between different systems, voters and time steps). Then, the probability of an overall failure at or before time step L (starting at Unreliable Error Correction in Dynamic Systems 121 time step 0) can be bounded asfollows: Pr[ overall failure at or before time step L ] ::; L t (7) pi(l-pt- i , i=Ln/2J where p == Pv + (l-pv)Ps. This bound goes down exponentially with the number of systems n if and only if P < !. Proof: Given that there is no overall failure at time step T-l, the conditional probability that system j ends up in an incorrect state at time step T is bounded by the probability that either voter j suffers a transient fault, or voter j does not suffer a fault but system j itself takes a transition to an incorrect state, i.e., Pr[ system j in incorrect state at T I no overall failure at T-l] ::; ::; Pv + (1 - Pv)Ps == P . The probability of an overall failure at time step T given no overall failure at time step T-1 is bounded by the probability that half or more of the n system replicas suffer faults. Pr[ overall failure at T I no overall failure at T-1 J ::; : ; t (7) pi(l - pt- i . i=Ln/2J Using the union bound, the probability of an overall failure at or before a certain time step L can be bounded as Pr[ overall failure at or before L J ::; L t (7) pi(l _ p)n-i . i=Ln/2J Note that the bound on the probability of overall failure increases linearly with the number of time steps (because of the union bound). The bound goes down exponentially with n if and only if p is less than to see this, one can use the Sterling approximation and the results on p. 531 of [Gallager, 1968]: assuming p < !; !, __ ~------_v X ------~ 122 CODING APPROACHES TO FAULT TOLERANCE and ~------v~------~ X (where for simplicity n has been assumed to be even). Since V{~l one can conclude that 2n <( n ) < - n/2 t ( 7) i=n/2 with n if and only if p(l - p) - V/2 -;:;; 2n , pi(l - pt- i will decrease exponentially < ~ (i.e., if and only if p is less than ~). 0 A potential problem with the arrangement in Figure 7.2 is the fact that as n increases, the complexity of each voter (and therefore Pv) increases. An arrangement in which the number of inputs to each voter is fixed is discussed in the next section. In such schemes, the voter complexity and Pv remain constant as the number of systems and voters is increased. Another concern about the approach in Figure 7.2 is that, in order to construct dynamic systems that suffer transient faults with an acceptably small probability of overall failure during any pre-specified (finite) time interval, the hardware in the redundant implementation may have to increase in an unacceptable fashion. More specifically, if the number of time steps is doubled, the bound in Theorem 7.1 suggests that one may need to increase the number of system replicas by a constant amount (in order to keep the probability of an overall failure at the same level). To make an analogy with the problem of digital transmission through an unreliable communication link, what was shown in Theorem 7.1 is very similar to what can be achieved in digital communications without coding techniques. In other words, in the communications setting the probability of a transmission error can be made arbitrarily small by replicating (retransmitting) bits, but at the cost of correspondingly reducing the rate at which information is transmitted. If, however, one is willing to transmit k bits as a block, then, the use of coding techniques can result in an arbitrarily small probability of transmission error with a constant amount of redundancy per bit. l In the next section, this coding paradigm is transfered to an unreliable dynamic system setting. Specifically, it is shown that for identical linear finite-state machines that operate in parallel on distinct input streams one can design a scheme that requires only a constant amount of redundancy per machine to achieve arbitrarily small probability of overall failure over any finite time interval. Unreliable Error Correction in Dynamic Systems 4 123 RELIABLE LINEAR FINITE-STATE MACHINES This section combines linear coding techniques with the distributed voting scheme of the previous section in order to protect linear finite-state machines (LFSM's). The resulting scheme is an interconnection of identical LFSM's that operate in parallel on distinct input streams and require only a constant amount of redundancy per machine to achieve an arbitrarily small probability of overall failure over any pre-specified (finite) time interval. The linear codes that are used are low-density parity check codes [Gallager, 1963; Sipser and Spielman, 1996; Spielman, 1996b J. Error correction is of low complexity and can be implemented using unreliable voters and unreliable XOR gates: (i) The unreliable voters vote on J -1 bits, where J is a constant, and suffer transient faults with a probability that is bounded by some constant Pv (transient faults cause voters to provide an output other than the one agreed upon by the majority of their inputs). Each voter suffers transient faults independently from all other components (voters and XOR gates) and independently between time steps. (ii) The unreliable XOR gates take two inputs and suffer transient faults with a probability that is bounded by some constant Pal' independently from all other components and independently between time steps. The unreliable LFSM's are built out of 2-input XOR gates and single-bit memory elements (flip-flops): (i) These XOR gates also suffer transient faults with a probability that is bounded by Pal' independently from all other components and independently between time steps. (ii) The flip-flops are assumed to be fault-free, although the same approach can be extended to also handle this type of faults. 4.1 LOW-DENSITY PARITY CHECK CODES AND STABLE MEMORIES An (n, k) low-density parity check (LDPC) code is a linear code that represents k bits of information using n total bits. Just like any linear code, an LDPC code has an n x k generator matrix G with entries in GF(2) and with full-column rank; the additional requirement is that the code has a parity check matrix P that (is generally sparse and) has exactly K "1s" in each row and J "1s" in each column. It can be easily shown that the ratio has to be an integer and that P has dimension x n [Gallager, 1963]. Each bit in a codeword is involved in J parity checks, and each of these J parity checks involves K -1 additional bits. Note that the rows of P are allowed to be linearly dependent r;t K 124 CODING APPROACHES TO FAULT TOLERANCE (i.e., P can have more than n-k rows) and that the generator matrix G of an LDPC code is not necessarily sparse. Gallager studied ways to construct and decode LDPC codes in [Gallager, 1963]. In particular, he constructed sequences of (n, k) LDPC codes for fixed J and K with rate ~ ~ 1- and he suggested/analyzed the performance of simple iterative procedures for correcting erroneous bits in corrupted codewords; these procedures are summarized below. k Iterative Decoding. For each bit in a corrupted n-bit codeword: 1. Evaluate the J associated parity checks (since each column of P has exactly J "Is"). 2. If more than half of the J parity checks for a particular bit are unsatisfied, flip the value of that bit; do this for all bits concurrently. 3. Iterate (back to step 1). In order to analytically evaluate the performance of this iterative scheme, Gallager slightly modified his approach. Modified Iterative Decoding. Replace each bit bi in an n-bit corrupted codeword with J bit-copies {bLb~, ... ,b{} (all bit-copies are initially the same); obtain new estimates of each of these copies (i.e., J estimates for bit bi ) by executing the following steps: {bI, b~, ... , bf} 1. Evaluate J-I parity checks for each bit-copy; for each bit, exclude a different parity check from the original set of J checks. 2. Flip the value of a particular bit-copy if half or more of the J - 1 parity checks are unsatisfied. 3. Iterate (back to Step 1). A hardware implementation of Gallager's modified iterative decoding scheme can be seen in Figure 7.3 (EEl denotes a 2-input XOR gate and a V denotes a (J -1 )-bit voter). Initially, one starts with J copies of an (n, k) codeword (Le., a total of In bits). During each iteration, each bit-copy is corrected using an error-correcting mechanism of the form shown in the figure: for each bit-copy, there are a total of J -1 parity checks, each of which involves K -1 other bits and can be evaluated via K -1 2-input XOR gates. The output of each voter is "I" if half or more of the J - 1 parity checks are nonzero. Correction is accomplished by XOR-ing the output of the voter with the previous value of the bit -copy. DEFINITION 7.2 The number o/independent iterations m is the numbero/iterations/or which no decision about a particular bit-copy is based on a previous estimate o/this same bit-copy. Unreliable Error Correction in Dynamic Systems 125 r------------------- --------- Correcting Mechanism: I ....... I I I I f~r Each ~~~:=;~:====:~=:~ L-L-1I----,-1..LI---,-I_·-''-'-''-'.1. 1--1..1---,-1--,1 Bit-Copy -~ LD.@I Correction i t Bit-Copy , ~ I I I 11· .. •.. ·1 I I J Copies of K-1 Other Bit-Copies (n,k) Codeword ® ...\ J-1 Parity Checks -@ . (Jxn Total Bit-Copies) / '(f5 Figure 7.3. Hardware implementation of Gallager's modified iterative decoding scheme for LDPC codes. Note that in the modified iterative decoding scheme, each parity check requires K -1 input bits (other than the bit-copy being estimated). Since each of these input bits has J different copies, one has some flexibility in terms of which particular copy is used when estimating b{ of bit bi (1 ~ j ~ J). If one is careful enough in choosing among these J bit-copies, the number of independent iterations can be made nonzero. More specifically, one should ensure that when a bit-copy of bi is estimated using an estimate of bit bj' one uses the bit-copy of bj that disregarded the parity check involving bi (otherwise, the estimate of bi would immediately depend upon its previous estimate). The number of independent iterations is important because during the first m iterations, the probability of error in an estimate for a particular bit-copy can be calculated using independence. It is shown in [Gallager, 1963] that when using the modified iterative decoding scheme the number of independent iterations for any LDPC code is upper bounded by m< logn --~--------~ log[(K - 1)(J - 1)] In his thesis, Gallager suggested a procedure for constructing sequences of (n, k) LDPC codes with fixed J, K (i.e., with parity check matrices that have J "Is" in each row and K "Is" in each column) such that its rate is bounded 126 CODING APPROACHES TO FAULT TOLERANCE by ~ ~ 1- k and the number of independent iterations m is bounded by m+ 1 > logn + log KJ-K-J 2K 210g[(K - I)(J - 1)] > m. (7.1) Building on Gallager's work, Taylor considered the construction of reliable memories out of unreliable memory elements [Taylor, 1968b]. More specifically, Taylor assumed that the unreliable memory elements (flip-flops) store a single bit ("0" or "1") but can suffer transient faults with probability Pc, independently between different time steps. Taylor constructed reliable (or stable) memory arrays out of unreliable flip-flops using (n, k) LDPC codes: a reliable memory array uses n flip-flops to store k bits of information; at the end of each time step an unreliable error-correcting mechanism re-establishes (or at least tries to re-establish) the correct state in the memory array. The memory scheme performs acceptably for L time steps if, at any time step r (0 S r S L), the k information bits can be recovered from the n memory bits. This means that the n-bit sequence stored in the memory at time step r has to be within the set of n-bit sequences that get decoded to the originally stored codeword (i.e., if a fault-free iterative decoder was available, one could successfully use it to obtain the codeword that was stored in the memory array at time step 0). Note that if error correction is fault-free, the problem of constructing reliable memory arrays is trivial because it can be viewed as a sequence of transmissions through identical unreliable binary symmetric channels. Each transmission involves an n-bit sequence that ideally represents an (n, k) codeword. Each bit transmission is successful with probability 1 - Pc and unsuccessful with probability Pc' At the end of each channel transmission, error correction is performed and the (corrected) n-bit sequence is passed on to the next channel (i.e., the first node transmits an (n, k) codeword to the second node via an unreliable communication link; after performing error detection and correction, the second node transmits the corrected (ideally the same) codeword to the third node, and so forth). Therefore, if error correction is fault-free, one can use Shannon's result to establish that, by increasing k (and n), one can obtain reliable memory arrays as long as ~ S C where C is given by the channel capacity of a binary symmetric channel with crossover probability Pc C = 1 + Pc log Pc + (1- Pc)log(l- pc). When faults may take place in the error-correcting mechanism, however, the analysis becomes significantly harder. Taylor used LDPC codes and Gallager's modified iterative decoding procedure to build a correcting mechanism out of unreliable 2-input XOR gates and unreliable (J -1 )-bit voters that suffer transient faults (i.e., output an incorrect bit) with probabilities pz and Pv respectively. Unreliable Error Correction in Dynamic Systems 127 The scheme uses Gallager's modified iterative decoding scheme and requires J estimates for each of the n redundant bits. The corresponding correcting circuit has one (J -1 )-bit voter and 1 + (K -1) (J -1) 2-input XOR gates for each bit (see Figure 7.3). Taylor constructed reliable memory arrays using (n, k) LDPC codes (with ~ ;::: 1 J < K) such that the probability of a overall failure increases linearly with the number of time steps T and decreases polynomially with k (i.e., the probability of overall failure is O( Tk- (3 ) for a positive constant /3). By increasing k, the probability of overall failure can be made arbitrarily small while keeping ~ ;::: 1 (and thus the redundancy per bit) below a constant. Note that Taylor's construction of reliable memory arrays uses In voters, In flip-flops and In[1 + (J - 1)(K - 1)] 2-input XOR gates; since f S I-J; K' the overhead per bit (in terms of the overall number of flip-flops, XOR gates and voters) remains below a constant as k and n increase. Taylor also showed that one can reliably perform the XOR operation on k pairs of bits by performing component-wise XOR-ing on two (n, k) codewords. In fact, he showed that one can reliably perform a sequence of T such component-wise XOR operations [Taylor, 1968a]. k, k 4.2 RELIABLE LINEAR FINITE-STATE MACHINES USING CONSTANT REDUNDANCY Consider an LFSM with a single-bit input (u and state evolution = 1), a d-dimensional state q[t + 1J = Acq[t] EB bx[tJ . (7.2) Without loss of generality, the d x d matrix Ac can be assumed to be in classical canonical form (see the discussion in Section 2 of Chapter 6). Any such LFSM can be implemented using 2-input XOR gates and flip-flops as outlined in Chapter 6. In these implementations, each of the d bits in the next-state vector q[t + 1] is generated using at most two bits from q[t] and at most one bit from the input; therefore, the calculation of each bit in q[t + 1J can be accomplished by using at most two 2-input XOR operations (this is direct consequence of the fact that the canonical matrix Ac has at most two "1s" in each row). If k such LFSM's operate in parallel, each with a possibly different initial state and a possibly different input stream, the result is k parallel instantiations of the system in Eq. (7.2): [ qt[t+ 1] ... qk[t+ 1]] = Ac [ql[t] EB b [ xdtJ ~[t] ] EB Xk[t] ] . (7.3) Let G be the n x k encoding matrix of an (n, k) linear code. If both sides of Eq. (7.3) are post-multiplied by G T , one obtains the following n encoded 128 CODING APPROACHES TO FAULT TOLERANCE =:g1 ::> _0> -0 kDistin~~ Inputs . - Replace With) -0 Q5 -"'0 o - "'0 U C UJ k Distinct Inputs -0 -0 k Identical LFMS's Figure 7.4. 0 .$ :::J .c ·c +"" en o n Redundant LFSM's Replacing k LFSM's with n redundant LFSM's. parallel instantiations: (Ac [ qI[t] qk[t] ])GTEB Xk[t] ]) G T EB (b [ xdt] = Ac([ qk[t] ] GTJ EB Xk[t] ] G ) , qdt] EBb([Xl[t] or equivalently [ 6[t + 1] ... en[t + 1] ] = Ac [6[t] EB b ([ Xl[t] , ... en[t]] EB ... Xk[t]] G T) V' ' X[t] (7.4) where Effectively, n LFSM's with state evolution of the form of Eq. (7.2) are used to perform k different encoded instantiations of the system in Eq. (7.2). As shown in Figure 7.4, the operation of k identical LFSM's acting on distinct input streams has effectively been replaced by n redundant LFSM's acting on encoded versions of the k original inputs. Input encoding is performed according to an Unreliable Error Correction ill Dynamic Systems 129 (n, k) linear code with generator matrix G. Each of the n redundant systems is implemented using a separate set of flip-flops and XOR gates; for simplicity flip-flops are assumed to be reliable and encoding is assumed to be instantaneous and fault-free. Most of these assumptions could be relaxed; the real issue with the encoding mechanism is its time and hardware complexity - see the discussion in the next section of this chapter. At each time step, n encoded inputs are provided to the n redundant LFSM's and each of them evolves to its corresponding (and possibly erroneous) next state. At the end of the time step, errors in the new states of the n systems are corrected by performing error correction on d codewords from the (n, k) linear code with generator matrix G (the ith codeword is obtained by collecting the ith bit from each of the n state vectors). If error correction was fault-free, one could invoke Shannon's result and argue that, by increasing k (and n), the condition in Eq. (7.5) can be satisfied with an arbitrarily high probability (at least for a pre-specified, finite number of time steps and as long as the probability of component faults is below a certain constant). More specifically, one can make the probability of "error per time step" (i.e., the probability of an overall failure at a particular time step given no corruption at the previous time step, denoted by Pr[ error per time step ]) arbitrarily small. Then, using the union bound, the probability of an overall failure over L consecutive time steps could be bounded by Pr[ overall failure at or before time step L J ~ L Pr[ error per time step J . To make the above argument more precise, one has to bound the probability of error per bit during each time step. Assuming that there are no corruptions in any of the n state vectors at the beginning of a given time step, the probability of a bit error (in any particular bit of the n next-state vectors) can be obtained by considering the number of XOR operations that are involved. If this bit-error probability is less than and if errors among different bits are independent, then, the problem essentially reduces to an unreliable communication problem. Fault-free error correction essentially ensures that at the beginning of each time step the overall redundant state will be correct (unless, of course, an overall failure took place). ! Since fault-free error correction is not an option, the approach taken in the rest of this chapter is quite different. In order to allow faults in the errorcorrecting mechanism, one employs LDPC codes and performs error correction in each bit using the unreliable error-correcting mechanism of Figure 7.3. This error-correcting mechanism is implemented using different unreliable XOR gates and different unreliable voters for each bit (so that a single fault in a component corrupts a single bit). Following Taylor's scheme in [Taylor, 1968bl, one actually needs to have J replicas of each of the n redundant systems (a total 130 CODING APPROACHES TO FAULT TOLERANCE of In systems). At the beginning of each time step, these In systems evolve to a (possibly corrupted) next state; at the end of the time step, error correction is performed using one iteration of Gallager's modified iterative decoding scheme (see Section 4.1). Once faults in the error-correcting mechanism are allowed, one can no longer guarantee that the invariant condition in Eq. (7.5) will be true at the beginning of each time step. However, as long as no overall failure takes place, the overall redundant state (i.e., the state of all In systems) at a certain time step can correctly represent the state of the k underlying systems. 2 In such case, one can recover the exact state of the k underlying redundant systems (e.g., by using an iterative decoder). THEOREM 7.2 Consider k distinct instantiations of an LFSM with state evo- lution as in Eq. (7.2), each instantiation with its own initial state and a distinct input stream. These k instantiations are embedded into n redundant LFSM's [also with state evolution as in Eq. (7.2)] using the approach ofEq. (7.4), where G is the n x k encoding matrix of a linear (n, k) LDPC code. Each redundant system is properly initialized (so that Eq. (7.5) is satisfied for T = 0) and is supplied with an encoded input according to X[t] in Eq. (7.4). Each of the n redundant systems has J realizations (so that there is a total of J n systems) that use their own (dedicated) sets of reliable flip-flops and unreliable 2-input XOR gates. At the beginning of a time step, all In redundant systems evolve to a (possibly corrupted) next state. At the end of the time step, Gallager's modified iterative decoding scheme is used to correct any errors that may have taken place. Each bit-copy is corrected using different hardware, i.e., a different set of1+ (J-1)(K-1) unreliable 2-input XOR gates and one unreliable (J-1)-bit voter. Let J be afixed even integer greater than 4, let K be an integer greater than J. If the 2-input XOR gates suffer transient faults independently with probability bounded by Pa:, the (J -1 )-bit voters suffer transient faults independently (and independently from the XOR gates) with probability bounded by Pv, and there exists P such that -1 ) [(K - 1)(2p + 3pa:)] J/2 p> ( JJ/2 + Pv + Pa: , then, there exists a sequence of (n, k) LDPC codes (with ~ ~ 1 -f?), such that the probability of an overall failure at or before time step L is bounded above as follows: Pr[ overall failure at or before time step L 1< LdCk-/3 , Unreliable Error Correction in Dynamic Systems 131 where /3 and C are constants given by IOg{ (J-l)(K -1) ( /3 C J~2-_21 ) [(K _I)(2p+3Pz)]J/2-1 } 21og[(J-l)(K-l)] = J (I-J/KP (2p + 3pz) The code redundancy is I ~ [ 1 2K - l-}/K - 3, 1 ] -(/3+3) 2J(K-I) . and the hardware used (including hard- ware in the error-correcting mechanism) is bounded above by J d(3+/~J}»K -1)) XOR gates and by l!//K voters per system (d is the system dimension). Proof: The proof follows similar steps as the proofs in [Taylor, 1968a; Taylor, 1968b]. The following discussion provides an overview of the proof; a complete description can be found in Appendix 7.A. The state of the overall redundant implementation at a given time step T [i.e., the states of the n redundant systems created by the embedding in Eq. (7.4)] are fully captured by d codewords Ci [tJ from an (n, k) LDPC code (1 ~ i ~ d). In other words, the state evolution equation of the n systems can be written as [ CI[t + IJ C2 [t + IJ I Cd[t + IJ where X[tJ = [ Xl [tJ X2[tJ ... Xk[tJ] G T is the encoding of the k inputs at time step t and A c ' b are the matrices in the state evolution equation (7.2). Taylor showed that the addition of any two (n, k) codewords modulo-2 can be done reliably using LDPC codes and Gallager's modified iterative decoding scheme. Furthermore, he showed that one can reliably perform a sequence of L such additions by performing a component-wise XOR operation in an array of n 2-input XOR gates followed by one iteration of Gallager's modified scheme (using the mechanism shown in Figure 7.3). More specifically, Taylor showed that Pr[ overall failure in a sequence of L array additions J < LC'k-!3' for constants C' and /3' that depend on the fault probability of the XOR gates and the voters, and on the parameters of the LDPC codes used. Taylor's scheme can be used to perform error correction in the d codewords from the (n, k) code. This requires, of course, that one maintains J copies of each codeword (a total of J d codewords). During each time step, the overall re- l32 CODING APPROACHES TO FAULT TOLERANCE dundant implementation calculates its new state (J d new codewords) by adding modulo-2 the corresponding codewords ofthe current state; this is then followed by one iteration of error correction based on Gallager's modified scheme. Since matrix Ac is in canonical form, the computation of each codeword in the next overall state is based on at most two codewords of the current state (plus the input modulo-2). So, over L time steps, one essentially has d sequences of additions modulo-2 in the form that Taylor considered and which he showed can be protected efficiently via LDPC coding. Using the union bound, the probability of an overall failure at or before time step L can be bounded as Pr[ overall failure at or before time step L 1< LdCk-/3 . Note that the input is also something that needs to be considered (and one of the reasons that constants f3 and C differ from the ones in Taylor's work), but it is not critical in the proof since the inputs involve no memory. 0 5 OTHER ISSUES In a memoryless binary symmetric channel with crossover probability p, a bit ("0" or "I") that is provided as input at the transmitting end is corrupted at the receiving end with probability p. Errors between successive uses of the channel are assumed to be independent. Shannon studied ways to encode k input bits into n redundant bits in order to achieve low probability of overall failure during transmissions. He showed that the probability of error can be made arbitrarily low using coding techniques, as long as the rate R = ~ of the code is less than the capacity of the channel defined as C = 1 + plogp+ (1- p)log(l- p) (for the binary symmetric channel). Moreover, for rates R greater than C, the probability of error per bit in the transmitted sequence can be arbitrarily large. Theorem 7.2 looked at embeddings of k distinct instantiations of a particular LFSM into n redundant systems, each of which is implemented using unreliable components. It was shown that, given certain conditions on the fault probabilities of components, there exist LDPC codes that allow the n LFSM's to reliably implement k identical LFSM's (that nevertheless operate on distinct input streams) and, with nonzero "rate," achieve arbitrarily low probability of overall failure during any pre-specified time interval. "Rate" in this context means the amount of redundant hardware that is required per machine instanthe tiation. Specifically, by increasing nand k while keeping ~ ~ 1 probability of an overall failure can be made arbitrarily small. An upper bound on ~, which might then be called the computational capacity, was not obtained in Theorem 7.2. Also notice that the bound on the probability of failure that was obtained in Theorem 7.2 goes down polynomiaUy with the number of systems 'k, Unreliable Error Correction ill Dynamic Systems 133 (not exponentially, as was the case for the distributed voting scheme and for Shannon's approach). Another issue that was not explicitly addressed in the development of Theorem 7.2 was the encoding of the k original inputs into n inputs according to X[t] = [Xl[t] X2[t] ... Xk[t]] G T [see Eq. (7.4)]. Using the generator matrix G, one sees that each of the n encoded bits can be generated using at most k information bits (Le., at most k-l 2-input XOR gates). This approach, however, is problematic because as k (and n) increase, each bit will be encoded incorrectly with probability 1/2 [Gallager, 1963; Taylor, 1968b]. One alternative is to encode using a binary tree of depth log k, where each node performs a component-wise 2-input XOR operation on two arrays of n bits. This encoding approach requires O(nk) 2-input XOR gates and o (log k) time steps to complete, but can be done reliably using unreliable XOR gates if at the end of each stage of the tree evaluation one performs a correcting iteration (of the type performed at the end of each time step during the operation of the system). One potential problem is that this encoding approach will reduce the operating speed of the system by o (log k) steps. Understanding the constraints due to the "computational capacity" and encoding complexity limitations are issues that are worth exploring in the future. The idea of having mUltiple unreliable implementations of the same system, each operating on distinct inputs, and each offering assistance to each other in order to achieve reliable computation is a possibility that needs to be explored further (along the lines of [Taylor, 1968b; Gacs, 1986; Spielman, 1996a; Hadjicostis, 1999]). Another promising direction is to explore the applicability and effectiveness of encoding the state of an individual system using various types of codes (along the lines of [Larsen and Reed, 1972; Wang and Redinbo, 1984]). APPENDIX 7.A: Proof of Theorem 7.2 The proof of Theorem 7.2 appears in [Hadjicostis, 1999] and follows the steps in [Taylor, 1968a; Taylor, 1968b]. The overall redundant implementation starts operation at time step O. As described in Section 4, during each time step, all J n redundant system are first allowed to evolve to their corresponding (and possibly corrupted) next states; then, error correction is performed using one iteration of Gallager's modified decoding scheme. This is done in parallel for each of the J d (n, k) codewords (recall that each codeword has J copies - see Figure 7 .A.l). The low-density parity check (LDPC) coding scheme was constructed so that the number of independent iterations m satisfies Eq. (7.1). Therefore, for the first m time steps, the parity checks that are involved in correcting a particular bit-copy are guaranteed to be in error with independent probabilities (because errors within these parity check sets are generated by different components). After the first m time steps, the independence condition in the parity checks will 134 CODING APPROACHES TO FAULT TOLERANCE (n,k) Codeword ~F~~~t-------------------------, Error ! I I! 1 I 1 I ...... • 1 1 1 I !-- Correction ----4-------------------------, per Codeword Iii 1 I I· ...... I I I I (Total of d :I Codewords) I I I !I Il I I I Ii I ____ ..I 1...... ·1 I I i d-Dimensional State Vector of System 1 Figure 7.A.I. Encoded implementation of k LFSM's using n redundant LFSM's. not necessarily be true. If, however, no component fault influences decisions for m or more consecutive time steps (Le., by causing a bit-copy to be incorrect m or more time steps ~n the future), then, one can guarantee that the J -1 parity checks for a particular bit-copy are in error with independent probabilities. The following definitions make this more precise. DEFINITION 7-7.A.3 A propagation failure occurs whenever any of the Jnd bit-copies in the overall redundant implementation is erroneous due to componentfaults that occurred more than m time steps in the past. DEFINITION 7- 7. A. 4 The initial propagation failure denotes the first propagation failure that takes place, i. e., the occurrence of the first component fault that propagates for m + 1 time steps in the future. It will be shown shown that a propagation failure is very unlikely and that in most cases the bit errors in the J d codewords that represent the encoded state of all LFSM's will depend only on component faults that occurred within the last few time steps. To calculate this bound on the probability of propagation failure, one uses a bound on the probability of error per bit-copy which is established in the next section. Initial Propagation Failure The probability of error per bit-copy at the end of time step T, 0 ~ T ~ m, can be bounded by a constant p. It will be shown that this is true as long as 1. T ~ m, and Unreliable Error Correction in Dynamic Systems 135 To see why this is the case, consider the following: • In order to calculate a certain bit-copy in its next-state vector, each of the In redundant systems uses at most two bit-copies from a previous state vector and performs at most two 2-input XOR operations (one XOR-ing involves the two bit-copies in the previous state vector, the other one involves the input). Using the union bound, the probability of error per bit-copy at the end of the state evolution stage is bounded above by Pr[ error per bit-copy after state evolution at step T 1 ~ 2p + 2pa: == q . This is simply the union bound of the events that any of the two previous bit-copies is erroneous and/or that there is a fault in any of the two XOR gates (for simplicity the input provided is assumed to be correct). Note that independence is not required here. • Once all J n systems transition to their next states, error correction is performed along the J d codewords (see Figure 7 .A.I). Correction involves one iteration of Gallager's modified decoding scheme. Recall that each bit-copy is corrected using J -1 parity checks, each of which involves K -1 other bitcopies. A parity check associated with a particular bit-copy (1 ~ j ~ J) is said to be in error if bit-copy bf is incorrect but the parity check is "0", or if bf is correct but the parity check is "1." This is because ideally one would like parity checks to be "0" if their corresponding bit-copy is correct and to be "I" if the bit-copy is incorrect. Note that this definition decouples the probability of a parity check being in error with whether or not the associated bit-copy is erroneous. The probability of an error in the calculation of a parity check (see the error-correcting mechanism in Figure 7.3) is bounded by 0. Pr[ parity check in error 1~ (K - I)(q + Pa:) = (K - I)(2p + 3pa:) (i.e., a parity check for a particular bit-copy is in error if there is an error in any of the K -1 other bit-copies or a fault in any of the K -1 XOR operations). • A particular bit-copy will not be corrected if one or more of the following three events happen: (i) J /2 or more of the associated J -1 parity checks are in error, (ii) there is a fault in the voting mechanism, or (iii) there is a fault in the XOR gate that receives the voter output as input (see Figure 7.3). If the parity checks associated with a particular bit-copy are in error with 136 CODING APPROACHES TO FAULT TOLERANCE independent probabilities, then, Pr[ error per bit-copy after correction] :S ( J;;21 ) [(K - 1)(2p + 3p:z:)]J j 2 + Pv + Px :S p. Therefore, if the parity checks for each bit-copy are in error with independent probabilities, then, the system ends up with a probability of error per bit-copy that satisfies Pr[ error per bit-copy at end of time step T] :S p . The constant p can be viewed as a bound on the "steady-state" probability of error per bit-copy at the end/beginning of each time step (at least up to time step m). This "steady-state" probability of error per bit-copy remains valid for T > m as long as the initial propagation failure does not take place. The only complication is that the probability of error per bit-copy conditional on the event that no propagation failure has taken place may not necessarily be bounded by p. Next, it is shown that this assumption remains true; the proof is a direct consequence of the definition of a propagation failure. At the end of time step T = m, the probability of error per bit-copy is bounded by p. However, in order to derive the same bound for the probability of error per bit-copy at the end of time step T = m + 1, one has to assume that different parity checks for a particular bit-copy are in error independently. To ensure this, it is enough to require that no component fault took place at time step T = and propagated up to time step m (so that it causes a propagation failure at time step T = m + 1). The probability that a particular bit-copy bf (1 :S j :S J) is in error at the end of time step T = m conditional on no propagation failure (no PF) up to time step T = m is denoted by ° Pr[ error per bit-copy at end of time step T = m I no initial PF at T = m] and is smaller or equal to the "steady-state" probability of error per bit-copy (Le., smaller than p). To see this, consider patterns of component faults at time steps T = 0,1, ... , m that cause bit-copy to be erroneous at the end of time step m. If this event is called A, then, it is clear that Pr(A) :S p. Let B denote the set of primitive events (patterns of component faults at time steps T = 0,1, ... , m) that lead to a propagation failure at bit-copy bf. Note that by definition set B is a subset of A (B c A) because a propagation failure at time bt Unreliable Error Correction in Dynamic Systems step T = m has to corrupt bit bf. 137 Therefore, Pr[ bf is erroneous at end of T = m I no initial PF at bf at time m J = = ~ Pr(A) - Pr(B) 1 - Pr(B) Pr(A) ~ p. One easily concludes that the "steady-state" probability of error per bit -copy remains bounded by p given that no propagation failure takes place. Note that one actually conditions on the event that "no propagation failure takes place in any of the bit-copies at time step m," which is different from event B. The proof goes through in exactly the same way because a pattern of component faults that causes a propagation failure to a different bit-copy can either cause an error in the computation of bit-copy bf or not interfere with it at all. Bounding the Probability of Initial Propagation Failure Given that no propagation failure has taken place up to time step T, a bound on the probability of error per bit-copy is available and the parity checks for a given bit-copy are in error with independent probabilities. Using this, one can calculate the probability that a component fault that took place at time step T-m propagates up to time step T, corrupts the value of bit-copy bf (1 ~ j ~ J) and causes the initial propagation failure. This is called the "probability of initial propagation failure at bit-copy b{" Note that in order for a component fault to propagate for m time steps it is necessary that it was critical in causing a wrong decision during the correcting stages of m consecutive time steps. In other words, without this particular component fault the decision/correction for all of these time steps would have had the desired effect. Let Pm denote the probability that a component fault has propagated for m consecutive time steps in a way that causes the value of bit-copy bf at time step T to be incorrect. In order for this situation to happen, both of the following two independent conditions are required: 1. The value of one or more of the (J -l)(K -1) bit-copies involved in the parity checks of bit-copy bf is incorrect because of a component fault that has propagated form-1 time steps. Since each such bit-copy was generated during the state evolution stage of time step T based on at most two bit -copies from the previous state vector (the input bit is irrelevant in error propagation), the probability of this event is bounded by (J - l)(K - 1)2Pm-l , 138 CODING APPROACHES TO FAULT TOLERANCE where Pm - l is the probability that a component fault has propagated for m-l consecutive time steps (causing an incorrect value to one of the bitThe factor of two comes in because copies used in the parity checks for the fault that propagates for m - 1 time steps could be in any of the at most two bit-copies used to generate during the state evolution stage. This is due to the fact that the system matrix Ac in Eq. (7.2) is in standard canonical form. bl). bl 2. Since one of the parity checks is associated with the fault that has propagated for m-l time steps, at least J /2 - lout of the J - 1 remaining parity checks would have to be erroneous. The probability of this event is bounded by J - 2 ) [(K-1 )( q+P:t )]J/2-1 . ( J/2-1 If no propagation failure has taken place, errors in parity checks will be independent. Therefore, the probability of a fault propagating for m consecutive time steps is bounded by Similarly, and so forth. One concludes that The union bound can be used to obtain an upper bound on the probability that the initial propagation failure takes place at time step T. For this to happen, a pattern of component faults has to propagate for m time steps in at least one of the Jnd bit-copies of the redundant construction, i.e., Pr[ initial prop. failure at time step T] :::; JndPm . Unreliable Error Correction ill Dynamic Systems 139 If one uses Gallager's construction in [Gallager, 1963], the LDPC codes will satisfy the following conditions: m > m < KJ-K-J 1ogn + 1og 2KJ(K-l) 2Iog[(J _ 1)(K -1)J == A(n) , logn log[(J - 1)(K - 1)J ' k 1- JjK . Using the first inequality, one obtains Pr[ initial prop. failure at time step T J S; S; Jnd(q + p:v)2 m [(J -1)(K -1) ( S; Jnd(q + p:v)2m J~2-_21 ) [(K _1)(q+P:v)JJ/2-1]A(n) 1] }-f3' {[1 2K - 2J(K _ 1) n , where {31 is given by f3' = _log { (J - I)(K -1) ( /;2-_2 1 ) [(K - l)(q + p.)]J/2-1 } 21og[(J - 1)(K - 1)J - Since k S; nand n S; l-;/K' one gets Pr[ initial prop. failure at time step T J S; J {1 _ {[1 ~ j K} d (q + P:v) 2m 1] 2K - 2J(K _ 1) k Clearly, 2m < 2iogn < n < - - k -1-JjK' which leads to Pr[ initial prop. failure at T J < dC 1k- f3'+2, }-f3' . 140 CODING APPROACHES TO FAULT TOLERANCE where J C' == (q + Pa:) (1- JIK)2 [ 1 1] 2K - 2J(K -1) -{3' Bounding the Probability of Overall Failure Note that if no propagation failure takes place in the interval from time step o to T, a fault-free iterative decoder will be able to correctly decode the state of the overall system (all Jd codewords) at each time step. The reason is that no previous fault can be critical in causing consecutive erroneous decisions in more than m decoding iterations. Using this, one can find an upper bound on the probability that the initial propagation failure takes place at time step T, assuming that no propagation failure has taken place in the time interval from oto T. An upper bound on the probability that the initial propagation or decoding failure takes place is given by Pr[ overall failure at time step T 1 < mdC'k-{3'+2 , or, since m ~ log n ~ n ~ l-Y/ K' Pr[ overall failure at time step T J < dCk-{3 , where /3 = /3' - 3 J log { (J - I)(K - 1) ( J-2) ~ _ 1 [(K - 1)(q + Pa:)J2- 2Iog[(J - 1)(K - I)J C I} _ 3 ' C' I-JIK J [ 1 1] (q + Pa:) (1 - J I K)3 2K - 2J (K - 1) -{3' . Using the union bound, the probability of an overall failure at or before time step L can be bounded by Pr[ overall failure at or before time step L J < LdCk-{3 . References 141 Notes This is achieved by encoding k information bits into n > k bits, transmitting these n bits through the channel, receiving n (possibly corrupted) bits, performing error correction and finally decoding the n bits into the original k bits. 2 The overall state is an nd binary vector that represents kd bits of information. The n redundant systems "perform correctly without an overall failure for L time steps" if their overall state at time step T (0 :::;: T :::;: L) is within the set of nd vectors that correspond to the actual kd bits of information at that particular time step. In other words, if a fault-free (iterative) decoder was available, one would be able to obtain the correct states of the k underlying systems. References Avizienis, A. (1981). Fault-tolerance by means of external monitoring of computer systems. In Proceedings of the 1981 National Computational Conference, pages 27-40. Bhattacharyya, A. (1983). On a novel approach of fault detection in an easily testable sequential machine with extra inputs and extra outputs. IEEE Transactions on Computers, 32(3):323-325. Gacs, P. (1986). Reliable computation with Cellular Automata. Journal of Computer and System Sciences, 32(2): 15-78. Gallager, R. G. (1963). Low-Density Parity Check Codes. MIT Press, Cambridge, Massachusetts. Gallager, R. G. (1968). Information Theory and Reliable Communication. John Wiley & Sons, New York. Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts. Hadjicostis, C. N. (2000). Fault-tolerant dynamic systems. In Proceedings of ISIT 2000, the Int. Symp. on Information Theory, page 444. Hadjicostis, C. N. and Verghese, G. C. (1999). Fault-tolerant linear finite state machines. In Proceedings of the 6th IEEE Int. Conf on Electronics, Circuits and Systems, pages 1085-1088. Iyengar, V. S. and Kinney, L. L. (1985). Concurrent fault detection in microprogrammed control units. IEEE Transactions on Computers, 34(9):810-821. Johnson, B. (1989). Design and Analysis of Fault-Tolerant Digital Systems. Addison-Wesley, Reading, Massachusetts. Larsen, R. W. and Reed, I. S. (1972). Redundancy by coding versus redundancy by replication for failure-tolerant sequential circuits. IEEE Transactions on Computers, 21(2):130-137. 142 CODING APPROACHES TO FAULT TOLERANCE Leveugle, R., Koren, Z., Koren, I., Saucier, G., and Wehn, N. (1994). The Hyeti defect tolerant microprocessor: A practical experiment and its costeffectiveness analysis. IEEE Transactions on Computers, 43( 12): 1398-1406. Leveugle, R. and Saucier, G. (1990). Optimized synthesis of concurrently checked controllers. IEEE Transactions on Computers, 39(4):419-425. Parekhji, R. A, Venkatesh, G., and Sherlekar, S. D. (1991). A methodology for designing optimal self-checking sequential circuits. In Proceedings of the Int. Con! VLSI Design, pages 283-291. IEEE CS Press. Parekhji, R. A, Venkatesh, G., and Sherlekar, S. D. (1995). Concurrent error detection using monitoring machines. IEEE Design and Test of Computers, 12(3):24-32. Pippenger, N. (1990). Developments in the synthesis of reliable organisms from unreliable components. In Proceedings of Symposia in Pure Mathematics, volume 50, pages 311-324. Pradhan, D. K. (1996). Fault- Tolerant Computer System Design. Prentice Hall, Englewood Cliffs, New Jersey. Robinson, S. H. and Shen, J. P. (1992). Direct methods for synthesis of selfmonitoring state machines. In Proceedings of22nd Fault-Tolerant Computing Symp., pages 306-315. IEEE CS Press. Siewiorek, D. and Swarz, R. (1998). Reliable Computer Systems: Design and Evaluation. AK. Peters. Sipser, M. and Spielman, D. A (1996). Expander codes. IEEE Transactions on Information Theory, 42(6):1710-1722. Spielman, D. A (1996a). Highly fault-tolerant parallel computation. In Proceedings of the Annual Symp. on Foundations of Computer Science, volume 37, pages 154-160. Spielman, D. A (1996b). Linear-time encodable and decodable error-correcting codes. IEEE Transactions on Information Theory, 42(6): 1723-1731. Taylor, M. G. (1968a). Reliable computation in computing systems designed from unreliable components. The Bell System Journal, 47(10):2239-2366. Taylor, M. G. (1968b). Reliable information storage in memories designed from unreliable components. The Bell System Journal, 47(10):2299-2337. Wang, G. X. and Redinbo, G. R. (1984). Probability of state transition errors in a finite state machine containing soft failures. IEEE Transactions on Computers, 33(3):269-277. Chapter 8 CODING APPROACHES FOR FAULT DETECTION AND IDENTIFICATION IN DISCRETE EVENT SYSTEMS 1 INTRODUCTION This chapter applies coding techniques in the context of detecting and identifying faults in complex discrete event systems (DES's) that can be modeled as Petri nets [Hadjicostis, 1999; Hadjicostis and Verghese, 1999]. The approach is based on replacing the Petri net model of a given DES with a redundant Petri net model in a way that preserves the state, evolution and properties of the original system in some encoded form. This redundant Petri net model enables straightforward fault detection and identification based on simple parity checks that are used to verify the validity of artificially-imposed invariant conditions. Criteria and methods for designing redundant Petri net models that achieve the desired objective while minimizing the cost associated with them (e.g., by minimizing the number of sensors or communication links) are not pursued here, but several examples illustrate how such problems can be approached. In many ways, the development in this chapter parallels the discussion on fault-tolerant redundant implementations in Chapters 5 and 6. The main difference is in terms of the underlying assumptions/constraints and the fault model that is used. In particular, in the context ofJault diagnosis, the objective is to interpret activity/status information in a way that facilitates fault detection and identification. In most cases, the system implementation cannot be changed and the fault diagnosis scheme does not have any flexibility in the choice of sensor allocation or sensor measurements. Thus, an effective diagnoser is one that is able to handle the available sensory data and determine (with a reasonable delay) what faults, if any, have taken place; the diagnoser is commonly assumed to be fault-free. The usual approach in constructing a diagnoser is to locate invariant properties of the given system, a subset of which is violated soon after a particular 144 CODING APPROACHES TO FAULT TOLERANCE fault takes place. Then, by monitoring the activity in the system, one can detect violations of such invariant properties (which indicates the presence of a fault) and correlate them with a unique fault in the system (which then constitutes fault identification). The task becomes challenging because of potential observability limitations (in terms of the inputs, states or outputs that are observed [Cieslak et aI., 1988]) and various other requirements (such as detection/communication delays [Debouk et aI., 2000), sensor allocation limitations [Debouk et aI., 1999), distributivity/decentralizability constraints [Aghasaryaiu et aI., 1998; Debouk et aI., 1998), or the sheer size of the diagnoser). There is large volume of related work, especially within the systems/control [Gertler, 1998) and computer engineering communities [Tinghuai, 1992). More relevant to this chapter are previous works on fault diagnosis in large-scale discrete event systems. This includes the work in [Sampath et aI., 1995; Sampath et aI., 1998], which studies fault diagnosis in finite-state machines using a language-theoretic approach, the work in [Valette et aI., 1989; Cardoso et aI., 1995), which models the behavior of a discrete event system as a Petri net and develops state estimation techniques to perform fault diagnosis, and the work in [Pandalai and Holloway, 2000), which performs diagnosis based on timing relations between events. Also relevant are the methodologies for fault diagnosis in complex communication networks that appeared in [Bouloutas et aI., 1992; Wang and Schwartz, 1993; Park and Chong, 1995; Aghasaryaiu et aI., 1997a; Aghasaryaiu et aI., 1997b; Aghasaryaiu et aI., 1998). The presentation in this chapter analyses fault diagnosis schemes that result from two types of redundant Petri net implementations. (i) Separate redundant Petri net implementations retain the functionality of the original Petri net intact and use additional places and tokens in order to impose invariant conditions. (ii) Non-separate redundant Petri net implementations only need to retain the original Petri net functionality in some encoded form, allowing in this way additional flexibility in the design of diagnosis schemes. As mentioned earlier, the schemes that result out of both separate and nonseparate redundant Petri net implementations are attractive because of their simplicity. They are also able to automatically point out the additional connections that are necessary and they may not require explicit acknowledgments from each activity. These issues are elaborated upon later on; making additional connections between coding theory and fault diagnosis is certainly a worthwhile future direction. 145 Coding Approaches for Fault Detection and Identification h /O~0P2 p,(j) 2. \ 1 tl~ Orl'0P3 b Figure 8.1. 2 oy B+ = B- = U~ 1 0 0 0 U 1 0 1 n Petri net with three places and three transitions. PETRI NET MODELS OF DISCRETE EVENT SYSTEMS Petri nets are a graphical and mathematical model for a variety of information and processing systems [Murata, 1989]. Due to their power and flexibility, Petri nets are particularly relevant to the study of concurrent, asynchronous, distributed, nondeterministic, and/or stochastic systems [Baccelli et aI., 1992; Cassandras et aI., 1995]. They are used to model manufacturing systems [Desrochers and AI-Jaar, 1994], communication protocols or other DES's [Cassandras, 1993]. A Petri net S is represented by a directed, bipartite graph with two kinds of nodes: places (denoted by {PI, P2, ... , Pd} and drawn as circles) and transitions (denoted by {tll t2, ... , tu} and drawn as rectangles). Weighted directed arcs connect transitions to places and vice-versa (but there are no connections from a place to a place or from a transition to a transition). The arc weights have to be nonnegative integers: bi; denotes the integer weight of the arc from place Pi to transition tj and bt denotes the integer weight of the arc from transition tj to place Pl. The graph shown in Figure 8.1 is an example of a Petri net with d = 3 and u = 3; its three places are denoted by PI. P2 and P3, and its three transitions by tI. t2 and t3 (arcs with zero weight are not drawn). Depending on the system modeled by the Petri net, input places can be interpreted as preconditions, input data/signals, resources, or buffers; transitions can be regarded as events, computation steps, tasks, or processors; output places can represent postconditions, output data/signals, conclusions, or buffers. Each place functions as a token holder. Tokens are drawn as black dots and represent resources that are available at different parts of the system. The number of 146 CODING APPROACHES TO FAULT TOLERANCE tokens in a place cannot be negative. At any given time instant t, the marking (state) of the Petri net is given by the number of tokens at each of its places; for the Petri net in the figure, the marking (at time instant 0) is given by ~[Oj = U] Transitions model events that cause the rearrangement, generation or disappearance of tokens. Transition tj is enabled (i.e., it is allowed to take place) only if each of its input places Pi has at least bi; tokens (where, as explained before, bi; is the weight of the arc from place Pi to transition tj)' When transition tj takes place (transition tj is said tofire), it removes bi; tokens from each bt tokens to each output place Pl. In the Petri net in input place Pi and adds Figure 8.1, transitions it a~d t2 are enabled but transition t3 is not. If transition tl fires, it removes 2 tokens from its input place PI and adds one token each to its output places P2 and P3; the corresponding state of the Petri net (at the next time instant) will be q.[lj = [n . Let B- = [bi;J (respectively B+ = [b~]) denote the d x u matrix with bi; (respectively b~) at its ith row, jth column position. The state evolution of a Petri net can then be represented by the following equation: qs[t + 1J = qs[tJ + (B+ - B-)x[tJ (8.1) qs[tJ (8.2) + Bx[tJ , where B == B+ -B-. (Figure 8.1 shows the corresponding B+ and B- for that Petri net.) The input x[tJ in the above description is u-dimensional and is restricted to have exactly one nonzero entry with value "I." When o x[tJ = Xj = 1 o Coding Approaches for Fault Detection and Identification [ C2 Room 2 Mouse C8 Room n : . ---m2 l 31 C3 I ~-t--I---- m3-I m1 ___ C1 r-H----j ~C4 C7 m6 Room 4 I --~ Cs Figure 8.2. ' Room 1 m4 f--- ~ -- t ·I- ms I Cat 147 C6 --~ Room 5 LI ____________ Ii Cat-and-mouse maze. (the single "I" being at the jth position), transition tj fires (j is in {I, 2, ... , u}). Note that transition tj is enabled at time instant t if and only if where B - (:, j) denotes the jth column of B - and the inequality is taken element-wise. A pure Petri net is one in which no place serves as both an input and an output for the same transition (i.e., at most one of bt and bi; can be nonzero). The Petri net in Figure 8.1 (with the indicated B+ and B- matrices) is a pure Petri net. Matrix B has integer entries and its transpose is known as the incidence matrix [Murata, 1989]. Discrete event systems are often modeled as Petri nets. The following example presents the Petri net version of the popular "cat-and-mouse" problem, introduced by Ramadge and Wonham in the setting of supervisory control [Ramadge and Wonham, 1989] and described as a Petri net in [Yamalidou et aI., 1996]. The authors in [Ramadge and Wonham, 1989; Yamalidou et aI., 1996] were concerned with controlling the doors in the maze so that the two animals are never in the same room together. The task becomes challenging because only a subset of the doors may be controllable and because one may wish to allow maximum freedom in the movement of the two animals (while avoiding their entrance into the same room). Fault detection and identification in such systems is discussed later in this chapter. 148 CODING APPROACHES TO FAULT TOLERANCE EXAMPLE 8.1 A cat and a mouse circulate in the maze of Figure 8.2, with the cat movingfrom room to room through a set ofunidirectional doors { Cl, C2, ••. , cs} and the mouse through a set of unidirectional doors {ml' m2, ... , m6}' The Petri net model is based on two independent subnets, one dealing with the eat's position and movements and the other dealing with the mouse's position and movements. Each subnet has five places, corresponding to the five rooms in the maze. A token in a certain place indicates that the mouse (or the cat) is in the corresponding room. Transitions model the movements of the two animals between different rooms (as allowed by the structure of the maze in Figure 8.2). The subnet that deals with the mouse has a marking with five variables, exactly one of which has the value "1" (the rest are set to zero). The state evolution for this subnet is given by Eqs. (8.1) and (8.2) with B+ = o o 0 1 001 1 000 0 1 0 0 0 0 0 00001 0 000 100 , B- = 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 For example, state qs [t] = [0 1 0 0 0] T indicates that at time instant t the mouse is in room 2. Transition t3 takes place when the mouse moves from room 2 to room 1 through door m3; this causes the new state to be qs[t + 1] = r. [1 0 0 0 0 In [Yamalidou et al., 1996} the two subnets associated with the mouse and cat movements were combined in an overall Petri net, which was then used to construct a linear controller that achieved the desired objective (i.e., disallowed the presence of the cat and the mouse in the same room while permitting maximum freedom in their movement within the maze). 3 FAULT MODELS FOR PETRI NETS In complex DES's with Petri net models that have a large number of places and transitions, faults can manifest themselves in a variety of ways, including malfunctions due to hardware, software or communication components. It is therefore essential that systems are designed with the ability to detect, locate and correct these different types of faults. This section discusses the fault models that will be used in the forthcoming fault detection and identification schemes. As mentioned in Chapter 1, the fault model needs to capture the underlying faults in an efficient manner. Since faults in DES's depend on the actual implementation (which varies considerably depending on the application), three different fault models are considered [Hadjicostis, 1999]. Transition Faults: Transition tj is said to fail to execute its postconditions if no tokens are deposited to its output places, even though the correct number 149 Coding Approaches for Fault Detection and Identification of tokens from the input places have been consumed. Similarly, transition tj is said tofail to execute its preconditions if the tokens that are supposed to be consumed from the input places of the faulty transition are not consumed even though the correct number of tokens are deposited at the corresponding output places. In terms of the state evolution in Eq. (8.1), a fault at transition tj corresponds to transition tj tiring, but its preconditions, as given by the jth column of B- [denoted by B- (:, j)], or its postconditions, as given by B+ (:, j), not taking effect. Place Faults: Faults that corrupt the number of tokens in a single place of the Petri net are modeled by place faults. In terms ofEq. (8.1), a place fault at time instant t causes the value of a single variable in the d-dimensional state qs ttl to be incorrect. This fault model is suitable for Petri nets that represent computational systems or tinite-state machines and has appeared in earlier work that dealt with fault detection in pure Petri nets [Sifakis, 1979; Silva and Velilla, 1985]. Additive Fault Model: Another approach is to model the error of each fault of interest in terms of its additive effect on the state qs ttl of the Petri net. In particular, if fault f(i) takes place at time instant t, then, the corrupted state qJ(i} ttl of the Petri net can be written as where eJ(i} is the additive effect of fault f(i). One can tind a priori the additive effect eJ(.} for each fault, so that the d x l error matrix E = [ ef(l} I ef(2} I ... I eJ(I} ] (where l is the total number of faults) summarizes all that is necessary to detect and identify this set of faults in the given Petri net. Note that the additive fault model captures both transition and place faults: • If transition tj fails to execute its preconditions, then, et: whereas iftj fails to execute its postconditions, then, = B- (:, j), et = -B+(:,j). 3 3 • The corruption of the number of tokens in place Pi is captured by the additive d-dimensional error array o e pi =Cx 1 o 150 CODING APPROACHES TO FAULT TOLERANCE Figllre 8.3. Petri net model of a distribllted processing system. where c is an integer that denotes the number of tokens that have been added and where the only nonzero entry appears at the ith position. The big advantage of the additive fault model is that it can easily capture the effects of multiple additively independent faults, that is faults whose additive effect does not depend on whether other faults have taken place or not. For example, a precondition fault at transition tj and an independent fault at place Pi will result in the additive error array e~ + e pi ' J EXAMPLE 8.2 Consider the Petri net in Figure 8.3 which could be the model of a distributed processing network or a flexible manufacturing system. Transition t2 models a process that takes as input two data packets (or two raw products) from place P2 and produces two different data packets (or intermediate products), one of which gets deposited to place P3 and one of which gets deposited to place P4. Processes t3 and t4 take input packets from places P3 and P4 respectively. and produce final data packets (or final products) in places Ps and P6 respectively. Note that processes t3 and t4 can take effect concurrently; once done, they return separate acknowledgments to places Ps and P6 so that process t2 can be enabled again. Transition tl models the external input to the system and is always enabled. The state of the Petri net shown in Figure 8.3 is given by qs [0] = [2 2 0 0 1 1] T; only transitions hand t2 are enabled. If the process modeled by transition t2 fails to execute its postconditions, tokens will be removed from input places P2, Ps and P6. but no tokens will be deposited at output places P3 and P4. The erroneous state of the Petri net will be q, [1] = [2 0 0 0 0 0 f. 151 Coding Approaches for Fault Detection and Identification Separate Redundant Petri Net Embedding r------------------------~ I I I 1 1 I I I I Original Petri Net System I State Information • 1 ::t! 1 (.') Q) I I I I I I I ( ;rranSilion Information Q) C ::J 1 I Figure 8.4. Monitor : State : Information I 1 Error? 'C' III I I I - .J:: C) • \, ./ Concurrent monitoring scheme using a separate Petri net implementation. If, instead, process t2 fails to execute its preconditions, then, tokens will appear at the output places P3 and P4 but no tokens will be removed from the input places P2, Ps and P6. The erroneous state of the Petri net will be f. qf[l] = [2 2 1 1 1 1 If process t2 executes correctly but there is a fault at place P4, then, the resulting state will be of the form qf[l] = [2 0 1 l+c 0 0 (the number of tokens at place P4 is corrupted by c). f 4 4.1 SEPARATE MONITORING SCHEMES SEPARATE REDUNDANT PETRI NET IMPLEMENTATIONS In separate monitoring schemes the original Petri net is enhanced by a separate monitor, whose state is updated according to transition activity in the original system [Hadjicostis, 1999; Hadjicostis and Verghese, 1999]. Faults can be concurrently detected and identified by comparing the state of the original system and the monitor (see Figure 8.4). A special case of this construction is when the monitor is a simulator of the original system, so that, given the same inputs, the monitor and the original Petri net are ideally in the same state; when this is not the case, a fault is detected. The main disadvantage of this approach is that it requires access to all activity in the original Petri net and it cannot easily handle multiple faults (e.g., in distributed DES) and information that is incorrect or missing. What is studied in this section is an alternative that uses monitors of reduced sizes and is able to overcome some of these limitations. 152 CODING APPROACHES TO FAULT TOLERANCE DEFINITION 8.1 A separate redundant implementation for Petri net S [with d places, u transitions, marking qs [.] and state evolution as in Eq. (8.1) } is a Petri net 1£ (with 1] == d + s places, s > 0, and u transitions) that has state evolution qh[t] + B+x[t] - B-x[t] qh[t] +[ ~: ] x[t] - [ ~= ] x[t] (8.3) and whose state is given by ~[t] = [ ~ ] qs[t] ---G for all time instants t. It is required that for any initial marking (state) qs [0] of S, Petri net 1£ (with initial state qh[O] = Gqs [0]) admits all firing transition sequences that are allowed in S (under initial state qs [0]). Note that the functionality of Petri net S remains intact within the separate redundant Petri net implementation 1£. Since all valid states qh[t] in 1£ have to lie within the column space of the encoding matrix G, there exists a parity check matrix P = [-C Is] such that Pqh[t] = 0 for all t (at least under fault-free conditions). Since 1£ is a Petri net, matrices X+ and X-, and state qh[t] (for all t) have nonnegative integer entries. The following theorem characterizes separate redundant Petri net implementations. THEOREM 8.1 Consider the setting described above. Petri net 1£ is a separate redundant implementation of Petri net S if and only if C is a matrix with nonnegative integer entries and X+ = CB+ -D, where D is any MIN( CB+, 8 X- = CB- -D, x u matrix with nonnegative integer entries such that D ::; CB-) (operations::; and MIN are taken element-wise). Proof: (=» The state ~ [0] = Gqs [0] = [ ~ ] qs [0] has nonnegative integer entries for all valid qs [0] (a valid qs [0] is any marking with nonnegative integer entries). For this to be true, a necessary (and sufficient) condition is that C is a matrix with nonnegative integer entries. Coding Approaches/or Fault Detection and Identification 153 If the state evolution of the redundant Petri net in Eq. (8.3) is combined with the state evolution of the original Petri net in Eq. (8.1), one obtains Gqs[t + 1J [ = <Jh[t + 1J = ~ ] qs[t + 1J = <Jh[tJ + [ ~: ] x[tJ - [ ~= [~] qs[tJ + [ ~: ] ] x[tJ - [ x[tJ ~= ] x[tJ . Since any transition tj can be enabled [e.g., by choosing qs[OJ ~ B-(:,j)), one concludes that x+ - X- = C(B+ - B-) . Without loss of generality X+ can be set to CB+ -D and X- to CB--D for some matrix D with integer entries. Petri net 1£ has initial marking [ J <Jh 0 = qs[OJ] [ Cqs[OJ ' where qs [OJ is any initial state for S. In order for 1£ to admit all firing transition sequences that are allowed in S under initial state qs[OJ, one needs D to have nonnegative integer entries. This can be proved by contradiction: suppose D has a negative entry in its jth column; if qs [OJ = B- (:, j), transition tj can be fired in S but cannot be fired in 1£ because Cqs[OJ = < CB-(:,j) CB-(:,j) - D(:,j) = X-(:,j). The requirement that D ::; MIN(CB+, CB-) follows from X+ and Xbeing matrices with nonnegative integer entries. ({:::) The converse direction follows easily. The only challenge is to show that if D has nonnegative entries, all transitions that are enabled in S at time instant t under state qs[tJ are also enabled in 1£ under state 154 CODING APPROACHES TO FAULT TOLERANCE To show this, note that if D has nonnegative entries, then, qs[t] ~ B-(:,j) => Gqs[t] ~ GB-(:,j) => <lh[t] ~ GB-(:,j) => <lh[t] ~ GB-(:,j) - => <lh[t] ~ [ Dt,j) ] B-(:,j) . (This is because matrices G, B+, B- and D have nonnegative integer entries.) One concludes that, if transition tj is enabled in S, then, it is also enabled in 1£ [i.e., ifqs[t] ~ B-(:,j), then, <lh[t] ~ B-(:,j)]. 0 4.2 FAULT DETECTION AND IDENTIFICATION The separate redundant implementations described in Theorem 8.1 can be used to monitor faults in the original Petri net [with state evolution as in Eq. (8.1)). The invariant conditions imposed by a separate implementation can be checked by verifying that [-C Is] qh[t] is equal to zero. The s additional places in 1£ function as checkpoint places and can either be distributed in the Petri net system or be part of a centralized monitor. l Transition Faults: Suppose that at time instant t -1 transition tj fires (that is, x[t-1] = Xj). If, due to a fault, the postconditions of transition tj are not executed, the erroneous state at time instant t will be where <lh [t] is the state that would have been reached under fault -free conditions. The error syndrome can be calculated to be Pqf[t] = P(<lh[t] - [ CB~+- D Pqh[t] - P [ o- ] CB~+- D [-C Is] [ Xj) ] Xj CB~+- D - (-CB+ + CB+ - D) DXj == D(:,j) . ] Xj Xj If the preconditions of transition t j are not executed, the erroneous state will be Coding Approaches for Fault Detection and Identification 155 and the error syndrome can be calculated similarly as Pqf[t] = -Dxj == -D(:,j) . If all columns of D are distinct, one can detect and identify all single transition faults. In addition, depending on the sign, one can determine whether preconditions or postconditions were not executed. Of course, given enough redundancy, one may be able to also identify multiple transition faults. (For example, if the effect of multiple transitions is additive, their occurrence could be identified if the columns of D were linearly independent.) 8.3 Consider the Petri net in Figure 8.1 with the indicated B + and B - matrices. To concurrently detect and identify transition faults, a separate redundant implementation with one additional place will be used (s = 1). if D is set to [3 2 1] and C is set to [2 2 1], one obtains the separate redundant Petri net implementation of Figure 8.5 (the additional connections are shown with dotted lines). Since the columns of matrix D are distinct, identification of single transition faults is possible (the choice of C does not affect the syndromes of transition faults). Matrices B+ and B- are given by EXAMPLE B+ - [ B- - [ B+ CB+-D 1 B] CB--D ~I [~ ~ I [~ 1 0 0 0 0 1 0 0 The parity check that is performed concurrently by the checking mechanism (not shown in the figure) is given by 1fthe parity check is -3 (respectively -2, -1), then, transition tl (respectively t2, t3) has failed to perform its preconditions. If the parity check is 3 (respectively 2, 1), then, transition tl (respectively t2, t3) has failed to perform its postconditions. The additional place P4 is part of the monitoring mechanism: it receives information about the activity in the original Petri net (e.g., which transitions fire) and appropriately updates its tokens. The linear checker detects and identifies 156 CODING APPROACHES TO FAULT TOLERANCE Figure B.S. Example 0/ a separate redundant Petri net implementation that identifies single transition/aults in the Petri net 0/ Figure B.1. faults by evaluating a checksum on the state of the overall (redundant) system. Note that the number of tokens in place P4 is updated regardless of the activity in transition t2' More generally, explicit connections from each transition to the monitoring mechanism may not be required. Place Faults: If, due to a fault, the number of tokens in place Pi is increased by c, the erroneous state will be given by where e p ; is an 1]-dimensional array with a unique nonzero entry at its ith position, Le., o o In this case, the parity check will be = P(Clh[t] + epJ PClh[t] + Pep; 0+ Pep; ex P(:,i) . 157 Coding Approaches for Fault Detection and Identification Figure 8.6. Example of a separate redundant Petri net implementation that identifies single place faults in the Petri net of Figure 8.1. Ifonechooses C so that columns ofP == [-C Is] are not rational mUltiples of each other, then, one can detect and identify single place faults. 2 EXAMPLE 8.4 In order to concurrently detect and identify single place faults in the Petri net of Figure 8.1, two additional places will be used (s Matrix C will be chosen to be [ i] ~ .~ = 2). (so that the columns of the parity check matrix P = [-C 12 ] are not multiples of each other); the choice for D is not critical in the identification of place faults and in this case D is set to be [ ; i i]· With these choices, one obtains the separate redundant implementation shown in Figure 8.6. Matrices B+ and B- are given by 1 1 1 0 0 1 0 0 0 B+ - [ B- - [ B+ CB+-D B- CB- -D ] 1 0 0 0 1 1 ] = 2 0 0 0 1 0 0 0 1 1 0 2 0 0 0 158 CODING APPROACHES TO FAULT TOLERANCE The parity check is performed through -2 -1 11 0] [ -C I12] <lh[t] = [ -1 -2 -1 -1 0 1 <lh[t]. if the result is a multiple of [ ~ ] (respectively [ ~ l [~ l [~ l [~ ]). then. the number of tokens in place PI (respectively P2. P3. P4. P5) has been corrupted. Through proper choice of C and D, one can perform detection and identification of both place and transition faults. Note that matrices C and D can be chosen almost independently. The only constraint between the two is that D :::; MIN(CB+, CB-) (this constraint can sometimes be relaxed by multiplying matrix C by a large enough integer constant so that the possibilities for Dare increased. 3 ) The following example illustrates how one can detect and identify both place and transition faults. EXAMPLE 8.5 Identification of a single transition fault or a single place fault (but not ofboth occurring together) can be achieved in the Petri net of Figure 8.1 by using two additional places (s D = [~ ~ ~]. = 2) and by setting C = [~ ~ ~] With these choices. matrices B+ and B- are given by o 1 1 1 0 0 100 010 211 200 010 001 100 022 The parity check is performed through and 159 Coding Approaches for Fault Detection and Identification Figure 8.7. Example of a separate redUlidallt Petri lIet implemelltatioll that idelltifies single transition or single place faults ill the Petri net of Figure 8.1. If the parity check is a multiple of [ [ ~ ~ ] (respectively [ ~ ], [ ~ ], [ ~ ], ]), then, there is a place fault in PI (respectively P2, P3, P4, Ps). If the parity check is [ ~ ] (respectively [ ~ ], [ ~ ]), then, transition tl (respectively t2, t3) has failed to perform its postconditions. (respectively [ =~ l [=~ ]), If the parity check is [ =~ ] then, transition tl (respectively t2, t3) has failed to perform its preconditions. The resulting redundant Petri net implementation is shown in Figure 8.7. It is instructive to consider the interpretation of the monitoring schemes shown in Figures 8.5,8.6 and 8.7: (i) The s additional places, which could be part of a centralized monitor or could be distributed in the system, are connected to the transitions of the original Petri net and the tokens associated with the additional connections and places act as simple acknowledgment messages. (ii) The weights of the additional connections are given by matrices CB+ - D andCB--D. (iii) The choice of matrix C specifies detection and identification for place faults, whereas the choice of D determines detection and identification for transi- 160 CODING APPROACHES TO FAULT TOLERANCE tion faults. Coding techniques or simple linear algebra can be used to guide the choices of C and D. (iv) The checking mechanism (not shown in any of the figures in Examples 8.3, 8.4 and 8.5) detects and identifies faults by evaluating a linear checksum on the state of the original Petri net and the added monitor. The implicit assumption is that this checksum mechanism is fault-free. Given fault detection and identification requirements, one has a variety of choices for matrices C and D. Therefore, depending on the underlying system, one could try to optimize certain variables of interest, such as the size of the monitor memory (number of additional places), the number of additional connections (from the original system to the additional places), or the number of tokens involved. Note that, when restricted to pure Petri nets, one has no choice for D. More specifically, since the resulting Petri net has to be pure, matrix D has to be chosen so that D = MIN(CB+, CB-). The ability to detect transition faults may be lost in such cases. The work in [Sifakis, 1979; Silva and Velilla, 1985] studied this approach in pure Petri nets: given a pure Petri net S as in Eg. (8.2), one can construct a pure Petri net embedding with state evolution for an s x d matrix C with nonnegative integer entries. The distance measure adopted in [Sifakis, 1979] suggests that the redundant Petri net should guard against place faults (corruption of the number of tokens in individual places). 5 5.1 NON· SEPARATE MONITORING SCHEMES NON· SEPARATE REDUNDANT PETRI NET IMPLEMENTATIONS In the monitoring scheme of Figure 8.8, the state of a redundant Petri net implementation is a non-separate encoding of the state of the original Petri net (i.e., an encoding scheme that does not immediately yield the state of the original system). As in the case of separate redundant implementations, the redundancy in the state of a non-separate redundant system will result in fault detection and identification schemes that operate by analyzing violations on the imposed state restrictions [Hadjicostis, 1999]. Notice that non-separate redundant implementations can only be used when the designer has flexibility in re-arranging the structure of the original DES. DEFINITION 8.2 Let S be a Petri net with d places, u transitions and state evolution as in Eq. (8.1); let qs [0] be any initial state qs [0] ~ 0 and X = 161 Coding Approaches for Fault Detection and Identification Original State 0 Non-Separate Redundant Petri Net Implementation Figure 8.8. State Information Q Q) 0 / ~ E :J ~ rtJ Q).::s! C Q . - Q) " Error? .....J,r:. () Concurrent monitoring scheme using a non-separate Petri net implementation. {x[O], x[I], ... } be any admissible firing transition sequence under this initial state. A Petri net 1l with 1J == d + s (where s > 0). u transitions. initial state <Jh[0] and state evolution equation <Jh[t + 1] <Jh[t] + B+x[t] - B-x[t] <Jh[t] + (B+ - B-)x[t] (8.4) is a non-separate redundant implementation for S if it concurrently simulates S in the following sense: there exist 1. a state decoding mapping i. and 2. a state encoding mapping g. such that for any initial state qs [0] in S and any admissible firing sequence X (for qs[O]). for all time instants t ~ O. The non-separate redundant implementation 1l defined above is a Petri net that, after proper initialization [i.e., <Jh[0] = g(qs[O])J, admits any firing transition sequence X that is admissible by the original Petri net S under initial state qs[O]. The state of the original Petri net at time instant t is specified by the state of the redundant implementation and vice-versa (through mappings .e and g). Note that, regardless of the initial state qs [0] and the firing sequence X, the state <Jh[t] of the redundant implementation always lies in a subset of the redundant state space (namely the image of qs['] under the mapping g). The rest of this section focuses on a special class of non-separate redundant implementations, where encoding and decoding can be performed through lin- 162 CODING APPROACHES TO FAULT TOLERANCE ear operations. Specifically, a d x 'rJ decoding matrix L and an 'rJ x d encoding matrix G exist such that, under any initial state qs [0] and any admissible firing transition sequence X = {x[O], x[I], ... }, for all time instants t ~ o. The state evolution equation of a non-separate redundant Petri net implementation is then given by CJh[t] + B+x[t] - B-x[t] qh[t] + Bx[t] , (8.5) (8.6) where B == B+ -B-. The additional structure that is enforced through the nonseparate redundant Petri net implementation can be used for fault detection and identification. In order to systematically construct redundant implementations, one needs to have a starting point. The following theorem characterizes non-separate redundant Petri net implementations in terms of a similarity transformation and a standard redundant Petri net. THEOREM 8.2 A Petri net 1£ with 'rJ == d + s (s > 0) places, u transitions and state evolution as in Eqs. (8.5) and (8.6) is a redundant Petri net implementation for S [with state evolution as in Eqs. (8.1) and (8.2) } only if it is similar (in the usual sense of change of basis in the state space, see Chapter 5) to a standard redundant Petri net implementation 1£(T whose state evolution equation is given by (8.7) Here, B+, B- and B == B+ - B- are the matrices in Eqs. (8.1) and (8.2). Associated with the standard redundant Petri net implementation is the standard decoding matrix L(T and the standard encoding matrix G(T given by Note that the standard redundant Petri net implementation is a pure Petri net. Proof: Under fault-free conditions, LGqs[·] = LCJh[·] = qs[·]. Since the initial state qs [0] can be any array with nonnegative integer entries, one concludes that LG = Id. In particular, L is full-row rank, G is full-column rank and there 163 Coding Approaches for Fault Detection and Identification exists an 'rf x 'rf matrix 7 such that L7 = [Id 0] and 7- 1 G = [ ~ ]. By employing the similarity transformation q~[tJ = 7qh[tJ, one obtains a similar system 1{' whose state evolution is given by B' and whose decoding and encoding matrices are given by The state q~[tJ of system 1{' at any time instant t is of the form by combining the state evolution equations of the original Petri net and the redundant system, it is seen that [ qs[tJ ~ BX[t]] = [q~t]] + [ :~ ] x[tJ . The above equations hold for all initial conditions qs [OJ; since all transitions are enabled under some appropriate initial condition qs [OJ, one concludes that Bl = Band B2 = O. If system 1{' is regarded as a pure Petri net, one sees that any transition enabled in S is also enabled in 1{'. Therefore, 1{' is a redundant Petri net implementation. In fact, it is the standard redundant Petri net implementation 1{eT with the decoding and encoding matrices presented in the theorem. The invariant conditions that are imposed by the added redundancy on the standard Petri net 1{eT are summarized in the transformed coordinates by the parity check P eTqeT ['J, where PeT = [0 Is] is the parity check matrix. 0 Theorem 8.2 provides a characterization of the class of non-separate redundant Petri net implementations for the given Petri net S and is a convenient 164 CODING APPROACHES TO FAULT TOLERANCE starting point for systematically constructing such implementations. The following theorem completes this point of view. THEOREM 8.3 Let S be a Petri net with d places, u transitions and state evolution as given in Eqs. (8.1) and (8.2). A Petri net 1£ with", == d + s (s > 0) places, u transitions and state evolution as in Eqs. (8.5) and (8.6) is a redundant Petri net implementation of S if: 1. 1t is similar to a standard redundant Petri net implementation 1£cr [with state evolution equation as in Eq. (8.7)] through an 1J x 1J invertible matrix 7, whose first d columns consist of nonnegative integer entries. The encoding, decoding and parity check matrices of the Petri net implementation 1£ are then given by L = [Id 0] 7, G = 7- 1 [ ~ ] , p = [0 Is] 7 . 2. Matrices B+ and B- are given by B+ = 7- 1 [ ~+ B- = 7- 1 [ ~- ] -V ] -V = GB+ -V , = GB- -V, where V is an 1J x u matrix with nonnegative integer entries. Note that V has to be chosen so that the entries of B+ and B- are nonnegative, i.e., V ~ MIN{GB+, GB-). Proof: From Theorem 8.2, it is clear that any non-separate redundant Petri net implementation 1£ as in Eqs. (8.5) and (8.6) can be obtained through an appropriate similarity transformation 7 qhft] = qcr ttl of the standard redundant implementation 1£cr in Eq. (8.7). In the process of constructing 1£ from 1£cr' one needs to ensure that 1£ is a valid redundant Petri net implementation of S, i.e., one that meets the following requirements: 1. Given any initial condition qs [0] (i.e., given a d-dimensional array with nonnegative integer entries), the marking has nonnegative integer entries. 2. Matrices B+ and B- have nonnegative integer entries. Coding Approaches for Fault Detection and Identification 165 3. The set of transitions enabled in S at any time instant t is a subset of the set of transitions enabled in 11. (so that, under any initial condition qs [OJ, a firing transition sequence X that is admissible in S is also admissible in 11.). The first condition has to be satisfied for any array qs[OJ with nonnegative integer entries. It is therefore necessary and sufficient that the first d columns of 7- 1 have nonnegative integer entries. This also ensures that the matrix difference 7- 1 [ B+ ~ B- ] 7- 1 [ ~ ] (B+ - B-) G(B+ - B-) consists of integer entries. Without loss of generality, where the entries of 1) are integers chosen so that B+ and B- have nonnegative entries (Le., so that 1) ::; GB+ and 1) ::; GB-). To check for the third condition, notice that tj is enabled in the original Petri net S at time instant t if and only if qs[tJ ~ B-(:,j). If 1) has nonnegative entries, then, qs[tJ ~ B-xj :::} Gqs[tJ ~ GB-xj :::} qh[tJ ~ GB-xj :::} ~[tJ ~ (GB- - 1))Xj :::} ~[tJ ~ B-xj , where B-(:, j) == B-xj (recall that qs[tj, B-, G and 1) have nonnegative integer entries). Therefore, if transition tj is enabled in the original Petri net S, it is also enabled in 11. [transition tj is enabled in 11. if and only if ~[tj ~ B- (:, j)]. It is not hard to see that it is also necessary for 1) to have nonnegative integer entries (otherwise one can find a counterexample by appropriately choosing the 0 initial condition qs [OJ). The following lemma is derived from Theorem 8.3 and simplifies the construction of non-separate redundant Petri net implementations. 8.1 Let S be a Petri net with d places, u transitions and state evolution as given in Eqs. (8.1) and (8.2). A Petri net 11. with T} == d + s (s > 0) places, u transitions and state evolution as in Eqs. (8.5) and (8.6) is a non-separate LEMMA 166 CODING APPROACHES TO FAULT TOLERANCE redundant implementation ofS entries given by ifmatrices 8+ and 8- have nonnegative integer = = GB+-1) , GB- -1), where G is a full-column rank 71 x d matrix with nonnegative integer entries and 1) is an 71 x u matrix with nonnegative integer entries. In cases where one has the flexibility to restructure the original Petri net, nonseparate redundant Petri net implementations could offer potential advantages (e.g., could use fewer tokens, connections, or places than separate implementations of the same order). S.2 FAULT DETECTION AND IDENTIFICATION The invariant conditions imposed by the non-separate redundant Petri net implementations in Theorem 8.3 can be checked by the parity matrix P = PO' T = [0 Is] r. The following analysis of the fault detection and identification procedures is close to the development in Section 4.2. Transition Faults: Suppose a non-separate redundant Petri net implementation is used to detect and identify transition faults. If transition t j fires at time instant t-l (i.e., x[t-l] = Xj) butfails to execute its postconditions, the erroneous state will be where <Ih[t] is the state the Petri net would be in under fault-free conditions. The error syndrome can be calculated to be PqJ[t] = P{<Ih[t] - (GB+ -1)) 0- P (GB+ -1)) Xj Xj} -¥(yB+-V)Xj = be PO' T1)xj == P1)Xj . If the preconditions of transition tj are not executed, the erroneous state will qf[t] = = <Ih[t] <Ih[t] + 8-(:,j) + (GB- -1)) Xj Coding Approaches for Fault Detection and Identification 167 Figure B.9. Example ofa nOli-separate redundant Petri lIet implemellfation that identifies single transition faults in the Petri net of Figure B.l. and the error syndrome can be calculated similarly to be Pqf[tJ = -P1>Xj. If the columns of matrix P1> are distinct, one can detect and identify all single transition faults. Depending on the sign, one can decide whether postconditions or preconditions were not executed. Note that, unlike the separate case, the syndromes in the non-separate case are linear combinations of columns of 1>. EXAMPLE 8.6 The Petri net in Figure 8.9 is a non-separate redundant implementation of the Petri net in Figure 8.1. The additional place P4 is disconnected from the rest of the network and can be treated as a constant. The scheme can detect and identify single transition faults. The transformation matrix 7- 1 and the matrix 1> that were used to obtain the non-separate implementation of Figure 8.9 were as follows: They result in thefollowing matrices B+ = GB+ -1> and B- = GB- -1>: 168 CODING APPROACHES TO FAULT TOLERANCE The decoding matrix L = LO" 7 and the parity check matrix P = PO" 7 are given by L = [ 1 1 0 -1 1 1 1 -2 -3 -4 -2 7 1 ,P = [1 2 1 -3 ] . If the parity check Pqh[t] equals -3 (respectively -2, -1), then, transition tl (respectively t2, t3) has failed to execute its postconditions. If the check is 3 (respectively 2, 1), then, transition h (respectively t2, t3) has failed to execute its preconditions. Place Faults: Suppose one uses a non-separate redundant Petri net implementation to protect against place faults. If, due to a fault, the number of tokens in place Pi is increased by c, the erroneous state will be given by where e pi is an 1]-dimensional array with a unique nonzero entry at its ith position: o e pi = c x 1 o The parity check will then be = = 0 + Pepi ex P(:, i) . Single place faults can be detected if all columns of matrix P == [0 Is] 7 are nonzero. If the columns of P are not rational multiples of each other, then, single place faults can be detected and identified. EXAMPLE 8.7 Figure 8.10 shows a non-separate redundant implementation of the Petri net in Figure 8.1. The implementation uses two additional places (s = 2) and is able to identify single place faults. Note that place P4 essentially acts as a constant. 169 Coding Approaches for Fault Detection and Identification Figure 8.10. Example of a non-separate redllndant Petri net implementation that identifies single place faults ill tlte Petri Ilet of Figure 8.1. The transformation matrix 7- 1 and matrix V that were used, as well as matrices B+ and B-, are given by 7- 1 = 1 1 1 1 1 B+ = 2 0 -1 1 1 1 1 2 1 0 1 1 0 0 0 1 0 1 1 3 0 0 1 0 1 0 1 0 0 0 1 0 0 0 1 V= 2 1 2 2 2 B- = 0 1 0 0 0 1 0 1 1 1 0 1 1 1 0 1 0 0 0 0 0 1 0 2 0 The parity check matrix is p = [0 12 ] T 1 -1 1 1 x [-3 =4 -1 3 1 -5 ~ ] . Note that the syndromes for transition and place faults in non-separate Petri net embeddings are more complicated than the syndromes in separate embeddings. At the same time, however, some additional flexibility is available and can potentially be used to construct embeddings that maintain the desired monitoring capabilities while minimizing certain quantities of interest (such as tokens, connections or places). 170 6 CODING APPROACHES TO FAULT TOLERANCE APPLICATIONS IN CONTROL Discrete event systems (DES's) are usually monitored through separate mechanisms that take appropriate actions based on observations about the state and activity in the system. Control strategies (such as enabling or disabling transitions and external inputs) are often based on the Petri net that models the DES of interest [Yamalidou et aI., 1996; Moody and Antsaklis, 1997; Moody and Antsaklis, 1998; Moody and Antsaklis, 2000]. This section uses redundant Petri net implementations to facilitate the task of the controller by monitoring active transitions and by identifying "illegal" transitions. One of the biggest advantages of this approach is that it can be combined with fault detection and identification, and perform monitoring despite incomplete or erroneous information. 6.1 MONITORING ACTIVE TRANSITIONS In order to time decisions appropriately, the controller of a DES may need to identify ongoing activity in the system. For example, the controller may need to detect when two or more transitions have fired simultaneously or it may have to identify all active transitions (i.e., transitions that have used all tokens at their input places but have not returned any tokens at their output places (using the terminology of the transition fault model in Section 3, one can say that active transitions are the ones that have not completed their postconditions). Employing the techniques of Section 4, one can construct separate redundant Petri net implementations that allow the controller to detect and locate active transitions by looking at the state of the redundant implementation. The following example illustrates this idea. EXAMPLE 8.8 If one extra place is added to the Petri net of Figure 8.3 (s and if matrices C and D are given by C=[1 1 3 2 3 IJ, = 1) D=[2 5 3 IJ, one obtains the separate redundant Petri net implementation shown in Figure 8.11: at any given time instant t, the controller of the redundant Petri net can determine if a transition is under execution by observing the overall state CJh[t] of the system and by performing the parity check [-c 11 ] qh[t] = [-1 -1 -3 -2 -3 -1 1 ] qh[t] . If the result is 2 (respectively 5, 3, 1), then, transition tl (respectively t2, t3, t4) is under execution. Note that in order to identify whether multiple transitions are under execution, one needs to use additional places (s > 1). The additional place P7 in this example acts as a place-holder for special tokens (which in reality would correspond to acknowledgments): it receives Coding Approaches for Fault Detection and Identification 171 Figure 8.11. Example of a separate redundant Petri net implementation that enhances control in the Petri net of Figure 8.3. 2 (respectively 1) such tokens whenever transition tl (respectively t4) is completed; it provides 1 token in order to enable transition t2' Explicit acknowledgments about the initiation and completion of each transition are avoided (for example, transition t3 does not need to send any acknowledgment). Furthermore, by adding enough extra places, the above monitoring scheme can be made robust to incomplete or erroneous information (as in the case when a certain place fails to submit the correct number of tokens). 6.2 DETECTING ILLEGAL TRANSITIONS The occurrence of illegal activity in a DES can lead to complete control failure. This section uses separate redundant Petri net implementations to detect and identify illegal transitions in DES's. The system modeled by the Petri net is assumed to be "observable" through two different mechanisms: (i) place sensors that provide information about the number of tokens in each place, and (ii) transition sensors that indicate when each transition fires. Suppose that the DES of interest is modeled by a Petri net with state evolution equation qs[t + 1] = qs[t] + [ B+ I Bt ] x[t]- [ B- I B~ ] x[t] , 172 CODING APPROACHES TO FAULT TOLERANCE where matrices B~ and B~ model the postconditions and preconditions of illegal transitions and where the input x[t] == [ Xl[t] ] xu[t] is an input vector that captures both legal and illegal transitions (in xz[t] and xu[t] respectively). If a separate redundant implementation of the (legaI 4 ) part of the network is constructed, the overall system will have the following state evolution equation: = B+ I B~ ] Qh[t] + [ CB+ _ D I 0 x[t]B- - [ CB- - D I B~ I 0 ] x[t] . The goal then is to choose C and D so that illegal behavior can be detected. Information about the state of the upper part of the redundant implementation, with state evolution qhl[t + 1] = QhIlt] + [ B+ I B~ ] x[t]- [ B- I B~ ] x[t] , will be provided to the monitor by the place sensors. Notice that illegal transitions change the number of tokens in these places, enabling the detection/identification of faults. The additional places, which evolve according to the equation Qh2[t + 1] = Qh2[t] + [ CB+ - D I 0 ] x[t]- [ CB- - D I 0 ] x[t] , are internal to the controller and act only as test places, i.e., they cannot inhibit transitions and can have a negative number of tokens. Once the number of tokens in these test places is initialized appropriately (i.e., Qh2[O] = Cqht[Oj), the controller removes or adds tokens to these places based on which (legal) transitions take place. Therefore, the state of the bottom part of the system is controlled by the transition sensors. If an illegal transition fires at time instant t, the illegal state qf [t] of the redundant implementation is given by qf[t] = Qh[t] = Qh[t] + [ ~~ ] xu[t]- [ Bo~ ] xu[t] +[ ~u ] xu[t] , 173 Coding Approaches for Fault Detection alld Identification where Bu == B;t -B; and xu[t] denotes an array with all zero entries. except a single entry with value "1" that indicates the illegal transition that fired. If the parity check Pqf [t] is performed. one gets Pqf[t] = [-C Is] qf[t] = [-C Is] (CJh[t] + [ = -CBuxu[t]. ~u ] xu[t]) Therefore. one can identify which illegal transition has fired if all columns of CBu are unique. EXAMPLE 8.9 The controller of the maze in Figure 8.2 obtains information about the state of the system through a set ofdetectors. More specifically, each room is equipped with a "mouse sensor" that indicates whether the mouse is in that room. In addition, "door sensors" get activated whenever the mouse goes through the corresponding door. Suppose that due to a bad choice ofmaterials, the maze of Figure 8.2 is built in a way that allows the mouse to dig a tunnel connecting rooms 1 and 5 and a tunnel connecting rooms 1 and 4. This leads to the following set of illegal (i. e., non-door) transitions in the network: 1 -1 o o o -1 1 -1 o 0 0 o 0 -1 1 o 0 0 1 0 In order to detect the existence of such tunnels, one can use a redundant Petri net implementation with one additional place (s = 1), C = [1 1 1 2 3] and D = [1 1 1 1 2 1]. The resulting redundant matrices B+ and B- are given by B+ [ B+ I B;t ] = CB+ -D I 0 o o 0 1 001 1 0 0 0 0 10000 0 o 0 0 0 1 0 000 100 1 0 1 0 0 o 0 0 000 o 1 0 000 2 0 0 000 0 o 0 0 0 1 0 174 B- CODING APPROACHES TO FAULT TOLERANCE = 1 0 0 0 0 [ B- II B-] CB- -D OU 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 The upper part of the network is observed through the place ("mouse") sensors. The number of tokens in the additional place is updated based on informationfrom the transition ("door") sensors. More specifically, it receives two tokens when transition t3 fires; it looses one token each time transition t4 or t5fires. The parity check is given by [-1 -1 -1 -2 -3 1] <lh[t] and is zero if no illegal activity has taken place. It is 2 (respectively - 2, 1, -1) if illegal transition Bu(:, 1) [respectively Bu(:, 2), Bu(:, 3), Bu(:, 4)J has taken place. Note that one can detect the existence ofa tunnel in the maze using only three door sensors (since there are only three nonzero entries in matrices CB+ - D and CB- - D). 7 SUMMARY This chapter constructed monitoring schemes for DES's based on their Petri net models. The technique systematically incorporates constraints into a given Petri net by looking at appropriate Petri net embeddings. The resulting monitor capitalizes on the imposed constraints in order to detect and identify faults via simple linear checks. Comparisons with existing fault diagnosis techniques in Petri net systems were made at various points during the analysis in this chapter; there still remain, however, a number of connections that need to pursued further in order to fully understand the role of coding techniques in performing fault diagnosis. Applications of these techniques in the context of monitoring power system faults can be found in [Hadjicostis and Verghese, 2000]. Notes 1 Some of the constraints imposed in Theorem 8.1 can be dropped if one adopts the view in [Silva and Velilla, 1985] and treats additional places only as test places, i.e., allows them to have a negative number of tokens. In such case, C and D can have negative entries. 2 One needs to ensure that for all pairs of columns of P there do not exist nonzero integers 0, f3 such that 0 x P(:, i) = f3 x P(:, j), i =1= j. References 175 3 Multiplication of C by a constant does not help if, for some i and j, CB + (i, j) oor CB-(i,j) = O. 4 Since one has no control over the illegal part of the Petri net, the monitoring scheme cannot use any acknowledgments from this part of the net. References Aghasaryaiu, A, Fabre, E., Benveniste, A, Boubour, R., and Jard, C. (1997a). A Petri net approach to fault detection and diagnosis in distributed systems (Part I). In Proceedings of the 36th IEEE Con! on Decision and Control, pages 720-725. Aghasaryaiu, A, Fabre, E., Benveniste, A, Boubour, R, and Jard, C. (1997b). A Petri net approach to fault detection and diagnosis in distributed systems (Part II). In Proceedings of the 36th IEEE Con! on Decision and Control, pages 726-731. Aghasaryaiu, A, Fabre, E., Benveniste, A, Boubour, R, and Jard, C. (1998). Fault detection and diagnosis in distributed systems: an approach by partially stochastic Petri nets. Discrete Event Dynamic Systems: Theory and Applications, 8(2):203-231. Baccelli, F., Cohen, G., Olsder, G. J., and Quadrat, J. P. (1992). Synchronization and Linearity. Wiley, New York. Bouloutas, A, Hart, G. w., and Schwartz, M. (1992). Simple finite state fault detectors for communication networks. IEEE Transactions on Communications, 40(3):477-479. Cardoso, J., Ktinzle, L. A, and Valette, R (1995). Petri net based reasoning for the diagnosis of dynamic discrete event systems. In Proceedings of the IFSA '95, the 6th Int. Fuzzy Systems Association World Congress, pages 333-336. Cassandras, C. G. (1993). Discrete Event Systems. Aksen Associates, Boston. Cassandras, C. G., Lafortune, S., and Olsder, G. J. (1995). Trends in Control: A European Perspective. Springer-Verlag, London. Cieslak, R., Desclaux, c., Fawaz, A S., and Varaiya, P. (1988). Supervisory control of discrete-event processes with partial observations. IEEE Transactions on Automatic Control, 33(3):249-260. Debouk, R, Lafortune, S., and Teneketzis, D. (1998). Coordinated decentralized protocols for failure diagnosis of discrete event systems. In Proceedings of the 37th IEEE Con! on Decision and Control, pages 3763-3768. Debouk, R, Lafortune, S., and Teneketzis, D. (1999). On an optimization problem in sensor selection for failure diagnosis. In Proceedings of the 38th IEEE Con! on Decision and Control, pages 4990-4995. Debouk, R, Lafortune, S., and Teneketzis, D. (2000). On the effect of communication delays in failure diagnosis of decentralized discrete event systems. = 176 CODING APPROACHES TO FAULT TOLERANCE In Proceedings ofthe 39th IEEE Con! on Decision and Control, pages 22452251. Desrochers, A. A. and Al-Jaar, R. Y. (1994). Applications of Petri Nets in Manufacturing Systems. IEEE Press. Gertler, J. (1998). Fault Detection and Diagnosis in Engineering Systems. Marcel Dekker, New York. Hadjicostis, C. N. (1999). Coding Approaches to Fault Tolerance in Dynamic Systems. PhD thesis, EECS Department, Massachusetts Institute of Technology, Cambridge, Massachusetts. Hadjicostis, C. N. and Verghese, G. C. (1999). Monitoring discrete event systems using Petri net embeddings. In Application and Theory of Petri Nets 1999, number 1639 in Lecture Notes in Computer Science, pages 188-208. Hadjicostis, C. N. and Verghese, G. C. (2000). Power system monitoring using Petri net embeddings. lEE Proceedings: Generation, Transmission, Distribution, 147(5):299-303. Moody, J. O. and Antsaklis, P. J. (1997). Supervisory control using computationally efficient linear techniques: A tutorial introduction. In Proceedings of MED 1997, the 5th IEEE Mediterranean Con! on Control and Systems. Moody, J. O. and Antsaklis, P. J. (1998). Supervisory Control of Discrete Event Systems Using Petri Nets. Kluwer Academic Publishers, Boston. Moody, J. O. and Antsaklis, P. J. (2000). Petri net supervisors for DES with uncontrollable and unobservable transitions. IEEE Transactions on Automatic Control,45(3):462-476. Murata, T. (1989). Petri nets: Properties, analysis and applications. Proceedings of the IEEE, 77(4):541-580. Pandalai, D. N. and Holloway, L. E. (2000). Template languages for fault monitoring of timed discrete event processes. IEEE Transactions on Automatic Control,45(5):868-882. Park, Y. and Chong, E. K. P. (1995). Fault detection and identification in communication networks: a discrete event systems approach. In Proceedings of the 33rdAnnuaiAllerton Con! on Communication, Control, and Computing, pages 126-135. Ramadge, P. J. and Wonham, W. M. (1989). The control of discrete event systems. Proceedings of the IEEE, 77(1):81-97. Sampath, M., Lafortune, S., and Teneketzis, D. (1998). Active diagnosis of discrete-event systems. IEEE Transactions on Automatic Control, 43(7):908929. Sampath, M., Sengupta, R., Lafortune, S., Sinnamohideen, K., and Teneketzis, D. (1995). Diagnosability of discrete-event systems. IEEE Transactions on Automatic Control, 40(9): 1555-1575. Sifakis, J. (1979). Realization of fault-tolerant systems by coding Petri nets. Journal of Design Automation and Fault-Tolerant Computing, 3(2):93-107. References 177 Silva, M. and Velilla, S. (1985). Error detection and correction in Petri net models of discrete events control systems. In Proceedings of ISCAS 1985, the IEEE Int. Symp. on Circuits and Systems, pages 921-924. Tinghuai, C. (1992). Fault diagnosis andfault tolerance: a systematic approach to special topics. Springer-Verlag, Berlin. Valette, R., Cardoso, J., and Dubois, D. (1989). Monitoring manufacturing systems by means of Petri nets with imprecise markings. In Proceedings of the IEEE Int. Symp. on Intelligent Control, pages 233-238. Wang, C. and Schwartz, M. (1993). Fault detection with multiple observers. IEEEIACM Transactions on Networking, 1(1):48-55. Yamalidou, K., Moody, J., Lemmon, M., and Antsaklis, P. (1996). Feedback control of Petri nets based on place invariants. Automatica, 32( 1): 15-28. Chapter 9 CONCLUDING REMARKS 1 SUMMARY This book presented a unifying approach for constructing fault -tolerant combinational and dynamic systems. The underlying motive was to develop resourceefficient alternatives to modular redundancy by constructing appropriate redundant system embeddings. These embeddings preserve the functionality of the original system and are designed in a way that imposes constraints on the set of outputs/states that are reachable under fault-free conditions. Violations of these constraints can then be used by an external mechanism to detect and correct errors. The faults that cause the errors could be due to hardware malfunctions, communication faults, incorrect initialization, and so forth. The book systematically studied this two-stage approach to fault tolerance and demonstrated its potential and effectiveness for both combinational and dynamic systems. Combinational systems were studied first by reviewing von Neumann's ground-breaking approach in Chapter 2. For combinational systems that perform computations with algebraic structure, Chapter 3 showed that algebraic constructions (and, in particular, algebraic injective homomorphisms) can greatly facilitate fault tolerance. Among other results, it was shown that the development of parity-type protection schemes for computations with an underlying group or semigroup structure can be posed and solved as an algebraic problem. In the case of dynamic systems, which was studied next, a couple of additional, important issues were identified: (i) Redundant dynamics provide flexibility that can be used to efficiently/reliably enforce state constraints (for example, in order to build redundant implementations that require less hardware). (ii) Error propagation complicates the task of maintaining correctness during the operation of a dynamic system, particularly for long time intervals. 180 CODING APPROACHES TO FAULT TOLERANCE This raises questions regarding not only the cost but also the feasibility of constructing reliable dynamic systems exclusively out of unreliable components. Assuming fault-free error correction, the overarching goal of Chapters 4-6 was to systematically develop alternatives to modular redundancy. It was shown that, under a particular error detection/correction scheme, a number of redundant implementations is possible. A precise characterization of these different redundant implementations was obtained for a variety of dynamic systems. This resulted in diverse schemes for fault tolerance that included embeddings based on algebraic homomorphisms (see Chapter 4), non-concurrent checking schemes (see Chapters 4 and 5), reconfiguration methodologies (see Chapter 5) and redundant implementations that required less hardware (see Chapter 6). Chapter 7 relaxed the assumption that the error-correcting mechanism be fault-free. It considered dynamic systems that suffer transient faults in the state transition mechanism and in the error-correcting mechanism. Due to the dynamic nature of these systems, transient faults in the error-correcting mechanism propagate in time, resulting in a serious increase in the probability of overall failure. In order to handle error propagation effectively, modular redundancy schemes that use multiple system replicas and voters were studied. It was shown that, by increasing the amount of redundancy, one can in principle construct redundant implementations that operate under a specified (low) level of failure probability for any finite time interval. Furthermore, for the case of unreliable linear finite-state machines (LFSM's), low-complexity error-correcting codes can be used to obtain interconnections of identical LFSM's that operate in parallel on distinct input sequences, fail with arbitrarily low probability during a finite time interval and require only a constant amount of redundancy per machine. Chapter 8 explored similar ideas in the context of fault diagnosis in discrete event systems that are modeled by Petri nets. More specifically, by employing embeddings similar to the ones developed in Chapters 4-6, one can obtain monitoring schemes for complex networked systems, such as manufacturing systems, communication protocols or power systems. The trade-offs and objectives involved in fault diagnosis, however, can be quite different. For example, the objective may be to avoid complicated reachability analysis, or to minimize the size of the monitor, or to construct monitoring schemes that require minimal communication overhead. The resulting methodologies are simple and allow easy specification of additional places, connections and weights so that detection/identification of both transition and place faults can be verified by weighted checksums on the overall state of the redundant Petri net. In addition, the monitoring schemes can be designed to perform reliably despite erroneous/incomplete information. Concluding Remarks 2 181 FUTURE RESEARCH DIRECTIONS There are many important directions for future research in this area. Perhaps the most exciting one is to explore how techniques for fault tolerance can enable innovative, possibly less expensive, manufacturing technologies and how they can lead to novel computational architectures. In particular, one prospect is to build reliable systems out of presently unreliable technologies (such as quantum or molecular computers) by developing appropriate coding protection schemes. Another prospect is to apply fault-tolerance techniques in silicon-based systems to increase speed or power-efficiency [Shanbhag, 1997]. A number of related open questions pertain to the development of faulttolerant implementations that allow faults in the error-correcting mechanism. For example, the encoding techniques in Chapter 7 could potentially be generalized to group machines or other algebraic machines. In addition, different (easily decodable) coding schemes could be used for simultaneously protecting parallel simulations of a given system. There are also interesting theoretical questions regarding how one can define the computational capacity of unreliable LFSM's and, more generally, finite-state machines. Since one concern about the approach in Chapter 7 is the increasing number of connections, it may be worthwhile to explore how one can design dynamic systems that limit the number of connections to neighboring elements (much like Gacs approach in [Gacs, 1986]). The two-stage approach for fault tolerance that was studied in this book operates under the premise that the code (constraints) enforced on the state of the redundant implementation are time-independent. This implies that the errorcorrecting mechanism has no memory and it would be interesting to investigate the applicability of more general approaches. For example, instead of using block codes, one could try convolutional codes to protect LFSM's (some related work has appeared in [Redinbo, 1987; Holmquist and Kinney, 1991]). This approach seems promising since convolutional codes can also be decoded at low cost and appear suitable for a dynamic system setting (see, for example, the work in [Rosenthal and York, 1999]). In addition, using error-correcting mechanisms with memory may lead to reduced hardware complexity in these fault-tolerant implementations. More generally, one can develop a "behavioral" approach to fault tolerance, where system behaviors (i.e., state trajectories) are associated with fault-free and faulty systems [Antoulas and Willems, 1993]. Applying these ideas further in specific contexts (e.g., in linear filters for digital signal processing applications or linear systems over groups [Fagnani and Zampieri, 1996]) can help in the systematic study of optimization criteria (e.g., the minimization of redundant hardware) and in the development of efficient reconfiguration schemes (e.g., for handling permanent faults in integrated circuits). One can also study how these ideas generalize to nonlinear and/or time-varying systems. 182 CODING APPROACHES TO FAULT TOLERANCE There are a number of future extensions that relate to fault diagnosis in discrete event systems. These include the development of resource-efficient hierarchical or distributed fault diagnosis schemes that are robust to uncertainty in the sensors or in the information communicated to the diagnoser. Also appealing is the explicit study of examples where a subset of the transitions is uncontrollable and/or unobservable (see, for example, [Moody and Antsaklis, 1997; Moody and Antsaklis, 1998; Moody and Antsaklis, 2000]). Another promising research direction is the application of these ideas to max-plus systems [Cuningham-Green, 1979; Cohen et aI., 1989; Baccelli et aI., 1992; Cassandras, 1993; Cassandras et aI., 1995]. These systems are "linear" in the semifield of real numbers under the MAX (additive) and + (multiplicative) operations, and redundancy can be introduced in them in ways analogous to those for linear dynamic systems. The absence of an inverse for the MAX operation, however, forces one to consider issues related to error detection and robust performance rather than error correction. These ideas may be useful in building robust flow networks, real-time systems and scheduling algorithms. References Antoulas, A. C. and Willems, J. C. (1993). A behavioral approach to linear exact modeling. IEEE Transactions on Automatic Control, 38( 12): 1776-1802. Baccelli, F., Cohen, G., Olsder, G. 1., and Quadrat, 1. P (1992). Synchronization and Linearity. Wiley, New York. Cassandras, C. G. (1993). Discrete Event Systems. Aksen Associates, Boston. Cassandras, C. G., Lafortune, S., and Olsder, G. 1. (1995). Trends in Control: A European Perspective. Springer-Verlag, London. Cohen, G., Moller, P, Quadrat, 1.-P., and Viot, M. (1989). Algebraic tools for the performance evaluation of discrete event systems. Proceedings of the IEEE,77(1):39-85. Cuningham-Green, R. (1979). Minimax Algebra. Springer-Verlag, Berlin. Fagnani, F. and Zampieri, S. (1996). Dynamical systems and convolutional codes over finite abelian groups. IEEE Transactions on Information Theory, 42(11): 1892-1912. Gacs, P. (1986). Reliable computation with Cellular Automata. Journal of Computer and System Sciences, 32(2): 15-78. Holmquist, L. P. and Kinney, L. L. (1991). Concurrent error detection in sequential circuits using convolutional codes. In Proceedings ofthe 9th Int. Symp. on Applied Algebra, Algebraic Algorithms and Error-Correcting Codes, pages 183-194. Springer-Verlag. Moody, 1. O. and Antsaklis, P. J. (1997). Supervisory control using computationally efficient linear techniques: A tutorial introduction. In Proceedings of MED 1997, the 5th IEEE Mediterranean Conf. on Control and Systems. References 183 Moody, J. O. and Antsaklis, P. J. (1998). Supervisory Control of Discrete Event Systems Using Petri Nets. Kluwer Academic Publishers, Boston. Moody, J. O. and Antsaklis, P. J. (2000). Petri net supervisors for DES with uncontrollable and unobservable transitions. IEEE Transactions on Automatic Control,45(3):462-476. Redinbo, G. R. (1987). Finite field fault-tolerant digital filtering architecture. IEEE Transactions on Computers, 36(10):1236-1242. Rosenthal, J. and York, F. V. (1999). BCH convolutional codes. IEEE Transactions on Information Theory, 45(6):1833-1844. Shanbhag, N. R. (1997). A mathematical basis for power-reduction in digital VLSI systems. IEEE Transactions on Circuits and Systems -II: Analog and Digital Signal Processing, 44(11 ):935-951. About the Author Christoforos Hadjicostis is currently an Assistant Professor in the Department of Electrical and Computer Engineering and a Research Assistant Professor in the Coordinated Science Laboratory at the University of Illinois at Urbana-Champaign. He received S.B. degrees in Electrical Engineering in 1993, in Computer Science and Engineering in 1993 and in Mathematics in 1999, the M.Eng. degree in Electrical Engineering and Computer Science in 1995, and a Ph.D. in Electrical Engineering and Computer Science in 1999, all from the Massachusetts Institute of Technology, Cambridge, Massachusetts. Dr. Hadjicostis was awarded the Faculty Early Development (Career) award from the National Science Foundation in 2001. While at MIT, he served as president of the MIT Chapter of HKN, received the Harold L. Hazen Teaching Award and the Ernst A. Guillemin Thesis Prize, and received fellowships from the National Semiconductor Corporation and the Grass Instrument Company. Dr. Hadjicostis' research interests include fault-tolerant computation in combinational and dynamic systems, fault management and control of complex systems, and coding and graph theory. Index Active transition, 170 Additive fault model, 42, 149 Algorithm-based fault tolerance, 2, 7, 34, 37 Arithmetic code, 33, 35 Associativity, 41 Autonomous machine, 68 Behavioral approach, 181 Binary operation, 41 Binary symmetric'channel, 132 Boolean function, 22 gate, 22 Capacity, 126 channel, 132 computational, 132 Cellular automata, 117 Channel binary symmetric, 132 capacity, 132 crossover probability, 132 Checksum, 88, 104 Chip-kill,5 Circuit combinational, 22 depth,22 reliable, 26, 28 size, 22 Cluster states, 116 Combinational system, 3 Commutativity, 41 Computational capacity, 132 Concurrent error masking, I Congruence relation, 52 Convolutional encoder, 106 Coset, 45, 63 nonzero, 47 Decoding, 50 output, 36 Diagnoser, 143 Discrete event system, 143 max-plus, 181 Distributed voting, 115, 119 Dynamic system, 3 redundant implementation, 10,9 reliable state evolution, 7 Encoding input, 35, 133 matrix, 94 state, 116 Equivalence relation, 52 Error correction, 36, 45, 51, 103 conditions for single error, 51 fault-free, 34, 42, 81 multiple, 51 unreliable, 115-116 Error detection and correction, 87 non-concurrent, 92 periodic, 92 Error detection, 36, 45, 51, 103 conditions for single error, 51 fault-free, 81 multiple, 51 Error, I propagation, 12, 115 single-bit, 64, 67 Failure, I, 171 overall, 118, 120, 130 Fault detection and identification, 154, 166 Fault diagnosis, 13, 143 distributed, 182 hierarchical, 182 Fault model, 148 additive, 42, 149 Fault tolerance, I Fault, I detection, 144 hardware, 64 identification, 144 model,2 188 CODING APPROACHES TO FAULT TOLERANCE monitoring, 144 permanent, I, 84 transient, I, 84, 115 Fault-tolerant FFT, 34, 40 Fault-tolerant convolution, 34 Fault-tolerant integer addition, 36 Fault-tolerant linear operators, 40 Fault-tolerant matrix multiplication, 37 Fault-tolerant sorting networks, 40 Finite field, 112 Flip-flop, 116, 123 Gate 3-input,27 u-input,29 Boolean, 22 NAND, 27, 31 XNAND, 27-28 XOR, 99-101,108, 123 unreliable, 22 Group machine, 62 Group, 41 abelian, 42 canonical surjective homomorphism, 49 coset, 45, 63 cyclic, 68 homomorphism, 45 inverse, 41 non-trivial subgroup, 62 normal subgroup, 62 simple, 63 subgroup, 45 surjective homomorphism, 53 Hamming code, 87-88, 93 Hamming distance, 75-76, 116 Illegal transition, 171 Incidence matrix, 147 Independent iterations, 124 Iterative decoding, 124 LTI dynamic system, 79 hardware implementation, 83 redundant dynamics, 82, 91 redundant implementation, 80, 83 signal flow graph, 83 standard redundant implementation, 82 state evolution, 79 Linear code, 79, 99, 103 Hamming code, 87 encoding matrix, 94 independent iterations, 124 low-density parity check code, 123 parity check, 81 single-error correction, 82 single-error detection, 82 Linear feedback shift register, 100 Linear finite-state machine, 99,123, 127 autonomous, 110 classical canonical form, 10 I, 132 hardware implementation, 101 parallel instantiations, 127, 132 redundant dynamics, 104 redundant implementation, 102, 109 sequence enumerator, 100, 110 standard redundant implementation, 103 state evolution, 99 Loop-free interconnection, 22 Low-density parity check code, 123 Machine decomposition Krohn-Rhodes, 63 Zieger, 74 coset leader, 62 series-parallel,62 subgroup machine, 62 Machine, 61 algebraic, 61 autonomous, 67 group, 61 permutation-reset, 73 redundant implementation, 64 reset, 73 reset-identity, 73 semigroup, 61 Marking, 145 Modular redundancy, 4, 7, 33, 35, 93, 118 Monitor, 66-68, 151, 160 Monoid,49 homomorphism, 50 Multiprocessor system, 34, 37 Overall failure, I, 118, 120, 130 Parallel matrix multiplication, 37 Parity channel, 47 Parity check, 81, 103 Permutation-reset machine, 73 Petri net, 143-144 additive fault model, 149 fault detection and identification, 154, 166 fault model, 148 incidence matrix, 147 input place, 145 marking, 145 non-separate redundant implementation, 144, 162 output place, 145 place fault, 149 place, 145 separate redundant implementation, 144, 151 token, 145 transition fault, 149 transition, 145-146 Place fault, 149 Place, 145 fault,149 input, 145 output, 145 RAID,5 Reachability matrix, 95 Index Reconfiguration, 84 Redundant implementation LTI dynamic system, 80 algebraic machine, 64 group machine, 64 linear finite-state machine, 109 non-separate, 69, 75 semigroup machine, 73 separate, 66, 74 Reliable state evolution, 118 Reset machine, 73 Reset-identity machine, 73 Restoring organ, 21, 23 Self-checking module, 4 Semigroup, 49 abelian, 49 canonical surjective homomorphism, 53 homomorphism, 50 non-abelian, 49 Separate code for integer addition, 47, 54 for integer comparison, 55 for integer multiplication, 54 group, 47 semigroup, 52 Signal flow graph, 84 delay-free paths, 85, 88 factored state variables, 85 Similarity transformation, 80, 101 Stable memories, 116, 126 State transition fault, 4 Structured redundancy, 4, 6-7, 35, 41, 117 Supervisory control, 147 Surjective homomorphism, 49 TMR, 7, 35,93-94 Tolerable noise, 27 Transition fault, 149 Transition, 145-146 active, 170 fault, 149 iIIegal,171 Unreliable components, 5, 21, 115, 117 reliably, 21 189