Issues on the Design of Efficient Fail-Safe Fault Tolerance Arshad Jhumka Department of Computer Science University of Warwick Coventry, UK, CV4 7AL arshad@dcs.warwick.ac.uk Abstract The design of a fault-tolerant program is known to be an inherently difficult task. Decisions taken during the design process will invariably have an impact on the efficiency of the resulting fault-tolerant program. In this paper, we focus on two such decisions, namely (i) the class of faults the program is to tolerate, and (ii) the variables that can be read and written. The impact these design issues have on the overall fault tolerance of the system needs to be well-understood, failure of which can lead to costly redesigns. For the case of understanding the impact of fault classes on the efficiency of fail-safe fault tolerance, we show that, under the assumption of a general fault model, it is impossible to preserve the original behavior of the fault-intolerant program. For the second problem of read and write constraints of variables, we again show that it is impossible to preserve the original behavior of the fault-intolerant program. We analyze the reasons that lead to these impossibility results, and suggest possible ways of circumventing them. 1. Introduction Computer systems are becoming increasingly pervasive, being deployed in consumer-oriented products such as mobile phones, PDAs etc to safety-critical systems such as car engine control, avionics etc. Such increasing pervasiveness has led to a corresponding increase in our reliance on these systems to continually provide correct services, in spite of external perturbations, such as security intrusions, faults etc. In other words, we want such systems to be dependable [13]. There is no single approach for developing a faulttolerant program(e.g., [2], [8], [15], [16], [11]) In [11], the authors analyzed the complexity of designing failsafe fault tolerance (i.e., efficient design of fail-safe Matthew Leeke Department of Computer Science University of Warwick Coventry, UK, CV4 7AL matt@dcs.warwick.ac.uk fault tolerance), whereas, in this paper, as a matter of contrast, we analyze the problem of designing efficient fail-safe fault tolerance. We focus on transformational approaches, whereby an initially fault-intolerant program is transformed into a corresponding (fail-safe) fault-tolerant one [2], [8], [15]. For such approaches, it is very desirable that the properties of the initial faultintolerant program is preserved in the corresponding fault-tolerant program in the absence of faults. In other words, it is desirable for the fault tolerance components to be transparent in the fault-tolerant program. This means that, in the absence of faults, the fault-tolerant program behaves as the fault-intolerant one (thus preserving its properties), but in the presence of faults, the fault-tolerant program handles the faults. However, there are design issues that may end up having an impact on whether the properties are preserved or not. In this paper, we will consider two such issues, namely (i) the assumed fault model, and (ii) the read/write constraints. We focus on two important properties of the fault-tolerant programs, namely accuracy and completeness [8]. Accuracy implies the ability of a program to avoid false positives (i.e., wrongly flagging existence of an error), while completeness is the ability of the program to avoid false negatives, (i.e., detect every error). We refer to these two properties as detection efficiency of the fault-tolerant program. Whenever it is clear from the context, we will just refer to detection efficiency as efficiency. In the design of a fault-tolerant program, it is common practice to develop a fault model which the program is supposed to handle. In a formal sense, a fault model F affecting a program p can be seen as a program transformation [5]. Such a fault model can be factorized along two dimensions, namely (i) a local dimension, and (ii) a global dimension. Voelzer [18] calls the local dimension an impact model, and the global fined later in the paper), it is impossible to add efficient fail-safe fault tolerance. We also show that it is impossible to add efficient fail-safe fault tolerance when there are read/write constraints defined. Consequently, we study some of the possibilities to circumvent these impossibility results. We develop a framework that allows us to study and relate various existing research in trying to circumvent this impossibility result. In so doing, we further identify several potential research areas arising as a result. The paper is structured as follows: In Section 2, we provide the formal underpinnings of our paper. In Section 3, we explain the role of detectors in fail-safe fault tolerance addition. In Section 4, we explain the problem of adding efficient fail-safe fault tolerance to a fault-intolerant program, and study the complexity of solving the problem in presence of a general fault model. Section 5 provides some potential solutions to some variants of the efficient fail-safe fault tolerance addition problem. In Section 6, we analyzed the problem of designing efficient fail-safe fault tolerance in presence of read restrictions imposed on processes. In Section 7, we provide a necessary and sufficient condition to solve the problem. We conclude in Section 8. dimension a rely specification. The impact model specifies the additional faulty behavior of the system, whereas the rely specification dictates the extent of the faulty behavior. For example, an impact model may specify that “nodes may fail by crashing”, whereas the rely specification will specify that “no more than t nodes may crash”. The impact model causes an “enlargement” of the system’s behavior, while the rely specification constrains that “enlargement”. Distributed programs have read/write constraints imposed on them. In a similar manner, read/write constraints are imposed through encapsulation. This inability to read/write certain variables has not only an impact on the properties of the resulting fault tolerance, but the complexity of designing fault-tolerant programs when there are read/write constraints is very high [11]. Thus, it becomes important to be able to determine the properties of the resulting fault-tolerant program. Three commonly studied fault tolerance levels, namely (i) fail-safe fault tolerance (ii) non-masking fault tolerance and (iii) masking fault tolerance, can be designed through the addition of error detection (a.k.a., detectors) and error recovery mechanisms (a.k.a., correctors) [2]. Specifically relevant for this paper is that it is both necessary and sufficient to add detectors to a program to ensure that the resulting program never violates safety in presence of faults, i.e., to design failsafe fault-tolerant programs [2]. The main goal of this paper is to analyze (i) the impact of an assumed fault model on the design of efficient fail-safe fault tolerance, and (ii) the impact of read/write restrictions on the design of efficient failsafe fault tolerance. Very often, to develop a faulttolerant program, a programmer adopts a defensive programming style to ensure that the system never violates safety in the presence of faults, leading to the program to trigger false positives [14]. These false positives may reduce the efficiency of the program, since recovery actions will then be triggered, and the program spends more “time” recovering. Another problem is that of false negatives, which can seriously threaten the safety of the program in the presence of faults. These problems are not easily addressed, requiring extensive knowledge of the program and/or extensive experience on the programmers [14]. The notion of accuracy and completeness is analogous to the definitions of accuracy and completeness of Chandra and Toueg [4]. In this paper, we first formalize the problem of adding (detection) efficient fail-safe fault tolerance to an initially fault-intolerant program. We then show that, in the presence of a general fault model (as de- 2. Formal Preliminaries In this section, we summarize the formal terminologies that will be used through out this paper. This work assumes an interleaved execution semantics, i.e., state transitions are atomic events and an execution is regarded as a linear sequence of states. We assume a shared variables communication model, i.e., processes communicate with each other by writing data into memory locations accessible by the receiver. Syntactically, a program will be represented as a set of guarded commands, and semantically, the program is interpreted as a state transition system. 2.1. Programs Program in Guarded Command Notation A program p consists of a finite set of processes {p1 , . . . , pn }. Each process pi consists of a (nonempty) set of actions Ai and variables Vi . An action has the form hguardi → hstatementi where the guard is a boolean expression over the program variables and the statement is either the empty statement or an instantaneous assignment to one or 2 A state s ∈ Sp is said to be reachable if there exists a computation c of p such that s occurs in c. more variables. Each variable stores a value from a nonempty finite domain and is associated with a predefined set of initial values. A state of p is an assignment of values to variables of p. The state space of p is the set of all possible value assignments to variables of p. An action ac of p is enabled in a state s if the guard of ac evaluates to “true” in s. An action ac can be represented by a set of state pairs. We assume that actions are deterministic. 2.2. Read/Write Constraints In this section, we identify how read/write constraints impact on the transitions a process can take. Write Restrictions: Given a transition (s0 , s1 ), it is easy to determine which variables need to be written to for the transition to take place. If there is a variable v such that the value of variable v in state s0 (denoted by v(s0 )) is different to that when in state s1 , i.e., v(s0 ) 6= v(s1 ), then it means that variable v has to be written to for transition (s0 , s1 ) to take place. The write restrictions then imply checking that the transition of a process only write to variables that the process can. Denoting by wi the set of variables that process pi can write to, then pi cannot use these transitions: Program as State Transition System To recall, an assignment of values to variables of p is called a state of p, and the state space Sp of a program p is the set of all possible assignments of values to variables of p. A state predicate of p (predicate for short) is a boolean expression over the state space of p. We denote by s(x) when a predicate x is evaluated in state s. A predicate x can be alternatively represented as a set of states in which it evaluates to true, i.e., x = {s : Sp | s(x) = T }. The set of initial states Ip of p is defined by the set of all possible assignments of initial values to variables. Alternatively, we call the set of initial states of p as representing the initial condition of p. In this model, a transition is a state pair. A computation of p is a weakly fair (finite or infinite) sequence of states s0 · s1 . . . such that s0 ∈ Ip and for each j ≥ 0, sj+1 results from sj by executing a single action that is enabled in sj 1 . Weak fairness means that if a program action ac is continuously enabled along the states of an execution, then ac is eventually chosen to be executed. Weak fairness implies that a computation is maximal with respect to program actions, i.e., if the computation is finite then no program action is enabled in the final state. We say state s occurs in a computation s0 , s1 , . . . iff there exists an i such that s = si . Similarly, a transition (s, s0 ) occurs in a computation s0 , s1 , . . . iff there exists an i such that s = si and s0 = si+1 . In this setting, a program can equivalently be represented as state machines, i.e., a program is a tuple p = (Sp , Ip , δp ) where Sp is the state space and Ip ⊆ Sp is the set of initial states. The state transition relation δp ⊆ Sp × Sp is defined by the set of actions as follows: Every action ac implicitly defines a set of transitions which is added to δp . Transition (s, s0 ) ∈ δp iff ∃ac ∈ p that is enabled in state s and computation of the statement results in state s0 . We say that ac induces these transitions. State s is called the start state and s0 is called the end state of the transition. nw(wi ) = {(s0 , s1 ) : (∃t : t 6∈ wj : t(s0 ) 6= t(s1 ))} Read Restrictions: It appears that all variables need to be read for a transition to take place. When read restrictions are imposed on a process for certain variables, then transitions have to be grouped. Specifically, when a given transition is being considered for inclusion or exclusion from the program, the transition cannot be considered on its own. Rather, a set of transitions need to be considered. This can be best illustrated through an example. Consider a program with two processes pi and pj . Process pi has a variable i, while process pj has variable j. Process pi (resp.pj ) cannot read variable j (resp. i). Each variable has domain {0, 1}. Now, if process pi wants to include transition (hi = 0, j = 0i, hi = 1, j = 0i), then it needs to include transition (hi = 0, j = 1i, hi = 1, j = 1i) too, so that the need to read variable j is irrelevant from pi ’s point of view. In case only one transition needs to be included, then it becomes important for pi to be able to read variable j so it can decide which transition to take. Denoting the set of variables process pi can read by ri , for a given transition (s0 , s1 ), the set of transitions that need to be considered together with (s0 , s1 ) is given by: set(rj )(s0 , s1 ) = {(s00 , s01 ) : (∀x : x ∈ rj : (x(s0 ) = x(s00 )) ∧ (x(s1 ) = x(s01 )))∧ (∀x : x 6∈ rj : (x(s00 ) = x(s01 ))∧(x(s0 ) = x(s1 )))} Later, we will show how the fact that set of transitions (rather than individual transition) needs to be considered impact on the design of efficient fail-safe fault tolerance design. 1. The · operator in a sequence represents concatenation. 3 2.3. Specifications gram only needs to keep track of the current state before deciding on whether to proceed. Formally, this translates into keeping track of only invalid transitions (or bad transitions), rather than invalid prefixes. Definition 3 (Bad transition): Let p be a program, SSP EC be a safety specification and α be a finite computation of p. We say that transition (s, s0 ) of p is bad for SSP EC iff α · (s, s0 ) violates SSP EC. It has been shown that any specification that is not fusion closed can be transformed into an equivalent fusion-closed specification [10], through the addition of history variables. Calculating these transitions can be achieved in polynomial time in the size of the state space of the program. Gaertner and Jhumka [6] showed how to circumvent the problem of requiring fusionclosed specification to minimize the expansion of the state space due to the fusion closure requirement. Also, the notion of satisfies deals with infinite computations, whereas maintains deals with computation prefixes. A specification for a program p is a set of computations which is fusion-closed. A specification S is fusionclosed iff the following holds for computations α and β: if α = γ · s · ∈ S, and β = τ · s · ξ ∈ S, then γ · s · ξ ∈ S and τ · s · ∈ S. A computation cp of p satisfies a specification S iff cp ∈ S, otherwise cp violates S.Program p satisfies a specification S iff every computation of p satisfies S. Intuitively, a fusion-closed specification allows a program to make decisions about future state transitions by looking at its current state only, i.e., the history of the state sequence is encoded within a state. Fusionclosed specifications are non-restrictive in the sense that every specification which is not fusion-closed can be transformed into an equivalent fusion-closed specification by adding history variables. It was further shown how this transformation can be efficiently done in [6]. Alpern and Schneider [1] have shown that every specification can be written as the intersection of a safety specification and a liveness specification. A safety specification demands that “something bad never happens” [12]. Formally, it defines a set of “bad” finite computation prefixes that should not be found in any computation. Since we are mainly interested in detectors, we focus on safety specification, and present a definition here. Definition 1 (Safety specification): A specification S of a program p is a safety specification iff the following condition holds: For every computation σ that violates S, there exists a prefix α of σ such that for all state sequences β, α · β violates S. The notion of a finite computation of not being “bad”, i.e., the possibility to extend it to remain in the specification, is captured by the definition of maintains. Definition 2 (Maintains): Let p be a program, S be a specification and α be a finite computation of p. We say that α maintains S iff there exists a sequence of states β such that α · β ∈ S. A safety specification can thus be represented by a set of computation prefixes that should not occur in any computation of the program, i.e., prevent invalid prefixes from occurring. However, to ensure that no computation displays such prefixes, the program needs to keep track of the whole execution history so as to be able to decide if such prefixes are about to occur. Detecting such invalid prefixes becomes computationally expensive. However, if the specification is fusion-closed, rather than keeping track of computation prefixes, the pro- 2.4. Fault Models and Fault Tolerance A fault model precisely describes the way in which components of the system may fail. Fault models have been categorised into different domains [17]: time faults, and value faults. Traditional stopping faults cannot lead by themselves to a violation of safety. To violate a safety specification, a system must exhibit one of the disallowed computation prefixes. The standard value faults from practice (i.e., bit-flips, stuck-at faults) can directly or indirectly lead to a violation of safety. Formally, a fault model for a program defines a set of transitions for the program. Faults that can directly cause a violation of safety will be called strong faults, while those that cannot directly violate safety will be called weak faults. A general fault model consists of both a weak fault model and a strong fault model. We provide the definitions below. Definition 4 (Weak Fault model): A weak fault model F for program p and safety specification SSP EC is a set of actions over the variables of p that do not violate SSP EC, i.e., if transition (sj , sj+1 ) is a transition induced by F and s0 , s1 , . . . , sj maintains SSP EC, then s0 , s1 , . . . , sj , sj+1 also maintains SSP EC. Definition 5 (Strong Fault model): A strong fault model F for program p and safety specification SSP EC is a set of actions over the variables of p that violates SSP EC, i.e., if transition (sj , sj+1 ) is a transition induced by F and s0 , s1 , . . . , sj maintains SSP EC, then s0 , s1 , . . . , sj , sj+1 violates SSP EC. 4 We call actions of F faulty actions (or faults). Actions (or transitions) of a weak fault model are weak faults, and those of a strong fault model are strong faults. A fault occurs if a faulty action (transition) is executed. Definition 6 (General Fault model): A general fault model F for program p and safety specification SSP EC is a set of actions over the variables of p that can be partitioned into a non-empty weak fault model and a non-empty strong fault model. Unless specified otherwise, we denote a general fault model by F , and the strong part of the fault model by Fs and the weak part of the fault model by Fw , such that F = Fs ∪ Fw , and Fs ∩ Fw = ∅ for non-empty Fs and Fw . In this paper, we assume a general fault model. Definition 7 (Computation in presence of faults): A computation of p in the presence of F is a weakly p-fair sequence of states s0 , s1 , . . . such that s0 is an initial state of p and for each j ≥ 0, sj+1 results from sj by executing a program action from p or a fault action from F . Note: By weakly p-fair, we mean that the actions of p are treated weakly fair, but not fault actions. Note also that faults do not cause violation of the initial condition. Rephrased in the transition system view, a fault model adds a set of transitions to the transition relation of p. We denote the modified transition relation by δpF . Definition 8 (Fail-safe fault-tolerance): Let S be a specification, SSPEC be the smallest safety specification including S, and let F be a fault model. A program p is said to be fail-safe F -tolerant for specification S iff all computations of p in the presence of F satisfy SSPEC. We say that a program p is F -intolerant for SSPEC iff p satisfies SSPEC in the absence of F but violates SSPEC in the presence of F . We will also write faultintolerant instead of F -intolerant for SSPEC if F and SSPEC are clear from the context. A state s ∈ Sp is reachable in presence of faults F if there exists a computation c of p in presence of faults F such that s occurs in c. context of fusion-closed specifications. The main idea of the result is to use detectors to simply “halt” the program in a state where it is about to violate the safety specification, i.e., “halt” the program in a safe state. An important prerequisite for this sufficiency result is that specifications are fusion-closed. Fusion-closed specifications allow to characterise a safety specification as a set of disallowed “bad” transitions (instead of a set of disallowed computation prefixes). Definition 9 (bad transition): Given a program p, fault model F , and a safety specification SSP EC. A transition τ ∈ δpF is said to be bad for p in presence of F for SSP EC if for all computations σ of p in presence of F , if τ occurs in σ then σ 6∈ SSPEC. Note that, under our fault model assumption, a bad transition can be both a program transition or a fault transition. Under such a fault model, for a program to remain safe, the program needs to avoid being in a state where safety has already been violated. In other words, this means that checks need to be performed before transitions are executed to ensure that the program does not end up in safety violating states. This translates into checking whether the transitions can lead to a safety violating state. The check is performed by evaluating a predicate which enables the correct transitions, and disables the bad program transitions. Observe that fault transitions can occur at any time, and cannot be disabled, unless fault avoidance techniques such as system redesign are used. The predicate is implemented using a program component called a detector, which is defined below (Def. 10). Definition 10 (Detector for an action): Let SSPEC be a safety specification. An SSPEC-detector d monitoring program action ac of p is a state predicate of p such that executing ac in a state where d holds maintains SSPEC. We will simply talk about detectors instead of SSPEC-detectors if the relevant safety specification is clear from the context. When a detector d refines the guard of an action ac of p, we say that we compose ac with d. We will also sometimes say that the detector d is located at location ac. We also say that we compose p with d if ∃ac ∈ p such that ac is composed with d. We say that we compose a program p with a set D of detectors (denoted p[]D) iff ∀d ∈ D · ∃ac ∈ p s.t ac is composed with d. Formally, in the transition system view of a program p, a state s is reachable by p iff starting from an initial state of p, there exists a computation which contains s using only transitions from δp . Otherwise s is unreachable. Similarly the notions of a state or 3. Detectors: Role and Design In this section, we will briefly preview the role of detectors in the design of fault tolerance, and subsequently review the basis underpinning their design. Arora and Kulkarni[2] showed that a model of program components called detectors is necessary and sufficient to establish fail-safe fault-tolerance in the 5 action of p with the predicate “false”, thus giving the empty program, the proof of impossibility will be based on showing that it is impossible to devise such a p0 that simultaneously satisfies the first two conditions (E1 and E2)) and the third condition (E3). Given: a fault-intolerant program p, general fault model F , and a safety specification SSP EC. Prove: There is no p0 that can solve the efficient fail-safe fault tolerance addition problem. The proof will consist of two parts: P1. If a program p0 satisfies the first two conditions of the efficient fail-safe fault tolerance addition problem (E1 and E2), then p0 cannot satisfy condition E3. P2. If p0 satisfies E3, then it cannot satisfy E1 or E2. Here, we prove the first part of the proof, i.e., proof P1. We make two assumptions. transition being reachable in the presence of faults can be defined by referring to δpF . Using the above terminology, composing a program p with detectors results in some transitions of p becoming unreachable in the presence of faults. Observe that bad transitions are only reachable in presence of faults. As we have mentioned earlier, design of detectors has been usually achieved through experience and intuition, as mentioned in [14], [2]. 4. Addition of Efficient Fail-Safe Fault Tolerance (FSFT) In this section, we will first define the problem of adding efficient fail-safe fault tolerance to an initially fault-intolerant program. Definition 11 (Addition of FSFT): Given a faultintolerant program p, a general fault model F , and a safety specification SSP EC, design a program p0 such that: E1. Every computation of p is a computation of p0 in the absence of F . E2. Every computation of p0 is a computation of p in the absence of F . E3. In the presence of F , p0 satisfies SSP EC. If such a p0 exists, then we say that p0 solves the efficient fail-safe fault tolerance design problem. The first two conditions characterise the efficiency issue that we address in this paper. Since the faultintolerant program may have been tested for performance and high efficiency, conditions E1 and E2 ensure that, in the absence of faults, the fail-safe faulttolerant program p0 retains the same high performance of p. A program p0 satisfying the above conditions is said to solve the efficient fail-safe fault tolerance addition problem. The first two conditions implies that the fault tolerance mechanisms are transparent, i.e., if no faults are present, then p0 behaves exactly as p, hence pays no “price”. On the other hand, the third condition says that p0 extends p in being fail-safe faulttolerant, which p is not. We now present our first contribution of the paper. Theorem 1 (Impossibility): Given a fault-intolerant program p, a general fault model F , and a safety specification SSP EC. It is impossible to design a program p0 such that p0 solves the efficient fail-safe fault-tolerance addition problem. Assumptions 1) Assume p0 satisfies E1 and E2. 2) Assume that p0 satisfies E3. We will prove this by contradiction, i.e., if p0 satisfies E1 and E2, then it cannot satisfy E3, or vice versa. We also make use of the following: Given a program T , the set of reachable states of T is denoted by Reachable(T ), and is defined as Reachable(T ) = {s:ST |s occurs in some computation of T }. We denote the set of states reachable by a program T in presence of faults F by Reachable(T ,F ), and is defined as Reachable(T ,F )= {s:ST |s occurs in some computation of T in presence of F }. Proof a.1 From assumption 1, Reachable(p) = Reachable(p0 ). a.2 Given p is fault-intolerant, and F is a general fault model, then ∃s ∈ Reachable(p,F ), such that ∃s0 ∈ Sp such that (s, s0 ) ∈ Fs (recall that a general fault model contains a non-empty strong fault model). a.3 From a.1 and a.2, p0 is fault-intolerant, since state s can also be reached in p0 . a.4 From a.3, and assumption 2 above, we have a contradiction. a.5 From a.4, we conclude that E3 cannot be satisfied by p0 . We now prove the second part (P2). We prove the second part by constructing an appropriate p0 and show that it differs from p in absence of faults F . a.1 From assumption 2, p0 has no bad transition in presence of F for SSPEC. Proof Since it is always possible to design a fail-safe faulttolerant program p0 by strengthening the guard of each 6 to solve efficient fail-safe fault tolerance addition in a weak fault model makes the problem solvable. In fact, Jhumka et al. [8] developed a theory, and associated polynomial-time algorithm that transforms an initially fault-intolerant program into an efficient fail-safe faulttolerant one. The fact that a fault transition cannot directly violate safety allows all the states reachable in absence of faults to be retained. Hence, efficient fail-safe fault tolerance can be achieved in presence of a weak fault model only. a.2 From a.1, and from the assumption of a general fault model, it follows that, ∀s ∈Reachable(p0 ) 6 ∃s0 ∈Reachable(p0 , F ) s.t (s, s0 ) is a bad transition (Note that (s, s0 ) can only be a fault transition in this case). a.3 From a.2, it follows that there is no fault transition that is a bad transition (though there are fault transitions from a strong fault model). a.4 Since p is fault-intolerant to F , and F is a general fault model, it means that ∃s ∈Reachable(p) s.t ∃s0 ∈Reachable(p, F ) s.t (s, s0 ) is a bad transition (Note: here we are focusing on a fault transition that is bad due to a strong fault). a.5 From a.3 and a.4, it follows that a bad transition (s, s0 ) of p in presence of F is not present in p0 . a.6 From a.5, state s 6∈Reachable(p0 ) (because if it was, then p and p0 would have the same bad transition). a.7 From a.6, since Reachable(p0 )⊂ Reachable(p), then the set of computation of p is not equal to set of computation of p0 . The proof is based on the fact that all three requirements cannot be simultaneously satisfied. Thus, it is impossible to solve the efficient fail-safe fault tolerance addition problem under a general fault model. There can be several reasons behind that impossibility result. We will investigate some of the possible reasons in the next section, and identify potential solutions to some variants of the efficient fail-safe fault tolerance addition. We survey the field of fault tolerance synthesis to provide potential approaches to circumvent the impossibility result. 5.2. Fault-Safe Specification In this section, we focus on the concept of fault-safe specification, independently introduced by Kulkarni and Ebnenasir [11], and by Jhumka et.al [9]. For fault-safe specification, a fault cannot in itself cause a violation of safety. A fault can only disturb the state space in such a way such that subsequent execution of program transitions lead to a violation of safety. Examples of fault-safe specification abound in the area of fault tolerance, such as consensus, mutual exclusion, 2-phase commit etc. For example, in mutual exclusion, a fault cannot force a second process to access the critical section. However, a fault can modify the value of a lock (i.e., reset the value of the lock), causing another process to believe the critical section is free, which then access it, violating safety. Reinterpreted in our context, a fault-safe specification does not admit a strong fault model, i.e., there is no strong fault model for such a specification, and a program satisfying the specification. Thus, such a specification admits only a weak fault model, and from section 5.1, solving efficient fail-safe fault tolerance addition under a weak fault model is possible, and can be achieved in polynomial time. 5. Possible Attempts at Circumventing the Impossibility Results In this section, we study several possible ways of circumventing this impossibility results. In fact, we relate several existing work in various areas of fault tolerance with possible ways to circumventing the impossibility result. 5.3. Developing a Weaker Specification One problem that contributes to the impossibility result is the fact that the requirement that the initial faultintolerant program and the fail-safe fault-tolerant program to display exact behavior in the absence of faults is strong. This fact is highlighted in the proof, whereby a potential fail-safe fault-tolerant p0 attempts to match the steps of the fault-intolerant program p, however needs to avoid those potential tricky states where a strong fault can cause them to violate safety. Hence, there are two possible ways of weakening the efficient fail-safe fault tolerance addition. The first possibility is to weaken the requirement of p and p0 to have exactly matching steps in the absence of faults. 5.1. Weaker Fault Model One of the problems with finding a program p0 that solves the problem of efficient fail-safe fault tolerance design is the fact that a general fault model is too strong. Such a fault model allows the specification to be violated by fault actions alone. Given that a general fault model is composed of a non-empty weak fault model and a non-empty strong fault model, choosing 7 One way to achieve it is to either request p0 to display a subset of the computation set of p, such that p0 displays only those computations that avoid those potentially tricky states from where a strong fault can cause it to directly violate safety. Such a line of approach has been adopted by Arora and Kulkarni [2]. Another way is to allow p0 to display as many computations as possible as p does, but allow it to “divert” to a different computation whenever it is reaching a tricky state. In the extreme case were all states s ∈ Reachable(p) are tricky, it involves rewriting the new program from scratch, else the only fail-safe fault-tolerant program is the trivial “null” program. This is an area for future work. An interesting work in relation to this is the work in the area of computer security on edit automata [3]. Bauer et.al [3] introduces the concept of effective enforcement, whereby an (or group of) action ac in p is replaced by another (or group of) action ac’ that is syntactically different to ac, but semantically equivalent to it. For example, an action x := x + 2 can be replaced by the following three actions: y := x; y := y + 2; x := y, where ; denotes sequential execution. An interesting area for investigation is the implication of effective enforcement in fault tolerance, especially for real-time systems. The second way of weakening the problem specification is to require exact behavior in the absence of faults from both the fault-intolerant program p, and p0 , but not requiring p0 to be fail-safe fault-tolerant. This means that, in certain circumstances, it is alright if safety is violated. This situation can occur in nonsafety-critical systems, where it is more important to maintain a certain performance than to maintain system safety. specific fault models. As future work, we plan to investigate the conditions under which safe stabilization is realistic, i.e., the program can display a “big” enough computation set. 6. Issues of Read/Write Constraints In this section, we analyse the problem of designing efficient fail-safe fault tolerance when read/write constraints are imposed on variables. We now present another main contribution of the paper: Theorem 2 (Impossibility): Given a fault-intolerant program p in which there are read/write constraints imposed on variables, a weak fault model F , and a safety specification SSP EC. Then, it is impossible to design a program p0 such that p0 solves the efficient fail-safe fault-tolerance addition problem. Notice here that we consider a weak fault model, since a general fault model would have automatically led to the impossibility. Recall that a weak fault model does not cause violation of safety (in contrast to strong fault models), but they may ultimately lead to violation of safety. Proof The impossibility proof will be based on the issues of distribution. The proof of impossibility will be based on showing that it is impossible to devise such a p0 that simultaneously satisfies the first two conditions (E1 and E2)) and the third condition (E3) of the problem of designing efficient fail-safe fault tolerance. Given a fault-intolerant program p, a weak fault model Fw , and a safety specification SSP EC, prove there is no program p0 that solves the efficient fail-safe fault tolerance design problem. We provide a construction for p0 . Assume a process p0j in p0 has to include a transition (s, s0 ). Thus, the problem is then that p0j either has to include set(rj )(s, s0 ) or exclude the whole set. Assume that (s, s0 ) is not a bad transition, but however set(rj )(s, s0 ) contains a bad transition, which we denote by δ. Now, note that program p contains δ since p is fault-intolerant. Thus, for fail-safe fault tolerance, p0j has to exclude set(rj )(s, s0 ). Thus, by excluding set(rj )(s, s0 ), it means that the set of reachable states of p0 is a subset of that of program p, i.e., reachable(p0 ) ⊂ reachable(p). Thus, the set of computations of p0 is a subset of that of p, violating conditions E1, E2 of the efficient fail-safe fault tolerance addition. Now, assume that instead of excluding set(rj )(s, s0 ), p0j includes set(rj )(s, s0 ) in it transition 5.4. Choosing Another Fault Tolerance: Safe Stabilization From the previous section, the problem specification may be weakened by not requiring the program p0 to be fail-safe fault-tolerant. On the other hand, one may require p0 to satisfy some other fault tolerance property, while still satisfying safety. Ghosh and Bejan [7] addressed this issue by developing a framework for safe stabilization. Stabilization is the property of a program to satisfy liveness (as opposed to safety) even in presence of faults. Safe stabilization is the property of a program to satisfy both liveness and safety in presence of faults. However, for a program to satisfy safe stabilization, it needs to be designed conservatively, so as to satisfy certain conditions under 8 xyz Fault intolerant program P a special case where it is possible to design efficient fail-safe fault tolerance. Specification: (x + y + z <= 4) 000 001 010 011 F F 003 103 102 013 7.1. Critical Variables Bad transition 200 201 F F 000 001 Fail-safe fault-tolerant program P’ 010 011 F 200 201 F 202 003 103 102 013 202 We observe that, in the proof of impossibility, the problem is caused by the fact that a variable that cannot be read by a process can lead to transitions that are good as well as bad. For this, we develop the notion of a critical variable. Definition 12 (Critical Variable): Given a program p, weak fault model F and safety specification SSP EC. Given also a process pj of program p. Denote the set of variables of p by Vp . A variable v 6∈ rj is critical for a transition (s, s0 ) of pj for SSP EC if and only if set(Vp \ {v})(s, s0 ) contains at least one transition that is bad, and at least one transition that is not bad. We say that a variable v ∈ Vp is critical for p iff there exists a process pj of p with transition (s, s0 ) such that v is critical for (s, s0 ) of pj for SSP EC. Here, (Vp \ {v}) means that pj can read all other variables in p except v. This definition implies that deciding whether to include the set of transitions set(Vp \ {v})(s, s0 ) becomes problematic since it contains both bad transitions and good transitions. Using the concept of critical variables, we present our next contribution. Theorem 3 (Reading critical variables): Given a program p, weak fault model F , and safety specification SSP EC. Given also Cp , the set of critical variables of p. Then, there exists a p0 that solves the efficient fail-safe fault tolerance design problem iff ∀pj ∈ p : Cp ⊆ rj . 203 203 Figure 1. Illustrating the impossibility result set. The set of reachable states by p0 remains the same as that of p, but p0j will contain the bad transition δ, thus making p0 fault-intolerant. Thus, p0 cannot satisfy E1, E2, E3 at the same time. Hence, no such p0 exists. 6.1. An Example Illustrating the Impossibility In Figure 6.1, there are two processes involved, namely p1 and p2 . The first two values in each state represent the values of two variables x and y respectively, which belong to process p1 . The last value is that of variable z, which belongs to process p2 . Process p1 (resp. p2 ) cannot read variable z (resp. x). The specification for the program is (x + y + z ≤ 4). Based on this specification, the bad transition in program P (which is fault-intolerant) is h(0, 1, 3) → (2, 0, 3)i. In the fail-safe fault-tolerant program P’, if the bad transition is removed, since the transition is executed by process p1 , which cannot read variable z, then there are two transitions that are affected, namely transitions h(0, 1, 3) → (2, 0, 3)i and h(0, 1, 1) → (2, 0, 1)i. When transition h(0, 1, 1) → (2, 0, 1)i is removed, some computation in the absence of faults will not longer be possible, hence the original behavior is not preserved. Proof Given a process pj of p. Since Cp ⊆ rj , then it means rj may also contain non-critical variables for p. This means that the set set(rj )(s, s0 ) for a transition (s, s0 ) contains either all bad transitions or good transitions. If the set contains all bad transitions, then the set of transitions set(rj )(s, s0 ) is excluded. Then only the bad transitions are removed, whose starting states are only reachable in presence of faults. Now, if the set set(rj )(s, s0 ) contains only good transitions, then the set is included, and no good transition is removed. This is done for all processes, and every transition of each process. Therefore, all computation of p0 is a computation of p in the absence of faults, and viceversa. Also, p0 is fail-safe fault-tolerant since all bad transitions are removed. 7. Circumventing the Impossibility Result We have shown that it is impossible to solve the problem of designing efficient fail-safe fault tolerance when read/write restrictions have been imposed on processes. However, there are various ways to circumvent the impossibility. In this paper, we focus on identifying 9 Thus, to solve the efficient fail-safe fault tolerance design problem, it is crucial that all the critical variables can be read by all the processes. [9] Arshad Jhumka, Felix C. Gärtner, Christof Fetzer, and Neeraj Suri. On systematic design of fast and perfect detectors. Technical Report 200263, Swiss Federal Institute of Technology (EPFL), School of Computer and Communication Sciences, Lausanne, Switzerland, September 2002. 8. Conclusion In this paper our main contributions are: (i) we have shown that, under a general fault model, addition of efficient fail-safe fault tolerance cannot be solved, (ii) we have investigated several strands of work that are relevant to allowing us to circumvent the impossibility result, (iii) we proved that it is impossible to design efficient fail-safe fault tolerance if restrictions are imposed on processes’ ability to read variables, and (iv) we have identified a necessary and sufficient condition that allows efficient fail-safe fault tolerance to be designed, even when read restrictions are imposed. [10] Sandeep S. Kulkarni. Component Based Design of Fault-Tolerance. PhD thesis, Department of Computer and Information Science, The Ohio State University, 1999. References [13] Jean-Claude Laprie, editor. Dependability: Basic concepts and Terminology, volume 5 of Dependable Computing and Fault-Tolerant Systems. Springer-Verlag, 1992. [11] Sandeep S. Kulkarni and A. Ebnenasir. Complexity of adding failsafe fault-tolerance. In Proceedings of the 22nd IEEE International Conference on Distributed Computing Systems (ICDCS 2002), pages 337–344. IEEE Computer Society Press, July 2002. [12] Leslie Lamport. Proving the correctness of multiprocess programs. IEEE Transactions on Software Engineering, 3(2):125–143, March 1977. [1] Bowen Alpern and Fred B. Schneider. Defining liveness. Information Processing Letters, 21:181–185, 1985. [14] Nancy G. Leveson, Stephen S. Cha, John C. Knight, and Timothy J. Shimeall. The use of self checks and voting in software error detection: An empirical study. IEEE Transactions on Software Engineering, 16(4):432–443, 1990. [2] Anish Arora and Sandeep S. Kulkarni. Detectors and correctors: A theory of fault-tolerance components. In Proceedings of the 18th IEEE International Conference on Distributed Computing Systems (ICDCS98), May 1998. [15] Zhiming Liu and Mathai Joseph. Transformation of programs for fault-tolerance. Formal Aspects of Computing, 4(5):442–469, 1992. [3] L. Bauer, J. Ligatti, and D. Walker. More enforceable security policies. In Proceedings of Computer Security Foundations Workshop, July 2002. [4] Tushar Deepak Chandra and Sam Toueg. Unreliable failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996. [16] Zhiming Liu and Mathai Joseph. Stepwise development of fault-tolerant reactive systems. In Formal techniques in real-time and fault-tolerant systems, number 863 in Lecture Notes in Computer Science, pages 529–546. Springer-Verlag, 1994. [5] Felix C. Gärtner. Transformational approaches to the specification and verification of fault-tolerant systems: Formal background and classification. Journal of Universal Computer Science (J.UCS), 5(10):668– 692, October 1999. Special Issue on Dependability Evaluation and Assessment. [17] David Powell. Failure mode assumptions and assumption coverage. In Dhiraj K. Pradhan, editor, Proceedings of the 22nd Annual International Symposium on Fault-Tolerant Computing (FTCS ’92), pages 386–395, Boston, MA, July 1992. IEEE Computer Society Press. [18] Hagen Völzer. Verifying fault tolerance of distributed algorithms formally: An example. In Proceedings of the International Conference on Application of Concurrency to System Design (CSD98), pages 187–197, Fukushima, Japan, March 1998. IEEE Computer Society Press. [6] Felix C. Gärtner and Arshad Jhumka. Automating the addition of fail-safe fault-tolerance: Beyond fusion-closed specifications. In Proceedings of Formal Techniques in Real-Time and Fault-Tolerant Systems (FTRTFT), Grenoble, France, September 2004. [7] S. Ghosh and A. Bejan. A framework for safe stabilization. In Proceedings Symposium on Self Stabilization, 2003. [8] A. Jhumka, F. Freiling, C. Fetzer, and N. Suri. An approach to synthesize safe systems. Intl. Journal on Security and Networks, 1(1), 2006. 10