Issues on the Design of Efficient Fail-Safe Fault Tolerance

advertisement
Issues on the Design of Efficient Fail-Safe Fault Tolerance
Arshad Jhumka
Department of Computer Science
University of Warwick
Coventry, UK, CV4 7AL
arshad@dcs.warwick.ac.uk
Abstract
The design of a fault-tolerant program is known to
be an inherently difficult task. Decisions taken during
the design process will invariably have an impact on
the efficiency of the resulting fault-tolerant program.
In this paper, we focus on two such decisions, namely
(i) the class of faults the program is to tolerate, and
(ii) the variables that can be read and written. The
impact these design issues have on the overall fault
tolerance of the system needs to be well-understood,
failure of which can lead to costly redesigns. For the
case of understanding the impact of fault classes on
the efficiency of fail-safe fault tolerance, we show that,
under the assumption of a general fault model, it is
impossible to preserve the original behavior of the
fault-intolerant program. For the second problem of
read and write constraints of variables, we again show
that it is impossible to preserve the original behavior
of the fault-intolerant program. We analyze the reasons
that lead to these impossibility results, and suggest
possible ways of circumventing them.
1. Introduction
Computer systems are becoming increasingly pervasive, being deployed in consumer-oriented products
such as mobile phones, PDAs etc to safety-critical
systems such as car engine control, avionics etc. Such
increasing pervasiveness has led to a corresponding
increase in our reliance on these systems to continually
provide correct services, in spite of external perturbations, such as security intrusions, faults etc. In other
words, we want such systems to be dependable [13].
There is no single approach for developing a faulttolerant program(e.g., [2], [8], [15], [16], [11]) In [11],
the authors analyzed the complexity of designing failsafe fault tolerance (i.e., efficient design of fail-safe
Matthew Leeke
Department of Computer Science
University of Warwick
Coventry, UK, CV4 7AL
matt@dcs.warwick.ac.uk
fault tolerance), whereas, in this paper, as a matter of
contrast, we analyze the problem of designing efficient
fail-safe fault tolerance. We focus on transformational
approaches, whereby an initially fault-intolerant program is transformed into a corresponding (fail-safe)
fault-tolerant one [2], [8], [15]. For such approaches, it
is very desirable that the properties of the initial faultintolerant program is preserved in the corresponding
fault-tolerant program in the absence of faults. In other
words, it is desirable for the fault tolerance components
to be transparent in the fault-tolerant program. This
means that, in the absence of faults, the fault-tolerant
program behaves as the fault-intolerant one (thus preserving its properties), but in the presence of faults,
the fault-tolerant program handles the faults. However,
there are design issues that may end up having an
impact on whether the properties are preserved or not.
In this paper, we will consider two such issues, namely
(i) the assumed fault model, and (ii) the read/write
constraints. We focus on two important properties of
the fault-tolerant programs, namely accuracy and completeness [8]. Accuracy implies the ability of a program
to avoid false positives (i.e., wrongly flagging existence
of an error), while completeness is the ability of the
program to avoid false negatives, (i.e., detect every
error). We refer to these two properties as detection
efficiency of the fault-tolerant program. Whenever it is
clear from the context, we will just refer to detection
efficiency as efficiency.
In the design of a fault-tolerant program, it is
common practice to develop a fault model which the
program is supposed to handle. In a formal sense, a
fault model F affecting a program p can be seen as a
program transformation [5]. Such a fault model can be
factorized along two dimensions, namely (i) a local dimension, and (ii) a global dimension. Voelzer [18] calls
the local dimension an impact model, and the global
fined later in the paper), it is impossible to add efficient
fail-safe fault tolerance. We also show that it is impossible to add efficient fail-safe fault tolerance when
there are read/write constraints defined. Consequently,
we study some of the possibilities to circumvent these
impossibility results. We develop a framework that
allows us to study and relate various existing research
in trying to circumvent this impossibility result. In so
doing, we further identify several potential research
areas arising as a result.
The paper is structured as follows: In Section 2,
we provide the formal underpinnings of our paper.
In Section 3, we explain the role of detectors in
fail-safe fault tolerance addition. In Section 4, we
explain the problem of adding efficient fail-safe fault
tolerance to a fault-intolerant program, and study the
complexity of solving the problem in presence of a
general fault model. Section 5 provides some potential
solutions to some variants of the efficient fail-safe
fault tolerance addition problem. In Section 6, we
analyzed the problem of designing efficient fail-safe
fault tolerance in presence of read restrictions imposed
on processes. In Section 7, we provide a necessary and
sufficient condition to solve the problem. We conclude
in Section 8.
dimension a rely specification. The impact model
specifies the additional faulty behavior of the system,
whereas the rely specification dictates the extent of the
faulty behavior. For example, an impact model may
specify that “nodes may fail by crashing”, whereas
the rely specification will specify that “no more than
t nodes may crash”. The impact model causes an
“enlargement” of the system’s behavior, while the rely
specification constrains that “enlargement”.
Distributed programs have read/write constraints
imposed on them. In a similar manner, read/write
constraints are imposed through encapsulation. This
inability to read/write certain variables has not only an
impact on the properties of the resulting fault tolerance,
but the complexity of designing fault-tolerant programs
when there are read/write constraints is very high [11].
Thus, it becomes important to be able to determine the
properties of the resulting fault-tolerant program.
Three commonly studied fault tolerance levels,
namely (i) fail-safe fault tolerance (ii) non-masking
fault tolerance and (iii) masking fault tolerance, can be
designed through the addition of error detection (a.k.a.,
detectors) and error recovery mechanisms (a.k.a., correctors) [2]. Specifically relevant for this paper is that
it is both necessary and sufficient to add detectors to
a program to ensure that the resulting program never
violates safety in presence of faults, i.e., to design failsafe fault-tolerant programs [2].
The main goal of this paper is to analyze (i) the
impact of an assumed fault model on the design of
efficient fail-safe fault tolerance, and (ii) the impact of
read/write restrictions on the design of efficient failsafe fault tolerance. Very often, to develop a faulttolerant program, a programmer adopts a defensive
programming style to ensure that the system never
violates safety in the presence of faults, leading to
the program to trigger false positives [14]. These false
positives may reduce the efficiency of the program,
since recovery actions will then be triggered, and
the program spends more “time” recovering. Another
problem is that of false negatives, which can seriously
threaten the safety of the program in the presence
of faults. These problems are not easily addressed,
requiring extensive knowledge of the program and/or
extensive experience on the programmers [14]. The
notion of accuracy and completeness is analogous
to the definitions of accuracy and completeness of
Chandra and Toueg [4].
In this paper, we first formalize the problem of
adding (detection) efficient fail-safe fault tolerance to
an initially fault-intolerant program. We then show
that, in the presence of a general fault model (as de-
2. Formal Preliminaries
In this section, we summarize the formal terminologies
that will be used through out this paper.
This work assumes an interleaved execution semantics, i.e., state transitions are atomic events and an
execution is regarded as a linear sequence of states.
We assume a shared variables communication model,
i.e., processes communicate with each other by writing
data into memory locations accessible by the receiver.
Syntactically, a program will be represented as a set
of guarded commands, and semantically, the program
is interpreted as a state transition system.
2.1. Programs
Program in Guarded Command Notation
A program p consists of a finite set of processes
{p1 , . . . , pn }. Each process pi consists of a (nonempty) set of actions Ai and variables Vi . An action
has the form
hguardi → hstatementi
where the guard is a boolean expression over the
program variables and the statement is either the empty
statement or an instantaneous assignment to one or
2
A state s ∈ Sp is said to be reachable if there exists
a computation c of p such that s occurs in c.
more variables. Each variable stores a value from
a nonempty finite domain and is associated with a
predefined set of initial values. A state of p is an
assignment of values to variables of p. The state space
of p is the set of all possible value assignments to
variables of p.
An action ac of p is enabled in a state s if the
guard of ac evaluates to “true” in s. An action ac can
be represented by a set of state pairs. We assume that
actions are deterministic.
2.2. Read/Write Constraints
In this section, we identify how read/write constraints
impact on the transitions a process can take.
Write Restrictions: Given a transition (s0 , s1 ), it is
easy to determine which variables need to be written
to for the transition to take place. If there is a variable
v such that the value of variable v in state s0 (denoted
by v(s0 )) is different to that when in state s1 , i.e.,
v(s0 ) 6= v(s1 ), then it means that variable v has to be
written to for transition (s0 , s1 ) to take place. The write
restrictions then imply checking that the transition of
a process only write to variables that the process can.
Denoting by wi the set of variables that process pi
can write to, then pi cannot use these transitions:
Program as State Transition System
To recall, an assignment of values to variables of p is
called a state of p, and the state space Sp of a program
p is the set of all possible assignments of values to
variables of p. A state predicate of p (predicate for
short) is a boolean expression over the state space of p.
We denote by s(x) when a predicate x is evaluated in
state s. A predicate x can be alternatively represented
as a set of states in which it evaluates to true, i.e.,
x = {s : Sp | s(x) = T }. The set of initial states Ip
of p is defined by the set of all possible assignments of
initial values to variables. Alternatively, we call the set
of initial states of p as representing the initial condition
of p. In this model, a transition is a state pair.
A computation of p is a weakly fair (finite or infinite)
sequence of states s0 · s1 . . . such that s0 ∈ Ip and
for each j ≥ 0, sj+1 results from sj by executing
a single action that is enabled in sj 1 . Weak fairness
means that if a program action ac is continuously
enabled along the states of an execution, then ac
is eventually chosen to be executed. Weak fairness
implies that a computation is maximal with respect to
program actions, i.e., if the computation is finite then
no program action is enabled in the final state.
We say state s occurs in a computation s0 , s1 , . . .
iff there exists an i such that s = si . Similarly, a
transition (s, s0 ) occurs in a computation s0 , s1 , . . . iff
there exists an i such that s = si and s0 = si+1 .
In this setting, a program can equivalently be represented as state machines, i.e., a program is a tuple
p = (Sp , Ip , δp ) where Sp is the state space and
Ip ⊆ Sp is the set of initial states. The state transition
relation δp ⊆ Sp × Sp is defined by the set of actions
as follows: Every action ac implicitly defines a set of
transitions which is added to δp . Transition (s, s0 ) ∈ δp
iff ∃ac ∈ p that is enabled in state s and computation of
the statement results in state s0 . We say that ac induces
these transitions. State s is called the start state and s0
is called the end state of the transition.
nw(wi ) = {(s0 , s1 ) : (∃t : t 6∈ wj : t(s0 ) 6= t(s1 ))}
Read Restrictions: It appears that all variables need
to be read for a transition to take place. When read
restrictions are imposed on a process for certain variables, then transitions have to be grouped. Specifically,
when a given transition is being considered for inclusion or exclusion from the program, the transition
cannot be considered on its own. Rather, a set of
transitions need to be considered. This can be best
illustrated through an example. Consider a program
with two processes pi and pj . Process pi has a variable
i, while process pj has variable j. Process pi (resp.pj )
cannot read variable j (resp. i). Each variable has
domain {0, 1}. Now, if process pi wants to include
transition (hi = 0, j = 0i, hi = 1, j = 0i), then it needs
to include transition (hi = 0, j = 1i, hi = 1, j = 1i)
too, so that the need to read variable j is irrelevant
from pi ’s point of view. In case only one transition
needs to be included, then it becomes important for pi
to be able to read variable j so it can decide which
transition to take.
Denoting the set of variables process pi can read by
ri , for a given transition (s0 , s1 ), the set of transitions
that need to be considered together with (s0 , s1 ) is
given by:
set(rj )(s0 , s1 ) = {(s00 , s01 ) : (∀x : x ∈ rj : (x(s0 ) =
x(s00 )) ∧ (x(s1 ) = x(s01 )))∧
(∀x : x 6∈ rj : (x(s00 ) = x(s01 ))∧(x(s0 ) = x(s1 )))}
Later, we will show how the fact that set of transitions
(rather than individual transition) needs to be considered impact on the design of efficient fail-safe fault
tolerance design.
1. The · operator in a sequence represents concatenation.
3
2.3. Specifications
gram only needs to keep track of the current state
before deciding on whether to proceed. Formally, this
translates into keeping track of only invalid transitions
(or bad transitions), rather than invalid prefixes.
Definition 3 (Bad transition): Let p be a program,
SSP EC be a safety specification and α be a finite
computation of p. We say that transition (s, s0 ) of p is
bad for SSP EC iff α · (s, s0 ) violates SSP EC.
It has been shown that any specification that is not
fusion closed can be transformed into an equivalent
fusion-closed specification [10], through the addition
of history variables. Calculating these transitions can
be achieved in polynomial time in the size of the state
space of the program. Gaertner and Jhumka [6] showed
how to circumvent the problem of requiring fusionclosed specification to minimize the expansion of the
state space due to the fusion closure requirement. Also,
the notion of satisfies deals with infinite computations,
whereas maintains deals with computation prefixes.
A specification for a program p is a set of computations
which is fusion-closed. A specification S is fusionclosed iff the following holds for computations α and
β: if α = γ · s · ∈ S, and β = τ · s · ξ ∈ S, then
γ · s · ξ ∈ S and τ · s · ∈ S.
A computation cp of p satisfies a specification S iff
cp ∈ S, otherwise cp violates S.Program p satisfies a
specification S iff every computation of p satisfies S.
Intuitively, a fusion-closed specification allows a
program to make decisions about future state transitions by looking at its current state only, i.e., the history
of the state sequence is encoded within a state. Fusionclosed specifications are non-restrictive in the sense
that every specification which is not fusion-closed
can be transformed into an equivalent fusion-closed
specification by adding history variables. It was further
shown how this transformation can be efficiently done
in [6].
Alpern and Schneider [1] have shown that every
specification can be written as the intersection of a
safety specification and a liveness specification. A
safety specification demands that “something bad never
happens” [12]. Formally, it defines a set of “bad”
finite computation prefixes that should not be found
in any computation. Since we are mainly interested in
detectors, we focus on safety specification, and present
a definition here.
Definition 1 (Safety specification): A specification
S of a program p is a safety specification iff the
following condition holds: For every computation σ
that violates S, there exists a prefix α of σ such that
for all state sequences β, α · β violates S.
The notion of a finite computation of not being
“bad”, i.e., the possibility to extend it to remain in the
specification, is captured by the definition of maintains.
Definition 2 (Maintains): Let p be a program, S be
a specification and α be a finite computation of p. We
say that α maintains S iff there exists a sequence of
states β such that α · β ∈ S.
A safety specification can thus be represented by a
set of computation prefixes that should not occur in
any computation of the program, i.e., prevent invalid
prefixes from occurring. However, to ensure that no
computation displays such prefixes, the program needs
to keep track of the whole execution history so as to be
able to decide if such prefixes are about to occur. Detecting such invalid prefixes becomes computationally
expensive.
However, if the specification is fusion-closed, rather
than keeping track of computation prefixes, the pro-
2.4. Fault Models and Fault Tolerance
A fault model precisely describes the way in which
components of the system may fail. Fault models
have been categorised into different domains [17]:
time faults, and value faults. Traditional stopping faults
cannot lead by themselves to a violation of safety. To
violate a safety specification, a system must exhibit one
of the disallowed computation prefixes. The standard
value faults from practice (i.e., bit-flips, stuck-at faults)
can directly or indirectly lead to a violation of safety.
Formally, a fault model for a program defines a set
of transitions for the program. Faults that can directly
cause a violation of safety will be called strong faults,
while those that cannot directly violate safety will be
called weak faults. A general fault model consists of
both a weak fault model and a strong fault model.
We provide the definitions below.
Definition 4 (Weak Fault model): A weak fault
model F for program p and safety specification
SSP EC is a set of actions over the variables of p that
do not violate SSP EC, i.e., if transition (sj , sj+1 ) is
a transition induced by F and s0 , s1 , . . . , sj maintains
SSP EC, then s0 , s1 , . . . , sj , sj+1 also maintains
SSP EC.
Definition 5 (Strong Fault model): A strong fault
model F for program p and safety specification
SSP EC is a set of actions over the variables of p
that violates SSP EC, i.e., if transition (sj , sj+1 ) is
a transition induced by F and s0 , s1 , . . . , sj maintains
SSP EC, then s0 , s1 , . . . , sj , sj+1 violates SSP EC.
4
We call actions of F faulty actions (or faults).
Actions (or transitions) of a weak fault model are weak
faults, and those of a strong fault model are strong
faults. A fault occurs if a faulty action (transition) is
executed.
Definition 6 (General Fault model): A
general
fault model F for program p and safety specification
SSP EC is a set of actions over the variables of p
that can be partitioned into a non-empty weak fault
model and a non-empty strong fault model.
Unless specified otherwise, we denote a general fault
model by F , and the strong part of the fault model by
Fs and the weak part of the fault model by Fw , such
that F = Fs ∪ Fw , and Fs ∩ Fw = ∅ for non-empty
Fs and Fw . In this paper, we assume a general fault
model.
Definition 7 (Computation in presence of faults):
A computation of p in the presence of F is a weakly
p-fair sequence of states s0 , s1 , . . . such that s0 is an
initial state of p and for each j ≥ 0, sj+1 results from
sj by executing a program action from p or a fault
action from F .
Note: By weakly p-fair, we mean that the actions of
p are treated weakly fair, but not fault actions. Note
also that faults do not cause violation of the initial
condition.
Rephrased in the transition system view, a fault
model adds a set of transitions to the transition relation
of p. We denote the modified transition relation by δpF .
Definition 8 (Fail-safe fault-tolerance): Let S be a
specification, SSPEC be the smallest safety specification including S, and let F be a fault model. A program
p is said to be fail-safe F -tolerant for specification S
iff all computations of p in the presence of F satisfy
SSPEC.
We say that a program p is F -intolerant for SSPEC
iff p satisfies SSPEC in the absence of F but violates
SSPEC in the presence of F . We will also write faultintolerant instead of F -intolerant for SSPEC if F and
SSPEC are clear from the context.
A state s ∈ Sp is reachable in presence of faults
F if there exists a computation c of p in presence of
faults F such that s occurs in c.
context of fusion-closed specifications. The main idea
of the result is to use detectors to simply “halt” the
program in a state where it is about to violate the safety
specification, i.e., “halt” the program in a safe state.
An important prerequisite for this sufficiency result
is that specifications are fusion-closed. Fusion-closed
specifications allow to characterise a safety specification as a set of disallowed “bad” transitions (instead
of a set of disallowed computation prefixes).
Definition 9 (bad transition): Given a program p,
fault model F , and a safety specification SSP EC. A
transition τ ∈ δpF is said to be bad for p in presence
of F for SSP EC if for all computations σ of p in
presence of F , if τ occurs in σ then σ 6∈ SSPEC.
Note that, under our fault model assumption, a bad
transition can be both a program transition or a fault
transition.
Under such a fault model, for a program to remain
safe, the program needs to avoid being in a state
where safety has already been violated. In other words,
this means that checks need to be performed before
transitions are executed to ensure that the program does
not end up in safety violating states. This translates into
checking whether the transitions can lead to a safety
violating state. The check is performed by evaluating
a predicate which enables the correct transitions, and
disables the bad program transitions. Observe that
fault transitions can occur at any time, and cannot be
disabled, unless fault avoidance techniques such as system redesign are used. The predicate is implemented
using a program component called a detector, which
is defined below (Def. 10).
Definition 10 (Detector for an action): Let SSPEC
be a safety specification. An SSPEC-detector d monitoring program action ac of p is a state predicate
of p such that executing ac in a state where d holds
maintains SSPEC.
We will simply talk about detectors instead of
SSPEC-detectors if the relevant safety specification is
clear from the context. When a detector d refines the
guard of an action ac of p, we say that we compose ac
with d. We will also sometimes say that the detector d
is located at location ac. We also say that we compose
p with d if ∃ac ∈ p such that ac is composed with d.
We say that we compose a program p with a set D of
detectors (denoted p[]D) iff ∀d ∈ D · ∃ac ∈ p s.t ac is
composed with d.
Formally, in the transition system view of a program
p, a state s is reachable by p iff starting from an
initial state of p, there exists a computation which
contains s using only transitions from δp . Otherwise
s is unreachable. Similarly the notions of a state or
3. Detectors: Role and Design
In this section, we will briefly preview the role of
detectors in the design of fault tolerance, and subsequently review the basis underpinning their design.
Arora and Kulkarni[2] showed that a model of
program components called detectors is necessary and
sufficient to establish fail-safe fault-tolerance in the
5
action of p with the predicate “false”, thus giving
the empty program, the proof of impossibility will be
based on showing that it is impossible to devise such a
p0 that simultaneously satisfies the first two conditions
(E1 and E2)) and the third condition (E3).
Given: a fault-intolerant program p, general fault
model F , and a safety specification SSP EC.
Prove: There is no p0 that can solve the efficient
fail-safe fault tolerance addition problem.
The proof will consist of two parts:
P1. If a program p0 satisfies the first two conditions
of the efficient fail-safe fault tolerance addition
problem (E1 and E2), then p0 cannot satisfy
condition E3.
P2. If p0 satisfies E3, then it cannot satisfy E1 or E2.
Here, we prove the first part of the proof, i.e., proof
P1. We make two assumptions.
transition being reachable in the presence of faults
can be defined by referring to δpF . Using the above
terminology, composing a program p with detectors
results in some transitions of p becoming unreachable
in the presence of faults. Observe that bad transitions
are only reachable in presence of faults.
As we have mentioned earlier, design of detectors
has been usually achieved through experience and
intuition, as mentioned in [14], [2].
4. Addition of Efficient Fail-Safe Fault Tolerance (FSFT)
In this section, we will first define the problem of
adding efficient fail-safe fault tolerance to an initially
fault-intolerant program.
Definition 11 (Addition of FSFT): Given a faultintolerant program p, a general fault model F , and
a safety specification SSP EC, design a program p0
such that:
E1. Every computation of p is a computation of p0
in the absence of F .
E2. Every computation of p0 is a computation of p
in the absence of F .
E3. In the presence of F , p0 satisfies SSP EC.
If such a p0 exists, then we say that p0 solves the
efficient fail-safe fault tolerance design problem.
The first two conditions characterise the efficiency
issue that we address in this paper. Since the faultintolerant program may have been tested for performance and high efficiency, conditions E1 and E2
ensure that, in the absence of faults, the fail-safe faulttolerant program p0 retains the same high performance
of p. A program p0 satisfying the above conditions
is said to solve the efficient fail-safe fault tolerance
addition problem. The first two conditions implies that
the fault tolerance mechanisms are transparent, i.e., if
no faults are present, then p0 behaves exactly as p,
hence pays no “price”. On the other hand, the third
condition says that p0 extends p in being fail-safe faulttolerant, which p is not.
We now present our first contribution of the paper.
Theorem 1 (Impossibility): Given a fault-intolerant
program p, a general fault model F , and a safety
specification SSP EC. It is impossible to design a
program p0 such that p0 solves the efficient fail-safe
fault-tolerance addition problem.
Assumptions
1) Assume p0 satisfies E1 and E2.
2) Assume that p0 satisfies E3.
We will prove this by contradiction, i.e., if p0 satisfies
E1 and E2, then it cannot satisfy E3, or vice versa.
We also make use of the following: Given a
program T , the set of reachable states of T is denoted
by Reachable(T ), and is defined as Reachable(T )
= {s:ST |s occurs in some computation of T }. We
denote the set of states reachable by a program T
in presence of faults F by Reachable(T ,F ), and is
defined as Reachable(T ,F )= {s:ST |s occurs in some
computation of T in presence of F }.
Proof
a.1 From
assumption
1,
Reachable(p)
=
Reachable(p0 ).
a.2 Given p is fault-intolerant, and F is a general
fault model, then ∃s ∈ Reachable(p,F ), such that
∃s0 ∈ Sp such that (s, s0 ) ∈ Fs (recall that a
general fault model contains a non-empty strong
fault model).
a.3 From a.1 and a.2, p0 is fault-intolerant, since state
s can also be reached in p0 .
a.4 From a.3, and assumption 2 above, we have a
contradiction.
a.5 From a.4, we conclude that E3 cannot be satisfied by p0 .
We now prove the second part (P2). We prove the
second part by constructing an appropriate p0 and show
that it differs from p in absence of faults F .
a.1 From assumption 2, p0 has no bad transition in
presence of F for SSPEC.
Proof
Since it is always possible to design a fail-safe faulttolerant program p0 by strengthening the guard of each
6
to solve efficient fail-safe fault tolerance addition in a
weak fault model makes the problem solvable. In fact,
Jhumka et al. [8] developed a theory, and associated
polynomial-time algorithm that transforms an initially
fault-intolerant program into an efficient fail-safe faulttolerant one. The fact that a fault transition cannot
directly violate safety allows all the states reachable
in absence of faults to be retained. Hence, efficient
fail-safe fault tolerance can be achieved in presence of
a weak fault model only.
a.2 From a.1, and from the assumption of a general
fault model, it follows that, ∀s ∈Reachable(p0 )
6 ∃s0 ∈Reachable(p0 , F ) s.t (s, s0 ) is a bad
transition (Note that (s, s0 ) can only be a fault
transition in this case).
a.3 From a.2, it follows that there is no fault transition that is a bad transition (though there are
fault transitions from a strong fault model).
a.4 Since p is fault-intolerant to F , and F is a general fault model, it means that ∃s ∈Reachable(p)
s.t ∃s0 ∈Reachable(p, F ) s.t (s, s0 ) is a bad
transition (Note: here we are focusing on a fault
transition that is bad due to a strong fault).
a.5 From a.3 and a.4, it follows that a bad transition
(s, s0 ) of p in presence of F is not present in p0 .
a.6 From a.5, state s 6∈Reachable(p0 ) (because if it
was, then p and p0 would have the same bad
transition).
a.7 From a.6, since Reachable(p0 )⊂ Reachable(p),
then the set of computation of p is not equal to
set of computation of p0 .
The proof is based on the fact that all three requirements cannot be simultaneously satisfied. Thus, it is
impossible to solve the efficient fail-safe fault tolerance
addition problem under a general fault model. There
can be several reasons behind that impossibility result.
We will investigate some of the possible reasons in
the next section, and identify potential solutions to
some variants of the efficient fail-safe fault tolerance
addition. We survey the field of fault tolerance synthesis to provide potential approaches to circumvent the
impossibility result.
5.2. Fault-Safe Specification
In this section, we focus on the concept of fault-safe
specification, independently introduced by Kulkarni
and Ebnenasir [11], and by Jhumka et.al [9]. For
fault-safe specification, a fault cannot in itself cause
a violation of safety. A fault can only disturb the state
space in such a way such that subsequent execution
of program transitions lead to a violation of safety.
Examples of fault-safe specification abound in the area
of fault tolerance, such as consensus, mutual exclusion,
2-phase commit etc. For example, in mutual exclusion,
a fault cannot force a second process to access the
critical section. However, a fault can modify the value
of a lock (i.e., reset the value of the lock), causing
another process to believe the critical section is free,
which then access it, violating safety.
Reinterpreted in our context, a fault-safe specification does not admit a strong fault model, i.e., there
is no strong fault model for such a specification, and
a program satisfying the specification. Thus, such a
specification admits only a weak fault model, and from
section 5.1, solving efficient fail-safe fault tolerance
addition under a weak fault model is possible, and can
be achieved in polynomial time.
5. Possible Attempts at Circumventing the
Impossibility Results
In this section, we study several possible ways of
circumventing this impossibility results. In fact, we
relate several existing work in various areas of fault
tolerance with possible ways to circumventing the
impossibility result.
5.3. Developing a Weaker Specification
One problem that contributes to the impossibility result
is the fact that the requirement that the initial faultintolerant program and the fail-safe fault-tolerant program to display exact behavior in the absence of faults
is strong. This fact is highlighted in the proof, whereby
a potential fail-safe fault-tolerant p0 attempts to match
the steps of the fault-intolerant program p, however
needs to avoid those potential tricky states where a
strong fault can cause them to violate safety.
Hence, there are two possible ways of weakening
the efficient fail-safe fault tolerance addition. The first
possibility is to weaken the requirement of p and p0 to
have exactly matching steps in the absence of faults.
5.1. Weaker Fault Model
One of the problems with finding a program p0 that
solves the problem of efficient fail-safe fault tolerance
design is the fact that a general fault model is too
strong. Such a fault model allows the specification to
be violated by fault actions alone. Given that a general
fault model is composed of a non-empty weak fault
model and a non-empty strong fault model, choosing
7
One way to achieve it is to either request p0 to display a
subset of the computation set of p, such that p0 displays
only those computations that avoid those potentially
tricky states from where a strong fault can cause it
to directly violate safety. Such a line of approach has
been adopted by Arora and Kulkarni [2].
Another way is to allow p0 to display as many
computations as possible as p does, but allow it to
“divert” to a different computation whenever it is
reaching a tricky state. In the extreme case were
all states s ∈ Reachable(p) are tricky, it involves
rewriting the new program from scratch, else the only
fail-safe fault-tolerant program is the trivial “null”
program. This is an area for future work. An interesting
work in relation to this is the work in the area of
computer security on edit automata [3]. Bauer et.al
[3] introduces the concept of effective enforcement,
whereby an (or group of) action ac in p is replaced by
another (or group of) action ac’ that is syntactically
different to ac, but semantically equivalent to it. For
example, an action x := x + 2 can be replaced by the
following three actions: y := x; y := y + 2; x := y,
where ; denotes sequential execution. An interesting
area for investigation is the implication of effective
enforcement in fault tolerance, especially for real-time
systems.
The second way of weakening the problem specification is to require exact behavior in the absence of
faults from both the fault-intolerant program p, and
p0 , but not requiring p0 to be fail-safe fault-tolerant.
This means that, in certain circumstances, it is alright
if safety is violated. This situation can occur in nonsafety-critical systems, where it is more important to
maintain a certain performance than to maintain system
safety.
specific fault models. As future work, we plan to investigate the conditions under which safe stabilization is
realistic, i.e., the program can display a “big” enough
computation set.
6. Issues of Read/Write Constraints
In this section, we analyse the problem of designing efficient fail-safe fault tolerance when read/write
constraints are imposed on variables. We now present
another main contribution of the paper:
Theorem 2 (Impossibility): Given a fault-intolerant
program p in which there are read/write constraints
imposed on variables, a weak fault model F , and a
safety specification SSP EC. Then, it is impossible to
design a program p0 such that p0 solves the efficient
fail-safe fault-tolerance addition problem.
Notice here that we consider a weak fault model,
since a general fault model would have automatically
led to the impossibility. Recall that a weak fault
model does not cause violation of safety (in contrast
to strong fault models), but they may ultimately lead
to violation of safety.
Proof
The impossibility proof will be based on the issues of
distribution. The proof of impossibility will be based
on showing that it is impossible to devise such a p0 that
simultaneously satisfies the first two conditions (E1
and E2)) and the third condition (E3) of the problem
of designing efficient fail-safe fault tolerance.
Given a fault-intolerant program p, a weak fault
model Fw , and a safety specification SSP EC, prove
there is no program p0 that solves the efficient fail-safe
fault tolerance design problem.
We provide a construction for p0 . Assume a process p0j in p0 has to include a transition (s, s0 ).
Thus, the problem is then that p0j either has to include set(rj )(s, s0 ) or exclude the whole set. Assume that (s, s0 ) is not a bad transition, but however set(rj )(s, s0 ) contains a bad transition, which
we denote by δ. Now, note that program p contains
δ since p is fault-intolerant. Thus, for fail-safe fault
tolerance, p0j has to exclude set(rj )(s, s0 ). Thus, by
excluding set(rj )(s, s0 ), it means that the set of reachable states of p0 is a subset of that of program p,
i.e., reachable(p0 ) ⊂ reachable(p). Thus, the set of
computations of p0 is a subset of that of p, violating
conditions E1, E2 of the efficient fail-safe fault tolerance addition.
Now, assume that instead of excluding
set(rj )(s, s0 ), p0j includes set(rj )(s, s0 ) in it transition
5.4. Choosing Another Fault Tolerance: Safe
Stabilization
From the previous section, the problem specification
may be weakened by not requiring the program p0
to be fail-safe fault-tolerant. On the other hand, one
may require p0 to satisfy some other fault tolerance
property, while still satisfying safety. Ghosh and Bejan [7] addressed this issue by developing a framework
for safe stabilization. Stabilization is the property of
a program to satisfy liveness (as opposed to safety)
even in presence of faults. Safe stabilization is the
property of a program to satisfy both liveness and
safety in presence of faults. However, for a program
to satisfy safe stabilization, it needs to be designed
conservatively, so as to satisfy certain conditions under
8
xyz
Fault intolerant
program P
a special case where it is possible to design efficient
fail-safe fault tolerance.
Specification: (x + y + z <= 4)
000
001
010
011
F
F
003
103
102
013
7.1. Critical Variables
Bad transition
200
201
F
F
000
001
Fail-safe fault-tolerant
program P’
010
011
F
200
201
F
202
003
103
102
013
202
We observe that, in the proof of impossibility, the
problem is caused by the fact that a variable that cannot
be read by a process can lead to transitions that are
good as well as bad. For this, we develop the notion
of a critical variable.
Definition 12 (Critical Variable): Given a program
p, weak fault model F and safety specification
SSP EC. Given also a process pj of program p.
Denote the set of variables of p by Vp . A variable
v 6∈ rj is critical for a transition (s, s0 ) of pj for
SSP EC if and only if set(Vp \ {v})(s, s0 ) contains
at least one transition that is bad, and at least one
transition that is not bad. We say that a variable v ∈ Vp
is critical for p iff there exists a process pj of p with
transition (s, s0 ) such that v is critical for (s, s0 ) of pj
for SSP EC.
Here, (Vp \ {v}) means that pj can read all other
variables in p except v. This definition implies that
deciding whether to include the set of transitions
set(Vp \ {v})(s, s0 ) becomes problematic since it contains both bad transitions and good transitions.
Using the concept of critical variables, we present
our next contribution.
Theorem 3 (Reading critical variables): Given
a program p, weak fault model F , and safety
specification SSP EC. Given also Cp , the set of
critical variables of p. Then, there exists a p0 that
solves the efficient fail-safe fault tolerance design
problem iff ∀pj ∈ p : Cp ⊆ rj .
203
203
Figure 1. Illustrating the impossibility result
set. The set of reachable states by p0 remains the same
as that of p, but p0j will contain the bad transition δ,
thus making p0 fault-intolerant.
Thus, p0 cannot satisfy E1, E2, E3 at the same time.
Hence, no such p0 exists.
6.1. An Example Illustrating the Impossibility
In Figure 6.1, there are two processes involved, namely
p1 and p2 . The first two values in each state represent
the values of two variables x and y respectively, which
belong to process p1 . The last value is that of variable
z, which belongs to process p2 . Process p1 (resp. p2 )
cannot read variable z (resp. x). The specification for
the program is (x + y + z ≤ 4).
Based on this specification, the bad transition in
program P (which is fault-intolerant) is h(0, 1, 3) →
(2, 0, 3)i. In the fail-safe fault-tolerant program P’, if
the bad transition is removed, since the transition is
executed by process p1 , which cannot read variable z,
then there are two transitions that are affected, namely
transitions h(0, 1, 3) → (2, 0, 3)i and h(0, 1, 1) →
(2, 0, 1)i. When transition h(0, 1, 1) → (2, 0, 1)i is
removed, some computation in the absence of faults
will not longer be possible, hence the original behavior
is not preserved.
Proof
Given a process pj of p. Since Cp ⊆ rj , then it means
rj may also contain non-critical variables for p. This
means that the set set(rj )(s, s0 ) for a transition (s, s0 )
contains either all bad transitions or good transitions.
If the set contains all bad transitions, then the set of
transitions set(rj )(s, s0 ) is excluded. Then only the
bad transitions are removed, whose starting states are
only reachable in presence of faults. Now, if the set
set(rj )(s, s0 ) contains only good transitions, then the
set is included, and no good transition is removed.
This is done for all processes, and every transition
of each process. Therefore, all computation of p0 is a
computation of p in the absence of faults, and viceversa. Also, p0 is fail-safe fault-tolerant since all bad
transitions are removed.
7. Circumventing the Impossibility Result
We have shown that it is impossible to solve the problem of designing efficient fail-safe fault tolerance when
read/write restrictions have been imposed on processes.
However, there are various ways to circumvent the
impossibility. In this paper, we focus on identifying
9
Thus, to solve the efficient fail-safe fault tolerance
design problem, it is crucial that all the critical variables can be read by all the processes.
[9] Arshad Jhumka, Felix C. Gärtner, Christof Fetzer, and
Neeraj Suri. On systematic design of fast and perfect
detectors. Technical Report 200263, Swiss Federal
Institute of Technology (EPFL), School of Computer
and Communication Sciences, Lausanne, Switzerland,
September 2002.
8. Conclusion
In this paper our main contributions are: (i) we have
shown that, under a general fault model, addition of
efficient fail-safe fault tolerance cannot be solved, (ii)
we have investigated several strands of work that are
relevant to allowing us to circumvent the impossibility
result, (iii) we proved that it is impossible to design
efficient fail-safe fault tolerance if restrictions are imposed on processes’ ability to read variables, and (iv)
we have identified a necessary and sufficient condition
that allows efficient fail-safe fault tolerance to be
designed, even when read restrictions are imposed.
[10] Sandeep S. Kulkarni. Component Based Design of
Fault-Tolerance. PhD thesis, Department of Computer
and Information Science, The Ohio State University,
1999.
References
[13] Jean-Claude Laprie, editor. Dependability: Basic concepts and Terminology, volume 5 of Dependable Computing and Fault-Tolerant Systems. Springer-Verlag,
1992.
[11] Sandeep S. Kulkarni and A. Ebnenasir. Complexity
of adding failsafe fault-tolerance. In Proceedings of
the 22nd IEEE International Conference on Distributed
Computing Systems (ICDCS 2002), pages 337–344.
IEEE Computer Society Press, July 2002.
[12] Leslie Lamport. Proving the correctness of multiprocess programs. IEEE Transactions on Software
Engineering, 3(2):125–143, March 1977.
[1] Bowen Alpern and Fred B. Schneider. Defining liveness. Information Processing Letters, 21:181–185,
1985.
[14] Nancy G. Leveson, Stephen S. Cha, John C. Knight,
and Timothy J. Shimeall. The use of self checks
and voting in software error detection: An empirical
study. IEEE Transactions on Software Engineering,
16(4):432–443, 1990.
[2] Anish Arora and Sandeep S. Kulkarni. Detectors and
correctors: A theory of fault-tolerance components. In
Proceedings of the 18th IEEE International Conference
on Distributed Computing Systems (ICDCS98), May
1998.
[15] Zhiming Liu and Mathai Joseph. Transformation of
programs for fault-tolerance. Formal Aspects of Computing, 4(5):442–469, 1992.
[3] L. Bauer, J. Ligatti, and D. Walker. More enforceable
security policies. In Proceedings of Computer Security
Foundations Workshop, July 2002.
[4] Tushar Deepak Chandra and Sam Toueg. Unreliable
failure detectors for reliable distributed systems. Journal of the ACM, 43(2):225–267, March 1996.
[16] Zhiming Liu and Mathai Joseph. Stepwise development
of fault-tolerant reactive systems. In Formal techniques
in real-time and fault-tolerant systems, number 863 in
Lecture Notes in Computer Science, pages 529–546.
Springer-Verlag, 1994.
[5] Felix C. Gärtner. Transformational approaches to
the specification and verification of fault-tolerant systems: Formal background and classification. Journal
of Universal Computer Science (J.UCS), 5(10):668–
692, October 1999. Special Issue on Dependability
Evaluation and Assessment.
[17] David Powell. Failure mode assumptions and assumption coverage. In Dhiraj K. Pradhan, editor, Proceedings of the 22nd Annual International Symposium on
Fault-Tolerant Computing (FTCS ’92), pages 386–395,
Boston, MA, July 1992. IEEE Computer Society Press.
[18] Hagen Völzer. Verifying fault tolerance of distributed
algorithms formally: An example. In Proceedings of
the International Conference on Application of Concurrency to System Design (CSD98), pages 187–197,
Fukushima, Japan, March 1998. IEEE Computer Society Press.
[6] Felix C. Gärtner and Arshad Jhumka. Automating the addition of fail-safe fault-tolerance: Beyond
fusion-closed specifications. In Proceedings of Formal
Techniques in Real-Time and Fault-Tolerant Systems
(FTRTFT), Grenoble, France, September 2004.
[7] S. Ghosh and A. Bejan. A framework for safe stabilization. In Proceedings Symposium on Self Stabilization,
2003.
[8] A. Jhumka, F. Freiling, C. Fetzer, and N. Suri. An
approach to synthesize safe systems. Intl. Journal on
Security and Networks, 1(1), 2006.
10
Download