3. Evaluating system reliability and performance indices

advertisement
Reliability and Performance Analysis of Hardware-Software Systems with
Fault-Tolerant Software Components
Gregory Levitin
Reliability Department, Planning, Development and Technology Division,
Israel Electric Corporation Ltd., P.O. Box 10, Haifa, 31000 Israel
E-mail: levitin@iec.co.il
Abstract
This paper presents an algorithm for evaluating reliability and expected execution time for
systems consisting of fault-tolerant software components running on several hardware units.
The components are built from functionally equivalent but independently developed versions
characterized by different reliability and execution time. Different number of versions can be
executed simultaneously depending on the number of available units. The system reliability is
defined as the probability that the system produces a correct output in a specified time.
Keywords: Hardware-software systems; fault-tolerant programming; performance; reliability;
expected execution time.
Acronyms1
ATB
acceptance test block
NVP
N-version programming
RBS
recovery block scheme
MGF moment generating function
r.v.
1
random variable
The singular and plural of an acronym are always spelled the same.
1
pmf
probability mass function
Notation
Pr{e} probability of event e
1(x)
unity function: 1(TRUE)=1, 1(FALSE)=0
C
number of components in the software system
Hc
number of parallel hardware units in component c
hc
random number of available hardware units in component c
Qc(x) Pr{hc=x}
Ac
availability of hardware units in component c
Nc
number of software versions in component c
Mc
number of correct outputs needed for component c to succeed
Lc
random number of software versions in component c that can be executed
simultaneously
lc(x)
realization of Lc corresponding to hc=x
rci
reliability of i-th version in component c
ci
execution time of i-th software version in component c
tci(lc) termination time of i-th software version in component c when lc versions are executed
simultaneously
pcj(lc) probability that system component c produces correct output after termination of j first
versions in component c when lc versions are executed simultaneously
k(i)
number of version that produces i-th output in the component
sck (i ) random success indicator of version that produces i-th output in component c
Tc
random execution time for system component c
T
random execution time for the entire system
tx
x-th realization of T
2
qx
Pr{T=tx}
X
number of different realizations of T
T*
upper bound for system execution time
R(T*) system reliability (probability that the correct output is produced in time less than T*)
E
expected system execution time
uck (i ) ( z ) MGF representing pmf of success indicator scki
U cj ( z, lc ) MGF representing the pmf of the number of correct outputs in component c after
the execution of a group of first j versions when lc versions are executed
simultaneously
vc(z,lc) MGF representing conditional pmf of r.v. Tc when lc versions are executed
simultaneously
Vc(z)
MGF representing pmf of r.v. Tc
V(z)
MGF representing pmf of r.v. T
1. Introduction
Software failures are caused by errors made in various phases of program development.
When the software reliability is of critical importance, special programming techniques are
used in order to achieve its fault tolerance. Two of the best-known fault-tolerant software
design methods are N-version programming (NVP) and recovery block scheme (RBS) [1].
Both methods are based on the redundancy of software modules (functionally equivalent but
independently developed) and the assumption that coincident failures of modules are rare.
The fault tolerance usually requires additional resources and results in performance penalties
(particularly with regard to computation time), which constitutes a tradeoff between software
performance and reliability.
3
NVP was proposed by Chen and Avizienis [2]. This approach presumes the execution of
N functionally equivalent software modules (called versions) that receive the same input and
send their outputs to a voter, which is aimed at determining the system output. The voter
produces an output if at least M out of N outputs agree (it is presumed that the probability that
M wrong outputs agree is negligibly small). Otherwise, the system fails. Usually majority
voting is used in which N is odd and M=(N+1)/2.
In some applications, the available computational resources do not allow all of the
versions to be executed simultaneously. In these cases, the versions are executed according to
some predefined sequence and the program execution terminates either when M versions
produce the same output (success) or when after the execution of all the N versions the
number of equivalent outputs is less than M (failure). The entire program execution time is a
random variable depending on the parameters of the versions and on the number of versions
that can be executed simultaneously.
RBS was proposed by Randell [3]. In this approach after execution of each version, its
output is tested by an acceptance test block (ATB). If the ATB accepts the version output, the
process is terminated and the version output becomes the output of the entire system. If all N
versions do not produce the accepted output, the system fails. If the computational resources
allow simultaneous execution of several versions, the versions are executed according to
some predefined sequence and the entire program terminates either when one of versions
produces the output accepted by the ATB (success) or when after the execution of all the N
versions no output is accepted by the ATB (failure). If the acceptance test time is included
into the execution time of each version, the RBS performance model becomes identical to the
performance model of the NVP with M=1.
Estimating the effect of the fault-tolerant programming on system performance is
especially important in safety critical real-time computer applications. This effect has been
4
studied by Tai et al. in [4] and by Goseva-Popstojanova and Grnarov in [5, 6]. While in [4] a
basic realization of NVP (N=3, M=2) consisting of versions with identical fault probabilities
and different execution times has been considered, in [5] and [6] NVP with arbitrary N has
been studied in which both times to failure and execution times of different versions are
identically distributed random variables.
In many cases, the information about version reliability and execution time is available
from separate testing and/or reliability prediction models [7]. This information can be
incorporated into a fault-tolerant program model in order to obtain an evaluation of its
reliability and performance. The reliability model of NVP with versions having different
reliability has been considered in [8]. However, in this study, the system performance
evaluation problem has not been addressed and a general algorithm for evaluating NVP
reliability for arbitrary N and M has not been suggested.
Since the performance of fault-tolerant programs depends on availability of
computational resources, the impact of hardware availability should be taken into account
when the system availability is evaluated. In [9] several simple configurations of hardwaresoftware systems with N3 and number of hardware units not greater than 3 have been studied
also without considering system performance.
This paper presents an algorithm for finding the reliability and performance measures for
arbitrary fault-tolerant hardware-software systems with given hardware structure and
availability of hardware units and given parameters of software versions. The novelty of the
presented algorithm lies in its ability to take into account both hardware and software
reliability for arbitrary number of software versions and hardware units and to evaluate both
system reliability and performance measures.
The algorithm does not take into account the common cause failures which leads to
overestimation of system reliability. However even such optimistic estimates can be used for
5
comparison of different system architectures and for optimization of software system
structure as it has been done in [8, 10-12].
The probabilities of common cause failures can be evaluated separately (elicited from
experimental study) and added to system unreliability in order to obtain more accurate
reliability estimates.
2. Model
According to the model presented in [8] the software system consists of C components.
Each component performs a subtask and the sequential execution of the components performs
a major task. Such series architecture can be found in many applications where the output of a
component is fed to the next component as its input. An example of such architecture is a
speech recognition system presented in [9].
Performance of different applications and
services in the grid systems can also be represented using the series architecture [9].
It is assumed that Nc functionally equivalent versions are available for each component c.
Each version i has an estimated reliability rci and constant execution time ci. Failures of
versions for each component are statistically independent as well as the total failures of the
different components.
The software versions in each component c run on parallel hardware units. The total
number of units is Hc. The units are s-independent and identical. The availability of each unit
is Ac. The number of units available at the moment hc determines the amount of available
computational resources and, therefore, the number of versions that can be executed
simultaneously Lc(hc). No hardware unit can change its state during the software execution.
The versions of each component c start their execution in accordance with a
predetermined ordered list. Lc first versions from the list start their execution simultaneously
(at time 0). If the number of terminated versions is less than Mc, after termination of each
6
version a new version from the list starts its execution immediately. If the number of
terminated versions is not less than Mc, after termination of each version the voter compares
the outputs. If Mc outputs are identical, the component terminates its execution (terminating
all the versions that are still executed), otherwise a new version from the list is executed
immediately.
If after termination of Nc versions the number of identical outputs is less than Mc the
component and the entire system fail.
In the case of component success, the time of the entire component execution is equal to
the termination time of the version that has produced the Mc-th correct output (in most cases
the time needed by the voter to make the decision can be neglected). It can be seen that the
component execution time is a random variable depending on the reliability and the execution
time of the component versions and on the availability of the hardware units.
The examples of time diagrams (corresponding to components with Nc=5, Mc=3 and
Nc=3, Mc=2) for a given sequence of versions execution (the versions are numbered according
to this sequence) and different values of Lc are presented in Fig. 1.
The sum of the random execution times of each component gives the random task
execution time for the entire system. In order to estimate both the system's reliability and its
performance, different measures can be used depending on the application.
In applications where the execution time of each task is of critical importance, the system
reliability R(T*) is defined (according to performability concept [4, 13]) as a probability that
the correct output is produced in time less than T*.
In applications where the average system productivity (the number of executed tasks)
over a fixed mission time is of interest [14], the system reliability is defined as the probability
that it produces correct outputs without respect to the total execution time, (this index can be
referred to as R()) while the conditional expected system execution time E (given the system
7
produces correct output) is considered to be a measure of its performance. This index
determines the system expected execution time given that the system does not fail.
3. Evaluating system reliability and performance indices
3.1. Number of versions that can be executed simultaneously
The number of available hardware units in component c can vary from 0 to Hc. Given all
of the units are identical and have availability Ac, one can easily obtain probabilities Pr{hc=x}
for 1xHc :
H 
Qc(x)=Pr{hc=x}=  c  Ac x (1  Ac ) H c  x .
 x 
(1)
The number of available hardware units x determines the number of versions that can be
executed simultaneously: lc(x). Therefore
Pr{Lc=lc(x)}=Qc(x).
(2)
The pairs Qc(x), lc(x) for 1xHc determine the pmf of the discrete r.v. Lc.
3.2. Version termination times
In each component c a sequence in which the versions start their execution is defined by
the numbers of the versions. This means that each version i starts execution not earlier than
versions 1,…, i-1 and not latter than versions i+1,…,Nc. Having the execution time of each
version ci (1iNc) and the number of versions that can run simultaneously lc, one can obtain
termination time tci(lc) for each version using the following simple algorithm:
1. Assign 1=…=lc=0;
2. For i=1,…,Nc repeat:
2.1. Find any k (1klc): k=min{ 1, …,  l c };
2.2. Obtain tci(lc)=k+ci and assign k=tci(lc).
8
Times tci(lc), 1iNc correspond to intervals between the beginning of component
execution and the moment when the versions produce their outputs. Observe that the versions
that start execution earlier can terminate later: j<w does not guarantee that tcj(lc)tcw(lc). In
order to obtain the sequence, in which the versions produce their outputs, the termination
times should be sorted in increasing order tck (1) (lc )  tck (2) (lc ) … tck ( N c ) (lc ) , which gives
the order of versions k(1), k(2),…, k(Nc), corresponding to times of their termination.
The list k(1), k(2),…, k(Nc) determines the sequence of version outputs in which they
arrive to the voter. Now one can consider the component c as a system in which the Nc
versions are executed consecutively according to the order k(1), k(2),…, k(Nc) and produce
their outputs at times tck (1) (lc ) , tck (2) (lc ) ,…, tck ( N c ) (lc ) .
3.3. Reliability and performance of components and entire system
Consider the probability that m out of n first versions of component c succeed. This
probability can be obtained as:
n  m 1
n
 (1  rck (i ) )[ 
i1 1
i 1
rck (i1 )
nm 2
rck (i2 )

rck (im )
n

...
(1  rck (i1 ) ) i2  i1 1 (1  rck (i2 ) ) im  im1 1 (1  rck (im ) )
]
(3)
The component c produces the correct output directly after the end of the execution of j
versions (Mc jNc) iff k(j)-th version succeeds and exactly Mc-1 out of the first executed j-1
versions succeed.
The probability of such event pcj(lc) is:
j 1
pcj (lc )  rck ( j )  (1  rck (i ) )[
i 1
j  M c 1

i1 1
j M c 2
rck (i1 )

rck (i2 )
j 1
...

rck (iMc1 )
(1  rck (i1 ) ) i2 i1 1 (1  rck (i2 ) ) iM 1 iM 2 1 (1  rck (iM 1 ) )
c
c
c
(4)
Observe that pcj(lc) is the conditional probability that the component execution time is
tck ( j ) (lc ) given lc versions are executed simultaneously:
pcj(lc)=Pr{Tc= tck ( j ) (lc ) | Lc=lc}.
9
(5)
]
Having pmf of r.v. Lc we can now obtain for 1xHc:
Pr{Tc= tck ( j ) (lc ( x)) }=Pr{Tc= tck ( j ) (lc ( x)) | Lc=lc(x)}Pr{Lc=lc(x)}= pcj(lc(x))Qc(x).
(6)
The pairs tck ( j ) (lc ( x)) , pcj(lc(x))Qc(x), obtained for 1xHc and McjNc determine pmf of
r.v. Tc.
Since the events of successful component execution termination for different j and x are
mutually exclusive, we can express the probability of component c success as
Hc
Nc
Rc()=Pr{Tc<}= [Qc ( x)  pcj (lc ( x))] .
x 1
(7)
jM c
From the pmf of execution times Tc for each component c one can obtain pmf of
execution time of the entire system, which is equal to the sum of the execution times of
components:
C
T=  Tc .
(8)
c 1
Having the pmf of T in the form qx, tx (1xX) one can evaluate the system reliability
defined as the probability that it produces the correct output in less time than a specified value
T*:
X
R(T *)   q x 1(t x  T *) .
(9)
x 1
The probability that the correct output is obtained without respect to the execution time is
defined as
X
R ( )   q x .
(10)
x 1
The conditional expected system execution time (the expected system execution time on
condition that the system does not fail) takes the form
10
E ( ) 
1 X
 t xqx .
R() x 1
(11)
In addition one can obtain the conditional expected system execution time E(T*) (the
expected system execution time on condition that the system produces the correct output in
less time than a specified value T*) in the form
E (T *) 
X
1
 t x q x 1(t x  T *) .
R(T *) x 1
(12)
3.4. Algorithm for determining the execution time distribution for the system component
In order to obtain the execution time distribution for a component c and a given lc in the
form pcj(lc), tck ( j ) (lc ) (McjNc) one can determine the realizations tck ( j ) (lc ) of the
execution time Tc(lc) using the algorithm presented in section 3.2 and the corresponding
probabilities pcj(lc) using Eq. (4).
However, the probabilities pcj(lc) can be obtained in a much simpler way using a
procedure based on the moment generating function (MGF) technique.
The MGF of a discrete random variable Y is defined as a polynomial
K
u ( z )   ak z y k ,
(13)
k 1
where the variable Y has K possible values and ak is the probability that Y is equal to yk.
Let the random binary variable sck (i ) be an indicator of the success of version k(i) in
component c such that sck (i ) =1 if the version produces the correct output and sck (i ) =0 if it
produces the wrong output. The pmf of sck (i ) can be represented by the MGF
uck (i ) ( z )  rck (i ) z1  (1  rck (i ) ) z 0
It can be easily seen that the product of polynomials
11
(14)
j
j
i 1
i 1
U cj ( z , lc )   uck (i ) ( z )   [rck (i ) z1  (1  rck (i ) ) z 0 ]
(15)
represents the pmf of the number of correct outputs in component c after the execution of a
group of first j versions (note that the order of elements k(1), k(2),…, k(Nc) and, therefore,
Ucj(z,lc) depends on lc). Indeed the resulting polynomial relates the probabilities of
combinations of correct and wrong outputs (the product of corresponding probabilities) with
the number of correct outputs in these combinations (the sum of success indicators). Observe
that after collecting the like terms (corresponding to obtaining the overall probability of a
different combination with the same number of correct outputs) Ucj(z,lc) takes the form
j
U cj ( z , lc )    jm z m ,
(16)
m0
where jm is the probability that the group of first j versions produces m correct outputs.
Note that Ucj(z,lc) can be obtained by using the recurrent expression
U cj ( z, lc )  U c j 1 ( z )[rck ( j ) z1  (1  rck ( j ) ) z 0 ] .
(17)
According to its definition, pcj(lc) is the probability that the group of first j versions
produces Mc correct outputs given the group of first j-1 versions has produced Mc-1 correct
outputs. The coefficient jMc in polynomial Ucj(z,lc) is equal to the unconditional probability
that the group of first j versions produces Mc correct outputs.
In order to let the coefficient jMc in polynomial Ucj(z,lc) be equal to pcj(lc), the term with
the exponent equal to Mc should be removed from Uc
j-1(z,lc)
before applying Eq. (17)
(excluding the combination in which j first versions produce Mc correct outputs while the kj-th
version fails).
The above considerations lie at the base of the following algorithm for determining all of
the probabilities pcj(lc) (McjNc):
12
1.
For the given lc determine the order of version termination k(1), k(2),…, k(Nc) using the
algorithm from section 3.2.
2.
Determine the MGF of each version of component c according to Eq. (14);
3.
Define Uc0(z,lc) =1;
4.
For j=1,2,…,Nc:
3.1 Obtain Ucj(z,lc) using Eq. (17) and after collecting like terms represent it in the
form (16);
3.2. If jMc assign: pcj(lc)=jMc and remove term jMc zMc from Ucj(z,lc).
This algorithm is much simpler than using Eq. (4). Since after collecting like terms each
MGF Ucj(z,lc) can have no more than Mc terms, the computational complexity of the algorithm
(number of term multiplications in the procedure (17)) is not greater than 2(Nc-1)Mc.
3.5. Execution time distribution for the entire system
Having the pairs pcj(lc(x)), tck ( j ) (lc ( x)) for each possible realization lc(x) of Lc (1xHc)
and probabilities Pr{Lc=lc(x)}=Qc(x) one can obtain pmf of random execution times Tc for
each component by applying Eq. (6). If the conditional pmf pcj(lc(x)), tck ( j ) (lc ( x)) are
represented by the following MGF:
Nc
vc ( z, lc ( x))  
j M c
pcj (lc ( x))z
t ck ( j ) (l c ( x ))
(18)
the MGF representation of r.v. Tc takes the form:
Hc
Vc ( z )   Qc ( x)vc ( z, lc ( x)).
(19)
x 1
The random system execution time T is equal to sum of execution times of all of C
components. Therefore the MGF V(z) representing the pmf of T can be obtained as a product
of MGF Vc(z) representing pmf of Tc for 1cC:
13
C
C Hc
c 1
c 1 x 1
V(z)= Vc ( z )   (  Qc ( x)vc ( z, lc ( x))
(20)
V(Z) defines the pmf of the random execution time of the entire system by relating the
probabilities of combinations of different realizations of component execution times with the
total system execution time for these combinations.
After collecting the like terms (corresponding to different combinations resulting in the
same total system execution time), the MGF V(z) takes the form
X
V ( z)   qx z t x ,
(21)
x 1
where tx are possible realizations of the random variable T, qx are the probabilities of these
realizations and X is the total number of different realizations.
After obtaining the pmf of T in the form (21), representing the pairs qx, tx for 1xX one
can determine the system reliability and performance indices using Eq. (9)-(12).
3.6. Different components executed on the same hardware
Now consider a case when all of the software components are consecutively executed on
the same hardware consisting of H parallel identical modules with availability A. The number
of available parallel hardware modules h is a r.v. with pmf defined by Eq. (1).
When h=x the number of versions that can be executed simultaneously in each
component c is lc(x). The MGF that represent pmf of corresponding component execution
times are vc ( z, lc ( x)) defined by Eq. (18). The MGF that represents the conditional pmf of the
system execution time (given the number of available hardware modules is x) can be obtained
for any x (1xH) as a product of corresponding component MGF:
C
 vc ( z, lc ( x)) .
c 1
14
(22)
Having the pmf of r.v. h we obtain the MGF representing the pmf of r.v. T as
H
C
x 1
c 1
V ( z )   Qc ( x)  vc ( z, lc ( x)) .
4.
(23)
Examples
4.1. Analytical example
Consider a system consisting of two components. First component consists of H1=2
hardware units with availability A1=0.9 on which N1=5 software versions with M1=3 are
executed. Second component consists of H2=3 hardware units with availability A2=0.8 on
which and N2=3 software versions with M2=2 are executed. The parameters of versions rci and
ci are presented in Table 1.
One software version can be executed on each hardware unit: lc(hc)=hc.
The terminations times obtained for different possible values of L1 and L2 using the
algorithm described in section 3.2 are also presented in this table. The version execution
diagrams for different values of L1 and L2 are presented in Fig. 1.
According to the given parameters, define the MGF of software versions:
For component 1:
u11(z)=0.3z0+0.7z1; u12(z)=0.4z0+0.6z1; u13(z)=0.2z0+0.8z1;
u14(z)=0.4z0+0.6z1; u15(z)=0.1z0+0.9z1;
For component 2:
u21(z)=0.2z0+0.8z1; u22(z)=0.2z0+0.8z1; u23(z)=0.3z0+0.7z1.
The order of version termination in the first component is 1, 2, 3, 4, 5 for both L1=1 and
L1=2 (see Table 1).
According to the algorithm presented in section 3.4, determine the MGF for the groups of
versions and corresponding probabilities p1j(1) and p1j(2).
15
U10(z,1)=U10(z,2)=1; U11(z,1)=U11(z,2)=u11(z)=0.3z0+0.7z1;
U12(z,1)=U12(z,2)=u11(z)u12(z)=0.12z0+0.46z1+0.42z2;
U13(z,1)=U13(z,2)=U12(z,1)u13(z)=0.024z0+0.188z1+0.452z2+0.336z3;
Remove the term 0.336z3 from U13(z,1) and U13(z,2) and obtain p13(1)=p13(2)=0.336.
U14(z,1)=U14(z,2)=U13(z,1)u14(z)=0.0096z0+0.0896z1+0.2936z2+0.2712z3;
Remove the term 0.2712z3 from U14(z,1) and U14(z,2) and obtain p14(1)=p14(2)=0.2712;
U15(z,1)=U15(z,2)=U14(z,1)u15(z)=0.00096z0+0.0176z1+0.11z2+0.26424z3.
Finally, obtain p15(1)=p15(2)=0.26424.
Having the probabilities p1j(1), p1j(2) and the corresponding termination times t1j(1),
t1j(2), define the MGF representing the component execution time distributions
v1(z,1)=0.336z28+0.271z46+0.264z60, v1(z,2)=0.336z16+0.535z30,
The pmf of L1 is
Pr{L1=1}=Pr{h1=1}=2A1(1- A1)=0.18, Pr{L1=2}=Pr{h1=2}=A12=0.81;
According to Eq. (19) we obtain:
V1(z)= 0.18v1(z,1)+0.81v1(z,2)=0.06z28+0.049z46+0.048z60+0.272z16+0.434z30.
The order of version termination in the second component is 1, 2, 3 for both L2=1 and
L2=2 and 1, 3, 2 for L2=3 (see Table 1).
According to the algorithm presented in section 3.4, determine the MGF for the groups of
versions and corresponding probabilities p2j(1), p2j(2) and p2j(3).
For L2=1 and L2=2:
U20(z,1)=U20(z,2)=1; U21(z,1)=U21(z,2)=u21(z)=0.2z0+0.8z1;
U22(z,1)= U22(z,2)=u21(z)u22(z)=0.04z0+0.32z1+0.64z2;
Remove the term 0.64z2 from U22(z,1) and U22(z,2) and obtain p22(1)=p22(2)=0.64;
U23(z,1)=U23(z,2)=U22(z,1)u23(z)=0.012z0+0.124z1+0.224z2;
Obtain p23(1)=p23(2)=0.224.
16
Having the probabilities p2j(1), p2j(2) and the corresponding termination times t2j(1),
t2j(2), define the MGF representing the component execution time distributions
v2(z,1)=0.64z26+0.224z38, v2(z,2)=0.64z16+0.224z22.
U20(z,3)=1; U21(z,3)=u21(z)=0.2z0+0.8z1;
For L2=3:
U22(z,3)=u21(z)u23(z)=0.06z0+0.38z1+0.56z2;
Remove the term 0.56z2 from U22(z) and obtain p22(3)=0.56;
U23(z,3)=U22(z,3)u22(z)=0.012z0+0.124z1+0.304z2;
Obtain p23(3)=0.304.
The MGF representing the corresponding component execution time distribution takes
the form v2(z,3)=0.56z12+0.304z16. The distribution of L2 is Pr{L2=1}=Pr{h2=1}
=3A2(1A2)2=0.096, Pr{L2=2}=Pr{h2=2}=3A22(1-A2)=0.384, Pr{L2=3}=Pr{h2=3}=A23=0.512;
According to Eq. (19) we obtain:
V2(z)=0.096v2(z,1)+0.384v2(z,2)+0.512v2(z,3) =
0.287z12+0.401z16+0.086z22+0.0614z26+0.0215z38.
The MGF representing the execution time distribution for the entire system takes the form:
V(z)=V1(z)V2(z)=
0.078z28+0.109z32+0.023z38+0.017z40+0.141z42+0.024z44+0.174z46+
0.005z50+0.037z52+0.007z54+0.027z56+0.014z58+0.020z62+0.001z66+
0.012z68+0.020z72+0.019z76+0.004z82+0.001z84+0.003z86+0.001z98.
Having this MGF, one can obtain the system reliability and conditional expected execution
time for different time constraints T* using Eq. (9) - (12): R()=0.739; E()=44.559.
For T*=50: R(50)=0.567; E(50)=22.24/0.567=39.24
4.2. Numerical example
A. System with different software components
17
Consider a system consisting of five components. The number of parallel software units,
availability of these units and the parameters Nc and Mc of the fault-tolerant programs for each
component are presented in Table 2. The components 1, 3 and 5 are of NVP type, the
components 2 and 4 are of RBS type (M2=M4=1). The reliability and execution times of
software versions are presented in Table 3.
The entire system execution time varies in the range 81≤T≤241. The system reliability
and expected execution time (without respect to execution time constraint) obtained by the
suggested algorithm are respectively R()=0.77 and E()=96.64. The indices R(T*) and
E(T*) are presented in Fig. 2 as functions of maximum allowable system execution time T*.
In order to estimate the effect of fault-tolerant software architecture on the entire system
reliability we assume that each component consists of single version (most reliable version
from Table 3). Having the execution times of these most reliable versions we obtain the
system execution time as sum of versions' execution times: T=130. The system reliability is a
product of reliabilities of each component. The reliability of each component c is equal to
version reliability multiplied by the probability that at least lc(1) hardware units out of Hc are
available. For the given system parameters the single version system reliability is
R(130)=0.7699 . The fault tolerant system with structure of the software components
presented in Table 2 provides for the same execution time T*=130 reliability R(130)=0.75. It
can be seen that in the considered case adding less reliable versions to the most reliable ones
does not improve system reliability defined as the probability of obtaining the correct output.
Even for T=241 the multi-version system does not provide much greater reliability than the
single version one. However in the single version system the wrong outputs in components 1,
3 and 5 cannot be detected since these components have no ATB. The probability of
undetected wrong output in the single version system is 0.0972 while in the multi-version
system it is negligible small.
18
B.
System with single software component
Now consider the same fault-tolerant software system running on a single hardware block
consisting of three parallel units with availability A=0.9. The functions lc(x) for each software
component are presented in Table 4. Note that the system can operate only when h2 since
single hardware unit has not enough resources for execution of the second software
component.
The entire system execution time varies in the range 81≤T≤195. The system reliability
and expected execution time (without respect to execution time constraint) are respectively
R()=0.917 and E()=103.18. The indices R(T*) and E(T*) are presented in Fig. 3 as
functions of maximum allowable system execution time T*.
If each software component consists of single version (most reliable version from Table
3), the system execution time is T=130. The system reliability obtained as the product of
reliabilities of the versions multiplied by probability that at least two out of three hardware
units are available is R=0.767. The fault tolerant system with structure of the software
components presented in Table 2 provides for the same execution time T*=130 greater
reliability R(130)=0.804 .
5.
Summary and further work
The considered model considers fault-tolerant systems with series architecture and arbitrary
number of hardware units and software versions. The presented algorithm is aimed at
evaluating system reliability and performance indices that can be used for comparison of
different system configurations and for solving system structure optimization problems.
The model and the algorithm in their present form have the following limitations:
19
1. The common cause failures are not taken into account. This leads to overestimation of
system reliability.
Further research should be devoted to incorporating common cause
failures into the system reliability and performance model. This can be done based on works
[15-18].
2. The version execution times are assumed to be constant. Such assumption is justified
when the subtasks are small and versions have no logical branching or when the possible time
variation of each version is much smaller than difference in execution times for different
versions. In further work the version execution time distributions should be incorporated into
the model. The universal generating function technique [19] can be used for analysis of
versions with variable execution times.
3. The structure of hardware is assumed to be homogenous. The components of real systems
can consist of different hardware units with different reliability and processing speed. The
model should be further extended to analyze systems with components consisting of an
arbitrary set of hardware units.
4. The system is assumed to consist of components connected in series. In many real systems
the components have more complex interaction than sequential task execution. The possible
ways of the model extension to more complex structures should be studied.
5. It is assumed that the system performs a single task. When the same system is aimed at
running different applications (providing different services) different software modules can be
used for different applications or the same software versions can perform differently while
solving different subtasks. The extension of the model to multitask applications can be done
using ideas presented in [9].
6. References
[1] Teng X., Pham H., Software fault tolerance, in Reliability Engineering Handbook, H. Pham (Ed.), Springer,
2003, 585-611
20
[2] Chen L., Avizienis A., N-version programming: a fault tolerance approach to the reliable software, Proc. 8t h
Int. Sym. Foult-Tolerant Computing, Toulouse, France; 1978, p. 3-9.
[3] Randell B., System structure for software fault tolerance, IEEE Trans. Software Eng., 1(2), 1975, p. 220-232.
[4] Tai A., Meyer J., Avizienis A., Performability enhancement of fault-tolerant software, IEEE Transactions on
Reliability, 42 (2), 1993, p. 227-237.
[5] Goseva-Popstojanova K., Grnarov A., Performability Modeling of N Version Programming Technique, Proc.
6th IEEE International Symposium on Software Reliability Engineering (ISSRE'95), Toulouse, France,
Nov.1995.
[6] K.Goseva-Popstojanova, A.Grnarov, Performability and Reliability Modeling of N Version Fault Tolerant
Software in Real-Time Systems, Proc. Euromicro'97 Conference, Budapest, Hungary, Sep. 1997.
[7] Belli, F., Jedrzejowicz, P., Fault-tolerant programs and their reliability, IEEE Transactions on Reliability, 39
(2), 1990 , 184 –192.
[8] Ashrafi, N., Berman, O., Cutler, M., Optimal design of large software-systems using N-version
programming, IEEE Transactions on Reliability, Vol. 43 (2), 1994, 344 -350.
[9] Wattanapongsakorn, N., Levitan. S., Reliability optimization models for embedded systems with multiple
applications, IEEE Transactions on Reliability, Vol. 53 (3), 2004, 406 -416.
[10] Tsujimura, Y., Yamamoto, H., Takeno, T. and Yamazaki, G., Evolutionary design of N-version fault
tolerant software system, Proceedings of the 33rd International Conference on Computers and Industrial
Engineering (CD-ROM), Jeju, Korea, 2004
[11] Tsujimura, Y., Yamamoto, H., Takeno, T. and Yamazaki, G., A genetic algorithm for designing N-version
program, Proceedings of the Fifth Asia-Pacific Industrial Engineering and Management Systems Conference ,
Gold Coast, Australia, 2004.
[12] Levitin G., Optimal version sequencing in fault-tolerant programs, Asia-Pacific Journal of Operational
Research, 22(1), 2005 pp. 1-18.
[13] Meyer, J.F. On evaluating the performability of degradable computing systems, IEEE Transactions on
Computers, Vol. 29, 1980, 720-731.
[14] Grassi V., Donatiello L., Iazeolla G., Performability evaluation of multicomponent fault-tolerant systems,
IEEE Transactions on Reliability, 37 (2), 1988, p. 216-222.
[15] Nicola V.F., Goyal A. Modeling of correlated failures and community error recovery in multiversion
software. IEEE Trans Software Eng, vol. 16, 1990, 350-359.
21
[16] Scott R.K., Gault J.W., McAllister D. F. Fault-tolerant reliability modelling. IEEE Trans Software Eng, vol.
13(5), 1987, 582-592.
[17] Eckhardt D. E. and Lee L. D., A Theoretical Basis of Multiversion Software Subject to Coincident Errors,
IEEE Trans. on Software Engineering, vol. 11, 1985, 1511-1517.
[18] Littlewood B.and Miller D. R., Conceptual Modelling of Coincident Failures in Multi-Version Software,
IEEE Trans on Software Engineering, vol. 15, 1989, 1596-1614.
[19] A. Lisnianski, G. Levitin, Multi-state system reliability, World Scientific, 2003.
22
Table 1. Parameters of system components for the analytical example.
component
c=1
c=2
version
1
2
3
4
5
1
2
3
rci
0.7
0.6
0.8
0.6
0.9
0.8
0.8
0.7
ci
6
12
10
18
14
10
16
12
tci(1)
6
18
28
46
60
10
26
38
tci(2)
6
12
16
30
30
10
16
22
tci(3)
-
-
-
-
-
10
16
12
Table 2. Parameters of system components for the numerical example.
Component
1
2
3
4
5
Hc
3
4
2
3
2
Ac
0.80
0.95
0.90
0.85
0.95
Nc
5
2
3
3
5
Mc
3
1
2
1
3
lc(1)
2
0
1
1
3
lc(2)
4
1
3
2
5
lc(3)
5
1
-
3
-
lc(4)
-
2
-
-
-
23
Table 3. Parameters of software versions for the numerical example.
c=1
c=2
c=3
c=4
c=5
version
1
2
3
4
5
r1i
0.86
0.77
0.98
0.93
0.91
1i
10
12
25
22
15
r2i
0.85
0.92
-
-
-
2i
30
45
-
-
-
r3i
0.87
0.94
0.98
-
-
3i
6
6
10
-
-
r4i
0.95
0.85
0.85
-
-
4i
25
15
20
-
-
r5i
0.68
0.79
0.90
0.90
0.94
5i
10
10
15
20
25
Table 4. Number of parallel software versions executed simultaneously as function of number
of available hardware units.
Component
1
2
3
4
5
lc(1)
2
0
3
1
1
lc(2)
4
1
3
2
2
lc(3)
5
1
3
3
3
24
L1=1
6
12
10
L1=2
18
14
1
1
2
3
4
3
5
5
2
28
46
60
4
16
30
Component 1
L2=1
L2=2
L2=3
1
10
16
12
1
1
2
2
3
3
3
2
26
38
16
22
12 16
Component 2
Figure 1. Time diagrams for software components with different number of versions executed
simultaneously.
25
100
0.8
95
0.6
90
0.4
85
0.2
80
0
80
100
120
140
160
180
200
220
R(T*)
E(T*)
Fig. 2
240
T*
E(T*)
R(T*)
Figure 2. R(T*) and E(T*) functions for fault-tolerant system running on different hardware
components.
110
1
105
0.8
E(T*)
100
0.6
95
0.4
90
R(T*)
Fig. 3
0.2
85
80
0
80
100
120
140
160
180
200
T*
E(T*)
R(T*)
Figure 3. R(T*) and E(T*) functions for fault-tolerant system running on a single hardware
component.
26
Download