Reliability and Performance Analysis of Hardware-Software Systems with Fault-Tolerant Software Components Gregory Levitin Reliability Department, Planning, Development and Technology Division, Israel Electric Corporation Ltd., P.O. Box 10, Haifa, 31000 Israel E-mail: levitin@iec.co.il Abstract This paper presents an algorithm for evaluating reliability and expected execution time for systems consisting of fault-tolerant software components running on several hardware units. The components are built from functionally equivalent but independently developed versions characterized by different reliability and execution time. Different number of versions can be executed simultaneously depending on the number of available units. The system reliability is defined as the probability that the system produces a correct output in a specified time. Keywords: Hardware-software systems; fault-tolerant programming; performance; reliability; expected execution time. Acronyms1 ATB acceptance test block NVP N-version programming RBS recovery block scheme MGF moment generating function r.v. 1 random variable The singular and plural of an acronym are always spelled the same. 1 pmf probability mass function Notation Pr{e} probability of event e 1(x) unity function: 1(TRUE)=1, 1(FALSE)=0 C number of components in the software system Hc number of parallel hardware units in component c hc random number of available hardware units in component c Qc(x) Pr{hc=x} Ac availability of hardware units in component c Nc number of software versions in component c Mc number of correct outputs needed for component c to succeed Lc random number of software versions in component c that can be executed simultaneously lc(x) realization of Lc corresponding to hc=x rci reliability of i-th version in component c ci execution time of i-th software version in component c tci(lc) termination time of i-th software version in component c when lc versions are executed simultaneously pcj(lc) probability that system component c produces correct output after termination of j first versions in component c when lc versions are executed simultaneously k(i) number of version that produces i-th output in the component sck (i ) random success indicator of version that produces i-th output in component c Tc random execution time for system component c T random execution time for the entire system tx x-th realization of T 2 qx Pr{T=tx} X number of different realizations of T T* upper bound for system execution time R(T*) system reliability (probability that the correct output is produced in time less than T*) E expected system execution time uck (i ) ( z ) MGF representing pmf of success indicator scki U cj ( z, lc ) MGF representing the pmf of the number of correct outputs in component c after the execution of a group of first j versions when lc versions are executed simultaneously vc(z,lc) MGF representing conditional pmf of r.v. Tc when lc versions are executed simultaneously Vc(z) MGF representing pmf of r.v. Tc V(z) MGF representing pmf of r.v. T 1. Introduction Software failures are caused by errors made in various phases of program development. When the software reliability is of critical importance, special programming techniques are used in order to achieve its fault tolerance. Two of the best-known fault-tolerant software design methods are N-version programming (NVP) and recovery block scheme (RBS) [1]. Both methods are based on the redundancy of software modules (functionally equivalent but independently developed) and the assumption that coincident failures of modules are rare. The fault tolerance usually requires additional resources and results in performance penalties (particularly with regard to computation time), which constitutes a tradeoff between software performance and reliability. 3 NVP was proposed by Chen and Avizienis [2]. This approach presumes the execution of N functionally equivalent software modules (called versions) that receive the same input and send their outputs to a voter, which is aimed at determining the system output. The voter produces an output if at least M out of N outputs agree (it is presumed that the probability that M wrong outputs agree is negligibly small). Otherwise, the system fails. Usually majority voting is used in which N is odd and M=(N+1)/2. In some applications, the available computational resources do not allow all of the versions to be executed simultaneously. In these cases, the versions are executed according to some predefined sequence and the program execution terminates either when M versions produce the same output (success) or when after the execution of all the N versions the number of equivalent outputs is less than M (failure). The entire program execution time is a random variable depending on the parameters of the versions and on the number of versions that can be executed simultaneously. RBS was proposed by Randell [3]. In this approach after execution of each version, its output is tested by an acceptance test block (ATB). If the ATB accepts the version output, the process is terminated and the version output becomes the output of the entire system. If all N versions do not produce the accepted output, the system fails. If the computational resources allow simultaneous execution of several versions, the versions are executed according to some predefined sequence and the entire program terminates either when one of versions produces the output accepted by the ATB (success) or when after the execution of all the N versions no output is accepted by the ATB (failure). If the acceptance test time is included into the execution time of each version, the RBS performance model becomes identical to the performance model of the NVP with M=1. Estimating the effect of the fault-tolerant programming on system performance is especially important in safety critical real-time computer applications. This effect has been 4 studied by Tai et al. in [4] and by Goseva-Popstojanova and Grnarov in [5, 6]. While in [4] a basic realization of NVP (N=3, M=2) consisting of versions with identical fault probabilities and different execution times has been considered, in [5] and [6] NVP with arbitrary N has been studied in which both times to failure and execution times of different versions are identically distributed random variables. In many cases, the information about version reliability and execution time is available from separate testing and/or reliability prediction models [7]. This information can be incorporated into a fault-tolerant program model in order to obtain an evaluation of its reliability and performance. The reliability model of NVP with versions having different reliability has been considered in [8]. However, in this study, the system performance evaluation problem has not been addressed and a general algorithm for evaluating NVP reliability for arbitrary N and M has not been suggested. Since the performance of fault-tolerant programs depends on availability of computational resources, the impact of hardware availability should be taken into account when the system availability is evaluated. In [9] several simple configurations of hardwaresoftware systems with N3 and number of hardware units not greater than 3 have been studied also without considering system performance. This paper presents an algorithm for finding the reliability and performance measures for arbitrary fault-tolerant hardware-software systems with given hardware structure and availability of hardware units and given parameters of software versions. The novelty of the presented algorithm lies in its ability to take into account both hardware and software reliability for arbitrary number of software versions and hardware units and to evaluate both system reliability and performance measures. The algorithm does not take into account the common cause failures which leads to overestimation of system reliability. However even such optimistic estimates can be used for 5 comparison of different system architectures and for optimization of software system structure as it has been done in [8, 10-12]. The probabilities of common cause failures can be evaluated separately (elicited from experimental study) and added to system unreliability in order to obtain more accurate reliability estimates. 2. Model According to the model presented in [8] the software system consists of C components. Each component performs a subtask and the sequential execution of the components performs a major task. Such series architecture can be found in many applications where the output of a component is fed to the next component as its input. An example of such architecture is a speech recognition system presented in [9]. Performance of different applications and services in the grid systems can also be represented using the series architecture [9]. It is assumed that Nc functionally equivalent versions are available for each component c. Each version i has an estimated reliability rci and constant execution time ci. Failures of versions for each component are statistically independent as well as the total failures of the different components. The software versions in each component c run on parallel hardware units. The total number of units is Hc. The units are s-independent and identical. The availability of each unit is Ac. The number of units available at the moment hc determines the amount of available computational resources and, therefore, the number of versions that can be executed simultaneously Lc(hc). No hardware unit can change its state during the software execution. The versions of each component c start their execution in accordance with a predetermined ordered list. Lc first versions from the list start their execution simultaneously (at time 0). If the number of terminated versions is less than Mc, after termination of each 6 version a new version from the list starts its execution immediately. If the number of terminated versions is not less than Mc, after termination of each version the voter compares the outputs. If Mc outputs are identical, the component terminates its execution (terminating all the versions that are still executed), otherwise a new version from the list is executed immediately. If after termination of Nc versions the number of identical outputs is less than Mc the component and the entire system fail. In the case of component success, the time of the entire component execution is equal to the termination time of the version that has produced the Mc-th correct output (in most cases the time needed by the voter to make the decision can be neglected). It can be seen that the component execution time is a random variable depending on the reliability and the execution time of the component versions and on the availability of the hardware units. The examples of time diagrams (corresponding to components with Nc=5, Mc=3 and Nc=3, Mc=2) for a given sequence of versions execution (the versions are numbered according to this sequence) and different values of Lc are presented in Fig. 1. The sum of the random execution times of each component gives the random task execution time for the entire system. In order to estimate both the system's reliability and its performance, different measures can be used depending on the application. In applications where the execution time of each task is of critical importance, the system reliability R(T*) is defined (according to performability concept [4, 13]) as a probability that the correct output is produced in time less than T*. In applications where the average system productivity (the number of executed tasks) over a fixed mission time is of interest [14], the system reliability is defined as the probability that it produces correct outputs without respect to the total execution time, (this index can be referred to as R()) while the conditional expected system execution time E (given the system 7 produces correct output) is considered to be a measure of its performance. This index determines the system expected execution time given that the system does not fail. 3. Evaluating system reliability and performance indices 3.1. Number of versions that can be executed simultaneously The number of available hardware units in component c can vary from 0 to Hc. Given all of the units are identical and have availability Ac, one can easily obtain probabilities Pr{hc=x} for 1xHc : H Qc(x)=Pr{hc=x}= c Ac x (1 Ac ) H c x . x (1) The number of available hardware units x determines the number of versions that can be executed simultaneously: lc(x). Therefore Pr{Lc=lc(x)}=Qc(x). (2) The pairs Qc(x), lc(x) for 1xHc determine the pmf of the discrete r.v. Lc. 3.2. Version termination times In each component c a sequence in which the versions start their execution is defined by the numbers of the versions. This means that each version i starts execution not earlier than versions 1,…, i-1 and not latter than versions i+1,…,Nc. Having the execution time of each version ci (1iNc) and the number of versions that can run simultaneously lc, one can obtain termination time tci(lc) for each version using the following simple algorithm: 1. Assign 1=…=lc=0; 2. For i=1,…,Nc repeat: 2.1. Find any k (1klc): k=min{ 1, …, l c }; 2.2. Obtain tci(lc)=k+ci and assign k=tci(lc). 8 Times tci(lc), 1iNc correspond to intervals between the beginning of component execution and the moment when the versions produce their outputs. Observe that the versions that start execution earlier can terminate later: j<w does not guarantee that tcj(lc)tcw(lc). In order to obtain the sequence, in which the versions produce their outputs, the termination times should be sorted in increasing order tck (1) (lc ) tck (2) (lc ) … tck ( N c ) (lc ) , which gives the order of versions k(1), k(2),…, k(Nc), corresponding to times of their termination. The list k(1), k(2),…, k(Nc) determines the sequence of version outputs in which they arrive to the voter. Now one can consider the component c as a system in which the Nc versions are executed consecutively according to the order k(1), k(2),…, k(Nc) and produce their outputs at times tck (1) (lc ) , tck (2) (lc ) ,…, tck ( N c ) (lc ) . 3.3. Reliability and performance of components and entire system Consider the probability that m out of n first versions of component c succeed. This probability can be obtained as: n m 1 n (1 rck (i ) )[ i1 1 i 1 rck (i1 ) nm 2 rck (i2 ) rck (im ) n ... (1 rck (i1 ) ) i2 i1 1 (1 rck (i2 ) ) im im1 1 (1 rck (im ) ) ] (3) The component c produces the correct output directly after the end of the execution of j versions (Mc jNc) iff k(j)-th version succeeds and exactly Mc-1 out of the first executed j-1 versions succeed. The probability of such event pcj(lc) is: j 1 pcj (lc ) rck ( j ) (1 rck (i ) )[ i 1 j M c 1 i1 1 j M c 2 rck (i1 ) rck (i2 ) j 1 ... rck (iMc1 ) (1 rck (i1 ) ) i2 i1 1 (1 rck (i2 ) ) iM 1 iM 2 1 (1 rck (iM 1 ) ) c c c (4) Observe that pcj(lc) is the conditional probability that the component execution time is tck ( j ) (lc ) given lc versions are executed simultaneously: pcj(lc)=Pr{Tc= tck ( j ) (lc ) | Lc=lc}. 9 (5) ] Having pmf of r.v. Lc we can now obtain for 1xHc: Pr{Tc= tck ( j ) (lc ( x)) }=Pr{Tc= tck ( j ) (lc ( x)) | Lc=lc(x)}Pr{Lc=lc(x)}= pcj(lc(x))Qc(x). (6) The pairs tck ( j ) (lc ( x)) , pcj(lc(x))Qc(x), obtained for 1xHc and McjNc determine pmf of r.v. Tc. Since the events of successful component execution termination for different j and x are mutually exclusive, we can express the probability of component c success as Hc Nc Rc()=Pr{Tc<}= [Qc ( x) pcj (lc ( x))] . x 1 (7) jM c From the pmf of execution times Tc for each component c one can obtain pmf of execution time of the entire system, which is equal to the sum of the execution times of components: C T= Tc . (8) c 1 Having the pmf of T in the form qx, tx (1xX) one can evaluate the system reliability defined as the probability that it produces the correct output in less time than a specified value T*: X R(T *) q x 1(t x T *) . (9) x 1 The probability that the correct output is obtained without respect to the execution time is defined as X R ( ) q x . (10) x 1 The conditional expected system execution time (the expected system execution time on condition that the system does not fail) takes the form 10 E ( ) 1 X t xqx . R() x 1 (11) In addition one can obtain the conditional expected system execution time E(T*) (the expected system execution time on condition that the system produces the correct output in less time than a specified value T*) in the form E (T *) X 1 t x q x 1(t x T *) . R(T *) x 1 (12) 3.4. Algorithm for determining the execution time distribution for the system component In order to obtain the execution time distribution for a component c and a given lc in the form pcj(lc), tck ( j ) (lc ) (McjNc) one can determine the realizations tck ( j ) (lc ) of the execution time Tc(lc) using the algorithm presented in section 3.2 and the corresponding probabilities pcj(lc) using Eq. (4). However, the probabilities pcj(lc) can be obtained in a much simpler way using a procedure based on the moment generating function (MGF) technique. The MGF of a discrete random variable Y is defined as a polynomial K u ( z ) ak z y k , (13) k 1 where the variable Y has K possible values and ak is the probability that Y is equal to yk. Let the random binary variable sck (i ) be an indicator of the success of version k(i) in component c such that sck (i ) =1 if the version produces the correct output and sck (i ) =0 if it produces the wrong output. The pmf of sck (i ) can be represented by the MGF uck (i ) ( z ) rck (i ) z1 (1 rck (i ) ) z 0 It can be easily seen that the product of polynomials 11 (14) j j i 1 i 1 U cj ( z , lc ) uck (i ) ( z ) [rck (i ) z1 (1 rck (i ) ) z 0 ] (15) represents the pmf of the number of correct outputs in component c after the execution of a group of first j versions (note that the order of elements k(1), k(2),…, k(Nc) and, therefore, Ucj(z,lc) depends on lc). Indeed the resulting polynomial relates the probabilities of combinations of correct and wrong outputs (the product of corresponding probabilities) with the number of correct outputs in these combinations (the sum of success indicators). Observe that after collecting the like terms (corresponding to obtaining the overall probability of a different combination with the same number of correct outputs) Ucj(z,lc) takes the form j U cj ( z , lc ) jm z m , (16) m0 where jm is the probability that the group of first j versions produces m correct outputs. Note that Ucj(z,lc) can be obtained by using the recurrent expression U cj ( z, lc ) U c j 1 ( z )[rck ( j ) z1 (1 rck ( j ) ) z 0 ] . (17) According to its definition, pcj(lc) is the probability that the group of first j versions produces Mc correct outputs given the group of first j-1 versions has produced Mc-1 correct outputs. The coefficient jMc in polynomial Ucj(z,lc) is equal to the unconditional probability that the group of first j versions produces Mc correct outputs. In order to let the coefficient jMc in polynomial Ucj(z,lc) be equal to pcj(lc), the term with the exponent equal to Mc should be removed from Uc j-1(z,lc) before applying Eq. (17) (excluding the combination in which j first versions produce Mc correct outputs while the kj-th version fails). The above considerations lie at the base of the following algorithm for determining all of the probabilities pcj(lc) (McjNc): 12 1. For the given lc determine the order of version termination k(1), k(2),…, k(Nc) using the algorithm from section 3.2. 2. Determine the MGF of each version of component c according to Eq. (14); 3. Define Uc0(z,lc) =1; 4. For j=1,2,…,Nc: 3.1 Obtain Ucj(z,lc) using Eq. (17) and after collecting like terms represent it in the form (16); 3.2. If jMc assign: pcj(lc)=jMc and remove term jMc zMc from Ucj(z,lc). This algorithm is much simpler than using Eq. (4). Since after collecting like terms each MGF Ucj(z,lc) can have no more than Mc terms, the computational complexity of the algorithm (number of term multiplications in the procedure (17)) is not greater than 2(Nc-1)Mc. 3.5. Execution time distribution for the entire system Having the pairs pcj(lc(x)), tck ( j ) (lc ( x)) for each possible realization lc(x) of Lc (1xHc) and probabilities Pr{Lc=lc(x)}=Qc(x) one can obtain pmf of random execution times Tc for each component by applying Eq. (6). If the conditional pmf pcj(lc(x)), tck ( j ) (lc ( x)) are represented by the following MGF: Nc vc ( z, lc ( x)) j M c pcj (lc ( x))z t ck ( j ) (l c ( x )) (18) the MGF representation of r.v. Tc takes the form: Hc Vc ( z ) Qc ( x)vc ( z, lc ( x)). (19) x 1 The random system execution time T is equal to sum of execution times of all of C components. Therefore the MGF V(z) representing the pmf of T can be obtained as a product of MGF Vc(z) representing pmf of Tc for 1cC: 13 C C Hc c 1 c 1 x 1 V(z)= Vc ( z ) ( Qc ( x)vc ( z, lc ( x)) (20) V(Z) defines the pmf of the random execution time of the entire system by relating the probabilities of combinations of different realizations of component execution times with the total system execution time for these combinations. After collecting the like terms (corresponding to different combinations resulting in the same total system execution time), the MGF V(z) takes the form X V ( z) qx z t x , (21) x 1 where tx are possible realizations of the random variable T, qx are the probabilities of these realizations and X is the total number of different realizations. After obtaining the pmf of T in the form (21), representing the pairs qx, tx for 1xX one can determine the system reliability and performance indices using Eq. (9)-(12). 3.6. Different components executed on the same hardware Now consider a case when all of the software components are consecutively executed on the same hardware consisting of H parallel identical modules with availability A. The number of available parallel hardware modules h is a r.v. with pmf defined by Eq. (1). When h=x the number of versions that can be executed simultaneously in each component c is lc(x). The MGF that represent pmf of corresponding component execution times are vc ( z, lc ( x)) defined by Eq. (18). The MGF that represents the conditional pmf of the system execution time (given the number of available hardware modules is x) can be obtained for any x (1xH) as a product of corresponding component MGF: C vc ( z, lc ( x)) . c 1 14 (22) Having the pmf of r.v. h we obtain the MGF representing the pmf of r.v. T as H C x 1 c 1 V ( z ) Qc ( x) vc ( z, lc ( x)) . 4. (23) Examples 4.1. Analytical example Consider a system consisting of two components. First component consists of H1=2 hardware units with availability A1=0.9 on which N1=5 software versions with M1=3 are executed. Second component consists of H2=3 hardware units with availability A2=0.8 on which and N2=3 software versions with M2=2 are executed. The parameters of versions rci and ci are presented in Table 1. One software version can be executed on each hardware unit: lc(hc)=hc. The terminations times obtained for different possible values of L1 and L2 using the algorithm described in section 3.2 are also presented in this table. The version execution diagrams for different values of L1 and L2 are presented in Fig. 1. According to the given parameters, define the MGF of software versions: For component 1: u11(z)=0.3z0+0.7z1; u12(z)=0.4z0+0.6z1; u13(z)=0.2z0+0.8z1; u14(z)=0.4z0+0.6z1; u15(z)=0.1z0+0.9z1; For component 2: u21(z)=0.2z0+0.8z1; u22(z)=0.2z0+0.8z1; u23(z)=0.3z0+0.7z1. The order of version termination in the first component is 1, 2, 3, 4, 5 for both L1=1 and L1=2 (see Table 1). According to the algorithm presented in section 3.4, determine the MGF for the groups of versions and corresponding probabilities p1j(1) and p1j(2). 15 U10(z,1)=U10(z,2)=1; U11(z,1)=U11(z,2)=u11(z)=0.3z0+0.7z1; U12(z,1)=U12(z,2)=u11(z)u12(z)=0.12z0+0.46z1+0.42z2; U13(z,1)=U13(z,2)=U12(z,1)u13(z)=0.024z0+0.188z1+0.452z2+0.336z3; Remove the term 0.336z3 from U13(z,1) and U13(z,2) and obtain p13(1)=p13(2)=0.336. U14(z,1)=U14(z,2)=U13(z,1)u14(z)=0.0096z0+0.0896z1+0.2936z2+0.2712z3; Remove the term 0.2712z3 from U14(z,1) and U14(z,2) and obtain p14(1)=p14(2)=0.2712; U15(z,1)=U15(z,2)=U14(z,1)u15(z)=0.00096z0+0.0176z1+0.11z2+0.26424z3. Finally, obtain p15(1)=p15(2)=0.26424. Having the probabilities p1j(1), p1j(2) and the corresponding termination times t1j(1), t1j(2), define the MGF representing the component execution time distributions v1(z,1)=0.336z28+0.271z46+0.264z60, v1(z,2)=0.336z16+0.535z30, The pmf of L1 is Pr{L1=1}=Pr{h1=1}=2A1(1- A1)=0.18, Pr{L1=2}=Pr{h1=2}=A12=0.81; According to Eq. (19) we obtain: V1(z)= 0.18v1(z,1)+0.81v1(z,2)=0.06z28+0.049z46+0.048z60+0.272z16+0.434z30. The order of version termination in the second component is 1, 2, 3 for both L2=1 and L2=2 and 1, 3, 2 for L2=3 (see Table 1). According to the algorithm presented in section 3.4, determine the MGF for the groups of versions and corresponding probabilities p2j(1), p2j(2) and p2j(3). For L2=1 and L2=2: U20(z,1)=U20(z,2)=1; U21(z,1)=U21(z,2)=u21(z)=0.2z0+0.8z1; U22(z,1)= U22(z,2)=u21(z)u22(z)=0.04z0+0.32z1+0.64z2; Remove the term 0.64z2 from U22(z,1) and U22(z,2) and obtain p22(1)=p22(2)=0.64; U23(z,1)=U23(z,2)=U22(z,1)u23(z)=0.012z0+0.124z1+0.224z2; Obtain p23(1)=p23(2)=0.224. 16 Having the probabilities p2j(1), p2j(2) and the corresponding termination times t2j(1), t2j(2), define the MGF representing the component execution time distributions v2(z,1)=0.64z26+0.224z38, v2(z,2)=0.64z16+0.224z22. U20(z,3)=1; U21(z,3)=u21(z)=0.2z0+0.8z1; For L2=3: U22(z,3)=u21(z)u23(z)=0.06z0+0.38z1+0.56z2; Remove the term 0.56z2 from U22(z) and obtain p22(3)=0.56; U23(z,3)=U22(z,3)u22(z)=0.012z0+0.124z1+0.304z2; Obtain p23(3)=0.304. The MGF representing the corresponding component execution time distribution takes the form v2(z,3)=0.56z12+0.304z16. The distribution of L2 is Pr{L2=1}=Pr{h2=1} =3A2(1A2)2=0.096, Pr{L2=2}=Pr{h2=2}=3A22(1-A2)=0.384, Pr{L2=3}=Pr{h2=3}=A23=0.512; According to Eq. (19) we obtain: V2(z)=0.096v2(z,1)+0.384v2(z,2)+0.512v2(z,3) = 0.287z12+0.401z16+0.086z22+0.0614z26+0.0215z38. The MGF representing the execution time distribution for the entire system takes the form: V(z)=V1(z)V2(z)= 0.078z28+0.109z32+0.023z38+0.017z40+0.141z42+0.024z44+0.174z46+ 0.005z50+0.037z52+0.007z54+0.027z56+0.014z58+0.020z62+0.001z66+ 0.012z68+0.020z72+0.019z76+0.004z82+0.001z84+0.003z86+0.001z98. Having this MGF, one can obtain the system reliability and conditional expected execution time for different time constraints T* using Eq. (9) - (12): R()=0.739; E()=44.559. For T*=50: R(50)=0.567; E(50)=22.24/0.567=39.24 4.2. Numerical example A. System with different software components 17 Consider a system consisting of five components. The number of parallel software units, availability of these units and the parameters Nc and Mc of the fault-tolerant programs for each component are presented in Table 2. The components 1, 3 and 5 are of NVP type, the components 2 and 4 are of RBS type (M2=M4=1). The reliability and execution times of software versions are presented in Table 3. The entire system execution time varies in the range 81≤T≤241. The system reliability and expected execution time (without respect to execution time constraint) obtained by the suggested algorithm are respectively R()=0.77 and E()=96.64. The indices R(T*) and E(T*) are presented in Fig. 2 as functions of maximum allowable system execution time T*. In order to estimate the effect of fault-tolerant software architecture on the entire system reliability we assume that each component consists of single version (most reliable version from Table 3). Having the execution times of these most reliable versions we obtain the system execution time as sum of versions' execution times: T=130. The system reliability is a product of reliabilities of each component. The reliability of each component c is equal to version reliability multiplied by the probability that at least lc(1) hardware units out of Hc are available. For the given system parameters the single version system reliability is R(130)=0.7699 . The fault tolerant system with structure of the software components presented in Table 2 provides for the same execution time T*=130 reliability R(130)=0.75. It can be seen that in the considered case adding less reliable versions to the most reliable ones does not improve system reliability defined as the probability of obtaining the correct output. Even for T=241 the multi-version system does not provide much greater reliability than the single version one. However in the single version system the wrong outputs in components 1, 3 and 5 cannot be detected since these components have no ATB. The probability of undetected wrong output in the single version system is 0.0972 while in the multi-version system it is negligible small. 18 B. System with single software component Now consider the same fault-tolerant software system running on a single hardware block consisting of three parallel units with availability A=0.9. The functions lc(x) for each software component are presented in Table 4. Note that the system can operate only when h2 since single hardware unit has not enough resources for execution of the second software component. The entire system execution time varies in the range 81≤T≤195. The system reliability and expected execution time (without respect to execution time constraint) are respectively R()=0.917 and E()=103.18. The indices R(T*) and E(T*) are presented in Fig. 3 as functions of maximum allowable system execution time T*. If each software component consists of single version (most reliable version from Table 3), the system execution time is T=130. The system reliability obtained as the product of reliabilities of the versions multiplied by probability that at least two out of three hardware units are available is R=0.767. The fault tolerant system with structure of the software components presented in Table 2 provides for the same execution time T*=130 greater reliability R(130)=0.804 . 5. Summary and further work The considered model considers fault-tolerant systems with series architecture and arbitrary number of hardware units and software versions. The presented algorithm is aimed at evaluating system reliability and performance indices that can be used for comparison of different system configurations and for solving system structure optimization problems. The model and the algorithm in their present form have the following limitations: 19 1. The common cause failures are not taken into account. This leads to overestimation of system reliability. Further research should be devoted to incorporating common cause failures into the system reliability and performance model. This can be done based on works [15-18]. 2. The version execution times are assumed to be constant. Such assumption is justified when the subtasks are small and versions have no logical branching or when the possible time variation of each version is much smaller than difference in execution times for different versions. In further work the version execution time distributions should be incorporated into the model. The universal generating function technique [19] can be used for analysis of versions with variable execution times. 3. The structure of hardware is assumed to be homogenous. The components of real systems can consist of different hardware units with different reliability and processing speed. The model should be further extended to analyze systems with components consisting of an arbitrary set of hardware units. 4. The system is assumed to consist of components connected in series. In many real systems the components have more complex interaction than sequential task execution. The possible ways of the model extension to more complex structures should be studied. 5. It is assumed that the system performs a single task. When the same system is aimed at running different applications (providing different services) different software modules can be used for different applications or the same software versions can perform differently while solving different subtasks. The extension of the model to multitask applications can be done using ideas presented in [9]. 6. References [1] Teng X., Pham H., Software fault tolerance, in Reliability Engineering Handbook, H. Pham (Ed.), Springer, 2003, 585-611 20 [2] Chen L., Avizienis A., N-version programming: a fault tolerance approach to the reliable software, Proc. 8t h Int. Sym. Foult-Tolerant Computing, Toulouse, France; 1978, p. 3-9. [3] Randell B., System structure for software fault tolerance, IEEE Trans. Software Eng., 1(2), 1975, p. 220-232. [4] Tai A., Meyer J., Avizienis A., Performability enhancement of fault-tolerant software, IEEE Transactions on Reliability, 42 (2), 1993, p. 227-237. [5] Goseva-Popstojanova K., Grnarov A., Performability Modeling of N Version Programming Technique, Proc. 6th IEEE International Symposium on Software Reliability Engineering (ISSRE'95), Toulouse, France, Nov.1995. [6] K.Goseva-Popstojanova, A.Grnarov, Performability and Reliability Modeling of N Version Fault Tolerant Software in Real-Time Systems, Proc. Euromicro'97 Conference, Budapest, Hungary, Sep. 1997. [7] Belli, F., Jedrzejowicz, P., Fault-tolerant programs and their reliability, IEEE Transactions on Reliability, 39 (2), 1990 , 184 –192. [8] Ashrafi, N., Berman, O., Cutler, M., Optimal design of large software-systems using N-version programming, IEEE Transactions on Reliability, Vol. 43 (2), 1994, 344 -350. [9] Wattanapongsakorn, N., Levitan. S., Reliability optimization models for embedded systems with multiple applications, IEEE Transactions on Reliability, Vol. 53 (3), 2004, 406 -416. [10] Tsujimura, Y., Yamamoto, H., Takeno, T. and Yamazaki, G., Evolutionary design of N-version fault tolerant software system, Proceedings of the 33rd International Conference on Computers and Industrial Engineering (CD-ROM), Jeju, Korea, 2004 [11] Tsujimura, Y., Yamamoto, H., Takeno, T. and Yamazaki, G., A genetic algorithm for designing N-version program, Proceedings of the Fifth Asia-Pacific Industrial Engineering and Management Systems Conference , Gold Coast, Australia, 2004. [12] Levitin G., Optimal version sequencing in fault-tolerant programs, Asia-Pacific Journal of Operational Research, 22(1), 2005 pp. 1-18. [13] Meyer, J.F. On evaluating the performability of degradable computing systems, IEEE Transactions on Computers, Vol. 29, 1980, 720-731. [14] Grassi V., Donatiello L., Iazeolla G., Performability evaluation of multicomponent fault-tolerant systems, IEEE Transactions on Reliability, 37 (2), 1988, p. 216-222. [15] Nicola V.F., Goyal A. Modeling of correlated failures and community error recovery in multiversion software. IEEE Trans Software Eng, vol. 16, 1990, 350-359. 21 [16] Scott R.K., Gault J.W., McAllister D. F. Fault-tolerant reliability modelling. IEEE Trans Software Eng, vol. 13(5), 1987, 582-592. [17] Eckhardt D. E. and Lee L. D., A Theoretical Basis of Multiversion Software Subject to Coincident Errors, IEEE Trans. on Software Engineering, vol. 11, 1985, 1511-1517. [18] Littlewood B.and Miller D. R., Conceptual Modelling of Coincident Failures in Multi-Version Software, IEEE Trans on Software Engineering, vol. 15, 1989, 1596-1614. [19] A. Lisnianski, G. Levitin, Multi-state system reliability, World Scientific, 2003. 22 Table 1. Parameters of system components for the analytical example. component c=1 c=2 version 1 2 3 4 5 1 2 3 rci 0.7 0.6 0.8 0.6 0.9 0.8 0.8 0.7 ci 6 12 10 18 14 10 16 12 tci(1) 6 18 28 46 60 10 26 38 tci(2) 6 12 16 30 30 10 16 22 tci(3) - - - - - 10 16 12 Table 2. Parameters of system components for the numerical example. Component 1 2 3 4 5 Hc 3 4 2 3 2 Ac 0.80 0.95 0.90 0.85 0.95 Nc 5 2 3 3 5 Mc 3 1 2 1 3 lc(1) 2 0 1 1 3 lc(2) 4 1 3 2 5 lc(3) 5 1 - 3 - lc(4) - 2 - - - 23 Table 3. Parameters of software versions for the numerical example. c=1 c=2 c=3 c=4 c=5 version 1 2 3 4 5 r1i 0.86 0.77 0.98 0.93 0.91 1i 10 12 25 22 15 r2i 0.85 0.92 - - - 2i 30 45 - - - r3i 0.87 0.94 0.98 - - 3i 6 6 10 - - r4i 0.95 0.85 0.85 - - 4i 25 15 20 - - r5i 0.68 0.79 0.90 0.90 0.94 5i 10 10 15 20 25 Table 4. Number of parallel software versions executed simultaneously as function of number of available hardware units. Component 1 2 3 4 5 lc(1) 2 0 3 1 1 lc(2) 4 1 3 2 2 lc(3) 5 1 3 3 3 24 L1=1 6 12 10 L1=2 18 14 1 1 2 3 4 3 5 5 2 28 46 60 4 16 30 Component 1 L2=1 L2=2 L2=3 1 10 16 12 1 1 2 2 3 3 3 2 26 38 16 22 12 16 Component 2 Figure 1. Time diagrams for software components with different number of versions executed simultaneously. 25 100 0.8 95 0.6 90 0.4 85 0.2 80 0 80 100 120 140 160 180 200 220 R(T*) E(T*) Fig. 2 240 T* E(T*) R(T*) Figure 2. R(T*) and E(T*) functions for fault-tolerant system running on different hardware components. 110 1 105 0.8 E(T*) 100 0.6 95 0.4 90 R(T*) Fig. 3 0.2 85 80 0 80 100 120 140 160 180 200 T* E(T*) R(T*) Figure 3. R(T*) and E(T*) functions for fault-tolerant system running on a single hardware component. 26