SOFTWARE ARCHITECTURE FOR DISTRIBUTED SYSTEMS (SADS): NN AND EV APPROACHES Fl. Popentiu Vladicescu City University, London, DEEIE, UK, e-mail: Fl.Popentiu@city.ac.uk G. Albeanu Bucharest University, RO, e-mail: albeanu@math.math.unibuc.ro Pierre Sens Université Pierre & Marie Curie, LIP6, Paris 6, e-mail: Pierre.Sens@lip6.fr Poul Thyregod Technical University of Denmark, IMM, e-mail: pt@imm.dtu.dk ABSTRACT The problem of software architecture according to the software reliability forecasting is considered. The proposed technique is based on a chain of automatic data collection, which allows us the possibility to adjust, during the execution, the strategy of fault management. Four modules are chained: monitoring, statistical, prediction, selection. The statistical module deals both with artificial neural networks (NN) and explanatory variables approaches (EV). The predictions are used to recalibrate/training the initial model. The selection process identifies the appropriate algorithm for adaptability of fault management. This paper addresses the structure of the statistical approach module from the viewpoint of the “predictions” module. The main advantage of the proposed approach is that it gives the possibility to include, in the prediction process, information concerning the structure and the history of the distributed system behavior. 1 INTRODUCTION The software (operating system and different client-server applications) that is in control of distributed systems and environments upon which human (economical and social) life is critically dependent, must receive special treatment throughout its life-cycle in order to assure that demanded safety, reliability and quality levels have been attained. The software reliability forecasting is a problem of increasing importance for many critical applications. Not only, the reliability parameters of the components and their communications must be evaluated but also their performance in case of overload and reconfigurations. The statistical methods are important to improve the ability of the current software reliability growth models to give measurements and predictions of reliability that can be trusted for adapted tolerance algorithms (Burtschy et al., 1997). In addition, Neural Network methods can be applied for modeling software reliability growth and for achieving ultra-high reliability in a specific environment. It was proved (Renate Sitte, 1999) that Neural Networks are not only much simpler to use than the statistical based recalibration methods, but they are equal or better trend (variable term) predictors. The problem of software architecture according to the software reliability forecasting have been presented at the ESREL’99 Conference by Fl. Popentiu Vladicescu et al. (1999). It was appreciated that good software architecture facilitates application system development and promotes system configuration. The main objective of SADS is to improve the software architecture for monitoring fault-tolerant design in distributed systems. The proposed technique is based on a chain of automatic data collection, which allows us the possibility to adjust, during the execution, the strategy of fault management (Fig. 1). Fig.1 Forecasting architecture The SADS architecture proposed consists of the following modules: 1. The module “MONITOR-monitor of the application” allows collecting the information about the evolution of the applications and the environment, such as the use of resources in terms of memory, files and communication. The distributed monitoring architecture used (Fl. Popentiu et al., 1999) is shortly described here. A specific host, namely the collector, will collect information from remote involved hosts. The collector will be subscribed to specific information chosen by a “reliability administrator”. On each host a server called the monitor agent will collect information concerning local processes and sends it back to the collector. Servers rely on a high-level communication layer, the Network Message Server. 2. The module “STAT-statistical approach” allows filtering the data supplied by the former module and produces a model that can be used for prediction. The structure of this module includes both parametric and non-parametric statistical resources and artificial Neural Networks utilizing information on the state of the operational environment such as proportional hazards. 3. The module “PRED-predictions” makes predictions about the future failure behavior of the application and the operational environment. Moreover, the predictions are used to calibrate the statistical models. 4. The module “SEL-selection” uses the information of the prediction module to select the most appropriate algorithm for adaptability of fault management. A comparative analysis of the predictions allows us to choose between the pessimistic approach to favor the recovery delay and the optimistic one to reduce the cost of fault management under the failure - free execution. This paper addresses the structure of the “STAT” module from the perspective of the prediction process. The main advantage of the proposed approach is that it gives the possibility to include, in the prediction process, information concerning the structure and the behavior of the distributed system. 2 THE KERNEL OF THE “STAT” MODULE During the two decades, numerous software reliability models and measurement procedures have been proposed for prediction, estimation, and engineering of software reliability. For a detailed description of most of these models, see Musa (1999). The majority of them are based on the concept of random failures and variants in these models takes into account by removing the fault other errors may appear in the software. An approach to early software reliability prediction, based on the systematic identification of the failures modes of the software development process is developed by Smidts et al. (1998). Two concepts are used in the failure-mode-based modeling of software: process failure modes and product failure modes. Process failure modes are those failures to carry the steps of the software development process correctly, contrasting to product failure modes which are failures of the software itself. In the context of this paper, only failures of the software are considered; the behavior of the software can be studied in regard to its operational profile. The operational profile is a quantitative characterization of how the software system is used, assigning probabilities to software operations. The assigning process use the information provided by the “MONITOR”. The module “STAT” consists of the following components: SRGM - the general software reliability growth model developed by Burtschy et al., (1997). The supermodel considered is build as a weighted sum of several software reliability growth models Jelinski-Moranda, Goel Okumoto, Duane, Littewood-Verall, KeillerLittlewood etc. Three methods to obtain weight factors are considered: (1) the maximum likelihood method; (2) the Bayesian inference weighted decision, and (3) the prequential likelihood function method. StEx - the methodology developed by Kaufman et al. (1999) for estimating mean Time To Failure, Time To Next Failure and similar statistics using the statistics of the extremes. This component is added due to the following reasons: For a highly reliable software system, the occurrence of a software failure is a rare event: therefore, this kind of failure data is found in the tail of the parent distribution. Lambert et al., (1994) show that fitting a separate distribution for the tail of the parent distribution can provide a more accurate representation of an extreme event. Such an approach will provide important information for the “predictions” module. The StEx is applied for the random variable which representing the time from restart to software failure (ttf – time to failure). The basic assumption during estimation process claims that the sequence of ttf random variables, from successive software restarts is statistically independent and identically distributed (few faults remain after extensive testing and their time to exposure is nominally the same). Parameter estimates can be made either graphically or analytically. The analytical method selected for implementation is the Lieblin method (Castillo, 1988). If it is assumed that software is never completely fault free then only the Gumbel Type I and Gumbel Type II distributions are good to use. However, if there is a finite upper limit for the number of failures, then the Gumbel Type III distribution is appropriate. SST - the stratified sampling techniques used to estimate population parameters efficiently when there is substantial variability between subpopulations. In order for SST to be useful for estimating software reliability, program executions must be stratified so that some failures are concentrated in one or more strata or are isolated by themselves (Podgurski et al., 1999). The history of the software system provides the opportunity to classify the failures and the SST will estimate the proportion of the failures belonging to some class in the entire population of possible failures. Such an approach is appropriate for large software packages which integrate different applications that can be executed independently. EV - the explanatory variables module based on an extended hazard regression model proposed by Lousada-Neto (1997). This module uses the data concerning the history of the software system and estimates the hazard regression parameters. NN - the module, which implements an artificial neural network approach for a general reliability growth model. A detailed study on the application of NN for software reliability growth prediction is given by Karunanithi et al. (1992). From the architectural point of view the following attributes are available: (1) number of layers in the network and (2) the type of network topology used. The Feed-Forward Net architecture (Karunanithi et al., 1992) is selected for software reliability system. The simplest NN has three layers: the input layer, the hidden layer and the output layer. Other types of architectures will be included for the next version of the SADS system. These modules are in different stages of implementation. In the following, more attention is given to the modules EV and NN. 3 THE EXTENDED HAZARD REGRESSION The extended hazard regression model includes the most common hazard models, such as the proportional hazards model, the accelerated failure time model and a proportional hazards/accelerated time hybrid model with constant spread parameter. Although proportional hazards models have been used for some time to model occurrence of hardware failures such models for describing the dependence on covariates have not previously been used for modeling software failures. Proportional hazard and accelerated time models may be used to explain that part of the spread of failure distribution that is due to variation of the covariates. Let h0(.) be a baseline hazard function, = (1, 2, 1, 2) be the vector of unknown parameters, and u1(.), u2(.), v1(.), v2(.) be the known monotone functions equal one when their arguments are zero. The extended hazard regression model is given by: hc (t | z ) T1 (t | z )T2 (t | z ), where T1 (t | z ) u1 ( 1T z )v1 ( 1T z )[u1 ( 1T z )t ]v1 ( 1 z ) 1 T and T2 (t | z ) h0 ([u 2 ( 2T z )t ]v2 ( 2 z ) ). T Assuming u1(.)=u2(.), v1(.) = v2(.) and h0 ( x) {(k )}1 x k 1e x , 1 I ( k ; x) with x I (k ; x) {(k )}1 t k 1e t dt , 0 the above model corresponds to the hazard function of a random variable with a generalized gamma distribution with three parameters, two of them depending on covariates z. For v1(.) = v2(.) = k =1, the exponential model is obtained. When k=1, a Weibull model is obtained. For h0(x)=1/(1+x), we obtain the log-logistic distribution with two parameters depending on covariates. The baseline hazard function can be also a polynomial approximation or a piecewise polynomial approximation. To estimate the parameters let: t H c (t | z ) hc (u | z )du 0 and t1, t2, … a time sequence. Each ti has associated a covariate vector zi and an indicator variable defined by i=1 if at time ti, the system is in failure, or i=0 otherwise. The loglikelihood function is: n log L i (log[ hc (ti | zi ) H c (ti | zi )]. i 1 Then the vector of unknown parameters * = (*1, *2, *1, *2), the maximum likelihood estimate of is obtained by solving the system of nonlinear equations Log L/ = 0. In the software system under development, the covariates are provided by the “MONITOR” and describe the software history: the number of previous failures, the time of failure, the number of files under process, the global memory usage, the number of swapped out pages, the number of processes in the running queue etc. The structure of the file system, which supports the interface with the “STAT”, is given by Sens (1998). 4 USING NN FOR SOFTWARE RELIABILITY GROWTH MODELING The predictive power of the parametric prediction model provided by the extended hazard regression model may be further improved by a neural network. It is, however, well known that the predictive capability of a neural network can be affected by the type of the neural network model is used to describe the failure data, how the input and output variables are represented to it, the order in which the input and output values are presented during the training process, and the complexity of the network. Software reliability growth model can be expressed in terms of the neural network mapping as k+h = Mapping((Tk, Fk), tk+h), where Tk is the sequence of cumulative execution time (t1, t2, …, tk), Fk is the corresponding observed accumulated failures (1, …, k) up to the time tk used to train the network, tk+h=tk+ is the cumulative execution time at the end of a future test session k+h and k+h is the prediction of the network. The following attributes are important for the architecture of a neural network: number of layers in the network and the type of network topology used. A NN can be a single-layer network (there is no hidden layer, only input and output layers) or multilayer networks (which have more hidden layers). Based on the connectivity and the direction of the links, the network can employs only forwardfeeding connections (FFN) or is a multilayer network which employ feedback connections (recurrent networks). The predictive ability of a NN can be affected by the learning process. For the module NN two training regimes are available: generalized training and prediction training. The generalized training method is the standard for training feed-forward networks. Prediction training is appropriate in training recurrent networks. The data for training at time ti consists of the complete failure history of the system since the beginning of running. The representation method for input and output variables is based on a scale of the interval [0, 1]. The number of possible values (equidistant division) can be selected depending on the software project under study. The expected maximum values for both the cumulative faults and the cumulative execution time must be translated to a positive value, but less than 1. In the training process, initially, at least three observed data points must be used. Practically, at the end of each training, the network it is fed with future inputs to measure its prediction of the total number of defects. 5 ADAPTABILITY OF FAULT MANAGEMENT A substantial body of work has been published regarding fault tolerance and we should find the best algorithm related to the dependability and performance requirements for each specific application. On the other hand, environmental parameters directly influence the choice of the fault tolerant policy. For instance, if the failure rate becomes to low and the recovery delay is not so important and we can reasonably chose an optimistic approach such as checkpointing. 6 CONCLUSIONS This paper describe the module “STAT” of the SADS architecture - an architecture of a software platform for monitoring fault-tolerant design in distributed systems. The module under discussion allows filtering the data supplied by the monitoring structure and produces a model that can be used for prediction. Different components of this module are introduced and the basic functions are given. The presented structure is suited for the “predictions” module and it is under implementation. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Burtschy, B., Albeanu, G., Boros, D.N. & Popentiu, Fl., Improving Software Reliability Forecasting, Microelectronics and Reliability, 37, 6(1997), 901-907. Sitte Renate, Comparison of Software-Reliability-Growth Predictions: Neural Networks vs Parametric-Recalibration, IEEE Transactions on Reliability, 48,3(1999), 285-291. Popentiu-Vladicescu, Fl. & Sens, P., A Software Architecture for Monitoring the Reliability in Distributed Systems, ESREL’99, September 13-17, TUM MunichGarching Germany, 615-620, 1999. Musa, J.D., Software Reliability Engineering, McGraw-Hill, New York, 1999. Smidts, C., Stutzke, M. & Stoddard, R,W., Software Reliability Modeling: An Approach to Early Reliability Prediction, IEEE Transactions on Reliability, 47, 3(1998), 268-278. Jelinski, Z. & Moranda, P.B., Software reliability research. In Statistical Computer Performance Analysis. Academic Press, N.Y., 465-484, 1972. Goel, A.L. & Okumoto, K., Tome-dependent error detection rate model for software reliability and other performance measures. IEEE Transactions on Reliability, R-28, 1979. Duane, J.T., Learning curve approach to reliability monitoring. IEEE Transactions Aerospace, 2 (1964), 563-566. Littlewood, B. & Verall, J.L., A Bayesian reliability growth model for computer software. J. Royal Statist. Soc., C22 (1973), 332-346. Keiller, P.A., Littlewood, B., Miller, D.R. & Sofer, A., Comparison of software reliability predictions. Digest FTCS 13 (13th International Symposium on FaultTolerant Computing), 138-144, 1983. Kaufman, L.M., Dugan. J.B., Johnson. B.W., Using Statistics of the Extremes for Software Reliability Analysis. IEEE Transactions on Reliability, 48, 3(1999), 292299. Lambert, J.H., Matalas, N.C., Ling, C.W., et al., Selection of probability distributions in characterizing of extreme events, Risk Analysis, 14, 5(1994), 731-742. Castillo, E., Extreme Value Theory in Engineering, Academic Press, 1988. Podgurski, A., Masri, W., McCleese, Y., Wolf, F.G. & Yang, C., Estimation of software Reliability by Stratified Sampling. ACM Transactions on Software Engineering and Methodology, 8, 3(1999), 263-283 Lousada-Neto, F., Extended Hazard Regression Model for Reliability and Survival Analysis. Lifetime Data Analysis, 3(1997), 367-389. Karunanithi, N., Whitely, D. & Malaya, K., Predictions of Software Reliability Using Connectionist Models, IEEE Transactions on Software Engineering, 18, 7(1992), 563574. Sens, P. & Folliot, B., The STAR Fault Tolerant manager for Distributed Operating Environments. Software Practice and Experience, 28, 10(1998), 1079-1099. Hansen, K. & Thyregod. P., On the analysis of failure data for repairable systems, Reliability Engineering and System Safety, 36(1992), 47-51.