Software Architecture for Distributed Systems (SADS)

advertisement
SOFTWARE ARCHITECTURE FOR DISTRIBUTED SYSTEMS
(SADS): NN AND EV APPROACHES
Fl. Popentiu Vladicescu
City University, London, DEEIE, UK, e-mail: Fl.Popentiu@city.ac.uk
G. Albeanu
Bucharest University, RO, e-mail: albeanu@math.math.unibuc.ro
Pierre Sens
Université Pierre & Marie Curie, LIP6, Paris 6, e-mail: Pierre.Sens@lip6.fr
Poul Thyregod
Technical University of Denmark, IMM, e-mail: pt@imm.dtu.dk
ABSTRACT
The problem of software architecture according to the software reliability forecasting
is considered. The proposed technique is based on a chain of automatic data collection, which
allows us the possibility to adjust, during the execution, the strategy of fault management.
Four modules are chained: monitoring, statistical, prediction, selection. The statistical
module deals both with artificial neural networks (NN) and explanatory variables approaches
(EV). The predictions are used to recalibrate/training the initial model. The selection process
identifies the appropriate algorithm for adaptability of fault management. This paper
addresses the structure of the statistical approach module from the viewpoint of the
“predictions” module. The main advantage of the proposed approach is that it gives the
possibility to include, in the prediction process, information concerning the structure and the
history of the distributed system behavior.
1
INTRODUCTION
The software (operating system and different client-server applications) that is in
control of distributed systems and environments upon which human (economical and social)
life is critically dependent, must receive special treatment throughout its life-cycle in order to
assure that demanded safety, reliability and quality levels have been attained.
The software reliability forecasting is a problem of increasing importance for many
critical applications. Not only, the reliability parameters of the components and their
communications must be evaluated but also their performance in case of overload and
reconfigurations. The statistical methods are important to improve the ability of the current
software reliability growth models to give measurements and predictions of reliability that can
be trusted for adapted tolerance algorithms (Burtschy et al., 1997). In addition, Neural
Network methods can be applied for modeling software reliability growth and for achieving
ultra-high reliability in a specific environment. It was proved (Renate Sitte, 1999) that Neural
Networks are not only much simpler to use than the statistical based recalibration methods,
but they are equal or better trend (variable term) predictors.
The problem of software architecture according to the software reliability forecasting
have been presented at the ESREL’99 Conference by Fl. Popentiu Vladicescu et al. (1999). It
was appreciated that good software architecture facilitates application system development
and promotes system configuration. The main objective of SADS is to improve the software
architecture for monitoring fault-tolerant design in distributed systems. The proposed
technique is based on a chain of automatic data collection, which allows us the possibility to
adjust, during the execution, the strategy of fault management (Fig. 1).
Fig.1 Forecasting architecture
The SADS architecture proposed consists of the following modules:
1. The module “MONITOR-monitor of the application” allows collecting the
information about the evolution of the applications and the environment, such as the use of
resources in terms of memory, files and communication. The distributed monitoring
architecture used (Fl. Popentiu et al., 1999) is shortly described here. A specific host, namely
the collector, will collect information from remote involved hosts. The collector will be
subscribed to specific information chosen by a “reliability administrator”. On each host a
server called the monitor agent will collect information concerning local processes and sends
it back to the collector. Servers rely on a high-level communication layer, the Network
Message Server.
2. The module “STAT-statistical approach” allows filtering the data supplied by the
former module and produces a model that can be used for prediction. The structure of this
module includes both parametric and non-parametric statistical resources and artificial Neural
Networks utilizing information on the state of the operational environment such as
proportional hazards.
3. The module “PRED-predictions” makes predictions about the future failure
behavior of the application and the operational environment. Moreover, the predictions are
used to calibrate the statistical models.
4. The module “SEL-selection” uses the information of the prediction module to select
the most appropriate algorithm for adaptability of fault management. A comparative analysis
of the predictions allows us to choose between the pessimistic approach to favor the recovery
delay and the optimistic one to reduce the cost of fault management under the failure - free
execution.
This paper addresses the structure of the “STAT” module from the perspective of the
prediction process. The main advantage of the proposed approach is that it gives the
possibility to include, in the prediction process, information concerning the structure and the
behavior of the distributed system.
2
THE KERNEL OF THE “STAT” MODULE
During the two decades, numerous software reliability models and measurement
procedures have been proposed for prediction, estimation, and engineering of software
reliability. For a detailed description of most of these models, see Musa (1999). The majority
of them are based on the concept of random failures and variants in these models takes into
account by removing the fault other errors may appear in the software. An approach to early
software reliability prediction, based on the systematic identification of the failures modes of
the software development process is developed by Smidts et al. (1998). Two concepts are
used in the failure-mode-based modeling of software: process failure modes and product
failure modes. Process failure modes are those failures to carry the steps of the software
development process correctly, contrasting to product failure modes which are failures of the
software itself.
In the context of this paper, only failures of the software are considered; the behavior
of the software can be studied in regard to its operational profile. The operational profile is a
quantitative characterization of how the software system is used, assigning probabilities to
software operations. The assigning process use the information provided by the
“MONITOR”.
The module “STAT” consists of the following components:
SRGM - the general software reliability growth model developed by Burtschy et al.,
(1997). The supermodel considered is build as a weighted sum of several software reliability
growth models Jelinski-Moranda, Goel Okumoto, Duane, Littewood-Verall, KeillerLittlewood etc. Three methods to obtain weight factors are considered: (1) the maximum
likelihood method; (2) the Bayesian inference weighted decision, and (3) the prequential
likelihood function method.
StEx - the methodology developed by Kaufman et al. (1999) for estimating mean
Time To Failure, Time To Next Failure and similar statistics using the statistics of the
extremes. This component is added due to the following reasons:
 For a highly reliable software system, the occurrence of a software failure is a rare
event: therefore, this kind of failure data is found in the tail of the parent distribution. Lambert
et al., (1994) show that fitting a separate distribution for the tail of the parent distribution can
provide a more accurate representation of an extreme event. Such an approach will provide
important information for the “predictions” module.
The StEx is applied for the random variable which representing the time from
restart to software failure (ttf – time to failure). The basic assumption during estimation
process claims that the sequence of ttf random variables, from successive software restarts is
statistically independent and identically distributed (few faults remain after extensive testing
and their time to exposure is nominally the same).
Parameter estimates can be made either graphically or analytically. The analytical
method selected for implementation is the Lieblin method (Castillo, 1988). If it is assumed
that software is never completely fault free then only the Gumbel Type I and Gumbel Type II
distributions are good to use. However, if there is a finite upper limit for the number of
failures, then the Gumbel Type III distribution is appropriate.
SST - the stratified sampling techniques used to estimate population parameters
efficiently when there is substantial variability between subpopulations. In order for SST to be
useful for estimating software reliability, program executions must be stratified so that some
failures are concentrated in one or more strata or are isolated by themselves (Podgurski et al.,
1999). The history of the software system provides the opportunity to classify the failures and
the SST will estimate the proportion of the failures belonging to some class in the entire
population of possible failures. Such an approach is appropriate for large software packages
which integrate different applications that can be executed independently.
EV - the explanatory variables module based on an extended hazard regression model
proposed by Lousada-Neto (1997). This module uses the data concerning the history of the
software system and estimates the hazard regression parameters.
NN - the module, which implements an artificial neural network approach for a
general reliability growth model. A detailed study on the application of NN for software
reliability growth prediction is given by Karunanithi et al. (1992). From the architectural
point of view the following attributes are available: (1) number of layers in the network and
(2) the type of network topology used. The Feed-Forward Net architecture (Karunanithi et al.,
1992) is selected for software reliability system. The simplest NN has three layers: the input
layer, the hidden layer and the output layer. Other types of architectures will be included for
the next version of the SADS system.
These modules are in different stages of implementation. In the following, more
attention is given to the modules EV and NN.

3
THE EXTENDED HAZARD REGRESSION
The extended hazard regression model includes the most common hazard models, such
as the proportional hazards model, the accelerated failure time model and a proportional
hazards/accelerated time hybrid model with constant spread parameter. Although proportional
hazards models have been used for some time to model occurrence of hardware failures such
models for describing the dependence on covariates have not previously been used for
modeling software failures. Proportional hazard and accelerated time models may be used to
explain that part of the spread of failure distribution that is due to variation of the covariates.
Let h0(.) be a baseline hazard function,  = (1, 2, 1, 2) be the vector of unknown
parameters, and u1(.), u2(.), v1(.), v2(.) be the known monotone functions equal one when their
arguments are zero. The extended hazard regression model is given by:
hc (t | z )  T1 (t | z )T2 (t | z ),
where
T1 (t | z )  u1 ( 1T z )v1 (  1T z )[u1 ( 1T z )t ]v1 ( 1 z ) 1
T
and
T2 (t | z )  h0 ([u 2 ( 2T z )t ]v2 (  2 z ) ).
T
Assuming u1(.)=u2(.), v1(.) = v2(.) and
h0 ( x) 
{(k )}1 x k 1e  x
,
1  I ( k ; x)
with
x
I (k ; x)  {(k )}1  t k 1e t dt ,
0
the above model corresponds to the hazard function of a random variable with a generalized
gamma distribution with three parameters, two of them depending on covariates z. For v1(.) =
v2(.) = k =1, the exponential model is obtained. When k=1, a Weibull model is obtained. For
h0(x)=1/(1+x), we obtain the log-logistic distribution with two parameters depending on
covariates.
The baseline hazard function can be also a polynomial approximation or a
piecewise polynomial approximation.
To estimate the parameters let:
t
H c (t | z )   hc (u | z )du
0
and t1, t2, … a time sequence. Each ti has associated a covariate vector zi and an indicator
variable defined by i=1 if at time ti, the system is in failure, or i=0 otherwise. The loglikelihood function is:
n
log L    i (log[ hc (ti | zi )  H c (ti | zi )].
i 1
Then the vector of unknown parameters * = (*1, *2, *1, *2), the maximum
likelihood estimate of  is obtained by solving the system of nonlinear equations Log L/ = 0.
In the software system under development, the covariates are provided by the
“MONITOR” and describe the software history: the number of previous failures, the time of
failure, the number of files under process, the global memory usage, the number of swapped
out pages, the number of processes in the running queue etc. The structure of the file system,
which supports the interface with the “STAT”, is given by Sens (1998).
4
USING NN FOR SOFTWARE RELIABILITY GROWTH MODELING
The predictive power of the parametric prediction model provided by the extended
hazard regression model may be further improved by a neural network. It is, however, well
known that the predictive capability of a neural network can be affected by the type of the
neural network model is used to describe the failure data, how the input and output variables
are represented to it, the order in which the input and output values are presented during the
training process, and the complexity of the network.
Software reliability growth model can be expressed in terms of the neural network
mapping as k+h = Mapping((Tk, Fk), tk+h), where Tk is the sequence of cumulative execution
time (t1, t2, …, tk), Fk is the corresponding observed accumulated failures (1, …, k) up to the
time tk used to train the network, tk+h=tk+ is the cumulative execution time at the end of a
future test session k+h and k+h is the prediction of the network. The following attributes are
important for the architecture of a neural network: number of layers in the network and the
type of network topology used. A NN can be a single-layer network (there is no hidden layer,
only input and output layers) or multilayer networks (which have more hidden layers). Based
on the connectivity and the direction of the links, the network can employs only forwardfeeding connections (FFN) or is a multilayer network which employ feedback connections
(recurrent networks). The predictive ability of a NN can be affected by the learning process.
For the module NN two training regimes are available: generalized training and prediction
training. The generalized training method is the standard for training feed-forward networks.
Prediction training is appropriate in training recurrent networks.
The data for training at time ti consists of the complete failure history of the system
since the beginning of running. The representation method for input and output variables is
based on a scale of the interval [0, 1]. The number of possible values (equidistant division)
can be selected depending on the software project under study. The expected maximum
values for both the cumulative faults and the cumulative execution time must be translated to
a positive value, but less than 1. In the training process, initially, at least three observed data
points must be used. Practically, at the end of each training, the network it is fed with future
inputs to measure its prediction of the total number of defects.
5
ADAPTABILITY OF FAULT MANAGEMENT
A substantial body of work has been published regarding fault tolerance and we
should find the best algorithm related to the dependability and performance requirements for
each specific application. On the other hand, environmental parameters directly influence the
choice of the fault tolerant policy. For instance, if the failure rate becomes to low and the
recovery delay is not so important and we can reasonably chose an optimistic approach such
as checkpointing.
6
CONCLUSIONS
This paper describe the module “STAT” of the SADS architecture - an architecture of
a software platform for monitoring fault-tolerant design in distributed systems. The module
under discussion allows filtering the data supplied by the monitoring structure and produces a
model that can be used for prediction. Different components of this module are introduced
and the basic functions are given.
The presented structure is suited for the “predictions” module and it is under
implementation.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
Burtschy, B., Albeanu, G., Boros, D.N. & Popentiu, Fl., Improving Software
Reliability Forecasting, Microelectronics and Reliability, 37, 6(1997), 901-907.
Sitte Renate, Comparison of Software-Reliability-Growth Predictions: Neural Networks
vs Parametric-Recalibration, IEEE Transactions on Reliability, 48,3(1999), 285-291.
Popentiu-Vladicescu, Fl. & Sens, P., A Software Architecture for Monitoring the
Reliability in Distributed Systems, ESREL’99, September 13-17, TUM MunichGarching Germany, 615-620, 1999.
Musa, J.D., Software Reliability Engineering, McGraw-Hill, New York, 1999.
Smidts, C., Stutzke, M. & Stoddard, R,W., Software Reliability Modeling: An
Approach to Early Reliability Prediction, IEEE Transactions on Reliability, 47,
3(1998), 268-278.
Jelinski, Z. & Moranda, P.B., Software reliability research. In Statistical Computer
Performance Analysis. Academic Press, N.Y., 465-484, 1972.
Goel, A.L. & Okumoto, K., Tome-dependent error detection rate model for software
reliability and other performance measures. IEEE Transactions on Reliability, R-28,
1979.
Duane, J.T., Learning curve approach to reliability monitoring. IEEE Transactions
Aerospace, 2 (1964), 563-566.
Littlewood, B. & Verall, J.L., A Bayesian reliability growth model for computer
software. J. Royal Statist. Soc., C22 (1973), 332-346.
Keiller, P.A., Littlewood, B., Miller, D.R. & Sofer, A., Comparison of software
reliability predictions. Digest FTCS 13 (13th International Symposium on FaultTolerant Computing), 138-144, 1983.
Kaufman, L.M., Dugan. J.B., Johnson. B.W., Using Statistics of the Extremes for
Software Reliability Analysis. IEEE Transactions on Reliability, 48, 3(1999), 292299.
Lambert, J.H., Matalas, N.C., Ling, C.W., et al., Selection of probability distributions
in characterizing of extreme events, Risk Analysis, 14, 5(1994), 731-742.
Castillo, E., Extreme Value Theory in Engineering, Academic Press, 1988.
Podgurski, A., Masri, W., McCleese, Y., Wolf, F.G. & Yang, C., Estimation of software
Reliability by Stratified Sampling. ACM Transactions on Software Engineering and
Methodology, 8, 3(1999), 263-283
Lousada-Neto, F., Extended Hazard Regression Model for Reliability and Survival
Analysis. Lifetime Data Analysis, 3(1997), 367-389.
Karunanithi, N., Whitely, D. & Malaya, K., Predictions of Software Reliability Using
Connectionist Models, IEEE Transactions on Software Engineering, 18, 7(1992), 563574.
Sens, P. & Folliot, B., The STAR Fault Tolerant manager for Distributed Operating
Environments. Software Practice and Experience, 28, 10(1998), 1079-1099.
Hansen, K. & Thyregod. P., On the analysis of failure data for repairable systems,
Reliability Engineering and System Safety, 36(1992), 47-51.
Download