Using machine learning for software criticality Evaluation25

advertisement
Using Machine Learning for Software Criticality Evaluation.
Miyoung Shin1 and Amrit L.Goel2.
1Bioinformatics
Team, Future Technology Research Division, ETRI
Daejeon 305-350, Korea
shinmy@etri.re.kr
2Dept. of Electrical Engineering and Computer Science, Syracuse University
Syracuse, New York 13244, USA
goel@ecs.syr.edu
Abstract. During software development, early identification of critical components is necessary
to allocate adequate resources to ensure high quality of the delivered system. The purpose of this
paper is to describe a methodology for modeling component criticiality based on its software
characteristics. This involves development of a relationship between the input and the output
variables without knowledge of their joint probability distribution. Machine learning techniques
have been remarkably successful in such applications; and in this paper we employ one such
technique is employed here. In particular, we use software component data from the NASA
metrics database and develop radial basis function classifiers using our new, innovative algebraic
algorithm for determining the model parameters. Using a principled approach for classifier
development and evaluation, we show that design characteristics can yield parsimonious
classifiers with impressive classification performance on test data.
1. Introduction
In spite of much progress in the theory and practice of software engineering, software systems continue to
suffer from cost overruns, schedule slips and poor quality. Due to the lack of a sound mathematical basis
for this discipline, empirical models play a crucial role in the planning, monitoring and control during
software system development.
Machine Learning techniques address issues related to programs that learn relationships between inputs and
outputs of a system without involving their distributions. Such techniques have been found very useful in
many applications, including some aspects of software engineering. In this paper our interest is in
relationships among the characteristics of software components and their criticality levels. Here the inputs
are module characteristics and the output is its criticality level. In most software engineering applications
the underlying distributions are not known and hence it is highly desirable to employ machine learning
techniques to develop input-output relationships.
Two main classes of models used in software engineering applications are effort prediction [1] and module
classification [2]. A number of studies have dealt with both of the types of the models. In this paper we
address the second class of models, viz two-class classifier development using machine learning.
A large number of studies have been reported in the literature about the use of classification techniques for
software components [see eg. 2,3,4,5 6,7]. In most of these, a classification model is built using the known
values of the relevant software characteristics called metrics and known class labels. Then, based on
metrics data about a previously unseen software component, the classification model is used to assess its
class, viz. high or low criticality. A recent case study [2] provides a good review and summary of the main
techniques used for this purpose.
Some of the commonly used techniques are listed below. For applications of these and related techniques in
software engineering, see [7,8].
 Logistic regression
 Case based reasoning (CBR)
 Classification and regression trees (CART)
 Quinlan’s classification trees (C 4.5, SEE 5.0)
Each of these, and other approaches, has its own pros and cons. The usual criteria for evaluating classifiers
are classification accuracy, time to learn, time to use, robustness, interoperability, and classifier complexity
[9, 10]. For software applications, accuracy, robustness, interoperability and complexity are or primary
concern.
In developing software classifiers an important consideration is what metrics to use as features. An
important goal of this paper is to evaluate the efficacy and accuracy of component classifiers using only a
few software metrics, that are available at early stages of software development life cycle. In particular we
are interested in developing good machine learning classifiers based on software design metrics and
evaluating their efficacy relative to those that employ coding metrics and combined design and coding
metrics. In this paper, we employ the radial basis function (RBF) classifiers because RBF is a versatile
modeling technique by virtue of its separate treatment of non-linear and linear mapping, possesses
impressive mathematical properties of universal [1, 11] and best approximation, and our recent algebraic
algorithm for RBF design [10,11] provides an easy to use and principled methodology for developing
classifiers. Finally, we employ Gaussian kernels as they are the most popular kernels in RBF applications
and also possess the above mathematical properties.
This paper is organized as follows. In Section 2, we provide a brief description of the data set employed.
An outline of the RBF model and classification methodology is given in Section 3. The main results of the
empirical study are discussed in Section 4. Finally, some concluding remarks are presented in Section 5.
2. Data Description
The data for this study was obtained from a NASA software metrics database available in the public
domain. It provides various project and process measures for space related software systems. The particular
data set used here consists of eight systems with a total of 796 components. The number of components in a
system varies from 45 to 166 with an average of about 100 components per system. The size of the data set
is a little less than half a million program lines. Since our primary interest in this study is to evaluate
classification performance based on design and coding measures, we only considered the relevant metrics.
They are listed in Table 1. The first three metrics (x1, x2, x3) represent design measures and the last three
(x4, x5, x6) are coding measures. Two statistics, average and standard deviation, for each of these metrics
are also listed in Table 1.The design metrics are available to software engineers much earlier in the life
cycle than the coding metrics so that predictive models based on design metrics are more useful than those
based on say, coding metrics that become available much later in the development life cycle. Typical
values of the six metrics for five components in the database are shown in Table 2, along with the
component class label where 1 refers to high criticality and 0 to low criticality.
Table 1: List of Metrics
Variable
X1
X2
X3
X4
X5
X6
Y
Description
Function calls from this
component
Function calls to this component
Input/Output component
parameters
Size in lines
Comment lines
Number of decisions
Component class (0 or 1)
Avg
9.51
Std. dev
11.94
3.91
27.28
8.45
23.37
257.49
138.94
21.18
171.22
102.22
21.22
Table 2: Values of selected metrics for five components from the database.
Component
Number.
1
2
3
4
5
x1
x2
x3
x4
x5
x6
class
16
0
5
4
38
39
4
0
45
41
1
0
3
1
22
689
361
276
635
891
388
198
222
305
407
25
25
2
32
130
1
0
0
0
1
The metrics data for software systems tends to be very skewed so that a large number of components have
small values and a few components have large values. This is also true for the data analyzed in this case
study. In general, metrics data tend to follow a Poisson distribution whose mean equals the standard
deviation. Some evidence in support of this observation is provided by the fact that for several metrics in
table 1, the average is almost equal to the standard deviation.
The classification of modules into high critical (class 1) and low critical (class 0) was determined on the
basis of actual number of faults detected. Modules with 5 or less faults were labeled class 0 and others as
class 1. This classification resulted in about 25 % of the components in class 1 and about 75 % in class 0.
Classification into high and low critical classes is determined by the application domain. However, the
classification used here is quite typical of real world applications.
The 796 module data set was randomly divided into three subsets, 398 for classifier development (training),
199 for validation, and 199 for test. Further, ten random permutation of data sets were prepared to study the
variability in classification accuracy and model complexity due to the random partitioning of the data into
three sets.
3 Classification methodology. [NOTE: Eqns and symbols need to be retyped]
The objective of this study is to construct a model that captures an unknown input output mapping pattern
on the basis of limited evidence about its nature. The evidence available is a set of labeled historic data,
called the training set, denoted as.
Here both the d-dimensional inputs xi and their corresponding outputs yi are made available and the outputs
represent the class label, zero or one.
where Φ(.) is called a basis function and μ j and σj are called the centre and the width of the j th basis function,
respectively. Also, wj is the weight associated with the jth basis function output and m is the number of
basis functions. For the common case where σj = σ, i = 1,…,m, the RBF model is fully determined by the
parameter set P = (m, μ, σ, w). Also, we employ Gaussian kernels in this study so that
Φi(|| x – μj || / σj) = exp[(-|| x – μj ||2 / σ2 )]
In practice, we seek a classifier that is neither too simple nor too complex. A classifier that is too simple
will suffer from under fitting because it does not learn enough from the data and hence provides a poor fit.
On the other hand, a classifier that is too complicated would learn too many details of data including even
noise and thus suffers from over fitting. It cannot provide good generalization on unseen data. Hence an
important issue is to determine an appropriate number of basis functions (m) that represents a good
compromise between these competing goals.
In this study we employed our recently developed algorithm [1, 11] for providing a compromise between
these competing goals. In this algorithm, a complexity control parameter is specified by the user. We used a
default value of 0.5 percent. The classifier width values depend on the dimensionality of the input data. The
algorithm automatically computes the other model paramteres m, μ and w. The mathematical and
computational details of the algorithm are beyond the scope of this paper. For details, the reader is referred
to [11].
Classifiers for specified values of σ are developed for the training data and classification error (CE) for
each model is computed as the fraction of incorrectly classified components. Next, for each model,
validation data is used to evaluate validation error. The model with the smallest validation error is the
selected classifier. Finally, classification error for the test set is computed and provides an estimate of the
classification error for the future components for which only the metrics data would be available and the
class label will need to be determined.
4. Classification Results
In this section we present and discuss the results for three experiments using (1) design metrics, (2) coding
metrics, and (3) combined design and coding metrics. In each case, we develop RBF classifiers according
to the methodology described above, viz. develop classifiers for training data, select the best one based on
the validation data and estimate future performance using the test data. The model development algorithm
used in this study provides consistent models without any statistical variability. Therefore, we do not need
to perform multiple runs for a given permutation of the data set. In other words, the results for one run are
repeatable. The results for the ten permutations, of course, will vary due to different random assignments of
components amongst the training, validation and test sets.
In evaluating the results, we are primarily interested in classifier complexity and classification error. A
commonly used measure of the RBF model complexity is the number of basis functions (m) and is also
used here. The results are presented below.
4.1 Design Metrics:
Model complexity and classification errors based on design metrics data (x 1,x2,x3) are listed in Table 3. We
note that m varies from 3 to 7 with an average value of 5.1. The training error varies from 21.6% to 27.1%
with an average of 24.3%. The validation error varies from 21.1% to 29.2% with an average of 25.9%, and
the test error varies from 21.6% to 28.1% with an average of 25.0%. Clearly, there is noticeable variation
among the results from different permutations. A plot of the three error measures is given in Figure 1 and
illustrate the variability as seen in Table 3.
Table 3. Classification results for design metrics.
Permutation
1
2
3
4
5
6
7
8
9
10
m
4
6
4
7
4
7
3
5
7
4
Training
27.1
25.2
25.6
24.9
21.6
24.1
22.6
24.4
24.4
23.1
Classification Error (%)
Validation
29.2
23.6
21.1
26.6
27.6
25.1
26.6
28.6
28.6
24.6
Test
21.6
24.6
26.1
22.6
28.1
24.6
24.6
24.1
24.1
27.1
In classification studies, the test data accuracy is of primary interest, while the other values provide useful
insights into the nature of the classifier and the model development process. Therefore, we concentrate only
on test error. Below, we compute confidence bounds for the true test error based on the results from the ten
permutations. It should be noted that the confidence bounds given below are not the same thing as might be
obtained based on data from only one permutation by using techniques such as bootstrap.
The standard deviation (SD) of test errors from Table 4 is 1.97 %. Using the well known t-statistic, 95 %
confidence bounds for the true, unknown test error are given as below.
{Average Test Error ± t(9; 0.025)* (SD of Test Error) / √10 }
= { 24.95 ± 2.26 (1.97)/ √10 }
= { 23.60, 26.40}
The above values are interpreted as follows. Based on the results for test errors presented in Table 3, we are
95 % confident that the true test error for the RBF model is between 23.6 % and 26.4 %. The 95 % is a
commonly used value; bounds for other values can be easily computed. For example, 90 % confidence
bounds for test error here are {23.81, 26.05}. The bounds get narrower as the confidence value decreases
and get wider as it increases.
4.2 Coding and Combined Measures
Since our focus is on model complexity and test error, we now present only these values in Table 4 for the
coding measures (x4, x5, x6). Also given are the values for the combined (design plus coding) measures (x 1
to x6). We note that model complexity here has much more variability than for the case of design measures.
Standard deviations and confidence bounds for these cases were obtained as above. A summary of test error
results for the three cases is given in Table 5
Table 4. Model complexity and test errors.
Coding Measures
Permutation
1
2
3
4
5
6
7
8
9
10
M
2
2
14
22
4
12
4
9
3
21
Combined Metrics
Test Error (%)
14.6
26.1
24.1
18.6
26.1
25.1
23.6
23.6
23.6
24.6
M
6
2
17
6
4
16
6
16
6
28
Test Error (%)
20.6
26.1
26.1
21.1
27.6
24.6
22.6
24.6
22.6
27.6
Table 5. Summary of test error results
Metrics
Design Metrics
Coding Metrics
Combined Metrics
Average
24.95
23.00
24.35
SD
1.97
3.63
2.54
Confidence Bounds and Width
90 %
95 %
{23.81, 26.05}
{23.60, 26.40}
{20.89, 25.11}
{21.40, 25.80}
{22.89, 25.81}
{22.55, 26.15}
4.3 Discussion
The results in Table 5 provide insights into the two issues addressed in this paper. The first one, measure of
variability due to the random assignment of components to the training, validation and test sets is indicated
by the widths of the confidence intervals. For both the 90 % and 95 % bounds, the width are quite
reasonable. This is specially noticeable in light of the fact that software engineering data tends to be quite
noisy. Further, the variability due to coding metrics alone is considerably higher than for the other two
cases. Regarding the second issue, relative classification accuracies for the three cases, it appears that
classification based on the design metrics alone is comparable to the other two cases. Overall, it appears
that a test error of about 24% is a reasonable summary for the data analyzed in this study. We note that this
value of 24% is easily contained in all the confidence bounds listed in Table 5.
5 Concluding Remarks:
In this paper we studied an important software engineering issue of identifying fault-prone components
using selected software metrics as predictor variables. A software criticality evaluation model, that is
parsimonious and employs easily available metrics, can be an effective analytical tool in reducing the
likelihood of operational failures. However, the efficacy of such a tool depends on the mathematical
properties of the classifier and its design algorithm. Therefore, in this study we employed Gaussian radial
basis function classifiers that possess the powerful properties of best and universal approximation. Further,
our new design algorithm yields consistent results using only algebraic methods. Further, we used input
data permutations and classifiers for each permutation to compute the confidence bounds for test error. A
comparison of these bounds based on design, coding and combined metrics indicated that the errors in the
three cases could be considered to be statistically equal, based on the analyses presented here, we believe
that based on a combination of the model and algorithm properties, able to establish, at least empirically,
that design metrics, which are easily available early in the software development cycle, can be effective
predictors of potentially critical components. Early attention and additional resources allocated to such
components can be instrumental in minimizing the operational failures and thus improving system quality.
References:
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Shin, M and Goel A. Empirical Data Modelling in Software Engineering using Radial Basis
Functions. IEEE Transactions on Software Engineering, (28).(2002) 567_576.
Khoshgoftaar, T., and Seliya, N. Comparitive Assessment of Empirical Software Quality
Classification Techniques: An Empirical Case study.
Software Engineering, 9 (2004) 229-257.
Khoshgoftaar, T., Yuan, X., Aleen, E. B., Jones, W. D., and Hudepohl, J. P.:
Uncertain Clasification of Fault-Prone Software Modules. Empirical
Software Engineering, 7(2002) 297-318.
Lanubile, F., Lonigro, A., and Vissagio, G.: Comparing Models for
identifying Fault-Prone Software Components, 7th International Conference
on Software Engineering and Knowledge Engineering, 312-319, Rockville,
Maryland, June 1995.
Pedrycz, W.: Computational Intelligence as an Emerging Paradigm of
Software Engineering. Fourteenth International Conference on Software
Engineering and Knowledge Engineering, Ischia, Itlay, July 2002, 7-14.
Pighin, M. And Zamolo, R.: A Predictive Meric based on Discriminant
Statistical Analysis, International Conference on Software Engineering,
Boston, MA, 1997, 262-270.
Zhang, D. And Tsai, J. J. P.: Machine Learning and Software Engineering.
Software Quality journal, 11 (2003) 87-119.
Shull, F., Mendonce, M. G., Basili V., Carver J., Maldonado, J. C., Fabbri,
S., Travassos, G. H., and Ferreira, M. C..: Knowledge-Sharing Issues in
Experimental Software Engineering, Empirical Software Engineering, 9
(2004) pp 111-137.
Goel, A. L. And Shin, M.: Tutorial on Software Models and Metrics.
International Conference on Software Engineering, Boston, MA (1997).
Han, J. and Kamber, M.: Data Mining Morgan Kauffman, 2001.
Shin M., Goel, A.: Design and Evaluation of RBF Models based on RC
Criterion, Technical Report, Syracuse University, 2003.
Download