advertisement

Using Machine Learning for Software Criticality Evaluation. Miyoung Shin1 and Amrit L.Goel2. 1Bioinformatics Team, Future Technology Research Division, ETRI Daejeon 305-350, Korea shinmy@etri.re.kr 2Dept. of Electrical Engineering and Computer Science, Syracuse University Syracuse, New York 13244, USA goel@ecs.syr.edu Abstract. During software development, early identification of critical components is necessary to allocate adequate resources to ensure high quality of the delivered system. The purpose of this paper is to describe a methodology for modeling component criticiality based on its software characteristics. This involves development of a relationship between the input and the output variables without knowledge of their joint probability distribution. Machine learning techniques have been remarkably successful in such applications; and in this paper we employ one such technique is employed here. In particular, we use software component data from the NASA metrics database and develop radial basis function classifiers using our new, innovative algebraic algorithm for determining the model parameters. Using a principled approach for classifier development and evaluation, we show that design characteristics can yield parsimonious classifiers with impressive classification performance on test data. 1. Introduction In spite of much progress in the theory and practice of software engineering, software systems continue to suffer from cost overruns, schedule slips and poor quality. Due to the lack of a sound mathematical basis for this discipline, empirical models play a crucial role in the planning, monitoring and control during software system development. Machine Learning techniques address issues related to programs that learn relationships between inputs and outputs of a system without involving their distributions. Such techniques have been found very useful in many applications, including some aspects of software engineering. In this paper our interest is in relationships among the characteristics of software components and their criticality levels. Here the inputs are module characteristics and the output is its criticality level. In most software engineering applications the underlying distributions are not known and hence it is highly desirable to employ machine learning techniques to develop input-output relationships. Two main classes of models used in software engineering applications are effort prediction [1] and module classification [2]. A number of studies have dealt with both of the types of the models. In this paper we address the second class of models, viz two-class classifier development using machine learning. A large number of studies have been reported in the literature about the use of classification techniques for software components [see eg. 2,3,4,5 6,7]. In most of these, a classification model is built using the known values of the relevant software characteristics called metrics and known class labels. Then, based on metrics data about a previously unseen software component, the classification model is used to assess its class, viz. high or low criticality. A recent case study [2] provides a good review and summary of the main techniques used for this purpose. Some of the commonly used techniques are listed below. For applications of these and related techniques in software engineering, see [7,8]. Logistic regression Case based reasoning (CBR) Classification and regression trees (CART) Quinlan’s classification trees (C 4.5, SEE 5.0) Each of these, and other approaches, has its own pros and cons. The usual criteria for evaluating classifiers are classification accuracy, time to learn, time to use, robustness, interoperability, and classifier complexity [9, 10]. For software applications, accuracy, robustness, interoperability and complexity are or primary concern. In developing software classifiers an important consideration is what metrics to use as features. An important goal of this paper is to evaluate the efficacy and accuracy of component classifiers using only a few software metrics, that are available at early stages of software development life cycle. In particular we are interested in developing good machine learning classifiers based on software design metrics and evaluating their efficacy relative to those that employ coding metrics and combined design and coding metrics. In this paper, we employ the radial basis function (RBF) classifiers because RBF is a versatile modeling technique by virtue of its separate treatment of non-linear and linear mapping, possesses impressive mathematical properties of universal [1, 11] and best approximation, and our recent algebraic algorithm for RBF design [10,11] provides an easy to use and principled methodology for developing classifiers. Finally, we employ Gaussian kernels as they are the most popular kernels in RBF applications and also possess the above mathematical properties. This paper is organized as follows. In Section 2, we provide a brief description of the data set employed. An outline of the RBF model and classification methodology is given in Section 3. The main results of the empirical study are discussed in Section 4. Finally, some concluding remarks are presented in Section 5. 2. Data Description The data for this study was obtained from a NASA software metrics database available in the public domain. It provides various project and process measures for space related software systems. The particular data set used here consists of eight systems with a total of 796 components. The number of components in a system varies from 45 to 166 with an average of about 100 components per system. The size of the data set is a little less than half a million program lines. Since our primary interest in this study is to evaluate classification performance based on design and coding measures, we only considered the relevant metrics. They are listed in Table 1. The first three metrics (x1, x2, x3) represent design measures and the last three (x4, x5, x6) are coding measures. Two statistics, average and standard deviation, for each of these metrics are also listed in Table 1.The design metrics are available to software engineers much earlier in the life cycle than the coding metrics so that predictive models based on design metrics are more useful than those based on say, coding metrics that become available much later in the development life cycle. Typical values of the six metrics for five components in the database are shown in Table 2, along with the component class label where 1 refers to high criticality and 0 to low criticality. Table 1: List of Metrics Variable X1 X2 X3 X4 X5 X6 Y Description Function calls from this component Function calls to this component Input/Output component parameters Size in lines Comment lines Number of decisions Component class (0 or 1) Avg 9.51 Std. dev 11.94 3.91 27.28 8.45 23.37 257.49 138.94 21.18 171.22 102.22 21.22 Table 2: Values of selected metrics for five components from the database. Component Number. 1 2 3 4 5 x1 x2 x3 x4 x5 x6 class 16 0 5 4 38 39 4 0 45 41 1 0 3 1 22 689 361 276 635 891 388 198 222 305 407 25 25 2 32 130 1 0 0 0 1 The metrics data for software systems tends to be very skewed so that a large number of components have small values and a few components have large values. This is also true for the data analyzed in this case study. In general, metrics data tend to follow a Poisson distribution whose mean equals the standard deviation. Some evidence in support of this observation is provided by the fact that for several metrics in table 1, the average is almost equal to the standard deviation. The classification of modules into high critical (class 1) and low critical (class 0) was determined on the basis of actual number of faults detected. Modules with 5 or less faults were labeled class 0 and others as class 1. This classification resulted in about 25 % of the components in class 1 and about 75 % in class 0. Classification into high and low critical classes is determined by the application domain. However, the classification used here is quite typical of real world applications. The 796 module data set was randomly divided into three subsets, 398 for classifier development (training), 199 for validation, and 199 for test. Further, ten random permutation of data sets were prepared to study the variability in classification accuracy and model complexity due to the random partitioning of the data into three sets. 3 Classification methodology. [NOTE: Eqns and symbols need to be retyped] The objective of this study is to construct a model that captures an unknown input output mapping pattern on the basis of limited evidence about its nature. The evidence available is a set of labeled historic data, called the training set, denoted as. Here both the d-dimensional inputs xi and their corresponding outputs yi are made available and the outputs represent the class label, zero or one. where Φ(.) is called a basis function and μ j and σj are called the centre and the width of the j th basis function, respectively. Also, wj is the weight associated with the jth basis function output and m is the number of basis functions. For the common case where σj = σ, i = 1,…,m, the RBF model is fully determined by the parameter set P = (m, μ, σ, w). Also, we employ Gaussian kernels in this study so that Φi(|| x – μj || / σj) = exp[(-|| x – μj ||2 / σ2 )] In practice, we seek a classifier that is neither too simple nor too complex. A classifier that is too simple will suffer from under fitting because it does not learn enough from the data and hence provides a poor fit. On the other hand, a classifier that is too complicated would learn too many details of data including even noise and thus suffers from over fitting. It cannot provide good generalization on unseen data. Hence an important issue is to determine an appropriate number of basis functions (m) that represents a good compromise between these competing goals. In this study we employed our recently developed algorithm [1, 11] for providing a compromise between these competing goals. In this algorithm, a complexity control parameter is specified by the user. We used a default value of 0.5 percent. The classifier width values depend on the dimensionality of the input data. The algorithm automatically computes the other model paramteres m, μ and w. The mathematical and computational details of the algorithm are beyond the scope of this paper. For details, the reader is referred to [11]. Classifiers for specified values of σ are developed for the training data and classification error (CE) for each model is computed as the fraction of incorrectly classified components. Next, for each model, validation data is used to evaluate validation error. The model with the smallest validation error is the selected classifier. Finally, classification error for the test set is computed and provides an estimate of the classification error for the future components for which only the metrics data would be available and the class label will need to be determined. 4. Classification Results In this section we present and discuss the results for three experiments using (1) design metrics, (2) coding metrics, and (3) combined design and coding metrics. In each case, we develop RBF classifiers according to the methodology described above, viz. develop classifiers for training data, select the best one based on the validation data and estimate future performance using the test data. The model development algorithm used in this study provides consistent models without any statistical variability. Therefore, we do not need to perform multiple runs for a given permutation of the data set. In other words, the results for one run are repeatable. The results for the ten permutations, of course, will vary due to different random assignments of components amongst the training, validation and test sets. In evaluating the results, we are primarily interested in classifier complexity and classification error. A commonly used measure of the RBF model complexity is the number of basis functions (m) and is also used here. The results are presented below. 4.1 Design Metrics: Model complexity and classification errors based on design metrics data (x 1,x2,x3) are listed in Table 3. We note that m varies from 3 to 7 with an average value of 5.1. The training error varies from 21.6% to 27.1% with an average of 24.3%. The validation error varies from 21.1% to 29.2% with an average of 25.9%, and the test error varies from 21.6% to 28.1% with an average of 25.0%. Clearly, there is noticeable variation among the results from different permutations. A plot of the three error measures is given in Figure 1 and illustrate the variability as seen in Table 3. Table 3. Classification results for design metrics. Permutation 1 2 3 4 5 6 7 8 9 10 m 4 6 4 7 4 7 3 5 7 4 Training 27.1 25.2 25.6 24.9 21.6 24.1 22.6 24.4 24.4 23.1 Classification Error (%) Validation 29.2 23.6 21.1 26.6 27.6 25.1 26.6 28.6 28.6 24.6 Test 21.6 24.6 26.1 22.6 28.1 24.6 24.6 24.1 24.1 27.1 In classification studies, the test data accuracy is of primary interest, while the other values provide useful insights into the nature of the classifier and the model development process. Therefore, we concentrate only on test error. Below, we compute confidence bounds for the true test error based on the results from the ten permutations. It should be noted that the confidence bounds given below are not the same thing as might be obtained based on data from only one permutation by using techniques such as bootstrap. The standard deviation (SD) of test errors from Table 4 is 1.97 %. Using the well known t-statistic, 95 % confidence bounds for the true, unknown test error are given as below. {Average Test Error ± t(9; 0.025)* (SD of Test Error) / √10 } = { 24.95 ± 2.26 (1.97)/ √10 } = { 23.60, 26.40} The above values are interpreted as follows. Based on the results for test errors presented in Table 3, we are 95 % confident that the true test error for the RBF model is between 23.6 % and 26.4 %. The 95 % is a commonly used value; bounds for other values can be easily computed. For example, 90 % confidence bounds for test error here are {23.81, 26.05}. The bounds get narrower as the confidence value decreases and get wider as it increases. 4.2 Coding and Combined Measures Since our focus is on model complexity and test error, we now present only these values in Table 4 for the coding measures (x4, x5, x6). Also given are the values for the combined (design plus coding) measures (x 1 to x6). We note that model complexity here has much more variability than for the case of design measures. Standard deviations and confidence bounds for these cases were obtained as above. A summary of test error results for the three cases is given in Table 5 Table 4. Model complexity and test errors. Coding Measures Permutation 1 2 3 4 5 6 7 8 9 10 M 2 2 14 22 4 12 4 9 3 21 Combined Metrics Test Error (%) 14.6 26.1 24.1 18.6 26.1 25.1 23.6 23.6 23.6 24.6 M 6 2 17 6 4 16 6 16 6 28 Test Error (%) 20.6 26.1 26.1 21.1 27.6 24.6 22.6 24.6 22.6 27.6 Table 5. Summary of test error results Metrics Design Metrics Coding Metrics Combined Metrics Average 24.95 23.00 24.35 SD 1.97 3.63 2.54 Confidence Bounds and Width 90 % 95 % {23.81, 26.05} {23.60, 26.40} {20.89, 25.11} {21.40, 25.80} {22.89, 25.81} {22.55, 26.15} 4.3 Discussion The results in Table 5 provide insights into the two issues addressed in this paper. The first one, measure of variability due to the random assignment of components to the training, validation and test sets is indicated by the widths of the confidence intervals. For both the 90 % and 95 % bounds, the width are quite reasonable. This is specially noticeable in light of the fact that software engineering data tends to be quite noisy. Further, the variability due to coding metrics alone is considerably higher than for the other two cases. Regarding the second issue, relative classification accuracies for the three cases, it appears that classification based on the design metrics alone is comparable to the other two cases. Overall, it appears that a test error of about 24% is a reasonable summary for the data analyzed in this study. We note that this value of 24% is easily contained in all the confidence bounds listed in Table 5. 5 Concluding Remarks: In this paper we studied an important software engineering issue of identifying fault-prone components using selected software metrics as predictor variables. A software criticality evaluation model, that is parsimonious and employs easily available metrics, can be an effective analytical tool in reducing the likelihood of operational failures. However, the efficacy of such a tool depends on the mathematical properties of the classifier and its design algorithm. Therefore, in this study we employed Gaussian radial basis function classifiers that possess the powerful properties of best and universal approximation. Further, our new design algorithm yields consistent results using only algebraic methods. Further, we used input data permutations and classifiers for each permutation to compute the confidence bounds for test error. A comparison of these bounds based on design, coding and combined metrics indicated that the errors in the three cases could be considered to be statistically equal, based on the analyses presented here, we believe that based on a combination of the model and algorithm properties, able to establish, at least empirically, that design metrics, which are easily available early in the software development cycle, can be effective predictors of potentially critical components. Early attention and additional resources allocated to such components can be instrumental in minimizing the operational failures and thus improving system quality. References: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. Shin, M and Goel A. Empirical Data Modelling in Software Engineering using Radial Basis Functions. IEEE Transactions on Software Engineering, (28).(2002) 567_576. Khoshgoftaar, T., and Seliya, N. Comparitive Assessment of Empirical Software Quality Classification Techniques: An Empirical Case study. Software Engineering, 9 (2004) 229-257. Khoshgoftaar, T., Yuan, X., Aleen, E. B., Jones, W. D., and Hudepohl, J. P.: Uncertain Clasification of Fault-Prone Software Modules. Empirical Software Engineering, 7(2002) 297-318. Lanubile, F., Lonigro, A., and Vissagio, G.: Comparing Models for identifying Fault-Prone Software Components, 7th International Conference on Software Engineering and Knowledge Engineering, 312-319, Rockville, Maryland, June 1995. Pedrycz, W.: Computational Intelligence as an Emerging Paradigm of Software Engineering. Fourteenth International Conference on Software Engineering and Knowledge Engineering, Ischia, Itlay, July 2002, 7-14. Pighin, M. And Zamolo, R.: A Predictive Meric based on Discriminant Statistical Analysis, International Conference on Software Engineering, Boston, MA, 1997, 262-270. Zhang, D. And Tsai, J. J. P.: Machine Learning and Software Engineering. Software Quality journal, 11 (2003) 87-119. Shull, F., Mendonce, M. G., Basili V., Carver J., Maldonado, J. C., Fabbri, S., Travassos, G. H., and Ferreira, M. C..: Knowledge-Sharing Issues in Experimental Software Engineering, Empirical Software Engineering, 9 (2004) pp 111-137. Goel, A. L. And Shin, M.: Tutorial on Software Models and Metrics. International Conference on Software Engineering, Boston, MA (1997). Han, J. and Kamber, M.: Data Mining Morgan Kauffman, 2001. Shin M., Goel, A.: Design and Evaluation of RBF Models based on RC Criterion, Technical Report, Syracuse University, 2003.