Associative Classifier for Software Fault Tolerance in presence of

advertisement
Associative Classifier for Software Fault Tolerance in presence
of Class Imbalance
Vijaya Bharathi Manjeti
Sireesha Rodda
GMRIT Institute Technology
GITAM Instittue of Technology
Rajam
Srikakulam
GITAM University
Visakhapatnam
[email protected]
[email protected]
ABSTRACT
Software fault prediction is crucial in reducing the
overall cost for developing a software product and
also to assure the quality of the finished software
product. Different software quality models based
on data mining techniques are in existence for
identifying
software-fault
prone
modules.
However, the presence of class imbalance problem
reduces the overall quality of the developed
software product. This paper addresses the effects
of class imbalance on the classification algorithms
intended to perform software-fault prediction. An
ensemble-based classifier is proposed to mitigate
the effects of class imbalance. This classifier learns
defect prediction efficiently as demonstrated in the
results.
Keywords- Software defect prediction, class imbalance
learning, ensemble classifiers.
1.INTRODUCTION
Presence of software faults can turn out to be
expensive during software development in terms of
quality and cost [1]. The conventional process of
manual software reviews and testing activities can
only detect 60% of the faults [2]. Menzies et. al. [3]
found defect predictors can increase the probability
of detection to 71%.
Various Machine learning and statistical
approaches have been investigated for Software
Defect Prediction. Classification is a popular
option for performing software defect prediction.
The classification algorithm categorizes which
module is more prone to defects based on the
classifier developed from existing data culled from
previous development projects.
Association Mining (AM)[4] refers to the task of
finding the complete set of frequent itemsets from
which class association rules are generated based
on their association with the pertinent class labels.
Associative Classification[5] deals with the set of
features as itemsets and applies Association Mining
techniques to discover set of frequent itemsets that
occur in the training dataset based on a user
specified minimum support threshold.
An
associative classifier uses the Class Association
Rules (CARs) generated by Association Mining to
predict the class label of an unseen instance.
Once the classification model is built using CARs,
it is evaluated on the test data. It has been shown
that
Associative
Classifiers
show better
performance than other classifiers. The rules
generated by the classifier are understandable to the
human user.
Software Defect Prediction features an imbalance
between defect and non-defect class labels of the
dataset. Generally, the number of non-defect
samples (majority class) is much more than that of
defective ones(minority class).
Imbalanced
distribution of data contributes to for the poor
performance of the classifier, negatively effecting
the
classification
of
defective
samples.
Arunasalem et.al.in their paper[6], prove that
accuracy is not a suitable metric for evaluating the
efficiency of a classifier, particularly when it
concerns imbalanced data. They also prove that
support and confidence framework is biased
towards the majority class.
Presence of class imbalance in Software Defect
Prediction demands for more importance to the
identification of minority class elements even at the
cost of accuracy. Therefore, specialized techniques
which are custom made for imbalanced data must
be used. This paper uses Partition based
Associative Classification technique for handling
imbalanced datasets.
The rest of the paper is organized as follows.
Section 2 provides a brief review of the recent
developments in software defect prediction for
imbalanced datasets. Section 3 discusses the
methodology and algorithm for Partition based
Associative Classifier. Section 4 discusses about
the evaluation metrics used for comparing the
performance of various classifiers, with respect to
imbalanced datasets. Section 5 presents and
analyzes the results obtained while comparing them
with the performance of other classifiers. Section 6
presents conclusions.
3.METHODOLOGY
A decision tree uses a tree-like data structure
where the non-leaf nodes are labeled with
attributes, the arcs out of a node labeled by a given
attribute are each labeled by possible values of the
attribute, and the leaf nodes are labeled by the
classes, indicating whether the current module is
fault-prone or not. In the presence of imbalanced
data, pruning might remove the branches predicting
the minority class and the class label might then be
relabeled to majority class. Pruning is based on
predicting error. To reduce the error rate, pruning
might remove the branches leading to minority
class. The stopping criterion also might not allow
the decision tree to grow till the minority class
instances are detected. Hence, decision trees cannot
handle imbalanced datasets successfully.
This paper uses Partition-based Associative
Classification framework for performing SDP of
imbalanced datasets. The dataset is divided into
two partitions based on the class label: Majority
Partition(non-faulty modules) and Minority
Partition fault-prone modules). There is no need of
representing the class attribute in either of the
partitions. Local frequent items are then generated
using any Frequent Itemset Mining algorithm. In
this paper, Apriori algorithm was used for
generating frequent itemsets. The minimum
support threshold should be specified by the user.
As the majority and minority samples are
considered independently, all the locally significant
itemsets which pass the percentage minimum
support threshold will be generated. This results in
the generation of the frequent itemsets of the
majority partition and frequent itemsets of the
minority partition.
The frequent itemsets directly represent CARs as
the right hand side of the CAR is nothing but the
label of the partition to which the frequent item
belongs. In the classification phase, the number of
occurrences of each frequent itemset in the other
partition than it was generated from, is calculated.
Using this information, Complement Class
Support[6] and Confidence, and Strength Score[6]
of every rule is generated. While classifying a test
instances, all the rules belonging to both majority
class and minority class are found. The percentage
of the minority class with respect to majority class
is calculated as a constant ‘k’ using which a
Scoring Function[6] parameter is calculated. If this
value is greater than some user-specified threshold,
then the test instance belongs to minority class.
Otherwise, the test instance belongs to the majority
class. The details of the algorithm in terms of two
phases i.e., Learning Phase and Classification
Phase are given below.
Other conventional classifiers based on the
accuracy or reducing the error rate ignore the
classification rules pertaining to minority class. It
has been observed that the imbalanced distribution
between fault-prone and non-faulty modules could
degrade the classifier’s performance. Some
researchers attempted to use class imbalance
learning based approaches to alleviate this effect.
Menzies et. al. [12] used undersampling for
reducing the size of non-faulty modules to be same
as that of fault-prone modules. Ensemble based
algorithms and cost-snsitive learning algorithms
have been proposed to alleviate the effect of clas
imbalance on SDP [13, 14].
This paper investigate ensemble based
Associative Classifier for handling class imbalance.
The methodology is presented in the next section.
3.1
Algorithm
Learning Phase:
1.
The majority (or negative) and minority
(or positive) class labels of the dataset are
identified depending on their frequency of
occurrence.
2.
The training dataset is then divided into
two partitions: Pmaj (the training data instances
belonging to majority class) and Pmin (the training
data instances belonging to minority class). For
each partition, each training instance is represented
as a transaction after removing its class label.
3.
Locally frequent itemsets are generated
for every partition using Apriori algorithm.
Let Ai be an itemset belonging to Partition with
class label Cj. The Class Support of Ai is calculated
using the following equation:
2. RELATED WORK
Existing works on Software Defect Prediction
(SDP) have been based on classification algorithms
such as Naïve Bayes[7], Decision Trees [8]
Random Forest [9], AdaBoost [10], Neural
Networks [11]. Naïve Bayes is the simplest form of
Bayesian Classification. Bayesian classification
develops probabilistic models which best fit the
training data. Hence, the networks learn to
approximate the dependency patterns in the data
using probabilities. In case of software defect
prediction in presence of imbalanced data,
dependency patterns in the rare class are not
significant and are usually insufficient to encode
that into the networks. Hence, small classes are
often misclassified by Bayesian classification.
𝐢𝑙𝑆𝑒𝑝(Ai → Cj)=
σ(AiUCj)
σ(Cj)
Eq.(1)
The local support of an itemset corresponds to the
fraction of instances containing the itemset in that
partition. If the support of an itemset is greater than
some user-defined threshold, it is considered to be
frequent.
4.
Once frequent itemsets are identified,
generation of CARs is straight forward. The right
hand side of the CAR is the class label of the
partition currently being used, and its left hand side
is the locally frequent itemset.
5.
Load the partition P min into main memory.
For each frequent itemset in Pmaj, find the
conditional support count in Pmin.Using that, the
global frequency of the itemset in the training data
set could be found. This value could be used to
obtain Confidence(Conf), Complement Class
Support(CCS), and Strength Score(SS) using the
formulas given below.
σ(AiU┐Cj)
CCS Ai → Cj)=
Eq.(2)
Conf Ai → Cj)=
SS Ai → Cj)=
σ(┐Cj)
σ(AiUCj)
Eq.(3)
σ(Ai)
Conf(Ai → Cj)∗Clsup(Ai → Cj)
CCS(Ai → Cj)+t
where t=0.01
Eq.(4)
Strength Score represents the accuracy with which
Ai indicates the belongingness of Ai with Cj.
6.
Repeat step 5 after loading partition Pmaj
to find the global counts of frequent items in P min.
Classification Phase:
7.
For every test instance, find the set of
CARs applicable from the majority class(negative
class) and set of CARs applicable from the
minority class(positive class).
Calculate the Scoring Function [5]:
π‘˜π›΄π‘–πœ€π‘π‘œπ‘ π‘†π‘†π‘–π‘π‘œπ‘ 
S=
Eq.(5)
π‘˜π›΄π‘–πœ€π‘π‘œπ‘ π‘†π‘†π‘–π‘π‘œπ‘ +π›΄π‘–πœ€π‘π‘œπ‘ π‘†π‘†π‘–π‘π‘œπ‘ 
π‘ƒπ‘’π‘Ÿ(πΆπ‘šπ‘Žπ‘—π‘œπ‘Ÿπ‘–π‘‘π‘¦)
Where k= 3
√π‘ƒπ‘’π‘Ÿ(πΆπ‘šπ‘–π‘›π‘œπ‘Ÿπ‘–π‘‘π‘¦)
If k>1, value of k is substituted, otherwise k=1is
substituted.
S ε [0, 1].
If the ‘S’value of the test instance tends to one,
minority class is suggested, otherwise majority
class is suggested.
If the ‘S’ value of the test instance is greater than
some cutoff value, then minority class label is
assigned to the instance, otherwise majority class
label is assigned.
4.PERFORMANCE METRICS
While learning from an extremely imbalanced
dataset, overall accuracy is not an appropriate
measure of performance. A classifier which
predicts every test instance as majority class can
still achieve a high accuracy.[6,15] show that
accuracy is not a proper metric for evaluating
classifiers on imbalanced datasets.[16,17] discuss
that Precision, Recall, and F-measure are
commonly used metrics used to evaluate
imbalanced dataset classification models. Hence,
the proposed classifier’s performance is analyzed
using metrics like Classification Accuracy,
Precision, Recall, True Positive Rate, False
Positive Rate, F-measure which are defined in
terms of the entries of the confusion matrix which
is shown in Table I. The rows of the confusion
matrix correspond to actual classes while the
columns correspond to predicted classes. In the
Test Set, let ‘P’ indicate the test instance belonging
to the positive class. Let ‘N’ indicate the test
samples instance to the negative class. According
to Table I, the performance metrics are defined as
follows:
𝑇𝑃+𝑇𝑁
Accuracy=
Eq.(6)
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
𝑇𝑃
TruePositiveRate(TPR)=
𝑇𝑃+𝐹𝑁
𝐹𝑃
FalsePositiveRate(FPR)=
Precision=
Recall=
𝑇𝑃
𝑇𝑃+𝐹𝑃
𝑇𝑃
𝐹𝑃+𝑇𝑁
Eq.(7)
Eq.(8)
Eq.(9)
Eq.(10)
𝑇𝑃+𝐹𝑁
2∗π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘›∗π‘…π‘’π‘π‘Žπ‘™π‘™
F-measure=
π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘›+π‘…π‘’π‘π‘Žπ‘™π‘™
5.RESULTS
Eq.(11)
In this section, the performance of different
classifiers is evaluated on datasets obtained from
publicly available PROMISE repository with data
collated from practical projects[18]. Three datasets
have been selected from those available in the
repository which will be used in our study. The
characteristics of the datasets under consideration
is presented in Table 1.
Data
Language
Examples
Attributes
jm1
cm1
pc1
C
C
C
10885
498
1109
21
21
21
%
Imbalance
19.35
9.83
6.94
Table 1: PROMISE Datasets
Each sample in the datasets describes the attribute
of one module or method, its class label mentioning
whether the module is fault-prone or not. The nonclass attributes include information such as
McCabe metrics, Halstead metrics, lines of code
and other attributes. As most classifiers can deal
with discrete or categorical values, the five datasets
have been discretized using WEKA package. The
missing values in jm1 dataset have been handled
using Replace Missing Values option in WEKA
package. The default options in either of the
preprocessing procedures are retained.
The classifiers used for comparison along with the
proposed classifier include Naïve Bayes,
AdaBoost, PART, ID3, CBA and the proposed
Partition-based Associative Classifier (Partition).
In this section, the performance of
different classifiers is evaluated on three
classification datasets using performance metrics
like Precision, Recall, True Positive Rate, False
Positive Rate, Accuracy and F-measure. For the
sake of comparison, a minimum support threshold
of 0.01 is used for Partition-based Associative
Classifier and CBA.
Datasets used:
The performance for different classifiers for jm1,
cm1 and pc1 datasets are presented in Tables II, III,
and IV respectively.
As the jm1 dataset is not very imbalanced, it has
been observed that most of the considered
classifiers (except CBA) efficiently classifiy the
minority class samples (fault-prone/defective
modules). However, the cm1 and pc1 datasets are
both imbalanced in nature consisting only of a
small portion of minority class samples as shown in
Table II.
For cm1 dataset, except for the proposed Partition
algorithm, all other algorithms show sub-optimal
performance on classifying defective modules as
shown in Table III.
The pc1 dataset is most imbalanced among the
three datasets considered. The results in Table IV
show that, though Partition-based Associative
classifier does not return the best overall accuracy,
it still returns the best values for Precision, Recall,
and F-measure. Hence we can conclude that the
Partition-based Associative Classifier performs
well for pc1 which is an imbalanced dataset. The
Performance is same as other classifier for jm1
dataset in which the class distributions are roughly
balanced. In case of the cm1, the partition-based
associative classifier outperforms all the other
classifiers considered.
6.CONCLUSIONS
A Partition-based Associative Classifier is
presented in this paper. This classifier is
specifically designed to predict software defects in
development projects in presence of class
imbalance unlike most classifiers which assume
approximately balanced class distributions for
datasets. The classifier’s performance is compared
with other classifiers with respect to six
performance metrics. Results show that the
classifiers performs better than other classifiers
when the dataset is skewed and shows comparable
performance when the dataset is balanced in nature.
7.REFERENCES
[1] Jones, C., & Bonsignour, O. (2012). The
Economics of Software Quality. Pearson
Education, Inc.
[2] Shull, F., Basili, V., Boehm, B., Brown, A. W.,
Costa, P., Lindvall, M., … Zelkowitz, M. (2002).
What we have learned about fighting defects. In
Proceedings Eighth IEEE Symposium on Software
Metrics 2002 (pp. 249–258).
[3] Menzies, T., Milton, Z., Turhan, B., Cukic, B.,
Jiang, Y., & Bener, A. (2010). Defect prediction
from static code features: current results,
limitations, new approaches. Automated Software
Engineering, 17(4), 375–407.
[4] R. Agrawal and R. Srikant (1994). Fast
Algorithm for mining association rules. In Proc. Of
VLDB’94, Santiago, Chile, Sept. 1994.
[5] Janssens D, Wets G, Brijs T and Vanhoof K
(2003). “Integrating classification and association
rules by proposing adaptations to CBA”. In: Proc.
of the 10th International Conference on Recent
Advances in Retailing and Services Science,
Portland, Oregon.
[6] Arunasalem, Bavani. & Chawla, Sanjay. &
University of Sydney. 2006 Parameter-free
classification for imbalanced data scoring using
complement class support, School of Information
Technologies, The University of Sydney, [Sydney,
N.S.W.]
[7] T. Menzies, J. Greenwald, and A. Frank, “Data
mining static code attributes to learn defect
predictors”, no. 1, pp. 2–13, Jan. 2007.
[8] Khoshgoftaar, T. M., Seliya, N., & Gao, K.
Assessment of a New Three-Group Software
Quality Classification Technique: An Empirical
Case Study. Empirical Software Engineering,
10(2), 183–218, 2005.
[9] Lan Guo, Yan Ma, Bojan Cukic, Harshinder
Singh, "Robust Prediction of Fault-Proneness by
Random Forests", ISSRE, 2004, 15th International
Symposium on Software Reliability Engineering,
15th International Symposium on Software
Reliability Engineering 2004, pp. 417-428,
doi:10.1109/ISSRE.2004.35
[10] Zheng, Jun. (2010). Cost-sensitive boosting
neural networks for software defect prediction.
Elsevier Journal Expert Systems with Application,
37(6), pp.4437-4543.
[11] Singh, M., Salaria, D.S., (2013). Software
defect prediction tool based on Neural Network,
International Journal of Computer Applications,
Vol. 70(1), May 2013.
[15] Cheng G. Weng, Josiah Poon. A New
Evaluation Measure for Imbalanced Datasets,
Conferences in Research and Practice in
Information Technology, Vol. 87, pp:27-32
[12] T. Menzies, B. Turhan, A. Bener, G. Gay, B.
Cukic, and Y. Jiang, “Implications of ceiling
effects in defect predictors,” in The 4th
International Workshop on Predictor Models in
Software Engineering (PROMISE 08), 2008, pp.
47–54
[16] Nitesh V. Chawla, C4.5 and Imbalanced
datasets: investigating the effect of sampling
method, probabilistic estimate and tree structure, In
Proc. of ICML’03 workshop on class imbalances,
2003
[13] Seiffert, C., Khoshgoftaar, T. M., & Van
Hulse, J. (2009). Improving Software-Quality
Predictions With Data Sampling and Boosting.
IEEE Transactions on Systems, Man, and
Cybernetics - Part A: Systems and Humans, 39(6),
1283–1294.
[17] Qiong Gu, Zhihua Cai, Li Ziu, Classification
of Imbalanced Data Sets by Using the Hybrid Resampling Algorithm Based on Isomap, In LNCS,
Advances in Computation and Intelligence, vol.
5821, pp:287-296, 2009
[14] Wang, S., & Yao, X. (2013). Using Class
Imbalance Learning for Software DefectPrediction.
IEEE Transactions on Reliability, 62(2), 434–443.
[18] G. Boetticher, T. Menzies, T. J. Ostrand,
(2007) Promise repository of empirical software
engineering
data.
[Online].
Available:
http://promisedata.org/repository
Predicted Positive Class
TP(True Positive)
Actual Positive Class
Actual Negative Class
Predicted Negative Class
FN(False Negative)
FP(False Positive)
TN(True Negative)
Table I:Confusion Matrix
Classifier
TPR
FPR
Precision
Recall
F-measure
Partition
Naïve
Bayes
AdaBoost
PART
ID3
J48(C 4.5)
CBA
1
1
0.085
0.085
0.8518
0.852
1
1
0.92
0.92
1
0.085
0.852
1
0.92
1
0.085
0.852
1
0.92
1
0.085
0.852
1
0.92
1
0.085
0.852
1
0.92
0.9583
0.045
0.92
0.95
0.938
Table II: Performance Comparison for jm1 Dataset
Classifier
TPR
FPR
Precision
Recall
F-measure
Partition
Naïve
Bayes
0.4893
0.191
0.0209
0.024
0.7931
0.563
0.4893
0.191
0.6052
0.286
Accuracy
%
94.2857
94.2857
94.2857
94.2857
94.2857
94.2857
94.2028
Accuracy
%
91.0179
86.53
AdaBoost
PART
ID3
J48(C 4.5)
CBA
0
0.298
0.261
0.191
0.0667
0
0.045
0.061
0.024
0.069
0
0.519
0.414
0.563
0.6
0
0.298
0.261
0.191
0.667
0
0.378
0.32
0.286
0.12
85.9281
86.2275
81.4371
86.5269
86.486
Table III: Performance Comparison for cm1 Dataset
Classifier
TPR
FPR
Precision
Recall
F-measure
Partition
Naïve
Bayes
AdaBoost
PART
ID3
J48(C 4.5)
CBA
0.5
0.065
0.25
0.5
0.3333
Accuracy
%
91.6667
0
0.065
0
0
0
89.5833
0
0.065
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
Table IV: Performance Comparison for pc1 Dataset
89.5833
95.8333
95.8333
95.8333
76.5957
Download