A Novel Prominent Recursive Minority Over- Sampling Technique

advertisement
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
A Novel Prominent Recursive Minority OverSampling Technique
K.P.N.V.SATYA SREE1, Dr. J.V.R. MURTHY2
1
Assistant Professor, Department of Computer Science and Engineering,
Vignan’s Nirula Institute of Technology in Science for WOMEN,
Guntur, A.P
2
Professor, Department of Computer Science and Engineering,
Jawaharlal Nehru Technological University Kakinada,
Kakinada,A.P
ABSTRACT
In supervised classification, a new and burning challenge has emerged for researcher community in data mining is Class
Imbalance Learning (CIL). The problem of class imbalance learning is of great significance when dealing with real-world
datasets. The data imbalance problem is more serious in the case of binary class, where the number of instances in one class
predominantly outnumbers the number of instances in another class.
In this paper, we present a practical algorithm to deal with thedata imbalance classification problem when there are datasets
with different imbalance ratios.A new algorithm called Prominent Recursive Minority Oversampling Technique; PRMOTE is
proposed. We proposed to recursively oversample the most prominent examples in the minority set for handling the problem of
class imbalance. An experimental analysis is carried out over a wide range of fifteen highly imbalanceddata sets and uses the
statistical tests suggested in the specialized literature. The results obtained showthat our proposed method outperforms other
classic and recent models in terms of AUC, precision, f-measure, recall, TN rate and accuracy and requires storing a lower
number of generalized examples.
Keywords— Classification, class imbalance, prominent metric, re-sampling, PRMOTE.
1. INTRODUCTION
A dataset is class imbalanced if the classification categories are not approximately equally represented. The level of
imbalance (ratio of size of the majority class to minority class) can be as huge as 1:99[1]. It is noteworthy that class
imbalance is emerging as an important issue in designing classifiers [2], [3], [4]. Furthermore, the class with the lowest
number of instances is usually the class of interest from the point of view of the learning task [5]. This problem is of
great interest because it turns up in many real-world classification problems, such as remote-sensing [6], pollution
detection [7], risk management [8], fraud detection [9], and especially medical diagnosis [10]–[13].
Whenever a class in a classification task is underrepresented (i.e., has a lower prior probability) compared to
otherclasses, we consider the data as imbalanced [14], [15], [16]. The main problem in imbalanced data is that the
majority classes that are represented by large numbers of patterns rule the classifier decision boundaries at the expense
of the minority classes that are represented by small numbers of patterns. This leads to high and low accuracies in
classifying the majority and minority classes, respectively, which do not necessarily reflect the true difficulty in
classifying these classes. Most common solutions to this problem balance the number of patterns in the minority or
majority classes.
Resampling techniques can be categorized into three groups. Under sampling methods, which create a subset of the
original data-set by eliminating instances (usually majority class instances); oversampling methods, which create a
superset of the original data-set by replicating some instances or creating new instances from existing ones; and finally,
hybrids methods that combine both sampling methods. Among these categories, there exist several different proposals;
from this point, we only center our attention in those that have been used in oversampling.Either way, balancing the
data has been found to alleviate the problem of imbalanced data and enhance accuracy [15], [16], [17].
Data balancing is performed by, e.g., oversamplingpatterns of minority classes either randomly or from areasclose to
the decision boundaries. Interestingly, random oversamplingis found comparable to more sophisticated
oversamplingmethods [17]. Alternatively, under sampling is performed on majority classes either randomly or
Volume 2, Issue 5, May 2013
Page 249
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
fromareas far away from the decision boundaries. We note that random under sampling may remove significant
patternsand random oversampling may lead to over fitting, so random sampling should be performed with care. We
alsonote that, usually, oversampling of minority classes is more accurate than under sampling of majority classes
[17].In this paper, we are laying more stress to proposea oversampling algorithm for solving the class imbalance
problem.
This paper is organized as follows. Section 2 briefly reviews the data balancing literature and its measures.In Section 3,
we discuss the proposed method of PRMOTE (Prominent Resampling Minority Over-sampling Technique) technique
for CIL. Section 4 presents the imbalanced datasets used to validate the proposed method, while In Section 5, we
present the experimental setting. In Section 6, wediscuss in detail, the classification results obtained by the proposed
method and compare them with the results obtained by different existing methods and finally, in Section 7 we conclude
the paper by specifying a future extension.
2. LITERATURE REVIEW
A comprehensive review of different CIL methods can be found in [18]. The following two sections briefly discuss the
external-imbalance and internal-imbalance learning methods. The external methods are independent from the learning
algorithm being used, and they involve preprocessing of the training datasets to balance them before training the
classifiers. Different resampling methods, such as random and focused oversampling and under sampling, fall into to
this category. In random under sampling, the majority-class examples are removed randomly, until a particular class
ratio is met [19]. In random oversampling, the minority-class examples are randomly duplicated, until a particular class
ratio is met [18]. Synthetic minority oversamplingtechnique (SMOTE) [20] is an oversampling method, where new
synthetic examples are generated in the neighborhood of the existing minority-class examples rather than directly
duplicating them. In addition, several informed sampling methods have been introduced in [21].
Currently, the research in class imbalance learning mainly focuses on the integration of imbalance class learning with
other AI techniques. How to integrate the class imbalance learning with other new techniques is one of the hottest
topics in class imbalance learning research. There are some of the recent research directions for class imbalance
learning as follows:
In [22] authors have proposed a minority sample method based on k-means clustering and genetic algorithm. The
proposed algorithm works in two stages, in the first stage k-means clustering is used to find clusters in the minority set
and in the second stage genetic algorithm is used to choose the best minority samples for resampling. Classifier
ensemble techniques can also be efficiently used for handling the problem of class imbalance learning. In [23] authors
have proposed another variant of clustering-based sampling method for handling class imbalance problem. In [24]
authors have proposed a pure genetic algorithm based sampling method for handling class imbalance problem.
Evolutionary algorithms are also of great use for handling the problem of class imbalance. In [25] authors have
proposed evolutionary method of nested generalized exemplar family which uses Euclidean n-space to store objects
when computing distances to the nearest generalized exemplar. This method uses evolutionary algorithm for selection
of most suitable generalized exemplars for resampling. In [26] authors have proposed an evolutionary cooperative
competitive strategy for the design of radial-basis function networks CO2RBFN by using evolutionary cooperative
competitive technique with radial-basis function on imbalanced datasets. In CO2RBFN framework where an individual
of population represents only a part of the solution, competing to survival and at the same time cooperating in order to
build the whole RBFN, which achieves good generalization for new patterns by representing the complete knowledge
about the problem space.
In [27] authors have proposed a dynamic classifier ensemble method (DCEID) by ensemble technique with cost
sensitive learning. A new cost-sensitive measure is proposed to combine the output of ensemble classifiers. A
comparative analysis of external and internal methods for handling class imbalance learning is conducted [28]. The
results of analysis are to study deeply about data intrinsic properties for proposal of data shifting and data overlapping.
In [29] authors have given suggestion for applying gradient boosting and random forest classifier for better
performance on class imbalanced datasets.In [30] authors have introduced a new hybrid sampling/boosting algorithm,
called RUSBoost from its individual component AdaBoost and SMOTEBoost, which is another algorithm that
combines boosting and data sampling for learning from skewed training data.
In [31] authors have proposed a max-min technique for extracting maximum negative features and minimum positive
features for handling the problem of class imbalance datasets for binary classification. The proposed two models which
can do the max-min extraction of features have been implemented simultaneously thereby producing effective results.
The application of fuzzy rule based technique for handling class imbalance datasets is proposed as Fuzzy Rule Based
Classification Systems (FRBCSs) [32] and Fuzzy Hybrid Genetic Based Machine Learning (FH-GBML) algorithm [33].
These fuzzy based algorithmic approaches had also shown a good performance in terms of metric learning.
In [34] authors have proposed to deal the imbalanced datasets by using preprocessing step with fuzzy rule based
classification systems by application of an adaptive inference system with parametric conjunction operators. In [35]
authors have proposed the applicability of K-Nearest Neighbor (k-NN) classifier to investigate the performance on class
Volume 2, Issue 5, May 2013
Page 250
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
imbalanced datasets. In [36] authors have conducted a systematic study to investigate the effects of different classifiers
with different resampling strategies on imbalanced datasets with different imbalance ratios. Obviously, there are many
other algorithms which are not included in this literature. A profound comparison of the above algorithms and many
others can be gathered from the references list.
The bottom line is that when studying problems with imbalanced data, using the classifiers produced by standard
machine learning algorithms without adjusting the output threshold may well be a critical mistake. This skewness
towards minority class (positive) generally causes the generation of a high number of false-negative predictions, which
lower the model’s performance on the positive class compared with the performance on the negative (majority) class.
3. A New Over-Sampling Method: Prominent Recursive Minority Oversampling Technique
(PRMOTE)
In order to achieve better prediction, most of the classification algorithms attempt tolearn the model by using the
training process and the model so build will be used for future prediction of the testing instances.The training model
build by the classifier will be more robust if the training data contains only quality instances. Theexamples which are
outliers and the ones of noisy nature are more apt to be misclassified than the ones far from the borderline, andthus
more important for classification. Based on the analysis above, those examples which fall in the misclassification
category may contribute little to classification. We presenteda novel method dubbed as ProminentRecursive Minority
Oversampling Technique (PRMOTE), in which only the prominent examples of the minority class are recursively oversampled. Our method is different from the existingover-sampling methods in which all the minority examples or a
random subset of theminority class are over-sampled [31] [38] [39].
Our methods can be easily understood with the following pima diabetes data set, which has two classes. Fig. 1 (a)
shows the original distribution of the data set, the blue colored points represent majority examples and the red colored
points are minority examples. Firstly, we apply a base classifier on the whole dataset to find out the misclassified
examples of themajority and minority class, which are denoted by circled dots in Fig. 1 (b). Then, any remaining noise
or borderline examples are removed from the minority set. The so purified minority set is used to generate new
synthetic examples recursively using reaming prominent examples of the minority class. The synthetic examples are
shown in Fig. 1 (c) with Hollow Square. It is easy to find outfrom the figures that, different from SMOTE,
1(a)
1(b)
1(c)
Fig. 1.(a) The original distribution of pima diabetes data set. (b) The only misclassified instances to be removed from
majority subset and the misclassified, borderline, noisy and weak instances to be removed from minority subset (circled
dots both blue and red). (c) After applying PRMOTE: The only highly prominent synthetic minority instances
resampled (high density of minority instances in the squared box).
Suppose that the whole training set is T, the minority classis P and the majority class is N, and
P = {p1, p2 ,..., ppnum}, N = {n1,n2 ,...,nnnum}
wherepnumand nnumare the number of minority and majority examples. The detailed procedure of PRMOTE is as
follows.
_____________________________________________
Algorithm: PRMOTE
_____________________________________________
Step1. For every pi (i= 1, 2,..., pnum) in the minority class P, we calculate its m nearestneighbors from the whole
training set T. The number of majority examples amongthe m nearest neighbors is denoted bym' (0 ≤ m'≤ m) .
Volume 2, Issue 5, May 2013
Page 251
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
Step2 .If m/ 2 ≤ m'<m, namely the number of pi ’s majority nearest neighbors is larger than the number of itsminority
ones, pi is considered to be easily misclassified and put into a set MISCLASS.
MISSCLASS = m'
Remove the instances m' from the minority set.
Step3. For every ni(i= 1,2,..., nnum) in the majority class N, we calculate its m nearestneighbors from the whole
training set T. The number of majority examples amongthe m nearest neighbors is denoted bym' (0 ≤ m'≤ m) .
Step4.Ifm/ 2 ≤ m'<m , namely the number of ni ’s minority nearest neighbors is larger than the number of its majority
ones, ni is considered to be easily misclassified and put into a set MISCLASS.
MISSCLASS = m'
Remove the instances m' from the majority set.
Step5. For every pi’(i= 1,2,..., pnum’) in the minority class P, we calculate its m nearestneighbors from the whole
training set T. The number of majority examples amongthe m nearest neighbors is denoted bym' (0 ≤ m'≤ m).
Ifm'= m, i.e. all the m nearest neighbors of pi are majority examples, pi’ isconsidered to be noise or outliers or missing
values and are to be removed.
Ste 6. For every pi’’(i= 1,2,..., pnum’’) in the minority class P, we calculate its m nearestneighbors from the whole
training set T. The number of majority examples amongthe m nearest neighbors is denoted bym' (0 ≤ m'≤ m).
If 0 ≤ m'<m/ 2 , pi is a prominent example and need to be kept in minority set for resampling.
Step7. The examples in minority set are the prominent examples of the minority class P, and we can see that
PROMINENT⊆P . We set
PROMINENT= {p'1 ,p'2 ,..., p'dnum}, 0 ≤ dnum≤ pnum
Step8. In this step, we generate s × dnum synthetic positive examples from the prominent examples in minority set,
where s is an integer between 1 and k .One percentage of synthetic examples generated are replica of prominent
examples and other are the hybrid of prominent examples.
_____________________________________________
We repeat the above procedure for each p'I prominent examples in minority set and can attain s × dnum synthetic
examples.
In the procedure above, pi ,ni, p'i, dif j and synthetic j are vectors. We can see that new synthetic data are generated by
using the prominent examples remained in minority set thus the quality of synthetic examples generated will
strengthened the minority set and oversampling of examples will minimize the problem of class imbalance.
The different components of our new proposed framework are elaborated in the next subsections.
3.1 Applying dataset to a base classifier
In the initial stage of our frame work the dataset is applied to a base algorithm for identifying mostly misclassified
instances in both majority and minority classes. The instances which are misclassified are mostly weak instances and
removing those instances from the majority and minority classes will not harm the dataset. In fact it will be helpful for
improving the quality of the datasets in two fold; one way by removing weak instances from majority class will help to
reduce the problem of class imbalance to a minor extend. Another is the removal of weak instances from minority class
for the purpose of finding prominent instances to recursively replicate and hybridized for oversampling is also the part
of the goal of the framework.The mostly misclassified instances are identified by using a base algorithm in this case
C4.5 [40] is used. C4.5 is one of the best performing algorithms in the area of supervised learning. In our framework
C4.5 is used as base algorithm for both initial and final stage. Obviously this had boosted the performance of our
PRMOTE approach.Our approach PRMOTE is classifier independent, i.e there is no constraint that the same classifier
(in this case C4.5) has to be implemented for identifying mostly misclassified instances.
Volume 2, Issue 5, May 2013
Page 252
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
3.2 Preparation of the Majority and Minority subsets
The datasets is partitioned into majority and minority subsets. As we are concentrating on over sampling, we will take
minority data subset for further analysis to generate synthetic instances.
3.3Selection of novel subset of instances
Minority subset can be further analyzed to find the missing or noisy instances so that we can eliminate those. For
finding noisy, boarder line and missing value instances for generating pure minority set one of the ways is to go
through a preprocessing process.
3.4Generating synthetic instances by using the novel subset
The prominent instances remained in the minority subset are to be resampled i;e both replicated and hybridized
instances are generated. The percentage of synthetic instances generated will range from 0 – 100 % depending upon the
percentage of difference of majority and minority classes in the original dataset. The synthetic minority instances
generated can have a percentage of instances which can be a replica of the pure instances and reaming percentage of
instances are of the hybrid quality of synthetic instances generated by combing two or more instances from the pure
minority subset.
3.5 Forming the strong dataset
The oversampled minority subset and the stronger majority subset are combined to form a strong and balance dataset,
which is used for learning of a base algorithm. In this case again we have used C4.5 as the base algorithm.Combing of
the synthetic minority instances to the original dataset results in the formation of an improved and almost balanced
dataset.The imbalance dataset improvement can be made into balance or almost balance depending upon the pure
majority subset generated. The maximum synthetic minority instances generated are limited to 100% of the pure
minority set formed. Our method will be superior than other oversampling methods since our approach uses the only
available pure instances in the existing minority setfor generating synthetic instances.
4. Evaluation Metrics
To assess theclassification results we count the number of true positive (TP),true negative (TN), false positive (FP)
(actually negative, but classifiedas positive) and false negative (FN) (actually positive, butclassified as negative)
examples.It is now well known that error rate is not anappropriate evaluation criterion when there is class imbalance or
unequal costs. In this paper, we use AUC, Precision, F-measure,Recall, TN Rate and Accuracy as performance
evaluation measures [41].
Table 1. Confusion Matrix for a two class problem
------------------------------------------------------------------Predicted Positives Predicted Negatives
------------------------------------------------------------------Positive
TP
FN
Negatives FP
TN
-------------------------------------------------------------------AUC 
1  TPRATE  FPRATE
2
Pr ecision 
TP
TP   FP 
F  measure 
Re call 
(2)
2  Pr ecision  Re call
Pr ecision  Re call
(3)
TP
(4)
TP   FN 
TrueNegati veRate 
Volume 2, Issue 5, May 2013
(1)
TN
TN   FP 
(5)
Page 253
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
Accuracy 
TP  TN
 FP  TN 
TP  FN
(6)
The above defined evaluation metrics can reasonably evaluate the learner for imbalanced data sets because their
formulae are relative to the minority class.
5. Experimental framework
An empirical comparison between the proposed oversampling method and other benchmark and over-sampling
algorithms has been performed over a total of 15 data sets taken from the UCI Data Set Repository
(http://www.ics.uci.edu/~mlearn/MLRepository.htm) [42]. Note that all the original binary databases have been used
for experimental validation for significant analysis of two-class problems. Table 2summarizes the main characteristics
of the data sets, including the imbalance ratio (IR), that is, the number of negative (majority) examples divided by the
number of positive (minority)examples. The fifth and sixth columns in Table 2indicate the majority and minority
classes of each dataset with respective IR.
We performed theimplementation using WEKAon Windows XP with 2Duo CPU runningon 3.16 GHz PC with 3.25
GB RAM.We have adopted a tenfold cross-validationmethodto estimate theAUC and all other measures: each data
sethas been divided into ten stratified blocks of size n/10 (wheren denotes the total number of examples in the data set),
using nine folds for training the classifiers and the remaining block as an independent test set. Therefore, the results
correspondto the average over the ten runs.Each classifier has been applied to the original (imbalanced)training sets
and also to sets that have been preprocessed by the implementations of the proposed technique SMOTEand five stateof-the-art and over-sampling approaches takenfrom the WEKA data mining software tool [43]. In SMOTE algorithm
the data setshave been balanced to the 100 % distribution.
5.1 Evaluation on ten real-world datasets
In this study PRMOTEis applied to fifteen binary data setsfrom the UCI repositorywith different imbalance ratio (IR).
Table 2summarizes the data selected in this study and shows, for each data set, the number of examples (#Ex.), number
of attributes (#Atts.), class name of each class (minority and majority) and IR.
Table 2: Summary of benchmark imbalanced datasets
S.No
1
Datasets # Ex. # Atts. Class (_,+)IR
Breast268 9(recurrence; no-recurrence) 2.37
2
3
Breast_w699 9(benign; malignant)1.90
Colic
368 22(yes; no)1.71
4
5
Credit-g 1000 21(good; bad)2.33
Diabetes 7688 (tested-potv; tested-negtv) 1.87
6
7
Heart-c
Heart-h
8
9
Heart-stat 270 14(absent, present)1.25Hepatitis 155 19(die; live)3.85
10
11
Ionosphere 351 34 (b;g)1.79
Kr-vs-kp3196 37(won; nowin) 1.09
12
13
Labor 56 16(bad ; good ) 1.85
Mushroom812423(e;p )1.08
14
15
Sick 3772 29(negative ; sick ) 15.32
Sonar 208 60(rock ; mine ) 1.15
303 14(<50,>50_1)1.19
294 14(<50,>50_1)1.77
5.2 Algorithms for comparison and parameters
To validate the proposed PRMOTEalgorithm, all the experiments have been carried out using theWEKA learning
environment [43] with the C4.5 [40] decision tree, the Classification and Regression Trees (CART) [44], the functional
trees (FT), the reduced error pruning tree (REP) and the synthetic minority oversampling technique (SMOTE) [45],
whose parameter values used in the experimentsare given in Table 3.
Specifically, we consider five different algorithmic approaches for comparison:
 C4.5: we have selected the C4.5 algorithm as a well-known classifier that has been widely used forimbalanced data.
A decision tree consists of internal nodes that specify tests on individual input variables or attributes that split the data
into smaller subsets, and a series of leaf nodes assigning a class to each of the observations in the resulting segments.
Volume 2, Issue 5, May 2013
Page 254
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
For our study, we chose the popular decision tree classifier C4.5, which builds decision trees using the concept of
information entropy. The entropy of a sample S of classified observations is given. C4.5 examines the normalized
information gain (entropy difference) that results from choosing an attribute for splitting the data. The attribute with
the highest normalized information gain is the one used to make the decision. The algorithm then recurs on the smaller
subsets.
 CART: The CART methodology is technically known as binary recursive partitioning. The process is binary because
parent nodes are always split into exactly two child nodes and recursive because the process can be repeated by treating
each child node as a parent. The key elements of a CART analysis are a set of rules for:
i. Splitting each node in a tree;
ii. Deciding when a tree is complete; and
iii. Assigning each terminal node to a classoutcome (or predicted value for regression).
[1.] FT: Functional Trees (FT) is a classifier for building 'Functional trees', which are classification treesthat could
have logistic regression functions at the inner nodes and/or leaves. The algorithm can deal with binary and multi-class
target variables, numeric and nominal attributes and missing values.
Table 3: Parameters used in the classifiers
 REP: One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced
with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive,
reduced error pruning has the advantage of simplicity and speed.
 SMOTE: Regarding the use of the SMOTE pre-processing method [45], we consider only the 1-nearest neighbor
(using the Euclidean distance) to generate the synthetic samples.
Table 3 presents the complete experimental setting used for all the five comparative algorithms.
6. Results and Discussion
We evaluated the performance of the proposedPRMOTEapproaches on a number of real-world classification problems.
The goal is to examine whether the new proposed learning framework achieve better AUC and other evaluation metrics
than a number of existing learning algorithms.
In all the experiments we estimate AUC, Precision, F-measure, recall, TN rate and accuracy using 10-fold crossvalidation. We experimented with 15 standard datasets for UCI repository; these datasets are standard benchmarks used
in the context of high-dimensional imbalance learning. Experiments on these datasets have 2 goals. First, we study the
class imbalance properties of the datasets using proposed PRMOTE learning algorithms. Second, we compare the
classification performance of our proposed PRMOTE algorithm with the traditional and class imbalance learning
methods based on all datasets.
2(a)
Volume 2, Issue 5, May 2013
2(b)
Page 255
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
2(c)
2(d)
Figure 2: a) A zoomed-in view of original sick dataset: minority class is sick class (red color) andmajority class is
negative class (blue color). b) A zoomed-in view of generated dataset with proposed oversampling technique: minority
class is sick class (red color) and majority class is negative class (blue color). Concentration of more red dots in
minority class is the result of oversampling the minority class with the prominent instances. c) A zoomed-in view of
original hepatitis dataset: minority class is Die class (blue color) and majority class is Live class (red color). d) A
zoomed-in view of generated dataset with proposed oversampling technique: minority class is Die (blue color) and
majority class is Live class (red color). Concentration of more blue dots in minority class is the result of oversampling
the minority class with the prominent technique.
Table 4: Summary of tenfold cross validation performance for AUC on all the datasets
Table 5: Summary of tenfold cross validation performance for Precision on all the datasets
Volume 2, Issue 5, May 2013
Page 256
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
Table 6: Summary of tenfold cross validation performance for F-measure on all the datasets
Table 7: Summary of tenfold cross validation performance for Recall on all the datasets
Table 8: Summary of tenfold cross validation performance for TN Rate (Specificity) on all the datasets
Volume 2, Issue 5, May 2013
Page 257
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
Table 9: Summary of tenfold cross validation performance for Accuracy on all the datasets
3(a)
3(b)
Fig. 3(a) and – 3(b): Test results on AUC and Accuracy between the C4.5, CART, FN, REP, SMOTE and PRMOTEfor
all the fifteen datasets from UCI.
Volume 2, Issue 5, May 2013
Page 258
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
In this part of the experimental study, we use SMOTE to preprocessthe data sets used in this paper to obtain balanced
distribution of classes. In imbalanced classification, SMOTE has proved to be anexcellent preprocessing step suitable to
improve almost ant learningalgorithm and it is considered a standard in the topic.
First of all, we focus on checking whether or not our approachimproves its behavior on benchmark and SMOTE
algorithms.Table 4, 5, 6, 7, 8 and 9 reports the results of AUC, Precision, F-measure, TP Rate TN Rate and Accuracy
respectively for all the fifteen datasets from UCI.A two-tailed corrected resampled paired t-test [46] is used in this paper
to determine whether the results of the cross-validation show that there is a difference between the two algorithms is
significant or not. Difference in accuracy is considered significant when the p-value is less than 0.05 (confidence level
is greater than 95%). In discussion of results, if one algorithm is stated to be betteror worse than another then it is
significantly better or worse at the 0.05 level.The bold dot ‘●’ indicates a win of PRMOTE method on C4.5, CART,
FT, REP and SMOTE and a ‘○’ indicates a loss of PRMOTE method on above said algorithms. The results in the tables
show that PRMOTE has given a good improvement on all the measures of class imbalance learning.
First of all, we focus on checking whether or not our approachimproves its behavior when SMOTE is previously used.
Fig. 3(a) and Fig 3(b) plotsa star graphic that represents the AUC and Accuracy metric obtained for each data set and
allows us to see in a easier way how all the five comparative and our prosed PRMOTE algorithms behave in the same
domains. We can observe that the prosed PRMOTE algorithm had dominated remaining all the five benchmark
algorithms in terms of performance in most of data sets. In fact, the results obtained by PRMOTEwith a p-value of
0.05, shows that it is recommendable for most of the imbalanced datasets.
Two main reasons support the conclusion achieved above. The first one is the decrease of instances in majority subset,
has also given its contribution for the better performance of our proposed PRMOTE algorithm. The second reason, it is
well-known that the resampling of synthetic instances in the minority subset is the only way in oversampling but
conduction proper exploration–exploitation of prominent instances in minority subset is the key for the success of our
algorithm. Another reason is the deletion of noisy instances by the interpolation mechanism of PRMOTE.
Finally, we can make a global analysis of results combining theresults offered by Tables from 4–9and Fig. 3(a) and (b):
 Our proposals, PRMOTEis the best performing one when the data sets are of imbalance category. It outperforms one
of the best pre-processing SMOTE methods and this hypothesis is confirmed by including standard deviation
variations. We have considered a complete competitive set of methods and an improvement of results is expected in the
benchmark algorithms i;e C4.5, CART, FN and REP. However, they are not able to outperform WIUS. In this sense,
the competitive edge of PRMOTE can be seen.
 Considering that PRMOTEbehaves similarly or not effective than SMOTE shows the unique properties of the
datasets where there is scope of improvement in majority subset and not in minority subset. Our PRMOTE can mainly
focus on improvements in minority subset which is not effective for some unique property datasets.
The strengths of our model are that PRMOTE only over-sample the most prominent examples recursively thereby
strengthens the minority class. One more point to consider is our method tries to remove the most misclassified
instances from both majority and minority set. Firstly, the removal of some weak instances from majority set will not
harm the dataset; in fact it will reduce the root cause of our problem of class imbalance as a whole by reducing majority
samples in a small proportion. Second, the removal of weak instances from the minority set will again help in better
generation of synthetic examples of both same and hybrid type.
Finally, we can say that PRMOTE is one of the best alternatives to handle class imbalance problems effectively.This
experimental study supports the conclusion that the a prominent recursive oversamplingapproach can improve the
CILbehavior when dealing with imbalanced data-sets, as it has helped the PRMOTE method to be the best
performingalgorithms when compared with four classical and well-known algorithms: C4.5, CART, FN, REP and a
well-established pre-processing technique SMOTE.
7. Conclusion
Class imbalance problem have given a scope for a new paradigm of algorithms in data mining. The data imbalance
problem is more serious in the case of binary class, where the number of instances in one class predominantly
outnumbers the number of instances in another class.A new algorithm called Prominent Recursive Minority
Oversampling Technique; PRMOTE is proposed. We proposed to recursively oversample the most prominent examples
in the minority set for handling the problem of class imbalance. The results obtained showthat our proposed method
outperforms other classic and recent models in terms of AUC, precision, f-measure, recall, TN rate and accuracy and
requires storing a lower number of generalized examples.
In our future work, we will apply PRMOTE to multi class datasets and especially high dimensional feature learning
tasks.
References:
[1.] J. Wu, S. C. Brubaker, M. D. Mullin, and J. M. Rehg, “Fast asymmetric learning for cascade face detection,” IEEE
Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 369–382, Mar. 2008.
Volume 2, Issue 5, May 2013
Page 259
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
[2.] N. V. Chawla, N. Japkowicz, and A. Kotcz, Eds., Proc. ICML Workshop Learn. Imbalanced Data Sets, 2003.
[3.] N. Japkowicz, Ed., Proc. AAAI Workshop Learn. Imbalanced Data Sets, 2000.\
[4.] G. M.Weiss, “Mining with rarity: A unifying framework,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 7–
19, Jun. 2004.
[5.] N. V. Chawla, N. Japkowicz, and A. Kolcz, Eds., Special Issue Learning Imbalanced Datasets, SIGKDD Explor.
Newsl.,vol. 6, no. 1, 2004.
[6.] W.-Z. Lu and D.Wang, “Ground-level ozone prediction by support vector machine approach with a cost-sensitive
classification scheme,” Sci. Total. Enviro., vol. 395, no. 2-3, pp. 109–116, 2008.
[7.] Y.-M. Huang, C.-M. Hung, and H. C. Jiau, “Evaluation of neural networks and data mining methods on a credit
assessment task for class imbalance problem,” Nonlinear Anal. R. World Appl., vol. 7, no. 4, pp. 720–747, 2006.
[8.] D. Cieslak, N. Chawla, and A. Striegel, “Combating imbalance in network intrusion datasets,” in IEEE Int. Conf.
Granular Comput., 2006, pp. 732–737.
[9.] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, and G. D. Tourassi, “Training neural network
classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural
Netw., vol. 21, no. 2–3, pp. 427–436, 2008.
[10.] A. Freitas, A. Costa-Pereira, and P. Brazdil, “Cost-sensitive decision trees applied to medical data,” in Data
Warehousing Knowl. Discov. (Lecture Notes Series in Computer Science), I. Song, J. Eder, and T. Nguyen, Eds.,
[11.] K.Kilic¸,O¨ zgeUncu and I. B. Tu¨rksen, “Comparison of different strategies of utilizing fuzzy clustering in
structure identification,” Inf. Sci., vol. 177, no. 23, pp. 5153–5162, 2007.
[12.] M. E. Celebi, H. A. Kingravi, B. Uddin, H. Iyatomi, Y. A. Aslandogan, W. V. Stoecker, and R. H. Moss, “A
methodological approach to the classification of dermoscopy images,” Comput.Med. Imag. Grap., vol. 31, no. 6,
pp. 362–373, 2007.
[13.] X. Peng and I. King, “Robust BMPM training based on second-order cone programming and its application in
medical diagnosis,” Neural Netw., vol. 21, no. 2–3, pp. 450–457, 2008.Berlin/Heidelberg, Germany: Springer,
2007, vol. 4654, pp. 303–312.
[14.] RukshanBatuwita and Vasile Palade (2010) FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance
Learning, IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 18, NO. 3, JUNE 2010, pp no:558-571.
[15.] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol.
6, pp. 429-450, 2002.
[16.] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection,” Proc. 14th
Int’l Conf. Machine Learning, pp. 179-186, 1997.
[17.] G.E.A.P.A. Batista, R.C. Prati, and M.C. Monard, “A Study of the Behavior of Several Methods for Balancing
Machine Learning Training Data,” SIGKDD Explorations, vol. 6, pp. 20-29, 2004.1
[18.] D. Cieslak and N. Chawla, “Learning decision trees for unbalanced data,” in Machine Learning and Knowledge
Discovery in Databases. Berlin, Germany: Springer-Verlag, 2008, pp. 241–256.
[19.] G.Weiss, “Mining with rarity: A unifying framework,” SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 7–19, 2004.
[20.] N. Chawla, K. Bowyer, and P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif.
Intell. Res., vol. 16, pp. 321–357, 2002.
[21.] J. Zhang and I. Mani, “KNN approach to unbalanced data distributions: A case study involving information
extraction,” in Proc. Int. Conf. Mach. Learning, Workshop: Learning Imbalanced Data Sets, Washington, DC,
2003, pp. 42–48.
[22.] Yang Yong, “The Research of Imbalanced Data Set of Sample Sampling Method Based on K-Means Cluster and
Genetic Algorithm”, Energy Procedia 17 (2012) 164 – 170.
[23.] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1,
pp. 40–49, 2004.
[24.] S. Zou, Y. Huang, Y. Wang, J. Wang, and C. Zhou, “SVM learning from imbalanced data by GA sampling for
protein domain prediction,” in Proc. 9th Int. Conf. Young Comput. Sci., Hunan, China, 2008, pp. 982– 987.
[25.] Salvador Garcı´a, Joaquı´nDerrac, Isaac Triguero, Cristobal J. Carmona, Francisco Herrera, “Evolutionary-based
selection of generalized instances for imbalanced classification”, Knowledge-Based Systems 25 (2012) 3–12.
[26.] María Dolores Pérez-Godoy, Alberto Fernández, Antonio Jesús Rivera, María José del Jesus,” Analysis of an
evolutionary RBFN design algorithm, CO2RBFN, for imbalanced data sets”, Pattern Recognition Letters 31 (2010)
2375–2388.
[27.] Jin Xiao, Ling Xie, Changzheng He, Xiaoyi Jiang,” Dynamic classifier ensemble model for customer classification
with imbalanced class distribution”, Expert Systems with Applications 39 (2012) 3668–3675.
[28.] Z. Chi, H. Yan, T. Pham, Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition,
World Scientific, 1996.
[29.] V. Garcia, J.S. Sanchez , R.A. Mollineda,” On the effectiveness of preprocessing methods when dealing with
different levels of class imbalance”, Knowledge-Based Systems 25 (2012) 13–21.
Volume 2, Issue 5, May 2013
Page 260
International Journal of Application or Innovation in Engineering & Management (IJAIEM)
Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com
Volume 2, Issue 5, May 2013
ISSN 2319 - 4847
[30.] Alberto Fernández, María José del Jesus, Francisco Herrera,” On the 2-tuples based genetic tuning performance
for fuzzy rule based classification systems in imbalanced data-sets”, Information Sciences 180 (2010) 1268–1291.
[31.] Jinguha Wang, JaneYou ,QinLi,YongXu,” Extract minimum positive and maximum negative features for
imbalanced binary classification”, Pattern Recognition 45 (2012) 1136–1145.
[32.] Alberto Fernández, María José del Jesus, Francisco Herrera,” On the influence of an adaptive inference system in
fuzzy rule based classification systems for imbalanced data-sets”, Expert Systems with Applications 36 (2009)
9805–9812.
[33.] Victoria López, Alberto Fernández, Jose G. Moreno-Torres, Francisco Herrera, “Analysis of preprocessing vs.
cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics”, Expert
Systems with Applications 39 (2012) 6585–6608.
[34.] Alberto Fernández, María José del Jesus, Francisco Herrera,” On the influence of an adaptive inference system in
fuzzy rule based classification systems for imbalanced data-sets”, Expert Systems with Applications 36 (2009)
9805–9812.
[35.] Jordan M. Malof, Maciej A. Mazurowski, Georgia D. Tourassi,” The effect of class imbalance on case selection for
case-based classifiers: An empirical study in the context of medical decision support”, Neural Networks 25 (2012)
141–145.
[36.] V. Garcia, J.S. Sanchez , R.A. Mollineda,” On the effectiveness of preprocessing methods when dealing with
different levels of class imbalance”, Knowledge-Based Systems 25 (2012) 13–21.
[37.] NiteshV.Chawla, Nathalie Japkowicz and AleksanderKolcz.: Editorial: Special Issue on Learning from
Imbalanced Data Sets. SIGKDD Explorations 6 (1) (2004) 1-6
[38.] G. Weiss: Mining with rarity: A unifying framework. SIGKDD Explorations 6 (1) (2004) 7-19
[39.] Chawla, N.V., Lazarevic, A., Hall, L.O. and Bowyer, K.: SMOTEBoost: Improving prediction of the Minority
Class in Boosting. 7th Europea Conference on Principles and Practice of Knowledge Discovery in Databases,
Cavtat Dubrovnik, Croatia (2003) 107-119.
[40.] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed. San Mateo, CA: Morgan Kaufmann Publishers,
1993.
[41.] Bradley A.: The use of the area under the ROC curve in the evaluation of machine learningalgorithms. Pattern
Recognition 30 (7) (1997) 1145-1159.
[42.] A. Asuncion D. Newman. (2007). UCI Repository of Machine Learning Database (School of Information and
Computer
Science),
Irvine,
CA:
Univ.
of
California
[Online].
Available:
http://www.ics.uci.edu/∼mlearn/MLRepository.html
[43.] Witten, I.H. and Frank, E. (2005) Data Mining: Practical machine learning tools and techniques. 2nd edition
Morgan Kaufmann, San Francisco.
[44.] RBreiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont.
[45.] Chawla,N., Bowyer,K.,Kegelmeyer, P.:SMOTE: syntheticminority over-sampling technique. J. Artif. Intell. Res.
16, 321–357 (2002).
Author
K.P.N.V.SATYA SREE, Assistant Professor, Department of Computer Science and Engineering,
Vignan’s Nirula Institute of Technology in Science for WOMEN, Guntur, A.P
Dr. J.V.R. MURTHY, Professor, Department of Computer Science and Engineering, Jawaharlal Nehru Technological
University Kakinada, Kakinada, A.P
Volume 2, Issue 5, May 2013
Page 261
Download