A Novel Prominent Recursive Minority Over- Sampling Technique

International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 A Novel Prominent Recursive Minority OverSampling Technique K.P.N.V.SATYA SREE1, Dr. J.V.R. MURTHY2 1 Assistant Professor, Department of Computer Science and Engineering, Vignan’s Nirula Institute of Technology in Science for WOMEN, Guntur, A.P 2 Professor, Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Kakinada, Kakinada,A.P ABSTRACT In supervised classification, a new and burning challenge has emerged for researcher community in data mining is Class Imbalance Learning (CIL). The problem of class imbalance learning is of great significance when dealing with real-world datasets. The data imbalance problem is more serious in the case of binary class, where the number of instances in one class predominantly outnumbers the number of instances in another class. In this paper, we present a practical algorithm to deal with thedata imbalance classification problem when there are datasets with different imbalance ratios.A new algorithm called Prominent Recursive Minority Oversampling Technique; PRMOTE is proposed. We proposed to recursively oversample the most prominent examples in the minority set for handling the problem of class imbalance. An experimental analysis is carried out over a wide range of fifteen highly imbalanceddata sets and uses the statistical tests suggested in the specialized literature. The results obtained showthat our proposed method outperforms other classic and recent models in terms of AUC, precision, f-measure, recall, TN rate and accuracy and requires storing a lower number of generalized examples. Keywords— Classification, class imbalance, prominent metric, re-sampling, PRMOTE. 1. INTRODUCTION A dataset is class imbalanced if the classification categories are not approximately equally represented. The level of imbalance (ratio of size of the majority class to minority class) can be as huge as 1:99[1]. It is noteworthy that class imbalance is emerging as an important issue in designing classifiers [2], [3], [4]. Furthermore, the class with the lowest number of instances is usually the class of interest from the point of view of the learning task [5]. This problem is of great interest because it turns up in many real-world classification problems, such as remote-sensing [6], pollution detection [7], risk management [8], fraud detection [9], and especially medical diagnosis [10]–[13]. Whenever a class in a classification task is underrepresented (i.e., has a lower prior probability) compared to otherclasses, we consider the data as imbalanced [14], [15], [16]. The main problem in imbalanced data is that the majority classes that are represented by large numbers of patterns rule the classifier decision boundaries at the expense of the minority classes that are represented by small numbers of patterns. This leads to high and low accuracies in classifying the majority and minority classes, respectively, which do not necessarily reflect the true difficulty in classifying these classes. Most common solutions to this problem balance the number of patterns in the minority or majority classes. Resampling techniques can be categorized into three groups. Under sampling methods, which create a subset of the original data-set by eliminating instances (usually majority class instances); oversampling methods, which create a superset of the original data-set by replicating some instances or creating new instances from existing ones; and finally, hybrids methods that combine both sampling methods. Among these categories, there exist several different proposals; from this point, we only center our attention in those that have been used in oversampling.Either way, balancing the data has been found to alleviate the problem of imbalanced data and enhance accuracy [15], [16], [17]. Data balancing is performed by, e.g., oversamplingpatterns of minority classes either randomly or from areasclose to the decision boundaries. Interestingly, random oversamplingis found comparable to more sophisticated oversamplingmethods [17]. Alternatively, under sampling is performed on majority classes either randomly or Volume 2, Issue 5, May 2013 Page 249 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 fromareas far away from the decision boundaries. We note that random under sampling may remove significant patternsand random oversampling may lead to over fitting, so random sampling should be performed with care. We alsonote that, usually, oversampling of minority classes is more accurate than under sampling of majority classes [17].In this paper, we are laying more stress to proposea oversampling algorithm for solving the class imbalance problem. This paper is organized as follows. Section 2 briefly reviews the data balancing literature and its measures.In Section 3, we discuss the proposed method of PRMOTE (Prominent Resampling Minority Over-sampling Technique) technique for CIL. Section 4 presents the imbalanced datasets used to validate the proposed method, while In Section 5, we present the experimental setting. In Section 6, wediscuss in detail, the classification results obtained by the proposed method and compare them with the results obtained by different existing methods and finally, in Section 7 we conclude the paper by specifying a future extension. 2. LITERATURE REVIEW A comprehensive review of different CIL methods can be found in [18]. The following two sections briefly discuss the external-imbalance and internal-imbalance learning methods. The external methods are independent from the learning algorithm being used, and they involve preprocessing of the training datasets to balance them before training the classifiers. Different resampling methods, such as random and focused oversampling and under sampling, fall into to this category. In random under sampling, the majority-class examples are removed randomly, until a particular class ratio is met [19]. In random oversampling, the minority-class examples are randomly duplicated, until a particular class ratio is met [18]. Synthetic minority oversamplingtechnique (SMOTE) [20] is an oversampling method, where new synthetic examples are generated in the neighborhood of the existing minority-class examples rather than directly duplicating them. In addition, several informed sampling methods have been introduced in [21]. Currently, the research in class imbalance learning mainly focuses on the integration of imbalance class learning with other AI techniques. How to integrate the class imbalance learning with other new techniques is one of the hottest topics in class imbalance learning research. There are some of the recent research directions for class imbalance learning as follows: In [22] authors have proposed a minority sample method based on k-means clustering and genetic algorithm. The proposed algorithm works in two stages, in the first stage k-means clustering is used to find clusters in the minority set and in the second stage genetic algorithm is used to choose the best minority samples for resampling. Classifier ensemble techniques can also be efficiently used for handling the problem of class imbalance learning. In [23] authors have proposed another variant of clustering-based sampling method for handling class imbalance problem. In [24] authors have proposed a pure genetic algorithm based sampling method for handling class imbalance problem. Evolutionary algorithms are also of great use for handling the problem of class imbalance. In [25] authors have proposed evolutionary method of nested generalized exemplar family which uses Euclidean n-space to store objects when computing distances to the nearest generalized exemplar. This method uses evolutionary algorithm for selection of most suitable generalized exemplars for resampling. In [26] authors have proposed an evolutionary cooperative competitive strategy for the design of radial-basis function networks CO2RBFN by using evolutionary cooperative competitive technique with radial-basis function on imbalanced datasets. In CO2RBFN framework where an individual of population represents only a part of the solution, competing to survival and at the same time cooperating in order to build the whole RBFN, which achieves good generalization for new patterns by representing the complete knowledge about the problem space. In [27] authors have proposed a dynamic classifier ensemble method (DCEID) by ensemble technique with cost sensitive learning. A new cost-sensitive measure is proposed to combine the output of ensemble classifiers. A comparative analysis of external and internal methods for handling class imbalance learning is conducted [28]. The results of analysis are to study deeply about data intrinsic properties for proposal of data shifting and data overlapping. In [29] authors have given suggestion for applying gradient boosting and random forest classifier for better performance on class imbalanced datasets.In [30] authors have introduced a new hybrid sampling/boosting algorithm, called RUSBoost from its individual component AdaBoost and SMOTEBoost, which is another algorithm that combines boosting and data sampling for learning from skewed training data. In [31] authors have proposed a max-min technique for extracting maximum negative features and minimum positive features for handling the problem of class imbalance datasets for binary classification. The proposed two models which can do the max-min extraction of features have been implemented simultaneously thereby producing effective results. The application of fuzzy rule based technique for handling class imbalance datasets is proposed as Fuzzy Rule Based Classification Systems (FRBCSs) [32] and Fuzzy Hybrid Genetic Based Machine Learning (FH-GBML) algorithm [33]. These fuzzy based algorithmic approaches had also shown a good performance in terms of metric learning. In [34] authors have proposed to deal the imbalanced datasets by using preprocessing step with fuzzy rule based classification systems by application of an adaptive inference system with parametric conjunction operators. In [35] authors have proposed the applicability of K-Nearest Neighbor (k-NN) classifier to investigate the performance on class Volume 2, Issue 5, May 2013 Page 250 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 imbalanced datasets. In [36] authors have conducted a systematic study to investigate the effects of different classifiers with different resampling strategies on imbalanced datasets with different imbalance ratios. Obviously, there are many other algorithms which are not included in this literature. A profound comparison of the above algorithms and many others can be gathered from the references list. The bottom line is that when studying problems with imbalanced data, using the classifiers produced by standard machine learning algorithms without adjusting the output threshold may well be a critical mistake. This skewness towards minority class (positive) generally causes the generation of a high number of false-negative predictions, which lower the model’s performance on the positive class compared with the performance on the negative (majority) class. 3. A New Over-Sampling Method: Prominent Recursive Minority Oversampling Technique (PRMOTE) In order to achieve better prediction, most of the classification algorithms attempt tolearn the model by using the training process and the model so build will be used for future prediction of the testing instances.The training model build by the classifier will be more robust if the training data contains only quality instances. Theexamples which are outliers and the ones of noisy nature are more apt to be misclassified than the ones far from the borderline, andthus more important for classification. Based on the analysis above, those examples which fall in the misclassification category may contribute little to classification. We presenteda novel method dubbed as ProminentRecursive Minority Oversampling Technique (PRMOTE), in which only the prominent examples of the minority class are recursively oversampled. Our method is different from the existingover-sampling methods in which all the minority examples or a random subset of theminority class are over-sampled [31] [38] [39]. Our methods can be easily understood with the following pima diabetes data set, which has two classes. Fig. 1 (a) shows the original distribution of the data set, the blue colored points represent majority examples and the red colored points are minority examples. Firstly, we apply a base classifier on the whole dataset to find out the misclassified examples of themajority and minority class, which are denoted by circled dots in Fig. 1 (b). Then, any remaining noise or borderline examples are removed from the minority set. The so purified minority set is used to generate new synthetic examples recursively using reaming prominent examples of the minority class. The synthetic examples are shown in Fig. 1 (c) with Hollow Square. It is easy to find outfrom the figures that, different from SMOTE, 1(a) 1(b) 1(c) Fig. 1.(a) The original distribution of pima diabetes data set. (b) The only misclassified instances to be removed from majority subset and the misclassified, borderline, noisy and weak instances to be removed from minority subset (circled dots both blue and red). (c) After applying PRMOTE: The only highly prominent synthetic minority instances resampled (high density of minority instances in the squared box). Suppose that the whole training set is T, the minority classis P and the majority class is N, and P = {p1, p2 ,..., ppnum}, N = {n1,n2 ,...,nnnum} wherepnumand nnumare the number of minority and majority examples. The detailed procedure of PRMOTE is as follows. _____________________________________________ Algorithm: PRMOTE _____________________________________________ Step1. For every pi (i= 1, 2,..., pnum) in the minority class P, we calculate its m nearestneighbors from the whole training set T. The number of majority examples amongthe m nearest neighbors is denoted bym' (0 ≤ m'≤ m) . Volume 2, Issue 5, May 2013 Page 251 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 Step2 .If m/ 2 ≤ m'<m, namely the number of pi ’s majority nearest neighbors is larger than the number of itsminority ones, pi is considered to be easily misclassified and put into a set MISCLASS. MISSCLASS = m' Remove the instances m' from the minority set. Step3. For every ni(i= 1,2,..., nnum) in the majority class N, we calculate its m nearestneighbors from the whole training set T. The number of majority examples amongthe m nearest neighbors is denoted bym' (0 ≤ m'≤ m) . Step4.Ifm/ 2 ≤ m'<m , namely the number of ni ’s minority nearest neighbors is larger than the number of its majority ones, ni is considered to be easily misclassified and put into a set MISCLASS. MISSCLASS = m' Remove the instances m' from the majority set. Step5. For every pi’(i= 1,2,..., pnum’) in the minority class P, we calculate its m nearestneighbors from the whole training set T. The number of majority examples amongthe m nearest neighbors is denoted bym' (0 ≤ m'≤ m). Ifm'= m, i.e. all the m nearest neighbors of pi are majority examples, pi’ isconsidered to be noise or outliers or missing values and are to be removed. Ste 6. For every pi’’(i= 1,2,..., pnum’’) in the minority class P, we calculate its m nearestneighbors from the whole training set T. The number of majority examples amongthe m nearest neighbors is denoted bym' (0 ≤ m'≤ m). If 0 ≤ m'<m/ 2 , pi is a prominent example and need to be kept in minority set for resampling. Step7. The examples in minority set are the prominent examples of the minority class P, and we can see that PROMINENT⊆P . We set PROMINENT= {p'1 ,p'2 ,..., p'dnum}, 0 ≤ dnum≤ pnum Step8. In this step, we generate s × dnum synthetic positive examples from the prominent examples in minority set, where s is an integer between 1 and k .One percentage of synthetic examples generated are replica of prominent examples and other are the hybrid of prominent examples. _____________________________________________ We repeat the above procedure for each p'I prominent examples in minority set and can attain s × dnum synthetic examples. In the procedure above, pi ,ni, p'i, dif j and synthetic j are vectors. We can see that new synthetic data are generated by using the prominent examples remained in minority set thus the quality of synthetic examples generated will strengthened the minority set and oversampling of examples will minimize the problem of class imbalance. The different components of our new proposed framework are elaborated in the next subsections. 3.1 Applying dataset to a base classifier In the initial stage of our frame work the dataset is applied to a base algorithm for identifying mostly misclassified instances in both majority and minority classes. The instances which are misclassified are mostly weak instances and removing those instances from the majority and minority classes will not harm the dataset. In fact it will be helpful for improving the quality of the datasets in two fold; one way by removing weak instances from majority class will help to reduce the problem of class imbalance to a minor extend. Another is the removal of weak instances from minority class for the purpose of finding prominent instances to recursively replicate and hybridized for oversampling is also the part of the goal of the framework.The mostly misclassified instances are identified by using a base algorithm in this case C4.5 [40] is used. C4.5 is one of the best performing algorithms in the area of supervised learning. In our framework C4.5 is used as base algorithm for both initial and final stage. Obviously this had boosted the performance of our PRMOTE approach.Our approach PRMOTE is classifier independent, i.e there is no constraint that the same classifier (in this case C4.5) has to be implemented for identifying mostly misclassified instances. Volume 2, Issue 5, May 2013 Page 252 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 3.2 Preparation of the Majority and Minority subsets The datasets is partitioned into majority and minority subsets. As we are concentrating on over sampling, we will take minority data subset for further analysis to generate synthetic instances. 3.3Selection of novel subset of instances Minority subset can be further analyzed to find the missing or noisy instances so that we can eliminate those. For finding noisy, boarder line and missing value instances for generating pure minority set one of the ways is to go through a preprocessing process. 3.4Generating synthetic instances by using the novel subset The prominent instances remained in the minority subset are to be resampled i;e both replicated and hybridized instances are generated. The percentage of synthetic instances generated will range from 0 – 100 % depending upon the percentage of difference of majority and minority classes in the original dataset. The synthetic minority instances generated can have a percentage of instances which can be a replica of the pure instances and reaming percentage of instances are of the hybrid quality of synthetic instances generated by combing two or more instances from the pure minority subset. 3.5 Forming the strong dataset The oversampled minority subset and the stronger majority subset are combined to form a strong and balance dataset, which is used for learning of a base algorithm. In this case again we have used C4.5 as the base algorithm.Combing of the synthetic minority instances to the original dataset results in the formation of an improved and almost balanced dataset.The imbalance dataset improvement can be made into balance or almost balance depending upon the pure majority subset generated. The maximum synthetic minority instances generated are limited to 100% of the pure minority set formed. Our method will be superior than other oversampling methods since our approach uses the only available pure instances in the existing minority setfor generating synthetic instances. 4. Evaluation Metrics To assess theclassification results we count the number of true positive (TP),true negative (TN), false positive (FP) (actually negative, but classifiedas positive) and false negative (FN) (actually positive, butclassified as negative) examples.It is now well known that error rate is not anappropriate evaluation criterion when there is class imbalance or unequal costs. In this paper, we use AUC, Precision, F-measure,Recall, TN Rate and Accuracy as performance evaluation measures [41]. Table 1. Confusion Matrix for a two class problem ------------------------------------------------------------------Predicted Positives Predicted Negatives ------------------------------------------------------------------Positive TP FN Negatives FP TN -------------------------------------------------------------------AUC  1  TPRATE  FPRATE 2 Pr ecision  TP TP   FP  F  measure  Re call  (2) 2  Pr ecision  Re call Pr ecision  Re call (3) TP (4) TP   FN  TrueNegati veRate  Volume 2, Issue 5, May 2013 (1) TN TN   FP  (5) Page 253 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 Accuracy  TP  TN  FP  TN  TP  FN (6) The above defined evaluation metrics can reasonably evaluate the learner for imbalanced data sets because their formulae are relative to the minority class. 5. Experimental framework An empirical comparison between the proposed oversampling method and other benchmark and over-sampling algorithms has been performed over a total of 15 data sets taken from the UCI Data Set Repository (http://www.ics.uci.edu/~mlearn/MLRepository.htm) [42]. Note that all the original binary databases have been used for experimental validation for significant analysis of two-class problems. Table 2summarizes the main characteristics of the data sets, including the imbalance ratio (IR), that is, the number of negative (majority) examples divided by the number of positive (minority)examples. The fifth and sixth columns in Table 2indicate the majority and minority classes of each dataset with respective IR. We performed theimplementation using WEKAon Windows XP with 2Duo CPU runningon 3.16 GHz PC with 3.25 GB RAM.We have adopted a tenfold cross-validationmethodto estimate theAUC and all other measures: each data sethas been divided into ten stratified blocks of size n/10 (wheren denotes the total number of examples in the data set), using nine folds for training the classifiers and the remaining block as an independent test set. Therefore, the results correspondto the average over the ten runs.Each classifier has been applied to the original (imbalanced)training sets and also to sets that have been preprocessed by the implementations of the proposed technique SMOTEand five stateof-the-art and over-sampling approaches takenfrom the WEKA data mining software tool [43]. In SMOTE algorithm the data setshave been balanced to the 100 % distribution. 5.1 Evaluation on ten real-world datasets In this study PRMOTEis applied to fifteen binary data setsfrom the UCI repositorywith different imbalance ratio (IR). Table 2summarizes the data selected in this study and shows, for each data set, the number of examples (#Ex.), number of attributes (#Atts.), class name of each class (minority and majority) and IR. Table 2: Summary of benchmark imbalanced datasets S.No 1 Datasets # Ex. # Atts. Class (_,+)IR Breast268 9(recurrence; no-recurrence) 2.37 2 3 Breast_w699 9(benign; malignant)1.90 Colic 368 22(yes; no)1.71 4 5 Credit-g 1000 21(good; bad)2.33 Diabetes 7688 (tested-potv; tested-negtv) 1.87 6 7 Heart-c Heart-h 8 9 Heart-stat 270 14(absent, present)1.25Hepatitis 155 19(die; live)3.85 10 11 Ionosphere 351 34 (b;g)1.79 Kr-vs-kp3196 37(won; nowin) 1.09 12 13 Labor 56 16(bad ; good ) 1.85 Mushroom812423(e;p )1.08 14 15 Sick 3772 29(negative ; sick ) 15.32 Sonar 208 60(rock ; mine ) 1.15 303 14(<50,>50_1)1.19 294 14(<50,>50_1)1.77 5.2 Algorithms for comparison and parameters To validate the proposed PRMOTEalgorithm, all the experiments have been carried out using theWEKA learning environment [43] with the C4.5 [40] decision tree, the Classification and Regression Trees (CART) [44], the functional trees (FT), the reduced error pruning tree (REP) and the synthetic minority oversampling technique (SMOTE) [45], whose parameter values used in the experimentsare given in Table 3. Specifically, we consider five different algorithmic approaches for comparison:  C4.5: we have selected the C4.5 algorithm as a well-known classifier that has been widely used forimbalanced data. A decision tree consists of internal nodes that specify tests on individual input variables or attributes that split the data into smaller subsets, and a series of leaf nodes assigning a class to each of the observations in the resulting segments. Volume 2, Issue 5, May 2013 Page 254 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 For our study, we chose the popular decision tree classifier C4.5, which builds decision trees using the concept of information entropy. The entropy of a sample S of classified observations is given. C4.5 examines the normalized information gain (entropy difference) that results from choosing an attribute for splitting the data. The attribute with the highest normalized information gain is the one used to make the decision. The algorithm then recurs on the smaller subsets.  CART: The CART methodology is technically known as binary recursive partitioning. The process is binary because parent nodes are always split into exactly two child nodes and recursive because the process can be repeated by treating each child node as a parent. The key elements of a CART analysis are a set of rules for: i. Splitting each node in a tree; ii. Deciding when a tree is complete; and iii. Assigning each terminal node to a classoutcome (or predicted value for regression). [1.] FT: Functional Trees (FT) is a classifier for building 'Functional trees', which are classification treesthat could have logistic regression functions at the inner nodes and/or leaves. The algorithm can deal with binary and multi-class target variables, numeric and nominal attributes and missing values. Table 3: Parameters used in the classifiers  REP: One of the simplest forms of pruning is reduced error pruning. Starting at the leaves, each node is replaced with its most popular class. If the prediction accuracy is not affected then the change is kept. While somewhat naive, reduced error pruning has the advantage of simplicity and speed.  SMOTE: Regarding the use of the SMOTE pre-processing method [45], we consider only the 1-nearest neighbor (using the Euclidean distance) to generate the synthetic samples. Table 3 presents the complete experimental setting used for all the five comparative algorithms. 6. Results and Discussion We evaluated the performance of the proposedPRMOTEapproaches on a number of real-world classification problems. The goal is to examine whether the new proposed learning framework achieve better AUC and other evaluation metrics than a number of existing learning algorithms. In all the experiments we estimate AUC, Precision, F-measure, recall, TN rate and accuracy using 10-fold crossvalidation. We experimented with 15 standard datasets for UCI repository; these datasets are standard benchmarks used in the context of high-dimensional imbalance learning. Experiments on these datasets have 2 goals. First, we study the class imbalance properties of the datasets using proposed PRMOTE learning algorithms. Second, we compare the classification performance of our proposed PRMOTE algorithm with the traditional and class imbalance learning methods based on all datasets. 2(a) Volume 2, Issue 5, May 2013 2(b) Page 255 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 2(c) 2(d) Figure 2: a) A zoomed-in view of original sick dataset: minority class is sick class (red color) andmajority class is negative class (blue color). b) A zoomed-in view of generated dataset with proposed oversampling technique: minority class is sick class (red color) and majority class is negative class (blue color). Concentration of more red dots in minority class is the result of oversampling the minority class with the prominent instances. c) A zoomed-in view of original hepatitis dataset: minority class is Die class (blue color) and majority class is Live class (red color). d) A zoomed-in view of generated dataset with proposed oversampling technique: minority class is Die (blue color) and majority class is Live class (red color). Concentration of more blue dots in minority class is the result of oversampling the minority class with the prominent technique. Table 4: Summary of tenfold cross validation performance for AUC on all the datasets Table 5: Summary of tenfold cross validation performance for Precision on all the datasets Volume 2, Issue 5, May 2013 Page 256 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 Table 6: Summary of tenfold cross validation performance for F-measure on all the datasets Table 7: Summary of tenfold cross validation performance for Recall on all the datasets Table 8: Summary of tenfold cross validation performance for TN Rate (Specificity) on all the datasets Volume 2, Issue 5, May 2013 Page 257 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 Table 9: Summary of tenfold cross validation performance for Accuracy on all the datasets 3(a) 3(b) Fig. 3(a) and – 3(b): Test results on AUC and Accuracy between the C4.5, CART, FN, REP, SMOTE and PRMOTEfor all the fifteen datasets from UCI. Volume 2, Issue 5, May 2013 Page 258 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 In this part of the experimental study, we use SMOTE to preprocessthe data sets used in this paper to obtain balanced distribution of classes. In imbalanced classification, SMOTE has proved to be anexcellent preprocessing step suitable to improve almost ant learningalgorithm and it is considered a standard in the topic. First of all, we focus on checking whether or not our approachimproves its behavior on benchmark and SMOTE algorithms.Table 4, 5, 6, 7, 8 and 9 reports the results of AUC, Precision, F-measure, TP Rate TN Rate and Accuracy respectively for all the fifteen datasets from UCI.A two-tailed corrected resampled paired t-test [46] is used in this paper to determine whether the results of the cross-validation show that there is a difference between the two algorithms is significant or not. Difference in accuracy is considered significant when the p-value is less than 0.05 (confidence level is greater than 95%). In discussion of results, if one algorithm is stated to be betteror worse than another then it is significantly better or worse at the 0.05 level.The bold dot ‘●’ indicates a win of PRMOTE method on C4.5, CART, FT, REP and SMOTE and a ‘○’ indicates a loss of PRMOTE method on above said algorithms. The results in the tables show that PRMOTE has given a good improvement on all the measures of class imbalance learning. First of all, we focus on checking whether or not our approachimproves its behavior when SMOTE is previously used. Fig. 3(a) and Fig 3(b) plotsa star graphic that represents the AUC and Accuracy metric obtained for each data set and allows us to see in a easier way how all the five comparative and our prosed PRMOTE algorithms behave in the same domains. We can observe that the prosed PRMOTE algorithm had dominated remaining all the five benchmark algorithms in terms of performance in most of data sets. In fact, the results obtained by PRMOTEwith a p-value of 0.05, shows that it is recommendable for most of the imbalanced datasets. Two main reasons support the conclusion achieved above. The first one is the decrease of instances in majority subset, has also given its contribution for the better performance of our proposed PRMOTE algorithm. The second reason, it is well-known that the resampling of synthetic instances in the minority subset is the only way in oversampling but conduction proper exploration–exploitation of prominent instances in minority subset is the key for the success of our algorithm. Another reason is the deletion of noisy instances by the interpolation mechanism of PRMOTE. Finally, we can make a global analysis of results combining theresults offered by Tables from 4–9and Fig. 3(a) and (b):  Our proposals, PRMOTEis the best performing one when the data sets are of imbalance category. It outperforms one of the best pre-processing SMOTE methods and this hypothesis is confirmed by including standard deviation variations. We have considered a complete competitive set of methods and an improvement of results is expected in the benchmark algorithms i;e C4.5, CART, FN and REP. However, they are not able to outperform WIUS. In this sense, the competitive edge of PRMOTE can be seen.  Considering that PRMOTEbehaves similarly or not effective than SMOTE shows the unique properties of the datasets where there is scope of improvement in majority subset and not in minority subset. Our PRMOTE can mainly focus on improvements in minority subset which is not effective for some unique property datasets. The strengths of our model are that PRMOTE only over-sample the most prominent examples recursively thereby strengthens the minority class. One more point to consider is our method tries to remove the most misclassified instances from both majority and minority set. Firstly, the removal of some weak instances from majority set will not harm the dataset; in fact it will reduce the root cause of our problem of class imbalance as a whole by reducing majority samples in a small proportion. Second, the removal of weak instances from the minority set will again help in better generation of synthetic examples of both same and hybrid type. Finally, we can say that PRMOTE is one of the best alternatives to handle class imbalance problems effectively.This experimental study supports the conclusion that the a prominent recursive oversamplingapproach can improve the CILbehavior when dealing with imbalanced data-sets, as it has helped the PRMOTE method to be the best performingalgorithms when compared with four classical and well-known algorithms: C4.5, CART, FN, REP and a well-established pre-processing technique SMOTE. 7. Conclusion Class imbalance problem have given a scope for a new paradigm of algorithms in data mining. The data imbalance problem is more serious in the case of binary class, where the number of instances in one class predominantly outnumbers the number of instances in another class.A new algorithm called Prominent Recursive Minority Oversampling Technique; PRMOTE is proposed. We proposed to recursively oversample the most prominent examples in the minority set for handling the problem of class imbalance. The results obtained showthat our proposed method outperforms other classic and recent models in terms of AUC, precision, f-measure, recall, TN rate and accuracy and requires storing a lower number of generalized examples. In our future work, we will apply PRMOTE to multi class datasets and especially high dimensional feature learning tasks. References: [1.] J. Wu, S. C. Brubaker, M. D. Mullin, and J. M. Rehg, “Fast asymmetric learning for cascade face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 3, pp. 369–382, Mar. 2008. Volume 2, Issue 5, May 2013 Page 259 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 [2.] N. V. Chawla, N. Japkowicz, and A. Kotcz, Eds., Proc. ICML Workshop Learn. Imbalanced Data Sets, 2003. [3.] N. Japkowicz, Ed., Proc. AAAI Workshop Learn. Imbalanced Data Sets, 2000.\ [4.] G. M.Weiss, “Mining with rarity: A unifying framework,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 7– 19, Jun. 2004. [5.] N. V. Chawla, N. Japkowicz, and A. Kolcz, Eds., Special Issue Learning Imbalanced Datasets, SIGKDD Explor. Newsl.,vol. 6, no. 1, 2004. [6.] W.-Z. Lu and D.Wang, “Ground-level ozone prediction by support vector machine approach with a cost-sensitive classification scheme,” Sci. Total. Enviro., vol. 395, no. 2-3, pp. 109–116, 2008. [7.] Y.-M. Huang, C.-M. Hung, and H. C. Jiau, “Evaluation of neural networks and data mining methods on a credit assessment task for class imbalance problem,” Nonlinear Anal. R. World Appl., vol. 7, no. 4, pp. 720–747, 2006. [8.] D. Cieslak, N. Chawla, and A. Striegel, “Combating imbalance in network intrusion datasets,” in IEEE Int. Conf. Granular Comput., 2006, pp. 732–737. [9.] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, and G. D. Tourassi, “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance,” Neural Netw., vol. 21, no. 2–3, pp. 427–436, 2008. [10.] A. Freitas, A. Costa-Pereira, and P. Brazdil, “Cost-sensitive decision trees applied to medical data,” in Data Warehousing Knowl. Discov. (Lecture Notes Series in Computer Science), I. Song, J. Eder, and T. Nguyen, Eds., [11.] K.Kiliç,O¨ zgeUncu and I. B. Tu¨rksen, “Comparison of different strategies of utilizing fuzzy clustering in structure identification,” Inf. Sci., vol. 177, no. 23, pp. 5153–5162, 2007. [12.] M. E. Celebi, H. A. Kingravi, B. Uddin, H. Iyatomi, Y. A. Aslandogan, W. V. Stoecker, and R. H. Moss, “A methodological approach to the classification of dermoscopy images,” Comput.Med. Imag. Grap., vol. 31, no. 6, pp. 362–373, 2007. [13.] X. Peng and I. King, “Robust BMPM training based on second-order cone programming and its application in medical diagnosis,” Neural Netw., vol. 21, no. 2–3, pp. 450–457, 2008.Berlin/Heidelberg, Germany: Springer, 2007, vol. 4654, pp. 303–312. [14.] RukshanBatuwita and Vasile Palade (2010) FSVM-CIL: Fuzzy Support Vector Machines for Class Imbalance Learning, IEEE TRANSACTIONS ON FUZZY SYSTEMS, VOL. 18, NO. 3, JUNE 2010, pp no:558-571. [15.] N. Japkowicz and S. Stephen, “The Class Imbalance Problem: A Systematic Study,” Intelligent Data Analysis, vol. 6, pp. 429-450, 2002. [16.] M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One-Sided Selection,” Proc. 14th Int’l Conf. Machine Learning, pp. 179-186, 1997. [17.] G.E.A.P.A. Batista, R.C. Prati, and M.C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” SIGKDD Explorations, vol. 6, pp. 20-29, 2004.1 [18.] D. Cieslak and N. Chawla, “Learning decision trees for unbalanced data,” in Machine Learning and Knowledge Discovery in Databases. Berlin, Germany: Springer-Verlag, 2008, pp. 241–256. [19.] G.Weiss, “Mining with rarity: A unifying framework,” SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 7–19, 2004. [20.] N. Chawla, K. Bowyer, and P. Kegelmeyer, “SMOTE: Synthetic minority over-sampling technique,” J. Artif. Intell. Res., vol. 16, pp. 321–357, 2002. [21.] J. Zhang and I. Mani, “KNN approach to unbalanced data distributions: A case study involving information extraction,” in Proc. Int. Conf. Mach. Learning, Workshop: Learning Imbalanced Data Sets, Washington, DC, 2003, pp. 42–48. [22.] Yang Yong, “The Research of Imbalanced Data Set of Sample Sampling Method Based on K-Means Cluster and Genetic Algorithm”, Energy Procedia 17 (2012) 164 – 170. [23.] T. Jo and N. Japkowicz, “Class imbalances versus small disjuncts,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 40–49, 2004. [24.] S. Zou, Y. Huang, Y. Wang, J. Wang, and C. Zhou, “SVM learning from imbalanced data by GA sampling for protein domain prediction,” in Proc. 9th Int. Conf. Young Comput. Sci., Hunan, China, 2008, pp. 982– 987. [25.] Salvador Garcıá, JoaquıńDerrac, Isaac Triguero, Cristobal J. Carmona, Francisco Herrera, “Evolutionary-based selection of generalized instances for imbalanced classification”, Knowledge-Based Systems 25 (2012) 3–12. [26.] María Dolores Pérez-Godoy, Alberto Fernández, Antonio Jesús Rivera, María José del Jesus,” Analysis of an evolutionary RBFN design algorithm, CO2RBFN, for imbalanced data sets”, Pattern Recognition Letters 31 (2010) 2375–2388. [27.] Jin Xiao, Ling Xie, Changzheng He, Xiaoyi Jiang,” Dynamic classifier ensemble model for customer classification with imbalanced class distribution”, Expert Systems with Applications 39 (2012) 3668–3675. [28.] Z. Chi, H. Yan, T. Pham, Fuzzy Algorithms with Applications to Image Processing and Pattern Recognition, World Scientific, 1996. [29.] V. Garcia, J.S. Sanchez , R.A. Mollineda,” On the effectiveness of preprocessing methods when dealing with different levels of class imbalance”, Knowledge-Based Systems 25 (2012) 13–21. Volume 2, Issue 5, May 2013 Page 260 International Journal of Application or Innovation in Engineering & Management (IJAIEM) Web Site: www.ijaiem.org Email: editor@ijaiem.org, editorijaiem@gmail.com Volume 2, Issue 5, May 2013 ISSN 2319 - 4847 [30.] Alberto Fernández, María José del Jesus, Francisco Herrera,” On the 2-tuples based genetic tuning performance for fuzzy rule based classification systems in imbalanced data-sets”, Information Sciences 180 (2010) 1268–1291. [31.] Jinguha Wang, JaneYou ,QinLi,YongXu,” Extract minimum positive and maximum negative features for imbalanced binary classification”, Pattern Recognition 45 (2012) 1136–1145. [32.] Alberto Fernández, María José del Jesus, Francisco Herrera,” On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets”, Expert Systems with Applications 36 (2009) 9805–9812. [33.] Victoria López, Alberto Fernández, Jose G. Moreno-Torres, Francisco Herrera, “Analysis of preprocessing vs. cost-sensitive learning for imbalanced classification. Open problems on intrinsic data characteristics”, Expert Systems with Applications 39 (2012) 6585–6608. [34.] Alberto Fernández, María José del Jesus, Francisco Herrera,” On the influence of an adaptive inference system in fuzzy rule based classification systems for imbalanced data-sets”, Expert Systems with Applications 36 (2009) 9805–9812. [35.] Jordan M. Malof, Maciej A. Mazurowski, Georgia D. Tourassi,” The effect of class imbalance on case selection for case-based classifiers: An empirical study in the context of medical decision support”, Neural Networks 25 (2012) 141–145. [36.] V. Garcia, J.S. Sanchez , R.A. Mollineda,” On the effectiveness of preprocessing methods when dealing with different levels of class imbalance”, Knowledge-Based Systems 25 (2012) 13–21. [37.] NiteshV.Chawla, Nathalie Japkowicz and AleksanderKolcz.: Editorial: Special Issue on Learning from Imbalanced Data Sets. SIGKDD Explorations 6 (1) (2004) 1-6 [38.] G. Weiss: Mining with rarity: A unifying framework. SIGKDD Explorations 6 (1) (2004) 7-19 [39.] Chawla, N.V., Lazarevic, A., Hall, L.O. and Bowyer, K.: SMOTEBoost: Improving prediction of the Minority Class in Boosting. 7th Europea Conference on Principles and Practice of Knowledge Discovery in Databases, Cavtat Dubrovnik, Croatia (2003) 107-119. [40.] J. R. Quinlan, C4.5: Programs for Machine Learning, 1st ed. San Mateo, CA: Morgan Kaufmann Publishers, 1993. [41.] Bradley A.: The use of the area under the ROC curve in the evaluation of machine learningalgorithms. Pattern Recognition 30 (7) (1997) 1145-1159. [42.] A. Asuncion D. Newman. (2007). UCI Repository of Machine Learning Database (School of Information and Computer Science), Irvine, CA: Univ. of California [Online]. Available: http://www.ics.uci.edu/∼mlearn/MLRepository.html [43.] Witten, I.H. and Frank, E. (2005) Data Mining: Practical machine learning tools and techniques. 2nd edition Morgan Kaufmann, San Francisco. [44.] RBreiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth, Belmont. [45.] Chawla,N., Bowyer,K.,Kegelmeyer, P.:SMOTE: syntheticminority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002). Author K.P.N.V.SATYA SREE, Assistant Professor, Department of Computer Science and Engineering, Vignan’s Nirula Institute of Technology in Science for WOMEN, Guntur, A.P Dr. J.V.R. MURTHY, Professor, Department of Computer Science and Engineering, Jawaharlal Nehru Technological University Kakinada, Kakinada, A.P Volume 2, Issue 5, May 2013 Page 261

A Novel Prominent Recursive Minority Over- Sampling Technique

Related documents

Products

Support

A Novel Prominent Recursive Minority Over- Sampling Technique

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib