See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221534216 Multiclass classifiers vs multiple binary classifiers using filters for feature selection Conference Paper · July 2010 DOI: 10.1109/IJCNN.2010.5596567 · Source: DBLP CITATIONS READS 13 2,481 4 authors, including: Noelia Sánchez-Maroño Amparo Alonso-Betanzos University of A Coruña University of A Coruña 86 PUBLICATIONS 2,796 CITATIONS 263 PUBLICATIONS 5,450 CITATIONS SEE PROFILE SEE PROFILE Some of the authors of this publication are also working on these related projects: SMARTEES: Social Innovation Modelling Approaches to Realizing Transition to Energy Efficiency and Sustainability View project LOCAW (Low-Carbon at Work: Modelling agents and organisations to achieve transition to a low-carbon Europe View project All content following this page was uploaded by Verónica Bolón-Canedo on 06 July 2016. The user has requested enhancement of the downloaded file. Program at a glance Organizing Committee Welcome Message Plenary Sessions Announcements Detailed Program Detailed Program Detailed Program Invited Sessions Invited Sessions Invited Sessions Author Index Paper Index Program Committee Author Index Paper Index Program Committee © 2010 IEEE Author Index Paper Index Program Committee Hybrid Program Preface Sebastian Stober, Christian Hentschel and Andreas Nuernberger . . . . . . . 2780 Multi-Label Classification by ART-based Neural Networks and Hierarchy Extraction (N-0133) Fernando Benites, Florian Brucker and Elena Sapozhnikova . . . . . . . . . . . . . 2788 Multi-model Ensemble Forecasting in High Dimensional Chaotic System (N0794) Michael Siek and Dimitri Solomatine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2797 Multi-Resolution State-Space Discretization for Q-Learning with Pseudo-Randomized Discretization (N-0307) Amanda Lampton, John Valasek and Mrinal Kumar . . . . . . . . . . . . . . . . . . . . . 2805 Multi-scale Support Vector Regression (N-0815) Stefano Ferrari, Francesco Bellocchio, Vincenzo Piuri and N. Alberto Borghese 2813 Multi-task Learning for One-class Classification (N-0240) Haiqin Yang, Irwin King and Michael R. Lyu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2820 Multiagent Reinforcement Learning in the Iterated Prisoner’s Dilemma: Fast Cooperation through Evolved Payoffs (N-0644) Vassilis Vassiliades and Chris Christodoulou . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2828 Multiclass classifiers vs multiple binary classifiers using filters for feature selection (N-0355) Noelia Sanchez-Marono, Amparo Alonso-Betanzos, Pablo Garcia-Gonzalez and Veronica Bolon-Canedo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2836 Multiobjective Multiclass Support Vector Machine Based on the One-against-all Method (N-0597) Keiji Tatsumi, Masato Tai and Tetsuzo Tanino . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2844 Multiple Kernel Learning with ICA: Local Discriminative Image Descriptors for Recognition (N-0358) Siyao Fu, Shengyang Guo and Guanghou Zeng . . . . . . . . . . . . . . . . . . . . . . . . . 2851 MuSeRA: Multiple Selectively Recursive Approach towards Imbalanced Stream Data Mining (N-0581) Sheng Chen, Haibo He, Kang Li and Sachi Desai . . . . . . . . . . . . . . . . . . . . . . . 2857 N2Cloud: Cloud Based Neural Network Simulation Application (N-0552) Altaf Ahmad Huqqani, Xin Li, Peter Beran and Erich Schikuta. . . . . . . . . . . .2865 Naive Bayes texture classification applied to whisker data from a moving robot (N-0408) Nathan Lepora, Mat Evans, Charles Fox, Mathew Diamond, Kevin Gurney and Tony Prescott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2870 32 WCCI 2010 IEEE World Congress on Computational Intelligence July, 18-23, 2010 - CCIB, Barcelona, Spain IJCNN Multiclass classifiers vs multiple binary classifiers using filters for feature selection N. Sánchez-Maroño, A. Alonso-Betanzos, Member, IEEE, P. Garcı́a-González and V. Bolón-Canedo Abstract— There are two classical approaches for dealing with multiple class data sets: a classifier that can deal directly with them, or alternatively, dividing the problem into multiple binary sub-problems. While studies on feature selection using the first approach are relatively frequent in scientific literature, very few studies employ the latter one. Out of the four classical methods that can be employed for generating binary problems from a multiple class data set (random, exhaustive, one-vsone and one-vs-rest), the two last were employed in this work. Besides, four different methods were used for joining the results of these binary classifiers (sum, sum with threshold, Hamming distance and loss-based function). In this paper, both approaches (multiclass and multiple binary classifiers), are carried out using a combination method composed by a discretizer (two different were employed), a filter for feature selection (two methods were chosen), and a classifier (two classifiers were tested). The different combinations of the previous methods, with and without feature selection, were tested over 21 different multiple data sets. An exhaustive study of the results and a comparison between the described methods and some others on the literature is carried out. I. I NTRODUCTION Classification problems with multiple classes are common in real life applications. However, while binary classification problems have been studied intensively, only very few works have been carried out for studying multiclass classification [1], [2], [3], [4]. There are two basic approaches to deal with classifying multiple classes: one is to use classification algorithms that can deal directly with multiple classes; and the alternative is to divide the original problem into several binary classification problems. Some authors conclude that there is not a multiclass method that outperforms every other one, and that the method to be used in order to obtain the best results will depend on the problem, and also on other userdefined constraints, such as the desired level of accuracy, the time available for obtaining a solution, etc.[5] Besides the multiclass characteristic, nowadays the majority of the applications deal with data sets with high dimensionality. Problems with multiple classes and high dimensionality have been even less studied. Feature selection is one of the methods that can be used to reduce dimensionality. These methods aim to reduce the number of input attributes of a given problem, eliminating those that are unnecessary or redundant, and obtaining a reduction in the computational Department of Computer Science, University of A Coruña, Spain (email: nsanchez@udc.es, ciamparo@udc.es, pablo.garcia.gonzalez86@gmail.com, vbolon@udc.es). This work was supported by Spanish Ministerio de Ciencia e Innovación (under projects TIN 2006-02402 and TIN 2009-10748) and Xunta de Galicia (under project 2007/000134-0), all of them partially supported by the European Union ERDF. 978-1-4244-8126-2/10/$26.00 c 2010 IEEE resources needed, and most of the times, improving the performance of the classification algorithms employed. Among the different approaches than can be employed, filters are the common option when the number of input features is very high, presenting also two other interesting advantages: they are independent of the evaluation function used and they consume less computational resources than the alternative wrapper methods [6], [7]. In this paper, a combination method that uses a discretizer, a filter and a classification algorithm has been tested over 21 different multiclass problems. Four different discretizers, two filters and two classifiers have been selected to carry out an exhaustive comparative study. In order to check the adequacy of the feature selection step, the results of the classification method without previous filtering have been obtained for the first approach (multiclass problem) and the second approach (multiple binary classifiers). Regarding the later, there are several strategies that can be used for dividing the multiclass problem in several binary problems, such as the one-versusrest, one-versus-one, exhaustive, etc. In this work, the results of the two first methods are included in the comparative study. Also, there are different algorithms that can be used to integrate the information of the various binary classifiers that select the same class. In this work, sum, sum with threshold, Hamming decodification, and Loss-based decodification have been used [8]. At the end, 12 different combinations were tested over the 21 different data sets, and the results obtained were compared, so as to be able to reach conclusions about their performance. II. M ULTIPLE CLASSES APPROACHES There are two main approaches for classification problems involving more than two classes. One transforms the multiclass problem into several binary problems, while the other deals directly with the multiclass problem. This last strategy has as disadvantage: a possible overtraining of the algorithm, favoring those classes that are most represented in the sample, or are easier to separate [9]. Besides, another possible problem is that as the number of classes increases, the more difficult can be for a feature set to provide an adequate separation among them [7]. However, the first strategy has also its drawbacks, such as how to integrate the information that comes from each of the binary classifiers, or the fact that there exists enough representation of a specific class in the training set generated from the original one to train the binary classifier. There are different schemes to transform the multiple classes problem into several binary problems, but all of them 2836 can be generalized by the Error-Correcting Output-Codes (ECOC) method [10]. In this method, instead of providing each classifier with a set of inputs and its corresponding output, the output is transformed by using a matrix Ml×c which columns are the number of classes (c) and its rows the number of classifiers (l). There are four different schemes: • Random: it forms l different classifiers. Each classifier randomly selects a subset of classes as positive and another subset, mutually exclusive, as negative class. • One-versus-rest: it transforms a problem with c classes in c binary problems, i.e., it uses one classifier for each class (l = c). The elected class represents the positive samples and the remaining classes the negative ones. • One-versus-one: it generates a classifier for each pair of classes, i.e., each classifier compares two different classifiers. classes. Therefore, there are l = c(c−1) 2 • Exhaustive: it generates one classifier for each possible combination between classes. Any meaningless combination is not taken into account, for example, considering all classes as positive. Notice that this approach includes the schemes one-versus-rest and oneversus-one. Transforming the multiple class problem into several binary problems is a coding technique that requires a counterpart, i.e., a decoding technique. This decoding technique should integrate the information provided by all the classifiers into an unique solution, that is, finally the sample is classified in only one class. The most common decoding techniques are Hamming distance and loss-based functions [8], subsequent paragraphs briefly explain both. An illustrative example that shows how different decoding techniques can lead to different results can be consulted in [8]. Consider a problem with c classes that has been codified using a matrix Ml×c with l classifiers. Each class i (i = 1, . . . , c) has one of three possible values for each classifier j (j = 1, . . . , l): positive (+), negative (-) or ignored (*). Then, the pattern of class i is a vector of length l where each position j denotes the value of class i for classifier j. For example, considering the one-vs-rest scheme, the pattern for class 1 would be {+, −, . . . , −, −}, being the number of negative classes equal to l − 1 (there are no ignored classes in this scheme). Once the classifiers are trained and a new pattern is provided to them, each classifier returns a solution forming an ”objective” pattern. Then, this ”objective” pattern is compared to each class pattern. The class which ”distance” is minor compared to the ”objective” pattern would be elected as output class. There are different ways to measure this ”distance”: • The Hamming distance between two vectors of equal length is the number of positions at which the corresponding symbols are different. Put another way, it measures the minimum number of substitutions required to change one vector into the other, or the number of errors that transformed one vector into the other. In this case, the differences are weighted to assign a higher ”distance” value to those positions than differ in two • values {+, −} than those that only differ in one ({+, ∗} or {−, ∗}). The previous distance measure does not consider the values returned by classifiers causing a loss of information. The loss-based function takes under consideration this information and it is adapted for each classifier. In this work, due to the classifiers employed, the logistic regression function was chosen, defined as: L(z) = log(1 + e−2z ), where z is equal to the product of each position of the ”objective” pattern and the same position at the class pattern. The value L(z) is computed for each position of a class pattern, all these values are added and so, the ”distance” for the class is obtained. Notice that the ”objective” patterns are like {0.90, 0.30, . . . , −0.40}, i.e., they reflect the value that the classifiers provided. Another problem of the multiple binary classifiers approach is the need for enough number of samples in the training set. This aspect is critical in those cases in which there is a great difference between the number of samples of one class and the number of samples of others. In those cases, the capacity of generalization of the learning algorithm can be seriously affected, as it could end up ignoring the samples of the minority class, better than trying to learn it. In order to mitigate this problem, there exist different alternatives that consist on downsizing the majority class (undersampling), upsizing the minority class (oversampling) or altering the relative costs of misclassifying the small and the large classes [11]. In this work the last strategy was adopted, i.e. oversampling, because the unbalanced data sets employed have not many samples available. Thus, some samples of the minority class were randomly replicated so as to obtain the same number of samples of the majority class. III. T HE METHODOLOGY In this paper a combination method for multiclass classification problems has been tested. The method is divided in three steps: • First, a discretizer method is applied over the input data, with the aim of solving problems of unbalanced values, and prepare the attributes of the sample to be processed by the feature selection algorithm of the next step. Several discretizers have been chosen to test their influence on the classification problem, specifically EWD (Equal Width Discretization), EFD (Equal Frequency Discretization), PKID (Proportional K-Interval Discretization) [12] and EM (Entropy Minimization) [13]. • After discretization, feature selection is carried out using filters. Two different methods based in different metrics were tested, Consistency-based filter (CBF) [14] and Correlation-based feature selection (CFS) [15]. • Finally, a classifier is applied. In our case, C 4.5 [16] and naı̈ve Bayes [17] were selected, because both can be used for the direct multiclass approach. Besides, their 2837 Fig. 1. An scheme of the several approaches tested in this work use of computer resources is affordable, an important factor in our study due to the high dimensionality of some of the data sets employed. The combination method above has been used over the two different general approaches than can be used for multiclass classification problems: • A multiclass approach. This is feasible for both the classification methods selected, C4.5 and naı̈ve Bayes. • A multiple binary classification approach. In this work the approaches one-versus-rest and one-versus-one have been chosen. The random scheme is not expected to return good performance results, whereas the exhaustive scheme is very computer-resource demanding due to its requirement of training a high number of classifiers. So, one-versus-rest and one-versus-one are the coding techniques adopted. The rest of the section regards to the decoding techniques used. Initially, it is necessary to explain that the results returned by the classifiers are probability values. For example, a possible output value of a classifier could be 0.9, which means that a pattern should be assigned to the positive class with a high probability. An opposite value could be 0.4 which means that this pattern should be consider negative with reservations. The decoding technique used for the one-versus-rest scheme consists on assigning as result class the class with the highest probability value. Notice that those classifiers that return as ”winning” class the negative class (”rest” class) are ignored. For the oneversus-one scheme, Hamming and loss-based (see section II) were used as decoding techniques. A threshold was used for computing both distances. The threshold is used to denote if a probability value should be considered positive, negative or ignored for the Hamming distance or for weighting this probability value for the Loss-based function. Apart from these distances, we developed and applied two more ”adhoc” distance measurements: • Sum is an union method based in the probability assigned by the binary classifier to the “winning” class for each sample. Therefore, instead of calculating distances to determine the class from the l different results, the • accumulative probability sum of each class is computed. Then, the desired output is the one with the highest value. Notice that this measure is similar to the one used for the one-versus-rest scheme. Sum with threshold is a method that modifies the previous one and takes under consideration the fact that test patterns include “ignored” classes, i.e., classes not used for the learning of the classifier. Then, this technique only computes those probabilities that are over an established threshold to guarantee that only clearly winning classes are computed. As we have seen, there are two multiple class classification approaches. In the first one, called ”multiclass”, a data set with c classes is discretized, filtered and classified, and no union method is required since the prediction results are directly obtained. On the other hand, when the multiple binary classification approach is chosen, the problem turns into several binary problems, depending on the scheme adopted, one-vs-one or one-vs-rest. Each classifier requires the previous steps of discretization and filtering and, besides, after obtaining the outputs of the l binary classifiers, a union method is required in order to join data and achieve an unique prediction. At this point, as explained above, four union methods are available for the one-vs-one scheme: Sum, Sum threshold, Hamming and Loss-based and they will provide us the final result ( see figure 1). Besides, for all those approaches, in order to compare the performance with and without filtering and see the benefits related with feature selection, we can eliminate the filter step (dashed in the center of Figure 1). Therefore, there are 16 different combinations (4 discretizers × 2 filters × 2 classifiers) when using feature selection plus 8 more without using it (4 discretizers × 2 classifiers). Each combination is applied for each approach, i.e., multiclass, one-vs-rest and one-vs-one with 4 different decoding techniques. The number of results achieved is enormous even for only one data set and cannot be included in a paper, so next section will try to summarize them. 2838 IV. T HE EXPERIMENTAL RESULTS The direct multiple approach and the two multiple binary approaches, with and without feature selection, and with the different methods of discretizers, filters and classifiers were tested using Weka[18] and MatLab[19]. The 21 different data sets shown in Table I were selected [20], attempting to include several aspects such as different number of classes, different number of samples, different ratios features/samples, unbalanced data sets, etc. Specifically, there are 4 data sets with clearly unbalanced classes: Glass, Connect-4, Dermatology and Thyroid. The oversampling technique was applied to the glass data set which has 3 of 6 classes with very reduced number of samples (lower than 20), making extremely difficult the adequate learning of the classifiers. A 10-fold crossvalidation was used to obtain results in percentage of correct classification and in features selected by the methods that use filters. For each data set, in order to check if the several methods used exhibit statistically significant differences in performance, a multiple comparison statistical test was employed using ANOVA (ANalysis Of VAriance) [21], if the normality hypothesis is assumed, or using an alternative non-parametric procedure, the KruskalWallis test [22]. The prefix MFeat employed in some data sets of Table I means Multi-feature, whereas the prefix MLL in Leukemia data set denotes the type of leukemia being tackled. Both prefixes will be ignored in the rest of this paper. TABLE I T HE DATA SETS EMPLOYED IN THE EXPERIMENTAL STUDY Data set Iris Vehicle Wine Waveform Segment Glass Connect 4 Dermatology Vowel KDD SC Splice Thyroid Optdigits Pendigits Landsat MFeat-Fourier MFeat-Factor MFeat-Karhounen MFeat-Pixel MFeat-Zernike MLL-Leukemia Classes 3 4 3 3 7 6 3 6 11 6 3 3 10 10 6 10 10 10 10 10 3 Samples 150 846 178 5000 2310 214 67557 366 990 600 3190 3772 3823 7494 4435 2000 2000 2000 2000 2000 57 Features 4 18 13 21 19 10 42 34 13 60 61 21 64 16 36 76 216 64 240 47 12582 For each data set, 16 different combinations with feature selection and 8 without filtering were used for each approach considered. Then, there are 24 performance results for both multiclass and one-vs-rest approach and, moreover, 96 (24 × 4 union techniques) one-vs-one results. Trying to show all results becomes untractable, so the best results for each scheme with and without feature selection are shown. It turns to 12 different results for each data set. As an example, in Figure 2 the results obtained for the Thyroid data set are shown. As it can be seen, the best performance with the smallest set of features is obtained by the one-vs-one approach using the sum method to make the union of the multiple binary classifiers and the combination EM+Consistency-Based Filter+C4.5. The precision obtained was 99.52 ± 0.34 using 5 features (the number of attributes used by this approach is 23.81% of the total). Fig. 2. Best results obtained for the Thyroid data set. Percentage of features selected is represented at the left y axis, while on the right side, the accuracy obtained by the approach is drawn. On the x axis, the best results for each approach without and with feature selection are displayed. In any case, Figure 2 shows that all approaches in this data set obtained good results when using feature selection. Both multiclass and one-vs-one approaches (using sum and sum with threshold as union techniques) achieved better accuracy values using feature selection than not using it. On the other hand, two of the one-vs-one versions (using Hamming and loss-based) and the one-vs-rest approach obtained lower values for the accuracy by using feature selection. This latter case is the one with the lowest accuracy value when feature selection is applied (99.23 ± 0.36), a slightly lower value than using the same approach without feature selection (99.44 ± 0.34). However, the difference is not statistically significant and the reduction in the number of features used is important (21 vs 8). Figure 3 shows the result of applying a multiple comparison function (0.05 significance level) between the number of features selected by the approaches multiclass, one-vs-rest and one-vs-one Sum (for the sake of completeness also the set containing all features was included) for the Thyroid data set. The graph displays each group mean represented by a symbol and an interval around the symbol. Two means are significantly different if their intervals are disjoint, and are not significantly different if their intervals overlap. Therefore, it can be seen that there is a significant statistical difference among the approaches with and without feature selection. Notice that these differences are not so notorious as bars in figure 2 may suggest because X-axis denotes average group ranks. Ranks are found by ordering the data from smallest to largest across all groups, and taking the numeric index of this ordering. The rank for a tied observation is equal to the average rank of all observations tied with it. So, in figure 3, 40 different values were ranked (10 values, one per fold, for each one of the 4 approaches considered). Clearly, there 2839 TABLE II B EST RESULTS FOR EACH DATA SET. P ERFORMANCE WITH AND WITHOUT F EATURE S ELECTION (FS AND W ITHOUT FS) APPROACHES . F OR EACH ONE , MEAN AND STANDARD DEVIATION ACCURACY OBTAINED FROM THE 10- FOLD CV (ACC ) AND RANKING OCCUPIED BY THE APPROACH (R K ). W HEN USING FS, PERCENTAGE OF CHARACTERISTICS SELECTED (% F EAT.). O F THE one-vs-one APPROACHES , ONLY THE RESULTS USING THE SUM ARE DISPLAYED , BECAUSE THEY TEND TO SHOW THE BEST PERFORMANCES . AT THE END OF THE TABLE AVERAGE ACCURACY AND AVERAGE RANKING IS SHOWED FOR THE SIX APPROACHES . Multiclass Iris Vehicle Wine Waveform Segment Glass Connect 4 Dermat. Vowel KDDSC Splice Thyroid Optdigits Pendigits Landsat Karhounen Factor Fourier Pixel Zernike Leukemia Accuracy Ranking Without FS Acc Rk 95.99±3.44 6 69.73±6.02 8 98.88±2.34 2 80.98±1.72 10 91.38±1.99 12 73.83±8.56 9 80.94±0.74 2 98.35±2.31 4 74.04±5.15 2 96.99±2.19 9 95.74±0.94 9 99.28±0.60 10 92.67±1.05 8 89.61±0.93 11 85.27±1.40 8 92.50±1.90 4 93.30±1.70 6 77.40±2.75 9 93.50±1.24 4 72.05±1.97 11 91.33±12.09 12 87.80 7.43 Acc 97.99±3.22 71.62±4.33 98.33±2.68 80.86±1.82 92.42±2.07 70.15±9.06 81.16±0.51 98.91±1.89 74.64±3.67 96.16±2.83 96.08±1.57 99.44±0.56 92.62±0.69 89.32±0.93 84.75±1.45 93.10±1.64 95.90±1.32 78.60±1.76 93.30±1.18 72.35±3.68 96.67±7.03 88.30 1VsRest FS Rk 1 2 3 11 9 11 1 1 1 10 6 3 10 12 12 1 1 3 5 9 1 5.48 % Feat 50.00 38.89 84.62 71.42 63.15 77.77 83.33 67.64 84.61 91.66 55.74 28.57 57.81 62.50 88.88 92.18 49.53 69.73 57.50 68.08 4.79 Without FS Acc Rk 96.00±6.44 3 72.11±4.83 1 98.3±2.74 6 79.78±1.78 12 92.12±1.96 11 72.01±8.20 10 80.61±0.50 3 97.79±2.87 11 54.04±5.37 4 95.00±1.76 11 96.14±1.00 3 99.23±0.34 11 92.62±1.91 10 93.50±1.21 8 84.82±2.14 11 92.50±1.56 4 88.70±2.2 8 77.10±2.54 10 91.70±1.33 11 71.65±2.40 12 96.67±7.02 1 86.78 7.67 Acc 96.65±3.51 67.60±6.12 97.71±4.83 82.36±1.76 92.42±2.25 68.83±8.51 80.61±0.51 97.00±2.36 64.44±4.82 82.33±19.11 95.89±1.63 99.44±0.36 91.68±1.55 93.40±1.10 84.98±1.57 90.95±1.78 92.90±2.52 76.25±2.40 91.15±1.39 72.15±3.44 96.67±7.03 86.45 1vs1 Sum FS Rk 2 12 10 6 9 12 3 12 3 12 8 3 12 9 9 8 7 12 12 10 1 8.19 % Feat 75.00 100 100 95.23 94.73 100 97.62 76.47 84.61 96.66 55.73 38.09 85.93 100 100 100 84.72 98.68 90.00 95.74 0.86 Without FS Acc Rk 94.00±7.34 9 71.27±4.70 3 97.77±3.88 7 81.02±1.72 8 94.31±1.61 5 85.81±4.58 3 80.38±0.56 5 98.07±2.28 8 -±0 97.00±2.19 6 96.21±0.82 1 99.52±0.44 1 93.83±1.94 5 95.78±0.65 1 86.89±1.21 3 89.75±2.58 10 88.00±2.51 10 78.40±2.36 4 92.95±1.42 7 75.20±2.62 1 93.33±8.61 5 89.47 5.10 Acc 96.00±4.66 71.05±6.23 97.19±2.96 82.64±1.50 94.37±1.54 85.34±8.51 80.26±0.55 98.34±1.93 -±98.33±1.11 96.14±1.12 99.44±0.34 94.19±1.04 94.15±1.40 87.10±1.36 92.50±2.27 95.65±1.85 78.90±3.71 94.15±1.05 74.90±4.21 93.33±11.65 90.20 FS Rk 3 4 11 4 4 4 8 7 0 1 3 3 3 6 1 4 2 1 2 2 5 % Feat 50.00 100 53.85 90.47 84.21 100 97.62 94.11 95.00 63.93 23.81 84.37 100 100 100 99.07 88.15 97.91 100 0.70 3.90 example are the results of the Leukemia data set, which has a much higher number of features (12582) than samples (57). Besides, it is representative of a type of data sets that are receiving considerable attention in research, the microarray data sets. In these type of data sets, feature selection can have a high impact, because eliminating unnecessary features can help the biologists for a better explanation of the behavior of genes involved in cancer research. The results obtained for the Leukemia data set can be seen in Figure 4. Again, better results are obtained by the feature selection versions of the methods, and the best is that of the one-vs-rest approach. Although accuracy is the same for multiclass with feature selection, and one-vs-rest with and without feature selection, the drastic reduction in the number of features needed (108 out of 12582) makes it worthwile to use EWD+CFS+NB. Fig. 3. Multiple comparison results using all features (ALL) and the features selected by the multiple classes (MC), the one-vs-rest(1R) and onevs-one (11) approaches for the Thyroid data set are 10 tied values at the last positions of this rank, one per fold of the approach with all features. On the other hand, the features subset returned by the one-vs-one approach is most of the times at the top. The important reduction in the number of necessary attributes achieved by the feature selection methods make their use worthwhile, specially in those cases in which the elimination of features can contribute to better explanation of clinical situations, such as it is the case in some data sets used in this work (i.e., Leukemia, Thyroid,...). So, another interesting Fig. 4. Best results obtained for the Leukemia data set. Percentage of features selected is represented at the left y axis, while on the right side, the accuracy obtained by the approach is drawn. On the x axis, the best results for each approach without and with feature selection are displayed. 2840 TABLE III R ESULTS IN ACCURACY OBTAINED FOR EACH OF THE APPROACHES A. Analysis of multiclass versus multiple binary approaches In order to be able to establish a general comparative picture between all methods and both alternatives, in Table II the results obtained for all 21 data sets by the Multiclass, one-vs-rest and one-vs-one approaches with (FS) and without (without FS) feature selection are shown. For the latter approach, only the results achieved by the sum as union method are shown, because it is the one obtaining best results in average. A column labeled Rk has been added with the idea of detecting which approximation is the best in average. To be able to do this, the 12 different approaches are listed in order of percentage obtained for each data set. Subsequently, the average ranking position is computed for each of the 12 approaches, so as to compare the different methods. This value is shown in the last row of the table. Also, the average in accuracy is displayed for each method in the previous row. It is necessary to note that the ranks are computed over all the 12 approaches, although only the best 6 of them are shown in the table. From results in Table II it can be concluded that, although the multiclass approach appears to behave better in number of features selected and with similar accuracy than the alternative multiple binary classifiers, the approach with the best average results, in both ranking and accuracy, is onevs-one using sum and feature selection. The ranking value of the latter, 3.9 is clearly separated from the rest, that obtain values higher than 5. The same behavior can be observed for the accuracy, in which the best value is obtained again by one-vs-one using sum and feature selection, although in this case the difference with the other methods is smaller. Notice that the scheme one-vs-one was not carried out for the Vowel data set, because its 10 different classes will imply the training of 45 classifiers in this approximation. A deeper analysis of Table I suggests that the one-vs-one approach clearly surpasses the others when there exists a high ratio number of samples/number of features and enough samples per class. In general, the one-vs-rest approximation exhibits a poor behavior, even when using feature selection, but surprisingly it achieves the best results when leading with the Leukemia data set that has a very low ratio number of samples/number of features. Besides, the results obtained by this approach worsen as the number of classes increases. This is due to the consequent unbalanced division into positive and negative classes for each of its classifiers. The good results achieved by the multiclass approach were unexpected and a further analysis was done in order to get more insight. It is important to remember that different combinations of discretizer, filter and classifier were done for each approach and the best combination was selected for each one. An example showing all the combinations for the Factor data set is depicted in Table III. The first block of rows of this table shows the results achieved when using the C4.5 classifier, while the last block is devoted to the naı̈ve Bayes classifier. The best accuracy obtained for each combination is emphasized in bold font. The last row indicates the average accuracy achieved by each approach. It can be seen that the best accuracy (marked with grey Factor DATA SET. AVERAGE ACCURACY OBTAINED FOR M ULTICLASS AND BOTH MULTIPLE BINARY CLASSES APPROACHES ARE SHOWN IN LAST ROW OF THE TABLE . TESTED FOR THE Combin. EWD+CFS EFD+CFS PKID+CFS EM+CFS EWD+CBF EFD+CBF PKID+CBF EM+CBF Combin. EWD+CFS EFD+CFS PKID+CFS EM+CFS EWD+CBF EFD+CBF PKID+CBF EM+CBF Average C4.5 Classifier Multiclass 1vsRest 95.00 ± 1.81 92.150 ± 2.30 94.65 ± 1.47 91.50± 1.83 92.50 ± 2.12 90.85 ± 1.93 95.90 ± 1.33 92.90 ± 2.53 84.90 ± 3.00 90.45± 2.73 82.10 ± 1.90 88.65 ± 1.68 69.75 ± 3.84 84.35 ± 1.89 84.90 ± 3.34 92.55 ± 1.74 NB Classifier Multiclass OnevsRest 79.20 ± 1.89 84.60 ± 2.85 74.10 ± 2.53 85.45 ± 1.76 52.75 ± 3.07 71.00 ± 4.60 80.65 ± 3.07 87.75 ± 3.16 78.60 ± 3.70 85.60 ± 2.39 76.05 ± 3.13 85.00 ± 2.20 53.95 ± 2.99 71.50 ± 2.59 80.80 ± 3.15 87.30 ± 2.39 79.74±2.65 86.35±2.41 1vs1 Sum 94.65 ± 2.07 94.65 ± 1.55 94.25 ± 1.21 95.65 ± 1.86 92.35 ± 1.29 90.90 ± 1.88 90.95 ± 2.78 92.15 ± 1.97 1vs1 Sum 89.75 ± 2.53 88.95 ± 2.06 88.05 ± 2.31 88.95 ± 2.60 89.20 ± 1.89 88.75 ± 2.23 89.40 ± 2.13 89.55± 2.24 91.13±2.04 TABLE IV B EST ACCURACY AND AVERAGE ACCURACY OBTAINED FOR M ULTICLASS AND BOTH MULTIPLE BINARY CLASSES APPROACHES FOR THE Dermatology AND Karhounen DATA SET. Data set Dermat. Karhounen Multiclass 98.91 ± 1.90 94.34 ± 4.06 93.10 ± 1.64 71.72 ± 3.16 1vsRest 96.70 ± 3.61 94.66 ± 4.16 89.60 ± 2.20 73.73 ± 2.85 1vs1 Sum 98.08 ± 3.43 96.62 ± 3.52 91.60 ± 1.43 87.12 ± 2.33 background) is obtained by the combination of minimum entropy discretizer + Correlated filter selection + C4.5 classifier using the multiclass approach. However, the last row denotes that the one-vs-one scheme achieves the best result in average, and moreover, this scheme obtains the best accuracy in 12 of 16 combinations, whereas the multiclass approach only gets the best values in 3 of them. Table IV shows similar results for Dermatology and Karhounen data sets, but summarized. For each data set, the first row shows the values for each approach when the best accuracy is achieved (EFD+CFS+NB combination for Dermatoloy data set and EWD+CFS+NB combination for Karhounen data set). The second row indicates the mean average. Again, the multiclass approach gets the best performance result in a combination, however it is clearly surpassed by the one-vs-one scheme when focusing on averages. B. Best discretizer, filter and classifier combination In this work, 16 different combinations of discretizer, filter and classifier were tried over 21 data sets. Moreover, for reasons of completeness, the filtering step was considered optional. Table V attempts to determine which combination gets the best accuracy values. If two combinations obtain 2841 TABLE VI DATA SETS WHICH IMPROVES THE RESULTS ACHIEVED BY METHODS identical accuracy, both are computed in this table, and so all values in table V add up to 24, not to 21. Several conclusions can be extracted from this table attending to different issues: • Feature selection (with or without): the number of combinations using feature selection and reaching the best performance values are 17 from 24, which denotes the adequacy of its use. • Discretizer: Entropy minimization discretizer forms part of the ”best” combination in 13 occasions. On the other hand, PKID is only included in a ”best” combination using C4.5, although it is suited for naı̈ve Bayes classifier, but it is suboptimal when learning from training data of small size [12]. • Filter: CFS seems to be a good filter combined with NB classifier either using EWD or EM discretizers. However, this filter does not achieve the best results when it is applied together with C4.5 classifier; in this case, the consistency based filter is preferred. • Classifier: Naı̈ve Bayes obtains better results in more data sets than C4.5. Specifically, naı̈ve Bayes gets the best values for 12 different data sets, while C4.5 only for 8 of them (both classifiers achieve the same accuracy for the Dermatology data set). EXISTING IN THE BIBLIOGRAPHY. ACC STANDS FOR ACCURACY AND %F EAT THE PERCENTAGE OF FEATURES EMPLOYED Data Wine Glass Connect4 Dermatol. Splice Thyroid ACCURACY AND Data Iris Vehicle Waveform Segment Vowel KDD SC Optdigits Pendigits Landsat Karhounen Factor Fourier Pixel Zernike Leukemia RESPECTIVELY. W 4 M 4 W 0 F 0 CFS K 1 M 0 Naı̈ve Bayes Classifier CBF W F K M 1 0 0 2 C4.5 Classifier CBF W F K M 2 0 0 2 W 1 Without FS F K M 0 0 2 W 1 Without FS F K M 0 0 3 C. Comparative study with other methods In this section we will try to compare our results with those existing in the literature. Notice that it is not a ”fair” study in the sense that the validation methodologies may differ from one study to another. As can be seen in Table VI, we obtained better results than those achieved by other methods for 6 data sets. This is a remarkable fact because we are comparing our method with a group of methods, in some cases specially designed for an specific problem, such as the Parzen method for image data sets [28]. It is important to notice the difference in accuracy for the Glass data set (up to a 17% of improvement). It is also worth to mention the reduced set of features used for the Thyroid data set while there is a slight increment in the accuracy. Analogously, in Connect 4 results, accuracy is improved while the number of features is reduced in 16,7%. Table VII reflects those data sets where our performance results did not surpass the existing ones. It is important to remark than some methods are highly computer-resource demanding, for example the EECOC method, that is, the Best value achieved Acc % Feat. 98.9 100 87.7 100 81.2 83.33 98.9 100 96.2 100 99.5 23.81 TABLE VII DATA SETS WHICH DOES NOT IMPROVE THE RESULTS ACHIEVED BY METHODS EXISTING IN THE BIBLIOGRAPHY. ACC STANDS FOR TABLE V N UMBER OF TIMES A COMBINATION GETS THE BEST RESULTS . W, F, K AND M STANDS FOR EWD, EFD, PKID AND EM DISCRETIZERS , CFS F K 1 0 Best value in bibliography Acc Method 97.8 NB-Back [26] 70.9 C4.5+EECOC’s [23] 79.2 C4.5 [24] 97.5 NB [25] 95.4 NB [25] 99.4 C4.5+ DMIFS [25] %F EAT THE PERCENTAGE OF FEATURES EMPLOYED . LR MEANS LOGISTIC REGRESSION . Best value in bibliography Acc Method 100 RNA [25] 75.8 AFN-FS [26] 86.6 LR+EECOCs [23] 97.5 C4.5+ EECOCs [23] 93.2 C4.5+ECCOCs [23] 98.4 Naı̈ve Bayes [25] 98.1 C4.5+ECCOC’s [23] 99.1 C4.5+ECCOC’ [23] 91.0 MLP+ SCG [27] 96.3 Parzen [28] 96.6 Bayes lineal [28] 82.9 Parzen [28] 96.3 Bayes lineal [28] 82.0 Parzen [28] 98.2 SVM + 3NN [29] Best value achieved Acc % Feat 98.0 50 72.1 100 84.1 90.47 94.8 100 74.6 84.61 98.3 95.00 94.3 84.67 95.8 100 87.1 100 93.1 92.18 95.9 49.53 78.9 88.15 94.3 97.91 75.2 100 96.7 0.86 exhaustive ECOC commented in section II. For some data sets, the performance accuracy obtained by the combination method proposed in this paper is lower than other methods’ accuracy; however the reduction in the number of features is very significant, see for example the Factor or Leukemia data sets in table VII. Notice that this latter data set has a very high ratio number of features/number of samples, making extremely difficult the adequate learning of any classifier. Nevertheless the combination only employs 0.86% of the features and obtains a good level of accuracy in the test data. V. C ONCLUSIONS The main goal of this paper was to study the combination of discretizers and feature selection methods in multiple class problems. Four discretizers, two filters and two classifiers were taken into account, obtaining 16 different combinations that were tested over 21 data sets broadly used in the bibliography. Moreover, two approaches were considered to deal with multiple class problems: the first one consists on applying a suitable classifier for those problems; the second one divides the problem into several binary problems, the one-vs-rest and one-vs-one schemes were used to generate this division. In the latter scheme, the results provided for each classifier were gathered using 4 decoding techniques, 2842 although only the results obtained by the best of them are shown in this paper. Therefore, 16 combinations were applied using three different approaches, one of them using 4 ways to join its results. Besides, for the sake of completeness, the same approaches were also tested without using the feature selection step, i.e., without filtering. The experimental results support the hypothesis that using feature selection leads to better performance results than not using it. Moreover, it eliminates the associate cost of acquisition and storage of those discarded features. On the other hand, comparing the different approaches to deal with multiple class problems, the one-vs-one scheme obtains better accuracy results in average than the others, although using a higher number of features. This approach is also more computational demanding than the others, because each sample must be tested for different classifiers, unifying their results later to obtain the desired output. From the experimental results achieved, the one-vs-one scheme should be used when there are enough samples per class and features, while, on the contrary, the one-vs-rest can be used with data sets with large number of features and reduced number of samples. Nevertheless, a deeper theoretical analysis needs to be done to support these hypothesis, trying to relate some properties of the data set (separability, number of classes, ratios samples/features, etc.) with the adequacy of a determined approach. Regarding the numerous combinations checked, several of them exhibited a good behavior, so it becomes difficult to select one. The entropy minimization discretizer seems to be more adequate than the rest of discretizers, and CFS filter is preferred when using naı̈ve Bayes classifier while consistency based filter is the best when applying C4.5. A comparative study was done to test the effectiveness of the combinations proposed compared to other existing methods. Six data sets obtained better performance results than those provided by other authors. As was shown in this study, the exhaustive methods for generating binary problems achieved good performance results and suggest a future line of research. Another interesting line can be the application of more advanced classifiers, such as Support Vector Machines that may obtain better performance results, also using them as part of the feature selection process. Finally, the combinations return very good results for the Leukemia data set that has a very high number of features with a reduced set of samples, therefore, in a more exhaustive study, the combinations would be applied to this appealing type of problems. R EFERENCES [1] T. Li, C. Zhang and M. Ogihara. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression, Bioinformatics, vol. 20, no. 15, pp. , 24292437, 2004 [2] C. H. Yeang, S. Ramaswamy, P. Tamayo and S. Mukherjee and R. M. Rifkin and M. Angelo and M. Reicha and E. Lander and J. Mesirov and T. Golub. Molecular classification of multiple tumor types, Bioinformatics, vol. 17, pp. 316–322, 2001 [3] G. Madzarov, D. Gjorgjevikj and J. Chorbev. A multiclass SVM classifier using binary decision tree, Informatica,vol. 33, pp. 233-241, 2009 [4] Y. Ivar Chang and S. Lin. Sinergy of logistic regression and Support Vector Machies in Multiple-class classification, In Proc. IDEAL, LNCS, vol. 3177, pp. 132-141, 2004 [5] A. Golestani, K. Almadian, A. Amiri and M. JahedMotlagh. A novel adaptive-boost-based strategy for combining classifiers using diversity concept, 6th IEEE/ACIS Int. Conf. on Computer and Information Science, (ICIS). 0-7695-2841-4/07, 2007 [6] R. Kohavi and G.H. John. Wrapper for feature subset selection, Artificial Intelligence Journal, Special issue on relevance, vol. 97 no. 1–2, pp. 273–324, 1997 [7] I. Guyon, S. Gunn, M. Nikravesh and L. Zadeh, Feature extraction. Foundations and applications, Springer, 2006 [8] E.L. Allwein, R.E. Shapire and Y. Singer, Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers, Journal of Machine Learning Research, vol. 1, pp. 113-141, 2001 [9] G. Forman, An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research, pp. 1289– 1305, 2003 [10] T.G. Dietterich and G. Bakiri, Solving multiclass learning problems via error-correcting output code, Journal of Artificial Intelligence Resarch,vol. 2,pp. 263–285, 1995 [11] N. Jaopkowicz and S. Stephen, The class imbalance problem: A system study, Intelligent Data Analysis vol. 6(5), 2002 [12] Y. Yang and G.I. Webb, Proportional k-Interval Discretization for Naive-Bayes Classifiers, In Proceedings of the 12th European Conference on Machine Learning, pp. . 564-575, 2001 [13] U.M. Fayyad and K.B. Irani, Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning, Proceedings of the 13th International Joint Conference on Artificial Intelligence, pp. . 1022-1029, Morgan Kaufmann, 1993 [14] M. Dash and H. Liu, Consistency-based Search in Feature Selection, Journal of Artificial Intelligence, vol. 151, no. 1-2, pp. 155-176, 2003 [15] M.A. Hall, Correlation-based Feature Selection for Machine Learning, PhD thesis, University of Waikato, Hamilton, New Zealand, 1999 [16] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993 [17] I. Rish, An Empirical Study of the naı̈ve Bayes Classifier. Proceedings of IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence, vol. 335, 2001 [18] I.H. Witten and E. Frank. Data mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005. http://www.cs.waikato.ac.nz/ml/weka/. Last access: February 2010 [19] The Mathworks. Matlab tutorial, 1984. http://www.mathworks. com/academia/student\_center/tutorials/. Last access: February 2010 [20] A. Asuncion and D.J. Newman. UCI machine learning repository, University of California, Irvine, School of Information and Computer Sciences, http://mlearn.ics.uci.edu/MLRepository.html. Last access: February 2010 [21] R. Fisher, Statistical methods for research workers, Oliver and Boyd, 1925 [22] W. H. Kruskal and W. A. Wallis, Use of ranks in one-criterion variance analysis, Journal of the American Statistical Association, vol. 47, no. 260, pp. 583621, 1952 [23] E. Frank and S. Kramer. Ensembled for Nested Dichotomies for MultiClass Problems, Proceedings of International Conference on Machine Learning, ACM Press, pp. 305-312, 2004 [24] N. Kerdprasop and K. Kerdprasop. Data partitioning for incremental data minig, The 1st International Forum on Information and Computer Science (IFICT) , Shizuoka University, Japan, pp. 114-118, 2003 [25] H. Liu and H. Zhang. Feature selection with dynamic mutual information, Pattern Recognition, vol. 42, no. 7, pp. 1330-1339, 2009 [26] N. Sánchez-Maroño, A. Alonso-Betanzos, R. M. Calvo. A wrapper method for feature selection in multiple classes datasets, J. Cabestany et al. (Eds.), IWANN, Part I, LNCS 5517, pp. 456-463, 2009 [27] http://www.mathworks.com/academia/student\ _center/tutorials/, Last access: February 2010. [28] P. W. Duin and A. K. Jain and J. Mao. Statistical Pattern Recognition: A review, 2000 [29] C. J. Alonso-Gonzlez, Q. I. Moro, O. J. Prieto, M. Aránzazu Simón. Selecting few genes for micro-array gene expresion classification, Actas Conferencia Española para la Inteligencia Artificial, pp. 21–31, 2009 2843 View publication stats