Uploaded by Winki Winkiky

N-0355 full

advertisement
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/221534216
Multiclass classifiers vs multiple binary classifiers using filters for feature selection
Conference Paper · July 2010
DOI: 10.1109/IJCNN.2010.5596567 · Source: DBLP
CITATIONS
READS
13
2,481
4 authors, including:
Noelia Sánchez-Maroño
Amparo Alonso-Betanzos
University of A Coruña
University of A Coruña
86 PUBLICATIONS 2,796 CITATIONS
263 PUBLICATIONS 5,450 CITATIONS
SEE PROFILE
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
SMARTEES: Social Innovation Modelling Approaches to Realizing Transition to Energy Efficiency and Sustainability View project
LOCAW (Low-Carbon at Work: Modelling agents and organisations to achieve transition to a low-carbon Europe View project
All content following this page was uploaded by Verónica Bolón-Canedo on 06 July 2016.
The user has requested enhancement of the downloaded file.
Program at a glance
Organizing Committee
Welcome Message
Plenary Sessions
Announcements
Detailed Program
Detailed Program
Detailed Program
Invited Sessions
Invited Sessions
Invited Sessions
Author Index
Paper Index
Program Committee
Author Index
Paper Index
Program Committee
© 2010 IEEE
Author Index
Paper Index
Program Committee
Hybrid Program
Preface
Sebastian Stober, Christian Hentschel and Andreas Nuernberger . . . . . . . 2780
Multi-Label Classification by ART-based Neural Networks and Hierarchy Extraction (N-0133)
Fernando Benites, Florian Brucker and Elena Sapozhnikova . . . . . . . . . . . . . 2788
Multi-model Ensemble Forecasting in High Dimensional Chaotic System (N0794)
Michael Siek and Dimitri Solomatine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2797
Multi-Resolution State-Space Discretization for Q-Learning with Pseudo-Randomized
Discretization (N-0307)
Amanda Lampton, John Valasek and Mrinal Kumar . . . . . . . . . . . . . . . . . . . . . 2805
Multi-scale Support Vector Regression (N-0815)
Stefano Ferrari, Francesco Bellocchio, Vincenzo Piuri and N. Alberto Borghese
2813
Multi-task Learning for One-class Classification (N-0240)
Haiqin Yang, Irwin King and Michael R. Lyu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2820
Multiagent Reinforcement Learning in the Iterated Prisoner’s Dilemma: Fast
Cooperation through Evolved Payoffs (N-0644)
Vassilis Vassiliades and Chris Christodoulou . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2828
Multiclass classifiers vs multiple binary classifiers using filters for feature selection (N-0355)
Noelia Sanchez-Marono, Amparo Alonso-Betanzos, Pablo Garcia-Gonzalez
and Veronica Bolon-Canedo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2836
Multiobjective Multiclass Support Vector Machine Based on the One-against-all
Method (N-0597)
Keiji Tatsumi, Masato Tai and Tetsuzo Tanino . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2844
Multiple Kernel Learning with ICA: Local Discriminative Image Descriptors for
Recognition (N-0358)
Siyao Fu, Shengyang Guo and Guanghou Zeng . . . . . . . . . . . . . . . . . . . . . . . . . 2851
MuSeRA: Multiple Selectively Recursive Approach towards Imbalanced Stream
Data Mining (N-0581)
Sheng Chen, Haibo He, Kang Li and Sachi Desai . . . . . . . . . . . . . . . . . . . . . . . 2857
N2Cloud: Cloud Based Neural Network Simulation Application (N-0552)
Altaf Ahmad Huqqani, Xin Li, Peter Beran and Erich Schikuta. . . . . . . . . . . .2865
Naive Bayes texture classification applied to whisker data from a moving robot
(N-0408)
Nathan Lepora, Mat Evans, Charles Fox, Mathew Diamond, Kevin Gurney and
Tony Prescott . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2870
32
WCCI 2010 IEEE World Congress on Computational Intelligence
July, 18-23, 2010 - CCIB, Barcelona, Spain
IJCNN
Multiclass classifiers vs multiple binary classifiers using filters for
feature selection
N. Sánchez-Maroño, A. Alonso-Betanzos, Member, IEEE, P. Garcı́a-González and V. Bolón-Canedo
Abstract— There are two classical approaches for dealing
with multiple class data sets: a classifier that can deal directly
with them, or alternatively, dividing the problem into multiple
binary sub-problems. While studies on feature selection using
the first approach are relatively frequent in scientific literature,
very few studies employ the latter one. Out of the four classical
methods that can be employed for generating binary problems
from a multiple class data set (random, exhaustive, one-vsone and one-vs-rest), the two last were employed in this work.
Besides, four different methods were used for joining the
results of these binary classifiers (sum, sum with threshold,
Hamming distance and loss-based function). In this paper,
both approaches (multiclass and multiple binary classifiers),
are carried out using a combination method composed by a
discretizer (two different were employed), a filter for feature
selection (two methods were chosen), and a classifier (two
classifiers were tested). The different combinations of the
previous methods, with and without feature selection, were
tested over 21 different multiple data sets. An exhaustive study
of the results and a comparison between the described methods
and some others on the literature is carried out.
I. I NTRODUCTION
Classification problems with multiple classes are common
in real life applications. However, while binary classification
problems have been studied intensively, only very few works
have been carried out for studying multiclass classification
[1], [2], [3], [4]. There are two basic approaches to deal
with classifying multiple classes: one is to use classification
algorithms that can deal directly with multiple classes; and
the alternative is to divide the original problem into several
binary classification problems. Some authors conclude that
there is not a multiclass method that outperforms every other
one, and that the method to be used in order to obtain the best
results will depend on the problem, and also on other userdefined constraints, such as the desired level of accuracy, the
time available for obtaining a solution, etc.[5]
Besides the multiclass characteristic, nowadays the majority of the applications deal with data sets with high
dimensionality. Problems with multiple classes and high dimensionality have been even less studied. Feature selection is
one of the methods that can be used to reduce dimensionality.
These methods aim to reduce the number of input attributes
of a given problem, eliminating those that are unnecessary
or redundant, and obtaining a reduction in the computational
Department of Computer Science, University of A Coruña, Spain (email:
nsanchez@udc.es, ciamparo@udc.es, pablo.garcia.gonzalez86@gmail.com,
vbolon@udc.es).
This work was supported by Spanish Ministerio de Ciencia e Innovación
(under projects TIN 2006-02402 and TIN 2009-10748) and Xunta de Galicia
(under project 2007/000134-0), all of them partially supported by the
European Union ERDF.
978-1-4244-8126-2/10/$26.00 c 2010 IEEE
resources needed, and most of the times, improving the performance of the classification algorithms employed. Among
the different approaches than can be employed, filters are
the common option when the number of input features is
very high, presenting also two other interesting advantages:
they are independent of the evaluation function used and they
consume less computational resources than the alternative
wrapper methods [6], [7].
In this paper, a combination method that uses a discretizer,
a filter and a classification algorithm has been tested over 21
different multiclass problems. Four different discretizers, two
filters and two classifiers have been selected to carry out an
exhaustive comparative study. In order to check the adequacy
of the feature selection step, the results of the classification
method without previous filtering have been obtained for the
first approach (multiclass problem) and the second approach
(multiple binary classifiers). Regarding the later, there are
several strategies that can be used for dividing the multiclass
problem in several binary problems, such as the one-versusrest, one-versus-one, exhaustive, etc. In this work, the results
of the two first methods are included in the comparative
study. Also, there are different algorithms that can be used to
integrate the information of the various binary classifiers that
select the same class. In this work, sum, sum with threshold,
Hamming decodification, and Loss-based decodification have
been used [8]. At the end, 12 different combinations were
tested over the 21 different data sets, and the results obtained
were compared, so as to be able to reach conclusions about
their performance.
II. M ULTIPLE CLASSES APPROACHES
There are two main approaches for classification problems involving more than two classes. One transforms the
multiclass problem into several binary problems, while the
other deals directly with the multiclass problem. This last
strategy has as disadvantage: a possible overtraining of the
algorithm, favoring those classes that are most represented
in the sample, or are easier to separate [9]. Besides, another
possible problem is that as the number of classes increases,
the more difficult can be for a feature set to provide an
adequate separation among them [7]. However, the first
strategy has also its drawbacks, such as how to integrate the
information that comes from each of the binary classifiers, or
the fact that there exists enough representation of a specific
class in the training set generated from the original one to
train the binary classifier.
There are different schemes to transform the multiple
classes problem into several binary problems, but all of them
2836
can be generalized by the Error-Correcting Output-Codes
(ECOC) method [10]. In this method, instead of providing
each classifier with a set of inputs and its corresponding
output, the output is transformed by using a matrix Ml×c
which columns are the number of classes (c) and its rows the
number of classifiers (l). There are four different schemes:
• Random: it forms l different classifiers. Each classifier
randomly selects a subset of classes as positive and
another subset, mutually exclusive, as negative class.
• One-versus-rest: it transforms a problem with c classes
in c binary problems, i.e., it uses one classifier for each
class (l = c). The elected class represents the positive
samples and the remaining classes the negative ones.
• One-versus-one: it generates a classifier for each pair
of classes, i.e., each classifier compares two different
classifiers.
classes. Therefore, there are l = c(c−1)
2
• Exhaustive: it generates one classifier for each possible combination between classes. Any meaningless
combination is not taken into account, for example,
considering all classes as positive. Notice that this
approach includes the schemes one-versus-rest and oneversus-one.
Transforming the multiple class problem into several binary problems is a coding technique that requires a counterpart, i.e., a decoding technique. This decoding technique
should integrate the information provided by all the classifiers into an unique solution, that is, finally the sample
is classified in only one class. The most common decoding
techniques are Hamming distance and loss-based functions
[8], subsequent paragraphs briefly explain both. An illustrative example that shows how different decoding techniques
can lead to different results can be consulted in [8].
Consider a problem with c classes that has been codified
using a matrix Ml×c with l classifiers. Each class i (i =
1, . . . , c) has one of three possible values for each classifier
j (j = 1, . . . , l): positive (+), negative (-) or ignored (*).
Then, the pattern of class i is a vector of length l where
each position j denotes the value of class i for classifier
j. For example, considering the one-vs-rest scheme, the
pattern for class 1 would be {+, −, . . . , −, −}, being the
number of negative classes equal to l − 1 (there are no
ignored classes in this scheme). Once the classifiers are
trained and a new pattern is provided to them, each classifier
returns a solution forming an ”objective” pattern. Then, this
”objective” pattern is compared to each class pattern. The
class which ”distance” is minor compared to the ”objective”
pattern would be elected as output class. There are different
ways to measure this ”distance”:
• The Hamming distance between two vectors of equal
length is the number of positions at which the corresponding symbols are different. Put another way, it
measures the minimum number of substitutions required
to change one vector into the other, or the number of
errors that transformed one vector into the other. In this
case, the differences are weighted to assign a higher
”distance” value to those positions than differ in two
•
values {+, −} than those that only differ in one ({+, ∗}
or {−, ∗}).
The previous distance measure does not consider the
values returned by classifiers causing a loss of information. The loss-based function takes under consideration
this information and it is adapted for each classifier. In
this work, due to the classifiers employed, the logistic
regression function was chosen, defined as:
L(z) = log(1 + e−2z ),
where z is equal to the product of each position of the
”objective” pattern and the same position at the class
pattern. The value L(z) is computed for each position
of a class pattern, all these values are added and so,
the ”distance” for the class is obtained. Notice that the
”objective” patterns are like {0.90, 0.30, . . . , −0.40},
i.e., they reflect the value that the classifiers provided.
Another problem of the multiple binary classifiers approach is the need for enough number of samples in the
training set. This aspect is critical in those cases in which
there is a great difference between the number of samples
of one class and the number of samples of others. In those
cases, the capacity of generalization of the learning algorithm
can be seriously affected, as it could end up ignoring the
samples of the minority class, better than trying to learn
it. In order to mitigate this problem, there exist different
alternatives that consist on downsizing the majority class
(undersampling), upsizing the minority class (oversampling)
or altering the relative costs of misclassifying the small
and the large classes [11]. In this work the last strategy
was adopted, i.e. oversampling, because the unbalanced data
sets employed have not many samples available. Thus, some
samples of the minority class were randomly replicated so as
to obtain the same number of samples of the majority class.
III. T HE METHODOLOGY
In this paper a combination method for multiclass classification problems has been tested. The method is divided in
three steps:
• First, a discretizer method is applied over the input
data, with the aim of solving problems of unbalanced
values, and prepare the attributes of the sample to be
processed by the feature selection algorithm of the
next step. Several discretizers have been chosen to test
their influence on the classification problem, specifically
EWD (Equal Width Discretization), EFD (Equal Frequency Discretization), PKID (Proportional K-Interval
Discretization) [12] and EM (Entropy Minimization)
[13].
• After discretization, feature selection is carried out using
filters. Two different methods based in different metrics
were tested, Consistency-based filter (CBF) [14] and
Correlation-based feature selection (CFS) [15].
• Finally, a classifier is applied. In our case, C 4.5 [16]
and naı̈ve Bayes [17] were selected, because both can be
used for the direct multiclass approach. Besides, their
2837
Fig. 1.
An scheme of the several approaches tested in this work
use of computer resources is affordable, an important
factor in our study due to the high dimensionality of
some of the data sets employed.
The combination method above has been used over the two
different general approaches than can be used for multiclass
classification problems:
• A multiclass approach. This is feasible for both the
classification methods selected, C4.5 and naı̈ve Bayes.
• A multiple binary classification approach. In this work
the approaches one-versus-rest and one-versus-one have
been chosen. The random scheme is not expected to
return good performance results, whereas the exhaustive
scheme is very computer-resource demanding due to its
requirement of training a high number of classifiers.
So, one-versus-rest and one-versus-one are the coding techniques adopted. The rest of the section regards to the
decoding techniques used. Initially, it is necessary to explain
that the results returned by the classifiers are probability
values. For example, a possible output value of a classifier
could be 0.9, which means that a pattern should be assigned
to the positive class with a high probability. An opposite
value could be 0.4 which means that this pattern should be
consider negative with reservations. The decoding technique
used for the one-versus-rest scheme consists on assigning
as result class the class with the highest probability value.
Notice that those classifiers that return as ”winning” class
the negative class (”rest” class) are ignored. For the oneversus-one scheme, Hamming and loss-based (see section II)
were used as decoding techniques. A threshold was used for
computing both distances. The threshold is used to denote if
a probability value should be considered positive, negative
or ignored for the Hamming distance or for weighting this
probability value for the Loss-based function. Apart from
these distances, we developed and applied two more ”adhoc” distance measurements:
•
Sum is an union method based in the probability assigned by the binary classifier to the “winning” class for
each sample. Therefore, instead of calculating distances
to determine the class from the l different results, the
•
accumulative probability sum of each class is computed.
Then, the desired output is the one with the highest
value. Notice that this measure is similar to the one
used for the one-versus-rest scheme.
Sum with threshold is a method that modifies the
previous one and takes under consideration the fact
that test patterns include “ignored” classes, i.e., classes
not used for the learning of the classifier. Then, this
technique only computes those probabilities that are
over an established threshold to guarantee that only
clearly winning classes are computed.
As we have seen, there are two multiple class classification
approaches. In the first one, called ”multiclass”, a data set
with c classes is discretized, filtered and classified, and no
union method is required since the prediction results are
directly obtained. On the other hand, when the multiple
binary classification approach is chosen, the problem turns
into several binary problems, depending on the scheme
adopted, one-vs-one or one-vs-rest. Each classifier requires
the previous steps of discretization and filtering and, besides,
after obtaining the outputs of the l binary classifiers, a union
method is required in order to join data and achieve an unique
prediction. At this point, as explained above, four union
methods are available for the one-vs-one scheme: Sum, Sum
threshold, Hamming and Loss-based and they will provide
us the final result ( see figure 1).
Besides, for all those approaches, in order to compare the
performance with and without filtering and see the benefits
related with feature selection, we can eliminate the filter
step (dashed in the center of Figure 1). Therefore, there
are 16 different combinations (4 discretizers × 2 filters × 2
classifiers) when using feature selection plus 8 more without
using it (4 discretizers × 2 classifiers). Each combination
is applied for each approach, i.e., multiclass, one-vs-rest and
one-vs-one with 4 different decoding techniques. The number
of results achieved is enormous even for only one data set
and cannot be included in a paper, so next section will try
to summarize them.
2838
IV. T HE EXPERIMENTAL RESULTS
The direct multiple approach and the two multiple binary
approaches, with and without feature selection, and with the
different methods of discretizers, filters and classifiers were
tested using Weka[18] and MatLab[19]. The 21 different
data sets shown in Table I were selected [20], attempting to include several aspects such as different number of
classes, different number of samples, different ratios features/samples, unbalanced data sets, etc. Specifically, there
are 4 data sets with clearly unbalanced classes: Glass,
Connect-4, Dermatology and Thyroid. The oversampling
technique was applied to the glass data set which has 3
of 6 classes with very reduced number of samples (lower
than 20), making extremely difficult the adequate learning of
the classifiers. A 10-fold crossvalidation was used to obtain
results in percentage of correct classification and in features
selected by the methods that use filters. For each data set, in
order to check if the several methods used exhibit statistically
significant differences in performance, a multiple comparison
statistical test was employed using ANOVA (ANalysis Of
VAriance) [21], if the normality hypothesis is assumed, or
using an alternative non-parametric procedure, the KruskalWallis test [22]. The prefix MFeat employed in some data
sets of Table I means Multi-feature, whereas the prefix MLL
in Leukemia data set denotes the type of leukemia being
tackled. Both prefixes will be ignored in the rest of this paper.
TABLE I
T HE DATA SETS EMPLOYED IN THE EXPERIMENTAL STUDY
Data set
Iris
Vehicle
Wine
Waveform
Segment
Glass
Connect 4
Dermatology
Vowel
KDD SC
Splice
Thyroid
Optdigits
Pendigits
Landsat
MFeat-Fourier
MFeat-Factor
MFeat-Karhounen
MFeat-Pixel
MFeat-Zernike
MLL-Leukemia
Classes
3
4
3
3
7
6
3
6
11
6
3
3
10
10
6
10
10
10
10
10
3
Samples
150
846
178
5000
2310
214
67557
366
990
600
3190
3772
3823
7494
4435
2000
2000
2000
2000
2000
57
Features
4
18
13
21
19
10
42
34
13
60
61
21
64
16
36
76
216
64
240
47
12582
For each data set, 16 different combinations with feature
selection and 8 without filtering were used for each approach considered. Then, there are 24 performance results
for both multiclass and one-vs-rest approach and, moreover,
96 (24 × 4 union techniques) one-vs-one results. Trying
to show all results becomes untractable, so the best results
for each scheme with and without feature selection are
shown. It turns to 12 different results for each data set.
As an example, in Figure 2 the results obtained for the
Thyroid data set are shown. As it can be seen, the best
performance with the smallest set of features is obtained by
the one-vs-one approach using the sum method to make the
union of the multiple binary classifiers and the combination
EM+Consistency-Based Filter+C4.5. The precision obtained
was 99.52 ± 0.34 using 5 features (the number of attributes
used by this approach is 23.81% of the total).
Fig. 2. Best results obtained for the Thyroid data set. Percentage of features
selected is represented at the left y axis, while on the right side, the accuracy
obtained by the approach is drawn. On the x axis, the best results for each
approach without and with feature selection are displayed.
In any case, Figure 2 shows that all approaches in this data
set obtained good results when using feature selection. Both
multiclass and one-vs-one approaches (using sum and sum
with threshold as union techniques) achieved better accuracy
values using feature selection than not using it. On the other
hand, two of the one-vs-one versions (using Hamming and
loss-based) and the one-vs-rest approach obtained lower
values for the accuracy by using feature selection. This
latter case is the one with the lowest accuracy value when
feature selection is applied (99.23 ± 0.36), a slightly lower
value than using the same approach without feature selection
(99.44 ± 0.34). However, the difference is not statistically
significant and the reduction in the number of features used
is important (21 vs 8).
Figure 3 shows the result of applying a multiple comparison function (0.05 significance level) between the number
of features selected by the approaches multiclass, one-vs-rest
and one-vs-one Sum (for the sake of completeness also the
set containing all features was included) for the Thyroid data
set. The graph displays each group mean represented by a
symbol and an interval around the symbol. Two means are
significantly different if their intervals are disjoint, and are
not significantly different if their intervals overlap. Therefore,
it can be seen that there is a significant statistical difference
among the approaches with and without feature selection.
Notice that these differences are not so notorious as bars in
figure 2 may suggest because X-axis denotes average group
ranks. Ranks are found by ordering the data from smallest
to largest across all groups, and taking the numeric index of
this ordering. The rank for a tied observation is equal to the
average rank of all observations tied with it. So, in figure
3, 40 different values were ranked (10 values, one per fold,
for each one of the 4 approaches considered). Clearly, there
2839
TABLE II
B EST RESULTS FOR EACH DATA SET. P ERFORMANCE WITH AND WITHOUT F EATURE S ELECTION (FS AND W ITHOUT FS) APPROACHES . F OR EACH
ONE , MEAN AND STANDARD DEVIATION ACCURACY OBTAINED FROM THE 10- FOLD CV (ACC ) AND RANKING OCCUPIED BY THE APPROACH (R K ).
W HEN USING FS, PERCENTAGE OF CHARACTERISTICS SELECTED (% F EAT.). O F THE one-vs-one APPROACHES , ONLY THE RESULTS USING THE SUM
ARE DISPLAYED , BECAUSE THEY TEND TO SHOW THE BEST PERFORMANCES . AT THE END OF THE TABLE AVERAGE ACCURACY AND AVERAGE
RANKING IS SHOWED FOR THE SIX APPROACHES .
Multiclass
Iris
Vehicle
Wine
Waveform
Segment
Glass
Connect 4
Dermat.
Vowel
KDDSC
Splice
Thyroid
Optdigits
Pendigits
Landsat
Karhounen
Factor
Fourier
Pixel
Zernike
Leukemia
Accuracy
Ranking
Without FS
Acc
Rk
95.99±3.44
6
69.73±6.02
8
98.88±2.34
2
80.98±1.72
10
91.38±1.99
12
73.83±8.56
9
80.94±0.74
2
98.35±2.31
4
74.04±5.15
2
96.99±2.19
9
95.74±0.94
9
99.28±0.60
10
92.67±1.05
8
89.61±0.93
11
85.27±1.40
8
92.50±1.90
4
93.30±1.70
6
77.40±2.75
9
93.50±1.24
4
72.05±1.97
11
91.33±12.09
12
87.80
7.43
Acc
97.99±3.22
71.62±4.33
98.33±2.68
80.86±1.82
92.42±2.07
70.15±9.06
81.16±0.51
98.91±1.89
74.64±3.67
96.16±2.83
96.08±1.57
99.44±0.56
92.62±0.69
89.32±0.93
84.75±1.45
93.10±1.64
95.90±1.32
78.60±1.76
93.30±1.18
72.35±3.68
96.67±7.03
88.30
1VsRest
FS
Rk
1
2
3
11
9
11
1
1
1
10
6
3
10
12
12
1
1
3
5
9
1
5.48
% Feat
50.00
38.89
84.62
71.42
63.15
77.77
83.33
67.64
84.61
91.66
55.74
28.57
57.81
62.50
88.88
92.18
49.53
69.73
57.50
68.08
4.79
Without FS
Acc
Rk
96.00±6.44
3
72.11±4.83
1
98.3±2.74
6
79.78±1.78
12
92.12±1.96
11
72.01±8.20
10
80.61±0.50
3
97.79±2.87
11
54.04±5.37
4
95.00±1.76
11
96.14±1.00
3
99.23±0.34
11
92.62±1.91
10
93.50±1.21
8
84.82±2.14
11
92.50±1.56
4
88.70±2.2
8
77.10±2.54
10
91.70±1.33
11
71.65±2.40
12
96.67±7.02
1
86.78
7.67
Acc
96.65±3.51
67.60±6.12
97.71±4.83
82.36±1.76
92.42±2.25
68.83±8.51
80.61±0.51
97.00±2.36
64.44±4.82
82.33±19.11
95.89±1.63
99.44±0.36
91.68±1.55
93.40±1.10
84.98±1.57
90.95±1.78
92.90±2.52
76.25±2.40
91.15±1.39
72.15±3.44
96.67±7.03
86.45
1vs1 Sum
FS
Rk
2
12
10
6
9
12
3
12
3
12
8
3
12
9
9
8
7
12
12
10
1
8.19
% Feat
75.00
100
100
95.23
94.73
100
97.62
76.47
84.61
96.66
55.73
38.09
85.93
100
100
100
84.72
98.68
90.00
95.74
0.86
Without FS
Acc
Rk
94.00±7.34
9
71.27±4.70
3
97.77±3.88
7
81.02±1.72
8
94.31±1.61
5
85.81±4.58
3
80.38±0.56
5
98.07±2.28
8
-±0
97.00±2.19
6
96.21±0.82
1
99.52±0.44
1
93.83±1.94
5
95.78±0.65
1
86.89±1.21
3
89.75±2.58
10
88.00±2.51
10
78.40±2.36
4
92.95±1.42
7
75.20±2.62
1
93.33±8.61
5
89.47
5.10
Acc
96.00±4.66
71.05±6.23
97.19±2.96
82.64±1.50
94.37±1.54
85.34±8.51
80.26±0.55
98.34±1.93
-±98.33±1.11
96.14±1.12
99.44±0.34
94.19±1.04
94.15±1.40
87.10±1.36
92.50±2.27
95.65±1.85
78.90±3.71
94.15±1.05
74.90±4.21
93.33±11.65
90.20
FS
Rk
3
4
11
4
4
4
8
7
0
1
3
3
3
6
1
4
2
1
2
2
5
% Feat
50.00
100
53.85
90.47
84.21
100
97.62
94.11
95.00
63.93
23.81
84.37
100
100
100
99.07
88.15
97.91
100
0.70
3.90
example are the results of the Leukemia data set, which has a
much higher number of features (12582) than samples (57).
Besides, it is representative of a type of data sets that are
receiving considerable attention in research, the microarray
data sets. In these type of data sets, feature selection can have
a high impact, because eliminating unnecessary features can
help the biologists for a better explanation of the behavior of
genes involved in cancer research. The results obtained for
the Leukemia data set can be seen in Figure 4. Again, better
results are obtained by the feature selection versions of the
methods, and the best is that of the one-vs-rest approach.
Although accuracy is the same for multiclass with feature
selection, and one-vs-rest with and without feature selection,
the drastic reduction in the number of features needed (108
out of 12582) makes it worthwile to use EWD+CFS+NB.
Fig. 3.
Multiple comparison results using all features (ALL) and the
features selected by the multiple classes (MC), the one-vs-rest(1R) and onevs-one (11) approaches for the Thyroid data set
are 10 tied values at the last positions of this rank, one per
fold of the approach with all features. On the other hand, the
features subset returned by the one-vs-one approach is most
of the times at the top.
The important reduction in the number of necessary attributes achieved by the feature selection methods make their
use worthwhile, specially in those cases in which the elimination of features can contribute to better explanation of clinical
situations, such as it is the case in some data sets used in
this work (i.e., Leukemia, Thyroid,...). So, another interesting
Fig. 4. Best results obtained for the Leukemia data set. Percentage of
features selected is represented at the left y axis, while on the right side, the
accuracy obtained by the approach is drawn. On the x axis, the best results
for each approach without and with feature selection are displayed.
2840
TABLE III
R ESULTS IN ACCURACY OBTAINED FOR EACH OF THE APPROACHES
A. Analysis of multiclass versus multiple binary approaches
In order to be able to establish a general comparative
picture between all methods and both alternatives, in Table
II the results obtained for all 21 data sets by the Multiclass,
one-vs-rest and one-vs-one approaches with (FS) and without
(without FS) feature selection are shown. For the latter
approach, only the results achieved by the sum as union
method are shown, because it is the one obtaining best results
in average. A column labeled Rk has been added with the
idea of detecting which approximation is the best in average.
To be able to do this, the 12 different approaches are listed in
order of percentage obtained for each data set. Subsequently,
the average ranking position is computed for each of the 12
approaches, so as to compare the different methods. This
value is shown in the last row of the table. Also, the average
in accuracy is displayed for each method in the previous row.
It is necessary to note that the ranks are computed over all the
12 approaches, although only the best 6 of them are shown
in the table. From results in Table II it can be concluded that,
although the multiclass approach appears to behave better in
number of features selected and with similar accuracy than
the alternative multiple binary classifiers, the approach with
the best average results, in both ranking and accuracy, is onevs-one using sum and feature selection. The ranking value of
the latter, 3.9 is clearly separated from the rest, that obtain
values higher than 5. The same behavior can be observed
for the accuracy, in which the best value is obtained again
by one-vs-one using sum and feature selection, although in
this case the difference with the other methods is smaller.
Notice that the scheme one-vs-one was not carried out for
the Vowel data set, because its 10 different classes will
imply the training of 45 classifiers in this approximation.
A deeper analysis of Table I suggests that the one-vs-one
approach clearly surpasses the others when there exists a
high ratio number of samples/number of features and enough
samples per class. In general, the one-vs-rest approximation
exhibits a poor behavior, even when using feature selection,
but surprisingly it achieves the best results when leading with
the Leukemia data set that has a very low ratio number of
samples/number of features. Besides, the results obtained by
this approach worsen as the number of classes increases. This
is due to the consequent unbalanced division into positive and
negative classes for each of its classifiers.
The good results achieved by the multiclass approach were
unexpected and a further analysis was done in order to
get more insight. It is important to remember that different
combinations of discretizer, filter and classifier were done
for each approach and the best combination was selected
for each one. An example showing all the combinations
for the Factor data set is depicted in Table III. The first
block of rows of this table shows the results achieved when
using the C4.5 classifier, while the last block is devoted to
the naı̈ve Bayes classifier. The best accuracy obtained for
each combination is emphasized in bold font. The last row
indicates the average accuracy achieved by each approach.
It can be seen that the best accuracy (marked with grey
Factor DATA SET. AVERAGE ACCURACY OBTAINED FOR
M ULTICLASS AND BOTH MULTIPLE BINARY CLASSES APPROACHES ARE
SHOWN IN LAST ROW OF THE TABLE .
TESTED FOR THE
Combin.
EWD+CFS
EFD+CFS
PKID+CFS
EM+CFS
EWD+CBF
EFD+CBF
PKID+CBF
EM+CBF
Combin.
EWD+CFS
EFD+CFS
PKID+CFS
EM+CFS
EWD+CBF
EFD+CBF
PKID+CBF
EM+CBF
Average
C4.5 Classifier
Multiclass
1vsRest
95.00 ± 1.81
92.150 ± 2.30
94.65 ± 1.47
91.50± 1.83
92.50 ± 2.12
90.85 ± 1.93
95.90 ± 1.33
92.90 ± 2.53
84.90 ± 3.00
90.45± 2.73
82.10 ± 1.90
88.65 ± 1.68
69.75 ± 3.84
84.35 ± 1.89
84.90 ± 3.34
92.55 ± 1.74
NB Classifier
Multiclass
OnevsRest
79.20 ± 1.89
84.60 ± 2.85
74.10 ± 2.53
85.45 ± 1.76
52.75 ± 3.07
71.00 ± 4.60
80.65 ± 3.07
87.75 ± 3.16
78.60 ± 3.70
85.60 ± 2.39
76.05 ± 3.13
85.00 ± 2.20
53.95 ± 2.99
71.50 ± 2.59
80.80 ± 3.15
87.30 ± 2.39
79.74±2.65
86.35±2.41
1vs1 Sum
94.65 ± 2.07
94.65 ± 1.55
94.25 ± 1.21
95.65 ± 1.86
92.35 ± 1.29
90.90 ± 1.88
90.95 ± 2.78
92.15 ± 1.97
1vs1 Sum
89.75 ± 2.53
88.95 ± 2.06
88.05 ± 2.31
88.95 ± 2.60
89.20 ± 1.89
88.75 ± 2.23
89.40 ± 2.13
89.55± 2.24
91.13±2.04
TABLE IV
B EST ACCURACY AND AVERAGE ACCURACY OBTAINED FOR
M ULTICLASS AND BOTH MULTIPLE BINARY CLASSES APPROACHES FOR
THE Dermatology AND Karhounen DATA SET.
Data set
Dermat.
Karhounen
Multiclass
98.91 ± 1.90
94.34 ± 4.06
93.10 ± 1.64
71.72 ± 3.16
1vsRest
96.70 ± 3.61
94.66 ± 4.16
89.60 ± 2.20
73.73 ± 2.85
1vs1 Sum
98.08 ± 3.43
96.62 ± 3.52
91.60 ± 1.43
87.12 ± 2.33
background) is obtained by the combination of minimum
entropy discretizer + Correlated filter selection + C4.5 classifier using the multiclass approach. However, the last row
denotes that the one-vs-one scheme achieves the best result in
average, and moreover, this scheme obtains the best accuracy
in 12 of 16 combinations, whereas the multiclass approach
only gets the best values in 3 of them. Table IV shows
similar results for Dermatology and Karhounen data sets,
but summarized. For each data set, the first row shows the
values for each approach when the best accuracy is achieved
(EFD+CFS+NB combination for Dermatoloy data set and
EWD+CFS+NB combination for Karhounen data set). The
second row indicates the mean average. Again, the multiclass
approach gets the best performance result in a combination,
however it is clearly surpassed by the one-vs-one scheme
when focusing on averages.
B. Best discretizer, filter and classifier combination
In this work, 16 different combinations of discretizer, filter
and classifier were tried over 21 data sets. Moreover, for
reasons of completeness, the filtering step was considered
optional. Table V attempts to determine which combination
gets the best accuracy values. If two combinations obtain
2841
TABLE VI
DATA SETS WHICH IMPROVES THE RESULTS ACHIEVED BY METHODS
identical accuracy, both are computed in this table, and so all
values in table V add up to 24, not to 21. Several conclusions
can be extracted from this table attending to different issues:
• Feature selection (with or without): the number of
combinations using feature selection and reaching the
best performance values are 17 from 24, which denotes
the adequacy of its use.
• Discretizer: Entropy minimization discretizer forms part
of the ”best” combination in 13 occasions. On the other
hand, PKID is only included in a ”best” combination
using C4.5, although it is suited for naı̈ve Bayes classifier, but it is suboptimal when learning from training
data of small size [12].
• Filter: CFS seems to be a good filter combined with
NB classifier either using EWD or EM discretizers.
However, this filter does not achieve the best results
when it is applied together with C4.5 classifier; in this
case, the consistency based filter is preferred.
• Classifier: Naı̈ve Bayes obtains better results in more
data sets than C4.5. Specifically, naı̈ve Bayes gets the
best values for 12 different data sets, while C4.5 only
for 8 of them (both classifiers achieve the same accuracy
for the Dermatology data set).
EXISTING IN THE BIBLIOGRAPHY.
ACC STANDS FOR ACCURACY AND
%F EAT THE PERCENTAGE OF FEATURES EMPLOYED
Data
Wine
Glass
Connect4
Dermatol.
Splice
Thyroid
ACCURACY AND
Data
Iris
Vehicle
Waveform
Segment
Vowel
KDD SC
Optdigits
Pendigits
Landsat
Karhounen
Factor
Fourier
Pixel
Zernike
Leukemia
RESPECTIVELY.
W
4
M
4
W
0
F
0
CFS
K
1
M
0
Naı̈ve Bayes Classifier
CBF
W
F
K
M
1
0
0
2
C4.5 Classifier
CBF
W
F
K
M
2
0
0
2
W
1
Without FS
F
K
M
0
0
2
W
1
Without FS
F
K
M
0
0
3
C. Comparative study with other methods
In this section we will try to compare our results with
those existing in the literature. Notice that it is not a ”fair”
study in the sense that the validation methodologies may
differ from one study to another. As can be seen in Table
VI, we obtained better results than those achieved by other
methods for 6 data sets. This is a remarkable fact because
we are comparing our method with a group of methods, in
some cases specially designed for an specific problem, such
as the Parzen method for image data sets [28]. It is important
to notice the difference in accuracy for the Glass data set (up
to a 17% of improvement). It is also worth to mention the
reduced set of features used for the Thyroid data set while
there is a slight increment in the accuracy. Analogously, in
Connect 4 results, accuracy is improved while the number
of features is reduced in 16,7%.
Table VII reflects those data sets where our performance
results did not surpass the existing ones. It is important
to remark than some methods are highly computer-resource
demanding, for example the EECOC method, that is, the
Best value achieved
Acc
% Feat.
98.9
100
87.7
100
81.2
83.33
98.9
100
96.2
100
99.5
23.81
TABLE VII
DATA SETS WHICH DOES NOT IMPROVE THE RESULTS ACHIEVED BY
METHODS EXISTING IN THE BIBLIOGRAPHY. ACC STANDS FOR
TABLE V
N UMBER OF TIMES A COMBINATION GETS THE BEST RESULTS . W, F, K
AND M STANDS FOR EWD, EFD, PKID AND EM DISCRETIZERS ,
CFS
F
K
1
0
Best value in bibliography
Acc
Method
97.8
NB-Back [26]
70.9
C4.5+EECOC’s [23]
79.2
C4.5 [24]
97.5
NB [25]
95.4
NB [25]
99.4
C4.5+ DMIFS [25]
%F EAT THE PERCENTAGE OF FEATURES EMPLOYED .
LR MEANS LOGISTIC REGRESSION .
Best value in bibliography
Acc
Method
100
RNA [25]
75.8
AFN-FS [26]
86.6
LR+EECOCs [23]
97.5
C4.5+ EECOCs [23]
93.2
C4.5+ECCOCs [23]
98.4
Naı̈ve Bayes [25]
98.1
C4.5+ECCOC’s [23]
99.1
C4.5+ECCOC’ [23]
91.0
MLP+ SCG [27]
96.3
Parzen [28]
96.6
Bayes lineal [28]
82.9
Parzen [28]
96.3
Bayes lineal [28]
82.0
Parzen [28]
98.2
SVM + 3NN [29]
Best value achieved
Acc
% Feat
98.0
50
72.1
100
84.1
90.47
94.8
100
74.6
84.61
98.3
95.00
94.3
84.67
95.8
100
87.1
100
93.1
92.18
95.9
49.53
78.9
88.15
94.3
97.91
75.2
100
96.7
0.86
exhaustive ECOC commented in section II. For some data
sets, the performance accuracy obtained by the combination
method proposed in this paper is lower than other methods’
accuracy; however the reduction in the number of features is
very significant, see for example the Factor or Leukemia data
sets in table VII. Notice that this latter data set has a very
high ratio number of features/number of samples, making
extremely difficult the adequate learning of any classifier.
Nevertheless the combination only employs 0.86% of the
features and obtains a good level of accuracy in the test data.
V. C ONCLUSIONS
The main goal of this paper was to study the combination
of discretizers and feature selection methods in multiple class
problems. Four discretizers, two filters and two classifiers
were taken into account, obtaining 16 different combinations
that were tested over 21 data sets broadly used in the
bibliography. Moreover, two approaches were considered to
deal with multiple class problems: the first one consists on
applying a suitable classifier for those problems; the second
one divides the problem into several binary problems, the
one-vs-rest and one-vs-one schemes were used to generate
this division. In the latter scheme, the results provided for
each classifier were gathered using 4 decoding techniques,
2842
although only the results obtained by the best of them are
shown in this paper. Therefore, 16 combinations were applied
using three different approaches, one of them using 4 ways
to join its results. Besides, for the sake of completeness, the
same approaches were also tested without using the feature
selection step, i.e., without filtering.
The experimental results support the hypothesis that using
feature selection leads to better performance results than
not using it. Moreover, it eliminates the associate cost of
acquisition and storage of those discarded features. On the
other hand, comparing the different approaches to deal with
multiple class problems, the one-vs-one scheme obtains
better accuracy results in average than the others, although
using a higher number of features. This approach is also
more computational demanding than the others, because each
sample must be tested for different classifiers, unifying their
results later to obtain the desired output. From the experimental results achieved, the one-vs-one scheme should be
used when there are enough samples per class and features,
while, on the contrary, the one-vs-rest can be used with data
sets with large number of features and reduced number of
samples. Nevertheless, a deeper theoretical analysis needs
to be done to support these hypothesis, trying to relate
some properties of the data set (separability, number of
classes, ratios samples/features, etc.) with the adequacy of
a determined approach.
Regarding the numerous combinations checked, several
of them exhibited a good behavior, so it becomes difficult
to select one. The entropy minimization discretizer seems
to be more adequate than the rest of discretizers, and CFS
filter is preferred when using naı̈ve Bayes classifier while
consistency based filter is the best when applying C4.5.
A comparative study was done to test the effectiveness
of the combinations proposed compared to other existing
methods. Six data sets obtained better performance results
than those provided by other authors. As was shown in
this study, the exhaustive methods for generating binary
problems achieved good performance results and suggest a
future line of research. Another interesting line can be the
application of more advanced classifiers, such as Support
Vector Machines that may obtain better performance results,
also using them as part of the feature selection process.
Finally, the combinations return very good results for the
Leukemia data set that has a very high number of features
with a reduced set of samples, therefore, in a more exhaustive
study, the combinations would be applied to this appealing
type of problems.
R EFERENCES
[1] T. Li, C. Zhang and M. Ogihara. A comparative study of feature
selection and multiclass classification methods for tissue classification
based on gene expression, Bioinformatics, vol. 20, no. 15, pp. , 24292437, 2004
[2] C. H. Yeang, S. Ramaswamy, P. Tamayo and S. Mukherjee and R.
M. Rifkin and M. Angelo and M. Reicha and E. Lander and J.
Mesirov and T. Golub. Molecular classification of multiple tumor types,
Bioinformatics, vol. 17, pp. 316–322, 2001
[3] G. Madzarov, D. Gjorgjevikj and J. Chorbev. A multiclass SVM classifier
using binary decision tree, Informatica,vol. 33, pp. 233-241, 2009
[4] Y. Ivar Chang and S. Lin. Sinergy of logistic regression and Support
Vector Machies in Multiple-class classification, In Proc. IDEAL, LNCS,
vol. 3177, pp. 132-141, 2004
[5] A. Golestani, K. Almadian, A. Amiri and M. JahedMotlagh. A novel
adaptive-boost-based strategy for combining classifiers using diversity
concept, 6th IEEE/ACIS Int. Conf. on Computer and Information
Science, (ICIS). 0-7695-2841-4/07, 2007
[6] R. Kohavi and G.H. John. Wrapper for feature subset selection, Artificial Intelligence Journal, Special issue on relevance, vol. 97 no. 1–2,
pp. 273–324, 1997
[7] I. Guyon, S. Gunn, M. Nikravesh and L. Zadeh, Feature extraction.
Foundations and applications, Springer, 2006
[8] E.L. Allwein, R.E. Shapire and Y. Singer, Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers, Journal of Machine
Learning Research, vol. 1, pp. 113-141, 2001
[9] G. Forman, An extensive empirical study of feature selection metrics for
text classification, Journal of Machine Learning Research, pp. 1289–
1305, 2003
[10] T.G. Dietterich and G. Bakiri, Solving multiclass learning problems
via error-correcting output code, Journal of Artificial Intelligence
Resarch,vol. 2,pp. 263–285, 1995
[11] N. Jaopkowicz and S. Stephen, The class imbalance problem: A system
study, Intelligent Data Analysis vol. 6(5), 2002
[12] Y. Yang and G.I. Webb, Proportional k-Interval Discretization for
Naive-Bayes Classifiers, In Proceedings of the 12th European Conference on Machine Learning, pp. . 564-575, 2001
[13] U.M. Fayyad and K.B. Irani, Multi-Interval Discretization of
Continuous-Valued Attributes for Classification Learning, Proceedings
of the 13th International Joint Conference on Artificial Intelligence,
pp. . 1022-1029, Morgan Kaufmann, 1993
[14] M. Dash and H. Liu, Consistency-based Search in Feature Selection,
Journal of Artificial Intelligence, vol. 151, no. 1-2, pp. 155-176, 2003
[15] M.A. Hall, Correlation-based Feature Selection for Machine Learning,
PhD thesis, University of Waikato, Hamilton, New Zealand, 1999
[16] J.R. Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993
[17] I. Rish, An Empirical Study of the naı̈ve Bayes Classifier. Proceedings
of IJCAI-01 Workshop on Empirical Methods in Artificial Intelligence,
vol. 335, 2001
[18] I.H. Witten and E. Frank. Data mining: Practical machine learning
tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco, 2005. http://www.cs.waikato.ac.nz/ml/weka/. Last
access: February 2010
[19] The Mathworks. Matlab tutorial, 1984. http://www.mathworks.
com/academia/student\_center/tutorials/. Last access:
February 2010
[20] A. Asuncion and D.J. Newman. UCI machine learning repository, University of California, Irvine, School of Information and Computer Sciences, http://mlearn.ics.uci.edu/MLRepository.html.
Last access: February 2010
[21] R. Fisher, Statistical methods for research workers, Oliver and Boyd,
1925
[22] W. H. Kruskal and W. A. Wallis, Use of ranks in one-criterion variance
analysis, Journal of the American Statistical Association, vol. 47,
no. 260, pp. 583621, 1952
[23] E. Frank and S. Kramer. Ensembled for Nested Dichotomies for MultiClass Problems, Proceedings of International Conference on Machine
Learning, ACM Press, pp. 305-312, 2004
[24] N. Kerdprasop and K. Kerdprasop. Data partitioning for incremental
data minig, The 1st International Forum on Information and Computer
Science (IFICT) , Shizuoka University, Japan, pp. 114-118, 2003
[25] H. Liu and H. Zhang. Feature selection with dynamic mutual information, Pattern Recognition, vol. 42, no. 7, pp. 1330-1339, 2009
[26] N. Sánchez-Maroño, A. Alonso-Betanzos, R. M. Calvo. A wrapper
method for feature selection in multiple classes datasets, J. Cabestany
et al. (Eds.), IWANN, Part I, LNCS 5517, pp. 456-463, 2009
[27] http://www.mathworks.com/academia/student\
_center/tutorials/, Last access: February 2010.
[28] P. W. Duin and A. K. Jain and J. Mao. Statistical Pattern Recognition:
A review, 2000
[29] C. J. Alonso-Gonzlez, Q. I. Moro, O. J. Prieto, M. Aránzazu Simón.
Selecting few genes for micro-array gene expresion classification, Actas
Conferencia Española para la Inteligencia Artificial, pp. 21–31, 2009
2843
View publication stats
Download