Uploaded by rajendra rawat

Ensembles of cost-diverse Bayesian neural learners for

advertisement
Information Sciences 520 (2020) 31–45
Contents lists available at ScienceDirect
Information Sciences
journal homepage: www.elsevier.com/locate/ins
Ensembles of cost-diverse Bayesian neural learners for
imbalanced binary classification
Marcelino Lázaro a,∗, Francisco Herrera b, Aníbal R. Figueiras-Vidal a
a
Departamento de Teoría de la Señal y Comunicaciones, Universidad Carlos III de Madrid. Av. Universidad 30, Leganés 28911, Madrid,
SPAIN
b
Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain
a r t i c l e
i n f o
Article history:
Received 14 June 2018
Revised 23 May 2019
Accepted 22 December 2019
Available online 29 January 2020
Keywords:
Imbalanced classification
Ensembles
Bayes risk
Parzen windows
a b s t r a c t
Combining traditional diversity and re-balancing techniques serves to design effective ensembles for solving imbalanced classification problems. Therefore, to explore the performance of new diversification procedures and new re-balancing methods is an attractive
research subject which can provide even better performances. In this contribution, we propose to create ensembles of the recently introduced binary Bayesian classifiers, that show
intrinsic re-balancing capacities, by means of a diversification mechanism which is based
on applying different cost policies to each ensemble learner as well as appropriate aggregation schemes. Experiments with an extensive number of representative imbalanced
datasets and their comparison with those of several selected high-performance classifiers
show that the proposed approach provides the best overal results.
© 2020 Elsevier Inc. All rights reserved.
1. Introduction
Imbalanced classification problems are relevant in the real world. Not only the well-known cases –fraud, credit, diagnostic, intrusion, recognition...– are frequent but there are also many specialized applications that deal with this kind of tasks
[1–3]. In most of the practical situations, there is not a statistical model, but a finite register of observations and labels. Under imbalance conditions – when the class populations are very different, – a discriminative machine cannot be dessigned
following conventional procedures if detecting minority samples is important, and re-balancing methods must be applied.
Since there is not room here even for a brief overview of these methods, we recommend tutorials [4–6] and text [7] to the
interested reader.
In this paper, we propose to combine two methodologies to solve imbalance binary classification problems. First, a new
form of building ensembles, based on imposing different classification costs to generate diversity. This is one more step
along the direction that considers ensembles as an effective approach to solve imbalanced problems [8–11]. Second, we
adopt the recently introduced Bayesian neural classifiers [12] as learners. These learners show intrinsic resistance to imbalance difficulties, because they are trained by minimizing the sampled Bayes risk and not a surrogate cost. On the other
hand, such a formulation allows to establish a direct connection between the classification costs and an estimate of the
theoretical Receiver Operating Characteristic (ROC), providing a new perspective of that diversity and suggesting ways to
aggregate the ensemble outputs. Our approach is very different than the most common approaches to build ensembles for
∗
Corresponding author at: Departamento de Teoría de la Señal y Comunicaciones, Universidad Carlos III de Madrid, Spain.
E-mail address: mlazaro@tsc.uc3m.es (M. Lázaro).
https://doi.org/10.1016/j.ins.2019.12.050
0020-0255/© 2020 Elsevier Inc. All rights reserved.
32
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
imbalanced data. In these methods, such as in [8–11], the diversity of the individual classifiers is obtained by modifying the
available training data, by means of resampling (over-sampling of the minority class or under-sampling of the majority class
[8,9], or a combination of both methodologies [10]) or switching (random modification of some labels of the majority class
[11]). Although these methodologies have shown that they can provide good results in many cases, they have the potential drawback of sampling techniques, which can modify the problem by reducing the influence of critical samples and/or
emphasizing unimportant instances [6]. The proposed approach does not modify the available training set. Diversity is introduced through the probabilistic definition of the relative costs in the errors of samples of different classes, using a Bayesian
formulation. The different cost definitions define a set of problems that are similar enough, but with diversity. Therefore,
the combination of the solutions to this set of problems helps to improve the average performance and to reduce the variance. In the design of ensembles, typically all the classifiers provide the same decision for a relatively high number of the
patterns (those patterns that are relatively far away from the decision boundary), and different decisions are constrained
to patterns that are relatively close to the boundary. For this reason, many ensembles techniques are designed with the
aim of diversifying the decisions in the proximity of the decision boundary. The proposed approach naturally modifies the
boundary for each classifier according to the definition of the costs, which provides diversity in the decisions for samples
that are close to the decision boundary of the whole problem.
The paper is organized as follows. In Section 2, the basic classification problem will be stated, the Bayesian formulation
will be surveyed, and the usual procedure that is used to train neural networks will be outlined. In Section 3, the main
features of the training algorithm proposed in [12] are summarized, including an alternative notation that is more appropriate to the further description of the ensemble, which is done in Section 4. In this section, the architecture of the ensemble
is presented and several fusion rules to combine the outputs of the constituent classifiers are provided. Section 5 presents
experiments to evaluate the performance of the provided method in different imbalanced datasets, and finally Section 6 discusses the main conclusion obtained in this work.
2. Bayesian formulation and conventional training of neural networks
A classification problem consists in assigning a D-dimensional pattern x (instance or sample) to one out of a known set H
of possible classes or hypotheses. In a binary classification problem only two classes are possible, namely H = {H−1 , H+1 } ≡
{−1, +1}. The main goal of a classifier is to provide the best possible estimation of the correct class according to some
pre-established figure of merit, such as the average probability of misclassification just to mention one of the most common
examples.
Depending on the available information about the classification problem, several approaches can be used to solve it.
When conditional distributions of the observations under each hypothesis are known, f X |Ht (x ) for t ∈ { ± 1}, statistical
detection techniques can be used. In particular, Bayesian formulation has been commonly used in this framework. This
formulation considers a general figure of merit involving the a priori class probabilities along with the different costs of
each possible decision for samples of every class. The goal of a Bayesian classifier is to minimize the Bayesian risk function,
which includes the statistical average of these costs [13,14]
R=
πt cd,t pHˆ |Ht (d )
(1)
t∈H d∈H
where π t denotes the prior probability of hypothesis t, cd,t is the cost of deciding hypothesis d when the true hypothesis is
t, and pHˆ |H (d ) denotes the conditional probability of this decision
t
pHˆ |Ht (d ) ≡ P (Decide d | t is true )
(2)
The classifier minimizing the Bayesian risk is defined by the likelihoods (conditional distributions of input pattern under
both hypothesis, fX |Ht (x ), for t ∈ {−1, +1}), as well as the a priori class probabilities and decision costs. The optimal decision
rule minimizing Bayesian risk is
(x ) =
fX |H+1 (x )
fX |H−1 (x )
Hˆ =+1
≷
Hˆ =−1
c+1,−1 − c−1,−1
c−1,+1 − c+1,+1
π−1
=γ
π+1
(> 0 )
(3)
i.e., a test comparing the likelihood ratio (LR) with a threshold γ given by costs cd,t and prior probabilities π t .
The performance of a binary classifier is usually characterized by the false alarm and the miss probabilities (probabilities
of erroneous decisions under both hypothesis)
pF A = pHˆ |H−1 (+1 ) ≡ P (Decide + 1 | − 1 is true )
(4)
pM = pHˆ |H+1 (−1 ) ≡ P (Decide − 1 | + 1 is true )
(5)
Obviously, different values of decision threshold γ produce different pairs of values for pFA and pM , which show a compromise: reducing one of them means increasing the other.
In binary classification, usually Bayesian formulation is simplified by assuming that costs associated to correct decisions
are null, c+1,+1 = c−1,−1 = 0, which allows to parameterize the other two costs by means of a single parameter α as follows
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
c+1,−1 =
α
1−α
, c−1,+1 =
π−1
π+1
33
(6)
Using this parameterization, the Bayesian risk becomes
R ( α ) = α pF A + ( 1 − α ) pM
(7)
and decision threshold is now given by γ = α /(1 − α ). This parameterization can be helpful in imbalanced scenarios because
parameter α establishes the relative importance given to errors for samples of both classes, independently of the prior
probabilities of each class.
From (3) it is evident that the design of a Bayesian classifier requires the knowledge of likelihoods and prior probabilities
for each class. In practice, in many real problems these distributions are unknown, and the only available knowledge is a
set of N labeled examples of the problem at hand, i.e.
{xn , yn }, n ∈ {1, 2, . . . , N }
(8)
where binary class label yn ∈ { ± 1} indicates the class of pattern xn . Machine learning methods, such as decision trees
[15], Support Vector Machines (SVMs) [16], or neural networks [13], can be used in several different ways to solve binary
classification problems from a labeled data set. In this work we will use neural networks to solve the problem.
Binary discriminative machine classifiers usually provide a continuous variable z, or soft output, and the classification
result is obtained by applying a hard threshold to it. There are different forms of obtaining z, which is a nonlinear transformation of the input sample x. SVMs impose a maximal margin betweeen sample values for each class. Neural networks,
such as Multi-Layer Perceptrons (MLPs) and Radial Basis Function Networks (RBFNs), include activation functions to obtain
projections of input samples and minimize the sampled version of a surrogate cost – i.e, a measure of the separation between the target and the output of the network – by means of search algorithms, such as the stochastic gradient search.
When there are several layers, the case of hidden layer MLPs, the well-known Back-Propagation (BP) algorithm has to be
applied.
In principle, none of the above approaches has a direct connection with Bayesian learning, with only one exception: If
the surrogate cost is a Bregman divergence [17,18], the neural network output provides an estimate of the Bayesian class 1
posterior probability. This is an avenue to explore other new re-balancing algorithms, which other works explore [19]. As
we will see below, our approach connects with Bayes’ theory following a different route.
3. Bayesian neural network classifier
Recently, we have proposed a new cost function to train binary neural network classifiers [12]. This cost function is an
estimate of Bayesian risk (7)
J Bayes (w ) = α pˆ F A + (1 − α ) pˆ M
(9)
where the estimates of pFA and pM , pˆ F A and pˆ M , respectively, are obtained from the soft output of the neural network. If the
decision threshold is λ = 0, pFA and pM for the neural classifier are
pF A =
∞
0
fZ |H−1 (z ) dz, pM =
0
−∞
fZ |H+1 (z ) dz
(10)
In general, conditional distributions of Z, which models the soft output of the network, are unknown. The training method
proposed in [12] estimates these distributions from the available data using the Parzen window estimator [20]. If sets S−1
and S+1 contain indexes for data corresponding to hypothesis H−1 and H+1 , respectively
S−1 = {n : yn = −1} and S+1 = {n : yn = +1}
(11)
and N−1 and N+1 denote the number of samples in each set, Parzen window estimate for conditional distribution of Z given
H = t is
1 fˆZ |Ht (z ) =
kt (z − zn ), with t ∈ {−1, +1}
Nt n∈S
(12)
t
The Parzen window kt (z) is any valid probability density function (PDF). It is important to remark that Z is one-dimensional,
therefore the windows kt (z) are functions of a one-dimensional variable.
Using these estimates, cost function (9) is minimized iteratively by a gradient descent algorithm
α
− N k+1 (−zn ),
∂ JBayes (w )
=
∂ zn
+ 1−Nα k−1 (−zn ),
if yn = +1
if yn = −1
(13)
where ∂ zn /∂ w can be calculated just in the same manner that for conventional neural networks, using BP when needed. It
can be seen that the gradient for a pattern of class t is proportional to the Parzen window used to estimate the conditional
distribution of Z for this class, i.e. kt (−zk ). Further details about the updating equations to minimize (9) or about the role
of the kernels kt (z) in the proposed cost function can be found in [12].
34
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
Fig. 1. Architecture of the proposed ensemble of Nc Bayesian neural network classifiers.
We want to remark here that α is a parameter of the training method. This parameter allows to establish the trade-off
between the expected performance in the classification of samples of both classes independently of the number of samples
of each class in the available data set. Obviously, this can be helpful in the context of imbalanced classification problems.
This training method is valid for every possible network architecture, such as an MLP with one or several hidden layers,
or an RBFN, and with every possible activation function in the neurons of the network (hyperbolic tangent, rectified linear
units, Gaussian units for RBFNs, etc.). But it can also be applied to a linear classifier, i.e., z = wT [1, xT ]T .
4. An ensemble of Bayesian neural network classifiers
Combining classifiers is a well known technique to improve the result obtained by individual classifiers. The use of
ensembles, also known as teams of learners, has shown excellent results in many applications, including imbalanced classification problems [8–10].
Diversity plays a key role in the design of ensembles, because learners have to provide different decisions for some
patterns in order to improve the performance of the individual classifiers. However, there is no a strict definition of what is
considered as diversity (see [21] and references therein for a discussion about this topic), and several different techniques
have been used to build ensembles of diverse classifiers: Bagging [22], boosting [23], random forests [24], or output flipping
[25], also known as class switching, are some well known examples.
In this work we propose a novel approach to introduce diversity. The proposed classifier is an ensemble of Nc Bayesian
neural network classifiers. Each neural classifier will be trained with the training algorithm proposed in [12] (summarized
in Section 3) for a different value of the parameter α weighting pFA vs pM in the Bayesian cost function (9). This architecture
is shown in Fig. 1.
The values for α used to train each individual classifier will be denoted as α (j) , with j ∈ {1, 2, . . . , Nc } This means that
each constituent classifier will be intended to work with a different trade-off between the probabilities of error under the
two hypothesis
p(F Aj ) , p(Mj )
for j ∈ {1, 2, . . . , Nc }
(14)
In the context of statistical decision theory, these different trade-offs can be seen as different operation points in the Receiver Operating Characteristic (ROC). The ROC is the curve that represents the detection probability, which is the complement of the miss probability, pD = 1 − pM , versus the false alarm probability pFA for every possible value of the decision
threshold, i.e. 0 ≤ γ < ∞, in a Bayesian classifier [26]. Therefore, the Nc neural classifiers in the ensemble will be working
at different points of the ROC of the classification problem (see Fig. 2), which are given by different pairs of probabilities of
false alarm and detection, thus having
p(F Aj ) , p(Dj )
for j ∈ {1, 2, . . . , Nc }
(15)
These different compromises between pFA and pD (or equivalently pM ) will provide diversity in the decisions of the individual
classifiers. Obviously, all these operation points are below the optimal ROC, as shown in the example of Fig. 2. The optimal
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
35
Fig. 2. Example of operation points in the ROC for individual Bayesian neural classifiers.
ROC is that associated to the Bayesian classifier (7), which would require the knowledge of likelihoods f X |Ht (x ), for t ∈
{−1, +1}.
Once each individual classifier is trained, and, therefore, it is able to provide an appropriate soft output for each input pattern, it is necessary to define the decision rule for the ensemble (aggregation or fusion rule). In this work we will
compare the performance obtained by using four different aggregation methods:
1. Addition of soft outputs
yˆEn (Soft ) = sgn
Nc
zn( j )
(16)
j=1
2. Addition of hard outputs (or majority voting)
yˆEn (Hard ) = sgn
Nc
yˆn( j ) , with yˆn( j ) = sgn(zn( j ) )
(17)
j=1
3. Bayesian aggregation of hard outputs
yˆEn
(Bayes ) =
+1,
if (xEn ) ≥ γ E
−1,
if (xEn ) < γ E
, with
γE =
αE
1 − αE
(18)
where α E is the trade-off parameter between pFA and pM in the Bayesian ensemble, and (xEn ) is the likelihood ratio
for the ensemble (see the Appendix for the analytical expression of the likelihood ratio). This is the optimal Bayesian
fusion rule to combine the binary decisions (hard outputs) of the individual classifiers.
4. Majority voting for the 3 previous rules
yˆEn (Maj ) = sgn yˆEn (Soft ) + yˆEn (Hard ) + yˆEn (Bayes )
(19)
( j)
( j)
To apply the third fusion rule it is necessary to estimate the operation point for each individual classifier, ( pF A , pD ).
These values have to be estimated from the training set by cross-validation (details will be provided in Section 5.2). It is
( j) ( j)
interesting to remark that theoretically, and assuming that the true values of ( pF A , pD ) are known, the third rule must be
better than the second one if the figure of merit is the Bayes risk (in fact, (18) is the optimal Bayesian decision rule to fuse
( j) ( j)
hard decisions). However, if the estimates of the operation points ( pF A , pD ) are not accurate enough, the performance of
this rule can decay.
5. Experiments
This section presents the results obtained with the proposed method in the classification of several imbalanced databases.
5.1. Databases
We have tested the proposed method with several imbalanced real-world databases obtained from the KEEL-dataset
repository [27]. Data and information about these data sets can be found at http://www.keel.es/dataset.php. Data sets in
[27] are organized in different k-fold partitions for training and test data. Here, we have worked with the 5-fold partition
provided in the KEEL-dataset repository, thus making easier to compare results. Each fold splits data in a train set and a test
set with around a 80% – 20% proportion. The different methods will be tested independently in each fold and the results
36
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
Table 1
Description of the 33 KEEL datasets, including dimension (D), number of patterns (Np ), imbalance
ratio (IR), and percentages of types of minority examples according to the definitions in [28] (S: Safe,
B: Boundary, R: Rare, O: Outlier).
No
Dataset
D
Np
IR
Types of minority samples
S (%) / B (%) / R (%) / O (%)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
glass04vs5
ecoli0346vs5
ecoli0347vs56
yeast05679vs4
ecoli067vs5
vowel0
glass016vs2
glass2
ecoli0147vs2356
led7digit02456789vs1
ecoli01vs5
glass06vs5
glass0146vs2
ecoli0147vs56
cleveland0vs4
ecoli0146vs5
ecoli4
shuttlec0vsc4
yeast1vs7
glass4
pageblocks13vs4
abalone918
glass016vs5
shuttlec2vsc4
yeast1458vs7
glass5
yeast2vs8
yeast4
yeast1289vs7
yeast5
ecoli0137vs26
yeast6
abalone19
9
7
7
8
6
13
9
9
7
7
6
9
9
6
13
6
7
9
7
9
10
8
9
9
8
9
8
8
8
8
7
8
8
92
205
257
528
220
988
192
214
336
443
240
108
205
332
173
280
336
1829
459
214
472
731
184
129
693
214
482
1484
947
1484
281
1484
4174
9.22
9.25
9.28
9.35
10.00
10.10
10.76
10.39
10.59
10.89
11.00
11.00
11.06
12.28
12.62
13.01
13.84
13.88
13.88
15.47
15.86
16.67
19.45
20.51
22.09
22.75
23.10
28.15
30.55
32.78
39.16
39.16
128.87
33.33 % / 55.56 % / 0.00 % / 11.11 %
75.00 % / 10.00 % / 0.00 % / 15.00 %
68.00 % / 20.00 % / 0.00 % / 12.00 %
7.84 % / 41.18 % / 19.61 % / 31.37 %
45.00 % / 35.00 % / 0.00 % / 20.00 %
98.89 % / 1.11 % / 0.00 % / 0.00 %
0.00 % / 23.53 % / 35.29 % / 41.18 %
0.00 % / 23.53 % / 47.06 % / 29.41 %
65.52 % / 17.24 % / 0.00 % / 17.24 %
10.81 % / 18.92 % / 13.51 % / 56.76 %
75.00 % / 10.00 % / 0.00 % / 15.00 %
55.56 % / 22.22 % / 11.11 % / 11.11 %
0.00 % / 23.53 % / 35.29 % / 41.18 %
72.00 % / 16.00 % / 0.00 % / 12.00 %
7.69 % / 69.23 % / 15.38 % / 7.69 %
70.00 % / 15.00 % / 0.00 % / 15.00 %
70.00 % / 20.00 % / 5.00 % / 5.00 %
98.38 % / 0.81 % / 0.00 % / 0.81 %
6.67 % / 40.00 % / 20.00 % / 33.33 %
30.78 % / 46.15 % / 7.69 % / 15.38 %
78.57 % / 14.29 % / 7.14 % / 0.00 %
11.90 % / 21.43 % / 21.43 % / 45.24 %
33.33 % / 55.56 % / 0.00 % / 11.11 %
83.33 % / 0.00 % / 0.00 % / 16.67 %
0.00 % / 6.67 % / 40.00 % / 53.33 %
33.33 % / 55.56 % / 0.00 % / 11.11 %
55.00 % / 0.00 % / 10.00 % / 35.00 %
5.88 % / 35.29 % / 19.61 % / 39.22 %
0.00 % / 26.67 % / 23.33 % / 50.00 %
34.10 % / 50.00 % / 11.36 % / 4.54 %
71.43 % / 0.00 % / 0.00 % / 28.57 %
37.14 % / 22.86 % / 11.43 % / 28.57 %
0.00 % / 0.00 % / 12.50 % / 87.50 %
obtained for the 5 folds will be averaged. Table 1 shows the main characteristics of the tested databases: dimension (D),
number of patterns (Np ), and the imbalance ratio (IR), defined as the ratio between the probabilities of the two classes
IR =
π−1
π+1
Datasets in Table 1 are sorted according to this ratio. The 33 chosen datasets, which are the same ones tested in [9,10], have
a high imbalance (IR > 9, which means less than 10% of samples for the minority class). Finally, the table also contains the
percentages of types of minority examples such as they are defined in [28]. This work shows that the difficulties in learning
from imbalanced data are related with the location of the samples of the minority class with respect to the samples of the
majority class. Therefore, this information can be useful to analyze the results obtained with different classification methods.
5.2. Implementation details
Since the objective of this paper is to show the intrinsic potential of the proposed method to work with different data,
instead of looking for an specific configuration for each database, a generic setup has been used for all databases.
The only pre-processing of input data is normalization, forcing zero mean and unit variance for each dimension. Two
architectures have been tested for the individual Bayesian classifiers:
• Linear classifiers.
• MLPs with a single hidden layer with Nn neurons, a single neuron in the output layer, and hyperbolic tangent activation functions.
∂z
Transfer function and gradient expressions ∂ wk are well-known for these architectures, both the linear classifier and the
MLP [29,30].
An adaptive step-size μ has been used in gradient updating. After each epoch the cost JBayes (w(i) ) is evaluated and compared with J Bayes (w(i−1 ) )
• If J Bayes (w(i ) ) < J Bayes (w(i−1 ) ): step size is increased, μ = cI μ
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
37
• If J Bayes (w(i ) ) ≥ J Bayes (w(i−1 ) ): step size is decreased, μ = μ/cD , and weights w(i) are re-computed with the new step
size.
cI = 1.01 and cD = 2 have been used in all experiments, with initial value μ = 10−3 . The Bayesian classifier has been
implemented in MATLAB1 .
Individual Bayesian classifiers for 9 different values of parameter α have been trained for each dataset. In particular, the
parameters defining the Bayes risk objective for each one of the constituent classifiers are
α ( j ) = 0.1 × j, for j ∈ {1, 2, . . . , 9}
(20)
Experiments have shown that in some datasets the performance of some of the individual classifiers working in the
extreme (lowest or highest) values for α (j) can degrade. For this reason we will compare the performance obtained with
two different numbers of individual classifiers in the ensemble:
• Ensembles combining 9
• Ensembles combining 7
discard for each dataset
– α (j) ∈ {0.1, 0.2}
– α (j) ∈ {0.8, 0.9}
– α (j) ∈ {0.1, 0.9}
The two values that are
individual classifiers, for the 9 values of α (j) given in (20).
individual classifiers, for the 7 consecutive values with a better performance. This means to
two values for α (j) , which depending on the dataset can be
discarded for each dataset are obtained by cross-validation.
The network parameters w are randomly initialized, with values drawn from independent uniform distributions in
[−0.1, 0.1] for each parameter. 100 independent Monte Carlo simulations, starting with different initial parameters for each
classifier, have been performed for each one of the 5 folds of every database. Average results obtained in the 5 folds will be
presented.
Several aspects have to be cross-validated during training, such as the number of epochs in the training algorithm, the
number of neurons in the hidden layer of the MLPs, the values of ( p(FiA) , p(Di ) ) necessary to implement the Bayesian ensemble
rule (18), and the two values of α (j) that are discarded in the ensembles with 7 classifiers. As it happens in many real
problems, the number of samples in the training data set for most of the datasets is relatively low, specially for the minority
class. For instance, first dataset glass04vs5 depending of the specific fold has only 73 or 74 patterns in the training set (73
patterns in folds 1 and 2, and 74 pattern in the remaining folds). And the number of samples of the minority class in these
training sets is only 7 or 8 (7 patterns in folds 1, 2, 3, and 4, and 8 patterns in fold 5). In this scenario, the strategy of
splitting the available training data set in a single train-set/validation-set partition has the drawback of providing a very
low number of samples of the minority class in the validation-set or of reducing drastically the number of samples of the
minority class in the train-set if the size of the validation-set is increased. For this reason, a 10-fold cross-validation strategy
has been chosen for each fold in the KEEL dataset. The design and evaluation of each method is done independently for each
fold, which contains a train set and a test set. For the design of the classifier in a given fold, we have used the following
methodology:
• The train set of that fold is randomly splitted in 10 sub-sets, maintaining the class proportions, and then the network
is trained 10 times. Each time, one of the sub-sets is used as validation-set and the remaining 9 sub-sets are used as
train-set.
• Validation results obtained in the 10 sub-sets are averaged, which allows to obtain the network configuration for the
fold.
• After that, the validated solution is trained using the whole train set.
• Finally, the designed classifier is evaluated using the test set of the fold. Of course, the samples in this test set were
not used during the validation and the training phases of the design of the classifiers for that fold, which were carried
working only with the train set of the fold.
The figure of merit that will be used to compare the performance of different methods is the average probability of a
successful classification of samples of both classes, measured as
pS = 1 −
pF A + pM
2
(21)
Note that this figure of merit corresponds geometrically with the area under the trapezoid given by the operation point
in the ROC and the points (0,0) and (1,1), as shown in Fig. 3. For this reason, in some works it is called Area Under the
Curve (AUC), although strictly speaking it does not correspond to the area under a ROC curve, which is used sometimes as a
figure of merit for neural classifiers [31,32]. Figure of merit (21) corresponds to assign the same importance to errors in both
classes, independently of the prior probabilities of the classes, which in the Bayesian risk corresponds to α = 1/2. Therefore,
α E = 1/2 will be used in the ensemble for Bayesian aggregation rule (18), which makes the decision threshold γ E = 1.
1
MATLAB code is available at www.tsc.uc3m.es/∼mlazaro/.
38
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
Fig. 3. Value of the figure of merit pS seen as the area under the trapezoid defined by the operation point (pFA , pD ).
Fig. 4. Kernels under test for experiments with complementary windows.
The Parzen windows used for each class will be complementary, i.e.
k+1 (z ) = k(z ),
k−1 (z ) = k(−z )
(22)
In the following, we will call k(z) the Parzen kernel, and the four kernels shown in Fig. 4 are tested. Superscripts denote
uniform (U), linear (L), triangle (T), and absolute value (A), respectively. All of them have support [−1, 1], which is the range
of the output z if a hyperbolic tangent activation function is used to saturate the output, as it will be the case with MLPs in
this work.
As discussed in [12], the choice of the kernel is not aimed at having the best possible estimates of the Bayes risk in
the cost function (9), but to provide good performance. These four kernels have different properties that can be appropriate
under different characteristics of the problem at hand, as it will be shown in the experiments. The best kernel for a given
problem will depend on the characteristics of the problem itself, but will also depend on the architecture of the network
that is trained. This is because the architecture defines the kind of input/output projection that is performed and also the
sensitivity to overfitting, that can be balanced with an appropriate choice of the kernel. More details about the role of the
kernel function in the learning process and the performance of a binary classifier are given in [12].
5.3. Experimental results
As a baseline for comparison, we include the best results obtained in [9] and in [10], where the same datasets where used
to evaluate different classification methods. In [9], several ensemble methods combining from 10 to 40 classifiers (with the
best number selected for each method) using bagging and boosting techniques along with several preprocesing techniques
to balance data sets before constructing the ensemble, like the Synthetic Minority Over-sampling Technique (SMOTE) [33] or
the Evolutionary Under-Sampling (EUS) [34], are evaluated. In [10], several class switching algorithms, using preprocessing
techniques to relatively balance the datasets, are evaluated and compared with other baseline methods.
First of all, we want to show that the results will depend on the Parzen kernel k(z) which is selected for the training
algorithm of the constituent classifiers. To illustrate this, Table 2 compares the average value of pS obtained by an individual
classifier trained with α ( j ) = 0.5, the value that is fitted to the figure of merit, pS , using the 4 kernels under evaluation. In
this case, the network architecture is an MLP with Nn = 4 neurons in the hidden layer. The last row in the table contains,
for each kernel, the number of wins (number of datasets with the best performance, uniquely) and ties (number of datasets
achieving the best performance, but along with other kernels).
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
39
Table 2
Comparison of the four Parzen kernels in the individual classifier
training for α ( j ) = 0.5 and an MLP network with Nn = 4 neurons
in the hidden layer.
Dataset
kU (z)
kL (z)
kT (z)
kA (z)
glass04vs5
ecoli0346vs5
ecoli0347vs56
yeast05679vs4
ecoli067vs5
vowel0
glass016vs2
glass2
ecoli0147vs2356
led7digit02456789vs1
ecoli01vs5
glass06vs5
glass0146vs2
ecoli0147vs56
cleveland0vs4
ecoli0146vs5
ecoli4
shuttlec0vsc4
yeast1vs7
glass4
pageblocks13vs4
abalone918
glass016vs5
shuttlec2vsc4
yeast1458vs7
glass5
yeast2vs8
yeast4
yeast1289vs7
yeast5
ecoli0137vs26
yeast6
abalone19
Wins / Ties
95.67
88.15
89.45
77.90
86.93
97.01
73.94
82.34
88.11
87.26
87.25
98.77
73.08
87.27
90.80
83.17
85.13
99.62
73.92
92.45
95.89
86.29
92.35
99.52
62.05
94.25
78.31
78.44
70.63
96.94
83.64
85.47
77.71
4/1
98.50
87.57
90.36
77.91
87.89
98.99
75.97
83.31
86.48
86.89
87.38
99.99
72.95
87.24
91.64
84.64
84.85
99.63
75.12
92.47
98.24
87.98
93.00
99.54
65.49
94.42
74.38
78.55
68.67
97.38
81.16
85.79
75.67
17 / 1
95.36
87.47
89.40
77.55
87.22
97.41
73.57
80.78
88.13
86.30
87.32
99.93
73.45
87.27
90.11
83.11
84.93
99.63
74.02
92.52
96.55
86.42
92.91
99.64
63.07
92.35
78.94
78.61
70.20
97.00
84.64
86.00
77.58
3/2
95.57
87.34
89.15
77.94
87.29
96.41
74.43
82.95
87.65
87.21
87.14
97.28
73.97
87.14
90.52
83.04
83.07
99.61
74.44
92.53
95.80
84.91
92.40
99.54
63.63
93.12
77.61
79.98
69.93
97.04
86.62
86.11
77.86
7/0
It can be seen that, although in some datasets the choice of the kernel is not critical, because similar results are obtained with all the kernels (for instance, yeast05679vs4 or yeast5), there are also datasets where the difference is relevant
(for instance, glass04vs5 or ecoli0137vs26), and therefore it is important to select an appropriate kernel to obtain the best
possible results. The best kernel for each configuration can be obtained by cross-validation.
In Table 3 we compare the performance of the four aggregation rules combining 9 or 7 individual classifiers. In this case
a linear kernel, kL (z), is used in an MLP architecture with Nn = 4 neurons in the hidden layer.
It can be seen that for both types of ensembles, using 9 or 7 components, the rule having the best performance in most
datasets is the first one, the aggregation based on the addition of soft outputs (16).
Comparing the results for ensembles with 9 vs ensembles with 7 components, we can see that differences are in general
relatively small, but the ensemble with 7 components obtains the best pS value in a slightly higher number of problems.
A similar behavior has been observed for every architecture and for all Parzen kernels. Therefore, and for the sake of
simpler comparisons, in the following we will consider a single configuration of ensembles with 7 components and the first
aggregation rule. In particular, we will compare the results obtained with the best schemes evaluated in [9] and [10], with
the results provided by the proposed method selecting the best kernel by cross-validation. For the sake of completeness, the
comparison will also include the results obtained with the individual classifier trained with α ( j ) = 0.5, which is that fitted
to the figure of merit which is being considered
Table 4 shows the results obtained when the individual classifiers used to build the ensemble are linear Bayesian classifiers (linear classifiers trained with the method proposed in [12]). The table also indicates the kernel that has been used for
each dataset.
Table 5 shows the results obtained using MLPs with Nn neurons in the hidden layer as constituent classifiers of the
ensemble. The number of neurons and the Parzen kernel for each dataset, which are included in the table, have been
obtained by cross-validation. Networks with 1, 2, 4, 6, 8, 10, 12, 16, 20, 30, 40, and 50 neurons have been validated.
The proposed method, using simple linear classifiers, provides competitive results. In the 33 datasets, the best method in
[9] gets the best performance (win or tie) in 14 datasets, the best method in [10] in 12 datasets, and the proposed method
40
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
Table 3
Comparison of ensembles of 9 and 7 MLP Bayesian classifiers with Nn = 4 neurons in the hidden layer using kL (z).
Dataset
Soft(9)
Hard(9)
Bayes(9)
Maj(9)
Soft (7)
Hard(7)
Bayes(7)
Maj(7)
glass04vs5
ecoli0346vs5
ecoli0347vs56
yeast05679vs4
ecoli067vs5
vowel0
glass016vs2
glass2
ecoli0147vs2356
led7digit02456789vs1
ecoli01vs5
glass06vs5
glass0146vs2
ecoli0147vs56
cleveland0vs4
ecoli0146vs5
ecoli4
shuttlec0vsc4
yeast1vs7
glass4
pageblocks13vs4
abalone918
glass016vs5
shuttlec2vsc4
yeast1458vs7
glass5
yeast2vs8
yeast4
yeast1289vs7
yeast5
ecoli0137vs26
yeast6
abalone19
Wins / Ties
98.65
88.24
91.08
78.65
88.08
99.11
77.20
84.14
87.27
87.13
87.99
100.00
72.65
87.97
92.76
85.62
84.56
99.61
75.91
92.51
98.71
88.18
93.12
99.69
65.15
93.73
74.46
78.22
70.52
97.52
82.28
85.87
76.91
9/3
98.51
87.92
90.93
78.72
88.05
99.17
76.81
83.36
87.22
87.08
87.95
100.00
72.83
87.60
91.71
85.48
85.06
99.60
75.81
92.51
98.58
88.38
92.75
99.60
65.25
93.74
74.71
78.41
69.52
97.50
82.11
85.87
76.07
1/3
98.31
88.24
90.91
78.02
87.46
99.17
76.89
84.37
87.23
87.00
87.70
100.00
74.20
87.14
87.84
85.43
85.06
99.61
75.78
92.34
98.03
87.97
91.93
99.60
66.85
94.26
74.14
78.86
70.93
97.52
82.12
85.63
76.86
2/3
98.51
88.18
90.93
78.71
87.94
99.17
77.07
84.08
87.20
87.06
87.94
100.00
73.14
87.56
91.24
85.55
85.06
99.61
75.82
92.47
98.55
88.35
92.66
99.60
65.39
93.79
74.38
78.41
70.18
97.53
82.26
85.84
76.74
0/3
98.49
88.16
91.26
78.72
88.58
99.08
77.10
84.21
87.87
87.81
87.97
100.00
73.94
88.33
93.80
85.57
84.54
99.61
75.67
92.31
98.63
89.74
93.01
99.74
66.89
93.71
75.60
78.83
71.14
97.53
82.20
85.67
77.93
11 / 2
98.49
88.02
91.15
78.71
88.58
99.06
76.91
84.22
87.58
87.78
87.93
100.00
74.15
88.17
93.56
85.40
85.15
99.61
75.70
92.39
98.47
89.07
92.56
99.66
66.64
93.84
75.70
78.72
70.64
97.50
82.24
85.88
77.35
2/2
98.47
88.20
90.99
78.47
87.81
99.07
76.90
83.99
87.23
87.64
87.62
100.00
75.85
87.52
91.84
85.53
85.15
99.61
75.66
92.35
97.87
89.07
91.77
99.56
66.74
94.36
74.65
78.72
70.66
97.54
82.09
85.65
77.36
3/2
98.49
88.22
91.21
78.71
88.42
99.06
76.91
84.18
87.71
87.76
87.92
100.00
74.29
88.13
93.57
85.50
85.15
99.61
75.68
92.37
98.40
89.07
92.37
99.66
66.53
93.90
75.42
78.72
70.58
97.53
82.19
85.86
77.36
0/2
also in 12 datasets. In 3 of these 12 datasets, the individual classifier for α ( j ) = 0.5 gets a slightly better average pS than the
ensemble, but in all those cases the ensemble has a higher average pS than the baseline methods.
Using MLP Bayesian classifiers, the best method for each database among all those of [9] achieves the maximum value
of pS (win or tie) in 13 datasets, the best method in [10] in 11 datasets, and the proposed method in 14 datasets, and only
in 1 of these 14 datasets (yeast5) the individual classifier for α ( j ) = 0.5 gets a slightly better average pS than the ensemble
(although again the ensemble has a higher pS than the best method in both [9] and [10]). Interestingly, the proposed method
obtains the best result in the last 4 datasets, which are those with higher imbalance ratios (IR > 32, which means less than
3% of samples of the minority class), with an improvement of more than 8% in the most imbalanced dataset, abalone19.
Although for the sake of an easier comparison results in Tables 4 and 5 include only aggregation rule (16) for ensembles
of 7 classifiers, similar results have been obtained with the other proposed fusion rules and with ensembles of 9 classifiers
(see Table 3).
It is necessary to remark that the best results in [9] and [10] are obtained by several different methods. In particular, the
9 methods that obtain the best result in at least a dataset are included in Table 6, where they are compared with Soft (7).
The details of each method can be found in [9] and [10].
The second column of Table 6 shows the number of datasets where each method wins, i.e., the method provides globally
the best result when compared with the other 9 methods, or ties, i.e, the method reaches the best result along with other
methods. The relatively high number of ties for the benchmark methods is because most of them provide the same results
for datasets glass04vs5, shuttlec0vsc4 and shuttlec2vsc4. It can be seen that a single benchmark method wins in at most 3
datasets, with 2 additional ties, while the figures for the proposed method are 14 wins.
The third column present the number of wins and losses for every method in a pair comparison with Soft (7). There are
no ties in this comparison. All the benchmark methods have more losses than wins with respect to the proposed method.
The maximun number of wins, 15, is obtained by USwitchingNED, which losses in 18 datasets.
Fourth and fifth columns show the result of two statistical tests that are commonly used to compare two classifiers,
the paired T-test and the Wilcoxon signed-ranks test [35]. The p-value and the result of the hypothesis that the mean
difference between two sets of observations is not zero for a statistical significance of 5% are presented in a pair comparison
of every method against Soft (7). It can be seen that the hypothesis is false only for 1 method in the T-test (in this case
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
41
Table 4
Comparison of the proposed method using linear Bayesian classifiers with the best
method for each dataset in [9] and [10].
Dataset
Kernel
Best [9]
Best [10]
ind
Soft (7)
glass04vs5
ecoli0346vs5
ecoli0347vs56
yeast05679vs4
ecoli067vs5
vowel0
glass016vs2
glass2
ecoli0147vs2356
led7digit02456789vs1
ecoli01vs5
glass06vs5
glass0146vs2
ecoli0147vs56
cleveland0vs4
ecoli0146vs5
ecoli4
shuttlec0vsc4
yeast1vs7
glass4
pageblocks13vs4
abalone918
glass016vs5
shuttlec2vsc4
yeast1458vs7
glass5
yeast2vs8
yeast4
yeast1289vs7
yeast5
ecoli0137vs26
yeast6
abalone19
Wins / Ties
U
L
U
L
U
L
U
T
U
L
T
T
T
L
U
A
T
T
U
T
L
L
L
U
U
T
L
U
L
L
T
A
T
99.41
92.75
89.28
81.44
89.00
98.87
74.88
80.45
89.43
88.80
92.35
99.50
77.36
89.24
82.80
92.95
93.09
100.00
77.71
91.92
99.06
74.24
98.86
100.00
63.43
98.78
80.19
84.89
74.91
96.61
83.60
86.78
70.81
9/5
99.41
91.80
90.76
80.93
90.25
98.26
75.33
79.51
88.72
90.80
91.14
99.50
78.78
90.96
88.40
91.60
92.15
100.00
77.82
90.09
99.81
72.24
98.86
100.00
62.38
98.78
78.67
84.49
76.42
97.40
83.60
88.08
69.91
7/5
95.75
89.84
90.89
79.72
87.67
96.77
77.91
82.26
88.15
87.07
87.37
99.82
80.11
88.23
94.92
87.57
85.47
99.66
76.59
92.82
96.44
90.15
97.07
99.67
65.62
96.66
79.18
81.56
72.18
97.53
86.53
87.45
78.11
3/0
95.26
89.75
91.01
79.63
87.32
96.53
81.35
83.93
88.30
88.71
87.26
99.57
81.26
87.50
95.58
88.22
85.36
99.66
75.39
92.70
96.79
90.51
95.70
99.64
66.89
96.81
79.09
80.88
72.09
97.59
86.66
87.25
77.47
9/0
USwitchingNED with a p-value of 5.06%, close to the limit of 5%) and for 2 methods in the Wilcoxon test (EUBQ with a
p-value of 5.04%, and again USwitchingNED).
To test the differences between more than two models, two typical tests are the Dunnett test for ANOVA and the Friedman test [35]. Fig. 5(a) plots the means (with the range of two standard deviations in each direction) obtained by ANOVA,
and Fig. 5(b) plots the average ranks (with the range of two stardard deviations in each direction) obtained for the Friedman
test.
For ANOVA, Soft (7) has the best mean value. The Dunnett test provides p-values comparing Soft (7) as the control
method with the benchmark methods. These p-values and the hypothesis (with 5% of significance) are shown in Table 7,
and it can be seen that they only allow to establish a significant difference of Soft (7) with respect to SBAG4, Switching-I
and SwitchingNED. For the Friedman ranks, Soft (7) has the best rank. The critical difference for the Bonferroni-Dunn test
[35] with 5% of significance is CD = 2.1325, and therefore it is only possible to establish a significant difference of Soft (7)
with respect to Switching-I and SwitchingNED, which have ranks with a difference higher than CD with respect to the rank
of Soft (7). The p-values and the hypothesis (with 5% of significance) using Soft (7) as the control method are shown in
Table 7.
The above results permit to conclude that the proposed approach has a very competitive performance. In a pair comparison it provides the best average results and the best wins/losses figure against every benchmark method. The difference
in performance is statistically significant with respect to 8 out of the 9 benchmark methods (and the other method is very
close to the threshold), according to the paired T-Test, and with respect to 7 of those methods, according to the Wilcoxon
test.
In a multiple comparison, again the proposed method provides the best average mean for ANOVA and the best average
rank for Friedman test, although the Dunnet and Bonferroni-Dunn tests limit the statistical significance of the difference
with respect to several benchmark methods.
Another interesting result is related with the imbalance ratio and the inherent difficulty of the problem. The proposed
method provides the best results in the 4 most imbalanced datasets, and these datasets have a wide variety of difficulty in
terms of the percentages of types of the minority samples (see Table 1):
• yeast5 has most samples in the safe and boundary categories (84.10%).
42
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
Table 5
Comparison of the proposed method using MLP Bayesian classifiers with Nn neurons in
the hidden layer with the best method for each dataset in [9] and [10].
Dataset
Kernel
Nn
Best [9]
Best [10]
ind
Soft (7)
glass04vs5
ecoli0346vs5
ecoli0347vs56
yeast05679vs4
ecoli067vs5
vowel0
glass016vs2
glass2
ecoli0147vs2356
led7digit02456789vs1
ecoli01vs5
glass06vs5
glass0146vs2
ecoli0147vs56
cleveland0vs4
ecoli0146vs5
ecoli4
shuttlec0vsc4
yeast1vs7
glass4
pageblocks13vs4
abalone918
glass016vs5
shuttlec2vsc4
yeast1458vs7
glass5
yeast2vs8
yeast4
yeast1289vs7
yeast5
ecoli0137vs26
yeast6
abalone19
Wins / Ties
L
L
L
T
L
L
L
L
T
U
L
L
T
L
A
L
U
L
A
A
L
L
T
A
L
L
T
U
U
L
A
L
T
4
2
1
1
16
30
30
4
8
2
30
6
1
30
20
6
10
6
30
12
20
1
16
8
12
8
4
50
4
2
2
1
6
99.41
92.75
89.28
81.44
89.00
98.87
74.88
80.45
89.43
88.80
92.35
99.50
77.36
89.24
82.80
92.95
93.09
100.00
77.71
91.92
99.06
74.24
98.86
100.00
63.43
98.78
80.19
84.89
74.91
96.61
83.60
86.78
70.81
8/5
99.41
91.80
90.76
80.93
90.25
98.26
75.33
79.51
88.72
90.80
91.14
99.50
78.78
90.96
88.40
91.60
92.15
100.00
77.82
90.09
99.81
72.24
98.86
100.00
62.38
98.78
78.67
84.49
76.42
97.40
83.60
88.08
69.91
6/5
98.50
89.04
90.86
79.52
88.66
99.89
77.48
83.31
88.37
87.92
88.13
100.00
80.49
87.93
92.94
85.15
85.79
99.66
75.17
92.64
99.15
90.61
94.68
99.74
66.22
95.14
78.94
82.69
70.63
97.63
86.66
87.93
78.19
1/1
98.49
89.28
91.22
80.07
89.25
99.92
77.78
84.21
87.72
87.69
88.14
100.00
81.28
88.81
94.12
85.32
85.45
99.64
77.64
92.76
99.30
90.68
94.95
99.74
66.95
94.52
79.19
83.42
70.71
97.59
87.01
88.11
78.87
12 / 1
Table 6
For each one of the 10 methods under comparison, number of wins/ties against all the methods in the 33 datasets
(second column), number of wins/loses against Soft (7) in the 33 datasets (third column), T paired test against Soft (7),
p-value and the hypothesis for the 5% of significance (fourth column), and Wilcoxon test against Soft (7), p-value and
the hypothesis for the 5% of significance (fifth column).
Method
Wins/Ties vs all
Wins/Losses vs Soft (7)
T-Test p-value (H)
Wilcoxon p-value (H)
SBAG4 [9]
RUS1 [9]
UB4 [9]
EUSBoost [10]
EUBQ [9]
EUBH [9]
Switching-I [9]
SwitchingNED [10]
USwitchingNED [10]
Soft (7)
3/1
2/3
1/3
1/4
1/4
2/4
1/2
1/2
3/2
14 / 0
9 / 24
11 / 22
10 / 23
11 / 22
12 / 21
13 / 20
4 / 29
7 / 26
15 / 18
0.0012 (1)
0.0106 (1)
0.0011 (1)
0.0172 (1)
0.0326 (1)
0.0172 (1)
1.45 10−6 (1)
1.65 10−5 (1)
0.0506 (0)
0.0011 (1)
0.0084 (1)
0.0014 (1)
0.0188 (1)
0.0504 (0)
0.0444 (1)
3.86 10−6 (1)
2.03 10−5 (1)
0.1891 (0)
• ecoli137vs26 have most samples in the safe category, but with the remaining samples in the outlier category (28.57%).
• yeast6 has the most relatively uniform distribution in the four categories, ranging from 11.43% of rare patterns to
37.14% of safe patterns.
• abalone19 has all samples in the rare and outlier categories, with a higher percentage of outliers (87,5%), which makes
specially difficult the classification task.
The proposed method was competitive in all these diverse scenarios with the highest imbalance ratios. It can be seen
that the variety also appears in the remaining 10 datasets where the proposed method obtains the best results (as with
vowel having a 98,89% of safe patterns, and with glass016vs2 that has no safe patterns and most of the patterns in the rare
(35.29%) and outlier (41.18%) categories, just to mention two very different situations). The difficulty of the problem given
by the location of the minority patterns is related with the performance, with high pS values of 99.92% or 97.63% in easy
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
43
Fig. 5. Average means for ANOVA (a) and ranks for Friedman test (b) applyied to the 10 methods under comparison.
Table 7
For each one of the 9 benchmark methods, p-value and the hypothesis with the 5% of
significance for the Dunnet and the Bonferroni-Dunn tests in the multiple comparison
with Soft (7) as the control classifier.
Method
Dunnet Test p-value (H)
Bonferroni-Dunn test p-value (H)
SBAG4 [9]
RUS1 [9]
UB4 [9]
EUSBoost [10]
EUBQ [9]
EUBH [9]
Switching-I [9]
SwitchingNED [10]
USwitchingNED [10]
0.0345 (1)
0.3991 (0)
0.2028 (0)
0.8366 (0)
0.9856 (0)
0.6445 (0)
≈ 0 (1)
1.12×10−12 (1)
0.9961 (0)
0.2389 (0)
0.4403 (0)
0.2285 (0)
0.7169 (0)
0.9998 (0)
0.6610 (0)
1.69×10−7 (1)
4.57×10−5 (1)
0.9992 (0)
problems such as vowel0 or yeast5, and lower values of 77.78% and 78.87% in more difficult problems such as glass016vs2
or abalone19.
6. Conclusions
In this paper, we propose to design new machine ensembles for solving binary imbalanced classification problems by employing Bayesian neural networks that are trained to minimize a sampled version of the Bayes risk. Consequently, they offer
an intrinsic resistance to imbalance effects, as ensemble learners, diversifying them in a new form, consisting of applying
appropriately selected different cost policies. Several output aggregation schemes are also considered.
The results of extensive experiments with 33 benchmark databases, comparing accuracy rates with those of 9 highperformance designs under the assumption of equal importance for both types of errors, clearly show the excellent performance fo the proposed algorithms: They are the absolute winners for more than one third of the databases both with
linear and one-hidden layer MLP learners. The best individual alternative design offers less than one half absolute win+ties
44
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
when compared to the proposed design with MLPs. Statistical analysis confirms that the proposed methods is competitive
for imbalanced classification in very diverse problems. It is also worth to remark that adapting the proposed designs to the
other classification error costs is a trivial task.
Among the research directions that this study opens, we are at the present time working in using deep learners and in
extending the formulation to multiclass problems.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have
appeared to influence the work reported in this paper.
Acknowledgments
This work was partly supported by research Grant TEC-2015-67719-P “Macro-ADOBE” (MINECO-FEDER, EU), and the research network TIN 2015-70808-REDT, “DAMA” (MINECO-FEDER, EU), for M. Lázaro and A.R. Figueiras-Vidal, as well as by
research Grant TIN2017-89517-P, “Smart-DaSCI” (MINECO-FEDER, EU), for F. Herrera.
Appendix. Likelihood ratio for Bayesian aggregation of hard outputs
( j)
The likelihood ratio involved in the Bayesian aggregation rule (18) is obtained as follows. If yˆk denotes as in (17) the
binary decision of the jth individual classifier for input pattern xk , the input of the ensemble for pattern xk is
xEk ≡ xEk (xk ) = yˆk(1) , yˆk(2) , . . . , yˆk(Nc )
(23)
A likelihood ratio can be defined for this input of the ensemble. Conditional distributions of the decisions of the individual
classifiers are given by
p(Yˆj|)H yˆk( j ) | − 1 =
p(F Aj ) ,
if yˆk( j ) = +1
1 − p(F Aj ) ,
if yˆk( j ) = −1
p(Dj ) ,
if yˆk( j ) = +1
1 − p(Dj ) ,
if yˆk( j ) = −1
(24)
and
p(Yˆj|)H yˆk( j ) | + 1 =
(25)
Using these conditional distributions, assuming conditional independence between the output of the Nc classifiers, and with
δ [n] denoting the discrete-time delta function to provide a compact expression for (24) and (25), the likelihood ratio for xk(E )
is given by
( j)
Nc p( j ) y
ˆk | + 1
E Yˆ |H
xk =
( j)
( j)
ˆk | − 1
j=1 pYˆ |H y
=
Nc
p(Dj ) δ [yˆk( j ) − 1] + 1 − p(Dj ) δ [yˆk( j ) + 1]
j=1
p(F Aj ) δ [yˆk( j ) − 1] + 1 − p(F Aj )
δ [yˆk( j ) + 1]
(26)
References
[1] B.J. Park, S.K. Oh, W. Pedrycz, The design of polynomial function-based neural network predictors for detection of software defects, Inf. Sci. 229 (2013)
40–57.
[2] P. González, E. Álvarez, J. Díez, R. González-Quinteros, E. Nogueira, A. López-Urrutia, J.J. del Coz, Multiclass support vector machines with example
dependent costs applied to plankton biomass estimation, IEEE Trans. Neural Netw. Learn.Syst. 24 (2013) 1901–1905.
[3] C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse, F. Folleco, An empirical study of the classification performance of learners on imbalanced and noisy
software quality data, Inf. Sci. 259 (2014) 571–595.
[4] Y. Sun, A.K.C. Wong, M.S. Kamel, Classification of imbalanced data: a review, Int. J. Pattern Recognit.Artif. Intell. 23 (2009) 687–719.
[5] V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classification with imbalanced data: empirical results and current trends on
using data intrinsic characteristics, Inf. Sci. 250 (2013) 113–141.
[6] P. Branco, L. Torgo, R.P. Ribeiro, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. 49 (2016) 31:1–31:50.
[7] H. He, Y. Ma (Eds.), Imbalanced Learning: Foundations, Algorithms, and Applications, IEEE Press - Wiley, 2013.
[8] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and
hybrid-based approaches, IEEE Trans. Syst. Man Cybern. 42 (2012) 463–484.
[9] M. Galar, A. Fernández, E. Barrenechea, F. Herrera, EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling,
Pattern Recognit. (2013) 3460–3471.
[10] S. González, S. García, M. Lázaro, A.R. Figueiras-Vidal, F. Herrera, Class switching according to nearest enemy distance for learning from highly imbalanced data, Pattern Recognit. 70 (2017) 12–24.
[11] L. Nanni, C. Fantozzi, N. Lazzarini, Coupling different methods for overcoming the class imbalance problem, Neurocomputing 158 (2015) 48–61.
[12] M. Lázaro, M.H. Hayes, A.R. Figueiras-Vidal, Training neural network classifiers through Bayes risk minimization applying unidimensional Parzen windows, Pattern Recognit. 77 (2018) 204–215.
M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
45
C. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1997.
R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., John Wiley & Sons, 2001.
L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth Inc, 1984.
B. Schölkopf, C. Burges, A. Smola, Advances in Kernel Methods - Support Vector Learning, MIT Press, Cambridge, MA, 1999.
L.M. Bregman, The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, USSR Comput. Math. Math.Phys. 7 (1967) 200–217.
J. Cid-Sueiro, J.I. Arribas, S. Urbán-Muñoz, A.R. Figueiras-Vidal, Cost functions to estimate a posteriori probabilities in multiclass problems, IEEE Trans.
Neural Netw. 10 (1999) 645–656.
A. Benitez-Buenache, L.A. Pérez, V.J. Mathews, A.R. Figueiras-Vidal, Likelihood ratio equivalence and imbalanced binary classification, Expert Syst. Appl.
130 (2019) 84–96.
E. Parzen, On the estimation of a probability density function and the mode, Ann. Math. Stat. 33 (1962) 1065–1076.
L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuraccy, Mach. Learn. 51 (2003)
181–207.
L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140.
R.E. Schapire, Y. Freund, Boosting: Foundations and Algorithms, Cambridge, MA: MIT Press, 2012.
L. Breiman, Random forest, Mach. Learn. 45 (2001) 5–32.
L. Breiman, Randomizing outputs to increase prediction accuracy, Mach. Learn. 40 (20 0 0) 229–242.
H.L. Van Trees, Detection, Estimation, and Modulation Theory: Part I, John Wiley and Sons, New York, 1968.
A. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, KEEL data-mining software tool: data set repository, integration of
algorithms and experimental analysis framework, J. Multiple-Valued Logic Soft Comput. 17 (2011) 255–287.
K. Napierala, J. Stefanowski, Types of minority class examples and their influence on learning classifiers from imbalanced data, J. Intell. Inf. Syst. 46
(2016) 563–597.
D.E. Rumelhart, G.E. Hinton, R.J. Willians, Learning representations by back-propagating errors, Nature (London) 323 (1986) 533–536.
B. Widrow, M.A. Lehr, 30 years of adaptive neural networks: perceptron, Madaline and backpropagation, Proc. IEEE 78 (1990) 1415–1441.
T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (2006) 861–874.
A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit. 30 (1997) 1145–1159.
N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357.
S. García, F. Herrera, Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy, Evol. Comput. 17 (2009)
275–306.
J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30.
Download