Information Sciences 520 (2020) 31–45 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins Ensembles of cost-diverse Bayesian neural learners for imbalanced binary classification Marcelino Lázaro a,∗, Francisco Herrera b, Aníbal R. Figueiras-Vidal a a Departamento de Teoría de la Señal y Comunicaciones, Universidad Carlos III de Madrid. Av. Universidad 30, Leganés 28911, Madrid, SPAIN b Department of Computer Science and Artificial Intelligence, University of Granada, Granada 18071, Spain a r t i c l e i n f o Article history: Received 14 June 2018 Revised 23 May 2019 Accepted 22 December 2019 Available online 29 January 2020 Keywords: Imbalanced classification Ensembles Bayes risk Parzen windows a b s t r a c t Combining traditional diversity and re-balancing techniques serves to design effective ensembles for solving imbalanced classification problems. Therefore, to explore the performance of new diversification procedures and new re-balancing methods is an attractive research subject which can provide even better performances. In this contribution, we propose to create ensembles of the recently introduced binary Bayesian classifiers, that show intrinsic re-balancing capacities, by means of a diversification mechanism which is based on applying different cost policies to each ensemble learner as well as appropriate aggregation schemes. Experiments with an extensive number of representative imbalanced datasets and their comparison with those of several selected high-performance classifiers show that the proposed approach provides the best overal results. © 2020 Elsevier Inc. All rights reserved. 1. Introduction Imbalanced classification problems are relevant in the real world. Not only the well-known cases –fraud, credit, diagnostic, intrusion, recognition...– are frequent but there are also many specialized applications that deal with this kind of tasks [1–3]. In most of the practical situations, there is not a statistical model, but a finite register of observations and labels. Under imbalance conditions – when the class populations are very different, – a discriminative machine cannot be dessigned following conventional procedures if detecting minority samples is important, and re-balancing methods must be applied. Since there is not room here even for a brief overview of these methods, we recommend tutorials [4–6] and text [7] to the interested reader. In this paper, we propose to combine two methodologies to solve imbalance binary classification problems. First, a new form of building ensembles, based on imposing different classification costs to generate diversity. This is one more step along the direction that considers ensembles as an effective approach to solve imbalanced problems [8–11]. Second, we adopt the recently introduced Bayesian neural classifiers [12] as learners. These learners show intrinsic resistance to imbalance difficulties, because they are trained by minimizing the sampled Bayes risk and not a surrogate cost. On the other hand, such a formulation allows to establish a direct connection between the classification costs and an estimate of the theoretical Receiver Operating Characteristic (ROC), providing a new perspective of that diversity and suggesting ways to aggregate the ensemble outputs. Our approach is very different than the most common approaches to build ensembles for ∗ Corresponding author at: Departamento de Teoría de la Señal y Comunicaciones, Universidad Carlos III de Madrid, Spain. E-mail address: mlazaro@tsc.uc3m.es (M. Lázaro). https://doi.org/10.1016/j.ins.2019.12.050 0020-0255/© 2020 Elsevier Inc. All rights reserved. 32 M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 imbalanced data. In these methods, such as in [8–11], the diversity of the individual classifiers is obtained by modifying the available training data, by means of resampling (over-sampling of the minority class or under-sampling of the majority class [8,9], or a combination of both methodologies [10]) or switching (random modification of some labels of the majority class [11]). Although these methodologies have shown that they can provide good results in many cases, they have the potential drawback of sampling techniques, which can modify the problem by reducing the influence of critical samples and/or emphasizing unimportant instances [6]. The proposed approach does not modify the available training set. Diversity is introduced through the probabilistic definition of the relative costs in the errors of samples of different classes, using a Bayesian formulation. The different cost definitions define a set of problems that are similar enough, but with diversity. Therefore, the combination of the solutions to this set of problems helps to improve the average performance and to reduce the variance. In the design of ensembles, typically all the classifiers provide the same decision for a relatively high number of the patterns (those patterns that are relatively far away from the decision boundary), and different decisions are constrained to patterns that are relatively close to the boundary. For this reason, many ensembles techniques are designed with the aim of diversifying the decisions in the proximity of the decision boundary. The proposed approach naturally modifies the boundary for each classifier according to the definition of the costs, which provides diversity in the decisions for samples that are close to the decision boundary of the whole problem. The paper is organized as follows. In Section 2, the basic classification problem will be stated, the Bayesian formulation will be surveyed, and the usual procedure that is used to train neural networks will be outlined. In Section 3, the main features of the training algorithm proposed in [12] are summarized, including an alternative notation that is more appropriate to the further description of the ensemble, which is done in Section 4. In this section, the architecture of the ensemble is presented and several fusion rules to combine the outputs of the constituent classifiers are provided. Section 5 presents experiments to evaluate the performance of the provided method in different imbalanced datasets, and finally Section 6 discusses the main conclusion obtained in this work. 2. Bayesian formulation and conventional training of neural networks A classification problem consists in assigning a D-dimensional pattern x (instance or sample) to one out of a known set H of possible classes or hypotheses. In a binary classification problem only two classes are possible, namely H = {H−1 , H+1 } ≡ {−1, +1}. The main goal of a classifier is to provide the best possible estimation of the correct class according to some pre-established figure of merit, such as the average probability of misclassification just to mention one of the most common examples. Depending on the available information about the classification problem, several approaches can be used to solve it. When conditional distributions of the observations under each hypothesis are known, f X |Ht (x ) for t ∈ { ± 1}, statistical detection techniques can be used. In particular, Bayesian formulation has been commonly used in this framework. This formulation considers a general figure of merit involving the a priori class probabilities along with the different costs of each possible decision for samples of every class. The goal of a Bayesian classifier is to minimize the Bayesian risk function, which includes the statistical average of these costs [13,14] R= πt cd,t pHˆ |Ht (d ) (1) t∈H d∈H where π t denotes the prior probability of hypothesis t, cd,t is the cost of deciding hypothesis d when the true hypothesis is t, and pHˆ |H (d ) denotes the conditional probability of this decision t pHˆ |Ht (d ) ≡ P (Decide d | t is true ) (2) The classifier minimizing the Bayesian risk is defined by the likelihoods (conditional distributions of input pattern under both hypothesis, fX |Ht (x ), for t ∈ {−1, +1}), as well as the a priori class probabilities and decision costs. The optimal decision rule minimizing Bayesian risk is (x ) = fX |H+1 (x ) fX |H−1 (x ) Hˆ =+1 ≷ Hˆ =−1 c+1,−1 − c−1,−1 c−1,+1 − c+1,+1 π−1 =γ π+1 (> 0 ) (3) i.e., a test comparing the likelihood ratio (LR) with a threshold γ given by costs cd,t and prior probabilities π t . The performance of a binary classifier is usually characterized by the false alarm and the miss probabilities (probabilities of erroneous decisions under both hypothesis) pF A = pHˆ |H−1 (+1 ) ≡ P (Decide + 1 | − 1 is true ) (4) pM = pHˆ |H+1 (−1 ) ≡ P (Decide − 1 | + 1 is true ) (5) Obviously, different values of decision threshold γ produce different pairs of values for pFA and pM , which show a compromise: reducing one of them means increasing the other. In binary classification, usually Bayesian formulation is simplified by assuming that costs associated to correct decisions are null, c+1,+1 = c−1,−1 = 0, which allows to parameterize the other two costs by means of a single parameter α as follows M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 c+1,−1 = α 1−α , c−1,+1 = π−1 π+1 33 (6) Using this parameterization, the Bayesian risk becomes R ( α ) = α pF A + ( 1 − α ) pM (7) and decision threshold is now given by γ = α /(1 − α ). This parameterization can be helpful in imbalanced scenarios because parameter α establishes the relative importance given to errors for samples of both classes, independently of the prior probabilities of each class. From (3) it is evident that the design of a Bayesian classifier requires the knowledge of likelihoods and prior probabilities for each class. In practice, in many real problems these distributions are unknown, and the only available knowledge is a set of N labeled examples of the problem at hand, i.e. {xn , yn }, n ∈ {1, 2, . . . , N } (8) where binary class label yn ∈ { ± 1} indicates the class of pattern xn . Machine learning methods, such as decision trees [15], Support Vector Machines (SVMs) [16], or neural networks [13], can be used in several different ways to solve binary classification problems from a labeled data set. In this work we will use neural networks to solve the problem. Binary discriminative machine classifiers usually provide a continuous variable z, or soft output, and the classification result is obtained by applying a hard threshold to it. There are different forms of obtaining z, which is a nonlinear transformation of the input sample x. SVMs impose a maximal margin betweeen sample values for each class. Neural networks, such as Multi-Layer Perceptrons (MLPs) and Radial Basis Function Networks (RBFNs), include activation functions to obtain projections of input samples and minimize the sampled version of a surrogate cost – i.e, a measure of the separation between the target and the output of the network – by means of search algorithms, such as the stochastic gradient search. When there are several layers, the case of hidden layer MLPs, the well-known Back-Propagation (BP) algorithm has to be applied. In principle, none of the above approaches has a direct connection with Bayesian learning, with only one exception: If the surrogate cost is a Bregman divergence [17,18], the neural network output provides an estimate of the Bayesian class 1 posterior probability. This is an avenue to explore other new re-balancing algorithms, which other works explore [19]. As we will see below, our approach connects with Bayes’ theory following a different route. 3. Bayesian neural network classifier Recently, we have proposed a new cost function to train binary neural network classifiers [12]. This cost function is an estimate of Bayesian risk (7) J Bayes (w ) = α pˆ F A + (1 − α ) pˆ M (9) where the estimates of pFA and pM , pˆ F A and pˆ M , respectively, are obtained from the soft output of the neural network. If the decision threshold is λ = 0, pFA and pM for the neural classifier are pF A = ∞ 0 fZ |H−1 (z ) dz, pM = 0 −∞ fZ |H+1 (z ) dz (10) In general, conditional distributions of Z, which models the soft output of the network, are unknown. The training method proposed in [12] estimates these distributions from the available data using the Parzen window estimator [20]. If sets S−1 and S+1 contain indexes for data corresponding to hypothesis H−1 and H+1 , respectively S−1 = {n : yn = −1} and S+1 = {n : yn = +1} (11) and N−1 and N+1 denote the number of samples in each set, Parzen window estimate for conditional distribution of Z given H = t is 1 fˆZ |Ht (z ) = kt (z − zn ), with t ∈ {−1, +1} Nt n∈S (12) t The Parzen window kt (z) is any valid probability density function (PDF). It is important to remark that Z is one-dimensional, therefore the windows kt (z) are functions of a one-dimensional variable. Using these estimates, cost function (9) is minimized iteratively by a gradient descent algorithm α − N k+1 (−zn ), ∂ JBayes (w ) = ∂ zn + 1−Nα k−1 (−zn ), if yn = +1 if yn = −1 (13) where ∂ zn /∂ w can be calculated just in the same manner that for conventional neural networks, using BP when needed. It can be seen that the gradient for a pattern of class t is proportional to the Parzen window used to estimate the conditional distribution of Z for this class, i.e. kt (−zk ). Further details about the updating equations to minimize (9) or about the role of the kernels kt (z) in the proposed cost function can be found in [12]. 34 M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 Fig. 1. Architecture of the proposed ensemble of Nc Bayesian neural network classifiers. We want to remark here that α is a parameter of the training method. This parameter allows to establish the trade-off between the expected performance in the classification of samples of both classes independently of the number of samples of each class in the available data set. Obviously, this can be helpful in the context of imbalanced classification problems. This training method is valid for every possible network architecture, such as an MLP with one or several hidden layers, or an RBFN, and with every possible activation function in the neurons of the network (hyperbolic tangent, rectified linear units, Gaussian units for RBFNs, etc.). But it can also be applied to a linear classifier, i.e., z = wT [1, xT ]T . 4. An ensemble of Bayesian neural network classifiers Combining classifiers is a well known technique to improve the result obtained by individual classifiers. The use of ensembles, also known as teams of learners, has shown excellent results in many applications, including imbalanced classification problems [8–10]. Diversity plays a key role in the design of ensembles, because learners have to provide different decisions for some patterns in order to improve the performance of the individual classifiers. However, there is no a strict definition of what is considered as diversity (see [21] and references therein for a discussion about this topic), and several different techniques have been used to build ensembles of diverse classifiers: Bagging [22], boosting [23], random forests [24], or output flipping [25], also known as class switching, are some well known examples. In this work we propose a novel approach to introduce diversity. The proposed classifier is an ensemble of Nc Bayesian neural network classifiers. Each neural classifier will be trained with the training algorithm proposed in [12] (summarized in Section 3) for a different value of the parameter α weighting pFA vs pM in the Bayesian cost function (9). This architecture is shown in Fig. 1. The values for α used to train each individual classifier will be denoted as α (j) , with j ∈ {1, 2, . . . , Nc } This means that each constituent classifier will be intended to work with a different trade-off between the probabilities of error under the two hypothesis p(F Aj ) , p(Mj ) for j ∈ {1, 2, . . . , Nc } (14) In the context of statistical decision theory, these different trade-offs can be seen as different operation points in the Receiver Operating Characteristic (ROC). The ROC is the curve that represents the detection probability, which is the complement of the miss probability, pD = 1 − pM , versus the false alarm probability pFA for every possible value of the decision threshold, i.e. 0 ≤ γ < ∞, in a Bayesian classifier [26]. Therefore, the Nc neural classifiers in the ensemble will be working at different points of the ROC of the classification problem (see Fig. 2), which are given by different pairs of probabilities of false alarm and detection, thus having p(F Aj ) , p(Dj ) for j ∈ {1, 2, . . . , Nc } (15) These different compromises between pFA and pD (or equivalently pM ) will provide diversity in the decisions of the individual classifiers. Obviously, all these operation points are below the optimal ROC, as shown in the example of Fig. 2. The optimal M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 35 Fig. 2. Example of operation points in the ROC for individual Bayesian neural classifiers. ROC is that associated to the Bayesian classifier (7), which would require the knowledge of likelihoods f X |Ht (x ), for t ∈ {−1, +1}. Once each individual classifier is trained, and, therefore, it is able to provide an appropriate soft output for each input pattern, it is necessary to define the decision rule for the ensemble (aggregation or fusion rule). In this work we will compare the performance obtained by using four different aggregation methods: 1. Addition of soft outputs yˆEn (Soft ) = sgn Nc zn( j ) (16) j=1 2. Addition of hard outputs (or majority voting) yˆEn (Hard ) = sgn Nc yˆn( j ) , with yˆn( j ) = sgn(zn( j ) ) (17) j=1 3. Bayesian aggregation of hard outputs yˆEn (Bayes ) = +1, if (xEn ) ≥ γ E −1, if (xEn ) < γ E , with γE = αE 1 − αE (18) where α E is the trade-off parameter between pFA and pM in the Bayesian ensemble, and (xEn ) is the likelihood ratio for the ensemble (see the Appendix for the analytical expression of the likelihood ratio). This is the optimal Bayesian fusion rule to combine the binary decisions (hard outputs) of the individual classifiers. 4. Majority voting for the 3 previous rules yˆEn (Maj ) = sgn yˆEn (Soft ) + yˆEn (Hard ) + yˆEn (Bayes ) (19) ( j) ( j) To apply the third fusion rule it is necessary to estimate the operation point for each individual classifier, ( pF A , pD ). These values have to be estimated from the training set by cross-validation (details will be provided in Section 5.2). It is ( j) ( j) interesting to remark that theoretically, and assuming that the true values of ( pF A , pD ) are known, the third rule must be better than the second one if the figure of merit is the Bayes risk (in fact, (18) is the optimal Bayesian decision rule to fuse ( j) ( j) hard decisions). However, if the estimates of the operation points ( pF A , pD ) are not accurate enough, the performance of this rule can decay. 5. Experiments This section presents the results obtained with the proposed method in the classification of several imbalanced databases. 5.1. Databases We have tested the proposed method with several imbalanced real-world databases obtained from the KEEL-dataset repository [27]. Data and information about these data sets can be found at http://www.keel.es/dataset.php. Data sets in [27] are organized in different k-fold partitions for training and test data. Here, we have worked with the 5-fold partition provided in the KEEL-dataset repository, thus making easier to compare results. Each fold splits data in a train set and a test set with around a 80% – 20% proportion. The different methods will be tested independently in each fold and the results 36 M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 Table 1 Description of the 33 KEEL datasets, including dimension (D), number of patterns (Np ), imbalance ratio (IR), and percentages of types of minority examples according to the definitions in [28] (S: Safe, B: Boundary, R: Rare, O: Outlier). No Dataset D Np IR Types of minority samples S (%) / B (%) / R (%) / O (%) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 glass04vs5 ecoli0346vs5 ecoli0347vs56 yeast05679vs4 ecoli067vs5 vowel0 glass016vs2 glass2 ecoli0147vs2356 led7digit02456789vs1 ecoli01vs5 glass06vs5 glass0146vs2 ecoli0147vs56 cleveland0vs4 ecoli0146vs5 ecoli4 shuttlec0vsc4 yeast1vs7 glass4 pageblocks13vs4 abalone918 glass016vs5 shuttlec2vsc4 yeast1458vs7 glass5 yeast2vs8 yeast4 yeast1289vs7 yeast5 ecoli0137vs26 yeast6 abalone19 9 7 7 8 6 13 9 9 7 7 6 9 9 6 13 6 7 9 7 9 10 8 9 9 8 9 8 8 8 8 7 8 8 92 205 257 528 220 988 192 214 336 443 240 108 205 332 173 280 336 1829 459 214 472 731 184 129 693 214 482 1484 947 1484 281 1484 4174 9.22 9.25 9.28 9.35 10.00 10.10 10.76 10.39 10.59 10.89 11.00 11.00 11.06 12.28 12.62 13.01 13.84 13.88 13.88 15.47 15.86 16.67 19.45 20.51 22.09 22.75 23.10 28.15 30.55 32.78 39.16 39.16 128.87 33.33 % / 55.56 % / 0.00 % / 11.11 % 75.00 % / 10.00 % / 0.00 % / 15.00 % 68.00 % / 20.00 % / 0.00 % / 12.00 % 7.84 % / 41.18 % / 19.61 % / 31.37 % 45.00 % / 35.00 % / 0.00 % / 20.00 % 98.89 % / 1.11 % / 0.00 % / 0.00 % 0.00 % / 23.53 % / 35.29 % / 41.18 % 0.00 % / 23.53 % / 47.06 % / 29.41 % 65.52 % / 17.24 % / 0.00 % / 17.24 % 10.81 % / 18.92 % / 13.51 % / 56.76 % 75.00 % / 10.00 % / 0.00 % / 15.00 % 55.56 % / 22.22 % / 11.11 % / 11.11 % 0.00 % / 23.53 % / 35.29 % / 41.18 % 72.00 % / 16.00 % / 0.00 % / 12.00 % 7.69 % / 69.23 % / 15.38 % / 7.69 % 70.00 % / 15.00 % / 0.00 % / 15.00 % 70.00 % / 20.00 % / 5.00 % / 5.00 % 98.38 % / 0.81 % / 0.00 % / 0.81 % 6.67 % / 40.00 % / 20.00 % / 33.33 % 30.78 % / 46.15 % / 7.69 % / 15.38 % 78.57 % / 14.29 % / 7.14 % / 0.00 % 11.90 % / 21.43 % / 21.43 % / 45.24 % 33.33 % / 55.56 % / 0.00 % / 11.11 % 83.33 % / 0.00 % / 0.00 % / 16.67 % 0.00 % / 6.67 % / 40.00 % / 53.33 % 33.33 % / 55.56 % / 0.00 % / 11.11 % 55.00 % / 0.00 % / 10.00 % / 35.00 % 5.88 % / 35.29 % / 19.61 % / 39.22 % 0.00 % / 26.67 % / 23.33 % / 50.00 % 34.10 % / 50.00 % / 11.36 % / 4.54 % 71.43 % / 0.00 % / 0.00 % / 28.57 % 37.14 % / 22.86 % / 11.43 % / 28.57 % 0.00 % / 0.00 % / 12.50 % / 87.50 % obtained for the 5 folds will be averaged. Table 1 shows the main characteristics of the tested databases: dimension (D), number of patterns (Np ), and the imbalance ratio (IR), defined as the ratio between the probabilities of the two classes IR = π−1 π+1 Datasets in Table 1 are sorted according to this ratio. The 33 chosen datasets, which are the same ones tested in [9,10], have a high imbalance (IR > 9, which means less than 10% of samples for the minority class). Finally, the table also contains the percentages of types of minority examples such as they are defined in [28]. This work shows that the difficulties in learning from imbalanced data are related with the location of the samples of the minority class with respect to the samples of the majority class. Therefore, this information can be useful to analyze the results obtained with different classification methods. 5.2. Implementation details Since the objective of this paper is to show the intrinsic potential of the proposed method to work with different data, instead of looking for an specific configuration for each database, a generic setup has been used for all databases. The only pre-processing of input data is normalization, forcing zero mean and unit variance for each dimension. Two architectures have been tested for the individual Bayesian classifiers: • Linear classifiers. • MLPs with a single hidden layer with Nn neurons, a single neuron in the output layer, and hyperbolic tangent activation functions. ∂z Transfer function and gradient expressions ∂ wk are well-known for these architectures, both the linear classifier and the MLP [29,30]. An adaptive step-size μ has been used in gradient updating. After each epoch the cost JBayes (w(i) ) is evaluated and compared with J Bayes (w(i−1 ) ) • If J Bayes (w(i ) ) < J Bayes (w(i−1 ) ): step size is increased, μ = cI μ M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 37 • If J Bayes (w(i ) ) ≥ J Bayes (w(i−1 ) ): step size is decreased, μ = μ/cD , and weights w(i) are re-computed with the new step size. cI = 1.01 and cD = 2 have been used in all experiments, with initial value μ = 10−3 . The Bayesian classifier has been implemented in MATLAB1 . Individual Bayesian classifiers for 9 different values of parameter α have been trained for each dataset. In particular, the parameters defining the Bayes risk objective for each one of the constituent classifiers are α ( j ) = 0.1 × j, for j ∈ {1, 2, . . . , 9} (20) Experiments have shown that in some datasets the performance of some of the individual classifiers working in the extreme (lowest or highest) values for α (j) can degrade. For this reason we will compare the performance obtained with two different numbers of individual classifiers in the ensemble: • Ensembles combining 9 • Ensembles combining 7 discard for each dataset – α (j) ∈ {0.1, 0.2} – α (j) ∈ {0.8, 0.9} – α (j) ∈ {0.1, 0.9} The two values that are individual classifiers, for the 9 values of α (j) given in (20). individual classifiers, for the 7 consecutive values with a better performance. This means to two values for α (j) , which depending on the dataset can be discarded for each dataset are obtained by cross-validation. The network parameters w are randomly initialized, with values drawn from independent uniform distributions in [−0.1, 0.1] for each parameter. 100 independent Monte Carlo simulations, starting with different initial parameters for each classifier, have been performed for each one of the 5 folds of every database. Average results obtained in the 5 folds will be presented. Several aspects have to be cross-validated during training, such as the number of epochs in the training algorithm, the number of neurons in the hidden layer of the MLPs, the values of ( p(FiA) , p(Di ) ) necessary to implement the Bayesian ensemble rule (18), and the two values of α (j) that are discarded in the ensembles with 7 classifiers. As it happens in many real problems, the number of samples in the training data set for most of the datasets is relatively low, specially for the minority class. For instance, first dataset glass04vs5 depending of the specific fold has only 73 or 74 patterns in the training set (73 patterns in folds 1 and 2, and 74 pattern in the remaining folds). And the number of samples of the minority class in these training sets is only 7 or 8 (7 patterns in folds 1, 2, 3, and 4, and 8 patterns in fold 5). In this scenario, the strategy of splitting the available training data set in a single train-set/validation-set partition has the drawback of providing a very low number of samples of the minority class in the validation-set or of reducing drastically the number of samples of the minority class in the train-set if the size of the validation-set is increased. For this reason, a 10-fold cross-validation strategy has been chosen for each fold in the KEEL dataset. The design and evaluation of each method is done independently for each fold, which contains a train set and a test set. For the design of the classifier in a given fold, we have used the following methodology: • The train set of that fold is randomly splitted in 10 sub-sets, maintaining the class proportions, and then the network is trained 10 times. Each time, one of the sub-sets is used as validation-set and the remaining 9 sub-sets are used as train-set. • Validation results obtained in the 10 sub-sets are averaged, which allows to obtain the network configuration for the fold. • After that, the validated solution is trained using the whole train set. • Finally, the designed classifier is evaluated using the test set of the fold. Of course, the samples in this test set were not used during the validation and the training phases of the design of the classifiers for that fold, which were carried working only with the train set of the fold. The figure of merit that will be used to compare the performance of different methods is the average probability of a successful classification of samples of both classes, measured as pS = 1 − pF A + pM 2 (21) Note that this figure of merit corresponds geometrically with the area under the trapezoid given by the operation point in the ROC and the points (0,0) and (1,1), as shown in Fig. 3. For this reason, in some works it is called Area Under the Curve (AUC), although strictly speaking it does not correspond to the area under a ROC curve, which is used sometimes as a figure of merit for neural classifiers [31,32]. Figure of merit (21) corresponds to assign the same importance to errors in both classes, independently of the prior probabilities of the classes, which in the Bayesian risk corresponds to α = 1/2. Therefore, α E = 1/2 will be used in the ensemble for Bayesian aggregation rule (18), which makes the decision threshold γ E = 1. 1 MATLAB code is available at www.tsc.uc3m.es/∼mlazaro/. 38 M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 Fig. 3. Value of the figure of merit pS seen as the area under the trapezoid defined by the operation point (pFA , pD ). Fig. 4. Kernels under test for experiments with complementary windows. The Parzen windows used for each class will be complementary, i.e. k+1 (z ) = k(z ), k−1 (z ) = k(−z ) (22) In the following, we will call k(z) the Parzen kernel, and the four kernels shown in Fig. 4 are tested. Superscripts denote uniform (U), linear (L), triangle (T), and absolute value (A), respectively. All of them have support [−1, 1], which is the range of the output z if a hyperbolic tangent activation function is used to saturate the output, as it will be the case with MLPs in this work. As discussed in [12], the choice of the kernel is not aimed at having the best possible estimates of the Bayes risk in the cost function (9), but to provide good performance. These four kernels have different properties that can be appropriate under different characteristics of the problem at hand, as it will be shown in the experiments. The best kernel for a given problem will depend on the characteristics of the problem itself, but will also depend on the architecture of the network that is trained. This is because the architecture defines the kind of input/output projection that is performed and also the sensitivity to overfitting, that can be balanced with an appropriate choice of the kernel. More details about the role of the kernel function in the learning process and the performance of a binary classifier are given in [12]. 5.3. Experimental results As a baseline for comparison, we include the best results obtained in [9] and in [10], where the same datasets where used to evaluate different classification methods. In [9], several ensemble methods combining from 10 to 40 classifiers (with the best number selected for each method) using bagging and boosting techniques along with several preprocesing techniques to balance data sets before constructing the ensemble, like the Synthetic Minority Over-sampling Technique (SMOTE) [33] or the Evolutionary Under-Sampling (EUS) [34], are evaluated. In [10], several class switching algorithms, using preprocessing techniques to relatively balance the datasets, are evaluated and compared with other baseline methods. First of all, we want to show that the results will depend on the Parzen kernel k(z) which is selected for the training algorithm of the constituent classifiers. To illustrate this, Table 2 compares the average value of pS obtained by an individual classifier trained with α ( j ) = 0.5, the value that is fitted to the figure of merit, pS , using the 4 kernels under evaluation. In this case, the network architecture is an MLP with Nn = 4 neurons in the hidden layer. The last row in the table contains, for each kernel, the number of wins (number of datasets with the best performance, uniquely) and ties (number of datasets achieving the best performance, but along with other kernels). M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 39 Table 2 Comparison of the four Parzen kernels in the individual classifier training for α ( j ) = 0.5 and an MLP network with Nn = 4 neurons in the hidden layer. Dataset kU (z) kL (z) kT (z) kA (z) glass04vs5 ecoli0346vs5 ecoli0347vs56 yeast05679vs4 ecoli067vs5 vowel0 glass016vs2 glass2 ecoli0147vs2356 led7digit02456789vs1 ecoli01vs5 glass06vs5 glass0146vs2 ecoli0147vs56 cleveland0vs4 ecoli0146vs5 ecoli4 shuttlec0vsc4 yeast1vs7 glass4 pageblocks13vs4 abalone918 glass016vs5 shuttlec2vsc4 yeast1458vs7 glass5 yeast2vs8 yeast4 yeast1289vs7 yeast5 ecoli0137vs26 yeast6 abalone19 Wins / Ties 95.67 88.15 89.45 77.90 86.93 97.01 73.94 82.34 88.11 87.26 87.25 98.77 73.08 87.27 90.80 83.17 85.13 99.62 73.92 92.45 95.89 86.29 92.35 99.52 62.05 94.25 78.31 78.44 70.63 96.94 83.64 85.47 77.71 4/1 98.50 87.57 90.36 77.91 87.89 98.99 75.97 83.31 86.48 86.89 87.38 99.99 72.95 87.24 91.64 84.64 84.85 99.63 75.12 92.47 98.24 87.98 93.00 99.54 65.49 94.42 74.38 78.55 68.67 97.38 81.16 85.79 75.67 17 / 1 95.36 87.47 89.40 77.55 87.22 97.41 73.57 80.78 88.13 86.30 87.32 99.93 73.45 87.27 90.11 83.11 84.93 99.63 74.02 92.52 96.55 86.42 92.91 99.64 63.07 92.35 78.94 78.61 70.20 97.00 84.64 86.00 77.58 3/2 95.57 87.34 89.15 77.94 87.29 96.41 74.43 82.95 87.65 87.21 87.14 97.28 73.97 87.14 90.52 83.04 83.07 99.61 74.44 92.53 95.80 84.91 92.40 99.54 63.63 93.12 77.61 79.98 69.93 97.04 86.62 86.11 77.86 7/0 It can be seen that, although in some datasets the choice of the kernel is not critical, because similar results are obtained with all the kernels (for instance, yeast05679vs4 or yeast5), there are also datasets where the difference is relevant (for instance, glass04vs5 or ecoli0137vs26), and therefore it is important to select an appropriate kernel to obtain the best possible results. The best kernel for each configuration can be obtained by cross-validation. In Table 3 we compare the performance of the four aggregation rules combining 9 or 7 individual classifiers. In this case a linear kernel, kL (z), is used in an MLP architecture with Nn = 4 neurons in the hidden layer. It can be seen that for both types of ensembles, using 9 or 7 components, the rule having the best performance in most datasets is the first one, the aggregation based on the addition of soft outputs (16). Comparing the results for ensembles with 9 vs ensembles with 7 components, we can see that differences are in general relatively small, but the ensemble with 7 components obtains the best pS value in a slightly higher number of problems. A similar behavior has been observed for every architecture and for all Parzen kernels. Therefore, and for the sake of simpler comparisons, in the following we will consider a single configuration of ensembles with 7 components and the first aggregation rule. In particular, we will compare the results obtained with the best schemes evaluated in [9] and [10], with the results provided by the proposed method selecting the best kernel by cross-validation. For the sake of completeness, the comparison will also include the results obtained with the individual classifier trained with α ( j ) = 0.5, which is that fitted to the figure of merit which is being considered Table 4 shows the results obtained when the individual classifiers used to build the ensemble are linear Bayesian classifiers (linear classifiers trained with the method proposed in [12]). The table also indicates the kernel that has been used for each dataset. Table 5 shows the results obtained using MLPs with Nn neurons in the hidden layer as constituent classifiers of the ensemble. The number of neurons and the Parzen kernel for each dataset, which are included in the table, have been obtained by cross-validation. Networks with 1, 2, 4, 6, 8, 10, 12, 16, 20, 30, 40, and 50 neurons have been validated. The proposed method, using simple linear classifiers, provides competitive results. In the 33 datasets, the best method in [9] gets the best performance (win or tie) in 14 datasets, the best method in [10] in 12 datasets, and the proposed method 40 M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 Table 3 Comparison of ensembles of 9 and 7 MLP Bayesian classifiers with Nn = 4 neurons in the hidden layer using kL (z). Dataset Soft(9) Hard(9) Bayes(9) Maj(9) Soft (7) Hard(7) Bayes(7) Maj(7) glass04vs5 ecoli0346vs5 ecoli0347vs56 yeast05679vs4 ecoli067vs5 vowel0 glass016vs2 glass2 ecoli0147vs2356 led7digit02456789vs1 ecoli01vs5 glass06vs5 glass0146vs2 ecoli0147vs56 cleveland0vs4 ecoli0146vs5 ecoli4 shuttlec0vsc4 yeast1vs7 glass4 pageblocks13vs4 abalone918 glass016vs5 shuttlec2vsc4 yeast1458vs7 glass5 yeast2vs8 yeast4 yeast1289vs7 yeast5 ecoli0137vs26 yeast6 abalone19 Wins / Ties 98.65 88.24 91.08 78.65 88.08 99.11 77.20 84.14 87.27 87.13 87.99 100.00 72.65 87.97 92.76 85.62 84.56 99.61 75.91 92.51 98.71 88.18 93.12 99.69 65.15 93.73 74.46 78.22 70.52 97.52 82.28 85.87 76.91 9/3 98.51 87.92 90.93 78.72 88.05 99.17 76.81 83.36 87.22 87.08 87.95 100.00 72.83 87.60 91.71 85.48 85.06 99.60 75.81 92.51 98.58 88.38 92.75 99.60 65.25 93.74 74.71 78.41 69.52 97.50 82.11 85.87 76.07 1/3 98.31 88.24 90.91 78.02 87.46 99.17 76.89 84.37 87.23 87.00 87.70 100.00 74.20 87.14 87.84 85.43 85.06 99.61 75.78 92.34 98.03 87.97 91.93 99.60 66.85 94.26 74.14 78.86 70.93 97.52 82.12 85.63 76.86 2/3 98.51 88.18 90.93 78.71 87.94 99.17 77.07 84.08 87.20 87.06 87.94 100.00 73.14 87.56 91.24 85.55 85.06 99.61 75.82 92.47 98.55 88.35 92.66 99.60 65.39 93.79 74.38 78.41 70.18 97.53 82.26 85.84 76.74 0/3 98.49 88.16 91.26 78.72 88.58 99.08 77.10 84.21 87.87 87.81 87.97 100.00 73.94 88.33 93.80 85.57 84.54 99.61 75.67 92.31 98.63 89.74 93.01 99.74 66.89 93.71 75.60 78.83 71.14 97.53 82.20 85.67 77.93 11 / 2 98.49 88.02 91.15 78.71 88.58 99.06 76.91 84.22 87.58 87.78 87.93 100.00 74.15 88.17 93.56 85.40 85.15 99.61 75.70 92.39 98.47 89.07 92.56 99.66 66.64 93.84 75.70 78.72 70.64 97.50 82.24 85.88 77.35 2/2 98.47 88.20 90.99 78.47 87.81 99.07 76.90 83.99 87.23 87.64 87.62 100.00 75.85 87.52 91.84 85.53 85.15 99.61 75.66 92.35 97.87 89.07 91.77 99.56 66.74 94.36 74.65 78.72 70.66 97.54 82.09 85.65 77.36 3/2 98.49 88.22 91.21 78.71 88.42 99.06 76.91 84.18 87.71 87.76 87.92 100.00 74.29 88.13 93.57 85.50 85.15 99.61 75.68 92.37 98.40 89.07 92.37 99.66 66.53 93.90 75.42 78.72 70.58 97.53 82.19 85.86 77.36 0/2 also in 12 datasets. In 3 of these 12 datasets, the individual classifier for α ( j ) = 0.5 gets a slightly better average pS than the ensemble, but in all those cases the ensemble has a higher average pS than the baseline methods. Using MLP Bayesian classifiers, the best method for each database among all those of [9] achieves the maximum value of pS (win or tie) in 13 datasets, the best method in [10] in 11 datasets, and the proposed method in 14 datasets, and only in 1 of these 14 datasets (yeast5) the individual classifier for α ( j ) = 0.5 gets a slightly better average pS than the ensemble (although again the ensemble has a higher pS than the best method in both [9] and [10]). Interestingly, the proposed method obtains the best result in the last 4 datasets, which are those with higher imbalance ratios (IR > 32, which means less than 3% of samples of the minority class), with an improvement of more than 8% in the most imbalanced dataset, abalone19. Although for the sake of an easier comparison results in Tables 4 and 5 include only aggregation rule (16) for ensembles of 7 classifiers, similar results have been obtained with the other proposed fusion rules and with ensembles of 9 classifiers (see Table 3). It is necessary to remark that the best results in [9] and [10] are obtained by several different methods. In particular, the 9 methods that obtain the best result in at least a dataset are included in Table 6, where they are compared with Soft (7). The details of each method can be found in [9] and [10]. The second column of Table 6 shows the number of datasets where each method wins, i.e., the method provides globally the best result when compared with the other 9 methods, or ties, i.e, the method reaches the best result along with other methods. The relatively high number of ties for the benchmark methods is because most of them provide the same results for datasets glass04vs5, shuttlec0vsc4 and shuttlec2vsc4. It can be seen that a single benchmark method wins in at most 3 datasets, with 2 additional ties, while the figures for the proposed method are 14 wins. The third column present the number of wins and losses for every method in a pair comparison with Soft (7). There are no ties in this comparison. All the benchmark methods have more losses than wins with respect to the proposed method. The maximun number of wins, 15, is obtained by USwitchingNED, which losses in 18 datasets. Fourth and fifth columns show the result of two statistical tests that are commonly used to compare two classifiers, the paired T-test and the Wilcoxon signed-ranks test [35]. The p-value and the result of the hypothesis that the mean difference between two sets of observations is not zero for a statistical significance of 5% are presented in a pair comparison of every method against Soft (7). It can be seen that the hypothesis is false only for 1 method in the T-test (in this case M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 41 Table 4 Comparison of the proposed method using linear Bayesian classifiers with the best method for each dataset in [9] and [10]. Dataset Kernel Best [9] Best [10] ind Soft (7) glass04vs5 ecoli0346vs5 ecoli0347vs56 yeast05679vs4 ecoli067vs5 vowel0 glass016vs2 glass2 ecoli0147vs2356 led7digit02456789vs1 ecoli01vs5 glass06vs5 glass0146vs2 ecoli0147vs56 cleveland0vs4 ecoli0146vs5 ecoli4 shuttlec0vsc4 yeast1vs7 glass4 pageblocks13vs4 abalone918 glass016vs5 shuttlec2vsc4 yeast1458vs7 glass5 yeast2vs8 yeast4 yeast1289vs7 yeast5 ecoli0137vs26 yeast6 abalone19 Wins / Ties U L U L U L U T U L T T T L U A T T U T L L L U U T L U L L T A T 99.41 92.75 89.28 81.44 89.00 98.87 74.88 80.45 89.43 88.80 92.35 99.50 77.36 89.24 82.80 92.95 93.09 100.00 77.71 91.92 99.06 74.24 98.86 100.00 63.43 98.78 80.19 84.89 74.91 96.61 83.60 86.78 70.81 9/5 99.41 91.80 90.76 80.93 90.25 98.26 75.33 79.51 88.72 90.80 91.14 99.50 78.78 90.96 88.40 91.60 92.15 100.00 77.82 90.09 99.81 72.24 98.86 100.00 62.38 98.78 78.67 84.49 76.42 97.40 83.60 88.08 69.91 7/5 95.75 89.84 90.89 79.72 87.67 96.77 77.91 82.26 88.15 87.07 87.37 99.82 80.11 88.23 94.92 87.57 85.47 99.66 76.59 92.82 96.44 90.15 97.07 99.67 65.62 96.66 79.18 81.56 72.18 97.53 86.53 87.45 78.11 3/0 95.26 89.75 91.01 79.63 87.32 96.53 81.35 83.93 88.30 88.71 87.26 99.57 81.26 87.50 95.58 88.22 85.36 99.66 75.39 92.70 96.79 90.51 95.70 99.64 66.89 96.81 79.09 80.88 72.09 97.59 86.66 87.25 77.47 9/0 USwitchingNED with a p-value of 5.06%, close to the limit of 5%) and for 2 methods in the Wilcoxon test (EUBQ with a p-value of 5.04%, and again USwitchingNED). To test the differences between more than two models, two typical tests are the Dunnett test for ANOVA and the Friedman test [35]. Fig. 5(a) plots the means (with the range of two standard deviations in each direction) obtained by ANOVA, and Fig. 5(b) plots the average ranks (with the range of two stardard deviations in each direction) obtained for the Friedman test. For ANOVA, Soft (7) has the best mean value. The Dunnett test provides p-values comparing Soft (7) as the control method with the benchmark methods. These p-values and the hypothesis (with 5% of significance) are shown in Table 7, and it can be seen that they only allow to establish a significant difference of Soft (7) with respect to SBAG4, Switching-I and SwitchingNED. For the Friedman ranks, Soft (7) has the best rank. The critical difference for the Bonferroni-Dunn test [35] with 5% of significance is CD = 2.1325, and therefore it is only possible to establish a significant difference of Soft (7) with respect to Switching-I and SwitchingNED, which have ranks with a difference higher than CD with respect to the rank of Soft (7). The p-values and the hypothesis (with 5% of significance) using Soft (7) as the control method are shown in Table 7. The above results permit to conclude that the proposed approach has a very competitive performance. In a pair comparison it provides the best average results and the best wins/losses figure against every benchmark method. The difference in performance is statistically significant with respect to 8 out of the 9 benchmark methods (and the other method is very close to the threshold), according to the paired T-Test, and with respect to 7 of those methods, according to the Wilcoxon test. In a multiple comparison, again the proposed method provides the best average mean for ANOVA and the best average rank for Friedman test, although the Dunnet and Bonferroni-Dunn tests limit the statistical significance of the difference with respect to several benchmark methods. Another interesting result is related with the imbalance ratio and the inherent difficulty of the problem. The proposed method provides the best results in the 4 most imbalanced datasets, and these datasets have a wide variety of difficulty in terms of the percentages of types of the minority samples (see Table 1): • yeast5 has most samples in the safe and boundary categories (84.10%). 42 M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 Table 5 Comparison of the proposed method using MLP Bayesian classifiers with Nn neurons in the hidden layer with the best method for each dataset in [9] and [10]. Dataset Kernel Nn Best [9] Best [10] ind Soft (7) glass04vs5 ecoli0346vs5 ecoli0347vs56 yeast05679vs4 ecoli067vs5 vowel0 glass016vs2 glass2 ecoli0147vs2356 led7digit02456789vs1 ecoli01vs5 glass06vs5 glass0146vs2 ecoli0147vs56 cleveland0vs4 ecoli0146vs5 ecoli4 shuttlec0vsc4 yeast1vs7 glass4 pageblocks13vs4 abalone918 glass016vs5 shuttlec2vsc4 yeast1458vs7 glass5 yeast2vs8 yeast4 yeast1289vs7 yeast5 ecoli0137vs26 yeast6 abalone19 Wins / Ties L L L T L L L L T U L L T L A L U L A A L L T A L L T U U L A L T 4 2 1 1 16 30 30 4 8 2 30 6 1 30 20 6 10 6 30 12 20 1 16 8 12 8 4 50 4 2 2 1 6 99.41 92.75 89.28 81.44 89.00 98.87 74.88 80.45 89.43 88.80 92.35 99.50 77.36 89.24 82.80 92.95 93.09 100.00 77.71 91.92 99.06 74.24 98.86 100.00 63.43 98.78 80.19 84.89 74.91 96.61 83.60 86.78 70.81 8/5 99.41 91.80 90.76 80.93 90.25 98.26 75.33 79.51 88.72 90.80 91.14 99.50 78.78 90.96 88.40 91.60 92.15 100.00 77.82 90.09 99.81 72.24 98.86 100.00 62.38 98.78 78.67 84.49 76.42 97.40 83.60 88.08 69.91 6/5 98.50 89.04 90.86 79.52 88.66 99.89 77.48 83.31 88.37 87.92 88.13 100.00 80.49 87.93 92.94 85.15 85.79 99.66 75.17 92.64 99.15 90.61 94.68 99.74 66.22 95.14 78.94 82.69 70.63 97.63 86.66 87.93 78.19 1/1 98.49 89.28 91.22 80.07 89.25 99.92 77.78 84.21 87.72 87.69 88.14 100.00 81.28 88.81 94.12 85.32 85.45 99.64 77.64 92.76 99.30 90.68 94.95 99.74 66.95 94.52 79.19 83.42 70.71 97.59 87.01 88.11 78.87 12 / 1 Table 6 For each one of the 10 methods under comparison, number of wins/ties against all the methods in the 33 datasets (second column), number of wins/loses against Soft (7) in the 33 datasets (third column), T paired test against Soft (7), p-value and the hypothesis for the 5% of significance (fourth column), and Wilcoxon test against Soft (7), p-value and the hypothesis for the 5% of significance (fifth column). Method Wins/Ties vs all Wins/Losses vs Soft (7) T-Test p-value (H) Wilcoxon p-value (H) SBAG4 [9] RUS1 [9] UB4 [9] EUSBoost [10] EUBQ [9] EUBH [9] Switching-I [9] SwitchingNED [10] USwitchingNED [10] Soft (7) 3/1 2/3 1/3 1/4 1/4 2/4 1/2 1/2 3/2 14 / 0 9 / 24 11 / 22 10 / 23 11 / 22 12 / 21 13 / 20 4 / 29 7 / 26 15 / 18 0.0012 (1) 0.0106 (1) 0.0011 (1) 0.0172 (1) 0.0326 (1) 0.0172 (1) 1.45 10−6 (1) 1.65 10−5 (1) 0.0506 (0) 0.0011 (1) 0.0084 (1) 0.0014 (1) 0.0188 (1) 0.0504 (0) 0.0444 (1) 3.86 10−6 (1) 2.03 10−5 (1) 0.1891 (0) • ecoli137vs26 have most samples in the safe category, but with the remaining samples in the outlier category (28.57%). • yeast6 has the most relatively uniform distribution in the four categories, ranging from 11.43% of rare patterns to 37.14% of safe patterns. • abalone19 has all samples in the rare and outlier categories, with a higher percentage of outliers (87,5%), which makes specially difficult the classification task. The proposed method was competitive in all these diverse scenarios with the highest imbalance ratios. It can be seen that the variety also appears in the remaining 10 datasets where the proposed method obtains the best results (as with vowel having a 98,89% of safe patterns, and with glass016vs2 that has no safe patterns and most of the patterns in the rare (35.29%) and outlier (41.18%) categories, just to mention two very different situations). The difficulty of the problem given by the location of the minority patterns is related with the performance, with high pS values of 99.92% or 97.63% in easy M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 43 Fig. 5. Average means for ANOVA (a) and ranks for Friedman test (b) applyied to the 10 methods under comparison. Table 7 For each one of the 9 benchmark methods, p-value and the hypothesis with the 5% of significance for the Dunnet and the Bonferroni-Dunn tests in the multiple comparison with Soft (7) as the control classifier. Method Dunnet Test p-value (H) Bonferroni-Dunn test p-value (H) SBAG4 [9] RUS1 [9] UB4 [9] EUSBoost [10] EUBQ [9] EUBH [9] Switching-I [9] SwitchingNED [10] USwitchingNED [10] 0.0345 (1) 0.3991 (0) 0.2028 (0) 0.8366 (0) 0.9856 (0) 0.6445 (0) ≈ 0 (1) 1.12×10−12 (1) 0.9961 (0) 0.2389 (0) 0.4403 (0) 0.2285 (0) 0.7169 (0) 0.9998 (0) 0.6610 (0) 1.69×10−7 (1) 4.57×10−5 (1) 0.9992 (0) problems such as vowel0 or yeast5, and lower values of 77.78% and 78.87% in more difficult problems such as glass016vs2 or abalone19. 6. Conclusions In this paper, we propose to design new machine ensembles for solving binary imbalanced classification problems by employing Bayesian neural networks that are trained to minimize a sampled version of the Bayes risk. Consequently, they offer an intrinsic resistance to imbalance effects, as ensemble learners, diversifying them in a new form, consisting of applying appropriately selected different cost policies. Several output aggregation schemes are also considered. The results of extensive experiments with 33 benchmark databases, comparing accuracy rates with those of 9 highperformance designs under the assumption of equal importance for both types of errors, clearly show the excellent performance fo the proposed algorithms: They are the absolute winners for more than one third of the databases both with linear and one-hidden layer MLP learners. The best individual alternative design offers less than one half absolute win+ties 44 M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 when compared to the proposed design with MLPs. Statistical analysis confirms that the proposed methods is competitive for imbalanced classification in very diverse problems. It is also worth to remark that adapting the proposed designs to the other classification error costs is a trivial task. Among the research directions that this study opens, we are at the present time working in using deep learners and in extending the formulation to multiclass problems. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. Acknowledgments This work was partly supported by research Grant TEC-2015-67719-P “Macro-ADOBE” (MINECO-FEDER, EU), and the research network TIN 2015-70808-REDT, “DAMA” (MINECO-FEDER, EU), for M. Lázaro and A.R. Figueiras-Vidal, as well as by research Grant TIN2017-89517-P, “Smart-DaSCI” (MINECO-FEDER, EU), for F. Herrera. Appendix. Likelihood ratio for Bayesian aggregation of hard outputs ( j) The likelihood ratio involved in the Bayesian aggregation rule (18) is obtained as follows. If yˆk denotes as in (17) the binary decision of the jth individual classifier for input pattern xk , the input of the ensemble for pattern xk is xEk ≡ xEk (xk ) = yˆk(1) , yˆk(2) , . . . , yˆk(Nc ) (23) A likelihood ratio can be defined for this input of the ensemble. Conditional distributions of the decisions of the individual classifiers are given by p(Yˆj|)H yˆk( j ) | − 1 = p(F Aj ) , if yˆk( j ) = +1 1 − p(F Aj ) , if yˆk( j ) = −1 p(Dj ) , if yˆk( j ) = +1 1 − p(Dj ) , if yˆk( j ) = −1 (24) and p(Yˆj|)H yˆk( j ) | + 1 = (25) Using these conditional distributions, assuming conditional independence between the output of the Nc classifiers, and with δ [n] denoting the discrete-time delta function to provide a compact expression for (24) and (25), the likelihood ratio for xk(E ) is given by ( j) Nc p( j ) y ˆk | + 1 E Yˆ |H xk = ( j) ( j) ˆk | − 1 j=1 pYˆ |H y = Nc p(Dj ) δ [yˆk( j ) − 1] + 1 − p(Dj ) δ [yˆk( j ) + 1] j=1 p(F Aj ) δ [yˆk( j ) − 1] + 1 − p(F Aj ) δ [yˆk( j ) + 1] (26) References [1] B.J. Park, S.K. Oh, W. Pedrycz, The design of polynomial function-based neural network predictors for detection of software defects, Inf. Sci. 229 (2013) 40–57. [2] P. González, E. Álvarez, J. Díez, R. González-Quinteros, E. Nogueira, A. López-Urrutia, J.J. del Coz, Multiclass support vector machines with example dependent costs applied to plankton biomass estimation, IEEE Trans. Neural Netw. Learn.Syst. 24 (2013) 1901–1905. [3] C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse, F. Folleco, An empirical study of the classification performance of learners on imbalanced and noisy software quality data, Inf. Sci. 259 (2014) 571–595. [4] Y. Sun, A.K.C. Wong, M.S. Kamel, Classification of imbalanced data: a review, Int. J. Pattern Recognit.Artif. Intell. 23 (2009) 687–719. [5] V. López, A. Fernández, S. García, V. Palade, F. Herrera, An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics, Inf. Sci. 250 (2013) 113–141. [6] P. Branco, L. Torgo, R.P. Ribeiro, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. 49 (2016) 31:1–31:50. [7] H. He, Y. Ma (Eds.), Imbalanced Learning: Foundations, Algorithms, and Applications, IEEE Press - Wiley, 2013. [8] M. Galar, A. Fernández, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Trans. Syst. Man Cybern. 42 (2012) 463–484. [9] M. Galar, A. Fernández, E. Barrenechea, F. Herrera, EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling, Pattern Recognit. (2013) 3460–3471. [10] S. González, S. García, M. Lázaro, A.R. Figueiras-Vidal, F. Herrera, Class switching according to nearest enemy distance for learning from highly imbalanced data, Pattern Recognit. 70 (2017) 12–24. [11] L. Nanni, C. Fantozzi, N. Lazzarini, Coupling different methods for overcoming the class imbalance problem, Neurocomputing 158 (2015) 48–61. [12] M. Lázaro, M.H. Hayes, A.R. Figueiras-Vidal, Training neural network classifiers through Bayes risk minimization applying unidimensional Parzen windows, Pattern Recognit. 77 (2018) 204–215. M. Lázaro, F. Herrera and A.R. Figueiras-Vidal / Information Sciences 520 (2020) 31–45 [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] 45 C. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford, 1997. R.O. Duda, P.E. Hart, D.G. Stork, Pattern Classification, second ed., John Wiley & Sons, 2001. L. Breiman, J. Friedman, R. Olshen, C. Stone, Classification and Regression Trees, Wadsworth Inc, 1984. B. Schölkopf, C. Burges, A. Smola, Advances in Kernel Methods - Support Vector Learning, MIT Press, Cambridge, MA, 1999. L.M. Bregman, The relaxation method of finding the common point of convex sets and its application to the solution of problems in convex programming, USSR Comput. Math. Math.Phys. 7 (1967) 200–217. J. Cid-Sueiro, J.I. Arribas, S. Urbán-Muñoz, A.R. Figueiras-Vidal, Cost functions to estimate a posteriori probabilities in multiclass problems, IEEE Trans. Neural Netw. 10 (1999) 645–656. A. Benitez-Buenache, L.A. Pérez, V.J. Mathews, A.R. Figueiras-Vidal, Likelihood ratio equivalence and imbalanced binary classification, Expert Syst. Appl. 130 (2019) 84–96. E. Parzen, On the estimation of a probability density function and the mode, Ann. Math. Stat. 33 (1962) 1065–1076. L.I. Kuncheva, C.J. Whitaker, Measures of diversity in classifier ensembles and their relationship with the ensemble accuraccy, Mach. Learn. 51 (2003) 181–207. L. Breiman, Bagging predictors, Mach. Learn. 24 (1996) 123–140. R.E. Schapire, Y. Freund, Boosting: Foundations and Algorithms, Cambridge, MA: MIT Press, 2012. L. Breiman, Random forest, Mach. Learn. 45 (2001) 5–32. L. Breiman, Randomizing outputs to increase prediction accuracy, Mach. Learn. 40 (20 0 0) 229–242. H.L. Van Trees, Detection, Estimation, and Modulation Theory: Part I, John Wiley and Sons, New York, 1968. A. Alcalá-Fdez, A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, F. Herrera, KEEL data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, J. Multiple-Valued Logic Soft Comput. 17 (2011) 255–287. K. Napierala, J. Stefanowski, Types of minority class examples and their influence on learning classifiers from imbalanced data, J. Intell. Inf. Syst. 46 (2016) 563–597. D.E. Rumelhart, G.E. Hinton, R.J. Willians, Learning representations by back-propagating errors, Nature (London) 323 (1986) 533–536. B. Widrow, M.A. Lehr, 30 years of adaptive neural networks: perceptron, Madaline and backpropagation, Proc. IEEE 78 (1990) 1415–1441. T. Fawcett, An introduction to ROC analysis, Pattern Recognit. Lett. 27 (2006) 861–874. A.P. Bradley, The use of the area under the ROC curve in the evaluation of machine learning algorithms, Pattern Recognit. 30 (1997) 1145–1159. N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, SMOTE: synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357. S. García, F. Herrera, Evolutionary undersampling for classification with imbalanced datasets: proposals and taxonomy, Evol. Comput. 17 (2009) 275–306. J. Demsar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30.