Learning ELM-Tree from big data based on uncertainty reduction ScienceDirect Ran Wang

advertisement
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.1 (1-22)
Available online at www.sciencedirect.com
ScienceDirect
Fuzzy Sets and Systems ••• (••••) •••–•••
www.elsevier.com/locate/fss
Learning ELM-Tree from big data based on uncertainty reduction
Ran Wang a , Yu-Lin He b,∗ , Chi-Yin Chow a , Fang-Fang Ou b , Jian Zhang b
a Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong
b Key Laboratory in Machine Learning and Computational Intelligence, College of Mathematics and Computer Science, Hebei University,
Baoding 071002, Hebei, China
Received 21 September 2013; received in revised form 22 April 2014; accepted 23 April 2014
Abstract
A challenge in big data classification is the design of highly parallelized learning algorithms. One solution to this problem is
applying parallel computation to different components of a learning model. In this paper, we first propose an extreme learning
machine tree (ELM-Tree) model based on the heuristics of uncertainty reduction. In the ELM-Tree model, information entropy
and ambiguity are used as the uncertainty measures for splitting decision tree (DT) nodes. Besides, in order to resolve the overpartitioning problem in the DT induction, ELMs are embedded as the leaf nodes when the gain ratios of all the available splits are
smaller than a given threshold. Then, we apply parallel computation to five components of the ELM-Tree model, which effectively
reduces the computational time for big data classification. Experimental studies demonstrate the effectiveness of the proposed
method.
© 2014 Elsevier B.V. All rights reserved.
Keywords: Big data classification; Decision tree; ELM-Tree; Extreme learning machine; Uncertainty reduction
1. Introduction
With the arrival of big data era, learning from big data has become unavoidable in many fields such as machine
learning, pattern recognition, image processing, information retrieval, etc. Currently, a formal definition of big data
has not been proposed yet, however, several descriptions could be found from recent literatures [5,10,19]. The main
difficulties in learning from big data include the following three aspects. First, it is hard to finish the computation
on a single computer within a tolerable time. Second, the high-dimensional and multi-modal features may degrade
the performance and efficiency of the learning algorithm. Finally, the transformation of learning concepts is hard
to realize due to the dynamic increase of data volume. In order to overcome these difficulties, the parallelization of
sequential classification algorithms is widely adopted. Decision tree (DT) induction [21,22], which has the advantages
of simple implementation, few parameters, and low computational load, is a promising classification algorithm for
parallelization.
* Corresponding author. Tel.: +86 185 31315747.
E-mail addresses: ranwang3-c@my.cityu.edu.hk (R. Wang), csylhe@gmail.com (Y.-L. He).
http://dx.doi.org/10.1016/j.fss.2014.04.028
0165-0114/© 2014 Elsevier B.V. All rights reserved.
JID:FSS AID:6547 /FLA
2
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.2 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
The generalization capability of a DT is largely influenced by the scale of the tree. Generally, the scale of a
tree is affected by two factors, i.e., the degree of over-partitioning in the induction process, and the selection of
heuristic measure for splitting nodes. Information entropy [18,22] and ambiguity [29,34], which respectively reflect
the impurity of classes and the uncertainty of the split, are the two most widely used heuristics for splitting nodes
during the tree growth. The performances of these two heuristics may differ a lot, however, both of them are uncertainty
reduction based methods [31,32] that can be applied to the parallelization of sequential DT.
Several works on the parallelization of sequential DT could be found from the literature. Shafer et al. [23] first
proposed the SPRINT algorithm, which tries to remove all the memory restrictions by improving the data structures
used in DT growth. Srivastava et al. [25] later described two parallel DT models, which are respectively based on
synchronous construction approach and partitioned construction approach. Sheng et al. [24] developed a parallel DT
by splitting the input vector into four sub-vectors, and applied it to user authentication. Panda et al. [20] proposed
the PLANET method based on the MapReduce model of distributed computation. Ben-Haim and Tom-Tov [2] designed a parallel DT algorithm for classifying large streaming data by constructing histograms at the processors and
compressing the data to a fixed amount of memory. These works have been proved to be effective and exhibited good
performances on big data classification. However, they neglect the over-partitioning problem, which may lead to a
redundant and over-fitted tree [30], especially for big data.
In order to resolve the over-partitioning problem, a hybrid DT induction scheme, named Extreme Learning Machine
Tree (ELM-Tree), is proposed in this paper. Our proposed ELM-Tree is similar to the model tree given in [7]. The
difference between a model tree and an ELM-Tree is that the model tree is such a decision tree of which the leaf nodes
are linear regression functions while in the ELM-Tree each leaf node is an ELM. Given a model tree, a new instance
is classified by traversing the tree from the root to a leaf and determined the prediction value with the linear regression
function in leaf node. Some well-known modifications and extensions to model tree include functional tree (FT) [9],
Naive Bayes tree (NBTree) [16], logistic model tree (LMT) [17] and LMTFAM+WT [26] which is the improved version
of LMT by using first AIC minimum (FAM) method and weight trimming (WT) strategy. Basically our ELM-Tree is
a new modification to the model tree.
All the above-mentioned methods construct the decision trees with the classical top-down recursive partitioning
schemes. The main difference among FT, NBTree, LMT and LMTFAM+WT is the types of nodes in the trees: linear
discriminants or logistic regressions as leaf nodes for FT, Naive Bayes classifiers as the leaf nodes for NBTree, and linear logistic regressions as the leaf nodes for LMT and LMTFAM+WT . Although FT, NBTree, LMT and LMTFAM+WT
improve the classification performances of decision trees to some extent, an obvious drawback of these algorithms is
the high time consumption when building them, e.g., pruning back with the bottom-up procedure in FT, computing
the 5-fold cross-validation accuracy estimate for determining the mode’s utility in NBTree, learning the weights of logistic regressions with optimization algorithms iteratively in FT, LMT and LMTFAM+WT , etc. The high computational
complexity of FT, NBTree, LMT and LMTFAM+WT seriously downgrades the ability for the model trees to handle big
data. Meanwhile, the complex training processes of these learning algorithms make it very difficult to implement their
parallelization.
ELMs are emergent techniques for training single-hidden layer feedforward neural networks (SLFNs) [6]. In an
ELM, the input weights are randomly assigned, and the output weights are analytically determined via the pseudoinverse of the hidden layer output matrix [13–15]. Unlike the time-consuming weight optimization in FT/LMT and
cross-validation based utility determination in NBTree, the ELMs in the ELM-Tree have their extremely fast training speed and therefore have the great potential for learning from big data. In our ELM-Tree scheme, a threshold
is given to determine whether a node should be split further or not. If the learner decides to stop splitting a node,
this node will become either a traditional leaf node or an ELM leaf node based on its class impurity. Then, by applying parallel computation to five components of the ELM-Tree, a parallel ELM-Tree model is developed for big
data classification. Experimental results validate that the ELM-Tree scheme can well resolve the over-partitioning
problem, and the parallel ELM-Tree model is effective to reduce the computational load for big data classification.
The rest of this paper is organized as follows. In Section 2, a brief discussion on the two uncertainties, i.e., information entropy and ambiguity, is given. In Section 3, the ELM-Tree scheme is proposed. In Section 4, the parallel
ELM-Tree model is developed. In Section 5, experimental comparisons are conducted to show the feasibility of the
proposed method. Finally, conclusions are given in Section 6.
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.3 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
3
2. Two types of uncertainty
This section reviews the two most widely used uncertainties for splitting DT nodes, i.e., information entropy and
ambiguity.
In a DT model, each node corresponds to a probability distribution E = (p1 , . . . , pm ), where pi , i = 1, . . . , m
represents the proportion of the i-th class in the node. The uncertainty of a node is defined on E, and the learner tends
to split the node by using the attribute with the minimum average uncertainty. Such an attribute is called expanded
attribute.
2.1. Information entropy
Generally, information entropy is defined on a probability distribution, and represents the impurity of classes in
a node. It is a type of statistical uncertainty that arises from the random behavior of physical systems. Given that
E = (p1 , . . . , pm ), i = 1, . . . , m, m
i=1 pi = 1, information entropy is defined as
Entr(E) = −
m
pi log2 (pi ).
(1)
i=1
Without losing generality, we assume p1 is a single variable and p2 , · · ·, pm−1 are m − 2 constants with c =
p2 + . . . + pm−1 , then we have pm = 1 − p1 − c, and Eq. (1) degenerates to
Entr(E) = −p1 log2 (p1 ) − (1 − p1 − c) log2 (1 − p1 − c) + CE ,
(2)
where CE is an independent constant in variable p1 . By solving
∂ Entr(E)
1 − p1 − c
= log2
= 0,
∂p1
p1
(3)
we get that Entr(E) attains its maximum at p1 = pm = 1−c
2 . Due to the symmetry of p1 , . . . , pm , we conclude that
the entropy of an m-dimensional probability distribution attains its maximum at
p1 = p 2 = . . . = p m =
1
.
m
(4)
2.2. Ambiguity
Ambiguity is defined on a possibility distribution, and denotes a type of non-specificity when there is a need to
specify one object from a group. Since probability distribution is a special case of possibility distribution, we can also
discuss the ambiguity in terms of the probability distribution E = (p1 , . . . , pm ). Ambiguity describes the uncertainty
that arises from human reasoning and cognition, which is defined as
1 ∗
∗
pi − pi+1
ln(i),
m
m
Ambig(E) =
(5)
i=1
∗ ) is the normalization of (p , . . . , p ) with 1 = p ∗ ≥ p ∗ ≥ . . . ≥ p ∗ ≥ p ∗
where (p1∗ , . . . , pm
1
m
m
1
2
m+1 = 0. More specifically, the values of p1 , . . . , pm , 0 are normalized to the interval [0, 1] by multiplying them with the factor max{p11,...,pm } ,
∗ , p∗
and then are sorted to be p1∗ , . . . , pm
m+1 with descending order.
Similarly, due to the monotonicity, continuity, and symmetry [12], the maximum of Ambig(E) is also attained at
p1 = p2 = . . . = pm = m1 .
2.3. Specificity under binary cases
We discuss the two types of uncertainty for a binary classification problem. When m = 2, Eqs. (1) and (5) respectively degenerate to
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.4 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
4
Fig. 1. Entropy and ambiguity when m = 2.
and
Entr(E) = −p1 log2 (p1 ) − p2 log2 (p2 ),
(6)
⎧1
⎪
⎨ 2 (p1 /p2 ) ln(2),
Ambig(E) = 12 ln(2),
⎪
⎩1
2 (p2 /p1 ) ln(2),
(7)
if p1 < p2
if p1 = p2 ,
if p1 > p2
where p1 + p2 = 1.
Fig. 1 depicts the curves for Eqs. (6) and (7) by taking p1 as the function variable under the condition p1 +p2 = 1. It
is easy to observe that entropy and ambiguity have several common features, i.e., they are defined on [0, 1], symmetric
at p1 = 0.5, strictly increasing in [0, 0.5], and strictly decreasing in [0.5, 1]. Besides, they attain their maxima at
p1 = 0.5 and minima at p1 = 0 and p1 = 1. However, entropy is a convex function, while ambiguity is a concave
function. Given p1 , the value of entropy is always larger than that of ambiguity. Furthermore, when p1 ∈ [0, 0.5], the
increase of entropy is faster than that of ambiguity, and when p1 ∈ [0.5, 1], the decrease of entropy is slower than that
of ambiguity.
3. The ELM-Tree approach
3.1. Extreme learning machine
Given a training set X that contains N distinct instances with n inputs and m outputs, i.e., X = {(xi , yi )|xi =
[xi1 , . . . , xin ]T ∈ Rn , yi = [yi1 , . . . , yim ]T ∈ Rm , i = 1, . . . , N }, the SLFNs with Ñ hidden nodes and activation function g(x) are formulated as
Ñ
βj g(wj · xi + bj ) = oi ,
i = 1, . . . , N,
(8)
j =1
where wj = [wj 1 , . . . , wj n ]T is the weight vector connecting the input nodes and the j -th hidden node, bj is the bias
of the j -th hidden node, βj = [βj 1 , . . . , βj m ]T is the weight connecting the j -th hidden node and the output nodes,
and oi is the output of xi in the network.
The standard SLNFs can approximate the N training instances with zero error, i.e., there exists wj , bj , and βj ,
such that:
Ñ
βj g(wj · xi + bj ) = yi ,
i = 1, . . . , N.
j =1
Eq. (9) could be written into the matrix form, i.e., Hβ = Y, where
(9)
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.5 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
⎡
⎢
⎢
H=⎢
⎣
g(w1 · x1 + b1 )
g(w1 · x2 + b1 )
..
.
g(w2 · x1 + b2 )
g(w2 · x2 + b2 )
..
.
...
...
..
.
g(wÑ · x1 + bÑ )
g(wÑ · x2 + bÑ )
..
.
g(w1 · xN + b1 )
g(w2 · xN + b2 )
...
g(wÑ · xN + bÑ )
⎤
⎥
⎥
⎥
⎦
and
⎡
⎢
⎢
Y=⎢
⎣
βÑ2
...
βÑ m
y11
y21
..
.
y12
y22
..
.
...
...
..
.
y1m
y2m
..
.
yN1
yN2
...
yN m
(10)
N×Ñ
is called the hidden layer output matrix, β and Y are respectively represented by
⎤
⎡
β11 β12 . . . β1m
⎢ β21 β22 . . . β2m ⎥
⎥
⎢
β =⎢ .
,
..
.. ⎥
..
⎣ ..
.
.
. ⎦
βÑ 1
5
(11)
Ñ ×m
⎤
⎥
⎥
⎥
⎦
(12)
.
N×m
Most traditional training methods for neural networks are gradient descent algorithms. They try to minimize the
following cost error function
CE =
Ñ
N
i=1
2
βj g(wj · xi + bj ) − yi
,
(13)
j =1
and the parameter vector αj = (wj , bj , βj ) is adjusted iteratively by
αk+1 = αk − η
∂CE(αk )
,
∂αk
(14)
where k is the iteration index and η is the learning rate.
Although the gradient descent algorithms [36] have exhibited good performances on different learning domains, it
is still hard to overcome its high complexity [1,3]. Besides, due to the iterative learning mechanism, they may also
cause the local optima and over-fitting problems. Recently, ELM is proposed by Huang et al. [15] for training SLFNs.
It has been theoretically proved that in order to approximate the training instances, the input weights and hidden biases
could be randomly assigned if the activation function is infinitely differentiable [13,14]. Motivated by these facts, ELM
treats the network as a linear system. It randomly chooses the input weights and analytically determines the output
weights by the pseudo-inverse of the hidden layer output matrix. The detail of ELM is described in Algorithm 1.
Algorithm 1: Extreme Learning Machine – ELM.
1
2
3
Input: Training set {(xi , yi )|xi ∈ Rn , yi ∈ Rm , i = 1, . . . , N }; activation function g(x); the number of hidden node Ñ .
Output: Input weight wj , input bias bj , and output weight β.
Randomly assign input weight wj and bias bj where j = 1, . . . , Ñ ;
Calculate the hidden layer output matrix H;
Calculate the output weight β = H† Y where H† is the Moore–Penrose generalized inverse of matrix H.
3.2. ELM-Tree with uncertainty reduction
In this section, a new classification model named ELM-Tree is developed to handle the over-partitioning problem.
The key difference between ELM-Tree and traditional DT lies in the determination of the leaf nodes. In the ELM-Tree
model, ELMs are embedded as the leaf nodes when certain conditions are met. Thus, there are two key steps in the
construction of ELM-Tree, i.e., splitting a non-leaf node and determining an ELM leaf node.
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.6 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
6
Given a node X = {x1 , . . . , xN } with N instances from m classes, each instance is described by n continuous
attributes, where the i-th instance is denoted by xi = {xij }nj=1 and Aj represents the j -th attribute. We first discuss
how to split the non-leaf node X into two child nodes X1 and X2 . This problem could be equally transferred to
selecting the optimal attribute and its candidate cut-point to divide X into two sets. Such a cut-point always refers to
the one that can maximize the information gain. Algorithm 2 depicts the procedures of splitting X.
Algorithm 2: Split a DT Node.
1
n
Input: Node X = {(xi , yi )}N
i=1 , where xi = {xi1 , . . . , xin } ∈ R is the i-th instance, yi ∈ {1, . . . , m} is the class label of xi , m is the number
of classes, and Aj , j = 1, . . . , n is the j -th attribute.
Output: Two child nodes X1 and X2 .
∗ ≤ . . . ≤ x∗ ;
For each attribute Aj , sort its values x1j , . . . , xNj in ascending order, and the sorted values are recorded as x1j
Nj
x ∗ +x ∗
2
3
4
Get all the available cut-points of each Aj , i.e., cutij = ij 2i+1,j , i = 1, 2, · · · , N − 1;
Calculate the information gain of each Aj and its cut-point cutij :
Gain(X, cutij ) = Info(X) − I (cutij )
|Xij 1 |
|Xij 2 |
= Info(X) −
Info(Xij 1 ) +
Info(Xij 2 ) ,
|X|
|X|
where Xij 1 = {xi ∈ X|xij ≤ cutij } and Xij 2 ∈ {xi ∈ X|xij > cutij } are the two subsets of X divided by cutij , |X| is the size of X, and
I (cutij ) is the expected information of cutij ;
For each attribute Aj , select its optimal cut-point cuti (j ) j where
i (j ) = argmax
Gain(X, cutij ) ;
i=1,...,N −1
5
Calculate the split information of the optimal cut-point for each Aj :
6
|X
|Xi (j ) j 1 | |Xi (j ) j 2 |
|Xi (j ) j 2 | i (j ) j 1 |
Split(X, cuti (j ) j ) = −
log2
+
log2
;
|X|
|X|
|X|
|X|
Calculate the gain ratio of Aj :
Ratio(X, Aj ) =
7
Gain(X, cuti (j ) j )
Split(X, cuti (j ) j )
;
Select the optimal attribute Aj ∗ where
j ∗ = argmax Ratio(X, Aj ) ;
j =1,...,n
8
(15)
(16)
(17)
(18)
(19)
Split X into X1 and X2 by Aj ∗ and its optimal cut-point cuti (j ∗ ) j ∗ where X1 = {xi ∈ X|xij ∗ ≤ cuti (j ∗ ) j ∗ } and
X2 ∈ {xi ∈ X|xij ∗ > cuti (j ∗ ) j ∗ }.
The term Info(X) in Eq. (15) is the residual uncertainty of the class information in X, which measures its class
impurity and the amount of information. Similarly, Info(Xij 1 ) and Info(Xij 2 ) reflect the amounts of information in
Xij 1 and Xij 2 . Thus, Eq. (15) represents the information gain when splitting X into Xij 1 and Xij 2 by cutij . After
determining the optimal cut-point cuti (j ) j for attribute Aj , Eq. (18) is used to calculate the information gain ratio
for Aj . In fact, the gain ratio is a compensation to the information gain, which reduces the bias on a high-branch
attribute by considering the intrinsic information [33] in Eq. (17), i.e., the split information of the optimal cut-point
cuti (j ) j of Aj . Thus, the gain ratio is the normalization of the information gain by the intrinsic information.
For the given node X, Info(X) in Eq. (15) is a constant that can be neglected, thus, the information gain of cutij
is just determined by Info(Xij 1 ) and Info(Xij 2 ). Take Info(Xij 1 ) as an instance, we now apply the two types of
uncertainty, i.e., information entropy and ambiguity, to its computation:
1. The entropy-based heuristic measure, denoted by InfoE (Xij 1 ), is calculated as
InfoE (Xij 1 ) = −
m
(1) (1)
pij l log2 pij l ,
l=1
(1)
where pij l is the probability of the l-th class in Xij 1 .
2. The ambiguity-based heuristic measure, denoted by InfoA (Xij 1 ), is calculated as
(20)
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.7 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
7
1 (1∗)
(1∗) pij l − pij,l+1 ln(l),
m
m
InfoA (Xij 1 ) =
(21)
l=1
(1∗)
(1∗)
(1∗)
(1)
(1)
(1)
(1∗)
(1∗)
where (pij 1 , pij 2 , . . . , pij m ) is the normalization of (pij 1 , pij 2 , . . . , pij m ) with 1 = pij 1 ≥ pij 2 ≥ . . . ≥
(1∗)
pij m = 0.
According to the analysis in Section 2, we have
0 ≤ InfoE (Xij 1 ) ≤ log2 (m),
0 ≤ InfoA (Xij 1 ) ≤ ln(m).
By applying (22) to (15), we have
0 ≤ IE (cutij ) ≤ log2 (m),
0 ≤ IA (cutij ) ≤ ln(m).
(22)
(23)
Obviously, log2 (m) and ln(m) are the maxima of IE (cutij ) and IA (cutij ), which can generate the most uncertain
partitions of X.
With the above-mentioned heuristic measures, Algorithm 2 is iteratively applied to the generated child nodes, and
a DT is finally constructed. However, without a proper stopping criterion, these procedures will continue until all the
nodes cannot be split further, which may lead to an over-partitioned tree. In order to resolve this over-partitioning
problem, we propose the ELM-Tree model. If the largest class probability in X, i.e., max{pi }m
i=1 , is larger than the
given threshold θ , or the size of X, i.e., |X|, is smaller than a given number N ∗ , X is considered as a leaf node, and
its decision label is determined as argmaxi {pi }m
i=1 . Otherwise, further partition is performed on X. However, if the
expected information of all the attributes and their cut-points, i.e., I (cutij ), i = 1, . . . , N − 1, j = 1, . . . , n, are smaller
than a given uncertainty coefficient ε, an ELM is introduced to classify the instances in X. The induction of ELM-Tree
is then described in Algorithm 3. Obviously, there are two important parameters in Algorithm 3, i.e., the truth level
threshold θ ∈ [0, 1], and the uncertainty coefficient ε ∈ [0, 1]. For implementation, we further modify the uncertainty
coefficient as ε × log2 (m) for entropy-based ELM-Tree and ε × ln(m) for ambiguity-based ELM-Tree, where m is
the number of classes in X.
Algorithm 3: Construct an ELM-Tree.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Input: A training set with N instances, n attributes and m classes; truth level threshold θ ∈ [0, 1]; uncertainty coefficient ε ∈ [0, 1]; and
integer parameter N ∗ ∈ {1, . . . , N }.
Output: An ELM-Tree.
Ω is initialized as an empty set;
Consider the original training set as the root-node, and add it to Ω;
while Ω is not empty do
Select one node from Ω, denoted by X;
∗
if max{pi }m
i=1 > θ or |X| < N then
Assign X a label, i.e., argmaxi {pi }m
i=1 , and remove it from Ω;
else
if I (cutij ) < ε for i = 1, . . . , N − 1, j = 1, . . . , n then
Train an ELM for X and remove it from Ω;
else
Split X into two child nodes X1 and X2 by Algorithm 2;
Remove X from Ω, add X1 and X2 to Ω;
end
end
end
Fig. 2 gives an illustrative structure of ELM-Tree. There are two types of leaf nodes in this tree, i.e., ELM leaf
nodes and non-ELM leaf nodes. Obviously, LN1 , LN2 and LN4 are non-ELM leaf nodes with a certain decision label,
while LN3 , LN5 , LN6 , LN7 and LN8 are ELM leaf nodes which classify the instances with an ELM classifier.
JID:FSS AID:6547 /FLA
8
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.8 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
Fig. 2. The structure of ELM-Tree.
Fig. 3. Different partitions to a sample set by using C4.5 and ELM-Tree ( positive training sample; • negative training sample;
sample).
positive testing
We now try to explain the main difference between ELM-Tree and C4.5 tree by observing Fig. 3. Given a dataset
X that contains 6 positive instances and 3 negative instances. Fig. 3(a) demonstrates the partition of C4.5 on X. This
partition includes 5 leaf nodes, and the training accuracy is 1. However, wrong prediction is made on the new testing
instance. Fig. 3(b) demonstrates a partition on X by ELM-Tree. In this partition, an ELM is trained on X and a curve
is obtained as the classification boundary, which correctly predicts the new instance. Besides, there are 5 leaf nodes
in C4.5, while the ELM-Tree has just one leaf node. In addition, a real numerical example on selected instances from
Auto Mpg dataset (details in Table 2) is given to show the process of generating an ELM leaf node. In this example, θ
is set as 0.90 and ε is set as 0.08. There are 27 instances in X as listed in Table 1, where the largest class probability of
X is 0.667, which is smaller than θ . It is also investigated from Table 1 that the entropy-based expected information
I (cutij ) of all the cut-points are smaller than ε × log2 (m). Note that in Table 1, the repetitive values of I (cutij ) are
deleted. Thus, X can be regarded as an ELM leaf node and an ELM classifier is trained.
3.3. Time complexity
We give an analysis on the time complexity of ELM-Tree. For a training dataset with N instances and n continuous
attributes, the training complexities of ELM and C4.5 DT are O(N) [15] and O(nN log2 N ) [22], respectively. Since
it will cost a complexity of O(nN log2 N ) to partition the training dataset iteratively and O(nN) to train the determined ELM leaf nodes, we can conclude that the complexity of the ELM-Tree model is O(nN log2 N ) + O(nN) =
O(nN log2 N).
4. Parallelization of ELM-Tree
In Algorithm 3, the computations of information gain and gain ratio for different cut-points are all independent,
which could be finished in parallel. Besides, the main task of ELM is to calculate the Moore–Penrose generalized
inverse of the hidden layer output matrix H, i.e.,
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.9 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
9
Table 1
Example for generating an ELM leaf node in ELM-Tree.
X
A1
A2
A3
A4
A5
Class
x1
x2
x3
x4
x5
x6
x7
x8
x9
x10
x11
x12
x13
x14
x15
x16
x17
x18
x19
x20
x21
x22
x23
x24
x25
x26
x27
0.7101
0.4521
0.3723
0.5053
0.4787
0.4255
0.3191
0.6516
0.5585
0.4894
0.5851
0.3457
0.6649
0.5053
0.2660
0.5851
0.5053
0.4521
0.3191
0.4787
0.4255
0.6915
0.3723
0.5319
0.3457
0.4255
0.5053
0.0775
0.1395
0.1395
0.0775
0.1137
0.0762
0.1395
0.0775
0.1111
0.1370
0.1137
0.1395
0.1137
0.1344
0.1395
0.1318
0.1137
0.0775
0.1344
0.0853
0.1240
0.1395
0.1344
0.0775
0.1370
0.1085
0.1008
0.1848
0.1848
0.2174
0.1848
0.2283
0.1848
0.2174
0.2011
0.1848
0.1848
0.2120
0.2174
0.2283
0.1793
0.2120
0.1957
0.2283
0.1793
0.2228
0.2011
0.1902
0.2283
0.2283
0.2011
0.1630
0.2228
0.2174
0.0856
0.2376
0.1721
0.1562
0.2912
0.1454
0.1738
0.1310
0.1537
0.2997
0.2728
0.2217
0.2217
0.2869
0.1976
0.3139
0.2813
0.1820
0.3873
0.1670
0.1721
0.2515
0.3811
0.1718
0.2546
0.3003
0.2413
0.3810
0.5060
0.3571
0.4167
0.6310
0.5357
0.5060
0.4702
0.4048
0.4167
0.4881
0.4762
0.5952
0.6310
0.6250
0.6786
0.6905
0.5774
0.6845
0.4345
0.5298
0.4226
0.5357
0.5060
0.5952
0.5655
0.4464
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
Expected information of cut-point
j =1
j =2
j =3
j =4
j =5
max{p1 , p2 }
I (cutij ) < ε × log2 (m)
0.0058
0.0008
0.0026
0.0136
0.0047
0.0084
0.0048
0.0025
0.0119
0.0073
0.0011
0.0018
0.0058
0.0069
0.0088
0.0008
0.0007
0.0061
0.0026
0.0025
0.0006
0.0031
0.0088
0.0199
0.0294
0.0116
0.0176
0.0026
0.0107
0.0203
0.0010
0.0048
0.0099
0.0152
0.0208
0.0267
0.0061
0.0006
0.0005
0.0018
0.0040
0.0109
0.0158
0.0071
0.0018
0.0005
0.0022
0.0054
0.0108
0.0201
0.0092
0.0274
0.0132
0.0052
0.0106
0.0163
0.0287
0.0066
0.0006
0.0005
0.0005
0.0019
0.0019
0.0005
0.0049
0.0023
0.0066
0.0033
0.0016
0.0052
0.667 < θ
−1
H† = HT H HT ,
where HT is the transposition of H. Thus,
−1
β = HT H HT Y.
(24)
(25)
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.10 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
10
For big data problems, the matrices of H, HT H and HT Y are too large, which make the computation impossible. In
this case, their calculations could also be finished in parallel.
Assume there are M + 1 computers in a parallel system. One computer is a host node and the other M computers
are computing nodes. The host node is in charge of assigning tasks and collecting results [35]. Basically, parallel comnN log N
putation could be applied to five components of the ELM-Tree, which reduces its training complexity to O( M 2 )
in parallel. The five components are introduced as follows:
1. Calculation of information gain. We simply denote the information gain of attribute Aj and its i-th cut-point as
Gainij , where i = 1, . . . , N − 1 and j = 1, . . . , n. Since M N , the host node first assigns the calculations of
Gain1j , Gain2j , . . . , GainMj to the M computing nodes. Once any of the M calculations is finished, the host node
stores the result and orderly assigns the equivalent calculations from GainM+1,j , GainM+2,j , . . . , GainN−1,j to
the free computing nodes. When the computation on Aj is finished, the computation on Aj +1 begins.
2. Calculation of gain ratio. We simply denote the gain ratio of attribute Aj as Ratioj , where j = 1, . . . , n. When
n <= M, the calculations of Ratio1 , . . . , Ration are directly finished in parallel. When n > M, similar procedures
of the calculation on information gain are applied.
3. Calculation of HN×Ñ . For big data classification, N is a very large number, thus the N × Ñ matrix could be
decomposed into N disjointed 1 × Ñ sub-matrices. The calculations of these N sub-matrices are assigned to the
M computing nodes in turn. For the i-th sub-matrix, the computation of g(wj · x + bj ), j = 1, . . . , Ñ is carried
out Ñ times. Once a computing node is free, the host node assigns a new sub-matrix to it until all the calculations
are finished.
4. Calculation of HT HÑ ×Ñ . We denote HT HÑ ×Ñ as [hhij ]Ñ×Ñ where
hhij =
N
g(wi · xk + bi )g(wj · xk + bj ) ,
i, j = 1, . . . , Ñ .
(26)
k=1
The host node splits the computation of hhij into the summation of M disjoint units, i.e.,
(1)
(2)
(M)
hhij = hhij + hhij + . . . + hhij ,
(27)
where
⎧
N
M
⎪
⎪
⎪
(1)
⎪
⎪
g(wi · xk + bi )g(wj · xk + bj ) ,
hhij =
⎪
⎪
⎪
⎪
k=1
⎪
⎪
⎪
N
⎪
2 M
⎪
⎪
⎪
⎪ (2)
⎨
hhij =
g(wi · xk + bi )g(wj · xk + bj ) ,
N
k= M
+1
⎪
⎪
⎪
⎪
.
⎪
..
⎪
⎪
⎪
⎪
⎪
N
⎪
⎪
⎪
(M)
⎪
⎪
=
hh
⎪
ij
⎪
⎩
N
(28)
g(wi · xk + bi )g(wj · xk + bj ) .
k=(M−1) M +1
(1)
(2)
(M)
Then, the calculations of hhij , hhij , . . . , hhij are finished on the M computing nodes respectively.
5. Calculation of HT YÑ ×m . We denote HT HÑ ×m as [hyij ]Ñ×m where
hyij =
N
g(wi · xk + bi )ykj ,
i = 1, . . . , Ñ , j = 1, . . . , m.
(29)
k=1
The host node splits the computation of hyij into the summation of M disjoint units, i.e.,
(1)
(2)
(M)
hyij = hyij + hyij + . . . + hyij ,
(30)
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.11 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
11
Table 2
The specification of 15 small benchmark datasets.
No.
Dataset
# Attribute
# Class
Class distribution
# Instance
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
5
10
30
33
15
9
33
5
22
8
60
44
18
13
8
3
2
2
2
2
7
2
3
2
2
2
2
4
3
10
245/79/68
458/241
357/212
151/47
383/307
76/70/29/17/13/9/0
225/126
150/35/30
147/48
500/268
111/97
212/55
218/217/212/199
91/59/48
463/429/244/163/51/44/35/30/20/5
392
699
569
198
690
214
351
215
195
768
208
267
846
178
1484
where
⎧
N
M
⎪
⎪
⎪
(1)
⎪
⎪
g(wi · xk + bi )ykj ,
hyij =
⎪
⎪
⎪
⎪
k=1
⎪
⎪
⎪
N
⎪
2 M
⎪
⎪
⎪
⎪
(2)
⎨ hy =
g(wi · xk + bi )ykj ,
ij
N
n= M
+1
⎪
⎪
⎪
⎪
..
⎪
⎪
.
⎪
⎪
⎪
⎪
N
⎪
⎪
⎪
(M)
⎪
⎪
hy
=
⎪
ij
⎪
⎩
N
(31)
g(wi · xk + bi )ykj .
n=(M−1) M +1
(1)
(2)
(M)
Then, the calculations of hyij , hyij , . . . , hyij
are finished on the M computing nodes respectively.
5. Experimental comparisons
In this section, we firstly validate the feasibility of sequential ELM-Trees on 15 small datasets [27] for the sake
of running time and then demonstrate the effectiveness of parallel ELM-Trees on 4 big datasets. The main objective
of testing the performances of sequential ELM-Trees on small datasets is to check the impact of different learning
parameters (i.e., θ and ε) on sequential ELM-Trees. For each UCI benchmark dataset as listed in Tables 2 and 10, the
nominal attributes are discarded, and the missing values are replaced by the unsupervised filter ReplaceMissingValues
in Weka 3.6 [33]. In addition, the sequential algorithms are implemented under Matlab and the parallel algorithms
are implemented with C language. The experiments are conducted on a PC with Windows XP operation system, a
Pentium 4 2.8 GHz CPU, and a 2 GB RAM.
5.1. Impact of uncertainty parameters θ and ε on ELM-Tree
We first investigate the effects of parameter ε on the training time, testing time, training accuracy, testing accuracy
and scale of the tree. The value of θ in Algorithm 3 is fixed as 0.90, N ∗ is set as 5, and ε is ranged from 0.01 to 0.15
with a step of 0.01. With each ε, we conduct 10-fold cross validation for 10 times on a representative dataset, i.e.,
Breast Cancer W-P. Besides, the number of hidden nodes in ELM1 is set as 20.
1 http://www.ntu.edu.sg/home/egbhuang/elm-codes.html.
JID:FSS AID:6547 /FLA
12
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.12 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
Fig. 4. The impact of uncertainty coefficient on performances of entropy-based ELM-Tree in Breast Cancer W-P dataset (θ = 0.90).
Fig. 5. The impact of uncertainty coefficient on performances of ambiguity-based ELM-Tree in Breast Cancer W-P dataset (θ = 0.90).
Fig. 4 and Fig. 5 depict the effects of ε on the performances of entropy-based ELM-Tree and ambiguity-based
ELM-Tree respectively. Three observations could be made. First, as shown in Fig. 4(a) and Fig. 5(a), the training
time gradually decreases with the increase of ε, while the change of testing time is not obvious. Second, as shown in
Fig. 4(b) and Fig. 5(b), the training accuracy gradually decreases with the increase of ε, while the testing accuracy first
increases and then fluctuates. Third, from Fig. 4(c) and Fig. 5(c), we can see that for all types of nodes, the number
has an obvious decrease with the increase of ε.
It is clear that the larger ε is, the smaller the probability of generating an ELM leaf node will be. Thus, the increase
of ε will lead to a reduction of the tree size. For example, assume there are three candidate attributes A1 , A2 and A3
in the current node, each attribute has m candidate cut-points: cut11 , . . . , cut1m for A1 ; cut21 , . . . , cut2m for A2 ; and
cut31 , . . . , cut3m for A3 . The information gains of these candidate cuts are Gain11 , . . . , Gain1m ; Gain21 , . . . , Gain2m ;
and Gain31 , . . . , Gain3m respectively. If there exists any Gainij > ε, further partition on this node is needed, which
will lead to an increase of the node number. However, if a larger ε is given such that ∀i = 1, 2, 3, j = 1, · · · , m,
Gainij < ε, an ELM will be trained, and the number of nodes will be smaller. Besides, the training accuracy and
testing accuracy of classical C4.5 algorithm on Breast Cancer W-P dataset are 0.981 and 0.631 respectively, while the
average training accuracy and testing accuracy of ELM-Tree are 0.834 and 0.778 respectively, which demonstrate that
the ELM-Tree model significantly reduces the degree of over-fitting.
Then, we fix the value of ε as 0.05 and investigate the effect of θ . Fig. 6 and Fig. 7 depict the effects of θ on
the performances of entropy-based ELM-Tree and ambiguity-based ELM-Tree respectively. From these figures, we
can see that θ has no obvious influence on the accuracy, time, and tree scale. The classification truth levels of some
expanded nodes in ELM-Tree are usually smaller than θ . That is to say, a smaller θ cannot increase the probability of
generating a non-ELM leaf node. The termination of the induction process mainly depends on the number of instances
falling into the expanded nodes or the uncertainty coefficient ε.
Above all, the uncertainty coefficient ε has a great influence on the performances of ELM-Tree, while the impact
of θ is not obvious.
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.13 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
13
Fig. 6. The impact of truth level on performances of entropy-based ELM-Tree in Breast Cancer W-P dataset (ε = 0.05).
Fig. 7. The impact of truth level on performances of ambiguity-based ELM-Tree in Breast Cancer W-P dataset (ε = 0.05).
5.2. Comparisons among ELM-Trees, C4.5, and ELM
In this section, we compare the performances of entropy-based ELM-Tree, ambiguity-based ELM-Tree, C4.5,2
and ELM. We conduct 10-fold cross-validation for 10 times, and observe the average result. Three pairs of (θ, ε) are
tested, i.e., (0.90, 0.01), (0.90, 0.05) and (0.90, 0.1).
Table 3 and Table 4 respectively report the training accuracy and testing accuracy of the compared methods. Several
observations could be made.
1. First, it is investigated that C4.5 has a high degree of over-fitting (high training accuracy and low testing accuracy),
which is mainly caused by the over-partitioning of tree nodes, while ELM does not suffer from this problem. Thus,
incorporating ELM in C4.5 can greatly reduce the degree of over-fitting in C4.5.
2. Second, the training accuracy of entropy-based ELM-Tree is higher than that of ambiguity-based ELM-Tree,
while its testing accuracy is significantly lower. The reason for this observation could be discovered by investigating the number of nodes induced by the trees, which is reported in Table 7. Take (θ, ε) = (0.90, 0.01) as an
instance, it is calculated from Table 7 that the ratios of ELM leaf nodes to all the leaf nodes in ambiguity-based
ELM-Tree are (0.56, 0.78, 0.42, 0.57, 0.62, 0.40, 0.67, 0.71, 0.63, 0.53, 0.43, 0.73, 0.57, 0.70, 0.42) on these
15 datasets, while the ratios of entropy-based ELM-Tree are (0.31, 0.47, 0.56, 0.41, 0.32, 0.24, 0.38, 0.44, 0.50,
0.31, 0.56, 0.52, 0.28, 0.57, 0.17), which are much lower. That is to say, the ambiguity-based ELM-Tree sacrifices a certain degree of training accuracy, but the wrongly classified testing instances by non-ELM leaf nodes in
entropy-based ELM-Tree could be correctly classified by ELM-leaf nodes in ambiguity-based ELM-Tree. As a
result, ambiguity-based ELM-Tree obtains a lower training accuracy but a higher testing accuracy.
2 http://read.pudn.com/downloads139/sourcecode/math/599191/c4.5matlab/UseC45.m.htm.
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.14 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
14
Table 3
Comparison of training accuracy among ELM-Tree (θ = 0.90), C4.5 tree and ELM on the 15 small benchmark datasets.
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Dataset
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
avg.
Entropy-based ELM-Tree
Ambiguity-based ELM-Tree
C4.5
ELM
0.807
0.970
0.971
0.834
0.767
0.736
0.885
0.934
0.884
0.786
0.794
0.796
0.670
0.997
0.602
0.973
0.990
0.998
0.981
0.965
0.954
0.997
0.988
0.991
0.972
0.994
0.990
0.954
0.996
0.899
0.804
0.969
0.966
0.834
0.767
0.728
0.885
0.940
0.892
0.784
0.803
0.795
0.670
0.999
0.602
0.829
0.976
0.829
C4.5
ELM
0.819
0.937
0.921
0.631
0.693
0.635
0.889
0.917
0.816
0.702
0.717
0.741
0.666
0.921
0.500
0.765
0.964
0.963
0.778
0.746
0.637
0.866
0.902
0.851
0.772
0.756
0.791
0.627
0.978
0.589
ε = 0.01
ε = 0.05
ε = 0.1
ε = 0.01
ε = 0.05
ε = 0.1
0.891
0.976
0.973
0.906
0.861
0.885
0.921
0.959
0.920
0.893
0.837
0.851
0.810
0.993
0.775
0.887
0.974
0.970
0.883
0.848
0.804
0.911
0.956
0.911
0.839
0.816
0.855
0.769
0.992
0.694
0.889
0.968
0.969
0.846
0.776
0.731
0.884
0.956
0.899
0.790
0.817
0.807
0.747
0.991
0.597
0.815
0.970
0.971
0.837
0.771
0.794
0.891
0.942
0.888
0.788
0.841
0.801
0.692
0.999
0.624
0.807
0.969
0.967
0.826
0.768
0.749
0.889
0.934
0.898
0.785
0.824
0.795
0.679
0.998
0.598
0.897
0.874
0.844
0.842
0.832
Note: For each dataset, the highest training accuracy is in bold face.
Table 4
Comparison of testing accuracy among ELM-Tree (θ = 0.90), C4.5 tree and ELM on the 15 small benchmark datasets.
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
avg.
Dataset
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
Entropy-based ELM-Tree
Ambiguity-based ELM-Tree
ε = 0.01
ε = 0.05
ε = 0.1
ε = 0.01
ε = 0.05
ε = 0.1
0.801
√
0.950
√
0.933
√
0.667
0.680
0.631
0.858
0.908
0.811
√
0.706
0.693
√
0.772
0.609
√
0.939
√
0.510
0.791
√
0.953
√
0.951
√
0.717
√
0.707
√
0.654
0.832
0.912
√
0.862
√
0.737
0.688
√
0.761
0.632
√
0.961
√
0.557
0.801
√
0.961
√
0.961
√
0.788
√
0.735
0.624
0.861
0.917
√
0.832
√
0.771
0.697
√
0.775
0.619
√
0.961
√
0.582
0.763
√
0.960
√
0.958
√
0.777
√
0.748
√
0.637
0.846
0.912
√
0.862
√
0.764
√
0.750
√
0.783
0.634
√
0.983
√
0.585
0.773
√
0.961
√
0.965
√
0.753
√
0.759
0.612
0.852
0.889
√
0.842
√
0.771
0.688
√
0.794
0.632
√
0.972
√
0.583
0.791
√
0.969
√
0.954
√
0.762
√
0.746
√
0.641
0.832
0.902
√
0.857
√
0.776
0.716
√
0.798
0.643
√
0.977
√
0.584
0.765
(7/15)
0.781
(10/15)
0.792
(9/15)
0.797
0.790
0.797
0.767
0.799
(11/15)
(9/15)
(10/15)
√
Note: For each dataset, the highest testing accuracy is in bold face. For each result, represents that the method is significantly better than C4.5
by Wilcoxon signed-rank test.
3. Third, we perform some Wilcoxon signed-rank tests [4,28] to compare the performances of ELM-Trees and
C4.5. Wilcoxon signed-rank test is often used as an alternative of paired t-test. It ranks the absolute values
of differences between the testing accuracies of two classifiers on every run of 10-fold cross-validation and
compares the ranks for positive and negative differences. In our experiments, all statistical comparisons are conducted under the significance level 0.1. As depicted in Table 4, for the three pairs of (θ, ε), i.e., (0.90, 0.01),
(0.90, 0.05) and (0.90, 0.1), the testing accuracy of entropy-based ELM-Tree outperforms (i.e., is statistically
better than) C4.5 on 7, 10 and 9 datasets out of 15, while the testing accuracy of ambiguity-based ELMTree outperforms C4.5 on 11, 9 and 10 datasets out of 15, which validates the effectiveness of incorporating
ELM in C4.5. Besides, sign test [4] is performed to validate whether the proposed ELM-Tree is statistically
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.15 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
15
Table 5
Comparison of training time among ELM-Tree (θ = 0.90), C4.5 tree and ELM on the 15 small benchmark datasets.
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Dataset
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
avg.
Entropy-based ELM-Tree
Ambiguity-based ELM-Tree
C4.5
ELM
0.18594
0.29688
4.23906
0.80625
0.42031
0.40156
1.92656
0.08594
0.91563
0.45625
4.54219
0.59375
0.98906
0.42188
0.68750
0.27344
0.24688
3.45469
1.35000
0.83281
0.55625
2.44844
0.08281
0.81563
1.06719
2.30000
0.89687
1.23125
0.25469
2.50156
0.00625
0.00625
0.00469
0.00156
0.00000
0.00156
0.00469
0.00156
0.00156
0.00469
0.00313
0.00156
0.00625
0.00156
0.00625
1.13125
1.22083
0.00344
C4.5
ELM
ε = 0.01
ε = 0.05
ε = 0.1
ε = 0.01
ε = 0.05
ε = 0.1
0.31094
0.26406
4.48281
1.42188
0.91719
0.56563
2.79219
0.10781
0.88125
1.23125
2.72969
1.08750
1.33281
0.30156
2.38437
0.30938
0.24219
3.53594
1.27500
0.82656
0.39844
2.61094
0.10469
0.89063
0.87187
2.75000
1.06250
1.17031
0.28594
1.77500
0.29844
0.15937
2.69375
0.41094
0.35000
0.19687
1.48125
0.10781
0.84062
0.35156
2.01094
0.36719
0.95313
0.30312
0.19219
0.29219
0.39687
7.02031
2.59844
0.62344
0.73594
4.05781
0.14063
1.23594
0.95000
7.37500
1.46250
1.67500
0.51719
1.90469
0.19219
0.31875
4.22344
1.26719
0.52500
0.50313
2.87969
0.10938
0.90781
0.57031
6.76406
0.86094
1.24688
0.45000
1.00469
1.38740
1.20729
0.71448
2.06573
1.45490
Note: For each dataset, the lowest training time is in bold face.
Table 6
Comparison of testing time among ELM-Tree (θ = 0.90), C4.5 tree and ELM on the 15 small benchmark datasets.
No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
avg.
Dataset
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
Entropy-based ELM-Tree
Ambiguity-based ELM-Tree
ε = 0.01
ε = 0.05
ε = 0.1
ε = 0.01
ε = 0.05
ε = 0.1
0.00781
0.00469
0.00625
0.00156
0.00469
0.00156
0.00313
0.00156
0.00625
0.00625
0.00313
0.00313
0.00781
0.00313
0.01875
0.00000
0.00000
0.00000
0.00000
0.00469
0.00313
0.00469
0.00156
0.00156
0.00313
0.00156
0.00313
0.01250
0.00000
0.00781
0.00313
0.00000
0.00313
0.00000
0.00156
0.00000
0.00000
0.00000
0.00156
0.00313
0.00000
0.00000
0.01094
0.00000
0.00000
0.00313
0.00469
0.00313
0.00156
0.00156
0.00156
0.00000
0.00000
0.00313
0.00156
0.00313
0.00156
0.00781
0.00156
0.01094
0.00000
0.00156
0.00156
0.00000
0.00000
0.00000
0.00156
0.00000
0.00000
0.00156
0.00469
0.00156
0.00625
0.00156
0.00000
0.00156
0.00000
0.00000
0.00156
0.00156
0.00000
0.00000
0.00000
0.00156
0.00156
0.00156
0.00156
0.00469
0.00156
0.00000
0.00313
0.00156
0.00156
0.00000
0.00625
0.00156
0.00000
0.00156
0.00313
0.00937
0.00313
0.00313
0.01250
0.00313
0.02656
0.00000
0.00156
0.00156
0.00000
0.00000
0.00000
0.00156
0.00313
0.00000
0.00156
0.00000
0.00000
0.00000
0.00000
0.00000
0.00531
0.00292
0.00156
0.00302
0.00135
0.00114
0.00510
0.00062
Note: For each dataset, the lowest testing time is in bold face.
meaningful on all the datasets. Sign test assumes that√ the win number of a given learning model in a specific comparison obeys the normal distribution N ( m2 , 2m ) under the null-hypothesis, where m is the number
√
of used datasets. If the win number is at least m2 + zα/2 × 2m , it can be concluded that the given learning
model is statistically
meaningful under
the significance level α. In our study, m = 15. Let α = 0.1, we have
√
√
m
15
m
15
2 + zα/2 × 2 = 2 + 1.645 × 2 ≈ 10. Thus, it can be concluded that entropy-based ELM-Tree with
(θ, ε) = (0.90, 0.05), ambiguity-based ELM-Tree with (θ, ε) = (0.90, 0.01) and (θ, ε) = (0.90, 0.1) have significantly better performances than C4.5.
Table 5 and Table 6 respectively report the training and testing time of the compared methods. Basically, ELM is
the fastest method in both training and testing. As shown in Table 5, the training time of ELM-Tree and C4.5 have
JID:FSS AID:6547 /FLA
16
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.16 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
no obvious difference. However, from the average results in Table 6, we can see that ELM-Tree has higher testing
efficiency than C4.5 in most cases. Besides, with the same (θ, ε), ambiguity-based ELM-Tree is more efficient than
entropy-based ELM-Tree.
Finally, we make some observations on the scale of the tree from Table 7. It is obvious that the size of entropy-based
ELM-Tree is larger than that of ambiguity-based ELM-Tree, while both of them are much smaller than C4.5 on all
the 15 datasets. From Fig. 1(c) and Fig. 1(d), we can see that with the same probability distribution E, Ambig(E) ≤
Entr(E). Specifically, when p1 = p2 = · · · = pL , Ambig(E)= Entr(E), otherwise Ambig(E) < E(E). For a given
attribute Aj and its cut-point cutAi , we suppose IE (cutAi ) > ε. It is possible that IA (cutAi ) < ε, which increases the
probability of generating ELM leaf nodes. Thus, the induction process can terminate earlier with ambiguity-based
heuristic, which will lead to a smaller ELM-Tree.
5.3. Comparisons among ELM-Trees, FTs, NBTree, LMT, and LMTFAM+WT
In this section, we compare the testing accuracies and training time (i.e., the time taken to build models in Weka 3.6
[33]) of ELM-Trees, FTs, NBTree, LMT, and LMTFAM+WT . All the experimental results in this section are obtained
based on the platform of Weka 3.6 which is run on a PC with Windows XP operation system, a Pentium 4 2.8 GHz
CPU, and a 2 GB RAM. We also conduct 10-fold cross-validation for 10 times and observe the average results
corresponding to FT (FT with logistic regression functions at the leaf and inner nodes synchronously), FTLeaves
(FT with logistic regression functions only at the leaf nodes), FTInner (FT with logistic regression functions only
at the inner nodes), NBTree, LMT, and LMTFAM+WT . The detailed parameter setups of these learning algorithms in
Weka 3.6 are as follows:
1. FT-weka.classifiers.trees.FT: minNumInstances = 15, numBoostingIterations = 100, useAIC = False and
weightTrimBeta = 0.0;
2. FTLeaves-weka.classifiers.trees.LMT: minNumInstances = 15, numBoostingIterations = 100, useAIC = False
and weightTrimBeta = 0.0;
3. FTInner-weka.classifiers.trees.LMT: minNumInstances = 15, numBoostingIterations = 100, useAIC = False and
weightTrimBeta = 0.0;
4. NBTree-weka.classifiers.trees.NBTree: Default value;
5. LMT-weka.classifiers.trees.LMT: minNumInstances = 15, numBoostingIterations = 100, useAIC = False and
weightTrimBeta = 0.0;
6. LMTFAM+WT -weka.classifiers.trees.LMT: minNumInstances = 15, numBoostingIterations = 100, useAIC =
True and weightTrimBeta = 0.1.
The comparative results on testing accuracy and training time are summarized in Tables 8 and 9 respectively, where
the testing accuracy and training time corresponding to ELM-Trees are extracted from Tables 4 and 5. Here, we also
use sign test to compare the performances of ELM-Trees, FTs, NBTree, LMT, and LMTFAM+WT .
From Table 8 we can see that in comparison with ELM-Tress, FT, FTLeaves, FTInner, NBTree, LMT, and
LMTFAM+WT respectively obtain better performances on 4, 5, 4, 4, 7, and 5 datasets respectively. According to
the sign test, one learning algorithm is significantly better than another on 15 test datasets only if the win number
reaches 10. Thus, we conclude that these existing model tree’s variations are not obviously better than ELM-Trees on
the used 15 small benchmark datasets as shown in Table 2. However, the comparison on training time in Table 9 shows
that our proposed ELM-Trees are obviously better than other model tree’s variations, i.e., the training time of ELMTrees are significantly lower than each of FT, FTLeaves, FTInner, NBTree, LMT, and LMTFAM+WT . Meanwhile, we
plot the variation trend of all compared algorithms’ training time with the change of datasets’ size in Fig. 8. From
Fig. 8 we can find the training time of FT, FTLeaves, FTInner, LMT, and LMTFAM+WT keep the obviously ascending
trends with the increase of datasets’ size, while training time of ELM-Trees and NBTree do not change drastically
with the change of datasets’ size. This indicates that the sequential FT, FTLeaves, FTInner, LMT, and LMTFAM+WT
are unsuitable for the big data classification due to their higher computational complexities that are caused by the application of the time-consuming LogitBoost algorithm [8] called repeatedly for a fixed number of iterations under the
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.17 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
17
Table 7
Comparison of node number between ELM-Tree (θ = 0.90) and C4.5 tree on the 15 small benchmark datasets.
No.
Dataset
Entropy-based ELM-Tree
Ambiguity-based ELM-Tree
ε = 0.1
ε = 0.01
ε = 0.05
C4.5
ε = 0.01
ε = 0.05
ε = 0.1
39
17
16
22
91
38
21
9
14
101
16
21
126
7
266
38
15
12
18
74
19
20
9
12
56
16
21
99
7
150
38
4
6
3
14
3
9
9
11
7
11
5
72
8
1
9
9
12
7
13
30
9
7
8
15
23
11
77
10
79
8
7
6
4
9
14
6
5
7
10
21
7
52
8
13
6
7
6
3
7
6
4
4
7
6
13
4
35
7
5
42
26
16
22
106
40
21
10
17
109
16
22
142
8
350
63
# Leaf node
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
avg.
54
38
13
21
12
8
# ELM leaf node
1
Auto Mpg
2
Breast Cancer
3
Breast Cancer W-D
4
Breast Cancer W-P
5
Credit Approval
6
Glass Identification
7
Ionosphere
8
New Thyroid Gland
9
Parkinsons
10
Pima Indian Diabetes
11
Sonar
12
SPECTF Heart
13
Vehicle Silhouettes
14
Wine
15
Yeast
12
8
9
9
29
9
8
4
7
31
9
11
35
4
45
12
8
7
9
24
6
8
5
6
20
9
10
31
4
30
12
3
4
2
7
2
5
4
5
4
6
3
22
4
1
5
7
5
4
8
12
6
5
5
8
10
8
44
7
33
6
6
4
3
7
8
6
4
5
6
11
5
31
7
9
6
6
4
2
6
5
4
3
5
4
10
4
22
6
4
avg.
15
13
6
11
8
6
77
34
32
42
180
74
41
17
26
201
31
41
251
14
532
75
28
22
36
147
37
39
17
23
111
31
40
197
13
298
76
7
11
5
27
6
17
17
21
13
21
9
142
14
1
16
18
22
14
24
59
16
13
16
29
45
21
152
18
156
14
14
12
7
17
28
11
9
13
19
42
12
103
16
25
11
13
11
4
13
11
7
7
13
10
26
8
68
13
8
125
77
47
65
315
119
60
27
48
325
47
64
423
23
1049
106
74
26
41
23
15
188
# All node
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
avg.
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
Note: For each dataset, the lowest number is in bold face.
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.18 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
18
Table 8
Comparison of testing accuracy of ELM-Tree, FTs, NBTree, LMT, and LMTFAM+WT on the 15 small benchmark datasets.
No.
Dataset
ELM-Trees
FT
FTLeaves
FTInner
NBTree
LMT
LMTFAM+WT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
0.801
0.969
0.965
0.788
0.759
0.654
0.858
0.917
0.862
0.776
0.750
0.798
0.643
0.983
0.585
0.755
0.959
0.954
0.742
0.772
0.626
0.838
0.953
0.815
0.751
0.764
0.768
0.729
0.972
0.576
0.793
0.957
0.953
0.747
0.767
0.650
0.917
0.953
0.846
0.766
0.788
0.790
0.766
0.972
0.582
0.745
0.967
0.963
0.747
0.772
0.593
0.858
0.958
0.821
0.750
0.764
0.775
0.704
0.961
0.572
0.801
0.963
0.926
0.722
0.764
0.696
0.917
0.930
0.903
0.760
0.755
0.723
0.650
0.949
0.586
0.819
0.964
0.963
0.773
0.772
0.645
0.915
0.953
0.877
0.768
0.769
0.783
0.774
0.972
0.590
0.819
0.966
0.965
0.793
0.752
0.645
0.901
0.940
0.862
0.771
0.788
0.801
0.754
0.978
0.588
(4/15)
(5/15)
(4/15)
(4/15)
(7/15)
(5/15)
Significantly better than ELM-Trees
Table 9
Comparison of training time (seconds) of ELM-Trees, FTs, NBTree, LMT, and LMTFAM+WT on the 15 small benchmark datasets.
No.
Dataset
ELM-Trees
FT
FTLeaves
FTInner
NBTree
LMT
LMTFAM+WT
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Auto Mpg
Breast Cancer
Breast Cancer W-D
Breast Cancer W-P
Credit Approval
Glass Identification
Ionosphere
New Thyroid Gland
Parkinsons
Pima Indian Diabetes
Sonar
SPECTF Heart
Vehicle Silhouettes
Wine
Yeast
0.19
0.16
2.69
0.41
0.35
0.20
1.48
0.11
0.84
0.35
2.01
0.37
0.95
0.29
0.19
0.91
0.89
1.08
1.13
1.31
2.52
1.17
0.39
0.48
1.38
1.14
0.91
6.25
0.27
38.64
1.05
1.34
2.23
1.24
1.52
1.81
2.14
0.42
0.63
1.98
1.78
1.75
9.30
0.50
28.22
0.91
0.89
1.31
1.08
1.31
1.92
1.41
0.16
0.69
1.39
1.14
1.22
6.72
0.63
29.48
0.33
0.66
4.84
0.15
0.39
0.42
4.14
0.19
0.89
0.39
3.72
3.58
2.70
0.28
1.33
5.66
7.06
11.34
5.42
8.58
10.22
11.02
2.52
3.30
12.16
9.25
11.67
51.48
2.47
160.76
4.97
4.25
3.38
3.55
12.13
2.75
5.5
0.88
2.28
10.33
4.00
13.41
24.57
0.56
37.20
(5/15)
(3/15)
(4/15)
(2/15)
(0/15)
(0/15)
Significantly better than ELM-Trees
framework of 5-fold cross-validation. Although NBTree’s training time is also not affected obviously by the dataset’s
size, its computational time is significantly higher than ELM-Trees according to the statistical result obtained with
sign test.
5.4. Parallel performances of ELM-Tree
In this section, we test the performance (i.e., running time, speedup and scaleup [11]) of the parallel ELM-Tree on
four big datasets, which are constructed from four UCI benchmark datasets [27]: Magic Telescope, Image Segment,
Page Blocks and Wine Quality-White. Take Magic Telescope as an instance, we introduce how to generate a big dataset
based on it. For every instance (x1 , x2 , · · · , xn , y) in Magic Telescope, we generate 1000 new instances according to
the following operation: (x1 ± ei1 , x2 ± ei2 , · · · , xn ± ein ), where eij , i = 1, 2, · · · , 1000, j = 1, 2, · · · , n is a random
number obeying the uniform distribution U (0, 0.01). After all the 19 020 instances in Magic Telescope are considered,
a big Magic Telescope dataset with 19 020 000 instances is obtained. Similarly, the big Image Segment, big Page
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.19 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
19
Fig. 8. The variation trend of training time of different model tree algorithms with the change of datasets’ size.
Table 10
The specification of 4 big datasets.
Dataset
# Attribute
# Class
Class distribution
# Instance
Image Segment
Magic Telescope
Page Blocks
Wine Quality-White
19
10
10
11
7
2
5
6
330 000 × 7
12 332 000/6 688 000
4 913 000/329 000/115 000/88 000/28 000
2 198 000/1 457 000/880 000/175 000/163 000/20 000
2 310 000
19 020 000
5 473 000
4 898 000
Blocks and big Wine Quality-White datasets with 2 310 000, 5 473 000 and 4 898 000 instances can also be generated.
The details of these four big datasets are listed in Table 10.
We adopt the ambiguity-based heuristic measure, and implement it with 1 host computer and 8 servant computers.
Each computer is with a Pentium 4 Xeon 3.06 GHz CPU, 512 MB RAM, and RedHat Linux 9.0 operation system.
The parallel system is configured based on the message passing interface (MPI) programming standard with C language. For simplicity, parallel computation is applied to two components, i.e., information gain and gain ratio. The
time recorded only includes the computing time while neglects the communicating time among different computers.
Besides, we also use the criteria of speedup and scaleup [11] to measure the performances. Speedup measures how
much a parallel algorithm is faster than a corresponding sequential algorithm, which can be expressed by
Speedup =
Computing time on 1 computer
.
Computing time on M computers
(32)
Scaleup measures the ability of an M times larger system to perform an M times larger task in the same computing
time as the original time, which expresses how much more work can be done in the same time period by a parallel
system. The formulation of scaleup can be expressed by
Scaleup =
Computing time for processing data on 1 computer
.
Computing time for processing data ∗ M on M computers
(33)
We implement the parallel ELM-Tree algorithm on 2, 4, 6, 8 computers, and the training dataset sizes are 2, 4, 6,
and 8 times of the original dataset. The execution time is summarized in Table 11, the speedup and scaleup values are
shown in Fig. 9. As shown in Table 11, the execution time of parallel ELM-Tree demonstrates a decreasing trend with
the increase of computers, which indicates the feasibility of the parallelization in reducing the computational time.
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.20 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
20
Table 11
Execution time (seconds) of parallel ambiguity-based ELM-Tree on 4 big datasets.
Dataset
2 computers
4 computers
Image Segment
2-times
4-times
6-times
8-times
6 computers
8 computers
3143.89
8340.78
11 922.16
14 890.93
1977.50
5159.94
7261.19
9034.98
945.01
2439.82
3646.30
4393.87
488.23
1148.51
1752.65
2738.84
Magic Telescope
2-times
4-times
6-times
8-times
66 615.66
170 802.54
235 026.29
293 555.18
38 604.97
94 024.51
131 217.35
164 999.23
10 776.83
28 241.36
40 562.60
48 942.07
3513.07
8315.77
12 446.06
24 052.94
Page Blocks
2-times
4-times
6-times
8-times
5756.21
14 494.77
21 217.74
25 652.75
2592.83
6516.00
9475.43
11 972.18
860.94
2110.56
3149.15
3829.53
268.98
661.30
991.34
1947.15
Wine Quality-White
2-times
4-times
6-times
8-times
6239.44
16 040.40
23 659.38
28 012.47
4195.61
10 713.09
16 009.42
18 839.69
2738.31
6973.18
10 335.04
12 496.50
1969.97
4703.07
7061.85
9875.35
It is analyzed from Eq. (32) that a good parallel system usually has a linear speedup. As shown in Figs. 9(b), 9(e),
9(h) and 9(k), the speedup of parallel ELM-Tree tends to be linear with the increase of the data size. The larger the
data size is, the higher the speedup will be. Besides, it is analyzed from Eq. (33) that the scaleup of an ideal parallel
system should be constantly equal to 1. However, as demonstrated in Figs. 9(c), 9(f), 9(i) and 9(l), the scaleup of a
parallel algorithm usually has a decline trend with the increase of the data size and the number of computers. All the
above arguments indicate that the proposed ELM-Tree algorithm can be highly parallelized and used to handle big
data classification problems.
6. Conclusion and future works
In this paper, a new hybrid learning algorithm named ELM-Tree is proposed to deal with the over-partitioning
problem in DT induction. It adopts the uncertainty reduction heuristics, and embeds ELMs as its leaf nodes when
the information gain ratios of all the cut-points are smaller than a given uncertainty coefficient. Besides, a parallel
ELM-Tree model is proposed for big data classification, which is proved to be effective in reducing the computational
time. Our future work regarding this topic may include the following three aspects: 1) further study on the incremental
mechanism of ELM-Tree; 2) ELM-Tree induction with mixed types of attributes; and 3) parallel ELM-Tree with
MapReduce framework.
Acknowledgements
The authors would like to thank Prof. Xi-Zhao Wang for his initial idea and step-by-step instruction during the completion of this paper. This research is supported by the National Natural Science Foundation of
China (71371063, 61170040 and 60903089), by the Natural Science Foundation of Hebei Province (F2013201110,
F2012201023 and F2011201063), and by the Key Scientific Research Foundation of Education Department of Hebei
Province (ZD2010139). This work is also supported by Shenzhen New Industry Development Fund under grant
No. JCYJ20120617120716224.
JID:FSS AID:6547 /FLA
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.21 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
Fig. 9. Performances of parallel ELM-Tree on the 4 big datasets.
21
JID:FSS AID:6547 /FLA
22
[m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.22 (1-22)
R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–•••
References
[1] M. Barakat, D. Lefebvre, M. Khalil, F. Druaux, O. Mustapha, Parameter selection algorithm with self adaptive growing neural network
classifier for diagnosis issues, Int. J. Mach. Learn. Cybern. 4 (3) (2013) 217–233.
[2] Y. Ben-Haim, E. Tom-Tov, A streaming parallel decision tree algorithm, J. Mach. Learn. Res. 11 (2010) 849–872.
[3] C.J. Chen, Structural vibration suppression by using neural classifier with genetic algorithm, Int. J. Mach. Learn. Cybern. 3 (3) (2012) 215–221.
[4] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30.
[5] X.L. Dong, D. Srivastava, Big data integration, in: Proceedings of ICDE’13, 2013, pp. 1245–1248.
[6] S. Ferrari, R.F. Stengel, Smooth function approximation using neural networks, IEEE Trans. Neural Netw. 16 (1) (2005) 24–38.
[7] E. Frank, Y. Wang, S. Inglis, G. Holmes, I.H. Witten, Using model trees for classification, Mach. Learn. 32 (1) (1998) 63–76.
[8] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Ann. Stat. 38 (2) (2000) 337–374.
[9] J. Gama, Functional trees, Mach. Learn. 55 (3) (2004) 219–250.
[10] A. Gattiker, F.H. Gebara, H.P. Hofstee, J.D. Hayes, A. Hylick, Big Data text-oriented benchmark creation for Hadoop, IBM J. Res. Dev.
57 (3–4) (2013) 10:1–10:6.
[11] Q. He, T.F. Shang, F.Z. Zhuang, Z.Z. Shi, Parallel extreme learning machine for regression based on MapReduce, Neurocomputing 102 (2013)
52–58.
[12] M. Higashi, G.J. Klir, Measures of uncertainty and information based on possibility distributions, Int. J. Gen. Syst. 9 (1) (1982) 43–58.
[13] G.B. Huang, D.H. Wang, Y. Lan, Extreme learning machines: a survey, Int. J. Mach. Learn. Cybern. 2 (2) (2011) 107–122.
[14] G.B. Huang, H.M. Zhou, X.J. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man
Cybern., Part B, Cybern. 42 (2) (2012) 513–529.
[15] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (2006) 489–501.
[16] R. Kohavi, Scaling up the accuracy of naive Bayes classifiers: a decision-tree hybrid, in: Proceedings of KDD’96, 1996, pp. 202–207.
[17] N. Landwehr, M. Hall, E. Frank, Logistic model trees, Mach. Learn. 59 (1–2) (2005) 161–205.
[18] S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction algorithms, ACM Comput. Surv. 45 (2) (2013) 16:1–16:35.
[19] P. Malik, Governing Big Data: principles and practices, IBM J. Res. Dev. 57 (3–4) (2013) 1:1–1:13.
[20] B. Panda, J.S. Herbach, S. Basu, R.J. Bayardo, PLANET: massively parallel learning of tree ensembles with MapReduce, Proc. VLDB Endow.
2 (2) (2009) 1426–1437.
[21] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106.
[22] J.R. Quinlan, Improved use of continuous attributes in C4.5, J. Artif. Intell. Res. 4 (1996) 77–90.
[23] J. Shafer, R. Agrawal, M. Mehta, SPRINT: a scalable parallel classifier for data mining, in: Proceedings of VLDB’96, 1996, pp. 544–555.
[24] Y. Sheng, V.V. Phoha, S.M. Rovnyak, A parallel decision tree-based method for user authentication based on keystroke patterns, IEEE Trans.
Syst. Man Cybern., Part B, Cybern. 35 (4) (2005) 826–833.
[25] A. Srivastava, E.H. Han, V. Kumar, V. Singh, Parallel formulations of decision-tree classification algorithms, Data Min. Knowl. Discov. 3 (3)
(1999) 237–261.
[26] M. Sumner, E. Frank, M. Hall, Speeding up logistic model tree induction, in: Knowledge Discovery in Databases, PKDD 2005, in: Lect. Notes
Comput. Sci., vol. 3721, 2005, pp. 675–683.
[27] UCI Machine Learning Repository, available online: http://archive.ics.uci.edu/ml/.
[28] X.Z. Wang, Y.L. He, D.D. Wang, Non-naive Bayesian classifiers for classification problems with continuous attributes, IEEE Trans. Cybern.
44 (1) (2014) 21–39.
[29] X.Z. Wang, J.R. Hong, On the handling of fuzziness for continuous-valued attributes in decision tree generation, Fuzzy Sets Syst. 99 (3)
(1998) 283–290.
[30] T. Wang, Z.X. Qin, Z. Jin, S.C. Zhang, Handling over-fitting in test cost-sensitive decision tree learning by feature selection, smoothing and
pruning, J. Syst. Softw. 83 (7) (2010) 1137–1147.
[31] X.Z. Wang, D.S. Yeung, E.C.C. Tsang, A comparative study on heuristic algorithms for generating fuzzy decision trees, IEEE Trans. Syst.
Man Cybern., Part B, Cybern. 31 (2) (2001) 215–226.
[32] X.Z. Wang, J.H. Zhai, S.X. Lu, Induction of multiple fuzzy decision trees based on rough set technique, Inf. Sci. 178 (16) (2008) 3188–3202.
[33] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005.
[34] Y.F. Yuan, M.J. Shaw, Induction of fuzzy decision tree, Fuzzy Sets Syst. 69 (2) (1995) 125–139.
[35] M.J. Zaki, Parallel and distributed association mining: a survey, IEEE Concurr. 7 (4) (1999) 14–25.
[36] S.F. Zheng, Gradient descent algorithms for quantile regression with smooth approximation, Int. J. Mach. Learn. Cybern. 2 (3) (2011) 191–207.
Download