Learning ELM-Tree from big data based on uncertainty reduction ScienceDirect Ran Wang

JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.1 (1-22) Available online at www.sciencedirect.com ScienceDirect Fuzzy Sets and Systems ••• (••••) •••–••• www.elsevier.com/locate/fss Learning ELM-Tree from big data based on uncertainty reduction Ran Wang a , Yu-Lin He b,∗ , Chi-Yin Chow a , Fang-Fang Ou b , Jian Zhang b a Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong b Key Laboratory in Machine Learning and Computational Intelligence, College of Mathematics and Computer Science, Hebei University, Baoding 071002, Hebei, China Received 21 September 2013; received in revised form 22 April 2014; accepted 23 April 2014 Abstract A challenge in big data classification is the design of highly parallelized learning algorithms. One solution to this problem is applying parallel computation to different components of a learning model. In this paper, we first propose an extreme learning machine tree (ELM-Tree) model based on the heuristics of uncertainty reduction. In the ELM-Tree model, information entropy and ambiguity are used as the uncertainty measures for splitting decision tree (DT) nodes. Besides, in order to resolve the overpartitioning problem in the DT induction, ELMs are embedded as the leaf nodes when the gain ratios of all the available splits are smaller than a given threshold. Then, we apply parallel computation to five components of the ELM-Tree model, which effectively reduces the computational time for big data classification. Experimental studies demonstrate the effectiveness of the proposed method. © 2014 Elsevier B.V. All rights reserved. Keywords: Big data classification; Decision tree; ELM-Tree; Extreme learning machine; Uncertainty reduction 1. Introduction With the arrival of big data era, learning from big data has become unavoidable in many fields such as machine learning, pattern recognition, image processing, information retrieval, etc. Currently, a formal definition of big data has not been proposed yet, however, several descriptions could be found from recent literatures [5,10,19]. The main difficulties in learning from big data include the following three aspects. First, it is hard to finish the computation on a single computer within a tolerable time. Second, the high-dimensional and multi-modal features may degrade the performance and efficiency of the learning algorithm. Finally, the transformation of learning concepts is hard to realize due to the dynamic increase of data volume. In order to overcome these difficulties, the parallelization of sequential classification algorithms is widely adopted. Decision tree (DT) induction [21,22], which has the advantages of simple implementation, few parameters, and low computational load, is a promising classification algorithm for parallelization. * Corresponding author. Tel.: +86 185 31315747. E-mail addresses: ranwang3-c@my.cityu.edu.hk (R. Wang), csylhe@gmail.com (Y.-L. He). http://dx.doi.org/10.1016/j.fss.2014.04.028 0165-0114/© 2014 Elsevier B.V. All rights reserved. JID:FSS AID:6547 /FLA 2 [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.2 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• The generalization capability of a DT is largely influenced by the scale of the tree. Generally, the scale of a tree is affected by two factors, i.e., the degree of over-partitioning in the induction process, and the selection of heuristic measure for splitting nodes. Information entropy [18,22] and ambiguity [29,34], which respectively reflect the impurity of classes and the uncertainty of the split, are the two most widely used heuristics for splitting nodes during the tree growth. The performances of these two heuristics may differ a lot, however, both of them are uncertainty reduction based methods [31,32] that can be applied to the parallelization of sequential DT. Several works on the parallelization of sequential DT could be found from the literature. Shafer et al. [23] first proposed the SPRINT algorithm, which tries to remove all the memory restrictions by improving the data structures used in DT growth. Srivastava et al. [25] later described two parallel DT models, which are respectively based on synchronous construction approach and partitioned construction approach. Sheng et al. [24] developed a parallel DT by splitting the input vector into four sub-vectors, and applied it to user authentication. Panda et al. [20] proposed the PLANET method based on the MapReduce model of distributed computation. Ben-Haim and Tom-Tov [2] designed a parallel DT algorithm for classifying large streaming data by constructing histograms at the processors and compressing the data to a fixed amount of memory. These works have been proved to be effective and exhibited good performances on big data classification. However, they neglect the over-partitioning problem, which may lead to a redundant and over-fitted tree [30], especially for big data. In order to resolve the over-partitioning problem, a hybrid DT induction scheme, named Extreme Learning Machine Tree (ELM-Tree), is proposed in this paper. Our proposed ELM-Tree is similar to the model tree given in [7]. The difference between a model tree and an ELM-Tree is that the model tree is such a decision tree of which the leaf nodes are linear regression functions while in the ELM-Tree each leaf node is an ELM. Given a model tree, a new instance is classified by traversing the tree from the root to a leaf and determined the prediction value with the linear regression function in leaf node. Some well-known modifications and extensions to model tree include functional tree (FT) [9], Naive Bayes tree (NBTree) [16], logistic model tree (LMT) [17] and LMTFAM+WT [26] which is the improved version of LMT by using first AIC minimum (FAM) method and weight trimming (WT) strategy. Basically our ELM-Tree is a new modification to the model tree. All the above-mentioned methods construct the decision trees with the classical top-down recursive partitioning schemes. The main difference among FT, NBTree, LMT and LMTFAM+WT is the types of nodes in the trees: linear discriminants or logistic regressions as leaf nodes for FT, Naive Bayes classifiers as the leaf nodes for NBTree, and linear logistic regressions as the leaf nodes for LMT and LMTFAM+WT . Although FT, NBTree, LMT and LMTFAM+WT improve the classification performances of decision trees to some extent, an obvious drawback of these algorithms is the high time consumption when building them, e.g., pruning back with the bottom-up procedure in FT, computing the 5-fold cross-validation accuracy estimate for determining the mode’s utility in NBTree, learning the weights of logistic regressions with optimization algorithms iteratively in FT, LMT and LMTFAM+WT , etc. The high computational complexity of FT, NBTree, LMT and LMTFAM+WT seriously downgrades the ability for the model trees to handle big data. Meanwhile, the complex training processes of these learning algorithms make it very difficult to implement their parallelization. ELMs are emergent techniques for training single-hidden layer feedforward neural networks (SLFNs) [6]. In an ELM, the input weights are randomly assigned, and the output weights are analytically determined via the pseudoinverse of the hidden layer output matrix [13–15]. Unlike the time-consuming weight optimization in FT/LMT and cross-validation based utility determination in NBTree, the ELMs in the ELM-Tree have their extremely fast training speed and therefore have the great potential for learning from big data. In our ELM-Tree scheme, a threshold is given to determine whether a node should be split further or not. If the learner decides to stop splitting a node, this node will become either a traditional leaf node or an ELM leaf node based on its class impurity. Then, by applying parallel computation to five components of the ELM-Tree, a parallel ELM-Tree model is developed for big data classification. Experimental results validate that the ELM-Tree scheme can well resolve the over-partitioning problem, and the parallel ELM-Tree model is effective to reduce the computational load for big data classification. The rest of this paper is organized as follows. In Section 2, a brief discussion on the two uncertainties, i.e., information entropy and ambiguity, is given. In Section 3, the ELM-Tree scheme is proposed. In Section 4, the parallel ELM-Tree model is developed. In Section 5, experimental comparisons are conducted to show the feasibility of the proposed method. Finally, conclusions are given in Section 6. JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.3 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 3 2. Two types of uncertainty This section reviews the two most widely used uncertainties for splitting DT nodes, i.e., information entropy and ambiguity. In a DT model, each node corresponds to a probability distribution E = (p1 , . . . , pm ), where pi , i = 1, . . . , m represents the proportion of the i-th class in the node. The uncertainty of a node is defined on E, and the learner tends to split the node by using the attribute with the minimum average uncertainty. Such an attribute is called expanded attribute. 2.1. Information entropy Generally, information entropy is defined on a probability distribution, and represents the impurity of classes in a node. It is a type of statistical uncertainty that arises from the random behavior of physical systems. Given that E = (p1 , . . . , pm ), i = 1, . . . , m, m i=1 pi = 1, information entropy is defined as Entr(E) = − m pi log2 (pi ). (1) i=1 Without losing generality, we assume p1 is a single variable and p2 , · · ·, pm−1 are m − 2 constants with c = p2 + . . . + pm−1 , then we have pm = 1 − p1 − c, and Eq. (1) degenerates to Entr(E) = −p1 log2 (p1 ) − (1 − p1 − c) log2 (1 − p1 − c) + CE , (2) where CE is an independent constant in variable p1 . By solving ∂ Entr(E) 1 − p1 − c = log2 = 0, ∂p1 p1 (3) we get that Entr(E) attains its maximum at p1 = pm = 1−c 2 . Due to the symmetry of p1 , . . . , pm , we conclude that the entropy of an m-dimensional probability distribution attains its maximum at p1 = p 2 = . . . = p m = 1 . m (4) 2.2. Ambiguity Ambiguity is defined on a possibility distribution, and denotes a type of non-specificity when there is a need to specify one object from a group. Since probability distribution is a special case of possibility distribution, we can also discuss the ambiguity in terms of the probability distribution E = (p1 , . . . , pm ). Ambiguity describes the uncertainty that arises from human reasoning and cognition, which is defined as 1 ∗ ∗ pi − pi+1 ln(i), m m Ambig(E) = (5) i=1 ∗ ) is the normalization of (p , . . . , p ) with 1 = p ∗ ≥ p ∗ ≥ . . . ≥ p ∗ ≥ p ∗ where (p1∗ , . . . , pm 1 m m 1 2 m+1 = 0. More specifically, the values of p1 , . . . , pm , 0 are normalized to the interval [0, 1] by multiplying them with the factor max{p11,...,pm } , ∗ , p∗ and then are sorted to be p1∗ , . . . , pm m+1 with descending order. Similarly, due to the monotonicity, continuity, and symmetry [12], the maximum of Ambig(E) is also attained at p1 = p2 = . . . = pm = m1 . 2.3. Specificity under binary cases We discuss the two types of uncertainty for a binary classification problem. When m = 2, Eqs. (1) and (5) respectively degenerate to JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.4 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 4 Fig. 1. Entropy and ambiguity when m = 2. and Entr(E) = −p1 log2 (p1 ) − p2 log2 (p2 ), (6) ⎧1 ⎪ ⎨ 2 (p1 /p2 ) ln(2), Ambig(E) = 12 ln(2), ⎪ ⎩1 2 (p2 /p1 ) ln(2), (7) if p1 < p2 if p1 = p2 , if p1 > p2 where p1 + p2 = 1. Fig. 1 depicts the curves for Eqs. (6) and (7) by taking p1 as the function variable under the condition p1 +p2 = 1. It is easy to observe that entropy and ambiguity have several common features, i.e., they are defined on [0, 1], symmetric at p1 = 0.5, strictly increasing in [0, 0.5], and strictly decreasing in [0.5, 1]. Besides, they attain their maxima at p1 = 0.5 and minima at p1 = 0 and p1 = 1. However, entropy is a convex function, while ambiguity is a concave function. Given p1 , the value of entropy is always larger than that of ambiguity. Furthermore, when p1 ∈ [0, 0.5], the increase of entropy is faster than that of ambiguity, and when p1 ∈ [0.5, 1], the decrease of entropy is slower than that of ambiguity. 3. The ELM-Tree approach 3.1. Extreme learning machine Given a training set X that contains N distinct instances with n inputs and m outputs, i.e., X = {(xi , yi )|xi = [xi1 , . . . , xin ]T ∈ Rn , yi = [yi1 , . . . , yim ]T ∈ Rm , i = 1, . . . , N }, the SLFNs with Ñ hidden nodes and activation function g(x) are formulated as Ñ βj g(wj · xi + bj ) = oi , i = 1, . . . , N, (8) j =1 where wj = [wj 1 , . . . , wj n ]T is the weight vector connecting the input nodes and the j -th hidden node, bj is the bias of the j -th hidden node, βj = [βj 1 , . . . , βj m ]T is the weight connecting the j -th hidden node and the output nodes, and oi is the output of xi in the network. The standard SLNFs can approximate the N training instances with zero error, i.e., there exists wj , bj , and βj , such that: Ñ βj g(wj · xi + bj ) = yi , i = 1, . . . , N. j =1 Eq. (9) could be written into the matrix form, i.e., Hβ = Y, where (9) JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.5 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• ⎡ ⎢ ⎢ H=⎢ ⎣ g(w1 · x1 + b1 ) g(w1 · x2 + b1 ) .. . g(w2 · x1 + b2 ) g(w2 · x2 + b2 ) .. . ... ... .. . g(wÑ · x1 + bÑ ) g(wÑ · x2 + bÑ ) .. . g(w1 · xN + b1 ) g(w2 · xN + b2 ) ... g(wÑ · xN + bÑ ) ⎤ ⎥ ⎥ ⎥ ⎦ and ⎡ ⎢ ⎢ Y=⎢ ⎣ βÑ2 ... βÑ m y11 y21 .. . y12 y22 .. . ... ... .. . y1m y2m .. . yN1 yN2 ... yN m (10) N×Ñ is called the hidden layer output matrix, β and Y are respectively represented by ⎤ ⎡ β11 β12 . . . β1m ⎢ β21 β22 . . . β2m ⎥ ⎥ ⎢ β =⎢ . , .. .. ⎥ .. ⎣ .. . . . ⎦ βÑ 1 5 (11) Ñ ×m ⎤ ⎥ ⎥ ⎥ ⎦ (12) . N×m Most traditional training methods for neural networks are gradient descent algorithms. They try to minimize the following cost error function CE = Ñ N i=1 2 βj g(wj · xi + bj ) − yi , (13) j =1 and the parameter vector αj = (wj , bj , βj ) is adjusted iteratively by αk+1 = αk − η ∂CE(αk ) , ∂αk (14) where k is the iteration index and η is the learning rate. Although the gradient descent algorithms [36] have exhibited good performances on different learning domains, it is still hard to overcome its high complexity [1,3]. Besides, due to the iterative learning mechanism, they may also cause the local optima and over-fitting problems. Recently, ELM is proposed by Huang et al. [15] for training SLFNs. It has been theoretically proved that in order to approximate the training instances, the input weights and hidden biases could be randomly assigned if the activation function is infinitely differentiable [13,14]. Motivated by these facts, ELM treats the network as a linear system. It randomly chooses the input weights and analytically determines the output weights by the pseudo-inverse of the hidden layer output matrix. The detail of ELM is described in Algorithm 1. Algorithm 1: Extreme Learning Machine – ELM. 1 2 3 Input: Training set {(xi , yi )|xi ∈ Rn , yi ∈ Rm , i = 1, . . . , N }; activation function g(x); the number of hidden node Ñ . Output: Input weight wj , input bias bj , and output weight β. Randomly assign input weight wj and bias bj where j = 1, . . . , Ñ ; Calculate the hidden layer output matrix H; Calculate the output weight β = H† Y where H† is the Moore–Penrose generalized inverse of matrix H. 3.2. ELM-Tree with uncertainty reduction In this section, a new classification model named ELM-Tree is developed to handle the over-partitioning problem. The key difference between ELM-Tree and traditional DT lies in the determination of the leaf nodes. In the ELM-Tree model, ELMs are embedded as the leaf nodes when certain conditions are met. Thus, there are two key steps in the construction of ELM-Tree, i.e., splitting a non-leaf node and determining an ELM leaf node. JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.6 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 6 Given a node X = {x1 , . . . , xN } with N instances from m classes, each instance is described by n continuous attributes, where the i-th instance is denoted by xi = {xij }nj=1 and Aj represents the j -th attribute. We first discuss how to split the non-leaf node X into two child nodes X1 and X2 . This problem could be equally transferred to selecting the optimal attribute and its candidate cut-point to divide X into two sets. Such a cut-point always refers to the one that can maximize the information gain. Algorithm 2 depicts the procedures of splitting X. Algorithm 2: Split a DT Node. 1 n Input: Node X = {(xi , yi )}N i=1 , where xi = {xi1 , . . . , xin } ∈ R is the i-th instance, yi ∈ {1, . . . , m} is the class label of xi , m is the number of classes, and Aj , j = 1, . . . , n is the j -th attribute. Output: Two child nodes X1 and X2 . ∗ ≤ . . . ≤ x∗ ; For each attribute Aj , sort its values x1j , . . . , xNj in ascending order, and the sorted values are recorded as x1j Nj x ∗ +x ∗ 2 3 4 Get all the available cut-points of each Aj , i.e., cutij = ij 2i+1,j , i = 1, 2, · · · , N − 1; Calculate the information gain of each Aj and its cut-point cutij : Gain(X, cutij ) = Info(X) − I (cutij ) |Xij 1 | |Xij 2 | = Info(X) − Info(Xij 1 ) + Info(Xij 2 ) , |X| |X| where Xij 1 = {xi ∈ X|xij ≤ cutij } and Xij 2 ∈ {xi ∈ X|xij > cutij } are the two subsets of X divided by cutij , |X| is the size of X, and I (cutij ) is the expected information of cutij ; For each attribute Aj , select its optimal cut-point cuti (j ) j where i (j ) = argmax Gain(X, cutij ) ; i=1,...,N −1 5 Calculate the split information of the optimal cut-point for each Aj : 6 |X |Xi (j ) j 1 | |Xi (j ) j 2 | |Xi (j ) j 2 | i (j ) j 1 | Split(X, cuti (j ) j ) = − log2 + log2 ; |X| |X| |X| |X| Calculate the gain ratio of Aj : Ratio(X, Aj ) = 7 Gain(X, cuti (j ) j ) Split(X, cuti (j ) j ) ; Select the optimal attribute Aj ∗ where j ∗ = argmax Ratio(X, Aj ) ; j =1,...,n 8 (15) (16) (17) (18) (19) Split X into X1 and X2 by Aj ∗ and its optimal cut-point cuti (j ∗ ) j ∗ where X1 = {xi ∈ X|xij ∗ ≤ cuti (j ∗ ) j ∗ } and X2 ∈ {xi ∈ X|xij ∗ > cuti (j ∗ ) j ∗ }. The term Info(X) in Eq. (15) is the residual uncertainty of the class information in X, which measures its class impurity and the amount of information. Similarly, Info(Xij 1 ) and Info(Xij 2 ) reflect the amounts of information in Xij 1 and Xij 2 . Thus, Eq. (15) represents the information gain when splitting X into Xij 1 and Xij 2 by cutij . After determining the optimal cut-point cuti (j ) j for attribute Aj , Eq. (18) is used to calculate the information gain ratio for Aj . In fact, the gain ratio is a compensation to the information gain, which reduces the bias on a high-branch attribute by considering the intrinsic information [33] in Eq. (17), i.e., the split information of the optimal cut-point cuti (j ) j of Aj . Thus, the gain ratio is the normalization of the information gain by the intrinsic information. For the given node X, Info(X) in Eq. (15) is a constant that can be neglected, thus, the information gain of cutij is just determined by Info(Xij 1 ) and Info(Xij 2 ). Take Info(Xij 1 ) as an instance, we now apply the two types of uncertainty, i.e., information entropy and ambiguity, to its computation: 1. The entropy-based heuristic measure, denoted by InfoE (Xij 1 ), is calculated as InfoE (Xij 1 ) = − m (1) (1) pij l log2 pij l , l=1 (1) where pij l is the probability of the l-th class in Xij 1 . 2. The ambiguity-based heuristic measure, denoted by InfoA (Xij 1 ), is calculated as (20) JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.7 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 7 1 (1∗) (1∗) pij l − pij,l+1 ln(l), m m InfoA (Xij 1 ) = (21) l=1 (1∗) (1∗) (1∗) (1) (1) (1) (1∗) (1∗) where (pij 1 , pij 2 , . . . , pij m ) is the normalization of (pij 1 , pij 2 , . . . , pij m ) with 1 = pij 1 ≥ pij 2 ≥ . . . ≥ (1∗) pij m = 0. According to the analysis in Section 2, we have 0 ≤ InfoE (Xij 1 ) ≤ log2 (m), 0 ≤ InfoA (Xij 1 ) ≤ ln(m). By applying (22) to (15), we have 0 ≤ IE (cutij ) ≤ log2 (m), 0 ≤ IA (cutij ) ≤ ln(m). (22) (23) Obviously, log2 (m) and ln(m) are the maxima of IE (cutij ) and IA (cutij ), which can generate the most uncertain partitions of X. With the above-mentioned heuristic measures, Algorithm 2 is iteratively applied to the generated child nodes, and a DT is finally constructed. However, without a proper stopping criterion, these procedures will continue until all the nodes cannot be split further, which may lead to an over-partitioned tree. In order to resolve this over-partitioning problem, we propose the ELM-Tree model. If the largest class probability in X, i.e., max{pi }m i=1 , is larger than the given threshold θ , or the size of X, i.e., |X|, is smaller than a given number N ∗ , X is considered as a leaf node, and its decision label is determined as argmaxi {pi }m i=1 . Otherwise, further partition is performed on X. However, if the expected information of all the attributes and their cut-points, i.e., I (cutij ), i = 1, . . . , N − 1, j = 1, . . . , n, are smaller than a given uncertainty coefficient ε, an ELM is introduced to classify the instances in X. The induction of ELM-Tree is then described in Algorithm 3. Obviously, there are two important parameters in Algorithm 3, i.e., the truth level threshold θ ∈ [0, 1], and the uncertainty coefficient ε ∈ [0, 1]. For implementation, we further modify the uncertainty coefficient as ε × log2 (m) for entropy-based ELM-Tree and ε × ln(m) for ambiguity-based ELM-Tree, where m is the number of classes in X. Algorithm 3: Construct an ELM-Tree. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Input: A training set with N instances, n attributes and m classes; truth level threshold θ ∈ [0, 1]; uncertainty coefficient ε ∈ [0, 1]; and integer parameter N ∗ ∈ {1, . . . , N }. Output: An ELM-Tree. Ω is initialized as an empty set; Consider the original training set as the root-node, and add it to Ω; while Ω is not empty do Select one node from Ω, denoted by X; ∗ if max{pi }m i=1 > θ or |X| < N then Assign X a label, i.e., argmaxi {pi }m i=1 , and remove it from Ω; else if I (cutij ) < ε for i = 1, . . . , N − 1, j = 1, . . . , n then Train an ELM for X and remove it from Ω; else Split X into two child nodes X1 and X2 by Algorithm 2; Remove X from Ω, add X1 and X2 to Ω; end end end Fig. 2 gives an illustrative structure of ELM-Tree. There are two types of leaf nodes in this tree, i.e., ELM leaf nodes and non-ELM leaf nodes. Obviously, LN1 , LN2 and LN4 are non-ELM leaf nodes with a certain decision label, while LN3 , LN5 , LN6 , LN7 and LN8 are ELM leaf nodes which classify the instances with an ELM classifier. JID:FSS AID:6547 /FLA 8 [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.8 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• Fig. 2. The structure of ELM-Tree. Fig. 3. Different partitions to a sample set by using C4.5 and ELM-Tree ( positive training sample; • negative training sample; sample). positive testing We now try to explain the main difference between ELM-Tree and C4.5 tree by observing Fig. 3. Given a dataset X that contains 6 positive instances and 3 negative instances. Fig. 3(a) demonstrates the partition of C4.5 on X. This partition includes 5 leaf nodes, and the training accuracy is 1. However, wrong prediction is made on the new testing instance. Fig. 3(b) demonstrates a partition on X by ELM-Tree. In this partition, an ELM is trained on X and a curve is obtained as the classification boundary, which correctly predicts the new instance. Besides, there are 5 leaf nodes in C4.5, while the ELM-Tree has just one leaf node. In addition, a real numerical example on selected instances from Auto Mpg dataset (details in Table 2) is given to show the process of generating an ELM leaf node. In this example, θ is set as 0.90 and ε is set as 0.08. There are 27 instances in X as listed in Table 1, where the largest class probability of X is 0.667, which is smaller than θ . It is also investigated from Table 1 that the entropy-based expected information I (cutij ) of all the cut-points are smaller than ε × log2 (m). Note that in Table 1, the repetitive values of I (cutij ) are deleted. Thus, X can be regarded as an ELM leaf node and an ELM classifier is trained. 3.3. Time complexity We give an analysis on the time complexity of ELM-Tree. For a training dataset with N instances and n continuous attributes, the training complexities of ELM and C4.5 DT are O(N) [15] and O(nN log2 N ) [22], respectively. Since it will cost a complexity of O(nN log2 N ) to partition the training dataset iteratively and O(nN) to train the determined ELM leaf nodes, we can conclude that the complexity of the ELM-Tree model is O(nN log2 N ) + O(nN) = O(nN log2 N). 4. Parallelization of ELM-Tree In Algorithm 3, the computations of information gain and gain ratio for different cut-points are all independent, which could be finished in parallel. Besides, the main task of ELM is to calculate the Moore–Penrose generalized inverse of the hidden layer output matrix H, i.e., JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.9 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 9 Table 1 Example for generating an ELM leaf node in ELM-Tree. X A1 A2 A3 A4 A5 Class x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25 x26 x27 0.7101 0.4521 0.3723 0.5053 0.4787 0.4255 0.3191 0.6516 0.5585 0.4894 0.5851 0.3457 0.6649 0.5053 0.2660 0.5851 0.5053 0.4521 0.3191 0.4787 0.4255 0.6915 0.3723 0.5319 0.3457 0.4255 0.5053 0.0775 0.1395 0.1395 0.0775 0.1137 0.0762 0.1395 0.0775 0.1111 0.1370 0.1137 0.1395 0.1137 0.1344 0.1395 0.1318 0.1137 0.0775 0.1344 0.0853 0.1240 0.1395 0.1344 0.0775 0.1370 0.1085 0.1008 0.1848 0.1848 0.2174 0.1848 0.2283 0.1848 0.2174 0.2011 0.1848 0.1848 0.2120 0.2174 0.2283 0.1793 0.2120 0.1957 0.2283 0.1793 0.2228 0.2011 0.1902 0.2283 0.2283 0.2011 0.1630 0.2228 0.2174 0.0856 0.2376 0.1721 0.1562 0.2912 0.1454 0.1738 0.1310 0.1537 0.2997 0.2728 0.2217 0.2217 0.2869 0.1976 0.3139 0.2813 0.1820 0.3873 0.1670 0.1721 0.2515 0.3811 0.1718 0.2546 0.3003 0.2413 0.3810 0.5060 0.3571 0.4167 0.6310 0.5357 0.5060 0.4702 0.4048 0.4167 0.4881 0.4762 0.5952 0.6310 0.6250 0.6786 0.6905 0.5774 0.6845 0.4345 0.5298 0.4226 0.5357 0.5060 0.5952 0.5655 0.4464 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 Expected information of cut-point j =1 j =2 j =3 j =4 j =5 max{p1 , p2 } I (cutij ) < ε × log2 (m) 0.0058 0.0008 0.0026 0.0136 0.0047 0.0084 0.0048 0.0025 0.0119 0.0073 0.0011 0.0018 0.0058 0.0069 0.0088 0.0008 0.0007 0.0061 0.0026 0.0025 0.0006 0.0031 0.0088 0.0199 0.0294 0.0116 0.0176 0.0026 0.0107 0.0203 0.0010 0.0048 0.0099 0.0152 0.0208 0.0267 0.0061 0.0006 0.0005 0.0018 0.0040 0.0109 0.0158 0.0071 0.0018 0.0005 0.0022 0.0054 0.0108 0.0201 0.0092 0.0274 0.0132 0.0052 0.0106 0.0163 0.0287 0.0066 0.0006 0.0005 0.0005 0.0019 0.0019 0.0005 0.0049 0.0023 0.0066 0.0033 0.0016 0.0052 0.667 < θ −1 H† = HT H HT , where HT is the transposition of H. Thus, −1 β = HT H HT Y. (24) (25) JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.10 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 10 For big data problems, the matrices of H, HT H and HT Y are too large, which make the computation impossible. In this case, their calculations could also be finished in parallel. Assume there are M + 1 computers in a parallel system. One computer is a host node and the other M computers are computing nodes. The host node is in charge of assigning tasks and collecting results [35]. Basically, parallel comnN log N putation could be applied to five components of the ELM-Tree, which reduces its training complexity to O( M 2 ) in parallel. The five components are introduced as follows: 1. Calculation of information gain. We simply denote the information gain of attribute Aj and its i-th cut-point as Gainij , where i = 1, . . . , N − 1 and j = 1, . . . , n. Since M N , the host node first assigns the calculations of Gain1j , Gain2j , . . . , GainMj to the M computing nodes. Once any of the M calculations is finished, the host node stores the result and orderly assigns the equivalent calculations from GainM+1,j , GainM+2,j , . . . , GainN−1,j to the free computing nodes. When the computation on Aj is finished, the computation on Aj +1 begins. 2. Calculation of gain ratio. We simply denote the gain ratio of attribute Aj as Ratioj , where j = 1, . . . , n. When n <= M, the calculations of Ratio1 , . . . , Ration are directly finished in parallel. When n > M, similar procedures of the calculation on information gain are applied. 3. Calculation of HN×Ñ . For big data classification, N is a very large number, thus the N × Ñ matrix could be decomposed into N disjointed 1 × Ñ sub-matrices. The calculations of these N sub-matrices are assigned to the M computing nodes in turn. For the i-th sub-matrix, the computation of g(wj · x + bj ), j = 1, . . . , Ñ is carried out Ñ times. Once a computing node is free, the host node assigns a new sub-matrix to it until all the calculations are finished. 4. Calculation of HT HÑ ×Ñ . We denote HT HÑ ×Ñ as [hhij ]Ñ×Ñ where hhij = N g(wi · xk + bi )g(wj · xk + bj ) , i, j = 1, . . . , Ñ . (26) k=1 The host node splits the computation of hhij into the summation of M disjoint units, i.e., (1) (2) (M) hhij = hhij + hhij + . . . + hhij , (27) where ⎧ N M ⎪ ⎪ ⎪ (1) ⎪ ⎪ g(wi · xk + bi )g(wj · xk + bj ) , hhij = ⎪ ⎪ ⎪ ⎪ k=1 ⎪ ⎪ ⎪ N ⎪ 2 M ⎪ ⎪ ⎪ ⎪ (2) ⎨ hhij = g(wi · xk + bi )g(wj · xk + bj ) , N k= M +1 ⎪ ⎪ ⎪ ⎪ . ⎪ .. ⎪ ⎪ ⎪ ⎪ ⎪ N ⎪ ⎪ ⎪ (M) ⎪ ⎪ = hh ⎪ ij ⎪ ⎩ N (28) g(wi · xk + bi )g(wj · xk + bj ) . k=(M−1) M +1 (1) (2) (M) Then, the calculations of hhij , hhij , . . . , hhij are finished on the M computing nodes respectively. 5. Calculation of HT YÑ ×m . We denote HT HÑ ×m as [hyij ]Ñ×m where hyij = N g(wi · xk + bi )ykj , i = 1, . . . , Ñ , j = 1, . . . , m. (29) k=1 The host node splits the computation of hyij into the summation of M disjoint units, i.e., (1) (2) (M) hyij = hyij + hyij + . . . + hyij , (30) JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.11 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 11 Table 2 The specification of 15 small benchmark datasets. No. Dataset # Attribute # Class Class distribution # Instance 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast 5 10 30 33 15 9 33 5 22 8 60 44 18 13 8 3 2 2 2 2 7 2 3 2 2 2 2 4 3 10 245/79/68 458/241 357/212 151/47 383/307 76/70/29/17/13/9/0 225/126 150/35/30 147/48 500/268 111/97 212/55 218/217/212/199 91/59/48 463/429/244/163/51/44/35/30/20/5 392 699 569 198 690 214 351 215 195 768 208 267 846 178 1484 where ⎧ N M ⎪ ⎪ ⎪ (1) ⎪ ⎪ g(wi · xk + bi )ykj , hyij = ⎪ ⎪ ⎪ ⎪ k=1 ⎪ ⎪ ⎪ N ⎪ 2 M ⎪ ⎪ ⎪ ⎪ (2) ⎨ hy = g(wi · xk + bi )ykj , ij N n= M +1 ⎪ ⎪ ⎪ ⎪ .. ⎪ ⎪ . ⎪ ⎪ ⎪ ⎪ N ⎪ ⎪ ⎪ (M) ⎪ ⎪ hy = ⎪ ij ⎪ ⎩ N (31) g(wi · xk + bi )ykj . n=(M−1) M +1 (1) (2) (M) Then, the calculations of hyij , hyij , . . . , hyij are finished on the M computing nodes respectively. 5. Experimental comparisons In this section, we firstly validate the feasibility of sequential ELM-Trees on 15 small datasets [27] for the sake of running time and then demonstrate the effectiveness of parallel ELM-Trees on 4 big datasets. The main objective of testing the performances of sequential ELM-Trees on small datasets is to check the impact of different learning parameters (i.e., θ and ε) on sequential ELM-Trees. For each UCI benchmark dataset as listed in Tables 2 and 10, the nominal attributes are discarded, and the missing values are replaced by the unsupervised filter ReplaceMissingValues in Weka 3.6 [33]. In addition, the sequential algorithms are implemented under Matlab and the parallel algorithms are implemented with C language. The experiments are conducted on a PC with Windows XP operation system, a Pentium 4 2.8 GHz CPU, and a 2 GB RAM. 5.1. Impact of uncertainty parameters θ and ε on ELM-Tree We first investigate the effects of parameter ε on the training time, testing time, training accuracy, testing accuracy and scale of the tree. The value of θ in Algorithm 3 is fixed as 0.90, N ∗ is set as 5, and ε is ranged from 0.01 to 0.15 with a step of 0.01. With each ε, we conduct 10-fold cross validation for 10 times on a representative dataset, i.e., Breast Cancer W-P. Besides, the number of hidden nodes in ELM1 is set as 20. 1 http://www.ntu.edu.sg/home/egbhuang/elm-codes.html. JID:FSS AID:6547 /FLA 12 [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.12 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• Fig. 4. The impact of uncertainty coefficient on performances of entropy-based ELM-Tree in Breast Cancer W-P dataset (θ = 0.90). Fig. 5. The impact of uncertainty coefficient on performances of ambiguity-based ELM-Tree in Breast Cancer W-P dataset (θ = 0.90). Fig. 4 and Fig. 5 depict the effects of ε on the performances of entropy-based ELM-Tree and ambiguity-based ELM-Tree respectively. Three observations could be made. First, as shown in Fig. 4(a) and Fig. 5(a), the training time gradually decreases with the increase of ε, while the change of testing time is not obvious. Second, as shown in Fig. 4(b) and Fig. 5(b), the training accuracy gradually decreases with the increase of ε, while the testing accuracy first increases and then fluctuates. Third, from Fig. 4(c) and Fig. 5(c), we can see that for all types of nodes, the number has an obvious decrease with the increase of ε. It is clear that the larger ε is, the smaller the probability of generating an ELM leaf node will be. Thus, the increase of ε will lead to a reduction of the tree size. For example, assume there are three candidate attributes A1 , A2 and A3 in the current node, each attribute has m candidate cut-points: cut11 , . . . , cut1m for A1 ; cut21 , . . . , cut2m for A2 ; and cut31 , . . . , cut3m for A3 . The information gains of these candidate cuts are Gain11 , . . . , Gain1m ; Gain21 , . . . , Gain2m ; and Gain31 , . . . , Gain3m respectively. If there exists any Gainij > ε, further partition on this node is needed, which will lead to an increase of the node number. However, if a larger ε is given such that ∀i = 1, 2, 3, j = 1, · · · , m, Gainij < ε, an ELM will be trained, and the number of nodes will be smaller. Besides, the training accuracy and testing accuracy of classical C4.5 algorithm on Breast Cancer W-P dataset are 0.981 and 0.631 respectively, while the average training accuracy and testing accuracy of ELM-Tree are 0.834 and 0.778 respectively, which demonstrate that the ELM-Tree model significantly reduces the degree of over-fitting. Then, we fix the value of ε as 0.05 and investigate the effect of θ . Fig. 6 and Fig. 7 depict the effects of θ on the performances of entropy-based ELM-Tree and ambiguity-based ELM-Tree respectively. From these figures, we can see that θ has no obvious influence on the accuracy, time, and tree scale. The classification truth levels of some expanded nodes in ELM-Tree are usually smaller than θ . That is to say, a smaller θ cannot increase the probability of generating a non-ELM leaf node. The termination of the induction process mainly depends on the number of instances falling into the expanded nodes or the uncertainty coefficient ε. Above all, the uncertainty coefficient ε has a great influence on the performances of ELM-Tree, while the impact of θ is not obvious. JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.13 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 13 Fig. 6. The impact of truth level on performances of entropy-based ELM-Tree in Breast Cancer W-P dataset (ε = 0.05). Fig. 7. The impact of truth level on performances of ambiguity-based ELM-Tree in Breast Cancer W-P dataset (ε = 0.05). 5.2. Comparisons among ELM-Trees, C4.5, and ELM In this section, we compare the performances of entropy-based ELM-Tree, ambiguity-based ELM-Tree, C4.5,2 and ELM. We conduct 10-fold cross-validation for 10 times, and observe the average result. Three pairs of (θ, ε) are tested, i.e., (0.90, 0.01), (0.90, 0.05) and (0.90, 0.1). Table 3 and Table 4 respectively report the training accuracy and testing accuracy of the compared methods. Several observations could be made. 1. First, it is investigated that C4.5 has a high degree of over-fitting (high training accuracy and low testing accuracy), which is mainly caused by the over-partitioning of tree nodes, while ELM does not suffer from this problem. Thus, incorporating ELM in C4.5 can greatly reduce the degree of over-fitting in C4.5. 2. Second, the training accuracy of entropy-based ELM-Tree is higher than that of ambiguity-based ELM-Tree, while its testing accuracy is significantly lower. The reason for this observation could be discovered by investigating the number of nodes induced by the trees, which is reported in Table 7. Take (θ, ε) = (0.90, 0.01) as an instance, it is calculated from Table 7 that the ratios of ELM leaf nodes to all the leaf nodes in ambiguity-based ELM-Tree are (0.56, 0.78, 0.42, 0.57, 0.62, 0.40, 0.67, 0.71, 0.63, 0.53, 0.43, 0.73, 0.57, 0.70, 0.42) on these 15 datasets, while the ratios of entropy-based ELM-Tree are (0.31, 0.47, 0.56, 0.41, 0.32, 0.24, 0.38, 0.44, 0.50, 0.31, 0.56, 0.52, 0.28, 0.57, 0.17), which are much lower. That is to say, the ambiguity-based ELM-Tree sacrifices a certain degree of training accuracy, but the wrongly classified testing instances by non-ELM leaf nodes in entropy-based ELM-Tree could be correctly classified by ELM-leaf nodes in ambiguity-based ELM-Tree. As a result, ambiguity-based ELM-Tree obtains a lower training accuracy but a higher testing accuracy. 2 http://read.pudn.com/downloads139/sourcecode/math/599191/c4.5matlab/UseC45.m.htm. JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.14 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 14 Table 3 Comparison of training accuracy among ELM-Tree (θ = 0.90), C4.5 tree and ELM on the 15 small benchmark datasets. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Dataset Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast avg. Entropy-based ELM-Tree Ambiguity-based ELM-Tree C4.5 ELM 0.807 0.970 0.971 0.834 0.767 0.736 0.885 0.934 0.884 0.786 0.794 0.796 0.670 0.997 0.602 0.973 0.990 0.998 0.981 0.965 0.954 0.997 0.988 0.991 0.972 0.994 0.990 0.954 0.996 0.899 0.804 0.969 0.966 0.834 0.767 0.728 0.885 0.940 0.892 0.784 0.803 0.795 0.670 0.999 0.602 0.829 0.976 0.829 C4.5 ELM 0.819 0.937 0.921 0.631 0.693 0.635 0.889 0.917 0.816 0.702 0.717 0.741 0.666 0.921 0.500 0.765 0.964 0.963 0.778 0.746 0.637 0.866 0.902 0.851 0.772 0.756 0.791 0.627 0.978 0.589 ε = 0.01 ε = 0.05 ε = 0.1 ε = 0.01 ε = 0.05 ε = 0.1 0.891 0.976 0.973 0.906 0.861 0.885 0.921 0.959 0.920 0.893 0.837 0.851 0.810 0.993 0.775 0.887 0.974 0.970 0.883 0.848 0.804 0.911 0.956 0.911 0.839 0.816 0.855 0.769 0.992 0.694 0.889 0.968 0.969 0.846 0.776 0.731 0.884 0.956 0.899 0.790 0.817 0.807 0.747 0.991 0.597 0.815 0.970 0.971 0.837 0.771 0.794 0.891 0.942 0.888 0.788 0.841 0.801 0.692 0.999 0.624 0.807 0.969 0.967 0.826 0.768 0.749 0.889 0.934 0.898 0.785 0.824 0.795 0.679 0.998 0.598 0.897 0.874 0.844 0.842 0.832 Note: For each dataset, the highest training accuracy is in bold face. Table 4 Comparison of testing accuracy among ELM-Tree (θ = 0.90), C4.5 tree and ELM on the 15 small benchmark datasets. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 avg. Dataset Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast Entropy-based ELM-Tree Ambiguity-based ELM-Tree ε = 0.01 ε = 0.05 ε = 0.1 ε = 0.01 ε = 0.05 ε = 0.1 0.801 √ 0.950 √ 0.933 √ 0.667 0.680 0.631 0.858 0.908 0.811 √ 0.706 0.693 √ 0.772 0.609 √ 0.939 √ 0.510 0.791 √ 0.953 √ 0.951 √ 0.717 √ 0.707 √ 0.654 0.832 0.912 √ 0.862 √ 0.737 0.688 √ 0.761 0.632 √ 0.961 √ 0.557 0.801 √ 0.961 √ 0.961 √ 0.788 √ 0.735 0.624 0.861 0.917 √ 0.832 √ 0.771 0.697 √ 0.775 0.619 √ 0.961 √ 0.582 0.763 √ 0.960 √ 0.958 √ 0.777 √ 0.748 √ 0.637 0.846 0.912 √ 0.862 √ 0.764 √ 0.750 √ 0.783 0.634 √ 0.983 √ 0.585 0.773 √ 0.961 √ 0.965 √ 0.753 √ 0.759 0.612 0.852 0.889 √ 0.842 √ 0.771 0.688 √ 0.794 0.632 √ 0.972 √ 0.583 0.791 √ 0.969 √ 0.954 √ 0.762 √ 0.746 √ 0.641 0.832 0.902 √ 0.857 √ 0.776 0.716 √ 0.798 0.643 √ 0.977 √ 0.584 0.765 (7/15) 0.781 (10/15) 0.792 (9/15) 0.797 0.790 0.797 0.767 0.799 (11/15) (9/15) (10/15) √ Note: For each dataset, the highest testing accuracy is in bold face. For each result, represents that the method is significantly better than C4.5 by Wilcoxon signed-rank test. 3. Third, we perform some Wilcoxon signed-rank tests [4,28] to compare the performances of ELM-Trees and C4.5. Wilcoxon signed-rank test is often used as an alternative of paired t-test. It ranks the absolute values of differences between the testing accuracies of two classifiers on every run of 10-fold cross-validation and compares the ranks for positive and negative differences. In our experiments, all statistical comparisons are conducted under the significance level 0.1. As depicted in Table 4, for the three pairs of (θ, ε), i.e., (0.90, 0.01), (0.90, 0.05) and (0.90, 0.1), the testing accuracy of entropy-based ELM-Tree outperforms (i.e., is statistically better than) C4.5 on 7, 10 and 9 datasets out of 15, while the testing accuracy of ambiguity-based ELMTree outperforms C4.5 on 11, 9 and 10 datasets out of 15, which validates the effectiveness of incorporating ELM in C4.5. Besides, sign test [4] is performed to validate whether the proposed ELM-Tree is statistically JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.15 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 15 Table 5 Comparison of training time among ELM-Tree (θ = 0.90), C4.5 tree and ELM on the 15 small benchmark datasets. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Dataset Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast avg. Entropy-based ELM-Tree Ambiguity-based ELM-Tree C4.5 ELM 0.18594 0.29688 4.23906 0.80625 0.42031 0.40156 1.92656 0.08594 0.91563 0.45625 4.54219 0.59375 0.98906 0.42188 0.68750 0.27344 0.24688 3.45469 1.35000 0.83281 0.55625 2.44844 0.08281 0.81563 1.06719 2.30000 0.89687 1.23125 0.25469 2.50156 0.00625 0.00625 0.00469 0.00156 0.00000 0.00156 0.00469 0.00156 0.00156 0.00469 0.00313 0.00156 0.00625 0.00156 0.00625 1.13125 1.22083 0.00344 C4.5 ELM ε = 0.01 ε = 0.05 ε = 0.1 ε = 0.01 ε = 0.05 ε = 0.1 0.31094 0.26406 4.48281 1.42188 0.91719 0.56563 2.79219 0.10781 0.88125 1.23125 2.72969 1.08750 1.33281 0.30156 2.38437 0.30938 0.24219 3.53594 1.27500 0.82656 0.39844 2.61094 0.10469 0.89063 0.87187 2.75000 1.06250 1.17031 0.28594 1.77500 0.29844 0.15937 2.69375 0.41094 0.35000 0.19687 1.48125 0.10781 0.84062 0.35156 2.01094 0.36719 0.95313 0.30312 0.19219 0.29219 0.39687 7.02031 2.59844 0.62344 0.73594 4.05781 0.14063 1.23594 0.95000 7.37500 1.46250 1.67500 0.51719 1.90469 0.19219 0.31875 4.22344 1.26719 0.52500 0.50313 2.87969 0.10938 0.90781 0.57031 6.76406 0.86094 1.24688 0.45000 1.00469 1.38740 1.20729 0.71448 2.06573 1.45490 Note: For each dataset, the lowest training time is in bold face. Table 6 Comparison of testing time among ELM-Tree (θ = 0.90), C4.5 tree and ELM on the 15 small benchmark datasets. No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 avg. Dataset Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast Entropy-based ELM-Tree Ambiguity-based ELM-Tree ε = 0.01 ε = 0.05 ε = 0.1 ε = 0.01 ε = 0.05 ε = 0.1 0.00781 0.00469 0.00625 0.00156 0.00469 0.00156 0.00313 0.00156 0.00625 0.00625 0.00313 0.00313 0.00781 0.00313 0.01875 0.00000 0.00000 0.00000 0.00000 0.00469 0.00313 0.00469 0.00156 0.00156 0.00313 0.00156 0.00313 0.01250 0.00000 0.00781 0.00313 0.00000 0.00313 0.00000 0.00156 0.00000 0.00000 0.00000 0.00156 0.00313 0.00000 0.00000 0.01094 0.00000 0.00000 0.00313 0.00469 0.00313 0.00156 0.00156 0.00156 0.00000 0.00000 0.00313 0.00156 0.00313 0.00156 0.00781 0.00156 0.01094 0.00000 0.00156 0.00156 0.00000 0.00000 0.00000 0.00156 0.00000 0.00000 0.00156 0.00469 0.00156 0.00625 0.00156 0.00000 0.00156 0.00000 0.00000 0.00156 0.00156 0.00000 0.00000 0.00000 0.00156 0.00156 0.00156 0.00156 0.00469 0.00156 0.00000 0.00313 0.00156 0.00156 0.00000 0.00625 0.00156 0.00000 0.00156 0.00313 0.00937 0.00313 0.00313 0.01250 0.00313 0.02656 0.00000 0.00156 0.00156 0.00000 0.00000 0.00000 0.00156 0.00313 0.00000 0.00156 0.00000 0.00000 0.00000 0.00000 0.00000 0.00531 0.00292 0.00156 0.00302 0.00135 0.00114 0.00510 0.00062 Note: For each dataset, the lowest testing time is in bold face. meaningful on all the datasets. Sign test assumes that√ the win number of a given learning model in a specific comparison obeys the normal distribution N ( m2 , 2m ) under the null-hypothesis, where m is the number √ of used datasets. If the win number is at least m2 + zα/2 × 2m , it can be concluded that the given learning model is statistically meaningful under the significance level α. In our study, m = 15. Let α = 0.1, we have √ √ m 15 m 15 2 + zα/2 × 2 = 2 + 1.645 × 2 ≈ 10. Thus, it can be concluded that entropy-based ELM-Tree with (θ, ε) = (0.90, 0.05), ambiguity-based ELM-Tree with (θ, ε) = (0.90, 0.01) and (θ, ε) = (0.90, 0.1) have significantly better performances than C4.5. Table 5 and Table 6 respectively report the training and testing time of the compared methods. Basically, ELM is the fastest method in both training and testing. As shown in Table 5, the training time of ELM-Tree and C4.5 have JID:FSS AID:6547 /FLA 16 [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.16 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• no obvious difference. However, from the average results in Table 6, we can see that ELM-Tree has higher testing efficiency than C4.5 in most cases. Besides, with the same (θ, ε), ambiguity-based ELM-Tree is more efficient than entropy-based ELM-Tree. Finally, we make some observations on the scale of the tree from Table 7. It is obvious that the size of entropy-based ELM-Tree is larger than that of ambiguity-based ELM-Tree, while both of them are much smaller than C4.5 on all the 15 datasets. From Fig. 1(c) and Fig. 1(d), we can see that with the same probability distribution E, Ambig(E) ≤ Entr(E). Specifically, when p1 = p2 = · · · = pL , Ambig(E)= Entr(E), otherwise Ambig(E) < E(E). For a given attribute Aj and its cut-point cutAi , we suppose IE (cutAi ) > ε. It is possible that IA (cutAi ) < ε, which increases the probability of generating ELM leaf nodes. Thus, the induction process can terminate earlier with ambiguity-based heuristic, which will lead to a smaller ELM-Tree. 5.3. Comparisons among ELM-Trees, FTs, NBTree, LMT, and LMTFAM+WT In this section, we compare the testing accuracies and training time (i.e., the time taken to build models in Weka 3.6 [33]) of ELM-Trees, FTs, NBTree, LMT, and LMTFAM+WT . All the experimental results in this section are obtained based on the platform of Weka 3.6 which is run on a PC with Windows XP operation system, a Pentium 4 2.8 GHz CPU, and a 2 GB RAM. We also conduct 10-fold cross-validation for 10 times and observe the average results corresponding to FT (FT with logistic regression functions at the leaf and inner nodes synchronously), FTLeaves (FT with logistic regression functions only at the leaf nodes), FTInner (FT with logistic regression functions only at the inner nodes), NBTree, LMT, and LMTFAM+WT . The detailed parameter setups of these learning algorithms in Weka 3.6 are as follows: 1. FT-weka.classifiers.trees.FT: minNumInstances = 15, numBoostingIterations = 100, useAIC = False and weightTrimBeta = 0.0; 2. FTLeaves-weka.classifiers.trees.LMT: minNumInstances = 15, numBoostingIterations = 100, useAIC = False and weightTrimBeta = 0.0; 3. FTInner-weka.classifiers.trees.LMT: minNumInstances = 15, numBoostingIterations = 100, useAIC = False and weightTrimBeta = 0.0; 4. NBTree-weka.classifiers.trees.NBTree: Default value; 5. LMT-weka.classifiers.trees.LMT: minNumInstances = 15, numBoostingIterations = 100, useAIC = False and weightTrimBeta = 0.0; 6. LMTFAM+WT -weka.classifiers.trees.LMT: minNumInstances = 15, numBoostingIterations = 100, useAIC = True and weightTrimBeta = 0.1. The comparative results on testing accuracy and training time are summarized in Tables 8 and 9 respectively, where the testing accuracy and training time corresponding to ELM-Trees are extracted from Tables 4 and 5. Here, we also use sign test to compare the performances of ELM-Trees, FTs, NBTree, LMT, and LMTFAM+WT . From Table 8 we can see that in comparison with ELM-Tress, FT, FTLeaves, FTInner, NBTree, LMT, and LMTFAM+WT respectively obtain better performances on 4, 5, 4, 4, 7, and 5 datasets respectively. According to the sign test, one learning algorithm is significantly better than another on 15 test datasets only if the win number reaches 10. Thus, we conclude that these existing model tree’s variations are not obviously better than ELM-Trees on the used 15 small benchmark datasets as shown in Table 2. However, the comparison on training time in Table 9 shows that our proposed ELM-Trees are obviously better than other model tree’s variations, i.e., the training time of ELMTrees are significantly lower than each of FT, FTLeaves, FTInner, NBTree, LMT, and LMTFAM+WT . Meanwhile, we plot the variation trend of all compared algorithms’ training time with the change of datasets’ size in Fig. 8. From Fig. 8 we can find the training time of FT, FTLeaves, FTInner, LMT, and LMTFAM+WT keep the obviously ascending trends with the increase of datasets’ size, while training time of ELM-Trees and NBTree do not change drastically with the change of datasets’ size. This indicates that the sequential FT, FTLeaves, FTInner, LMT, and LMTFAM+WT are unsuitable for the big data classification due to their higher computational complexities that are caused by the application of the time-consuming LogitBoost algorithm [8] called repeatedly for a fixed number of iterations under the JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.17 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 17 Table 7 Comparison of node number between ELM-Tree (θ = 0.90) and C4.5 tree on the 15 small benchmark datasets. No. Dataset Entropy-based ELM-Tree Ambiguity-based ELM-Tree ε = 0.1 ε = 0.01 ε = 0.05 C4.5 ε = 0.01 ε = 0.05 ε = 0.1 39 17 16 22 91 38 21 9 14 101 16 21 126 7 266 38 15 12 18 74 19 20 9 12 56 16 21 99 7 150 38 4 6 3 14 3 9 9 11 7 11 5 72 8 1 9 9 12 7 13 30 9 7 8 15 23 11 77 10 79 8 7 6 4 9 14 6 5 7 10 21 7 52 8 13 6 7 6 3 7 6 4 4 7 6 13 4 35 7 5 42 26 16 22 106 40 21 10 17 109 16 22 142 8 350 63 # Leaf node 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast avg. 54 38 13 21 12 8 # ELM leaf node 1 Auto Mpg 2 Breast Cancer 3 Breast Cancer W-D 4 Breast Cancer W-P 5 Credit Approval 6 Glass Identification 7 Ionosphere 8 New Thyroid Gland 9 Parkinsons 10 Pima Indian Diabetes 11 Sonar 12 SPECTF Heart 13 Vehicle Silhouettes 14 Wine 15 Yeast 12 8 9 9 29 9 8 4 7 31 9 11 35 4 45 12 8 7 9 24 6 8 5 6 20 9 10 31 4 30 12 3 4 2 7 2 5 4 5 4 6 3 22 4 1 5 7 5 4 8 12 6 5 5 8 10 8 44 7 33 6 6 4 3 7 8 6 4 5 6 11 5 31 7 9 6 6 4 2 6 5 4 3 5 4 10 4 22 6 4 avg. 15 13 6 11 8 6 77 34 32 42 180 74 41 17 26 201 31 41 251 14 532 75 28 22 36 147 37 39 17 23 111 31 40 197 13 298 76 7 11 5 27 6 17 17 21 13 21 9 142 14 1 16 18 22 14 24 59 16 13 16 29 45 21 152 18 156 14 14 12 7 17 28 11 9 13 19 42 12 103 16 25 11 13 11 4 13 11 7 7 13 10 26 8 68 13 8 125 77 47 65 315 119 60 27 48 325 47 64 423 23 1049 106 74 26 41 23 15 188 # All node 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 avg. Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast Note: For each dataset, the lowest number is in bold face. JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.18 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 18 Table 8 Comparison of testing accuracy of ELM-Tree, FTs, NBTree, LMT, and LMTFAM+WT on the 15 small benchmark datasets. No. Dataset ELM-Trees FT FTLeaves FTInner NBTree LMT LMTFAM+WT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast 0.801 0.969 0.965 0.788 0.759 0.654 0.858 0.917 0.862 0.776 0.750 0.798 0.643 0.983 0.585 0.755 0.959 0.954 0.742 0.772 0.626 0.838 0.953 0.815 0.751 0.764 0.768 0.729 0.972 0.576 0.793 0.957 0.953 0.747 0.767 0.650 0.917 0.953 0.846 0.766 0.788 0.790 0.766 0.972 0.582 0.745 0.967 0.963 0.747 0.772 0.593 0.858 0.958 0.821 0.750 0.764 0.775 0.704 0.961 0.572 0.801 0.963 0.926 0.722 0.764 0.696 0.917 0.930 0.903 0.760 0.755 0.723 0.650 0.949 0.586 0.819 0.964 0.963 0.773 0.772 0.645 0.915 0.953 0.877 0.768 0.769 0.783 0.774 0.972 0.590 0.819 0.966 0.965 0.793 0.752 0.645 0.901 0.940 0.862 0.771 0.788 0.801 0.754 0.978 0.588 (4/15) (5/15) (4/15) (4/15) (7/15) (5/15) Significantly better than ELM-Trees Table 9 Comparison of training time (seconds) of ELM-Trees, FTs, NBTree, LMT, and LMTFAM+WT on the 15 small benchmark datasets. No. Dataset ELM-Trees FT FTLeaves FTInner NBTree LMT LMTFAM+WT 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Auto Mpg Breast Cancer Breast Cancer W-D Breast Cancer W-P Credit Approval Glass Identification Ionosphere New Thyroid Gland Parkinsons Pima Indian Diabetes Sonar SPECTF Heart Vehicle Silhouettes Wine Yeast 0.19 0.16 2.69 0.41 0.35 0.20 1.48 0.11 0.84 0.35 2.01 0.37 0.95 0.29 0.19 0.91 0.89 1.08 1.13 1.31 2.52 1.17 0.39 0.48 1.38 1.14 0.91 6.25 0.27 38.64 1.05 1.34 2.23 1.24 1.52 1.81 2.14 0.42 0.63 1.98 1.78 1.75 9.30 0.50 28.22 0.91 0.89 1.31 1.08 1.31 1.92 1.41 0.16 0.69 1.39 1.14 1.22 6.72 0.63 29.48 0.33 0.66 4.84 0.15 0.39 0.42 4.14 0.19 0.89 0.39 3.72 3.58 2.70 0.28 1.33 5.66 7.06 11.34 5.42 8.58 10.22 11.02 2.52 3.30 12.16 9.25 11.67 51.48 2.47 160.76 4.97 4.25 3.38 3.55 12.13 2.75 5.5 0.88 2.28 10.33 4.00 13.41 24.57 0.56 37.20 (5/15) (3/15) (4/15) (2/15) (0/15) (0/15) Significantly better than ELM-Trees framework of 5-fold cross-validation. Although NBTree’s training time is also not affected obviously by the dataset’s size, its computational time is significantly higher than ELM-Trees according to the statistical result obtained with sign test. 5.4. Parallel performances of ELM-Tree In this section, we test the performance (i.e., running time, speedup and scaleup [11]) of the parallel ELM-Tree on four big datasets, which are constructed from four UCI benchmark datasets [27]: Magic Telescope, Image Segment, Page Blocks and Wine Quality-White. Take Magic Telescope as an instance, we introduce how to generate a big dataset based on it. For every instance (x1 , x2 , · · · , xn , y) in Magic Telescope, we generate 1000 new instances according to the following operation: (x1 ± ei1 , x2 ± ei2 , · · · , xn ± ein ), where eij , i = 1, 2, · · · , 1000, j = 1, 2, · · · , n is a random number obeying the uniform distribution U (0, 0.01). After all the 19 020 instances in Magic Telescope are considered, a big Magic Telescope dataset with 19 020 000 instances is obtained. Similarly, the big Image Segment, big Page JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.19 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 19 Fig. 8. The variation trend of training time of different model tree algorithms with the change of datasets’ size. Table 10 The specification of 4 big datasets. Dataset # Attribute # Class Class distribution # Instance Image Segment Magic Telescope Page Blocks Wine Quality-White 19 10 10 11 7 2 5 6 330 000 × 7 12 332 000/6 688 000 4 913 000/329 000/115 000/88 000/28 000 2 198 000/1 457 000/880 000/175 000/163 000/20 000 2 310 000 19 020 000 5 473 000 4 898 000 Blocks and big Wine Quality-White datasets with 2 310 000, 5 473 000 and 4 898 000 instances can also be generated. The details of these four big datasets are listed in Table 10. We adopt the ambiguity-based heuristic measure, and implement it with 1 host computer and 8 servant computers. Each computer is with a Pentium 4 Xeon 3.06 GHz CPU, 512 MB RAM, and RedHat Linux 9.0 operation system. The parallel system is configured based on the message passing interface (MPI) programming standard with C language. For simplicity, parallel computation is applied to two components, i.e., information gain and gain ratio. The time recorded only includes the computing time while neglects the communicating time among different computers. Besides, we also use the criteria of speedup and scaleup [11] to measure the performances. Speedup measures how much a parallel algorithm is faster than a corresponding sequential algorithm, which can be expressed by Speedup = Computing time on 1 computer . Computing time on M computers (32) Scaleup measures the ability of an M times larger system to perform an M times larger task in the same computing time as the original time, which expresses how much more work can be done in the same time period by a parallel system. The formulation of scaleup can be expressed by Scaleup = Computing time for processing data on 1 computer . Computing time for processing data ∗ M on M computers (33) We implement the parallel ELM-Tree algorithm on 2, 4, 6, 8 computers, and the training dataset sizes are 2, 4, 6, and 8 times of the original dataset. The execution time is summarized in Table 11, the speedup and scaleup values are shown in Fig. 9. As shown in Table 11, the execution time of parallel ELM-Tree demonstrates a decreasing trend with the increase of computers, which indicates the feasibility of the parallelization in reducing the computational time. JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.20 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• 20 Table 11 Execution time (seconds) of parallel ambiguity-based ELM-Tree on 4 big datasets. Dataset 2 computers 4 computers Image Segment 2-times 4-times 6-times 8-times 6 computers 8 computers 3143.89 8340.78 11 922.16 14 890.93 1977.50 5159.94 7261.19 9034.98 945.01 2439.82 3646.30 4393.87 488.23 1148.51 1752.65 2738.84 Magic Telescope 2-times 4-times 6-times 8-times 66 615.66 170 802.54 235 026.29 293 555.18 38 604.97 94 024.51 131 217.35 164 999.23 10 776.83 28 241.36 40 562.60 48 942.07 3513.07 8315.77 12 446.06 24 052.94 Page Blocks 2-times 4-times 6-times 8-times 5756.21 14 494.77 21 217.74 25 652.75 2592.83 6516.00 9475.43 11 972.18 860.94 2110.56 3149.15 3829.53 268.98 661.30 991.34 1947.15 Wine Quality-White 2-times 4-times 6-times 8-times 6239.44 16 040.40 23 659.38 28 012.47 4195.61 10 713.09 16 009.42 18 839.69 2738.31 6973.18 10 335.04 12 496.50 1969.97 4703.07 7061.85 9875.35 It is analyzed from Eq. (32) that a good parallel system usually has a linear speedup. As shown in Figs. 9(b), 9(e), 9(h) and 9(k), the speedup of parallel ELM-Tree tends to be linear with the increase of the data size. The larger the data size is, the higher the speedup will be. Besides, it is analyzed from Eq. (33) that the scaleup of an ideal parallel system should be constantly equal to 1. However, as demonstrated in Figs. 9(c), 9(f), 9(i) and 9(l), the scaleup of a parallel algorithm usually has a decline trend with the increase of the data size and the number of computers. All the above arguments indicate that the proposed ELM-Tree algorithm can be highly parallelized and used to handle big data classification problems. 6. Conclusion and future works In this paper, a new hybrid learning algorithm named ELM-Tree is proposed to deal with the over-partitioning problem in DT induction. It adopts the uncertainty reduction heuristics, and embeds ELMs as its leaf nodes when the information gain ratios of all the cut-points are smaller than a given uncertainty coefficient. Besides, a parallel ELM-Tree model is proposed for big data classification, which is proved to be effective in reducing the computational time. Our future work regarding this topic may include the following three aspects: 1) further study on the incremental mechanism of ELM-Tree; 2) ELM-Tree induction with mixed types of attributes; and 3) parallel ELM-Tree with MapReduce framework. Acknowledgements The authors would like to thank Prof. Xi-Zhao Wang for his initial idea and step-by-step instruction during the completion of this paper. This research is supported by the National Natural Science Foundation of China (71371063, 61170040 and 60903089), by the Natural Science Foundation of Hebei Province (F2013201110, F2012201023 and F2011201063), and by the Key Scientific Research Foundation of Education Department of Hebei Province (ZD2010139). This work is also supported by Shenzhen New Industry Development Fund under grant No. JCYJ20120617120716224. JID:FSS AID:6547 /FLA [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.21 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• Fig. 9. Performances of parallel ELM-Tree on the 4 big datasets. 21 JID:FSS AID:6547 /FLA 22 [m3SC+; v 1.192; Prn:23/05/2014; 10:05] P.22 (1-22) R. Wang et al. / Fuzzy Sets and Systems ••• (••••) •••–••• References [1] M. Barakat, D. Lefebvre, M. Khalil, F. Druaux, O. Mustapha, Parameter selection algorithm with self adaptive growing neural network classifier for diagnosis issues, Int. J. Mach. Learn. Cybern. 4 (3) (2013) 217–233. [2] Y. Ben-Haim, E. Tom-Tov, A streaming parallel decision tree algorithm, J. Mach. Learn. Res. 11 (2010) 849–872. [3] C.J. Chen, Structural vibration suppression by using neural classifier with genetic algorithm, Int. J. Mach. Learn. Cybern. 3 (3) (2012) 215–221. [4] J. Demšar, Statistical comparisons of classifiers over multiple data sets, J. Mach. Learn. Res. 7 (2006) 1–30. [5] X.L. Dong, D. Srivastava, Big data integration, in: Proceedings of ICDE’13, 2013, pp. 1245–1248. [6] S. Ferrari, R.F. Stengel, Smooth function approximation using neural networks, IEEE Trans. Neural Netw. 16 (1) (2005) 24–38. [7] E. Frank, Y. Wang, S. Inglis, G. Holmes, I.H. Witten, Using model trees for classification, Mach. Learn. 32 (1) (1998) 63–76. [8] J. Friedman, T. Hastie, R. Tibshirani, Additive logistic regression: a statistical view of boosting, Ann. Stat. 38 (2) (2000) 337–374. [9] J. Gama, Functional trees, Mach. Learn. 55 (3) (2004) 219–250. [10] A. Gattiker, F.H. Gebara, H.P. Hofstee, J.D. Hayes, A. Hylick, Big Data text-oriented benchmark creation for Hadoop, IBM J. Res. Dev. 57 (3–4) (2013) 10:1–10:6. [11] Q. He, T.F. Shang, F.Z. Zhuang, Z.Z. Shi, Parallel extreme learning machine for regression based on MapReduce, Neurocomputing 102 (2013) 52–58. [12] M. Higashi, G.J. Klir, Measures of uncertainty and information based on possibility distributions, Int. J. Gen. Syst. 9 (1) (1982) 43–58. [13] G.B. Huang, D.H. Wang, Y. Lan, Extreme learning machines: a survey, Int. J. Mach. Learn. Cybern. 2 (2) (2011) 107–122. [14] G.B. Huang, H.M. Zhou, X.J. Ding, R. Zhang, Extreme learning machine for regression and multiclass classification, IEEE Trans. Syst. Man Cybern., Part B, Cybern. 42 (2) (2012) 513–529. [15] G.B. Huang, Q.Y. Zhu, C.K. Siew, Extreme learning machine: theory and applications, Neurocomputing 70 (2006) 489–501. [16] R. Kohavi, Scaling up the accuracy of naive Bayes classifiers: a decision-tree hybrid, in: Proceedings of KDD’96, 1996, pp. 202–207. [17] N. Landwehr, M. Hall, E. Frank, Logistic model trees, Mach. Learn. 59 (1–2) (2005) 161–205. [18] S. Lomax, S. Vadera, A survey of cost-sensitive decision tree induction algorithms, ACM Comput. Surv. 45 (2) (2013) 16:1–16:35. [19] P. Malik, Governing Big Data: principles and practices, IBM J. Res. Dev. 57 (3–4) (2013) 1:1–1:13. [20] B. Panda, J.S. Herbach, S. Basu, R.J. Bayardo, PLANET: massively parallel learning of tree ensembles with MapReduce, Proc. VLDB Endow. 2 (2) (2009) 1426–1437. [21] J.R. Quinlan, Induction of decision trees, Mach. Learn. 1 (1) (1986) 81–106. [22] J.R. Quinlan, Improved use of continuous attributes in C4.5, J. Artif. Intell. Res. 4 (1996) 77–90. [23] J. Shafer, R. Agrawal, M. Mehta, SPRINT: a scalable parallel classifier for data mining, in: Proceedings of VLDB’96, 1996, pp. 544–555. [24] Y. Sheng, V.V. Phoha, S.M. Rovnyak, A parallel decision tree-based method for user authentication based on keystroke patterns, IEEE Trans. Syst. Man Cybern., Part B, Cybern. 35 (4) (2005) 826–833. [25] A. Srivastava, E.H. Han, V. Kumar, V. Singh, Parallel formulations of decision-tree classification algorithms, Data Min. Knowl. Discov. 3 (3) (1999) 237–261. [26] M. Sumner, E. Frank, M. Hall, Speeding up logistic model tree induction, in: Knowledge Discovery in Databases, PKDD 2005, in: Lect. Notes Comput. Sci., vol. 3721, 2005, pp. 675–683. [27] UCI Machine Learning Repository, available online: http://archive.ics.uci.edu/ml/. [28] X.Z. Wang, Y.L. He, D.D. Wang, Non-naive Bayesian classifiers for classification problems with continuous attributes, IEEE Trans. Cybern. 44 (1) (2014) 21–39. [29] X.Z. Wang, J.R. Hong, On the handling of fuzziness for continuous-valued attributes in decision tree generation, Fuzzy Sets Syst. 99 (3) (1998) 283–290. [30] T. Wang, Z.X. Qin, Z. Jin, S.C. Zhang, Handling over-fitting in test cost-sensitive decision tree learning by feature selection, smoothing and pruning, J. Syst. Softw. 83 (7) (2010) 1137–1147. [31] X.Z. Wang, D.S. Yeung, E.C.C. Tsang, A comparative study on heuristic algorithms for generating fuzzy decision trees, IEEE Trans. Syst. Man Cybern., Part B, Cybern. 31 (2) (2001) 215–226. [32] X.Z. Wang, J.H. Zhai, S.X. Lu, Induction of multiple fuzzy decision trees based on rough set technique, Inf. Sci. 178 (16) (2008) 3188–3202. [33] I.H. Witten, E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, 2005. [34] Y.F. Yuan, M.J. Shaw, Induction of fuzzy decision tree, Fuzzy Sets Syst. 69 (2) (1995) 125–139. [35] M.J. Zaki, Parallel and distributed association mining: a survey, IEEE Concurr. 7 (4) (1999) 14–25. [36] S.F. Zheng, Gradient descent algorithms for quantile regression with smooth approximation, Int. J. Mach. Learn. Cybern. 2 (3) (2011) 191–207.

Learning ELM-Tree from big data based on uncertainty reduction ScienceDirect Ran Wang

Related documents

Products

Support

Learning ELM-Tree from big data based on uncertainty reduction ScienceDirect Ran Wang

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib