Evaluation of data mining algorithms in medicine 2013 Abstract This paper presents a novel and efficient decision tree construction approach based on C4.5. C4.5 constructs decision tree with information gain ratio and deals with missing values or noise. ID3 and its improvement, C4.5, both select one attribute as the splitting criterion each time during constructing decision tree, adopting one step forward. Comparing with one step forward, the proposed algorithm, ASFIDT in the paper would use either one attribute or two attributes as the splitting criterion for establishing tree nodes, adopting adaptive step forward that would improve the possibility in finding the optima. Given 3 UCI standard datasets, the experimental results prove its performance and efficiency in constructing decision tree. Introduction The decision tree is an understandable and popular classification algorithm in machine learning because of its flexibility and clarity in representing the classifying process[1]. Among the algorithms for building decision tree, ID3 [2] proposed by Quinlan in 1981 and its improved version C4.5 [3] are two of the most popular ones. Because C4.5 can handle with missing values [4], noise, avoiding overfitting and so on, it is better than ID3 in generating decision tree used commonly. However, both select one attribute from attribute list as the splitting criterion to generate one node in building decision tree. To improve its accuracy rate of classification and reduce the depth of tree, the paper proposed a novel approach to extend C4.5 that is to select one or two attributes as the splitting criterion(s) depending on the bigger information gain ratio. The new algorithm is denoted as ASFIDT, Adaptive Step Forward Decision Tree Construction. The remainder of this paper is organized as follows: In section 2, we will give some conceptions of Information Entropy, ID3, and C4.5 with their mathematical descriptions. In section 3, we will give a new approach to construct decision tree, i.e., ASFIDT, Adaptive Step Forward Decision Tree Construction. In section 4, we will present the experimental results and analysis them. Finally, the conclusions and our future direction of work will be given in section 5. 1 Evaluation of data mining algorithms in medicine 2013 Related Works In the construction of decision tree, finding the splitting criterion used to generate a node becomes the most important and necessary step [5]. The information gain used by ID3 and the information gain ratio used by C4.5 are both the common ways to select the splitting criterion from the attribute list. They are the concepts ofinformation entropy theory. In the decision tree building, the information gain and gain ratio can calculate each attribute's impurity that represents the possibility of being selected as the best splitting criterion, that is, the less, the better. ID3 The "Iterative Dichotomiser Tree" (ID3) is a greedy induction algorithm, constructing decision trees in a top-to-down recursive divide-and-conquer manner and works only with categorical input and output data. ID3 grows trees splitting on all categories of an attribute, thus producing shallow and wide trees. It grows tree classifiers in three steps: 1. Splits creation in form of multi-way splits, i.e. for every attribute a single split is created where attributes categories are branches ofthe proposed split. 2. Evaluation ofbest split for tree branching based on information gain measure, and 3. Checking of the stop criteria, and recursively applying the steps to new branches. These three steps are iterating and are executed in all nodes of the decision tree classifier. The information gain measure (2) is based on the well-known Shannon entropy measurement shown in (1). where k represents the number of classes of the output variable, and Pi the probability of the i-th class. S represents the dataset. ID3 uses information gain (2) as a measure of split quality. 2 Evaluation of data mining algorithms in medicine 2013 where Values(A) is the set of all possible values in attribute A and S; is the subset of dataset S that have value v in S. And Entropy(S) is the expected entropy of an input attribute A that has k categories, Entropy(SV) is the entropy of an attributes category with respect to the output attribute, and is |SV| / |S| the probability of the j-th category in the attribute. Information gain of an attribute is the difference between entropy ofthe system, or node, and the entropy of an attribute. It represents the amount of information an attribute holds for the class disambiguation C4.5 C4.5 algorithm is an improvement ofID3. It can work with numerical input attributes as well [6]. It follows three steps during tree growth: 1. Splits creation for categorical attributes is the same as in ID3. For numerical attributes all possible binary splits have to be considered. Numerical attributes splits are always binary. 2. Evaluation ofbest split for tree branching based on gain ratio measure, and 3. Checking of the stop criteria, and recursively applying the steps to new branches. This algorithm introduces a new, less biased, split evaluation measure (Gain ratio). The algorithm can work with missing values, has pruning option, grouping attribute values, rules generating etc. The Gain ratio selection criterion (3) is a measure that is less biased towards selecting attributes with more categories. where S,…,Sk are k sample subsets in S divided by k different values of attribute A. And (1) calculates information entropy, then (2) calculates the information gain Gain(S,A) of A in dataset A, (3) calculates the splitting information SplitInformation(S,A) and finally (4) calculates the information gain ratio of each attribute 3 Evaluation of data mining algorithms in medicine 2013 GainRatio(S,A) . Gain ratio divides the attribute information gain with the split info SplitInformation(S,A), defined by (4), a measure that is dependent on the number of categories k in an attribute. C4.5 can work with categorical and numerical attributes. Categorical attributes can produce multi-way splits, and numerical attributes binary splits. C4.5 includes three pruning algorithms, namely reduced error pruning, pessimistic error pruning and error based pruning. In summary, it has the following merits: 1. Handle both continuous and discrete attributes; 2. Deal with missing attribute value; 3. Cope attributes with differing costs; 4. Prune decision tree after it is created. THE IMPROVED ALGORITHM, ASFIDF The goal of classification is to classify data to the given class according to its specific attribute and a priori knowledge that is one ofthe important tools in data analysis. Decision tree induction classification [7] is one kind of common way used in classification that is sort of tree structure similar to flow chart, having the merits such as fast speed of classification, high accuracy, dealing with high-dimensional data, fitting for inductive knowledge discovery and so on. The classical ID3 algorithm selects attributes by adopting information gain as the inductive function, but tends to select those attributes that have many values. There are two criterions in evaluating the quality of decision tree: (1) less leaf nodes, low depth and little redundancy; (2) high accuracy of classification. C4.5 selects splitting attributes by adopting information gain ratio. It derives all merits from ID3 and add the technologies such as to discrete the continual attributes, to deal with the missing value of attributes [8], to prune decision tree, etc. C4.5 is up-to-down to construct the decision tree with one step greedy searching strategy. The algorithm only finds the locally optimal solution in classification. To improve the possibility of finding the globally optimal solution, we proposed a new algorithm to construct decision tree with two forward steps that is similar to [9]. When selecting attributes, the proposed algorithm took the information gain of selecting two attributes simultaneously into account, not only one the information gain by selecting one attribute. Thus, considering two optimal attributes is better than the single optimal attribute to improve the possibility in searching for the globally optimal solution that is also used in [10]. In the 4 Evaluation of data mining algorithms in medicine 2013 experiments of3 DCI standard data sets, the results proved that the proposed algorithm apparently is better than C4.5. It can select splitting attributes more accurately, improve the classifying result and construct decision tree with lower average depth. However, in the case that the imbalance of each attribute value distribution happens to data sets, the proposed algorithm has some flaws, resulting in the situation that most samples are focused on some branch that is worse than C4.5 in performance. It still cannot avoid being trapped into the locally optimal solution, though it improves the possibility in finding the globally optimal solution to the problem that needs further study. In order to fit the problem, we proposed a new way of constructing decision with adaptive steps. The description of ASFIDT algorithm is shown as the following. Given the training set and the label attribute set, the steps of ASFIDT are: (a) Preprocess each item in the training set in case of missing values ofsome attributes. In terms ofthe strategy of dealing with missing values used in C4.5, to fill in those missing values of attribute. It gives one probabilistic value for each attribute by calculating the weight of each class in the result from multi path to leaf node. (b) According to the below formulas, to calculate the information gain ratio of each attribute (one step for-ward) and each pair of attributes (two steps forward); (c) Compare the information gain ratio with one step forward with the average information gain ratio with two steps forward and select the greater as the current node to construct a tree node; (d) For each branch of the current node, to select the optimal attribute or attribute pair from the remaining attribute(s) by Step (b) as the successive node and set it as the current node; (e) Repeat Step (d) until all attributes are all selected. If only one attribute is left, to set it as the successor of current node directly; (f) According to the same method pruning after rule used by C4.5, to prune the constructed tree, i.e, to construct a decision tree from training data set and increase the tree until it fits the training data best, allowing the over fitting; to convert the decision tree into the equivalent rule set and modify every rule by deleting any precondition that could improve the accuracy 5 Evaluation of data mining algorithms in medicine 2013 ofevaluation; to sort them in terms of the evaluating accuracy ofthe pruned rule and classify the samples with the rules according to the order. Assuming that the dataset is S,A ={Al,A2···, } arecandidate attributes, then the definition of information gain of attribute pair {4,Aj} in dataset A as where, Values (Ak) ( k =i, j) is the set of all possible values of attribute Ak and SV," is subset of dataset S that has value v of attribute Ai and value u of attribute Aj Assuming that the attribute 4 has n different values and the attribute Aj has m different values, then the splitting information is defined as where, S1·· ·,S m x n are m x n sample subsets of dataset S divided by the combination of all possible values of attributes 4 and Aj • The information gain ratio of attribute pair {Ai,Aj } in dataset A is defined as: where, GainRatio (S,4,Aj) and SplitInformation (S,4,Aj) are calculated by (5) and (6). 6