Evaluation of data mining algorithms in medicine

advertisement
Evaluation of data mining algorithms in medicine 2013
Abstract
This paper presents a novel and efficient decision tree construction approach based on C4.5.
C4.5 constructs decision tree with information gain ratio and deals with missing values or
noise. ID3 and its improvement, C4.5, both select one attribute as the splitting criterion each
time during constructing decision tree, adopting one step forward. Comparing with one step
forward, the proposed algorithm, ASFIDT in the paper would use either one attribute or two
attributes as the splitting criterion for establishing tree nodes, adopting adaptive step forward
that would improve the possibility in finding the optima. Given 3 UCI standard datasets, the
experimental results prove its performance and efficiency in constructing decision tree.
Introduction
The decision tree is an understandable and popular classification algorithm in machine
learning because of its flexibility and clarity in representing the classifying process[1].
Among the algorithms for building decision tree, ID3 [2]
proposed by Quinlan in 1981 and its improved version C4.5 [3] are two of the most popular
ones. Because C4.5 can handle with missing values [4], noise, avoiding overfitting and so on,
it is better than ID3 in generating decision tree
used commonly. However, both select one attribute from attribute list as the splitting criterion
to generate one node in building decision tree. To improve its accuracy rate of classification
and reduce the depth of tree, the paper proposed a novel approach to extend C4.5 that is to
select one or two attributes as the splitting criterion(s) depending on the bigger information
gain ratio. The new algorithm is denoted as ASFIDT, Adaptive Step Forward Decision Tree
Construction.
The remainder of this paper is organized as follows: In
section 2, we will give some conceptions of Information Entropy, ID3, and C4.5 with their
mathematical descriptions. In section 3, we will give a new approach to construct decision
tree, i.e., ASFIDT, Adaptive Step Forward Decision Tree Construction. In section 4, we will
present the experimental results and analysis them. Finally, the conclusions and our future
direction of work will be given in section 5.
1
Evaluation of data mining algorithms in medicine 2013
Related Works
In the construction of decision tree, finding the splitting criterion used to generate a node
becomes the most important and necessary step [5]. The information gain used by ID3 and
the information gain ratio used by C4.5 are both the common ways to select the splitting
criterion from the attribute list. They are the concepts ofinformation entropy theory. In the
decision tree building, the information gain and gain ratio can calculate each attribute's
impurity that represents the possibility of being selected as the best splitting criterion, that is,
the less, the better.
ID3
The "Iterative Dichotomiser Tree" (ID3) is a greedy induction algorithm, constructing
decision trees in a top-to-down recursive divide-and-conquer manner and works only with
categorical input and output data. ID3 grows trees splitting on all categories of an attribute,
thus producing shallow and wide trees. It grows tree classifiers
in three steps:
1. Splits creation in form of multi-way splits, i.e. for every attribute a single split is created
where attributes categories are branches ofthe proposed split.
2. Evaluation ofbest split for tree branching based on information gain measure, and
3. Checking of the stop criteria, and recursively applying the steps to new branches. These
three steps are iterating and are executed in all nodes of the decision tree classifier. The
information gain measure (2) is based on the well-known Shannon entropy measurement
shown in (1).
where k represents the number of classes of the output variable, and Pi the probability of the
i-th class. S represents the dataset. ID3 uses information gain (2) as a measure of split quality.
2
Evaluation of data mining algorithms in medicine 2013
where Values(A) is the set of all possible values in
attribute A and S; is the subset of dataset S that have
value v in S. And Entropy(S) is the expected entropy of an input attribute A that has k
categories, Entropy(SV) is the entropy of an attributes category with respect to the output
attribute, and is |SV| / |S| the probability of the j-th category in the attribute. Information gain
of an attribute is the difference between entropy ofthe system, or node, and the entropy of an
attribute. It represents the amount of information an attribute holds for the class
disambiguation
C4.5
C4.5 algorithm is an improvement ofID3. It can work with numerical input attributes as well
[6]. It follows three steps during tree growth:
1. Splits creation for categorical attributes is the same as in ID3. For numerical attributes all
possible binary splits have to be considered. Numerical attributes splits are always binary.
2. Evaluation ofbest split for tree branching based on gain ratio measure, and
3. Checking of the stop criteria, and recursively applying the steps to new branches. This
algorithm introduces a new, less biased, split evaluation measure (Gain ratio). The algorithm
can work with missing values, has pruning option, grouping attribute values, rules generating
etc. The Gain ratio selection criterion (3) is a measure that is less biased towards selecting
attributes with more
categories.
where S,…,Sk are k sample subsets in S divided by
k different values of attribute A. And (1) calculates information entropy, then (2) calculates
the information gain Gain(S,A) of A in dataset A, (3) calculates the splitting information
SplitInformation(S,A) and finally (4) calculates the information gain ratio of each attribute
3
Evaluation of data mining algorithms in medicine 2013
GainRatio(S,A) . Gain ratio divides the attribute information gain with the split info
SplitInformation(S,A), defined by (4), a measure that is dependent on the number of
categories k in an attribute. C4.5 can work with categorical and numerical attributes.
Categorical attributes can produce multi-way splits, and numerical attributes binary splits.
C4.5 includes three pruning algorithms, namely reduced error pruning, pessimistic error
pruning and error based pruning. In summary, it has the following merits:
1. Handle both continuous and discrete attributes;
2. Deal with missing attribute value;
3. Cope attributes with differing costs;
4. Prune decision tree after it is created.
THE IMPROVED ALGORITHM, ASFIDF
The goal of classification is to classify data to the given class according to its specific
attribute and a priori knowledge that is one ofthe important tools in data analysis. Decision
tree induction classification [7] is one kind of common way used in classification that is sort
of tree structure similar to flow chart, having the merits such as fast speed of classification,
high accuracy, dealing with high-dimensional data, fitting for inductive knowledge discovery
and so on. The classical ID3 algorithm selects attributes by adopting information gain as the
inductive function, but tends to select those attributes that have many values. There are two
criterions in evaluating the quality of decision tree: (1) less leaf nodes, low depth and little
redundancy; (2) high accuracy of classification. C4.5 selects splitting attributes by adopting
information gain ratio. It derives all merits from ID3 and add the technologies such as to
discrete the continual attributes, to deal with the missing value of attributes [8], to prune
decision tree, etc. C4.5 is up-to-down to construct the decision tree with one step greedy
searching strategy. The algorithm only finds the locally optimal solution in classification. To
improve the possibility of finding the globally optimal solution, we proposed a new algorithm
to construct decision tree with two forward steps that is similar to [9]. When selecting
attributes, the proposed algorithm took the information gain of selecting two attributes
simultaneously into account, not only one the information gain by selecting one attribute.
Thus, considering two optimal attributes is better than the single optimal attribute to improve
the possibility in searching for the globally optimal solution that is also used in [10]. In the
4
Evaluation of data mining algorithms in medicine 2013
experiments of3 DCI standard data sets, the results proved that the proposed algorithm
apparently is better than C4.5. It can select splitting attributes more accurately, improve the
classifying result and construct decision tree with lower
average depth. However, in the case that the imbalance of each attribute value distribution
happens to data sets, the proposed algorithm has some flaws, resulting in the situation that
most samples are focused on some branch that is worse than C4.5 in performance. It still
cannot avoid being trapped into the locally optimal solution, though it improves the
possibility in finding the globally optimal solution to the problem that needs further study. In
order to fit the problem, we proposed a new way of constructing decision with adaptive steps.
The description of ASFIDT algorithm is shown as the following. Given the training set and
the label attribute set, the steps of ASFIDT are:
(a) Preprocess each item in the training set in case of missing values ofsome attributes. In
terms ofthe strategy
of dealing with missing values used in C4.5, to fill in those missing values of attribute. It
gives one probabilistic value for each attribute by calculating the weight of each class in the
result from multi path to leaf node.
(b) According to the below formulas, to calculate the information gain ratio of each attribute
(one step for-ward) and each pair of attributes (two steps forward);
(c) Compare the information gain ratio with one step forward with the average information
gain ratio with two steps forward and select the greater as the current node to construct a tree
node;
(d) For each branch of the current node, to select the optimal attribute or attribute pair from
the remaining attribute(s) by Step (b) as the successive node and set it as the current node;
(e) Repeat Step (d) until all attributes are all selected. If only one attribute is left, to set it as
the successor of current node directly;
(f) According to the same method pruning after rule used by C4.5, to prune the constructed
tree, i.e, to construct a decision tree from training data set and increase the tree until it fits the
training data best, allowing the over fitting; to convert the decision tree into the equivalent
rule set and modify every rule by deleting any precondition that could improve the accuracy
5
Evaluation of data mining algorithms in medicine 2013
ofevaluation; to sort them in terms of the evaluating accuracy ofthe pruned rule and classify
the samples with the rules according to the order. Assuming that the dataset is S,A
={Al,A2···, } arecandidate attributes, then the definition of information gain of attribute pair
{4,Aj} in dataset A as
where, Values (Ak) ( k =i, j) is the set of all possible values of attribute Ak and SV," is subset
of dataset S that has value v of attribute Ai and value u of attribute Aj
Assuming that the attribute 4 has n different values and the attribute Aj
has
m
different
values,
then
the
splitting
information
is
defined
as
where, S1·· ·,S m x n are m x n sample subsets of dataset S divided by the combination of all
possible values of attributes 4 and Aj
• The information gain ratio of attribute pair {Ai,Aj } in dataset A is defined as:
where, GainRatio (S,4,Aj) and SplitInformation (S,4,Aj) are calculated by (5) and (6).
6
Download