DECISION TREES ALGORITHMS A particularly efficient method for producing classifiers from data is to generate a decision tree. The decision-tree representation is the most widely used logic method. There is a large number of decision-tree induction algorithms described primarily in the machine-learning and applied-statistics literature. They are supervised learning methods that construct decision trees from a set of input-output samples. A typical decision-tree learning system adopts a top-down strategy that searches for a solution in a part of the search space. The skeleton of the C4.5 algorithm is based on Hunt's CLS method for constructing a decision tree from a set T of training samples. Let the classes be denoted as {C1, C2…, Ck}. There are three possibilities for the content of the set T: 1. T contains one or more samples, all belonging to a single class Cj. The decision tree for T is a leaf identifying class Cj. 2. T contains no samples. The decision tree is again a leaf but the class to be associated with the leaf must be determined from information other than T, such as the overall majority class in T. The C4.5 algorithm uses as a criterion the most frequent class at the parent of the given node. 3. T contains samples that belong to a mixture of classes. In this situation, the idea is to refine T into subsets of samples that are heading towards a single-class collection of samples. Based on single attribute, an appropriate test that has one or more mutually exclusive outcomes {O1, O2…,On} is chosen. T is partitioned into subsets T1, T2…,Tn where Ti contains all the samples in T that have outcome Oi of the chosen test. The decision tree for T consists of a decision node identifying the test and one branch for each possible outcome (examples of this type of nodes are nodes A, B, and C in the decision tree in FIGURE). The same tree-building procedure is applied recursively to each subset of training samples, so that the ith branch leads to the decision tree constructed from the subset Ti of training samples. The successive division of the set of training samples proceeds until all the subsets consist of samples belonging to a single class. Suppose we have the task of selecting a possible test with n outcomes (n values for a given feature) that partitions the set T of training samples into subsets T1, T2, …, Tn. The only information available for guidance is the distribution of classes in T and its subsets Ti. If S is any set of samples, let freq (Ci, S) stand for the number of samples in S that belong to class Ci (out of k possible classes), and let ∣S∣ denote the number of samples in the set S. The original ID3 algorithm used a criterion called gain to select the attribute to be tested which is based on the information theory concept: entropy. The following relation gives the computation of the entropy of the set S (bits are units): Now consider a similar measurement after T has been partitioned in accordance with n outcomes of one attribute test X. The expected information requirement can be found as the weighted sum of entropies over the subsets: The quantity measures the information that is gained by partitioning T in accordance with the test X. The gain criterion selects a test X to maximize Gain (X), i.e., this criterion will select an attribute with the highest info-gain. Let us analyze the application of these measures and the creation of a decision tree for one simple example. Suppose that the database T is given in a flat form in which each out of fourteen examples (cases) is described by three input attributes and belongs to one of two given classes: CLASSI or CLASS2. Nine samples belong to CLASS1 and five samples to CLASS2, so the entropy before splitting is After using Attributel to divide the initial set of samples T into three subsets (test x1 represents the selection one of three values A, B, or C), the resulting information is given by: The information gained by this test x1 is If the test and splitting is based on Attribute3 (test x2 represents the selection one of two values True or False), a similar computation will give new results: and corresponding gain is To find the optimal test… Based on the gain criterion, the decision-tree algorithm will select test x1 as an initial test for splitting the database T because this gain is higher. To find the optimal test it will be necessary to analyze a test on Attribute2, which is a numeric feature with continuous values. In general, C4.5 contains mechanisms for proposing three types of tests: 1. The "standard" test on a discrete attribute, with one outcome and one branch for each possible value of that attribute (in our example these are both tests x1 for Attributel and x2 for Attribute3). 2. If attribute Y has continuous numeric values, a binary test with outcomes Y≤Z and Y>Z could be defined, by comparing its value against a threshold value Z. 3. A more complex test also based on a discrete attribute, in which the possible values are allocated to a variable number of groups with one outcome and branch for each group. While we have already explained standard test for categorical attributes, additional explanations are necessary about a procedure for establishing tests on attributes with numeric values. It might seem that tests on continuous attributes would be difficult to formulate, since they contain an arbitrary threshold for splitting all values into two intervals. But there is an algorithm for the computation of optimal threshold value Z. The training samples are first sorted on the values of the attribute Y being considered. There are only a finite number of these values, so let us denote them in sorted order as {v1, v2 …, vm}. Any threshold value lying between vi and vi+1 will have the same effect as dividing the cases into those whose value of the attribute Y lies in {v1, v2 …, vi} and those whose value is in {vi+1, vi+2, …, vm}. There are thus only m − 1 possible splits on Y, all of which should be examined systematically to obtain an optimal split. It is usual to choose the midpoint of each interval, (vi + vi+1)/2, as the representative threshold. The algorithm C4.5 differs in choosing as the threshold a smaller value vi for every interval {vi, vi+1}, rather than the midpoint itself. This ensures that the threshold values appearing in either the final decision tree or rules or both actually occur in the database. To illustrate this threshold-finding process, we could analyze, for our example of database T, the possibilities of Attribute2 splitting. After a sorting process, the set of values for Attribute2 is {65, 70, 75, 78, 80, 85, 90, 95, 96} and the set of potential threshold values Z is {65, 70, 75, 78, 80, 85, 90, 95}. Out of these eight values the optimal Z (with the highest information gain) should be selected. For our example, the optimal Z value is Z = 80 and the corresponding process of information-gain computation for the test x3 (Attribute2 ≤ 80 or Attribute2 > 80) is the following: Now, if we compare the information gain for the three attributes in our example, we can see that Attribute1 still gives the highest gain of 0.246 bits and therefore this attribute will be selected for the first splitting in the construction of a decision tree. The root node will have the test for the values of Attribute1, and three branches will be created, one for each of the attribute values. This initial tree with the corresponding subsets of samples in the children nodes is represented in Figure: After initial splitting, every child node has several samples from the database, and the entire process of test selection and optimization will be repeated for every child node. Because the child node for test x 1: Attribute1=B has four cases and all of them are in CLASS1, this node will be the leaf node, and no additional tests are necessary for this branch of the tree. For the remaining child node where we have five cases in subset T1, tests on the remaining attributes can be performed; an optimal test (with maximum information gain) will be test x4 with two alternatives: Attribute2 ≤ 70 or Attribute2 > 70. Using Attribute2 to divide T1 into two subsets (test x4 represents the selection of one of two intervals), the resulting information is given by: The information gained by this test is maximal: and two branches will create the final leaf nodes because the subsets of cases in each of the branches belong to the same class. A similar computation will be carried out for the third child of the root node. For the subset T3 of the database T, the selected optimal test x5 is the test on Attribute3 values. Branches of the tree, Attribute3 = True and Attribute3 = False, will create uniform subsets of cases which belong to the same class. The final decision tree for database T is represented in Figure Exercises for students 1. Given a data set X with 3-dimesional categorical samples: X: Attribute1 Attribute2 Class Construct a decision tree using the computation steps given in the C4.5 algorithm. T 1 C2 T 2 C1 F F 1 2 C2 C2 2. Given a training data set Y: Y A B C Class 15 1 A C1 20 3 B C2 25 e. 2 A C1 30 4 A C1 35 2 B C2 25 4 A C1 15III. 2 B C2 20 3 B C2 a. Find the best threshold (for the maximal gain) for attribute A. b. Find the best threshold (for the maximal gain) for attribute B. c. Find a decision tree for data set Y. d. If the testing set is: A B C Class I. what is the percentage of correct 10 2 A classifications using the decision tree developed in c). II. Derive decision rules from the 20 1 B decision tree. 30 3 A 40 2 B C2 C1 C2 C2 15 1 B C1 3. Use the C4.5 algorithm to build a decision tree for classifying the following objects: