Decision Tree Learning Introduction ● Decision tree learning is one of the most widely used and practical methods for inductive classification task. ● Trained model (i.e. approximate target function) is a decision tree. “Play tennis” example - Each internal node tests an attribute - Each branch corresponds to attribute value - Each leaf node assigns a class ● Decision tree takes <attribute values> as inputs and outputs predicted class. - Predict the class (Yes or No) of a new example <Outlook=Sunny, Temperature=Hot, Humidity=High, Wind=Strong> ● A training example consists of <attribute values, class>. ● Decision tree can also be represented as a set of “if-then rules” to improve human readability. For the “play tennis” example, create if-then rules. ● Create your own decision tree for the following training example: - A training example is composed of <x, y, c> where x, y: attributes and c: class - Training set = { <x1, y1, c1>, <x2, y1,c1>, <x3, y1, c2>, <x1, y2, c2>, <x2, y2, c2>, <x3, y2, c2> } - From the example, we know that decision tree learning splits attribute space vertically and horizontally in a recursive manner such that each segmented area contains classes of the same label as many as possible. ● Alternative description of decision tree - For predicting C-section risk When to Consider Decision Trees ● Training examples (also called instances) are represented by attribute-class pairs. - Attribute values can be category values, discrete values, or continuous value (discretization is required) - Class should be category values. ● Training data may contain errors – decision tree learning is robust to errors, both in class values and in attribute values. ● Training data may contain missing attribute values. ● Application examples - Equipment or medical diagnosis - Credit risk analysis How Do Machines Learn Decision Trees ● Decision tree learning is an optimization problem – find the best tree in the set of all possible trees (hypothesis space) that maximizes the classification accuracy of training examples. ● Widely used decision tree learning algorithms are ID3 and its revisions (C4.5 and C5.0). We will review ID3 in this chapter. ● The ID3 algorithm employs a top-down, greedy search through the hypothesis space. Main loop: 1. A ← the “best” decision attribute for next node 2. Assign A as decision attribute for the node 3. For each value of A, create new descendant of the node 4. Sort training examples to leaf nodes 5. If training examples perfectly classified, then stop. Otherwise, iterate over new leaf nodes End loop ● The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree. - which attribute is best? Entropy & Information Gain ● We would like to select the attribute that is most useful for classifying training examples. For the purpose we define a statistical property, referred to as information gain, that measures how well a given attribute separates the training examples according to their target classification. Information gain is based on entropy. ● Entropy measures the impurity (uncertainty) of an arbitrary collection of training examples. Given a collection 𝑆, containing positive and negative examples, the entropy of 𝑆 relative to this Boolean classification is 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = −𝑝+ 𝑙𝑜𝑔2 𝑝+ − 𝑝− 𝑙𝑜𝑔2 𝑝− where 𝑝+ is the proportion of positive examples in 𝑆, 𝑝− is the proportion of negative examples in 𝑆. - The entropy is 0 when the collection 𝑆 contains the same class (all positive or all negative). Note that we define 0log0 to be 0. - The entropy is 1 when the collection 𝑆 contains an equal number of positive and negative examples - The entropy is between 0 and 1 when the collection 𝑆 contains unequal numbers of positive and negative examples - Examples: calculate the entropy of the collection 𝑆 that contains 9 positive and 5 negative examples ([9+, 5-]). ● (Generalization) If class can take 𝑐 different class values, then the entropy of 𝑆 is defined as 𝑐 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = ∑ −𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖 𝑖=1 ● Given an attribute, information gain is the expected reduction in entropy caused by partitioning the training examples according to this attribute. More precisely, the information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴) of an attribute 𝐴, relative to a collection of examples 𝑆, is defined as 𝐺𝑎𝑖𝑛(𝑆, 𝐴) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) − ∑ 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴) |𝑆𝑣 | 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ) |𝑆| where 𝑣𝑎𝑙𝑢𝑒𝑠(𝐴) is the set of all possible values for attribute 𝐴 and 𝑆𝑣 is the subset of 𝑆 for which attribute 𝐴 has value 𝑣. - Calculate information gain for attributes A1 and A2. Which one is better classifier? ● Example ID3 Algorithm ● Complete ID3 algorithm (recursive form) ID3 (Examples, Attribute Set) 1. Create a Root node for the tree. 2. If all Examples have a single class value, return the single-node tree Root, with label=the class value. 3. If Attribute Set is empty, return the single-node tree Root, with label=most common class value in Examples. 4. Otherwise begin 4.1 A ← the attribute from Attribute Set that best classifies the examples. 4.2 Decision attribute for Root ← A. 4.3 For each value vi of A, 4.3.1 Add a new tree branch below Root with A=vi. 4.3.2 If Examples(vi) is empty, Then below this new branch add a leaf node with label=most common class value in Examples. Else below this new branch add the subtreeID3(Examples(vi), Attribute Set – {A}) End begin Hypothesis Space Search by ID3 ● The hypothesis space searched by ID3 is the set of all possible decision trees ● ID3 performs top-down, greedy search (also called hill climbing search) guided by information gain. - Only one decision tree is generated. - No backtracking is allowed (local minima). ● ID3 searches the best decision tree in the hypothesis space that classifies training examples as correctly as possible. Property of ID3 ● Given a collection of training examples, there are typically many decision trees consistent with these examples. Which of these decision trees does ID3 prefer? Shorter trees are preferred over large trees. ● Why prefer shorter decision tree? - Scientists sometimes appear to follow this inductive bias. For example, physicists prefer simple explanations for the motions of planets over more complex explanations. - Why? One argument is that because there are fewer short hypotheses than longer ones, it is less likely that one will find a short hypothesis that coincidentally fits the training data. In contrast, there are often many very complex hypotheses that fit the current training data but fail to generalize correctly to subsequent data. - Occam’s razor: Prefer the simplest hypothesis that fits the data. Overfitting in Decision Tress ● Consider the error of hypothesis h over - training data: errortrain(h) - entire distribution D of data: errorD(h) Hypothesis h∈H overfits training data if there is an alternative hypothesis h’ ∈H such that errortrain(h) < errortrain(h’) and errorD(h) > errorD(h’) ● Overfitting can lead to difficulties - When there is noise in the data Consider adding noisy training example #15 <Outlook=Sunny, Temperature=Hot, PlayTennis=No> What effect on earlier tree? Humidity=Normal, Wind=Strong, - When the number of training examples is too small to produce representative samples ● How can we avoid overfitting? - First approach: Stop growing the tree earlier before it reaches the point where it perfectly classifies the training data. - Second approach: Grow full-size tree, then post-prune. -The second approach seems more practical one because the first approach has difficulty in finding when to stop growing the tree. ● Reduced error pruning - Split data into training and validation set. - Do until further pruning is harmful: 1. Starting from bottom decision nodes, evaluate impact on validation set of pruning each possible node (bottom-up pruning). 2. Remove the decision node with its subtree that most improves validation set accuracy. Replace the subtree with a leaf node. - Suppose that the decision node named A has a set S of examples (see the following figure), and its two branches (v1 and v2) have Sv1 and Sv2 example sets. Then, if Error_rate (S,A) < Error_rate (Sv1, v1) + Error_rate (Sv1, v2) then replace the subtree with a leaf node with label of most common class value in the example set S of the decision node A. Bottom-up pruning Leaf node Leaf node A v1 Leaf node Sv1 S Leaf node Leaf node Leaf node Leaf node Leaf node Leaf node Leaf node Leaf node Leaf node v2 Leaf node Sv2 Pruning Most common class value in the example set S Subtree - The following figure shows the pruning effect on the accuracy. As the learning continues, tree’s accuracy over the training set would improve, but its accuracy against test data set decreases (overfitting effect). However, if we employ the reduced error pruning, we can detect the size of the tree that guarantees high accuracy over the test data set. -The major drawback is that when data is limited, withholding part of it for the validation set reduces even further the number of examples available for training. ● How do we measure performance (error rate or accuracy) of a decision tree algorithm? Moreover, suppose that you developed a new classifier. How do you test the performance of the classifier with a small training set? (1) k-fold cross validation (Leave-one-out in extreme cases) (2) Bootstrapping Alternative Measures for Selecting Attributes ● Gain ratio - Gain ratio is used by the C4.5 algorithm, an updated version of ID3 (Latest version is C5). - It is mainly developed to cope with test attributes that have many attribute values. - For example, suppose that the temperature (test attributes) in the play tennis game has discrete (or continuous) values (e.g. 24 ℃, 25 ℃, …). Then splitting with the temperature results in many lower nodes, one for each temperature value with small entropy value. This result is not what we want (we need rather simple trees: occam’s razor!). - The gain ratio is defined as 𝐺𝑎𝑖𝑛 𝑅𝑎𝑡𝑖𝑜 (𝑆, 𝐴) = 𝐺𝑎𝑖𝑛 (𝑆, 𝐴) 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆, 𝐴) where the 𝐺𝑎𝑖𝑛 (𝑆, 𝐴) is the information gain and the S𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆, 𝐴) is given as 𝑆𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆, 𝐴) = − ∑ 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴) |𝑆𝑣 | |𝑆𝑣 | 𝑙𝑜𝑔2 |𝑆 | |𝑆 | - The split information is high when attribute value size, |𝑣𝑎𝑙𝑢𝑒𝑠(𝐴)|, is big and lower nodes have similar numbers of examples. - Therefore, the gain ratio prefers test attributes that have high information gain, 𝐺𝑎𝑖𝑛 (𝑆, 𝐴), and low split information, S𝑝𝑙𝑖𝑡 𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 (𝑆, 𝐴). ● Gini index - The gini index is used by CART (classification and regression tree), the first decision tree model developed by Breiman et al. (1984). - CART can handle category type class values and numeric type class values. - When class value is a category type, the gini index is given 𝑔(𝑡) = ∑ 𝑓(𝑛|𝑡) 𝑓(𝑙 |𝑡) = ∑ 𝑓 (𝑛|𝑡) {1 − 𝑓(𝑛 |𝑡)} = 1 − ∑ 𝑓 (𝑛|𝑡 )2 𝑛≠𝑙 𝑛 𝑛 where 𝑡 is a node and 𝑓 (𝑛|𝑡) is the fraction of examples labeled with category value 𝑛 at the node 𝑡. - The gini index measures the impurity of the node 𝑡. It has a low value when ∑𝑛 𝑓 (𝑛|𝑡)2 is high, for which most examples in the node fall into a single category value. - Given a parent node 𝑡𝑝 , change of impurity by splitting with attribute 𝐴, ∆𝑔(𝑡𝑝 , 𝐴), is ∆𝑔(𝑡𝑝 , 𝐴) = 𝑔(𝑡𝑝 ) − 𝐸𝐴 [𝑔(𝑡𝑐 )] ∆𝑔(𝑡𝑝 , 𝐴) = 𝑔(𝑡𝑝 ) − ∑ 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴) |𝑆𝑣 | 𝑔(𝑡𝑣 ) |𝑆| where 𝑆 is the set of examples in the node 𝑡𝑝 and 𝑆𝑣 is the set of examples in a child node 𝑡𝑣 where all examples have attribute 𝐴 value of 𝑣. - CART selects the best attribute 𝐴∗ with the maximum ∆𝑔 and split the tree at the parent node using the attribute 𝐴∗ . - When class value is numeric type, variance (or error) concept is used for splitting. - At a node 𝑡, variance (sum of squared errors) is defined as 𝑛 ∑(𝑦𝑖 − 𝑦̅)2 𝑖=1 where 𝑛 is the size of examples in the node 𝑡; 𝑦𝑖 is the class value of the i-th example; 𝑦̅ is the average class value of the examples in the node 𝑡. - At the children nodes of the node 𝑡 split by attribute 𝐴, the variance (average sum of squared errors) is defined as ∑ 𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴) |𝑆𝑣 | 2 ∑ (𝑦𝑖 − 𝑦̅𝑡𝑣 ) |𝑆| 𝑦𝑖 ∈𝑡𝑣 where 𝑦̅𝑡𝑣 is the average class value of the examples in a child node 𝑡𝑣 . - CART selects the best attribute 𝐴∗ with the maximum of variance difference defined as |𝑆 | Variance difference = ∑𝑛𝑖=1(𝑦𝑖 − 𝑦̅)2 − ∑𝑣∈𝑣𝑎𝑙𝑢𝑒𝑠(𝐴) |𝑆|𝑣 ∑𝑦𝑖 ∈𝑡𝑣(𝑦𝑖 − 𝑦̅𝑡𝑣 ) 2 and split the tree at the parent node using the attribute 𝐴∗ . - When class value is numeric type, the decision tree is called regression tree. - In a regression tree, the representative class value of a leaf node corresponds to the average class value of the node. Therefore, unlike to popular statistical regression models such as linear regression and neural networks, regression tree predicts discontinuous class values. - In fact, the variance is identical to the gini index. Miscellaneous ● Incorporating continuous-valued attributes - For a continuous-valued attribute, define a new discrete-valued attribute that partitions the continuous-valued attribute into a discrete set of intervals. - Thresholds {𝑐} should be determined for the discretization. It is usual to choose the midpoint between two consecutive values (i.e. 𝑐 = (𝑣𝑖 + 𝑣𝑖+1 )/2) as a threshold. Single threshold case T T 54 2 examples Two-threshold case T 6 examples T > 54=(48+60)/2 4 examples T 54 2 examples 6 examples 54 < T 85 T > 85=(80+90)/2 3 examples 1 examples - For a single threshold case, pick a threshold c that produces the greatest information gain. -Multiple threshold case is rather complex. ● Handling training examples with unknown attribute values - Consider a situation in which 𝐺𝑎𝑖𝑛 (𝑆, 𝐴𝑖 ) is to be calculated at node 𝑛 to evaluate whether the attribute 𝐴𝑖 is the best attribute to test at this decision node. Suppose that < 𝑎1 , … , 𝑎𝑛 , 𝑐 > is a training example and that the value 𝑎𝑖 is unknown. Node n Ai Example set S example 1 <a1,…,ai=T, ….an, c2> example 2 <a1,…,ai ?, ….an, c1> example 3 <a1,…,ai=T, ….an, c2> example 4 <a1,…,ai=F, ….an, c1> - Three strategies for this case 1. Assign most common value of 𝐴𝑖 among other examples sorted to node 𝑛. 2. Assign most common value of 𝐴𝑖 among other examples with the same class value 𝑐 sorted to node 𝑛. 3. Assign probability 𝑝𝑖 to each possible value 𝑣𝑖 a probabilistic choice. - Classify new examples in the same fashion. of 𝐴𝑖 and estimate 𝑎𝑖 with