DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson & Shi 2007, de Ville 2006, SAS Education 2005 slide 1 DSCI 4520/5240 DATA MINING n Decision Tree Algorithms Decision Tree algorithms were developed by the Machine Learning research community (part of Computer Science/Artificial Intelligence) Lecture 6 - 2 DSCI 4520/5240 DATA MINING Agenda n Rule evaluation criteria in Machine Learning n Decision Tree algorithms: 1R, ID3 – Entropy n Naïve Bayes classification (optional) slide 3 DSCI 4520/5240 DATA MINING Decision Trees: Credit risk example This example is related to determining credit risks. We have a total of 10 people. 6 are good risks and 4 are bad. We apply splits to the tree based on employment status. When we break this down, we find that there are 7 employed and 3 not employed. Of the 3 that are not employed, all of them are bad credit risks and thus we have learned something about our data. Note that here we cannot split this node down any further since all of our data is grouped into one set. This is called a pure node. The other node, however, can be split again based on a different criterion. So we can continue to grow the tree on the left hand side. CORRESPONDING RULES: •IF employed = yes AND married = yes THEN risk = good •IF employed = yes AND married = no THEN risk = good •IF employed = no THEN risk = bad Lecture 6 - 4 DSCI 4520/5240 DATA MINING Decision Tree performance Confidence is the degree of accuracy of a rule. Support is the degree to which the rule conditions occur in the data. EXAMPLE: if 10 customers purchased Zane Grey’s The Young Pitcher and 8 of them also purchased The Short Stop, the rule: {IF basket has The Young Pitcher THEN basket has The Short Stop} has confidence of 0.80. If these purchases were the only 10 to cover these books out of 10,000,000 purchases, the support is only 0.000001. Lecture 6 - 5 DSCI 4520/5240 DATA MINING Rule Interestingness Interestingness is the idea that Data Mining discovers something unexpected. Consider the rule: {IF basket has eggs THEN basket has bacon}. Suppose the confidence level is 0.90 and the support level is 0.20. This may be a useful rule, but it may not be interesting if the grocer was already aware of this association. Recall the definition of DM as the discovery of previously unknown knowledge! Lecture 6 - 6 Rule Induction algorithms DSCI 4520/5240 DATA MINING They are recursive algorithms that identify data partitions of progressive separation with respect to the outcome. The partitions are then organized into a decision tree. Common Algorithms: n 1R n ID3 n C4.5/C5.0 n CART CHAID CN2 BruteDL SDL Lecture 6 - 7 Illustration of two Tree algorithms DSCI 4520/5240 DATA MINING n 1R and Discretization in 1R n ID3: Min Entropy and Max Info Gain Lecture 6 - 8 DSCI 4520/5240 DATA MINING 1R slide 9 1R: Inferring Rudimentary Rules DSCI 4520/5240 DATA MINING 1R: learns a 1-level decision tree n In other words, generates a set of rules that all test on one particular attribute Basic version (assuming nominal attributes) n One branch for each of the attribute’s values n Each branch assigns most frequent class n n Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch Choose attribute with lowest error rate Lecture 6 - 10 Pseudo-code for 1R DSCI 4520/5240 DATA MINING For each attribute, For each value of the attribute, make a rule as follows: count how often each class appears find the most frequent class make the rule assign that class to this attribute-value Calculate the error rate of the rules Choose the rules with the smallest error rate Let’s apply 1R on the weather data: n Consider the first (outlook) of the 4 attributes (outlook, temp, humidity, windy). Consider all values (sunny, overcast, rainy) and make 3 corresponding rules. Continue until you get all 4 sets of rules. Lecture 6 - 11 DSCI 4520/5240 DATA MINING A simple example: Weather Data Outlook Temp Humidity Windy Play? Sunny Hot High False No Sunny Hot High True No Overcast Hot High False Yes Rainy Mild High False Yes Rainy Cool Normal False Yes Rainy Cool Normal True No Overcast Cool Normal True Yes Sunny Mild High False No Sunny Cool Normal False Yes Rainy Mild Normal False Yes Sunny Mild Normal True Yes Overcast Mild High True Yes Overcast Hot Normal False Yes Rainy Mild High True No slide 12 DSCI 4520/5240 DATA MINING Evaluating the Weather Attributes in 1R (*) indicates a random choice between two equally likely outcomes slide 13 DSCI 4520/5240 DATA MINING Decision tree for the weather data Outlook sunny rainy overcast Windy Humidity yes high no normal yes false yes true no slide 14 DSCI 4520/5240 DATA MINING Discretization in 1R Consider continuous Temperature data, after sorting them in ascending order: 65 65 68 69 70 71 72 72 75 75 80 81 83 85 Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No One way to discretize temperature is to place breakpoints wherever the class changes: Yes | No | Yes Yes Yes | No No | Yes Yes Yes | No | Yes Yes | No To avoid overfitting, 1R adopts the rule that observations of the majority class in each partition be up to 3 (if available), unless there is a “run”: Yes No Yes Yes Yes | No No Yes Yes Yes | No Yes Yes No If adjacent partitions have the same majority class, the partitions are merged: Yes No Yes Yes Yes No No Yes Yes Yes | No Yes Yes No The final discretization leads to the rule set: IF temperature <= 77.5 THEN Yes IF temperature > 77.5 THEN No slide 15 DSCI 4520/5240 DATA MINING Comments on 1R n 1R was described in a paper by Holte (1993) n Contains an experimental evaluation on 16 datasets (using cross-validation so that results were representative of performance on future data) n Minimum number of instances was set to 6 after some experimentation n n 1R’s simple rules performed not much worse than much more complex decision trees Simplicity-first pays off! Lecture 6 - 16 DSCI 4520/5240 DATA MINING Entropy and Information Gain slide 17 DSCI 4520/5240 DATA MINING Constructing Decision Trees in ID3, C4.5, C5.0 Normal procedure: top down in recursive divide-and-conquer fashion First: attribute is selected for root node and branch is created for each possible attribute value Then: the instances are split into subsets (one for each branch extending from the node) Finally: procedure is repeated recursively for each branch, using only instances that reach the branch Process stops if all instances have the same class Lecture 6 - 18 Which attribute to select? DSCI 4520/5240 DATA MINING Outlook Temperature rainy sunny overcast yes yes no no no yes yes yes yes hot yes yes yes no no yes yes no no Windy false Humidity high yes yes yes no no no no normal yes yes yes yes yes yes no yes yes yes yes yes yes no no mild yes yes yes yes no no cool yes yes yes no true yes yes yes no no no slide 19 DSCI 4520/5240 DATA MINING • • • • • A criterion for attribute selection Which is the best attribute? The one which will result in the smallest tree. Heuristic: choose the attribute that produces the “purest” nodes! Popular impurity criterion: Information. This is the extra information needed to classify an instance. It takes a low value for pure nodes and a high value for impure nodes. We can then compare a tree before the split and after the split using Information Gain = Info (before) – Info (after). Information Gain increases with the average purity of the subsets that an attribute produces Strategy: choose attribute that results in greatest information gain. However, it equivalent (and faster) to select the smallest information (=total weighted entropy) slide 20 Computing Information DSCI 4520/5240 DATA MINING nInformation is measured in bits nGiven a probability distribution, the info required to predict an event is the distribution’s entropy nEntropy gives the additional required information (i.e., the information deficit) in bits n This can involve fractions of bits! nThe negative sign in the entropy formula is needed to convert all negative logs back to positive values Formula for computing the entropy: Entropy (p1, p2, …, pn) = –p1 logp1 –p2 logp2 … –pn logpn slide 21 DSCI 4520/5240 DATA MINING Weather example: attribute “outlook” • Outlook = “Sunny” Info([2,3]) = entropy(2/5, 3/5) = –2/5log(2/5) –3/5log(3/5) = 0.971 bits Info([2,3]) Outlook rainy sunny yes yes no no no overcast yes yes yes yes yes yes yes no no • Outlook = “Overcast” Info([4,0]) = entropy(1, 0) = –1log(1) –0log(0) = 0 bits (by definition) • Outlook = “Rainy” Info([3,2]) = entropy(3/5, 2/5) = –3/5log(3/5) –2/5log(2/5) = 0.971 bits Information (=total weighted entropy) for attribute Outlook: Info([3,2], [4,0], [3,2]) = (5/14)×0.971+ (4/14)×0 + (5/14)×0.971 = 0.693 bits. slide 22 DSCI 4520/5240 DATA MINING Comparing Information Gains (OPTIONAL) or total weighted entropies • Information Gain = Information Before – Information After Gain (Outlook) = info([9,5]) – info([2,3], [4,0], [3,2]) = 0.940 – 0.693 = 0.247 bits • Information Gain for attributes from the Weather Data: Gain (Outlook) = 0.247 bits Gain (Temperature) = 0.029 bits Gain (Humidity) = 0.152 bits Gain (Windy) = 0.048 Outlook also has minimum weighted entropy at 0.693 slide 23 Continuing to split DSCI 4520/5240 DATA MINING Outlook Outlook sunny sunny no no mild cool yes no yes sunny Windy Temperature hot Outlook false yes yes no no true yes no Humidity high no no no normal yes yes Gain (Temperature) = 0.571 bits Gain (Humidity) = 0.971 bits Gain (Windy) = 0.020 bits slide 24 Final Decision Tree DSCI 4520/5240 DATA MINING Outlook sunny rainy overcast Humidity high no • • normal yes Windy yes false yes true no Not all leaves need to be pure. Sometimes identical instances belong to different classes Splitting stops when data cannot split any further slide 25 DSCI 4520/5240 DATA MINING Another example: Loan Application Data Age young young young young young young young young young young young young middle middle middle middle middle old old old Income low low low low low average average average average average average high average average average high high high high high Risk high high high high low high high high high low low high average low low low low low low low OnTime? yes yes no no yes no yes yes yes yes yes no no yes yes yes yes yes yes yes Twenty loan application cases are presented. The target variable OnTime? Indicates whether the loan was paid off on time. slide 26 Loan Example: probability calculations DSCI 4520/5240 DATA MINING All possible values for the three attributes (Age, Income, Risk) are shown below. For each value, the probability for the loan to be On Time (OnTime = yes) is calculated: Variable Age Income Risk Value young middle old low average high high average low Cases 12 5 3 5 9 6 9 1 10 OnTime=yes 8 4 3 3 7 5 5 0 10 Prob(yes) 0.67 0.80 1.00 0.60 0.78 0.83 0.56 0.00 1.00 slide 27 DSCI 4520/5240 DATA MINING Loan Example: Entropy calculations Information calculations for attribute Age are shown below. • First we calculate the probability for each value to result in Yes • Also the probability for this value to result in No. • Then we compute the entropy for this value as: E = –p(yes) logp(yes) –p(no) logp(no) • Finally we calculate Information (=weighted entropy) for the entire attribute: Inform = E1p1 + E2p2 + E3p3 Log Log Entropy= Prob prob(yes) (base 2) prob(no) (base 2) Sum of Products (attribute-value) Product young 8/12 -0.585 4/12 -1.585 0.918 12/20 0.551 middle 4/5 -0.322 1/5 -2.322 0.722 5/20 0.18 old 3/3 0 0/3 0 0 3/20 0 Information 0.731 slide 28 Loan Example: The first split DSCI 4520/5240 DATA MINING The calculations continue until we have, for each competing attribute, the Information required to predict the outcome. The attribute with lowest required information is also the attribute with largest information gain, when we compare the required information before and after the split. Attribute Age Income Risk Information (in bits) 0.731 0.782 0.446 Risk low high average slide 29 DSCI 4520/5240 DATA MINING Naïve Bayes Classification (optional topic) slide 30 DSCI 4520/5240 DATA MINING Statistical Decision Tree Modeling 1R uses one attribute at a time and chooses the one that works best. Consider the “opposite” of 1R: Use all the attributes. Let’s first make two assumptions: Attributes are • Equally important • Statistically independent Although based on assumptions that are almost never correct, this scheme works well in practice! slide 31 DSCI 4520/5240 DATA MINING Probabilities for the Weather Data Table showing counts and conditional probabilities (contigencies): A new day: Suppose the answer is Play=Yes. How likely is to get the attribute values of this new day? slide 32 Baye’s Rule DSCI 4520/5240 DATA MINING Probability of event H given evidence E: P(H|E) = P(E|H) P(H) P(E) WHERE: H = target value, E = input variable values “A priori” probability of H: P(H) (Probability of event before evidence has been seen) “A posteriori” probability of H: P(H|E) (Probability of event after evidence has been seen) slide 33 Naïve Bayes Classification DSCI 4520/5240 DATA MINING Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve Bayes assumption: evidence can be split into independent parts (i.e. attributes of instance!) P(H|E) = P(E1|H) P(E2|H) … P(En|H) P(H) P(E) slide 34 DSCI 4520/5240 DATA MINING Naïve Bayes on the Weather Data Evidence E P( Yes | E) = (P( Outlook = Sunny | Yes) × P( Temperature = Cool | Yes) × P( Humidity = High | Yes) × P( Windy = True | Yes) × P(Yes)) / P(E) P( Yes | E) = (2/9 × 3/9 × 3/9 × 3/9 × 9/14) / P(E) = 0.0053 / P(E) P( No | E) = (3/5 × 1/5 × 4/5 × 3/5 × 5/14) / P(E) = 0.0206 / P(E) Note that P(E) will disappear when we normalize! slide 35 DSCI 4520/5240 DATA MINING n n n Comments on Naïve Bayes Classification Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to the correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) slide 36 DSCI 4520/5240 DATA MINING • • • • Suggested readings Verify the entropy, information, and information gain calculations we did in these slides Hint: All logs are base 2!!! Read the SAS GSEM 5.3 text, chapter 4 (pp. 61-102) Read the Sarma text, chapter 4 (pp. 113-168). Pay particular attention to: • Entropy calculations (p. 126) • Profit Matrix (p. 136) • Expected profit calculations (p. 137) • How to use SAS EM and grow a decision tree (pp. 143-158) slide 37