DECISION TREES AND CLASSIFICATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining. http://www-users.cs.umn.edu/~kumar/dmbook/ What is classification? • Determining to what group a data element belongs • Based on data contained within that element • Or “attributes” of that “entity” • Examples • Determining whether a customer should be given a loan • Flagging a credit card transaction as a fraudulent charge • Categorizing a news story as finance, entertainment, or sports How classification works 1 • Choose a set of records for the training set 2 • Choose a set of records for the test set 3 • Choose a class attribute (classifier, fact of interest, dependent variable, outcome variable) 4 • Find model that predicts the outcome variable as a function of the other attributes 5 • Apply that model to the test set to check accuracy 6 • Apply the final model to future records to classify Decision Tree Learning Training Set Trans. ID Charge Amount Avg. Charge 6 months Item Same state as billing Classification 1 $800 $100 Electronics No Fraudulent 2 $60 $100 Gas Yes Legitimate 3 $1 $50 Gas No Fraudulent 4 $200 $100 Restaurant Yes Legitimate 5 $50 $40 Gas No Legitimate 6 $80 $80 Groceries Yes Legitimate 7 $140 $100 Retail No Legitimate 8 $140 $100 Retail No Fraudulent Derive model Classification software Test Set Trans. ID Charge Amount Avg. Charge 6 months Item Same state as billing Classification 101 $100 $200 Electronics Yes ? 102 $200 $100 Groceries No ? 103 $1 $100 Gas Yes ? 104 $30 $25 Restaurant Yes ? Apply model Goals • The model developed using the training set should assign test set records to the right category • It won’t be 100% accurate, but should be as close as possible • Once this is done, the model’s rules can be applied to new records as they come along • We want an automated, reliable way to predict the outcome Classification Method: The Decision Tree • A model to predict membership of cases or values of a dependent variable based on one or more predictor variables (Tan, Steinback, and Kumar 2004) • Presents decision rules in plain English Some Terms • What is a “Variable” • Outcome (fact of interest, dependent variable, classification, classifier) Variable • Default/no default • Sales performance • Credit decision • Predictor (Factor, independent variable) Variable • Age • Income • Marital status Example: Credit Card Default TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent Default 3 33k 15% Owns No default 4 28k 19% Rents Default 5 55k 30% Owns No default 6 48k 35% Rent Default 7 65k 17% Owns No default 8 85k 10% Rents No default … Training Data We create the tree from a set of training data Each unique combination of predictors is associated with an outcome This set was “rigged” so that every combination is accounted for and has an outcome Example: Credit Card Default Leaf node Child node TID Income Debt Owns/ Rents Outcome 1 25k 35% Owns Default 2 35k 40% Rent Default 3 33k 15% Owns No default 4 28k 19% Rents Default 5 55k 30% Owns No default 6 48k 35% Rent Default 7 65k 17% Owns No default 8 85k 10% Rents … Training Data No default Root node Debt > 20% Income <40k Debt < 20% Credit Approval Debt > 20% Income >40k Debt < 20% Owns house Default Rents Default Owns house No Default Rents Default Owns house No Default Rents Default Owns house No Default Rents No Default Same Data, Different Tree TID Income Debt Owns/ Rents Decision 1 25k 35% Owns Default 2 35k 40% Rent Default 3 33k 15% Owns No Default 4 28k 19% Rents Default 5 55k 30% Owns No Default 6 48k 35% Rent No Default 7 65k 17% Owns Default 8 85k 10% Rents No Default Concept of conditional probability! Income < 40k Owns Income > 40k Credit Approval Income < 40k Rents Income > 40k Debt > 20% Default Debt < 20% No Default Debt > 20% No Default Debt < 20% No Default Debt > 20% Default Debt < 20% Default Debt > 20% Default Debt < 20% No Default … Training Data We just changed the order of the predictors. Apply the data to new (test) data Debt > 20% Income <40k Debt < 20% Credit Approval Debt > 20% Income >40k Debt < 20% Owns house Default Rents Default Owns house No Default Rents Default Owns house No Default Rents Default Owns house No Default Rents No Default TID Income Debt Owns/ Rents Decision (Predicted) Decision (Actual) 1 80k 35% Rent Default Default 2 20k 40% Owns Default Default 3 15k 15% Owns No Default No Default 4 50k 19% Rents No Default Default 5 35k 30% Owns Default Default Test Data How well did the decision tree do in predicting the outcome? This assumes some things • For real data with a lot of data the tree is built using software • The software has to deal with several issues • When the same set of predictors result in different outcomes (probability!) • When not every combination of predictors is in the training set • When multiple paths result in the same outcome Tree Induction Algorithms Tree induction algorithms take large sets of data and compute the tree Debt > 20% Similar cases may have different outcomes Income <40k Debt < 20% Credit Approval Debt > 20% So probability of an outcome is computed Income >40k Debt < 20% Owns house Default Rents Default Owns house No Default Rents Default Owns house No Default Rents Default Owns house No Default Rents No Default For instance, you may find: When income > 40k, debt < 20%, and the customers rents no default occurs 80% of the time. 0.8 How the induction algorithm works Start with single node with all training data Yes Are samples all of same classification? No Are there predictor(s) that will split the data? No Yes Partition node into child nodes according to predictor(s) Are there more nodes (i.e., new child nodes)? Yes Go to next node. No DONE! Debt > 20% Start with root node Income <40k Debt < 20% Credit Approval Debt > 20% Income >40k Debt < 20% Credit Approval • There are both “defaults” and “no defaults” in the set • So we need to look for predictors to split the data Owns house Default Rents Default Owns house No Default Rents Default Owns house No Default Rents Default Owns house No Default Rents No Default Debt > 20% Split on income Income <40k Debt < 20% Credit Approval Debt > 20% Income >40k Debt < 20% Owns house Default Rents Default Owns house No Default Rents Default Owns house No Default Rents Default Owns house No Default Rents No Default • Income is a factor Credit Approval Income <40k Income >40k • (Income < 40, Debt > 20, Owns = Default) but (Income > 40, Debt > 20, Owns = No default) • But there are also a combination of defaults and no defaults within each income group • So look for another split Debt > 20% Split on debt Income <40k Debt < 20% Credit Approval Debt > 20% Income >40k Debt < 20% Credit Approval Income <40k Income >40k Debt >20% Debt <20% Debt >20% Debt <20% Owns house Default Rents Default Owns house No Default Rents Default Owns house No Default Rents Default Owns house No Default Rents No Default • Debt is a factor • (Income < 40, Debt < 20, Owns = No default) but (Income < 40, Debt < 20, Rents = Default) • But there are also a combination of defaults and no defaults within some debt groups • So look for another split Debt > 20% Split on Owns/Rents Income <40k Debt < 20% Credit Approval Debt > 20% Income >40k Debt < 20% Debt >20% Income <40k Debt <20% Credit Approval Debt >20% Income >40k Debt <20% Default Owns house No Default Rents Default Owns house No Default Rents Default No Default Owns house Default Rents Default Owns house No Default Rents Default Owns house No Default Rents Default Owns house No Default Rents No Default • Owns/Rents is a factor • For some cases it doesn’t matter, but for some it does • So you group similar branches • …And we stop because we’re out of predictors! How does it know when and how to split? • Our example was simple because we had one case per branch • So there was 100% agreement within every branch • When you have a lot of data, you’ll get different outcomes with the same predictors • So you need criteria to decide when to split the branch using a predictor • And you need a way of figuring out when your leaf nodes • Describe primarily one type of outcome • Are different from one another Hypothesis Testing • Consider we want to test the following statement (hypothesis): • The split between “owns” and “rents” will make no difference on the chance of default (50-50 chance of default for the two groups). • How do we test it? • Mathematically: • Defaultowns=No Defaultowns • Defaultrents=No Defaultrents • We need a joint test statistic to test these two equations! We want to have confidence to reject at least one of them! Chi-squared test • Is the proportion of the outcome class the same in each child node • It shouldn’t be, or the classification isn’t very helpful 2 Rents (n=650) Default = 450 No Default = 200 O ij Eij 2 Eij Owns Rents Default 300 450 750 No Default 550 200 750 850 650 1500 Hypothesized (Expected) Root (n=1500) Default = 750 No Default = 750 Owns (n=850) Default = 300 No Default = 550 Observed Owns Rents Default 425 325 750 No Default 425 325 750 850 650 1500 Chi-squared test 2 Observed Owns Rents Default 300 450 750 No Default 550 200 750 O ij Eij 2 Eij (300 425) 2 (550 425) 2 (450 325) 2 (200 325) 2 425 425 325 325 2 850 650 1500 2 36.7 36.7 48.0 48.0 169.4 Hypothesized (Expected) Owns Rents Default 425 325 750 No Default 425 325 750 850 650 1500 p 0.0001 If the groups were the same, you’d expect an even split (Hypothesized - Expected) But we can see they aren’t distributed evenly (Observed) But is it enough (i.e., statistically significant)? Small p-values (i.e., less than 0.05 mean it’s very unlikely the groups are the same) So Owns/Rents is a predictor that creates two different groups Gini index • Also called “Simpson Diversity Index” • How much did the split lower diversity amongst groups • A node with all cases having the same outcomes (i.e., “No default”) has a Gini index of 0 • The more the nodes are mixed, the closer the index gets to 1 Computing the Gini index Gini 1 [ p ( j | t )]2 This means: Add up the squared proportions for each possible value. Then subtract that value from 1. j Observed Owns Rents Default 300 450 750 No Default 550 200 750 850 650 1500 300 2 550 2 1 0.124 0.418 0.458 Giniowns 1 850 850 450 2 200 2 1 0.479 0.094 0.427 Ginirents 1 650 650 Neither group is very “pure.” Both have a lot of both classifications in them. A “pure” example Gini 1 [ p ( j | t )]2 This means: Add up the squared proportions for each possible value. Then subtract that value from 1. j Observed Owns Rents Default 850 0 850 No Default 0 650 650 850 650 1500 Giniowns 850 2 0 2 11 0 0 1 850 850 Ginirents 0 2 650 2 1 0 1 0 1 650 650 This means both groups are perfectly “pure.” Each group only has one type of classification. Bottom line: Interpreting the model tests • High result from chi-squared test (low p-value) • Means the groups are different • Low result from the Gini test • Means that the groups have mostly one classification in them • And you want the Gini statistic for the leaf nodes to get smaller as you add branches • Data mining software computes these metrics for you, but it’s good to know where they come from. • And you definitely need to understand how to interpret them! One more thing… • Outcome cases are not 100% certain • There are probabilities attached to each outcome in a node • So let’s code “Default” as 1 and “No Default” as 0 0: 15% 1: 85% Debt >20% Income <40k Debt <20% Credit Approval Debt >20% Income >40k Debt <20% Owns house 0: 80% 1: 20% Rents 0: 70% 1: 30% Owns house 0: 78% 1: 22% Rents 0: 40% 1: 60% 0: 74% 1: 26% So what is the chance that: A renter making more than $40,000 and debt more than 20% of income will default? A home owner making less than $40,000 and debt more than 20% of income will default? Is this the only way to analyze the data? • Conditional probability/expectations • Regressions • Likelihood of default = a0+a1*own (or rent) + a2*income +a3*debt + error • Fit historical data with this equation with regression technique -> either least squares method or maximum likelihood method Simple linear regression P=.22; not significant The linear regression model: intercept Love of Math = 5 + .01*math SAT score slope