BUSI4359 Accounting Analytics Instructor: Dr. Qingxin Meng Week 2: Logisitc Regression and Decision Trees (and Node Purity) à Vitamin Sales à Pregnancy à House Size à House Price Straight Line Model Name 7. Structure of an Analytics Problems Inputs/Outputs LOGISTIC REGRESSION Hyperplane Boston Housing Data 1. Slope 2. Intercept LINEAR REGRESSION Placement of the plane Retail Transaction Data A Fixed Structure Learnt Parameters Training Data A Final Model Model structure: Underfitting and overfitting 1. Example of incorrect structure àWhen we, for example, are doing a regression, there's no reason to think the data will follow a a straight line…. àOften a “linear” model’s structure is too tight. 2. Looser Structures? A CURVED LINE for regression… MULTIPLE HYPERPLANES for classification... 2. Deciding on Model Structure à TOO MUCH STRUCTURE = overfitting à TOO LITTLE STRUCTURE = poor fit This will generate BAD BUSINESS PERFORMANCE! • Consider setting a price point for a new product. • As an analytics problem this requires looking at previous, comparable products, and modelling the relationship between price and revenue. • This is a non-linear relationship, so you instinctively would not favour a model like linear regression. 3. Just enough structure to model the data • All our data points here are previous products from similar categories. • Discount the price point too low, and no amount of sales will save overall revenue. • And once we set the price too high, sales drop off too rapidly to balance the per item sales gain. Revenue The blue curve denotes the true relationship. This can be modelled with a polynomial regression. Item Price Point 3. Overfitting produces poor performance • However, consider if we’d allowed for a super-curvy line, that went through almost every point in our data… • Despite being a great fit that line’s predictive performance would be poor. The simpler blue line was a far better model to use for new products. Revenue In polynomial regression we can add too many degrees of freedom. Item Sales Price 3. Underfitting can be even worse • Overfitting is an important issue (and one that you will see come up frequently in business analytics problems). • However, underfitting is a potentially and even worse problem, that occurs when your model is just too simple Revenue Item Price Point 3. Underfitting and Overfitting • It is very hard to know how complex or simple a parametric model needs to be, without experience of the problem domain. As a result business analysts will tend to try lots of models and assess which is best. • Bad analysts will assess this by fit (and overfit!) • Good analysts will see how well the candidate models generalize to new data. But they will also use “non-parametric models” too… Training and Testing Testing the performance of a model on the training data Testing the performance of a model on a NEVER SEEN BEFORE test set. Evaluates Training Accuracy Evaluates Testing Accuracy Which one (traning accuracy, testing accurcay) is a better measurement for Model Performance? •Overfitting and Underfitting are two fundamental concepts in Machine Learning. From the following statements, which ones are correct representations of Overfitting and Underfitting? 4. Underfitting (in Classifiers) • Both underfitting and overfitting are serious challenges to both classifiers and regressors. • Here you can see that logistic regression can never be a solution to the business task. • Why? 4. Underfitting (in Classifiers) • The reason is because logistic regression can only construct a single partitioning hyperplane. • And in this example no single line can separate the classes our business cares about… • …. But two lines could! Model 4: DECISION TREES finally more than one line 5. Decision Trees For classification tasks (e.g. will a customer leave within 3 months) I want you to do the following: • Create a Logistic Regression model • Examine to see how accurate your logistic regression model is for use in the real-world business case. • Try and beat it using a Decision Tree. 5. Decision Tree Example A decision tree (of which there are many varieties, but which are predominantly used for classification) functions as follows: • Find the feature where we can find a single point which will linearly discriminate the different classes the most. • Take each subspace created and repeat this process. • Keep iterating until an ‘end condition’ is reached. 5. Decision Tree Example • Let’s consider an example business task that a lot of retailing companies would be interested in. • How does the psychology of a customer affect the impact of advertising on them? • Given how “agreeable” a customer is, and how often they already shop at a retailing store, can we predict if an advertising campaign will increase their visits? 5. Decision Tree Example agreeableness visited more no change start agreeableness>-2.24 agreeableness<-2.24 no change visited more agreeableness>1.15 no change agreeableness<1.15 visited more total spend spend > 0.78 no change spend < 0.78 visited more agreeableness<-1.79 no change spend<-0.98 no change spending agreeableness>-1.79 visited more spend>-0.98 visited more 5. Decision Tree Example agreeableness visited more no change start agreeableness>-2.24 agreeableness<-2.24 advertise to ignore agreeableness>1.15 agreeableness<1.15 ignore advertise to spend < 0.78 agreeableness<-1.79 total spend spend > 0.78 ignore advertise to ignore spend<-0.98 ignore spending agreeableness>-1.79 advertise to spend>-0.98 advertise to 5. Decision Tree Example • The original algorithm for decision trees, called ID3, only used categorical features (Categorical features have a smaller search space, so are easier to analyse). • This algorithm was extended to continuous variables, and called C4.5. This algorithm allowed use of input features which were continuous (searching across their values for optimal placement of hyperplanes). • Note these techniques are “non-parametric”. 6. Decision Tree example – Customer Churn! • Let’s work through a canonical example of how a decision tree algorithm works. • All types of decision tree split using a notion of “purity” 6. Decision Tree example – Customer Churn! No Yes Yes Yes 1 2 3 4 Yes No Yes No No Yes No Yes 6 7 8 9 10 11 12 5 1. What style of problem is this? 2. What is the output feature/variable? 3. How many classes are we trying to predict? 4. What are the input variables and their values? 5. What model would be good to use? Exited To work this out create a table and analyse each input (independent) feature. If you split on it how "pure" the resulting groups would be in voting choice. Task I. Decide which feature to split on Exited To work this out create a table and analyse each input (independent) feature. If you split on it how "pure" the resulting groups would be in voting choice. BODY SHAPE! Exited If we split on the head shape category, then these will be the subgroups that result. If someone fell into this "round" class we would predict the mode: "YES" That notion allows us to calculate a measure that consider how often the model could guess right if we split on this category. 7. Classification Accuracy • Classification accuracy is one way of measuring the concept of node purity (a measure which allows us to pick between different features, and values to split our decision tree on). • In each node we would predict the most probable class. So our error rate for a node will be: Classification Error = 1 – max(pi) Remember that we first consider each potential generated node’s error as: Classification Error = 1 – max(pi) • We assess a split by looking at the % of items flowing into each node it produces, and weighting the classification errors accordingly by that percentage (summing the results for each node) • Thus if our decision tree used Classification error as a measure of node impurity, it would indeed pick body shape to split on -it has the min expected error Now we have our first split, take the sub-groups it produces and do it all again! NODE A: all the people NODE B: square bodies 3 4 NODE C: oval bodies 5 5 YES 7 1 NO 10 6 2 1 12 2 YES 8 4 NO 9 11 NODE B: square bodies Exited NODE B: square bodies NODE B1: black shirts NODE B2: white shirts 5 YES 0 NO 0 YES 1 NO NODE C: oval bodies Exited NODE C: oval bodies NODE C1: Square Heads 0 YES NODE C2: Round heads 4 NO 2 YES 0 NO Body shape NODE B: square bodies NODE C: oval bodies 5 YES 1 NO Shirt Colour NODE B1: black shirts 5 YES NODE B2: white shirts 0 NO 0 YES 2 YES Head Shape NODE C1: Square Heads 1 NO 4 NO 0 YES 4 NO NODE C2: Round heads 2 YES 0 NO UH OH! New data… No Yes Yes Yes 1 2 3 4 Yes 5 No Yes No No Yes No Yes 6 7 8 9 10 11 12 UH OH! New data… No Yes Yes Yes 1 2 3 4 Yes 5 No Yes No No Yes No Yes 6 7 8 9 10 11 12 Exited Things have changed in the Wellbeing column… now which feature choice would create the most "pure" group based on classification error? Task II. Now which feature to split on? Exited Things have changed in the Wellbeing column… now which feature choice would create the most "pure" group based on classification error? Exited With that small change in the data, we end up with a completely different tree… splitting on wellbeing first… NODE A: all the people Node C: Node B: sad No 1 No 6 0 YES No 9 4 NO No 11 neither Node D: happy Yes Yes Yes Yes Yes Yes No Yes 7 10 2 3 4 5 8 12 2 YES 0 NO 5 YES 1 NO With that small change in the data, we ultimately ended up with a completely different tree. Wellbeing Head shape Node D2: square heads Node D1: round heads 2 YES 0 NO 3 YES 1 NO Body shape Node D2.1: round heads 0 YES 1 NO Node D2.2: square heads 3 YES 0 NO 4. Conclusion • A Decision Tree is a very robust analytics model. Their principles underpin many real-world analytics solutions, from credit default prediction, to customer churn. • However, they are potentially volatile, and require some arbitrary selection of a method of “node purity”. • They make it hard to underfit your data… • But have to be told when to stop splitting nodes, or they will ultimately completely overfit the data. viii. A note on terminology One Input Variable Many Input Variables Predicting from 2 classes Predicting from several classes Binary Logistic Regression Multinomial Logistic Regression Multiple Multiple Multinomial Logistic Regression Logistic Regression ii. Decision Trees + continuous variables agreeableness visited more no change start agreeableness>-2.24 agreeableness<-2.24 advertise to ignore agreeableness>1.15 agreeableness<1.15 ignore advertise to spend < 0.78 agreeableness<-1.79 total spend spend > 0.78 ignore advertise to ignore spend<-0.98 ignore agreeableness>-1.79 advertise to spend>-0.98 advertise to iii. Decision Trees + discrete variables Body shape NODE B: square bodies NODE C: oval bodies 5 YES 0 NO 0 YES 1 NO 1. Gini Impurity / Index • Gini Impurity is the likelihood of an incorrect classification of a new instance of a random variable, if that new instance were randomly classified according to the distribution of class labels from the data set. • Gini Impurity of each node is calculated as: 𝐺𝑖𝑛𝑖 𝐼𝑛𝑑𝑒𝑥 = 1 − + 𝑝𝑗 " ! • So you simply calculate the probability (relative frequency) of each class label occurring in a node, sum the squares and minus from one. • A split, as before, is a good one if it has the lowest “expected” gini impurity. 1. Gini Impurity example Consider Head Shape: • For SQUARE we have: 2 2 1 – (4/9) – (5/9) = 0.494 • For ROUND we have: 2 2 1 – (2/3) – (1/3) = 0.444 So our expected value is: • 9/12 (SQUARE) + 3/12(ROUND) = 0.481 Note again Body Shape has lower Gini Impurity! Head Shape SQUARE ROUND Yes Body Shape OVAL SQUARE Yes Shirt Colour BLACK WHITE Yes Wellbeing Yes SAD HAPPY NONPLUSSED 4 2 2 5 6 1 2 3 2 No No No No 5 1 4 1 4 1 3 2 0 Total Total Total Total Gini Index Weighted 9 0.494 0.370 3 0.444 0.111 Expected Gini 0.481 Gini Index Weighted 6 0.444 0.222 6 0.278 0.139 Expected Gini 0.361 Gini Index Weighted 10 0.480 0.400 2 0.500 0.083 Expected Gini 0.483 Gini Index Weighted 5 0.480 0.200 5 0.480 0.200 2 0.000 0.000 Expected Gini 0.400 1. Gini Impurity / Index • The CART algorithm uses Gini Impurity, and therefore produces a different tree than if it had used classification error. • Often, this will produce a better tree, because its predictions will have less overfit – it will be more generalizable to unseen realworld data. • Remember, this – it is crucial, as we care not about ‘fitting’ but new predictions. • However, potentially we have an even better option… entropy. 2. Entropy • Entropy as a concept was invented by Claude Shannon. • It underpins a huge mathematical field called “Information Theory” • Information theory is highly related to prediction, and hence of immense importance to analytics. • It can be slightly dry. 2. Entropy This concept can be considered in several ways: • Entropy as a measure of ‘node impurity’ • Entropy as a measure of randomness. • Entropy as a measure of predictability. An entropy score of zero means you we have complete uniformity. The unknown is perfectly predictable. As entropy increases, the probability of potential outcomes all start to all match each other – they become equally likely and impossible to predict. 3. Entropy in Decision Trees Consider that the entropy of this whole dataset is: = −0.3𝑙𝑜𝑔! 0.3 − " 𝑃(𝑥)𝑙𝑜𝑔" 𝑃(𝑥) = −0.7𝑙𝑜𝑔! 0.7 ! = 0.88132 à What is the best node to split on if we use entropy as our measure of node impurity? Note: In entropy calculations, 0log20 = 0