Chapter 5 Predictive Analytics I: Trees, k-Nearest Neighbors, Naive Bayes’, and Ensemble Estimates Copyright ©2018 McGraw-Hill Education. All rights reserved. Chapter Outline 5.1 5.2 5.3 5.4 5.5 Decision Trees I: Classification Trees Decision Trees II: Regression Trees k-Nearest Neighbors Naive Bayes’ Classification An Introduction to Ensemble Estimates 5-2 LO5-1: Interpret the information provided by classification trees. 5.1 Decision Trees I: Classification Trees Decision trees Regression tree: predicting a quantitative response variable Classification tree: predicting a qualitative or categorical response variable Dummy variable: a quantitative variable used to represent a qualitative variable Training data: portion of the data used to fit the analytic Validation data: portion of the data used to assess how well the analytic fitted to the training data fits data different from the training data 5-3 LO5-1 Decision Trees I: Classification Trees Prediction of upgrade for a fee Studied 40 existing customers Offer upgrade Response variables 1 – upgraded 0 – did not upgrade Purchases Recorded in thousands of dollars Predictor variables 1 – fits profile 0 – did not fit profile 5-4 LO5-1 Decision Trees I: Classification Trees Continued Sample proportion 𝑝 Examine potential predictor 𝑝 with purchases ≥ that value who upgraded 𝑝 with purchases < that value who upgraded 𝑝 conforming to profile (1) who upgraded 𝑝 not conforming to profile (0) who upgraded 𝑝 that upgraded 𝑝 = 19/40 = .4750 or 47.50 percent 5-5 LO5-1 A JMP classification Tree for the Card Upgrade Data Figure 5.1 (a) 5-6 LO5-1 Decision Trees I: Classification Trees Continued Combination of predictor variable and split point produced Intuitively produces greatest difference between proportion who upgraded and who did not upgrade Continues searching on two resulting groups Stops splitting at a leaf (terminal leaf) Produces a leaf < specified minimum split size 𝑝 is either 1 or 0 pure leaf – no splitting possible 5-7 LO5-1 Decision Trees I: Classification Trees Continued Confusion matrix: summarizes a classification analytics' success in classifying observations in the training data set and/or validation data set Entropy RSquare: the square of the simple correlation coefficient between the observed 0 and 1 upgrade values and the corresponding upgrade probability estimates 5-8 LO5-2: Interpret the information provided by regression trees. 5.2 Decision Trees II: Regression Trees 705 applicants studied to predict college GPA 50% - training data set (352) 50% - validation data set (353) Compute 𝜇 for each group Use prediction(s) to calculate three quantities MSE RMSE RSquare Examine each predictor variable and every possible way of splitting the values of each predictor variable into two groups 5-9 LO5-2 Final Regression Tree Figure 5.12 (c) 5-10 LO5-3: Interpret the information provided by k-nearest neighbors. 5.3 k-Nearest Neighbors Nearest neighbors to an observation are determined by measuring the distance between the set of predictor variable values for that observation and the set of predictor variable values for every other observation Predicting a quantitative response variable using k-nearest neighbors is the same as classifying a qualitative response variable except that we predict the quantitative response variable by averaging the response variable values for the k-nearest neighbors 5-11 LO5-3 Nearest Neighbors in the Upgrade Example Figure 5.26 partial 5-12 LO5-3 Classification Using Nearest Neighbors in the Upgrade Example Figure 5.27 partial 5-13 LO5-4: Interpret the information provided by naive Bayes’ classification. 5.4 Naive Bayes’ Classification Uses a “naive“ version of Bayes’ Theorem to classify observations Full version of Bayes’ Theorem Naive version of Bayes’ Theorem 5-14 LO5-5: Interpret the information provided by ensemble models. 5.5 An Introduction to Ensemble Estimates Ensemble Estimate: combines the estimates or predictions obtained from different analytics to arrive at an overall result Table 5.3 5-15