Uploaded by badgermanylover

Chapter 5

advertisement
Chapter 5
Predictive Analytics I: Trees,
k-Nearest Neighbors, Naive
Bayes’, and Ensemble
Estimates
Copyright ©2018 McGraw-Hill Education. All rights reserved.
Chapter Outline
5.1
5.2
5.3
5.4
5.5
Decision Trees I: Classification Trees
Decision Trees II: Regression Trees
k-Nearest Neighbors
Naive Bayes’ Classification
An Introduction to Ensemble Estimates
5-2
LO5-1: Interpret the
information provided
by classification trees.
5.1 Decision Trees I:
Classification Trees
 Decision trees
 Regression tree: predicting a quantitative
response variable
 Classification tree: predicting a qualitative or
categorical response variable
 Dummy variable: a quantitative variable used to
represent a qualitative variable
 Training data: portion of the data used to fit the
analytic
 Validation data: portion of the data used to
assess how well the analytic fitted to the training
data fits data different from the training data
5-3
LO5-1
Decision Trees I: Classification
Trees
 Prediction of upgrade for a fee


Studied 40 existing customers
Offer upgrade

Response variables
 1 – upgraded
 0 – did not upgrade

Purchases
 Recorded in thousands of dollars

Predictor variables
 1 – fits profile
 0 – did not fit profile
5-4
LO5-1
Decision Trees I: Classification
Trees Continued
 Sample proportion 𝑝
 Examine potential predictor




𝑝 with purchases ≥ that value who upgraded
𝑝 with purchases < that value who upgraded
𝑝 conforming to profile (1) who upgraded
𝑝 not conforming to profile (0) who upgraded
 𝑝 that upgraded

𝑝 = 19/40 = .4750 or 47.50 percent
5-5
LO5-1
A JMP classification Tree for the
Card Upgrade Data
Figure 5.1 (a)
5-6
LO5-1
Decision Trees I: Classification
Trees Continued
 Combination of predictor variable and split
point produced
 Intuitively produces greatest difference
between proportion who upgraded and who
did not upgrade
 Continues searching on two resulting groups
 Stops splitting at a leaf (terminal leaf)


Produces a leaf < specified minimum split size
𝑝 is either 1 or 0

pure leaf – no splitting possible
5-7
LO5-1
Decision Trees I: Classification
Trees Continued
 Confusion matrix: summarizes a
classification analytics' success in classifying
observations in the training data set and/or
validation data set
 Entropy RSquare: the square of the simple
correlation coefficient between the observed
0 and 1 upgrade values and the
corresponding upgrade probability estimates
5-8
LO5-2: Interpret the
information provided
by regression trees.
5.2 Decision Trees II:
Regression Trees
 705 applicants studied to predict college GPA
 50% - training data set (352)
 50% - validation data set (353)
 Compute 𝜇 for each group
 Use prediction(s) to calculate three quantities
 MSE
 RMSE
 RSquare
 Examine each predictor variable and every
possible way of splitting the values of each
predictor variable into two groups
5-9
LO5-2
Final Regression Tree
Figure 5.12 (c)
5-10
LO5-3: Interpret the
information provided
by k-nearest
neighbors.
5.3 k-Nearest Neighbors
 Nearest neighbors to an observation are
determined by measuring the distance between
the set of predictor variable values for that
observation and the set of predictor variable
values for every other observation
 Predicting a quantitative response variable using
k-nearest neighbors is the same as classifying a
qualitative response variable except that we
predict the quantitative response variable by
averaging the response variable values for the
k-nearest neighbors
5-11
LO5-3
Nearest Neighbors in the Upgrade
Example
Figure 5.26 partial
5-12
LO5-3
Classification Using Nearest
Neighbors in the Upgrade Example
Figure 5.27 partial
5-13
LO5-4: Interpret the
information provided
by naive Bayes’
classification.
5.4 Naive Bayes’ Classification
 Uses a “naive“ version of Bayes’ Theorem to
classify observations
 Full version of Bayes’ Theorem
 Naive version of Bayes’ Theorem
5-14
LO5-5: Interpret the
information provided
by ensemble models.
5.5 An Introduction to
Ensemble Estimates
 Ensemble Estimate: combines the estimates or
predictions obtained from different analytics to
arrive at an overall result
Table 5.3
5-15
Download