R: 2/6/15 KNN Homework Problems For all models conducted in

advertisement
R: 2/6/15
KNN Homework Problems
For all models conducted in these homework problems, select Fix random sequence.
Problem 7.1
Download the following two files: data: UniversalBank.csv; Ch7hwStudent.xlsx
Read the problem description at the top of the problem on page 146 of the textbook
but answer the questions below instead of the questions shown in the textbook. In
this problem, you will build a series of classifiers, that is, algorithms to make
categorical predictions.
This is a KNN classification problem (predicting a category); do not use the KNN
under the regression option.
Data Description
Field Name
ID
ZIPCode
Age
Experience
Education
Family
Income
CCAvg
Mortgage
CreditCard
yPersonal Loan
Securities Account
CD Account
Online
Field Description
Customer ID
Home Address ZIP code.
Customer's age
Number of years of professional experience
Education Level. 1: Undergrad; 2: Graduate; 3:
Advanced/Professional
Customer's family size
Annual income of the customer ($000)
Avg. spending on credit cards from all banks combined per
month ($000)
Value of house mortgage if any. ($000)
Does the customer use a credit card issued by UniversalBank?
Did this customer accept the personal loan offered in the last
campaign?
Does the customer have a securities account with the bank?
Does the customer have a certificate of deposit (CD) account
with the bank?
Does the customer use internet banking facilities?
Data Preparation: Prepare the dataset for modeling by removing ID and ZIPCode.
Partition the data into 60% training and 40% validation. If there are collinear input
variables (IVs) where the correlation between the IVs exceeds ± .75, retain only the
one most strongly correlated with the outcome variable.
Where to put your answers: For each of questions below, fill in the classification
matrix related cells and the AUC values highlighted in blue in the Ch7hwStudent.xlsx
file. Record the rest of your answer in this document.
a. A classifier that follows the Naïve rule always predicts all observation to be
the most frequently occurring class. What PersonalLoan class would this
default classifier pick and what would be the error rate? Would the errors
be made up of false positive, false negatives, or both? Explain. (Note: If the
classifier built does not perform better than the Naïve rule classifier then it is
of no value.)
The table below summarizes the KNN algorithm settings for each of the following
KNN portions of this problem. Use these settings to create a KNN classifier for each
problem shown in the table. Additional directions are provided as needed.
KNN Algorithm Settings
7.1.b
Max
K
15
7.1.c
15
7.1.d
15
Relative to
distance
Based on Correlation
matrix
7.1.d
15
Relative to
distance
Based on best
predictor scores
found in 7.1.e
Problem
Observation
Weights
Equal
Relative to
distance
Attribute Weights
None
None
b. Create the classifier. Based on the results, is the classifier more likely to
misclassify an actual positive (1) or actual negative (0)?
c. Create the classifier. Did setting the model to Relative to distance
substantially improve model performance?
d. Create a correlation matrix to evaluate inputs most correlated with the
response variable yPersonalLoan. Use your judgment to set attribute
weights in the KNN model specification according to information gleaned
from the correlation matrix. Insert a screen capture of your attribute
weights. Create the classifier. Did the weighting of attributes improve this
classifier’s performance over the previous two classifiers?
e. Build a classifier using the LogRegression modeler. (Note: Even though the
Log Regression modeler has not yet been introduced, you can create a
LogRegression model by dragging that modeler icon onto the partitioned
dataset.) How does the classification performance of the LogRegression
model compare to the previously built KNN models? Drag and drop the
2
model to a display and select "Predictor Scores." Predictor scores show how
influential variables were in helping a model predict outcomes. Insert a
screen capture of the predictor scores in this document.
f. Use the predictor scores of the Log Regression model produced in (e) to set
attribute weights. Insert a screen capture of your attribute weights.
g. Compare the performance of the resulting classifier to the classifier in
problem (d) that used correlations to set attribute weights. Explain how they
are different.
Problem 7.2
Download the following data file: data: BostonHousing.csv
Read the problem description at the top of the problem on page 146 of the textbook
but answer the questions below instead of the questions shown in the textbook.
This is a KNN regression problem (predicting a continuous number value); do not
use the K-NN under the classification option.
Data Description
Expected
Min
Max
1
506
0
1
0
100
1
15
1
25
5
30
0
100
1
40
0
1
0
100
100
800
10
25
2
10
3
150
0
1
Field Name
IDNum
ByRiver
CrimeRate
DistToWork
HiwayAccess
IndustrialPerc
LargeLotsPerc
LowIncomePerc
NoxAirPerc
OldHomePerc
ProptyTaxRate
PupilsPerTeach
RoomsPerHome
yHomeMedVal
yHomeMedValOv100
Description
Unique ID for each record
Charles River dummy variable (1 if tract bounds river; 0 otherwise)
per capita crime rate by town
weighted distances to five Boston employment centres
index of accessibility to radial highways
proportion of non-retail business acres per town.
proportion of residential land zoned for lots over 25,000 sq.ft.
% lower income status of the population
nitric oxides concentration (parts per 10 million)
proportion of owner-occupied units built prior to 1940
full-value property-tax rate per $10,000
pupil-teacher ratio by town
average number of rooms per dwelling
Median value of owner-occupied homes in $1000
Median value of owner-occupied homes > $100K (1 = Yes, 2 = No)
Data Preparation: Prepare the dataset for modeling by removing any unnecessary
input columns (be sure to remove yHomeMedValOver100). Remove collinear input
3
variables (IVs) where correlation between the IVs exceeds ± .75. Partition the data
into 60% training and 40% validation.
Where to put your answers: For each of questions below, fill in the spreadsheet
cells (Best K, RMSE, and R2) in the Ch7hwAns.xlsx file. Record the rest of your
answer in this document.
The table below summarizes the KNN algorithm settings for specific problems.
Additional directions are provided as needed.
KNN Algorithm Settings
7.2.b
Max
K
15
7.2.c
15
7.2.d
15
Problem
Observation
Weights
Equal
Relative to
distance
Relative to
distance
Attribute Weights
None
None
Based on Correlation
matrix
a. In addition to yHomeMedValOver100, what other columns did you remove?
Why?
For each problem below, build the model as directed and fill in the results in this
table. Also answer the questions. Note: To get KNN Regression results, mouse over
the model.
b. Build the KNN regression model as specified in the KNN Algorithm Settings
table.
c. Build a second KNN regression model as specified in the KNN Algorithm
Settings table.
d. Use the correlation matrix to evaluate inputs most correlated with the
response variable yHomemedval. Build a third KNN regression with
attribute weights set according to information gleaned from the correlation
matrix. Save a screen capture of your attribute weights. Did the weighting of
attributes improve KNN regression performance over the previous two?
e. Build a multiple linear regression model using the same variables. Compare
its performance to that of the three KNN models just built.
4
f. Remove any MLR coefficients that are not statistically significant at .05 and
rerun the MLR model. Does the best MLR model perform as well as the best
KNN model on this problem?
g. Read and answer question 7.2e from the textbook on page 147.
5
Download