R: 2/6/15 KNN Homework Problems For all models conducted in these homework problems, select Fix random sequence. Problem 7.1 Download the following two files: data: UniversalBank.csv; Ch7hwStudent.xlsx Read the problem description at the top of the problem on page 146 of the textbook but answer the questions below instead of the questions shown in the textbook. In this problem, you will build a series of classifiers, that is, algorithms to make categorical predictions. This is a KNN classification problem (predicting a category); do not use the KNN under the regression option. Data Description Field Name ID ZIPCode Age Experience Education Family Income CCAvg Mortgage CreditCard yPersonal Loan Securities Account CD Account Online Field Description Customer ID Home Address ZIP code. Customer's age Number of years of professional experience Education Level. 1: Undergrad; 2: Graduate; 3: Advanced/Professional Customer's family size Annual income of the customer ($000) Avg. spending on credit cards from all banks combined per month ($000) Value of house mortgage if any. ($000) Does the customer use a credit card issued by UniversalBank? Did this customer accept the personal loan offered in the last campaign? Does the customer have a securities account with the bank? Does the customer have a certificate of deposit (CD) account with the bank? Does the customer use internet banking facilities? Data Preparation: Prepare the dataset for modeling by removing ID and ZIPCode. Partition the data into 60% training and 40% validation. If there are collinear input variables (IVs) where the correlation between the IVs exceeds ± .75, retain only the one most strongly correlated with the outcome variable. Where to put your answers: For each of questions below, fill in the classification matrix related cells and the AUC values highlighted in blue in the Ch7hwStudent.xlsx file. Record the rest of your answer in this document. a. A classifier that follows the Naïve rule always predicts all observation to be the most frequently occurring class. What PersonalLoan class would this default classifier pick and what would be the error rate? Would the errors be made up of false positive, false negatives, or both? Explain. (Note: If the classifier built does not perform better than the Naïve rule classifier then it is of no value.) The table below summarizes the KNN algorithm settings for each of the following KNN portions of this problem. Use these settings to create a KNN classifier for each problem shown in the table. Additional directions are provided as needed. KNN Algorithm Settings 7.1.b Max K 15 7.1.c 15 7.1.d 15 Relative to distance Based on Correlation matrix 7.1.d 15 Relative to distance Based on best predictor scores found in 7.1.e Problem Observation Weights Equal Relative to distance Attribute Weights None None b. Create the classifier. Based on the results, is the classifier more likely to misclassify an actual positive (1) or actual negative (0)? c. Create the classifier. Did setting the model to Relative to distance substantially improve model performance? d. Create a correlation matrix to evaluate inputs most correlated with the response variable yPersonalLoan. Use your judgment to set attribute weights in the KNN model specification according to information gleaned from the correlation matrix. Insert a screen capture of your attribute weights. Create the classifier. Did the weighting of attributes improve this classifier’s performance over the previous two classifiers? e. Build a classifier using the LogRegression modeler. (Note: Even though the Log Regression modeler has not yet been introduced, you can create a LogRegression model by dragging that modeler icon onto the partitioned dataset.) How does the classification performance of the LogRegression model compare to the previously built KNN models? Drag and drop the 2 model to a display and select "Predictor Scores." Predictor scores show how influential variables were in helping a model predict outcomes. Insert a screen capture of the predictor scores in this document. f. Use the predictor scores of the Log Regression model produced in (e) to set attribute weights. Insert a screen capture of your attribute weights. g. Compare the performance of the resulting classifier to the classifier in problem (d) that used correlations to set attribute weights. Explain how they are different. Problem 7.2 Download the following data file: data: BostonHousing.csv Read the problem description at the top of the problem on page 146 of the textbook but answer the questions below instead of the questions shown in the textbook. This is a KNN regression problem (predicting a continuous number value); do not use the K-NN under the classification option. Data Description Expected Min Max 1 506 0 1 0 100 1 15 1 25 5 30 0 100 1 40 0 1 0 100 100 800 10 25 2 10 3 150 0 1 Field Name IDNum ByRiver CrimeRate DistToWork HiwayAccess IndustrialPerc LargeLotsPerc LowIncomePerc NoxAirPerc OldHomePerc ProptyTaxRate PupilsPerTeach RoomsPerHome yHomeMedVal yHomeMedValOv100 Description Unique ID for each record Charles River dummy variable (1 if tract bounds river; 0 otherwise) per capita crime rate by town weighted distances to five Boston employment centres index of accessibility to radial highways proportion of non-retail business acres per town. proportion of residential land zoned for lots over 25,000 sq.ft. % lower income status of the population nitric oxides concentration (parts per 10 million) proportion of owner-occupied units built prior to 1940 full-value property-tax rate per $10,000 pupil-teacher ratio by town average number of rooms per dwelling Median value of owner-occupied homes in $1000 Median value of owner-occupied homes > $100K (1 = Yes, 2 = No) Data Preparation: Prepare the dataset for modeling by removing any unnecessary input columns (be sure to remove yHomeMedValOver100). Remove collinear input 3 variables (IVs) where correlation between the IVs exceeds ± .75. Partition the data into 60% training and 40% validation. Where to put your answers: For each of questions below, fill in the spreadsheet cells (Best K, RMSE, and R2) in the Ch7hwAns.xlsx file. Record the rest of your answer in this document. The table below summarizes the KNN algorithm settings for specific problems. Additional directions are provided as needed. KNN Algorithm Settings 7.2.b Max K 15 7.2.c 15 7.2.d 15 Problem Observation Weights Equal Relative to distance Relative to distance Attribute Weights None None Based on Correlation matrix a. In addition to yHomeMedValOver100, what other columns did you remove? Why? For each problem below, build the model as directed and fill in the results in this table. Also answer the questions. Note: To get KNN Regression results, mouse over the model. b. Build the KNN regression model as specified in the KNN Algorithm Settings table. c. Build a second KNN regression model as specified in the KNN Algorithm Settings table. d. Use the correlation matrix to evaluate inputs most correlated with the response variable yHomemedval. Build a third KNN regression with attribute weights set according to information gleaned from the correlation matrix. Save a screen capture of your attribute weights. Did the weighting of attributes improve KNN regression performance over the previous two? e. Build a multiple linear regression model using the same variables. Compare its performance to that of the three KNN models just built. 4 f. Remove any MLR coefficients that are not statistically significant at .05 and rerun the MLR model. Does the best MLR model perform as well as the best KNN model on this problem? g. Read and answer question 7.2e from the textbook on page 147. 5