732A20 Data Mining and Statistical Learning Department of Computer and Information Science Computer lab 4: Linear classification methods Learning objectives The main objective of this computer lab is to make the student familiar with linear classification methods, in particular logistic regression and linear discriminant analysis. After completing the lab the student shall be able to: (i) Undertake a logistic regression in SAS Enterprise Miner and a linear discriminant analysis using Proc DISCRIM in SAS (ii) Explain the concept of odds ratio (iii) Explain the concept of discriminant function Recommended reading Chapter 4.1-4.4 in Hastie et al. Assignment 1: Prediction of customer behaviour using logistic regression The data file crm.xls contains data retrieved from a database in a private enterprise. Each row contains information about one customer. The variable Y indicates whether the customer has made a single purchase (Y = 0) or two or more purchases (Y = 1). Other variables provide information about socioeconomic characteristics and previous transactions. Your task is to use the Regression node in SAS Enterprise Miner to derive a predictive model of Y. Import data and draw a workflow diagram Import crm.xls to SAS and check that the file has been correctly imported. Create a workflow diagram with an Input Data Source node, a Partition node, and a Regression node. Assign the imported data to the Input Data Source node and edit the Model role column so that you have one target variable and the predictors are classified as inputs. Prepare for running the regression node Use the Partition node to divide the data into a training set (70%) and a test set (30%). Then use the Model options tab to specify a logistic regression model. Run the regression node and interpret the results a) Fit a logistic regression model to your training data using all the variables you have classified as inputs. Note what fraction of the customers in the validation set that have been misclassified with respect to the value of Y. Inspect odds-ratios and 732A20 Data Mining and Statistical Learning Department of Computer and Information Science associated p-values to examine which of the explanatory variables that seem to contribute the most to the classification of customers. b) Select a few subsets of your input variables and repeat the model fitting and estimation of misclassification rate. How does the predictive power vary with the subset of inputs? Which of the tested logistic regression models has the highest predictive power? c) Use Forward selection to select a suitable logistic regression model. Compare the obtained misclassification rate with the results obtained in assignment 1b. Did forward selection provide a good prediction model? Assignment 2: Classification of e-mails using linear discriminant analysis (LDA) The data file spambase.xls contains information about the frequency of various words, characters etc for a total of 4601 e-mails. Furthermore, these e-mails have been classified as spams (spam = 1) or regular e-mails (spam = 0). Your task is to use LDA to develop a model that can be used as a spam filter. Detailed information about the mail attributes is given below. Attribute Information for spambase: The last column of 'spambase.xls' denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Most of the attributes indicate whether a particular word or character was frequently occurring in the e-mail. The run-length attributes (55-57) measure the length of sequences of consecutive capital letters. For the statistical measures of each attribute, see the end of this file. Here are the definitions of the attributes: 48 continuous real [0,100] attributes of type word_freq_WORD = percentage of words in the e-mail that match WORD, i.e. 100 * (number of times the WORD appears in the e-mail) / total number of words in e-mail. A "word" in this case is any string of alphanumeric characters bounded by non-alphanumeric characters or end-of-string. 6 continuous real [0,100] attributes of type char_freq_CHAR = percentage of characters in the e-mail that match CHAR, i.e. 100 * (number of CHAR occurences) / total characters in e-mail 1 continuous real [1,...] attribute of type capital_run_length_average = average length of uninterrupted sequences of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_longest = length of longest uninterrupted sequence of capital letters 1 continuous integer [1,...] attribute of type capital_run_length_total = sum of length of uninterrupted sequences of capital letters = total number of capital letters in the e-mail 732A20 Data Mining and Statistical Learning Department of Computer and Information Science 1 nominal {0,1} class attribute of type spam = denotes whether the e-mail was considered spam (1) or not (0), i.e. unsolicited commercial e-mail. Import data and write a SAS code Import the Excel file spambase.xls to SAS and make sure that the file has been correctly imported. Use the help file in SAS 9.1 to examine how you can employ proc DISCRIM to undertake a linear discriminant analysis. Write a SAS code in which this procedure is employed to derive a classification model in which all available predictors are used. Run proc DISCRIM and interpret the results Fit an LDA model to your data using all the variables you have classified as inputs. Explain how the discriminant functions are used for the classification of e-mails Assignment 3: Classification of e-mails using linear discriminant analysis (LDA) and model validation by using training and test sets Examine the help file in SAS 9.1 to find out how you can introduce a test set in proc DISCRIM. Split the data set in assignment 2 into two parts (70% training, 30% test). Write and run SAS code in which a test set option is introduced into proc DISCRIM To hand in Assignment 1: Answers to tasks a, b and c Assignment 2: Interpretation of the output from proc DISCRIM Assignment 3: SAS code and edited outputs from a run of proc DISCRIM involving a test data set