Computer lab 4: Linear classification methods

advertisement
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
Computer lab 4: Linear classification methods
Learning objectives
The main objective of this computer lab is to make the student familiar with linear
classification methods, in particular logistic regression and linear discriminant analysis.
After completing the lab the student shall be able to:
(i)
Undertake a logistic regression in SAS Enterprise Miner and a linear
discriminant analysis using Proc DISCRIM in SAS
(ii)
Explain the concept of odds ratio
(iii)
Explain the concept of discriminant function
Recommended reading
Chapter 4.1-4.4 in Hastie et al.
Assignment 1: Prediction of customer behaviour using logistic
regression
The data file crm.xls contains data retrieved from a database in a private enterprise. Each
row contains information about one customer. The variable Y indicates whether the
customer has made a single purchase (Y = 0) or two or more purchases (Y = 1). Other
variables provide information about socioeconomic characteristics and previous
transactions. Your task is to use the Regression node in SAS Enterprise Miner to derive a
predictive model of Y.
Import data and draw a workflow diagram
Import crm.xls to SAS and check that the file has been correctly imported.
Create a workflow diagram with an Input Data Source node, a Partition node, and a
Regression node. Assign the imported data to the Input Data Source node and edit the
Model role column so that you have one target variable and the predictors are classified
as inputs.
Prepare for running the regression node
Use the Partition node to divide the data into a training set (70%) and a test set (30%).
Then use the Model options tab to specify a logistic regression model.
Run the regression node and interpret the results
a) Fit a logistic regression model to your training data using all the variables you
have classified as inputs. Note what fraction of the customers in the validation set
that have been misclassified with respect to the value of Y. Inspect odds-ratios and
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
associated p-values to examine which of the explanatory variables that seem to
contribute the most to the classification of customers.
b) Select a few subsets of your input variables and repeat the model fitting and
estimation of misclassification rate. How does the predictive power vary with the
subset of inputs? Which of the tested logistic regression models has the highest
predictive power?
c) Use Forward selection to select a suitable logistic regression model. Compare the
obtained misclassification rate with the results obtained in assignment 1b. Did
forward selection provide a good prediction model?
Assignment 2: Classification of e-mails using linear discriminant
analysis (LDA)
The data file spambase.xls contains information about the frequency of various words,
characters etc for a total of 4601 e-mails. Furthermore, these e-mails have been classified
as spams (spam = 1) or regular e-mails (spam = 0). Your task is to use LDA to develop a
model that can be used as a spam filter. Detailed information about the mail attributes is
given below.
Attribute Information for spambase:
The last column of 'spambase.xls' denotes whether the e-mail was
considered spam (1) or not (0), i.e. unsolicited commercial e-mail.
Most of the attributes indicate whether a particular word or
character was frequently occurring in the e-mail. The run-length
attributes (55-57) measure the length of sequences of consecutive
capital letters. For the statistical measures of each attribute,
see the end of this file. Here are the definitions of the attributes:
48 continuous real [0,100] attributes of type word_freq_WORD
= percentage of words in the e-mail that match WORD,
i.e. 100 * (number of times the WORD appears in the e-mail) /
total number of words in e-mail. A "word" in this case is any
string of alphanumeric characters bounded by non-alphanumeric
characters or end-of-string.
6 continuous real [0,100] attributes of type char_freq_CHAR
= percentage of characters in the e-mail that match CHAR,
i.e. 100 * (number of CHAR occurences) / total characters in e-mail
1 continuous real [1,...] attribute of type capital_run_length_average
= average length of uninterrupted sequences of capital letters
1 continuous integer [1,...] attribute of type
capital_run_length_longest
= length of longest uninterrupted sequence of capital letters
1 continuous integer [1,...] attribute of type capital_run_length_total
= sum of length of uninterrupted sequences of capital letters
= total number of capital letters in the e-mail
732A20 Data Mining and Statistical Learning
Department of Computer and Information Science
1 nominal {0,1} class attribute of type spam
= denotes whether the e-mail was considered spam (1) or not (0),
i.e. unsolicited commercial e-mail.
Import data and write a SAS code
Import the Excel file spambase.xls to SAS and make sure that the file has been correctly
imported. Use the help file in SAS 9.1 to examine how you can employ proc DISCRIM
to undertake a linear discriminant analysis. Write a SAS code in which this procedure is
employed to derive a classification model in which all available predictors are used.
Run proc DISCRIM and interpret the results
Fit an LDA model to your data using all the variables you have classified as inputs.
Explain how the discriminant functions are used for the classification of e-mails
Assignment 3: Classification of e-mails using linear discriminant
analysis (LDA) and model validation by using training and test
sets
Examine the help file in SAS 9.1 to find out how you can introduce a test set in proc
DISCRIM.
Split the data set in assignment 2 into two parts (70% training, 30% test). Write and run
SAS code in which a test set option is introduced into proc DISCRIM
To hand in
Assignment 1: Answers to tasks a, b and c
Assignment 2: Interpretation of the output from proc DISCRIM
Assignment 3: SAS code and edited outputs from a run of proc DISCRIM involving a
test data set
Download