STAT 425 – Modern Methods of Data Analysis (37 pts.) Assignment 11

advertisement
STAT 425 – Modern Methods of Data Analysis (37 pts.)
Assignment 11 – Naïve Bayes Classifiers, Logistic Regression, and
Neural Networks, LDA, QDA, and RDA
PROBLEM 1 –– CLEVELAND HEART DISEASE STUDY
The goal here is to predict heart disease status and possibly severity of heart disease using
demographic and diagnostic information on the patients. The variable descriptions are given
below.
Variable Name
Description
age
age (yrs.)
gender
gender (male or female)
cp
chest pain type
___________
_
-- typical angina (angina)
-- atypical angina (abang)
-- non-anginal pain (notang)
-- asymptomatic (asympt)
trestbps
resting blood pressure (in mm Hg on admission to the hospital)
chol
serum cholesterol level in mg/dl
fbs
fasting blood sugar > 120 mg/dl) (true=true or fal=false)
restecg
resting electrocardiographic results
-- normal (norm)
-- having ST-T wave abnormality (abn)
-- showing probable or definite left ventricular hypertrophy by Estes'
criteria (hyp)
thalach
maximum heart rate achieved
exang
exercise induced angina (true or false=fal)
oldpeak
ST depression induced by exercise relative to rest
slope
the slope of the peak exercise ST segment
-- upsloping (up)
-- flat (flat)
-- downsloping (down)
ca
number of major vessels (0-3) colored by flourosopy
thal
normal (norm), fixed defect (fix), reversable defect (rev)
Responses:
diag = sick or buff (healthy)
grp = H (healthy), S1, S2, S3, S4 (higher number = more sick)
1
> head(Cleveland)
age gender
cp trestbps chol fbs restecg thatach exang oldpeak slope ca thal diag grp
1 63
male angina
145 233 true
hyp
150
fal
2.3 down 0 fix buff
H
2 67
male asympt
160 286 fal
hyp
108 true
1.5 flat 3 norm sick S2
3 67
male asympt
120 229 fal
hyp
129 true
2.6 flat 2 rev sick S1
4 37
male notang
130 250 fal
norm
187
fal
3.5 down 0 norm buff
H
5 41
fem abnang
130 204 fal
hyp
172
fal
1.4
up 0 norm buff
H
6 56
male abnang
120 236 fal
norm
178
fal
0.8
up 0 norm buff
H
a)
The file cleveland.txt in the Share folder contains the raw data in comma-delimited
format. Read these data into a data frame called Cleveland in R. Next, form one data
frame called Cleve.diag which drops the variable grp from the original database and
one data frame called Cleve.grp which drops the variable diag from the original
database. The data frame Cleve.diag will be used to predict heart disease status (sick
or buff (healthy)), while the data frame Cleve.grp can be used to predict heart disease
status and severity (H or S1,S2,S3,S4). (2 pts.)
b) Use logistic regression to predict heart disease status (buff or sick) working with the
Cleve.diag data. What is the APER rate based on your final model? You do not need to
use cross-validation for this estimate. (3 pts.)
c) From your model in part (b) what are the most important factors in determining the heart
disease status of a patient? Justify your answer. (3 pts.)
d) Using the last 50 observations in the Cleve.diag data set as a test, estimate the APER
using logistic regression by fitting a model to the first 246 observations. Code to create the
test and training sets is given below.
> Cleve.test = Cleve.diag[247:296,]
> Cleve.train = Cleve.diag[1:246,]
e) Again using the last 50 observations as a test set, develop a neural network model using the
training set. Try different size neural networks and choose what you think is best. What is
the APER for predicting the test cases for your final neural network model? (3 pts.)
f) Use a naïve bayes classifier (the e1071 one) to develop a prediction rule using the training
data set and predict the test cases. What is the APER based on your test case predictions?
(3 pts.)
2
PROBLEM 2 –– SATELLITE IMAGE DATA
The goal here is to predict the type of ground cover from a satellite image broken up into pixels.
Description from UCI Machine Learning database:
The database consists of the multi-spectral values of pixels in 3x3 neighborhoods in a satellite image, and
the classification associated with the central pixel in each neighborhood. The aim is to predict this
classification, given the multi-spectral values. In the sample database, the class of a pixel is coded as a
number.
The Landsat satellite data is one of the many sources of information available for a scene. The
interpretation of a scene by integrating spatial data of diverse types and resolutions including
multispectral and radar data, maps indicating topography, land use etc. is expected to assume significant
importance with the onset of an era characterized by integrative approaches to remote sensing (for
example, NASA's Earth Observing System commencing this decade). Existing statistical methods are illequipped for handling such diverse data types. Note that this is not true for Landsat MSS data considered
in isolation (as in this sample database). This data satisfies the important requirements of being numerical
and at a single resolution, and standard maximum-likelihood classification performs very well.
Consequently, for this data, it should be interesting to compare the performance of other methods against
the statistical approach.
One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral
bands. Two of these are in the visible region (corresponding approximately to green and red regions of
the visible spectrum) and two are in the (near) infra-red. Each pixel is a 8-bit binary word, with 0
corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each
image contains 2340 x 3380 such pixels.
The database is a (tiny) sub-area of a scene, consisting of 82 x 100 pixels. Each line of data corresponds
to a 3x3 square neighborhood of pixels completely contained within the 82x100 sub-area. Each line
contains the pixel values in the four spectral bands (converted to ASCII) of each of the 9 pixels in the 3x3
neighborhood and a number indicating the classification label of the central pixel. The number is a code
for the following classes:
Number Class
1 red soil
2 cotton crop
3 grey soil
4 damp grey soil
5 soil with vegetation stubble
6 mixture class (all types present)
7 very damp grey soil
Note: There are no examples with class 6 in this dataset.
The data is given in random order and certain lines of data have been removed so you cannot reconstruct
the original image from this dataset.
In each line of data the four spectral values for the top-left pixel are given first followed by the four
3
spectral values for the top-middle pixel and then those for the top-right pixel, and so on with the pixels
read out in sequence left-to-right and top-to-bottom. Thus, the four spectral values for the central pixel are
given by attributes 17,18,19 and 20.
You can read the data into R from the file satimage.txt in the Shared folder on Class Storage
using the command below:
>
SATimage = read.table(file.choose(),header=T,sep=” “)
 be sure to put a space
between the quotes!
>
SATimage = data.frame(SATimage[,1:36],class=as.factor(SATimage$class))
This command makes sure that the response is interpreted as a factor (categorical) rather than as a
number. Use SATimage as the data frame throughout.
Create a test and training set using the code below:
>
>
>
>
set.seed(888)  this ensures you all have the same data!!!
testcases = sample(1:dim(SATimage)[1],1000,replace=F)
SATtest = SATimage[testcases,]
SATtrain = SATimage[-testcases,]
a) Compare sknn, naïve Bayes, neural network, lda, qda, and rda classification of the test cases.
Which method performs best for these data?
b) Write your own MCCV cross-validation routines for lda, qda, and rda classification.
Demonstrate their use with the full SAT image dataset. Which of these methods performs
best?
4
Download