Assignment #7 - Winona State University

advertisement
STAT 425 - ASSIGNMENT 7
(29 pts.)
Using Logistic Regression Methods to Classify a Dichotomous Categorical Response
PROBLEM 1 - Wisconsin Diagnostic Breast Cancer Data (WDBC)
Researchers who created these data:
Dr. William H. Wolberg, General Surgery Dept., University of
Wisconsin, Clinical Sciences Center, Madison, WI 53792
wolberg@eagle.surgery.wisc.edu
W. Nick Street, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
street@cs.wisc.edu 608-262-6619
Olvi L. Mangasarian, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi@cs.wisc.edu
Medical literature citations:
W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
Machine learning techniques to diagnose breast cancer from
fine-needle aspirates. Cancer Letters, 77 (1994) 163-171.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
Image analysis and machine learning applied to breast cancer
diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17
No. 2, pages 77-87, April 1995.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
Computerized breast cancer diagnosis and prognosis from fine
needle aspirates. Archives of Surgery 1995;130:511-516.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
Computer-derived nuclear features distinguish malignant from
benign breast cytology. Human Pathology, 26:792--796, 1995.
See also:
http://www.cs.wisc.edu/~olvi/uwmp/mpml.html
http://www.cs.wisc.edu/~olvi/uwmp/cancer.html
1
Data Description:
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.
A sample image is shown above.
Response: Diagnosis (M = malignant, B = benign)
Ten real-valued features are the mean value based on three cells for the following cell
features:
Radius = radius (mean of distances from center to points on the perimeter)
Texture texture (standard deviation of gray-scale values)
Perimeter = perimeter of the cell nucleus
Area = area of the cell nucleus
Smoothness = smoothness (local variation in radius lengths)
Compactness = compactness (perimeter^2 / area - 1.0)
Concavity = concavity (severity of concave portions of the contour)
Concavepts = concave points (number of concave portions of the contour)
Symmetry = symmetry (measure of symmetry of the cell nucleus)
FracDim = fractal dimension ("coastline approximation" - 1)
The full data set contains the standard errors of the cell measurements (e.g. serad is the
standard error based on the three cell radius measurements) and worst case (maximum)
value for each (e.g. wrad = maximum cell radius of the three cells sampled)
Several of the papers listed above contain detailed descriptions of how these features are
measured and computed if you are interested.
Questions and Tasks:
a) Fit a logistic regression model to classify a breast tumor as malignant or benign using all
available predictors in the data frame (BD.df) that you will need create using the code below.
The BreastDiag data frame is in the original R data directory I sent you at the beginning of the
course.
> BreastDiag = BreastDiag[,-1]  you will use this data frame in part (c).
> BD.df = BreastDiag[,1:11]
> bc.log = glm(Diagnosis~.,data=BD.df,family=”binomial”)
Note: You will get warning messages regarding the convergence of this default model!
What is the misclassification rate of this model using the following classification rule?
If 𝑃̂ (𝑌 = 𝑀|𝑿) > 0.50 then classify as Y = M.
Recall: The cutoff probability is usually taken to be .50 for obvious reasons but other values can be used. (10 pts.)
2
b) Obtain the ROC curve for your final model using functions in the ROCR package. What does
this curve tell you about the predictive abilities of your model? (3 pts.)
>
>
>
>
>
>
>
library(ROCR)
pred = prediction(fitted(bc.log),BD.df$Diagnosis)
perf = performance(pred,”tpr”,”fpr”)
plot(perf)
perf2 = performance(pred,”auc”)
perf2
text(locator(),”AUC = ????”)
c) Now fit a logistic regression model to the full data set (means, SE’s, and worst case values)
contained in the data frame BreastDiag you created above. What happens? (2 pts.)
d) Now fit ridge and Lasso logistic regression models on the full data set. What are the
misclassification rates for these two methods using the optimal  values chosen via crossvalidation? (6 pts.)
>
>
>
>
>
>
>
X = model.matrix(Diagnosis~.,data=BreastDiag)[,-1]
y = as.numeric(BreastDiag$Diagnosis)-1
ridge.log = glmnet(X,y,alpha=0,family=”binomial”)
ridge.cv = cv.glmnet(X,y,alpha=0,family=”binomial”)
plot(ridge.cv); bestlam = ridge.cv$lambda.min
ypred = predict(ridge.log,newx=X,s= bestlam,type=”response”)
table(ypred > .5, y)
Etc…
e) Form training and test sets from the BreastDiag data frame (2/3 vs. 1/3) and fit ridge and Lasso
logistic regression models to the training data and then predict the breast cancer diagnosis for the
women in the test set. Give the misclassification rates on the test data using both methods. Which
method, ridge or Lasso, performs the best? Which characteristics appear to be the most important
in classifying malignancy? (8 pts.)
3
Download