STAT 425 - ASSIGNMENT 7 (29 pts.) Using Logistic Regression Methods to Classify a Dichotomous Categorical Response PROBLEM 1 - Wisconsin Diagnostic Breast Cancer Data (WDBC) Researchers who created these data: Dr. William H. Wolberg, General Surgery Dept., University of Wisconsin, Clinical Sciences Center, Madison, WI 53792 wolberg@eagle.surgery.wisc.edu W. Nick Street, Computer Sciences Dept., University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street@cs.wisc.edu 608-262-6619 Olvi L. Mangasarian, Computer Sciences Dept., University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi@cs.wisc.edu Medical literature citations: W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters, 77 (1994) 163-171. W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792--796, 1995. See also: http://www.cs.wisc.edu/~olvi/uwmp/mpml.html http://www.cs.wisc.edu/~olvi/uwmp/cancer.html 1 Data Description: Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A sample image is shown above. Response: Diagnosis (M = malignant, B = benign) Ten real-valued features are the mean value based on three cells for the following cell features: Radius = radius (mean of distances from center to points on the perimeter) Texture texture (standard deviation of gray-scale values) Perimeter = perimeter of the cell nucleus Area = area of the cell nucleus Smoothness = smoothness (local variation in radius lengths) Compactness = compactness (perimeter^2 / area - 1.0) Concavity = concavity (severity of concave portions of the contour) Concavepts = concave points (number of concave portions of the contour) Symmetry = symmetry (measure of symmetry of the cell nucleus) FracDim = fractal dimension ("coastline approximation" - 1) The full data set contains the standard errors of the cell measurements (e.g. serad is the standard error based on the three cell radius measurements) and worst case (maximum) value for each (e.g. wrad = maximum cell radius of the three cells sampled) Several of the papers listed above contain detailed descriptions of how these features are measured and computed if you are interested. Questions and Tasks: a) Fit a logistic regression model to classify a breast tumor as malignant or benign using all available predictors in the data frame (BD.df) that you will need create using the code below. The BreastDiag data frame is in the original R data directory I sent you at the beginning of the course. > BreastDiag = BreastDiag[,-1] you will use this data frame in part (c). > BD.df = BreastDiag[,1:11] > bc.log = glm(Diagnosis~.,data=BD.df,family=”binomial”) Note: You will get warning messages regarding the convergence of this default model! What is the misclassification rate of this model using the following classification rule? If 𝑃̂ (𝑌 = 𝑀|𝑿) > 0.50 then classify as Y = M. Recall: The cutoff probability is usually taken to be .50 for obvious reasons but other values can be used. (10 pts.) 2 b) Obtain the ROC curve for your final model using functions in the ROCR package. What does this curve tell you about the predictive abilities of your model? (3 pts.) > > > > > > > library(ROCR) pred = prediction(fitted(bc.log),BD.df$Diagnosis) perf = performance(pred,”tpr”,”fpr”) plot(perf) perf2 = performance(pred,”auc”) perf2 text(locator(),”AUC = ????”) c) Now fit a logistic regression model to the full data set (means, SE’s, and worst case values) contained in the data frame BreastDiag you created above. What happens? (2 pts.) d) Now fit ridge and Lasso logistic regression models on the full data set. What are the misclassification rates for these two methods using the optimal values chosen via crossvalidation? (6 pts.) > > > > > > > X = model.matrix(Diagnosis~.,data=BreastDiag)[,-1] y = as.numeric(BreastDiag$Diagnosis)-1 ridge.log = glmnet(X,y,alpha=0,family=”binomial”) ridge.cv = cv.glmnet(X,y,alpha=0,family=”binomial”) plot(ridge.cv); bestlam = ridge.cv$lambda.min ypred = predict(ridge.log,newx=X,s= bestlam,type=”response”) table(ypred > .5, y) Etc… e) Form training and test sets from the BreastDiag data frame (2/3 vs. 1/3) and fit ridge and Lasso logistic regression models to the training data and then predict the breast cancer diagnosis for the women in the test set. Give the misclassification rates on the test data using both methods. Which method, ridge or Lasso, performs the best? Which characteristics appear to be the most important in classifying malignancy? (8 pts.) 3