Biostatistics – Midterm Exam (Part 2) (65 points) 1) Wisconsin Diagnostic Breast Cancer Data (WDBC) Researchers who created these data: Dr. William H. Wolberg, General Surgery Dept., University of Wisconsin, Clinical Sciences Center, Madison, WI 53792 wolberg@eagle.surgery.wisc.edu W. Nick Street, Computer Sciences Dept., University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 street@cs.wisc.edu 608-262-6619 Olvi L. Mangasarian, Computer Sciences Dept., University of Wisconsin, 1210 West Dayton St., Madison, WI 53706 olvi@cs.wisc.edu Medical literature citations: W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters, 77 (1994) 163-171. W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Image analysis and machine learning applied to breast cancer diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17 No. 2, pages 77-87, April 1995. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computerized breast cancer diagnosis and prognosis from fine needle aspirates. Archives of Surgery 1995;130:511-516. W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian. Computer-derived nuclear features distinguish malignant from benign breast cytology. Human Pathology, 26:792--796, 1995. See also: http://www.cs.wisc.edu/~olvi/uwmp/mpml.html http://www.cs.wisc.edu/~olvi/uwmp/cancer.html 1 Data Description: Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. A sample image is shown above. Response: Diagnosis (M = malignant, B = benign) Ten real-valued features are computed for each cell nucleus: Radius = radius (mean of distances from center to points on the perimeter) Texture texture (standard deviation of gray-scale values) Perimeter = perimeter of the cell nucleus Area = area of the cell nucleus Smoothness = smoothness (local variation in radius lengths) Compactness = compactness (perimeter^2 / area - 1.0) Concavity = concavity (severity of concave portions of the contour) Concavepts = concave points (number of concave portions of the contour) Symmetry = symmetry (measure of symmetry of the cell nucleus) FracDim = fractal dimension ("coastline approximation" - 1) Several of the papers listed above contain detailed descriptions of how these features are computed. Questions and Tasks: a) Develop a logistic regression model for diagnosis in R. Use the transformation guidelines we went through in class. Include the plots that you used to choose appropriate terms for your model in your output. Examine case diagnostics, plots used assess the model adequacy (mmps) and an ROC curve for your “final” model. Summarize your model development process and your findings. You do not need to address OR interpretations for this model! (20 pts.) To obtain an ROC curve in R, you will need to install the package epicalc from CRAN and use the function lroc to plot the ROC and find the area under the curve (AUC). Use the R commands below: > roc.mymodel = lroc(mymodel) draws curve and saves results > roc.mymodel$auc gives the area under the curve (AUC) b) What does the ROC curve tell you about the predictive abilities of your model? (3 pts.) c) Fit your final model from R in JMP using the data file: BreastDiag.JMP. Save the fitted probabilities into your spreadsheet and cross-classify Most Likely Diagnosis (X) with actual Diagnosis (Y). What is apparent error rate (AER) when your model is used to classify the tumor as malignant or benign? (5 pts.) Note: In logistic regression, if classification is one goal of the analysis, we can use the following rule to perform the actual classification: If Pˆ (Y 1 | x) cutoff then classify as Y = 1. ~ The cutoff probability is usually taken to be .50 for obvious reasons but other values can be used. 2 The ROC curve is constructed looking at a sequence of value for cutoff described above. For each cutoff value, we can easily compute the sensitivity and specificity so the ROC curve and the area beneath it can be found. 2) Right Heart Catheterization Study The effectiveness of right heart catheterization in the initial care of critically ill patients. SUPPORT Investigators. Connors AF, et al. Department of Medicine, Case Western Reserve University at Metro Health Medical Center, Cleveland, Ohio, USA. OBJECTIVE: To examine the association between the use of right heart catheterization (RHC) during the first 24 hours of care in the intensive care unit (ICU) and subsequent survival, length of stay, intensity of care, and cost of care. DESIGN: Prospective cohort study. SETTING: Five US teaching hospitals between 1989 and 1994. SUBJECTS: A total of 5735 critically ill adult patients receiving care in an ICU for 1 of 9 prespecified disease categories. MAIN OUTCOME MEASURES: Survival time, cost of care, intensity of care, and length of stay in the ICU and hospital, determined from the clinical record and from the National Death Index. Variable name Variable Definition Age Sex Race Edu Income Ninsclas Age Sex Race: white, black, other Years of education Income (under 11k, 11-25k, 26-50k, > 50k) Medical insurance status: No insurance, Medicare, Medicaid, Medicaid & Medicare, Private & Medicare, Private Primary disease category: MOSF w/sepsis, MOSF w/malignancy, lung cancer, COPD, coma, colon cancer, cirrhosis, CHF, ARF Cat1 Categories of admission diagnosis: Resp Card Neuro Gastr Renal Meta Hema Seps Trauma Ortho Das2d3pc Dnr1 Ca Surv2md1 Aps1 Wtkilo1 Temp1 Meanbp1 Respiratory Diagnosis (yes or no) Cardiovascular Diagnosis (yes or no) Neurological Diagnosis (yes or no) Gastrointestinal Diagnosis (yes or no) Renal Diagnosis (yes or no) Metabolic Diagnosis (yes or no) Hematologic Diagnosis (yes or no) Sepsis Diagnosis (yes or no) Trauma Diagnosis (yes or no) Orthopedic Diagnosis (yes or no) DASI ( Duke Activity Status Index) DNR status on day1 (yes or no) Cancer (3 levels = yes, no, or metastatic) Support model estimate of the prob. of surviving 2 months APACHE score Weight Temperature Mean blood pressure 3 Resp1 Hrt1 Pafi1 Paco21 Ph1 Wblc1 Hema1 Sod1 Pot1 Crea1 Bili1 Alb1 Respiratory rate Heart rate PaO2/FIO2 ratio PaCo2 PH WBC Hematocrit Sodium Potassium Creatinine Bilirubin Albumin Categories of comorbidities illness: These are all coded as: 0 = No, 1 = Yes Cardiohx Chfhx Dementhx Psychhx Chrpulhx Renalhx Liverhx Gibledhx Malighx Immunhx Transhx Amihx Acute MI, Peripheral Vascular Disease, Severe Cardiovascular Symptoms (NYHA-Class III), Very Severe Cardiovascular Symptoms (NYHA-Class IV) Congestive Heart Failure Dementia, Stroke or Cerebral Infarct, Parkinson’s Disease Psychiatric History, Active Psychosis or Severe Depression Chronic Pulmonary Disease, Severe Pulmonary Disease, Very Severe Pulmonary Disease Chronic Renal Disease, Chronic Hemodialysis or Peritoneal Dialysis Cirrhosis, Hepatic Failure Upper GI Bleeding Solid Tumor, Metastatic Disease, Chronic Leukemia/Myeloma, Acute Leukemia, Lymphoma Immunosupperssion, Organ Transplant, HIV Positivity, Diabetes Mellitus Without End Organ Damage, Diabetes Mellitus With End Organ Damage, Connective Tissue Disease Transfer (> 24 Hours) from Another Hospital Definite Myocardial Infarction More Important Variables Swang1 Death Dth30 Right Heart Catheterization (RHC) (yes or no) Death at any time up to 180 Days (yes or no) Death at any time up to 30 Days (yes or no) RESPONSE The researchers found that heart attack patients who had a right heart catheter (Swan-Ganz line) put in had a 24% higher risk of 30-day mortality than patients that did not have the procedure performed. Given that this procedure is generally used when doctors are in some sense perplexed about what course of treatment to follow, one could argue that this result is expected because patients where the course of treatment is not obvious may be more severely ill. In study such as this we can try to eliminate this potential confounding by using information about the physiological state of the patient at the time of treatment. All of the additional variables can be used to accomplish this goal. Build a logistic regression model for 30-day mortality using these data. Keep in mind quantification of the risk associated with right heart catheterization (swang1) is of primary interest so no matter 4 what don’t take it out of the model. Carefully and thoroughly summarize the following: a) The model development process. You are going to have to take a more hands on approach here. Examine diagnostics, model adequacy plots, and the ROC curve for your “final” model. (25 pts.) b) Create a table giving the OR and associated CI for each term in your final model. For continuous predictors pick an increment (e.g. c = SD(x)) giving the OR and CI for that increment value. (20 pts.) b) The risk associated with the Swan-Ganz line procedure. Does the 24% increase in risk sound reasonable given your model? Use your OR and CI from part (b) to answer this question. (3 pts.) All the necessary data files are on the course website in the Data Sets list. 5