Wisconsin Diagnostic Breast Cancer Data (WDBC)

advertisement
Biostatistics – Midterm Exam (Part 2)
(65 points)
1) Wisconsin Diagnostic Breast Cancer Data (WDBC)
Researchers who created these data:
Dr. William H. Wolberg, General Surgery Dept., University of
Wisconsin, Clinical Sciences Center, Madison, WI 53792
wolberg@eagle.surgery.wisc.edu
W. Nick Street, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
street@cs.wisc.edu 608-262-6619
Olvi L. Mangasarian, Computer Sciences Dept., University of
Wisconsin, 1210 West Dayton St., Madison, WI 53706
olvi@cs.wisc.edu
Medical literature citations:
W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
Machine learning techniques to diagnose breast cancer from
fine-needle aspirates. Cancer Letters, 77 (1994) 163-171.
W.H. Wolberg, W.N. Street, and O.L. Mangasarian.
Image analysis and machine learning applied to breast cancer
diagnosis and prognosis. Analytical and Quantitative Cytology and Histology, Vol. 17
No. 2, pages 77-87, April 1995.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
Computerized breast cancer diagnosis and prognosis from fine
needle aspirates. Archives of Surgery 1995;130:511-516.
W.H. Wolberg, W.N. Street, D.M. Heisey, and O.L. Mangasarian.
Computer-derived nuclear features distinguish malignant from
benign breast cytology. Human Pathology, 26:792--796, 1995.
See also:
http://www.cs.wisc.edu/~olvi/uwmp/mpml.html
http://www.cs.wisc.edu/~olvi/uwmp/cancer.html
1
Data Description:
Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast
mass. They describe characteristics of the cell nuclei present in the image.
A sample image is shown above.
Response: Diagnosis (M = malignant, B = benign)
Ten real-valued features are computed for each cell nucleus:
Radius = radius (mean of distances from center to points on the perimeter)
Texture texture (standard deviation of gray-scale values)
Perimeter = perimeter of the cell nucleus
Area = area of the cell nucleus
Smoothness = smoothness (local variation in radius lengths)
Compactness = compactness (perimeter^2 / area - 1.0)
Concavity = concavity (severity of concave portions of the contour)
Concavepts = concave points (number of concave portions of the contour)
Symmetry = symmetry (measure of symmetry of the cell nucleus)
FracDim = fractal dimension ("coastline approximation" - 1)
Several of the papers listed above contain detailed descriptions of how these features are
computed.
Questions and Tasks:
a) Develop a logistic regression model for diagnosis in R. Use the transformation guidelines we
went through in class. Include the plots that you used to choose appropriate terms for your model
in your output. Examine case diagnostics, plots used assess the model adequacy (mmps) and an
ROC curve for your “final” model. Summarize your model development process and your
findings. You do not need to address OR interpretations for this model! (20 pts.)
To obtain an ROC curve in R, you will need to install the package epicalc from CRAN and use
the function lroc to plot the ROC and find the area under the curve (AUC). Use the R commands
below:
> roc.mymodel = lroc(mymodel)  draws curve and saves results
> roc.mymodel$auc  gives the area under the curve (AUC)
b) What does the ROC curve tell you about the predictive abilities of your model? (3 pts.)
c) Fit your final model from R in JMP using the data file: BreastDiag.JMP. Save the fitted
probabilities into your spreadsheet and cross-classify Most Likely Diagnosis (X) with actual
Diagnosis (Y). What is apparent error rate (AER) when your model is used to classify the tumor as
malignant or benign? (5 pts.)
Note: In logistic regression, if classification is one goal of the analysis, we can use the following rule to perform the
actual classification:
If Pˆ (Y  1 | x)   cutoff then classify as Y = 1.
~
The cutoff probability is usually taken to be .50 for obvious reasons but other values can be used.
2
The ROC curve is constructed looking at a sequence of value for
 cutoff described above. For each cutoff value, we
can easily compute the sensitivity and specificity so the ROC curve and the area beneath it can be found.
2) Right Heart Catheterization Study
The effectiveness of right heart catheterization in the initial care of critically ill patients.
SUPPORT Investigators. Connors AF, et al.
Department of Medicine, Case Western Reserve University at Metro Health Medical Center,
Cleveland, Ohio, USA.
OBJECTIVE: To examine the association between the use of right heart catheterization (RHC)
during the first 24 hours of care in the intensive care unit (ICU) and subsequent survival, length of
stay, intensity of care, and cost of care. DESIGN: Prospective cohort study. SETTING: Five US
teaching hospitals between 1989 and 1994. SUBJECTS: A total of 5735 critically ill adult patients
receiving care in an ICU for 1 of 9 prespecified disease categories. MAIN OUTCOME
MEASURES: Survival time, cost of care, intensity of care, and length of stay in the ICU and
hospital, determined from the clinical record and from the National Death Index.
Variable name
Variable Definition
Age
Sex
Race
Edu
Income
Ninsclas
Age
Sex
Race: white, black, other
Years of education
Income (under 11k, 11-25k, 26-50k, > 50k)
Medical insurance status: No insurance, Medicare, Medicaid,
Medicaid & Medicare, Private & Medicare, Private
Primary disease category: MOSF w/sepsis, MOSF
w/malignancy, lung cancer, COPD, coma, colon cancer,
cirrhosis, CHF, ARF
Cat1
Categories of admission diagnosis:
Resp
Card
Neuro
Gastr
Renal
Meta
Hema
Seps
Trauma
Ortho
Das2d3pc
Dnr1
Ca
Surv2md1
Aps1
Wtkilo1
Temp1
Meanbp1
Respiratory Diagnosis (yes or no)
Cardiovascular Diagnosis (yes or no)
Neurological Diagnosis (yes or no)
Gastrointestinal Diagnosis (yes or no)
Renal Diagnosis (yes or no)
Metabolic Diagnosis (yes or no)
Hematologic Diagnosis (yes or no)
Sepsis Diagnosis (yes or no)
Trauma Diagnosis (yes or no)
Orthopedic Diagnosis (yes or no)
DASI ( Duke Activity Status Index)
DNR status on day1 (yes or no)
Cancer (3 levels = yes, no, or metastatic)
Support model estimate of the prob. of surviving 2 months
APACHE score
Weight
Temperature
Mean blood pressure
3
Resp1
Hrt1
Pafi1
Paco21
Ph1
Wblc1
Hema1
Sod1
Pot1
Crea1
Bili1
Alb1
Respiratory rate
Heart rate
PaO2/FIO2 ratio
PaCo2
PH
WBC
Hematocrit
Sodium
Potassium
Creatinine
Bilirubin
Albumin
Categories of comorbidities illness:
These are all coded as: 0 = No, 1 = Yes
Cardiohx
Chfhx
Dementhx
Psychhx
Chrpulhx
Renalhx
Liverhx
Gibledhx
Malighx
Immunhx
Transhx
Amihx
Acute MI, Peripheral Vascular Disease, Severe
Cardiovascular Symptoms (NYHA-Class III), Very Severe
Cardiovascular Symptoms (NYHA-Class IV)
Congestive Heart Failure
Dementia, Stroke or Cerebral Infarct, Parkinson’s Disease
Psychiatric History, Active Psychosis or Severe Depression
Chronic Pulmonary Disease, Severe Pulmonary Disease,
Very Severe Pulmonary Disease
Chronic Renal Disease, Chronic Hemodialysis or Peritoneal
Dialysis
Cirrhosis, Hepatic Failure
Upper GI Bleeding
Solid Tumor, Metastatic Disease, Chronic
Leukemia/Myeloma, Acute Leukemia, Lymphoma
Immunosupperssion, Organ Transplant, HIV Positivity,
Diabetes Mellitus Without End Organ Damage, Diabetes
Mellitus With End Organ Damage, Connective Tissue
Disease
Transfer (> 24 Hours) from Another Hospital
Definite Myocardial Infarction
More Important Variables
Swang1
Death
Dth30
Right Heart Catheterization (RHC) (yes or no)
Death at any time up to 180 Days (yes or no)
Death at any time up to 30 Days (yes or no) RESPONSE
The researchers found that heart attack patients who had a right heart catheter (Swan-Ganz line) put
in had a 24% higher risk of 30-day mortality than patients that did not have the procedure
performed. Given that this procedure is generally used when doctors are in some sense perplexed
about what course of treatment to follow, one could argue that this result is expected because
patients where the course of treatment is not obvious may be more severely ill. In study such as
this we can try to eliminate this potential confounding by using information about the physiological
state of the patient at the time of treatment. All of the additional variables can be used to
accomplish this goal.
Build a logistic regression model for 30-day mortality using these data. Keep in mind quantification
of the risk associated with right heart catheterization (swang1) is of primary interest so no matter
4
what don’t take it out of the model.
Carefully and thoroughly summarize the following:
a) The model development process. You are going to have to take a more hands on approach here.
Examine diagnostics, model adequacy plots, and the ROC curve for your “final” model. (25 pts.)
b) Create a table giving the OR and associated CI for each term in your final model. For
continuous predictors pick an increment (e.g. c = SD(x)) giving the OR and CI for that increment
value. (20 pts.)
b) The risk associated with the Swan-Ganz line procedure. Does the 24% increase in risk sound
reasonable given your model? Use your OR and CI from part (b) to answer this question. (3 pts.)
All the necessary data files are on the course website in the Data Sets list.
5
Download