STAT 405 Fall 2012 - Homework 6 ( points) Due Thursday, October 11th 1 - Pesticides in the Breast Milk of Mothers in Western Australia These data are from a study of breast feeding mothers in Western Australia. Earlier research discovered surprisingly high levels of pesticide levels in human breast milk. The research conducted hopes to show that the levels had decreased as a result of stricter government regulations on the use of pesticides on food crops. They did find decreases for several types of pesticides. Levels of the pesticide Dieldrin, however, had substantially increased. For 45 breast milk donors, we have information on the mother's age in years, whether they lived in a new suburb (0 = no, 1 = yes), whether their house had been treated for termites within the past three years (0 = no, 1 = yes), and whether their breast milk contained above average levels (> .009 ppm) of the pesticide Dieldrin. Note that by law new houses are treated for termites in Australia. The variables in the Pestmilk.sas & Pestmilk.JMP data files found on my website are: • • • • New Suburb - new suburb indicator (New, Old) Age – Ago of Mother (yrs) Treated - house treated for termites in the last 3 years (HT, No) Dieldrin - Dieldrin level (High, Normal) a. Fit a logistic regression model to predict Dieldren levels using only whether or not the house had been treated for termites in the last 3 years. Provide your logistic regression parameter estimates, η̂ 0 and η̂1 . (1 pt) b. Use the model from part a to estimate the probability that a woman has high Dieldrin levels if her house has been treated for termites in the last 3 years. (2 pts) c. Find and interpret the odds ratio for having High Dieldren levels associated with having one’s house treated for termites in the last 3 years. (3 pts) d. Fit a logistic regression model to predict Dieldren levels using only whether or not the house is in a new suburb. Provide your logistic regression parameter estimates, η̂ 0 and η̂1 . (1 pt) e. Use the model from part d to estimate the probability that a woman has high Dieldrin levels if her house is in a new suburb. (2 pts) f. Find and interpret the odds ratio for having High Dieldren levels associated with living in a new suburb. (3 pts) g. Fit a logistic regression model to predict Dieldren levels using only the mother’s age. Provide your logistic regression parameter estimates, η̂ 0 and η̂1 . (1 pt) h. Find the odds ratio for having High Dieldren levels associated with a one-year increase in age. (2 pts) i. Find and interpret a 95% confidence interval for the odds ratio in part h. (2 pts) j. Find the odds ratio for having High Dieldren levels associated with a ten-year increase in age. (2 pts) k. Find and interpret a 95% confidence interval for the odds ratio in part j. (2 pts) l. Obtain a plot showing the ROC curve for the model that uses only age to predict Dieldren levels. (2 pts) m. Interpret the area under the ROC curve in the context of the problem. (2 pts) 2 – Prostate Cancer Study It is believed that men’s prostate glands grow with age. Therefore prostate-specific antigen (PSA), a substance made in the prostate, increase slightly as men become older. Oncologists have established that prostate cancer cells are more permeable than normal prostate cells, causing the PSA level to rise in most cases of prostate cancer. When prostate cancer is not detected early enough the tumor can develop and penetrate the prostatic capsule. Dr. Donn Young, at the The Ohio State University Comprehensive Cancer Center, collected data, including demographic variables, and various test results on a cohort of men with prostate cancer. The interest is in how to use various test results among patients, and in particular, whether variables measured at the baseline exam can be used to predict whether or not the tumor has penetrated the prostatic capsule. You will use the binary logistic regression tool to help identify the indicators of tumor penetration of the prostatic capsule in cancer patients. Data Files: Prostate Logistic.TXT, Prostate.JMP Drexam (1 = nodule, 0 = no nodule) Additional predictors added: 10 11 Log base 10 of PSA Level Alpha-numeric recoding of CAPSULE log10(mg/ml) NP = no penetration CP = penetration log10PSA CapPen a) Use a scatterplot to examine the relationship between PSA level and age. Use a different plotting symbol for capsule with penetration and those without penetration. Does there appear to be an association between PSA and age? In particular, do these data suggest that PSA increases with age? (2 pts.) b) Now plot the log base 10 of the PSA levels vs. age. Comment on the advantage of converting to the log scale for PSA. Does your answer to part (a) change? (1 pt.) c) Is there an association between digital rectal exam (Drexam) and capsule penetration? Justify your answer using the appropriate statistical test and reporting the p-value. (3 pts.) d) What is the OR and associated CI for capsule penetration associated with the detection of a nodule via digital rectal exam (Drexam) at baseline? (3 pts.) e) Fit the simple logistic model for capsule penetration using PSA level as the predictor. Is PSA level a significant predictor of capsule penetration status? Explain. (2 pts.) f) What is multiplicative effect of a 20 mg/ml increase in the PSA level? Give a 95% CI for this effect as well. Discuss these results. (3 pts.) g) Write out the logistic model for capsule penetration using PSA, age, and the results of the digital rectal exam (Drexam) as the predictors. (1 pt.) h) Fit the model from part (g) and include the output. Test the utility of the logistic model. Discuss. (3 pts.) i) After adjusting for PSA level and results of the digital rectal exam at baseline is age of the subject any value in predicting capsule penetration? Explain. (2 pts.) j) What is a Gleason Score? Google it! (1 pt.) k) Fit the logistic model for capsule penetration using the following predictors: Age, Race, Dpos, Dcaps, PSA, Vol, and Gleason Summarize the model for Dr. Young. (4 pts.) l) Use backward elimination or forward selection to obtain a suitable reduced model. Summarize each predictor in the final model in terms of odds ratios. If you use JMP to fit the stepwise model choose the Whole Effects option from the Rules drop down menu as shown below. This will force JMP to not combine levels for Dpos variable. Summarize these results as if you were writing them for Dr. Young. Be sure to discuss the OR’s associated with each term in your final model. For Dpos you will probably want to focus on the OR’s that are greater than 1 as they are easier to discuss. (10 pts.) m) Examine the case diagnostics for your final reduced model. Identify the case with the highest Cook’s distance (or some other suitable measure of influence) and discuss why you think this case stands out. In order to do this you will need to run the model in R. Here are some commands to help you out substantially in this process. (3 pts.) Read in Prostate.txt from my website: Prostate = read.table(file.choose(),header=T,sep=”,”) names(Prostate) [1] "ID" "Capsule" "Age" "Race" "Dpros" [9] "Vol" "Gleason" "log10PSA" "CapPen" "Drexam" "Dcaps" "PSA" Fit the full logistic model for part (k) fullmodel = glm(Capsule~Age+as.factor(Race)+as.factor(Dpros)+as.factor(Dcaps)+PSA+Vol +Gleason, family=”binomial”,data=Prostate) The use of as.factor() is necessary to make sure that the nominal variables: Race, Dpros, and Dcaps (which are coded numerically) are NOT treated as numeric in the logistic model fit. Perform backward elimination to obtain a reduced model stepmodel = step(fullmodel) summary(stepmodel) Note: this model will not be the same as the one obtained using stepwise selection in JMP. This will give the same model you obtained in JMP. Diagplot.log(stepmodel) identify points in the two plots that standout. Be sure to hit Esc when you are finished identifying points in each plot. stepmodel2 = update(stepmodel,.~.-Vol) n) Identify the case which is most poorly fit using a suitable measure and discuss why you think this case is identified as being poorly fit. See plots obtain in part (m) above. (3 pts.) 3 – ICU Study Descriptive Abstract of the ICU data set: The ICU data set consists of a sample of 200 subjects who were part of a much larger study on survival of patients following admission to an adult intensive care unit (ICU). The major goal of this study was to develop a logistic regression model to predict the probability of survival to hospital discharge of these patients and to study the risk factors associated with ICU mortality. Source: Data were collected at Baystate Medical Center in Springfield, Massachusetts. Data Files: ICU.JMP, ICU.TXT Table: Code Sheet for the ICU Data. NAME Description Codes/Values/Units --------------------------------------------------------------------------------ID Identification Code STA Vital Status 0 = Lived 1 = Died Age Age years Sex Gender 0 = Male 1 = Female Race Ethnicity of patient 1 = White 2 = Black 3 = Other SER Service at ICU Admission 0 = Medical 1 = Surgical CAN Cancer Part of Present Problem 0 = No 1 = Yes CRN History of Chronic Renal Failure O = No 1 = Yes INF Infection Probable at ICU Admission 0 = No 1 = Yes CPR CPR Prior to ICU Admission 0 = No 1 = Yes SYS Systolic Blood Pressure at ICU Admission mmHg HRA Heart Rate at ICU Admission Beats/min PRE Previous Admission to an ICU within 6 Months 0 = No 1 = Yes TYP Type of Admission 0 = Elective 1 = Emergency FRA Long Bone, Multiple, Neck, Single Area, or Hip Fracture 0 = No 1 = Yes PO2 PO2 from Initial Blood Gases 0 = > 60 1 = < 60 PH PH from Initial Blood Gases 0 => 7.25 1 =< 7.25 PCO PCO2 from initial Blood Gases 0 = < 45 1 = > 45 BIC Bicarbonate from Initial Blood Gases 0 = > 18 1 = < 18 CRE Creatinine from Initial Blood Gases 0 = < 2.0 1 = > 2.0 LOC Level of Consciousness at ICU Admission O = No Coma or Stupor 1 = Deep stupor or Coma Problems and Tasks: a) Build a simple logistic model for outcome status (STA) using only age (Age) as predictor. Give two interpretations of the effect of age in terms of odd’s ratios using one year an increment of your choosing. Also construct a plot of P(STA=1|Age). Discuss your results in a well written paragraph. (6 pts.) b) Build a simple logistic model for outcome status (STA) using only type of admission (TYP) as the covariate. Compare the estimated OR, SE(OR), and associated CI obtained from the logistic model to that obtained via the usual method for 2 X 2 tables. Do they agree? Should they? Interpret the OR associated with type of admission. (6 pts.) c) The variable Race is coded at three levels. Prepare a table showing the coding of the two dummy variables necessary for including this variable in a logistic regression model. (2 pts.) d) Write down the equation for the logistic regression model of STA on age, cancer status, CPR status, infection status and race. How many parameters does this model contain? (2 pts.) e) Fit the model from part (d). Using the estimates obtained write down the equation for the fitted values, i.e. the estimated probabilities of P( STA 1 | x) . (5 pts.) ~ Estimate the probability of death for the following patient types: i) black patients 70 years old without cancer, who did not have CPR performed, and were infection free. ii) white patients 60 years old with cancer, who did not have CPR performed, and had an infection at the time of admission. iii) white patients 80 years old with cancer, who did CPR performed, and were infection free. iv) asian patients 85 years old without cancer, who did not have CPR performed, and had an infection at the time of admission. f) Using the model from part (d), plot the P(STA=1|Age) for the following patient groups: i) black patients, who did not have cancer, had CPR, and did not have an infection. ii) white patients, who had cancer, did not have CPR, and had an infection. iii) Hispanic patients, who did not have cancer, had CPR, and did not have an infection. The function and R code below should be helpful: PrAge <- function(age,race2,race3,can1,cpr1,inf1) { L -3.51152 +.02712*age - .95703*race2 + .25975*race3 + .24451*can1 + 1.6465*cpr1 + .68067*inf1 exp(L)/(1+exp(L)) } > x <- seq(min(age),max(age),.5) The commands below will plot P(STA=1|Age) for patients described in part (i) =================================================================== > plot(x,PrAge(x,1,0,0,1,0),ylab="P(Death|Age)",xlab="Age", ylim=c(0,1),type="l") > lines(x,PrAge(x,0,0,1,0,1),lty=2,col=”blue”) # plot for (ii) patients > lines(x,PrAge(x,0,1,0,1,0),lty=3,col=”red”) # plot for (iii) patients g) Use similar plots to those constructed in part (f) to show that there is not a large race effect. To do this choose arbitrary levels for the other factors and plot P(STA=1|Age,Race). (3 pts.) h) Use similar plots to those constructed in parts (f) & (g) to show the CPR effect. To do this choose arbitrary levels for the other factors and plot P(STA=1|Age,CPR). (2 pts.) i) Test the significance of the terms in model from part (d). Some of the terms are not significant and could be removed. Build a reduced model and justify that your reduced model does not have a significantly degraded fit via the General 2 Test. (4 pts.) j) Create a table giving the OR’s associated with each coefficient. For continuous predictors choose a reasonable increment (perhaps c = SD) and for categorical predictors with more than two levels, i.e. Race, use White as the reference group when reporting the odds ratios. k) Consider the logistic regression of vital status (STA), service at ICU admission (SER), and age. Fit the models that include only service at admission, both service at admission & age, and the model that includes both service at admission & age, plus the interaction between service at admission and age. Use the General 2 Test to decide which of these models you would choose. For the interaction model, construct a plot that shows the P(STA=1|SER,Age) for both levels of service at admission. Discuss. (5 pts.) l) Consider the variable level of consciousness at ICU admission (LOC) as a covariate and vital status (STA) as the outcome variable. Compare estimates of the odds ratios obtained from the cross-classification of STA by LOC and logistic regression of STA on LOC. Use LOC=0 as the reference group for both methods. How well did the logistic regression deal with the zero cell? What strategy would you recommend for using LOC in future models. (5 pts.) m) Using vital status (STA) as the outcome variable and the remainder of the variables in the ICUpool data set as possible covariates, develop a logistic regression model using main effect terms only. Use model selection methods to arrive at a “best” model. Document thoroughly the rationale for each step in the process you follow. Display the results of your final model in table. Also include a table of point and 95% CI estimates of all relevant odds ratios. Discuss your results in a well-written paragraph. (10 pts.) n) For your final model from part (m) examine case diagnostics, residual plots and model checking plots. Discuss any “unusual” cases and examine the potential effect these cases might have on the final model and its interpretation. Discuss the other plots in terms of what they tell you about the adequacy of your fitted model. (5 pts.) o) Use model selection methods to arrive at a “best” model which allows for possible interactions between the covariates. It is likely that in the process of fitting potential models you will obtain warnings from R about the fitted values being close to 0 and 1. This happens when you are “overfitting”, i.e. including too many terms in your model. To avoid this problem, I would recommend starting from your final model from part (m) and then consider adding interaction terms to that model. This should help alleviate the overfitting problems. Discuss your model and use plots of P( STA 1 | x) to aid in the interpretation of any the interaction ~ effects. (10 pts.)