Assignment #6 - Winona State University

advertisement
STAT 405 Fall 2012 - Homework 6 ( points)
Due Thursday, October 11th
1 - Pesticides in the Breast Milk of Mothers in Western Australia
These data are from a study of breast feeding mothers in Western Australia. Earlier research
discovered surprisingly high levels of pesticide levels in human breast milk. The research
conducted hopes to show that the levels had decreased as a result of stricter government
regulations on the use of pesticides on food crops. They did find decreases for several types of
pesticides. Levels of the pesticide Dieldrin, however, had substantially increased.
For 45 breast milk donors, we have information on the mother's age in years, whether they lived
in a new suburb (0 = no, 1 = yes), whether their house had been treated for termites within the
past three years (0 = no, 1 = yes), and whether their breast milk contained above average levels
(> .009 ppm) of the pesticide Dieldrin. Note that by law new houses are treated for termites in
Australia.
The variables in the Pestmilk.sas & Pestmilk.JMP data files found on my
website are:
•
•
•
•
New Suburb - new suburb indicator (New, Old)
Age – Ago of Mother (yrs)
Treated - house treated for termites in the last 3 years (HT, No)
Dieldrin - Dieldrin level (High, Normal)
a. Fit a logistic regression model to predict Dieldren levels using only whether or not
the house had been treated for termites in the last 3 years. Provide your logistic
regression parameter estimates, η̂ 0 and η̂1 . (1 pt)
b. Use the model from part a to estimate the probability that a woman has high Dieldrin
levels if her house has been treated for termites in the last 3 years. (2 pts)
c. Find and interpret the odds ratio for having High Dieldren levels associated with
having one’s house treated for termites in the last 3 years. (3 pts)
d. Fit a logistic regression model to predict Dieldren levels using only whether or not
the house is in a new suburb. Provide your logistic regression parameter estimates,
η̂ 0 and η̂1 . (1 pt)
e. Use the model from part d to estimate the probability that a woman has high Dieldrin
levels if her house is in a new suburb. (2 pts)
f.
Find and interpret the odds ratio for having High Dieldren levels associated with
living in a new suburb. (3 pts)
g. Fit a logistic regression model to predict Dieldren levels using only the mother’s age.
Provide your logistic regression parameter estimates, η̂ 0 and η̂1 . (1 pt)
h. Find the odds ratio for having High Dieldren levels associated with a one-year
increase in age. (2 pts)
i.
Find and interpret a 95% confidence interval for the odds ratio in part h. (2 pts)
j.
Find the odds ratio for having High Dieldren levels associated with a ten-year
increase in age. (2 pts)
k. Find and interpret a 95% confidence interval for the odds ratio in part j. (2 pts)
l.
Obtain a plot showing the ROC curve for the model that uses only age to predict
Dieldren levels. (2 pts)
m. Interpret the area under the ROC curve in the context of the problem. (2 pts)
2 – Prostate Cancer Study
It is believed that men’s prostate glands grow with age. Therefore prostate-specific
antigen (PSA), a substance made in the prostate, increase slightly as men become older.
Oncologists have established that prostate cancer cells are more permeable than normal
prostate cells, causing the PSA level to rise in most cases of prostate cancer. When
prostate cancer is not detected early enough the tumor can develop and penetrate the
prostatic capsule.
Dr. Donn Young, at the The Ohio State University Comprehensive Cancer Center,
collected data, including demographic variables, and various test results on a cohort of
men with prostate cancer. The interest is in how to use various test results among
patients, and in particular, whether variables measured at the baseline exam can be
used to predict whether or not the tumor has penetrated the prostatic capsule.
You will use the binary logistic regression tool to help identify the indicators of tumor
penetration of the prostatic capsule in cancer patients.
Data Files: Prostate Logistic.TXT, Prostate.JMP
Drexam
(1 = nodule, 0 = no nodule)
Additional predictors added:
10
11
Log base 10 of PSA Level
Alpha-numeric recoding
of CAPSULE
log10(mg/ml)
NP = no penetration
CP = penetration
log10PSA
CapPen
a) Use a scatterplot to examine the relationship between PSA level and age. Use a
different plotting symbol for capsule with penetration and those without
penetration. Does there appear to be an association between PSA and age? In
particular, do these data suggest that PSA increases with age? (2 pts.)
b) Now plot the log base 10 of the PSA levels vs. age. Comment on the advantage
of converting to the log scale for PSA. Does your answer to part (a) change? (1 pt.)
c) Is there an association between digital rectal exam (Drexam) and capsule
penetration? Justify your answer using the appropriate statistical test and
reporting the p-value. (3 pts.)
d) What is the OR and associated CI for capsule penetration associated with the
detection of a nodule via digital rectal exam (Drexam) at baseline? (3 pts.)
e) Fit the simple logistic model for capsule penetration using PSA level as the
predictor. Is PSA level a significant predictor of capsule penetration status?
Explain. (2 pts.)
f) What is multiplicative effect of a 20 mg/ml increase in the PSA level? Give a 95%
CI for this effect as well. Discuss these results. (3 pts.)
g) Write out the logistic model for capsule penetration using PSA, age, and the
results of the digital rectal exam (Drexam) as the predictors. (1 pt.)
h) Fit the model from part (g) and include the output. Test the utility of the logistic
model. Discuss. (3 pts.)
i) After adjusting for PSA level and results of the digital rectal exam at baseline is
age of the subject any value in predicting capsule penetration? Explain. (2 pts.)
j) What is a Gleason Score? Google it! (1 pt.)
k) Fit the logistic model for capsule penetration using the following predictors:
Age, Race, Dpos, Dcaps, PSA, Vol, and Gleason
Summarize the model for Dr. Young. (4 pts.)
l) Use backward elimination or forward selection to obtain a suitable reduced
model. Summarize each predictor in the final model in terms of odds ratios. If
you use JMP to fit the stepwise model choose the Whole Effects option from the
Rules drop down menu as shown below. This will force JMP to not combine
levels for Dpos variable. Summarize these results as if you were writing them for
Dr. Young.
Be sure to discuss the OR’s associated with each term in your final model. For
Dpos you will probably want to focus on the OR’s that are greater than 1 as they
are easier to discuss. (10 pts.)
m) Examine the case diagnostics for your final reduced model. Identify the case
with the highest Cook’s distance (or some other suitable measure of influence)
and discuss why you think this case stands out. In order to do this you will need
to run the model in R. Here are some commands to help you out substantially in
this process. (3 pts.)
Read in Prostate.txt from my website:
Prostate = read.table(file.choose(),header=T,sep=”,”)
names(Prostate)
[1] "ID"
"Capsule" "Age"
"Race"
"Dpros"
[9] "Vol"
"Gleason" "log10PSA" "CapPen"
"Drexam"
"Dcaps"
"PSA"
Fit the full logistic model for part (k)
fullmodel = glm(Capsule~Age+as.factor(Race)+as.factor(Dpros)+as.factor(Dcaps)+PSA+Vol
+Gleason, family=”binomial”,data=Prostate)
The use of as.factor() is necessary to make sure that the nominal variables: Race, Dpros,
and Dcaps (which are coded numerically) are NOT treated as numeric in the logistic model fit.
Perform backward elimination to obtain a reduced model
stepmodel = step(fullmodel)
summary(stepmodel)  Note: this model will not be the same as the one obtained using
stepwise selection in JMP.
 This will give the same model you obtained in JMP.
Diagplot.log(stepmodel)  identify points in the two plots that standout. Be sure to hit
Esc when you are finished identifying points in each plot.
stepmodel2 = update(stepmodel,.~.-Vol)
n) Identify the case which is most poorly fit using a suitable measure and discuss
why you think this case is identified as being poorly fit. See plots obtain in part
(m) above. (3 pts.)
3 – ICU Study
Descriptive Abstract of the ICU data set:
The ICU data set consists of a sample of 200 subjects who were part of a much larger study on
survival of patients following admission to an adult intensive care unit (ICU). The major goal of
this study was to develop a logistic regression model to predict the probability of survival to
hospital discharge of these patients and to study the risk factors associated with ICU mortality.
Source: Data were collected at Baystate Medical Center in Springfield,
Massachusetts.
Data Files: ICU.JMP, ICU.TXT
Table: Code Sheet for the ICU Data.
NAME
Description
Codes/Values/Units
--------------------------------------------------------------------------------ID
Identification Code
STA
Vital Status
0 = Lived
1 = Died
Age
Age
years
Sex
Gender
0 = Male
1 = Female
Race
Ethnicity of patient
1 = White
2 = Black
3 = Other
SER
Service at ICU Admission
0 = Medical
1 = Surgical
CAN
Cancer Part of Present
Problem
0 = No
1 = Yes
CRN
History of Chronic Renal
Failure
O = No
1 = Yes
INF
Infection Probable at ICU
Admission
0 = No
1 = Yes
CPR
CPR Prior to ICU Admission
0 = No
1 = Yes
SYS
Systolic Blood Pressure at
ICU Admission
mmHg
HRA
Heart Rate at ICU Admission
Beats/min
PRE
Previous Admission to an ICU
within 6 Months
0 = No
1 = Yes
TYP
Type of Admission
0 = Elective
1 = Emergency
FRA
Long Bone, Multiple, Neck,
Single Area, or Hip Fracture
0 = No
1 = Yes
PO2
PO2 from Initial Blood Gases
0 = > 60
1 = < 60
PH
PH from Initial Blood Gases
0 => 7.25
1 =< 7.25
PCO
PCO2 from initial Blood
Gases
0 = < 45
1 = > 45
BIC
Bicarbonate from Initial
Blood Gases
0 = > 18
1 = < 18
CRE
Creatinine from Initial Blood
Gases
0 = < 2.0
1 = > 2.0
LOC
Level of Consciousness at ICU
Admission
O = No Coma
or Stupor
1 = Deep stupor or Coma
Problems and Tasks:
a) Build a simple logistic model for outcome status (STA) using only age (Age) as predictor.
Give two interpretations of the effect of age in terms of odd’s ratios using one year an increment
of your choosing. Also construct a plot of P(STA=1|Age). Discuss your results in a well written
paragraph. (6 pts.)
b) Build a simple logistic model for outcome status (STA) using only type of admission (TYP)
as the covariate. Compare the estimated OR, SE(OR), and associated CI obtained from the
logistic model to that obtained via the usual method for 2 X 2 tables. Do they agree? Should
they? Interpret the OR associated with type of admission. (6 pts.)
c) The variable Race is coded at three levels. Prepare a table showing the coding of the two
dummy variables necessary for including this variable in a logistic regression model. (2 pts.)
d) Write down the equation for the logistic regression model of STA on age, cancer status, CPR
status, infection status and race. How many parameters does this model contain? (2 pts.)
e) Fit the model from part (d). Using the estimates obtained write down the equation for the
fitted values, i.e. the estimated probabilities of P( STA  1 | x) . (5 pts.)
~
Estimate the probability of death for the following patient types:
i) black patients 70 years old without cancer, who did not have CPR performed, and were
infection free.
ii) white patients 60 years old with cancer, who did not have CPR performed, and had an
infection at the time of admission.
iii) white patients 80 years old with cancer, who did CPR performed, and were infection free.
iv) asian patients 85 years old without cancer, who did not have CPR performed, and had an
infection at the time of admission.
f) Using the model from part (d), plot the P(STA=1|Age) for the following patient groups:
i) black patients, who did not have cancer, had CPR, and did not have an infection.
ii) white patients, who had cancer, did not have CPR, and had an infection.
iii) Hispanic patients, who did not have cancer, had CPR, and did not have an infection.
The function and R code below should be helpful:
PrAge <- function(age,race2,race3,can1,cpr1,inf1) {
L -3.51152 +.02712*age - .95703*race2 + .25975*race3 + .24451*can1
+ 1.6465*cpr1 + .68067*inf1
exp(L)/(1+exp(L))
}
> x <- seq(min(age),max(age),.5)
The commands below will plot P(STA=1|Age) for patients described in part (i)
===================================================================
> plot(x,PrAge(x,1,0,0,1,0),ylab="P(Death|Age)",xlab="Age",
ylim=c(0,1),type="l")
> lines(x,PrAge(x,0,0,1,0,1),lty=2,col=”blue”) # plot for (ii) patients
> lines(x,PrAge(x,0,1,0,1,0),lty=3,col=”red”)
# plot for (iii) patients
g) Use similar plots to those constructed in part (f) to show that there is not a large race effect.
To do this choose arbitrary levels for the other factors and plot P(STA=1|Age,Race). (3 pts.)
h) Use similar plots to those constructed in parts (f) & (g) to show the CPR effect. To do this
choose arbitrary levels for the other factors and plot P(STA=1|Age,CPR). (2 pts.)
i) Test the significance of the terms in model from part (d). Some of the terms are not
significant and could be removed. Build a reduced model and justify that your reduced model
does not have a significantly degraded fit via the General  2 Test. (4 pts.)
j) Create a table giving the OR’s associated with each coefficient. For continuous predictors
choose a reasonable increment (perhaps c = SD) and for categorical predictors with more than
two levels, i.e. Race, use White as the reference group when reporting the odds ratios.
k) Consider the logistic regression of vital status (STA), service at ICU admission (SER), and
age. Fit the models that include only service at admission, both service at admission & age, and
the model that includes both service at admission & age, plus the interaction between service at
admission and age. Use the General  2 Test to decide which of these models you would
choose. For the interaction model, construct a plot that shows the P(STA=1|SER,Age) for both
levels of service at admission. Discuss. (5 pts.)
l) Consider the variable level of consciousness at ICU admission (LOC) as a covariate and vital
status (STA) as the outcome variable. Compare estimates of the odds ratios obtained from the
cross-classification of STA by LOC and logistic regression of STA on LOC. Use LOC=0 as the
reference group for both methods. How well did the logistic regression deal with the zero cell?
What strategy would you recommend for using LOC in future models. (5 pts.)
m) Using vital status (STA) as the outcome variable and the remainder of the variables in the
ICUpool data set as possible covariates, develop a logistic regression model using main effect
terms only. Use model selection methods to arrive at a “best” model. Document thoroughly the
rationale for each step in the process you follow. Display the results of your final model in table.
Also include a table of point and 95% CI estimates of all relevant odds ratios. Discuss your
results in a well-written paragraph. (10 pts.)
n) For your final model from part (m) examine case diagnostics, residual plots and model
checking plots. Discuss any “unusual” cases and examine the potential effect these cases might
have on the final model and its interpretation. Discuss the other plots in terms of what they tell
you about the adequacy of your fitted model. (5 pts.)
o) Use model selection methods to arrive at a “best” model which allows for possible
interactions between the covariates. It is likely that in the process of fitting potential models you
will obtain warnings from R about the fitted values being close to 0 and 1. This happens when
you are “overfitting”, i.e. including too many terms in your model. To avoid this problem, I
would recommend starting from your final model from part (m) and then consider adding
interaction terms to that model. This should help alleviate the overfitting problems. Discuss
your model and use plots of P( STA  1 | x) to aid in the interpretation of any the interaction
~
effects. (10 pts.)
Download