Applied Statistical Methods

advertisement
Applied Statistical Methods
HSRP 734 - Summer 2008
Homework 3 (60 points total)
Due: 6/12/2008
Solution key
General instructions: 1) you may discuss any and all portions of the assignment with
other members of the class. However, the homework you turn in must be your own. 2)
For problems that require a statistical package you must do your own programming and
provide output with your answers. 3) A final answer is not sufficient. Show all your
work with SAS output. If you give SAS output, clearly indicate your answers to the
questions.
Q1. Give a reasonably detailed explanation of what Maximum Likelihood is (3 points).
Maximum likelihood is a technique used to estimate statistical models like
logistic regression (and any other general linear model (GLM)). The idea behind
maximum likelihood is to choose parameter estimates such that they maximize
the probability of observing the data that was in fact observed (maximizing the
likelihood). This is achieved by common methods to maximizing any
mathematical function: 1) writing down a probability expression for the
likelihood, 2) taking the derivative of the likelihood function (or log likelihood as
it’s easier to work with), setting equal to 0, and solving to ascertain a maximum
like in Calculus. Usually this maximum is estimated using iterative methods
however. Maximum likelihood estimates have good properties: they are
asymptotically unbiased, have the smallest variance and are Normally distributed.
Because of these desirable properties, ML has become a popular technique for
fitting these regression models.
*For Questions 2 & 3 use the SAS dataset: mistudy.sas7bdat
(dataset description=mistudy_notes.doc)
Q2. Fit a simple logistic regression model to these data using treatment group as a
predictor for myocardial infarction at 5 years (MI).
a. What is the form of the estimated logistic regression equation (2 points)?
We can use SAS Enterprise to fit a simple logistic regression model for MI with
treatment group by going to Analyze->Regression->Logistic.
1
Doing so gives the following relevant output:
Model Information
Data Set
WORK.SORTTEMPTABLESORTED
Response Variable
mi
Number of Response Levels 2
Model
binary logit
Optimization Technique
Fisher's scoring
Probability modeled is mi=1.
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Wald
Pr > ChiSq
Error
Chi-Square
Intercept
1
-0.4382
0.2868
2.3348
0.1265
itrt
1
-0.5268
0.4106
1.6464
0.1995
Thus, the form of the estimated logistic regression model is:
 Pr  MI  1 
ln 
  0.4382  0.5268* itrt
1

Pr
MI

1




2
b. What is the estimated odds ratio and 95% CI for the OR for the treatment
group effect (3 points)? Interpret the odds ratio (2 points).
From the same output this OR is given by:
Odds Ratio Estimates
Effect Point Estimate
itrt
0.590
95% Wald
Confidence Limits
0.264
1.320
Thus, the OR for treatment vs. control = 0.590, with a 95% Confidence Interval =
(0.264, 1.320). Therefore, the odds of a MI were 41% lower for patients in the
treatment group compared to the control group.
c. Add gender to the logistic regression model. Is there possible evidence of
gender confounding on the treatment effects for MI (3 points)? How did
adding gender affect the predictive ability of the model (3 points)?
If we’re using OR’s to quantify effects in a logistic regression model, then we
need to compare the OR of itrt in models with and without ifemale in order to
determine if there is confounding due to gender.
After adding ifemale to the model, the odds ratios are given by:
Odds Ratio Estimates
Effect
Point Estimate
95% Wald
Confidence Limits
0.664
0.290
1.523
ifemale 1.825
0.741
4.495
itrt
Thus, the OR for itrt went from 0.590 to 0.664, which is a 12.5% change. Thus,
we conclude there is confounding on treatment effects due to gender (and thus we
would include gender in the model).
The Area under the ROC curve (c statistic) without ifemale in the model was
0.565 and with ifemale was 0.620. So although the predictive ability of the model
improved somewhat by adding gender this was not a dramatic improvement.
3
Q3. Fit a multiple logistic regression model for response with treatment arm, systolic
blood pressure (SBP) at baseline and gender. Use the “centered” version of SBP (which
is centered at its mean value and include this centered variable in your model.
a. What is the form of the estimated logistic regression equation (4 points)?
Using SAS Enterprise to fit this model gives the following relevant output:
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Wald
Pr > ChiSq
Error
Chi-Square
Intercept
1
-0.5959
0.3481
2.9308
0.0869
itrt
1
-0.5354
0.4388
1.4889
0.2224
ifemale
1
0.5220
0.4650
1.2600
0.2617
Csbp
1
-0.0249
0.0203
1.4997
0.2207
Thus, the form of the estimated logistic regression model is:
 Pr  MI  1 
ln 
  0.5959  0.5354* itrt  0.5220* ifemale  0.0249*  SBP  138.0734 
1

Pr
MI

1




b. What is the estimated odds ratio and 95% CI for the OR for the gender effect
(3 points)? Interpret the odds ratio (2 points).
The OR’s are given in the following output:
Odds Ratio Estimates
Effect
Point Estimate
95% Wald
Confidence Limits
0.585
0.248
1.383
ifemale 1.685
0.677
4.193
0.937
1.015
itrt
Csbp
0.975
The adjusted OR for ifemale is 1.685, with a 95% CI = (0.677, 4.193). Thus, the
odds of a MI are 68.5% higher for females compared to males, adjusting for
treatment group and systolic blood pressure.
4
c. Give and interpret the odds ratio for baseline SBP (3 points).
The odds ratio for SBP centered at its mean is 0.975. Thus, for every unit
increase in SBP (1 mmHg) the odds of MI decrease by 2.5%, controlling for
treatment group and gender.
However the 95% confidence interval for this adjusted OR includes 1 (and thus
stretched below and above 1). So although the point estimate suggests an inverse
association, this association is not statistically significant (more about this later).
d. Conduct a Likelihood ratio test and a Wald test to see if there are significant
treatment arm effects (8 points).
The Wald test for treatment arm effects is given directly off the SAS output:
Analysis of Maximum Likelihood Estimates
Parameter DF Estimate Standard
Wald
Pr > ChiSq
Error
Chi-Square
Intercept
1
-0.5959
0.3481
2.9308
0.0869
itrt
1
-0.5354
0.4388
1.4889
0.2224
ifemale
1
0.5220
0.4650
1.2600
0.2617
Csbp
1
-0.0249
0.0203
1.4997
0.2207
Here, the Wald Chi-square = 1.4889 with df = 1 and p-value = 0.2224. Thus
using the Wald test we would fail to reject the null hypothesis of no treatment
group effect at the 5% significance level. Here, we conclude there is no
treatment effect on the odds of having an MI.
To conduct a Likelihood ratio test we must compare the -2lnL value from the
current (full) model with treatment group:
Model Fit Statistics
Criterion Intercept
Only
Intercept
and
Covariates
AIC
140.293
141.390
SC
142.985
152.155
-2 Log L
138.293
133.390
5
To a reduced model without treatment group (but with ifemale and Csbp):
Model Fit Statistics
Criterion Intercept
Only
Intercept
and
Covariates
AIC
140.293
140.889
SC
142.985
148.963
-2 Log L
138.293
134.889
Thus the Likelihood ratio test statistic = 134.889 – 133.390 = 0.125. Since this
does not exceed the critical value of 3.84, we fail to reject the null hypothesis that
full model has similar fit as the reduced model. Thus there is not evidence of a
treatment group effect on the odds of having a MI.
e. How did adding SBP affect the predictive ability of the model (3 points)?
Give a plot of the corresponding ROC curve which helps summarize this
graphically (1 point).
The Area under the ROC curve without SBP was 0.620 and with SBP was 0.637.
So again the model’s predictive ability was just slightly improved.
SAS Enterprise gives the following ROC curve
plot:
6
f. Does the treatment effect depend on gender (in a model with itrt, ifemale,
Csbp)? Conduct an investigation using the data in order to address this
question and justify your answer (6 points).
In order to answer this question, we need to fit a model with itrt, ifemale, Csbp
and an interaction effect between itrt and ifemale (i.e., include INTitrtifemale,
which is INTitrtifemale = itrt * ifemale). If the treatment effect depends upon
gender this means there is an interaction between itrt and ifemale.
Doing this gives the following relevant SAS Enterprise output:
Analysis of Maximum Likelihood Estimates
Parameter
DF Estimate Standard
Wald
Pr > ChiSq
Error
Chi-Square
Intercept
1
-0.3333
0.3647
0.8353
0.3607
itrt
1
-1.0270
0.5148
3.9799
0.0460
ifemale
1
-0.1819
0.5996
0.0920
0.7616
Csbp
1
-0.0203
0.0206
0.9677
0.3252
1.7870
0.9572
3.4852
0.0619
INTitrtifemale 1
7
Here, the Wald test for interaction gives a p-value = 0.0619. Thus, there is not
evidence of interaction at the 5% significance level, although the p-value was
close. Thus, we conclude there is not significant evidence that the treatment
effect depends upon gender.
Q4. Answer the following questions by hand (not using SAS). The fitted multiple
logistic regression model for baseline SBP and gender for a subset of the study
participants looks like:
 Pr  mi  
ln 
  0.87  0.68* ifemale  0.03  sbp  137.2 
1

Pr
mi




and SE ifemale  0.46 .
a. Estimate the odds ratio and 95% CI for gender (4 points). Is there evidence of a
gender effect? (2 points).
The odds ratio can be estimated by exp(0.68)=1.97 . Since SE=0.46 then a 95% CI is
given by:

95% CI = exp  0.68- 1.96*0.46   , exp  0.68+ 1.96*0.46  
=  exp  -0.2216 , exp 1.5816 
=  0.80, 4.86

The 95% CI is pretty wide, indicating there was lower precision in estimating gender
effects. Since the 95% CI for the OR includes 1, we can deduce that there is not a
significant gender effect at the 5% level.
However we could formally carry out a Wald test by calculating Z = 0.68/0.46 =
1.478. Since this does not exceed the critical value of 1.96 at the 5% significance
level, we fail to reject the null hypothesis of no gender effect.
b. Estimate the probability that a Female with baseline SBP=150 has a MI by 5
years (4 points).
We can estimate this probability by simply plugging in the covariate information into
the probability form of the estimated logistic regression model:
8
Pr  MI  1 

exp  0.87  0.68*1  0.03 150  137.2  
1  exp  0.87  0.68*1  0.03 150  137.2  
exp  0.574 
1  exp  0.574 
 0.3603
Thus a female with SBP=150 at baseline has about a 36% chance of having a MI.
c. Estimate the probability that a Male at the average SBP of the sample (137.2)
will have a MI by 5 years (4 points).
Using the same formula we can calculate this probability:
Pr  MI  1 

exp  0.87  0.68*0  0.03 137.2  137.2  
1  exp  0.87  0.68*0  0.03 137.2  137.2  
exp  0.87 
1  exp  0.87 
 0.2953
Thus a male with average SBP has about a 29.5% chance of having a MI.
B. Explain what quasi-complete separation is and how you would remediate it if you had
it in a multiple logistic regression model with 3 categorical predictor variables. Include
the detailed steps you would take (Bonus up to 2 points).
Quasi-complete separation occurs when a predictor variable is so good at predicting the
dichotomous outcome that for some linear combination of the predictors the outcome is
perfectly predicted. When this happens statistical estimation breaks down and SE’s and
CI’s become extremely inflated. SAS prints out a warning in the SAS Log file when this
occurs and also can be diagnosed after observing extreme, nonsensical SE’s and very
wide CI’s. To remediate for a categorical predictor one could collapse the problematic
categories by reducing the number of groups. If convergence problems remain other
options include dropping the variable or excluding the cases where perfect outcome
prediction is observed.
9
Download