Applied Statistical Methods HSRP 734 - Summer 2008 Homework 3 (60 points total) Due: 6/12/2008 Solution key General instructions: 1) you may discuss any and all portions of the assignment with other members of the class. However, the homework you turn in must be your own. 2) For problems that require a statistical package you must do your own programming and provide output with your answers. 3) A final answer is not sufficient. Show all your work with SAS output. If you give SAS output, clearly indicate your answers to the questions. Q1. Give a reasonably detailed explanation of what Maximum Likelihood is (3 points). Maximum likelihood is a technique used to estimate statistical models like logistic regression (and any other general linear model (GLM)). The idea behind maximum likelihood is to choose parameter estimates such that they maximize the probability of observing the data that was in fact observed (maximizing the likelihood). This is achieved by common methods to maximizing any mathematical function: 1) writing down a probability expression for the likelihood, 2) taking the derivative of the likelihood function (or log likelihood as it’s easier to work with), setting equal to 0, and solving to ascertain a maximum like in Calculus. Usually this maximum is estimated using iterative methods however. Maximum likelihood estimates have good properties: they are asymptotically unbiased, have the smallest variance and are Normally distributed. Because of these desirable properties, ML has become a popular technique for fitting these regression models. *For Questions 2 & 3 use the SAS dataset: mistudy.sas7bdat (dataset description=mistudy_notes.doc) Q2. Fit a simple logistic regression model to these data using treatment group as a predictor for myocardial infarction at 5 years (MI). a. What is the form of the estimated logistic regression equation (2 points)? We can use SAS Enterprise to fit a simple logistic regression model for MI with treatment group by going to Analyze->Regression->Logistic. 1 Doing so gives the following relevant output: Model Information Data Set WORK.SORTTEMPTABLESORTED Response Variable mi Number of Response Levels 2 Model binary logit Optimization Technique Fisher's scoring Probability modeled is mi=1. Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Wald Pr > ChiSq Error Chi-Square Intercept 1 -0.4382 0.2868 2.3348 0.1265 itrt 1 -0.5268 0.4106 1.6464 0.1995 Thus, the form of the estimated logistic regression model is: Pr MI 1 ln 0.4382 0.5268* itrt 1 Pr MI 1 2 b. What is the estimated odds ratio and 95% CI for the OR for the treatment group effect (3 points)? Interpret the odds ratio (2 points). From the same output this OR is given by: Odds Ratio Estimates Effect Point Estimate itrt 0.590 95% Wald Confidence Limits 0.264 1.320 Thus, the OR for treatment vs. control = 0.590, with a 95% Confidence Interval = (0.264, 1.320). Therefore, the odds of a MI were 41% lower for patients in the treatment group compared to the control group. c. Add gender to the logistic regression model. Is there possible evidence of gender confounding on the treatment effects for MI (3 points)? How did adding gender affect the predictive ability of the model (3 points)? If we’re using OR’s to quantify effects in a logistic regression model, then we need to compare the OR of itrt in models with and without ifemale in order to determine if there is confounding due to gender. After adding ifemale to the model, the odds ratios are given by: Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits 0.664 0.290 1.523 ifemale 1.825 0.741 4.495 itrt Thus, the OR for itrt went from 0.590 to 0.664, which is a 12.5% change. Thus, we conclude there is confounding on treatment effects due to gender (and thus we would include gender in the model). The Area under the ROC curve (c statistic) without ifemale in the model was 0.565 and with ifemale was 0.620. So although the predictive ability of the model improved somewhat by adding gender this was not a dramatic improvement. 3 Q3. Fit a multiple logistic regression model for response with treatment arm, systolic blood pressure (SBP) at baseline and gender. Use the “centered” version of SBP (which is centered at its mean value and include this centered variable in your model. a. What is the form of the estimated logistic regression equation (4 points)? Using SAS Enterprise to fit this model gives the following relevant output: Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Wald Pr > ChiSq Error Chi-Square Intercept 1 -0.5959 0.3481 2.9308 0.0869 itrt 1 -0.5354 0.4388 1.4889 0.2224 ifemale 1 0.5220 0.4650 1.2600 0.2617 Csbp 1 -0.0249 0.0203 1.4997 0.2207 Thus, the form of the estimated logistic regression model is: Pr MI 1 ln 0.5959 0.5354* itrt 0.5220* ifemale 0.0249* SBP 138.0734 1 Pr MI 1 b. What is the estimated odds ratio and 95% CI for the OR for the gender effect (3 points)? Interpret the odds ratio (2 points). The OR’s are given in the following output: Odds Ratio Estimates Effect Point Estimate 95% Wald Confidence Limits 0.585 0.248 1.383 ifemale 1.685 0.677 4.193 0.937 1.015 itrt Csbp 0.975 The adjusted OR for ifemale is 1.685, with a 95% CI = (0.677, 4.193). Thus, the odds of a MI are 68.5% higher for females compared to males, adjusting for treatment group and systolic blood pressure. 4 c. Give and interpret the odds ratio for baseline SBP (3 points). The odds ratio for SBP centered at its mean is 0.975. Thus, for every unit increase in SBP (1 mmHg) the odds of MI decrease by 2.5%, controlling for treatment group and gender. However the 95% confidence interval for this adjusted OR includes 1 (and thus stretched below and above 1). So although the point estimate suggests an inverse association, this association is not statistically significant (more about this later). d. Conduct a Likelihood ratio test and a Wald test to see if there are significant treatment arm effects (8 points). The Wald test for treatment arm effects is given directly off the SAS output: Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Wald Pr > ChiSq Error Chi-Square Intercept 1 -0.5959 0.3481 2.9308 0.0869 itrt 1 -0.5354 0.4388 1.4889 0.2224 ifemale 1 0.5220 0.4650 1.2600 0.2617 Csbp 1 -0.0249 0.0203 1.4997 0.2207 Here, the Wald Chi-square = 1.4889 with df = 1 and p-value = 0.2224. Thus using the Wald test we would fail to reject the null hypothesis of no treatment group effect at the 5% significance level. Here, we conclude there is no treatment effect on the odds of having an MI. To conduct a Likelihood ratio test we must compare the -2lnL value from the current (full) model with treatment group: Model Fit Statistics Criterion Intercept Only Intercept and Covariates AIC 140.293 141.390 SC 142.985 152.155 -2 Log L 138.293 133.390 5 To a reduced model without treatment group (but with ifemale and Csbp): Model Fit Statistics Criterion Intercept Only Intercept and Covariates AIC 140.293 140.889 SC 142.985 148.963 -2 Log L 138.293 134.889 Thus the Likelihood ratio test statistic = 134.889 – 133.390 = 0.125. Since this does not exceed the critical value of 3.84, we fail to reject the null hypothesis that full model has similar fit as the reduced model. Thus there is not evidence of a treatment group effect on the odds of having a MI. e. How did adding SBP affect the predictive ability of the model (3 points)? Give a plot of the corresponding ROC curve which helps summarize this graphically (1 point). The Area under the ROC curve without SBP was 0.620 and with SBP was 0.637. So again the model’s predictive ability was just slightly improved. SAS Enterprise gives the following ROC curve plot: 6 f. Does the treatment effect depend on gender (in a model with itrt, ifemale, Csbp)? Conduct an investigation using the data in order to address this question and justify your answer (6 points). In order to answer this question, we need to fit a model with itrt, ifemale, Csbp and an interaction effect between itrt and ifemale (i.e., include INTitrtifemale, which is INTitrtifemale = itrt * ifemale). If the treatment effect depends upon gender this means there is an interaction between itrt and ifemale. Doing this gives the following relevant SAS Enterprise output: Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Wald Pr > ChiSq Error Chi-Square Intercept 1 -0.3333 0.3647 0.8353 0.3607 itrt 1 -1.0270 0.5148 3.9799 0.0460 ifemale 1 -0.1819 0.5996 0.0920 0.7616 Csbp 1 -0.0203 0.0206 0.9677 0.3252 1.7870 0.9572 3.4852 0.0619 INTitrtifemale 1 7 Here, the Wald test for interaction gives a p-value = 0.0619. Thus, there is not evidence of interaction at the 5% significance level, although the p-value was close. Thus, we conclude there is not significant evidence that the treatment effect depends upon gender. Q4. Answer the following questions by hand (not using SAS). The fitted multiple logistic regression model for baseline SBP and gender for a subset of the study participants looks like: Pr mi ln 0.87 0.68* ifemale 0.03 sbp 137.2 1 Pr mi and SE ifemale 0.46 . a. Estimate the odds ratio and 95% CI for gender (4 points). Is there evidence of a gender effect? (2 points). The odds ratio can be estimated by exp(0.68)=1.97 . Since SE=0.46 then a 95% CI is given by: 95% CI = exp 0.68- 1.96*0.46 , exp 0.68+ 1.96*0.46 = exp -0.2216 , exp 1.5816 = 0.80, 4.86 The 95% CI is pretty wide, indicating there was lower precision in estimating gender effects. Since the 95% CI for the OR includes 1, we can deduce that there is not a significant gender effect at the 5% level. However we could formally carry out a Wald test by calculating Z = 0.68/0.46 = 1.478. Since this does not exceed the critical value of 1.96 at the 5% significance level, we fail to reject the null hypothesis of no gender effect. b. Estimate the probability that a Female with baseline SBP=150 has a MI by 5 years (4 points). We can estimate this probability by simply plugging in the covariate information into the probability form of the estimated logistic regression model: 8 Pr MI 1 exp 0.87 0.68*1 0.03 150 137.2 1 exp 0.87 0.68*1 0.03 150 137.2 exp 0.574 1 exp 0.574 0.3603 Thus a female with SBP=150 at baseline has about a 36% chance of having a MI. c. Estimate the probability that a Male at the average SBP of the sample (137.2) will have a MI by 5 years (4 points). Using the same formula we can calculate this probability: Pr MI 1 exp 0.87 0.68*0 0.03 137.2 137.2 1 exp 0.87 0.68*0 0.03 137.2 137.2 exp 0.87 1 exp 0.87 0.2953 Thus a male with average SBP has about a 29.5% chance of having a MI. B. Explain what quasi-complete separation is and how you would remediate it if you had it in a multiple logistic regression model with 3 categorical predictor variables. Include the detailed steps you would take (Bonus up to 2 points). Quasi-complete separation occurs when a predictor variable is so good at predicting the dichotomous outcome that for some linear combination of the predictors the outcome is perfectly predicted. When this happens statistical estimation breaks down and SE’s and CI’s become extremely inflated. SAS prints out a warning in the SAS Log file when this occurs and also can be diagnosed after observing extreme, nonsensical SE’s and very wide CI’s. To remediate for a categorical predictor one could collapse the problematic categories by reducing the number of groups. If convergence problems remain other options include dropping the variable or excluding the cases where perfect outcome prediction is observed. 9