Multilevel Modeling-Logistic Raul Cruz-Cano, HLTH653 Spring 2013 Schedule 3/18/2013 = Spring Break 3/25/2013 = Longitudinal Analysis 4/1/2013 = Midterm (Exercises 1-5, not Longitudinal) Raul Cruz-Cano, HLTH653 Spring 2013 Introduction Just as with linear regression, logistic regression allows you to look at the effect of multiple predictors on an outcome. Consider the following example: 15- and 16-year-old adolescents were asked if they have ever had sexual intercourse. The outcome of interest is intercourse. The predictors are race (white and black) and gender (male and female). Example from Agresti, A. Categorical Data Analysis, 2nd ed. 2002. Raul Cruz-Cano, HLTH653 Spring 2013 Here is a table of the data: Intercourse Race Gender Yes No White Male 43 134 Female 26 149 Male 29 23 Female 22 36 Black Raul Cruz-Cano, HLTH653 Spring 2013 Data Set Intercourse DATA intercourse; INPUT white male intercourse count; DATALINES; 1 1 1 43 1 1 0 134 1 0 1 26 1 0 0 149 0 1 1 29 0 1 0 23 0 0 1 22 0 0 0 36 ; RUN; Raul Cruz-Cano, HLTH653 Spring 2013 SAS: PROC LOGISTIC DATA = intercourse descending; weight count; MODEL intercourse = white male/rsquare lackfit; RUN; “descending” models the probability that intercourse = 1 (yes) rather than = 0 (no). “rsquare” requests the R2 value from SAS; it is interpreted the same way as the R2 from linear regression. “lackfit” requests the Hosmer and Lemeshow Goodness-of-Fit Test. This tells you if the model you have created is a good fit for the data. Raul Cruz-Cano, HLTH653 Spring 2013 SAS Output: R2 Raul Cruz-Cano, HLTH653 Spring 2013 Interpreting the R2 value The R2 value is 0.9907. This means that 99.07% of the variability in our outcome (intercourse) is explained by including gender and race in our model. Raul Cruz-Cano, HLTH653 Spring 2013 PROC LOGISTIC Output The odds of having intercourse is 1.911 times greater for males versus females. Hosmer and Lemeshow GOF Test H-L GOF Test The Hosmer and Lemeshow Goodness-of-Fit Test tests the hypotheses: Ho: the model is a good fit, vs. Ha: the model is NOT a good fit With this test, we want to FAIL to reject the null hypothesis, because that means our model is a good fit (this is different from most of the hypothesis testing you have seen). Look for a p-value > 0.10 in the H-L GOF test. This indicates the model is a good fit. In this case, the pvalue = 0.2419, so we do NOT reject the null hypothesis, and we conclude the model is a good fit. Raul Cruz-Cano, HLTH653 Spring 2013 Model Selection in SAS Often, if you have multiple predictors and interactions in your model, SAS can systematically select significant predictors using forward selection, backwards selection, or stepwise selection. In forward selection, SAS starts with no predictors in the model. It then selects the predictor with the smallest pvalue and adds it to the model. It then selects another predictor from the remaining variables with the smallest pvalue and adds it to the model. It continues doing this until no more predictors have pvalues less than 0.05. In backwards selection, SAS starts with all of the predictors in the model and eliminates the non-significant predictors one at a time, refitting the model between each elimination. It stops once all the predictors remaining in the model are statistically significant. Raul Cruz-Cano, HLTH653 Spring 2013 Forward Selection in SAS We will let SAS select a model for us out of the three predictors: white, male, white*male. Type the following code into SAS: PROC LOGISTIC DATA = intercourse descending; weight count; MODEL intercourse = white male white*male/selection = forward lackfit; RUN; Raul Cruz-Cano, HLTH653 Spring 2013 Output from Forward Selection: “white” is added to the model “male” is added to the model No more predictors are found to be statistically significant The Final Model: Hosmer and Lemeshow GOF Test: The model is a good fit Multilevel Modeling (refresher) Multi-level modeling takes into account the hierarchical structure of the data (e.g. decedents clustered within occupations as in our data). Such data structure is subject to intra-class correlation, whereby individuals within the same group are more alike than individuals across groups. Analysis that ignores this intra-class correlation may underestimate the standard error of the regression coefficient of the aggregate risk factor, leading to overestimation of the significance of the risk factor. To illustrate the above point, we conducted our analysis using two approaches Raul Cruz-Cano, HLTH653 Spring 2013 1st Approach Fit a multiple logistic regression model on the combined data with PROC LOGISTIC. The dependent variable is death from injury (yes/no); the risk factor of interest is exposure to hazardous equipment at work (high/low); confounders included are gender, race (white/black/other), age (continuous, centered) and a quadratic term for age. This model ignores the hierarchical structure of the data, and treats aggregate exposure as if it was measured at individual level. The model is expressed by the following equation p log it ( pij ) log ij 1 p ij 2 Exposurei Genderij Raceij 1 Ageij 2 Ageij Raul Cruz-Cano, HLTH653 Spring 2013 1st Approach pij is the expected probability of death from injury for the jth individual of the ith occupation conditional on the predictor variables pij log it ( pij ) log 1 p ij 2 Exposurei Genderij Raceij 1 Ageij 2 Ageij proc logistic data=noms.combined descending; class exposure gender race; model injury = exposure gender race age age*age; run; Raul Cruz-Cano, HLTH653 Spring 2013 Multilevel Example Allison, 2006 The sample consists of 1151 girls from the National Longitudinal Survey of Youth who were interviewed annually for nine years, beginning in 1979. For this initial example, we’ll only use data from year 1 and year 5. The response variable POV has a value of 1 if the girl’s household was in poverty (as defined by U.S. federal standards) in each of the years, otherwise 0. The predictor variables are: AGE: Age in years at the first interview BLACK: 1 if respondent is black, otherwise 0 MOTHER: 1 if respondent currently had a least one child, otherwise 0 SPOUSE: 1 if respondent is currently living with a spouse, otherwise 0 INSCHOOL: 1 if respondent is currently enrolled in school, otherwise 0 HOURS: Hours worked during the week of the survey Raul Cruz-Cano, HLTH653 Spring 2013 Multilevel Example 5755 observations, five for each of the 1151 girls The CLASS statement declares YEAR to be a categorical variable, with the highest year (year 5) being the reference category. The STRATA statement says that each girl is a separate stratum, which has the consequence of grouping together the five observations for each girl in the process of constructing the likelihood function. PROC LOGISTIC DATA=teenyrs5 DESC; CLASS year; MODEL pov = year mother spouse inschool hours; STRATA id; RUN; In PROC LOGISTIC there is no CLUSTER, just CLASS and STRATA Multilevel Example In the “Analysis of Maximum of Likelihood Estimates” panel, we see that motherhood and school enrollment increase the risk of poverty while living with a husband and working more hours reduce the risk. The last panel gives the odds ratios. We see that motherhood increases the odds of poverty by an estimated 79 percent. Living with a husband cuts the odds approximately in half. Each additional hour of employment per week reduces the odds by about 2 percent. Keep in mind that these estimates control for all stable characteristics of the girls, including such things as race, intelligence, place of birth and parent’s education Raul Cruz-Cano, HLTH653 Spring 2013 Multilevel Example The next model, for example, includes the interaction between MOTHER and BLACK. PROC LOGISTIC DATA=teenyrs5 DESC; CLASS year; MODEL pov = year mother spouse inschool hours mother*black; STRATA id; RUN; Raul Cruz-Cano, HLTH653 Spring 2013 Multilevel Example The interaction is statistically significant at the .05 level. For nonblack girls, the effect of motherhood is to increase the odds of poverty by a factor of exp(.9821)=2.67. For black girls, on the other hand, the effect of motherhood is to increase the odds of poverty by a factor of exp(.9821-.5989)= 1.47. Thus, motherhood has a larger effect on poverty status among nonblack girls than among black girls. Raul Cruz-Cano, HLTH653 Spring 2013 SAS Weigted Example A random sample 300 students from each of the classes: freshman, sophomore, junior, and senior classes. data WebSurvey; proc format; value Design 1='A' 2='B' 3='C'; value Rating 1='dislike very much' 2='dislike' 3='neutral' 4='like' 5='like very much'; value Class 1='Freshman' 2='Sophomore' 3='Junior' 4='Senior'; run; data Enrollment; format Class Class.; input Class _TOTAL_; datalines; 1 3734 2 3565 3 3903 4 4196 ; Raul run; format Class Class. Design Design. Rating Rating. ; do Class=1 to 4; do Design=1 to 3; do Rating=1 to 5; input Count @@; output; end; end; end; datalines; 10 34 35 16 15 8 21 23 26 22 5 10 24 30 21 1 14 25 23 37 11 14 20 34 21 16 19 30 23 12 19 12 26 18 25 11 14 24 33 18 10 18 32 23 17 8 15 35 30 12 15 22 34 9 20 2 34 30 18 16 ; run; data WebSurvey; set WebSurvey; if Class=1 then Weight=3734/300; if Class=2 then Weight=3565/300; if Class=3 then Weight=3903/300; if Class=4 then Weight=4196/300; run; Cruz-Cano, HLTH653 Spring 2013 PROC Logistic proc logistic data=WebSurvey; freq Count; class Design; model Rating (ref='neutral') = Design ; weight Weight; run; Raul Cruz-Cano, HLTH653 Spring 2013 PROC surveylogistic If you want “better” results.. proc surveylogistic data=WebSurvey total=Enrollment; freq Count; class Design; model Rating (ref='neutral') = Design; stratum Class; weight Weight; run; For the Ratings for Design B vs. Design C compare 1. The point estimete 2. 95% Confidence Interval Raul Cruz-Cano, HLTH653 Spring 2013 More to come… There are also mixed effects logistic models…which will be studied later Raul Cruz-Cano, HLTH653 Spring 2013 References Paul D. Allison, Fixed Effects Regression Methods In SAS, SUGI 31 Proceedings (2006), paper 184-31 Jia Li, Toni Alterman, James A. Deddens, Analysis of Large Hierarchical Data with Multilevel Logistic Modeling Using PROC GLIMMIX In SAS, SUGI 31 Proceedings (2006), paper 151-31 David L. Cassell, (2006) “Wait Wait, Don't Tell Me… You're Using the Wrong Proc! SUGI31. Paper 193-31. Raul Cruz-Cano, HLTH653 Spring 2013