1 Multiple Logistic Regression Qualitative Explanatory Variables When using logistic regression to model a binary outcome we can have explanatory variables that are continuous, nominal (discrete), ordinal (discrete), or a combination of discrete and continuous variables. We can incorporate nominal variables into our model by creating dummy variables that represent the categories of the variable. For example if I wanted to incorporate SES into a model, using free and reduced lunch status, I might assign a 0 if one did not have free and reduced lunch and a 1 if one did. When an explanatory variable only has 2 levels then only a single dummy variable is needed. When an explanatory variable has more than two levels, say I levels, then I – 1 dummy variables are needed in the model. For example, suppose I have a SES variable with three levels: low, medium, and high. Then I would create two dummy variables such that: Variable 1 = SES1 = 1 if SES is low, 0 otherwise Variable 2 = SES2 = 1 if SES is middle, 0 otherwise You could also reverse the coding such that SES1 represented high SES and SES2 represented middle SES. Our model in this case would be: logit(SES1) + SES2) In SAS one can use the class command in either proc logistic or proc genmod to create dummy variables. You can control the order in which nominal variables are dummy coded by using the order = data command. If your dependent variable is coded as 0 1 then you can ensure that you are modeling “success” = 1 by using the descending option. It is possible to do effect coding in SAS as well, but this will only change the parameter values and not the substantive results. Example Suppose we obtained the following data: SES High School Program Type Academic Achievement Low 44 95 Middle 147 152 High 117 45 I fit a logistic regression to the model and obtained the following results. Class Level Information 2 Class Levels ses 3 Values Low Middle High Criteria For Assessing Goodness Of Fit Criterion Deviance Scaled Deviance Pearson Chi-Square Scaled Pearson X2 Log Likelihood DF Value 0 0 0 0 0.0000 0.0000 0.0000 0.0000 -389.6949 Value/DF . . . . Algorithm converged. Analysis Of Parameter Estimates Parameter Intercept ses ses ses Scale Low Middle High DF Estimate Standard Error 1 1 1 0 0 0.9555 -1.7252 -0.9890 0.0000 1.0000 0.1754 0.2530 0.2101 0.0000 0.0000 Wald 95% Confidence Limits 0.6117 -2.2211 -1.4008 0.0000 1.0000 1.2993 -1.2293 -0.5771 0.0000 1.0000 ChiSquare Pr > ChiSq 29.67 46.49 22.15 . <.0001 <.0001 <.0001 . NOTE: The scale parameter was held fixed. Note that since we only had 2 df and we fit two parameters we have fit the saturated model and we have no indices for model fit. This model fits the data perfectly. If we had fit the raw data, as opposed to entering data in a tabular form, we would have many df with which to estimate model fit. How do you think we can interpret these parameters? Try calculating the odds ratio for the frequency table and then interpreting the parameter estimates. Multiple Logistic Regression Typically we are not interested in fitting a model with only one explanatory variable. Rather there are many variables that we believe may be related to our dependent variable. Similar to ordinary regression models for normal data we can generalize the models we’ve been 3 covering to incorporate more predictor variables and these variables can be either quantitative, qualitative or a combination of the two. Example Let’s start with a simple model with two qualitative explanatory variables. Suppose we have the following data. We want to determine whether or not one follows politics is a function of their level of education and the country residing in. If this had been a 2 x 2 x k table then I would have conducted the Breslow-Day test for homogeneous association. Follows Politics Regularly Level of Educ. USSR UK USA Yes No Yes No Yes No Primary 94 84 356 144 227 112 Secondary 318 120 256 76 371 71 College 473 72 22 2 180 8 Fitting a logistic regression or logit model with no interaction term I obtained: Logit ( = x1) + x2) + z1) + z2) = 2.6233 - 0.6673(USSR) - 0.0843 (UK) - 1.7684 (Primary) - 1.0692 (Secondary) This model has only main effects, in an ANOVA context, for country and education. For each combination of the explanatory variables our model is: County USSR UK US Education x1 x2 z1 z2 Model Primary 1 0 1 0 Secondary 1 0 0 1 College 1 0 0 0 Primary 0 1 1 0 Secondary 0 1 0 1 College 0 1 0 0 Primary 0 0 1 0 Secondary 0 0 0 1 College 0 0 0 0 So how can we interpret the parameters? Exponentiating each of the parameters gives us an estimate of a common odds ratio between the dependent and independent variables. Of course odds ratios are only appropriate for two-by-two 4 tables so for the independent variables we are comparing the level of the variable included in the model to the level of the variable that has a parameter estimate of zero. If we wanted to compare two levels of the variable that are included in the model we could exponentiate the difference between the two levels of interest. Well, exp(the conditional odds ratio between political activity for the USSR versus the US, given educational level. In our case, exp(. Therefore, regardless of education level the odds of being politically active in the USSR were 0.513 times the odds of being politically active in the US. exp(the conditional odds ratio between political activity for the UK versus the US, given educational level. In our case, exp( Therefore, regardless of education level the odds of being politically active in the USSR were 1.02 times the odds of being politically active in the US. The confidence interval for includes 0 and therefore, there doesn’t seem to be much difference between the US and the UK. exp(the conditional odds ratio between political activity for the those with only a primary education as compared to a college education, given country. In our case, exp( Therefore, regardless of country, the odds of being politically active given one only had a primary education were only 0.171 times the odds of being politically active given one had a college degree. exp(the conditional odds ratio between political activity for the those with only a secondary education as compared to a college education, given country. In our case, exp( Therefore, regardless of country, the odds of being politically active given one only had a secondary education were only 0.849 times the odds of being politically active given one had a college degree. How can we determine if this is a good model? We can use the likelihood ratio test to compare models. This test can be conducted by either using the formula for the log-likelihood ratio OR by calculating the difference in the deviance. The results will be indentical. To determine whether or not we need country in the model, I ran the model with both country and education and only with education and obtained the following: -2(L0 - L1) = -2(-1546.7396 - (-1527.2602) = 36.852 with 1 df To determine whether or not we need education in the model, I ran the model with both country and education and only with country and obtained the following: -2(L0 - L1) = -2(-1607.6772 - (-1527.2602) = 158.727 with 1 df The fact that I did not include an interaction term implies that I have only main effects and that the conditional odds ratios do not depend on the level of the third variable. For our example this means that political activity in the countries differs but those differences are the same regardless of education level. Likewise, political activity differs for different educational levels but those differences are the same for the different countries. The fact that the UK did not differ from the US but USSR did differ from the US suggests that this assumption might not be true. I can fit the model with the 2 way interaction term, for our data this is the saturated model, and test whether or not an interaction term is needed. Doing so I obtain: -2(L0 - L1) = -2(-1521.65872 - (-1528.3137) = 13.31 with 1 df 5 Selecting the Best Model Theory should always guide your model selection when you have a large number of variables. You need to think about what you need to test your substantive research questions. If you only have a few possible explanatory variables then you can fit all possible models and compare. If you have lots of variables (i.e. 4 or more) then you can use the following steps to narrow down the possible effects. 1. Start with the most complex model possible which includes all variables and interactions. 2. Delete the highest way interaction and check whether the removal leads to a significant decrease in fit by conducting a likelihood ratio test. 3. If the test is significant, stop. The most complex model is needed. 4. If the test is not significant delete each of the next highest way interaction terms and, for each deletion, conduct a likelihood ratio test of the model, conditioning on the model from step 2. 5. Choose the model that leads to the least decrease in fit. If the decrease in fit is not significant then consider this model as the best fitting model. 6. Try deleting another highest way interaction term using the model from step 5 as the conditioning model. 7. Continue until there are no further terms that can be deleted without leading to a significant reduction in model fit. If you have too many variables to even use this procedure you can skip the intermediate steps and simply try to determine the level of complexity you need in your model by deleting all interaction terms at once. For example, if you had 6 possible predictors you would first fit the most complex model, then delete the six-way interaction, then delete all five-way interactions, then delete all four-way interactions, etc. What you should NOT do is to let a computer algorithm select the model for you using stepwise regression. When you include many variables in your model you have the chance of introducing multicollinearity. This occurs when you have included explanatory variables that are highly correlated with each other so that you have redundancy in your variables. Signs of multicollinearity include: 1. None of the Wald statistics for the variables in a model are significant, but the likelihood ratio test between the model without the variables with non-significant coefficients is significant. Rejecting the likelihood ratio test indicates that the set of variables excluded from the model are needed. 2. If deleting a variable results in a significant decrease in fit but none of the parameters are significant you may want to investigate whether any of the variables are correlated, thereby resulting in multicollinearity. Example 6 Data regarding whether or not one attended a sporting event in the last year was obtained from the General Social Science Survey along with the demographic variables of sex (S) race (R), and income (I) (ordinal). We will use the steps previously outlined to determine the best fit for the model. Note that when using proc genmod you can use deviance to compare models directly (i.e. you do NOT have to multiply by -2 unless you use the likelihood ratio) and when using proc logistic you can use the likelihood ratio test directly. Deviance-G2 (df) Models Compared ∆ G2 (df) (1) SRI 1814.56 (1416) NA (2) SR, SI, RI 1815.13 (1418) (2) - (1) (3a) SR, SI 1821.09 (1420) (3a) - (2) (3b) SR, RI 1815.63 (1419) (3b) - (2) (3c) SI, RI 1817.90 (1420) (3c) - (2) (4a) I, SR 1818.20 (1421) (4a) - (3b) (4b) S, RI 1821.90 (1421) (4b) - (3b) NA 0.57 (2) p = .752 5.96 (2) p = .051 0.50 (1) p = .478 2.6 (2) p = .270 2.57 (1) p = .109 6.27 (1) p = .012 Model (5) R S I 1824.86 (1423) (5) - (4a) 6.66 (2) p = .035 (6) R S SR 1937.87 (1422) (6) - (4a) 119.67 (1) p < .0001 Conclusion 3-way interaction NOT needed SI interaction term NOT needed Model now includes all main effects and sex*race and race*income interaction RI interaction term NOT needed Model now includes all main effects and sex*race interaction SR interaction term IS needed in the model, therefore we CANNOT eliminate either the sex or race main effect STOP - Final model needs income, race, sex main effects and sex by race interaction Final Model: logit( = x1) + y1) + y2) + z1) + y1z1) + y2z1) = -0.4286 + 0.3761(male) - 1.2337 (white) - 1.7854 (black) + 0.0027 (income) + 0.1124(white*income) + .1423(black*income) Interpretation: Similar to simple linear regression, since we have an interaction term in our model it is not appropriate to interpret the main effects for variables for which an interaction term is needed. To interpret the interaction it is helpful to look at the curves. If there is an interaction between two discrete variables the curves for will be shifted horizontally but the shape will stay the same. If there is an interaction between a continuous variable and a discrete one the curves will cross. 7 Race = White Race = Black Race = Other 1.00 p-hat 0.80 0.60 0.40 0.20 0.00 0 10 20 30 40 Income To interpret our main effect of sex we can simply exponentiate the estimated parameter since there is not an interaction term for sex. In our case, exp(the conditional odds ratio between sex and attending a sporting event in the last year, given race and income. In our case, exp(. Therefore, regardless of race and income the odds of attending a sporting event in the last year given one was male was 1.46 times the odds of attending a sporting event in the last year given one was female.