Qualitative Explanatory Variables

advertisement
1
Multiple Logistic Regression
Qualitative Explanatory Variables

When using logistic regression to model a binary outcome we can have explanatory variables
that are continuous, nominal (discrete), ordinal (discrete), or a combination of discrete and
continuous variables.

We can incorporate nominal variables into our model by creating dummy variables that
represent the categories of the variable. For example if I wanted to incorporate SES into a
model, using free and reduced lunch status, I might assign a 0 if one did not have free and
reduced lunch and a 1 if one did.

When an explanatory variable only has 2 levels then only a single dummy variable is needed.
When an explanatory variable has more than two levels, say I levels, then I – 1 dummy
variables are needed in the model.
For example, suppose I have a SES variable with three levels: low, medium, and high. Then
I would create two dummy variables such that:
Variable 1 = SES1 = 1 if SES is low, 0 otherwise
Variable 2 = SES2 = 1 if SES is middle, 0 otherwise
You could also reverse the coding such that SES1 represented high SES and SES2 represented
middle SES.
Our model in this case would be:
logit(SES1) + SES2)

In SAS one can use the class command in either proc logistic or proc genmod to create
dummy variables. You can control the order in which nominal variables are dummy coded
by using the order = data command. If your dependent variable is coded as 0 1 then you
can ensure that you are modeling “success” = 1 by using the descending option. It is
possible to do effect coding in SAS as well, but this will only change the parameter values
and not the substantive results.
Example
Suppose we obtained the following data:
SES
High School Program Type
Academic
Achievement
Low
44
95
Middle
147
152
High
117
45
I fit a logistic regression to the model and obtained the following results.
Class Level Information
2
Class
Levels
ses
3
Values
Low Middle High
Criteria For Assessing Goodness Of Fit
Criterion
Deviance
Scaled Deviance
Pearson Chi-Square
Scaled Pearson X2
Log Likelihood
DF
Value
0
0
0
0
0.0000
0.0000
0.0000
0.0000
-389.6949
Value/DF
.
.
.
.
Algorithm converged.
Analysis Of Parameter Estimates
Parameter
Intercept
ses
ses
ses
Scale
Low
Middle
High
DF
Estimate
Standard
Error
1
1
1
0
0
0.9555
-1.7252
-0.9890
0.0000
1.0000
0.1754
0.2530
0.2101
0.0000
0.0000
Wald 95% Confidence
Limits
0.6117
-2.2211
-1.4008
0.0000
1.0000
1.2993
-1.2293
-0.5771
0.0000
1.0000
ChiSquare
Pr > ChiSq
29.67
46.49
22.15
.
<.0001
<.0001
<.0001
.
NOTE: The scale parameter was held fixed.
Note that since we only had 2 df and we fit two parameters we have fit the saturated model and
we have no indices for model fit. This model fits the data perfectly. If we had fit the raw data,
as opposed to entering data in a tabular form, we would have many df with which to estimate
model fit.
How do you think we can interpret these parameters? Try calculating the odds ratio for the
frequency table and then interpreting the parameter estimates.
Multiple Logistic Regression

Typically we are not interested in fitting a model with only one explanatory variable. Rather
there are many variables that we believe may be related to our dependent variable. Similar to
ordinary regression models for normal data we can generalize the models we’ve been
3
covering to incorporate more predictor variables and these variables can be either
quantitative, qualitative or a combination of the two.
Example
Let’s start with a simple model with two qualitative explanatory variables. Suppose we have the
following data. We want to determine whether or not one follows politics is a function of their
level of education and the country residing in. If this had been a 2 x 2 x k table then I would
have conducted the Breslow-Day test for homogeneous association.
Follows Politics Regularly
Level of
Educ.
USSR
UK
USA
Yes
No
Yes
No
Yes
No
Primary
94
84
356
144
227
112
Secondary
318
120
256
76
371
71
College
473
72
22
2
180
8
Fitting a logistic regression or logit model with no interaction term I obtained:
Logit ( = x1) + x2) + z1) + z2)
= 2.6233 - 0.6673(USSR) - 0.0843 (UK) - 1.7684 (Primary) - 1.0692 (Secondary)
This model has only main effects, in an ANOVA context, for country and education. For each
combination of the explanatory variables our model is:
County
USSR
UK
US
Education
x1
x2
z1
z2
Model
Primary
1
0
1
0

Secondary
1
0
0
1

College
1
0
0
0

Primary
0
1
1
0

Secondary
0
1
0
1

College
0
1
0
0

Primary
0
0
1
0

Secondary
0
0
0
1

College
0
0
0
0

So how can we interpret the parameters?
Exponentiating each of the parameters gives us an estimate of a common odds ratio between the
dependent and independent variables. Of course odds ratios are only appropriate for two-by-two
4
tables so for the independent variables we are comparing the level of the variable included in the
model to the level of the variable that has a parameter estimate of zero. If we wanted to compare
two levels of the variable that are included in the model we could exponentiate the difference
between the two levels of interest.
Well, exp(the conditional odds ratio between political activity for the USSR versus the US,
given educational level. In our case, exp(. Therefore, regardless of education level
the odds of being politically active in the USSR were 0.513 times the odds of being politically
active in the US.
exp(the conditional odds ratio between political activity for the UK versus the US, given
educational level. In our case, exp( Therefore, regardless of education level the odds
of being politically active in the USSR were 1.02 times the odds of being politically active in the
US. The confidence interval for includes 0 and therefore, there doesn’t seem to be much
difference between the US and the UK.
exp(the conditional odds ratio between political activity for the those with only a primary
education as compared to a college education, given country. In our case, exp(
Therefore, regardless of country, the odds of being politically active given one only had a
primary education were only 0.171 times the odds of being politically active given one had a
college degree.
exp(the conditional odds ratio between political activity for the those with only a secondary
education as compared to a college education, given country. In our case, exp(
Therefore, regardless of country, the odds of being politically active given one only had a
secondary education were only 0.849 times the odds of being politically active given one had a
college degree.
How can we determine if this is a good model? We can use the likelihood ratio test to compare
models. This test can be conducted by either using the formula for the log-likelihood ratio OR
by calculating the difference in the deviance. The results will be indentical.
To determine whether or not we need country in the model, I ran the model with both country
and education and only with education and obtained the following:
-2(L0 - L1) = -2(-1546.7396 - (-1527.2602) = 36.852 with 1 df
To determine whether or not we need education in the model, I ran the model with both country
and education and only with country and obtained the following:
-2(L0 - L1) = -2(-1607.6772 - (-1527.2602) = 158.727 with 1 df
The fact that I did not include an interaction term implies that I have only main effects and that
the conditional odds ratios do not depend on the level of the third variable. For our example this
means that political activity in the countries differs but those differences are the same regardless
of education level. Likewise, political activity differs for different educational levels but those
differences are the same for the different countries. The fact that the UK did not differ from the
US but USSR did differ from the US suggests that this assumption might not be true. I can fit
the model with the 2 way interaction term, for our data this is the saturated model, and test
whether or not an interaction term is needed. Doing so I obtain:
-2(L0 - L1) = -2(-1521.65872 - (-1528.3137) = 13.31 with 1 df
5
Selecting the Best Model

Theory should always guide your model selection when you have a large number of
variables. You need to think about what you need to test your substantive research questions.

If you only have a few possible explanatory variables then you can fit all possible models and
compare.

If you have lots of variables (i.e. 4 or more) then you can use the following steps to narrow
down the possible effects.
1. Start with the most complex model possible which includes all variables and
interactions.
2. Delete the highest way interaction and check whether the removal leads to a
significant decrease in fit by conducting a likelihood ratio test.
3. If the test is significant, stop. The most complex model is needed.
4. If the test is not significant delete each of the next highest way interaction terms and,
for each deletion, conduct a likelihood ratio test of the model, conditioning on the
model from step 2.
5. Choose the model that leads to the least decrease in fit. If the decrease in fit is not
significant then consider this model as the best fitting model.
6.
Try deleting another highest way interaction term using the model from step 5 as the
conditioning model.
7. Continue until there are no further terms that can be deleted without leading to a
significant reduction in model fit.

If you have too many variables to even use this procedure you can skip the intermediate steps
and simply try to determine the level of complexity you need in your model by deleting all
interaction terms at once. For example, if you had 6 possible predictors you would first fit
the most complex model, then delete the six-way interaction, then delete all five-way
interactions, then delete all four-way interactions, etc.

What you should NOT do is to let a computer algorithm select the model for you using
stepwise regression.

When you include many variables in your model you have the chance of introducing
multicollinearity. This occurs when you have included explanatory variables that are highly
correlated with each other so that you have redundancy in your variables.

Signs of multicollinearity include:
1. None of the Wald statistics for the variables in a model are significant, but the likelihood
ratio test between the model without the variables with non-significant coefficients is
significant. Rejecting the likelihood ratio test indicates that the set of variables excluded
from the model are needed.
2. If deleting a variable results in a significant decrease in fit but none of the parameters are
significant you may want to investigate whether any of the variables are correlated,
thereby resulting in multicollinearity.
Example
6
Data regarding whether or not one attended a sporting event in the last year was obtained from
the General Social Science Survey along with the demographic variables of sex (S) race (R), and
income (I) (ordinal). We will use the steps previously outlined to determine the best fit for the
model.
Note that when using proc genmod you can use deviance to compare models directly (i.e.
you do NOT have to multiply by -2 unless you use the likelihood ratio) and when using proc
logistic you can use the likelihood ratio test directly.
Deviance-G2
(df)
Models
Compared
∆ G2 (df)
(1) SRI
1814.56 (1416)
NA
(2) SR, SI, RI
1815.13 (1418)
(2) - (1)
(3a) SR, SI
1821.09 (1420)
(3a) - (2)
(3b) SR, RI
1815.63 (1419)
(3b) - (2)
(3c) SI, RI
1817.90 (1420)
(3c) - (2)
(4a) I, SR
1818.20 (1421)
(4a) - (3b)
(4b) S, RI
1821.90 (1421)
(4b) - (3b)
NA
0.57 (2)
p = .752
5.96 (2)
p = .051
0.50 (1)
p = .478
2.6 (2)
p = .270
2.57 (1)
p = .109
6.27 (1)
p = .012
Model
(5) R S I
1824.86 (1423)
(5) - (4a)
6.66 (2)
p = .035
(6) R S SR
1937.87 (1422)
(6) - (4a)
119.67 (1)
p < .0001
Conclusion
3-way interaction NOT needed
SI interaction term NOT needed
Model now includes all main
effects and sex*race and
race*income interaction
RI interaction term NOT needed
Model now includes all main
effects and sex*race interaction
SR interaction term IS needed in
the model, therefore we
CANNOT eliminate either the
sex or race main effect
STOP - Final model needs
income, race, sex main effects
and sex by race interaction
Final Model:
logit( = x1) + y1) + y2) + z1) + y1z1) + y2z1)
= -0.4286 + 0.3761(male) - 1.2337 (white) - 1.7854 (black) + 0.0027 (income) +
0.1124(white*income) + .1423(black*income)
Interpretation:
Similar to simple linear regression, since we have an interaction term in our model it is not
appropriate to interpret the main effects for variables for which an interaction term is needed.
To interpret the interaction it is helpful to look at the curves. If there is an interaction between
two discrete variables the curves for  will be shifted horizontally but the shape will stay the
same. If there is an interaction between a continuous variable and a discrete one the curves will
cross.
7
Race = White
Race = Black
Race = Other
1.00
p-hat
0.80
0.60
0.40
0.20
0.00
0
10
20
30
40
Income
To interpret our main effect of sex we can simply exponentiate the estimated parameter since
there is not an interaction term for sex. In our case, exp(the conditional odds ratio between
sex and attending a sporting event in the last year, given race and income. In our case,
exp(. Therefore, regardless of race and income the odds of attending a sporting event
in the last year given one was male was 1.46 times the odds of attending a sporting event in the
last year given one was female.
Download