doc file

advertisement
Ron Heck, Summer 2012 Seminars
Multilevel Regression Models and Their Applications Seminar
1
Introducing Generalized Linear Models: Logistic Regression
The generalized linear model (GLM) represents and extension of the linear model for
investigating outcomes that are categorical (e.g., dichotomous, ordinal, multinomial, counts).
The model has three components that are important to consider:

A random component that specifies the variance in terms of the mean (  ij ), or
expected value, of Yij for individual i in group j;

A link function, often represented as g () , which converts the expected values (  ) of
Y to transformed predicted values of  [  f (  ) ]. The Greek letter (eta) is
typically used to denote the transformed linear predictor. The link function therefore
provides the relationship between the linear predictor and the mean of the distribution
function; and

A structural component that relates the transformed predictor ij of Y to a set of
predictors.
One type of GLM that can be applied in various situations is the logistic regression model. It can
be applied to dichotomous dependent variables (e.g., pass/fail, stay/leave) and can also be
extended to consider a dependent variable with several categories (e.g., manager, clerical,
custodian). The categories may be ordered (as in ordinal regression) or unordered. In this latter
case (called multinomial logistic regression), each category is compared to a reference category
(i.e., the last category is the default in SPSS). Logistic regression can also readily accommodate
interactions among predictors (e.g., gender x ethnicity). It is important to note that extremely
high correlations between predictors may lead to multicollinearity. One fix is to eliminate one of
the variables in multicollinear pairs from the model.
The Model
Ordinary least squares regression cannot be correctly applied to data where the outcome is
categorical (e.g., two categories coded 0 and 1) or ordinal (i.e., with a restricted range). In the
case of a dichotomous dependent variable (Y), for example, linear models with nonzero slopes
will eventually predict values of Y greater than 1 or less than 0. The problem is that when an
outcome can only be a 0 or a 1, it will not be well described by a line since there is a limit on
values that describe someone’s outcomes (0 = dropout, 1 = persist). Predictions will break down
Summer 2012 Quantitative Methods Series at Portland State University
Introducing Generalized Linear Models: Logistic Regression
2
at the boundaries (0 or 1). Let’s suppose we use students’ socioeconomic status (SES) to predict
their likelihood to persist versus drop out. Any linear model with a nonzero slope for the effect of
SES on probability of persisting will eventually predict values of Y greater than 1 or less than 0
(e.g., 1.5 or -0.5), yet these are impossible predictions since by definition the values for Y can
only be 0 or 1.
Moreover, for dichotomous outcomes, the distribution of the errors from predicting whether
someone persists or not will be nonrandom (since they can only have two values) and, therefore,
they cannot normally distributed (i.e., violating a basic assumption of the linear model).
Contrast this with a linear relationship, for example, predicting someone’s math test score, from
their family income, where one family income might be $25,000 while another has a family
income of $250,000. The regression line will extend along the Y axis and X axis well beyond the
limits of 0 and 1, and the prediction errors will also tend to be random and normally distributed.
Because of these problems, we need a nonlinear model that can transform predicted values of Y
to lie within the boundaries of 0 and 1 (Hamilton, 1992). Logistic regression emphasizes the
probability of a particular outcome occurring, given an individual’s pattern of responses to a set
of independent variables. This relationship between independent variables and categories is
nonlinear (i.e., not constant across levels of the independent variable) with respect to the
probabilities of an outcome being either 0 or 1. The logic transformation is typically a type of Scurve. We might standardize income (X) such as in the following graph.
Figure 1. S-curve graph
Introducing Generalized Linear Models: Logistic Regression
3
Logit Coefficient
While the logistic regression model is nonlinear for probabilities (due to properties of its
variation between 0 and 1), it is linear with respect to the logit coefficients. The probability of an
event occurring (e.g., π = 1 versus not occurring) can then be written as follows:
P(Y  1) 

1
By taking the natural logarithm of the odds of the event occurring versus not occurring, we
obtain a logit coefficient (η). The logit coefficient then can be predicted as a linear function of
the following:
η = log(π/(1-π)) = β0+ β1X1 + β2X2 + ... β2Xp
If we take the natural log of the probability π = 1/(1- π), we can interpret the effect of a predictor
as the change in the log odds of η associated with a one-unit change in the independent variable.
So if we had a simple model predicting the log odds of persisting using only female, the equation
might look like this:
η = 0.06 + 0.76 (female)
The coefficient for female is 0.76, which suggests that as gender changes from 0 to 1 (i.e., male
coded 0 to female coded 1), the log odds of persisting increase by 0.76 units. The intercept can
be interpreted as the value for Y when all the X predictors are equal to zero. In this case, for a
man, the log odds of persisting would then be 0.06 [0.06 + 0.76(0)]. Notice also that there is no
error term for a logistic model. This is because in the binomial probability distribution the
variance is related to the mean proportion, so it cannot be estimated separately.
We can use the inverse of the logit link function to obtain the probability of persisting

1
.
1  e
In this case, the probability of persisting for a man would be:
1/[1 + e-(0.06)] = 0.51,
where e is value of the natural log, which is approximately 2.71828.
Introducing Generalized Linear Models: Logistic Regression
4
Since the estimate for a male’s probability of persisting is just about 0.50 (or half the time), our
knowledge that an individual is male is not much better than flipping a coin in terms of knowing
whether or not a particular male individual will persist. If the estimate were considerably below
0.5, then we would predict that the individual would be unlikely to persist.
For a female, however, the log odds of persisting are much better. The estimated log odds would
be 0.82 [0.06 + 0.76(1)]. The probability of persisting would then be
1/[1 + e-(0.82)] = 0.69
So, since the estimate is considerably above 0.5, we would predict that a given female is likely to
persist.
Odds Ratio
People generally interpret the odds ratio (i.e., the increase or decrease in the odds of being in one
outcome category when the predictor increases by one unit). Consider the case where we flip a
coin. The ratio of the odds of it coming up heads or tails is the same.
.50
------ = 1
.50
Therefore, when the events are equally likely to occur we say the ratio of their odds is equal to 1
(or equally likely). The odds ratio (eβ) can be expressed as follows:
P( )
   X   X ...  q X q
=e 0 1 1 2 2
P(1   )
This ought to look somewhat similar to the log odds equation. The odds ratio for a particular
predictor variable is defined as eβ, where β is the logit coefficient estimate for the predictor and e
is the natural log. If β is zero, the odds ratio will equal 1 (i.e., since any number to the 0 power is
1), which leaves the odds unchanged. If β is positive, the odds ratio will be greater than 1, which
means the odds are increased. If β is negative, the coefficient will be less than 1, which means
the odds are decreased.
Consider the case previous case where the relationship of female (coded male = 0, female = 1) to
persisting to graduation is examined. If the odds ratio [also abbreviated as Exp(β)] is 2.5, it
means as gender changes from male to female (i.e., a unit change in the independent variable),
Introducing Generalized Linear Models: Logistic Regression
5
the odds of persisting versus dropping out are multiplied by a factor of 2.5. In contrast, consider
the case where the odds ratio is 0.40. Suppose we reversed the coding of gender (females = 0,
males = 1), then it would imply that the odds of persisting diminish for males by a factor of 0.40
compared with females. These two odds ratios can be shown to be equivalent by division (1/0.40
= 2.5 and 1/2.5 = 0.40).
Model Fitting
Logistic models are commonly estimated with maximum likelihood rather than ordinary least
squares (as in multiple regression). Estimates are sought that yield the highest possible values for
the likelihood function. The equations are nonlinear in that the parameters cannot be solved
directly (as in OLS regression). Instead, an iterative procedure is used, where successively better
approximates are found that satisfy the log likelihood function (Hamilton, 1992).
Nested Models
Successive logistic regression models can be evaluated against each other. Successive models
(called nested models when all of the elements of the smaller model are also in the bigger model)
can be evaluated in logistic regression by comparing them against a baseline model using the
change in their log likelihoods. We compare the deviance (or -2*log likelihood) of each model.
The difference in models is distributed as  with degrees of freedom equal to the difference in
2
the number of parameters estimated between the two models. For example, if one is evaluating a
model with three predictors and chooses to add another predictor, the change in model fit can be
assessed. For one degree of freedom, a significant improvement (at p = .05) in model fit would
result from a chi-square change of 3.84. The likelihood ratio test (G2) is defined as the following:
G2 = 2[(log-likelihood for bigger model)- (log likelihood for smaller model)]
So for example, if the log-likelihood of the smaller (3 predictors) is -10.095 and for the bigger
model (4 predictors) is -8.740, the equation looks like this:
G2 = 2[(-8.470) - (-10.095)] = 2.71
Because the required chi square for 1degree of freedom (df) is 3.84, we would generally
conclude that the model with three predictors is preferred over the other model, since adding the
fourth predictor did not improve the model’s fit significantly.
Introducing Generalized Linear Models: Logistic Regression
6
Common Problems
A number of problems can occur if there are too few cases relative to the number of predictors in
the model. It is likely that the model will produce some very large parameter estimates and large
standard errors. This can be a result of having combinations of variables that produce too many
cells with no cases in the cells. The model may also fail to converge on a solution. Sometimes
this can be fixed by collapsing the categories (i.e., making fewer categories) or by eliminating
independent variables that are not important.
A Two-Level GLM Model
The single-level model can easily be extended to a mixed model (GLMM). We start here by
introducing the basic specification for a two-level model. For a two-level model, we include
subscript i to represent an individual nested within a level-2 unit designated by subscript j.
The level-1 model for individual i nested in group j is of the general form:
ij  xij  ,
where xij is a (p + 1) x 1 vector of predictors for the linear predictor ij of Yij and  is a vector of
corresponding regression coefficients. Notice there is no error term at level 1. An appropriate
link function is then used to link the expected value of Yij to ij .
In this case, we will use the logistic link function.
  ij 
  0 j  1 X1ij   2 X 2ij     q X qij .
1   ij 
ij  log 
At level 2, the level-1 coefficients  qj can become an outcome variable. Following Raudenbush
et al. (2004), a generic structural model can be denoted as follows:
 qj   q 0   q1W1 j   q 2W2 j     qS WS j  uqj ,
q
q
where  qS (q = 0,1,…,Sq) are the level-2 coefficients, WSqj are level-2 predictors, and u qj are
level-2 random effects. You can see that the level-2 model (and successive levels) are specified
the same as a model with continuous outcomes. It is only the level-1 model that is different.
Introducing Generalized Linear Models: Logistic Regression
7
We use the GENLIN MIXED program (starting in SPSS Version 19) to build an example twolevel model to examine students’ proficiency in reading. In samples where individuals are
clustered in higher order social groupings (e.g., a department, a school, or some other type of
organization), simple random sampling does not hold because individuals clustered in groups
will tend to be similar in various ways. For example, they attend schools with particular student
compositions, expectations for student academic achievement, and curricular organization and
instructional strategies. If the clustering of students is ignored, it is likely that bias will be
introduced in estimating model parameters.
As has been noted previously, where there are clustered data, it is likely that there is a
distribution of both intercepts and slopes around their respective average fixed effects. In this
situation, we might wish to investigate the random variability in intercepts and slopes across the
sample of higher level units in the study. Once we determine that variation exists in the
parameters of interest across groups, we can build lelvel-2 models to explain this variation. In
some cases, we may have a specific theoretical model in mind that we wish to test, while in other
cases, we might be trying to explore possible new mechanisms that explain this observed
variation in parameters across groups.
We have a data set with 7,009 high school students in 988 schools. We wish to determine what
might affect their likelihood to be proficient in math. Within schools we have background
variables associated with student SES, grade point average, and whether they were in a primarily
college prep curricular program or a more advanced high school program. Between schools, we
have a student composition variable and a variable describing the academic focus of the school.
“No Predictors” Model
We can begin with a “no predictors” model. At level 1, the unconditional model to relate the
transformed predicted values to an intercept parameter is defined as follows:
  ij 
 0 j .
 1   
ij 

ij  log 
We note again there is no separate level-1 residual variance term for models with categorical
outcomes. The level-2 model will simply be the following:
0 j   00  u0 j .
Introducing Generalized Linear Models: Logistic Regression
8
Through substitution, the combined single equation is the following:
ij   00  u0 j ,
which suggests there are two parameters to be estimated (the intercept and random level-2 effect). Here
we can see the estimated log odds of being proficient is 0.684.
Table 1: Fixed Effects
95% Confidence Interval
for Exp(Coefficient)
Model Term
Coefficients
Intercept
Std. Error
0.684
T
0.034 20.328
Sig.
.000
Exp(Coefficient)
1.982
Lower
Upper
1.856
Probability distribution: Binomial; Link function: Logit
Notice the “look” of the tables is a bit different in the GENLIN MIXED program (i.e., it requires
using the computer mouse to open up each part of the output and then converting heat maps to
tables). I have used “table template” I created to present the relevant output.
Table 2: mathprof2
We can see in the above table that most students are likely to be proficient in math. The
percentage of proficient students at the individual level is about 66.4%.
We can calculate the odds of being proficient from the fixed effects table above as the following:

1
1
=
1  e 1  e  (.684)
The resulting proficiency level averaged across schools is about 0.67 [1/(1 + 0.501) = 0.67]. This
is a little different since it is the average unit estimated proficiency level, rather than the average
for the population. If we look at the odds ratio (1.98), we can say students are about 2:1 more
2.118
Introducing Generalized Linear Models: Logistic Regression
9
likely to be proficient than not proficient. If we made this a proportion, it would be something
like 0.667/0.333 = 2.
Variance Components
The level 2 variability suggests that the probability varies significantly across schools (Z = 8.144,
p < .01).
Table 3: Variance Components
95% Confidence
Interval for
Exp(Coefficient)
Random and Residual Effects
Random
Var(Intercept)a
Residualb
aCovariance
bCovariance
Estimate
Std. Error
0.386
1.00
0.047
Z-test
Significance
8.144
.000
Lower
0.303
Upper
0.490
Structure: Variance components; Subject specification: schcode
Structure: Scaled Identity; Subject Specification: none
We can notice also that the variability at level 1 (Residual) is scaled to 1.0. This is because the
variance at level 1 is tied to the population proportion of individuals who are proficient, so it
cannot be estimated separately from the mean. Instead, it is simply scaled to 1.0 to provide a
metric for the log odds scale.
Despite the scaling to 1.0, an intraclass correlation can be estimated describing the proportion of
variance that lies between units (  Between ) relative to the total variance (i.e.,
2
The variance of a logistic distribution with scale factor 1.0 is
can calculate an ICC (  ) as follows:
2
2
).
 Between
  Within
 2 / 3  3.29 (Hox, 2002), so we
2
2
   Between
/( Between
 3.29Within ) .
In this case, it will be 0.386/(0.386 + 3.29), or 0.386/3.676, which is 0.105. This suggests about
10.5% of the variance in math proficiency lies between schools.
Individual Predictors
We might decide to go ahead and build a multilevel model. We can interpret the intercept as the
log odds of being proficient when all the other variables are 0. In this case, SES and GPA are
Introducing Generalized Linear Models: Logistic Regression
10
standardized (Mean = 0, SD = 1). The intercept is therefore the log odds of being proficient when
the student is not in the more advanced curricular program (coded 0). Such a person is about 2.02
times more likely to be proficient as not proficient (Odds Ratio = 2.02, p < .01).
Table 4: Fixed Effects
95% Confidence Interval
for Exp(Coefficient)
Model Term
Intercept
ses
gpa
acprog=1
acprog=0
Coefficient
Std. Error
0.704
0.527
0.432
0.137
0a
0.036
0.040
0.031
0.074
T
19.776
13.300
14.111
1.859
Sig.
.000
.000
.000
.063
Exp(Coefficient)
2.021
1.694
1.541
1.147
Lower
Upper
1.885
1.568
1.451
0.993
Probability distribution: Binomial; Link function: Logit
aThis coefficient is set to zero because it is redundant.
Increasing student SES by 1-SD (since SES is standardized with mean = 0 and standard
deviation =1) would result in an increase in predicted log odds of being proficient of 0.527, other
variables being held constant. Alternatively, we can say that the odds of such an individual being
proficient increase by a factor of 1.694 compared with individuals at the mean for SES. We can
see some support for the view that being in the stronger academic program also increases the log
odds of being proficient in math (0.137, p < .07).
2.167
1.831
1.636
1.325
Introducing Generalized Linear Models: Logistic Regression
11
Adding School Predictors
We can see in the following table that when we add the two school-level predictors, only student
composition is significantly related to likelihood to be proficient (log odds = 0.541, p < .01).
Increasing student SES composition by 1-SD would increase the log odds of being proficient by
0.541 units.
Table 5: Fixed Effects
95% Confidence Interval
for Exp(Coefficient)
Model Term
Intercept
Ses
Gpa
acprog=1
acprog=0
acadfocus
studentcomp
Coefficient
Std. Error
T
Sig.
Exp(Coefficient)
Lower
Upper
0.711
0.360
0.431
0.158
0a
0.061
0.541
0.047
0.046
0.031
0.074
15.123
7.900
13.918
2.128
.000
.000
.000
.033
2.036
1.434
1.538
1.171
1.856
1.311
1.448
1.013
2.232
1.568
1.635
1.355
0.256
0.086
0.238
6.294
.812
.000
1.063
1.717
0.643
1.451
1.755
2.033
Probability distribution: Binomial; Link function: Logit
aThis coefficient is set to zero because it is redundant.
In terms of odds ratios, increasing composition by 1-SD (since it is also standardized) would
increase the odds of being proficient by a factor of 1.717. We should note that because odds
ratios are multiplicative rather than additive, if we increased student composition by 2-SD, the
resulting odds of persisting would increase by a factor of 2.95. To obtain this new estimated odds
ratio, we can first add the log odds (which are the exponents) and then exponentiate the new log
odds:
e0.541 + 0.541 = e 1.082 = 2.95
We can also obtain this result by multiplying the odds ratios (1.717*1.717 = 2.95). The odds
ratios should not be added, however (1.717 + 1.717 = 3.434). So at 2-SD above the grand mean,
the odds of being proficient would be increased by a factor of 2.95, or approximately 3 to 1.
We could, of course, extend our analysis to examine a random slope such as the SES-proficiency
slope, or perhaps the impact of being in the more advanced academic program might vary across
schools. We can also extend the basic two-level model to three-level models. Similarly, we could
extend this basic dichotomous cross-sectional model to represent a longitudinal model looking at
the likelihood of being proficient at different points in time (e.g., from 9th through 12th grades).
We cover a number of other type of models with categorical outcomes in our new workbook on
categorical models using IBM SPSS.
Download