A Multilevel Structural Equation Model for Dyadic Data

advertisement
A Multilevel Structural Equation Model for Dyadic Data
Jason T. Newsom
Portland State Unversity
I begin by giving a brief overview of latent growth models and multilevel
regression (i.e., hierarchical linear models). I've assumed some familiarity with
both of these techniques, but I've summarized them to illustrate their parallels in
the case of repeated measures. I then proceed by proposing a (new, I think)
structural equation model approach to hierarchical regression in the case of
dyadic data. All comments and questions are welcome and encouraged
(newsomj@pdx.edu).
Latent Growth Curve Models
Structural equation modeling (SEM) can be used to estimate individual growth
curves by using repeated measures as indicators of two latent variables, an
intercept variable (0) and a slope variable (1), called “latent growth curve
models”. The interpretation of the intercept variable depends on how the
loadings on the slope factor are fixed. For instance, one approach to defining the
slope variable is to fix loadings to values 0, 1, 2, 3,4, . . . t-1. In this case, the
intercept latent variable represents the initial value, because the first loading on
1 is set to 0. One can also “center “ these loadings by setting the middle time
point to 0 (e.g., -3.,-2,-1,0,1,2,3), giving the intercept factor the value of the
average score across all time points [see here for a Figure showing the general
model specification (with estimated loadings on the slope factor, discussed
later)].
Mathematically, the latent growth curve model is represented by the following set
of formulas:
level-1 equation (measurement model):
yit  it0i  it1i   it
(1.1)
level-2 equations (structural model):
0  0   0
(1.2)
1  1   1
(1.3)
yit is the dependent variable. The subscripts i and t indicate a measurement
within an individual, i, for each time point, t. 0 is a latent variable that represents
the level-1 intercept, 1 is a latent variable that represents the relationship
between the time code and the dependent variable (i.e., the growth trajectory), it
are the loadings for each time point on the intercept latent variable (0) and the
slope latent variable (1). (The intercept, , associated with each loading is
assumed to be zero and is not shown above.) For simplicity, no level-2 predictors
are presented in (1.2) or (1.3), but could be included as predictors of the intercept
or slope variables. In the level-2 equations, 0 and 1 are the intercepts or
average value of 0 and 1 respectively, and 0 and 1 are error terms.
More traditionally, the structural model would be represented by grouping each
variable into matrices:
(1.4)
y    
   
(1.5)
In these equations,  is a 2 X t matrix representing the relationship between 2
latent variables, 0 and 1, and t indicators, one for each time point. The column
in  which corresponds to 0 is comprised of all 1s, because each loading for
this variable is set equal to 1 to define it as the intercept.  is the matrix of
measurement errors for each indicator at each time point.  is a 2 X 1 vector
containing latent means for the intercept and slope, representing the average
intercept across individuals and the average slope (i.e., trajectory) across
individuals.  is the error term. The variance of the intercepts and slopes, 0 and
1, are obtained by estimation of the  matrix.
Multilevel Regression Models (Hierarchical Linear Models:
HLM)
Multilevel regression models estimate predictive relationships when the data are
nested or hierarchically structured, as in the case of students nested within
schools. The statistical model used for hierarchically structured data is the same
statistical model used for longitudinal analysis of individual growth curves. With
growth curve models, longitudinal data measurements are considered to be
nested within individuals. In general, a multilevel regression with a single level-1
predictor and no level-2 predictors can be written with two sets of equations:
level-1 equation:
yi  0  1 xi  ri
(1.6)
level-2 equations:
0g   00 g  u0g
(1.7)
1   01g  u1g
(1.8)
g
The first equation is the familiar regression equation, with r representing error or
unexplained variance. The subscripts i and g indicate whether the value is for
each individual or each group. In the second equation, the intercept values for
each group serve as the dependent variable. For simplicity sake, there are no
predictors in equation (1.7) or (1.8). In (1.7), 00 is the intercept (mean of all
group intercepts), and u0 is the error or remaining variance. u0 can also be
interpreted as the variance of the intercept values across groups. Since the
intercepts represent adjusted means for each group (i.e., adjusting or controlling
for the effects of xi), u0 is the variance of the adjusted means for each group. In
the third equation (1.8), the slopes from the level-1 equation, 1g, serve as
dependent variables for each group. 01 is the intercept in this equation, and
represents the average of all slopes, 1g, interpreted as the average effect of xi
on the dependent variable across all groups. u1 is the error term or the variance
of the slopes across groups (i.e., the variability in the relationship between x and
y across the groups). By substituting equations (1.7) and (1.8) into equation
(1.6), the HLM model can be expressed as,

 

yi   00g  u0 g   01g  u1g xi  ri
(1.9)
or, by rearranging the terms,
yi   00g  u0 g   01g xi  u1g xi  ri
(1.10)
If growth curve models are tested, the level-1 x variable is replaced by time
codes, xt (e.g., 0,1,2,3...t-1). The dependent variable at each time point is
regressed on the time code at level-1. Instead of individuals nested within
groups, repeated measures are nested within individuals. Level 2 consists of
individuals rather than groups.
Comparing the SEM and HLM growth models
Both approaches to latent growth models are essentially equivalent. One
important difference between the two approaches is that the SEM approach
accounts for measurement error at each time point (in the  matrix). The
parallels between the SEM and HLM approaches can be seen by comparing their
algebraic formulas.
Level-1
Level-2
SEM
yit  it0i  it1i   it (1.1)
0   0   0 (1.2)
1  1   1 (1.3)
HLM
yi  0  1 xi  ri (1.6)
0g   00 g  u0g (1.7)
1   01g  u1g (1.8)
g
These formulas are parallel, although this may not be fully apparent at first
glance. In SEM, loadings are used in place of level-1 regression coefficients. In
equation (1.1), the level-1 intercept is represented by the product term it0i,
which refers to the loadings for latent intercept variable and the intercept variable
itself. The  matrix is analogous to the X matrix in matrix regression in which the
first column is a vector of 1's used to produce the intercept. By setting the
loadings for the intercept (0i) to 1, the product of the loadings and the intercept
(it0i ) of (1.1) is simply equivalent to the intercept term, 0, of equation (1.6).
The next term in equation (1.1), it1i, representing the slope factor, can also be
considered identical to the slope in equation (1.6) as long as the loadings in the 
matrix are set to values that would be used as predictors in growth curve
analysis, such as 0, 1, 2, . . . t-1. Here, it in (1.1) is equivalent to xt in (1.6).
Because 's are equivalent to x's and the 's are equivalent to 's, it would make
more sense to re-express equation (1.1) as,
yit  0i it  1i it   it
(1.11)
By estimating the means and variances of 0i and 1i , we can obtain estimates of
the average latent intercept and average latent slope and the extent to which
they vary across individuals.
A Multilevel Structural Equation Model for Dyadic Data
Because growth models (repeated measures) and two-level hierarchical
regression models are identical statistical models, as I've shown above, it is
possible to specify a multilevel SEM for certain hierarchical data situations which
uses the same model specifications as those used in the growth model case. I
start by describing a dyadic data situation (e.g., couples, twins, mother-child
dyads), for which this approach is clearly the simplest and possibly the best
suited. I describe the model specification and give an example. I then discuss
the possibilities of applying this technique to other data analytic situations. The
generalization of the approach should be relatively straight forward for small
groups sizes where the sample size is balanced (equal in all groups). Finally, I
will discuss the possibilities of generalizing to situations in which group sizes are
not equal but remain fairly small.
The usual multilevel approach has been to set up between and within covariance
matrices, corresponding to level-1 and level-2 variables, and use a multigroup
strategy to estimate the model (eg., Muthen, 1997; Muthen and Satorra, 1995).
This approach can be implemented in any of the current SEM software packages
by creating between and within covariance matrices, which are then analyzed in
a multigroup structural model. This process has been recently automated in
Mplus. This approach has a few limitations. First, except for Mplus, the
procedure of constructing and reading separate covariance matrices can be
cumbersome and analytically intensive, especially for larger models. Second, the
multigroup approach is limited to separate within and between models. That is,
one cannot analyze relationships between level-1 and level-2 variables.
Multilevel regression, in contrast, allows for prediction of level-1 intercepts by
level-2 predictors and for prediction of level-1 slopes by level-2 variables (called
cross-level interactions"). Third, with small groups, nonconvergence of multilevel
structural equation models can be a problem, especially with lower intraclass
correlations (see Muthen & Satorra, 1995).
Data requirements. At minimum, one merely needs to have a single dependent
measure obtained from each member of the dyad. Multiple indicators for each
individual can also be used (see Specifications 2 and 3 below). For dyadic data,
it would be optimal to have at least three indicators for each member of the
couple. Members of each couple must also be nonexchangable. That is, there
must be a basis for distinguishing members of each couple in an identical
manner in all groups. Examples might include husbands and wives, mother and
child, first born and second born, or caregiver and care recipient. The data set is
set up in a so-called "repeated measures" format, in which each case in the data
matrix contains information about the dyad. For instance, each record contains
information about the husband and the wife, recorded under different variable
names (e.g., y1h, y2h, y3h, y4w, y5w, y6w).
Example. To illustrate, I will use an example from a study I conducted recently
examining interactions between spousal caregivers and care recipients. There
are 118 couples (236 individuals), in which each member of the couple was
interviewed separately. I examine five items from the Veit and Ware (1983)
positive affect subscale of the Mental Health Inventory. Items such as "How
much of the time have you felt the future look hopeful and promising?" on a 6point scale of frequency of occurrence. Thus the analysis is based on 10
variables--5 items for caregivers and 5 items for care recipients.
Model specification 1. I first take the simplest case in which there is only one
measure (i.e., indicator) of positive affect for caregivers and for care recipients
(the measure was computed by averaging the five items for each). This model
specification follows that of the growth curve model described above in the case
in which there is only two time points tested. The first approach is depicted in
Figure 1 below. Two latent variables are defined: a latent intercept, 0, and a
latent slope, 1,. There is only one indicator for each of these latent variables.
The loadings are fixed to specified values and the measurement errors are fixed
to zero. The intercept variable, 0, is defined by fixing loadings on each of the
two indicators to 1. The slope variable, 1, is defined by fixing loadings on the
same two indicators to 0 and 1. The average intercept and average slope
across couples (i.e., 00 and 01 in HLM notation) are obtained by estimating the
mean structures of each (not estimated by default in SEM software packages).
The variances of the slopes and intercepts across couples are indicated by the
variances of 0 and 1 (i.e., the PSI matrix).
Figure 1. Multilevel Model for Dyadic Data: Specification 0
0
0
y1cg
1
y2cr
1
0
(Intercept)
0
1

(Slope)
Note: y1cg represents caregiver’s positive affect, y2 cr represents care recipient’s positive affect.
Loadings for 1 are set to 0 for caregivers and 1 for care recipients to define 0 as the average
score for wives taken across couples.
Results. Parallel analyses were conducted using Mplus (Muthen & Muthen,
1999) and HLM 5 (Raudenbush, Bryk, Cheong, & Congdon, 2000). In HLM, it is
not possible to estimate variances (i.e., random effects) for the slopes and
intercepts simultaneously with dyadic data. One can obtain estimates by running
separate models fixing the random effect of either the intercept of the slope to
zero, but these estimates will differ from a simultaneous estimation of both.
Therefore, they are not presented here. Analyses are presented for dummy
coding of the caregiving variable (0 and 1) and for group-mean centering. When
dummy coding is used, the intercept represents the average score for caregivers
(because they were coded as 0). When group-centering is used, the average
intercept represents the grand mean for all couples (caregivers and care
recipients combined). In multilevel regression, group-mean centering is achieved
by subtracting each individual’s score from the mean of the dyad (this can be
done automatically in the HLM software). To obtain the group-mean centered
solution using the SEM approach, however, one simply sets the loadings of the
slope variable to -.5 and +.5, rather than 0 and 1.
As can be seen in Table 1, the means (and their standard errors) for the intercept
and slope variable are highly similar SEM method and the HLM method. The
average intercept, approximately 4.2, represents the average positive affect for
caregivers. Table 2 presents results when a group-centered approach is used.
Notice that the average intercept obtained with centering differs little from that
obtained with dummy coding. This is because there is very little difference
between caregivers and care recipients on positive affect scores. Thus the
average of caregivers and care recipients combined is similar to the average for
caregivers only. The average slope (-.056 using the SEM method) represents
the relationship between the difference variable (caregivers vs. care recipients)
and the dependent variable, and is identical to a test between the caregiver and
care recipient means on positive affect.
Table 1. Mean and variance estimates when the caregiving variable is
uncentered/dummy coded (0,1).
Intercept
Slope
Means (SE)
HLM
4.221 (.088)
-.031 (.126)
SEM
4.214 (.082)
-.056 (.099)
Variances
HLM
NA
NA
SEM
.791 (.103)
1.141 (.149)
Table 2. Mean and variance estimates when the caregiving variable is groupmean centered (-.5,+.5).
Intercept
Slope
Means (SE)
HLM
4.205 (.072)
-.031 (.103)
SEM
4.186 (.069)
-.056 (.099)
Variances (SE)
HLM
NA
NA
SEM
.562 (.073)
1.141 (.149)
Model specification 2. The above approach has two shortcomings: it assumes
no measurement error and only single indicators are used. Another specification
is possible when there are several indicators. In this specification true latent
variables are used for the intercept, 0, and a latent slope, 1,. Each of the ten
indicators (5 caregiver items, 5 care recipient items) define the latent intercept
and each loading is set equal to 1. The slope variable represents a difference
variable, or dummy variable, distinguishing between the caregiver and care
recipient. In Figure 2, I have coded caregivers as 0 and care recipients as 1. To
accomplish this, the loadings for each of the caregiver items are set to 0 and
each of the loadings for care recipients is set to 1 for the slope factor. The slope
variable, 1, then represents the difference between caregivers and recipients on
positive affect. In this specification, each of the five items are assumed to be
equivalent indicators for caregivers and for care recipients (as is the case when
computes an equally weighted composite score). Because of the 0-1 coding
used for the latent slope variable, the intercept variable represents the latent
mean for caregivers. One could also use an effects code (-1,+1 or -.5,+.5) for the
slope variable, to center the variable within each group. In this case, the latent
intercept would represent the average score for each couple. To account for
intercorrelations between parallel items for caregivers and care recipients, I also
estimate the correlations among measurement errors. The error correlations
represent association between the errors over and above the variance accounted
for by the the intercept and the difference factor. Ideally, one should test these
correlations for significance. If they are not significant, the correlations can be
set to zero. As in growth curve analysis, the individual intercepts in the
measurement equation, , are set to zero
The model shown in Figure 2 below, corresponds to a simple multilevel
regression model, with one level-1 predictor and no level-2 predictors as in
equations (1.6) through (1.8).
Figure 2. Multilevel Model for Dyadic Data: Specification 2
y1cg
y2cg
y3cg
y4cg
y5cg
y6cr
y7cr
y8cr
y9cr y10cr
111
0 1 1 1
11
11
00
1 1
11 0
0
0

(Intercept)
(Slope)
Note: y1cg-y5cg represent responses from caregivers, y6cr-y10cr represent responses from care recipients.
Loadings for 1 are set to 0 for caregivers and 1 for care recipients to define 0 as the average
score for wives taken across couples.
Results. Table 3 and Table 4 present the results previously obtained from HLM
and new results using Mplus and Specification 2. The most notable difference
between results obtained with Specification 1 and Specification 2 is that the
variances (or random effects) are smaller under Specification 2. This is the result
of estimation of measurement error by the use of true latent variables for the
intercept and slope.
Table 3. Mean and variance estimates when the caregiving variable is
uncentered/dummy coded (0,1).
Intercept
Slope
Means (SE)
HLM
4.221 (.088)
-.031 (.126)
SEM
4.253 (.075)
-.043 (.096)
Variances
HLM
NA
NA
SEM
.538 (.086)
.824 (.142)
Similar analyses were conducted using group-mean centering of the caregiving
variable. Using SEM, the equivalent model to the group-mean centered model in
HLM constrains the loadings on 1 to -.5 and +.5. With group-mean centering,
the intercept value is the mean of both groups.
Table 4. Mean and variance estimates when the caregiving variable is groupmean centered (-.5,+.5).
Intercept
Slope
Means (SE)
HLM
4.205 (.072)
-.031 (.103)
SEM
4.226 (.068)
-.107 (.095)
Variances (SE)
HLM
NA
NA
SEM
.483 (.071)
.820 (.139)
Model specification 3. In Model Specification 2 above, loadings on 1 were
assumed to be equal for caregivers and for care recipients. An alternative
specification, using a second-order factor model for the slope variable would not
require this assumption. Using this model specification, the loadings for the first
order factor can be freely estimated, allowing for unequal loadings for each item.
Two second-order factors, 0 and 1 can be used to represent the intercept and
slope. As before, an uncentered, dummy coding (0,1) or group-mean centered
coding (-.5,+.5) can be used to provide different interpretations for the intercept
variable 0 . If dummy coding is used, the intercept represents the mean for the
member of the dyad who is assigned 0 for the slope variable; and, if centered
codes are used, the intercept represents the mean for the full sample. Several
details of the specification are important. First, the intercepts for the first order
latent variables (cg and cr ) should be constrained to zero. Second, it is best to
constrain loadings for the first order factors to be equal for parallel items. Third,
the disturbances and the correlation for the first order factors should be set to
zero. Fourth, the individual intercepts in the first order measurement equation, ,
are set to zero. Figure 3 below illustrates the model using dummy coding.
Figure 3. Multilevel Model for Dyadic Data: Specification 3
y1cg
y2cg
*
1
y3cg
*
y4cg
*
cg
y6cr
y5cg
*
y7cr
1
*
*
cr
0
0
y8cr
y9cr y10cr
*
*
0
1
1
1
0
(Intercept)
1
(Slope)
Note: y1cg-y5cg represent responses from caregivers, y6cr-y10cr represent responses from care recipients.
cg is a latent variable for caregivers, and cr is a latent variable for care recipients. 0 and 1 represent the
intercept and the slope, respectively.
Results. As can be seen in Table 3, this model specification produces
comparable values for the intercept and the slope mean. The two models,
centered and uncentered, produce identical estimates for the slope and its
standard error. The intercept values, however, differ because in the uncentered
case (0,1), the intercept represents the mean of caregivers, whereas, in the
centered case (-.5,+.5), the intercept represents the mean of both caregivers and
care recipients. As before, because the mean slope is not significant, caregivers
and care recipients do not differ on positive affect, and this explains the relatively
small difference between the mean intercept value under centered and
uncentered coding.
Table 3. Results from the second-order factor model approach.
Intercept
Slope
Means (SE)
Uncentered
4.213 (.093)
-.079 (.096)
Centered
4.173 (.083)
-.079 (.096)
Variances (SE)
Uncentered
.610 (.094)
.870 (.142)
Centered
.468 (.069)
.870 (.142)
Discussion
The SEM approaches described above provide several advantages over existing
multilevel structural equation models. First, multilevel models with small groups
(e.g., under 10) often fail to converge, and in some cases do not have sufficient
degrees of freedom to estimate all random effects. The dyadic approaches
presented here should not have similar problems with convergence. Second,
although the models presented above did not include level-2 predictors (e.g.,
household income), these variables would be simple to include. This would
provide a substantial advantage over the current multigroup specification of
multilevel models, because slopes and intercepts could be used as predicted
outcomes (or predictors). Incorporation of level-1 type predictors could be
included by predicting first order factors (either with standard latent variables or
other slopes and intercepts). Third, the model specification proposed does not
require separate covariance matrices for within and between components or
special data configurations in which the data are disaggregated. This makes
implementation relatively easy in most SEM packages that do not have special
features for multilevel models.
Generalization of the Approach
(In this section I present some very rough, preliminary ideas regarding
generalization of the dyadic multilevel model. )
The dyadic model specifications above could be generalized to data involving
larger groups (e.g., 2, 3, 4,or 5 individuals per group), such as family units or
small working groups. The application of these models, however, would be
limited to the nonexchangeable case, in which each group member has a role
consistently present across groups. As the number of individuals per group
increases, however, models would become more complex. Of particular difficulty
is the specification of the difference slope variable. A simple difference variable
becomes more complicated to implement, because there must be s-1 dummy
variables to represent comparisons (where s is the number of individuals per
group). These dummy variables would become unwieldy as the number of group
members increases. However, one could omit latent slopes under these
conditions. If the second order factor approach is used, other level-1 predictors
could be included as predictors of the first order factors. Through equality
constraints on the causal parameters or use of the multilevel models similar to
those specified above, prediction of slopes by level-2 variables (cross-level
interactions) could be implemented. Finally, although the examples presented
above are based on equal group sizes (i.e, balanced n), application of these
models would not necessarily be restricted to balanced data. Through the use of
data imputation methods (such as the EM algorithm) now available in several
SEM software packages, unbalanced data could also be analyzed. Data
imputation with the EM algorithm can be shown to be equivalent to Bayesian
estimates obtained in multilevel regression when group sizes are unequal.
Download