A Multilevel Structural Equation Model for Dyadic Data Jason T. Newsom Portland State Unversity I begin by giving a brief overview of latent growth models and multilevel regression (i.e., hierarchical linear models). I've assumed some familiarity with both of these techniques, but I've summarized them to illustrate their parallels in the case of repeated measures. I then proceed by proposing a (new, I think) structural equation model approach to hierarchical regression in the case of dyadic data. All comments and questions are welcome and encouraged (newsomj@pdx.edu). Latent Growth Curve Models Structural equation modeling (SEM) can be used to estimate individual growth curves by using repeated measures as indicators of two latent variables, an intercept variable (0) and a slope variable (1), called “latent growth curve models”. The interpretation of the intercept variable depends on how the loadings on the slope factor are fixed. For instance, one approach to defining the slope variable is to fix loadings to values 0, 1, 2, 3,4, . . . t-1. In this case, the intercept latent variable represents the initial value, because the first loading on 1 is set to 0. One can also “center “ these loadings by setting the middle time point to 0 (e.g., -3.,-2,-1,0,1,2,3), giving the intercept factor the value of the average score across all time points [see here for a Figure showing the general model specification (with estimated loadings on the slope factor, discussed later)]. Mathematically, the latent growth curve model is represented by the following set of formulas: level-1 equation (measurement model): yit it0i it1i it (1.1) level-2 equations (structural model): 0 0 0 (1.2) 1 1 1 (1.3) yit is the dependent variable. The subscripts i and t indicate a measurement within an individual, i, for each time point, t. 0 is a latent variable that represents the level-1 intercept, 1 is a latent variable that represents the relationship between the time code and the dependent variable (i.e., the growth trajectory), it are the loadings for each time point on the intercept latent variable (0) and the slope latent variable (1). (The intercept, , associated with each loading is assumed to be zero and is not shown above.) For simplicity, no level-2 predictors are presented in (1.2) or (1.3), but could be included as predictors of the intercept or slope variables. In the level-2 equations, 0 and 1 are the intercepts or average value of 0 and 1 respectively, and 0 and 1 are error terms. More traditionally, the structural model would be represented by grouping each variable into matrices: (1.4) y (1.5) In these equations, is a 2 X t matrix representing the relationship between 2 latent variables, 0 and 1, and t indicators, one for each time point. The column in which corresponds to 0 is comprised of all 1s, because each loading for this variable is set equal to 1 to define it as the intercept. is the matrix of measurement errors for each indicator at each time point. is a 2 X 1 vector containing latent means for the intercept and slope, representing the average intercept across individuals and the average slope (i.e., trajectory) across individuals. is the error term. The variance of the intercepts and slopes, 0 and 1, are obtained by estimation of the matrix. Multilevel Regression Models (Hierarchical Linear Models: HLM) Multilevel regression models estimate predictive relationships when the data are nested or hierarchically structured, as in the case of students nested within schools. The statistical model used for hierarchically structured data is the same statistical model used for longitudinal analysis of individual growth curves. With growth curve models, longitudinal data measurements are considered to be nested within individuals. In general, a multilevel regression with a single level-1 predictor and no level-2 predictors can be written with two sets of equations: level-1 equation: yi 0 1 xi ri (1.6) level-2 equations: 0g 00 g u0g (1.7) 1 01g u1g (1.8) g The first equation is the familiar regression equation, with r representing error or unexplained variance. The subscripts i and g indicate whether the value is for each individual or each group. In the second equation, the intercept values for each group serve as the dependent variable. For simplicity sake, there are no predictors in equation (1.7) or (1.8). In (1.7), 00 is the intercept (mean of all group intercepts), and u0 is the error or remaining variance. u0 can also be interpreted as the variance of the intercept values across groups. Since the intercepts represent adjusted means for each group (i.e., adjusting or controlling for the effects of xi), u0 is the variance of the adjusted means for each group. In the third equation (1.8), the slopes from the level-1 equation, 1g, serve as dependent variables for each group. 01 is the intercept in this equation, and represents the average of all slopes, 1g, interpreted as the average effect of xi on the dependent variable across all groups. u1 is the error term or the variance of the slopes across groups (i.e., the variability in the relationship between x and y across the groups). By substituting equations (1.7) and (1.8) into equation (1.6), the HLM model can be expressed as, yi 00g u0 g 01g u1g xi ri (1.9) or, by rearranging the terms, yi 00g u0 g 01g xi u1g xi ri (1.10) If growth curve models are tested, the level-1 x variable is replaced by time codes, xt (e.g., 0,1,2,3...t-1). The dependent variable at each time point is regressed on the time code at level-1. Instead of individuals nested within groups, repeated measures are nested within individuals. Level 2 consists of individuals rather than groups. Comparing the SEM and HLM growth models Both approaches to latent growth models are essentially equivalent. One important difference between the two approaches is that the SEM approach accounts for measurement error at each time point (in the matrix). The parallels between the SEM and HLM approaches can be seen by comparing their algebraic formulas. Level-1 Level-2 SEM yit it0i it1i it (1.1) 0 0 0 (1.2) 1 1 1 (1.3) HLM yi 0 1 xi ri (1.6) 0g 00 g u0g (1.7) 1 01g u1g (1.8) g These formulas are parallel, although this may not be fully apparent at first glance. In SEM, loadings are used in place of level-1 regression coefficients. In equation (1.1), the level-1 intercept is represented by the product term it0i, which refers to the loadings for latent intercept variable and the intercept variable itself. The matrix is analogous to the X matrix in matrix regression in which the first column is a vector of 1's used to produce the intercept. By setting the loadings for the intercept (0i) to 1, the product of the loadings and the intercept (it0i ) of (1.1) is simply equivalent to the intercept term, 0, of equation (1.6). The next term in equation (1.1), it1i, representing the slope factor, can also be considered identical to the slope in equation (1.6) as long as the loadings in the matrix are set to values that would be used as predictors in growth curve analysis, such as 0, 1, 2, . . . t-1. Here, it in (1.1) is equivalent to xt in (1.6). Because 's are equivalent to x's and the 's are equivalent to 's, it would make more sense to re-express equation (1.1) as, yit 0i it 1i it it (1.11) By estimating the means and variances of 0i and 1i , we can obtain estimates of the average latent intercept and average latent slope and the extent to which they vary across individuals. A Multilevel Structural Equation Model for Dyadic Data Because growth models (repeated measures) and two-level hierarchical regression models are identical statistical models, as I've shown above, it is possible to specify a multilevel SEM for certain hierarchical data situations which uses the same model specifications as those used in the growth model case. I start by describing a dyadic data situation (e.g., couples, twins, mother-child dyads), for which this approach is clearly the simplest and possibly the best suited. I describe the model specification and give an example. I then discuss the possibilities of applying this technique to other data analytic situations. The generalization of the approach should be relatively straight forward for small groups sizes where the sample size is balanced (equal in all groups). Finally, I will discuss the possibilities of generalizing to situations in which group sizes are not equal but remain fairly small. The usual multilevel approach has been to set up between and within covariance matrices, corresponding to level-1 and level-2 variables, and use a multigroup strategy to estimate the model (eg., Muthen, 1997; Muthen and Satorra, 1995). This approach can be implemented in any of the current SEM software packages by creating between and within covariance matrices, which are then analyzed in a multigroup structural model. This process has been recently automated in Mplus. This approach has a few limitations. First, except for Mplus, the procedure of constructing and reading separate covariance matrices can be cumbersome and analytically intensive, especially for larger models. Second, the multigroup approach is limited to separate within and between models. That is, one cannot analyze relationships between level-1 and level-2 variables. Multilevel regression, in contrast, allows for prediction of level-1 intercepts by level-2 predictors and for prediction of level-1 slopes by level-2 variables (called cross-level interactions"). Third, with small groups, nonconvergence of multilevel structural equation models can be a problem, especially with lower intraclass correlations (see Muthen & Satorra, 1995). Data requirements. At minimum, one merely needs to have a single dependent measure obtained from each member of the dyad. Multiple indicators for each individual can also be used (see Specifications 2 and 3 below). For dyadic data, it would be optimal to have at least three indicators for each member of the couple. Members of each couple must also be nonexchangable. That is, there must be a basis for distinguishing members of each couple in an identical manner in all groups. Examples might include husbands and wives, mother and child, first born and second born, or caregiver and care recipient. The data set is set up in a so-called "repeated measures" format, in which each case in the data matrix contains information about the dyad. For instance, each record contains information about the husband and the wife, recorded under different variable names (e.g., y1h, y2h, y3h, y4w, y5w, y6w). Example. To illustrate, I will use an example from a study I conducted recently examining interactions between spousal caregivers and care recipients. There are 118 couples (236 individuals), in which each member of the couple was interviewed separately. I examine five items from the Veit and Ware (1983) positive affect subscale of the Mental Health Inventory. Items such as "How much of the time have you felt the future look hopeful and promising?" on a 6point scale of frequency of occurrence. Thus the analysis is based on 10 variables--5 items for caregivers and 5 items for care recipients. Model specification 1. I first take the simplest case in which there is only one measure (i.e., indicator) of positive affect for caregivers and for care recipients (the measure was computed by averaging the five items for each). This model specification follows that of the growth curve model described above in the case in which there is only two time points tested. The first approach is depicted in Figure 1 below. Two latent variables are defined: a latent intercept, 0, and a latent slope, 1,. There is only one indicator for each of these latent variables. The loadings are fixed to specified values and the measurement errors are fixed to zero. The intercept variable, 0, is defined by fixing loadings on each of the two indicators to 1. The slope variable, 1, is defined by fixing loadings on the same two indicators to 0 and 1. The average intercept and average slope across couples (i.e., 00 and 01 in HLM notation) are obtained by estimating the mean structures of each (not estimated by default in SEM software packages). The variances of the slopes and intercepts across couples are indicated by the variances of 0 and 1 (i.e., the PSI matrix). Figure 1. Multilevel Model for Dyadic Data: Specification 0 0 0 y1cg 1 y2cr 1 0 (Intercept) 0 1 (Slope) Note: y1cg represents caregiver’s positive affect, y2 cr represents care recipient’s positive affect. Loadings for 1 are set to 0 for caregivers and 1 for care recipients to define 0 as the average score for wives taken across couples. Results. Parallel analyses were conducted using Mplus (Muthen & Muthen, 1999) and HLM 5 (Raudenbush, Bryk, Cheong, & Congdon, 2000). In HLM, it is not possible to estimate variances (i.e., random effects) for the slopes and intercepts simultaneously with dyadic data. One can obtain estimates by running separate models fixing the random effect of either the intercept of the slope to zero, but these estimates will differ from a simultaneous estimation of both. Therefore, they are not presented here. Analyses are presented for dummy coding of the caregiving variable (0 and 1) and for group-mean centering. When dummy coding is used, the intercept represents the average score for caregivers (because they were coded as 0). When group-centering is used, the average intercept represents the grand mean for all couples (caregivers and care recipients combined). In multilevel regression, group-mean centering is achieved by subtracting each individual’s score from the mean of the dyad (this can be done automatically in the HLM software). To obtain the group-mean centered solution using the SEM approach, however, one simply sets the loadings of the slope variable to -.5 and +.5, rather than 0 and 1. As can be seen in Table 1, the means (and their standard errors) for the intercept and slope variable are highly similar SEM method and the HLM method. The average intercept, approximately 4.2, represents the average positive affect for caregivers. Table 2 presents results when a group-centered approach is used. Notice that the average intercept obtained with centering differs little from that obtained with dummy coding. This is because there is very little difference between caregivers and care recipients on positive affect scores. Thus the average of caregivers and care recipients combined is similar to the average for caregivers only. The average slope (-.056 using the SEM method) represents the relationship between the difference variable (caregivers vs. care recipients) and the dependent variable, and is identical to a test between the caregiver and care recipient means on positive affect. Table 1. Mean and variance estimates when the caregiving variable is uncentered/dummy coded (0,1). Intercept Slope Means (SE) HLM 4.221 (.088) -.031 (.126) SEM 4.214 (.082) -.056 (.099) Variances HLM NA NA SEM .791 (.103) 1.141 (.149) Table 2. Mean and variance estimates when the caregiving variable is groupmean centered (-.5,+.5). Intercept Slope Means (SE) HLM 4.205 (.072) -.031 (.103) SEM 4.186 (.069) -.056 (.099) Variances (SE) HLM NA NA SEM .562 (.073) 1.141 (.149) Model specification 2. The above approach has two shortcomings: it assumes no measurement error and only single indicators are used. Another specification is possible when there are several indicators. In this specification true latent variables are used for the intercept, 0, and a latent slope, 1,. Each of the ten indicators (5 caregiver items, 5 care recipient items) define the latent intercept and each loading is set equal to 1. The slope variable represents a difference variable, or dummy variable, distinguishing between the caregiver and care recipient. In Figure 2, I have coded caregivers as 0 and care recipients as 1. To accomplish this, the loadings for each of the caregiver items are set to 0 and each of the loadings for care recipients is set to 1 for the slope factor. The slope variable, 1, then represents the difference between caregivers and recipients on positive affect. In this specification, each of the five items are assumed to be equivalent indicators for caregivers and for care recipients (as is the case when computes an equally weighted composite score). Because of the 0-1 coding used for the latent slope variable, the intercept variable represents the latent mean for caregivers. One could also use an effects code (-1,+1 or -.5,+.5) for the slope variable, to center the variable within each group. In this case, the latent intercept would represent the average score for each couple. To account for intercorrelations between parallel items for caregivers and care recipients, I also estimate the correlations among measurement errors. The error correlations represent association between the errors over and above the variance accounted for by the the intercept and the difference factor. Ideally, one should test these correlations for significance. If they are not significant, the correlations can be set to zero. As in growth curve analysis, the individual intercepts in the measurement equation, , are set to zero The model shown in Figure 2 below, corresponds to a simple multilevel regression model, with one level-1 predictor and no level-2 predictors as in equations (1.6) through (1.8). Figure 2. Multilevel Model for Dyadic Data: Specification 2 y1cg y2cg y3cg y4cg y5cg y6cr y7cr y8cr y9cr y10cr 111 0 1 1 1 11 11 00 1 1 11 0 0 0 (Intercept) (Slope) Note: y1cg-y5cg represent responses from caregivers, y6cr-y10cr represent responses from care recipients. Loadings for 1 are set to 0 for caregivers and 1 for care recipients to define 0 as the average score for wives taken across couples. Results. Table 3 and Table 4 present the results previously obtained from HLM and new results using Mplus and Specification 2. The most notable difference between results obtained with Specification 1 and Specification 2 is that the variances (or random effects) are smaller under Specification 2. This is the result of estimation of measurement error by the use of true latent variables for the intercept and slope. Table 3. Mean and variance estimates when the caregiving variable is uncentered/dummy coded (0,1). Intercept Slope Means (SE) HLM 4.221 (.088) -.031 (.126) SEM 4.253 (.075) -.043 (.096) Variances HLM NA NA SEM .538 (.086) .824 (.142) Similar analyses were conducted using group-mean centering of the caregiving variable. Using SEM, the equivalent model to the group-mean centered model in HLM constrains the loadings on 1 to -.5 and +.5. With group-mean centering, the intercept value is the mean of both groups. Table 4. Mean and variance estimates when the caregiving variable is groupmean centered (-.5,+.5). Intercept Slope Means (SE) HLM 4.205 (.072) -.031 (.103) SEM 4.226 (.068) -.107 (.095) Variances (SE) HLM NA NA SEM .483 (.071) .820 (.139) Model specification 3. In Model Specification 2 above, loadings on 1 were assumed to be equal for caregivers and for care recipients. An alternative specification, using a second-order factor model for the slope variable would not require this assumption. Using this model specification, the loadings for the first order factor can be freely estimated, allowing for unequal loadings for each item. Two second-order factors, 0 and 1 can be used to represent the intercept and slope. As before, an uncentered, dummy coding (0,1) or group-mean centered coding (-.5,+.5) can be used to provide different interpretations for the intercept variable 0 . If dummy coding is used, the intercept represents the mean for the member of the dyad who is assigned 0 for the slope variable; and, if centered codes are used, the intercept represents the mean for the full sample. Several details of the specification are important. First, the intercepts for the first order latent variables (cg and cr ) should be constrained to zero. Second, it is best to constrain loadings for the first order factors to be equal for parallel items. Third, the disturbances and the correlation for the first order factors should be set to zero. Fourth, the individual intercepts in the first order measurement equation, , are set to zero. Figure 3 below illustrates the model using dummy coding. Figure 3. Multilevel Model for Dyadic Data: Specification 3 y1cg y2cg * 1 y3cg * y4cg * cg y6cr y5cg * y7cr 1 * * cr 0 0 y8cr y9cr y10cr * * 0 1 1 1 0 (Intercept) 1 (Slope) Note: y1cg-y5cg represent responses from caregivers, y6cr-y10cr represent responses from care recipients. cg is a latent variable for caregivers, and cr is a latent variable for care recipients. 0 and 1 represent the intercept and the slope, respectively. Results. As can be seen in Table 3, this model specification produces comparable values for the intercept and the slope mean. The two models, centered and uncentered, produce identical estimates for the slope and its standard error. The intercept values, however, differ because in the uncentered case (0,1), the intercept represents the mean of caregivers, whereas, in the centered case (-.5,+.5), the intercept represents the mean of both caregivers and care recipients. As before, because the mean slope is not significant, caregivers and care recipients do not differ on positive affect, and this explains the relatively small difference between the mean intercept value under centered and uncentered coding. Table 3. Results from the second-order factor model approach. Intercept Slope Means (SE) Uncentered 4.213 (.093) -.079 (.096) Centered 4.173 (.083) -.079 (.096) Variances (SE) Uncentered .610 (.094) .870 (.142) Centered .468 (.069) .870 (.142) Discussion The SEM approaches described above provide several advantages over existing multilevel structural equation models. First, multilevel models with small groups (e.g., under 10) often fail to converge, and in some cases do not have sufficient degrees of freedom to estimate all random effects. The dyadic approaches presented here should not have similar problems with convergence. Second, although the models presented above did not include level-2 predictors (e.g., household income), these variables would be simple to include. This would provide a substantial advantage over the current multigroup specification of multilevel models, because slopes and intercepts could be used as predicted outcomes (or predictors). Incorporation of level-1 type predictors could be included by predicting first order factors (either with standard latent variables or other slopes and intercepts). Third, the model specification proposed does not require separate covariance matrices for within and between components or special data configurations in which the data are disaggregated. This makes implementation relatively easy in most SEM packages that do not have special features for multilevel models. Generalization of the Approach (In this section I present some very rough, preliminary ideas regarding generalization of the dyadic multilevel model. ) The dyadic model specifications above could be generalized to data involving larger groups (e.g., 2, 3, 4,or 5 individuals per group), such as family units or small working groups. The application of these models, however, would be limited to the nonexchangeable case, in which each group member has a role consistently present across groups. As the number of individuals per group increases, however, models would become more complex. Of particular difficulty is the specification of the difference slope variable. A simple difference variable becomes more complicated to implement, because there must be s-1 dummy variables to represent comparisons (where s is the number of individuals per group). These dummy variables would become unwieldy as the number of group members increases. However, one could omit latent slopes under these conditions. If the second order factor approach is used, other level-1 predictors could be included as predictors of the first order factors. Through equality constraints on the causal parameters or use of the multilevel models similar to those specified above, prediction of slopes by level-2 variables (cross-level interactions) could be implemented. Finally, although the examples presented above are based on equal group sizes (i.e, balanced n), application of these models would not necessarily be restricted to balanced data. Through the use of data imputation methods (such as the EM algorithm) now available in several SEM software packages, unbalanced data could also be analyzed. Data imputation with the EM algorithm can be shown to be equivalent to Bayesian estimates obtained in multilevel regression when group sizes are unequal.