Draft- For Comment Only Hierarchical Linear Models: Strengths and Weaknesses By Duncan Chaplin December 9th, 2003 The Urban Institute 2100 M St. NW Washington, D.C. This paper is a revised version of a paper prepared for the meetings of the Association for Public Policy Analysis and Management in November of 2003 and is based on a presentation given to the UI Modeling Group at the Urban Institute in March of 2003. Much thanks goes to Chris Bollinger, Mary Noonan, and all the members of the UI Modeling Group for helpful discussions of the issues presented here. All omissions, mistakes (glaring and otherwise) and tangential asides are those of the author and should not be attributed to the Urban Institute or any of the people kind enough to share their thoughts with me on this topic. Questions and queries should be addressed to the author at DChaplin@ui.urban.org. Draft- For Comment Only Chaplin Abstract Regression models are often run using data with observations that are highly correlated within subgroups. Ignoring these correlations can yield biased standard errors. Consequently a number of statistical methods have been developed to help adjust estimated standard errors appropriately. Hierarchical Linear Models (HLM) is one such method, particularly common in Education research, but growing in popularity elsewhere. In this paper I compare HLM to a variety of alternative estimation methods more commonly used by economists that also deal with clustering. In particular, I compare HLM to random and fixed effects (as used in the Econometrics literature), random coefficients, Generalized Least Squares, Huber-White corrections, and simulation methods (Jackknife/Bootstrap). I also discuss general strengths and weaknesses of HLM. Draft- For Comment Only Chaplin Introduction HLM is an estimation method1 that was developed by education researchers (Bryk and Raudenbush, 1992) that is now growing in popularity in a number of other research areas including health and psychology. It is used primarily to estimate cross-sectional linear models and has received relatively little attention from economists.2 In this paper I describe HLM, how it relates to models more commonly used by economists, and discuss what HLM does and does not do. The standard way of presenting HLM is to start by discussing the concept of conducting regressions at two levels. For example, when analyzing factors that impact student test scores using data on students from a number of schools (with many students in each school), one could run one regression for each school and then a second set of regressions using the coefficient estimates from the school-level regressions as outcomes. Such a model can be estimated and Hanushek (1974) describes a method of doing so which is similar, in some ways, to the Fixed Effects Model common in Econometrics. This is powerful because the Fixed Effects Model controls for all unobserved factors at the school level. This is not, however, what standard HLM does. Thus, while the standard HLM does correct standard errors and produce more efficient estimates than standard Ordinary Least Squares methods, it does not correct the estimated impacts of student-level variables for any bias caused by unobserved school-level variables. Since HLM is usually presented as if it were estimated in two stages, this clarification seems likely to be an important one for many researchers who are learning to use HLM for the first time and for economists who may not be familiar with the HLM method. 1 HLM models can be estimated using a variety of software packages (Singer, 1998). These include Proc Mixed in SAS and the software package HLM, developed by Bryk and Raudenbush (1992) who coined the term HLM. 2 Interesting work has been done using HLM to estimate “growth models” with panel data (Bryk and Raudenbush, 1992) and some researchers have used HLM-type models for discrete outcomes (Swanson et al, 2002). 1 Draft- For Comment Only Chaplin HLM has a number of benefits. First, it can help to control for clustering of observations and heteroskedasticity. Secondly, it can improve the efficiency of estimated impacts, given that the assumptions of the HLM are correct. Third, even if the assumptions are violated HLM will still produce a best “HLM” fit, similar to the Best Linear Unbiased Estimate property of an OLS model (Goldberger, 1991).3 Fourth, a variation of the HLM model, with group mean centering, does produce unbiased slope estimates under the same conditions that are normally used to justify a Fixed Effects Model in economics. There are alternative methods of controlling for clustering and heteroskedasticity—for instance the simulation methods of controlling for clustering (Jack-knife and Boot-strap) and the Huber-White corrections.4 These methods have a number of advantages over the HLM methods. First, HLM constrains the variance of the error to be a function of the same factors that affect the mean value of the outcome while the other methods allow the variance to depend on additional factors. Second, the estimated standard errors from the alternative methods are robust to more forms of heteroskedasticity than are allowed by HLM.5 None the less, HLM will produce valid standard error estimates under a wide variety of conditions and more efficient coefficient estimates given the HLM assumptions. To help put HLM in context I begin with a section describing the motivation behind HLM. I then show how HLM is related to random coefficients, random effects, and fixed effects, as used in the Econometrics literature. This is followed by a discussion of what HLM 3 Relative to OLS, the HLM estimates give more weight to observations for which the estimates suggest the data are more precise. For example, if the HLM model suggests that the data on women are more precise than the data on men, then the resulting HLM slope estimates will give more weight to the observations for the women. 4 The simulation methods can be easily implemented in packages like Wesvar, Shazam, and SAS, while the HuberWhite corrections can be done in Stata. 5 This is because the White correction allows for any form of heteroskedasticity while HLM only allows for heteroskedasticity that is captured by including random coefficients in the model. 2 Draft- For Comment Only Chaplin does and does not do and some comments on what econometricians might view as an odd feature of HLM. HLM Motivation One of the research questions that can be well addressed by the HLM is whether Catholic schools help to reduce inequality in outcomes compared to public schools. To answer this question researchers often look at how the estimated effect of student socioeconomic status (SES) on student test scores varies by school type (Catholic vs. Public). Evidence of a smaller SES slope for Catholic schools is taken as evidence that the Catholic schools help to reduce inequality. Figure 1 illustrates this point. One can estimate interaction terms between school type and SES in ordinary least squares regression models (OLS). However, this would not account for clustering of observations within schools. For this reason, many (if not most) econometricians would use a Random Effects Model in this situation. While this is an improvement over OLS, it only allows the intercept to vary randomly across schools, and not the SES slope. Figure 2, presents the data used to generate Figure 1 by school. As Figure 2 shows, even though on average the Catholic school SES slope is smaller than that of public schools, there is a great deal of variation in the SES slopes across schools, so much so that one might wonder if the slope differences by school type were truly statistically significant. More importantly, the issue is not just that the intercepts vary randomly across schools. It is also clear that there is a great deal of variation in the slope estimates. HLM allows for random variation in both the intercepts and slopes. As noted earlier, HLM is often described as if it uses two sets of regression models—one at the student level and a second at the school level, as shown below. Level 1: Yij=0j +1j *Xij + eij (at student level) 3 Draft- For Comment Only Level 2: Chaplin 0j = 00 + 01*Wj +u0j (at school level) 1j = 10 + 11*Wj +u1j where, for our example problem, Yij=test score of student i in school j, Wj=1 if school is Catholic, Xij=Student SES Cov(Xij,eij,Wj,u0j,u1j)=0 By substituting the random coefficients in the Level 1 equation with their components shown in the Level 2 equations one can write a combined model: Yij= 00+01*Wj+10*Xij+11*Wj*Xij+ij where ij = u1j *Xij +u0j + eij This is the Random Coefficients Model which enjoyed some attention from economists in the 1970s and 1980s. It is also a subset of the Generalized Least Squares Models (GLS).6 In addition, if u1j (the school-level component of the error term that is multiplied by Xij) is set to 0 then the model boils down to the Random Effects Models more commonly used in the econometrics literature today. Indeed, a large number of papers that use the HLM method find no evidence that V(u1j)>0 and consequently end up estimating Random Effects Models. Thus, while HLM allows for estimation of a much broader set of models than a Random Effects Model, it appears that in many practical situations, analysts end up estimating a Random Effects Model when they use the HLM method. 6 GLS is more flexible in that it allows the variance of Y to be impacted by factors that may not impact the mean value of Y. In practice, however, GLS models may be harder to estimate due to a lack of available software. 4 Draft- For Comment Only Chaplin What HLM Does Not Do HLM has a number of important strengths and weaknesses. I start by discussing one of the major problems with HLM which is caused by how the model is presented rather than by any inherent flaw in the estimation method. In order to better describe this weakness, however, it will be useful to clarify some of the terms used in the HLM literature as they overlap in rather unfortunate ways with terms commonly used by economists. In particular the terms random and fixed effects in HLM mean somewhat different things than they do in economics. In HLM random effects refer to the error terms in the level 2 equation—i.e. the error terms for the coefficient estimates. Fixed effects refer to the non-random parts of the coefficient estimates. In contrast, in economics the term random effects is generally used to refer to only the random component of the intercept (u0j). In economics fixed effects also refer to u0j but only in the context of a very different model—one in which u1j=0 and cov(u0j,Xij) is not constrained to be 0. In Fixed Effects Models in Economics, the fixed effects refer to values of u0j which are treated as fixed rather than as random. Estimation is generally accomplished using a dummy variable for each school. In the rest of this paper I will be using the terms random and fixed effects as they are used by economists rather than in the way they are used in the HLM literature. Fixed Effects Models are considered a very powerful tool for economists as they can be used to control for bias caused by a large set of unobserved variables. For example, if one were interested in estimating the impact of student SES on student achievement, controlling for all school-level variables (observed and unobserved), one could use a Fixed Effect Model with a dummy variable for each school. Using such a model one could correctly claim that the estimated impact of SES was not biased by any school-level variables, including those not observed. Interestingly, the same result would hold if one were to estimate multilevel models in 5 Draft- For Comment Only Chaplin two stages as HLM is presented. In the example given above this would mean first estimating the impact of student SES for each school and then estimating second stage equations to determine how school-level factors impact the intercept and slope of the school by school regression coefficient estimates. Were one to estimate such a model one could legitimately claim that the SES slope estimates would not be biased by the omission of any school-level factors. This is not, however, how HLM is estimated. Instead, HLM is estimated in one stage and the standard HLM model (without group mean centering) uses both the between and within school variation to estimate the SES slope estimates. The result is that omitted school-level variables can bias the SES slope estimates. Statistically speaking, Fixed Effects Models allow cov(uij,Xij) to be non-zero while the standard HLM model assumes that this covariance is 0. Interestingly, economists sometimes estimate models that are done in the way the HLM presentation implies (i.e. in two stages), although these models have received relatively little use (Hanushek, 1974; Chaplin, 1993). Such models have the advantage of allowing one to both deal with the fact that slope estimates may vary across schools and to control for all unobserved factors at the school-level. Their major weakness, however, is that these models can only be estimated if there are sufficient data at each level—for example enough students within schools to estimate a separate regression for each school. In contrast HLM (and Random Effects Models) can be estimated even if there are only two observations per school for at least some schools because they use both the within and between school variation to estimate the coefficient estimates. This last point illustrates another example of how the presentation of HLM has misled researchers as some have argued that HLM models should only be estimated using schools that have a sufficient number of observations per school. 6 Draft- For Comment Only Chaplin Of course they way HLM is presented may only be a theoretical weakness if all researchers understand the model well. However, there is substantial evidence that many prominent researchers are not well aware of this problem. In particular, many appear to believe that HLM models are estimated in two stages (Yasumoto et al., 2001; Nye et al., 2002; Alexander et al., 2001; Wenglinsky, 1998; Brewer and Goldhaber, 2000) and many believe it will drop cases, presumably those with few observations per group (Gamoran et al., 1997; Alexander et al., 2001; Brewer and Goldhaber, 2000). At least one set of researchers writes as if the HLM model controls for unobserved group-level variables (Gamoran et al., 1997).7 All of these points would hold for a model estimated in two stages, but do not hold for the standard HLM model. There is a variant of the HLM model that can be used to control for the same types of biases that a Fixed Effects Model deals with. This is not a standard HLM feature, but does receive prominent attention in the major book introducing the HLM method (Bryk and Raudenbush, 1992). The idea is that one can estimate a useful set of models by subtracting group means from the X’s. For example, as noted above, combining the Level 1 and Level 2 equations of HLM gives the equation: Yij= 00+01*Wj+10*Xij+11*Wj*Xij+ij After within group centering this becomes: Yij= 00+01*Wj+10*(Xij-X.j)+11*Wj*(Xij-X.j)+ij where X.j = the mean of Xij for group j. Bryk and Raudenbush (1992) are careful to explain that group centering changes the underlying model and note that in many cases it may not be clear which model would be 7 Relevant quotes from these papers are provided in an appendix. 7 Draft- For Comment Only Chaplin preferred. Many economists might recall that one method of estimating a Fixed Effects Model is to subtract group means from all variables in the model (Goldberger, 1991). Group-mean centering in HLM comes close to this except that the group means of the outcome (Y) are not being subtracted. Nevertheless, it turns out that the result is unbiased slope estimates for the within group variables even in the presence of unobserved group level variables that are correlated with the within group variables—the exact issue that is generally highlighted as a strength of the Fixed Effects Model in economics.8 HLM also shares two weaknesses with Random Effects Models.9 First, it does not allow for negative within group correlations in the error terms. This could be important for outcomes that are socially determined if people look to their peer group when making determinations about their own level of success or achievement. For example, one might expect to see negative associations between the error terms of self-efficacy ratings of different teachers within the same school if these teachers generally judge their own performance by making comparisons within rather than across schools. In HLM the random component of the intercept causes a positive correlation between observations within the same school. The more general GLS Models allow that same correlation to be negative, rather than positive. The second weakness that HLM shares with Random Effects Models is that it may produce biased estimates for models that need weights. This issue is complicated by the fact that many economists would argue that weights are not needed in multivariate regressions. Rather, they argue, one should be able to fully model behavior using appropriate controls and interactions. If weights change regression results they would argue that this implies that the 8 Mundlak (1978) notes that this property of Fixed Effects Models can also be achieved by including all of the group means of the individual variables (i.e. the Xijs) as controls in a standard OLS model. 8 Draft- For Comment Only Chaplin model is miss-specified—i.e. important variables were omitted. An alternative view is that all regression models should be viewed as parsimonious descriptions of relationships that are almost surely far more complicated than any one regression model could capture. Regression results can provide a useful summary of existing relationships but should not be viewed as providing evidence against the importance of omitted variables or interactions. Under this view, weights may be viewed as helping to make sure that the summary is relevant to the population being studied. If one takes the latter stance, then, there is an implicit admission that there could be important omitted interactions and that rather than try to estimate all of these interactions one will simply provide as representative as possible coefficient estimates. The problem with HLM (and random effects) under this set of assumptions is that it reweights the data based on the variance/covariance structure of the error terms in a way that effectively offsets the impacts of the weights themselves (Selden, 1994). Thus, a belief that the weights are important would seem to be incompatible with the use of HLM (or Random Effects Models.) HLM also has a feature not shared with Random Coefficients Models that is odd, if not necessarily incorrect. The standard method of starting an HLM analysis involves a test for between group variance done without controls. This is used to justify the inclusion of grouplevel variables. While this test is likely to produce correct results in general, it is possible for the test to suggest no between group variance even when the group-level variables are powerful predictors of the outcomes. This can happen for two reasons. First, the control variables at the group (i.e. school) level can offset each other. Second, the within group control variables (i.e. student SES) may be offsetting the group level variables. For both of these reasons a more appropriate test for the inclusion of the group-level variables would be a joint test of their 9 Another weakness that has been suggested is that HLM is not compatible with Instrumental Variables estimation (Brewer and Goldhaber, 2000; Mason, 1995). However, Spencer and Fielding (2000) show that HLM can be used 9 Draft- For Comment Only Chaplin statistical significance. To illustrate this point consider the standard combined model discussed earlier. Yij= 00+01*Wj+10*Xij+11*Wj*Xij +ij Now, it is possible to have 01 >0, 10>0, and 11>0 but to also have V(01*Wj+10*Xij+11*Wj*Xij) approximately equal to 0 if, for example, cov(Wj,Xij ) is negative and sufficient in magnitude. For a real world example suppose that a certain school district (or state) put sufficient resources into schools serving primarily lower SES students to effectively offset the test score gap by parent SES. Were they able to accomplish this goal we would observe relatively small differences in student outcomes across schools, even if both parental SES and spending per student had large and important impacts on student performance. If we were to rely on the HLM test of between school differences we might draw the incorrect conclusion that school spending did not matter because we would never estimate the full model having found no evidence of between school differences in the model without controls. What HLM Does Do While HLM may be misinterpreted by some researchers and has some odd features, it does have a number of valuable characteristics that make it worth considering in many circumstances. First, as noted above, it deals with a fairly large set of possible violations of the standard OLS model assumptions about the distributions of the error terms in ways that are somewhat more flexible than the standard Random Effects Models used in econometrics. Second, like Random Effects Models, it produces more efficient estimates than one would obtain using OLS or any other method that relies on the OLS slope estimates but produces correct to estimate IV models. 10 Draft- For Comment Only Chaplin standard errors (i.e. the Huber-White and simulation methods of correcting standard errors).10 Third, even if some of the HLM assumptions are violated, HLM can be used to produce a “best” fit based on the HLM weights, much as OLS is often viewed as producing a Best Linear Unbiased Estimates (Goldberger, 1991). Fourth, when group mean centering is used, the unbiased slope estimates for within group variables obtained by using Fixed Effects Models can be obtained using HLM. 10 This statement assumes that there are random intercepts and/or slopes in the model. In the absence of such variation, OLS could be more efficient as it estimates fewer parameters. 11 Draft- For Comment Only Chaplin Conclusion The growing use of HLM suggests that researchers are becoming increasingly aware of and willing to deal with the important issue of clustering of data within groups. This implies that the conclusions reached in these studies can be taken more seriously than those of many studies in the past that ignored clustering as the estimated standard errors are less likely to be biased. At the same time, however, the introduction of HLM may have some costs. In particular, it appears that the standard method of presenting HLM (at two levels) may mislead some researchers into believing that they have achieved a very different goal—that of controlling for all unobserved group-level factors. While this ideal can be obtained using other types of models (in particular Fixed Effects Models common in econometrics) it is not achieved using the standard HLM methods. Clarifying this important limitation of HLM should help to ensure that it is used more correctly in future and enable researchers to make better choices when deciding which method is most appropriate for their research questions. 12 Draft- For Comment Only Chaplin References Alexander, Karl L., Doris R. Entwisle, and Linda S. Olson (2001) “Schools, Achievement, and Inequality: A Seasonal Perspective,” Education Evaluation and Policy Analysis, 23(2):171-191. Brewer, Dominic J. and Dan D. Goldhaber (2000) “Improving Longitudinal Data on Student Achievement: Some Lessons from Recent Research Using NELS:88,” in Analytic Issues in the Assessment of Student Achievement, by David Grissmer and J. Michael Ross, U.S. Department of Education, National Center for Education Statistics, NCES 2000-050. Bryk, A. S., & Raudenbush, S. W. (1992). Hierarchical linear models: Applications and data analysis methods. Newbury Park, CA: Sage Publications. Chaplin, Duncan (1993) Employment Bust or Education Boom? Dissertation, University of Wisconsin at Madision. Black Teenage Males: 1960-1988. Gamoran, Adam, Andrew C. Porter, Jon Smithson, and Paula A. White (1997) “Upgrading High School Mathematics Instruction: Improving Learning Opportunities for Low-Achieving, Low-Income Youth,” Education Evaluation and Policy Analysis, 19(4):325-338. Goldberger, Arthur S. (1991) A Course in Econometrics, Harvard University Press, Cambridge, MA. HLM (2000) “HLM Concepts and Background,“ http://www.ssicentral.com/hlm/concept.htm. Downloaded October 29th, 2003. Hanushek, Eric A. (1974) “Efficient Estimators for Regressing Regression Coefficients,” The American Statistician, 28(2), May. Mason, W.M. (1995) “Hierarchical Linear Models: Problems and Prospects,” Journal of Educational and Behavioral Statistics, 20(2):221-227. Mundlak, Y. (1978) “On the Pooling of Time Series and Cross Section Data,” Econometrica, 46:69-85. Nye, Barbara, Larry V. Hedges, and Spyros Konstantopoulos (2002) “Do Low-Achieving Students Benefit More from Small Classes? Evidence from the Tennessee Class Size Experiment,” Educational Evaluation and Policy Analysis, 24(3):201-217. Selden, Thomas M (1994) “Weighted generalized least squares estimation for complex survey data,” Economic Letters, 46:1-6. Seltzer, Michael, John Novak, Kilchan Choi, and Nelson Lim (2002) “Sensitivity Analysis for Hierarchical Models Employing t Level-1 Assumptions,” Journal of Educational and Behavioral Statistics, 27(2):181-222. Singer, Judith (1998) "Using SAS PROC MIXED to Fit Multilevel Models, Hierarchical Models, and Individual Growth Models," Journal of Educational and Behavioral Statistics. 13 Draft- For Comment Only Chaplin Spencer, Neil H. and Anthony Fielding (2000) “An Instrumental Variable Consistent Estimation Procedure to Overcome the Problem of Endogenous Variables in Multilevel Models,” Mutilevel Modeling Newsletter 12(1):4-7. Swanson, David B., Brian E. Clauser, Susan M. Case, Ronald J. Nungester, and Carol Feathermean (2002) “Analysis of Differential Item Functioning (DIF) Using Hierarchical Logistic Regression Models, Journal of Educational and Behavioral Statistics, 27(1):53-75.White, Halbert (1980) "A Heteroskedasticity-Consistent Covariance Matrix Estimator and a Direct Test for Heteroskedasticity," Econometrica, 48(4):817-838. Wenglinsky, Harold (1998) “Finance Equalization and Within-School Equity: The Relationship between Education Spending and the Social Distribution of Achievement,” Education Evaluation and Policy Analysis, 20(4):269-283. Yasumoto, Uekawa, and Bidwell (2001) “The Collegial Focus and High School Students’ Achievement,” Sociology of Education, July, 74(3):181-209. 14 Draft- For Comment Only Chaplin Figure 1 Test Scores by SES and School Type Sector Averages 7.5 7 6.5 Test Scores 6 5.5 5 Public Catholic 4.5 4 3.5 3 1 2 3 4 5 6 SES 15 7 8 9 10 Draft- For Comment Only Chaplin Figure 2 Test Scores by SES and School Type Individual Schools 8 7.5 7 6.5 Test Scores 6 5.5 5 Catholic 1 4.5 Catholic 2 Catholic 3 4 Public 1 Public 2 Public 3 3.5 3 1 2 3 4 5 6 SES 16 7 8 9 10 Draft- For Comment Only Chaplin Appendix Quotes Suggesting that HLM is Misunderstood “As in the Level 2 formulation, at Level 3 average 10th-grade achievement and growth rate are estimated separately for each department.” (Yasumoto et al., 2001) “…Such models permit the analysis and pooling of school-specific regressions…” and “…in each of the school-specific regression coefficients…” (Nye et al., 2002). “This procedure estimates separate error variances for each level, ensuring that parameters at the class level are not distorted because of similarities among students within classes…” and “At least three cases are needed to estimate the growth curve, but the estimation procedure accommodates cases for which data are available at two of the three time points.” (Gamoran et al., 1997.) “…HLM is used to estimate within-person achievement growth models…Person-specific growth parameters are estimated at the within-person, or Level 1, stage….” and “HLM screens out many cases because of strategic gaps in the testing record…” Alexander et al. (2001). “…Separate equations are estimated for the effect of student-level variables on students and of schoollevel variables on the average of student-level variables.” Wenglinsky (1998). “The basic approach of HLM is to first estimate a within-group model and use the estimated slope coefficients as the dependent variable in a second across-group stage.” And “…HLM utilizes only a sub-sample of all the potential students in the sample…” (Brewer and Goldhaber, 2000). 17