Understanding Finite Sample Bias from Instrumental Variables

advertisement
Abstract Title Page
Title:
Understanding Finite Sample Bias from Instrumental Variables Analysis in Randomized
Trials
Authors:
Howard S. Bloom, Ph.D., MDRC
Pei Zhu, Ph.D., MDRC
Fatih Unlu, Ph.D., Abt Associates Inc.
Abstract Body
This paper assesses the promises and potential problems of using instrumental variables
analysis in randomized trials to obtain internally valid (unbiased or consistent) and reasonably
precise (powerful) estimates of causal relationships between characteristics of settings
(mediators) and individual-level outcomes. It unpacks the statistical problems with instrumental
variables analysis in ways that are accessible to a broad range of applied researchers. In doing so,
the paper focuses on one key aspect of the problems (finite sample bias) and develops an
understanding of it based on “first principles”. It then uses this deeper understanding to develop
intuition about how to assess and address the problem in practice.
Why should we care about causal relationships between mediators and individual outcomes?
In the past six years, education research has taken a quantum leap forward based on a
large and growing number of high-quality randomized field trials and regression discontinuity
studies of the effects of educational interventions.1 Most of this new research and existing
methodologies for conducting it focus on the response of student academic outcomes to specific
educational interventions.2 Such information is invaluable and can provide a solid foundation for
accumulating much-needed knowledge. However, this information only indicates how well
specific interventions (which comprise complex bundles of features) work for specific students
in specific settings. Therefore by itself, the information is not sufficient to ascertain “What works
best for whom, when and why?” And it is this more comprehensive understanding of educational
interventions that is needed to guide future policy and practice.
In other words, it is necessary to “unpack the black boxes” being tested by randomized
experiments or high-quality quasi-experiments in order to learn how best to improve the
education—and thus life chances—of students in the U.S., especially those who are
economically disadvantaged. This unpacking job comprises learning more about the relative
effectiveness of the active ingredients of educational interventions (their mediators) and learning
more about factors that influence the effectiveness of these interventions (their moderators).3
Now that multi-site randomized experiments and rigorous quasi-experiments have been shown to
1
Spybrook, 2007 identified 55 randomized studies of a broad range of interventions and Gamse et. al., 2008 and
Jackson et. al., 2007 report on regression discontinuity studies of the federal Reading First and Early Reading First
programs.
2
An important exception involves a series of randomized tests of interventions for improving students’ social and
emotional outcomes (Jones, Brown and Aber, 2008 and Haegerick and Metz, under review).
3
Cook (2001) speculates about why, until recently, the education research community strenuously resisted
randomized experiments.
1
be feasible for educational research4 it is an opportune time to begin to explore these subtler and
more complex questions.
Why use instrumental variable analysis?
Instrumental variables analysis originated in the work of P.G. Wright (1928) who
developed the approach to study the demand for and supply of flax seed.5 Since then, the
approach has been used for a wide range of applications.6 Of particular relevance for the present
paper is the use of instrumental variables analysis in the context of multi-site randomized
experiments or quasi-experiments to study the effects of mediating variables on final outcomes.
Early applications of this approach focused mainly on estimating the effect of receiving an
intervention instead of just being assigned to it.7 Recent extensions of the approach have begun
to use it to explore causal effects of other mediating factors. For example, data from a
randomized trial of subsidies for public housing residents to stimulate movement to lowerpoverty neighborhoods were used to study the effects of neighborhood poverty on child
outcomes (Kling, Liebman, and Katz, 2007).
These studies use the cross-site pattern of observed intervention effects on key mediators
(e.g. neighborhood poverty or family income) and observed intervention effects on key final
outcomes (e.g. child development or behavior) to study causal relationships between mediators
and final outcomes. To the extent that the effects of an intervention on a mediator are correlated
across sites or other subgroups with the effects of the intervention on a final outcome, this
provides evidence of a causal relationship between the mediator and the final outcome.
Instrumental variable analysis has its own limitations and necessary assumptions.
Nevertheless, evidence derived from this method is likely to be stronger than that provided by
more traditional methods based on correlation analysis or regression analysis, which are almost
always subject to some combination of “attenuation bias” due to measurement error (e.g.
Director, 1979), “omitted variables bias” due to unmeasured variables not accounted for, and
simultaneity bias, due to reciprocal causality. Hence, there are good reasons to believe that the
newly evolving approach might offer a more promising way to unpack the black boxes
represented by complex educational initiatives.
4
Greenburg, Meyer, Michalopoulos and Wiseman (2003) argue for using multi-site experiments to study
moderators of program effectiveness.
5
Stock and Trebbi (2003) and Angrist and Krueger (2001) discuss the history of instrumental variables analysis.
According to Angrist and Krueger (2001, p.1) “The term “instrumental variables” originated with Olav Reiersol
(1945); Morgan (1990) cites an interview in which Reiersol attributed the term to his teacher, Ragnar Frisch.”
6
Instrumental variables analysis has being used to estimate causal effects. For example, Levitt (1997) used the
approach to study the effects of police on crime, Waldman (2006) used the approach to study the effects of
television viewing on autism, and Hoxby (2000) used the approach to study the effects of school competition on
student achievement.
7
This issue is often referred to as the problem of “compliance” to treatment in medical research (Angrist, Imbens
and Rubin. 1996) or the problem of “no-shows” and “crossovers” in program evaluation research (Bloom, 1984 and
Gennetian, Morris, Bos and Bloom, 2005). See Bloom (1984) and Angrist, Imbens and Rubin (1996) for early
discussions of its application.
2
What did we study and learn?
Even though instrumental variable analysis is gaining popularity and may have the
potential for a broad range of social science applications, it is subject to some important
statistical problems that are not widely understood.
One such problem is “finite sample bias,” which as demonstrated by recent research, can
distort findings even from exceptionally large samples (Bound, Jaeger, and Baker, 1995).
Unfortunately, the existing literature on this problem is highly technical and accessible mainly to
econometricians and statisticians, even though the approach is potentially most valuable to
applied social scientists. It is with this in mind that we have attempted to “unpack” the problem
in ways that promote a broader intuitive understanding of what produces it, how to assess its
magnitude and consequences, and thus how to decide when to use the new approach.8 In doing
so we derived finite sample bias from basic principles and constructed derivations that facilitate a
practical understanding of how to assess and address this problem.
It is important to note that some (but not all) of our results have already been established
in the extant literature. Hence, we focus on the intuition involved in these results and we aim to
make this intuition available to the broadest possible audience of applied researchers.
We start with the simple case of a single mediator and a single instrument. Specifically,
we consider a series of relationships among a treatment indicator, T , a mediator, M , and an
outcome, Y , with treatment status randomly assigned to individual sample members, i . The
series of relationships of interest are: (i) the causal effect ( ) of treatment on the mediator, (ii)
the causal relationship (  ca ) between the mediator and the outcome, and (iii) the cross sectional
relationship (  cs ) between the mediator and the outcome. Note that this final parameter reflects
a combination of extraneous factors like attenuation bias, omitted variables bias, and
simultaneity bias. It is well known (but not fully appreciated) that conventional ordinary least
squares (OLS) regression of the outcome on the mediator yields a biased estimate of the causal
relationship and this bias, often called “OLS bias” can be characterized as the difference between
 cs and  ca (for example, see Angrist and Krueger, 2001). Instrumental variables analysis is used
to estimate the casual effect of the mediator on the outcome via two-stage least squares (TSLS)
from the following model:
First Stage:
M i    Ti   i
Second Stage: Yi     TSLS Mˆ i   i
(1)
(2)
8
The finite sample bias problem reflects two ways: (i) with respect to biased point estimates and (ii) with respect to
incorrect statistical inferences. In this paper, we focused on biased point estimates and have little to say yet about
incorrect inferences, which we shall explore in later work. Note that the finite sample bias is also known as the bias
due to “weak instruments” (Bound et al., 1995).
3
where  i and  i are random error terms and M̂ i is the predicted mediator, constructed using the
OLS parameter estimates ( ˆ and ˆ ) from the first-stage equation ( M̂ i = ˆ  ˆTi ). TSLS is the
TSLS estimator of the effect of the mediator on the outcome.
Notice that M̂ i is a function of ˆ , which is the estimated effect of treatment on the
mediator. ˆ can be represented as a combination of its true value,  , and the first-stage
estimation error,   , which reflects the imperfect randomization in a finite sample (i.e., a
treatment and comparison mismatch). Therefore the variation in M̂ i has two parts: (1) that
induced by the true treatment effect on the mediator,  , which we call “treatment-induced
variation (tiv)”; and (2) that driven by estimation error,   , which we call “error-induced
variation (eiv).” Using this framework, we show that the error-induced variation in the predicted
mediator causes the finite sample bias in the estimate of TSLS . We also demonstrate that this
finite sample bias ( BIAS TSLS ) is proportional to OLS bias ( BIAS OLS ). More specifically, we show
that the ratio of BIAS TSLS to BIAS OLS is approximately equal to the ratio of the expected value of
the eiv ( E[eiv ]  EIV ) to the expected value of the total variation in M̂ i ( EIV  TIV =
E[eiv]+E[tiv]).
We then examine the use of the population F statistic for the first-stage regression ( F pop )
to measure the strength of an instrument for a given application in order to assess its finite
sample bias (Bound, Jaeger, and Baker, 1995). We show that F pop is approximately equal to the
ratio of the expected value of the total variation in M̂ i ( EIV  TIV ) to the expected value of the
error-induced variation ( EIV ). We also illustrate that the ratio of BIAS TSLS to BIAS OLS can be
characterized (in approximation) as the inverse of F pop . This conclusion is consistent with
findings reported in the econometrics literature (for a review, see Hahn and Hausman, 2003).
Next, we analyze an extension of the single instrument and single mediator case, namely
the use of multiple instruments with a single mediator: a situation that researchers often face in a
multi-site experimental setting. Specifically, we examine what happens to the expected value of
the TSLS estimator and to the corresponding first-stage F statistic when multiple instruments —
created by interacting treatment status with site (or some other strata) indicators—are used. We
first focus on the special case of a constant treatment effect on the mediator across sites. We
K
demonstrate that, as in the single instrument and single mediator case, BIAS TSLS
(where K stands
K
for the number of sites from now on) is a proportion of BIAS OLS
, and the proportion is
approximately the ratio between the expected values of error-induced variation and the total
variation in M̂ i .9
9
Note that in this special case, the expected value of the treatment induced variation in M̂ i ( TIV
constant as K changed from 1 to K, but the expected value of the error induced variation in
by a factor of K as K changed from 1 to K. i.e.,
and
K
) stayed
M̂ i ( EIV K ) increased
.
4
Secondly, we examine the use of multiple instruments for a single mediator given a
varying first-stage impact (i.e., effect of the treatment on the mediator varies by site). In
particular, we consider the case with a constant difference (  ) among K site specific first-stage
impacts. That is, if sites are ordered by the first-stage impacts, with site 1 being the one with the
smallest impact (  1 ) and site K being the largest (  K ), the first-stage impact in the kth site
(k=1,2,…,K) is characterized by  1  (k  1) . We derive corresponding expressions for the
K
K
finite sample bias ( BIAS TSLS
) and first-stage F statistic ( F pop
) in terms of EIV K and TIV K . We
K
K
conclude that the ratio of BIAS TSLS
to BIAS OLS is again (in approximation) the inverse of F pop
.
By contrasting these results with those for the single instrument case, we then derive conditions
under which multiple instruments reduce or increase finite sample bias from that of a single
instrument. Intuitively, this condition implies that with regard to finite sample bias, the use of
multiple instruments is preferred to the use of a single instrument if the increase (due to using
multiple instruments) in the error induced variation is more than offset by the increase in the
treatment induced variation.
For each of the aforementioned scenarios, this paper also considers the finite sample bias
problem in the presence of clustering (e.g. nested structure of students within classrooms,
schools, or districts). We derive the corresponding expressions for BIAS TSLS ,cluster in terms of
EIVcluster and TIVcluster , as well as in terms of BIAS OLS ,cluster and Fpop,cluster . We observe that these
expressions are parallel to those resulted in the absence of clustering. We conclude that other
things being equal, clustering reduces the first-stage F statistic ( Fpop,cluster < F pop ) and thereby
increases finite sample bias.
To sum up, in this paper we study finite sample bias inherent to instrumental variables
estimators in four cases: the simple case of a single mediator with a single instrument, with or
without clustered structure in the data, and the case of a single mediator with multiple
instruments, with or without considering data clustering. For each of these cases, we show that
the treatment-control mismatch that randomly occurs in experimental settings is the source of
this bias, which causes an error-induced variation in the predicted mediator. We also demonstrate
why and how the F-statistic from the first-stage of a TSLS estimator is useful in studying the
extent of the bias. In addition, from the bias standpoint, we conclude that the shift from the use
of a single instrument to that of multiple instruments is only preferred when there is enough
variation in the first stage impact estimates (across sites) to compensate for the increase in
error-induced variation with an increase in treatment-induced variation. We also demonstrate
that a clustered data structure leads to a smaller F-value, and hence increased the bias.
What lies ahead?
The present paper represents only the first stage of a more comprehensive project to
examine the new approach, its analytical problems and its application potential. Our future work
will attempt to generalize findings to more realistic (and important) applications: analyses of
multiple setting features based on multiple instrumental variables. In addition, to complicating
the issues discussed in this paper, this more general situation raises several new analytic issues
that must be considered closely before proceeding with the approach in future applications.
5
Appendixes
Appendix A. References
Angrist, J. D., G. Imbens, & D. Rubin (1996). Identification of Causal Effects Using
Instrumental Variables. Journal of the American Statistical Association, 91, 444-55.
Angrist, J. D. & A. B. Krueger (2001). Instrumental Variables and the Search for Identification:
From Supply and Demand to Natural Experiments. Princeton University, Working Paper 455,
Industrial Relations Section, August.
Bloom, H. S. (1984). Accounting for No-Shows in Experimental Evaluation Designs. Evaluation
Review, 8(2), 225-46.
Bloom, H. S. (2005) Learning More from Social Experiments: Evolving Analytic Approaches.
New York: Russell Sage Foundation.
Bloom, H. S., C. J. Hill, & J. Riccio (2005). Modeling Cross-Site Experimental Differences to
Find Out Why Program Effectiveness Varies. In Howard S. Bloom (Ed.), Learning More from
Social Experiments: Evolving Analytic Approaches (pp. 37-74). New York: Russell Sage
Foundation.
Bound, J., D. A. Jaeger, & R. M. Baker (1995). Problem with Instrumental Variables Estimation
when the Correlation between the Instruments and the Endogenous Explanatory Variable is
Weak. Journal of the American Statistical Association, 90(430), 443-50.
Cook, T. D. (2001). Sciencephobia: Why Education Researchers Reject Randomized
Experiments. Education Next, Fall, 63-68.
Gamse, B. C., H. S. Bloom, J. J. Kemple, & R. T. Jacob (2008). Reading First Impact Study:
Interim Report (NCEE 2008-4016). Washington, DC: National Center for Education Evaluation
and Regional Assistance, Institute of Education Sciences, U.S. Department of Education.
Greenberg, D. H., R. Meyer, C. Michalopoulos, & M. Wiseman (2003). Explaining Variation in
the Effects of Welfare-to-Work Programs. Evaluation Review, 27(4), 359-94.
Haegerich, T.M., & E. Metz (under review). The Social and Character Development Research
Program: Development, Goals and Opportunities. Journal of Research in Character Education.
Hahn, J. & J. Hausman (2003). “Weak Instruments: Diagnosis and Cures in Empirical
Econometrics.” American Economic Review 93, 118-125.
Hoxby, C. M. (2000). Does Competition among Public Schools Benefit Students and Taxpayers?
American Economic Review, 90(5), 1209-1238.
6
Jackson, R., A. McCoy, C. Pistorino, A. Wilkinson, J. Burghardt, M. Clark, C. Ross, P.
Schochet, & P. Swank (2007). National Evaluation of Early Reading First: Final Report, U.S.
Department of Education, Institute of Education Sciences, Washington, DC: U.S. Government
Printing Office.
Jones, S. M., J. L. Brown, & J. L. Aber (2008). Classroom Settings as Targets of Intervention
and Research. In M. Shinn and H. Yoshikawa (editors) Changing Schools and Community
Organizations to Foster Positive Youth Development. New York: Oxford University Press.
Levitt, S. (1997). Using Electoral Cycles in Police Hiring to Estimate the Effect of Police on
Crime. The American Economic Review, 87(3), 270-90.
Stock, J. H. & F. Trebbi (2003). Who Invented Instrumental Variable Regression? Journal of
Economic Perspectives, 17(3), 177 – 94.
Morgan, M. (1990). The History of Econometric Ideas. Cambridge: Cambridge University Press.
Reiersol, O. (1945). Confluence Analysis by Means of Instrumental Sets of Variables. Arkiv for
Matematik, Astronomi och Fysik, 32a(4), 1 – 119.
Waldman, M., S. Nicholson, & N. Adilov (2006). Does Television Cause Autism? NBER
Working Paper No. 12632.
Wright, P. G. (1928). The Tariff on Animal and Vegetable Oils. New York: Macmillan.
Appendix B.
7
Download