How Much Can We Generalize from Impact Evaluations? Are They Worthwhile? Eva Vivalt∗ Stanford University November 8, 2015 Abstract Impact evaluations aim to predict the future, but they are rooted in particular contexts and to what extent they generalize is an open and important question. I exploit a new data set of impact evaluation results on a wide variety of interventions in development to answer this and other questions. I find that while a good model and the separation of sampling variance helps, results remain much more heterogeneous than in other fields, such as medicine. Given the heterogeneity, an obvious question is: are impact evaluations worthwhile? I model policymakers’ decisions and answer this question for different parameter values. I find that if policymakers were to use the simplest, naive predictor of a program’s effects, they would typically be off by 97%. Finally, I show how researchers can estimate the generalizability of their own study using their own data, even when data from no comparable studies exist. ∗ E-mail: vivalt@stanford.edu. I thank Edward Miguel, Bill Easterly, David Card, Ernesto Dal Bó, Hunt Allcott, Elizabeth Tipton, David McKenzie, Vinci Chow, Willa Friedman, Xing Huang, Michaela Pagel, Steven Pennings, Edson Severnini, seminar participants at the University of California, Berkeley, Columbia University, New York University, the World Bank, Cornell University, Princeton University, the University of Toronto, the London School of Economics, the Australian National University, the University of Ottawa, and the Stockholm School of Economics, among others, and participants at the 2015 ASSA meeting and 2013 Association for Public Policy Analysis and Management Fall Research Conference for helpful comments. I am also grateful for the hard work put in by many at AidGrade over the duration of this project, including but not limited to Jeff Qiu, Bobbie Macdonald, Diana Stanescu, Cesar Augusto Lopez, Mi Shen, Ning Zhang, Jennifer Ambrose, Naomi Crowther, Timothy Catlett, Joohee Kim, Gautam Bastian, Christine Shen, Taha Jalil, Risa Santoso and Catherine Razeto. 1 1 Introduction In the last few years, rigorous impact evaluations have become extensively used in development economics research. Policymakers and donors typically fund impact evaluations precisely to figure out how effective a similar program would be in the future to guide their decisions on what course of action they should take. However, it is not yet clear how much we can extrapolate from past results or under which conditions. Some have argued that impact evaluation results are context-dependent in a way that prevents them from being informative (Deaton, 2011; Pritchett and Sandefur, 2013). Further, there is some evidence that even a similar program, in a similar environment, can yield different results. For example, Bold et al. (2013) carry out an impact evaluation of a program to provide contract teachers in Kenya; this was a scaled-up version of an earlier program studied by Duflo, Dupas and Kremer (2012). The earlier intervention studied by Duflo, Dupas and Kremer was implemented by an NGO, while Bold et al. compared implementation by an NGO and the government’s attempted scale-up. While Duflo, Dupas and Kremer found positive effects, Bold et al. showed significant results only for the NGO-implemented group. The different findings in the same country for purportedly similar programs point to the substantial context-dependence of impact evaluation results. Knowing the extent of this context-dependence is crucial in order to understand what we can learn from any impact evaluation, and this paper provides that information for a wide variety of interventions. This paper also answers a second, related question: what is the value of an impact evaluation in terms of allowing policymakers to make better decisions? If the point of an impact evaluation is to provide information, their use is naturally constrained by the extent to which one can generalize from their findings. I model the policymaker’s decision problem and compare the potential costs of an impact evaluation with the benefits of improved predictive ability leading to better policy decisions. While the mains reason to examine generalizability are to aid interpretation and improve predictions, it would also help to direct research attention to where it is most needed. If generalizability were higher in some areas, fewer papers would be needed to understand how people would behave in a similar situation; conversely, if there were topics or regions where generalizability was low, it would call for further study. With more information, researchers can better calibrate where to direct their attention to generate new insights. Impact evaluations are still exponentially increasing in number and in terms of the resources devoted to them. The World Bank recently received a major grant from the UK aid agency DFID to expand its already large impact evaluation works; the Millennium Challenge Corporation has committed to conduct rigorous impact evaluations for 50% of its activities, 2 Figure 1: Growth of Impact Evaluations This figure plots the number of studies that came out in each year that are contained in each of three databases described in the text: 3ie’s title/abstract/keyword database of impact evaluations; J-PAL’s database of affiliated randomized controlled trials; and AidGrade’s database of impact evaluation results data. with “some form of credible evaluation of impact” for every activity (Millennium Challenge Corporation, 2009); and the U.S. Agency for International Development is also increasingly invested in impact evaluations, coming out with a new policy in 2011 that directs 3% of program funds to evaluation.1 Yet while impact evaluations are still growing in development, a few thousand are already complete. Figure 1 plots the explosion of RCTs that researchers affiliated with J-PAL, a center for development economics research, have completed each year; alongside are the number of development-related impact evaluations released that year according to 3ie, which keeps a directory of titles, abstracts, and other basic information on impact evaluations more broadly, including quasi-experimental designs; finally, the dashed line shows the number of papers that came out in each year that are included in AidGrade’s database of impact evaluation results, which will be described shortly. In short, while we do impact evaluation to figure out what will happen in the future, many issues have been raised about how well we can extrapolate from past impact evaluations and, despite the importance of the topic, previously we were unable to do little more than guess or examine the question in narrow settings as we did not have the data. Now we have the opportunity to address speculation, drawing on a large, unique dataset of impact 1 While most of these are less rigorous “performance evaluations”, country mission leaders are supposed to identify at least one opportunity for impact evaluation for each development objective in their 3-5 year plans (USAID, 2011). 3 evaluation results. I founded a non-profit organization dedicated to gathering these data. That organization, AidGrade, seeks to systematically understand which programs work best where, a task that requires also knowing the limits of our knowledge. To date, AidGrade has conducted 20 meta-analyses and systematic reviews of different development programs.2 Data gathered through meta-analyses are the ideal data to answer the question of how much we can extrapolate from past results, and since data on these 20 topics were collected in the same way, coding the same outcomes and other variables, we can look across different types of programs to see if there are any more general trends. Currently, the data set contains 647 papers on 210 narrowly-defined intervention-outcome combinations, with the greater database containing 15,021 estimates. A further contribution of this paper is the development of benchmarks or rules of thumb that researchers or practitioners can use to gauge the external validity of their own work. I discuss several metrics and show typical values across a range of interventions. Other disciplines have considered generalizability more, so I draw on the literature relating to meta-analysis, which has been most well-developed in medicine, as well as the psychometric literature on generalizability theory (Higgins and Thompson, 2002; Shavelson and Webb, 2006; Briggs and Wilson, 2007). Since we may not care about heterogeneity in treatment effects if it can be modelled, I show how the measures I discuss could also be used in conjunction with explanatory models. I use the concrete examples of conditional cash transfers (CCTs) and deworming programs, which are relatively well-understood and on which many papers have been written, to elucidate the issues. Ultimately, I find that while a good model and the separation of sampling variance helps, results remain much more heterogeneous than in other fields, such as medicine. Policy decisions would seem to be improved by careful research designs that pay attention to the issue of external validity since, as it stands, the typical prediction of the effect of a program might be off by 97%. Though this paper focuses on results for impact evaluations of development programs, this is only one of the first areas within economics to which these kinds of methods can be applied. Many impact evaluations also exist for domestic policies, for instance. In many of the sciences, knowledge is built through a combination of researchers conducting individual studies and other researchers synthesizing the evidence through meta-analysis. This paper begins that natural next step. 2 Throughout, I will refer to all 20 as meta-analyses, but some did not have enough comparable outcomes for meta-analysis and became systematic reviews. 4 2 2.1 Theory The Policymaker’s Decision Problem In the simplest case, a policymaker might face a choice between a program and an outside option. The program’s effect if it were to be implemented in the policymaker’s setting, θi , is unknown ex ante; the outside option’s effect is θ˚ . The policymaker’s prior is: θi „ N pµ, τ 2 q (1) where µ and τ 2 are unknown hyperparameters. The policymaker has the opportunity to observe a signal about the effect of the program by conducting an impact evaluation. For example, this could be thought of as an evaluation of a pilot before rolling out a program naturally. The signal is itself drawn from a distribution: Yi |θi „ N pθi , σi2 q (2) where Yi is the observed effect size of a particular study and σi2 the sample variance. The impact evaluation has a cost, c ą 0, and the policymaker needs to decide whether the value of the information provided by the signal is worth it. I assume the policymaker uses Bayesian updating. θi can then neatly be estimated using Bayesian meta-analysis. As a quick review, the meta-analysis literature suggests two general types of models that can be parameterized in many ways: fixed-effect models and random-effects models. Fixed-effect models assume there is one true effect of a particular program and all differences between studies can be attributed simply to sampling error. In other words: Yi “ θ ` εi (3) where θ is the true effect and εi is the error term. Random-effects models do not make this assumption; the true effect could potentially vary from context to context. Here, Yi “ θi ` εi “ θ̄ ` ηi ` εi (4) (5) where θ̄ is the mean true effect size, ηi is a particular study’s divergence from that mean true effect size, and εi is the error. Random-effects models are more plausible and they are necessary if we think there are heterogeneous treatment effects, so I use them in this paper. 5 Random-effects models can also be modified by the addition of explanatory variables, at which point they are called mixed models; I will also use mixed models in this paper. I begin by presenting the random-effects model, followed by the related strategy to estimate a mixed model. The random-effects model is fully described in Gelman et al. (2013), from which the next section heavily draws. 2.1.1 Estimating a Random-Effects Model To build a hierarchical Bayesian random-effects model, I first assume the data are normally distributed: Yij |θi „ N pθi , σ 2 q (6) where j indexes the individuals in the study. I do not have individual-level data, but instead can use sufficient statistics: Yi |θi „ N pθi , σi2 q (7) where Yi is the sample mean and σi2 the sample variance. This provides the likelihood for θi . I also need a prior for θi . As discussed, I assume between-study normality: θi „ N pµ, τ 2 q (8) where µ and τ are unknown hyperparameters. Conditioning on the distribution of the data, given by Equation 7, I get a posterior: θi |µ, τ, Y „ N pθˆi , Vi q where θˆi “ Yi σi2 1 σi2 ` ` µ τ2 1 τ2 , Vi “ 1 σi2 1 ` (9) 1 τ2 (10) I then need to pin down µ|τ and τ by constructing their posterior distributions given non-informative priors and updating based on the data. I assume a uniform prior for µ|τ , and as the Yi are estimates of µ with variance pσi2 ` τ 2 q, obtain: µ|τ, Y „ N pµ̂, Vµ q where Yi i σi2 `τ 2 ř 1 i σi2 `τ 2 (11) ř µ̂ “ , Vµ “ 6 ÿ 1 i σi2 `τ 2 1 (12) |Y q For τ , note that ppτ |Y q “ ppµ,τ . The denominator follows from Equation 12; for the ppµ|τ,Y q numerator, we can observe that ppµ, τ |Y q is proportional to ppµ, τ qppY |µ, τ q, and we know the marginal distribution of Yi |µ, τ : Yi |µ, τ „ N pµ, σi2 ` τ 2 q (13) I use a uniform prior for τ , following Gelman et al. (2013). This yields the posterior for the numerator: ź ppµ, τ |Y q9ppµ, τ q N pYi |µ, σi2 ` τ 2 q (14) i Putting together all the pieces in reverse order, I first simulate τ , then generate ppτ |Y q using τ , followed by µ and finally θi . 2.1.2 Estimating a Mixed Model The strategy here is similar. Appendix E contains a derivation. 2.2 Solving the Policymaker’s Problem and Extensions The random-effects or mixed model both would yield a predicted θi , with or without the use of explanatory variables. We can imagine that with infinite data, θpi Ñ θi . The model also enables a back-of-the-envelope calculation of how much it might cost a policymaker to make the “wrong” decision, which would let us say for what set of conditions it might be useful to spend money on an additional impact evaluation. Remembering that the outside option’s effects are θ˚ , if θpi ą θ˚ and θi ă θ˚ , or vice versa, the policymaker is making a mistake costing the value of |θi ´ θ˚ |. We can either think of a function that assigns a cost to each θi or keep all calculations in terms of θi . This exercise is limited to the costs and benefits a policymaker would face if trying to find the most efficient program with which to achieve a particular policy goal and does not take into consideration other goals a policymaker might have, such as political concerns. While thus limited, we might think that a policymaker that seeks to make the best policy decisions represents the ideal social planner. Though this model was framed in terms of comparing a program against an outside option for the sake of simplicity, we might think that the outside option could itself be uncertain. It might be more realistic to think of the policymaker as choosing between two uncertain programs. This would be straightforward to accommodate in the model: we would only have to consider the difference in effects between two programs as the θi that is being estimated. Finally, an impact evaluation can also be thought of as providing a public good that 7 multiple policymakers could take advantage of; in this case, an impact evaluation could still be worthwhile as a public good, even in some cases in which it would not be worthwhile as a private good. I will consider this possibility in the discussion of the results. 3 Data This paper uses a database of impact evaluation results collected by AidGrade, a U.S. non-profit research institute that I founded in 2012. AidGrade focuses on gathering the results of impact evaluations and analyzing the data, including through meta-analysis. Its data on impact evaluation results were collected in the course of its meta-analyses from 2012-2014 (AidGrade, 2015). AidGrade’s meta-analyses follow the standard stages: (1) topic selection; (2) a search for relevant papers; (3) screening of papers; (4) data extraction; and (5) data analysis. In addition, it pays attention to (6) dissemination and (7) updating of results. Here, I will discuss the selection of papers (stages 1-3) and the data extraction protocol (stage 4); more detail is provided in Appendix B. 3.1 Selection of Papers The interventions that were selected for meta-analysis were selected largely on the basis of there being a sufficient number of studies on that topic. Five AidGrade staff members each independently made a preliminary list of interventions for examination; the lists were then combined and searches done for each topic to determine if there were likely to be enough impact evaluations for a meta-analysis. The remaining list was voted on by the general public online and partially randomized. Appendix B provides further detail. A comprehensive literature search was done using a mix of the search aggregators SciVerse, Google Scholar, and EBSCO/PubMed. The online databases of J-PAL, IPA, CEGA and 3ie were also searched for completeness. Finally, the references of any existing systematic reviews or meta-analyses were collected. Any impact evaluation which appeared to be on the intervention in question was included, barring those in developed countries.3 Any paper that tried to consider the counterfactual was considered an impact evaluation. Both published papers and working papers were included. The search and screening criteria were deliberately broad. There is not enough room to include the full text of the search terms and inclusion criteria for all 20 topics in this paper, but these are available in an online appendix as detailed in Appendix A. 3 High-income countries, according to the World Bank’s classification system. 8 3.2 Data Extraction The subset of the data on which I am focusing is based on those papers that passed all screening stages in the meta-analyses. Again, the search and screening criteria were very broad and, after passing the full text screening, the vast majority of papers that were later excluded were excluded merely because they had no outcome variables in common or did not provide adequate data (for example, not providing data that could be used to calculate the standard error of an estimate, or for a variety of other quirky reasons, such as displaying results only graphically). The small overlap of outcome variables is a surprising and notable feature of the data. Ultimately, the data I draw upon for this paper consist of 15,021 results (double-coded and then reconciled by a third researcher) across 647 papers covering the 20 types of development program listed in Table 1.4 For sake of comparison, though the two organizations clearly do different things, at present time of writing this is more impact evaluations than J-PAL has published, concentrated in these 20 topics. Unfortunately, only 318 of these papers both overlapped in outcomes with another paper and were able to be standardized and thus included in the main results which rely on intervention-outcome groups. Outcomes were defined under several rules of varying specificity, as will be discussed shortly. Table 1: List of Development Programs Covered 2012 2013 Conditional cash transfers Contract teachers Deworming Financial literacy training Improved stoves HIV education Insecticide-treated bed nets Irrigation Microfinance Micro health insurance Safe water storage Micronutrient supplementation Scholarships Mobile phone-based reminders School meals Performance pay Unconditional cash transfers Rural electrification Water treatment Women’s empowerment programs 73 variables were coded for each paper. Additional topic-specific variables were coded for some sets of papers, such as the median and mean loan size for microfinance programs. This 4 Three titles here may be misleading. “Mobile phone-based reminders” refers specifically to SMS or voice reminders for health-related outcomes. “Women’s empowerment programs” required an educational component to be included in the intervention and it could not be an unrelated intervention that merely disaggregated outcomes by gender. Finally, micronutrients were initially too loosely defined; this was narrowed down to focus on those providing zinc to children, but the other micronutrient papers are still included in the data, with a tag, as they may still be useful. 9 paper focuses on the variables held in common across the different topics. These include which method was used; if randomized, whether it was randomized by cluster; whether it was blinded; where it was (village, province, country - these were later geocoded in a separate process); what kind of institution carried out the implementation; characteristics of the population; and the duration of the intervention from the baseline to the midline or endline results, among others. A full set of variables and the coding manual is available online, as detailed in Appendix A. As this paper pays particular attention to the program implementer, it is worth discussing how this variable was coded in more detail. There were several types of implementers that could be coded: governments, NGOs, private sector firms, and academics. There was also a code for “other” (primarily collaborations) or “unclear”. The vast majority of studies were implemented by academic research teams and NGOs. This paper considers NGOs and academic research teams together because it turned out to be practically difficult to distinguish between them in the studies, especially as the passive voice was frequently used (e.g. “X was done” without noting who did it). There were only a few private sector firms involved, so they are considered with the “other” category in this paper. Studies tend to report results for multiple specifications. AidGrade focused on those results least likely to have been influenced by author choices: those with the fewest controls, apart from fixed effects. Where a study reported results using different methodologies, coders were instructed to collect the findings obtained under the authors’ preferred methodology; where the preferred methodology was unclear, coders were advised to follow the internal preference ordering of prioritizing randomized controlled trials, followed by regression discontinuity designs and differences-in-differences, followed by matching, and to collect multiple sets of results when they were unclear on which to include. Where results were presented separately for multiple subgroups, coders were similarly advised to err on the side of caution and to collect both the aggregate results and results by subgroup except where the author appeared to be only including a subgroup because results were significant within that subgroup. For example, if an author reported results for children aged 8-15 and then also presented results for children aged 12-13, only the aggregate results would be recorded, but if the author presented results for children aged 8-9, 10-11, 12-13, and 14-15, all subgroups would be coded as well as the aggregate result when presented. Authors only rarely reported isolated subgroups, so this was not a major issue in practice. When considering the variation of effect sizes within a group of papers, the definition of the group is clearly critical. Two different rules were initially used to define outcomes: a strict rule, under which only identical outcome variables are considered alike, and a loose rule, under which similar but distinct outcomes are grouped into clusters. 10 The precise coding rules were as follows: 1. We consider outcome A to be the same as outcome B under the “strict rule” if outcomes A and B measure the exact same quality. Different units may be used, pending conversion. The outcomes may cover different timespans (e.g. encompassing both outcomes over “the last month” and “the last week”). They may also cover different populations (e.g. children or adults). Examples: height; attendance rates. 2. We consider outcome A to be the same as outcome B under the “loose rule” if they do not meet the strict rule but are clearly related. Example: parasitemia greater than 4000/µl with fever and parasitemia greater than 2500/µl. Clearly, even under the strict rule, differences between the studies may exist, however, using two different rules allows us to isolate the potential sources of variation, and other variables were coded to capture some of this variation, such as the age of those in the sample. If one were to divide the studies by these characteristics, however, the data would usually be too sparse for analysis. Interventions were also defined separately and coders were also asked to write a short description of the details of each program. Program names were recorded so as to identify those papers on the same program, such as the various evaluations of PROGRESA. After coding, the data were then standardized to make results easier to interpret and so as not to overly weight those outcomes with larger scales. The typical way to compare results across different outcomes is by using the standardized mean difference, defined as: SM D “ µ1 ´ µ2 σp where µ1 is the mean outcome in the treatment group, µ2 is the mean outcome in the control group, and σp is the pooled standard deviation. When data are not available to calculate the pooled standard deviation, it can be approximated by the standard deviation of the dependent variable for the entire distribution of observations or as the standard deviation in the control group (Glass, 1976). If that is not available either, due to standard deviations not having been reported in the original papers, one can use the typical standard deviation for the intervention-outcome. I follow this approach to calculate the standardized mean difference, which is then used as the effect size measure for the rest of the paper unless otherwise noted. This paper uses the “strict” outcomes where available, but the “loose” outcomes where that would keep more data. For papers which were follow-ups of the same study, the most recent results were used for each outcome. 11 Finally, one paper appeared to misreport results, suggesting implausibly low values and standard deviations for hemoglobin. These results were excluded and the paper’s corresponding author contacted. Excluding this paper’s results, effect sizes range between -1.5 and 1.8 SD, with an interquartile range of 0 to 0.2 SD. So as to mitigate sensitivity to individual results, especially with the small number of papers in some intervention-outcome groups, I restrict attention to those standardized effect sizes less than 2 SD away from 0. I report main results including this one additional observation in the Appendix. 3.3 Data Description Figure 2 summarizes the distribution of studies covering the interventions and outcomes considered in this paper that can be standardized. Attention will typically be limited to those intervention-outcome combinations on which we have data for at least three papers. Table 12 in Appendix D lists the interventions and outcomes and describes their results in a bit more detail, providing the distribution of significant and insignificant results. It should be emphasized that the number of negative and significant, insignificant, and positive and significant results per intervention-outcome combination only provide ambiguous evidence of the typical efficacy of a particular type of intervention. Simply tallying the numbers in each category is known as “vote counting” and can yield misleading results if, for example, some studies are underpowered. Table 2 further summarizes the distribution of papers across interventions and highlights the fact that papers exhibit very little overlap in terms of outcomes studied. This is consistent with the story of researchers each wanting to publish one of the first papers on a topic. Vivalt (2015a) finds that later papers on the same intervention-outcome combination more often remain as working papers. A note must be made about combining data. When conducting a meta-analysis, the Cochrane Handbook for Systematic Reviews of Interventions recommends collapsing the data to one observation per intervention-outcome-paper, and I do this for generating the within intervention-outcome meta-analyses (Higgins and Green, 2011). Where results had been reported for multiple subgroups (e.g. women and men), I aggregated them as in the Cochrane Handbook’s Table 7.7.a. Where results were reported for multiple time periods (e.g. 6 months after the intervention and 12 months after the intervention), I used the most comparable time periods across papers. When combining across multiple outcomes, which has limited use but will come up later in the paper, I used the formulae from Borenstein et al. (2009), Chapter 24. 12 Figure 2: Within-Intervention-Outcome Number of Papers 13 Table 2: Descriptive Statistics: Distribution of Narrow Outcomes Intervention Number of outcomes Mean papers per outcome Max papers per outcome Conditional cash transfers Contract teachers Deworming Financial literacy HIV/AIDS Education Improved stoves Insecticide-treated bed nets Irrigation Micro health insurance Microfinance Micronutrient supplementation Mobile phone-based reminders Performance pay Rural electrification Safe water storage Scholarships School meals Unconditional cash transfers Water treatment Women’s empowerment programs 10 1 12 1 3 4 1 2 1 5 23 2 1 3 1 3 3 3 2 2 21 3 13 5 8 2 9 2 2 4 27 4 3 3 2 4 3 9 5 2 37 3 18 5 10 2 9 2 2 5 47 5 3 3 2 5 3 11 6 2 Average 4.2 6.5 9.0 14 4 4.1 Method Estimation of the Gains in Accuracy of θpi The model contains four parameters: µ, τ 2 , σi2 , and θi . σi2 is provided by the data; τ 2 , µ and σi are estimated as described in Section 2. These estimates are done within intervention-outcome combinations. To approximate the effects of a policymaker learning more information as more studies are completed, I use all the different permutations of studies, in turn, when generating the estimates. For example, if three studies, labelled 1, 2, 3, considered the effects of a particular intervention on a particular outcome, I would generate estimates of θi using each set of studies {1}, {2}, {3}, {1,2}, {1,3}, {2,3}, {1,2,3}. Each θpi is then associated with a number of studies, n, that went into estimating it, denoted θpi,n . I approximate θi (θpi,n as n Ñ 8), with θpi,N , the estimate from using the full data available for that intervention-outcome combination. Figure 3 shows how the mean θpi,n evolves as more data are added for each intervention-outcome combination. In particular, the Y-axis represents the difference between the mean θpi,n and mean θpi,N , for various n. The convergence of mean θpi,n to θpi,N is mechanical due to how θpi,n is constructed, however, the figure illustrates that convergence is often attained by n = 10 even when more studies exist. This suggests that using θpi,N to estimate θi is particularly reasonable for N ě 10. I will use this number as a minimum cut-off for N in a robustness check. Figure 3: Evolution of Mean θpi,n The Y-axis represents the difference between the mean θpi,n and mean θpi,N . 15 4.2 Generalizability Measures It would be helpful to have some rules of thumb that can be used to gauge the expected external validity of a particular study’s results or the expected likelihood that an impact evaluation will be worthwhile. I will refer throughout to several metrics that could be used as measures of generalizability. I can then relate these measures to results from the model. Since generalizability can be thought of as the ability to predict results accurately in an out-of-sample group, and the ability to predict results, in turn, hinges both on 1) the variability of the results and 2) the proportion that can be explained, we will want measures that speak to´ each of variance (var(Yi )) and coefficient ¯ these. In particular, I will focus on ´ the ¯ sd pYi q τ2 2 to capture variability and the I τ 2 `σ2 as a measure of the proportion of variation Y¯i of variation that is systematic.5 I will also separate out the sampling variance and use explanatory variables to reduce the unexplained heterogeneity, resulting in the amount of residual variation in Yi , varR pYi q, the coefficient of residual variation, CVR pYi q, and the residual IR2 . Appendix C has more information on these measures and motivates their use. It is important to note that each measure captures different things and has advantages and disadvantages, as summarized in Tables 10 and 11 in that section. Multiple measures should be simultaneously used to get a full picture. 5 Results In this section, I will first present results without modelling any of the heterogeneity in outcomes. We will see that the values for all the measures of heterogeneity are quite high, and an additional impact evaluation would typically improve estimates of θi by only 0.002 0.015 standard deviations. Given an average effect size of 0.12, this is approximately 1.6% 12.3%. 5.1 Without Modelling Heterogeneity Table 3 presents results within intervention-outcome combinations. All Yi were converted to be in terms of standard deviations to put them on a common scale before statistics were calculated, with the aforementioned caveats. ρA,B represents the average change in the difference between θpi and θi when moving from using data from A studies to using data from 5 This paper follows convention and reports the absolute value of the coefficient of variation wherever it appears. 16 Figure 4: Dispersion of Estimates Various θpi,n estimates for τ 2 and σi2 in the data. Outliers are dropped for easier viewing. B studies to estimate θi . If we restrict attention to those intervention-outcome combinations with N ě 10, for which θpi,N may better approximate θi , the average amount an additional impact evaluation would improve estimates of θi decreases to 0.0002 - 0.007, or 0.1 - 5.7%. The marginal improvements are higher for smaller n, as we might expect. Figure 4 plots the distribution of θpi generated for each subset of the data within an intervention-outcome combination. Each intervention-outcome combination is associated with a unique τ 2 and σi2 in this plot; the τ 2 and σi2 that were estimated using all the data available for that intervention-outcome combination. Each vertical line of dots thus represents a range of different estimates of θpi . If θpi ą θ˚ but θi ă θ˚ , or vice versa, the policymaker is making a mistake. θ˚ is set at 0.1 in the diagram for illustration. Clearly, with a lot of dispersion around 0.1, the chances of making a mistake are high. The likelihood of making a mistake would be lower for particularly high or low θ˚ , but additional impact evaluations do not help to reduce this likelihood by much. The different heterogeneity measures, var(Yi ), CV(Yi ) and I 2 , yield varied results. As previously discussed, they measure different things. The coefficient of variation depends heavily on the mean; the I 2 , on the precision of the underlying estimates. The interventionoutcome combinations that fall within the bottom third by variance have varpYi q ď 0.015; the top third have varpYi q ě 0.052. Similarly, the threshold delineating the bottom third for the coefficient of variation is 1.14 and, for the top third, 2.36. For I 2 , the lower threshold is 0.93 and the upper threshold approaches 1. 17 Intervention Table 3: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes Outcome ρ1,2 ρ5,6 ρ10,11 var(Yi ) CV(Yi ) 18 Microfinance Rural Electrification Micronutrients Microfinance Microfinance Financial Literacy Microfinance Contract Teachers Performance Pay Micronutrients Conditional Cash Transfers Micronutrients Micronutrients Micronutrients Micronutrients Conditional Cash Transfers Deworming Micronutrients Conditional Cash Transfers Unconditional Cash Transfers Water Treatment SMS Reminders Conditional Cash Transfers School Meals Micronutrients Micronutrients Micronutrients Bed Nets Conditional Cash Transfers Assets Enrollment rate Cough prevalence Total income Savings Savings Profits Test scores Test scores Body mass index Unpaid labor Weight-for-age Weight-for-height Birthweight Height-for-age Test scores Hemoglobin Mid-upper arm circumference Enrollment rate Enrollment rate Diarrhea prevalence Treatment adherence Labor force participation Test scores Height Mortality rate Stunted Malaria Attendance rate 0.00 0.01 0.01 0.00 0.00 0.01 0.01 0.01 0.01 0.00 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.01 0.00 0.01 0.01 0.01 0.00 0.05 -0.01 0.03 0.02 0.02 0.03 0.00 0.01 0.00 0.01 0.01 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.000 0.001 0.001 0.001 0.002 0.004 0.005 0.005 0.006 0.007 0.009 0.009 0.010 0.010 0.012 0.013 0.015 0.015 0.015 0.016 0.020 0.022 0.023 0.023 0.023 0.025 0.025 0.029 0.030 5.51 0.13 1.65 0.99 1.77 5.47 5.45 0.40 0.61 0.67 0.92 1.94 2.15 0.98 2.47 1.87 3.38 2.08 0.83 1.09 0.97 1.67 1.63 1.29 4.37 2.88 1.11 0.50 0.52 I2 N 1.00 0.93 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 0.93 0.98 0.80 0.94 1.00 1.00 1.00 0.53 1.00 1.00 1.00 0.93 0.33 0.58 0.98 0.07 0.12 0.97 1.00 4 3 3 5 3 5 5 3 3 5 5 34 26 7 36 5 15 18 37 11 6 5 17 3 32 12 5 9 15 19 Micronutrients HIV/AIDS Education Micronutrients Deworming Micronutrients Scholarships Conditional Cash Transfers Deworming Micronutrients School Meals Micronutrients Deworming Deworming Micronutrients Micronutrients Micronutrients Deworming Micronutrients SMS Reminders Deworming Conditional Cash Transfers Rural Electrification Average Weight Used contraceptives Perinatal deaths Height Test scores Enrollment rate Height-for-age Weight-for-height Stillbirths Enrollment rate Prevalence of anemia Height-for-age Weight-for-age Diarrhea incidence Diarrhea prevalence Fever prevalence Weight Hemoglobin Appointment attendance rate Mid-upper arm circumference Probability unpaid work Study time 0.00 0.02 0.02 0.02 0.00 0.01 0.04 0.00 0.03 0.11 0.02 0.03 -0.01 -0.04 0.03 0.02 0.02 -0.01 0.01 0.00 0.02 0.09 0.015 0.00 0.01 0.02 -0.01 0.00 0.01 -0.01 0.00 0.00 0.00 -0.01 0.02 0.00 -0.01 0.03 0.00 -0.01 0.00 0.00 -0.03 0.00 0.03 0.02 0.00 0.002 0.003 0.034 0.036 0.038 0.049 0.052 0.053 0.055 0.072 0.075 0.081 0.095 0.098 0.107 0.109 0.111 0.146 0.184 0.215 0.224 0.439 0.609 0.997 2.70 3.12 2.10 2.36 1.69 0.69 22.17 3.13 3.04 1.14 0.79 1.98 2.29 3.30 1.21 3.08 4.76 1.44 2.91 1.77 6.42 1.10 0.73 0.51 0.05 1.00 1.00 1.00 0.03 1.00 0.01 0.01 0.90 1.00 1.00 1.00 0.98 0.76 1.00 1.00 0.99 1.00 0.97 0.03 36 10 6 17 10 5 7 11 4 3 15 14 12 11 6 5 18 46 3 7 5 3 0.083 2.52 0.80 12 How should we interpret these numbers? Higgins and Thompson, who defined I 2 , called 0.25 indicative of “low”, 0.5 “moderate”, and 0.75 “high” levels of heterogeneity (2002; Higgins et al., 2003). Figure 5 plots a histogram of the I 2 results, showing a lot of systematic variation according to this scale. No similar defined benchmarks exist for the variance, but studies in the medical literature tend to exhibit a coefficient of variation of approximately 0.05-0.5 (Tian, 2005; Ng, 2014). By this standard, too, results would appear quite heterogeneous. Figure 5: Density of I 2 values An alternative benchmark that might have more appeal is that of the average withinstudy variation. If the across-study variation approached the within-study variation, we might not be so concerned about generalizability. Table 13 in Appendix D illustrates the gap between the across-study and mean within-study variance, coefficient of variation, and I 2 for those intervention-outcomes for which we have enough data to calculate the within-study measures. Not all studies report multiple results for the intervention-outcome combination in question. A paper might report multiple results for a particular intervention-outcome combination if, for example, it were reporting results for different subgroups, such as for different age groups, genders, or geographic areas. The median within-paper variance for those papers for which this can be generated is 0.027, while it is 0.037 across papers; similarly, the median within-paper coefficient of variation is 0.91, compared to 1.48 across papers. If we were to form the I 2 for each paper separately, the median within-paper value would be 0.64, as opposed to 1 across papers. Figure 6 presents the distributions graphically; to increase the sample size, this figure includes results even when there are only two papers within an intervention-outcome 20 Figure 6: Distribution of within and across-paper heterogeneity measures As there are a few outliers in the right tail, variances above 0.25 and coefficients of variation above 10 are dropped from these figures (dropping 5 observations). combination or two results reported within a paper. Finally, we can try to derive benchmarks more directly, based on the expected prediction error. Again, it is immediately apparent that what counts as large or small error depends on the policy question - the outside option θ˚ . In some cases, it might not matter if an effect size were mis-predicted by 25%. In others, a prediction error of this magnitude could mean the difference between choosing one program over another or whether a program is worthwhile to pursue at all. Still, if we take the mean effect size within an intervention-outcome to be our “best guess” of how a program will perform and, as an illustrative example, want the prediction error to be less than 25% at least 50% of the time, this would imply a certain cut-off threshold for the variance if we assume that results are normally distributed. Note that the assumptions that results are drawn from the same normal distribution and that the mean and variance of this distribution can be approximated by the mean and variance of observed results is a simplification for the purpose of a back-of-the-envelope calculation. Table 4 summarizes the implied bounds for var(Yi ) for the prediction error to be less than 25% and 50%, respectively, at least 50% of the time, alongside the actual variance in results within each intervention-outcome. In only 1 of 51 cases is the true variance in results smaller than the variance implied by the 25% prediction error cut-off threshold, and in 9 other cases it is below the 50% prediction error threshold. In other words, for more than 80% of intervention-outcomes, the implied prediction error is greater than 50% more than 50% of the time. Intervention Microfinance Table 4: Actual Variance vs. Variance for Prediction Error Thresholds Outcome Ȳi varpYi q var25 Assets 0.003 21 0.000 0.000 var50 0.000 Rural Electrification Micronutrients Microfinance Microfinance Financial Literacy Microfinance Contract Teachers Performance Pay Micronutrients Conditional Cash Transfers Micronutrients Micronutrients Micronutrients Micronutrients Conditional Cash Transfers Deworming Micronutrients Conditional Cash Transfers Unconditional Cash Transfers Water Treatment SMS Reminders Conditional Cash Transfers School Meals Micronutrients Micronutrients Micronutrients Bed Nets Conditional Cash Transfers Micronutrients HIV/AIDS Education Micronutrients Deworming Micronutrients Scholarships Conditional Cash Transfers Deworming Micronutrients School Meals Micronutrients Deworming Deworming Micronutrients Micronutrients Enrollment rate Cough prevalence Total income Savings Savings Profits Test scores Test scores Body mass index Unpaid labor Weight-for-age Weight-for-height Birthweight Height-for-age Test scores Hemoglobin Mid-upper arm circumference Enrollment rate Enrollment rate Diarrhea prevalence Treatment adherence Labor force participation Test scores Height Mortality rate Stunted Malaria Attendance rate Weight Used contraceptives Perinatal deaths Height Test scores Enrollment rate Height-for-age Weight-for-height Stillbirths Enrollment rate Prevalence of anemia Height-for-age Weight-for-age Diarrhea incidence Diarrhea prevalence 22 0.176 -0.016 0.029 0.027 -0.012 -0.013 0.182 0.131 0.125 0.103 0.050 0.045 0.102 0.044 0.062 0.036 0.058 0.150 0.115 0.145 0.088 0.092 0.117 0.035 -0.054 0.143 0.342 0.333 0.068 0.061 -0.093 0.094 0.134 0.336 -0.011 0.086 -0.090 0.250 0.389 0.159 0.143 0.100 0.277 0.001 0.001 0.001 0.002 0.004 0.005 0.005 0.006 0.007 0.009 0.009 0.010 0.010 0.012 0.013 0.015 0.015 0.015 0.016 0.020 0.022 0.023 0.023 0.023 0.025 0.025 0.029 0.030 0.034 0.036 0.038 0.049 0.052 0.053 0.055 0.072 0.075 0.081 0.095 0.098 0.107 0.109 0.111 0.005 0.000 0.000 0.000 0.000 0.000 0.005 0.003 0.002 0.002 0.000 0.000 0.002 0.000 0.001 0.000 0.001 0.003 0.002 0.003 0.001 0.001 0.002 0.000 0.000 0.003 0.018 0.017 0.001 0.001 0.001 0.001 0.003 0.017 0.000 0.001 0.001 0.009 0.023 0.004 0.003 0.002 0.012 0.027 0.000 0.001 0.001 0.000 0.000 0.029 0.015 0.014 0.009 0.002 0.002 0.009 0.002 0.003 0.001 0.003 0.019 0.011 0.018 0.007 0.007 0.012 0.001 0.003 0.018 0.101 0.096 0.004 0.003 0.008 0.008 0.016 0.098 0.000 0.006 0.007 0.054 0.131 0.022 0.018 0.009 0.066 Micronutrients Deworming Micronutrients SMS Reminders Deworming Conditional Cash Transfers Rural Electrification Fever prevalence Weight Hemoglobin Appointment attendance rate Mid-upper arm circumference Probability unpaid work Study time 0.124 0.090 0.322 0.163 0.373 -0.122 0.906 0.146 0.184 0.215 0.224 0.439 0.609 0.997 0.002 0.001 0.016 0.004 0.021 0.002 0.125 0.013 0.007 0.090 0.023 0.121 0.013 0.710 var25 represents the variance that would result in a 25% prediction error for draws from a normal distribution centered at Ȳi . var50 represents the variance that would result in a 50% prediction error. 5.1.1 Predicting External Validity from a Single Paper It would be very helpful if we could estimate the across-paper within-interventionoutcome metrics using the results from individual papers. Many papers report results for different subgroups or over time, and the variation in results for a particular interventionoutcome within a single paper could be a plausible proxy of variation in results for that same intervention-outcome across papers. If this relationship holds, it would help researchers estimate the external validity of their own study, even when no other studies on the intervention have been completed. Table 5 shows the results of regressing the across-paper measures of var(Yi ) and CV(Yi ) on the average within-paper measures for the same intervention-outcome combination. It appears that within-paper variation in results is indeed significantly correlated with across-paper variation in results. Authors could undoubtedly obtain even better estimates using micro data. 5.1.2 Robustness Checks One may be concerned that low-quality papers are either inflating or depressing the degree of generalizability that is observed. There are many ways to measure paper “quality”; I consider two. First, I use the most widely-used quality assessment measure, the Jadad scale (Jadad et al., 1996). The Jadad scale asks whether the study was randomized, doubleblind, and whether there was a description of withdrawals and dropouts. A paper gets one point for having each of these characteristics; in addition, a point is added if the method of randomization was appropriate, subtracted if the method is inappropriate, and similarly added if the blinding method was appropriate and subtracted if inappropriate. This results in a 0-5 point scale. Given that the kinds of interventions being tested are not typically readily suited to blinding, I consider all those papers scoring at least a 3 to be “high quality”. In an alternative specification, I also consider only those results from studies that were 23 Table 5: Regression of Mean Within-Paper Heterogeneity on Across-Paper Heterogeneity (1) Across-paper variance b/se Mean within-paper variance (2) Across-paper CV b/se 0.343** (0.13) Mean within-paper CV -0.000 (0.00) Mean within-paper I 2 Constant Observations R2 (3) Across-paper I 2 b/se 0.101* (0.06) 0.794 (0.67) 0.498*** (0.12) 0.507*** (0.10) 51 0.04 48 0.00 51 0.26 The mean of each within-paper measure is created by calculating the measure within a paper, for each paper reporting two or more results on the same intervention-outcome combination, and then averaging that measure across papers within the intervention-outcome. RCTs. This is for two reasons. First, many would consider RCTs to be higher-quality studies. We might also be concerned about how specification searching and publication bias could affect results. In a separate paper (Vivalt, 2015a), I discuss these issues at length and find relatively little evidence of these biases in the data, with RCTs exhibiting even fewer signs of specification searching and publication bias. The results based on only those studies which were RCTs thus provide a good robustness check. Tables 14 and 15 in the Appendix provide robustness checks using the data that meet these two quality criteria. Table 16 also includes the one observation previously dropped for having an effect size more than 2 SD away from 0. The heterogeneity measures are not substantially different using these data sets. 5.2 5.2.1 With Modelling Heterogeneity Modelling Heterogeneity Across Intervention-Outcomes If the heterogeneity in outcomes that has been observed can be systematically modelled, it would improve our ability to make predictions. Do results exhibit any variation that is systematic? To begin, I first present some OLS results, looking across different interventionoutcome combinations, to examine whether effect sizes are associated with any characteristics of the program, study, or sample, pooling data from different intervention-outcomes. 24 As Table 6 indicates, there is some evidence that studies with a smaller number of observations have greater effect sizes than studies based on a larger number of observations. This is what we would expect if specification searching were easier in small datasets; this pattern of results would also be what we would expect if power calculations drove researchers to only proceed with studies with small sample sizes if they believed the program would result in a large effect size or if larger studies are less well-targeted. Interestingly, governmentimplemented programs fare worse even controlling for sample size (the dummy variable category left out is “Other-implemented”, which mainly consists of collaborations and private sector-implemented interventions). Studies in the Middle East / North Africa region may appear to do slightly better than those in Sub-Saharan Africa (the excluded region category), but not much weight should be put on this as very few studies were conducted in the former region. While these regressions have the advantage of allowing me to draw on a larger sample of studies and we might think that any patterns observed across so many interventions and outcomes are fairly robust, we might be able to explain more variation if we restrict attention to a particular intervention-outcome combination. I therefore focus on the case of conditional cash transfers (CCTs) and enrollment rates, as this is the intervention-outcome combination that contains the largest number of papers. 5.2.2 Within an Intervention-Outcome: The Case of CCTs and Enrollment Rates The previous results used the across-intervention-outcome data, which were aggregated to one result per intervention-outcome-paper. However, we might think that more variation could be explained by carefully modelling results within a particular intervention-outcome combination. This section provides an example, using the case of conditional cash transfers and enrollment rates, the intervention-outcome combination covered by the most papers. Suppose we were to try to explain as much variability in outcomes as possible, using sample characteristics. The available variables which might plausibly have a relationship to effect size are: the baseline enrollment rates6 ; the sample size; whether the study was done in a rural or urban setting, or both; results for other programs in the same region7 ; and the age and gender of the sample under consideration. 6 In some cases, only endline enrollment rates are reported. This variable is therefore constructed by using baseline rates for both the treatment and control group where they are available, followed by, in turn, the baseline rate for the control group; the baseline rate for the treatment group; the endline rate for the control group; the endline rate for the treatment and control group; and the endline rate for the treatment group 7 Regions include: Latin America, Africa, the Middle East and North Africa, East Asia, and South Asia, following the World Bank’s geographical divisions. 25 Table 6: Regression of Effect Size on Study Characteristics (1) Effect size Number of observations (100,000s) Government-implemented (2) Effect size (3) Effect size -0.011** (0.00) RCT -0.012*** (0.00) -0.009* (0.00) -0.087** (0.04) -0.057 (0.05) 0.038 (0.03) East Asia 0.120*** (0.00) 0.180*** (0.03) 0.091*** (0.02) -0.003 (0.03) 0.012 (0.04) 0.275** (0.11) 0.021 (0.04) 0.105*** (0.02) 556 0.20 656 0.23 656 0.22 556 0.23 Latin America Middle East/North Africa South Asia Observations R2 (5) Effect size -0.107*** (0.04) -0.055 (0.04) Academic/NGO-implemented Constant (4) Effect size 0.177*** (0.03) 556 0.20 Standard errors are clustered by intervention-outcome. Different columns contain different numbers of observations because not all studies reported the number of observations on which their estimate was based. 26 Table 7 shows the results of OLS regressions of the effect size on these variables, in turn. The baseline enrollment rates show the strongest relationship to effect size, as reflected in the R2 and significance levels: it is easier to have large gains where initial rates are low. Some papers pay particular attention to those children that were not enrolled at baseline or that were enrolled at baseline. These are coded as a “0%” or “100%” enrollment rate at baseline but are also represented by two dummy variables (Column 2). Larger studies and studies done in urban areas also tend to find smaller effect sizes than smaller studies or studies done in rural or mixed urban/rural areas. Finally, for each result I calculate the mean result in the same region, excluding results from the program in question. Results do appear slightly correlated across different programs in the same region. 27 Table 7: Regression of Projects’ Effect Sizes on Characteristics (CCTs on Enrollment Rates) Enrollment Rates (1) ES (2) ES -0.224*** (0.05) -0.092 (0.06) -0.002 (0.02) 0.183*** (0.05) Enrolled at Baseline Not Enrolled at Baseline Number of Observations (100,000s) Rural (3) ES (4) ES (5) ES (6) ES (7) ES (8) ES (10) ES -0.127*** (0.02) 0.142*** (0.03) -0.002 (0.00) 0.002 (0.02) -0.039** (0.02) -0.011* (0.01) 0.049** (0.02) Urban 28 -0.068*** (0.02) Girls -0.002 (0.03) Boys -0.019 (0.02) Minimum Sample Age 0.005 (0.01) Mean Regional Result Observations R2 (9) ES 112 0.41 112 0.52 108 0.01 130 0.06 130 0.05 130 0.00 130 0.01 104 0.02 1.000** (0.38) 0.714** (0.28) 130 0.01 92 0.58 Each column reports the results of regressing the effect size (ES) on different explanatory variables. Multiple results for different subgroups may be reported for the same paper; the data on which this table is based includes multiple results from the same paper for different subgroups that are non-overlapping (e.g. boys and girls, groups with different age ranges, or different geographical areas). Standard errors are clustered by paper. Not every paper reports every explanatory variable, so different columns are based on different numbers of observations. Table 8: Impact of Mixed Models on Measures var(Yi ) Random effects model 0.011 Mixed model 1 0.011 Mixed model 2 0.012 varR pYi ´ Ypi q 0.011 0.007 0.005 CV(Yi ) 1.24 1.28 1.25 CVR pYi ´ Ypi q 1.24 1.04 0.85 I2 IR2 N 0.97 0.97 122 0.97 0.96 104 0.96 0.93 87 As baseline enrollment rates have the strongest relationship to effect size, I use this as an explanatory variable in a hierarchical mixed model (specification of Column 1), to explore how it affects the residual varR pYi ´ Ypi q, CVR pYi ´ Ypi q and IR2 . I also use the specification in Column 10 of Table 7 as a robustness check. The results are reported in Table 8 for each of these two mixed models, alongside the values from the random-effects model that does not use any explanatory variables. Not all papers provide information for each explanatory variable, and each row is based on only those studies which could be used to estimate the model. Thus, the value of varpYi q, CV(Yi ) and I 2 , which do not depend on the model used, may still vary between rows. In the random-effects model, since no explanatory variables are used, Ypi is only the mean, and varR pYi ´ Ypi q, CVR pYi ´ Ypi q and IR2 do not offer improvements on var(Yi ), CV(Yi ) and I 2 . As more explanatory variables are added, the gap between varpYi q and varR pYi ´ Ypi q, CV(Yi ) and CVR pYi ´ Ypi q and I 2 and IR2 grows. In all cases, including explanatory variables can help reduce the unexplained variation, to varying degrees. varR pYi ´ Ypi q and CVR pYi ´ Ypi q are greatly reduced from var(Yi ) and CV(Yi ), but IR2 is not much lower than I 2 . This is likely due to a feature of I 2 (IR2 ) previously discussed: that it depends on the precision of estimates. With evaluations of CCT programs tending to have large sample sizes, the value of I 2 (IR2 ) is higher than it otherwise would be. 5.2.3 Removing Sampling Variance Finally, it may be worthwhile to point out that sampling variance can artificially inflate the variance of studies’ results. By how much? I examine this question using the case of deworming programs. Here, we know that many factors may affect the results a study obtains. For example, whether the study was randomized by cluster or by individual would have ramifications for the overall burden of disease and likelihood of infection in a particular area (Miguel and Kremer, 2004). Apart from this, initial burdens of disease may vary from place to place, and the exact dosage schedule was also different between different studies. Despite these and other differences, it is remarkable how much of the variation in treatment effects can be attributed to sampling variance. Table 9 presents the variance of Yi and the portion that is 29 not attributed to sampling variance, τ 2 . It then shows the difference between the two and the proportion of var(Yi ) that can be attributed to sampling variance - 46%, on average. It then presents similar statistics for the coefficient of variation, generating a measure, CVτ , which is the coefficient of variation if τ were used in the place of the standard deviation of Yi . An average of 29% of CV(Yi ) can be explained simply by sampling variance. Clearly, considering sampling variance can reduce the observed heterogeneity measures. Further, it does not even require the potentially time-consuming and expensive collection of data on additional explanatory variables. 30 Table 9: Deworming Case Study: The Difference Made by Adjusting for Sampling Error Outcome Height Height-for-age Hemoglobin Mid-upper arm circumference Weight Weight-for-age Weight-for-height var(Yi ) τ2 0.049 0.098 0.015 0.439 0.184 0.107 0.072 0.044 0.073 0.003 0.083 0.082 0.072 0.047 Difference Percent difference in variance in variance 0.005 0.10 0.026 0.26 0.012 0.83 0.356 0.81 0.102 0.56 0.036 0.33 0.026 0.35 CV(Yi ) CVτ 2.36 1.98 3.38 1.77 4.76 2.29 3.13 2.24 1.70 1.40 0.77 3.17 1.87 2.52 Difference in CV 0.12 0.28 1.98 1.00 1.59 0.42 0.61 Percent difference in CV 0.05 0.14 0.59 0.57 0.33 0.18 0.20 31 6 Discussion Why should we care about heterogeneity in treatment effects? Impact evaluations are used to inform policy decisions. We need to know how best to extrapolate from them in order to make the most effective policy choices. There has been a great controversy over external validity, and this paper brings data from many different types of interventions to bear on the question. We saw that whether we looked at benchmarks from other literatures (CVs typically falling between 0.05 and 0.5; I 2 being considered “high” at 0.75) or whether we constructed additional benchmarks based on the amount of variation that would result in substantial prediction errors, the degree of heterogeneity between different studies’ results was quite high. Another statistic that supports this conclusion is the absolute value of the prediction error, |Yi ´ Ŷi |, at time t ` 1, using all the data up to time t to predict Yi . If we use the mean Yi up to time t in each intervention-outcome as the simplest, naive predictor, the average absolute value of the error is 0.18, compared to an average effect size of 0.12. The median absolute value of the error is 0.10, or in percent terms, the median amount by which the prediction is off is 97%. It would seem difficult to plan policy with the typical error approaching 100%.8 Whether this is “enough” evidence on which to base a policy decision of whether or not to adopt an intervention in a new setting depends on the alternative policies that could be adopted and the amount of information that could be gained by an additional impact evaluation. This paper focused on both the marginal improvements in the predicted expected value of a program as well as on the range of results a program is likely to have. The benefits of improving estimates of θi are obvious, but the distribution of results might also be critically important. First, different programs exhibit different levels of heterogeneity, and policymakers may be more risk-averse or risk-loving. There are several reasons one might be risk-averse. Practically speaking, it is possible that if one does not obtain a good result, one will not get another shot. If a program fails, the people that were targeted may not be interested in taking up a second attempt. If it is a government program, the policymaker could also be voted out of office. Policymakers might also be risk-averse if they would like to maximize beneficiaries’ utility and think that the beneficiaries themselves would be risk-averse. On the other hand, if a policymaker believed that beneficiaries were stuck in a poverty trap and needed a large enough push to break out of the trap, they might instead be risk-loving. Another reason to care about the full range of a study’s possible results is more insidu8 Imagine: “This policy will increase A by B. Or maybe it will do nothing at all. It’s a toss-up.” 32 ous: we might have a bias towards remembering the larger results. Those studies with larger effect sizes are better-cited in the data, and if policymakers are optimistic they may also put more weight on those studies that find the largest effects. This could bias policy towards those interventions that are more heterogeneous rather than more efficacious.9 Finally, the large errors call into question our standard approach to impact evaluation. It might still be worthwhile to conduct impact evaluations with a large prediction error, but several factors should be considered. First, the costs of the evaluation should be weighed against the benefits. Some evaluations are relatively low-cost; others are not. The average World Bank impact evaluation costs $500,000 (IEG, 2012). In our toy example with θ˚ “ 0.1, an additional impact evaluation was pivotal 11% of the time, leading to improvements of about 0.1 - 5.7% in those cases depending on how many impact evaluations had previously been done. A policymaker would charitably have to be willing to spend $500,000 for an increase in the effect size of approximately half a percentage point in this scenario. Spillover effects to other policymakers could also be considered in this calculation. 7 Conclusion How much impact evaluation results generalize to other settings is an important topic, and data from meta-analyses are the ideal data with which to answer this question. With data on 20 different types of interventions, all collected in the same way, we can begin to speak a bit more generally about how results tend to vary across contexts and what that implies for impact evaluation design and policy recommendations. Smaller studies tended to have larger effect sizes, which we might expect if the smaller studies are better-targeted, are selected to be evaluated when there is a higher a priori expectation they will have a large effect size, or if there is a preference to report larger effect sizes, which smaller studies would obtain more often by chance. Government-implemented programs also had smaller effect sizes than academic/NGO-implemented programs, even after controlling for sample size. This is unfortunate given we often do smaller impact evaluations with NGOs in the hopes of finding a strong positive effect that can scale through government implementation. I also compared within-paper heterogeneity in treatment effects to across-paper heterogeneity in treatment effects. Within-paper heterogeneity is present in my data as papers often 9 If we are indeed more likely to remember larger effect sizes, this could also help to explain the very large gap between the sample sizes of the studies in the data set and the sample size they would have required in order to have power = 0.8 at the beginning of the study had the authors’ priors been equal to the effect sizes actually found (Coville and Vivalt, 2015). Either researchers are conducting studies that they know are underpowered a priori, or else effect sizes are lower than they expect (or some combination of the two). 33 report multiple results for the same outcomes, such as for different subgroups. Fortunately, I find that even these crude measures of within-paper heterogeneity predict across-paper heterogeneity for the relevant intervention-outcome. This implies that researchers can get a quick estimate how well their results would apply to other settings, simply by using their own data. With access to micro data, authors could do much richer analysis. Since we may be concerned with the residual amount of heterogeneity after taking the best-fitting model into consideration, I discussed the case of the effect of CCTs on enrollment rates. The generalizability measures improve with the addition of an explanatory mixed model. I consider several ways to evaluate the magnitude of the variation in results. Whether results are too heterogeneous ultimately depends on the purpose for which they are being used; some policy decisions might have greater room for error than others. However, it is safe to say, looking at both the coefficient of variation and the I 2 , which have commonly accepted benchmarks in other disciplines, that these impact evaluations exhibit more heterogeneity than is typical in other fields, such as medicine, even after accounting for explanatory variables in the case of conditional cash transfers. Further, I find that under mild assumptions, the typical variance of results is such that a particular program would be mis-predicted by more than 50% over 80% of the time. There are some steps that researchers can take that may improve the generalizability of their own studies. First, just as with heterogeneous selection into treatment (Chassang, Padró i Miquel and Snowberg, 2012), one solution would be to ensure one’s impact evaluation varied some of the contextual variables that we might think underlie the heterogeneous treatment effects. Given that many studies are underpowered as it is, that may not be likely; however, large organizations and governments have been supporting more impact evaluations, providing more opportunities to explicitly integrate these analyses. Efforts to coordinate across different studies, asking the same questions or looking at some of the same outcome variables, would also help. The framing of heterogeneous treatment effects could also provide positive motivation for replication projects in different contexts: different findings would not necessarily negate the earlier ones but add another level of information. In summary, generalizability is not binary but something that we can measure. This paper showed that past results have significant but limited ability to predict other results on the same topic and this was not seemingly due to bias. Knowing how much results tend to extrapolate and when is critical if we are to know how to interpret an impact evaluation’s results or apply its findings. Given that other fields, with less heterogeneity, also seem to have a more well-developed practice of replication and meta-analysis, it would seem like economics would have a lot to gain by expanding in this direction. 34 References AidGrade (2013). “AidGrade Process Description”, http://www.aidgrade.org/methodology/ processmap-and-methodology, March 9, 2013. AidGrade (2015). “AidGrade Impact Evaluation Data, Version 1.2”. Alesina, Alberto and David Dollar (2000). “Who Gives Foreign Aid to Whom and Why?”, Journal of Economic Growth, vol. 5 (1). Allcott, Hunt (forthcoming). “Site Selection Bias in Program Evaluation”, Quarterly Journal of Economics. Bastardi, Anthony, Eric Luis Uhlmann and Lee Ross (2011). “Wishful Thinking: Belief, Desire, and the Motivated Evaluation of Scientific Evidence”, Psychological Science. Becker, Betsy Jane and Meng-Jia Wu (2007). “The Synthesis of Regression Slopes in Meta-Analysis”, Statistical Science, vol. 22 (3). Bold, Tessa et al. (2013). “Scaling-up What Works: Experimental Evidence on External Validity in Kenyan Education”, working paper. Borenstein, Michael et al. (2009). Introduction to Meta-Analysis. Wiley Publishers. Boriah, Shyam et al. (2008). “Similarity Measures for Categorical Data: A Comparative Evaluation”, in Proceedings of the Eighth SIAM International Conference on Data Mining. Brodeur, Abel et al. (2012). “Star Wars: The Empirics Strike Back”, working paper. Cartwright, Nancy (2007). Hunting Causes and Using Them: Approaches in Philosophy and Economics. Cambridge: Cambridge University Press. Cartwright, Nancy (2010). “What Are Randomized Controlled Trials Good For?”, Philosophical Studies, vol. 147 (1): 59-70. Casey, Katherine, Rachel Glennerster, and Edward Miguel (2012). “Reshaping Institutions: Evidence on Aid Impacts Using a Preanalysis Plan.” Quarterly Journal of Economics, vol. 127 (4): 1755-1812. Chassang, Sylvain, Gerard Padr I Miquel, and Erik Snowberg (2012). “Selective Trials: A Principal-Agent Approach to Randomized Controlled Experiments.” American Economic Review, vol. 102 (4): 1279-1309. Cohen, Jacob (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Earlbaum Associates. Coville, Aidan and Eva Vivalt (2015). “When Should You Change Your Priors When a New Study Comes Out?”, working paper. Deaton, Angus (2010). “Instruments, Randomization, and Learning about Development.” Journal of Economic Literature, vol. 48 (2): 424-55. Dehejia, Rajeev, Cristian Pop-Eleches and Cyrus Samii (2015). “From Local to Global: External Validity in a Fertility Natural Experiment”, working paper. 35 Duflo, Esther, Pascaline Dupas and Michael Kremer (2012). “School Governance, Teacher Incentives and Pupil-Teacher Ratios: Experimental Evidence from Kenyan Primary Schools”, NBER Working Paper. Evans, David and Anna Popova (2014). “Cost-effectiveness Measurement in Development: Accounting for Local Costs and Noisy Impacts”, World Bank Policy Research Working Paper, No. 7027. Ferguson, Christopher and Michael Brannick (2012). “Publication bias in psychological science: Prevalence, methods for identifying and controlling, and implications for the use of meta-analyses.” Psychological Methods, vol. 17 (1), Mar 2012, 120-128. Franco, Annie, Neil Malhotra and Gabor Simonovits (2014). “Publication Bias in the Social Sciences: Unlocking the File Drawer”, Working Paper. Gerber, Alan and Neil Malhotra (2008a). “Do Statistical Reporting Standards Affect What Is Published? Publication Bias in Two Leading Political Science Journals”, Quarterly Journal of Political Science, vol 3. Gerber, Alan and Neil Malhotra (2008b). “Publication Bias in Empirical Sociological Research: Do Arbitrary Significance Levels Distort Published Results”, Sociological Methods &Research, vol. 37 (3). Gelman, Andrew et al. (2013). Bayesian Data Analysis, Third Edition, Chapman and Hall/CRC. Hedges, Larry and Therese Pigott (2004). “The Power of Statistical Tests for Moderators in Meta-Analysis”, Psychological Methods, vol. 9 (4). Higgins, Julian PT and Sally Green, (eds.) (2011). Cochrane Handbook for Systematic Reviews of Interventions, Version 5.1.0 [updated March 2011]. The Cochrane Collaboration. Available from www.cochrane-handbook.org. Higgins, Julian PT et al. (2003). “Measuring inconsistency in meta-analyses”, BMJ 327: 557-60. Higgins, Julian PT and Simon Thompson (2002). “Quantifying heterogeneity in a metaanalysis”, Statistics in Medicine, vol. 21: 1539-1558. Hsiang, Solomon, Marshall Burke and Edward Miguel (2013). “Quantifying the Influence of Climate on Human Conflict”, Science, vol. 341. Independent Evaluation Group (2012). “World Bank Group Impact Evaluations: Relevance and Effectiveness”, World Bank Group. Innovations for Poverty Action (2015). “IPA Launches the Goldilocks Project: Helping Organizations Build Right-Fit M&E Systems”, http://www.poverty-action.org/goldilocks. Jadad, A.R. et al. (1996). “Assessing the quality of reports of randomized clinical trials: Is 36 blinding necessary?” Controlled Clinical Trials, 17 (1): 112. Millennium Challenge Corporation (2009). “Key Elements of Evaluation at MCC”, presentation June 9, 2009. Miguel, Edward and Michael Kremer (2004). “Worms: identifying impacts on education and health in the presence of treatment externalities”, Econometrica, vol. 72(1). Ng, CK (2014). “Inference on the common coefficient of variation when populations are lognormal: A simulation-based approach”, Journal of Statistics: Advances in Theory and Applications, vol. 11 (2). Page, Matthew, McKenzie, Joanne and Andrew Forbes (2013). “Many Scenarios Exist for Selective Inclusion and Reporting of Results in Randomized Trials and Systematic Reviews”, Journal of Clinical Epidemiology, vol. 66 (5). Pritchett, Lant and Justin Sandefur (2013). “Context Matters for Size: Why External Validity Claims and Development Practice Don’t Mix”, Center for Global Development Working Paper 336. Pritchett, Lant, Salimah Sanji and Jeffrey Hammer (2013). “It’s All About MeE: Using Structured Experiential Learning (“e”) to Crawl the Design Space”, Center for Global Development Working Paper 233. Rodrik, Dani (2009). “The New Development Economics: We Shall Experiment, but How Shall We Learn?”, in What Works in Development? Thinking Big, and Thinking Small, ed. Jessica Cohen and William Easterly, 24-47. Washington, D.C.: Brookings Institution Press. Saavedra, Juan and Sandra Garcia (2013). “Educational Impacts and Cost-Effectiveness of Conditional Cash Transfer Programs in Developing Countries: A Meta-Analysis”, CESR Working Paper. Shadish, William, Thomas Cook and Donald Campbell (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston: Houghton Mifflin. Simmons, Joseph and Uri Simonsohn (2011). “False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant”, Psychological Science, vol. 22. Simonsohn, Uri et al. (2014). “P-Curve: A Key to the File Drawer”, Journal of Experimental Psychology: General. Tian, Lili (2005). “Inferences on the common coefficient of variation”, Statistics in Medicine, vol. 24: 2213-2220. Tibshirani, Ryan and Robert Tibshirani (2009). “A Bias Correction for the Minimum Error Rate in Cross-Validation”, Annals of Applied Statistics, vol. 3 (2). Tierney, Michael J. et al. (2011). “More Dollars than Sense: Refining Our Knowledge of 37 Development Finance Using AidData”, World Development, vol. 39. Tipton, Elizabeth (2013). “Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts”, Journal of Educational and Behavioral Statistics, 38: 239-266. RePEc (2013). “RePEc h-index for journals”, http://ideas.repec.org/top/ top.journals.hindex.html. Vivalt, Eva (2015a). “The Trajectory of Specification Searching Across Disciplines and Methods”, Working Paper. Vivalt, Eva (2015b). “How Concerned Should We Be About Selection Bias, Hawthorne Effects and Retrospective Evaluations?”, Working Paper. Walsh, Michael et. al. (2013). “The Statistical Significance of Randomized Controlled Trial Results is Frequently Fragile: A Case for a Fragility Index”, Journal of Clinical Epidemiology. USAID (2011). “Evaluation: Learning from Experience”, USAID Evaluation Policy, Washington, DC. 38 For Online Publication 39 Appendices A A.1 Guide to Appendices Appendices in this Paper B) Excerpt from AidGrade’s Process Description (2013). C) Discussion of heterogeneity measures. D) Additional results. E) Derivation of mixed model estimation strategy. A.2 Further Online Appendices Having to describe data from twenty different meta-analyses and systematic reviews, I must rely in part on online appendices. The following are available at http://www.evavivalt.com/research: F) The search terms and inclusion criteria for each topic. G) The references for each topic. H) The coding manual. 40 B B.1 Data Collection Description of AidGrade’s Methodology The following details of AidGrade’s data collection process draw heavily from AidGrade’s Process Description (AidGrade, 2013). Figure 7: Process Description Stage 1: Topic Identification AidGrade staff members were asked to each independently make a list of at least thirty international development programs that they considered to be the most interesting. The independent lists were appended into one document and duplicates were tagged and removed. Each of the remaining topics was discussed and refined to bring them all to a clear 41 and narrow level of focus. Pilot searches were conducted to get a sense of how many impact evaluations there might be on each topic, and all the interventions for which the very basic pilot searches identified at least two impact evaluations were shortlisted. A random subset of the topics was selected, also acceding to a public vote for the most popular topic. Stage 2: Search Each search engine has its own peculiarities. In order to ensure all relevant papers and few irrelevant papers were included, a set of simple searches was conducted on different potential search engines. First, initial searches were run on AgEcon; British Library for Development Studies (BLDS); EBSCO; Econlit; Econpapers; Google Scholar; IDEAS; JOLISPlus; JSTOR; Oxford Scholarship Online; Proquest; PubMed; ScienceDirect; SciVerse; SpringerLink; Social Science Research Network (SSRN); Wiley Online Library; and the World Bank eLibrary. The list of potential search engines was compiled broadly from those listed in other systematic reviews. The purpose of these initial searches was to obtain information about the scope and usability of the search engines to determine which ones would be effective tools in identifying impact evaluations on different topics. External reviews of different search engines were also consulted, such as a Falagas et al. (2008) study which covered the advantages and differences between the Google Scholar, Scopus, Web of Science and PubMed search engines. Second, searches were conducted for impact evaluations of two test topics: deworming and toilets. EBSCO, IDEAS, Google Scholar, JOLISPlus, JSTOR, Proquest, PubMed, ScienceDirect, SciVerse, SpringerLink, Wiley Online Library and the World Bank eLibrary were used for these searches. 9 search strings were tried for deworming and up to 33 strings for toilets, with modifications as needed for each search engine. For each search the number of results and the number of results out of the first 10-50 results which appeared to be impact evaluations of the topic in question were recorded. This gave a better sense of which search engines and which kinds of search strings would return both comprehensive and relevant results. A qualitative assessment of the search results was also provided for the Google Scholar and SciVerse searches. Finally, the online databases of J-PAL, IPA, CEGA and 3ie were searched. Since these databases are already narrowly focused on impact evaluations, attention was restricted to simple keyword searches, checking whether the search engines that were integrated with each database seemed to pull up relevant results for each topic. Ultimately, Google Scholar and the online databases of J-PAL, IPA, CEGA and 3ie, along with EBSCO/PubMed for health-related interventions, were selected for use in the full searches. 42 After the interventions of interest were identified, search strings were developed and tested using each search source. Each search string included methodology-specific stock keywords that narrowed the search to impact evaluation studies, except for the search strings for the J-PAL, IPA, CEGA and 3ie searches, as these databases already exclusively focus on impact evaluations. Experimentation with keyword combinations in stages 1.4 and 2.1 was helpful in the development of the search strings. The search strings could take slightly different forms for different search engines. Search terms were tailored to the search source, and a full list is included in an appendix. C# was used to write a script to scrape the results from search engines. The script was programmed to ensure that the Boolean logic of the search string was properly applied within the constraints of each search engines capabilities. Some sources were specialized and could have useful papers that do not turn up in simple searches. The papers listed on J-PAL, IPA, CEGA and 3ies websites are a good example of this. For these sites, it made more sense for the papers to be manually searched and added to the relevant spreadsheets. After the automated and manual searches were complete, duplicates were removed by matching on author and title names. During the title screening stage, the consolidated list of citations yielded by the scraped searches was checked for any existing meta-analyses or systematic reviews. Any papers that these papers included were added to the list. With these references added, duplicates were again flagged and removed. Stage 3: Screening Generic and topic-specific screening criteria were developed. The generic screening criteria are detailed below, as is an example of a set of topic-specific screening criteria. The screening criteria were very inclusive overall. This is because AidGrade purposely follows a different approach to most meta-analyses in the hopes that the data collected can be re-used by researchers who want to focus on a different subset of papers. Their motivation is that vast resources are typically devoted to a meta-analysis, but if another team of researchers thinks a different set of papers should be used, they will have scour the literature and recreate the data from scratch. If the two groups disagree, all the public sees are their two sets of findings and their reasoning for selecting different papers. AidGrade instead strives to cover the superset of all impact evaluations one might wish to include along with a list of their characteristics (e.g. where they were conducted, whether they were randomized by individual or by cluster, etc.) and let people set their own filters on the papers or select individual papers and view the entire space of possible results. 43 Figure 8: Generic Screening Criteria Category Methodologies Inclusion Criteria Impact evaluations that have counterfactuals Publication status Time period of study LocationGeography Quality Peer-reviewed or working paper Any Any Any Exclusion Criteria Observational studies, strictly qualitative studies N/A N/A N/A N/A Figure 9: Topic-Specific Criteria Example: Formal Banking Category Intervention Outcomes Inclusion Criteria Formal banking services specifically including: - Expansion of credit and/or savings - Provision of technological innovations - Introduction or expansion of financial education, or other program to increase financial literacy or awareness - Individual and household income - Small and micro-business income - Household and business assets - Household consumption - Small and micro-business investment - Small, micro-business or agricultural output - Measures of poverty - Measures of well-being or stress - Business ownership - Any other outcome covered by multiple papers Exclusion Criteria Other formal banking services Microfinance N/A Figure 10 illustrates the difference. For this reason, minimal screening was done during the screening stage. Instead, data was collected broadly and re-screening was allowed at the point of doing the analysis. This is highly beneficial for the purpose of this paper, as it allows us to look at the largest possible set of papers and all subsets. After screening criteria were developed, two volunteers independently screened the titles to determine which papers in the spreadsheet were likely to meet the screening criteria developed in Stage 3.1. Any differences in coding were arbitrated by a third volunteer. All volunteers received training before beginning, based on the AidGrade Training Manual and a test set of entries. Volunteers’ training inputs were screened to ensure that only proficient 44 Figure 10: AidGrade’s Strategy 45 volunteers would be allowed to continue. Of those papers that passed the title screening, two volunteers independently determined whether the papers in the spreadsheet met the screening criteria developed in Stage 3.1 judging by the paper abstracts. Any differences in coding were again arbitrated by a third volunteer. The full text was then found for those papers which passed both the title and abstract checks. Any paper that proved not to be a relevant impact evaluation using the aforementioned criteria was discarded at this stage. Stage 4: Coding Two AidGrade members each independently used the data extraction form developed in Stage 4.1 to extract data from the papers that passed the screening in Stage 3. Any disputes were arbitrated by a third AidGrade member. These AidGrade members received much more training than those who screened the papers, reflecting the increased difficulty of their work, and also did a test set of entries before being allowed to proceed. The data extraction form was organized into three sections: (1) general identifying information; (2) paper and study characteristics; and (3) results. Each section contained qualitative and quantitative variables that captured the characteristics and results of the study. Stage 5: Analysis A researcher was assigned to each meta-analysis topic who could specialize in determining which of the interventions and results were similar enough to be combined. If in doubt, researchers could consult the original papers. In general, researchers were encouraged to focus on all the outcome variables for which multiple papers had results. When a study had multiple treatment arms sharing the same control, researchers would check whether enough data was provided in the original paper to allow estimates to be combined before the meta-analysis was run. This is a best practice to avoid double-counting the control group; for details, see the Cochrane Handbook for Systematic Reviews of Interventions (2011). If a paper did not provide sufficient data for this, the researcher would make the decision as to which treatment arm to focus on. Data were then standardized within each topic to be more comparable before analysis (for example, units were converted). The subsequent steps of the meta-analysis process are irrelevant for the purposes of this paper. It should be noted that the first set of ten topics followed a slightly different procedure for stages (1) and (2). Only one list of potential topics was created in Stage 1.1, so Stage 1.2 (Consolidation of Lists) was only vacuously followed. There was also no randomization after public voting (Stage 1.7) and no scripted scraping searches (Stage 2.3), as all searches were manually conducted using specific strings. A different search engine was 46 also used: SciVerse Hub, an aggreator that includes SciVerse Scopus, MEDLINE, PubMed Central, ArXiv.org, and many other databases of articles, books and presentations. The search strings for both rounds of meta-analysis, manual and scripted, are detailed in another appendix. 47 C Heterogeneity Measures As discussed in the main text, we will want measures that speak to both the overall variability as well as the amount that can be explained. The most obvious measure to consider is the variance of studies’ results, var(Yi ). A potential drawback to using the variance as a measure of generalizability is that we might be concerned that studies that have higher effect sizes or are measured in terms of units with larger scales have larger variances. This would limit us to making comparisons only between data with the same scale. We could either: 1) restrict attention to those outcomes in the same natural units (e.g. enrollment rates in percentage points); 2) convert results to be in terms of a common unit, such as standard deviations10 ; or 3) scale the measure, such as by the mean result, to create a unitless figure. Scaling the standard deviation of results within an intervention-outcome combination by the mean result within that intervention-outcome creates a measure known as the coefficient of variation, which represents the inverse of the signal-to-noise ratio, and as a unitless figure can be compared across intervention-outcome combinations with different natural units. It is not immune to criticism, however, particularly in that it may result in large values as the mean approaches zero. The measures discussed so far focus on variation. However, if we could explain the variation, it would no longer worsen our ability to make predictions in a new setting, so long as we had all the necessary data from that setting, such as covariates, with which to extrapolate. One portion of the variation that can be immediately explained is the sampling variance, varpYi |θi q, denoted σ 2 . The variation in observed effect sizes is: varpYi q “ τ 2 ` σ 2 (15) and the proportion of the variation that is not sampling error is: I2 “ τ2 τ 2 ` σ2 (16) The I 2 is an established metric in the meta-analysis literature that helps determine whether a fixed or random-effects model is more appropriate; the higher I 2 , the less plausible it is that sampling error drives all the variation in results, and the more appropriate a random-effects model is. I 2 is considered “low” at 0.25, “moderate” at 0.5, and “high” at 0.75 (Higgins et al., 2003).11 10 This can be problematic if the standard deviations themselves vary but is a common approach in the meta-analysis literature in lieu of a better option. 11 The Cochrane Collaboration uses a slightly different set of norms, saying 0-0.4 “might not be important”, 0.3-0.6 “may represent moderate heterogeneity”, 0.5-0.9 “may represent substantial heterogeneity”, and 0.75- 48 Table 10: Summary of heterogeneity measures Measure of variation varpYi q varR pYi ´ Ypi q CVpYi q CVR pYi ´ Ypi q I2 IR2 R2 Measure of proportion of variation that is systematic X X X X Measure makes use of explanatory variables X X X X X X X If we wanted to explain more of the variation, we could use a mixed model and, upon estimating it, we can calculate several additional statistics: the amount of residual variation in Yi , after accounting for Xn , varR pYi q, the coefficient of residual variation, CVR pYi q, and the residual IR2 . Further, we can examine the R2 of the meta-regression. It should be noted that a linear meta-regression is only one way of modelling variation in Yi . The I 2 , for example, is analogous to the reliability coefficient of classical test theory or the generalizability coefficient of generalizability theory (a branch of psychometrics), both of which estimate the proportion of variation that is not error. In this literature, additional heterogeneity is usually modelled using ANOVA rather than meta-regression. Modelling variation in treatment effects also does not have to occur only retrospectively at the conclusion of studies; we can imagine that a carefully-designed study could anticipate and estimate some of the potential sources of variation experimentally. Table 10 summarizes the different indicators, dividing them into measures of variation and measures of the proportion of variation that is systematic. Each of these metrics has its advantages and disadvantages. Table 11 summarizes the desirable properties of a measure of heterogeneity and which properties are possessed by each of the discussed indicators. Measuring heterogeneity using the variance of Yi requires the Yi to have comparable units. Using the coefficient of variation requires the assumption that the mean effect size is an appropriate measure with which to scale sd(Yi ). The variance and coefficient of variation also do not have anything to say about the amount of heterogeneity that can be explained. Adding explanatory variables also has its limitations. In any model, we have no way to guarantee that we are indeed capturing all the relevant factors. While I 2 has the nice property that it disaggregates sampling variance as a source of variation, estimating it depends on the weights applied to each study’s results and thus, in turn, on 1 “considerable heterogeneity” (Higgins and Green, 2011). 49 Table 11: Desirable properties of a measure of heterogeneity Does not depend on the number of studies in a cell varpYi q varR pYi ´ Ypi q CVpYi q CVR pYi ´ Ypi q I2 IR2 R2 X X X X X X X Does not depend on the precision of individual estimates X X X X X Does not depend on the estimates’ units Does not depend on the mean result in the cell X X X X X X X X X X A “cell” here refers to an intervention-outcome combination. The “precision” of an estimate refers to its standard error. the sample sizes of the studies. The R2 has its own well-known caveats, such as that it can be artificially inflated by over-fitting. To get a full picture of the extent to which results might generalize, then, multiple measures should be used. 50 D Additional Results 51 52 Intervention Conditional cash transfers Conditional cash transfers Conditional cash transfers Conditional cash transfers Conditional cash transfers Conditional cash transfers Conditional cash transfers Conditional cash transfers Conditional cash transfers Conditional cash transfers HIV/AIDS Education HIV/AIDS Education HIV/AIDS Education Unconditional cash transfers Unconditional cash transfers Unconditional cash transfers Insecticide-treated bed nets Contract teachers Deworming Deworming Deworming Deworming Deworming Deworming Deworming Deworming Deworming Deworming Deworming Deworming Financial literacy Improved stoves Improved stoves Improved stoves Improved stoves Irrigation Irrigation Table 12: Descriptive Statistics: Standardized Narrowly Defined Outcomes Outcome # Neg sig papers # Insig papers Attendance rate 0 6 Enrollment rate 0 6 Height 0 1 Height-for-age 0 6 Labor force participation 1 12 Probability unpaid work 1 0 Test scores 1 2 Unpaid labor 0 2 Weight-for-age 0 2 Weight-for-height 0 1 Pregnancy rate 0 2 Probability has multiple sex partners 0 1 Used contraceptives 1 6 Enrollment rate 0 3 Test scores 0 1 Weight-for-height 0 2 Malaria 0 3 Test scores 0 1 Attendance rate 0 1 Birthweight 0 2 Diarrhea incidence 0 1 Height 3 10 Height-for-age 1 9 Hemoglobin 0 13 Malformations 0 2 Mid-upper arm circumference 2 0 Test scores 0 0 Weight 3 8 Weight-for-age 1 6 Weight-for-height 2 7 Savings 0 2 Chest pain 0 0 Cough 0 0 Difficulty breathing 0 0 Excessive nasal secretion 0 1 Consumption 0 1 Total income 0 1 # Pos sig papers 9 31 1 1 5 4 2 3 0 1 0 1 3 8 1 0 6 2 1 0 1 4 4 2 0 5 2 7 5 2 3 2 2 2 1 1 1 # Papers 15 37 2 7 18 5 5 5 2 2 2 2 10 11 2 2 9 3 2 2 2 17 14 15 2 7 2 18 12 11 5 2 2 2 2 2 2 53 Microfinance Microfinance Microfinance Microfinance Microfinance Micro health insurance Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Micronutrient supplementation Mobile phone-based reminders Mobile phone-based reminders Performance pay Rural electrification Rural electrification Rural electrification Safe water storage Scholarships Scholarships Scholarships Assets Consumption Profits Savings Total income Enrollment rate Birthweight Body mass index Cough prevalence Diarrhea incidence Diarrhea prevalence Fever incidence Fever prevalence Height Height-for-age Hemoglobin Malaria Mid-upper arm circumference Mortality rate Perinatal deaths Prevalence of anemia Stillbirths Stunted Test scores Triceps skinfold measurement Wasted Weight Weight-for-age Weight-for-height Appointment attendance rate Treatment adherence Test scores Enrollment rate Study time Total income Diarrhea incidence Attendance rate Enrollment rate Test scores 0 0 1 0 0 0 0 0 0 1 0 0 1 3 5 7 0 2 0 1 0 0 0 1 1 0 4 1 0 1 1 0 0 0 0 0 0 0 0 3 2 3 3 3 1 4 1 3 5 5 2 2 22 23 11 2 9 12 5 6 4 5 2 0 2 19 23 18 0 3 2 1 1 2 1 1 2 2 1 0 1 0 2 1 3 4 0 5 1 0 2 7 8 29 0 7 0 0 9 0 0 7 1 0 13 10 8 2 1 1 2 2 0 1 1 3 0 4 2 5 3 5 2 7 5 3 11 6 2 5 32 36 47 2 18 12 6 15 4 5 10 2 2 36 34 26 3 5 3 3 3 2 2 2 5 2 School meals School meals School meals Water treatment Water treatment Women’s empowerment programs Women’s empowerment programs Average Enrollment rate Height-for-age Test scores Diarrhea incidence Diarrhea prevalence Savings Total income 0 0 0 0 0 0 0 0.6 3 2 2 1 1 1 0 4.2 0 0 1 1 5 1 2 3.2 3 2 3 2 6 2 2 7.9 54 Intervention 55 Micronutrients Conditional Cash Transfers Conditional Cash Transfers Deworming Micronutrients Micronutrients Micronutrients School Meals Micronutrients Unconditional Cash Transfers SMS Reminders Micronutrients Micronutrients Micronutrients Micronutrients Micronutrients Microfinance Conditional Cash Transfers Conditional Cash Transfers Deworming Micronutrients Bed Nets Scholarships Conditional Cash Transfers HIV/AIDS Education Deworming Deworming Deworming Micronutrients Micronutrients Deworming Conditional Cash Transfers Micronutrients Table 13: Across-Paper vs. Mean Within-Paper Heterogeneity Outcome Across-paper Within-paper Across-paper Within-paper var(Yi ) var(Yi ) CV(Yi ) CV(Yi ) Cough prevalence 0.001 0.006 1.017 3.181 Enrollment rate 0.009 0.027 0.790 0.968 Unpaid labor 0.009 0.004 0.918 0.853 Hemoglobin 0.009 0.068 1.639 8.687 Weight-for-height 0.010 0.005 2.252 * Birthweight 0.010 0.011 0.974 0.963 Weight-for-age 0.010 0.124 2.370 0.713 Height-for-age 0.011 0.000 1.086 * Height-for-age 0.012 0.042 2.474 3.751 Enrollment rate 0.014 0.014 1.223 * Treatment adherence 0.022 0.008 1.479 0.672 Height 0.023 0.028 4.001 3.471 Stunted 0.024 0.059 1.085 24.373 Mortality rate 0.026 0.195 2.533 1.561 Weight 0.029 0.027 2.852 0.149 Fever prevalence 0.034 0.011 5.937 0.126 Total income 0.037 0.003 1.770 1.232 Probability unpaid work 0.046 0.386 1.419 0.408 Attendance rate 0.046 0.018 0.591 0.526 Height 0.048 0.112 1.845 0.211 Perinatal deaths 0.049 0.015 2.087 0.234 Malaria 0.052 0.047 0.650 4.093 Enrollment rate 0.053 0.026 1.094 1.561 Height-for-age 0.055 0.002 22.166 1.212 Used contraceptives 0.059 0.120 2.863 6.967 Weight-for-height 0.072 0.164 3.127 * Height-for-age 0.100 0.005 2.043 1.842 Weight-for-age 0.108 0.004 2.317 1.040 Diarrhea incidence 0.135 0.016 2.844 1.741 Diarrhea prevalence 0.137 0.029 1.375 3.385 Weight 0.168 0.121 4.087 1.900 Labor force participation 0.790 0.047 2.931 4.300 Hemoglobin 2.650 0.176 2.982 0.731 Across-paper I2 0.905 1.000 0.927 0.661 0.798 0.930 1.000 0.996 1.000 1.000 0.998 0.987 0.222 0.037 0.742 0.696 0.999 1.000 1.000 1.000 0.402 0.999 1.000 0.036 0.351 1.000 1.000 1.000 0.993 0.949 1.000 0.269 1.000 Within-paper values are based on those papers which report results for different subsets of the data. For closer comparison of the across and within-paper statistics, the across-paper values are based on the same data set, aggregating the within-paper results to one observation per Within-paper I2 1.000 0.714 0.828 0.835 0.603 0.982 0.580 0.849 0.417 0.398 0.644 0.564 0.033 0.008 0.081 0.005 1.000 0.489 0.196 0.740 0.009 0.574 0.559 0.693 0.485 0.975 0.806 0.827 0.861 0.653 0.918 0.605 1.000 intervention-outcome-paper, as discussed. Each paper needs to have reported 3 results for an intervention-outcome combination for it to be included in the calculation, in addition to the requirement of there being 3 papers on the intervention-outcome combination. Due to the slightly different sample, the across-paper statistics diverge slightly from those reported in Table 3. Occasionally, within-paper measures of the mean equal or approach zero, making the coefficient of variation undefined or unreasonable; * denotes those coefficients of variation that were either undefined or greater than 1,000,000. 56 Intervention Table 14: Heterogeneity Measures for RCTs Outcome ρ1,2 57 Conditional Cash Transfers Micronutrients Microfinance Financial Literacy Contract Teachers Performance Pay Micronutrients Micronutrients Micronutrients Micronutrients Conditional Cash Transfers Conditional Cash Transfers Micronutrients Deworming Micronutrients Water Treatment Conditional Cash Transfers Micronutrients SMS Reminders School Meals Unconditional Cash Transfers Micronutrients Micronutrients Micronutrients Conditional Cash Transfers Bed Nets HIV/AIDS Education Micronutrients Micronutrients Unpaid labor Cough prevalence Savings Savings Test scores Test scores Weight-for-height Body mass index Weight-for-age Birthweight Height-for-age Test scores Height-for-age Hemoglobin Mid-upper arm circumference Diarrhea prevalence Enrollment rate Test scores Treatment adherence Test scores Enrollment rate Height Mortality rate Stunted Attendance rate Malaria Used contraceptives Weight Perinatal deaths 0.00 0.01 0.01 0.01 0.01 0.01 0.00 0.00 0.01 0.01 0.03 0.03 0.02 0.01 0.02 0.01 0.00 0.00 0.01 0.05 0.01 -0.01 0.01 0.02 0.01 0.02 0.01 0.00 0.02 ρ5,6 ρ10,11 0.00 0.00 -0.01 0.00 0.00 0.00 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.01 0.01 0.01 0.02 0.01 0.00 0.01 0.01 0.02 0.00 -0.01 0.00 0.00 var(Yi ) CV(Yi ) I2 N 0.000 0.001 0.003 0.004 0.005 0.006 0.009 0.009 0.009 0.010 0.011 0.012 0.012 0.015 0.018 0.020 0.020 0.021 0.022 0.023 0.023 0.024 0.025 0.025 0.025 0.029 0.031 0.037 0.038 0.16 1.65 1.43 5.47 0.40 0.61 2.40 1.06 2.18 0.98 0.83 1.19 2.43 3.38 2.40 0.97 0.70 1.45 1.67 1.29 1.03 3.76 2.88 1.11 0.41 0.50 1.63 2.76 2.10 0.99 1.00 1.00 0.99 1.00 1.00 0.79 1.00 0.99 0.95 0.04 1.00 1.00 0.99 0.70 1.00 1.00 1.00 0.93 0.55 1.00 0.98 0.05 0.11 0.81 0.99 0.94 0.78 0.04 2 3 2 5 3 3 25 3 32 7 2 4 35 15 15 6 17 8 5 3 6 30 12 5 6 9 8 33 6 58 Deworming Conditional Cash Transfers Deworming Micronutrients Conditional Cash Transfers Deworming Micronutrients Deworming Micronutrients Micronutrients School Meals Micronutrients Deworming Micronutrients SMS Reminders Deworming Average Height Labor force participation Weight-for-height Stillbirths Probability unpaid work Height-for-age Prevalence of anemia Weight-for-age Diarrhea incidence Diarrhea prevalence Enrollment rate Fever prevalence Weight Hemoglobin Appointment attendance rate Mid-upper arm circumference 0.00 0.01 0.02 0.03 0.02 0.01 0.04 0.01 0.04 0.03 0.16 0.02 -0.05 0.00 0.01 0.00 0.02 0.01 0.00 0.03 0.00 0.00 0.01 0.01 0.02 -0.01 0.03 0.00 0.00 0.00 0.00 0.00 0.02 -0.03 0.03 0.00 0.01 0.00 0.049 0.051 0.072 0.075 0.082 0.098 0.100 0.107 0.109 0.111 0.143 0.146 0.184 0.219 0.224 0.439 2.36 2.15 3.13 3.04 1.19 1.98 0.84 2.29 3.30 1.21 1.24 3.08 4.76 1.47 2.91 1.77 1.00 1.00 1.00 0.01 1.00 1.00 0.91 1.00 1.00 0.99 0.40 0.78 1.00 1.00 0.94 1.00 17 6 11 4 2 14 14 12 11 6 2 5 18 45 3 7 0.060 1.90 0.84 11 Table 15: Heterogeneity Measures for Higher-Quality Studies Intervention Outcome ρ1,2 ρ5,6 59 Micronutrients SMS Reminders Microfinance Financial Literacy Contract Teachers Conditional Cash Transfers Conditional Cash Transfers Micronutrients Micronutrients Micronutrients Micronutrients Micronutrients Deworming Micronutrients Water Treatment Micronutrients Micronutrients School Meals Micronutrients Conditional Cash Transfers Micronutrients HIV/AIDS Education Bed Nets Micronutrients Deworming SMS Reminders Micronutrients Deworming Deworming Cough prevalence Treatment adherence Savings Savings Test scores Labor force participation Test scores Body mass index Weight-for-height Weight-for-age Birthweight Height-for-age Hemoglobin Mid-upper arm circumference Diarrhea prevalence Test scores Mortality rate Test scores Height Attendance rate Stunted Used contraceptives Malaria Weight Height Appointment attendance rate Perinatal deaths Weight-for-height Height-for-age 0.01 0.01 0.01 0.01 0.01 0.02 0.00 0.00 0.01 0.01 0.01 -0.01 0.00 0.01 0.01 0.00 0.01 0.04 0.01 0.03 0.02 0.01 0.03 -0.01 0.03 0.02 0.02 0.01 0.01 ρ10,11 0.00 0.00 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.00 -0.01 0.01 -0.01 0.02 -0.01 0.00 0.00 -0.01 0.00 0.00 var(Yi ) CV(Yi ) I2 N 0.001 0.002 0.003 0.004 0.005 0.006 0.006 0.009 0.009 0.009 0.009 0.013 0.015 0.019 0.020 0.021 0.022 0.023 0.025 0.025 0.025 0.031 0.038 0.039 0.049 0.053 0.060 0.072 0.098 1.65 0.47 1.43 5.47 0.40 0.24 0.58 1.06 2.30 2.05 0.73 2.32 3.38 2.34 0.97 1.45 7.28 1.29 3.73 0.34 1.11 1.63 0.56 2.75 2.73 0.55 3.66 3.13 1.98 1.00 0.32 1.00 0.99 1.00 0.02 1.00 1.00 0.79 0.99 0.99 1.00 1.00 0.73 1.00 1.00 0.11 0.54 0.97 0.28 0.11 0.94 0.99 0.72 1.00 0.79 0.21 1.00 1.00 3 3 2 5 3 2 3 3 24 30 4 32 15 14 6 8 10 3 29 3 5 8 7 31 16 2 4 11 14 Micronutrients Deworming Micronutrients Micronutrients Micronutrients Micronutrients Micronutrients Deworming Deworming Average Prevalence of anemia Weight-for-age Diarrhea incidence Diarrhea prevalence Fever prevalence Stillbirths Hemoglobin Weight Mid-upper arm circumference 0.03 -0.02 0.00 0.03 0.02 0.05 -0.01 -0.02 0.00 0.00 -0.02 0.04 0.03 0.01 0.00 0.00 0.01 0.00 0.00 -0.02 0.01 0.01 0.00 0.00 0.100 0.107 0.109 0.111 0.146 0.173 0.197 0.198 0.439 0.84 2.29 3.30 1.21 3.08 2.62 1.51 3.85 1.77 0.84 1.00 1.00 0.98 0.81 0.02 1.00 1.00 1.00 14 12 11 6 5 2 40 16 7 0.060 2.05 0.79 11 60 Table 16: Heterogeneity Measures for Effect Sizes Within Intervention-Outcomes, Including Outlier Intervention Outcome ρ1,2 ρ5,6 ρ10,11 61 Microfinance Rural Electrification Micronutrients Microfinance Microfinance Financial Literacy Microfinance Contract Teachers Performance Pay Micronutrients Conditional Cash Transfers Micronutrients Micronutrients Micronutrients Micronutrients Conditional Cash Transfers Deworming Micronutrients Conditional Cash Transfers Unconditional Cash Transfers Water Treatment SMS Reminders Conditional Cash Transfers School Meals Micronutrients Micronutrients Micronutrients Bed Nets Conditional Cash Transfers Assets Enrollment rate Cough prevalence Total income Savings Savings Profits Test scores Test scores Body mass index Unpaid labor Weight-for-age Weight-for-height Birthweight Height-for-age Test scores Hemoglobin Mid-upper arm circumference Enrollment rate Enrollment rate Diarrhea prevalence Treatment adherence Labor force participation Test scores Height Mortality rate Stunted Malaria Attendance rate 0.00 0.01 0.01 0.00 0.00 0.01 0.01 0.01 0.01 0.00 0.01 0.01 0.00 0.01 0.00 0.01 0.03 0.01 0.02 0.01 0.01 0.01 0.04 0.05 0.00 0.02 0.02 0.02 0.01 -0.01 0.00 0.00 0.00 0.00 0.00 -0.01 0.00 0.00 0.01 0.00 0.00 0.00 0.00 0.01 0.00 0.04 0.00 0.01 0.01 0.01 0.00 0.01 0.00 0.00 var(Yi ) CV(Yi ) I2 N 0.000 0.001 0.001 0.001 0.002 0.004 0.005 0.005 0.006 0.007 0.009 0.009 0.010 0.010 0.012 0.013 0.015 0.015 0.015 0.016 0.020 0.022 0.023 0.023 0.023 0.025 0.025 0.029 0.030 5.51 0.13 1.65 0.99 1.77 5.47 5.45 0.40 0.61 0.67 0.92 1.94 2.15 0.98 2.47 1.87 3.38 2.08 0.83 1.09 0.97 1.67 1.63 1.29 4.37 2.88 1.11 0.50 0.52 1.00 0.92 1.00 1.00 1.00 0.99 1.00 1.00 1.00 1.00 0.94 0.98 0.78 0.95 1.00 1.00 0.99 0.53 1.00 1.00 1.00 0.90 0.36 0.52 0.98 0.06 0.12 0.98 0.99 4 3 3 5 3 5 5 3 3 5 5 34 26 7 36 5 15 18 37 11 6 5 18 3 32 12 5 9 15 62 Micronutrients HIV/AIDS Education Micronutrients Deworming Micronutrients Scholarships Conditional Cash Transfers Deworming Micronutrients School Meals Micronutrients Deworming Deworming Micronutrients Micronutrients Micronutrients Deworming Micronutrients SMS Reminders Deworming Conditional Cash Transfers Rural Electrification Average Weight Used contraceptives Perinatal deaths Height Test scores Enrollment rate Height-for-age Weight-for-height Stillbirths Enrollment rate Prevalence of anemia Height-for-age Weight-for-age Diarrhea incidence Diarrhea prevalence Fever prevalence Weight Hemoglobin Appointment attendance rate Mid-upper arm circumference Probability unpaid work Study time 0.00 0.02 0.02 -0.03 0.00 0.01 0.05 0.01 0.03 0.12 0.01 0.02 -0.04 -0.03 0.03 0.02 -0.04 -0.02 0.01 0.00 0.02 0.08 -0.01 0.00 0.02 -0.02 0.00 0.01 0.01 0.01 -0.01 -0.01 -0.01 0.00 -0.01 0.00 -0.01 0.01 0.03 0.00 -0.01 0.00 0.00 0.03 0.03 0.00 0.01 0.00 0.00 0.034 0.036 0.038 0.049 0.052 0.053 0.055 0.072 0.075 0.081 0.095 0.098 0.107 0.109 0.111 0.146 0.184 0.215 0.224 0.439 0.609 0.997 2.70 3.12 2.10 2.36 1.69 0.69 22.17 3.13 3.04 1.14 0.79 1.98 2.29 3.30 1.21 3.08 4.76 1.44 2.91 1.77 6.42 1.10 0.68 0.46 0.04 1.00 1.00 1.00 0.04 1.00 0.02 0.01 0.78 1.00 1.00 1.00 0.96 0.81 1.00 1.00 0.99 1.00 0.96 0.03 36 10 6 17 10 5 7 11 4 3 15 14 12 11 6 5 18 46 3 7 5 3 0.083 2.52 0.80 12 Table 17: Regression of Studies’ Absolute Percent Difference from the Within-InterventionOutcome Mean on Chronological Order (1) (2) (3) (4) ă500% ă1000% ă1500% ă2000% Chronological order Observations R2 0.053 (0.21) 0.699 (0.47) 0.926 (0.61) 0.971 (0.67) 480 0.13 520 0.13 527 0.12 532 0.12 Each column restricts attention to a set of results a given maximum percentage away from the mean result in an intervention-outcome combination. 63 E Derivation of Mixed Model Estimation Strategy The model we are estimating is of the form: Yi “ Xi β ` ui ` ei (17) This can easily be generalized to include other explanatory variables, and I do this to estimate Model 2. However, for the sake of exposition I focus on the simplest case. In the fully Bayesian random-effects model, we estimated the parameters θ, µ and τ using the fact that P pθ, µ, τ |Y q “ P pθ|µ, τ, Y qP pµ|τ, Y qP pτ |Y q and sampling from τ , calculating µ, and finally obtaining θ. Analogously, we can estimate the hierarchical mixed model by decomposing P pβ, e, τ |Y q: P pβ, ei , τ |Yi q “ P pei |β, τ, Yi qP pβ|τ, Yi qP pτ |Yi q (18) Inspecting each term on the RHS separately, we can see a similar identification strategy: sampling from τ , calculating β, calculating the sufficient statistics for ei , and finally sampling ei . In particular, the three terms can be re-written as follows. For the first term: P pei |β, τ, Yi q9P pβ, ei , τ |Yi q (19) P pβ, ei , τ |Yi q9P pYi |β, ei qP pei |τ qP pβ, τ q ˆ ˆ ˙ ˙ 1 1 2 1 1 2 exp ´ 2 pYi ´ Xi β ´ ei q ? 9a exp ´ 2 ei P pβ, τ q 2σi 2τ 2πτ 2 2πσi2 1 1 logP pei |β, τ, Yi q “ C ´ 2 pe2i ´ 2pYi ´ Xi βqei q ´ 2 e2i 2σi 2τ 2 2 σ `τ Yi ´ X i β “ C ´ i 2 2 e2i ` ei 2σi τ σi2 ˆ ˙2 pYi ´ Xi βqτ 2 σi2 ` τ 2 “C´ ei ´ 2σi2 τ 2 σi2 ` τ 2 where C is a constant in each line that can be different throughout. For the second term: 64 (20) (21) (22) (23) (24) P pβ|τ, Y q9P pβ, τ |Y q n ź a P pβ, τ |Y q9P pβ, τ q (25) ˆ 1 1 pYi ´ Xi βq2 exp ´ 2 2q 2 2 2pσ ` τ 2πpσ ` τ q i i i“1 n 2 ÿ pYi ´ Xi βq logP pβ|τ, Y q “ logP pβ|τ q ´ `C 2pσi2 ` τ 2 q i“1 ˙ (26) (27) Gelman et al. (2013) suggest a noninformative prior for P pβ|τ q. ˜ logP pβ|τ, Y q “ C ´ β 1 n ÿ Xi1 Xi 2pσi2 ` τ 2 q i“1 ¸ ˜ β` n ÿ Yi Xi pσi2 ` τ 2 q i“1 ¸ β 1 “ C ´ pβ ´ Ωλ1 q1 Ω´1 pβ ´ Ωλ1 q 2 where Ω “ ´ř n Xi1 Xi i“1 pσi2 `τ 2 q ¯´1 ,λ“ (28) (29) řn Yi Xi i“1 pσi2 `τ 2 q . For the third term: P pτ |Y q “ P pβ, τ |Y q P pβ|τ, Y q (30) For this last equation, note that P pβ|τ, Y q is solved above, and P pβ, τ |Y q is solved above except for the unknown term P pβ, τ q “ P pβ|τ qP pτ q. We already defined a uniform prior for P pβ|τ q and can define another uniform prior for P pτ q. Then: n n ÿ 1ÿ 1 2 2 logP pτ |Y q “ C ´ logpσi ` τ q ´ pYi ´ Xi βq2 2 2 2 i“1 2pσi ` τ q i“1 1 1 ` log|Ω| ` pβ ´ Ωλ1 q1 Ω´1 pβ ´ Ωλ1 q 2 2 and we are ready to begin sampling τ . 65 (31) (32)