Randomized Controlled Trials Angus Deaton Research Program in Development Studies Center for Health and Wellbeing Princeton University September 2009 Main messages Randomized controlled trials are useful, sometimes uniquely so There is no unique “gold standard” claim for RCTs in developing evidence for policy There is no gold standard of any kind No blanket exemption from scrutiny or skepticism for RCTs than does not apply to other methods General program of “finding out what works” by routine RCTs is not well-adapted to deepen scientific understanding of development Though it may be useful for other purposes, such as accountability or auditing A progressive scientific research program generally requires the investigation of mechanisms, why things work, not what works Outline of these remarks Internal issues Things to think about when you are doing an RCT Using the results RCTs in social science versus RCTs in medicine The comparison is often made, with medicine seen as an example that we should follow But the situation is a little more complicated Useful to think about medicine versus what we do External issues What are RCTs good for, and not good for? The investigation of mechanisms Alternatives: project evaluation v. doing economics Project evaluation Think of a project, which might be a school reform, or building a clinic, a new incentive scheme, provision of some service, a new HYV, or a drug If a “unit” (say person) gets the “treatment” there is a change in outcome by an amount that varies from person to person Emphasize the heterogeneity: very little structure on this problem, and we want to assume as little as possible to get “credibility” If we randomize across persons, and compare average outcome in treated group v. untreated group, we can estimate the average treatment effect This requires essentially no assumptions, compared with econometrics, for example Randomization guarantees that two groups come from the same probability law Testing effectiveness If there is a net beneficial outcome, could it have happened by chance? Always possible, obviously so with small numbers of people We can calculate the probability of a favorable outcome having come about by chance using combinatorics, which is what R. A. Fisher did in the first randomized trials So we have a method of finding out whether the project worked on average, which in its assumptions, contrasts favorably with standard econometric practice Selection, direction of causality, simultaneity, and so on Benefits of randomization versus observational data An important example ISIS-2 trial of giving people aspirin immediately after an MCI (heart attack) 17,000 patients over several studies (this size is needed) Even bottom of confidence interval would save hundreds of thousands of lives worldwide Doctors did not take this treatment seriously prior to the trial Within two years of publication in 1988, use in UK went from 10% to 90% (much less in the US, where it is still a problem!) Hard to see how these results would have been obtained in any other way Effect is small enough to be a problem both for small RCTs, and for observational studies Don’t need RCT for tobacco, because effect is so large What does an RCT not tell us? Informative about the mean, not of any other characteristic of the distribution of treatment effects, e.g. the median, or the fraction of people who benefit, or lose Policymakers are often interested in these Does yield the full distribution of outcomes for both treatments and controls For some purposes, this might be enough If one distribution first-order stochastically dominates the other Another aspirin example: low dose regime RCTs show a net reduction in mortality But it kills some and saves some Public health perspective says do it Individual doctor or patient perspective is much less clear If two groups like this, the RCT average applies to no one! MCI aspirin example, the effects in same direction for everyone, or at least broad classes of people Estimates and standard errors The RCT gives us a mean treatment effect This is not worth much without a standard error: the oft heard reply, the estimate is fine, “only” the standard error is a problem, is nonsense This is not the same as the p-value for the null hypothesis that the treatment has no effect, which can be done without additional assumptions Standard errors cannot be obtained non-parametrically Unless we bootstrap the RCT! We need to make the sort of assumptions that advocates of RCTs don’t like using other methods Not clear how much better-off we are: at least diminishes the benefit Example, regressing outcomes on treatment dummy gives the wrong standard error Heteroskedasticity correction gives a t-value that does not have the t- distribution (Fisher-Behrens problem) Much more attention needs to be given to these issues, in medicine as well as economics Reducing standard errors The effects RCTs are used to estimate tend to be relatively small If they were large, we would not need a trial Large trials are typically necessary Which are expensive Especially with “saturation” experiments, where units are schools or villages, and not individuals, because of interactions (typically not an issue with medical trials) Reduce variance by using baseline information as covariates E.g. regression on treatment dummy with covariates This leads to bias in the treatment effect: can be important with small numbers Generally biases the standard errors Opens up to charges of data mining: choose covariates until you get a significant (or insignificant) result, whichever you are looking for Again, more work needs to be done here to guide practice Pseudo-randomization, e.g. alphabetization, is likely to introduce bias and makes it impossible to calculate standard errors RCTs in medicine? Most doctors will tell you yes they are the gold standard: same at NIH Yet there are serious concerns Ethics and IRBs Influence of money: selective publication of results, and other evils Cost and timeliness Public health versus individual perspectives 80 percent of oncology trials are unfilled: people will not participate Populations who participate in trials are strange in some way Exclusion of co-morbidities makes extrapolation hazardous Undoing of blinding, which we economists don’t even try to do Meta-studies sometimes contradicted by later trials Argued that good observational studies produce the same results Led to extensive funding for CER in ARRA in the US Chairman of Stanford Medical School’s Department of Medicine looks forward to a world in which there are no randomized clinical trials Happy to talk further about these in the discussion period Using the results from RCTs External validity Which I will come to Using the results in the original experimental population This would seem to be favorable Yet mean effect is often not very useful for individuals, even those in the trial Treatment for everyone might improve social welfare, but that does not imply that any individual should be treated Subgroup analysis of trials is so prevalent (and makes sense) but it is subject to data mining Astrological signs in ISIS-2 If all effects have the same sign, that helps, better still if they are all the same Once again, we are trying to reduce heterogeneity, by some sort of modeling Regime aspirin If we confine it to those who are risk of heart attacks on other grounds, lose many of those who will die from it Mechanism is thinning of blood, so we can understand how to divide Can’t learn this from the trial itself without some theoretical understanding of what is going on Why might RCT results not apply? General equilibrium effects are well-known, but other threats even in the population as a whole Population subject to randomization may be different E.g. no elderly people included, or India is not the same as the Philippines No one with co-morbidities included in the trial, but those who get the treatment often have co-morbidities (trade-off between internal and external validity) Not everyone wants to participate in a trial: e.g. if there is risk involved, selects more risk averse People running the trial almost always different from those who would administer it more generally (true even in agricultural crop research) People in school now will be different from those in school if the educational system is changed Would have to randomize from birth, or even before Attrition or refusals, though there are various techniques for dealing with this Randomization may have failed—and there are many practical difficulties in the field— and there is no way of testing that it did work within the RCT framework because we can only check on observables Need to apply exactly the same expert skepticism to RCTs as other studies: no free pass! The order of rigor is irrelevant Comes from the philosopher Nancy Cartwright If we are to use evidence in policy, we need to Develop the evidence, e.g. from an RCT Argue that it applies to the population that we want to treat The overall quality of the evidence depends on the weakest of these two steps; that the first step is rigorous and convincing does not help if the second one is weak The second step is often argued by “like” or “matching” arguments, that are inherently tied to observables, not the unobservables that the RCT can control So a matching estimator at the first step might do just as well in backing the policy, in spite of its inferiority to the RCT “There is, at present, no basis for the popular belief that extrapolation from social experiments is less problematic than extrapolation from observational data. As we see it, the recent embrace of reduced-form social experimentation to the exclusion of structural evaluation based on observational data is not warranted.” Manski and Garfinkel on training programs. Alternatives? I am not arguing for IV estimation, natural experiments, regression discontinuity designs, fixed effect estimation, or even OLS as a better than an RCT for project evaluation Indeed, these methods often see themselves as mimicking RCTs, but being inferior to them I agree What I doubt is whether project evaluation itself can lead to scientific progress on issues in economic development At least if we confine ourselves to what works as opposed to why it works For the latter we need theory, mechanisms, or whatever This does not necessarily involve structural estimation as it is usually construed in econometrics, which has problems of its own Just that we need a connection to a mechanism of some sort More on mechanisms RCTs not very good at mechanisms: the results are what they are, and accounts of why are often “fairy stories,” ex post rationalizations with no evidence base This makes them difficult to generalize, because without the mechanism, hard to assess external validity Also hard to assess welfare, because there are often positive and negative accounts that give the same answer An example from the World Bank research review Excellent project on doctor behavior in India and elsewhere, by Jishnu Das, Jeff Hammer, Lant Pritchett and others Attempt to find out why private and public doctors behave as they do, the constraints and incentives they face, and the welfare consequences Reviewer argued that this work was worthless, and should be replaced by RCTs to find out what works RCTs would be an excellent idea here, but at a later stage, when we have some mechanisms to test, and test why things work, and from that we can learn and possibly generalize Moving to opportunity Another persuasive example comes from the MTO experiment, in which people were randomly assigned in city centers in the US to move to “better” places Analyzed by a large team of economists and other social scientists Interchange in American Journal of Sociology 2008 Clampet-Lundquist and Massey argue that the results of MTO don’t make sense and don’t give recognition to what they know Katz, Kling and Liebman say “you don’t understand selection or what RCTs do,” with some justification Sampson “cleans up” in a beautiful paper that explains what us going on in one of the cities, Chicago, why the MTO experiment gets the results it gets, and why Massey is right in spirit if not in detail Where do we go from here? Economics is not project evaluation! We need theories and tests of them RCTs will often be an excellent way of testing theory But there are other methods too Such as I think structural econometrics often has a role Comprehensive, full-information test of a theory Trouble is there is often much additional structure, often incredible My ideal is the development of theory to the point where it is possible to develop acid tests in simple non-parametric way Hypothetico-deductive method, Popper and beyond Requires close interaction between theorists and empiricists Medicine may rely on RCTs, physics never uses them Examples of positive progress Taken from a paper I am writing for the Journal of Economic Perspectives Saving and growth Life-cycle theory predicts that growth drives saving Confirmation by Modigliani in 1970s: not obvious and impressive Later refutations in simple non-parametric tests Which parts of the theory need to be abandoned, and which can we work on Commodity prices My work with Guy Laroque Here structural econometrics helped elucidate non-parametric predictions that we were not smart enough to see in advance These were falsified, but again provided suggestions of where to go Nutrition in India Work with Jean Drèze Per capita calorie consumption is falling in spite of rapid growth and upward sloping Engel curves Here we are starting from data, but have a mechanism in mind, that real income growth generates improvements in nutrition: this is unlikely to be abandoned but it needs to be supplemented