Events management system: test document

advertisement
Randomized Controlled Trials
Angus Deaton
Research Program in Development Studies
Center for Health and Wellbeing
Princeton University
September 2009
Main messages
 Randomized controlled trials are useful, sometimes uniquely so
 There is no unique “gold standard” claim for RCTs in developing
evidence for policy
 There is no gold standard of any kind
 No blanket exemption from scrutiny or skepticism for RCTs than
does not apply to other methods
 General program of “finding out what works” by routine RCTs is
not well-adapted to deepen scientific understanding of
development
 Though it may be useful for other purposes, such as accountability
or auditing
 A progressive scientific research program generally requires the
investigation of mechanisms, why things work, not what works
Outline of these remarks
 Internal issues
 Things to think about when you are doing an RCT
 Using the results
 RCTs in social science versus RCTs in medicine
 The comparison is often made, with medicine seen as an
example that we should follow
 But the situation is a little more complicated
 Useful to think about medicine versus what we do
 External issues
 What are RCTs good for, and not good for?
 The investigation of mechanisms
 Alternatives: project evaluation v. doing economics
Project evaluation
 Think of a project, which might be a school reform, or building a
clinic, a new incentive scheme, provision of some service, a new
HYV, or a drug
 If a “unit” (say person) gets the “treatment” there is a change in
outcome by an amount that varies from person to person
 Emphasize the heterogeneity: very little structure on this problem,
and we want to assume as little as possible to get “credibility”
 If we randomize across persons, and compare average outcome
in treated group v. untreated group, we can estimate the
average treatment effect
 This requires essentially no assumptions, compared with
econometrics, for example
 Randomization guarantees that two groups come from the same
probability law
Testing effectiveness
 If there is a net beneficial outcome, could it have
happened by chance?
 Always possible, obviously so with small numbers of people
 We can calculate the probability of a favorable outcome
having come about by chance using combinatorics, which is
what R. A. Fisher did in the first randomized trials
 So we have a method of finding out whether the project
worked on average, which in its assumptions, contrasts
favorably with standard econometric practice
 Selection, direction of causality, simultaneity, and so on
 Benefits of randomization versus observational data
An important example
 ISIS-2 trial of giving people aspirin immediately after an
MCI (heart attack)
 17,000 patients over several studies (this size is needed)
 Even bottom of confidence interval would save hundreds of
thousands of lives worldwide
 Doctors did not take this treatment seriously prior to the trial
 Within two years of publication in 1988, use in UK went from
10% to 90% (much less in the US, where it is still a problem!)
 Hard to see how these results would have been obtained
in any other way
 Effect is small enough to be a problem both for small RCTs,
and for observational studies
 Don’t need RCT for tobacco, because effect is so large
What does an RCT not tell us?
 Informative about the mean, not of any other characteristic of the
distribution of treatment effects, e.g. the median, or the fraction of
people who benefit, or lose
 Policymakers are often interested in these
 Does yield the full distribution of outcomes for both treatments and
controls
 For some purposes, this might be enough
 If one distribution first-order stochastically dominates the other
 Another aspirin example: low dose regime
 RCTs show a net reduction in mortality
 But it kills some and saves some
 Public health perspective says do it
 Individual doctor or patient perspective is much less clear
 If two groups like this, the RCT average applies to no one!
 MCI aspirin example, the effects in same direction for everyone, or at
least broad classes of people
Estimates and standard errors
 The RCT gives us a mean treatment effect
 This is not worth much without a standard error: the oft heard reply, the
estimate is fine, “only” the standard error is a problem, is nonsense
 This is not the same as the p-value for the null hypothesis that the treatment




has no effect, which can be done without additional assumptions
Standard errors cannot be obtained non-parametrically
Unless we bootstrap the RCT!
We need to make the sort of assumptions that advocates of RCTs don’t like
using other methods
Not clear how much better-off we are: at least diminishes the benefit
 Example, regressing outcomes on treatment dummy gives the wrong standard
error
 Heteroskedasticity correction gives a t-value that does not have the t-
distribution (Fisher-Behrens problem)
 Much more attention needs to be given to these issues, in medicine as well as
economics
Reducing standard errors
 The effects RCTs are used to estimate tend to be relatively small
 If they were large, we would not need a trial
 Large trials are typically necessary
 Which are expensive
 Especially with “saturation” experiments, where units are schools or villages,
and not individuals, because of interactions (typically not an issue with medical
trials)
 Reduce variance by using baseline information as covariates
E.g. regression on treatment dummy with covariates
This leads to bias in the treatment effect: can be important with small numbers
Generally biases the standard errors
Opens up to charges of data mining: choose covariates until you get a significant
(or insignificant) result, whichever you are looking for
 Again, more work needs to be done here to guide practice




 Pseudo-randomization, e.g. alphabetization, is likely to introduce bias and
makes it impossible to calculate standard errors
RCTs in medicine?
 Most doctors will tell you yes they are the gold standard: same at NIH
 Yet there are serious concerns











Ethics and IRBs
Influence of money: selective publication of results, and other evils
Cost and timeliness
Public health versus individual perspectives
80 percent of oncology trials are unfilled: people will not participate
Populations who participate in trials are strange in some way
Exclusion of co-morbidities makes extrapolation hazardous
Undoing of blinding, which we economists don’t even try to do
Meta-studies sometimes contradicted by later trials
Argued that good observational studies produce the same results
Led to extensive funding for CER in ARRA in the US
 Chairman of Stanford Medical School’s Department of Medicine looks forward
to a world in which there are no randomized clinical trials
 Happy to talk further about these in the discussion period
Using the results from RCTs
 External validity
 Which I will come to
 Using the results in the original experimental population
 This would seem to be favorable
 Yet mean effect is often not very useful for individuals, even those in the trial
 Treatment for everyone might improve social welfare, but that does not imply
that any individual should be treated
 Subgroup analysis of trials is so prevalent (and makes sense) but it is subject to
data mining
 Astrological signs in ISIS-2
 If all effects have the same sign, that helps, better still if they are all the same
 Once again, we are trying to reduce heterogeneity, by some sort of modeling
 Regime aspirin
 If we confine it to those who are risk of heart attacks on other grounds, lose
many of those who will die from it
 Mechanism is thinning of blood, so we can understand how to divide
 Can’t learn this from the trial itself without some theoretical understanding of
what is going on
Why might RCT results not apply?
General equilibrium effects are well-known, but other threats even in the population as
a whole
 Population subject to randomization may be different

 E.g. no elderly people included, or India is not the same as the Philippines
 No one with co-morbidities included in the trial, but those who get the treatment often
have co-morbidities (trade-off between internal and external validity)
 Not everyone wants to participate in a trial: e.g. if there is risk involved, selects more risk
averse
 People running the trial almost always different from those who would administer it more
generally (true even in agricultural crop research)
 People in school now will be different from those in school if the educational system is
changed

Would have to randomize from birth, or even before
Attrition or refusals, though there are various techniques for dealing with this
 Randomization may have failed—and there are many practical difficulties in the field—
and there is no way of testing that it did work within the RCT framework because we can
only check on observables
 Need to apply exactly the same expert skepticism to RCTs as other studies: no free pass!

The order of rigor is irrelevant
 Comes from the philosopher Nancy Cartwright
 If we are to use evidence in policy, we need to
 Develop the evidence, e.g. from an RCT
 Argue that it applies to the population that we want to treat
 The overall quality of the evidence depends on the weakest of these two steps;
that the first step is rigorous and convincing does not help if the second one is
weak
 The second step is often argued by “like” or “matching” arguments, that are
inherently tied to observables, not the unobservables that the RCT can control
 So a matching estimator at the first step might do just as well in backing the policy, in spite of
its inferiority to the RCT
 “There is, at present, no basis for the popular belief that extrapolation from
social experiments is less problematic than extrapolation from observational
data. As we see it, the recent embrace of reduced-form social experimentation
to the exclusion of structural evaluation based on observational data is not
warranted.” Manski and Garfinkel on training programs.
Alternatives?
 I am not arguing for IV estimation, natural experiments,
regression discontinuity designs, fixed effect estimation, or even
OLS as a better than an RCT for project evaluation
 Indeed, these methods often see themselves as mimicking RCTs, but
being inferior to them
 I agree
 What I doubt is whether project evaluation itself can lead to
scientific progress on issues in economic development
 At least if we confine ourselves to what works as opposed to why it
works
 For the latter we need theory, mechanisms, or whatever
 This does not necessarily involve structural estimation as it is
usually construed in econometrics, which has problems of its own
 Just that we need a connection to a mechanism of some sort
More on mechanisms
 RCTs not very good at mechanisms: the results are what they are, and
accounts of why are often “fairy stories,” ex post rationalizations with
no evidence base
 This makes them difficult to generalize, because without the
mechanism, hard to assess external validity
 Also hard to assess welfare, because there are often positive and
negative accounts that give the same answer
 An example from the World Bank research review
 Excellent project on doctor behavior in India and elsewhere, by Jishnu
Das, Jeff Hammer, Lant Pritchett and others
 Attempt to find out why private and public doctors behave as they do,
the constraints and incentives they face, and the welfare consequences
 Reviewer argued that this work was worthless, and should be replaced
by RCTs to find out what works
 RCTs would be an excellent idea here, but at a later stage, when we
have some mechanisms to test, and test why things work, and from
that we can learn and possibly generalize
Moving to opportunity
 Another persuasive example comes from the MTO
experiment, in which people were randomly assigned in
city centers in the US to move to “better” places
 Analyzed by a large team of economists and other social
scientists
 Interchange in American Journal of Sociology 2008
 Clampet-Lundquist and Massey argue that the results of MTO
don’t make sense and don’t give recognition to what they
know
 Katz, Kling and Liebman say “you don’t understand selection
or what RCTs do,” with some justification
 Sampson “cleans up” in a beautiful paper that explains what
us going on in one of the cities, Chicago, why the MTO
experiment gets the results it gets, and why Massey is right in
spirit if not in detail
Where do we go from here?
 Economics is not project evaluation!
 We need theories and tests of them
 RCTs will often be an excellent way of testing theory
 But there are other methods too
 Such as
 I think structural econometrics often has a role
 Comprehensive, full-information test of a theory
 Trouble is there is often much additional structure, often incredible
 My ideal is the development of theory to the point where it is
possible to develop acid tests in simple non-parametric way
 Hypothetico-deductive method, Popper and beyond
 Requires close interaction between theorists and empiricists
 Medicine may rely on RCTs, physics never uses them
Examples of positive progress
Taken from a paper I am writing for the Journal of Economic Perspectives
 Saving and growth






Life-cycle theory predicts that growth drives saving
Confirmation by Modigliani in 1970s: not obvious and impressive
Later refutations in simple non-parametric tests
Which parts of the theory need to be abandoned, and which can we work on
Commodity prices
 My work with Guy Laroque
 Here structural econometrics helped elucidate non-parametric predictions that we were
not smart enough to see in advance
 These were falsified, but again provided suggestions of where to go

Nutrition in India
 Work with Jean Drèze
 Per capita calorie consumption is falling in spite of rapid growth and upward sloping Engel
curves
 Here we are starting from data, but have a mechanism in mind, that real income growth
generates improvements in nutrition: this is unlikely to be abandoned but it needs to be
supplemented
Download