FIELD EXPERIMENTS Prof Samek

advertisement
FIELD EXPERIMENTS
Prof Samek
Outline
• Experiments: What and Why?
• Sample arrangements
• Sample size & statistical power
• Unit of randomization & intracluster correlation
• Randomization designs – Blocking & balance
• Incorporating insights from behavioral economics into
intervention design
Economics
• Economics is a social science that studies the choices
that individuals, firms, governments, societies make,
given:
• scarcity
• incentives
• Interest in causation rather than necessarily correlation
• Positive, prescriptive, normative economics
• Three core principles in econ
• Optimization (people choose rationally)
• Equilibrium (state where no agent benefits from changing)
• Empiricism (testing with data/empirical analysis)
Experiments: What & Why?
• To measure the effect of intervention “X” on outcome “Y”
you compare avg. “Y” among people who receive “X” to
avg. “Y” among people who don’t receive “X”
• Ex. X=information about a charity’s effectiveness; Y= donations
• Non-experimental methods
• Use naturally occurring variation across people, institutions,
locations, time, etc.
• Randomized controlled experiment
• Experimenter randomly assigns who receives intervention
(“treatment”) and who doesn’t (“control”)
Violent crimes per 1,000
Data: Correlation
Police per 1,000
Violent crimes per 1,000
Data: Correlation
Police per 1,000
Case Study on Financial Education
• We study the ability of associates to make better choices
•
•
•
•
about how they manage finances. We advertise a new
educational program – Grow Your Wealth!. 5,000 qualified
associates are invited to join the program, 2,000 are
interested in learning to save money and decide to
participate.
All 2,000 are invited to join and do very well in the
program.
At the end of the study, we compare the group who has
participated (2,000) with the group who hasn’t (3,000)
Our 2,000 have clearly done better!
Is the program a success?
What Really Happened?
• Selection Bias: Caused our results to be over-stated
• Do not know if the program or the people cause the
effect
I joined the program because I want
to start saving money… I would
probably have save money anyway, but
it helped me!
I didn’t join the program because I
don’t care about saving money right
now. I would rather buy a flat-panel
LCD television.
How to Get Cause & Effect
• This is a Randomized Field Experiment
• 2,000 chose to participate in the program
• These 2,000 associates are located in 5 different cities
• After associates have expressed their interest, select 2 (A
and B) of the 5 cities and offer Grow Your Wealth! as a
pilot
• Compare all those who wanted to participate in cities A
and B with C, D and E
• Now we have removed selection bias – so we know the
program caused the effect
Control is Key
• In a Field Experiment, we control who gets offered the
intervention
• The “Control group” is the group for comparison
• They wanted to participate, but were not able to
• The “Treatment group” is the group of interest
• They wanted to participate, and were able to
• What about non-participants?
• We may want to learn how the program will affect those
who don’t want to participate
• But we cannot force people…must use incentives
Behavioral Interventions
• How to get more participants in the study?
Now vs. Later –
bounded willpower
The consequences of not
saving money are far in the
future.
Bring “now”
incentives to motivate
behavior
Include a
commitment
Social feedback
I didn’t join the program because I
don’t care about saving
money right now. I would rather
have a big-screen TV. I
later.
will save
Experiments: What & Why?
Why want randomization and control group?
• Randomization: Avoid “selection” into treatment
•  Only difference between groups is the intervention
• Control: Tells you what would have happened without
intervention (“counterfactual”)
So you want to run an experiment . . .
Basic elements
(1) Treatment design
(2) Sample sizes & randomization design
(3) Implementation & addressing concerns
(4) Outcomes & analysis
• First, we will focus on (2)
• Then we will turn to (1), incorporating a
discussion of (2)-(4)
• (3)-(4) will be covered in more depth this week
SAMPLE ARRANGEMENTS
Sample arrangements
• Two related questions:
• How many people do I need in the experiment?
• I have X number of people, how many different
treatments should I test?
• Answer depends on:
• How large you think the treatment effect(s) are
• Your unit of randomization
• Your randomization design
Minimum detectable effect
• Treatment effect: how much the treatment
moves the outcome
• Measure by comparing mean outcome in treatment
group vs. control group
• We often think of treatment effects in terms of
the outcome’s standard deviation (e.g.,s.d. for
height (w/in gender) = 3 in.; for IQ = 15 pts.)
• Basic rule: The larger you think the treatment
effect is, the fewer people you need in the
experiment
Minimum detectable effect
• Example: If you assign 64 people to treatment and
64 to control and the true treatment effect is ½ a
standard deviation, then there is an 80% probability
that when you compare the mean outcome in
treatment to control the difference will be statistically
significant at the 5% level (using a two-sided t-test).
• Sample size n=64
• Minimum detectable effect
1-β
• Significance level α
• Power level
δ
Minimum detectable effect
• To decide on sample sizes and/or how many treatments
you should test, ALWAYS CONDUCT A POWER TEST to
determine what size treatment effect you can detect
• Decide beforehand what size effect is worth detecting.
• Don’t conduct “underpowered” experiments – i.e., that
don’t have large enough samples to detect effects of
interest.
• If the experiment is underpowered and you don’t find a
significant effect, you don’t know if it’s because the
intervention doesn’t work or it does work but you needed
more people
POWER TESTS
Errors in Hypothesis Testing
• All tests have some probability of making errors
• Type 1 Error – probability of making it: α
• Incorrectly reject H0 that treatment = control
• Conclude means are different when they are not
• Type 2 Error – probability of making it: β
• Incorrectly fail to reject H0 that treatment = control
• Conclude means are not different when they are
α, when H0 is X=Y
• α is the probability of making a Type 1 error - rejecting H0
in error
• The smaller the α, the lower the probability
• For instance is α<0.01 = 1% probability of Type 1 error
• α<0.10 = 10% probability of Type 1 error.
• Typically we care about α<0.05
• Suppose t-test for X=Y results in α=0.06
• Then we reject H0 at the 10% level, but not at 5% level.
β
• β is the probability of making a Type 2 error – failure to
reject H0 when it is false
• Suppose that you have sample of 100 subjects in the
experiment
• 50 go into control group
• 50 go into treatment group
• At the end of the intervention, you conduct a t-test for
Control = Treatment, and find that p=0.11
• So – you cannot publish the paper.
Have you done a power test?
• 1-β is the power of your experiment
• Doing a power test gives you a better understanding of
the sample size that you will need
• Typical power selection is = 0.8 (0.2 probability of Type II
error)
• Other power selection could be = 0.9 (0.1 probability of
Type II error)
Simple formula for calculating sample sizes
• Assuming equal variances σ12 = σ22:
n0*  n1*  n *  2(t / 2
2 
 t )  
 
2
• Effect size:
• Sample size depends on the ratio of effect size to standard
deviation.
• Hence, effect sizes can just as easily be expressed in standard
deviations.
• Necessary sample size:
• Increases with desired significance level and power
• Increases proportionally with variance of outcomes
• Decreases inversely proportionally with the square of the minimum
detectable effect size
Rules of Thumb
• Standard is to use α=0.05 and have power of 0.80
(β=0.20).
• So if we want to detect a one-standard deviation
change using the standard approach, we would need:
• n = 2(1.96 + 0.84)2*(1)2 = 15.68 observations in each
cell
• ½ std. dev. change is detectable with 4*15.68 ~ 64
observations per cell
• n=30 seems to be the magic number in many lab
experiment studies: ~ 0.70 std. dev. change.
Power Test in STATA
• In STATA, type in “help sampsi”
• Two-sample comparisons of M1 and M2
• One sample comparison of M to hypothesized value
• Two sample comparison of proportions
• One sample comparison of proportion to hypothesized value
• Typical code:
• sampsi mean_1 mean_2, p(0.8) sd1(sd1) sd2(sd2)
• sampsi prop_1 prop_2, power(0.8)
Determining Effect Size
• Determine M0, M1, S0, S1
• Look at related literature
• approximate your mean/standard deviation
• Run a pilot test and
• Use this data to determine mean/standard deviation
• Can use existing/historical data for the M0
• Can begin experiment and adjust sampling as needed
(carefully)
• Determine Effect Size of Interest
• Talk to the practitioner about desired effect size
• Avoid the under-powered experiment!
Effects of “economic significance”
• Often confused with statistical significance
• Answers the question:
• “So what”?
• “Why do we care”?
• Suggestions:
• Compare size of effect with effects of related programs
• Provide evidence that the dependent variable is predicted to
move across economically important threshold (move
households above poverty threshold; move children above
failing)
• Cost-benefit analysis of the program
• “½ standard deviation change seems pretty good”
• Other ideas?
LEVEL OF
RANDOMIZATION
Intra-Cluster Correlation
• Level of randomization differs from the unit of observation
• Randomization at grocery store level, outcomes observed at
individual consumer level
• Randomization at the school level, outcomes observed at the child
level
• Randomization at the city level, outcomes observed at the
consumer level
• Example:
• 4 cities randomized across treatments, same number of consumers
in each city
• Responses of consumers may be correlated within the city
Intra-Cluster Correlation
• Real Sample Size (RSS) = mk/CE
m = number of subjects in a cluster
k = number of clusters
CE = 1 + ρ(m-1)
ρ = intracluster correlation coefficient
= s2B/(s2B + s2w)
s2B = variance between clusters
s2w = variance within clusters
Intracluster Correlation
• What does ρ → 0 mean?
No correlation of responses within a cluster
►No need to adjust optimal sample sizes
What does ρ → 1 mean?
All responses within a cluster are identical
►Large adjustment needed: RSS is reduced to
the number of clusters
Example
• Pilot testing finds that ρ=0.04
• We wish to detect a 1/10 standard deviation change
• What sample size do we need?
• If ρ->0, would need
Sample Size Formula: n = 2*(ta + tB )2 * [σ/δ]2
n = 1568 at each level; 3136 total.
Required Sample Size
• RSS = mk/CE
=784*4/(1+.04(784-1))
~97!
What is the required sample size?
= 2*(ta + tB )2 * 100(1+783(0.04))
= 15.68*3232
(note that ρ → 0: 15.68*100)
=50,678 at each incentive level!
So why would you ever cluster?
• Logistics
• Spillovers
RANDOMIZATION DESIGN
Randomized Designs
• Fully random design with heterogeneous participants:
• Reduces likelihood that treatment is correlated with
observables
• May have high variance
• Initial groupings may be heterogeneous
• How to reduce variance of unobserved component?
• Include controls in final regression (age, gender)
• Control for observables in advance of
randomization - blocking
Blocked Design – Decreases Variance
• Select block observables of interest
• Block subjects by these observables
• Randomize within block to the treatment
• Treating characteristics of subjects as add’l treatment
Special Case of Blocking
• Within subjects design also “blocks” fully on all
characteristics of the subject
• Disadvantage: treating subject multiple times may result
in complicated interactions of treatment and yield a
different parameter estimate
• Crossover design: randomizing the order of treatment for
within-subject experiments
• Still need caution when interpreting
Balanced Design
• After blocking, no need to “balance” on those same
observables
• May wish to balance on other observables
• Balancing involves t-tests comparing each observable
across treatments, before the experiment (and potentially
re-randomizing to achieve better balance)
A warning about blocking
• Once you start “messing” with the randomization, you can
inadvertently introduce bias
• For example, suppose you create pairs of people
matched on gender and first language and then assign
one to treatment and one to control.
• Often there are “leftovers” with no match (e.g., the only
Chinese speaking woman in the sample)
Sample Arrangements - Conclusion
(1) Run power tests when designing your experiment. Don’t run an
underpowered experiment!
(2) The larger you expect the treatment effect to be, the fewer
people you need
(3) Be aware of the effect of clustering on needed sample sizes
(4) Randomization techniques that help balance your sample can
increase power but are trickier to do correctly
(5) Run as few treatments as possible. Resist the temptation to add
in more than is absolutely necessary.
Conclusion
• For more in depth discussion see List, John, Sally Sadoff
and Mathis Wagner (2011). “So you want to run and
experiment, now what?” Experimental Economics, 14(4):
439-57
Implementation: Keep it real
• The actual population in the actual context
• Are they representative?
• Integrate into what is already taking place
• Minimize disruption & “experimenter demand” effects
• Closer to actual policy
• BUT you can also learn a lot from preliminary studies in
more lab-like contexts
Outcomes & Analysis
Ideally pre-defined:
• Main outcome(s)
• Other outcomes of interest – e.g., longer term, potential
mechanisms, spillovers, survey measures, etc.
• Within experiment outcomes – e.g., process, compliance,
attrition, etc.
Example: Flu vaccine
Staff flu
vaccine
program
Staff
vaccination
rates
No
Change
Staff
vaccination
rates
Resident
flu
infections
Resident
mortality
Nursing
homes
Resident
flu
infections
Resident
mortality
Address concerns
• Experiments are logistically difficult, disruptive and costly
• Most experiments “fail” – experiments are brutally honest!
• Experimentation is unethical or unfair
Address concerns
• Design experiments that integrate relatively easily into what is
already occurring
• Look for natural opportunities to experiment (pilots, new program,
gradual rollout, etc.)
• A null result is not a failure. Important to learn what doesn’t
work (vs. cost of running an ineffective program)
• Design experiments and include outcomes that help you learn about
mechanisms
• Design experiments to minimize fairness concerns (e.g., unit
of randomization)
• People are often familiar and comfortable with the idea of pilot
programs, limited space, etc. especially when we’re not sure yet if
something works.
Download