FIELD EXPERIMENTS Prof Samek Outline • Experiments: What and Why? • Sample arrangements • Sample size & statistical power • Unit of randomization & intracluster correlation • Randomization designs – Blocking & balance • Incorporating insights from behavioral economics into intervention design Economics • Economics is a social science that studies the choices that individuals, firms, governments, societies make, given: • scarcity • incentives • Interest in causation rather than necessarily correlation • Positive, prescriptive, normative economics • Three core principles in econ • Optimization (people choose rationally) • Equilibrium (state where no agent benefits from changing) • Empiricism (testing with data/empirical analysis) Experiments: What & Why? • To measure the effect of intervention “X” on outcome “Y” you compare avg. “Y” among people who receive “X” to avg. “Y” among people who don’t receive “X” • Ex. X=information about a charity’s effectiveness; Y= donations • Non-experimental methods • Use naturally occurring variation across people, institutions, locations, time, etc. • Randomized controlled experiment • Experimenter randomly assigns who receives intervention (“treatment”) and who doesn’t (“control”) Violent crimes per 1,000 Data: Correlation Police per 1,000 Violent crimes per 1,000 Data: Correlation Police per 1,000 Case Study on Financial Education • We study the ability of associates to make better choices • • • • about how they manage finances. We advertise a new educational program – Grow Your Wealth!. 5,000 qualified associates are invited to join the program, 2,000 are interested in learning to save money and decide to participate. All 2,000 are invited to join and do very well in the program. At the end of the study, we compare the group who has participated (2,000) with the group who hasn’t (3,000) Our 2,000 have clearly done better! Is the program a success? What Really Happened? • Selection Bias: Caused our results to be over-stated • Do not know if the program or the people cause the effect I joined the program because I want to start saving money… I would probably have save money anyway, but it helped me! I didn’t join the program because I don’t care about saving money right now. I would rather buy a flat-panel LCD television. How to Get Cause & Effect • This is a Randomized Field Experiment • 2,000 chose to participate in the program • These 2,000 associates are located in 5 different cities • After associates have expressed their interest, select 2 (A and B) of the 5 cities and offer Grow Your Wealth! as a pilot • Compare all those who wanted to participate in cities A and B with C, D and E • Now we have removed selection bias – so we know the program caused the effect Control is Key • In a Field Experiment, we control who gets offered the intervention • The “Control group” is the group for comparison • They wanted to participate, but were not able to • The “Treatment group” is the group of interest • They wanted to participate, and were able to • What about non-participants? • We may want to learn how the program will affect those who don’t want to participate • But we cannot force people…must use incentives Behavioral Interventions • How to get more participants in the study? Now vs. Later – bounded willpower The consequences of not saving money are far in the future. Bring “now” incentives to motivate behavior Include a commitment Social feedback I didn’t join the program because I don’t care about saving money right now. I would rather have a big-screen TV. I later. will save Experiments: What & Why? Why want randomization and control group? • Randomization: Avoid “selection” into treatment • Only difference between groups is the intervention • Control: Tells you what would have happened without intervention (“counterfactual”) So you want to run an experiment . . . Basic elements (1) Treatment design (2) Sample sizes & randomization design (3) Implementation & addressing concerns (4) Outcomes & analysis • First, we will focus on (2) • Then we will turn to (1), incorporating a discussion of (2)-(4) • (3)-(4) will be covered in more depth this week SAMPLE ARRANGEMENTS Sample arrangements • Two related questions: • How many people do I need in the experiment? • I have X number of people, how many different treatments should I test? • Answer depends on: • How large you think the treatment effect(s) are • Your unit of randomization • Your randomization design Minimum detectable effect • Treatment effect: how much the treatment moves the outcome • Measure by comparing mean outcome in treatment group vs. control group • We often think of treatment effects in terms of the outcome’s standard deviation (e.g.,s.d. for height (w/in gender) = 3 in.; for IQ = 15 pts.) • Basic rule: The larger you think the treatment effect is, the fewer people you need in the experiment Minimum detectable effect • Example: If you assign 64 people to treatment and 64 to control and the true treatment effect is ½ a standard deviation, then there is an 80% probability that when you compare the mean outcome in treatment to control the difference will be statistically significant at the 5% level (using a two-sided t-test). • Sample size n=64 • Minimum detectable effect 1-β • Significance level α • Power level δ Minimum detectable effect • To decide on sample sizes and/or how many treatments you should test, ALWAYS CONDUCT A POWER TEST to determine what size treatment effect you can detect • Decide beforehand what size effect is worth detecting. • Don’t conduct “underpowered” experiments – i.e., that don’t have large enough samples to detect effects of interest. • If the experiment is underpowered and you don’t find a significant effect, you don’t know if it’s because the intervention doesn’t work or it does work but you needed more people POWER TESTS Errors in Hypothesis Testing • All tests have some probability of making errors • Type 1 Error – probability of making it: α • Incorrectly reject H0 that treatment = control • Conclude means are different when they are not • Type 2 Error – probability of making it: β • Incorrectly fail to reject H0 that treatment = control • Conclude means are not different when they are α, when H0 is X=Y • α is the probability of making a Type 1 error - rejecting H0 in error • The smaller the α, the lower the probability • For instance is α<0.01 = 1% probability of Type 1 error • α<0.10 = 10% probability of Type 1 error. • Typically we care about α<0.05 • Suppose t-test for X=Y results in α=0.06 • Then we reject H0 at the 10% level, but not at 5% level. β • β is the probability of making a Type 2 error – failure to reject H0 when it is false • Suppose that you have sample of 100 subjects in the experiment • 50 go into control group • 50 go into treatment group • At the end of the intervention, you conduct a t-test for Control = Treatment, and find that p=0.11 • So – you cannot publish the paper. Have you done a power test? • 1-β is the power of your experiment • Doing a power test gives you a better understanding of the sample size that you will need • Typical power selection is = 0.8 (0.2 probability of Type II error) • Other power selection could be = 0.9 (0.1 probability of Type II error) Simple formula for calculating sample sizes • Assuming equal variances σ12 = σ22: n0* n1* n * 2(t / 2 2 t ) 2 • Effect size: • Sample size depends on the ratio of effect size to standard deviation. • Hence, effect sizes can just as easily be expressed in standard deviations. • Necessary sample size: • Increases with desired significance level and power • Increases proportionally with variance of outcomes • Decreases inversely proportionally with the square of the minimum detectable effect size Rules of Thumb • Standard is to use α=0.05 and have power of 0.80 (β=0.20). • So if we want to detect a one-standard deviation change using the standard approach, we would need: • n = 2(1.96 + 0.84)2*(1)2 = 15.68 observations in each cell • ½ std. dev. change is detectable with 4*15.68 ~ 64 observations per cell • n=30 seems to be the magic number in many lab experiment studies: ~ 0.70 std. dev. change. Power Test in STATA • In STATA, type in “help sampsi” • Two-sample comparisons of M1 and M2 • One sample comparison of M to hypothesized value • Two sample comparison of proportions • One sample comparison of proportion to hypothesized value • Typical code: • sampsi mean_1 mean_2, p(0.8) sd1(sd1) sd2(sd2) • sampsi prop_1 prop_2, power(0.8) Determining Effect Size • Determine M0, M1, S0, S1 • Look at related literature • approximate your mean/standard deviation • Run a pilot test and • Use this data to determine mean/standard deviation • Can use existing/historical data for the M0 • Can begin experiment and adjust sampling as needed (carefully) • Determine Effect Size of Interest • Talk to the practitioner about desired effect size • Avoid the under-powered experiment! Effects of “economic significance” • Often confused with statistical significance • Answers the question: • “So what”? • “Why do we care”? • Suggestions: • Compare size of effect with effects of related programs • Provide evidence that the dependent variable is predicted to move across economically important threshold (move households above poverty threshold; move children above failing) • Cost-benefit analysis of the program • “½ standard deviation change seems pretty good” • Other ideas? LEVEL OF RANDOMIZATION Intra-Cluster Correlation • Level of randomization differs from the unit of observation • Randomization at grocery store level, outcomes observed at individual consumer level • Randomization at the school level, outcomes observed at the child level • Randomization at the city level, outcomes observed at the consumer level • Example: • 4 cities randomized across treatments, same number of consumers in each city • Responses of consumers may be correlated within the city Intra-Cluster Correlation • Real Sample Size (RSS) = mk/CE m = number of subjects in a cluster k = number of clusters CE = 1 + ρ(m-1) ρ = intracluster correlation coefficient = s2B/(s2B + s2w) s2B = variance between clusters s2w = variance within clusters Intracluster Correlation • What does ρ → 0 mean? No correlation of responses within a cluster ►No need to adjust optimal sample sizes What does ρ → 1 mean? All responses within a cluster are identical ►Large adjustment needed: RSS is reduced to the number of clusters Example • Pilot testing finds that ρ=0.04 • We wish to detect a 1/10 standard deviation change • What sample size do we need? • If ρ->0, would need Sample Size Formula: n = 2*(ta + tB )2 * [σ/δ]2 n = 1568 at each level; 3136 total. Required Sample Size • RSS = mk/CE =784*4/(1+.04(784-1)) ~97! What is the required sample size? = 2*(ta + tB )2 * 100(1+783(0.04)) = 15.68*3232 (note that ρ → 0: 15.68*100) =50,678 at each incentive level! So why would you ever cluster? • Logistics • Spillovers RANDOMIZATION DESIGN Randomized Designs • Fully random design with heterogeneous participants: • Reduces likelihood that treatment is correlated with observables • May have high variance • Initial groupings may be heterogeneous • How to reduce variance of unobserved component? • Include controls in final regression (age, gender) • Control for observables in advance of randomization - blocking Blocked Design – Decreases Variance • Select block observables of interest • Block subjects by these observables • Randomize within block to the treatment • Treating characteristics of subjects as add’l treatment Special Case of Blocking • Within subjects design also “blocks” fully on all characteristics of the subject • Disadvantage: treating subject multiple times may result in complicated interactions of treatment and yield a different parameter estimate • Crossover design: randomizing the order of treatment for within-subject experiments • Still need caution when interpreting Balanced Design • After blocking, no need to “balance” on those same observables • May wish to balance on other observables • Balancing involves t-tests comparing each observable across treatments, before the experiment (and potentially re-randomizing to achieve better balance) A warning about blocking • Once you start “messing” with the randomization, you can inadvertently introduce bias • For example, suppose you create pairs of people matched on gender and first language and then assign one to treatment and one to control. • Often there are “leftovers” with no match (e.g., the only Chinese speaking woman in the sample) Sample Arrangements - Conclusion (1) Run power tests when designing your experiment. Don’t run an underpowered experiment! (2) The larger you expect the treatment effect to be, the fewer people you need (3) Be aware of the effect of clustering on needed sample sizes (4) Randomization techniques that help balance your sample can increase power but are trickier to do correctly (5) Run as few treatments as possible. Resist the temptation to add in more than is absolutely necessary. Conclusion • For more in depth discussion see List, John, Sally Sadoff and Mathis Wagner (2011). “So you want to run and experiment, now what?” Experimental Economics, 14(4): 439-57 Implementation: Keep it real • The actual population in the actual context • Are they representative? • Integrate into what is already taking place • Minimize disruption & “experimenter demand” effects • Closer to actual policy • BUT you can also learn a lot from preliminary studies in more lab-like contexts Outcomes & Analysis Ideally pre-defined: • Main outcome(s) • Other outcomes of interest – e.g., longer term, potential mechanisms, spillovers, survey measures, etc. • Within experiment outcomes – e.g., process, compliance, attrition, etc. Example: Flu vaccine Staff flu vaccine program Staff vaccination rates No Change Staff vaccination rates Resident flu infections Resident mortality Nursing homes Resident flu infections Resident mortality Address concerns • Experiments are logistically difficult, disruptive and costly • Most experiments “fail” – experiments are brutally honest! • Experimentation is unethical or unfair Address concerns • Design experiments that integrate relatively easily into what is already occurring • Look for natural opportunities to experiment (pilots, new program, gradual rollout, etc.) • A null result is not a failure. Important to learn what doesn’t work (vs. cost of running an ineffective program) • Design experiments and include outcomes that help you learn about mechanisms • Design experiments to minimize fairness concerns (e.g., unit of randomization) • People are often familiar and comfortable with the idea of pilot programs, limited space, etc. especially when we’re not sure yet if something works.