Sample Sizes for IE Power Calculations Overview General question: How large does the sample need to be to credibly detect a given effect size? What does “Credibly” mean here? We can be reasonably sure that the difference between the treatment group and the comparison group is due to the program Randomization removes bias, but it does not remove noise. To reduce noise, we need a large sample size. But how large is large? Measuring Impact At the end of an experiment, we will compare the outcome of interest in the treatment and the comparison groups. We are interested in the difference: Mean in treatment - Mean in control = Effect size For example: mean of the malaria prevalence in villages with ITN distribution vs. mean of malaria prevalence in villages with no ITNs To make conclusions based on that effect size, we need it to be calculated with precision- since there is always variability in data If there are other many unobserved factors affecting outcomes, it is harder to say whether the treatment had an effect Precise outcomes Low Standard Deviation 25 15 mean 50 mean 60 10 5 Number of Villagers Exposed to Malaria Blue = treatment 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 0 value Frequency 20 Some noise Medium Standard Deviation 9 6 5 mean 50 mean 60 4 3 2 1 Number 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 0 value Frequency 8 7 Very noisy High Standard Deviation 8 7 5 mean 50 mean 60 4 3 2 1 Number 89 85 81 77 73 69 65 61 57 53 49 45 41 37 33 lu e 0 va Frequency 6 Confidence Intervals We only work with data which is a sample of the population. In order to assess whether this is valid for the entire population, we need a measure of reliability A 95% confidence interval for an effect size tells us that, for 95% of any samples that we could have drawn from the same population, the estimated effect would have fallen into this interval. The Standard error (se) of the estimate in the sample captures both the size of the sample and the variability of the outcome it is larger with a small sample and with a variable outcome Two Types of Errors First type of error : Conclude that there is an effect, when in fact there are no effect. The level of your test is the probability that you will falsely conclude that the program has an effect, when in fact it does not. So with a level of 5%, you can be 95% confident in the validity of your conclusion that the program had an effect. To be confident, a= 5%, 10%, 1% Rule of thumb is that if the effect size is more than twice the standard error, you can conclude with more than 95% certainty that the program had an effect Two Types of Errors Second type of error: you fail to reject that the program had no effect, when it fact it does have an effect. The Power of a test is the probability of finding a significant effect in the RCT Only with a significant effect can you cleanly influence policy Power Calculations are a tool to see how likely we are to find a significant effect for a given sample size What you Need for a Power Calculation Significance level -This is often conventionally set at 5%. - Lower levels (less likely to reject a false positive), we need more sample size to detect the effect Power Level -A power level of 80% says: 80% of the time, if there is a true effect you will be able to detect it in a given sample -Larger sample More Power The mean and the variability of the outcome in the comparison group -From previous surveys conducted in similar settings -The larger the variability is, the larger the sample needed for a given power The effect size that we want to detect -What is the smallest effect that should prompt a policy response? - The smaller the expected effect size the larger sample size needed How to Determine Effect Size What is the smallest effect that should justify the program to be adopted (in terms of cost-benefit)? Common danger: use an effect size that is too optimistic too small of sample size How large an effect you can detect with a given sample depends on how variable the outcomes is. Sets minimum effect size we would want to be able to test for Example: If all children have very similar diarrhea prevalence without a program, a very small impact will be easy to detect The Standardized effect size is the effect size divided by the standard deviation of the outcome Common effect sizes are: .20 (small); .40 (medium); .50 (large) Design Factors to Take into Account Availability of a Baseline A baseline can help reduce needed sample size since: 1. 2. Removes some variability in data, increasing precision Can been use it to stratify and create subgroups The level of randomization Whenever treatment occurs at a group level, this reduces power relative to randomization at individual level Cluster (Group) Randomization Rural Water Project: Water Guard Individual Rural Water Project: Spring Improvement Village Community-based Monitoring in Uganda Village HIV/AIDS Education School-level Implications from Group Design The outcomes for all the individuals within a unit may be correlated All villagers affected by spring improvements at same time All students at school with trained teachers may have benefited from information The sample size needs to be adjusted for this correlation The more correlation within the group, the more we need to adjust the standard errors Implications It is extremely important to randomize an adequate number of groups. Typically the number of individual within groups matter less than the number of groups Big increases in power usually only happens when the number of groups that are randomized increase If you randomize at the level of the district, with one treated district and one control district, you have 2 observations! Conclusions Power calculations involve some guess work Some time we do not have the right information to conduct it very properly However, it is important to do them to: Avoid launching studies that will have no power at all: waste of time and money Determine the appropriate resources to the studies that you decide to conduct (and not too much) If you have a fixed budget, can determine whether the project is feasible at all Software: http://sitemaker.umich.edu/group-based/optimal_design_software