Impact Evaluation: Logic, Benefits and Limitations Outline Theory of change Experimental method – RCT Quasi-experimental methods Managing impact evaluations Think about theory Test scores Test scores Pupil teacher ratio So we need evidence to know which theory is right Pupil teacher ratio Examples of ‘atheoretical’ IEs School capitation grant studies that don’t ask how the money was used BCC intervention studies that don’t ask if behaviour has changed (indeed, almost any study that does not capture behavior change) Microfinance studies that don’t look at use of funds and cash flow Studies of capacity development that don’t ask if knowledge acquired and used Common questions in TBIE Design Implementation Evaluation questions Targeting The right people? Type I and II errors Why is targeting failing? (protocols faulty, not being followed, corruption ..) Training / capacity building Right people? Is it appropriate? Mechanisms to ensure skills utilized Quality of delivery Skills / knowledge acquired and used Have skills / knowledge changed? Are skills / knowledge being applied? Do they make a difference to outcomes? Intervention delivery Tackling a binding constraint? Appropriate? Within local institutional capacity Delivered as intended: protocols followed, no leakages, technology functioning and maintained What problems have been encountered in implementation? When did first benefits start being realized? How is the intervention perceived by IA staff and beneficiaries? Behavior change Is desired BC culturally possible and appropriate; will it benefit intended beneficiaries? Is BC being promoted as intended (right people, right message, right media?) Is behavior change occurring? If not, why not? ToC: micro-insurance Design Design of insurance product Marketing Intermediate outcomes Final outcomes (insured) Consumption smoothing and assets protected Average income may be lower Savings utilized and for more productive (possibly riskier) investments Income increase Employment generation Increased utilization health services (and better quality health services) Better health Positive health spillover effects Adoption of insurance product Ambiguous impact on out of pocket expenses. Likely reduction in catastrophic expenses A s s u m p t i o n s Design is appropriate (something people need) (Relevance) Final outcomes (uninsured) Product is well-marketed to target market Concept of insurance is well-understood Premiums are affordable Take up is sufficient for product to be sustainable (Sustainability) Premiums are paid Lack of adverse selection in measurement of impact on utilization Insurance pays out in a timely manner Insurance accepted by service providers Absence of moral hazard in behavioural response of insured Lack of adverse selection in measurement of impact on health status Why did the Bangladesh Integrated Nutrition Program (BINP) fail? Why did the Bangladesh Integrated Nutrition Project (BINP) fail? Comparison of impact estimates Summary of theory Target group participate in program (mothers of young children) Target group for nutritional counselling is the relevant one Exposure to nutritional counselling results in knowledge acquisition and behaviour change Behaviour change sufficient to change child nutrition Children are correctly identified to be enrolled in the program Food is delivered to those enrolled Supplementary feeding is supplemental, i.e. no leakage or substitution Improved nutritional outcomes Food is of sufficient quantity and quality The theory of change Target group participate in program (mothers of young children) Right target group for nutritional PARTICIPATION counselling RATES WERE UP TO Target group for nutritional counselling is the relevant one Exposure to nutritional counselling results in knowledge acquisition and behaviour change Children are Food is correctly delivered to identified enrolled 30%to LOWERthose FOR be enrolled in WOMEN LIVING the program WITH THEIR MOTHER-IN-LAW Behaviour change sufficient to change child nutrition Supplementary feeding is supplemental, i.e. no leakage or substitution Food is of sufficient quantity and quality Improved nutritional outcomes The theory of change Target group participate in program (mothers of young children) Target group for nutritional counselling is the relevant one Exposure to nutritional counselling results in knowledge acquisition and behaviour change Children are correctly identified to be enrolled in the program Food is delivered to those enrolled Knowledge acquired and used Behaviour change sufficient to change child nutrition Supplementary feeding is supplemental, i.e. no leakage or substitution Food is of sufficient quantity and quality Improved nutritional outcomes The theory of change Target group participate in program (mothers of young children) Target group for nutritional counselling is the relevant one Exposure to nutritional counselling results in knowledge acquisition and behaviour change Behaviour change sufficient to change child nutrition Children are correctly identified to be enrolled in the program Food is delivered to those enrolled Supplementary feeding is supplemental, i.e. no leakage or substitution Improved nutritional outcomes Food is of sufficient quantity and quality The theory of change Target group participate in program (mothers of young children) Target group for nutritional counselling is the relevant one Exposure to nutritional counselling results in knowledge acquisition and behaviour change Children are correctly identified to be enrolled in the program Food is delivered to those enrolled Supplementary feeding is supplementary Behaviour change sufficient to change child nutrition Improved nutritional outcomes Supplementary feeding is supplemental, i.e. no leakage or substitution Food is of sufficient quantity and quality Lessons from BINP Apparent successes can turn out to be failures Outcome monitoring does not tell us impact and can be misleading A theory based impact evaluation shows if something is working and why Quality of match for rigorous study Independent study got different findings from project commissioned study Problems in implementing rigorous impact evaluation: selecting a comparison group Spill over effects: effects on non-target group Contagion (aka contamination): other interventions or self-contamination from spillovers Selection bias: beneficiaries are different Ethical and political considerations The problem of selection bias Program participants are not chosen at random, but selected through Program placement Self selection This is a problem if the correlates of selection are also correlated with the outcomes of interest, since those participating would do better (or worse) than others regardless of the intervention Selection bias from program placement A program of post-conflict social cohesion programs in communities with most conflict-affected persons Since these areas have high numbers of conflict affected persons social cohesion may be lower than elsewhere to start with So comparing these communities with other communities will find lower social cohesion in project communities – because of how they were chosen, not because the project isn’t working The comparison group has to be drawn from a group of schools in similarly deprived areas Selection bias from self-selection A community fund is available for community-identified projects An intended outcome is to build social capital for future community development activities But those communities with higher degrees of cohesion and social organization (i.e. social capital) are more likely to be able to make proposals for financing Hence social capital is higher amongst beneficiary communities than non-beneficiaries regardless of the intervention, so a comparison between these two groups will overstate program impact Examples of selection bias Hospital delivery in Bangladesh (0.115 vs 0.067) Secondary education and teenage pregnancy in Zambia Male circumcision and HIV/AIDS in Africa HIV/AIDs and circumcision: geographical overlay Main point There is ‘selection’ in who benefits from nearly all interventions. So need to get a comparison group which has the same characteristics as those selected for the intervention. Dealing with selection bias Need to use experimental (RCTs) or quasi-experimental methods to cope with selection bias this is what has been meant by rigorous impact evaluation Experimental method Called by different names Random assignment studies Randomized field trials Social experiments Randomized controlled trials (RCTs) Randomized controlled experiments Randomization (RCTs) Randomization addresses the problem of selection bias by the random allocation of the treatment Randomization may not be at the same level as the unit of intervention Randomize across schools but measure individual learning outcomes Randomize across sub-districts but measure village-level outcomes The less units over which you randomize the higher your standard errors But you need to randomize across a ‘reasonable number’ of units At least 30 for simple randomized design (though possible imbalance considered a problem for n < 200) Can possibly be as few as 10 for matched pair randomization, though literature is not clear on this Claim: Experimental method produces estimates of missing counterfactuals by randomization (Angrist and Pischke, 2008) – that is, randomization solves the selection problem How? The independence of outcomes (Y) and treatment assignment (T) in randomized experiment allows one to substitute the observable mean outcome of the untreated E[Y0|T=0] for the missing mean outcome of the treated when not treated E[Y0|T=1] Randomized assignment implies that the distribution of both observable and unobservable characteristics in the treatment and control groups are statistically identical. That is, members of the groups (treatment and control) do not differ systematically at the outset of the experiment. Key advantage of experimental method: Because members of the groups (treatment and control) do not differ systematically at the outset of the experiment, any difference that subsequently arises between them after administering treatment can be attributed to the treatment rather than to other factors. Typical experimental design involves two randomizations Experimental method usually refers to the 2nd randomization – assignment to treatment and control POPULATION Randomized sampling SAMPLE Randomized assignment TREATMENT CONTROL The first stage ensures that the results of the sample will represent the results in the population within a defined level of sampling error (external validity) The second stage ensures that the observed effect on the outcome is due to treatment rather than other confounding factors (internal validity) Sample Size and Power How large does a sample need to be in order to credibly detect a given effect size at a predetermined power? Definition of power (common threshold is 80% and above) A power of 80% tells us that, in 80% of the experiments of this sample size conducted in this population, if H0 is in fact false (i.e., treatment effect is not zero), we will be able to detect it Determinants of sample size given power How big an effect are we measuring? How noisy is the measure of outcome? The noisier the measure, the larger the sample size Do we have a baseline? The larger the effect size, the smaller the sample size Presence of a baseline (which is presumably correlated with subsequent outcomes) reduces the sample size Are individual responses correlated with each other? The more correlated the individual responses, the larger the sample size. Impact estimator: Experimental method This is the difference in mean outcomes of those with project and those without project in a given sample. Conducting an RCT Has to be an ex-ante design Has to be politically feasible, and confidence that program managers will maintain integrity of the design Perform power calculation to determine sample size (and therefore cost) A, B and A+B designs (factorial designs) (and possibly no ‘untreated control’) Ethics and RCT designs Rarely is an intervention universal from day one If incomplete geographical coverage can exploit that fact to get control If rolled out over time can use roll out design (pipeline or stepped wedge design) Or can use an ‘encouragement design’ Doesn’t stop you targeting: randomize across eligible population not whole population … but might used “raised threshold” randomization And randomization is more transparent What is really unethical is to do things that don’t work Ethics and RCT designs Summary: •Simple RCT: individual or cluster •Across pipeline •Expand ‘potential’ geographic area •Encouragement design •Randomize around threshold (possibly with raised threshold) Spillover effects Otherwise uncounted benefits (word of mouth knowledge transfer) or losses (water diversion) Program theory should allow identification of these effects Untreated in treated communities (bed nets, also consider herd effects), easily dealt with if collect data Spillover to other communities (data collection implications; contagion) Contagion (aka contamination) Comparison group are treated By your own intervention: spillover or benign but ill-informed intervention errors By other similar interventions (or to affect same outcomes) Make sure you know it’s happening: data collection is not just outcomes in the comparison group Contagion: just deal with it Total contamination: different counterfactual or dose response model Partial: could drop contaminated observations especially if matched pairs Give up The comparison group location trade off C2 C1 C2 T T T C1 C1 C2 Bottom lines You need skills on RCTs, and to encourage these designs where possible Think about how likely are spillover and contagion and incorporate them in your design if necessary Buying-in academic expertise will help… but beware of differing incentives Quasi-experimental approaches (advantage is can be ex post, but can also be ex ante) Where o where art thou, baseline? Existing datasets Previous rates Monitoring data, but no comparison Recreating baselines From existing data (e.g. 3ie working paper on Pakistan post-disaster) Using recall: be realistic Matching methods Quasi-experimental methods (construct a comparison group) Propensity score matching (PSM) Regression discontinuity design (RDD) ‘Intuitive matching’ Regression-based Instrumental variables: need to be wellmotivated Tips on ex post matching Always check and report quality of the match, and take seriously any mis-match (you can redefine matching method) Baseline data allowing double differencing will add credibility to your results All above points about spillovers and contagion apply All methods (including RCT) using regression approach to ‘iron out’ remaining differences between treatment and comparison Propensity score matching Need someone with all the same age, education, religion etc. But, matching on a single number calculated as a weighted average of these characteristics gives the same result and matching individually on every characteristic – this is the basis of propensity score matching The weights are given by the ‘participation equation’, that is a probit equation of whether a person participates in the project or not PSM: What you need Can be based on ex post single difference, though double difference is better Need common survey for treatment and potential comparison, or survey with common sections for matching variables and outcomes Impact of WS on child health in rural Philippines: Comparison groups All rural households with children younger than 5 years from the 1993, 1998, 2003 and 2008 rounds of the NDHS Some of these children had diarrhea during the two-week period prior to the interview Treatment vs. control Children in households with piped water vs. children in households without piped water Children in households with their own flush toilets vs. children in households without their own flush toilets. Impact measure: Average treatment effect on the treated ATT(X) = E(D1|T=1,p(X))- E(D0|T=1, p(X)) = E(D1|T=1,p(X))- E(D0|T=0, p(X)) where D1 = with diarrhea, D0= without diarrhea, T= treatment indicator (=1 if with piped water or flush toilet, 0 if without), p(X) =propensity scored defined over a vector of covariates X PSM estimates: Impact of piped water and flush toilet in rural Philippines Treatment/ matching algorithm Piped water NN5 (0.001) NN5 (0.01) NN5 (0.02) NN5 (0.03) Kernel (0.03) Kernel (0.05) Own flush toilet NN5 (0.001) NN5 (0.01) NN5 (0.02) NN5 (0.03) Kernel (0.03) Kernel (0.05) 1993 ATT (X) Std. errors 1998 ATT (X) Std. errors 2003 ATT (X) Std. errors 2008 ATT (X) Std. errors -0.020c -0.015 -0.009 -0.013 -0.002 -0.001 0.014 0.013 0.013 0.013 0.012 0.012 0.012 0.008 0.012 0.013 0.014 0.014 0.015 0.015 0.015 0.015 0.013 0.013 -0.032b -0.014 -0.012 -0.015 -0.010 -0.005 0.018 0.017 0.017 0.017 0.015 0.015 -0.029b -0.040a -0.045a -0.042a -0.028b -0.018b 0.017 0.015 0.015 0.015 0.013 0.013 -0.017 -0.013 -0.012 -0.015 -0.015 -0.016 0.016 0.014 0.014 0.014 0.013 0.013 -0.010 -0.003 -0.001 -0.005 0.002 0.002 0.013 0.012 0.012 0.012 0.011 0.011 -0.025c -0.026b -0.027b -0.030b -0.028b -0.027b 0.016 0.015 0.015 0.015 0.014 0.014 -0.034b -0.100a -0.090a -0.087a -0.073a -0.068a 0.018 0.020 0.019 0.019 0.018 0.018 Notes: "NN5(...)" means nearest-5 neighbor matching with the caliper size in parenthesis. a statistically significant at p<0.01. b statistically significant at p<0.05. c statistically significant at p<0.10. Regression discontinuity: an example – agricultural input supply program Naïve impact estimates Total = income(treatment) – income(comparison) = 9.6 Agricultural h/h only = 7.7 But there is a clear link between net income and land holdings And it turns out that the program targeted those households with at least 1.5 ha of land (you can see this in graph) So selection bias is a real issue, as the treatment group would have been better off in absence of program, so single difference estimate is upward bias Regression discontinuity Where there is a ‘threshold allocation rule’ for program participation, then we can estimate impact by comparing outcomes for those just above and below the threshold (as these groups are very similar) We can do that by estimating a regression with a dummy for the threshold value (and possibly also a slope dummy) – see graph In our case the impact estimate is 4.5, which is much less than that from the naïve estimates (less than half) MANAGING IMPACT EVALUATIONS When to do an impact evaluation Different stuff Pilot programs Innovative programs New activity areas Established stuff Representative programs Important programs Look to fill gaps What do IE managers need to know? If an IE is needed and viable Your role as champion The importance of ex ante designs with baseline (building evaluation into design) Funding issues The importance of a credible design with a strong team (and how to recognize that) Help on design Ensure management feedback loops Issues in managing IEs Different objective functions of managers and study teams Project management buy-in Trade-offs On time On richness of study design Overview on data collection Baseline, midterm and endline Treatment and comparison Process data Capture contagion and spillovers Quant and qual Different levels (e.g. facility data, worker data) – link the data Multiple data sources Data used in BINP study Project evaluation data (three rounds) Save the Children evaluation Helen Keller Nutritional Surveillance Survey DHS (one round) Project reports Anthropological studies of village life Action research (focus groups, CNP survey) Piggybacking Use of existing survey Add Oversample project areas Additional module(s) Lead time is longer, not shorter But probably higher quality data and less effort in managing data collection Some timelines Ex post 12-18 months Ex ante: lead time for survey design 3-6 months Post-survey to first impact estimates 6-9 months Report writing and consultation 3-6 months Budget and timeline Ex post or ex ante Existing data or new data How many rounds of data collection? How large is sample? When is it sensible to estimate impact? Remember Results means impact But be selective Be issues-driven not methods driven Find best available method for evaluation questions at hand Randomization often is possible But do ask, is this sufficiently credible to be worth doing? References P. Gertler et al. (2011). Impact evaluation in practice. World Bank. Washington, DC. H. White (2009). Theory-based impact evaluation: principles and practice. 3ie working paper 3. 3ie, New Delhi. H. White (2012). Quality impact evaluation: An introductory workshop. 3ie, New Delhi. Thank you!