ENHANCING POLICY EFFECTIVENESS: THE ROLE OF IMPACT EVALUATION UNDERSTANDING IMPACT EVALUATION Howard White Lecture 1 Steps in impact evaluation design Howard White What is impact? • Impact = the outcome with the intervention compared to what it would have been in the absence of the intervention • At the heart of it is the idea of a attribution – and attribution implies a counterfactual Howard White When to do an impact evaluation • Pilot programs • Innovative programs • Representative or important programs Howard White Steps in evaluation design • Clearly state objectives and outcomes • Map out the causal chain • Identify evaluation questions • Identify best available method for different evaluation questions Howard White The principles • Map out the causal chain (programme theory): see figure from BINP example • Understand context: Bangladesh is not Tamil Nadu • Anticipate heterogeneity: more malnourished children; different implementing agencies • Rigorous evaluation of impact using an appropriate counterfactual: PSM versus simple comparison • Rigorous factual analysis: targeting, Key Performance gap • Use mixed methods: informed by anthropology, focus groups, own field visits Howard White Program design (theory of change) Target group participate in program (mothers of young children) Target group for nutritional counselling is the relevant one Exposure to nutritional counselling results in knowledge acquisition and behaviour change Behaviour change sufficient to change child nutrition Children are correctly identified to be enrolled in the program Food is delivered to those enrolled Supplementary feeding is supplemental, i.e. no leakage or substitution Improved nutritional outcomes Food is of sufficient quantity and quality mothers A targeting participants Howard White B1 B2 children counselling C1 C2 supplemental feeding change behavior in child nutrition D1 D2 no leakage / substitute D3 sufficient qnty / qlty E improved nutrition outcome Amenia Project—Causal Chain Indicator Measure Inputs Children / teachers / vitamins Expenditure / Number of Training Material Packages for teachers / Number of vitamins Activities/ Processes Give student vitamin per day Number of vitamins passed out per day Actually give vitamins Outputs Kids that have been given improved nutrition Number of kids that took vitamins for entire treatment Kids swallow vitamins when given / kids take vitamins on weekends Intermediate outcomes Falling anemia rates (healthy kids) Hb levels Hb tests are accurate Final outcomes Better educational performance Scores on standardized tests Scores are measuring educational performance / there was no extra help given to kids in vitamin schools Howard White Assumptions A causal chain (example 2) Indicator Measure Inputs Money: teaching materials, stipends, trainers Expenditure Supply of materials for curriculum Activities/ Processes Teacher training No. of teachers trained Training of good quality Outputs Trained teachers Teacher skills Teachers apply skills Teachers retained Intermediate outcomes Improved pupil learning outcomes Test scores School environment allows use of improved methods Test scores comparable Final outcomes Higher productivity Higher earnings Income or consumption Productive employment available Howard White Assumptions Common questions in IE Design Implementation Evaluation questions Targeting The right people? Type I and II errors Why is targeting failing? (protocols faulty, not being followed, corruption ..) Training / capacity building Right people? Is it appropriate? Mechanisms to ensure skills utilized Quality of delivery Skills / knowledge acquired and used Have skills / knowledge changed? Are skills / knowledge being applied? Do they make a difference to outcomes? Intervention delivery Tackling a binding constraint? Appropriate? Within local institutional capacity Delivered as intended: protocols followed, no leakages, technology functioning and maintained What problems have been encountered in implementation? When did first benefits start being realized? How is the intervention perceived by IA staff and beneficiaries? Behavior change Is desired BC culturally possible and appropriate; will it benefit intended beneficiaries? Is BC being promoted as Is behavior change occuring? intended (right people, If not, why not? right message, right media?) Howard White Exercise • Identify the objective(s) for your intervention and main outcome(s) (select measurable indicators) • Map out theory of change underlying your project • What are the related evaluation questions? Make the point about the difference between an outcome and impact Howard White Lecture 2/3 Counterfactuals and comparison groups Selection bias and experimental design Howard White The attribution problem: factual and counterfactual Impact varies over time Howard White What has been the impact of the French revolution? “It is too early to say” Zhou Enlai Howard White • So where does the counterfactual come from? • Most usual is to use a comparison group of similar people / households / schools / firms… Howard White What do we need to measure impact? Girl’s secondary enrolment Before Project (treatment) After 92 Comparison The majority of evaluations have just this information … which means we can say absolutely nothing about impact Howard White Before versus after single difference comparison Before versus after = 92 – 40 = 52 Project (treatment) Before After 40 92 Comparison “scholarships have led to rising schooling of young girls in the project villages” This ‘before versus after’ approach is outcome monitoring, which has become popular recently. Outcome monitoring has its place, but it is not impact evaluation Howard White Rates of completion of elementary male and female students in all rural China’s poor areas Share of rural children 100 80 60 1993 2008 1993 2008 girls 20 boys 40 0 Howard White Post-treatment control comparison Single difference = 92 – 84 = 8 Before After Project (treatment) 92 Control 84 But we don’t know if they were similar before… though there are ways of doing this (statistical matching = quasi-experimental approaches) Howard White Double difference = (92-40)-(84-26) = 52-58 = -6 Before After Project (treatment) 40 92 Comparison 26 84 Conclusion: Longitudinal (panel) data, with a comparison group, allow for the strongest impact evaluation design (though still need matching). SO WE NEED BASELINE DATA FROM PROJECT AND COMPARISON AREAS Howard White Main points so far • Analysis of impact implies a counterfactual comparison • Outcome monitoring is a factual analysis, and so cannot tell us about impact • The counterfactual is most commonly determined by using a comparison group If you are going to do impact evaluation you need a credible counterfactual using a control group VERY PREFERABLY WITH BASELINE DATA Howard White Exercise • Using hypothetical outcome data for the before/after, comparison/treatment matrix calculate the: – Ex-post single difference – Before versus after (single difference) – Double difference -- Impact estimates Howard White Lecture 3 Selection bias and experimental designs (randomized control trials, RCTs) Howard White Problems in implementing rigorous impact evaluation: selecting a comparison group • Contagion: other interventions • Spill over effects: control affected by intervention • Selection bias: beneficiaries are different • Ethical and political considerations Howard White Contagion • The problem: rural children do not go to preschool “behind before they start” – Many possible reasons: • Poor access to preschools • Too expensive (liquidity constraint • Poor quality preschools – Baseline survey: only 19% of children attend preschool • Objective of a Nokia sponsored preschool for the poor project: increase quality of teaching / eliminate liquidity constraint • Approach: preschool teacher training … Headstart voucher program – RCT: 6 counties … 1/3 towns get teacher training … 1/3 towns get vouchers (tuition-free preschool) … 1/3 control (no training / no vouchers) • Finding: no impact on attendance … after 2 year project – More than 85% of children attending preschool in all groups of school WHY? Contagion: from a new rule that gives children that have gone to preschool a higher chance of getting into the higher quality schools … elementary schools allowed / encouraged to open preschools … Howard White Low attendance in preschool 100 90 80 70 60 50 40 30 20 10 0 Rural (6 county study) Howard White Urban One reason might be because on average the nearest preschool is more than 20 kilometers away … Relatively Expensive … • 300 yuan per semester tuition • 200 yuan per semester lunch / transport • Total cost per year = 500 yuan x 2 = 1000 yuan [per capita income at poverty line < 1000 yuan] “This is one of those expenses we can do without …” “I didn’t go to preschool, and I am OK …” “I don’t even know what preschool is … where is the nearest one? Howard White Howard White Poor quality .04 Educational Readiness of Children in Urban and Rural China Mean value==90 Urban 0 .01 .02 .03 Critical value==70 50 100 Readinessscore 150 200 .02 0 .01 Rural Sample 0 .005 Lushan .015 Mean value==64 0 50 100 Readinessscore 150 200 OurHoward results show that ≈ 65% of children in poor rural areas are NOT ready in an educational sense White Contagion • The problem: rural children do not go to preschool “behind before they start” – Many possible reasons: • Poor access to preschools • Poor preschools • Too expensive (liquidity constraint – Baseline survey: only 19% of children attend preschool • Objective of a Nokia sponsored preschool for the poor project: increase quality of teaching / eliminate liquidity constraint increase attendance • Approach: preschool teacher training … “Headstart” voucher program – RCT: 6 counties … 1/3 towns get teacher training … 1/3 towns get vouchers (tuition-free preschool) … 1/3 control (no training / no vouchers) • Finding: no impact on attendance … after 2 year project – More than 85% of children attending preschool in all groups of school WHY? Contagion: from a new government rule that gives children that have gone to preschool a higher chance of getting into the higher quality schools … schools allowed / encouraged to open preschools … Howard White Spillovers • Problem: intestinal worms in poor mountainous villages in southern Sichuan (up to 70% infections) anemia poor cognitive abilities poor educational outcomes • Objective: NIH-funded study … analyze the effect of a deworming campaign on health and educational outcomes • Approach: RCT: choose 50 villages as treatment villages test for worms/50 villages are controls … in treatment villages, if more than 50% infections, deworm all children in village; if less than 50% test all children and deworm selectively … “paying for performance incentives:” clinicians get a bonus for low prevalence … the project made deworming drugs (Abendesol) easily available … • Results: in pilot study … no difference … all villages experience sharp fall in infections … WHY? Spillovers: Villages clinicians have monthly training meetings in township hospital training facility … attended by clinicians in treated and control villages … when clinicians in control villages heard about the problem of worms, with easily accessible deworming drugs, the clinicians in the control villages took steps to reduce infections … Howard White Behavior like this … Howard White … leads to outcome like this … Unless there are simple deworming interventions … Howard White Spillovers • Problem: intestinal worms in poor mountainous villages in southern Sichuan (up to 70% infections) anemia poor cognitive abilities poor educational outcomes • Objective: NIH-funded study … analyze the effect of a deworming campaign on health and educational outcomes • Approach: RCT: choose 50 villages as treatment villages test for worms/50 villages are controls … in treatment villages, if more than 50% infections, deworm all children in village; if less than 50% test all children and deworm selectively … “paying for performance incentives:” clinicians get a bonus for low prevalence … the project made deworming drugs (Abendesol) easily available … • Results: in first pilot study … no difference … all villages experience sharp fall in infections … WHY? Spillovers: Villages clinicians have monthly training meetings in township hospital training facility … attended by clinicians in treated and control villages … when clinicians in control villages heard about the problem of worms, with easily accessible deworming drugs, the clinicians in the control villages took steps to reduce infections … Howard White The problem of selection bias • Program participants are not chosen at random, but selected through – Program placement – Self selection • This is a problem if the correlates of selection are also correlated with the outcomes of interest, since those participating would do better (or worse) than others regardless of the intervention Howard White Two general sources of selection bias 1. Selection bias from program placement • A program of school improvements is targeted at the poorest schools • Since these schools are in poorer areas it is likely that students have home and parental characteristics are associated with lower learning outcomes (e.g. illiteracy, no electricity, child labor) • Hence learning outcomes in project schools will be lower than the average for other schools • The comparison group has to be drawn from a group of schools in similarly deprived areas Howard White 2. Selection bias from self-selection • A community fund is available for community-identified projects (this was the nature of China’s new poverty reduction program of 2003) • An intended outcome is to build social capital (as well as income earning opportunities) for future community development activities • But those communities with higher degrees of cohesion and social organization (i.e. social capital) are more likely to be able to make proposals for financing (in China, the village leaders that were the head of single surname villages were those that applied first and were selected first for poverty alleviation funding) • Hence social capital would be higher amongst beneficiary communities than non-beneficiaries regardless of the intervention, so a comparison between these two groups will Howard White Other examples of selection bias • Infant mortality in Bangladesh: – Hospital delivery (0.115 vs 0.067) Led to conclusions that hospital births were more dangerous than having a baby at home … but, is this a fair comparison? … which types of parents have their babies in a hospital? … perhaps those with potential complications … • Secondary education and teenage pregnancy in Zambia – But, in Zambia, when a girl gets pregnant, she has to drop out … so of course there is a negative correlation between secondary educational Howard attainment White and getting pregnant Main point There is ‘selection’ in who benefits from nearly all interventions. So need to get a control group which has the same characteristics as those selected for the intervention. This takes a lot of thinking … and needs to be considered at the time of the design of the project and IE Howard White Dealing with selection bias • Need to use experimental or quasi-experimental methods to cope with this; this is what has been meant by rigorous impact evaluation • Experimental (randomized control trials = RCTs, commonly used in agricultural research and medical trials, but are more widely applicable) • Quasi-experimental – Propensity score matching – Regression discontinuity Howard White Randomization (RCTs) • Randomization addresses the problem of selection bias by the random allocation of the treatment • Randomization may not be at the same level as the unit of intervention – Randomize across schools but measure individual learning outcomes – Randomize across sub-districts but measure village-level outcomes • The less units over which you randomize the higher your standard errors • But you need to randomize across a ‘reasonable number’ of units – At least 30 for simple randomized design (though possible imbalance considered a problem for n < 200) – Can be as few as 10 for matched pair randomization Howard White Issues in randomization • Randomize across eligible population not whole population • Can randomize across the pipeline • Is no less unethical than any other method with a control group (perhaps more ethical), and any intervention which is not immediately universal in coverage has an untreated population to act as a potential control group • No more costly than other survey-based approaches Howard White Conducting an RCT • Has to be an ex-ante design • Has to be politically feasible, and confidence that program managers will maintain integrity of the design • Perform power calculation to determine sample size (and therefore cost) • Adopt strict randomization protocol • Maintain information on how randomization done, refusals and ‘cross-overs’ • A, B and A+B designs (factorial designs) • Collect baseline data to: – Test quality of the match – Conduct difference in difference analysis Howard White Exercise L2: • Using hypothetical outcome data for the before/after, treatment matrix calculate the: – Ex-post single difference – Before versus after (single difference) – Double difference Are these impact estimates? Why / why not? L3: • Does your intervention suffer from selection bias? Why? • Is randomization an option for your intervention? • At what Howard White level would you randomize? Lecture 4 Quasi-experimental designs Howard White Quasi-experimental approaches • Possible methods – Propensity score matching – Regression discontinuity – Instrumental variables • Advantage: can be done ex post, and when random assignment not possible • Disadvantage: cannot be assured of absence of selection bias Howard White Propensity score matching • Need someone with all the same age, education, religion etc. • But, matching on a single number calculated as a weighted average of these characteristics gives the same result and matching individually on every characteristic – this is the basis of propensity score matching • The weights are given by the ‘participation equation’, that is a probit equation of whether a person participates in the project or not Howard White Propensity score matching: what you need • Can be based on ex post single difference, though double difference is better • Need common survey for treatment and potential control, or survey with common sections for matching variables and outcomes Howard White Propensity score matching: example of matching: water supply in Nepal Variable Before matching After matching Rural resident Treatment: 29% Comparison: 78% Treatment: 33% Comparison: 38% Richest wealth quintile Treatment: 46% Comparison: 2% Treatment: 39% Comparison: 36% H/h higher education Treatment: 21% Comparison: 4% Treatment: 17% Comparison: 17% Outcome (diarrhea incidence children<2) Treatment: 18% Comparison: 23% Treatment: 15% Comparison: 23% OR = 1.28 OR = 1.53 Howard White Regression discontinuity: an example – agricultural input supply program Howard White Naïve impact estimates • Total = income(treatment) – income(control) = 9.6 • Agricultural h/h only = 7.7 • But there is a clear link between net income and land holdings • And it turns out that the program targeted those households with at least 1.5 ha of land (you can see this in graph) • So selection bias is a real issue, as the treatment group would have been better off in absence of program, so single difference estimate is upward bias Howard White Regression discontinuity • Where there is a ‘threshold allocation rule’ for program participation, then we can estimate impact by comparing outcomes for those just above and below the threshold (as these groups are very similar) • We can do that by estimating a regression with a dummy for the threshold value (and possibly also a slope dummy) – see graph • In our case the impact estimate is 4.5, which is much less than that from the naïve estimates (less than half) Howard White Discussion • What possible quasi-experimental designs could you use for your intervention? Howard White Lecture 5 Data collection and mixed methods Howard White Overview on data collection • • • • • • Baseline, midterm and endline Treatment and comparison Process data Capture contagion and spillovers Quant and qual Different levels (e.g. facility data, worker data) – link the data • Multiple data sources Howard White Data used in BINP study • Project evaluation data (three rounds) • Save the Children evaluation • Helen Keller Nutritional Surveillance Survey • DHS (one round) • Project reports • Anthropological studies of village life • Action research (focus groups, CNP survey) Howard White Some study costs • IADB vocational training studies: US$20,000 each • IEG BINP study US$40,000 • IEG rural electrification study US$120,000 • IEG Ghana education study US$500,000 • Average 3ie study US$300,000 + • Average 3ie study in Africa with two rounds of surveys; US$500,000 + Howard White Timelines • Ex post 12-18 months • Ex ante: – lead time for survey design 3-6 months – Post-survey to first impact estimates 6-9 months – Report writing and consultation 3-6 months Howard White Budget and timeline • • • • • Ex post or ex ante Existing data or new data How many rounds of data collection? How large is sample? When is it sensible to estimate impact? Howard White The role of mixed methods • The mother-in-law effect in Bangladesh • The angry man in Andhra Pradesh • The disconnected in connected villages pretty much everywhere • Social funds in Zambia Howard White Exercise • Propose for your intervention – Timeline for impact evaluation – Budget Howard White Final exercise • Prepare a 5 minute presentation which covers: – What the intervention is – The main evaluation questions – How they will be addressed – What it will cost and how long it will take Howard White