Howard White

advertisement
ENHANCING POLICY
EFFECTIVENESS: THE ROLE
OF IMPACT EVALUATION
UNDERSTANDING IMPACT
EVALUATION
Howard White
Lecture 1
Steps in impact evaluation design
Howard White
What is impact?
• Impact = the outcome with the intervention
compared to what it would have been in
the absence of the intervention
• At the heart of it is the idea of a attribution – and
attribution implies a counterfactual
Howard White
When to do an impact evaluation
• Pilot programs
• Innovative programs
• Representative or important programs
Howard White
Steps in evaluation design
• Clearly state objectives and
outcomes
• Map out the causal chain
• Identify evaluation questions
• Identify best available method for
different evaluation questions
Howard White
The principles
• Map out the causal chain (programme theory): see
figure from BINP example
• Understand context: Bangladesh is not Tamil Nadu
• Anticipate heterogeneity: more malnourished children;
different implementing agencies
• Rigorous evaluation of impact using an appropriate
counterfactual: PSM versus simple comparison
• Rigorous factual analysis: targeting, Key Performance
gap
• Use mixed methods: informed by anthropology, focus
groups, own field visits
Howard White
Program design (theory of change)
Target group
participate in
program
(mothers of
young
children)
Target group
for
nutritional
counselling is
the relevant
one
Exposure to
nutritional
counselling
results in
knowledge
acquisition and
behaviour
change
Behaviour change
sufficient to change
child nutrition
Children are
correctly
identified to
be enrolled in
the program
Food is
delivered to
those enrolled
Supplementary
feeding is
supplemental, i.e.
no leakage or
substitution
Improved
nutritional
outcomes
Food is of sufficient
quantity and quality
mothers
A
targeting
participants
Howard White
B1
B2
children
counselling
C1
C2
supplemental
feeding
change behavior in child nutrition
D1
D2
no leakage / substitute
D3
sufficient qnty / qlty
E
improved
nutrition
outcome
Amenia Project—Causal Chain
Indicator
Measure
Inputs
Children / teachers /
vitamins
Expenditure /
Number of Training
Material Packages for
teachers / Number of
vitamins
Activities/
Processes
Give student vitamin
per day
Number of vitamins
passed out per day
Actually give vitamins
Outputs
Kids that have been
given improved
nutrition
Number of kids that
took vitamins for entire
treatment
Kids swallow vitamins
when given / kids take
vitamins on weekends
Intermediate
outcomes
Falling anemia rates
(healthy kids)
Hb levels
Hb tests are accurate
Final outcomes
Better educational
performance
Scores on
standardized tests
Scores are measuring
educational
performance / there
was no extra help given
to kids in vitamin
schools
Howard White
Assumptions
A causal chain (example 2)
Indicator
Measure
Inputs
Money: teaching
materials, stipends,
trainers
Expenditure
Supply of materials for
curriculum
Activities/
Processes
Teacher training
No. of teachers trained
Training of good quality
Outputs
Trained teachers
Teacher skills
Teachers apply skills
Teachers retained
Intermediate
outcomes
Improved pupil
learning outcomes
Test scores
School environment
allows use of improved
methods
Test scores comparable
Final outcomes
Higher productivity
Higher earnings
Income or
consumption
Productive employment
available
Howard White
Assumptions
Common questions in IE
Design
Implementation
Evaluation questions
Targeting
The right people?
Type I and II errors
Why is targeting failing?
(protocols faulty, not being
followed, corruption ..)
Training /
capacity
building
Right people? Is it
appropriate?
Mechanisms to
ensure skills utilized
Quality of delivery
Skills / knowledge
acquired and used
Have skills / knowledge
changed? Are skills /
knowledge being applied?
Do they make a difference to
outcomes?
Intervention
delivery
Tackling a binding
constraint?
Appropriate?
Within local
institutional capacity
Delivered as intended:
protocols followed, no
leakages, technology
functioning and
maintained
What problems have been
encountered in
implementation? When did
first benefits start being
realized? How is the
intervention perceived by IA
staff and beneficiaries?
Behavior
change
Is desired BC
culturally possible
and appropriate; will
it benefit intended
beneficiaries?
Is BC being promoted as Is behavior change occuring?
intended (right people,
If not, why not?
right message, right
media?)
Howard White
Exercise
• Identify the objective(s) for your
intervention and main outcome(s) (select
measurable indicators)
• Map out theory of change underlying your
project
• What are the related evaluation questions?
Make the point about
the difference between
an outcome and impact
Howard White
Lecture 2/3
Counterfactuals and comparison
groups
Selection bias and experimental
design
Howard White
The attribution problem:
factual and counterfactual
Impact varies over time
Howard White
What has been the impact of
the French revolution?
“It is too early to say”
Zhou Enlai
Howard White
• So where does the counterfactual come
from?
• Most usual is to use a comparison group
of similar people / households / schools /
firms…
Howard White
What do we need to measure
impact? Girl’s secondary
enrolment
Before
Project (treatment)
After
92
Comparison
The majority of evaluations
have just this
information … which
means we can say
absolutely nothing about
impact
Howard White
Before versus after single difference comparison
Before versus after = 92 – 40 = 52
Project (treatment)
Before
After
40
92
Comparison
“scholarships
have led to rising
schooling of
young girls in the
project villages”
This ‘before versus after’ approach is outcome monitoring, which
has become popular recently. Outcome monitoring has its place,
but it is not impact evaluation
Howard White
Rates of completion of elementary male and female
students in all rural China’s poor areas
Share of rural children
100
80
60
1993 2008
1993 2008
girls
20
boys
40
0
Howard White
Post-treatment control comparison
Single difference = 92 – 84 = 8
Before
After
Project (treatment)
92
Control
84
But we don’t know if they were
similar before… though there are
ways of doing this (statistical
matching = quasi-experimental
approaches)
Howard White
Double difference =
(92-40)-(84-26) = 52-58 = -6
Before
After
Project (treatment)
40
92
Comparison
26
84
Conclusion: Longitudinal (panel) data, with a
comparison group, allow for the strongest
impact evaluation design (though still need
matching). SO WE NEED BASELINE DATA
FROM PROJECT AND COMPARISON AREAS
Howard White
Main points so far
• Analysis of impact implies a counterfactual
comparison
• Outcome monitoring is a factual analysis, and so
cannot tell us about impact
• The counterfactual is most commonly
determined by using a comparison group
If you are going to do impact evaluation you
need a credible counterfactual using a control group VERY PREFERABLY WITH BASELINE DATA
Howard White
Exercise
• Using hypothetical outcome data for the
before/after, comparison/treatment matrix
calculate the:
– Ex-post single difference
– Before versus after (single difference)
– Double difference
-- Impact estimates
Howard White
Lecture 3
Selection bias and experimental designs
(randomized control trials, RCTs)
Howard White
Problems in implementing rigorous impact
evaluation: selecting a comparison group
• Contagion: other interventions
• Spill over effects: control affected by intervention
• Selection bias: beneficiaries are different
• Ethical and political considerations
Howard White
Contagion
• The problem: rural children do not go to preschool  “behind before
they start”
– Many possible reasons:
• Poor access to preschools
• Too expensive (liquidity constraint
• Poor quality preschools
– Baseline survey: only 19% of children attend preschool
• Objective of a Nokia sponsored preschool for the poor project: increase
quality of teaching / eliminate liquidity constraint
• Approach: preschool teacher training … Headstart voucher program
– RCT: 6 counties … 1/3 towns get teacher training … 1/3 towns get vouchers
(tuition-free preschool) … 1/3 control (no training / no vouchers)
• Finding: no impact on attendance … after 2 year project
– More than 85% of children attending preschool in all groups of school
WHY?
Contagion: from a new rule that gives children that have gone to
preschool a higher chance of getting into the higher quality schools …
elementary schools allowed / encouraged to open preschools …
Howard White
Low attendance in preschool
100
90
80
70
60
50
40
30
20
10
0
Rural (6 county study)
Howard White
Urban
One reason might be because on average the nearest
preschool is more than 20 kilometers away …
Relatively Expensive …
• 300 yuan per semester  tuition
• 200 yuan per semester  lunch / transport
• Total cost per year = 500 yuan x 2 = 1000 yuan
[per capita income at poverty line < 1000 yuan]
“This is one of those expenses we can do without …”
“I didn’t go to preschool, and I am OK …”
“I don’t even know what preschool is … where is the
nearest one?
Howard White
Howard
White
Poor
quality
.04
Educational Readiness of
Children in Urban and Rural
China
Mean
value==90
Urban
0
.01
.02
.03
Critical
value==70
50
100
Readinessscore
150
200
.02
0
.01
Rural
Sample
0
.005
Lushan
.015
Mean
value==64
0
50
100
Readinessscore
150
200
OurHoward
results show
that ≈ 65% of children in poor rural areas are NOT ready in an educational sense
White
Contagion
• The problem: rural children do not go to preschool  “behind before
they start”
– Many possible reasons:
• Poor access to preschools
• Poor preschools
• Too expensive (liquidity constraint
– Baseline survey: only 19% of children attend preschool
• Objective of a Nokia sponsored preschool for the poor project: increase
quality of teaching / eliminate liquidity constraint  increase attendance
• Approach: preschool teacher training … “Headstart” voucher program
– RCT: 6 counties … 1/3 towns get teacher training … 1/3 towns get vouchers
(tuition-free preschool) … 1/3 control (no training / no vouchers)
• Finding: no impact on attendance … after 2 year project
– More than 85% of children attending preschool in all groups of school
WHY?
Contagion: from a new government rule that gives children that
have gone to preschool a higher chance of getting into the higher
quality schools … schools allowed / encouraged to open preschools …
Howard White
Spillovers
• Problem: intestinal worms in poor mountainous villages in southern
Sichuan (up to 70% infections)  anemia  poor cognitive abilities 
poor educational outcomes
• Objective: NIH-funded study … analyze the effect of a deworming
campaign on health and educational outcomes
• Approach: RCT: choose 50 villages as treatment villages test for
worms/50 villages are controls … in treatment villages, if more than 50%
infections, deworm all children in village; if less than 50% test all children
and deworm selectively … “paying for performance incentives:”
clinicians get a bonus for low prevalence … the project made deworming
drugs (Abendesol) easily available …
• Results: in pilot study … no difference … all villages experience sharp
fall in infections …
WHY? Spillovers: Villages clinicians have monthly training meetings in
township hospital training facility … attended by clinicians in treated and
control villages … when clinicians in control villages heard about the
problem of worms, with easily accessible deworming drugs, the
clinicians in the control villages took steps to reduce infections …
Howard White
Behavior like
this …
Howard White
… leads to
outcome like this …
Unless there are
simple deworming
interventions …
Howard White
Spillovers
• Problem: intestinal worms in poor mountainous villages in southern
Sichuan (up to 70% infections)  anemia  poor cognitive abilities 
poor educational outcomes
• Objective: NIH-funded study … analyze the effect of a deworming
campaign on health and educational outcomes
• Approach: RCT: choose 50 villages as treatment villages test for
worms/50 villages are controls … in treatment villages, if more than 50%
infections, deworm all children in village; if less than 50% test all children
and deworm selectively … “paying for performance incentives:”
clinicians get a bonus for low prevalence … the project made deworming
drugs (Abendesol) easily available …
• Results: in first pilot study … no difference … all villages experience
sharp fall in infections …
WHY? Spillovers: Villages clinicians have monthly training meetings in
township hospital training facility … attended by clinicians in treated and
control villages … when clinicians in control villages heard about the
problem of worms, with easily accessible deworming drugs, the
clinicians in the control villages took steps to reduce infections …
Howard White
The problem of selection bias
• Program participants are not chosen at random,
but selected through
– Program placement
– Self selection
• This is a problem if the correlates of selection
are also correlated with the outcomes of interest,
since those participating would do better (or
worse) than others regardless of the intervention
Howard White
Two general sources of selection bias
1. Selection bias from program placement
• A program of school improvements is targeted at the
poorest schools
• Since these schools are in poorer areas it is likely
that students have home and parental
characteristics are associated with lower learning
outcomes (e.g. illiteracy, no electricity, child labor)
• Hence learning outcomes in project schools will be
lower than the average for other schools
• The comparison group has to be drawn from a
group of schools in similarly deprived areas
Howard White
2. Selection bias from self-selection
• A community fund is available for community-identified
projects (this was the nature of China’s new poverty
reduction program of 2003)
• An intended outcome is to build social capital (as well as
income earning opportunities) for future community
development activities
• But those communities with higher degrees of cohesion and
social organization (i.e. social capital) are more likely to be
able to make proposals for financing (in China, the village
leaders that were the head of single surname villages were
those that applied first and were selected first for poverty
alleviation funding)
• Hence social capital would be higher amongst beneficiary
communities than non-beneficiaries regardless of the
intervention, so a comparison between these two groups will
Howard White
Other examples of selection bias
• Infant mortality in Bangladesh:
– Hospital delivery (0.115 vs 0.067) 
 Led to conclusions that hospital births were more dangerous than
having a baby at home … but, is this a fair comparison? … which
types of parents have their babies in a hospital? … perhaps those
with potential complications …
• Secondary education and teenage pregnancy in
Zambia
– But, in Zambia, when a girl gets pregnant, she has to
drop out … so of course there is a negative
correlation between secondary educational
Howard attainment
White
and getting pregnant
Main point
There is ‘selection’ in who benefits from
nearly all interventions. So need to get a
control group which has the same
characteristics as those selected for the
intervention.
This takes a lot of thinking … and needs to
be considered at the time of the design of
the
project
and
IE
Howard White
Dealing with selection bias
• Need to use experimental or quasi-experimental
methods to cope with this; this is what has been meant
by rigorous impact evaluation
• Experimental (randomized control trials = RCTs,
commonly used in agricultural research and medical
trials, but are more widely applicable)
• Quasi-experimental
– Propensity score matching
– Regression discontinuity
Howard White
Randomization (RCTs)
• Randomization addresses the problem of selection bias by
the random allocation of the treatment
• Randomization may not be at the same level as the unit of
intervention
– Randomize across schools but measure individual learning outcomes
– Randomize across sub-districts but measure village-level outcomes
• The less units over which you randomize the higher your
standard errors
• But you need to randomize across a ‘reasonable number’ of
units
– At least 30 for simple randomized design (though possible imbalance
considered a problem for n < 200)
– Can be as few as 10 for matched pair randomization
Howard White
Issues in randomization
• Randomize across eligible population not whole
population
• Can randomize across the pipeline
• Is no less unethical than any other method with
a control group (perhaps more ethical), and any
intervention which is not immediately universal
in coverage has an untreated population to act
as a potential control group
• No more costly than other survey-based
approaches
Howard White
Conducting an RCT
• Has to be an ex-ante design
• Has to be politically feasible, and confidence that program
managers will maintain integrity of the design
• Perform power calculation to determine sample size (and
therefore cost)
• Adopt strict randomization protocol
• Maintain information on how randomization done, refusals
and ‘cross-overs’
• A, B and A+B designs (factorial designs)
• Collect baseline data to:
– Test quality of the match
– Conduct difference in difference analysis
Howard White
Exercise
L2:
• Using hypothetical outcome data for the
before/after, treatment matrix calculate the:
– Ex-post single difference
– Before versus after (single difference)
– Double difference
Are these impact estimates? Why / why not?
L3:
• Does your intervention suffer from selection bias? Why?
• Is randomization an option for your intervention?
• At what
Howard
White level would you randomize?
Lecture 4
Quasi-experimental designs
Howard White
Quasi-experimental approaches
• Possible methods
– Propensity score matching
– Regression discontinuity
– Instrumental variables
• Advantage: can be done ex post, and when
random assignment not possible
• Disadvantage: cannot be assured of absence of
selection bias
Howard White
Propensity score matching
• Need someone with all the same age, education,
religion etc.
• But, matching on a single number calculated as
a weighted average of these characteristics
gives the same result and matching individually
on every characteristic – this is the basis of
propensity score matching
• The weights are given by the ‘participation
equation’, that is a probit equation of whether a
person participates in the project or not
Howard White
Propensity score matching:
what you need
• Can be based on ex post single difference,
though double difference is better
• Need common survey for treatment and
potential control, or survey with common
sections for matching variables and
outcomes
Howard White
Propensity score matching:
example of matching:
water supply in Nepal
Variable
Before matching
After matching
Rural resident
Treatment: 29%
Comparison: 78%
Treatment: 33%
Comparison: 38%
Richest wealth quintile
Treatment: 46%
Comparison: 2%
Treatment: 39%
Comparison: 36%
H/h higher education
Treatment: 21%
Comparison: 4%
Treatment: 17%
Comparison: 17%
Outcome (diarrhea
incidence children<2)
Treatment: 18%
Comparison: 23%
Treatment: 15%
Comparison: 23%
OR = 1.28
OR = 1.53
Howard White
Regression discontinuity: an example –
agricultural input supply program
Howard White
Naïve impact estimates
• Total = income(treatment) – income(control) = 9.6
• Agricultural h/h only = 7.7
• But there is a clear link between net income and land
holdings
• And it turns out that the program targeted those
households with at least 1.5 ha of land (you can see this
in graph)
• So selection bias is a real issue, as the treatment group
would have been better off in absence of program, so
single difference estimate is upward bias
Howard White
Regression discontinuity
• Where there is a ‘threshold allocation rule’ for program
participation, then we can estimate impact by comparing
outcomes for those just above and below the threshold
(as these groups are very similar)
• We can do that by estimating a regression with a dummy
for the threshold value (and possibly also a slope dummy)
– see graph
• In our case the impact estimate is 4.5, which is much
less than that from the naïve estimates (less than half)
Howard White
Discussion
• What possible quasi-experimental designs
could you use for your intervention?
Howard White
Lecture 5
Data collection and mixed methods
Howard White
Overview on data collection
•
•
•
•
•
•
Baseline, midterm and endline
Treatment and comparison
Process data
Capture contagion and spillovers
Quant and qual
Different levels (e.g. facility data, worker
data) – link the data
• Multiple data sources
Howard White
Data used in BINP study
• Project evaluation data (three rounds)
• Save the Children evaluation
• Helen Keller Nutritional Surveillance
Survey
• DHS (one round)
• Project reports
• Anthropological studies of village life
• Action research (focus groups, CNP
survey)
Howard White
Some study costs
• IADB vocational training studies:
US$20,000 each
• IEG BINP study US$40,000
• IEG rural electrification study US$120,000
• IEG Ghana education study US$500,000
• Average 3ie study US$300,000 +
• Average 3ie study in Africa with two
rounds of surveys; US$500,000 +
Howard White
Timelines
• Ex post 12-18 months
• Ex ante:
– lead time for survey design 3-6 months
– Post-survey to first impact estimates 6-9
months
– Report writing and consultation 3-6 months
Howard White
Budget and timeline
•
•
•
•
•
Ex post or ex ante
Existing data or new data
How many rounds of data collection?
How large is sample?
When is it sensible to estimate impact?
Howard White
The role of mixed methods
• The mother-in-law effect in Bangladesh
• The angry man in Andhra Pradesh
• The disconnected in connected villages
pretty much everywhere
• Social funds in Zambia
Howard White
Exercise
• Propose for your intervention
– Timeline for impact evaluation
– Budget
Howard White
Final exercise
• Prepare a 5 minute presentation which
covers:
– What the intervention is
– The main evaluation questions
– How they will be addressed
– What it will cost and how long it will take
Howard White
Download