Ex Post Evaluation Methods Slides

advertisement
EVALUATION NOTES
BERGEN COURSE
SPRING, 2010
Petra Todd
University of Pennsylvania
Department of Economics
The Evaluation Problem

Will study econometric methods for evaluating
effects of active labor market programs



Employment, training and job search assistance
programs
School subsidy programs
Health interventions
Key questions




Do program participants benefit from the program?
Do program benefits exceed costs?
What is the social return to the program?
Would an alternative program yield greater impact
at the same cost?
Goals

Understand the identifying assumptions needed to
justify application of different estimators
 Statistical
assumptions
 Behavioral assumptions
 Assumptions with regard to heterogeneity in how
people respond to a program intervention
Potential Outcomes






Y0 – outcome without treatment
Y1 – output with treatment
D=1 if receive treatment, else D=0
Observed outcome
Y=D Y1+(1-D) Y0
Treatment Effect
Δ= Y1-Y0
Δ not directly observed, missing data problem
Parameters of Interest

Average impact of treatment on the treated (TT)
E(Y1-Y0|D=1,X)

Average treatment effect (ATE)
E(Y1-Y0|X)

Average effect of treatment on the untreated (UT)
E(Y1-Y0|D=10,X)

ATE=Pr(D=1|X)TT+(1-Pr(D=1|X))UT
Other parameters of interest



Proportion of people benefiting from the program
Pr(Y1>Y0|D=1)=Pr(Δ>0|D=1)
Distribution of treatment effects
F(Δ|D=1,X)
Selected quantile
Inf {Δ:F(Δ|D=1,X)>q}
Model for potential outcomes with and without
treatment

Model:
Y1=Xβ1+U1
Y0=Xβ0+U0
E(U1|X)=E(U0|X)=0

Observed outcome:
Y=Y0+E(Y1-Y0)
Y= Xβ0+D(Xβ1- Xβ0)+U0+D(U1-U0)
Distinction between TT and ATE

TT=E(Δ|D=1,X)=Xβ1- Xβ0+E(U1-U0|D=1,X)

ATE= E(Δ|X)=Xβ1- Xβ0


TT depends on structural parameters as well as means of
unobservables
Parameters are the same if



(A1) U1=U0
(A2) E(U1-U0|D=1,X)=0
Condition (A2) means that D is uninformative on U1-U0, , i.e. ex post
heterogeneity but not acted on ex ante
Three Commonly Made Assumptions from least to
most general

Coefficient on D is fixed (given X) and is the same
for everyone (most restrictive)
 U1=U0
 Y=Xβ+Dα(X)+U
 E(Y1-Y0|X,D)=
α(X)

Coefficient on D is random given X, but U1-U0 does not
help predict participation in the program
Pr(D=1| U1-U0 ,X)=Pr(D=1|X)
which implies
E(U1-U0 |D=1,X)= E(U1-U0 |X)

Coefficient on D is random given X and D helps predict
program participation (least restrictive)
E(U1-U0 |D=1,X)≠E(U1-U0 |X)
How Can Randomization Solve the Evaluation
Problem?


Comparison group selected using a randomization
devise to randomly exclude some fraction of program
applicants from the program
Main advantage – increase comparability between
program participants and nonpartcipants
Have same distribution of observables and of
unobservables
 Satisfy program eligibility criteria

What problems can arise in social experiments?

Randomization bias – occurs when introducing
randomization changes the way the program
operates
 Greater
recruitment needs may lead to change in
acceptance standards
 Individuals may decide not to apply if they know they
will be subject to randomization


Contamination bias – occurs when control group members
seek alternative forms of treatment
Ethical considerations – there may be opposition to the
experiment and some sites may refuse to participate,
which poses a threat to external validity

Dropout – some of the treatment group members may
drop out before completing the program

Sample attrition – may have differential attrition between
the treatment and control groups
At what stage should randomization be applied?






Randomization after acceptance into the program
Randomization of eligibility
Let R=1 if randomized (treatment group),
R=0 if randomized out (control group)
Let Y1* and Y0* denote outcomes
Let D* denote someone who applies to the program
and is subject to randomization



From treatment group, get E(Y1*|X,D*=1,R=1)
From control group, get E(Y0*|X,D*=1,R=0)
No randomization bias and random assignment implies
E(Y1*|X,D*=1,R=1)=E(Y1|X,D=1)
E(Y0*|X,D*=1,R=0)=E(Y0|X,D=1)

Thus, the experiment gives
TT=E(Y1-Y0|X,D=1)
How does program dropout affect
experiments?


Can define treatment as “intent-to-treat” or “offer of
treatment,” in which case dropout not a problem
If dropout occurs prior to receiving the program (i.e.
dropouts do not get treatment), then could treat it like
randomization on eligibility.
Randomization on eligibility


Let e=1 if eligible, e=0 if not eligible
Let D=1 denote would-be participants if program
were made available
E(Y|X,e=1)=Pr(D=1|X,e=1)E(Y1|X,e=1,D=1)
+ Pr(D=0|X,e=1)E(Y0|X,e=1,D=0)
E(Y|X,e=0)=Pr(D=1|X,e=0)E(Y0|X,e=0,D=1)
+ Pr(D=0|X,e=0)E(Y0|X,e=0,D=0)

Because eligibility is randomized,
Pr(D=1|X,e=1)=Pr(D=1|X,e=0)
Pr(D=0|X,e=1)=Pr(D=0|X,e=0)
E(Y0|X,e,D=1)= E(Y0|X,D=1)
E(Y1|X,e,D=1)= E(Y1|X,D=1)

Thus, difference in previous two equations gives
Pr(D=1|X,e=1){E(Y1|X,e,D=1)-E(Y0|X,D=1)}
TT 
E (Y | X , e  1)  E (Y | X , e  0 )
Pr( D  1 | X , e  1)
What about control group contamination?

Not necessarily a problem if willing to define
benchmark state as being excluded from the
program
What about sample attrition?



Attrition is a problem that is common to both
experimental and nonexperimental studies
Attrition occurs when some people are not followed
in the data (maybe due to nonresponse)
If attrition is nonrandom with respect to treatment,
then attrition requires the use of nonexperimental
evaluation methods
Sources of bias in estimating E(Δ|X), E(Δ|X,D=1)
Traditional (Simple) Regression Estimators



Cross-section
Before-after
Difference-in-differences
“Ashenfelter’s Dip”
Mean Y
D=1
D=0
T=0
Before-after estimators
Drawbacks and Advantages of before-after
approach

Drawbacks
Identification breaks down in the presence of time-specific
intercepts
 Can be sensitive to choice of time periods because of
Ashenfelter Dip pattern


Advantage

minimal data requirements - only requires data on
participants.
Cross-section estimators
Difference-in-difference estimators

Advantages
 Allows
for time-specific intercepts that are common
across groups
 Consistent under fixed effect error structure – therefore
allows for time-invariant unobservables to affect
participation decisions and program outcomes
Matching Estimators
Assume have access to data on treated and
untreated individuals (D=1 and D=0)
 Assume also have access to a set of X variables
whose distribution is not affected by D
F(X|D,YP)=f(X|YP)
where YP=(Y0,Y1) “potential outcomes”




Matching estimators pair treated individuals with
observably similar untreated individuals
Usually assumed that
(Y0,Y1) ╨ D | X
(M-1)
or
Pr(D=1|X, Y0,Y1) = Pr(D=1|X)
and
0<Pr(D=1|X)<1 (M-2)
To justify this assumption, individuals cannot select into
the program based on anticipated treatment impact

Assumption (M-1) implies
F(Y0|D=1,X)=F(Y0|D=0,X)=F(Y0|X)
F(Y1|D=1,X)=F(Y1|D=0,X)=F(Y1|X)
also
E(Y0|D=1,X)=E(Y0|D=0,X)=E(Y0|X)
E(Y1|D=1,X)=E(Y1|D=0,X)=E(Y1|X)

Under assumptions that justify matching, can estimate
TT, ATE, and UT


Let n denote number of observations in the
treatment group
A typical matching estimator for TT takes the
form:
m 
1
n1
ˆ (Y | X
[
Y

E
 1i
0j
i{ D i  1}
j
 X i , D j  0 )]
Eˆ (Y 0 j | X
j
 X i , D j  0)
is an estimator for the matched no treatment outcome
Recall, that (M-1) implies
E (Y 0 j | X
j
 X i , D j  0 )  E (Y 0 j | X
j
 X i , D j  1)
How does matching compare to a randomized
experiment?



Distribution of observables will by construction be the
same matched control group as in the treatment
group
However, distribution of unobservables not necessarily
balanced across groups
Experiment has full support (M-2), but with matching
there can be a failure of the common support
condition (when matches cannot be found)
Even though matching methods assume
E(Y1-Y0|D=1,X)=E(Y1-Y0|X)

Could still potentially have
E(Y1-Y0|D=1)≠E(Y1-Y0)
E(Δ|D=1)=∫E(Δ|D=1,X)f(X|D=1)dX
E(Δ)=∫E(Δ|X)f(X)dX
If interest centers on TT, (M-1) can be replaced by
weaker assumption
E(Y0|X,D=1)=E(Y0|X,D=0)=E(Y0|X)
 The weaker assumption allows selection into the
program to depend on Y1 and allows
E(Y1-Y0|X,D)≠E(Y1-Y0|X)

Only require
Pr(D=1|X,Y0,Y1)=Pr(D=1|X,Y1)

Practical problems in Matching

Problems
 How
to construct match when X is of high dimension
 How to choose set of X values
 What do to if Pr(D=1|X)=1 for some X (violation of
common support condition (M-1))
Rosenbaum and Rubin (1983) Theorem




Provide a solution to the problem of constructing a
match when X is of high dimension
Show that
(Y0,Y1) ╨ D | X
Implies
(Y0,Y1) ╨ D | Pr(D=1|X)
Reduces the matching problem to a univariate
problem, provided Pr(D=1|X) can be parametrically
estimated
Pr(D=1|X) is known as the propensity score
Proof of RR theorem
Let P(X)=Pr(D=1|X)
 E(D|Y0,P(X))=E(E(D|Y0,X)|Y0,P(X))
= E(P(X)|Y0,P(X))
=P(X)
Where first equality holds because X is finer than P(X)
 E(D|Y0,X)=E(D|X)=P(X)

Matching can be implemented in two steps


Step 1: estimate a model for program participation,
estimate the propensity score P(Xi) for each person
Step 2: Select matches based on the estimated
propensity score
ˆ m 
1
n1
 [Y
1i
i{ D i 1}
( Pˆi )  Eˆ (Y 0 j | Pˆ j  Pˆi , D j  0 )]
Ways of constructing matched outcomes


Define a neighborhood C(Pi) for each person
i Є{Di=1}
Neighbors are persons in {Dj=0} for whom Pj Є C(Pi)
Set of persons matched to i is
Ai={jЄ{Di=0} such that Pj Є C(Pi)}

Nearest Neighbor Matching

C(Pi)=min || Pi-Pj ||
j
jЄ{Di=0}
=> Ai is a singleton set

Caliper matching
Matches only made if || Pi-Pj ||<ε for some prespecified
tolerance (tries to avoid bad matches)
Kernel Matching

Estimate matched outcomes by nonparametric
regression
Local Linear Regression Matching
Difference-in-difference matching

Assume
(Y0t-Y0t’) ╨ D | Pr(D=1|X)
0<Pr(D=1|X)<1

Main advantage
Allows for time invariant unobservable differences between
the treatment group and the control group
 Selection into the program can be based on the
unobservables

ˆ DD 
1
n1
 [( Y
i{ D i 1}
1 it
 Y 0 it ' )  Eˆ (Y 0 jt  Y 0 jt ' | Pˆ j  Pˆi , D j  0 )]
Should matches be reused?

If don’t reuse, then results will not be invariant to the
order in which observations were matched
Balancing Tests: Checking the Specification of the
Propensity Score Model

By R&R Theorem, for any set of variables Z
Different kinds of “balancing tests”


If conditioning on estimated value of P(Z) there is
additional dependence on Z then this can be
viewed as misspecification in P(Z)
In practice, researchers often group data according
to values of P(Z) (e.g. 5 strata) and compare
means of Z within each group (how to choose strata
not entirely clear)
Or estimate regression…
If balancing tests fail…



Refine propensity score model and reestimate
Could use a semiparametric approach to estimate
P(Z)
If estimate P(Z) fully nonparametrically, then curse
of dimensionality returns and there is no gain to
using the propensity score methodology.
What variables to put in the propensity score?
Econometric Models of Program Participation






Assume individuals have the option of taking training in
period k
Prior to k, observe Y0j, j=1..k
After k, observe two potential outcomes (Y0t,Y1t)
To participate in training, individuals must apply and be
accepted, so there may be several decision-makers
determining who gets training
D=1 if participates, =0 else
Assume participation decisions are based on maximization of
future earnings
Simple model of participation

D=1 if
T k
Y0 ,k  j
 T  k Y1 , k  j

E 
C 
| Ik   0
j
j
j  0 (1  r )
 j 1 (1  r )





First term is earnings stream if participates in program
C is the direct cost of training
Last term is earnings stream if do not participate
Ik is the information set at time k used to form expectations
Implications of this simple decision rule




Past earnings are irrelevant except for value in
predicting future earnings
Persons with lower foregone earnings or lower costs
are more likely to participate in programs
Older persons and persons with higher discount rates
are less likely to participate
The decision to take training is correlated with future
earnings only through the correlation with expected
future earnings
Special case of above model


Assume constant treatment effect α
D=1 if expected rewards exceed costs
T  k


E 
| I k   C  Y0 k
j
 j 1 (1  r )

If
earnings temporarily low (e.g. unemployed),
people are more likely to enroll in the program
Model is consistent with Ashenfelter’s Dip Pattern
Model of the decision process




Let IN=H(X)-V
H(X) = expected future rewards
V = costs=C+Y0k (assumed unknown)
If V assumed to be independent of X, then could estimate by
logistic or probit model:
Pr(D=1|X)=eH(X)/ {1+ eH(X)}
Pr(D=1|X)=Φ((H(X)-μ1)/σv)
Evidence on Performance of Matching Estimators
Control function methods

References:



Roy (1951), Willis and Rosen (1979), Heckman and Honore
(1990), Heckman and Sedlacek (1985), Heckman and Robb
(1985, 1986)
Allow selectivity into the program to be based on
unobservables, explicitly model and control for
potential selectivity bias
Conventional to assume unobservables are normally
distributed, but can relax normality
Model for outcomes
Comparison of Control function and matching methods
Normal model

Note that assumption of normal model in inconsistent
with assumption of matching estimator
Bias Function


Matching assumption implies that B(P(Z))=0.
Difference-in-difference matching assumes that
bias function differences over time
Decomposition of Sources of Bias
for B=E(Y0|D=1)-E(Y0|D=0)
Propensity Score Distribution
Pointwise Bias and Comparison with Normal Model
Pointwise bias over time, conditional
on P
Nonexperimental Estimators: Regression
Discontinuity Methods


Rule determining who gets treatment, but
assignment is nonrandom
Probability of getting treatment changes
discontinuously as a function of underlying variables
Previous research




Introduced in Thistlethwaite and Campbell (1960)
Analyzed by Goldberger (1972) in context of
evaluating education interventions
Applications in Berk and Rauma (1983), van der
Klaauw (1996) and Angrist and Lavy (1996)
Many studies rely implicitly on nonlinearities or
discontinuities in the treatment assignment rules (e.g.
Black, 1996, and Angrist and Krueger, 1991)
Post
Test
score
c
Pre-test score
Questions



How does the RD design provide additional sources
of identifying information?
How can treatment effects be recovered with minimal
parametric restrictions?
What is the relationship between RD and IV
estimators?
Sharp design
Pr(D=1|z)
z
Fuzzy design
Pr(D=1|z)
c
z
IV and Local IV (LIV) Estimators


Suppose binary instrument Z
Identifying assumption is that
When does the IV estimator identify treatmenton-the-treated?
Case 1: Common effect
Case 2: heterogenous impacts
Examples

Angrist (1990) uses draft lottery as instrument.
Could be invalid for TT parameter
 Firms
take into account lottery numbers in making hiring
decisions
 Workers take actions to avoid draft

Moffit (1996) uses cross-section variation in welfare
benefits as an instrument for participation in a job
training program
 Could
be invalid if anticipated benefit correlated with
welfare benefit
LATE


Imbens and Angrist (1994, Econometrica) show that
even if the assumptions that would justify
application of IV for purpose of estimating TT are
not valid, the IV estimator still identifies the LOCAL
AVERAGE TREATMENT EFFECT (LATE)
LATE is the average treatment effect for the subset
of individuals induced to change their treatment
status by the instrument
 Distinction
between “always-takers,” “compliers,”
“never-takers”
 LATE is the average treatment effect for compliers only
and compliers cannot be identified in the data
 Size of the group of compliers is unknown and the
group may be instrument-dependent
 LATE assumes that the instrument affects everyone’s
propensity to take the treatment in the same way
(monotone response to the instrument)
Examples of LATE interpretation



In Angrist (1990) example, get effect of military service for the
subset induced to enter the military by the draft lottery (excludes
those who always join or who never go, despite the draft)
Angrist and Krueger (1994) study effect of schooling on earnings
using compulsory schooling laws as an instrument
 Gives treatment effect for the subset induced to enter school by
the instrument (tend to be low level schooling types)
Angrist and Evans (1998) – effect of fertility on labor supply using
twins as an instrument
MTE and LIV estimation
(Heckman and Vytlacil, 2005 Econometrica)


Develop a unifying theory for how TT, ATE, and LATE
all relate to one another
Propose a new concept called the “marginal
treatment effect” (MTE) and show how to estimate it
and how to build up the other parameters from it.
Treatment effect model
Parameters of interest
Interpret LATE in terms of MTE
TT, ATE, LATE as function of MTE
Estimation strategy

Uses fact that MTE is a limiting form of LATE
Bounding approaches (Heckman
and Smith, 1995, Manski, 1997)




Recall that from experiments, we can only learn
about the marginal distributions of Y0 and Y1 and
not about the joint distribution
Let Y0 and Y1 denote a discrete outcome, such as
employment status
(Y0,Y1) can take on the values (E,E),(E,N),(N,E) and
(N,N)
Would like to know treatment effect= PNE-PEN
We only see marginals
Frechet-Hoeffding bounds


Upper bound – prob of joint event cannot exceed
probability of the events that compose it
Lower bound – the sum of the individual cell
probabilities must equal one
Download