EVALUATION NOTES BERGEN COURSE SPRING, 2010 Petra Todd University of Pennsylvania Department of Economics The Evaluation Problem Will study econometric methods for evaluating effects of active labor market programs Employment, training and job search assistance programs School subsidy programs Health interventions Key questions Do program participants benefit from the program? Do program benefits exceed costs? What is the social return to the program? Would an alternative program yield greater impact at the same cost? Goals Understand the identifying assumptions needed to justify application of different estimators Statistical assumptions Behavioral assumptions Assumptions with regard to heterogeneity in how people respond to a program intervention Potential Outcomes Y0 – outcome without treatment Y1 – output with treatment D=1 if receive treatment, else D=0 Observed outcome Y=D Y1+(1-D) Y0 Treatment Effect Δ= Y1-Y0 Δ not directly observed, missing data problem Parameters of Interest Average impact of treatment on the treated (TT) E(Y1-Y0|D=1,X) Average treatment effect (ATE) E(Y1-Y0|X) Average effect of treatment on the untreated (UT) E(Y1-Y0|D=10,X) ATE=Pr(D=1|X)TT+(1-Pr(D=1|X))UT Other parameters of interest Proportion of people benefiting from the program Pr(Y1>Y0|D=1)=Pr(Δ>0|D=1) Distribution of treatment effects F(Δ|D=1,X) Selected quantile Inf {Δ:F(Δ|D=1,X)>q} Model for potential outcomes with and without treatment Model: Y1=Xβ1+U1 Y0=Xβ0+U0 E(U1|X)=E(U0|X)=0 Observed outcome: Y=Y0+E(Y1-Y0) Y= Xβ0+D(Xβ1- Xβ0)+U0+D(U1-U0) Distinction between TT and ATE TT=E(Δ|D=1,X)=Xβ1- Xβ0+E(U1-U0|D=1,X) ATE= E(Δ|X)=Xβ1- Xβ0 TT depends on structural parameters as well as means of unobservables Parameters are the same if (A1) U1=U0 (A2) E(U1-U0|D=1,X)=0 Condition (A2) means that D is uninformative on U1-U0, , i.e. ex post heterogeneity but not acted on ex ante Three Commonly Made Assumptions from least to most general Coefficient on D is fixed (given X) and is the same for everyone (most restrictive) U1=U0 Y=Xβ+Dα(X)+U E(Y1-Y0|X,D)= α(X) Coefficient on D is random given X, but U1-U0 does not help predict participation in the program Pr(D=1| U1-U0 ,X)=Pr(D=1|X) which implies E(U1-U0 |D=1,X)= E(U1-U0 |X) Coefficient on D is random given X and D helps predict program participation (least restrictive) E(U1-U0 |D=1,X)≠E(U1-U0 |X) How Can Randomization Solve the Evaluation Problem? Comparison group selected using a randomization devise to randomly exclude some fraction of program applicants from the program Main advantage – increase comparability between program participants and nonpartcipants Have same distribution of observables and of unobservables Satisfy program eligibility criteria What problems can arise in social experiments? Randomization bias – occurs when introducing randomization changes the way the program operates Greater recruitment needs may lead to change in acceptance standards Individuals may decide not to apply if they know they will be subject to randomization Contamination bias – occurs when control group members seek alternative forms of treatment Ethical considerations – there may be opposition to the experiment and some sites may refuse to participate, which poses a threat to external validity Dropout – some of the treatment group members may drop out before completing the program Sample attrition – may have differential attrition between the treatment and control groups At what stage should randomization be applied? Randomization after acceptance into the program Randomization of eligibility Let R=1 if randomized (treatment group), R=0 if randomized out (control group) Let Y1* and Y0* denote outcomes Let D* denote someone who applies to the program and is subject to randomization From treatment group, get E(Y1*|X,D*=1,R=1) From control group, get E(Y0*|X,D*=1,R=0) No randomization bias and random assignment implies E(Y1*|X,D*=1,R=1)=E(Y1|X,D=1) E(Y0*|X,D*=1,R=0)=E(Y0|X,D=1) Thus, the experiment gives TT=E(Y1-Y0|X,D=1) How does program dropout affect experiments? Can define treatment as “intent-to-treat” or “offer of treatment,” in which case dropout not a problem If dropout occurs prior to receiving the program (i.e. dropouts do not get treatment), then could treat it like randomization on eligibility. Randomization on eligibility Let e=1 if eligible, e=0 if not eligible Let D=1 denote would-be participants if program were made available E(Y|X,e=1)=Pr(D=1|X,e=1)E(Y1|X,e=1,D=1) + Pr(D=0|X,e=1)E(Y0|X,e=1,D=0) E(Y|X,e=0)=Pr(D=1|X,e=0)E(Y0|X,e=0,D=1) + Pr(D=0|X,e=0)E(Y0|X,e=0,D=0) Because eligibility is randomized, Pr(D=1|X,e=1)=Pr(D=1|X,e=0) Pr(D=0|X,e=1)=Pr(D=0|X,e=0) E(Y0|X,e,D=1)= E(Y0|X,D=1) E(Y1|X,e,D=1)= E(Y1|X,D=1) Thus, difference in previous two equations gives Pr(D=1|X,e=1){E(Y1|X,e,D=1)-E(Y0|X,D=1)} TT E (Y | X , e 1) E (Y | X , e 0 ) Pr( D 1 | X , e 1) What about control group contamination? Not necessarily a problem if willing to define benchmark state as being excluded from the program What about sample attrition? Attrition is a problem that is common to both experimental and nonexperimental studies Attrition occurs when some people are not followed in the data (maybe due to nonresponse) If attrition is nonrandom with respect to treatment, then attrition requires the use of nonexperimental evaluation methods Sources of bias in estimating E(Δ|X), E(Δ|X,D=1) Traditional (Simple) Regression Estimators Cross-section Before-after Difference-in-differences “Ashenfelter’s Dip” Mean Y D=1 D=0 T=0 Before-after estimators Drawbacks and Advantages of before-after approach Drawbacks Identification breaks down in the presence of time-specific intercepts Can be sensitive to choice of time periods because of Ashenfelter Dip pattern Advantage minimal data requirements - only requires data on participants. Cross-section estimators Difference-in-difference estimators Advantages Allows for time-specific intercepts that are common across groups Consistent under fixed effect error structure – therefore allows for time-invariant unobservables to affect participation decisions and program outcomes Matching Estimators Assume have access to data on treated and untreated individuals (D=1 and D=0) Assume also have access to a set of X variables whose distribution is not affected by D F(X|D,YP)=f(X|YP) where YP=(Y0,Y1) “potential outcomes” Matching estimators pair treated individuals with observably similar untreated individuals Usually assumed that (Y0,Y1) ╨ D | X (M-1) or Pr(D=1|X, Y0,Y1) = Pr(D=1|X) and 0<Pr(D=1|X)<1 (M-2) To justify this assumption, individuals cannot select into the program based on anticipated treatment impact Assumption (M-1) implies F(Y0|D=1,X)=F(Y0|D=0,X)=F(Y0|X) F(Y1|D=1,X)=F(Y1|D=0,X)=F(Y1|X) also E(Y0|D=1,X)=E(Y0|D=0,X)=E(Y0|X) E(Y1|D=1,X)=E(Y1|D=0,X)=E(Y1|X) Under assumptions that justify matching, can estimate TT, ATE, and UT Let n denote number of observations in the treatment group A typical matching estimator for TT takes the form: m 1 n1 ˆ (Y | X [ Y E 1i 0j i{ D i 1} j X i , D j 0 )] Eˆ (Y 0 j | X j X i , D j 0) is an estimator for the matched no treatment outcome Recall, that (M-1) implies E (Y 0 j | X j X i , D j 0 ) E (Y 0 j | X j X i , D j 1) How does matching compare to a randomized experiment? Distribution of observables will by construction be the same matched control group as in the treatment group However, distribution of unobservables not necessarily balanced across groups Experiment has full support (M-2), but with matching there can be a failure of the common support condition (when matches cannot be found) Even though matching methods assume E(Y1-Y0|D=1,X)=E(Y1-Y0|X) Could still potentially have E(Y1-Y0|D=1)≠E(Y1-Y0) E(Δ|D=1)=∫E(Δ|D=1,X)f(X|D=1)dX E(Δ)=∫E(Δ|X)f(X)dX If interest centers on TT, (M-1) can be replaced by weaker assumption E(Y0|X,D=1)=E(Y0|X,D=0)=E(Y0|X) The weaker assumption allows selection into the program to depend on Y1 and allows E(Y1-Y0|X,D)≠E(Y1-Y0|X) Only require Pr(D=1|X,Y0,Y1)=Pr(D=1|X,Y1) Practical problems in Matching Problems How to construct match when X is of high dimension How to choose set of X values What do to if Pr(D=1|X)=1 for some X (violation of common support condition (M-1)) Rosenbaum and Rubin (1983) Theorem Provide a solution to the problem of constructing a match when X is of high dimension Show that (Y0,Y1) ╨ D | X Implies (Y0,Y1) ╨ D | Pr(D=1|X) Reduces the matching problem to a univariate problem, provided Pr(D=1|X) can be parametrically estimated Pr(D=1|X) is known as the propensity score Proof of RR theorem Let P(X)=Pr(D=1|X) E(D|Y0,P(X))=E(E(D|Y0,X)|Y0,P(X)) = E(P(X)|Y0,P(X)) =P(X) Where first equality holds because X is finer than P(X) E(D|Y0,X)=E(D|X)=P(X) Matching can be implemented in two steps Step 1: estimate a model for program participation, estimate the propensity score P(Xi) for each person Step 2: Select matches based on the estimated propensity score ˆ m 1 n1 [Y 1i i{ D i 1} ( Pˆi ) Eˆ (Y 0 j | Pˆ j Pˆi , D j 0 )] Ways of constructing matched outcomes Define a neighborhood C(Pi) for each person i Є{Di=1} Neighbors are persons in {Dj=0} for whom Pj Є C(Pi) Set of persons matched to i is Ai={jЄ{Di=0} such that Pj Є C(Pi)} Nearest Neighbor Matching C(Pi)=min || Pi-Pj || j jЄ{Di=0} => Ai is a singleton set Caliper matching Matches only made if || Pi-Pj ||<ε for some prespecified tolerance (tries to avoid bad matches) Kernel Matching Estimate matched outcomes by nonparametric regression Local Linear Regression Matching Difference-in-difference matching Assume (Y0t-Y0t’) ╨ D | Pr(D=1|X) 0<Pr(D=1|X)<1 Main advantage Allows for time invariant unobservable differences between the treatment group and the control group Selection into the program can be based on the unobservables ˆ DD 1 n1 [( Y i{ D i 1} 1 it Y 0 it ' ) Eˆ (Y 0 jt Y 0 jt ' | Pˆ j Pˆi , D j 0 )] Should matches be reused? If don’t reuse, then results will not be invariant to the order in which observations were matched Balancing Tests: Checking the Specification of the Propensity Score Model By R&R Theorem, for any set of variables Z Different kinds of “balancing tests” If conditioning on estimated value of P(Z) there is additional dependence on Z then this can be viewed as misspecification in P(Z) In practice, researchers often group data according to values of P(Z) (e.g. 5 strata) and compare means of Z within each group (how to choose strata not entirely clear) Or estimate regression… If balancing tests fail… Refine propensity score model and reestimate Could use a semiparametric approach to estimate P(Z) If estimate P(Z) fully nonparametrically, then curse of dimensionality returns and there is no gain to using the propensity score methodology. What variables to put in the propensity score? Econometric Models of Program Participation Assume individuals have the option of taking training in period k Prior to k, observe Y0j, j=1..k After k, observe two potential outcomes (Y0t,Y1t) To participate in training, individuals must apply and be accepted, so there may be several decision-makers determining who gets training D=1 if participates, =0 else Assume participation decisions are based on maximization of future earnings Simple model of participation D=1 if T k Y0 ,k j T k Y1 , k j E C | Ik 0 j j j 0 (1 r ) j 1 (1 r ) First term is earnings stream if participates in program C is the direct cost of training Last term is earnings stream if do not participate Ik is the information set at time k used to form expectations Implications of this simple decision rule Past earnings are irrelevant except for value in predicting future earnings Persons with lower foregone earnings or lower costs are more likely to participate in programs Older persons and persons with higher discount rates are less likely to participate The decision to take training is correlated with future earnings only through the correlation with expected future earnings Special case of above model Assume constant treatment effect α D=1 if expected rewards exceed costs T k E | I k C Y0 k j j 1 (1 r ) If earnings temporarily low (e.g. unemployed), people are more likely to enroll in the program Model is consistent with Ashenfelter’s Dip Pattern Model of the decision process Let IN=H(X)-V H(X) = expected future rewards V = costs=C+Y0k (assumed unknown) If V assumed to be independent of X, then could estimate by logistic or probit model: Pr(D=1|X)=eH(X)/ {1+ eH(X)} Pr(D=1|X)=Φ((H(X)-μ1)/σv) Evidence on Performance of Matching Estimators Control function methods References: Roy (1951), Willis and Rosen (1979), Heckman and Honore (1990), Heckman and Sedlacek (1985), Heckman and Robb (1985, 1986) Allow selectivity into the program to be based on unobservables, explicitly model and control for potential selectivity bias Conventional to assume unobservables are normally distributed, but can relax normality Model for outcomes Comparison of Control function and matching methods Normal model Note that assumption of normal model in inconsistent with assumption of matching estimator Bias Function Matching assumption implies that B(P(Z))=0. Difference-in-difference matching assumes that bias function differences over time Decomposition of Sources of Bias for B=E(Y0|D=1)-E(Y0|D=0) Propensity Score Distribution Pointwise Bias and Comparison with Normal Model Pointwise bias over time, conditional on P Nonexperimental Estimators: Regression Discontinuity Methods Rule determining who gets treatment, but assignment is nonrandom Probability of getting treatment changes discontinuously as a function of underlying variables Previous research Introduced in Thistlethwaite and Campbell (1960) Analyzed by Goldberger (1972) in context of evaluating education interventions Applications in Berk and Rauma (1983), van der Klaauw (1996) and Angrist and Lavy (1996) Many studies rely implicitly on nonlinearities or discontinuities in the treatment assignment rules (e.g. Black, 1996, and Angrist and Krueger, 1991) Post Test score c Pre-test score Questions How does the RD design provide additional sources of identifying information? How can treatment effects be recovered with minimal parametric restrictions? What is the relationship between RD and IV estimators? Sharp design Pr(D=1|z) z Fuzzy design Pr(D=1|z) c z IV and Local IV (LIV) Estimators Suppose binary instrument Z Identifying assumption is that When does the IV estimator identify treatmenton-the-treated? Case 1: Common effect Case 2: heterogenous impacts Examples Angrist (1990) uses draft lottery as instrument. Could be invalid for TT parameter Firms take into account lottery numbers in making hiring decisions Workers take actions to avoid draft Moffit (1996) uses cross-section variation in welfare benefits as an instrument for participation in a job training program Could be invalid if anticipated benefit correlated with welfare benefit LATE Imbens and Angrist (1994, Econometrica) show that even if the assumptions that would justify application of IV for purpose of estimating TT are not valid, the IV estimator still identifies the LOCAL AVERAGE TREATMENT EFFECT (LATE) LATE is the average treatment effect for the subset of individuals induced to change their treatment status by the instrument Distinction between “always-takers,” “compliers,” “never-takers” LATE is the average treatment effect for compliers only and compliers cannot be identified in the data Size of the group of compliers is unknown and the group may be instrument-dependent LATE assumes that the instrument affects everyone’s propensity to take the treatment in the same way (monotone response to the instrument) Examples of LATE interpretation In Angrist (1990) example, get effect of military service for the subset induced to enter the military by the draft lottery (excludes those who always join or who never go, despite the draft) Angrist and Krueger (1994) study effect of schooling on earnings using compulsory schooling laws as an instrument Gives treatment effect for the subset induced to enter school by the instrument (tend to be low level schooling types) Angrist and Evans (1998) – effect of fertility on labor supply using twins as an instrument MTE and LIV estimation (Heckman and Vytlacil, 2005 Econometrica) Develop a unifying theory for how TT, ATE, and LATE all relate to one another Propose a new concept called the “marginal treatment effect” (MTE) and show how to estimate it and how to build up the other parameters from it. Treatment effect model Parameters of interest Interpret LATE in terms of MTE TT, ATE, LATE as function of MTE Estimation strategy Uses fact that MTE is a limiting form of LATE Bounding approaches (Heckman and Smith, 1995, Manski, 1997) Recall that from experiments, we can only learn about the marginal distributions of Y0 and Y1 and not about the joint distribution Let Y0 and Y1 denote a discrete outcome, such as employment status (Y0,Y1) can take on the values (E,E),(E,N),(N,E) and (N,N) Would like to know treatment effect= PNE-PEN We only see marginals Frechet-Hoeffding bounds Upper bound – prob of joint event cannot exceed probability of the events that compose it Lower bound – the sum of the individual cell probabilities must equal one