Four Parameters of Interest in the Evaluation of Social Programs James J. Heckman Justin L. Tobias Edward Vytlacil Nu!eld College, Oxford, August, 2005 1 1 Introduction This paper uses a latent variable framework to unite the recent treatment eect literature with the classical selection bias literature. We obtain simple closed-form expressions for four treatment parameters of interest: the Average Treatment Eect (ATE), the eect of Treatment on the Treated (TT), the Local Average Treatment Eect (LATE) (Imbens and Angrist 1994), and the Marginal Treatment Eect (MTE) (Björklund and Mo!tt 1987; Heckman 1997; Heckman and Vytlacil 1999, 2000a-b) for the “textbook” Gaussian selection model. Discuss how one might approach estimation of the distributions associated with these parameters of interest. 2 2 Treatment Parameters in a Canonical Model Consider a model of potential outcomes: \1 = [ 1 + X1 (2.1) \0 = [ 0 + X0 GW = ] + XG = Each agent is observed in only one state, so that either \1 or \0 is observed. The pair (\1 > \0 ) is never observed for any given person. Gain is denoted by { \1 \0 = 3 G(]) denotes the observed treatment decision G(]) = 1 denotes receipt of treatment G(]) = 0 denotes nonreceipt. GW is a latent variable which generates G(]), G(]) = 1[GW (]) 0] = 1[] + XG 0]> 4 (2.2) 1[D] is the indicator function which takes the value 1 if the event D is true, 0 otherwise. Extension of the Roy (1951) model, GW = \1 \0 F, where F represents the cost of participating in the treated state G(}) = 1[} XG ]. G(}) indicates whether or not the individual would have received treatment had her value of ] been externally set to }, holding her unobserved XG constant. 5 Varying ]n , we can manipulate an individual’s probability of receiving treatment without aecting the potential outcomes. Assume (XG X1 X0 ) independent of [ and ]. \ denotes observed earnings. \ = G\1 + (1 G)\0 = (2.3) Switching regression model: Quandt (1972), Rubin’s model (Rubin 1978), or Roy model of income distribution (Roy 1951: Heckman and Honoré 1990).1 1 Amemiya (1985) has classified models of this type as generalized tobit models, and refers to the model in (1) as the Type 5 tobit model. 6 Estimating the return to a college education. \ represents log earnings, \0 denotes the log earnings of college graduates and \1 denotes the log earnings of those not selecting into higher education. The latent index maps people into either the “college” (or treated) state and the “no-college” (or untreated) state. Expected college log wage premium for given characteristics [, ( i.e. H(\1 \0 | [)).2 2 Other applications which fit directly into this model include Lee (1978) and Willis and Rosen (1979). 7 Examine four treatment parameters: Average Treatment Eect (ATE), the eect of Treatment on the Treated (TT), the Local Average Treatment Eect (LATE), and the Marginal Treatment Eect (MTE). 8 Average Treatment Eect (ATE): expected gain from participating in the program for a randomly chosen individual. { \1 \0 : gain from program participation, where q is sample size. ATE({) = H({ | [ = {) = {( 1 0 )= Z ATE = H({) = 1X ATE([)gI ([) ATE({l ) = {( 1 0 )> q l=1 q 9 Treatment on the Treated (TT): TT({> }> G(}) = 1) = H({ | [ = {> ] = }> G(}) = 1) (2.4) = {( 1 0 ) + H(X1 X0 | XG }> [ = {> ] = }) = {( 1 0 ) + H(X1 X0 | XG })> 10 We can obtain an unconditional estimate by integrating. Gl = 1, TT can be approximated as follows: TT = H({|G(]) = 1) Z = TT([> ]> G(]) = 1) gI ([> ]|G(]) = 1) q 1 X Gl W W ({l > }l > G(}l ) = 1)= qw l=1 11 (2.5) Local Average Treatment Eect (LATE) of Imbens and Angrist (1994) LATE is defined as the expected outcome gain for those induced to receive treatment through a change in the instrument from ]n = }n to ]n = }n0 . LATE parameter as a change in the index from ] = } to ] = } 0 , where } 0 A } and } and } 0 are identical except for their n wk coordinate. We could equivalently define the treatment parameters in terms of the propensity score, S (]) = Pr(G = 1|]) = 1 IX G (]), IV denotes the cdf of the random variable V. 12 The LATE parameter: LATE(G(}) = 0> G(} 0 ) = 1> [ = {) = H({ | G(}) = 0> G(} 0 ) = 1> [ = {) = {( 1 0 ) + H(X1 X0 | } 0 XG }> [ = {) = {( 1 0 ) + H(X1 X0 | } 0 XG }) 13 Two ways to define the unconditional version of LATE. First, consider H({|G(}) = 0> G(} 0 ) = 1) = Z LATE(G(}) = 0> G(} 0 ) = 1> [)gI ([) 1X LATE(G(}) = 0> G(} 0 ) = 1> {l )= (2.6) q l=1 q Parameter H({|G(}) = 0> G(} 0 ) = 1) treatment eect for individuals who would not select into treatment if their vector ] was set to } but would select into treatment if ] was set to } 0 . Alternative definition of the unconditional version of LATE is to let ] 0 (]) equal ] Let ] 1 (]) equal ] but with the nth element replaced by }n0 . 14 Second definition of the unconditional version of LATE, H({|G(] 0 (])) = 0> G(] 1 (])) = 1) Z = LATE(G(] 0 (])) = 0> G(] 1 (])) = 1> [)gI ([> ]) 1X LATE(G(] 0 (}l )) = 0> G(] 1 (}l )) = 1> {l )= q l=1 q 15 (2.7) Marginal Treatment Eect (MTE) (Björklund and Mo!tt 1987; Heckman 1997; Heckman and Smith 1998; Heckman and Vytlacil 1999, 2000a-b), MTE({> xG ) = H({|[ = {> XG = xG ) = {( 1 0 ) + H(X1 X0 | XG = xG > [ = {) = {( 1 0 ) + H(X1 X0 | XG = xG )> where third equality follows (XG X1 X0 ) independent of [. 16 (2.8) At low values of xG > average the outcome gain for those with unobservables making them least likely to participate, while evaluation of the MTE parameter at high values of xG is the gain for those individuals with unobservables which make them most likely to participate. [ is independent of X G , the MTE parameter unconditional on observed covariates can be written as Z MTE(xG ) = MTE([> xG )gI ([) 1X MTE({l > xG ) = {( 1 0 ) + H(X1 X0 |XG = xG )= q l=1 q 17 MTE parameter can also be expressed as the limit form of the LATE parameter, lim0 LATE({> G(}) = 0> G(} 0 ) = 1) }<} = {( 1 0 ) + lim0 H(X1 X0 | } 0 XG }> [ = {) }<} = {( 1 0 ) + H(X1 X0 | XG = } 0 ) = MTE({> } 0 )= MTE parameter measures average gain in outcomes for those individuals who are just indierent to treatment when the } index is fixed at the value xG . 18 3 Simple Expressions for the Dierent Treatment Parameters in the General Case Textbook normal model: 5 6 9 XG 9 9 9 X 9 1 9 7 X0 : E 9 1 1G 0G : E 9 : E 9 : Q E0> 9 : E 9 1G 21 10 : E 9 8 C 7 0G 10 20 64 3 5 19 :F :F :F :F = :F :F 8D Treatment on the Treated (TT) is: TT({> }> G(}) = 1) = {( 1 0 ) + (1 1 0 0 ) !(}) > x(}) Thus, if Cov(X1 X0 > XG ) = 0, or 1 1 = 0 0 = If Cov(X1 X0 > XG ) A 0, then TT A ATE= } $ 4, TT $ ATE. (e.g. Cramer 1946 or Johnson, Kotz, and Balakrishnan 1992) 20 (|> }) Q(| > } > | > } > ) and e A d, then μ H(| | d } e) = | + | ¶ !() !() > x() x() = (d } )@ } , = (e } )@ } . Thus, LATE({> G(}) = 0> G(} 0 ) = 1) = H(\1 \0 | {> } 0 XG })) !(} 0 ) !(}) = {( 1 0 ) + (1 1 0 0 ) x(} 0 ) x(}) 21 (3.1) The Marginal Treatment Eect PW H({> xG ) = {( 1 0 ) + H(X1 X0 |XG = xG ) = {( 1 0 ) + H(X1 |XG = xG ) H(X0 |XG = xG ) = {( 1 0 ) + (1 1 0 0 )xG = 22 Limit form of LATE.3 PW H({> xG ) = {( 1 0 ) + (1 1 0 0 ) lim w<3xG = {( 1 0 ) + (1 1 0 0 ) lim w<3xG ( !(xG ) !(w)) @(xG w) (x(xG ) x(w)) @(xG w) = {( 1 0 ) + (1 1 0 0 )xG = 3 The last line in this derivation follows from L’Hôpital’s rule. 23 !(xG ) !(w) x(xG ) x(w) ¸ ¸ Evaluating MTE when xG is large corresponds to case where average outcome gain is evaluated for those individuals with unobservables making them most likely to participate, (and conversely when xG is small). When xG = 0, MTE = ATE as a consequence of symmetry of normal distribution. 24 Non-Normal Extensions Following Lee (1982, 1983), trivariate Normal model can be generalized by exploiting natural flexibility of selection equation. In latent variable framework, selection rule assigns people to treated state (Gl = 1) provided XlG ]l0 = This is equivalent to setting Gl = 1 when M(XlG ) M(]l0 ) for some strictly increasing function M= 25 Suppose XG I , where I an absolutely continuous distribution function. For simplicity, assume symmetry of XG about zero so that I (d) = 1 I (d). X̃G Mx (XG )> Mx (x) x31 I (x)= X̃G is standard normal random variable. 26 Original model in (1) is equivalent to the transformed model: \1 = [ 0 1 + X1 > \0 = [ 0 0 + X0 > GlWW = Mx (] 0 ) + X̃G now assume [X̃G > X1 > X0 ]0 is trivariate normal. Obtain the following selectioncorrected conditional mean functions: H(\1 H(\0 ! (Mx (} 0 )) > | G(]) = 1> [ = {> ] = }) = { 1 + 1 1 I (} 0 ) 0 ! (M (} )) x 0 > | G(]) = 0> [ = {> ] = }) = { 0 0 0 1 I (} 0 ) 0 27 (3.2) (3.3) ! (Mx (} 0 )) > W W ({> }> G(}) = 1) = { ( 1 0 ) + (1 1 0 0 ) I (} 0 ) 0 ODW H({> G(}) = 0> G(˜ } ) = 1) = {0 ( 1 0 ) + (1 1 0 0 ) ! (Mx (˜ } 0 )) ! (Mx (} 0 )) > · I (˜ } 0 ) I (} 0 ) PW H({> xG ) = {0 ( 1 0 ) + (1 1 0 0 )Mx (xG )= 28 Less straightforward generalization can be achieved by following Lee (1982, 1983) in (14) to be jointly distributed according to the Student-wy distribution. wy (> l) denotes the multivariate. Student-wy density function with mean , scale matrix l (variance equal to [y@(y 2)]l) and y degrees of freedom.4 Let wy denote the standardized univariate Student wy density with mean 0 and scale parameter equal to 1. Let Wy denote the associated cdf. 4 The mean exists when y A 1 and the variance exists when y A 2= 29 Letting XG I , we define MWy (x) Wy31 (I (x)) as before, again noting that MWy (x) = MWy (x)= Assume [X̃G > X1 > X0 ]0 has a trivariate wy (0> l) density. H (\1 | G(]) = 1> [ = {> ] = }) = {0 1 ¶μ ¶¸ μ 0 2 0 wy (MWy (} )) y + [MWy (} )] > + 1 1 0 y1 I (} ) H (\0 | G(]) = 0> [ = {> ] = }) = {0 0 ¶μ ¶¸ μ 0 2 0 wy (MWy (} )) y + [MWy (} )] = 0 0 0 y1 1 I (} ) 30 μ j(x> y) 2 y + [MWy (x)] y1 ¶ wy (MWy (x))= j(} 0 > y) = W W ({> }> G(}) = 1) = { ( 1 0 ) + (1 1 0 0 ) 0 I (} ) 0 j(˜ } 0 > y) j(} 0 > y) = ODW H({> G(}) = 0> G(˜ } ) = 1) = { ( 1 0 )+(1 1 0 0 ) I (˜ } 0 ) I (} 0 ) 0 PW H({> xG ) = {0 ( 1 0 ) + (1 1 0 0 )MWy (xG )= 31 3.1 Estimation 1. Obtain ˆ from a probit model on the decision to take the treatment. 2. Compute the appropriate selection correction terms evaluated at ˆ, (i.e. !(]l ˆ)@x(]l ˆ) when Gl = 1, and !(]l ˆ)@(1 x(]l ˆ)) when Gl = 0=) 32 3. Run treatment-outcome-specific regressions (for the groups {l : Gl = 1} and {l : Gl = 0}) with the inclusion of the appropriate selectioncorrection terms obtained from the previous step. 4. Given ˆ 0 > ˆ 1 > 1ˆ 1 and 0ˆ 0 obtained from step 3, and ˆ from step (1), use these parameter estimates to obtain point estimates of the treatment parameters for given [, ], and ] 0 . Alternatively, one could integrate over the distribution of the characteristics to obtain unconditional estimates, as suggested in section 2. 33 Table 1 Point Estimates and Standard Errors of Alternate Treatment Parameters Outcome Errors / Link Function ATE TT LATE Normal/Normal .092 .039 .079 (SSR=345.25) (.03) (.04) (.03) tv=2 / Logit .061 .036 .053 (SSR = 346.09) (.02) (.03) (.02) tv=3 / Logit .073 .035 .062 (SSR = 345.79) (.02) (.03) (.02) tv=4 / Logit .079 .035 .067 (SSR = 345.61) (.02) (.04) (.03) tv=5 / Logit .082 .034 .069 (SSR = 345.51) (.03) (.04) (.03) tv=6 / Logit .084 .034 .071 (SSR = 345.44) (.03) (.04) (.03) tv=8 / Logit .085 .034 .073 (SSR = 345.36) (.03) (.04) (.03) tv=12 / Logit .087 .034 .073 (SSR = 345.29) (.03) (.04) (.04) tv=24 / Logit .088 .033 .075 (SSR = 345.23) (.04) (.04) (.03) tv=2 / tv=2 .067 .028 .058 (SSR = 345.68) (.03) (.04) (.03) tv=3 / tv=3 .075 .030 .063 (SSR = 345.56) (.03) (.04) (.03) tv=4 / tv=4 .079 .031 .066 (SSR = 345.48) (.03) (.04) (.03) tv=5 / tv=5 .082 .032 .069 (SSR = 345.43) (.03) (.04) (.03) tv=6 / tv=6 .084 .033 .070 (SSR = 345.40) (.03) (.04) (.03) tv=8 / tv=8 .086 .034 .072 (SSR = 345.36) (.03) (.04) (.03) tv=12 / tv=12 .088 .036 .075 (SSR = 345.32) (.03) (.04) (.03) tv=24 / tv=24 .090 .037 .077 (SSR = 345.29) (.03) (.04) (.03) 34 Table 2: Coecients and Standard Errors for Application of Section 5 Variable Coecient Standard Error College State Constant 1.85 .225 .092 .053 g (Ability) .124 .055 Northeast .059 .057 South .098 .044 Experience -.004 .003 Experience2 .326 .072 Urban -.002 .002 Unemp. Rate (Z ˆ) -.165 .081 No-College State Constant 1.89 .424 .191 .036 g (Ability) .126 .057 Northeast -.046 .053 South .043 .067 Experience -.001 .003 Experience2 .136 .051 Urban .001 .002 Unemp. Rate (Z ˆ) .097 .094 Selection Equation Constant -.478 .149 .541 .112 MomCollege .603 .097 DadCollege -.069 .024 Numsibs .754 .048 g (Ability) .096 .131 Urban18 35 ~D j U ~ D > ¡J(u)) for various Speci¯cations of the Outcome Disturbances / and Figure 1: E(U Link Function 3 t(v=2) / Normal 2.5 1.5 t(v=2) / t(v=2) E(U D U D > J(x)) 2 1 t(v=20) / Normal 0.5 Normal / Normal 0 1.5 1 0.5 0 36 x 0.5 1 1.5 Distributions of Treatment on the Treated and Marginal Treatment E®ects Using Normal and t2 Models. Generated NORMAL Data. 1,000 Replications with N = 1,500. Figure 2: Treatment on the Treated with Z = ¡2: True Value ¼ 2:28 2 1.8 1.6 1.4 Normal t(2) 1.2 1 0.8 0.6 0.4 0.2 0 1.4 1.6 1.8 2 2.2 2.4 37 2.6 2.8 3 3.2 Figure 3: Marginal Treatment E®ect with uD = 1: True Value ¼ 1:54 3 2.5 Normal 2 t(4) 1.5 1 0.5 0 1.4 1.6 1.8 2 2.2 2.4 Marginal Treatment Effect Z=2. True MTE 2.08. 38 2.6 2.8 Distributions of Treatment on the Treated and Marginal Treatment E®ects Using Normal and t2 Models. Generated t4 Data. 1,000 Replications with N = 2,500. Figure 4: Treatment on the Treated with Z = ¡2: True Value ¼ 2:64 2 1.8 1.6 Normal 1.4 t(4) 1.2 1 0.8 0.6 0.4 0.2 0 1.8 2 2.2 2.4 2.6 2.8 3 Treatment on the Treated Z=2. True TT 2.64. 39 3.2 3.4 3.6 Figure 5: Marginal Treatment E®ect with uD = 2: True Value ¼ 2:08 3 2.5 Normal 2 t(4) 1.5 1 0.5 0 1.4 1.6 1.8 2 2.2 2.4 Marginal Treatment Effect Z=2. True MTE 2.08. 40 2.6 2.8 Figure 6: Probability of Correctly Choosing Normal Model Over t2 Model Using MSE Criterion. 1,000 Iterations Probability of Selecting Correct Model Using MSE Criterion. 1,000 Iterations 0.9 0.85 ρ1D = .95, ρ0D = .1 Probability of Choosing Correctly 0.8 0.75 0.7 ρ 1D 0.65 = .5, ρ 0D = .1 0.6 ρ1D = .2, ρ0D = .1 0.55 0.5 0 100 200 300 400 500 600 Number of Observations 41 33 700 800 900 1000 Figure 7: Plots of Marginal Treatment E®ects Across Alternate Models (Unscaled) 1.2 1 Normal / Normal Marginal Treatment Effect (MTE) 0.8 0.6 t(24) / t(24) 0.4 0.2 t(2) / Logit t(2) / t(2) 0 0.2 0.4 0.6 3 2 1 0 D 42 U 1 2 3