Research Method Lecture 15-3 © Truncated regression and Heckman Sample Selection Corrections 1 Truncated regression Truncated regression is different from censored regression in the following way: Censored regressions: dependent variable may be censored, but you can include the censored observations in the regression Truncated regressions: A subset of observations are dropped, thus, only the truncated data are available for the regression. 2 Reasons data truncation happens Example 1 (Truncation by survey design): The “Gary Negative income experiment data”, which is used extensively in the economic literature, samples only those families whose income is less than 1.5 times the 1976 poverty line. In this case, families whose incomes are greater than that threshold are dropped from the regression due to the survey design. 3 Example 2 (Incidental Truncation): In the wage offer regression of married women, only those who are working has wage information. Thus, the regression cannot include women who are not working. In this case, it is the people’s decision, not the surveyor’s decision, that determines the sample selection. 4 When applying OLS to a truncated data causes a bias Before learning the techniques to deal with truncated data, it is important to know when applying OLS to a truncated data would cause a bias. 5 Suppose that you consider the following regression: yi=β0+β1xi+ui And suppose that you have a random sample of size N. We also assume that all the OLS assumptions are satisfied. (The most important assumption is E(ui|xi)=0) 6 Now, suppose that, instead of using all the N observations, you select a subsample of the original sample, then run OLS using this sub-sample (truncated sample) only. Then, under what conditions, would this OLS be unbiased. And under what conditions, would this OLS be biased? 7 A: Running OLS using only the selected subsample (truncated data) would not cause a bias if (A-1) Sample selection is randomly done. (A-2) Sample selection is determined solely by the value of x-variable. For example, suppose that x is age. Then if you select sample if age is greater than 20 years old, this OLS is unbiased. 8 B: Running OLS using only the selected subsample (truncated data) would cause bias if (B-1) Sample selection is determined by the value of y-variable. For example, suppose that y is the family income, and further suppose that you select the sample if y is greater than certain threshold. Then this OLS is biased. 9 (B-2) Sample selection is correlated with ui. For example, if you are running wage regression: wage=β0+β1(educ)+u, where u contains unobserved ability. If sample is selected based on the unobserved ability, this OLS is biased. In practice, this situation happens when the selection is based on the survey participant’s decision. For example, in wage regression, a person’s decision whether to work or not determines if the person is included in the data or not. Since the decision is likely to be based on unobserved factors which is contained in u, the selection is likely to be correlated with u. 10 Understanding why these conditions indicate running OLS on the truncated data is unbiasednes/biasedness Now, we know the conditions under which OLS using a truncated data would be cause biased or not. Now let me explain why these conditions cause/does not cause biases. (There are some repetition in the explanations, but they are more elaborate containing very important information. So please read them carefully.) 11 Consider the following regression yi=β0+β1xi+ui Suppose that this regression satisfies all the OLS assumptions. Now, let si be a selection indicator: If si=1, then this person is included in the regression. If si=0, then this person is dropped from the data. 12 Then running OLS using the selected subsample means you run OLS using only the observation with si=1. This is equivalent to running the following regression. siyi=β0si+β1sixi+siui In this regression, sixi is the explanatory variable, and siui is the error term. The crucial condition under which this OLS is unbiased is the zero conditional mean assumption: E(siui|sixi)=0. Thus we need check under what conditions this is 13 satisfied. To check E(siui|sixi)=0, it is sufficient to check if E(siui|xi, si)=0. (If the latter is zero, the former is also zero.) But, further notice that E(siui|xi,si)=siE(ui|xi,si) since si is a function of si which is in the conditional set. Thus, it is sufficient to check the condition which ensures E(ui|xi, si)=0. To simplify the notation from now, I drop isubscript. So I will check the condition under which E(u|x, s)=0 14 Condition under which running OLS on the selected subsample (truncated data) is unbiased. (A-1) Sample selection is done randomly. In this case, s is independent of u and x. Then we have E(u|x,s)=E(u|x). But since the original regression satisfy OLS conditions, we have E(u|x)=0. Therefore, in this case, this OLS is unbiased. 15 (A-2) Sample is selected based solely on the value of x-variable. For example, if x is age, and you select the person if the age is greater than 20 years old. Then s=1 if x≥20, and s=0 if x<20. In this case, s is a deterministic function of x. If s is a deterministic Thus we have function of x, you can drop E(u|x, s)=E(u|x, s(x)) s(x) from the =E(u|x). conditioning set. But E(u|x)=0 since the original regression satisify all the OLS conditions. Therefore, in this case, OLS is unbiased. 16 Condition under which running OLS on the selected subsample (truncated data) is biased. (B-1) Sample selection is based on the value of y-variable. For example, y is monthly family income, and you select families whose income is smaller than $500. Then, s=1 if y<500. Checking if E(u|x, s)=0 is equivalent to checking if E(u|x, s=1)=0 and E(x|x,s=0)=0. So we check this. 17 E(u|x, s=1)=E(u|x, y≤500) =E(u|x, β0+β1x+u ≤500) =E(u|x, u ≤500-β0-β1x) ≠E(u|x) Since, the set {u ≤500-β0-β1x} directly depends on u, you cannot drop this from the conditioning set. Thus, this is not equal to E(u|x) which means that this is not equal to zero. Thus, E(u|x,s=1) ≠0. Similarly, you can show that E(u|x,s=0) ≠0. Thus E(u|x,s) ≠0. Thus, this OLS is biased. 18 (B-2) Sample selection is correlated with ui. This happens when it is the people’s decision, not the surveyor's decision, that determines the sample selection. This type of truncation is called the ‘incidental truncation’. The bias that arises from this type of sample selection is called the Sample Selection Bias. The leading example is the wage offer regression of married women: wage= β0+β1edu+ui. When the woman decides not to work, the wage information is not available. Thus, this women will be dropped from the data. Since it is the woman’s decision, this sample selection is likely to be based on some unobservable factors which are contained in ui. 19 For example, the women decides to work if the wage offer is greater than her reservation wage. This reservation wage is likely to be determined by some unobserved factors in u, such as unobserved ability, unobserved family backgrounds etc. Thus the selection criteria is likely to be correlated with u. This in turn means that s is correlated with u. Now, mathematically, it can be shown as follows. 20 If s is correlated with u, then you cannot drop s from the conditioning set. Thus we have E(u|x,s)≠E(u|x) This means that E(u|x,s) ≠0. Thus, this OLS is biased. Again, this type of bias is called the Sample Selection Bias. 21 A slightly more complicated case Suppose, x is IQ, and the survey participant responds to your survey if IQ>v. In this case, the sample selection is based on xvariable and a random error v. Then, if you run OLS using only the truncated data, will it cause a bias? Answer Case 1: If v is independent of u, then it does not cause a bias. Case 2: If v is correlated with u, then this is the same case as (B-2). Thus, the OLS will be biased. 22 Estimation methods when data is truncated. When you have (B-1) type truncation, then we use ‘truncated regression’ When you have (B-2) type truncation (incidental truncation), then we use the Heckman Sample Selection Correction method. This is also called the Heckit model. I will explain these methods one by one. 23 The Truncated Regression When the data truncation is (B-1) type, you apply the Truncated Regression model. To explain again, (B-1) type truncation happens because the surveyor samples people based on the value of y-variable. 24 Suppose that the following regression satisfies all the OLS assumptions. yi=β0+β1xi+ui, ui~N(0,σ2) But, you sample only if yi<ci. (This means yu drop observations if yi≥ci by survey design.) In this case, you know the exact value of ci for each person. 25 Family income per month Example of (B-1) type data truncation These observat ions are dropped from the data. $500 True regression Biased regression when applying OLS to truncated data Educ of household head 26 As can be seen, running OLS on the truncated data will cause biases. The model that produces unbiased estimate is based on the Maximum Likelihood Estimation. 27 The estimation method is as follows. For each observation, we can write ui=yiβ0-β1xi. Thus, the likelihood contribution is the height of the density function. However, since we select sample only if yi<ci, we have to use the density function of u conditional on yi<ci. The conditional density function is given in the next slide. 28 f (ui | yi c ) f (ui | 0 1 xi ui ci ) f (ui | ui ci 0 1 xi ) f (ui ) f (ui ) f (ui ) u c 0 1 xi c 0 1 xi P (ui ci 0 1 xi ) P( i i ) ( i ) 1 ( ci 0 1 xi 1 2 2 ) 1 1 ui2 2 e 2 2 u 1 i 2 1 e ci 0 1 xi 2 ( ) ui ci 0 1 xi 1 ( ( ui ) ) 29 Thus, the likelihood contribution for ith observation is obtained by plugging in ui=yi-β0-β1xi in the conditional density function. This is given by 1 yi 0 1 xi Li c 0 1 xi ( i ) The likelihood function is given by n L( 0 , 1 , ) Li i 1 The values of β0,β1,σ that maximizes L is the estimators of the Truncated Regression. 30 The partial effects The estimated β1 shows the effect of x on y. Thus, you can interpret the parameters as if they were OLS parameters. 31 Exercise We do not have a suitable data for truncated regression. Therefore, let us truncate the data by ourselves to check how the truncated regression works. EX1. Use JPSC_familyinc.dta to estimate the following model using all the observation. (family income)=β0+β1(husband’ educ)+u Family income is in 10,000 yen. 32 EX2. Then run the OLS using only the observations whose familyinc<800. How did the parameters change? EX2. Run the truncated regression model for the data truncated from above at 800 (data which drops all the obs with familyinc≥800). How did the parameters change? Did the truncated regression recover the parameters of the original regression? 33 . reg familyinc huseduc Source SS df MS Model Residual 38305900.9 1 38305900.9 318850122 7693 41446.7856 Total 357156023 7694 46420.0705 familyinc Coef. huseduc _cons 32.93413 143.895 Std. Err. t 1.083325 15.09181 30.40 9.53 Number of obs F( 1, 7693) Prob > F R-squared Adj R-squared Root MSE = = = = = = 7695 924.22 0.0000 0.1073 0.1071 203.58 P>|t| [95% Conf. Interval] 0.000 0.000 30.81052 114.3109 OLS using all the observations 35.05775 173.479 . reg familyinc huseduc if familyinc<800 Source Model Residual Total SS df MS Number of obs F( 1, 6272) Prob > F R-squared Adj R-squared Root MSE 11593241.1 1 11593241.1 120645494 6272 19235.5699 132238735 6273 familyinc Coef. huseduc _cons 20.27929 244.5233 21080.621 Std. Err. .8260432 11.33218 t 24.55 21.58 = = = = = = 6274 602.70 0.0000 0.0877 0.0875 138.69 P>|t| [95% Conf. Interval] 0.000 0.000 18.65996 222.3084 21.89861 266.7383 Obs with familyinc≥800 are dropped. The parameter on huseduc is biased towards zero. 34 Truncated regression model with the upper truncation limit equal to 800: Obs with familyinc≥800 are automatically dropped from this regression. . truncreg familyinc huseduc, ul(800) (note: 1421 obs. truncated) Fitting full model: Iteration Iteration Iteration Iteration 0: 1: 2: 3: log log log log likelihood likelihood likelihood likelihood = = = = -39676.782 -39618.757 -39618.629 -39618.629 Truncated regression Limit: lower = -inf upper = 800 Log likelihood = -39618.629 Number of obs = 6274 Wald chi2(1) = 569.90 Prob > chi2 = 0.0000 familyinc Coef. Std. Err. huseduc _cons 24.50276 203.6856 1.0264 13.75721 /sigma 153.1291 1.805717 z P>|z| [95% Conf. Interval] 23.87 14.81 0.000 0.000 22.49105 176.7219 26.51446 230.6492 84.80 0.000 149.59 156.6683 Bias seems to be corrected, but not perfect in this example. 35 Heckman Sample Selection Bias Correction (Heckit Model) Most common reason for data truncation is (B-2) type: the incidental truncation. This data truncation usually occurs because sample selection is determined by the people’s decision, not the surveyor’s decision. Consider the wage regression example. If the person has chosen to work, the person has “self-selected into the sample”. If the person has decided not to work, the person has “self-selected out of the sample”. Bias caused by this type of truncation is called 36 the Sample Selection Bias. Bias correction for this type of data truncation is done by the Heckman Sample Selection Correction Method. It is also called the Heckit model. Consider the wage regression model. In Heckit, you have wage equation and sample selection equation. Wage eq: yi=xiβ+ui and ui~N(0,σu2) Selection equ: si*=ziδ+ei,and ei~N(0,1) Such that the person work if si*>0. That is si=1 if si*>0, and si=0 if si*≤0. 37 In the above equations, I am using the following vector notations. β =(β0,β1,β2,…,βk)T. xi=(1,xi1, xi2,…,xik) and δ=(δ0, δ1,.., δm)T and zi=(1, zi1, zi2,..,zim). We assume that xi and zi are exogenous in a sense that E(ui|xi, zi)=0. Further, assume that xi is a strict subset of zi. That is, all the x-variables are also a part of zi. For example, xi=(1, experi, agei), and zi=(1, experi, agei, kidslt6i). We require that zi contains at least one variable that is not contained in xi. 38 The structural error, ui, and the sample selection si are correlated only if ui and ei are correlated. In other words, the sample selection causes a bias only if ui and ei are correlated. Let use denote the correlation between ui and ei by ρ=corr(ui, ei). 39 The data requirement of the Heckit model is as follows. 1. yi is available only for the observations who are currently working. 2: However, xi and zi are available both for those who are working, and for those who re not working. 40 Now, I will describe the Heckit model. First, the expected value of yi given the fact that the person has participated in the labor force (i.e., si=1) is written as E ( yi | si 1, zi ) E ( yi | si* 0, zi ) E ( yi | zi ei 0, zi ) E ( yi | ei zi , zi ) E ( xi ui | ei zi , zi ) xi E (ui | ei zi , zi ) Using a result of bivariate normal distribution, the last term can be shown to be E(ui|ei>-ziδ,zi)= ( zi ) / ( zi ). But the term, ( zi ) / ( zi ) , is the 41 inverse mills ratio, λ(ziδ). Thus, we have E ( yi | si 1, zi ) xi E (ui | ei zi , zi ) xi ( zi ) Heckman showed that sample selection bias can be viewed as an omitted variable bias, where the omitted variable is λ(ziδ). 42 Important thing to note is that, λ(ziδ) can be easily estimated. How? Note that the selection equation is simply a probit model of a labor force participation. So, estimate the sample selection equation by probit to estimate ˆ . Then compute ( z ˆ) . Then, you can correct the bias by including ( z ˆ) in the wage regression, then estimate the model using OLS. Heckman showed that this method corrects for the sample selection bias. This method is the Heckit model. Next slide summarizes the Heckit model. i i 43 Heckman Two-step Sample Selection Correction Method (Heckit model) Wage eq: yi=xiβ+ui and ui~N(0,σu2) Selection equ: si*=ziδ+ei,and ei~N(0,1) Such that the person work if si*>0. The person does not work if si*≤0. Assumption 1: E(ui|xi, zi)=0 Assumption 2: xi is a strict subset of zi. If ui and ei are correlated, OLS estimation of wage equation (using only the observations who are 44 working) is biased. First step: Estimate sample selection equation parameters ˆ using Probit. Then, compute ( z ˆ) . Second step: Plug in ( z ˆ) in the wage equation, then estimate the equation using OLS. That is: estimate the following. i i yi xi ( ziˆ) error In this model, ρ is the coefficient for ( z ˆ) . If ρ≠0, then the sample selection bias is present. If ρ=0, then it is evidence that sample selection bias is not present. i 45 Note, when you exactly follow this procedure, you get the correct coefficients, but you don’t get the correct standard errors. For the exact formula of standard error, consult Wooldridge (2002). The Stata automatically computes the correct standard errors. 46 Exercise Using Mroz.dta estimate the wage offer equation using Heckit model. The explanatory variables for wage offer equation are educ exper expersq. The explanatory variables for the sample selection equation is educ, exper, expersq, nwifeinc, age, kidslt6, kidsge6. 47 . ********************************************** . * Estimating heckit model manually * . ********************************************** . *************************** . * First create selection * . * Variable * . *************************** . gen s=0 if wage==. (428 missing values generated) Estimating Heckit Manually. (note: you will not get the correct standard errors. . replace s=1 if wage~=. (428 real changes made) . . . . . ******************************* *Next, estimate the probit * *selection equation * ******************************* probit s educ exper expersq nwifeinc age kidslt6 kidsge6 Iteration Iteration Iteration Iteration Iteration 0: 1: 2: 3: 4: log log log log log likelihood likelihood likelihood likelihood likelihood = = = = = -514.8732 -405.78215 -401.32924 -401.30219 -401.30219 Probit regression Number of obs LR chi2(7) Prob > chi2 Pseudo R2 Log likelihood = -401.30219 s Coef. educ exper expersq nwifeinc age kidslt6 kidsge6 _cons .1309047 .1233476 -.0018871 -.0120237 -.0528527 -.8683285 .036005 .2700768 The first step: The probit selectdion equation Std. Err. .0252542 .0187164 .0006 .0048398 .0084772 .1185223 .0434768 .508593 z 5.18 6.59 -3.15 -2.48 -6.23 -7.33 0.83 0.53 P>|z| 0.000 0.000 0.002 0.013 0.000 0.000 0.408 0.595 = = = = 753 227.14 0.0000 0.2206 [95% Conf. Interval] .0814074 .0866641 -.003063 -.0215096 -.0694678 -1.100628 -.049208 -.7267473 .180402 .1600311 -.0007111 -.0025378 -.0362376 -.636029 .1212179 1.266901 48 . . . . ******************************* *Then create inverse lambda * ******************************* predict xdelta, xb The second step: . gen lambda =normalden(xdelta)/normal(xdelta) . . . . ************************************* *Finally, estimate the Heckit model * ************************************* reg lwage educ exper expersq lambda Source SS df MS Model Residual 35.0479487 188.279492 4 8.76198719 423 .445105182 Total 223.327441 427 .523015084 lwage Coef. educ exper expersq lambda _cons .1090655 .0438873 -.0008591 .0322619 -.5781032 Std. Err. .0156096 .0163534 .0004414 .1343877 .306723 t 6.99 2.68 -1.95 0.24 -1.88 Number of obs F( 4, 423) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.008 0.052 0.810 0.060 = = = = = = 428 19.69 0.0000 0.1569 0.1490 .66716 Note the standard errors are not correct. [95% Conf. Interval] .0783835 .0117434 -.0017267 -.2318889 -1.180994 .1397476 .0760313 8.49e-06 .2964126 .024788 49 . heckman lwage educ exper expersq, select(s=educ exper expersq nwifeinc age kidslt6 kidsge6) twostep Heckman selection model -- two-step estimates (regression model with sample selection) Coef. Std. Err. z Number of obs Censored obs Uncensored obs = = = 753 325 428 Wald chi2(3) Prob > chi2 = = 51.53 0.0000 P>|z| [95% Conf. Interval] lwage educ exper expersq _cons .1090655 .0438873 -.0008591 -.5781032 .015523 .0162611 .0004389 .3050062 7.03 2.70 -1.96 -1.90 0.000 0.007 0.050 0.058 .0786411 .0120163 -.0017194 -1.175904 .13949 .0757584 1.15e-06 .019698 educ exper expersq nwifeinc age kidslt6 kidsge6 _cons .1309047 .1233476 -.0018871 -.0120237 -.0528527 -.8683285 .036005 .2700768 .0252542 .0187164 .0006 .0048398 .0084772 .1185223 .0434768 .508593 5.18 6.59 -3.15 -2.48 -6.23 -7.33 0.83 0.53 0.000 0.000 0.002 0.013 0.000 0.000 0.408 0.595 .0814074 .0866641 -.003063 -.0215096 -.0694678 -1.100628 -.049208 -.7267473 .180402 .1600311 -.0007111 -.0025378 -.0362376 -.636029 .1212179 1.266901 lambda .0322619 .1336246 0.24 0.809 -.2296376 .2941613 rho sigma lambda 0.04861 .66362875 .03226186 .1336246 Heckit estimated automati cally. s mills Note H0:ρ=0 cannot be rejected. So there is little evidence that sample selection bias is present. 50