ECON 0019: Quant Econ and Econometrics Practical Session 3 Guanyi Wang Department of Economics University College London Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 1 / 39 Plan ▶ Two exercises on truncation and sample selection (incidental truncation): ▶ C13 from chapter 17 of Wooldridge (C.17.13) ▶ C7 from chapter 17 of Wooldridge (C.17.7) Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 2 / 39 Ex C.17.13 ▶ Use the data in HTV.RAW to answer this question. ▶ (i) Using OLS on the full sample, estimate a model for log(wage) using explanatory variables educ, abil, exper , nc, west, south, and urban. Report the estimated return to education and its standard error. ▶ (ii) Now estimate the equation from part (i) using only people with educ < 16. What percentage of the sample is lost? Now what is the estimated return to a year of schooling? How does it compare with part (i)? Based on what was said in class, this should have no consequences on unbiasedness and consistency. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 3 / 39 ▶ (iii) Now drop all observations with wage ≥ 20, so that everyone remaining in the sample earns less than $20 an hour. Run the regression from part (i) and comment on the coefficient on educ. (Because the normal truncated regression model assumes that y is continuous, it does not matter in theory whether we drop observations with wage ≥ 20 or wage > 20. In practice, including in this application, it can matter slightly because there are some people who earn exactly $20 per hour.) This is more problematic. ▶ (iv) Using the sample in part (iii), apply truncated regression with the upper truncation point being log(20). Does truncated regression appear to recover the return to education in the full population, assuming the estimate from (i) is consistent? Explain. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 4 / 39 ▶ Question uses HTV.dta, which includes information on wages, education, parents’ education, and several other variables for 1,230 working men in 1991. ▶ Let’s look at the variables. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 5 / 39 Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 6 / 39 ▶ (i) Using OLS on the full sample, estimate a model for log(wage) using explanatory variables educ, abil, exper , nc, west, south, and urban. Report the estimated return to education and its standard error Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 7 / 39 Estimated coefficient: 0.1037178 . Standard error: .0096894. Using robust standard errors, the estimated standard error goes up to .010553. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 8 / 39 ▶ (ii) Now estimate the equation from part (i) using only people with educ < 16. What percentage of the sample is lost? Now what is the estimated return to a year of schooling? How does it compare with part (i)? Based on what was said in class, this should have no consequences on unbiasedness and consistency. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 9 / 39 Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 10 / 39 ▶ 1064 obs included, vs 1230 in (i) ⇒ 166 obs = 13.5% of sample lost. ▶ Estimated coefficient = 0.1182. Slight increase from estimate in (i). ▶ Standard error = 0.0126. Larger than in (i). ▶ Estimated less precisely. Why? ▶ By restricting to obs where educ < 16 we are reducing the sample variation in educ and the number of observations ▶ Remember that: se β̂j = r SSTj · Wang (UCL) σ̂ 1 − Rj2 =√ σ̂ n · sd xj · ECON 0019: Quantitative Economics and Econometrics r 1 − Rj2 11 / 39 ▶ (iii) Now drop all observations with wage ≥ 20, so that everyone remaining in the sample earns less than $20 an hour. Run the regression from part (i) and comment on the coefficient on educ. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 12 / 39 Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 13 / 39 ▶ We excluded 164 obs, which is about the same number as we excluded in part (ii). ▶ Estimated coefficient = 0.0579. ▶ Estimate in (i): 0.1037 ▶ Estimate in (ii): 0.1182 ▶ Much bigger change than in (ii) despite dropping about the same number of observations. ▶ Why? Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 14 / 39 ▶ Perhaps the truncated regression model covered in the lectures can explain this. ▶ Regression model: log (wagei ) = xi β + ui , u | x ∼ N 0, σ 2 , where x includes educ and other variables ▶ But we were estimating si log (wagei ) = si xi β + si ui where si = 1 if the observation is included, si = 0 otherwise (selection rule). ▶ For OLS to be consistent we need: E[su] = 0, E[(sx)(su)] = E[sxu] = 0, (zero covariance between selected error and selected regressor) and for OLS to be unbiased: E[su | sx] = 0. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 15 / 39 ▶ In part (ii) we excluded obs where educ < 16. ▶ So s depends on whether educ < 16. ▶ Our regression model assumes E[u | x] = 0. But if s depends only on educ, which is included in x, then s provides no information beyond that provided in x, and so E[u | sx] = E[u | x] = 0. ▶ And so we have E[su | sx] = sE[u | sx] = 0. ▶ So OLS estimators are unbiased and consistent. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 16 / 39 ▶ But in part (iii) we excluded obs where wage ≥ 20. ▶ So s = 1 if and only if wage < 20 ⇔ log( wage ) < log(20) ⇔ xβ + u < log(20) ⇔ u < log(20) − xβ ▶ We therefore have that s and u are correlated, implying E[sxu] ̸= 0, and thus E[su | sx] ̸= 0. ▶ Why? if xβ is high, then need u to be low as their sum must be below log(20) (we are only keeping low wages) ▶ And so OLS estimators will be inconsistent and biased. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 17 / 39 ▶ (iv) Using the sample in part (iii), apply truncated regression with the upper truncation point being log(20). Does truncated regression appear to recover the return to education in the full population, assuming the estimate from (i) is consistent? Explain. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 18 / 39 ▶ A truncated regression should not suffer from the same inconsistency problem. ▶ Estimates coefficients using maximum likelihood estimation, using the density of log (wage) given xi : f (log( wage ) | xi , β, σ) F (log(20) | xi , β, σ) where f (log( wage ) | xi , β, σ) is the normal density with mean xi β and variance σ 2 , and F (log( wage ) | xi , β, σ) is the CDF. ▶ Intuition: the distribution of the outcome truncated at a particular cutoff cuts part of the original distribution, so need to re-normalize it by the probability of y < c so that it integrates back to 1 ▶ To perform truncated regression in Stata, use truncreg command, and specify upper truncation point of log(20). Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 19 / 39 Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 20 / 39 ▶ Estimated coefficient is 0.1060 . ▶ Very close to estimate in (i) using original sample. ▶ Standard error is 0.0168 . ▶ Less precise than in (i) due to the dropped observations and so smaller sample. ▶ Extra question: what would be different in this question if we assumed that wages are censored above 20? ▶ for those who earn more than 20 , we do not know their wage but we do know they earn more ▶ need to model it using a censored regression model (which is similar to the Tobit). Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 21 / 39 Exercise C.17.7 ▶ Use the MROZ.RAW data for this exercise. ▶ (i) Using the 428 women who were in the workforce, estimate the return to education by OLS including exper, exper2 , nwifeinc, age, kidslt6, and kidsge6 as explanatory variables. Report your estimate on educ and its standard error. ▶ (ii) Now, estimate the return to education by Heckit, where all exogenous variables show up in the second-stage regression. In other words, the regression is log (wage) on educ, exper, exper2 , nwifeinc, age, kidslt6, kidsge 6 and λ̂. Compare the estimated return to education and its standard error to that from part (i). ▶ (iii) Using only the 428 observations for working women, regress λ̂ on educ, exper, exper2 , nwifeinc, age, kidslt6, kidsge6. How big is the R-squared? How does this help explain your findings from part (ii)? (Hint: Think multicollinearity.) Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 22 / 39 ▶ Question uses mroz.dta, which includes information on labour force participation, wages, education, parents education, and several other variables for 753 women in 1976. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 23 / 39 Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 24 / 39 ▶ (i) Using the 428 women who were in the workforce, estimate the return to education by OLS including exper, exper2 , nwifeinc, age, kidslt6, and kidsge6 as explanatory variables. Report your estimate on educ and its standard error. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 25 / 39 Estimated coefficient: 0.0999 . Standard error: 0.0151 Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 26 / 39 ▶ These estimates likely to be inconsistent. ▶ Example of incidental truncation (self-selection) | only have wages for those who choose to work. ▶ log(wage) only observed when woman is in labour force (if inlf = 1 ). ▶ But labour force participation and wage likely to be determined many of the same unobservable factors. ▶ This will cause inconsistency (see lecture notes on incidental truncation). Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 27 / 39 ▶ Model: log( wage ) = xβ + u with E[u | x] = 0 inlf = 1[zγ + v ≥ 0] where we observe log (wage) if inlf = 1. Notice z and x are the same (non-linearity of lambda will preclude perfect collinearity). ▶ Under some further assumptions (see lecture notes), we get E[log( wage ) | z, inlf = 1] = xβ + ρλ(zγ) where λ(·) = ϕ(·)/Φ(·) is the inverse Mills ratio. ▶ And so if we perform an OLS regression of log (wage) on x, using only the observations where inlf = 1 (which is all we can do, as log (wage) is unobserved elsewhere), we have an omitted variable: λ(zγ) (if ρ is not zero). ▶ And so OLS estimators will be inconsistent. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 28 / 39 ▶ (ii) Now, estimate the return to education by Heckit, where all exogenous variables show up in the second-stage regression. In other words, the regression is log (wage) on educ, exper, exper2 , nwifeinc, age, kidslt6, kidsge 6 and λ̂. Compare the estimated return to education and its standard error to that from part (i). Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 29 / 39 ▶ First stage of Heckit procedure is to carry out a probit regression to estimate selection equation (using probit) P( inlf = 1 | z) = Φ(zγ) which provides an estimate for γ̂ ▶ Then, using the formula for the inverse Mills ratio, compute λ̂i ≡ λ (zi γ̂) for each observation ▶ Second stage: estimate OLS regression of log (wage) on the explanatory variable plus λ̂ Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 30 / 39 In this question, perform probit regression of inlf on z, where z = [ educ,nwifeinc, exper,expersq,age,kidslt6,kidsge6]. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 31 / 39 ▶ Calculate z γ̂ predict zgammahat, xb ▶ Calculate λ̂ ≡ λ(z γ̂) gen lambda=normalden(zgammahat) / normal(zgammahat) Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 32 / 39 In this question, perform OLS regression of log (wage) on x and λ̂, where x = [ educ,nwifeinc, exper,expersq,age,kidslt6,kidsge6 ] = z. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 33 / 39 Alternatively, can perform in one command (correct SEs, accounting for sampling error in estimating γ ) (see do-file): Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 34 / 39 ▶ Estimated coefficient: 0.1187 . Standard error: 0.0341 ▶ Standard error over twice as large than that from OLS regression in (i). ▶ Remark about notation: E[log( wage ) | z, inlf = 1] = xβ + ρλ(zγ)) where ρ = corr(u, v ) σσvu ▶ In our case, ρ̂ = 0.2885. So this is what Stata calls lambda and not rho. ▶ Stata’s rho is the estimate of corr(u, v ). ▶ Also σv is normalized to 1 because the first stage is a probit model ▶ So Stata’s sigma is σu . Now note that you get lambda by multiplying rho by sigma. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 35 / 39 ▶ (iii) Using only the 428 observations for working women, regress λ̂ on educ, exper, exper2 , nwifeinc, age, kidslt6, kidsge6. How big is the R-squared? How does this help explain your findings from part (ii)? (Hint: Think multicollinearity.) Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 36 / 39 Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 37 / 39 ▶ R 2 is very large: 0.962 . ▶ Suggests high degree of multicollinearity between λ̂ and the other explanatory variables. ▶ Which means there was high degree of multicollinearity among explanatory variables in second stage regression of Heckit. Hence large standard error: σ̂ σ̂ r se β̂j = r =√ 2 SSTj · 1 − Rj n · sd xj · 1 − Rj2 ▶ Rj2 rises once we add λ̂, so se β̂j rises too Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 38 / 39 ▶ Not surprising: λ̂ is a function (the inverse Mills ratio, which is often well approximated by a linear function) of z γ̂, which is a linear combination of the variables in z, which were exactly those included in x in the second stage regression. ▶ In general to avoid this we would need to have variables included in z that were excluded from x. Wang (UCL) ECON 0019: Quantitative Economics and Econometrics 39 / 39