Uploaded by Eugene

econ0019 practical3 (1)

advertisement
ECON 0019: Quant Econ and Econometrics
Practical Session 3
Guanyi Wang
Department of Economics
University College London
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
1 / 39
Plan
▶ Two exercises on truncation and sample selection (incidental truncation):
▶ C13 from chapter 17 of Wooldridge (C.17.13)
▶ C7 from chapter 17 of Wooldridge (C.17.7)
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
2 / 39
Ex C.17.13
▶ Use the data in HTV.RAW to answer this question.
▶ (i) Using OLS on the full sample, estimate a model for log(wage) using explanatory
variables educ, abil, exper , nc, west, south, and urban. Report the estimated return
to education and its standard error.
▶ (ii) Now estimate the equation from part (i) using only people with educ < 16. What
percentage of the sample is lost? Now what is the estimated return to a year of
schooling? How does it compare with part (i)? Based on what was said in class,
this should have no consequences on unbiasedness and consistency.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
3 / 39
▶ (iii) Now drop all observations with wage ≥ 20, so that everyone remaining in the
sample earns less than $20 an hour. Run the regression from part (i) and comment
on the coefficient on educ. (Because the normal truncated regression model
assumes that y is continuous, it does not matter in theory whether we drop
observations with wage ≥ 20 or wage > 20. In practice, including in this application,
it can matter slightly because there are some people who earn exactly $20 per hour.)
This is more problematic.
▶ (iv) Using the sample in part (iii), apply truncated regression with the upper truncation
point being log(20). Does truncated regression appear to recover the return to
education in the full population, assuming the estimate from (i) is consistent? Explain.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
4 / 39
▶ Question uses HTV.dta, which includes information on wages, education, parents’
education, and several other variables for 1,230 working men in 1991.
▶ Let’s look at the variables.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
5 / 39
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
6 / 39
▶ (i) Using OLS on the full sample, estimate a model for log(wage) using explanatory
variables educ, abil, exper , nc, west, south, and urban. Report the estimated return
to education and its standard error
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
7 / 39
Estimated coefficient: 0.1037178 . Standard error: .0096894. Using robust standard
errors, the estimated standard error goes up to .010553.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
8 / 39
▶ (ii) Now estimate the equation from part (i) using only people with educ < 16. What
percentage of the sample is lost? Now what is the estimated return to a year of
schooling? How does it compare with part (i)? Based on what was said in class,
this should have no consequences on unbiasedness and consistency.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
9 / 39
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
10 / 39
▶ 1064 obs included, vs 1230 in (i) ⇒ 166 obs = 13.5% of sample lost.
▶ Estimated coefficient = 0.1182. Slight increase from estimate in (i).
▶ Standard error = 0.0126. Larger than in (i).
▶ Estimated less precisely. Why?
▶ By restricting to obs where educ < 16 we are reducing the sample variation in educ
and the number of observations
▶ Remember that:
se β̂j = r
SSTj ·
Wang (UCL)
σ̂
1 − Rj2
=√
σ̂
n · sd xj ·
ECON 0019: Quantitative Economics and Econometrics
r
1 − Rj2
11 / 39
▶ (iii) Now drop all observations with wage ≥ 20, so that everyone remaining in the
sample earns less than $20 an hour. Run the regression from part (i) and comment
on the coefficient on educ.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
12 / 39
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
13 / 39
▶ We excluded 164 obs, which is about the same number as we excluded in part (ii).
▶ Estimated coefficient = 0.0579.
▶ Estimate in (i): 0.1037
▶ Estimate in (ii): 0.1182
▶ Much bigger change than in (ii) despite dropping about the same number of
observations.
▶ Why?
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
14 / 39
▶ Perhaps the truncated regression model covered in the lectures can explain this.
▶ Regression model: log (wagei ) = xi β + ui , u | x ∼ N 0, σ 2 , where x includes educ
and other variables
▶ But we were estimating
si log (wagei ) = si xi β + si ui
where si = 1 if the observation is included, si = 0 otherwise (selection rule).
▶ For OLS to be consistent we need:
E[su] = 0, E[(sx)(su)] = E[sxu] = 0,
(zero covariance between selected error and selected regressor) and for OLS to be
unbiased:
E[su | sx] = 0.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
15 / 39
▶ In part (ii) we excluded obs where educ < 16.
▶ So s depends on whether educ < 16.
▶ Our regression model assumes E[u | x] = 0. But if s depends only on educ, which is
included in x, then s provides no information beyond that provided in x, and so
E[u | sx] = E[u | x] = 0.
▶ And so we have E[su | sx] = sE[u | sx] = 0.
▶ So OLS estimators are unbiased and consistent.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
16 / 39
▶ But in part (iii) we excluded obs where wage ≥ 20.
▶ So s = 1 if and only if
wage < 20 ⇔ log( wage ) < log(20) ⇔ xβ + u < log(20) ⇔
u < log(20) − xβ
▶ We therefore have that s and u are correlated, implying E[sxu] ̸= 0, and thus
E[su | sx] ̸= 0.
▶ Why? if xβ is high, then need u to be low as their sum must be below log(20) (we are
only keeping low wages)
▶ And so OLS estimators will be inconsistent and biased.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
17 / 39
▶ (iv) Using the sample in part (iii), apply truncated regression with the upper truncation
point being log(20). Does truncated regression appear to recover the return to
education in the full population, assuming the estimate from (i) is consistent? Explain.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
18 / 39
▶ A truncated regression should not suffer from the same inconsistency problem.
▶ Estimates coefficients using maximum likelihood estimation, using the density of log
(wage) given xi :
f (log( wage ) | xi , β, σ)
F (log(20) | xi , β, σ)
where f (log( wage ) | xi , β, σ) is the normal density with mean xi β and variance σ 2 ,
and F (log( wage ) | xi , β, σ) is the CDF.
▶ Intuition: the distribution of the outcome truncated at a particular cutoff cuts part of
the original distribution, so need to re-normalize it by the probability of y < c so that it
integrates back to 1
▶ To perform truncated regression in Stata, use truncreg command, and specify upper
truncation point of log(20).
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
19 / 39
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
20 / 39
▶ Estimated coefficient is 0.1060 .
▶ Very close to estimate in (i) using original sample.
▶ Standard error is 0.0168 .
▶ Less precise than in (i) due to the dropped observations and so smaller sample.
▶ Extra question: what would be different in this question if we assumed that wages are
censored above 20?
▶ for those who earn more than 20 , we do not know their wage but we do know they earn
more
▶ need to model it using a censored regression model (which is similar to the Tobit).
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
21 / 39
Exercise C.17.7
▶ Use the MROZ.RAW data for this exercise.
▶ (i) Using the 428 women who were in the workforce, estimate the return to education by
OLS including exper, exper2 , nwifeinc, age, kidslt6, and kidsge6 as explanatory
variables. Report your estimate on educ and its standard error.
▶ (ii) Now, estimate the return to education by Heckit, where all exogenous variables show
up in the second-stage regression. In other words, the regression is log (wage) on educ,
exper, exper2 , nwifeinc, age, kidslt6, kidsge 6 and λ̂. Compare the estimated return to
education and its standard error to that from part (i).
▶ (iii) Using only the 428 observations for working women, regress λ̂ on educ, exper,
exper2 , nwifeinc, age, kidslt6, kidsge6. How big is the R-squared? How does this help
explain your findings from part (ii)? (Hint: Think multicollinearity.)
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
22 / 39
▶ Question uses mroz.dta, which includes information on labour force participation,
wages, education, parents education, and several other variables for 753 women in
1976.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
23 / 39
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
24 / 39
▶ (i) Using the 428 women who were in the workforce, estimate the return to education
by OLS including exper, exper2 , nwifeinc, age, kidslt6, and kidsge6 as explanatory
variables. Report your estimate on educ and its standard error.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
25 / 39
Estimated coefficient: 0.0999 . Standard error: 0.0151
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
26 / 39
▶ These estimates likely to be inconsistent.
▶ Example of incidental truncation (self-selection) | only have wages for those who
choose to work.
▶ log(wage) only observed when woman is in labour force (if inlf = 1 ).
▶ But labour force participation and wage likely to be determined many of the same
unobservable factors.
▶ This will cause inconsistency (see lecture notes on incidental truncation).
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
27 / 39
▶ Model:
log( wage ) = xβ + u with E[u | x] = 0
inlf = 1[zγ + v ≥ 0]
where we observe log (wage) if inlf = 1. Notice z and x are the same (non-linearity of
lambda will preclude perfect collinearity).
▶ Under some further assumptions (see lecture notes), we get
E[log( wage ) | z, inlf = 1] = xβ + ρλ(zγ)
where λ(·) = ϕ(·)/Φ(·) is the inverse Mills ratio.
▶ And so if we perform an OLS regression of log (wage) on x, using only the
observations where inlf = 1 (which is all we can do, as log (wage) is unobserved
elsewhere), we have an omitted variable: λ(zγ) (if ρ is not zero).
▶ And so OLS estimators will be inconsistent.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
28 / 39
▶ (ii) Now, estimate the return to education by Heckit, where all exogenous variables
show up in the second-stage regression. In other words, the regression is log (wage)
on educ, exper, exper2 , nwifeinc, age, kidslt6, kidsge 6 and λ̂. Compare the
estimated return to education and its standard error to that from part (i).
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
29 / 39
▶ First stage of Heckit procedure is to carry out a probit regression to estimate
selection equation (using probit)
P( inlf = 1 | z) = Φ(zγ)
which provides an estimate for γ̂
▶ Then, using the formula for the inverse Mills ratio, compute λ̂i ≡ λ (zi γ̂) for each
observation
▶ Second stage: estimate OLS regression of log (wage) on the explanatory variable
plus λ̂
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
30 / 39
In this question, perform probit regression of inlf on z, where z = [ educ,nwifeinc,
exper,expersq,age,kidslt6,kidsge6].
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
31 / 39
▶ Calculate z γ̂
predict zgammahat, xb
▶ Calculate λ̂ ≡ λ(z γ̂)
gen lambda=normalden(zgammahat) / normal(zgammahat)
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
32 / 39
In this question, perform OLS regression of log (wage) on x and λ̂, where x = [
educ,nwifeinc, exper,expersq,age,kidslt6,kidsge6 ] = z.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
33 / 39
Alternatively, can perform in one command (correct SEs, accounting for sampling error in
estimating γ ) (see do-file):
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
34 / 39
▶ Estimated coefficient: 0.1187 . Standard error: 0.0341
▶ Standard error over twice as large than that from OLS regression in (i).
▶ Remark about notation: E[log( wage ) | z, inlf = 1] = xβ + ρλ(zγ)) where
ρ = corr(u, v ) σσvu
▶ In our case, ρ̂ = 0.2885. So this is what Stata calls lambda and not rho.
▶ Stata’s rho is the estimate of corr(u, v ).
▶ Also σv is normalized to 1 because the first stage is a probit model
▶ So Stata’s sigma is σu . Now note that you get lambda by multiplying rho by sigma.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
35 / 39
▶ (iii) Using only the 428 observations for working women, regress λ̂ on educ, exper,
exper2 , nwifeinc, age, kidslt6, kidsge6. How big is the R-squared? How does this
help explain your findings from part (ii)? (Hint: Think multicollinearity.)
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
36 / 39
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
37 / 39
▶ R 2 is very large: 0.962 .
▶ Suggests high degree of multicollinearity between λ̂ and the other explanatory
variables.
▶ Which means there was high degree of multicollinearity among explanatory variables
in second stage regression of Heckit. Hence large standard error:
σ̂
σ̂
r
se β̂j = r
=√
2
SSTj · 1 − Rj
n · sd xj ·
1 − Rj2
▶ Rj2 rises once we add λ̂, so se β̂j rises too
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
38 / 39
▶ Not surprising: λ̂ is a function (the inverse Mills ratio, which is often well
approximated by a linear function) of z γ̂, which is a linear combination of the
variables in z, which were exactly those included in x in the second stage regression.
▶ In general to avoid this we would need to have variables included in z that were
excluded from x.
Wang (UCL)
ECON 0019: Quantitative Economics and Econometrics
39 / 39
Download