ps 1

advertisement
Problem Set 1: Return to Schooling and IV
For both questions, don’t write more than you have to. Some parts will not have any
written response at all. You should turn in one .do file that answers both questions.
Question 1:
Often times it can be useful to generate your own data in which you know the true
relationship between all the variables. In this question, you will create a simulated
dataset and then apply IV methods to this dataset. Include a constant in all regressions.
1. Create Data
a. Copy and paste, the following into the beginning of your Stata .do file.
*The first 4 lines are generic, the 5th line creates an
empty dataset with 900 observations.
#d;
clear;
version 12.0;
set more off,perm;
set obs 900;
*These 3 lines generate an ability variable that takes 3
values.;
gen ability=0 in 1/300;
replace ability=1 in 301/600;
replace ability=2 in 601/900;
**This creates a female dummy which will only be used for
the optional part (k).;
gen female=0;
replace female=1 in 1/150;
replace female=1 in 301/400;
replace female=1 in 601/750;
*This line creates an indicator for living “near a
college”. As written, the indicator will alternate 1,0,1,0
such that there is an equal number of “near college”
individuals in each ability group. Also ability and
near_col will not be correlated.;
gen near_col=mod(_n,2);
*These two lines create a college attendance indicator.
As constructed below, college attendance is somewhat
correlated with both ability and being near a college.
Make sure you exactly understand the construction of the
“attend_college” indicator. Browse the data to verify;
gen attend_college=0;
replace attend_college=1 if ability==2 | (ability==1 &
near_col==1);
**This last line generates white noise following a normal
distribution with a standard deviation of 0.01 and a mean
of zero.;
gen white_noise=invnorm(uniform())*0.01;
b. Generate an income variable such that the true return to ability is 1 and the
return to college is 2. Add the white_noise to the constructed income
variable so that income has a small random component.
c. Using the corr command, examine the correlation between college
attendance, ability, living near a college, and income. You should verify
that the correlation between ability and living near a college is zero. Also
note that the correlation between income and college attendance should be
around 0.962.
Analysis:
d. Regress income on college attendance and ability and verify that OLS
uncovers the data generating process you created in part b.
e. Now suppose ability is unobservable to the econometrician but remains an
important determinant of income (still has a return of 1). Regress income
on college attendance and write down the magnitude of the omitted
variable bias. What is it about the construction of the variables that leads
to an upward bias?
f. Still assuming that ability is unobservable to the econometrician, use ivreg
to obtain an unbiased estimate of the impact of college attendance on
income. Who are the compliers for the near_col IV?
g. Suppose that there is a systematic survey response bias such that only
individuals with competent ability appear in the survey. To simulate this
lack of response, run the same regression from part f on only the
observations with an ability of 1 or 2. Do not drop the low ability
observations from the data. Why do you get the same estimates as from
part f.
h. Up till now, we have assumed homogenous treatment effects since all
observations had the same income benefit from going to college. Create a
new variable called income_hetero such that the return to college for
people with an ability of 0,1 and 2 are 1, 2 and 10 respectively (see code
below). For simplicity, you will force the returns to ability to be zero for
all individuals. This is accomplished by simply leaving ability out of the
data generating process for income.
gen income_hetero=attend_college*1 + white_noise if ability==0;
replace income_hetero=attend_college*2+ white_noise if ability==1;
replace income_hetero=attend_college*10 + white_noise if ability==2;
i. Run a regression of income_hetero on college attendance using the reg
command. Explain using exact numbers how the coefficient on college
attendance can be thought of as the TOT, ATE or LATE. Your answer
should show that the estimate is a weighted average of 1,2, and 10.
j. Run a regression of income_hetero on college attendance using the ivreg
where near_college is the instrument. Explain using exact numbers how
the coefficient on college attendance can be thought of as the TOT, ATE
or LATE. Your answer should show that the estimate is a weighted
average of 1,2, and 10.
k. Optional Question: The LATE is the average treatment effect for the
complying population. In general, it is impossible to directly observe who
is a complier because you cannot see people both with and without the
instrument. While it is impossible to detect exactly who the compliers are,
it is possible to describe their relative characteristics. If you are using an
instrument, you should always try to provide a characterization of the
compliers. You can learn about the formula for estimating the complier
population from Mostly Harmless Econometrics (pgs 166-173). Verify
that the ratio of females in the complying population to the total is 0.75.
You should verify this both by looking directly at the code that generates
the data, and by using the techniques from MHE.
Question 2:
This question set uses data from the NLS young men cohort. This is the data that David
Card uses in his influential paper “Using Geographic Variation in College Proximity to
Estimate the Return to Schooling”. The stata dataset posted on blackboard is a cleaned
up and labeled version of the data posted on Card’s website, so if you want more details
about the data, check his website or his paper.
For this problem set, you should write out a .do program that does everything from
importing the data to running the regressions. Include a constant in all regressions.
1. Data work
a. Specify the data directory using cd “yourdirectory” and then type use
card95, clear to load the data.
b. Play with the data to learn about it. Use the command describe to get a
description of the data. Examine the distribution of key variables using
the command tab for discrete and summ for continuous. Some important
things to check out are the age distribution and the distribution of key
analysis variables. You do not have to include this portion in the .do file.
c. Construct variables for potential experience and it’s square. Potential
experience is equal to (Age – Educ – 6).
2. Analysis
a. Using the reg command run a regression of log wage on education.
b. Using the reg command, add potential experience and it’s square to the
regression from part (a). How does the coefficient on education change
compared to part (a)? Why does the coefficient change in that direction?
c. Provide a few theoretical reasons why the coefficient on ed76 in part (b)
might still be biased?
d. Given a constant experience and education level, older individuals might
be more mature and therefore command a higher wage. Add a control for
age to the regression from part (b). Note that the Stata results lists one of
e.
f.
g.
h.
i.
j.
k.
the controls as “omitted”. Why did Stata refuse to run the regression that
you specified? For the remaining parts, exclude the age control.
Add the “iq” variable to the regression from part (b). Think about how the
“ed76” coefficient changes and why.
Add the kww variable. This is a test score from a non-cognitive survey
examining how much young men know about the work force. Notice how
the “ed76” coefficient changes and how the iq coefficient changes.
Briefly explain.
What are the necessary conditions for an instrument to be valid? (I did not
explicitly go over this in class, so if you don’t know, you should look it
up)
Use nearc4 to instrument for education in predicting log wages. You
should use the ivreg command and include all controls that you used in
part (b). What are the income returns to education according to this
regression (they will be large).
Explain why the instrument from part (h) is invalid. Discuss why this
leads to an implausibly large estimate of the returns to education.
Add controls for locational dummies to the IV from part (h). The
locational dummies are 9 region dummies and an indicator for smsa:
reg661-reg668 smsa66r. Note how the coefficient on education changes
as a result of controlling for location. How does adding locational
dummies change the assumption necessary for the nearc4 instrument to be
valid? Is nearc4 a plausible instrument in this context?
Industry is a major determinant of wages. Why would it be a mistake to
control for industry in trying to determine the return to education.
optional questions (not hard):
l. Construct an indicator abovehs for whether the individual has more than
12 years of education. Type “tab ed76 abovehs” to make sure you
constructed the variable correctly.
m. Using the summ command, compare mean wages of those with more than
12 years of schooling to those with 12 or less years of schooling.
n. Using the reg command predict wages using the abovehs indicator and a
constant. Discuss how the regression results match up to part (m).
Download