Problem Set 1: Return to Schooling and IV For both questions, don’t write more than you have to. Some parts will not have any written response at all. You should turn in one .do file that answers both questions. Question 1: Often times it can be useful to generate your own data in which you know the true relationship between all the variables. In this question, you will create a simulated dataset and then apply IV methods to this dataset. Include a constant in all regressions. 1. Create Data a. Copy and paste, the following into the beginning of your Stata .do file. *The first 4 lines are generic, the 5th line creates an empty dataset with 900 observations. #d; clear; version 12.0; set more off,perm; set obs 900; *These 3 lines generate an ability variable that takes 3 values.; gen ability=0 in 1/300; replace ability=1 in 301/600; replace ability=2 in 601/900; **This creates a female dummy which will only be used for the optional part (k).; gen female=0; replace female=1 in 1/150; replace female=1 in 301/400; replace female=1 in 601/750; *This line creates an indicator for living “near a college”. As written, the indicator will alternate 1,0,1,0 such that there is an equal number of “near college” individuals in each ability group. Also ability and near_col will not be correlated.; gen near_col=mod(_n,2); *These two lines create a college attendance indicator. As constructed below, college attendance is somewhat correlated with both ability and being near a college. Make sure you exactly understand the construction of the “attend_college” indicator. Browse the data to verify; gen attend_college=0; replace attend_college=1 if ability==2 | (ability==1 & near_col==1); **This last line generates white noise following a normal distribution with a standard deviation of 0.01 and a mean of zero.; gen white_noise=invnorm(uniform())*0.01; b. Generate an income variable such that the true return to ability is 1 and the return to college is 2. Add the white_noise to the constructed income variable so that income has a small random component. c. Using the corr command, examine the correlation between college attendance, ability, living near a college, and income. You should verify that the correlation between ability and living near a college is zero. Also note that the correlation between income and college attendance should be around 0.962. Analysis: d. Regress income on college attendance and ability and verify that OLS uncovers the data generating process you created in part b. e. Now suppose ability is unobservable to the econometrician but remains an important determinant of income (still has a return of 1). Regress income on college attendance and write down the magnitude of the omitted variable bias. What is it about the construction of the variables that leads to an upward bias? f. Still assuming that ability is unobservable to the econometrician, use ivreg to obtain an unbiased estimate of the impact of college attendance on income. Who are the compliers for the near_col IV? g. Suppose that there is a systematic survey response bias such that only individuals with competent ability appear in the survey. To simulate this lack of response, run the same regression from part f on only the observations with an ability of 1 or 2. Do not drop the low ability observations from the data. Why do you get the same estimates as from part f. h. Up till now, we have assumed homogenous treatment effects since all observations had the same income benefit from going to college. Create a new variable called income_hetero such that the return to college for people with an ability of 0,1 and 2 are 1, 2 and 10 respectively (see code below). For simplicity, you will force the returns to ability to be zero for all individuals. This is accomplished by simply leaving ability out of the data generating process for income. gen income_hetero=attend_college*1 + white_noise if ability==0; replace income_hetero=attend_college*2+ white_noise if ability==1; replace income_hetero=attend_college*10 + white_noise if ability==2; i. Run a regression of income_hetero on college attendance using the reg command. Explain using exact numbers how the coefficient on college attendance can be thought of as the TOT, ATE or LATE. Your answer should show that the estimate is a weighted average of 1,2, and 10. j. Run a regression of income_hetero on college attendance using the ivreg where near_college is the instrument. Explain using exact numbers how the coefficient on college attendance can be thought of as the TOT, ATE or LATE. Your answer should show that the estimate is a weighted average of 1,2, and 10. k. Optional Question: The LATE is the average treatment effect for the complying population. In general, it is impossible to directly observe who is a complier because you cannot see people both with and without the instrument. While it is impossible to detect exactly who the compliers are, it is possible to describe their relative characteristics. If you are using an instrument, you should always try to provide a characterization of the compliers. You can learn about the formula for estimating the complier population from Mostly Harmless Econometrics (pgs 166-173). Verify that the ratio of females in the complying population to the total is 0.75. You should verify this both by looking directly at the code that generates the data, and by using the techniques from MHE. Question 2: This question set uses data from the NLS young men cohort. This is the data that David Card uses in his influential paper “Using Geographic Variation in College Proximity to Estimate the Return to Schooling”. The stata dataset posted on blackboard is a cleaned up and labeled version of the data posted on Card’s website, so if you want more details about the data, check his website or his paper. For this problem set, you should write out a .do program that does everything from importing the data to running the regressions. Include a constant in all regressions. 1. Data work a. Specify the data directory using cd “yourdirectory” and then type use card95, clear to load the data. b. Play with the data to learn about it. Use the command describe to get a description of the data. Examine the distribution of key variables using the command tab for discrete and summ for continuous. Some important things to check out are the age distribution and the distribution of key analysis variables. You do not have to include this portion in the .do file. c. Construct variables for potential experience and it’s square. Potential experience is equal to (Age – Educ – 6). 2. Analysis a. Using the reg command run a regression of log wage on education. b. Using the reg command, add potential experience and it’s square to the regression from part (a). How does the coefficient on education change compared to part (a)? Why does the coefficient change in that direction? c. Provide a few theoretical reasons why the coefficient on ed76 in part (b) might still be biased? d. Given a constant experience and education level, older individuals might be more mature and therefore command a higher wage. Add a control for age to the regression from part (b). Note that the Stata results lists one of e. f. g. h. i. j. k. the controls as “omitted”. Why did Stata refuse to run the regression that you specified? For the remaining parts, exclude the age control. Add the “iq” variable to the regression from part (b). Think about how the “ed76” coefficient changes and why. Add the kww variable. This is a test score from a non-cognitive survey examining how much young men know about the work force. Notice how the “ed76” coefficient changes and how the iq coefficient changes. Briefly explain. What are the necessary conditions for an instrument to be valid? (I did not explicitly go over this in class, so if you don’t know, you should look it up) Use nearc4 to instrument for education in predicting log wages. You should use the ivreg command and include all controls that you used in part (b). What are the income returns to education according to this regression (they will be large). Explain why the instrument from part (h) is invalid. Discuss why this leads to an implausibly large estimate of the returns to education. Add controls for locational dummies to the IV from part (h). The locational dummies are 9 region dummies and an indicator for smsa: reg661-reg668 smsa66r. Note how the coefficient on education changes as a result of controlling for location. How does adding locational dummies change the assumption necessary for the nearc4 instrument to be valid? Is nearc4 a plausible instrument in this context? Industry is a major determinant of wages. Why would it be a mistake to control for industry in trying to determine the return to education. optional questions (not hard): l. Construct an indicator abovehs for whether the individual has more than 12 years of education. Type “tab ed76 abovehs” to make sure you constructed the variable correctly. m. Using the summ command, compare mean wages of those with more than 12 years of schooling to those with 12 or less years of schooling. n. Using the reg command predict wages using the abovehs indicator and a constant. Discuss how the regression results match up to part (m).