December 2, 2021 ECON 301: ECONOMETRICS I TAKE-HOME FINAL ANSWER KEY Rules: 1. You are required to submit your answers to “Final Exam Dropbox” in Moodle by 5 pm, in the week of November 29 covering today. 2. The take-home final exam is designed to take 2 hours. You have an additional 2 hours to submit your answers to Moodle, to account for any technical problems you might encounter. If you fail to submit your answers by 5 pm, therefore, you will receive a zero without any exception. 3. You need to write down your own original answers and submit them as a single file (preferably pdf). Moodle will not allow you to submit more than one file. Submitting a zip file will result in a loss of 50 percent of the points you collect from the exam. You can use a software (ex: Word or Latex) to type your answers or scan your handwritten answers into a pdf file. In the latter case, you are responsible for the quality of scanning. 4. You are NOT allowed to resubmit, so make sure you are submitting the right file. 5. Making the same mistakes in a similar order will be judged as plagiarism and will be sanctioned according to the Student’s Code of Misconduct. 6. Answers that are correct but copied & pasted or not properly paraphrased from another source will be judged as plagiarism and sanctioned according to the Student’s Code of Misconduct. 7. Especially in open-ended questions but not limited to them, answers using similar steps to your peers even though they are properly paraphrased will be suspected plagiarism. Your exam will then not count, and I will instead give you an oral exam on December 10. 8. The bottom line behind rules 5-7 is to write your own original answers. 9. There are 4 True/False questions and 4 problems in total, worth of 100 points. 10. If you believe a question is vague, sharpen it as you see fit before answering. 11. Explain your answers carefully. You will get no credit for unsupported assertions or guesses. Write as if you are trying to convince an intelligent person who does not already know the answers. If your answers would not convince such a person, I will assume that you do not really understand the material. 1 True or False (30 points, 3 points each) Are the following statements true or false? Justify your answers briefly. – For each, correct T/F is worth only 1 pts, and justification is 2 pts. (i) Under Assumptions MLR.1 - MLR.5, the OLS estimators are unbiased and normally distributed. False. MLR.6 is required for normality. (ii) H1: π½Μπ ≠ 0, where π½Μπ is a regression coefficient associated with an explanatory variable, represents a two-sided alternative hypothesis. False. H1: π½π ≠ 0, not π½Μπ ≠ 0, represents a two-sided alternative hypothesis. (iii) If the error term is not correlated with any of the independent variables, the OLS estimators are unbiased and consistent. False. In this case, OLS estimates will be consistent but not unbiased. Unbiasedness requires E (u | X)=0, i.e. the error term should be mean independent of any of the independent variables. (iv) Unbiasedness is a minimal requirement for an estimator in regression analysis. False. We settle for consistency most of the time since it is not always easy to obtain unbiased estimates. An example for this is the IV estimator which is consistent but biased, and we are happy with it. (v) Beta coefficients are always greater than standardized coefficients. False. Beta coefficients are indeed standardized coefficients – the definition of standardized coefficient is to subtract the mean then divide by the standard error. This is what you do to get beta coefficients. (vi) Robust standard errors are always preferred over regular standard errors. False. In small samples (n<120 roughly), we do not know whether robust or regular standard errors are better. (vii) Adjusted R2 is always larger than 0. πππ /(π−π−1) False. π Μ 2 = 1 − πππ/(π−1) < 0 if πππ π−π−1 πππ > π−1. Equivalently, (n-1)SSR > (n-k-1)SST, or kSST>(n-1)(SST-SSR), or kSST>(n-1)SSE, or SSE/SST<k/(n-1), or R2< k/(n-1). This inequality hold for any k and n. Since typically k<<n, this suggest that if R2 is sufficiently close to 0 (so that our regression does not fit the data well at all), π Μ 2 might be negative. (viii) If you conduct a White test, the special version of it, and a Breusch-Pagan test of heteroscedasticity and find evidence in favor of the null hypothesis in all these tests (all slope 2 estimates are jointly insignificant when they are regressed on the squared residuals), you can conclude that your data do not suffer from heteroscedasticity for sure. False. Two reasons: i) the test will only indicate you fail to reject (Type II error); ii) even if the test could tell you to reject, you do not know the form of heteroscedasticity and each test assumes a different functional form which does not have to be right form of heteroscedasticity. (ix) If the data are missing at random, then the missing data do not cause any statistical problems. True. If the data are missing at random, then what you can observe will be a random subsample of the original random sample. Since the sample is still random, we will not have any issues with estimation. (x) If the instrumental variable estimator has an upward bias, the ordinary least square estimator always has a downward bias. False. The asymptotic bias for OLS estimator is (σu/σx)Corr(x,u) whereas the asymptotic bias for an IV estimator is (σu/σx)Corr(z,u)/Corr(z,x). Corr(x,u)>0 does not imply Corr(z,u)/Corr(z,x)<0. Or in plain words, when you have an endogenous variable OLS will be biased. In principle, you use IV to get rid of the bias. Problem 1 (22 points): The dataset LOANAPP from Wooldridge is indeed utilized for the 1996 paper “The Cultural Affinity Hypothesis and Mortgage Lending Decisions,” Journal of Real Estate Finance and Economics 13, 57-70 by W.C. Hunter and M.B. Walker. In this data, approve is the dummy dependent variable which is equal to 1 if a mortgage loan to an individual was approved; hrat and obrat measure percentages of housing expenditures and other obligations, normalized by total income; dep is the number of dependents; sch is a dummy variable denoting high school graduates; white and hispan are dummy variables for white and Hispanic ethnicities with blacks being the reference category; unem, male, married, and cosign are dummy variables denoting being unemployed, males, being married, and having a cosigner for the loan. Letting X denote the vector of the regressors of interest, the econometrics model of interest is provided below: Pr(ππππππ£π = 1 | π) = πΊ(π½0 + π½1 βπππ‘ + π½2 obrat + π½3 dep + π½4 sch + π½5 white +π½6 hispan + π½7 unem + π½8 male + π½9 married + π½10 cosign), where G denotes the standard normal cumulative distribution function. The regression output and some other statistics are presented below: 3 Number of obs LR chi2(10) Prob > chi2 Pseudo R2 Log likelihood = -665.69811 = = = = 1,971 144.56 0.0000 0.0979 -----------------------------------------------------------------------------approve | Coef. Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------hrat | .0128048 .0064934 1.97 0.049 .0000779 .0255317 obrat | -.0346974 .0056213 -6.17 0.000 -.045715 -.0236798 dep | -.0538628 .0361607 -1.49 0.136 -.1247365 .017011 sch | .0558429 .0904552 0.62 0.537 -.121446 .2331319 white | .819701 .1052271 7.79 0.000 .6134597 1.025942 hispan | .2315558 .1656895 1.40 0.162 -.0931896 .5563012 unem | -.0399494 .0165387 -2.42 0.016 -.0723647 -.0075341 male | -.0444482 .1050657 -0.42 0.672 -.2503731 .1614767 married | .250487 .0895493 2.80 0.005 .0749736 .4260003 cosign | .1398014 .2317651 0.60 0.546 -.3144499 .5940527 _cons | 1.372325 .2266169 6.06 0.000 .9281639 1.816486 -----------------------------------------------------------------------------. margins, dydx(*) Average marginal effects Model VCE : OIM Number of obs = 1,971 Expression : Pr(approve), predict() dy/dx w.r.t. : hrat obrat dep sch white hispan unem male married cosign -----------------------------------------------------------------------------| Delta-method | dy/dx Std. Err. z P>|z| [95% Conf. Interval] -------------+---------------------------------------------------------------hrat | .0023665 .0011984 1.97 0.048 .0000176 .0047154 obrat | -.0064125 .001034 -6.20 0.000 -.0084391 -.0043859 dep | -.0099545 .0066799 -1.49 0.136 -.0230469 .0031379 sch | .0103205 .0167166 0.62 0.537 -.0224435 .0430844 white | .151491 .0190326 7.96 0.000 .1141878 .1887941 hispan | .0427944 .0305766 1.40 0.162 -.0171346 .1027234 unem | -.0073832 .0030529 -2.42 0.016 -.0133668 -.0013995 male | -.0082146 .0194169 -0.42 0.672 -.046271 .0298418 married | .0462931 .0165313 2.80 0.005 .0138924 .0786938 cosign | .025837 .0428289 0.60 0.546 -.058106 .1097801 -----------------------------------------------------------------------------. test white hispan ( 1) ( 2) [approve]white = 0 [approve]hispan = 0 chi2( 2) = Prob > chi2 = 69.73 0.0000 . tab dep number of | dependents | Freq. Percent Cum. ------------+----------------------------------0 | 1,175 59.16 59.16 1 | 317 15.96 75.13 2 | 327 16.47 91.59 4 3 | 126 6.34 97.94 4 | 31 1.56 99.50 5 | 5 0.25 99.75 6 | 3 0.15 99.90 7 | 1 0.05 99.95 8 | 1 0.05 100.00 ------------+----------------------------------Total | 1,986 100.00 a. (2 points) What is the name of the regression we run? Probit regression. b. (4 points) Interpret the effect of hrat on the probability of getting approved for a loan. Holding everything else constant, if housing expenditures increases by 1 percentage point of total income, the probability of getting approved for a loan increases by 0.23 percentage point on average. c. (4 points) Would you change how the number of dependents is modeled? Why or why not? There is no reason for an increase in the number of dependents from 1 to 2 to have the same effect with an increase in the number of dependents from 2 to 3. Realizing this, it would be better to model number of dependents utilizing dummy variables. Given the tabulation of number of dependents, we observe that most variation is observed for 0-3 dependents. Using 0 as the reference category and define 3 dummy variables: having 1, 2, and 3+ dependents. d. (4 points) If you assume that the model we ran is correctly specified, would you conclude that blacks are discriminated? While the average marginal effect is significant for Whites, it is not significant for Hispanics. This suggests that Blacks and Hispanics are discriminated. Note that since Hispanics is not significant, you cannot use the joint significance test! e. (4 points) Test the null hypothesis H0: β9 = 0.4 vs H1: β9 > 0.4 at 1 percent significance level. t=(0.25-0.4)/0.09 < 0. Having a negative t-statistic, we fail to reject the null hypothesis. f. (4 points) Calculate the 99% confidence interval for β7. -0.0399 ±2.58*0.0165 = (-0.0825, 0.0027) 5 Problem 2 (18 points): The data set CAMPUS from Wooldridge is collected by an undergraduate Michigan State University student for his project to analyze campus crime rates across the U.S. In this data, lcrime denotes the logarithm of the number of crimes committed in a year in the university campus; lenroll denotes the logarithm of the total number of students enrolled in the university; lpolice denotes the logarithm of the number of police officers employed by the university; private is a dummy variable indicating whether the university is private. The econometric model is given by: ππππππ = π½0 + π½1 πππππππ + π½2 πππππππ + π½3 ππππ£ππ‘π + π’. Assume that all the classical linear regression model assumptions are satisfied. The regression output is provided below. Parts of the output are intentionally withheld. . reg lcrime lenroll lpolice priv Source | SS df MS -------------+---------------------------------Model | Residual | -------------+---------------------------------Total | 183.119479 Number of obs F(3, 93) Prob > F R-squared Adj R-squared Root MSE = = = = = = 97 54.04 0.0000 0.6237 .84721 -----------------------------------------------------------------------------lcrime | Coef. Std. Err. t P>|t| [95% Conf. Interval] -------------+---------------------------------------------------------------lenroll | .9462363 .1460992 6.48 0.000 .6561123 1.23636 lpolice | .5391599 .1507161 3.58 0.001 .2398675 .8384523 priv | .26656 .2835986 0.94 0.350 -.2966105 .8297305 _cons | -5.102549 1.160198 -4.40 0.000 -7.406473 -2.798626 ------------------------------------------------------------------------------ a. (3 points) Explain why it is crucial to assume that all the classical linear regression model assumptions are satisfied for this question. Since we have a small sample size, we cannot utilize robust standard errors or asymptotic normality. If we do not assume MLR. 5-6, we cannot then make any inference on our results. The significance of MLR 1-4 is as usual, to get unbiased estimates. b. (3 points) State the null and alternative hypothesis to test whether the elasticity of crime with respect to enrollment is unity. H0: β1=1 vs. H1: β1≠1. c. (3 points) Conduct the test in part b. t=(.946-1)/.146 = -0.37. Since the t-value is pretty low in absolute value, we fail to reject the null hypothesis at any conventional level of significance. 6 d. (4 points) Interpret the slope coefficient for private. Crime rates in private schools, holding everything else constant, are on average 30.6 (exp(.267) – 1 = 0.306) higher. Note that since the estimate of β3 is high, we cannot use the approximation of 26.6 percent – should be obvious when comparing the exact effect of 30.6 to 26.6 percent. e. (5 points) Regressing lcrime on lenroll yields an R2 of 0.5848. Do you reject H0: β2=β3=0 at the 10 percent level of significance? First, we need to find the R2 of the long regression. Residual MS=Root MSE2, so Residual MS= 0.847212 = 0.71776. Residual df = 97-3-1 = 93. Then Residual SS =93* 0.71776=66.75. R2 = 1 66.75/183.12 = 0.6355. Having recovered the R2 of the long regression, we can now conduct the F test. F= (0.6355 - 0.5848) / (1-0.6355)* 93/2 = 6.48 ~ F (2, 93). The critical value for F (2, 90) is 2.36, so we reject the null hypothesis at the 10 percent level of significance. Problem 3 (15 points): Consider the model π = π½1 π + π, where E(U|X) = 0. In particular, we assume that the intercept is equal to 0 (which may or may not be true in the population). The data is a random sample of size n, {Yi, Xi}: i=1,2,…,n. a. (10 points) Consider the estimator ∑ππ=1 ππ ππ 1 + ∑ππ=1 ππ2 π Is π½Μ1 an unbiased estimator of π½1? Is π½Μ1 a consistent estimator of π½1? π½Μ1 = π π π 2 ∑ ∑π=1 ππ (π½1 ππ +π’π ) πΈ(∑π=1 π½1 ππ +ππ π’π ) ππ ππ 1 1 1 E(π½Μ1)=πΈ ( ∑π=1 )+π = +π π π 2 ) + πΈ (π ) = πΈ ( ∑π π 2 ∑π π 2 π=1 π π=1 π π=1 π ∑ππ=1 ππ πΈ(π’π ) 1 π½1 ∑ππ=1 ππ2 + ππ πΈ(π’π ) 1 1 πΈ(π½Μ1 ) = + = π½ + + = π½1 + 1 π π 2 2 ∑π=1 ππ ∑π=1 ππ π π π Last equality follow because E(U)=0. We assumed X’s are non-random for this derivation. Alternatively, you can show all this hold using expectation conditional on X (the mathematically perfect way). π½Μ1 is biased. π ∑ ππ ππ 1 π½Μ1 = ∑π=1 π π2 + π = π=1 π 1 ππππ( ) ∑π π=1 ππ ππ π 1 2 ππππ( ) ∑π π=1 ππ π 1 ( ) ∑π π=1 ππ ππ π 1 2 ( ) ∑π π=1 ππ π 1 1 + π. Then, using the property of limit, we can write ππππ π½Μ1 = πΈ(ππ) + ππππ π = πΈ(π 2 ) = πΈ(π(π½1 π+π)) πΈ(π 2 ) = π½1 πΈ(π 2 )+πΈ(ππ) πΈ(π 2 ) . Since E(U|X)=0, it follows that E(UX)=0. Therefore, ππππ π½Μ1 = π½1. π½Μ1 is a consistent estimator of π½1. 7 b. (5 points) Assume that X1≠0 (X1 is the first observation of the regressor in the sample) and consider the estimator π12 π1 Μ π½1 = 3 π1 Is π½Μ1 an unbiased estimator of π½1? 2 π π π π½ π +π π π πΈ(π ) π½Μ1 = π1 31 = π1 = 1 π1 1 = π½1 + π1 . Then, πΈ(π½Μ1 ) = π½1 + πΈ (π1 ) = π½1 + π 1 = 0. Note that 1 1 1 1 1 1 second to last equality follows only under the assumption that X’s are non-random. Alternatively, we could show unbiasedness using the expectation conditional based on X. π½Μ1 is unbiased. Problem 4 (15 points): A not very good hypothetical Econometrics student estimated a model that tries to explain the standardized outcome on a huge ECON101 final exam (stndfnl) in terms of percentage of classes attended (atndrte), prior college GPA (priGPA), and ACT score (ACT). Regression output is provided below. Regular standard errors are presented in parenthesis. Μ = −0.021ππ‘ππππ‘π + 0.082π΄πΆπ − 0.555ππππΊππ΄ + 0.011ππ‘ππππ‘π ∗ ππππΊππ΄ π π‘πππππ (0.002) (0.011) (0.078) (0.004) 2 n=680 R =0.201 Assume that the data on stndfnl, ACT, and priGPA are obtained from the registry of the university, so they are not subject to measurement error. Atndre, on the other hand, are collected from students through a survey at the end of the semester. a. (3 points) Being a good Econometrics student, you know that having a large sample size, the student should have reported robust standard errors. If you employ robust standard errors, which of the following will change: parameter estimates, standard errors, R2? Only standard errors will change. Parameter estimates does not depend on which standard errors we use; hence SSR, SSE, and SST; hence R2. b. (3 points) Being a good Econometrics student, you know that because atndre are collected from students through a survey at the end of the semester, it should be subject to measurement error. What would be the effect of this measurement error on your estimates? The answer depends on the measurement error. If the CEV assumption holds, i.e. the measurement error is be uncorrelated with the true value of attendance, the estimated coefficient on atndrt will be biased toward zero (since it is estimated to have a negative sign, the estimate will get larger). However, one can question the CEV assumption easily in this example since it will be easier for students who almost always go to class to remember exactly how many classes they missed suggesting that measurement error should be correlated with the true value of 8 attendance. In this case, the bias can go either way. In either case, all the other estimates will also be biased, and the bias can be of either direction. c. (3 points) Being a good Econometrics student, you also realize that even though 800 students took the final exam, the sample size of the regression is 680, implying that 120 students did not report their attendance rates. What problem would that cause for your estimation? It is likely that the survey response will depend on the success level of students. In this case, we will have a self-selected sample, or in other words, we will have an endogenous sampling. As a result, our estimates will be biased and inconsistent. d. (2 points) Being a good Econometrics student, you know that atndre should be endogenous. Explain why. What we do not model in this regression is motivation or laziness. These attributes will surely affect the final exam grade, so they should be in u. However, attendance rate will be correlated with these attributes, as a result atndre should be endogenous variable. e. (4 points) If you have data on the distance from where a student lives to the lecture hall (call it dist), explain why you can utilize this information to tackle endogeneity. Write down the STATA code you would run to get your new estimates. dist should be negatively correlated with attendance rate since larger distance will increase the non-monetary cost of attending classes, but be uncorrelated with the error term since students typically do not choose where to live strategically (either live with parents or are given a dorm). This is why we can use dist as an IV for atndre. In writing down the STATA code, observe that we also have atndre*priGPA as a regressor. The code is then given by: ivreg stndfnl ACT priGPA (atndre atndre*priGPA = dist dist*priGPA) 9