Mid-term exam Practice problems Most problems are “short answer” problems. You receive points for the answer and the explanation. Full points require both, unless otherwise specified. Explaining your answer is never a bad idea. A little practice on stuff from last semester in here; that stuff is always relevant, so I want you to make sure that knowledge hasn’t left you. Plus, answering these questions should be good practice on how to answer test questions. Most questions rely on a mix of knowledge that you have obtained over the last two semesters. Question 16 (EKC question) is especially about issues we have not discussed this semester, but relies on knowledge you should have. If you want, you can skip this one and concentrate on other questions (but again, you should ultimately not find this to be a question that you cannot answer). 1 (5 points) True or false (no explanation necessary) In the multiple regression model, the goodness-of-fit measure R-squared always increases (or remains the same) when an additional explanatory variable is added. True 2 (3 points) True or false (no explanation necessary) You estimate a regression model with three independent variables and get an R2 = 0.86. True or false: Such a high R2 in a regression model with three independent variables indicates that each of the explanatory variables must be (individually) statistically significant. False 3 (3 points) True or false (no explanation necessary) True or false: When using a probit model, the effect of a change in x on the probability that y = 1 depends on the value of x at which you evaluate the change. 1 True 4 (5 points) True or false (no explanation necessary) b always have the same sign as the correEstimated OLS regression coefficients (the β’s) lation between y and the respective x’s. True 5 (5 points) (this is a hard one; requires some thinking) Multiple choice — circle all that apply (no explanation necessary) In an OLS model with a constant, endogeneity is characterized by: (a) E(u|x) 6= 0 (b) V ar(u|x) 6= 0. (c) V ar(u|x) 6= σ 2 (d) E(u|x) = E(u) The answer is (a). Only (a). If you didn’t recall exactly what endogeneity meant, you could have found a major clue in the later problem about omitted variable bias (in which OVB was defined). The (b) and (c) choices are about the variance of the unobservable term, so we can discount them right away. The choice (d) could have been tricky. (d) says that the expected value of u, conditional on x is equal to the unconditional expectation of u. In any OLS regression model with a constant, E(u) = 0. So (d) implies that E(u|x) = 0. The conditional expectation of u being zero does not characterize endogeneity. Endogeneity is characterized by a situation in which x and u are correlated, i.e. knowing something about x would change our expectation of the value of u. And our expectation of the value of u is always 0 to begin with. So (a) is the only correct answer. 6 (3 points) Multiple choice (no explanation necessary) In a linear probability model all of the following are true except: (a) the estimated coefficients can be interpreted directly as marginal effects (b) R2 is a good measure of how well the model fits the data (c) the predicted probability can be negative (d) the errors are always heteroskedastic The answer is (b). If you didn’t already know that (b) was false, by process of elimination: (a) is true, since the coefficients of a LPM model are equal to marginal effects (whereas in a probit model the mfx are a nonlinear function of the coefficients); (c) is true because a linear probability model is nothing but an OLS model, and an OLS model predicts 2 values above 1 and below 0; (d) is true by construction of the LPM — the fit of the model is such that the variance must change as x changes. In order to gain some intuition as to why (b) is false, think about what R2 is a measure of: the proportion of variation in y that is explained by our model. If we are using a LPM model, y is binary (0,1). If y is binary, its values are artificially constrained to be either 0 or 1. Our predictions will never actually be exactly equal to zero or one, i.e. yb is always between zero and one. Compare this situation to a continuous y. With the LPM, R2 gives a much less useful (more artificial, if you will) gauge of the variation explained by our model. 7 (5 points) Multiple choice (no explanation necessary) In the probit model all of the following are lies except:. . . (a) β0 cannot be negative since probabilities have to lie between 0 and 1 (b) βj tells you the effect of a unit increase in xj on the probability that y = 1 (c) βj does not have a simple interpretation (i.e. cannot be interpreted directly) (d) β0 is the probability of observing y when all the x variables are 0 The answer is (c) — the marginal effects are a nonlinear function of the coefficients βj , but the βj themselves don’t tell us much. 8 (5 points) Short answer — provide an answer and an explanation You plan to run an instrumental variables regression (a two-stage least squares model). You run a first-stage regression in order to examine the relationship between your candidate instrument and the variable you suspect of being endogenous in the main regression. Which concern would you have if the first-stage regression results were as indicated: (a) the relationship between the candidate instrument and the endogenous regressor seems weak — the t-stat is low, the R2 is low, and the F-stat is low. Consequently, you suspect that the instrument might not be valid. (b) the relationship between the candidate instrument and the endogenous regressor seems strong — conditions opposite of those listed in (a). Consequently, you suspect that the instrument might not be valid. (c) the relationship between the candidate instrument and the endogenous regressor seems weak. Consequently, you suspect that the instrument might lead to imprecise estimates of the effect of interest in the second stage. The answer is (c). A poor relationship between an instrument and an endogneous variable (i.e. the variable that is being instrument for) is reason to be concerned about the amount of variation that will be left once you “clean” the endogenous variable. A weak relationship is NOT an indication of validity or lack of validity of the instrument. 3 9 (6 points) Very short answer (don’t get stuck with overly long explanations on this one) (Ballantine question) Suppose there are three variables, y, x1 , and x2 . The variation in y is represented by the circle made up of area A+B+C+D in the figure below; x1 is represented by area B+C+E+F; x2 is represented by area C+D+F+G. Three questions: (a) For the model y = β0 + β1 x1 + β2 x2 + u, what area of the Ballantine figure represents u? (b) In the model x1 = α0 + α1 x2 + w, what area of the Ballantine figure represents w? (c) Finally, in the model y = γ0 + γ1 w + z, where w refers to the error term from the model in part (b), what area of the Ballantine figure represents z? Answer all three questions: (a)what area of the Ballantine figure represents u? (2 points) (b)what area of the Ballantine figure represents w? (2 points) (c)what area of the Ballantine figure represents z (2 points)? A B D C E F 4 G The area of the Ballantine that represents u is the area A. The area of the Ballantine that represents w is B+E. The area of the Ballantine that represents z is A+C+D. 10 (5 points) Show your work or explain your answer Suppose you have data on y and x. You specify the following regression model y = β0 + β1 x + u and obtain the following results from OLS βb0 = 3 βb1 = 4. You then add 10 to the y data. That is, you create y∗ = y + 10. What will be the regression coefficients when you estimate the model y∗ = α0 + α1 x + w? That is, what will be α c0 and α c1 ? α c0 = 13 and α c1 = 4. 11 (5 points) Short answer There are a class of models sometimes referred to as “polychotomous dependent variable models” that can be used to analyze data when there are a number of possible discrete outcomes. Match the models listed below to the appropriate example. Model Ordered probit Multinomial logit Poisson model Use transportation choice number of bankruptcies bond rating height Ordered probit matches to bond rating; Multinomial logit matches to transportation choice; Poisson model matches to number of bankruptcies. 12 (10 points) Short answer Explain the difference between censored data and truncated data. Use the example of tickets to a sporting event (treat this as the y-variable) as an example of one of these two types of data. Use your own example of the other type of data. “Censoring” of data is when some part of the data is hidden from us — we cannot observe y data beyond certain bounds (above or below some cut-point). We do, however, 5 observe the x data corresponding to these censored values of y. Trucated data, on the other hand, exists when we do not observe anything at all when y is above or below the bound. We do not observe y and we do not observe x. The demand for tickets sold to a sporting event are an example of censored data. When there is a sellout, we do not observe what demand might have been — all we observe is that there was a sellout, so demand is equal to the maximum capacity of a stadium. We observe other things that might help drive demand, such as the record of the team, the population and income of the home city, the opponent, etc. (Another very similar example here: http://link.springer.com/article/10.1007%2FBF02031947). Another example might be constructed by talking about “top-coding” of survey data (see Wooldridge if you don’t recall what top-coding is). An example of a truncated sample exists when a whole observation is missing due to the y variable being at or beyond a bound. An example of this might be survey data on poverty when observations are not collected on individuals with income above the poverty line. No data is collected when y is above this bound. 13 (22 points) You are interested in estimating the impact of a household head’s level of education on the food security status of his or her household. (Roughly stated, the issue of food security concerns whether a person or household can count on having access to a sufficient number of calories) You have data from a 2008 household survey run in Senegal. The survey is nationally representative and includes over 20,000 households and the following variables: varname foodSecure educHd rural employHd members kids nMale nFem description dummy for households whose monthly income lies above the food poverty line (the line at which the household would achieve 2100 calories per person per day) years of education of household head dummy if household is situated in a rural area; zero otherwise dummy if household head is currently employed number of individuals living in the household number of children in the household number of men in the household number of women in the household (a) Write down a model that you would run using OLS. What is the marginal effect of educHd on the food security status of the household in this model? (3 points) A simple linear probability model would be written f oodSecure =β0 + β1 educHd + β2 rural + β3 employHd+ β4 members + β5 kids + β6 nM ale + u 6 There are several ways that you could have written this model that I would have considered reasonable. Although the question only tells you explicitly that you would be interested in the relationship between f oodSecure and educHd, I would be surprised if you didn’t include some of the other variables, as they seem to sensibly belong in the DGP of f oodSecure. With regard to the four variables members, kids, nM ale, and nF em, I could see you including each of them on the assumption that they each measure unique things, or I could see you omitting one on the assumption that nM ale was the number of ADULT males, nF em was the number of ADULT females, and thus kids+nM ale+nF em = members, i.e. there would be perfect collinearity if you included all of the variables. But this isn’t what the question was about. Just saying. The part about interpretation should be straightforward: the marginal effect of educHD4 on the food security status of the household is given by β1 . (b) Specify an alternative model that you would run using maximum likelihood estimation (you need not write down a specific functional form based on a specific distribution; you may use the notion of a generic function “G” to represent the functional form you have named using words). How would you obtain the marginal effect of educHd on the food security status of the household in your new model? (3 points) One alternative model that we could run using ML estimation is the logit. The other is a probit. Using the direct analog of the model above I would write the alternative model as f oodSecure = G(β0 + β1 educHd + β2 rural + β3 employHd+ β4 members + β5 kids + β6 nM ale) To get the marginal effect of educHd on the food security status we would need to take a derivative (or have Stata do it for us). No need to actually take the derivative to get points. As long as you realized that the coefficient itself is not the marginal effect (and said so), I’m good with the answer. If you want to get all fancy, use the generic “G” function and show me you know how to apply the chain rule. G0 (stuff)β1 (c) What are the pros and cons of model (a) vs. model (b)? Name all that you can. (3 points) There are 4 pros/cons that we talked about. If you got 3, I was happy. The cons of the LPM are most prominent: heteroskedastic; predictions above 1 and below 0; poor fit to the data; constant mfx might not make sense in this case. The virtues are simplicity and ease of interpretation of the coefficients. (d) A colleague suggests that your estimate of the effect of household head’s level of education suffers from omitted variable bias since you do not have a measure of household income. Describe the circumstances under which your colleague would be correct. Give an example and be specific about the nature of the bias that you would expect. (5 points) 7 Hey colleague — go get your own problems. There are two ways you could go with this problem, either of which are fine with me: (1) you could argue that your colleague has a point, because the omitted variable is correlated with both food security and education; (2) you could argue that our food security variable is simply a noisy (i.e. discrete) version of household income, and thus observing household income would perfectly predict our y-variable. Option 1: I assume that household income is positively correlated both with food security and with education. If this is so, then we would expect education to get too much credit for improving food security in a regression that omitted household income. That is, we expect a positive bias. Option 2: I don’t find the claim very compelling. If we could measure household income we might be able to get a finer measure of food security, in the sense that we could measure more directly the ability of individuals to afford food. However, the dependent variable that we DO have is a coarse measure of income. Having a finer measure would not cause us to put this variable on the right hand side of our equation. I think the colleague is a bit of a jerk. (e) Your colleague suggests using the education level of the household head’s father as an instrumental variable for household head’s level of education. Do you believe that this instrument is valid ? Explain, being as specific as possible. (8 points) The issue of validity is strictly concerned with whether or not the instrument is correlated with the endogenous regressor (whether or not educHF is correlated with educHd) and simultaneously uncorrelated with the source of the endogeneity. Therefore you can answer this question regardless of how you answered the previous part. In our case, the source of the alleged endogeneity is an omitted variable: inc. The primary argument is whether educHF is uncorrelated with household income. Does this seeem likely? In my opinion, the argument is tenuous. I would accept either answer if it was well argued (welcome to the ambiguity that is advanced econometrics!). An argument that stated that the instrument was invalid might go something like this: the household head’s father’s level of education was determinant his children’s opportunities, and thus their education and future income. An argument that stated that the instrument was valid might go something like this: the household head’s father’s level of education is determinant of his children’s education, as an educated household is more likely to foster an environment in which education is values; however, more education in the parent generation does not lead to income in the child generation, so educHF is likely to be uncorrelated with inc. 14 (5 points) Short answer We estimate a regression of y on x: y = β0 + β1 x + u, 8 and obtain estimates βb0 and βb1 , and an R-squared coefficient of 0.24. If you were to change the units of x by multiplying x by 10, what is the new R-squared? Explain why your answer makes intuitive sense. If you change the units of (any number of) variables in the equation you do nothing to change the fundamental relationship. The R-squared is exactly the same as it was (0.24). Try it for yourself using any regression you like in order to prove it to yourself. 15 (10 points) Short answer Suppose that you are attempting to build a model that explains aggregate savings behavior in the United States as a function of the level of interest rates. Would you rather construct your data sample during a period of fluctuating interest rates or a period in which interest rates were relatively constant? Why? More variation! We want more variation! You would like to sample during a period of fluctuating interest rates, as the relationship between movements in interest rates and movements in aggregate savings is most easily observed when . . . you guesed it, interest rates are moving. More variation in an independent variable reduces the variability of the coefficient estimate: σ2 V ar(βbj ) = . SSTj (1 − Rj2 ) 16 (15 points) Short answer The Environmental Kuznets Curve (EKC) is a theory relating income and productivity to environmental conditions. The theory (roughly) goes something like this. Environmental conditions degrade as income increases, because the act of producing lots of goods is taxing on the environment. At some point of development, however, the relationship changes. As people become richer, they start to develop preferences for a cleaner environment. (When you are poor, you’d rather have food and shelter and smog than no food, no shelter, and no smog, but when you nail down the food and shelter thing, you prefer less smog to a fancier shelter). The (rough) shape of a theoretical EKC relationship is given in the figure below. 9 Pollution GDP We gather data on sulfur-dioxide emissions and GDP in 100 countries and estimate an environmental Kuznet’s curve model SO2 = β0 + β1 GDP + β2 GDP 2 + u, where SO2 represents sulfur-dioxide emissions, GDP is, well, GDP, and GDP 2 is GDPsquared. Write down the hypothesis test (or tests) that you would use to help you determine whether the data is consistent with the EKC theory. That is, choose the hypothesis test or tests that you feel would help you falsify the EKC theory, or, should you not reject the null, support it. The EKC would be fulfilled if the slope on GDP (β1 ) were positive, indicating a positive relationship between GDP and SO2 , and the slope on GDP 2 (β2 ) were negative, indicating that at some point as GDP increases, SO2 decreases. Other answers are possible, but that would be my answer. To write this in terms of formal null- and alternativehypotheses, I would express the above in the following way: H0 : β1 = 0, β2 = 0 HA : H0 not true I would use an F-test to test whether the two coefficients were jointly non-zero, since the EKC reqjuires both coefficients to simultaneously be non-zero. If you wanted to test each coefficient individually, you might do so using a pair of t-tests. The hypotheses (with one-sided alternatives) appropriate for such a pair of nulls are below. H0 : β1 = 0 HA : β1 > 0 H0 : β2 = 0 HA : β2 < 0. 10 17 (15 points) Short answer We collect variables y (the dependent variable), x1 , and x2 . If we run a multiple regression of y on x1 , and x2 y = β0 + β1 x1 + β2 x2 + u we obtain estimated coefficients βb0 , βb1 , and βb2 . The estimated slope coefficients βb1 , and βb2 are interpreted as independent effects. That is, βb1 is interpreted to be the effect of x1 on y, independent of the effect of x2 on y. And vice-versa. Another way to say the same thing is that βb1 is the effect of x1 on y, controlling for the effect of x2 on y. Use what is sometimes referred to as the “partialling-out procedure” to explain what, exactly, is meant by “independent effect” or “control.” That is, demonstrate what βb1 represents in the multiple-regression context by describing a series of bivariate regressions (regressions with one dependent variable and one independent variable). You should be able to arrive at the results of a bivariate regression in which one of the coefficient estimates is exactly equal to βb1 in the multiple regression above. Explain how this procedure highlights the “independent effect” interpretation of βb1 . [Hint: you don’t need to remember the “partialing-out” procedure to complete this question. The βb1 coefficient in the multiple regression above represents the independent effect of x1 on y, when the effect of x2 on y has already been accounted for. Use this knowledge to reconstruct βb1 with a series of bivariate regressions involving y, x1 , and x2 .] This question is a gimme if you remember the procedure we went over in class. I tried to describe the procedure as directly as I could without revealing the exact steps. I didn’t receive any clarifying questions, so I assume everyone knew what I was talking about. Thus, the explanation is what really matters here. You want to make clear that, essentially, you understood the point of the exercise! Here is what I would give for an answer: βb1 is an estimate of the independent effect of x1 on y. To replicate this estimate using a series of bivariate regressions, we ultimately want to regress y on the part of x1 that is independent of x2 . We need to isolate the variation in x1 that influences y, but is not related to variation in x2 . x1 can be separated into two parts: parts that are correlated with x2 and parts that are not. To find the part that is correlated with x2 , we regress x1 on x2 : x1 = α0 + α1 x2 + v and obtain estimated coefficients α c0 and α c1 . We use these estimated coefficients to obtain predictions of x1 : x c1 = α c0 + α c1 x2 . These predictions represent the variation in x1 that is explained by x2 . That is, the prediction x c1 represents the variation in x1 that is related to variation in x2 . If x1 is a variable that contains all the variation, and x c1 is a variable that contains only the variation related to variation in x2 , then x1 − x c1 is a variable that contains only the variation not related to variation in x2 . We have a name for this variable. This variable, 11 x1 − x c1 , is exactly the residual from the regression of x1 on x2 . Let’s call it r1 . When we then regress y on r1 y = δ0 + δ1 r1 + w we get estimated coefficients δb0 and δb1 . The estimated coefficient δb1 represents the effect of r1 on y. r1 is a variable that contains all the variation in x1 that is not explained by x2 . Therefore, δb1 is equal to βb1 from the original multivariate regression. This procedure highlights the fact that βb1 is an estimate of the effect of x1 on y, with the effect of x2 removed entirely. This is what is meant by independent effect. 18 (15 points) Short answer “Omitted variable bias” is a term that refers to a situation when omitting an important independent variable (an “x-variable”) from a regression equation causes us to mis-estimate the effect of another independent variable that is included in the regression. Explain why omitted variable bias is an example of endogeneity. (Endogeneity is when an independent variable is correlated with the unobservable, or “error,” term in a regression). Use the model below when constructing your explanation. y = β0 + β1 x1 + β2 x2 + u Explain how omitting x2 causes x1 to be endogenous (i.e. correlated with the unobservable in a regression of y on x1 ). (No algebra required! The answer can be given in an intuitive discussion. If you find it easier to work with a real-variable example rather than generic x and y variables, that is fine) [Hint: If you don’t know how to get started, work out how omitting x2 would cause omitted variable bias in your estimation of β1 (the relationship between y and x1 ) under different scenarios describing the relationship between y, x1 , and x2 .] Our task is to explain how leaving x2 out of a regression causes x1 to be related to the error term. You can get this directly from the statement of the problem, where I gave you the definition of endogeneity. If the true model is y = β0 + β1 x1 + β2 x2 + u, but we regress y = β0 + β1 x1 + w instead, then we are ignoring the impact of x2 . Doing this means that when x2 moves, causing y to move, we do not have any way to control for the independent effect of x1 on y (see previous question). When x2 moves, we assign all the credit (or “blame,” as the case may be) to x1 . Consider the regression y = β0 + β1 x1 + w. 12 All of the variation in y that cannot be explained by x1 is contained in the unobservable term w. What is in w? Well, if x2 really matters, then the effect of x2 is in w. In fact, if the true model is y = β0 + β1 x1 + β2 x2 + u, then w is equal to all the stuff we left out of the equation: y = β0 + β1 x1 + (β2 x2 + u). That is, w = (β2 x2 + u). We get omitted variable bias when x1 and x2 are correlated. If x1 and x2 are correlated, then x1 and w are correlated, since w contains x2 . Omitting x2 causes x1 to be correlated with the error term w, which is another way of saying that x1 is endogeneous. 13