Lecture 14: Regression specification, part I BUEC 333 Professor David Jacks 1 All the way back in Lecture 11, we set out the six assumptions of the CLRM. We also noted that when these six assumptions are satisfied, the least squares estimator is BLUE. Given the prominence of the OLS estimator in empirical applications, we would like to know whether or not the classical assumptions. Violating the classical assumptions 2 In the remaining weeks of the term, we will consider in more detail: 1.) what happens to the OLS estimator when the classical assumptions are violated 2.) how to test for these violations 3.) what to do about it if we detect a violation Violating the classical assumptions 3 Assumption 1 of the CLRM: 1.) model has correct functional form a.) regression function is linear in the coefficients, E(Yi|Xi) = β0 + β1Xi b.) we have included all the correct X’s, and c.) we have applied the correct transformation of Y and the X’s Violating Assumption 1 4 Already noted 2.) is a pretty weak assumption as OLS will automatically correct this for us. Also rely on theory and intuition to tell us that the regression is linear in coefficients; if it is not, we can estimate a non-linear regression model. So we begin by considering case where our regression model is incorrectly specified and specification errors are a problem. Violating Assumption 1 5 Every time we estimate a regression model, we make some very important choices (unfortunately, sometimes without even thinking about it): 1.) What independent variables belong in model? 2.) What functional form should the regression function take (e.g. logarithms, quadratic,…)? Specification 6 The fact that we have choice is both good and bad. On the good side, it represents flexibility (often needed in the “real world” of data analysis). On the bad side, it raises the prospect of searching over specifications until we get the results that proves our point. Specification 7 Again, we look to theory and intuition to guide us, but more often than not, justifying our decisions is an exercise in persuasion. The particular model that we decide to estimate (and interpret) is the culmination of these choices: we call it a specification. A regression specification consists of the model Specification 8 It is convenient to think of there being a right answer to each of the questions from before. That is, a correct specification does indeed exist. Sometimes, we refer to this as the data generating process (DGP), the true model that “generates” the data we actually observe. Specification error 9 One way to think about regression analysis then is that we want to learn about the DGP, given the data at hand (i.e., our sample). A regression model that differs from the DGP is an incorrect specification…leading us to say that the regression model is mis-specified. Again, an incorrect specification arises Specification error 10 We will begin by talking about the choice of which independent variables to include in the model and the types of errors we can make: 1.) We can exclude (leave out) one or more important independent variables that should be in the model. 2.) We can include irrelevant independent variables that should not be in the model. Choosing the independent variables 11 3.) We can choose a specification that confirms what we hoped to find without relying on theory or intuition to specify the model. In what follows, we discuss: a.) the consequences of each of these kinds of specification error and Choosing the independent variables 12 Suppose the true DGP is: Yi = β0 + β1X1i + β2X2i + εi But instead, we estimate the regression model: Yi = β0 + β1X1i + εi* Think of this in terms of a regression were Y is earnings, X1 is education, and X2 is “work ethic”. Omitted variables 13 The question now becomes what are the consequences of omitting the variable X2 from our model…does it mess up our estimates of β1? It definitely messes up our interpretation of β1: with X2 in the model, β1 measures the marginal effect of X1 on Y holding X2 constant… Omitted variables 14 It is pretty easy to see why leaving out X2 out of the model biases our estimate of β1: the error term in the mis-specified model is now εi* = β2X2i + εi. So, if X1 and X2 are correlated, then Assumption 3 of the CLRM is violated: X1 is correlated with εi*. This implies that if X2 changes, so do εi*, X1 and Y. Omitted variable bias 15 That is, β1 measures the effect of X1 and (some of) the effect of X2 on Y; consequently, our estimate of β1 is biased. Returning to the example from before on earnings. Suppose the true β1 > 0 so that more educated workers earn more. Omitted variable bias 16 Finally, suppose that Cov(X1,X2) > 0 so that workers with a stronger work ethic acquire more education on average. When we leave work ethic out of the model, β1 measures the effect of education and work ethic on earnings. If we were very lucky, then Cov(X1,X2) = 0 Omitted variable bias 17 We know that if Cov(X1,X2) ≠ 0 and we omit X2 from the model, our estimate of β1 is biased: E[ˆ1 ] 1 But is the bias positive or negative? That is, can we predict whether E[ ˆ1 ] 1 or E[ ˆ1 ] 1? In fact, we have shown in Lecture 11 that Is the bias positive or negative? 18 Bias E[ ˆ1 ] 1 2 Cov X 1 , X 2 Var X 1 Therefore: 1.) missing regressors with zero coefficients do not cause bias; 2.) uncorrelated (“non-co-varying”) missing regressors do not cause bias Is the bias positive or negative? 19 Consider our previous example where β2 > 0 and Cov(X1,X2) > 0. Bias E[ ˆ1 ] 1 2 Cov X 1 , X 2 Var X 1 Hence, we overestimate the effect of education on earnings…that is, we measure the effect of both having more education and a stronger work ethic. Is the bias positive or negative? 20 The amount of bias introduced by omitting a variable is equal to the impact of the omitted variable on the dependent variable times a function of the correlation between the omitted and the included variables. In particular, Bias = βomitted * f(rincluded, omitted). From the earnings regression with work ethic, βomitted > 0 and rincluded, omitted > 0 How much bias? 21 So, how do we know if we have omitted an important variable? Before we specify, we must think hard about what should be in the model…what theory and intuition tell us are important predictors of Y. Before we estimate, always predict the sign of the regression coefficients…if we generate a “wrong” Detecting and correcting omitted variable bias 22 Next, how do we correct for omitted variable bias? “Easy” way out: just add the omitted variable into the model, but we presumably would have done this in the first place if it was possible. We can also include a “proxy” for the omitted variable instead where the proxy is something highly correlated with the omitted variable. Detecting and correcting omitted variable bias 23 What about the opposite problem of including an independent variable in our regression that does not belong there? Suppose the true DGP is Yi = β0 + β1X1i + εi. But the model we estimate is Yi = β0 + β1X1i + β2X2i + εi*. Including irrelevant variables 24 If X2 really is irrelevant, then β2 = 0 and 1.) our estimates of β0 and β1 will be unbiased 2.) our estimate of β2 will be unbiased as we expect it to be zero. In practice, this requires that Cov(X2,Y) = 0 as β2 = Cov(X2,Y)/Var(X2). Oftentimes, it is possible for X2 and Y to co-vary Including irrelevant variables 25 Recall the gravity model in Lecture 13: Source SS df MS Model Residual 1123.39127 533.896273 2 146 561.695635 3.65682379 Total 1657.28754 148 11.1978888 lntrade Coef. lngdpprod lndist _cons 1.469513 -1.713246 18.09403 Std. Err. .1002652 .1351385 2.530274 t 14.66 -12.68 7.15 Number of obs F( 2, 146) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 = = = = = = 149 153.60 0.0000 0.6778 0.6734 1.9123 [95% Conf. Interval] 1.271355 -1.980326 13.09334 1.667672 -1.446165 23.09473 Now add a variable if a country starts with “I”: Source SS df MS Model Residual 1200.85538 456.432168 3 145 400.285125 3.14780805 Total 1657.28754 148 11.1978888 lntrade Coef. lngdpprod lndist i _cons 1.533656 -1.65529 -1.748512 16.39695 Std. Err. .0939199 .125924 .3524703 2.372372 t 16.33 -13.15 -4.96 6.91 Including irrelevant variables Number of obs F( 3, 145) Prob > F R-squared Adj R-squared Root MSE P>|t| 0.000 0.000 0.000 0.000 = = = = = = 149 127.16 0.0000 0.7246 0.7189 1.7742 [95% Conf. Interval] 1.348027 -1.904174 -2.445156 11.70806 1.719285 -1.406406 -1.051869 21.08585 26 Even in the case of a truly irrelevant independent variable with zero co-variance, there is still the cost imposed by losing a degree of freedom. Thus, we should expect the adjusted R2 to fall as R2 1 2 e i i / (n k 1) Y Y i i 2 / (n 1) , and less precise estimates in general (larger standard errors, smaller t) for other coefficients as ˆ ˆ Var 1 2 e i i / n k 1 X i Including irrelevant variables i X 2 . 27 Want to avoid either omitting relevant or including irrelevant variables, but which is the greater sin? Omitting relevant variables induces omitted variable bias but has no clear effect on variance. Including irrelevant variables is not as big of a problem as OLS is still unbiased, but is not best. Irrelevant versus omitted variables 28 At the end of the day, it is up to you as the econometrician to decide what independent variables to include in the model. Naturally, there is a temptation to choose the model that fits “best” or that tells you what you (or your boss or your client) want to hear. Resist this temptation and instead let theory and intuition be your guide! Data mining as a (potentially) bad practice. Data mining 29 Data mining (in econometrics) consists of estimating lots of “candidate” specifications and choosing the one whose results we like the most. The problem is we will discard the specifications: 1.) where the coefficients have the “wrong” sign (not wrong theoretically, but “wrong” in the sense we do not like the result) and/or Data mining 30 So we end up with a regression model where the coefficients we care about have big t stats and the “right” signs. But how confident can we really be of our results if we threw away lots of “candidate” regressions? Did we really learn anything about the DGP? Data mining 31 Data mining can also be used to explore a data set to uncover empirical regularities that can inform economic theory. An inductive versus deductive approach… Nothing wrong with this if used appropriately: a hypothesis developed using data mining techniques must be tested Data mining 32 A common test for specification error is Ramsey’s Regression Specification Error Test (RESET). It works as follows: 1.) Estimate the regression you care about (the one you want to test); suppose it has k independent variables and call it Model 1. The RESET test 33 3.) Regress Y on the k independent variables and on M powers of Y-hat; call this Model 2: 2 3 M ˆ ˆ ˆ Yi 0 1 X 1i 2Yi 3Yi ... mYi i 4.) Compare the results of the two regressions using an F-test where ( RSS1 RSS 2 ) / M F ~ FM ,n k M 1 RSS 2 / (n k M 1) The RESET test 34 5.) If the test statistic is larger than your critical value, then reject the null hypothesis of a correct specification of the regression model. What exactly is the intuition here? If the model is correctly specified, then all those functions of Y-hat (and thus, the independent variables) should not help in explaining Y. The RESET test 35 The test statistic is large if RSS1 >> RSS2… that is, the part of variation in Y that we cannot explain is much larger in Model 1 than Model 2. This means that all those (non-linear) functions of Y-hat (and thus, the independent variables) help explain a lot more variation in Y. Problem: if you reject the null of correct specification The RESET test 36 One of the econometrician’s primary concerns comes with ensuring that Assumption 1 of the CLRM is satisfied. Violations occur when we: 1.) exclude an important independent variable 2.) include an irrelevant independent variable Conclusion 37