Econ107 Applied Econometrics Topic 5: Specification: Choosing Independent Variables (Studenmund, Chapter 6) Specification errors that we will deal with: wrong independent variable; wrong functional form. This lecture deals with wrong independent variables, which may be due to i) omitted variables, ii) redundant variables (irrelevant variables). Use the following example under both types: lnW i = 0 + 1 S i + 2 OJT i + i where Wi = Wage rate of worker i. Si = Years of formal education of worker i.. OJTi = Effective years of On-the-Job Training of worker i. The idea is that we have 2 forms of human capital: general human capital obtained through formal education and specific human capital obtained through vocational education, apprenticeship programmes, etc. Both may increase wages (i.e., β1>0 and β2>0), but not at the same rate (i.e., β1β2). I. Omitting a Relevant Variable. One of the most common problems in regression analysis. Could be based in the ignorance of the researcher (i.e., variable available, but not used). More likely, data unavailable (e.g., Household Economic Survey). Estimate the following model instead: * lnW i = 0 + 1 S i + i So that the true error in the above regression is i * 2 OJTi i So that Assumption 2 does not hold because E( i * ) 2OJTi 0 . More importantly, in the case where OJT and S are correlated, looks like Assumption 3 does not hold because Cov( i * , Si ) 0 . As a result, Gauss-Markov theorem does not apply. In general, OLS estimate of the regression coefficient is biased, ie, * E ( ˆ1 ) 1 Page - 2 And the bias is * * bias ( ˆ1 ) E ( ˆ1 ) 1 2 b12 where: b12 = Cov ( S i , OJT i ) Var ( S i ) Suppose that b12>0, then: * E ( ˆ1 ) > 1 and the estimated coefficient is biased upward. Bias is zero when the coefficient of omitted variable is zero or the included and omitted variables are uncorrelated. In addition, the standard errors on these estimated coefficients will be biased. In the misspecified model: Var ( ˆ1 ) = * 2 si2 But variance of the 'true' estimator is: Var ( ˆ1 ) = 2 2 si2 (1 - r12 ) where r12 is the correlation coefficient between S and OJT. This means that: If r122 > 0 , then Var ( ˆ1 ) < Var ( ˆ1 ) * The variance of estimated coefficient is also biased. We're placing 'too much' confidence in our coefficient estimates. The result is that the t test will be misleading (this is true even if r12=0, because our estimate of σ2 will also be biased.) The remedial measure is easy IF we know which variable has been omitted and this omitted variable is available. Include it in the model. If the omitted variable not available, might try to find a proxy variable that is closely related to this missing variable (e.g., use information on the average OJT or people in a particular industry and occupation). Or at least sign the direction of the bias, and estimate its potential magnitude. The above remedy works in theory. In practice, sometimes it is difficult to know if a variable has been omitted. To detect the existence of the problem of omitting Page - 3 a relevant variable, one common practice is to examine the sign of estimated coefficients and see if they meet our expectation or economic theory. If not, it is very likely that relevant variables have been omitted. The next step is to use the direction of the bias to look for relevant variables. II. Including an Irrelevant Variable. Suppose true model doesn't contain OJTi. This is consistent with some theoretical models that predict that this human capital will not affect wages, employers are more likely to pay for it. Thus, the correct regression model is: lnW i = 0 + 1 S i + i but we estimate: lnW i = 0 + 1 S i + 2 OJT i + i ** The problems here are less severe compared to omitting a relevant variable. The true error in the above regression is i ** i 2 OJTi If OJT is irrelevant, 2 should be zero and hence Assumption 2 holds. Assumption 3 holds too. What are the properties of the OLS estimates? (i) Estimated coefficients are unbiased and consistent. * E ( ˆ1 ) = 1 (ii) t test is valid if the correct standard error is used. (iii) The only problem is that the estimated coefficients are inefficient. Under the 'false' model: * Var ( ˆ1 ) = 2 2 si2 (1 - r12 ) Under the 'true' model: Var ( ˆ1 ) = 2 si 2 Since if r122 > 0 , then Var ( ˆ1 ) < Var ( ˆ1* ) , we're placing 'too little' confidence in our coefficient estimates (i.e., the standard error on the estimated coefficient is larger Page - 4 than it should be). This makes the t-ratio smaller than it should be, and makes it more likely that we won’t be able to reject the null when we should. This is an easy one to solve in theory. If the variable shouldn’t be in the regression, eliminate it from the outset. But in practice, this isn’t so easy. The theory in this example says that both specifications might be right. If an independent variable may be relevant, include it. III. How to Decide Whether to Include Variable or Not? 1. Graphic method to detect the problem of omitting a relevant variable Plot the residuals and look for 'distinct pattern'. Take the earlier example on functional form of the regression. We estimate: * lnW i = 0 + 1 S i + i but the 'true' model is: lnW i = 0 + 1 S i 2 S i + ui 2 i * 2 S i 2 ui A plot of the residuals against Si would produce a 'detectable' pattern (i.e., curved downward). Page - 5 2. Four criteria Economic theory: is there any sound theory? Student t statistic: is it significant in the correct direction? Has R 2 improved? Do other coefficients change sign when a variable is included? Include variable if answers are positive. Don’t necessarily drop insignificant variables. An insignificant finding can be an important result. Example: Cofˆfee = 9.1 7.8Pbc + 2.4 Pt 0.0035Yd (15.6) t= 0.5 (1.2) (0.001) 2.0 3.5 R 2 0.60 n=25, where Coffee= demand for Brazilian coffee in US Pbc = price of Brazilian coffee Pt = price of tea Yd = disposable income in US What happens if you drop Pbc? Cofˆfee = 9.3 2.6 Pt 0.0036Yd (1.0) (0.0009) t= 2.6 4.0 R 2 0.61 n=25, What happens if you add another variable, price of Colombia coffee, Pcc Cofˆfee = 10 8.0 Pcc 5.6 Pbc + 2.6 Pt 0.0030Yd (4.0) t= n=25, 2 (2.0) -2.8 R 2 0.65 (1.3) (0.001) 2 3 Page - 6 3. Three incorrect techniques for choosing variables 1) Data mining: simultaneously try a whole series of possible regression formulations and then choose the equation that conforms the most to what the researcher wants the results to look like. Doing econometrics = making sausages. 2) Stepwise regression technique: systematic way of variable selection based on R 2 . The computer program is given a “shopping list” of possible independent variables, and then builds the equation in step. It always adds to the regression model the variable which increases R 2 the most. Problem: independent variables could be correlated. 3) Sequential specification search: add and drop sequentially (ie estimate an undisclosed number of regressions) but only present a final choice as if it were the only specification estimated. When you test a model, you have a type I error. If you estimate and test too many models, type I errors will accumulate. IV. Lagged Independent Variables Consider the following regressions: Yt 0 1 X 1t 2 X 2t t Yt 0 1 X 1t 1 2 X 2t t (1) (2) where t = 1, …, n. That is, we have sample of n time-series observations. Note the change of notation from i to t to emphasize time series data. In equation (1), the effect of X1 on Y is instantaneous. In equation (2), the effect is felt one period later. As long as X1 is exogenous (not influenced by Y), the lagged structure of the equation poses no problem. Of course, the interpretation of slope coefficient is different. Page - 7 V. Akaike’s Information Criterion and Schwarz Criterion In general the more variables included in the regression, the smaller will be the RSS. But if a variable only contributes marginally to the reduction of the RSS, it should not be included. AIC and SC (also known BIC) measures the RSS with penalty of additional parameters. They are defined in regression models as: AIC = ln(RSS/n) +2(K+1)/n SC = ln(RSS/n) + ln(n)(K+1)/n You may select models that minimize the AIC or SC. These are called model selection criteria. Note that R 2 is also a model selection criterion. You choose model to maximize R 2 . Compared with the AIC or SC, R 2 tends to select a model with irrelevant variables. VI. Questions for Discussion: Q6.3, Q6.9 VII. Computing Exercise: Q6.5 (Johnson, Ch 6), Q6.15, Johnson Ch 6: AIC