CHAPTER 7 REGRESSION DIAGNOSTIC IV: MODEL SPECIFICATION ERRORS Damodar Gujarati Econometrics by Example MODEL SPECIFICATION ERRORS One of the assumptions of the classical linear regression (CLRM) is that the model is specified correctly. By correct specification we mean one or more of the following: 1. The model does not exclude any “core” variables. 2. The model does not include superfluous variables. 3. The functional form of the model is suitably chosen. 4. There are no errors of measurement in the regressand and regressors. 5. Outliers in the data, if any, are taken into account. 6. The probability distribution of the error term is well specified. 7. The regressors are nonstochastic. Damodar Gujarati Econometrics by Example OMISSION OF RELEVANT VARIABLES If we omit a relevant variable because we do not have the data, or because we have not studied the underlying economic theory carefully, or because we have not studied prior research in the area thoroughly, or just due to carelessness, we are underfitting a model. Damodar Gujarati Econometrics by Example CONSEQUENCES 1. If the omitted variables are correlated with the variables included in the model, the coefficients of the estimated model are biased. This bias does not disappear as the sample size gets larger (i.e., the estimated coefficients of the misspecified model are also inconsistent). 2. Even if the incorrectly excluded variables are not correlated with the variables included in the model, the intercept of the estimated model is biased. 3. The disturbance variance is incorrectly estimated. 4. The variances of the estimated coefficients of the misspecified model are biased. 5. In consequence, the usual confidence intervals and hypothesis-testing procedures become suspect, leading to misleading conclusions about the statistical significance of the estimated parameters. 6. Furthermore, forecasts based on the incorrect model and the forecast confidence intervals based on it will be unreliable. Damodar Gujarati Econometrics by Example F TEST TO COMPARE TWO MODELS If the original model is the “restricted” model, and the model with the added (previously omitted) variable – which could also be a squared term or an interaction term – is the “unrestricted” model, we can compare the two using an F test: ( Rur2 Rr2 ) / m F (1 Rur2 ) /(n k ) where m = number of restrictions (or omitted variables), n = number of observations, and k = number of parameters in the unrestricted model A rejection of the null suggests that the omitted variables belong in the model. Damodar Gujarati Econometrics by Example DETECTION OF OMISSION OF VARIABLES Ramsey’s Regression Specification Error (RESET) Test Lagrange Multiplier (LM) test Damodar Gujarati Econometrics by Example RAMSEY’S RESET TEST 1. From the (incorrectly) estimated model, we first obtain the estimated, or fitted, values of the dependent variable, Yˆi . 2. Reestimate the original model including Yˆi 2 and Yˆi 3 (and possibly higher powers of the estimated dependent variable) as additional regressors. 3. The initial model is the restricted model and the model is Step 2 is the unrestricted model. 4. Under the null hypothesis that the restricted (i.e., the original) model is correct, we can use the previously mentioned F test . 5. If the F test in Step 4 is statistically significant, we can reject the null hypothesis. That is, the restricted model is not appropriate in the present situation. By the same token, if the F statistic is statistically insignificant, we do not reject the original model. Damodar Gujarati Econometrics by Example LAGRANGE MULTIPLIER TEST 1. From the original model, we obtain the estimated residuals, ei. 2. If in fact the original model is the correct model, then the residuals ei obtained from this model should not be related to the regressors omitted from that model. 3. We now regress ei on the regressors in the original model and the omitted variables from the original model. This is the auxiliary regression. 4. If the sample size is large, it can be shown that n (the sample size) times the R2 obtained from the auxiliary regression follows the chi-square distribution with df equal to the number of regressors omitted from the original regression. 5. If the computed chi-square value exceeds the critical chi-square value at the chosen level of significance, or if its p value is sufficiently low, we reject the original (or restricted) regression. This is to say, that the original model was misspecified. Damodar Gujarati Econometrics by Example INCLUSION OF IRRELEVANT OR UNNECESSARY VARIABLES Sometimes researchers add variables in the hope that the R2 value of their model will increase in the mistaken belief that the higher the R2 the better the model. This is called overfitting a model. But if the variables are not economically meaningful and relevant, such a strategy is not recommended. Damodar Gujarati Econometrics by Example CONSEQUENCES 1. The OLS estimators of the “incorrect”or overfitted model are all unbiased and consistent. 2. The error variance is correctly estimated. 3. The usual confidence interval and hypothesis testing procedures remain valid. 4. However, the estimated coefficients of such a model are generally inefficient (their variances will be larger than those of the true model). Damodar Gujarati Econometrics by Example MISSPECIFICATION OF THE FUNCTIONAL FORM OF A REGRESSION MODEL Sometimes researchers mistakenly do not account for the nonlinear nature of variables in a model. Moreover, some dependent variables (such as wage, which tends to be skewed to the right) are more appropriately entered in natural log form. Damodar Gujarati Econometrics by Example COMPARING ON BASIS OF R2 We can transform the models as follows, as in Chapter 2: 1. Compute the geometric mean (GM) of the dependent variable, call it Y*. 2. Divide Yi by Y* to obtain: Yi ~ Y * Yi ~ 3. Estimate the equation with lnYi as the dependent variable using Yi in ~ lieu of Yi as the dependent variable (i.e., use ln Yi as the dependent variable). ~ 4. Estimate the equation with Yi as the dependent variable using Yi as the dependent variable instead of Yi. 5. Compute the following, putting the larger RSS value in the numerator: 2 n RSS1 ln( ) 1 2 RSS2 ~ If this is significant, the model with the lower RSS value is better. Damodar Gujarati Econometrics by Example ERRORS OF MEASUREMENT One of the assumptions of CLRM is that the model used in the analysis is correctly specified. Although not explicitly spelled out, this presumes that the values of the regressand as well as regressors are accurate. That is, they are not guess estimates, extrapolated, interpolated or rounded off in any systematic manner or recorded with errors. Damodar Gujarati Econometrics by Example CONSEQUENCES Consequences for Errors of Measurement in the Regressand: 1. The OLS estimators are still unbiased. 2. The variances and standard errors of OLS estimators are still unbiased. 3. But the estimated variances, and ipso facto the standard errors, are larger than in the absence of such errors. In short, errors of measurement in the regressand do not pose a very serious threat to OLS estimation. Damodar Gujarati Econometrics by Example CONSEQUENCES Consequences for Errors of Measurement in the Regressor: 1. OLS estimators are biased as well as inconsistent. 2. Errors in a single regressor can lead to biased and inconsistent estimates of the coefficients of the other regressors in the model. It is not easy to establish the size and direction of bias in the estimated coefficients. It is often suggested that we use instrumental or proxy variables for variables suspected of having measurement errors. The proxy variables must satisfy two requirements—that they are highly correlated with the variables for which they are a proxy and also they are uncorrelated with the usual equation error as well as the measurement error But such proxies are not easy to find. We should thus be very careful in collecting the data and making sure that some obvious errors are eliminated. Damodar Gujarati Econometrics by Example OUTLIERS, LEVERAGE, AND INFLUENCE DATA OLS gives equal weight to every observation in the sample. This may create problems if we have observations that may not be “typical” of the rest of the sample. Such observations, or data points, are known as outliers, leverage or influence points. Damodar Gujarati Econometrics by Example OUTLIERS, LEVERAGE, AND INFLUENCE DATA Outliers: In the context of regression analysis, an outlier is an observation with a large residual (ei), large in comparison with the residuals of the rest of the observations. Leverage: An observation is said to exert (high) leverage if it is disproportionately distant from the bulk of the sample observations. In this case such observation(s) can pull the regression line towards itself, which may distort the slope of the regression line. Influential point: If a levered observation in fact pulls the regression line toward itself, it is called an influential point. The removal of such a data point(s) from the sample can dramatically change the slope of the estimated regression line. Damodar Gujarati Econometrics by Example PROBABILITY DISTRIBUTION OF THE ERROR TERM The classical normal linear regression model (CNLRM), an extension of CLRM, assumes that the error term ui in the regression model is normally distributed. This assumption is critical if the sample size is relatively small, for the commonly used tests of significance, such as t and F, are based on the normality assumption. Damodar Gujarati Econometrics by Example JARQUE-BERA (JB) TEST OF NORMALITY This is a large sample test. The test statistic is as follows: JB = S 2 ( K 3) 2 n 6 24 where n is the sample size, S = skewness coefficient, K = kurtosis coefficient. For a normally distributed variable S = 0 and K= 3. When this is the case, the JB statistic is zero. Therefore, the closer is the value of JB to zero, the better is the normality assumption. Since in practice we do not observe the true error term, we use its proxy, ei. The null hypothesis is the joint hypothesis that S=0 and K = 3. JB have shown that the statistic follows the chi-square distribution with 2 df (because we are imposing two restrictions, namely, that skewness is zero and kurtosis is 3). If the computed JB statistic exceeds the critical chi-square value, we reject the hypothesis that the error term is normally distributed. Damodar Gujarati Econometrics by Example RANDOM OR STOCHASTIC REGRESSORS The CLRM assumes that the regressand is random but the regressors are non-stochastic or fixed—that is, we keep the values of the regressors fixed and draw several random samples of the dependent variable. Although the assumption of fixed regressors may be valid in several economic situations, it may not be tenable for all economic data. In other words, we assume that both Y (the dependent variable) and the Xs (the regressors) are drawn randomly. This is the case of stochastic or random regressors. Damodar Gujarati Econometrics by Example THE SIMULTANEITY PROBLEM There are many situations where such unidirectional relationship between Y and the Xs cannot be maintained, since some Xs affect Y but in turn Y also affects one or more Xs. In other words, there may be a feedback relationship between the Y and X variables. Simultaneous equation regression models are models that take into account feedback relationships among variables. Endogenous variables are variables whose values are determined in the model. Exogenous variables are variables whose values are not determined in the model. Sometimes, exogenous variables are called predetermined variables, for their values are determined independently or fixed, such as the tax rates fixed by the government. Estimate parameters using Method of Indirect Least Squares (ILS) or Method of Two-Stage Least Squares (2SLS). Damodar Gujarati Econometrics by Example