Chapter 7 Further Inference in Multiple Regression 1. 2. 3. 4. 5. 6. 7. 8. 9. Testing the Significance of a Model—The πΉ­Test Cases Where the πΉ­Test and π‘­Test Give Contradictory Results—Collinearity 2.1. πΉ­Test for “Restricted Least Squares” Extension of the Regression Model Testing Some Economic Hypotheses 4.1. Test the Significance of Advertising 4.2. The Optimal Level of Advertising The Use of Non-sample Information Model Specification 6.1. Consequences of Omitted and Irrelevant Variables 6.1.1. The Omitted Variable Problem 6.1.1.1. Proof of the Omitted Variable Bias 6.1.2. The Irrelevant Variable Problem 6.2. The RESET Test for Model Misspecification Identifying and Mitigating Collinearity Confidence and Prediction Intervals A More Practical Way of Finding var(π¦Μ0 ) 1. Testing the Significance of a Model—The π­πππ¬π In Chapter 5 the πΉ-Test was explained as an alternative to the π‘-Test for the significance of a simple regression model. There we noted that the p-value for the πΉ-Test in the ANOVA section of the regression summary output was equal to the p-value for the t-Test for the significance of the slope coefficient π2 . In multiple regression, however, the πΉ­test and π‘-tests have different roles. The π‘-tests test the significance of the coefficient of each variable individually, while the πΉ-test tests their joint explanatory power. The model will have no explanatory power if it turns out that π¦ is unrelated to any of the explanatory variables. In the two-explanatory variable regression model, π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + π’ we perform the following hypothesis test: π»0 : π½2 = 0, π½3 = 0 π»1 : at least one of the π½π is not zero In Chapter 5 it was explained that the ratio of mean square regression (πππ ) over mean square error (πππΈ) has an πΉ-distribution with the numerator and denominator degrees of freedom, respectively, of (π − 1) and (π − π). πΉ(π−1,π−π) = ∑(π¦Μ − π¦Μ )2 ⁄(π − 1) ∑(π¦ − π¦Μ)2 ⁄(π − π) This is the test statistic (ππ) πΉ which is compared to the critical value πΆπ = πΉπΌ,(π−1,π−π) . We would reject the null hypothesis and conclude that estimated relationship between π¦ and the independent variables is significant, if ππ > πΆπ. In the ππ’ππππ model presented in the previous chapter, Chapter 7—Further Inference in Multiple Regression Page 1 of 20 ππ΄πΏπΈπ = π½1 + π½2 ππ πΌπΆπΈ + π½3 π΄π·ππΈπ π + π’ The estimated regression model was, Μ = 118.914 − 7.908ππ πΌπΆπΈ + 1.863π΄π·ππΈπ π ππ΄πΏπΈπ To test for the significance of the overall model, the πΉ statistic was obtained from the ANOVA table, ANOVA df Regression Residual Total ππ = πΉ = 2 72 74 SS 1396.5389 1718.9429 3115.4819 MS 698.26946 23.874207 F 29.247859 Significance F 5.04086E-10 698.2695 = 29.248 23.8742 πΆπ = πΉ0.05,(2,72) = 3.124 Since ππ > πΆπ, we reject the null hypothesis and conclude that the relationship between ππ΄πΏπΈπ, ππ πΌπΆπΈ and π΄π·ππΈπ π is significant. Also note that the probability value shown under “Significance F” is practically 0. 1.1. F-Tests for “Restricted Least Squares” Most economic data are obtained not from controlled (laboratory) experiments but are collected as “historical” data. Therefore, when data are the result of an uncontrolled experiment, many of the economic variables may be correlated or are said to be collinear. The problem is labeled collinearity, or multicollinearity when several variables are involved. The restricted least squares can be useful when the problem of collinearity is present. We might want to test if an explanatory variable or a group of explanatory variables is relevant in a particular model. The πΉ-test for one hypothesis, or a set of hypotheses, is based on a comparison of the sum of squared error (πππΈ) from the original, unrestricted model to the πππΈ from a regression model in which the null hypotheses for the restricted mode is assumed to be true. This is explained below. The unrestricted model is the original regression model. Unrestricted model: π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + π’ The restricted model is that which one or more of the explanatory variables are removed. In a model with two independent variables we can remove the impact of one of the two variables. Restricted model: π¦ = π½1 + π½3 π₯3 + π’ We want to test the hypothesis that changes in, say, π₯2 , have no effect on π¦, against the alternative that it has an effect. π»0 : π½2 = 0 π»1 : π½2 ≠ 0 Chapter 7—Further Inference in Multiple Regression Page 2 of 20 It was explained in Chapter 6 that a model with a larger number of independent variables would have a smaller πππΈ. Therefore, when we remove one of the independent variables by constraining the model, the restricted πππΈ increases. πππΈπ > πππΈπ The ratio of the difference between πππΈπ − πππΈπ over πππΈπ is a test statistic that has an πΉ distribution with a numerator degrees of freedom equal to the number of hypotheses in π»0 and the denominator degrees of freedom of π − π, where π is the number of parameters in the unrestricted model. Here we have only one null hypothesis. Therefore, π = 1. πΉ= (πππΈπ − πππΈπ )⁄π πππΈπ ⁄(π − π) If ππ = πΉ(π,π−π) > πΆπ = πΉπΌ,(π,π−π) , then we would reject the null hypothesis (hypotheses), concluding the π₯2 has a significant effect on π¦. In the example relating sales to price and advertising expenditure, constrain the model by removing the impact of price, π₯2 . The unrestricted πππ , determined previously, is πππ π = 1718.943. The restricted πππΈ is then (see the Excel file CH7 DATA, worksheet tab burger) πππΈπ = 2961.827 Thus, πΉ= (2961.827 − 1718.943)⁄1 = 52.06 1718.943⁄(75 − 3) To find the critical value, using the Excel =F.INV.RT function, we have: πΉ0.05,(1,72) = 3.974. Since the test statistic exceeds the critical value, we reject the null hypothesis π»0 : π½2 = 0 and conclude that variations in price influence monthly sales. Note that when we test the significance of the overall model, in effect, we use the following methodology: Unrestricted model: Restricted model: π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + π’ π¦ = π½1 + π’ The null and alternative hypotheses are: π»0 : π½2 = 0, π½3 = 0 π»1 : at least one of the π½π is non-zero The estimated coefficient for the restricted model is simply π1 = π¦Μ , and the regression equation is: π¦Μ = π1 = π¦Μ . Therefore, πππΈπ = ∑(π¦ − π¦Μ)2 = ∑(π¦ − π¦Μ )2 = πππ Thus, πΉ= (πππΈπ − πππΈπ )⁄π (πππ − πππΈπ )⁄(π − 1) = πππΈπ ⁄(π − π) πππΈπ ⁄(π − π) πΉ= πππ ⁄(π − 1) πππΈ ⁄(π − π) Chapter 7—Further Inference in Multiple Regression Page 3 of 20 Which is the same πΉ statistic for the original unrestricted model. 1.2. Test of the Significance of Advertising Using the extended burger model, ππ΄πΏπΈπ = π½1 + π½2 ππ πΌπΆπΈ + π½3 π΄π·ππΈπ π + π½4 π΄π·ππΈπ π 2 + π’ we can test some interesting economic hypotheses and illustrate the use of π‘- and πΉ-tests. In the extended model above, incorporating π΄π·ππΈπ π 2 in the model, we wish to test whether advertising has an effect on total revenue. This means we want to perform a joint test for β3 and β4. Again, let’s make use of unrestricted/ restricted models methodology used above. Unrestricted model: Restricted model: π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + π½4 π₯32 + π’ π¦ = π½1 + π½2 π₯2 + π’ The joint null and alternative hypotheses are: π»0 : π½3 = 0, π½4 = 0 π»1 : π½3 or π½4 , or both are non-zero. From the regression output for the unrestricted model, as seen above, πππΈπ = 1532.084. regression output for the restricted model (see CH7 DATA Excel file): πππΈπ = 1896.391. Unrestricted Model ANOVA df Regression 3 Residual 71 Total 74 Intercept PRICE ADVERT ADVERT² Coefficients 109.7190 -7.6400 12.1512 -2.7680 SS 1583.3974 1532.0845 3115.4819 MS 527.79914 21.578654 Standard 6.7990 Error 1.0459 3.5562 0.9406 t Stat 16.1374 -7.3044 3.4170 -2.9427 Restricted Model ANOVA df Regression 1 Residual 73 Total 74 Intercept PRICE Coefficients 121.9002 -7.8291 From the SS 1219.0910 1896.3908 3115.4819 MS 1219.0910 25.9780 Standard 6.5263 Error 1.1429 t Stat 18.6783 -6.8504 The test statistic F is: πΉ= (ππππ − πππΈπ )⁄π (1896.3908 − 1532.0845)⁄2 = = 8.441 πππΈπ ⁄(π − π) 1532.0845⁄(75 − 4) And the critical value is: F0.05,(2, 71) = 3.126 Clearly we reject the null hypothesis and conclude that advertising has an effect on total revenue. Chapter 7—Further Inference in Multiple Regression Page 4 of 20 1.3. Test the Significance of Advertising—Optimal Level What is the optimum level of advertising? Optimality in economics always implies marginal benefit of an action be equal to its marginal cost. If marginal benefit exceeds the marginal cost, the action should be taken. If marginal benefit is less than the marginal cost, the action should be curtailed. The optimum is, therefore, where the two are equal. The marginal benefit of advertising is the contribution of each additional $1 thousand of advertising expenditure to total revenue. Form the model, π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + π½4 π₯32 + π’ the marginal benefit of advertising is: πy = π½3 + 2π½4 π₯3 ππ₯3 Ignoring the marginal cost of additional sales, marginal cost is each additional $1 thousand spent on advertising. Thus, the optimality requirement is ππ΅ = ππΆ: π½3 + 2π½4 π₯3 = $1 Using the estimated least squares coefficients, we thus have: 12.151 + 2(−2.768)π₯3 = 1 Yielding, π₯3 = $2.014 thousand. Now suppose we wish to use the estimated regression model to test the null hypothesis that the optimum advertising expenditure is $1.9 thousand. That is, π»0 : π½3 + 2π½4 (1.9) = 1 π»1 : π½3 + 2π½4 (1.9) ≠ 1 The test statistic for this test has a t distribution. ππ = |π‘| = ππππππ ππ‘ππ‘ππ π‘ππ − ππ’ππ ππππ’π ππ‘ππππππ πΈππππ The sample statistic for the test is π3 + 3.8π4 . The standard error in the denominator is se(π3 + 3.8π4 ). ππ = π3 + 3.8π4 − 1 se(π3 + 3.8π4 ) How do we find the value for the standard error? Using the properties of variance, we have var(π3 + 3.8π4 ) = var(π3 ) + 3.82 var(π4 ) + 2(3.8)cov(π3 , π4 ) We can obtain var(π3 ) and var(π4 ) by simply squaring the standard errors from the regression output. Unfortunately, the Excel regression output does not provide the covariance value. We can, however, still use Excel to compute cov(π3 , π4 ). If you recall, using matrix algebra we can determine the variance-covariance matrix. Chapter 7—Further Inference in Multiple Regression Page 5 of 20 var(π1 ) [covar(π1 , π2 ) covar(π1 , π3 ) covar(π1 , π2 ) var(π2 ) covar(π2 , π3 ) covar(π1 , π3 ) covar(π2 , π3 )] = var(π)X −1 var(π3 ) where, X-1 is the inverse of the matrix, π X = [∑π₯2 ∑π₯3 ∑π₯2 ∑π₯3 2 ∑π₯2 ∑π₯2 π₯3 ] ∑π₯2 π₯3 ∑π₯32 is for a model with two independent variables. In our current model we have three independent variables. The solution for this problem is simple because the X matrix can be expanded to incorporate any number of independent variables. For a 3-variable model we have, π X= ∑π₯2 ∑π₯3 [∑π₯4 ∑π₯2 ∑π₯22 ∑π₯2 π₯3 ∑π₯2 π₯4 ∑π₯3 ∑π₯2 π₯3 ∑π₯32 ∑π₯3 π₯4 ∑π₯4 ∑π₯2 π₯4 ∑π₯3 π₯4 ∑π₯42 ] Using Excel we can compute these quantities as the elements of the X matrix, find the inverse and then multiply the inverse matrix by var(π): var(π)X −1 . The result is the following covariance matrix. (The calculations are shown in the Excel file. 46.227 -6.426 -11.601 2.939 -6.426 1.094 0.300 -0.086 -11.601 0.300 12.646 -3.289 2.939 -0.086 -3.289 0.885 Thus, var(π3 + 3.8π4 ) = var(π3 ) + 3.82 var(π4 ) + 2(3.8)covar(π3 , π4 ) var(π3 + 3.8π4 ) = 12.646 + 3.82 (0.885) + 2(3.8)(−3.289) = 0.428 se(π3 + 3.8π4 ) = √0.428 = 06542 ππ = 1.633 − 1 = 0.968 0.6542 The critical value is πΆπ = π‘0.025,72 = 1.994. Since ππ < πΆπ, do not reject the null hypothesis. The optimum advertising expenditure amount is $1.9 thousand. An π-test alternative can also be used for the optimum advertising hypothesis test. To do this test state the unrestricted and restricted models: Unrestricted model: Restricted model: π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + π½4 π₯32 + π’ π¦ = π½1 + π½2 π₯2 + (1 − 3.8π½4 )π₯3 + π½4 π₯32 + π’ Note that we have used the null statement π½3 +2π½4 (1.9) = 1, that is, π½3 = 1 − 3.8π½4 , in place of the coefficient of π₯3 . To estimate the restricted model, rearrange the model in the following format: π¦ = π½1 + π½2 π₯2 + π₯3 + π½4 (π₯32 − 3.8π₯3 ) + π’ Chapter 7—Further Inference in Multiple Regression Page 6 of 20 π¦ − π₯3 = π½1 + π½2 π₯2 + π½4 (π₯32 − 3.8π₯3 ) + π’ Running the regression on the restricted model yields a sum of square error of SSER = 2594.5332, from which compute the F-statistic (see the Excel file): πΉ= (πππΈπ − πππΈπ )⁄1 (1152.286 − 1532.084)⁄1 = = 0.936 πππΈπ ⁄(π − π) 1532.084⁄71 The critical value is πΆπ = πΉ0.05,(1,71) = 3.976. Note that when there is only one null hypothesis, the t- and Ftests are equivalent tests because they both yield the same probability value of 0.8014.1 Also note that πΉ = π‘ 2 = (0.967572)2 = 0.936195. 2. The Use of Non-sample Information In many estimation and inference problems we have information external to the sample data. This nonsample information may come from economic theory or principles, or experience. When available, we can combine non-sample with the sample information to improve the precision of the estimated parameters. In economic analysis of demand, demand for a good depends on the price of the good, price of substitutes and complements, and on income. Take the demand for beer. It depends on the price of beer, ππ΅ , the price of other liquor, ππΏ , the price of all other remaining goods and services, ππ , and income (πΌ): π = π(ππ΅ , ππΏ , ππ , πΌ) Assuming that a log-log function form is appropriate for this demand relationship, ln(π) = π½1 + π½2 ln(ππ΅ ) + π½3 ln(ππΏ ) + π½4 ln(ππ ) + π½5 ln(πΌ) + π’ The relevant non-sample information can be derived by assuming that there is no “money illusion” with respect to simultaneous equal-percentage increase in all prices and income. That is, the demand for beer will not change, if income and all prices, say, double. Impose this assumption on the model by multiplying all variables by the proportion λ. ln(π) = π½1 + π½2 ln(πππ΅ ) + π½3 ln(πππΏ ) + π½4 ln(πππ ) + π½5 ln(ππΌ) Using the properties of logarithm, we can rewrite the above equation as: ln(π) = π½1 + π½2 ln(ππ΅ ) + π½3 ln(ππΏ ) + π½4 ln(ππ ) + π½5 ln(πΌ) + (π½2 + π½3 + π½4 + π½5 ) ln(π) Since multiplying all right-hand-side variables in the original equation does not alter ln(π), then the following must be true, π½2 + π½3 + π½4 + π½5 = 0 This non-sample information thus can be imposed as a constraint on the parameters in the demand model. Solve this restriction for π½4 , π½4 = −π½2 − π½3 − π½5 1 Use the Excel function =πΉπ·πΌππ() to find the tail area under the F-curve for a given F statistic. Chapter 7—Further Inference in Multiple Regression Page 7 of 20 Using the multiple regression model obtain an estimate of the log-log demand function above, ln(π) = π½1 + π½2 ln(ππ΅ ) + π½3 ln(ππΏ ) + π½4 ln(ππ ) + π½5 ln(πΌ) + π’ and substituting for π½4 from the constraint, we have ln(π) = π½1 + π½2 ln(ππ΅ ) + π½3 ln(ππΏ ) + (−π½2 − π½3 − π½5 ) ln(ππ ) + π½5 ln(πΌ) + π’ ln(π) = π½1 + π½2 [ln(ππ΅ ) − ln(ππ )] + π½3 [ln(ππΏ ) − ln(ππ )] + π½5 [ln(πΌ) − ln(ππ )] + π’ The restricted model then can be written as, ππ΅ ππΏ πΌ ln(π) = π½1 + π½2 ln ( ) + π½3 ln ( ) + π½5 ln ( ) + π’ ππ ππ ππ The data is available in πΆπ»7 π·π΄ππ΄ in the tab π΅πΈπΈπ to estimate this model. The summary output is presented below: SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.8989 0.8079 0.7858 0.0617 30 ANOVA df Regression Residual Total Intercept ln(PB/PR) ln(PL/PR) ln(I/PR) 3 26 29 SS 0.4161 0.0989 0.5150 MS 0.1387 0.0038 F 36.4602 Significance F 0.0000 Coefficients -4.7978 -1.2994 0.1868 0.9458 Standard Error 3.7139 0.1657 0.2844 0.4270 t Stat -1.2918 -7.8400 0.6569 2.2148 P-value 0.2078 0.0000 0.5170 0.0357 Lower 95% -12.4318 -1.6401 -0.3977 0.0680 Upper 95% 2.8362 -0.9587 0.7714 1.8236 Since in the restricted model β4 = −β2 − β3 − β5, then the estimated β4 is π4∗ = −π2∗ − π3∗ − π5∗ = −(−1.2994) − 0.1868 − 0.9458 = 0.1667 The restricted least square estimates are biased [E(ππ ) ≠ π½π ], unless the constraints are exactly true. Also note that the variance of a restricted least square estimator is smaller than the unrestricted one. By combining non-sample information with the sample information, we reduce the variation in the estimation procedure introduced by random sampling. Chapter 7—Further Inference in Multiple Regression Page 8 of 20 3. Model Specification In any econometric investigation, choice of the model is one of the first steps. What are the important considerations when choosing a model? What are the consequences of choosing the wrong model? Are there any ways of assessing model specification or misspecification? The essential features of the model choice are: ο· ο· ο· Choice of functional form, Choice of explanatory variables, and whether the multiple regression model assumptions hold.2 For choice of functional form and regressors, economic principles and logical reasoning play a prominent and vital role. 3.1. Consequences of Omitted and Irrelevant Variables 3.1.1. The Omitted Variable Problem Suppose in a particular industry the wage rate of employees W, depends on their experience E, and their motivation M. Then the model is specified as: π = π½1 + π½2 πΈ + π½3 π + π’ However, since data on motivation are unavailable, we dispense with M and instead we estimate the model π = π½1 + π½2 πΈ + π’ By estimating the alternative model we are imposing the constraint π½3 = 0 when it is not true, that is, when in fact π½3 ≠ 0. By imposing this constraint, the least squares estimates π1 and π2 will be biased. Only when the omitted variable is uncorrelated with the retained variables will the estimates be unbiased. But perfectly uncorrelated regressors are rare. Because of the possibility of omitted-variable bias (OVB), one must include all important relevant variables. If an estimated equation has coefficients with unexpected signs, or unrealistic magnitudes, a possible cause of these strange results is the omission of an important variable. One method to determine if a variable or a group of variables should be included in a model is to perform “significance tests”. For one variable (one null hypothesis π»0 : π½3 = 0), we use the t-test and for more than one variable (two or more null hypotheses π»0 : π½3 = 0, π½4 = 0) we use the F-test. But we must also remember that we may reject the null because of the quality or paucity of data, even though the variable is in fact relevant to the model. One could, in such cases, be inducing omitted-variable bias in the remaining coefficient estimates. 3.1.1.1. Proof of the Omitted Variable Bias Suppose the true model is π¦ = π½1 + π½2 π₯ + π½3 β + π’, but by omitting the variable β we estimate the model π¦ = π½1 + π½2 π₯ + π’ instead. Denote the least squares estimator of π½2 in the reduced model by π2∗ . 2 The typical violations of the assumptions are: heteroskedasticity, autocorrelation, and random regressors. Chapter 7—Further Inference in Multiple Regression Page 9 of 20 π2∗ = ∑(π₯ − π₯Μ )(π¦ − π¦Μ ) ∑(π₯ − π₯Μ )π¦ = ∑(π₯ − π₯Μ )2 ∑(π₯ − π₯Μ )2 To simplify the proof, let π€= π₯ − π₯Μ ∑(π₯ − π₯Μ )2 Thus, π2∗ = ∑π€π¦ Now substitute for π¦ from the original model, which includes h, π2∗ = ∑π€(π½1 + π½2 π₯ + π½3 β + π’) = π½1 ∑π€ + π½2 ∑π€π₯ + π½3 ∑π€β + ∑π€π’ It is simple to show that ∑π€ = 0 and ∑π€π₯ = 1. Thus, π2∗ = π½2 + π½3 ∑π€β + ∑π€π’ Taking the expectations of both sides, we have E(π2∗ ) = π½2 + π½3 ∑π€β ≠ π½2 Consider the term ∑π€β. ∑π€β = ∑(π₯ − π₯Μ )β ∑(π₯ − π₯Μ )(β − βΜ ) = ∑(π₯ − π₯Μ )2 ∑(π₯ − π₯Μ )2 ∑π€β = ∑(π₯ − π₯Μ )(β − βΜ )⁄(π − 1) cov(π₯, β) = ∑(π₯ − π₯Μ )2 ⁄(π − 1) var(π₯) The numerator is the covariance of π₯ and β, and the denominator the variance of π₯. This allows us to write: E(π2∗ ) = π½2 + π½3 cov(π₯, β) ≠ π½2 var(π₯) The OVB here is shown as the difference between E(π2∗ ) and π½2 : bias(π2∗ ) = E(π2∗ ) − π½2 = π½3 cov(π₯, β) var(π₯) Knowing the sign of π½3 and the sign of cov(π₯, β) tells us the direction of the bias. Also note that if π₯ and β are uncorrelated, their covariance will be zero. Thus the bias disappears and E(π2∗ ) = π½2 . Example The worksheet “πππ’­πππ” in the file “CH7 DATA” contains 428 observations relating annual family income (π¦ = πΉπ΄ππΌππΆ) to the level of education of the income earners. The explanatory variables are husband's years of education (π₯2 = π»πΈπ·π) and wife's years of education (π₯3 = ππΈπ·π). The regression outcome shows family income rises by $3,132 for each additional year of the husband's education and by $4,523 for each additional year of the wife's education. π¦Μ = −5533.629 + 3131.509π₯2 + 4522.641π₯3 Chapter 7—Further Inference in Multiple Regression Page 10 of 20 If, however, we omit π₯3 = ππΈπ·π from the model the regression equation becomes, π¦Μ = 2619.27 + 5155.483π₯2 With the effect of an extra year of the husband's education on family income rising by nearly $2,000 to $5,155, the model overstates the contribution of the husband's educational attainment to the family income. Denote the biased coefficient as π2∗ . Then the omitted variable bias, as explained above, is bias(π2∗ ) = π½3 cov(π₯2 , π₯3 ) var(π₯) We can show that the omitted variable imparts a positive bias to the model. The estimated coefficient π3 = 4522.641 > 0 and, using Excel, cov(π₯2 , π₯3 ) = 4.113 > 0. 3 Now include a third explanatory variable, the number of children under 6 years of age, ππ = π²π³π. The regression equation is as follows. Note that, the coefficient of πΎπΏ6 is negative, implying that the larger the number of children in the family the lower the income (fewer number of hours worked). For each additional child, family income is reduced by $14,311. For comparison, the regression equation for the original model is shown below the new equation. π¦Μ = −7755.3 + 3211.5π₯2 + 4776.9π₯3 − 14310.9π₯4 π¦Μ = −5533.6 + 3131.5π₯2 + 4522.6π₯3 Note that the inclusion of the π₯4 = πΎπΏ6 variable does not alter the coefficients of the original model by much. The reason for this is that the πΎπΏ6 variable is not highly correlated with the education variables. Even though the new variable is relevant (the p-value in the regression output is 0.0044), its omission would not impart an OVB because of the absence of significant correlation with existing explanatory variables (here the education variables). 3.1.2. The Irrelevant Variable Problem The opposite of the omitted variable problem is the irrelevant variable problem. The irrelevant variable does not make the other estimated coefficients biased. But it increases their variance (or standard error), if the 2 irrelevant variable is significantly correlated with existing variables. Recall the role of (1 − πππ ) in the denominator of the formula for variance of ππ . Even though the irrelevant variable is unlikely to influence the dependent variable, it could be correlated with one or both of the previous variables, thus increasing their variance. 3.2. The RESET Test for Model Misspecification The general idea behind the RESET test (Regression Specification Error Test) is that if we can significantly improve the model by artificially including powers of the predictions of the model (that is π¦Μ 2 and π¦Μ 3 ), then we can conclude that the original model is inadequate or misspecified. Let’s state the original model as: π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + π’ which is estimated by 3 Do not confuse cov(π₯2 , π₯3 ) with cov(π2 , π3 )! Chapter 7—Further Inference in Multiple Regression Page 11 of 20 π¦Μ = π1 + π2 π₯2 + π3 π₯3 We can include the squared prediction of π¦, π¦Μ 2 , alone or along with its cubic value, π¦Μ 3 , in the original model: π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + πΎ1 π¦Μ 2 + π’ π¦ = π½1 + π½2 π₯2 + π½3 π₯3 + πΎ1 π¦Μ 2 + πΎ2 π¦Μ 3 + π’ (The symbol "γ" is the Greek letter Gamma.) In the first model we can then test the hypotheses, π»0 : πΎ1 = 0 π»1 : πΎ1 ≠ 0 and in the second one, π»0 : πΎ1 = 0, πΎ2 = 0 π»1 : πΎ1 ≠ 0, or πΎ2 = 0, or both ≠ 0 If we reject the null hypothesis and conclude that πΎ1 or πΎ2 is significantly different from zero, then the inclusion of the powers of prediction of π¦ in the model has improved the model. Therefore, the original model is inadequate. If we have omitted variables, and these variables are correlated with π₯2 and π₯3 , then some of their effect may be picked up by the inclusion of the powers of prediction of π¦. Let us use the family income model above to do the RESET test. To conduct a RESET test, run the regression first by adding π¦Μ 2 alone, and then by including both π¦Μ 2 and π¦Μ 3 in the original model. In running a RESET test we treat the original model as the restricted model and use the F-test described above. Using the estimated regression equation with the KL6 included, π¦Μ = −7755.3 + 3211.5π₯2 + 4776.9π₯3 − 14310.9π₯4 and then running the model with the added π¦Μ 2 π¦Μ ∗ = π1 + π2 π₯3 + π3 π₯4 + π4 π₯4 + π1 π¦Μ 2 The ANOVE section of the computer output for the two model are shown as follows: π¦Μ ∗ = π1 + π2 π₯3 + π3 π₯4 + π4 π₯4 + π1 π¦Μ 2 ANOVA df Regression Residual Total Intercept HEDU WEDU KL6 yΜ ² π¦Μ = π1 + π2 π₯2 + π3 π₯3 + π4 π₯4 ANOVA 4 423 427 SS 1.568E+11 6.743E+11 8.311E+11 MS 3.92E+10 1.594E+09 Coefficients 87242.9829 -2381.4657 -4235.1089 10887.3371 0.000010 Standard Error 40389.3906 2419.6918 3832.1395 11439.2762 0.0000041 t Stat 2.1600 -0.9842 -1.1052 0.9518 2.4462 Regression Residual Total df 3 424 427 SS 1.472E+11 6.838E+11 8.311E+11 MS 4.91E+10 1.61E+09 Intercept HEDU WEDU KL6 Coefficients -7755.330 3211.526 4776.907 -14310.921 Standard Error 11162.935 796.703 1061.164 5003.928 t Stat -0.695 4.031 4.502 -2.860 The F-statistic for the test π»0 : πΎ1 = 0 is: πΉ= (πππΈπ − πππΈπ )⁄π πππΈπ ⁄(π − π) Chapter 7—Further Inference in Multiple Regression Page 12 of 20 πππΈπ = 6.838E + 11 πππΈπ = 6.743E + 11 π=1 π − π = 423 πΉ = 5.984 The probability value is: =F. DIST. RT(5.984,1,423) = 0.0148. At a 5% level of significance, we reject H0 and conclude that predictions squared does improve the original model. This indicates that the original model (with the kids variable included) is misspecified. Including, in addition, the “predictions cubed” and running the regression provides the following ANOVA table and subsequent F-statistic. ANOVA df Regression Residual Total 5 422 427 SS 1.57219E+11 6.73868E+11 8.31087E+11 MS 3.14E+10 1.6E+09 F 19.69122 Significance F 1.2E-17 πππΈπ = 6.838E+11 πππΈπ = 6.739E+11 π=2 π − π = 422 πΉ = 3.1226 The p-value is =F. DIST. RT(3.1226,2,422) = 0.0451. We reject π»0 : πΎ1 = 0, πΎ2 = 0 at a 5% level of significance. This would indicate that the model could be improved upon by adding more variables, such as age of wage earner, experience, the geographic location of the household (rural versus urban). 4. Cases Where the π-Test and π-Tests Give Contradictory Results— Collinearity In some cases, the πΉ-test may indicate that the overall model is significant, but individual π‘-tests lead us to conclude that the individual π½π are not significantly different from zero. When collinearity among the explanatory variable exists, that is, when independent variables themselves are correlated (when their explanatory powers overlap), the standard errors of the regression coefficients would be large, making the π‘ test statistics small, thus leading us to conclude that each π½π individually is not significantly different from zero. The following is a discussion of the collinearity problem in multiple regression models. 4.1. Collinearity To gain a better understanding of the relationship between the dependent variable y and the explanatory variables, it is important to understand the factors affecting the variance and covariance of the coefficients. To that end, consider the actual formulas for the variances of the slope coefficients and their covariance. Chapter 7—Further Inference in Multiple Regression Page 13 of 20 var(π2 ) = var(π) ∑(π₯2 − π₯Μ 2 )2 (1 − π232 ) var(π3 ) = var(π) ∑(π₯3 − π₯Μ 3 )2 (1 − π232 ) covar(π2 , π3 ) = var(π) 4 −∑(π₯2 − π₯Μ 2 )(π₯3 − π₯Μ 3 ) ∑(π₯2 − π₯Μ 2 )2 ∑(π₯3 − π₯Μ 3 )2 (1 − π232 ) ο· The larger the variance of the disturbance term var(π’), as estimated by var(π), the larger the variance (and covariance) of the least squares estimators. This means that the dependent variable data is more widely scattered about the regression plane, indicating a weaker association between π¦ and the respective explanatory variable. ο· Note that the sum of squared deviations for all independent variables appear in the denominator of the three formulas. The bigger these are, the smaller the variances and the covariance. Two factors affect these sum-squares: ο· o The sample size: The larger the sample size π, bigger the sum-squares. o The degree of dispersion of the π₯ππ data about their respective mean π₯Μ π : The more dispersed the π₯ππ data are about their mean, the bigger the sum-squares. In order to estimate the population slope parameters π½π precisely by reducing the variance of ππ , there should be a large amount of variation in the π₯ππ . Finally, the larger the correlation between π₯2 and π₯3 , π23 , the bigger the variance of ππ . The final point here deserves more detailed attention. Note that the correlation coefficient π23 , given by the familiar formula, π23 = ∑(π₯2 − π₯Μ 2 )(π₯3 − π₯Μ 3 ) √∑(π₯2 − π₯Μ 2 )2 ∑(π₯3 − π₯Μ 3 )2 measures the degree of association between π₯2 and π₯3 . Since π23 appears in the denominator of var(π2 ), 2 ), var(π3 ), and covar(π2 , π3 ) in the term (1 − π23 then the bigger the correlation coefficient π23 , the smaller the denominator, hence the bigger the variances and the covariance. When the two independent variables are correlated it is difficult to disentangle their separate effects on the independent variable. 4 The general formula for var(π2 ) is: var(π2 ) = var(π) ∑(π₯2 − π₯Μ 2 )2 ∑(π₯3 ∑(π₯3 − π₯Μ 3 )2 − π₯Μ 3 )2 − [∑(π₯2 − π₯Μ 2 )(π₯3 − π₯Μ 3 )]2 In the denominator we can substitute from the formula for the correlation coefficient r23, measuring the correlation between x1 and x2. π23 = ∑(π₯2 − π₯Μ 2 )(π₯3 − π₯Μ 3 ) √∑(π₯2 − π₯Μ 2 )2 ∑(π₯3 − π₯Μ 3 )2 2 [∑(π₯2 − π₯Μ 2 )(π₯3 − π₯Μ 3 )]2 = π23 ∑(π₯2 − π₯Μ 2 )2 ∑(π₯3 − π₯Μ 3 )2 var(π2 ) = var(π) ∑(π₯3 − π₯Μ 3 )2 var(π) 2)= 2) ∑(π₯2 − π₯Μ 2 )2 ∑(π₯3 − π₯Μ 3 )2 (1 − π23 ∑(π₯2 − π₯Μ 2 )2 (1 − π23 Chapter 7—Further Inference in Multiple Regression Page 14 of 20 In simple regression, π2 is the total effect of π₯ on π¦: π2 = ππ¦⁄ππ₯2 . In multiple regression, the coefficient of the first variable, π₯2 , is the estimated net effect of a change in that variable on π¦: π2 = ππ¦⁄ππ₯2 , and that of the second variable π₯3 , similarly, is the estimated net effect of a change in π₯3 on π¦: π3 = ππ¦⁄ππ₯3 . Suppose π₯2 and π₯3 are themselves linearly related, so that a change in one induces a change in the other. Then the total effect of a change in π₯2 on π¦ involves not only π2 , but it must also include the effect of the change in π₯3 induced by the change in π₯2 . To further illustrate the effect of correlation among the independent variables, consider the following two Venn diagrams (π΄) and (π΅). In both (π΄) and (π΅) each circle represents the total variation in the variable. In (π΄) the two independent variable π₯2 and π₯3 are uncorrelated as shown by the two non-overlapping circles. Here π23 = 0. Thus, the variance of the coefficients π2 and π3 are “simplified” into: var(π2 ) = var(π) (π₯ ∑ 2 − π₯Μ 2 )2 var(π3 ) = var(π) (π₯ ∑ 3 − π₯Μ 3 )2 which are the same as the variance of the slope coefficient in the simple linear regression. Also, since π23 = 0, the term ∑(π₯2 − π₯Μ 2 )(π₯3 − π₯Μ 3 ) in the numerator of covar(π2 , π3 ) formula equals zero, thus making covar(π2 , π3 ) = 0. Thus, a multiple regression of π¦ on π₯2 and π₯3 will contain the same information as is contained in two separate simple regressions. (A) xβ and xβ are uncorrelated Total variations in y Total variations in xβ Total variations in xβ (B) xβ and xβ are correlated Total variations in y Total variations in xβ Total variations in xβ In (π΅), the two circles representing the variations in the independent variables, in addition to overlapping with the π¦ circle, overlap with each other. Thus there is variation common to all three variables, shown as the area of intersection of the three circles. A simple regression of π¦ on π₯2 would involve the entire overlap between π¦ and π₯2 , but, as the diagram shows, this overlap includes also some of the variation in π₯3 . The resulting “net effect” overlap of the variation in π₯2 and π¦ is smaller than the overlap depicting the gross relationship between these two variables. The existence of the overlap between the circles representing variations in the independent variables indicates “collinearity”. The bigger the area of overlap π₯2 and π₯3 , the stronger the collinearity. The practical impact of collinearity can be observed by the impact of the correlation coefficient π23 in the denominator of the variance formula for either slope coefficient. Take the variance of π2 : var(π2 ) = var(π) ∑(π₯2 − π₯Μ 2 )2 (1 − π232 ) 2 The stronger the collinearity of π₯2 and π₯3 , the bigger π23 , the bigger the variance of π2 , and the less precise the estimate of the parameter π½2 . The variation in π₯2 about its mean π₯Μ 2 adds most to the precision of estimation when it is not connected to the variation in the other explanatory variable. When the variation in π₯2 about its Chapter 7—Further Inference in Multiple Regression Page 15 of 20 mean is related to the variation in the in the other explanatory variable, the precision of estimation is diminished. Example Refer to the data in the tab “cars” in the Excel file, to estimate the effect of the number of cylinders (CYL), engine size (displacement in cubic inches, ENG), and vehicle weight (WGT) on fuel consumption (MPG). First run a simple regression using only the number of cylinders as the only variable. The relevant part regression summary output is shown below: Intercept CYL Coefficients 42.916 -3.558 Standard Error 0.835 0.146 t Stat 51.404 -24.425 P-value 0.000 0.000 Lower 95% 41.274 -3.844 Upper 95% 44.557 -3.272 The r-square value is π 2 = 0.6047. Given the p-value = 0.000, clearly, the number of cylinders have a significant impact on MPG—as expected. Now run the regression using all independent variables mentioned above. SUMMARY OUTPUT Regression Statistics Multiple R 0.8362 R Square 0.6993 Adjusted R Square 0.6970 Standard Error 4.2965 Observations 392 ANOVA df Regression Residual Total Intercept CYL ENG WGT 3 388 391 SS 16656.444 7162.549 23818.993 MS 5552.148 18.460 F 300.764 Significance F 0.000 Coefficients 44.3710 -0.2678 -0.0127 -0.0057 Standard Error 1.4807 0.4131 0.0083 0.0007 t Stat 29.9665 -0.6483 -1.5362 -7.9951 P-value 0.0000 0.5172 0.1253 0.0000 Lower 95% 41.4598 -1.0799 -0.0289 -0.0071 Upper 95% 47.2821 0.5443 0.0035 -0.0043 Note that both π 2 = 0.6993 and F-statistic show that the combined impact of all variables on MPG is significant. However, considered separately, given the t-statistics and p-values for CYL and ENG, we cannot reject the null hypotheses π»0 : π½2 = 0 and π»0 : π½3 = 0, indicating that these variables have no impact on MPG! Also, using the F-test for the null hypotheses π»0 : π½2 = π½3 = 0, the F-statistic is F = 4.298 with a pvalue = 0.0142, leading us to reject the “no-effect” null hypothesis. These contradictions arise from the fact that there is strong collinearity between the variables CYL and ENG. Considering the Venn diagram shown above, there is significant overlap (correlation) between the two independent variables π₯2 = CYL and π₯3 = ENG. Using the Excel function =CORREL, the correlation coefficient for the two variables is π23 = 0.9508. Chapter 7—Further Inference in Multiple Regression Page 16 of 20 5. Identifying and Mitigating Collinearity As explained above, the collinearity problem arises from the association or correlation between the independent variables. Theoretically, if there is perfect collinearity between any two independent variables, the inverse of the X matrix does not exist. Therefore there is no unique solution for the system of normal equations that is used to obtain values of the regression coefficients. In matrix jargon, we say one row is a linear combination of another row. In any regression model, the collinearity problem would indicate that the data do not contain enough "information" about the individual effects of the explanatory variables to precisely estimate the population slope parameters π½π . Even if, in theory, two explanatory variables are perfectly collinear, the sample data may never indicate perfect collinearity. Therefore, there is always a solution for the values of ππ . But these solutions will not be precise estimates of the π½π . The question is, how can we detect the existence of significant collinearity? In a model with two explanatory variables, a simple way to detect collinearity is to compute the correlation coefficient using the formula, π23 = ∑(π₯2 − π₯Μ 2 )∑(π₯3 − π₯Μ 3 ) √∑(π₯2 − π₯Μ 2 )2 ∑(π₯3 − π₯Μ 3 )2 In Excel, the function is =CORREL(). For example, in the family income model with two explanatory variables the correlation coefficient between HEDU and WEDU is rββ = 0.5943. In models with more than two explanatory variables, the collinear relationships may involve more than two of the explanatory variables. To detect collinearity, we can estimate the "auxiliary" regression, where the "dependent" or explained variable is one of the explanatory variables. We run the regression using the remaining explanatory variables. In the family income model where KL6 is the additional explanatory variable, now we can use this variable as the dependent variable and run the auxiliary regression. The relevant "regression" equation is then π₯Μ4 = π1 + π2 π₯2 + π3 π₯3 Here, the objective is not to determine the coefficients, rather, the concern is the value of the R2. A large R2 value, say, above 0.80, would indicate a significant correlation between the variables under consideration. The R2 for this auxiliary regression is 0.0179, which clearly indicates the absence of collinearity. 6. Confidence and Prediction Intervals The interval estimate for the mean value of the dependent variable, π¦Μ0 , for given value of the dependent variables is the familiar general format: πΏ, π = π¦Μ0 ± π‘πΌ⁄2,(π−π) se(π¦Μ0 ) In a model with two independent variables, the variance of π¦Μ0 , from which we obtain the standard error figure to build the interval estimate, is: var(π¦Μ0 ) = var(π1 + π2 π₯02 + π3 π₯03 ) 2 2 var(π¦Μ0 ) = var(π1 ) + π₯02 var(π2 ) + π₯03 var(π3 ) + 2π₯02 cov(π1 , π2 ) + 2π₯03 cov(π1 , π3 ) + 2π₯02 π₯03 cov(π2 , π3 ) Chapter 7—Further Inference in Multiple Regression Page 17 of 20 The prediction interval for an individual value of the dependent variable, π¦0 , for given values of the independent variables, takes the following form: πΏ, π = π¦Μ0 ± π‘πΌ⁄2,(π−π) se(π¦0 ) The interval is still built around π¦Μ0 . But the standard error is now different. The difference arises from the fact that the individual value of π¦ deviates from the mean value by the prediction error. π¦0 = π¦Μ0 + π Therefore, var(π¦0 ) = var(π¦Μ0 ) + var(π) Example Use the data in the tab “burger2” to estimate the coefficients of the model. ππ΄πΏπΈπ = π½1 + π½2 ππ πΌπΆπΈ + π½3 π΄π·ππΈπ π + π½4 π΄π·ππΈπ π 2 + π’ π¦ = ππ΄πΏπΈπ π₯2 = ππ πΌπΆπΈ π₯3 = π΄π·ππΈπ π π₯4 = π΄π·ππΈπ π 2 Thus, π¦Μ = π1 + π2 π₯2 + π3 π₯3 + π4 π₯4 π¦Μ = 109.719 − 7.64π₯2 + 12.1512π₯3 − 2.768π₯4 Now, let π₯02 = $6 π₯03 = $1.9 π₯04 = (1.9)2 = 3.61 Then, π¦Μ0 = 109.719 − 7.64(6) + 12.1512(1.9) − 2.768(3.61) π¦Μ0 = 76.974 First, build a confidence interval for the mean value of π¦. We need to find var(π¦Μ0 ). In the Excel file determine the covariance matrix. var(π¦Μ0 ) = var(π1 + π2 π₯02 + π3 π₯03 + π4 π₯04 ) 2 2 2 var(π¦Μ0 ) = var(π1 ) + π₯02 var(π2 ) + π₯03 varπ3 π₯3 + π₯04 var(π4 ) var(π¦Μ0 ) = + 2π₯02 cov(π1 , π2 ) + 2π₯03 cov(π1 , π3 ) + 2π₯04 cov(π1 , π4 ) var(π¦Μ0 ) = + 2π₯02 π₯03 cov(π2 , π3 ) + 2π₯02 π₯04 cov(π2 , π4 ) + 2π₯03 π₯04 cov(π3 , π4 ) var(π¦Μ0 ) = 46.22702 + (62 )(1.09399) + (1.92 )(12.64630) + (3.612 )(0.88477) var(π¦Μ0 ) = + 2(6)(−6.42611) + 2(1.9)(−11.60096) + 2(3.61)(2.93903) var(π¦Μ0 ) = + 2(6)(1.9)(0.30041) + 2(6)(3.61)(−0.08562) + 2(1.9)(3.61)(−3.28875) Chapter 7—Further Inference in Multiple Regression Page 18 of 20 var(π¦Μ0 ) = 0.8422 se(π¦Μ0 ) = 0.9177 The 95% confidence interval for π¦Μ0 when π₯02 = 6, π₯03 = 1.9, and π₯04 = 3.61 is then, πΏ, π = π¦Μ0 ± π‘πΌ⁄2,(π−π) se(π¦Μ0 ) πΏ, π = 76.974 ± (1.994)(0.9177) = 76.974 ± 1.830 = [75.144,78.804] Now the prediction interval for the individual value of π¦: var(π¦0 ) = var(π¦Μ0 ) + var(π) var(π¦0 ) = 0.8422 + 21.5787 = 22.42085 se(π¦0 ) = 4.7351 πΏ, π = π¦Μ0 ± π‘πΌ⁄2,(π−π) se(π¦0 ) πΏ, π = 76.974 ± (1.994)(4.7351) = 76.974 ± 9.441 = [67.533,86.415] Μπ ) 7. A More Practical Way of Finding π¬π(π In many models, where the number of independent variables exceeds two, obtaining the standard error of the linear combination of the regression coefficients, as we have done above, becomes very tedious and may lead to miscalculations. There is a simpler way to compute the standard error in question, as shown below. We will use the same example above. The general approach is as follows: ο· Subtract the given value of each π₯π from each value in that column. For example, π₯π2 − π₯02 = π₯π2 − 6 π₯π3 − π₯03 = π₯π3 − 1.9 π₯π4 − π₯04 = π₯π2 − 3.61 ο· Run the regression with the adjusted values of the π₯π The result for our example (see the Excel file tab “burger 3” for the full calculation) is: ANOVA Regression Residual Total df 3 71 74 SS 1583.3974 1532.0845 3115.4819 MS 527.7991 21.5787 F 24.459316 Significance F 5.59996E-11 Intercept PRICE* ADVERT* ADVERT²* Coefficients 76.974 -7.640 12.151 -2.768 Standard Error 0.9177 1.0459 3.5562 0.9406 t Stat 83.8760 -7.3044 3.4170 -2.9427 P-value 9.19E-73 3.24E-10 0.00105 0.00439 Lower 95% 75.1442 -9.7255 5.0604 -4.6435 Upper 95% 78.8039 -5.5545 19.2420 -0.8924 Note that the coefficients of ππ πΌπΆπΈ ∗ = ππ πΌπΆπΈ − 6, π΄π·ππΈπ π ∗ = π΄π·ππΈπ π − 1.9, and π΄π·ππΈπ π 2∗ = π΄π·ππΈπ π 2 − 3.61, and their standard errors are exactly equal to the coefficients and standard errors before Chapter 7—Further Inference in Multiple Regression Page 19 of 20 the adjustments to these variables. However the intercept coefficient and its standard error are different. In fact the intercept value is the predicted ππ΄πΏπΈπ for ππ πΌπΆπΈ = 6, π΄π·ππΈπ π = 1.9, and π΄π·ππΈπ π 2 = 3.61. More importantly, now we have obtained se(π¦Μ0 ) as the standard error of the intercept directly from running this regression. To obtain var(π¦0 ), var(π¦0 ) = var(π¦Μ0 ) + var(π) var(π¦0 ) = (0.9177)2 + 21.5787 = 22.4208 Note that var(π) does not change when the variables are adjusted as above. Chapter 7—Further Inference in Multiple Regression Page 20 of 20