7-Multiple Regressio..

Chapter 7 Further Inference in Multiple Regression 1. 2. 3. 4. 5. 6. 7. 8. 9. Testing the Significance of a Model—The 𝐹Test Cases Where the 𝐹Test and 𝑡Test Give Contradictory Results—Collinearity 2.1. 𝐹Test for “Restricted Least Squares” Extension of the Regression Model Testing Some Economic Hypotheses 4.1. Test the Significance of Advertising 4.2. The Optimal Level of Advertising The Use of Non-sample Information Model Specification 6.1. Consequences of Omitted and Irrelevant Variables 6.1.1. The Omitted Variable Problem 6.1.1.1. Proof of the Omitted Variable Bias 6.1.2. The Irrelevant Variable Problem 6.2. The RESET Test for Model Misspecification Identifying and Mitigating Collinearity Confidence and Prediction Intervals A More Practical Way of Finding var(𝑦̂0 ) 1. Testing the Significance of a Model—The 𝑭𝐓𝐞𝐬𝐭 In Chapter 5 the 𝐹-Test was explained as an alternative to the 𝑡-Test for the significance of a simple regression model. There we noted that the p-value for the 𝐹-Test in the ANOVA section of the regression summary output was equal to the p-value for the t-Test for the significance of the slope coefficient 𝑏2 . In multiple regression, however, the 𝐹test and 𝑡-tests have different roles. The 𝑡-tests test the significance of the coefficient of each variable individually, while the 𝐹-test tests their joint explanatory power. The model will have no explanatory power if it turns out that 𝑦 is unrelated to any of the explanatory variables. In the two-explanatory variable regression model, 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝑢 we perform the following hypothesis test: 𝐻0 : 𝛽2 = 0, 𝛽3 = 0 𝐻1 : at least one of the 𝛽𝑗 is not zero In Chapter 5 it was explained that the ratio of mean square regression (𝑀𝑆𝑅) over mean square error (𝑀𝑆𝐸) has an 𝐹-distribution with the numerator and denominator degrees of freedom, respectively, of (𝑘 − 1) and (𝑛 − 𝑘). 𝐹(𝑘−1,𝑛−𝑘) = ∑(𝑦̂ − 𝑦̅)2 ⁄(𝑘 − 1) ∑(𝑦 − 𝑦̂)2 ⁄(𝑛 − 𝑘) This is the test statistic (𝑇𝑆) 𝐹 which is compared to the critical value 𝐶𝑉 = 𝐹𝛼,(𝑘−1,𝑛−𝑘) . We would reject the null hypothesis and conclude that estimated relationship between 𝑦 and the independent variables is significant, if 𝑇𝑆 > 𝐶𝑉. In the 𝑏𝑢𝑟𝑔𝑒𝑟 model presented in the previous chapter, Chapter 7—Further Inference in Multiple Regression Page 1 of 20 𝑆𝐴𝐿𝐸𝑆 = 𝛽1 + 𝛽2 𝑃𝑅𝐼𝐶𝐸 + 𝛽3 𝐴𝐷𝑉𝐸𝑅𝑇 + 𝑢 The estimated regression model was, ̂ = 118.914 − 7.908𝑃𝑅𝐼𝐶𝐸 + 1.863𝐴𝐷𝑉𝐸𝑅𝑇 𝑆𝐴𝐿𝐸𝑆 To test for the significance of the overall model, the 𝐹 statistic was obtained from the ANOVA table, ANOVA df Regression Residual Total 𝑇𝑆 = 𝐹 = 2 72 74 SS 1396.5389 1718.9429 3115.4819 MS 698.26946 23.874207 F 29.247859 Significance F 5.04086E-10 698.2695 = 29.248 23.8742 𝐶𝑉 = 𝐹0.05,(2,72) = 3.124 Since 𝑇𝑆 > 𝐶𝑉, we reject the null hypothesis and conclude that the relationship between 𝑆𝐴𝐿𝐸𝑆, 𝑃𝑅𝐼𝐶𝐸 and 𝐴𝐷𝑉𝐸𝑅𝑇 is significant. Also note that the probability value shown under “Significance F” is practically 0. 1.1. F-Tests for “Restricted Least Squares” Most economic data are obtained not from controlled (laboratory) experiments but are collected as “historical” data. Therefore, when data are the result of an uncontrolled experiment, many of the economic variables may be correlated or are said to be collinear. The problem is labeled collinearity, or multicollinearity when several variables are involved. The restricted least squares can be useful when the problem of collinearity is present. We might want to test if an explanatory variable or a group of explanatory variables is relevant in a particular model. The 𝐹-test for one hypothesis, or a set of hypotheses, is based on a comparison of the sum of squared error (𝑆𝑆𝐸) from the original, unrestricted model to the 𝑆𝑆𝐸 from a regression model in which the null hypotheses for the restricted mode is assumed to be true. This is explained below. The unrestricted model is the original regression model. Unrestricted model: 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝑢 The restricted model is that which one or more of the explanatory variables are removed. In a model with two independent variables we can remove the impact of one of the two variables. Restricted model: 𝑦 = 𝛽1 + 𝛽3 𝑥3 + 𝑢 We want to test the hypothesis that changes in, say, 𝑥2 , have no effect on 𝑦, against the alternative that it has an effect. 𝐻0 : 𝛽2 = 0 𝐻1 : 𝛽2 ≠ 0 Chapter 7—Further Inference in Multiple Regression Page 2 of 20 It was explained in Chapter 6 that a model with a larger number of independent variables would have a smaller 𝑆𝑆𝐸. Therefore, when we remove one of the independent variables by constraining the model, the restricted 𝑆𝑆𝐸 increases. 𝑆𝑆𝐸𝑅 > 𝑆𝑆𝐸𝑈 The ratio of the difference between 𝑆𝑆𝐸𝑅 − 𝑆𝑆𝐸𝑈 over 𝑆𝑆𝐸𝑈 is a test statistic that has an 𝐹 distribution with a numerator degrees of freedom equal to the number of hypotheses in 𝐻0 and the denominator degrees of freedom of 𝑛 − 𝑘, where 𝑘 is the number of parameters in the unrestricted model. Here we have only one null hypothesis. Therefore, 𝑗 = 1. 𝐹= (𝑆𝑆𝐸𝑅 − 𝑆𝑆𝐸𝑈 )⁄𝑗 𝑆𝑆𝐸𝑈 ⁄(𝑛 − 𝑘) If 𝑇𝑆 = 𝐹(𝑗,𝑛−𝑘) > 𝐶𝑉 = 𝐹𝛼,(𝑗,𝑛−𝑘) , then we would reject the null hypothesis (hypotheses), concluding the 𝑥2 has a significant effect on 𝑦. In the example relating sales to price and advertising expenditure, constrain the model by removing the impact of price, 𝑥2 . The unrestricted 𝑆𝑆𝑅, determined previously, is 𝑆𝑆𝑅𝑈 = 1718.943. The restricted 𝑆𝑆𝐸 is then (see the Excel file CH7 DATA, worksheet tab burger) 𝑆𝑆𝐸𝑅 = 2961.827 Thus, 𝐹= (2961.827 − 1718.943)⁄1 = 52.06 1718.943⁄(75 − 3) To find the critical value, using the Excel =F.INV.RT function, we have: 𝐹0.05,(1,72) = 3.974. Since the test statistic exceeds the critical value, we reject the null hypothesis 𝐻0 : 𝛽2 = 0 and conclude that variations in price influence monthly sales. Note that when we test the significance of the overall model, in effect, we use the following methodology: Unrestricted model: Restricted model: 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝑢 𝑦 = 𝛽1 + 𝑢 The null and alternative hypotheses are: 𝐻0 : 𝛽2 = 0, 𝛽3 = 0 𝐻1 : at least one of the 𝛽𝑗 is non-zero The estimated coefficient for the restricted model is simply 𝑏1 = 𝑦̅, and the regression equation is: 𝑦̂ = 𝑏1 = 𝑦̅. Therefore, 𝑆𝑆𝐸𝑅 = ∑(𝑦 − 𝑦̂)2 = ∑(𝑦 − 𝑦̅)2 = 𝑆𝑆𝑇 Thus, 𝐹= (𝑆𝑆𝐸𝑅 − 𝑆𝑆𝐸𝑈 )⁄𝑗 (𝑆𝑆𝑇 − 𝑆𝑆𝐸𝑈 )⁄(𝑘 − 1) = 𝑆𝑆𝐸𝑈 ⁄(𝑛 − 𝑘) 𝑆𝑆𝐸𝑈 ⁄(𝑛 − 𝑘) 𝐹= 𝑆𝑆𝑅 ⁄(𝑘 − 1) 𝑆𝑆𝐸 ⁄(𝑛 − 𝑘) Chapter 7—Further Inference in Multiple Regression Page 3 of 20 Which is the same 𝐹 statistic for the original unrestricted model. 1.2. Test of the Significance of Advertising Using the extended burger model, 𝑆𝐴𝐿𝐸𝑆 = 𝛽1 + 𝛽2 𝑃𝑅𝐼𝐶𝐸 + 𝛽3 𝐴𝐷𝑉𝐸𝑅𝑇 + 𝛽4 𝐴𝐷𝑉𝐸𝑅𝑇 2 + 𝑢 we can test some interesting economic hypotheses and illustrate the use of 𝑡- and 𝐹-tests. In the extended model above, incorporating 𝐴𝐷𝑉𝐸𝑅𝑇 2 in the model, we wish to test whether advertising has an effect on total revenue. This means we want to perform a joint test for β3 and β4. Again, let’s make use of unrestricted/ restricted models methodology used above. Unrestricted model: Restricted model: 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥32 + 𝑢 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝑢 The joint null and alternative hypotheses are: 𝐻0 : 𝛽3 = 0, 𝛽4 = 0 𝐻1 : 𝛽3 or 𝛽4 , or both are non-zero. From the regression output for the unrestricted model, as seen above, 𝑆𝑆𝐸𝑈 = 1532.084. regression output for the restricted model (see CH7 DATA Excel file): 𝑆𝑆𝐸𝑅 = 1896.391. Unrestricted Model ANOVA df Regression 3 Residual 71 Total 74 Intercept PRICE ADVERT ADVERT² Coefficients 109.7190 -7.6400 12.1512 -2.7680 SS 1583.3974 1532.0845 3115.4819 MS 527.79914 21.578654 Standard 6.7990 Error 1.0459 3.5562 0.9406 t Stat 16.1374 -7.3044 3.4170 -2.9427 Restricted Model ANOVA df Regression 1 Residual 73 Total 74 Intercept PRICE Coefficients 121.9002 -7.8291 From the SS 1219.0910 1896.3908 3115.4819 MS 1219.0910 25.9780 Standard 6.5263 Error 1.1429 t Stat 18.6783 -6.8504 The test statistic F is: 𝐹= (𝑆𝑆𝑈𝑅 − 𝑆𝑆𝐸𝑈 )⁄𝑗 (1896.3908 − 1532.0845)⁄2 = = 8.441 𝑆𝑆𝐸𝑈 ⁄(𝑛 − 𝑘) 1532.0845⁄(75 − 4) And the critical value is: F0.05,(2, 71) = 3.126 Clearly we reject the null hypothesis and conclude that advertising has an effect on total revenue. Chapter 7—Further Inference in Multiple Regression Page 4 of 20 1.3. Test the Significance of Advertising—Optimal Level What is the optimum level of advertising? Optimality in economics always implies marginal benefit of an action be equal to its marginal cost. If marginal benefit exceeds the marginal cost, the action should be taken. If marginal benefit is less than the marginal cost, the action should be curtailed. The optimum is, therefore, where the two are equal. The marginal benefit of advertising is the contribution of each additional $1 thousand of advertising expenditure to total revenue. Form the model, 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥32 + 𝑢 the marginal benefit of advertising is: 𝜕y = 𝛽3 + 2𝛽4 𝑥3 𝜕𝑥3 Ignoring the marginal cost of additional sales, marginal cost is each additional $1 thousand spent on advertising. Thus, the optimality requirement is 𝑀𝐵 = 𝑀𝐶: 𝛽3 + 2𝛽4 𝑥3 = $1 Using the estimated least squares coefficients, we thus have: 12.151 + 2(−2.768)𝑥3 = 1 Yielding, 𝑥3 = $2.014 thousand. Now suppose we wish to use the estimated regression model to test the null hypothesis that the optimum advertising expenditure is $1.9 thousand. That is, 𝐻0 : 𝛽3 + 2𝛽4 (1.9) = 1 𝐻1 : 𝛽3 + 2𝛽4 (1.9) ≠ 1 The test statistic for this test has a t distribution. 𝑇𝑆 = |𝑡| = 𝑆𝑎𝑚𝑝𝑙𝑒 𝑆𝑡𝑎𝑡𝑖𝑠𝑡𝑖𝑐 − 𝑁𝑢𝑙𝑙 𝑉𝑎𝑙𝑢𝑒 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐸𝑟𝑟𝑜𝑟 The sample statistic for the test is 𝑏3 + 3.8𝑏4 . The standard error in the denominator is se(𝑏3 + 3.8𝑏4 ). 𝑇𝑆 = 𝑏3 + 3.8𝑏4 − 1 se(𝑏3 + 3.8𝑏4 ) How do we find the value for the standard error? Using the properties of variance, we have var(𝑏3 + 3.8𝑏4 ) = var(𝑏3 ) + 3.82 var(𝑏4 ) + 2(3.8)cov(𝑏3 , 𝑏4 ) We can obtain var(𝑏3 ) and var(𝑏4 ) by simply squaring the standard errors from the regression output. Unfortunately, the Excel regression output does not provide the covariance value. We can, however, still use Excel to compute cov(𝑏3 , 𝑏4 ). If you recall, using matrix algebra we can determine the variance-covariance matrix. Chapter 7—Further Inference in Multiple Regression Page 5 of 20 var(𝑏1 ) [covar(𝑏1 , 𝑏2 ) covar(𝑏1 , 𝑏3 ) covar(𝑏1 , 𝑏2 ) var(𝑏2 ) covar(𝑏2 , 𝑏3 ) covar(𝑏1 , 𝑏3 ) covar(𝑏2 , 𝑏3 )] = var(𝑒)X −1 var(𝑏3 ) where, X-1 is the inverse of the matrix, 𝑛 X = [∑𝑥2 ∑𝑥3 ∑𝑥2 ∑𝑥3 2 ∑𝑥2 ∑𝑥2 𝑥3 ] ∑𝑥2 𝑥3 ∑𝑥32 is for a model with two independent variables. In our current model we have three independent variables. The solution for this problem is simple because the X matrix can be expanded to incorporate any number of independent variables. For a 3-variable model we have, 𝑛 X= ∑𝑥2 ∑𝑥3 [∑𝑥4 ∑𝑥2 ∑𝑥22 ∑𝑥2 𝑥3 ∑𝑥2 𝑥4 ∑𝑥3 ∑𝑥2 𝑥3 ∑𝑥32 ∑𝑥3 𝑥4 ∑𝑥4 ∑𝑥2 𝑥4 ∑𝑥3 𝑥4 ∑𝑥42 ] Using Excel we can compute these quantities as the elements of the X matrix, find the inverse and then multiply the inverse matrix by var(𝑒): var(𝑒)X −1 . The result is the following covariance matrix. (The calculations are shown in the Excel file. 46.227 -6.426 -11.601 2.939 -6.426 1.094 0.300 -0.086 -11.601 0.300 12.646 -3.289 2.939 -0.086 -3.289 0.885 Thus, var(𝑏3 + 3.8𝑏4 ) = var(𝑏3 ) + 3.82 var(𝑏4 ) + 2(3.8)covar(𝑏3 , 𝑏4 ) var(𝑏3 + 3.8𝑏4 ) = 12.646 + 3.82 (0.885) + 2(3.8)(−3.289) = 0.428 se(𝑏3 + 3.8𝑏4 ) = √0.428 = 06542 𝑇𝑆 = 1.633 − 1 = 0.968 0.6542 The critical value is 𝐶𝑉 = 𝑡0.025,72 = 1.994. Since 𝑇𝑆 < 𝐶𝑉, do not reject the null hypothesis. The optimum advertising expenditure amount is $1.9 thousand. An 𝑭-test alternative can also be used for the optimum advertising hypothesis test. To do this test state the unrestricted and restricted models: Unrestricted model: Restricted model: 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛽4 𝑥32 + 𝑢 𝑦 = 𝛽1 + 𝛽2 𝑥2 + (1 − 3.8𝛽4 )𝑥3 + 𝛽4 𝑥32 + 𝑢 Note that we have used the null statement 𝛽3 +2𝛽4 (1.9) = 1, that is, 𝛽3 = 1 − 3.8𝛽4 , in place of the coefficient of 𝑥3 . To estimate the restricted model, rearrange the model in the following format: 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝑥3 + 𝛽4 (𝑥32 − 3.8𝑥3 ) + 𝑢 Chapter 7—Further Inference in Multiple Regression Page 6 of 20 𝑦 − 𝑥3 = 𝛽1 + 𝛽2 𝑥2 + 𝛽4 (𝑥32 − 3.8𝑥3 ) + 𝑢 Running the regression on the restricted model yields a sum of square error of SSER = 2594.5332, from which compute the F-statistic (see the Excel file): 𝐹= (𝑆𝑆𝐸𝑅 − 𝑆𝑆𝐸𝑈 )⁄1 (1152.286 − 1532.084)⁄1 = = 0.936 𝑆𝑆𝐸𝑈 ⁄(𝑛 − 𝑘) 1532.084⁄71 The critical value is 𝐶𝑉 = 𝐹0.05,(1,71) = 3.976. Note that when there is only one null hypothesis, the t- and Ftests are equivalent tests because they both yield the same probability value of 0.8014.1 Also note that 𝐹 = 𝑡 2 = (0.967572)2 = 0.936195. 2. The Use of Non-sample Information In many estimation and inference problems we have information external to the sample data. This nonsample information may come from economic theory or principles, or experience. When available, we can combine non-sample with the sample information to improve the precision of the estimated parameters. In economic analysis of demand, demand for a good depends on the price of the good, price of substitutes and complements, and on income. Take the demand for beer. It depends on the price of beer, 𝑃𝐵 , the price of other liquor, 𝑃𝐿 , the price of all other remaining goods and services, 𝑃𝑅 , and income (𝐼): 𝑄 = 𝑓(𝑃𝐵 , 𝑃𝐿 , 𝑃𝑅 , 𝐼) Assuming that a log-log function form is appropriate for this demand relationship, ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃𝐵 ) + 𝛽3 ln(𝑃𝐿 ) + 𝛽4 ln(𝑃𝑅 ) + 𝛽5 ln(𝐼) + 𝑢 The relevant non-sample information can be derived by assuming that there is no “money illusion” with respect to simultaneous equal-percentage increase in all prices and income. That is, the demand for beer will not change, if income and all prices, say, double. Impose this assumption on the model by multiplying all variables by the proportion λ. ln(𝑄) = 𝛽1 + 𝛽2 ln(𝜆𝑃𝐵 ) + 𝛽3 ln(𝜆𝑃𝐿 ) + 𝛽4 ln(𝜆𝑃𝑅 ) + 𝛽5 ln(𝜆𝐼) Using the properties of logarithm, we can rewrite the above equation as: ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃𝐵 ) + 𝛽3 ln(𝑃𝐿 ) + 𝛽4 ln(𝑃𝑅 ) + 𝛽5 ln(𝐼) + (𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 ) ln(𝜆) Since multiplying all right-hand-side variables in the original equation does not alter ln(𝑄), then the following must be true, 𝛽2 + 𝛽3 + 𝛽4 + 𝛽5 = 0 This non-sample information thus can be imposed as a constraint on the parameters in the demand model. Solve this restriction for 𝛽4 , 𝛽4 = −𝛽2 − 𝛽3 − 𝛽5 1 Use the Excel function =𝐹𝐷𝐼𝑆𝑇() to find the tail area under the F-curve for a given F statistic. Chapter 7—Further Inference in Multiple Regression Page 7 of 20 Using the multiple regression model obtain an estimate of the log-log demand function above, ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃𝐵 ) + 𝛽3 ln(𝑃𝐿 ) + 𝛽4 ln(𝑃𝑅 ) + 𝛽5 ln(𝐼) + 𝑢 and substituting for 𝛽4 from the constraint, we have ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃𝐵 ) + 𝛽3 ln(𝑃𝐿 ) + (−𝛽2 − 𝛽3 − 𝛽5 ) ln(𝑃𝑅 ) + 𝛽5 ln(𝐼) + 𝑢 ln(𝑄) = 𝛽1 + 𝛽2 [ln(𝑃𝐵 ) − ln(𝑃𝑅 )] + 𝛽3 [ln(𝑃𝐿 ) − ln(𝑃𝑅 )] + 𝛽5 [ln(𝐼) − ln(𝑃𝑅 )] + 𝑢 The restricted model then can be written as, 𝑃𝐵 𝑃𝐿 𝐼 ln(𝑄) = 𝛽1 + 𝛽2 ln ( ) + 𝛽3 ln ( ) + 𝛽5 ln ( ) + 𝑢 𝑃𝑅 𝑃𝑅 𝑃𝑅 The data is available in 𝐶𝐻7 𝐷𝐴𝑇𝐴 in the tab 𝐵𝐸𝐸𝑅 to estimate this model. The summary output is presented below: SUMMARY OUTPUT Regression Statistics Multiple R R Square Adjusted R Square Standard Error Observations 0.8989 0.8079 0.7858 0.0617 30 ANOVA df Regression Residual Total Intercept ln(PB/PR) ln(PL/PR) ln(I/PR) 3 26 29 SS 0.4161 0.0989 0.5150 MS 0.1387 0.0038 F 36.4602 Significance F 0.0000 Coefficients -4.7978 -1.2994 0.1868 0.9458 Standard Error 3.7139 0.1657 0.2844 0.4270 t Stat -1.2918 -7.8400 0.6569 2.2148 P-value 0.2078 0.0000 0.5170 0.0357 Lower 95% -12.4318 -1.6401 -0.3977 0.0680 Upper 95% 2.8362 -0.9587 0.7714 1.8236 Since in the restricted model β4 = −β2 − β3 − β5, then the estimated β4 is 𝑏4∗ = −𝑏2∗ − 𝑏3∗ − 𝑏5∗ = −(−1.2994) − 0.1868 − 0.9458 = 0.1667 The restricted least square estimates are biased [E(𝑏𝑘 ) ≠ 𝛽𝑘 ], unless the constraints are exactly true. Also note that the variance of a restricted least square estimator is smaller than the unrestricted one. By combining non-sample information with the sample information, we reduce the variation in the estimation procedure introduced by random sampling. Chapter 7—Further Inference in Multiple Regression Page 8 of 20 3. Model Specification In any econometric investigation, choice of the model is one of the first steps. What are the important considerations when choosing a model? What are the consequences of choosing the wrong model? Are there any ways of assessing model specification or misspecification? The essential features of the model choice are:    Choice of functional form, Choice of explanatory variables, and whether the multiple regression model assumptions hold.2 For choice of functional form and regressors, economic principles and logical reasoning play a prominent and vital role. 3.1. Consequences of Omitted and Irrelevant Variables 3.1.1. The Omitted Variable Problem Suppose in a particular industry the wage rate of employees W, depends on their experience E, and their motivation M. Then the model is specified as: 𝑊 = 𝛽1 + 𝛽2 𝐸 + 𝛽3 𝑀 + 𝑢 However, since data on motivation are unavailable, we dispense with M and instead we estimate the model 𝑊 = 𝛽1 + 𝛽2 𝐸 + 𝑢 By estimating the alternative model we are imposing the constraint 𝛽3 = 0 when it is not true, that is, when in fact 𝛽3 ≠ 0. By imposing this constraint, the least squares estimates 𝑏1 and 𝑏2 will be biased. Only when the omitted variable is uncorrelated with the retained variables will the estimates be unbiased. But perfectly uncorrelated regressors are rare. Because of the possibility of omitted-variable bias (OVB), one must include all important relevant variables. If an estimated equation has coefficients with unexpected signs, or unrealistic magnitudes, a possible cause of these strange results is the omission of an important variable. One method to determine if a variable or a group of variables should be included in a model is to perform “significance tests”. For one variable (one null hypothesis 𝐻0 : 𝛽3 = 0), we use the t-test and for more than one variable (two or more null hypotheses 𝐻0 : 𝛽3 = 0, 𝛽4 = 0) we use the F-test. But we must also remember that we may reject the null because of the quality or paucity of data, even though the variable is in fact relevant to the model. One could, in such cases, be inducing omitted-variable bias in the remaining coefficient estimates. 3.1.1.1. Proof of the Omitted Variable Bias Suppose the true model is 𝑦 = 𝛽1 + 𝛽2 𝑥 + 𝛽3 ℎ + 𝑢, but by omitting the variable ℎ we estimate the model 𝑦 = 𝛽1 + 𝛽2 𝑥 + 𝑢 instead. Denote the least squares estimator of 𝛽2 in the reduced model by 𝑏2∗ . 2 The typical violations of the assumptions are: heteroskedasticity, autocorrelation, and random regressors. Chapter 7—Further Inference in Multiple Regression Page 9 of 20 𝑏2∗ = ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) ∑(𝑥 − 𝑥̅ )𝑦 = ∑(𝑥 − 𝑥̅ )2 ∑(𝑥 − 𝑥̅ )2 To simplify the proof, let 𝑤= 𝑥 − 𝑥̅ ∑(𝑥 − 𝑥̅ )2 Thus, 𝑏2∗ = ∑𝑤𝑦 Now substitute for 𝑦 from the original model, which includes h, 𝑏2∗ = ∑𝑤(𝛽1 + 𝛽2 𝑥 + 𝛽3 ℎ + 𝑢) = 𝛽1 ∑𝑤 + 𝛽2 ∑𝑤𝑥 + 𝛽3 ∑𝑤ℎ + ∑𝑤𝑢 It is simple to show that ∑𝑤 = 0 and ∑𝑤𝑥 = 1. Thus, 𝑏2∗ = 𝛽2 + 𝛽3 ∑𝑤ℎ + ∑𝑤𝑢 Taking the expectations of both sides, we have E(𝑏2∗ ) = 𝛽2 + 𝛽3 ∑𝑤ℎ ≠ 𝛽2 Consider the term ∑𝑤ℎ. ∑𝑤ℎ = ∑(𝑥 − 𝑥̅ )ℎ ∑(𝑥 − 𝑥̅ )(ℎ − ℎ̅) = ∑(𝑥 − 𝑥̅ )2 ∑(𝑥 − 𝑥̅ )2 ∑𝑤ℎ = ∑(𝑥 − 𝑥̅ )(ℎ − ℎ̅)⁄(𝑛 − 1) cov(𝑥, ℎ) = ∑(𝑥 − 𝑥̅ )2 ⁄(𝑛 − 1) var(𝑥) The numerator is the covariance of 𝑥 and ℎ, and the denominator the variance of 𝑥. This allows us to write: E(𝑏2∗ ) = 𝛽2 + 𝛽3 cov(𝑥, ℎ) ≠ 𝛽2 var(𝑥) The OVB here is shown as the difference between E(𝑏2∗ ) and 𝛽2 : bias(𝑏2∗ ) = E(𝑏2∗ ) − 𝛽2 = 𝛽3 cov(𝑥, ℎ) var(𝑥) Knowing the sign of 𝛽3 and the sign of cov(𝑥, ℎ) tells us the direction of the bias. Also note that if 𝑥 and ℎ are uncorrelated, their covariance will be zero. Thus the bias disappears and E(𝑏2∗ ) = 𝛽2 . Example The worksheet “𝑒𝑑𝑢𝑖𝑛𝑐” in the file “CH7 DATA” contains 428 observations relating annual family income (𝑦 = 𝐹𝐴𝑀𝐼𝑁𝐶) to the level of education of the income earners. The explanatory variables are husband's years of education (𝑥2 = 𝐻𝐸𝐷𝑈) and wife's years of education (𝑥3 = 𝑊𝐸𝐷𝑈). The regression outcome shows family income rises by $3,132 for each additional year of the husband's education and by $4,523 for each additional year of the wife's education. 𝑦̂ = −5533.629 + 3131.509𝑥2 + 4522.641𝑥3 Chapter 7—Further Inference in Multiple Regression Page 10 of 20 If, however, we omit 𝑥3 = 𝑊𝐸𝐷𝑈 from the model the regression equation becomes, 𝑦̂ = 2619.27 + 5155.483𝑥2 With the effect of an extra year of the husband's education on family income rising by nearly $2,000 to $5,155, the model overstates the contribution of the husband's educational attainment to the family income. Denote the biased coefficient as 𝑏2∗ . Then the omitted variable bias, as explained above, is bias(𝑏2∗ ) = 𝛽3 cov(𝑥2 , 𝑥3 ) var(𝑥) We can show that the omitted variable imparts a positive bias to the model. The estimated coefficient 𝑏3 = 4522.641 > 0 and, using Excel, cov(𝑥2 , 𝑥3 ) = 4.113 > 0. 3 Now include a third explanatory variable, the number of children under 6 years of age, 𝒙𝟒 = 𝑲𝑳𝟔. The regression equation is as follows. Note that, the coefficient of 𝐾𝐿6 is negative, implying that the larger the number of children in the family the lower the income (fewer number of hours worked). For each additional child, family income is reduced by $14,311. For comparison, the regression equation for the original model is shown below the new equation. 𝑦̂ = −7755.3 + 3211.5𝑥2 + 4776.9𝑥3 − 14310.9𝑥4 𝑦̂ = −5533.6 + 3131.5𝑥2 + 4522.6𝑥3 Note that the inclusion of the 𝑥4 = 𝐾𝐿6 variable does not alter the coefficients of the original model by much. The reason for this is that the 𝐾𝐿6 variable is not highly correlated with the education variables. Even though the new variable is relevant (the p-value in the regression output is 0.0044), its omission would not impart an OVB because of the absence of significant correlation with existing explanatory variables (here the education variables). 3.1.2. The Irrelevant Variable Problem The opposite of the omitted variable problem is the irrelevant variable problem. The irrelevant variable does not make the other estimated coefficients biased. But it increases their variance (or standard error), if the 2 irrelevant variable is significantly correlated with existing variables. Recall the role of (1 − 𝑟𝑗𝑘 ) in the denominator of the formula for variance of 𝑏𝑗 . Even though the irrelevant variable is unlikely to influence the dependent variable, it could be correlated with one or both of the previous variables, thus increasing their variance. 3.2. The RESET Test for Model Misspecification The general idea behind the RESET test (Regression Specification Error Test) is that if we can significantly improve the model by artificially including powers of the predictions of the model (that is 𝑦̂ 2 and 𝑦̂ 3 ), then we can conclude that the original model is inadequate or misspecified. Let’s state the original model as: 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝑢 which is estimated by 3 Do not confuse cov(𝑥2 , 𝑥3 ) with cov(𝑏2 , 𝑏3 )! Chapter 7—Further Inference in Multiple Regression Page 11 of 20 𝑦̂ = 𝑏1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 We can include the squared prediction of 𝑦, 𝑦̂ 2 , alone or along with its cubic value, 𝑦̂ 3 , in the original model: 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛾1 𝑦̂ 2 + 𝑢 𝑦 = 𝛽1 + 𝛽2 𝑥2 + 𝛽3 𝑥3 + 𝛾1 𝑦̂ 2 + 𝛾2 𝑦̂ 3 + 𝑢 (The symbol "γ" is the Greek letter Gamma.) In the first model we can then test the hypotheses, 𝐻0 : 𝛾1 = 0 𝐻1 : 𝛾1 ≠ 0 and in the second one, 𝐻0 : 𝛾1 = 0, 𝛾2 = 0 𝐻1 : 𝛾1 ≠ 0, or 𝛾2 = 0, or both ≠ 0 If we reject the null hypothesis and conclude that 𝛾1 or 𝛾2 is significantly different from zero, then the inclusion of the powers of prediction of 𝑦 in the model has improved the model. Therefore, the original model is inadequate. If we have omitted variables, and these variables are correlated with 𝑥2 and 𝑥3 , then some of their effect may be picked up by the inclusion of the powers of prediction of 𝑦. Let us use the family income model above to do the RESET test. To conduct a RESET test, run the regression first by adding 𝑦̂ 2 alone, and then by including both 𝑦̂ 2 and 𝑦̂ 3 in the original model. In running a RESET test we treat the original model as the restricted model and use the F-test described above. Using the estimated regression equation with the KL6 included, 𝑦̂ = −7755.3 + 3211.5𝑥2 + 4776.9𝑥3 − 14310.9𝑥4 and then running the model with the added 𝑦̂ 2 𝑦̂ ∗ = 𝑏1 + 𝑏2 𝑥3 + 𝑏3 𝑥4 + 𝑏4 𝑥4 + 𝑔1 𝑦̂ 2 The ANOVE section of the computer output for the two model are shown as follows: 𝑦̂ ∗ = 𝑏1 + 𝑏2 𝑥3 + 𝑏3 𝑥4 + 𝑏4 𝑥4 + 𝑔1 𝑦̂ 2 ANOVA df Regression Residual Total Intercept HEDU WEDU KL6 ŷ ² 𝑦̂ = 𝑏1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + 𝑏4 𝑥4 ANOVA 4 423 427 SS 1.568E+11 6.743E+11 8.311E+11 MS 3.92E+10 1.594E+09 Coefficients 87242.9829 -2381.4657 -4235.1089 10887.3371 0.000010 Standard Error 40389.3906 2419.6918 3832.1395 11439.2762 0.0000041 t Stat 2.1600 -0.9842 -1.1052 0.9518 2.4462 Regression Residual Total df 3 424 427 SS 1.472E+11 6.838E+11 8.311E+11 MS 4.91E+10 1.61E+09 Intercept HEDU WEDU KL6 Coefficients -7755.330 3211.526 4776.907 -14310.921 Standard Error 11162.935 796.703 1061.164 5003.928 t Stat -0.695 4.031 4.502 -2.860 The F-statistic for the test 𝐻0 : 𝛾1 = 0 is: 𝐹= (𝑆𝑆𝐸𝑅 − 𝑆𝑆𝐸𝑈 )⁄𝑗 𝑆𝑆𝐸𝑈 ⁄(𝑛 − 𝑘) Chapter 7—Further Inference in Multiple Regression Page 12 of 20 𝑆𝑆𝐸𝑅 = 6.838E + 11 𝑆𝑆𝐸𝑈 = 6.743E + 11 𝑗=1 𝑛 − 𝑘 = 423 𝐹 = 5.984 The probability value is: =F. DIST. RT(5.984,1,423) = 0.0148. At a 5% level of significance, we reject H0 and conclude that predictions squared does improve the original model. This indicates that the original model (with the kids variable included) is misspecified. Including, in addition, the “predictions cubed” and running the regression provides the following ANOVA table and subsequent F-statistic. ANOVA df Regression Residual Total 5 422 427 SS 1.57219E+11 6.73868E+11 8.31087E+11 MS 3.14E+10 1.6E+09 F 19.69122 Significance F 1.2E-17 𝑆𝑆𝐸𝑅 = 6.838E+11 𝑆𝑆𝐸𝑈 = 6.739E+11 𝑗=2 𝑛 − 𝑘 = 422 𝐹 = 3.1226 The p-value is =F. DIST. RT(3.1226,2,422) = 0.0451. We reject 𝐻0 : 𝛾1 = 0, 𝛾2 = 0 at a 5% level of significance. This would indicate that the model could be improved upon by adding more variables, such as age of wage earner, experience, the geographic location of the household (rural versus urban). 4. Cases Where the 𝑭-Test and 𝒕-Tests Give Contradictory Results— Collinearity In some cases, the 𝐹-test may indicate that the overall model is significant, but individual 𝑡-tests lead us to conclude that the individual 𝛽𝑗 are not significantly different from zero. When collinearity among the explanatory variable exists, that is, when independent variables themselves are correlated (when their explanatory powers overlap), the standard errors of the regression coefficients would be large, making the 𝑡 test statistics small, thus leading us to conclude that each 𝛽𝑗 individually is not significantly different from zero. The following is a discussion of the collinearity problem in multiple regression models. 4.1. Collinearity To gain a better understanding of the relationship between the dependent variable y and the explanatory variables, it is important to understand the factors affecting the variance and covariance of the coefficients. To that end, consider the actual formulas for the variances of the slope coefficients and their covariance. Chapter 7—Further Inference in Multiple Regression Page 13 of 20 var(𝑏2 ) = var(𝑒) ∑(𝑥2 − 𝑥̅2 )2 (1 − 𝑟232 ) var(𝑏3 ) = var(𝑒) ∑(𝑥3 − 𝑥̅3 )2 (1 − 𝑟232 ) covar(𝑏2 , 𝑏3 ) = var(𝑒) 4 −∑(𝑥2 − 𝑥̅2 )(𝑥3 − 𝑥̅3 ) ∑(𝑥2 − 𝑥̅2 )2 ∑(𝑥3 − 𝑥̅3 )2 (1 − 𝑟232 )  The larger the variance of the disturbance term var(𝑢), as estimated by var(𝑒), the larger the variance (and covariance) of the least squares estimators. This means that the dependent variable data is more widely scattered about the regression plane, indicating a weaker association between 𝑦 and the respective explanatory variable.  Note that the sum of squared deviations for all independent variables appear in the denominator of the three formulas. The bigger these are, the smaller the variances and the covariance. Two factors affect these sum-squares:  o The sample size: The larger the sample size 𝑛, bigger the sum-squares. o The degree of dispersion of the 𝑥𝑖𝑗 data about their respective mean 𝑥̅𝑗 : The more dispersed the 𝑥𝑖𝑗 data are about their mean, the bigger the sum-squares. In order to estimate the population slope parameters 𝛽𝑗 precisely by reducing the variance of 𝑏𝑗 , there should be a large amount of variation in the 𝑥𝑖𝑗 . Finally, the larger the correlation between 𝑥2 and 𝑥3 , 𝑟23 , the bigger the variance of 𝑏𝑗 . The final point here deserves more detailed attention. Note that the correlation coefficient 𝑟23 , given by the familiar formula, 𝑟23 = ∑(𝑥2 − 𝑥̅2 )(𝑥3 − 𝑥̅3 ) √∑(𝑥2 − 𝑥̅2 )2 ∑(𝑥3 − 𝑥̅3 )2 measures the degree of association between 𝑥2 and 𝑥3 . Since 𝑟23 appears in the denominator of var(𝑏2 ), 2 ), var(𝑏3 ), and covar(𝑏2 , 𝑏3 ) in the term (1 − 𝑟23 then the bigger the correlation coefficient 𝑟23 , the smaller the denominator, hence the bigger the variances and the covariance. When the two independent variables are correlated it is difficult to disentangle their separate effects on the independent variable. 4 The general formula for var(𝑏2 ) is: var(𝑏2 ) = var(𝑒) ∑(𝑥2 − 𝑥̅2 )2 ∑(𝑥3 ∑(𝑥3 − 𝑥̅3 )2 − 𝑥̅3 )2 − [∑(𝑥2 − 𝑥̅2 )(𝑥3 − 𝑥̅3 )]2 In the denominator we can substitute from the formula for the correlation coefficient r23, measuring the correlation between x1 and x2. 𝑟23 = ∑(𝑥2 − 𝑥̅2 )(𝑥3 − 𝑥̅3 ) √∑(𝑥2 − 𝑥̅2 )2 ∑(𝑥3 − 𝑥̅ 3 )2 2 [∑(𝑥2 − 𝑥̅2 )(𝑥3 − 𝑥̅ 3 )]2 = 𝑟23 ∑(𝑥2 − 𝑥̅2 )2 ∑(𝑥3 − 𝑥̅3 )2 var(𝑏2 ) = var(𝑒) ∑(𝑥3 − 𝑥̅3 )2 var(𝑒) 2)= 2) ∑(𝑥2 − 𝑥̅2 )2 ∑(𝑥3 − 𝑥̅3 )2 (1 − 𝑟23 ∑(𝑥2 − 𝑥̅2 )2 (1 − 𝑟23 Chapter 7—Further Inference in Multiple Regression Page 14 of 20 In simple regression, 𝑏2 is the total effect of 𝑥 on 𝑦: 𝑏2 = 𝑑𝑦⁄𝑑𝑥2 . In multiple regression, the coefficient of the first variable, 𝑥2 , is the estimated net effect of a change in that variable on 𝑦: 𝑏2 = 𝜕𝑦⁄𝜕𝑥2 , and that of the second variable 𝑥3 , similarly, is the estimated net effect of a change in 𝑥3 on 𝑦: 𝑏3 = 𝜕𝑦⁄𝜕𝑥3 . Suppose 𝑥2 and 𝑥3 are themselves linearly related, so that a change in one induces a change in the other. Then the total effect of a change in 𝑥2 on 𝑦 involves not only 𝑏2 , but it must also include the effect of the change in 𝑥3 induced by the change in 𝑥2 . To further illustrate the effect of correlation among the independent variables, consider the following two Venn diagrams (𝐴) and (𝐵). In both (𝐴) and (𝐵) each circle represents the total variation in the variable. In (𝐴) the two independent variable 𝑥2 and 𝑥3 are uncorrelated as shown by the two non-overlapping circles. Here 𝑟23 = 0. Thus, the variance of the coefficients 𝑏2 and 𝑏3 are “simplified” into: var(𝑏2 ) = var(𝑒) (𝑥 ∑ 2 − 𝑥̅2 )2 var(𝑏3 ) = var(𝑒) (𝑥 ∑ 3 − 𝑥̅3 )2 which are the same as the variance of the slope coefficient in the simple linear regression. Also, since 𝑟23 = 0, the term ∑(𝑥2 − 𝑥̅2 )(𝑥3 − 𝑥̅3 ) in the numerator of covar(𝑏2 , 𝑏3 ) formula equals zero, thus making covar(𝑏2 , 𝑏3 ) = 0. Thus, a multiple regression of 𝑦 on 𝑥2 and 𝑥3 will contain the same information as is contained in two separate simple regressions. (A) x₂ and x₃ are uncorrelated Total variations in y Total variations in x₂ Total variations in x₃ (B) x₂ and x₃ are correlated Total variations in y Total variations in x₂ Total variations in x₃ In (𝐵), the two circles representing the variations in the independent variables, in addition to overlapping with the 𝑦 circle, overlap with each other. Thus there is variation common to all three variables, shown as the area of intersection of the three circles. A simple regression of 𝑦 on 𝑥2 would involve the entire overlap between 𝑦 and 𝑥2 , but, as the diagram shows, this overlap includes also some of the variation in 𝑥3 . The resulting “net effect” overlap of the variation in 𝑥2 and 𝑦 is smaller than the overlap depicting the gross relationship between these two variables. The existence of the overlap between the circles representing variations in the independent variables indicates “collinearity”. The bigger the area of overlap 𝑥2 and 𝑥3 , the stronger the collinearity. The practical impact of collinearity can be observed by the impact of the correlation coefficient 𝑟23 in the denominator of the variance formula for either slope coefficient. Take the variance of 𝑏2 : var(𝑏2 ) = var(𝑒) ∑(𝑥2 − 𝑥̅2 )2 (1 − 𝑟232 ) 2 The stronger the collinearity of 𝑥2 and 𝑥3 , the bigger 𝑟23 , the bigger the variance of 𝑏2 , and the less precise the estimate of the parameter 𝛽2 . The variation in 𝑥2 about its mean 𝑥̅2 adds most to the precision of estimation when it is not connected to the variation in the other explanatory variable. When the variation in 𝑥2 about its Chapter 7—Further Inference in Multiple Regression Page 15 of 20 mean is related to the variation in the in the other explanatory variable, the precision of estimation is diminished. Example Refer to the data in the tab “cars” in the Excel file, to estimate the effect of the number of cylinders (CYL), engine size (displacement in cubic inches, ENG), and vehicle weight (WGT) on fuel consumption (MPG). First run a simple regression using only the number of cylinders as the only variable. The relevant part regression summary output is shown below: Intercept CYL Coefficients 42.916 -3.558 Standard Error 0.835 0.146 t Stat 51.404 -24.425 P-value 0.000 0.000 Lower 95% 41.274 -3.844 Upper 95% 44.557 -3.272 The r-square value is 𝑅2 = 0.6047. Given the p-value = 0.000, clearly, the number of cylinders have a significant impact on MPG—as expected. Now run the regression using all independent variables mentioned above. SUMMARY OUTPUT Regression Statistics Multiple R 0.8362 R Square 0.6993 Adjusted R Square 0.6970 Standard Error 4.2965 Observations 392 ANOVA df Regression Residual Total Intercept CYL ENG WGT 3 388 391 SS 16656.444 7162.549 23818.993 MS 5552.148 18.460 F 300.764 Significance F 0.000 Coefficients 44.3710 -0.2678 -0.0127 -0.0057 Standard Error 1.4807 0.4131 0.0083 0.0007 t Stat 29.9665 -0.6483 -1.5362 -7.9951 P-value 0.0000 0.5172 0.1253 0.0000 Lower 95% 41.4598 -1.0799 -0.0289 -0.0071 Upper 95% 47.2821 0.5443 0.0035 -0.0043 Note that both 𝑅2 = 0.6993 and F-statistic show that the combined impact of all variables on MPG is significant. However, considered separately, given the t-statistics and p-values for CYL and ENG, we cannot reject the null hypotheses 𝐻0 : 𝛽2 = 0 and 𝐻0 : 𝛽3 = 0, indicating that these variables have no impact on MPG! Also, using the F-test for the null hypotheses 𝐻0 : 𝛽2 = 𝛽3 = 0, the F-statistic is F = 4.298 with a pvalue = 0.0142, leading us to reject the “no-effect” null hypothesis. These contradictions arise from the fact that there is strong collinearity between the variables CYL and ENG. Considering the Venn diagram shown above, there is significant overlap (correlation) between the two independent variables 𝑥2 = CYL and 𝑥3 = ENG. Using the Excel function =CORREL, the correlation coefficient for the two variables is 𝑟23 = 0.9508. Chapter 7—Further Inference in Multiple Regression Page 16 of 20 5. Identifying and Mitigating Collinearity As explained above, the collinearity problem arises from the association or correlation between the independent variables. Theoretically, if there is perfect collinearity between any two independent variables, the inverse of the X matrix does not exist. Therefore there is no unique solution for the system of normal equations that is used to obtain values of the regression coefficients. In matrix jargon, we say one row is a linear combination of another row. In any regression model, the collinearity problem would indicate that the data do not contain enough "information" about the individual effects of the explanatory variables to precisely estimate the population slope parameters 𝛽𝑗 . Even if, in theory, two explanatory variables are perfectly collinear, the sample data may never indicate perfect collinearity. Therefore, there is always a solution for the values of 𝑏𝑗 . But these solutions will not be precise estimates of the 𝛽𝑗 . The question is, how can we detect the existence of significant collinearity? In a model with two explanatory variables, a simple way to detect collinearity is to compute the correlation coefficient using the formula, 𝑟23 = ∑(𝑥2 − 𝑥̅2 )∑(𝑥3 − 𝑥̅3 ) √∑(𝑥2 − 𝑥̅2 )2 ∑(𝑥3 − 𝑥̅3 )2 In Excel, the function is =CORREL(). For example, in the family income model with two explanatory variables the correlation coefficient between HEDU and WEDU is r₂₃ = 0.5943. In models with more than two explanatory variables, the collinear relationships may involve more than two of the explanatory variables. To detect collinearity, we can estimate the "auxiliary" regression, where the "dependent" or explained variable is one of the explanatory variables. We run the regression using the remaining explanatory variables. In the family income model where KL6 is the additional explanatory variable, now we can use this variable as the dependent variable and run the auxiliary regression. The relevant "regression" equation is then 𝑥̂4 = 𝑎1 + 𝑎2 𝑥2 + 𝑎3 𝑥3 Here, the objective is not to determine the coefficients, rather, the concern is the value of the R2. A large R2 value, say, above 0.80, would indicate a significant correlation between the variables under consideration. The R2 for this auxiliary regression is 0.0179, which clearly indicates the absence of collinearity. 6. Confidence and Prediction Intervals The interval estimate for the mean value of the dependent variable, 𝑦̂0 , for given value of the dependent variables is the familiar general format: 𝐿, 𝑈 = 𝑦̂0 ± 𝑡𝛼⁄2,(𝑛−𝑘) se(𝑦̂0 ) In a model with two independent variables, the variance of 𝑦̂0 , from which we obtain the standard error figure to build the interval estimate, is: var(𝑦̂0 ) = var(𝑏1 + 𝑏2 𝑥02 + 𝑏3 𝑥03 ) 2 2 var(𝑦̂0 ) = var(𝑏1 ) + 𝑥02 var(𝑏2 ) + 𝑥03 var(𝑏3 ) + 2𝑥02 cov(𝑏1 , 𝑏2 ) + 2𝑥03 cov(𝑏1 , 𝑏3 ) + 2𝑥02 𝑥03 cov(𝑏2 , 𝑏3 ) Chapter 7—Further Inference in Multiple Regression Page 17 of 20 The prediction interval for an individual value of the dependent variable, 𝑦0 , for given values of the independent variables, takes the following form: 𝐿, 𝑈 = 𝑦̂0 ± 𝑡𝛼⁄2,(𝑛−𝑘) se(𝑦0 ) The interval is still built around 𝑦̂0 . But the standard error is now different. The difference arises from the fact that the individual value of 𝑦 deviates from the mean value by the prediction error. 𝑦0 = 𝑦̂0 + 𝑒 Therefore, var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒) Example Use the data in the tab “burger2” to estimate the coefficients of the model. 𝑆𝐴𝐿𝐸𝑆 = 𝛽1 + 𝛽2 𝑃𝑅𝐼𝐶𝐸 + 𝛽3 𝐴𝐷𝑉𝐸𝑅𝑇 + 𝛽4 𝐴𝐷𝑉𝐸𝑅𝑇 2 + 𝑢 𝑦 = 𝑆𝐴𝐿𝐸𝑆 𝑥2 = 𝑃𝑅𝐼𝐶𝐸 𝑥3 = 𝐴𝐷𝑉𝐸𝑅𝑇 𝑥4 = 𝐴𝐷𝑉𝐸𝑅𝑇 2 Thus, 𝑦̂ = 𝑏1 + 𝑏2 𝑥2 + 𝑏3 𝑥3 + 𝑏4 𝑥4 𝑦̂ = 109.719 − 7.64𝑥2 + 12.1512𝑥3 − 2.768𝑥4 Now, let 𝑥02 = $6 𝑥03 = $1.9 𝑥04 = (1.9)2 = 3.61 Then, 𝑦̂0 = 109.719 − 7.64(6) + 12.1512(1.9) − 2.768(3.61) 𝑦̂0 = 76.974 First, build a confidence interval for the mean value of 𝑦. We need to find var(𝑦̂0 ). In the Excel file determine the covariance matrix. var(𝑦̂0 ) = var(𝑏1 + 𝑏2 𝑥02 + 𝑏3 𝑥03 + 𝑏4 𝑥04 ) 2 2 2 var(𝑦̂0 ) = var(𝑏1 ) + 𝑥02 var(𝑏2 ) + 𝑥03 var𝑏3 𝑥3 + 𝑥04 var(𝑏4 ) var(𝑦̂0 ) = + 2𝑥02 cov(𝑏1 , 𝑏2 ) + 2𝑥03 cov(𝑏1 , 𝑏3 ) + 2𝑥04 cov(𝑏1 , 𝑏4 ) var(𝑦̂0 ) = + 2𝑥02 𝑥03 cov(𝑏2 , 𝑏3 ) + 2𝑥02 𝑥04 cov(𝑏2 , 𝑏4 ) + 2𝑥03 𝑥04 cov(𝑏3 , 𝑏4 ) var(𝑦̂0 ) = 46.22702 + (62 )(1.09399) + (1.92 )(12.64630) + (3.612 )(0.88477) var(𝑦̂0 ) = + 2(6)(−6.42611) + 2(1.9)(−11.60096) + 2(3.61)(2.93903) var(𝑦̂0 ) = + 2(6)(1.9)(0.30041) + 2(6)(3.61)(−0.08562) + 2(1.9)(3.61)(−3.28875) Chapter 7—Further Inference in Multiple Regression Page 18 of 20 var(𝑦̂0 ) = 0.8422 se(𝑦̂0 ) = 0.9177 The 95% confidence interval for 𝑦̂0 when 𝑥02 = 6, 𝑥03 = 1.9, and 𝑥04 = 3.61 is then, 𝐿, 𝑈 = 𝑦̂0 ± 𝑡𝛼⁄2,(𝑛−𝑘) se(𝑦̂0 ) 𝐿, 𝑈 = 76.974 ± (1.994)(0.9177) = 76.974 ± 1.830 = [75.144,78.804] Now the prediction interval for the individual value of 𝑦: var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒) var(𝑦0 ) = 0.8422 + 21.5787 = 22.42085 se(𝑦0 ) = 4.7351 𝐿, 𝑈 = 𝑦̂0 ± 𝑡𝛼⁄2,(𝑛−𝑘) se(𝑦0 ) 𝐿, 𝑈 = 76.974 ± (1.994)(4.7351) = 76.974 ± 9.441 = [67.533,86.415] ̂𝟎 ) 7. A More Practical Way of Finding 𝐬𝐞(𝒚 In many models, where the number of independent variables exceeds two, obtaining the standard error of the linear combination of the regression coefficients, as we have done above, becomes very tedious and may lead to miscalculations. There is a simpler way to compute the standard error in question, as shown below. We will use the same example above. The general approach is as follows:  Subtract the given value of each 𝑥𝑗 from each value in that column. For example, 𝑥𝑖2 − 𝑥02 = 𝑥𝑖2 − 6 𝑥𝑖3 − 𝑥03 = 𝑥𝑖3 − 1.9 𝑥𝑖4 − 𝑥04 = 𝑥𝑖2 − 3.61  Run the regression with the adjusted values of the 𝑥𝑗 The result for our example (see the Excel file tab “burger 3” for the full calculation) is: ANOVA Regression Residual Total df 3 71 74 SS 1583.3974 1532.0845 3115.4819 MS 527.7991 21.5787 F 24.459316 Significance F 5.59996E-11 Intercept PRICE* ADVERT* ADVERT²* Coefficients 76.974 -7.640 12.151 -2.768 Standard Error 0.9177 1.0459 3.5562 0.9406 t Stat 83.8760 -7.3044 3.4170 -2.9427 P-value 9.19E-73 3.24E-10 0.00105 0.00439 Lower 95% 75.1442 -9.7255 5.0604 -4.6435 Upper 95% 78.8039 -5.5545 19.2420 -0.8924 Note that the coefficients of 𝑃𝑅𝐼𝐶𝐸 ∗ = 𝑃𝑅𝐼𝐶𝐸 − 6, 𝐴𝐷𝑉𝐸𝑅𝑇 ∗ = 𝐴𝐷𝑉𝐸𝑅𝑇 − 1.9, and 𝐴𝐷𝑉𝐸𝑅𝑇 2∗ = 𝐴𝐷𝑉𝐸𝑅𝑇 2 − 3.61, and their standard errors are exactly equal to the coefficients and standard errors before Chapter 7—Further Inference in Multiple Regression Page 19 of 20 the adjustments to these variables. However the intercept coefficient and its standard error are different. In fact the intercept value is the predicted 𝑆𝐴𝐿𝐸𝑆 for 𝑃𝑅𝐼𝐶𝐸 = 6, 𝐴𝐷𝑉𝐸𝑅𝑇 = 1.9, and 𝐴𝐷𝑉𝐸𝑅𝑇 2 = 3.61. More importantly, now we have obtained se(𝑦̂0 ) as the standard error of the intercept directly from running this regression. To obtain var(𝑦0 ), var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒) var(𝑦0 ) = (0.9177)2 + 21.5787 = 22.4208 Note that var(𝑒) does not change when the variables are adjusted as above. Chapter 7—Further Inference in Multiple Regression Page 20 of 20

7-Multiple Regressio..

Related documents

Products

Support

7-Multiple Regressio..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib