5-Prediction, Goodne..

CHAPTER 5 PREDICTION, GOODNESS-OF-FIT AND MODELING ISSUES 1. 2. 3. 4. 5. 6. 7. Confidence Interval for the Mean Value of 𝑦 for a Given Value 𝑥, and Prediction Interval for the Individual Value of y for a Given 𝑥 1.1. Confidence Interval for the Mean Value of 𝑦 for a Given Value 𝑥 1.2. Prediction Interval for the Individual Value of 𝑦 for a Given 𝑥 Reporting Regression Results 2.1. Computer Output 2.2. Reporting the Summary Results The F-Test of Goodness of Fit Modeling Issues 4.1. The Effects of Scaling the Data 4.1.1. Changing the scale of 𝑥 4.1.2. Changing the scale of 𝑦 4.1.3. Changing the scale of 𝑥 and 𝑦 by the same factor 𝑐 Choosing a Functional Form Linear Functional Form Linear-Log (Semi-log) Log-Log Functional Form Examples for Functional Forms Prediction in the Log-Linear Model 1. Prediction Interval for the Individual Value of 𝒚 for a Given 𝒙 In Chapter 4, Section 3.1, we learned how to build an interval estimate for the mean value of the dependent variable 𝑦 for a given or specified value of the dependent variable 𝑥. For that we used the example of weekly food expenditure as a function of weekly income. 𝐹𝑂𝑂𝐷𝐸𝑋𝑃 = 𝛽1 + 𝛽2 𝐼𝑁𝐶𝑂𝑀𝐸 + 𝑢 There we obtained the estimated regression equation, 𝑦̂ = 𝑏1 + 𝑏2 𝑥, 𝑦̂ = 83.416 + 10.21𝑥 and built an interval estimate for 𝑦̂ when 𝑥 = 10 (weekly income = $1,000). Since 𝑦̂ is a point on the estimated regression line, it represents an estimate of the mean value 𝑦, food expenditure, in the population for the given weekly income. Therefore, when we build an interval estimate for 𝑦̂, we are building an interval estimate for the mean or expected value of 𝑦 for a given 𝑥: 𝜇𝑦|𝑥0 = E(𝑦|𝑥0 ). We know, however, for each value of 𝑥 in the population, there are many different values of 𝑦 corresponding to that 𝑥. How do we build an interval estimate for an individual value, rather than the mean value, of 𝑦? Let 𝑥0 denote the given value of 𝑥. Then in the population regression function, 𝑦0 is an individual value of 𝑦 for a given value of 𝑥, which deviates from the mean value by the disturbance term 𝑢0 , 𝑦0 = E(𝑦|𝑥0 ) + 𝑢0 = 𝛽1 + 𝛽2 𝑥0 + 𝑢0 When we obtain the estimated sample regression equation, the observed value of 𝑦 for each value of 𝑥 deviates from the predicted value by the prediction error 𝑒. 𝑦0 = 𝑦̂0 + 𝑒0 5-Prediction, Goodness-of-Fit, and Modeling Issues 1 of 27 𝑦0 = 𝑏1 + 𝑏2 𝑥0 + 𝑒0 The objective is now to build an interval estimate for this 𝑦0 . The interval estimate is then, 𝐿, 𝑈 = 𝑦0 ± 𝑡𝛼⁄2,(𝑛−2) se(𝑦0 ) Here we need to find an estimator for 𝑦0 and calculate se(𝑦0 ). Since the expected value of the 𝑦0 in the population is 𝑦̂0 , the prediction interval is built around the estimated 𝑦̂0 from the sample. Therefore, the estimator of 𝑦0 is 𝑦̂0 . To determine se(𝑦0 ), first start with var(𝑦0 ) and then take its square root. var(𝑦0 ) = var(𝑦̂0 + 𝑒0 ) Given the assumption of independence of 𝑦 values and the disturbance term, and the assumption of homoscedasticity, we can write, var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒) var(𝑦0 ) = var(𝑏1 + 𝑏2 𝑥0 ) + var(𝑒) var(𝑦0 ) = var(𝑏1 ) + 𝑥02 var(𝑏2 ) + 2𝑥0 cov(𝑏1 , 𝑏2 ) + var(𝑒) The interval estimate, or the prediction interval for an individual value of 𝑦 is then, 𝐿, 𝑈 = 𝑦̂0 ± 𝑡𝛼⁄2,(𝑛−2) se(𝑦0 ) For comparison, let’s build an interval estimate for the mean value of food expenditure for the weekly income of $2,000 (𝑥0 = 20) and a prediction interval for the individual value of food expenditure for the same weekly income. The estimated regression equation is 𝑦̂ = 83.416 + 10.21𝑥 For 𝑥0 = 20, 𝑦̂ = 83.416 + 10.21(20) = 287.61. The variance of error term is var(𝑒) = 8013.294 Using the inverse matrix 𝑋 −1 , we can obtain the covariance matrix: covariance matrix = var(𝑒)𝑋 −1 = 8013.294 [ [ var(𝑏1 ) cov(𝑏1 , 𝑏2 ) cov(𝑏1 , 𝑏2 ) 1884.442 ]=[ var(𝑏2 ) −85.903 0.2352 −0.0107 −0.0107 ] 0.00055 −85.903 ] 4.382 5-Prediction, Goodness-of-Fit, and Modeling Issues 2 of 27 Confidence Interval for the Mean Value of 𝑦 Prediction Interval for the Individual Value of 𝑦 𝐿, 𝑈 = 𝑦̂0 ± 𝑡𝛼⁄2,𝑑𝑓 se(𝑦̂0 ) 𝐿, 𝑈 = 𝑦̂0 ± 𝑡𝛼⁄2,𝑑𝑓 se(𝑦0 ) var(𝑦̂0 ) = var(𝑏1 + 𝑏2 𝑥0 ) var(𝑦0 ) = var(𝑦̂0 ) + var(𝑒) var(𝑦̂0 ) = var(𝑏1 ) + 𝑥02 var(𝑏2 ) + 2𝑥0 cov(𝑏1 , 𝑏2 ) var(𝑦̂0 ) = 201.017 var(𝑦̂0 ) = 201.017 + 8013.29 = 8214.31 se(𝑦̂0 ) = 14.178 se(𝑦0 ) = 90.633 𝑡0.025,38 = 2.024 𝑡0.025,38 = 2.024 𝑀𝑂𝐸 = (2.024)(14.178) = 28.70 𝑀𝑂𝐸 = (2.024)(90.633) = 183.48 𝐿 = 287.61 − 28.70 = 258.91 𝐿 = 287.61 − 183.48 = 104.13 𝑈 = 287.61 + 28.70 = 316.31 𝑈 = 287.61 + 183.48 = 471.09 Note the chart below. The bands representing prediction intervals for individual values of 𝑦 are wider than the bands for the confidence intervals for mean values of 𝑦. Also both bands are narrower in the middle. The further away the 𝑥 values are from the 𝑥̅ , the bigger the squared deviation (𝑥 − 𝑥̅ )2, hence the bigger the variance. The greater the variance, the less reliable is the prediction. 700 600 500 400 300 200 100 0 0 5 10 15 20 25 30 35 -100 2. Coefficient of Determination, R2 The closeness of fit of the regression line (to the scatter plot) is a measure of the closeness of the relationship between 𝑥 and 𝑦. The less scattered the observed 𝑦 values are around the regression line, the closer the relationship between 𝑥 and 𝑦. As explained above, se(𝑒) is such a measure of the fit. However, se(𝑒) has a major drawback. It is an absolute measure and, therefore, is affected by the absolute size of the data. The larger the values of the data set, the larger the 𝑠𝑒(𝑒). 5-Prediction, Goodness-of-Fit, and Modeling Issues 3 of 27 To explain this drawback, consider the data in the food expenditure example. Suppose the dependent variable data, weekly food expenditure figures, were also in $100s. That is, for example, instead of showing the weekly food expenditure as $155, we show it as $1.55. As the calculations in the tab “food2” in the Excel file “CH5 DATA” show, the standard error of estimate is reduced from 89.517 to 0.895. This reduction in the se(𝑒) is solely due to the change in the scale of the independent variable data. This should make it clear that using se(𝑒) as a measure of closeness of the fit suffers from the misleading impact of the absolute size or scale of the data used in the model. An alternative measure of the closeness of fit, which is not affected by the scale of the data, is the coefficient of determination, denoted by 𝑹𝟐 (rsquare). R-square is a relative measure. It is, therefore, not affected by the scale of data. It measures the proportion of total variations in 𝑦 explained by the regression (that is, by 𝑥). Basically, 𝑅2 involves the comparison of the variations or deviations of the observed 𝑦 around the regression (𝑦̂) line against the variations of the same 𝑦 values around the mean (𝑦̅) line. The diagram below shows the comparison of these deviations. y 700 600 500 ŷ 400 300 y̅ 200 100 0 0 5 10 15 20 25 30 x 35 Mathematically, 𝑅2 is proportion of the total squared deviation of the 𝑦 values from 𝑦̅ that is explained by the regression (𝑦̂) line. To understand this statement consider the following diagram. y 700 600 500 400 483 361 300 284 ŷ y̅ 200 100 0 27.14 5-Prediction, Goodness-of-Fit, and Modeling Issues x 4 of 27 In the diagram, the horizontal line represents the mean of all the observed 𝑦 values: 𝑦̅ = 284. The regression line is represented by the regression equation 𝑦̂ = 83.416 + 10.21𝑥. A single observed value of 𝑦 = 483 for a given 𝑥 = 27.14 value is selected. The vertical distance between this 𝑦 value and 𝑦̅ is called “total deviation”. 𝑇𝑜𝑡𝑎𝑙 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦 − 𝑦̅ = 483 − 284 = 199 The vertical distance between𝑦̂ on the regression line and 𝑦̅ is called “explained deviation”. 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦̂ − 𝑦̅ = 361 − 284 = 77 As the diagram indicates, clearly this portion of the total deviation is due to (or explained by) the regression model. That is, this deviation is explained by the independent variable 𝑥. The vertical distance between 𝑦 and 𝑦̂, the residual 𝑒, is called “unexplained deviation”. 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝑦 − 𝑦̂ = 483 − 361 = 122 Thus, 𝑇𝑜𝑡𝑎𝑙 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 = 𝐸𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 + 𝑈𝑛𝑒𝑥𝑝𝑙𝑎𝑖𝑛𝑒𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 (𝑦 − 𝑦̅) = (𝑦̂ − 𝑦̅) + (𝑦 − 𝑦̂) 199 = 77 + 122 Repeating the same process for all values of 𝑦, squaring the resulting deviations, and summing the squared values, we have the following the sum of squared deviations: 1. Sum of Squared Total Deviations ∑(𝑦 − 𝑦̅)2 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑇𝑜𝑡𝑎𝑙 (𝑆𝑆𝑇): 2. Sum of Squared Explained Deviations ∑(𝑦̂ − 𝑦̅)2 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 (𝑆𝑆𝑅): 3. Sum of Squared Unexplained Deviations ∑𝑒 2 = ∑(𝑦 − 𝑦̂)2 𝑆𝑢𝑚 𝑜𝑓 𝑆𝑞𝑢𝑎𝑟𝑒𝑠 𝐸𝑟𝑟𝑜𝑟 (𝑆𝑆𝐸): It can be shown that ∑(𝑦 − 𝑦̅)2 = ∑(𝑦̂ − 𝑦̅)2 + ∑(𝑦 − 𝑦̂)2 (see footnote1) (𝑦 − 𝑦̅) = (𝑦̂ − 𝑦̅) + (𝑦 − 𝑦̂) Square both sides (𝑦 − 𝑦̅)2 = (𝑦̂ − 𝑦̅)2 + (𝑦 − 𝑦̂)2 + 2(𝑦 − 𝑦̂)(𝑦̂ − 𝑦̅) and sum ∑(𝑦 − 𝑦̅)2 = ∑(𝑦̂ − 𝑦̅)2 + ∑(𝑦 − 𝑦̅)2 + 2∑𝑒(𝑦̂ − 𝑦̅) We must show that ∑𝑒(𝑦̂ − 𝑦̅) = 0 ∑𝑒(𝑦̂ − 𝑦̅) = ∑𝑒𝑦̂ − 𝑦̅∑𝑒 ∑𝑒(𝑦̂ − 𝑦̅) = ∑𝑒𝑦̂ ∑𝑒 = 0 Now show that ∑𝑒𝑦̂ = 0 ∑𝑒𝑦̂ = ∑𝑒(𝑏1 + 𝑏2 𝑥) ∑𝑒𝑦̂ = ∑𝑒(𝑦̅ − 𝑏2 𝑥̅ + 𝑏2 𝑥) ∑𝑒𝑦̂ = 𝑦̅∑𝑒 + 𝑏2 ∑𝑒(𝑥 − 𝑥̅ ) ∑𝑒𝑦̂ = 𝑏2 ∑𝑒𝑥 − 𝑏2 𝑥̅ ∑𝑒 1 5-Prediction, Goodness-of-Fit, and Modeling Issues 5 of 27 That is, 𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸 See the Excel file “CH5 DATA” tab “RSQ” for the calculations of sum of squares. ∑(𝑦 − 𝑦̅)2 = 495132.16 Note that: ∑(𝑦̂ − 𝑦̅)2 = 190626.98 ∑(𝑦 − 𝑦̂)2 = 304505.18 495132.16 = 190626.98 + 304505.18 As stated at the beginning of this discussion, 𝑅2 measures the proportion of total deviations in y explained by the regression. Thus, ∑(𝑦̂ − 𝑦̅)2 𝑆𝑆𝑅 𝑅2 = = ∑(𝑦 − 𝑦̅)2 𝑆𝑆𝑇 For our example: 𝑅2 = 𝑆𝑆𝑅 190626.98 = = 0.385 𝑆𝑆𝑇 495132.16 Also note: 𝑆𝑆𝐸 304505.18 = = 0.615 𝑆𝑆𝑇 495132.16 Thus, when 𝑅2 = 0.385, 38.5 percent of the variations or deviations in 𝑦, food expenditure, are explained by the regression model, that is the independent variable 𝑥, weekly income. The remaining 61.5 percent of the variations are due to other unexplained factors. Note that if all the variations in 𝑦 were explained by income , then 𝑅2 = 1. Thus, the values of 𝑅2 vary from 0 to 1: 0 ≤ 𝑅2 ≤ 1 Also note that the value of 𝑅2 is not affected by the scale of the data. You can check this in the model using Excel with the food expenditure figures in hundreds of dollars. 3. Correlation Analysis In Chapter 1 and Chapter 2 the concept of covariance was explained as a measure of the extent of association between two variables 𝑥 and 𝑦, and σ𝑥𝑦 was used as symbol for the population covariance and 𝑠𝑥𝑦 for the sample covariance. It was also explained that to avoid the distorting impact of the scale of the data on covariance, the correlation coefficient was obtained by dividing the covariance by the product of the standard deviations of x and y: ∑𝑒𝑦̂ = 𝑏2 ∑𝑒𝑥 ∑𝑒𝑥 = ∑𝑥(𝑦 − 𝑦̂) ∑𝑒𝑥 = ∑𝑥(𝑦 − 𝑏1 − 𝑏2 𝑥) The right-hand-side above is the normal equation obtained in development of the least squares coefficients. 𝜕∑𝑒 2 ⁄𝜕𝑏2 = −2∑𝑥(𝑦 − 𝑏1 − 𝑏2 𝑥) = 0 With this, then ∑𝑒(𝑦̂ − 𝑦̅) = 0, and hence ∑(𝑦 − 𝑦̅)2 = ∑(𝑦̂ − 𝑦̅)2 + ∑(𝑦̂ − 𝑦̅)2 5-Prediction, Goodness-of-Fit, and Modeling Issues 6 of 27 𝜌= σ𝑥𝑦 σ𝑥 σ𝑦 𝑠= s𝑥𝑦 s𝑥 s𝑦 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑠𝑎𝑚𝑝𝑙𝑒 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 In the sample formula, s𝑥𝑦 = ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) 𝑛−1 s𝑥 = √ ∑(𝑥 − 𝑥̅ )2 𝑛−1 s𝑦 = √ ∑(𝑦 − 𝑦̅)2 𝑛−1 From which we obtain, r𝑥𝑦 = ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) √∑(𝑥 − 𝑥̅ )2 √∑(𝑦 − 𝑦̅)2 Since 𝑟 is a relative measure, then −1 ≤ 𝑟 ≤ 1 The closer the coefficient of correlation is to −1 or 1, the stronger the association between the variations in 𝑦 and the variations in 𝑥. 3.1. The Relationship Between R² and r We can show that the coefficient of determination in regression, R², which shows how closely the variations in the dependent variable 𝑦 are associated with the variations in the explanatory variable 𝑥, is equal to correlation coefficient squared. 𝑅2 = ∑(𝑦̂ − 𝑦̅)2 2 = 𝑟𝑥𝑦 ∑(𝑦 − 𝑦̅)2 (see footnote2) In the numerator of 𝑅2 , substitute for 𝑦̂ = 𝑏1 + 𝑏2 𝑥, and then for 𝑏1 = 𝑦̅ − 𝑏2 𝑥̅ . ∑(𝑦̂ − 𝑦̅)2 = ∑(𝑏1 + 𝑏2 𝑥 − 𝑦̅)2 ∑(𝑦̂ − 𝑦̅)2 = ∑(𝑦̅ − 𝑏2 𝑥̅ + 𝑏2 𝑥 − 𝑦̅)2 ∑(𝑦̂ − 𝑦̅)2 = ∑[𝑏2 (𝑥 − 𝑥̅ )]2 ∑(𝑦̂ − 𝑦̅)2 = 𝑏22 ∑(𝑥 − 𝑥̅ )2 𝑏22 ∑(𝑥 − 𝑥̅ )2 𝑅2 = ∑(𝑦 − 𝑦̅)2 Using ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) 𝑏2 = ∑(𝑥 − 𝑥̅ )2 and substituting for 𝑏22 in the numerator above, we have [∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)]2 ∑(𝑥 − 𝑥̅ )2 𝑅2 = [∑(𝑥 − 𝑥̅ )2 ]2 ∑(𝑦 − 𝑦̅)2 2 𝑅2 = [∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)]2 2 = 𝑟𝑥𝑦 ∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2 5-Prediction, Goodness-of-Fit, and Modeling Issues 7 of 27 3.2. Another Point About R2 When it is said that 𝑅2 is a measure of “goodness of fit”, this simply refers to the correlation between the observed and predicted value of 𝑦. This correlation can be expressed as 𝑟𝑦𝑦̂ . Like any other measure of correlation between two variables, 𝑟𝑦𝑦̂ = ∑(𝑦 − 𝑦̅)(𝑦̂ − 𝑦̅̂ ) √∑(𝑦 − 𝑦̅)2 ∑(𝑦̂ − 𝑦̅̂ )2 That 𝑟𝑦𝑥 = 𝑟𝑦𝑦̂ is easily explained by the fact that 𝑦̂ is a linear transformation of the variable 𝑥: 𝑦̂ = 𝑏1 + 𝑏2 𝑥. Therefore, the correlation between 𝑦 and 𝑥 is the same as the correlation between 𝑦 and the linear 2 2 transformation of 𝑥. Also, it is easily proved mathematically that It can be shown that 𝑟𝑦𝑦 ̂ = 𝑅 (see footnote3), 2 𝑟𝑦𝑦 ̂ 2 [∑(𝑦 − 𝑦̅)(𝑦̂ − 𝑦̅̂ )] ∑(𝑦̂ − 𝑦̅)2 = = = 𝑅2 ∑(𝑦 − 𝑦̅)2 ∑(𝑦̂ − 𝑦̅̂ )2 ∑(𝑦 − 𝑦̅)2 Thus, 𝑅2 is also a measure of how well the estimated regression fits the data. 4. Reporting Regression Results 4.1. Computer Output Several computer statistical softwares are available to produce the regression results. We will use Excel’s regression output for illustration. In Excel, we find Regression in Tools, Data Analysis. Following the simple instructions in the drop box presented by Excel, the following output is generated for the food expenditure example: We need to prove that 2 [∑(𝑦 − 𝑦̅)(𝑦̂ − 𝑦̅̂ )] ∑(𝑦̂ − 𝑦̅)2 𝑆𝑆𝑅 2 𝑟𝑦𝑦 = = = = 𝑅2 ̂ 2 2 ̅ ∑(𝑦 − 𝑦̅)2 𝑆𝑆𝑇 ∑(𝑦 − 𝑦̅) ∑(𝑦̂ − 𝑦̂) First, we can show that the mean of the predicted values is equal to the mean of observed values: 𝑦̅̂ = 𝑦̅ 𝑦̂ = 𝑏1 + 𝑏2 𝑥 ∑𝑦̂ = 𝑛𝑏1 + 𝑏2 ∑𝑥 ∑𝑦̂ ∑𝑥 = 𝑏1 + 𝑏2 𝑛 𝑛 𝑦̅̂ = 𝑏1 + 𝑏2 𝑥̅ = 𝑦̅ − 𝑏2 𝑥̅ + 𝑏2 𝑥̅ = 𝑦̅ ∑(𝑦 − 𝑦̅)(𝑦̂ − 𝑦̅̂ ) = ∑(𝑦̂ + 𝑒 − 𝑦̅)(𝑦̂ − 𝑦̅) ∑(𝑦 − 𝑦̅)(𝑦̂ − 𝑦̅̂ ) = ∑(𝑦̂ − 𝑦̅)2 + ∑𝑒(𝑦̂ − 𝑦̅) ∑(𝑦 − 𝑦̅)(𝑦̂ − 𝑦̅̂ ) = ∑(𝑦̂ − 𝑦̅)2 since ∑𝑒(𝑦̂ − 𝑦̅) = 0 Thus, [∑(𝑦̂ − 𝑦̅)2 ]2 2 𝑟𝑦𝑦 ̂ = ∑(𝑦 − 𝑦̅)2 ∑(𝑦̂ − 𝑦̅)2 ∑(𝑦̂ − 𝑦̅)2 𝑆𝑆𝑅 2 𝑟𝑦𝑦 = = 𝑅2 ̂ = ∑(𝑦 − 𝑦̅)2 𝑆𝑆𝑇 3 5-Prediction, Goodness-of-Fit, and Modeling Issues 8 of 27 SUMMARY OUTPUT Regression Statistics Multiple R 0.6205 R Square 0.3850 Adjusted R Square 0.3688 Standard Error 89.517 Observations 40 ANOVA df Regression Residual Total Intercept income 1 38 39 SS 190626.98 304505.18 495132.16 MS 190627 8013.294 F 23.789 Significance F 1.95E-05 Coefficients 83.416 10.210 Standard Error 43.410 2.093 t Stat 1.922 4.877 P-value 0.0622 0.0000 Lower 95% -4.463 5.972 Upper 95% 171.295 14.447 The following table contains all the different symbols and formulas that are used in generating this output: SUMMARY OUTPUT Regression Statistics Multiple R R Square Not relevant to simple regression 𝑅2 = 𝑆𝑆𝑅⁄𝑆𝑆𝑇 Adjusted R Square Not relevant to simple regression Standard Error ∑𝑒 2 ∑(𝑦 − 𝑦̂)2 𝑆𝑆𝐸 se(𝑒) = √ =√ ≡√ = √𝑀𝑆𝐸 𝑛−2 𝑛−2 𝑛−2 Observations 𝑛 ANOVA* Regression Residual 𝑑𝑓 ** 𝑘−1 𝑛−𝑘 𝑆𝑆 𝑆𝑆𝑅 = ∑(𝑦̂ − 𝑦̅)2 Total 𝑛−1 𝑆𝑆𝑇 = ∑(𝑦 − 𝑦̅)2 Coefficients 𝑆𝑆𝐸 = ∑(𝑦 − 𝑦̂)2 𝑀𝑆 𝑀𝑆𝑅 = 𝑆𝑆𝑅⁄(𝑘 − 1) 𝑀𝑆𝐸 = 𝑆𝑆𝐸⁄(𝑛 − 𝑘) 𝐹* 𝐹 = 𝑀𝑆𝑅⁄𝑀𝑆𝐸 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝐹 Tail area of 𝐹 𝐷𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 Standard Error† 𝑡 𝑆𝑡𝑎𝑡 𝑃𝑣𝑎𝑙𝑢𝑒 Lower 95%†† Intercept 𝑏1 se(𝑏1 ) |t| = 𝑏1 ⁄se(𝑏1 ) P(𝑡 > |𝑡|) 𝐿 = 𝑏1 − 𝑀𝑂𝐸1 𝑈 = 𝑏1 + 𝑀𝑂𝐸1 X Variable 1 𝑏2 se(𝑏2 ) |t| = 𝑏2 ⁄se(𝑏2 ) P(𝑡 > |𝑡|) 𝑈 = 𝑏2 + 𝑀𝑂𝐸2 𝑈 = 𝑏2 + 𝑀𝑂𝐸2 Notes: * ** † †† Upper 95% ANOVA and the 𝐹 𝑑𝑖𝑠𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 are explained below 𝑘 denotes the number of parameters estimated. Here, 𝑘 = 2 se(𝑏1 ) = se(𝑒)√∑𝑥 2 ⁄𝑛∑(𝑥 − 𝑥̅ )2 se(𝑏2 ) = se(𝑒)⁄√∑(𝑥 − 𝑥̅ )2 𝑀𝑂𝐸1 = 𝑡α⁄2,(𝑛−2) se(𝑏1 ) 𝑀𝑂𝐸2 = 𝑡α⁄2,(𝑛−2) se(𝑏2 ) 5-Prediction, Goodness-of-Fit, and Modeling Issues 9 of 27 4.2. Reporting the Summary Results In many cases, rather than providing the whole computer output, the regression output is reported in a summary form. The following are two ways in which summery results are reported. yˆ  83 .416  10 .210 x R 2  0.385 ( 43.41) ( 2.093) ( s.e.) This summary report provides the value of the standard error of the regression coefficients, se(𝑏1 ) and se(𝑏2 ). This information allows us to obtain the confidence intervals for the parameters of the regression. You just need to compute the 𝑀𝑂𝐸 using 𝑡α⁄2,(𝑛−𝑘) and the standard error of each coefficient. You can also divide the coefficient value by its standard error to obtain the test statistic for the hypothesis test about the parameter. Alternatively, the summary result is reported as follows: yˆ  83 .416  10 .210 x R 2  0.385 (1.922) ( 4.877) (t ) Here, you can use the 𝑡 stat and either compute the probability value (you must use a computer) or compare it to the critical 𝑡: 𝑡α⁄2,(𝑛−𝑘) to test for the null hypothesis, 𝐻0 : 𝛽2 = 0. 5. The 𝑭 Test of Goodness of Fit The goodness of fit of regression is measured by 𝑅2 . 𝑅2 = ∑(𝑦̂ − 𝑦̅)2 ∑(𝑦 − 𝑦̅)2 The more the observed values of y are clustered around the regression line, the better the goodness of fit, or the greater the linear association between the dependent variable 𝑦 and the explanatory variable 𝑥. Also, we saw above that 𝑅2 can also be computed as the square of the correlation coefficient 𝑟: 𝑅2 = 𝑟 2 𝑟= ∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅) √∑(𝑥 − 𝑥̅ )2 √∑(𝑦 − 𝑦̅)2 𝑟2 = [∑(𝑥 − 𝑥̅ )(𝑦 − 𝑦̅)]2 = 𝑅2 ∑(𝑥 − 𝑥̅ )2 ∑(𝑦 − 𝑦̅)2 Theoretically, two variables 𝑥 and 𝑦 are independent if the population correlation coefficient 𝜌 is zero. Within the simple linear regression context, absence of linear relationship between 𝑥 and 𝑦 would imply the slope parameter 𝛽2 is zero. Thus, none of the total deviation of 𝑦 from the mean 𝑦 would be accounted for by the regression. As explained, in the sample, 𝑅2 measures this deviation relative to the total. However, even if there is no relationship between 𝑥 and 𝑦 in the population, the probability that an 𝑅2 computed from a random sample would be zero is practically nil. Therefore, 𝑅2 will always be a number greater than zero. In simple regression analysis, therefore, to conclude that there is a relationship between 𝑥 and 𝑦, we perform the test of hypothesis that 𝐻0 : 𝛽2 = 0 versus 𝐻1 : 𝛽2 ≠ 0 This has already been done using 𝑏2 as the test statistic and performing the “t test”. We may consider 𝑅2 , in a way, an alternative test statistic testing the same hypothesis. If 𝑅2 is significantly different from zero, then we 5-Prediction, Goodness-of-Fit, and Modeling Issues 10 of 27 will reject the above null hypothesis. To determine if 𝑅2 is significantly different from zero we need a critical value, for a given significance level α, to which we compare the test statistic. The problem here is that there is no statistical critical value directly related to 𝑅2 . 𝑅2 is obtained as a ratio of two squared deviations, 𝑆𝑆𝑅 over 𝑆𝑆𝑇, and, as such, it does not generate any probability distribution such as 𝑍, 𝑇, or Chi-square. The way around this problem is the indirect approach of measuring the mean 𝑆𝑆𝑅 relative to the mean 𝑆𝑆𝐸. This way we are comparing two measures of variance of 𝑦—variance due to regression versus variance due to unexplained factors. Hence the term ANOVA—analysis of variance. If explained deviations outweigh the unexplained deviations, then the variance measures in the numerator of the variance ratio will be greater than that in the denominator. Since the variance ratio is that of squared terms, the ratio measure is always positive. The larger the variations due to regression is, the further away the quotient is from 1, indicating a better fit. To obtain any variance measure from sample data we divide the sum of square deviations by the degrees of freedom to determine the mean square. The two mean squares in the regression ANOVA are the mean square regression (𝑀𝑆𝑅) and mean square error (𝑀𝑆𝐸): 𝑀𝑆𝑅 = 𝑆𝑆𝑅 ∑(𝑦̂ − 𝑦̅)2 = 𝑑𝑓 𝑘−1 where k is the number of the parameters in the regression, 𝛽1 and 𝛽2 . 𝑀𝑆𝐸 = 𝑆𝑆𝐸 ∑(𝑦 − 𝑦̂)2 = 𝑑𝑓 𝑛−𝑘 The ratio of 𝑀𝑆𝑅 to 𝑀𝑆𝐸 is called the 𝐹-ratio. 𝐹= 𝑀𝑆𝑅 𝑀𝑆𝐸 The 𝐹-ratio is a test statistic with a specific probability distribution called the 𝐹 distribution.4 The 𝐹 distribution is the ratio of two independent Chi-square random variables each divided by its own degrees of freedom, 𝑑1 being the degrees of freedom of the numerator and 𝑑2 that of the denominator: 𝐹(𝑑1,𝑑2 ) = 𝜒12 ⁄𝑑1 𝜒22 ⁄𝑑2 The F distribution is used in testing the equality of two population variances, as is being done in the case being discussed here. 𝐹(𝑘−1,𝑛−𝑘) = ∑(𝑦̂ − 𝑦̅)2 ⁄(𝑘 − 1) ∑(𝑦 − 𝑦̂)2 ⁄(𝑛 − 𝑘) The numerator variance measure represents the average squared deviation of predicted values from the mean of 𝑦. This is the mean square deviation explained by the regression. The denominator is the mean square deviation of the observed values from the regression line—the unexplained mean square deviation. To observe the impact of these mean squares on the value of 𝐹 consider the following two models and corresponding figures. In (𝐴) test scores on an exam are observed against the explanatory variable hours studied. In (𝐵) the same test scores are observed against randomly generated numbers as the explanatory variable. 4 Named after the English statistician Sir Ronald A. Fisher. 5-Prediction, Goodness-of-Fit, and Modeling Issues 11 of 27 (𝐴) (𝐵) Score Hours Studied 56 52 72 56 92 72 88 96 80 100 1.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 7.0 Score 56 52 72 56 92 72 88 96 80 100 0.6564 8.0140 3.9091 0.0045 𝑅2 = 𝑏2 = |𝑡| = 𝑝𝑟𝑜𝑏 𝑣𝑎𝑙𝑢𝑒 = 0.0329 0.1299 0.5213 0.6163 𝑅2 = 𝑏2 = |𝑡| = 𝑝𝑟𝑜𝑏 𝑣𝑎𝑙𝑢𝑒 = 1836.8056 120.1993 15.2813 0.0045 𝑀𝑆𝑅 = 𝑀𝑆𝐸 = 𝐹= 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝐹 = Random numbers 15 63 42 51 85 93 32 43 31 65 MSR = MSE = F= Significance F = 91.9413 338.3073 0.2718 0.6163 The regression model (𝐴) shows the relationship between test scores and the hours studied. Model (B) regresses the same test scores against a set of numbers randomly selected from 1-100. Note the regression line in Model (𝐵) is practically flat with a slope of 0.10. Correspondingly, the |t| statistics for 𝐻0 : 𝛽2 = 0 results in the probability value of 0.6163, leading us to convincingly conclude that the population slope parameter is zero. Now pay attention to 𝑀𝑆𝑅. Since the regression line, as shown in panel (𝐵) in the diagram below, is very close to the 𝑦̅ line, leaving very little room for the deviations 𝑦̂ − 𝑦̅, thus making 𝑀𝑆𝑅 = (𝑦̂ − 𝑦̅)²⁄(𝑘 − 1) very small relative to 𝑀𝑆𝐸 = (𝑦 − 𝑦̂)²⁄(𝑛 − 𝑘). The F statistic is hence a small value 91.9413⁄338.3073 = 0.2718. The probability value (the tail area under the F-curve to the right of the F value of 0.1744, using =F.DIST.RT(0.2718,1,8) in Excel, is 0.6163—clearly leading us not to reject 𝐻0 : 𝛽2 = 0. In contrast, the regression line in panel (𝐴) indicates a pronounced slope, making the deviations 𝑦̂ − 𝑦̅, hence 𝑀𝑆𝑅, relatively significant compared to 𝑀𝑆𝐸. The 𝐹 statistic is thus a large value 1836.8056⁄120.1993 = 15.2813. The probability value is 0.0045, clearly leading us to reject 𝐻0 : 𝛽2 = 0. y (A) 120 y 100 ŷ 100 80 y̅ 80 60 60 40 40 20 20 0 (B) 120 ŷ y̅ 0 0 1 2 3 4 5 6 7 8 x 0 20 40 60 80 100 120 x After explaining all this about the F test and 𝐴𝑁𝑂𝑉𝐴, it would sound quite anticlimactic to say that in simple regression we need not perform the F test at all because it is redundant! With careful attention you would 5-Prediction, Goodness-of-Fit, and Modeling Issues 12 of 27 recognize that the F statistic with the numerator degrees of freedom k – 1 is the same value as t statistic squared: 𝐹 = 𝑡 2 = (3.9091)2 = 15.2813 And, 𝑝𝑣𝑎𝑙𝑢𝑒 = 𝑆𝑖𝑔𝑛𝑖𝑓𝑖𝑐𝑎𝑛𝑐𝑒 𝐹 = 0.0045 See footnote5 for the mathematical proof that 𝐹 = 𝑡 2. Note, however, that the 𝐹 test plays a different and important role in statistical inference in multiple regression—to be pointed out in later chapters 6. Modeling Issues 6.1. The Effects of Scaling the Data 6.1.1. Changing the scale of 𝒙 Consider the general form of the estimated simple linear regression equation: 𝑦̂ = 𝑏1 + 𝑏2 𝑥. We want to find out what happens to the regression model if we changed the scale of 𝑥 by multiplying it be a constant 𝑐. First, determine the impact on the slope coefficient 𝑏2 . Denote the resulting new coefficient as 𝑏2∗ . Impact on 𝒃𝟐 : 𝑏2 = 𝑏2∗ = 𝑏2 ⁄𝑐 ∑𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ ∑𝑥 2 − 𝑛𝑥̅ 2 Multiply 𝑥 by a constant 𝑐. Let 𝑏2∗ be the resulting new coefficient. Then, 𝑏2∗ = ∑(𝑐𝑥)𝑦 − 𝑛(𝑐𝑥̅ )𝑦̅ 𝑐(∑𝑥𝑦 − 𝑛𝑥̅ 𝑦̅) 𝑏2 = = ∑(𝑐𝑥)2 − 𝑛(𝑐𝑥̅ )2 𝑐 2 (∑𝑥 2 − 𝑛𝑥̅ 2 ) 𝑐 Show that, 𝑀𝑆𝑅 𝑏22 𝐹= = = 𝑡2 𝑀𝑆𝐸 var(𝑏2 ) 𝑀𝑆𝑅 = ∑(𝑦̂ − 𝑦̅)2 Note: 𝑑𝑓 = 1 𝑀𝑆𝑅 = ∑(𝑏1 + 𝑏2 𝑥 − 𝑦̅)2 𝑀𝑆𝑅 = ∑(𝑦̅ − 𝑏2 𝑥̅ + 𝑏2 𝑥 − 𝑦̅)2 𝑀𝑆𝑅 = ∑(𝑏2 𝑥 − 𝑏2 𝑥̅ )2 𝑀𝑆𝑅 = 𝑏22 ∑(𝑥 − 𝑥̅ )2 𝑀𝑆𝐸 = var(𝑒) var(𝑒) var(𝑏2 ) = ∑(𝑥 − 𝑥̅ )2 𝑀𝑆𝐸 = var(𝑒) = var(𝑏2 )∑(𝑥 − 𝑥̅ )2 Thus, 𝑏22 ∑(𝑥 − 𝑥̅ )2 𝑏22 𝐹= = var(𝑏2 )∑(𝑥 − 𝑥̅ )2 var(𝑏2 ) Note that the test statistic for 𝐻0 : 𝛽2 = 0 is 𝑏2 𝑡= se(𝑏2 ) Therefore, 𝑏22 𝐹= = 𝑡2 var(𝑏2 ) 5 5-Prediction, Goodness-of-Fit, and Modeling Issues 13 of 27 Thus, when 𝑥 is scaled by a constant 𝑐, the new slope coefficient is equal to the pre-scaled coefficient divided by 𝑐. Impact on 𝒃𝟏 : 𝑏1∗ = 𝑏1 𝑏1 = 𝑦̅ − 𝑏2 𝑥̅ 𝑏1∗ = 𝑦̅ − 𝑏2∗ (𝑐𝑥̅ ) = 𝑦̅ − (𝑏2 ⁄𝑐 )(𝑐𝑥̅ ) = 𝑦̅ − 𝑏2 𝑥̅ = 𝑏1 Thus, scaling 𝑥 does not change the intercept. ̂ Impact on the predicted values 𝒚 𝑦̂ ∗ = 𝑦̂ 𝑦̂ = 𝑏1 + 𝑏2 𝑥 𝑦̂ ∗ = 𝑏1 + 𝑏2∗ (𝑐𝑥) = 𝑏1 + (𝑏2 ⁄𝑐 )(𝑐𝑥̅ ) = 𝑏1 + 𝑏2 𝑥 = 𝑦̂ There is no impact. Impact on 𝐯𝐚𝐫(𝒆): var(𝑒) = var(𝑒 ∗ ) = var(𝑒) ∑(𝑦 − 𝑦̂)2 𝑛−2 var(𝑒 ∗ ) = ∑(𝑦 − 𝑦̂ ∗ )2 ∑(𝑦 − 𝑦̂)2 = = var(𝑒) 𝑛−2 𝑛−2 There is no impact. Impact on 𝑹𝟐 : 𝑅2 = (𝑅2 )∗ = 𝑅2 𝑆𝑆𝑅 ∑(𝑦̂ − 𝑦̅)2 = 𝑆𝑆𝑇 ∑(𝑦 − 𝑦̅)2 (𝑅2 )∗ = 𝑆𝑆𝑅 ∑(𝑦̂ ∗ − 𝑦̅)2 ∑(𝑦̂ − 𝑦̅)2 = = = 𝑅2 𝑆𝑆𝑇 ∑(𝑦 − 𝑦̅)2 ∑(𝑦 − 𝑦̅)2 There is no impact. Impact on 𝐯𝐚𝐫(𝒃𝟐 ): var(𝑏2∗ ) = 1 𝑐2 var(𝑏2 ) 𝑏2 1 var(𝑏2∗ ) = var ( ) = 2 var(𝑏2 ) 𝑐 𝑐 se(𝑏2∗ ) = 1 se(𝑏2 ) 𝑐 Impact on 𝐯𝐚𝐫(𝒃𝟏 ): var(𝑏1∗ ) = var(𝑏1 ) Note: 𝑏1∗ = 𝑏1 5-Prediction, Goodness-of-Fit, and Modeling Issues 14 of 27 Impact on t statistic: 𝑡= 𝑡∗ = 𝑡 𝑏2 se(𝑏2 ) 𝑡∗ = 𝑏2∗ 𝑏2 ⁄𝑐 = =𝑡 se(𝑏2∗ ) se(𝑏2 )⁄𝑐 There is no change. 6.1.2. Changing the scale of 𝒚 We want to find out what happens to the regression model if we changed the scale of 𝑦 by multiplying it by a constant 𝑐. First, determine the impact on the slope coefficient 𝑏2 . Impact on 𝒃𝟐 : 𝑏2 = 𝑏2∗ = 𝑐𝑏2 ∑𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ ∑𝑥 2 − 𝑛𝑥̅ 2 Multiply 𝑦 by a constant 𝑐. Let 𝑏2∗ be the resulting new coefficient. Then, 𝑏2∗ = ∑𝑐𝑥𝑦 − 𝑐𝑛𝑥̅ 𝑦̅ 𝑐(∑𝑥𝑦 − 𝑛𝑥̅ 𝑦̅) = = 𝑐𝑏2 ∑𝑥 2 − 𝑛𝑥̅ 2 ∑𝑥 2 − 𝑛𝑥̅ 2 Thus, when 𝑦 is scaled by a constant 𝑐, the new slope coefficient is equal to the pre-scaled coefficient multiplied by 𝑐. Impact on 𝒃𝟏 : 𝑏1∗ = 𝑐𝑏1 𝑏1 = 𝑦̅ − 𝑏2 𝑥̅ 𝑏1∗ = 𝑐𝑦̅ − 𝑏2∗ 𝑥̅ = 𝑐𝑦̅ − 𝑐𝑏2 𝑥̅ = 𝑐(𝑦̅ − 𝑏2 𝑥̅ ) = 𝑐𝑏1 Thus, scaling y changes the intercept by a multiple of c. ̂: Impact on the predicted values 𝒚 𝑦̂ ∗ = 𝑐𝑦̂ 𝑦̂ = 𝑏1 + 𝑏2 𝑥 𝑦̂ ∗ = 𝑐𝑏1 + 𝑐𝑏2 𝑥 = 𝑐(𝑏1 + 𝑏2 𝑥) = 𝑐𝑦̂ Predicted values also change by a multiple of 𝑐. Impact on 𝐯𝐚𝐫(𝒆): var(𝑒) = var(𝑒 ∗ ) = 𝑐 2 var(𝑒) ∑(𝑦 − 𝑦̂)2 𝑛−2 5-Prediction, Goodness-of-Fit, and Modeling Issues 15 of 27 var(𝑒 ∗ ) = ∑(𝑐𝑦 − 𝑦̂ ∗ )2 ∑(𝑐𝑦 − 𝑐𝑦̂)2 𝑐 2 ∑(𝑦 − 𝑦̂)2 = = = 𝑐 2 var(𝑒) 𝑛−2 𝑛−2 𝑛−2 Impact on 𝑹𝟐 : 𝑅2 = (𝑅2 )∗ = 𝑅2 𝑆𝑆𝑅 ∑(𝑦̂ − 𝑦̅)2 = 𝑆𝑆𝑇 ∑(𝑦 − 𝑦̅)2 (𝑅2 )∗ = ∑(𝑦̂ ∗ − 𝑐𝑦̅)2 ∑(𝑐𝑦̂ − 𝑐𝑦̅)2 𝑐 2 ∑(𝑦̂ − 𝑦̅)2 = = = 𝑅2 ∑(𝑐𝑦 − 𝑐𝑦̅)2 ∑(𝑐𝑦 − 𝑐𝑦̅)2 𝑐 2 ∑(𝑦 − 𝑦̅)2 There is no impact. Impact on 𝐯𝐚𝐫(𝒃𝟐 ): var(𝑏2 ) = var(𝑒) ∑(𝑥 − 𝑥̅ )2 var(𝑏2∗ ) = 𝑐 2 var(𝑒) = 𝑐 2 var(𝑏2 ) ∑(𝑥 − 𝑥̅ )2 var(𝑏2∗ ) = 𝑐 2 var(𝑏2 ) se(𝑏2∗ ) = 𝑐se(𝑏2 ) Impact on 𝐯𝐚𝐫(𝒃𝟏 ): var(𝑏1∗ ) = 𝑐 2 var(𝑏1 ) var(𝑏1 ) = ∑𝑥 2 var(𝑒) 𝑛∑(𝑥 − 𝑥̅ )2 var(𝑏1∗ ) = ∑𝑥 2 ∑𝑥 2 ∗) var(𝑒 𝑐 2 var(𝑒) = 𝑐 2 var(𝑏1 ) 𝑛∑(𝑥 − 𝑥̅ )2 𝑛∑(𝑥 − 𝑥̅ )2 Impact on 𝒕 statistic: 𝑡= 𝑡∗ = 𝑡 𝑏2 se(𝑏2 ) 𝑡∗ = 𝑏2∗ 𝑐𝑏2 =𝑡 ∗) = se(𝑏2 𝑐se(𝑏2 ) 6.1.3. Changing the scale of 𝒙 and 𝒚 by the same factor 𝒄 Consider the general form of the estimated simple linear regression equation: 𝑦̂ = 𝑏1 + 𝑏2 𝑥. We want to find out what happens to the regression model if we changed the scale of 𝑥 and 𝑦 by multiplying both be a constant 𝑐. First, determine the impact on the slope coefficient 𝑏2 . Impact on 𝒃𝟐 𝑏2∗ = 𝑏2 5-Prediction, Goodness-of-Fit, and Modeling Issues 16 of 27 𝑏2∗ = ∑(𝑐𝑥)(𝑐𝑦) − 𝑛(𝑐𝑥̅ )(𝑐𝑦̅) 𝑐 2 ∑𝑥𝑦 − 𝑛𝑥̅ 𝑦̅ = 2 2 = 𝑏2 ∑(𝑐𝑥)2 − 𝑛(𝑐𝑥̅ )2 𝑐 ∑𝑥 − 𝑛𝑥̅ 2 There is no change in the slope coefficient. Impact on 𝒃𝟏 𝑏1∗ = 𝑐𝑏1 𝑏1 = 𝑦̅ − 𝑏2 𝑥̅ 𝑏1∗ = 𝑐𝑦̅ − 𝑏2∗ (𝑐𝑥̅ ) = 𝑐𝑦̅ − 𝑏2 (𝑐𝑥̅ ) = 𝑐(𝑦̅ − 𝑏2 𝑥̅ ) = 𝑐𝑏1 The intercept will change by a multiple of c. Impact on the predicted values ŷ 𝑦̂ ∗ = 𝑐𝑦̂ 𝑦̂ ∗ = 𝑏1∗ + 𝑏2∗ (𝑐𝑥) = 𝑐𝑏1 + 𝑏2 (𝑐𝑥) = 𝑐(𝑏1 + 𝑏2 𝑥) = 𝑐𝑦̂ Impact on 𝐯𝐚𝐫(𝒆) var(𝑒 ∗ ) = ∑(𝑐𝑦 − 𝑐𝑦̂)2 = 𝑐 2 var(𝑒) 𝑛−2 Impact on 𝑹𝟐 (𝑅2 )∗ = var(𝑒 ∗ ) = 𝑐 2 var(𝑒) (𝑅2 )∗ = 𝑅2 ∑(𝑐𝑦̂ − 𝑐𝑦̅)2 ∑(𝑦̂ − 𝑦̅)2 = = 𝑅2 ∑(𝑐𝑦 − 𝑐𝑦̅)2 ∑(𝑦 − 𝑦̅)2 There is no impact. Impact on 𝐯𝐚𝐫(𝒃𝟐 ) var(𝑏2 ) = var(𝑒) ∑(𝑥 − 𝑥̅ )2 var(𝑏2∗ ) = 𝑐 2 var(𝑒) = var(𝑏2 ) ∑(𝑐𝑥 − 𝑐𝑥̅ )2 var(𝑏2∗ ) = var(𝑏2 ) se(𝑏2∗ ) = se(𝑏2 ) Impact on 𝐯𝐚𝐫(𝒃𝟏 ) var(𝑏1 ) = var(𝑏1∗ ) = var(𝑏1∗ ) = 𝑐 2 var(𝑏1 ) ∑𝑥 2 var(𝑒) 𝑛∑(𝑥 − 𝑥̅ )2 𝑐 2 ∑𝑥 2 ∑𝑥 2 var(𝑒 ∗ ) = 𝑐 2 var(𝑒) = 𝑐 2 var(𝑏1 ) 2 − 𝑥̅ ) 𝑛∑(𝑥 − 𝑥̅ )2 𝑛𝑐 2 ∑(𝑥 5-Prediction, Goodness-of-Fit, and Modeling Issues 17 of 27 Impact on 𝒕 statistic 𝑡∗ = 𝑡∗ = 𝑡 𝑏2∗ 𝑏2 =𝑡 ∗) = se(𝑏2 se(𝑏2 ) There is no change. 7. Choosing a Functional Form In explaining the simple linear regression model we have assumed that the population parameters 𝛽1 and 𝛽2 are linear—that is, they are not expressed as, say, 𝛽22 , 1⁄𝛽2 , or any form other than 𝛽2 —and also the impact of the changes in the independent variable on y works directly through x rather than through expressions such as, say, 𝑥 2 or ln(𝑥). In this section, we will continue assuming that the regression is linear in parameters, but relax the assumption of linearity of the variables. In many economic models the relationship between the dependent and independent variables is not a straight line relationship. That is the change in 𝑦 does not follow the same pattern for all values of 𝑥. Consider for example an economic model explaining the relationship between expenditure on food (or housing) and income. As income rises, we do expect expenditure on food to rise, but not at a constant rate. In fact, we should expect the rate of increase in expenditure on food to decrease as income rises. Therefore the relationship between income and food expenditure is not a straight-line relationship. In Chapter 3 we considered two alternative regression models, the quadratic model and the log-linear model. Here we will considered two more alternative models: the linear-log and log-log models. 7.1. Linear-Log (Semi-log) Model The independent variable is in logarithms, but the explained variable is not. 𝑦 = 𝛽1 + 𝛽2 ln(𝑥) Slope: 𝑑𝑦 1 = 𝛽2 𝑑𝑥 𝑥 Elasticity: 𝜖= 𝑑𝑦 𝑥 1 𝑥 1 = 𝛽2 ( ) = 𝛽2 𝑑𝑥 𝑦 𝑥 𝑦 𝑦 The following diagram is the plot of functions 𝑦 = 1 + 1.3ln(𝑥) and 𝑦 = 1 − 1.3ln(𝑥). For example, for 𝑥 = 2, the function with 𝛽2 > 0 provides 𝑦 = 1 + 1.3ln(2) = 1.901, and the one with the negative 𝛽2 < 0 provides 𝑦 = 1 − 1.3 ln(2) = 0.099. 5-Prediction, Goodness-of-Fit, and Modeling Issues 18 of 27 y 5 4 β₂ > 0 3 2 1 0 -1 β₂ < 0 -2 -3 0.0 1.0 2.0 3.0 4.0 5.0 6.0 x The slope of the function 𝑦 = 1 + 1.3ln(𝑥) at a given point, say, 𝑥0 = 2 is 𝑑𝑦 1 1 = 𝛽2 = 1.3 = 0.65 𝑑𝑥 𝑥 2 Let’s interpret the meaning of slope = 0.65 in a linear-log function. First, the coefficient of ln(𝑥), 1.3, is not the slope of the function. The value 1.3 implies that for each 1% increase in 𝑥, the dependent variable rises by approximately 0.013 units. 𝑥0 = 2 ∆𝑥% 0.01 𝑥1 2.02 𝑦0 = 1.9011 𝑦1 ∆𝑦 = 𝑦1 − 𝑦0 1.9140 0.0129 The slope of 0.65 at 𝑥0 = 2 means that for a very small change in 𝑥 in the immediate vicinity of 𝑥0 = 2 , 𝑦 rises by 0.65 units. The table below shows that as the increment in 𝑥 is reduced from 1 to 0.001, the difference quotient approaches 0.65, the slope of the function at 𝑥0 = 2. 𝑥0 = 2 𝑥1 3 2.5 2.1 2.01 2.001 𝑦0 = 1.9011 𝑦1 ∆𝑦⁄∆𝑥 ≈ 𝑑𝑦⁄𝑑𝑥 2.4282 0.5271 2.1912 0.5802 1.9645 0.6343 1.9076 0.6484 1.9017 0.6498 Thus, “slope” in the linear-log model implies the “change in 𝑦 in response to a small change in 𝑥”. The coefficient of ln(𝑥), on the other hand, implies the change in 𝑦 for a percentage change in 𝑥. Example Use the data in the food expenditure model (see the Excel file “CH5 DATA”). The variables are 𝑦 = 𝑓𝑜𝑜𝑑_𝑒𝑥𝑝 (weekly food expenditure in $) and 𝑥 = 𝑖𝑛𝑐𝑜𝑚𝑒 (weekly income in $). The output is the result of running the regression 𝑦̂ = 𝑏1 + 𝑏2 ln(𝑥). To run the regression first transform the 𝑥 values to ln(𝑥). The estimated regression equation is: 𝑦̂ = −97.1864 + 132.1658ln(𝑥) 5-Prediction, Goodness-of-Fit, and Modeling Issues 19 of 27 The coefficient of ln(𝑥), 𝑏2 = 132.1658, implies that the weekly food expenditure will increase by approximately $1.32 for each 1% increase in weekly income, regardless of the income level, as the following calculations show. 𝑥0 = 𝑥1 = ∆𝑥% = 10 1.01 0.01 𝑦0 = 𝑦1 = ∆𝑦 = 207.137 208.452 1.3151 𝑥0 = 𝑥1 = ∆𝑥% = 20 2.02 0.01 𝑦0 = 𝑦1 = ∆𝑦 = 298.747 300.062 1.3151 However, the change in food-expenditure for each additional dollar increase in weekly income will differ based on the income level, as the calculations in the following table show. This means that the slope of the regression equation depends on the value of 𝑥, the weekly income level. 𝑥0 = 𝑥1 = ∆𝑥 = 1000 1001 1 𝑦0 = 𝑦1 = ∆𝑦⁄∆𝑥 = 207.137 207.269 0.1321 𝑥0 = 𝑥1 = ∆𝑥 = 2000 2001 1 𝑦0 = 𝑦1 = ∆𝑦⁄∆𝑥 = 298.747 298.813 0.0661 7.2. Log-Linear Model The log-linear model was introduced in Chapter 3. There are additional points with respect to this model that we need to pay attention to. The Log-Linear Model in regression takes the following form ln(𝑦) = 𝛽1 + 𝛽2 𝑥 determine the slope take the exponent of both sides of the equation 𝑦 = 𝑒 𝛽1+𝛽2 𝑥 Then, Slope: 𝑑𝑦 = 𝛽2 𝑒 ln(𝑦) = 𝛽2 𝑦 𝑑𝑥 Elasticity: 𝜖= 𝑑𝑦 𝑥 𝑥 = 𝛽2 𝑦 ( ) = 𝛽2 𝑥 𝑑𝑥 𝑦 𝑦 The coefficient 𝛽2 in ln(𝑦) = 𝛽1 + 𝛽2 𝑥 implies that for each additional unit increase in 𝑥, 𝑦 will increase by 𝛽2 percent. Using the slope expression above, we have, 𝑑𝑦⁄𝑦 = 𝛽2 𝑑𝑥 Example Consider the model in which the 𝑝𝑟𝑖𝑐𝑒 of a house is related to the house size measured in square feet (𝑠𝑞𝑓𝑡). Let 𝑦 = 𝑝𝑟𝑖𝑐𝑒 and 𝑥 = 𝑠𝑞𝑓𝑡. The log-linear equation is, ̂ ln (𝑦) = 𝑏1 + 𝑏2 𝑥 The data and the summary regression output for this example is in the Excel file 𝑐ℎ5 𝑑𝑎𝑡𝑎. The estimated regression equation is ̂ ln (𝑦) = 10.8386 + 0.000411𝑥 5-Prediction, Goodness-of-Fit, and Modeling Issues 20 of 27 Consider a house size of 𝑥 = 2000 𝑠𝑞𝑓𝑡. The impact of an additional square feet is shown in calculations in the following table: 𝑥0 = 𝑥1 = ∆𝑥 = 2000 2001 1 ln(𝑦0 ) = ln(𝑦1 ) = 11.66113 11.66155 𝑦0 = 𝑦1 = ∆𝑦 = ∆𝑦⁄𝑦0 = 115975.5 116023.2 47.7 0.000411 Note that when 𝑥0 = 2000 𝑠𝑞𝑓𝑡, for each additional 𝑠𝑞𝑓𝑡, the price of the house increases by ∆𝑦 = $47.7. The proportional increase is ∆𝑦⁄𝑦0 = 0.000411 or 0.04%. Now consider a house size of 𝑥 = 4000 𝑠𝑞𝑓𝑡: 𝑥0 = 𝑥1 = ∆𝑥 = 4000 4001 1 ln(𝑦)0 = ln(𝑦)1 = 12.48367 12.48408 𝑦0 = 𝑦1 = ∆𝑦 = ∆𝑦⁄𝑦0 = 263991.4 264100.0 108.6 0.000411 For a larger house, here 𝑥 = 4000 𝑠𝑞𝑓𝑡, each additional square feet adds a larger amount ∆𝑦 = $108.6 to the price of the house. However, the percentage change in the price of the house is the same. 7.2.1. Adjustment to the Predicted Value in Log-Linear Models In the calculations in the previous two tables the predicted value of 𝑦 for a given value of 𝑥 was obtained by taking the exponent (anti-log) of the predicted log of 𝑦. 𝑥 = 2000 ̂ = 10.8386 + 0.000411(2000) = 11.66113 ln(𝑦) 𝑦 ≡ 𝑦𝑛 = exp(11.66113) = 115975.5 Here 𝑦𝑛 denotes “natural predictor”. In most cases (for large samples) a “corrected” predicted value is obtained by multiplying the “natural” predictor by the quantity 𝑒 var(𝑒)⁄2 . In the regression summary output var(𝑒) is shown as 𝑀𝑆𝐸, or mean square error. 𝑦̂𝑐 = 𝑦̂𝑛 𝑒 var(𝑒)⁄2 In the above example, the regression summary output shows that var(𝑒) = 𝑀𝑆𝐸 = 0.10334. Thus, for 𝑥 = 2000, 𝑦̂𝑐 = 115975.5𝑒 0.10334⁄2 = 122125.4 The natural predictor tends to systematically under-predict the value of 𝑦 in a log-linear model. The corrected predictor balances this downward bias in large samples. Example A Growth Model The Excel file “CH5 DATA” tab “wheat” contains data describing average wheat yield (tons per hectare) for a region in Australia against time (𝑡), which runs from 1950 to 1997. The rise in yield overtime is attributed to improvements in technology, where 𝑡 is used as a proxy for technology. The objective here is to obtain an estimate of average rate of growth in yield. 5-Prediction, Goodness-of-Fit, and Modeling Issues 21 of 27 Yield (tons per hectare) 2.5 2.0 1.5 1.0 0.5 0.0 0 10 20 30 40 50 Time (1950-1997) Let 𝑦 stand for 𝑌𝐼𝐸𝐿𝐷, where 𝑦0 is the yield in the base year and 𝑦𝑡 is the yield in year 𝑡. Also, let 𝑟 stand for the rate of growth. Then, 𝑦𝑡 = 𝑦0 (1 + 𝑟)𝑡 Taking the natural log of both sides and using the properties of logarithms, we have, ln(𝑦𝑡 ) = ln(𝑦0 ) + 𝑡 ln(1 + 𝑟) We can write this as a log-linear regression model with 𝑏1 = ln(𝑦0 ) and 𝑏2 = ln(1 + 𝑟). ln̂ (𝑦𝑡 ) = 𝑏1 + 𝑏2 𝑡 The estimated regression equation is, ln̂ (𝑦𝑡 ) = −0.3434 + 0.01784𝑡 From the estimated coefficients we can determine the base year yield and the growth rate: Base year yield 𝑏1 = ln(𝑦0 ) = −0.3434 𝑦0 = exp(−0.3434) = 0.709 tons per hectare Growth rate 𝑏2 = ln(1 + 𝑟) = 0.01784 1 + 𝑟 = exp(0.01784) = 1.018 𝑟 = 1.018 − 1 = 0.018 The estimated average annual growth rate is then approximately 1.8%. Example A Wage Equation The Excel file “CH5 DATA” tab “wage” contains data describing hourly wage (𝑊𝐴𝐺𝐸) against years of education (𝐸𝐷𝑈𝐶). In this example the objective is to determine the estimated average rate of increase in the 5-Prediction, Goodness-of-Fit, and Modeling Issues 22 of 27 wage rate for each additional year of schooling. Using the same methodology as in the previous example, we have 𝑊𝐴𝐺𝐸 = 𝑊𝐴𝐺𝐸0 (1 + 𝑟)𝐸𝐷𝑈𝐶 ln(𝑊𝐴𝐺𝐸) = ln(𝑊𝐴𝐺𝐸0 ) + ln(1 + 𝑟)𝐸𝐷𝑈𝐶 We obtain the following estimated regression equation ̂ ln(𝑊𝐴𝐺𝐸) = 1.6094 + 0.0904𝐸𝐷𝑈𝐶 𝑊𝐴𝐺𝐸0 = exp(1.6094) = 5.00 1 + 𝑟 = exp(0.0904) = 0.095 Thus, the estimated rate of increase for an additional year of education is 9.5%. 7.2.2. Predicted Value in the Log-Linear Wage-Education Model What is the predicted value of 𝑊𝐴𝐺𝐸 for a person with 12 years of education? ̂ ln(𝑊𝐴𝐺𝐸) = 1.6094 + 0.0904(12) = 2.694 ̂ = exp(2.694) = 14.7958 = $14.80 𝑊𝐴𝐺𝐸 According to text (p 154) this figure in a log-linear model is the “natural” predictor (𝑦̂𝑛 ). We need to find the corrected predictor, 𝑦̂𝑐 . The corrected predictor is obtained by, 𝑦̂𝑐 = 𝑦̂𝑛 𝑒 var(𝑒)⁄2 𝑦̂𝑐 = 14.7958 × exp(0.2773⁄2) = 16.996 = $17.00 In large samples the natural predictor 𝑦̂𝑛 tends to systematically under-predict the value of the dependent variable. The correction offsets this downward bias. 7.2.3. Generalized 𝑹𝟐 Measure for Log-Linear Models When considering 𝑅2 in a log-linear model, we need to keep two things in mind: 1) The 𝑅2 measure shown in ̂ the regression output involves ln (𝑦). We need to show 𝑅2 as a measure of explained variations in 𝑦, rather than ln(𝑦). 2) When we find the antilog of the predicted values, the result is a set of natural predictors which we need to change into the corrected predictors. Thus, we obtain the general 𝑅2 as follows: Recall from above that 𝑅2 is equal to coefficient of correlation between 𝑦 and 𝑦̂ squared. Here we must find the correlation coefficient between 𝑦 and 𝑦̂𝑐 and then square it. 2 𝑅2 = 𝑟𝑦𝑦 ̂𝑐 2 𝑟𝑦𝑦 ̂ = 0.1859 The calculations are shown in the Excel file. 5-Prediction, Goodness-of-Fit, and Modeling Issues 23 of 27 7.2.4. Prediction Interval in the Log-Linear Model Here we want to build a prediction interval for 𝑊𝐴𝐺𝐸 when 𝐸𝐷𝑈𝐶 = 12. This is an interval for individual value of 𝑦 for a given 𝑥. First, we need to determine se(𝑦0 ) when 𝑥0 = 12. From the regression equation ̂ ln (𝑦) = 𝑏1 + 𝑏2 𝑥 we have the covariance matrix var(𝑒)𝑋 −1 = [ var(𝑏1 ) cov(𝑏1 , 𝑏2 ) cov(𝑏1 , 𝑏2 ) 0.0075 ]=[ var(𝑏2 ) −0.00052 −0.00052 ] 0.000038 var[ln(𝑦0 )] = var(𝑏1 ) + 𝑥02 var(𝑏2 ) + 2𝑥0 cov(𝑏1 , 𝑏2 ) + var(𝑒) var[ln(𝑦0 )] = 0.0075 + 122 × 0.000038 − 2 × 12 × 0.00052 + 0.2773 = 0.2777 se[ln(𝑦0 )] = 0.5270 𝑀𝑂𝐸 = 𝑡0.025,998 × se[ln(𝑦0 )] = 1.962 × 0.5270 = 1.034 ln(𝑦̂|𝑥0 ) = 2.694 𝐿ln(𝑦0) = 2.694 − 1.034 = 1.660 𝑈ln(𝑦0) = 2.694 + 1.034 = 3.728 𝐿𝑦𝑛 = exp(1.660) = 5.2604 𝑈𝑦𝑛 = exp(3.728) = 41.6158 𝐿𝑦𝑐 = 5.2604 × exp(0.2773/2) = $6.04 𝑈𝑦𝑐 = 41.6158 × exp(0.2773/2) = $41.77 The prediction interval [$6.04, $41.77] is so wide that is basically useless. This indicates that our model is not an accurate predictor of the range of the dependent variable values for a given 𝑥. To develop a better predictor we need to add additional variables to the model and approach the situation via a multiple regression model. This will be done in the next chapter. 7.3. Log-Log Models The log-log model is used in describing demand equations and production functions. Generally, ln(𝑦) = 𝛽1 + 𝛽2 ln(𝑥) To determine the slope of the function 𝑑𝑦⁄𝑑𝑥 , first take the exponent of both sides. 𝑦 = 𝑒 𝛽1+𝛽2 ln(𝑥) The slope is then, 𝑑𝑦 1 𝑦 = 𝛽2 𝑒 𝛽1+𝛽2ln(𝑥) = 𝛽2 𝑑𝑥 𝑥 𝑥 The elasticity is, 5-Prediction, Goodness-of-Fit, and Modeling Issues 24 of 27 𝜖= 𝑑𝑦 𝑦 = 𝛽2 𝑑𝑥 𝑥 Thus, in the log-log model, the coefficient 𝛽2 represents the percentage change in 𝑦 in response to a percentage change in 𝑥. Example A Log-Log Poultry Demand Equation The Excel file “CH5 DATA” tab “chicken” contains data contains data describing the per capita consumption of chicken (in pounds) against the real (inflation adjusted) price. Using the log-log model ln(𝑄) = 𝛽1 + 𝛽2 ln(𝑃) the estimated regression equation is, ̂ = 3.7169 − 1.1214 ln(𝑃) ln(𝑄) Here the coefficient of ln(𝑃) is the estimated elasticity of demand, which is 1.121. This implies that for a 1% increase in the real price of chicken, the quantity demanded is reduced by 1.121%. To obtain the predicted value of per capita consumption when 𝑃 = $2.00, ̂ = 3.7169 − 1.1214 ln(2) = 2.94 ln(𝑄) 𝑄̂𝑛 = exp(2.94) = 18.91 𝑄̂𝑐 = 18.91 × evar(𝑒)/2 = 18.91 × exp(0.01392⁄2) = 19.042 The generalized 𝑅2 is, 2 𝑅𝐺2 = 𝑟𝑦𝑦 ̂𝑐 = 0.8818 See the Excel file for the calculations. 5-Prediction, Goodness-of-Fit, and Modeling Issues 25 of 27 Appendix The variance formula 1 (𝑥0 − 𝑥̅ )2 var(𝑦̂0 ) = 𝜎𝑢2 ( + ) 𝑛 ∑(𝑥 − 𝑥̅ )2 is obtained as follows: Start with the estimated regression equation and the predicted value of 𝑦 for the given 𝑥0 . 𝑦̂0 = 𝑏1 + 𝑏2 𝑥0 Taking the variance from both sides of the equation we have, var(𝑦̂0 ) = var(𝑏1 + 𝑏2 𝑥0 ) var(𝑦̂0 ) = var(𝑏1 ) + 𝑥02 var(𝑏2 ) + 2𝑥0 cov(𝑏1 , 𝑏2 ) On the right-hand-side, substituting for var(𝑏1 ) = var(𝑏2 ) = ∑𝑥 2 𝑛∑(𝑥 − 𝑥̅ )2 𝜎𝑢2 , 𝜎𝑢2 , and ∑(𝑥 − 𝑥̅ )2 cov(𝑏1 , 𝑏2 ) = −𝑥̅ 𝜎2 (𝑥 ∑ − 𝑥̅ )2 𝑢 we have, var(𝑦̂0 ) = ∑𝑥 2 𝑥02 −2𝑥0 𝑥̅ 2 𝜎 + 𝜎𝑢2 − 𝜎2 𝑢 2 2 𝑛∑(𝑥 − 𝑥̅ ) ∑(𝑥 − 𝑥̅ ) ∑(𝑥 − 𝑥̅ )2 𝑢 var(𝑦̂0 ) = 𝜎𝑢2 ∑𝑥 2 + 𝑛𝑥02 − 2𝑛𝑥0 𝑥̅ + 𝑛𝑥̅ 2 − 𝑛𝑥̅ 2 𝑛∑(𝑥 − 𝑥̅ )2 var(𝑦̂0 ) = 𝜎𝑢2 ∑𝑥 2 − 𝑛𝑥̅ 2 + 𝑛(𝑥0 − 𝑥̅ )2 𝑛∑(𝑥 − 𝑥̅ )2 var(𝑦̂0 ) = 𝜎𝑢2 ∑(𝑥 − 𝑥̅ )2 + 𝑛(𝑥0 − 𝑥̅ )2 𝑛∑(𝑥 − 𝑥̅ )2 1 (𝑥0 − 𝑥̅ )2 var(𝑦̂0 ) = 𝜎𝑢2 ( + ) 𝑛 ∑(𝑥 − 𝑥̅ )2 5-Prediction, Goodness-of-Fit, and Modeling Issues 26 of 27 5-Prediction, Goodness-of-Fit, and Modeling Issues 27 of 27

5-Prediction, Goodne..

Related documents

Products

Support

5-Prediction, Goodne..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib