CHAPTER 5 PREDICTION, GOODNESS-OF-FIT AND MODELING ISSUES 1. 2. 3. 4. 5. 6. 7. Confidence Interval for the Mean Value of π¦ for a Given Value π₯, and Prediction Interval for the Individual Value of y for a Given π₯ 1.1. Confidence Interval for the Mean Value of π¦ for a Given Value π₯ 1.2. Prediction Interval for the Individual Value of π¦ for a Given π₯ Reporting Regression Results 2.1. Computer Output 2.2. Reporting the Summary Results The F-Test of Goodness of Fit Modeling Issues 4.1. The Effects of Scaling the Data 4.1.1. Changing the scale of π₯ 4.1.2. Changing the scale of π¦ 4.1.3. Changing the scale of π₯ and π¦ by the same factor π Choosing a Functional Form Linear Functional Form Linear-Log (Semi-log) Log-Log Functional Form Examples for Functional Forms Prediction in the Log-Linear Model 1. Prediction Interval for the Individual Value of π for a Given π In Chapter 4, Section 3.1, we learned how to build an interval estimate for the mean value of the dependent variable π¦ for a given or specified value of the dependent variable π₯. For that we used the example of weekly food expenditure as a function of weekly income. πΉπππ·πΈππ = π½1 + π½2 πΌππΆπππΈ + π’ There we obtained the estimated regression equation, π¦Μ = π1 + π2 π₯, π¦Μ = 83.416 + 10.21π₯ and built an interval estimate for π¦Μ when π₯ = 10 (weekly income = $1,000). Since π¦Μ is a point on the estimated regression line, it represents an estimate of the mean value π¦, food expenditure, in the population for the given weekly income. Therefore, when we build an interval estimate for π¦Μ, we are building an interval estimate for the mean or expected value of π¦ for a given π₯: ππ¦|π₯0 = E(π¦|π₯0 ). We know, however, for each value of π₯ in the population, there are many different values of π¦ corresponding to that π₯. How do we build an interval estimate for an individual value, rather than the mean value, of π¦? Let π₯0 denote the given value of π₯. Then in the population regression function, π¦0 is an individual value of π¦ for a given value of π₯, which deviates from the mean value by the disturbance term π’0 , π¦0 = E(π¦|π₯0 ) + π’0 = π½1 + π½2 π₯0 + π’0 When we obtain the estimated sample regression equation, the observed value of π¦ for each value of π₯ deviates from the predicted value by the prediction error π. π¦0 = π¦Μ0 + π0 5-Prediction, Goodness-of-Fit, and Modeling Issues 1 of 27 π¦0 = π1 + π2 π₯0 + π0 The objective is now to build an interval estimate for this π¦0 . The interval estimate is then, πΏ, π = π¦0 ± π‘πΌ⁄2,(π−2) se(π¦0 ) Here we need to find an estimator for π¦0 and calculate se(π¦0 ). Since the expected value of the π¦0 in the population is π¦Μ0 , the prediction interval is built around the estimated π¦Μ0 from the sample. Therefore, the estimator of π¦0 is π¦Μ0 . To determine se(π¦0 ), first start with var(π¦0 ) and then take its square root. var(π¦0 ) = var(π¦Μ0 + π0 ) Given the assumption of independence of π¦ values and the disturbance term, and the assumption of homoscedasticity, we can write, var(π¦0 ) = var(π¦Μ0 ) + var(π) var(π¦0 ) = var(π1 + π2 π₯0 ) + var(π) var(π¦0 ) = var(π1 ) + π₯02 var(π2 ) + 2π₯0 cov(π1 , π2 ) + var(π) The interval estimate, or the prediction interval for an individual value of π¦ is then, πΏ, π = π¦Μ0 ± π‘πΌ⁄2,(π−2) se(π¦0 ) For comparison, let’s build an interval estimate for the mean value of food expenditure for the weekly income of $2,000 (π₯0 = 20) and a prediction interval for the individual value of food expenditure for the same weekly income. The estimated regression equation is π¦Μ = 83.416 + 10.21π₯ For π₯0 = 20, π¦Μ = 83.416 + 10.21(20) = 287.61. The variance of error term is var(π) = 8013.294 Using the inverse matrix π −1 , we can obtain the covariance matrix: covariance matrix = var(π)π −1 = 8013.294 [ [ var(π1 ) cov(π1 , π2 ) cov(π1 , π2 ) 1884.442 ]=[ var(π2 ) −85.903 0.2352 −0.0107 −0.0107 ] 0.00055 −85.903 ] 4.382 5-Prediction, Goodness-of-Fit, and Modeling Issues 2 of 27 Confidence Interval for the Mean Value of π¦ Prediction Interval for the Individual Value of π¦ πΏ, π = π¦Μ0 ± π‘πΌ⁄2,ππ se(π¦Μ0 ) πΏ, π = π¦Μ0 ± π‘πΌ⁄2,ππ se(π¦0 ) var(π¦Μ0 ) = var(π1 + π2 π₯0 ) var(π¦0 ) = var(π¦Μ0 ) + var(π) var(π¦Μ0 ) = var(π1 ) + π₯02 var(π2 ) + 2π₯0 cov(π1 , π2 ) var(π¦Μ0 ) = 201.017 var(π¦Μ0 ) = 201.017 + 8013.29 = 8214.31 se(π¦Μ0 ) = 14.178 se(π¦0 ) = 90.633 π‘0.025,38 = 2.024 π‘0.025,38 = 2.024 πππΈ = (2.024)(14.178) = 28.70 πππΈ = (2.024)(90.633) = 183.48 πΏ = 287.61 − 28.70 = 258.91 πΏ = 287.61 − 183.48 = 104.13 π = 287.61 + 28.70 = 316.31 π = 287.61 + 183.48 = 471.09 Note the chart below. The bands representing prediction intervals for individual values of π¦ are wider than the bands for the confidence intervals for mean values of π¦. Also both bands are narrower in the middle. The further away the π₯ values are from the π₯Μ , the bigger the squared deviation (π₯ − π₯Μ )2, hence the bigger the variance. The greater the variance, the less reliable is the prediction. 700 600 500 400 300 200 100 0 0 5 10 15 20 25 30 35 -100 2. Coefficient of Determination, R2 The closeness of fit of the regression line (to the scatter plot) is a measure of the closeness of the relationship between π₯ and π¦. The less scattered the observed π¦ values are around the regression line, the closer the relationship between π₯ and π¦. As explained above, se(π) is such a measure of the fit. However, se(π) has a major drawback. It is an absolute measure and, therefore, is affected by the absolute size of the data. The larger the values of the data set, the larger the π π(π). 5-Prediction, Goodness-of-Fit, and Modeling Issues 3 of 27 To explain this drawback, consider the data in the food expenditure example. Suppose the dependent variable data, weekly food expenditure figures, were also in $100s. That is, for example, instead of showing the weekly food expenditure as $155, we show it as $1.55. As the calculations in the tab “food2” in the Excel file “CH5 DATA” show, the standard error of estimate is reduced from 89.517 to 0.895. This reduction in the se(π) is solely due to the change in the scale of the independent variable data. This should make it clear that using se(π) as a measure of closeness of the fit suffers from the misleading impact of the absolute size or scale of the data used in the model. An alternative measure of the closeness of fit, which is not affected by the scale of the data, is the coefficient of determination, denoted by πΉπ (rsquare). R-square is a relative measure. It is, therefore, not affected by the scale of data. It measures the proportion of total variations in π¦ explained by the regression (that is, by π₯). Basically, π 2 involves the comparison of the variations or deviations of the observed π¦ around the regression (π¦Μ) line against the variations of the same π¦ values around the mean (π¦Μ ) line. The diagram below shows the comparison of these deviations. y 700 600 500 yΜ 400 300 yΜ 200 100 0 0 5 10 15 20 25 30 x 35 Mathematically, π 2 is proportion of the total squared deviation of the π¦ values from π¦Μ that is explained by the regression (π¦Μ) line. To understand this statement consider the following diagram. y 700 600 500 400 483 361 300 284 yΜ yΜ 200 100 0 27.14 5-Prediction, Goodness-of-Fit, and Modeling Issues x 4 of 27 In the diagram, the horizontal line represents the mean of all the observed π¦ values: π¦Μ = 284. The regression line is represented by the regression equation π¦Μ = 83.416 + 10.21π₯. A single observed value of π¦ = 483 for a given π₯ = 27.14 value is selected. The vertical distance between this π¦ value and π¦Μ is called “total deviation”. πππ‘ππ π·ππ£πππ‘πππ = π¦ − π¦Μ = 483 − 284 = 199 The vertical distance betweenπ¦Μ on the regression line and π¦Μ is called “explained deviation”. πΈπ₯πππππππ π·ππ£πππ‘πππ = π¦Μ − π¦Μ = 361 − 284 = 77 As the diagram indicates, clearly this portion of the total deviation is due to (or explained by) the regression model. That is, this deviation is explained by the independent variable π₯. The vertical distance between π¦ and π¦Μ, the residual π, is called “unexplained deviation”. ππππ₯πππππππ π·ππ£πππ‘πππ = π¦ − π¦Μ = 483 − 361 = 122 Thus, πππ‘ππ π·ππ£πππ‘πππ = πΈπ₯πππππππ π·ππ£πππ‘πππ + ππππ₯πππππππ π·ππ£πππ‘πππ (π¦ − π¦Μ ) = (π¦Μ − π¦Μ ) + (π¦ − π¦Μ) 199 = 77 + 122 Repeating the same process for all values of π¦, squaring the resulting deviations, and summing the squared values, we have the following the sum of squared deviations: 1. Sum of Squared Total Deviations ∑(π¦ − π¦Μ )2 ππ’π ππ πππ’ππππ πππ‘ππ (πππ): 2. Sum of Squared Explained Deviations ∑(π¦Μ − π¦Μ )2 ππ’π ππ πππ’ππππ π πππππ π πππ (πππ ): 3. Sum of Squared Unexplained Deviations ∑π 2 = ∑(π¦ − π¦Μ)2 ππ’π ππ πππ’ππππ πΈππππ (πππΈ): It can be shown that ∑(π¦ − π¦Μ )2 = ∑(π¦Μ − π¦Μ )2 + ∑(π¦ − π¦Μ)2 (see footnote1) (π¦ − π¦Μ ) = (π¦Μ − π¦Μ ) + (π¦ − π¦Μ) Square both sides (π¦ − π¦Μ )2 = (π¦Μ − π¦Μ )2 + (π¦ − π¦Μ)2 + 2(π¦ − π¦Μ)(π¦Μ − π¦Μ ) and sum ∑(π¦ − π¦Μ )2 = ∑(π¦Μ − π¦Μ )2 + ∑(π¦ − π¦Μ )2 + 2∑π(π¦Μ − π¦Μ ) We must show that ∑π(π¦Μ − π¦Μ ) = 0 ∑π(π¦Μ − π¦Μ ) = ∑ππ¦Μ − π¦Μ ∑π ∑π(π¦Μ − π¦Μ ) = ∑ππ¦Μ ∑π = 0 Now show that ∑ππ¦Μ = 0 ∑ππ¦Μ = ∑π(π1 + π2 π₯) ∑ππ¦Μ = ∑π(π¦Μ − π2 π₯Μ + π2 π₯) ∑ππ¦Μ = π¦Μ ∑π + π2 ∑π(π₯ − π₯Μ ) ∑ππ¦Μ = π2 ∑ππ₯ − π2 π₯Μ ∑π 1 5-Prediction, Goodness-of-Fit, and Modeling Issues 5 of 27 That is, πππ = πππ + πππΈ See the Excel file “CH5 DATA” tab “RSQ” for the calculations of sum of squares. ∑(π¦ − π¦Μ )2 = 495132.16 Note that: ∑(π¦Μ − π¦Μ )2 = 190626.98 ∑(π¦ − π¦Μ)2 = 304505.18 495132.16 = 190626.98 + 304505.18 As stated at the beginning of this discussion, π 2 measures the proportion of total deviations in y explained by the regression. Thus, ∑(π¦Μ − π¦Μ )2 πππ π 2 = = ∑(π¦ − π¦Μ )2 πππ For our example: π 2 = πππ 190626.98 = = 0.385 πππ 495132.16 Also note: πππΈ 304505.18 = = 0.615 πππ 495132.16 Thus, when π 2 = 0.385, 38.5 percent of the variations or deviations in π¦, food expenditure, are explained by the regression model, that is the independent variable π₯, weekly income. The remaining 61.5 percent of the variations are due to other unexplained factors. Note that if all the variations in π¦ were explained by income , then π 2 = 1. Thus, the values of π 2 vary from 0 to 1: 0 ≤ π 2 ≤ 1 Also note that the value of π 2 is not affected by the scale of the data. You can check this in the model using Excel with the food expenditure figures in hundreds of dollars. 3. Correlation Analysis In Chapter 1 and Chapter 2 the concept of covariance was explained as a measure of the extent of association between two variables π₯ and π¦, and σπ₯π¦ was used as symbol for the population covariance and π π₯π¦ for the sample covariance. It was also explained that to avoid the distorting impact of the scale of the data on covariance, the correlation coefficient was obtained by dividing the covariance by the product of the standard deviations of x and y: ∑ππ¦Μ = π2 ∑ππ₯ ∑ππ₯ = ∑π₯(π¦ − π¦Μ) ∑ππ₯ = ∑π₯(π¦ − π1 − π2 π₯) The right-hand-side above is the normal equation obtained in development of the least squares coefficients. π∑π 2 ⁄ππ2 = −2∑π₯(π¦ − π1 − π2 π₯) = 0 With this, then ∑π(π¦Μ − π¦Μ ) = 0, and hence ∑(π¦ − π¦Μ )2 = ∑(π¦Μ − π¦Μ )2 + ∑(π¦Μ − π¦Μ )2 5-Prediction, Goodness-of-Fit, and Modeling Issues 6 of 27 π= σπ₯π¦ σπ₯ σπ¦ π = sπ₯π¦ sπ₯ sπ¦ ππππ’πππ‘πππ ππππππππ‘πππ πππππππππππ‘ π πππππ ππππππππ‘πππ πππππππππππ‘ In the sample formula, sπ₯π¦ = ∑(π₯ − π₯Μ )(π¦ − π¦Μ ) π−1 sπ₯ = √ ∑(π₯ − π₯Μ )2 π−1 sπ¦ = √ ∑(π¦ − π¦Μ )2 π−1 From which we obtain, rπ₯π¦ = ∑(π₯ − π₯Μ )(π¦ − π¦Μ ) √∑(π₯ − π₯Μ )2 √∑(π¦ − π¦Μ )2 Since π is a relative measure, then −1 ≤ π ≤ 1 The closer the coefficient of correlation is to −1 or 1, the stronger the association between the variations in π¦ and the variations in π₯. 3.1. The Relationship Between R² and r We can show that the coefficient of determination in regression, R², which shows how closely the variations in the dependent variable π¦ are associated with the variations in the explanatory variable π₯, is equal to correlation coefficient squared. π 2 = ∑(π¦Μ − π¦Μ )2 2 = ππ₯π¦ ∑(π¦ − π¦Μ )2 (see footnote2) In the numerator of π 2 , substitute for π¦Μ = π1 + π2 π₯, and then for π1 = π¦Μ − π2 π₯Μ . ∑(π¦Μ − π¦Μ )2 = ∑(π1 + π2 π₯ − π¦Μ )2 ∑(π¦Μ − π¦Μ )2 = ∑(π¦Μ − π2 π₯Μ + π2 π₯ − π¦Μ )2 ∑(π¦Μ − π¦Μ )2 = ∑[π2 (π₯ − π₯Μ )]2 ∑(π¦Μ − π¦Μ )2 = π22 ∑(π₯ − π₯Μ )2 π22 ∑(π₯ − π₯Μ )2 π 2 = ∑(π¦ − π¦Μ )2 Using ∑(π₯ − π₯Μ )(π¦ − π¦Μ ) π2 = ∑(π₯ − π₯Μ )2 and substituting for π22 in the numerator above, we have [∑(π₯ − π₯Μ )(π¦ − π¦Μ )]2 ∑(π₯ − π₯Μ )2 π 2 = [∑(π₯ − π₯Μ )2 ]2 ∑(π¦ − π¦Μ )2 2 π 2 = [∑(π₯ − π₯Μ )(π¦ − π¦Μ )]2 2 = ππ₯π¦ ∑(π₯ − π₯Μ )2 ∑(π¦ − π¦Μ )2 5-Prediction, Goodness-of-Fit, and Modeling Issues 7 of 27 3.2. Another Point About R2 When it is said that π 2 is a measure of “goodness of fit”, this simply refers to the correlation between the observed and predicted value of π¦. This correlation can be expressed as ππ¦π¦Μ . Like any other measure of correlation between two variables, ππ¦π¦Μ = ∑(π¦ − π¦Μ )(π¦Μ − π¦Μ Μ ) √∑(π¦ − π¦Μ )2 ∑(π¦Μ − π¦Μ Μ )2 That ππ¦π₯ = ππ¦π¦Μ is easily explained by the fact that π¦Μ is a linear transformation of the variable π₯: π¦Μ = π1 + π2 π₯. Therefore, the correlation between π¦ and π₯ is the same as the correlation between π¦ and the linear 2 2 transformation of π₯. Also, it is easily proved mathematically that It can be shown that ππ¦π¦ Μ = π (see footnote3), 2 ππ¦π¦ Μ 2 [∑(π¦ − π¦Μ )(π¦Μ − π¦Μ Μ )] ∑(π¦Μ − π¦Μ )2 = = = π 2 ∑(π¦ − π¦Μ )2 ∑(π¦Μ − π¦Μ Μ )2 ∑(π¦ − π¦Μ )2 Thus, π 2 is also a measure of how well the estimated regression fits the data. 4. Reporting Regression Results 4.1. Computer Output Several computer statistical softwares are available to produce the regression results. We will use Excel’s regression output for illustration. In Excel, we find Regression in Tools, Data Analysis. Following the simple instructions in the drop box presented by Excel, the following output is generated for the food expenditure example: We need to prove that 2 [∑(π¦ − π¦Μ )(π¦Μ − π¦Μ Μ )] ∑(π¦Μ − π¦Μ )2 πππ 2 ππ¦π¦ = = = = π 2 Μ 2 2 Μ ∑(π¦ − π¦Μ )2 πππ ∑(π¦ − π¦Μ ) ∑(π¦Μ − π¦Μ) First, we can show that the mean of the predicted values is equal to the mean of observed values: π¦Μ Μ = π¦Μ π¦Μ = π1 + π2 π₯ ∑π¦Μ = ππ1 + π2 ∑π₯ ∑π¦Μ ∑π₯ = π1 + π2 π π π¦Μ Μ = π1 + π2 π₯Μ = π¦Μ − π2 π₯Μ + π2 π₯Μ = π¦Μ ∑(π¦ − π¦Μ )(π¦Μ − π¦Μ Μ ) = ∑(π¦Μ + π − π¦Μ )(π¦Μ − π¦Μ ) ∑(π¦ − π¦Μ )(π¦Μ − π¦Μ Μ ) = ∑(π¦Μ − π¦Μ )2 + ∑π(π¦Μ − π¦Μ ) ∑(π¦ − π¦Μ )(π¦Μ − π¦Μ Μ ) = ∑(π¦Μ − π¦Μ )2 since ∑π(π¦Μ − π¦Μ ) = 0 Thus, [∑(π¦Μ − π¦Μ )2 ]2 2 ππ¦π¦ Μ = ∑(π¦ − π¦Μ )2 ∑(π¦Μ − π¦Μ )2 ∑(π¦Μ − π¦Μ )2 πππ 2 ππ¦π¦ = = π 2 Μ = ∑(π¦ − π¦Μ )2 πππ 3 5-Prediction, Goodness-of-Fit, and Modeling Issues 8 of 27 SUMMARY OUTPUT Regression Statistics Multiple R 0.6205 R Square 0.3850 Adjusted R Square 0.3688 Standard Error 89.517 Observations 40 ANOVA df Regression Residual Total Intercept income 1 38 39 SS 190626.98 304505.18 495132.16 MS 190627 8013.294 F 23.789 Significance F 1.95E-05 Coefficients 83.416 10.210 Standard Error 43.410 2.093 t Stat 1.922 4.877 P-value 0.0622 0.0000 Lower 95% -4.463 5.972 Upper 95% 171.295 14.447 The following table contains all the different symbols and formulas that are used in generating this output: SUMMARY OUTPUT Regression Statistics Multiple R R Square Not relevant to simple regression π 2 = πππ ⁄πππ Adjusted R Square Not relevant to simple regression Standard Error ∑π 2 ∑(π¦ − π¦Μ)2 πππΈ se(π) = √ =√ ≡√ = √πππΈ π−2 π−2 π−2 Observations π ANOVA* Regression Residual ππ ** π−1 π−π ππ πππ = ∑(π¦Μ − π¦Μ )2 Total π−1 πππ = ∑(π¦ − π¦Μ )2 Coefficients πππΈ = ∑(π¦ − π¦Μ)2 ππ πππ = πππ ⁄(π − 1) πππΈ = πππΈ⁄(π − π) πΉ* πΉ = πππ ⁄πππΈ ππππππππππππ πΉ Tail area of πΉ π·ππ π‘ππππ’π‘πππ Standard Error† π‘ ππ‘ππ‘ π­π£πππ’π Lower 95%†† Intercept π1 se(π1 ) |t| = π1 ⁄se(π1 ) P(π‘ > |π‘|) πΏ = π1 − πππΈ1 π = π1 + πππΈ1 X Variable 1 π2 se(π2 ) |t| = π2 ⁄se(π2 ) P(π‘ > |π‘|) π = π2 + πππΈ2 π = π2 + πππΈ2 Notes: * ** † †† Upper 95% ANOVA and the πΉ πππ π‘ππππ’π‘πππ are explained below π denotes the number of parameters estimated. Here, π = 2 se(π1 ) = se(π)√∑π₯ 2 ⁄π∑(π₯ − π₯Μ )2 se(π2 ) = se(π)⁄√∑(π₯ − π₯Μ )2 πππΈ1 = π‘α⁄2,(π−2) se(π1 ) πππΈ2 = π‘α⁄2,(π−2) se(π2 ) 5-Prediction, Goodness-of-Fit, and Modeling Issues 9 of 27 4.2. Reporting the Summary Results In many cases, rather than providing the whole computer output, the regression output is reported in a summary form. The following are two ways in which summery results are reported. yˆ ο½ 83 .416 ο« 10 .210 x R 2 ο½ 0.385 ( 43.41) ( 2.093) ( s.e.) This summary report provides the value of the standard error of the regression coefficients, se(π1 ) and se(π2 ). This information allows us to obtain the confidence intervals for the parameters of the regression. You just need to compute the πππΈ using π‘α⁄2,(π−π) and the standard error of each coefficient. You can also divide the coefficient value by its standard error to obtain the test statistic for the hypothesis test about the parameter. Alternatively, the summary result is reported as follows: yˆ ο½ 83 .416 ο« 10 .210 x R 2 ο½ 0.385 (1.922) ( 4.877) (t ) Here, you can use the π‘ stat and either compute the probability value (you must use a computer) or compare it to the critical π‘: π‘α⁄2,(π−π) to test for the null hypothesis, π»0 : π½2 = 0. 5. The π Test of Goodness of Fit The goodness of fit of regression is measured by π 2 . π 2 = ∑(π¦Μ − π¦Μ )2 ∑(π¦ − π¦Μ )2 The more the observed values of y are clustered around the regression line, the better the goodness of fit, or the greater the linear association between the dependent variable π¦ and the explanatory variable π₯. Also, we saw above that π 2 can also be computed as the square of the correlation coefficient π: π 2 = π 2 π= ∑(π₯ − π₯Μ )(π¦ − π¦Μ ) √∑(π₯ − π₯Μ )2 √∑(π¦ − π¦Μ )2 π2 = [∑(π₯ − π₯Μ )(π¦ − π¦Μ )]2 = π 2 ∑(π₯ − π₯Μ )2 ∑(π¦ − π¦Μ )2 Theoretically, two variables π₯ and π¦ are independent if the population correlation coefficient π is zero. Within the simple linear regression context, absence of linear relationship between π₯ and π¦ would imply the slope parameter π½2 is zero. Thus, none of the total deviation of π¦ from the mean π¦ would be accounted for by the regression. As explained, in the sample, π 2 measures this deviation relative to the total. However, even if there is no relationship between π₯ and π¦ in the population, the probability that an π 2 computed from a random sample would be zero is practically nil. Therefore, π 2 will always be a number greater than zero. In simple regression analysis, therefore, to conclude that there is a relationship between π₯ and π¦, we perform the test of hypothesis that π»0 : π½2 = 0 versus π»1 : π½2 ≠ 0 This has already been done using π2 as the test statistic and performing the “t test”. We may consider π 2 , in a way, an alternative test statistic testing the same hypothesis. If π 2 is significantly different from zero, then we 5-Prediction, Goodness-of-Fit, and Modeling Issues 10 of 27 will reject the above null hypothesis. To determine if π 2 is significantly different from zero we need a critical value, for a given significance level α, to which we compare the test statistic. The problem here is that there is no statistical critical value directly related to π 2 . π 2 is obtained as a ratio of two squared deviations, πππ over πππ, and, as such, it does not generate any probability distribution such as π, π, or Chi-square. The way around this problem is the indirect approach of measuring the mean πππ relative to the mean πππΈ. This way we are comparing two measures of variance of π¦—variance due to regression versus variance due to unexplained factors. Hence the term ANOVA—analysis of variance. If explained deviations outweigh the unexplained deviations, then the variance measures in the numerator of the variance ratio will be greater than that in the denominator. Since the variance ratio is that of squared terms, the ratio measure is always positive. The larger the variations due to regression is, the further away the quotient is from 1, indicating a better fit. To obtain any variance measure from sample data we divide the sum of square deviations by the degrees of freedom to determine the mean square. The two mean squares in the regression ANOVA are the mean square regression (πππ ) and mean square error (πππΈ): πππ = πππ ∑(π¦Μ − π¦Μ )2 = ππ π−1 where k is the number of the parameters in the regression, π½1 and π½2 . πππΈ = πππΈ ∑(π¦ − π¦Μ)2 = ππ π−π The ratio of πππ to πππΈ is called the πΉ-ratio. πΉ= πππ πππΈ The πΉ-ratio is a test statistic with a specific probability distribution called the πΉ distribution.4 The πΉ distribution is the ratio of two independent Chi-square random variables each divided by its own degrees of freedom, π1 being the degrees of freedom of the numerator and π2 that of the denominator: πΉ(π1,π2 ) = π12 ⁄π1 π22 ⁄π2 The F distribution is used in testing the equality of two population variances, as is being done in the case being discussed here. πΉ(π−1,π−π) = ∑(π¦Μ − π¦Μ )2 ⁄(π − 1) ∑(π¦ − π¦Μ)2 ⁄(π − π) The numerator variance measure represents the average squared deviation of predicted values from the mean of π¦. This is the mean square deviation explained by the regression. The denominator is the mean square deviation of the observed values from the regression line—the unexplained mean square deviation. To observe the impact of these mean squares on the value of πΉ consider the following two models and corresponding figures. In (π΄) test scores on an exam are observed against the explanatory variable hours studied. In (π΅) the same test scores are observed against randomly generated numbers as the explanatory variable. 4 Named after the English statistician Sir Ronald A. Fisher. 5-Prediction, Goodness-of-Fit, and Modeling Issues 11 of 27 (π΄) (π΅) Score Hours Studied 56 52 72 56 92 72 88 96 80 100 1.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 7.0 Score 56 52 72 56 92 72 88 96 80 100 0.6564 8.0140 3.9091 0.0045 π 2 = π2 = |π‘| = ππππ π£πππ’π = 0.0329 0.1299 0.5213 0.6163 π 2 = π2 = |π‘| = ππππ π£πππ’π = 1836.8056 120.1993 15.2813 0.0045 πππ = πππΈ = πΉ= ππππππππππππ πΉ = Random numbers 15 63 42 51 85 93 32 43 31 65 MSR = MSE = F= Significance F = 91.9413 338.3073 0.2718 0.6163 The regression model (π΄) shows the relationship between test scores and the hours studied. Model (B) regresses the same test scores against a set of numbers randomly selected from 1-100. Note the regression line in Model (π΅) is practically flat with a slope of 0.10. Correspondingly, the |t| statistics for π»0 : π½2 = 0 results in the probability value of 0.6163, leading us to convincingly conclude that the population slope parameter is zero. Now pay attention to πππ . Since the regression line, as shown in panel (π΅) in the diagram below, is very close to the π¦Μ line, leaving very little room for the deviations π¦Μ − π¦Μ , thus making πππ = ο₯(π¦Μ − π¦Μ )²⁄(π − 1) very small relative to πππΈ = ο₯(π¦ − π¦Μ)²⁄(π − π). The F statistic is hence a small value 91.9413⁄338.3073 = 0.2718. The probability value (the tail area under the F-curve to the right of the F value of 0.1744, using =F.DIST.RT(0.2718,1,8) in Excel, is 0.6163—clearly leading us not to reject π»0 : π½2 = 0. In contrast, the regression line in panel (π΄) indicates a pronounced slope, making the deviations π¦Μ − π¦Μ , hence πππ , relatively significant compared to πππΈ. The πΉ statistic is thus a large value 1836.8056⁄120.1993 = 15.2813. The probability value is 0.0045, clearly leading us to reject π»0 : π½2 = 0. y (A) 120 y 100 yΜ 100 80 yΜ 80 60 60 40 40 20 20 0 (B) 120 yΜ yΜ 0 0 1 2 3 4 5 6 7 8 x 0 20 40 60 80 100 120 x After explaining all this about the F test and π΄ππππ΄, it would sound quite anticlimactic to say that in simple regression we need not perform the F test at all because it is redundant! With careful attention you would 5-Prediction, Goodness-of-Fit, and Modeling Issues 12 of 27 recognize that the F statistic with the numerator degrees of freedom k – 1 is the same value as t statistic squared: πΉ = π‘ 2 = (3.9091)2 = 15.2813 And, π­π£πππ’π = ππππππππππππ πΉ = 0.0045 See footnote5 for the mathematical proof that πΉ = π‘ 2. Note, however, that the πΉ test plays a different and important role in statistical inference in multiple regression—to be pointed out in later chapters 6. Modeling Issues 6.1. The Effects of Scaling the Data 6.1.1. Changing the scale of π Consider the general form of the estimated simple linear regression equation: π¦Μ = π1 + π2 π₯. We want to find out what happens to the regression model if we changed the scale of π₯ by multiplying it be a constant π. First, determine the impact on the slope coefficient π2 . Denote the resulting new coefficient as π2∗ . Impact on ππ : π2 = π2∗ = π2 ⁄π ∑π₯π¦ − ππ₯Μ π¦Μ ∑π₯ 2 − ππ₯Μ 2 Multiply π₯ by a constant π. Let π2∗ be the resulting new coefficient. Then, π2∗ = ∑(ππ₯)π¦ − π(ππ₯Μ )π¦Μ π(∑π₯π¦ − ππ₯Μ π¦Μ ) π2 = = ∑(ππ₯)2 − π(ππ₯Μ )2 π 2 (∑π₯ 2 − ππ₯Μ 2 ) π Show that, πππ π22 πΉ= = = π‘2 πππΈ var(π2 ) πππ = ∑(π¦Μ − π¦Μ )2 Note: ππ = 1 πππ = ∑(π1 + π2 π₯ − π¦Μ )2 πππ = ∑(π¦Μ − π2 π₯Μ + π2 π₯ − π¦Μ )2 πππ = ∑(π2 π₯ − π2 π₯Μ )2 πππ = π22 ∑(π₯ − π₯Μ )2 πππΈ = var(π) var(π) var(π2 ) = ∑(π₯ − π₯Μ )2 πππΈ = var(π) = var(π2 )∑(π₯ − π₯Μ )2 Thus, π22 ∑(π₯ − π₯Μ )2 π22 πΉ= = var(π2 )∑(π₯ − π₯Μ )2 var(π2 ) Note that the test statistic for π»0 : π½2 = 0 is π2 π‘= se(π2 ) Therefore, π22 πΉ= = π‘2 var(π2 ) 5 5-Prediction, Goodness-of-Fit, and Modeling Issues 13 of 27 Thus, when π₯ is scaled by a constant π, the new slope coefficient is equal to the pre-scaled coefficient divided by π. Impact on ππ : π1∗ = π1 π1 = π¦Μ − π2 π₯Μ π1∗ = π¦Μ − π2∗ (ππ₯Μ ) = π¦Μ − (π2 ⁄π )(ππ₯Μ ) = π¦Μ − π2 π₯Μ = π1 Thus, scaling π₯ does not change the intercept. Μ Impact on the predicted values π π¦Μ ∗ = π¦Μ π¦Μ = π1 + π2 π₯ π¦Μ ∗ = π1 + π2∗ (ππ₯) = π1 + (π2 ⁄π )(ππ₯Μ ) = π1 + π2 π₯ = π¦Μ There is no impact. Impact on π―ππ«(π): var(π) = var(π ∗ ) = var(π) ∑(π¦ − π¦Μ)2 π−2 var(π ∗ ) = ∑(π¦ − π¦Μ ∗ )2 ∑(π¦ − π¦Μ)2 = = var(π) π−2 π−2 There is no impact. Impact on πΉπ : π 2 = (π 2 )∗ = π 2 πππ ∑(π¦Μ − π¦Μ )2 = πππ ∑(π¦ − π¦Μ )2 (π 2 )∗ = πππ ∑(π¦Μ ∗ − π¦Μ )2 ∑(π¦Μ − π¦Μ )2 = = = π 2 πππ ∑(π¦ − π¦Μ )2 ∑(π¦ − π¦Μ )2 There is no impact. Impact on π―ππ«(ππ ): var(π2∗ ) = 1 π2 var(π2 ) π2 1 var(π2∗ ) = var ( ) = 2 var(π2 ) π π se(π2∗ ) = 1 se(π2 ) π Impact on π―ππ«(ππ ): var(π1∗ ) = var(π1 ) Note: π1∗ = π1 5-Prediction, Goodness-of-Fit, and Modeling Issues 14 of 27 Impact on t statistic: π‘= π‘∗ = π‘ π2 se(π2 ) π‘∗ = π2∗ π2 ⁄π = =π‘ se(π2∗ ) se(π2 )⁄π There is no change. 6.1.2. Changing the scale of π We want to find out what happens to the regression model if we changed the scale of π¦ by multiplying it by a constant π. First, determine the impact on the slope coefficient π2 . Impact on ππ : π2 = π2∗ = ππ2 ∑π₯π¦ − ππ₯Μ π¦Μ ∑π₯ 2 − ππ₯Μ 2 Multiply π¦ by a constant π. Let π2∗ be the resulting new coefficient. Then, π2∗ = ∑ππ₯π¦ − πππ₯Μ π¦Μ π(∑π₯π¦ − ππ₯Μ π¦Μ ) = = ππ2 ∑π₯ 2 − ππ₯Μ 2 ∑π₯ 2 − ππ₯Μ 2 Thus, when π¦ is scaled by a constant π, the new slope coefficient is equal to the pre-scaled coefficient multiplied by π. Impact on ππ : π1∗ = ππ1 π1 = π¦Μ − π2 π₯Μ π1∗ = ππ¦Μ − π2∗ π₯Μ = ππ¦Μ − ππ2 π₯Μ = π(π¦Μ − π2 π₯Μ ) = ππ1 Thus, scaling y changes the intercept by a multiple of c. Μ: Impact on the predicted values π π¦Μ ∗ = ππ¦Μ π¦Μ = π1 + π2 π₯ π¦Μ ∗ = ππ1 + ππ2 π₯ = π(π1 + π2 π₯) = ππ¦Μ Predicted values also change by a multiple of π. Impact on π―ππ«(π): var(π) = var(π ∗ ) = π 2 var(π) ∑(π¦ − π¦Μ)2 π−2 5-Prediction, Goodness-of-Fit, and Modeling Issues 15 of 27 var(π ∗ ) = ∑(ππ¦ − π¦Μ ∗ )2 ∑(ππ¦ − ππ¦Μ)2 π 2 ∑(π¦ − π¦Μ)2 = = = π 2 var(π) π−2 π−2 π−2 Impact on πΉπ : π 2 = (π 2 )∗ = π 2 πππ ∑(π¦Μ − π¦Μ )2 = πππ ∑(π¦ − π¦Μ )2 (π 2 )∗ = ∑(π¦Μ ∗ − ππ¦Μ )2 ∑(ππ¦Μ − ππ¦Μ )2 π 2 ∑(π¦Μ − π¦Μ )2 = = = π 2 ∑(ππ¦ − ππ¦Μ )2 ∑(ππ¦ − ππ¦Μ )2 π 2 ∑(π¦ − π¦Μ )2 There is no impact. Impact on π―ππ«(ππ ): var(π2 ) = var(π) ∑(π₯ − π₯Μ )2 var(π2∗ ) = π 2 var(π) = π 2 var(π2 ) ∑(π₯ − π₯Μ )2 var(π2∗ ) = π 2 var(π2 ) se(π2∗ ) = πse(π2 ) Impact on π―ππ«(ππ ): var(π1∗ ) = π 2 var(π1 ) var(π1 ) = ∑π₯ 2 var(π) π∑(π₯ − π₯Μ )2 var(π1∗ ) = ∑π₯ 2 ∑π₯ 2 ∗) var(π π 2 var(π) = π 2 var(π1 ) π∑(π₯ − π₯Μ )2 π∑(π₯ − π₯Μ )2 Impact on π statistic: π‘= π‘∗ = π‘ π2 se(π2 ) π‘∗ = π2∗ ππ2 =π‘ ∗) = se(π2 πse(π2 ) 6.1.3. Changing the scale of π and π by the same factor π Consider the general form of the estimated simple linear regression equation: π¦Μ = π1 + π2 π₯. We want to find out what happens to the regression model if we changed the scale of π₯ and π¦ by multiplying both be a constant π. First, determine the impact on the slope coefficient π2 . Impact on ππ π2∗ = π2 5-Prediction, Goodness-of-Fit, and Modeling Issues 16 of 27 π2∗ = ∑(ππ₯)(ππ¦) − π(ππ₯Μ )(ππ¦Μ ) π 2 ∑π₯π¦ − ππ₯Μ π¦Μ = 2 2 = π2 ∑(ππ₯)2 − π(ππ₯Μ )2 π ∑π₯ − ππ₯Μ 2 There is no change in the slope coefficient. Impact on ππ π1∗ = ππ1 π1 = π¦Μ − π2 π₯Μ π1∗ = ππ¦Μ − π2∗ (ππ₯Μ ) = ππ¦Μ − π2 (ππ₯Μ ) = π(π¦Μ − π2 π₯Μ ) = ππ1 The intercept will change by a multiple of c. Impact on the predicted values yΜ π¦Μ ∗ = ππ¦Μ π¦Μ ∗ = π1∗ + π2∗ (ππ₯) = ππ1 + π2 (ππ₯) = π(π1 + π2 π₯) = ππ¦Μ Impact on π―ππ«(π) var(π ∗ ) = ∑(ππ¦ − ππ¦Μ)2 = π 2 var(π) π−2 Impact on πΉπ (π 2 )∗ = var(π ∗ ) = π 2 var(π) (π 2 )∗ = π 2 ∑(ππ¦Μ − ππ¦Μ )2 ∑(π¦Μ − π¦Μ )2 = = π 2 ∑(ππ¦ − ππ¦Μ )2 ∑(π¦ − π¦Μ )2 There is no impact. Impact on π―ππ«(ππ ) var(π2 ) = var(π) ∑(π₯ − π₯Μ )2 var(π2∗ ) = π 2 var(π) = var(π2 ) ∑(ππ₯ − ππ₯Μ )2 var(π2∗ ) = var(π2 ) se(π2∗ ) = se(π2 ) Impact on π―ππ«(ππ ) var(π1 ) = var(π1∗ ) = var(π1∗ ) = π 2 var(π1 ) ∑π₯ 2 var(π) π∑(π₯ − π₯Μ )2 π 2 ∑π₯ 2 ∑π₯ 2 var(π ∗ ) = π 2 var(π) = π 2 var(π1 ) 2 − π₯Μ ) π∑(π₯ − π₯Μ )2 ππ 2 ∑(π₯ 5-Prediction, Goodness-of-Fit, and Modeling Issues 17 of 27 Impact on π statistic π‘∗ = π‘∗ = π‘ π2∗ π2 =π‘ ∗) = se(π2 se(π2 ) There is no change. 7. Choosing a Functional Form In explaining the simple linear regression model we have assumed that the population parameters π½1 and π½2 are linear—that is, they are not expressed as, say, π½22 , 1⁄π½2 , or any form other than π½2 —and also the impact of the changes in the independent variable on y works directly through x rather than through expressions such as, say, π₯ 2 or ln(π₯). In this section, we will continue assuming that the regression is linear in parameters, but relax the assumption of linearity of the variables. In many economic models the relationship between the dependent and independent variables is not a straight line relationship. That is the change in π¦ does not follow the same pattern for all values of π₯. Consider for example an economic model explaining the relationship between expenditure on food (or housing) and income. As income rises, we do expect expenditure on food to rise, but not at a constant rate. In fact, we should expect the rate of increase in expenditure on food to decrease as income rises. Therefore the relationship between income and food expenditure is not a straight-line relationship. In Chapter 3 we considered two alternative regression models, the quadratic model and the log-linear model. Here we will considered two more alternative models: the linear-log and log-log models. 7.1. Linear-Log (Semi-log) Model The independent variable is in logarithms, but the explained variable is not. π¦ = π½1 + π½2 ln(π₯) Slope: ππ¦ 1 = π½2 ππ₯ π₯ Elasticity: π= ππ¦ π₯ 1 π₯ 1 = π½2 ( ) = π½2 ππ₯ π¦ π₯ π¦ π¦ The following diagram is the plot of functions π¦ = 1 + 1.3ln(π₯) and π¦ = 1 − 1.3ln(π₯). For example, for π₯ = 2, the function with π½2 > 0 provides π¦ = 1 + 1.3ln(2) = 1.901, and the one with the negative π½2 < 0 provides π¦ = 1 − 1.3 ln(2) = 0.099. 5-Prediction, Goodness-of-Fit, and Modeling Issues 18 of 27 y 5 4 ββ > 0 3 2 1 0 -1 ββ < 0 -2 -3 0.0 1.0 2.0 3.0 4.0 5.0 6.0 x The slope of the function π¦ = 1 + 1.3ln(π₯) at a given point, say, π₯0 = 2 is ππ¦ 1 1 = π½2 = 1.3 = 0.65 ππ₯ π₯ 2 Let’s interpret the meaning of slope = 0.65 in a linear-log function. First, the coefficient of ln(π₯), 1.3, is not the slope of the function. The value 1.3 implies that for each 1% increase in π₯, the dependent variable rises by approximately 0.013 units. π₯0 = 2 βπ₯% 0.01 π₯1 2.02 π¦0 = 1.9011 π¦1 βπ¦ = π¦1 − π¦0 1.9140 0.0129 The slope of 0.65 at π₯0 = 2 means that for a very small change in π₯ in the immediate vicinity of π₯0 = 2 , π¦ rises by 0.65 units. The table below shows that as the increment in π₯ is reduced from 1 to 0.001, the difference quotient approaches 0.65, the slope of the function at π₯0 = 2. π₯0 = 2 π₯1 3 2.5 2.1 2.01 2.001 π¦0 = 1.9011 π¦1 βπ¦⁄βπ₯ ≈ ππ¦⁄ππ₯ 2.4282 0.5271 2.1912 0.5802 1.9645 0.6343 1.9076 0.6484 1.9017 0.6498 Thus, “slope” in the linear-log model implies the “change in π¦ in response to a small change in π₯”. The coefficient of ln(π₯), on the other hand, implies the change in π¦ for a percentage change in π₯. Example Use the data in the food expenditure model (see the Excel file “CH5 DATA”). The variables are π¦ = ππππ_ππ₯π (weekly food expenditure in $) and π₯ = ππππππ (weekly income in $). The output is the result of running the regression π¦Μ = π1 + π2 ln(π₯). To run the regression first transform the π₯ values to ln(π₯). The estimated regression equation is: π¦Μ = −97.1864 + 132.1658ln(π₯) 5-Prediction, Goodness-of-Fit, and Modeling Issues 19 of 27 The coefficient of ln(π₯), π2 = 132.1658, implies that the weekly food expenditure will increase by approximately $1.32 for each 1% increase in weekly income, regardless of the income level, as the following calculations show. π₯0 = π₯1 = βπ₯% = 10 1.01 0.01 π¦0 = π¦1 = βπ¦ = 207.137 208.452 1.3151 π₯0 = π₯1 = βπ₯% = 20 2.02 0.01 π¦0 = π¦1 = βπ¦ = 298.747 300.062 1.3151 However, the change in food-expenditure for each additional dollar increase in weekly income will differ based on the income level, as the calculations in the following table show. This means that the slope of the regression equation depends on the value of π₯, the weekly income level. π₯0 = π₯1 = βπ₯ = 1000 1001 1 π¦0 = π¦1 = βπ¦⁄βπ₯ = 207.137 207.269 0.1321 π₯0 = π₯1 = βπ₯ = 2000 2001 1 π¦0 = π¦1 = βπ¦⁄βπ₯ = 298.747 298.813 0.0661 7.2. Log-Linear Model The log-linear model was introduced in Chapter 3. There are additional points with respect to this model that we need to pay attention to. The Log-Linear Model in regression takes the following form ln(π¦) = π½1 + π½2 π₯ determine the slope take the exponent of both sides of the equation π¦ = π π½1+π½2 π₯ Then, Slope: ππ¦ = π½2 π ln(π¦) = π½2 π¦ ππ₯ Elasticity: π= ππ¦ π₯ π₯ = π½2 π¦ ( ) = π½2 π₯ ππ₯ π¦ π¦ The coefficient π½2 in ln(π¦) = π½1 + π½2 π₯ implies that for each additional unit increase in π₯, π¦ will increase by π½2 percent. Using the slope expression above, we have, ππ¦⁄π¦ = π½2 ππ₯ Example Consider the model in which the πππππ of a house is related to the house size measured in square feet (π πππ‘). Let π¦ = πππππ and π₯ = π πππ‘. The log-linear equation is, Μ ln (π¦) = π1 + π2 π₯ The data and the summary regression output for this example is in the Excel file πβ5 πππ‘π. The estimated regression equation is Μ ln (π¦) = 10.8386 + 0.000411π₯ 5-Prediction, Goodness-of-Fit, and Modeling Issues 20 of 27 Consider a house size of π₯ = 2000 π πππ‘. The impact of an additional square feet is shown in calculations in the following table: π₯0 = π₯1 = βπ₯ = 2000 2001 1 ln(π¦0 ) = ln(π¦1 ) = 11.66113 11.66155 π¦0 = π¦1 = βπ¦ = βπ¦⁄π¦0 = 115975.5 116023.2 47.7 0.000411 Note that when π₯0 = 2000 π πππ‘, for each additional π πππ‘, the price of the house increases by βπ¦ = $47.7. The proportional increase is βπ¦⁄π¦0 = 0.000411 or 0.04%. Now consider a house size of π₯ = 4000 π πππ‘: π₯0 = π₯1 = βπ₯ = 4000 4001 1 ln(π¦)0 = ln(π¦)1 = 12.48367 12.48408 π¦0 = π¦1 = βπ¦ = βπ¦⁄π¦0 = 263991.4 264100.0 108.6 0.000411 For a larger house, here π₯ = 4000 π πππ‘, each additional square feet adds a larger amount βπ¦ = $108.6 to the price of the house. However, the percentage change in the price of the house is the same. 7.2.1. Adjustment to the Predicted Value in Log-Linear Models In the calculations in the previous two tables the predicted value of π¦ for a given value of π₯ was obtained by taking the exponent (anti-log) of the predicted log of π¦. π₯ = 2000 Μ = 10.8386 + 0.000411(2000) = 11.66113 ln(π¦) π¦ ≡ π¦π = exp(11.66113) = 115975.5 Here π¦π denotes “natural predictor”. In most cases (for large samples) a “corrected” predicted value is obtained by multiplying the “natural” predictor by the quantity π var(π)⁄2 . In the regression summary output var(π) is shown as πππΈ, or mean square error. π¦Μπ = π¦Μπ π var(π)⁄2 In the above example, the regression summary output shows that var(π) = πππΈ = 0.10334. Thus, for π₯ = 2000, π¦Μπ = 115975.5π 0.10334⁄2 = 122125.4 The natural predictor tends to systematically under-predict the value of π¦ in a log-linear model. The corrected predictor balances this downward bias in large samples. Example A Growth Model The Excel file “CH5 DATA” tab “wheat” contains data describing average wheat yield (tons per hectare) for a region in Australia against time (π‘), which runs from 1950 to 1997. The rise in yield overtime is attributed to improvements in technology, where π‘ is used as a proxy for technology. The objective here is to obtain an estimate of average rate of growth in yield. 5-Prediction, Goodness-of-Fit, and Modeling Issues 21 of 27 Yield (tons per hectare) 2.5 2.0 1.5 1.0 0.5 0.0 0 10 20 30 40 50 Time (1950-1997) Let π¦ stand for ππΌπΈπΏπ·, where π¦0 is the yield in the base year and π¦π‘ is the yield in year π‘. Also, let π stand for the rate of growth. Then, π¦π‘ = π¦0 (1 + π)π‘ Taking the natural log of both sides and using the properties of logarithms, we have, ln(π¦π‘ ) = ln(π¦0 ) + π‘ ln(1 + π) We can write this as a log-linear regression model with π1 = ln(π¦0 ) and π2 = ln(1 + π). lnΜ (π¦π‘ ) = π1 + π2 π‘ The estimated regression equation is, lnΜ (π¦π‘ ) = −0.3434 + 0.01784π‘ From the estimated coefficients we can determine the base year yield and the growth rate: Base year yield π1 = ln(π¦0 ) = −0.3434 π¦0 = exp(−0.3434) = 0.709 tons per hectare Growth rate π2 = ln(1 + π) = 0.01784 1 + π = exp(0.01784) = 1.018 π = 1.018 − 1 = 0.018 The estimated average annual growth rate is then approximately 1.8%. Example A Wage Equation The Excel file “CH5 DATA” tab “wage” contains data describing hourly wage (ππ΄πΊπΈ) against years of education (πΈπ·ππΆ). In this example the objective is to determine the estimated average rate of increase in the 5-Prediction, Goodness-of-Fit, and Modeling Issues 22 of 27 wage rate for each additional year of schooling. Using the same methodology as in the previous example, we have ππ΄πΊπΈ = ππ΄πΊπΈ0 (1 + π)πΈπ·ππΆ ln(ππ΄πΊπΈ) = ln(ππ΄πΊπΈ0 ) + ln(1 + π)πΈπ·ππΆ We obtain the following estimated regression equation Μ ln(ππ΄πΊπΈ) = 1.6094 + 0.0904πΈπ·ππΆ ππ΄πΊπΈ0 = exp(1.6094) = 5.00 1 + π = exp(0.0904) = 0.095 Thus, the estimated rate of increase for an additional year of education is 9.5%. 7.2.2. Predicted Value in the Log-Linear Wage-Education Model What is the predicted value of ππ΄πΊπΈ for a person with 12 years of education? Μ ln(ππ΄πΊπΈ) = 1.6094 + 0.0904(12) = 2.694 Μ = exp(2.694) = 14.7958 = $14.80 ππ΄πΊπΈ According to text (p 154) this figure in a log-linear model is the “natural” predictor (π¦Μπ ). We need to find the corrected predictor, π¦Μπ . The corrected predictor is obtained by, π¦Μπ = π¦Μπ π var(π)⁄2 π¦Μπ = 14.7958 × exp(0.2773⁄2) = 16.996 = $17.00 In large samples the natural predictor π¦Μπ tends to systematically under-predict the value of the dependent variable. The correction offsets this downward bias. 7.2.3. Generalized πΉπ Measure for Log-Linear Models When considering π 2 in a log-linear model, we need to keep two things in mind: 1) The π 2 measure shown in Μ the regression output involves ln (π¦). We need to show π 2 as a measure of explained variations in π¦, rather than ln(π¦). 2) When we find the antilog of the predicted values, the result is a set of natural predictors which we need to change into the corrected predictors. Thus, we obtain the general π 2 as follows: Recall from above that π 2 is equal to coefficient of correlation between π¦ and π¦Μ squared. Here we must find the correlation coefficient between π¦ and π¦Μπ and then square it. 2 π 2 = ππ¦π¦ Μπ 2 ππ¦π¦ Μ = 0.1859 The calculations are shown in the Excel file. 5-Prediction, Goodness-of-Fit, and Modeling Issues 23 of 27 7.2.4. Prediction Interval in the Log-Linear Model Here we want to build a prediction interval for ππ΄πΊπΈ when πΈπ·ππΆ = 12. This is an interval for individual value of π¦ for a given π₯. First, we need to determine se(π¦0 ) when π₯0 = 12. From the regression equation Μ ln (π¦) = π1 + π2 π₯ we have the covariance matrix var(π)π −1 = [ var(π1 ) cov(π1 , π2 ) cov(π1 , π2 ) 0.0075 ]=[ var(π2 ) −0.00052 −0.00052 ] 0.000038 var[ln(π¦0 )] = var(π1 ) + π₯02 var(π2 ) + 2π₯0 cov(π1 , π2 ) + var(π) var[ln(π¦0 )] = 0.0075 + 122 × 0.000038 − 2 × 12 × 0.00052 + 0.2773 = 0.2777 se[ln(π¦0 )] = 0.5270 πππΈ = π‘0.025,998 × se[ln(π¦0 )] = 1.962 × 0.5270 = 1.034 ln(π¦Μ|π₯0 ) = 2.694 πΏln(π¦0) = 2.694 − 1.034 = 1.660 πln(π¦0) = 2.694 + 1.034 = 3.728 πΏπ¦π = exp(1.660) = 5.2604 ππ¦π = exp(3.728) = 41.6158 πΏπ¦π = 5.2604 × exp(0.2773/2) = $6.04 ππ¦π = 41.6158 × exp(0.2773/2) = $41.77 The prediction interval [$6.04, $41.77] is so wide that is basically useless. This indicates that our model is not an accurate predictor of the range of the dependent variable values for a given π₯. To develop a better predictor we need to add additional variables to the model and approach the situation via a multiple regression model. This will be done in the next chapter. 7.3. Log-Log Models The log-log model is used in describing demand equations and production functions. Generally, ln(π¦) = π½1 + π½2 ln(π₯) To determine the slope of the function ππ¦⁄ππ₯ , first take the exponent of both sides. π¦ = π π½1+π½2 ln(π₯) The slope is then, ππ¦ 1 π¦ = π½2 π π½1+π½2ln(π₯) = π½2 ππ₯ π₯ π₯ The elasticity is, 5-Prediction, Goodness-of-Fit, and Modeling Issues 24 of 27 π= ππ¦ π¦ = π½2 ππ₯ π₯ Thus, in the log-log model, the coefficient π½2 represents the percentage change in π¦ in response to a percentage change in π₯. Example A Log-Log Poultry Demand Equation The Excel file “CH5 DATA” tab “chicken” contains data contains data describing the per capita consumption of chicken (in pounds) against the real (inflation adjusted) price. Using the log-log model ln(π) = π½1 + π½2 ln(π) the estimated regression equation is, Μ = 3.7169 − 1.1214 ln(π) ln(π) Here the coefficient of ln(π) is the estimated elasticity of demand, which is 1.121. This implies that for a 1% increase in the real price of chicken, the quantity demanded is reduced by 1.121%. To obtain the predicted value of per capita consumption when π = $2.00, Μ = 3.7169 − 1.1214 ln(2) = 2.94 ln(π) πΜπ = exp(2.94) = 18.91 πΜπ = 18.91 × evar(π)/2 = 18.91 × exp(0.01392⁄2) = 19.042 The generalized π 2 is, 2 π πΊ2 = ππ¦π¦ Μπ = 0.8818 See the Excel file for the calculations. 5-Prediction, Goodness-of-Fit, and Modeling Issues 25 of 27 Appendix The variance formula 1 (π₯0 − π₯Μ )2 var(π¦Μ0 ) = ππ’2 ( + ) π ∑(π₯ − π₯Μ )2 is obtained as follows: Start with the estimated regression equation and the predicted value of π¦ for the given π₯0 . π¦Μ0 = π1 + π2 π₯0 Taking the variance from both sides of the equation we have, var(π¦Μ0 ) = var(π1 + π2 π₯0 ) var(π¦Μ0 ) = var(π1 ) + π₯02 var(π2 ) + 2π₯0 cov(π1 , π2 ) On the right-hand-side, substituting for var(π1 ) = var(π2 ) = ∑π₯ 2 π∑(π₯ − π₯Μ )2 ππ’2 , ππ’2 , and ∑(π₯ − π₯Μ )2 cov(π1 , π2 ) = −π₯Μ π2 (π₯ ∑ − π₯Μ )2 π’ we have, var(π¦Μ0 ) = ∑π₯ 2 π₯02 −2π₯0 π₯Μ 2 π + ππ’2 − π2 π’ 2 2 π∑(π₯ − π₯Μ ) ∑(π₯ − π₯Μ ) ∑(π₯ − π₯Μ )2 π’ var(π¦Μ0 ) = ππ’2 ∑π₯ 2 + ππ₯02 − 2ππ₯0 π₯Μ + ππ₯Μ 2 − ππ₯Μ 2 π∑(π₯ − π₯Μ )2 var(π¦Μ0 ) = ππ’2 ∑π₯ 2 − ππ₯Μ 2 + π(π₯0 − π₯Μ )2 π∑(π₯ − π₯Μ )2 var(π¦Μ0 ) = ππ’2 ∑(π₯ − π₯Μ )2 + π(π₯0 − π₯Μ )2 π∑(π₯ − π₯Μ )2 1 (π₯0 − π₯Μ )2 var(π¦Μ0 ) = ππ’2 ( + ) π ∑(π₯ − π₯Μ )2 5-Prediction, Goodness-of-Fit, and Modeling Issues 26 of 27 5-Prediction, Goodness-of-Fit, and Modeling Issues 27 of 27