CHAPTER 4 INTERVAL ESTIMATION AND HYPOTHESIS TESTING IN REGRESSION 1. 2. 3. Confidence Interval for the Population Parameters of Regression 1.1. A Review of the General Concept of the Confidence Interval 1.2. Confidence Interval for the Regression Slope Parameter Hypothesis Tests 2.1. Review of the General Concept 2.2. Hypothesis test for ๐ฝ2 2.2.1. Two-Tail Test of Significance 2.2.2. Two-Tail Test of an Economic Hypothesis 2.2.3. Right-Tail Test of Significance 2.2.4. Right-Tail Test of an Economic Hypothesis 2.2.5. Left-Tail Tests Inferences Involving a Linear Combination of the Regression Parameters 3.1. Interval Estimate 3.2. Hypothesis Test 1. Confidence Interval for the Population Parameters of Regression 1.1. A Review of the General Concept of the Confidence Interval The purpose here is to build a confidence interval for the parameters of the regression model, ๐ฝ1 and ๐ฝ2 . A brief review of the methodology for building a confidence interval for the population parameter µ (the population mean) will help in explaining the CI for the parameters of the regression. To build a confidence interval for µ, the sampling distribution of the sample statistic ๐ฅฬ must be normal. The mean of this distribution is E(๐ฅฬ ) = µ, and the standard deviation (standard error) is se(๐ฅฬ ) = ๐⁄√๐. μ xฬ These properties of the sampling distribution of ๐ฅฬ allow us to define ๐ง= ๐ฅฬ − µ se(๐ฅฬ ) as a standard normal random variable. Solving for ๐ฅฬ in this equation we have, ๐ฅฬ = µ ± ๐ง โ se(๐ฅฬ ) 4-Inference in Regression 1 of 13 (๐ง < 0, for ๐ฅฬ values to the left of µ). This expression tells us that the values of the random variable ๐ฅฬ are distributed around the population mean, each value deviating from µ by a multiple of standard error se(๐ฅฬ ). If se(๐ฅฬ ) is known, then we can find boundaries of the intervals symmetric about the mean (middle intervals) which tell us what percentage of ๐ฅฬ values fall within these intervals. These intervals are determined by the specific values of ๐ง. µ − ๐ง โ se(๐ฅฬ ) ≤ ๐ฅฬ ≤ µ + ๐ง โ se(๐ฅฬ ) For example, the middle interval containing 95% of all ๐ฅฬ values is determined by ๐ง0.025 = 1.96. P(µ − ๐ง0.025 se(๐ฅฬ ) ≤ ๐ฅฬ ≤ µ + ๐ง0.025 se(๐ฅฬ )) = 0.95 P(µ − 1.96se(๐ฅฬ ) ≤ ๐ฅฬ ≤ µ + 1.96se(๐ฅฬ )) = 0.95 The remaining 5% of all ๐ฅฬ fall outside the interval. Generally, any proportion or fraction of ๐ฅฬ values that fall outside of the interval of interest is denoted by α and is known as the error probability. The ๐ง score corresponding to any error probability for an interval is denoted by ๐งα⁄2 . The product ๐งα⁄2 se(๐ฅฬ ) is called the margin of sampling error, or simply the margin of error (๐๐๐ธ). Working with 95% as an example, we have established that 95% of all ๐ฅฬ values fall within ๐๐๐ธ = ±๐ง0.025 se(๐ฅฬ ) from the population mean. Now, using the same ๐๐๐ธ, instead of building the interval around µ, build the interval around a randomly determined sample mean value: ๐ฅฬ ± ๐ง0.025 se(๐ฅฬ ). Since the continuous random variable ๐ฅฬ can take on infinite number of values, then we can theoretically build infinite number of such intervals using the same ๐๐๐ธ. Ninety five percent of such intervals would capture µ. This is the theoretical framework for the confidence interval for any population parameter. In practice, we take only one sample of size ๐ and build one interval around the mean computed from this random sample. Then we state that we are 95% confident that this interval contains the population parameter. A confidence interval is simply the point estimate of the parameter obtained from a random sample ± the margin of error. In practice, since the population standard deviation is not known, we must use the sample standard deviation as an estimate of the population standard deviation. But when the sample standard deviation is used, because of increased uncertainty arising from using another estimated value (on top of the estimated ๐ฅฬ ), the margin of error necessarily becomes wider. This is why in place of ๐ง in the ๐๐๐ธ formula we use the random variable t. ๐๐๐ธ = ๐กα⁄2,(๐−1) ๐ √๐ Suppose from a random sample of ๐ = 25 the sample mean is ๐ฅฬ = 48 and the sample standard deviation is ๐ = 20. The margin of error for a 95% confidence interval for µ is: ๐๐๐ธ = ๐ก0.025,(24) 1.2. 20 √25 = 2.064 × 4 = 8.26 Confidence Interval for the Regression Slope Parameter In the previous chapter we showed that if for a given ๐ฅ, the ๐ฆ values are normally distributed. It was also proved that the sample regression coefficient is a linear function of ๐ฆ, 4-Inference in Regression 2 of 13 ๐2 = ∑๐ค๐ฆ = ∑(๐ฅ − ๐ฅฬ )๐ฆ ∑(๐ฅ − ๐ฅฬ )2 is also normally distributed because the slope coefficient ๐2 is a linear function of the normal ๐ฆ. The mean and standard deviation (standard error) of ๐2 are, respectively, E(๐2 ) = ๐ฝ2 and se(๐2 ) = σ๐ข √∑(๐ฅ − ๐ฅฬ )2 βโ bโ Since b2 is normally distributed, then the standard normal random variable z can then be defined as, ๐ง= ๐2 − ๐ฝ2 se(๐2 ) Following the same methodology as that for the confidence interval for µ, first solve for ๐2 : ๐2 = ๐ฝ2 ± ๐ง โ se(๐2 ) Using 95% as the benchmark, then 95% of all values of the random variable ๐2 fall within the margin of error ๐๐๐ธ = ±๐ง0.025 โ se(๐2 ) That is, P(๐ฝ2 − ๐ง0.025 โ se(๐2 ) ≤ ๐2 ≤ ๐ฝ2 + ๐ง0.025 โ se(๐2 )) = 0.95 Again, using the margin of error ๐ง0.025 โ se(๐2 ) we can build infinite number of intervals around the randomly determined values of ๐2 . Ninety five percent of such intervals would contain the population parameter ๐ฝ2 . P(๐2 − ๐ง0.025 โ se(๐2 ) ≤ ๐ฝ2 ≤ ๐2 + ๐ง0.025 โ se(๐2 )) = 0.95 Now note that in the ๐๐๐ธ formula, the formula for the standard error se(๐2 ) contains the unknown population parameter σ๐ข . se(๐2 ) = σ๐ข √∑(๐ฅ − ๐ฅฬ )2 This requires us to replace σ๐ข with its sample estimator se(๐). The symbol and the formula for the standard error of ๐2 thus change to: 4-Inference in Regression 3 of 13 se(๐2 ) = se(๐) √∑(๐ฅ − ๐ฅฬ )2 We are now using an estimated value in the ๐๐๐ธ formula. Therefore, the margin of error becomes inherently wider due to the added uncertainty of using se(๐). The z-score in the ๐๐๐ธ formula no longer works and it is replaced by ๐ก. ๐๐๐ธ = ±๐ก๐ผ⁄2,(๐๐) โ se(๐2 ) To show that the term ๐2 − ๐ฝ2 se(๐2 ) has a ๐ก distribution, we introduce a new random variable called Chi-square (๐ 2 ). Chi-square is formed as the sum of ๐ independent squared standard normal distributions, ๐ง๐2 . Like the ๐ก distribution, each ๐ 2 distribution is identified by the parameter degrees of freedom. If the ๐ 2 distribution is formed by the sum of ๐ independent ๐ง๐2 , then the degrees of freedom is the value ๐. Let ๐ (๐๐ข) be a ๐ 2 random variable. The probability density function of ๐ is: ๐ ๐ ( 2 −1) ๐(๐) = ๐ ( − 1) ! 2๐⁄2 ๐ ๐⁄2 2 where ๐ denotes the degrees of freedom of the distribution and ๐ is the base for natural logarithm. The function is defined only for ๐ ≥ 0. The mean of the distribution is E(๐) = ๐ and the variance is σ2๐ = 2๐. Using Excel you can plot the ๐๐๐ for a given ๐. The following shows three distributions with the indicated degrees of freedom ๐. f(ν) df = 4 df = 8 df = 12 0 5 10 15 20 25 ν 30 Now recall that the error term ๐ข in the regression model is assumed to be normally distributed with mean of 0 and standard deviation of σ๐ข . Thus, ๐ง๐ = ๐ข๐ σ๐ข Squaring and summing for all ๐, we have a ๐ 2 random variable ๐. ๐ = ∑๐ง๐2 = ∑๐ข๐2 σ2๐ข 4-Inference in Regression 4 of 13 Since the ๐ข๐ are unknown, they are replaced by the regression residuals ๐. ๐= ∑๐๐2 σ2๐ข Recall that var(๐) = ∑๐ 2 n−2 Thus, we have: ๐= ∑๐2 = (๐ − 2)var(๐). Substituting for ∑๐2 in the numerator of ๐ we have, (๐ − 2)var(๐) σ2๐ข The degrees of freedom of the ๐ 2 random variable ๐ is ๐ − 2 because only ๐ − 2 residuals are independent.1 Dividing both sides of the last equation by ๐ − 2. ๐ var(๐) = ๐−2 σ2๐ข and taking the square root of both sides, we have: √ ๐ se(๐) = ๐−2 σ๐ข Now we introduce the random variable ๐ก (the Student ๐ก distribution) again! Theoretically, the ๐ก distribution is formed as the ratio of ๐ง over √๐ ⁄(๐ − 2): ๐ก= ๐ง √๐ ⁄(๐ − 2) Substituting in the numerator and the denominator, we have ๐2 − ๐ฝ2 ๐2 − ๐ฝ2 σ๐ข ⁄√∑(๐ฅ − ๐ฅฬ )2 se(๐2 ) ๐ก= = se(๐) se(๐) σ๐ข σ๐ข ๐ก= ๐2 − ๐ฝ2 se(๐)⁄√∑(๐ฅ − ๐ฅฬ )2 The denominator of the ๐ก equation is the standard error of the slope coefficient ๐2 estimated from the sample data: Whenever we want to find the average of square deviations using the sample data we lose one degree of freedom for each estimated parameter in the squared deviation. For example, to find the sample variance of ๐ฆ, we compute average of the squared deviations of ๐ฆ from ๐ฆฬ , ∑(๐ฆ − ๐ฆฬ )2 . Since the random variable ๐ฆฬ , the estimator of the parameter μ, is used in the calculation of the square deviations, only ๐ − 1 of the squared deviations are independent. We have lost one degree of freedom. Similarly, to compute the average of ∑๐ 2 = ∑(๐ฆ − ๐ฆฬ)2 , since we obtained ๐ฆฬ by estimating the two parameters ๐ฝ1 and ๐ฝ2 , we lose two degrees of freedom—we have only ๐ − 2 independent squared deviations. 1 4-Inference in Regression 5 of 13 se(๐2 ) = se(๐) √∑(๐ฅ − ๐ฅฬ )2 Thus, ๐ก= ๐2 − ๐ฝ2 se(๐2 ) Solving for b2 in the equation this equation, we have ๐2 = ๐ฝ2 + ๐ก โ se(๐2 ) in which the term ๐ก โ se(๐2 ) is the margin of sampling error (๐๐๐ธ). Again, using 95% as the benchmark probability, we have: P (๐ฝ2 − ๐ก0.025,(๐๐) se(๐2 ) ≤ ๐2 ≤ ๐ฝ2 + ๐ก0.025,(๐๐) se(๐2 )) = 0.95 The degrees of freedom here are the same ๐๐ as that of the ๐ 2 distribution, ๐ − 2. Since ๐ฝ2 is unknown, its estimator is used to construct a confidence interval for the slope parameter. Using the ๐๐๐ธ determined above, the lower and upper boundaries of a 95% confidence interval for ๐ฝ2 are: ๐ฟ, ๐ = ๐2 ± ๐ก0.025,(๐๐) se(๐2 ) ๐ฟ, ๐ = ๐2 ± ๐ก0.025,(๐๐) se(๐) √∑(๐ฅ − ๐ฅฬ )2 Example: Household food expenditure and weekly income The data and other calculations are in the Excel file CH4 DATA.xlsx (“food” tab). The data show the weekly food expenditure of 40 households in dollars and weekly income in hundreds of dollars ($100). How does weekly food expenditure respond to changes in weekly income? ๐น๐๐๐ท๐ธ๐๐ = ๐ฝ1 + ๐ฝ2 ๐ผ๐๐ถ๐๐๐ธ + ๐ข Let ๐ฅ = ๐ค๐๐๐๐๐ฆ ๐๐๐๐๐๐ and ๐ฆ = ๐ค๐๐๐๐๐ฆ ๐๐๐๐ ๐๐ฅ๐๐๐๐๐๐ก๐ข๐๐. population slope parameter ๐ฝ2 . Build a 95% confidence interval for the ๐ฟ, ๐ = ๐2 ± ๐ก0.025,(๐๐) se(๐2 ) The estimated regression equation is: ๐ฆฬ = 83.416 + 10.210๐ฅ The other ingredients to build the interval are: se(๐2 ) = se(๐) √∑(๐ฅ − ๐ฅฬ )2 = 89.517 √1828.788 = 2.093 ๐๐๐ธ = ๐ก0.025,(๐−2) se(๐2 ) = 2.024 × 2.093 = 4.238 ๐ฟ, ๐ = ๐2 ± ๐๐๐ธ = 10.210 ± 4.238 = [5.97,14.45] 4-Inference in Regression 6 of 13 We estimate, with 95% confidence, that for each additional $100 weekly income household food expenditure be between $5.97 and $14.45. 2. Hypothesis Tests 2.1. Review of the General Concept Again, as a background let us review the test of hypothesis for the population mean µ. Using a two-tailed test case, suppose we are testing the null hypothesis that µ = 100, at a 5% level of significance (allowing 5% as the probability of a Type I error—rejecting the null when it is true). ๐ป0 : µ = 100 ๐ป1 : µ ≠ 100 To test this hypothesis, suppose we take a random sample of ๐ = 25, which yields ๐ฅฬ = 109 and ๐ = 20. The test statistic for this test of hypothesis is: ๐ก= ๐ฅฬ − µ0 se(๐ฅฬ ) where se(๐ฅฬ ) = ๐ก= ๐ √๐ = 20 √25 =4 109 − 100 = 2.25 4 We compare the test statistic ๐ก = 2.250 to the ๐๐๐๐ก๐๐๐๐ ๐ฃ๐๐๐ข๐ ๐ก๐ผ⁄2,(๐−1) = ๐ก0.025,24 = 2.06. Since the test statistic exceeds the critical value, we reject the null hypothesis that µ = 100. Another method to determine whether to reject the null hypothesis is to compute the ๐­๐ฃ๐๐๐ข๐ of the test. The ๐­๐ฃ๐๐๐ข๐ in a two-tailed test is the sum of the two tail areas under the t-curve corresponding to ±๐ก test statistic: 2 × P(t > 2.25) = 2 × 0.0169 = 0.0338. df = 24 0.0169 -2.250 0.0169 2.250 t If the ๐­๐ฃ๐๐๐ข๐ reject the null hypothesis. The combined two-tail area under the t-curve (0.0338) is obtained using the Excel function =๐. ๐๐๐๐. ๐(๐ฑ, ๐๐๐ _๐๐ซ๐๐๐๐จ๐ฆ), where x = 2.25, deg_freedom = 24. 4-Inference in Regression 7 of 13 2.2. Hypothesis test for ๐ท๐ 2.2.1. Two-Tail Test of Significance Generally, but not always, in a hypothesis test for the slope parameter, we test the null hypothesis that the population slope is zero. This hypothesis implies that there is no relationship between ๐ฅ and ๐ฆ. To prove our theory that there is a relationship between ๐ฅ and ๐ฆ, we must reject the null “beyond a reasonable doubt”. Thus, we start the test of hypothesis for ๐ฝ2 with: ๐ป0 : ๐ฝ2 = 0 ๐ป1 : ๐ฝ2 ≠ 0 Our sample statistic is ๐2 with a standard error of se(๐2 ) = se(๐)⁄√∑(๐ฅ − ๐ฅฬ )2 . The test statistic is then: ๐ก= ๐2 − (๐ฝ2 )0 ๐2 = se(๐2 ) se(๐2 ) since by the null hypothesis ๐ฝ2 = 0. Using a level of significance of α, then the critical value is: ๐ก๐ผ⁄2,(๐−2) . In food expenditure example we obtained ๐2 = 10.210. Is this figure significantly different from zero? The test statistic here is: ๐๐ = ๐ก = 10.21 = 4.877 2.093 Using ๐ผ = 0.05, the critical ๐ก with ๐๐ = ๐ − 2 = 38 is ๐ถ๐ = ๐ก๐ผ⁄2,(๐๐) = ๐ก0.025,(38) = 2.024 Since ๐๐ = 4.877 > ๐ถ๐ = 2.024, we reject ๐ป0 : ๐ฝ2 = 0 and conclude that ๐2 = 10.210 is significantly different from zero. The statistical relationship between income and food expenditure is significant. p-value Rather than comparing ๐๐ to ๐ถ๐, standard statistical reports provide the probability value, commonly referred to as the p-value, of the test as the decision rule. The p-value is simply the tail area corresponding to the test statistic under the t curve, which is then compared to the given ๐ผ for the test. For a two-tail test, to compare with ๐ผ, the sum of two tail areas is used. To find the tail areas for a two-tail test use the Excel command ๐. ๐๐๐๐. ๐๐(๐ฑ, ๐๐๐ − ๐๐ซ๐๐๐๐จ๐ฆ). ๐หvalue = 2 × P(๐ก > ๐๐) ๐หvalue = 2 × ๐(๐ก > 4.877) = 2 × 0.0000097 = 0.0000195 T. DIST. 2T(4.877,38) = 0.0000195 The logic of p-value is simple. You can think of p-value as the probability of Type I error as revealed by the test. If the revealed probability exceeds the benchmark probability ๐ผ, then the probability of Type I error is higher than what we would like it to be. Therefore, we do not reject the null hypothesis. But if the revealed probability is less than the benchmark probability of Type I error, then we would reject the null hypothesis. 2.2.2. Two-Tail Test of an Economic Hypothesis 4-Inference in Regression 8 of 13 We want to test the hypothesis, at the 5 percent level of significance, that households spend $7.50 of each additional $100 weekly income on food. The hypotheses for the test are: ๐ป0 : ๐ฝ2 = 7.50 ๐ป1 : ๐ฝ2 ≠ 7.50 The test statistic is ๐๐ = ๐ก = ๐2 − (๐ฝ2 )0 10.21 − 7.50 = = 1.294 se(๐2 ) 2.093 and the critical value is ๐ถ๐ = ๐ก๐ผ⁄2,(๐๐) = ๐ก0.025,(38) = 2.024 Since ๐๐ = 1.294 < ๐ถ๐ = 2.024, do not reject the null hypothesis that the food expenditure per $100 additional income is equal to $7.50. The sample data are consistent with the null hypothesis. ๐หvalue = 2 × ๐(๐ก > 1.294) = 2 × 0.10166 = 0.20332 T. DIST. 2T(1.294,38) = 0.20332 2.2.3. Right-Tail Test of Significance Economic theory suggests that food is a normal good. That is expenditure on food rises with an increase in income. Thus, we expect that in the regression model ๐ฝ2 > 0. The sample data provided an estimate of ๐ฝ2 , ๐2 = 10.21, which is greater than zero. The objective of the test is, however, to prove that this estimated quantity is significantly greater than zero. To prove this, therefore, we must reject the null hypothesis that ๐ป0 : ๐ฝ2 ≤ 0. The hypotheses for the test are then written as ๐ป0 : ๐ฝ2 ≤ 0 ๐ป1 : ๐ฝ2 > 0 Since the direction of the strict inequality in the alternative hypothesis, “>”, is to the right, then this is a righttail test. The test statistic is ๐๐ = ๐ก = ๐2 − (๐ฝ2 )0 10.21 − 0 = = 4.88 se(๐2 ) 2.093 and the critical value, at ๐ผ = 0.05, is ๐ถ๐ = ๐ก๐ผ,(๐๐) = ๐ก0.05,(38) = 1.686 In the diagram below, clearly, ๐๐ = 4.88 > ๐ถ๐ = 1.69. Therefore, we reject the null hypothesis. The data indicates that the food expenditure is a normal good. 4-Inference in Regression 9 of 13 CV = 1.69 TS = 4.88 ๐หvalue = ๐(๐ก > 4.877) = 0.0000097 T. DIST. RT(4.877,38) = 0.0000097 2.2.4. Right-Tail Test of an Economic Hypothesis We want to test the hypothesis that households food expenditure exceeds $5.50 for each additional $100 in weekly income. The purpose of the hypothesis is to see whether the construction of a new supermarket in a residential area is economically justified (profitable). If the data confirms the hypothesis, then the supermarket will be constructed. The estimated coefficient is ๐2 = 10.21, which is greater than $5.50. The question, however, is if $10.21 is significantly greater $5.50. Thus, the null and alternative hypotheses are: ๐ป0 : ๐ฝ2 ≤ 5.50 ๐ป1 : ๐ฝ2 > 5.50 The test statistic is, ๐๐ = ๐ก = ๐2 − (๐ฝ2 )0 10.21 − 5.50 = = 2.25 se(๐2 ) 2.093 and the critical value, at ๐ผ = 0.05, is ๐ถ๐ = ๐ก๐ผ,(๐๐) = ๐ก0.05,(38) = 1.686 Based on ๐ผ = 0.05, since ๐๐ = 2.25 > ๐ถ๐ = 1.69, we reject ๐ป0 : ๐ฝ2 ≤ 5.50. But, if we choose a smaller ๐ผ, say, ๐ผ = 0.01, then ๐ถ๐ = ๐ก0.01,(38) = 2.429 which provides ๐๐ = 2.25 < ๐ถ๐ = 2.429, and we would not reject ๐ป0 . Note the diagrams below. ๐ผ = 0.05 Reject ๐ป0 4-Inference in Regression ๐ผ = 0.01 Do not reject ๐ป0 10 of 13 CV = 1.69 TS = 2.25 TS = 2.25 CV = 2.43 ๐หvalue = ๐(๐ก > 2.25) = 0.0152 T. DIST. RT(2.25,38) = 0.0152 Note that if ๐ผ = 0.05, then ๐หvalue = 0.0152 < ๐ผ = 0.05, then we would reject ๐ป0 . But, if ๐ผ = 0.01, then ๐หvalue = 0.0152 > ๐ผ = 0.01, then we would not reject ๐ป0 . 2.2.5. Left-Tail Tests Continuing with the food expenditure model, we want to test the hypothesis that household food expenditure is below $15 for each additional $100 in weekly income. The word “below” indicates that we are testing ๐ฝ2 < 15 against ๐ฝ2 ≥ 15. This makes it a left-tail test. ๐ป0 : ๐ฝ2 ≥ 15 ๐ป1 : ๐ฝ2 < 15 The test statistic is, ๐๐ = ๐ก = ๐2 − (๐ฝ2 )0 10.21 − 15 = = −2.288 se(๐2 ) 2.093 and the critical value, at ๐ผ = 0.05, is ๐ถ๐ = −๐ก๐ผ,(๐๐) = −๐ก0.05,(38) = −1.686 You can present the conclusion either as ๐๐ = −2.288 < ๐ถ๐ = −1.686 Reject ๐ป0 or, |๐๐| = 2.288 > |๐ถ๐| = 1.686 4-Inference in Regression Reject ๐ป0 11 of 13 TS = -2.29 CV = -1.69 ๐หvalue = ๐(๐ก < −2.288) = 0.0139 In Excel you can find the tail area when ๐ก < 0 two ways. T. DIST. RT(2.288,38) = 0.0139 or, use ๐. ๐๐๐๐. ๐๐(๐ฑ, ๐๐๐ − ๐๐ซ๐๐๐๐จ๐ฆ, ๐๐ฎ๐ฆ๐ฎ๐ฅ๐๐ญ๐ข๐ฏ๐). T. DIST(−2.288,38,1) = 0.0139 3. Inferences Involving a Linear Combination of the Regression Parameters 3.1. Interval Estimate To explain this topic, let us use the food expenditure model again. Suppose we want to build an interval estimate for the mean value of food expenditure in the population of households for a specific weekly income of, say, $1,000. We learned earlier that the regression line is the locus of the mean values of ๐ฆ for each given value of the explanatory variable ๐ฅ, ๐๐ฆ|๐ฅ๐ . Thus, in the estimated regression equation, ๐ฆฬ = ๐1 + ๐2 ๐ฅ, the predicted value of ๐ฆ for a given value of ๐ฅ is that estimated mean we are looking for. To build an interval estimate for the mean of ๐ฆ for a given ๐ฅ, ๐๐ฆ|๐ฅ0 , we consider ๐ฆฬ as the sample statistic, the estimator, for the population parameter ๐๐ฆ|๐ฅ0 . Thus the interval estimate is, ๐ฟ, ๐ = ๐ฆฬ + ๐ก๐ผ⁄2,(๐๐) se(๐ฆฬ) Substituting for ๐ฆฬ, we have, ๐ฟ, ๐ = (๐1 + ๐2 ๐ฅ) + ๐ก๐ผ⁄2,(๐๐) se(๐1 + ๐2 ๐ฅ) This clearly shows that the linear combination of ๐1 and ๐2 , (๐1 + ๐2 ๐ฅ), is the estimator of the linear combination of the population parameters, ๐ฝ1 + ๐ฝ2 ๐ฅ. To build the interval estimate, we need to find se(๐1 + ๐2 ๐ฅ). For this we start with var(๐1 + ๐2 ๐ฅ). Using the properties of the variance of the linear combination of two random variables, we have var(๐1 + ๐2 ๐ฅ) = var(๐1 ) + ๐ฅ 2 var(๐2 ) + 2๐ฅcov(๐1 , ๐2 ) In the previous chapter we learned how to determine the covariance matrix for the coefficients of the regression equation. 4-Inference in Regression 12 of 13 [ var(๐1 ) cov(๐1 , ๐2 ) cov(๐1 , ๐2 ) ] = var(๐)๐ −1 var(๐2 ) var(๐)๐ −1 = 8013.294 × [ [ var(๐1 ) cov(๐1 , ๐2 ) 0.23516 −0.01072 −0.01072 ] 0.00055 cov(๐1 , ๐2 ) 1884.442 ]=[ var(๐2 ) −85.903 −85.903 ] 4.382 Thus, using the relevant figures form the covariance matrix and 10 ($1,000) for the income level ๐ฅ, we have var(๐1 + ๐2 ๐ฅ) = 1884.442 + 102 × 4.382 − 2 × 10 × 85.903 = 604.554 se(๐1 + ๐2 ๐ฅ) = √604.554 = 24.588 To determine the margin of error to build the interval estimate, ๐ก๐ผ⁄2,(๐๐) = ๐ก0.025,(38) = 2.024 ๐๐๐ธ = ๐ก๐ผ⁄2,(๐๐) se(๐1 + ๐2 ๐ฅ) = 2.024 × 24.588 = 49.78 ๐ฆฬ๐ฅ=10 = 83.416 + 10.21 × 10 = 185.51 ๐ฟ, ๐ = 185.51 ± 49.78 = [135.74,235.29] We estimate with 95% confidence that the mean food expenditure of households with weekly income of $1,000 is between $135.74 and $235.29. 3.2. Hypothesis Test Test the hypothesis that for a household with a weekly income of $1,000, the expected expenditure on food is more than $150. ๐ป0 : ๐ฝ1 + 10๐ฝ2 ≤ 150 ๐ป1 : ๐ฝ1 + 10๐ฝ2 > 150 The sample statistic or the estimator of ๐ฝ1 + 10๐ฝ2 is ๐1 + 10๐2 . Thus, the test statistic becomes, ๐๐ = ๐1 + 10๐2 − 150 se(๐1 + 10๐2 ) Given ๐1 = 83.416, ๐2 = 10.21, and the previously computed se(๐1 + 10๐2 ) = 24.588, The test statistic is then, ๐๐ = 83.416 + 10 × 10.21 − 150 = 2.505 24.588 p-value for the test is P(๐ก > 2.505) = 0.0083 We thus reject the null hypothesis at ๐ผ = 0.05 and ๐ผ = 0.01, both, and conclude the mean food expenditure for a weekly income of $1,000 is greater than $150 a week. 4-Inference in Regression 13 of 13