4-inference in regre..

CHAPTER 4 INTERVAL ESTIMATION AND HYPOTHESIS TESTING IN REGRESSION 1. 2. 3. Confidence Interval for the Population Parameters of Regression 1.1. A Review of the General Concept of the Confidence Interval 1.2. Confidence Interval for the Regression Slope Parameter Hypothesis Tests 2.1. Review of the General Concept 2.2. Hypothesis test for 𝛽2 2.2.1. Two-Tail Test of Significance 2.2.2. Two-Tail Test of an Economic Hypothesis 2.2.3. Right-Tail Test of Significance 2.2.4. Right-Tail Test of an Economic Hypothesis 2.2.5. Left-Tail Tests Inferences Involving a Linear Combination of the Regression Parameters 3.1. Interval Estimate 3.2. Hypothesis Test 1. Confidence Interval for the Population Parameters of Regression 1.1. A Review of the General Concept of the Confidence Interval The purpose here is to build a confidence interval for the parameters of the regression model, 𝛽1 and 𝛽2 . A brief review of the methodology for building a confidence interval for the population parameter µ (the population mean) will help in explaining the CI for the parameters of the regression. To build a confidence interval for µ, the sampling distribution of the sample statistic 𝑥̅ must be normal. The mean of this distribution is E(𝑥̅ ) = µ, and the standard deviation (standard error) is se(𝑥̅ ) = 𝜎⁄√𝑛. μ x̄ These properties of the sampling distribution of 𝑥̅ allow us to define 𝑧= 𝑥̅ − µ se(𝑥̅ ) as a standard normal random variable. Solving for 𝑥̅ in this equation we have, 𝑥̅ = µ ± 𝑧 ∙ se(𝑥̅ ) 4-Inference in Regression 1 of 13 (𝑧 < 0, for 𝑥̅ values to the left of µ). This expression tells us that the values of the random variable 𝑥̅ are distributed around the population mean, each value deviating from µ by a multiple of standard error se(𝑥̅ ). If se(𝑥̅ ) is known, then we can find boundaries of the intervals symmetric about the mean (middle intervals) which tell us what percentage of 𝑥̅ values fall within these intervals. These intervals are determined by the specific values of 𝑧. µ − 𝑧 ∙ se(𝑥̅ ) ≤ 𝑥̅ ≤ µ + 𝑧 ∙ se(𝑥̅ ) For example, the middle interval containing 95% of all 𝑥̅ values is determined by 𝑧0.025 = 1.96. P(µ − 𝑧0.025 se(𝑥̅ ) ≤ 𝑥̅ ≤ µ + 𝑧0.025 se(𝑥̅ )) = 0.95 P(µ − 1.96se(𝑥̅ ) ≤ 𝑥̅ ≤ µ + 1.96se(𝑥̅ )) = 0.95 The remaining 5% of all 𝑥̅ fall outside the interval. Generally, any proportion or fraction of 𝑥̅ values that fall outside of the interval of interest is denoted by α and is known as the error probability. The 𝑧 score corresponding to any error probability for an interval is denoted by 𝑧α⁄2 . The product 𝑧α⁄2 se(𝑥̅ ) is called the margin of sampling error, or simply the margin of error (𝑀𝑂𝐸). Working with 95% as an example, we have established that 95% of all 𝑥̅ values fall within 𝑀𝑂𝐸 = ±𝑧0.025 se(𝑥̅ ) from the population mean. Now, using the same 𝑀𝑂𝐸, instead of building the interval around µ, build the interval around a randomly determined sample mean value: 𝑥̅ ± 𝑧0.025 se(𝑥̅ ). Since the continuous random variable 𝑥̅ can take on infinite number of values, then we can theoretically build infinite number of such intervals using the same 𝑀𝑂𝐸. Ninety five percent of such intervals would capture µ. This is the theoretical framework for the confidence interval for any population parameter. In practice, we take only one sample of size 𝑛 and build one interval around the mean computed from this random sample. Then we state that we are 95% confident that this interval contains the population parameter. A confidence interval is simply the point estimate of the parameter obtained from a random sample ± the margin of error. In practice, since the population standard deviation is not known, we must use the sample standard deviation as an estimate of the population standard deviation. But when the sample standard deviation is used, because of increased uncertainty arising from using another estimated value (on top of the estimated 𝑥̅ ), the margin of error necessarily becomes wider. This is why in place of 𝑧 in the 𝑀𝑂𝐸 formula we use the random variable t. 𝑀𝑂𝐸 = 𝑡α⁄2,(𝑛−1) 𝑠 √𝑛 Suppose from a random sample of 𝑛 = 25 the sample mean is 𝑥̅ = 48 and the sample standard deviation is 𝑠 = 20. The margin of error for a 95% confidence interval for µ is: 𝑀𝑂𝐸 = 𝑡0.025,(24) 1.2. 20 √25 = 2.064 × 4 = 8.26 Confidence Interval for the Regression Slope Parameter In the previous chapter we showed that if for a given 𝑥, the 𝑦 values are normally distributed. It was also proved that the sample regression coefficient is a linear function of 𝑦, 4-Inference in Regression 2 of 13 𝑏2 = ∑𝑤𝑦 = ∑(𝑥 − 𝑥̅ )𝑦 ∑(𝑥 − 𝑥̅ )2 is also normally distributed because the slope coefficient 𝑏2 is a linear function of the normal 𝑦. The mean and standard deviation (standard error) of 𝑏2 are, respectively, E(𝑏2 ) = 𝛽2 and se(𝑏2 ) = σ𝑢 √∑(𝑥 − 𝑥̅ )2 β₂ b₂ Since b2 is normally distributed, then the standard normal random variable z can then be defined as, 𝑧= 𝑏2 − 𝛽2 se(𝑏2 ) Following the same methodology as that for the confidence interval for µ, first solve for 𝑏2 : 𝑏2 = 𝛽2 ± 𝑧 ∙ se(𝑏2 ) Using 95% as the benchmark, then 95% of all values of the random variable 𝑏2 fall within the margin of error 𝑀𝑂𝐸 = ±𝑧0.025 ∙ se(𝑏2 ) That is, P(𝛽2 − 𝑧0.025 ∙ se(𝑏2 ) ≤ 𝑏2 ≤ 𝛽2 + 𝑧0.025 ∙ se(𝑏2 )) = 0.95 Again, using the margin of error 𝑧0.025 ∙ se(𝑏2 ) we can build infinite number of intervals around the randomly determined values of 𝑏2 . Ninety five percent of such intervals would contain the population parameter 𝛽2 . P(𝑏2 − 𝑧0.025 ∙ se(𝑏2 ) ≤ 𝛽2 ≤ 𝑏2 + 𝑧0.025 ∙ se(𝑏2 )) = 0.95 Now note that in the 𝑀𝑂𝐸 formula, the formula for the standard error se(𝑏2 ) contains the unknown population parameter σ𝑢 . se(𝑏2 ) = σ𝑢 √∑(𝑥 − 𝑥̅ )2 This requires us to replace σ𝑢 with its sample estimator se(𝑒). The symbol and the formula for the standard error of 𝑏2 thus change to: 4-Inference in Regression 3 of 13 se(𝑏2 ) = se(𝑒) √∑(𝑥 − 𝑥̅ )2 We are now using an estimated value in the 𝑀𝑂𝐸 formula. Therefore, the margin of error becomes inherently wider due to the added uncertainty of using se(𝑒). The z-score in the 𝑀𝑂𝐸 formula no longer works and it is replaced by 𝑡. 𝑀𝑂𝐸 = ±𝑡𝛼⁄2,(𝑑𝑓) ∙ se(𝑏2 ) To show that the term 𝑏2 − 𝛽2 se(𝑏2 ) has a 𝑡 distribution, we introduce a new random variable called Chi-square (𝜒 2 ). Chi-square is formed as the sum of 𝑚 independent squared standard normal distributions, 𝑧𝑖2 . Like the 𝑡 distribution, each 𝜒 2 distribution is identified by the parameter degrees of freedom. If the 𝜒 2 distribution is formed by the sum of 𝑚 independent 𝑧𝑖2 , then the degrees of freedom is the value 𝑚. Let 𝜈 (𝑛𝑢) be a 𝜒 2 random variable. The probability density function of 𝜈 is: 𝑚 𝜈 ( 2 −1) 𝑓(𝜈) = 𝑚 ( − 1) ! 2𝑚⁄2 𝑒 𝜈⁄2 2 where 𝑚 denotes the degrees of freedom of the distribution and 𝑒 is the base for natural logarithm. The function is defined only for 𝜈 ≥ 0. The mean of the distribution is E(𝜈) = 𝑚 and the variance is σ2𝜈 = 2𝑚. Using Excel you can plot the 𝑝𝑑𝑓 for a given 𝑚. The following shows three distributions with the indicated degrees of freedom 𝑚. f(ν) df = 4 df = 8 df = 12 0 5 10 15 20 25 ν 30 Now recall that the error term 𝑢 in the regression model is assumed to be normally distributed with mean of 0 and standard deviation of σ𝑢 . Thus, 𝑧𝑖 = 𝑢𝑖 σ𝑢 Squaring and summing for all 𝑖, we have a 𝜒 2 random variable 𝜈. 𝜈 = ∑𝑧𝑖2 = ∑𝑢𝑖2 σ2𝑢 4-Inference in Regression 4 of 13 Since the 𝑢𝑖 are unknown, they are replaced by the regression residuals 𝑒. 𝜈= ∑𝑒𝑖2 σ2𝑢 Recall that var(𝑒) = ∑𝑒 2 n−2 Thus, we have: 𝜈= ∑𝑒2 = (𝑛 − 2)var(𝑒). Substituting for ∑𝑒2 in the numerator of 𝜈 we have, (𝑛 − 2)var(𝑒) σ2𝑢 The degrees of freedom of the 𝜒 2 random variable 𝜈 is 𝑛 − 2 because only 𝑛 − 2 residuals are independent.1 Dividing both sides of the last equation by 𝑛 − 2. 𝜈 var(𝑒) = 𝑛−2 σ2𝑢 and taking the square root of both sides, we have: √ 𝜈 se(𝑒) = 𝑛−2 σ𝑢 Now we introduce the random variable 𝑡 (the Student 𝑡 distribution) again! Theoretically, the 𝑡 distribution is formed as the ratio of 𝑧 over √𝜈 ⁄(𝑛 − 2): 𝑡= 𝑧 √𝜈 ⁄(𝑛 − 2) Substituting in the numerator and the denominator, we have 𝑏2 − 𝛽2 𝑏2 − 𝛽2 σ𝑢 ⁄√∑(𝑥 − 𝑥̅ )2 se(𝑏2 ) 𝑡= = se(𝑒) se(𝑒) σ𝑢 σ𝑢 𝑡= 𝑏2 − 𝛽2 se(𝑒)⁄√∑(𝑥 − 𝑥̅ )2 The denominator of the 𝑡 equation is the standard error of the slope coefficient 𝑏2 estimated from the sample data: Whenever we want to find the average of square deviations using the sample data we lose one degree of freedom for each estimated parameter in the squared deviation. For example, to find the sample variance of 𝑦, we compute average of the squared deviations of 𝑦 from 𝑦̅, ∑(𝑦 − 𝑦̅)2 . Since the random variable 𝑦̅, the estimator of the parameter μ, is used in the calculation of the square deviations, only 𝑛 − 1 of the squared deviations are independent. We have lost one degree of freedom. Similarly, to compute the average of ∑𝑒 2 = ∑(𝑦 − 𝑦̂)2 , since we obtained 𝑦̂ by estimating the two parameters 𝛽1 and 𝛽2 , we lose two degrees of freedom—we have only 𝑛 − 2 independent squared deviations. 1 4-Inference in Regression 5 of 13 se(𝑏2 ) = se(𝑒) √∑(𝑥 − 𝑥̅ )2 Thus, 𝑡= 𝑏2 − 𝛽2 se(𝑏2 ) Solving for b2 in the equation this equation, we have 𝑏2 = 𝛽2 + 𝑡 ∙ se(𝑏2 ) in which the term 𝑡 ∙ se(𝑏2 ) is the margin of sampling error (𝑀𝑂𝐸). Again, using 95% as the benchmark probability, we have: P (𝛽2 − 𝑡0.025,(𝑑𝑓) se(𝑏2 ) ≤ 𝑏2 ≤ 𝛽2 + 𝑡0.025,(𝑑𝑓) se(𝑏2 )) = 0.95 The degrees of freedom here are the same 𝑑𝑓 as that of the 𝜒 2 distribution, 𝑛 − 2. Since 𝛽2 is unknown, its estimator is used to construct a confidence interval for the slope parameter. Using the 𝑀𝑂𝐸 determined above, the lower and upper boundaries of a 95% confidence interval for 𝛽2 are: 𝐿, 𝑈 = 𝑏2 ± 𝑡0.025,(𝑑𝑓) se(𝑏2 ) 𝐿, 𝑈 = 𝑏2 ± 𝑡0.025,(𝑑𝑓) se(𝑒) √∑(𝑥 − 𝑥̅ )2 Example: Household food expenditure and weekly income The data and other calculations are in the Excel file CH4 DATA.xlsx (“food” tab). The data show the weekly food expenditure of 40 households in dollars and weekly income in hundreds of dollars ($100). How does weekly food expenditure respond to changes in weekly income? 𝐹𝑂𝑂𝐷𝐸𝑋𝑃 = 𝛽1 + 𝛽2 𝐼𝑁𝐶𝑂𝑀𝐸 + 𝑢 Let 𝑥 = 𝑤𝑒𝑒𝑘𝑙𝑦 𝑖𝑛𝑐𝑜𝑚𝑒 and 𝑦 = 𝑤𝑒𝑒𝑘𝑙𝑦 𝑓𝑜𝑜𝑑 𝑒𝑥𝑝𝑒𝑛𝑑𝑖𝑡𝑢𝑟𝑒. population slope parameter 𝛽2 . Build a 95% confidence interval for the 𝐿, 𝑈 = 𝑏2 ± 𝑡0.025,(𝑑𝑓) se(𝑏2 ) The estimated regression equation is: 𝑦̂ = 83.416 + 10.210𝑥 The other ingredients to build the interval are: se(𝑏2 ) = se(𝑒) √∑(𝑥 − 𝑥̅ )2 = 89.517 √1828.788 = 2.093 𝑀𝑂𝐸 = 𝑡0.025,(𝑛−2) se(𝑏2 ) = 2.024 × 2.093 = 4.238 𝐿, 𝑈 = 𝑏2 ± 𝑀𝑂𝐸 = 10.210 ± 4.238 = [5.97,14.45] 4-Inference in Regression 6 of 13 We estimate, with 95% confidence, that for each additional $100 weekly income household food expenditure be between $5.97 and $14.45. 2. Hypothesis Tests 2.1. Review of the General Concept Again, as a background let us review the test of hypothesis for the population mean µ. Using a two-tailed test case, suppose we are testing the null hypothesis that µ = 100, at a 5% level of significance (allowing 5% as the probability of a Type I error—rejecting the null when it is true). 𝐻0 : µ = 100 𝐻1 : µ ≠ 100 To test this hypothesis, suppose we take a random sample of 𝑛 = 25, which yields 𝑥̅ = 109 and 𝑠 = 20. The test statistic for this test of hypothesis is: 𝑡= 𝑥̅ − µ0 se(𝑥̅ ) where se(𝑥̅ ) = 𝑡= 𝑠 √𝑛 = 20 √25 =4 109 − 100 = 2.25 4 We compare the test statistic 𝑡 = 2.250 to the 𝑐𝑟𝑖𝑡𝑖𝑐𝑎𝑙 𝑣𝑎𝑙𝑢𝑒 𝑡𝛼⁄2,(𝑛−1) = 𝑡0.025,24 = 2.06. Since the test statistic exceeds the critical value, we reject the null hypothesis that µ = 100. Another method to determine whether to reject the null hypothesis is to compute the 𝑝𝑣𝑎𝑙𝑢𝑒 of the test. The 𝑝𝑣𝑎𝑙𝑢𝑒 in a two-tailed test is the sum of the two tail areas under the t-curve corresponding to ±𝑡 test statistic: 2 × P(t > 2.25) = 2 × 0.0169 = 0.0338. df = 24 0.0169 -2.250 0.0169 2.250 t If the 𝑝𝑣𝑎𝑙𝑢𝑒 reject the null hypothesis. The combined two-tail area under the t-curve (0.0338) is obtained using the Excel function =𝐓. 𝐃𝐈𝐒𝐓. 𝟐(𝐱, 𝐝𝐞𝐠_𝐟𝐫𝐞𝐞𝐝𝐨𝐦), where x = 2.25, deg_freedom = 24. 4-Inference in Regression 7 of 13 2.2. Hypothesis test for 𝜷𝟐 2.2.1. Two-Tail Test of Significance Generally, but not always, in a hypothesis test for the slope parameter, we test the null hypothesis that the population slope is zero. This hypothesis implies that there is no relationship between 𝑥 and 𝑦. To prove our theory that there is a relationship between 𝑥 and 𝑦, we must reject the null “beyond a reasonable doubt”. Thus, we start the test of hypothesis for 𝛽2 with: 𝐻0 : 𝛽2 = 0 𝐻1 : 𝛽2 ≠ 0 Our sample statistic is 𝑏2 with a standard error of se(𝑏2 ) = se(𝑒)⁄√∑(𝑥 − 𝑥̅ )2 . The test statistic is then: 𝑡= 𝑏2 − (𝛽2 )0 𝑏2 = se(𝑏2 ) se(𝑏2 ) since by the null hypothesis 𝛽2 = 0. Using a level of significance of α, then the critical value is: 𝑡𝛼⁄2,(𝑛−2) . In food expenditure example we obtained 𝑏2 = 10.210. Is this figure significantly different from zero? The test statistic here is: 𝑇𝑆 = 𝑡 = 10.21 = 4.877 2.093 Using 𝛼 = 0.05, the critical 𝑡 with 𝑑𝑓 = 𝑛 − 2 = 38 is 𝐶𝑉 = 𝑡𝛼⁄2,(𝑑𝑓) = 𝑡0.025,(38) = 2.024 Since 𝑇𝑆 = 4.877 > 𝐶𝑉 = 2.024, we reject 𝐻0 : 𝛽2 = 0 and conclude that 𝑏2 = 10.210 is significantly different from zero. The statistical relationship between income and food expenditure is significant. p-value Rather than comparing 𝑇𝑆 to 𝐶𝑉, standard statistical reports provide the probability value, commonly referred to as the p-value, of the test as the decision rule. The p-value is simply the tail area corresponding to the test statistic under the t curve, which is then compared to the given 𝛼 for the test. For a two-tail test, to compare with 𝛼, the sum of two tail areas is used. To find the tail areas for a two-tail test use the Excel command 𝐓. 𝐃𝐈𝐒𝐓. 𝟐𝐓(𝐱, 𝐝𝐞𝐠 − 𝐟𝐫𝐞𝐞𝐝𝐨𝐦). 𝑝˗value = 2 × P(𝑡 > 𝑇𝑆) 𝑝˗value = 2 × 𝑃(𝑡 > 4.877) = 2 × 0.0000097 = 0.0000195 T. DIST. 2T(4.877,38) = 0.0000195 The logic of p-value is simple. You can think of p-value as the probability of Type I error as revealed by the test. If the revealed probability exceeds the benchmark probability 𝛼, then the probability of Type I error is higher than what we would like it to be. Therefore, we do not reject the null hypothesis. But if the revealed probability is less than the benchmark probability of Type I error, then we would reject the null hypothesis. 2.2.2. Two-Tail Test of an Economic Hypothesis 4-Inference in Regression 8 of 13 We want to test the hypothesis, at the 5 percent level of significance, that households spend $7.50 of each additional $100 weekly income on food. The hypotheses for the test are: 𝐻0 : 𝛽2 = 7.50 𝐻1 : 𝛽2 ≠ 7.50 The test statistic is 𝑇𝑆 = 𝑡 = 𝑏2 − (𝛽2 )0 10.21 − 7.50 = = 1.294 se(𝑏2 ) 2.093 and the critical value is 𝐶𝑉 = 𝑡𝛼⁄2,(𝑑𝑓) = 𝑡0.025,(38) = 2.024 Since 𝑇𝑆 = 1.294 < 𝐶𝑉 = 2.024, do not reject the null hypothesis that the food expenditure per $100 additional income is equal to $7.50. The sample data are consistent with the null hypothesis. 𝑝˗value = 2 × 𝑃(𝑡 > 1.294) = 2 × 0.10166 = 0.20332 T. DIST. 2T(1.294,38) = 0.20332 2.2.3. Right-Tail Test of Significance Economic theory suggests that food is a normal good. That is expenditure on food rises with an increase in income. Thus, we expect that in the regression model 𝛽2 > 0. The sample data provided an estimate of 𝛽2 , 𝑏2 = 10.21, which is greater than zero. The objective of the test is, however, to prove that this estimated quantity is significantly greater than zero. To prove this, therefore, we must reject the null hypothesis that 𝐻0 : 𝛽2 ≤ 0. The hypotheses for the test are then written as 𝐻0 : 𝛽2 ≤ 0 𝐻1 : 𝛽2 > 0 Since the direction of the strict inequality in the alternative hypothesis, “>”, is to the right, then this is a righttail test. The test statistic is 𝑇𝑆 = 𝑡 = 𝑏2 − (𝛽2 )0 10.21 − 0 = = 4.88 se(𝑏2 ) 2.093 and the critical value, at 𝛼 = 0.05, is 𝐶𝑉 = 𝑡𝛼,(𝑑𝑓) = 𝑡0.05,(38) = 1.686 In the diagram below, clearly, 𝑇𝑆 = 4.88 > 𝐶𝑉 = 1.69. Therefore, we reject the null hypothesis. The data indicates that the food expenditure is a normal good. 4-Inference in Regression 9 of 13 CV = 1.69 TS = 4.88 𝑝˗value = 𝑃(𝑡 > 4.877) = 0.0000097 T. DIST. RT(4.877,38) = 0.0000097 2.2.4. Right-Tail Test of an Economic Hypothesis We want to test the hypothesis that households food expenditure exceeds $5.50 for each additional $100 in weekly income. The purpose of the hypothesis is to see whether the construction of a new supermarket in a residential area is economically justified (profitable). If the data confirms the hypothesis, then the supermarket will be constructed. The estimated coefficient is 𝑏2 = 10.21, which is greater than $5.50. The question, however, is if $10.21 is significantly greater $5.50. Thus, the null and alternative hypotheses are: 𝐻0 : 𝛽2 ≤ 5.50 𝐻1 : 𝛽2 > 5.50 The test statistic is, 𝑇𝑆 = 𝑡 = 𝑏2 − (𝛽2 )0 10.21 − 5.50 = = 2.25 se(𝑏2 ) 2.093 and the critical value, at 𝛼 = 0.05, is 𝐶𝑉 = 𝑡𝛼,(𝑑𝑓) = 𝑡0.05,(38) = 1.686 Based on 𝛼 = 0.05, since 𝑇𝑆 = 2.25 > 𝐶𝑉 = 1.69, we reject 𝐻0 : 𝛽2 ≤ 5.50. But, if we choose a smaller 𝛼, say, 𝛼 = 0.01, then 𝐶𝑉 = 𝑡0.01,(38) = 2.429 which provides 𝑇𝑆 = 2.25 < 𝐶𝑉 = 2.429, and we would not reject 𝐻0 . Note the diagrams below. 𝛼 = 0.05 Reject 𝐻0 4-Inference in Regression 𝛼 = 0.01 Do not reject 𝐻0 10 of 13 CV = 1.69 TS = 2.25 TS = 2.25 CV = 2.43 𝑝˗value = 𝑃(𝑡 > 2.25) = 0.0152 T. DIST. RT(2.25,38) = 0.0152 Note that if 𝛼 = 0.05, then 𝑝˗value = 0.0152 < 𝛼 = 0.05, then we would reject 𝐻0 . But, if 𝛼 = 0.01, then 𝑝˗value = 0.0152 > 𝛼 = 0.01, then we would not reject 𝐻0 . 2.2.5. Left-Tail Tests Continuing with the food expenditure model, we want to test the hypothesis that household food expenditure is below $15 for each additional $100 in weekly income. The word “below” indicates that we are testing 𝛽2 < 15 against 𝛽2 ≥ 15. This makes it a left-tail test. 𝐻0 : 𝛽2 ≥ 15 𝐻1 : 𝛽2 < 15 The test statistic is, 𝑇𝑆 = 𝑡 = 𝑏2 − (𝛽2 )0 10.21 − 15 = = −2.288 se(𝑏2 ) 2.093 and the critical value, at 𝛼 = 0.05, is 𝐶𝑉 = −𝑡𝛼,(𝑑𝑓) = −𝑡0.05,(38) = −1.686 You can present the conclusion either as 𝑇𝑆 = −2.288 < 𝐶𝑉 = −1.686 Reject 𝐻0 or, |𝑇𝑆| = 2.288 > |𝐶𝑉| = 1.686 4-Inference in Regression Reject 𝐻0 11 of 13 TS = -2.29 CV = -1.69 𝑝˗value = 𝑃(𝑡 < −2.288) = 0.0139 In Excel you can find the tail area when 𝑡 < 0 two ways. T. DIST. RT(2.288,38) = 0.0139 or, use 𝐓. 𝐃𝐈𝐒𝐓. 𝟐𝐓(𝐱, 𝐝𝐞𝐠 − 𝐟𝐫𝐞𝐞𝐝𝐨𝐦, 𝐜𝐮𝐦𝐮𝐥𝐚𝐭𝐢𝐯𝐞). T. DIST(−2.288,38,1) = 0.0139 3. Inferences Involving a Linear Combination of the Regression Parameters 3.1. Interval Estimate To explain this topic, let us use the food expenditure model again. Suppose we want to build an interval estimate for the mean value of food expenditure in the population of households for a specific weekly income of, say, $1,000. We learned earlier that the regression line is the locus of the mean values of 𝑦 for each given value of the explanatory variable 𝑥, 𝜇𝑦|𝑥𝑖 . Thus, in the estimated regression equation, 𝑦̂ = 𝑏1 + 𝑏2 𝑥, the predicted value of 𝑦 for a given value of 𝑥 is that estimated mean we are looking for. To build an interval estimate for the mean of 𝑦 for a given 𝑥, 𝜇𝑦|𝑥0 , we consider 𝑦̂ as the sample statistic, the estimator, for the population parameter 𝜇𝑦|𝑥0 . Thus the interval estimate is, 𝐿, 𝑈 = 𝑦̂ + 𝑡𝛼⁄2,(𝑑𝑓) se(𝑦̂) Substituting for 𝑦̂, we have, 𝐿, 𝑈 = (𝑏1 + 𝑏2 𝑥) + 𝑡𝛼⁄2,(𝑑𝑓) se(𝑏1 + 𝑏2 𝑥) This clearly shows that the linear combination of 𝑏1 and 𝑏2 , (𝑏1 + 𝑏2 𝑥), is the estimator of the linear combination of the population parameters, 𝛽1 + 𝛽2 𝑥. To build the interval estimate, we need to find se(𝑏1 + 𝑏2 𝑥). For this we start with var(𝑏1 + 𝑏2 𝑥). Using the properties of the variance of the linear combination of two random variables, we have var(𝑏1 + 𝑏2 𝑥) = var(𝑏1 ) + 𝑥 2 var(𝑏2 ) + 2𝑥cov(𝑏1 , 𝑏2 ) In the previous chapter we learned how to determine the covariance matrix for the coefficients of the regression equation. 4-Inference in Regression 12 of 13 [ var(𝑏1 ) cov(𝑏1 , 𝑏2 ) cov(𝑏1 , 𝑏2 ) ] = var(𝑒)𝑋 −1 var(𝑏2 ) var(𝑒)𝑋 −1 = 8013.294 × [ [ var(𝑏1 ) cov(𝑏1 , 𝑏2 ) 0.23516 −0.01072 −0.01072 ] 0.00055 cov(𝑏1 , 𝑏2 ) 1884.442 ]=[ var(𝑏2 ) −85.903 −85.903 ] 4.382 Thus, using the relevant figures form the covariance matrix and 10 ($1,000) for the income level 𝑥, we have var(𝑏1 + 𝑏2 𝑥) = 1884.442 + 102 × 4.382 − 2 × 10 × 85.903 = 604.554 se(𝑏1 + 𝑏2 𝑥) = √604.554 = 24.588 To determine the margin of error to build the interval estimate, 𝑡𝛼⁄2,(𝑑𝑓) = 𝑡0.025,(38) = 2.024 𝑀𝑂𝐸 = 𝑡𝛼⁄2,(𝑑𝑓) se(𝑏1 + 𝑏2 𝑥) = 2.024 × 24.588 = 49.78 𝑦̂𝑥=10 = 83.416 + 10.21 × 10 = 185.51 𝐿, 𝑈 = 185.51 ± 49.78 = [135.74,235.29] We estimate with 95% confidence that the mean food expenditure of households with weekly income of $1,000 is between $135.74 and $235.29. 3.2. Hypothesis Test Test the hypothesis that for a household with a weekly income of $1,000, the expected expenditure on food is more than $150. 𝐻0 : 𝛽1 + 10𝛽2 ≤ 150 𝐻1 : 𝛽1 + 10𝛽2 > 150 The sample statistic or the estimator of 𝛽1 + 10𝛽2 is 𝑏1 + 10𝑏2 . Thus, the test statistic becomes, 𝑇𝑆 = 𝑏1 + 10𝑏2 − 150 se(𝑏1 + 10𝑏2 ) Given 𝑏1 = 83.416, 𝑏2 = 10.21, and the previously computed se(𝑏1 + 10𝑏2 ) = 24.588, The test statistic is then, 𝑇𝑆 = 83.416 + 10 × 10.21 − 150 = 2.505 24.588 p-value for the test is P(𝑡 > 2.505) = 0.0083 We thus reject the null hypothesis at 𝛼 = 0.05 and 𝛼 = 0.01, both, and conclude the mean food expenditure for a weekly income of $1,000 is greater than $150 a week. 4-Inference in Regression 13 of 13

4-inference in regre..

Related documents

Products

Support

4-inference in regre..

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib