4-inference in regre..

advertisement
CHAPTER 4
INTERVAL ESTIMATION AND HYPOTHESIS TESTING IN REGRESSION
1.
2.
3.
Confidence Interval for the Population Parameters of Regression
1.1.
A Review of the General Concept of the Confidence Interval
1.2.
Confidence Interval for the Regression Slope Parameter
Hypothesis Tests
2.1. Review of the General Concept
2.2. Hypothesis test for ๐›ฝ2
2.2.1. Two-Tail Test of Significance
2.2.2. Two-Tail Test of an Economic Hypothesis
2.2.3. Right-Tail Test of Significance
2.2.4. Right-Tail Test of an Economic Hypothesis
2.2.5. Left-Tail Tests
Inferences Involving a Linear Combination of the Regression Parameters
3.1. Interval Estimate
3.2. Hypothesis Test
1. Confidence Interval for the Population Parameters of Regression
1.1.
A Review of the General Concept of the Confidence Interval
The purpose here is to build a confidence interval for the parameters of the regression model, ๐›ฝ1 and ๐›ฝ2 . A
brief review of the methodology for building a confidence interval for the population parameter µ (the
population mean) will help in explaining the CI for the parameters of the regression.
To build a confidence interval for µ, the sampling distribution of the sample statistic ๐‘ฅฬ… must be normal. The
mean of this distribution is E(๐‘ฅฬ… ) = µ, and the standard deviation (standard error) is se(๐‘ฅฬ… ) = ๐œŽ⁄√๐‘›.
μ
xฬ„
These properties of the sampling distribution of ๐‘ฅฬ… allow us to define
๐‘ง=
๐‘ฅฬ… − µ
se(๐‘ฅฬ… )
as a standard normal random variable. Solving for ๐‘ฅฬ… in this equation we have,
๐‘ฅฬ… = µ ± ๐‘ง โˆ™ se(๐‘ฅฬ… )
4-Inference in Regression
1 of 13
(๐‘ง < 0, for ๐‘ฅฬ… values to the left of µ). This expression tells us that the values of the random variable ๐‘ฅฬ… are
distributed around the population mean, each value deviating from µ by a multiple of standard error se(๐‘ฅฬ… ). If
se(๐‘ฅฬ… ) is known, then we can find boundaries of the intervals symmetric about the mean (middle intervals)
which tell us what percentage of ๐‘ฅฬ… values fall within these intervals. These intervals are determined by the
specific values of ๐‘ง.
µ − ๐‘ง โˆ™ se(๐‘ฅฬ… ) ≤ ๐‘ฅฬ… ≤ µ + ๐‘ง โˆ™ se(๐‘ฅฬ… )
For example, the middle interval containing 95% of all ๐‘ฅฬ… values is determined by ๐‘ง0.025 = 1.96.
P(µ − ๐‘ง0.025 se(๐‘ฅฬ… ) ≤ ๐‘ฅฬ… ≤ µ + ๐‘ง0.025 se(๐‘ฅฬ… )) = 0.95
P(µ − 1.96se(๐‘ฅฬ… ) ≤ ๐‘ฅฬ… ≤ µ + 1.96se(๐‘ฅฬ… )) = 0.95
The remaining 5% of all ๐‘ฅฬ… fall outside the interval. Generally, any proportion or fraction of ๐‘ฅฬ… values that fall
outside of the interval of interest is denoted by α and is known as the error probability. The ๐‘ง score
corresponding to any error probability for an interval is denoted by ๐‘งα⁄2 . The product ๐‘งα⁄2 se(๐‘ฅฬ… ) is called the
margin of sampling error, or simply the margin of error (๐‘€๐‘‚๐ธ).
Working with 95% as an example, we have established that 95% of all ๐‘ฅฬ… values fall within
๐‘€๐‘‚๐ธ = ±๐‘ง0.025 se(๐‘ฅฬ… )
from the population mean. Now, using the same ๐‘€๐‘‚๐ธ, instead of building the interval around µ, build the
interval around a randomly determined sample mean value: ๐‘ฅฬ… ± ๐‘ง0.025 se(๐‘ฅฬ… ). Since the continuous random
variable ๐‘ฅฬ… can take on infinite number of values, then we can theoretically build infinite number of such
intervals using the same ๐‘€๐‘‚๐ธ. Ninety five percent of such intervals would capture µ. This is the theoretical
framework for the confidence interval for any population parameter. In practice, we take only one sample of
size ๐‘› and build one interval around the mean computed from this random sample. Then we state that we are
95% confident that this interval contains the population parameter.
A confidence interval is simply the point estimate of the parameter obtained from a random sample ± the
margin of error. In practice, since the population standard deviation is not known, we must use the sample
standard deviation as an estimate of the population standard deviation. But when the sample standard
deviation is used, because of increased uncertainty arising from using another estimated value (on top of the
estimated ๐‘ฅฬ… ), the margin of error necessarily becomes wider. This is why in place of ๐‘ง in the ๐‘€๐‘‚๐ธ formula we
use the random variable t.
๐‘€๐‘‚๐ธ = ๐‘กα⁄2,(๐‘›−1)
๐‘ 
√๐‘›
Suppose from a random sample of ๐‘› = 25 the sample mean is ๐‘ฅฬ… = 48 and the sample standard deviation is
๐‘  = 20. The margin of error for a 95% confidence interval for µ is:
๐‘€๐‘‚๐ธ = ๐‘ก0.025,(24)
1.2.
20
√25
= 2.064 × 4 = 8.26
Confidence Interval for the Regression Slope Parameter
In the previous chapter we showed that if for a given ๐‘ฅ, the ๐‘ฆ values are normally distributed. It was also
proved that the sample regression coefficient is a linear function of ๐‘ฆ,
4-Inference in Regression
2 of 13
๐‘2 = ∑๐‘ค๐‘ฆ =
∑(๐‘ฅ − ๐‘ฅฬ… )๐‘ฆ
∑(๐‘ฅ − ๐‘ฅฬ… )2
is also normally distributed because the slope coefficient ๐‘2 is a linear function of the normal ๐‘ฆ. The mean
and standard deviation (standard error) of ๐‘2 are, respectively,
E(๐‘2 ) = ๐›ฝ2 and se(๐‘2 ) =
σ๐‘ข
√∑(๐‘ฅ − ๐‘ฅฬ… )2
βโ‚‚
bโ‚‚
Since b2 is normally distributed, then the standard normal random variable z can then be defined as,
๐‘ง=
๐‘2 − ๐›ฝ2
se(๐‘2 )
Following the same methodology as that for the confidence interval for µ, first solve for ๐‘2 :
๐‘2 = ๐›ฝ2 ± ๐‘ง โˆ™ se(๐‘2 )
Using 95% as the benchmark, then 95% of all values of the random variable ๐‘2 fall within the margin of error
๐‘€๐‘‚๐ธ = ±๐‘ง0.025 โˆ™ se(๐‘2 )
That is,
P(๐›ฝ2 − ๐‘ง0.025 โˆ™ se(๐‘2 ) ≤ ๐‘2 ≤ ๐›ฝ2 + ๐‘ง0.025 โˆ™ se(๐‘2 )) = 0.95
Again, using the margin of error ๐‘ง0.025 โˆ™ se(๐‘2 ) we can build infinite number of intervals around the randomly
determined values of ๐‘2 . Ninety five percent of such intervals would contain the population parameter ๐›ฝ2 .
P(๐‘2 − ๐‘ง0.025 โˆ™ se(๐‘2 ) ≤ ๐›ฝ2 ≤ ๐‘2 + ๐‘ง0.025 โˆ™ se(๐‘2 )) = 0.95
Now note that in the ๐‘€๐‘‚๐ธ formula, the formula for the standard error se(๐‘2 ) contains the unknown
population parameter σ๐‘ข .
se(๐‘2 ) =
σ๐‘ข
√∑(๐‘ฅ − ๐‘ฅฬ… )2
This requires us to replace σ๐‘ข with its sample estimator se(๐‘’). The symbol and the formula for the standard
error of ๐‘2 thus change to:
4-Inference in Regression
3 of 13
se(๐‘2 ) =
se(๐‘’)
√∑(๐‘ฅ − ๐‘ฅฬ… )2
We are now using an estimated value in the ๐‘€๐‘‚๐ธ formula. Therefore, the margin of error becomes inherently
wider due to the added uncertainty of using se(๐‘’). The z-score in the ๐‘€๐‘‚๐ธ formula no longer works and it is
replaced by ๐‘ก.
๐‘€๐‘‚๐ธ = ±๐‘ก๐›ผ⁄2,(๐‘‘๐‘“) โˆ™ se(๐‘2 )
To show that the term
๐‘2 − ๐›ฝ2
se(๐‘2 )
has a ๐‘ก distribution, we introduce a new random variable called Chi-square (๐œ’ 2 ). Chi-square is formed as the
sum of ๐‘š independent squared standard normal distributions, ๐‘ง๐‘–2 . Like the ๐‘ก distribution, each ๐œ’ 2
distribution is identified by the parameter degrees of freedom. If the ๐œ’ 2 distribution is formed by the sum of
๐‘š independent ๐‘ง๐‘–2 , then the degrees of freedom is the value ๐‘š.
Let ๐œˆ (๐‘›๐‘ข) be a ๐œ’ 2 random variable. The probability density function of ๐œˆ is:
๐‘š
๐œˆ ( 2 −1)
๐‘“(๐œˆ) = ๐‘š
( − 1) ! 2๐‘š⁄2 ๐‘’ ๐œˆ⁄2
2
where ๐‘š denotes the degrees of freedom of the distribution and ๐‘’ is the base for natural logarithm. The
function is defined only for ๐œˆ ≥ 0. The mean of the distribution is E(๐œˆ) = ๐‘š and the variance is σ2๐œˆ = 2๐‘š.
Using Excel you can plot the ๐‘๐‘‘๐‘“ for a given ๐‘š. The following shows three distributions with the indicated
degrees of freedom ๐‘š.
f(ν)
df = 4
df = 8
df = 12
0
5
10
15
20
25
ν
30
Now recall that the error term ๐‘ข in the regression model is assumed to be normally distributed with mean of
0 and standard deviation of σ๐‘ข . Thus,
๐‘ง๐‘– =
๐‘ข๐‘–
σ๐‘ข
Squaring and summing for all ๐‘–, we have a ๐œ’ 2 random variable ๐œˆ.
๐œˆ = ∑๐‘ง๐‘–2 =
∑๐‘ข๐‘–2
σ2๐‘ข
4-Inference in Regression
4 of 13
Since the ๐‘ข๐‘– are unknown, they are replaced by the regression residuals ๐‘’.
๐œˆ=
∑๐‘’๐‘–2
σ2๐‘ข
Recall that
var(๐‘’) =
∑๐‘’ 2
n−2
Thus, we have:
๐œˆ=
∑๐‘’2 = (๐‘› − 2)var(๐‘’). Substituting for ∑๐‘’2 in the numerator of ๐œˆ we have,
(๐‘› − 2)var(๐‘’)
σ2๐‘ข
The degrees of freedom of the ๐œ’ 2 random variable ๐œˆ is ๐‘› − 2 because only ๐‘› − 2 residuals are independent.1
Dividing both sides of the last equation by ๐‘› − 2.
๐œˆ
var(๐‘’)
=
๐‘›−2
σ2๐‘ข
and taking the square root of both sides, we have:
√
๐œˆ
se(๐‘’)
=
๐‘›−2
σ๐‘ข
Now we introduce the random variable ๐‘ก (the Student ๐‘ก distribution) again! Theoretically, the ๐‘ก distribution
is formed as the ratio of ๐‘ง over √๐œˆ ⁄(๐‘› − 2):
๐‘ก=
๐‘ง
√๐œˆ ⁄(๐‘› − 2)
Substituting in the numerator and the denominator, we have
๐‘2 − ๐›ฝ2
๐‘2 − ๐›ฝ2
σ๐‘ข ⁄√∑(๐‘ฅ − ๐‘ฅฬ… )2
se(๐‘2 )
๐‘ก=
=
se(๐‘’)
se(๐‘’)
σ๐‘ข
σ๐‘ข
๐‘ก=
๐‘2 − ๐›ฝ2
se(๐‘’)⁄√∑(๐‘ฅ − ๐‘ฅฬ… )2
The denominator of the ๐‘ก equation is the standard error of the slope coefficient ๐‘2 estimated from the sample
data:
Whenever we want to find the average of square deviations using the sample data we lose one degree of freedom for
each estimated parameter in the squared deviation. For example, to find the sample variance of ๐‘ฆ, we compute average of
the squared deviations of ๐‘ฆ from ๐‘ฆฬ…, ∑(๐‘ฆ − ๐‘ฆฬ…)2 . Since the random variable ๐‘ฆฬ…, the estimator of the parameter μ, is used in
the calculation of the square deviations, only ๐‘› − 1 of the squared deviations are independent. We have lost one degree of
freedom. Similarly, to compute the average of ∑๐‘’ 2 = ∑(๐‘ฆ − ๐‘ฆฬ‚)2 , since we obtained ๐‘ฆฬ‚ by estimating the two parameters
๐›ฝ1 and ๐›ฝ2 , we lose two degrees of freedom—we have only ๐‘› − 2 independent squared deviations.
1
4-Inference in Regression
5 of 13
se(๐‘2 ) =
se(๐‘’)
√∑(๐‘ฅ − ๐‘ฅฬ… )2
Thus,
๐‘ก=
๐‘2 − ๐›ฝ2
se(๐‘2 )
Solving for b2 in the equation this equation, we have
๐‘2 = ๐›ฝ2 + ๐‘ก โˆ™ se(๐‘2 )
in which the term ๐‘ก โˆ™ se(๐‘2 ) is the margin of sampling error (๐‘€๐‘‚๐ธ). Again, using 95% as the benchmark
probability, we have:
P (๐›ฝ2 − ๐‘ก0.025,(๐‘‘๐‘“) se(๐‘2 ) ≤ ๐‘2 ≤ ๐›ฝ2 + ๐‘ก0.025,(๐‘‘๐‘“) se(๐‘2 )) = 0.95
The degrees of freedom here are the same ๐‘‘๐‘“ as that of the ๐œ’ 2 distribution, ๐‘› − 2.
Since ๐›ฝ2 is unknown, its estimator is used to construct a confidence interval for the slope parameter. Using
the ๐‘€๐‘‚๐ธ determined above, the lower and upper boundaries of a 95% confidence interval for ๐›ฝ2 are:
๐ฟ, ๐‘ˆ = ๐‘2 ± ๐‘ก0.025,(๐‘‘๐‘“) se(๐‘2 )
๐ฟ, ๐‘ˆ = ๐‘2 ± ๐‘ก0.025,(๐‘‘๐‘“)
se(๐‘’)
√∑(๐‘ฅ − ๐‘ฅฬ… )2
Example: Household food expenditure and weekly income
The data and other calculations are in the Excel file CH4 DATA.xlsx (“food” tab). The data show the weekly
food expenditure of 40 households in dollars and weekly income in hundreds of dollars ($100). How does
weekly food expenditure respond to changes in weekly income?
๐น๐‘‚๐‘‚๐ท๐ธ๐‘‹๐‘ƒ = ๐›ฝ1 + ๐›ฝ2 ๐ผ๐‘๐ถ๐‘‚๐‘€๐ธ + ๐‘ข
Let ๐‘ฅ = ๐‘ค๐‘’๐‘’๐‘˜๐‘™๐‘ฆ ๐‘–๐‘›๐‘๐‘œ๐‘š๐‘’ and ๐‘ฆ = ๐‘ค๐‘’๐‘’๐‘˜๐‘™๐‘ฆ ๐‘“๐‘œ๐‘œ๐‘‘ ๐‘’๐‘ฅ๐‘๐‘’๐‘›๐‘‘๐‘–๐‘ก๐‘ข๐‘Ÿ๐‘’.
population slope parameter ๐›ฝ2 .
Build a 95% confidence interval for the
๐ฟ, ๐‘ˆ = ๐‘2 ± ๐‘ก0.025,(๐‘‘๐‘“) se(๐‘2 )
The estimated regression equation is:
๐‘ฆฬ‚ = 83.416 + 10.210๐‘ฅ
The other ingredients to build the interval are:
se(๐‘2 ) =
se(๐‘’)
√∑(๐‘ฅ − ๐‘ฅฬ… )2
=
89.517
√1828.788
= 2.093
๐‘€๐‘‚๐ธ = ๐‘ก0.025,(๐‘›−2) se(๐‘2 ) = 2.024 × 2.093 = 4.238
๐ฟ, ๐‘ˆ = ๐‘2 ± ๐‘€๐‘‚๐ธ = 10.210 ± 4.238 = [5.97,14.45]
4-Inference in Regression
6 of 13
We estimate, with 95% confidence, that for each additional $100 weekly income household food expenditure
be between $5.97 and $14.45.
2. Hypothesis Tests
2.1. Review of the General Concept
Again, as a background let us review the test of hypothesis for the population mean µ. Using a two-tailed test
case, suppose we are testing the null hypothesis that µ = 100, at a 5% level of significance (allowing 5% as the
probability of a Type I error—rejecting the null when it is true).
๐ป0 : µ = 100
๐ป1 : µ ≠ 100
To test this hypothesis, suppose we take a random sample of ๐‘› = 25, which yields ๐‘ฅฬ… = 109 and ๐‘  = 20. The
test statistic for this test of hypothesis is:
๐‘ก=
๐‘ฅฬ… − µ0
se(๐‘ฅฬ… )
where
se(๐‘ฅฬ… ) =
๐‘ก=
๐‘ 
√๐‘›
=
20
√25
=4
109 − 100
= 2.25
4
We compare the test statistic ๐‘ก = 2.250 to the ๐‘๐‘Ÿ๐‘–๐‘ก๐‘–๐‘๐‘Ž๐‘™ ๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ ๐‘ก๐›ผ⁄2,(๐‘›−1) = ๐‘ก0.025,24 = 2.06. Since the test
statistic exceeds the critical value, we reject the null hypothesis that µ = 100.
Another method to determine whether to reject the null hypothesis is to compute the ๐‘­๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ of the test. The
๐‘­๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ in a two-tailed test is the sum of the two tail areas under the t-curve corresponding to ±๐‘ก test
statistic:
2 × P(t > 2.25) = 2 × 0.0169 = 0.0338.
df = 24
0.0169
-2.250
0.0169
2.250
t
If the ๐‘­๐‘ฃ๐‘Ž๐‘™๐‘ข๐‘’ reject the null hypothesis. The combined two-tail area under the t-curve (0.0338) is obtained
using the Excel function =๐“. ๐ƒ๐ˆ๐’๐“. ๐Ÿ(๐ฑ, ๐๐ž๐ _๐Ÿ๐ซ๐ž๐ž๐๐จ๐ฆ), where x = 2.25, deg_freedom = 24.
4-Inference in Regression
7 of 13
2.2. Hypothesis test for ๐œท๐Ÿ
2.2.1.
Two-Tail Test of Significance
Generally, but not always, in a hypothesis test for the slope parameter, we test the null hypothesis that the
population slope is zero. This hypothesis implies that there is no relationship between ๐‘ฅ and ๐‘ฆ. To prove our
theory that there is a relationship between ๐‘ฅ and ๐‘ฆ, we must reject the null “beyond a reasonable doubt”.
Thus, we start the test of hypothesis for ๐›ฝ2 with:
๐ป0 : ๐›ฝ2 = 0
๐ป1 : ๐›ฝ2 ≠ 0
Our sample statistic is ๐‘2 with a standard error of se(๐‘2 ) = se(๐‘’)⁄√∑(๐‘ฅ − ๐‘ฅฬ… )2 . The test statistic is then:
๐‘ก=
๐‘2 − (๐›ฝ2 )0
๐‘2
=
se(๐‘2 )
se(๐‘2 )
since by the null hypothesis ๐›ฝ2 = 0. Using a level of significance of α, then the critical value is: ๐‘ก๐›ผ⁄2,(๐‘›−2) .
In food expenditure example we obtained ๐‘2 = 10.210. Is this figure significantly different from zero? The
test statistic here is:
๐‘‡๐‘† = ๐‘ก =
10.21
= 4.877
2.093
Using ๐›ผ = 0.05, the critical ๐‘ก with ๐‘‘๐‘“ = ๐‘› − 2 = 38 is
๐ถ๐‘‰ = ๐‘ก๐›ผ⁄2,(๐‘‘๐‘“) = ๐‘ก0.025,(38) = 2.024
Since ๐‘‡๐‘† = 4.877 > ๐ถ๐‘‰ = 2.024, we reject ๐ป0 : ๐›ฝ2 = 0 and conclude that ๐‘2 = 10.210 is significantly different
from zero. The statistical relationship between income and food expenditure is significant.
p-value
Rather than comparing ๐‘‡๐‘† to ๐ถ๐‘‰, standard statistical reports provide the probability value, commonly
referred to as the p-value, of the test as the decision rule. The p-value is simply the tail area corresponding to
the test statistic under the t curve, which is then compared to the given ๐›ผ for the test. For a two-tail test, to
compare with ๐›ผ, the sum of two tail areas is used. To find the tail areas for a two-tail test use the Excel
command ๐“. ๐ƒ๐ˆ๐’๐“. ๐Ÿ๐“(๐ฑ, ๐๐ž๐  − ๐Ÿ๐ซ๐ž๐ž๐๐จ๐ฆ).
๐‘ห—value = 2 × P(๐‘ก > ๐‘‡๐‘†)
๐‘ห—value = 2 × ๐‘ƒ(๐‘ก > 4.877) = 2 × 0.0000097 = 0.0000195
T. DIST. 2T(4.877,38) = 0.0000195
The logic of p-value is simple. You can think of p-value as the probability of Type I error as revealed by the
test. If the revealed probability exceeds the benchmark probability ๐›ผ, then the probability of Type I error is
higher than what we would like it to be. Therefore, we do not reject the null hypothesis. But if the revealed
probability is less than the benchmark probability of Type I error, then we would reject the null hypothesis.
2.2.2.
Two-Tail Test of an Economic Hypothesis
4-Inference in Regression
8 of 13
We want to test the hypothesis, at the 5 percent level of significance, that households spend $7.50 of each
additional $100 weekly income on food. The hypotheses for the test are:
๐ป0 : ๐›ฝ2 = 7.50
๐ป1 : ๐›ฝ2 ≠ 7.50
The test statistic is
๐‘‡๐‘† = ๐‘ก =
๐‘2 − (๐›ฝ2 )0 10.21 − 7.50
=
= 1.294
se(๐‘2 )
2.093
and the critical value is
๐ถ๐‘‰ = ๐‘ก๐›ผ⁄2,(๐‘‘๐‘“) = ๐‘ก0.025,(38) = 2.024
Since ๐‘‡๐‘† = 1.294 < ๐ถ๐‘‰ = 2.024, do not reject the null hypothesis that the food expenditure per $100
additional income is equal to $7.50. The sample data are consistent with the null hypothesis.
๐‘ห—value = 2 × ๐‘ƒ(๐‘ก > 1.294) = 2 × 0.10166 = 0.20332
T. DIST. 2T(1.294,38) = 0.20332
2.2.3.
Right-Tail Test of Significance
Economic theory suggests that food is a normal good. That is expenditure on food rises with an increase in
income. Thus, we expect that in the regression model ๐›ฝ2 > 0. The sample data provided an estimate of ๐›ฝ2 ,
๐‘2 = 10.21, which is greater than zero. The objective of the test is, however, to prove that this estimated
quantity is significantly greater than zero. To prove this, therefore, we must reject the null hypothesis that
๐ป0 : ๐›ฝ2 ≤ 0. The hypotheses for the test are then written as
๐ป0 : ๐›ฝ2 ≤ 0
๐ป1 : ๐›ฝ2 > 0
Since the direction of the strict inequality in the alternative hypothesis, “>”, is to the right, then this is a righttail test. The test statistic is
๐‘‡๐‘† = ๐‘ก =
๐‘2 − (๐›ฝ2 )0 10.21 − 0
=
= 4.88
se(๐‘2 )
2.093
and the critical value, at ๐›ผ = 0.05, is
๐ถ๐‘‰ = ๐‘ก๐›ผ,(๐‘‘๐‘“) = ๐‘ก0.05,(38) = 1.686
In the diagram below, clearly, ๐‘‡๐‘† = 4.88 > ๐ถ๐‘‰ = 1.69. Therefore, we reject the null hypothesis. The data
indicates that the food expenditure is a normal good.
4-Inference in Regression
9 of 13
CV = 1.69
TS = 4.88
๐‘ห—value = ๐‘ƒ(๐‘ก > 4.877) = 0.0000097
T. DIST. RT(4.877,38) = 0.0000097
2.2.4.
Right-Tail Test of an Economic Hypothesis
We want to test the hypothesis that households food expenditure exceeds $5.50 for each additional $100 in
weekly income. The purpose of the hypothesis is to see whether the construction of a new supermarket in a
residential area is economically justified (profitable). If the data confirms the hypothesis, then the
supermarket will be constructed. The estimated coefficient is ๐‘2 = 10.21, which is greater than $5.50. The
question, however, is if $10.21 is significantly greater $5.50. Thus, the null and alternative hypotheses are:
๐ป0 : ๐›ฝ2 ≤ 5.50
๐ป1 : ๐›ฝ2 > 5.50
The test statistic is,
๐‘‡๐‘† = ๐‘ก =
๐‘2 − (๐›ฝ2 )0 10.21 − 5.50
=
= 2.25
se(๐‘2 )
2.093
and the critical value, at ๐›ผ = 0.05, is
๐ถ๐‘‰ = ๐‘ก๐›ผ,(๐‘‘๐‘“) = ๐‘ก0.05,(38) = 1.686
Based on ๐›ผ = 0.05, since ๐‘‡๐‘† = 2.25 > ๐ถ๐‘‰ = 1.69, we reject ๐ป0 : ๐›ฝ2 ≤ 5.50. But, if we choose a smaller ๐›ผ, say,
๐›ผ = 0.01, then
๐ถ๐‘‰ = ๐‘ก0.01,(38) = 2.429
which provides ๐‘‡๐‘† = 2.25 < ๐ถ๐‘‰ = 2.429, and we would not reject ๐ป0 . Note the diagrams below.
๐›ผ = 0.05
Reject ๐ป0
4-Inference in Regression
๐›ผ = 0.01
Do not reject ๐ป0
10 of 13
CV = 1.69 TS = 2.25
TS = 2.25 CV = 2.43
๐‘ห—value = ๐‘ƒ(๐‘ก > 2.25) = 0.0152
T. DIST. RT(2.25,38) = 0.0152
Note that if ๐›ผ = 0.05, then ๐‘ห—value = 0.0152 < ๐›ผ = 0.05, then we would reject ๐ป0 . But, if ๐›ผ = 0.01, then
๐‘ห—value = 0.0152 > ๐›ผ = 0.01, then we would not reject ๐ป0 .
2.2.5.
Left-Tail Tests
Continuing with the food expenditure model, we want to test the hypothesis that household food expenditure
is below $15 for each additional $100 in weekly income. The word “below” indicates that we are testing ๐›ฝ2 <
15 against ๐›ฝ2 ≥ 15. This makes it a left-tail test.
๐ป0 : ๐›ฝ2 ≥ 15
๐ป1 : ๐›ฝ2 < 15
The test statistic is,
๐‘‡๐‘† = ๐‘ก =
๐‘2 − (๐›ฝ2 )0 10.21 − 15
=
= −2.288
se(๐‘2 )
2.093
and the critical value, at ๐›ผ = 0.05, is
๐ถ๐‘‰ = −๐‘ก๐›ผ,(๐‘‘๐‘“) = −๐‘ก0.05,(38) = −1.686
You can present the conclusion either as
๐‘‡๐‘† = −2.288 < ๐ถ๐‘‰ = −1.686
Reject ๐ป0
or,
|๐‘‡๐‘†| = 2.288 > |๐ถ๐‘‰| = 1.686
4-Inference in Regression
Reject ๐ป0
11 of 13
TS = -2.29 CV = -1.69
๐‘ห—value = ๐‘ƒ(๐‘ก < −2.288) = 0.0139
In Excel you can find the tail area when ๐‘ก < 0 two ways.
T. DIST. RT(2.288,38) = 0.0139
or, use
๐“. ๐ƒ๐ˆ๐’๐“. ๐Ÿ๐“(๐ฑ, ๐๐ž๐  − ๐Ÿ๐ซ๐ž๐ž๐๐จ๐ฆ, ๐œ๐ฎ๐ฆ๐ฎ๐ฅ๐š๐ญ๐ข๐ฏ๐ž).
T. DIST(−2.288,38,1) = 0.0139
3. Inferences Involving a Linear Combination of the Regression Parameters
3.1. Interval Estimate
To explain this topic, let us use the food expenditure model again. Suppose we want to build an interval
estimate for the mean value of food expenditure in the population of households for a specific weekly income
of, say, $1,000. We learned earlier that the regression line is the locus of the mean values of ๐‘ฆ for each given
value of the explanatory variable ๐‘ฅ, ๐œ‡๐‘ฆ|๐‘ฅ๐‘– . Thus, in the estimated regression equation, ๐‘ฆฬ‚ = ๐‘1 + ๐‘2 ๐‘ฅ, the
predicted value of ๐‘ฆ for a given value of ๐‘ฅ is that estimated mean we are looking for.
To build an interval estimate for the mean of ๐‘ฆ for a given ๐‘ฅ, ๐œ‡๐‘ฆ|๐‘ฅ0 , we consider ๐‘ฆฬ‚ as the sample statistic, the
estimator, for the population parameter ๐œ‡๐‘ฆ|๐‘ฅ0 . Thus the interval estimate is,
๐ฟ, ๐‘ˆ = ๐‘ฆฬ‚ + ๐‘ก๐›ผ⁄2,(๐‘‘๐‘“) se(๐‘ฆฬ‚)
Substituting for ๐‘ฆฬ‚, we have,
๐ฟ, ๐‘ˆ = (๐‘1 + ๐‘2 ๐‘ฅ) + ๐‘ก๐›ผ⁄2,(๐‘‘๐‘“) se(๐‘1 + ๐‘2 ๐‘ฅ)
This clearly shows that the linear combination of ๐‘1 and ๐‘2 , (๐‘1 + ๐‘2 ๐‘ฅ), is the estimator of the linear
combination of the population parameters, ๐›ฝ1 + ๐›ฝ2 ๐‘ฅ. To build the interval estimate, we need to find se(๐‘1 +
๐‘2 ๐‘ฅ). For this we start with var(๐‘1 + ๐‘2 ๐‘ฅ). Using the properties of the variance of the linear combination of
two random variables, we have
var(๐‘1 + ๐‘2 ๐‘ฅ) = var(๐‘1 ) + ๐‘ฅ 2 var(๐‘2 ) + 2๐‘ฅcov(๐‘1 , ๐‘2 )
In the previous chapter we learned how to determine the covariance matrix for the coefficients of the
regression equation.
4-Inference in Regression
12 of 13
[
var(๐‘1 )
cov(๐‘1 , ๐‘2 )
cov(๐‘1 , ๐‘2 )
] = var(๐‘’)๐‘‹ −1
var(๐‘2 )
var(๐‘’)๐‘‹ −1 = 8013.294 × [
[
var(๐‘1 )
cov(๐‘1 , ๐‘2 )
0.23516
−0.01072
−0.01072
]
0.00055
cov(๐‘1 , ๐‘2 )
1884.442
]=[
var(๐‘2 )
−85.903
−85.903
]
4.382
Thus, using the relevant figures form the covariance matrix and 10 ($1,000) for the income level ๐‘ฅ, we have
var(๐‘1 + ๐‘2 ๐‘ฅ) = 1884.442 + 102 × 4.382 − 2 × 10 × 85.903 = 604.554
se(๐‘1 + ๐‘2 ๐‘ฅ) = √604.554 = 24.588
To determine the margin of error to build the interval estimate,
๐‘ก๐›ผ⁄2,(๐‘‘๐‘“) = ๐‘ก0.025,(38) = 2.024
๐‘€๐‘‚๐ธ = ๐‘ก๐›ผ⁄2,(๐‘‘๐‘“) se(๐‘1 + ๐‘2 ๐‘ฅ) = 2.024 × 24.588 = 49.78
๐‘ฆฬ‚๐‘ฅ=10 = 83.416 + 10.21 × 10 = 185.51
๐ฟ, ๐‘ˆ = 185.51 ± 49.78 = [135.74,235.29]
We estimate with 95% confidence that the mean food expenditure of households with weekly income of
$1,000 is between $135.74 and $235.29.
3.2. Hypothesis Test
Test the hypothesis that for a household with a weekly income of $1,000, the expected expenditure on food is
more than $150.
๐ป0 : ๐›ฝ1 + 10๐›ฝ2 ≤ 150
๐ป1 : ๐›ฝ1 + 10๐›ฝ2 > 150
The sample statistic or the estimator of ๐›ฝ1 + 10๐›ฝ2 is ๐‘1 + 10๐‘2 . Thus, the test statistic becomes,
๐‘‡๐‘† =
๐‘1 + 10๐‘2 − 150
se(๐‘1 + 10๐‘2 )
Given ๐‘1 = 83.416, ๐‘2 = 10.21, and the previously computed se(๐‘1 + 10๐‘2 ) = 24.588, The test statistic is
then,
๐‘‡๐‘† =
83.416 + 10 × 10.21 − 150
= 2.505
24.588
p-value for the test is
P(๐‘ก > 2.505) = 0.0083
We thus reject the null hypothesis at ๐›ผ = 0.05 and ๐›ผ = 0.01, both, and conclude the mean food expenditure
for a weekly income of $1,000 is greater than $150 a week.
4-Inference in Regression
13 of 13
Download