Hypothesis Tests in Multiple Regression Analysis

advertisement
Hypothesis Tests in Multiple Regression Analysis
Multiple regression model: Y   0  1 X 1   2 X 2  ...   p 1 X p  1   where p
represents the total number of variables in the model.
I.
Testing for significance of the overall regression model.
Question of interest: Is the regression relation significant? Are one or more of the
independent variables in the model useful in explaining variability in Y and/or predicting
future values of Y?
Null Hypothesis: The initial assumption is that there is no relation, which is expressed
as: H0: β1 = β2 = … = βp-1 =0.
Alternative Hypothesis: At least one of the independent variables IS useful in
explaining/predicting Y, expressed as: H1: At least one βi is  0.
Test Statistic: F 
SSR / p  1
MSR
which is found on any regression printout

SSE /( n  p) MSE
Sampling Distribution: Under the null hypothesis the statistic follows an F-distribution
with p-1 and n - p degrees of freedom. Reject in the upper tail of this distribution.
Interpreting Results: If we reject H0 we conclude that the relation is significant/does have
explanatory or predictive power. If we fail to reject, we conclude that there isn't any
evidence of explanatory power, which suggests that there is no point in using this model.
Text Reference: Section 6.5
II.
Testing for the Significance/Contribution of a Single Independent Variable in
the Model
Question of interest: Suppose we have a significant multiple regression model. In this
model, does a single independent variable of interest, say Xj, contribute to
explaining/predicting Y? Or, would the model be just as useful without the inclusion of
this variable?
Null Hypothesis: The initial assumption is that the variable does not contribute in this
model, which is expressed as: H0: βk = 0.
Alternative Hypothesis: The alternative is that the variable does contribute and should
remain in the model: H1: βk  0.
Test Statistic: t 
bk  0
which is found on any regression printout.
s{bk }
1
Sampling Distribution: Under the null hypothesis the statistic follows a t-distribution
with n - p degrees of freedom. Reject in the upper or lower tail of this distribution.
Interpreting Results: If we reject H0 we conclude that the independent variable Xk does
have explanatory or predictive power in our model. Note that conclusion is modelspecific, in that it might change if the model included a different set of independent
variables. If we fail to reject, we conclude that there isn't any evidence of explanatory
power. That suggests that there is no point in having Xk in this model and we should
consider dropping it and re-running the regression analysis.
Text Reference: Section 6.6.
III. A General Test for the Value of βj
Question of interest: Does βj equal a specified value of interest, say  *j ? Or, do we have
evidence to state that it is not equal to  *j ? (A two-sided test situation is assumed. Make
the obvious adjustment for a one-sided test.)
Null Hypothesis: H0: βj =  *j
Alternative Hypothesis: H1: βj   *j
Test Statistic: t 
b j   *j
which is NOT found on the regression printout. You will,
s{b j }
however, find bj and s{bj} on the printout.
Sampling Distribution: Under the null hypothesis the statistic follows a t-distribution
with n - p degrees of freedom. Reject in the upper or lower tail of this distribution,
making the appropriate adjustment for one-sided tests.
Interpreting Results: If we reject H0 we conclude that we have evidence that βj is not
equal to the specified  *j value (we can refute the claim that βj =  *j ). Otherwise, we
can't refute the claim.
Text Reference: I don’t think it’s in there. This is an obvious extension of the test for a
zero slope presented in Section 6.6.
2
IV. Testing for the significance/contribution of a subset of independent variables in
the regression model. (The “general linear model” test.)
Question of interest: In the multiple regresion model:
Y   0  1 X 1  ...   g 1 X g 1   g X g  ...   p ` X p 1   (full model)
does the subset of independent variables Xg, …, Xp-1 contribute to explaining/predicting
Y? Or, would be do just as well if these variables were dropped and we reduced the
model to
Y   0  1 X 1  ...   g 1 X g 1   (reduced model).
Null Hypothesis: The initial assumption is that the subset does not contribute to the
model's explanatory power, which is expressed as: H0: βg = … = βp-1 =0.
Alternative Hypothesis: At least one of the independent variables in the subset IS useful
in explaining/predicting Y, expressed as: H1: At least one βi is  0, i = g+1 to p-1.
Test Statistic: You need to run two regressions, one for the full model and one for the
reduced model as described above. Then calculate:
F
SSRFull
 SSRRe duced  /  p  g  Change _ in _ SSR / Number _ Variables _ Dropped

MSE Full
MSE Full
Sampling Distribution: Under the null hypothesis the statistic follows an F-distribution
with p - g and n – p degrees of freedom. Reject in the upper tail of this distribution.
Interpreting Results: If we reject H0 we conclude that at least one independent variable in
the subset (Xg, …, Xp-1) does have explanatory or predictive power, so we don't reduce
the model by dropping out this subset. If we fail to reject, we conclude we have no
evidence that inclusion of the subset of independent variables in the model contributes to
explanatory power. This suggests that you may as well drop them out and re-run the
regression using the reduced model.
Comment: If p - g = 1, i.e. if the subset consists of a single independent variable, then
this F-test is equivalent to the two-sided t-test presented in Part II. In fact, t2 = F. You
might recall a similar result from simple regression analysis.
Text Reference: Section 2.8. I’m extending the concept to multiple regression.
3
Example Application
Data Set: Crime Rates for 50 States
Variables:
Y = Auto Theft = (duh) a measure of the auto theft rate
Independent Variables (X’s)
Unemployment = % of civilian labor force unemployed in 1991
Police = Police protection per 10,000 population
In-School = Percent of persons 5-17 years old in 1989 enrolled in public elementary
and secondary schools
Unemp*Police, the product of Unemployment and Police
Unemp*In-School, the product of Unemployment and In-School
Police*In-School, the product of Police and In*School
Regression Analysis: Full Model
The regression equation is
Auto Theft = 1358 - 306 Unemployment + 25 Police - 7.5 In-School
+ 8.59 Unemp*Police + 1.65 Unemp*In-School - 0.55 Police*InSchool
Predictor
Constant
Unemploy
Police
In-Schoo
Unemp*Po
Unemp*In
Police*I
S = 175.0
Coef
1358
-306.2
25.4
-7.46
8.592
1.650
-0.550
StDev
4343
448.8
109.8
45.76
3.781
4.690
1.103
R-Sq = 61.6%
T
0.31
-0.68
0.23
-0.16
2.27
0.35
-0.50
P
0.756
0.499
0.818
0.871
0.028
0.727
0.621
R-Sq(adj) = 56.3%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
6
43
49
Regression Analysis:
model)
SS
2114879
1317461
3432340
MS
352480
30639
F
11.50
P
0.000
Reduced Model (Interaction terms removed from
The regression equation is
Auto Theft = 661 + 48.0 Unemployment + 31.3 Police - 14.4 In-School
Predictor
Constant
Unemploy
Police
In-Schoo
S = 181.8
Coef
661.4
48.05
31.291
-14.401
StDev
610.0
16.96
5.370
6.128
R-Sq = 55.7%
T
1.08
2.83
5.83
-2.35
P
0.284
0.007
0.000
0.023
R-Sq(adj) = 52.8%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
3
46
49
SS
1912002
1520338
3432340
MS
637334
33051
F
19.28
P
0.000
4
Download