Multiple Regression - Selecting the Best Equation

advertisement
Multiple Regression - Selecting the Best Equation
When fitting a multiple linear regression model, a researcher will likely include
independent variables that are not important in predicting the dependent variable Y.
In the analysis he will try to eliminate these variable from the final equation. The
objective in trying to find the “best equation” will be to find the simplest model that
adequately fits the data. This will not necessarily be the model the explains the most
variance in the dependent variable Y (the equation with the highest value of R2). This
equation will be the equation with all of the independent variables in the equation. Our
objective will be to find the equation with the least number of variables that still explain a
percentage of variance in the dependent variable that is comparable to the percentage
explained with all the variables in the equation.
An Example
The example that we will consider is interested in how the heat evolved in the curing of
cement is affected by the amounts of various chemical included in the cement mixture.
The independent and dependent variables are listed below:
X1 = amount of tricalcium aluminate, 3 CaO - Al2O3
X2 = amount of tricalcium silicate, 3 CaO - SiO2
X3 = amount of tetracalcium alumino ferrite, 4 CaO - Al2O3 - Fe2O3
X4 = amount of dicalcium silicate, 2 CaO - SiO2
Y = heat evolved in calories per gram of cement.
X1
7
1
11
11
7
11
3
1
2
21
1
11
10
X2
26
29
56
31
52
55
71
31
54
47
40
66
68
X3
6
15
8
8
6
9
17
22
18
4
23
9
8
X4
60
52
20
47
33
22
6
44
22
26
34
12
12
Y
79
74
104
88
96
109
103
73
93
116
84
113
109
Techniques for Selecting the "Best" Regression Equation
The best Regression equation is not necessarily the equation that explains most of the
variance in Y (the highest R2).
• This equation will be the one with all the variables included.
• The best equation should also be simple and interpretable. (i.e. contain a small no.
of variables).
• Simple (interpretable) & Reliable - opposing criteria.
• The best equation is a compromise between these two.
page 55
I All Possible Regressions
Suppose we have the p independent variables X1, X2, ..., Xp.
- Then there are 2p subsets of variables.
Example (k=3) X1, X2, X3
Variables in Equation
- no variables
- X1
- X2
- X3
- X1, X2
- X1, X3
- X2, X3
- X1, X2, X3
Model
Y = β0 + ε
Y = β0 + β1 X1+ ε
Y = β0 + β2 X2+ ε
Y = β0 + β3 X3+ ε
Y = β0 + β1 X1+ β2 X2+ ε
Y = β0 + β1 X1+ β3 X3+ ε
Y = β0 + β2 X2+ β3 X3+ ε ,3/
Y = β0 + β1 X1+ β2 X2+ β2 X3+ ε
Use of R2
1. Assume we carry out 2p runs for each of the subsets.
Divide the Runs into the following sets
Set 0: No variables
Set 1: One independent variable.
...
Set p: p independent variables.
2. Order the runs in each set according to R2.
3. Examine the leaders in each run looking for consistent patterns
- take into account correlation between independent variables.
Example
(k=4)
X1, X2, X3, X4
Set 1:
Set 2:
Set 3:
Set 4:
Variables in for leading runs
100 R2%
X4.
X1, X2.
X1, X4
X1, X2, X4.
X1, X2, X3, X4.
67.5 %
97.9 %
97.2 %
98.234 %
98.237 %
Examination of the correlation coefficients reveals a high correlation
between X1, X3 (r 13= -0.824) and between X2, X4 (r 24= -0.973).
Best Equation
Y = β0 + β1 X1+ β4 X4+ ε
page 56
Use of the Residual Mean Square (RMS) (s2)
When all of the variables having a non-zero effect have been included in the model then the
residual mean square is an estimate of σ2.
If "significant" variables have been left out then RMS will be biased upward.
No. of Variables p
RMS s2(p)
Average s2(p)
1
2
3
4
*
115.06, 82.39,1176.31, 80.35
5.79*,122.71,7.48**,86.59.17.57
5.35, 5.33, 5.65, 8.20
5.98
- run X1, X2
** - run X1, X4
113.53
47.00
6.13
5.98
s2 - approximately 6.
Use of Mallows Ck
RSS k
− [n − 2(k + 1)]
2
s complete
If the equation with p variables is adequate then both s2complete and RSSp/(n-p-1) will be
estimating σ2.Then Ck = [(n-k-1)σ2]/σ2 - [n-2(k+1)]= [n-k-1] - [n-2(k+1)] = k +1. Thus if we plot,
for each run, Ck vs k and look for Ck close to p then we will be able to identify models giving a
reasonable fit.
Mallows C k =
Run
Ck
k+1
no variables
443.2
1
1,2,3,4
202.5, 142.5, 315.2, 138.7
2
12,13,14
23,24,34
2.7, 198.1, 5.5
62.4, 138.2, 22.4
3
123,124,134,234
3.0, 3.0, 3.5, 7.5
4
1234
5.0
5
page 57
II Backward Elimination
In this procedure the complete regression equation is determined containing all the variables - X1,
X2, ..., Xp. Then variables are checked one at a time and the least significant is dropped from the
model at each stage. The procedure is terminated when all of the variables remaining in the
equation provide a significant contribution to the prediction of the dependent variable Y. The
precise algorithm proceeds as follows:
1. Fit a regression equation containing all variables.
2. A partial F-test (F to remove) is computed for each of the independent variables still
in the equation.
• The Partial F statistic (F to remove) = [RSS2 - RSS1]/MSE1 ,where
• RSS1 = the residual sum of squares with all variables that are presently
in the equation,
• RSS2 = the residual sum of squares with one of the variables removed,
and
• MSE1 = the Mean Square for Error with all variables that are presently
in the equation.
3. The lowest partial F value (F to remove) is compared with Fα for some pre-specified
α .If FLowest ≤ Fα then remove that variable and return to step 2. If FLowest >
Fα then accept the equation as it stands.
Example
(k=4) (same example as before) X1, X2, X3, X4
1.
X1, X2, X3, X4 in the equation.
The lowest partial F = 0.018 (X3) is compared with Fα(1,8) = 3.46 for α = 0.01.
Remove X3.
2. X1, X2, X4 in the equation.
The lowest partial F = 1.86 (X4) is compared with Fα(1,9) = 3.36 for α= 0.01.
Remove X4.
3. X1, X2 in the equation.
Partial F for both variables X1 and X2 exceed Fα(1,10) = 3.36 for α= 0.01.
Equation is accepted as it stands. Note : F to Remove = partial F.
Y = 52.58 + 1.47 X1 + 0.66 X2
II Forward Selection
In this procedure we start with no variables in the equation. Then variables are checked one at a
time and the most significant is added to the model at each stage. The procedure is terminated
when all of the variables not in the equation have no significant effect on the dependent variable
Y. The precise algorithm proceeds as follows:
page 58
1. With no varaibles in the equation compute a partial F-test (F to enter) is computed for
each of the independent variables not in the equation.
• The Partial F statistic (F to enter) = [RSS2 - RSS1]/MSE1 ,where
• RSS1 = the residual sum of squares with all variables that are presently
in the equation and the variable under consideration,
• RSS2 = the residual sum of squares with all variables that are presently
in the equation .
• MSE1 = the Mean Square for Error with variables that are presently in
the equation and the variable under consideration.
2. The largest partial F value (F to enter) is compared with Fα for some pre-specified
α .If FLargest > Fα then add that variable and return to step 1. If FLargest ≤ Fα then
accept the equation as it stands.
IV Stepwise Regression
In this procedure the regression equation is determined containing no variables in the model.
Variables are then checked one at a time using the partial correlation coefficient (equivalently F
to Enter) as a measure of importance in predicting the dependent variable Y. At each stage the
variable with the highest significant partial correlation coefficient (F to Enter) is added to the
model. Once this has been done the partial F statistic (F to Remove) is computed for all
variables now in the model is computed to check if any of the variables previously added can
now be deleted. This procedure is continued until no further variables can be added or deleted
from the model. The partial correlation coefficient for a given variable is the correlation
between the given variable and the response when the present independent variables in the
equation are held fixed. It is also the correlation between the given variable and the residuals
computed from fitting an equation with the present independent variables in the equation.
(Partial correlation of Xi with variables Xi1, X12, ... etc in the equation)2
= The percentage of variance in Y explained Xi by that is left unexplained Xi1, X12, etc.
Example
(k=4) (same example as before) X1, X2, X3, X4
1. With no variables in the equation. The correlation of each independent variable with
the dependent variable Y is computed. The highest significant correlation ( r = 0.821) is with variable X4. Thus the decision is made to include X4.
Regress Y with X4 -significant thus we keep X4.
2. Compute partial correlation coefficients of Y with all other independent variables
given X4 in the equation.The highest partial correlation is with the variable X1. (
[rY1.4]2 = 0.915). Thus the decision is made to include X1.
Regress Y with X1, X4.
R2 = 0.972 , F = 176.63 .
For X1 the partial F value =108.22 (F0.10(1,8) = 3.46)
page 59
Retain X1.
For X4 the partial F value =154.295 (F0.10 (1,8) = 3.46)
Retain X4.
3. Compute partial correlation coefficients of Y with all other independent variables
given X4 and X1 in the equation. The highest partial correlation is with the variable
X2. ( [rY2.14]2 = 0.358). Thus the decision is made to include X2.
Regress Y with X1, X2,X4.
R2 = 0.982 .
Lowest partial F value =1.863 for X4 (F0.10 (1,9) = 3.36)
Remove X4 leaving X1 and X2 .
page 60
Download