Chapter 5 - City University of Hong Kong

advertisement
Chapter 5: Linear Regression Analysis – Part II
Page
1
Department of Mathematics
Faculty of Science and Engineering
City University of Hong Kong
MA 3518: Applied Statistics
Chapter 5: Linear Regression Analysis – Part II
In multiple linear regression analysis, it is vital to achieve a
parsimonious result without eliminating any important explanatory
variables. This chapter focuses on the aspects of selecting
significant explanatory variables to be included in a regression
model. We introduce four standard procedures for variable
selection, namely best subset selection method, forward selection,
backward selection, backward elimination, stepwise regression.
Topics included in this chapter are listed as follows:
Section 5.1: Variable Selection
Section 5.2: SAS for Variable Selection
The Philosophy of Parsimony!
Chapter 5: Linear Regression Analysis – Part II
Page
2
Section 5.1: Variable Selection
1. Motivation:
 Consider the following multiple linear regression model
Y = a + b1X1 + b2X2 + b3X3 + …..+ bpXp + e
It may be possible that we have included some unnecessary
explanatory variables that may not be significant to explain
the variation of the response variable Y or to predict Y
 Remove some insignificant variables from the regression
model and reduce the number of regression parameters
 To achieve a parsimonious result
2. Objective:
 Select significant explanatory variables to be included in and
identify insignificant explanatory variables to be excluded
from a regression model so that the final model can provide a
reasonably good prediction for the response variable and is
easy to understand, interpret and apply
3. Question: What happens if a significant explanatory variable is
excluded from a regression model?
 Biased estimates for other regression coefficients and
forecasts will be obtained
Chapter 5: Linear Regression Analysis – Part II
Page
4. Question: What happens if an insignificant explanatory variable
is included in a regression model?
 The standard errors of the estimators for the regression
coefficients and the forecasts will increase
5. Four standard procedures: No unique method that ensures to
pick the ‘best’ regression model
 Best subset selection method
 Forward selection
 Backward elimination
 Stepwise regression
6. Best subset selection method:
 Consider the following multiple linear regression model
with three explanatory variables X1, X2 and X3
Y = a + b1X1 + b2X2 + b3X3 + e
We can fit 7 possible regression models with the following
combinations:
(a) Y = a + b1X1 + e
(b) Y = a + b2X2 + e
(c) Y = a + b3X3 + e
(d) Y = a + b1X1 + b2X2 + e
(e) Y = a + b1X1 + b3X3 + e
(f) Y = a + b2X2 + b3X3 + e
(g) Y = a + b1X1 + b2X2 + b3X3 + e
3
Chapter 5: Linear Regression Analysis – Part II
Page
Note that 23 – 1 = 7
Now, suppose we have p explanatory variables. Then, we
can fit 2p – 1 possible regression models
 Rationale: To fit all 2p – 1 possible regression models and
select the ‘best’ one according to one of the
following criteria
(a) R2 or its adjusted version:
- Criterion based on R2:
Choose the “best” model with the largest R2
- Criterion based on adjusted R2:
Choose the “best” model with the largest adjusted R2
Note that
- R2 (or R2adj) increases as p does
- R2 (or R2adj) is useful for comparing different models
when p is fixed
(b) Mallow’s Cp statistic: A very computationally expensive
method
- Definition of Cp statistic:
Cp = (SSER/MSEF) – (n – 2k)
4
Chapter 5: Linear Regression Analysis – Part II
Page
5
where
SSER = the sum of squared errors for the reduced model
MSER = the mean of squared errors for the full model
n = the number of observations
p = the number of explanatory variables
k = the number of unknown parameters = p + 1
- Note that
(1) The full model is the one which contains all
explanatory variables
(2) If the reduced model is true, E(Cp) is approximately
equal to k (= p + 1)
- Criterion one for model selection:
Choose the “best” model with the smallest Cp
- Criterion two for model selection:
Choose the “best” model with Cp closest to p + 1
7. Remarks:
 Shortcoming of the best subset selection method:
Not very practical to evaluate all possible subsets regressions
when the number of explanatory variables p is very large
Chapter 5: Linear Regression Analysis – Part II
Page
6
 Sequential method: Include or drop one variable at a time
(a) Forward selection method
(b) Backward elimination method
(c) Stepwise method
8. Forward selection with significance level  1:
 Start with the simplest intercept model (i.e. without any
explanatory variables)
Y=a+e
 For each of the regression equations, we compute the partial
F-statistic (or the t-statistic or their corresponding p-values)
for each of the regression coefficients and choose the
explanatory variable with the largest partial F-statistic, say
Xk
 Perform the following hypothesis testing at significance level
 1:
H0: bk = 0 vs H1: bk  0
If the null hypothesis H0 is rejected, the explanatory variable
Xk is significant and should be included in the regression
equation
 Determine which one of the remaining variable will have the
largest partial F-statistic if each of them is added to the
regression equation that already contains Xk
Chapter 5: Linear Regression Analysis – Part II
Page
7
 Repeat the procedure in Step 2 – Step 4 until no further
variables are significant
9. Backward elimination with significance level  2:
 Start with the full model (i.e. include all possible explanatory
variables and the intercept)
Y = a + b1X1 + b2X2 + b3X3 + …..+ bpXp + e
 Choose the explanatory variable with the smallest partial Fstatistic after fitting the full model, say Xk
 Perform the following hypothesis testing at significance level
 2:
H0: bk = 0 vs H1: bk  0
If the null hypothesis H0 is accepted, the explanatory variable
Xk is not significant and should be excluded from the
regression equation
10. Stepwise regression with significance levels  1 and  2: (A
mixture of both forward selection and backward elimination)
 Start with the simplest intercept model (i.e. without any
explanatory variables)
 Select the most significant variable (i.e. the one with the
largest partial F-statistic), say Xk, by forward selection
method
Chapter 5: Linear Regression Analysis – Part II
Page
8
 Perform the test for the following hypotheses at significance
level  1:
H0: bk = 0 vs H1: bk  0
If H0 is accepted, record summary statistics and STOP;
Otherwise, add Xk to the regression model and perform
backward elimination
 Select the explanatory variable with the smallest partial Fstatistic, say Xj
 Perform the test for the following hypotheses at significance
level  2:
H0: bj = 0 vs H1: bj  0
If H0 is accepted, remove the variable Xj from the regression
model; Otherwise, go to Step 2
11. Use SAS procedures to perform variable selection:
 The SELECTION statement
PROC REG DATA = name of dataset <options>;
MODEL response = explanatory variables / SELECTION = <options>;
RUN;
 Options for the SELECTION statement
(a) Perform the best subset selection:
SELECTION = RSQUARE
Chapter 5: Linear Regression Analysis – Part II
Page
9
(b) Perform forward selection:
SELECTION = forward
(c) Perform backward elimination:
SELECTION = backward
(d) Perform stepwise regression
SELECTION = stepwise
(e) Specify the significance level  1 for an explanatory
variable to be included in the model during forward
selection or stepwise regression
sle =  1
By default, the significance level is 50% for forward
selection while it is 15% for stepwise regression
(f) Specify the significance level  2 for an explanatory
variable to be removed from the model during
backward elimination or stepwise regression
sls =  2
By default, the significance level is 1% for backward
elimination while it is 15% for stepwise regression
Chapter 5: Linear Regression Analysis – Part II
Page
10
(g) Display the Mallow’s Cp statistic (i.e. can only be used
in the option “SELECTION = RSQUARE”)
SELECTION = RSQUARE cp
(h) Display the adjusted R2 (i.e. can only be used in the
option “SELECTION = RSQUARE”)
SELECTION = RSQUARE adjrsq
(i) Display the mean squared errors MSE (i.e. can only be
used in the option “SELECTION = RSQUARE”)
SELECTION = RSQUARE mse
12. Example: (Best subset selection method)
Consider the following dataset containing the daily open close,
high, low values and the trading volume of S&P500 global
index from 2 Sep 2003 to 2 Oct 2003 in the last chapter
Data SP500;
Input Date $ Open High Low Close Volume;
CARDS;
2-Oct-03 1017.25 1021.90 1013.38 1020.24 1091209984
1-Oct-03 997.15 1018.22 997.15 1018.22 1329970048
30-Sep-03 1004.72 1004.72 990.34 995.97 1360259968
29-Sep-03 998.12 1006.91 995.31 1006.58 1128700000
26-Sep-03 1003.31 1003.32 996.03 996.85 1237640000
25-Sep-03 1010.24 1015.97 1003.26 1003.27 1276470000
24-Sep-03 1029.09 1029.83 1008.93 1009.38 1378250000
23-Sep-03 1023.26 1030.06 1021.50 1029.03 1124940000
22-Sep-03 1036.30 1036.30 1018.27 1022.82 1082870000
19-Sep-03 1039.64 1039.64 1031.85 1036.30 1328210000
18-Sep-03 1025.80 1040.18 1025.66 1039.58 1257790000
17-Sep-03 1028.91 1031.37 1024.23 1025.97 1135540000
16-Sep-03 1015.07 1029.68 1015.07 1029.32 1161780000
15-Sep-03 1018.68 1019.80 1013.59 1014.81 943448000
12-Sep-03 1014.54 1019.68 1007.70 1018.63 1092610000
11-Sep-03 1011.34 1020.84 1011.34 1016.42 1151640000
10-Sep-03 1021.27 1021.28 1009.73 1010.92 1313300000
9-Sep-03 1030.51 1030.51 1021.13 1023.16 1226980000
8-Sep-03 1021.84 1032.42 1021.84 1031.64 1171310000
5-Sep-03 1027.02 1029.24 1018.20 1021.39 1292100000
4-Sep-03 1025.97 1029.15 1022.17 1027.97 1259030000
3-Sep-03 1023.37 1029.36 1022.39 1026.27 1547380000
2-Sep-03 1009.14 1022.63 1005.65 1021.99 1279880000
;
RUN;
Chapter 5: Linear Regression Analysis – Part II
Page
11
Suppose the full model for the regression with the response
variable “Close” and the explanatory variables “Open”,
“High”, “Low” and “Volume” is given as follow:
Close = a + b1 Open + b2 High + b3 Low + b4 Volume + e
Perform the best subset selection method using the following
SAS procedures:
PROC REG DATA = SP500;
MODEL Close = Open High Low Volume / SELECTION =
RSQUARE cp adjrsq mse;
RUN;
The SAS output is shown as follows:
The SAS System
21:58 Wednesday, October 15, 2003 1
The REG Procedure
Model: MODEL1
Dependent Variable: Close
R-Square Selection Method
Number in
Adjusted
Model R-Square R-Square C(p)
MSE
Variables in Model
1
0.7889
0.7788 58.0670
29.36623 High
1
0.7612
0.7498 68.1654
33.21419 Low
1
0.3822
0.3528 206.5163
85.93252 Open
1
0.0041
-.0433 344.5411
138.52661 Volume
------------------------------------------------------------------------------------2
0.8660
0.8526 31.8989
19.56446 Open High
2
0.8444
0.8289 39.7845
22.71949 Open Low
2
0.8103
0.7914 52.2348
27.70086 High Low
2
0.7991
0.7790 56.3310
29.33976 High Volume
2
0.7616
0.7377 70.0302
34.82080 Low Volume
2
0.3882
0.3270 206.3250
89.35242 Open Volume
------------------------------------------------------------------------------------3
0.9489
0.9408 3.6669
7.86171 Open High Low
3
0.8779
0.8586 29.5784
18.77458 Open High Volume
Chapter 5: Linear Regression Analysis – Part II
Page
12
3
0.8448
0.8203 41.6383
23.85371 Open Low Volume
3
0.8153
0.7861 52.4377
28.40193 High Low Volume
------------------------------------------------------------------------------------4
0.9507
0.9397 5.0000
8.00201 Open High Low Volume
Interpretations of the SAS output:
(1) Since the largest R2 is 0.9507, the “best” model based on R2 is
given by:
Close = a + b1 Open + b2 High + b3 Low + b4 Volume + e
(2) Since the largest adjusted R2 is 0.9408, the “best” model
based on adjusted R2 is given by:
Close = a + b1 Open + b2 High + b3 Low+ e
(3) Since the smallest Mallow’s Cp statistic is 3.6669, the “best”
model based on the first criterion of Mallow’s Cp statistic is
given by:
Close = a + b1 Open + b2 High + b3 Low+ e
(4) Since the Mallow’s Cp statistic closest to p + 1 (= 4 + 1) is
5.0000, the “best” model based on the second criterion of
Mallow’s Cp statistic is given by:
Close = a + b1 Open + b2 High + b3 Low+ b4 Volume + e
13. Example: (Forward selection)
Consider again the data set in the last example and use the
following SAS procedures to perform forward selection
PROC REG DATA = SP500;
MODEL Close = Open High Low Volume / SELECTION = Forward;
RUN;
Chapter 5: Linear Regression Analysis – Part II
Page
The SAS output is given as follows:
The SAS System
11:41 Friday, October 17, 2003 1
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Forward Selection: Step 1
Variable High Entered: R-Square = 0.7889 and C(p) = 58.0670
Analysis of Variance
Source
DF
Sum of
Squares
Model
Error
Corrected Total
1
21
22
2304.35616
616.69073
2921.04689
Variable
Parameter Standard
Estimate
Error
Mean
Square
2304.35616
29.36623
Type II SS
F Value Pr > F
78.47 <.0001
F Value Pr > F
Intercept -16.20403 116.91573
0.56409 0.02
High
1.01088 0.11412 2304.35616 78.47
0.8911
<.0001
Bounds on condition number: 1, 1
-------------------------------------------------------------------------------------------------Forward Selection: Step 2
Variable Open Entered: R-Square = 0.8660 and C(p) = 31.8989
Analysis of Variance
Sum of
Squares
Mean
Square
Source
DF
Model
Error
Corrected Total
2 2529.75762 1264.87881
20 391.28927
19.56446
22 2921.04689
F Value Pr > F
64.65 <.0001
13
Chapter 5: Linear Regression Analysis – Part II
Variable
Intercept
Open
High
Parameter Standard
Estimate
Error
-3.87743
-0.54116
1.53702
Type II SS
95.49862 0.03225
0.15944 225.40146
0.18084 1413.29365
The SAS System
Page
F Value Pr > F
0.00
11.52
72.24
0.9680
0.0029
<.0001
11:41 Friday, October 17, 2003 2
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Forward Selection: Step 2
Bounds on condition number: 3.7694, 15.078
-------------------------------------------------------------------------------------------------Forward Selection: Step 3
Variable Low Entered: R-Square = 0.9489 and C(p) = 3.6669
Analysis of Variance
Sum of
Squares
Source
DF
Model
Error
Corrected Total
3 2771.67443
19 149.37246
22 2921.04689
Variable
Intercept
Open
High
Low
Parameter Standard
Estimate
Error
7.75846
-0.79702
0.96239
0.82713
Mean
Square
923.89148
7.86171
Type II SS
F Value Pr > F
117.52 <.0001
F Value Pr > F
60.57343 0.12897 0.02
0.11109 404.64474 51.47
0.15451 305.01740 38.80
0.14911 241.91681 30.77
0.8994
<.0001
<.0001
<.0001
Bounds on condition number: 7.5278, 56.789
-------------------------------------------------------------------------------------------------Forward Selection: Step 4
Variable Volume Entered: R-Square = 0.9507 and C(p) = 5.0000
14
Chapter 5: Linear Regression Analysis – Part II
Page
15
Analysis of Variance
Sum of
Squares
Source
DF
Model
Error
Corrected Total
4 2777.01075
18 144.03613
22 2921.04689
The SAS System
Mean
Square
F Value Pr > F
694.25269
8.00201
86.76 <.0001
11:41 Friday, October 17, 2003 3
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Forward Selection: Step 4
Variable
Parameter
Estimate
Standard
Error Type II SS
F Value Pr > F
Intercept 10.09909 61.17871 0.21805
Open
-0.79023 0.11239 395.60059
High
0.98725 0.15883 309.18435
Low
0.79761
0.15471 212.68084
Volume -3.95259E-9 4.840167E-9 5.33633
0.03
49.44
38.64
26.58
0.67
0.8707
<.0001
<.0001
<.0001
0.4248
Bounds on condition number: 7.9624, 82.844
--------------------------------------------------------------------------------------------------
All variables have been entered into the model.
Summary of Forward Selection
Variable Number Partial Model
Step Entered Vars In R-Square R-Square
1
2
3
4
High
Open
Low
Volume
1
2
3
4
0.7889
0.0772
0.0828
0.0018
0.7889
0.8660
0.9489
0.9507
C(p)
58.0670
31.8989
3.6669
5.0000
F Value Pr > F
78.47
11.52
30.77
0.67
<.0001
0.0029
<.0001
0.4248
Interpretations of the SAS output:
(1)
In Step one, the explanatory variable “High” is entered into
the model since its F-statistic is largest (78.47) among other
F-statistics for the other explanatory variables; that is, the
following regression model is considered:
Chapter 5: Linear Regression Analysis – Part II
Page
16
Close = a + b2 High + e
The p-value of the partial F-statistic for the explanatory
variable “High” is less than 0.001. Hence, we reject the null
hypothesis H0: b2 = 0 and decide to include the variable
“High” in our model at significance level 0.5 (by default)
(2)
In Step two, the explanatory variable “Open” is entered into
the model since its partial F-statistic (11.52) is largest
among other partial F-statistics for the explanatory variables
“Low” and “Volume”. In this step, we consider the
following regression model:
Close = a + b1 Open + b2 High + e
The proportion of variation of the response variable Y that
can be explained by including the explanatory variable
“Open” is 0.8660 – 0.7889 (= 0.077)
The p-value of the partial F-statistic for the explanatory
variable “Open” is 0.0029. Hence, we reject the null
hypothesis H0: b1 = 0 and decide to include the variable
“Open” in our model at significance level 0.5
(3)
In Step three, the explanatory variable “Low” is entered into
the model since its partial F-statistic (30.77) is larger than
the partial F-statistic for the explanatory variable “Volume”.
In this step, we consider the following regression model:
Close = a + b1 Open + b2 High + b3 Low+ e
Chapter 5: Linear Regression Analysis – Part II
Page
17
The proportion of variation of the response variable Y that
can be explained by including the explanatory variable
“Low” is 0.9489 – 0.8660 (= 0.083)
The p-value of the partial F-statistic for the explanatory
variable “Low” is less than 0.0001. Hence, we reject the null
hypothesis H0: b3 = 0 and decide to include the variable
“Low” in our model at significance level 0.5
(4)
In Step four, the last explanatory variable “Volume” is
entered into the model. Hence, we consider the full model as
follows:
Close = a + b1 Open + b2 High + b3 Low+ b4 Volume+ e
The proportion of variation of the response variable Y that
can be explained by including the explanatory variable
“Volume” is 0.9507 – 0.9489 (= 0.002)
The p-value of the partial F-statistic for the explanatory
variable “Volume” is 0.4248. Hence, we reject the null
hypothesis H0: b4 = 0 and decide to include the variable
“Volume” in our model at significance level 0.5
(5)
The last part of the SAS output contains a table that provides
a summary for forward selection. All the p-values of the
partial F-statistics for the explanatory variables are displayed.
We can compare the p-values with the default significant
level 0.5 and decide whether an explanatory variable should
be included in each step of the forward selection
Chapter 5: Linear Regression Analysis – Part II
Page
18
Based on the results of the SAS output, the “best” model
chosen by the forward selection with significance level 0.5 is
given by:
Close = a + b1 Open + b2 High + b3 Low+ b4 Volume+ e
The fitted regression model is presented as follows:
Close =10.09909 – 0.79023 Open + 0.98725 High + 0.79761 Low- 3.95259  10 - 9 Volume
14. Example: (Backward elimination)
Consider again the data set in the last example and use the
following SAS procedures to perform backward elimination
PROC REG DATA = SP500;
MODEL Close = Open High Low Volume / SELECTION =
Backward sls = 0.05;
RUN;
The SAS output is given by:
The SAS System
11:55 Friday, October 17, 2003 1
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Backward Elimination: Step 0
All Variables Entered: R-Square = 0.9507 and C(p) = 5.0000
Analysis of Variance
Sum of
Squares
Source
DF
Model
Error
Corrected Total
4 2777.01075
18 144.03613
22 2921.04689
Mean
Square
694.25269
8.00201
F Value Pr > F
86.76 <.0001
Chapter 5: Linear Regression Analysis – Part II
Variable
Parameter Standard
Estimate
Error
Page
Type II SS F Value Pr > F
Intercept 10.09909 61.17871 0.21805 0.03
Open
-0.79023 0.11239 395.60059 49.44
High
0.98725 0.15883 309.18435 38.64
Low
0.79761 0.15471 212.68084 26.58
Volume -3.95259E-9 4.840167E-9 5.33633 0.67
0.8707
<.0001
<.0001
<.0001
0.4248
Bounds on condition number: 7.9624, 82.844
-------------------------------------------------------------------------------------------------Backward Elimination: Step 1
Variable Volume Removed: R-Square = 0.9489 and C(p) = 3.6669
Analysis of Variance
Sum of
Squares
Source
DF
Model
Error
Corrected Total
3 2771.67443
19 149.37246
22 2921.04689
The SAS System
Mean
Square
923.89148
7.86171
F Value Pr > F
117.52 <.0001
11:55 Friday, October 17, 2003 2
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Backward Elimination: Step 1
Variable
Intercept
Open
High
Low
Parameter Standard
Estimate
Error Type II SS
7.75846
-0.79702
0.96239
0.82713
F Value Pr > F
60.57343 0.12897 0.02
0.11109 404.64474 51.47
0.15451 305.01740 38.80
0.14911 241.91681 30.77
0.8994
<.0001
<.0001
<.0001
Bounds on condition number: 7.5278, 56.789
--------------------------------------------------------------------------------------------------
All variables left in the model are significant at the 0.0500 level.
19
Chapter 5: Linear Regression Analysis – Part II
Page
20
Summary of Backward Elimination
Variable
Step Removed
1
Volume
Number Partial Model
Vars In R-Square R-Square
3
0.0018
0.9489
C(p)
3.6669
F Value Pr > F
0.67 0.4248
Interpretations of the SAS output:
(1)
In Step zero, the following full model is considered:
Close = a + b1 Open + b2 High + b3 Low+ b4 Volume+ e
The partial F-statistic for each of the explanatory variables
are displayed. The explanatory variable “Volume” has the
smallest partial F-statistic (0.67). Hence, we select the
variable “Volume” and investigate whether it should be
removed from the full model or not. Since the p-value for the
variable “Volume” is 0.4248, we do not reject the null
hypothesis H0: b4 = 0 and decide to remove the variable
“Volume” from the full model at the pre-specified
significance level 0.05
(2)
In Step one, we consider the following regression model:
Close = a + b1 Open + b2 High + b3 Low + e
Since all the p-values are less than 0.0001, we conclude that
all of the explanatory variables are significant at the
significance level 5%
The proportion of variation of the response variable Y is
given by 0.9507 – 0.9489 = 0.002
(3)
The last part of the SAS output provides a summary for the
Chapter 5: Linear Regression Analysis – Part II
Page
21
backward elimination with significance level 5%. From the
output in this part, we notice that the variable “Volume”
should be removed from the full model and the “best” model
is given as follows:
Close = a + b1 Open + b2 High + b3 Low+ e
The fitted regression model is given by:
Close = 7.758646 – 0.79702 Open + 0.96239 High + 0.82713 Low
15. Example: (Stepwise regression)
Consider again the data set in the last example and use the
following SAS procedures to perform stepwise regression
PROC REG DATA = SP500;
MODEL Close = Open High Low Volume / SELECTION =
stepwise sle = 0.05 sls = 0.1;
RUN;
The SAS output is given by:
The SAS System
11:59 Friday, October 17, 2003 1
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Stepwise Selection: Step 1
Variable High Entered: R-Square = 0.7889 and C(p) = 58.0670
Analysis of Variance
Chapter 5: Linear Regression Analysis – Part II
Sum of
Squares
Source
DF
Model
Error
Corrected Total
1 2304.35616
21 616.69073
22 2921.04689
Variable
Mean
Square
2304.35616
29.36623
Parameter Standard
Estimate
Error
Page
F Value Pr > F
78.47 <.0001
Type II SS F Value Pr > F
Intercept -16.20403 116.91573 0.56409 0.02 0.8911
High
1.01088
0.11412 2304.35616 78.47 <.0001
Bounds on condition number: 1, 1
-------------------------------------------------------------------------------------------------Stepwise Selection: Step 2
Variable Open Entered: R-Square = 0.8660 and C(p) = 31.8989
Analysis of Variance
Sum of
Squares
Source
DF
Model
Error
Corrected Total
2 2529.75762
20 391.28927
22 2921.04689
Variable
Parameter
Estimate
Intercept
Open
High
-3.87743
-0.54116
1.53702
Mean
Square
1264.87881
19.56446
Standard
Error Type II SS
F Value Pr > F
64.65 <.0001
F Value Pr > F
95.49862 0.03225 0.00
0.15944 225.40146 11.52
0.18084 1413.29365 72.24
The SAS System
0.9680
0.0029
<.0001
11:59 Friday, October 17, 2003 2
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Stepwise Selection: Step 2
Bounds on condition number: 3.7694, 15.078
-------------------------------------------------------------------------------------------------Stepwise Selection: Step 3
22
Chapter 5: Linear Regression Analysis – Part II
Page
Variable Low Entered: R-Square = 0.9489 and C(p) = 3.6669
Analysis of Variance
Sum of
Squares
Source
DF
Model
Error
Corrected Total
3 2771.67443
19 149.37246
22 2921.04689
Variable
Intercept
Open
High
Low
Mean
Square
923.89148
7.86171
Parameter Standard
Estimate
Error Type II SS
7.75846
-0.79702
0.96239
0.82713
F Value Pr > F
117.52 <.0001
F Value Pr > F
60.57343 0.12897 0.02
0.11109 404.64474 51.47
0.15451 305.01740 38.80
0.14911 241.91681 30.77
0.8994
<.0001
<.0001
<.0001
Bounds on condition number: 7.5278, 56.789
--------------------------------------------------------------------------------------------------
All variables left in the model are significant at the 0.1000 level.
No other variable met the 0.0500 significance level for entry into the model.
The SAS System
11:59 Friday, October 17, 2003 3
The REG Procedure
Model: MODEL1
Dependent Variable: Close
Summary of Stepwise Selection
Step
Variable
Entered
1
2
3
High
Open
Low
Variable
Removed
Number Partial Model
Vars In R-Square R-Square
1
2
3
0.7889
0.0772
0.0828
Interpretations of the SAS output:
0.7889
0.8660
0.9489
C(p)
58.0670
31.8989
3.6669
F Value Pr > F
78.47 <.0001
11.52 0.0029
30.77 <.0001
23
Chapter 5: Linear Regression Analysis – Part II
(1)
Page
24
In Step one, the explanatory variable “High” is entered into
the model since its F-statistic is largest (78.47) among other
F-statistics for the other explanatory variables and its p-value
is less than 0.0001 which is less than the 0.05 significance
level for forward selection; that is, the following regression
model is considered:
Close = a + b2 High + e
Since the p-value of the partial F-statistic for the explanatory
variable “High” is less than 0.001, we reject the null
hypothesis H0: b2 = 0 and decide not to remove the variable
“High” by backward elimination at significance level 0.1
(2)
In Step two, the explanatory variable “Open” is entered into
the model since its partial F-statistic (11.52) is largest
among other partial F-statistics for the explanatory variables
“Low” and “Volume” and its p-value is 0.0029 which is less
than the 0.05 significance level for forward selection. In this
step, we consider the following regression model:
Close = a + b1 Open + b2 High + e
Since the partial F-statistic for the explanatory variable
“Open” is the smallest one, we select “Open” for backward
elimination. Since the p-value of the partial F-statistic for
“Open” is 0.0029, we reject the null hypothesis H0: b1 = 0 and
decide not to remove the variable “Open” by backward
elimination at significance level 0.1
(3)
In Step three, the explanatory variable “Low” is entered into
the model since its partial F-statistic (30.77) is larger than
the partial F-statistic for the explanatory variable “Volume”
and its p-value is less than 0.0001 which is less than the 0.05
significance level for forward selection. In this step, we
Chapter 5: Linear Regression Analysis – Part II
Page
25
consider the following regression model:
Close = a + b1 Open + b2 High + b3 Low+ e
Since the partial F-statistic for the explanatory variable
“Low” is the smallest one, we select “Low” for backward
elimination. Since the p-value of the partial F-statistic for
“Low” is less than 0.0001, we reject the null hypothesis H 0:
b3 = 0 and decide not to remove the variable “Low” by
backward elimination at significance level 0.1
(4)
After Step three, no variables met the 0.050 significance
level for entry into the model by forward selection and the
stepwise regression stops.
(5)
The last part of the SAS output contains a table that provides
a summary for stepwise regression. From the table, we
notice that the explanatory variables “High”, “Open” and
“Low” are entered into the model and no variables are
removed from the model
Based on the results of the SAS output, the “best” model
chosen by the forward selection with significance level 0.05
for forward selection and significance level 0.1 for backward
elimination is given by:
Close = a + b1 Open + b2 High + b3 Low+ e
The fitted regression model is given by:
Close = 7.758646 – 0.79702 Open + 0.96239 High + 0.82713 Low
~ End of Chapter 5~
Download