Chapter 5: Linear Regression Analysis – Part II Page 1 Department of Mathematics Faculty of Science and Engineering City University of Hong Kong MA 3518: Applied Statistics Chapter 5: Linear Regression Analysis – Part II In multiple linear regression analysis, it is vital to achieve a parsimonious result without eliminating any important explanatory variables. This chapter focuses on the aspects of selecting significant explanatory variables to be included in a regression model. We introduce four standard procedures for variable selection, namely best subset selection method, forward selection, backward selection, backward elimination, stepwise regression. Topics included in this chapter are listed as follows: Section 5.1: Variable Selection Section 5.2: SAS for Variable Selection The Philosophy of Parsimony! Chapter 5: Linear Regression Analysis – Part II Page 2 Section 5.1: Variable Selection 1. Motivation: Consider the following multiple linear regression model Y = a + b1X1 + b2X2 + b3X3 + …..+ bpXp + e It may be possible that we have included some unnecessary explanatory variables that may not be significant to explain the variation of the response variable Y or to predict Y Remove some insignificant variables from the regression model and reduce the number of regression parameters To achieve a parsimonious result 2. Objective: Select significant explanatory variables to be included in and identify insignificant explanatory variables to be excluded from a regression model so that the final model can provide a reasonably good prediction for the response variable and is easy to understand, interpret and apply 3. Question: What happens if a significant explanatory variable is excluded from a regression model? Biased estimates for other regression coefficients and forecasts will be obtained Chapter 5: Linear Regression Analysis – Part II Page 4. Question: What happens if an insignificant explanatory variable is included in a regression model? The standard errors of the estimators for the regression coefficients and the forecasts will increase 5. Four standard procedures: No unique method that ensures to pick the ‘best’ regression model Best subset selection method Forward selection Backward elimination Stepwise regression 6. Best subset selection method: Consider the following multiple linear regression model with three explanatory variables X1, X2 and X3 Y = a + b1X1 + b2X2 + b3X3 + e We can fit 7 possible regression models with the following combinations: (a) Y = a + b1X1 + e (b) Y = a + b2X2 + e (c) Y = a + b3X3 + e (d) Y = a + b1X1 + b2X2 + e (e) Y = a + b1X1 + b3X3 + e (f) Y = a + b2X2 + b3X3 + e (g) Y = a + b1X1 + b2X2 + b3X3 + e 3 Chapter 5: Linear Regression Analysis – Part II Page Note that 23 – 1 = 7 Now, suppose we have p explanatory variables. Then, we can fit 2p – 1 possible regression models Rationale: To fit all 2p – 1 possible regression models and select the ‘best’ one according to one of the following criteria (a) R2 or its adjusted version: - Criterion based on R2: Choose the “best” model with the largest R2 - Criterion based on adjusted R2: Choose the “best” model with the largest adjusted R2 Note that - R2 (or R2adj) increases as p does - R2 (or R2adj) is useful for comparing different models when p is fixed (b) Mallow’s Cp statistic: A very computationally expensive method - Definition of Cp statistic: Cp = (SSER/MSEF) – (n – 2k) 4 Chapter 5: Linear Regression Analysis – Part II Page 5 where SSER = the sum of squared errors for the reduced model MSER = the mean of squared errors for the full model n = the number of observations p = the number of explanatory variables k = the number of unknown parameters = p + 1 - Note that (1) The full model is the one which contains all explanatory variables (2) If the reduced model is true, E(Cp) is approximately equal to k (= p + 1) - Criterion one for model selection: Choose the “best” model with the smallest Cp - Criterion two for model selection: Choose the “best” model with Cp closest to p + 1 7. Remarks: Shortcoming of the best subset selection method: Not very practical to evaluate all possible subsets regressions when the number of explanatory variables p is very large Chapter 5: Linear Regression Analysis – Part II Page 6 Sequential method: Include or drop one variable at a time (a) Forward selection method (b) Backward elimination method (c) Stepwise method 8. Forward selection with significance level 1: Start with the simplest intercept model (i.e. without any explanatory variables) Y=a+e For each of the regression equations, we compute the partial F-statistic (or the t-statistic or their corresponding p-values) for each of the regression coefficients and choose the explanatory variable with the largest partial F-statistic, say Xk Perform the following hypothesis testing at significance level 1: H0: bk = 0 vs H1: bk 0 If the null hypothesis H0 is rejected, the explanatory variable Xk is significant and should be included in the regression equation Determine which one of the remaining variable will have the largest partial F-statistic if each of them is added to the regression equation that already contains Xk Chapter 5: Linear Regression Analysis – Part II Page 7 Repeat the procedure in Step 2 – Step 4 until no further variables are significant 9. Backward elimination with significance level 2: Start with the full model (i.e. include all possible explanatory variables and the intercept) Y = a + b1X1 + b2X2 + b3X3 + …..+ bpXp + e Choose the explanatory variable with the smallest partial Fstatistic after fitting the full model, say Xk Perform the following hypothesis testing at significance level 2: H0: bk = 0 vs H1: bk 0 If the null hypothesis H0 is accepted, the explanatory variable Xk is not significant and should be excluded from the regression equation 10. Stepwise regression with significance levels 1 and 2: (A mixture of both forward selection and backward elimination) Start with the simplest intercept model (i.e. without any explanatory variables) Select the most significant variable (i.e. the one with the largest partial F-statistic), say Xk, by forward selection method Chapter 5: Linear Regression Analysis – Part II Page 8 Perform the test for the following hypotheses at significance level 1: H0: bk = 0 vs H1: bk 0 If H0 is accepted, record summary statistics and STOP; Otherwise, add Xk to the regression model and perform backward elimination Select the explanatory variable with the smallest partial Fstatistic, say Xj Perform the test for the following hypotheses at significance level 2: H0: bj = 0 vs H1: bj 0 If H0 is accepted, remove the variable Xj from the regression model; Otherwise, go to Step 2 11. Use SAS procedures to perform variable selection: The SELECTION statement PROC REG DATA = name of dataset <options>; MODEL response = explanatory variables / SELECTION = <options>; RUN; Options for the SELECTION statement (a) Perform the best subset selection: SELECTION = RSQUARE Chapter 5: Linear Regression Analysis – Part II Page 9 (b) Perform forward selection: SELECTION = forward (c) Perform backward elimination: SELECTION = backward (d) Perform stepwise regression SELECTION = stepwise (e) Specify the significance level 1 for an explanatory variable to be included in the model during forward selection or stepwise regression sle = 1 By default, the significance level is 50% for forward selection while it is 15% for stepwise regression (f) Specify the significance level 2 for an explanatory variable to be removed from the model during backward elimination or stepwise regression sls = 2 By default, the significance level is 1% for backward elimination while it is 15% for stepwise regression Chapter 5: Linear Regression Analysis – Part II Page 10 (g) Display the Mallow’s Cp statistic (i.e. can only be used in the option “SELECTION = RSQUARE”) SELECTION = RSQUARE cp (h) Display the adjusted R2 (i.e. can only be used in the option “SELECTION = RSQUARE”) SELECTION = RSQUARE adjrsq (i) Display the mean squared errors MSE (i.e. can only be used in the option “SELECTION = RSQUARE”) SELECTION = RSQUARE mse 12. Example: (Best subset selection method) Consider the following dataset containing the daily open close, high, low values and the trading volume of S&P500 global index from 2 Sep 2003 to 2 Oct 2003 in the last chapter Data SP500; Input Date $ Open High Low Close Volume; CARDS; 2-Oct-03 1017.25 1021.90 1013.38 1020.24 1091209984 1-Oct-03 997.15 1018.22 997.15 1018.22 1329970048 30-Sep-03 1004.72 1004.72 990.34 995.97 1360259968 29-Sep-03 998.12 1006.91 995.31 1006.58 1128700000 26-Sep-03 1003.31 1003.32 996.03 996.85 1237640000 25-Sep-03 1010.24 1015.97 1003.26 1003.27 1276470000 24-Sep-03 1029.09 1029.83 1008.93 1009.38 1378250000 23-Sep-03 1023.26 1030.06 1021.50 1029.03 1124940000 22-Sep-03 1036.30 1036.30 1018.27 1022.82 1082870000 19-Sep-03 1039.64 1039.64 1031.85 1036.30 1328210000 18-Sep-03 1025.80 1040.18 1025.66 1039.58 1257790000 17-Sep-03 1028.91 1031.37 1024.23 1025.97 1135540000 16-Sep-03 1015.07 1029.68 1015.07 1029.32 1161780000 15-Sep-03 1018.68 1019.80 1013.59 1014.81 943448000 12-Sep-03 1014.54 1019.68 1007.70 1018.63 1092610000 11-Sep-03 1011.34 1020.84 1011.34 1016.42 1151640000 10-Sep-03 1021.27 1021.28 1009.73 1010.92 1313300000 9-Sep-03 1030.51 1030.51 1021.13 1023.16 1226980000 8-Sep-03 1021.84 1032.42 1021.84 1031.64 1171310000 5-Sep-03 1027.02 1029.24 1018.20 1021.39 1292100000 4-Sep-03 1025.97 1029.15 1022.17 1027.97 1259030000 3-Sep-03 1023.37 1029.36 1022.39 1026.27 1547380000 2-Sep-03 1009.14 1022.63 1005.65 1021.99 1279880000 ; RUN; Chapter 5: Linear Regression Analysis – Part II Page 11 Suppose the full model for the regression with the response variable “Close” and the explanatory variables “Open”, “High”, “Low” and “Volume” is given as follow: Close = a + b1 Open + b2 High + b3 Low + b4 Volume + e Perform the best subset selection method using the following SAS procedures: PROC REG DATA = SP500; MODEL Close = Open High Low Volume / SELECTION = RSQUARE cp adjrsq mse; RUN; The SAS output is shown as follows: The SAS System 21:58 Wednesday, October 15, 2003 1 The REG Procedure Model: MODEL1 Dependent Variable: Close R-Square Selection Method Number in Adjusted Model R-Square R-Square C(p) MSE Variables in Model 1 0.7889 0.7788 58.0670 29.36623 High 1 0.7612 0.7498 68.1654 33.21419 Low 1 0.3822 0.3528 206.5163 85.93252 Open 1 0.0041 -.0433 344.5411 138.52661 Volume ------------------------------------------------------------------------------------2 0.8660 0.8526 31.8989 19.56446 Open High 2 0.8444 0.8289 39.7845 22.71949 Open Low 2 0.8103 0.7914 52.2348 27.70086 High Low 2 0.7991 0.7790 56.3310 29.33976 High Volume 2 0.7616 0.7377 70.0302 34.82080 Low Volume 2 0.3882 0.3270 206.3250 89.35242 Open Volume ------------------------------------------------------------------------------------3 0.9489 0.9408 3.6669 7.86171 Open High Low 3 0.8779 0.8586 29.5784 18.77458 Open High Volume Chapter 5: Linear Regression Analysis – Part II Page 12 3 0.8448 0.8203 41.6383 23.85371 Open Low Volume 3 0.8153 0.7861 52.4377 28.40193 High Low Volume ------------------------------------------------------------------------------------4 0.9507 0.9397 5.0000 8.00201 Open High Low Volume Interpretations of the SAS output: (1) Since the largest R2 is 0.9507, the “best” model based on R2 is given by: Close = a + b1 Open + b2 High + b3 Low + b4 Volume + e (2) Since the largest adjusted R2 is 0.9408, the “best” model based on adjusted R2 is given by: Close = a + b1 Open + b2 High + b3 Low+ e (3) Since the smallest Mallow’s Cp statistic is 3.6669, the “best” model based on the first criterion of Mallow’s Cp statistic is given by: Close = a + b1 Open + b2 High + b3 Low+ e (4) Since the Mallow’s Cp statistic closest to p + 1 (= 4 + 1) is 5.0000, the “best” model based on the second criterion of Mallow’s Cp statistic is given by: Close = a + b1 Open + b2 High + b3 Low+ b4 Volume + e 13. Example: (Forward selection) Consider again the data set in the last example and use the following SAS procedures to perform forward selection PROC REG DATA = SP500; MODEL Close = Open High Low Volume / SELECTION = Forward; RUN; Chapter 5: Linear Regression Analysis – Part II Page The SAS output is given as follows: The SAS System 11:41 Friday, October 17, 2003 1 The REG Procedure Model: MODEL1 Dependent Variable: Close Forward Selection: Step 1 Variable High Entered: R-Square = 0.7889 and C(p) = 58.0670 Analysis of Variance Source DF Sum of Squares Model Error Corrected Total 1 21 22 2304.35616 616.69073 2921.04689 Variable Parameter Standard Estimate Error Mean Square 2304.35616 29.36623 Type II SS F Value Pr > F 78.47 <.0001 F Value Pr > F Intercept -16.20403 116.91573 0.56409 0.02 High 1.01088 0.11412 2304.35616 78.47 0.8911 <.0001 Bounds on condition number: 1, 1 -------------------------------------------------------------------------------------------------Forward Selection: Step 2 Variable Open Entered: R-Square = 0.8660 and C(p) = 31.8989 Analysis of Variance Sum of Squares Mean Square Source DF Model Error Corrected Total 2 2529.75762 1264.87881 20 391.28927 19.56446 22 2921.04689 F Value Pr > F 64.65 <.0001 13 Chapter 5: Linear Regression Analysis – Part II Variable Intercept Open High Parameter Standard Estimate Error -3.87743 -0.54116 1.53702 Type II SS 95.49862 0.03225 0.15944 225.40146 0.18084 1413.29365 The SAS System Page F Value Pr > F 0.00 11.52 72.24 0.9680 0.0029 <.0001 11:41 Friday, October 17, 2003 2 The REG Procedure Model: MODEL1 Dependent Variable: Close Forward Selection: Step 2 Bounds on condition number: 3.7694, 15.078 -------------------------------------------------------------------------------------------------Forward Selection: Step 3 Variable Low Entered: R-Square = 0.9489 and C(p) = 3.6669 Analysis of Variance Sum of Squares Source DF Model Error Corrected Total 3 2771.67443 19 149.37246 22 2921.04689 Variable Intercept Open High Low Parameter Standard Estimate Error 7.75846 -0.79702 0.96239 0.82713 Mean Square 923.89148 7.86171 Type II SS F Value Pr > F 117.52 <.0001 F Value Pr > F 60.57343 0.12897 0.02 0.11109 404.64474 51.47 0.15451 305.01740 38.80 0.14911 241.91681 30.77 0.8994 <.0001 <.0001 <.0001 Bounds on condition number: 7.5278, 56.789 -------------------------------------------------------------------------------------------------Forward Selection: Step 4 Variable Volume Entered: R-Square = 0.9507 and C(p) = 5.0000 14 Chapter 5: Linear Regression Analysis – Part II Page 15 Analysis of Variance Sum of Squares Source DF Model Error Corrected Total 4 2777.01075 18 144.03613 22 2921.04689 The SAS System Mean Square F Value Pr > F 694.25269 8.00201 86.76 <.0001 11:41 Friday, October 17, 2003 3 The REG Procedure Model: MODEL1 Dependent Variable: Close Forward Selection: Step 4 Variable Parameter Estimate Standard Error Type II SS F Value Pr > F Intercept 10.09909 61.17871 0.21805 Open -0.79023 0.11239 395.60059 High 0.98725 0.15883 309.18435 Low 0.79761 0.15471 212.68084 Volume -3.95259E-9 4.840167E-9 5.33633 0.03 49.44 38.64 26.58 0.67 0.8707 <.0001 <.0001 <.0001 0.4248 Bounds on condition number: 7.9624, 82.844 -------------------------------------------------------------------------------------------------- All variables have been entered into the model. Summary of Forward Selection Variable Number Partial Model Step Entered Vars In R-Square R-Square 1 2 3 4 High Open Low Volume 1 2 3 4 0.7889 0.0772 0.0828 0.0018 0.7889 0.8660 0.9489 0.9507 C(p) 58.0670 31.8989 3.6669 5.0000 F Value Pr > F 78.47 11.52 30.77 0.67 <.0001 0.0029 <.0001 0.4248 Interpretations of the SAS output: (1) In Step one, the explanatory variable “High” is entered into the model since its F-statistic is largest (78.47) among other F-statistics for the other explanatory variables; that is, the following regression model is considered: Chapter 5: Linear Regression Analysis – Part II Page 16 Close = a + b2 High + e The p-value of the partial F-statistic for the explanatory variable “High” is less than 0.001. Hence, we reject the null hypothesis H0: b2 = 0 and decide to include the variable “High” in our model at significance level 0.5 (by default) (2) In Step two, the explanatory variable “Open” is entered into the model since its partial F-statistic (11.52) is largest among other partial F-statistics for the explanatory variables “Low” and “Volume”. In this step, we consider the following regression model: Close = a + b1 Open + b2 High + e The proportion of variation of the response variable Y that can be explained by including the explanatory variable “Open” is 0.8660 – 0.7889 (= 0.077) The p-value of the partial F-statistic for the explanatory variable “Open” is 0.0029. Hence, we reject the null hypothesis H0: b1 = 0 and decide to include the variable “Open” in our model at significance level 0.5 (3) In Step three, the explanatory variable “Low” is entered into the model since its partial F-statistic (30.77) is larger than the partial F-statistic for the explanatory variable “Volume”. In this step, we consider the following regression model: Close = a + b1 Open + b2 High + b3 Low+ e Chapter 5: Linear Regression Analysis – Part II Page 17 The proportion of variation of the response variable Y that can be explained by including the explanatory variable “Low” is 0.9489 – 0.8660 (= 0.083) The p-value of the partial F-statistic for the explanatory variable “Low” is less than 0.0001. Hence, we reject the null hypothesis H0: b3 = 0 and decide to include the variable “Low” in our model at significance level 0.5 (4) In Step four, the last explanatory variable “Volume” is entered into the model. Hence, we consider the full model as follows: Close = a + b1 Open + b2 High + b3 Low+ b4 Volume+ e The proportion of variation of the response variable Y that can be explained by including the explanatory variable “Volume” is 0.9507 – 0.9489 (= 0.002) The p-value of the partial F-statistic for the explanatory variable “Volume” is 0.4248. Hence, we reject the null hypothesis H0: b4 = 0 and decide to include the variable “Volume” in our model at significance level 0.5 (5) The last part of the SAS output contains a table that provides a summary for forward selection. All the p-values of the partial F-statistics for the explanatory variables are displayed. We can compare the p-values with the default significant level 0.5 and decide whether an explanatory variable should be included in each step of the forward selection Chapter 5: Linear Regression Analysis – Part II Page 18 Based on the results of the SAS output, the “best” model chosen by the forward selection with significance level 0.5 is given by: Close = a + b1 Open + b2 High + b3 Low+ b4 Volume+ e The fitted regression model is presented as follows: Close =10.09909 – 0.79023 Open + 0.98725 High + 0.79761 Low- 3.95259 10 - 9 Volume 14. Example: (Backward elimination) Consider again the data set in the last example and use the following SAS procedures to perform backward elimination PROC REG DATA = SP500; MODEL Close = Open High Low Volume / SELECTION = Backward sls = 0.05; RUN; The SAS output is given by: The SAS System 11:55 Friday, October 17, 2003 1 The REG Procedure Model: MODEL1 Dependent Variable: Close Backward Elimination: Step 0 All Variables Entered: R-Square = 0.9507 and C(p) = 5.0000 Analysis of Variance Sum of Squares Source DF Model Error Corrected Total 4 2777.01075 18 144.03613 22 2921.04689 Mean Square 694.25269 8.00201 F Value Pr > F 86.76 <.0001 Chapter 5: Linear Regression Analysis – Part II Variable Parameter Standard Estimate Error Page Type II SS F Value Pr > F Intercept 10.09909 61.17871 0.21805 0.03 Open -0.79023 0.11239 395.60059 49.44 High 0.98725 0.15883 309.18435 38.64 Low 0.79761 0.15471 212.68084 26.58 Volume -3.95259E-9 4.840167E-9 5.33633 0.67 0.8707 <.0001 <.0001 <.0001 0.4248 Bounds on condition number: 7.9624, 82.844 -------------------------------------------------------------------------------------------------Backward Elimination: Step 1 Variable Volume Removed: R-Square = 0.9489 and C(p) = 3.6669 Analysis of Variance Sum of Squares Source DF Model Error Corrected Total 3 2771.67443 19 149.37246 22 2921.04689 The SAS System Mean Square 923.89148 7.86171 F Value Pr > F 117.52 <.0001 11:55 Friday, October 17, 2003 2 The REG Procedure Model: MODEL1 Dependent Variable: Close Backward Elimination: Step 1 Variable Intercept Open High Low Parameter Standard Estimate Error Type II SS 7.75846 -0.79702 0.96239 0.82713 F Value Pr > F 60.57343 0.12897 0.02 0.11109 404.64474 51.47 0.15451 305.01740 38.80 0.14911 241.91681 30.77 0.8994 <.0001 <.0001 <.0001 Bounds on condition number: 7.5278, 56.789 -------------------------------------------------------------------------------------------------- All variables left in the model are significant at the 0.0500 level. 19 Chapter 5: Linear Regression Analysis – Part II Page 20 Summary of Backward Elimination Variable Step Removed 1 Volume Number Partial Model Vars In R-Square R-Square 3 0.0018 0.9489 C(p) 3.6669 F Value Pr > F 0.67 0.4248 Interpretations of the SAS output: (1) In Step zero, the following full model is considered: Close = a + b1 Open + b2 High + b3 Low+ b4 Volume+ e The partial F-statistic for each of the explanatory variables are displayed. The explanatory variable “Volume” has the smallest partial F-statistic (0.67). Hence, we select the variable “Volume” and investigate whether it should be removed from the full model or not. Since the p-value for the variable “Volume” is 0.4248, we do not reject the null hypothesis H0: b4 = 0 and decide to remove the variable “Volume” from the full model at the pre-specified significance level 0.05 (2) In Step one, we consider the following regression model: Close = a + b1 Open + b2 High + b3 Low + e Since all the p-values are less than 0.0001, we conclude that all of the explanatory variables are significant at the significance level 5% The proportion of variation of the response variable Y is given by 0.9507 – 0.9489 = 0.002 (3) The last part of the SAS output provides a summary for the Chapter 5: Linear Regression Analysis – Part II Page 21 backward elimination with significance level 5%. From the output in this part, we notice that the variable “Volume” should be removed from the full model and the “best” model is given as follows: Close = a + b1 Open + b2 High + b3 Low+ e The fitted regression model is given by: Close = 7.758646 – 0.79702 Open + 0.96239 High + 0.82713 Low 15. Example: (Stepwise regression) Consider again the data set in the last example and use the following SAS procedures to perform stepwise regression PROC REG DATA = SP500; MODEL Close = Open High Low Volume / SELECTION = stepwise sle = 0.05 sls = 0.1; RUN; The SAS output is given by: The SAS System 11:59 Friday, October 17, 2003 1 The REG Procedure Model: MODEL1 Dependent Variable: Close Stepwise Selection: Step 1 Variable High Entered: R-Square = 0.7889 and C(p) = 58.0670 Analysis of Variance Chapter 5: Linear Regression Analysis – Part II Sum of Squares Source DF Model Error Corrected Total 1 2304.35616 21 616.69073 22 2921.04689 Variable Mean Square 2304.35616 29.36623 Parameter Standard Estimate Error Page F Value Pr > F 78.47 <.0001 Type II SS F Value Pr > F Intercept -16.20403 116.91573 0.56409 0.02 0.8911 High 1.01088 0.11412 2304.35616 78.47 <.0001 Bounds on condition number: 1, 1 -------------------------------------------------------------------------------------------------Stepwise Selection: Step 2 Variable Open Entered: R-Square = 0.8660 and C(p) = 31.8989 Analysis of Variance Sum of Squares Source DF Model Error Corrected Total 2 2529.75762 20 391.28927 22 2921.04689 Variable Parameter Estimate Intercept Open High -3.87743 -0.54116 1.53702 Mean Square 1264.87881 19.56446 Standard Error Type II SS F Value Pr > F 64.65 <.0001 F Value Pr > F 95.49862 0.03225 0.00 0.15944 225.40146 11.52 0.18084 1413.29365 72.24 The SAS System 0.9680 0.0029 <.0001 11:59 Friday, October 17, 2003 2 The REG Procedure Model: MODEL1 Dependent Variable: Close Stepwise Selection: Step 2 Bounds on condition number: 3.7694, 15.078 -------------------------------------------------------------------------------------------------Stepwise Selection: Step 3 22 Chapter 5: Linear Regression Analysis – Part II Page Variable Low Entered: R-Square = 0.9489 and C(p) = 3.6669 Analysis of Variance Sum of Squares Source DF Model Error Corrected Total 3 2771.67443 19 149.37246 22 2921.04689 Variable Intercept Open High Low Mean Square 923.89148 7.86171 Parameter Standard Estimate Error Type II SS 7.75846 -0.79702 0.96239 0.82713 F Value Pr > F 117.52 <.0001 F Value Pr > F 60.57343 0.12897 0.02 0.11109 404.64474 51.47 0.15451 305.01740 38.80 0.14911 241.91681 30.77 0.8994 <.0001 <.0001 <.0001 Bounds on condition number: 7.5278, 56.789 -------------------------------------------------------------------------------------------------- All variables left in the model are significant at the 0.1000 level. No other variable met the 0.0500 significance level for entry into the model. The SAS System 11:59 Friday, October 17, 2003 3 The REG Procedure Model: MODEL1 Dependent Variable: Close Summary of Stepwise Selection Step Variable Entered 1 2 3 High Open Low Variable Removed Number Partial Model Vars In R-Square R-Square 1 2 3 0.7889 0.0772 0.0828 Interpretations of the SAS output: 0.7889 0.8660 0.9489 C(p) 58.0670 31.8989 3.6669 F Value Pr > F 78.47 <.0001 11.52 0.0029 30.77 <.0001 23 Chapter 5: Linear Regression Analysis – Part II (1) Page 24 In Step one, the explanatory variable “High” is entered into the model since its F-statistic is largest (78.47) among other F-statistics for the other explanatory variables and its p-value is less than 0.0001 which is less than the 0.05 significance level for forward selection; that is, the following regression model is considered: Close = a + b2 High + e Since the p-value of the partial F-statistic for the explanatory variable “High” is less than 0.001, we reject the null hypothesis H0: b2 = 0 and decide not to remove the variable “High” by backward elimination at significance level 0.1 (2) In Step two, the explanatory variable “Open” is entered into the model since its partial F-statistic (11.52) is largest among other partial F-statistics for the explanatory variables “Low” and “Volume” and its p-value is 0.0029 which is less than the 0.05 significance level for forward selection. In this step, we consider the following regression model: Close = a + b1 Open + b2 High + e Since the partial F-statistic for the explanatory variable “Open” is the smallest one, we select “Open” for backward elimination. Since the p-value of the partial F-statistic for “Open” is 0.0029, we reject the null hypothesis H0: b1 = 0 and decide not to remove the variable “Open” by backward elimination at significance level 0.1 (3) In Step three, the explanatory variable “Low” is entered into the model since its partial F-statistic (30.77) is larger than the partial F-statistic for the explanatory variable “Volume” and its p-value is less than 0.0001 which is less than the 0.05 significance level for forward selection. In this step, we Chapter 5: Linear Regression Analysis – Part II Page 25 consider the following regression model: Close = a + b1 Open + b2 High + b3 Low+ e Since the partial F-statistic for the explanatory variable “Low” is the smallest one, we select “Low” for backward elimination. Since the p-value of the partial F-statistic for “Low” is less than 0.0001, we reject the null hypothesis H 0: b3 = 0 and decide not to remove the variable “Low” by backward elimination at significance level 0.1 (4) After Step three, no variables met the 0.050 significance level for entry into the model by forward selection and the stepwise regression stops. (5) The last part of the SAS output contains a table that provides a summary for stepwise regression. From the table, we notice that the explanatory variables “High”, “Open” and “Low” are entered into the model and no variables are removed from the model Based on the results of the SAS output, the “best” model chosen by the forward selection with significance level 0.05 for forward selection and significance level 0.1 for backward elimination is given by: Close = a + b1 Open + b2 High + b3 Low+ e The fitted regression model is given by: Close = 7.758646 – 0.79702 Open + 0.96239 High + 0.82713 Low ~ End of Chapter 5~