Multiple Regression - Selecting the Best Equation When fitting a multiple linear regression model, a researcher will likely include independent variables that are not important in predicting the dependent variable Y. In the analysis he will try to eliminate these variable from the final equation. The objective in trying to find the “best equation” will be to find the simplest model that adequately fits the data. This will not necessarily be the model the explains the most variance in the dependent variable Y (the equation with the highest value of R2). This equation will be the equation with all of the independent variables in the equation. Our objective will be to find the equation with the least number of variables that still explain a percentage of variance in the dependent variable that is comparable to the percentage explained with all the variables in the equation. An Example The example that we will consider is interested in how the heat evolved in the curing of cement is affected by the amounts of various chemical included in the cement mixture. The independent and dependent variables are listed below: X1 = amount of tricalcium aluminate, 3 CaO - Al2O3 X2 = amount of tricalcium silicate, 3 CaO - SiO2 X3 = amount of tetracalcium alumino ferrite, 4 CaO - Al2O3 - Fe2O3 X4 = amount of dicalcium silicate, 2 CaO - SiO2 Y = heat evolved in calories per gram of cement. X1 7 1 11 11 7 11 3 1 2 21 1 11 10 X2 26 29 56 31 52 55 71 31 54 47 40 66 68 X3 6 15 8 8 6 9 17 22 18 4 23 9 8 X4 60 52 20 47 33 22 6 44 22 26 34 12 12 Y 79 74 104 88 96 109 103 73 93 116 84 113 109 Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily the equation that explains most of the variance in Y (the highest R2). • This equation will be the one with all the variables included. • The best equation should also be simple and interpretable. (i.e. contain a small no. of variables). • Simple (interpretable) & Reliable - opposing criteria. • The best equation is a compromise between these two. page 55 I All Possible Regressions Suppose we have the p independent variables X1, X2, ..., Xp. - Then there are 2p subsets of variables. Example (k=3) X1, X2, X3 Variables in Equation - no variables - X1 - X2 - X3 - X1, X2 - X1, X3 - X2, X3 - X1, X2, X3 Model Y = β0 + ε Y = β0 + β1 X1+ ε Y = β0 + β2 X2+ ε Y = β0 + β3 X3+ ε Y = β0 + β1 X1+ β2 X2+ ε Y = β0 + β1 X1+ β3 X3+ ε Y = β0 + β2 X2+ β3 X3+ ε ,3/ Y = β0 + β1 X1+ β2 X2+ β2 X3+ ε Use of R2 1. Assume we carry out 2p runs for each of the subsets. Divide the Runs into the following sets Set 0: No variables Set 1: One independent variable. ... Set p: p independent variables. 2. Order the runs in each set according to R2. 3. Examine the leaders in each run looking for consistent patterns - take into account correlation between independent variables. Example (k=4) X1, X2, X3, X4 Set 1: Set 2: Set 3: Set 4: Variables in for leading runs 100 R2% X4. X1, X2. X1, X4 X1, X2, X4. X1, X2, X3, X4. 67.5 % 97.9 % 97.2 % 98.234 % 98.237 % Examination of the correlation coefficients reveals a high correlation between X1, X3 (r 13= -0.824) and between X2, X4 (r 24= -0.973). Best Equation Y = β0 + β1 X1+ β4 X4+ ε page 56 Use of the Residual Mean Square (RMS) (s2) When all of the variables having a non-zero effect have been included in the model then the residual mean square is an estimate of σ2. If "significant" variables have been left out then RMS will be biased upward. No. of Variables p RMS s2(p) Average s2(p) 1 2 3 4 * 115.06, 82.39,1176.31, 80.35 5.79*,122.71,7.48**,86.59.17.57 5.35, 5.33, 5.65, 8.20 5.98 - run X1, X2 ** - run X1, X4 113.53 47.00 6.13 5.98 s2 - approximately 6. Use of Mallows Ck RSS k − [n − 2(k + 1)] 2 s complete If the equation with p variables is adequate then both s2complete and RSSp/(n-p-1) will be estimating σ2.Then Ck = [(n-k-1)σ2]/σ2 - [n-2(k+1)]= [n-k-1] - [n-2(k+1)] = k +1. Thus if we plot, for each run, Ck vs k and look for Ck close to p then we will be able to identify models giving a reasonable fit. Mallows C k = Run Ck k+1 no variables 443.2 1 1,2,3,4 202.5, 142.5, 315.2, 138.7 2 12,13,14 23,24,34 2.7, 198.1, 5.5 62.4, 138.2, 22.4 3 123,124,134,234 3.0, 3.0, 3.5, 7.5 4 1234 5.0 5 page 57 II Backward Elimination In this procedure the complete regression equation is determined containing all the variables - X1, X2, ..., Xp. Then variables are checked one at a time and the least significant is dropped from the model at each stage. The procedure is terminated when all of the variables remaining in the equation provide a significant contribution to the prediction of the dependent variable Y. The precise algorithm proceeds as follows: 1. Fit a regression equation containing all variables. 2. A partial F-test (F to remove) is computed for each of the independent variables still in the equation. • The Partial F statistic (F to remove) = [RSS2 - RSS1]/MSE1 ,where • RSS1 = the residual sum of squares with all variables that are presently in the equation, • RSS2 = the residual sum of squares with one of the variables removed, and • MSE1 = the Mean Square for Error with all variables that are presently in the equation. 3. The lowest partial F value (F to remove) is compared with Fα for some pre-specified α .If FLowest ≤ Fα then remove that variable and return to step 2. If FLowest > Fα then accept the equation as it stands. Example (k=4) (same example as before) X1, X2, X3, X4 1. X1, X2, X3, X4 in the equation. The lowest partial F = 0.018 (X3) is compared with Fα(1,8) = 3.46 for α = 0.01. Remove X3. 2. X1, X2, X4 in the equation. The lowest partial F = 1.86 (X4) is compared with Fα(1,9) = 3.36 for α= 0.01. Remove X4. 3. X1, X2 in the equation. Partial F for both variables X1 and X2 exceed Fα(1,10) = 3.36 for α= 0.01. Equation is accepted as it stands. Note : F to Remove = partial F. Y = 52.58 + 1.47 X1 + 0.66 X2 II Forward Selection In this procedure we start with no variables in the equation. Then variables are checked one at a time and the most significant is added to the model at each stage. The procedure is terminated when all of the variables not in the equation have no significant effect on the dependent variable Y. The precise algorithm proceeds as follows: page 58 1. With no varaibles in the equation compute a partial F-test (F to enter) is computed for each of the independent variables not in the equation. • The Partial F statistic (F to enter) = [RSS2 - RSS1]/MSE1 ,where • RSS1 = the residual sum of squares with all variables that are presently in the equation and the variable under consideration, • RSS2 = the residual sum of squares with all variables that are presently in the equation . • MSE1 = the Mean Square for Error with variables that are presently in the equation and the variable under consideration. 2. The largest partial F value (F to enter) is compared with Fα for some pre-specified α .If FLargest > Fα then add that variable and return to step 1. If FLargest ≤ Fα then accept the equation as it stands. IV Stepwise Regression In this procedure the regression equation is determined containing no variables in the model. Variables are then checked one at a time using the partial correlation coefficient (equivalently F to Enter) as a measure of importance in predicting the dependent variable Y. At each stage the variable with the highest significant partial correlation coefficient (F to Enter) is added to the model. Once this has been done the partial F statistic (F to Remove) is computed for all variables now in the model is computed to check if any of the variables previously added can now be deleted. This procedure is continued until no further variables can be added or deleted from the model. The partial correlation coefficient for a given variable is the correlation between the given variable and the response when the present independent variables in the equation are held fixed. It is also the correlation between the given variable and the residuals computed from fitting an equation with the present independent variables in the equation. (Partial correlation of Xi with variables Xi1, X12, ... etc in the equation)2 = The percentage of variance in Y explained Xi by that is left unexplained Xi1, X12, etc. Example (k=4) (same example as before) X1, X2, X3, X4 1. With no variables in the equation. The correlation of each independent variable with the dependent variable Y is computed. The highest significant correlation ( r = 0.821) is with variable X4. Thus the decision is made to include X4. Regress Y with X4 -significant thus we keep X4. 2. Compute partial correlation coefficients of Y with all other independent variables given X4 in the equation.The highest partial correlation is with the variable X1. ( [rY1.4]2 = 0.915). Thus the decision is made to include X1. Regress Y with X1, X4. R2 = 0.972 , F = 176.63 . For X1 the partial F value =108.22 (F0.10(1,8) = 3.46) page 59 Retain X1. For X4 the partial F value =154.295 (F0.10 (1,8) = 3.46) Retain X4. 3. Compute partial correlation coefficients of Y with all other independent variables given X4 and X1 in the equation. The highest partial correlation is with the variable X2. ( [rY2.14]2 = 0.358). Thus the decision is made to include X2. Regress Y with X1, X2,X4. R2 = 0.982 . Lowest partial F value =1.863 for X4 (F0.10 (1,9) = 3.36) Remove X4 leaving X1 and X2 . page 60