Statistics 511 Study Guide 8 Fall 2001 A) Variable Selection 1) On study guide 6 you saw data relating the overall rating of supervisors by their employees to 6 ratings of their performance in various areas. In this problem, we try to find a smaller set of variables that adequately predicts the overall rating. The variables in the analysis are: RATING COMPL PRIV NEW RAISE CRITIC ADVANC Overall rating of job being done by supervisor Handles employee complaints Does not allow special privileges Opportunity to learn new things Raises based on performances Too critical of poor performances Rate of advancing to better jobs a) Below is the ANOVA table for the regression of overall rating on all the variables. Using backwards stepwise regression, which variable would be removed first from the regression? DEP VARIABLE: RATING ANALYSIS OF VARIANCE DF 6 23 29 SUM OF SQUARES 3147.96634 1149.00032 4296.96667 MEAN SQUARE 524.66106 49.95653586 F VALUE 10.502 ROOT MSE 7.067994 R-SQUARE 0.7326 SOURCE MODEL ERROR C TOTAL VARIABLE INTERCEP COMPL PRIV NEW RAISE CRITIC ADVANCE DF 1 1 1 1 1 1 1 PARAMETER ESTIMATE 10.78707639 0.61318761 -0.07305014 0.32033212 0.08173213 0.03838145 -0.21705668 PARAMETER ESTIMATES STANDARD T FOR H0: ERROR PARAMETER=0 11.58925724 0.931 0.16098311 3.809 0.13572469 -0.538 0.16852032 1.901 0.22147768 0.369 0.14699544 0.261 0.17820947 -1.218 b) Below are the summary statistics for the regression of overall rating on each independent variable separately. Using forwards stepwise regression, which variable would be added first to the regression? -1- PROB>F 0.0001 PROB > |T| 0.3616 0.0009 0.5956 0.0699 0.7155 0.7963 0.2356 Statistics 511 Study Guide 8 VARIABLE COMPL PRIV NEW RAISE CRITIC ADVANCE MODEL R**2 0.6813 0.1816 0.3890 0.3483 0.0245 0.0241 F 59.8608 6.2121 17.8246 14.9622 0.7024 0.6900 Fall 2001 PROB>F 0.0001 0.0189 0.0002 0.0006 0.4091 0.4132 c) The model selected by stepwise regression is y=0+1COMPL+2NEW+ Is this the best model for predicting overall rating? d) The plot of R2 versus the number of parameters for this data set is below. About how many variables should be in the model? R S Q U A R E D PLOT OF _RSQ_*_P_ SYMBOL USED IS * 0.735 + | | * | * 0.730 + | * | | 0.725 + * | | | 0.720 + | | | 0.715 + | | | 0.710 + | | * | 0.705 + | | | 0.700 + | | | 0.695 + | | | 0.690 + | | | 0.685 + | | | * 0.680 + | --+------------+------------+------------+------------+------------+-2 3 4 5 6 7 NUMBER OF PARAMETERS -2- Statistics 511 Study Guide 8 Fall 2001 e) Below is the output from all subsets regression. What models appear to be good predictors of overall rating? N=30 REGRESSION MODELS FOR DEPENDENT VARIABLE: RATING MODEL: MODEL1 NUMBER IN MODEL R-SQUARE VARIABLES IN MODEL 1 0.02447321 CRITIC 1 0.18157559 PRIV 1 0.34826403 RAISE 1 0.38897445 NEW 1 0.68131416 COMPL -----------------------------2 0.68131649 COMPL CRITIC 2 0.68228010 COMPL ADVANCE 2 0.68306390 COMPL PRIV 2 0.68389794 COMPL RAISE 2 0.70801520 COMPL NEW ---------------------------------3 0.68952873 COMPL RAISE ADVANCE 3 0.70801569 COMPL NEW CRITIC 3 0.70829161 COMPL NEW RAISE 3 0.71500445 COMPL NEW PRIV 3 0.72559500 COMPL NEW ADVANCE -----------------------------------------4 0.71503039 COMPL NEW PRIV CRITIC 4 0.71522371 COMPL NEW PRIV RAISE 4 0.72726509 COMPL NEW ADVANCE CRITIC 4 0.72851428 COMPL NEW ADVANCE RAISE 4 0.72934125 COMPL NEW ADVANCE PRIV -3- Statistics 511 Study Guide 8 Fall 2001 A1) a) Using BACKWARDS stepwise regression, we would eliminate the least significant variable from the full model. (However, read the note on NWK page 419) CRITIC has the smallest t-value (0.261) (largest p value (0.7963)) among al variables in the full model and would therefore be eliminated. b) Using FORWARDS stepwise regression, we would add the most significant variable to the model. COMPL has the largest F-value (smallest p value) and would be the variable added first. c) The “best” model depends upon the criteria we choose to evaluate the model. A better prediction of overall rating could be obtained from a 3 variable model (COMPL, NEW, ADVANCE) but the higher number of variables could cause other problems resulting from multicollinearity. d) We should only add a new variable to the model if the improvement in R2 is significant (generally 0.02 or more). It is debatable whether the 3 variable models are much better than the 2 variable models, here. Even a 1 variable model would be reasonable. e) RATING vs. COMPL alone is a reasonable model. Any of the 5 2-variable models would also be acceptable, although none is much better than the 1-variable model. -4-