A) Variable Selection

advertisement
Statistics 511
Study Guide 8
Fall 2001
A) Variable Selection
1) On study guide 6 you saw data relating the overall rating of supervisors by their
employees to 6 ratings of their performance in various areas. In this problem, we
try to find a smaller set of variables that adequately predicts the overall rating.
The variables in the analysis are:
RATING
COMPL
PRIV
NEW
RAISE
CRITIC
ADVANC
Overall rating of job being done by supervisor
Handles employee complaints
Does not allow special privileges
Opportunity to learn new things
Raises based on performances
Too critical of poor performances
Rate of advancing to better jobs
a) Below is the ANOVA table for the regression of overall rating on all the
variables. Using backwards stepwise regression, which variable would be
removed first from the regression?
DEP VARIABLE: RATING
ANALYSIS OF VARIANCE
DF
6
23
29
SUM OF
SQUARES
3147.96634
1149.00032
4296.96667
MEAN
SQUARE
524.66106
49.95653586
F VALUE
10.502
ROOT MSE
7.067994
R-SQUARE
0.7326
SOURCE
MODEL
ERROR
C TOTAL
VARIABLE
INTERCEP
COMPL
PRIV
NEW
RAISE
CRITIC
ADVANCE
DF
1
1
1
1
1
1
1
PARAMETER
ESTIMATE
10.78707639
0.61318761
-0.07305014
0.32033212
0.08173213
0.03838145
-0.21705668
PARAMETER ESTIMATES
STANDARD
T FOR H0:
ERROR
PARAMETER=0
11.58925724
0.931
0.16098311
3.809
0.13572469
-0.538
0.16852032
1.901
0.22147768
0.369
0.14699544
0.261
0.17820947
-1.218
b) Below are the summary statistics for the regression of overall rating on each
independent variable separately. Using forwards stepwise regression, which
variable would be added first to the regression?
-1-
PROB>F
0.0001
PROB > |T|
0.3616
0.0009
0.5956
0.0699
0.7155
0.7963
0.2356
Statistics 511
Study Guide 8
VARIABLE
COMPL
PRIV
NEW
RAISE
CRITIC
ADVANCE
MODEL
R**2
0.6813
0.1816
0.3890
0.3483
0.0245
0.0241
F
59.8608
6.2121
17.8246
14.9622
0.7024
0.6900
Fall 2001
PROB>F
0.0001
0.0189
0.0002
0.0006
0.4091
0.4132
c) The model selected by stepwise regression is
y=0+1COMPL+2NEW+
Is this the best model for predicting overall rating?
d) The plot of R2 versus the number of parameters for this data set is below. About
how many variables should be in the model?
R
S
Q
U
A
R
E
D
PLOT OF _RSQ_*_P_
SYMBOL USED IS *
0.735 +
|
|
*
|
*
0.730 +
|
*
|
|
0.725 +
*
|
|
|
0.720 +
|
|
|
0.715 +
|
|
|
0.710 +
|
|
*
|
0.705 +
|
|
|
0.700 +
|
|
|
0.695 +
|
|
|
0.690 +
|
|
|
0.685 +
|
|
| *
0.680 +
|
--+------------+------------+------------+------------+------------+-2
3
4
5
6
7
NUMBER OF PARAMETERS
-2-
Statistics 511
Study Guide 8
Fall 2001
e) Below is the output from all subsets regression. What models appear to be good
predictors of overall rating?
N=30
REGRESSION MODELS FOR DEPENDENT VARIABLE: RATING
MODEL: MODEL1
NUMBER IN
MODEL
R-SQUARE
VARIABLES IN MODEL
1
0.02447321
CRITIC
1
0.18157559
PRIV
1
0.34826403
RAISE
1
0.38897445
NEW
1
0.68131416
COMPL
-----------------------------2
0.68131649
COMPL CRITIC
2
0.68228010
COMPL ADVANCE
2
0.68306390
COMPL PRIV
2
0.68389794
COMPL RAISE
2
0.70801520
COMPL NEW
---------------------------------3
0.68952873
COMPL RAISE ADVANCE
3
0.70801569
COMPL NEW CRITIC
3
0.70829161
COMPL NEW RAISE
3
0.71500445
COMPL NEW PRIV
3
0.72559500
COMPL NEW ADVANCE
-----------------------------------------4
0.71503039
COMPL NEW PRIV CRITIC
4
0.71522371
COMPL NEW PRIV RAISE
4
0.72726509
COMPL NEW ADVANCE CRITIC
4
0.72851428
COMPL NEW ADVANCE RAISE
4
0.72934125
COMPL NEW ADVANCE PRIV
-3-
Statistics 511
Study Guide 8
Fall 2001
A1)
a) Using BACKWARDS stepwise regression, we would eliminate the least significant variable
from the full model. (However, read the note on NWK page 419)
CRITIC has the smallest t-value (0.261) (largest p value (0.7963)) among al variables in the
full model and would therefore be eliminated.
b) Using FORWARDS stepwise regression, we would add the most significant variable to the
model. COMPL has the largest F-value (smallest p value) and would be the variable added
first.
c) The “best” model depends upon the criteria we choose to evaluate the model. A better
prediction of overall rating could be obtained from a 3 variable model (COMPL, NEW,
ADVANCE) but the higher number of variables could cause other problems resulting from
multicollinearity.
d) We should only add a new variable to the model if the improvement in R2 is significant
(generally 0.02 or more). It is debatable whether the 3 variable models are much better than
the 2 variable models, here. Even a 1 variable model would be reasonable.
e) RATING vs. COMPL alone is a reasonable model.
Any of the 5 2-variable models would also be acceptable, although none is much better than
the 1-variable model.
-4-
Download