Multiple Regression Example - Wharton Statistics Department

advertisement
Multiple Regression Example
The dean of an MBA program wants to base admissions on who is most likely to succeed
in the program. She regards a student’s MBA program GPA as a measure of their
success. She believes the primary determinants of success are the following:
Undergraduate grade point average (GPA)
Graduate Management Admissions Test Score (GMAT) score
Number of years of work experience
She randomly samples students who completed the MBA and recorded their MBA
program GPA, as well as the three variables listed above. These are stored in the file
mba.jmp for Chapter 19.
To fit the multiple regression model, we click on Analyze, Fit Model; put MBA GPA in
Y, Response; click on UnderGPA, GMAT and Work and click Add (these three variables
should now appear in the Construct Model Box) and then click Run Model.
MBA GPA Actual
Response MBA GPA
Whole Model
Actual by Predicted Plot
11
10
9
8
7
6
6
7
8
9 10
MBA GPA Predicted
P<.0001 RSq=0.46
RMSE=0.7879
11
Summary of Fit
RSquare
RSquare Adj
Root Mean Square Error
Mean of Response
Observations (or Sum Wgts)
0.463532
0.444597
0.787938
8.156517
89
Analysis of Variance
Source
Model
Error
C. Total
DF
Sum of Squares
Mean Square
F Ratio
3
85
88
45.597249
52.771971
98.369220
15.1991
0.6208
24.4812
Prob > F
<.0001
Parameter Estimates
Term
Intercept
UnderGPA
GMAT
Work
Estimate
Std Error
t Ratio
Prob>|t|
0.4660931
0.062827
0.0112814
0.092595
1.505631
0.11993
0.001383
0.030909
0.31
0.52
8.16
3.00
0.7576
0.6017
<.0001
0.0036
Effect Tests
Source
Nparm
DF
Sum of Squares
F Ratio
Prob > F
1
1
1
1
1
1
0.170380
41.327192
5.571825
0.2744
66.5659
8.9746
0.6017
<.0001
0.0036
UnderGPA
GMAT
Work
MBA GPA Residual
Residual by Predicted Plot
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
6
7
8
9 10 11
MBA GPA Predicted
We first examine the residual plots to try to determine if all the regression
assumptions are met.
Our assumptions are the following:
We assume y   0  1 x1     k xk   (in the population) where
1. The regression function is a linear function of the independent variables
x1,....,xk , i.e., E ( y | x)   0  1 x1     k xk . Another way of stating this is that
multiple regression line does not systematically overestimate y for any
combination of x1,....,xk.
2. The error  is normally distributed with mean 0.
3. The standard deviation is constant (   ) for all values of x’s.
4. The errors are independent.
It is difficult to check independence (this is not a time series), so we will have to
evaluate this from the study description.
As in simple linear regression, we evaluate the linearity assumption #1 by looking
at the residual plot. This time we plot the residuals versus the predicted Y’s (the
ŷ ’s). We can think of the ŷ ’s as summarizing all the information in the X’s. We
would like this plot to show a random scatter around the zero line (showing no
obvious curves or other trends). To check assumption #3 of constant variance,
we also look at the residual plot versus the ŷ ’s and see whether the variance
seems relatively constant.
Here the residual plots do not show any gross violations of linearity or constant
variance but if they did, we should see if we can isolate the problem to one or
more X’s. We should plot the residuals versus each one of the X’s individually.
To do this, save the residuals into a column by clicking on the red triangle next to
Response MBA GPA, click on Save Columns and click on Residuals. Now plot
ResidualsMBA versus the independent variables.
Residual MBA GPA
Fit Y by X Group
Bivariate Fit of Residual MBA GPA By UnderGPA
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
8.5 9.0 9.5 10.0 10.5 11.0 11.5
UnderGPA
Residual MBA GPA
Bivariate Fit of Residual MBA GPA By GMAT
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
450 500 550 600 650 700
GMAT
Residual MBA GPA
Bivariate Fit of Residual MBA GPA By Work
2
1.5
1
0.5
0
-0.5
-1
-1.5
-2
2 3 4 5 6 7 8 9 10 11 12 13
Work
To evaluate the normality assumption, we look at the histograms of the residuals
(using the column of saved residuals from above):
Distributions
Residual MBA GPA
-0.5 -0.4 -0.3 -0.2 -0.1
0
Methods for detecting outliers and influential points for multiple regression will be
covered in a future lecture.
Once we are satisfied that all regression assumptions are approximately
satisfied, we can interpret the regression output.
1.
First check the F-test in the ANOVA table. This test tests:
H 0 : 1   2  ...   k versus H a : not all of 1 ,...,  k are zero (notice this does not
involve  0 ) . This test answers the question: Are any of these X’s useful in
predicting Y. The p-value for the test is less than .0001 so some of the X’s are
useful in predicting Y.
2.
The Rsquare measures the proportion of variability in Y explained by the
regression of Y on these X’s (another way of saying this is the proportion of
variability in Y explained by the regression model). The Rsquare is .4635.
3.
The individual t-tests tell you specifically whether Xj is useful in predicting
Y when the other X’s are already included in the model. Undergraduate GPA
does not appear to be useful in predicting MBA GPA if GMAT score and work
experience are already included in the model. GMAT score remains useful for
predicting MBA GPA when undergraduate GPA and work experience are
included in the model. Also, work experience remains useful for predicting MBA
GPA when undergraduate GPA and GMAT score are included in the model.
4.
Interpretation of regression coefficients. ̂ GMAT  .011. This means that an
increase in GMAT score of one point is associated with an increase in MBA GPA
of 0.011 on average, assuming that all other variables are held constant.
5.
In multiple regression, it is often desirable to find the most parsimonious
model (since these are easiest to interpret). To do this, we can remove any
variables that are not useful in predicting Y—e.g., variables that have coefficients
that are not statistically significant from zero. Here, we can remove
undergraduate GPA and just use GMAT score and work experience to predict
MBA GPA. (Model selection is actually more subtle and we will further explore it
in Chapter 20).
Predictions
To find predictions, prediction intervals and confidence intervals for the mean
response, we click on the red triangle next to Response MBA GPA, click Save
Columns and click Predicted Values for predictions; click Mean Confidence
Interval for confidence intervals for the mean response; or click Individual
Confidence Interval for prediction intervals. Note that if you want to form a
confidence interval for the mean response or a prediction interval for an X-value
that is not in the data set, you can construct a new row with the new data,
exclude it from the analysis when you fit the regression equation (highlight the
row and click on Rows, then click Exclude), then ask JMP to save the Predicted
Values, Mean Confidence Intervals and Individual Confidence Intervals.
MBA GPA UnderGPA GMAT
9.32
8.17
7.93
7.72
.
9.51
10.97
9.52
8.52
10
658
587
525
618
600
Work
Lower
Upper
Lower
Upper
Residual Predicted 95%
95%
95%
95%
GPA
Mean
Mean
Predicted Predicted
4 0.462881 8.857119 8.517708 9.19653 7.254141 10.4601
3 0.114728 8.055272 7.738101 8.372443 6.456856 9.653688
7 0.294894 7.635106 7.360891 7.909321 6.044656 9.225556
5 -0.71626 8.436259 7.992757 8.879761 6.80806 10.06446
4.
8.233583 8.011233 8.455932 6.65125 9.815915
Download