Multiple Regression Example The dean of an MBA program wants to base admissions on who is most likely to succeed in the program. She regards a student’s MBA program GPA as a measure of their success. She believes the primary determinants of success are the following: Undergraduate grade point average (GPA) Graduate Management Admissions Test Score (GMAT) score Number of years of work experience She randomly samples students who completed the MBA and recorded their MBA program GPA, as well as the three variables listed above. These are stored in the file mba.jmp for Chapter 19. To fit the multiple regression model, we click on Analyze, Fit Model; put MBA GPA in Y, Response; click on UnderGPA, GMAT and Work and click Add (these three variables should now appear in the Construct Model Box) and then click Run Model. MBA GPA Actual Response MBA GPA Whole Model Actual by Predicted Plot 11 10 9 8 7 6 6 7 8 9 10 MBA GPA Predicted P<.0001 RSq=0.46 RMSE=0.7879 11 Summary of Fit RSquare RSquare Adj Root Mean Square Error Mean of Response Observations (or Sum Wgts) 0.463532 0.444597 0.787938 8.156517 89 Analysis of Variance Source Model Error C. Total DF Sum of Squares Mean Square F Ratio 3 85 88 45.597249 52.771971 98.369220 15.1991 0.6208 24.4812 Prob > F <.0001 Parameter Estimates Term Intercept UnderGPA GMAT Work Estimate Std Error t Ratio Prob>|t| 0.4660931 0.062827 0.0112814 0.092595 1.505631 0.11993 0.001383 0.030909 0.31 0.52 8.16 3.00 0.7576 0.6017 <.0001 0.0036 Effect Tests Source Nparm DF Sum of Squares F Ratio Prob > F 1 1 1 1 1 1 0.170380 41.327192 5.571825 0.2744 66.5659 8.9746 0.6017 <.0001 0.0036 UnderGPA GMAT Work MBA GPA Residual Residual by Predicted Plot 2.0 1.5 1.0 0.5 0.0 -0.5 -1.0 -1.5 -2.0 6 7 8 9 10 11 MBA GPA Predicted We first examine the residual plots to try to determine if all the regression assumptions are met. Our assumptions are the following: We assume y 0 1 x1 k xk (in the population) where 1. The regression function is a linear function of the independent variables x1,....,xk , i.e., E ( y | x) 0 1 x1 k xk . Another way of stating this is that multiple regression line does not systematically overestimate y for any combination of x1,....,xk. 2. The error is normally distributed with mean 0. 3. The standard deviation is constant ( ) for all values of x’s. 4. The errors are independent. It is difficult to check independence (this is not a time series), so we will have to evaluate this from the study description. As in simple linear regression, we evaluate the linearity assumption #1 by looking at the residual plot. This time we plot the residuals versus the predicted Y’s (the ŷ ’s). We can think of the ŷ ’s as summarizing all the information in the X’s. We would like this plot to show a random scatter around the zero line (showing no obvious curves or other trends). To check assumption #3 of constant variance, we also look at the residual plot versus the ŷ ’s and see whether the variance seems relatively constant. Here the residual plots do not show any gross violations of linearity or constant variance but if they did, we should see if we can isolate the problem to one or more X’s. We should plot the residuals versus each one of the X’s individually. To do this, save the residuals into a column by clicking on the red triangle next to Response MBA GPA, click on Save Columns and click on Residuals. Now plot ResidualsMBA versus the independent variables. Residual MBA GPA Fit Y by X Group Bivariate Fit of Residual MBA GPA By UnderGPA 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 8.5 9.0 9.5 10.0 10.5 11.0 11.5 UnderGPA Residual MBA GPA Bivariate Fit of Residual MBA GPA By GMAT 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 450 500 550 600 650 700 GMAT Residual MBA GPA Bivariate Fit of Residual MBA GPA By Work 2 1.5 1 0.5 0 -0.5 -1 -1.5 -2 2 3 4 5 6 7 8 9 10 11 12 13 Work To evaluate the normality assumption, we look at the histograms of the residuals (using the column of saved residuals from above): Distributions Residual MBA GPA -0.5 -0.4 -0.3 -0.2 -0.1 0 Methods for detecting outliers and influential points for multiple regression will be covered in a future lecture. Once we are satisfied that all regression assumptions are approximately satisfied, we can interpret the regression output. 1. First check the F-test in the ANOVA table. This test tests: H 0 : 1 2 ... k versus H a : not all of 1 ,..., k are zero (notice this does not involve 0 ) . This test answers the question: Are any of these X’s useful in predicting Y. The p-value for the test is less than .0001 so some of the X’s are useful in predicting Y. 2. The Rsquare measures the proportion of variability in Y explained by the regression of Y on these X’s (another way of saying this is the proportion of variability in Y explained by the regression model). The Rsquare is .4635. 3. The individual t-tests tell you specifically whether Xj is useful in predicting Y when the other X’s are already included in the model. Undergraduate GPA does not appear to be useful in predicting MBA GPA if GMAT score and work experience are already included in the model. GMAT score remains useful for predicting MBA GPA when undergraduate GPA and work experience are included in the model. Also, work experience remains useful for predicting MBA GPA when undergraduate GPA and GMAT score are included in the model. 4. Interpretation of regression coefficients. ̂ GMAT .011. This means that an increase in GMAT score of one point is associated with an increase in MBA GPA of 0.011 on average, assuming that all other variables are held constant. 5. In multiple regression, it is often desirable to find the most parsimonious model (since these are easiest to interpret). To do this, we can remove any variables that are not useful in predicting Y—e.g., variables that have coefficients that are not statistically significant from zero. Here, we can remove undergraduate GPA and just use GMAT score and work experience to predict MBA GPA. (Model selection is actually more subtle and we will further explore it in Chapter 20). Predictions To find predictions, prediction intervals and confidence intervals for the mean response, we click on the red triangle next to Response MBA GPA, click Save Columns and click Predicted Values for predictions; click Mean Confidence Interval for confidence intervals for the mean response; or click Individual Confidence Interval for prediction intervals. Note that if you want to form a confidence interval for the mean response or a prediction interval for an X-value that is not in the data set, you can construct a new row with the new data, exclude it from the analysis when you fit the regression equation (highlight the row and click on Rows, then click Exclude), then ask JMP to save the Predicted Values, Mean Confidence Intervals and Individual Confidence Intervals. MBA GPA UnderGPA GMAT 9.32 8.17 7.93 7.72 . 9.51 10.97 9.52 8.52 10 658 587 525 618 600 Work Lower Upper Lower Upper Residual Predicted 95% 95% 95% 95% GPA Mean Mean Predicted Predicted 4 0.462881 8.857119 8.517708 9.19653 7.254141 10.4601 3 0.114728 8.055272 7.738101 8.372443 6.456856 9.653688 7 0.294894 7.635106 7.360891 7.909321 6.044656 9.225556 5 -0.71626 8.436259 7.992757 8.879761 6.80806 10.06446 4. 8.233583 8.011233 8.455932 6.65125 9.815915