Topic 13: Multiple Linear Regression Example Outline • • • • Description of example Descriptive summaries Investigation of various models Conclusions Study of CS students • Too many computer science majors at Purdue were dropping out of program • Wanted to find predictors of success to be used in admissions process • Predictors must be available at time of entry into program. Data available • • • • • • • GPA after three semesters Overall high school math grade Overall high school science grade Overall high school English grade SAT Math SAT Verbal Gender (of interest for other reasons) Data for CS Example • Y is the student’s grade point average (GPA) after 3 semesters • 3 HS grades and 2 SAT scores are the explanatory variables (p=6) • Have n=224 students Descriptive Statistics Data a1; infile 'C:\...\csdata.dat'; input id gpa hsm hss hse satm satv genderm1; proc means data=a1 maxdec=2; var gpa hsm hss hse satm satv; run; Output from Proc Means Variable gpa hsm hss hse satm satv N Mean Std Dev Minimum Maximum 224 2.64 0.78 0.12 4.00 224 8.32 1.64 2.00 10.00 224 8.09 1.70 3.00 10.00 224 8.09 1.51 3.00 10.00 224 595.29 86.40 300.00 800.00 224 504.55 92.61 285.00 760.00 Descriptive Statistics proc univariate data=a1; var gpa hsm hss hse satm satv; histogram gpa hsm hss hse satm satv /normal; run; Correlations proc corr data=a1; var hsm hss hse satm satv; proc corr data=a1; var hsm hss hse satm satv; with gpa; run; Output from Proc Corr gpa hsm hss hse satm satv Pearson Correlation Coefficients, N = 224 Prob > |r| under H0: Rho=0 gpa hsm hss hse satm 1.00000 0.43650 0.32943 0.28900 0.25171 <.0001 <.0001 <.0001 0.0001 0.43650 1.00000 0.57569 0.44689 0.45351 <.0001 <.0001 <.0001 <.0001 0.32943 0.57569 1.00000 0.57937 0.24048 <.0001 <.0001 <.0001 0.0003 0.28900 0.44689 0.57937 1.00000 0.10828 <.0001 <.0001 <.0001 0.1060 0.25171 0.45351 0.24048 0.10828 1.00000 0.0001 <.0001 0.0003 0.1060 0.11449 0.22112 0.26170 0.24371 0.46394 0.0873 0.0009 <.0001 0.0002 <.0001 satv 0.11449 0.0873 0.22112 0.0009 0.26170 <.0001 0.24371 0.0002 0.46394 <.0001 1.00000 Output from Proc Corr Pearson Correlation Coefficients, N = 224 Prob > |r| under H0: Rho=0 gpa hsm hss hse satm 0.43650 0.32943 0.28900 0.25171 <.0001 <.0001 <.0001 0.0001 All but SATV significantly correlated with GPA satv 0.11449 0.0873 Scatter Plot Matrix proc corr data=a1 plots=matrix; var gpa hsm hss hse satm satv; run; Allows visual check of pairwise relationships No “strong” linear Relationships Can see discreteness of high school scores Use high school grades to predict GPA (Model #1) proc reg data=a1; model gpa=hsm hss hse; run; Results Model #1 Root MSE Dependent Mean Coeff Var 0.69984 R-Square 2.63522 Adj R-Sq 26.55711 0.2046 0.1937 Meaningful?? Variable Intercept hsm hss hse DF 1 1 1 1 Parameter Estimates Parameter Standard Estimate Error 0.58988 0.29424 0.16857 0.03549 0.03432 0.03756 0.04510 0.03870 t Value 2.00 4.75 0.91 1.17 Pr > |t| 0.0462 <.0001 0.3619 0.2451 ANOVA Table #1 Source Model Error Corrected Total Analysis of Variance Sum of Mean DF Squares Square 3 27.71233 9.23744 220 107.75046 0.48977 223 135.46279 F Value 18.86 Pr > F <.0001 Significant F test but not all variable t tests significant Remove HSS (Model #2) proc reg data=a1; model gpa=hsm hse; run; Results Model #2 Root MSE Dependent Mean Coeff Var 0.69958 R-Square 2.63522 Adj R-Sq 26.54718 0.2016 0.1943 Slightly better MSE and adjusted R-Sq Variable Intercept hsm hse DF 1 1 1 Parameter Estimates Parameter Standard Estimate Error 0.62423 0.29172 0.18265 0.03196 0.06067 0.03473 t Value 2.14 5.72 1.75 Pr > |t| 0.0335 <.0001 0.0820 ANOVA Table #2 Source Model Error Corrected Total Analysis of Variance Sum of Mean DF Squares Square 2 27.30349 13.65175 221 108.15930 0.48941 223 135.46279 F Value 27.89 Pr > F <.0001 Significant F test but not all variable t tests significant Rerun with HSM only (Model #3) proc reg data=a1; model gpa=hsm; run; Results Model #3 Root MSE Dependent Mean Coeff Var 0.70280 R-Square 2.63522 Adj R-Sq 26.66958 0.1905 0.1869 Slightly worse MSE and adjusted R-Sq Variable Intercept hsm DF 1 1 Parameter Estimates Parameter Standard Estimate Error 0.90768 0.24355 0.20760 0.02872 t Value 3.73 7.23 Pr > |t| 0.0002 <.0001 ANOVA Table #3 Source Model Error Corrected Total Analysis of Variance Sum of Mean DF Squares Square 1 25.80989 25.80989 222 109.65290 0.49393 223 135.46279 F Value 52.25 Pr > F <.0001 Significant F test and all variable t tests significant SATs (Model #4) proc reg data=a1; model gpa=satm satv; run; Results Model #4 Root MSE Dependent Mean Coeff Var 0.75770 R-Square 2.63522 Adj R-Sq 28.75287 0.0634 0.0549 Much worse MSE and adjusted R-Sq Variable Intercept satm satv DF 1 1 1 Parameter Estimates Parameter Standard Estimate Error t Value Pr > |t| 1.28868 0.37604 3.43 0.0007 0.00228 0.00066291 3.44 0.0007 -0.00002456 0.00061847 -0.04 0.9684 ANOVA Table #4 Source Model Error Corrected Total Analysis of Variance Sum of Mean DF Squares Square 2 8.58384 4.29192 221 126.87895 0.57411 223 135.46279 F Value 7.48 Pr > F 0.0007 Significant F test but not all variable t tests significant HS and SATs (Model #5) proc reg data=a1; model gpa=satm satv hsm hss hse; *Does general linear test; sat: test satm, satv; hs: test hsm, hss, hse; Results Model #5 Root MSE Dependent Mean Coeff Var Variable Intercept hsm hss hse satm satv DF 1 1 1 1 1 1 0.70000 R-Square 2.63522 Adj R-Sq 26.56311 Parameter Estimates Parameter Standard Estimate Error 0.32672 0.40000 0.14596 0.03926 0.03591 0.03780 0.05529 0.03957 0.00094359 0.00068566 -0.00040785 0.00059189 0.2115 0.1934 t Value 0.82 3.72 0.95 1.40 1.38 -0.69 Pr > |t| 0.4149 0.0003 0.3432 0.1637 0.1702 0.4915 Test sat Test sat Results for Dependent Variable gpa Mean Source DF Square F Value Pr > F Numerator 2 0.46566 0.95 0.3882 Denominator 218 0.49000 Cannot reject the reduced model…No significant information lost…We don’t need SAT variables Test hs Test hs Results for Dependent Variable gpa Mean Source DF Square F Value Pr > F Numerator 3 6.68660 13.65 <.0001 Denominator 218 0.49000 Reject the reduced model…There is significant information lost…We can’t remove HS variables from model Best Model? • Likely the one with just HSM or the one with HSE and HSM. • We’ll discuss comparison methods in Chapters 7 and 8 Key ideas from case study • First, look at graphical and numerical summaries one variable at a time • Then, look at relationships between pairs of variables with graphical and numerical summaries. • Use plots and correlations to understand relationships Key ideas from case study • The relationship between a response variable and an explanatory variable depends on what other explanatory variables are in the model • A variable can be a significant (P<.05) predictor alone and not significant (P>0.5) when other X’s are in the model Key ideas from case study • Regression coefficients, standard errors and the results of significance tests depend on what other explanatory variables are in the model Key ideas from case study • Significance tests (P values) do not tell the whole story • Squared multiple correlations give the proportion of variation in the response variable explained by the explanatory variables) can give a different view • We often express R2 as a percent Key ideas from case study • You can fully understand the theory in terms of Y = Xβ + e • However to effectively use this methodology in practice you need to understand how the data were collected, the nature of the variables, and how they relate to each other Background Reading • Cs2.sas contains the SAS commands used in this topic