Josh Simpson & Scott Ouzts 2 Our group used binary logistic regression to predict how well a student’s GPA at the end of the three semester period predicted whether the student remained a computer science, engineering, or other science-related major. We chose to use binary logistic regression because our response variable had only two values: success or failure. A success would mean if the student remained a computer science, engineering, or other science related major and a failure if they changed to a major outside engineering or science. From the table below you can see that there were 156 students who remained a computer science major or switched to a major in engineering or some other science. We can also see that there were 78 students who switched to a major outside science or engineering. Variable success flag Value Count 1 156 0 78 Total 234 (Event) Next we analyzed our logistic regression table to determine the significance of “GPA” on the response variable “Maj.” Our Logistic Regression table is as follows: Predictor Constant gpa Coef -3.12405 1.43270 SE Coef 0.652317 0.241520 Z -4.79 5.93 P 0.000 0.000 Odds Ratio 4.19 95% CI Lower Upper 2.61 6.73 From this table we found a model of the data that has the log odds as a linear function of the explanatory variable. The form for this model is log(odds) = β0 + β1x, with β0 equal to the constant Coefficient and β1 equal to the GPA coefficient. Our fitted regression model for this data was log(odds) = -3.12 + 1.43x. Our odds ratio was 4.19 with a 95% confidence interval of (2.61, 6.73). This means if we increase (GPA) by one unit we increase the odds that a person will remain a computer science, engineering, or other science related major by 4.2 times. We then examined the hypothesis that the regression coefficient for the explanatory variable was zero. If the coefficient is zero this would mean that the explanatory variable (GPA) would have no effect on predicting whether the student remained a computer science, engineering, or other science related major. Our null hypothesis was that β1 = 0 and our alternative hypothesis is that β1 ≠ 0. From the logistic regression table we can tell that since the P-value of GPA is 0.000, we can reject the null hypothesis that β1 = 0. This is statistically significant evidence in favor of a student’s GPA predicting whether the student remained a computer science, engineering, or other science-related major. As seen below, the P-value for the tests that all slopes are zero also returns a significant P-value of 0.000 which also rejects the null hypothesis that β1 = 0. Test that all slopes are zero: G = 46.902, DF = 1, P-Value = 0.000 By analyzing the goodness of fit tests of Hosmer-Lemeshow we see the P-value is .885. This is insufficient evidence to prove that the model does not fit the data adequately. Goodness-of-Fit Tests Method Hosmer-Lemeshow Chi-Square 3.672 DF 8 P 0.885 3 By looking at the table of observed and expected frequencies we can see that the observed and expected frequencies are similar indicating a good fit for the model as apparent by the Hosmer-Lemeshow statistic. Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Group Value 1 2 3 4 5 6 7 8 9 10 1 Obs 6 9 12 13 17 19 20 19 21 20 Exp 4.9 9.9 13.7 15.1 16.1 17.1 19.0 20.1 20.1 20.0 0 Obs 17 14 13 11 6 4 4 5 2 2 Exp 18.1 13.1 11.3 8.9 6.9 5.9 5.0 3.9 2.9 2.0 Total 23 23 25 24 23 23 24 24 23 22 Total 156 78 234 Our concordant percent was 76.6 which means that 76.6% of the time GPA accurately predicts whether a student remains a computer science, engineering, or another science related major or changes there major to outside science or engineering. Our Sumers’ D and Goodman-Kruskal Gamma measures were both .54, which shows a fairly weak predictive ability. These measures lie between 0 and 1 with higher values meaning more predictive ability. Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Concordant Discordant Ties Total Number 9316 2794 58 12168 Percent 76.6 23.0 0.5 100.0 Summary Measures Somers' D Goodman-Kruskal Gamma Kendall's Tau-a 0.54 0.54 0.24 Our conclusion from this experiment based on the evidence provided above is that a student’s GPA is a statistically significant variable in predicting whether the student remained a computers science, engineering, or other science related major after three semesters. 4 For question two, our group needed to statistically determine whether “SEX” and “Maj” had a significant effect on the variable “HSS.” We performed a 2-way ANOVA with the hypothesis test that there was no effect for these two variables or an interaction effect on high school science (HSS). The 2-way ANOVA compares the means of populations that can be classified in two ways. We chose the 2-way ANOVA because we have two categorical explanatory variables of “SEX” and “Maj” and the one quantitative response variable of “HSS.” As shown below in our 2-way ANOVA table the P values for major, sex, and the interaction effect are 0.00, 0.025, 0.009. Source maj sex Interaction Error Total S = 1.599 DF 2 1 2 228 233 SS 44.410 12.927 24.855 582.923 665.115 R-Sq = 12.36% MS 22.2051 12.9274 12.4274 2.5567 F 8.69 5.06 4.86 P 0.000 0.025 0.009 R-Sq(adj) = 10.44% For this omnibus test the null hypothesis is that the explanatory variables of “Maj” and “SEX” have no effect on the response variable of “HSS”. The alternative hypothesis is that there is an effect from the explanatory variables on the response. Under a 95% confidence level we reject the null hypothesis for both explanatory variables and interaction. The lowest p value of our hypothesis test is the variable “Maj” and this shows very strong evidence supporting the alternative hypothesis that there is an effect. By looking at the boxplots below of high school science scores you can tell the sex one (males) who chose to remain a computer science major has a higher median high school science scores than those who switched to other science major or changed to a major outside science or engineering. Boxplot of hss 10 9 8 hss 7 6 5 4 3 maj sex 1 2 1 3 1 2 2 3 5 On the other hand for sex 2 (females) the median high school science scores are closer together with those who switched to another science or engineering major having the highest score. The significance of the main effect for sex is due to the fact that sex 2 has higher high school science scores than sex 1 in all 3 majors as seen below. The analysis indicates that a complete description of the high school science scores require consideration of the interaction in addition to the main effects. The two lines in the plot are not parallel. This is because of an interaction effect that is occurring. Scatterplot of High School Science vs Major 9.5 Sex 1 2 High School Science 9.0 8.5 8.0 7.5 1.0 1.5 2.0 Major 2.5 3.0 For sex 1 the high school science scores do not change much between switching from majoring in engineering or other science related field to switching to a non science or engineering major. However in sex 2 there is a significant decrease in high school science scores between those who switch to another science or engineering major to those who switch to a non science or engineering major. Also you notice there is an opposite effect on sex 1 and sex 2 when switching from the computer science major to another science or engineering major. In sex one high school science scores decrease and in sex two high school science scores increase. In conclusion, we found that “SEX” and “Maj” have a statistically significant effect on the variable “HSS.” We also discovered a significant interaction effect on the response variable as well. 6 For question three our group used the binary logistic regression to find the best predictors of whether a student remains a computer science major after three semesters. Here again we chose to use binary logistic regression because our response variable had only two values: success or failure. A success would mean if the student remained a computer science major and a failure if they changed to a major in engineering or some other science or changed to a major outside engineering or science. We began by doing a binary logistic regression with the only explanatory variable being “GPA.” This gave us a test all slopes P-value of .083 which is not significant enough alone. We then began adding other explanatory variables in an attempt find the best combination in order to have the most accurate fit. We found that “SEX” and “HSS” did not help the accuracy of the model but the other variables combined (“SATM”, “SATV”, “HSM”, “HSE”, and “GPA”) provided us with most accurate prediction of the response variable “Maj.” We then examined the hypothesis that the regression coefficients for the explanatory variables were zero. If the coefficients are zero this would mean that the explanatory variables (HSM), (SATV), (GPA), (HSE), and (SATM) would have no effect on predicting whether the student remained a computer science major. Our null hypothesis was that β1 = β2 = β3 = β4 = β5 = 0 and our alternative hypothesis is that β1, β2, β3, β4 or β5 ≠ 0. . The logistic regression table below shows that with the exception of “GPA”, “SATM” and “HSE” the other variables’ individual P-values are all significant at the 5% level. Despite this fact, as a whole adding these three variables to the model provided us with the best predictor of students who remain a computer science major. Logistic Regression Table Predictor Constant gpa hsm satm satv hse Coef -3.29890 0.162896 0.403254 -0.0034206 0.0047816 -0.213466 SE Coef 1.38587 0.223952 0.138902 0.0022712 0.0017937 0.112048 Z -2.38 0.73 2.90 -1.51 2.67 -1.91 Odds 95% CI P Ratio Lower 0.017 0.467 1.18 0.76 0.004 1.50 1.14 0.132 1.00 0.99 0.008 1.00 1.00 0.057 0.81 0.65 Upper 1.83 1.97 1.00 1.01 1.01 Test that all slopes are zero: G = 19.429, DF = 5, P-Value = 0.002 Also, note that the P-value for the test of all slopes are zero is .002 which is most significant. From this table we found a model of the data that has the log odds as a linear function of the explanatory variables. The form for this model is log(odds) = β0 + β1HSM + β2SATV + β3GPA + β4SATM + β5HSE. Our fitted regression model for this data was log(odds) = -3.299 + .403(HSM) + .005(SATV) + .163(GPA) .003(SATM) - .213(HSE) . By analyzing the goodness of fit tests of Hosmer-Lemeshow we see the P-value is .639. This is insufficient evidence to prove that the model does not fit the data adequately. The P-value of .639 for this goodness-of-fit test was the highest P-value we could achieve. Goodness-of-Fit Tests Method Hosmer-Lemeshow Chi-Square 6.070 DF 8 P 0.639 7 By looking at the table of observed and expected frequencies we can see that the observed and expected frequencies are similar indicating a good fit for the model as apparent by the Hosmer-Lemeshow statistic. Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Value 1 Obs Exp 0 Obs Exp Total 1 2 3 Group 4 5 6 7 1 2.6 6 4.3 4 5.8 8 6.2 7 7.4 8 8.1 10 8.9 22 20.4 23 17 18.7 23 20 18.2 24 15 16.8 23 17 16.6 24 15 14.9 23 13 14.1 23 8 13 10.2 11 13.8 24 9 10 11.9 15 13.1 25 10 11 12.6 11 9.4 22 Total 78 156 234 Our concordant percent was 65.8 which means that 65.8% of the time GPA accurately predicts whether a student remains a computer science major. Based on the evidence above we found that the explanatory variables of “GPA”, “SATM”, “SATV”, “HSM”, and “HSE” were the best predictors for students who will remain a computer science major after three semesters.