Our group used binary logistic regression to predict how well a

advertisement
Josh Simpson & Scott Ouzts
2
Our group used binary logistic regression to predict how well a student’s GPA at
the end of the three semester period predicted whether the student remained a computer
science, engineering, or other science-related major. We chose to use binary logistic
regression because our response variable had only two values: success or failure. A
success would mean if the student remained a computer science, engineering, or other
science related major and a failure if they changed to a major outside engineering or
science. From the table below you can see that there were 156 students who remained a
computer science major or switched to a major in engineering or some other science. We
can also see that there were 78 students who switched to a major outside science or
engineering.
Variable
success flag
Value Count
1
156
0
78
Total
234
(Event)
Next we analyzed our logistic regression table to determine the significance of
“GPA” on the response variable “Maj.” Our Logistic Regression table is as follows:
Predictor
Constant
gpa
Coef
-3.12405
1.43270
SE Coef
0.652317
0.241520
Z
-4.79
5.93
P
0.000
0.000
Odds
Ratio
4.19
95% CI
Lower Upper
2.61
6.73
From this table we found a model of the data that has the log odds as
a linear function of the explanatory variable. The form for this model is log(odds) = β0 +
β1x, with β0 equal to the constant Coefficient and β1 equal to the GPA coefficient. Our
fitted regression model for this data was log(odds) = -3.12 + 1.43x. Our odds ratio was 4.19
with a 95% confidence interval of (2.61, 6.73). This means if we increase (GPA) by one
unit we increase the odds that a person will remain a computer science, engineering, or
other science related major by 4.2 times. We then examined the hypothesis that the
regression coefficient for the explanatory variable was zero. If the coefficient is zero this
would mean that the explanatory variable (GPA) would have no effect on predicting
whether the student remained a computer science, engineering, or other science related
major. Our null hypothesis was that β1 = 0 and our alternative hypothesis is that β1 ≠ 0.
From the logistic regression table we can tell that since the P-value of GPA is 0.000, we
can reject the null hypothesis that β1 = 0. This is statistically significant evidence in favor
of a student’s GPA predicting whether the student remained a computer science,
engineering, or other science-related major. As seen below, the P-value for the tests that all
slopes are zero also returns a significant P-value of 0.000 which also rejects the null
hypothesis that β1 = 0.
Test that all slopes are zero: G = 46.902, DF = 1, P-Value = 0.000
By analyzing the goodness of fit tests of Hosmer-Lemeshow we see the P-value is
.885. This is insufficient evidence to prove that the model does not fit the data adequately.
Goodness-of-Fit Tests
Method
Hosmer-Lemeshow
Chi-Square
3.672
DF
8
P
0.885
3
By looking at the table of observed and expected frequencies we can see that the
observed and expected frequencies are similar indicating a good fit for the model as
apparent by the Hosmer-Lemeshow statistic.
Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)
Group
Value
1
2
3
4
5
6
7
8
9
10
1
Obs
6
9
12
13
17
19
20
19
21
20
Exp
4.9
9.9 13.7 15.1 16.1 17.1 19.0 20.1 20.1 20.0
0
Obs
17
14
13
11
6
4
4
5
2
2
Exp 18.1 13.1 11.3
8.9
6.9
5.9
5.0
3.9
2.9
2.0
Total
23
23
25
24
23
23
24
24
23
22
Total
156
78
234
Our concordant percent was 76.6 which means that 76.6% of the time GPA
accurately predicts whether a student remains a computer science, engineering, or another
science related major or changes there major to outside science or engineering. Our
Sumers’ D and Goodman-Kruskal Gamma measures were both .54, which shows a fairly
weak predictive ability. These measures lie between 0 and 1 with higher values meaning
more predictive ability.
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs
Concordant
Discordant
Ties
Total
Number
9316
2794
58
12168
Percent
76.6
23.0
0.5
100.0
Summary Measures
Somers' D
Goodman-Kruskal Gamma
Kendall's Tau-a
0.54
0.54
0.24
Our conclusion from this experiment based on the evidence provided above is that
a student’s GPA is a statistically significant variable in predicting whether the student
remained a computers science, engineering, or other science related major after three
semesters.
4
For question two, our group needed to statistically determine whether “SEX” and
“Maj” had a significant effect on the variable “HSS.” We performed a 2-way ANOVA
with the hypothesis test that there was no effect for these two variables or an interaction
effect on high school science (HSS). The 2-way ANOVA compares the means of
populations that can be classified in two ways. We chose the 2-way ANOVA because we
have two categorical explanatory variables of “SEX” and “Maj” and the one quantitative
response variable of “HSS.” As shown below in our 2-way ANOVA table the P values
for major, sex, and the interaction effect are 0.00, 0.025, 0.009.
Source
maj
sex
Interaction
Error
Total
S = 1.599
DF
2
1
2
228
233
SS
44.410
12.927
24.855
582.923
665.115
R-Sq = 12.36%
MS
22.2051
12.9274
12.4274
2.5567
F
8.69
5.06
4.86
P
0.000
0.025
0.009
R-Sq(adj) = 10.44%
For this omnibus test the null hypothesis is that the explanatory variables of
“Maj” and “SEX” have no effect on the response variable of “HSS”. The alternative
hypothesis is that there is an effect from the explanatory variables on the response. Under
a 95% confidence level we reject the null hypothesis for both explanatory variables and
interaction. The lowest p value of our hypothesis test is the variable “Maj” and this shows
very strong evidence supporting the alternative hypothesis that there is an effect.
By looking at the boxplots below of high school science scores you can tell the
sex one (males) who chose to remain a computer science major has a higher median high
school science scores than those who switched to other science major or changed to a
major outside science or engineering.
Boxplot of hss
10
9
8
hss
7
6
5
4
3
maj
sex
1
2
1
3
1
2
2
3
5
On the other hand for sex 2 (females) the median high school science scores are
closer together with those who switched to another science or engineering major having
the highest score.
The significance of the main effect for sex is due to the fact that sex 2 has higher
high school science scores than sex 1 in all 3 majors as seen below. The analysis
indicates that a complete description of the high school science scores require
consideration of the interaction in addition to the main effects. The two lines in the plot
are not parallel. This is because of an interaction effect that is occurring.
Scatterplot of High School Science vs Major
9.5
Sex
1
2
High School Science
9.0
8.5
8.0
7.5
1.0
1.5
2.0
Major
2.5
3.0
For sex 1 the high school science scores do not change much between switching
from majoring in engineering or other science related field to switching to a non science
or engineering major. However in sex 2 there is a significant decrease in high school
science scores between those who switch to another science or engineering major to those
who switch to a non science or engineering major. Also you notice there is an opposite
effect on sex 1 and sex 2 when switching from the computer science major to another
science or engineering major. In sex one high school science scores decrease and in sex
two high school science scores increase.
In conclusion, we found that “SEX” and “Maj” have a statistically significant
effect on the variable “HSS.” We also discovered a significant interaction effect on the
response variable as well.
6
For question three our group used the binary logistic regression to find the best
predictors of whether a student remains a computer science major after three semesters.
Here again we chose to use binary logistic regression because our response variable had
only two values: success or failure. A success would mean if the student remained a
computer science major and a failure if they changed to a major in engineering or some
other science or changed to a major outside engineering or science. We began by doing a
binary logistic regression with the only explanatory variable being “GPA.” This gave us a
test all slopes P-value of .083 which is not significant enough alone. We then began
adding other explanatory variables in an attempt find the best combination in order to
have the most accurate fit. We found that “SEX” and “HSS” did not help the accuracy of
the model but the other variables combined (“SATM”, “SATV”, “HSM”, “HSE”, and
“GPA”) provided us with most accurate prediction of the response variable “Maj.”
We then examined the hypothesis that the regression coefficients for
the explanatory variables were zero. If the coefficients are zero this would mean that the
explanatory variables (HSM), (SATV), (GPA), (HSE), and (SATM) would have no effect
on predicting whether the student remained a computer science major. Our null
hypothesis was that β1 = β2 = β3 = β4 = β5 = 0 and our alternative hypothesis is that β1,
β2, β3, β4 or β5 ≠ 0. .
The logistic regression table below shows that with the exception of “GPA”,
“SATM” and “HSE” the other variables’ individual P-values are all significant at the 5%
level. Despite this fact, as a whole adding these three variables to the model provided us
with the best predictor of students who remain a computer science major.
Logistic Regression Table
Predictor
Constant
gpa
hsm
satm
satv
hse
Coef
-3.29890
0.162896
0.403254
-0.0034206
0.0047816
-0.213466
SE Coef
1.38587
0.223952
0.138902
0.0022712
0.0017937
0.112048
Z
-2.38
0.73
2.90
-1.51
2.67
-1.91
Odds
95% CI
P Ratio Lower
0.017
0.467
1.18
0.76
0.004
1.50
1.14
0.132
1.00
0.99
0.008
1.00
1.00
0.057
0.81
0.65
Upper
1.83
1.97
1.00
1.01
1.01
Test that all slopes are zero: G = 19.429, DF = 5, P-Value = 0.002
Also, note that the P-value for the test of all slopes are zero is .002 which is most
significant. From this table we found a model of the data that has the log odds as
a linear function of the explanatory variables. The form for this model is log(odds) = β0
+ β1HSM + β2SATV + β3GPA + β4SATM + β5HSE. Our fitted regression model for
this data was log(odds) = -3.299 + .403(HSM) + .005(SATV) + .163(GPA) .003(SATM) - .213(HSE) .
By analyzing the goodness of fit tests of Hosmer-Lemeshow we see the P-value is
.639. This is insufficient evidence to prove that the model does not fit the data adequately.
The P-value of .639 for this goodness-of-fit test was the highest P-value we could achieve.
Goodness-of-Fit Tests
Method
Hosmer-Lemeshow
Chi-Square
6.070
DF
8
P
0.639
7
By looking at the table of observed and expected frequencies we can see that the
observed and expected frequencies are similar indicating a good fit for the model as
apparent by the Hosmer-Lemeshow statistic.
Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)
Value
1
Obs
Exp
0
Obs
Exp
Total
1
2
3
Group
4
5
6
7
1
2.6
6
4.3
4
5.8
8
6.2
7
7.4
8
8.1
10
8.9
22
20.4
23
17
18.7
23
20
18.2
24
15
16.8
23
17
16.6
24
15
14.9
23
13
14.1
23
8
13
10.2
11
13.8
24
9
10
11.9
15
13.1
25
10
11
12.6
11
9.4
22
Total
78
156
234
Our concordant percent was 65.8 which means that 65.8% of the time GPA
accurately predicts whether a student remains a computer science major.
Based on the evidence above we found that the explanatory variables of “GPA”,
“SATM”, “SATV”, “HSM”, and “HSE” were the best predictors for students who will
remain a computer science major after three semesters.
Download