Statistical Consulting Firm

advertisement
1
Customer’s Question:
“How well does a student’s GPA at the end of the three semester period predict whether
the student remained a CS, engineering, or other science-related major?”
Firm’s Response:
We will be explaining the step-by-step process by which we answered the customer’s
question, and we will also be explaining the meaning behind the statistical output corresponding
to the model.
The given question poses an “either-or” scenario. Either a given student remained within
the warm embrace of science, or he/she decided to major in a field outside engineering or
science. Therefore, we can model this problem with a Binary Logistic Regression, but we must
first re-code the original categorical variable of major. The data given to us had broken the
students into one of three categories: major one indicates a student who remained a CS major at
the end of the three semester period, major two indicates a student who changed to a major in
engineering or some other science, and major three indicates a student who changed to a major
outside engineering or science. The students were classified as a 1, 2, or 3 according to their
respective major. We re-coded the data and classified any student that was a 1 or a 2 as a
“success” (=1). We also re-coded any student that was a 3 as a “failure” (=0). We renamed our
re-coded major column as “remains science major,” in preparation of running the Binary Logistic
Regression in Minitab. We ran the Binary Logistic Regression with “remains science major” as
our response and “gpa” as our model.
The first part of the Minitab output gives us some basic statistics about the number of
successes (=1) and the number of failures (=0):
Response Information
Variable
remains science major
Value
1
0
Total
Count
156
78
234
(Event)
The next part of the Minitab output gives us the p-value for the omnibus test that all the
coefficients are equal to zero (with G statistic), and the p-value for the test that the individual
coefficients are equal to zero. We can see from the output that we reject the omnibus null
hypothesis, as well as the null hypotheses that the individual coefficients are zero. The
regression coefficients are not equal to zero:
Logistic Regression Table
Predictor
Constant
gpa
Coef
-3.12405
1.43270
SE Coef
0.652317
0.241520
Z
-4.79
5.93
P
0.000
0.000
Odds
Ratio
4.19
95% CI
Lower Upper
2.61
6.73
Log-Likelihood = -125.493
Test that all slopes are zero: G = 46.902, DF = 1, P-Value = 0.000
2
We can also determine how well this particular model fits the data by inspecting the
goodness-of-fit tests in Minitab’s output. Overall, the p-value associated with each test does not
allow us to reject the null hypothesis, so we can conclude that there is insufficient evidence to
claim that the model does not fit the data adequately:
Goodness-of-Fit Tests
Method
Pearson
Deviance
Hosmer-Lemeshow
Chi-Square
164.532
184.252
3.672
DF
126
126
8
P
0.012
0.001
0.885
Finally, we can determine whether our model contains any real explanatory power by
observing the concordant pairs, discordant pairs, and ties under the measures of association. If
our model is accurately predicting the probability of success for a given student that is indeed a
success, then the percentage of concordant pairs would be near 100%, and the percentage of
discordant pairs would be near 0%. Our model shows a relatively high percentage of concordant
pairs, and a relatively low percentage of discordant pairs, which allows us to have confidence in
the predictability of our model. In addition, the Summary Measures give us a similar confidence
because larger values (up to one) indicate that the model has a better predictive ability.
Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)
Value
1
Obs
Exp
0
Obs
Exp
Total
Group
5
6
1
2
3
4
6
4.9
9
9.9
12
13.7
13
15.1
17
16.1
17
18.1
23
14
13.1
23
13
11.3
25
11
8.9
24
6
6.9
23
7
8
9
10
Total
19
17.1
20
19.0
19
20.1
21
20.1
20
20.0
156
4
5.9
23
4
5.0
24
5
3.9
24
2
2.9
23
2
2.0
22
78
234
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs
Concordant
Discordant
Ties
Total
Number
9316
2794
58
12168
Percent
76.6
23.0
0.5
100.0
Summary Measures
Somers' D
Goodman-Kruskal Gamma
Kendall's Tau-a
0.54
0.54
0.24
In conclusion, a student’s GPA clearly holds statistically significant power in predicting
whether a student remained within the “warm embrace of science” (classified as a success =1) or
whether the given student changed to a major outside the field of CS, science, or engineering
(classified as a failure =0).
3
Customer’s Question:
“What are the effects of the variables “SEX” and “Maj” on the variable “HSS”?
Firm’s Response:
The question presents the case involving two categorical explanatory variables (SEX and
Maj) and one quantitative response variable (HSS). This observation can be recognized through
the values that correspond to each variable throughout each individual sample. The variable
“SEX” attains a value of either “1” or “2” and the variable “Maj” possesses one of three values
(“1”, “2”, or “3”), whereas the response variable “HSS” can equal a wide range of values
(continuous distribution). Now that we identified the characteristics of the variables in question,
it is time for us to declare which test would be best to effectively and accurately answer the
customer’s question. We can measure the effects of the explanatory variables on the response
variable through the use of 2-Way ANOVA. In order to do so, we used the Minitab software to
run the test. First, we scrolled to the appropriate test (under the stats menu) and entered the
variables in question into their respective slots (with “SEX” as the row factor and “Maj” as the
column factor). Also, make sure that the confidence level is at the value of 95 .
The first part of the Minitab output gives us the ANOVA table of data:
Source
sex
maj
Interaction
Error
Total
S = 1.599
DF
1
2
2
228
233
SS
12.927
44.410
24.855
582.923
665.115
R-Sq = 12.36%
MS
12.9274
22.2051
12.4274
2.5567
F
5.06
8.69
4.86
P
0.025
0.000
0.009
R-Sq(adj) = 10.44%
From this data, we can infer that “SEX,” “Maj,” and the interaction between the two are
all statistically significant in testing their effects on “HSS” because of their low p-values (all
below the 0.05 significance level). Immediately following this output, a residual plot for “HSS”
pops up so we can visually confirm that the residuals are normal (except for some left
skewedness). We do this to satisfy the assumption that the residuals are independent and have a
normal distribution for a 2-Way ANOVA test to be adequate.
We then created a main effects plot for “HSS” using Minitab. Upon viewing the graph, we were
able to notice that there were indeed some main effects happening between the explanatory
variables and the response variable. For instance, the graph indicates that females had a
somewhat significant higher “HSS” mean than males did. Also, the graph portrayed that
students who changed to a major outside engineering or science had a very significant lower
mean than students who remained a CS major or who changed to a major inside engineering or
some other science. Accordingly, there was really no main effect between those who remained
CS majors or switched to some other major inside the field of science because of the basically
horizontal line connecting the two on the graph.
4
Main Effects Plot for hss
Data Means
sex
maj
8.50
Mean
8.25
8.00
7.75
7.50
1
2
1
2
3
Interaction Plot for hss
Data Means
9.5
sex
1
2
Mean
9.0
8.5
8.0
7.5
1
2
maj
3
After this analysis, we then generated scatterplots to view the potential interactions between the
two explanatory variables on the response variable. We instantly noticed an interaction effect
between gender and students who remained a CS major. This analysis shows that both males
and females who remained a CS major at the end of three semesters were correlated through their
high school science means.
5
Residual Plots for hss
Normal Probability Plot
Versus Fits
99
2
90
0
Residual
Percent
99.9
50
10
-2
-4
1
0.1
-5.0
-2.5
0.0
Residual
2.5
-6
5.0
7.5
8.0
Histogram
2
Residual
Frequency
9.0
Versus Order
30
20
10
0
8.5
Fitted Value
0
-2
-4
-5
-4
-3
-2
-1
Residual
0
1
2
-6
1
20
40
60
80
100 120 140 160 180 200 220
Observation Order
We created a residual plot for “HSS” so we can visually confirm that the residuals are
normal (except for some left skewedness). We do this to satisfy the assumption that the residuals
are independent and have a normal distribution for a 2-Way ANOVA test to be adequate.
6
Customer’s Question:
“What are the best predictors of whether a student remains a CS major after three
semesters?”
Firm’s Response:
The question posed by the customer offers a scenario similar to the customer’s first
question. The given question poses an “either-or” scenario. Either a given student remained a
CS major after three semesters, or he/she decided to major in engineering, some other science, or
a field outside science. Therefore, we can model this problem with a Binary Logistic
Regression, but we must first re-code the original categorical variable of major. The data given
to us had broken the students into one of three categories: major one indicates a student who
remained a CS major at the end of the three semester period, major two indicates a student who
changed to a major in engineering or some other science, and major three indicates a student who
changed to a major outside engineering or science. The students were classified as a 1, 2, or 3
according to their respective major. We re-coded the data and classified any student that was a 1
as a “success” (=1). We also re-coded any student that was a 2 or a 3 as a “failure” (=0). We
renamed our re-coded major column as “remains CS major,” in preparation of running the
Binary Logistic Regression in Minitab. We ran the Binary Logistic Regression with “remains
CS major” as our response and “hsm, satm, sex” as our model. Those three predictors were
chosen because they led to the strongest model, as explained with the help of Minitab’s output.
The first part of the Minitab output gives us some basic statistics about the number of
successes (=1) and the number of failures (=0):
Response Information
Variable
remains CS major
Value
1
0
Total
Count
78
156
234
(Event)
The next part of the Minitab output gives us the p-value for the omnibus test that all the
coefficients are equal to zero (with G statistic), and the p-value for the test that the individual
coefficients are equal to zero. We can see from the output that we reject the omnibus null
hypothesis, but our individual p-values only allow us to reject the null hypothesis for “hsm” that
the coefficient is equal to zero (no explanatory power). Despite our inability to reject the null for
“satm” and “sex” that the coefficients are equal to zero, we are confident that these variables do
indeed offer explanatory power to our model, as seen later in the Minitab output:
Logistic Regression Table
Predictor
Constant
hsm
satm
sex
Coef
-2.77585
0.348232
-0.0010238
-0.215419
SE Coef
1.34368
0.119025
0.0020640
0.299857
Z
-2.07
2.93
-0.50
-0.72
P
0.039
0.003
0.620
0.473
Odds
Ratio
1.42
1.00
0.81
95% CI
Lower Upper
1.12
0.99
0.45
1.79
1.00
1.45
Log-Likelihood = -143.637
Test that all slopes are zero: G = 10.615, DF = 3, P-Value = 0.014
7
We can also determine how well this particular model fits the data by inspecting the
goodness-of-fit tests in Minitab’s output. Because we are using multiple variables to predict
whether a student remains a CS major, we have to account for the grouping structure of the
Goodness-of-Fit Tests, and the only one over which we have control of grouping is the HosmerLemeshow Test. So we had to aggressively bin our data in order to have the grouping for all
combinations of the variables to be equivalent, so we limited the tests to five groups. But once
we aggressively binned our data, the P-values for the Pearson and Deviance Tests dropped
significantly, forcing us to focus on the p-value for the Hosmer-Lemeshow Test. The p-value
associated with the Hosmer-Lemeshow Test does not allow us to reject the null hypothesis, so
we can conclude that there is insufficient evidence to claim that the model does not fit the data
adequately:
Goodness-of-Fit Tests
Method
Pearson
Deviance
Hosmer-Lemeshow
Chi-Square
183.154
221.483
3.233
DF
119
119
3
P
0.000
0.000
0.357
Finally, we can determine whether our model contains any real explanatory power by
observing the concordant pairs, discordant pairs, and ties under the measures of association. If
our model is accurately predicting the probability of success for a given student that is indeed a
success, then the percentage of concordant pairs would be near 100%, and the percentage of
discordant pairs would be near 0%. We began our search for the best predictors by examining
the difference between the concordant pairs and the summation of discordant pairs and ties (as a
conservative estimate). We ranked the best predictors of whether a student remains a CS major
based upon this difference: “satv” was first, “satm” second, and “hsm” was third. Those
rankings required that the individual variables failed to reject the null hypothesis for the HosmerLemeshow Test, so we could conclude that the variables are accurately fitting the data. We
began our process of determining which variables are the best predictors by starting with “satv”
and then trying each other variable to see which model contains the most predictive power. We
discovered that if two more variables are added to the model with “satv,” we would reject that
the model fits the data well. But we also discovered a model with predictors “satv” and “gpa”
that compares to our model, though it contains slightly less predictive power. After many trialand-error binary logistic regressions, we decided upon the three predictors “satm,” “hsm,” and
“sex” for our model. Our model shows a relatively high percentage of concordant pairs, and a
relatively low percentage of discordant pairs, which allows us to have confidence in the
predictability of our model. The Summary Measures are not as close to one as we would hope,
but they are still higher than most other models.
Table of Observed and Expected Frequencies:
(See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic)
Value
1
Obs
Exp
1
2
11
8.3
11
13.3
Group
3
15
17.2
4
5
Total
17
18.6
24
20.7
78
8
0
Obs
Exp
Total
35
37.7
46
36
33.7
47
34
31.8
49
29
27.4
46
22
25.3
46
156
234
Measures of Association:
(Between the Response Variable and Predicted Probabilities)
Pairs
Concordant
Discordant
Ties
Total
Number
7609
4461
98
12168
Percent
62.5
36.7
0.8
100.0
Summary Measures
Somers' D
Goodman-Kruskal Gamma
Kendall's Tau-a
0.26
0.26
0.12
In conclusion, we have determined that the predictors “satm,” “hsm,” and “sex” hold the
most predictive power in determining whether a given student remains a CS major. Our search
for predictors focused on three areas in Minitab’s output: the p-value for the omnibus test that
all the coefficients are equal to zero, the p-value associated with the Hosmer-Lemeshow
Goodness-of-Fit Test, and the difference between the concordant pairs and the summation of
discordant pairs and ties (C-(D+T)). Our model rejected the omnibus test (so our model contains
predictive power), failed to reject the Hosmer-Lemeshow Test (so our model fits the data), and
had the biggest difference in concordant pairs. Overall, we are confident that the chosen three
variables offer the greatest predictive power for determining whether a student remains a CS
major.
Download