Berrellez Final Report

advertisement
Determining influential factors for student performance
Valerie Cousineau
Kevin Doubleday
Ruoyu Huang
1
Executive Summary
Our client, Brian Berrellez, is a business analyst who works in the College of Agriculture and
Life Sciences (CALS). He is interested in identifying potential factors that may affect student
performance by building a predictive model. He measured student performance using GPA and
length of time taken to graduate. We decided to split this research question into two analyses. We
used logistic regression to predict whether a student would graduate on time, then and used a
linear regression model using just the data from students who graduated on time to predict their
GPA. Finally, we used backward selection to identify other important factors which factors that
may impact the length of time taken to graduate and GPA student performance.
2
Detailed Summary
2.1 Background
Brian is interested in identifying the potential impact of academic, personal, social and economic
information on student performance using a dataset that includes information from 5214 students.
The factors that are associated with personal information are gender, age range and ethnicity.
Factors associated with academic information are the plan the student enrolled in and year of
enrollment. The social and economic factors are Rural-Urban Continuum Code (RUCC), the
poverty percentage from student’s residence, and the median income of the student’s residence.
The RUCC is a metric for classifying student residence as rural or urban, which ranges from 1 to
7, where 1 is most rural and 7 is most urban. Since there could potentially be several addresses
recorded for each student, we decided to use the first address on record as this is most likely to be
the location where the student grew up. Brian decided to set 6.5 years as the metric of whether a
student graduated on time or not.
2.2 Data
There were some missing values that need to be cleaned up before doing the analysis. We omitted
all the students that had any values of NA and students which had “unspecified” for their
ethnicity, “U” for gender, or “0” for median income. Since there are only 4 students in the Native
Hawaiian/Oth Pac Island ethnicity group, so we also removed those 4 students from our dataset.
After removing these students we had a dataset with 431721 students, of which 148790 (34%)
finished their program within 6.5 years while the remaining 28302831 did not. People who were
50 years or over were combined into a single group.
2.2 The Logistic Regression Model and Linear Regression model
Logistic regression was used to analyze all of the 4321 students in the final dataset. It is used to
predict the log-odds ratio of the probability that a student will graduate within 6.5 years and the
probability that a student will not. The log odds takes the natural log of the ratio for the
probability that an event will occur over the probability that it will not occur. In the case of our
analysis, we’re modelling the probability that the student graduates on time over its complement,
the student does not graduate on time.
Linear regression assumes a linear relationship between the outcome variable and the factors.
Linear regression operates similarly to logistic regression, where we are able to predict an
expected value for our outcome (the student’s GPA) based on the factors we include in the
analysis. In contrast to the logistic regression model, we are finding the predicted mean value for
GPA in this model, as opposed to predicted value of the log odds ratio in the previous one.
In order to do analysis using either logistic or linear regression we need to have reference group
to which we are comparing all other groups. For example, if we are trying to predict whether
there is an effect of gender on a student’s GPA we will need to use one category as the reference
and our model will look something like this: 𝐺𝑃𝐴 = 0 + 1 ∗ πΊπ‘’π‘›π‘‘π‘’π‘Ÿ. Let’s assume we use
males as the reference category, and our model is calculated to be 𝐺𝑃𝐴 = 3.2 + 0.2 ∗ πΊπ‘’π‘›π‘‘π‘’π‘Ÿ.
Since males are our reference, we would interpret our model to mean that for males, 𝐺𝑃𝐴 =
3.2 + 0.2 ∗ 0 = 3.2, and for females 𝐺𝑃𝐴 = 3.2 + 0.2 ∗ 1 = 3.4. The reference category will
always be equal to zero.
2.3 Backward Selection
Backward selection is the variable selection method we used. This method allows us to input a
statistical significance level, which it then uses as a threshold for calculating the variables to
include in the model. The variable which has the smallest contribution to the model is deleted
step by step until the threshold we input in the beginning for significance is met. Additionally,
this method either keeps or throws out all levels of a categorical variable. So the result can be
easily interpreted.
This method selects the influential coefficients by deleting the most irrelevant factor step by step.
Then, it compares this sequence of models by their AIC value (a calculation of the model’s
overall goodness of fit for the data). The smaller the AIC the better the model fit with the factors
we have chosen to include.
3
Results and Interpretation
3.1 Prediction of Success via logistic regression
Table 1: Estimated Coefficients by logistic regression
Intercept
PLANAGEMBS
PLANARECBS
PLANASCBS
PLANEWREBS
PLANMICRBS
PLANNTRSBS
PLANRAMBSRNR
PLANVSCBS
PLANWFSCBSRNR
-0.0287.29
-0.6595829
-0.5459599
-0.5573728
-0.1205046
-0.2594716
-0.2398559
-0.142628
-0.7221375
-0.90559181
PLANWWRRBSNR
PLANWWRRBSRNR
Male
Age 20 to 30
Age 30 to 40
Age 40 to 50
Age above 50
The year of enrollment
Poverty Percentage
Median income
-0.4714870
-1.396410
-0.211939
-0.502916
-0.276359
0.08422532
-0.44559
0.14337
-1.6442
-0.00000365667
Table 1 is the estimated coefficients given by logistic regression. Notice that only the influential
factors selected by backward selection are listed in this table.
Categorical and ordinal factors are interpreted differently from numeric factors. The reference for
the student’s plan is “ABEMBS”, so the correct interpretation for this would be to say that, given
that all other factors have the same value, the odds of graduating on time vs. not graduating on
time for a student enrolled in “AGEMBS” is 0.5171051(𝑒 −0.6829 = 0.5051) times the odds that a
student is enrolled in “ABEMBS”. However, the coefficient of “POVERTY_PCT” should be
interpreted as, given that all other terms stay the same, an increase in poverty percent by one unit
will change the odds to 0.1936 times (𝑒 −1.642 = 0.1936) the odds before the increment of poverty
percent. Since one unit means 100 percent here, if the poverty percent is increased by one
percent, than the odds will be 0.9837(𝑒 −0.01642 = 0.9837) times the odds prior to this increase.
This may sound confusing, but fortunately in a predictive model our interest is not really in
interpreting the estimates for different factors, but rather to use the various factors to compute an
estimate of the outcomes we’re interested in, GPA, and whether or not a student will graduate on
time. Here is an example. Suppose we have a male student enrolled in AGEMBS plan at 2012.
His age is 25. The poverty percent of his residential area is 30%, and the median income in that
area is 40,000. So the odds equals 0.4481
(𝑒 −.287,2−0.6595−0.2119−0.5029+2012∗0.1433−0.3∗1.644−40,000∗0.000003656 = 0.4481). Since a student
either graduate on time or not, so the sum of those two probabilities is 1. Then we can derive the
0.4418
predicted probability that this student can graduate on time is 0.3094 (1+0.4418 = 0.3094). And
the predicted probability that this student cannot graduate on time is just 1 – 0.3094 = 0.6906.
Notice that the backwards selection procedure indicated the most influential factors are plan,
gender, age range, first term academic year, poverty percent and median income.
3.2 Prediction of GPA via linear regression
Table 2: Estimated Coefficients by linear regression
Intercept
Asian
Black/African American
Hispanic/Latino
Non Resident AlienNative
Hawaiian/Oth Pac Island
-2.98336
0.2889766
0.0983.314
0.14901508
0.41750.09109
PLANNTRSBS
PLANVSCBS
PLANWFSCBSRNR
PLANWWRRBSNR
PLANWWRRBSRNR
0.306448
0.42190
0.618673
0.473879
0.350961
White Non Resident Alien
PLANAGEMBSWhite
PLANARECBSPLANAGEMBS
PLANASCBSPLANARECBS
PLANEWREBSPLANASCBS
PLANMICRBSPLANEWREBS
0.31190.4120
0.01270.3037
0.31310.01025
0.31070.3145
0.18620.3092
0.45170.1836
PLANMICRBS
0.4512
Male
Age 20 to 30
Age 30 to 40
Age 40 to 50
Age above 50
The year of
enrollment
Median income
Adjusted r-squared
-0.081639
-0.128300
0.1085108
-0.0884687
0.03854129
-0.01611587
0.000000767
0.142833
Now we focus on predicting the GPA of a student given that he or she graduated within 6.5 years.
Once again, we only list the factors which were determined to be influential in Table 2. The
method of interpreting the estimates is similar to logistic regression where we have a reference
category which we are comparing all others to. However, with linear regression the correct
interpretation is that the difference between groups is an average difference, compared with
logistic regression where we are interpreting the estimates as odds ratios. The adjusted r squared
measures the percentage of variation in the data explained by the model. In this case, the linear
regression model we reported above explains 14.2833% of the variation in the data. It is hard to
tell if this number is good enough since there are different standards in different fields.
3.3 Future Study
Although there is nothing wrong with the models and variable selection method we applied here,
there is room for improvement in terms of variable selection. Backward selection is a good
variable selection method since it is easy to interpret. However, there is no theory that guarantees
backward selection can select the set of true influential factors.
Something else that may be of interest to look into for future models is the idea that there might
be some kind of interaction between terms. When two or more variables have an effect on each
other that is considered an interaction. For example, perhaps there is an interaction between
gender and age, where one gender may have a higher expected GPA at older ages than the other
gender. While this type of analysis was outside of the scope of this project it may be something
worth looking into for future research of this nature.
Download