Determining influential factors for student performance Valerie Cousineau Kevin Doubleday Ruoyu Huang 1 Executive Summary Our client, Brian Berrellez, is a business analyst who works in the College of Agriculture and Life Sciences (CALS). He is interested in identifying potential factors that may affect student performance by building a predictive model. He measured student performance using GPA and length of time taken to graduate. We decided to split this research question into two analyses. We used logistic regression to predict whether a student would graduate on time, then and used a linear regression model using just the data from students who graduated on time to predict their GPA. Finally, we used backward selection to identify other important factors which factors that may impact the length of time taken to graduate and GPA student performance. 2 Detailed Summary 2.1 Background Brian is interested in identifying the potential impact of academic, personal, social and economic information on student performance using a dataset that includes information from 5214 students. The factors that are associated with personal information are gender, age range and ethnicity. Factors associated with academic information are the plan the student enrolled in and year of enrollment. The social and economic factors are Rural-Urban Continuum Code (RUCC), the poverty percentage from student’s residence, and the median income of the student’s residence. The RUCC is a metric for classifying student residence as rural or urban, which ranges from 1 to 7, where 1 is most rural and 7 is most urban. Since there could potentially be several addresses recorded for each student, we decided to use the first address on record as this is most likely to be the location where the student grew up. Brian decided to set 6.5 years as the metric of whether a student graduated on time or not. 2.2 Data There were some missing values that need to be cleaned up before doing the analysis. We omitted all the students that had any values of NA and students which had “unspecified” for their ethnicity, “U” for gender, or “0” for median income. Since there are only 4 students in the Native Hawaiian/Oth Pac Island ethnicity group, so we also removed those 4 students from our dataset. After removing these students we had a dataset with 431721 students, of which 148790 (34%) finished their program within 6.5 years while the remaining 28302831 did not. People who were 50 years or over were combined into a single group. 2.2 The Logistic Regression Model and Linear Regression model Logistic regression was used to analyze all of the 4321 students in the final dataset. It is used to predict the log-odds ratio of the probability that a student will graduate within 6.5 years and the probability that a student will not. The log odds takes the natural log of the ratio for the probability that an event will occur over the probability that it will not occur. In the case of our analysis, we’re modelling the probability that the student graduates on time over its complement, the student does not graduate on time. Linear regression assumes a linear relationship between the outcome variable and the factors. Linear regression operates similarly to logistic regression, where we are able to predict an expected value for our outcome (the student’s GPA) based on the factors we include in the analysis. In contrast to the logistic regression model, we are finding the predicted mean value for GPA in this model, as opposed to predicted value of the log odds ratio in the previous one. In order to do analysis using either logistic or linear regression we need to have reference group to which we are comparing all other groups. For example, if we are trying to predict whether there is an effect of gender on a student’s GPA we will need to use one category as the reference and our model will look something like this: πΊππ΄ = ο’0 + ο’1 ∗ πΊπππππ. Let’s assume we use males as the reference category, and our model is calculated to be πΊππ΄ = 3.2 + 0.2 ∗ πΊπππππ. Since males are our reference, we would interpret our model to mean that for males, πΊππ΄ = 3.2 + 0.2 ∗ 0 = 3.2, and for females πΊππ΄ = 3.2 + 0.2 ∗ 1 = 3.4. The reference category will always be equal to zero. 2.3 Backward Selection Backward selection is the variable selection method we used. This method allows us to input a statistical significance level, which it then uses as a threshold for calculating the variables to include in the model. The variable which has the smallest contribution to the model is deleted step by step until the threshold we input in the beginning for significance is met. Additionally, this method either keeps or throws out all levels of a categorical variable. So the result can be easily interpreted. This method selects the influential coefficients by deleting the most irrelevant factor step by step. Then, it compares this sequence of models by their AIC value (a calculation of the model’s overall goodness of fit for the data). The smaller the AIC the better the model fit with the factors we have chosen to include. 3 Results and Interpretation 3.1 Prediction of Success via logistic regression Table 1: Estimated Coefficients by logistic regression Intercept PLANAGEMBS PLANARECBS PLANASCBS PLANEWREBS PLANMICRBS PLANNTRSBS PLANRAMBSRNR PLANVSCBS PLANWFSCBSRNR -0.0287.29 -0.6595829 -0.5459599 -0.5573728 -0.1205046 -0.2594716 -0.2398559 -0.142628 -0.7221375 -0.90559181 PLANWWRRBSNR PLANWWRRBSRNR Male Age 20 to 30 Age 30 to 40 Age 40 to 50 Age above 50 The year of enrollment Poverty Percentage Median income -0.4714870 -1.396410 -0.211939 -0.502916 -0.276359 0.08422532 -0.44559 0.14337 -1.6442 -0.00000365667 Table 1 is the estimated coefficients given by logistic regression. Notice that only the influential factors selected by backward selection are listed in this table. Categorical and ordinal factors are interpreted differently from numeric factors. The reference for the student’s plan is “ABEMBS”, so the correct interpretation for this would be to say that, given that all other factors have the same value, the odds of graduating on time vs. not graduating on time for a student enrolled in “AGEMBS” is 0.5171051(π −0.6829 = 0.5051) times the odds that a student is enrolled in “ABEMBS”. However, the coefficient of “POVERTY_PCT” should be interpreted as, given that all other terms stay the same, an increase in poverty percent by one unit will change the odds to 0.1936 times (π −1.642 = 0.1936) the odds before the increment of poverty percent. Since one unit means 100 percent here, if the poverty percent is increased by one percent, than the odds will be 0.9837(π −0.01642 = 0.9837) times the odds prior to this increase. This may sound confusing, but fortunately in a predictive model our interest is not really in interpreting the estimates for different factors, but rather to use the various factors to compute an estimate of the outcomes we’re interested in, GPA, and whether or not a student will graduate on time. Here is an example. Suppose we have a male student enrolled in AGEMBS plan at 2012. His age is 25. The poverty percent of his residential area is 30%, and the median income in that area is 40,000. So the odds equals 0.4481 (π −.287,2−0.6595−0.2119−0.5029+2012∗0.1433−0.3∗1.644−40,000∗0.000003656 = 0.4481). Since a student either graduate on time or not, so the sum of those two probabilities is 1. Then we can derive the 0.4418 predicted probability that this student can graduate on time is 0.3094 (1+0.4418 = 0.3094). And the predicted probability that this student cannot graduate on time is just 1 – 0.3094 = 0.6906. Notice that the backwards selection procedure indicated the most influential factors are plan, gender, age range, first term academic year, poverty percent and median income. 3.2 Prediction of GPA via linear regression Table 2: Estimated Coefficients by linear regression Intercept Asian Black/African American Hispanic/Latino Non Resident AlienNative Hawaiian/Oth Pac Island -2.98336 0.2889766 0.0983.314 0.14901508 0.41750.09109 PLANNTRSBS PLANVSCBS PLANWFSCBSRNR PLANWWRRBSNR PLANWWRRBSRNR 0.306448 0.42190 0.618673 0.473879 0.350961 White Non Resident Alien PLANAGEMBSWhite PLANARECBSPLANAGEMBS PLANASCBSPLANARECBS PLANEWREBSPLANASCBS PLANMICRBSPLANEWREBS 0.31190.4120 0.01270.3037 0.31310.01025 0.31070.3145 0.18620.3092 0.45170.1836 PLANMICRBS 0.4512 Male Age 20 to 30 Age 30 to 40 Age 40 to 50 Age above 50 The year of enrollment Median income Adjusted r-squared -0.081639 -0.128300 0.1085108 -0.0884687 0.03854129 -0.01611587 0.000000767 0.142833 Now we focus on predicting the GPA of a student given that he or she graduated within 6.5 years. Once again, we only list the factors which were determined to be influential in Table 2. The method of interpreting the estimates is similar to logistic regression where we have a reference category which we are comparing all others to. However, with linear regression the correct interpretation is that the difference between groups is an average difference, compared with logistic regression where we are interpreting the estimates as odds ratios. The adjusted r squared measures the percentage of variation in the data explained by the model. In this case, the linear regression model we reported above explains 14.2833% of the variation in the data. It is hard to tell if this number is good enough since there are different standards in different fields. 3.3 Future Study Although there is nothing wrong with the models and variable selection method we applied here, there is room for improvement in terms of variable selection. Backward selection is a good variable selection method since it is easy to interpret. However, there is no theory that guarantees backward selection can select the set of true influential factors. Something else that may be of interest to look into for future models is the idea that there might be some kind of interaction between terms. When two or more variables have an effect on each other that is considered an interaction. For example, perhaps there is an interaction between gender and age, where one gender may have a higher expected GPA at older ages than the other gender. While this type of analysis was outside of the scope of this project it may be something worth looking into for future research of this nature.