Uploaded by Adeniyi Talabi

Predicting Students Dropout Rate with AutoML

advertisement
PREDICTING STUDENTS DROP OUT AND ACADEMIC
SUCCESS FOR FINANCIAL PLANNING IN SPACE
UNIVERSITY, USA
MIS 587: Business Applications in Machine Learning
December 11, 2023
Executive Summary
The primary aim of this initiative was to strategically address SPACE University's fiscal
challenges, exacerbated by a concerning 39% undergraduate dropout rate. This attrition
has been impacting the university's financial health, resulting in an estimated annual loss
of USD 13 million.
In a strategic alliance with FRAB Consulting, an extensive study leveraging machine
learning algorithms to forecast student attrition trends was carried out. Our methodology
encompassed a thorough analysis of an extensive dataset, incorporating demographic,
socioeconomic, and academic performance metrics, processed through the sophisticated
AI and machine learning capabilities of the DataRobot platform.
The project's analysis, utilizing a dataset of 3,630 records (excluding currently enrolled
students), unearthed pivotal factors influencing student withdrawal. Advanced machine
learning techniques, including feature engineering and selection, were applied. The most
efficacious model identified was the eXtreme Gradient Boosted Trees Classifier.
We advised a robust integration of this model within the university's student record
management framework. This should include a dynamic, tiered intervention strategy,
primarily focusing on high-risk students for immediate action, while also maintaining
vigilant oversight for medium-risk individuals. This recommendation is formulated with a
dual focus on mitigating financial risks and enhancing overall student retention and
success.
The efficacy of this model is inherently linked to the precision of the current dataset and
the university's operational infrastructure. We advocate for regular model recalibrations
to align with evolving student demographics and academic policy changes.
The deployment of this predictive model is projected to significantly curtail financial losses
related to student attrition, with an estimated reduction potential of 70.43%, thereby
fortifying SPACE University's financial stability and academic standing.
Table of Contents
Executive Summary ................................................................................................................................... 1
Business Problem and Project Objectives........................................................................................ 4
Dataset Analysis: ................................................................................................................................. 4
Data Decision: ....................................................................................................................................... 6
Feature Engineering ................................................................................................................................ 7
Feature Selection: ................................................................................................................................ 7
Changes in DataRobot: .................................................................................................................... 10
The model’s Feature Impact: .......................................................................................................... 12
Model Selection ...................................................................................................................................... 13
Describe Candidate Models: ........................................................................................................... 13
Comparing models with different feature lists: ......................................................................... 13
Model Selection: ................................................................................................................................. 15
Model Description:............................................................................................................................. 15
Critique of strengths and weaknesses: ............................................................................................. 16
Recommendation ................................................................................................................................... 18
Organization Actions for Implementation: .................................................................................. 18
Final Recommendation: ................................................................................................................... 20
How the model should be implemented: ..................................................................................... 20
Operational Implementation: .......................................................................................................... 20
References ............................................................................................................................................... 20
Business Problem and Project Objectives
SPACE University, a prestigious non-profit institution, is facing critical financial
challenges. These challenges stem from a combination of rising operational costs,
diminishing tuition revenue, and decreasing government funding. A major factor
exacerbating this financial strain is the university's high student attrition rate. Currently,
39% of undergraduate students fail to graduate within six years, leading to a staggering
annual loss of approximately USD 13 million. This attrition rate not only impacts the
university’s financial stability but also adversely affects its reputation and future enrollment
prospects.
The university's high dropout rate is a multifaceted problem. Not only does it represent a
significant financial loss, but it also undermines the perceived value and effectiveness of
the education SPACE University provides. In a broader context, this issue aligns with
national trends. According to the National Student Clearinghouse Research Center
(2019), only 63% of first-time, full-time students at four-year private universities graduate
within six years. The Pew Research Center (2020) further highlights that public
confidence in higher education is waning, with only 40% of Americans believing that
colleges adequately prepare students for life post-graduation.
To address these challenges, SPACE University has partnered with FRAB Consulting to
develop a predictive solution aimed at identifying students at risk of early academic
discontinuation. The solution leverages predictive analytics and machine learning to
proactively identify at-risk students, allowing the university to implement targeted
intervention programs to enhance student retention and success.
The National Bureau of Economic Research highlights the substantial financial impact of
reducing attrition rates in private universities. A 10% reduction in these rates can result in
an average annual saving of USD 300 million. For SPACE University, this means that the
machine learning project focused on lowering attrition rates could potentially save millions
of dollars annually. Furthermore, universities with higher graduation rates tend to have
better reputations, which can attract more students, donors, and esteemed faculty
members. Enhancing the university's reputation could play a crucial role in bridging its
financial gap.
Dataset Analysis:
The dataset provided by SPACE University offers a comprehensive profile of students
enrolled in various undergraduate programs, including nursing, journalism, management,
and design. This dataset is a rich source of both demographic and socioeconomic
information, collected at the time of student registration. Additionally, it includes academic
performance data recorded after the first and second semesters.
In terms of structure, the dataset encompasses 4,424 individual records, each described
by 37 distinct attributes. These attributes are detailed in the accompanying feature lists.
The dataset's composition is predominantly numerical, with a diverse mix of data types:
it includes one categorical feature, five continuous variables, and a majority of discrete
data types. There is one notable qualitative feature as well.
A key strength of this dataset lies in its exceptional data quality. It is remarkably clean,
with no missing values, inaccuracies, or inconsistencies observed, which is a significant
advantage for any analytical endeavor. For the implementation of this project, DataRobot,
an advanced AI and machine learning platform, was utilized. This choice is motivated by
DataRobot's capabilities in efficiently handling large datasets and performing complex
analyses, making it an ideal tool for extracting meaningful insights from the SPACE
University student data.
Feature lists
Data Decision:
The target variable for our model, labeled "Outcome," is categorized into three distinct
groups: "dropout," "enrolled," and "graduate," reflecting the students' status after their
standard course period. Considering the specific requirements of this project, we have
excluded data of students who are currently enrolled (“enrolled” category) from our
dataset resulting in 3,630 individual records from 4,424 (more details in feature
engineering). This is done to focus our analysis on the more definitive outcomes of
"dropout" and "graduate," which we have encoded numerically as 0 and 1, respectively.
This binary classification aligns well with the nature of our predictive task, which is
inherently a classification problem.
Feature Engineering
DataRobot is a machine learning platform that automates the end-to-end process of
building, deploying, and managing machine learning models. Feature engineering and
feature selection are crucial steps in the machine learning workflow, and they play a
significant role in improving model performance.
There are different Feature Engineering Techniques:
Missing Value Imputation (Imputing missing values using mean, median, or more
advanced methods), Binning or Discretization (Grouping continuous variables into bins),
Feature Scaling (Scaling numerical features to a standard range), Categorical Encoding
(encoding categorical features into numerical values), feature splitting (Splitting features
into parts, like date and time), Handling Outliers (removal or replacing values with
outliers), Variable Transformations (for helping with normalizing skewed data).
This dataset has no missing values. For Feature Engineering, we used the ‘Categorical
Encoding’ Technique. Our target feature, Outcome, had 3 levels: Dropout, Graduate, and
Enrolled. Since our objective is to predict how many students will drop out and/or how
many of them will graduate, we dropped the rows having the Enrolled variable. The
dataset had 4,425 observations at first, after removing the rows having Enrolled variable,
it became 3,630 rows. And then we used the categorical encoding technique of Feature
Engineering to encode ‘Dropout’ to 0, and ‘Graduate’ to 1 under the column named
‘CodedTarget’.
Feature Selection:
Feature selection algorithms are categorized into three main types: intrinsic methods,
filter methods, and wrapper methods.
1. Intrinsic Methods: These algorithms inherently incorporate feature selection within
themselves. Examples include rule-and-tree-based algorithms, Multivariate Adaptive
Regression Spline (MARS) models, and regularization models. These methods are fast
and don't require an external feature selection algorithm.
2. Filter Methods: Pre-processing techniques that independently assess each feature's
impact on a predictive model. Like information gain, entropy, and the relief feature
selection method. These methods are generic but may reduce model accuracy as they
do not consider the nature of the predictive model.
3. Wrapper Methods: Aim to create a subset of features that optimizes the predictive
model's performance. Two types: greedy (e.g., Recursive Feature Elimination) and nongreedy (e.g., Genetic Algorithms, Simulated Annealing). Greedy methods focus on locally
optimal results, while non-greedy methods assess all previous feature subsets for overall
best performance.
Here in this dataset, wrapper methods are the best subset of features for optimal model
performance. In this one, we used Recursive Feature Elimination (RFE). It removes the
least important features until the desired number remains. Therefore, RFE helped us deal
with high-dimensional data and improve model efficiency and overfitting reduction).
After feature engineering, we employed the revised dataset in the Datarobot, then started
modeling, and used ‘CodedTarget’ as the Target, target type: Binary Classification.
The best model is the eXtreme Gradient Boosted Trees Classifier.
Here, in this model, we can see the most important top features. By wrapper method, we
use Recursive Feature Elimination (RFE) to remove the least important features. And
upload the dataset, following the same steps, to get the best model.
By eliminating all the other features, and only keeping the top 13 features plus the
‘CodedTarget’ feature to use as my target in data modeling.
The best model for the updated dataset is Light Gradient Boosted Trees Classifier with
Early Stopping.
To see the comparison of these 2 models:
After taking 14 top features into account, the model shows that the Debtor is not the last
in the top features. Admission grade, age at enrollment, scholarship holder, course, and
tuition to date matter besides the curricular units 1st and 2nd semester (approved, enrolled,
evaluation, grade).
Changes in DataRobot:
The technique used in DataRobot: Automated Feature Engineering
Leveraging DataRobot's automated feature engineering capabilities. DataRobot's
automated feature engineering can quickly explore a wide range of feature
transformations and combinations, potentially identifying patterns and relationships that
might be overlooked manually. After uploading both 1st dataset and the updated dataset
that has the top 14 features: create a new relationship between both datasets:
The best model it predicted with targets CodedTarget and Curricular Units 2nd semester
(approved) is eXtreme Gradient Boosted Trees Calssifier and the top 11 features.
The codedTarget feature has 1767 students graduated and 1137 dropout students.
The model’s Feature Impact:
The best thing about Automated Feature Engineering is that it can automatically find out
the target leakage, and redundant features like “Curricular units 1st semester (enrolled)”
and detect any data quality issue.
Model Selection
Describe Candidate Models:
Of the eight models that Data Robot produces for our experiment, we decided to evaluate
the two most accurate models: eXtreme Gradient Boosted Trees Classifier, (Model 1) and
Light Gradient Boosted Trees Classifier with Early Stopping, (Model 2).
Model 1
Model 2
Above are the blueprints of the most accurate models. These models are supervised,
using labeled information for their training. The information used in our case is numeric
and the specific information to predict is whether a student will graduate or drop out.
Missing information is handled by imputation, in our case the data was complete.
Feature leakage is a major concern and there was such an occurrence in our project, the
feature was dropped to address the challenge. These models return visuals like
Confusion Matrix which helps with the calculation of matrices like accuracy, precision,
recall, and F 1 score. The models’ performances are compared and used to select the
best performers. These models are the best of only eight models, besides exploring far
more sophisticated models, we recommend the collection and use of more information to
improve the model’s performance and become more scalable.
Comparing models with different feature lists:
The different feature lists used in comparing these two models are objectively (randomly)
chosen with the exemption of Curricular_units_1st_sem (enrolled), a feature earlier
associated with leakage. Below is a model comparison based on RMSE, a metric
indicating the chance of error that comes with using a model, the lower the RMSE, the
more accurate the model.
•
List_1
Using the first feature list, both models have the same validation score of 0.4354.
•
List_2
Using the second feature list, the second model has a lower validation score,0.3695.
From the difference and similarity in validation scores when different feature lists are used
on competing models, feature selection has an impact on model performance. It would
therefore be necessary to have a clear objective of what is to be achieved since different
models can indeed have similar matric scores when subjected to the same set of
information and features.
Calculate and report metrics for models.
•
Model 1. eXtreme Gradient Boosted Trees Classifier
Accuracy =
Precision =
Recall =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
𝑇𝑁+𝑇𝑃
𝑇𝑃
𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑇𝑃+𝐹𝑁
0.714∗0.768
366+218
𝑇𝑁+𝐹𝑃+𝑇𝑃+𝐹𝑁
,
,
218
218+76
218
218+66
F1 Score = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 , 2*0.714+0.768 = 0.754
, 366+76+218+66 = 0.804
= 0.741
= 0.768
•
Model 2. Light Gradient Boosted Trees Classifier with Early Stopping
AC =
𝑇𝑁+𝑇𝑃
Precision =
Recall =
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
387+203
, 387+55+203+81 = 0.813
𝑇𝑁+𝐹𝑃+𝑇𝑃+𝐹𝑁
𝑇𝑃
𝑇𝑃+𝐹𝑃
𝑇𝑃
𝑇𝑃+𝐹𝑁
,
203
,
203+55
203
203+81
= 0.787
= 0.715
0.787∗0.715
F 1 score = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 , 2* 0.787+0.715 = 0.749
Above are calculations of matrices for both models. The number of students who
graduated is denoted by 0 while that of those who failed to graduate is denoted by 1. The
positive (graduated - desired) cases in both models, (TP and TN) show the number of
students correctly predicted to either graduate or drop out, while the negative (dropout undesired) cases in both models (FP and FN) show the number of students incorrectly
predicted to either or dropout.
Model Selection:
The second model has a higher accuracy score of 81.3% than the first model whose
accuracy is 80.4%; an indication that the second model is more accurate in classifying
information instances over the overall occurrence of information. Since both models’
confusion matrices have different numbers of information instances, we did not fully rely
on accuracy to decide on which model to deploy. The precision score adds up to one, the
second model has a precision score closer to one than the first model, 0.787 and 0.741
respectively. The rate of true positivity measured by recall is better in the first model
compared with the second model, 0.768 and 0.715 respectively. Harmonizing the mean
scores of precisions and recall is a better guide in the model section, done by selecting a
model with the highest F1 score. In our case, it’s Model 1, eXtreme Gradient Boosted
Trees Classifier whose F1 score is 0.754.
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
0.714∗0.768
•
Model 1
F1 Score = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 , 2*0.714+0.768 = 0.754
•
Model 2
F 1 score = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 , 2* 0.787+0.715 = 0.749
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙
0.787∗0.715
Model Description:
The model’s blueprint is shown below. Upon uploading the locally available information
to Data Robot, any numeric missing information is filled up with a process called
imputation. The Data Robot then runs several models and scores them directly for choice
by the analyst. In our case, the information was complete, and hence imputation was a
routine process. Our target feature had information with two categories, graduate, and
dropout. We engineered the target to be numeric where graduate participants were
denoted by 0 and dropout by 1. The model is the best and is to be deployed and used for
making predictions based on new information.
From the feature engineering process, Model 1 has an RMSE score of 0. 2564. This is a
fair score indicating a minimal chance of error and better accuracy, (Stephen A.2022).
The model has a good score in an overall metric, F 1 score of 0.754.
The model’s accuracy is further demonstrated by the lift chart below, where,
•
•
•
The actual line is steep. (Visual indication of accuracy)
The predicted line closely matches the actual line. (Visual indication of accuracy)
There is a fair consistent increase in the lines along the Bins based on the
predicted value line. (Visual indication of accuracy)
Critique of strengths and weaknesses:
•
Strengths
The model is fairly accurate, and its performance can be improved by training with more
information. It has an accuracy score of 0.804 and F1 score of 0.754, figures not far from
the maximum scores of 1 respectively.
The model training process was economical, it did not demonstrate any difficulties where
it would demand advanced feature engineering or hyperparameter tuning. The only
feature engineering was done on the target feature.
The model was ranked among the best two of eight models, it is objective to evaluate it
for deployment and used to predict the target feature.
•
Weaknesses
The model is trained on relatively small amounts of information and may not perform well
on new information despite an incredible performance on training information, this is
called overfitting.
Also, scalability (the ability to extend capabilities and accommodate change) may be a
challenge to this model as we gather more information.
The model only demonstrated fair performance on most metrics like accuracy and F 1
score, 0.804 and 0.754 respectively.
Recommendation
We have formulated an effective business recommendation for SPACE University using
DataRobot’s predictive model outcomes, and we considered several factors, including the
probability thresholds for the target, actions for implementation, and financial and societal
implications. The following is a refined approach to these recommendations:
Business Decisions at Probability Thresholds
We propose that Space University’s administrators adopt
distinct measures corresponding to different probability
thresholds. For instance, a high probability might trigger
immediate and intensive interventions, while a lower probability
could initiate monitoring or light support.
The chosen threshold used is aimed at maximizing Savings
and is objectively arrived at as suggested by Data Robot.
Our project is using the selected model’s
confusion matrix to evaluate the cost
interventions associated with deployments and
the use of the model. From the confusion matrix
below, there are four quadrants.
Organization Actions for Implementation:
Besides predicting the students who stand a chance of dropping out, here are some
actionable steps that the institution can implement to address the case of those who are
already at different levels of attrition risk:
• Immediate interventions for high-risk students
•
1. Personal counseling, there should be an organization of personal counseling
sessions by Space University with the help of a professional counselor and
accessible to all students who are at a high risk of dropping out.
2. Academic Support: these sets of students' academic records should be
investigated, and their weak areas marked and be provided with adequate
academic support in such areas.
3. Financial aid adjustments: More financial aid options should be provided for
students, particularly those who fall into these categories.
Regular follow-ups and monitoring for medium-risk students.
Those who fall in the category of medium-risk students according to our model
should be given regular follow-up and monitoring.
•
General awareness and educational programs for all students to improve
retention.
Baseline and Savings Matrices (Payoff) Related to Actions
Using the payoff scenarios from the DataRobot analysis, we can determine the financial
implications of the interventions. The four quadrants shown in the confusion matrix
represent the following and can be interpreted as follows:
•
•
•
•
True Negative, (TN) – These are the number of students that the model correctly
predicts to drop out. The university will therefore incur an intervention cost for each
student. It will, however, fail to assume that any of the students will be retained and
pay any fees.
False Negative, (FN) – These are the number of students that the models
incorrectly predict to drop out when they graduate. The school will therefore
unnecessarily incur an intervention cost for each student. It will, however, fail to
assume that any of the students will be retained and pay any fees.
False Positive, (FP) – This is the number of students that the model has incorrectly
predicted to graduate, when they will drop out. This will interfere with the
university’s prospective income since the anticipated fees will not be paid. In this
case, the university is not to anticipate fees from such students but, engage in an
intervention program aimed at convincing them to stay.
True Positive, (TP) - These are the number of students that the model has correctly
predicted to graduate. The university therefore anticipates fees from each student.
With this basic understanding, the team made some assumptions to make the profit
matrix more realistic. The average cost of attending school is $38,000 while the average
intervention cost is $1500. The university will therefore aim at maximizing the fees and
minimizing the intervention cost, by increasing failure to incur intervention costs on
students who will graduate, False Negative, yet predicted to drop out, and reducing
anticipation of receiving fees from students who will drop out False Positive. The payoff
matrix below shows that the average maximum savings the model predicts for the
university is $8052.34 per student.
These Savings figures will guide how much the university should invest in student
retention programs.
Final Recommendation:
Based on the goals of this model to predict which students are at risk of dropout, deploy
the recommended model. Given its slightly better balance between precision and recall
(F1 Score: 74.3%). Allocate the budget based on the payoff scenarios, prioritizing the
most cost-effective interventions. Which offers a high annual average Savings of $
9,155,510.58
How to implement the Model:
• We recommend integrating the model into the student record management system
of the school via an API system with the DataRobot platform, for seamless data
upload and prediction.
• The model should be used as part of a broader strategy to improve student
retention, financial stability, and the university's reputation.
• Use a tiered approach for interventions based on the predicted risk level.
Operational Implementation:
•
•
Continuously monitor and adjust the model and interventions based on real-time
data and outcomes.
Regular review should be done, perhaps at the end of every semester on model
performance and intervention effectiveness by the Registrar and Student Success
Team.
This recommendation is aimed at enhancing student retention and success, achieving
financial stability, and positively impacting SPACE University's reputation and societal
perception of higher education. It's important to regularly review and adjust the strategy
based on new data and insights.
References
8 Feature Engineering Techniques for Machine Learning. (n.d.). ProjectPro.
https://www.projectpro.io/article/8-feature-engineering-techniques-formachine-learning/423
Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Journal of the
American Statistical Association, 73(363), 680. https://doi.org/10.2307
/2286629
Completing College. (2019). https://nscresearchcenter.org/wp-content/uploads
/Completions_Report_2019.pdf
Cook, E. E., & Turner, S. (2022, March 1). Progressivity of Pricing at US Public
Universities. National Bureau of Economic Research. https://www
.nber.org/papers/w29829
Feature Selection Methods in Machine Learning. (n.d.). ProjectPro. Retrieved
December 12, 2023, from https://www.projectpro.io/article/featureselection-methods-in-machine-learning/562
Martins, M. V., Tolledo, D., Machado, J., Baptista, L., & Realinho, V. (2021). Early
Prediction of student’s Performance in Higher Education: A Case Study.
https://doi.org/10.1007/978-3-030-72657-7_16
Pew Research Center. (2020). Pew Research Center. Pew Research Center. https://
www.pewresearch.org/
Student Right to Know. (n.d.). American College of Education. https://ace.edu
/about/student-right-to-know/
UCI Machine Learning Repository. (n.d.). Archive.ics.uci.edu. https://archive.ics.
uci.edu/dataset/697/predict+students+dropout+and+academic+success
Allwright, S. (2022, August 24). How to interpret RMSE (simply explained). Stephen
Allwright. https://stephenallwright.com/interpret-rmse/
Download