PREDICTING STUDENTS DROP OUT AND ACADEMIC SUCCESS FOR FINANCIAL PLANNING IN SPACE UNIVERSITY, USA MIS 587: Business Applications in Machine Learning December 11, 2023 Executive Summary The primary aim of this initiative was to strategically address SPACE University's fiscal challenges, exacerbated by a concerning 39% undergraduate dropout rate. This attrition has been impacting the university's financial health, resulting in an estimated annual loss of USD 13 million. In a strategic alliance with FRAB Consulting, an extensive study leveraging machine learning algorithms to forecast student attrition trends was carried out. Our methodology encompassed a thorough analysis of an extensive dataset, incorporating demographic, socioeconomic, and academic performance metrics, processed through the sophisticated AI and machine learning capabilities of the DataRobot platform. The project's analysis, utilizing a dataset of 3,630 records (excluding currently enrolled students), unearthed pivotal factors influencing student withdrawal. Advanced machine learning techniques, including feature engineering and selection, were applied. The most efficacious model identified was the eXtreme Gradient Boosted Trees Classifier. We advised a robust integration of this model within the university's student record management framework. This should include a dynamic, tiered intervention strategy, primarily focusing on high-risk students for immediate action, while also maintaining vigilant oversight for medium-risk individuals. This recommendation is formulated with a dual focus on mitigating financial risks and enhancing overall student retention and success. The efficacy of this model is inherently linked to the precision of the current dataset and the university's operational infrastructure. We advocate for regular model recalibrations to align with evolving student demographics and academic policy changes. The deployment of this predictive model is projected to significantly curtail financial losses related to student attrition, with an estimated reduction potential of 70.43%, thereby fortifying SPACE University's financial stability and academic standing. Table of Contents Executive Summary ................................................................................................................................... 1 Business Problem and Project Objectives........................................................................................ 4 Dataset Analysis: ................................................................................................................................. 4 Data Decision: ....................................................................................................................................... 6 Feature Engineering ................................................................................................................................ 7 Feature Selection: ................................................................................................................................ 7 Changes in DataRobot: .................................................................................................................... 10 The model’s Feature Impact: .......................................................................................................... 12 Model Selection ...................................................................................................................................... 13 Describe Candidate Models: ........................................................................................................... 13 Comparing models with different feature lists: ......................................................................... 13 Model Selection: ................................................................................................................................. 15 Model Description:............................................................................................................................. 15 Critique of strengths and weaknesses: ............................................................................................. 16 Recommendation ................................................................................................................................... 18 Organization Actions for Implementation: .................................................................................. 18 Final Recommendation: ................................................................................................................... 20 How the model should be implemented: ..................................................................................... 20 Operational Implementation: .......................................................................................................... 20 References ............................................................................................................................................... 20 Business Problem and Project Objectives SPACE University, a prestigious non-profit institution, is facing critical financial challenges. These challenges stem from a combination of rising operational costs, diminishing tuition revenue, and decreasing government funding. A major factor exacerbating this financial strain is the university's high student attrition rate. Currently, 39% of undergraduate students fail to graduate within six years, leading to a staggering annual loss of approximately USD 13 million. This attrition rate not only impacts the university’s financial stability but also adversely affects its reputation and future enrollment prospects. The university's high dropout rate is a multifaceted problem. Not only does it represent a significant financial loss, but it also undermines the perceived value and effectiveness of the education SPACE University provides. In a broader context, this issue aligns with national trends. According to the National Student Clearinghouse Research Center (2019), only 63% of first-time, full-time students at four-year private universities graduate within six years. The Pew Research Center (2020) further highlights that public confidence in higher education is waning, with only 40% of Americans believing that colleges adequately prepare students for life post-graduation. To address these challenges, SPACE University has partnered with FRAB Consulting to develop a predictive solution aimed at identifying students at risk of early academic discontinuation. The solution leverages predictive analytics and machine learning to proactively identify at-risk students, allowing the university to implement targeted intervention programs to enhance student retention and success. The National Bureau of Economic Research highlights the substantial financial impact of reducing attrition rates in private universities. A 10% reduction in these rates can result in an average annual saving of USD 300 million. For SPACE University, this means that the machine learning project focused on lowering attrition rates could potentially save millions of dollars annually. Furthermore, universities with higher graduation rates tend to have better reputations, which can attract more students, donors, and esteemed faculty members. Enhancing the university's reputation could play a crucial role in bridging its financial gap. Dataset Analysis: The dataset provided by SPACE University offers a comprehensive profile of students enrolled in various undergraduate programs, including nursing, journalism, management, and design. This dataset is a rich source of both demographic and socioeconomic information, collected at the time of student registration. Additionally, it includes academic performance data recorded after the first and second semesters. In terms of structure, the dataset encompasses 4,424 individual records, each described by 37 distinct attributes. These attributes are detailed in the accompanying feature lists. The dataset's composition is predominantly numerical, with a diverse mix of data types: it includes one categorical feature, five continuous variables, and a majority of discrete data types. There is one notable qualitative feature as well. A key strength of this dataset lies in its exceptional data quality. It is remarkably clean, with no missing values, inaccuracies, or inconsistencies observed, which is a significant advantage for any analytical endeavor. For the implementation of this project, DataRobot, an advanced AI and machine learning platform, was utilized. This choice is motivated by DataRobot's capabilities in efficiently handling large datasets and performing complex analyses, making it an ideal tool for extracting meaningful insights from the SPACE University student data. Feature lists Data Decision: The target variable for our model, labeled "Outcome," is categorized into three distinct groups: "dropout," "enrolled," and "graduate," reflecting the students' status after their standard course period. Considering the specific requirements of this project, we have excluded data of students who are currently enrolled (“enrolled” category) from our dataset resulting in 3,630 individual records from 4,424 (more details in feature engineering). This is done to focus our analysis on the more definitive outcomes of "dropout" and "graduate," which we have encoded numerically as 0 and 1, respectively. This binary classification aligns well with the nature of our predictive task, which is inherently a classification problem. Feature Engineering DataRobot is a machine learning platform that automates the end-to-end process of building, deploying, and managing machine learning models. Feature engineering and feature selection are crucial steps in the machine learning workflow, and they play a significant role in improving model performance. There are different Feature Engineering Techniques: Missing Value Imputation (Imputing missing values using mean, median, or more advanced methods), Binning or Discretization (Grouping continuous variables into bins), Feature Scaling (Scaling numerical features to a standard range), Categorical Encoding (encoding categorical features into numerical values), feature splitting (Splitting features into parts, like date and time), Handling Outliers (removal or replacing values with outliers), Variable Transformations (for helping with normalizing skewed data). This dataset has no missing values. For Feature Engineering, we used the ‘Categorical Encoding’ Technique. Our target feature, Outcome, had 3 levels: Dropout, Graduate, and Enrolled. Since our objective is to predict how many students will drop out and/or how many of them will graduate, we dropped the rows having the Enrolled variable. The dataset had 4,425 observations at first, after removing the rows having Enrolled variable, it became 3,630 rows. And then we used the categorical encoding technique of Feature Engineering to encode ‘Dropout’ to 0, and ‘Graduate’ to 1 under the column named ‘CodedTarget’. Feature Selection: Feature selection algorithms are categorized into three main types: intrinsic methods, filter methods, and wrapper methods. 1. Intrinsic Methods: These algorithms inherently incorporate feature selection within themselves. Examples include rule-and-tree-based algorithms, Multivariate Adaptive Regression Spline (MARS) models, and regularization models. These methods are fast and don't require an external feature selection algorithm. 2. Filter Methods: Pre-processing techniques that independently assess each feature's impact on a predictive model. Like information gain, entropy, and the relief feature selection method. These methods are generic but may reduce model accuracy as they do not consider the nature of the predictive model. 3. Wrapper Methods: Aim to create a subset of features that optimizes the predictive model's performance. Two types: greedy (e.g., Recursive Feature Elimination) and nongreedy (e.g., Genetic Algorithms, Simulated Annealing). Greedy methods focus on locally optimal results, while non-greedy methods assess all previous feature subsets for overall best performance. Here in this dataset, wrapper methods are the best subset of features for optimal model performance. In this one, we used Recursive Feature Elimination (RFE). It removes the least important features until the desired number remains. Therefore, RFE helped us deal with high-dimensional data and improve model efficiency and overfitting reduction). After feature engineering, we employed the revised dataset in the Datarobot, then started modeling, and used ‘CodedTarget’ as the Target, target type: Binary Classification. The best model is the eXtreme Gradient Boosted Trees Classifier. Here, in this model, we can see the most important top features. By wrapper method, we use Recursive Feature Elimination (RFE) to remove the least important features. And upload the dataset, following the same steps, to get the best model. By eliminating all the other features, and only keeping the top 13 features plus the ‘CodedTarget’ feature to use as my target in data modeling. The best model for the updated dataset is Light Gradient Boosted Trees Classifier with Early Stopping. To see the comparison of these 2 models: After taking 14 top features into account, the model shows that the Debtor is not the last in the top features. Admission grade, age at enrollment, scholarship holder, course, and tuition to date matter besides the curricular units 1st and 2nd semester (approved, enrolled, evaluation, grade). Changes in DataRobot: The technique used in DataRobot: Automated Feature Engineering Leveraging DataRobot's automated feature engineering capabilities. DataRobot's automated feature engineering can quickly explore a wide range of feature transformations and combinations, potentially identifying patterns and relationships that might be overlooked manually. After uploading both 1st dataset and the updated dataset that has the top 14 features: create a new relationship between both datasets: The best model it predicted with targets CodedTarget and Curricular Units 2nd semester (approved) is eXtreme Gradient Boosted Trees Calssifier and the top 11 features. The codedTarget feature has 1767 students graduated and 1137 dropout students. The model’s Feature Impact: The best thing about Automated Feature Engineering is that it can automatically find out the target leakage, and redundant features like “Curricular units 1st semester (enrolled)” and detect any data quality issue. Model Selection Describe Candidate Models: Of the eight models that Data Robot produces for our experiment, we decided to evaluate the two most accurate models: eXtreme Gradient Boosted Trees Classifier, (Model 1) and Light Gradient Boosted Trees Classifier with Early Stopping, (Model 2). Model 1 Model 2 Above are the blueprints of the most accurate models. These models are supervised, using labeled information for their training. The information used in our case is numeric and the specific information to predict is whether a student will graduate or drop out. Missing information is handled by imputation, in our case the data was complete. Feature leakage is a major concern and there was such an occurrence in our project, the feature was dropped to address the challenge. These models return visuals like Confusion Matrix which helps with the calculation of matrices like accuracy, precision, recall, and F 1 score. The models’ performances are compared and used to select the best performers. These models are the best of only eight models, besides exploring far more sophisticated models, we recommend the collection and use of more information to improve the model’s performance and become more scalable. Comparing models with different feature lists: The different feature lists used in comparing these two models are objectively (randomly) chosen with the exemption of Curricular_units_1st_sem (enrolled), a feature earlier associated with leakage. Below is a model comparison based on RMSE, a metric indicating the chance of error that comes with using a model, the lower the RMSE, the more accurate the model. • List_1 Using the first feature list, both models have the same validation score of 0.4354. • List_2 Using the second feature list, the second model has a lower validation score,0.3695. From the difference and similarity in validation scores when different feature lists are used on competing models, feature selection has an impact on model performance. It would therefore be necessary to have a clear objective of what is to be achieved since different models can indeed have similar matric scores when subjected to the same set of information and features. Calculate and report metrics for models. • Model 1. eXtreme Gradient Boosted Trees Classifier Accuracy = Precision = Recall = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙 𝑇𝑁+𝑇𝑃 𝑇𝑃 𝑇𝑃+𝐹𝑃 𝑇𝑃 𝑇𝑃+𝐹𝑁 0.714∗0.768 366+218 𝑇𝑁+𝐹𝑃+𝑇𝑃+𝐹𝑁 , , 218 218+76 218 218+66 F1 Score = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 , 2*0.714+0.768 = 0.754 , 366+76+218+66 = 0.804 = 0.741 = 0.768 • Model 2. Light Gradient Boosted Trees Classifier with Early Stopping AC = 𝑇𝑁+𝑇𝑃 Precision = Recall = 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙 387+203 , 387+55+203+81 = 0.813 𝑇𝑁+𝐹𝑃+𝑇𝑃+𝐹𝑁 𝑇𝑃 𝑇𝑃+𝐹𝑃 𝑇𝑃 𝑇𝑃+𝐹𝑁 , 203 , 203+55 203 203+81 = 0.787 = 0.715 0.787∗0.715 F 1 score = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 , 2* 0.787+0.715 = 0.749 Above are calculations of matrices for both models. The number of students who graduated is denoted by 0 while that of those who failed to graduate is denoted by 1. The positive (graduated - desired) cases in both models, (TP and TN) show the number of students correctly predicted to either graduate or drop out, while the negative (dropout undesired) cases in both models (FP and FN) show the number of students incorrectly predicted to either or dropout. Model Selection: The second model has a higher accuracy score of 81.3% than the first model whose accuracy is 80.4%; an indication that the second model is more accurate in classifying information instances over the overall occurrence of information. Since both models’ confusion matrices have different numbers of information instances, we did not fully rely on accuracy to decide on which model to deploy. The precision score adds up to one, the second model has a precision score closer to one than the first model, 0.787 and 0.741 respectively. The rate of true positivity measured by recall is better in the first model compared with the second model, 0.768 and 0.715 respectively. Harmonizing the mean scores of precisions and recall is a better guide in the model section, done by selecting a model with the highest F1 score. In our case, it’s Model 1, eXtreme Gradient Boosted Trees Classifier whose F1 score is 0.754. 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙 0.714∗0.768 • Model 1 F1 Score = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 , 2*0.714+0.768 = 0.754 • Model 2 F 1 score = 2 ∗ 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛+𝑅𝑒𝑐𝑎𝑙𝑙 , 2* 0.787+0.715 = 0.749 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛∗𝑅𝑒𝑐𝑎𝑙𝑙 0.787∗0.715 Model Description: The model’s blueprint is shown below. Upon uploading the locally available information to Data Robot, any numeric missing information is filled up with a process called imputation. The Data Robot then runs several models and scores them directly for choice by the analyst. In our case, the information was complete, and hence imputation was a routine process. Our target feature had information with two categories, graduate, and dropout. We engineered the target to be numeric where graduate participants were denoted by 0 and dropout by 1. The model is the best and is to be deployed and used for making predictions based on new information. From the feature engineering process, Model 1 has an RMSE score of 0. 2564. This is a fair score indicating a minimal chance of error and better accuracy, (Stephen A.2022). The model has a good score in an overall metric, F 1 score of 0.754. The model’s accuracy is further demonstrated by the lift chart below, where, • • • The actual line is steep. (Visual indication of accuracy) The predicted line closely matches the actual line. (Visual indication of accuracy) There is a fair consistent increase in the lines along the Bins based on the predicted value line. (Visual indication of accuracy) Critique of strengths and weaknesses: • Strengths The model is fairly accurate, and its performance can be improved by training with more information. It has an accuracy score of 0.804 and F1 score of 0.754, figures not far from the maximum scores of 1 respectively. The model training process was economical, it did not demonstrate any difficulties where it would demand advanced feature engineering or hyperparameter tuning. The only feature engineering was done on the target feature. The model was ranked among the best two of eight models, it is objective to evaluate it for deployment and used to predict the target feature. • Weaknesses The model is trained on relatively small amounts of information and may not perform well on new information despite an incredible performance on training information, this is called overfitting. Also, scalability (the ability to extend capabilities and accommodate change) may be a challenge to this model as we gather more information. The model only demonstrated fair performance on most metrics like accuracy and F 1 score, 0.804 and 0.754 respectively. Recommendation We have formulated an effective business recommendation for SPACE University using DataRobot’s predictive model outcomes, and we considered several factors, including the probability thresholds for the target, actions for implementation, and financial and societal implications. The following is a refined approach to these recommendations: Business Decisions at Probability Thresholds We propose that Space University’s administrators adopt distinct measures corresponding to different probability thresholds. For instance, a high probability might trigger immediate and intensive interventions, while a lower probability could initiate monitoring or light support. The chosen threshold used is aimed at maximizing Savings and is objectively arrived at as suggested by Data Robot. Our project is using the selected model’s confusion matrix to evaluate the cost interventions associated with deployments and the use of the model. From the confusion matrix below, there are four quadrants. Organization Actions for Implementation: Besides predicting the students who stand a chance of dropping out, here are some actionable steps that the institution can implement to address the case of those who are already at different levels of attrition risk: • Immediate interventions for high-risk students • 1. Personal counseling, there should be an organization of personal counseling sessions by Space University with the help of a professional counselor and accessible to all students who are at a high risk of dropping out. 2. Academic Support: these sets of students' academic records should be investigated, and their weak areas marked and be provided with adequate academic support in such areas. 3. Financial aid adjustments: More financial aid options should be provided for students, particularly those who fall into these categories. Regular follow-ups and monitoring for medium-risk students. Those who fall in the category of medium-risk students according to our model should be given regular follow-up and monitoring. • General awareness and educational programs for all students to improve retention. Baseline and Savings Matrices (Payoff) Related to Actions Using the payoff scenarios from the DataRobot analysis, we can determine the financial implications of the interventions. The four quadrants shown in the confusion matrix represent the following and can be interpreted as follows: • • • • True Negative, (TN) – These are the number of students that the model correctly predicts to drop out. The university will therefore incur an intervention cost for each student. It will, however, fail to assume that any of the students will be retained and pay any fees. False Negative, (FN) – These are the number of students that the models incorrectly predict to drop out when they graduate. The school will therefore unnecessarily incur an intervention cost for each student. It will, however, fail to assume that any of the students will be retained and pay any fees. False Positive, (FP) – This is the number of students that the model has incorrectly predicted to graduate, when they will drop out. This will interfere with the university’s prospective income since the anticipated fees will not be paid. In this case, the university is not to anticipate fees from such students but, engage in an intervention program aimed at convincing them to stay. True Positive, (TP) - These are the number of students that the model has correctly predicted to graduate. The university therefore anticipates fees from each student. With this basic understanding, the team made some assumptions to make the profit matrix more realistic. The average cost of attending school is $38,000 while the average intervention cost is $1500. The university will therefore aim at maximizing the fees and minimizing the intervention cost, by increasing failure to incur intervention costs on students who will graduate, False Negative, yet predicted to drop out, and reducing anticipation of receiving fees from students who will drop out False Positive. The payoff matrix below shows that the average maximum savings the model predicts for the university is $8052.34 per student. These Savings figures will guide how much the university should invest in student retention programs. Final Recommendation: Based on the goals of this model to predict which students are at risk of dropout, deploy the recommended model. Given its slightly better balance between precision and recall (F1 Score: 74.3%). Allocate the budget based on the payoff scenarios, prioritizing the most cost-effective interventions. Which offers a high annual average Savings of $ 9,155,510.58 How to implement the Model: • We recommend integrating the model into the student record management system of the school via an API system with the DataRobot platform, for seamless data upload and prediction. • The model should be used as part of a broader strategy to improve student retention, financial stability, and the university's reputation. • Use a tiered approach for interventions based on the predicted risk level. Operational Implementation: • • Continuously monitor and adjust the model and interventions based on real-time data and outcomes. Regular review should be done, perhaps at the end of every semester on model performance and intervention effectiveness by the Registrar and Student Success Team. This recommendation is aimed at enhancing student retention and success, achieving financial stability, and positively impacting SPACE University's reputation and societal perception of higher education. It's important to regularly review and adjust the strategy based on new data and insights. References 8 Feature Engineering Techniques for Machine Learning. (n.d.). ProjectPro. https://www.projectpro.io/article/8-feature-engineering-techniques-formachine-learning/423 Cohen, J. (1988). Statistical power analysis for the behavioral sciences. Journal of the American Statistical Association, 73(363), 680. https://doi.org/10.2307 /2286629 Completing College. (2019). https://nscresearchcenter.org/wp-content/uploads /Completions_Report_2019.pdf Cook, E. E., & Turner, S. (2022, March 1). Progressivity of Pricing at US Public Universities. National Bureau of Economic Research. https://www .nber.org/papers/w29829 Feature Selection Methods in Machine Learning. (n.d.). ProjectPro. Retrieved December 12, 2023, from https://www.projectpro.io/article/featureselection-methods-in-machine-learning/562 Martins, M. V., Tolledo, D., Machado, J., Baptista, L., & Realinho, V. (2021). Early Prediction of student’s Performance in Higher Education: A Case Study. https://doi.org/10.1007/978-3-030-72657-7_16 Pew Research Center. (2020). Pew Research Center. Pew Research Center. https:// www.pewresearch.org/ Student Right to Know. (n.d.). American College of Education. https://ace.edu /about/student-right-to-know/ UCI Machine Learning Repository. (n.d.). Archive.ics.uci.edu. https://archive.ics. uci.edu/dataset/697/predict+students+dropout+and+academic+success Allwright, S. (2022, August 24). How to interpret RMSE (simply explained). Stephen Allwright. https://stephenallwright.com/interpret-rmse/