Success in academic studies prediction through Machine Learning Madie Fabio1 1 Norwegian University of Science and Technology, Trondheim 7034, Norway Abstract. As we start our academic studies, we have to invest a lot to succeed. A lot of money to pay for the tuition fees, housing, misc and extraneous expenses,… The purpose of this study is then to create a prediction model from a various data set from the website kaggle.com and treated and analyzed in the paper "Predicting Student Dropout and Academic Success"[1]. Higher education institutions collect a large amount of data about their students, representing a tremendous opportunity to develop information, knowledge, and monitoring. School dropout and educational failure in higher education are both barriers to economic growth, employment, competitiveness, and production, with significant consequences for students and their families, higher education institutions, and society as a whole. Student achievement is critical at educational institutions since it is frequently used as a criterion for the institution's performance. Early discovery of at-risk students, combined with preventive actions, can significantly increase their achievement. Machine learning techniques have recently been widely used for prediction. In this regard, it is necessary to, first of all, prepare the data set that will be used, presented on the Kaggle page affiliated with the project[2].. Different types of models will be confronted in order to determine the most efficient to make an early prediction of student academic success. Some exotic models will be considered for this task and will be stated on their effectiveness. The idea of the project is to identify the main risk factors for dropout and take early interventions to prevent it and determine the factor that leads to a positive academic outcome. Keywords: Machine Learning Prediction, Multi-class model, Academic success. 1 Introduction 1.1 Project Idea The purpose of this project is severalfold. The main idea is to provide a program that will be able to determine the chance of a student dropping out and for this percentage what are the main failure elements. On the other hand, it will deliver the principal success components and an affiliated academic success rate. Several machine learning models exist, the target is multi-class which limits the methods we can use. We will explore different types of machine learning such as Classification and Regression. For this type of dataset, because it has been treated a lot in existing research, we will only use one type of Regression model known as Logistic Regression to determine its effectiveness for this type of problem. It is not commonly used in multi-class problems but it is pretty interpretable. Then, we will use classification algorithms such as the K-nearest Neighbors method 2 because, in this data set, several variables such as demographic information, academic performance, and financial information can be used to predict whether a student is likely to drop out or graduate. K-nearest Neighbors can be used to group students based on these variables, which can help identify common characteristics or factors that contribute to their retention or dropout rates. We will do a comparison between the Decision Tree method and the Random Forest to conclude on the effectiveness compared with the increase in complexity. And finally, we will implement Support Vector Machine that can handle non-linear decision boundaries which can appear in our dataset, are effective with high-dimensional data, and are robust to outliers. The data is well-labeled only the supervised training method will be used. The performance and the accuracy of each model will be compared in several ways such as accuracy, precision, recall, and F1-score. 1.2 Literature Review The topic of predicting a student's academic failure or success has always been of interest to researchers. Indeed, in order to maximize the chances of success, it is interesting to wonder which courses to follow, and in which geographical area,... All the following have been extracted from ‘’Artificial neural networks in academic performance prediction: Systematic Implementation and Predictor Evaluation’’[3]. This literature review about academic success prediction through time defines Academic Success and presents the methodology used in research. It has provided a complete guideline describing data mining techniques and summarizing previous studies. In this report, only information relative to the studies based on prediction using course level will be considered. The dataset currently used is based on student performance in undergraduate degrees. Previous research depending on the year level or the degree level is irrelevant. The following table relates the content of the existing research: Table 1: Review of the existing studies Reference Algorithms Used Model Type Sample Size Best Accuracy Almarabeh (2017) )[5] NB, BN, ID3, J48, NN [C] 255 NB-93% Mueen et al (2016)[6] NB, NN, C4.5 [C] 60 NB with 86% Mohamed & Waguih (2017)[7] J48, Rep Tree, RT [C] 8080 J48 – 85% Sivasakthi (2017)[8] SMO, NB, J48, NN, REPTree [C] 300 MLP – 93% Putpuek et al (2018)[9] ID3, C4.5,KNN, NB [C] / NB-43.18% Garg (2018) )[10] C4.5 [C] 400 / Yassein et al (2017) )[11] C4.5 [C] [CC] 150 / [C] for classification; [R] for regression; [CC] for clustering; BN Bayes net, DT decision tree, KNN k-nearest neighbors, LR logistic regression, NB naive Bayes, (P)NN (probabilistic) neural network, RB rule-based, RI rule induction, RF random forest, RT random tree, NN neural network, TE tree ensemble; −: information not available {4] The same evaluation will be considered for each type of algorithm used in order to provide a solid comparison with the previous studies. Although the accuracy of the most performant model is greatly influenced by the data set and preprocessing operation done on it, this gives an overall impression of what type of model is the most efficient. The Naïve bays model distinguishes itself from the others. Classification types are the most commonly used with the Clustering types. Regression models are irrelevant because the output is binary: The student either passes or drops out. The report will provide an instance of the inefficiency of this type of algorithm. 3 2 Methodology The review methodology is based on the approach of the previous research[4]. Every student’s information could influence the prediction, it would not be accurate to remove data without consideration based on the model response. First of all, we have to prepare the data which are raw and can not be used for analysis and modeling. It could contain missing, inconsistent, or duplicate data that must be removed before any other step. Hence, the data will be split and models will be train on one part and tested on the other. Each model’s results will be evaluated and important features will be deducted. Further tests will be realized with those features only. 2.1 Data Set Presentation 1.1.1. Data Description As explained in the dataset report1, the data sources are varied and the dataset is a combination of external data from the Academic Management System (AMS), the Support System for the Teaching Activity of the institution (developed internally and called PAE), the General Directorate of Higher Education (DGES) regarding admission through the National Competition for Access to Higher Education (CNAES) and the Contemporary Portugal Database (PORDATA) regarding macroeconomic data. It refers to records of students enrolled between the academic years 2008/2009 to 2018/2019. It includes 17 undergraduate degrees specified in the original report. The data set used is presented on the website Kaggle2.. It is used to predict dropouts and academic outcomes. It is composed of various data including demographic, social, economic, academic performance, and personal factors. The values come from students enrolled in higher education institutions. Social parameters are for instance the mother’s/father’s qualifications, nationality, the mother/father’s occupation,… All the data is explained in Table 1. It described each attribute used in the dataset grouped by class[1]. The possible values for each attribute are detailed in the Appendix of the report from the dataset extracted[1]. This dataset includes 4424 records of the 34 attributes with no missing values. The analysis of the dataset will be performed with Python 3 using the Pandas library, the Sklear library for the feature selection, the Machine Learning Models and their evaluation, and the Matplotlib and Seaborn libraries for the data visualization. The main difficulties that can be faced with the dataset are the missing data, which means that either some data are not provided in the dataset or there are not enough data to construct an accurate model. In our case, we don’t have any missing data. Another common issue is the case of an imbalanced dataset, which means that the number of samples from one class is significantly less than the samples from other classes. Finally, for a regression-type model, there is a problem of multicollinearity that can occur. It appears whenever an independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. All these problems will be treated during the pre-processing phase. Although, the dataset is pretty well-sized and there are no imbalances between in. Our target feature is a categorical type value that represents if a student is either ‘’Dropout’’, ‘’Enrolled, or ‘’Successful’’. The problem is in light of that, a multi-class classification. Hence, a transformation will have to be done to get a result for the regression model. 4 Table 1: Data Description Data Class Demographic Socioeconomic Macroeconomic Academic data at enrollment Academic results 1st semester Academic results 2nd semester Target Attribute Data Type Marital Status Nationality Displaced Gender Age at enrollment International Mother’s qualification Father’s qualification Mother’s occupation Father’s occupation Educational special needs Debtor Tuition fees are up to date Scholarship holder Unemployment rate Inflation rate GDP Application mode Application order Course Daytime/evening attendance Previous qualification Numeric/discrete Numeric/discrete Numeric/binary Numeric/binary Numeric/discrete Numeric/binary Numeric/discrete Numeric/discrete Numeric/discrete Numeric/discrete Numeric/binary Numeric/binary Numeric/binary Numeric/binary Numeric/continuous Numeric/continuous Numeric/continuous Numeric/discrete Numeric/ordinal Numeric/ordinal Numeric/binary Numeric/discrete Curricular units 1st sem (credited) Curricular units 1st sem (enrolled) Curricular units 1st sem (evaluations) Curricular units 1st sem (approved) Curricular units 1st sem (grade) Curricular units 1st sem (without evaluations) Numeric/discrete Numeric/discrete Numeric/discrete Numeric/discrete Numeric/discrete Numeric/discrete Curricular units 2nd sem (credited) Numeric/discrete Curricular units 2nd sem (enrolled) Curricular units 2nd sem (evaluations) Curricular units 2nd sem (approved) Curricular units 2nd sem (grade) Curricular units 2nd sem (without evaluations) Target Numeric/discrete Numeric/discrete Numeric/discrete Numeric/discrete Numeric/discrete Categorical 1.1.2. Data Processing and Data Transformation First of all, a data selection also called ‘’Dimensionality Reduction’’ has to be considered for the sake of reducing the computer power needed to compute results. Furthermore, the irrelevant feature can yield below-optimal prediction reduction results. Two different data methods exist : The vertical selection consists is the removal of redundant or irrelevant features to simplify the understanding of patterns and decrease the time of the learning phase. But, it needs a good understanding of the data to select the feature. The horizontal selection consists of the removal of conflicting instances to strengthen the dataset. But, it requires to have a large sample size first and foremost. Data cleaning is a crucial step in any machine learning project, as it helps ensure the data used to train the model is accurate, consistent, and free from any anomalies or errors. First of all, we 5 have to clean the data set and check if there are not any missing values that can create issues during the prediction. Data sources can be inconsistent or contain noises, then this step is crucial in order to get usable results. In our dataset, there is any missing or duplicate value. Next, the data type that represents the student's status must be changed. Indeed, to be interpretable by the models, the information in the 'Target' column must be an integer and not a string like it is originally presented in the dataset. We have to remove every case where the student is enrolled in the school. Every graduate student is labeled with a 1 and student who failed with a 0. With that, the multi-class classification problem becomes a binary-class classification. In order to improve the quality of the result the dataset will be scaled. The goal of normalization is to transform features to be on a similar scale. This improves the performance and training stability of the model. There exists in the dataset a strong imbalance towards the ‘’Graduate’’ group as it is shown in Figure 2. The majority of the student ‘Graduate ‘represents 50% of the total records (2209 of 4424) and Dropout represents 32% of the total records (1421 of 4424), then the Enrolled represents 18% of total records (794 of 4424). Even if the Enrolled class is removed from the dataset, the majority is still represented by the ‘’Graduate’’ class with 61% (2209 of 3630) against 39% (1421 of 3630). A great imbalance might cause a prediction driven by the majority group which is not optimal. This problem will be addressed by using the Synthetic Minority Oversampling Technique (SMOTE) on the dataset to oversample the minority class. Figure 1: Distribution of student records Target 2500 Frequency 2000 1500 1000 500 0 Enrolled 2.2 Dropout Graduate Machine Learning Algorithms 1.1.1. Models Used Once the data has been cleaned and preprocessed, we have to split the data set for the training and the testing. For instance, to do that, a random shuffle is applied to the cleaned dataset and all the training and testing parts are converted again into an integer. The values are also separated into 5 folds for Cross Validation analysis. For every model, we will use the same notation : Μ βΆ model prediction π input variable features, π target variable, π The Logistic regression model (LR) is a straightforward model that works well with small to medium-sized datasets: 6 We use the sigmoid function to which we apply this input variable: π = βπ. π βββ + π, where π and π are the parameters of the model also called the coefficients or the weights. This value is passed to the sigmoid function : π π(π) = βββ +π) π + π−(πβ.π In order to set the output to either 0 or 1 only. We can set up a threshold where: Μ=π π(π) ≥ πππππππππ , π { Μ=π π(π) < πππππππππ , π In the Sklearn library, the threshold for a two-class problem is 0.5. A K-Nearest Neighbors (KNN) which is capable of handling intricate correlations between features, but it can be computationally expensive for large datasets: For all the new data π′, a distance measure (Euclidian distance in the Sckear library) is done with its K-nearest neighbors ππ : π (π′ , ππ ) = √∑(ππ − ππ π ) π The data π′ is assigned to the majority class of the K-nearest Neighbors. In the Sklearn library, the default value for k is 5 but further testing tends to prove that the most optimal is 1. A value above this number is too computationally expensive - Decision Tree (DT) is capable of handling both continuous and categorical variables, but it is susceptible to overfitting: This model uses two different selection criteria: the Entropy to measure impurity or the Gini index. Because there is no difference between these two criteria, the less computational expensive will be chosen and developed in this report. Let the data at node π be represented πΈπ by with ππ samples. For each candidate split π½ = πππππ (π, ππ ) consisting of a feature π and threshold ππ , partition the data into πΈππππ subsets. π and πΈπ ππππ (π½) ) πΈπ = {(π, π | ππ < ππ } πππππ ππππ πΈπ (π½) = πΈπ \πΈπ A target is a classification outcome taking on values 0,1, for node π, let: πππ = π ∑ π°(π = π) ππ π∈πΈπ We can compute the impurity function based on gini Index, π―(πΈπ ) = ∑ πππ (π − πππ ) π Hence, we minimize the impurity by looking at the minimum value in the split π½, ππππ πππππ ππ ππ ππππ πππππ πππππππ½ ( π― (πΈπ ) + π― (πΈπ )) ππ ππ The split achieving this to the fullest will be selected to start another node. This will continue until the maximum allowable depth is reached (None by default) or all the data have been split. Random Forest (RF) is good for high-dimensional datasets with complicated relationships: It is a model which uses an estimator that fits several decision tree classifiers on subsamples of the training dataset. Hence, it uses averaging to prevent overfitting and increase accuracy. The decision tree classifier is based on the previous model presented with the same parameters. After testing, the most optimal number of classifiers to prevent overfitting and be less computably expensive is 100 with a maximum depth of 10. Support Vector Machine (SVM) works well in high-dimensional domains and can handle complex interactions, although it can be expensive to compute and challenging to interpret: It is a supervised machine-learning problem where we try to find a hyperplane that best separates the two classes. Because the target is binary, a linear SVM will be considered. 7 The functional margin of a hyperplane is defined by : πΜπ = π(π) (ππ» π(π) + π). Hence, π = π, ππ» π(π) + π β« π { π = π, ππ» π(π) + π βͺ π The goal of the model is to minimize the geometric margin : π¦ (π) (π€ π π₯ (π) + π) πΜπ π= = βπ€β βπ€β By optimizing the optimal margin, the problem is now resumed to the minimization of: π βπβπ π¦π’π§ + πͺ ∑ π»π π,π π π =π Where π»π are positive stack variables introduced to relax the margin, C also called Regularization Constant controls these variables and has to be identified through Cross Validation. For this dataset, C has been determined optimally equal to 1.10−5 . 1.1.2. Results Evaluation This type of classification problem fits an evaluation using a confusion matrix. While evaluating each model, four different cases related to a given success prediction occur: -True Positive (TP): number of students correctly classified as ‘’Graduated’’ -False Positive (FP): number of students wrongly classified as ‘’Graduated’’ -True Negative (TN): number of students correctly classified as ‘’Dropout’’ -False Negative (FN): number of students wrongly classified as ‘’Dropout’’ To evaluate the performance of each model, we used accuracy, precision, recall, and F1 Score. Information on the measurement tools is provided in Table 3. Table 3: Measurement Tools for classification problem Performance assessment Calculation Interpretation Accuracy ππ + ππ ππ + πΉπ + ππ + πΉπ Number of correct predictions Precision ππ ππ + πΉπ Number of students correctly labeled as ‘’Graduate’’ for all students predicted as ‘’Graduate’’ Recall ππ ππ + πΉπ The number of students correctly labeled as ‘’Graduate’’ for all students ‘’Graduate’’ F1 Score 2× ππππππ πππ × π πππππ ππππππ πππ + π πππππ 3 Results and analysis 3.1 feature selection and preprocessing graph The precision of the classifier / his robustness Collinearity can be an issue in our dataset, the analysis of the heatmap (Figure 2) shows that some pairs of the feature have high Pearson correlation coefficients. Collinearity is the most important with the same group of features but is also present between groups. “Nationality” and “International” or “Mother’s occupation” and “Father’s occupation” have great collinearity coefficients as well as “Curricular units 1st sem (approved)” and “Curricular units 2nd sem (approved)”. The performance at the end of the semester greatly influences the next one. 8 Figure 2: Correlation Table A test has been performed to determine the most important features considering the permutation Feature Importance. The 10 most important features are plotted in Figure 3. The analysis of these results shows that five features are considered important in all algorithms: “Curricular units 2nd sem (approved)”, “Curricular units 1st sem (approved)”, “Curricular units 2nd sem (grade)”, “Course”, and “Tuition fees up to date”. Figure 3: Plot of top 10 Permutation Feature Importance 9 3.2 Model Evaluation We compared, tested, and assessed five classifiers on a dataset. All 34 accessible attributes were examined on all five classifiers. We employed fivefold cross-validation, which implies that the dataset was randomly divided into five equal-sized sections. Table 4 displays the results of the average of ten of the experiment, which used all of the attributes. Table 4: Classifier results using all attributes LR KNN DT RF SVM Accuracy Precision Score Recall Score F1 Score 91.5 % 91.5 91.6 91.6 86.3 % 86.2 86.2 86.2 90.0% 90.3 89.8 89.8 91.0% 91.1 90.9 90.9 90.7% 91.3 90.7 90.7 Logistic Regression and Random Forest Models are the most performants in all the previously defined performance metrics. Every classifier achieves great accuracy. Now in order to be less computably expensive, irrelevant features are removed. Again classifiers are executed on a reduced dataset with only the ten most important features for each using five-fold cross-validation. The result of this reduced dataset can be seen in Table 5. Table 5: Classifier results using the ten best attributes LR KNN DT RF SVM Accuracy Precision Score Recall Score F1 Score 90.8 % 91.0 90.8 90.8 83.9 % 84.9 84.9 84.9 88.0 % 88.3 88.0 87.9 91.9 % 92.0 92.0 91.9 90.4 % 91.0 90.3 90.4 The most performant algorithms do not change with the removal of irrelevant features however for every model except the Random Forest model, their performance slightly decreases. For the case of the Random Forest, some features removed must have high collinearity with others. Although, Although it would have been good for models to remove the features with the highest collinearity, it was not possible because they are the most weighted features too. Their removal would have been even more unbeneficial. The performance of every algorithm is rather high except for the K Nearest Neighbors model. 4 Conclusion The ability to predict a student’s performance is very important in educational environments. It permits us to successfully predict what has to be aimed to be graduated. This work is an example of how machine learning can be used to analyze students’ academic success. This research aims to assist teachers to identify early signs of dropout. It will be possible for them to provide extra attention to these students and help them to enhance their performance. Multiple classifiers were used as presented with few data pre-processing techniques such as Dimensionality Reduction, Synthetic Minority Oversampling Technique (SMOTE),… It permits achieving a great performance for the Random Forest model which outperforms every other algorithm with an average accuracy of 91.9%. The different factors that have to be watched are which course the student follows, if its tuition fees are up to date, if its curricular units are approved and what are his grades in those. Finally, for future study, it would be interesting to conduct more trials with larger datasets that include different courses and educational levels, and degrees. 10 References 1. Author, F. Valentim Realinho, Author, F. Jorge Machado, Author, F. Luis Baptista, Author, F. Monica V.Martins, S.: Predict students’ dropout and academic success. https://www.mdpi.com/23065729/7/11/146 2. Affiliated Kaggle page: https://www.kaggle.com/datasets/thedevastator/higher-education-predictorsof-student-retention 3. Author, F. Carlos Felipe Rodríguez-Hernandez, Author, F. Mariel Musso, Author, F. Eva Kyndt, Author, F. Eduardo Cascallar, S.: Artificial neural networks in academic performance prediction: Systematic implementation and predictor evaluation https://www-sciencedirect-com.docelec.insa-lyon.fr/science/article/pii/S2666920X21000126#bbib5 4. Author, F. Eyman Alyahyan, Author, F. d Dilek DüΕtegör, , S.: Predicting academic success in higher education: literature review and best practices | International Journal of Educational Technology in Higher Education | Full Text (springeropen.com) 5. Almarabeh, H. (2017). Analysis of Students' Performance by Using Different Data Mining Classifiers (mecs-press.org) 6. Mueen, A., Zafar, B., & Manzoor, U. (2016). Modeling and predicting students’ academic performance using data mining techniques. International Journal of Modern Education and Computer Science, 8(11), 36–42. https://www.researchgate.net/publication/311068715_Modeling_and_Predicting_Students'_Academic_Performance_Using_Data_Mining_Techniques/citations 7. Mohamed, M. H., & Waguih, H. M. (2017). Early prediction of student success using a data mining classification technique. International Journal of Science and Research, 6(10), 126–131. https://www.semanticscholar.org/paper/Early-Prediction-of-Student-Success-Using-a-Data-MohamedWaguih/e90dcba96b0c9472750869e4f127a8240e6763e1 8. M. Sivasakthi, “Classification and Prediction based Data Mining Algorithms to Predict Students ’ Introductory Programming Performance,” Icici, 0–4, 2017 https://www.ijsr.net/archive/v6i10/ART20177029.pdf 9. Putpuek, N., Rojanaprasert, N., Atchariyachanvanich, K., & Thamrongthanyawong, T. (2018). Comparative Study of Prediction Models for Final GPA Score: A Case Study of Rajabhat Rajanagarindra University. In 2018 IEEE/ACIS 17th International Conference on Computer and Information Science, (pp. 92–97) https://www.researchgate.net/publication/327820955_Comparative_Study_of_Prediction_Models_for_Final_GPA_Score_A_Case_Study_of_Rajabhat_Rajanagarindra_University 10. Garg, R. (2018). Predict Student performance in different regions of Punjab. International Journal of Advanced Research in Computer Science, 9(1), 236–241. http://ijarcs.info/index.php/Ijarcs/article/view/5234/4486 11. Yassein, N. A., Helali, R. G. M., & Mohomad, S. B. (2017). Information Technology & Software Engineering Predicting Student Academic Performance in KSA using Data Mining Techniques. Journal of Information Technology and Software Engineering, 7(5), 1–5. https://www.longdom.org/open-access-pdfs/predicting-student-academic-performance-in-ksa-using-data-mining-techniques-2165-78661000213.pdf 12. Almarabeh, H. (2017). Analysis of students’ performance by using different data mining classifiers. International Journal of Modern Education and Computer Science, 9(8), 9–15.