International Journal of Information Management Data Insights 2 (2022) 100111 Contents lists available at ScienceDirect International Journal of Information Management Data Insights journal homepage: www.elsevier.com/locate/jjimei Analysis of machine learning strategies for prediction of passing undergraduate admission test Md. Abul Ala Walid a,b,∗, S.M. Masum Ahmed c,d,e,f,1,∗, Mohammad Zeyad c,e,f,g,1, S. M. Saklain Galib h,2, Meherun Nesa a,2 a Department of Computer Science and Engineering, Bangabandhu Sheikh Mujibur Rahman Science and Technology University (BSMRSTU), Gopalganj 8100, Bangladesh Department of Computer Science and Engineering, Khulna University of Engineering and Technology (KUET), Khulna 9203, Bangladesh c Energy and Technology Research Division, Advanced Bioinformatics, Computational Biology and Data Science Laboratory, Bangladesh (ABCD Laboratory, Bangladesh), Chattogram, 4226, Bangladesh d Faculty of Engineering, University of Mons (UMONS), Bd Dolez 31, 7000, Mons, Belgium e School of Engineering and Physical Sciences (EPS), Heriot-Watt University (HWU), EH14 4AS, Edinburgh, Scotland, United Kingdom f Department of Energy Engineering, University of the Basque Country (UPV/EHU), Ingeniero Torres Quevedo Plaza, 1, 48013, Bilbao, Biscay, Spain g School of Science & Technology, International Hellenic University (IHU), 14th km Thessaloniki – N. Moudania, 57001, Thermi, Thessaloniki, Greece h Department of Biomedical Engineering, Khulna University of Engineering and Technology (KUET), Khulna, 9203, Bangladesh b a r t i c l e i n f o Keywords: Machine Learning Balanced Dataset Adaboost Support Vector Machines (SVM) Precision a b s t r a c t This article primarily focuses on understanding the reasons behind the failure of undergraduate admission seekers using different machine learning (ML) strategies. An operative dataset has been equipped using the least significant attributes to avoid the complexity of the model. The procedure halted after obtaining 343 observations with ten different attributes. The predictions are achieved using six immensely used ML techniques. Stratified Kfold cross-validation is mentioned to measure the expertise of proposed models to unsighted data, and Precision, Recall, F-Measure, and AUC Score matrices are determined to assess the efficiency of each model. A comprehensive investigation of this article indicates that the resampling strategy derived from the combination of edited nearest neighbor (ENN) and borderline SVM-based SMOTE and SVM model achieved prominent performance. Additionally, the borderline SVM-based SMOTE and the Adaboost model performs as the second-highest performing model. 1. Introduction Utilizing ML, enormous amounts of information can be re-evaluated and discover particular patterns that might not be immediately noticeable or recognizable to humans. ML strategies have increasingly been used to assess educational data such as student class performance (Cardona and Cudney, 2019). In the pursuit of the academic well-being of students, the utilization of neoteric technologies such as data mining, data management, and ML has increased. The idea of extracting undisclosed information from a large number of raw databases is called data mining. Consequently, the exploration of knowledge acqui- sition relates to predictive ML models and subsequent decision-making (O’Bannon and Thomas, Jul. 2015). State-of-the-arts of data mining and ML have become more acceptable in predicting student examination evaluations such as grades, achievement, etc. (Wakelam, Jefferies, Davey, and Sun, Mar. 2020). Generally, conventional data mining for educational data analysis aimed at solving problems in an educational context can be described as educational data mining (O’Bannon and Thomas, Jul. 2015), (Predicting Student Performance using Classification and Regression Trees Algorithm, Jan. 2020). Currently, intelligent computer-based methods such as artificial intelligence (AI) and data Abbreviations: AI, Artificial Intelligence; ANN, Artificial Neural Network; CNN, Convolutional Neural Network; CC, Correlation Coefficient; DT, Decision Tree; ENN, Edited Nearest neighbor; GBM, Gradient Boosting Machine; FN, False Negative; FP, False Positive; KNN, K-Nearest Neighbor; LR, Logistic Regression; LSTM, Long Short-Term Memory; ML, Machine Learning; MDI, Mean Decrease Impurity; MSE, Mean Squared Error; RF, Random Forest; RTV-SVM, Reduced Training Vector-Based SVM; AUC, ROC curve; SARIMA, Seasonal Autoregressive Integrated Moving Average; SSC, Secondary School Certificate; SVM, Support Vector Machine; SMOTE, Synthetic Minority Oversampling Technique; PSO, Particle Swarm Optimization; VC, Vapnik - Chervonenkis; TP, True positive. ∗ Corresponding authors. E-mail addresses: abulalawalid@gmail.com (Md.A.A. Walid), smmasum.ahmed.eee@gmail.com (S.M.M. Ahmed), mohammad.zeyad.eee@gmail.com (M. Zeyad). 1 Joint 2nd Authors (Equally contribute to this work). 2 Joint 3rd Authors (Equally contribute to this work). https://doi.org/10.1016/j.jjimei.2022.100111 Received 17 November 2021; Received in revised form 16 August 2022; Accepted 18 August 2022 2667-0968/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/) Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 mining have been successfully applied to improve people’s daily lives (Udo, Bagchi, and Kirs, 2010). A couple of million students participate in the bachelor’s entrance examination at government-run universities each year in Bangladesh. Nevertheless, only a few thousand are admitted after this competitive examination. In some cases, it was observed that many candidates struggled hard during this period. However, they could not get admission to a public university in Bangladesh, resulting in an unforeseeable future. Numerous factors could be behind their unsuccessful admission to a public university, such as family circumstances, frustration, admission test anxiety, etc. However, Bangladeshi students need admission to a public university because private university education costs are too high for middle-income and low-income families. In contrast, the government primarily covers public university costs. Specifically, with the help of this research, the unprivileged, poor, and middle-income communities parent will find a way to improve their children’s chances to admit to a public university in Bangladesh. The conquest of public university admission tests in Bangladesh is quite competitive. Concisely, success in a particular public university can be achieved by continuously monitoring candidates from collected data. In this regard, ML models generated from the data collected from a particular public university can be highly effective for predicting educational status to monitor the status of the candidate. The methodology of this work can be considered as a norm to improve students performances in any exam like graduate record examinations (GRE), international english language testing system (IELTS), secondary school certificate (SSC), etc. Moreover, with the help of this research, a mobile application could be developed in the future. From their entering information, a student can understand their situation regarding their chances of being admitted to a public university. In this study, an outline has been developed for students employing the most modern data mining and ML techniques in order to advise applicants for the university’s undergraduate admissions test by providing ’risk’ warnings in advance. This article explains the precise model with generalization retention on both categories equipped with the dataset, such as a uniform number of points for each category to identify certain students who are failing the public university admission exam in Bangladesh. The proposed approach can be fruitful by notifying them about their educational circumstances to improve their stance and reduce the study gap and depression. The objectives of this work are given below: in data management and data mining classification for human performance and behavior analysis (Bruce, 1999, Agarwal, Chauhan, Kar, and Goyal, 2017, Votto, Valecha, Najafirad, and Rao, Nov. 2021, Garg, Sinha, Kar, and Mani, 2022, Mahdikhani, Apr. 2022). Particularly, data management of student information is quite crucial to predict their performance (Al-Mamary, Nov. 2022, Al-Mamary, Nov. 2022, Miguéis, Freitas, Garcia, and Silva, Nov. 2018, Tomasevic, Gvozdenovic, and Vranes, Jan. 2020, Edwards, Apr. 2022, Asif, Merceron, Ali, and Haider, Oct. 2017). M. I. Al-Twijri and A. Y. Noaman developed a data mining model for higher institutions by proposing a new data mining model for higher education (Al-Twijri and Noaman, 2015). Besides, (Wakelam, Jefferies, Davey, and Sun, Mar. 2020) predicted student performance utilizing ML and data mining strategies. A group of 23 student data was used in their study and classify the students at risk. Furthermore, (Romero et al., 2013) predicted the performance of firstyear university students through online discussion. In this article, researchers proposed data mining methods for improving student final performance prediction using a combinational approach that joins a clustering method to classification methods. The clustering approach exercised in the study had been set to produce several clusters that were similar to the classes of their dataset which need to manage properly. Latterly the field of education benefits from AI, Different ML algorithms, including such as artificial neural networks (ANN) is used to predict academic achievement or academic failure, which helps learners become more conscious about their studies (Rodríguez-Hernández, Musso, Kyndt, and Cascallar, Jan. 2021). Tomasevic, Gvozdenovic and Vranes, Jan. (2020) utilizing a modern supervised ML approach (ANN), performed student exam performance prediction and produced a comparative illustration. Besides, Hoffait and Schyns, Sep. 2017 analyzed the potential difficulties of the university student. This work focused on the early detection of possible failures by using student data collected during enrollment. Three different methods were applied to data mining, including ANN, random forest (RF), and logistic regression (LR). Furthermore, (Fotouhi, Asadi, and Kattan, 2019) worked on an imbalanced dataset and performed classification using four popular classifiers like decision tree (DT), K-nearest neighbor (KNN), ANN, and RIPPER. To put down misclassification results due to imbalanced data distribution, both over-sampling and under-sampling are employed with different algorithms. With the help of the DT, (Hamsa, Indiradevi, & Kizhakkethottam, 2016) suggested a prediction approach to admission scores. Accuracy, mean squared error (MSE), and correlation coefficient (CC) were utilized to test and compare their regression models. (Cardona and Cudney, 2019) developed a model for student performance estimation dependent on the SVM. Yet their output variable’s data distribution suggests a mismatch condition on categories called an imbalanced issue. But this could not be known to be a stable and convenient approach to the problem of its dissatisfactory performance. Researchers used precision, recall, and accurate measurements for analysis and showed that SVM with RBF kernel provides better performance. In this research article, four types of mathematical models were compared including SVM, multiple LR, multilayer perception network, and the radial basis function network. Eight input variables were engaged to seek out the final exam score, and out of 323 undergraduates in four semesters, 2907 data samples were obtained (Huang and Fang, Feb. 2013). In addition, (Chui, Fung, Lytras and Lam, Jun. 2020) proposed a modified SVM-based mechanism that alleviates support vectors and training time by carrying away redundant training vectors, named RTVSVM. Researchers utilized a large amount of information from university students to assess the proposed mode. (Costa et al., Aug. 2017) evaluated various prediction techniques of data mining to find students who were likely to fail programming courses. The proposed methodology was applied to two diverse data sets from a Brazilian Public University regarding programming courses. After analyzing the datasets, SVM found to be the highest effective model. A Gradient Boosting Machine (GBM) model was suggested by (Fernandes et al., Jan. 2019) to forecast academic results. By ex- • The main objective of the research is to find reasons behind the failure of the candidate in the undergraduate admission test and to unfold the factors that significantly impact their rejection from a public university. • A proposed method applies a comprehensive investigation on the principle of the outcome of some useful metrics where both a resampling approach and classification model exist in each pair. • An investigation has performed on the highest and second-highest performing models, including the SVM model is highly robust to the ’Allow’ class. However, the Adaboost model is moderately robust to both categories. The rest of this article is formulated as follows. The literature review was discussed in Section 2. The research methodology is given in Section 3. Section 4 indicates the analysis of the result. Moreover, an empirical discussion is apprised in Section 5. Finally, Section 6 summarizes the findings and mentions future work possibilities. 2. Literature review Data mining and ML techniques were utilized effectively to improve students performance by understanding their behavior (Ifinedo, Apr. 2016), (Ramírez-Noriega, Juárez-Ramírez, and Martínez-Ramírez, Feb. 2017) to analyzing big data (Shirdastian, Laroche, and Richard, Oct. 2019), (Chowdhury et al., 2022). ML applications were most widespread 2 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 amining demographic parameters, researchers found that ’neighborhood’, ’school’, and ’age’ are possible indices of a student’s academic success or failure. In addition, various algorithms were investigated, and overall performance was evaluated depending on various assessment criteria, including accuracy, precision, recall, the convergence speed of the optimal solutions, F-1 measure, and computational time (Edwards, Apr. 2022), (Batra, Jain, Tikkiwal, and Chakraborty, Apr. 2021, Garg, Kiwelekar, Netak, and Ghodake, Apr. 2021, Koch, Plattfaut, and Kregel, Nov. 2021, Tandon, Revankar, Palivela, and Parihar, Nov. 2021, Ensafi, Amin, Zhang, and Shah, Apr. 2022). After assessing multiple studies, it was discovered that in order to make precise predictions about students performance, the collected data information from student needed to be managed carefully. (Fernandes et al., Jan. 2019), (Grant, Huang, and Pasfield-Neofitou, 2014), (Cortez and Silva, 2022), (Hussain, Dahan, Ba-Alwib, and Ribata, Feb. 2018). Throughout this context, nine input variables as possible prediction factors and one output variable unravel in Table 2. Input features are equipped so that they have a sufficient influence on the output variable. The PreExR variable displays the effects of the previously participated formulation exam conducted directly before the admission test. The descriptive type of other input features regarded; residential place, relationship status, family status, the missing time each day (using social networking, playing sports, watching movies, doing unnecessary gossip with buddies), study duration per day for exam planning, etc. shown in Table 2. There are two different types in the dataset’s output component: ’Allow’ and ’Not-allow’ demonstrated in feature id number 10. Hence, the id number (1 to 9) features are engaged as predictive ML models input. Again, features belonging to the feature id number (1 to 9) hold the value based on the information collected for four-month before the admission test until the test starts. In 2018, around ten thousand students took the university’s admission test for Life-Science faculty (Walid, Masum Ahmed, and Sadique, Nov. 2020). The success rate was close to 2%. Therefore, simple random sampling without a replacement approach was adopted to get the sample from the population. The process called simple random sampling is a way of selecting 𝑛 elements from a population of size 𝑁 elements in such a way that each combination of n elements has the same chance of being chosen as the others (Mitra and Pathak, Dec. 2007). The estimated sample size (n0 ) was calculated from the following formula with a 90% confidence level (Hernández-Sayago et al., Mar. 2013). 3. Research methodology The domain-specific features were constructed from appropriate data collection at the beginning of this research. After that, data were collected from the students of a public university named Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh, and the students of different national universities, namely Govt. Bangabandhu College, Gopalganj, Bangladesh, and Government Brajalal College, Khulna, Bangladesh, by following two primary data collection techniques: Interviewing Method and Email Questionnaire Method and then transformed into a secondary format in order to take advantage of comma-separated values (CSV) in computerized applications. Questionnaires were prepared by observing facts from the papers (Fernandes et al., Jan. 2019), (Grant, Huang, and PasfieldNeofitou, 2014), (Hussain, Dahan, Ba-Alwib, and Ribata, Feb. 2018, Akanda, 2019, Cortez and Silva, 2022). The datasets were imported on Google Colaboratory and several data preparation activities were applied to better fit the ML model. With the assistance of symmetrical analysis, the balanced or imbalanced condition had checked for datasets and emulated up-to-date statistical techniques. Four resolved datasets having balance in both classes were prepared to accomplish the subsequent actions of this research. The validation subsamples were obtained from the stratified k-fold cross-validation technique and by configuring the value of k for estimating the proficiency of the models on datasets. Furthermore, a comparative investigation was conducted to evaluate each model’s efficiency generated by each dataset using several expedient model evaluation metrics and comparing them. Besides, enormously redacting models and corresponding statistical techniques are revealed as reasonable solutions. Again, one or more models can be yielded from the step of a suitable solution. However, the proposed research aimed to combine the strength of utmost performing models generated from reasonable solution steps. In Fig. 1, the proposed methodology has been illustrated, which is developed from the methods of the selected research works (v Chawla, Bowyer, Hall, and Kegelmeyer, 2002, Han, Wang, and Mao, 2005, Nguyen, Cooper, and Kamei, 2022). In particular, the learning process of the most classification paradigm is often biased toward most class examples in a binary classification problem. For the minority ones, classification errors can be observed so high (Lamari et al., 2021). So, to overcome the issue, several resampling methods of under-sampling and over-sampling (Kirshners, Parshutin, and Gorskis, Dec. 2016) had specified for to construction of four different datasets. To construct predictive models, all of these datasets are used. Concerning this analysis, six most popular model-based approaches (Fernandes et al., Jan. 2019) have been applied for supervised classification tasks, including LR, KNN, SVM, ANN, and the two most frequent ensemble ML methods named RF and Adaboost. Moreover, Table 1 describes the Algorithms used in this research work. 𝑛0 = 𝑛= 𝑧2 𝑝𝑞 𝑑2 𝑛0 1+ 𝑛0 𝑁 (1) (2) 𝑛0 = estimated sample size; 𝑧 = statistical certainty chosen (1.64 for 10% level of significance); 𝑝 = estimated prevalence; (0.5 if unknown); 𝑞 = 1 − 𝑝; 𝑑 = precision desired (usually consider 0.05); 𝑛 = desired sample size The sample size was calculated as 266. More precisely, 266 or more measurements/surveys were required to have a confidence level of 90% that the real value was within ±5% of the measured/surveyed value. For this case, the margin of error was calculated as 5%. But to reduce the margin of error, 343 samples were used in the final dataset. In this manner margin of error becomes 4.38%. In this aspect, there was a 90% chance that the real value was within ±4.38% of the measured/surveyed value. However, samples are collected using two primary data collection techniques: Interviewing and Email Questionnaire. Accordingly, a realistic dataset was developed for this work. The dataset includes 343 samples, where 185 samples contribute to ’Allow’ and the rest of the samples belong to the category ’Not-Allow’. The ’Allow’ group is samples from students who attained in the admission test exam. On the other hand, samples are the ’Non-Allow’ type obtained from certain students who participated but did not pass the admission test. Fig. 2 indicates an initial dataset of the data distribution of both categories. From the first-year students of the Life-Science faculty of a public university in Bangladesh (Walid, Masum Ahmed, and Sadique, Nov. 2020), samples of the ’Allow’ type are gathered. Samples in another type are obtained from outside of the university, in the scope of the facts of the students who participated in the Life-Science faculty test but could not pass the test to acquire the almost fully funded scholarship by the government. However, unequal distribution of data samples amongst categories was observed, suggesting a situation of imbalance. The presence of the imbalanced situation in datasets forces most of the standard classifier learning algorithms, such as KNN, DT, and Back-Propagation Neural Networks, to understand outcomes (Rout, Mishra, and Mallick, 2018). 3.1. Data collection Each dataset variable is established by experimenting with the characteristics and significance of the features displayed in previous research 3 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 Fig. 1. Proposed Methodology. Fig. 2. Initial Data Distribution for Target Value. 4 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 Table 1 Description of Algorithms Name of the Algorithm Description Logistic Regression (LR) LR is a statistical method equivalent to linear regression, even though LR finds a formula that estimates the result of one or more response variables for a binary variable. However, LR could be categorical or continuous in response variables, as the model does not require continuous data. Furthermore, LR makes the assumption variable is independent. One disadvantage of LR is that the system cannot yield probabilities of typicality (DiGangi and Hefner, 2013, Du, Liu, Yu, and Yan, Jun. 2017, Bujang, Sa’At, Tg Abu Bakar Sidik, and Lim, Aug. 2018). Support Vector Machine (SVM) SVM is an operative strategy of data mining focused on ML used to predict data. SVM (Shirdastian, Laroche, and Richard, Oct. 2019), (Lashkarashvili and Tsintsadze, Apr. 2022) is one of the most effective techniques extensively used in several fields of data classification, such as bioinformatics, detection of faults, vehicle power management, and so on. Furthermore, with advances in computing science and intelligent technology, intelligent learning recognition skills have been well developed to communicate complex nonlinear interactions between meteorological elements in real-time and space. In addition, SVM provides other vital advantages in finding solutions with limited sample, nonlinear, and high dimensional pattern recognition. SVM is a system of classification based on statistical analysis and dimensional hypotheses of the Vapnik - Chervonenkis (VC) (Du, Liu, Yu, and Yan, Jun. 2017). Artificial Neural Network (ANN) ANN is an algorithm based on learning inspired by the human brain’s neural network (Shin et al., Sep. 2019). ANN has the potential to utilize both independent and dependent variables of a network is such a complex and nonlinear communication phenomenon (Khandelwal et al., Apr. 2018), (Zeyad and Hossain, Dec. 2021). In ANN, neurons are interconnected, and each connection has a numerical weight. Besides, each layer ensures some artificial neurons and activation mechanisms. Again, a training mechanism is also defined in ANN. Feed-forward infers that the associations between layers are constantly guided from lower to upper layers (Shin et al., Sep. 2019). ANNs are frequently trained with certain optimization algorithms to accomplish learning and get optimal results. An error can be determined by comparing the actual results and those predicted (Khandelwal et al., Apr. 2018). Random Forest (RF) RF is an ensemble prediction method containing a set of various DT’s fitted with bagging and random variable selection. The concept of tree construction of trees in RF remains similar to CART, but the process will be completed with the help of recursive partitioning. The precise cut-point location and the dividing vector’s choice in recursive partitioning heavily rely on the distribution of findings throughout the learning sample. RF overcomes CART’s problem of uncertainty by estimating using a set of trees rather than a single tree. Combining high-diversity trees will significantly improve each tree’s instability since CART is an impartial indicator that is unstable, which produces the correct average prediction. RF combines them until all trees are formed by combining their different predictions to level the effect of training data and make RF consistent (Wang et al., Jul. 2018). AdaBoost algorithm Various algorithms were developed from the AdaBoost algorithm. Moreover, many algorithms emphasize classifications, and the remaining portion of the algorithm is associated with regression. AdaBoost is one kind of iterative algorithm, and this process adapts the learning process to return the fault by weak learners. The AdaBoost algorithm is one kind of iterative algorithm that joins weak learners sequentially and adjusts the total learning mechanism as per the error given by weak learners. Besides, the core aspect of AdaBoost is merging vulnerable learners produced from every iteration to construct a strong learner (Xiao, Dong, and Dong, Mar. 2018). K-Nearest Neighbor (KNN) The KNN is recognized as a non-parametric model, instance-based, or lazy method (Anubhove et al., 2022). It has been considered one of the easiest strategies to perceive in ML and even in deep learning. An unknown data point is classified based on the closest neighbor, known as class. For this algorithm, the nearest neighbor is determined by the k-value, which specifies the quantity of most immediate neighbors to be deliberated and thus defines a class of an unrecognized data point. Often beneficial to prevent tied votes by selecting k as an odd number. To determine the total number of neighbors used for classification, a single number k is provided. If the value of k is considered equal to 1, its closest neighbor will determine the class for a sample. Sometimes, the classification of the given data point belongs to the usage of more than one nearest neighbor is the justification for calling: KNN. This algorithm exposes a memory-based strategy since, at runtime, data points should be in the memory (Amra and Maghari, Oct. 2017). Table 2 Attributes and their Possible Values Id Feature Explanation Feature Name Possible Values 1 2 Previously participated exam result The educational and economical condition of the family PreExR Family_Situation 3 4 Living area during the exam Spending time per day on social media or game playing Living under family observance or not Wasting time by doing relationship Or, Wasting time by being frustrated in a relationship Wasting time on political jobs Living_Area Misspend_Time Low, Medium, Good, Excellent (a) Educated (b) Uneducated (c) Unemployed (d) Employed: (a)+(c)/(a)+(d)/(b)+(d)/(b)+(d) Village/Town 0-1hr/1-3hr/3-5hr/More than 5hr. Living_Status Relation_Status With family/Without family Yes/No 5 6 7 8 9 10 Yes/No Political_Engagement Study_Duration The average duration of study in a single day Deforming thought by any addiction Status of achieving success Drug_Addiction Output (Target variable) 0-1hr/1-3hr/3-5hr/More than 5hr. Yes/No Allow (1)/ Not-Allow (0) stage, the input dataset is scaled the value of the input variable is within the range of (0, 1). The Output variable is simply transformed to a numeric value according to the encoding of the label where the "Allow" label is converted to a numeric value of 1′s and other side, "Not-Allow," which is converted to 0′s. 3.2. Data analysis The ML approach demands quality data along with preprocessing for better accuracy. Missing values, feature encoding, data normalization, and standardization type preprocessing steps are applied to data to enhance quality as well as a better understanding of algorithms. At this 5 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 Table 3 Distribution of the Classes respectively generated from Under-Sampling and Over-Sampling Approaches Table 4 Performance Comparison (AUC) on different Datasets generated from Resampling approaches for Life-science faculty admission Dataset Technique Class Distribution Algorithm AUCDA AUCDB AUCDC AUCDD Original Set DA DB DC DD NA Join T-Link B-Smote S-Smote ENN-Smote Allow 185 158 185 185 158 Logistic Regression SVM ANN Random Forest Adaboost KNN 0.65 0.69 0.63 0.64 0.70 0.67 0.74 0.74 0.72 0.64 0.70 0.64 0.65 0.70 0.67 0.69 0.71 0.70 0.91 0.96 0.90 0.96 0.95 0.93 Not-Allow 158 158 185 185 158 4. Results In Fig. 2, the data distribution of each group can be understood spontaneously, and the imbalanced class distribution problem is observed effortlessly. Since this research focuses on accurately predicting both majorities and minority groups, attempts have been made to resolve the situation. The under-sampling and over-sampling methods have been adopted to produce four resolved datasets, namely dataset ’A’ (DA), dataset ’B’ (DB), dataset ’C’ (DC), and dataset ’D’ (DD), from the original set to ensure equal distribution of both groups. Class distribution of each category of resolved datasets is demonstrated in Table 3. To produce the DA, both Tomek Link (AT, M, F and M., 2016) and the random approach to down-sampling are followed, and the method is defined as a joint T-link where T-links are firstly dispelled from the majority class and then down-sampling is used to ensure fair class distribution. By experimenting with the idea behind N. V. Chawla’s (v Chawla, Bowyer, Hall, and Kegelmeyer, 2002) synthetic minority oversampling technique (SMOTE), it has been decided for this work to use the borderline SMOTE oversampling approach proposed by Hui Han et al. (Han, Wang, and Mao, 2005) to generate a dataset DB and also construct DC by following the methodology of proposed by (Nguyen, Cooper, and Kamei, 2022) called SVM based borderline SMOTE approach. For the simple SMOTE oversampling method, the synthetic points are propagated between the minority samples and selected nearest points, but in the field of borderline SMOTE oversampling, the samples that are located at the borderline of the minority class are over-sampled only. Afterward, a hybrid (Fotouhi, Asadi, and Kattan, 2019) oversampling approach was considered generated by integrating ENN and borderline SVM-based SMOTE, and a balanced set DD was introduced. The borderline SVM-based SMOTE oversampling method is chosen as a praiseworthy method for the proposed combinational oversampling approach after observing the performance of the models on datasets DB and DC. Moreover, a relative analysis is performed to assess the performance of several models prepared from the proposed four sets of data. 4.1. Implementation details Six supervised algorithms, Adaboost, LR, RF, KNN, SVM, and ANN, had used to overcome classification tasks and conduct this research. During the implementation of RF, the different numbers of DT’s had checked. To select the precise number of DT’s to achieve high performance from the maximum elective technique, the assistance of an iterative approach that audits and assimilates proficiency for all numbers of trees indicating from 5 to 1000. Also, another iterative approach for KNN is resorted to ensure a promising outcome and appoint the optimum value chosen from the numbers two to fifty for ’k’, indicating the number of nearest neighbors. This work is operated by adopting a grid search algorithm (Syarif, Prugel-Bennett, and Wills, Dec. 2016) to optimize SVM parameters. For this reason, the value of ’C’, ’𝜆’, and ’kernel’ was tuned. On the other hand, one input layer, two hidden layers, and one output layer were implemented into the ANN. Each hidden layer exerts the ReLU activation mechanism. Disposing units in the hidden layer optimizes ANN. Again, the adaptive moment estimation technique called the Adam optimization focuses on the gradient descent optimization process with ANN structure to decrease loss function. The essential advantage of the Adam optimizer is that the learning rate does not need to be defined. The parameters had optimized depending on the number of weak learners and the learning rate value concerning the ensemble boosting classifier called Adaboost. 4.1.1. Performance measure The area under the ROC curve (AUC), Precision, Recall, and FMeasure were picked as defined metrics to estimate the model’s efficiency. Essentially, the value of AUC away from 1 implies the model had less-class incoherence capacity. If close to the ’1′ implies, the model had exalted class incoherence capacity. Then again, sensitivity was commonly used to discover the true positive (TP) rate utilizing (3), which centers on decreasing the false negative (FN) number again; precision focuses on limiting the false positive (FP) number utilizing (4). The sensitivity value turns out to be low when the FN number increases rapidly. Utilizing equation (5), F1-Score appreciates the model’s suitability by finding the harmony between precision and recall (Alyahyan and Düştegör, 2020). 3.3. Robustness checks (stratified K-fold cross-validation) Cross-validation is one of the finest approaches to identify the model’s generalization ability on new data. This analysis uses a stratified k-fold cross-validation approach to select the best model where the number of folds is specified as five. In this aspect, the dataset has been split into five segments or folds, and each segment is chosen to contain approximately the exact proportions of class labels. In the primary stage, the first segment (SEG1) is considered a testing set, and by taking the rest of the data as a training set, the model is fitted with that training set, and the performance score is stored in SEG1. The same thing had done for the next stage by considering sub-group no. 2 and SEG2 performance records are preserved. This process is completed five times data and collected five performance data such as (SEG1, SEG2, SEG3, SEG4, SEG5) and finally, an average of all five performances such as AVERAGE (SEG1 + SEG 2 + SEG3 + SEG4 + SEG5) had defined as a model’s final performance. In addition, each testing segment is formed in a way in each iteration of the cross-validation approach that it ignores those samples generated by resampling techniques. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃 ∕ (𝑇 𝑃 + 𝐹 𝑁 ) (3) 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃 ∕ (𝑇 𝑃 + 𝐹 𝑃 ) (4) 𝐹 1 𝑆𝑐𝑜𝑟𝑒 = (2 ∗ 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙) ∕ (𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙) (5) 4.2. Result analysis In Table 4-Table 8, the mean outcomes from K-fold cross-validation techniques were organized by considering the four datasets. Resampling 6 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 Table 5 Precision, Recall, F-MEASURE for all Algorithms on Dataset DD Algorithm Precision Recall F1-Score Logistic Regression SVM ANN Random Forest Adaboost KNN 0.8913 0.9638 0.9374 0.9729 0.9173 0.9778 0.8790 0.9871 0.8923 0.9361 0.9289 0.9489 0.8851 0.9754 0.9143 0.9542 0.9231 0.9630 Table 6 Precision, Recall, F-MEASURE for all Algorithms on Dataset DA Algorithm Precision Recall F1-Score Logistic Regression SVM ANN Random Forest Adaboost KNN 0.6346 0.6605 0.53 0.5956 0.6659 0.5907 0.6133 0.7084 0.65 0.6256 0.7272 0.6952 0.6238 0.6836 0.58 0.6102 0.6952 0.6387 Fig. 3. ROC Curve for Each Fold and Mean ROC from SVM Model. Table 7 Precision, Recall, F-MEASURE for all Algorithms on Dataset DB methods were evaluated to uncover the most suitable and feasible models by examining the exhibition of the models derived from four datasets. Moreover, the AUC performance of various models prepared from all the datasets stored in Table 4 conversely Precision, Recall, and F-Measure were demonstrated in Table 5 to Table 8. The method concentrates on the efficacy of datasets DA, DB, DC, and DD constructed from state-ofthe-art resampling techniques to select the resampling technique. The term "dataset’s efficacy" indicates how good the prediction result obtained from that set. By analyzing the AUC score from Table 4, it can be easily perceived that the AUC score from dataset DD in short (AUCDD ) gives a much higher value than all other performing models generated by the remaining data set. Therefore, dataset DD obtained from the ENN and borderline SVM-based SMOTE over-sampling technique is more feasible in this case. After that, it was intended to investigate the outstanding model consequent upon their performance. SVM and RF demonstrated outstanding and similar AUC scores for dataset DD. After observing Table 5, it can be explained precisely that SVM outperforms the RF because of its high recall and satisfied weighted average precision and recall. Contrariwise, KNN was keen to alleviate false positive prediction and emerged with the highest precision, but the balance between recall and precision was not impeccable compared to SVM. Therefore, the resampling technique, which combines ENN and SMOTE approaches, was responsible for growing the DD dataset and the SVM model conveyed from the same data set and declared as successful approach for this study. In Fig. 3, the mean ROC curve plotted for the most exactly performing SVM model and the ROC curve for all cross-validation folds in the same place. By examining the AUC score of each fold, the value of 0.98 is remarkable as the highest score given by some folds. More precisely for the SVM model, ’C = 10′ ’𝜆 = 0.1′ ’kernel = rbf’ were the optimized parameters. 353 number of estimators have been specified as an optimum parameter from the iterative process as it shows relatively prominent achievement. However, the second effective reconstruction method had pursued rather than stopping at the second place and surpassing the model. In this context, after looking at the AUC score obtained from dataset DA and DB indicated in Table 4, Adaboost gives the highest AUC. Therefore LR shows the lowest score for dataset DA. DB gives a higher (sometimes equal) AUC score than DA for every model except KNN. In particular, the AUC disparity graph in Fig. 4 can also give the same conception more accurately as all the points on the AUC_DB line graph are positioned on or above AUC_DA except the point of KNN. LR and SVM models carried from dataset DB show 0.74 AUC, which is higher than other models derived from dataset DB. Still, from Table 7, Algorithm Precision Recall F1-Score Logistic Regression SVM ANN Random Forest Adaboost KNN 0.6780 0.6840 0.63 0.6465 0.7015 0.60 0.6811 0.7243 0.70 0.6649 0.7297 0.85 0.6796 0.7036 0.67 0.6556 0.7154 0.7053 Table 8 Precision, Recall, F-MEASURE for all Algorithms on Dataset DC Algorithm Precision Recall F1-Score Logistic Regression SVM ANN Random Forest Adaboost KNN 0.6689 0.6872 0.64 0.6729 0.7046 0.66 0.7081 0.7351 0.68 0.7513 0.7405 0.7405 0.6879 0.7104 0.66 0.7099 0.7221 0.6979 the exalted performance of SVM has been easily perceived as it exposes higher precision, recall, and F1-score compared to the LR model. On the other hand, the Adaboost model from dataset DB gives a 0.70 AUC score along with maximum precision and F-measure with the placement of outstanding balance between precision and recall, which is the main reason to select as the most helpful model for dataset DB. Moreover, by observing the performance of the models prepared by dataset DC from Table 4 and Table 8, it had noticed that Adaboost gives the highest AUC as it has the peak point of the AUC_DB line in Fig. 4 and also shows maximal precision as well as the harmonic mean of the model’s precision and recall by keeping an outstanding balance between precision and recall. But the fascinating things are the precision, recall, and F-measure of the Adaboost model prepared from dataset DC demonstrated in Table 8, which is higher than any other values of dataset DB for the same metrics. This extensive judgment has been condensed into Fig. 5. It can be observed spontaneously from Fig. 5 that the Adaboost_DC point grabs the peak position for each line graph. This is because Adaboost_DC refers to the Adaboost model prepared by dataset DC. For this reason, admittedly, it was found that during this research, the borderline SVM-based SMOTE over-sampling technique was the second effective resampling technique, and Adaboost was the maximal ef7 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 Fig. 4. AUC disparity graph for datasets DA, DB, and DC. Fig. 5. Graphical Analogy based on precision, recall, and F measure. Fig. 7 illustrates an RF classifier’s pictorial representation of emergent attributes. By looking at this statistic, it has been easily found that all the features have a relative significance value. However, a little suggests that the dataset of this study does not have an insignificant variable. Moreover, from a discerning exploration of input features, it can be noticed spontaneously the attribute called PreExR pictures the most elevated significance again the variables; Misspend_Time, Family_Situation, and Study_Duration also have a high value of importance that suggests these variables are strong predictors as well as most potential indicators for predicting Life-Science faculty admission test result. On the contrary, an illustrative representation of feature significance from the Adaboost model was narrated in Fig. 8. Since a DT was set as a base classifier for the Adaboost model, feature significance for each feature was calculated by the value of feature significance of each base classifier which was part of the ensemble approach. From Fig. 8, Living_Status and Family_Situation were highly significant variables for the Adaboost model. fective classifier based on its performance on dataset DC prepared by this approach. Fig. 6 demonstrates the AUC score for each fold and mean AUC from the Adaboost model. By exploring the AUC score of each fold, it was noticed that fold-1 and fold-3 give the highest score of 0.74. The mean AUC score was also put in the same place using a deep red color vertical line. The AUC value of each fold was so close to the vertical line, indicating accurate perception that was never observed for the SVM model prepared from the DB dataset. 4.2.1. Feature involvement exploration To determine the significance of the feature, the mean decrease impurity (MDI) feature importance of RF was calculated by counting the times a feature is used to divide a node, weighted by the number of samples it splits (Alyahyan and Düştegör, 2020). G. louppe et al. (Louppe, Wehenkel, Sutera, and Geurts, 2022) also indicated that the MDI value was equal to zero for absolutely an irrelevant feature. 8 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 Fig. 6. AUC scores for Each Fold and Mean AUC from Adaboost Model Prepared from Dataset DC. Fig. 7. Feature Importance based on Random Forest Classifier. Fig. 8. Feature Importance based on Adaboost Classifier. 9 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 Table 9 Distribution of the Classes generated from Under-Sampling and OverSampling approaches Technique Class Distribution NA ENN ENN-Smote Allow 185 158 158 Not-Allow 158 47 158 5. Discussion According to the investigation, models generated from dataset DD demonstrate prominent outcomes. Moreover, dataset DD was constructed by the resampling approach that combines ENN and borderline SVM-based SMOTE. In the edited nearest-neighbor under-sampling method ENN, misclassified samples were cut out according to their nearest neighbors (Fotouhi, Asadi, and Kattan, 2019). The approach excludes all noisy and borderline examples (Beckmann, Ebecken, and Pires de Lima, 2015), (Berka and Marek, Sep. 2021). ENN was often used to exclude samples from all categories (Fotouhi, Asadi, and Kattan, 2019), which occurred for this study. In this study, ENN was used to eliminate problematic samples of both categories. Intending to get a smoother surface for decision-making, ENN had applied, and after that, synthetic samples were generated for the minority class by SMOTE over-sampling method so that both classes became equally balanced. In-depth observation can be attained by exploring Table 9. From Table 9, it can be decisively remarked that points of both classes are reduced when the down-sampling approach called ENN is applied in the original set. Nevertheless, samples of negative (or ’Not-allow’) classes are substantially decreased. Almost 70.25% of real observations from the negative class are eliminated, and only 14.59% of real observations from the positive (or ’Allow’) class are abandoned. Afterward, the SMOTE technique is adopted on the ENN output, resulting in an equally balanced dataset with 158 samples in each class. It had been believed that a model prepared by a few numbers of real observations from a negative class might restrict a model from performing in real cases for that class, although the high sensitivity is offered. The procedure may narrow down the knowledge retention of a model to a negative class. From Table 5, it has been noticed that SVM and KNN both are displaying such fascinating outcomes for precision, where KNN is revealing the utmost. However, high precision means less tendency to predict positive class as negative and limiting false positive (FP) number. Hence, it can be assumed that observations predicted as a positive class by the SVM or KNN model can be more likely to be a positive class. Therefore, the model (SVM or KNN) will be criterion one for classifying observations that belongs to the positive class. On the contrary, dataset DC obtained from borderline SVM-based SMOTE up-sampling method merges some synthetic samples to negative class only, which ensures there is no loss in real observations of any class. Besides, the model called Adaboost, prepared from DC, shows the highest F-measure and precision with its high sensitivity that ascertains the model’s robustness to both classes. Therefore, the model can be criterion one for classifying observations that belongs to the negative class. In Fig. 9, a mechanism had illustrated to utilize the strength of both models and take advantage of the model’s robustness to positive or negative class in a productive manner to acquire maximum benefit. Fig. 9. The optimal approach for binding strength of highly redacting models together for final prediction. studies. From that perspective, six algorithms were used in this work which is a contribution to the literature. The dataset used in this research was collected by the researchers of this project. Comparative analysis of the result generated from different ML models in Bangladesh perspective, which is quite new. By doing this analysis, the primary objective of this research is accomplished. This method can be useful in several sectors where data are in tabular form and slightly imbalanced. Through the above methodology, the model and the resampling technique are also returned. Again, the optimal approach that is binding strength of highly redacting models can be used to improve student’s performance by providing proper suggestions to individuals analyzing their data. 5.2. Implications for practice To improve the performance of students in any educational sector from primary level to university proposed method can be significant. Weights of the trained model can be utilized for retraining with new data like pre-trained models and able to use the strength of the previous model in an effective manner in order to unerring predictions. Moreover, in the future, with the help of this research, a mobile application can be developed to accomplish the main objective of this study and solve the research gap in an effective manner. By their entering some answers in this application, a student can understand their situation regarding their probability of being admitted to a public university or any other institution (where a competitive exam is the only way of admission). Parents, family members, friends, and physicians will be able to assist pupils in improving their mental health if this study is further developed. 6. Conclusion The goal of the article is to raise awareness among students and assist parents based on their prosperity to tend in taking prompt action, such as a proposal to minimize the failure rate by providing the result of the university admission test early using ML predictive models. A realistic dataset was prepared to conduct this research work. Four balanced data sets were given for further calculations as ML models that function well on the balanced data set following four resample processes, the joint T-link, two isolated methods of SMOTE, and a combination of ENN and SMOTE approaches. Six separate ML models were prepared for each dataset. A comparative analysis was carried out among the models, assessing the helpfulness using the most valuable metrics (AUC, Precision, Recall, F-Measure) and choosing the ideal resampling technique for the dataset based on this performance. Thus, this research work exhibits pairs in a deteriorating manner of effectively utilizing its conceptual 5.1. Contributions to literature A slight imbalance dataset was managed; this is one of the unique portions of this study. Several resampling approaches were utilized in this research which is another uniqueness of this study. Moreover, three algorithms (Son and Fujita, 2019) four algorithms (Helal et al., 2018), and five algorithms (Abu Zohair, 2019) were used in similar kinds of 10 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 analysis where each pair reveals a resampling approach and classification model. The most effective pair signifies the SVM model and a resampling strategy aggregating ENN and borderline SVM-based SMOTE. Again, the second most effective pair signifies the ensemble boosting model called Adaboost and the borderline SVM-based SMOTE oversampling method. However, the combinational resampling approach aggregating ENN and borderline SVM-based SMOTE exceedingly reduces many real observations from the negative category for ENN and generates numerous synthetic data for the same category. Consequently, the model founded on the combinational resampling approach may poorly learn from real observations of the negative category. To bypass this problem, both the Adaboost model prepared by borderline SVM-based SMOTE and the SVM model produced by the combinational resampling approach is suggested to be combined. On account of this, a remarkable method is uncovered in this study to achieve the best features from deployed models simultaneously. Also, this study includes a clear overview of significant features, which is crucial to building an effective model for future usage. Moreover, on one side, the proposed model is employed to predict their success or failure, and on another side, statistical analysis delivers advice with factors to be improved for admission. The exploratory data analysis expresses the candidates spending time more than 5 hours per day on social media or game playing despite reading 1-3 hours per day; the majority of them have not succeeded in achieving their goal. Again, the candidates living in urban areas preparing for admission tests are more successful than students living in villages. Moreover, the suggestions were provided to each candidate who was predicted as failed. The suggestion includes the factors that should be improved and how much improvement is required for them. The mentioned decisions can be taken only by comparing a significant factor’s value with the average value of significant factors of the candidates belonging to the positive class. The authors, therefore, assume that the proposed methodology can be used effectively to predict the outcome of a bachelor’s admission test at a university in a global problem only by enlargement of the dataset. In the overall analysis and comparison, other new educational data mining and ML methods will be included as part of future work. Further extension of this research will help parents, family, friends, and doctors to help students to improve their mental health. Alyahyan, E., & Düştegör, D. (2020). Predicting academic success in higher education: literature review and best practices. International Journal of Educational Technology in Higher Education, 17(1) Springer, Dec. 01. 10.1186/s41239-020-0177-7. Amra, I. A. A., & Maghari, A. Y. A. (Oct. 2017). Students performance prediction using KNN and Naïve Bayesian. In ICIT 2017 - 8th International Conference on Information Technology, Proceedings (pp. 909–913). 10.1109/ICITECH.2017.8079967. Md. Sadik Tasrif Anubhove, S. M. Masum Ahmed, M. Zeyad, Md. Abul Ala Walid, N. Ashrafi, and A. M. Saleque, “Tomato’s disease identification using machine learning techniques with the potential of AR and VR technologies for inclusiveness,” 2022, pp. 93–112. doi: 10.1007/978-981-16-7220-0_7. Asif, R., Merceron, A., Ali, S. A., & Haider, N. G. (Oct. 2017). Analyzing undergraduate students’ performance using educational data mining. Computers and Education, 113, 177–194. 10.1016/j.compedu.2017.05.007. AT, E., M, A., F, A.-M., & M, S. (2016). Classification of imbalance data using Tomek Link (T-Link) Combined with random under-sampling (RUS) as a Data Reduction Method. Global Journal of Technology and Optimization, 01(S1). 10.4172/2229-8711.s1111. Batra, J., Jain, R., Tikkiwal, V. A., & Chakraborty, A. (Apr. 2021). A comprehensive study of spam detection in e-mails using bio-inspired optimization techniques. International Journal of Information Management Data Insights, 1(1). 10.1016/j.jjimei.2020.100006. Beckmann, M., Ebecken, N. F. F., & Pires de Lima, B. S. L. (2015). A KNN Undersampling Approach for Data Balancing. Journal of Intelligent Learning Systems and Applications, 07(04), 104–116. 10.4236/jilsa.2015.74010. Berka, P., & Marek, L. (Sep. 2021). Bachelor’s degree student dropouts: Who tend to stay and who tend to leave? Studies in Educational Evaluation, 70. 10.1016/j.stueduc.2021.100999. C. S. Bruce, “Workplace experiences of information literacy,” 1999. Bujang, M. A., Sa’At, N., Tg Abu Bakar Sidik, T. M. I., & Lim, C. J. (Aug. 2018). Sample size guidelines for logistic regression from observational studies with large population: Emphasis on the accuracy between statistics and parameters based on real life clinical data. Malaysian Journal of Medical Sciences, 25(4), 122–130. 10.21315/mjms2018.25.4.12. Cardona, T. A., & Cudney, E. A. (2019). Predicting student retention using support vector machines. Procedia Manufacturing, 39, 1827–1833. 10.1016/j.promfg.2020.01.256. Md. I. H. Chowdhury, N. M. Sakib, S. M. Masum Ahmed, M. Zeyad, Md. A. A. Walid, and G. Kawcher, “Human face detection and recognition protection system based on machine learning algorithms with proposed ar technology,” 2022, pp. 177–192. doi: 10.1007/978-981-16-7220-0_11. Chui, K. T., Fung, D. C. L., Lytras, M. D., & Lam, T. M. (Jun. 2020). Predicting at-risk university students in a virtual learning environment via a machine learning algorithm. Computers in Human Behavior, 107. 10.1016/j.chb.2018.06.032. P. Cortez and A. Silva, 2022 “Using data mining to predict secondary school student performance.” Costa, E. B., Fonseca, B., Santana, M. A., de Araújo, F. F., & Rego, J. (Aug. 2017). Evaluating the effectiveness of educational data mining techniques for early prediction of students’ academic failure in introductory programming courses. Computers in Human Behavior, 73, 247–256. 10.1016/j.chb.2017.01.047. DiGangi, E. A., & Hefner, J. T. (2013). Ancestry Estimation. In Research Methods in Human Skeletal Biology (pp. 117–149). Elsevier Inc.. 10.1016/B978-0-12-385189-5.00005-4. Du, J., Liu, Y., Yu, Y., & Yan, W. (Jun. 2017). A prediction of precipitation data based on support vector machine and particle swarm optimization (PSO-SVM) algorithms. Algorithms, 10(2). 10.3390/a10020057. Edwards, J. S. (Apr. 2022). Where knowledge management and information management meet: Research directions. International Journal of Information Management, 63, Article 102458. 10.1016/j.ijinfomgt.2021.102458. Ensafi, Y., Amin, S. H., Zhang, G., & Shah, B. (Apr. 2022). Time-series forecasting of seasonal items sales using machine learning – A comparative analysis. International Journal of Information Management Data Insights, 2(1). 10.1016/j.jjimei.2022.100058. Fernandes, E., Holanda, M., Victorino, M., Borges, V., Carvalho, R., & van Erven, G. (Jan. 2019). Educational data mining: Predictive analysis of academic performance of public school students in the capital of Brazil. Journal of Business Research, 94, 335–343. 10.1016/j.jbusres.2018.02.012. Fotouhi, S., Asadi, S., & Kattan, M. W. (2019). A comprehensive data level analysis for cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 90 Academic Press Inc., Feb. 01. 10.1016/j.jbi.2018.12.003. Garg, R., Kiwelekar, A. W., Netak, L. D., & Ghodake, A. (Apr. 2021). i-Pulse: A NLP based novel approach for employee engagement in logistics organization. International Journal of Information Management Data Insights, 1(1). 10.1016/j.jjimei.2021.100011. Garg, S., Sinha, S., Kar, A. K., & Mani, M. (2022). A review of machine learning applications in human resource management. International Journal of Productivity and Performance Management, 71(5), 1590–1610 Emerald Group Holdings Ltd.May 06. 10.1108/IJPPM-08-2020-0427. Grant, S., Huang, H., & Pasfield-Neofitou, S. (2014). The authenticity-anxiety paradox: The quest for authentic second language communication and reduced foreign language anxiety in virtual environments. Procedia Technology, 13, 23–32. 10.1016/j.protcy.2014.02.005. H. Han, W.-Y. Wang, and B.-H. Mao, “LNCS 3644 - Borderline-SMOTE: A New OverSampling Method in Imbalanced Data Sets Learning,” 2005. Helal, S., et al., (Dec. 2018). Predicting academic performance by considering student heterogeneity. Knowledge-Based Systems, 161, 134–146. 10.1016/j.knosys.2018.07.042. Hernández-Sayago, E., Espinar-Escalona, E., Barrera-Mora, J. M., Ruiz-Navarro, M. B., Llamas-Carreras, J. M., & Solano-Reina, E. (Mar. 2013). Lower incisor position in different malocclusions and facial patterns. Medicina Oral, Patologia Oral y Cirugia Bucal, 18(2). 10.4317/medoral.18434. Hoffait, A. S., & Schyns, M. (Sep. 2017). Early detection of university students with potential difficulties. Decision Support Systems, 101, 1–11. 10.1016/j.dss.2017.05.003. Huang, S., & Fang, N. (Feb. 2013). Predicting student academic performance in an en- Financial disclosure This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. References Abu Zohair, L. M. (Dec. 2019). Prediction of Student’s performance by modelling small dataset size. International Journal of Educational Technology in Higher Education, 16(1), 27. 10.1186/s41239-019-0160-3. Agarwal, N., Chauhan, S., Kar, A. K., & Goyal, S. (2017). Role of human behaviour attributes in mobile crowd sensing: a systematic literature review. Digital Policy, Regulation and Governance, 19(2), 56–73. 10.1108/DPRG-05-2016-0023. Akanda, Md. A. S. (2019). Research Methodology a complete direction for learners (Second Edition). Dhaka: Akanda & Sons Publications. Al-Mamary, Y. H. S. (Nov. 2022). Understanding the use of learning management systems by undergraduate university students using the UTAUT model: Credible evidence from Saudi Arabia. International Journal of Information Management Data Insights, 2(2), Article 100092. 10.1016/j.jjimei.2022.100092. Al-Mamary, Y. H. S. (Nov. 2022). Why do students adopt and use Learning Management Systems?: Insights from Saudi Arabia. International Journal of Information Management Data Insights, 2(2), Article 100088. 10.1016/j.jjimei.2022.100088. Al-Twijri, M. I., & Noaman, A. Y. (2015). A New Data Mining Model Adopted for Higher Institutions. Procedia Computer Science, 65, 836–844. 10.1016/j.procs.2015.09.037. 11 Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al. International Journal of Information Management Data Insights 2 (2022) 100111 gineering dynamics course: A comparison of four types of predictive mathematical models. Computers and Education, 61(1), 133–145. 10.1016/j.compedu.2012.08.015. Hussain, S., Dahan, N. A., Ba-Alwib, F. M., & Ribata, N. (Feb. 2018). Educational data mining and analysis of students’ academic performance using WEKA. Indonesian Journal of Electrical Engineering and Computer Science, 9(2), 447–459. 10.11591/ijeecs.v9.i2.pp447-459. Ifinedo, P. (Apr. 2016). Applying uses and gratifications theory and social influence processes to understand students’ pervasive adoption of social networking sites: Perspectives from the Americas. International Journal of Information Management, 36(2), 192– 206. 10.1016/j.ijinfomgt.2015.11.007. Khandelwal, M., et al., (Apr. 2018). Implementing an ANN model optimized by genetic algorithm for estimating cohesion of limestone samples. Engineering with Computers, 34(2), 307–317. 10.1007/s00366-017-0541-y. Kirshners, A., Parshutin, S., & Gorskis, H. (Dec. 2016). Entropy-based classifier enhancement to handle imbalanced class problem. Procedia Computer Science, 104, 586–591. 10.1016/j.procs.2017.01.176. Koch, J., Plattfaut, R., & Kregel, I. (Nov. 2021). Looking for Talent in Times of Crisis – The Impact of the Covid-19 Pandemic on Public Sector Job Openings. International Journal of Information Management Data Insights, 1(2). 10.1016/j.jjimei.2021.100014. Lamari, M., et al., (2021). SMOTE–ENN-based data sampling and improved dynamic ensemble selection for imbalanced medical data classification. Advances in Intelligent Systems and Computing, 1188, 37–49. 10.1007/978-981-15-6048-4_4. Lashkarashvili, N., & Tsintsadze, M. (Apr. 2022). Toxicity detection in online Georgian discussions. International Journal of Information Management Data Insights, 2(1). 10.1016/j.jjimei.2022.100062. G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts, 2022 “Understanding variable importances in forests of randomized trees.” Mahdikhani, M. (Apr. 2022). Predicting the popularity of tweets by analyzing public opinion and emotions in different stages of Covid-19 pandemic. International Journal of Information Management Data Insights, 2(1). 10.1016/j.jjimei.2021.100053. Miguéis, V. L., Freitas, A., Garcia, P. J. V., & Silva, A. (Nov. 2018). Early segmentation of students according to their academic performance: A predictive modelling approach. Decision Support Systems, 115, 36–51. 10.1016/j.dss.2018.09.001. Mitra, S. K., & Pathak, P. K. (Dec. 2007). The Nature of Simple Random Sampling. The Annals of Statistics, 12(4). 10.1214/aos/1176346810. Nguyen, H. M., Cooper, E. W., & Kamei, K. (2022). Borderline Over-sampling for Imbalanced Data Classification. O’Bannon, B. W., & Thomas, K. M. (Jul. 2015). Mobile phones in the classroom: Preservice teachers answer the call. Computers and Education, 85, 110–122. 10.1016/j.compedu.2015.02.010. Predicting Student Performance using Classification and Regression Trees Algorithm. (Jan. 2020). International Journal of Innovative Technology and Exploring Engineering, 9(3), 3349–3356. 10.35940/ijitee.c8964.019320. Ramírez-Noriega, A., Juárez-Ramírez, R., & Martínez-Ramírez, Y. (Feb. 2017). Evaluation module based on Bayesian networks to Intelligent Tutoring Systems. International Journal of Information Management, 37(1), 1488–1498. 10.1016/j.ijinfomgt.2016.05.007. Rodríguez-Hernández, C. F., Musso, M., Kyndt, E., & Cascallar, E. (Jan. 2021). Artificial neural networks in academic performance prediction: Systematic implementation and predictor evaluation. Computers and Education: Artificial Intelligence, 2. 10.1016/j.caeai.2021.100018. Romero, C., López, M. I., Luna, J. M., & Ventura, S. (2013). Predicting students’ final performance from participation in on-line discussion forums. Computers and Education, 68, 458–472. 10.1016/j.compedu.2013.06.009. N. Rout, D. Mishra, and M. K. Mallick, “Handling imbalanced data: A survey,” in Advances in Intelligent Systems and Computing, 2018, vol. 628, pp. 431–443. doi: 10.1007/978981-10-5272-9_39. Shin, Y., Kim, Z., Yu, J., Kim, G., & Hwang, S. (Sep. 2019). Development of NOx reduction system utilizing artificial neural network (ANN) and genetic algorithm (GA). Journal of Cleaner Production, 232, 1418–1429. 10.1016/j.jclepro.2019.05.276. Shirdastian, H., Laroche, M., & Richard, M. O. (Oct. 2019). Using big data analytics to study brand authenticity sentiments: The case of Starbucks on Twitter. International Journal of Information Management, 48, 291–307. 10.1016/j.ijinfomgt.2017.09.007. Son, L. H., & Fujita, H. (Jan. 2019). Neural-fuzzy with representative sets for prediction of student performance. Applied Intelligence, 49(1), 172–187. 10.1007/s10489-018-1262-7. Syarif, I., Prugel-Bennett, A., & Wills, G. (Dec. 2016). SVM parameter optimization using grid search and genetic algorithm to improve classification performance. TELKOMNIKA (Telecommunication Computing Electronics and Control), 14(4), 1502. 10.12928/telkomnika.v14i4.3956. Tandon, C., Revankar, S., Palivela, H., & Parihar, S. S. (Nov. 2021). How can we predict the impact of the social media messages on the value of cryptocurrency? Insights from big data analytics. International Journal of Information Management Data Insights, 1(2). 10.1016/j.jjimei.2021.100035. Tomasevic, N., Gvozdenovic, N., & Vranes, S. (Jan. 2020). An overview and comparison of supervised data mining techniques for student exam performance prediction. Computers & Education, 143, Article 103676. 10.1016/j.compedu.2019.103676. Udo, G. J., Bagchi, K. K., & Kirs, P. J. (2010). An assessment of customers’ e-service quality perception, satisfaction and intention. International Journal of Information Management, 30(6), 481–492. 10.1016/j.ijinfomgt.2010.03.005. v Chawla, N., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic Minority Over-sampling Technique. Votto, A. M., Valecha, R., Najafirad, P., & Rao, H. R. (Nov. 2021). Artificial Intelligence in Tactical Human Resource Management: A Systematic Literature Review. International Journal of Information Management Data Insights, 1(2). 10.1016/j.jjimei.2021.100047. Wakelam, E., Jefferies, A., Davey, N., & Sun, Y. (Mar. 2020). The potential for student performance prediction in small cohorts with minimal available attributes. British Journal of Educational Technology, 51(2), 347–370. 10.1111/bjet.12836. M. A. A. Walid, S. M. Masum Ahmed, and S. M. S. Sadique, “A comparative analysis of machine learning models for prediction of passing bachelor admission test in life-science faculty of a public university in Bangladesh,” Nov. 2020. doi: 10.1109/EPEC48502.2020.9320119. Wang, Z., Wang, Y., Zeng, R., Srinivasan, R. S., & Ahrentzen, S. (Jul. 2018). Random Forest based hourly building energy prediction. Energy and Buildings, 171, 11–25. 10.1016/j.enbuild.2018.04.008. Xiao, L., Dong, Y., & Dong, Y. (Mar. 2018). An improved combination approach based on Adaboost algorithm for wind speed time series forecasting. Energy Conversion and Management, 160, 273–288. 10.1016/j.enconman.2018.01.038. Zeyad, M., & Hossain, M. S. (Dec. 2021). A comparative analysis of data mining methods for weather prediction. In 2021 International Conference on Computational Performance Evaluation (ComPE) (pp. 167–172). 10.1109/ComPE53109.2021.9752344. Hamsa H, Indiradevi S, Kizhakkethottam JJ. Student academic performance prediction model using decision tree and fuzzy genetic algorithm. Procedia Technology. 2016 Jan 1;25:326-32. 12