Uploaded by Download GD

1-s2.0-S2667096822000544-main

advertisement
International Journal of Information Management Data Insights 2 (2022) 100111
Contents lists available at ScienceDirect
International Journal of Information Management Data
Insights
journal homepage: www.elsevier.com/locate/jjimei
Analysis of machine learning strategies for prediction of passing
undergraduate admission test
Md. Abul Ala Walid a,b,∗, S.M. Masum Ahmed c,d,e,f,1,∗, Mohammad Zeyad c,e,f,g,1,
S. M. Saklain Galib h,2, Meherun Nesa a,2
a
Department of Computer Science and Engineering, Bangabandhu Sheikh Mujibur Rahman Science and Technology University (BSMRSTU), Gopalganj 8100, Bangladesh
Department of Computer Science and Engineering, Khulna University of Engineering and Technology (KUET), Khulna 9203, Bangladesh
c
Energy and Technology Research Division, Advanced Bioinformatics, Computational Biology and Data Science Laboratory, Bangladesh (ABCD Laboratory,
Bangladesh), Chattogram, 4226, Bangladesh
d
Faculty of Engineering, University of Mons (UMONS), Bd Dolez 31, 7000, Mons, Belgium
e
School of Engineering and Physical Sciences (EPS), Heriot-Watt University (HWU), EH14 4AS, Edinburgh, Scotland, United Kingdom
f
Department of Energy Engineering, University of the Basque Country (UPV/EHU), Ingeniero Torres Quevedo Plaza, 1, 48013, Bilbao, Biscay, Spain
g
School of Science & Technology, International Hellenic University (IHU), 14th km Thessaloniki – N. Moudania, 57001, Thermi, Thessaloniki, Greece
h
Department of Biomedical Engineering, Khulna University of Engineering and Technology (KUET), Khulna, 9203, Bangladesh
b
a r t i c l e
i n f o
Keywords:
Machine Learning
Balanced Dataset
Adaboost
Support Vector Machines (SVM)
Precision
a b s t r a c t
This article primarily focuses on understanding the reasons behind the failure of undergraduate admission seekers
using different machine learning (ML) strategies. An operative dataset has been equipped using the least significant attributes to avoid the complexity of the model. The procedure halted after obtaining 343 observations
with ten different attributes. The predictions are achieved using six immensely used ML techniques. Stratified Kfold cross-validation is mentioned to measure the expertise of proposed models to unsighted data, and Precision,
Recall, F-Measure, and AUC Score matrices are determined to assess the efficiency of each model. A comprehensive investigation of this article indicates that the resampling strategy derived from the combination of edited
nearest neighbor (ENN) and borderline SVM-based SMOTE and SVM model achieved prominent performance. Additionally, the borderline SVM-based SMOTE and the Adaboost model performs as the second-highest performing
model.
1. Introduction
Utilizing ML, enormous amounts of information can be re-evaluated
and discover particular patterns that might not be immediately noticeable or recognizable to humans. ML strategies have increasingly
been used to assess educational data such as student class performance
(Cardona and Cudney, 2019). In the pursuit of the academic well-being
of students, the utilization of neoteric technologies such as data mining, data management, and ML has increased. The idea of extracting undisclosed information from a large number of raw databases is
called data mining. Consequently, the exploration of knowledge acqui-
sition relates to predictive ML models and subsequent decision-making
(O’Bannon and Thomas, Jul. 2015). State-of-the-arts of data mining
and ML have become more acceptable in predicting student examination evaluations such as grades, achievement, etc. (Wakelam, Jefferies,
Davey, and Sun, Mar. 2020). Generally, conventional data mining for
educational data analysis aimed at solving problems in an educational
context can be described as educational data mining (O’Bannon and
Thomas, Jul. 2015), (Predicting Student Performance using Classification and Regression Trees Algorithm, Jan. 2020). Currently, intelligent
computer-based methods such as artificial intelligence (AI) and data
Abbreviations: AI, Artificial Intelligence; ANN, Artificial Neural Network; CNN, Convolutional Neural Network; CC, Correlation Coefficient; DT, Decision Tree; ENN,
Edited Nearest neighbor; GBM, Gradient Boosting Machine; FN, False Negative; FP, False Positive; KNN, K-Nearest Neighbor; LR, Logistic Regression; LSTM, Long
Short-Term Memory; ML, Machine Learning; MDI, Mean Decrease Impurity; MSE, Mean Squared Error; RF, Random Forest; RTV-SVM, Reduced Training Vector-Based
SVM; AUC, ROC curve; SARIMA, Seasonal Autoregressive Integrated Moving Average; SSC, Secondary School Certificate; SVM, Support Vector Machine; SMOTE,
Synthetic Minority Oversampling Technique; PSO, Particle Swarm Optimization; VC, Vapnik - Chervonenkis; TP, True positive.
∗
Corresponding authors.
E-mail addresses: abulalawalid@gmail.com (Md.A.A. Walid), smmasum.ahmed.eee@gmail.com (S.M.M. Ahmed), mohammad.zeyad.eee@gmail.com (M. Zeyad).
1
Joint 2nd Authors (Equally contribute to this work).
2
Joint 3rd Authors (Equally contribute to this work).
https://doi.org/10.1016/j.jjimei.2022.100111
Received 17 November 2021; Received in revised form 16 August 2022; Accepted 18 August 2022
2667-0968/© 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
mining have been successfully applied to improve people’s daily lives
(Udo, Bagchi, and Kirs, 2010).
A couple of million students participate in the bachelor’s entrance
examination at government-run universities each year in Bangladesh.
Nevertheless, only a few thousand are admitted after this competitive
examination. In some cases, it was observed that many candidates struggled hard during this period. However, they could not get admission to
a public university in Bangladesh, resulting in an unforeseeable future.
Numerous factors could be behind their unsuccessful admission to a public university, such as family circumstances, frustration, admission test
anxiety, etc. However, Bangladeshi students need admission to a public
university because private university education costs are too high for
middle-income and low-income families. In contrast, the government
primarily covers public university costs. Specifically, with the help of
this research, the unprivileged, poor, and middle-income communities
parent will find a way to improve their children’s chances to admit to a
public university in Bangladesh.
The conquest of public university admission tests in Bangladesh is
quite competitive. Concisely, success in a particular public university
can be achieved by continuously monitoring candidates from collected
data. In this regard, ML models generated from the data collected from
a particular public university can be highly effective for predicting educational status to monitor the status of the candidate. The methodology of this work can be considered as a norm to improve students
performances in any exam like graduate record examinations (GRE), international english language testing system (IELTS), secondary school
certificate (SSC), etc. Moreover, with the help of this research, a mobile
application could be developed in the future. From their entering information, a student can understand their situation regarding their chances
of being admitted to a public university.
In this study, an outline has been developed for students employing
the most modern data mining and ML techniques in order to advise applicants for the university’s undergraduate admissions test by providing
’risk’ warnings in advance. This article explains the precise model with
generalization retention on both categories equipped with the dataset,
such as a uniform number of points for each category to identify certain students who are failing the public university admission exam in
Bangladesh. The proposed approach can be fruitful by notifying them
about their educational circumstances to improve their stance and reduce the study gap and depression.
The objectives of this work are given below:
in data management and data mining classification for human performance and behavior analysis (Bruce, 1999, Agarwal, Chauhan, Kar,
and Goyal, 2017, Votto, Valecha, Najafirad, and Rao, Nov. 2021,
Garg, Sinha, Kar, and Mani, 2022, Mahdikhani, Apr. 2022). Particularly, data management of student information is quite crucial to predict their performance (Al-Mamary, Nov. 2022, Al-Mamary, Nov. 2022,
Miguéis, Freitas, Garcia, and Silva, Nov. 2018, Tomasevic, Gvozdenovic, and Vranes, Jan. 2020, Edwards, Apr. 2022, Asif, Merceron, Ali,
and Haider, Oct. 2017). M. I. Al-Twijri and A. Y. Noaman developed
a data mining model for higher institutions by proposing a new data
mining model for higher education (Al-Twijri and Noaman, 2015). Besides, (Wakelam, Jefferies, Davey, and Sun, Mar. 2020) predicted student performance utilizing ML and data mining strategies. A group of
23 student data was used in their study and classify the students at risk.
Furthermore, (Romero et al., 2013) predicted the performance of firstyear university students through online discussion. In this article, researchers proposed data mining methods for improving student final
performance prediction using a combinational approach that joins a
clustering method to classification methods. The clustering approach exercised in the study had been set to produce several clusters that were
similar to the classes of their dataset which need to manage properly.
Latterly the field of education benefits from AI, Different ML algorithms, including such as artificial neural networks (ANN) is used to predict academic achievement or academic failure, which helps learners become more conscious about their studies (Rodríguez-Hernández, Musso,
Kyndt, and Cascallar, Jan. 2021). Tomasevic, Gvozdenovic and Vranes,
Jan. (2020) utilizing a modern supervised ML approach (ANN), performed student exam performance prediction and produced a comparative illustration. Besides, Hoffait and Schyns, Sep. 2017 analyzed the
potential difficulties of the university student. This work focused on the
early detection of possible failures by using student data collected during enrollment. Three different methods were applied to data mining,
including ANN, random forest (RF), and logistic regression (LR). Furthermore, (Fotouhi, Asadi, and Kattan, 2019) worked on an imbalanced
dataset and performed classification using four popular classifiers like
decision tree (DT), K-nearest neighbor (KNN), ANN, and RIPPER. To
put down misclassification results due to imbalanced data distribution,
both over-sampling and under-sampling are employed with different algorithms. With the help of the DT, (Hamsa, Indiradevi, & Kizhakkethottam, 2016) suggested a prediction approach to admission scores. Accuracy, mean squared error (MSE), and correlation coefficient (CC) were
utilized to test and compare their regression models.
(Cardona and Cudney, 2019) developed a model for student performance estimation dependent on the SVM. Yet their output variable’s
data distribution suggests a mismatch condition on categories called an
imbalanced issue. But this could not be known to be a stable and convenient approach to the problem of its dissatisfactory performance. Researchers used precision, recall, and accurate measurements for analysis and showed that SVM with RBF kernel provides better performance.
In this research article, four types of mathematical models were compared including SVM, multiple LR, multilayer perception network, and
the radial basis function network. Eight input variables were engaged
to seek out the final exam score, and out of 323 undergraduates in four
semesters, 2907 data samples were obtained (Huang and Fang, Feb.
2013). In addition, (Chui, Fung, Lytras and Lam, Jun. 2020) proposed
a modified SVM-based mechanism that alleviates support vectors and
training time by carrying away redundant training vectors, named RTVSVM. Researchers utilized a large amount of information from university
students to assess the proposed mode. (Costa et al., Aug. 2017) evaluated various prediction techniques of data mining to find students who
were likely to fail programming courses. The proposed methodology was
applied to two diverse data sets from a Brazilian Public University regarding programming courses. After analyzing the datasets, SVM found
to be the highest effective model.
A Gradient Boosting Machine (GBM) model was suggested by
(Fernandes et al., Jan. 2019) to forecast academic results. By ex-
• The main objective of the research is to find reasons behind the failure of the candidate in the undergraduate admission test and to unfold the factors that significantly impact their rejection from a public
university.
• A proposed method applies a comprehensive investigation on the
principle of the outcome of some useful metrics where both a resampling approach and classification model exist in each pair.
• An investigation has performed on the highest and second-highest
performing models, including the SVM model is highly robust to the
’Allow’ class. However, the Adaboost model is moderately robust to
both categories.
The rest of this article is formulated as follows. The literature review was discussed in Section 2. The research methodology is given in
Section 3. Section 4 indicates the analysis of the result. Moreover, an
empirical discussion is apprised in Section 5. Finally, Section 6 summarizes the findings and mentions future work possibilities.
2. Literature review
Data mining and ML techniques were utilized effectively to improve
students performance by understanding their behavior (Ifinedo, Apr.
2016), (Ramírez-Noriega, Juárez-Ramírez, and Martínez-Ramírez, Feb.
2017) to analyzing big data (Shirdastian, Laroche, and Richard, Oct.
2019), (Chowdhury et al., 2022). ML applications were most widespread
2
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
amining demographic parameters, researchers found that ’neighborhood’, ’school’, and ’age’ are possible indices of a student’s academic
success or failure. In addition, various algorithms were investigated,
and overall performance was evaluated depending on various assessment criteria, including accuracy, precision, recall, the convergence
speed of the optimal solutions, F-1 measure, and computational time
(Edwards, Apr. 2022), (Batra, Jain, Tikkiwal, and Chakraborty, Apr.
2021, Garg, Kiwelekar, Netak, and Ghodake, Apr. 2021, Koch, Plattfaut,
and Kregel, Nov. 2021, Tandon, Revankar, Palivela, and Parihar, Nov.
2021, Ensafi, Amin, Zhang, and Shah, Apr. 2022). After assessing multiple studies, it was discovered that in order to make precise predictions
about students performance, the collected data information from student needed to be managed carefully.
(Fernandes et al., Jan. 2019), (Grant, Huang, and Pasfield-Neofitou,
2014), (Cortez and Silva, 2022), (Hussain, Dahan, Ba-Alwib, and Ribata, Feb. 2018). Throughout this context, nine input variables as possible prediction factors and one output variable unravel in Table 2. Input
features are equipped so that they have a sufficient influence on the output variable. The PreExR variable displays the effects of the previously
participated formulation exam conducted directly before the admission
test. The descriptive type of other input features regarded; residential
place, relationship status, family status, the missing time each day (using
social networking, playing sports, watching movies, doing unnecessary
gossip with buddies), study duration per day for exam planning, etc.
shown in Table 2. There are two different types in the dataset’s output
component: ’Allow’ and ’Not-allow’ demonstrated in feature id number
10. Hence, the id number (1 to 9) features are engaged as predictive
ML models input. Again, features belonging to the feature id number (1
to 9) hold the value based on the information collected for four-month
before the admission test until the test starts.
In 2018, around ten thousand students took the university’s admission test for Life-Science faculty (Walid, Masum Ahmed, and Sadique,
Nov. 2020). The success rate was close to 2%. Therefore, simple random
sampling without a replacement approach was adopted to get the sample from the population. The process called simple random sampling is
a way of selecting 𝑛 elements from a population of size 𝑁 elements in
such a way that each combination of n elements has the same chance of
being chosen as the others (Mitra and Pathak, Dec. 2007). The estimated
sample size (n0 ) was calculated from the following formula with a 90%
confidence level (Hernández-Sayago et al., Mar. 2013).
3. Research methodology
The domain-specific features were constructed from appropriate data
collection at the beginning of this research. After that, data were collected from the students of a public university named Bangabandhu
Sheikh Mujibur Rahman Science and Technology University, Gopalganj, Bangladesh, and the students of different national universities,
namely Govt. Bangabandhu College, Gopalganj, Bangladesh, and Government Brajalal College, Khulna, Bangladesh, by following two primary data collection techniques: Interviewing Method and Email Questionnaire Method and then transformed into a secondary format in order to take advantage of comma-separated values (CSV) in computerized applications. Questionnaires were prepared by observing facts from
the papers (Fernandes et al., Jan. 2019), (Grant, Huang, and PasfieldNeofitou, 2014), (Hussain, Dahan, Ba-Alwib, and Ribata, Feb. 2018,
Akanda, 2019, Cortez and Silva, 2022). The datasets were imported on
Google Colaboratory and several data preparation activities were applied to better fit the ML model. With the assistance of symmetrical analysis, the balanced or imbalanced condition had checked for datasets and
emulated up-to-date statistical techniques. Four resolved datasets having balance in both classes were prepared to accomplish the subsequent
actions of this research. The validation subsamples were obtained from
the stratified k-fold cross-validation technique and by configuring the
value of k for estimating the proficiency of the models on datasets.
Furthermore, a comparative investigation was conducted to evaluate
each model’s efficiency generated by each dataset using several expedient model evaluation metrics and comparing them. Besides, enormously
redacting models and corresponding statistical techniques are revealed
as reasonable solutions. Again, one or more models can be yielded from
the step of a suitable solution. However, the proposed research aimed to
combine the strength of utmost performing models generated from reasonable solution steps. In Fig. 1, the proposed methodology has been illustrated, which is developed from the methods of the selected research
works (v Chawla, Bowyer, Hall, and Kegelmeyer, 2002, Han, Wang, and
Mao, 2005, Nguyen, Cooper, and Kamei, 2022).
In particular, the learning process of the most classification paradigm
is often biased toward most class examples in a binary classification
problem. For the minority ones, classification errors can be observed so
high (Lamari et al., 2021). So, to overcome the issue, several resampling
methods of under-sampling and over-sampling (Kirshners, Parshutin,
and Gorskis, Dec. 2016) had specified for to construction of four different datasets. To construct predictive models, all of these datasets
are used. Concerning this analysis, six most popular model-based approaches (Fernandes et al., Jan. 2019) have been applied for supervised
classification tasks, including LR, KNN, SVM, ANN, and the two most
frequent ensemble ML methods named RF and Adaboost. Moreover,
Table 1 describes the Algorithms used in this research work.
𝑛0 =
𝑛=
𝑧2 𝑝𝑞
𝑑2
𝑛0
1+
𝑛0
𝑁
(1)
(2)
𝑛0 = estimated sample size; 𝑧 = statistical certainty chosen (1.64 for 10%
level of significance); 𝑝 = estimated prevalence; (0.5 if unknown); 𝑞 =
1 − 𝑝; 𝑑 = precision desired (usually consider 0.05); 𝑛 = desired sample
size
The sample size was calculated as 266. More precisely, 266 or more
measurements/surveys were required to have a confidence level of 90%
that the real value was within ±5% of the measured/surveyed value.
For this case, the margin of error was calculated as 5%. But to reduce
the margin of error, 343 samples were used in the final dataset. In this
manner margin of error becomes 4.38%. In this aspect, there was a 90%
chance that the real value was within ±4.38% of the measured/surveyed
value. However, samples are collected using two primary data collection
techniques: Interviewing and Email Questionnaire.
Accordingly, a realistic dataset was developed for this work. The
dataset includes 343 samples, where 185 samples contribute to ’Allow’
and the rest of the samples belong to the category ’Not-Allow’. The ’Allow’ group is samples from students who attained in the admission test
exam. On the other hand, samples are the ’Non-Allow’ type obtained
from certain students who participated but did not pass the admission
test. Fig. 2 indicates an initial dataset of the data distribution of both
categories. From the first-year students of the Life-Science faculty of
a public university in Bangladesh (Walid, Masum Ahmed, and Sadique,
Nov. 2020), samples of the ’Allow’ type are gathered. Samples in another
type are obtained from outside of the university, in the scope of the facts
of the students who participated in the Life-Science faculty test but could
not pass the test to acquire the almost fully funded scholarship by the
government. However, unequal distribution of data samples amongst
categories was observed, suggesting a situation of imbalance. The presence of the imbalanced situation in datasets forces most of the standard
classifier learning algorithms, such as KNN, DT, and Back-Propagation
Neural Networks, to understand outcomes (Rout, Mishra, and Mallick,
2018).
3.1. Data collection
Each dataset variable is established by experimenting with the characteristics and significance of the features displayed in previous research
3
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
Fig. 1. Proposed Methodology.
Fig. 2. Initial Data Distribution for Target Value.
4
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
Table 1
Description of Algorithms
Name of the
Algorithm
Description
Logistic
Regression (LR)
LR is a statistical method equivalent to linear regression, even though LR finds a formula that estimates the result of one or more response variables for
a binary variable. However, LR could be categorical or continuous in response variables, as the model does not require continuous data. Furthermore,
LR makes the assumption variable is independent. One disadvantage of LR is that the system cannot yield probabilities of typicality (DiGangi and
Hefner, 2013, Du, Liu, Yu, and Yan, Jun. 2017, Bujang, Sa’At, Tg Abu Bakar Sidik, and Lim, Aug. 2018).
Support Vector
Machine (SVM)
SVM is an operative strategy of data mining focused on ML used to predict data. SVM (Shirdastian, Laroche, and Richard, Oct. 2019), (Lashkarashvili and
Tsintsadze, Apr. 2022) is one of the most effective techniques extensively used in several fields of data classification, such as bioinformatics, detection of
faults, vehicle power management, and so on. Furthermore, with advances in computing science and intelligent technology, intelligent learning
recognition skills have been well developed to communicate complex nonlinear interactions between meteorological elements in real-time and space. In
addition, SVM provides other vital advantages in finding solutions with limited sample, nonlinear, and high dimensional pattern recognition. SVM is a
system of classification based on statistical analysis and dimensional hypotheses of the Vapnik - Chervonenkis (VC) (Du, Liu, Yu, and Yan, Jun. 2017).
Artificial Neural
Network (ANN)
ANN is an algorithm based on learning inspired by the human brain’s neural network (Shin et al., Sep. 2019). ANN has the potential to utilize both
independent and dependent variables of a network is such a complex and nonlinear communication phenomenon (Khandelwal et al., Apr. 2018),
(Zeyad and Hossain, Dec. 2021). In ANN, neurons are interconnected, and each connection has a numerical weight. Besides, each layer ensures some
artificial neurons and activation mechanisms. Again, a training mechanism is also defined in ANN. Feed-forward infers that the associations between
layers are constantly guided from lower to upper layers (Shin et al., Sep. 2019). ANNs are frequently trained with certain optimization algorithms to
accomplish learning and get optimal results. An error can be determined by comparing the actual results and those predicted (Khandelwal et al.,
Apr. 2018).
Random Forest
(RF)
RF is an ensemble prediction method containing a set of various DT’s fitted with bagging and random variable selection. The concept of tree
construction of trees in RF remains similar to CART, but the process will be completed with the help of recursive partitioning. The precise cut-point
location and the dividing vector’s choice in recursive partitioning heavily rely on the distribution of findings throughout the learning sample. RF
overcomes CART’s problem of uncertainty by estimating using a set of trees rather than a single tree. Combining high-diversity trees will significantly
improve each tree’s instability since CART is an impartial indicator that is unstable, which produces the correct average prediction. RF combines them
until all trees are formed by combining their different predictions to level the effect of training data and make RF consistent (Wang et al., Jul. 2018).
AdaBoost
algorithm
Various algorithms were developed from the AdaBoost algorithm. Moreover, many algorithms emphasize classifications, and the remaining portion of
the algorithm is associated with regression. AdaBoost is one kind of iterative algorithm, and this process adapts the learning process to return the fault
by weak learners. The AdaBoost algorithm is one kind of iterative algorithm that joins weak learners sequentially and adjusts the total learning
mechanism as per the error given by weak learners. Besides, the core aspect of AdaBoost is merging vulnerable learners produced from every iteration
to construct a strong learner (Xiao, Dong, and Dong, Mar. 2018).
K-Nearest
Neighbor (KNN)
The KNN is recognized as a non-parametric model, instance-based, or lazy method (Anubhove et al., 2022). It has been considered one of the easiest
strategies to perceive in ML and even in deep learning. An unknown data point is classified based on the closest neighbor, known as class. For this
algorithm, the nearest neighbor is determined by the k-value, which specifies the quantity of most immediate neighbors to be deliberated and thus
defines a class of an unrecognized data point. Often beneficial to prevent tied votes by selecting k as an odd number. To determine the total number of
neighbors used for classification, a single number k is provided. If the value of k is considered equal to 1, its closest neighbor will determine the class for
a sample. Sometimes, the classification of the given data point belongs to the usage of more than one nearest neighbor is the justification for calling:
KNN. This algorithm exposes a memory-based strategy since, at runtime, data points should be in the memory (Amra and Maghari, Oct. 2017).
Table 2
Attributes and their Possible Values
Id
Feature Explanation
Feature Name
Possible Values
1
2
Previously participated exam result
The educational and economical
condition of the family
PreExR
Family_Situation
3
4
Living area during the exam
Spending time per day on social
media or game playing
Living under family observance or not
Wasting time by doing relationship
Or, Wasting time by being frustrated
in a relationship
Wasting time on political jobs
Living_Area
Misspend_Time
Low, Medium, Good, Excellent
(a) Educated (b) Uneducated (c)
Unemployed (d) Employed:
(a)+(c)/(a)+(d)/(b)+(d)/(b)+(d)
Village/Town
0-1hr/1-3hr/3-5hr/More than 5hr.
Living_Status
Relation_Status
With family/Without family
Yes/No
5
6
7
8
9
10
Yes/No
Political_Engagement
Study_Duration
The average duration of study in a
single day
Deforming thought by any addiction
Status of achieving success
Drug_Addiction
Output (Target
variable)
0-1hr/1-3hr/3-5hr/More than 5hr.
Yes/No
Allow (1)/ Not-Allow (0)
stage, the input dataset is scaled the value of the input variable is within
the range of (0, 1). The Output variable is simply transformed to a numeric value according to the encoding of the label where the "Allow"
label is converted to a numeric value of 1′s and other side, "Not-Allow,"
which is converted to 0′s.
3.2. Data analysis
The ML approach demands quality data along with preprocessing
for better accuracy. Missing values, feature encoding, data normalization, and standardization type preprocessing steps are applied to data to
enhance quality as well as a better understanding of algorithms. At this
5
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
Table 3
Distribution of the Classes respectively generated from Under-Sampling and Over-Sampling Approaches
Table 4
Performance Comparison (AUC) on different Datasets generated from Resampling approaches for Life-science faculty
admission
Dataset
Technique
Class Distribution
Algorithm
AUCDA
AUCDB
AUCDC
AUCDD
Original Set
DA
DB
DC
DD
NA
Join T-Link
B-Smote
S-Smote
ENN-Smote
Allow
185
158
185
185
158
Logistic Regression
SVM
ANN
Random Forest
Adaboost
KNN
0.65
0.69
0.63
0.64
0.70
0.67
0.74
0.74
0.72
0.64
0.70
0.64
0.65
0.70
0.67
0.69
0.71
0.70
0.91
0.96
0.90
0.96
0.95
0.93
Not-Allow
158
158
185
185
158
4. Results
In Fig. 2, the data distribution of each group can be understood spontaneously, and the imbalanced class distribution problem is observed
effortlessly. Since this research focuses on accurately predicting both
majorities and minority groups, attempts have been made to resolve the
situation. The under-sampling and over-sampling methods have been
adopted to produce four resolved datasets, namely dataset ’A’ (DA),
dataset ’B’ (DB), dataset ’C’ (DC), and dataset ’D’ (DD), from the original set to ensure equal distribution of both groups. Class distribution of
each category of resolved datasets is demonstrated in Table 3.
To produce the DA, both Tomek Link (AT, M, F and M., 2016)
and the random approach to down-sampling are followed, and the
method is defined as a joint T-link where T-links are firstly dispelled
from the majority class and then down-sampling is used to ensure
fair class distribution. By experimenting with the idea behind N. V.
Chawla’s (v Chawla, Bowyer, Hall, and Kegelmeyer, 2002) synthetic
minority oversampling technique (SMOTE), it has been decided for this
work to use the borderline SMOTE oversampling approach proposed
by Hui Han et al. (Han, Wang, and Mao, 2005) to generate a dataset
DB and also construct DC by following the methodology of proposed
by (Nguyen, Cooper, and Kamei, 2022) called SVM based borderline
SMOTE approach. For the simple SMOTE oversampling method, the
synthetic points are propagated between the minority samples and selected nearest points, but in the field of borderline SMOTE oversampling, the samples that are located at the borderline of the minority class
are over-sampled only. Afterward, a hybrid (Fotouhi, Asadi, and Kattan,
2019) oversampling approach was considered generated by integrating
ENN and borderline SVM-based SMOTE, and a balanced set DD was introduced. The borderline SVM-based SMOTE oversampling method is
chosen as a praiseworthy method for the proposed combinational oversampling approach after observing the performance of the models on
datasets DB and DC. Moreover, a relative analysis is performed to assess
the performance of several models prepared from the proposed four sets
of data.
4.1. Implementation details
Six supervised algorithms, Adaboost, LR, RF, KNN, SVM, and ANN,
had used to overcome classification tasks and conduct this research.
During the implementation of RF, the different numbers of DT’s had
checked. To select the precise number of DT’s to achieve high performance from the maximum elective technique, the assistance of an iterative approach that audits and assimilates proficiency for all numbers
of trees indicating from 5 to 1000. Also, another iterative approach for
KNN is resorted to ensure a promising outcome and appoint the optimum value chosen from the numbers two to fifty for ’k’, indicating the
number of nearest neighbors. This work is operated by adopting a grid
search algorithm (Syarif, Prugel-Bennett, and Wills, Dec. 2016) to optimize SVM parameters. For this reason, the value of ’C’, ’𝜆’, and ’kernel’
was tuned.
On the other hand, one input layer, two hidden layers, and one output layer were implemented into the ANN. Each hidden layer exerts the
ReLU activation mechanism. Disposing units in the hidden layer optimizes ANN. Again, the adaptive moment estimation technique called the
Adam optimization focuses on the gradient descent optimization process
with ANN structure to decrease loss function. The essential advantage
of the Adam optimizer is that the learning rate does not need to be defined. The parameters had optimized depending on the number of weak
learners and the learning rate value concerning the ensemble boosting
classifier called Adaboost.
4.1.1. Performance measure
The area under the ROC curve (AUC), Precision, Recall, and FMeasure were picked as defined metrics to estimate the model’s efficiency. Essentially, the value of AUC away from 1 implies the model
had less-class incoherence capacity. If close to the ’1′ implies, the
model had exalted class incoherence capacity. Then again, sensitivity was commonly used to discover the true positive (TP) rate utilizing (3), which centers on decreasing the false negative (FN) number
again; precision focuses on limiting the false positive (FP) number utilizing (4). The sensitivity value turns out to be low when the FN number increases rapidly. Utilizing equation (5), F1-Score appreciates the
model’s suitability by finding the harmony between precision and recall
(Alyahyan and Düştegör, 2020).
3.3. Robustness checks (stratified K-fold cross-validation)
Cross-validation is one of the finest approaches to identify the
model’s generalization ability on new data. This analysis uses a stratified k-fold cross-validation approach to select the best model where
the number of folds is specified as five. In this aspect, the dataset has
been split into five segments or folds, and each segment is chosen to contain approximately the exact proportions of class labels. In the primary
stage, the first segment (SEG1) is considered a testing set, and by taking
the rest of the data as a training set, the model is fitted with that training
set, and the performance score is stored in SEG1. The same thing had
done for the next stage by considering sub-group no. 2 and SEG2 performance records are preserved. This process is completed five times data
and collected five performance data such as (SEG1, SEG2, SEG3, SEG4,
SEG5) and finally, an average of all five performances such as AVERAGE (SEG1 + SEG 2 + SEG3 + SEG4 + SEG5) had defined as a model’s
final performance. In addition, each testing segment is formed in a way
in each iteration of the cross-validation approach that it ignores those
samples generated by resampling techniques.
𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 = 𝑅𝑒𝑐𝑎𝑙𝑙 = 𝑇 𝑃 ∕ (𝑇 𝑃 + 𝐹 𝑁 )
(3)
𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇 𝑃 ∕ (𝑇 𝑃 + 𝐹 𝑃 )
(4)
𝐹 1 𝑆𝑐𝑜𝑟𝑒 = (2 ∗ 𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 ∗ 𝑅𝑒𝑐𝑎𝑙𝑙) ∕ (𝑃 𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 + 𝑅𝑒𝑐𝑎𝑙𝑙)
(5)
4.2. Result analysis
In Table 4-Table 8, the mean outcomes from K-fold cross-validation
techniques were organized by considering the four datasets. Resampling
6
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
Table 5
Precision, Recall, F-MEASURE for all Algorithms on
Dataset DD
Algorithm
Precision
Recall
F1-Score
Logistic Regression
SVM
ANN
Random Forest
Adaboost
KNN
0.8913
0.9638
0.9374
0.9729
0.9173
0.9778
0.8790
0.9871
0.8923
0.9361
0.9289
0.9489
0.8851
0.9754
0.9143
0.9542
0.9231
0.9630
Table 6
Precision, Recall, F-MEASURE for all Algorithms on
Dataset DA
Algorithm
Precision
Recall
F1-Score
Logistic Regression
SVM
ANN
Random Forest
Adaboost
KNN
0.6346
0.6605
0.53
0.5956
0.6659
0.5907
0.6133
0.7084
0.65
0.6256
0.7272
0.6952
0.6238
0.6836
0.58
0.6102
0.6952
0.6387
Fig. 3. ROC Curve for Each Fold and Mean ROC from SVM Model.
Table 7
Precision, Recall, F-MEASURE for all Algorithms on
Dataset DB
methods were evaluated to uncover the most suitable and feasible models by examining the exhibition of the models derived from four datasets.
Moreover, the AUC performance of various models prepared from all the
datasets stored in Table 4 conversely Precision, Recall, and F-Measure
were demonstrated in Table 5 to Table 8. The method concentrates on
the efficacy of datasets DA, DB, DC, and DD constructed from state-ofthe-art resampling techniques to select the resampling technique. The
term "dataset’s efficacy" indicates how good the prediction result obtained from that set. By analyzing the AUC score from Table 4, it can
be easily perceived that the AUC score from dataset DD in short (AUCDD )
gives a much higher value than all other performing models generated
by the remaining data set. Therefore, dataset DD obtained from the ENN
and borderline SVM-based SMOTE over-sampling technique is more feasible in this case.
After that, it was intended to investigate the outstanding model consequent upon their performance. SVM and RF demonstrated outstanding
and similar AUC scores for dataset DD. After observing Table 5, it can
be explained precisely that SVM outperforms the RF because of its high
recall and satisfied weighted average precision and recall. Contrariwise,
KNN was keen to alleviate false positive prediction and emerged with
the highest precision, but the balance between recall and precision was
not impeccable compared to SVM. Therefore, the resampling technique,
which combines ENN and SMOTE approaches, was responsible for growing the DD dataset and the SVM model conveyed from the same data set
and declared as successful approach for this study. In Fig. 3, the mean
ROC curve plotted for the most exactly performing SVM model and the
ROC curve for all cross-validation folds in the same place. By examining the AUC score of each fold, the value of 0.98 is remarkable as the
highest score given by some folds. More precisely for the SVM model,
’C = 10′ ’𝜆 = 0.1′ ’kernel = rbf’ were the optimized parameters. 353 number of estimators have been specified as an optimum parameter from the
iterative process as it shows relatively prominent achievement.
However, the second effective reconstruction method had pursued
rather than stopping at the second place and surpassing the model. In
this context, after looking at the AUC score obtained from dataset DA
and DB indicated in Table 4, Adaboost gives the highest AUC. Therefore
LR shows the lowest score for dataset DA. DB gives a higher (sometimes
equal) AUC score than DA for every model except KNN. In particular, the
AUC disparity graph in Fig. 4 can also give the same conception more
accurately as all the points on the AUC_DB line graph are positioned on
or above AUC_DA except the point of KNN.
LR and SVM models carried from dataset DB show 0.74 AUC, which
is higher than other models derived from dataset DB. Still, from Table 7,
Algorithm
Precision
Recall
F1-Score
Logistic Regression
SVM
ANN
Random Forest
Adaboost
KNN
0.6780
0.6840
0.63
0.6465
0.7015
0.60
0.6811
0.7243
0.70
0.6649
0.7297
0.85
0.6796
0.7036
0.67
0.6556
0.7154
0.7053
Table 8
Precision, Recall, F-MEASURE for all Algorithms on
Dataset DC
Algorithm
Precision
Recall
F1-Score
Logistic Regression
SVM
ANN
Random Forest
Adaboost
KNN
0.6689
0.6872
0.64
0.6729
0.7046
0.66
0.7081
0.7351
0.68
0.7513
0.7405
0.7405
0.6879
0.7104
0.66
0.7099
0.7221
0.6979
the exalted performance of SVM has been easily perceived as it exposes
higher precision, recall, and F1-score compared to the LR model. On
the other hand, the Adaboost model from dataset DB gives a 0.70 AUC
score along with maximum precision and F-measure with the placement
of outstanding balance between precision and recall, which is the main
reason to select as the most helpful model for dataset DB.
Moreover, by observing the performance of the models prepared by
dataset DC from Table 4 and Table 8, it had noticed that Adaboost
gives the highest AUC as it has the peak point of the AUC_DB line in
Fig. 4 and also shows maximal precision as well as the harmonic mean
of the model’s precision and recall by keeping an outstanding balance
between precision and recall. But the fascinating things are the precision, recall, and F-measure of the Adaboost model prepared from dataset
DC demonstrated in Table 8, which is higher than any other values of
dataset DB for the same metrics. This extensive judgment has been condensed into Fig. 5. It can be observed spontaneously from Fig. 5 that the
Adaboost_DC point grabs the peak position for each line graph. This is
because Adaboost_DC refers to the Adaboost model prepared by dataset
DC. For this reason, admittedly, it was found that during this research,
the borderline SVM-based SMOTE over-sampling technique was the second effective resampling technique, and Adaboost was the maximal ef7
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
Fig. 4. AUC disparity graph for datasets DA, DB, and DC.
Fig. 5. Graphical Analogy based on precision, recall, and F measure.
Fig. 7 illustrates an RF classifier’s pictorial representation of emergent
attributes. By looking at this statistic, it has been easily found that
all the features have a relative significance value. However, a little
suggests that the dataset of this study does not have an insignificant
variable. Moreover, from a discerning exploration of input features, it
can be noticed spontaneously the attribute called PreExR pictures the
most elevated significance again the variables; Misspend_Time, Family_Situation, and Study_Duration also have a high value of importance
that suggests these variables are strong predictors as well as most potential indicators for predicting Life-Science faculty admission test result.
On the contrary, an illustrative representation of feature significance
from the Adaboost model was narrated in Fig. 8. Since a DT was set as
a base classifier for the Adaboost model, feature significance for each
feature was calculated by the value of feature significance of each base
classifier which was part of the ensemble approach. From Fig. 8, Living_Status and Family_Situation were highly significant variables for the
Adaboost model.
fective classifier based on its performance on dataset DC prepared by
this approach.
Fig. 6 demonstrates the AUC score for each fold and mean AUC from
the Adaboost model. By exploring the AUC score of each fold, it was noticed that fold-1 and fold-3 give the highest score of 0.74. The mean
AUC score was also put in the same place using a deep red color vertical
line. The AUC value of each fold was so close to the vertical line, indicating accurate perception that was never observed for the SVM model
prepared from the DB dataset.
4.2.1. Feature involvement exploration
To determine the significance of the feature, the mean decrease
impurity (MDI) feature importance of RF was calculated by counting
the times a feature is used to divide a node, weighted by the number of samples it splits (Alyahyan and Düştegör, 2020). G. louppe
et al. (Louppe, Wehenkel, Sutera, and Geurts, 2022) also indicated that
the MDI value was equal to zero for absolutely an irrelevant feature.
8
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
Fig. 6. AUC scores for Each Fold and Mean AUC from Adaboost Model Prepared from Dataset DC.
Fig. 7. Feature Importance based on Random
Forest Classifier.
Fig. 8. Feature Importance based on Adaboost
Classifier.
9
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
Table 9
Distribution of the Classes generated
from Under-Sampling and OverSampling approaches
Technique
Class Distribution
NA
ENN
ENN-Smote
Allow
185
158
158
Not-Allow
158
47
158
5. Discussion
According to the investigation, models generated from dataset DD
demonstrate prominent outcomes. Moreover, dataset DD was constructed by the resampling approach that combines ENN and borderline SVM-based SMOTE. In the edited nearest-neighbor under-sampling
method ENN, misclassified samples were cut out according to their nearest neighbors (Fotouhi, Asadi, and Kattan, 2019). The approach excludes
all noisy and borderline examples (Beckmann, Ebecken, and Pires de
Lima, 2015), (Berka and Marek, Sep. 2021). ENN was often used to exclude samples from all categories (Fotouhi, Asadi, and Kattan, 2019),
which occurred for this study. In this study, ENN was used to eliminate
problematic samples of both categories. Intending to get a smoother
surface for decision-making, ENN had applied, and after that, synthetic
samples were generated for the minority class by SMOTE over-sampling
method so that both classes became equally balanced.
In-depth observation can be attained by exploring Table 9. From
Table 9, it can be decisively remarked that points of both classes are reduced when the down-sampling approach called ENN is applied in the
original set. Nevertheless, samples of negative (or ’Not-allow’) classes
are substantially decreased. Almost 70.25% of real observations from
the negative class are eliminated, and only 14.59% of real observations from the positive (or ’Allow’) class are abandoned. Afterward, the
SMOTE technique is adopted on the ENN output, resulting in an equally
balanced dataset with 158 samples in each class. It had been believed
that a model prepared by a few numbers of real observations from a
negative class might restrict a model from performing in real cases for
that class, although the high sensitivity is offered. The procedure may
narrow down the knowledge retention of a model to a negative class.
From Table 5, it has been noticed that SVM and KNN both are displaying such fascinating outcomes for precision, where KNN is revealing the
utmost. However, high precision means less tendency to predict positive
class as negative and limiting false positive (FP) number. Hence, it can
be assumed that observations predicted as a positive class by the SVM
or KNN model can be more likely to be a positive class. Therefore, the
model (SVM or KNN) will be criterion one for classifying observations
that belongs to the positive class.
On the contrary, dataset DC obtained from borderline SVM-based
SMOTE up-sampling method merges some synthetic samples to negative class only, which ensures there is no loss in real observations of
any class. Besides, the model called Adaboost, prepared from DC, shows
the highest F-measure and precision with its high sensitivity that ascertains the model’s robustness to both classes. Therefore, the model can
be criterion one for classifying observations that belongs to the negative class. In Fig. 9, a mechanism had illustrated to utilize the strength
of both models and take advantage of the model’s robustness to positive
or negative class in a productive manner to acquire maximum benefit.
Fig. 9. The optimal approach for binding strength of highly redacting models
together for final prediction.
studies. From that perspective, six algorithms were used in this work
which is a contribution to the literature. The dataset used in this research
was collected by the researchers of this project. Comparative analysis of
the result generated from different ML models in Bangladesh perspective, which is quite new. By doing this analysis, the primary objective of
this research is accomplished. This method can be useful in several sectors where data are in tabular form and slightly imbalanced. Through
the above methodology, the model and the resampling technique are
also returned. Again, the optimal approach that is binding strength of
highly redacting models can be used to improve student’s performance
by providing proper suggestions to individuals analyzing their data.
5.2. Implications for practice
To improve the performance of students in any educational sector
from primary level to university proposed method can be significant.
Weights of the trained model can be utilized for retraining with new
data like pre-trained models and able to use the strength of the previous
model in an effective manner in order to unerring predictions. Moreover,
in the future, with the help of this research, a mobile application can be
developed to accomplish the main objective of this study and solve the
research gap in an effective manner. By their entering some answers
in this application, a student can understand their situation regarding
their probability of being admitted to a public university or any other
institution (where a competitive exam is the only way of admission).
Parents, family members, friends, and physicians will be able to assist
pupils in improving their mental health if this study is further developed.
6. Conclusion
The goal of the article is to raise awareness among students and assist
parents based on their prosperity to tend in taking prompt action, such
as a proposal to minimize the failure rate by providing the result of the
university admission test early using ML predictive models. A realistic
dataset was prepared to conduct this research work. Four balanced data
sets were given for further calculations as ML models that function well
on the balanced data set following four resample processes, the joint
T-link, two isolated methods of SMOTE, and a combination of ENN and
SMOTE approaches. Six separate ML models were prepared for each
dataset. A comparative analysis was carried out among the models, assessing the helpfulness using the most valuable metrics (AUC, Precision,
Recall, F-Measure) and choosing the ideal resampling technique for the
dataset based on this performance. Thus, this research work exhibits
pairs in a deteriorating manner of effectively utilizing its conceptual
5.1. Contributions to literature
A slight imbalance dataset was managed; this is one of the unique
portions of this study. Several resampling approaches were utilized in
this research which is another uniqueness of this study. Moreover, three
algorithms (Son and Fujita, 2019) four algorithms (Helal et al., 2018),
and five algorithms (Abu Zohair, 2019) were used in similar kinds of
10
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
analysis where each pair reveals a resampling approach and classification model. The most effective pair signifies the SVM model and a resampling strategy aggregating ENN and borderline SVM-based SMOTE.
Again, the second most effective pair signifies the ensemble boosting
model called Adaboost and the borderline SVM-based SMOTE oversampling method. However, the combinational resampling approach aggregating ENN and borderline SVM-based SMOTE exceedingly reduces
many real observations from the negative category for ENN and generates numerous synthetic data for the same category. Consequently, the
model founded on the combinational resampling approach may poorly
learn from real observations of the negative category. To bypass this
problem, both the Adaboost model prepared by borderline SVM-based
SMOTE and the SVM model produced by the combinational resampling
approach is suggested to be combined. On account of this, a remarkable
method is uncovered in this study to achieve the best features from deployed models simultaneously. Also, this study includes a clear overview
of significant features, which is crucial to building an effective model for
future usage.
Moreover, on one side, the proposed model is employed to predict
their success or failure, and on another side, statistical analysis delivers advice with factors to be improved for admission. The exploratory
data analysis expresses the candidates spending time more than 5 hours
per day on social media or game playing despite reading 1-3 hours per
day; the majority of them have not succeeded in achieving their goal.
Again, the candidates living in urban areas preparing for admission tests
are more successful than students living in villages. Moreover, the suggestions were provided to each candidate who was predicted as failed.
The suggestion includes the factors that should be improved and how
much improvement is required for them. The mentioned decisions can
be taken only by comparing a significant factor’s value with the average
value of significant factors of the candidates belonging to the positive
class. The authors, therefore, assume that the proposed methodology
can be used effectively to predict the outcome of a bachelor’s admission test at a university in a global problem only by enlargement of the
dataset. In the overall analysis and comparison, other new educational
data mining and ML methods will be included as part of future work.
Further extension of this research will help parents, family, friends, and
doctors to help students to improve their mental health.
Alyahyan, E., & Düştegör, D. (2020). Predicting academic success in higher education:
literature review and best practices. International Journal of Educational Technology in
Higher Education, 17(1) Springer, Dec. 01. 10.1186/s41239-020-0177-7.
Amra, I. A. A., & Maghari, A. Y. A. (Oct. 2017). Students performance prediction using
KNN and Naïve Bayesian. In ICIT 2017 - 8th International Conference on Information
Technology, Proceedings (pp. 909–913). 10.1109/ICITECH.2017.8079967.
Md. Sadik Tasrif Anubhove, S. M. Masum Ahmed, M. Zeyad, Md. Abul Ala Walid, N.
Ashrafi, and A. M. Saleque, “Tomato’s disease identification using machine learning
techniques with the potential of AR and VR technologies for inclusiveness,” 2022, pp.
93–112. doi: 10.1007/978-981-16-7220-0_7.
Asif, R., Merceron, A., Ali, S. A., & Haider, N. G. (Oct. 2017). Analyzing undergraduate
students’ performance using educational data mining. Computers and Education, 113,
177–194. 10.1016/j.compedu.2017.05.007.
AT, E., M, A., F, A.-M., & M, S. (2016). Classification of imbalance data using Tomek Link
(T-Link) Combined with random under-sampling (RUS) as a Data Reduction Method.
Global Journal of Technology and Optimization, 01(S1). 10.4172/2229-8711.s1111.
Batra, J., Jain, R., Tikkiwal, V. A., & Chakraborty, A. (Apr. 2021). A comprehensive study
of spam detection in e-mails using bio-inspired optimization techniques. International
Journal of Information Management Data Insights, 1(1). 10.1016/j.jjimei.2020.100006.
Beckmann, M., Ebecken, N. F. F., & Pires de Lima, B. S. L. (2015). A KNN Undersampling
Approach for Data Balancing. Journal of Intelligent Learning Systems and Applications,
07(04), 104–116. 10.4236/jilsa.2015.74010.
Berka, P., & Marek, L. (Sep. 2021). Bachelor’s degree student dropouts: Who
tend to stay and who tend to leave? Studies in Educational Evaluation, 70.
10.1016/j.stueduc.2021.100999.
C. S. Bruce, “Workplace experiences of information literacy,” 1999.
Bujang, M. A., Sa’At, N., Tg Abu Bakar Sidik, T. M. I., & Lim, C. J. (Aug. 2018).
Sample size guidelines for logistic regression from observational studies with large
population: Emphasis on the accuracy between statistics and parameters based
on real life clinical data. Malaysian Journal of Medical Sciences, 25(4), 122–130.
10.21315/mjms2018.25.4.12.
Cardona, T. A., & Cudney, E. A. (2019). Predicting student retention using support vector
machines. Procedia Manufacturing, 39, 1827–1833. 10.1016/j.promfg.2020.01.256.
Md. I. H. Chowdhury, N. M. Sakib, S. M. Masum Ahmed, M. Zeyad, Md. A. A. Walid,
and G. Kawcher, “Human face detection and recognition protection system based on
machine learning algorithms with proposed ar technology,” 2022, pp. 177–192. doi:
10.1007/978-981-16-7220-0_11.
Chui, K. T., Fung, D. C. L., Lytras, M. D., & Lam, T. M. (Jun. 2020). Predicting at-risk university students in a virtual learning environment via a machine learning algorithm.
Computers in Human Behavior, 107. 10.1016/j.chb.2018.06.032.
P. Cortez and A. Silva, 2022 “Using data mining to predict secondary school student performance.”
Costa, E. B., Fonseca, B., Santana, M. A., de Araújo, F. F., & Rego, J. (Aug. 2017). Evaluating the effectiveness of educational data mining techniques for early prediction of
students’ academic failure in introductory programming courses. Computers in Human
Behavior, 73, 247–256. 10.1016/j.chb.2017.01.047.
DiGangi, E. A., & Hefner, J. T. (2013). Ancestry Estimation. In Research Methods in Human
Skeletal Biology (pp. 117–149). Elsevier Inc.. 10.1016/B978-0-12-385189-5.00005-4.
Du, J., Liu, Y., Yu, Y., & Yan, W. (Jun. 2017). A prediction of precipitation data based
on support vector machine and particle swarm optimization (PSO-SVM) algorithms.
Algorithms, 10(2). 10.3390/a10020057.
Edwards, J. S. (Apr. 2022). Where knowledge management and information management
meet: Research directions. International Journal of Information Management, 63, Article
102458. 10.1016/j.ijinfomgt.2021.102458.
Ensafi, Y., Amin, S. H., Zhang, G., & Shah, B. (Apr. 2022). Time-series forecasting of
seasonal items sales using machine learning – A comparative analysis. International
Journal of Information Management Data Insights, 2(1). 10.1016/j.jjimei.2022.100058.
Fernandes, E., Holanda, M., Victorino, M., Borges, V., Carvalho, R., & van Erven, G. (Jan. 2019). Educational data mining: Predictive analysis of academic performance of public school students in the capital of Brazil. Journal of Business Research,
94, 335–343. 10.1016/j.jbusres.2018.02.012.
Fotouhi, S., Asadi, S., & Kattan, M. W. (2019). A comprehensive data level analysis for
cancer diagnosis on imbalanced data. Journal of Biomedical Informatics, 90 Academic
Press Inc., Feb. 01. 10.1016/j.jbi.2018.12.003.
Garg, R., Kiwelekar, A. W., Netak, L. D., & Ghodake, A. (Apr. 2021). i-Pulse: A NLP
based novel approach for employee engagement in logistics organization. International
Journal of Information Management Data Insights, 1(1). 10.1016/j.jjimei.2021.100011.
Garg, S., Sinha, S., Kar, A. K., & Mani, M. (2022). A review of machine learning applications in human resource management. International Journal of Productivity and
Performance Management, 71(5), 1590–1610 Emerald Group Holdings Ltd.May 06.
10.1108/IJPPM-08-2020-0427.
Grant, S., Huang, H., & Pasfield-Neofitou, S. (2014). The authenticity-anxiety paradox: The quest for authentic second language communication and reduced foreign language anxiety in virtual environments. Procedia Technology, 13, 23–32.
10.1016/j.protcy.2014.02.005.
H. Han, W.-Y. Wang, and B.-H. Mao, “LNCS 3644 - Borderline-SMOTE: A New OverSampling Method in Imbalanced Data Sets Learning,” 2005.
Helal, S., et al., (Dec. 2018). Predicting academic performance by considering student heterogeneity. Knowledge-Based Systems, 161, 134–146. 10.1016/j.knosys.2018.07.042.
Hernández-Sayago, E., Espinar-Escalona, E., Barrera-Mora, J. M., Ruiz-Navarro, M. B.,
Llamas-Carreras, J. M., & Solano-Reina, E. (Mar. 2013). Lower incisor position in
different malocclusions and facial patterns. Medicina Oral, Patologia Oral y Cirugia
Bucal, 18(2). 10.4317/medoral.18434.
Hoffait, A. S., & Schyns, M. (Sep. 2017). Early detection of university students with potential difficulties. Decision Support Systems, 101, 1–11. 10.1016/j.dss.2017.05.003.
Huang, S., & Fang, N. (Feb. 2013). Predicting student academic performance in an en-
Financial disclosure
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Declaration of Competing Interest
The authors declare that they have no known competing financial
interests or personal relationships that could have appeared to influence
the work reported in this paper.
References
Abu Zohair, L. M. (Dec. 2019). Prediction of Student’s performance by modelling small
dataset size. International Journal of Educational Technology in Higher Education, 16(1),
27. 10.1186/s41239-019-0160-3.
Agarwal, N., Chauhan, S., Kar, A. K., & Goyal, S. (2017). Role of human behaviour attributes in mobile crowd sensing: a systematic literature review. Digital Policy, Regulation and Governance, 19(2), 56–73. 10.1108/DPRG-05-2016-0023.
Akanda, Md. A. S. (2019). Research Methodology a complete direction for learners (Second
Edition). Dhaka: Akanda & Sons Publications.
Al-Mamary, Y. H. S. (Nov. 2022). Understanding the use of learning management systems
by undergraduate university students using the UTAUT model: Credible evidence from
Saudi Arabia. International Journal of Information Management Data Insights, 2(2), Article 100092. 10.1016/j.jjimei.2022.100092.
Al-Mamary, Y. H. S. (Nov. 2022). Why do students adopt and use Learning Management
Systems?: Insights from Saudi Arabia. International Journal of Information Management
Data Insights, 2(2), Article 100088. 10.1016/j.jjimei.2022.100088.
Al-Twijri, M. I., & Noaman, A. Y. (2015). A New Data Mining Model Adopted for Higher
Institutions. Procedia Computer Science, 65, 836–844. 10.1016/j.procs.2015.09.037.
11
Md.A.A. Walid, S.M.M. Ahmed, M. Zeyad et al.
International Journal of Information Management Data Insights 2 (2022) 100111
gineering dynamics course: A comparison of four types of predictive mathematical
models. Computers and Education, 61(1), 133–145. 10.1016/j.compedu.2012.08.015.
Hussain, S., Dahan, N. A., Ba-Alwib, F. M., & Ribata, N. (Feb. 2018). Educational
data mining and analysis of students’ academic performance using WEKA. Indonesian Journal of Electrical Engineering and Computer Science, 9(2), 447–459.
10.11591/ijeecs.v9.i2.pp447-459.
Ifinedo, P. (Apr. 2016). Applying uses and gratifications theory and social influence processes to understand students’ pervasive adoption of social networking sites: Perspectives from the Americas. International Journal of Information Management, 36(2), 192–
206. 10.1016/j.ijinfomgt.2015.11.007.
Khandelwal, M., et al., (Apr. 2018). Implementing an ANN model optimized by genetic
algorithm for estimating cohesion of limestone samples. Engineering with Computers,
34(2), 307–317. 10.1007/s00366-017-0541-y.
Kirshners, A., Parshutin, S., & Gorskis, H. (Dec. 2016). Entropy-based classifier enhancement to handle imbalanced class problem. Procedia Computer Science, 104, 586–591.
10.1016/j.procs.2017.01.176.
Koch, J., Plattfaut, R., & Kregel, I. (Nov. 2021). Looking for Talent in Times of Crisis –
The Impact of the Covid-19 Pandemic on Public Sector Job Openings. International
Journal of Information Management Data Insights, 1(2). 10.1016/j.jjimei.2021.100014.
Lamari, M., et al., (2021). SMOTE–ENN-based data sampling and improved dynamic ensemble selection for imbalanced medical data classification. Advances in Intelligent
Systems and Computing, 1188, 37–49. 10.1007/978-981-15-6048-4_4.
Lashkarashvili, N., & Tsintsadze, M. (Apr. 2022). Toxicity detection in online Georgian discussions. International Journal of Information Management Data Insights, 2(1).
10.1016/j.jjimei.2022.100062.
G. Louppe, L. Wehenkel, A. Sutera, and P. Geurts, 2022 “Understanding variable importances in forests of randomized trees.”
Mahdikhani, M. (Apr. 2022). Predicting the popularity of tweets by analyzing public
opinion and emotions in different stages of Covid-19 pandemic. International Journal
of Information Management Data Insights, 2(1). 10.1016/j.jjimei.2021.100053.
Miguéis, V. L., Freitas, A., Garcia, P. J. V., & Silva, A. (Nov. 2018). Early segmentation of
students according to their academic performance: A predictive modelling approach.
Decision Support Systems, 115, 36–51. 10.1016/j.dss.2018.09.001.
Mitra, S. K., & Pathak, P. K. (Dec. 2007). The Nature of Simple Random Sampling. The
Annals of Statistics, 12(4). 10.1214/aos/1176346810.
Nguyen, H. M., Cooper, E. W., & Kamei, K. (2022). Borderline Over-sampling for Imbalanced
Data Classification.
O’Bannon, B. W., & Thomas, K. M. (Jul. 2015). Mobile phones in the classroom: Preservice teachers answer the call. Computers and Education, 85, 110–122.
10.1016/j.compedu.2015.02.010.
Predicting Student Performance using Classification and Regression Trees Algorithm. (Jan.
2020). International Journal of Innovative Technology and Exploring Engineering, 9(3),
3349–3356. 10.35940/ijitee.c8964.019320.
Ramírez-Noriega, A., Juárez-Ramírez, R., & Martínez-Ramírez, Y. (Feb.
2017).
Evaluation module based on Bayesian networks to Intelligent Tutoring Systems. International Journal of Information Management, 37(1), 1488–1498.
10.1016/j.ijinfomgt.2016.05.007.
Rodríguez-Hernández, C. F., Musso, M., Kyndt, E., & Cascallar, E. (Jan. 2021). Artificial neural networks in academic performance prediction: Systematic implementation and predictor evaluation. Computers and Education: Artificial Intelligence, 2.
10.1016/j.caeai.2021.100018.
Romero, C., López, M. I., Luna, J. M., & Ventura, S. (2013). Predicting students’ final
performance from participation in on-line discussion forums. Computers and Education,
68, 458–472. 10.1016/j.compedu.2013.06.009.
N. Rout, D. Mishra, and M. K. Mallick, “Handling imbalanced data: A survey,” in Advances
in Intelligent Systems and Computing, 2018, vol. 628, pp. 431–443. doi: 10.1007/978981-10-5272-9_39.
Shin, Y., Kim, Z., Yu, J., Kim, G., & Hwang, S. (Sep. 2019). Development of NOx reduction
system utilizing artificial neural network (ANN) and genetic algorithm (GA). Journal
of Cleaner Production, 232, 1418–1429. 10.1016/j.jclepro.2019.05.276.
Shirdastian, H., Laroche, M., & Richard, M. O. (Oct. 2019). Using big data analytics to
study brand authenticity sentiments: The case of Starbucks on Twitter. International
Journal of Information Management, 48, 291–307. 10.1016/j.ijinfomgt.2017.09.007.
Son, L. H., & Fujita, H. (Jan.
2019). Neural-fuzzy with representative sets
for prediction of student performance. Applied Intelligence, 49(1), 172–187.
10.1007/s10489-018-1262-7.
Syarif, I., Prugel-Bennett, A., & Wills, G. (Dec. 2016). SVM parameter optimization
using grid search and genetic algorithm to improve classification performance.
TELKOMNIKA (Telecommunication Computing Electronics and Control), 14(4), 1502.
10.12928/telkomnika.v14i4.3956.
Tandon, C., Revankar, S., Palivela, H., & Parihar, S. S. (Nov. 2021). How can we predict
the impact of the social media messages on the value of cryptocurrency? Insights from
big data analytics. International Journal of Information Management Data Insights, 1(2).
10.1016/j.jjimei.2021.100035.
Tomasevic, N., Gvozdenovic, N., & Vranes, S. (Jan. 2020). An overview and comparison of supervised data mining techniques for student exam performance prediction.
Computers & Education, 143, Article 103676. 10.1016/j.compedu.2019.103676.
Udo, G. J., Bagchi, K. K., & Kirs, P. J. (2010). An assessment of customers’ e-service quality
perception, satisfaction and intention. International Journal of Information Management,
30(6), 481–492. 10.1016/j.ijinfomgt.2010.03.005.
v Chawla, N., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic
Minority Over-sampling Technique.
Votto, A. M., Valecha, R., Najafirad, P., & Rao, H. R. (Nov. 2021). Artificial Intelligence in
Tactical Human Resource Management: A Systematic Literature Review. International
Journal of Information Management Data Insights, 1(2). 10.1016/j.jjimei.2021.100047.
Wakelam, E., Jefferies, A., Davey, N., & Sun, Y. (Mar. 2020). The potential for student performance prediction in small cohorts with minimal available attributes. British Journal
of Educational Technology, 51(2), 347–370. 10.1111/bjet.12836.
M. A. A. Walid, S. M. Masum Ahmed, and S. M. S. Sadique, “A comparative analysis of machine learning models for prediction of passing bachelor admission
test in life-science faculty of a public university in Bangladesh,” Nov. 2020. doi:
10.1109/EPEC48502.2020.9320119.
Wang, Z., Wang, Y., Zeng, R., Srinivasan, R. S., & Ahrentzen, S. (Jul. 2018). Random
Forest based hourly building energy prediction. Energy and Buildings, 171, 11–25.
10.1016/j.enbuild.2018.04.008.
Xiao, L., Dong, Y., & Dong, Y. (Mar. 2018). An improved combination approach based
on Adaboost algorithm for wind speed time series forecasting. Energy Conversion and
Management, 160, 273–288. 10.1016/j.enconman.2018.01.038.
Zeyad, M., & Hossain, M. S. (Dec. 2021). A comparative analysis of data mining methods
for weather prediction. In 2021 International Conference on Computational Performance
Evaluation (ComPE) (pp. 167–172). 10.1109/ComPE53109.2021.9752344.
Hamsa H, Indiradevi S, Kizhakkethottam JJ. Student academic performance prediction
model using decision tree and fuzzy genetic algorithm. Procedia Technology. 2016
Jan 1;25:326-32.
12
Download