Uploaded by Rudy Ariyanto

Student Enrollment Prediction with Multiple Linear Regression

advertisement
Development of Predictive Modelling Applications
for Historical Student Enrolment Data with a
Multiple Linear Regression Approach
line 1: 1st Given Name Surname
line 2: dept. name of organization
(of Affiliation)
line 3: name of organization
(of Affiliation)
line 4: City, Country
line 5: email address or ORCID
line 1: 2nd Given Name Surname
line 2: dept. name of organization
(of Affiliation)
line 3: name of organization
(of Affiliation)
line 4: City, Country
line 5: email address or ORCID
line 1: 3rd Given Name Surname
line 2: dept. name of organization
(of Affiliation)
line 3: name of organization
(of Affiliation)
line 4: City, Country
line 5: email address or ORCID
line 1: 4th Given Name Surname
line 2: dept. name of organization
(of Affiliation)
line 3: name of organization
(of Affiliation)
line 4: City, Country
line 5: email address or ORCID
line 1: 5th Given Name Surname
line 2: dept. name of organization
(of Affiliation)
line 3: name of organization
(of Affiliation)
line 4: City, Country
line 5: email address or ORCID
line 1: 6th Given Name Surname
line 2: dept. name of organization
(of Affiliation)
line 3: name of organization
(of Affiliation)
line 4: City, Country
line 5: email address or ORCID
Abstract—This research addresses the increasing complexity
of higher education administration due to the digitization of
services, which has raised challenges related to resource
allocation, stakeholder support, and privacy concerns. This
research aims to develop a predictive application using Multiple
Linear Regression (MLR) to forecast student enrollment
numbers, an important task for optimizing resource
management. Historical enrollment data was collected and
refined to build the MLR model, focusing on variables such as
academic year, number of programs, and number of applicants
per program. The MLR model was evaluated using standard
metrics, including Mean Absolute Percentage Error (MAPE),
which showed high accuracy with an average prediction error
of 0.15% for Informatics Engineering and 0.47% for
Information Systems through 10-Fold Cross Validation. The
developed application has the potential for strategic decisionmaking in higher education by providing accurate and efficient
predictions of new student admissions.
emphasize the importance of adaptation in leadership,
overcoming existing barriers, and being aware of changes in
the world of work that affect higher education.
Keywords—Multiple Linear Regression, Student Enrolment
Prediction,Mean Absolute Percentage Error, Higher Education
Administration.
Regression is a model building technique used to predict
the value of given input data. Regression is a statistical
measure used to determine the strength of the relationship
between the dependent variable (independent) and the
independent variable (independent). The main method for
making predictions is to build a regression model by finding
the relationship between one or more independent or predictor
variables (X) and the dependent or response variable (Y).
Linear regression models the relationship between scalar
variables and one or more explanatory variables [7]. Multiple
Linear Regression or Multiple Linear Regression can be used
in prediction or forecasting which is compiled on the basis of
relevant data relationship patterns in the past. In the regression
method, the predicted variable, such as sales or demand for a
product, is generally stated as the dependent variable, this
variable is influenced by the independent variable. There are
basically two kinds of relationship analysis in forecasting,
namely cross section analysis or causal model and time series
analysis which will be discussed in this study.
I. INTRODUCTION
The digitization of education services has brought
significant changes in higher education administration,
creating increasing complexity and expanding the
management options available [1]. However, the application
of learning analytics in this sector is inseparable from a
number of challenges, including resource limitations, the need
for stakeholder buy-in, and pressing ethical and privacy issues
to address [2]. Understanding the dynamics at play in higher
education also requires attention to the route dependencies and
political changes that complicate the situation [3].
The critical role of higher education data infrastructures,
often shaped by political objectives, is increasingly apparent
in this process of education sector reform [4]. This complexity
is compounded by changes in the structure of the workforce in
higher education, including an increase in workers in the third
space. In managing this growing complexity, researchers
Prediction is an approach model which is expected to
produce a forecast or forecast regarding a description of future
conditions based on data from the previous time through a
mathematical calculation process. Prediction has a very
important role in the process of determining the results related
to an event that will occur so that it can be well prepared for
what will be needed [5]. Prediction itself includes
classification and regression. The classification in question is
the classification of an entity into certain groups according to
certain standards. In addition to classification, there is also
regression which can be used to make predictions based on the
relationship between 2 or more parameters. Regression can
make predictions to get a value that describes future
conditions based on influencing parameters [6].
Past research suggests that increased enrollment in higher
education poses significant challenges in maintaining the
quality of education, which requires careful planning and
strong infrastructure [8]. This concern is further exacerbated
by the projected decline in high school graduates and specific
challenges such as high dropout rates among black males,
which require appropriate interventions [9]. This fact confirms
the importance of more in-depth research and thorough
analysis in formulating effective strategies to increase college
participation. Another study was conducted by [10]. with the
title Undergraduate International Student Enrollment
Forecasting Model: An Application of Time Series Analysis.
This research builds a SARIMA model to estimate the number
of foreign students enrolling in undergraduate programs at a
university in the Midwest is the focus of this research.
Elements considered in this model include enrollment trends,
visa policy changes, and tuition rates. The findings show visa
policy changes as well as increased Chinese enrollment as
variables that have significant influence. Although the
influence of tuition fees is very low, it is still significant. The
use of these insights provides useful direction for policy
formulation, enrollment strategies, and student support
services for undergraduate students from other countries.
Based on the above explanation, the Faculty of
Information Technology at KH. A. Wahab Hasbullah
University, as an example of a higher education institution,
must conduct annual planning that includes estimating the
number of student enrollments to optimize resource allocation
in the field of academic administration. Accurate enrollment
prediction facilitates management in managing class capacity,
budget allocation, and staffing, and allows institutions to
proactively respond to changes in program demand [11].
Therefore, the development of reliable predictive models is
key in supporting strategic decisions taken by the Faculty of
Information Technology, in order to improve the effectiveness
of educational services and future plans.
II. RESEARCH METHODOLOGY
This research consists of several stages as shown in the
frameworkof Figure 1.
The next stage is model development, where the
processed data is tested using the Multiple Linear Regression
method. The pattern shown by simple regression analysis
assumes that the relationship between >2 variables can be
expressed with a straight line [12]. Multiple Linear
Regression can be written as:
𝑦 = 𝛽0 + 𝛽1𝑋1 + ⋯ + 𝛽𝑛𝑋𝑛 + 𝜀
(1)
where y is the dependent variable (Dependent), X1
Independent variable (Independent), 𝛽0 is the value of y
when the other parameters / independent variables are 0
(intercept), MLR estimation coefficient and is 𝜀 Error.
The application prototype will be developed if the model
shows optimal performance, otherwise, the preprocessing will
be adjusted and the model tested again. If the model
demonstrates optimal performance, the development of an
application prototype proceeds. If not, preprocessing
adjustments are made, and the model is re-tested. The
validation stage employs K-Fold Cross Validation, a robust
method to evaluate the model’s performance. In K-Fold Cross
Validation, the dataset is split into 'k' subsets or folds [13] .
The model is trained on 𝑘−1 folds and tested on the remaining
fold. This process is repeated 'k' times, with each fold serving
as the test set exactly once. The model's performance is
averaged over all 'k' trials to ensure that it generalizes well
across different subsets of data. This method reduces the
chances of overfitting and provides a more accurate estimate
of the model’s effectiveness [14], with MAPE being the
primary metric used for evaluating the error rate.
Once the model is validated, the focus turns to
implementation, which is the development and
implementation of a validated application prototype. This
stage involves designing, coding, and testing the application
according to user requirements. Finally, in the analysis and
interpretation stage, the prediction results are analyzed to
understand the factors influencing student enrollment trends
and evaluate whether the model meets the research needs. The
research findings are further analyzed to understand the
practical implications and develop recommendations that can
improve or enhance processes, policies, or technology
development based on the research results.
III. RESULT AND DISCUSSION
A. Performance Analysis
The following are the performance analysis results of the
Multiple Linear Regression model analyzed for its ability to
handle linear relationships between input and output variables.
Table 1. Prediction of the Number of New IT Students
Figure 1. Research Methodology
This method begins with problem identification including
interviews to understand the problem of student enrollment
and observation of hardware, software, and operators. Then a
literature study was conducted to review various methods of
predicting student enrollment and multiple linear regression
techniques. In the data collection stage, data related to student
enrollment and planning from the Faculty of Information
Technology was collected and described to ensure clear
integration. The collected data is then processed in the data
processing stage, including selecting relevant data,
completing missing data, correcting incorrect or inconsistent
data, and formatting data according to model needs.
Year
Actual
Predict
2013
346
346.00
2014
177
177.27
2015
202
201.78
2016
171
171.30
2017
293
293.47
2018
507
506.56
2019
223
222.76
2020
204
203.58
2021
279
279.91
2022
239
238.80
2023
248
247.49
Table 2. Table 3 Prediction of the Number of New SI Students
Year
Actual
Predict
2013
209
209.00
Figure 2. Correlogram Heatmap
2014
76
77.00
2015
161
161.51
2016
177
176.86
2017
157
156.49
2018
191
191.41
2019
143
142.02
2020
92
92.71
2021
102
101.69
The Correlogram Heatmap shows that all the variables
used in the model have a very strong correlation with each
other. This can be seen from the dark red color that dominates
the heatmap, which shows a correlation value close to 1. This
very strong correlation indicates that the variables have a
strong linear relationship, which can contribute to the high
accuracy of the model in predicting the number of applicants.
A strong correlation also indicates that a change in one input
variable is likely to be followed by a change in the other input
variables, thus providing more consistent predictions.
2022
127
126.76
2023
61
60.28
Table 3. Evaluation Results
Validation
K-Fold
B. Implementation
The results of the research that have been achieved in the
form of a website-based application prototype where in the
prototype, successfully predicting the number of new
students at the Faculty of Information Technology, KH. A.
Wahab Hasbullah University.
MAPE
TI
SI
1
0.08%
0.66%
2
0.11%
0.32%
3
0.18%
0.07%
4
0.16%
0.32%
5
0.09%
0.22%
6
0.11%
0.68%
7
0.20%
0.77%
8
0.33%
0.29%
9
0.08%
0.18%
10
0.20%
1.18%
11
0.15%
0.47%
The results showed that the Multiple Linear Regression
(MLR) model, evaluated using Mean Absolute Percentage
Error (MAPE), provided excellent performance by producing
accurate predictions indicated by the small MAPE value [15].
Further analysis revealed factors that influenced the prediction
results, including the identification of significant predictor
variables such as academic year, number of study programs,
and number of applicants in each program. The influence of
these variables in predicting the number of student
enrollments became clear, providing greater insight into the
contribution of each variable to the accuracy of the model.
These results show that the MLR model is not only able to
provide predictions that are close to the actual values, but also
clarify how each variable plays an important role in
influencing the prediction results,
Figure 3. Dashboard Page
The Dashboard page is the initial page when the user
accesses the website before entering the prediction page. On
this page, the user uploads the registrant data for each year in
the form of a csv file. If the inputted data is correct, the system
will redirect to the prediction page.
Figure 4. Prediction Page
The prediction page is used to view the amount of new
student data in the form of a graph of the predicted number
of new students and the actual data on the number of
applicants.
IV. CONCLUSION
The results demonstrate that the Multiple Linear
Regression (MLR) model effectively predicts the number of
new students enrolling in the Information Technology (TI)
and Information Systems (SI) programs. Evaluated using 10Fold Cross-Validation, the model for TI shows a low Mean
a Mean Absolute Percentage Error (MAPE) of 0.15%,
indicating high prediction accuracy. For the SI program, the
MAPE is 0.47%, slightly higher but still within acceptable
limits for practical application. The successful
implementation of the model into a prototype web-based
application further validates its utility, enabling accurate
enrollment predictions for the Faculty of Information
Technology. These findings suggest that the MLR model is a
reliable tool for forecasting student enrollment, contributing
valuable insights for institutional planning and resource
allocation.
ACKNOWLEDGMENT
The authors are grateful to …. for providing the funding
needed for this research and its preparation for publication.
necessary for this research and its preparation for publication.
The authors assume all responsibility for any errors and
omissions in this research.
REFERENCES
[1]
[2]
[3]
[4]
Harper, D. A., Muñoz, F.-F., & Vázquez, F. J. (2021). Innovation
in online higher-education services: Building complex systems.
Economics of Innovation and New Technology, 30(4), 412–431.
https://doi.org/10.1080/10438599.2020.1716508
Tsai, Y., Poquet, O., Gašević, D., Dawson, S., & Pardo, A. (2019).
Complexity leadership in learning analytics: Drivers, challenges
and opportunities. British Journal of Educational Technology,
50(6), 2839–2854. https://doi.org/10.1111/bjet.12846
Kauko, J. (2014). Complexity in higher education politics:
Bifurcations, choices and irreversibility. Studies in Higher
Education,
39(9),
1683–1699.
https://doi.org/10.1080/03075079.2013.801435
Williamson, B. (2018). The hidden architecture of higher education:
Building a big data infrastructure for the ‘smarter university.’
International Journal of Educational Technology in Higher
Education, 15(1), 12. https://doi.org/10.1186/s41239-018-0094-1
[5]
Chen, Y., Li, R., & Hagedorn, L. S. (2019). Undergraduate
International Student Enrollment Forecasting Model: An
Application of Time Series Analysis. Journal of International
Students, 9(1), 242–261. https://doi.org/10.32674/jis.v9i1.266
[6]
Cho, J., & Lee, J. (2018). Multiple Linear Regression Models for
Predicting Nonpoint-Source Pollutant Discharge from a Highland
Agricultural
Region.
Water,
10(9),
1156.
https://doi.org/10.3390/w10091156
[7]
Hartaka, I. M., Eka Suadnyana, I. B. P., & Somawati, A. V. (2021).
Tantangan Dan Solusi Penerimaan Mahasiswa Baru Prodi Filsafat
Hindu Stahn Mpu Kuturan Singaraja. Jurnal Penjaminan Mutu,
7(2). https://doi.org/10.25078/jpm.v7i2.2778
Grip, R. S., & Grip, M. L. (2020). Using Multiple Methods to
Provide Prediction Bands of K-12 Enrollment Projections.
Population Research and Policy Review, 39(1), 1–22.
https://doi.org/10.1007/s11113-019-09533-2
[8]
[9]
Jameel A. Scott, Kenneth J. Taylor, & Robert T. Palmer. (2013).
Challenges to Success in Higher Education: An Examination of
Educational Challenges from the Voices of College-Bound Black
Males. The Journal of Negro Education, 82(3), 288.
https://doi.org/10.7709/jnegroeducation.82.3.0288
[10] Johnson, D. M. (2019). Student Demographics: The Coming
Changes and Challenges for Higher Education. In D. M. Johnson,
The Uncertain Future of American Public Higher Education (pp.
141–156).
Springer
International
Publishing.
https://doi.org/10.1007/978- 3-030-01794-1_10
[11] Jozaghi, A., Shen, H., Ghazvinian, M., Seo, D.-J., Zhang, Y.,
Welles, E., & Reed, S. (2021). Multi-model streamflow prediction
using conditional bias-penalized multiple linear regression.
Stochastic Environmental Research and Risk Assessment, 35(11),
2355–2373. https://doi.org/10.1007/s00477-021-02048-3
[12]
Rossi, E., Pecorini, I., & Iannelli, R. (2022). Multilinear Regression
Model for Biogas Production Prediction from Dry Anaerobic
Digestion of
OFMSW.
Sustainability, 14(8), 4393.
https://doi.org/10.3390/su14084393
[13]
Werth, J., & Sigman, M. S. (2021). Linear Regression Model
Development for Analysis of Asymmetric Copper-Bisoxazoline
Catalysis.
ACS
Catalysis,
11(7),
3916–3922.
https://doi.org/10.1021/acscatal.1c00531
[14]
Li, X. (2022). Sequence Model and Prediction for Sustainable
Enrollments in Chinese Universities. Sustainability, 15(1), 214.
https://doi.org/10.3390/su15010214
[15]
Doresdiana, H., Badawi Saluy, A., & Author, C. (2021). Spare Parts
Demand Forecasting During Covid 19 pandemic (Automotive
Company Case Study). 2(2). https://doi.org/10.38035/dijefa.v2i2
Download