Machine Learning Model for Analysis OPTIMISED PREDICTIVE MODEL Submitted by Mukul saini (22BCS14902) Nishant Bamal (22BCS15558) in partial fulfillment for the award of the degree of Bachelors of Engineering IN Computer Science Chandigarh University February - June 2023 INTRODUCTION ............................................................. PAGE NO. SNO. DESCRIPTION 1. STUDENTS AND ACADEMICS 6 2. ABOUT THE DATASET AND MODEL 7 10 3. The Dataset 12 4. TImeline Literature Review………………………………………… SNO. 1. 2. PAGE NO. DESCRIPTION TIMELINE OF REPORTED PROBLEM EXISTING SOLUTIONS 13 13 14 3. BIBLIOMETRIC ANALYSIS 14 4. REVIEW SUMMARY 5. References 15 Design flow and Process………………………………………. S.No. 1. Description Proposed Methodology Page No. 16 2. Generalized Method 16 3. Phase wise association 17 Result analysis and Validation S.No. 1. Description Types of Analysis Page No. 20 2. Analysis Tables 21 3. Analysis through Graphs 23 Conclusion and Summary S.No. 1. 2. Description Conclusion and Summary Team Roles Page No. 24 25 List of Figures Figure 1.1 Figure 1.2 Figure 1.3 Figure 1.4 Figure 1.5 List of Tables Table 1.1 Table 1.2 Table 1.3 Table 1.4 Table 1.5 Table 1.6 Gantt Chart of Project Timeline Generalised Method flow-chart Phase Wise association flow-chart Graph for analysis Graph for analysis Accuracy Table Error Table TP rate Table FP rate Table ROC rate Table Resampled Table INTRODUCTION Students and Academics Academic life is one of the most significant parts of a student’s life. According to the dictionary, Academic refers to: ‘A teacher or a scholar in educational institutes’. Students and Academics are intermingled together and often the student’s hard work and intellectual ability (intelligence) is judged based upon his/her academic scores. Academic performance of adolescents is also in many cases believed to be an indicator of their future success. Academics act as pillars for future professional growth and long term success. Serious academic study usually begins in the teenage and academic study can take many years of an individual’s life depending upon the degree that they are pursuing. Academics are foundational stones for cognitive development and knowledge enhancement. Academic life of student’s can often be very challenging also. The teenage and early adulthood is also the time for physical development also hence, many students opt for sports during this time. Managing extra-curricula’s and academics can be a challenging task. If done right then it results in the overall competence increment otherwise a poor performance in both the tasks which results in lack of confidence. From the above it is clear that predicting the students’ academic percentage will be useful. A machine learning model that can predict a given student’s future percentage will be hence useful. It will be useful for the following reasons: • Helpful for teachers: It can help teachers identify students who will have likely weak performance in the future. Once teachers know this, they can put their focus on these students accordingly which can effectively result in better academic grades of all the students. • Helpful for students: It can prove to be helpful to students as teachers will be able to better guide students who will be likely poor in academics. It can help students get the attention and guidance they require. • Statistically helpful: It will be statistically helpful as it will provide a base for further research. It can also help assess the performance of the teaching techniques used by teachers as it will provide concrete evidence for the success/failure of the techniques. • Better Management: It will result in better management of students. One category of students can be grouped together, their pace of teaching can be decided according to it. It will improve the overall management of students. • Growth in extra-curriculars – Knowing about a student’s future percentage, it can be known whether the student will be able to manage sports and other extra-curriculars along with academics. According to this different students can be encouraged for sports. About the Model and the Dataset The model will be trained using Machine learning and a dataset containing various parameters for the prediction model. Regression analysis will be used for making the model. Machine Learning Machine Learning is a subset of Artificial intelligence (AI) and computer science. It uses different algorithms and data to imitate human’s way of learning. Its accuracy gradually improves over time. In simple terms, it gives computers the ability to learn without specific programming. It enables the computers to learn automatically from past data. Machine learning algorithms build a mathematical model from the sample historical data (training data). It is basically a branch of computer science and statistics helpful in creating predictive models. It is extremely helpful in many organizations and companies as it can provide useful data which can result in good decisions. For example – A company can make a decision for the price of a specific item based on the past records of sales of the company according to the price of the items. This is a very simple example, much complex decisions can be made using it. Features of machine learning-: 1. It learns from historic data and improves accordingly. 2. It is a data-driven technology. 3. It identifies patterns in a given dataset. 4. It is very similar to data-mining because of its ability to deal with huge-datasets. Classification of Machine Learning 1. Supervised Learning In supervised learning, we provide sample labeled data for training the machine learning system. On the basis of this it predicts the output. The goal of supervised learning is to map input data to the output data. It is same as learning of a student in the presence of supervision of a qualified teacher. It is further divided into 2 algorithms – classification and regression. 2. Unsupervised Learning In this method, as the name implies, the machine learns without any supervision. In this the set of data has not been categorized or classified and the algorithm needs to act on the data without any supervision. The machine tries to find useful data by itself. It can also be divided into 2 algorithms – clustering and association. 3. Reinforcement Learning It is a feedback-based learning mechanism in machine learning. It is about decisionmaking. It is about maximizing award by taking appropriate decisions in any environment. It learns by trial and error and works on learning to achieve the best outcome. Many of the modern robots work on this system. Missing values in Machine Learning In the pre-processing stage of machine learning, handling missing values is a crucial step. It's critical to select the best approach based on the dataset, the kind and quantity of missing data, and the particular specifications of the current challenge. Additionally, it's critical to evaluate how the chosen imputation method will affect the outcomes and take into account any biases that may be introduced by imputed missing values. For the purpose of machine learning, a number of algorithms and methods are frequently used to fill in the missing values in a dataset :1. Mean/Median/Mode Imputation: This approach replaces missing values with the mean, median, or mode of the available data. It is commonly used for numerical features and is a simple and quick method. However, it assumes that the missing values have a similar distribution as the observed data. 2. Regression Imputation: Regression models can be used to predict missing values based on other features in the dataset. A regression algorithm is trained using the complete data, with the feature containing missing values as the target variable. The trained model can then be used to predict the missing values. 3. k-Nearest Neighbours (KNN) Imputation: In this approach, the values of the k nearest neighbours in the feature space are used to impute missing values. The closest data points with complete information are found using the distance measure, and their values are used to fill in the missing values. 4. Expectation-maximization (EM) Imputation: Using an iterative process called the EM algorithm, missing values are estimated by maximising the likelihood function. Starting with an initial estimate, conditional expectations are used to iteratively update the estimates until convergence. For data with intricate interdependencies across variables, EM is frequently utilised. 5. Multiple Imputation: Using statistical models to impute missing values, multiple imputation produces several imputed datasets. The results are then merged to account for the uncertainty brought on by missing values after each dataset has been individually examined. This method produces more reliable estimates by accounting for imputation variability. Resampling In machine learning, resampling is the act of picking and modifying existing data points to produce new training datasets. Issues including unbalanced datasets, overfitting, and model evaluation are frequently addressed with it. The two primary categories of resampling procedures are: 1. Oversampling: In datasets where one class is underrepresented in comparison to others, oversampling is employed to remedy the class imbalance. In order to achieve balance with the majority class, the minority class must have more instances. The Synthetic Minority Oversampling approach (SMOTE), which interpolates between existing minority class samples to construct synthetic samples, is the most used oversampling approach. 2. Undersampling: By lowering the proportion of instances in the majority class, undersampling seeks to resolve the imbalance between classes. In order to match the number of samples in the minority class, this method randomly selects a subset of the samples from the majority class. While undersampling can aid in dataset balancing, it may cause information loss and may not fully reflect the diversity of the majority class. Ensembling In order to increase overall predictive performance, ensembling is a machine learning technique that includes merging the results of various models. Ensembling is based on the premise that by combining the predictions of various models, the strengths of each particular model can make up for the deficiencies of others, producing forecasts that are more reliable and accurate. Ensembling has the following advantages: Better Predictive Performance: When models are combined, they can perform more accurately than when they are used alone, especially when the models have varied strengths and weaknesses. Robustness: By utilising the consensus of many models, assembling helps minimise the influence of outliers or noisy data. Reduced Overfitting: By merging predictions from many models trained on various subsets of the data, assembling can help limit overfitting, especially in complicated models. Model Interpretability: Feature importance metrics provided by assembling techniques like Random Forests can be used to comprehend the relative value of various aspects in the forecast. The Dataset The dataset contains data for predicting students’ academic performance. It tries to find the end-semester percentage prediction based on different social, economic and academic attributes. The dataset is gathered from UCI machine learning repository https://archive.ics.uci.edu/ml/datasets/Student+Academics+Performance . This dataset was donated on 16 September, 2018. Its source is Dr Sadiq Hussain, Dibrugarh University, Dibrugarh, Assam, India, sadiq '@' dibru.ac.in. It has total 22 attributes with 300 instances. It does not contain any missing value. Some of the attributes are as follows:o Gender It comprises two Genders – Male and Female. o Caste It contains caste attribute which contains castes in the form of General , SC , ST and OBC. o Marital Status It tells about the marital status of the individual student i.e. whether married or unmarried. o Father’s Occupation This attribute tells about the father’s occupation. It is a very important attribute as this highly influences the economic condition of the family which in turn effects the education of the student. It has 4 options – Service, Business, Retired, Farmer, Others. o Mother’s Occupation Again it is a very important attribute. It also has 4 options – Service, Business, Retired, Farmer, Others. o Father’s Qualification This is also a very important attribute as father’s education level also effects the child’s learning and knowledge. It has 5 options – Illiterate, 10th, 12th, Degree, PG. o Mother’s Qualification Mother is the first teacher so her education qualification impacts the child in a significant way, in many cases, even more than father. It also has 5 options – Illiterate, 10th, 12th, Degree, PG. o Medium of Education Medium of Education is also an attribute. Since, this dataset is made in India hence Indian languages are included. It has 4 options – English , Hindi , Assamese, Bengali. o Family Income Family income supports education hence it is a useful attribute. It has 4 attributes – Very high, high, medium, low. o Academic Grades This is the most important attribute as past behavior is indicator of future behavior. The academic grades are of 4 subjects. It has 5 options – Best, Very Good, Good, Pass, Fail. Etc. Timeline Figure 1.1 CHAPTER 2. LITERATURE REVIEW 2.1 Timeline of the reported problem A number of studies have been conducted in the past to predict the students’ performance. It has been proven [1] that the data about the activity of students during the semester improves the prediction. The predictive models using different algorithms have been made a lot of times. One of the earliest is (Thai-Nghe et al., 2011) [2]. After this a lot of studies have been published. 2.2 Existing solutions 1. PERSONALIZED MULTI-LINEAR REGRESSION MODELS (PLMR) by [3](Elbadrawy et al., 2016) . 2. REGRESSION AND CLASSIFICATION MODELS by [4] (Meier et al., 2016) and [5] (Zimmermann et al., 2015). 3. FACTORIZATION MACHINES (FM) [6] (Sweeney et al., 2015). 4. MATRIX FACTORIZATION MODEL [2] (Thai-Nghe et al., 2011). 2.3 Bibliometric analysis 1. PERSONALIZED MULTI-LINEAR REGRESSION The PLMR could predict the next semester percentage with lower error rates. PLMR was also useful for predicting grades on assessments within a traditional class or online course by incorporating features captured through students’ interaction with LMS and MOOC server logs. OBSERVED DRAWBACK – The final grade prediction based on the limited initial data of students and courses is a challenging task because, at the beginning of undergraduate studies, most of the students are motivated and perform well in the first semester but as the time passed there might be a decrease in motivation and performance of the students. 2. REGRESSION AND CLASSIFICATION MODELS (Meier et al., 2016) proposed an algorithm to predict the final grade of an individual student when the expected accuracy of the prediction is sufficient. The algorithm can be used in both regression and classification settings to predict students’ performance in a course. The study also demonstrated that timely prediction of the performance of each student would allow instructors to intervene accordingly. (Zimmermann et al., 2015) considered regression models in combination with variable selection and variable aggregation approach to predict the performance of graduate students and their aggregates. By analyzing the structure of the undergraduate program, they assessed a set 3 of students’ abilities. Their results can be used as a methodological basis for deriving principle guidelines for admissions committees. 3. FACTORIZATION MACHINES (FM) (Sweeney et al., 2015) developed a system for predicting students’ grades using simple baselines and MF-based methods for the dataset of George Mason University (GMU). Their study showed that Factorization Machines (FM) model achieved the lowest prediction error and can be used to predict both cold-start and non-cold-start predictions accurately. 4. MATRIX FACTORIZATION MODEL (Thai-Nghe et al., 2011)[2] created matrix factorization models in order to predict student performance of Algebra. This technique is useful in cases involving sparse data. They are also useful when the absence of students’ background knowledge and tasks is there. OBSERVED DRAWBACK – It is not efficient when dealing with small sample sizes. 2.4 Review Summary After careful analysis of literature it has been found out that a number of studies have been conducted on this topic. By far the most accurate predictive models are factorization models and regression and classification models while models like matrix factorization struggled with small sample sizes. REFERENCES 1. Koprinska, I., Stretton, J., and Yacef, K. 2015. Students at Risk: Detection and Remediation. The 8th 2. 3. 4. 5. 6. International Conference on Educational Data Mining (EDM 2015), pp. 512 – 515. THAI-NGHE, N., DRUMOND, L., HORVATH ´ , T., NANOPOULOS, A., AND SCHMIDTTHIEME, L. 2011. Matrix and tensor factorization for predicting student performance. In CSEDU (1). Citeseer, 69–78. THAI-NGHE, N., DRUMOND, L., HORVATH ´ , T., SCHMIDT-THIEME, L., ET AL. 2011. Multi-relational factorization models for predicting student performance. In Proc. of the KDD Workshop on Knowledge Discovery in Educational Data. Citeseer, 27–40. THAI-NGHE, N., DRUMOND, L., HORVATH ´ , T., NANOPOULOS, A., AND SCHMIDTTHIEME, L. 2011. Matrix and tensor factorization for predicting student performance. In CSEDU (1). Citeseer, 69–78. THAI-NGHE, N., DRUMOND, L., HORVATH ´ , T., SCHMIDT-THIEME, L., ET AL. 2011. Multi-relational factorization models for predicting student performance. In Proc. of the KDD Workshop on Knowledge Discovery in Educational Data. Citeseer, 27–40. MEIER, Y., XU, J., ATAN, O., AND VAN DER SCHAAR, M. 2016. Predicting grades. IEEE Transactions on Signal Processing 64, 4, 959–972. ZIMMERMANN, J., BRODERSEN, K. H., HEINIMANN, H. R., AND BUHMANN, J. M. 2015. A modelbased approach to predicting graduate-level performance using indicators of undergraduatelevel performance. JEDM-Journal of Educational Data Mining 7, 3, 151–176. SWEENEY, M., LESTER, J., AND RANGWALA, H. 2015. Next-term student grade prediction. In Big Data (Big Data), 2015 IEEE International Conference on. IEEE, 970–975. Design Flow/Process Proposed Methodology In proposed methodology two approaches have been used: 1)Generalised Method 2) Phase-wise Association Fig 1.2 GENERALISED METHOD The dataset must first be imported from the UCI repository. Once the dataset has been imported, it becomes apparent that it does not meet the criteria, necessitating preprocessing during which prominent features are discovered by using three distinct feature selection techniques and feature selection methods are compared. Significant features have been observed. The dataset has been preprocessed, and ensembling is applied to the model. Comparing original dataset to ensembling, the best accuracy is attained. To train the original and prominent features, training is currently being done with the prominent and original dataset. Analysis and feedback are taken. If the model is unable to meet the requirements, the improvement is implemented in accordance with the recommendations. It is then put to the test once more and verified. The ROBUSTNESS of the model is improved if it is able to achieve the intended results, as this is one of the most crucial steps in determining whether or not the model is capable of dealing with real-world problems and situations. The model is ready to be used in the real world once its robustness has been increased. Fig 1.3 Phase wise Association This section is divided into 5 different phase. In the first phase, the problem statement is identified The dataset is imported which is relevant to the problem statement. The imported dataset is not according to the requirements , So further processing is required. Once we have imported the dataset , the second phase is starting with the pre processing in which prominent features are identified using different kind of algorithms . This is one of the most important phase which focuses on meeting the requirements of the dataset . The dataset is balanced and structurised. In the third phase, the model is trained with traditional machine learning algorithms and the data is collected and analysed. After that Ensembling (voting) is applied to the dataset for further increasing the accuracy. In phase four, Test and Validate section focus on comparing the performance of prominent features with original features and improved the ROBUSTNESS of the model. The best possible accuracy is achieved in ensembling through voting with 77.2% accuracy. The model is strengthened in the fifth step so that it may be used in the actual world since it is reliable. The model is deployed to the real world after considering all forms of input and analysis and after making all of the suggested modifications, as it is now prepared to face and solve the problems of the real world. From 53.2% to 77.2% accuracy, there had been an improvement. Therefore, the outcomes based on the relevant model are satisfactory and prepared for use in the real world. Result Analysis and Validation The stages of result analysis and validation are essential to the assessment of student achievement. They aid in evaluating the data gathered and figuring out the validity and reliability of the evaluation techniques employed. Researchers and educators may make sure the evaluation process is accurate and useful by doing thorough analysis and validation. We go over a few typical methods for result analysis and validation in student performance evaluations in this section. Statistical Analysis: Statistical analysis is essential to understanding results. To find patterns, trends, and relationships in the data, several statistical approaches must be applied. The performance data can be summarised using descriptive statistics like mean, standard deviation, and frequency distributions. T-tests and analysis of variance (ANOVA) are two inferential statistics that can be used to compare student performance across various groups or circumstances. Finding predictors of student performance, such as demographic variables or instructional interventions, can be done using regression analysis. Validity Analysis: Validity analysis focuses on determining whether the assessment methods are measuring what they are supposed to measure. As a result, the required learning outcomes or constructs are accurately captured by the evaluation tools. You can assess the content validity, criterion-related validity, and construct validity, among other sorts of validity. Examining the assessment items' content validity entails determining if they accurately reflect the content domain. Comparing student performance to outside standards like standardised examinations or professional opinions is referred to as criterion-related validity. The degree to which the assessment measures match to theoretical constructs or concepts is measured by construct validity. Cross-validation: This technique is used to confirm that assessment models or findings are accurate and generalizable. The dataset is divided into several subsets, the model is trained on one subset, and it is then validated on the remaining subsets. This procedure aids in analysing the model's reliability and generalizability by gauging how well it performs on fresh or previously unexplored data. It is crucial to remember that rigorous result analysis and validation should be carried out while taking the context and procedural constraints of the assessment process into account. Researchers and educators can increase the validity, reliability, and credibility of student performance evaluations by thoroughly analysing and validating the findings, which will result in better decision-making and educational outcomes. Accuracy Analysis Algorithm s BAYES NET LOGISTIC Multilayer Perceptro n SMO Lazy.IBk Decision Stump Random Forest Random Tree 53.2675 39.0704 45.0481 Consiste nt 54.9618 38.9313 48.0916 Resampl ed 68.7023 77.0992 72.5191 43.6044 33.8779 45.8333 44.2748 41.9847 43.5115 71.7557 71.7557 44.2748 48.6196 54.9618 71.7557 41.5653 37.4046 67.9389 Original Table 1.1 Error Analysis Algorithms BAYES NET LOGISTIC Multilayer Perceptron SMO Lazy.IBk Decision Stump Random Forest Random Tree Origi nal 46.73 25 60.92 96 54.95 19 56.39 56 66.12 21 54.16 67 51.38 04 58.43 47 Table 1.2 Consiste nt 45.0382 Resampl ed 31.2977 61.0687 22.9008 51.9084 27.4809 55.7252 28.2443 58.0153 28.2443 56.4885 55.7252 45.0382 28.2443 62.5954 32.0611 TP Rate Analysis Resampled 0.533 0.391 0.45 Consiste nt 0.55 0.389 0.481 0.436 0.339 0.458 0.443 0.42 0.435 0.718 0.718 0.443 0.486 0.55 0.718 0.416 0.374 0.679 Algorithms Original BAYES NET LOGISTIC Multilayer Perceptron SMO Lazy.IBk Decision Stump Random Forest Random Tree 0.687 0.771 0.725 Table 1.3 FP Rate Analaysis Algorithms BAYES NET LOGISTIC Multilayer Perceptron SMO Lazy.IBk DecisionStump RandomForest RandomTree Origina l 0.234 0.305 0.275 Consi stent 0.235 0.332 0.296 Resampled 0.282 0.331 0.271 0.257 0.292 Table 1.4 0.321 0.337 0.359 0.264 0.341 0.168 0.155 0.35 0.167 0.187 0.161 0.133 0.166 ROC Rate Analysis Algorithms BAYES NET LOGISTIC MultilayerP erceptron SMO Lazy.IBk DecisionStu mp RandomFor est RandomTre e Origina l 0.706 0.581 0.663 Consi stent 0.696 0.574 0.662 Resampl ed 0.816 0.803 0.851 0.647 0.53 0.55 0.627 0.539 0.553 0.793 0.796 0.544 0.677 0.655 0.898 0.57 0.539 0.77 Table 1.5 Resampled and Ensembled Data with W-saw and L-saw MODEL BAYES NET ACCUR ACY 68.702 3 LOGISTIC 77.099 2 MultilayerP erceptron 72.519 1 SMO 71.755 7 Lazy.IBk 71.755 7 DecisionSt ump 44.274 8 RandomFo rest 71.755 7 err or 31. 297 7 22. 900 8 27. 480 9 28. 244 3 28. 244 3 55. 725 2 28. 244 3 TP FP ROC L-saw 0.816 Wsaw 9.15 0.687 0.1 61 0.771 0.1 33 0.803 9.15 3.22 0.725 0.1 66 0.851 9.15 3.22 0.718 0.1 68 0.793 9.15 3.22 0.718 0.1 55 0.796 9.15 3.22 0.443 0.3 5 0.544 9.15 3.22 0.718 0.1 67 0.898 9.15 3.22 3.22 RandomTr ee 67.938 9 Voting (Logistic + Random Forest) 77.099 2 32. 061 1 22. 900 8 0.679 0.1 87 0.77 9.15 3.22 0.771 0.1 33 0.916 9.15 3.22 Table 1.6 Graphs for Analysis of Ensembling and resampled data 100 80 60 40 20 0 ACCURACY error TP FP ROC wsaw Fig 1.4 80 MODEL 60 ACCURACY 40 error 20 TP 0 1 2 3 4 5 6 7 8 Fig 1.5 9 10 TP FP MODEL ROC Conclusion and Summary In conclusion, student performance evaluation is a crucial part of the educational process since it enables teachers to evaluate their students' academic progress, pinpoint their areas of strength, and create efficient interventions. We have learned the value of student performance evaluation datasets through literature reviews in identifying the variables affecting student performance and informing educational practises. In order to guarantee the accuracy, reliability, and validity of student performance evaluation, the result analysis and validation processes are extremely important. Data interpretation and pattern and link identification are both aided by statistical analysis. Validity analysis guarantees that the assessment methods effectively measure the targeted learning outcomes, whereas reliability analysis provides consistency and stability in the evaluation measures. The validation process is also aided by cross-validation and expert opinion. A more thorough and all-encompassing picture of student performance can be obtained by further research and development of novel assessment techniques, such as performance-based assessments, project-based assessments, and authentic assessments. In conclusion, research and development in student performance evaluation are continuous. We can improve the efficacy and fairness of student performance evaluation by advancing assessment techniques, incorporating technology, conducting longitudinal studies, adopting multidimensional evaluation approaches, and addressing ethical issues. This will ultimately support students' academic success and growth. Organization of the Report Chapter 1 Problem Identification: This chapter introduces the project and describes the problem statement discussed earlier in the report. Chapter 2 Literature Review: This chapter prevents review for various research papers which help us to understand the problem in a better way. It also defines what has been done to already solve the problem and what can be further done. Chapter 3 Design Flow/ Process: This chapter presents the need and significance of the proposed work based on literature review. Proposed objectives and methodology are explained. This presents the relevance of the problem. It also represents logical and schematic plan to resolve the research problem. Chapter 4 Result Analysis and Validation: This chapter explains various performance parameters used in implementation. Experimental results are shown in this chapter. It explains the meaning of the results and why they matter. Chapter 5 Conclusion and Summary: This chapter concludes the results and explain the best method to perform this research to get the best results and define the future scope of study that explains the extent to which the research area will be explored in the work. Team Roles Member Name UID Mukul saini 22BCS14902 Roles • COLLECTION AND MAKING OF THE DATASET • CLUSTERING DISTRIBUTION OF AND THE DATASET. • Nishant Bamal 22BCS15558 • • VISUALISATION OF THE DATASET COLLECTION OF DATASET VISUALISATION OF THE DATASET • TESTING AND TRAINING OF THE DATASET