Using Machine Learning to predict the completion of a learner’s Undergraduate Science Degree based on their First-year Marks Prince Ngema Supervised by DR Ritesh Adjhooda and DR Ashwini BSC Hons Computer Science - 2019 The University of Witwatersrand Course: COMS4030A- Adaptive Computation and Machine Learning Abstract— Advances in the field of Machine learning due to the increased usage of computers and the availability of data has led to major improvements in domain of ”student performance prediction”. Machine Learning can be used in the development of models that can predict students’ performance. Predicting the students’ performance is a taxing task because a student academic performance is dependant on a number of factors. How a student performs in their first year of study may give a vivid idea of where the student is headed academically. We can therefore, using machine learning techniques to leverage these marks and predict the performance of students. The focus of this study, taking a conceptual framework proposed by Spady [1970] as a rationale is to identify the optimal Machine Learning algorithm for predicting the completion of a learner’s Computer science degree using marks obtained in their first year of study at a South African University. We also focus on ranking the predictive power of the courses taken in first year and we also provide an interactive program that is able to predict the completion of the student degree. The ML algorithms used are J48 tree, K-Star, Naive Bayes, Multilayer perceptron, Support Vector Machines(SVM) and the Logistic Regression. The results of the algorithms were compared using 10-fold cross validation method in terms of prediction accuracy, precision, recall, AUROC curves and computation time. Results show that... This study also revealed that previous knowledge of Mathematics and Physics both at Ordinary level and 100 level are essential determinants of students’ performance in a computer programming course. index Terms– Image classification, image histogram, support vector machines, RBF kernels. 1. INTRODUCTION Many a time, admittance into a University programme is a transformative experience for students, it gives them and their families hope for a lustrous future. Alas, some students who are accepted into the university programme fail to complete their degree due poor academic performances. These students are left lamenting and drowning in debt accumulated during their years of study. In attempts to remedy this situation , many researchers from different walks of life have developed models to predict students’ academic performance. Studies show that the biggest attrition rate occurs at fresh-man level. In South Africa, 29% of students drop out after doing their fresh-man year studies and just 30% of fresh-man students graduate after five years [Scott et al. 2007]. This study is based on Computer Science learners at a South African Higher-Education Research-Intensive Institution between the years 2008 and 2017. These learners are categorized into three groups: ”Completed”, this where a learner completes their degree in minimum time ( 3 years); ”Delayed”, this when the learner completes their degree in more than 3 years; ”Failed”, this where the learner drops out and does not complete their degree. At this institution, they have been 21% of learners categorised as Completed; 24% as Delayed; and 55% as Failed. In light of these alarming figures, higher learning institutions are in need programs that will improve student retention rates and graduation rates. These programs aim to improve the performance of students by looking at students’ previous performance and using this knowledge to predict how a student is likely to perform in their field of study. [Yadav et al. 2012]. Previous work in field of predicting student performance has seen researchers employ different techniques like machine learning, statistical analysis and data mining in attempts of building models to do the prediction. Spady [1970] proposed proposed that the student’s decision to stay and complete their degree or withdraw from the academic institution is influenced by Grades, intellectual development, normative congruence and friendship support. He grouped these factors under 2 systems: Grades and intellectual development under the Academic system; and normative congruence and friendship support under Social system. This study lies squarely within the academic system. Grades obtained during a student’s first-year of study are taken as the predictor variable. Following the conceptual framework of Spady [1970], we define several features associated with first-year marks that will be used to classify the student into three completion profiles: ”Completed”, ”Delayed”, and ”Failed”. During their first year of study, a student must have three majors and one elective module. These four constitute our feature space along with the student’s degree completion outcome and the aggregate of all the marks obtain. Different machine learning classifiers were used to predict the performance of a student. These classifiers are : J48, K ∗ , Support Vector Machines, Logistic Regression; Naive Bayes; and Multilayer perceptron. The goal of this study is therefore , exploring the possibility of applying these machine learning algorithms to predict the completion of a student’s Computer Science degree based on first-year marks and investigating the value of using first-year marks as predictor variables. The research question can be seen as two fold: • Are the above mentioned algorithms capable of predicting students’ performance? • Is using first-year marks as a predictor variables in a student performance prediction task viable? Using 10-fold cross validation method, Accuracy, AUROC values and time taken to execute were used to scale the performance of these models. Information gain was used to rank features according to their predictive power. Based on the experiments it is found that the accuracy level of the classifiers range between 52% and 70%. The J48 achieved the biggest recorded accuracy. The AUROC values vary between 0,7 to 0,8. The J48 algorithm had the highest AUROC value. The J48 algorithm also took the least time to execute. In terms of Information gain, MajorOne is the highest ranked feature, that is, MATHS 1 is the module with the highest deterministic power. An application software which uses the J48 predictive model to predict the completion of a learner’s Computer Science degree has been prepared. The stake holders of this application are: • Computer Science Students: Students who want to know their chances of completing their degrees. • Companies: Companies that offer scholarship or bursaries to students might want to use this application during the selection process. • The institution: The institution The reasons are to identify the student at risk of attrition early enough in order to provide necessary support and intervention for them with the goals of reducing attrition, increasing retention, performance and graduation rate Aljohani [2016]. A diagrammatic representation of the goals of predicting student performance is shown in Fig1 There are three major general contributions of this paper: (a) an interactive program which is able to predict the completion of a student’s Computer Science degree (b) a comparison of various classification models to classify learner instances into these four completion outcomes; and (c) the first trained classifier able to calculate the probability of a learner completing their degree at a South African University focused on the the conceptual framework of Spady [1970]. This document is structured as follows....................... 2. R ELATED WORK Many theoretical models (conceptual frameworks) on student performance have been developed to date. In this study we will adopt a conceptional framework proposed by Spady [1970] as a rationale to predict the completion of a Computer Science degree using first-year marks. To explain Fig. 1: A diagrammatic presentation of the goals of predicting student performance the attrition process, Spady [1970] investigated the calibre of reciprocity between the students and the environment of their academic institutions. He asserted that this interaction is a consequence of the exposure of individual students’ attributes such as dispositions, interests, attitudes and skills to the influences, expectations and demands of the different components of their institutions including courses, faculty members, administrators and peers. His key premise was that the effect of this reciprocity determines the level of students’ integration within the academic and social systems of their institutions and subsequently their tenacity. He further postulated that the student’s decision to stay and complete their degree or withdraw from the academic institution is impacted by two chief factors in each of the two systems as seen in Figure 1. These factors are grades and intellectual development (Academic system) and normative congruence and friendship support (social system) [Spady 1970]. Studies to predict students academic performance have been made over the years. Various techniques used to predict students’ performance like data mining, statistical analysis and machine learning have been employed. In this section, we review some of the work previously done in this field. In this paper student performance measures or indicates whether a student was able to complete their degree. A study to identify at risk-students in Mathematical Sciences using biographical data and enrollment observations to was conducted by researchers at a South African University [Ajoodha and Jadhav 2017]. The basic methodology was to indicate influence of four biographical characteristics (i.e. gender, spoken home language, home province, and race description) on student aggregates, explore the trajectory of student performance over the period 2008 to 2017 respect to biographical characteristics and calculated the posterior probability of Fig. 2 The use of grades as a factor that affects students’ performance has been investigated by many researchers. In all of the studies reviewed in Table??, researchers used grades as failing to complete the minimum requirements given various biographical profiles using Bayesian analysis. The results showcased at-risk biographical profiles with a Bayesian estimate that was greater than 0.7 for failing to complete the requirements for a degree in the Mathematical Sciences. performs better than the other the machine learning algorithms. Comparing the performance of these algorithms on engineered data and raw data, better performance results were obtained on engineered data. Oyelade et al. [2010] used 6 learning algorithms namely K-Nearest Neighbour, Neural networks, naive Bayes Algorithm, SVM and other machine learning algorithms to predict using first-year marks the completion of a student’s degree and other contributing factors. They also compared these learning algorithms and concluded that the Naive Bayes algorithm performs better and its accuracy is satisfactory. [Oyelade et al. 2010]. Cortez and Silva [2008] investigated students’ performance using machine learning algorithms including SVM and NB. They stressed out the point that grades are a key feature in predicting students’ performance. They came to this conclusion when they employed the Decision tree algorithm, students grades were the root node which clearly emphasizes the significance of marks in predicting outcome of a student. They also concluded that the NB algorithm out performs the SVM algorithm. A similar study was conducted by Butcher and Muth [1985], they used American Collage Testing Program(ACT) test scores along with performance in high school and information regarding the students’ programs to predict how students will perform will perform in an Computer Science course and semester one’s collage average of their grade points. They took the statistical analysis approach and used the Statistical system (SAS) to perform all statistical analyses. They concluded that semester one’s marks may indeed be used to predict students’ performance. Daud et al. [2017] investigated students’ performance prediction methods. The prediction methods investigated included the NB , SVM and Bayesian networks. These were applied to student data and the performance of each classifier was measured using the F-measure method. After carrying out experiments, they came to the conclusion that the SVM out performs the other classifiers. This is a clear contradiction with what Cortez and Silva [2008] found. In Cortez and Silva [2008], it was concluded that the NB algorithm performs better than the SVM classifier. Another study was done by Pojon [2017], they investigated students’ performance using ML. They used various ML techniques(Linear regression, Decision tress, naive Bayes’ classifier) and compared their performance. Feature engineering methods were used to better the performance of the prediction model. They found that the NB classifier Bydžovská [2016] having the grade average of a student as one of their data attributes compared ML techniques for predicting student performance. They employed SVM , Random forests, Naive Bayes and decision trees to predict the students’ performance. Using the F-measure method to measure the performance of each algorithm, the SVM was ranked as number one, out performing all the other algorithms. These results are in line with what Daud et al. [2017] found. 3. M ETHODOLOGY In this study we attempt to predict the completion of a student’s Computer Science degree using first-year marks. In simpler terms, we are trying to answer the following question, Will a student in question complete their computer science degree? The are three possible outcomes: “Completed” , the student is expected to complete their degree in minimum time (3 years); ”Failed”, the student is expected not to complete their degree; and “Delayed”, the students successfully completes their degree but not in minimum time. Several machine learning classification models to deduce a learner into one of three categories Completed, Delayed or Failed will be employed. The best performing algorithm will be used to create an application that will be used to predict the outcome of a student. To gauge the performance of the trained classification models, confusion matrices and AUROC curves will be used. This section is structured as follows: A brief description of the data collection procedure which includes information concerning the ethics clearance certificate details is given firstly. Secondly, preprocessing steps taken to prepare the data for this research objective are outlined. Thirdly, a brief descriptions of each machine learning classifier to be used is given . Fourthly, the feature analysis process is presented and finally brief descriptions of the evaluation metrics used to gauge the performance of the classifiers. Fig. 3: Proposed Methodology of Classification Model A. Data collection and Ethics The data utilized in this study was obtained from a South African University. The data consists of biographical, student marks from first year to third , all postgraduate marks and enrollment observations of students from the Faculty of Science doing Mathematical Science Degrees. The study participants are students who studied anytime between the years 2008 to 2017 at a research-concentrated South African university. This study was approved by the Human Research Ethics Committee of the University. The committee also TABLE I: Systematic Literature Review Study Study Purpose Attributes used Models used Findings Pavlou et al. (2007) Trust A buyer’s intentions to accept vulnerability based on her beliefs that transactions with a seller will meet her... Trust, web site asvxngs Pavlou et al. (2007) Trust A buyer’s intentions to accept vulnerability based on her beliefs that transactions with a seller will meet her... Trust, web site asvxngs Variables Description Type Major One Aggregate of final Calculus 1 and Algebra 1 marks Nominal Major Two Aggregate of all Coms 1 marks Nominal Major Three Computation and Applied Mathematics orInformation Nominal Systems Final Marks Marks of any Level 1 course Elective Nominal (i.e. Physics) Aggregate Aggregate of all marks obtained Nominal Progress Outcome Outcome of a student in their final year Categorical Variable encoding values 4 = 70% - 100% 3 = 60% - 69 % 2 = 50 % - 59% 1 = 0% - 49% 4 = 70% -100% 3 = 60% - 69% 2 = 50% - 59% 1 = 0% - 49% 4 = 70% - 100% 3 = 60% - 69% 2 = 50% - 59% 1 = 0% - 49% 4 = 70% - 100% 3 = 60% - 69% 2 = 50% - 59% 1 = 0% - 49% 4 = 70% - 100% 3 = 60% - 69% 2 = 50% - 59% 1 = 0% - 49% Completed - Yes Failed - No Delayed - NRT TABLE II: The Students data set description tackled ethical issues of protecting the identity of the students participating in the research and ensuring the security of data. Data preparation First year Computer Science students undertake three major courses and one elective. The data taken from the AISU , contains the marks obtained by each student during their first year. Using this data, the features that will help us predict the completion of the degree were engineered. These features are MajorOne, MajorTwo, MajorThree, Elective, Aggregate and ProgressOutcome. The first four attributes are the predictors and Progress outcome is the target attribute. The target attribute contains three classes, Completed , Delayed, and Failed. Table IV shows the description of these attributes. Data preprocessing Some entries in the data contained missing values. This is due to a number of reasons. The most common reason being that some students did not enrol for some of the courses. To combat this problem, entries with missing values where removed. According to figure 4, there is a huge gap between the number of students who are labeled as ’Failed” and those those who are labelled ”Completed” or ”delayed. The data set contains a total number of 624 instances. The classes are imbalanced , this may lead to reduced accuracy.To combat this, we undersampled the data using the spreadsubsample filter on WEKA. The final data contains 393 instances with 131 in each class. Attribute ranking The goal of attribute ranking is to determine the predictive power of each variable in our feature set. To achieve this, the Information Gain Ranking algorithm will be used. The IGR algorithm calculates the information gain for each feature Major One 3 2 3 4 3 1 Major two 3 1 3 2 4 1 Major three 4 3 4 3 2 1 Elective 3 2 4 2 2 2 Aggregate 3 2 3 3 4 1 Outcome Completed Failed Completed Delayed Delayed Failed TABLE III: The resulting data set after preprocessing Completed Delayed 21% 24% 55% Failed Fig. 4: The percentage of Delayed, Completed and Failed instances contained in the data set with respect to target feature. Information gain measures how much information a feature gives us about a class. The values of Information gain (entropy) are between 0 and 1, that is, the minimum entropy is 0 and the maximum entropy is 1. Classification Algorithms In this study, the machine learning algorithms employed for the classification process were K-Star, Naive Bayes, SVM, Decision tree, Logistic Regression, and the Multi-layer Perceptron. k-Star:.The K* instance-based classifier uses an entropy-based distance function to classify test instances using the training instance most similar to them. The K* implementation used in this paper closely followed the implementation by Clearly and Trigg (1995). Using an entropy-based distance function allows consistency in the classification of real-valued and symbolic features found in our experiments. Naive Bayes: For prediction problems, this is the most used algorithm [Pojon 2017]. It is beloved for its pragmatic approach to machine learning problems. It uses Bayes’ theorem to classify instances to one or a number of independent classes using probabilistic approach [Koller et al. 2009]. It is the easiest learning algorithm to implement [Pojon 2017]. It attempts to find (provided that all the features are conditionally independent given the class label of each instance) the likelihood of features occurring in each class and take the class with the largest posterior probability as our predicted class. The main assumption is that all features are conditionally independent given the class label of each instance. SVM: Support Vector Machines (SVM) is supervised ML algorithm which was initially developed for binary class classification cases [Madzarov et al. 2009]. However, it be extended to classification cases with more than two classes by breaking them to classification cases with only two classes [Madzarov et al. 2009]. The idea is to find a multi-dimensional hyperplane that will best divide the dataset into two classes. Test instances are then mapped on the same space and predicted based on which side of the hyperplane they fall on. SVMs can be scaled for nonlinear and high-dimensional classification problems by employing the kernel trick and one to many partitioning. J48: The J48 algorithm , a successor of ID3 was developed bY Ross Quinlan and is implemented in WEKA using Java [Bashir and Chachoo 2017]. The decision tree’s greedy top-down approach is adopted by this algorithm. It is used for classification in which new instance is labelled according to the training data.In this paper, we will use this algorithm to classify a student into one of the 3 categories ( completed, not completed, NRT). Logistic Regression: The Linear Logistic Regression model predicts probabilities directly by using the Logit transform. The implementation in this paper follows Sumner et al. [2005]. Multi-layer Perceptron: B. Model Evaluation To gauge the performance of the trained classification models, confusion matrices and Receiver Operating Characteristic curves will be used. From the confusion we will derive performance measure metrics from like Precision, Recall and Accuracy and from the ROC curve we will derive the AUC (area under the ROC curve metric). From the confusion matrix depicted in Fig 5, True Positive (TP) gives the proportion of positive entries that are accurately identified, False Positive (FP) is the proportion of positive entries that are classified as negative, False Negative (FN) is the fraction of negative entries that are classified as positive and True Negative (TN) is the number of negative entries correctly classified. A. Feature Ranking Predicted Values True Positive False Negative False Positive True Negative One important part of the study is feature selection, which is used to rank the predictor variables according to the strength of their relationship with dependent or outcome variable, which is completion outcome in our case. Information gain can be used to rank the predictor variables according to the strength of their relationship with the outcome variable. To rank the predictor variables in terms of their predictive power, the IGR algorithm was used. Table 3 depicts the ranking of the features using information gain. Column 1 indicates the rank of each feature , column 2 indicates the feature names and column 3 indicates the entropy (information gain). The attribute ranking with respect to the class label actual values Fig. 5: Confusion Matrix Featur Accuracy which is the proportion of correct predictions is calculated as follows: TN + TP Accuracy = (1) TP + TN + FP + FN The higher the accuracy, the better the model’s performance. Precision which is the capacity of a model not to classify a positive instance as negative. It is calculated as follows : (2) The higher the precision, the better the model’s performance. Recall which is ability of a classifier to find all positive instances is calculated as follows: TP Recall = (3) TP + FN The higher the Recall, the better the model’s performance. Receiver Operating Characteristic (ROC) is a curve with true positive rate is on the Y axis and false positive rate on the X axis that visualizes the tradeoff between the model’s sensitivity and specificity. Using this curve we can calculate the AUC (Area under the ROC curve) metric. The AUC of a model tells us about the classification model’s discriminatory capabilities, that is, the ability of a model to discriminate between classes. The higher the AUC of a model, the better the performance of the model. C. The Software Application Methodology 4. R ESULTS AND D ISCUSSION In this study, Six classification models were employed with the purpose of predicting the completion of a learner’s undergraduate degree based on their first-year marks. This section presents the results obtained when these classification models were evaluated using Accuracy, Precision, Recall, AUC and Execution time. Rank Aggregate 0.308 1 Elective 0.104 5 MajorOne 0.294 2 MajorTwo 0.189 3 MajorThree 0.186 4 TABLE IV: Feature Ranking using gain ratio criteria shows that students with very good background knowledge in MAT 103, MAT 102, PHY 101, PHY 102 and MAT 101 will perform very well in computer programming courses. These courses are calculation-intensive and they require very sharp brains that can think very fast. Computer programming is one of the courses that involve developing resourceful algorithms and the ability to transform these algorithms to efficient working software. Sometime, these algorithms are mathematically based. The results of this work validate this fact Rank vs Entropy Entropy 0.35 0.3 Entropy TP P recision = TP + FP Entropy 0.25 0.2 0.15 0.1 5 · 10−2 1 2 3 4 5 Feature Rank The attribute ranking with respect to the class label using gain ratio criteria shows that students with very good background knowledge in MAT 103, MAT 102, PHY 101, PHY 102 and MAT 101 will perform very well in computer programming courses. These courses are calculation-intensive and they require very sharp brains that can think very fast. Computer programming is one of the courses that involve developing resourceful algorithms and the ability to transform these algorithms to efficient working software. Sometime, these algorithms are mathematically based. The results of this work validate this fact. Classification Model Evaluation Different classifiers whose results vary depending on efficiency were employed. In this paper the following algorithms were employed: Naive Bayes, SVMs, Decision trees, K* , Multilayer- perceptron and Logistic Regression. To gauge the performance of these classifiers, we used confusion matrices and ROC curves. The confusion matrix of each model is depicted in Fig 6.In addition, table 4 shows the execution time (in seconds) used by each classifier when building its model for the training data. J48 tree used the shortest time (0.02 seconds) for classification while CART and BF Tree used the same time (0.22 seconds) for their classification. Therefore, J48 tree also has the least execution time among the three algorithms under investigation. Comparison was also made based on ROC Curves and AUROC. From Figure 5d, it is indicated that the J48 model is the best performer in terms of the area under the ROC curve (AUROC) It achieved an AUROC value of 0.8645 using 10-fold cross validation. With the exception of the K* and naive Bayes, compared to the other four classification models employed in this paper the J48 took the least time to build. Figure 5a illustrates the ROC curve for the SVM model which achieves an AUROC of 0.754 using 10-fold cross validation. Furthermore, With the exception of the Multi-layer perceptron, compared to other four classifications, the SVM classification model took the longest time to build. Figure 5b illustrates the ROC curve for the naive Bayes model which achieves an AUROC of 0.783 using 10-fold cross validation.With the exception of the K* classification model, from the other five models employed in this paper Naive Bayes took the least time to build. Figure 5c illustrates the ROC curve for the Multi-layer perceptron model which achieves an AUROC of 0.7867 using 10-fold cross validation. The Multi-layer perceptron model took the longest time to build as compared to any model employed in this study. Figure 5e illustrates the ROC curve for the K ∗ model which achieves an AUROC of 0.7852 using 10-fold cross validation. The k ∗ model took the least time to build as compared to any model employed in this study. Figure 5f illustrates the ROC curve for the Logistic Regression model which achieves an AUROC of 0.7764 using 10-fold cross validation. With the exception of the SVM and the Multi-layer perceptron classification models, the Logistic Regression Model took the longest time to build. Algorithm Naive Bayes SVM J48 Logistic Regresion Multi-layer Perceptron K-Star Accuracy (%) 52.013 58.015 70.125 60.560 54.199 57.506 Precision 0.571 0.581 0.690 0.605 0.548 0.548 Recall 0.583 0.580 0.692 0.606 0.542 0.542 Execution time(sec) 0.10 0.40 0.09 0.20 0.56 0.01 TABLE V (a) A confusion matrix describing the performance of the Naive Bayes predictive model. The NB model achieves an accuracy of 52%, a precision value of 0.58 and a recall value of 0.58. Out of the 393 instances, 225 were correctly classified and only 168 were classified incorrectly. Actual Completed Failed Delayed 74 4 40 9 48 102 25 29 62 (b) A confusion matrix describing the performance of the Logistic Regression predictive model. The NB model achieves an accuracy of 52%, a precision value of 0.58 and a recall value of 0.58. Out of the 393 instances, 225 were correctly classified and only 168 were classified incorrectly. (c) A confusion matrix describing the performance of the SVM predictive model. The SVM model achieves an accuracy of 58%, a precision value of 0.60 and a recall value of 0.60. Out of the 393 instances, 228 were correctly classified and only 165 were classified incorrectly Completed Failed Delayed Delayed 47 29 55 Failed 7 96 26 Completed Delayed 77 6 50 Actual Failed Predicted Completed Actual Predicted Completed Failed Delayed 70 8 43 9 94 26 52 29 62 (d) A confusion matrix describing the performance of the K-star predictive model. The K-star model achieves an accuracy of 52%, a precision value of 0.58 and a recall value of 0.58. Out of the 393 instances, 225 were correctly classified and only 168 were classified incorrectly. (e) A confusion matrix describing the performance of the J48 predictive model. The SVM model achieves an accuracy of 70%, a precision value of 0.69 and a recall value of 0.69. Out of the 393 instances, 272 were correctly classified and only 121 were classified incorrectly Delayed Actual Completed Failed Delayed Failed 7 32 104 18 25 76 Completed 92 9 30 Delayed Failed Predicted Completed Actual Predicted Completed Failed Delayed Delayed Failed 13 59 102 80 33 64 Completed 59 1 34 Delayed Failed Completed Failed Delayed Predicted Completed Actual Predicted 69 18 56 8 91 22 54 22 53 (f) A confusion matrix describing the performance of the Multilayer Perceptron predictive model. The Multilayer perceptron model achieves an accuracy of 52%, a precision value of 0.58 and a recall value of 0.58. Out of the 393 instances, 225 were correctly classified and only 168 were classified incorrectly. Fig. 6: A set of confusion matrices describing the performance of several predictive models on a set of test data. Each predictive model’s accuracy and indicated along with the correctly and incorrectly classified instances R EFERENCES Ritesh Ajoodha and Ashwini Jadhav. Identifying at-risk undergraduate students using biographical and enrollment observations for mathematical science degrees at a south african university. Private Communication, 1(1):1–21, nov 2017. Othman Aljohani. A comprehensive review of the major studies and theoretical models of student retention in higher education. Higher education studies, 6(2):1–18, 2016. Uzair Bashir and Manzoor Chachoo. Performance evaluation of j48 and bayes algorithms for intrusion detection system. (a) ROC curve showing the tradeoff between sensitivity and the specificity of the K-star predictive classification model. The AUC of the K-star is 0.69 (b) ROC curve showing the tradeoff between sensitivity and the specificity of the SVM predictive classification model. The AUC of the SVM predictive classification model is 0.72 (c) ROC curve showing the tradeoff between sensitivity and the specificity of the MLP predictive classification model. The AUC of the MLP predictive classification model is 0.72 (d) ROC curve showing the tradeoff between sensitivity and the specificity of the Logistic Regression predictive classification model. The AUC of the Logistic Regression predictive classification model is 0.71 (e) ROC curve showing the tradeoff between sensitivity and the specificity of the Naive Bayes predictive classification model. The AUC of the Naive Bayes predictive classification model is 0.68 (f) ROC curve showing the tradeoff between sensitivity and the specificity of the J48 predictive classification model. The AUC of the J48 predictive classification model is 0.75 Fig. 7: Figures illustrating the ROC curves of the predictive classification models. AUC values of each class and the average AUC value of the predictive classification models are given. Int. J. Netw. Secur. Its Appl, 2017. DF Butcher and WA Muth. Predicting performance in an introductory computer science course. Communications of the ACM, 28(3):263–268, 1985. Hana Bydžovská. A comparative analysis of techniques for predicting student performance. International Educational Data Mining Society, 2016. Paulo Cortez and Alice Maria Gonçalves Silva. Using data mining to predict secondary school student performance. 2008. Ali Daud, Naif Radi Aljohani, Rabeeh Ayaz Abbasi, Miltiadis D Lytras, Farhat Abbas, and Jalal S Alowibdi. Predicting student performance using advanced learning analytics. In Proceedings of the 26th International Conference on World Wide Web Companion, pages 415–421. International World Wide Web Conferences Steering Committee, 2017. Daphne Koller, Nir Friedman, and Francis Bach. Probabilistic graphical models: principles and techniques. 2009. OJ Oyelade, OO Oladipupo, and IC Obagbuwa. Application of k means clustering algorithm for prediction of students academic performance. arXiv preprint arXiv:1002.2425, 2010. Murat Pojon. Using machine learning to predict student performance. Master’s thesis, 2017. Ian Scott, Nanette Yeld, and Jane Hendry. Higher education monitor: A case for improving teaching and learning in South African higher education, volume 6. Council on Higher Education Pretoria, 2007. William G Spady. Dropouts from higher education: An interdisciplinary review and synthesis. Interchange, 1(1): 64–85, 1970. Marc Sumner, Eibe Frank, and Mark Hall. Speeding up logistic model tree induction. In European conference on principles of data mining and knowledge discovery, pages 675–683. Springer, 2005. Surjeet Kumar Yadav, Brijesh Bharadwaj, and Saurabh Pal. Mining education data to predict student’s retention: a comparative study. arXiv preprint arXiv:1203.2987, 2012. 5. ACKNOWLEDGEMENTS This research was funded by EPSRC grant EP/N035437/1.