Mining Educational Data Using K-Means Clustering Pratiyush Guleria#1, Manu Sood#2 #Department of Computer Science, Himachal Pradesh University, Shimla Himachal Pradesh, India 1pratiyushguleria@gmail.com 2soodm_67@yahoo.com ABSTRACT As competitive environment is prevailing among the academic institutions, challenge is to increase the quality of education through data mining. Student’s performance is of great concern to the higher education. In this paper, using K-means clustering we have done pattern classification and clustered students according to their class performance, sessionals and attendance record. The results generated using Data Mining Techniques helps in effective decision making for faculty members to focus on students who are getting poor class results. We have also analyzed performance of K-means algorithm taking different cluster values and iterations performed on it. After collection of Data, there is pre-processing of Data in which data is cleaned and transformed into an appropriate format to be mined. Keywords Clustering, Classification, Data Mining, Decision. Pre-process Data Data Mining (Extract Patterns/Models) 1. INTRODUCTION Data Mining is playing an important role in educational systems where education is considered as one of the key inputs for social development [1]. The main challenge of Institutions is to deeply analyze their performance in terms of student performance, teaching skills and academic activities. Class Performance and Sessional Marks are important factors for analyzing and predicting student class result. There are many Data Mining Techniques like K-means Clustering, Decision trees, neural networks, Nearest Neighbor; Naive Bayes etc are being used in Educational Data Mining. Using these methods many kinds of knowledge can be discovered such as classification, clustering and association rules which can be helpful in increasing quality of education [2]. Classification is data mining technique and is a supervised learning technique where training data set is input for the classifier [3]. Clustering is the unsupervised classification of patterns into clusters [4]. K-means clustering algorithm uses Euclidean distance measure, where the distance is computed by finding the square of the distance between each scores, summing the squares and finding the square root of the sum [5]. Using Data Clustering, we extract previously unknown and hidden pattern from large data sets [6]. The conventional models used for classification are decision trees, neural network, statistical and clustering techniques [7]. Fig.1 shows the process of Data Mining using WEKA which is open source software that provides a collection of machine learning and data mining algorithm for data pre-processing, classification, regression, clustering, association rules and visualization [8]. In this process, Educational Dataset is embedded into WEKA simulation tool and results are interpreted and evaluated using Simple K-Means Clustering. Educational Dataset consists of attributes mentioned in Table I to predict their class result using classification techniques. Simple K-Means Clustering Fig. 1 Data Mining Using WEKA Tool 2. DATA MINING PROCESS 2.1 Data preparations In this paper, we have initially collected the data set of 120 Graduate Students doing Btech in Information Technology in one of the educational institutions. 2.2 Data selection We have analyzed the class result of these Students taking 5 Attributes of student record i.e. Class Performance, Attendance,Assignment,Lab Work, Sessional Performance (Aggregate of Ist and IInd Sessional). The model of student’s class result is predicted after analysing performance in these attributes. Attributes and Educational Data Set of students is given in Table 1 and Table 2. Table 1. ATTRIBUTES Attributes ClassID Description Roll No of Student CLP Class Performance of the student Sessional Aggregate of Ist and 2nd Sessional ASSGN Assignments ATTD Attendance LW Lab Work CLR Class Result of Students Coding Numeric Value {Excellent= “Above 80”,Good = “7080”,Average= “6070”,Poor= “Below 60”} {Good= “75100”,Average= “60-75”,Poor= “Below 60”} “A”=8-10, “B”= 67,“C”=1-5} {Good = “>=90%” ,Average = “>=75% or above” ,Poor = “below 75%” } {Good= “2025”,Average= “1520”,Poor= “Below 15”} {First= “Above 70%”,Second = “60-70%”,Third= “50-60”,Fail= “Below 50”} Cluster analysis, also called segmentation analysis, creates groups, or clusters, of data. Clusters are formed in such a way that objects in the same cluster are very similar and objects in different clusters are very distinct. Measures of similarity depend on the application. K-Means Clustering is a partitioning method. The function kmeans partitions data into k clusters, and returns the index of the cluster to which it has assigned each observation. K-means clustering operates on actual observations and creates a single level of clusters. K-means clustering is more suitable than hierarchical clustering for large amounts of data because it treats each observation in data as an object having a location in space. It finds a partition in which objects within each cluster are as close to each other as possible, and as far from objects in other clusters as possible. Table 3 shows the basic differences between Partitioning and Hierarchical Clustering. Each cluster in the partition is defined by its member objects and by its centroid, or center. The centroid for each cluster is the point to which the sum of distances from all objects in that cluster is minimized. Kmeans computes cluster centroid differently for each distance measure, to minimize the sum with respect to the measure that we specify. Kmeans uses an iterative algorithm that minimizes the sum of distances from each object to its cluster centroid, over all clusters. This algorithm moves objects between clusters until the sum cannot be decreased further. The result is a set of clusters that are as compact and well-separated as possible. We can control the details of the minimization using several optional input parameters to kmeans, including ones for the initial values of the cluster centroid, and for the maximum number of iterations [10]. Table 3 CLR LW ASSGN EDUCATIONAL DATA SET ATTD SESSIONAL CLP CLASSID Table 2. Sr No 1 78 79 89 9 23 79 2 79 80 89 10 22 81 3 71 70 76 10 23 71 4 52 79 55 4 10 80 5 52 50 45 3 12 51 - -------- --------- --------- --------- ---------- ------- 2.3 Implementation of data mining model In this paper, we have used The K-means algorithm which is a cluster analysis algorithm used as a partitioning method, and was developed by MacQueen in 1967[9]. K-Means is for clustering analysis. The goal of this algorithm is to minimize the total distance between the cluster and its corresponding centroid. Partitioning Clustering 1 Partitioning produces partitioning. 2 Partitioning clustering needs the number of clusters to be specified. 3 Partitioning clustering is usually more efficient. 4 Cluster membership is determined by calculating the centroid for each group and assigning each object to the group with the closest centroid. a clustering single Hierarchical Clustering Hierarchical Clustering can give different partitioning. Hierarchical clustering doesn’t need the number of clusters to be specified. Hierarchical clustering can be slow. Hierarchical algorithms combine or divide existing groups, creating a Hierarchical structure that reflects the order in which groups are merged or divided. 3. CALCULATING K-MEAN Kmeans uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all k clusters. The first phase uses batch updates, where each iteration consists of reassigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroid but this phase usually does not converge to solution that is a local minimum, whereas the second phase uses online updates, where points are individually reassigned which reduces the sum of distances, and cluster centroid are recomputed after each reassignment. Each iteration during the second phase consists of one pass though all the points. The second phase converges to a local minimum. Table 4 shows the steps for calculating K-means. Table 4 ALGORITHM Step1 Step2 Step3 K different clusters are selected Each Cluster is associated with a centroid (Centre point) .Centroid is typically the mean of the points in the cluster. Euclidean Distance of each object to the centroid is determined. 𝑛 𝑑(𝑞, 𝑝) = √∑(𝑞𝑖 − 𝑝𝑖 )2 𝑖=1 Step4 Step5 Step6 Objects are grouped based on minimum distance. Loop until all the objects are assigned to the closest centroid and recompute the centroid of each cluster. Stop if no further object clusters can be formed. 4. RESULTS AND DISCUSSIONS In this paper, we have applied K-means clustering algorithm in MATLAB simulink on the training data mentioned in Table 2 and grouped the students according to their class performance, Sessional and Attendance. After applying the pre-processing and Data Mining Models on Dataset, Fig.2 depicts clustering of students taking no. of clusters i.e. K=2 and Fig.3 shows clustering of students taking K=3.From the results we derive clusters of students who are poor in Class Performance, Sessional and are short of attendance. Table 5 No. of Clusters = 2 iter Phase num sum 1 1 120 214502 2 1 25 171069 3 1 13 151310 4 1 7 147808 5 1 1 147740 6 2 1 147715 6 iterations, total sum of distances = 147715 No. of clusters = 3 iter Phase num sum 1 1 120 129534 2 1 10 120939 3 1 6 118209 4 1 3 117458 5 1 1 117347 6 1 1 117028 7 2 2 116924 7 iterations, total sum of distances = 116924 No. of clusters = 4 iter phase num Sum 1 1 120 124189 2 1 16 110711 3 1 13 102809 4 1 4 101083 5 1 1 100902 6 1 1 100633 7 2 11 97834 8 2 12 90120 9 2 9 88336.5 10 2 2 87602.7 10 iterations, total sum of distances = 87602.7 The result shows that the total sum of distances decreases at each iteration as Kmeans reassigns points between clusters and recomputes cluster centroid. The Kmeans always converge to a local minimum and updates cluster centroid till local minimum is found. Fig.2 Visualisation of Students Attributes K=2 Fig.4 Fig.3 K=3 Table 5 shows total iterations and sum of distances when Number of Clusters taken are K=2, 3 and 4 respectively. Centroid Representation Centroids obtained from Fig.4 are shown in Table 6.Centroid 3 and 4 shows those students who are poor in Class Performance, sessionals and are short of Attendance whereas Centroid 1 shows students who are performing good in Sessionals and are not short of attendance. Centroid 2 shows those students who are having Systems”, International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.2, No.5, September 2012 satisfactory Class Performance, performing well in sessionals but are short of attendance. Table 6 When K = 4 Then Centroid 1 X= “Class Performance” Y = “Sessional” Z = “Attendance” Centroid 2 X = “Class Performance” Y = “Sessional” Z = “Attendance” Centroid 3 X = “Class Performance” Y = “Sessional” Z = “Attendance” Centroid 4 X= “Class Performance” Y = “Sessional” Z = “Attendance” 77.12 82.69 77.73 72.64 79 70.54 69.1 68.17 65.76 62.24 55.16 46.52 5. CONCLUSION In this paper, Using K-Means Clustering we have clustered the students based on their Class Performance, sessionals and Attendance in class. Centroids are calculated from the educational data set taking K-clusters. This study not only helps in identifying those students who are short of attendance and shown poor performance in sessionals but also enhances the decision-making approach to monitor the performance of students. We also concluded that on increasing the value of K, the accuracy becomes better with huge dataset and Kmeans can find the better grouping of the data. The results obtained help us to cluster those students who need special attention. REFERENCES [1] Sonali Agarwal, G. N. Pandey, and M. D. Tiwari, “Data Mining in Education: Data Classification and Decision Tree Approach”, International Journal of e-Education, e-Business, e-Management and e-Learning, Vol. 2, No. 2, April 2012 [2] Alaa El-Halees, “Mining Students Data to Analyse Learning Behaviour: A Case Study”, Available online at: https://uqu.edu.sa/files2/tiny_mce/plugins/filemanager/files/30/pape rs/f158.pdf [3] Mohd. Mahmood Ali, Mohd. S. Qaseem,Lakshmi Rajamani, “A. Govardhan, Extracting Useful Rules Through Improved Decision Tree Induction Using Information Entropy”, International Journal of Information Sciences and Techniques (IJIST) Vol.3, No.1, January 2013 [4] A.K. Jain, M.N. Murty, P.J. Flynn, “Data Clustering: A Review", ACM Computing Surveys, Vol. 31, No. 3, September 1999. [5] O.J Oyelade,O.O Oladipupo,I.C Obagbuwa,"Application of kMeans Clustering algorithm for prediction of Students’ Academic Performance",(IJCSIS) International Journal of Computer Science and Information Security, Vol. 7, No. 1, 2010 [6] Md. Hedayetul Islam Shovon,Mahfuza Haque,"An Approach of Improving Student’s Academic Performance by using K-means clustering algorithm and Decision tree",(IJACSA) International Journal of Advanced Computer Science and Applications,Vol.3, No. 8, 2012 [7] Pardeep Kumar , Nitin and Vivek Kumar Sehgal and Durg Singh Chauhan, “Benchmark To Select Data Mining Based Classification Algorithms For Business Intelligence And Decision Support [8] Cristobal Romero, Sebastian Ventura, Enrique Garcia, “Data Mining in course management systems: Moodle case study and tutorial”, Computers & Education,2007,doi:10.1016/j.compedu.2007.05.016 [9] Senol Zafer Erdogan,Mehpare Timor, “A Data Mining Application in a Student Database”, Journal of Aeronautics and Space Technologies, July 2005 Volume 2 Number 2 (53-57) [10] http://www.mathworks.in/help/stats/k-means-clustering-12.html