IEEE Paper Template in A4 (V1)

advertisement
Mining Educational Data Using K-Means Clustering
Pratiyush Guleria#1, Manu Sood#2
#Department
of Computer Science, Himachal Pradesh University, Shimla
Himachal Pradesh, India
1pratiyushguleria@gmail.com
2soodm_67@yahoo.com
ABSTRACT
As competitive environment is prevailing among the academic
institutions, challenge is to increase the quality of education
through data mining. Student’s performance is of great concern to
the higher education. In this paper, using K-means clustering we
have done pattern classification and clustered students according
to their class performance, sessionals and attendance record. The
results generated using Data Mining Techniques helps in
effective decision making for faculty members to focus on
students who are getting poor class results. We have also
analyzed performance of K-means algorithm taking different
cluster values and iterations performed on it.
After collection of Data, there is pre-processing of Data in which
data is cleaned and transformed into an appropriate format to be
mined.
Keywords
Clustering, Classification, Data Mining, Decision.
Pre-process Data
Data Mining (Extract Patterns/Models)
1. INTRODUCTION
Data Mining is playing an important role in educational systems
where education is considered as one of the key inputs for social
development [1]. The main challenge of Institutions is to deeply
analyze their performance in terms of student performance,
teaching skills and academic activities. Class Performance and
Sessional Marks are important factors for analyzing and
predicting student class result. There are many Data Mining
Techniques like K-means Clustering, Decision trees, neural
networks, Nearest Neighbor; Naive Bayes etc are being used in
Educational Data Mining. Using these methods many kinds of
knowledge can be discovered such as classification, clustering
and association rules which can be helpful in increasing quality
of education [2]. Classification is data mining technique and is a
supervised learning technique where training data set is input for
the classifier [3]. Clustering is the unsupervised classification of
patterns into clusters [4].
K-means clustering algorithm uses Euclidean distance measure,
where the distance is computed by finding the square of the
distance between each scores, summing the squares and finding
the square root of the sum [5]. Using Data Clustering, we extract
previously unknown and hidden pattern from large data sets [6].
The conventional models used for classification are decision trees,
neural network, statistical and clustering techniques [7]. Fig.1
shows the process of Data Mining using WEKA which is open
source software that provides a collection of machine learning
and data mining algorithm for data pre-processing, classification,
regression, clustering, association rules and visualization [8]. In
this process, Educational Dataset is embedded into WEKA
simulation tool and results are interpreted and evaluated using
Simple K-Means Clustering. Educational Dataset consists of
attributes mentioned in Table I to predict their class result using
classification techniques.
Simple K-Means Clustering
Fig. 1 Data Mining Using WEKA Tool
2. DATA MINING PROCESS
2.1 Data preparations
In this paper, we have initially collected the data set of 120
Graduate Students doing Btech in Information Technology in one
of the educational institutions.
2.2 Data selection
We have analyzed the class result of these Students taking 5
Attributes of student record i.e. Class Performance,
Attendance,Assignment,Lab Work, Sessional Performance
(Aggregate of Ist and IInd Sessional). The model of student’s
class result is predicted after analysing performance in these
attributes. Attributes and Educational Data Set of students is
given in Table 1 and Table 2.
Table 1. ATTRIBUTES
Attributes
ClassID
Description
Roll No of Student
CLP
Class Performance of
the student
Sessional
Aggregate of Ist and 2nd
Sessional
ASSGN
Assignments
ATTD
Attendance
LW
Lab Work
CLR
Class Result of Students
Coding
Numeric Value
{Excellent=
“Above 80”,Good
= “7080”,Average= “6070”,Poor= “Below
60”}
{Good= “75100”,Average=
“60-75”,Poor=
“Below 60”}
“A”=8-10, “B”= 67,“C”=1-5}
{Good = “>=90%”
,Average =
“>=75% or above”
,Poor = “below
75%”
}
{Good= “2025”,Average= “1520”,Poor= “Below
15”}
{First= “Above
70%”,Second =
“60-70%”,Third=
“50-60”,Fail=
“Below 50”}
Cluster analysis, also called segmentation analysis, creates groups,
or clusters, of data.
Clusters are formed in such a way that objects in the same cluster
are very similar and objects in different clusters are very distinct.
Measures of similarity depend on the application.
K-Means Clustering is a partitioning method. The function kmeans partitions data into k clusters, and returns the index of the
cluster to which it has assigned each observation. K-means
clustering operates on actual observations and creates a single
level of clusters. K-means clustering is more suitable than
hierarchical clustering for large amounts of data because it treats
each observation in data as an object having a location in space. It
finds a partition in which objects within each cluster are as close
to each other as possible, and as far from objects in other clusters
as possible. Table 3 shows the basic differences between
Partitioning and Hierarchical Clustering.
Each cluster in the partition is defined by its member objects and
by its centroid, or center. The centroid for each cluster is the
point to which the sum of distances from all objects in that cluster
is minimized. Kmeans computes cluster centroid differently for
each distance measure, to minimize the sum with respect to the
measure that we specify.
Kmeans uses an iterative algorithm that minimizes the sum of
distances from each object to its cluster centroid, over all clusters.
This algorithm moves objects between clusters until the sum
cannot be decreased further. The result is a set of clusters that are
as compact and well-separated as possible. We can control the
details of the minimization using several optional input
parameters to kmeans, including ones for the initial values of the
cluster centroid, and for the maximum number of iterations [10].
Table 3
CLR
LW
ASSGN
EDUCATIONAL DATA SET
ATTD
SESSIONAL
CLP
CLASSID
Table 2.
Sr
No
1
78
79
89
9
23
79
2
79
80
89
10
22
81
3
71
70
76
10
23
71
4
52
79
55
4
10
80
5
52
50
45
3
12
51
-
--------
---------
---------
---------
----------
-------
2.3 Implementation of data mining model
In this paper, we have used The K-means algorithm which is a
cluster analysis algorithm used as a partitioning method, and was
developed by MacQueen in 1967[9]. K-Means is for clustering
analysis. The goal of this algorithm is to minimize the total
distance between the cluster and its corresponding centroid.
Partitioning Clustering
1
Partitioning
produces
partitioning.
2
Partitioning
clustering
needs the number of
clusters to be specified.
3
Partitioning clustering is
usually more efficient.
4
Cluster membership is
determined by calculating
the centroid for each group
and assigning each object to
the group with the closest
centroid.
a
clustering
single
Hierarchical Clustering
Hierarchical Clustering
can
give
different
partitioning.
Hierarchical clustering
doesn’t need the number
of
clusters
to
be
specified.
Hierarchical clustering
can be slow.
Hierarchical algorithms
combine
or
divide
existing groups, creating
a Hierarchical structure
that reflects the order in
which groups are merged
or divided.
3. CALCULATING K-MEAN
Kmeans uses a two-phase iterative algorithm to minimize the sum
of point-to-centroid distances, summed over all k clusters. The
first phase uses batch updates, where each iteration consists of
reassigning points to their nearest cluster centroid, all at once,
followed by recalculation of cluster centroid but this phase
usually does not converge to solution that is a local minimum,
whereas the second phase uses online updates, where points are
individually reassigned which reduces the sum of distances, and
cluster centroid are recomputed after each reassignment.
Each iteration during the second phase consists of one pass
though all the points. The second phase converges to a local
minimum. Table 4 shows the steps for calculating K-means.
Table 4 ALGORITHM
Step1
Step2
Step3
K different clusters are selected
Each Cluster is associated with a centroid (Centre
point) .Centroid is typically the mean of the points in the
cluster.
Euclidean Distance of each object to the centroid is
determined.
𝑛
𝑑(𝑞, 𝑝) = √∑(𝑞𝑖 − 𝑝𝑖 )2
𝑖=1
Step4
Step5
Step6
Objects are grouped based on minimum distance.
Loop until all the objects are assigned to the closest
centroid and recompute the centroid of each cluster.
Stop if no further object clusters can be formed.
4. RESULTS AND DISCUSSIONS
In this paper, we have applied K-means clustering algorithm in
MATLAB simulink on the training data mentioned in Table 2
and grouped the students according to their class performance,
Sessional and Attendance.
After applying the pre-processing and Data Mining Models on
Dataset, Fig.2 depicts clustering of students taking no. of clusters
i.e. K=2 and Fig.3 shows clustering of students taking K=3.From
the results we derive clusters of students who are poor in Class
Performance, Sessional and are short of attendance.
Table 5
No. of Clusters = 2
iter
Phase
num
sum
1
1
120
214502
2
1
25
171069
3
1
13
151310
4
1
7
147808
5
1
1
147740
6
2
1
147715
6 iterations, total sum of distances = 147715
No. of clusters = 3
iter
Phase
num
sum
1
1
120
129534
2
1
10
120939
3
1
6
118209
4
1
3
117458
5
1
1
117347
6
1
1
117028
7
2
2
116924
7 iterations, total sum of distances = 116924
No. of clusters = 4
iter
phase
num
Sum
1
1
120
124189
2
1
16
110711
3
1
13
102809
4
1
4
101083
5
1
1
100902
6
1
1
100633
7
2
11
97834
8
2
12
90120
9
2
9
88336.5
10
2
2
87602.7
10 iterations, total sum of distances = 87602.7
The result shows that the total sum of distances decreases at each
iteration as Kmeans reassigns points between clusters and
recomputes cluster centroid. The Kmeans always converge to a
local minimum and updates cluster centroid till local minimum is
found.
Fig.2
Visualisation of Students Attributes
K=2
Fig.4
Fig.3
K=3
Table 5 shows total iterations and sum of distances when
Number of Clusters taken are K=2, 3 and 4 respectively.
Centroid Representation
Centroids obtained from Fig.4 are shown in Table 6.Centroid 3
and 4 shows those students who are poor in Class Performance,
sessionals and are short of Attendance whereas Centroid 1 shows
students who are performing good in Sessionals and are not short
of attendance. Centroid 2 shows those students who are having
Systems”, International Journal of Data Mining & Knowledge
Management Process (IJDKP) Vol.2, No.5, September 2012
satisfactory Class Performance, performing well in sessionals but
are short of attendance.
Table 6
When K = 4 Then
Centroid 1
X= “Class Performance”
Y = “Sessional”
Z = “Attendance”
Centroid 2
X = “Class Performance”
Y = “Sessional”
Z = “Attendance”
Centroid 3
X = “Class Performance”
Y = “Sessional”
Z = “Attendance”
Centroid 4
X= “Class Performance”
Y = “Sessional”
Z = “Attendance”
77.12
82.69
77.73
72.64
79
70.54
69.1
68.17
65.76
62.24
55.16
46.52
5. CONCLUSION
In this paper, Using K-Means Clustering we have clustered the
students based on their Class Performance, sessionals and
Attendance in class. Centroids are calculated from the
educational data set taking K-clusters. This study not only helps
in identifying those students who are short of attendance and
shown poor performance in sessionals but also enhances the
decision-making approach to monitor the performance of students.
We also concluded that on increasing the value of K, the
accuracy becomes better with huge dataset and Kmeans can find
the better grouping of the data.
The results obtained help us to cluster those students who need
special attention.
REFERENCES
[1]
Sonali Agarwal, G. N. Pandey, and M. D. Tiwari, “Data Mining in
Education: Data Classification and Decision Tree Approach”,
International Journal of e-Education, e-Business, e-Management
and e-Learning, Vol. 2, No. 2, April 2012
[2]
Alaa El-Halees, “Mining Students Data to Analyse Learning
Behaviour:
A
Case
Study”,
Available
online
at:
https://uqu.edu.sa/files2/tiny_mce/plugins/filemanager/files/30/pape
rs/f158.pdf
[3]
Mohd. Mahmood Ali, Mohd. S. Qaseem,Lakshmi Rajamani, “A.
Govardhan, Extracting Useful Rules Through Improved Decision
Tree Induction Using Information Entropy”, International Journal of
Information Sciences and Techniques (IJIST) Vol.3, No.1, January
2013
[4]
A.K. Jain, M.N. Murty, P.J. Flynn, “Data Clustering: A Review",
ACM Computing Surveys, Vol. 31, No. 3, September 1999.
[5]
O.J Oyelade,O.O Oladipupo,I.C Obagbuwa,"Application of kMeans Clustering algorithm for prediction of Students’ Academic
Performance",(IJCSIS) International Journal of Computer Science
and Information Security, Vol. 7, No. 1, 2010
[6]
Md. Hedayetul Islam Shovon,Mahfuza Haque,"An Approach of
Improving Student’s Academic Performance by using K-means
clustering algorithm and Decision tree",(IJACSA) International
Journal of Advanced Computer Science and Applications,Vol.3, No.
8, 2012
[7]
Pardeep Kumar , Nitin and Vivek Kumar Sehgal and Durg Singh
Chauhan, “Benchmark To Select Data Mining Based Classification
Algorithms For Business Intelligence And Decision Support
[8]
Cristobal Romero, Sebastian Ventura, Enrique Garcia, “Data
Mining in course management systems: Moodle case study and
tutorial”,
Computers
&
Education,2007,doi:10.1016/j.compedu.2007.05.016
[9]
Senol Zafer Erdogan,Mehpare Timor, “A Data Mining Application
in a Student Database”, Journal of Aeronautics and Space
Technologies, July 2005 Volume 2 Number 2 (53-57)
[10] http://www.mathworks.in/help/stats/k-means-clustering-12.html
Download