A. K-means algorithm

advertisement
Application Research of Modified K-means Clustering Algorithm
Guo-li Liu, You-qian Tan, Li-mei Yu, Jia Liu, Jin-qiao Gao
Department of Computer Science and Software, Hebei University of Technology, Tianjin, China
(lgl6699@163.com)
Abstract - This paper presents an efficient algorithm
called K-harmonic means clustering algorithm with
simulated annealing, for reducing the dependence of the
initial values and overcoming to converge to local minimum.
The proposed algorithm works by that K-harmonic means
algorithm solves the problem that clustering result is
sensitive to the initial valves and simulated annealing makes
the clustering jump out of local optimal solution at each
iteration patterns. The clustering result is verified by
experiments on analyzing IRIS dataset. The school XunTong
is application software that is convenient to communication
between parents and teachers. This paper applies the new
algorithm to analysis of dataset in School XunTong and
finds the relationship of students’ achievement and the
communication between parents and teachers. Finally, the
result of classification guides the learning direction of
students in universities and cultivates to students.
Keywords - K-Harmonic means, simulated annealing
, local minimum, School XunTong
I. INTRODUCTION
Among these commonly used clustering
algorithm, K-means algorithm is typical clustering
algorithm and widely used due to its simplicity and high
effectiveness. However, it has some problems on the
dependence of initial value and the local convergence of
clustering result. There are several methods to improve
this algorithm: first, apply K-means algorithm to cluster
many times and choose the optimum as a final clustering
results; second, research the new algorithms. To improve
k-means algorithm for the initial value and local
convergence, this paper puts forward a new algorithm
based on the combination of K harmonic mean algorithm
and simulated annealing (SA) algorithm. “School
XunTong” is application software to provide a service to
students' parents and involves some information about
students. In this paper, the new algorithm is used to
cluster the set of “School XunTong”, and to find potential
relationships of clustering result.
II. FUNDAMENTAL OF ALGORITHM
A. K-means algorithm
K-means algorithm(KM) is a common
clustering algorithm based on classification and the oldest
classical algorithm [1].In cluster analysis we assume that
we have been given a finite set of points X in the d-
dimensional space Rd , that is, X
 {xi | xi  R d , i  1,2, , n} .
K-means algorithm sets date set matrix X into a given
number k of disjoined subsets C1 , C2 , , Ck . An
optimal clustering is a partition that minimizes the intracluster distance and maximizes the inter-cluster distance.
In practice, the most popular similarity measure is
Euclidean distances due to its computational simplicity.
Euclidean distances is defined as
d (i, j)  [( xi1  x j1 ) 2  ( xi 2  x j 2 )  ...  ( xid  x jd )]
(1)
subject to
i  ( xi1 , xi 2 ,.., xid )  Rd and j  ( x j1, x j 2 ,..., x jd )  Rd
We remark the cluster at each iteration. The
updating of cluster centers is that
ci 
1
n
x
(2)
xci
Where x  ( xi1 , xi 2 ,.., xid ) 。
The main idea behind the K-means algorithm is
the minimization of an objective function usually taken up
as a function of the deviations between all patterns from
their respective cluster centers. The sum of squared
Euclidean distances measure has been adopted in most of
studies as the objective function. It is as follows,
k
n
E   d ij ( xi , c j )
(3)
j 1 i 1
K-means algorithm is simple and efficient, and
has good flexibility for large data, however, k-means has
its limitations such as the clustering is extremely sensitive
to the initial values and it always converges to local
minimum. K-harmonic means algorithm solves the
problem that clustering result is sensitive to the initial
valve.
B. K-harmonic means algorithm
K-harmonic means (KHM) is a center-based
algorithm that has been developed to solve the clustering
problem [2-5].This algorithm uses harmonic average of
distance from each data point to the cluster center, instead
of the minimum distance in K-means algorithm. The
harmonic average is defined as
k
1

p
cC d ( x, c )
finite probability of selecting the point x 2 even though it
(4)
is worse than the point x1 and depends on relative
where x  X denotes a finite set of points X in
the d-dimensional space Rd , c  C denotes the cluster
centers, d p ( x, c) denotes distance between two points, k
denotes the groups of clusters.
The iterate method of cluster center is that:
1
xi
d ikp 2
k
[ j 1 p ]
di, j
ck 
n
1
i1 n d p 2
[ j 1 ikp ]
di, j

n
i 1
(5)
III. K-HARMONIC MEANS WITH SIMULATED
ANNEALING ALGORITHM
p
where d ik denotes the distance between xi and x j .
The iterate of cluster center constantly
minimizes the objective function, the objective function is:

k
n
i 1
1
cC d p ( x, c)
magnitude of E and T values. The optimal solution is
obtained by simulating slow cooling, that is, by sampling
[7]
repeatedly .The initial temperature, cooling rate, number
of iterations performed at a particular temperature and the
condition of stopping are the most important parameters
which governs the success of the Simulated Annealing
procedure.
Simulated Annealing solves problem that kmeans always converges to local minimum. To solve the
shortcomings of dependency on the initial state and the
convergence to local optima of k-means, the paper
proposed a new algorithms called K-harmonic means
clustering algorithm with simulated annealing.
(6)
Objective function in KHM algorithm also
introduces conditional probability of cluster center to data
points and dynamic weights of data points in each iterate
process [3]. KHM algorithm improves the weakness that
K-means algorithm is sensitive to the initial values.
However K-means still converges to local minimum.
Heuristic algorithms as known have very good optimal
features [11], in paper, we use simulated annealing
algorithm to solve local minimum problem of K-means
algorithm.
C. Simulated Annealing
Simulated annealing(SA) , presented by
[6、7、14]
Metropolis 、Rosenbluth
and others in1953,is an
iterative method for finding approximate solutions to
intractable combinatorial optimization problems.
Simulated Annealing solution methodology
resembles the cooling process of molten metals through
annealing. The cooling phenomenon is simulated by
controlling a parameter, namely, temperature T
introduced with the concept of the Boltzmann probability
distribution. Metropolis suggested a way to implement
the Boltzmann probability distribution in simulated
thermodynamic systems that can also be used in the
function minimization context. At any instant the current
point and the corresponding function value at that point
( x1 ). Based on Metropolis algorithm, the probability of
the next point ( x 2 ) depends on the difference in the
function values ( E ) at these two points. There is some
A. Algorithm theory
K-harmonic means clustering algorithm with
simulated annealing is the combination of K-harmonic
means and simulated annealing, parameters in new
algorithm need to be set depending on the features of new
algorithm. The main idea of SAKHM (K-harmonic means
clustering algorithm with simulated annealing) is that
make the data set clustering result derived form Kharmonic means as the initial value of simulated
annealing algorithm, the generation of new value in
simulation of the iterative process is obtained by random
disturbance for current value. That is, randomly changes
one of several clustering sample’s category, generates a
new clustering division, so that the algorithm may jump
out of the local minimum value, play the global optimal
ability, finally obtain the global optimal clustering results
which is not affected by the initial value.
The steps of SAKHM algorithm are:
1) Initialize initial temperature t0 , final temperature tm ,
number of inner circulation iterations MaxInnerLoop ,
cooling rate DR.
2) Apply k-harmonic algorithm to the new presented
algorithm, each point in set is divided to the point's
closest center due to the minimize distance. Compute the
centroid for each cluster to obtain a new cluster and the
objective function J (1) .The clustering result is as the
initial solution w .
3) Let variable of inner circulation InnerLoop be 0,
initialize counting variable of external circulation i.
4) Perform the iteration to generate the improved set of
cluster, update the cluster center w(i ) , compute the new
object function of the new iteration that obtain a new set
of cluster J (i  1) , if J (i  1)  J (i ) , the cluster centers are
accepted, if J (i  1)  J (i) , we will compute the relative
magnitude of J and T due to P  exp( J (i  1)  J (i) ) ,
st (i )
J
) , where t (i ) stands for current
namely, p  exp(
sT
select initial value
t 0 of control parameter. The concrete
content as follows:
According to the theory of balance, the initial
value of control parameter t 0 should be selected big
temperature, s stands for constant , let r be random
probability such that r  [0,1] , if p  r , the new
cluster centers is accepted, else, the previous cluster
centers continue to iterate;
5) if InnerLoop MaxInnerLoop , parameter InnerLoop
plus to 1, i plus to 1, if or i  MaxLoop , then go to step
4); else, go to step 6 ;
6) if t (i )  tm , stop the program, else, use formula
enough. If the probability of initial value is assumed as
v0 , According to Metropolis Acceptance criteria,
t (i  1)  DR * t (i ) , then go to step3).
minimum of global optimal solution. Kirkpatrick and
others proposed a method to select initial temperature,
called experience method. First, a great value is selected
as t 0 and transformed several times, if Accept rate v is
B. K-Harmonic means clustering algorithm with
simulated annealing base on DK- t 0
exp(
J
)  1 . To make the formula to be set up, the
at0
value of
t 0 should be big enough, but if the value of t 0 is
too large, it will increase iteration times and computing
time. The best selection of t 0 can ensure algorithm to get
less than scheduled initial accept rate
K-harmonic means clustering algorithm with
simulated annealing focus on applying Simulate
Annealing solution methodology to K-means algorithm
and setting the key parameters of the new algorithm, the
following four aspects tell how to set parameters. It is
important to note that such strategies may significantly
impact the performance of the new presented algorithm.
3) 、4) are the key of the paper.
1) The choice of the objective function
The sum of squared Euclidean distances
measures has been adopted in the algorithm.
2) Update way of temperature
This algorithm uses cooling rate presented by
Kirkpatrick and others to control decrease temperature.
Let DR be cooling rate, where DR closes to constant
1.The formula of updating temperature defines as
T(k+1)= DR*T(k), k is the updating number of
temperature. The cooling speed of temperature is
controlled by the parameter DR .This paper sets
DR=0.98.
3) Generating the initial temperature
In the simulated annealing algorithm research,
the selection principle of initial temperature is: at the
beginning of the annealing, temperature has to be high
enough to move to any state. But if temperature is too
high, it will all make the difference result as new result
for a while, and influence the algorithm’s effect. So
through repeatedly experimenting, initial temperature is
determined by new value‘s proper proportion.
Although scholars have proposed many initial
temperature setting methods, there is no unified and more
effectively method to set the initial temperature. In
simulated annealing algorithm the set of initial
temperature T0 corresponds to the set of initial values of
control parameters in SAKHM algorithm. On the basis of
existing researches, this paper put forward a method to
0.8),
v0 (usually take
t 0 ’s value double until v  v0 . In the paper, we use
the method of the combination of experience method and
the objective function of K-harmonic means clustering to
select t 0 . The method is as follows: First, make the
objective function of K-harmonic means clustering as t 0 ’
value. Then according to the above method, transform
several times, if accept rate v  v0 ( v0 =0.8), t 0 ’s
v  v0 , at this time, t 0 is the request
value; If accept rate v  v0 ( v0 =0.8), t 0 ’s value is
in half, until v  v0 , at this time, t 0 is the request value,
value double, until
This can take to meet the conditions of minimum value.
Because the selection of t 0 is associated with the
experience method and objective function of K-harmonic
means clustering, it is called DK- t 0 selection method.
4) How to generation new solution
In order to make the algorithm balance in the
beginning of the algorithm, k-harmonic divides date set
into several cluster and clustering results are as initial
solutions. In next simulated iterative process, because the
calculation amount of k-harmonic is very large. To reduce
the running time of the algorithm, the updating cluster
center and objective function are by criterion of k-means
for the following iterative process and we will also get a
good clustering result.
In K-harmonic means clustering algorithm with
simulated annealing, new values are generated by
disturbing the current solution, that is, The algorithm will
naturally move one or more of these centers into other
areas .But initial solutions —the clustering result of kharmonic— are not clear partitioned into several clusters
.In disturbance process, each point should be clear
partitioned to the only cluster. To sum up, the data split
into different cluster according k-harmonic algorithm and
each point is divided to the only cluster with minimum
distance principle, calculate the corresponding target
function. New algorithm begins with the above clustering
result .The generating way of new solution is by
disturbing the current solution.
IV. VALIDATION BASED ON K-HARMONIC
ALGORITHM
In order to evaluate the performance of kharmonic means clustering algorithm with simulated
annealing, applying the new algorithm to IRIS data set.
The first one is the Iris data set that has N = 150 points
that each point has four attributes—calyx long, calyx
wide, petals long, petals wide. In this article, the distance
between the cluster and actual value is as the evaluation
of the algorithm.
Due to distance of data attributes are very
different; the date set must be normalized before
clustering. After 20 times experiments singly, some main
datas are gathered in TABLE I including the maximum,
minimum and average of algorithm target functions,
difference by actual value, and average CPU time.
Among them the last two columns are averages of
running by 20 times.
TABLE I
Clustering results of KM and SAKHM algorithm
algorithm
Minimum
average
KM
SA-KHM
140.94
76.32
175.56
79.82
Maximu
m
207.06
84.98
error
18.89
2.63
CPUtime
0.05
16.94
From TABLE I, the running time of K-mean
algorithm is least, but different initial value and clustering
effect are different obviously. The maximum of target
function has biggest difference compared with the
minimum, and then it shows that it is highly sensitive for
the initial value, and but the difference is very obviously.
On the contrary, the algorithm time is increasing
obviously of K-harmonic means clustering with simulated
annealing, but the change is smaller between target
functions, and the difference is small between cluster
center and actual center. And then the figure shows the
cluster result of the eighth running, the horizontal
ordinate shows a length of calyx, the Y-ordinate shows
the width of calyx.
Fig.2. Clustering results of SAKHM algorithm
Fig.2. is the result of SAKHM cluster analysis,
compared with KM algorithm from Fig.1 its point set has
some intersections, but it hasn’t from Fig.1. So it shows
the ability of global search and far away from the local
minimum base on K-harmonic means.
The experiment collects statistical chart of the
first bachelor aspiration of science and engineering in
2009. The data set includes 195 datas that each data has 5
properties: total score, Chinese score, math score, foreign
language score, and school level. If a school is 211
colleges the level is second and if it is 985 colleges the
level is third, if it is the 34 key universities the level is
fourth. The scores of Chinese, math, foreign language are
the least scores delivered first.
In the clustering process, the setting of t 0
respectively uses experience method and improved
method. To prove this assertion we use an argument
similar to that of the new algorithm, based on the same
initial assumptions. Each factor combination is tested 7
times with test problems.
580
560
540
520
500
480
old parameter
improved parameter
Fig.3. Contrast two method s’ Objective function
Fig .1. Clustering results of KM algorithm
Fig.3 compares the objective function
performance of the above two approaches. The full line
and the dotted line represent, respectively, the
unimproved parameter and improved parameter
approaches. it can find out a less cluster target function
from the DK- t 0 . The method based on DK- t 0 selected
gets little change of target function, namely the value of
target function is more stable, and the new cluster effect
is better.
V. APPLYING IMPROVED ALGORITHM TO
SCHOOL XUNTONG
In recent years , more and more parents
concerned themselves with the situation of their children
in school. Certain companies cooperated with Mobile
Company and have developed a digital campus system.
"School XunTong" is application software that can
exchange students' information between parents and
teachers conveniently. "School XunTong" mainly targets
at primary and middle school students. Teachers can send
and receive text messages freely and enjoy Internet
service, parents can acquire students' information through
customizing business, company can profit from this
business too. Because the target customers of “School
XunTong" are for primary and middle school students of
the whole province or city, it has a large number of
complex data. It is difficult to manage and there is little
value for the company. With the increasing number of
users, the data is growing rapidly. People desire to find
useful information from the database of “School
XunTong”. From this perspective, analysis the database
of “School XunTong” using the cluster analysis algorithm
becomes very meaningful.
The data to be analyzed come from school
database of"Shool XunTong" ,which includes more than
four hundred students' information from September 2009
to December 2009,the information include students'
accounts, parents' accounts, mobile numbers, time of
sending message, students' achievements, parents' mobile
numbers, message themes, and so on.
The data should be preprocessed before cluster;
the process of the preprocessing includes transferring data,
processing default value, processing abnormal value,
processing isolated points, etc.
1) The data source has different data types that
include numeric, text type, time type, etc. They should be
converted to unified data types. For example, the type of
grade is converted to numeric; senior school is equal to
10.
2) From the sample datas, there are some
default values. And the default value is filled by the right
value that usually is the most frequency used.
3) Some datas may be error in statistics, and
some can deviate from the data mean obviously, so they
become outlier data. For example, some scores are more
than 100 or less than 0 and some scores are 5, 10.These
scores are far away from the average value. These datas
should be obsolete, and then the cluster result will be
more effective. TABLE Ⅱ depicts the variable Chinese
Score.
TABLE Ⅱ
Clustering results of KM and SAKHM algorithm
variable
Chinese
Score
Minimum
10
average
58.67
Maximum
108.00
error
20.60
In order to ensure the data points in a certain
range, developing a standard makes the data points
in[58.6667-2.0*20.6009,100].Beyond the range,the data
point called isolated point must be deleted.In Fig.4,there
are three scores more than 100,two points is below
58.6667-2.0*20.6009.Then remove these point 104 、
108、10 and remove all data item corresponding to these
point. The rest of the points will be in [58.66672.0*20.6009,100].
Fig. 4 Diagram of YuwenScore in databset
4) Some dates must be removed that do not
connect with property values such as creation time; it is
acceptable for test score, counts of teachers’ informations.
Due to the obvious differences in the datas
which have different attributes from the data set, in order
to not be covered from large value for the small value and
lose its key role, and then the data set must be normalized
before cluster. That is,
x  average
x
.
max( x)  min( x)
The new algorithm is applied to the data set of
"School XunTong". Let parameters k=3、 t 0 =0.00001、
tm =48 、 MaxInnerLoop =6 、 MaxLoop =1000, and
then the new algorithm splits a 428-pattern data set into
three, each cluster respectively obtains 258,108,62 points.
The clustering result shows in TABLE Ⅲ.
The cluster is transferred to original data
form as the following:
TABLE Ⅲ
Cluster center of SAKHM algorithm
1
2
3
Chinese
Score
65.10
36.61
76.14
Math
Score
71.54
49.04
47.53
English
Score
71.80
46.42
38.79
AvgScore
69.31
43.77
54.00
Life
Msg
8.82
9.21
8.10
Study
Msg
34.39
35.83
34.78
According to the data results of TABLE Ⅲ, this
article will analysis the cluster result of the School
XunTong’s data set based on the K-harmonic means
cluster algorithm.
Sort first: the data set includes 258 data items,
60% of total students belong to this sort. According to the
clustering result, the scores of the students in the cluster
are better, more stable and have not learning branches.
From the extent of communication between parents and
teachers, compared by other clusters, the messages from
teachers are less than students’ about learning and life. In
one word, teachers send a few text messages.
Sort second: the data set includes 108 data
items, and it is 25% of the counts of total students.
According to the clustering result, the students haven’t
learned branches obviously, but the average scores are
less. From the extent of communication between parents
and teachers, compared by other clusters, the messages
from teachers are more than students’ about learning and
life. In these sorts, messages from teachers are most.
Sort third: the data set includes 62 data items,
and it is a rarely part of the total. According to the
Analysis of the clustering result, Chinese is very good,
math is medium, English is poor, average scores are low.
Life messages from teachers are far less than learning
messages.
According to the above analysis, the teachers
actively send far more messages than parents’ replying
messages which are mainly about learning. From
analysising the clustering result, enterprise can target on
the parents of those students who got poor academic
achievement. Life information of good students is a key
point for enterprise to persuade parents to open text
message service. Enterprise also can analyze the
characteristics of customer group and set up new business.
The school can effectively use the analysis: Good
students can have comprehensive development; Students
that tend to be unbalanced on one or some subjects should
learn weak courses and can rapidly improve scores;
Teacher should pay more attention to students whose
academic achievements are poor, and try to help them in
each subject. The students can make great achievement in
a short time.
Ⅵ. CONCLUSION
Inspired by observation that shortcomings that
are sensitive to start point and alway converge to local
minimum of k-mean are essentially solved,this work
presents an efficient algorithm, called K-harmonic means
clustering algorithm with simulated annealing.Applying
the new algorithm to IRIS date set,our experimental
results indicate that the proposed algorithm obtain results
better than those of k-means.Efforts are underway to
apply the proposed algorithm to "School XunTong" in
order to find potential relationship between students’
achievement and the communication between parents and
teachers and guide students to study.
REFERENCES
[1] M.-C. Chiang et al,A time-efficient pattern reduction
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
algorithm for k-means clustering,Information Sciences ,181
2011),pp. 716–731.
Z. Heng,Y. Wan-hai,A Fuzzy Clustering Algorithm Based
on K-harmonic Means(in china), Journal of circuits and
systems, vol. 9, no. 5, pp. 114-117.
Z. Güngör,A. Ünler.K-harmonic means data clustering with
simulated annealing heuristic,Applied Mathematics and
Copputatuin,2007,199-209
Z. Güngör, A.Ünler ,K-harmonic means data clustering
with simulated annealing heuristic,Applied Mathematical
Modelling ,32 (2008) 1115–1125.
B. Zhang,M.Hsu et al,K-harmonic means-a data clustering
algorithm,HP Teachnical Report Hpl-2000-137HewlettPackard Labs,2000.
L. Wei-min,Z. Ai-yun,L.Sun-Jian,Z. Fanggen,S. Jangsheng,Application Research of Simulated Annealing Kmeans clustering algorithm(in china), Microcomputer
Infermation, vol. 7, no. 3, pp. 182-184,2008.
S.Kirkpatrick , et al,Optimization by Simulated
Annealing,. Science , vol. 220, no.4598, pp. 671-680,1983.
J. Z. C. Lai,H. Tsung-Jen,L. Yi-Ching,A fast k-means
clustering algorithm using cluster center displacement,
Pattern Recognition,42(2009),2551-2556.
T. Jinlan,Z. Lin,Z. Suqin,L. Lu,Improvement and
Paralelism of k-Means Clustering Algorithm,TSINGHUA
SCIENCE AND TECHNOLOGY, vol. 10, no.3 pp. 276281,2005.
J. Pena, J. Lozano, P, Larranaga, An empirical comparison
of four initialization methods for the k-means
algorithm.Pattern recognition letters,1999,20:1027-1040.
A. Likas, N. Vlassis, J,Verbeek.The global k-means
clustering algorithm,IAS Technical Report IAS-UVA-0102 Intelligent Autonomous System,2001.
Hamerly Greg,Elkan Charles.Alternatives to the k-means
algorithm that find better clusterings ,In 11th International
Conference on Information and Knowledge Management
(CIKM 2002),2002,600–607.
R.Ng , J.Han.Efficient , effective clustering method for
spatial data mining,In Proc.Int.Conf.Very large Data Bases
San Francisco.CA:Morgan Kaufmann Publisher.1994,144155.
McErlean F J,Bell D A,McClean S I.The Use of Simulated
Annealing for Clustering Data in Databases Information
Systems.1990,15(2).
Chen M, An overview from a database perspective,
IEEE Trans on KDE,1996,8(6):866-883.
Download