Application Research of Modified K-means Clustering Algorithm Guo-li Liu, You-qian Tan, Li-mei Yu, Jia Liu, Jin-qiao Gao Department of Computer Science and Software, Hebei University of Technology, Tianjin, China (lgl6699@163.com) Abstract - This paper presents an efficient algorithm called K-harmonic means clustering algorithm with simulated annealing, for reducing the dependence of the initial values and overcoming to converge to local minimum. The proposed algorithm works by that K-harmonic means algorithm solves the problem that clustering result is sensitive to the initial valves and simulated annealing makes the clustering jump out of local optimal solution at each iteration patterns. The clustering result is verified by experiments on analyzing IRIS dataset. The school XunTong is application software that is convenient to communication between parents and teachers. This paper applies the new algorithm to analysis of dataset in School XunTong and finds the relationship of students’ achievement and the communication between parents and teachers. Finally, the result of classification guides the learning direction of students in universities and cultivates to students. Keywords - K-Harmonic means, simulated annealing , local minimum, School XunTong I. INTRODUCTION Among these commonly used clustering algorithm, K-means algorithm is typical clustering algorithm and widely used due to its simplicity and high effectiveness. However, it has some problems on the dependence of initial value and the local convergence of clustering result. There are several methods to improve this algorithm: first, apply K-means algorithm to cluster many times and choose the optimum as a final clustering results; second, research the new algorithms. To improve k-means algorithm for the initial value and local convergence, this paper puts forward a new algorithm based on the combination of K harmonic mean algorithm and simulated annealing (SA) algorithm. “School XunTong” is application software to provide a service to students' parents and involves some information about students. In this paper, the new algorithm is used to cluster the set of “School XunTong”, and to find potential relationships of clustering result. II. FUNDAMENTAL OF ALGORITHM A. K-means algorithm K-means algorithm(KM) is a common clustering algorithm based on classification and the oldest classical algorithm [1].In cluster analysis we assume that we have been given a finite set of points X in the d- dimensional space Rd , that is, X {xi | xi R d , i 1,2, , n} . K-means algorithm sets date set matrix X into a given number k of disjoined subsets C1 , C2 , , Ck . An optimal clustering is a partition that minimizes the intracluster distance and maximizes the inter-cluster distance. In practice, the most popular similarity measure is Euclidean distances due to its computational simplicity. Euclidean distances is defined as d (i, j) [( xi1 x j1 ) 2 ( xi 2 x j 2 ) ... ( xid x jd )] (1) subject to i ( xi1 , xi 2 ,.., xid ) Rd and j ( x j1, x j 2 ,..., x jd ) Rd We remark the cluster at each iteration. The updating of cluster centers is that ci 1 n x (2) xci Where x ( xi1 , xi 2 ,.., xid ) 。 The main idea behind the K-means algorithm is the minimization of an objective function usually taken up as a function of the deviations between all patterns from their respective cluster centers. The sum of squared Euclidean distances measure has been adopted in most of studies as the objective function. It is as follows, k n E d ij ( xi , c j ) (3) j 1 i 1 K-means algorithm is simple and efficient, and has good flexibility for large data, however, k-means has its limitations such as the clustering is extremely sensitive to the initial values and it always converges to local minimum. K-harmonic means algorithm solves the problem that clustering result is sensitive to the initial valve. B. K-harmonic means algorithm K-harmonic means (KHM) is a center-based algorithm that has been developed to solve the clustering problem [2-5].This algorithm uses harmonic average of distance from each data point to the cluster center, instead of the minimum distance in K-means algorithm. The harmonic average is defined as k 1 p cC d ( x, c ) finite probability of selecting the point x 2 even though it (4) is worse than the point x1 and depends on relative where x X denotes a finite set of points X in the d-dimensional space Rd , c C denotes the cluster centers, d p ( x, c) denotes distance between two points, k denotes the groups of clusters. The iterate method of cluster center is that: 1 xi d ikp 2 k [ j 1 p ] di, j ck n 1 i1 n d p 2 [ j 1 ikp ] di, j n i 1 (5) III. K-HARMONIC MEANS WITH SIMULATED ANNEALING ALGORITHM p where d ik denotes the distance between xi and x j . The iterate of cluster center constantly minimizes the objective function, the objective function is: k n i 1 1 cC d p ( x, c) magnitude of E and T values. The optimal solution is obtained by simulating slow cooling, that is, by sampling [7] repeatedly .The initial temperature, cooling rate, number of iterations performed at a particular temperature and the condition of stopping are the most important parameters which governs the success of the Simulated Annealing procedure. Simulated Annealing solves problem that kmeans always converges to local minimum. To solve the shortcomings of dependency on the initial state and the convergence to local optima of k-means, the paper proposed a new algorithms called K-harmonic means clustering algorithm with simulated annealing. (6) Objective function in KHM algorithm also introduces conditional probability of cluster center to data points and dynamic weights of data points in each iterate process [3]. KHM algorithm improves the weakness that K-means algorithm is sensitive to the initial values. However K-means still converges to local minimum. Heuristic algorithms as known have very good optimal features [11], in paper, we use simulated annealing algorithm to solve local minimum problem of K-means algorithm. C. Simulated Annealing Simulated annealing(SA) , presented by [6、7、14] Metropolis 、Rosenbluth and others in1953,is an iterative method for finding approximate solutions to intractable combinatorial optimization problems. Simulated Annealing solution methodology resembles the cooling process of molten metals through annealing. The cooling phenomenon is simulated by controlling a parameter, namely, temperature T introduced with the concept of the Boltzmann probability distribution. Metropolis suggested a way to implement the Boltzmann probability distribution in simulated thermodynamic systems that can also be used in the function minimization context. At any instant the current point and the corresponding function value at that point ( x1 ). Based on Metropolis algorithm, the probability of the next point ( x 2 ) depends on the difference in the function values ( E ) at these two points. There is some A. Algorithm theory K-harmonic means clustering algorithm with simulated annealing is the combination of K-harmonic means and simulated annealing, parameters in new algorithm need to be set depending on the features of new algorithm. The main idea of SAKHM (K-harmonic means clustering algorithm with simulated annealing) is that make the data set clustering result derived form Kharmonic means as the initial value of simulated annealing algorithm, the generation of new value in simulation of the iterative process is obtained by random disturbance for current value. That is, randomly changes one of several clustering sample’s category, generates a new clustering division, so that the algorithm may jump out of the local minimum value, play the global optimal ability, finally obtain the global optimal clustering results which is not affected by the initial value. The steps of SAKHM algorithm are: 1) Initialize initial temperature t0 , final temperature tm , number of inner circulation iterations MaxInnerLoop , cooling rate DR. 2) Apply k-harmonic algorithm to the new presented algorithm, each point in set is divided to the point's closest center due to the minimize distance. Compute the centroid for each cluster to obtain a new cluster and the objective function J (1) .The clustering result is as the initial solution w . 3) Let variable of inner circulation InnerLoop be 0, initialize counting variable of external circulation i. 4) Perform the iteration to generate the improved set of cluster, update the cluster center w(i ) , compute the new object function of the new iteration that obtain a new set of cluster J (i 1) , if J (i 1) J (i ) , the cluster centers are accepted, if J (i 1) J (i) , we will compute the relative magnitude of J and T due to P exp( J (i 1) J (i) ) , st (i ) J ) , where t (i ) stands for current namely, p exp( sT select initial value t 0 of control parameter. The concrete content as follows: According to the theory of balance, the initial value of control parameter t 0 should be selected big temperature, s stands for constant , let r be random probability such that r [0,1] , if p r , the new cluster centers is accepted, else, the previous cluster centers continue to iterate; 5) if InnerLoop MaxInnerLoop , parameter InnerLoop plus to 1, i plus to 1, if or i MaxLoop , then go to step 4); else, go to step 6 ; 6) if t (i ) tm , stop the program, else, use formula enough. If the probability of initial value is assumed as v0 , According to Metropolis Acceptance criteria, t (i 1) DR * t (i ) , then go to step3). minimum of global optimal solution. Kirkpatrick and others proposed a method to select initial temperature, called experience method. First, a great value is selected as t 0 and transformed several times, if Accept rate v is B. K-Harmonic means clustering algorithm with simulated annealing base on DK- t 0 exp( J ) 1 . To make the formula to be set up, the at0 value of t 0 should be big enough, but if the value of t 0 is too large, it will increase iteration times and computing time. The best selection of t 0 can ensure algorithm to get less than scheduled initial accept rate K-harmonic means clustering algorithm with simulated annealing focus on applying Simulate Annealing solution methodology to K-means algorithm and setting the key parameters of the new algorithm, the following four aspects tell how to set parameters. It is important to note that such strategies may significantly impact the performance of the new presented algorithm. 3) 、4) are the key of the paper. 1) The choice of the objective function The sum of squared Euclidean distances measures has been adopted in the algorithm. 2) Update way of temperature This algorithm uses cooling rate presented by Kirkpatrick and others to control decrease temperature. Let DR be cooling rate, where DR closes to constant 1.The formula of updating temperature defines as T(k+1)= DR*T(k), k is the updating number of temperature. The cooling speed of temperature is controlled by the parameter DR .This paper sets DR=0.98. 3) Generating the initial temperature In the simulated annealing algorithm research, the selection principle of initial temperature is: at the beginning of the annealing, temperature has to be high enough to move to any state. But if temperature is too high, it will all make the difference result as new result for a while, and influence the algorithm’s effect. So through repeatedly experimenting, initial temperature is determined by new value‘s proper proportion. Although scholars have proposed many initial temperature setting methods, there is no unified and more effectively method to set the initial temperature. In simulated annealing algorithm the set of initial temperature T0 corresponds to the set of initial values of control parameters in SAKHM algorithm. On the basis of existing researches, this paper put forward a method to 0.8), v0 (usually take t 0 ’s value double until v v0 . In the paper, we use the method of the combination of experience method and the objective function of K-harmonic means clustering to select t 0 . The method is as follows: First, make the objective function of K-harmonic means clustering as t 0 ’ value. Then according to the above method, transform several times, if accept rate v v0 ( v0 =0.8), t 0 ’s v v0 , at this time, t 0 is the request value; If accept rate v v0 ( v0 =0.8), t 0 ’s value is in half, until v v0 , at this time, t 0 is the request value, value double, until This can take to meet the conditions of minimum value. Because the selection of t 0 is associated with the experience method and objective function of K-harmonic means clustering, it is called DK- t 0 selection method. 4) How to generation new solution In order to make the algorithm balance in the beginning of the algorithm, k-harmonic divides date set into several cluster and clustering results are as initial solutions. In next simulated iterative process, because the calculation amount of k-harmonic is very large. To reduce the running time of the algorithm, the updating cluster center and objective function are by criterion of k-means for the following iterative process and we will also get a good clustering result. In K-harmonic means clustering algorithm with simulated annealing, new values are generated by disturbing the current solution, that is, The algorithm will naturally move one or more of these centers into other areas .But initial solutions —the clustering result of kharmonic— are not clear partitioned into several clusters .In disturbance process, each point should be clear partitioned to the only cluster. To sum up, the data split into different cluster according k-harmonic algorithm and each point is divided to the only cluster with minimum distance principle, calculate the corresponding target function. New algorithm begins with the above clustering result .The generating way of new solution is by disturbing the current solution. IV. VALIDATION BASED ON K-HARMONIC ALGORITHM In order to evaluate the performance of kharmonic means clustering algorithm with simulated annealing, applying the new algorithm to IRIS data set. The first one is the Iris data set that has N = 150 points that each point has four attributes—calyx long, calyx wide, petals long, petals wide. In this article, the distance between the cluster and actual value is as the evaluation of the algorithm. Due to distance of data attributes are very different; the date set must be normalized before clustering. After 20 times experiments singly, some main datas are gathered in TABLE I including the maximum, minimum and average of algorithm target functions, difference by actual value, and average CPU time. Among them the last two columns are averages of running by 20 times. TABLE I Clustering results of KM and SAKHM algorithm algorithm Minimum average KM SA-KHM 140.94 76.32 175.56 79.82 Maximu m 207.06 84.98 error 18.89 2.63 CPUtime 0.05 16.94 From TABLE I, the running time of K-mean algorithm is least, but different initial value and clustering effect are different obviously. The maximum of target function has biggest difference compared with the minimum, and then it shows that it is highly sensitive for the initial value, and but the difference is very obviously. On the contrary, the algorithm time is increasing obviously of K-harmonic means clustering with simulated annealing, but the change is smaller between target functions, and the difference is small between cluster center and actual center. And then the figure shows the cluster result of the eighth running, the horizontal ordinate shows a length of calyx, the Y-ordinate shows the width of calyx. Fig.2. Clustering results of SAKHM algorithm Fig.2. is the result of SAKHM cluster analysis, compared with KM algorithm from Fig.1 its point set has some intersections, but it hasn’t from Fig.1. So it shows the ability of global search and far away from the local minimum base on K-harmonic means. The experiment collects statistical chart of the first bachelor aspiration of science and engineering in 2009. The data set includes 195 datas that each data has 5 properties: total score, Chinese score, math score, foreign language score, and school level. If a school is 211 colleges the level is second and if it is 985 colleges the level is third, if it is the 34 key universities the level is fourth. The scores of Chinese, math, foreign language are the least scores delivered first. In the clustering process, the setting of t 0 respectively uses experience method and improved method. To prove this assertion we use an argument similar to that of the new algorithm, based on the same initial assumptions. Each factor combination is tested 7 times with test problems. 580 560 540 520 500 480 old parameter improved parameter Fig.3. Contrast two method s’ Objective function Fig .1. Clustering results of KM algorithm Fig.3 compares the objective function performance of the above two approaches. The full line and the dotted line represent, respectively, the unimproved parameter and improved parameter approaches. it can find out a less cluster target function from the DK- t 0 . The method based on DK- t 0 selected gets little change of target function, namely the value of target function is more stable, and the new cluster effect is better. V. APPLYING IMPROVED ALGORITHM TO SCHOOL XUNTONG In recent years , more and more parents concerned themselves with the situation of their children in school. Certain companies cooperated with Mobile Company and have developed a digital campus system. "School XunTong" is application software that can exchange students' information between parents and teachers conveniently. "School XunTong" mainly targets at primary and middle school students. Teachers can send and receive text messages freely and enjoy Internet service, parents can acquire students' information through customizing business, company can profit from this business too. Because the target customers of “School XunTong" are for primary and middle school students of the whole province or city, it has a large number of complex data. It is difficult to manage and there is little value for the company. With the increasing number of users, the data is growing rapidly. People desire to find useful information from the database of “School XunTong”. From this perspective, analysis the database of “School XunTong” using the cluster analysis algorithm becomes very meaningful. The data to be analyzed come from school database of"Shool XunTong" ,which includes more than four hundred students' information from September 2009 to December 2009,the information include students' accounts, parents' accounts, mobile numbers, time of sending message, students' achievements, parents' mobile numbers, message themes, and so on. The data should be preprocessed before cluster; the process of the preprocessing includes transferring data, processing default value, processing abnormal value, processing isolated points, etc. 1) The data source has different data types that include numeric, text type, time type, etc. They should be converted to unified data types. For example, the type of grade is converted to numeric; senior school is equal to 10. 2) From the sample datas, there are some default values. And the default value is filled by the right value that usually is the most frequency used. 3) Some datas may be error in statistics, and some can deviate from the data mean obviously, so they become outlier data. For example, some scores are more than 100 or less than 0 and some scores are 5, 10.These scores are far away from the average value. These datas should be obsolete, and then the cluster result will be more effective. TABLE Ⅱ depicts the variable Chinese Score. TABLE Ⅱ Clustering results of KM and SAKHM algorithm variable Chinese Score Minimum 10 average 58.67 Maximum 108.00 error 20.60 In order to ensure the data points in a certain range, developing a standard makes the data points in[58.6667-2.0*20.6009,100].Beyond the range,the data point called isolated point must be deleted.In Fig.4,there are three scores more than 100,two points is below 58.6667-2.0*20.6009.Then remove these point 104 、 108、10 and remove all data item corresponding to these point. The rest of the points will be in [58.66672.0*20.6009,100]. Fig. 4 Diagram of YuwenScore in databset 4) Some dates must be removed that do not connect with property values such as creation time; it is acceptable for test score, counts of teachers’ informations. Due to the obvious differences in the datas which have different attributes from the data set, in order to not be covered from large value for the small value and lose its key role, and then the data set must be normalized before cluster. That is, x average x . max( x) min( x) The new algorithm is applied to the data set of "School XunTong". Let parameters k=3、 t 0 =0.00001、 tm =48 、 MaxInnerLoop =6 、 MaxLoop =1000, and then the new algorithm splits a 428-pattern data set into three, each cluster respectively obtains 258,108,62 points. The clustering result shows in TABLE Ⅲ. The cluster is transferred to original data form as the following: TABLE Ⅲ Cluster center of SAKHM algorithm 1 2 3 Chinese Score 65.10 36.61 76.14 Math Score 71.54 49.04 47.53 English Score 71.80 46.42 38.79 AvgScore 69.31 43.77 54.00 Life Msg 8.82 9.21 8.10 Study Msg 34.39 35.83 34.78 According to the data results of TABLE Ⅲ, this article will analysis the cluster result of the School XunTong’s data set based on the K-harmonic means cluster algorithm. Sort first: the data set includes 258 data items, 60% of total students belong to this sort. According to the clustering result, the scores of the students in the cluster are better, more stable and have not learning branches. From the extent of communication between parents and teachers, compared by other clusters, the messages from teachers are less than students’ about learning and life. In one word, teachers send a few text messages. Sort second: the data set includes 108 data items, and it is 25% of the counts of total students. According to the clustering result, the students haven’t learned branches obviously, but the average scores are less. From the extent of communication between parents and teachers, compared by other clusters, the messages from teachers are more than students’ about learning and life. In these sorts, messages from teachers are most. Sort third: the data set includes 62 data items, and it is a rarely part of the total. According to the Analysis of the clustering result, Chinese is very good, math is medium, English is poor, average scores are low. Life messages from teachers are far less than learning messages. According to the above analysis, the teachers actively send far more messages than parents’ replying messages which are mainly about learning. From analysising the clustering result, enterprise can target on the parents of those students who got poor academic achievement. Life information of good students is a key point for enterprise to persuade parents to open text message service. Enterprise also can analyze the characteristics of customer group and set up new business. The school can effectively use the analysis: Good students can have comprehensive development; Students that tend to be unbalanced on one or some subjects should learn weak courses and can rapidly improve scores; Teacher should pay more attention to students whose academic achievements are poor, and try to help them in each subject. The students can make great achievement in a short time. Ⅵ. CONCLUSION Inspired by observation that shortcomings that are sensitive to start point and alway converge to local minimum of k-mean are essentially solved,this work presents an efficient algorithm, called K-harmonic means clustering algorithm with simulated annealing.Applying the new algorithm to IRIS date set,our experimental results indicate that the proposed algorithm obtain results better than those of k-means.Efforts are underway to apply the proposed algorithm to "School XunTong" in order to find potential relationship between students’ achievement and the communication between parents and teachers and guide students to study. REFERENCES [1] M.-C. Chiang et al,A time-efficient pattern reduction [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] algorithm for k-means clustering,Information Sciences ,181 2011),pp. 716–731. Z. Heng,Y. Wan-hai,A Fuzzy Clustering Algorithm Based on K-harmonic Means(in china), Journal of circuits and systems, vol. 9, no. 5, pp. 114-117. Z. Güngör,A. Ünler.K-harmonic means data clustering with simulated annealing heuristic,Applied Mathematics and Copputatuin,2007,199-209 Z. Güngör, A.Ünler ,K-harmonic means data clustering with simulated annealing heuristic,Applied Mathematical Modelling ,32 (2008) 1115–1125. B. Zhang,M.Hsu et al,K-harmonic means-a data clustering algorithm,HP Teachnical Report Hpl-2000-137HewlettPackard Labs,2000. L. Wei-min,Z. Ai-yun,L.Sun-Jian,Z. Fanggen,S. Jangsheng,Application Research of Simulated Annealing Kmeans clustering algorithm(in china), Microcomputer Infermation, vol. 7, no. 3, pp. 182-184,2008. S.Kirkpatrick , et al,Optimization by Simulated Annealing,. Science , vol. 220, no.4598, pp. 671-680,1983. J. Z. C. Lai,H. Tsung-Jen,L. Yi-Ching,A fast k-means clustering algorithm using cluster center displacement, Pattern Recognition,42(2009),2551-2556. T. Jinlan,Z. Lin,Z. Suqin,L. Lu,Improvement and Paralelism of k-Means Clustering Algorithm,TSINGHUA SCIENCE AND TECHNOLOGY, vol. 10, no.3 pp. 276281,2005. J. Pena, J. Lozano, P, Larranaga, An empirical comparison of four initialization methods for the k-means algorithm.Pattern recognition letters,1999,20:1027-1040. A. Likas, N. Vlassis, J,Verbeek.The global k-means clustering algorithm,IAS Technical Report IAS-UVA-0102 Intelligent Autonomous System,2001. Hamerly Greg,Elkan Charles.Alternatives to the k-means algorithm that find better clusterings ,In 11th International Conference on Information and Knowledge Management (CIKM 2002),2002,600–607. R.Ng , J.Han.Efficient , effective clustering method for spatial data mining,In Proc.Int.Conf.Very large Data Bases San Francisco.CA:Morgan Kaufmann Publisher.1994,144155. McErlean F J,Bell D A,McClean S I.The Use of Simulated Annealing for Clustering Data in Databases Information Systems.1990,15(2). Chen M, An overview from a database perspective, IEEE Trans on KDE,1996,8(6):866-883.