Performance Functions and Clustering Algorithms Wesam Barbakh and Colin Fyfe Abstract. We investigate the effect of different performance functions for measuring the performance of clustering algorithms and derive different algorithms depending on which performance algorithm is used. In particular, we show that two algorithms may be derived which do not exhibit the dependence on initial conditions(and hence the tendency to get stuck in local optima) that the standard K-Means algorithm exhibits. INTRODUCTION The K-Means algorithm is one of the most frequently used investigatory algorithms in data analysis. The algorithm attempts to locate K prototypes or means throughout a data set in such a way that the K prototypes in some way best represents the data. The algorithm is one of the first which a data analyst will use to investigate a new data set because it is algorithmically simple, relatively robust and gives `good enough’ answers over a wide variety of data sets: it will often not be the single best algorithm on any individual data set but be close to the optimal over a wide range of data sets. However the algorithm is known to suffer from the defect that the means or prototypes found depend on the initial values given to them at the start of the simulation. There are a number of heuristics in the literature which attempt to address this issue but, at heart, the fault lies in the performance function on which K-Means is based. Recently, there have been several investigations of alternative performance functions for clustering algorithms. One of the most effective updates of K-Means has been K-Harmonic Means which minimises K 1 x m In this paper, we investigate two alternative performance functions and show the effect the different functions have on the effectiveness of the resulting algorithms. We are specifically interested in developing algorithms which are effective in a worst case scenario: when the prototypes are initialised at the same position which is very far from the data points. If an algorithm can cope with this scenario, it should be able to cope with a more benevolent initialisation. PERFORMANCE CLUSTERING FUNCTIONS The performance function for K-Means may be written as N JK= min i 1 ( X m ) 2 k{1 ,..., M } i k Any prototype which is still far from data is not utilised and does not enter any calculation that give minimum performance, which may result in dead prototypes, prototypes which are never appropriate for any cluster. Thus initializing centres appropriately can play a big effect in K-Means. We can illustrate this effect with the following toy example: assume we have 3 data points ( X 1 , X 2 and X 3 ) and 3 prototypes ( m1 , m2 and m3 ) and the distances between them are as follows: Let d i , k xi m k 2 . We consider the situation that m1 is closest to X 1 m1 is closest to X 2 K 2 k 1 i m2 is closest to X 3 k for data samples {x1,…,xN} and prototypes {m1,…,mK}. This performance function can be shown to be minimised when 1 x 1 d ( ) and so m3 is not closest to any data point m1 m2 m3 X1 d 1,1 d1, 2 d1,3 X2 d 2,1 d 2, 2 d 2,3 N i 1 N i 1 2 l 1 m k i K 4 ik d 1 K d ( 4 ik l 1 2 il 1 ) d 2 il 2 (1) which we wish to minimise by moving the prototypes to the appropriate positions. Note that(1) detects only the centres closest to data points and then distributes them to give the minimum performance which determines the clustering. N i 1 FOR X 3 d 3,1 d 3, 2 performance. Therefore it may seem that what we want is to combine features of (1) and (2) to make a performance function such as: d 3, 3 Perf = JK = d 1,1 + d 2,1 + d 3, 2 which we minimise by J1= changing the postions of the prototypes, m1, m2, m3. X m Then i 1 Perf d1,1 d 2,1 0 m1 m1 m1 i.e. So, it is possible now to find new locations for m1 and m2 to minimise the performance function which determines the clustering, but it is not possible to find a new location for prototype m3 as it is far from the data and is not used as a minimum for any data point. We might consider the following performance function: M i 1 L 1 X i mL M i j 1 * min j X k i m (3) 2 k Consider the presentation of a specific data point, Xa. and the prototype mk , closest to Xa. Perf 000 m3 N We derive the clustering algorithm associated with this performance function by calculating the partial derivatives of (3) with respect to the prototypes. d 3, 2 Perf 00 m2 m2 JA = N 2 (2) which provides a relationship between all the data points and prototypes, but it doesn’t provide clustering at minimum performance since J 2 ( X m ) m min k X m a 2 k = X a mk 2 Then ( X mk ) Perf (i a) a * X a mk mk X a mk Perf (i a ) ( X a m k ) * X a m k mk 2 X a m1 .. X a mk .. X a 2 * X a m1 .. X a mk .. X a m M Perf (i a) ( X a mk ) * Aak mk (4) where A X m ak a 2 * X m ... X m k a 1 a M N A i i 1 k Now consider a second data point Xb for which mk is not the closest prototype, i.e. the min() function gives the distance with respect to a prototype other than mk , k J 1 0m X m N N A k i 1 i k Minimizing the performance function groups all the prototypes to the centre of data set regardless of the intitial position of the prototypes which is useless for identification of clusters. min k X b m 2 k =X b mr 2 , where r ≠ k. Then Perf (i b) = X b m1 .. X b mk .. X b mM * X b mr A combined performance function We wish to form a performance equation with following properties: - Minimum performance gives a good clustering - Creates a relationship between all data points and all prototypes. (2) provides an attempt to reduce the sensitivity to centres initialization by making relationship between all data points and all centres while (1) provides an attempt to cluster data points at minimum Perf (i b) ( X b mk ) * Bbk mk where X m B X m 2 b r b k bk (5) 2 For the algorithm, the partial derivatives with respect to mk for all data points, Perf is based on (4), or m k (5) or both of them. Consider the specific situation in which mk , closest to X2 but not the closest to X1 or X3. min 1 B12 d1, 2 2 min 2 where B22 d 2, 2 2 A32 min 3 2(d 3,1 min 3 d 3,3 ) Then we have Perf ( X 1 mk ) * B1k ( X 2 mk ) * A2 k ( X 3 mk ) * B3k mk m3 X 1 * B13 X 2 * B23 X 3 * B33 B13 B23 B33 Setting to 0 and solving for mk gives mk X 1 *B1k X 2 * A2 k X 3 * B3k B1k A2 k B3k Consider the previous example with 3 data points ( X 1 , X 2 and X 3 ) and 3 centres ( m1 , m2 and m3 ) and the distances between them such that m1 is closest to X 1 , m1 is closest to X 2 , and m2 is closest to X 3 We will write m1 m2 m3 X1 min 1 d1, 2 d1,3 X2 min 2 d 2, 2 d 2,3 min 3 d 3, 3 X 3 d 3,1 Then after training X * A X 2 * A21 X 3 * B31 m1 1 11 A11 A21 B31 A11 min 1 2(min 1 d1, 2 d1,3 ) where A21 min 2 2(min 2 d 2, 2 d 2,3 ) min 3 B31 d 3,1 m2 min 2 d 2,3 2 B23 min 3 B33 d 3, 3 2 This algorithm will cluster the data with the prototypes which are closest to the data points being positioned in such a way that the clusters can be identified. However there are some potential prototypes (such as m3 in the example) which are not sufficiently responsive to the data and so never move to identify a cluster. In fact, as illustrated in the example, these points move to the centre of the data set (actually a weighted centre as shown in the example). This may be an advantage in some cases in that we can easily identify redundancy in the prototypes however it does waste computational resources unnecessarily. A second algorithm To solve this, we need to move these unused prototypes towards the data so that they may become closest prototypes to at least one data sample and thus take advantage of the whole performance function. We do this by changing 2 X 1 * B12 X 2 * B22 X 3 * A32 B12 B22 A32 min 1 d 1,3 (6) where 2 B13 Bbk X b mr 2 X b mk in (5) to Bbk X b mr 2 X b mk 2 which allows centres to move continuously until they are in a position to be closest to some data points. This change allows the algorithm to work very well in the case that all centres are initialized in the same location and very far from the data points. Note: all centres will go to the same location even if they are calculated by using two different types of equation. Example: Assume we have 3 data points ( X 1 , X 2 and X 3 ) and 3 centres ( m1 , m2 and m3 ) initialized at the same location. Note: we assume every data point has only one minimum distance to the centres, in other words it is closest to one centre only. We treat the other centres as distant centres even they have the same minimum value. This step is optional, but it is very important if we want the algorithm to work very well in the case that all centres are initialized in the same location. Without this assumption in this example, we will find the centres ( m1 , m2 and m3 ) will use the same equation and hence go to the same location! Let m1 m2 m3 X1 a a a X2 b b b X3 c c c 2 3 1 3 Similarly, m X * (a ) X * (b) X * (c) abc m X * (a) X * (b) X * (c) abc 2 3 1 1 2 3 2 2 3 1 2 3 while a b c X * X * X * a b c m a b c a b c X X X 3 2 1 2 2 2 2 2 3 2 2 2 2 2 2 2 2 3 and similarly, X * ( 7 a ) X * ( 7b ) X * ( 7 c ) m 7 a 7b 7c X * (a ) X * (b) X * (c) abc 2 1 1 1 For algorithm 1, we have 1 X * ( 7 a ) X * ( 7b ) X * ( 7 c ) 7 a 7b 7c X * (a ) X * (b) X * (c) abc m 2 Note: if we have 3 different data points, it is not possible to have a b c 1 For algorithm 2, we have 3 m 3 X X X 3 1 2 3 Notice, the centre m1 that is detected by minimum function goes to a new location and all the other centres m2 and m3 are grouped together in another location. This change for Bbk makes separation between centres possible even if all of them start in the same location For algorithm 1, if the new location for m2 and m3 is still very far from the data and none of them is detected as minimum, the clustering algorithm stops without taking these centres into account in clustering data. For algorithm 2, if the new location for m2 and m3 are still very far from data points and none of them is detected as minimum, the clustering algorithm moves these undetected centres continually toward new locations Thus the algorithm lprovides clustering for data insensitive to initialization of centres. Simulations In both cases, all four clusters were reliably and stably identified. We illustrate these algorithms with a few simulations on artificial two dimensional data sets, since the results are easily visualised. Consider first the data set in Figure 1: the prototypes have all been initialised within one of the four clusters. Figure 1 Data set is shown as 4 clusters of red '+'s, prototypes are initialised to lie within one cluster and shown as blue '*'s. Figure 2 shows the final positions of the prototypes when K-Means is used: two clusters are not identified. Figure 3 Top: K-Harmonic Means after 5 iterations. Bottom: algorithm 2 after 3 iterations. Even with a good initialisation (for example with all the prototypes in the centre of the data, around the point (2,2)), K-Means will not guarantee to find all the clusters, K-Harmonic Means takes 5 iterations to move the prototypes to appropriate positions while algorithm 2 takes only one iteration to stably find appropriate positions for the prototypes. Figure 2 Prototypes’ positions when using K-Means. Figure 3 shows the final positions of the prototypes when K-Harmonic Means and algorithm 2 are used. K-Harmonic Means takes 5 iterations to find this position while algorithm 2 takes only three iterations. Consider now the situation in which the prototypes are initialised very far from the data; this is an unlikely situation to happen in general but it may be that all the prototypes are in fact initialised very far from a particular cluster. The question arises as to whether this cluster would be found by any algorithm. We show the initial positions of the data and a set of four prototypes in Figure 4. The four prototypes are in slightly different positions. Figure 5 shows the final positions of the prototypes for K-Harmonic Means and algorithm 2. Again K-Harmonic Means took rather longer (24 iterations) to appropriately position the prototypes than algorithm 2 (5 iterations). K-Means moved all four prototypes to a central location (approximately the point (2,2)) and did not subsequently find the 4 clusters. Figure 4 The prototypes are positioned very far from the four clusters. Figure 6 Results when prototypes are initialised very far from the data and all in the same position. Top: KHarmonic Means. Bottom: algorithm 2. Figure 4 Top: K-Harmonic Means after 24 iterations. Bottom: algorithm 2 after 5 iterations. Figure 7 Top: Initial prototypes and data. Bottom: after 123 iterations all prototypes are situated on a data point. Of course, the above situation was somewhat unrealistic since the number of prototypes was exactly equal to the number of clusters but we have similar results with e.g. 20 prototypes and the same data set. We now go to the other extreme and have the same number of prototypes as data points. We show in Figure 7 a simulation in which we had 40 data points from the same four clusters and 40 prototypes which were initialised to a single location far from the data. The bottom diagram shows that each prototype is eventually located at each data point. This took 123 iterations of the algorithm which is rather a lot, however neither K-Means nor K-Harmonic Means performed well: both of these located every prototype in the centre of the data i.e. at approximately the point (2,2). Even with a good initialisation (random locations throughout the data set), K-Means was unable to perform as well, having typically 28 prototypes redundant in that they have moved only to the centre of the data. CONCLUSION We have developed a new algorithm for data clustering and have shown that this algorithm is clearly superior to K-Means, the standard work-horse for clustering. We have also compared our algorithm to K-Harmonic Means which is state-of-the-art in clustering algorithms and have shown that under typical conditions, it is comparable while under extreme conditions, it is superior. Future work will investigate convergence of these algorithms on real data sets. REFERENCES B. Z. Zhang, M. Hsu, U. Dayal, K-Harmonic Means – a data clustering algorithm, Technical Report, HP Palo Alto laboratories, Oct 1999.