Performance Functions for clustering

advertisement
Performance Functions and Clustering Algorithms
Wesam Barbakh and Colin Fyfe
Abstract.
We investigate the effect of different performance
functions for measuring the performance of clustering
algorithms and derive different algorithms depending
on which performance algorithm is used. In
particular, we show that two algorithms may be
derived which do not exhibit the dependence on initial
conditions(and hence the tendency to get stuck in local
optima) that the standard K-Means algorithm exhibits.
INTRODUCTION
The K-Means algorithm is one of the most frequently
used investigatory algorithms in data analysis. The
algorithm attempts to locate K prototypes or means
throughout a data set in such a way that the K
prototypes in some way best represents the data.
The algorithm is one of the first which a data analyst
will use to investigate a new data set because it is
algorithmically simple, relatively robust and gives
`good enough’ answers over a wide variety of data
sets: it will often not be the single best algorithm on
any individual data set but be close to the optimal over
a wide range of data sets.
However the algorithm is known to suffer from the
defect that the means or prototypes found depend on
the initial values given to them at the start of the
simulation. There are a number of heuristics in the
literature which attempt to address this issue but, at
heart, the fault lies in the performance function on
which K-Means is based. Recently, there have been
several investigations of alternative performance
functions for clustering algorithms. One of the most
effective updates of K-Means has been K-Harmonic
Means which minimises
K

1

x m
In this paper, we investigate two alternative
performance functions and show the effect the
different functions have on the effectiveness of the
resulting algorithms. We are specifically interested in
developing algorithms which are effective in a worst
case scenario: when the prototypes are initialised at
the same position which is very far from the data
points. If an algorithm can cope with this scenario, it
should be able to cope with a more benevolent
initialisation.
PERFORMANCE
CLUSTERING
FUNCTIONS
The performance function for K-Means may be
written as
N
JK=  min
i 1
( X m )
2
k{1 ,..., M }
i
k
Any prototype which is still far from data is not
utilised and does not enter any calculation that give
minimum performance, which may result in dead
prototypes, prototypes which are never appropriate for
any cluster. Thus initializing centres appropriately can
play a big effect in K-Means.
We can illustrate this effect with the following toy
example: assume we have 3 data points ( X 1 , X 2
and X 3 ) and 3 prototypes ( m1 , m2 and m3 ) and the
distances between them are as follows:
Let d i , k  xi  m k
2
. We consider the situation that
m1 is closest to X 1
m1 is closest to X 2
K
2
k 1
i
m2 is closest to X 3
k
for data samples {x1,…,xN} and prototypes
{m1,…,mK}. This performance function can be shown
to be minimised when
1
x

1
d ( )
and so m3 is not closest to any data point
m1
m2
m3
X1
d 1,1
d1, 2
d1,3
X2
d 2,1
d 2, 2
d 2,3
N
i 1
N

i 1
2
l 1
m 
k
i
K
4
ik
d
1
K
d (
4
ik
l 1
2
il
1
)
d
2
il
2
(1)
which we wish to minimise by moving the prototypes
to the appropriate positions. Note that(1) detects only
the centres closest to data points and then distributes
them to give the minimum performance which
determines the clustering.
N
i 1
FOR
X 3 d 3,1
d 3, 2
performance. Therefore it may seem that what we
want is to combine features of (1) and (2) to make a
performance function such as:
d 3, 3
Perf = JK = d 1,1 + d 2,1 + d 3, 2 which we minimise by
J1=
changing the postions of the prototypes, m1, m2, m3.
  X m
Then
i 1
Perf d1,1 d 2,1


0
m1
m1 m1
i.e.
So, it is possible now to find new locations for m1 and
m2 to minimise the performance function which
determines the clustering, but it is not possible to find
a new location for prototype m3 as it is far from the
data and is not used as a minimum for any data point.
We might consider the following performance
function:
M

i 1 L 1
X i  mL
M
i
j 1
* min
j
X
k
i
m
 (3)
2
k
Consider the presentation of a specific data point, Xa.
and the prototype mk , closest to Xa.
Perf
 000
m3
N

We derive the clustering algorithm associated with
this performance function by calculating the partial
derivatives of (3) with respect to the prototypes.
d 3, 2
Perf
 00
m2
m2
JA =
N
2
(2)
which provides a relationship between all the data
points and prototypes, but it doesn’t provide clustering
at minimum performance since
J
 2 ( X  m )
m
min
k
X
m
a
2
k
=
X a  mk
2
Then
( X  mk )
Perf (i  a)
 a
* X a  mk
mk
X a  mk

Perf (i  a )
 ( X a  m k ) * X a  m k
mk
2
  X a  m1  ..  X a  mk  ..  X a 
 2 *  X a  m1  ..  X a  mk  ..  X a  m M
Perf (i  a)
 ( X a  mk ) * Aak
mk
(4)
where
A  X m
ak
a
 2 *  X  m  ...  X  m
k
a
1
a
M

N
A
i
i 1
k
Now consider a second data point Xb for which mk is
not the closest prototype, i.e. the min() function gives
the distance with respect to a prototype other than mk ,
k
J
1
0m  X
m
N
N
A
k
i 1
i
k
Minimizing the performance function groups all the
prototypes to the centre of data set regardless of the
intitial position of the prototypes which is useless for
identification of clusters.
min
k
X
b
m
2
k
=X
b
 mr
2
,
where r ≠ k. Then
Perf (i  b) =
X
b
 m1  ..  X b  mk  .. X b  mM  * X b  mr
A combined performance function
We wish to form a performance equation with
following properties:
- Minimum performance gives
a good
clustering
- Creates a relationship between all data points
and all prototypes.
(2) provides an attempt to reduce the sensitivity to
centres initialization by making relationship between
all data points and all centres while (1) provides an
attempt to cluster data points at minimum
Perf (i  b)
 ( X b  mk ) * Bbk
mk
where
X m
B 
X m
2
b
r
b
k
bk
(5)
2

For the algorithm, the partial derivatives with respect
to mk for all data points,
Perf
is based on (4), or
m k
(5) or both of them.
Consider the specific situation in which mk , closest to
X2 but not the closest to X1 or X3.
min 1
B12 
d1, 2
2
min 2
where B22 
d 2, 2
2
A32  min 3  2(d 3,1  min 3  d 3,3 )
Then we have
Perf
 ( X 1  mk ) * B1k  ( X 2  mk ) * A2 k  ( X 3  mk ) * B3k
mk
m3 
X 1 * B13  X 2 * B23  X 3 * B33
B13  B23  B33
Setting to 0 and solving for mk gives
mk 
X 1 *B1k  X 2 * A2 k  X 3 * B3k
B1k  A2 k  B3k
Consider the previous example with 3 data points
( X 1 , X 2 and X 3 ) and 3 centres ( m1 , m2 and m3 ) and
the distances between them such that m1 is closest to
X 1 , m1 is closest to X 2 , and m2 is closest to X 3
We will write
m1
m2
m3
X1
min 1
d1, 2
d1,3
X2
min 2
d 2, 2
d 2,3
min 3
d 3, 3
X 3 d 3,1
Then after training
X * A  X 2 * A21  X 3 * B31
m1  1 11
A11  A21  B31
A11  min 1  2(min 1  d1, 2  d1,3 )
where A21  min 2  2(min 2  d 2, 2  d 2,3 )
min 3
B31 
d 3,1
m2 
min 2
d 2,3
2
B23 
min 3
B33 
d 3, 3
2
This algorithm will cluster the data with the
prototypes which are closest to the data points being
positioned in such a way that the clusters can be
identified. However there are some potential
prototypes (such as m3 in the example) which are not
sufficiently responsive to the data and so never move
to identify a cluster. In fact, as illustrated in the
example, these points move to the centre of the data
set (actually a weighted centre as shown in the
example). This may be an advantage in some cases in
that we can easily identify redundancy in the
prototypes however it does waste computational
resources unnecessarily.
A second algorithm
To solve this, we need to move these unused
prototypes towards the data so that they may become
closest prototypes to at least one data sample and thus
take advantage of the whole performance function.
We do this by changing
2
X 1 * B12  X 2 * B22  X 3 * A32
B12  B22  A32
min 1
d 1,3
(6)
where
2
B13 
Bbk 
X b  mr
2
X b  mk
in (5) to
Bbk 
X b  mr
2
X b  mk
2
which allows centres to move continuously until
they are in a position to be closest to some data
points. This change allows the algorithm to work
very well in the case that all centres are initialized
in the same location and very far from the data
points.
Note: all centres will go to the same location even if
they are calculated by using two different types of
equation.
Example:
Assume we have 3 data points ( X 1 , X 2 and X 3 ) and
3 centres ( m1 , m2 and m3 ) initialized at the same
location.
Note: we assume every data point has only one
minimum distance to the centres, in other words it is
closest to one centre only. We treat the other centres
as distant centres even they have the same minimum
value.
This step is optional, but it is very important if we
want the algorithm to work very well in the case that
all centres are initialized in the same location.
Without this assumption in this example, we will find
the centres ( m1 , m2 and m3 ) will use the same
equation and hence go to the same location! Let
m1
m2
m3
X1
a
a
a
X2
b
b
b
X3
c
c
c
2
3
1
3
Similarly,
m 
X * (a )  X * (b)  X * (c)
abc
m 
X * (a)  X * (b)  X * (c)
abc
2
3
1
1
2
3
2
2
3
1
2
3
while
a 
b 
c 
X *   X *   X * 
a 
b 
c 
m 
a  b  c 
   
a  b  c 
X X X

3
2
1
2
2
2
2
2
3
2
2
2
2
2
2
2
2
3
and similarly,
X * ( 7 a )  X * ( 7b )  X * ( 7 c )
m 
7 a  7b  7c
X * (a )  X * (b)  X * (c)

abc
2
1
1
1
For algorithm 1, we have
1
X * ( 7 a )  X * ( 7b )  X * ( 7 c )
7 a  7b  7c
X * (a )  X * (b)  X * (c)

abc
m 
2
Note: if we have 3 different data points, it is not
possible to have a  b  c
1
For algorithm 2, we have
3
m 
3
X X X
3
1
2
3
Notice, the centre m1 that is detected by minimum
function goes to a new location and all the other
centres m2 and m3 are grouped together in another
location. This change for Bbk makes separation
between centres possible even if all of them start in
the same location
For algorithm 1, if the new location for m2 and m3 is
still very far from the data and none of them is
detected as minimum, the clustering algorithm stops
without taking these centres into account in clustering
data.
For algorithm 2, if the new location for m2 and m3
are still very far from data points and none of them is
detected as minimum, the clustering algorithm moves
these undetected centres continually toward new
locations Thus the algorithm lprovides clustering for
data insensitive to initialization of centres.
Simulations
In both cases, all four clusters were reliably and stably
identified.
We illustrate these algorithms with a few simulations
on artificial two dimensional data sets, since the
results are easily visualised. Consider first the data set
in Figure 1: the prototypes have all been initialised
within one of the four clusters.
Figure 1 Data set is shown as 4 clusters of red '+'s,
prototypes are initialised to lie within one cluster and
shown as blue '*'s.
Figure 2 shows the final positions of the prototypes
when K-Means is used: two clusters are not identified.
Figure 3 Top: K-Harmonic Means after 5 iterations.
Bottom: algorithm 2 after 3 iterations.
Even with a good initialisation (for example with all
the prototypes in the centre of the data, around the
point (2,2)), K-Means will not guarantee to find all the
clusters, K-Harmonic Means takes 5 iterations to
move the prototypes to appropriate positions while
algorithm 2 takes only one iteration to stably find
appropriate positions for the prototypes.
Figure 2 Prototypes’ positions when using K-Means.
Figure 3 shows the final positions of the prototypes
when K-Harmonic Means and algorithm 2 are used.
K-Harmonic Means takes 5 iterations to find this
position while algorithm 2 takes only three iterations.
Consider now the situation in which the prototypes are
initialised very far from the data; this is an unlikely
situation to happen in general but it may be that all the
prototypes are in fact initialised very far from a
particular cluster. The question arises as to whether
this cluster would be found by any algorithm.
We show the initial positions of the data and a set of
four prototypes in Figure 4. The four prototypes are in
slightly different positions. Figure 5 shows the final
positions of the prototypes for K-Harmonic Means and
algorithm 2. Again K-Harmonic Means took rather
longer (24 iterations) to appropriately position the
prototypes than algorithm 2 (5 iterations). K-Means
moved all four prototypes to a central location
(approximately the point (2,2)) and did not
subsequently find the 4 clusters.
Figure 4 The prototypes are positioned very far from
the four clusters.
Figure 6 Results when prototypes are initialised very
far from the data and all in the same position. Top: KHarmonic Means. Bottom: algorithm 2.
Figure 4 Top: K-Harmonic Means after 24
iterations. Bottom: algorithm 2 after 5 iterations.
Figure 7 Top: Initial prototypes and data. Bottom: after
123 iterations all prototypes are situated on a data point.
Of course, the above situation was somewhat
unrealistic since the number of prototypes was exactly
equal to the number of clusters but we have similar
results with e.g. 20 prototypes and the same data set.
We now go to the other extreme and have the same
number of prototypes as data points. We show in
Figure 7 a simulation in which we had 40 data points
from the same four clusters and 40 prototypes which
were initialised to a single location far from the data.
The bottom diagram shows that each prototype is
eventually located at each data point. This took 123
iterations of the algorithm which is rather a lot,
however neither K-Means nor K-Harmonic Means
performed well: both of these located every prototype
in the centre of the data i.e. at approximately the point
(2,2). Even with a good initialisation (random
locations throughout the data set), K-Means was
unable to perform as well, having typically 28
prototypes redundant in that they have moved only to
the centre of the data.
CONCLUSION
We have developed a new algorithm for data
clustering and have shown that this algorithm is
clearly superior to K-Means, the standard work-horse
for clustering. We have also compared our algorithm
to K-Harmonic Means which is state-of-the-art in
clustering algorithms and have shown that under
typical conditions, it is comparable while under
extreme conditions, it is superior. Future work will
investigate convergence of these algorithms on real
data sets.
REFERENCES
B. Z. Zhang, M. Hsu, U. Dayal, K-Harmonic Means –
a data clustering algorithm, Technical Report, HP Palo
Alto laboratories, Oct 1999.
Download