k-Means Clustering on Data with Incomplete Replications 1 Abstract Andy Lithio

advertisement
k-Means Clustering on Data with Incomplete Replications
Andy Lithio
May 9, 2013
1
Abstract
A flexible extension of the Hartigan-Wong k-means clustering algorithm for analysis of data where
observations have possibly unequal numbers of incomplete replications is proposed. Cluster initialization follows the suggestions of Hartigan and Wong. It is shown that the extension is equivalent
to the original Hartigan-Wong algorithm in the case of complete data with no replications. Computational and programming details are discussed, and the new algorithm is then applied to simulated
data with varying levels of censoring and complexity.
2
Introduction
The Hartigan-Wong k-means clustering algorithm is a routine for “[dividing] M points in N dimensions into K clusters so that the within-cluster sum of squares is minimized”. The algorithm
is not guaranteed to find a partition that achieves a global minimum for within-cluster sum of
squares, but obtains a local minimum where no movement of a point to a different cluster will
reduce the within-cluster sum of squares. Given appropriate initial values for the cluster means,
the Hartigan-Wong algorithm has been shown to perform well and be computationally efficient.
In this paper, we extend the Hartigan-Wong algorithm to the case where we have a set of observations, each consisting of at least one set of measurements, where each measurement is not
necessarily complete. We derive expressions for the reduction in within-cluster sum of squares
realized from transferring an observation out of a cluster and for the increase in within-cluster
sum of squares when transferring an observation into a cluster. These expressions are essential for
determining whether an observation should be moved to a different cluster. We then apply the
adapted algorithm to simulated datasets to investigate its efficiency and accuracy.
1
3
3
METHODS
Methods
Consider a set of observations X1 , ..., XN , where each Xi consists of ni measurements of dimension
n. We assume that each Xi has at least one recorded value in each dimension, but may contain
missing values. Define nij to be the number of recorded values for observation Xi in dimension
P
j. We define the input values M = N
i=1 ni , N to be the number of observations, and K to be
the number of clusters. Input data must be formatted as follows. Each line must contain one
measurement, with the recorded values listed in the first n columns, using nan to denote missing
values. All measurements from each Xi must be on consecutive lines. The (n + 1)th column should
contain an indexing variable from 1 to N identifying which Xi the observation corresponds to. The
data is read in and stored as the dat matrix.
Additionally, a k × (2n + 3) array of doubles named clust, a N × (n + 3) array of integers named
obs, and a N × 2n array of doubles named W OSS are created. Each row of clust corresponds to a
different cluster: the first n columns each contain one dimension of the current value of each cluster
center, the second n columns contain the number of recorded values in each dimension currently
assigned to that cluster, the (2n + 1)th column counts the number of Xi currently in each cluster,
the (2n + 2)th column is used to keep track of live clusters, and the final column contains the
within-cluster sum of squares for each cluster.
Each row of the array of integers obs corresponds to an observation, Xi . The first column contains
the number of measurements for each Xi , the second lists the cluster each Xi is currently assigned
to, with the third column essentially containing the next closest cluster to each Xi . Finally, the
final n columns list the number of recorded values in each dimension for each Xi . Each row of the
array W OSS also correspond to an Xi , but its first n columns contain the within-observation sum
of squares of that observation in each dimension, which will be used in the derivations of Section
3.1. The second n columns of W OSS contain the sum of the recorded values in each dimension of
the observation, which will be useful in updating cluster means.
The primary adjustment to the Hartigan-Wong algorithm required is to find the appropriate
quantities used to calculate the reduction in within-cluster sum of squares that would be realized
by removing an observation from its current cluster and the increase in within-cluster sum of
squares that would be realized when transferring an observation to another cluster. For data with
2
nL D(I,L)2
no replications and no missing values, Hartigan and Wong state these to be nLnD(I,L)
and
+1
nL −1 ,
L
respectively, where L is the cluster in question, I is the observation, nL is the number of observations
assigned to cluster L, and D(I, L)2 is the sum of squares between the observation I and cluster L.
The derivations for the reduction and increase in within-cluster sum of squares follow, transferring
observation Xi to or from cluster L, where L0 = L − Xi in the case of reduction and L0 = L ∪ Xi in
the case of increase in within-cluster sum of squares. Denote nLj as the number of recorded values
in dimension j assigned to cluster L, and recall nij is the number of recorded values for observation
Xi in dimension j. We will also use Yij to refer to the j th dimension
of a single measurement, so
P
that Xi = {Yij |Yij ∈ Xi }. For notational simplicity, denote X̄ij =
the j th dimension. Finally, we will denote µLj =
PM
i=1
I(Yij ∈L)Yij
nLj
i:Yij ∈Xi
Yij
, the center of Xi in
as the j th dimension of the center
of cluster L, and µL0 j as the j th dimension of the center of cluster L0 .
2
nij
3.1
3.1
Increase in Within-Cluster Sum of Squares
3
METHODS
Increase in Within-Cluster Sum of Squares
The increase
within-cluster sum of squares realized from transferring observation X1 to cluster
Pin
n
+
L is ∆+
=
j=1 ∆Lj , where
L
∆+
Lj
M
X
=
(Yij − µL0 j )2 I(Yi ∈ L0 ) − (Yij − µLj )2 I(Yi ∈ L)
=
i=1
M
X
2
(Yij − 2Yij µL0 j + µ2L0 j )I(Yi ∈ L0 ) − (Yij2 − 2Yij µLj + µ2Lj )I(Yi ∈ L)
i=1
=
X
Yij2
+
i:Yij ∈X1
=
X
M
X
(−2Yij µL0 j + µ2L0 j )I(Yi ∈ L0 ) − (−2Yij µLj + µ2Lj )I(Yi ∈ L)
i=1
Yij2 + −2(nLj + nij )µ2L0 j + (nLj + nij )µ2L0 j + 2nLj µ2Lj − nLj µ2Lj
i:Yij ∈X1
=
X
Yij2 + nLj µ2Lj − (nLj + nij )µ2L0 j
i:Yij ∈X1
=
X
Yij2
+
nLj µ2Lj
i:Yij ∈X1
=
X
Yij2
+
n2Lj µ2Lj + nLj nij µ2Lj
nLj + nij
i:Yij ∈X1
=
X
i:Yij ∈X1
P
0 2
( M
i=1 Yij ∗ I(Yi ∈ L ))
−
nLj + nij
+
−n2Lj µ2Lj − 2nLj µLj (

n
Lj
nij µ2Lj − 2nLj µLj
Yij2 +
nLj + nij
P
i:Yij ∈X1
nLj + nij

2
X

Yij 

X
P
Yij ) − ( i:Yij ∈X1 Yij )2
i:Yij ∈X1
Yij  −
nLj + nij
i:Yij ∈X1
2

X
=
(nLj + nij )
2
i:Yij ∈X1 Yij

P
nLj + nij
+

nLj
nij µ2Lj − 2nLj µLj
nLj + nij

nLj
nij µ2Lj − 2nLj µLj
=
nLj + nij
X
X
Yij +
i:Yij ∈X1
nij
Yij2  +
=
nLj
nLj + nij
X
nij
=
nLj
nLj + nij
i:Yij ∈X1
||Yij − µLj ||2 +
i:Yij ∈X1
nLj + nij
i:Yij ∈X1
X
X
Yij2 − 
i:Yij ∈X1
||Yij − µLj ||2 +
i:Yij ∈X1
X
(Yij − X̄ij )2
i:Yij ∈X1
nLj + nij
3
i:Yij ∈X1
nLj + nij
2

nij
i:Yij ∈X1
nLj + nij

2
X
X
Yij2 − 
Yij 
i:Yij ∈X1
X
Yij 
i:Yij ∈X1

X
Yij  −

Yij 
3.1
Increase in Within-Cluster Sum of Squares
=
nLj
nLj + nij
X
||Yij − µLj ||2 +
i:Yij ∈X1
3
nij
nLj + nij
X
METHODS
||Yij − X̄ij ||2
i:Yij ∈X1
The numerator of the term on the right is constant for each combination of observation and
dimension, so these are calculated and stored in the first n columns of the W OSS matrix after the
data is read in, then referenced when needed. The term on the right can be thought of as a weighted
corrected sum of squares for the measurements within Xi . The term on the left is reminiscent of the
quantity used by Hartigan-Wong, with the exception of incrementing by nij instead of 1, and having
to calculate it for each dimension instead of having a constant nLnL+1 multiplier for all dimensions.
The term on the left represents the distance between Xi and the current cluster center, and is
likely to have much more weight than the other term. The derivation for the reduction in withinclusterP
sum of squares when removing Xi from cluster L is similar, and results in the quantity
∆− = nj=1 ∆−
j , where
X
X
nLj
nij
||Yij − µLj ||2 −
||Yij − X̄ij ||2 .
∆−
j =
nLj − nij
nLj − nij
i:Yij ∈X1
i:Yij ∈X1
With ∆− and ∆+ defined, we now simply follow the Hartigan-Wong algorithm using observations
X1 , ..., XN , with ∆− and ∆+ as the reduction and increase in within-cluster sum of squares, respectively. To initialize the cluster centers, we adapt the suggestion of Hartigan and Wong. We first
calculate the overall mean for each dimension of the data, then put each observation in ascending
order by the distance from its mean to the overall mean. Then, for cluster L ∈ {1, 2, ..., K}, we
th
assign the mean of the 1 + (L−1)N
ordered observation (rounding to the nearest integer) to
K
be the the initial center. This ensures that no cluster will be empty after the initial assignment
of observations to clusters. However, the reader should note that this initialization method may
not be efficient or lead to accurate assignments, and further testing is required to determine a
recommended initialization. The remainder of the algorithm is outlined below, with comments on
how the included C program accomplishes each step.
Step 1: Initial Assignment For each observation, find the closest and second closest cluster centers,
as defined by the sum of squared errors, and record the index of each cluster in the second
and third columns of obs. Each observation is assigned to its closest cluster center.
Step 2: Update Cluster Centers Update the cluster centers to be the averages of all measurements
of observations assigned to them. At this point we also compute and store the number
of observations assigned to each cluster, as well as the number of recorded values in each
dimension for assigned to each cluster, and the within-cluster sum of squares. We track the
within-cluster sum of squares because it is convenient and useful for debugging, but adds
a very small computational burden. If desired, the within-cluster sum of squares can be
calculated only once upon exiting the algorithm.
Step 3: Live Set Initialization Put all clusters in the live set. In the included program, the (2n+2)th
column of clust is used to track which clusters are in the live set, where 0 indicates the cluster
is not in the live set, and any non-zero values indicates the cluster is in the live set. On this
step all entries in that column are given the value 2N + 1.
4
4
APPLICATION
Step 4: OPTRA Consider each observation I (I=1,...,N). If cluster L was updated in the last quicktransfer (QTRAN) stage, it will stay in the live set at least through the end of this stage.
Otherwise, remove a cluster from the live set if it has not been updated in the last N optimaltransfer (OPTRA steps). Let I be in cluster L1. If L is in the live set, do Step 4a, otherwise
do Step 4b.
Step 4a: Compute the minimum of ∆+
L over all clusters {L|L 6= L1}. Let L2 be the cluster that
has that minimum. If ∆+
≥
∆− , no transfers are necessary, but we record L2 as the
L2
next closest cluster to I. Otherwise, transfer I to L2, and record L1 as the next closest
cluster. Update the centers, within-cluster sum of squares, and counts of observations
of clusters L1 and L2. L1 and L2 are now in the live set, and we keep track of when the
clusters were last involved in a transfer by setting their live set columns to N + I.
Step 4b: Do Step 4a, but computing the minimum of ∆+
L only over the clusters in the live set.
Step 5: Exit Check Stop if the live set is empty. This will be the case if no transfers were made in
Step 4. Otherwise, proceed to Step 6.
Step 6: QTRAN Step Consider each observation I (I=1,...,N). Let L1 be the cluster I is assigned
to, and L2 be the cluster recorded as the next closest. We need not check I if both L1 and
−
L2 have not changed in the last N steps. If ∆+
L2 ≥ ∆ , no change is necessary. Otherwise,
switch L1 and L2, update the centers, and record their involvement in a transfer, and update
their live set columns to 2N + 1.
Step 7: Transfer Switch If no transfer has taken place in the last N steps (the count variable is
greater than or equal to N ), return to Step 4. Otherwise, return to Step 6.
4
Application
Using R scripts provided by Dr. Maitra, we performed simulations on the performance of this
adapted algorithm on data with varying levels of censoring and clustering complexity. Each data
set had n = 2 dimensions, k = 5 clusters, N = 1000 observations, and the same vectors of ni so
that M = 4547. The first measurement of each observation was left as complete, and the remaining
measurements were censored at rates varying from 0 to 0.8. At each censoring level, 100 data
sets were generated, and the Adjusted Rand Index (ARI) of the assignments made by the adapted
Hartigan-Wong and the true clusters was calculated and recorded. This process was repeated for
different complexities of clustering, as determined by M axOmega, ranging from 0.001 to 0.75. In
the table below, we report the mean of the 100 simulations at each combination of censoring and
M axOmega under the REP method. For comparison, we also run the original Hartigan-Wong
routine using just the first measurement of each observation on each data set, using both the initialization described above (the ORD method) and R’s default initialization of choosing 5 points
at random (the RAND method). Keep in mind that the censoring rate does not affect the ORD
and RAND methods, but they are included for comparison to the REP method.
5
5
Table 1: Mean ARI
MaxOmega=
Censoring Rate REP
0
0.755
0.1
0.776
0.2
0.784
0.4
0.785
0.6
0.788
0.8
0.817
0.001
ORD RAND
0.778 0.770
0.786 0.770
0.753 0.762
0.817 0.786
0.760 0.793
0.824 0.779
REP
0.705
0.670
0.684
0.677
0.670
0.671
0.25
ORD RAND
0.624 0.653
0.622 0.623
0.609 0.626
0.629 0.598
0.622 0.611
0.632 0.618
REP
0.226
0.230
0.225
0.207
0.294
0.212
DISCUSSION
0.75
ORD RAND
0.190 0.184
0.194 0.193
0.200 0.199
0.181 0.178
0.174 0.166
0.200 0.200
The apparent lack of a trend (or even the increasing trend) in mean ARI under REP as the
censoring rate increases is unexpected. It may indicate that there is a bug in the code or simulation
methods, but also may simply mean that not much is gained from the replicates (remember, the
first measurement of each observation in this simulation was complete). To compare, note that
under M axOmega = 0.001, there is nearly no difference between the REP method and the others.
However, as M axOmega increases, we see the mean ARI under REP uniformly higher than under
the other methods. Comparing the ORD and RAND methods, we also see no indication that one
initialization is superior to the other, but a more detailed comparison would be required to fully
investigate this matter.
5
Discussion
We have extended the Hartigan-Wong k-means clustering algorithm to operate on data with incomplete replicates by deriving new terms for the increase and decrease in within-cluster sum of squares
realized by moving an observation into or out of a cluster. We limited testing of the algorithm to
varying levels of censoring and clustering complexity, but it may also be interesting to investigate
performance for different numbers of clusters, replicates, or dimensions. A more in-depth comparison of different initialization methods might also be of interest, especially between choosing the
means k observations at random to function as the initial centers and the method outlined above.
There is also room for further work on the C implementation of the adapted algorithm. There
is potential to reduce the number of computations in the included code by storing ∆− for each
observation, and only updating when necessary (I attempted to include this, but ran out of time).
Finally, the included program assumes that each observation has at least one recorded value in each
dimension. Theoretically, this should not be necessary, but a program allowing such data is still
being debugged and tested.
6
Download