Document 11052197

advertisement
HD28
.M414
Oe\weV
ALFRED
P.
WORKING PAPER
SLOAN SCHOOL OF MANAGEMENT
ASYMPTOTIC PROPERTIES OF UNIVARIATE
POPULATION K-MEANS CLUSTERS
M. Anthony Ifong
Sloan School of >fanagement
Massachusetts Institute of Technology
Cambridge, MA 02139
Working Paper #1339-82
MASSACHUSETTS
INSTITUTE OF TECHNOLOGY
50 MEMORIAL DRIVE
CAMBRIDGE, MASSACHUSETTS 02139
ASYMPTOTIC PROPERTIES OF UNIVARIATE
POPULATION K-MEANS CLUSTERS
M. Anthony ^fo'^g
Sloan School of >fanagement
Massachusetts Institute of TechnologyCambridge, MA 02139
Working Paper #1339-82
Key Words and Phrases:
population k-means clusters; within-cluster
sums of squares; cluster lengths.
ABSTRACT
Let f be a density function defined on the closed interval [a, b].
The k-means partition of this interval is defined to be the k-partition
with the smallest within cluster sum of squares. The properties of this
k-raeans partition when k becomes large will be obtained in this paper.
The results suggest that the k-means clustering procedure can be used to
construct a variable-cell histogram estimate of f using a sample of observations taken from f.
^^'^
1
1 1983
1
INTRODUCTION
.
Let the univariate observations
from a distribution
X, ,X~,
.
.
with density function
F
.
,
be sampled
X,,
In cluster
F.
analysis, the k-means clustering method (see Hartigan (1975),
Chapter
4)
tions into
is often used to
k
partition the sample of
clusters with means
U,
...,
,
1
U,
k
.
observa-
N
The resultant
clusters satisfy the property that no movement of an observation
from one cluster to another reduces the sample within cluster
sum of squares
N
N
.
„
.
,
1=1
l£l<k
•^~
'
1
'
'
'
J
-^
For these sample k-means clusters, if
I.
interval containing all points in
closer to
other cluster means, then
the sampled space.
population
F
(I-i
>
R
•••>
is used to denote the
U.
than to any
defines a k-partition of
I^,^
The corresponding k-means partition in the
is defined by the k-population means
m.
,
...,
which are selected in such a way that the within cluster (or
interval) sum of squares
WSS = //"^^
M
X - m.
is minimized.
-1-
0745^6
1^
I
dF
m,
,
The k-means method has been widely used in clustering
applications (see Blashfield and Aldenderfer, 1978), and the
efficient computational algorithm given in Hartigan and Wong
(1979)
has been included in the multivariate programs BMDPKM of
the BMDP statistical package.
The properties of sample k-means
clusters have also been studied by several investigators.
Fisher (1958), and Fisher and Van Ness
(1971),
it
In
is shown that
k-means clusters are convex, i.e., if an observation is a
weighted average of observations in a cluster, the observation
is also in the cluster.
N
->-
°°)
And the asymptotic convergence (as
of the sample k-means clusters to the population k-means
cluster for fixed number of clusters
k
has been studied by
MacQueen (1967), Hartigan (1978), and Pollard (1981),
in
which
conditions that ensure the almost sure convergence of the set
of means of the k-means clusters can be found.
However, little
work have been done in examining the properties of population
k-means clusters, especially when
k
becomes large.
In
Dalenius (1951), it is shown that the cut-point between neighboring population clusters is the average of the means in the
clusters, and in Cox (1957), the cut-points for the k-means
clusters in the standard normal distribution are given for
k
=
1
6
''
In this paper, the asymptotic properties
(as
k
becomes
large) of the population k-means clusters in one dimension are
obtained.
It
is shown in Section 2
-2-
that the optimal population
partition is such that the within cluster sums of squares of the
k
cluster intervals are asymptotically equal, and that the sizes
of the cluster intervals are inversely proportional to the one-
third power of the underlying density at the midpoints of the
The implications of these results are discussed in
intervals.
Section
3.
ASYIIPTOTIC PROPERTIES OF POPULATION K-MEANS CLUSTERS
2.
Let
be a density function defined on the interval
f(x)
and denote the
[a,b],
derivative of
ith
Let the k-partition of
[a,b]
a < y,
i
<
<
y^
.
.
.
<
y
*
f
at
x
by
f
(x).
specified by the k-1 cutpoints
be the k-partition with the smallest
t>
within cluster sum of squares
k
WSS =
^
WSS. =
Z
i=l
where
a = y
,
^
b =
y
= / ^
y,
E
^i
^i-1
i=l
,
"
^i-1
2
f(x)
dx,
^
and
•
m.
(x - m.)
/
X f(x)
dx
/
y/ ^
f(x)dx.
^i-l
In this section, we will describe
-3-
the properties of this k-means
partition of a finite interval [a,b] as the number of cluster
intervals (or cells) becomes large.
Theorem
[a,b].
;
Let
denote a density function on the interval
f(x)
And let
a = y^j^
<
y^^
<
...
<
y(k_i)k
<
^kk
the cutpoints specifying the k-means partition of [a,b].
is positive and has four bounded derivatives in
have uniformly in
If
then we
i i ^ k.
- /^ [f(x)]^^^dx
f./^^
ik
(2.1)
^ Pik
[fWl'^'^i^k''^' - ^a
(2-2)
a
[/^
[f(x)
]'-''
^dx]^/12
(2.3)
k
«here
e.^ = y.^ - y^.,^)^^
f
p
.,
ik
= f
=
(1/2 y.,
^ik
/^^
+ 1/2 v,.
f(x)
...
(i-l)k
)
dx
^(i-l)k
and
f
k e.,
ik
k^SS., ^
as
1
[a,b],
^^
= ^
2
^ik
^ik
^^
x f(x)dx/p,, ]^ f(x)dx.
[x - / ^
WSS., = f
^^
^^
^(i-l)k
^(i-l)k
-4-
(The theorem states that,
of squares of the
for large
the within cluster sums
k,
intervals are nearly equal;
k
it
that the length of the interval containing a point
density
Proof
(I)
f(x)
x
of
-1/3
proportional to
is
follows
f(x)
.)
The proof is in four parts.
The k-partition of [a,b] consisting of
has a within cluster sum of squares of order
bution from the
~
k
;
the contri-
^3-2
interval to the optimal within cluster sum
ith
of squares is of order
which implies that
equal intervals
k
e.,
sup e.,
3
.
Therefore,
=
(k
T.
e
.,
=
(k
),
^^^
To avoid complexity of
-2/3
).
i
notation, the
(II)
indexing partition will be dropped.
k's
In this part of the proof,
it
\^n.ll
be shown that lengths of
neighboring clusters are of the same order of magnitude.
m.
be the mean of the
ith
interval.
Let
Then
1
yy
= / ^
f(x)dx.
X f(x)dx// ^
1
y.
y.1-1
1-1
•
m.
1
T
Consider any two neighboring intervals
e.
and
e.
,.
By the
optimality of the partition, as is shown in Dalenius (1951),
-5-
y
.
- m.
2
J
Thus
e
.
= m.
- y
^
,
J+1
y
>
J
- m
.
J
J
J
m.,^ - V.
J
J+1
= / ^^'- X f(x)
V
e
J
e
X
+ y.) dx//
f (x
^"'^
f(x -H/.)
dx
M
1
>
dx - y.
f(x)
.
^"""^
= /
dx//'^^-'
y
T
2
•
1
TT"
M
S-,T
1+1
•
u
where
»
'
-"
u
„
M
sub
_
=
>/
M,
=
1
^,
3.nf
,
aixib
,
,
and
f(x)
ef
\
r (x)
,
M
Sinalarly,
e..^
1
^
j+l
tt
.
2
1
tt-
M
.
e..
1
We will now establish the asymptotic relationship between
(III)
the lengths of neighboring intervals.
ith
interval by
C. = y
_i
C.
+ — e..
(i =
1,
.
.
.
,
k)
Denote the center of the
It
.
follows that
Using the Taylor series expansion, we have, for
•
any
x
in the
^1
+ i (x-Cp2
^x ),
(q
where
.
ith
f(2)(c
q
^x
interval,
f(x)
101
.
^i(x-Cj3
is between
x
.
and
=
f(C.)
+(x-C.)
•
+^
iz4i
f^^\c')
C.
1
f^''"\c.)
(x-C,)^
.
Since the first four
derivatives are bounded on [a,b], it follows from the above
-6-
f(^')
series expansion that we have simultaneously for all
p^ = /
^
+^
f(x)dx = e.[f(C.)
l<i<k.
f^^^C.)e.2 + 0(e.S],
(2.4)
i-1
and
^i
(I)
^
1
xf(x) dx = e. [C.f(C.) +
f
i-1
"^
(2)
1
f^^^ (C.)e.^ + 2^
^i
^^
^
e.^ + 0(e.^)]
(2.5)
(Note that the universal bound contained in the
term depends
only on the various bounds of the derivatives of
independent of
.)
f
and is
i.)
Therefore,
f\
=
m.
X
f
f^'^^V
1
(x)dx/p. = C. +
•
Y2
f(c.)^
^
2
+
0(e.^)
(2.6)
Since the partition is optimal, we have simultaneously for all
l<i<k,
(C
.
1
+
—
2
e
.
)
1
- m.
= m.
1
,
-
1+1
,
(C
-
,
.
1+1
,
tt
2
e
_
1+1,
) ,
which when
combined with (2.6) gives
^i -
6
•
f(C.)
^
" °^^i
^
=
^+1 ^
°(4+i)-
-7-
6
f(C.^J
i+1'
^+1
^
Since it has been shown in part
that
[II]
"^
and
e.
1
the same order of magnitude, we have for all
^+1
"
^
=
^+1
f(C.
6
- 6
•
e.,^
1+1
are of
l<i<k.,
-TiTT
^
^ °^^i^-
It follows that
.(1),„
=
^+1
^
^1 - 6
f(l)rr
.
^^
f(C.)
^
1
^
o2
^^°(^)^'
f(C.,J
1+1
1
After some Taylor series manipulation, we have
e.^^/e. = [f(C.^^)/f(C.)] ^/^
•
[1
+ 0(e^)].
(2.7)
Moreover, since it can be shown from (2.4), (2.5), and (2.6) that
"•
WSS. = /
1
y
and from (2.4),
.
(x-m,)^ f(x)dx =
3-
,
1-1
p.
= f(C.)
e.
[1
^ f(C.)eJ[l
12
1
1
+ 0(ej)],
1
+ 0(e.)], we obtain from 2.7
that
WSS. ^^ /WSS. =
and
1
+ 0(ej)
p.^^/p. = [f(C.^^)/f(C.)]-''Ml + 0(ej)].
-8-
(2.8)
(2.9)
Finally, we will now establish the relationship between
[IV]
and
e.
for any
It follows from (2.7)
l^i<j^k..
that for any
l^i^j^k,
pair of values of
e./e^ = [f(C.)/f(Cj)] ^^^{[1+O(ep][l+O(e^^^)]---[1+O(e^)]}.
But it has been shown in part [I] that
e./e^ =
Hence,
=
for all
[f (C
.
)
/f (C
]"^^^
.
e.
.
[140(k"^/^)
)
=
0(k
).
]^
[f(C.)/f(C.)]"^^^ [l+0(k"^/^^]
l-^i<ji^k,
which implies that we have uniformly in
l<i<j<k,
(e./e^)
Since
•
[f(C^)/f(Cj)]^^^
.,11
Z
e.
fCC.)-*"^-^
1=1
->
1
as k
^ /^ f(x)^'^^ dx,
a
»
->
(2.1)
/
\
follows.
Similarly, from (2.8) and (2.9), we have uniformly in l^i<j^k.
WSS^/WSS. ^ 1, and (p^/p
k
-*
°°,
.
) •
[f (C^) /f (C .)
]~^''^
^
1
which in turn gives (2.3) and (2.2) respectively.
the theorem is proved.
-9-
as
And
e.
3.
DISCUSSION
In this paper, our effort is directed towards obtaining the
properties of univariate population k-means clusters when
becomes large.
The properties given in Section
the lengths of the population k-means intervals
adaptive to the underlying density function:
2
k
indicate that
(or cells)
are
the intervals are
large when the density is low, while the intervals are small
where the density is high.
This result suggests that the k-means
clustering procedure can be used to construct a variable-cell
histogram estimate of an underlying density using
a
sample of
observations taken from that density (see Wong, 1980).
Such a
density estimation method is of interest because it makes use of
the computationally efficient k-means clustering procedure
(Hartigan and Wong, 1979) which is also applicable to multi-
variate data.
-10-
REFERENCES
Blashfield, R.K.
,
and Aldenderfer, M.S.
(1978), "The Literature
on Cluster Analysis," Multivariate Behavioral Research
,
13,
271-295.
Cox, D.R.
(1957), "Notes on grouping," Journal of the American
Statistical Association
,
52, 543-547.
(1951), "The problem of optimum stratification",
Dalenius, T.
Skandinavisk Aktuarietidskrif
Fisher, W.D.
(1958),
34,
,
133-148.
"On grouping for maximum homogeneity,"
Journal of the American Statistical Association
,
789-
53,
798.
Fisher, L., and Van Ness, J.N.
procedures," Biometrika
,
(1971), "Admissible clustering
91-104.
58,
Hartigan, J. A. (1975), Clustering Algorithms , New York:
John
Wiley and Sons.
(1978), "Asymptotic distributions for clustering
criteria," Annals of Statistics
,
and Wong, M.A.
(1979).
,
6,
117-131.
"Algorithm AS136:
means clustering algorithm," Applied Statistics
,
A K28,
100-
108.
MacQueen, J.B.
(1967), "Some methods for classification and
analysis of multivariate observations," Proceedings of the
Fifth Berkeley Symposium on Probability and Statistics
281-297.
-11-
,
Pollard, D.
(1981), "Strong consistency of k-means clustering",
Annals of Statistics
Wong, M.A.
,
9,
135-140.
(1980), "Asymptotic properties of k-means clustering
algorithm as a density estimation procedures," Sloan School
of Management Working Paper #2000-80, M.I.T., Cambridge, MA.
L
-12-
''^Ite Due
HD28.M414 no,1339- 82
Wong, M. Antho/Asymptotic properties
QQ1369""
176
D»BKS
D»
745f76
3
TDflO
o
DOS DM7 MMM
Download