Y ^x y ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT ASYMPTOTIC PROPERTIES OF K-MEANS CLUSTERING ALGORITHM AS A DENSITY ESTIMATION PROCEDURE o^y^ ^/,o^>-^^ MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 ASYMPTOTIC PROPERTIES OF K-MEi\NS CLUSTERING ALGORITHM AS A DENSITY ESTIMATION PROCEDURE by M. Anthony Wong ym '^'^0 ABSTRACT A random sample of size N is divided into k clusters that minimize The asymptotic properties of the within cluster sum of squares locally. this k-means method (as k and N approach °° ) ing variable cell histograms, are presented. , as a procedure for generat- In one dimension, it is established that the k-means clusters are such that within cluster sums of squares are asymptotically equal, and that the locally optimal solution approaches the population globally optimal solution under certain regularity conditions. A histogram density estimate is proposed, and is shown to be uniformly consistent in probability. KEY WORDS: K-means clustering algorithm; cluster sum of squares; probability. Asymptotic properties; Variable cell histograms; Within Uniform consistency in 1. Let X , X •••, ^M^^ observations from probability distribution random sample, INTRODUCTION F. some density f of a To estimate the univariate density the traditional method is the histogram. f . using the The asymptotic properties of the fixed cell histogram are given in the recent text by Tapia and Thompson (1978). Van Ryzin (1973) first proposed a variable cell histogram which is adaptive to the underlying density. His procedure is related to the nearest neighbour density estimates developed by Loftsgaarden and Quensenberry (1965). In this paper, it is proposed that the k-means clustering technique can be regarded as a practicable and convenient way of obtaining variable cell histograms in one or more dimensions. Suppose that the observations X^ , X„,..., X,, are partitioned into k groups such that no movement of an observation from one group to another will reduce the within group sum of squares. of a sample into k groups to minimize tlie This technique for division within group sum of squares locally is knouTi in the clustering literature as k-means. mension, the partition will be specified by k-1 outpoints; tions lying between common outpoints are in the same group. In one di- the observa- See Hartigan (1975) for a detailed description of the k-means technique, and see Hartigan and Wong (1979 a) for an efficient computational algorithm. The asymptotic properties of k-means as a clustering technique (as N approaches oo with k fixed) have' been studied by MacQu^en (1967), Hartigan (1978), and Pollard (1979). Here, however, the large sample properties of k-means (as k and N approach °°) as a density estimation technique are presented. 1 - The asymptotic properties (as k clusters are given in Theorem -> °^) of the population k-means It is established that the optimal 1. population partition is such that the within cluster sums of squares are asymptotically equal, and that the sizes of the cluster intervals are inversely proportional to the one-third power of the underlying density Theorem at the midpoints of the intervals. 2 and Theorem 3 give the asymptotic properties (as k and N approach ™) of the locally optimal k- means clusters for samples from the uniform [3,1 from a 4 and ] density. For samples general population F, the asymptotic results are given in Theorem Theorem 5. It is shown that the locally optimal solution approaches the population globally optimal solution under certain regularity conditions. In Theorem 6 and Corollary 7, tv;o be uniformly consistent in probability. in Section 6 to proposed estimates are sho^-m to T\i;o empirical examples are given illustrate the performance of one of the estimates. _ 9 _ SOME DEFINITIONS 2. Let bo a sequence of random variables, and let {X^,} N sequence of real constants and 1. The notation X,, N a^ = O(b^) (If 2. = 0(b,,) N a,, = N p constant and an c(c) — Pr{-r > c(e)} < < |a, l/b,, N' N liri be a co . ' we say , N be a sequence of positive constants. ^, means that for each (b^,) N lueans = 0(a^,) b^^ , ^^'fj) {a.,} e > there exists , I'^V(c) such that c for all is of order a^, N > N (c) b a. .) real . N a 3- a,, N 4' X,, N = o (b„) N - o (b^,) N p means :; ., b., > For a real sequence ^ where' lim 1 5 i 5 k„ sup = . means that for each Pr^— 5- — lini ([a , ,|/b as -> c] iN , ' N -+ CO . '^ a_, = 0(b.^,) we say < > and a positive real sequence ^ {a.^,} ,) e <^ ; {b...} ' xh , ' if if the double sequence is considered N as the single sequence 1 and 5 The sjiiibol f a .,., a, a ,.,, a, 2' •'•> definitions coincide. (x) v.'ill be used to denote the ith derivative of - 3 - f at x . ASYMPTOTIC PROPERTIES OF OPTIMAL 3. POPULATION K-ME.VNS CLUSTERS be f(x) Let a density function defined on the interval is to be partitioned into [a,b] Suppose that [a,b] . clusters (or k this optimal intervals) so that the within cluster sum of squares of k-partition is the mininum over all possible k-partitions. If f(x) Tneoreu 1: for all x > [a,b c and f(x) is continuous J, together with its first four derivatives on uniformly in • 1 c, ik k kp. i<_k, <_ f. k -» where " [f(x)]^^' dx a -'^^ ik k3 WSS., ik as '^'-/^ ik f. ^ik then we have [a,bj, -. (3.1) • wNf(x)]'/^dx a [/ ^ a [f(:0]^^^ (3.2) dx]Vl2 , e f ik ik = length of the ith interval in the optL-nal = density at mid-point of = area under p f inside k partition, ith interval, tlie ith interval, ik ySS = within-cluster sum of squares of the ith interval, ik (The theoreni states that, for large k , the within cluster sums of interval consquares are nearly equal; it follows that the length of the -1/3 f(x) is proportional to .) taining a point x of density f(x) Proof [I] : Tne proof is in tour parts. Tlie k-partition. of [a,b] consisting of vithin cluster sum of squares of order - A equal inten/als has a k k'Z ; the contributions from the ith intcrvnl to the is of order e., ^ Therefore, • = 0(k e., ) To avoid co;a- . lie iiC [II] within cluster sum of squares optL-:ial plexity of notation, the k's indexing partitions v.'ill Suppose that ~ ^ ^^^ ^^^ outpoints E c ^ ~ Yn "^ •'' ^ ^i of the optimal k-partition. Denote Then '^ ^1. = a y center of the ith interval by tlie + c. (i=^l,...,k) . . 1 2 1 be the mean of the ith interval. m (i=l,...,k) c.=y+— r... ^1-1 It follows that Let ^^^1.1 be dropped. That is, 1 m. = / ^i X f(x)dx// ^i-l .^ ^i f(x)dx "- Consider any two neighbouring intervals By the optirnality of the partition, Thus, e . > y - m . = ra.,, y y = -m, £ =in.,, . , , -y. (1 S j S k - 1) . . - y. y . ^"^^ = / y. and. e. J J J . ^i-l x f(x)dx/; y • J • ^'^^ f(x)dx - y. J • J f(x+yjdx X f(x+y.)dx//^^"*"^ Z^^"*"^ >i^|lc where , M^= inf f (x) a5x5b u M and = " sup f(x) . aSx<b ' Similarly, [III] e . . , 5 l^^l ^ 77- c . Let us now establish the asymptotic relationship betv;een the lengths of neighbouring intervals. Using the Taylor series expansion, we - 5 - have, for any x f^^\c 1/ - c f H- ) (4) (5 X -kx 1 where ) in ^ X Lhe ith intoi-val, )2f^')(c lb = f(c.) f(x) + (x l/H + i(x - c.)3f(3)^^^ ^ ) is between 1 and x c. 1 - c.) 1 ^^ _ ^ f are bounded on from the above series expansion v.-e have simultaneously for all 1 < i 5 k and it follov;s , . = /y_ / -^ "• f(x)dx = c.[f(c.) + A_ f^''^(c.)c.2 -H i X f(x)dx = c,[c.f(c.) c.[c.f(c.) + -L [^^\c.)e. f^^^c.)£: of i from definition 5, OCc."^)] (3. A) 2 (3.5) term, which is independent (Note that the universal bound contained in the f.) [a,b] , y p. 4 . i Since the first four derivatives of tliat ^ 1 depends on the various bounds of the derivatives of Tnerefore, Since the partition is optiir.al, we have simultaneously for all y.-m.=m.,,-y., •'1 1 1 1+1 ' which when combined with (3.6) gives 1 Thus fron [II], we have for all 3 I 5 It follows that - 6 - i S k , +i 1 S i 5 k (. " ^i( - 67^77 <f'"(=i>^i + f^"<=i.n) since =. cAll + Y^. — ^ (r^^hr ^. 4- ^i+i) -^ o(..2)l f(c^) = f(Cj^j) + f(^>^. , 0(c) , X1-1/3 N_ = E, + 0(c^2)^ f(c^) -1/3 ^(^i.-l> = C, U(l) jf (c.)c. + 0(e.2) 0(e.2) + 0(e^2)] [1 f(c..) , and (ii) f(y.) = 3^^^^ (.) ^ f(c.^p ^(y^j ^ "I^ ^^^^ ^ (^i+l)^i +1 . EquivalenUy, we have ^""Vc. = [f(c.^^)/f(c.)]- '^' fro. (3.4), p, from (3.4), = f(c.) (3.5), c. [1 + ( e.2)]. [1 + o(e.2)] . (3.7) and it can be shown and (3.6) that y. Hence, from (3.7), we obtain, Pi+l^Pi = ff(-i+i)/f(V]'^'[l + 0(e.2).] and WSS^_^^/WSS^ = - 7 1 - + 0(c.2) , (3.<J; (3.9; Let us [IV] any 1 est'blish the relaticnrhi nov7 5 i < j 5 k It . p betv.-ccn (3.7) froni folloi-.-s ••• [I], sup e. = 0(k [1 ^ 0(c.-)]} -2/3 ) = [f(c^)/f(c.)]"^^^ which implies that (e./e.) e. 1=1 • . v;e + 0(k'^''^] [1 have uniformly in [f(c.)/f(c.)]^''^ -^1 f(c.)^''^ -> i i t /^^ f(::)'^^dx k as , (3.1) (p./p.) 1 • J [f(c.)/f(c.)]~^^^ 1 " , Nej:t, approach ") v;hich 5 i < j £ k -^ - . loIIo^s. -^ 1 -> 1 , l£i<jlk , and J 1 -> 1 o. USS./VJSS. k for all l5i<j<k, Similarly, from (3.8) and (3.9), we have uniformly in as . e./c. - [f(c.)/f(c.)]"^^^ [1 + 0(k"''^^]^ Thus Z that for any pair, of ^ i Since fo" e l5i<j5k, values of But from and £ J in turn give (3. 2) and (3.3). And the theorem is proved. we will examine the as^miptotic properties (as k of the locally optimal sample k-means clusters. - 8 - and N , ASi'MPTOTIC PROPERTIES OF LOCALLY 4. OPTIMAL K-ME/iNS CLUSTERS 4.1. The Uniform Case Let X, 1 > x„ 2 > ...., Xv, N be a random sample from the uniform Suppose that the N observations are grouped distribution on [O, l]. into k», clusters so that the within cluster sum of squares of this locally optimal k -partition decreased by moving any cannot be single point from its present cluster to any other cluster. be many local optima; Theorem 2 shows that they all converge to the globally optimal partition of the population. statistic by x ,. If x. ,, There may x,^ +i ^ » (Denote the jth order ••••> x, '_-\\' ^(i ) ^^^ ^^^ "J -J observations in the jth cluster, then the length and the midpoint of the jth cluster are defined to be respectively, where ^(n)~ ^ ^^'^ [x .- ^ ,x- x ^(\-'-l)^ ^ ,^ - v) ^""^ '^ ^'^ (1 +1)^^11 '^ To determine the ratio of the lengths of two adjacent clusters, we need to use the means of the observations in the clusters to locate accurately the midpoints of the clusters. A theorem of large deviations due to Feller can be used to prove that the cluster means are suitably - 9 - -1)"' close to the midpoints. ^6"'"^^ pp. (Feller's theorem of large deviations; 1 see Feller (1971, 5A9-553) for proof.): Let X X , such that Let G and E(X = ) = 2) (27:)-'/' y "1 S(y^,)= - E(X , from a coiumon distribution sar.iple ^"2- F ^ N stand for the distribution of the normalized (.) 1 be a randooi ..., X i exp(- v 2) E._,x./a»^ sura 1=1 . Then provided that the characteristic function of (i) is analytic in a neighbourhood F of the origin, (ii) y varies as N LeiiLTia 2 ; Let , z [a,b] Let z ~ ->- , v;ith we have , ... in such a way that N [1 -G ]/[! (yj^) Yv, "* "*" ^ri<^ ) -> ] as 1 = (a + b)/2 h = and (b - a) C, D, Put . and a,, N = log N/N such that if N . N ^ N o. 12 z,+7.„4-- 1/2 Pr{ 1-1 h ^n Now o N:xj^^'''/16, Proof -> • •+z n n > C(log Y,)^''^) < DN-2(log N) - V ; E[z. - y] =0 Tnerefore oyLeiniaa <° be i.i.d. random variables unifonaly distributed on Then there exist constants n> N . y and E[(z. = -u)2] i' 1, sinr*- (A log N)^ - 10 - h2/12 . -1/5 n -^ °° ytj^"" S(y - " as n -> co ^'"^ and . we have I>r{v^h-' n~'^^ And the ler.iraa X (z. 1 ^ - u) - (/. log iO^''^}/(2n)~^''^ • (4 log N)~^'^ N-2 -> 1 as n -> follows. (It v;ill be shovm later that, clusters contain at least when Na all of the k N observations with 7)robability /16 ' is large enough, N tending to one.) In application of Lenma tlie tions in an interval of length is made more precise in Lenma h the number n will be 2, so ; of observa- is approxiraately n Nh which is a direct consequence 3, of Donsker's theorem for empirical processes '(^ce Billingsley (1968), Together with Lemma 141). 2, Lemma gives a uniform estimate 3 This . p. of the deviations betv/een cluster means and midpoints for those clusters that are long enough. Theorem Lemma Let The main difficulty to be overcome in the proof of is showing that all the cluster intervals arc long enough. 2 3: X X , ..., be a random sample from the uniform density on X Denote the length of an interval Then sup ' n^ - NS,i I I' tions in 10,1] . Lemma 4 Let [0,1] llien X I ^' = (N ) I , by where s^ . n^ p P- [0,1] is the number of observa- 1 is taken over all open subintervals of and the sup .... x«, be a random sample from the uniform distribution on I : , X , . there exists a constant C^ such that - 11 - . Pr{sup ^ where x < C - u |x i i. I J- mean of observations in = = length of s ^''^ s I and , sup tlie r,y-} = V-r ~ Ok I , _ o(l) 1 , midpoint of I , taken over all op^n intervals is I (whose boundary points are order statistics) containing at least observations. /lb Nu^, N Proof : For any N > N o _ /16 Using Lemma . Therefore, I|x^ - p^l _i 3 , "t - ^s /2 , s^^^- = /2C q (^/ \y (m)' '^/ 2 : distribution on x , [0,1] ^r x . 1 /3 1 /16 ^'^VT ..., x Let e . j ^"~^" 1 1/3 as -> k.,a., N N N -> «" max we have (N) (j=l, 2, ..., v,T.th N - l] = o be the length of k^J '" Let in such a the uniform froiTi v.'ay a = lof' ° N N/N . that we have (N) c Ik l<i<k Proof , The lemma follov:s. • be a random sample >^ increases , ' N^ . - D(log K)"^''^ N k ^^°S N)~^^^ is bounded by th cluster of a locally optimal k -partition. Then, provided that -1/2 ^^' with probability tending to one. J the ,) (lu+n -fl)^ taken over all intervals of the form is IN "^^'1 Let sup „. < DM~^(log N) N)^''^} < C^a^^^^^ > |xj - M^l and the ^ _lin) (m+n +1) Theorem -2,, „sl/2, Since the number of possible intervals where x, obtain v;e Pr{s^"^^^ [x^ - M^l > »^ ^^k^^^ - Pr{sup s, (first conditioning of the two 2 ,, Cdog > I s^-1 from Lemma Now, (x, ^ order statistics and then integrating out), 1/2 Pr {n^.'^^ the form of I (ra)' 1/3 > Na n v/herc consider an interval , ' (1) . P J ^^ : Consider a locally optiiial k^.-partition with i^ is in throe^parts. In part I, it is shown - 12 - k N = o(u ~ N ) . The proof that if a cluster is of length . ' > 1/(21; ) then both it and its ncip,hbouring c.l\istcrs contain at least , Na^' /16 In part II, using the result of Lemma observations. ^ , the relationship bctWLien the lengths of neighbouring clusters is established; a bound of their ratio is given by 1 + k^.~''o li the largest cluster is 1/k 2; , (1) Since the length of . p applying parts and II repeatedly gives I the result of this theorem. To avoid wordiness, statements are to read as if they included the "with probability tending' to one as qualification: approaches N infinity". Suppose that the jth cluster is of length [I] By Lemma it contains at least 3, k„ = o(a,~^'^) N N Using Lemma |x. - 3 N/2k - ) . observations. (N^'^) the number of observations in the jth cluster exceeds Thus, But S l/(2k 4 therefore this number , > Na^,^''^/4 N N/4k„ N . we have , c.1/2 -y.l < C'a^l/2 N where (4.1) 3 3 = x mean of observations in the jth cluster, 3 and \i y = midpoint of the jth cluster. (j-l)st cluster, a cluster adjacent to the jth cluster. Consider the Let . be the largest observation in the (j-l)st cluster and Then by local optinality, smallest observation in the jth cluster. midpoint ^ G q 'j-1 - q - between , X. , = x. , = J-1 X . - q x. J must lie between = X.- y - (ex ) = (x. y and z. -;i.)+ ht .- since the largest gap between successive order statistics is - 13 - be the z the And (a ) (c'j^) • . Fruni (i;.!). obtain v;e ^j-i^^j- - C'a l/2c.l/2 _ > 4:. > l/(Skj^.) by Lemma Thcrciforc, [II] Now, Ix. since , applying Lenma - u , Since S-1 r l/(2kj^j) observations eventually. the to 4 = x. , Tj-l Comb im' n*'' z. > - 1) (j st cluster, we have < C'a 1/2^1/2 I q - x. " (a,;) the (j-l)Kt interval contains at least , (Nl/2) > Naj^l/3/15 - N/(8kj^,) 3 -%^v l^^j-^jl - %(°N^> t\ f - q - -.) we have , ^j-I = ^j - (^j - i-j + %(«i,)) „^j nA r r (A on /-/. _„ ,i. • (4.3) .._• 5 0p(al/2) 13 e72 5 Hence ^rc 2 . 5 e < , . J-1 J Therefore (4.4) l^j-l/^j - M since , ^ 2 . l/(2kj^) > aj^l/3/2 . . J can be v/ritten as ^ 2C-.,l/2,.-./2 ^ = > e. V' %^^^ 2/|c.a,^/2,.-l/2 ^ 2^.-1 0^(a,) ("^-^^ ' 1/3 since k,,a = o (1) ; and this bound does not depend intervals involved. - 14 - on the [Ill] Let and c be the lcnp,Lh of e cluster respectively. Then 5 1/k^ > N c, 1 (4.5), by carrying out at most v;e largest and tiie ]/(2k,,) . t'.ie smallest Thuj;, from i\ comparisons of adjacent clusters, 1<^ obtain e^/c^ - ^' + k^-^ o (1)] [1 But for each 1 -^ = 1, . . . ' = k., , ' e, , N + 1 o (1) > c 1 ' Suinming over (1 s j + o (1)) > e. > e p e J Therefore from (4.6), we have for all e > . J (4.6) . 5 1 s j 5 k , . s we obtain , \ (4.7) Nov;, since the e.'s cover the interval overlaps of length (a,,) Z I e. = 1 + k^ N 3 p Substituting in (4,7), Similarly, IN e., = k "^ with at most k N , N p [0,1] (a J = N 1 + o (1) . p we have (1 + o (1)) p and the theorem is proved. ' Next, we will shovj that the within cluster sum of squares of the clusters are asymptotically equal. deviations is used to obtain a 2, First, Feller's theorem on large uniform estimate of the within cluster sum of squares, which is a function of Then using Theorem k the 1-ength of the cluster interval. the result (Theorem 3) follows. - 15 - (Let X X , X ..., , bo a set of obseivat j ons. within cluster sum Tlie squares of this set of observations is defined to be of where x Lemma 5: Let z , [a,b] x)''- is the r.can of the observatious. z , be i.i.d. random variables uniformly distributed on ... Let . - (x I, = u (a+b)/2 there exist constants n > V.aJ^'^/lb N and C, h = (b-a) and B' N Fut . a = log N/N such that if ' o N S N Then . and ' o , ' n . __Lnh2| Pr{h-2n-l/2 \l(z. -7)2 1 > C'(log N)^/2} ^ < D'N-2(log N)-l/2 ^ Proof; Now, E[(z. - u)^ - tV''] = TIiiip; hv T pmnip 1 since . Var[(z. - and (A T nn ^^^l/2„-l/G _ -Lh2] v.)2 _v n :' £ = 1 -> rt h'* CO ^ ^._To have n Pr{/r80 h-2n--l/2 v[(z_^ _ n n T(z. - y)2 = E(z. But - z)2 _ JLh?] > ^,)2 + n(z - p)2 (4 log N) ' /2} /(2^)-l/2 (''i log' N)"^ ''2x^-2 therefore as ^ ^ n -^ ag I °° n -)- , n Pr{/r80 h-2n-l/2 j;[(2_ J Lemma Le t ' t'n-i 1^2] > 12 (4 log j^)l/2 . _, ^ lemma follows. 6 X, 1 [0,1] 2, 7)2 _ |7 _ y|2}/(2^)-l/2(4 ^og N)-^/2j;-2 ^- /ISO h-2nl/2 By Lemma _ 1 . , x„, 2 llien ..., X N be a random sample from the uniform distribution on ' there exists a constant C - 16 - ' such that _i-Ns3| '11/1' _1 Pr{sup s-5/2 where WSS oN 'Nct,y2} = ^ _ ^(i) within cluster sum of squ.ires of the observations in = = length of s < q |;.;ss I and the , sup is . I taken over all open intervals (whose boundary points are order statistics) containing at least I Na ^f^/lb N observations. Proof: For any where N ^ N consider an interval ' o o'f the form (x, ^. x, (m+n (m) > Ka^'^/16 n I ' Using Lemma . (first conditioning on the 5 ,) +1)'^ ' tvro order statistics and then integrating out), we obtain Pr{s^-2nj-l/2 |v;sSj from Lemma Now, 3 "A I , "^1^1^' n^ - Ks^l 5 2Ns n Pr{s^-5/2 [wsSj - = p with probability tending to one, and Y2''^l^l - ^'^'^' Nl/2(iog K)'/2j < D'N-2(log N)-l/2 Since the number of possible intervals Pr{sup s^-5/2 vAiere C \\^SS^ with (x, ., X, ^ ,, .) (m) v.ni+rL.+l is bounded by N^ ^ we have , ^ is taken over all intervals of the form sup^ IN n^ > 1 < c^'Naj^^/2} > i-_ D'(log N)-l/2 - 3^^'s^3| and the = 2/2C' ' o N)^/^} < D'N-2(log N)-l/2^ (N^'2) I I therefore, - C' (log mjl'^/lb . The lemma follows. Theorem Let [0,1] 3: X., x_, 1 2 . Let . . . x^, , N WSS (N) . be a random sample from the uniform distribution on '^ (j=l, 2, ..., N) be the within cluster sum of squares of the jth cluster of a locally optimal-k -partition. - 17 - Let a = log N/N . • Then provided as N -^>- "^ tliat k,, N increases with k a N N l/3 ->- we have , max Il2N~M-. ^ l<i<k Proof in such a way that N 3 V.'SS . - ll (N) = o (1) . P ^ : Consider a locally optimal k -partition with It is sho;.m in Theorpra that for all 2 N _1 /3 = o(a k ' ) . large enough v/ith probability tending to one, 1 . 2. the number of observations in each cluster c.(N) = k^-1 From Cl)> + o (D) (1 can apply Lemma ^-'e for all 6 -3 —Nk n + o uniformly in 1 5 j S k . . 1 2 j 'p i < k, N 5 k oNN 5 o (1) + C •ct^/^^k,/''^ ll 5 1 rn^i < r '^'-J^^v'^^^ (i + o (i)) Nj ' and , uniformly in £.(N)'' 'Net,, ll2N-lk„3 WSS.(N) - Therefore, /16 obtain to Combining (2) and (3), we have uniformly in Iwss.(N) - 1/3 ..., k^ 2, i2joNj 'j |WSS.(N) - -r-V'Ic.(N)3| ^ C 3- j=l, > Na., N (1 . + o (1)) p And the theorem is proved Since the global optimum is necessarily locally optimal, the (Remark: results of Theorem 2 and Theorem 3 also apply to the globally optimal \- partition) The General Case A. 2 For samples from a general distribution F, to Theorem 2 and Theorem 3 are given, Theorem Lemma Let z the results analogous respectively in Tlieorem 4 and The proofs proceed in the same way as before. 5. i 7' , z , ... be i.i.d. random variables from some distribution with - 18 - , finite variance Put = log a Tlien tlicrc o^ . and let V,/]< E(- exist constants C, = u ) . and D, M o n > Ka„'/'716 N f n }/-> +z +• . 3 Let X n - v "i that of to -'^^ v^-'-^ < DN--(log n\;-:^/'i^,> >'\-l/2 C(lo" N)-^-^^ .OS 10 ^ Lenup.a 2.) : X , [a,b] -+7. ^ ]. (The proof is similar Lemnia and , z -1 Tr /a~'- MSN o such that if ..., , be a randoni sample from a distribution x^^ on F' . Denote by dF' / F'(I) . Then sup where = (X^/2) is the nuuiber of observations in n and the - KF'(1)1 |n , is taken over all open sub intervals sup (Like Lemma I ^ 1 [a,b] of . this lemma is a direct consaquence of Donsker's theorem 3, for empirical processes.) Let F be a distribution on [O, l] vith tne following properties: 1. the density [0,1] 2. Lemma Let f(x) f and its first derivatives are continuous on tv70 ; for all > x € [0,1] . 9 X,, x„, 1 2 Denote the , , inf . , X., N be a random sample from of the density Then there exists a constant Pr{sups^-^/2 j-^ _ C f o ^_^j by g , F . and put F(I) = / such that ^ C^a^^''^) = - 19 - ] - o(l) , dF . where x = mean of observations in y = / s - length of I I , , are order statistics) containing at least Proof on F xd}VF(I) = concliCicnnl nu-an of is taken over all open intervals sup and the I Not (whose boundary points I ' observations. g/16 : « N > N For any where , n^ > Na„ Using Lemma 1/3 ' consider an interval of the form I (x, x, ,, ^^.) _^ ,,, g/lG . (first conditioning on the two order statistics and 7 then integrating out), we obtain Pr{a -1 n,l/2 I where ^ a I = / |-^ - y (x - p J > Cdog N)^/2} < ^^-2 (^^g i^)-l/2 )2dF/F(I) = conditional variance of Now, by the Taylor series expansion of f(x) = f(mj) + (x where m -mj)f^^\m^) + ^(x = midpoint of (A. 8) ^ i J. I and £ I X f is F on I . , '^j) ^^ ^^^^j,) between x ^""^ and ""^^ m in x . i Therefore F(I) = f(iiij)Sj[l + 0(s^2)j ^ (Note that the universal constant in the of the second derivative of f term depends on the bound .) And hence (4.8) can be written as Pr{.l2 s^-l[l + 0(?j2)] n^l/2 |-^ _ ^j -^ c(]os :0^/^} 5 DN-2(log N)-^/^ - 20 - . I , , Since the number of possil)le intervals Pr{sup s^-l[l + 0(s^2)|j,_l/2 > Now, from Lernma n 2 ;7= C(log _ q(j) I we have uniformly in 8, < uj ^ - DClog N)-'/2 1 —NF(1) > |xj - is bounded by I N-^ we have , N)l/^-} _ I, with probability tending to one. V''^'S^-,- Therefore, C |- I ^ where -1/2 s Pr{sup A = /r 0/6 1/9 R C I and the , Theorem ^ X. , Let c CN) . X . , . . X , (j^l, " , ' . 2, be a random sample from k .,., F . be the length of the ) j th cluster of a ^^ locally optimal k^ N -par tition. Then, provided that max Proof -^- . J where N is taken over all intervals of the sup : Let 33 i n^ > Na^l/3g/16 with (X(^), x^^_^^_^^^^) fern, < c a^l/2} ^ o N _ p 1' I ' f . k., N Ik c.Cn') = o(a,~^'3) N f.^/' - f we have , ^ = o f(x)l/-^d>:| (1) is the density at the midpoint of the j locally optimal k^ -partition with = o(a , th cluster. : Consider a N k^, N w3j ^ N Denote the open interval (whose boundary points are order statistics) containing the i th cluster by I. and let its midpoint be , m. . J J Then, as before, we have U.-fjIj J J ra. J + -i- (m f 1 xdF/F(I.) = . 12 - 21 - f(m ^ ) . E.2 + o(e.'^) J J . (4.9) Again, to avoid word Lncss, the qualification: to be read as ii= "vn th probability tending to one as N s tatciiienL.s ave. they incliidoi approaclics infinity". [I] Suppose that 2{^-)^^^/k^> c.c:l/(2k^) F(I.) ? g/(2k„) Then, where , h = sup f(x) . -i^ the jth cluster contains at least By LeinraaS, _ 2kj^ observations. exceeds Since |x. - eventually. = o(ct - / Applying Lemma 5 C y.l where (N^^^) P the number of observations in the jth cluster Thus, Ng/4k.. k . ) > Na nuir.ber / g/4 . have v/e y, this , a//^.V2 (4.10) x. = mean of obser\'ations in the jth cluster. Consider the (j-l)st cluster, a cluster adjacent to the jth cluster. Using the argument given in part I the proof of Theorea 2, of it can be shown that the (j-l)st cluster contains at least and [II] (m._^ + ^c ._^ From (4.9), • , ^ i-1 f* f^ 12 - ^^^^u ^^^^(-i) 1 Let (4.12) . . -T^ J observations, (4.11) g/16 can be written as (m. = X. - (p. J ' O^Ca^)) - 7._^ = x. - (.. - ie. + O^ia^)) (4.12) i , ^j-1 - 12 -f 1/3 Na • -Fv-Vf(m ) ' ^3-1 - c-^ - 0^^-'') J J Vl + ^.) 2 be the density at the midpoint between - 22 - p and m j Then 20 (a J ) . (4.13) i\ m. J-i , . (m.) ^ji'n--fo;rT-'^j-^°<^j'> "^j'('-2 -Too J + 0(c?)] = -• \-l/3 TTTTT []. f* = f(m.) - ~r^\va.)c. + 0(c.2) + since 2 J of / f" about f 111. J p J J (n^J N Similarly, by the expansion of . + 0(£.2) + (a J , ] bv the expansion f about ni . , , j-i1/3 = e.l-TT [1 x-l + OCr? + 20 ,) (a.,)] j-i Thus (4.13) becomes \-V3 f* / 1 - I,,) + -^£. (x. ,4-^Vl (4.11), we can apply Lemma But from 9 ' U + 0(e,2)] + 20_^(aJ .(^-IM to the (j-1) cluster to give — I l^'j-l - ^ p ^j-l' - Vn I Therefore, tne ratio 1/2 1/2 ^j-1 combining (4.10), c. , / J-1 (^-15) • (4.14) and (4.15), we can first show is bounded, and then £. J 1/3 jzi. l!h^^ -i \f(n.) J - ft* y^^ 1 "N^/•^^-^/^ + 2e.-lO (aJ "•^o'JOir.Ty + 0(e.2) k -1[4/2"C N (ll)'^\,,^^^a,y +^\.'0 (^m) N N N N p OP + o(k„-M] 23 - (4.16) = k -1 N o p (1) ; and this bound does not depend on the intervals involved. we can show by contradiction From the first inequality in ('^•16), [Ill] that at least one of the cluster inter\'a]s satisfies li, N Then using the bound in(4.1o) and carrying out at most '«>, c om- parisons of adjacent clusters, we obtain /f(m.)^//^ c. k -^'^\-' IT- [TWJJ i+o^(i) Op(i)]^^ = uniformly in '^N Since 1 < i. i S k 1 E e. 1 J f(m.) 1/3 -»- J / f(x) 1/3 dx as N ^ «> ^ the theorem Q f ollov;s. Next, we will assume that that the F is four times dif f erentiable and show cluster sums of squares are asjnnptotically equal. Xv'ithin Lemma 10 : Let z z , ... , be i.i.d, random variables from some distribution with finite fourtli moment Put N/N - log a y . and let Then there exist constants and n > Ka„ N 1/3 ' /16 Pr{Y~^^^"n-^^^ var(z C' ) D' = a^ and . N ' o such that if , ' |z(z. - 7)2 _ na2| > C'(log N)^''^} 5 D'N-2(log N)~^^^ ^ (The proof is similar to that of Lamma 5) L emm a 11X , X , ' o , 1 Let N > ~ N ..., X be a random sample frora - 24 - F . . there exists a constant Tlicn C such that ' o where ^ - -, length of I , ~ ciidpoint of '^t sum of squares of thf observations in all open intervals , and the , = within cluster VJSS sup is taken over (whose boundary points are order statistics) con- I 1/3 tainine at least I I Na^, observations. g/16 Proof: N > N For any ' where > Na n Using Leiru-na consider an interval ' , of I the form o 1/3 (x, ., (m) x, ,,v) (m+n +1) , . ' ' , g/16 . 10 (first conditioning on the two order statistics and then integrating out), we obtain Pr^ r{Yj~^^^n^~^^^ wheire v^ = /j xdF/F(I) /j(x - y^)'^dF/F(I) C (log IwSSj - n^a^^\ > a ^^ , N)'^^} = /^(x - y^)2dF/F(I) 5 D'N-2(log N)-^/^ ,(4.17) , and yCD = , Now by the Taylor series expansion of f , ^^^ (m^) f(x) = f(mj) + (x-mj)f^^^ra^) + ^(x-ni^) 2f X.I + ^ix-:n^)H^^\n^) + ^(x-n^)" f ^'*\y where £ is between and x m^ . Therefore, F(I) = f(mj)sj[l + 0(sj2)] aj2 V =i^ = ^ s^2[i s '* r 4- 1 0(s^2)] + - 25 - ( s 2)1 (^^18) ^ , and (NolG universal Liie tlu-it of the derivatives oc And hence f in tho constciiit. depends on Lhe bounds tcr.n .) can be written as (A. 17) Pr{v/18^ s^-2[l + 0(s^2)]n^-l/2 jygg^ __i_n^s^2fl + 0(s^2)]| ^''^- > C'(log N)l/2} ^ D'N-2(log N)~ Since the number of possible intervals Pr{sup Sj-2nj-^/^ " 1V:SS^ I . is bounded by n2 we have ^ 0(.s^2)j| A"l\^^^ ^ 2 -7^ c'Ciog from Lemma Now, and (4.18) 8 n^ = NF(I) = - o(i) 1 . , (N^^^) = Nf(m )s -1- 1 :;)^/^} I p [1 + 0(s 2)] + q (N^''^) I I , p Therefore, n 5 s 21'Jh £.j~^^^ Pr{sup where C o the form Theorem = ' probability tending to one (h = sup v;ith f) and , |USSj - ^Nf(m^)s^3[i + o(s^2)j| 2 J /2 nr^ h C' and the /90 (x^^^, x^^^^^^^^^) sup ' is taken over all intervals of n^ > Naj^^'g/IG with . 5: Let X Let U'SS X , . , (N) . . . , be a random sample from X., (i=l, 2, .... J F . be the within cluster sum of squares of k^,) N the jth cluster of a locally optimal k -partition. Then, provided that max I<j5k |l2:rU: k 3 N = o(a ~ '^) v;3S.(i;) - , we have (/-^f(x)^/^dx)3| ' J N - 26 - =0 (1) P . . Proof. Consider localJy cptiiaal a l;^,-part:i t ion with k^, N is It is shov/n in Theorem 4 that, for all ^ o ~ (ex large enough N ' N p ) . probability v.'ith tending to one, 1 . 2. the number of observations in each cluster c. - e.(N) = kj^-1 f(m.)-^/^ G[l + o (1)] > Na ' where , g/16 and , G = /^^f(x)'/lix From (1), we can ajiply Lcmina 11 to obtain c.-^/2 |V.'SS.(N) J J - --L Nf(m.)E.^[l 12 jj H- j'oN 0(c.2)]|£ C 'Na//' unifonnly in From (2), we have uniformly in IWSS.(N) - 5 1 ~ Nk j 1 < j 2 k.. S k -3g3[1 + o (1)]| < 20 'N.^/'k -^/2g-^/°GV2 N O . N Therefore, |12N-Ik,3 WSS.(N) - g3| 5 o (1) + C*k,//^a„^/^ ' N N N P J (where = Op(l) (Remark: optimal, As before, . since the global optimum is necessarily locally the results of Theorem globally optinal k -partition. with finite support C* = 20 'g-^/^G^/^) [a,b] is A and Theorem Moreover, immediate.) - 27 5 also apply to the the generalization to densities . WEAK UNIFORiM CONSISTENCY OF 5. THE KISTOGRA.M ESTIRMES In this section, we vjill investigate the asymptotic properties of the k-means procedure as a density estimation technique. X,, X.,,...., Let be a random sample from some population F on [a, bj. Xj, Suppose that the density positive on [a,b is four times dif f erentiable and is strictly f . J Consider a locally optimal k -partition. Let a - yp,^^^ < '' < y, (-O be the outpoints of the partition; = b (N) y, the cutpoint betv.'cen two clusters is defined to be the midpoint between the cluster means. Denote / ^ f(x)^''"d;: by G Theorem 5, and Lemma Then from Theorem 4, uniformly in e. -^ 1 J 1 ^, respectively, we have > WSS. = -4^ G^ Nl^~^ 1/ 1^ J = 2 Nf . E. J J (1 Therefore, subsLituting u. 8 G\^^ f."'/'(l + Op (D) = n. . = GNkj^-lf^V^ + (1 o p + ( c (5.1) (1 (5.1) (D) + .^)) J in (5.2) • (n''^) (5.3) P gives (5.3) + Op(l)) . Define the density estimate (Estimate - 28 - (5.4) ^ I) at a point x by )'/^ = u/'/'''/?;(12WSS fj,(x) y 2 X < y , . • < 1 s j k,, . Then, from (5.2) and (5.4), G'/'nyh^^-yh. = fjj(x) = f. uniformly in (1^ 5 1 P is uniformly continuous, f Theorem + Op(l))/G3/^,3/\^-3/2 + o (1)) (1 J Since (1 „ (1)) ? k, N i -^ . we have shown the following. 6: sup a<x<b from (5.^), Moreover, = 2GKk -If .^/^ =~(e. + we can obtain from , = J.[2G3Kk -3 12 N (1 = i-[8G3Nk -3 (1 N Iz c. ,) + (1 +o (1 o (1)) (5.5) . +o =c.(l (1)) (1)) + (n 7- H- o p + o + n . , ) (x - x . (D) + 6G3Kk N -3 (1))] , . + (1 ) (by definition) ?- o (1))] p • . p Define the pooled density estimate (Estimate II) at a point (x) ^M N = (n. .-)''% + n. ,)^/VN(12WSS ,)^/Vn(12WSS.-)^''^ + n^ _^)^/^/N(12WSSj^ (n^ 7 x < x n^)^/VN(12WSS2-)^/^ (n^ + N *) ^N , x ^ b < = N . J (1 + uniformly in (D) o 7 1 5 p from the uniform continuity of And. hence, Corollary f f , we have : |f/(>:) - f(x)i sup a<x<b - 29 - = o (1) ^ . j by (2 5 j 5 k^,) S x. a 5 x 5 x ^; N N~ x . Then. f *(x) , (5.2) and (5.5)^ (5.1), + USS USS * = WSS . have v;e 7 -x. And since (1) P ^' + n n. = o - f(x)| (x) |f 5 k^, N ; 6. CONCLUDING REMARKS In constructing a histogram to estimate an unknown density function which vanishes outside the finite interval Section 4 [a, b], the results in indicate that the k-means procedure would partition [a, b] in such a way that the sizes of the intervals are adaptive to the underlying the intervals are large where the density is low while the density; intervals are small where the density is high. Thus k-raeans can be regardindeed, the ed as a useful tool for generating variable cell histograms; two estimates given in Section 5 are shown to be uniformly consistent in _„^V „1, ^ prCC„.> 1 ^' f ,T J. TT^TT^-,*^-*..>,..>.. _^, -^-; r'V,^,.1^ ^- Ko --W t-Tr^*^.— _ w- t->_..w_^J ^.,ft-v -^ -i .-.*- /^ ^*-\^^-^ -, w.. -' '^>-r^O .^ww^-^.lO v-» +- *--? T-. r> large sample properties, like the mean squared error and the rate of convergence, of the estimates have not been considered. A major difficulty of the usual histogram is that when multivariate histograms are constructed by partitioning the sampled space into cells of equal size, there are too many cells with very few observations. One desirable feature of the k-means procedure is that it provides a practicable and convenient way of obtaining a k-partition of the multivariate data or equivalently, the multidimensional sampled space. histogram estimates of the density over easily be obtained. Unfortunately, t'^ese k Consequently, cells or regions can the proofs of the theorem for the univariate case cannot be easily generalized to the multivariate case. - 30 - Much work has yet to be done to investigate the asymptotic properties of k-means partition of samples from two or more dimensional distributions. Finally, some results of an empirical study of the density estimates proposed in Section 5 are reported in Hartigan and Wong (1979b). In general, the numerical results obtaiiaed in the study provide an empirical validation of the asymptotic properties derived here; Wong (1979). for details, see Two examples shoving the performance of Estimate II are given in Figure A and Figure B; it should be pointed out that this pooled estimate consistently outperfoirms Estimate - 31 - I in the empirical study. o to c •H + a 6 I 'O ^T" 0) C •H u o en o u 14-1 c o (1) CO e •H > u 0) w o 4J 0) c Q w Pi o o o o REFERENCES Billingsley, P. John Wiley Feller, W. (1971), & As\ .AppIJcaJLlQUS, Hartigan, J. A. Converg e nce of Probability Measures. (1968), (1975), New York: Sons. Introduction to Probability T heory and Its Vol.11, New York: John Wiley & Sons, 549-553. flustering Algorit_hms, New York: John Wiley i, Sons. C9 7S), "As>Tnptotic distributions for clustering criteria", Aunala uT SuaLlfciLlcb , and Wong, M.A. , (1979a), "Algorithm AS136 Clustering Algorithm", , and Wong, M.A. 117-131. 6, Applied Statistics (1979b), : A K-means , 28, 100-108. "Hybrid Clustering',' Proceedings of the 12th Interface Symposium on Comput er Science and Statistics ed. Jane Gentlem.an, University of Waterloo, Loftsgaarden, D.O., and Quensenberry, C.P. (1965), 137-143. "A nonparametric estimate of a multivariate density function", Math ematical Statistics . - 34 36, 1049-1051. Annals of , MacQueen, J.B. (1967), "Some methods for classification and analysis of multivariate observations", Berkeley Symposium Pollard, D. (1979), , Proceedings of the Fifth 281-297. "Strong Consistency of k-means clustering", unpublished manuscript. Department of Statistics, Yale University. Tapia, R.A., and Thompson, J.R. Density Estimation , (1978), Nonparametric Probability Baltimore: The Johns Hopkins University Press. Van Ryzin, J. (19 73), "A Histogram Method of Density Estimation", Communications in Statistics Wong, M.A. (1979), . 2, 493-506. "Hybrid Clustering", unpublished Ph.D. Department of Statistics, Yale University. 7060 U3ij - 35 - thesis. ctiMS*, wrtfW* <g^^^ ^ Sf.P \ ^588 RpR ^6 X 4 3 '§8 »i?j ACME D00K3lNDiNG SEP 6 1 CO., INC. 1983 100 CAMBRIDGE STREET CHARLESTOWIl MASS. HD28.IVI414 no.l088- 79 Keen, Peter G. /Decision support system 737947.. .D«.BKS _0(Jl36576 -go 3 002 OME TOflO flTO 3 TDflD DOM MT3 42E MIT LieHARIES \Cf\Tk «r\0^1 3 TOflO DDM SEM 3 TOaO DOM MT3 MMfl MDfl H028.IVI414 no.1090- 79 Bullen Chnst/Distributed processing '3795.1,.. . D.»BKS Q013G577 \e>^9-?z) 3 3 iOflO TOflO COM s^q 3QC, 002 0M2 i2M jO"^/-?-^ III I II I III II nil iin \i TOflO DDM 3 III 52M ^^^ 3 HD28.M414 no,1092- 79 Stewart. Rosem/The nature and variety 737941 P»BKS 001319^0 .III. 3 I. ImII, III; 1 DfiO lliliilllllllllllli ,11 001 Tflfl 473 .o'^^-^'' 3 TOflO 004 524 44 MIT LlSftiHlES r:: 3 TOflO 004 Sfl4 IbE 3 TOao D04'4t"3 '3flD /s^-7^ lO- ^-7^ .O'^^ 3 TOflO 004 4T3 40 TOflO DD4 Sfil 7Tb C'^D