BASEMENT $ HD28 .M414 OeNNeV ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT ASYMPTOTIC PROPERTIES OF UNIVARIATE SAMPLE K-MEANS CLUSTERS Anthony Wong Massachusetts Institute of Technology M. Working Paper #1341-82 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 ASYMPTOTIC PROPERTIES OF UNIVARIATE SAMPLE K-MEANS CLUSTERS M. Anthony Wong Massachusetts Institute of Technology Working Paper #1341-82 Key words and phrases: k-means clusters, wi thin-clusters sum of squares, cluster lenghths, non-standard asymptotics. AMS 1980 subject classification: Primary 62H30. ABSTRACT A random sample of size N is divided into k clusters that mini- mize the within clusters sum of squares locally. Some large sample properties of this k-means clustering method (as k N) are obtained. In one dimension, approaches °° with it is established that the sample k-means clusters are such that the wi thin-cluster sums of squares are asymptotically equal, and that the sizes of the cluster intervals are inversely proportional to the one-third power of the underlying density at the midpoints of the intervals. The difficulty involved in generalizing the results to the multivariate case is mentioned. 0745^7^ INTRODUCTION 1. Let the univariate observations a distribution x-,, with density function F tions are partitioned into X2, f. . . , x^, be sampled from Suppose that these observa- groups with means k . z,, ..., z. no movement of an observation from one group to another will such that reduce the within groups sum of squares WSS., = ^ min Z x. i=l l<j<k " This method for division of in groups sum of squares k-means. as - z . • I J ^ a ' sample into k groups to minimize the with- locally is known in the clustering literature In one dimension, the partition will be specified by k-1 cutpoints; the observations lying between common outpoints are in the same group. See Hartigan (1975) for a detailed description of the k-means method, and see Hartigan and Wong (1979) for an efficient computational algorithm. This method has been widely used in various clustering applications (see Blashfield and Aldenderfer, 1978). properties (when N ^ for fixed <=°) Its asymptotic have been studied by MacQueen (1967), k Hartigan (1978), and Pollard (1979). Here, the sampling properties of k-means clusters when °° k approaches with N are presented. The properties of the univariate population k-means clusters when k -> CO are given in Wong (1982a). It is shown that for large k, the optimal population partition is such that the wi thin-cluster sums of squares are equal, and that the sizes of the cluster intervals are inversely proportional to the one-third power of the underlying density at the midpoints of the intervals. In -1- this paper, non-standard asymptotics k ^ °° k-means clusters for samples from a general popula- are used to obtain the asymptotic properties (when the locally optimal tion on F [0,1]; in particular, with N) of it is shown that the locally optimal partition approaches the population optimal partition under certain re- gularity conditions. are: (1) The special difficulties in showing this result the number of clusters approaches k the length of each cluster interval °° with (such that H approaches zero while the number of observations in each cluster approaches infinity), and (2) the locations and sizes of the cluster intervals are determined by an optimization pro- cedure (so all for all results concerning the clusters need to hold uniformly clusters) Theorem 1 . and Theorem 2, respectively, gives the asymptotic expres- sion for the lengths and the within-cluster sums of squares of the locally optimal k-means clusters. in Section 2, and Theorem 2 is The result of Theorem proved in Section 3. 1 is obtained Some concluding remarks are given in Section 4, in which the difficulties involved in generalizing these univariate results to many dimensions are also mentioned. -2- < ASYMPTOTIC LENGTHS OF THE LOCALLY OPTIMAL SAMPLE K-MEANS CLUSTERS 2. Let Xp, X-,, . . . be X|M , random sample from a density a positive and has four bounded derivatives in [0,1]. derivative of = b f(x). , p. at f Suppose that the and let Denote the = B q^J^P^ i is th and f(x) observations are grouped into N z,<Z2 <...< clusters with means k^i f^^^x), by x which f so that the within clusters z^ um of squares of this locally optimal k^,-parti tion cannot be decreased by moving any single observation from its present cluster to any other cluster. Denote the pth order statistic by X/ and let > be the n- i number of observations in the Then cluster. jth x t=0 x are the observations in the jth cluster, where . n 2 ( -^ , (Jp . . .' n,+l) ^ And the n|-,=0. ) t=0 ^ length and the midpoint e. [x J X/ . ( n^n) ^z of the jth cluster are defined to be m. J . , ^J-^ - ] n and 1/2 [x, ly, where x^^^ = t=0 and x^j^^^^ = . , ^ 1. Theorem There may be many locally optimal k-maans partitions. states that for any such partition, if max Ik^ e. f(m.)^/^ N J J V-J^k^ To show the result of Theorem consequence of a - 1, AU respective- ] (^l^ n^) t-0 n +1) i ) t=0 t=o + x .• ^ k^, = f{x)^^^ dx we need o([N/logN] | = 1/3 ), 1 then (1). o p a few lemmas. Lemma 1 is a theorem of large deviations given in Feller (1971). direct It in proving Lemma 2 which states useful is that if the are large n.'s enough, the cluster means are suitably close to the cluster midpoints. This result is then used in the proof of Theorem 1 to determine the relationship between the lengths of neighboring clusters. Lemma 1 : Let density Put ..., be a random sample from X|u J^^ fix) , and b = ^''^ , contained in an open subinterval E[z.-Uj] let and = E[(z.-Uj)^] oj Proof (or Zp of [0,1] by C, observations n z^j • • • and D, and not depending on N -1 as n 1/2 ^ ^0 n~ ^1 + . 1 n^N(log N/N)^/^b/16, ..+^n - as uJ n > -> we have °°, (41ogN)^/2}/(2ll)"^/^(41ogN)"^/2N"2-l f has bounded derivatives and Ot n 1 a distribution not depending on follows. contains at least k^, = o N(logN/N) and Uj (In the proof of Theorem 1, large enough and if of this 549), p. °=. Now since has Uj : since (41ogN) ; and z„> , From the theorem of large deviations given in Feller (1971, D with F a^. = N^N^ such that if I) Denote the f(x). I Then there exist constants and distribution a which is positive and has four bounded derivatives in [0,1]. f = B Xp, X-,, a. it will (or I), '-^ " I ^^t the lemma be shown that when N is ([N/logN] 1/3 ' 1/3 ' observations, and hence the result lemma can be applied.) b/16 ), I ' n ' then each of the k^, clusters the application of Lemma 1, In tions in an open subinterval mately NF(I), where = F(I) will be the number of observa- n I of [0,1]. /r dF. Therefore, More precisely, using Donsker's theorem for empirical processes (see Billingsley 1968, ^^P] "j where ny is - p. 141), we have (N^/2) = NF(I) is approxi- n (2.1) I the number of observations in over all open subintevals of used in the proof of Lemma I. and the sup is taken I, Both Lemma and Equation (2.1) will be 1 to give a uniform estimate of the deviations 2 between cluster means and midpoints for those clusters that are large enough. Lemma 2 : Let Xo, . . . , be x^, random sample from a whose density F = ^"P. 0<Xil f(x), b = n"""^, f(x), and Oi:Xi:l = F(I) Let dF. /. z. be the mean 1 I = xdF/F(I) be the of the observations in an open interval I, conditional mean of be the size of the interval Then there exists P^{ ^'f ^i~'^^^\~^i taken over all a - F on constant ^i\ ^ and I, ^0 Sj lit (logN/N)^''^} = statistics) containing at least /t I 1 - o(l), where the sup (whose boundary ooints are order N(logN/N) 1/3 ' b/16 observations. Proof: For any J N^N , I. such that C open intervals f Put positive and has four bounded derivatives in [0,1]. is B x-|, consider an interval -5- I of the form is . (X(p), x^p^^ where ^^p, Using Lemma n^ 2 (log N/N)^/\/16. N (first conditioning on the two order statistics and 1 then integrating out), we obtain p/aj^ Zj rij^/^l - Uj C(log N)^/^} > DN"^ (log N)"^/^ < I where 2 a. = fr (x - Ur) 2 dF/F(I) conditional variance of = Now, by the Taylor series expansion of f, 2(2) 1 + T"(x-mj) f (fir) of I and is between q^^ = F(I) ^^^ ^^'^ (q^^ x f(mj) Sj and ^ ^I ^ ^T2 aj = y^ S^ [1 P^ ^ { the midpoint is 2 • 'l 4 * 0(^1^' ^"^ term depends on the bound of the second f) It follows that (2.2) can be written as: vl2 Sj^ [1 ' + 0(s^)] (Note that the constant in the deri vati ve of f^ Therefore, m.. r f(mj • m. I. [l+0(s^)] 1 = on + (x - mr) fim-r) where ^' ^'^ ^ = f(x) F (2.2) + 0(s^)] nj^/2 I 5j - ujl :. CdogN)^/^} DN"^ (logN)"^/^ Since the number of possible intervals I is bounded by N 2 , we have 2 P, ^JPsjl ( [1.0 n^l/2 (s2)] -uj l-zj < JLc(logN)l/2j I /1 ^ D(1ogN)'^/^ - 1 = 1 - o(l). Now from (2.1), we have uniformly in I, Ht >-2NF(I) > j with Nb Sr probability tending to one. Therefore, {Sup 3jl/2 where = C„ form — 2 b -1/2 ^^1 ^ ^^ (logN/N)^/^} ^ 1 as N and the sup is taken over all intervals of the C, X(p^^ (x^p), _ |-^ p^ with ^^j) The result of Lemma n^ ^ N ( logN/N) ^^^ b/16. shows that if the k-means clusters are large 2 enough, the cluster means are suitably close to the cluster midpoints. When combined with some well-known properties of locally optimal k-means partitions, this result is useful in establishing the relationship between The main difficulty to be overcome the lengths of neighboring clusters. in the proof of Theorem 1 showing that all the clusters are large is enough. Theorem 1 Let : X-, Xp, , . whose density f Put f(x) B - Q^JJP^ . is be the length of the the sample. . , be X[u a random sample from distribution a F positive and has four bounded derivatives in [0,1]. and jth b = Q^^^^f(x), cluster of Then, provided that k^, -7- e^ ( j locally optimal a = let o ([N/logN] 1/3 ), = 1, 2, . . k^^^-parti we have . , k^) tion of max where m. I is n1/3 r, 1 .1 the midpoint of the n1/3 r, ,1 , ,s cluster. jth J Proof : Consider a locally optimal (whose boundary points Denote the open interval containing the kj^-parti tion with cluster by jth k^, ([N/logN] = 1/3 ). order statistics) ?i.rQ. Then, as before, we have I.. ^^^^(^•) = u. / . X dF / F(I .) ^ The proof is in three parts. of length bounded by ^ . part In -^^ 1 . nij neighboring clusters contain at least In part II, l/(2k^), N(logN/N) • ( e. I and is given by -rr-^) f(m.) ^ k^, 1 + k./ (1). p' N -^ II clusters satisfies 1/3 observations. b/16 a bound of the ratio Since at least one of qualification: 1/3 /k|^ ^ e- ^ applying parts l/(2k|^), By (2.1), the statements are to be read as if they included the "with probability tending to one as Suppose that vations. 2(B/b) repeatedly gives the result of this theorem. To avoid wordiness, [I] then both it and its J vj the (2.3) using the result of Lemma 2, the relationship between the lengths of neighboring clusters is established; —^ .] . 0(e]) it is shown that if a cluster is I, and 2{B/b)'^^^/k^ 4 2 . 2(B/b) jth ^'''^/k^ ^ e^ ^ l/(2k^). cluster contains at least N Then Nb/(2k|^) Therefore, the number of observations in the exceeds Nb/4k^, exceeds N(logN/N)^/^ b/4. eventually. Since -8- k^, = ([N/logN] approaches infinity' F(Ij) - jth 1/3 ), ^ b/(2k^). (N^^^) obser- cluster this number Applying Lemma - \~z. where the Ujl < Cq (logN/N)^/2 e}^^, (2.4) the mean of the observations in the jth cluster. is z- to the jth cluster, we have 2 cluster, (j-l)st a cluster adjacent to the jth cluster. (j-l)st cluster is largest observation in the x ._ smallest observation in the jth cluster is x . optimality, the midpoint local . and . X (^^ "t^ ^ t=0 . . n. + 4=0 i^_^ = z. , and z- 5. n + 1) '' .M = i. - X "t must lii t=0 ^ .Op ([logN/ND, ^_^ ^ since the largest gap between successive order statistics is It Then by 1) , (^'^ ej.i^M- between M \^' , {^z t=0 The and the , (t=0 Consider ([logN/N] follows from (2.3) that ^J-1 ^'j " "j^ ' And from (2.4) , ^ hj ^ Op([logN/N]) we obtain -j-1 ' i^j ^ |ej - Co [logN/N]^/2 ^ l/(8k^), since ^_l/2 e^. .> Therefore, it follows from (2.1) that the at least [II] Nb/(8k^) 0p(Nl/2^ Now, applying Lemma ^J-1 Since - M - I._^ - "j-ll = i. - ' 2 , N to the ^0 ^ g^ ([logN/Ni: l/(2kj^). (j-l)st [logN/N]^/^ b/16 (j-l)st contains observations eventually. cluster, we (1ogN/N)l/2^._^l/2. m, we have interval h ave (2.5] + [nij_i -^j.i + Op([logN/N])] z._^ - = ~z. - [m. - ^^ + Op( [logN/N] ) ] (2.6) From (2.3), (2.6) can be written as f^^^("ii 1 t^j-1 - 12 f^j - Let T2 • -fT^ 0(ej^)] = ej • ^j [fVf(mj)]"^/^ f* = f(mj) sion of about f - \ f^^^m^) - - 0(^? - i^j^ - \ 2 • - - 1 ^^J-l^ ' 2 ^J-1^ ° - ^J-1 ^J - 20p([logN/N]) ^ m. and + Op (logN/N)], e^ + 0(ej) + Op ( [logN/N] (2.7) m,-_-i- + O(ej^) [1 = Then by the expan- ) m.. Similarly, by the expansion of e,-itl 4 'U • be the density at the midpoint between f* since ? i) f(ni.;j • ^^(^YT^ f 0(ej.,)l about = ni-j_i' we have e..,[fVf(.,..,)]-^/^ ^ 0(e^.,) 1 . Op(logN/N)]. Therefore, (2.7) becomes - ("j.l \ Zj-i) + ^j \ ej.i [f*/f(mj_i)]"^^^[l [^'Vf(m.)]"^''^[l + 0(ej)] + 10- 2 + 0(ej_^)l = (i, 'J Op (logN/N). - uj + J (2.8) ' Combining (2.4), (2.5), and (2.8), we can first show the ratio is e. ,/e. bounded, and then - ,'j- [f(mj_^)/f(mj)]^/^ • - ^ li [fVf(nij)]^/^(logN/N)^/2ej-^/2 4 Cq ^J + 2ej"^ Op(logN/N) + < k~/ [4 /2 (e^) (2.9) CQ(B/b)^/\^2/2(logN/N)^/2 + 4k^ Op(logN/N) + 0(kj;;^)] ^n' °p(^) and this bound does not depend on the intervals involved. From the first inequality in (2.9), we can show by contradiction [Ill] that at least one of the :j e. i k|^ cluster intervals satisfies 2(B/b) 1/3 ' comparisons of adjacent clusters, we obtain (e^./e.) • uniformly in [f(m.)/f(mj)]^/^ < 1 = + [1 i, j < f (m ^^^ ^ /I f(x) ^ k'^Opd)] - 1 +Op(l) k|^. k^ Since z I /k^ Then using the bound in (2.9) and carrying on at most l/(2k^,). •J k^ e . J .) J ^^"^ ^ fol lows. • 11- dx as N ^ -, the theorem ASYMPTOTIC WITHIN-CLUSTER SUMS OF SQUARES OF THE LOCALLY OPTIMAL SAMPLE K-MEANS CLUSTERS 3. As in Section 2, let x •_ <...< x -, (t=0 "t ' observations in the jth cluster of of a sample from and let f, '^ locally optimal k-means partition a be the jth cluster mean. z- - (x Y, - , . Z-) 2 In this • WSS., First, defined to be is k^, clusters are asymptotically direct consequence of Feller's theorem on large devia- a tions (Lemma 3) The within- section, we will show that the wi thin-cluster sums of squares of the equal. n- (t=o "t) cluster sum of squares of the jth cluster, "i denote the j used in the proof of Lemma 4 to obtain is estimate of the within-cluster sum of squares, which is a uniform a function of the length of the cluster interval and the density at the midpoint of the interval. The result of Theorem 2 then follows from Theorem 1 and Lemma 4. Lemma 3: Xp Let density Put B ..., be Xju Q^,JJP^ and f(x) b = contained in an open interval E[z. - Uj] = 0, E[(z. there exist constants if N^ N* random sample from a a distribution F with which is positive and has four bounded derivatives in [0,1]. f = Xp, and P^ (Yr^^^n"^/^ - q^"^^ f(x). by I Uj)^]= Qj, C*, D*, z, , and Denote the z^, .,., E[(z. and N*, - z observations n and let , Uj)^] = not depending on -, ^ I < -. Then such that n^N(logN/N) ^^^b/16, 1 y (z. - ~z)^ - na? i > C*(logN)^/^} (Since the proof is similar to that of Lemma 12- 1, it will ^ D* N"^(logN)"^/^ not be given here, Lemma 4 : X,, x„, Let Xj^ be a random sample from mean of the observations in an open interval and be the size of the interval Sr constant P^ whose density C* (SUP WSSr I, cluster sum of squares of the observations in I, F positive and has four bounded derivatives in [0,1], is f ..., Then I. be the wi thinbe the midpoint of m^ I, Put there exists a such that Sj-5/>SSj - ^N f(mj)sj[l + 0(s^)]l ^ C* N(logN/N)^/^} 1-0(1), = where the sup taken over all open intervals is are order statistics) (whose boundary points I N(logN/N) containing at least 1/3 b/16 observations. The proof of this lemma is similar to that of Lemma 2; therefore, only the differences between the two proofs are outlined here. In this proof, the Taylor series expansion is carried out to the fourth order And after some series manipulations, we obtain terms. and al = /j(x - Uj)^dF/F(I) = ^ s^ Yj = /i(x - Uj)^dF/F(I) = y^ sj [1 [1 + 13- I (3.2) 0(s^)]. Applying (3.1) and (3.2) to the result of Lemma because the number of possible intervals (3.1) + 0(Sj)], is 3 will give this lemma again bounded by 2 N . Theorem 2 : Let X2J X-,, whose density Put B . f . is be a random sample from x^i , b= f(x). J^J^^ Let be the wi thin-cluster sum of squares of the optimal a distribution F positive and has four bounded derivatives in [0,1]. and Q^JJP^ f(x) = . VISS. = (j cluster of jth The provide that k|^-partition of the sample. 1, k,^ = 2, ..., k^) locally a o([N/logN] 1/3 we have max 12N"^k„^ WSS, - [A f(x)^/2dx)2 I oJl) Proof: Consider a locally optimal in Theorem It is shown tending to one, G = 1 /„ f(x) e."^/^ 1/3 WSS. and (ii) From dx. - that for all 1 tion with k^ = o([N/logN] 1/3 large enough, with probability N (i), ^Nf(m.) where k"^ f(m.)"^^^ G[l+o (1)], = e- we can apply Lemma 4 to obtain e^ [l+0(e^)] | ^ C* N(logN/N) ^/^, uniformly I in IWSSj - from It follows l^j^k^,. ^ Nk-2 G^ that, we have uniformly in (ii) [l+0p(l)]| ^ 2 where C** = - 2C* b G^ < I '° G Op(l) + C** k^^''^ (logN/N)^/2 ' . l<j<k^,. C* N(logN/N)^/2^j^-5/2^-5/6g5/2^ Therefore, 12N"^k^ WSSj ). the number of observations in each cluster exceeds (i) N(logN/N)^'^\/16, kj^,-parti = And the theorem is proved. •14- 0p(l), ), Since the globally optimal k-means partition of the sample is necessarily locally optimal, the results of Theorem apply to the globally optimal k^j-partition. 1 and Theorem 2 also Moreover, the generaliza- tion of these results to densities with finite support [a,b] is immediate. -15- 4. the properties of univariate sample k-means clusters In this paper, when k DISCUSSION approaches infinity with the sample size are presented. N This study is of interest in its own right because we want to examine the sampling properties of this widely-used clustering method. result in Section tion a also indicates that the k-means method would parti- 2 sample from The a distribution with density f on [a,b] way that the sizes of the cluster intervals are adaptive to in such a the f; intervals are large when the density is low while the intervals are small where the density is high. Therefore, k-means can potentially be used as a method for constructing variable-cell gram estimate of f histograms. Indeed, a histo- based on the k-means method is proposed in Wong (1982b); and it is shown to be uniformly consistent in probability. The multivariate case requires further investigation as the generalization of univariate results to many dimensions is not straightforward. An important first step is to determine the configuration of the optimal population k-means partition in several dimensions. indication that the best partition in R is There is some into regular hexagons (Wong 1982c); but it is not clear that the best partition is into regular polytopes for higher dimensions. 16- REFERENCES Billingsley, Convergence of Probability Measures (1968). P. . New York: John Wiley and Sons. Blashfield, R.K., and Aldenderfer, M.S. (1978). cluster analysis. Feller, W. tions (1971). . An Multivariate Behavioral Research (1975). , 13, 271-95. Introduction to Probability Theory and Its ApplicaJohn Wiley and Sons, 549-553. Volume II, New York: Hartigan, J. A. The literature on Clustering Algorithms New York: . John Wiley and Sons. Asymptotic distributions for clustering criteria. (1978). Annals of Statistics , , and Wong, M.A. 117-131. (1967). , 28, 100-108. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Probability and Statistics Pollard, D. (1981). Statistics Wong, M.A. 9, , (1982a). clusters. A k-means clustering (1979). Algorithm AS136: Applied Statistics algorithm. MacQueen, J.B. 6, , 281-97. Strong consistent of k-means clustering. Annals of 135-140. Asymptotic properties of univariate population k-means Unpublished manuscript, Sloan School of Management, M.I.T. (1982b). Using the k-means clustering method as estimation procedure. a density Unpublished manuscript, Sloan School of Manage- ment, M.I.T. (1982c). Asymptotic properties of bivariate k-means clusters. Communications in Statistics, Vol. All, No. 10 (to appear). Sloan School of Management Massachusetts Institute of Technology, E53-335 Cambridge, MA 02139 U.S.A. BKs^-^srl Date Due Lib-26-67 HD28IV1414 no 1341- 82 Wonq, M. Antho/Asymptotic properties 745ng p.*Mi o flflffiF-- TOaO QD2 D47 Mb'l