HD28 .M414 Dewey f»^. .JUL JtfJ 1981 ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT ASYMPTOTIC PROPERTIES OF BIVARIATE K-MEANS CLUSTERS M. Anthony Wong W.P. #1216-81 May 1981 MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 ASYMPTOTIC PROPERTIES OF BIVARIATE K-MEANS CLUSTERS M. Anthony Wong W.P. #1216-81 May 1981 M.I.T. LIBRARIES JUL 15 1981 RECEIVED , I ASYMPTOTIC PROPERTIES OF BIVARIATE K-MEANS CLUSTERS M. Anthony Wong Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02139 Hey Words and Phrases: k-means clusters; within cluster sum of squares; graph theory; regular hexagons. ABSTRACT A bounded reeion in R is partitioned into is minimized. k 2 with a uniform density function defined over it sub-regions such that the within cluster sum of squares An asymptotic (k-^<») lower bound for the within cluster sum of This lower bound is squares of this optimal k-means partition is obtained. useful in suggesting that the graph-configuration of the optimal k-partition would consist of regular hexagons of equal size when k is large enough. An empirical study illustrating these asymptotic properties of bivariate k-means clusters is also presented. 1 7*r \y;:'X.Ji - INTRODUCTION 1. Let the observations density function x be sampled from a distribution ...,x^ In cluster analysis, f. Hartigan (1975), Chapter observations into , 4) is with the k-means clustering method (see often used to partition the sample of N clusters with means k. F x_,...,x, 1 k The resultant clus- . ters satisfy the property that no movement of an observation from one cluster to another reduces the sample within cluster sum of squares WSS^ = .1. ."^^;. N x=l l<j<k II X. - x.||'/(N-k) 1 J For these k-means clusters, a k-partition of the sampled space can be defined by associating each cluster mean points in R closer to with the convex polyhedron x. than to any other cluster mean. x. ing optimal k-means partition in the population tion cluster means y , X . . . ,y , of all C. The correspond- is defined by the popula- F which are selected in such a way that the IS. within cluster sum of squares WSS = / l^jlkl |x - M.| For fixed number of clusters dF I the asymptotic convergence (as N k, -» of °°) the sample k-means clusters to the population k-means clusters has been studied by MacQueen (1967), Hartigan (1978), and Pollard (1981), sult can be found in Pollard (1981) , The most recent re- in which conditions are found that ensure the almost sure convergence of the set of means of the k-means clusters. ever, How- the asymptotic properties of k-means clustering in the case where the number of cluster k increases with the sample size N did not receive much attention. In Hartigan and Wong (1979a) (as k It is ->• <=°) and Wong (1980), some asymptotic properties of the population k-means clusters in one dimension are obtained. shown that the within sum of squares of the -2- k clusters are asymptotically equal, and that the length of the jth cluster interval (l<j<k) is inversely proportional to interval. o(N/log N) f(c.) 1/3 , where is the midpoint of the jth cluster c. It is also shown that if k(N) 1/3 , -> «> N as -> «> with k(N) = then the sample k-means clusters have asymptotic properties similar to that of the population clusters. Using these results, it can also be shown that a uniformly consistent histogram estimate of is constant over each k-means cluster interval, £• f, which can be constructed from the imple using the k-means method. Unfortunately, these univariate results cannot be easily generalized to Only empirical evidence exists to support the con- the multivariate case. jecture that similar asymptotic results hold in several dimensions for k-means clusters, and that a uniformly consistent histogram estimate of a multivariate density f can be constructed by the k-means method. uniform consistency result is of special practical interest as The latter it would just- ify the usage of the computationally efficient k-means method (Hartigan and Wong, 1979b) for estimating multivariate density functions from large samples. In this paper, some asymptotic properties of the population k-means clus- ters for uniform distributions in R 2 are given, In Section 2, an asymp- totic lower bound for the WSS of the optimal k-means partition is obtained. Since this lower bound is attained when all k clusters of the partition are regular hexagons of equal area, this result suggests that the graph configuration of the optimal k-means partition would consist of regular hexagons when large enough. k is An empirical study is performed to illustrate these asymptotic properties of bivariate k-means clusters, and the results are given in Section 3. -3- The results given in this paper fall short in generalizing the asymp- totic properties of the univariate population k-means clusters to the bi- variate case. However, they are the first results obtained in the investi- gation of the properties of bivariate k-means clusters. -4- AN ASYMPTOTIC LOWER BOUND FOR THE WITHIN CLUSTER 2. SUM OF SQUARES OF K-MEANS CLUSTERS In this paper, some asymptotic properties (k * of the population °°) k-means clusters for the uniform density in two dimensions are given. result is Theorem The main which gives an asymptotic lower bound for the within cluster 2, sum of squares of the optimal k-means partition. Theorem 2 Let be a region of area jlj, : the loundary of A A with a connected interior in 1/A has finite length and let R . Suppose that be the constant density over d. WSS Let be the minimum within cluster sum of squares over all k-partitions. Then (Remark: k Since the asymptotic lower bound given in Theorem 2 is attained when all clusters of the partition are regular hexagons of equal area, this result sug- gests that the graph configuration of the optimal k-means partition would consist of regular hexagons when In outline, n is large enough.) the proof of Theorem edges and area (see Lemma 1). k A 2 requires first showing that the polygon with which has the minimum within polygon sum of squares is regular For any polygon divided into k clusters, a lower bound for the limiting value (lim inf) of the within cluster sum of squares may then be found by assuming the clusters are regular hexagons (Theorem found by covering the polygon with regular hexagons. bounds approaches 1 as k ^ =°, Theorem Hence, to show the result in Theorem -5- 2 2, 1) . An upper bound may also be Since the ratio of the two follows. we need the following lemmas: LEMMA If f 1 is the constant density over an n-sided polygon WPSS^ then a lower bound for ^ fA^^^^^y^ + \ tan ^1 is given by However, to prove Lemma A of area sua of squares ot the within polygon , 'Ji . we need Lemma 1.1 and Lemma 1.2. 1, Lemma 1.1 For a triangle vertex V A with fixed area , the minimum value of , -^ ^[- / A , and fixed angle r^d(Area) (where + v tan =- 9 3 o is isosceles with equal edges adjacent to V . tance from V) is tan Z o 2o /, ^ l/zo , 3 r 9 at the is the dis- achieved when » A Proof . Consider the triangle SVT = with angle has an area of Let M A and 6 o . be the nidpoint between the vertices S VTS T loss of generality, let Let put if j{TM[l= ST t . Fig. Without (see Fig. 1). the origin. and e be at be the unit vector along the base Also, let is rotated by M 1 d9 V be represented by about M -6- to S'T' ^ . ST (x-axis) and It follows that (see Fig. 1) the , ^ , increment in (Area), / r^d t ll£ ^o = S^ o 4 = J ^ + ^£ll the second rncrnent about ^ " II z - 5^111^ 5vdxde 4(z-e) x2dxde .3 , U-£)t And hence the miniTnum occurs when negative. the triangle is given by , t ^^^^^ this increment is positive or negative as Thus, V z-e z-e = is positive or ; that is, v;hen isosceles. it, Now, if w^ move towards the isosceles position by such a rotation, the area increases. Thus, for a given triangle 2nd moment about the 2nd moment about -^A [ 2 o we can decrease the , by first rotating to the isosceles position, and then V sliding the base back towards equal to VTS V »„ . tan 1/2 V until the triangle has area A for an isosceles triangle of area + „ e o — 3 tab -^ 2 o ] A . Since is the lemma folloys. , ' Lemma 1,2 Let VT be two lines in VS and Suppose that ]R2 Q A O . Then / r^d (Area), * ' 'Q 11 . VT and VS Let the area of . Q the second moment about V, is minimized when is an isoceles triangle. Proof . [I] Fix an integer i and real number set of plane figures such that every Q u C , or VL. II VL = IJVL ; l| and L. = u . l. and let ^(i,u) rilaterals (v;hose interiors are disjoint), all IjVL < o is a union of quadrilaterals (whose inreriors are dis- Q joint) all of whose vertices lie on be T\'S = a angle j^t'Tx of lie respectively on ,^(i,u) is a union of quad- whose vertices lie on VT and VS Using the labeling system shov/n in Fig. -7- be the 2, , with it is clear that every Q € ^2i=-^8 ^(i,u) is uniquely determined by the set of vertices (x, ..., •"' , "r x, .) where , 4i' x^.x^ the x 's satisfy . 5 x^ < (i) •• _ < x^^ < u X - , and T . Aa Example of an can be identified ^(i,u) Thus, u - x^^ V _ X^^^j _ {,?X} I with a compact subset of [0,u] 4i Q ^(4,u) ^ (x.'s are the . distances from V .) from which it inherits a natural topology (pointwise convergence of Under this topology, the two mappings the X 's). f 3 f : ^(i.u) *'i^ where f ^(i,u) . > are continuous on and Q f ^(i,u) = {Q € = /^-^d (Area), (0) A^ 5 area of triangle Therefore, if ^(i,u) min It follows that Qe^^(i.u) f such that " (Q) f (Q) = A is nonempty and compact. } is attained by sods Next, we will show that, for any triangle. i and u , Q^ € were not an isosceles triangle. Q Using ^ (i,u) i = 2 Lensina 1.1 eliminate other cases, it is sufficient to consider the case when = triangle VBA (1) ||VB[! > IIVAil , (2) !|VFI| > IIVHll . u triangle AFH (see Fig. and -8- 3) . is an isosceles It is enough to show that the result holds when Suppose that ^o and ->TR the set ^1^2 [II] (Q) = area of :^(i,u) ^ where . to Consider the triangle AFH . BF , Choose a point close to F through on C Let fhe line . parallel to C cut the triangle AFH the segment . C*G* late line ||C*G*|| and CGHA Trans- along this with CGHA trapezium = along form a t.' ||CG|| C*G* VS By this construction (see Fig. . have the same area, but C*G*HA / the trapeziums 3), r^ d (Area) < CGHA r^d (Area). / C*G*HA Produce C*FG* HG to intersect VT at D . has a larger area than triangle = Since I|CG[[ CDC Part of the area of . 1|C*G*1| , can therefore be redistributed to complete the quadrilateral a decrease in 2nd moment. VBA to produce triangle points of triangle triangle [Ill] ABB* ; C*FG* , VB*A . If are further fro:a V than all the points of 2nd moment is thus decreased. nin f is attained by ^ (Q) the isosceles triangle with equal edges adjacent to that it is the same / r^d with is small enough, all the ||CFj| From the result of [I] and [II], value of CDHA C*FG* The; remaining area can be added to triangle Q-:.^^(i,u) Q triangle Q (Area) o for each over U ^(i.u) ^(i,u) . Thus Qo V Notice gives the minimum . Now, we can proceed to obtain the result given in Lemma 1. -9- . Proof of Lemma 1. Consider a given n-sided polygon tx/ [I] A of area By joining the centroid . and occh of the vertices of je/ n the Let Fig. A). jy in the t^ of defined by tJv cones radiating from n and let subtended at C n Let 29. (see C be the subset of j^. ith cone. we , , obLain an n-partition of C be the area A, be the angle (<n). by the ith cone. Then n = A Z A. and ^ 1 each where (1) r From Lemma . is the distance from UTS^ and Lemma 1.1 n - f I ^ {^. C . Summing over ^'^ (Area) > |f I 1 Next, we will find the minimum of i A.2(^ -e7 1 E , n the constraints: (i) . ^ 1 and (ii) ^ Z 9, = n ^ 1 where Thus, c and c J undei: 1 ^ are constants independent of i squaring (2) and then dividing by (3), we have -10- IT ; 6- ^ < y ^ox the mininium Eust satisfy: 1.1 tan^g. taa 9,] 1 Now, using Lagrange multipliers, 2r A/(- J A.^f t- + ^ -an 9.] 1 X 'tan 9. 3 sec^9, (3) fojT n - A Z A. 1 - i - n we have ve have , -^ 1 1 1.2 , 1 rn] II ^ < i < n 1 ^ E 6, 1 . (1 + -r- tan-2e./[-(l + tan-29.) +-^(1 + can^e.)] t-in2e )2 and hence (| tan-^eJ/Cl + (A) I i tan2e. + Can'*6.) = + J- 1 = r . Since the left side of (4) is a strictly increasing function of tan^e. is a constant. But < n/2 b. for all i so we .oust have , < i < n Also, A. = A/n [Ill] Using the result of [II], we have fron for all ^ WPSS. > > 1 ^2 1 2n 1 " f Z A.2[ X ^ f Il/n for all ^ < i < n 1 . . 11 \ tane^ + i- 3 (1) n/n that tane.] 1' A2[__L-7_ + i tan "^tan = 6. tan^g. 3 and the equality holds when the given n-polygon H] n , * tJS/ is regular, which gives the leimna. In the application of -this lemma, (Remark: f is usually the constant density over a region containing the n-sided polygon f < 1/A Thus, tjf . for most applications.) Next, cluster in Theorem 1, we will obtain a lower bound for the within sum of squares of the optimal k-means partition of a polygon in R 2 . However, consider only "3-edge" it is important to first establish that it is sufficient to the property k-partitions (k-partition whose corresponding graph configurations have that all the interior vertices are associated with exactly Lemma 2 -11- 3 edges.) Hence, we need . Lemma Let 2 t^ be a region with connected interior in density over ^j^ is f and x^xat every k-partition and for every R Suppose that the constant . ^^ is partitioned into > 0, there exists a "3-edge" partition whose e k regions. Then for within cluster sum of squares differs from that of the given partition by at most e Proof . WSS Since the within cluster sum of squares, , can be expressed in the form: k WSS = Z WSS^ ' ^ where /j^ ^^^^^^ '^i'^ = distance to ith cluster centroid and r » »JSy. i WSS it is clear that 1 is the ith cluster, is a continuous function of the vertices of the k k-partition, for every fixed . The lemma follows. Theorem Let ^ 1 : be a polygon with a connected interior in Suppose that J^. Let j;/ WSS has area A and that f F^- . is the constant density over be the minimum within cluster sum of squares over all possible k-partition of J^ . Then ^wss/(i^. k>«' k 5g) 54 -12- > 1 . • Proof . Let be the area of the ith cluster and A. 1 be the number of ed^es ° E, i of the ith cluster. k We will first obtain an expression for [I] T. 1 E. ^ . Consider the configuration of the optinal k-partiticr. Using Leiima and by choosing an arbitrarily sniall 2 of jfy £ . , it is enough to examine a "3-edge" k-partition. Let n be the number of vertices of the polygon tJ^ using a continuity argument, it can be Then n = n consider partitions with where , sho'-^m that it is enough to is the number of vertices n in the configuration of the given partition associated with exactly two edges. Let be the number of vertices associated with 3-edges in n Using some results in graph tbeery, we have the configuration. (1) 2E = 3n + 2n where is the total number of edges. E Moreover, Euler's formula gives E + 1 = F + V , where F is the number of faces (clusters) and V is the number of vertices. Therefore, from (1), we have i(3n^ + 2n) + 1 = k + (n^ + n) , which gives (2) n^ = 2(k - 1) . Hence from (1) and (2), n J E, = 2E - number of edges around the perimeter ^ 1 (^3) = 2E - (n^ + n) = 6(k-l) where n^ + n - n^ , is the number of "3-edge" vertices on the boundary of -13- ^ (Relationship (3) holds for all partitions in which the vertices of the polygon have tjil tv."o edges meeting then, and all remaining vertices have three.) [II) Let Next, we will find a lower bound for WSS . be the \dlthin polygon sum of squares of the ith cluster. USS. k Then WSS = By Lernma Z WSS. 1 ^ . we have. 1, Therefore, 1X1 k where g(E.) = ^ liiOTE" X "^ i Now the minimura of Z k 1 I ^=1' ^°^ F^^ ^> ••" ^^ ' X with all the E.'s being real numbers is A.2g(E,) X 1 1 ^^"^ 1 J not greater than its minimum with all the E.'s being integers, when both are subjected to the constraints: k k (i) E A. = A 1 and I 1 Consider the minimum value of numbers. (ii) 1 Z E. = 6(k-l) + n - n„ B 1 . with all the E.'s being real A^g(E.) Using Lagrange multipliers, this minimum must satisfy (iii) A^g(E^) = ci where c and (iii) and (iv) and (iv) \h^^^ i\) are constants independent of c i = c^ —^1 = constant for _14- i=l, ..-, k , It follows from . that = g(Ei) , . But it can be shown that E the first derivative of , is a monotone decreasing function of E. i -1/g . 1- k Therefore, the minimum of Z must A.^g(E.) ha-.'e ^ 1 ^ + 6 - k (v) E = E E./k = 6 - (n„ n)/k and A. (vi) for all = A/k 5 i < k 1 Thus. ^ WSS = Z VSS^ - (4) Now, for k > 4 g(E.) f A^ 8(6 _ (n^ + * a lower bound of , It follows that Since 1 2k 6 - (n„ + 6 - n 6 - n)/k) . is 3. B n)/k < 6(1 + (n - 9) /6k) is a monotone decreasing function of E. , g(6 - (n^ + 6 - n)/k) > g(6(l + (n - 9)/&k)) . Therefore, from (4), we have (5) WSS S Now for fixed n, Therefore, since ^ 6(1 g f a2 . + (n- g(6(l + (n - 9)/6k)) 9) /6k) -> as 6 is continuous, lim inf WSS/(-^ k-*- ^ • -i^) ^^ and the theorem follows. -15- ^ » . k -> » , . . at Corollary Let t-^ be a region with a connected interior in of finite length. A Let ^ constant density ov&r of squares over all be the area of . Let ^ and let be the 1/A be the ninimum within cluster sum V^S partitions of j^. k whose boundary is IR^ Then Proof . For each tJ!^ n be an n-sidcj polygon of area let jj/ , from the inside. A /A Then =^ + C 1 n where , n C A ^ n by WSS n Thus, by putting Then, since jn/ c n . = 1/A f , ^ l,'SS ' n < WSS . we have i-wss/(l^.^)>iSwss /(^^) k-x" k 54 n k->«» n Letting -> » , and using Theorem lim „Q(;//fA2 Theorem k n k->«» 1 , 54 n we obtain 5/3. , 2 Under the hypothesis stated in the Corollary k->«° Proof k 54 . it is sufficient to shoi; Using the Corollary, lira WSS/(^ . -^ < 1 . Fig. 5 Now given the region construct a region jy iJ) , we can always of area B consisting -16- as -* n Denote the minimum within cluster sum of squares over all of approximating k n -> « . partitions of k (1) connected regular hexagons £8 ^jd (2) is Let V.'SS^ ^/A = 1 . be the minimuni within cluster ' 54 ^, > — Ak k-partition of such that Fig. 5) and , partitions of Then (.sen 1/A and let WSS/jT) of squares over all k- be the constant density over U) since the , su.-a k . regular hexagons form a '^ ^ . Now, from (1), 5A WSS > WSS , and hence Ale Thus, (3) V.'SS/(-5g . ^) S b2/a2 , and the theorem follows from (2) and (3) The result of Theorem 2 only gives a lower bound for the overall within cluster sum of squares of the k-means partition. It falls short in showing that the within sum of squares of the k clusters are asymptotically equal (a conjecture due to Professor John A. Hartigan) . Much work has yet to be done to prove the conjecture for two or more dimensional distributions, -17- 3. EMPIRICAL ILLUSTRATIONS In order to illustrate the asymptotic properties of bivariate k-means clusters obtained in Section 2, an empirical study is performed using bivariate samples generated according to the uniform distribution on the unit square. It is necessary to estimate the WSS for various values of k within cluster sum of squares from generated samples because the WSS for the optimal k-mean partition of the unit square cannot be obtained analytically frc large values of k. Here, the results of three sets of experiments us- ing different sample sizes are reported. In Experiment One, four different samples of size ated from the uniform distribution on the unit square. N = 1500 are gener- Using k = 40, 50, 60, and 70 for the different samples, unbiased estimates WSS the different cluster sizes are obtained. of WSS for The values of WSS, N for the vari- are given alongside the corresponding asymptotic lower 5/3 1 bounds for WSS (that is, 54 k ) in Table 1, and the corresponding ous values of k pairs are found to be in close agreement with one another. different samples of size the values of k k are generated in Experiment Two, and used for the three samples are in Experiment Three, the values of N = 2500 Similarly, three k = 50 , 60, and 70; while N = 4000, and the three generated samples are of size used are k = 60, 80, and 100. The resulting for these six experimental trials are also given in Table 1. values of WSS., N for the various values of k WSS values Again, the are found to be in close agreement with the corresponding lower bounds for WSS. Hence, these empiri- cal results tend to indicate that the asymptotic lower bound obtained in Theorem 2 is the WSS for the optimal k-means partition when large -18- k becomes Table Sample Size (N) 1 Asymptotic distributions for clustering criteria. Hartigan, J. A. (1978) Annals of Statistics , 6^, 117-131. . Proceedings of Hartigan, J. A. and Wong, M.A. (1979a). Hybrid Clustering. the 12th Interface Symposium on Computer Science and Statistics ed Jane Gentleman, University of Waterloo Press, 137-143. , Hartigan, J. A. and Wong, M.A. (1979b). Algorithm AS 136: ing algorithm. Applied Statistics 28 100-108. , A K-means cluster- , Some methods for classification and analysis of multiMacQueen, J. (1967). variate observations. Proceedings: Fifth Berkeley Symposium on Math. Statist. Prob 1, 281-297. . Strong consistency of k-means clustering. Pollard, D. (1981). Statistics 9., 135-140. Annals of , Wong, M.A. (1980). Asymptotic Properties of k-means clustering algorithm as Sloan School of Management Working a density estimation procedure. Cambridge. M.I.T., Paper #2000-80, -20- Figure 1. Graph Configuration of sample k-means partition (k = 50) obtained for 1500 observations from the uniform distribution. 1.0 0.8:r 0.6 0.4 0.2 0.0 0.0 0.2 7267'^G3I Date Due Lib-26-67 HD28.M414 no.l216- 81 Wong, M. Antho/Asymptotic 742711 DxBKS TOfiO properties o 00.13.3.43.0.. 0D2 DOS flflT