Clustering Validity Adriano Joaquim de O Cruz ©2006 NCE/UFRJ adriano@nce.ufrj.br Clustering Validity The number of clusters is not always previously known. In many problems the number of classes is known but it is not the best configuration. It is necessary to study methods to indicate and/or validate the number of classes. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Clustering Validity Example 1 Consider the problem of number recognition It is known that there are 10 classes (10 digits) The number of clusters, however, may be greater than 10 This is the result of different handwriting to the same digit *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Clustering Validity Example 2 Consider the problem segmentation of thermal image in a room It is known that there are 2 classes of temperatures: body and room temperatures This is a problem where the number of classes is well defined. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Clustering Validity Problem First data is partitioned in different number of clusters It is also important to try different initial conditions to the same number of partitions Validity measures are applied to these partitions to estimate their quality It is necessary to estimate the quality when the number of partitions is changed and, for the same number, when the initial conditions are different *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Clustering Validity L-Clusters Initial Definitions d(ei,ek) is the dissimilarity between element ei and ek. Euclidean distance is an example of an measure of dissimilarity *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› L–Cluster Definition C is an L-cluster if for each object ei belonging to C: ek C, max d(ei,ek)<eh C, min d(ei,eh) Maximum distance between any element ei and any element ek is smaller than the minimum distance between ei and any eh from another cluster. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› L-cluster C *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› L* – Definition C is an L*-cluster if for each object ei belonging to C: ek C, max d(ei,ek) < el C, eh C, min d(el,eh) *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› L*-cluster C *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Clustering Validity Silhouettes Introduction Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics. P.J. Rousseeuw, 1987 Each cluster is represented by one silhouette, showing which objects lie well within the cluster. The user can compare the quality of the clusters *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Method - I Consider a cluster A . For each element ei A calculate the average dissimilarity to all other objects of A, a(ei) = d(ei,A). Therefore, A can not be a singleton. Euclidean distance is an example of dissimilarity. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Method - II Consider all clusters Ck different from A. Calculate dk(ei,Ck), the average dissimilarity of ei to all elements of Ck. Select b(ei) = min(dk(ei,Ck)). Let us call B the cluster whose dissimilarity is b(ei). This is the second-best choice for ei *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Method - III The silhouette s(ei) is equal to s(ei) = 1–[a(ei) / b(ei)] se a(ei) < b(ei). s(ei) = 0 se a(ei) = b(ei). s(ei) = [b(ei) / a(ei)] - 1 se a(ei) > b(ei). ou s(ei) = [b(ei) - a(ei)] / max (b(ei),a(ei)) -1 <= s(ei) <= +1 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Understanding s(ei) s(ei) 1: within dissimilarity a(ei) << b(ei), ei is well classified. s(ei) 0: a(ei) b(ei), ei may belong to either cluster. s(ei) -1: within dissimilarity a(ei)>>b(ei), ei is misclassified, should belong to B. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Silhouette The silhouette of the cluster A is the plot of all s(ei) ranked in decreasing order. The average of all s(ei) of all elements in the cluster is called the average silhouette. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Example of use I QTY = 100; X = [randn(QTY,2)+0.5*ones(QTY,2);randn(QTY,2)... - 0.5*ones(QTY,2)]; opts = statset('Display','final'); [cidx, ctrs] = kmeans(X, 2, 'Distance','city', ... 'Replicates',5, 'Options',opts); figure; plot(X(cidx==1,1),X(cidx==1,2),'r.', ... X(cidx==2,1),X(cidx==2,2), ... 'b.', ctrs(:,1),ctrs(:,2),'kx'); figure; [s, h] = silhouette(X, cidx, 'sqeuclid'); *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Ex Silhouette 1 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Ex Silhouette 2 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Example of use I I QTY = 100; X = [randn(QTY,2)+2*ones(QTY,2);randn(QTY,2)... - 2*ones(QTY,2)]; opts = statset('Display','final'); [cidx, ctrs] = kmeans(X, 2, 'Distance','city', ... 'Replicates',5, 'Options',opts); figure; plot(X(cidx==1,1),X(cidx==1,2),'r.', ... X(cidx==2,1),X(cidx==2,2), ... 'b.', ctrs(:,1),ctrs(:,2),'kx'); figure; [s, h] = silhouette(X, cidx, 'sqeuclid'); *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Ex silhouette 3 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Ex silhouette 4 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Cluster Validity Partition Coefficient Partition Coefficient This coefficient is defined as c n 2 F = (μij ) / n i=1 j=1 1/ c F 1 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Partition Coefficient comments F is inversely proportional to the number of clusters. F is not appropriated to find the best number of partitions F is best suited to validate the best partition among those with the same number of clusters *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Partition Coefficient When F=1/c the system is entirely fuzzy, since every element belongs to all clusters with the same degree of membership When F=1 the system is rigid and membership values are either 1 or 0. This measurement can only be applied to fuzzy partitions *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Partition Coefficient Example The Partition Matrix is U = 1 1 0 0 0 0 1 1 w1 w2 w3 12 +12 +12 +12 F= =1 4 *@2006 Adriano Cruz *NCE e IM - UFRJ w3 Cluster ‹#› Partition Coefficient Example The Partition Matrix is U = 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 w1 8 0.52 F= = 0.5= 1 / 2 = 1 / c 4 *@2006 Adriano Cruz *NCE e IM - UFRJ w2 w3 w4 Cluster ‹#› Partition Coefficient Example The Partition Matrix is U1 = 0.5 1 0 0.9 0.3 0.2 0.5 0 1 0.1 0.7 0.8 X1 X4 X2 X5 X3 X6 0.52 +12 + 0.92 + 0.32 + 0.22 + 0.52 +12 + 0.12 + 0.72 + 0.82 F= 6 F = 0.763 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Cluster Validity Partition Entropy Partition Entropy Partition Entropy is defined as c n H = μij log( μij ) / n i=1 j=1 0 H logc When H=0 the partition is rigid. When H=log(c) the fuzziness is maximum. 0 <= 1-F <= H *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Partition Entropy comments Partition Entropy (H) is directly proportional to the number of partitions. H is more appropriated to validate the best partition among several runs of an algorithm. H is strictly a fuzzy measure *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Cluster Validity Compactness and Separation Compactness and Separation CS is defined as CS = Jm n (dmin )2 Jm is the objective function minimized by the FCM algorithm. n is the number of elements. dmin is minimum Euclidean distance between the center of two clusters. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Compactness and Separation The minimum distance is defined as d min = mi,ijn ci c j The complete formula is c n μ m ij CS = *@2006 Adriano Cruz vi x j 2 i=1 j=1 n mi,ijn vi v j 2 *NCE e IM - UFRJ Cluster ‹#› Compactness and Separation This a very complete validation measure. It validates the number of clusters and the checks the separation among clusters. From our experiments it works well even when the degree of superposition is high. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Cluster Validity Fuzzy Linear Discriminant Fischer Linear Discriminant The Fisher’s Linear Discriminant (FLD) is an important technique used in pattern recognition problems to evaluate the compactness and separation of the partitions produced by crisp clustering techniques. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Fischer Linear Discriminant It is easier to handle classification problems in which sampled data has few characteristics So it is important to reduce the problem dimensionality When FLD is applied to a space crisply partitioned it produces an operator (W) that maps the original set (Rp) into a new set (Rk), where k<p *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Fischer Linear Discriminant x2 W x1 Figura . – Projeção de amostras dispostas em 2 classes em uma reta feita pelo Discriminante Linear de Fisher *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› FLD FLD measures the compactness and separation of all categories when crisp partitions are created FLD uses two matrices: SB : Between Classes Scatter Matrix SW: Within Classes Scatter Matrix *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› FLD – SB Matrix Measures the quality of separation between classes c S B = ni (mi m )(mi m ) T i=1 n 1 m= xj n j=1 *@2006 Adriano Cruz 1 mi = ni *NCE e IM - UFRJ ni x , i xi ci j=1 Cluster ‹#› FLD – SB Matrix m is the average of all samples mi is the average of all samples belonging to cluster i n is the number of samples ni is the number of samples belonging to cluster i c S B = ni (mi m )(mi m ) T i=1 n 1 m= xj n j=1 *@2006 Adriano Cruz ni 1 mi = xi ,xi ci ni j=1 *NCE e IM - UFRJ Cluster ‹#› FLD – SW Matrix Measures the compactness of all classes It is the sum of all internal scattering SW = (x j mi )(xj mi ) T i jci c n SW = (x j mi )(xj mi ) T i=1 j=1 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Total Scattering The total scattering is the sum of the internal scattering and the scattering between the classes ST=SW+SB In an optimal partition the separation between classes (SB) must be maximum and within the classes minimum (SW) *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› J criteria Fisher defined the J criteria that must be maximized SB J= SW A simplified way to evaluate J is trace(SB ) J= trace(SW ) *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› J comments J may vary in the interval 0<=J<= J is strictly rigid J looses precision as the sample overlapping increases *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› EFLD EFLD measures the compactness and separation of all categories when fuzzy partitions are created EFLD uses two matrices: SBe : Between Classes Scatter Matrix SWe: Within Classes Scatter Matrix *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› EFLD – SBe Matrix Measures the quality of separation between classes c n S Be = μij (mei m )(mei m )T i=1 j=1 n 1 n m= xj n j=1 μ x ij mei = j j=1 n μ ij j=1 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› EFLD – SWe Matrix Measures the compactness of all classes It is the sum of all internal scattering c n SWe = μij (x j mei )(xj mei )T i=1 j=1 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Total Scattering The total scattering is the sum of the internal scattering and the scattering between the classes STe=SWe+SBe In an optimal partition the separation between classes (SBe) must be maximum and within the classes minimum (SWe) *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Je criteria Je : criteria that must be maximised S J = S Be e We A simplified way to evaluate Je is Je = trace(SB ) e trace(SW ) e *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Simplifying Je criteria A simplified way to evaluate Je It can be proved that ST is constant and equal to ST = trace(ST ) n ST = x j m 2 j=1 S Be S Be Je = = SWe ST S Be *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Je comments Je may vary in the interval 0<=Je<= Je is strictly rigid Je looses precision as the sample overlapping increases *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Applying EFLD Número de Categorias EFLD 2 3 4 5 6 Amostras X1 4,6815 4,9136 0,2943 0,2559 0,3157 Amostras X2 0,3271 0,8589 0,8757 0,9608 1,0674 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Cluster Validity Inter Class Contrast Comments EFLD Increases as the number of clusters rises. Increases when classes have high degree of overlapping. Reaches maximum for a wrong number of clusters. *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› ICC Evaluates a crisp and fuzzy clustering algorithms Measures: Partition Compactness Partition Separation ICC must be Maximized *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› ICC sBe ICC = Dmin c n sBe – estimates the quality of the placement of the centres. 1/n – scale factor Compensates the influence of the number of points in sBe *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› ICC - 2 sBe ICC = Dmin c n Dmin – minimum Euclidian distance between all pairs of centres Neutralizes the tendency of sBe to grow, avoiding the maximum being reached for a number of clusters greater than the ideal value. When 2 or more clusters represent a class – Dmin decreases abruptly *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› ICC Fuzzy Application Five classes with 500 points each No class overlapping X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3 Apply FCM for m = 2 and c = 2 ...10 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› ICC Fuzzy Application Results Number of clusters Measures 2 3 4 5 ICC M 7,596 41,99 51,92 96,70 ICCTra M 7,596 41,99 51,92 96,70 ICCDet M IND 154685 259791 673637 EFLD M 0.185 0.986 1.877 13.65 EFLDTra M 0,185 0,986 1,877 13,65 EFLDDet M IND 0,955 3,960 182,70 CS m 0,350 0,096 0,070 0,011 F M 0,705 0,713 0,795 0,943 MinHT M 0,647 0,572 2,124 1,994 MeanHT M 0,519 0,496 1,327 1,887 MinRF 0 0,100 0,316 0 0 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› ICC Fuzzy Application Time Number of Categories Time 2 3 4 5 ICC 0,0061 0,0069 0,0082 0,00914 ICCTra 0,0078 0,0060 0,0088 0,0110 ICCDet 0,0110 0,0088 0,0110 0,0132 EFLD 0.0053 0.0071 0.0063 0.0080 EFLDTra 0,7678 1,0870 1,4780 1,8982 EFLDDet 0,7800 1,1392 1,5510 2,0160 CS 0,0226 0,0261 0,0382 0,0476 NFI 0,0061 0,0056 0,0058 0,00603 F 0,0044 0,0045 0,0049 0,00491 FPI 0,0061 0,0045 0,0049 0,00532 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Application with Overlapping Five classes with 500 points each High cluster overlapping X1 – (1,2), (6,2), (1, 6), (6,6), (3,5, 9) Std 0,3 Apply FCM for m = 2 and c = 2 ...10 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Application Overlapping Results Measures 2 3 4 5 10 ICC M 5,065 4,938 6,191 7,829 5,69 ICCTra M 5,065 4,938 6,191 7,829 5,69 ICCDet M IND 715,19 3572 7048 6024 EFLD M 0.450 0.585 0.839 1.095 1.344 EFLDTra M 0,450 0,585 0,839 1,095 1,344 EFLDDet M IND 0,049 0,315 0,743 1,200 CS m 0,164 0,225 0,191 0,122 0,223 F M 0,754 0,621 0,591 0,586 0,439 MeanHT M 0,632 0,485 0,550 0,597 0,429 MinRF 0 0,170 0,294 0,194 0,210 0,402 MPE m 0,568 0,601 0,561 0,525 0,565 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› Application Time Results Number of Clusters Time 2 3 4 5 ICC 0,0060 0,0064 0,0077 0,00881 ICCTra 0,0066 0,0060 0,0098 0,0110 ICCDet 0,0110 0,0078 0,0110 0,0120 EFLD 0.0063 0.0088 0.0096 0.0110 EFLDTra 0,7930 2,1038 1,7598 2,2584 EFLDDet 0,9720 1,2580 1,6090 1,8450 CS 0,0220 0,0283 0,0362 0,05903 F 0,0112 0,0121 0,0061 0,0164 MPE 0,0167 0,0271 0,0319 0,03972 *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#› ICC conclusions Fast and efficient Works with fuzzy and crisp partitions Efficient even with high overlapping clusters High rate of right results *@2006 Adriano Cruz *NCE e IM - UFRJ Cluster ‹#›