Intro. ANN & Fuzzy Systems Lecture 21 Clustering (2) Intro. ANN & Fuzzy Systems Outline • Similarity (Distance) Measures • Distortion Criteria Scattering Criterion • Hierarchical Clustering and other clustering methods (C) 2001-2003 by Yu Hen Hu 2 Intro. ANN & Fuzzy Systems Distance Measure • Distance Measure – What does it mean “Similar"? ( xi yi ) i 1 – Norm: d ( x, y ) || x y ||m N 1/ m m – Mahalanobis distance: d(x,y) = |x – y|TSxy1|x – y| – Angle: d(x,y) = xTy/(|x|•|y|) Binary and symbolic features (x, y contains 0, 1 only): – Tanimoto coefficient: d ( x, y ) (C) 2001-2003 by Yu Hen Hu xT y xT x y T y 3 Intro. ANN & Fuzzy Systems Clustering Criteria • Is the current clustering assignment good enough? Most popular one is the mean-square error distortion measure c n D I ( xk , i) || xk W (i) ||2 i 1 k 1 c 1 i 1 N i || x y || , x , yc ( i ) 2 N N i I ( xk , i) k 1 • Other distortion measures can also be used: 1 D i 1 N i c (C) 2001-2003 by Yu Hen Hu d ( x, y) x ; yC ( i ) 1 D Min. d ( x, y ) x ; yC ( i ) i 1 N i c 4 Intro. ANN & Fuzzy Systems Scatter Matrics • Scatter matrices are defined in the context of analysis of variance in statistics. • They are used in linear discriminant analysis. • However, they can also be used to gauge the fitness of a particular clustering assignment. • Mean vector for i-th cluster: 1 mi Ni N I ( xk , i) xk k 1 • Total mean vector 1 c 1 N m N i mi xk N i 1 N k 1 • Scatter matrix for i-th cluster: N Si I ( xk , i) ( xk mi )(xk mi )T k 1 • Within-cluster scatter matrix c SW Si i 1 • Between-cluster scatter matrix c S B N i (mi m)(mi m)T i 1 (C) 2001-2003 by Yu Hen Hu 5 Intro. ANN & Fuzzy Systems Scattering Criteria • Total scatter matrix: N ST ( xk m)(xk m)T k 1 SW S B • Note that the total scatter matrix is independent of the assignment I(xk,i). But … • SW and SB both depend on I(xk,i)! • Desired clustering property – SW small – SB large • How to gauge Sw is small or SB is large? There are several ways. • Tr. Sw (trace of SW): Let M SW m vm vmT m 1 be the eigenvalue decomposition of SW, then M c m 1 i 1 Tr. SW m Tr.Si c N I ( xk , i ) || xk mi ||2 D i 1 k 1 (C) 2001-2003 by Yu Hen Hu 6 Intro. ANN & Fuzzy Systems Cluster Separating Measure (CSM) std = 0.3, csm = 1.6667 1.5 • Similar to scattering criteria. • csm = (mi-mj)/(i+j) • The larger its value, the more separable the two clusters. • Assume underlying data distribution is Gaussian. 1 0.5 0 -2 -1 0 1 std = 0.5, csm = 1 2 -1 0 1 std = 0.8, csm = 0.625 2 -1 2 1.5 1 0.5 0 -2 2 1.5 1 0.5 0 -2 (C) 2001-2003 by Yu Hen Hu 0 1 7 Intro. ANN & Fuzzy Systems Hierarchical Clustering • Merge Method: Initially, each xk is a cluster. During each iteration, nearest pair of distinct clusters are merged until the number of clusters is reduced to 1. • How to measure distance between two clusters: dmin(C(i), C(j)) = min. d(x,y); x C(i), y C(j) leads to minimum spanning tree dmax(C(i), C(j)) = max. d(x,y); x C(i), y C(j) davg(C(i), C(j)) = 1 Ni N j d ( x, y) xC ( i ) yC ( j ) dmean(C(i), C(j)) = mi– mj (C) 2001-2003 by Yu Hen Hu 8 Intro. ANN & Fuzzy Systems Hierarchical Clustering (II) Split method: • Initially, only one cluster. Iteratively, a cluster is splited into two or more clusters, until the total number of clusters reaches a predefined goal. • The scattering criterion can be used to decide how to split a given cluster into two or more clusters. • Another way is to perform a m-way clustering, using, say, k-means algorithm to split a cluster into m smaller clusters. (C) 2001-2003 by Yu Hen Hu 9