Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley & Sons, 2000 with the permission of the authors and the publisher 1 Chapter 10 Unsupervised Learning & Clustering 1. Introduction 2. Mixture Densities and Identifiability 3. Maximum-Likelihood Estimates 4. Application to Normal Mixtures 6. Data Description and Clustering 7. Criterion Functions for Clustering 9. Hierarchical Clustering 13. Component Analysis Pattern Classification, Chapter 10 2 1. Introduction • • Previously, all our training samples were labeled: these samples were said to be “supervised” We now investigate “unsupervised” procedures using unlabeled samples. At least five reasons for this: 1. 2. Collecting and Labeling a large set of samples can be costly 3. This is also appropriate in many applications when the characteristics of the patterns can change slowly with time 4. We can use unsupervised methods to identify features that will then be useful for categorization 5. We gain some insight into the nature (or structure) of the data We can train with large amounts of (less expensive) unlabeled data, and only then use supervision to label the groupings found, this is appropriate for large “data mining” applications Pattern Classification, Chapter 10 3 2. Mixture Densities & Identifiability • We begin with the assumption that the functional forms for the underlying probability densities are known and that the only thing that must be learned is the value of an unknown parameter vector • We make the following assumptions: 1. 2. 3. 4. 5. The samples come from a known number c of classes The prior probabilities P(j) for each class are known j = 1, …,c The class-conditional densities P(x | j, j) j = 1, …,c are known The values of the c parameter vectors 1, 2, …, c are unknown The category labels are unknown Pattern Classification, Chapter 10 4 component densities c P ( x | ) P ( x | j , j ). P ( j ) j 1 mixing parameters where ( 1 , 2 ,..., c )t • This density function is called a mixture density • Our goal will be to use samples drawn from this mixture density to estimate the unknown parameter vector • Once is known, we can decompose the mixture into its components and use a maximum a posteriori (MAP) classifier on the derived densities Pattern Classification, Chapter 10 5 • Definition A density P(x | ) is said to be identifiable if ’ implies that there exists an x such that: P(x | ) P(x | ’) i.e., a density is identifiable only if we can recover unique parameters • Most mixtures of commonly-encountered real-world density functions are identifiable, especially for continuous densities • The textbook gives some unidentifiable examples • Although identifiability can be a problem, we always assume that the densities we are dealing with are identifiable! Pattern Classification, Chapter 10 6 3. Maximum-Likelihood Estimates • Suppose that we have a set D = {x1, …, xn} of n unlabeled samples drawn independently from the mixture density c p( x | ) p( x | j , j )P ( j ) j 1 where is fixed but unknown! • To estimate take the gradient of the log likelihood with respect to i and set to zero Pattern Classification, Chapter 10 7 4. Applications to Normal Mixtures p(x | i, i) ~ N(i, i) Case i i P(i) c 1 ? x x x 2 ? ? ? x 3 ? ? ? ? x = known ? = unknown Case 1 = Simplest case Pattern Classification, Chapter 10 8 4.1 Case 1: Unknown mean vectors • • This “simplest” case is not easy and the textbook obtains an iterative gradient ascent (hill-climbing) procedure to maximize the log-likelihood function Example: Consider the simple two-component onedimensional normal mixture p(x | 1, 2 ) 1 2 1 1 exp (x 1 )2 exp (x 2 )2 3 2 2 3 2 2 Set 1 = -2, 2 = 2 and draw 25 samples from this mixture. The log-likelihood function is: l( 1, 2 ) n lnp(xk | 1, 2 ) k 1 Pattern Classification, Chapter 10 9 25 Gaussian mixture one-dimensional data samples: 8 from class one and 17 from class two Pattern Classification, Chapter 10 10 The maximum value of the log likelihood function occurs at: ˆ 1 2.130 and ˆ 2 1.668 which are not far from the true values: 1 = -2 and 2 = +2 There is another peak at ˆ 1 2.085 and ˆ 2 1.257 which has almost the same height as seen in following figure This mixture of normal densities is identifiable When the mixture density is not identifiable, the maximum likelihood solution is not unique Pattern Classification, Chapter 10 11 Pattern Classification, Chapter 10 12 4.2 Case 2: All parameters unknown • If no constraints are placed on the covariance matrix, the maximum likelihood principle yields useless singular solutions • Constraints like assuming diagonal or equal covariance matrices can help Pattern Classification, Chapter 10 13 4.3 k-Means Clustering • Popular approximation method to estimate the c mean vectors 1, 2, …, c • Replace the squared Mahalanobis distance by the squared Euclidean distance • Find the mean P̂( i | xk ,ˆ ) ̂ m nearest to xk and approximate 1 if i m as: P̂ ( i | xk , ) 0 otherwise • Use the iterative scheme to find ˆ 1 , ˆ 2 ,..., ˆ c • The # iterations is usually much less than # samples Pattern Classification, Chapter 10 14 • If n is the known number of patterns and c the desired number of clusters, the k-means algorithm is: Begin initialize n, c, 1, 2, …, c(randomly selected) do classify n samples according to nearest i recompute i until no change in i return 1, 2, …, c End Pattern Classification, Chapter 10 15 Two-class example - compare max likelihood in previous fig Pattern Classification, Chapter 10 16 Three-class example – convergence in three iterations Pattern Classification, Chapter 10 17 6. Data Description and Clustering • • Structures of multidimensional patterns are important for clustering If we know that data come from a specific distribution, such data can be represented by a compact set of parameters (sufficient statistics) • • E.g., mean and covariance matrix for a normal distribution If samples are considered coming from a specific distribution, but actually they are not, these statistics is a misleading representation of the data (next figure) Pattern Classification, Chapter 10 18 Pattern Classification, Chapter 10 19 • Mixture of normal distributions can approximate a large variety of situations – i.e., any density functions • In these cases, one can use parametric methods to estimate the parameters of the mixture density. • However, if little prior knowledge can be assumed, the assumption of a parametric form is meaningless: we are actually imposing structure on data, not finding structure on it! • In these cases, one can use non parametric methods to estimate the unknown mixture density. • If the goal is to find subclasses, one can use a clustering procedure to identify groups of data points having strong internal similarities Pattern Classification, Chapter 10 20 6.1 Similarity measures • • The question is how to evaluate that the samples in one cluster are more similar among themselves than samples in other clusters Two isses: • • • • How to measure the similarity between samples? How to evaluate a partitioning of a set into clusters? The most obvious measure of similarity between two samples is the distance between them, i.e., define a metric Once this measure is defined, one would expect the distance between samples of the same cluster to be significantly less than the distance between samples in different classes Pattern Classification, Chapter 10 21 • Euclidean distance is a possible metric: a possible criterion is to assume samples belonging to same cluster if their distance is less than a threshold d0 • Clusters defined by Euclidean distance are invariant to translations and rotation of the feature space, but not invariant to general transformations that distort the distance relationship Pattern Classification, Chapter 10 22 Pattern Classification, Chapter 10 23 Pattern Classification, Chapter 10 24 Pattern Classification, Chapter 10 25 Scaling axes Pattern Classification, Chapter 10 26 Scaling for unit variance may be undesirable Pattern Classification, Chapter 10 27 • To achieve invariance, one can normalize the data, • e.g., such that they all have zero means and unit variance, or use principal components for invariance to rotation A broad class of metrics is the Minkowsky metric • 1/ q q d ( x, x' ) xk xk ' k 1 where q1 is a selectable parameter: q = 1 Manhattan or city block metric q = 2 Euclidean metric One can also used a nonmetric similarity function s(x,x’) to compare 2 vectors. d Pattern Classification, Chapter 10 28 7. Criterion Functions for Clustering • The second issue: how to evaluate a partitioning of • a set into clusters? Clustering can be posted as an optimization of a criterion function • The sum-of-squared-error criterion and its variants • Scatter criteria 7.1 The Sum-of-Squared-Error Criterion c J e x mi i 1 xDi where 1 mi ni x xDi Pattern Classification, Chapter 10 29 • This criterion defines clusters as their mean vectors mi in the sense that it minimizes the sum of the squared lengths of the error x - mi. • The optimal partition is defined as one that minimizes Je, also called minimum variance partition. • Work fine when clusters form well separated compact clouds, less when there are great differences in the number of samples in different clusters. 7.2 Related Minimum Variance Criteria Pattern Classification, Chapter 10 30 7.3 Scatter Criteria • Scatter matrices used in multiple discriminant analysis, i.e., the within-scatter matrix SW and the between-scatter matrix SB ST = SB +SW (T = total scatter matrix) (Note: T depends only from the set of samples and not on the partitioning) • Various criteria can be to minimize the withincluster or maximize the between-cluster scatter Pattern Classification, Chapter 10 31 Pattern Classification, Chapter 10 • 32 The Trace Criterion (minimize trace of Sw) is the simplest scalar measure of the scatter matrix, as it is proportional to the sum of the variances in the coordinate directions, that is in practice the sum-of-squared-error criterion c c trSW trS i x m i i 1 • i 1 xDi 2 Je As tr[ST] = tr[SW] + tr[SB] and tr[ST] is independent from the partitioning, no new results can be derived by maximizing tr[SB] c trS B ni m i m 2 i 1 • However, seeking to minimize the within-cluster criterion Je=tr[SW], is equivalent to maximizing the between-cluster criterion Pattern Classification, Chapter 10 33 9. Hierarchical Clustering • Many times, clusters are not disjoint, but may have subclusters, in turn having sub-subclusters, etc. • Consider a sequence of partitions of the n samples into c clusters • The first is a partition into n clusters, each one containing exactly one sample • The second is a partition into n-1 clusters, the third into n-2, and so on, until the n-th in which there is only one cluster containing all of the samples • At the level k in the sequence, c = n-k+1. Pattern Classification, Chapter 10 • • 34 Given any two samples x and x’, they will be grouped together at some level, and if they are grouped a level k, they remain grouped for all higher levels Hierarchical clustering tree called dendrogram Pattern Classification, Chapter 10 • The similarity values may help to determine if the • 35 grouping are natural or forced, but if they are evenly distributed no information can be gained Another representation is based on Venn diagrams Pattern Classification, Chapter 10 36 • Hierarchical clustering can be divided in agglomerative and divisive. • Agglomerative (bottom up, clumping): start with n singleton cluster and form the sequence by merging clusters • Divisive (top down, splitting): start with all of the samples in one cluster and form the sequence by successively splitting clusters Pattern Classification, Chapter 10 37 Agglomerative hierarchical clustering • The procedure terminates when the specified number of cluster has been obtained, and returns the clusters as sets of points, rather than the mean or a representative vector for each cluster. Terminating at c = 1 yields a dendrogram. Pattern Classification, Chapter 10 38 • At any level, the distance between nearest clusters • can provide the dissimilarity value for that level To find the nearest clusters, one can use d min ( Di , D j ) min x x' Nearest neighbor d max ( Di , D j ) max x x' Farthest neighbor xDi , x 'D j xDi , x 'D j 1 d avg ( Di , D j ) ni n j x x' xDi x 'D j d mean ( Di , D j ) m i m j which behave quite similar if the clusters are hyperspherical and well separated. Pattern Classification, Chapter 10 39 Nearest-neighbor algorithm • When dmin is used, the algorithm is called the • • • nearest neighbor algorithm If it is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called single-linkage algorithm If data points are considered nodes of a graph with edges forming a path between the nodes in the same subset Di, the merging of Di and Dj corresponds to adding an edge between the neirest pair of node in Di and Dj The resulting graph has any closed loop and it is a tree, if all subsets are linked we have a spanning tree Pattern Classification, Chapter 10 40 • The use of dmin as a distance measure and the agglomerative clustering generate a minimal spanning tree • Chaining effect: defect of this distance measure (right side of figure next slide) Pattern Classification, Chapter 10 41 Problem! Pattern Classification, Chapter 10 42 The farthest neighbor algorithm • When dmax is used, the algorithm is called the farthest neighbor algorithm • If it is terminated when the distance between nearest clusters exceeds an arbitrary threshold, it is called complete-linkage algorithm • This method discourages the growth of elongated clusters • In the terminology of graph theory, every cluster constitutes a complete subgraph, and the distance between two clusters is determined by the most distant nodes in the 2 clusters Pattern Classification, Chapter 10 43 • When two clusters are merged, the graph is changed by adding edges between every pair of nodes in the 2 clusters • All the procedures involving minima or maxima are sensitive to outliers. The use of dmean or davg are natural compromises Pattern Classification, Chapter 10 44 Pattern Classification, Chapter 10 45 The problem of the number of clusters • • • • • Typically, the number of clusters is known When it’s not, there are several ways of proceed When clustering is done by extremizing a criterion function, a common approach is to repeat the clustering with c=1, c=2, c=3, etc. Another approach is to state a threshold for the creation of a new cluster These approaches are similar to model selection procedures, typically used to determine the topology and number of states (e.g., clusters, parameters) of a model, given a specific application Pattern Classification, Chapter 10 46 13. Component Analysis (also Chap 3) • • • • Combine features to reduce the dimension of the feature space Linear combinations are simple to compute and tractable Project high dimensional data onto a lower dimensional space Two classical approaches for finding “optimal” linear transformation • PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense” • MDA (Multiple Discriminant Analysis) “Projection that best separates the data in a least-squares sense” (generalization of Fisher’s Linear Discriminant for two classes) Pattern Classification, Chapter 10 47 • PCA (Principal Component Analysis) “Projection that best represents the data in a least- square sense” • The scatter matrix of the cloud of samples is the same as the maximum-likelihood estimate of the covariance matrix • Unlike a covariance matrix, however, the scatter matrix includes samples from all classes! • And the least-square projection solution (maximum scatter) is simply the subspace defined by the d’<d eigenvectors of the covariance matrix that correspond to the largest d’ eigenvalues of the matrix • Because the scatter matrix is real and symmetric the eigenvectors are orthogonal Pattern Classification, Chapter 10 48 Fisher Linear Discriminant • While PCA seeks directions efficient for representation, discriminant analysis seeks directions efficient for discrimination Pattern Classification, Chapter 10 49 Multiple Discriminant Analysis • Generalization of Fisher’s Linear Discriminant for more than two classes Pattern Classification, Chapter 10