9.913 Pattern Recognition for Vision Class VII, Part I – Techniques for Clustering Yuri Ivanov Fall 2004 TOC • Similarity metric • K-means and IsoData algorithms • EM algorithm • Some hierarchical clustering schemes Fall 2004 Pattern Recognition for Vision Clustering • Clustering is a process of partitioning the data into groups based on similarity • Clusters are groups of measurements that are similar • In Classification groups of similar data form classes – Labels are given – Similarity is deduced from labels • In Clustering groups of similar data form clusters – Similarity measure is given – Labels are deduced from similarity Fall 2004 Pattern Recognition for Vision Clustering Labels given y Classification Scaling αx x Labels deduced y Clustering Fall 2004 βy βy Scaling x αx Pattern Recognition for Vision Questions • What is “similar”? • What is a “good” partitioning? Fall 2004 Pattern Recognition for Vision Distances Most obvious: distance between samples • Compute distances between the samples • Compare distances to a threshold y x We need a metric to define distances and thresholds Fall 2004 Pattern Recognition for Vision Metric and Invariance • We can choose it from a family: 1/ q � q� d ( x, x ' ) = � � xk - x 'k � Ł k =1 ł d - Minkowski metric q = 1 � Manhattan/city block/taxicab distance q = 2 � Euclidean distance d(x, x’) is invariant to rotation and translation only for q = 2 Fall 2004 Pattern Recognition for Vision Minkovski Metrics L¥ L1 � { q = 1} L2 L3 L4 L10 Points a distance 1 from origin Fall 2004 Pattern Recognition for Vision Metric and Invariance Other choices for invariant metric: • We can use data-driven metric: d ( x, x ' ) = ( x - x ') T S -1 ( x - x ') - Mahalanobis distance • We can normalize data (whiten) x ' = ( L -1 / 2 F T ) x And then use the Euclidean metric Fall 2004 Pattern Recognition for Vision Metric Euclidean metric • Good for isotropic spaces • Bad for linear transformations (except rotation and translation) Mahalanobis metric: • Good if there is enough data Whitening: • Good if the spread is due to random processes • Bad if it is due to subclasses Fall 2004 Pattern Recognition for Vision Similarity We need a symmetric function that is large for “similar” x E.g.: xt x ' s ( x, x ') = t x x' - “angular” similarity Vocabulary: {Two, three, little, star, monkeys, jumping, twinke, bed } a) Three little monkeys jumping on the bed (0, 1, 1, 0, 1, 1, 0, 1) b) Two little monkeys jumping on the bed (1, 0, 1, 0, 1, 1, 0, 1) c) Twinkle twinkle little star (0, 0, 1, 1, 0, 0, 2, 0) a Similarity matrix: Fall 2004 b c a 1.0 0.8 0.18 b 0.8 1.0 0.18 c 0.18 0.18 1.0 Pattern Recognition for Vision Similarity It doesn’t have to be metric: E.g.: Has fur Has 4 legs Can type Monkey 1 0 1 Platypus 1 1 0 xt x ' s ( x, x ' ) = d xt x ' s ( x, x ' ) = t t t + x x x' x' x x' .67 .33 .33 .67 1 .33 .33 1 Tanimoto coefficient Fall 2004 Pattern Recognition for Vision Partitioning Evaluation J – objective function, s.t. clustering is assumed optimal when J is minimized or maximized K Nk J = �� x k =1 n =1 (k ) n - mk 2 - Sum of squared error criterion (min) Using the definition of the mean: 1 K J = � Nk 2 k =1 Ø 1 Œ 2 º Nk Nk Nk �� n =1 m =1 x (k ) n -x (k ) m 2 ø œ ß Dissimilarity measure You can replace it with your favorite Fall 2004 Pattern Recognition for Vision Partitioning Evaluation Other possibilities: For within- and between- cluster scatter matrices (recall LDA) J = SW = K �S k =1 - Scatter determinant criterion (min) k d J = tr SW-1 S B = � li - Scatter ratio criterion (max) i =1 Careful with the ranks! Fall 2004 Pattern Recognition for Vision Which to choose? • No methodological answer • SSE criterion (minimum variance) – simple – good for well separated clusters in dense groups – affected by outliers, scale variant • Scatter criteria – Invariant to general linear transformations – Poor on small amounts of data as related to dimensionality • You should chose the metric and the objective that are invariant to the transformations natural to your problem Fall 2004 Pattern Recognition for Vision Clustering x – input data K – number of clusters (assumed known) Nk – number points in cluster k N – total number of data points tk – prototype (template) vector of k-th cluster J – objective function, s.t. clustering is assumed optimal when J is extremized Fall 2004 Pattern Recognition for Vision General Procedure Clustering is usually an iterative procedure: • Choose initial configuration • Adjust configuration s.t. J is optimized • Check for convergence J is often only partially minimized. Fall 2004 Pattern Recognition for Vision Clustering – A Good Start Let’s choose the following model: • Known number of clusters • Each cluster is represented by a single prototype • Similarity is defined in the nearest neighbor sense Sum-Squared-Error objective: K Nk J = �� x k =1 n =1 (k) n K Nk dJ d = �� dtk c =1 n =1 dtk - tk ( 2 - total in-cluster distance for all clusters xn( k ) - t k 2 ) 1 tk = Nk Fall 2004 Nk = - 2� ( xn( k ) - tk ) = 0 � n =1 Nk (k ) x � n n =1 Pattern Recognition for Vision K-Means Algorithm Using the iterative procedure: 1. 2. 3. 4. Choose M random positions for the prototypes Classify all samples by the nearest tk Compute new prototype positions If not converged (no cluster assignments changed from previous iteration), go to step 2 This is the K-Means (a.k.a. Lloyd’s, a.k.a. LBG) algorithm. What to do with empty clusters? Some heuristics are involved. Fall 2004 Pattern Recognition for Vision K-Means Algorithm Example K = 10 Fall 2004 Pattern Recognition for Vision Cluster Heuristics Sometimes clusters end up empty. We can: • Remove them • Randomly reinitialize them • Split the largest ones Sometimes we have too many clusters. We can: • Remove the smallest ones • Relocate the smallest ones • Merge the smallest ones together if they are neighbors Fall 2004 Pattern Recognition for Vision IsoData Algorithm In K-Means we assume that we know the number of clusters IsoData tries to estimate them – ultimate K-Means hack IsoData iterates between 3 stages: • Center estimation • Cluster splitting • Cluster merging The user specifies: T – min number of samples in a cluster ND – desired number of clusters Dm – max distance for merging σS2 – maximum cluster variance Nmax – max number of merges Fall 2004 Pattern Recognition for Vision IsoData Stage I – Cluster assignment: 1. Assign a label to each data point such that: w n = arg min x n - t j j 2. Discard clusters with Nk < T, reduce Nc 3. Update means of remaining clusters: 1 tj = Nj Nj ( j) x � i i =1 This is basically a step of K-Means algorithm Fall 2004 Pattern Recognition for Vision IsoData Stage II – Cluster splitting: 1. If this is the last iteration, set Dm=0 and go to Stage III 2. If Nc<=ND/2, go to splitting (step 4) 3. If iteration is even or if Nc>=2ND go to Stage III 4. Compute: 1 dk = Nk Nk � i =1 xi( k ) - t k 1 2 s k = max j Nk 1 d= N Fall 2004 �(x Nk i =1 (k ) i, j - avg. distance from the center - tk , j ) - max variance along a single 2 dimension Nc �N d k =1 k k - overall avg. distance from centers Pattern Recognition for Vision IsoData Stage II – Cluster splitting (cont.): 5. For clusters with σk2 > σS2: If ( dk>d AND Nk>2(T+1) ) OR Nc<ND /2 Split the cluster by creating a new mean: t 'k , j = tk , j + 0.5s k2 And moving the old one to: t k , j = tk , j - 0.5s k2 Fall 2004 Pattern Recognition for Vision IsoData Stage III – Cluster merging: If no split has been made: 1. Compute the matrix of distances between cluster centers Di , j = ti - t j 2. Make the list of pairs where Di , j < Dm 3. Sort them in ascending order 4. Merge up to Nmax unique pairs starting from the top by removing tj and replacing ti with: 1 ti = N i ti + N j t j ) ( Ni + N j Fall 2004 Pattern Recognition for Vision IsoData Example ND = 10 T = 10 σS 2 = 3 Dm = 2 Nmax = 3 Fall 2004 Pattern Recognition for Vision Mixture Density Model Mixture model – a linear combination of parametric densities Number of components M p ( x) = � p ( x | j ) P( j) j =1 Component density P( j ) ‡ 0, "j Component weight M and � P( j) = 1 j =1 Recall Kernel density estimation Kernels are parametric densities, subject to estimation Fall 2004 Pattern Recognition for Vision Example M p ( x) = � p ( x | j ) P( j) j =1 Fall 2004 Pattern Recognition for Vision Mixture Density Using ML principle, the objective function is the log-likelihood: N M � � � N � n n l (q ) ” ln �� p( x ) � = � ln � � p ( x | j ) P ( j ) � � n =1 � n =1 � j =1 � Diffirentiate w.r.t. parameters: ¶ �M � n �q j l (q ) = � ln �� p( x | k )P (k ) � n =1 ¶q j � k =1 � N N =� n =1 Here because of the log Fall 2004 1 M n p ( x � | k )P (k ) ¶ p ( x n | j )P ( j ) ¶q j k =1 Pattern Recognition for Vision Mixture Density For distributions p(x| j) in the exponential family: ¶ ¶ B (q , x ) B (q , x ) ¶ B (q , x ) غ A(q )e øß = A(q )e + B ( q , x ) A ( q ) e [ ] [ ] ¶q ¶q ¶q Goes in here Goes in here Goes in here ¶l (q ) N n � = � P ( j | x ) · (Stuff +More Stuff ) ¶q n=1 For a Gaussian: ¶l (q ) N n n -1 Ø = � P( j | x ) ºS j ( x - mˆ j ) øß ¶m j n =1 ¶l (q ) N n n n T ˆ -1 -1 -1 ˆ ˆ Ø ˆ ˆ = � P( j | x ) º S j - S j ( x - m j )( x - m j ) S j øß ¶Sˆ j n =1 Fall 2004 Pattern Recognition for Vision Mixture Density At the extremum of the objective: 1 P( j ) = N N mˆ j = n P ( j | x ) � n =1 n n P ( j | x ) x � n =1 N n P ( j | x ) � n =1 n n n ˆ P ( j | x ) x x - mˆ j ) m ( ) ( � j N Sˆ j = N T n =1 N n P ( j | x ) � n =1 BUT: P( j | xn ) = p ( x n | j ) P( j) M n p ( x | k ) P (k ) � - parameters are tied k =1 Solution – EM algorithm. Fall 2004 Pattern Recognition for Vision EM Algorithm Suppose we pick an initial configuration (just like in K-Means) Recall the objective (change of sign): N � N � E ” -l (q ) = - ln �� p( xn ) � = -� ln { p ( x n )} n =1 � n =1 � After a single step of optimization: E new -E � p new ( x n ) � = -� ln � old n � n =1 � p (x ) � N old � M P new ( j ) p new ( xn | j ) � = -� ln �� � n old p (x ) n =1 � j =1 � N Fall 2004 Pattern Recognition for Vision EM Algorithm After optimization step: E new -E � M P new ( j ) p new ( x n | j ) � = -� ln �� � old n p (x ) n=1 � j =1 � N old =1 � M Ø P new ( j ) p new ( x n | j ) Pold ( j | xn ) ø � = -� ln �� Œ old n old n œ� p (x ) P ( j | x ) ß� n =1 � j =1 º N new new n � M Ø old P ( j ) p ( x | j) ø� n = -� ln �� Œ P ( j | x ) old n old n œ� p ( x ) P ( j | x ) ß� n =1 � j =1 º N Sums to 1 over j Fall 2004 �M � ln �� l j y j � � j =1 � Pattern Recognition for Vision Digression-Convexity Definition: Function f is convex on [a, b] iff for any x1, x2 in [a, b] and any λ in [0, 1]: f ( l x1 + (1 - l ) x2 ) £ l f ( x1 ) + (1 - l ) f ( x2 ) f(x) λf(x1)+(1-λ)f(x2) f(λx1+(1-λ)x2) f(x) f(x) x1 λx1+(1-λ) x2 Fall 2004 x2 Pattern Recognition for Vision Digression - Jensen’s Inequality If f is a convex function: �M � M f � � l j x j � £ � l j f (x j ) Ł j =1 ł j =1 "l : 0 £ l j £ 1, � l j = 1 j Equivalently: Or: f ( E[ x ]) £ E[ f ( x)] � 1 f� ŁM � 1 xj � £ � j =1 ł M M M � f (x ) j =1 j Flip the inequality if f is concave Fall 2004 Pattern Recognition for Vision Digression - Jensen’s Inequality Proof by induction: a) JE is trivially true for any 2 points (definition of convexity) b) Assuming it is true for any k-1 points: for li* � li /(1 - lk ) k k -1 i =1 i =1 * l f ( x ) = l f ( x ) + (1 l ) l � i i k k k � i f ( xi ) � k -1 * � ‡ lk f ( xk ) + (1 - lk ) f � � li xi � Ł i =1 ł k -1 k � � � � * ‡ f � lk xk + (1 - lk )� li xi � = f � � li xi � Ł i =1 ł Ł i =1 ł End of digression Fall 2004 Pattern Recognition for Vision Back to EM Change in the error: �M � ln �� l j y j � � j =1 � E new - E old = new new n � M Ø old P ( j ) p ( x | j) ø� n = -� ln �� Œ P ( j | x ) old n old n œ� p ( x ) P ( j | x ) ß� n =1 � j =1 º N λ by Jensen’s inequality: new new n � P ( j ) p ( x | j) � old n £ -�� P ( j | x ) ln � old n old n � n=1 j =1 � p (x )P ( j | x ) � N M � l ln { y } M j =1 Fall 2004 j j Pattern Recognition for Vision Back to EM Change in the error: �M � ln �� l j y j � � j =1 � E new - E old = new new n � M Ø old P ( j ) p ( x | j) ø� n = -� ln �� Œ P ( j | x ) old n old n œ� p ( x ) P ( j | x ) ß� n =1 � j =1 º N λ by Jensen’s inequality: new new n � P ( j ) p ( x | j) � old n £ -�� P ( j | x ) ln � old n old n � n=1 j =1 � p (x )P ( j | x ) � N M call this “Q” Fall 2004 Pattern Recognition for Vision EM as Upper Bound Minimization Then: E new £ E old + Q - upper bound on Enew(θ new) Some observations: • Q is convex • Q is a function of new parameters θ new • So is Enew • If θ new = θ old then Enew = Eold +Q E (q new ) E old + Q(q new ) Step downhill in Q leads downhill in Enew !!! E new (q new ) E old q new q old Fall 2004 Pattern Recognition for Vision EM Iteration Given initial θ minimize Q E (q new ) E old + Q(q new ) E new (q new ) E old q new q old E (q new ) E old + Q(q new ) E new (q new ) Compute new Eold +Q E old q new q old Fall 2004 Pattern Recognition for Vision EM (cont.) new new n � P ( j ) p ( x | j) � old n Q = -�� P ( j | x ) ln � old n old n � n =1 j =1 � p ( x )P ( j | x ) � N M Can drop these Q� = -�� P old ( j | x n ) ln {P new ( j) p new ( x n | j )} N M n =1 j =1 for a Gaussian mixture: { } = -�� P old ( j | x n ) ln P new ( j ) - ln ( G j ( xn ) ) N M n =1 j =1 As before – differentiate, set to 0, solve for parameter. Fall 2004 Pattern Recognition for Vision EM (cont.) Straight-forward for means and covariances: N m̂ j = old n n P ( j | x ) x � n =1 N - convex sum, weighted w.r.t. previous estimate n old P ( j | x ) � n =1 N Sˆ j = �P n =1 old ( j | x ) ( x - mˆ j ) ( x - mˆ j ) n N n n old P ( j | x ) � n T - convex sum, weighted w.r.t. previous estimate n =1 Fall 2004 Pattern Recognition for Vision EM (cont.) Need to enforce sum-to-one constraint for P(j): M � � new � J P = Q + l � � P ( j ) - 1� Ł j =1 ł ¶ P old ( j | x n ) J P = -� +l = 0 new new ¶P ( j) P ( j) n =1 N N � l P new ( j ) = � P old ( j | x n ) n =1 M N M � l � P new ( j) = �� Pold ( j | x n ) j =1 � l=N Fall 2004 n=1 j =1 N 1 � P new ( j ) = � P old ( j | x n ) N n =1 Pattern Recognition for Vision EM Example Nc = 3 Fall 2004 Pattern Recognition for Vision EM Illustration m1 m2 P ( j = 2 | x) P ( j = 1| x ) P(j|x) tells how much the data point affects each cluster, unlike in Kmeans. m3 P( j = 3 | x) m1 You can manipulate P(j|x). Eg: Partially labeled data m2 labeled point m3 unlabeled point Fall 2004 Pattern Recognition for Vision EM vs K-Means m1 Furthermore, P(j|x) can be replaced with: g P ( j |x ) P ( j | x ) e P� ( j | x) = g P (k |x ) P ( k | x ) e � k m2 g =0 if g = 0, P� ( j | x) = P ( j | x) Now let’s relax g : lim P�( j | x) = d ( P ( j | x),max P( j | x)) g fi¥ P ( j = 1| x ) m3 P ( j = 2 | x) P( j = 3 | x) m1 m2 m3 This is K-Means!!! Fall 2004 Pattern Recognition for Vision Hierarchical Clustering Ex: Dendrogram There are 2 ways to do it: • Agglomerative (bottom-up) • Divisive (top-down) Dissimilarity T =3 3 T =2 2 1 x1 x6 … Different thresholds induce different cluster configurations. Stopping criterion – either a number of clusters, or a distance threshold Fall 2004 Pattern Recognition for Vision Hierarchical Agglomerative Clustering General structure: Initialize: K , Kˆ ‹ N , Dn ‹ xn , n = 1..N do Kˆ ‹ Kˆ - 1 i, j = argmin d ( Dl , Dm ) l ,m merge( Di , D j ) until K̂ == K Need to specify Ex: d = d mean ( Di , D j ) = m i - m j d = dmin ( Di , D j ) = min x1 - x2 d = dmax ( Di , D j ) = max x1 - x2 x1˛Di , x2 ˛D j x1˛Di , x2 ˛D j Fall 2004 Each induces different algorithm Pattern Recognition for Vision Single Linkage Algorithm Choosing d = dmin results in a Nearest Neighbor Algorithm (a.k.a single linkage algorithm, a.k.a. minimum algorithm) N=2 Each cluster is a minimal spanning tree of the data in the cluster. Identifies clusters that are well separated Fall 2004 Pattern Recognition for Vision Complete Linkage Algorithm Choosing d = dmax results in a Farthest Neighbor Algorithm (a.k.a. complete linkage algorithm, a.k.a. maximum algorithm) N=2 Each cluster is a complete subgraph of the data. Identifies clusters that are well localized Fall 2004 Pattern Recognition for Vision Summary • General concerns about choice of similarity metric • K-means algorithm – simple but relies on Euclidean distances • IsoData – old-school step towards model selection • EM – “statistician’s K-means” – simple, general and convenient • Some hierarchical clustering schemes Fall 2004 Pattern Recognition for Vision