Clustering with Bregman Divergences Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon, Joydeep Ghosh Presented by Rohit Gupta CSci 8980: Machine Learning Outline • Bregman Divergences – Basics and Examples • Bregman Information • Bregman Hard Clustering • The Exponential Family and connection to Bregman Divergence • Bregman Soft Clustering • Experiments and Results • Conclusions Bregman Hard and Soft Clustering • Most existing parametric clustering methods partition the data into pre-specified number of partitions with cluster representative corresponding to every partition/cluster Hard Clustering – disjoint partitioning of the data such that each data point belongs to exactly one of the partitions Soft Clustering – each data point has a certain probability of belonging to each of the partitions Hard Clustering can be seen as Soft Clustering when probabilities are either 0 or 1 Distortion or Loss Functions • Squared euclidean distance is the most commonly used loss function Extensive literature Easy to use – leads to simple calculations Not appropriate for some domains Difficult to compute for sparse data (missing dimensions) Example: Iterative K-means algorithm • Question: How to choose a distortion/loss function for a given problem? Bregman Divergences • Ref: Definition 1 in the paper: Let : S be differentiable and convex function on a convex set S Bregman Divergence, d is defined as: d ( x) ( y) x y, ( y) • Examples: Squared distance Relative Entropy (KL divergence) Itakura Saito distance d Few Take Home Points on Bregman Divergence 1. 2. d ( x, y ) d ( y, x) (Not symmetric and therefore triangle property does not hold) d ( x, y ) 0 if x d ( x, y ) 0 if x y y 3. Three Point Property d ( x, y ) d ( z, y ) d ( x, z ) ( x z ), ( ( y ) ( z ) 4. Strictly convex in the first argument but not necessarily so in the second argument Bregman Information • Bregman Information of a random variable X is given by I ( X ) min E[d ( X , s )] sS • The optimal vector that achieves the minimal value will be called Bregman representative of X • For squared loss, minimum loss is variance E[|| X ||2 ] • Best predictor of the random variable is the mean Bregman Information • Bregman Information is the minimum loss that corresponds to arg min E[d ( X , s)] s • Points to note: representative defined above always exists uniquely determined does not depend on the choice of Bregman divergence expectation of the random variable, X defines the minimizer Bregman Hard Clustering • This problem is posed as a quantization problem that involves minimizing the loss in Bregman information • Very similar to squared distance based iterative K-means – except that distortion function is general class of Bregman Divergence • Expected Bregman Divergence of the data points from their Bregman representatives is minimized • Procedure: Initialize the representatives Assign points to them Re-estimate the representatives Bregman Hard Clustering • Algorithm: Initialize { h }kh 1 While(converged ) Step 1:Assign each data point, x to the nearest cluster X h such that h arg min d ( x, s ) s Step 2: Re-estimate the representatives h x x X h nX h Take home points • Exhaustiveness: Bregman hard clustering algorithm works for all Bregman divergences and in fact only for Bregman Divergences Arithmetic mean is the best predictor for Bregman Divergences only Possible to design clustering algorithms based on distortion functions that are not Bregman divergences, but in that case, cluster representative would not be the arithmetic mean or the expectation • Linear Separators: Clusters obtained are separated by hyperplanes Take home points • Scalability: Each iteration of Bregman hard clustering algorithm is linear in the number of data points and the number of desired clusters • Applicability to mixed data types: Allows choosing different Bregman divergence that are meaningful and appropriate for different subsets of features • Also guarantees that the objective function will monotonically decrease till convergence Exponential families and Bregman Divergences • [Forster & Warmuth] remarked that the log-likelihood of the density of an exponential family distribution can be written as follows: log( p( , ) ( x)) d ( x, ( )) log(b ( x)) Here b is any uniquely determined function, is the expectation parameter and is some other natural parameter • Points to note: is cumulant function and it determines the exponential family fixes the distribution in the family Bregman Soft Clustering • Problem is posed as a parameter estimation problem for mixture models based on exponential family distributions • EM algorithm is used to design Bregman Soft Clustering algorithm • Maximizing log likelihood of data in the EM algorithm would be equivalent to minimizing the Bregman Divergence in the Bregman Soft Clustering algorithm (refer to the previous slide) • There is a Bregman Divergence for a defined exponential family Bregman Soft Clustering • Algorithm: k Initialize { h ,h }h=1 While (converged) Step 1: Expectation step Compute the posterior probability for all x, h p (h | x) h exp(d ( x, h ))b ( x) Step 2: Maximization step Recompute the paramters for all h, such that Bregman Divergence is minimized h h 1 p(h | x) n x p(h | x)x p(h | x) x x Experiments and Results • Question: How the quality of clustering would depend on the appropriateness of Bregman divergence? • Experiments performed on synthetic data proved that cluster quality is better when matching Bregman divergence is used than the non-matching one • Experiment 1: Three 1-dimensional datasets of 100 samples each are generated based on mixture models of Gaussian, Poisson, and Binomial distributions respectively datasets were clustered using three versions of Bregman hard clustering corresponding to different Bregman divergences Experiments and Results Mutual information is used to compare the results Table 3 in the paper shows large numbers along the diagonals, which shows the importance of using appropriate Bregman divergence • Experiment 2: Similar as experiment 1 except that this is for multidimensional data. Table 4 in the paper shows the results, which again indicate the same observation as above Conclusions • Hard and Soft clustering algorithms are presented that minimize the loss function based on Bregman Divergences • It was shown that there is a one-to-one mapping between regular exponential families and regular Bregman Divergences – this helped formulating soft clustering algorithm • Connection of Bregman divergences to shannon’s rate distortion theory is also established • Experiments on synthetic data showed the importance of choosing right Bregman divergence for the corresponding family of exponential distributions