Bregman Hard Clustering

advertisement
Clustering with Bregman Divergences
Arindam Banerjee, Srujana Merugu, Inderjit S. Dhillon,
Joydeep Ghosh
Presented by
Rohit Gupta
CSci 8980: Machine Learning
Outline
• Bregman Divergences – Basics and Examples
• Bregman Information
• Bregman Hard Clustering
• The Exponential Family and connection to Bregman
Divergence
• Bregman Soft Clustering
• Experiments and Results
• Conclusions
Bregman Hard and Soft Clustering
• Most existing parametric clustering methods partition the
data into pre-specified number of partitions with cluster
representative corresponding to every partition/cluster
 Hard Clustering – disjoint partitioning of the data such
that each data point belongs to exactly one of the
partitions
 Soft Clustering – each data point has a certain
probability of belonging to each of the partitions
 Hard Clustering can be seen as Soft Clustering when
probabilities are either 0 or 1
Distortion or Loss Functions
• Squared euclidean distance is the most commonly used
loss function
 Extensive literature
 Easy to use – leads to simple calculations
 Not appropriate for some domains
 Difficult to compute for sparse data (missing
dimensions)
 Example: Iterative K-means algorithm
• Question: How to choose a distortion/loss function for a
given problem?
Bregman Divergences
• Ref: Definition 1 in the paper:
Let  : S
be differentiable and convex function on a convex set S 
Bregman Divergence, d is defined as:
d   ( x)   ( y)  x  y,  ( y) 
• Examples:
 Squared distance
 Relative Entropy (KL divergence)
 Itakura Saito distance
d
Few Take Home Points on Bregman Divergence
1.
2.
d ( x, y )  d ( y, x)
(Not symmetric and therefore triangle property does not hold)
 d ( x, y )  0 if x 

 d ( x, y )  0 if x 
y

y 
3. Three Point Property
d ( x, y )  d ( z, y )  d ( x, z )   ( x  z ), ( ( y )   ( z ) 
4. Strictly convex in the first argument but not necessarily so
in the second argument
Bregman Information
• Bregman Information of a random variable X is given by
I ( X )  min E[d ( X , s )]
sS
• The optimal vector that achieves the minimal value will
be called Bregman representative of X
• For squared loss, minimum loss is variance
E[|| X   ||2 ]
• Best predictor of the random variable is the mean
Bregman Information
• Bregman Information is the minimum loss that corresponds
to
  arg min E[d ( X , s)]
s
• Points to note:
 representative defined above always exists
 uniquely determined
 does not depend on the choice of Bregman
divergence
 expectation of the random variable, X defines the
minimizer
Bregman Hard Clustering
• This problem is posed as a quantization problem that
involves minimizing the loss in Bregman information
• Very similar to squared distance based iterative K-means –
except that distortion function is general class of Bregman
Divergence
• Expected Bregman Divergence of the data points from their
Bregman representatives is minimized
• Procedure:
 Initialize the representatives
 Assign points to them
 Re-estimate the representatives
Bregman Hard Clustering
• Algorithm:
Initialize { h }kh 1
While(converged )
Step 1:Assign each data point, x to the nearest cluster X h such that
h  arg min d ( x,  s )
s
Step 2: Re-estimate the representatives
h 
x
x X h
nX h
Take home points
• Exhaustiveness: Bregman hard clustering algorithm works
for all Bregman divergences and in fact only for Bregman
Divergences
 Arithmetic mean is the best predictor for Bregman
Divergences only
 Possible to design clustering algorithms based on
distortion functions that are not Bregman divergences, but
in that case, cluster representative would not be the
arithmetic mean or the expectation
• Linear Separators: Clusters obtained are separated by
hyperplanes
Take home points
• Scalability: Each iteration of Bregman hard clustering
algorithm is linear in the number of data points and the
number of desired clusters
• Applicability to mixed data types: Allows choosing
different Bregman divergence that are meaningful and
appropriate for different subsets of features
• Also guarantees that the objective function will
monotonically decrease till convergence
Exponential families and Bregman Divergences
• [Forster & Warmuth] remarked that the log-likelihood of the
density of an exponential family distribution can be written as
follows:
log( p( , ) ( x))  d ( x,  ( ))  log(b ( x))
Here b is any uniquely determined function,
 is the expectation parameter and  is some other natural parameter
• Points to note:
 is cumulant function and it determines the exponential family
 fixes the distribution in the family
Bregman Soft Clustering
• Problem is posed as a parameter estimation problem for
mixture models based on exponential family distributions
• EM algorithm is used to design Bregman Soft Clustering
algorithm
• Maximizing log likelihood of data in the EM algorithm
would be equivalent to minimizing the Bregman Divergence in
the Bregman Soft Clustering algorithm (refer to the previous
slide)
• There is a Bregman Divergence for a defined exponential
family
Bregman Soft Clustering
• Algorithm:
k
Initialize { h ,h }h=1
While (converged)
Step 1: Expectation step
Compute the posterior probability for all x, h
p (h | x)   h exp(d ( x, h ))b ( x)
Step 2: Maximization step
Recompute the paramters for all h, such that Bregman Divergence is minimized
h 
h
1
p(h | x)

n x
 p(h | x)x

 p(h | x)
x
x
Experiments and Results
• Question: How the quality of clustering would depend on
the appropriateness of Bregman divergence?
• Experiments performed on synthetic data proved that cluster
quality is better when matching Bregman divergence is used
than the non-matching one
• Experiment 1:
 Three 1-dimensional datasets of 100 samples each are
generated based on mixture models of Gaussian, Poisson,
and Binomial distributions respectively
 datasets were clustered using three versions of
Bregman hard clustering corresponding to different
Bregman divergences
Experiments and Results
 Mutual information is used to compare the results
 Table 3 in the paper shows large numbers along the
diagonals, which shows the importance of using
appropriate Bregman divergence
• Experiment 2:
 Similar as experiment 1 except that this is for multidimensional data.
 Table 4 in the paper shows the results, which again
indicate the same observation as above
Conclusions
• Hard and Soft clustering algorithms are presented that
minimize the loss function based on Bregman Divergences
• It was shown that there is a one-to-one mapping between
regular exponential families and regular Bregman
Divergences – this helped formulating soft clustering
algorithm
• Connection of Bregman divergences to shannon’s rate
distortion theory is also established
• Experiments on synthetic data showed the importance of
choosing right Bregman divergence for the corresponding
family of exponential distributions
Download