Clustering Lecture 5: Mixture Model Jing Gao

advertisement
Clustering
Lecture 5: Mixture Model
Jing Gao
SUNY Buffalo
1
Outline
• Basics
– Motivation, definition, evaluation
• Methods
–
–
–
–
–
Partitional
Hierarchical
Density-based
Mixture model
Spectral methods
• Advanced topics
– Clustering ensemble
– Clustering in MapReduce
– Semi-supervised clustering, subspace clustering, co-clustering,
etc.
2
Using Probabilistic Models for Clustering
• Hard vs. soft clustering
– Hard clustering: Every point belongs to exactly one cluster
– Soft clustering: Every point belongs to several clusters with
certain degrees
• Probabilistic clustering
– Each cluster is mathematically represented by a
parametric distribution
– The entire data set is modeled by a mixture of these
distributions
3
Gaussian Distribution
f(x)
Changing μ shifts the
distribution left or right
σ
μ
Changing σ increases or
decreases the spread
x
Probability density function f(x) is a function of x
given μ and σ
1
1 x 2
2
N ( x | , ) 
exp(  (
) )
2 
 2
4
Likelihood
Which Gaussian distribution is
more likely to generate the data?
f(x)
x
Define likelihood as a function of μ and σ
n
given x1, x2, …, xn
2
N
(
x
|

,

)
 i
i 1
5
Gaussian Distribution
• Multivariate Gaussian
covariance
mean
• Log likelihood
n
n
1
L(  , )   ln N ( xi | , )   ( ( xi   )T  1 ( xi   ))   ln |  |)
2
i 1
i 1
6
Maximum Likelihood Estimate
• MLE
– Find model parameters ,  that maximize log
likelihood
L( , )
• MLE for Gaussian
7
Gaussian Mixture
• Linear combination of Gaussians
where
parameters to be estimated
8
Gaussian Mixture
• To generate a data point:
– first pick one of the components with probability
– then draw a sample
from that component distribution
• Each data point is generated by one of K components, a latent variable
is associated with each
9
Gaussian Mixture
• Maximize log likelihood
• Without knowing values of latent variables, we have to
maximize the incomplete log likelihood
10
Expectation-Maximization (EM) Algorithm
• E-step: for given parameter values we can compute
the expected values of the latent variables
(responsibilities of data points)
– Note that
instead of
but we
still have
11
Expectation-Maximization (EM) Algorithm
• M-step: maximize the expected complete log
likelihood
• Parameter update:
12
EM Algorithm
• Iterate E-step and M-step until the log likelihood
of data does not increase any more.
– Converge to local optimal
– Need to restart algorithm with different initial
guess of parameters (as in K-means)
• Relation to K-means
– Consider GMM with common covariance
– As
, two methods coincide
13
14
14
15
15
16
16
17
17
18
18
19
19
K-means vs GMM
• Objective function
– Minimize sum of squared
Euclidean distance
•
Objective function
– Maximize log-likelihood
•
EM algorithm
– E-step: Compute posterior
probability of membership
– M-step: Optimize parameters
– Perform soft assignment during
E-step
• Can be optimized by an EM
algorithm
– E-step: assign points to clusters
– M-step: optimize clusters
– Performs hard assignment
during E-step
•
• Assumes spherical clusters with
equal probability of a cluster
•
Can be used for non-spherical
clusters
Can generate clusters with
different probabilities
20
Mixture Model
• Strengths
– Give probabilistic cluster assignments
– Have probabilistic interpretation
– Can handle clusters with varying sizes, variance etc.
• Weakness
– Initialization matters
– Choose appropriate distributions
– Overfitting issues
21
Take-away Message
•
•
•
•
Probabilistic clustering
Maximum likelihood estimate
Gaussian mixture model for clustering
EM algorithm that assigns points to clusters and
estimates model parameters alternatively
• Strengths and weakness
22
Download