pptx

advertisement
Unsupervised Learning
Gaussian Mixture Models
Expectation-Maximization (EM)
Gaussian Mixture Models
Like K-Means, GMM clusters have
centers.
In addition, they have probability
distributions that indicate the
probability that a point belongs to the
cluster.
X2
These ellipses show “level sets”: lines
with equal probability of belonging to
the cluster.
Notice that green points still have
SOME probability of belonging to the
blue cluster, but it’s much lower than
the blue points.
X1
This is a more complex model than KMeans: distance from the center can
matter more in one direction than
another.
GMMs and EM
Gaussian Mixture Models (GMMs) are a model,
similar to a Naïve Bayes model but with important
differences.
Expectation-Maximization (EM) is a parameterestimation algorithm for training GMMs using
unlabeled data.
To explain these further, we first need to review
Gaussian (normal) distributions.
The Normal (aka Gaussian)
Distribution
f π‘₯ =
1
−(π‘₯−πœ‡)2
exp(
)
2
2πœ‹πœŽ
2𝜎
πœ‡: mean, σ2: variance
σ
πœ‡
Quiz: MLE for Gaussians
Based on your statistics knowledge,
1) What is the MLE for μ from a bunch of
example X points?
2) What is the MLE for σ from a bunch of
example X points?
Answer: MLE for Gaussians
Based on your statistics knowledge,
1) What is the MLE for μ from a bunch of example X points?
1
πœ‡=
𝑀
𝑋𝑖
𝑖
(average of the X values)
2) What is the MLE for σ from a bunch of example X points?
𝜎2
1
=
𝑀
(𝑋𝑖 − πœ‡)2
𝑖
(average deviation from the mean)
Note: this is a so-called “biased” estimator for 𝜎 2 ; there is also an
“unbiased” estimator which basically just uses (M-1) instead of M.
We’ll stick to the “biased” one here, but either one is fine.
Quiz: Deriving the ML estimators
How would you derive the MLE equations for
Gaussian distributions?
Answer: Deriving the ML estimators
How would you derive the MLE equations for Gaussian
distributions?
Same plan of attack as for MLE estimates of Bayes Nets:
1. Write down the Likelihood function P(D | M)
2. Make the assumption that each data point Xi is
independently distributed, so P(D|M) = ∏𝑃 𝑋𝑖 𝑀
3. Take the log
4. Take the partial derivative with respect to μ, set this equal
to zero, and solve for μ.
5. Take the partial derivative with respect to σ, set this equal
to zero, and solve for σ.
Quiz: Estimating a Gaussian
1
πœ‡=
𝑀
𝜎2
1
=
𝑀
0
On the left is a
dataset with the
following X values:
0, 3, 4, 5, 6, 7, 10
(𝑋𝑖 − πœ‡)2
Find the maximum
likelihood
Gaussian
distribution.
𝑋𝑖
𝑖
𝑖
Answer: Estimating a Gaussian
0
πœ‡=
1
𝑀
𝑋𝑖 =
𝑖
1
0 + 3 + 4 + 5 + 6 + 7 + 10 = 5
7
On the left is a
dataset with the
following X values:
0, 3, 4, 5, 6, 7, 10
1
=
(𝑋𝑖 − πœ‡)2
𝑀
𝑖
1
=
0 − 5 2 + 3 − 5 2 + 4 − 5 2 + 5 − 5 2 + 6 − 5 2 + 7 − 5 2 + 10 − 5
7
1 2
1
60
2
2
2
2
2
2
= 5 + 2 + 1 + 0 + 1 + 2 + 5 = 25 + 4 + 1 + 1 + 4 + 25 =
7
7
7
𝜎2
f π‘₯ =
1
60
2πœ‹ 7
exp(
−(π‘₯−5)2
120
7
)
2
Clustering by fitting K Gaussians
0
Suppose our dataset looks like the one above.
It doesn’t really look Gaussian anymore; it looks like it has 3
clusters.
Fitting a single Gaussian to this data will still give you an
estimate.
But that Gaussian will have a low Likelihood value: it will give
very low probability to the leftmost and rightmost clusters.
Clustering by fitting K Gaussians
0
What we’d like to do instead is to fit K Gaussians.
A model for data that involves multiple Gaussian
distributions is called a Gaussian Mixture Model
(GMM).
Clustering by fitting K Gaussians
0
μred
μblue
μgreen
Another way of drawing these is with “Level sets”:
Curves that show points with equal probability for each
Guassian.
Wider curves having lower probability than narrower curves.
Notice that each point is contained within every Gaussian, but
is most tightly bound to the closest Gaussian.
Expectation-Maximization (EM)
EM is “K-Means for GMMs”.
It is a parameter estimation algorithm for GMMs that will determine a
(locally-optimal) setting for all of the GMM parameters, using a bunch
of unlabeled X points.
Input:
1. Data points X1, …, XM
2. A number K
Output: πœ‡1 , 𝜎12 , …, πœ‡πΎ , 𝜎𝐾2 such that the GMM with those means and
standard deviations has a locally-maximum likelihood for the training
data set.
Visualization of EM
1. Initialize the mean and standard deviation of
each Gaussian randomly.
2. Repeat until convergence:
– Expectation: For each point X and each Gaussian
k, find P(X | Gaussian k)
Visualization of EM
1. Initialize the mean and standard deviation of
each Gaussian randomly.
2. Repeat until convergence:
– Expectation: For each point X and each Gaussian k,
find f(X | Gaussian k)
– Maximization: Estimate new πœ‡π‘˜ and πœŽπ‘˜ parameters
for each Gaussian. (Technically, you also need to
estimate a third parameter, called πk. More later.)
Visualization of EM
1. Initialize the mean and standard deviation of
each Gaussian randomly.
2. Repeat until convergence:
– Expectation: For each point X and each Gaussian k,
find f(X | Gaussian k)
– Maximization: Estimate new πœ‡π‘˜ and πœŽπ‘˜ parameters
for each Gaussian. (Technically, you also need to
estimate a third parameter, called πk. More later.)
Gaussian Mixture Model
K Gaussian distributions with parameters πœ‡1 , 𝜎12 through πœ‡πΎ , 𝜎𝐾2 .
It also involves K additional parameters, called prior probabilities, πœ‹1
through πœ‹πΎ . These describe the relative importance of each of the K
Gaussian distributions in the full model.
The likelihood equation for this model looks like this:
𝑓 𝑋1 , … , 𝑋𝑀 𝐺𝑀𝑀) =
𝑓 𝑋𝑖 𝐺𝑀𝑀
𝑖
𝐾
𝑓 𝑋𝑖 𝐺𝑀𝑀 =
πœ‹π‘˜
π‘˜=1
Prior
−(π‘₯ − πœ‡π‘˜ )2
exp(
)
2
2𝜎
2πœ‹πœŽπ‘˜
π‘˜
1
Gaussian
(i.i.d. assumption)
GMMs as Bayes Nets
Cluster
(1, 2, …, K)
GMMs are simple Bayes Nets.
Two differences from previous BNs we’ve seen:
1.
We’re used to binary variables in BNs.
Here, the “Cluster” variable has K possible
values (1, 2, …, K) instead of just two
(+cluster and –cluster). We used to store
P(+a) and P(-a) for the parent variable; now
we store πœ‹1 through πœ‹πΎ .
2.
The “X” variable has infinitely many values
(any real number) instead of just (+x and –
x). We used to store P(+x | +a) and P(+x | a). Now we store πœ‡1 , 𝜎12 through πœ‡πΎ , 𝜎𝐾2 ,
and we say f(X |Cluster
is j) =
2
−(π‘₯−πœ‡
)
1
𝑗
exp(
)
2
X
(a real
number)
2πœ‹πœŽπ‘—
2πœŽπ‘—
Formal Description of the Algorithm
1.
2.
Init: For each k in {1, …, K}, create a random πk, μk, σ2k
Repeat until all πk, μk, σ2k remain the same from one iteration to
the next:
Expectation (aka Assignment in K-Means):
For each Xi, for each k:
−(π‘₯−πœ‡π‘˜ )2
1
exp(
)
2πœ‹πœŽπ‘˜
2𝜎2
π‘˜
−(π‘₯−πœ‡π‘˜′ )2
1
πœ‹
exp(
)
π‘˜′ π‘˜′ 2πœ‹πœŽ
2𝜎2
π‘˜′
π‘˜′
πœ‹π‘˜
let C[Xi,k] οƒŸ
Maximization (aka Update in K-Means):
For each k,
πœ‹π‘˜ =
πœ‡π‘˜ =
πœŽπ‘˜ =
1
𝑀
𝑀
𝑖=1 𝐢[𝑋𝑖 , π‘˜]
𝑀
𝑖=1 𝐢[𝑋𝑖 ,π‘˜]βˆ™π‘‹π‘–
𝑀 𝐢[𝑋 ,π‘˜]
𝑖
𝑖=1
𝑀
2
𝑖=1 𝐢 𝑋𝑖 ,π‘˜ βˆ™(𝑋𝑖 −πœ‡π‘˜ )
𝑀 𝐢[𝑋 ,π‘˜]
𝑖
𝑖=1
3. Return (for all values of k) πk, μk, σ2k
Evaulation metric for GMMs and EM
LOSS Function (or Objective function) for EM:
EM (locally) maximizes “Marginal” Likelihood:
EM(X1, …, XM)
= argmax[πœ‡1 , 𝜎12 , πœ‹1 , …, πœ‡πΎ , 𝜎𝐾2 , πœ‹πΎ ]
f(X1,…XM | πœ‡1 , 𝜎12 , πœ‹1 , …, πœ‡πΎ , 𝜎𝐾2 , πœ‹πΎ )
Notice that this is the Likelihood function for just the X
variable in our Bayes Net, rather than the Likelihood for
(X and Cluster), which is why it is called “marginal
likelihood” rather than just “likelihood”.
Analysis of EM Performance
EM is guaranteed to find a local optimum of the
Likelihood function.
Theorem: After one iteration of EM, the Likelihood
of the new GMM >= the Likelihood of the previous
GMM.
(Dempster, A.P.; Laird, N.M.; Rubin, D.B. 1977.
"Maximum Likelihood from Incomplete Data via the
EM Algorithm". Journal of the Royal Statistical
Society. Series B (Methodological) 39 (1): 1–
38.JSTOR 2984875.)
EM Generality
Even though EM was originally invented for
GMMs, the same basic algorithm can be used
for learning with arbitrary Bayes Nets when
some of the training data has missing values.
This has made EM one of the most popular
unsupervised learning techniques in machine
learning.
EM Quiz
b
a
g1
Which Gaussian(s) have a nonzero value for f(a)?
How about f(c)?
c
g2
g3
Answer: EM Quiz
b
a
g1
c
g2
Which Gaussian(s) have a nonzero f(a)?
All Gaussians (g1, g2, and g3) have a nonzero value for f(a).
How about f(c)?
Ditto. All Gaussians have a nonzero value for f(c).
g3
Quiz: EM vs. K-Means
a
Option 1
c
Option 2
g1
g2
Option 3
Option 4
At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2?
At the end of EM, where will cluster center g1 end up – Option 1 or Option 2?
Answer: EM vs. K-Means
a
Option 1
c
Option 2
g1
g2
Option 3
Option 4
At the end of K-Means, where will cluster center g1 end up – Option 1 or Option 2?
Option 1: K-Means puts the “mean” at the center of all points in the cluster, and point a
will be the only point in g1’s cluster.
At the end of EM, where will cluster center g1 end up – Option 1 or Option 2?
Option 2: EM puts the “mean” at the center of all points in the dataset, where each
point is weighted by how likely it is according to the Gaussian. Point a and Point b will
both have some likelihood, but Point a’s likelihood will be much higher. So the “mean”
for g1 will be very close to Point a, but not all the way at Point a.
How many clusters?
We’ve been assuming a fixed K.
Here’s a technique to determine this automatically, from data.
New objective function:
Minimize: −
𝑀
𝑖=1 log 𝑓
𝑋𝑖 𝐺𝑀𝑀𝐾 + πΆπ‘œπ‘ π‘‘ βˆ™ 𝐾
Algorithm:
1. Initialize K somehow.
Repeat until convergence:
2. Run EM.
3. Remove unnecessary clusters (low π value)
4. Create new random clusters (more or fewer than before,
depending on a heuristic estimate of whether there were too many or too few
before).
This is slow. But one nice property is that it can overcome some difficulties with local
maxima.
Quiz
Is EM for GMMs
Classification or Regression?
Generative or Discriminative?
Parametric or Nonparametric?
Answer
Is EM for GMMs
Classification or Regression?
Two possible answers:
- classification: output is a discrete value (cluster label) for each point
- Regression: output is a real value (probability) for each possible cluster label for
each point
Generative or Discriminative?
- normally, it’s used with a fixed set of input and output variables. However, GMMs
are Bayes Nets that store a full joint distribution. Once it’s trained, a GMM can
actually make predictions for any subset of the variables given any other subset.
Technically, this is generative.
Parametric or Nonparametric?
- parametric: the number of parameters is 3K, which does not change with the
number of training data points.
Quiz
Is EM for GMMs
Supervised or Unsupervised?
Online or batch?
Closed-form or iterative?
Answer
Is EM for GMMs
Supervised or Unsupervised?
- Unsupervised
Online or batch?
- batch: if you add a new data point, you need to revisit
all the training data to recompute the locally-optimal
model
Closed-form or iterative?
-iterative: training requires many passes through the data
Download