ppt

advertisement
EE462 MLCV
Lecture 3-4
Clustering (1hr)
Gaussian Mixture and EM
(1hr)
Tae-Kyun Kim
1
EE462 MLCV
Vector Clustering
Data points (green), 2D vectors, are grouped to two
homogenous clusters (blue and red).
Clustering is achieved by an iterative algorithm (left to right).
The cluster centers are marked x.
2
EE462 MLCV
Pixel Clustering (Image Quantisation)
Image pixels are represented by 3D vectors of R,G,B values.
The vectors are grouped to K=10,3,2 clusters, and
represented by the mean values of the respective clusters.
R
G
B
𝐱 ∈ R3
`
`
3
EE462 MLCV
Patch Clustering
Lecture 9-10 (BoW)
Image patches are harvested around interest points from a large number of images.
They are represented by finite dimensional vectors, and clustered to form a visual
dictionary.
20
……
……
20
or raw pixels
…
dimension D
SIFT
D=400
K codewords
…
4
Image Clustering
EE462 MLCV
Whole images are represented as finite dimensional vectors.
Homogenous vectors are grouped together in Euclidean
space.
Lecture 910 (BoW)
……
5
K-means vs GMM
EE462 MLCV
Two standard methods are k-means and Gaussian Mixture Model (GMM).
K-means assigns data points to the nearest clusters, while GMM represents data
by multiple Gaussian densities.
Hard clustering: a
data point is
assigned a cluster.
Soft clustering: a
data point is
explained by a mix
of multiple
Gaussians
probabilistically.
6
EE462 MLCV
Matrix and Vector Derivatives
Matrix and vector derivatives are obtained first by element-wise derivatives
and then reforming them into matrices and vectors.
7
EE462 MLCV
Matrix and Vector Derivatives
8
EE462 MLCV
K-means Clustering
Given a data set {x1,…, xN} of N observations in a Ddimensional space, our goal is to partition the data set into K
clusters or groups.
The vectors μk, where k = 1,...,K, represent k-th cluster, e.g.
the centers of the clusters.
Binary indicator variables are defined for each data point xn,
rnk∈ {0, 1}, where k = 1,...,K.
1-of-K coding scheme: xn is assigned to cluster k then rnk = 1,
and rnj = 0 for j ≠ k.
9
EE462 MLCV
The objective function that measures distortion is
We ought to find {rnk} and {μk} that minimise J.
10
EE462 MLCV
• Iterative solution:
First we choose some initial values for μk.
Step 1: We minimise J with respect to rnk, keeping μk fixed. J is a linear
function of rnk, we have a closed form solution
till converge
Step 2: We minimise J with respect to μk keeping rnk fixed. J is a quadratic
of μk. We set its derivative with respect to μk to zero,
11
EE462 MLCV
K=2
rnk
μ1
μ2
12
EE462 MLCV
It provides convergence proof.
Local minimum: its result depends on initial values of μk .
13
EE462 MLCV
Generalisation of K-means
• Generalisation of K-means using a more generic dissimilarity measure
V (xn, μk). The objective function to minimise is
V = (xn - uk ) T Σk-1(xn - uk ) , where Σk denotes the covariance matrix.
•
𝜎𝑥2
Cluster shapes by different Σk: =
𝜎𝑦𝑥
Σk: = I
𝜎𝑥𝑦
𝜎𝑦2
Circles in the same size
14
EE462 MLCV
Generalisation of K-means
Σk: an isotropic matrix
Σk: a diagonal matrix
Σk: a full matrix
Different sized circles
Ellipses
Rotated
ellipses
15
EE462 MLCV
Statistical Pattern Recognition
Toolbox for Matlab
http://cmp.felk.cvut.cz/cmp/software/stp
rtool/
…\stprtool\probab\cmeans.m
16
EE462 MLCV
Mixture of Gaussians
Denote z as 1-of-K representation: zk ∈ {0, 1} and Σk zk = 1.
We define the joint distribution p(x, z) by a marginal
distribution p(z) and a conditional distribution p(x|z).
Hidden variable
Observable variable: data
Lecture 11-12 (Prob.
Graphical models)
17
EE462 MLCV
The marginal distribution over z is written by the mixing
coefficients πk
where
The marginal distribution is in the form of
Similarly,
18
EE462 MLCV
The marginal distribution of x is
, which is as a linear superposition of Gaussians.
19
EE462 MLCV
The conditional probability p(zk = 1|x) denoted by γ(zk ) is
obtained by Bayes' theorem,
We view πk as the prior probability of zk = 1, and γ(zk ) as the
posterior probability.
γ(zk ) is the responsibility that k-component takes for
explaining the observation x.
20
EE462 MLCV
Maximum Likelihood Estimation
Given a data set of X = {x1,…, xN}, the log of the likelihood
function is
s.t.
21
EE462 MLCV
Setting the derivatives of ln p(X|π, μ, Σ) with respect to μk
to zero, we obtain
22
EE462 MLCV
Setting the derivatives of ln p(X|π, μ, Σ) with respect to Σk
to zero, we obtain
Finally, we maximise ln p(X|π, μ, Σ) with respect to the
mixing coefficients πk. We use a Largrange multiplier
objective ftn. f(x)
max f(x) s.t. g(x)=0
constraints g(x)
max f(x) + 𝜆g(x)
Refer to Optimisation course or http://en.wikipedia.org/wiki/Lagrange_multiplier
23
EE462 MLCV
which gives
we find λ = -N and
24
EE462 MLCV
EM (Expectation Maximisation) for Gaussian
Mixtures
1. Initialise the means μk , covariances Σk and mixing
coefficients πk.
2. Ε step: Evaluate the responsibilities using the current
parameter values
3. M step: RE-estimate the parameters using the current
responsibilities
25
EE462 MLCV
EM (Expectation Maximisation) for Gaussian
Mixtures
4. Evaluate the log likelihood
and check for convergence of either the parameters or the
log likelihood. If the convergence criterion is not satisfied,
return to step 2.
26
EE462 MLCV
27
EE462 MLCV
Statistical Pattern Recognition
Toolbox for Matlab
http://cmp.felk.cvut.cz/cmp/software/stp
rtool/
…\stprtool\visual\pgmm.m
…\stprtool\demos\demo_emgmm.m
28
EE462 MLCV
Information Theory
Lecture 7 (Random forest)
The amount of information can be viewed as the degree
of surprise on the value of x.
If we have two events x and y that are unrelated, h(x,y) =
h(x) + h(y).
As p(x,y) = p(x)p(y), thus h(x) takes the logarithm of p(x) as
where the minus sign ensures that
information is positive or zero.
0
1
29
EE462 MLCV
The average amount of information (called entropy) is
given by
The differential entropy for a multivariate continuous
variable x is
30
Download