EE462 MLCV Lecture 3-4 Clustering (1hr) Gaussian Mixture and EM (1hr) Tae-Kyun Kim 1 EE462 MLCV Vector Clustering Data points (green), 2D vectors, are grouped to two homogenous clusters (blue and red). Clustering is achieved by an iterative algorithm (left to right). The cluster centers are marked x. 2 EE462 MLCV Pixel Clustering (Image Quantisation) Image pixels are represented by 3D vectors of R,G,B values. The vectors are grouped to K=10,3,2 clusters, and represented by the mean values of the respective clusters. R G B 𝐱 ∈ R3 ` ` 3 EE462 MLCV Patch Clustering Lecture 9-10 (BoW) Image patches are harvested around interest points from a large number of images. They are represented by finite dimensional vectors, and clustered to form a visual dictionary. 20 …… …… 20 or raw pixels … dimension D SIFT D=400 K codewords … 4 Image Clustering EE462 MLCV Whole images are represented as finite dimensional vectors. Homogenous vectors are grouped together in Euclidean space. Lecture 910 (BoW) …… 5 K-means vs GMM EE462 MLCV Two standard methods are k-means and Gaussian Mixture Model (GMM). K-means assigns data points to the nearest clusters, while GMM represents data by multiple Gaussian densities. Hard clustering: a data point is assigned a cluster. Soft clustering: a data point is explained by a mix of multiple Gaussians probabilistically. 6 EE462 MLCV Matrix and Vector Derivatives Matrix and vector derivatives are obtained first by element-wise derivatives and then reforming them into matrices and vectors. 7 EE462 MLCV Matrix and Vector Derivatives 8 EE462 MLCV K-means Clustering Given a data set {x1,…, xN} of N observations in a Ddimensional space, our goal is to partition the data set into K clusters or groups. The vectors μk, where k = 1,...,K, represent k-th cluster, e.g. the centers of the clusters. Binary indicator variables are defined for each data point xn, rnk∈ {0, 1}, where k = 1,...,K. 1-of-K coding scheme: xn is assigned to cluster k then rnk = 1, and rnj = 0 for j ≠ k. 9 EE462 MLCV The objective function that measures distortion is We ought to find {rnk} and {μk} that minimise J. 10 EE462 MLCV • Iterative solution: First we choose some initial values for μk. Step 1: We minimise J with respect to rnk, keeping μk fixed. J is a linear function of rnk, we have a closed form solution till converge Step 2: We minimise J with respect to μk keeping rnk fixed. J is a quadratic of μk. We set its derivative with respect to μk to zero, 11 EE462 MLCV K=2 rnk μ1 μ2 12 EE462 MLCV It provides convergence proof. Local minimum: its result depends on initial values of μk . 13 EE462 MLCV Generalisation of K-means • Generalisation of K-means using a more generic dissimilarity measure V (xn, μk). The objective function to minimise is V = (xn - uk ) T Σk-1(xn - uk ) , where Σk denotes the covariance matrix. • 𝜎𝑥2 Cluster shapes by different Σk: = 𝜎𝑦𝑥 Σk: = I 𝜎𝑥𝑦 𝜎𝑦2 Circles in the same size 14 EE462 MLCV Generalisation of K-means Σk: an isotropic matrix Σk: a diagonal matrix Σk: a full matrix Different sized circles Ellipses Rotated ellipses 15 EE462 MLCV Statistical Pattern Recognition Toolbox for Matlab http://cmp.felk.cvut.cz/cmp/software/stp rtool/ …\stprtool\probab\cmeans.m 16 EE462 MLCV Mixture of Gaussians Denote z as 1-of-K representation: zk ∈ {0, 1} and Σk zk = 1. We define the joint distribution p(x, z) by a marginal distribution p(z) and a conditional distribution p(x|z). Hidden variable Observable variable: data Lecture 11-12 (Prob. Graphical models) 17 EE462 MLCV The marginal distribution over z is written by the mixing coefficients πk where The marginal distribution is in the form of Similarly, 18 EE462 MLCV The marginal distribution of x is , which is as a linear superposition of Gaussians. 19 EE462 MLCV The conditional probability p(zk = 1|x) denoted by γ(zk ) is obtained by Bayes' theorem, We view πk as the prior probability of zk = 1, and γ(zk ) as the posterior probability. γ(zk ) is the responsibility that k-component takes for explaining the observation x. 20 EE462 MLCV Maximum Likelihood Estimation Given a data set of X = {x1,…, xN}, the log of the likelihood function is s.t. 21 EE462 MLCV Setting the derivatives of ln p(X|π, μ, Σ) with respect to μk to zero, we obtain 22 EE462 MLCV Setting the derivatives of ln p(X|π, μ, Σ) with respect to Σk to zero, we obtain Finally, we maximise ln p(X|π, μ, Σ) with respect to the mixing coefficients πk. We use a Largrange multiplier objective ftn. f(x) max f(x) s.t. g(x)=0 constraints g(x) max f(x) + 𝜆g(x) Refer to Optimisation course or http://en.wikipedia.org/wiki/Lagrange_multiplier 23 EE462 MLCV which gives we find λ = -N and 24 EE462 MLCV EM (Expectation Maximisation) for Gaussian Mixtures 1. Initialise the means μk , covariances Σk and mixing coefficients πk. 2. Ε step: Evaluate the responsibilities using the current parameter values 3. M step: RE-estimate the parameters using the current responsibilities 25 EE462 MLCV EM (Expectation Maximisation) for Gaussian Mixtures 4. Evaluate the log likelihood and check for convergence of either the parameters or the log likelihood. If the convergence criterion is not satisfied, return to step 2. 26 EE462 MLCV 27 EE462 MLCV Statistical Pattern Recognition Toolbox for Matlab http://cmp.felk.cvut.cz/cmp/software/stp rtool/ …\stprtool\visual\pgmm.m …\stprtool\demos\demo_emgmm.m 28 EE462 MLCV Information Theory Lecture 7 (Random forest) The amount of information can be viewed as the degree of surprise on the value of x. If we have two events x and y that are unrelated, h(x,y) = h(x) + h(y). As p(x,y) = p(x)p(y), thus h(x) takes the logarithm of p(x) as where the minus sign ensures that information is positive or zero. 0 1 29 EE462 MLCV The average amount of information (called entropy) is given by The differential entropy for a multivariate continuous variable x is 30