The EM algorithm, and Fisher vector image representation Jakob Verbeek December 17, 2010 Course website: http://lear.inrialpes.fr/~verbeek/MLCR.10.11.php Plan for the course • Session 1, October 1 2010 – – • Session 2, December 3 2010 – – – • Cordelia Schmid: Introduction Jakob Verbeek: Introduction Machine Learning Jakob Verbeek: Clustering with k-means, mixture of Gaussians Cordelia Schmid: Local invariant features Student presentation 1: Scale and affine invariant interest point detectors, Mikolajczyk, Schmid, IJCV 2004. Session 3, December 10 2010 – – Cordelia Schmid: Instance-level recognition: efficient search Student presentation 2: Scalable Recognition with a Vocabulary Tree, Nister and Stewenius, CVPR 2006. Plan for the course • Session 4, December 17 2010 – – – • Session 5, January 7 2011 – – – – • Jakob Verbeek: The EM algorithm, and Fisher vector image representation Cordelia Schmid: Bag-of-features models for category-level classification Student presentation 2: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, Lazebnik, Schmid and Ponce, CVPR 2006. Jakob Verbeek: Classification 1: generative and non-parameteric methods Student presentation 4: Large-Scale Image Retrieval with Compressed Fisher Vectors, Perronnin, Liu, Sanchez and Poirier, CVPR 2010. Cordelia Schmid: Category level localization: Sliding window and shape model Student presentation 5: Object Detection with Discriminatively Trained Part Based Models, Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2010. Session 6, January 14 2011 – – – Jakob Verbeek: Classification 2: discriminative models Student presentation 6: TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation, Guillaumin, Mensink, Verbeek and Schmid, ICCV 2009. Student presentation 7: IM2GPS: estimating geographic information from a single image, Hays and Efros, CVPR 2008. Clustering with k-means vs. MoG • Hard assignment in k-means is not robust near border of quantization cells • Soft assignment in MoG accounts for ambiguity in the assignment • Both algorithms sensitive for initialization – Run from several initializations – Keep best result • Nr of clusters need to be set • Both algorithm can be generalized to other types of distances or densities Images from [Gemert et al, IEEE TPAMI, 2010] Clustering with Gaussian mixture density • Mixture density is weighted sum of Gaussians – Mixing weight: importance of each cluster K p( x) k N ( x; m k , C k ) k 1 N ( x ; m , C ) ( 2 ) d / 2 |C | 1 / 2 1 T 1 exp ( x m ) C ( x m ) 2 • Density has to integrate to unity, so we require k 0 K k 1 k 1 Clustering with Gaussian mixture density • Given: data set of N points xn, n=1,…,N • Find mixture of Gaussians (MoG) that best explains data – Parameters: mixing weights, means, covariance matrices – Assume data points are drawn independently – Maximize log-likelihood of data set X w.r.t. parameters N L ( ) log N p ( xn ) n 1 K log n 1 k N ( xn ; m k , C k ) k 1 { k , m k , C k } k 1 K • As with k-means objective function has local minima – Can use Expectation-Maximization (EM) algorithm – Similar to the iterative k-means algorithm Maximum likelihood estimation of MoG • Use EM algorithm – – – – Initialize MoG parameters E-step: soft assign of data points to mixture components M-step: update the parameters Repeat EM steps, terminate if converged • Convergence of parameters or assignments • E-step: compute posterior on z given x: p ( z n k | x n ) k N ( xn ; m k , C k ) p ( xn ) • M-step: update parameters using the posteriors k 1 N q N mk nk n 1 1 N k N q x nk n n 1 Ck 1 N k N q n 1 nk ( x n m k )( x n m k ) T q nk Maximum likelihood estimation of MoG • Example of several EM iterations Bound optimization view of EM • The EM algorithm is an iterative bound optimization algorithm – Goal: Maximize data log-likelihood, can not be done in closed form N L ( ) log n 1 N p ( xn ) K log n 1 k p ( xn | k ) k 1 – Solution: maximize simple to optimize bound on the log-likelihood – Iterations: compute bound, maximize it, repeat • Bound uses two information theoretic quantities – Entropy – Kullback-Leibler divergence Entropy of a distribution • Entropy captures uncertainty in a distribution – Maximum for uniform distribution – Minimum, zero, for delta peak on single value • K H ( q ) q ( z k ) log q ( z k ) k 1 Connection to information coding (Noiseless coding theorem, Shannon 1948) – Frequent messages short code, optimal code length is (at least) -log p bits – Entropy: expected code length • • • • Suppose uniform distribution over 8 outcomes: 3 bit code words Suppose distribution: 1/2,1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64, entropy 2 bits! Code words: 0, 10, 110, 1110, 111100, 111101,111110,111111 Codewords are “self-delimiting”: – code is of length 6 and starts with 4 ones, or stops after first 0. Low entropy High entropy Kullback-Leibler divergence • Asymmetric dissimilarity between distributions – Minimum, zero, if distributions are equal – Maximum, infinity, if p has a zero where q is non-zero K D ( q || p ) q ( z k ) log k 1 • q(z k ) p(z k ) Interpretation in coding theory – Sub-optimality when messages distributed according to q, but coding with codeword lengths derived from p – Difference of expected code lengths K D ( q || p ) q ( z k ) log p ( z k ) H ( q ) k 1 – – – – – Suppose distribution q: 1/2,1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64 Coding with uniform 3-bit code, p=uniform Expected code length using p: 3 bits Optimal expected code length, entropy H(q) = 2 bits KL divergence D(q|p) = 1 bit 0 EM bound on log-likelihood • Define Gauss. mixture p(x) as marginal distribution of p(x,z) p( zn k ) k p ( xn | zn k ) N ( xn ; m k , C k ) K p ( xn ) k 1 K p ( zn k ) p ( xn | zn k ) k N ( xn ; m k , C k ) k 1 • Posterior distribution on latent cluster assignment p ( zn | xn ) p ( zn ) p ( xn | zn ) p ( xn ) • Let qn(zn) be arbitrary distribution over cluster assignment • Bound log-likelihood by subtracting KL divergence D(q(z) || p(z|x)) log p ( x n ) log p ( x n ) D q n ( z n ) || p ( z n | x n ) Maximizing the EM bound on log-likelihood • E-step: fix model parameters, update distributions qn N log L ( , { q n }) p ( x n ) D q n ( z n ) || p ( z n | x n ) n 1 – KL divergence zero if distributions are equal – Thus set qn(zn) = p(zn|xn) • M-step: fix the qn, update model parameters N L ( , { q n }) log p ( x n ) D ( q n ( z n ) || p ( z n | x n )) n 1 log p ( x ) q log q log p ( z k | x ) nk n nk n n n 1 k H ( q ) q log p ( z k , x ) nk n n n n 1 k N N N H (q n 1 n ) q nk log k log N ( x n ; m k , C k ) k • Terms for each Gaussian decoupled from rest ! Maximizing the EM bound on log-likelihood • Derive the optimal values for the mixing weights – Maximize N K q nk log k n 1 k 1 K – Take into account that weights sum to one, define – Take derivative for mixing weight k>1 k N K n 1 k 1 N q nk log k q nk n 1 N 1 q nk k n 1 k N q n1 n 1 N N n 1 n 1 N 1 1 1 1 q nk k q n 1 N 1N q n1 n 1 k 1 N N q n 1 nk n 1 q n1 1 1 0 1 1 k k 2 Maximizing the EM bound on log-likelihood • Derive the optimal values for the MoG parameters q – Maximize nk log N ( x n ; m k , C k ) n log N ( x n ; m , C ) m mk q N n 1 2 n 1 nk nk xn log | C | 2 1 1 1 1 2 ( xn m ) C T 1 ( xn m ) ( xn m ) C 1 ( x m )( x m ) T 2 N 1 q 1 log N ( x ; m , C ) log( 2 ) 2 log N ( x ; m , C ) C C d Ck N 1 q n q nk n 1 ( x n m k )( x n m k ) nk T EM bound on log-likelihood • L is bound on data log-likelihood for any distribution q N L ( , { q n }) log n 1 • Iterative coordinate ascent on F – E-step optimize q, makes bound tight – M-step optimize parameters p ( x n ) D q n ( z n ) || p ( z n | x n ) Clustering for image representation • For each image that we want to classify / analyze 1. Detect local image regions – For example affine invariant interest points 2. Describe the appearance of each region – For example using the SIFT decriptor 3. Quantization of local image descriptors – using k-means or mixture of Gaussians – (Soft) assign each region to clusters – Count how many regions were assigned to each cluster • Results in a histogram of (soft) counts – – How many image regions were assigned to each cluster Input to image classification method • Off-line: learn k-means quantization or mixture of Gaussians from data of many images Clustering for image representation • Detect local image regions – For example affine invariant interest points • Describe the appearance of each region – For example using the SIFT decriptor • Quantization of local image descriptors – – – – using k-means or mixture of Gaussians Cluster centers / Gaussians learned off-line (Soft) assign each region to clusters Count how many regions were assigned to each cluster • Results in a histogram of (soft) counts – How many image regions were assigned to each cluster • Input to image classification method Fisher vector representation: motivation • Feature vector quantization is computationally expensive in practice • Run-time linear in – N: nr. of feature vectors ~ 10^3 per image – D: nr. of dimensions ~ 10^2 (SIFT) – K: nr. of clusters ~ 10^3 for recognition • So in total in the order of 10^8 multiplications per image to obtain a histogram of size 1000 • Can we do this more efficiently ?! – Yes, store more than the number of data points assigned to each cluster centre / Gaussian 20 10 • Reading material: “Fisher Kernels on Visual Vocabularies for Image Categorization” F. Perronnin and C. Dance, in CVPR'07 Xerox Research Centre Europe, Grenoble 5 3 8 Fisher vector image representation • MoG / k-means stores nr of points per cell – Need many clusters to represent distribution of descriptors in image – But increases computational cost 20 10 5 3 • 8 Fisher vector adds 1st & 2nd order moments – More precise description of regions assigned to cluster – Fewer clusters needed for same accuracy – Per cluster also store: mean and variance of data in cell 5 20 8 3 10 Image representation using Fisher kernels • General idea of Fischer vector representation – Fit probabilistic model to data – Use derivative of data log-likelihood as data representation, eg.for classification See [Jaakkola & Haussler. “Exploiting generative models in discriminative classifiers”, in Advances in Neural Information Processing Systems 11, 1999.] • Here, we use Mixture of Gaussians to cluster the region descriptors N L ( ) log N p ( xn ) n 1 • K log n 1 k N ( xn ; m k , C k ) k 1 Concatenate derivatives to obtain data representation N k m k C L ( ) q nk n 1 L ( ) C 1 k N q N 1 k nk ( xn m k ) n 1 L ( ) q n 1 nk 1 1 T 2 C k 2 ( x n m k )( x n m k ) Image representation using Fisher kernels • Extended representation of image descriptors using MoG – Displacement of descriptor from center – Squares of displacement from center q nk n k m k C k 1 – From 1 number per descriptor per cluster, to 1+D+D2 (D = data dimension) • Simplified version obtained when – Using this representation for a linear classifier – Diagonal covariance matrices, variance in dimensions given by vector v k – For a single image region descriptor q nk q nk q nk ( x n m k ) q nk ( x n m k ) – Summed over all descriptors this gives us • 1: Soft count of regions assigned to cluster • D: Weighted average of assigned descriptors • D: Weighted variance of descriptors in all dimensions 2 Fisher vector image representation • MoG / k-means stores nr of points per cell – Need many clusters to represent distribution of descriptors in image • Fischer vector adds 1st & 2nd order moments – More precise description regions assigned to cluster – Fewer clusters needed for same accuracy – Representation (2D+1) times larger, at same computational cost – Terms already calculated when computing soft-assignment – Comp. cost is O(NKD), need difference between all clusters and data q nk q nk q nk ( x n m k ) q nk ( x n m k ) 2 5 20 8 3 10 Images from categorization task PASCAL VOC • Yearly “competition” since 2005 for image classification (also object localization, segmentation, and body-part localization) Fisher Vector: results • BOV-supervised learns separate mixture model for each image class, makes that some of the visual words are class-specific • • MAP: assign image to class for which the corresponding MoG assigns maximum likelihood to the region descriptors Other results: based on linear classifier of the image descriptions • • Similar performance, using 16x fewer Gaussians Unsupervised/universal representation good How to set the nr of clusters? • Optimization criterion of k-means and MoG always improved by adding more clusters – K-means: min distance to closest cluster can not increase by adding a cluster center – MoG: can always add the new Gaussian with zero mixing weight, (k+1) component models contain k component models. • Optimization criterion cannot be used to select # clusters • Model selection by adding penalty term increasing with # clusters – Minimum description length (MDL) principle – Bayesian information criterion (BIC) – Aikaike informaiton criterion (AIC) • Cross-validation if used for another task, eg. Image categorization – check performance of final system on validation set of labeled images • For more details see “Pattern Recognition & Machine Learning”, by C. Bishop, 2006. In particular chapter 9, and section 3.4 How to set the nr of clusters? • Bayesian model that treats parameters as missing values – Prior distribution over parameters – Likelihood of data given by averaging over parameter values p( X ) p( X | ) p ( ) z , • p( X | Z , ) p ( Z | ) p ( ) z , Variational Bayesian inference for various nr of clusters – Approximate data log-likelihood using the EM bound ln p ( X ) ln p ( X ) D q ( Z , ) || p ( Z , | X ) – E-step: distribution q generally too complex to represent exact – Use factorizing distribution q, not exact, KL divergence > 0 q ( Z , ) q ( Z ) q ( ) • For models with – Many parameters: fits many data sets – Few parameters: won’t fit data well – The “right” nr. of parameters: good fit Data sets