Sliding - LEAR

The EM algorithm, and Fisher vector image representation Jakob Verbeek December 17, 2010 Course website: http://lear.inrialpes.fr/~verbeek/MLCR.10.11.php Plan for the course • Session 1, October 1 2010 – – • Session 2, December 3 2010 – – – • Cordelia Schmid: Introduction Jakob Verbeek: Introduction Machine Learning Jakob Verbeek: Clustering with k-means, mixture of Gaussians Cordelia Schmid: Local invariant features Student presentation 1: Scale and affine invariant interest point detectors, Mikolajczyk, Schmid, IJCV 2004. Session 3, December 10 2010 – – Cordelia Schmid: Instance-level recognition: efficient search Student presentation 2: Scalable Recognition with a Vocabulary Tree, Nister and Stewenius, CVPR 2006. Plan for the course • Session 4, December 17 2010 – – – • Session 5, January 7 2011 – – – – • Jakob Verbeek: The EM algorithm, and Fisher vector image representation Cordelia Schmid: Bag-of-features models for category-level classification Student presentation 2: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, Lazebnik, Schmid and Ponce, CVPR 2006. Jakob Verbeek: Classification 1: generative and non-parameteric methods Student presentation 4: Large-Scale Image Retrieval with Compressed Fisher Vectors, Perronnin, Liu, Sanchez and Poirier, CVPR 2010. Cordelia Schmid: Category level localization: Sliding window and shape model Student presentation 5: Object Detection with Discriminatively Trained Part Based Models, Felzenszwalb, Girshick, McAllester and Ramanan, PAMI 2010. Session 6, January 14 2011 – – – Jakob Verbeek: Classification 2: discriminative models Student presentation 6: TagProp: Discriminative metric learning in nearest neighbor models for image auto-annotation, Guillaumin, Mensink, Verbeek and Schmid, ICCV 2009. Student presentation 7: IM2GPS: estimating geographic information from a single image, Hays and Efros, CVPR 2008. Clustering with k-means vs. MoG • Hard assignment in k-means is not robust near border of quantization cells • Soft assignment in MoG accounts for ambiguity in the assignment • Both algorithms sensitive for initialization – Run from several initializations – Keep best result • Nr of clusters need to be set • Both algorithm can be generalized to other types of distances or densities Images from [Gemert et al, IEEE TPAMI, 2010] Clustering with Gaussian mixture density • Mixture density is weighted sum of Gaussians – Mixing weight: importance of each cluster K p( x)   k N ( x; m k , C k ) k 1 N ( x ; m , C )  ( 2 ) d / 2 |C | 1 / 2  1  T 1 exp   ( x  m ) C ( x  m )   2  • Density has to integrate to unity, so we require k  0 K  k 1 k 1 Clustering with Gaussian mixture density • Given: data set of N points xn, n=1,…,N • Find mixture of Gaussians (MoG) that best explains data – Parameters: mixing weights, means, covariance matrices – Assume data points are drawn independently – Maximize log-likelihood of data set X w.r.t. parameters N L ( )   log N p ( xn )  n 1 K  log   n 1 k N ( xn ; m k , C k ) k 1   { k , m k , C k } k 1 K • As with k-means objective function has local minima – Can use Expectation-Maximization (EM) algorithm – Similar to the iterative k-means algorithm Maximum likelihood estimation of MoG • Use EM algorithm – – – – Initialize MoG parameters E-step: soft assign of data points to mixture components M-step: update the parameters Repeat EM steps, terminate if converged • Convergence of parameters or assignments • E-step: compute posterior on z given x: p ( z n  k | x n )   k N ( xn ; m k , C k ) p ( xn ) • M-step: update parameters using the posteriors k  1 N q  N mk  nk n 1  1 N k N q x nk n n 1  Ck  1 N k N q n 1 nk ( x n  m k )( x n  m k ) T  q nk Maximum likelihood estimation of MoG • Example of several EM iterations Bound optimization view of EM • The EM algorithm is an iterative bound optimization algorithm – Goal: Maximize data log-likelihood, can not be done in closed form N L ( )   log n 1 N p ( xn )  K  log   n 1 k p ( xn | k ) k 1 – Solution: maximize simple to optimize bound on the log-likelihood – Iterations: compute bound, maximize it, repeat • Bound uses two information theoretic quantities – Entropy – Kullback-Leibler divergence Entropy of a distribution • Entropy captures uncertainty in a distribution – Maximum for uniform distribution – Minimum, zero, for delta peak on single value • K H ( q )    q ( z  k ) log q ( z  k ) k 1 Connection to information coding (Noiseless coding theorem, Shannon 1948) – Frequent messages short code, optimal code length is (at least) -log p bits – Entropy: expected code length • • • • Suppose uniform distribution over 8 outcomes: 3 bit code words Suppose distribution: 1/2,1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64, entropy 2 bits! Code words: 0, 10, 110, 1110, 111100, 111101,111110,111111 Codewords are “self-delimiting”: – code is of length 6 and starts with 4 ones, or stops after first 0. Low entropy High entropy Kullback-Leibler divergence • Asymmetric dissimilarity between distributions – Minimum, zero, if distributions are equal – Maximum, infinity, if p has a zero where q is non-zero K D ( q || p )   q ( z  k ) log k 1 • q(z  k ) p(z  k ) Interpretation in coding theory – Sub-optimality when messages distributed according to q, but coding with codeword lengths derived from p – Difference of expected code lengths K D ( q || p )    q ( z  k ) log p ( z  k )  H ( q ) k 1 – – – – – Suppose distribution q: 1/2,1/4, 1/8, 1/16, 1/64, 1/64, 1/64, 1/64 Coding with uniform 3-bit code, p=uniform Expected code length using p: 3 bits Optimal expected code length, entropy H(q) = 2 bits KL divergence D(q|p) = 1 bit 0 EM bound on log-likelihood • Define Gauss. mixture p(x) as marginal distribution of p(x,z) p( zn  k )   k p ( xn | zn  k )  N ( xn ; m k , C k ) K p ( xn )   k 1 K p ( zn  k ) p ( xn | zn  k )    k N ( xn ; m k , C k ) k 1 • Posterior distribution on latent cluster assignment p ( zn | xn )  p ( zn ) p ( xn | zn ) p ( xn ) • Let qn(zn) be arbitrary distribution over cluster assignment • Bound log-likelihood by subtracting KL divergence D(q(z) || p(z|x)) log p ( x n )  log p ( x n )  D q n ( z n ) || p ( z n | x n )  Maximizing the EM bound on log-likelihood • E-step: fix model parameters, update distributions qn N  log L ( , { q n })  p ( x n )  D  q n ( z n ) || p ( z n | x n )  n 1 – KL divergence zero if distributions are equal – Thus set qn(zn) = p(zn|xn) • M-step: fix the qn, update model parameters N L ( , { q n })   log p ( x n )  D ( q n ( z n ) || p ( z n | x n )) n 1      log p ( x )  q log q  log p ( z  k | x )     nk n nk n n n 1  k     H ( q )  q log p ( z  k , x )    nk n n n  n 1  k  N N N     H (q n 1  n  )   q nk log  k  log N ( x n ; m k , C k )   k  • Terms for each Gaussian decoupled from rest ! Maximizing the EM bound on log-likelihood • Derive the optimal values for the mixing weights – Maximize N K q nk log  k n 1 k 1 K – Take into account that weights sum to one, define – Take derivative for mixing weight k>1   k N K  n 1 k 1 N q nk log  k   q nk n 1 N  1 q nk k n 1 k N   q n1 n 1 N N n 1 n 1 N 1 1 1  1  q nk   k  q n 1 N  1N  q n1 n 1 k  1 N N q n 1 nk   n 1 q n1 1 1 0 1  1   k k 2 Maximizing the EM bound on log-likelihood • Derive the optimal values for the MoG parameters q – Maximize nk log N ( x n ; m k , C k ) n log N ( x n ; m , C )    m mk  q N n 1 2 n 1 nk nk xn log | C |  2 1 1 1 1 2 ( xn  m ) C T 1 ( xn  m ) ( xn  m ) C 1 ( x  m )( x  m ) T 2 N 1 q 1 log N ( x ; m , C )  log( 2  )  2 log N ( x ; m , C )  C  C d Ck  N 1 q n q nk n 1 ( x n  m k )( x n  m k ) nk T EM bound on log-likelihood • L is bound on data log-likelihood for any distribution q N L ( , { q n })   log n 1 • Iterative coordinate ascent on F – E-step optimize q, makes bound tight – M-step optimize parameters p ( x n )  D  q n ( z n ) || p ( z n | x n )  Clustering for image representation • For each image that we want to classify / analyze 1. Detect local image regions – For example affine invariant interest points 2. Describe the appearance of each region – For example using the SIFT decriptor 3. Quantization of local image descriptors – using k-means or mixture of Gaussians – (Soft) assign each region to clusters – Count how many regions were assigned to each cluster • Results in a histogram of (soft) counts – – How many image regions were assigned to each cluster Input to image classification method • Off-line: learn k-means quantization or mixture of Gaussians from data of many images Clustering for image representation • Detect local image regions – For example affine invariant interest points • Describe the appearance of each region – For example using the SIFT decriptor • Quantization of local image descriptors – – – – using k-means or mixture of Gaussians Cluster centers / Gaussians learned off-line (Soft) assign each region to clusters Count how many regions were assigned to each cluster • Results in a histogram of (soft) counts – How many image regions were assigned to each cluster • Input to image classification method Fisher vector representation: motivation • Feature vector quantization is computationally expensive in practice • Run-time linear in – N: nr. of feature vectors ~ 10^3 per image – D: nr. of dimensions ~ 10^2 (SIFT) – K: nr. of clusters ~ 10^3 for recognition • So in total in the order of 10^8 multiplications per image to obtain a histogram of size 1000 • Can we do this more efficiently ?! – Yes, store more than the number of data points assigned to each cluster centre / Gaussian 20 10 • Reading material: “Fisher Kernels on Visual Vocabularies for Image Categorization” F. Perronnin and C. Dance, in CVPR'07 Xerox Research Centre Europe, Grenoble 5 3 8 Fisher vector image representation • MoG / k-means stores nr of points per cell – Need many clusters to represent distribution of descriptors in image – But increases computational cost 20 10 5 3 • 8 Fisher vector adds 1st & 2nd order moments – More precise description of regions assigned to cluster – Fewer clusters needed for same accuracy – Per cluster also store: mean and variance of data in cell 5 20 8 3 10 Image representation using Fisher kernels • General idea of Fischer vector representation – Fit probabilistic model to data – Use derivative of data log-likelihood as data representation, eg.for classification See [Jaakkola & Haussler. “Exploiting generative models in discriminative classifiers”, in Advances in Neural Information Processing Systems 11, 1999.] • Here, we use Mixture of Gaussians to cluster the region descriptors N L ( )   log N p ( xn )  n 1 • K  log   n 1 k N ( xn ; m k , C k ) k 1 Concatenate derivatives to obtain data representation  N  k  m k  C L ( )  q nk n 1 L ( )  C 1 k N q N 1 k nk ( xn  m k ) n 1 L ( )  q n 1 nk 1 1 T   2 C k  2 ( x n  m k )( x n  m k )    Image representation using Fisher kernels • Extended representation of image descriptors using MoG – Displacement of descriptor from center – Squares of displacement from center  q nk n     k m k C k 1 – From 1 number per descriptor per cluster, to 1+D+D2 (D = data dimension) • Simplified version obtained when – Using this representation for a linear classifier – Diagonal covariance matrices, variance in dimensions given by vector v k – For a single image region descriptor q nk q nk q nk ( x n  m k ) q nk ( x n  m k ) – Summed over all descriptors this gives us • 1: Soft count of regions assigned to cluster • D: Weighted average of assigned descriptors • D: Weighted variance of descriptors in all dimensions 2 Fisher vector image representation • MoG / k-means stores nr of points per cell – Need many clusters to represent distribution of descriptors in image • Fischer vector adds 1st & 2nd order moments – More precise description regions assigned to cluster – Fewer clusters needed for same accuracy – Representation (2D+1) times larger, at same computational cost – Terms already calculated when computing soft-assignment – Comp. cost is O(NKD), need difference between all clusters and data q nk q nk q nk ( x n  m k ) q nk ( x n  m k ) 2 5 20 8 3 10 Images from categorization task PASCAL VOC • Yearly “competition” since 2005 for image classification (also object localization, segmentation, and body-part localization) Fisher Vector: results • BOV-supervised learns separate mixture model for each image class, makes that some of the visual words are class-specific • • MAP: assign image to class for which the corresponding MoG assigns maximum likelihood to the region descriptors Other results: based on linear classifier of the image descriptions • • Similar performance, using 16x fewer Gaussians Unsupervised/universal representation good How to set the nr of clusters? • Optimization criterion of k-means and MoG always improved by adding more clusters – K-means: min distance to closest cluster can not increase by adding a cluster center – MoG: can always add the new Gaussian with zero mixing weight, (k+1) component models contain k component models. • Optimization criterion cannot be used to select # clusters • Model selection by adding penalty term increasing with # clusters – Minimum description length (MDL) principle – Bayesian information criterion (BIC) – Aikaike informaiton criterion (AIC) • Cross-validation if used for another task, eg. Image categorization – check performance of final system on validation set of labeled images • For more details see “Pattern Recognition & Machine Learning”, by C. Bishop, 2006. In particular chapter 9, and section 3.4 How to set the nr of clusters? • Bayesian model that treats parameters as missing values – Prior distribution over parameters – Likelihood of data given by averaging over parameter values p( X )   p( X |  ) p ( )  z , •  p( X | Z ,  ) p ( Z |  ) p ( ) z , Variational Bayesian inference for various nr of clusters – Approximate data log-likelihood using the EM bound ln p ( X )  ln p ( X )  D  q ( Z ,  ) || p ( Z ,  | X )  – E-step: distribution q generally too complex to represent exact – Use factorizing distribution q, not exact, KL divergence > 0 q ( Z ,  )  q ( Z ) q ( ) • For models with – Many parameters: fits many data sets – Few parameters: won’t fit data well – The “right” nr. of parameters: good fit Data sets

Sliding - LEAR

Related documents

Products

Support

Sliding - LEAR

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib