EE462 MLCV Lecture 13-14 Face Recognition Subspace/Manifold Learning Tae-Kyun Kim 1 EE462 MLCV Face Recognition Applications • Applications include – Automatic face tagging at commercial weblogs – Face image retrieval in MPEG7 (our solution is MPEG7 standard) – Automatic passport control – Feature length film character summarisation • A key issue is in the efficient representation of face images. 2 EE462 MLCV Face Recognition vs Object Categorisation Class 2 Class 1 Face image data sets Intraclass variation Interclass variation Class 2 Class 1 Object categorisation data sets Intraclass variation Interclass variation 3 EE462 MLCV Face Recognition vs Object Categorisation In both, we try representations/features that minimise intraclass variations and maximise interclass variations. Face image variations are more subtle, compared to those of generic object categories. Subspace/manifold techniques, cf. Bag of Words, are dominating-arts for face image analysis. 4 EE462 MLCV Principal Component Analysis (PCA) Maximum Variance Formulation Minimum-error formulation Probabilistic PCA 5 EE462 MLCV Maximum Variance Formulation of PCA • PCA (also known as Karhunen-Loeve transform) is a technique for dimensionality reduction, lossy data compression, feature extraction, and data visualisation. • PCA can be defined as the orthogonal projection of the data onto a lower dimensional linear space such that the variance of the projected data is maximised. 6 EE462 MLCV • Given a data set {xn}, n = 1,...,N and xn ∈ RD, our goal is to project the data onto a space of dimension M << D while maximising the projected data variance. For simplicity, M = 1. The direction of this space is defined by a vector u1 ∈ RD s.t. u1Tu1 = 1. Each data point xn is then projected onto a scalar value u1Txn. 7 EE462 MLCV The mean is , where The variance is given by where S is the data covariance matrix defined as 8 EE462 MLCV We maximise the projected variance u1TSu1 with respect to u1 with the normalisation condition u1Tu1 = 1. The Lagrange multiplier formulation is By setting the derivative with respect to u1 to zeros, we obtain u1 is an eigenvector of S. By multiplying u1T , the variance is obtained by 9 EE462 MLCV The variance is a maximum when u1 is the eigenvector with the largest eigenvalue λ1. The eigenvector is called the principal component. For the general case of an M dimensional subspace, it is obtained by the M eigenvectors u1, u2, … , uM of the data covariance matrix S corresponding to the M largest eigenvalues λ1, λ2 …, λM. ð®1 ð®2 ð¿ðð = 1, ðð ð = ð 0, otherwise 10 EE462 MLCV Minimum-error formulation of PCA • Alternative (equivalent) formulation of PCA is to minimise the projection error. We consider an orthonormal set of Ddimensional basis vectors {ui}, i=1,...,D s.t. ð¿ðð = 1, ðð ð = ð 0, otherwise • Each data point is represented by a linear combination of the basis vectors 11 EE462 MLCV • The coefficients αni = xnTui, and without loss of generality we have Our goal is to approximate the data point using M << D. Using M-dimensional linear subspace, we write each data point as where bi are constants for all data points. 12 EE462 MLCV • We minimise the distortion measure with repsect to ui, zni, bi. Setting the derivative with respect to znj to zero, from the orthonormality conditions, we have where j = 1, … , M. Setting the derivative of J w.r.t. bi to zero gives where j = M + 1, … , D. 13 EE462 MLCV If we substitute for zni and bi, we have We see that the displacement vectors lie in the space orthogonal to the principal subspace, as it is a linear combination of ui ,where i = M + 1, … , D. We further get 14 EE462 MLCV • Consider a two-dimensional data space D = 2 and a onedimensional principal subspace M = 1. Then, we choose u2 that minimises Setting the derivative w.r.t. u2 to zeros yields Su2 = λ2u2 We therefore obtain the minimum value of J by choosing u2 as the eigenvector corresponding to the smaller eigenvalue. We choose the principal subspace by the eigenvector with the larger eigenvalue. 15 EE462 MLCV • The general solution is to choose the eigenvectors of the covariance matrix with M largest eigenvalues. where I = 1, ... ,M. The distortion measure becomes 16 EE462 MLCV Applications of PCA to Face Recognition 17 EE462 MLCV (Recap) Geometrical interpretation of PCA • Principal components are the vectors in the direction of the maximum variance of the projection data. ð±2 • For given 2D data points, u1 and u2 are found as PCs. ð®1 ð®2 ð±1 • For dimension reduction, Each 2D data point is transformed to a single variable z1 representing the projection of the data point onto the eigenvector u1. The data points projected onto u1 has the max variance. • PCA infers the inherent structure of high dimensional data. • The intrinsic dimensionality of data is much smaller. 18 EE462 MLCV Eigenfaces • Collect a set of face images. • Normalize for scale, orientation, location (using eye locations), and vectorise them. D=wh ïŠ ï§ ï§ ï§ ïš w h ï¶ ï· ï· ï· ïž X ïR DïŽN N: number of images • Construct the covariance matrix S and obtain eigenvectors U. S ïœ 1 N T X ï¢X ï¢ , X ï¢ ïœ ï..., x i ï x ,... ï SU ïœ ï U , U ïR DïŽM M: number of eigenvectors 19 EE462 MLCV Eigenfaces • Project data onto the subspace Z ïœU X, T ZïR M ïŽN , M ïŒïŒ D • Reconstruction is obtained as ~ x ïœ M ï¥ z i u i ïœ Uz , ~ X ïœ UZ i ïœ1 • Use the distance to the subspace for face recognition x || x ï ~ x || ~x 20 EE462 MLCV Eigenfaces Method 1 • Given face images of different classes (i.e. identities), ci, compute the principal (eigen) subspace per class. • A query (test) image, x, is projected on each eigen-subspace and its reconstruction error is measured. • The class that has the minimum error is assigned. c1 c2 PCA ð¥_1 x c3 ð¥_3 ð¥_2 ð¥_ð : reconstruction by cth class subspace assign arg ð min | ð¥ − ð¥_ð | 21 EE462 MLCV Eigenfaces Method 2 • Given face images of different classes (i.e. identities), ci, compute the principal (eigen) subspace over all data. • A query (test) image, x, is projected on the eigen-subspace and its projection, z, is compared with the projections of the class means. • The class that has the minimum error is assigned. c1 x c2 PCA c3 ð§_1 ð§ ð§_3 ð§_2 ð§_ð : projection of c-th class data mean assign arg ð min | ð§ − ð§_ð | 22 EE462 MLCV Matlab Demos Face Recognition by PCA • • • • • Face Images Eigenvectors and Eigenvalue plot Face image reconstruction Projection coefficients (visualisation of high-dimensional data) Face recognition 23 EE462 MLCV Probabilistic PCA (PPCA) • A subspace is spanned by the orthonormal basis (eigenvectors computed from covariance matrix). • It interprets each observation with a generative model. • It estimates the probability of generating each observation with Gaussian distribution, PCA: uniform prior on the subspace PPCA: Gaussian dist. on the subspace 24 EE462 MLCV Continuous Latent Variable Model • PPCA has a continuous latent variable. • GMM (mixture of Gaussians) is the model with a discrete latent variable. Lecture 3-4 • PPCA represents that the original data points lie close to a manifold of much lower dimensionality. • In practice, the data points will not be confined precisely to a smooth low-dimensional manifold. We interpret the departures of data points from the manifold as noise. 25 EE462 MLCV Continuous Latent Variable Model • Consider an example of digit images that undergo a random displacement and rotation. • The images have the size of 100 x 100 pixel values, but the degree of freedom of variability across images is only three: vertical, horizontal translations and rotations. • The data points live on a subspace whose intrinsic dimensionality is three. • The translation and rotation parameters are continuous latent (hidden) variables. We only observe the image vectors. 26 EE462 MLCV Probabilistic PCA • PPCA is an example of the linear-Gaussian framework, in which all marginal and conditional distributions are Gaussian. Lecture 15-16 • We define a Gaussian prior distribution over the latent variable z as The observed D dimensional variable x is defined as where z is an M dimensional Gaussian latent variable, W is the D x M matrix and ε is a D dimensional zero-mean Gaussian-distributed noise variable with covariance σ2I. 27 EE462 MLCV • The conditional distribution takes the Gaussian form This is a generative process on a mapping from latent space to data space, in contrast to the conventional view of PCA. • The marginal distribution is written in the form From the linear-Gaussian model, the marginal distribution is again Gaussian as where 28 EE462 MLCV The above can be seen from 29 EE462 MLCV 30 EE462 MLCV Maximum likelihood Estimation for PPCA • We need to determine the parameters μ, W and σ2, which maximise the log-likelihood. • Given a data set X = {xn} of observed data points, PPCA can be expressed as a directed graph. 31 EE462 MLCV The log likelihood is For detailed optimisations, see Tipping and Bishop, PPCA (1999). where UM is the D x M eigenvector matrix of S, and LM is the M x M diagonal eigenvalue matrix, R is an orthogonal rotation matrix s.t. RRT= I. 32 EE462 MLCV Redundancy happens up to rotations, R, of the latent space coordinates. Consider a matrix where R is an orthogonal rotation matrix s.t. RRT= I. We see Hence, it is independent of R. 33 EE462 MLCV • Conventional PCA is generally formulated as a projection of points from the D dimensional data space onto an M dimensional linear subspace. • PPCA is most naturally expressed as a mapping from the latent space to the data space. • We can reverse this mapping using Bayes' theorem to get the posterior distribution p(z|x) as where the M x M matrix M is defined by 34 EE462 MLCV Limitations of PCA 35 EE462 MLCV Unsupervised learning PCA finds the direction for maximum variance of data (unsupervised), while LDA (Linear Discriminant Analysis) finds the direction that optimally separates data of different classes (supervised). PCA vs LDA 36 Linear model EE462 MLCV PCA is a linear projection method. When data lies in a nonlinear manifold, PCA is extended to Kernel PCA by the kernel trick. Lecture 9-10 ð(ð) Linear Manifold = Subspace PCA vs Kernel PCA Nonlinear Manifold 37 EE462 MLCV Gaussian assumption PCA models data as Gaussian distributions (2nd order statistics), whereas ICA (Independent Component Analysis) captures higher-order statistics. PC2 IC1 ICA PCA IC2 PC1 PCA vs ICA 38 EE462 MLCV Holistic bases PCA bases are holistic (cf. part-based) and less intuitive. ICA or NMF (Non-negative Matrix Factorisation) yields bases, which capture local facial components. (or ICA) Daniel D. Lee and H. Sebastian Seung (1999). "Learning the parts of objects by non-negative matrix factorization". Nature 401 (675 5): 788–791. 39