Linear algebra and dimension reduction PCA LDA INF 5300: Lecture 2 Principal components and linear discriminants Asbjørn Berge Department of Informatics University of Oslo 01. march 2005 Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA 1 Linear algebra and dimension reduction 2 PCA 3 LDA Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Definitions Signal representation vs classification Vector spaces A set of vectors u1 , u2 , . . . un is said to form a basis for a vector space if any arbitrary vector x can be represented by a linear combination x = a1 u1 + a2 u2 + . . . an un The coefficients a1 , a2 , . . . an are called the components of vector x with respect to the basis ui In order to form a basis, it is necessary and sufficient that the ui vectors be linearly independent 6= 0 i = j A basis ui is said to be orthogonal if uiT uj = = 0 i 6= j =1 i =j T A basis ui is said to be orthonormal if ui uj = = 0 i 6= j Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Definitions Signal representation vs classification Linear transformation A linear transformation is a mapping from a vector space X N onto a vector space Y M , and is represented by a matrix Given y1 y2 .. . ym a vector x ∈ X N , the corresponding y on Y M is vector x1 a11 a11 . . . a1n x2 a21 a22 . . . a2n .. . = .. .. .. .. . . . . . .. am1 am1 . . . amn xn Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Definitions Signal representation vs classification Eigenvectors and eigenvalues Given a matrix AN×N , we say that v is an eigenvector if there exists a scalar λ (the eigenvalue) such that Av = λv ⇔ v is an eigenvector with corresponding eigenvalue λ Av = λv ⇒ (A − λI )v = 0 ⇒ |(A − λI )| = 0 ⇒ λN + a1 λN−1 + . . . aN−1 λ + a0 = 0 | {z } Characteristic equation Zeroes of the characteristic equation are the eigenvalues of A A is non-singular ⇔ all eigenvalues are non-zero A is real and symmetric ⇔ all eigenvalues are real, and eigenvectors are orthogonal Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Definitions Signal representation vs classification Interpretation of eigenvectors and eigenvalues The eigenvectors of the covariance matrix Σ correspond to the principal axes of equiprobability ellipses! The linear transformation defined by the eigenvectors of Σ leads to vectors that are uncorrelated regardless of the form of the distribution If the distribution happens to be Gaussian, then the transformed vectors will be statistically independent Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Definitions Signal representation vs classification Dimensionality reduction Feature extraction can be stated as Given a feature space xi ∈ Rn find an optimal mapping y = f (x) : Rn → Rm with m < n. An optimal mapping in classification :the transformed feature vector y yield the same classification rate as x. The optimal mapping may be a non-linear function Difficult to generate/optimize non-linear transforms Feature extraction is therefore usually limited to linear transforms y = AT x x1 y1 a11 a11 . . . a1n x2 y2 a21 a22 . . . a2n .. .. = .. .. .. . .. . . . . . . .. ym am1 am1 . . . amn xn Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Definitions Signal representation vs classification Signal representation vs classification The search for the feature extraction mapping y = f (x) is guided by an objective function we want to maximize. In general we have two categories of objectives in feature extraction: Signal representation: Accurately approximate the samples in a lower-dimensional space. Classification: Keep or enhance class-discriminatory information in a lower-dimensional space. Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Definitions Signal representation vs classification Signal representation vs classification Principal components analysis (PCA) - signal representation, unsupervised Linear discriminant analysis (LDA) - classification, supervised Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA - Principal components analysis Reduce dimension while preserving signal variance (”randomness”) Represent x as a linear combination of orthonormal basis vectors [ϕ1 ⊥ϕ2 ⊥ . . . ⊥ϕn ]: n X x= yi ϕ i i=1 Approximate x with only m < n basis vectors. This can be done by replacing the components [ym+1 , . . . , yn ]T with some pre-selected constants bi : m n X X x̂(M) = yi ϕ i + bi ϕi i=1 Asbjørn Berge i=m+1 INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA - Principal components analysis Approximation error is then ˆ ∆x(M) = x − x(M) n X = yi ϕ i − i=1 m X n X yi ϕ i + i=1 ! bi ϕi i=m+1 = n X (yi − bi )ϕi i=m+1 To measure the representation error we use the mean squared error 2 MSE (M) = E[|∆x(M)| ] = E[ n X n X (yi − bi )T (yj − bj )ϕT i ϕj ] i=m+1 j=m+1 = n X E[(yi − bi )2 ] i=m+1 Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA - Principal components analysis Optimal values of bi can be found by taking the partial derivatives of the approximation error MSE (M): ∂ E[(yi − bi )2 ] = −2(E[yi ] − bi ) = 0 ⇒ bi = E[yi ] ∂bi We replace the discarded dimensions by their expected value. (This also feels intuitively correct) Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA - Principal components analysis The MSE can now be written as MSE (M) = = = n X i=m+1 n X i=m+1 n X E[(yi − E[yi ])2 ] E[(xϕi − E[xϕi ])T (xϕi − E[xϕi ])] ϕT i E[(x T − E[x]) (x − E[x])]ϕi = i=m+1 n X ϕT i Σx ϕi i=m+1 Σx is the covariance matrix of x Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA - Principal components analysis The orthonormality constraint on ϕ can be incorporated in our optimization using Lagrange multipliers λ MSE (M) = n X ϕT i Σ x ϕi + i=m+1 n X λi (1 − ϕT i ϕi ) i=m+1 Thus, we can find the optimal ϕi by partial derivation " n # n X X ∂ ∂ MSE (M) = ϕT λi (1 − ϕT i Σ x ϕi + i ϕi ) ∂ϕi ∂ϕi i=m+1 i=m+1 = 2(Σx ϕi − λi ϕi ) = 0 ⇒ Σx ϕi = λi ϕi Optimal ϕi and λi are eigenvalues and eigenvectors of the covariance matrix Σx Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA - Principal components analysis Note that this also implies MSE (M) = n X ϕT i Σx ϕi = i=m+1 n X i=m+1 ϕT i λ i ϕi = n X λi i=m+1 Thus in order to minimize MSE (M), λi will have to correspond to the smallest eigenvalues! Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA - Principal components analysis PCA dimension reduction The optimala approximation of a random vector x ∈ Rn by a linear combination of m < n independent vectors is obtained by projecting the random vector x onto the eigenvectors ϕi corresponding to the largest eigenvalues λi of the covariance matrix Σx a minimum sum of squared approximation error Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA - Principal components analysis PCA uses the eigenvectors of the covariance matrix Σx , and thus is able to find the independent axes of the data under the unimodal Gaussian assumption. For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlate the axes Main limitation of PCA is that it is unsupervised - thus it does not consider class separability. It is simply a coordinate rotation aligning the transformed axes with the directions of maximum variance No guarantee that these directions are good features for discrimination Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA PCA example 3d Gaussian with T µ = [0 5 2] , 25 −1 Σ = −1 4 7 −4 parameters 7 −4 10 Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Linear Discriminant Analysis (LDA) Goal: Reduce dimension while preserving class discriminatory information. Strategy (2 classes): We have a set of samples x = x1 , x2 , . . . , xn where n1 belong to class ω1 and the rest n2 to class ω2 . Obtain a scalar value by projecting x onto a line y : y = w T x Select the one that maximizes the separability of the classes. Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Linear Discriminant Analysis (LDA) To find a good projection vector, we need to define a measure of separation between the projections (J(w )) The mean vector of each class in the spaces spanned by x and y are P µi = n1i x∈ωi x and P P µ̃i = n1i y ∈ωi y = n1i x∈ωi w T x = w T µi A naive choice would be projected mean difference, J(w ) = |µ̃1 − µ̃2 | Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Linear Discriminant Analysis (LDA) Fishers solution: maximize a function that represents the difference between the means, scaled by a measure of the within class scatter DefinePclasswise scatter (eqv. of variance) s̃i2 = y ∈ωi (y − µ̃i )2 s̃12 + s̃22 is the within class scatter µ̃2 | Fishers criterion is then J(w ) = |µ̃s̃ 12− +s̃ 2 1 i We look for a projection where examples from the same class are close to each other, while at the same time projected mean values are as far apart as possible. Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Linear Discriminant Analysis (LDA) To optimize w we need J(w ) to be an explicit function of w Redefine P scatter Si2 = x∈ωi (x − µi )(x − µi )T , S1 + S2 = Sw where Sw is the within class scatter matrix. Remember the scatter ofP the projection y P s̃i2 = y ∈ωi (y − µ̃i )2 = x∈ωi (w T x − w T µi )2 = P T T T 2 2 T x∈ωi w (x − µi )(x − µi ) w = w Si w , s̃1 + s̃1 = w Sw w The projected mean difference can be expressed by the original means (µ̃1 − µ̃2 )2 = (w T µ1 − w T µ2 )2 = w T (µ1 − µ2 )(µ1 − µ2 )T w = w T Sb w | {z } SB Sb is called between class scatter. The Fisher criterion in terms of Sw and Sb is J(w ) = Asbjørn Berge w T Sb w w T Sw w INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Linear Discriminant Analysis (LDA) To find the optimal w , derive J(w ) and equate to zero. d w T Sb w d [J(w )] = [ ]=0 dw dw w T Sw w ⇒ [w T Sw w ] d[w T Sw w ] d[w T Sb w ] − [w T Sb w ] =0 dw dw ⇒ [w T Sw w ]2Sb w − [w T Sb w ]2Sw w = 0 Divide by w T Sw w and one gets Sb w − J(w )Sw w = 0 ⇒ Sw−1 Sb w = J(w )w Sw−1 Sb w = J(w )w has the solution w ∗ = Sw−1 (µ1 − µ2 ) Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Linear Discriminant Analysis (LDA) LDA generalizes nicely to C classes Instead of one projection y , we will seek C − 1 projections [y1 , y2 , . . . , yC −1 ] from C − 1 projection vectors W = [w1 , w2 , . . . , wC −1 ]: y = W T x The generalization of within-class scatter is P SW =P Ci=1 Si , Si = x∈ωi (x − µi )(x − µi )T and P µi = n1i x∈ωi x The generalization of between-class scatterP SB = PCi=1 ni (µi − µ)(µi − µ)T and µ = n1 ∀x x Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Linear Discriminant Analysis (LDA) Similar to the 2 class example, mean vector and scatter matricesPfor the projected PCsamples P can be expressed T 1 µ̃i = ni y ∈ωi y , S̃W = i=1 y ∈ωi (y − µ̃i )(y − µ̃i ) , S̃B = PC T i=1 ni (µ̃i − µ̃)(µ̃i − µ̃) and thus T S̃W = W SW W , S̃B = W T SB W We want an scalar objective function and use the ratio of matrix determinants J(W ) = |S̃B | |W T SB W | = |W T SW W | |S̃W | Another variant on scalar representation of the criteria is to (SB ) use the ratio of the trace of the matrices, i.e. J(W ) = trtr(S W) Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Linear Discriminant Analysis (LDA) The matrix W ∗ that maximizes this ratio can be shown to be composed of the eigenvectors corresponding to the largest eigenvalues of the eigenvalue problem (SB − λi Sw )wi∗ = 0. Sb is the sum of C matrices of rank one or less, and the mean 1 PC vectors are constrained by C i=1 µi = µ, so Sb will be rank C − 1 or less. Only C − 1 of the eigenvalues will be non-zero - we can only find C − 1 projection vectors. Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Solving multiclass LDA Estimate Sw and find a Kxp matrix of class centroids M Transform the centroid matrix into a space where Sw is diagonal −1/2 Find Sw a rotating and scaling linear transformation. −1/2 (Sw = VDV T → Sw = D −1 V ) −1/2 Transform M, M ∗ = MSw Compute Sb∗ , the scatter matrix of M ∗ Eigenanalyze Sb∗ , i.e. find the principal components of the centroids. Note that since we only have K centroids, these can span at most a K − 1 dimensional subspace, which means that only K − 1 principal components are nonzero. The eigenvectors of Sb∗ define the optimal projections W ∗ Transform the set of solutions back to the original space by performing the inverse of the linear transformation Thus, the maximizing W is W = Sw−1 W ∗ Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina Linear algebra and dimension reduction PCA LDA Limitations of LDA LDA produces at most C − 1 feature projections LDA is parametric, since it assumes unimodal gaussian likelihoods LDA will fail when the discriminatory information is not in the mean but in the variance of the data. Asbjørn Berge INF 5300: Lecture 2 Principal components and linear discrimina