INF 5300 Feature transformation Anne Solberg (anne@ifi.uio.no) Today: • Linear algebra applied to dimensionality reduction • Principal component analysis • Fisher’s linear discriminant F1 16.2.06 INF 5300 1 Vector spaces F1 16.2.06 INF 5300 2 Linear transformation F1 16.2.06 INF 5300 3 Eigenvalues and eigenvectors F1 16.2.06 INF 5300 4 Interpretation of eigenvectors and eigenvalues F1 16.2.06 INF 5300 5 Linear feature transforms F1 16.2.06 INF 5300 6 Signal representation vs classification • The search for the feature extraction mapping y = f(x) is guided by an objective function we want to maximize. • In general we have two categories of objectives in feature extraction: – Signal representation: Accurately approximate the samples in a lower-dimensional space. – Classification: Keep (or enhance) class-discriminatory information in a lower-dimensional space. F1 16.2.06 INF 5300 7 Signal representation vs classification • Principal components analysis (PCA) • Linear discriminant analysis (LDA) – – - signal representation, unsupervised -classification, supervised F1 16.2.06 INF 5300 8 Correlation matrix vs. covariance matrix • Σx is the covariance matrix of x [ Σ x = E ( x − µ )( x − µ )T ] • Rx is the correlation matrix of x [ Rx = E ( x )( x )T ] • Rx=Σx if µx=0. F1 16.2.06 INF 5300 9 Principal component or Karhunen-Loeve transform • Let x be a feature vector. • Features are often correlated, which might lead to redundancies. • We now derive a transform which yields uncorrelated features. • We seek a linear transform y=ATx, and the yis should be uncorrelated. • The yis are uncorrelated if E[y(i)y(j)]=0, i≠j. • If we can express the information in x using uncorrelated features, we might need fewer coefficients. F1 16.2.06 INF 5300 10 Principal component transform • The correlation of Y is described by the correlation matrix RY=E[yyT]=E[ATxxTA]=ATRxA Rx is the correlation matrix of X Rx is symmetric, thus all eigenvectors are orthogonal. • We seek uncorrelated components of Y, thus Ry should be diagonal. From linear algebra: • Ry will be diagonal if A is formed by the orthogonal eigenvectors ai, i=0,...,N-1 of Rx: Ry=ATRxA=Λ, where Λ is diagonal with the eigenvalues of Rx, λi, on the diagonal. • We find A by solving the equation ATRxA=Λ (using Singular Value Decomposition (SVD)). • A is formed by computing the eigenvectors of Rx. Each eigenvector will be a column of A. F1 16.2.06 INF 5300 11 Mean square error approximation • x can be expressed as a combination of all N basis vectors: x= N −1 ∑ y(i)ai , where y(i) = aiT x i =0 • An approximation to x is found by using only m of the basis m −1 vectors: a projection into the m-dimensional xˆ = ∑ y(i)ai subspace spanned by m eigenvectors i =0 • The PC-transform is based on minimizing the mean square error associted with this approximation. • The mean square error associated with this approximation is ⎡ N −1 2 E x − xˆ = E ⎢ y (i )ai ⎢ ⎣⎢ i = m [ ] ∑ 2⎤ ⎡ ⎥ = E⎢ ⎥ ⎢ i ⎣ ⎦⎥ ⎤ ∑∑ (y(i)aiT )(y( j )a j )⎥⎥ = j ∑ E [y 2 (i)]= ∑ aiT E [xxT ]ai N −1 i =m F1 16.2.06 N −1 ⎦ i =m INF 5300 12 • Furthermore, we can find that [ E x − xˆ 2 ]= ∑ a λ a = ∑ λ N −1 N −1 T i i i i i =m i =m • The mean square error is thus [ E x − xˆ 2 ] = ∑ λ −∑ λ = ∑ λ N −1 i i i i =1 N −1 m i =1 i =m • The error is minimized if we select the eigenvectors corresponding to the m largest eigenvales of the correlation matrix Rx. • The transformed vector y is called the principal components of x. The transform is called the principal component transform or Karhunen-Loeve-transform. F1 16.2.06 INF 5300 13 Principal component of the covariance matrix • Alternatively, we can find the principal components of the covariance matrix Σx. • If we have software for computing principal components of Rx, we can compute principal components from Σx by first setting z=x- µx and compute PC(z). • The principal component transform is not scale invariant, because the eigenvectors are not invariant. Often, normalization to data with zero mean and unit variance is done prior to applying the PC-transform. F1 16.2.06 INF 5300 14 Principal components and total variance • Assume that E[x]=0. • Let y=PC(x). • From Ry we know that the variance of component yj is λj. • The eigenvalues λj of the correlation matrix Rx is thus equal to the variance of the transformed features. • By selecting the m eigenvectors with the largest eigenvalues, we select the m dimensions with the largest variance. • The first principal component will be along the direction of the input space which has largest variance. F1 16.2.06 INF 5300 15 Geometrical interpretation of principal components • The eigenvector corresponding to the largest eigenvalue is the direction in n-dimensional space with highest variance. • The next principal component is orthogonal to the first, and along the direction with the second largest variance. Note that the direction with the highest variance is NOT related to separability between classes. F1 16.2.06 INF 5300 16 PCA example 3d Gaussian with parameters F1 16.2.06 INF 5300 17 Principal component images • For an image with n bands, we can compute the principal component transform of the entire image X. • Y=PC(X) will then be a new image with n bands, but most of the variance is in the bands with the lowest index (corresponding to the largest eigenvalues). F1 16.2.06 INF 5300 18 PCA applied to change detection • Assume that we have a time series of t images taken at different times. • We want to visualize the pixels which change at some point in the time series. • Perform PCA for each pixel on the multitemporal feature vector (PCA in time). • Visualize the PCA component: areas with the biggest changes will be visible in PCA-component 1. • Note that normalization should be applied if the mean values for the different images differ (e.g. due to different solar angle etc.). F1 16.2.06 INF 5300 19 PC and compression • PC-transform is optimal transform with respect to preserving the energy in the original image. • For compression purposes, PC-transform is theoretically optimal with respect to maximizing the entropy (from information theory). Entropy is related to randomness and thus to variance. • The basis vectors are the eigenvectors and vary from image to image. For transmission, both the transform coefficients and the eigenvectors must be transmitted. • PC-transform can be reasonably well approximated by the Cosinus-transform or Sinus-transform. These use constant basis vectors and are better suited for transmission, since only the coefficients must be transmitted (or stored). F1 16.2.06 INF 5300 20 Fisher’s Linear Discriminant - intro • Goal: – Reduce dimension while preserving class discriminatory information • Strategy (2 classes): – We have a set of samples x={x1, x2, …, xn} where n1 belong to class ω1 and the rest n2 to class ω2. Obtain a scalar value by projecting x onto a line y: y=wTx – Select the one w that maximizes the separability of the classes F1 16.2.06 INF 5300 21 Fisher’s Linear Discriminant • To find a good projection vector, we need to define a measure of separation between the projections (J(w)) • The mean vector of each class in the spaces spanned by x and y are • A naive choice would be projected mean difference, F1 16.2.06 INF 5300 22 Fisher’s linear discriminant • Fishers solution: maximize a function that represents the difference between the means, scaled by a measure of the within class scatter • Define classwise scatter (similar to variance) • is within class scatter • Fishers criterion is then • We look for a projection where examples from the same class are close to each other, while at the same time projected mean values are as far apart as possible. F1 16.2.06 INF 5300 23 Scatter matrices • Within-class scatter matrix: M S w = ∑ P (ω i ) Si Variance within each class i =1 Si = E [( x − µ i )( x − µ i )T ] • Between-class scatter matrix: M Sb = ∑ P (ω i )(µ i − µ 0 )(µ i − µ 0 )T i =1 M Distance between the classes µ 0 = ∑ µi i =1 • Mixture or total scatter matrix: [ S m = E ( x − µ 0 )( x − µ 0 )T F1 16.2.06 ] Variance of feature with respect to the global mean INF 5300 24 • The total scatter is Sm=Sw+Sb • Consider the criterion function J1 = trace{S m } trace{S w } Sum of the diagonal elemens in Sm c sum of variance around global mean sum of variance around class mean • J1 will be large when the variance among the global mean is large compared to the within-class variance. F1 16.2.06 INF 5300 25 Alternative scatter criteria S J 2 = m = S w−1S m Sw J 3 = trace S w−1Sb { } Error in textbook, should be Sb and not Sm • J2 and J3 are invariant to linear transformations. • In the two-class case, |Sw| is proportional to σ12 + σ22 and |Sm| is proportional to (µ1 - µ2)2 and yields Fisher’s discriminant ratio: (µ1 − µ 2 )2 FDR = 2 σ 1 + σ 22 • The generalized FDR for multiple classes is (for the classes i and M M (µ − µ )2 j): i j FDR1 = ∑∑ 2 2 i j ≠i σ i + σ j F1 16.2.06 INF 5300 26 Fisher’s linear discriminant • Fisher’s linear discriminant is a transform that uses the information in the training data set to find a linear combination that best separates the classes. • It is based on the criterion J3: M Sb = ∑ P(ωi )(µi − µ0 )(µi − µ0 )T − between − class scatter i =1 { J 3 = trace S w−1Sb M Sw = } ∑ P(ωi )Si − within - class scatter i =1 • From the feature vector x, let Sxw and Sxb be the within-class and between-class scatter matrix. • The scatter matrices for y=ATx are: Syw=ATSxwA F1 16.2.06 Syb=ATSxbA INF 5300 27 • In subspace y, J3 becomes: ( )( ) −1 ⎫ ⎧ J 3 = trace⎨ AT S w A AT Sb A ⎬ ⎭ ⎩ • Problem: find A such that J3 is maximized. ∂J 3 ( A) • Solution: set =0 ∂A ( )( c )( ) ( ) −1 −1 −1 ∂J 3 ( A) = −2 S xw A AT S xw A AT S xb A AT S xw A + 2 S xb A AT S xw A = 0 ∂A c −1 −1 S xw S xb A = A( S yw S yb ) F1 16.2.06 INF 5300 28 • The scatter matrices Syw and Syb are symmetric, and can thus be diagonalized by the linear transform BT S yw B = I and BT S yb B = D • B is a l×l matrix, I the identity matrix, and D an l×l diagonal matrix. I and D are the scatter matrices of the transformed vector yˆ = BT y = BT AT x • J3 is invariant under linear transformations and: ( )( ) { −1 ⎫ ⎧ −1 J 3 ( yˆ ) = trace⎨ BT S yw B BT S yb B ⎬ = trace B −1S yw S yb B ⎭ ⎩ −1 = trace{S yw S yb BB −1} = J 3 ( y ) } • Furthermore, (S xw−1 S xb )C = CD, where C = AB • Because D is diagonal, this is an eigenvalue-problem, D must have the eigenvalues of S-1xw on the diagonal and C the corresponding eigenvectors of S-1xw as columns... F1 16.2.06 • Note that INF 5300 29 M S xb = ∑ P(ωi )(µi − µ0 )(µi − µ0 )T i =1 is a sum of M (M=nof. classes) matrices of rank 1, only m-1 of these elements are independent, meaning that the rank of Sxb is M-1 or less (no more than M-1 eigenvalues are nonzero). −1 • This means also that S xw S xb has rank M-1 or less. • Fisher’s discriminant transform can give us a ldimensional projection, where l≤M-1. – Note: with 30 features (m=30) and 5 classes (M=5) this gives us a projection with dimension 4 or less. F1 16.2.06 INF 5300 30 Computing Fishers linear discriminant • For l=M-1: – Form C such that its columns are the unit norm M-1 −1 eigenvectors of S xwS xb T – Set yˆ = C x – This gives us the maximum J3 value. – This means that we can reduce the dimension from m to M1 without loss in class separability power (if J3 is a correct measure of class separability.) – Alternative view: with a Bayesian model we compute the probabilities P(ωi|x) for each class (i=1,...M). Once M-1 probabilities are found, the remaining P(ωM|x) is given because the P(ωi|x)’s sum to one. F1 16.2.06 INF 5300 31 Computation: Case 2: l<M-1 • Form C by selecting the eigenvectors corresponding −1 to the l largest eigenvalues of S xwS xb • We now have a loss of discriminating power since J 3, yˆ < J 3, x F1 16.2.06 INF 5300 32 Comments on Fishers discriminant rule • In general, projection of the original feature vector to a lower dimensional space is associated with some loss of information. • Although the projection is optimal with respect to J3, J3 might not be a good criterion to optimize for a given data set. (Note that J3 is a kind of sum of a product of between-class and within-class scatter, where the sum is over all classes) F1 16.2.06 INF 5300 33 Limitations of LDA • • LDA produces at most C-1 feature projections LDA is parametric, since it assumes unimodal gaussian likelihoods • LDA will fail when the discriminatory information is not in the mean but in the variance of the data F1 16.2.06 INF 5300 34 PC vs. Fisher’s linear discriminant transform • The principal component transform has no information about the classes in the data. • The PC-projection might not be helpful to improve class separability. • From an input vector x with dimension m, PC-transform gives us a projection y with dimensions 1,...,m (depending on how many eigenvalues we include). • A projection with Fishers linear discriminant gives us y with dimensions 1,...,K-1, where K is the number of classes. • Fishers linear discriminant find the projection that maximizes the ratio of between-class to within-class scatter. F1 16.2.06 INF 5300 35 Matlab implementations • Feature selection – PRTools • Evaluate features: feateval • Selection strategies: featselm • (Distance measures: distmaha, distm, proxm) • Linear projections – PRTools • Principal component analysis: pca, klm • LDA: fisherm F1 16.2.06 INF 5300 36 PCA - Principal components analysis F1 16.2.06 INF 5300 37 PCA - Principal components analysis • Approximation error is then • To measure the representation error we use the mean squared error F1 16.2.06 INF 5300 38 PCA - Principal components analysis • Optimal values of bi can be found by taking the partial derivatives of the approximation error: • We replace the discarded dimensions by their expected value. (This also feels intuitively correct) F1 16.2.06 INF 5300 39 PCA - Principal components analysis The MSE can now be written as Σx is the covariance matrix of x F1 16.2.06 INF 5300 40 PCA - Principal components analysis • The orthonormality constraint on φ can be incorporated in our optimization using Lagrange multipliers λ • Thus, we can find the optimal φi by partial derivation • Optimal φi and λi are eigenvalues and eigenvectors of the covariance matrix Σx F1 16.2.06 INF 5300 41 PCA - Principal components analysis • Note that this also implies • Thus in order to minimize MSE(M), λi will have to correspond to the smallest eigenvalues! F1 16.2.06 INF 5300 42 PCA - Principal components analysis The optimal approximation of a random vector x 2 R by a linear combination of m<n independent vectors is obtained by projecting the random vector x onto the eigenvectors {φi} corresponding to the largest eigenvalues {λi} of the covariance matrix Σx F1 16.2.06 INF 5300 43 PCA - Principal components analysis • PCA uses the eigenvectors of the covariance matrix Σx, and thus is able to find the independent axes of the data under the unimodal Gaussian assumption – For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlate the axes • Main limitation of PCA is that it is unsupervised - thus it does not consider class separability – It is simply a coordinate rotation aligning the transformed axes with the directions of maximum variance – No guarantee that these directions are good features for discrimination F1 16.2.06 INF 5300 44 Literature on pattern recognition • Updated review and statistical pattern recognition: – A. Jain, R. Duin and J. Mao: Statistical pattern recognition: a review, IEEE Trans. Pattern analysis and Machine Intelligence, vol. 22, no. 1, January 2001, pp. 4-- • Classical PR-books – – – R. Duda, P. Hart and D. Stork, Pattern Classification, 2. ed. Wiley, 2001 B. Ripley, Pattern Recognition and Neural Networks, Cambridge Press, 1996. S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999. F1 16.2.06 INF 5300 45