The Rayleigh Quotient

The Rayleigh Quotient Nuno Vasconcelos ECE Department, p , UCSD Principal component analysis basic idea: • if the data lives in a subspace, it is going to look very flat when viewed from the full space, e.g. 1D subspace in 2D 2D subspace in 3D • this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids 2 The role of the mean note that the mean of the entire data is a function of the coordinate system y • if X has mean µ then X – µ has mean 0 we can always make the data have zero mean by centering • if and |⎤ ⎡| X = ⎢⎢ x1 K xn ⎥⎥ ⎢⎣ | | ⎥⎦ X T c 1 T ⎛ = ⎜ I − 11 n ⎝ ⎞ T ⎟X ⎠ • then Xc has zero mean can assume that X is zero mean without loss of generality 3 Principal component analysis If y is Gaussian with covariance Σ the equiprobability q p y contours y T Σ −1 y = K y2 φ2 λ2 φ1 λ1 y1 are the ellipses whose • principal components φi are the eigenvectors of Σ • principal lengths λi are the eigenvalues of Σ by detecting small eigenvalues we can eliminate dimensions that have little variance thi iis PCA this 4 PCA by SVD computation of PCA by SVD given X with one example per column • 1) create the centered data-matrix X T c 1 T ⎛ = ⎜ I − 11 n ⎝ ⎞ T ⎟X ⎠ • 2) compute t its it SVD X cT = ΜΠΝT • 3) principal components are columns of N, eigenvalues are λi = π i2 / n 5 Principal component analysis a natural measure is to pick the eigenvectors that explain p % of the data variability y • can be done by plotting the ratio rk as a function of k k rk = ∑λ i =1 2 i n 2 λ ∑ i i =1 • e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset 6 Limitations of PCA PCA is not optimal for classification • note that there is no mention of the class label in the definition of PCA • keeping the dimensions of largest energy (variance) is a good idea but not always enough idea, • certainly improves the density estimation, since space has smaller dimension • but could be unwise from a classification point of view • the discriminant dimensions could be thrown out it is not hard to construct examples where PCA is the worst possible thing we could do 7 Fischer’s linear discriminant find the line z = wTx that best separates the two classes bad projection good projection ( E [Z | Y = 1] − E [Z | Y = 0]) w* = max 2 Z |Y w Z |Y var[Z | Y = 1] + var[Z | Y = 0] 8 Linear discriminant analysis this can be written as between class scatter wT S B w J ( w) = T w SW w optimal solution is S B = (µ1 − µ 0 )(µ1 − µ 0 ) T SW = (Σ1 + Σ 0 ) within class scatter w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 ) −1 BDR after projection on z is equivalent to BDR on x if • the two classes are Gaussian and have equal covariance otherwise, LDA leads to a sub-optimal classifier 9 The Rayleigh quotient it turns out that the maximization of the Rayleigh quotient wT S B w J ( w) = T w SW w S B , SW , symmetric i positive semidefinite appears in many problems in engineering and pattern recognition we have already seen that this is equivalent to max wT S B w subject j to w wT SW w = K and can be solved using Lagrange multipliers 10 The Rayleigh quotient define the Lagrangian ( L = wT S B w - λ wT SW w − K ) maximize with respect to w ∇ w L = 2(S B - λSW )w = 0 to obtain the solution S B w = λSW w this is a generalized eigenvalue problem that you can solve using any eigenvalue routine which eigenvalue? 11 The Rayleigh quotient recall that we want max wT S B w subject to w wT SW w = K and the optimal w satisfies S B w = λSW w hence (w *)T S B w* = λ (w *)T SW w* = λK which is maximum for the largest eigenvalue in summary, we need the generalized eigenvector eigen al e S B w = λSW w of largest eigenvalue 12 The Rayleigh quotient case 1: Sw invertible • simplifies to a standard eigenvalue problem SW−1S B w = λw • w is the largest eigenvalue of Sw-1SB case 2: Sw not invertible • this is case is more problematic • in fact the cost can be unbounded • consider w = wr + wn, wr in the row space of Sw and wn in the null space wT SW w = ( wr + wn )T SW ( wr + wn ) = ( wr + wn )T SW wr = wrT SW ( wr + wn ) = wrT SW wr 13 The Rayleigh quotient and wT S B w = ( wr + wn )T S B ( wr + wn ) = wrT S B wr + 2wrT S B wn + wnT S B wn 1 424 3 ≥0 hence, if there is a (wr,wn) such that wrTSBwn ≥ 0, • we can make the cost arbitrarily large • by simply scaling up the null space component wn this can also be seen geometrically 14 The Rayleigh quotient recall that wT SW w = K with SW = ΦΛΦ T are the th ellipses lli whose h • principal components φi are the eigenvectors of Sw w2 φ2 1/λ2 φ1 1/λ1 w1 • principal lengths are 1/λi when the eigenvalues go to zero, the ellipses blow up consider the picture of the optimization problem max wT S B w subject to w wT SW w = K 15 The Rayleigh quotient max wT S B w subject to w wT SW w = K w* w2 wTSBw 1/λ2 1/λ1 w1 wTSww w=K K the optimal solution is where the outer red ellipse (cost) touches the blue ellipse (constraint) • in this example, as λ1 goes to 0, ||w*|| and the cost go to infinity 16 The Rayleigh quotient how do we avoid this problem? • we introduce another constraint max wT S B w subject to w wTw=L wT SW w = K , w =L w* w2 wTSBw 1/λ2 1/λ1 w1 wTSww=K • restricts the set of possible solutions to these points (surfaces in high dimensional case) 17 The Rayleigh quotient the Lagrangian is now ( ) ( L = wT S B w - λ wT SW w − K - β wT w − L ) and the solution satisfies ∇ w L = 2(S B - λSW − βI )w = 0 or (S B - λ [SW + γI ])w = 0, γ = β / λ but this is exactly the solution of the original problem with Sw+γI instead of Sw max wT S B w subject bj t to t w wT [SW + γI ]w = K 18 The Rayleigh quotient adding the constraint is equivalent to maximizing the regularized g Rayleigh y g q quotient wT S B w J ( w) = T w [SW + γI ]w S B , SW , symmetric positive iti semidefini id fi ite t what does this accomplish? • note that SW = ΦΛΦ T ⇒ SW + γI = ΦΛΦ T + γΦIΦ T = Φ[Λ + γI ]Φ T • this makes all eigenvalues positive • the matrix is no longer non-invertible 19 The Rayleigh quotient in summary wT S B w max T w w SW w S B , SW , symmetric positive semidefinite 1) Sw invertible • w* is the eigenvector of largest eigenvalue of Sw-1SB • the max value is λK, where λ is the largest eigenvalue 2) Sw not invertible • regularize: l i Sw -> Sw + γI • w* is the eigenvector of largest eigenvalue of [Sw + γI]-1SB • the max value is λK, where λ is the largest eigenvalue 20 Regularized discriminant analysis back to LDA: • when the within scatter matrix is non-invertible non-invertible, instead of between class scatter wT S B w J ( w) = T w SW w • we use wT S B w J ( w) = T w SW w S B = (µ1 − µ 0 )(µ1 − µ 0 ) T SW = (Σ1 + Σ 0 ) within class scatter S B = (µ1 − µ 0 )(µ1 − µ 0 ) T SW = (Σ1 + Σ 0 + γI ) regularized within class scatter 21 Regularized discriminant analysis this is called regularized discriminant analysis (RDA) noting that SW = Σ1 + Σ 0 + γI = Σ1 + γ 1 I + Σ 0 + γ 0 I γ1 + γ 0 = γ this can also be seen as regularizing each covariance matrix individually the regularization parameters γi are determined by crossvalidation • more on this later • basically means that we try several possibilities and keep the best 22 Principal component analysis back to PCA: given X with one example per column • 1) create the centered data-matrix X T c 1 T⎞ T ⎛ = ⎜ I − 11 ⎟ X n ⎝ ⎠ this has one point per row ⎡ x1T − µ T ⎤ ⎢ ⎥ T Xc = ⎢ M ⎥ ⎢ xnT − µ T ⎥ ⎣ ⎦ • note that the p projection j of all p points on p principal p component p φ is z = Xc φ T ( ⎡ z1 ⎤ ⎡ x1 − µ ⎢M⎥=⎢ M ⎢ ⎥ ⎢ ⎢⎣ zn ⎥⎦ ⎢⎢ xn − µ ⎣ ( ) φ ⎤⎥ T ⎥ T ⎥ φ⎥ ⎦ ) 23 Principal component analysis and, since ( 1 1 zi = ∑ xi − µ ∑ n i n i ) T T ⎛1 ⎞ φ = ⎜ ∑ xi − µ ⎟ φ = 0 ⎠ ⎝n i the sample variance of Z is given by its norm 1 2 var( z ) = ∑ zi = z n i 2 recall that PCA looks for the largest variance component max z = max X φ 2 φ φ T c 2 ( ) = max X φ X cT φ = max φ T X c X cT φ φ T c T φ 24 Principal component analysis recall that the sample covariance is ( ) 1 1 T Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic n i n i T where xic is the ith column of Xc this can be written as c ⎡ | | − x ⎡ ⎤ 1 1⎢ c c ⎥⎢ Σ = ⎢ x1 K xn ⎥ ⎢ M n ⎢⎣ | | ⎦⎥ ⎣⎢− xnc −⎤ ⎥ 1 T = X X ⎥ n c c −⎥⎦ 25 Principal component analysis hence the PCA problem is ma φ T X c X cT φ = max max ma φ T Σφ φ φ as in LDA, this can be made arbitrarily large by simply scaling li φ to normalize we constrain φ to have unit norm max φ T Σφ φ subject bj t to t φ =1 which is equivalent to φ T Σφ max T φ φ φ shows that PCA = maximization of a Rayleigh quotient 26 Principal component analysis in this case wT S B w max T w w SW w S B , SW , symmetric i positive semidefinite with SB = Σ and Sw = I Sw is clearly invertible • no regularization problems • w* is the eigenvector of largest eigenvalue of Sw-1SB • this thi is i just j t the th largest l t eigenvector i t off the th covariance i Σ • the max value is λ, where λ is the largest eigenvalue 27 The Rayleigh quotient dual let’s assume, for a moment, that the solution is of the form w = X cα • i.e. a linear combination of the centered datapoints hence the problem is equivalent to α T X cT S B X c α max T T α α X S X α c W c thi does this d nott change h its it fform, the th solution l ti iis • α* is the eigenvector of largest eigenvalue of (XcTSwXc)-1XcTSBXc • the max value is λK, K where λ is the largest eigenvalue 28 The Rayleigh quotient dual for PCA • Sw = I and SB = Σ = 1/n XcXcT • the solution satisfies S B w = λS W w ⇔ −1 1 n T c X c X w = λw ⇔ w = X c 1 X cT w n4 λ 24 1 3 α • and, therefore, we have • w* eigenvalue of SB = XcXcT • α** eigenvalue i l off (XcTSwXc)-11XcTSBXc = (XcTXc)-11XcTXcXcTXc =X XcTXc • i.e. we have two alternative manners in which to compute PCA 29 Principal component analysis primal dual • assemble matrix • assemble matrix • • Σ = XcXcT Κ = XcTXc • compute eigenvectors φi • compute eigenvectors αi • these are the principal components • the principal components are • φi = Xcαi in both cases we have an eigenvalue problem • primal i l on th the sum off outer t products d t i • dual on the matrix of inner products ( ) Σ = ∑ x ic x ic K ij = (x )x c T i T c j 30 The Rayleigh quotient this is a property that holds for many Rayleigh quotient p problems • the primal solution is a linear combination of datapoints • the dual solution only depends on dot-products of the datapoints whenever both of these hold • the th problem bl can b be kkernelized li d • this has various interesting properties • we will talk about them many examples • kernel PCA, kernel LDA, manifold learning, etc. 31 32

The Rayleigh Quotient

Related documents

Products

Support

The Rayleigh Quotient

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib