The Rayleigh Quotient Nuno Vasconcelos ECE Department, p , UCSD Principal component analysis basic idea: • if the data lives in a subspace, it is going to look very flat when viewed from the full space, e.g. 1D subspace in 2D 2D subspace in 3D • this means that if we fit a Gaussian to the data the equiprobability contours are going to be highly skewed ellipsoids 2 The role of the mean note that the mean of the entire data is a function of the coordinate system y • if X has mean µ then X – µ has mean 0 we can always make the data have zero mean by centering • if and |⎤ ⎡| X = ⎢⎢ x1 K xn ⎥⎥ ⎢⎣ | | ⎥⎦ X T c 1 T ⎛ = ⎜ I − 11 n ⎝ ⎞ T ⎟X ⎠ • then Xc has zero mean can assume that X is zero mean without loss of generality 3 Principal component analysis If y is Gaussian with covariance Σ the equiprobability q p y contours y T Σ −1 y = K y2 φ2 λ2 φ1 λ1 y1 are the ellipses whose • principal components φi are the eigenvectors of Σ • principal lengths λi are the eigenvalues of Σ by detecting small eigenvalues we can eliminate dimensions that have little variance thi iis PCA this 4 PCA by SVD computation of PCA by SVD given X with one example per column • 1) create the centered data-matrix X T c 1 T ⎛ = ⎜ I − 11 n ⎝ ⎞ T ⎟X ⎠ • 2) compute t its it SVD X cT = ΜΠΝT • 3) principal components are columns of N, eigenvalues are λi = π i2 / n 5 Principal component analysis a natural measure is to pick the eigenvectors that explain p % of the data variability y • can be done by plotting the ratio rk as a function of k k rk = ∑λ i =1 2 i n 2 λ ∑ i i =1 • e.g. we need 3 eigenvectors to cover 70% of the variability of this dataset 6 Limitations of PCA PCA is not optimal for classification • note that there is no mention of the class label in the definition of PCA • keeping the dimensions of largest energy (variance) is a good idea but not always enough idea, • certainly improves the density estimation, since space has smaller dimension • but could be unwise from a classification point of view • the discriminant dimensions could be thrown out it is not hard to construct examples where PCA is the worst possible thing we could do 7 Fischer’s linear discriminant find the line z = wTx that best separates the two classes bad projection good projection ( E [Z | Y = 1] − E [Z | Y = 0]) w* = max 2 Z |Y w Z |Y var[Z | Y = 1] + var[Z | Y = 0] 8 Linear discriminant analysis this can be written as between class scatter wT S B w J ( w) = T w SW w optimal solution is S B = (µ1 − µ 0 )(µ1 − µ 0 ) T SW = (Σ1 + Σ 0 ) within class scatter w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 ) −1 BDR after projection on z is equivalent to BDR on x if • the two classes are Gaussian and have equal covariance otherwise, LDA leads to a sub-optimal classifier 9 The Rayleigh quotient it turns out that the maximization of the Rayleigh quotient wT S B w J ( w) = T w SW w S B , SW , symmetric i positive semidefinite appears in many problems in engineering and pattern recognition we have already seen that this is equivalent to max wT S B w subject j to w wT SW w = K and can be solved using Lagrange multipliers 10 The Rayleigh quotient define the Lagrangian ( L = wT S B w - λ wT SW w − K ) maximize with respect to w ∇ w L = 2(S B - λSW )w = 0 to obtain the solution S B w = λSW w this is a generalized eigenvalue problem that you can solve using any eigenvalue routine which eigenvalue? 11 The Rayleigh quotient recall that we want max wT S B w subject to w wT SW w = K and the optimal w satisfies S B w = λSW w hence (w *)T S B w* = λ (w *)T SW w* = λK which is maximum for the largest eigenvalue in summary, we need the generalized eigenvector eigen al e S B w = λSW w of largest eigenvalue 12 The Rayleigh quotient case 1: Sw invertible • simplifies to a standard eigenvalue problem SW−1S B w = λw • w is the largest eigenvalue of Sw-1SB case 2: Sw not invertible • this is case is more problematic • in fact the cost can be unbounded • consider w = wr + wn, wr in the row space of Sw and wn in the null space wT SW w = ( wr + wn )T SW ( wr + wn ) = ( wr + wn )T SW wr = wrT SW ( wr + wn ) = wrT SW wr 13 The Rayleigh quotient and wT S B w = ( wr + wn )T S B ( wr + wn ) = wrT S B wr + 2wrT S B wn + wnT S B wn 1 424 3 ≥0 hence, if there is a (wr,wn) such that wrTSBwn ≥ 0, • we can make the cost arbitrarily large • by simply scaling up the null space component wn this can also be seen geometrically 14 The Rayleigh quotient recall that wT SW w = K with SW = ΦΛΦ T are the th ellipses lli whose h • principal components φi are the eigenvectors of Sw w2 φ2 1/λ2 φ1 1/λ1 w1 • principal lengths are 1/λi when the eigenvalues go to zero, the ellipses blow up consider the picture of the optimization problem max wT S B w subject to w wT SW w = K 15 The Rayleigh quotient max wT S B w subject to w wT SW w = K w* w2 wTSBw 1/λ2 1/λ1 w1 wTSww w=K K the optimal solution is where the outer red ellipse (cost) touches the blue ellipse (constraint) • in this example, as λ1 goes to 0, ||w*|| and the cost go to infinity 16 The Rayleigh quotient how do we avoid this problem? • we introduce another constraint max wT S B w subject to w wTw=L wT SW w = K , w =L w* w2 wTSBw 1/λ2 1/λ1 w1 wTSww=K • restricts the set of possible solutions to these points (surfaces in high dimensional case) 17 The Rayleigh quotient the Lagrangian is now ( ) ( L = wT S B w - λ wT SW w − K - β wT w − L ) and the solution satisfies ∇ w L = 2(S B - λSW − βI )w = 0 or (S B - λ [SW + γI ])w = 0, γ = β / λ but this is exactly the solution of the original problem with Sw+γI instead of Sw max wT S B w subject bj t to t w wT [SW + γI ]w = K 18 The Rayleigh quotient adding the constraint is equivalent to maximizing the regularized g Rayleigh y g q quotient wT S B w J ( w) = T w [SW + γI ]w S B , SW , symmetric positive iti semidefini id fi ite t what does this accomplish? • note that SW = ΦΛΦ T ⇒ SW + γI = ΦΛΦ T + γΦIΦ T = Φ[Λ + γI ]Φ T • this makes all eigenvalues positive • the matrix is no longer non-invertible 19 The Rayleigh quotient in summary wT S B w max T w w SW w S B , SW , symmetric positive semidefinite 1) Sw invertible • w* is the eigenvector of largest eigenvalue of Sw-1SB • the max value is λK, where λ is the largest eigenvalue 2) Sw not invertible • regularize: l i Sw -> Sw + γI • w* is the eigenvector of largest eigenvalue of [Sw + γI]-1SB • the max value is λK, where λ is the largest eigenvalue 20 Regularized discriminant analysis back to LDA: • when the within scatter matrix is non-invertible non-invertible, instead of between class scatter wT S B w J ( w) = T w SW w • we use wT S B w J ( w) = T w SW w S B = (µ1 − µ 0 )(µ1 − µ 0 ) T SW = (Σ1 + Σ 0 ) within class scatter S B = (µ1 − µ 0 )(µ1 − µ 0 ) T SW = (Σ1 + Σ 0 + γI ) regularized within class scatter 21 Regularized discriminant analysis this is called regularized discriminant analysis (RDA) noting that SW = Σ1 + Σ 0 + γI = Σ1 + γ 1 I + Σ 0 + γ 0 I γ1 + γ 0 = γ this can also be seen as regularizing each covariance matrix individually the regularization parameters γi are determined by crossvalidation • more on this later • basically means that we try several possibilities and keep the best 22 Principal component analysis back to PCA: given X with one example per column • 1) create the centered data-matrix X T c 1 T⎞ T ⎛ = ⎜ I − 11 ⎟ X n ⎝ ⎠ this has one point per row ⎡ x1T − µ T ⎤ ⎢ ⎥ T Xc = ⎢ M ⎥ ⎢ xnT − µ T ⎥ ⎣ ⎦ • note that the p projection j of all p points on p principal p component p φ is z = Xc φ T ( ⎡ z1 ⎤ ⎡ x1 − µ ⎢M⎥=⎢ M ⎢ ⎥ ⎢ ⎢⎣ zn ⎥⎦ ⎢⎢ xn − µ ⎣ ( ) φ ⎤⎥ T ⎥ T ⎥ φ⎥ ⎦ ) 23 Principal component analysis and, since ( 1 1 zi = ∑ xi − µ ∑ n i n i ) T T ⎛1 ⎞ φ = ⎜ ∑ xi − µ ⎟ φ = 0 ⎠ ⎝n i the sample variance of Z is given by its norm 1 2 var( z ) = ∑ zi = z n i 2 recall that PCA looks for the largest variance component max z = max X φ 2 φ φ T c 2 ( ) = max X φ X cT φ = max φ T X c X cT φ φ T c T φ 24 Principal component analysis recall that the sample covariance is ( ) 1 1 T Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic n i n i T where xic is the ith column of Xc this can be written as c ⎡ | | − x ⎡ ⎤ 1 1⎢ c c ⎥⎢ Σ = ⎢ x1 K xn ⎥ ⎢ M n ⎢⎣ | | ⎦⎥ ⎣⎢− xnc −⎤ ⎥ 1 T = X X ⎥ n c c −⎥⎦ 25 Principal component analysis hence the PCA problem is ma φ T X c X cT φ = max max ma φ T Σφ φ φ as in LDA, this can be made arbitrarily large by simply scaling li φ to normalize we constrain φ to have unit norm max φ T Σφ φ subject bj t to t φ =1 which is equivalent to φ T Σφ max T φ φ φ shows that PCA = maximization of a Rayleigh quotient 26 Principal component analysis in this case wT S B w max T w w SW w S B , SW , symmetric i positive semidefinite with SB = Σ and Sw = I Sw is clearly invertible • no regularization problems • w* is the eigenvector of largest eigenvalue of Sw-1SB • this thi is i just j t the th largest l t eigenvector i t off the th covariance i Σ • the max value is λ, where λ is the largest eigenvalue 27 The Rayleigh quotient dual let’s assume, for a moment, that the solution is of the form w = X cα • i.e. a linear combination of the centered datapoints hence the problem is equivalent to α T X cT S B X c α max T T α α X S X α c W c thi does this d nott change h its it fform, the th solution l ti iis • α* is the eigenvector of largest eigenvalue of (XcTSwXc)-1XcTSBXc • the max value is λK, K where λ is the largest eigenvalue 28 The Rayleigh quotient dual for PCA • Sw = I and SB = Σ = 1/n XcXcT • the solution satisfies S B w = λS W w ⇔ −1 1 n T c X c X w = λw ⇔ w = X c 1 X cT w n4 λ 24 1 3 α • and, therefore, we have • w* eigenvalue of SB = XcXcT • α** eigenvalue i l off (XcTSwXc)-11XcTSBXc = (XcTXc)-11XcTXcXcTXc =X XcTXc • i.e. we have two alternative manners in which to compute PCA 29 Principal component analysis primal dual • assemble matrix • assemble matrix • • Σ = XcXcT Κ = XcTXc • compute eigenvectors φi • compute eigenvectors αi • these are the principal components • the principal components are • φi = Xcαi in both cases we have an eigenvalue problem • primal i l on th the sum off outer t products d t i • dual on the matrix of inner products ( ) Σ = ∑ x ic x ic K ij = (x )x c T i T c j 30 The Rayleigh quotient this is a property that holds for many Rayleigh quotient p problems • the primal solution is a linear combination of datapoints • the dual solution only depends on dot-products of the datapoints whenever both of these hold • the th problem bl can b be kkernelized li d • this has various interesting properties • we will talk about them many examples • kernel PCA, kernel LDA, manifold learning, etc. 31 32