The Rayleigh Quotient

advertisement
The Rayleigh Quotient
Nuno Vasconcelos
ECE Department,
p
, UCSD
Principal component analysis
basic idea:
• if the data lives in a subspace, it is going to look very flat when
viewed from the full space, e.g.
1D subspace in 2D
2D subspace in 3D
• this means that if we fit a Gaussian to the data the equiprobability
contours are going to be highly skewed ellipsoids
2
The role of the mean
note that the mean of the entire data is a function of the
coordinate system
y
• if X has mean µ then X – µ has mean 0
we can always make the data have zero mean by
centering
• if
and
|⎤
⎡|
X = ⎢⎢ x1 K xn ⎥⎥
⎢⎣ |
| ⎥⎦
X
T
c
1 T
⎛
= ⎜ I − 11
n
⎝
⎞ T
⎟X
⎠
• then Xc has zero mean
can assume that X is zero mean without loss of generality
3
Principal component analysis
If y is Gaussian with covariance Σ
the equiprobability
q p
y contours
y T Σ −1 y = K
y2
φ2
λ2
φ1
λ1
y1
are the ellipses whose
• principal components φi are the
eigenvectors of Σ
• principal lengths λi are the eigenvalues of Σ
by detecting small eigenvalues we can eliminate
dimensions that have little variance
thi iis PCA
this
4
PCA by SVD
computation of PCA by SVD
given X with one example per column
• 1) create the centered data-matrix
X
T
c
1 T
⎛
= ⎜ I − 11
n
⎝
⎞ T
⎟X
⎠
• 2) compute
t its
it SVD
X cT = ΜΠΝT
• 3) principal components are columns of N, eigenvalues are
λi = π i2 / n
5
Principal component analysis
a natural measure is to pick the eigenvectors that explain
p % of the data variability
y
• can be done by plotting the ratio rk as a function of k
k
rk =
∑λ
i =1
2
i
n
2
λ
∑ i
i =1
• e.g. we need 3 eigenvectors to cover 70% of the variability of this
dataset
6
Limitations of PCA
PCA is not optimal for classification
• note that there is no mention of the class label in the definition of
PCA
• keeping the dimensions of largest energy (variance) is a good
idea but not always enough
idea,
• certainly improves the density estimation, since space has
smaller dimension
• but could be unwise from a classification point of view
• the discriminant dimensions could be thrown out
it is not hard to construct examples where PCA is the
worst possible thing we could do
7
Fischer’s linear discriminant
find the line z = wTx that best separates the two classes
bad projection
good projection
(
E [Z | Y = 1] − E [Z | Y = 0])
w* = max
2
Z |Y
w
Z |Y
var[Z | Y = 1] + var[Z | Y = 0]
8
Linear discriminant analysis
this can be written as
between class scatter
wT S B w
J ( w) = T
w SW w
optimal solution is
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
SW = (Σ1 + Σ 0 )
within class scatter
w* = SW−1 (µ1 − µ0 ) = (Σ1 + Σ 0 ) (µ1 − µ 0 )
−1
BDR after projection on z is equivalent to BDR on x if
• the two classes are Gaussian and have equal covariance
otherwise, LDA leads to a sub-optimal classifier
9
The Rayleigh quotient
it turns out that the maximization of the Rayleigh quotient
wT S B w
J ( w) = T
w SW w
S B , SW , symmetric
i
positive semidefinite
appears in many problems in engineering and pattern
recognition
we have already seen that this is equivalent to
max wT S B w subject
j to
w
wT SW w = K
and can be solved using Lagrange multipliers
10
The Rayleigh quotient
define the Lagrangian
(
L = wT S B w - λ wT SW w − K
)
maximize with respect to w
∇ w L = 2(S B - λSW )w = 0
to obtain the solution
S B w = λSW w
this is a generalized eigenvalue problem that you can
solve using any eigenvalue routine
which eigenvalue?
11
The Rayleigh quotient
recall that we want
max wT S B w subject to
w
wT SW w = K
and the optimal w satisfies
S B w = λSW w
hence
(w *)T S B w* = λ (w *)T SW w* = λK
which is maximum for the largest eigenvalue
in summary, we need the generalized eigenvector
eigen al e
S B w = λSW w of largest eigenvalue
12
The Rayleigh quotient
case 1: Sw invertible
• simplifies to a standard eigenvalue problem
SW−1S B w = λw
• w is the largest eigenvalue of Sw-1SB
case 2: Sw not invertible
• this is case is more problematic
• in fact the cost can be unbounded
• consider w = wr + wn, wr in the row space of Sw and wn in the null
space
wT SW w = ( wr + wn )T SW ( wr + wn ) = ( wr + wn )T SW wr
= wrT SW ( wr + wn ) = wrT SW wr
13
The Rayleigh quotient
and
wT S B w = ( wr + wn )T S B ( wr + wn )
= wrT S B wr + 2wrT S B wn + wnT S B wn
1
424
3
≥0
hence, if there is a (wr,wn) such that wrTSBwn ≥ 0,
• we can make the cost arbitrarily large
• by simply scaling up the null space component wn
this can also be seen geometrically
14
The Rayleigh quotient
recall that
wT SW w = K with SW = ΦΛΦ T
are the
th ellipses
lli
whose
h
• principal components φi are the
eigenvectors of Sw
w2
φ2
1/λ2
φ1
1/λ1
w1
• principal lengths are 1/λi
when the eigenvalues go to zero, the ellipses blow up
consider the picture of the optimization problem
max wT S B w subject to
w
wT SW w = K
15
The Rayleigh quotient
max wT S B w subject to
w
wT SW w = K
w*
w2
wTSBw
1/λ2
1/λ1
w1
wTSww
w=K
K
the optimal solution is where the outer red ellipse (cost)
touches the blue ellipse (constraint)
• in this example, as λ1 goes to 0, ||w*|| and the cost go to infinity
16
The Rayleigh quotient
how do we avoid this problem?
• we introduce another constraint
max wT S B w subject to
w
wTw=L
wT SW w = K ,
w =L
w*
w2
wTSBw
1/λ2
1/λ1
w1
wTSww=K
• restricts the set of possible solutions to these points (surfaces in
high dimensional case)
17
The Rayleigh quotient
the Lagrangian is now
(
) (
L = wT S B w - λ wT SW w − K - β wT w − L
)
and the solution satisfies
∇ w L = 2(S B - λSW − βI )w = 0
or
(S B - λ [SW + γI ])w
= 0, γ = β / λ
but this is exactly the solution of the original problem with
Sw+γI instead of Sw
max wT S B w subject
bj t to
t
w
wT [SW + γI ]w = K
18
The Rayleigh quotient
adding the constraint is equivalent to maximizing the
regularized
g
Rayleigh
y g q
quotient
wT S B w
J ( w) = T
w [SW + γI ]w
S B , SW , symmetric
positive
iti semidefini
id fi ite
t
what does this accomplish?
• note that
SW = ΦΛΦ T ⇒ SW + γI = ΦΛΦ T + γΦIΦ T
= Φ[Λ + γI ]Φ T
• this makes all eigenvalues positive
• the matrix is no longer non-invertible
19
The Rayleigh quotient
in summary
wT S B w
max T
w
w SW w
S B , SW , symmetric
positive semidefinite
1) Sw invertible
• w* is the eigenvector of largest eigenvalue of Sw-1SB
• the max value is λK, where λ is the largest eigenvalue
2) Sw not invertible
• regularize:
l i
Sw -> Sw + γI
• w* is the eigenvector of largest eigenvalue of [Sw + γI]-1SB
• the max value is λK, where λ is the largest eigenvalue
20
Regularized discriminant analysis
back to LDA:
• when the within scatter matrix is non-invertible
non-invertible, instead of
between class scatter
wT S B w
J ( w) = T
w SW w
• we use
wT S B w
J ( w) = T
w SW w
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
SW = (Σ1 + Σ 0 )
within class scatter
S B = (µ1 − µ 0 )(µ1 − µ 0 )
T
SW = (Σ1 + Σ 0 + γI )
regularized within class scatter
21
Regularized discriminant analysis
this is called regularized discriminant analysis (RDA)
noting that
SW = Σ1 + Σ 0 + γI
= Σ1 + γ 1 I + Σ 0 + γ 0 I
γ1 + γ 0 = γ
this can also be seen as regularizing each covariance
matrix individually
the regularization parameters γi are determined by crossvalidation
• more on this later
• basically means that we try several possibilities and keep the
best
22
Principal component analysis
back to PCA: given X with one example per column
• 1) create the centered data-matrix
X
T
c
1 T⎞ T
⎛
= ⎜ I − 11 ⎟ X
n
⎝
⎠
this has one point per row
⎡ x1T − µ T ⎤
⎢
⎥
T
Xc = ⎢ M ⎥
⎢ xnT − µ T ⎥
⎣
⎦
• note that the p
projection
j
of all p
points on p
principal
p component
p
φ is
z = Xc φ
T
(
⎡ z1 ⎤ ⎡ x1 − µ
⎢M⎥=⎢
M
⎢ ⎥ ⎢
⎢⎣ zn ⎥⎦ ⎢⎢ xn − µ
⎣
(
) φ ⎤⎥
T
⎥
T ⎥
φ⎥
⎦
)
23
Principal component analysis
and, since
(
1
1
zi = ∑ xi − µ
∑
n i
n i
)
T
T
⎛1
⎞
φ = ⎜ ∑ xi − µ ⎟ φ = 0
⎠
⎝n i
the sample variance of Z is given by its norm
1
2
var( z ) = ∑ zi = z
n i
2
recall that PCA looks for the largest variance component
max z = max X φ
2
φ
φ
T
c
2
(
)
= max X φ X cT φ = max φ T X c X cT φ
φ
T
c
T
φ
24
Principal component analysis
recall that the sample covariance is
( )
1
1
T
Σ = ∑ ( xi − µ )( xi − µ ) = ∑ xic xic
n i
n i
T
where xic is the ith column of Xc
this can be written as
c
⎡
|
|
−
x
⎡
⎤
1
1⎢ c
c ⎥⎢
Σ = ⎢ x1 K xn ⎥ ⎢
M
n
⎢⎣ |
| ⎦⎥ ⎣⎢− xnc
−⎤
⎥ 1
T
=
X
X
⎥ n c c
−⎥⎦
25
Principal component analysis
hence the PCA problem is
ma φ T X c X cT φ = max
max
ma φ T Σφ
φ
φ
as in LDA, this can be made arbitrarily large by simply
scaling
li φ
to normalize we constrain φ to have unit norm
max φ T Σφ
φ
subject
bj t to
t
φ =1
which is equivalent to
φ T Σφ
max T
φ
φ φ
shows that PCA = maximization of a Rayleigh quotient
26
Principal component analysis
in this case
wT S B w
max T
w
w SW w
S B , SW , symmetric
i
positive semidefinite
with SB = Σ and Sw = I
Sw is clearly invertible
• no regularization problems
• w* is the eigenvector of largest eigenvalue of Sw-1SB
• this
thi is
i just
j t the
th largest
l
t eigenvector
i
t off the
th covariance
i
Σ
• the max value is λ, where λ is the largest eigenvalue
27
The Rayleigh quotient dual
let’s assume, for a moment, that the solution is of the
form
w = X cα
• i.e. a linear combination of the centered datapoints
hence the problem is equivalent to
α T X cT S B X c α
max T T
α α X S X α
c W
c
thi does
this
d
nott change
h
its
it fform, the
th solution
l ti iis
•
α* is the eigenvector of largest eigenvalue of (XcTSwXc)-1XcTSBXc
• the max value is λK,
K where λ is the largest eigenvalue
28
The Rayleigh quotient dual
for PCA
• Sw = I and SB = Σ = 1/n XcXcT
• the solution satisfies
S B w = λS W w ⇔
−1
1
n
T
c
X c X w = λw ⇔ w = X c
1
X cT w
n4
λ 24
1
3
α
• and, therefore, we have
• w* eigenvalue of SB = XcXcT
• α** eigenvalue
i
l off (XcTSwXc)-11XcTSBXc = (XcTXc)-11XcTXcXcTXc =X
XcTXc
• i.e. we have two alternative manners in which to compute PCA
29
Principal component analysis
primal
dual
• assemble matrix
• assemble matrix
•
•
Σ = XcXcT
Κ = XcTXc
• compute eigenvectors φi
• compute eigenvectors αi
• these are the principal
components
• the principal components
are
•
φi = Xcαi
in both cases we have an eigenvalue problem
• primal
i l on th
the sum off outer
t products
d t
i
• dual on the matrix of inner products
( )
Σ = ∑ x ic x ic
K ij = (x
)x
c T
i
T
c
j
30
The Rayleigh quotient
this is a property that holds for many Rayleigh quotient
p
problems
• the primal solution is a linear combination of datapoints
• the dual solution only depends on dot-products of the datapoints
whenever both of these hold
• the
th problem
bl
can b
be kkernelized
li d
• this has various interesting properties
• we will talk about them
many examples
• kernel PCA, kernel LDA, manifold learning, etc.
31
32
Download