INF 5300: Lecture 4 Principal components and linear discriminants Asbjørn Berge

advertisement
Linear algebra and dimension reduction
PCA
LDA
INF 5300: Lecture 4
Principal components and linear discriminants
Asbjørn Berge
Department of Informatics
University of Oslo
19. february 2004
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
1
Linear algebra and dimension reduction
2
PCA
3
LDA
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Definitions
Signal representation vs classification
Vector spaces
A set of vectors u1 , u2 , . . . un is said to form a basis for a
vector space if any arbitrary vector x can be represented by a
linear combination x = a1 u1 + a2 u2 + . . . an un
The coefficients a1 , a2 , . . . an are called the components of
vector x with respect to the basis ui
In order to form a basis, it is necessary and sufficient that the
ui vectors be linearly independent
6= 0 i = j
A basis ui is said to be orthogonal if uiT uj =
= 0 i 6= j
=1 i =j
T
A basis ui is said to be orthonormal if ui uj =
= 0 i 6= j
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Definitions
Signal representation vs classification
Linear transformation
A linear transformation is a mapping from a vector space X N
onto a vector space Y M , and is represented by a matrix
Given

y1
 y2

 ..
 .
ym
a vector x ∈ X N , the corresponding
y on Y M is
 vector

x
 

1
a11 a11 . . . a1n
 x2 

  a21 a22 . . . a2n   . 
 

.. 

 =  ..
..
..  
.


.
  .
.
.
.  . 
 .. 
am1 am1 . . . amn
xn
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Definitions
Signal representation vs classification
Eigenvectors and eigenvalues
Given a matrix AN×N , we say that v is an eigenvector if there
exists a scalar λ (the eigenvalue) such that Av = λv ⇔ v is
an eigenvector with corresponding eigenvalue λ
Av = λv ⇒ (A − λI )v = 0 ⇒
|(A − λI )| = 0 ⇒ λN + a1 λN−1 + . . . aN−1 λ + a0 = 0
|
{z
}
Characteristic equation
Zeroes of the characteristic equation are the eigenvalues of A
A is non-singular ⇔ all eigenvalues are non-zero
A is real and symmetric ⇔ all eigenvalues are real, and
eigenvectors are orthogonal
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Definitions
Signal representation vs classification
Interpretation of eigenvectors and eigenvalues
The eigenvectors of the covariance matrix Σ correspond to
the principal axes of equiprobability ellipses!
The linear transformation defined by the eigenvectors of Σ
leads to vectors that are uncorrelated regardless of the form of
the distribution
If the distribution happens to be Gaussian, then the
transformed vectors will be statistically independent
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Definitions
Signal representation vs classification
Dimensionality reduction
Feature extraction can be stated as
Given a feature space xi ∈ Rn find an optimal mapping
y = f (x) : Rn → Rm with m < n.
An optimal mapping in classification :the transformed feature
vector y yield the same classification rate as x.
The optimal mapping may be a non-linear function
Difficult to generate/optimize non-linear transforms
Feature extraction is therefore usually limited to linear
transforms y = AT x



 
 x1
y1
a11 a11 . . . a1n
 x2 

 y2   a21 a22 . . . a2n  
.. 

 


 ..  =  ..
..   . 
..
..

 .   .
.
.
.  . 
 .. 
ym
am1 am1 . . . amn
xn
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Definitions
Signal representation vs classification
Signal representation vs classification
The search for the feature extraction mapping y = f (x) is
guided by an objective function we want to maximize.
In general we have two categories of objectives in feature
extraction:
Signal representation: Accurately approximate the samples in a
lower-dimensional space.
Classification: Keep or enhance class-discriminatory
information in a lower-dimensional space.
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Definitions
Signal representation vs classification
Signal representation vs classification
Principal components analysis (PCA)
- signal representation, unsupervised
Linear discriminant analysis (LDA)
- classification, supervised
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA - Principal components analysis
Reduce dimension while preserving signal variance (”randomness”)
Represent x as a linear combination of orthonormal basis vectors
[ϕ1 ⊥ϕ2 ⊥ . . . ⊥ϕn ]:
n
X
x=
yi ϕ i
i=1
Approximate x with only m < n basis vectors. This can be done by
replacing the components [ym+1 , . . . , yn ]T with some pre-selected
constants bi :
m
n
X
X
x̂(M) =
yi ϕ i +
bi ϕi
i=1
Asbjørn Berge
i=m+1
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA - Principal components analysis
Approximation error is then
ˆ
∆x(M) = x − x(M)
n
X
=
yi ϕ i −
i=1
m
X
n
X
yi ϕ i +
i=1
!
bi ϕi
i=m+1
=
n
X
(yi − bi )ϕi
i=m+1
To measure the representation error we use the mean squared error
2
MSE (M) = E[|∆x(M)| ] = E[
n
X
n
X
(yi − bi )T (yj − bj )ϕT
i ϕj ]
i=m+1 j=m+1
=
n
X
E[(yi − bi )2 ]
i=m+1
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA - Principal components analysis
Optimal values of bi can be found by taking the partial derivatives
of the approximation error MSE (M):
∂
E[(yi − bi )2 ] = −2(E[yi ] − bi ) = 0 ⇒ bi = E[yi ]
∂bi
We replace the discarded dimensions by their expected value.
(This also feels intuitively correct)
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA - Principal components analysis
The MSE can now be written as
MSE (M) =
=
=
n
X
i=m+1
n
X
i=m+1
n
X
E[(yi − E[yi ])2 ]
E[(xϕi − E[xϕi ])T (xϕi − E[xϕi ])]
ϕT
i E[(x
T
− E[x]) (x − E[x])]ϕi =
i=m+1
n
X
ϕT
i Σx ϕi
i=m+1
Σx is the covariance matrix of x
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA - Principal components analysis
The orthonormality constraint on ϕ can be incorporated in our
optimization using Lagrange multipliers λ
MSE (M) =
n
X
ϕT
i Σx ϕi
i=m+1
+
n
X
λi (1 − ϕT
i ϕi )
i=m+1
Thus, we can find the optimal ϕi by partial derivation
#
" n
n
X
X
∂
∂
MSE (M) =
ϕT
λi (1 − ϕT
i ϕi )
i Σ x ϕi +
∂ϕi
∂ϕi
i=m+1
i=m+1
= 2(Σx ϕi − λi ϕi ) = 0 ⇒ Σx ϕi = λi ϕi
Optimal ϕi and λi are eigenvalues and eigenvectors of the
covariance matrix Σx
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA - Principal components analysis
Note that this also implies
MSE (M) =
n
X
ϕT
i Σ x ϕi =
i=m+1
n
X
i=m+1
ϕT
i λ i ϕi =
n
X
λi
i=m+1
Thus in order to minimize MSE (M), λi will have to correspond to
the smallest eigenvalues!
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA - Principal components analysis
PCA dimension reduction
The optimala approximation of a random vector x ∈ Rn by a linear
combination of m < n independent vectors is obtained by projecting
the random vector x onto the eigenvectors ϕi corresponding to the
largest eigenvalues λi of the covariance matrix Σx
a
minimum sum of squared approximation error
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA - Principal components analysis
PCA uses the eigenvectors of the covariance matrix Σx , and
thus is able to find the independent axes of the data under the
unimodal Gaussian assumption.
For non-Gaussian or multi-modal Gaussian data, PCA simply
de-correlate the axes
Main limitation of PCA is that it is unsupervised - thus it does
not consider class separability.
It is simply a coordinate rotation aligning the transformed axes
with the directions of maximum variance
No guarantee that these directions are good features for
discrimination
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
PCA example
3d Gaussian with
T
µ = [0
 5 2] ,
25 −1
Σ =  −1 4
7 −4
parameters

7
−4 
10
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Linear Discriminant Analysis (LDA)
Goal:
Reduce dimension while preserving class discriminatory
information.
Strategy (2 classes):
We have a set of samples x = x1 , x2 , . . . , xn where n1 belong to
class ω1 and the rest n2 to class ω2 . Obtain a scalar value by
projecting x onto a line y : y = w T x
Select the one that maximizes the separability of the classes.
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Linear Discriminant Analysis (LDA)
To find a good projection vector, we need to define a measure
of separation between the projections (J(w ))
The mean vector of each class in the spaces spanned by x and
y are P
µi = n1i x∈ωi x and
P
P
µ̃i = n1i y ∈ωi y = n1i x∈ωi w T x = w T µi
A naive choice would be projected mean difference,
J(w ) = |µ̃1 − µ̃2 |
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Linear Discriminant Analysis (LDA)
Fishers solution: maximize a function that represents the
difference between the means, scaled by a measure of the
within class scatter
DefinePclasswise scatter (eqv. of variance)
s̃i2 = y ∈ωi (y − µ̃i )2
s̃12 + s̃22 is the within class scatter
µ̃2 |
Fishers criterion is then J(w ) = |µ̃s̃ 12−
+s̃ 2
1
i
We look for a projection where examples from the same class
are close to each other, while at the same time projected
mean values are as far apart as possible.
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Linear Discriminant Analysis (LDA)
To optimize w we need J(w ) to be an explicit function of w
Redefine
P scatter
Si2 = x∈ωi (x − µi )(x − µi )T , S1 + S2 = Sw where Sw is the
within class scatter matrix.
Remember
the scatter ofP
the projection y
P
s̃i2 = y ∈ωi (y − µ̃i )2 = x∈ωi (w T x − w T µi )2 =
P
T
T
T
2
2
T
x∈ωi w (x − µi )(x − µi ) w = w Si w , s̃1 + s̃1 = w Sw w
The projected mean difference can be expressed by the original
means
(µ̃1 − µ̃2 )2 = (w T µ1 − w T µ2 )2 =
w T (µ1 − µ2 )(µ1 − µ2 )T w = w T Sb w
|
{z
}
SB
Sb is called between class scatter.
The Fisher criterion in terms of Sw and Sb is J(w ) =
Asbjørn Berge
w T Sb w
w T Sw w
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Linear Discriminant Analysis (LDA)
To find the optimal w , derive J(w ) and equate to zero.
d
d w T Sb w
[J(w )] =
[
]=0
dw
dw w T Sw w
⇒ [w T Sw w ]
d[w T Sb w ]
d[w T Sw w ]
− [w T Sb w ]
=0
dw
dw
⇒ [w T Sw w ]2Sb w − [w T Sb w ]2Sw w = 0
Divide by w T Sw w and one gets
Sb w − J(w )Sw w = 0 ⇒ Sw−1 Sb w = J(w )w
Sw−1 Sb w = J(w )w has the solution w ∗ = Sw−1 (µ1 − µ2 )
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Linear Discriminant Analysis (LDA)
LDA generalizes easily to C classes
Instead of one projection y , we will seek C − 1 projections
[y1 , y2 , . . . , yC −1 ] from C − 1 projection vectors
W = [w1 , w2 , . . . , wC −1 ]: y = W T x
The generalization
of P
within-class scatter is
PC
SW = i=1 Si , Si = x∈ωi (x − µi )(x − µi )T and
P
µi = n1i x∈ωi x
The generalization
of between-class scatter P
P
SB = Ci=1 ni (µi − µ)(µi − µ)T and µ = n1 ∀x x
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Linear Discriminant Analysis (LDA)
Similar to the 2 class example, mean vector and scatter
matricesPfor the projected
PCsamples
P can be expressed T
1
µ̃i = ni y ∈ωi y , S̃W = i=1 y ∈ωi (y − µ̃i )(y − µ̃i ) , S̃B =
PC
T
i=1 ni (µ̃i − µ̃)(µ̃i − µ̃) and thus
T
S̃W = W SW W , S̃B = W T SB W
We want an scalar objective function and use the ratio of
matrix determinants
J(W ) =
|S̃B |
|W T SB W |
=
|W T SW W |
|S̃W |
Another variant on scalar representation of the criteria is to
(SB )
use the ratio of the trace of the matrices, i.e. J(W ) = trtr(S
W)
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Linear Discriminant Analysis (LDA)
The matrix W ∗ that maximizes this ratio can be shown to be
composed of the eigenvectors corresponding to the largest
eigenvalues of the eigenvalue problem (SB − λi Sw )wi∗ = 0.
Sb is the sum of C matrices of
rank one or less, and the mean
1 PC
vectors are constrained by C i=1 µi = µ, so Sb will be rank
C − 1 or less.
Only C − 1 of the eigenvalues will be non-zero - we can only
find C − 1 projection vectors.
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Linear algebra and dimension reduction
PCA
LDA
Limitations of LDA
LDA produces at most C − 1 feature projections
LDA is parametric, since it assumes unimodal gaussian
likelihoods
LDA will fail when the discriminatory information is not in the
mean but in the variance of the data.
Asbjørn Berge
INF 5300: Lecture 4 Principal components and linear discriminan
Download