LECTURE 10: Linear Discriminant Analysis

advertisement
LECTURE 10: Linear Discriminant Analysis
Linear Discriminant Analysis, two classes
g Linear Discriminant Analysis, C classes
g LDA vs. PCA example
g Limitations of LDA
g Variants of LDA
g Other dimensionality reduction methods
g
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
1
Linear Discriminant Analysis, two-classes (1)
g
The objective of LDA is to perform dimensionality reduction while
preserving as much of the class discriminatory information as
possible
n
n
Assume we have a set of D-dimensional samples {x(1, x(2, …, x(N}, N1 of which
belong to class ω1, and N2 to class ω2. We seek to obtain a scalar y by projecting
the samples x onto a line
y = wTx
Of all the possible lines we would like to select the one that maximizes the
separability of the scalars
g
This is illustrated for the two-dimensional case in the following figures
x2
x2
x1
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
x1
2
Linear Discriminant Analysis, two-classes (2)
g
In order to find a good projection vector, we need to define a measure
of separation between the projections
g
The mean vector of each class in x and y feature space is
µi =
g
g
1
Ni
∑x
and
x∈ωi
~ = 1
µ
i
Ni
1
∑y = N ∑w
y∈ωi
i x∈ωi
T
x = w Tµi
We could then choose the distance between the projected means as our objective
function
~ −µ
~ = w T (µ − µ )
J( w ) = µ
1
2
1
2
However, the distance between the projected means is not a very good measure since it
does not take into account the standard deviation within the classes
x2
µ1
This axis yields better class separability
µ2
x1
This axis has a larger distance between means
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
3
Linear Discriminant Analysis, two-classes (3)
g
The solution proposed by Fisher is to maximize a function that represents the
difference between the means, normalized by a measure of the within-class
scatter
n
For each class we define the scatter, an equivalent of the variance, as
~
si2 =
2
i
y∈ωi
)
+~
s22 is called the within-class scatter of the projected examples
The Fisher linear discriminant is defined as the linear function wTx that maximizes the
criterion function
~ −µ
~ 2
µ
1
2
J( w ) = ~2 ~2
s +s
g
n
(~s
∑ (y − µ~ )
where the quantity
2
1
1
n
Therefore, we will be looking for a projection
where examples from the same class are
projected very close to each other and, at the
same time, the projected means are as farther
apart as possible
2
x2
µ1
µ2
x1
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
4
Linear Discriminant Analysis, two-classes (4)
n
n
In order to find the optimum projection w*, we need to express J(w) as an explicit function of w
We define a measure of the scatter in multivariate feature space x, which are scatter matrices
Si =
∑ (x − µ )(x − µ )
T
i
i
x∈ωi
S1 + S 2 = S W
g
n
where SW is called the within-class scatter matrix
The scatter of the projection y can then be expressed as a function of the scatter matrix in
feature space x
~
si2 =
∑ (y − µ~ )
2
y∈ωi
i
=
∑ (w
x∈ωi
T
x − w Tµi
) = ∑ w (x − µ )(x − µ ) w = w S w
2
T
T
i
x∈ωi
i
T
i
~
s12 + ~
s22 = w T S W w
n
Similarly, the difference between the projected means can be expressed in terms of the means
in the original feature space
T
(µ~1 − µ~2 )2 = (w Tµ1 − w Tµ2 )2 = w T (1
µ1 − µ2 )(µ1 − µ2 ) w = w T SB w
44
42444
3
SB
g
n
The matrix SB is called the between-class scatter. Note that, since SB is the outer product of two vectors,
its rank is at most one
We can finally express the Fisher criterion in terms of SW and SB as
w T SB w
J( w ) = T
w SW w
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
5
Linear Discriminant Analysis, two-classes (5)
n
To find the maximum of J(w) we derive and equate to zero
T
d
[J( w )] = d  wT SB w  = 0 ⇒
dw
dw  w S W w 
d w T SB w
d w TS W w
T
T
⇒ w SW w
− w SB w
=0 ⇒
dw
dw
⇒ w T S W w 2SB w − w T SB w 2S W w = 0
][
[
[
n
] [
]
][
]
[
]
Dividing by wTSWw
[w S w ]S w − [w S w ] S
[w S w ]
[w S w ]
T
T
W
T
W
B
B
T
W
w =0 ⇒
W
⇒ SB w − JS W w = 0 ⇒
⇒ S −W1SB w − Jw = 0
n
Solving the generalized eigenvalue problem (SW-1SBw=Jw) yields
 w T SB w 
−1
w* = argmax  T
 = S W (µ1 − µ2 )
w
 w SW w 
g
This is know as Fisher’s Linear Discriminant (1936), although it is not a discriminant but rather a
specific choice of direction for the projection of the data down to one dimension
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
6
LDA example
g
Compute the Linear Discriminant projection for
the following two-dimensional dataset
n
n
g
10
8
X1=(x1,x2)={(4,1),(2,4),(2,3),(3,6),(4,4)}
X2=(x1,x2)={(9,10),(6,8),(9,5),(8,7),(10,8)}
SOLUTION (by hand)
n
6
x2
The class statistics are:
 0.80 - 0.40 
 1.84 - 0.04 
S1 = 
; S2 = 


- 0.40 2.60 
- 0.04 2.64 
µ1 = [3.00 3.60 ];
µ2 = [8.40 7.60 ]
n
The within- and between-class scatter are
 2.64 - 0.44 
29.16 21.60 
=
;
S
SB = 
 W - 0.44 5.28 


21.60 16.00 
n
4
2
0
wLDA
0
2
4
x1
6
8
10
The LDA projection is then obtained as the solution of the generalized eigenvalue problem
S -W1 SB v = λv ⇒ S -W1 SB − λI = 0 ⇒
11.89 - λ
8.81
= 0 ⇒ λ = 15.65
5.08
3.76 - λ
 v 1   v 1   0.91
11.89 8.81  v1 
 5.08 3.76  v  = 15.65 v  ⇒ v  = 0.39 


 2 
 2  2 
n
Or directly by
w* = S −W1 (µ1 − µ2 ) = [− 0.91 − 0.39 ]
T
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
7
Linear Discriminant Analysis, C-classes (1)
g
Fisher’s LDA generalizes very gracefully for C-class problems
n
Instead of one projection y, we will now seek (C-1) projections [y1,y2,…,yC-1] by means of
(C-1) projection vectors wi, which can be arranged by columns into a projection matrix
W=[w1|w2|…|wC-1]:
T
yi = w i x ⇒ y = W T x
g
x2
Derivation
n
The generalization of the within-class scatter is
C
S W = ∑ Si
SW1
µ1
i=1
where Si =
n
∑ (x − µ )(x − µ )
T
i
x∈ωi
i
1
and µi =
Ni
SB1
∑x
x∈ωi
µ
The generalization for the between-class scatter is
C
SB = ∑ Ni (µi − µ)(µi − µ)
T
SB2
SB3
µ3
SW3
µ2
i=1
where µ =
g
1
1
x = ∑ Niµi
∑
N ∀x
N x∈ωi
where ST=SB+SW is called the total scatter matrix
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
SW2
x1
8
Linear Discriminant Analysis, C-classes (2)
n
Similarly, we define the mean vector and scatter matrices for the projected samples as
~ =1
µ
i
Ni
∑y
y∈ωi
~ =1 y
µ
∑
N ∀y
n
C
~
~ )(y − µ
~ )T
S W = ∑ ∑ (y − µ
i
i
i=1 y∈ωi
C
~
~ −µ
~)(µ
~ −µ
~)T
SB = ∑ Ni (µ
i
i
i=1
From our derivation for the two-class problem, we can write
~
S W = W TS W W
~
SB = W T SB W
n
Recall that we are looking for a projection that maximizes the ratio of between-class to
within-class scatter. Since the projection is no longer a scalar (it has C-1 dimensions), we
then use the determinant of the scatter matrices to obtain a scalar objective function:
~
SB
W T SB W
J( W ) = ~ =
W TSW W
SW
n
And we will seek the projection matrix W* that maximizes this ratio
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
9
Linear Discriminant Analysis, C-classes (3)
n
It can be shown that the optimal projection matrix W* is the one whose columns are the
eigenvectors corresponding to the largest eigenvalues of the following generalized
eigenvalue problem
[
W* = w | w | L | w
g
*
1
*
2
*
C −1
]
 W T SB W
= argmax  T
 W S W W

*
 ⇒ (SB − λ iS W )w i = 0

NOTES
n
SB is the sum of C matrices of rank one or less and the mean vectors are constrained by
1 C
∑ µi = µ
C i=1
g
g
n
n
Therefore, SB will be of rank (C-1) or less
This means that only (C-1) of the eigenvalues λi will be non-zero
The projections with maximum class separability information are the eigenvectors
corresponding to the largest eigenvalues of SW-1SB
LDA can be derived as the Maximum Likelihood method for the case of normal classconditional densities with equal covariance matrices
Introduction to Pattern Analysis
Ricardo Gutierrez-Osuna
Texas A&M University
10
Download