ch10LDA - Temple University

advertisement
Ch. 10: Linear Discriminant Analysis (LDA)
based on slides from
Duncan Fyfe Gillies
Carlos Thomaz authored the original version of these slides
Modified by
Longin Jan Latecki
Temple University
latecki@temple.edu
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 1
The story so far:
Let us start with a data set which we can write as a
matrix:
X
=
x1,1
x2,1
x3,1
x1,2
x2,2
xn,1
x1,N
x2,N
xn,N
Each column is one data point, each row is a variable,
but take care sometimes the transpose is used
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 2
The mean adjusted data matrix
We form the mean adjusted data matrix by subtracting
the mean of each variable
U
=
x1,1 - m1 x1,2 - m1
x2,1 - m2 x2,2 - m2
x3,1 - m3
x1,N - m1
x2,N - m2
xn,1 - mn
xn,N - mn
mi is the mean of the data items in row i
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 3
Covariance Matrix
The covariance matrix can be formed from the
product:
S = S = (1/(N-1)) U UT
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 4
Alternative notation for Covariance
Covariance is also expressed in the following way:
SS=
1
( N - 1)
N

j =1
( x j - x )( x j- x ) T
1xn row
nx1 column
Sum over the
data points

nxn matrix
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 5
Projection
A projection is a
transformation of data points
from one axis system to
another.
1
2
X2
(xi-m1
xi-m
m
X1
xi
Finding the mean adjusted
data matrix is equivalent to
moving the origin to the
center of the data. Projection
is then carried out by a dot
product
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 6
Projection Matrix
A full projection is defined by a matrix in which each
column is a vector defining the direction of one of the
new axes.
F = [ 1, 2, 3, . . . m ]
Each basis vector has the dimension of the original
data space. The projection of data point xi is:
yi = (xi - m)T F
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 7
Projecting every point
The projection of the data in mean adjusted form can
be written:
UT F = ( FT U )T
Projection of the covariance matrix is
FT S F = FT (1/(N-1)) U UT F
= (1/(N-1)) (FT U)(UT F )
which is the covariance matrix of the projected points
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 8
Orthogonal and Orthonormal
For most practical cases we expect the projection to
be orthogonal
(all the new axes are at right angles to each other)
and orthonormal
(all the basis vectors defining the axes are unit
length) thus:
FT F = I
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 9
PCA
The PCA projection is the one that diagonalises the
covariance matrix. That is it transforms the points
such that they are independent of each other.
FT S F = L
We will look at another projection today called the
LDA.
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 10
Introduction
Ronald A. Fisher, 1890-1962
“The elaborate mechanism built
on the theory of infinitely large
samples is not accurate enough
for simple laboratory data. Only
by systematically tackling small
sample problems on their merits
does it seem possible to apply
accurate tests to practical data.”
1936
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 11
Introduction
What is LDA?
Linear Discriminant Analysis, or simply LDA, is a
well-known feature extraction technique that has been
used successfully in many statistical pattern
recognition problems.
LDA is often called Fisher Discriminant Analysis
(FDA).
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 12
Motivation
The primary purpose of LDA is to separate samples
of distinct groups by transforming then to a space
which maximises their between-class separability
while minimising their within-class variability.
It assumes implicitly that the true covariance matrices
of each class are equal because the same within-class
scatter matrix is used for all the classes considered.
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 13
Geometric Idea
x2
u
1
2
PCA: (1,2)
x2
LDA: u
x1
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 14
x1
Method
Let the between-class scatter matrix Sb be defined as
Sb =
g
 N (x
i
i
- x )( xi - x )
T
i =1
and the within-class scatter matrix Sw be defined as
Sw =
g

i =1
( N i - 1)Si =
g
Ni

( xi , j - xi )( xi , j - xi )T
i =1 j =1
where xi,j is an n-dimensional data point j from class pi,
Ni is the number of training examples from class pi,
and g is the total number of classes or groups.
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 15
Nota Bene
A scatter matrix is un-normalised, using the data
matrix formulation:
S = U UT
is a scatter matrix but
Scov = U UT /(N-1)
is the corresponding covariance matrix
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 16
Method
The Sw matrix is essentially made up from the pooled
estimate of the covariance matrix:
1
Sp =
N-g
g
( N
i
- 1)Si =
( N1 - 1)S1  ( N 2 - 1)S 2    ( N g - 1)S g
i =1
N-g
Sw = (N-g) Sp
Since each Si has rank Ni -1 its rank can be at most N-g
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 17
Method (cont.)
The sample mean, sample covariance, and grand
mean vector are given respectively by:
1
xi =
Ni
Ni
x
j =1
i, j
Ni
1
T
Si =
(
x
x
)(
x
x
)

i, j
i
i, j
i
( Ni - 1) j =1
1
x=
N
g

1
N i xi =
N
i =1
g
Ni
 x
i, j
i =1 j =1
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 18
N = N1  N2    N g
Method (cont.)
The main objective of LDA is to find a projection
matrix Plda that maximises the ratio of the determinant
of Sb to the determinant of Sw (Fisher’s criterion), that is
T
Plda = arg max
P
P Sb P
PT S w P
Intuition (cont)
So Fisher’s criterion tries to find the projection that:
Maximises the variance of the class means
Minimises the variance of the individual classes
Poor projection axis for
separating the classes
Best (LDA) projection
axis for separating the
classes
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 20
Good projection
Bad projection
Intuition
The determinant of the co-variance matrix tells us
how much variance a class has.
Consider the co-variance matrix in the PCA
(diagonal) projection - the determinant is just the
product of the diagonal elements which are the
individual variable variances.
The determinant has the same value under any
orthonormal projection.
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 22
Method (cont.)
It has been shown that Plda is in fact the solution of
the following eigensystem problem:
Sb P - S w PL = 0
Multiplying both sides by the inverse of Sw
S w-1Sb P - S w-1S w PL = 0
S w-1Sb P - PL = 0
(S w-1Sb ) P = PL
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 23
Standard LDA
If Sw is a non-singular matrix then the Fisher’s
criterion is maximised when the projection matrix Plda
is composed of the eigenvectors of
-1
w b
S S
with at most (g-1) nonzero corresponding
eigenvalues.
(since there are only g points to estimate Sb)
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 24
Classification Using LDA
The LDA is an axis projection.
Once the projection is found all the data points can be
transformed to the new axis system along with the
class means and covariances.
Allocation of a new point to a class can be done using
a distance measure such as the Mahalanobis distance.
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 25
LDA versus PCA
LDA seeks directions that are efficient for
discriminating data whereas PCA seeks directions
that are efficient for representing data.
The directions that are discarded by PCA might be
exactly the directions that are necessary for
distinguishing between groups.
Intelligent Data Analysis and Probabilistic Inference Lecture 17 Slide No 26
Download