INF 5300 Feature transformation Anne Solberg () Today:

advertisement
INF 5300
Feature transformation
Anne Solberg (anne@ifi.uio.no)
Today:
• Linear algebra applied to dimensionality reduction
• Principal component analysis
• Fisher’s linear discriminant
F1 16.2.06
INF 5300
1
Vector spaces
F1 16.2.06
INF 5300
2
Linear transformation
F1 16.2.06
INF 5300
3
Eigenvalues and eigenvectors
F1 16.2.06
INF 5300
4
Interpretation of eigenvectors and eigenvalues
F1 16.2.06
INF 5300
5
Linear feature transforms
F1 16.2.06
INF 5300
6
Signal representation vs classification
• The search for the feature extraction mapping
y = f(x) is guided by an objective function we want to
maximize.
• In general we have two categories of objectives in feature
extraction:
– Signal representation: Accurately approximate the samples in a
lower-dimensional space.
– Classification: Keep (or enhance) class-discriminatory information in
a lower-dimensional space.
F1 16.2.06
INF 5300
7
Signal representation vs classification
•
Principal components analysis (PCA)
•
Linear discriminant analysis (LDA)
–
–
- signal representation, unsupervised
-classification, supervised
F1 16.2.06
INF 5300
8
Correlation matrix vs.
covariance matrix
• Σx is the covariance matrix of x
[
Σ x = E ( x − µ )( x − µ )T
]
• Rx is the correlation matrix of x
[
Rx = E ( x )( x )T
]
• Rx=Σx if µx=0.
F1 16.2.06
INF 5300
9
Principal component or
Karhunen-Loeve transform
• Let x be a feature vector.
• Features are often correlated, which might lead to
redundancies.
• We now derive a transform which yields uncorrelated
features.
• We seek a linear transform y=ATx, and the yis should be
uncorrelated.
• The yis are uncorrelated if E[y(i)y(j)]=0, i≠j.
• If we can express the information in x using uncorrelated
features, we might need fewer coefficients.
F1 16.2.06
INF 5300
10
Principal component transform
• The correlation of Y is described by the correlation matrix
RY=E[yyT]=E[ATxxTA]=ATRxA
Rx is the correlation matrix of X
Rx is symmetric, thus all eigenvectors are orthogonal.
• We seek uncorrelated components of Y, thus Ry should be
diagonal.
From linear algebra:
• Ry will be diagonal if A is formed by the orthogonal
eigenvectors ai, i=0,...,N-1 of Rx: Ry=ATRxA=Λ, where Λ
is diagonal with the eigenvalues of Rx, λi, on the diagonal.
• We find A by solving the equation ATRxA=Λ (using
Singular Value Decomposition (SVD)).
• A is formed by computing the eigenvectors of Rx. Each
eigenvector will be a column of A.
F1 16.2.06
INF 5300
11
Mean square error approximation
• x can be expressed as a combination of all N basis vectors:
x=
N −1
∑ y(i)ai , where y(i) = aiT x
i =0
• An approximation to x is found by using only m of the basis
m −1
vectors:
a projection into the m-dimensional
xˆ =
∑ y(i)ai
subspace spanned by m eigenvectors
i =0
• The PC-transform is based on minimizing the mean square error
associted with this approximation.
• The mean square error associated with this approximation is
⎡ N −1
2
E x − xˆ = E ⎢
y (i )ai
⎢
⎣⎢ i = m
[
]
∑
2⎤
⎡
⎥ = E⎢
⎥
⎢ i
⎣
⎦⎥
⎤
∑∑ (y(i)aiT )(y( j )a j )⎥⎥ =
j
∑ E [y 2 (i)]= ∑ aiT E [xxT ]ai
N −1
i =m
F1 16.2.06
N −1
⎦
i =m
INF 5300
12
• Furthermore, we can find that
[
E x − xˆ
2
]= ∑ a λ a = ∑ λ
N −1
N −1
T
i i i
i
i =m
i =m
• The mean square error is thus
[
E x − xˆ
2
] = ∑ λ −∑ λ = ∑ λ
N −1
i
i
i
i =1
N −1
m
i =1
i =m
• The error is minimized if we select the eigenvectors
corresponding to the m largest eigenvales of the correlation
matrix Rx.
• The transformed vector y is called the principal components of
x. The transform is called the principal component transform or
Karhunen-Loeve-transform.
F1 16.2.06
INF 5300
13
Principal component of the
covariance matrix
• Alternatively, we can find the principal components of
the covariance matrix Σx.
• If we have software for computing principal
components of Rx, we can compute principal
components from Σx by first setting z=x- µx and
compute PC(z).
• The principal component transform is not scale
invariant, because the eigenvectors are not invariant.
Often, normalization to data with zero mean and unit
variance is done prior to applying the PC-transform.
F1 16.2.06
INF 5300
14
Principal components
and total variance
• Assume that E[x]=0.
• Let y=PC(x).
• From Ry we know that the variance of component yj
is λj.
• The eigenvalues λj of the correlation matrix Rx is thus
equal to the variance of the transformed features.
• By selecting the m eigenvectors with the largest
eigenvalues, we select the m dimensions with the
largest variance.
• The first principal component will be along the
direction of the input space which has largest
variance.
F1 16.2.06
INF 5300
15
Geometrical interpretation of
principal components
• The eigenvector
corresponding to the
largest eigenvalue is the
direction in n-dimensional
space with highest
variance.
• The next principal
component is orthogonal
to the first, and along the
direction with the second
largest variance.
Note that the direction with the highest variance is
NOT related to separability between classes.
F1 16.2.06
INF 5300
16
PCA example
3d Gaussian with parameters
F1 16.2.06
INF 5300
17
Principal component images
• For an image with n bands, we can compute the
principal component transform of the entire image X.
• Y=PC(X) will then be a new image with n bands, but
most of the variance is in the bands with the lowest
index (corresponding to the largest eigenvalues).
F1 16.2.06
INF 5300
18
PCA applied to change detection
• Assume that we have a time series of t images taken
at different times.
• We want to visualize the pixels which change at
some point in the time series.
• Perform PCA for each pixel on the multitemporal
feature vector (PCA in time).
• Visualize the PCA component: areas with the biggest
changes will be visible in PCA-component 1.
• Note that normalization should be applied if the mean
values for the different images differ (e.g. due to
different solar angle etc.).
F1 16.2.06
INF 5300
19
PC and compression
• PC-transform is optimal transform with respect to preserving the
energy in the original image.
• For compression purposes, PC-transform is theoretically optimal
with respect to maximizing the entropy (from information
theory). Entropy is related to randomness and thus to variance.
• The basis vectors are the eigenvectors and vary from image to
image. For transmission, both the transform coefficients and the
eigenvectors must be transmitted.
• PC-transform can be reasonably well approximated by the
Cosinus-transform or Sinus-transform. These use constant basis
vectors and are better suited for transmission, since only the
coefficients must be transmitted (or stored).
F1 16.2.06
INF 5300
20
Fisher’s Linear Discriminant - intro
•
Goal:
– Reduce dimension while preserving class
discriminatory information
•
Strategy (2 classes):
– We have a set of samples x={x1, x2, …, xn}
where n1 belong to class ω1 and the rest n2 to
class ω2. Obtain a scalar value by projecting x
onto a line y: y=wTx
– Select the one w that maximizes the
separability of the classes
F1 16.2.06
INF 5300
21
Fisher’s Linear Discriminant
• To find a good projection vector, we need to define a measure of
separation between the projections (J(w))
• The mean vector of each class in the spaces spanned by x and y are
• A naive choice would be projected mean difference,
F1 16.2.06
INF 5300
22
Fisher’s linear discriminant
• Fishers solution: maximize a function
that represents the difference between
the means, scaled by a measure of the
within class scatter
• Define classwise scatter (similar to
variance)
•
is within class scatter
• Fishers criterion is then
• We look for a projection where
examples from the same class are
close to each other, while at the same
time projected mean values are as far
apart as possible.
F1 16.2.06
INF 5300
23
Scatter matrices
• Within-class scatter matrix:
M
S w = ∑ P (ω i ) Si
Variance within each class
i =1
Si = E [( x − µ i )( x − µ i )T ]
• Between-class scatter matrix:
M
Sb = ∑ P (ω i )(µ i − µ 0 )(µ i − µ 0 )T
i =1
M
Distance between the classes
µ 0 = ∑ µi
i =1
• Mixture or total scatter matrix:
[
S m = E ( x − µ 0 )( x − µ 0 )T
F1 16.2.06
]
Variance of feature with
respect to the global mean
INF 5300
24
• The total scatter is
Sm=Sw+Sb
• Consider the criterion function
J1 =
trace{S m }
trace{S w }
Sum of the diagonal elemens in Sm
c
sum of variance around global mean
sum of variance around class mean
• J1 will be large when the variance among the global mean
is large compared to the within-class variance.
F1 16.2.06
INF 5300
25
Alternative scatter criteria
S
J 2 = m = S w−1S m
Sw
J 3 = trace S w−1Sb
{
}
Error in textbook, should be Sb and not Sm
• J2 and J3 are invariant to linear transformations.
• In the two-class case, |Sw| is proportional to σ12 + σ22 and |Sm|
is proportional to (µ1 - µ2)2 and yields Fisher’s discriminant
ratio:
(µ1 − µ 2 )2
FDR = 2
σ 1 + σ 22
• The generalized FDR for multiple classes is (for the classes i and
M M (µ − µ )2
j):
i
j
FDR1 = ∑∑ 2
2
i j ≠i σ i + σ j
F1 16.2.06
INF 5300
26
Fisher’s linear discriminant
• Fisher’s linear discriminant is a transform that uses
the information in the training data set to find a
linear combination that best separates the classes.
• It is based on the criterion J3:
M
Sb =
∑ P(ωi )(µi − µ0 )(µi − µ0 )T − between − class scatter
i =1
{
J 3 = trace S w−1Sb
M
Sw =
}
∑ P(ωi )Si − within - class scatter
i =1
• From the feature vector x, let Sxw and Sxb be the
within-class and between-class scatter matrix.
• The scatter matrices for y=ATx are:
Syw=ATSxwA
F1 16.2.06
Syb=ATSxbA
INF 5300
27
• In subspace y, J3 becomes:
(
)(
)
−1
⎫
⎧
J 3 = trace⎨ AT S w A AT Sb A ⎬
⎭
⎩
• Problem: find A such that J3 is maximized.
∂J 3 ( A)
• Solution: set
=0
∂A
(
)(
c
)(
)
(
)
−1
−1
−1
∂J 3 ( A)
= −2 S xw A AT S xw A AT S xb A AT S xw A + 2 S xb A AT S xw A = 0
∂A
c
−1
−1
S xw
S xb A = A( S yw
S yb )
F1 16.2.06
INF 5300
28
• The scatter matrices Syw and Syb are symmetric, and can thus be
diagonalized by the linear transform
BT S yw B = I and BT S yb B = D
• B is a l×l matrix, I the identity matrix, and D an l×l diagonal
matrix. I and D are the scatter matrices of the transformed
vector
yˆ = BT y = BT AT x
• J3 is invariant under linear transformations and:
(
)(
)
{
−1
⎫
⎧
−1
J 3 ( yˆ ) = trace⎨ BT S yw B BT S yb B ⎬ = trace B −1S yw
S yb B
⎭
⎩
−1
= trace{S yw
S yb BB −1} = J 3 ( y )
}
• Furthermore,
(S xw−1 S xb )C = CD, where C = AB
• Because D is diagonal, this is an eigenvalue-problem, D must
have the eigenvalues of S-1xw on the diagonal and C the
corresponding eigenvectors of S-1xw as columns...
F1 16.2.06
• Note that
INF 5300
29
M
S xb =
∑ P(ωi )(µi − µ0 )(µi − µ0 )T
i =1
is a sum of M (M=nof. classes) matrices of rank 1,
only m-1 of these elements are independent,
meaning that the rank of Sxb is M-1 or less (no more
than M-1 eigenvalues are nonzero).
−1
• This means also that S xw
S xb has rank M-1 or less.
• Fisher’s discriminant transform can give us a ldimensional projection, where l≤M-1.
– Note: with 30 features (m=30) and 5 classes (M=5) this
gives us a projection with dimension 4 or less.
F1 16.2.06
INF 5300
30
Computing Fishers linear discriminant
• For l=M-1:
– Form C such that its columns are the unit norm M-1
−1
eigenvectors of S xwS xb
T
– Set yˆ = C x
– This gives us the maximum J3 value.
– This means that we can reduce the dimension from m to M1 without loss in class separability power (if J3 is a correct
measure of class separability.)
– Alternative view: with a Bayesian model we compute the
probabilities P(ωi|x) for each class (i=1,...M). Once M-1
probabilities are found, the remaining P(ωM|x) is given
because the P(ωi|x)’s sum to one.
F1 16.2.06
INF 5300
31
Computation: Case 2: l<M-1
• Form C by selecting the eigenvectors corresponding
−1
to the l largest eigenvalues of S xwS xb
• We now have a loss of discriminating power since
J 3, yˆ < J 3, x
F1 16.2.06
INF 5300
32
Comments on Fishers discriminant rule
• In general, projection of the original feature vector to
a lower dimensional space is associated with some
loss of information.
• Although the projection is optimal with respect to J3,
J3 might not be a good criterion to optimize for a
given data set. (Note that J3 is a kind of sum of a
product of between-class and within-class scatter,
where the sum is over all classes)
F1 16.2.06
INF 5300
33
Limitations of LDA
•
•
LDA produces at most C-1 feature projections
LDA is parametric, since it assumes unimodal gaussian likelihoods
•
LDA will fail when the discriminatory information is not in the mean but in the
variance of the data
F1 16.2.06
INF 5300
34
PC vs. Fisher’s linear
discriminant transform
• The principal component transform has no information about
the classes in the data.
• The PC-projection might not be helpful to improve class
separability.
• From an input vector x with dimension m, PC-transform gives us
a projection y with dimensions 1,...,m (depending on how many
eigenvalues we include).
• A projection with Fishers linear discriminant gives us y with
dimensions 1,...,K-1, where K is the number of classes.
• Fishers linear discriminant find the projection that maximizes
the ratio of between-class to within-class scatter.
F1 16.2.06
INF 5300
35
Matlab implementations
• Feature selection
– PRTools
• Evaluate features: feateval
• Selection strategies: featselm
• (Distance measures: distmaha, distm, proxm)
• Linear projections
– PRTools
• Principal component analysis: pca, klm
• LDA: fisherm
F1 16.2.06
INF 5300
36
PCA - Principal components analysis
F1 16.2.06
INF 5300
37
PCA - Principal components analysis
•
Approximation error is then
•
To measure the representation error we use the mean squared error
F1 16.2.06
INF 5300
38
PCA - Principal components analysis
• Optimal values of bi can be found by taking the
partial derivatives of the approximation error:
• We replace the discarded dimensions by their
expected value. (This also feels intuitively correct)
F1 16.2.06
INF 5300
39
PCA - Principal components analysis
The MSE can now be written as
Σx is the covariance matrix of x
F1 16.2.06
INF 5300
40
PCA - Principal components analysis
•
The orthonormality constraint on φ can be incorporated in our optimization using
Lagrange multipliers λ
•
Thus, we can find the optimal φi by partial derivation
•
Optimal φi and λi are eigenvalues and eigenvectors of the covariance matrix Σx
F1 16.2.06
INF 5300
41
PCA - Principal components analysis
• Note that this also implies
• Thus in order to minimize MSE(M), λi will have to
correspond to the smallest eigenvalues!
F1 16.2.06
INF 5300
42
PCA - Principal components analysis
The optimal approximation of a
random vector x 2 R by a linear
combination of m<n independent
vectors is obtained by projecting the
random vector x onto the
eigenvectors {φi} corresponding to
the largest eigenvalues {λi} of the
covariance matrix Σx
F1 16.2.06
INF 5300
43
PCA - Principal components analysis
•
PCA uses the eigenvectors of the covariance matrix Σx, and thus is able
to find the independent axes of the data under the unimodal Gaussian
assumption
– For non-Gaussian or multi-modal Gaussian data, PCA simply de-correlate
the axes
•
Main limitation of PCA is that it is unsupervised - thus it does not
consider class separability
– It is simply a coordinate rotation aligning the transformed axes with the
directions of maximum variance
– No guarantee that these directions are good features for
discrimination
F1 16.2.06
INF 5300
44
Literature on pattern recognition
• Updated review and statistical pattern recognition:
–
A. Jain, R. Duin and J. Mao: Statistical pattern recognition: a review, IEEE Trans.
Pattern analysis and Machine Intelligence, vol. 22, no. 1, January 2001, pp. 4--
• Classical PR-books
–
–
–
R. Duda, P. Hart and D. Stork, Pattern Classification, 2. ed. Wiley, 2001
B. Ripley, Pattern Recognition and Neural Networks, Cambridge Press, 1996.
S. Theodoridis and K. Koutroumbas, Pattern Recognition, Academic Press, 1999.
F1 16.2.06
INF 5300
45
Download