Investigations of the Fisher Linear Discriminant

advertisement
25/25
Investigations of the Fisher Linear Discriminant
J. R. Fairley
Mississippi State University Electrical Engineering Department
Mississippi State, MS 39762
Abstract
The discriminant functions presented are for classifying either one of the two categories
in the training data set given samples of known features for the two classes. The use of
the Fisher Linear Discriminant is to demonstrate the use of this technique to attain good
separation in the training data which leads to empirical errors on classifying a category
correctly lower than that of a general linear discriminant. Each category contains three
features to describe the category, and the use of the Fisher Linear Discriminant to
classify a category is exercised for the two categories.
1. Introduction
The general form of the linear discriminant is used to classify the data as either category
2 or category 3 of the given training data set that is of a normal distribution N(μ, Σ). The
samples of the training data set, the mean (µ) vectors, and covariance matrices (Σ) are
passed into the two linear discriminant functions to classify the categories as of region 2
or region 3
The general linear discriminant function in some cases is not adequate for classification.
In this case, if the results from this technique show minimal-to-no separation between the
categories based on the features given, then new techniques will have to be applied.
These techniques would use linear transformation techniques to transform the original
training data into a new subspace that would provide the separation between the two
categories so that the likelihood of a correct classification can be increased. This
technique is commonly referred to as the Fisher Linear Discriminant (FLD). This
technique seeks to find a weighting vector, w, that yields the maximum ratio of betweenclass scatter to within-class scatter. Once a weighting vector, w, is computed then it is
used to transform and project the original training data set into a new subspace were good
separation can be seen between the two categories. This separation then allows us to
choose a threshold, µ that provides lower error results for classifying a category over the
general linear discriminant function.
Once a weighting vector, w, is computed that maximizes J(·), the training data is fitted to
a univariate Gaussian distribution where a new decision boundary threshold, µ is
determined. An empirical relationship is then formed for the classification of the
category produced by the FLD to show the error associated with the classification of a
category for a univariate normal density.
2. Brief Discussion of Theory
In general, the Fisher Linear Discriminant is computed first by finding the d-dimensional
sample mean, mi, which is given by
mi 
1
ni
x ,
xD i
where d is the number of features that describe a category. One then defines the scatter
matrices Si and Sw where Si has the form
Si =
 ( x - m )( x  m ) ,
t
x D i
i
i
and
Sw = S1 + S2 ,
where Sw is known as the within-class scatter. Next, the between-class scatter matrix SB
is computed which is of the form
SB = (m1 – m2)( m1 – m2)t .
A weighting vector, w, is then computed that optimizes J(·) and is of the form
w = S w1 (m 1  m 2 ) ,
where J(·) is well known in mathematical physics as the generalized Rayleigh quotients
is of the form
J(w) =
w tS B w
w tS w w
.
The weighting vector, w, has now been obtained for the FLD yielding the maximum ratio
of between-class scatter to within-class scatter. The weighting vector, w, is often referred
to as the canonical variate. At this point the only thing left is to compute the optimal
threshold, which is the point along the one-dimensional subspace that separates the
projected points. The general form of the equation for projected points in the new
subspace is
y = wtx .
At this point, one can take special notice to the fact that the original two category
problem has been reduced from a d-dimensional problem into a much more manageable
one-dimensional problem with the overall computational complexity of finding the
optimal w for the FLD being bounded by the calculation of the within-class scatter and its
inverse, which is an O(d2n) calculation.
The new projected one-dimensional data is fitted to a univariate Gaussian distribution
with density p(x|ωi) that is given by the form
1 x   

 
 
1
p(x|ωi) =
e 2
 2
2
.
With a new decision boundary formed from this new projected data set, a classification is
made with the result being one of the two categories which now has an error rate,
P(error), based on the possibility of misclassification of a given set of x data. That is the
discriminant function has determined the training data to be of one category when in
actuality it is of the other category. The Bayes decision rule guarantees the lowest
average error rate; however, these results do not tell us what the probability of error
actually. The calculation of this error for a Gaussian case can be quite difficult do to the
discontinuous nature of the decision regions in the integral
P(error) =
 px |  P dx   px |  P dx
1
R2
1
R1
2
and
c
P(correct) =

i 1
Ri
p( x |  i ) P( i ) dx
where
P(error) = 1 – P(correct).
2
3. Analysis of Results
The data for the two categories that was used for investigations of the FLD are given
below in table 1.
Samples
1
2
3
4
5
6
7
8
9
10
ω2
ω3
Features of ω2
Features of ω3
x1
x2
x3
-0.4
-0.31
0.38
-0.15
-0.35
0.17
-0.011
-0.27
-0.065
-0.12
0.58
0.27
0.055
0.53
0.47
0.69
0.55
0.61
0.49
0.054
0.089
-0.04
-0.035
0.011
0.034
0.1
-0.18
0.12
0.0012
-0.063
x1
0.83
1.1
-0.44
0.047
0.28
-0.39
0.34
-0.3
1.1
0.18
x2
1.6
1.6
-0.41
-0.45
0.35
-0.48
-0.079
-0.22
1.2
-0.11
Table 1. Data used in Evaluating the FLD
The optimal w found for the given training data is shown below in table 2.
-0.3832
0.2137
-0.0767
w=
Table 2. Optimal Weighting Vector
Figure 1 below shows a plot of the projected points in the direction of w.
x3
-0.014
0.48
0.32
1.4
3.1
0.11
0.14
2.2
-0.46
-0.49
Figure 1. Projected Points in the Direction of w
The results from the decisions made on a decision boundary of x* = 0.0149 are given
below in table 3. This is the point at which the two PDFs overlap. The decision rules for
this new threshold are as follows:
if x > x*
choose ω2,
otherwise
choose ω3.
Feature y1 of ω2 in Optimal Subspace
-0.1312
0.0624
0.0747
0.1296
0.1356
0.1699
0.1796
0.2247
0.2320
0.2704
Feature y2 of ω3 in Optimal Subspace
-0.2704
-0.2216
-0.1579
-0.1298
-0.1164
-0.1009
-0.0549
0.0250
0.0384
0.0564
Decision
ω3
ω2
ω2
ω2
ω2
ω2
ω2
ω2
ω2
ω2
Decision
ω3
ω3
ω3
ω3
ω3
ω3
ω3
ω2
ω2
ω2
Table 3. Decisions made by New Decision Boundary formed by the Data in the Optimal Subspace
The error, P(error), computed empirically for the optimal weighting vector w is given
below in table 4.
Optimal Subspace
Empirical
Perror (%)
20
Table 4. Error computed Empirically for the Optimal Subspace
Using the nonoptimal direction w = (1.0, 2.0, -1.5)t, the new distributions for the
projected data were p(x|ω2) ~ N(0.74162, 0.1778) and p(x| ω3) ~ N(-0.143, 10.1)
respectively.
Solving for p(x|ω2) = p(x| ω3) given the form of the univariate Gaussian distribution
stated previously in this paper, the results from the decisions made on a decision
boundary are given below in table 5 for a nonoptimal direction w = (1.0, 2.0, -1.5)t. The
decision rules for this new threshold are as follows:
if  0.105  x  1.62 choose ω2,
otherwise
Feature y1 of ω2 in Nonoptimal Subspace
0.0825
0.2900
0.5390
0.5425
0.6265
0.7700
0.8935
0.9132
1.3590
1.4000
Feature y2 of ω3 in Nonoptimal Subspace
-4.0400
-3.6700
-2.9530
-1.7400
-1.5150
-0.0280
0.6950
3.5800
4.0510
4.1900
choose ω3.
Decision
ω2
ω2
ω2
ω2
ω2
ω2
ω2
ω2
ω2
ω2
Decision
ω3
ω3
ω3
ω3
ω3
ω2
ω2
ω3
ω3
ω3
Table 5. Decisions made by New Decision Boundary formed by the Data in the Nonoptimal Subspace
The error, P(error), computed empirically or the nonoptimal weighting vector w is given
below in table 6.
Nonoptimal Subspace
Empirical
Perror (%)
10
Table 6. Error computed Empirically for the Nonptimal Subspace
4. Conclusion
In conclusion, the error, P(error), decreases from 20% for an optimal w to 10% for a
nonoptimal w. These results seem to be of a precarious nature since a w was found to
maximize the ratio J(·) for the first case, and the second case used a nonoptimal w that
did not maximize this ratio. Nevertheless, there was a slight improvement using the
nonoptimal weighting vector versus the optimal weighting vector for classifying the
category correctly for the given training data set.
The use of the FLD to find an optimum threshold for the new projected data set did prove
to be a very beneficial technique in getting the separation between the two categories.
These two categories had minimal separation to start with making the use of the general
linear discriminant function not a good choice for classifying a category correctly due to
the overlapping nature of the PDFs for the two categories. The transformation of the
original training data produced by the FLD did clearly separate the two categories
making for a more robust decision boundary to classify a category accurately.
5. MatLab Computer Code for the FLD
MatLab code for the development of the dichotomizer used in this investigation of linear
discriminant functions is given below.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author: Josh Fairley
% Course: ECE 8443 - Pattern Recognition
% Assignment: Computer Exercise # 2 - Fisher Linear Discriminant
% Date:
March 21, 2006
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
clear all;
close all;
clc;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Assigning Values to the variables for 1 - 3 dimensions
d1 = 1; d2 = 2; d3 = 3;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% X1w2 - X1 feature of W2 X1w3 - X1 feature of W3
% X2w2 - X2 feature of W2 X2w3 - X2 feature of W3
% X3w2 - X3 feature of W2 X3w3 - X3 feature of W3
%
% m_X1w2 - mean of X1 feature of W2
m_X1w3 - mean of X1 feature of W3
% m_X2w2 - mean of X2 feature of W2
m_X2w3 - mean of X2 feature of W3
% m_X3w2 - mean of X3 feature of W2
m_X3w3 - mean of X3 feature of W3
%
% X_W2 - Matrix Construction for three-dimensional case of W2 (Is the X
% vector for the discriminant functions)
%
% X_W3 - Matrix Construction for three-dimensional case of W3 (Is the X
% vector for the discriminant functions)
X1w2 = [-0.4 -0.31 0.38 -0.15 -0.35 0.17 -0.011 -0.27 -0.065 -0.12];
X2w2 = [0.58 0.27 0.055 0.53 0.47 0.69 0.55 0.61 0.49 0.054];
X3w2 = [0.089 -0.04 -0.035 0.011 0.034 0.1 -0.18 0.12 0.0012 -0.063];
X1w3 = [0.83 1.1 -0.44 0.047 0.28 -0.39 0.34 -0.3 1.1 0.18];
X2w3 = [1.6 1.6 -0.41 -0.45 0.35 -0.48 -0.079 -0.22 1.2 -0.11];
X3w3 = [-0.014 0.48 0.32 1.4 3.1 0.11 0.14 2.2 -0.46 -0.49];
m_X1w2 = mean(X1w2);
m_X2w2 = mean(X2w2);
m_X3w2 = mean(X3w2);
m_X1w3 = mean(X1w3);
m_X2w3 = mean(X2w3);
m_X3w3 = mean(X3w3);
X_W2 = [X1w2; X2w2; X3w2];
X_W3 = [X1w3; X2w3; X3w3];
m_W2 = [m_X1w2 m_X2w2 m_X3w2];
m_W3 = [m_X1w3 m_X2w3 m_X3w3];
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Computation of Scatter Matrices Si and Sw
% S1 - Scatter Matrix for W2
% S2 - Scatter Matrix for W3
% Sw - Within Class Scatter Matix
for j = 1:3
for i = 1:length(X1w2)
if j == 1
a(j,i) = X1w2(i) - m_W2(j);
elseif j == 2
a(j,i) = X2w2(i) - m_W2(j);
elseif j == 3
a(j,i) = X3w2(i) - m_W2(j);
end
end
end
S1 = a*a';
for j = 1:3
for i = 1:length(X1w3)
if j == 1
b(j,i) = X1w3(i) - m_W3(j);
elseif j == 2
b(j,i) = X2w3(i) - m_W3(j);
elseif j == 3
b(j,i) = X3w3(i) - m_W3(j);
end
end
end
S2 = b*b';
Sw = S1 + S2;
% End of computation for within class scatter matrix Sw
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Computation of Between Class Matrix Sb
m_W2 = m_W2';
m_W3 = m_W3';
Sb = (m_W2 - m_W3)*(m_W2 - m_W3)';
% End of computation for between class matrix Sb
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Computation of Weighting Vector w
% w = inv(Sw)*(m_W2 - m_W3)
w = [1.0 2.0 -1.5]'
% End of computation for weighting vector w
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Fitting Data to a univariate Gaussian y = w'x and computing the mean and
% Fisher Discriminant for the new projected one-dimensional distribution.
X_W2 = X_W2';
X_W3 = X_W3';
for i = 1:2
for j = 1:length(X1w2)
if i == 1
y1(j) = w' * X_W2(j,:)';
elseif i == 2
y2(j) = w' * X_W3(j,:)';
end
end
end
y1;
y2;
m1_hat = w'*m_W2;
m2_hat = w'*m_W3;
s1_hat = w'*S1*w;
s2_hat = w'*S2*w;
Jw = (w'*Sb*w)/(w'*Sw*w);
% End of fitting Data to a univariate Gaussian y = w'x and computing the mean and
% Fisher Discriminant for the new projected one-dimensional distribution.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Computing the projection of the data for each category onto w and plotting
% the direction of w and the positions of the projected points.
for i = 1:2
for j = 1:length(X1w2)
if i == 1
Projection_W2_onto_w(j,:) = ((X_W2(j,:).*w')/(w'.*w'))*w';
elseif i == 2
Projection_W3_onto_w(j,:) = ((X_W3(j,:).*w')/(w'.*w'))*w';
end
end
end
Direction_of_w = cat(1,Projection_W2_onto_w,Projection_W3_onto_w);
X_Direction_of_w = Direction_of_w(:,1);
Y_Direction_of_w = Direction_of_w(:,2);
Z_Direction_of_w = Direction_of_w(:,3);
n = length(X_Direction_of_w);
figure;
plot3(X_Direction_of_w, Y_Direction_of_w, Z_Direction_of_w, 'Color', 'm');
hold on;
grid on;
xlabel('X1');
ylabel('X2');
zlabel('X3');
title('Projected Points onto w showing the direction of distance d(Wi,w)');
for i = 1:2
for j = 1:length(X_W2)
if i == 1
stem3(Projection_W2_onto_w(j,1), Projection_W2_onto_w(j,2), (X_W2(j,3) - Projection_W2_onto_w(j,3)), '--', 'fill');
hold on;
elseif i == 2
stem3(Projection_W3_onto_w(j,1), Projection_W3_onto_w(j,2), (X_W3(j,3) - Projection_W3_onto_w(j,3)), 'r--', 'fill');
hold on;
end
end
end
legend('w', 'Distance of Projection for W2', 'Distance of Projection for W3');
figure;
plot3(X_Direction_of_w, Y_Direction_of_w, Z_Direction_of_w, 'Color', 'c');
hold on;
scatter3(Projection_W2_onto_w(:,1), Projection_W2_onto_w(:,2), Projection_W2_onto_w(:,3), 'bo', 'fill');
hold on;
scatter3(Projection_W3_onto_w(:,1), Projection_W3_onto_w(:,2), Projection_W3_onto_w(:,3), 'ro', 'fill');
grid on;
xlabel('X1');
ylabel('X2');
zlabel('X3');
title('Projected Points onto w');
legend('w', 'Projected points for W2', 'Projected points for W3');
% End of computing the projection of the data for each category onto w and
% plotting the direction of w and the positions of the projected points.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Plot of univariate Gaussian data
figure;
stem(y1, 'bo', 'fill');
hold on;
stem(y2, 'ro', 'fill');
hold on;
line([1; 10], [0.057; 0.057], 'Color', 'c');
hold on;
line([1; 10], [0.0149; 0.0149], 'Color', 'm');
xlabel('Sample Number');
ylabel('One-Dimensional Univariate Gaussian Data');
grid on;
legend('Y1', 'Y2', 'Optimized X* = 0.057', 'X* = 0.0149');
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Computes PDF for 1-Dimensional Case
y1 = sort(y1);
y2 = sort(y2);
PDF1_W2 = 0.5*normpdf(y1,m1_hat,std(y1));
PDF1_W3 = 0.5*normpdf(y2,m2_hat,std(y2));
figure;
plot(y1, PDF1_W2, 'b');
hold on;
plot(y2, PDF1_W3, 'r');
hold on;
line([0.057; 0.057], [0; 2], 'Color', 'c');
hold on;
line([0.0149; 0.0149], [0; 2], 'Color', 'm');
xlabel('X');
ylabel('PDF')
title('PDF for Regions W2 and W3 (One-Dimensional Case)');
legend('W2', 'W3', 'Optimized X* = 0.057', 'X* = 0.0149');
grid on;
% End of computation for PDF
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Decision Results based on Boundary x*
x_star = 0.0149;
optimized_x_star = 0.057;
fprintf('\n');
disp('Decision based on X* for Univariate Gaussian Data of W2');
fprintf('\n');
% for i = 1:length(y1)
% if y1(i) > x_star
%
disp('Decide W2')
% else
%
disp('Decide W3')
% end
% end
for i = 1:length(y1)
if (y1(i) >= -0.105) & (y1(i) <= 1.62)
disp('Decide W2')
else
disp('Decide W3')
end
end
fprintf('\n');
disp('Decision based on X* for Univariate Gaussian Data of W3');
fprintf('\n');
% for i = 1:length(y2)
% if y2(i) < x_star
%
disp('Decide W3')
% else
%
disp('Decide W2')
% end
% end
for i = 1:length(y2)
if (y2(i) >= -0.105) & (y2(i) <= 1.62)
disp('Decide W2')
else
disp('Decide W3')
end
end
% fprintf('\n');
% disp('Decision based on Optimized X* for Univariate Gaussian Data of W2');
% fprintf('\n');
%
% for i = 1:length(y1)
% if y1(i) > optimized_x_star
%
disp('Decide W2')
% else
%
disp('Decide W3')
% end
% end
%
% fprintf('\n');
% disp('Decision based on Optimized X* for Univariate Gaussian Data of W3');
% fprintf('\n');
%
% for i = 1:length(y2)
% if y2(i) < optimized_x_star
%
disp('Decide W3')
% else
%
disp('Decide W2')
% end
% end
Download