computer_HW_2

advertisement
Again please follow the format indicated in the syllabus
Abstract
Introduction
Theory
Results
Conclusion
22/25
Reducing Dimensionality with Fisher’s Linear Discriminant Function
Bruce M. Sabol
ECE 8443
20 March 2006
Problem Statement
This exercise is intended to demonstrate reduction of dimensionality in such as way as to
optimize class discrimination for the 2-class case. Two classes of 3-dimensional (3feature) data are transformed into a single dimension using Fisher’s Linear Discriminant
(FLD) transformation. Conceptually, FLD accomplishes this by picking a line through 3space onto which all data are projected. The criteria for picking this line is such that
separability between the two classes is maximized. The primary equation by which FLD
seeks to find the line vector (W) maximizes the function J(W) below:
J (W ) 
W t S BW
W t SW W
where:
W
Sb
Sw
= optimum line vector through n-space
= scatter matrix between classes
= scatter matrix within classes
The scatter matrix within classes (Sw) is the sum of the scatter matrix for each class:
Si   xD ( x  mi )( x  mi )t
i
where:
Si = scatter matrix for class i
mi = d-dimensional sample mean for class i
x = d-dimensional sample points in training set
Note that Si is simply the covariance matrix for class i times the sample size minus one.
In practice the first equation reduces to a much simpler form to solve for W.
W  Si1 (m1  m2 )
Each point in the original 3-space is mapped onto the line defined by the W vector. From
here it is treated as 1-dimensional data for the 2-class case. A discrimination rule is
computed, based on an assumed Gaussian distribution of each class. This line is then
divided into segments associated with each class. Classification error is computed
empirically from the training set and theoretically based on the estimated Gaussian
distribution of parameters of each of the 2 transformed classes.
Data
Data used in this exercise are shown below (Table 1; classes 2 and 3 from p. 155, Duda et
al. 2001). The assumption is made that both classes are equally likely (i.e. P(w2) = P(w3)
=0.5).
Table 1. Data used in exercise (Duda et al., p. 155)
Sample
1
2
3
4
5
6
7
8
9
10
X1
-0.40
-0.31
0.38
-0.15
-0.35
0.17
-0.011
-0.27
-0.065
-0.12
Class2
X2
0.58
0.27
0. 055
0.53
0.47
0.69
0.55
0.61
0.49
0,054
X3
0.089
-0.04
-0.035
0.011
0.034
0.10
-0.18
0.12
0.0012
-0.063
X1
0.83
1.10
-0.44
0.047
0.28
-0.39
0.34
-0.30
1.10
0.18
Class 3
X2
1.60
1.60
-0.41
-0.45
0.35
-0.48
-0.079
-0.22
1.20
-0.11
X3
-0.014
0.48
0.32
1.40
3.10
0.11
0.14
2.20
-0.46
-0.49
Procedures
A single program (see Appendix) was developed in MatLab interpreter language to
compute the FLD W vector and to map 3-dimensional points onto the line defined by that
vector. Statistical summaries of the transformed Y points were then computed for the
optimum W vector and for an arbitrary non-optimal W vector.
Manual calculation of the decision boundary was performed based measured Gaussian
parameters of the transformed class 1-dimensional vectors. This was performed by
setting the conditional probabilities (P(x|wi)) for each class equal to each other and
solving for the boundary point(s) (m) using the Gaussian equation:
1
p ( m) 
 2
e
 1  m   2 
 
 
 2    
where:
i = sample standard deviation of class i
i = sample mean of class i
m = boundary point(s) between classes 1 and 2
Measurement of empirical classification error for the training set was performed.
Additionally, theoretical error was estimated based on measured Gaussian parameters of
the transformed classes.
Finally, the original 3-dimensional data were retransformed into 1 dimension based on an
arbitrary non-optimal W vector. Classification errors were computed, as described above,
for the transformed 1-dimensional data resulting from the use of this vector.
Results
Solving for vector W yielded [-0.3832, 0.2137, -0.0767]. Mapping points from the two
classes onto this line yielded class means and variances ([,2]) of [0.1348, 0.0132] and
[-0.0932, 0.0121] for classes 2 and 3, respectively. Variances for the two classes are
practically identical, therefore, with equal prior probabilities, the dividing point for class
discrimination would be roughly halfway between the class means (0.0208). Using this
rule for class discrimination, 10% of class 2 was incorrectly classified, and 30% of class
3 was incorrectly classified. The probability of misclassification error for the
overlapping Gaussian distributions was computed by calculating the equal Z statistic
between classes (z=1.013 = [class mean - class boundary]/standard deviation) which
corresponded to an error of approximately 16% for each class.
Note: Need to provide a plot showing how the projection to a line is done and the
position of the projected points
(-3)
Using the arbitrary and non-optimum W vector [1.0, 2.0, -1.5], the resulting distributions
of the transformed classes were quite different. Class means and variances ([,2]) for
the two classes were [0.7416, 0.1778] and [-0.1430, 10.100] for classes 2 and 3,
respectively. Solving for the class boundary points using the Gaussian equation above
yielded the following class discrimination rule:
if -0.105<y<1.62 then pick class 2
else pick class 3
Based upon this rule, there were no errors in classifying training set points in class 2, and
20% error in classifying points in class 3. For this discrimination rule the computed
probability of error based on Gaussian distributions is 4.2% for class 2 and 20.8% for
class 3.
Observations
1) The optimum transformation resulted in nice compact class distributions with nearly
equal variances and only modest levels of distribution overlap. Predicted error was 15.6%
for each class, with observed error rates of 10 and 30% for classes 2 and 3, respectively.
2) The non-optimum transform resulted in rather ugly distributions for the two classes.
They overlapped near their respective means and variances were far from equal. In spite
of this, both predicted and measured error rates were on the order of the order of, or
better than, those for the optimum transformation.
3) These results are rather curious. I don’t know how the “non-optimum” transform
vector was selected – it may not have been random, being very specifically picked to
show this curious outcome. Things I’d like to do if I had more time: systematically
select true random vectors to examine the class separability and error rates for these
random transforms – just how good is our “optimum” transform?
Appendix.
% computer
% sanitize
clear all;
close all;
clc;
HW #2 - problem 9, p.158
% supply source data
% class w1
X1w1 = [ 0.42 -0.20
X2w1 = [-0.087 -3.30
X3w1 = [ 0.58 -3.40
% class w2
X1w2 = [-0.40 -0.31
X2w2 = [ 0.58
0.27
X3w2 = [ 0.089 -0.04
% class w3
X1w3 = [ 0.83
1.10
X2w3 = [ 1.60
1.60
X3w3 = [-0.014 0.48
1.30
-0.32
1.70
0.39 -1.60 -0.029 -0.23
0.71 -5.30 0.89
1.90
0.23 -0.15 -4.70
2.20
0.27 -1.90
-0.30 0.76
-0.87 -2.10
0.87]';
-1.00]';
-2.60]';
0.38 -0.15 -0.35 0.17
0.055 0.53 0.47 0.69
-0.035 0.011 0.034 0.10
-0.011 -0.27 -0.065 -0.12]';
0.55
0.61 0.49
0.054]';
-0.18
0.12 0.0012 -0.063]';
-0.44
-0.41
0.32
0.34 -0.30 1.10
-0.079 -0.22 1.20
0.14
2.20 -0.46
0.047 0.28 -0.39
-0.45 0.35 -0.48
1.40 3.10 0.11
0.18]';
-0.11]';
-0.49]';
% compute class 1 stats
X_training1 = [X1w1 X2w1 X3w1]; %<===========PICK FEATURES
% set dimension d = # features used
[samples,d] = size(X_training1);
MU1
= mean(X_training1);
% mean vector
Covar1
= cov(X_training1);
% covariance matrix
%Corr1
= corrcoef(X_training1)% examine correlation matrix
S1
= (samples-1)*Covar1; % scatter matrix for class 1
% compute class 2 stats
X_training2 = [X1w2 X2w2 X3w2]; %<===========PICK FEATURES
MU2
= mean(X_training2); % mean vector
Covar2
= cov(X_training2);
% covariance matrix
S2
= (samples-1)*Covar2; % scatter matrix for class 2
% compute class 3 stats
X_training3 = [X1w3 X2w3 X3w3]; % <===========PICK FEATURES
MU3
= mean(X_training3); % mean vector
Covar3
= cov(X_training3);
% covariance matrix
S3
= (samples-1)*Covar3; % scatter matrix for class 3
%pick 2 classes and compute associated statistics - PICK 2 AND 3
% generate Scatter maxtix "within"
Sw
= S2+S3; % <========================PICK 2 CLASSES
First = X_training2;% <===================PICK FIRST CLASS
Second = X_training3;% <===================PICK SECOND CLASS
W
= inv(Sw)*(MU2 - MU3)'
vector
% compute Fisher Linear Discriminant transform
% perform Fisher Linear Discriminant Transform
for i=1:2 % 2-class case only
for j=1:samples
% transform 3-D X into 1-D Y
if (i==1) Y_first(j) = W'*(First(j,:))';
else
Y_second(j) = W'*(Second(j,:))';
end % if
end % j-loop
end
%i-loop
% compute stats on new 1-D Y vectors
mu_Y1
var_Y1
mu_Y2
var_Y2
=
=
=
=
mean(Y_first)
var(Y_first)
mean(Y_second)
var(Y_second)
% arbitrary non-optimal transform
W_arb = [1.0 2.0 -1.5]'
% perform Fisher Linear Discriminant Transform
for i=1:2 % 2-class case only
for j=1:samples
% transform 3-D X into 1-D Y
if (i==1) Y1_arb(j) = W_arb'*(First(j,:))';
else
Y2_arb(j) = W_arb'*(Second(j,:))';
end % if
end % j-loop
end
%i-loop
% compute stats on arbitrary transformed vector
mu_Y1 = mean(Y1_arb)
var_Y1 = var(Y1_arb)
mu_Y2 = mean(Y2_arb)
var_Y2 = var(Y2_arb)
Download