Again please follow the format indicated in the syllabus Abstract Introduction Theory Results Conclusion 22/25 Reducing Dimensionality with Fisher’s Linear Discriminant Function Bruce M. Sabol ECE 8443 20 March 2006 Problem Statement This exercise is intended to demonstrate reduction of dimensionality in such as way as to optimize class discrimination for the 2-class case. Two classes of 3-dimensional (3feature) data are transformed into a single dimension using Fisher’s Linear Discriminant (FLD) transformation. Conceptually, FLD accomplishes this by picking a line through 3space onto which all data are projected. The criteria for picking this line is such that separability between the two classes is maximized. The primary equation by which FLD seeks to find the line vector (W) maximizes the function J(W) below: J (W ) W t S BW W t SW W where: W Sb Sw = optimum line vector through n-space = scatter matrix between classes = scatter matrix within classes The scatter matrix within classes (Sw) is the sum of the scatter matrix for each class: Si xD ( x mi )( x mi )t i where: Si = scatter matrix for class i mi = d-dimensional sample mean for class i x = d-dimensional sample points in training set Note that Si is simply the covariance matrix for class i times the sample size minus one. In practice the first equation reduces to a much simpler form to solve for W. W Si1 (m1 m2 ) Each point in the original 3-space is mapped onto the line defined by the W vector. From here it is treated as 1-dimensional data for the 2-class case. A discrimination rule is computed, based on an assumed Gaussian distribution of each class. This line is then divided into segments associated with each class. Classification error is computed empirically from the training set and theoretically based on the estimated Gaussian distribution of parameters of each of the 2 transformed classes. Data Data used in this exercise are shown below (Table 1; classes 2 and 3 from p. 155, Duda et al. 2001). The assumption is made that both classes are equally likely (i.e. P(w2) = P(w3) =0.5). Table 1. Data used in exercise (Duda et al., p. 155) Sample 1 2 3 4 5 6 7 8 9 10 X1 -0.40 -0.31 0.38 -0.15 -0.35 0.17 -0.011 -0.27 -0.065 -0.12 Class2 X2 0.58 0.27 0. 055 0.53 0.47 0.69 0.55 0.61 0.49 0,054 X3 0.089 -0.04 -0.035 0.011 0.034 0.10 -0.18 0.12 0.0012 -0.063 X1 0.83 1.10 -0.44 0.047 0.28 -0.39 0.34 -0.30 1.10 0.18 Class 3 X2 1.60 1.60 -0.41 -0.45 0.35 -0.48 -0.079 -0.22 1.20 -0.11 X3 -0.014 0.48 0.32 1.40 3.10 0.11 0.14 2.20 -0.46 -0.49 Procedures A single program (see Appendix) was developed in MatLab interpreter language to compute the FLD W vector and to map 3-dimensional points onto the line defined by that vector. Statistical summaries of the transformed Y points were then computed for the optimum W vector and for an arbitrary non-optimal W vector. Manual calculation of the decision boundary was performed based measured Gaussian parameters of the transformed class 1-dimensional vectors. This was performed by setting the conditional probabilities (P(x|wi)) for each class equal to each other and solving for the boundary point(s) (m) using the Gaussian equation: 1 p ( m) 2 e 1 m 2 2 where: i = sample standard deviation of class i i = sample mean of class i m = boundary point(s) between classes 1 and 2 Measurement of empirical classification error for the training set was performed. Additionally, theoretical error was estimated based on measured Gaussian parameters of the transformed classes. Finally, the original 3-dimensional data were retransformed into 1 dimension based on an arbitrary non-optimal W vector. Classification errors were computed, as described above, for the transformed 1-dimensional data resulting from the use of this vector. Results Solving for vector W yielded [-0.3832, 0.2137, -0.0767]. Mapping points from the two classes onto this line yielded class means and variances ([,2]) of [0.1348, 0.0132] and [-0.0932, 0.0121] for classes 2 and 3, respectively. Variances for the two classes are practically identical, therefore, with equal prior probabilities, the dividing point for class discrimination would be roughly halfway between the class means (0.0208). Using this rule for class discrimination, 10% of class 2 was incorrectly classified, and 30% of class 3 was incorrectly classified. The probability of misclassification error for the overlapping Gaussian distributions was computed by calculating the equal Z statistic between classes (z=1.013 = [class mean - class boundary]/standard deviation) which corresponded to an error of approximately 16% for each class. Note: Need to provide a plot showing how the projection to a line is done and the position of the projected points (-3) Using the arbitrary and non-optimum W vector [1.0, 2.0, -1.5], the resulting distributions of the transformed classes were quite different. Class means and variances ([,2]) for the two classes were [0.7416, 0.1778] and [-0.1430, 10.100] for classes 2 and 3, respectively. Solving for the class boundary points using the Gaussian equation above yielded the following class discrimination rule: if -0.105<y<1.62 then pick class 2 else pick class 3 Based upon this rule, there were no errors in classifying training set points in class 2, and 20% error in classifying points in class 3. For this discrimination rule the computed probability of error based on Gaussian distributions is 4.2% for class 2 and 20.8% for class 3. Observations 1) The optimum transformation resulted in nice compact class distributions with nearly equal variances and only modest levels of distribution overlap. Predicted error was 15.6% for each class, with observed error rates of 10 and 30% for classes 2 and 3, respectively. 2) The non-optimum transform resulted in rather ugly distributions for the two classes. They overlapped near their respective means and variances were far from equal. In spite of this, both predicted and measured error rates were on the order of the order of, or better than, those for the optimum transformation. 3) These results are rather curious. I don’t know how the “non-optimum” transform vector was selected – it may not have been random, being very specifically picked to show this curious outcome. Things I’d like to do if I had more time: systematically select true random vectors to examine the class separability and error rates for these random transforms – just how good is our “optimum” transform? Appendix. % computer % sanitize clear all; close all; clc; HW #2 - problem 9, p.158 % supply source data % class w1 X1w1 = [ 0.42 -0.20 X2w1 = [-0.087 -3.30 X3w1 = [ 0.58 -3.40 % class w2 X1w2 = [-0.40 -0.31 X2w2 = [ 0.58 0.27 X3w2 = [ 0.089 -0.04 % class w3 X1w3 = [ 0.83 1.10 X2w3 = [ 1.60 1.60 X3w3 = [-0.014 0.48 1.30 -0.32 1.70 0.39 -1.60 -0.029 -0.23 0.71 -5.30 0.89 1.90 0.23 -0.15 -4.70 2.20 0.27 -1.90 -0.30 0.76 -0.87 -2.10 0.87]'; -1.00]'; -2.60]'; 0.38 -0.15 -0.35 0.17 0.055 0.53 0.47 0.69 -0.035 0.011 0.034 0.10 -0.011 -0.27 -0.065 -0.12]'; 0.55 0.61 0.49 0.054]'; -0.18 0.12 0.0012 -0.063]'; -0.44 -0.41 0.32 0.34 -0.30 1.10 -0.079 -0.22 1.20 0.14 2.20 -0.46 0.047 0.28 -0.39 -0.45 0.35 -0.48 1.40 3.10 0.11 0.18]'; -0.11]'; -0.49]'; % compute class 1 stats X_training1 = [X1w1 X2w1 X3w1]; %<===========PICK FEATURES % set dimension d = # features used [samples,d] = size(X_training1); MU1 = mean(X_training1); % mean vector Covar1 = cov(X_training1); % covariance matrix %Corr1 = corrcoef(X_training1)% examine correlation matrix S1 = (samples-1)*Covar1; % scatter matrix for class 1 % compute class 2 stats X_training2 = [X1w2 X2w2 X3w2]; %<===========PICK FEATURES MU2 = mean(X_training2); % mean vector Covar2 = cov(X_training2); % covariance matrix S2 = (samples-1)*Covar2; % scatter matrix for class 2 % compute class 3 stats X_training3 = [X1w3 X2w3 X3w3]; % <===========PICK FEATURES MU3 = mean(X_training3); % mean vector Covar3 = cov(X_training3); % covariance matrix S3 = (samples-1)*Covar3; % scatter matrix for class 3 %pick 2 classes and compute associated statistics - PICK 2 AND 3 % generate Scatter maxtix "within" Sw = S2+S3; % <========================PICK 2 CLASSES First = X_training2;% <===================PICK FIRST CLASS Second = X_training3;% <===================PICK SECOND CLASS W = inv(Sw)*(MU2 - MU3)' vector % compute Fisher Linear Discriminant transform % perform Fisher Linear Discriminant Transform for i=1:2 % 2-class case only for j=1:samples % transform 3-D X into 1-D Y if (i==1) Y_first(j) = W'*(First(j,:))'; else Y_second(j) = W'*(Second(j,:))'; end % if end % j-loop end %i-loop % compute stats on new 1-D Y vectors mu_Y1 var_Y1 mu_Y2 var_Y2 = = = = mean(Y_first) var(Y_first) mean(Y_second) var(Y_second) % arbitrary non-optimal transform W_arb = [1.0 2.0 -1.5]' % perform Fisher Linear Discriminant Transform for i=1:2 % 2-class case only for j=1:samples % transform 3-D X into 1-D Y if (i==1) Y1_arb(j) = W_arb'*(First(j,:))'; else Y2_arb(j) = W_arb'*(Second(j,:))'; end % if end % j-loop end %i-loop % compute stats on arbitrary transformed vector mu_Y1 = mean(Y1_arb) var_Y1 = var(Y1_arb) mu_Y2 = mean(Y2_arb) var_Y2 = var(Y2_arb)