25/25 Investigations of the Fisher Linear Discriminant J. R. Fairley Mississippi State University Electrical Engineering Department Mississippi State, MS 39762 Abstract The discriminant functions presented are for classifying either one of the two categories in the training data set given samples of known features for the two classes. The use of the Fisher Linear Discriminant is to demonstrate the use of this technique to attain good separation in the training data which leads to empirical errors on classifying a category correctly lower than that of a general linear discriminant. Each category contains three features to describe the category, and the use of the Fisher Linear Discriminant to classify a category is exercised for the two categories. 1. Introduction The general form of the linear discriminant is used to classify the data as either category 2 or category 3 of the given training data set that is of a normal distribution N(μ, Σ). The samples of the training data set, the mean (µ) vectors, and covariance matrices (Σ) are passed into the two linear discriminant functions to classify the categories as of region 2 or region 3 The general linear discriminant function in some cases is not adequate for classification. In this case, if the results from this technique show minimal-to-no separation between the categories based on the features given, then new techniques will have to be applied. These techniques would use linear transformation techniques to transform the original training data into a new subspace that would provide the separation between the two categories so that the likelihood of a correct classification can be increased. This technique is commonly referred to as the Fisher Linear Discriminant (FLD). This technique seeks to find a weighting vector, w, that yields the maximum ratio of betweenclass scatter to within-class scatter. Once a weighting vector, w, is computed then it is used to transform and project the original training data set into a new subspace were good separation can be seen between the two categories. This separation then allows us to choose a threshold, µ that provides lower error results for classifying a category over the general linear discriminant function. Once a weighting vector, w, is computed that maximizes J(·), the training data is fitted to a univariate Gaussian distribution where a new decision boundary threshold, µ is determined. An empirical relationship is then formed for the classification of the category produced by the FLD to show the error associated with the classification of a category for a univariate normal density. 2. Brief Discussion of Theory In general, the Fisher Linear Discriminant is computed first by finding the d-dimensional sample mean, mi, which is given by mi 1 ni x , xD i where d is the number of features that describe a category. One then defines the scatter matrices Si and Sw where Si has the form Si = ( x - m )( x m ) , t x D i i i and Sw = S1 + S2 , where Sw is known as the within-class scatter. Next, the between-class scatter matrix SB is computed which is of the form SB = (m1 – m2)( m1 – m2)t . A weighting vector, w, is then computed that optimizes J(·) and is of the form w = S w1 (m 1 m 2 ) , where J(·) is well known in mathematical physics as the generalized Rayleigh quotients is of the form J(w) = w tS B w w tS w w . The weighting vector, w, has now been obtained for the FLD yielding the maximum ratio of between-class scatter to within-class scatter. The weighting vector, w, is often referred to as the canonical variate. At this point the only thing left is to compute the optimal threshold, which is the point along the one-dimensional subspace that separates the projected points. The general form of the equation for projected points in the new subspace is y = wtx . At this point, one can take special notice to the fact that the original two category problem has been reduced from a d-dimensional problem into a much more manageable one-dimensional problem with the overall computational complexity of finding the optimal w for the FLD being bounded by the calculation of the within-class scatter and its inverse, which is an O(d2n) calculation. The new projected one-dimensional data is fitted to a univariate Gaussian distribution with density p(x|ωi) that is given by the form 1 x 1 p(x|ωi) = e 2 2 2 . With a new decision boundary formed from this new projected data set, a classification is made with the result being one of the two categories which now has an error rate, P(error), based on the possibility of misclassification of a given set of x data. That is the discriminant function has determined the training data to be of one category when in actuality it is of the other category. The Bayes decision rule guarantees the lowest average error rate; however, these results do not tell us what the probability of error actually. The calculation of this error for a Gaussian case can be quite difficult do to the discontinuous nature of the decision regions in the integral P(error) = px | P dx px | P dx 1 R2 1 R1 2 and c P(correct) = i 1 Ri p( x | i ) P( i ) dx where P(error) = 1 – P(correct). 2 3. Analysis of Results The data for the two categories that was used for investigations of the FLD are given below in table 1. Samples 1 2 3 4 5 6 7 8 9 10 ω2 ω3 Features of ω2 Features of ω3 x1 x2 x3 -0.4 -0.31 0.38 -0.15 -0.35 0.17 -0.011 -0.27 -0.065 -0.12 0.58 0.27 0.055 0.53 0.47 0.69 0.55 0.61 0.49 0.054 0.089 -0.04 -0.035 0.011 0.034 0.1 -0.18 0.12 0.0012 -0.063 x1 0.83 1.1 -0.44 0.047 0.28 -0.39 0.34 -0.3 1.1 0.18 x2 1.6 1.6 -0.41 -0.45 0.35 -0.48 -0.079 -0.22 1.2 -0.11 Table 1. Data used in Evaluating the FLD The optimal w found for the given training data is shown below in table 2. -0.3832 0.2137 -0.0767 w= Table 2. Optimal Weighting Vector Figure 1 below shows a plot of the projected points in the direction of w. x3 -0.014 0.48 0.32 1.4 3.1 0.11 0.14 2.2 -0.46 -0.49 Figure 1. Projected Points in the Direction of w The results from the decisions made on a decision boundary of x* = 0.0149 are given below in table 3. This is the point at which the two PDFs overlap. The decision rules for this new threshold are as follows: if x > x* choose ω2, otherwise choose ω3. Feature y1 of ω2 in Optimal Subspace -0.1312 0.0624 0.0747 0.1296 0.1356 0.1699 0.1796 0.2247 0.2320 0.2704 Feature y2 of ω3 in Optimal Subspace -0.2704 -0.2216 -0.1579 -0.1298 -0.1164 -0.1009 -0.0549 0.0250 0.0384 0.0564 Decision ω3 ω2 ω2 ω2 ω2 ω2 ω2 ω2 ω2 ω2 Decision ω3 ω3 ω3 ω3 ω3 ω3 ω3 ω2 ω2 ω2 Table 3. Decisions made by New Decision Boundary formed by the Data in the Optimal Subspace The error, P(error), computed empirically for the optimal weighting vector w is given below in table 4. Optimal Subspace Empirical Perror (%) 20 Table 4. Error computed Empirically for the Optimal Subspace Using the nonoptimal direction w = (1.0, 2.0, -1.5)t, the new distributions for the projected data were p(x|ω2) ~ N(0.74162, 0.1778) and p(x| ω3) ~ N(-0.143, 10.1) respectively. Solving for p(x|ω2) = p(x| ω3) given the form of the univariate Gaussian distribution stated previously in this paper, the results from the decisions made on a decision boundary are given below in table 5 for a nonoptimal direction w = (1.0, 2.0, -1.5)t. The decision rules for this new threshold are as follows: if 0.105 x 1.62 choose ω2, otherwise Feature y1 of ω2 in Nonoptimal Subspace 0.0825 0.2900 0.5390 0.5425 0.6265 0.7700 0.8935 0.9132 1.3590 1.4000 Feature y2 of ω3 in Nonoptimal Subspace -4.0400 -3.6700 -2.9530 -1.7400 -1.5150 -0.0280 0.6950 3.5800 4.0510 4.1900 choose ω3. Decision ω2 ω2 ω2 ω2 ω2 ω2 ω2 ω2 ω2 ω2 Decision ω3 ω3 ω3 ω3 ω3 ω2 ω2 ω3 ω3 ω3 Table 5. Decisions made by New Decision Boundary formed by the Data in the Nonoptimal Subspace The error, P(error), computed empirically or the nonoptimal weighting vector w is given below in table 6. Nonoptimal Subspace Empirical Perror (%) 10 Table 6. Error computed Empirically for the Nonptimal Subspace 4. Conclusion In conclusion, the error, P(error), decreases from 20% for an optimal w to 10% for a nonoptimal w. These results seem to be of a precarious nature since a w was found to maximize the ratio J(·) for the first case, and the second case used a nonoptimal w that did not maximize this ratio. Nevertheless, there was a slight improvement using the nonoptimal weighting vector versus the optimal weighting vector for classifying the category correctly for the given training data set. The use of the FLD to find an optimum threshold for the new projected data set did prove to be a very beneficial technique in getting the separation between the two categories. These two categories had minimal separation to start with making the use of the general linear discriminant function not a good choice for classifying a category correctly due to the overlapping nature of the PDFs for the two categories. The transformation of the original training data produced by the FLD did clearly separate the two categories making for a more robust decision boundary to classify a category accurately. 5. MatLab Computer Code for the FLD MatLab code for the development of the dichotomizer used in this investigation of linear discriminant functions is given below. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Author: Josh Fairley % Course: ECE 8443 - Pattern Recognition % Assignment: Computer Exercise # 2 - Fisher Linear Discriminant % Date: March 21, 2006 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% clear all; close all; clc; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Assigning Values to the variables for 1 - 3 dimensions d1 = 1; d2 = 2; d3 = 3; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % X1w2 - X1 feature of W2 X1w3 - X1 feature of W3 % X2w2 - X2 feature of W2 X2w3 - X2 feature of W3 % X3w2 - X3 feature of W2 X3w3 - X3 feature of W3 % % m_X1w2 - mean of X1 feature of W2 m_X1w3 - mean of X1 feature of W3 % m_X2w2 - mean of X2 feature of W2 m_X2w3 - mean of X2 feature of W3 % m_X3w2 - mean of X3 feature of W2 m_X3w3 - mean of X3 feature of W3 % % X_W2 - Matrix Construction for three-dimensional case of W2 (Is the X % vector for the discriminant functions) % % X_W3 - Matrix Construction for three-dimensional case of W3 (Is the X % vector for the discriminant functions) X1w2 = [-0.4 -0.31 0.38 -0.15 -0.35 0.17 -0.011 -0.27 -0.065 -0.12]; X2w2 = [0.58 0.27 0.055 0.53 0.47 0.69 0.55 0.61 0.49 0.054]; X3w2 = [0.089 -0.04 -0.035 0.011 0.034 0.1 -0.18 0.12 0.0012 -0.063]; X1w3 = [0.83 1.1 -0.44 0.047 0.28 -0.39 0.34 -0.3 1.1 0.18]; X2w3 = [1.6 1.6 -0.41 -0.45 0.35 -0.48 -0.079 -0.22 1.2 -0.11]; X3w3 = [-0.014 0.48 0.32 1.4 3.1 0.11 0.14 2.2 -0.46 -0.49]; m_X1w2 = mean(X1w2); m_X2w2 = mean(X2w2); m_X3w2 = mean(X3w2); m_X1w3 = mean(X1w3); m_X2w3 = mean(X2w3); m_X3w3 = mean(X3w3); X_W2 = [X1w2; X2w2; X3w2]; X_W3 = [X1w3; X2w3; X3w3]; m_W2 = [m_X1w2 m_X2w2 m_X3w2]; m_W3 = [m_X1w3 m_X2w3 m_X3w3]; %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Computation of Scatter Matrices Si and Sw % S1 - Scatter Matrix for W2 % S2 - Scatter Matrix for W3 % Sw - Within Class Scatter Matix for j = 1:3 for i = 1:length(X1w2) if j == 1 a(j,i) = X1w2(i) - m_W2(j); elseif j == 2 a(j,i) = X2w2(i) - m_W2(j); elseif j == 3 a(j,i) = X3w2(i) - m_W2(j); end end end S1 = a*a'; for j = 1:3 for i = 1:length(X1w3) if j == 1 b(j,i) = X1w3(i) - m_W3(j); elseif j == 2 b(j,i) = X2w3(i) - m_W3(j); elseif j == 3 b(j,i) = X3w3(i) - m_W3(j); end end end S2 = b*b'; Sw = S1 + S2; % End of computation for within class scatter matrix Sw %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Computation of Between Class Matrix Sb m_W2 = m_W2'; m_W3 = m_W3'; Sb = (m_W2 - m_W3)*(m_W2 - m_W3)'; % End of computation for between class matrix Sb %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Computation of Weighting Vector w % w = inv(Sw)*(m_W2 - m_W3) w = [1.0 2.0 -1.5]' % End of computation for weighting vector w %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Fitting Data to a univariate Gaussian y = w'x and computing the mean and % Fisher Discriminant for the new projected one-dimensional distribution. X_W2 = X_W2'; X_W3 = X_W3'; for i = 1:2 for j = 1:length(X1w2) if i == 1 y1(j) = w' * X_W2(j,:)'; elseif i == 2 y2(j) = w' * X_W3(j,:)'; end end end y1; y2; m1_hat = w'*m_W2; m2_hat = w'*m_W3; s1_hat = w'*S1*w; s2_hat = w'*S2*w; Jw = (w'*Sb*w)/(w'*Sw*w); % End of fitting Data to a univariate Gaussian y = w'x and computing the mean and % Fisher Discriminant for the new projected one-dimensional distribution. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Computing the projection of the data for each category onto w and plotting % the direction of w and the positions of the projected points. for i = 1:2 for j = 1:length(X1w2) if i == 1 Projection_W2_onto_w(j,:) = ((X_W2(j,:).*w')/(w'.*w'))*w'; elseif i == 2 Projection_W3_onto_w(j,:) = ((X_W3(j,:).*w')/(w'.*w'))*w'; end end end Direction_of_w = cat(1,Projection_W2_onto_w,Projection_W3_onto_w); X_Direction_of_w = Direction_of_w(:,1); Y_Direction_of_w = Direction_of_w(:,2); Z_Direction_of_w = Direction_of_w(:,3); n = length(X_Direction_of_w); figure; plot3(X_Direction_of_w, Y_Direction_of_w, Z_Direction_of_w, 'Color', 'm'); hold on; grid on; xlabel('X1'); ylabel('X2'); zlabel('X3'); title('Projected Points onto w showing the direction of distance d(Wi,w)'); for i = 1:2 for j = 1:length(X_W2) if i == 1 stem3(Projection_W2_onto_w(j,1), Projection_W2_onto_w(j,2), (X_W2(j,3) - Projection_W2_onto_w(j,3)), '--', 'fill'); hold on; elseif i == 2 stem3(Projection_W3_onto_w(j,1), Projection_W3_onto_w(j,2), (X_W3(j,3) - Projection_W3_onto_w(j,3)), 'r--', 'fill'); hold on; end end end legend('w', 'Distance of Projection for W2', 'Distance of Projection for W3'); figure; plot3(X_Direction_of_w, Y_Direction_of_w, Z_Direction_of_w, 'Color', 'c'); hold on; scatter3(Projection_W2_onto_w(:,1), Projection_W2_onto_w(:,2), Projection_W2_onto_w(:,3), 'bo', 'fill'); hold on; scatter3(Projection_W3_onto_w(:,1), Projection_W3_onto_w(:,2), Projection_W3_onto_w(:,3), 'ro', 'fill'); grid on; xlabel('X1'); ylabel('X2'); zlabel('X3'); title('Projected Points onto w'); legend('w', 'Projected points for W2', 'Projected points for W3'); % End of computing the projection of the data for each category onto w and % plotting the direction of w and the positions of the projected points. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Plot of univariate Gaussian data figure; stem(y1, 'bo', 'fill'); hold on; stem(y2, 'ro', 'fill'); hold on; line([1; 10], [0.057; 0.057], 'Color', 'c'); hold on; line([1; 10], [0.0149; 0.0149], 'Color', 'm'); xlabel('Sample Number'); ylabel('One-Dimensional Univariate Gaussian Data'); grid on; legend('Y1', 'Y2', 'Optimized X* = 0.057', 'X* = 0.0149'); %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Computes PDF for 1-Dimensional Case y1 = sort(y1); y2 = sort(y2); PDF1_W2 = 0.5*normpdf(y1,m1_hat,std(y1)); PDF1_W3 = 0.5*normpdf(y2,m2_hat,std(y2)); figure; plot(y1, PDF1_W2, 'b'); hold on; plot(y2, PDF1_W3, 'r'); hold on; line([0.057; 0.057], [0; 2], 'Color', 'c'); hold on; line([0.0149; 0.0149], [0; 2], 'Color', 'm'); xlabel('X'); ylabel('PDF') title('PDF for Regions W2 and W3 (One-Dimensional Case)'); legend('W2', 'W3', 'Optimized X* = 0.057', 'X* = 0.0149'); grid on; % End of computation for PDF %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Decision Results based on Boundary x* x_star = 0.0149; optimized_x_star = 0.057; fprintf('\n'); disp('Decision based on X* for Univariate Gaussian Data of W2'); fprintf('\n'); % for i = 1:length(y1) % if y1(i) > x_star % disp('Decide W2') % else % disp('Decide W3') % end % end for i = 1:length(y1) if (y1(i) >= -0.105) & (y1(i) <= 1.62) disp('Decide W2') else disp('Decide W3') end end fprintf('\n'); disp('Decision based on X* for Univariate Gaussian Data of W3'); fprintf('\n'); % for i = 1:length(y2) % if y2(i) < x_star % disp('Decide W3') % else % disp('Decide W2') % end % end for i = 1:length(y2) if (y2(i) >= -0.105) & (y2(i) <= 1.62) disp('Decide W2') else disp('Decide W3') end end % fprintf('\n'); % disp('Decision based on Optimized X* for Univariate Gaussian Data of W2'); % fprintf('\n'); % % for i = 1:length(y1) % if y1(i) > optimized_x_star % disp('Decide W2') % else % disp('Decide W3') % end % end % % fprintf('\n'); % disp('Decision based on Optimized X* for Univariate Gaussian Data of W3'); % fprintf('\n'); % % for i = 1:length(y2) % if y2(i) < optimized_x_star % disp('Decide W3') % else % disp('Decide W2') % end % end