VERSATILE PATTERN RECOGNITION SYSTEM BASED ON FISHER CRITERION Maciej Smiatacz, Witold Malina Wydział Elektroniki, Telekomunikacji i Informatyki Politechnika Gdańska ul. G. Narutowicza 11/12, 80-952 Gdańsk ABSTRACT In this work we present a complete pattern recognition system that can be used for classification of any digital images (bitmaps). The feature extraction algorithm that we implemented is universal so different applications (e.g. character or face recognition) do not require any modifications in the system architecture. The system is entirely based on Fisher linear classifier. INTRODUCTION Discriminant analysis [1] is well known and widely used in many fields of pattern recognition. Generally, it helps us to determine which variables discriminate between two or more naturally occurring groups, so it can be treated as a feature selection technique. In practice, however, this approach is represented mostly by the Fisher criterion that can be applied directly to pattern classification. The idea of Fisher classifier lies in finding such a vector d that the patterns belonging to opposite classes would be optimally separated after projecting them onto d. The basic form of Fisher criterion is related to the two-class case and it can be expressed with the following formula [2] dT Bd F (d) = T d Σd (1) where B – between-class scatter matrix, B = ∆∆T, ∆ = µ1 – µ2, µi – mean vector of class ci, i = 1, 2, Σ = P(c1) Σ1 + P(c2)Σ2, Σi – covariance matrix of class ci, P(ci) – a priori probability of class ci. The optimal Fisher discriminant vector dopt maximises the value of F. In order to find it we have to calculate the first derivative of F and solve the equation F’(d) = 0. Finally, we get the straightforward formula describing the solution dopt = Σ-1∆ (2) In order to perform the classification we have to project the conditional densities p(y/ci) of each class onto dopt and find the point α where p (d Topt y/c1 ) = p (d Topt y/c 2 ) . Then the decision rule for unknown pattern y becomes very simple: y ∈ c1 ⇔ d Topt y ≤ α y ∈ c2 ⇔ d Topt y > α (3) As we can see, Fisher criterion is easy to apply and provides linear solutions. This is why it was employed in many pattern recognition systems (e.g. [3, 4]) and its improved versions were proposed [5, 6]. Nevertheless, its obvious drawback is the necessity to invert the covariance matrix Σ. Consequently, if every pattern consists of N elements (N feature values) then we have to collect at least N+1 linearly independent training samples with the aim of making Σ non-singular. In practical applications the value of N can be quite large, especially if we want to use the intensity of each pixel in a digital image as the separate feature. In this case the arrays of pixel values (the bitmaps) must be converted to vectors first. This implies concatenation of the columns in each array so that original N×N image becomes a vector with N2 elements. Treating each pixel intensity as a feature value seems to be an attractive idea because it allows us to classify the bitmaps directly and eliminates the need for specialised feature extraction unit. This makes the recognition system versatile and its application depends only on the training set contents. Unfortunately, if the original image resolution is 128×128 pixels then each vector has 16384 elements and the covariance matrix dimensions are 16384×16384 which makes 268435456 elements. If we assume that elements of the covariance matrix are represented by double precision (8-byte) values then the total memory amount necessary to store the matrix is 2 gigabytes. It is still difficult to operate efficiently on such large matrices and in practice it is impossible to invert them. In order to overcome the problems mentioned above we proposed a two-stage Fisher classifier [7]. It uses matrices as the input data structure and does not require the column concatenation. As a result the covariance matrix has the same resolution as the original image. In our previous publication [7] we presented the preliminary results but our experiments were limited to the two-class problems only. In the following sections, however, we shall describe the complete pattern recognition system based on the two-stage Fisher classifier. SYSTEM ARCHITECTURE One of the simplest ways to build a multi-class system using a two-class algorithm is to construct a sequential classifier that separates only one class from the others at a time. The class that was chosen at the current stage is not taken into consideration in the following steps and the training process ends when the last two classes get separated. This way we reduce a multi-class problem to the set of the two-class decisions. The concept is simple but in addition to the classification method we have to define some measure describing the quality of discrimination i.e. the separability of the classes. In the case of Fisher criterion the quality of classification is expressed by means of the so-called Fisher distance DF(1, 2) = where mi = dTµi (m1 − m2 ) 2 σ 12 + σ 22 (4) and σi2 = dTΣid. Thanks to the above formula the implementation of sequential Fisher classifier is fairly easy. The basic block diagram of our training procedure is shown on fig. 1. Fig. 1. Block diagram of the training procedure in sequential pattern recognition system The core of the system is encapsulated in the module that performs the selection of the most distinctive class, i.e. the class that can be separated from the others with the lowest error possible. Lets us assume that we have already separated n classes so that there is L – n left (L is the total number of classes). In order to separate the next class we must carry out the following steps: 1. choose the class index i from the set containing indices of classes that haven’t been separated yet (Ln), 2. construct the new class cx that includes the patterns from all the classes indicated by Ln except ci, 3. build the two-stage Fisher classifier to discriminate ci and cx, 4. calculate the Fisher distance DF(i,x) (4) for ci and cx, 5. repeat steps 1. to 4. using all the class indices contained in Ln, 6. select the class ci for which DF(i,x) is the highest. We would like to point out that the classifier mentioned in the 3rd step is a two-stage classifier. At the first stage it creates the mean matrices A (i ) and A ( x ) for both classes and then calculates the optimal vector dM that ensures maximal distance between A (i ) and A ( x ) projected onto it. Thus the projection of training images A (ik ) onto dM can be treated as a sort of “discriminant feature” extraction process y (ki ) = ( A (ki ) ) T d M (5) (i ) where y k — “discriminant feature” vector describing k-th image of i-th class. Having calculated the y (ik ) feature vectors we can use the standard form of Fisher criterion (1) to create the classifier and find the discriminant vector d. The final result of the training process is a decision tree that serves as the base of the classification algorithm. It is a binary tree with at least one leaf attached to each node (fig. 2). The classification is performed by checking the decision rules (3) stored in the subsequent nodes until a leaf is reached. Each step involves two projections: the first one produces the “discriminant feature” vector y (5) and the second one is necessary to calculate d Topt y (3). The decision tree always contains L-1 nodes so in the worst case we have to make 2(L-1) projections. Because the dimensions of matrices and vectors involved in these projections are relatively small, in a typical case (where the number of classes is reasonable) the classification takes very little time. Fig. 2. An example of the decision tree produced by the sequential classifier EXPERIMENTAL RESULTS The system described above was implemented as a C++ Windows application. Our experiments were carried out on a standard PC equipped with Intel Pentium IV 1.5GHz processor but the code was not optimised for maximum performance. We tested our sequential classifier using the well-known NIST database containing binary images of handwritten digits. These images are normalised and the resolution is 32×32 pixels (fig. 3). Fig. 3. Some of the training examples used in experiments Our training set consisted of 10 classes (digits from 0 to 9) and each class was represented by 40 images. The testing set was created from 400 images not included in the training set. Figure 4 illustrates the classifiers that we obtained for each node of the decision tree that was created as a result of the training procedure. The classes were separated in the following sequence: 6/-, 0/-, 4/-, 1/-, 7/-, 9/-, 3/-, 5/-, 2/8 (fig. 2). In order to compare our two-stage approach with some standard method we run the same test using the sequential classifier based on the typical Fisher algorithm (1). In this case, however, we had to replace the covariance matrix Σ with the identity matrix I. It was necessary because the 32×32 images formed the feature vectors containing 1024 elements so the resulting covariance matrix built of them was singular. If we decided to use the formula (2) then we would have to collect at least 1024 training images for every class, which is practically impossible. We carried out two experiments using the data described above. In the second one the original testing set was used as a training one and vice versa. The results are summarised in Table I. It is noticeable that our two-stage classifier outperforms the traditional Fisher criterion in terms of speed. The recognition rates are similar and satisfactory for both methods. Two examples of wrong decisions made by the two-stage algorithm are presented on fig. 5. a) b) c) d) Class “6” Class “0” f) Class “4” g) Class “1” h) Class “9” Class “3” Class “5” Class “2” e) Class “7” i) Fig. 4. The classifiers obtained for subsequent nodes of the decision tree (two-stage method, 1st experiment) Fig. 5. Two examples of mistakes made by the two-stage classifier Obviously much more accurate results could be achieved if a better version of the multi-class algorithm was used. For example a highly effective sequential classifier was proposed in [8]. The aim of this work, however, was only to prove that the two-stage Fisher classifier can be successfully applied to solving realistic multi-class recognition problems. It is evident that the system presented here could be further optimised to produce better results. Table I. Experimental results Two-stage classifier Recognition rates 1st experiment 2nd experiment Standard Fisher criterion (1) with identity matrix Time Recognition rates Time Training Testing Average Training Testing Training Testing Average Training Testing set set set set 87% 73% 80% 84% 392s 88% 77% 82.5% 77% 80.5% 1041s 5s 84% 81% 50s 82.5% CONCLUSION The results listed in Table I indicate that the training of the new classifier is about three times faster in comparison with the standard Fisher criterion and the recognition time is reduced from 50s to 5s. After projection onto dopt the patterns are closely packed and their distribution is almost Gaussian (fig. 4) so the use of Fisher criterion is fully justified. Our experiments showed that the two-stage Fisher classifier could be successfully used as an efficient base for a versatile pattern recognition system. We are aware of the drawbacks of the approach presented in this paper so in our future work we will concentrate on developing more effective implementation of the proposed classifier. LITERATURE [1] G. J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, John Wiley & Sons, Inc., 1992. [2] J. Sammon, An Optimal Discriminant Plane, IEEE Transactions on Computers, vol. C-19, pp. 826 829, 1970. [3] Ch. Liu, H. Wechsler, A Shape- and Texture-Based Enhanced Fisher Classifier for Face Recognition, IEEE Transactons on Image Processing, vol. 10, no. 4, 2001. [4] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 19, no. 7, 1997. [5] W. Malina, On an Extended Fisher Criterion for Feature Selection, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 3, no. 5, pp. 611-614, 1981. [6] T. Okada, S. Tomita, An Extended Fisher Criterion for Feature Extraction – Malina’s Method and Its Problems, Electronics and Communications in Japan, vol. 67-A, no. 6, pp. 10-17, 1984. [7] M. Smiatacz, W. Malina, Modifying the Input Data Structure for Fisher Classifier, 2nd Conference on Computer Recognition Systems (KOSYR´2001), pp. 363-367, 2001. [8] A. Kołakowska, W. Malina, Application of Fisher Sequential Classifier to Digit Recognition, Proceedings of Sixth International Conference on Pattern Recognition and Information Processing (PRIP´2001), pp. 213-217, 2001.