Face Identification by a Cascade of Rejection Classifiers Quan Yuan, Ashwin Thangali, Stan Sclaroff Computer Science, Boston University, Boston, MA 02215, {yq,tvashwin,sclaroff}@cs.bu.edu Abstract an individual resembles someone on the watch list is a multiclass problem, the second step: to verify that the match is correct is a binary problem. Previous works [2, 11, 18] propose to use similarity measures to solve the recognition problem. For the verification task, with an acceptance threshold, the test input is compared with one or many image samples of the ID he claims to be. The similarity score is then compared with an acceptance threshold to decide yes or no. Intuitively, a good similarity measure should give similar face pairs high scores and dissimilar face pairs low scores. However, since the similarity measure is global for all pairs of faces, it needs to be robust to changes in extrinsic parameters like pose and illumination, and intrinsic parameters like expression, hair, glasses and other appearance changes. Obtaining such a similarity measure is a daunting task. Recently, Viola and Jones proposed a classification technique to distinguish similar from dissimilar faces [10]. For the identification task, nearest neighbor search using a similarity measure can be quite expensive for large datasets since each image in the dataset is compared to the test image to determine the best match. The method proposed in this paper solves the identification task using rejection classifiers. A rejection classifier is a classifier capable of rejecting at least one class with high confidence. One-vs-all Fisher linear discriminants are employed to obtain such rejection classifiers. These classifiers are combined in a cascade. Each of the rejection classifiers can often reject multiple classes and hence the proposed cascade structure is very efficient. A Nearest Neighbor search is performed on the remaining set of persons to output the top k ranked IDs most similar to a given input face. In our experiments on the FRGC version 1 data, significant speedup of up to three and accuracy comparable to the brute force Nearest Neighbor search are demonstrated. Nearest neighbor search is commonly employed in face recognition but it does not scale well to large dataset sizes. A strategy to combine rejection classifiers into a cascade for face identification is proposed in this paper. A rejection classifier for a pair of classes is defined to reject at least one of the classes with high confidence. These rejection classifiers are able to share discriminants in feature space and at the same time have high confidence in the rejection decision. In the face identification problem, it is possible that a pair of known individual faces are very dissimilar. It is very unlikely that both of them are close to an unknown face in the feature space. Hence, only one of them needs to be considered. Using a cascade structure of rejection classifiers, the scope of nearest neighbor search can be reduced significantly. Experiments on Face Recognition Grand Challenge (FRGC) version 1 data demonstrate that the proposed method achieves significant speed up and an accuracy comparable with the brute force Nearest Neighbor method. In addition, a graph cut based clustering technique is employed to demonstrate that the pairwise separability of these rejection classifiers is capable of semantic grouping. 1. Introduction Face recognition is an important application for surveillance and access control. In [12], three primary face recognition tasks are listed: • Verification: Am I who I claim to be? • Identification: Who am I? • Watch list: Are you looking for me? The verification task is a binary classification problem, i.e. the answer is yes or no. The identification task is a multiclass classification problem, only one out of many possible hypotheses can possibly be correct. The watch list task can be thought of as an identification problem combined with an verification problem. The first step: to determine whether 2. Related Work Our method originates from [13] who proposed a directed acyclic graph (DAG) of pairwise classifiers for multiclass classification. In their approach, each pairwise classifier 1 was a SVM. In an empirical study [9], the DAGSVM is shown to have very good accuracy when compared with more complex methods. In [5], DAGSVM is used for face recognition. probability and hence the class need not be considered further. The margin between a pair of classes is defined by the separation between the classes in the projected space. Let x0 , x1 be the set of training points for the two classes and p0 , p1 be the projections of these points on w. Let p0 be to the left of p1 . The margin is given by 1234 1 vs 4 2 vs 4 3 vs 4 reject 3 2 vs 3 3 1 vs 2 2 2(min(p1 ) − max(p0 )) σ0 + σ1 (1) where σ0 , σ1 are the standard deviations of p0 and p1 . If the margin is greater than a predefined threshold, the pairwise classifier is retained. The threshold t to separate the classes is given by min(p1 ) − max(p0 ) t= . (2) 2 The rejection classifiers obtained in this way can be regarded as a subset of general pairwise classifiers. 1 vs 3 reject 4 reject 1 reject 2 4 margin = reject 4 reject 1 3.2. Sharing Discriminants Among Classifiers 1 To reduce the number of classifiers that need to be trained, we observe that many pairwise classifiers can share the same linear discriminant. For instance, if person A can be separated from person B and C with sufficient margin by a Fisher discriminant, it is possible that the rejection classifiers (A,B), (A,C) and (D,B) can share the same discriminant, as shown in Figure 2. Figure 1: A DAG for four classes[13]: each node is a pairwise classifier between two classes. A decision of rejecting one of the two class labels is made at each node. At each node of the DAG, a pairwise comparison is made and one of the classes is rejected. If the number of classes is M , then the DAG needs M − 1 comparisons and hence is more efficient ¡ ¢than methods like Pairwise Voting [3, 6], which needs M comparisons. Furthermore, the graph 2 structure need not be stored, since the route that an instance takes can be determined during classification. It is undesirable to train pairwise SVMs as in [5] for a large number of classes since the number of SVMs to be trained grows quadratically with the number of classes. For instance, in [5] 9316 SVMs were needed in an experiment with 137 classes. In contrast, we use one-vs-all Fisher linear discriminants [1] to obtain projections in the feature space. The number of projections learnt is linear in the number of classes. Each of the projections can often reject multiple classes and hence the proposed cascade structure is more efficient than the DAG which needs exactly M − 1 levels. Shared discriminant B D vs B P3 A vs B Rejection thresholds A vs C P2 D P1 C A Boundaries towards the thresholds 3. Cascade of Rejection Classifiers 3.1. Rejection Classifiers Figure 2: Synthetic example of sharing discriminants among rejection classifiers. The classifier A vs B can share the same projection as A vs C and D vs B. For points projected to P1 , B and C can be rejected. For points projected to P2 , A and B can be rejected. For points projected to P3 , A and D can be rejected. Given a projection w, the threshold for a pair of classes A and B is selected to achieve a high margin between them on the training set. This ensures that a decision to reject a class at a node in the modified DAG is correct with high Ideally an optimal projection is one that separates the greatest number of class pairs. However, it is difficult to search for such an optimal projection in the space of all In our framework each classifier rejects at least one class with high confidence. We call such classifiers rejection classifiers. 2 4.1. Training and Ordering Rejection Classifiers possible projections. We instead use the Fisher linear discriminant obtained by solving the One-vs-All classification problem that maximizes the separation between a particular class (positive class) and the other classes (negative classes). Although on this projection the negative classes tend to overlap, they have good separations with the positive class. As described in Section 3.2, for each class i, a classifier C0i that rejects i, and a set of classifiers {Cji } that rejects a subset of the remaining classes is trained. All these classifiers share the same linear discriminant wi but use different thresholds. Each such set of rejection classifiers {C0i , Cji } forms a stage in the cascade. The total number of cascade stages trained is at most M (the number of classes), since some of projections are not able to separate any class pair with sufficient margin. The stages are sorted in decreasing order of the number of classes they can reject (given by |{C0i , Cji }|). This ordering greedily rejects as many classes as possible. We refer to this sequence of classifiers as a Cascade of Rejection Classifiers(CRC). The frequency of number of rejection classifiers in a stage of the cascade is shown in Figure 3. This distribution shows that in practice, a significant number classes are well separated from the other classes. This provides further justification that the cascade can use this separable nature of the classes to significantly narrow down the set of classes to concentrate on. Given the One-vs-All discriminant, each negative class is checked if a certain margin (Eqn. 1) is achieved with the positive class. If true, the rejection classifier for the negative class is given by a threshold along the discriminant (Eqn. 2). The threshold to reject the positive class is defined as the negative class threshold farthest from the positive class so as to maximize the confidence in this rejection. For some negative classes without sufficient margin, the rejection classifiers are not obtained. However, since there are M projections to be checked, it is possible that in other projections they are well separated. Another advantage of using the One-vs-All discriminant to find a projection shared among many classifiers is the reduced chance of over-fitting and ill-conditioned matrices (during Fisher linear discriminant computation) that may result due to the small number of per class training samples. 0.2 0.18 4. Face Identification via a Cascade of Rejection Classifiers 0.16 Frequency 0.14 In the identification problem, each person in the set of known IDs constitutes a class. Rejection classifiers are used in a cascade to reject with high confidence as many classes as possible. We are now left with a much smaller set of classes and can employ an expensive technique like Nearest Neighbor classification to obtain the best class. This is more efficient than brute force Nearest Neighbor search over the whole training set, since many classes were already rejected by the cascade. 0.12 0.1 0.08 0.06 0.04 0.02 0 50 100 150 200 250 Number of classes that can be rejected in a stage of the cascade Figure 3: Frequency of number of classes rejected with high margin at a stage of the CRC during training for the FRGC dataset. The x-axis represents the number of classes separated with sufficient margin in a stage of the cascade. The y-axis represents the number of cascade stages obtained during training with the corresponding number of rejection classifiers. On average, 223 of the total 263 cascade stages are able to reject at least one class with a large margin. For the nearest neighbor method, any similarity measure suitable for faces can be used. Zhao et. al [19] give a survey of many successfully used similarity measures. The focus of this paper is to demonstrate the significant speedup achieved by the cascade of rejection classifiers. We demonstrate this using Euclidean distance in the eigenspace as in [17] and show significant speedup over nearest neighbor classification without much loss in accuracy. Researchers in the past [7, 14] have proposed indexing techniques to speedup nearest neighbor search. Such techniques do not scale well to high dimensions, with utility typically limited to less than 20 dimensions. We use a 150 dimensional feature space and hence such techniques are not applicable. 4.2. Classification of Test Inputs There are two steps to classify a test input. The CRC is first applied to reduce the scope of possible classes to a small set. The Nearest Neighbor method is then used to determine 3 the best class. These steps are illustrated in Figure 4 and described below. CRC classification Given a test input x, classifiers from the cascade are considered in the sequence described above. At each level of the cascade, the stage corresponding to an unrejected class i is chosen. The test input x is projected on wi , xp = wiT x and the rejection classifiers {C0i , Cji } are run. Each rejection classifier is just a comparison with a threshold and hence is very fast. Figure 5: Top 5 eigen faces for alignment. Typically, we expect that some classes get rejected at each iteration and the number of levels needed would be smaller than M , the number of classes (IDs). In the worst case, none of the classes is rejected and M levels may be needed. In our experiments with the FRGC version 1 data, we found that the average the number of levels in the cascade was < 2M 3 . The set of possible classes are obtained at the end of this step. Figure 6: Examples of alignment: The first row shows the face images from the face detector. The second row shows the aligned faces. NN classification Given the set of possible classes, Nearest Neighbor search is performed on training samples from these classes to determine the best class. 5. Experiments The performance of the proposed technique is evaluated with the FRGC version 1 dataset of still images. This dataset contains 263 subjects with approximately 21 samples per person. The face detector [15]1 was able to detect 5472 faces from a total of 5658 images. To accurately align the faces, a PCA subspace of 50 dimensions learned from 300 correctly aligned faces is used to compute the reconstruction error to find the best orientation, scale and translation for a detected face. The aligned faces are resized to 100 × 100. For face identification, we determined empirically that 150 dimensions with the largest eigenvalues in the PCA subspace give the best accuracy for both Fisher discriminant and nearest neighbor methods. The PCA space is learned from a subset of 1600 randomly selected images, which allow the computation of eigenfaces in reasonable time. The focus of our work is to evaluate the speedup over nearest neighbor classification and we hence chose a simple set of features obtained by projections on eigenfaces. More sophisticated features [19] have been shown to achieve significantly higher performance and it is part of our future work to incorporate these features. Input Select a projection not used and whose positive class c is not rejected reject class c reject some other classes or Are there any more projections available? Yes No 5.1. Experimental Setup Nearest Neighbor search on remaining classes We compare the performance of the proposed technique with brute force Nearest Neighbor search for different sizes of the test set. These experiments demonstrate that the proposed method gives significant speedup and achieves accuracy close to the Nearest Neighbor method. Figure 4: Classification of a test input using a Cascade of Rejection Classifiers (CRC). 1 http://vasc.ri.cmu.edu/NNFaceDetector/ 4 Nearest Neighbor Euclidean distance is used to rank all faces in the training set as in [17]. The ranked samples are mapped to their corresponding classes to obtain a ranked list of face IDs. pair of classes there are two projections defined by their corresponding One-vs-All linear discriminants. The edge weight is given by the minimum of these weights since this gives the best separation. Given G, we use the Normalized Graph Cuts [16] algorithm to separate the faces into clusters of roughly equivalent size. The clusters obtained are shown in Figure 8. When they are clustered into 4 groups, one group of female faces clearly stands out. Most of the Asian people appear in another group, yet it is not so obvious as the female group. It seems to us that the discriminant information between females and males is easily captured in the PCA feature space with linear discriminant. Clustering classes can be useful to find better projections between groups of classes. It is an interesting future work to see how it can improve the separability between classes. CRC The choice of minimum acceptable margin (Eqn.1) strongly influences the tradeoff in accuracy versus speedup. Typically, increasing the margin leads to fewer errors in the rejection classifiers but also leads to a larger set of possible classes in the Nearest Neighbor classification step. We tried different margins and found that on the FRGC version 1 data set, a minimum margin of 3 gives a reasonable tradeoff. For CRC, the ranked output is obtained by nearest neighbor search on the set of classes remaining after running the cascade. 5.2. Identification Performance 7. Conclusion and Future work The performance of the techniques and the speedup achieved by the CRC are shown in Figure 7. It is clear that the CRC reduces the scope of nearest neighbors search by up to 85% without significant loss in accuracy. A summary of statistics for the CRC is given in Table 1. The errors in the CRC method may arise from two sources, In this paper an efficient technique for combining rejection classifiers is proposed to speed up the NN search for face identification. A margin maintained in each rejection classifier ensures the accuracy during the rejection process. The cascade is able to narrow the search and hence achieves a significant speedup over NN search. The separability obtained with these classifiers is employed to find semantic meaning by spectral clustering. In the experiments, performance was compared with brute-force NN retrieval. It should be noted that indexing structures like R-trees and K-d trees [7, 8, 14] that can be used to make nearest neighbor retrieval faster are inappropriate in high-dimensional features spaces, like that encountered in PCA-based face recognition. However, more recent techniques like locality sensitive hashing (LSH) [4] may faster NN retrieval, and experimental comparison of our approach and LSH-based NN retrieval will be conducted in future work. We also plan to investigate more sophisticated similarity measures and improved feature extraction methods. • errors induced during rejection of classes by linear discriminants at each stage of the cascade, and • errors in Nearest Neighbor search on the subset of confused classes. These errors are shown in Table 2. The fraction of errors due to the cascade increases with larger test set sizes since the confidence in rejection for the classifiers in the cascade decreases for smaller training sets. 6. Discovering Semantic Groups of Faces Humans are able to distinguish male from female, young from old, Asian from European, yet the identity of the person is unknown. It seems to us that there maybe basic “classifiers” in the “feature space” of human perception that have semantic meaning. Since we have defined pairwise classifiers between subjects, we are interested to see if any semantic information is captured by these classifiers. We consider each person as a node in a complete graph G. The edge weight for each pair of nodes is determined by their separation in the projected space and is given by, wij = References [1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 19(7):711–720, 1997. [2] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 23(6):681 – 685, 2001. σi + σj |mi − mj | [3] J. H. Friedman. Another approach to polychotomous classification. Technical report, Dept. of Statistics, Stanford University, 1996. where σi , σj are the standard deviations and mi −mj are the means of the two classes in the projected space. For each 5 Test set size as fraction of dataset CRC speedup Fraction of training set samples for NN search # of classes left for NN search # of projections in CRC CRC training time CRC time/per test 10% 1.58 39% 72 148 831s 0.04s 21% 1.58 34% 63 157 726s 0.03s 32% 1.60 29% 55 158 654s 0.03s 43% 2.38 25% 45 169 519s 0.03s 65% 3.35 14% 25 169 230s 0.02s Table 1: Performance statistics of the proposed method (CRC) for different sizes of test set averaged over 5 trials. The second row gives the speedup achieved by the proposed algorithm over brute force NN. The third row gives the fraction of the number of samples in the training set over which NN search is performed. The fourth row gives the number of classes remaining after the rejection step. The fifth row gives the average number of linear projections performed in the CRC and corresponds to the average number of levels in the cascade. In the worst case CRC may need to perform 263 projections. The sixth and seventh row gives the training and test times. Test set size as fraction of dataset CRC Cascade error CRC NN error CRC Total error (rank 1) Brute force NN error (rank 1) 10% 30 (4.5%) 98 (14.9%) 19.4% 19.6% 21% 82 (6.3%) 208 (16.0%) 22.3% 22.3% 32% 125 (6.8%) 322 (17.5%) 24.3% 24.9% 43% 225 (9.1%) 487 (19.6%) 28.7% 29.1% 65% 694 (19.0%) 839 (23.0%) 42.0% 41.9% Table 2: Errors at rank 1 induced by the Cascade and NN steps in the CRC method. The brute force NN search error is also shown for comparison. [4] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In Proc. International Conf. on Very Large Data Bases, pages 518 – 529, 1999. [11] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer, J. Chang, K. Hoffman, J. Marques, J. Min, and W. Worek. Overview of the face recognition grand challenge. In Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2005. [5] G. Guo, S.Z. Li, and K. Chan. Face recognition by support vector machines. In Proc. IEEE International Conf. on Face and Gesture Recognition (FGR), pages 196–201, 2000. [12] P. J. Phillips, P. Grother, R. J. Micheals, D. M. Blackburn, E. Tabassi, and J.M. Bone. Face recognition vendor test 2002: Overview and summary. Technical report, http://www.frvt.org, 2003. [6] G. Guo, H. Zhang, and S.Z. Li. Pairwise face recognition. In Proc. IEEE International Conf. on Computer Vision (ICCV), volume 2, pages 282–287, 2001. [13] J. C. Platt, N. Cristianini, and J. Shawe-Taylor. Large margin DAGs for multiclass classification. In Proc. Advances in Neural Information Processing Systems, pages 547–553, 1999. [7] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proc. ACM International Conf. on Management of Data (SIGMOD), pages 47 – 57, 1984. [14] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In Proc. ACM International Conf. on Management of Data (SIGMOD), pages 71–79, 1995. [8] G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Trans. on Database Systems, 28(4):517 – 580, 2003. [15] H. A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), pages 23 – 38, 1998. [9] C. Hsu and C. Lin. A comparison of methods for multi-class support vector machines. IEEE Trans. Neural Networks, 13:415–425, 2002. [16] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 22(8):888 – 905, 2000. [10] M. Jones and P. Viola. Face recognition using boosted local features. Technical report TR2003-025, Mitsubishi Electric Research Laboratories, 2003. [17] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1):71–86, 1991. 6 Test set size = 21 % of dataset 1 0.9 0.9 0.8 0.8 % correct classified % correct classified Test set size = 10 % of dataset 1 0.7 0.6 0.5 0.6 0.5 0.4 0.3 0.7 0.4 NN CRC, speedup = 1.58, #NN = 39% 0 5 10 15 20 25 0.3 30 NN CRC, speedup = 1.58, #NN = 34% 0 5 Rank of correct class Test set size = 32 % of dataset 20 25 30 Test set size = 43 % of dataset 1 0.9 0.9 0.8 0.8 % correct classified % correct classified 15 Rank of correct class 1 0.7 0.6 0.5 0.7 0.6 0.5 0.4 0.3 10 0.4 NN CRC, speedup = 1.60, #NN = 29% 0 5 10 15 20 25 0.3 30 Rank of correct class NN CRC, speedup = 2.38, #NN = 25% 0 5 10 15 20 25 30 Rank of correct class Figure 7: Comparison of face identification accuracy for CRC and Nearest Neighbor methods with different sizes of test set on FRGC version 1 data. The y-axis represents the percentage of test examples with the correct class ranked within the corresponding value on the x-axis. Speedup gives the speedup in execution time of the CRC over Nearest Neighbor search on the test set. #NN gives the fraction of nearest neighbors evaluated by the CRC when compared to brute force NN search. The size of the whole dataset is 5472 with 263 subjects and 21 samples per subject. The results are averaged from 5 trials. [18] L. Wiskott, J.M. Fellous, N. Krger, and C. von der Malsburg. Face recognition by elastic bunch graph matching. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 19(7):775 – 779, 1997. [19] W. Zhao, R. Chellappa, J. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Computing Surveys, pages 399–458, 2003. 7 (a) Cluster 1 (b) Cluster 2 (c) Cluster 3 (d) Cluster 4 Figure 8: Clustering faces of people with Normalized Graph Cuts using pairwise separability. Cluster 1 captures mainly female subjects. Cluster 2 captures mainly asian subjects and cluster 3 captures mainly white subjects. 8