Face Identification by a Cascade of Rejection Classifiers Abstract

advertisement
Face Identification by a Cascade of Rejection Classifiers
Quan Yuan, Ashwin Thangali, Stan Sclaroff
Computer Science, Boston University,
Boston, MA 02215,
{yq,tvashwin,sclaroff}@cs.bu.edu
Abstract
an individual resembles someone on the watch list is a multiclass problem, the second step: to verify that the match is
correct is a binary problem.
Previous works [2, 11, 18] propose to use similarity measures to solve the recognition problem. For the verification
task, with an acceptance threshold, the test input is compared with one or many image samples of the ID he claims
to be. The similarity score is then compared with an acceptance threshold to decide yes or no. Intuitively, a good similarity measure should give similar face pairs high scores
and dissimilar face pairs low scores. However, since the
similarity measure is global for all pairs of faces, it needs
to be robust to changes in extrinsic parameters like pose
and illumination, and intrinsic parameters like expression,
hair, glasses and other appearance changes. Obtaining such
a similarity measure is a daunting task. Recently, Viola
and Jones proposed a classification technique to distinguish
similar from dissimilar faces [10]. For the identification
task, nearest neighbor search using a similarity measure can
be quite expensive for large datasets since each image in the
dataset is compared to the test image to determine the best
match.
The method proposed in this paper solves the identification task using rejection classifiers. A rejection classifier is
a classifier capable of rejecting at least one class with high
confidence. One-vs-all Fisher linear discriminants are employed to obtain such rejection classifiers. These classifiers
are combined in a cascade. Each of the rejection classifiers can often reject multiple classes and hence the proposed cascade structure is very efficient. A Nearest Neighbor search is performed on the remaining set of persons to
output the top k ranked IDs most similar to a given input
face.
In our experiments on the FRGC version 1 data, significant speedup of up to three and accuracy comparable to the
brute force Nearest Neighbor search are demonstrated.
Nearest neighbor search is commonly employed in face
recognition but it does not scale well to large dataset sizes.
A strategy to combine rejection classifiers into a cascade
for face identification is proposed in this paper. A rejection classifier for a pair of classes is defined to reject at
least one of the classes with high confidence. These rejection classifiers are able to share discriminants in feature
space and at the same time have high confidence in the
rejection decision. In the face identification problem, it is
possible that a pair of known individual faces are very dissimilar. It is very unlikely that both of them are close to an
unknown face in the feature space. Hence, only one of them
needs to be considered. Using a cascade structure of rejection classifiers, the scope of nearest neighbor search can
be reduced significantly. Experiments on Face Recognition
Grand Challenge (FRGC) version 1 data demonstrate that
the proposed method achieves significant speed up and an
accuracy comparable with the brute force Nearest Neighbor
method. In addition, a graph cut based clustering technique
is employed to demonstrate that the pairwise separability of
these rejection classifiers is capable of semantic grouping.
1. Introduction
Face recognition is an important application for surveillance
and access control. In [12], three primary face recognition
tasks are listed:
• Verification: Am I who I claim to be?
• Identification: Who am I?
• Watch list: Are you looking for me?
The verification task is a binary classification problem, i.e.
the answer is yes or no. The identification task is a multiclass classification problem, only one out of many possible
hypotheses can possibly be correct. The watch list task can
be thought of as an identification problem combined with an
verification problem. The first step: to determine whether
2. Related Work
Our method originates from [13] who proposed a directed
acyclic graph (DAG) of pairwise classifiers for multiclass
classification. In their approach, each pairwise classifier
1
was a SVM. In an empirical study [9], the DAGSVM is
shown to have very good accuracy when compared with
more complex methods. In [5], DAGSVM is used for face
recognition.
probability and hence the class need not be considered further. The margin between a pair of classes is defined by the
separation between the classes in the projected space. Let
x0 , x1 be the set of training points for the two classes and
p0 , p1 be the projections of these points on w. Let p0 be to
the left of p1 . The margin is given by
1234
1 vs 4
2 vs 4
3 vs 4
reject 3
2 vs 3
3
1 vs 2
2
2(min(p1 ) − max(p0 ))
σ0 + σ1
(1)
where σ0 , σ1 are the standard deviations of p0 and p1 . If the
margin is greater than a predefined threshold, the pairwise
classifier is retained. The threshold t to separate the classes
is given by
min(p1 ) − max(p0 )
t=
.
(2)
2
The rejection classifiers obtained in this way can be regarded as a subset of general pairwise classifiers.
1 vs 3
reject 4
reject 1
reject 2
4
margin =
reject 4
reject 1
3.2. Sharing Discriminants Among Classifiers
1
To reduce the number of classifiers that need to be trained,
we observe that many pairwise classifiers can share the
same linear discriminant. For instance, if person A can be
separated from person B and C with sufficient margin by a
Fisher discriminant, it is possible that the rejection classifiers (A,B), (A,C) and (D,B) can share the same discriminant, as shown in Figure 2.
Figure 1: A DAG for four classes[13]: each node is a pairwise classifier between two classes. A decision of rejecting
one of the two class labels is made at each node.
At each node of the DAG, a pairwise comparison is made
and one of the classes is rejected. If the number of classes
is M , then the DAG needs M − 1 comparisons and hence
is more efficient
¡ ¢than methods like Pairwise Voting [3, 6],
which needs M
comparisons. Furthermore, the graph
2
structure need not be stored, since the route that an instance
takes can be determined during classification.
It is undesirable to train pairwise SVMs as in [5] for a
large number of classes since the number of SVMs to be
trained grows quadratically with the number of classes. For
instance, in [5] 9316 SVMs were needed in an experiment
with 137 classes. In contrast, we use one-vs-all Fisher linear
discriminants [1] to obtain projections in the feature space.
The number of projections learnt is linear in the number of
classes. Each of the projections can often reject multiple
classes and hence the proposed cascade structure is more
efficient than the DAG which needs exactly M − 1 levels.
Shared discriminant
B
D vs B
P3
A vs B
Rejection thresholds
A vs C
P2
D
P1
C
A
Boundaries towards
the thresholds
3. Cascade of Rejection Classifiers
3.1. Rejection Classifiers
Figure 2: Synthetic example of sharing discriminants
among rejection classifiers. The classifier A vs B can share
the same projection as A vs C and D vs B. For points projected to P1 , B and C can be rejected. For points projected
to P2 , A and B can be rejected. For points projected to P3 ,
A and D can be rejected.
Given a projection w, the threshold for a pair of classes A
and B is selected to achieve a high margin between them
on the training set. This ensures that a decision to reject a
class at a node in the modified DAG is correct with high
Ideally an optimal projection is one that separates the
greatest number of class pairs. However, it is difficult to
search for such an optimal projection in the space of all
In our framework each classifier rejects at least one class
with high confidence. We call such classifiers rejection
classifiers.
2
4.1. Training and Ordering Rejection Classifiers
possible projections. We instead use the Fisher linear discriminant obtained by solving the One-vs-All classification
problem that maximizes the separation between a particular class (positive class) and the other classes (negative
classes). Although on this projection the negative classes
tend to overlap, they have good separations with the positive class.
As described in Section 3.2, for each class i, a classifier
C0i that rejects i, and a set of classifiers {Cji } that rejects a
subset of the remaining classes is trained. All these classifiers share the same linear discriminant wi but use different
thresholds. Each such set of rejection classifiers {C0i , Cji }
forms a stage in the cascade. The total number of cascade
stages trained is at most M (the number of classes), since
some of projections are not able to separate any class pair
with sufficient margin. The stages are sorted in decreasing order of the number of classes they can reject (given by
|{C0i , Cji }|). This ordering greedily rejects as many classes
as possible. We refer to this sequence of classifiers as a
Cascade of Rejection Classifiers(CRC).
The frequency of number of rejection classifiers in a
stage of the cascade is shown in Figure 3. This distribution shows that in practice, a significant number classes are
well separated from the other classes. This provides further
justification that the cascade can use this separable nature
of the classes to significantly narrow down the set of classes
to concentrate on.
Given the One-vs-All discriminant, each negative class
is checked if a certain margin (Eqn. 1) is achieved with the
positive class. If true, the rejection classifier for the negative class is given by a threshold along the discriminant
(Eqn. 2). The threshold to reject the positive class is defined as the negative class threshold farthest from the positive class so as to maximize the confidence in this rejection.
For some negative classes without sufficient margin, the rejection classifiers are not obtained. However, since there
are M projections to be checked, it is possible that in other
projections they are well separated.
Another advantage of using the One-vs-All discriminant
to find a projection shared among many classifiers is the
reduced chance of over-fitting and ill-conditioned matrices
(during Fisher linear discriminant computation) that may
result due to the small number of per class training samples.
0.2
0.18
4. Face Identification via a Cascade of
Rejection Classifiers
0.16
Frequency
0.14
In the identification problem, each person in the set of
known IDs constitutes a class. Rejection classifiers are used
in a cascade to reject with high confidence as many classes
as possible. We are now left with a much smaller set of
classes and can employ an expensive technique like Nearest
Neighbor classification to obtain the best class. This is more
efficient than brute force Nearest Neighbor search over the
whole training set, since many classes were already rejected
by the cascade.
0.12
0.1
0.08
0.06
0.04
0.02
0
50
100
150
200
250
Number of classes that can be rejected in a stage of the cascade
Figure 3: Frequency of number of classes rejected with
high margin at a stage of the CRC during training for the
FRGC dataset. The x-axis represents the number of classes
separated with sufficient margin in a stage of the cascade.
The y-axis represents the number of cascade stages obtained
during training with the corresponding number of rejection
classifiers. On average, 223 of the total 263 cascade stages
are able to reject at least one class with a large margin.
For the nearest neighbor method, any similarity measure
suitable for faces can be used. Zhao et. al [19] give a survey of many successfully used similarity measures. The focus of this paper is to demonstrate the significant speedup
achieved by the cascade of rejection classifiers. We demonstrate this using Euclidean distance in the eigenspace as in
[17] and show significant speedup over nearest neighbor
classification without much loss in accuracy.
Researchers in the past [7, 14] have proposed indexing
techniques to speedup nearest neighbor search. Such techniques do not scale well to high dimensions, with utility
typically limited to less than 20 dimensions. We use a 150
dimensional feature space and hence such techniques are
not applicable.
4.2. Classification of Test Inputs
There are two steps to classify a test input. The CRC is first
applied to reduce the scope of possible classes to a small
set. The Nearest Neighbor method is then used to determine
3
the best class. These steps are illustrated in Figure 4 and
described below.
CRC classification Given a test input x, classifiers from the
cascade are considered in the sequence described above. At
each level of the cascade, the stage corresponding to an unrejected class i is chosen. The test input x is projected on
wi , xp = wiT x and the rejection classifiers {C0i , Cji } are
run. Each rejection classifier is just a comparison with a
threshold and hence is very fast.
Figure 5: Top 5 eigen faces for alignment.
Typically, we expect that some classes get rejected at
each iteration and the number of levels needed would be
smaller than M , the number of classes (IDs). In the worst
case, none of the classes is rejected and M levels may be
needed. In our experiments with the FRGC version 1 data,
we found that the average the number of levels in the cascade was < 2M
3 . The set of possible classes are obtained at
the end of this step.
Figure 6: Examples of alignment: The first row shows the
face images from the face detector. The second row shows
the aligned faces.
NN classification Given the set of possible classes, Nearest Neighbor search is performed on training samples from
these classes to determine the best class.
5. Experiments
The performance of the proposed technique is evaluated
with the FRGC version 1 dataset of still images. This
dataset contains 263 subjects with approximately 21 samples per person. The face detector [15]1 was able to detect
5472 faces from a total of 5658 images. To accurately align
the faces, a PCA subspace of 50 dimensions learned from
300 correctly aligned faces is used to compute the reconstruction error to find the best orientation, scale and translation for a detected face. The aligned faces are resized to
100 × 100.
For face identification, we determined empirically that
150 dimensions with the largest eigenvalues in the PCA
subspace give the best accuracy for both Fisher discriminant
and nearest neighbor methods. The PCA space is learned
from a subset of 1600 randomly selected images, which allow the computation of eigenfaces in reasonable time.
The focus of our work is to evaluate the speedup over
nearest neighbor classification and we hence chose a simple
set of features obtained by projections on eigenfaces. More
sophisticated features [19] have been shown to achieve significantly higher performance and it is part of our future
work to incorporate these features.
Input
Select a projection not used
and whose positive class c
is not rejected
reject class c
reject some
other classes
or
Are there any more
projections available?
Yes
No
5.1. Experimental Setup
Nearest Neighbor
search on remaining
classes
We compare the performance of the proposed technique
with brute force Nearest Neighbor search for different sizes
of the test set. These experiments demonstrate that the proposed method gives significant speedup and achieves accuracy close to the Nearest Neighbor method.
Figure 4: Classification of a test input using a Cascade of
Rejection Classifiers (CRC).
1 http://vasc.ri.cmu.edu/NNFaceDetector/
4
Nearest Neighbor Euclidean distance is used to rank all
faces in the training set as in [17]. The ranked samples are
mapped to their corresponding classes to obtain a ranked
list of face IDs.
pair of classes there are two projections defined by their
corresponding One-vs-All linear discriminants. The edge
weight is given by the minimum of these weights since this
gives the best separation.
Given G, we use the Normalized Graph Cuts [16] algorithm to separate the faces into clusters of roughly equivalent size. The clusters obtained are shown in Figure 8.
When they are clustered into 4 groups, one group of female
faces clearly stands out. Most of the Asian people appear in
another group, yet it is not so obvious as the female group.
It seems to us that the discriminant information between females and males is easily captured in the PCA feature space
with linear discriminant.
Clustering classes can be useful to find better projections
between groups of classes. It is an interesting future work
to see how it can improve the separability between classes.
CRC The choice of minimum acceptable margin (Eqn.1)
strongly influences the tradeoff in accuracy versus speedup.
Typically, increasing the margin leads to fewer errors in the
rejection classifiers but also leads to a larger set of possible classes in the Nearest Neighbor classification step. We
tried different margins and found that on the FRGC version 1 data set, a minimum margin of 3 gives a reasonable
tradeoff. For CRC, the ranked output is obtained by nearest
neighbor search on the set of classes remaining after running the cascade.
5.2. Identification Performance
7. Conclusion and Future work
The performance of the techniques and the speedup
achieved by the CRC are shown in Figure 7. It is clear that
the CRC reduces the scope of nearest neighbors search by
up to 85% without significant loss in accuracy. A summary
of statistics for the CRC is given in Table 1.
The errors in the CRC method may arise from two
sources,
In this paper an efficient technique for combining rejection
classifiers is proposed to speed up the NN search for face
identification. A margin maintained in each rejection classifier ensures the accuracy during the rejection process. The
cascade is able to narrow the search and hence achieves a
significant speedup over NN search. The separability obtained with these classifiers is employed to find semantic
meaning by spectral clustering.
In the experiments, performance was compared with
brute-force NN retrieval. It should be noted that indexing
structures like R-trees and K-d trees [7, 8, 14] that can be
used to make nearest neighbor retrieval faster are inappropriate in high-dimensional features spaces, like that encountered in PCA-based face recognition. However, more recent
techniques like locality sensitive hashing (LSH) [4] may
faster NN retrieval, and experimental comparison of our approach and LSH-based NN retrieval will be conducted in
future work. We also plan to investigate more sophisticated
similarity measures and improved feature extraction methods.
• errors induced during rejection of classes by linear discriminants at each stage of the cascade, and
• errors in Nearest Neighbor search on the subset of confused classes.
These errors are shown in Table 2. The fraction of errors
due to the cascade increases with larger test set sizes since
the confidence in rejection for the classifiers in the cascade
decreases for smaller training sets.
6.
Discovering Semantic Groups of
Faces
Humans are able to distinguish male from female, young
from old, Asian from European, yet the identity of the person is unknown. It seems to us that there maybe basic “classifiers” in the “feature space” of human perception that have
semantic meaning. Since we have defined pairwise classifiers between subjects, we are interested to see if any semantic information is captured by these classifiers.
We consider each person as a node in a complete graph
G. The edge weight for each pair of nodes is determined by
their separation in the projected space and is given by,
wij =
References
[1] P. Belhumeur, J. Hespanha, and D. Kriegman. Eigenfaces vs. Fisherfaces: Recognition using class specific
linear projection. IEEE Trans. on Pattern Analysis and
Machine Intelligence (PAMI), 19(7):711–720, 1997.
[2] T. Cootes, G. Edwards, and C. Taylor. Active appearance models. IEEE Trans. on Pattern Analysis and
Machine Intelligence (PAMI), 23(6):681 – 685, 2001.
σi + σj
|mi − mj |
[3] J. H. Friedman. Another approach to polychotomous
classification. Technical report, Dept. of Statistics,
Stanford University, 1996.
where σi , σj are the standard deviations and mi −mj are the
means of the two classes in the projected space. For each
5
Test set size as fraction of dataset
CRC speedup
Fraction of training set samples for NN search
# of classes left for NN search
# of projections in CRC
CRC training time
CRC time/per test
10%
1.58
39%
72
148
831s
0.04s
21%
1.58
34%
63
157
726s
0.03s
32%
1.60
29%
55
158
654s
0.03s
43%
2.38
25%
45
169
519s
0.03s
65%
3.35
14%
25
169
230s
0.02s
Table 1: Performance statistics of the proposed method (CRC) for different sizes of test set averaged over 5 trials. The
second row gives the speedup achieved by the proposed algorithm over brute force NN. The third row gives the fraction of
the number of samples in the training set over which NN search is performed. The fourth row gives the number of classes
remaining after the rejection step. The fifth row gives the average number of linear projections performed in the CRC and
corresponds to the average number of levels in the cascade. In the worst case CRC may need to perform 263 projections. The
sixth and seventh row gives the training and test times.
Test set size as fraction of dataset
CRC Cascade error
CRC NN error
CRC Total error (rank 1)
Brute force NN error (rank 1)
10%
30 (4.5%)
98 (14.9%)
19.4%
19.6%
21%
82 (6.3%)
208 (16.0%)
22.3%
22.3%
32%
125 (6.8%)
322 (17.5%)
24.3%
24.9%
43%
225 (9.1%)
487 (19.6%)
28.7%
29.1%
65%
694 (19.0%)
839 (23.0%)
42.0%
41.9%
Table 2: Errors at rank 1 induced by the Cascade and NN steps in the CRC method. The brute force NN search error is also
shown for comparison.
[4] A. Gionis, P. Indyk, and R. Motwani. Similarity search
in high dimensions via hashing. In Proc. International
Conf. on Very Large Data Bases, pages 518 – 529,
1999.
[11] P. J. Phillips, P. J. Flynn, T. Scruggs, K. W. Bowyer,
J. Chang, K. Hoffman, J. Marques, J. Min, and W.
Worek. Overview of the face recognition grand challenge. In Proc. IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR), 2005.
[5] G. Guo, S.Z. Li, and K. Chan. Face recognition by
support vector machines. In Proc. IEEE International
Conf. on Face and Gesture Recognition (FGR), pages
196–201, 2000.
[12] P. J. Phillips, P. Grother, R. J. Micheals, D. M. Blackburn, E. Tabassi, and J.M. Bone. Face recognition
vendor test 2002: Overview and summary. Technical
report, http://www.frvt.org, 2003.
[6] G. Guo, H. Zhang, and S.Z. Li. Pairwise face recognition. In Proc. IEEE International Conf. on Computer
Vision (ICCV), volume 2, pages 282–287, 2001.
[13] J. C. Platt, N. Cristianini, and J. Shawe-Taylor. Large
margin DAGs for multiclass classification. In Proc.
Advances in Neural Information Processing Systems,
pages 547–553, 1999.
[7] A. Guttman. R-trees: A dynamic index structure for
spatial searching. In Proc. ACM International Conf.
on Management of Data (SIGMOD), pages 47 – 57,
1984.
[14] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest
neighbor queries. In Proc. ACM International Conf.
on Management of Data (SIGMOD), pages 71–79,
1995.
[8] G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Trans. on Database
Systems, 28(4):517 – 580, 2003.
[15] H. A. Rowley, S. Baluja, and T. Kanade. Neural
network-based face detection. IEEE Trans. on Pattern
Analysis and Machine Intelligence (PAMI), pages 23
– 38, 1998.
[9] C. Hsu and C. Lin. A comparison of methods for
multi-class support vector machines. IEEE Trans.
Neural Networks, 13:415–425, 2002.
[16] J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 22(8):888 – 905, 2000.
[10] M. Jones and P. Viola. Face recognition using boosted
local features. Technical report TR2003-025, Mitsubishi Electric Research Laboratories, 2003.
[17] M. Turk and A. Pentland. Eigenfaces for recognition.
Journal of Cognitive Neuroscience, 3(1):71–86, 1991.
6
Test set size = 21 % of dataset
1
0.9
0.9
0.8
0.8
% correct classified
% correct classified
Test set size = 10 % of dataset
1
0.7
0.6
0.5
0.6
0.5
0.4
0.3
0.7
0.4
NN
CRC, speedup = 1.58, #NN = 39%
0
5
10
15
20
25
0.3
30
NN
CRC, speedup = 1.58, #NN = 34%
0
5
Rank of correct class
Test set size = 32 % of dataset
20
25
30
Test set size = 43 % of dataset
1
0.9
0.9
0.8
0.8
% correct classified
% correct classified
15
Rank of correct class
1
0.7
0.6
0.5
0.7
0.6
0.5
0.4
0.3
10
0.4
NN
CRC, speedup = 1.60, #NN = 29%
0
5
10
15
20
25
0.3
30
Rank of correct class
NN
CRC, speedup = 2.38, #NN = 25%
0
5
10
15
20
25
30
Rank of correct class
Figure 7: Comparison of face identification accuracy for CRC and Nearest Neighbor methods with different sizes of test
set on FRGC version 1 data. The y-axis represents the percentage of test examples with the correct class ranked within the
corresponding value on the x-axis. Speedup gives the speedup in execution time of the CRC over Nearest Neighbor search
on the test set. #NN gives the fraction of nearest neighbors evaluated by the CRC when compared to brute force NN search.
The size of the whole dataset is 5472 with 263 subjects and 21 samples per subject. The results are averaged from 5 trials.
[18] L. Wiskott, J.M. Fellous, N. Krger, and C. von der
Malsburg. Face recognition by elastic bunch graph
matching. IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 19(7):775 – 779, 1997.
[19] W. Zhao, R. Chellappa, J. Phillips, and A. Rosenfeld.
Face recognition: A literature survey. ACM Computing Surveys, pages 399–458, 2003.
7
(a) Cluster 1
(b) Cluster 2
(c) Cluster 3
(d) Cluster 4
Figure 8: Clustering faces of people with Normalized Graph Cuts using pairwise separability. Cluster 1 captures mainly
female subjects. Cluster 2 captures mainly asian subjects and cluster 3 captures mainly white subjects.
8
Download