Document 17865249

advertisement
>> Zhengyou Zhang: Okay. So let's get started. It's my pleasure to
introduce Gang Hua. Gang is not a stranger to us. He did internship
with me quite a few years ago, then joined the Live Lab. After Live
Lab was dissolved, he moved on to a number of places, including Nokia
Lab, IBM Research and now he's Associate Professor at Stevens Institute
of Technology. And today he will talk about Elastic Part Model for
Face Recognition. Let's listen to him.
>> Gang Hua: Thanks Daniel. It's a great pleasure to revisit MSR, one
of the best computer science research institutes in the world.
So today my talk is about flagship or based representation of real
world face recognition. You may wonder after 30 years of research why
should we still care about face recognition research. I think a short
answer for that it is still not solved yet. So before I get into the
technical part I want to briefly introduce my school.
So Stevens is in beautiful campus on New
River. So we oversee the skyline of New
opposite side is Manhattan downtown. So
most New Yorkers, when they come to -- I
lifetime they never come to our side and
see the skyline of Manhattan.
Jersey side of the Hudson
York City. Basically the
it's a beautiful campus, and
mean, perhaps in their
they never had the chance to
So next time you happen to be nearby give me an e-mail and I can
arrange a visit to our campus. Okay. So current research in my group,
I categorize them into three themes. The first is human-centered
visual computing. I tend to do my research centered around humans. I
understand human from image and videos and also do interactive type of
recognition tagging and things like that.
So the second theme of my research in my group is big visual data.
Part of the effort is originated from my experience here like designing
compact local descriptors all the way to modeling contextual
information including social networking context for recognition,
essentially.
The third theme of research I initiated at the school, after I joined
Stevens, is on this egocentric based cyberphysical and collab system.
The reason I call it collab system because we want to build integrated
human and robotic system. For example, one of our goals is trying to
use that egocentric camera. This is a very cheap camera. The one on
the right side is very cheap, spy camera I bought from China. So
there's a pinhole camera in between, the camera glasses, and we want to
enable the users to use that camera to control a wheelchair. You may
imagine a four-wheel quadriplegic they lose hand functionality, cannot
really control the wheelchair. We want to build a collab system where
most often the wheelchair robot is moving autonomously but whenever
it's unsure what next step to make it could ask the human for control
then the quadriplegic in the video can use the camera to control it.
It would naturally record the input like from the sensors and output
from the user control stand. We hope to build a linear algorithm which
can evolve the decision engine of that wheelchair system so in the
future it can handle similar situation.
So that's about the overall description of the research in my group.
So back to face recognition. Why do we really care about the post
variation. The simple reason is that the post or actually mingle order
visual variations together makes things really complicated. We all
know from the seminal paper of [indiscernible] and [indiscernible with
the post isn't changing then everything becomes linear, and it's easy
to handle. So as you can see from this example, once we have a lot of
post variations things becomes really complicated. So we want to
better handle post with better visual representation of the face.
So there's two types of approaches to handle post variation, first the
theme of research they focus on facial alignment facial landmarks.
There's serious work for Microsoft Research Asia they've done really
some good face lambda algorithms for photos. Just with my private
conversations with Jen, he told me that at the current face alignment
algorithm MSR actually failed. There's YouTube videos. There's a
benchmark for YouTube faces datasets released by Leon Wolf. I think
Jen's algorithm didn't really fly on that video database. So we're
going to revisit that to see why our representation is better.
So a lot of theme of research trying to build robust matching algorithm
to better handle post variations. My personal research is really
focused on this domain because I believe low face alignment algorithm
is going to be perfect. So you want to build the robust matching
algorithm to handle any possible visual post variations you could still
have.
Our approach is to take a part-based representation, build a part-based
model model. Previous work on part-based representation for face
images is mostly handcrafted. For example, this part could be defined
around like facial landmarks and so on and so forth. What we want to
do, though, we want to land the parts in unsupervised fashion from the
training data and build a generative model once we have this
parts-based model we could identify each specific parts in the specific
input image, then we can use this part, specific part in that image as
the representation.
Okay. We're going to talk about the benefit of having such a
representation. There's a bunch of work recently on more general face,
more general object recognition detection building a representation on
a intermediate level representation but they haven't applied it on the
face and some of the approach we'll see why it could be used on the
face domain.
So given this and here's the outline of the rest of my talk. So first
I'm going to introduce this probabilistic elastic part model. It's
very simple algorithm. But hopefully I can convince you that it is
very effective. Okay. So then I'm going to talk about two
applications. The first is face verification. So one good side of our
representation is that it provides a unified representation for both
image and video-based face verification. That means we have this
representation where we can enable a single face image to be matched
directly with a video clip. Without resorting to this frame-by-frame
cross matching, pairwise matching.
We will also talk about how we could use this representation for
enhancing any off line face detector. It's unsupervised detector
adaptation algorithm like hopefully I briefly describe some of the work
we did before on how to adapt detector to a video. Here we just try to
do an unsupervised detector adaptation to our photo connection to make
it better. So this part of the work we have published in a chart 2013,
and this part of the work is going to be presented in ICC 2013. And
some of the experiments we presented here will be from our recent
submission because we did more experiments and the results are
improved. So first what is the probabilistic elastic pair model. So
we have three goals for this pair based representation. Of course we
want it to be post invariant. Second we want to unify the
representation for both image and video faces. The reason we want it
to be like that because we want to avoid this frame-to-frame matching,
and also our ultimate goal is for face identification. So when you're
trying to build a gallery database, we want the database really scaled
to the lambda of person instead of lambda of images in the gallery
database.
And ultimately we also want this representation to be additive meaning
if you have a new face for a new person added in we can incrementally
update it without resorting to order previous images.
So I will show you how we achieve this. So, again, the general
philosophy, as I mentioned, we want to build a generative model for
with each specific image we want to identify the specific parts for
that face in this image, and we use that as our representation. So
just sort of give you a sense of from low level how we should enable to
lend such power-based model, we start with very simple, like feature
extraction process where we build image pyramid for each input face
image and we densely extract the overlapping patches. Okay.
Then for each patch we could build, we could extract the descriptor
either sift or RBP program from that. But we also do something in
addition to that. We augmented the X and Y location of that descriptor
to the overall descriptor. So we require a special appearance facial
descriptor. Then up to this point we have a set of features. It has a
spatial component and location component. So the face is represented
as a bag of features. But don't be confused by the words, because
we're not going to build a dictionary to build a bag of words
representation. We're going to do something different here.
Then how do we build this elastic pair model, is we could gather a set
of training images then we can fit a Gaussian mixture model. In speech
community this is called a universal background model. But I will
highlight something we changed the model a little bit. It is more
specific here because we can find each Gaussian component to be a
spherical Gaussian. The reason I will explain further because we want
to have a better way to control the balance between the appearance part
and the location part. Because if you think about it, the appearance
vector is about like a is 128 dimension, location is only two
dimension. We need a better way to control the balance between them.
If we use diagonal coordinates matrix we won't be able to achieve that
as I'll show you in some example we're going to show. So, of course,
this is very simple maximum likelihood estimation problem with missing
data.
So we just gather a set of training data, going through that feature
extraction process to having like this special location descriptors,
then we're going to fit this Gaussian mixture model. Then as you can
see, this Gaussian mixture model naturally comes with a set of parts.
So here the visualization is really the set of image patches associated
with each Gaussian component. Up to this point the method will be
simple to use. So how are we going to build the representation for
each face image from this model? Okay. Pay attention to this, because
this is simple but really differentiating from the previous work.
So here, if I have a new input image here, suppose I've already learned
this model, this Gaussian mixture model, the input is firstly
represented as a bag of spatial appearance descriptor. In order to
generate the final representation for the face, instead of building a
bag of words histogram, what we did here is that we do our inverse
assignment.
So for each Gaussian component, we look for one of the descriptors
inside. Extracted from this input face image which induced the highest
probability on this Gaussian component. So for each Gaussian component
we identify such a feature. Then we concatenate that set of features
we identified for each Gaussian component together to form a single
descriptor vector for the face. As you can see, this is for Gaussian
component three, like this is the probability map for the Gaussian
component like calculated with each specific descriptor extracted from
that face. As you can see it is really represented in the change of
the face. Here is the color of the eye. Although they have a lot of
post variations. So you could vary this, this is more like alignment
process there. Okay. Thanks. We have an implicit sort of -- by doing
the maximum likelihood estimation, like identification of the features,
we actually have implicit alignment process in this process. Okay.
So ->>: Do you have a location descriptor, actually put it ->> Gang Hua: Up to this point we discard the location. We only have
the appearance part in this final vector. But it's indexed by each
Gaussian component as you can see. The Gaussian component plays a row
as a bridge to build a correspondence between two faces.
So let's get into a little bit more details here, because as I said, we
confine each Gaussian component to be as spherical Gaussian. The
reason for that is really this. So here is just to show if we use a
regular Gaussian, for example, with diagonal covariance matrix, here's
the spatial span of the Gaussian component with length. As you can see
the spatial regions in the face which is a lot of the behavior we want.
The main reason for this is that the spatial part contribute more to
the probability, because if you pull the diagonal covariance matrix you
mention it's independent. So if you tried to make the spatial
component to contribute more by simply scaling that dimension, it's not
going to be helpful because you're just going to like really enlarge
the variance of that dimension when you are doing the estimation
without affecting the other component. If you confine to be a
spherical Gaussian then you mingle all the variants together. So if
you scale the location dimension, you essentially could have the
location to have more influence. So the Gaussian component becomes
more localized. So they're doing a local matching, essentially. So
any questions here? Okay. Another, why do we need this location
component? The main reason is like, okay, so we did the maximum
likelihood identification without the spatial constraints as you can
see. You could easily match eye with the mouth part here. The main
reason because we use a safety descriptor here. Safety is good in
shifting like variance like shifting invariant. You may match an eye
with this on the descriptor space, versus if we put this spatial
constraint together, we're really matching the eye with the eye. So
that's why we need that to do this facial component.
Let's say why this representation is post invariant. So we try to
visualize that process. Okay. So what we essentially did is for input
face image, for each Gaussian component, we identify the local patch.
Okay. It was the highest probability. Then we put that patch back
into the location of that Gaussian component. So we sense this face
that's essentially the process.
As you can see that this face is nearly frontal here. Although we
average -- it becomes blurred because we average overlapping regions
here without doing a lot of fancy sort of filtering there. Okay. So
this is suppose we just use this input image. What we can do is to
horizontally flip this face image and then the maximum likelihood
identification with the joint set of spatial appearance descriptors.
So then as you can see, the representation becomes more frontal. Okay.
I can let you see the difference there.
So then what we did, we're going to adopt this flipped version, like
joint version of both. So here is just something more. I mean because
of this inverse assignment, based on the Gaussian component, we don't
really care how many frames we have. So we can for one frame we can,
for example, this is George Bush's face like we come out with one
representation, then with ten frames as you can see it is nearly -- it
is more frontal in a sense so it is more post invariant. So 20 frames,
50 frames, they don't really make a huge difference there. So we also
have the version where we do the flipping for each of the frames there,
as you can see they have some difference but not too much, if you
integrate multiple frames together.
Okay. So then why this is a unified representation is quite obvious
for single image and for video face image. Once we have -- once we
have done the feature extraction and then using this PEP model, PEP
model stands for probabilistic elastic part model. Once we use this
PEP model to select the features in lambda they'll have the same
dimension because the feature dimension with respect to the lambda
component the dimension of the feature descriptor.
So that's why. Up to this, once we have this representation we can
match this single face image with video clip directly without resorting
to the frame-to-frame matching. So why is it incremental. Support you
have a gallery database. You have a set of faces from the same person.
From this model we can generate single representation for this person.
By doing this maximum likelihood identification. So it is incremental.
It could be incrementally updated many because the incremental nature
of this maximum operation, because once you have this representation,
given a new face we just need to compare. If the maximum likelihood
identified feature has higher probability than the feature I already
have in this representation a lot. If it is, then I just replace the
camera representation. It could be incrementally updated if we have
more faces added. That's a lot of benefit of this. Although in this
talk we're not going to present any facial identification results. But
if we want to do face identification with this representation we can
make the database essentially scale with the lambda person instead of
the number of images in the database.
>>: So when you do this is UBM for one person or for everybody?
>> Gang Hua:
For everybody.
>>: For the population.
>> Gang Hua:
Yes.
Each has different post variations.
>>: But here, when you're talking about representing one person ->> Gang Hua:
So we assumed that we already have that UBM model.
>>: So from UBM you set up, you find the personalization to do
personalization to a single person model.
>> Gang Hua: Yes. I'm going to introduce a little bit adaptation
process when you are trying to do specific matching for a pair of face
images. We have based adaptation process that's more like
person-specific adaptation. Okay. So I'm going to talk about in the
face verification experiment. So then I mean I hope you get a sense of
what this simple PEP model is. I'm going to move forward to present
two applications, one on face verification and the other on face
detection.
>>: [indiscernible].
>> Gang Hua:
Sure.
>>: The face, I assume you would need some kind of an alignment?
>> Gang Hua:
alignment.
Yes, you could either do alignment or without any
>>: Without alignment I assume it may not be very good.
>> Gang Hua: We do some experiment. We're still better than -- when
this paper was published, we're still better than the best in this
evaluation, in this bench mark data.
>>: Alignment.
>> Gang Hua: We tried both. This algorithm would benefit from some
sort of alignment algorithm, but because this representation is
designed to be robust post variations we have implicit alignment
process in building this.
>>: I thought with the training you would -- it's good to have some
alignment. But testing you're doing [indiscernible].
>> Gang Hua:
That's a good one.
You need them to be consistent.
>>: I see.
>> Gang Hua: If you have alignment in your training phase and you have
to have the same alignment in your testing phase. That's what kind of
what we're doing. But it would be interesting to try. We have
alignment always alignment on the testing stage. So for face
verification we're going to talk about like results on two very popular
benchmarks. Okay. So first we're going to talk about uncontrolled
face verification, like we're going to introduce very simple method
first to use this paper representation for facial recognition then.
I'll talk about how we can do a little bit better.
Suppose we want to enhance the matching of a specific face pair. Then
I'm going to show that our top performance on both benchmark dataset.
Okay. So how could we do the matching when we have a pair of face we
want to verify if they are the same person or not. So we're take a
very simple approach here. So once we have this representation, we're
going to the paper representation, we just take the difference of these
two representations.
Okay. We take the absolute difference vector. Then given a set of
training images, training pairs, labeled with positive or negative,
positive means they're from the same person, negative means they're a
lot of the same person, then we can train our SVM on this difference
vector to make a decision on if they are the same person or not. So
given the new testing face, we can use this as a classifier to make a
decision. That's a very simple method.
We have to resort to more complicated natural learning algorithm yet.
But I think I mean if we have a better natural learning algorithm, it
would enhance the results indeed.
So to make the representation to be even better, we have proposed
bayesian adaptation scheme where suppose this is what we call the PEP
model, which is universally trained off line. Suppose we have a pair
of face images we want to verify. So what we can do is we try to make
this model to be adaptive to this pair of face image. To fit them
better. But not deviate from this model too much. Bayesian adaptive
process and then we're going to gather a new adapted Gaussian mixture
model. Then from that we build this paper representation.
So if we look into the map, it's really simple scheme where we just put
conjugated prior on the adapted parameters and this parameter is from
the universal background model and this again, this could be done in
bayesian framework in iterative fashion. Really simple method.
To see why we need this adaptation process, suppose this is the pair of
face we want to verify. If we're looking to the set of patches, we
build a correspondence with the PEP model. Here you can see for some
of those patches, they're still misaligned a little bit, as was
highlighted by the blue rectangle there.
So this is before adaptation. After we do the adaptation, as you can
see, we really make the alignment better. If you compared them. Okay.
So that's why. This bayesian adaptation scheme can help us to build
better correspondence there. Okay. So any questions on this part?
So we're not trying to do a person-specific adaptation like in the
speaker verification domain. Here we're really adapting the PEP model
to a specific pair of images we're trying to match. Okay. So then we
can also do multi-feature fusion. We have post fusion process where if
we have different type of features to do the PEP model and to do the
elastic matching, then we can concatenate the decision score together,
then train linear SVM to combine the score together. That's very
routine process. I just want to mention it because our results benefit
from fusing multiple features. So we're going to present results on
two widely used benchmark dataset. One RFW and YouTube dataset.
Currently we have lambda two in the RFW database and lambda one in the
YouTube video datasets.
So when we're talking about results in RFW, I should be careful because
Jen's team has really pushed the recognition to be super high. But
they leveraged enormous amount of external data to train their facial
alignment algorithm and train their face verification algorithm.
So we're working on the most restricted protocol without any offset
training data. The reason is that we -- my philosophy is you should
really come out with a representation which could generalize across
different datasets. That's why we're confining ourselves just to RFW
to see how it can generalize. So when we're comparing the results,
we're mainly comparing results in this category. So we're not
comparing with the other algorithms which leverage the enormous amount
of external training data. So here just some details how we did the
feature extraction and things like that. Okay. So we specifically use
like 1,024 Gaussian components in our UBM model. So here is our
results here. So this is the best results when we published our paper
in SIG 2003 and recently there's one paper published from Oxford I
think Andrew's Leserman's group they used the Gaussian model by using
the facial vector to do matching. And they're currently number one,
and this result, this PM result is if we use like only LBP feature. If
we use sift feature, we get more or less the similar results with RBP
feature. So if we do this fusion, we can justly improve our results.
If we do the adaptation plus this fusion, like we can do even better.
Under the ROC curve, we're not as good as them. In terms of number
we're about 1.5 below their accuracy. But still we're currently
running lambda number two in this benchmark. So but their algorithm
needs to learn, to have conduct like a matching learning on different
datasets which is the part I don't like. But I respect their
performance a lot.
On YouTube video faces, ordinarily published on this dataset mainly
because of computational issues, all kinds of issues there. Currently
when we published our results, this only [indiscernible] published
their results. This is the baseline here.
So recently more algorithm published results, the best one is this one,
published in biometric conference. So our results, if we only use one
type of features, we're about here. Look, although we are worse in the
high false positive read area we're nearly as good as them with just a
single feature in the low false positive area. That's the area we
really care. So if we use different set of features, we are actually
already slightly better than the algorithm.
If we fuse them together, we already get better here. If we do the
adaptation, we do even better. Okay. So currently like it's a mix,
we're still lower than those algorithms under this high false positive
rate area, but we do significantly better in this low false positive
error.
>>: Which group, biometric thing?
>> Gang Hua:
>>: Yeah.
This one?
>> Gang Hua: That's a good -- I cannot remember. It's the group not
so active in computation, they're mainly active in biometric community.
I can send you some information.
>>: That's okay.
It's in your paper.
>> Gang Hua: So currently, if we do this, in terms of recognition
accuracy, if we determine a single operating point, we're about 1.5
better than the best algorithm published in biometric conference. So
this shows like why this lambda is lower than the RFW datasets is
because the resolution of the faces in this video face benchmark is not
as good as the error for RFW dataset. So we're currently number one
there.
And also I want to highlight like here, once we build the paper
representation, we don't need to do cross-frame matching as most of
this algorithm would do. Okay. They need to do this frame by frame,
like matching, then identify the best matching there.
So just I mean our algorithm is obviously not perfect. Which by
looking into this error results, which could tell us which direction we
should go.
So here are some errors made by our verification algorithm. As you can
see clearly, we're matching white face with Asian face. This shows
that we don't really have a comprehensive understanding of the face.
So that's the direction we're trying to go. We are trying to build
some, like do segmentation and do semantic labeling to try to really
understand the face, like to avoid making such embarrassing errors. So
that's the direction we're going, like trying to really drive the face
verification accuracy to the next level.
>>: The gray level.
>> Gang Hua: Currently only using gray level images. The color will
play an important role. But so far I haven't seen, I mean if there's
any work, really, leverage the color information.
>>: When you -- it's always interesting to look at these errors because
it tells us we still have a long way to go. But when you report
numbers like in the previous page there, 80 percent, on 80 percent of
the queries you've got the right person out of how many potential
answers?
>>: Oh, yeah, this is not identification task it's verification task.
So basically this benchmark datasets, they build some standard like
basically the input is face pairs. So you just make a decision if this
pair is from the same person or not. Is the classification ->>: Input distributions, 50 percent match.
>> Gang Hua:
50 percent.
>>: So the random guessing baseline is 50 percent.
>> Gang Hua:
Yes.
>>: You're at 80. So sort of taken, what's that, about three out of
five guesses, you're doing better than random.
>> Gang Hua:
Yes.
>>: Something like that.
>> Gang Hua: Something like that, that's currently state of the art.
I think this benchmark, they designed it in this way to balance the
training, really. I mean, as you can see the distribution of match and
long match faces should not be 50/50 percent in real world. So it's
skilled distribution. That's something perhaps in the future, the
benchmark evaluation could need to be redesigned somehow, I think.
>>: It's a way of measuring progress, but if someone who isn't a
researcher in this field asked if I give you a select photo from this,
what's the chance that your algorithm will name the correct one. And
let's say there are 200 celebrities in here. What's the chance? It's
like five percent of the time it will guess the right person or what is
it?
>> Gang Hua: That's an interesting question. I think that the number
I heard from Google Picasa is what they can achieve is like on the top
ten results, they can make sure that 90 percent of the cases it could
be correct.
>>: Top ten in, what, your family album?
>> Gang Hua:
search.
From all celebrity face images they have in Google image
>>: With the celebrities, they're saying -- I see. So you take a
select photo you ask Google who is this and it gives you a list of 20
names, and usually if you kind of look in the top ten it will say ->> Gang Hua:
faces.
90 percent of the time the correct name is in the top ten
>>: It will say CARMEN Diaz and Hank -- it will name a whole bunch of
people which are opposite genders and kind of nonsensical but somewhere
in those ten the right person may be there.
>> Gang Hua: That's what I heard from them.
never publish their results.
But it could be -- they
>>: For something like finding, just verifying what the labels are and
the images that could be useful. But if you basically said something
like here's my personal Picasa photos show me the ones with my
daughter, this daughter. The question is will it sort of, will
70 percent of the images be correct or only 20 percent.
>> Gang Hua: I believe in the personal photo album we can do much
better because the number is much less.
>>: There's much lower perplexity, but sometimes people look more
simpler; you can't take badges of facial differences and things like
that.
>> Gang Hua: That's absolutely true. There's really no best benchmark
for this perhaps like some efforts list in that space really to
evaluate progress in that space. I don't really see a lot of
benchmarks there. So Symon and -- when I was still in Microsoft
assignment, we explored that a little bit on this family album
scenario. But I think we need more serious benchmark in that space.
But it's kind of difficult.
Next I'm going to talk about face detection. How many of you here
still think that face recognition solve that problem. Paul is no
longer here, safe to ask the question. So raise your hand if you think
it's resolved.
>>: I thought front faces, pretty good.
>> Gang Hua: Resolved? Look at some of them, the mistakes you would
make with the state-of-the-art face detector. You're missing deduction
here. That's embarrassing right false alarm. Why this is detected as
face. That's very suspicious.
detection there.
We also are having some of the missing
>>: State of the art?
>> Gang Hua: This is from [indiscernible] face detector. It's not his
best one. But we tested one of the recent ones like from Adobe. They
have an example face-to-face detection algorithm. They're also making
these kind of errors.
>>: That's kind of the baseline that you just download free software.
>> Gang Hua:
Yes.
>>: Which is the ->> Gang Hua: We want to do better with that without a lot of effort.
So I'm going to tell you a really simple approach you can do this. So
what we did is very simple. Again. The philosophy is very simple.
But I mean the implication is -- it could be inspare lots of instances,
so that's something I'm trying to explore. What we did is here suppose
I have photo connection here. So first I will set the decision
threshold from the off line training detector to be really low to
ensure the recall. So I'm going to have a set of face candidates here.
Of course it's going to have false positive. So then I'm going to
choose perhaps the top 10 percent positives, and the bottom 10 percent
top negatives. And treat them as positive and negative examples. Then
I'm going to build the paper representation. So suppose I have off
line paper trained representation, paper model. I'm going to build
that paper representation on each image I connect here. I'm simply
going to train an SVM classifier on top of the paper representation.
Then I'm going to rerank all these images, then cut off the threshold
to see if we can do better.
>>: How many components do you use in that?
>>: 1,024. So we're really high dimension vector, indeed. But that's
what we did. That's simple as is. After I describe this I don't need
to go through all the slides because that's essentially what we did, to
see if we can make it better. So we tested our algorithm on three
photo albums. This album and this is connected by Symon, I and Ashish
like several years ago. We still have some detection. And the other
one is FTTP database which is again released by Eric Landon Miller from
UMASS and funny thing I had lot of discussion with him recently.
What he did he connected some Web photos and run the Johns detector and
picked up the ones that failed there. So he specifically is screwing
up Viola Jones face detector. So this set of photos are really
challenging. So can the best algorithm the detector there is face
detector called XZJY published by Adobe. They have an example, built a
database with 10,000 faces just doing this example matching, face
detection, yield the best performance so far on this FTTP database.
So we're going to show that by using our adaptation scheme, very simple
ones, we can improve both the Viola Jones face detector and XZJY
detector. So I'm going to play a video. But here just some subjective
results. This is on the G album the XZJY detector are still making
those embarrassing false positives and we can get rid of those. And
there are cases we can also -- yeah, the results we show here is mainly
like eliminating false positives but we also have cases where we can
actually get false negative back. I'm going to play a video. But in
terms of performance curves, okay. So first I want to show that this
paper representation is really also good to differentiate face and long
face. So here we did the same adaptation process with just
concatenated safe descriptor, meaning we're not doing -- we're using
the Gaussian mixture model to select those part features. Instead we
just densely extracted the safety descriptors. So here is the
performance of the Viola Jones detector. If we simply use the sifter
representation we're doing a lot better. But if we use the paper
representation and use doing the adaptation, we can do much better,
okay, on this. Which shows -- the reason is obvious. After we do this
paper representation, we're reducing the within class variation of
those real face images. Although background may be staying the same
because they're random anyway. But we're making the face class to be
tight so that we can do better. And here is some more comparison on
the album. So here's some more comparison on the album. Here is the
performance of the Viola Jones detector and this is the XZJY detector.
You can see this detector is much better than Viola Jones and nearly
perfect.
This is our algorithm example based detector adaptation algorithm that
was published in CPR 2011. It improved the Viola Jones detector by not
making the XZJY detector to be better. If we use our paper
representation, we managed to improve both from the Viola Jones
detector and even made some improvement from the XZJY detector in the
datasets. We make it much better. So on the album again these are the
results from CPR 2011 work. The works mainly for video this
example-based approach. That's where we make a lot of progress. If we
use this paper representation and do this representation as you can
see, after adaptation, even from the Viola Jones detection results,
we -- let me see -- yeah, so the blue curve is already better than the
XZJY detector which shows this is really helpful to do this one cycle
iterative reranking there. So this is on the benchmark like FTTP
database, like discrete score means if you predictive detection has
more than 50 percent overlap with your [indiscernible] then you claim
to be a correct detection. Okay. So this red curve is XZJY detector
results and here are some baselines published before like they use
different local features, like this one I believe used the serve
feature. And this is published by Eric Landon Miller's group they're
using -- they're also doing this type of adaptation but using a
Gaussian process classifier in the middle. So if we do this fold by
fold adaptation because they have a tenfold cross validation process.
If we only do the adaptation fold by fold, we can enhance the results,
we already enhance the results significantly. I'm not so sure -- I
didn't put all adaptation results there. I should add back. I think
we can do better than this. So this should be continuous the score and
this is the overlapping between predicted rectangles and that and I
should change these slides a little bit. This is the XZJY detector,
after we adapted with the paper representation, we made it
significantly better. So here I would like to play the video perhaps
just give you more sense that this representation is also good in
bringing some of the false negative back. Let me see. Okay. As I
say, we're going to present this piece of work in ICCV soon. Yeah,
this is the Viola Jones detector. Performance is valid at like
90 percent record rate. Here is the false positive. Got rid of it.
We have false negative here missing detection, and we get it back after
this adaptation. Here is the embarrassing one. And we wonder why this
kind of patches are detected. If you do histogram equalization on the
face patch looks like a face. And this album is actually published by
Shao before. Get rid of this. And we can get this back. Okay. I
will simply stop the video, because it's just going to show that it's
better indeed. And you see playback. So just some discussion on this
type of adaptation scheme. I think it could spare a lot of interesting
things because you could think of this like this is kind of a process
where we are trying to adapt a recognition algorithm to the statistic
of the datasets you're dealing with. It's just the light way of doing
that. But I would say that there could be a lot of interest in seeing,
in using that. I think the chart did something for detector adaptation
before with [indiscernible] extension tried to modify the detector, but
here we're doing a little bit differently, like trying to leverage the
examples. Okay. So in an unsupervised fashion. So just in conclusion
we have post environment face representation induced from the PEP model
and showed the leading performance samples face recognition face
detection task some benchmarks. So our future work involves how we
could use this type of representation for other visual recognition
problems like [indiscernible] recognition and we're doing some work
we're trying to see if we can make a better object detector based on
this paper representation idea. So before I end my talk, I want to
grill your face recognition capability. Who is this guy, without any
hints. Let me give you a little bit of hint. Yes?
>>: [indiscernible].
>> Gang Hua:
[indiscernible].
>>: I don't know.
Just decades.
>> Gang Hua: [indiscernible] so I worked for IBM before. So
[indiscernible] so we used his face all across my talks so I want to
thank him indeed. I'll stop here in case you have any questions.
>>: [indiscernible].
>> Gang Hua: No. I do a lot. IBM still have Deep Blue as demo in
their demo room. But it is never played again, I guess. It's kind of
interesting. Now they're focused on Watson, I think, the Deep QA
project. Very interesting.
>> Zhengyou Zhang: Thank you.
[applause]
Download