17138 >> Zhengyou Zhang: Okay, so let's get started. ...

advertisement
17138
>> Zhengyou Zhang: Okay, so let's get started. It's my pleasure to introduce Professor Ying-Li
Tian. She's from the City University of New York, and she is also visiting us for two months. And
before joining City University of New York she was a researcher at IBM, leading the effort on the
Smart Surveillance system and from the initial research project to commercialize the product, so
it's really a big effort. And, before that, she worked at CMU for quite a few years on the facial
expression recognition, and that's probably one of the main parts of the talk. And before that she
was a professor at the Institute of Automation in Chinese Academy of Sciences in China. Please.
>> Ying-Li Tian: Okay, thanks, Zhengyou. So I'll start to talk about the automatic facial
expression analysis. Part of the work has been done when I was at CMU, but I think it's 10 years
ago, but after that I continued working on that until 2004, doing for the IBM product, the people
wishing [phonetic] the surveillance system, getting more and more priority. So all the focus
shifted to there, so I started to do that. Since now I've come back to the academic, so I have
more time to pick up this topic again. So I'll talk a little bit for the future work at last the slides.
Before I go to the details about the work, I want to give at least about the applications for the
facial expression. Many people work on this but still do not see the real application on it. Some
future applications, I think, we can do the emotion and the paralinguistic communication and also
for the clinical psychology, psychiatry and neurology and also can do the pain [phonetic]
assessment. People also can do the lie detection.
When I was in CMU, I think that's in the year 2000, CIA organized a competition between the
CMU and the Salk Institute who did the expression analysis to see which system can really deal
with for the lie detection and also can be used for the person identification. For this one, since
some people, their expressions are very unique. I have one colleague, his neutral is like this.
So this kind of feature can be used for the person identification, and also we can do the image
retrieval and the video indexing to say, "okay, can you find other client's [phonetic] picture with
the smile, something like that." You can put this information in for the HCI and intelligent
environments.
And, also, when we talk about the driving awareness, this is only a subset part of the facial
expression. When you droop, when you're sleeping, only for the upper part of that can be used
for the driving awareness. Feel free to interrupt me if you have any questions. For the general
steps for the automatic expression analysis, I classified them for these three. The first one, we
want to find the face. Where is the face? Only when we find the face, then we can do the facial
expression analysis.
After we find the face, we will do the feature, data extraction and then re-presentation [phonetic].
Then we will do the facial expression recognition. For the face, most of the lab control
environment, people use the face detection, but in the real life you cannot get the face directly, so
you might also talk about for the head position [phonetic] estimation, try to find the way of the
head, and then focus on the front of the face.
For the feature extraction, people mainly can use two kind of features. One kind of feature
accounts geometric features. That means people extract where is eyes, lips, nose, all those kind
of features. For the appearance of features, I mean is give you the face, after you detect it, you
have the whole face. You just find the feature for the appearance of that.
Maybe you use the edge, maybe you use the Gabor filter, maybe you use also some other
features you get from that face reading. For the facial expression analysis, people work on two
paths. One path covers basic expressions. These basic expressions, I'll give you some
examples in the next slides, and also another part is for the action units.
So six basic expressions mainly give you the whole face, to say, okay, this expression belongs to
the anger or high PE [phonetic], something like that. For the action unit it's more subtle to
correspond it to the some points on your face.
Okay, here are the six basic expressions. We call it six basic expressions because it doesn't
matter which culture, it doesn't matter that you are male, female. These six classes of
expressions operate universal, disgust, fear, joy, surprise, sadness, anger.
I will also show you some action units. The facial action units was coded by the psychologist.
Mainly, we have the different part of the face. For the upper face, we have 12, lower face, we
have 18. Then they have the eye and head positions and some others. For our talk, people still
mainly focus on the upper face, lower face, for the face action unit. And also we can see, for
those kind of action units, for the real life we have more than 7,000 different combinations by
combining those kind of -- so our goal is not only recognize the single action unit. We also need
to deal with the combinations.
Here are some examples here. So AU1, for example, they say, okay, if for the inner brow of this,
you can lift it, this is AU1. Then they say okay, for the out part, this brow, you can lift it up, that's
AU2. And Paul Ekman, we can do that. He's the famous psychology guy to develop the action
unit, but most people cannot -- only lift the inner part and then lift the outer part. We mainly move
them together.
If you move your brow up together, then that's AU1 plus AU2. I gave some examples here and
you can see some in red color. Those in red color means we didn't recognize them in the system
because either we do not have enough data in the database, either they are pretty confusing. For
example, this AU22 and AU44, so AU43 is a closed eye. When you close it, it's AU43. But AU42
is when you droop down your upper lid halfway, that's the AU42.
But when you do the squint, like this, that's the AU44. For those kind of actions are too
confusing, so I didn't put that in the system. Okay, why the facial expression is difficult? I give
you some examples here. You see this is AU13 and AU14. One is a cheek puffer, one is a
dimple, but from the image this looks very similar. But we code as two different AUs. And, also,
if you can see this too is AU12 plus 25. AU12 means you leave the lip corner up, 12 means you
open your mouth like that.
So we put that together. Most of them [phonetic] for the smile, if you do this, it's the 12 plus 25.
But for AU20 plus 25, the AU20, you just stretch your lip but do not lift your corner, lip corner. But
when people do it, it looks very similar, and also another example is AU10 and AU9. Both show
disgust, but for the AU9, you can see here, we have furrows here. If you have furrows here, the
code is AU9. If not, it's coded as AU10.
>>: Should I think of an action unit as being like a parameter or a state of?
>> Ying-Li Tian: Let's say if you have the neutral face, like this, then you do this kind of facial
movement, when you move like this, then for each movement they give one code, decoded as
the action unit. For the different kinds of movement, they give a different number.
>>: But is this -- is an action unit like ->> Zhengyou Zhang: It's a localized portion of the face.
>>: Okay, so not the state after the motion. It's the degree along which you move?
>> Ying-Li Tian: No, the degree, if this is my neutral, if I do this, that's one kind of emotion. Then
they have this kind of -- for example, this 12, AU12, is said like this. I move my lip corner up.
>>: Okay, can I think of it as the action unit, one action unit is to be able to move the corners of
my lips up?
>> Ying-Li Tian: Yes.
>>: Okay, not the state of having my lips moved into that position?
>> Ying-Li Tian: No.
>>: Really. I mean, for me, I think the action unit is the state. When you raise your eyebrow,
that's the state of being up.
>>: As opposed to the muscle or whatever that moves that eyebrow up.
>>: It's not the motion, it's the fact that it's up, so that's the action unit present on your face. Am
I?
>> Ying-Li Tian: The movement caused by facial muscles. I think you can also system that it's a
state. When you compare to the neutral, if I move this one, this is the AU1 plus 2, so that's the
code to see this kind of state and movement. When you see how much it's movement, we call it
the intensity. They also have another parameter called action unit intensity. If that intensity is
big, for example, if I open my -- I will show you some examples, like this here. If we go back
here, you can see we have the -- where is it? AU25. This is 25, this is 26, this is 27.
For now, this is the previous one. They code it as three different actions, but the main difference
is how big you open your mouth. For this one, it's only a little bit parted, because AU25 -- if you
open larger, with bigger intensity, they give a different action unit than if you open very big, they
give the largest one.
They modify this for some action units. For example, for now, they only give this as one action
unit, as AU25, but with three different intensities. They use the intensity as A, B, C, D, E, to say,
okay, how big you moved. If you only move a little bit, that's the intensity A. If you move a lot,
that maybe goes to the E.
So, for that one, because the different people have the different -- the amount of the movement,
so that part is more difficult.
>>: It sounds kind of like a state, except when you're combining them it seems hard to combine
states. Like, if they're parameters in some facial expression parameter space, I can imagine, oh,
I got some of this parameter stretched out this way and some of this one out this way, so it's easy
to understand what it means to combine.
>> Ying-Li Tian: So that's why it's very important for how to extract the features and also how to
represent the features. If you represent the features well, you can capture those kinds of
movements. If not, it's more difficult. And also for some action units, for the lips especially, I'm
thinking only based on 2D images it's harder to detect those kinds of things. We need the 3D to
decide [phonetic] if you do this or not. But from the frontal, from 2D image, it looks very similar. If
you have the 3D, then you can see the difference.
Okay, so now after we introduced that general things, I'll move to the work we did. So we did
several works here. So the first one, for the facial expression analysis in a collaborative
environment -- that means in a lab we can collect the data, so the people are mainly cooperative.
We mainly have the single face, have very high resolution, and also for the expression it's pretty
much exaggerated expression.
I'll show you some video. You can see why that exaggerates expressions. After that, I will talk
about the neutral face detection, since when we do these, we have the assumption, we say, okay,
the neutral face should be available. But in many cases, we don't have that neutral face, so how
can we find the neutral face automatically.
Then we move to some spontaneous expression analysis. After that, I did some performance
evaluation for the expression analysis and then we talked about the future work.
Okay, now we move to the facial expression in a collaborative environment. This work I have
done when I was at CMU with Professor Kanade and Professor Cohn. He is a professor in the
psychology department. He is from the robotics computer division guys [phonetic] so we can
band together to find the real applications.
For the facial expression analysis, we have a video. We assume the first frame is a neutral face,
start from the neutral, then you perform the expression. So then from this video we extract the
features and the representative features for the different -- you see since we have very high
resolution image, we have all those kind of detailed features. When we do the recognition, we
separate the upper face and the lower face. Both use the neural network base, the classic
[phonetic] there, then we get the result.
For the feature extraction, we mainly extract the eyebrows, the eyes, with very detailed iris and
the up lid, lower lid, and also we put some features on the cheek. For this one, it doesn't
contribute that much, especially for the baby face, because it's non-texture. It really didn't provide
that much information. But those kind of furrows provide a lot of information, and also the lip.
I will not go to the details for the algorithm, but I can show you some furrow detection. These two
are very useful for the distinguish and the smile, those kind of expressions, mainly based on the
age information. We get the angle of between this one when we detect age. If the angles are
bigger, mainly when you do the smile, like that, and also for the lip. We combine the color, shape
and the motion information to try to track the three states of the lips, open and closed and tightly
closed.
Here are the color distributions you can see. When the lips are in the different states, they
operate differently. One difficulty for this use of color information is for the darker skin-colored
people, for the African people. Mainly their lips, the color is very similar to their skin color, so
sometimes for them we have some difficulties.
For the eyelid -- for the eye tracking and the detection, we also try to detect the opened eyes and
the closed eyes. We define an open eye as if the iris is visible, then we think it's an open eye. If
it's not visible, it's closed eye. We used the edge information from the iris to try to find the iris
[how it is] and here I want to show you how do we detect the features? How can we represent
that? I think that will answer part of our first question.
Mainly, we observe all the expression analysis, your inner corner of the eyes are pretty stable. It
doesn't matter what kind of expression you're doing. The inner corner doesn't move that much.
So we use the inner corner as the reference point. We get that, then we normalize all the
extractable features based on this baseline, to say, okay, how much you raised your brow, what's
the parameters? And also for the different features also how big, how wide, of this?
You can see we have the blue region here and the green region there. For those regions, we
mainly will try to capture the furrows, because for this region it's trying to distinguish AU6 and
AU7. It just means when you try to smile, if you have some furrows there, it's AU6. If not, then
you cannot see [phonetic] that. This one, as I showed the image before, try to distinguish the
disgust, AU9 and AU10, if there is furrows or not. And the same thing for the lower face, for outer
lip features, we also normalized based on this baseline.
Okay, I'll show you some examples for the tracking method. You can see the smile is
exaggerated. No people really smile like this in the real life, but you can see the feature
extraction, we can track the lip and also the eye detection, the iris like that. Here you can see for
the action unit, we have the different AU combination for this feature and also for the basic
expressions belong to this one.
We have a list to say okay for some combinations they will belong to which basic expressions. All
those kinds of action units have been coded by the certificate coder. You have to get the
certificate to decode these kind of action units. Here is a different expression. This is the anger.
You can see the lip for the difference. We track that, even for the eyes closed and open. And
these kind of furrows also make the contributions to the different actions. Here's another one, the
surprise.
So for the surprise, mainly the mouth is wide open and also for the eye wide open. You can see
the eyebrow moved a lot.
>>: [inaudible].
>> Ying-Li Tian: We tried that, but since most of the database, the CMU t-squared [phonetic]
database, were the psychology students. Not that many people had the furrow on the forehead.
Here, another guy. You can see how two different expressions -- disgust at the beginning, then
with smile.
This is the boy [phonetic]. You can see their lip colors are very similar to their skin color. Most of
them we can handle, but for some of them, sometimes it goes away, the lip contour goes away.
Here are some examples with the head motion. We say, okay, the head moves, this is what it
looks like. And also with the different field.
You can see the features are tracked pretty well. We also tried several -- this is also recorded in
the lab. I removed the chin for the case because it doesn't track that well, but you can see the
iris, the tracking with all the motion. We can get very accurate results for all those kind of detailed
features.
And here are two videos. We can see it's a spontaneous expression. The camera is not in that
high quality, and also with a lot of the head motion. Here is another one. This, the interview, we
asked the people to watch some movies, the clips. We interviewed them. They didn't know we
had the camera installed somewhere, so for those kind of expressions, you can see it's
spontaneous.
The face resolution is much smaller and also for some detailed features we cannot really catch
that. You can see this one. I'll play it again. For the eye, we didn't really catch those kind of
features. But the lip contours did pretty good.
For this one, and the face region is about 60 by 80. For the previous one, which I showed is
about 200 by 200 or something like that. Here are the recognition results. For the lower face, we
have those kind of action units. Here are how many times we have that in the database, here are
the recognition results.
This one is pretty low, the 26, 25, 26, 27, as I talked before, is how big your mouth opened. For
that one, most of the confusion is between this 26 and 25. If we combine them together as one
action unit but just give the different intensities, then this one will be much better. And also here
are the recognition results for the upper face. Most of them did pretty good on the eyebrows.
This 6, 5 and 6, are a little bit lower. Six means your lower lid moved up. With smile, your lower
lid moved up. So all these are very subtle, difficult to detect, but overall the recognition rate is
pretty good.
And here I show you for the AU25, AU26, in the image. That's the confusion. Here is AU5 and
AU6. Those are pretty confused action units, which we didn't do that good as other actions.
>>: These are all human labeled and they're based on single images?
>> Ying-Li Tian: No, no, based on the video sequence. For the sequence database we
collected, they start for the neutral. Then you gradually perform the expression until the peak,
and then come down. So they have the whole sequence for that. If you only depend on the
single image, for some expressions at the beginning, from the neutral, if I start to do that, it's only
a little bit. They cannot code it.
So they compare it. They play the video by frames. They compare to the neutral to say if there is
movement or not. If they see that movement, then they will code the action unit.
>>: The whole sequence is labeled as one class, one action unit, total sequence?
>> Ying-Li Tian: Yes, from the -- with very low intensity to the peak, then come back that. They
have the whole sequence. They also try to do some communication analysis. For example, for
the social smile and the real smile. If you are really happy, your smile keeps on the face, but for
some social smiles, if I say, Zhengyou Zhang, hi. Then when he passes by, my smile comes
back quickly.
So we do those kind of also for the communication kind of behavior. So they try to say, okay,
how long goes to the peak and then how quick it comes back, all those kind of things. But for the
code [phonetic] they do need -- for the peak one, they do need -- only from this image, they can
say, okay, it's AU5. But for some images close to that neutral, they need to play, really play
between the frames to see if they are changing or not.
>>: And the results, the mistakes that were made, are those because the features that you were
measuring, like lip corners and all that stuff, didn't' have enough information, or because the
classifier that you used needs to be classified.
>> Ying-Li Tian: I think the features from that high resolution you can see we captured pretty
good. I think the main reason is because of the confusion of the action units. So all those kind of
action units are coded by psychologists for human beings to observe it, not really good for the
computer to do that.
>>: If a human looked at the features that were generated, they could tell the difference between
5 and 6?
>> Ying-Li Tian: Only the certificate coder can do that. For me, for some of them, I think, okay,
they are very similar, they are the same. If you are not trained for doing that, for the normal
people cannot distinguish those kind of things.
>>: Do we have any data to show the consistency between the different coders, human coders?
>> Ying-Li Tian: They have some data like that. They use the key to show that. The different
coders to code some of that is about 85 percent. They still have 15 percent not agreeing with
each other. You said, okay, that's the AU6. They said, no, I don't think that's AU6. I think it's
AU5, something like that. So they have -- the agreement is about 85 percent, still have 15
percent confusion.
Okay, so for the previous work, we have the assumptions. For the initialization for the features,
we initial them manually on the first frame, on the neutral frame. And also we assume the neutral
face is available. For the initialization already being solved by the active appearance model fitting
algorithms, I left CMU, Simon Baker I think is in Microsoft Research with several other people, to
work on that, to do the automatic fading [phonetic] for the facial features. And also they extend
the active appearance model to handle some hand motions, try to move to the spontaneous
expressions.
Since we have the assumption for the neutral face, then we say, okay, if without the neutral
phase, what can we do? We need to -- if you give the image, I want to say, okay, are they a
neutral face or not? So that's the work, after I joined IBM, I've continued to work on that, neutral
face detection.
The purpose of this, if they view some images and we want to find the neutral face, and for this
one for the yellow or bugs [phonetic], we say, okay, this is the neutral face. The others are not.
This neutral face detection is image based, not video based, since we only randomly collect the
images.
What we do there is we have the image input and then we do the face detection, and if we find
the face, then we extract the features. For those kind of features, I use a different algorithm as
previous data, since we assume in the real application, in most of our cases, we cannot really get
that high resolution image. Then, based on the location feature and the shape feature, we still
can use the neural network classifier, try to find the neutral face detection.
For this one, we only do it's neutral or not. For the feature, since for most of the face we do not
have that high resolution, so we reduce it to the corner of the brow and the eyes and the lip
corner, without all those kind of contour head, of those kind of detailed information. And from the
later the result, you will see why we reduced it to this, since now we will have some images to
show you the low resolution on that. The same thing, based on the eye detection less the
baseline to normalize these kind of corners, the raise of the lip corner and also how far from this
baseline and also for the brows.
Okay, and that's the local feature. Then we have the appearance feature. We have the image,
we have the edge. Then we have this direction of this edge. At that time, we didn't call this HOG.
HOG is very hard now to do the detection reaction, all those kinds of things. At that time we
called it zonal histogram features, but then I figured out it's identical to the HOG feature, which
people used a lot, or is it? Yes.
We have the full directional histogram to do that. I just want to show you this three combination
of those features to show you in which direction the edge [descript]. So we put this feature and
the -- here is the shape features and also the geometry features together, put it in the neural
network. Then we find the neutral face and the non-neutral face. Here are some results we can
see.
We test on two databases. For the training, we use the CMU and the Pittsburgh database. We
separate them to two, only train [phonetic] on one data set. We didn't test on another part, so we
get pretty high detection rates. And also we test independent database and Yale database. This
one [inaudible] but still kept pretty high detection rate.
>>: So you used the same training data.
>> Ying-Li Tian: Only train one, we used training based on [inaudible]. Test this one and then
tested a totally different database. Here are some examples for the wrong classification result.
For the upper part, this part, are the neutral faces are classified as the non-neutral. From the
coach, they said, okay, all those are the neutral face, but our systems say, okay, they are
non-neutral. This one, I think maybe that's her neutral, but it looks anger. This one a little bit
smile. I think this is true because of this moustache. In the training data we do not have this
moustache.
For these several images, this is the non-neutral faces but we classify it as the neutral. You see
the non-neutral, you can see the lips are a little bit apart, so they code them as AU25 like that.
That's non-neutral, but since the intensity is very small and this one looks like a little bit smile. So
all the confusions are in that very low-intensity expression, so I think if we say al of those are
neutral it's still acceptable. Yes.
>>: I'm assuming that there are some psychologists that do the classifications for which AU each
face belongs to. What's their percentage of hit rate, if you just gave them a collection of these
types of small difference faces. Would 95 percent say that that's a neutral and 5 percent say that
that's a happy face? I'm thinking that you're cutting the edge too much as far as trying to specify
that this is yes or no.
>> Ying-Li Tian: I think that part, I don't know the percentage of that. I think that for the
certificate coder, they will definitely say, okay, since their lips are apart then definitely they have
an expression there. But for the psychology people, we thought the certification of coding, people
said, okay, that looks like neutral. I don't know the percentage for that part.
Okay, so we talk about the expression in the lab, and also for the neutral face detection. Then we
move to the spontaneous expressions. Okay, so in real life, how can we do the spontaneous
expression analysis.
For this kind of work, we mainly focus on two things. One is how to handle the head motion and
then how to deal with the low image quality. We do that because we have the past data in 2003.
They record the data in the smart room, smart meeting room. They have the data and also they
have the ground truths for the expression with the smile or not at that. So you can see compared
to the collaborative environment, the intensity of those data are very low.
In this region, about 70 by 90 pixels are those. How can we do that? Here are the stretch
[phonetic]. We have the video. We have the video input, instead of to do the face detection,
since there are a lot of moves away, so we do the background subtraction. Then we do the head
detection. Once we find the head, then we will say, okay, it's a frontal or not?
If it's a frontal, front of the face, then we do the feature extraction. Then we do the neutral face
detection, use the neutral and other things to classify it's neutral or smile or some other things.
When we do the head detection, if we couldn't find the head, then we stop. If we find the head,
we continue to see what's the head pose of that?
Here, I show you the background subtraction. Background subtraction mainly assumes the
camera is static and here is the background. When some moving object appears, we try to get
the moving object. The moving object is called the foreground. So we will find the moving object.
Once we find the moving people there, we will use an omega shift, a shift like omega to say
here's a head, here's the shoulder. Then we can find where is the head.
All of this has one assumption. We assume the people are upright. If you bend down or
something, this algorithm will not work. And also here it shows you the algorithm we are doing
the head detection. Once we get to slow [phonetic] , it gets a circle part out of the top of this, with
this omega, with the several images we get for the head.
Okay, once we find the head, then we will say, okay, what's the head pose? Only for the frontal
or nearly frontal. When we see the frontal or nearly frontal, it means when you turn away from
that, turn maybe less than 25 degrees in that, if it still turns too much, it stays as a side wheel
[phonetic], but for all others we can see the facial features clearly with the others.
For the facial expression, we are still focused on this. For some of that, I think we can do
something of that. But since from the psychologists, most of them only focused on the front of
face, we lack the kind of research in that part, so we still focus on this. Only for the head pose in
this frontal or nearly frontal, we continue the feature extraction.
Here, I show you one example for our head pose detection. You can see the original image here,
the background. We did the head detection. You can see this gauge along with the head pose. .
And this guy, Andrew, senior, he's in Google now. It caused us a lot of trouble because his part,
all the color -- skin color around the front [phonetic] and the back, in our training data we do not
have people like that.
Okay, that's the head pose. And also for the feature extraction, the previous feature tracking
algorithm cannot work, so how can we get the feature extraction? We just use very simple
boundary features. You can see if we have the different threshold, we can get mainly for the
eyebrows and eyes darker than other parts, so we can get this.
The threshold, if it's too high, you get this, too, but this is the brow, this is the eye. So we give the
automatic threshold detection, then we can get the two pairs of this, so we get both the eyebrows
and the eyes. Once we get the eyebrows and the eyes, we assume the lips is well below that.
Then we do the automatic search, trying to find the lip corners.
Yes?
>>: Overall, you are using the geometric features. I wonder, why don't you use some kind of
textural feature, especially when [inaudible]?
>> Ying-Li Tian: We have some textural features. We call them appearance features. I will show
you later. That's where we use the global [phonetic] try to get all those kinds of changes without
extracting all those kinds of local features. And also we compare the features to see which kind
of feature works better.
And here are some extracted feature in that smart meeting room. They detected the hard
[phonetic] front of the features. Then we used these kind of features to do that. I think this kind
of shape feature, kind of included in the whole face, we use this shift feature as before and also
put the distance feature, use the neural network, but try to get this kind of neutral face smile,
surprise, angry and others and output of the non-frontal.
And here are some results we have. Before we talk about results, you can see there are already
no ground truths. They say, okay, smile. Since they have the video, when you turn away it still it
assumes you continue to smile, smile, smile, all those kinds of smiles.
I did some modification to say, okay, since there are two configures for the system, so I modified
that. Okay, this is a profile, this is a side view. This smile, and also smile for this one. I turned it
to neutral. So for some of the ground truths, I made some modifications. I will test the result
based on these modified ground truths about that.
You see here for this one, we did it as the surprise, not as the smile, because I think mainly
because eyebrows are lifted up a lot. And here are the tables for the recognition results for the
non-frontal, neutral, and they have a lot of smile but not that much other expressions. Okay, so
once we tried that, we have more and more questions that we want to answer, so we will do the
performance evaluation analysis to say, okay, which kind of facial features played a more
important rule for the facial expression, and also for which level of the facial resolution we still can
do the expression analysis.
And, also, if we want to do the recognition for the action units and the basic expressions, which
one is easier, which one we can do better? And so we mainly compare for the facial resolution
and the algorithm for the face acquisition. This means the head pose and the face and also we
do the feature extraction, the geometric features and appearance features.
And also we compare the different recognition rates for action units and the basic expressions.
Here are a table on a [inaudible] CMU database. From the original image, just down sampled to
the different level of the resolution, so say, okay, for the face process, we can do the face as a
detection or the head pose detection. And also can we do the face recognition, can we extract
the facial features? Can we really do the expression?
Until here, I think, we can do everything. But for all those can we really do this? I'm not sure.
But for the feature extraction and the expression, I think we cannot do at that level, so we also
need to check in this level, which level can we really do the facial expression analysis?
And for the face acquisition, I compared three algorithms. For the face detection we used the
neural network developed by Henry Rowley. He works from CMU, but then worked in Microsoft,
now is in Google.
Another one is an integral-based face detector. That's famous John Lyolas [phonetic] face
detection and also we tried our head pose detection to see can we get comparable results for that
launch. And also for the features, the geometric features, I compared it two ways. One is the
extensive facial features. Thus it means for the high resolution we have all the contours. At the
beginning, I showed you the video with all those kind of details.
Then we have the basic facial features, just the only six corners, the lip corner, eyes and the
brows, just the basic facial features. And also for the appearance features, it's present the skin
texture and also the lips, all those kind of change. But we used the Gabor wavelet to apply the
Gabor wavelet on the whole face to try to get the appearance features.
This is the extensive geometric features. This is the basic geometric features. Here are some
examples for the Gabor wavelets in the different directions. You can see all these directions
capture the edge in the horizontal direction and all those kind of different directions, this is the
vertical direction.
We have different orientations can band them together, apply on the whole face. So it doesn't
matter. You can [inaudible] or not. Once we detect the face, we just apply it to that face region to
get that.
>>: [inaudible].
>> Ying-Li Tian: That one I didn't compare to here. Since for the -- if the resolution is too low for
those kind of edges, you can't really get those kind of edges. But for this, for this one, it doesn't
matter you get the edge or not, you just apply on the whole face. I think these somehow lack the
histogram of the 3Ds. And also here give you the example for the expression part, the fixed
expression and also for the action unit.
Here, you can see we used this one, we combined these, 25, 26, 27 at that. And same thing, we
have the video, still use the -- after we extract the other features, we still use the neural network.
Then we can get either action unit, each of the six basic expressions.
We still use the CMU and Pittsburgh database to do the training and the testing. We just down
sample the high-resolution image to different levels of the low resolution. Here are some results.
For the face acquisition, you can see until here. Both of them [phonetic], detector one -- this is
Henry Rowley's neural network based, and also John Lyolas' [phonetic], both algorithms can
detect all the faces, because the situation is very simple, only one face on the image without any
cluttered background.
But for this one, both of them are fails. I think both of them worked the minimum size is about
maybe 24 by 24. So if the resolution is lower than that they cannot detect it. But for the head
pose detection, it surprisingly continues a pretty good result until this low resolution.
Here are the feature extraction results. When we do the feature extraction, for that extensive
features we can do pretty good until here. For this level of intensity, we cannot track that well.
That means track all the details with all the contours, all the detailed information. For this level,
we cannot do that.
For the geometry, for the six points, the corners, we can do until this level, but we cannot do good
at that one. Appearance features, since we can find the head or the face region, we can always
apply it. It doesn't matter how confusing it is. We just apply not the face region.
>>: [inaudible]. Where do the [inaudible].
>> Ying-Li Tian: No, we tried that. Yes, we have samples. It looks very weird. It still cannot do
that. You can see all for this. When we display this, we all are sampled to the same size of this.
The face detector cannot detect it.
These are the action unit recognition results. You can see here, we compare the results like this.
Only use the geometric features on the extensive ones and only use the six corners on that. This
one only used the appearance features, the Gabor wavelets applied on that whole face.
Then we have the combination, the appearance feature plus this feature, the extensive geometric
feature, the appearance feature, plus these six points of the geometric features. You can see
here, for that extensive features until in this level it's dropped a little but I think still keeps the
same level until this one. It works pretty well.
For that geometric feature two, for this six points, we thought the detailed contours, the result is
much worse. But you can see I don't think the 1 percent, 2 percent is really meaningful. But in
this level it's drooped a lot. For the appearance features, I think we get the similar results as the
geometric features, like we did it before. It's drooped, but we still can get it. Even in this level, we
still can get 58 percent of the detection rate.
When we combine them together, we can see we have a little bit increase for the accuracy, but
not that much.
>>: How do you combine them?
>> Ying-Li Tian: We combine them means when we trim the [inaudible], we use both of the
features. For this one, we only use this one kind of feature.
>>: This is the Gabor wavelets. It's much better.
>>: What's the dimensionality of that, the [phonetic] appearance feature?
>> Ying-Li Tian: This one, I forgot the detailed numbers.
>>: How many wavelets?
>> Ying-Li Tian: We have all the directions. We have all the eight directions, maybe about 25
degrees of the distance of that. It should be eight directions. I forgot how many scales we used,
maybe six, six by eight, on the whole face.
>>: This is whole face?
>> Ying-Li Tian: On the whole face. On the whole face.
>>: This is 48?
>> Ying-Li Tian: Forty-eight, something like that.
>>: You have [inaudible].
>> Ying-Li Tian: No, no, I -- here, here. Here. You can see here. For this one, for that one, the
appearance features, we based it on those kind of feature points. We based it on that feature
point. But since for the lower face, the lower resolution, we don't know where is those kind of
features, then we apply on the whole face.
>>: [inaudible] Slide. So that was you had this ->> Ying-Li Tian: We apply it on the whole face.
>>: For what are the resolutions?
>> Ying-Li Tian: For both of them with the resolutions, yes.
>>: My earlier work on Gabor wavelets it was much better than geometry.
>> Ying-Li Tian: I think for that one, since most of them are checked pretty good, so we get the
comparable result as the appearance features. But you can see this one, if we have this kind of
feature, to fill [phonetic] them, if we cannot extract very good location features, it's much worse
than the appearance features.
>>: Forty-eight?
>> Ying-Li Tian: No, no, for each pixel we have the 48.
>>: So are each of these 18 [inaudible]? So how many like -- what's the --
>> Ying-Li Tian: I need to recall that part, how many pixels we get. At least we have those kind
of regions, because we know the eyebrows, eyes and the lip corners. But for this one, if we do
not have those kind of features, I think we apply this. I'm not sure it's each pixel or each region. I
must separate some grids like that. But how many regions, how did I separate that? For now, I
do not have that number in my head.
>>: So for the high-resolution faces are we sure that [inaudible].
>> Ying-Li Tian: No, for high resolution, this one, we just use those kind of points, because we
know all those kind of positions. We have these geometric features.
>>: In this position, you ->> Ying-Li Tian: We have the ->>: It could be the table [phonetic].
>> Ying-Li Tian: Yes, yes, but with this one, I think that since we cannot get both geometric
features, I think I just gridded the face. For each small region, gridded a sub block, get the official
features, get the appearance features.
Here are the conclusions. For the performance evaluation, the higher the detection, higher the
pose estimation can detect the face in very low resolution and it's much better than the face
detector. For that, there is no difference in the recognition of the expression analysis when the
head region, meaning the face region, is largely higher than this one.
And also the geometric features, the extensive geometric features and the appearance features
achieve the same level of the recognition rate. But if this part cannot do that well, appearance
features will do better. This means in the future we do not really spend that much time to extract
with those kind of geometric features. We can develop some good appearance features to work
on that.
And also for both action units and the six basic expression recognitions it seems not that much
different for us. Okay, the future work. I think the future work for now I'm trying to focus on two
things. One is still do the spontaneous expression analysis, but try to combine the expression
and the talking [inaudible] combine video and audio. That's why I say okay, I will do some audio
work, also, because for now when people are doing the facial expression, they assume all the
facial movement comes from facial expression.
But for the talking, people are also working on the lip-reading, the speech reading, all those kind
of information. They assume all the facial movement comes by talking, but we know that's not
true. In the real life, when you talk, you also have the smiles, the expressions, how can we do
that to improve, mainly improve the lip-reading with the expressions, with some minimal proposal
to try to do the lip-reading to have deaf people about [phonetic] that. So I think we will work on
something like that if we have both expressions talking appears, how should we do?
Another direction I want to work on is the contained and the context-based expression analysis.
What's this mean? The context based just means from the video. If we not previous and some
other things behind that, then we can estimate some expressions, even though you turn off your
head from the camera.
From the content sometimes, only when we know the content and the context, then we can do
the -- really use the expression analysis for the emotion analysis. For the whole talk, I only
mentioned the emotion at the beginning of the application, but I never mentioned the emotion.
How can we use those for the emotion analysis? We must understand the content and the
context. Here is the image. So which kind of expression he has from your point of view?
Sad, he's crying, and wipes his tears. Here's a real image. He got the championship. He's so
excited that he's crying. I think all those can change and also the context information will play a
lot for the real applications. Here are some more expressions for the championship winners. You
can see all the different conduct themes, but all show you that exciting moment. How can we use
all those kind of information and then can give you some information to say, okay, yes, you are
really happy or you are excited. That's all.
>>: Thank you.
>> Ying-Li Tian: Yes.
>>: On the [inaudible] the expression analysis is really the data set. There are not [phonetic]
data sets available. It's very hard to capture a data set. You mentioned one data set.
>> Ying-Li Tian: One data set we captured at the Salk Institute. They are doing that. They are
hiding the cameras, so the interviewers do not know there is a camera there, but since the IRB,
the confidential stuff, they cannot really release the data set.
So I think the data set is a difficult problem, and also for the comparison between the different
kind of jobs, work, it's difficult. So I will mainly still focus on the controlled environment to do
some either lip-reading stuff, but with some expression, to add that.
The spontaneous expression compared to the collective expression mainly because the intensity
is very small, and also for all the head motions, how can we do that? So if we can separate even
for like the spars [phonetic], as far as we can do the segmentation about that, if we have a long
video we can say, okay, for this part of the video, this guy is happy, this guy is sad, something like
that, I think that's good enough for now. Not that much people are really working on the
spontaneous expression.
>>: [inaudible].
>> Ying-Li Tian: It's hard, it's hard.
>>: I talked with some people. They said they all want to work on this, but they cannot get data.
>> Ying-Li Tian: They cannot get data and also it's very hard to solve it. So, for me, I do that.
After I get the funding support, we will really focus on some applications.
>>: [Inaudible].
>> Ying-Li Tian: For example, for one, I mentioned that that's for the NIH for the deaf people.
Mainly we want to provide the enhanced lip-reading, give them the captions. For that one, so we
will capture the data in the restaurant or the meeting, the gathering room, with multiple people
taking for that.
>>: [inaudible].
>> Ying-Li Tian: They have a lot of expressions. And mainly for that one I think we will focus on
the lip-reading with the different expressions. We will hire some certificate trainers to code the
expressions for us. We will not have all that extensive, all the expressions. If we have smile, we
have smile. It depends many folks on the lip-reading, but to say if they have three different
expressions, we will deal with that three expressions. Yes.
>>: It seems there's a lot of similarity between facial recognition and expression recognition. I
was wondering if you are able to leverage some of the developments in facial recognition?
>> Ying-Li Tian: For the recognition and this together?
>>: It sort of seems to me -- for example, [inaudible] maybe the neutral face is more dominant
and is the most significant [inaudible] something like that. So one of the expressions [inaudible] a
little bit other places around there. Just think widely [inaudible] any leverage or bridge between
these two types of?
>> Ying-Li Tian: I'm thinking for the facial extraction part can be used for the face recognition, but
for the classification and the other, the further process, this one and the face recognition are
different, because for this one we tried to find the common. From the different people, we tired to
find the common. For all this kind of movement it belongs to the smile, the joy, but for the facial
expression you're mainly trying to find the difference, to say, okay, which part differentiates you
and me, so I know you are you and me, I am me.
So I think for the application part, the classifier, the discriminate part, we need to think about how
to use those kind of features. But for the feature extraction, it can be used for both applications.
Do you know what I mean?
>>: Let's say you have a whole face, so each pixel connects to some kind of mural, connects to
that. For one, you can do the recognition. The other, based on the movements, you might be
able to read the expression. I'm just ->> Ying-Li Tian: Isn't that ->>: [inaudible] for facial expression you should try to extract the commonality among people. But
for facial recognition you need to just distinguish different people.
>> Ying-Li Tian: Find the difference between the people. For this kind of classification, try to find
the commonality, yes, of all the different people both have the same kind of expressions and we
belong to that. And, as I mentioned a little bit, to use that for the person identification it does
mean that we need -- compared to this kind of general classification, you only have some very
unique expressions.
They may say, okay, that unique expression belongs to you, not him, but once we find those
kinds of unique expressions it may know it's you. It can use those kind of unique information for
the identification.
>>: Leave the expressions, these are things [inaudible] differentiated. It's like you're trying to do
with faces.
>> Ying-Li Tian: I think at CMU, yes, you do. She worked on some symmetry information, trying
to use the symmetry information to find the people, try to identify people, since the different
people have the different symmetry, even for your face. That's why when you look in the mirror,
sometimes you find the asymmetry more than you just look at the other person.
>>: [inaudible] images that we saw in your slide deck, were all of those taken from analyses that
were occurring in real time or was this from recorded video?
>> Ying-Li Tian: For the process it's all the recorded video from that database. But the process
itself is a real-time process.
>>: And have you ever used that information to drive 3D face models for facial performance?
>> Ying-Li Tian: No. I didn't try the 3D-based model. Yes.
>>: [inaudible] geometry feature generates the texture appearance feature. However, do you
think it's useful about applying the state [inaudible] how the expression changed over time?
>> Ying-Li Tian: It does very good for that. That's one direction people are trying to work on the
context information. For now, for all the features, for all the results I presented today, even
though we used the video to do the tracking, for the recognition it's still frame based. We didn't
use any temporal information in there. That's why for the spontaneous one, for the smart
meeting, they code that as a smile when the people [inaudible] move out, because they use the
video information to say, okay, that guy still smiles.
But since we are doing image-based, the classification is image based, then we cannot recognize
that as a smile.
>>: [inaudible] over time. You can [inaudible].
>> Ying-Li Tian: In that one, at least we can improve the result. You mean for the single frame.
We did it for wrong recognition, but since it cannot be a smile, then sad, then smile. Then we can
absolutely improve the overall recognition rate.
>> Zhengyou Zhang: Any more questions?
Let's thank Ying-Li again.
Download