17138 >> Zhengyou Zhang: Okay, so let's get started. ...

17138 >> Zhengyou Zhang: Okay, so let's get started. It's my pleasure to introduce Professor Ying-Li Tian. She's from the City University of New York, and she is also visiting us for two months. And before joining City University of New York she was a researcher at IBM, leading the effort on the Smart Surveillance system and from the initial research project to commercialize the product, so it's really a big effort. And, before that, she worked at CMU for quite a few years on the facial expression recognition, and that's probably one of the main parts of the talk. And before that she was a professor at the Institute of Automation in Chinese Academy of Sciences in China. Please. >> Ying-Li Tian: Okay, thanks, Zhengyou. So I'll start to talk about the automatic facial expression analysis. Part of the work has been done when I was at CMU, but I think it's 10 years ago, but after that I continued working on that until 2004, doing for the IBM product, the people wishing [phonetic] the surveillance system, getting more and more priority. So all the focus shifted to there, so I started to do that. Since now I've come back to the academic, so I have more time to pick up this topic again. So I'll talk a little bit for the future work at last the slides. Before I go to the details about the work, I want to give at least about the applications for the facial expression. Many people work on this but still do not see the real application on it. Some future applications, I think, we can do the emotion and the paralinguistic communication and also for the clinical psychology, psychiatry and neurology and also can do the pain [phonetic] assessment. People also can do the lie detection. When I was in CMU, I think that's in the year 2000, CIA organized a competition between the CMU and the Salk Institute who did the expression analysis to see which system can really deal with for the lie detection and also can be used for the person identification. For this one, since some people, their expressions are very unique. I have one colleague, his neutral is like this. So this kind of feature can be used for the person identification, and also we can do the image retrieval and the video indexing to say, "okay, can you find other client's [phonetic] picture with the smile, something like that." You can put this information in for the HCI and intelligent environments. And, also, when we talk about the driving awareness, this is only a subset part of the facial expression. When you droop, when you're sleeping, only for the upper part of that can be used for the driving awareness. Feel free to interrupt me if you have any questions. For the general steps for the automatic expression analysis, I classified them for these three. The first one, we want to find the face. Where is the face? Only when we find the face, then we can do the facial expression analysis. After we find the face, we will do the feature, data extraction and then re-presentation [phonetic]. Then we will do the facial expression recognition. For the face, most of the lab control environment, people use the face detection, but in the real life you cannot get the face directly, so you might also talk about for the head position [phonetic] estimation, try to find the way of the head, and then focus on the front of the face. For the feature extraction, people mainly can use two kind of features. One kind of feature accounts geometric features. That means people extract where is eyes, lips, nose, all those kind of features. For the appearance of features, I mean is give you the face, after you detect it, you have the whole face. You just find the feature for the appearance of that. Maybe you use the edge, maybe you use the Gabor filter, maybe you use also some other features you get from that face reading. For the facial expression analysis, people work on two paths. One path covers basic expressions. These basic expressions, I'll give you some examples in the next slides, and also another part is for the action units. So six basic expressions mainly give you the whole face, to say, okay, this expression belongs to the anger or high PE [phonetic], something like that. For the action unit it's more subtle to correspond it to the some points on your face. Okay, here are the six basic expressions. We call it six basic expressions because it doesn't matter which culture, it doesn't matter that you are male, female. These six classes of expressions operate universal, disgust, fear, joy, surprise, sadness, anger. I will also show you some action units. The facial action units was coded by the psychologist. Mainly, we have the different part of the face. For the upper face, we have 12, lower face, we have 18. Then they have the eye and head positions and some others. For our talk, people still mainly focus on the upper face, lower face, for the face action unit. And also we can see, for those kind of action units, for the real life we have more than 7,000 different combinations by combining those kind of -- so our goal is not only recognize the single action unit. We also need to deal with the combinations. Here are some examples here. So AU1, for example, they say, okay, if for the inner brow of this, you can lift it, this is AU1. Then they say okay, for the out part, this brow, you can lift it up, that's AU2. And Paul Ekman, we can do that. He's the famous psychology guy to develop the action unit, but most people cannot -- only lift the inner part and then lift the outer part. We mainly move them together. If you move your brow up together, then that's AU1 plus AU2. I gave some examples here and you can see some in red color. Those in red color means we didn't recognize them in the system because either we do not have enough data in the database, either they are pretty confusing. For example, this AU22 and AU44, so AU43 is a closed eye. When you close it, it's AU43. But AU42 is when you droop down your upper lid halfway, that's the AU42. But when you do the squint, like this, that's the AU44. For those kind of actions are too confusing, so I didn't put that in the system. Okay, why the facial expression is difficult? I give you some examples here. You see this is AU13 and AU14. One is a cheek puffer, one is a dimple, but from the image this looks very similar. But we code as two different AUs. And, also, if you can see this too is AU12 plus 25. AU12 means you leave the lip corner up, 12 means you open your mouth like that. So we put that together. Most of them [phonetic] for the smile, if you do this, it's the 12 plus 25. But for AU20 plus 25, the AU20, you just stretch your lip but do not lift your corner, lip corner. But when people do it, it looks very similar, and also another example is AU10 and AU9. Both show disgust, but for the AU9, you can see here, we have furrows here. If you have furrows here, the code is AU9. If not, it's coded as AU10. >>: Should I think of an action unit as being like a parameter or a state of? >> Ying-Li Tian: Let's say if you have the neutral face, like this, then you do this kind of facial movement, when you move like this, then for each movement they give one code, decoded as the action unit. For the different kinds of movement, they give a different number. >>: But is this -- is an action unit like ->> Zhengyou Zhang: It's a localized portion of the face. >>: Okay, so not the state after the motion. It's the degree along which you move? >> Ying-Li Tian: No, the degree, if this is my neutral, if I do this, that's one kind of emotion. Then they have this kind of -- for example, this 12, AU12, is said like this. I move my lip corner up. >>: Okay, can I think of it as the action unit, one action unit is to be able to move the corners of my lips up? >> Ying-Li Tian: Yes. >>: Okay, not the state of having my lips moved into that position? >> Ying-Li Tian: No. >>: Really. I mean, for me, I think the action unit is the state. When you raise your eyebrow, that's the state of being up. >>: As opposed to the muscle or whatever that moves that eyebrow up. >>: It's not the motion, it's the fact that it's up, so that's the action unit present on your face. Am I? >> Ying-Li Tian: The movement caused by facial muscles. I think you can also system that it's a state. When you compare to the neutral, if I move this one, this is the AU1 plus 2, so that's the code to see this kind of state and movement. When you see how much it's movement, we call it the intensity. They also have another parameter called action unit intensity. If that intensity is big, for example, if I open my -- I will show you some examples, like this here. If we go back here, you can see we have the -- where is it? AU25. This is 25, this is 26, this is 27. For now, this is the previous one. They code it as three different actions, but the main difference is how big you open your mouth. For this one, it's only a little bit parted, because AU25 -- if you open larger, with bigger intensity, they give a different action unit than if you open very big, they give the largest one. They modify this for some action units. For example, for now, they only give this as one action unit, as AU25, but with three different intensities. They use the intensity as A, B, C, D, E, to say, okay, how big you moved. If you only move a little bit, that's the intensity A. If you move a lot, that maybe goes to the E. So, for that one, because the different people have the different -- the amount of the movement, so that part is more difficult. >>: It sounds kind of like a state, except when you're combining them it seems hard to combine states. Like, if they're parameters in some facial expression parameter space, I can imagine, oh, I got some of this parameter stretched out this way and some of this one out this way, so it's easy to understand what it means to combine. >> Ying-Li Tian: So that's why it's very important for how to extract the features and also how to represent the features. If you represent the features well, you can capture those kinds of movements. If not, it's more difficult. And also for some action units, for the lips especially, I'm thinking only based on 2D images it's harder to detect those kinds of things. We need the 3D to decide [phonetic] if you do this or not. But from the frontal, from 2D image, it looks very similar. If you have the 3D, then you can see the difference. Okay, so now after we introduced that general things, I'll move to the work we did. So we did several works here. So the first one, for the facial expression analysis in a collaborative environment -- that means in a lab we can collect the data, so the people are mainly cooperative. We mainly have the single face, have very high resolution, and also for the expression it's pretty much exaggerated expression. I'll show you some video. You can see why that exaggerates expressions. After that, I will talk about the neutral face detection, since when we do these, we have the assumption, we say, okay, the neutral face should be available. But in many cases, we don't have that neutral face, so how can we find the neutral face automatically. Then we move to some spontaneous expression analysis. After that, I did some performance evaluation for the expression analysis and then we talked about the future work. Okay, now we move to the facial expression in a collaborative environment. This work I have done when I was at CMU with Professor Kanade and Professor Cohn. He is a professor in the psychology department. He is from the robotics computer division guys [phonetic] so we can band together to find the real applications. For the facial expression analysis, we have a video. We assume the first frame is a neutral face, start from the neutral, then you perform the expression. So then from this video we extract the features and the representative features for the different -- you see since we have very high resolution image, we have all those kind of detailed features. When we do the recognition, we separate the upper face and the lower face. Both use the neural network base, the classic [phonetic] there, then we get the result. For the feature extraction, we mainly extract the eyebrows, the eyes, with very detailed iris and the up lid, lower lid, and also we put some features on the cheek. For this one, it doesn't contribute that much, especially for the baby face, because it's non-texture. It really didn't provide that much information. But those kind of furrows provide a lot of information, and also the lip. I will not go to the details for the algorithm, but I can show you some furrow detection. These two are very useful for the distinguish and the smile, those kind of expressions, mainly based on the age information. We get the angle of between this one when we detect age. If the angles are bigger, mainly when you do the smile, like that, and also for the lip. We combine the color, shape and the motion information to try to track the three states of the lips, open and closed and tightly closed. Here are the color distributions you can see. When the lips are in the different states, they operate differently. One difficulty for this use of color information is for the darker skin-colored people, for the African people. Mainly their lips, the color is very similar to their skin color, so sometimes for them we have some difficulties. For the eyelid -- for the eye tracking and the detection, we also try to detect the opened eyes and the closed eyes. We define an open eye as if the iris is visible, then we think it's an open eye. If it's not visible, it's closed eye. We used the edge information from the iris to try to find the iris [how it is] and here I want to show you how do we detect the features? How can we represent that? I think that will answer part of our first question. Mainly, we observe all the expression analysis, your inner corner of the eyes are pretty stable. It doesn't matter what kind of expression you're doing. The inner corner doesn't move that much. So we use the inner corner as the reference point. We get that, then we normalize all the extractable features based on this baseline, to say, okay, how much you raised your brow, what's the parameters? And also for the different features also how big, how wide, of this? You can see we have the blue region here and the green region there. For those regions, we mainly will try to capture the furrows, because for this region it's trying to distinguish AU6 and AU7. It just means when you try to smile, if you have some furrows there, it's AU6. If not, then you cannot see [phonetic] that. This one, as I showed the image before, try to distinguish the disgust, AU9 and AU10, if there is furrows or not. And the same thing for the lower face, for outer lip features, we also normalized based on this baseline. Okay, I'll show you some examples for the tracking method. You can see the smile is exaggerated. No people really smile like this in the real life, but you can see the feature extraction, we can track the lip and also the eye detection, the iris like that. Here you can see for the action unit, we have the different AU combination for this feature and also for the basic expressions belong to this one. We have a list to say okay for some combinations they will belong to which basic expressions. All those kinds of action units have been coded by the certificate coder. You have to get the certificate to decode these kind of action units. Here is a different expression. This is the anger. You can see the lip for the difference. We track that, even for the eyes closed and open. And these kind of furrows also make the contributions to the different actions. Here's another one, the surprise. So for the surprise, mainly the mouth is wide open and also for the eye wide open. You can see the eyebrow moved a lot. >>: [inaudible]. >> Ying-Li Tian: We tried that, but since most of the database, the CMU t-squared [phonetic] database, were the psychology students. Not that many people had the furrow on the forehead. Here, another guy. You can see how two different expressions -- disgust at the beginning, then with smile. This is the boy [phonetic]. You can see their lip colors are very similar to their skin color. Most of them we can handle, but for some of them, sometimes it goes away, the lip contour goes away. Here are some examples with the head motion. We say, okay, the head moves, this is what it looks like. And also with the different field. You can see the features are tracked pretty well. We also tried several -- this is also recorded in the lab. I removed the chin for the case because it doesn't track that well, but you can see the iris, the tracking with all the motion. We can get very accurate results for all those kind of detailed features. And here are two videos. We can see it's a spontaneous expression. The camera is not in that high quality, and also with a lot of the head motion. Here is another one. This, the interview, we asked the people to watch some movies, the clips. We interviewed them. They didn't know we had the camera installed somewhere, so for those kind of expressions, you can see it's spontaneous. The face resolution is much smaller and also for some detailed features we cannot really catch that. You can see this one. I'll play it again. For the eye, we didn't really catch those kind of features. But the lip contours did pretty good. For this one, and the face region is about 60 by 80. For the previous one, which I showed is about 200 by 200 or something like that. Here are the recognition results. For the lower face, we have those kind of action units. Here are how many times we have that in the database, here are the recognition results. This one is pretty low, the 26, 25, 26, 27, as I talked before, is how big your mouth opened. For that one, most of the confusion is between this 26 and 25. If we combine them together as one action unit but just give the different intensities, then this one will be much better. And also here are the recognition results for the upper face. Most of them did pretty good on the eyebrows. This 6, 5 and 6, are a little bit lower. Six means your lower lid moved up. With smile, your lower lid moved up. So all these are very subtle, difficult to detect, but overall the recognition rate is pretty good. And here I show you for the AU25, AU26, in the image. That's the confusion. Here is AU5 and AU6. Those are pretty confused action units, which we didn't do that good as other actions. >>: These are all human labeled and they're based on single images? >> Ying-Li Tian: No, no, based on the video sequence. For the sequence database we collected, they start for the neutral. Then you gradually perform the expression until the peak, and then come down. So they have the whole sequence for that. If you only depend on the single image, for some expressions at the beginning, from the neutral, if I start to do that, it's only a little bit. They cannot code it. So they compare it. They play the video by frames. They compare to the neutral to say if there is movement or not. If they see that movement, then they will code the action unit. >>: The whole sequence is labeled as one class, one action unit, total sequence? >> Ying-Li Tian: Yes, from the -- with very low intensity to the peak, then come back that. They have the whole sequence. They also try to do some communication analysis. For example, for the social smile and the real smile. If you are really happy, your smile keeps on the face, but for some social smiles, if I say, Zhengyou Zhang, hi. Then when he passes by, my smile comes back quickly. So we do those kind of also for the communication kind of behavior. So they try to say, okay, how long goes to the peak and then how quick it comes back, all those kind of things. But for the code [phonetic] they do need -- for the peak one, they do need -- only from this image, they can say, okay, it's AU5. But for some images close to that neutral, they need to play, really play between the frames to see if they are changing or not. >>: And the results, the mistakes that were made, are those because the features that you were measuring, like lip corners and all that stuff, didn't' have enough information, or because the classifier that you used needs to be classified. >> Ying-Li Tian: I think the features from that high resolution you can see we captured pretty good. I think the main reason is because of the confusion of the action units. So all those kind of action units are coded by psychologists for human beings to observe it, not really good for the computer to do that. >>: If a human looked at the features that were generated, they could tell the difference between 5 and 6? >> Ying-Li Tian: Only the certificate coder can do that. For me, for some of them, I think, okay, they are very similar, they are the same. If you are not trained for doing that, for the normal people cannot distinguish those kind of things. >>: Do we have any data to show the consistency between the different coders, human coders? >> Ying-Li Tian: They have some data like that. They use the key to show that. The different coders to code some of that is about 85 percent. They still have 15 percent not agreeing with each other. You said, okay, that's the AU6. They said, no, I don't think that's AU6. I think it's AU5, something like that. So they have -- the agreement is about 85 percent, still have 15 percent confusion. Okay, so for the previous work, we have the assumptions. For the initialization for the features, we initial them manually on the first frame, on the neutral frame. And also we assume the neutral face is available. For the initialization already being solved by the active appearance model fitting algorithms, I left CMU, Simon Baker I think is in Microsoft Research with several other people, to work on that, to do the automatic fading [phonetic] for the facial features. And also they extend the active appearance model to handle some hand motions, try to move to the spontaneous expressions. Since we have the assumption for the neutral face, then we say, okay, if without the neutral phase, what can we do? We need to -- if you give the image, I want to say, okay, are they a neutral face or not? So that's the work, after I joined IBM, I've continued to work on that, neutral face detection. The purpose of this, if they view some images and we want to find the neutral face, and for this one for the yellow or bugs [phonetic], we say, okay, this is the neutral face. The others are not. This neutral face detection is image based, not video based, since we only randomly collect the images. What we do there is we have the image input and then we do the face detection, and if we find the face, then we extract the features. For those kind of features, I use a different algorithm as previous data, since we assume in the real application, in most of our cases, we cannot really get that high resolution image. Then, based on the location feature and the shape feature, we still can use the neural network classifier, try to find the neutral face detection. For this one, we only do it's neutral or not. For the feature, since for most of the face we do not have that high resolution, so we reduce it to the corner of the brow and the eyes and the lip corner, without all those kind of contour head, of those kind of detailed information. And from the later the result, you will see why we reduced it to this, since now we will have some images to show you the low resolution on that. The same thing, based on the eye detection less the baseline to normalize these kind of corners, the raise of the lip corner and also how far from this baseline and also for the brows. Okay, and that's the local feature. Then we have the appearance feature. We have the image, we have the edge. Then we have this direction of this edge. At that time, we didn't call this HOG. HOG is very hard now to do the detection reaction, all those kinds of things. At that time we called it zonal histogram features, but then I figured out it's identical to the HOG feature, which people used a lot, or is it? Yes. We have the full directional histogram to do that. I just want to show you this three combination of those features to show you in which direction the edge [descript]. So we put this feature and the -- here is the shape features and also the geometry features together, put it in the neural network. Then we find the neutral face and the non-neutral face. Here are some results we can see. We test on two databases. For the training, we use the CMU and the Pittsburgh database. We separate them to two, only train [phonetic] on one data set. We didn't test on another part, so we get pretty high detection rates. And also we test independent database and Yale database. This one [inaudible] but still kept pretty high detection rate. >>: So you used the same training data. >> Ying-Li Tian: Only train one, we used training based on [inaudible]. Test this one and then tested a totally different database. Here are some examples for the wrong classification result. For the upper part, this part, are the neutral faces are classified as the non-neutral. From the coach, they said, okay, all those are the neutral face, but our systems say, okay, they are non-neutral. This one, I think maybe that's her neutral, but it looks anger. This one a little bit smile. I think this is true because of this moustache. In the training data we do not have this moustache. For these several images, this is the non-neutral faces but we classify it as the neutral. You see the non-neutral, you can see the lips are a little bit apart, so they code them as AU25 like that. That's non-neutral, but since the intensity is very small and this one looks like a little bit smile. So all the confusions are in that very low-intensity expression, so I think if we say al of those are neutral it's still acceptable. Yes. >>: I'm assuming that there are some psychologists that do the classifications for which AU each face belongs to. What's their percentage of hit rate, if you just gave them a collection of these types of small difference faces. Would 95 percent say that that's a neutral and 5 percent say that that's a happy face? I'm thinking that you're cutting the edge too much as far as trying to specify that this is yes or no. >> Ying-Li Tian: I think that part, I don't know the percentage of that. I think that for the certificate coder, they will definitely say, okay, since their lips are apart then definitely they have an expression there. But for the psychology people, we thought the certification of coding, people said, okay, that looks like neutral. I don't know the percentage for that part. Okay, so we talk about the expression in the lab, and also for the neutral face detection. Then we move to the spontaneous expressions. Okay, so in real life, how can we do the spontaneous expression analysis. For this kind of work, we mainly focus on two things. One is how to handle the head motion and then how to deal with the low image quality. We do that because we have the past data in 2003. They record the data in the smart room, smart meeting room. They have the data and also they have the ground truths for the expression with the smile or not at that. So you can see compared to the collaborative environment, the intensity of those data are very low. In this region, about 70 by 90 pixels are those. How can we do that? Here are the stretch [phonetic]. We have the video. We have the video input, instead of to do the face detection, since there are a lot of moves away, so we do the background subtraction. Then we do the head detection. Once we find the head, then we will say, okay, it's a frontal or not? If it's a frontal, front of the face, then we do the feature extraction. Then we do the neutral face detection, use the neutral and other things to classify it's neutral or smile or some other things. When we do the head detection, if we couldn't find the head, then we stop. If we find the head, we continue to see what's the head pose of that? Here, I show you the background subtraction. Background subtraction mainly assumes the camera is static and here is the background. When some moving object appears, we try to get the moving object. The moving object is called the foreground. So we will find the moving object. Once we find the moving people there, we will use an omega shift, a shift like omega to say here's a head, here's the shoulder. Then we can find where is the head. All of this has one assumption. We assume the people are upright. If you bend down or something, this algorithm will not work. And also here it shows you the algorithm we are doing the head detection. Once we get to slow [phonetic] , it gets a circle part out of the top of this, with this omega, with the several images we get for the head. Okay, once we find the head, then we will say, okay, what's the head pose? Only for the frontal or nearly frontal. When we see the frontal or nearly frontal, it means when you turn away from that, turn maybe less than 25 degrees in that, if it still turns too much, it stays as a side wheel [phonetic], but for all others we can see the facial features clearly with the others. For the facial expression, we are still focused on this. For some of that, I think we can do something of that. But since from the psychologists, most of them only focused on the front of face, we lack the kind of research in that part, so we still focus on this. Only for the head pose in this frontal or nearly frontal, we continue the feature extraction. Here, I show you one example for our head pose detection. You can see the original image here, the background. We did the head detection. You can see this gauge along with the head pose. . And this guy, Andrew, senior, he's in Google now. It caused us a lot of trouble because his part, all the color -- skin color around the front [phonetic] and the back, in our training data we do not have people like that. Okay, that's the head pose. And also for the feature extraction, the previous feature tracking algorithm cannot work, so how can we get the feature extraction? We just use very simple boundary features. You can see if we have the different threshold, we can get mainly for the eyebrows and eyes darker than other parts, so we can get this. The threshold, if it's too high, you get this, too, but this is the brow, this is the eye. So we give the automatic threshold detection, then we can get the two pairs of this, so we get both the eyebrows and the eyes. Once we get the eyebrows and the eyes, we assume the lips is well below that. Then we do the automatic search, trying to find the lip corners. Yes? >>: Overall, you are using the geometric features. I wonder, why don't you use some kind of textural feature, especially when [inaudible]? >> Ying-Li Tian: We have some textural features. We call them appearance features. I will show you later. That's where we use the global [phonetic] try to get all those kinds of changes without extracting all those kinds of local features. And also we compare the features to see which kind of feature works better. And here are some extracted feature in that smart meeting room. They detected the hard [phonetic] front of the features. Then we used these kind of features to do that. I think this kind of shape feature, kind of included in the whole face, we use this shift feature as before and also put the distance feature, use the neural network, but try to get this kind of neutral face smile, surprise, angry and others and output of the non-frontal. And here are some results we have. Before we talk about results, you can see there are already no ground truths. They say, okay, smile. Since they have the video, when you turn away it still it assumes you continue to smile, smile, smile, all those kinds of smiles. I did some modification to say, okay, since there are two configures for the system, so I modified that. Okay, this is a profile, this is a side view. This smile, and also smile for this one. I turned it to neutral. So for some of the ground truths, I made some modifications. I will test the result based on these modified ground truths about that. You see here for this one, we did it as the surprise, not as the smile, because I think mainly because eyebrows are lifted up a lot. And here are the tables for the recognition results for the non-frontal, neutral, and they have a lot of smile but not that much other expressions. Okay, so once we tried that, we have more and more questions that we want to answer, so we will do the performance evaluation analysis to say, okay, which kind of facial features played a more important rule for the facial expression, and also for which level of the facial resolution we still can do the expression analysis. And, also, if we want to do the recognition for the action units and the basic expressions, which one is easier, which one we can do better? And so we mainly compare for the facial resolution and the algorithm for the face acquisition. This means the head pose and the face and also we do the feature extraction, the geometric features and appearance features. And also we compare the different recognition rates for action units and the basic expressions. Here are a table on a [inaudible] CMU database. From the original image, just down sampled to the different level of the resolution, so say, okay, for the face process, we can do the face as a detection or the head pose detection. And also can we do the face recognition, can we extract the facial features? Can we really do the expression? Until here, I think, we can do everything. But for all those can we really do this? I'm not sure. But for the feature extraction and the expression, I think we cannot do at that level, so we also need to check in this level, which level can we really do the facial expression analysis? And for the face acquisition, I compared three algorithms. For the face detection we used the neural network developed by Henry Rowley. He works from CMU, but then worked in Microsoft, now is in Google. Another one is an integral-based face detector. That's famous John Lyolas [phonetic] face detection and also we tried our head pose detection to see can we get comparable results for that launch. And also for the features, the geometric features, I compared it two ways. One is the extensive facial features. Thus it means for the high resolution we have all the contours. At the beginning, I showed you the video with all those kind of details. Then we have the basic facial features, just the only six corners, the lip corner, eyes and the brows, just the basic facial features. And also for the appearance features, it's present the skin texture and also the lips, all those kind of change. But we used the Gabor wavelet to apply the Gabor wavelet on the whole face to try to get the appearance features. This is the extensive geometric features. This is the basic geometric features. Here are some examples for the Gabor wavelets in the different directions. You can see all these directions capture the edge in the horizontal direction and all those kind of different directions, this is the vertical direction. We have different orientations can band them together, apply on the whole face. So it doesn't matter. You can [inaudible] or not. Once we detect the face, we just apply it to that face region to get that. >>: [inaudible]. >> Ying-Li Tian: That one I didn't compare to here. Since for the -- if the resolution is too low for those kind of edges, you can't really get those kind of edges. But for this, for this one, it doesn't matter you get the edge or not, you just apply on the whole face. I think these somehow lack the histogram of the 3Ds. And also here give you the example for the expression part, the fixed expression and also for the action unit. Here, you can see we used this one, we combined these, 25, 26, 27 at that. And same thing, we have the video, still use the -- after we extract the other features, we still use the neural network. Then we can get either action unit, each of the six basic expressions. We still use the CMU and Pittsburgh database to do the training and the testing. We just down sample the high-resolution image to different levels of the low resolution. Here are some results. For the face acquisition, you can see until here. Both of them [phonetic], detector one -- this is Henry Rowley's neural network based, and also John Lyolas' [phonetic], both algorithms can detect all the faces, because the situation is very simple, only one face on the image without any cluttered background. But for this one, both of them are fails. I think both of them worked the minimum size is about maybe 24 by 24. So if the resolution is lower than that they cannot detect it. But for the head pose detection, it surprisingly continues a pretty good result until this low resolution. Here are the feature extraction results. When we do the feature extraction, for that extensive features we can do pretty good until here. For this level of intensity, we cannot track that well. That means track all the details with all the contours, all the detailed information. For this level, we cannot do that. For the geometry, for the six points, the corners, we can do until this level, but we cannot do good at that one. Appearance features, since we can find the head or the face region, we can always apply it. It doesn't matter how confusing it is. We just apply not the face region. >>: [inaudible]. Where do the [inaudible]. >> Ying-Li Tian: No, we tried that. Yes, we have samples. It looks very weird. It still cannot do that. You can see all for this. When we display this, we all are sampled to the same size of this. The face detector cannot detect it. These are the action unit recognition results. You can see here, we compare the results like this. Only use the geometric features on the extensive ones and only use the six corners on that. This one only used the appearance features, the Gabor wavelets applied on that whole face. Then we have the combination, the appearance feature plus this feature, the extensive geometric feature, the appearance feature, plus these six points of the geometric features. You can see here, for that extensive features until in this level it's dropped a little but I think still keeps the same level until this one. It works pretty well. For that geometric feature two, for this six points, we thought the detailed contours, the result is much worse. But you can see I don't think the 1 percent, 2 percent is really meaningful. But in this level it's drooped a lot. For the appearance features, I think we get the similar results as the geometric features, like we did it before. It's drooped, but we still can get it. Even in this level, we still can get 58 percent of the detection rate. When we combine them together, we can see we have a little bit increase for the accuracy, but not that much. >>: How do you combine them? >> Ying-Li Tian: We combine them means when we trim the [inaudible], we use both of the features. For this one, we only use this one kind of feature. >>: This is the Gabor wavelets. It's much better. >>: What's the dimensionality of that, the [phonetic] appearance feature? >> Ying-Li Tian: This one, I forgot the detailed numbers. >>: How many wavelets? >> Ying-Li Tian: We have all the directions. We have all the eight directions, maybe about 25 degrees of the distance of that. It should be eight directions. I forgot how many scales we used, maybe six, six by eight, on the whole face. >>: This is whole face? >> Ying-Li Tian: On the whole face. On the whole face. >>: This is 48? >> Ying-Li Tian: Forty-eight, something like that. >>: You have [inaudible]. >> Ying-Li Tian: No, no, I -- here, here. Here. You can see here. For this one, for that one, the appearance features, we based it on those kind of feature points. We based it on that feature point. But since for the lower face, the lower resolution, we don't know where is those kind of features, then we apply on the whole face. >>: [inaudible] Slide. So that was you had this ->> Ying-Li Tian: We apply it on the whole face. >>: For what are the resolutions? >> Ying-Li Tian: For both of them with the resolutions, yes. >>: My earlier work on Gabor wavelets it was much better than geometry. >> Ying-Li Tian: I think for that one, since most of them are checked pretty good, so we get the comparable result as the appearance features. But you can see this one, if we have this kind of feature, to fill [phonetic] them, if we cannot extract very good location features, it's much worse than the appearance features. >>: Forty-eight? >> Ying-Li Tian: No, no, for each pixel we have the 48. >>: So are each of these 18 [inaudible]? So how many like -- what's the -- >> Ying-Li Tian: I need to recall that part, how many pixels we get. At least we have those kind of regions, because we know the eyebrows, eyes and the lip corners. But for this one, if we do not have those kind of features, I think we apply this. I'm not sure it's each pixel or each region. I must separate some grids like that. But how many regions, how did I separate that? For now, I do not have that number in my head. >>: So for the high-resolution faces are we sure that [inaudible]. >> Ying-Li Tian: No, for high resolution, this one, we just use those kind of points, because we know all those kind of positions. We have these geometric features. >>: In this position, you ->> Ying-Li Tian: We have the ->>: It could be the table [phonetic]. >> Ying-Li Tian: Yes, yes, but with this one, I think that since we cannot get both geometric features, I think I just gridded the face. For each small region, gridded a sub block, get the official features, get the appearance features. Here are the conclusions. For the performance evaluation, the higher the detection, higher the pose estimation can detect the face in very low resolution and it's much better than the face detector. For that, there is no difference in the recognition of the expression analysis when the head region, meaning the face region, is largely higher than this one. And also the geometric features, the extensive geometric features and the appearance features achieve the same level of the recognition rate. But if this part cannot do that well, appearance features will do better. This means in the future we do not really spend that much time to extract with those kind of geometric features. We can develop some good appearance features to work on that. And also for both action units and the six basic expression recognitions it seems not that much different for us. Okay, the future work. I think the future work for now I'm trying to focus on two things. One is still do the spontaneous expression analysis, but try to combine the expression and the talking [inaudible] combine video and audio. That's why I say okay, I will do some audio work, also, because for now when people are doing the facial expression, they assume all the facial movement comes from facial expression. But for the talking, people are also working on the lip-reading, the speech reading, all those kind of information. They assume all the facial movement comes by talking, but we know that's not true. In the real life, when you talk, you also have the smiles, the expressions, how can we do that to improve, mainly improve the lip-reading with the expressions, with some minimal proposal to try to do the lip-reading to have deaf people about [phonetic] that. So I think we will work on something like that if we have both expressions talking appears, how should we do? Another direction I want to work on is the contained and the context-based expression analysis. What's this mean? The context based just means from the video. If we not previous and some other things behind that, then we can estimate some expressions, even though you turn off your head from the camera. From the content sometimes, only when we know the content and the context, then we can do the -- really use the expression analysis for the emotion analysis. For the whole talk, I only mentioned the emotion at the beginning of the application, but I never mentioned the emotion. How can we use those for the emotion analysis? We must understand the content and the context. Here is the image. So which kind of expression he has from your point of view? Sad, he's crying, and wipes his tears. Here's a real image. He got the championship. He's so excited that he's crying. I think all those can change and also the context information will play a lot for the real applications. Here are some more expressions for the championship winners. You can see all the different conduct themes, but all show you that exciting moment. How can we use all those kind of information and then can give you some information to say, okay, yes, you are really happy or you are excited. That's all. >>: Thank you. >> Ying-Li Tian: Yes. >>: On the [inaudible] the expression analysis is really the data set. There are not [phonetic] data sets available. It's very hard to capture a data set. You mentioned one data set. >> Ying-Li Tian: One data set we captured at the Salk Institute. They are doing that. They are hiding the cameras, so the interviewers do not know there is a camera there, but since the IRB, the confidential stuff, they cannot really release the data set. So I think the data set is a difficult problem, and also for the comparison between the different kind of jobs, work, it's difficult. So I will mainly still focus on the controlled environment to do some either lip-reading stuff, but with some expression, to add that. The spontaneous expression compared to the collective expression mainly because the intensity is very small, and also for all the head motions, how can we do that? So if we can separate even for like the spars [phonetic], as far as we can do the segmentation about that, if we have a long video we can say, okay, for this part of the video, this guy is happy, this guy is sad, something like that, I think that's good enough for now. Not that much people are really working on the spontaneous expression. >>: [inaudible]. >> Ying-Li Tian: It's hard, it's hard. >>: I talked with some people. They said they all want to work on this, but they cannot get data. >> Ying-Li Tian: They cannot get data and also it's very hard to solve it. So, for me, I do that. After I get the funding support, we will really focus on some applications. >>: [Inaudible]. >> Ying-Li Tian: For example, for one, I mentioned that that's for the NIH for the deaf people. Mainly we want to provide the enhanced lip-reading, give them the captions. For that one, so we will capture the data in the restaurant or the meeting, the gathering room, with multiple people taking for that. >>: [inaudible]. >> Ying-Li Tian: They have a lot of expressions. And mainly for that one I think we will focus on the lip-reading with the different expressions. We will hire some certificate trainers to code the expressions for us. We will not have all that extensive, all the expressions. If we have smile, we have smile. It depends many folks on the lip-reading, but to say if they have three different expressions, we will deal with that three expressions. Yes. >>: It seems there's a lot of similarity between facial recognition and expression recognition. I was wondering if you are able to leverage some of the developments in facial recognition? >> Ying-Li Tian: For the recognition and this together? >>: It sort of seems to me -- for example, [inaudible] maybe the neutral face is more dominant and is the most significant [inaudible] something like that. So one of the expressions [inaudible] a little bit other places around there. Just think widely [inaudible] any leverage or bridge between these two types of? >> Ying-Li Tian: I'm thinking for the facial extraction part can be used for the face recognition, but for the classification and the other, the further process, this one and the face recognition are different, because for this one we tried to find the common. From the different people, we tired to find the common. For all this kind of movement it belongs to the smile, the joy, but for the facial expression you're mainly trying to find the difference, to say, okay, which part differentiates you and me, so I know you are you and me, I am me. So I think for the application part, the classifier, the discriminate part, we need to think about how to use those kind of features. But for the feature extraction, it can be used for both applications. Do you know what I mean? >>: Let's say you have a whole face, so each pixel connects to some kind of mural, connects to that. For one, you can do the recognition. The other, based on the movements, you might be able to read the expression. I'm just ->> Ying-Li Tian: Isn't that ->>: [inaudible] for facial expression you should try to extract the commonality among people. But for facial recognition you need to just distinguish different people. >> Ying-Li Tian: Find the difference between the people. For this kind of classification, try to find the commonality, yes, of all the different people both have the same kind of expressions and we belong to that. And, as I mentioned a little bit, to use that for the person identification it does mean that we need -- compared to this kind of general classification, you only have some very unique expressions. They may say, okay, that unique expression belongs to you, not him, but once we find those kinds of unique expressions it may know it's you. It can use those kind of unique information for the identification. >>: Leave the expressions, these are things [inaudible] differentiated. It's like you're trying to do with faces. >> Ying-Li Tian: I think at CMU, yes, you do. She worked on some symmetry information, trying to use the symmetry information to find the people, try to identify people, since the different people have the different symmetry, even for your face. That's why when you look in the mirror, sometimes you find the asymmetry more than you just look at the other person. >>: [inaudible] images that we saw in your slide deck, were all of those taken from analyses that were occurring in real time or was this from recorded video? >> Ying-Li Tian: For the process it's all the recorded video from that database. But the process itself is a real-time process. >>: And have you ever used that information to drive 3D face models for facial performance? >> Ying-Li Tian: No. I didn't try the 3D-based model. Yes. >>: [inaudible] geometry feature generates the texture appearance feature. However, do you think it's useful about applying the state [inaudible] how the expression changed over time? >> Ying-Li Tian: It does very good for that. That's one direction people are trying to work on the context information. For now, for all the features, for all the results I presented today, even though we used the video to do the tracking, for the recognition it's still frame based. We didn't use any temporal information in there. That's why for the spontaneous one, for the smart meeting, they code that as a smile when the people [inaudible] move out, because they use the video information to say, okay, that guy still smiles. But since we are doing image-based, the classification is image based, then we cannot recognize that as a smile. >>: [inaudible] over time. You can [inaudible]. >> Ying-Li Tian: In that one, at least we can improve the result. You mean for the single frame. We did it for wrong recognition, but since it cannot be a smile, then sad, then smile. Then we can absolutely improve the overall recognition rate. >> Zhengyou Zhang: Any more questions? Let's thank Ying-Li again.

17138 >> Zhengyou Zhang: Okay, so let's get started. ...

Related documents

Products

Support

17138 &gt;&gt; Zhengyou Zhang: Okay, so let's get started. ...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

17138 >> Zhengyou Zhang: Okay, so let's get started. ...