Document 17865249

>> Zhengyou Zhang: Okay. So let's get started. It's my pleasure to introduce Gang Hua. Gang is not a stranger to us. He did internship with me quite a few years ago, then joined the Live Lab. After Live Lab was dissolved, he moved on to a number of places, including Nokia Lab, IBM Research and now he's Associate Professor at Stevens Institute of Technology. And today he will talk about Elastic Part Model for Face Recognition. Let's listen to him. >> Gang Hua: Thanks Daniel. It's a great pleasure to revisit MSR, one of the best computer science research institutes in the world. So today my talk is about flagship or based representation of real world face recognition. You may wonder after 30 years of research why should we still care about face recognition research. I think a short answer for that it is still not solved yet. So before I get into the technical part I want to briefly introduce my school. So Stevens is in beautiful campus on New River. So we oversee the skyline of New opposite side is Manhattan downtown. So most New Yorkers, when they come to -- I lifetime they never come to our side and see the skyline of Manhattan. Jersey side of the Hudson York City. Basically the it's a beautiful campus, and mean, perhaps in their they never had the chance to So next time you happen to be nearby give me an e-mail and I can arrange a visit to our campus. Okay. So current research in my group, I categorize them into three themes. The first is human-centered visual computing. I tend to do my research centered around humans. I understand human from image and videos and also do interactive type of recognition tagging and things like that. So the second theme of my research in my group is big visual data. Part of the effort is originated from my experience here like designing compact local descriptors all the way to modeling contextual information including social networking context for recognition, essentially. The third theme of research I initiated at the school, after I joined Stevens, is on this egocentric based cyberphysical and collab system. The reason I call it collab system because we want to build integrated human and robotic system. For example, one of our goals is trying to use that egocentric camera. This is a very cheap camera. The one on the right side is very cheap, spy camera I bought from China. So there's a pinhole camera in between, the camera glasses, and we want to enable the users to use that camera to control a wheelchair. You may imagine a four-wheel quadriplegic they lose hand functionality, cannot really control the wheelchair. We want to build a collab system where most often the wheelchair robot is moving autonomously but whenever it's unsure what next step to make it could ask the human for control then the quadriplegic in the video can use the camera to control it. It would naturally record the input like from the sensors and output from the user control stand. We hope to build a linear algorithm which can evolve the decision engine of that wheelchair system so in the future it can handle similar situation. So that's about the overall description of the research in my group. So back to face recognition. Why do we really care about the post variation. The simple reason is that the post or actually mingle order visual variations together makes things really complicated. We all know from the seminal paper of [indiscernible] and [indiscernible with the post isn't changing then everything becomes linear, and it's easy to handle. So as you can see from this example, once we have a lot of post variations things becomes really complicated. So we want to better handle post with better visual representation of the face. So there's two types of approaches to handle post variation, first the theme of research they focus on facial alignment facial landmarks. There's serious work for Microsoft Research Asia they've done really some good face lambda algorithms for photos. Just with my private conversations with Jen, he told me that at the current face alignment algorithm MSR actually failed. There's YouTube videos. There's a benchmark for YouTube faces datasets released by Leon Wolf. I think Jen's algorithm didn't really fly on that video database. So we're going to revisit that to see why our representation is better. So a lot of theme of research trying to build robust matching algorithm to better handle post variations. My personal research is really focused on this domain because I believe low face alignment algorithm is going to be perfect. So you want to build the robust matching algorithm to handle any possible visual post variations you could still have. Our approach is to take a part-based representation, build a part-based model model. Previous work on part-based representation for face images is mostly handcrafted. For example, this part could be defined around like facial landmarks and so on and so forth. What we want to do, though, we want to land the parts in unsupervised fashion from the training data and build a generative model once we have this parts-based model we could identify each specific parts in the specific input image, then we can use this part, specific part in that image as the representation. Okay. We're going to talk about the benefit of having such a representation. There's a bunch of work recently on more general face, more general object recognition detection building a representation on a intermediate level representation but they haven't applied it on the face and some of the approach we'll see why it could be used on the face domain. So given this and here's the outline of the rest of my talk. So first I'm going to introduce this probabilistic elastic part model. It's very simple algorithm. But hopefully I can convince you that it is very effective. Okay. So then I'm going to talk about two applications. The first is face verification. So one good side of our representation is that it provides a unified representation for both image and video-based face verification. That means we have this representation where we can enable a single face image to be matched directly with a video clip. Without resorting to this frame-by-frame cross matching, pairwise matching. We will also talk about how we could use this representation for enhancing any off line face detector. It's unsupervised detector adaptation algorithm like hopefully I briefly describe some of the work we did before on how to adapt detector to a video. Here we just try to do an unsupervised detector adaptation to our photo connection to make it better. So this part of the work we have published in a chart 2013, and this part of the work is going to be presented in ICC 2013. And some of the experiments we presented here will be from our recent submission because we did more experiments and the results are improved. So first what is the probabilistic elastic pair model. So we have three goals for this pair based representation. Of course we want it to be post invariant. Second we want to unify the representation for both image and video faces. The reason we want it to be like that because we want to avoid this frame-to-frame matching, and also our ultimate goal is for face identification. So when you're trying to build a gallery database, we want the database really scaled to the lambda of person instead of lambda of images in the gallery database. And ultimately we also want this representation to be additive meaning if you have a new face for a new person added in we can incrementally update it without resorting to order previous images. So I will show you how we achieve this. So, again, the general philosophy, as I mentioned, we want to build a generative model for with each specific image we want to identify the specific parts for that face in this image, and we use that as our representation. So just sort of give you a sense of from low level how we should enable to lend such power-based model, we start with very simple, like feature extraction process where we build image pyramid for each input face image and we densely extract the overlapping patches. Okay. Then for each patch we could build, we could extract the descriptor either sift or RBP program from that. But we also do something in addition to that. We augmented the X and Y location of that descriptor to the overall descriptor. So we require a special appearance facial descriptor. Then up to this point we have a set of features. It has a spatial component and location component. So the face is represented as a bag of features. But don't be confused by the words, because we're not going to build a dictionary to build a bag of words representation. We're going to do something different here. Then how do we build this elastic pair model, is we could gather a set of training images then we can fit a Gaussian mixture model. In speech community this is called a universal background model. But I will highlight something we changed the model a little bit. It is more specific here because we can find each Gaussian component to be a spherical Gaussian. The reason I will explain further because we want to have a better way to control the balance between the appearance part and the location part. Because if you think about it, the appearance vector is about like a is 128 dimension, location is only two dimension. We need a better way to control the balance between them. If we use diagonal coordinates matrix we won't be able to achieve that as I'll show you in some example we're going to show. So, of course, this is very simple maximum likelihood estimation problem with missing data. So we just gather a set of training data, going through that feature extraction process to having like this special location descriptors, then we're going to fit this Gaussian mixture model. Then as you can see, this Gaussian mixture model naturally comes with a set of parts. So here the visualization is really the set of image patches associated with each Gaussian component. Up to this point the method will be simple to use. So how are we going to build the representation for each face image from this model? Okay. Pay attention to this, because this is simple but really differentiating from the previous work. So here, if I have a new input image here, suppose I've already learned this model, this Gaussian mixture model, the input is firstly represented as a bag of spatial appearance descriptor. In order to generate the final representation for the face, instead of building a bag of words histogram, what we did here is that we do our inverse assignment. So for each Gaussian component, we look for one of the descriptors inside. Extracted from this input face image which induced the highest probability on this Gaussian component. So for each Gaussian component we identify such a feature. Then we concatenate that set of features we identified for each Gaussian component together to form a single descriptor vector for the face. As you can see, this is for Gaussian component three, like this is the probability map for the Gaussian component like calculated with each specific descriptor extracted from that face. As you can see it is really represented in the change of the face. Here is the color of the eye. Although they have a lot of post variations. So you could vary this, this is more like alignment process there. Okay. Thanks. We have an implicit sort of -- by doing the maximum likelihood estimation, like identification of the features, we actually have implicit alignment process in this process. Okay. So ->>: Do you have a location descriptor, actually put it ->> Gang Hua: Up to this point we discard the location. We only have the appearance part in this final vector. But it's indexed by each Gaussian component as you can see. The Gaussian component plays a row as a bridge to build a correspondence between two faces. So let's get into a little bit more details here, because as I said, we confine each Gaussian component to be as spherical Gaussian. The reason for that is really this. So here is just to show if we use a regular Gaussian, for example, with diagonal covariance matrix, here's the spatial span of the Gaussian component with length. As you can see the spatial regions in the face which is a lot of the behavior we want. The main reason for this is that the spatial part contribute more to the probability, because if you pull the diagonal covariance matrix you mention it's independent. So if you tried to make the spatial component to contribute more by simply scaling that dimension, it's not going to be helpful because you're just going to like really enlarge the variance of that dimension when you are doing the estimation without affecting the other component. If you confine to be a spherical Gaussian then you mingle all the variants together. So if you scale the location dimension, you essentially could have the location to have more influence. So the Gaussian component becomes more localized. So they're doing a local matching, essentially. So any questions here? Okay. Another, why do we need this location component? The main reason is like, okay, so we did the maximum likelihood identification without the spatial constraints as you can see. You could easily match eye with the mouth part here. The main reason because we use a safety descriptor here. Safety is good in shifting like variance like shifting invariant. You may match an eye with this on the descriptor space, versus if we put this spatial constraint together, we're really matching the eye with the eye. So that's why we need that to do this facial component. Let's say why this representation is post invariant. So we try to visualize that process. Okay. So what we essentially did is for input face image, for each Gaussian component, we identify the local patch. Okay. It was the highest probability. Then we put that patch back into the location of that Gaussian component. So we sense this face that's essentially the process. As you can see that this face is nearly frontal here. Although we average -- it becomes blurred because we average overlapping regions here without doing a lot of fancy sort of filtering there. Okay. So this is suppose we just use this input image. What we can do is to horizontally flip this face image and then the maximum likelihood identification with the joint set of spatial appearance descriptors. So then as you can see, the representation becomes more frontal. Okay. I can let you see the difference there. So then what we did, we're going to adopt this flipped version, like joint version of both. So here is just something more. I mean because of this inverse assignment, based on the Gaussian component, we don't really care how many frames we have. So we can for one frame we can, for example, this is George Bush's face like we come out with one representation, then with ten frames as you can see it is nearly -- it is more frontal in a sense so it is more post invariant. So 20 frames, 50 frames, they don't really make a huge difference there. So we also have the version where we do the flipping for each of the frames there, as you can see they have some difference but not too much, if you integrate multiple frames together. Okay. So then why this is a unified representation is quite obvious for single image and for video face image. Once we have -- once we have done the feature extraction and then using this PEP model, PEP model stands for probabilistic elastic part model. Once we use this PEP model to select the features in lambda they'll have the same dimension because the feature dimension with respect to the lambda component the dimension of the feature descriptor. So that's why. Up to this, once we have this representation we can match this single face image with video clip directly without resorting to the frame-to-frame matching. So why is it incremental. Support you have a gallery database. You have a set of faces from the same person. From this model we can generate single representation for this person. By doing this maximum likelihood identification. So it is incremental. It could be incrementally updated many because the incremental nature of this maximum operation, because once you have this representation, given a new face we just need to compare. If the maximum likelihood identified feature has higher probability than the feature I already have in this representation a lot. If it is, then I just replace the camera representation. It could be incrementally updated if we have more faces added. That's a lot of benefit of this. Although in this talk we're not going to present any facial identification results. But if we want to do face identification with this representation we can make the database essentially scale with the lambda person instead of the number of images in the database. >>: So when you do this is UBM for one person or for everybody? >> Gang Hua: For everybody. >>: For the population. >> Gang Hua: Yes. Each has different post variations. >>: But here, when you're talking about representing one person ->> Gang Hua: So we assumed that we already have that UBM model. >>: So from UBM you set up, you find the personalization to do personalization to a single person model. >> Gang Hua: Yes. I'm going to introduce a little bit adaptation process when you are trying to do specific matching for a pair of face images. We have based adaptation process that's more like person-specific adaptation. Okay. So I'm going to talk about in the face verification experiment. So then I mean I hope you get a sense of what this simple PEP model is. I'm going to move forward to present two applications, one on face verification and the other on face detection. >>: [indiscernible]. >> Gang Hua: Sure. >>: The face, I assume you would need some kind of an alignment? >> Gang Hua: alignment. Yes, you could either do alignment or without any >>: Without alignment I assume it may not be very good. >> Gang Hua: We do some experiment. We're still better than -- when this paper was published, we're still better than the best in this evaluation, in this bench mark data. >>: Alignment. >> Gang Hua: We tried both. This algorithm would benefit from some sort of alignment algorithm, but because this representation is designed to be robust post variations we have implicit alignment process in building this. >>: I thought with the training you would -- it's good to have some alignment. But testing you're doing [indiscernible]. >> Gang Hua: That's a good one. You need them to be consistent. >>: I see. >> Gang Hua: If you have alignment in your training phase and you have to have the same alignment in your testing phase. That's what kind of what we're doing. But it would be interesting to try. We have alignment always alignment on the testing stage. So for face verification we're going to talk about like results on two very popular benchmarks. Okay. So first we're going to talk about uncontrolled face verification, like we're going to introduce very simple method first to use this paper representation for facial recognition then. I'll talk about how we can do a little bit better. Suppose we want to enhance the matching of a specific face pair. Then I'm going to show that our top performance on both benchmark dataset. Okay. So how could we do the matching when we have a pair of face we want to verify if they are the same person or not. So we're take a very simple approach here. So once we have this representation, we're going to the paper representation, we just take the difference of these two representations. Okay. We take the absolute difference vector. Then given a set of training images, training pairs, labeled with positive or negative, positive means they're from the same person, negative means they're a lot of the same person, then we can train our SVM on this difference vector to make a decision on if they are the same person or not. So given the new testing face, we can use this as a classifier to make a decision. That's a very simple method. We have to resort to more complicated natural learning algorithm yet. But I think I mean if we have a better natural learning algorithm, it would enhance the results indeed. So to make the representation to be even better, we have proposed bayesian adaptation scheme where suppose this is what we call the PEP model, which is universally trained off line. Suppose we have a pair of face images we want to verify. So what we can do is we try to make this model to be adaptive to this pair of face image. To fit them better. But not deviate from this model too much. Bayesian adaptive process and then we're going to gather a new adapted Gaussian mixture model. Then from that we build this paper representation. So if we look into the map, it's really simple scheme where we just put conjugated prior on the adapted parameters and this parameter is from the universal background model and this again, this could be done in bayesian framework in iterative fashion. Really simple method. To see why we need this adaptation process, suppose this is the pair of face we want to verify. If we're looking to the set of patches, we build a correspondence with the PEP model. Here you can see for some of those patches, they're still misaligned a little bit, as was highlighted by the blue rectangle there. So this is before adaptation. After we do the adaptation, as you can see, we really make the alignment better. If you compared them. Okay. So that's why. This bayesian adaptation scheme can help us to build better correspondence there. Okay. So any questions on this part? So we're not trying to do a person-specific adaptation like in the speaker verification domain. Here we're really adapting the PEP model to a specific pair of images we're trying to match. Okay. So then we can also do multi-feature fusion. We have post fusion process where if we have different type of features to do the PEP model and to do the elastic matching, then we can concatenate the decision score together, then train linear SVM to combine the score together. That's very routine process. I just want to mention it because our results benefit from fusing multiple features. So we're going to present results on two widely used benchmark dataset. One RFW and YouTube dataset. Currently we have lambda two in the RFW database and lambda one in the YouTube video datasets. So when we're talking about results in RFW, I should be careful because Jen's team has really pushed the recognition to be super high. But they leveraged enormous amount of external data to train their facial alignment algorithm and train their face verification algorithm. So we're working on the most restricted protocol without any offset training data. The reason is that we -- my philosophy is you should really come out with a representation which could generalize across different datasets. That's why we're confining ourselves just to RFW to see how it can generalize. So when we're comparing the results, we're mainly comparing results in this category. So we're not comparing with the other algorithms which leverage the enormous amount of external training data. So here just some details how we did the feature extraction and things like that. Okay. So we specifically use like 1,024 Gaussian components in our UBM model. So here is our results here. So this is the best results when we published our paper in SIG 2003 and recently there's one paper published from Oxford I think Andrew's Leserman's group they used the Gaussian model by using the facial vector to do matching. And they're currently number one, and this result, this PM result is if we use like only LBP feature. If we use sift feature, we get more or less the similar results with RBP feature. So if we do this fusion, we can justly improve our results. If we do the adaptation plus this fusion, like we can do even better. Under the ROC curve, we're not as good as them. In terms of number we're about 1.5 below their accuracy. But still we're currently running lambda number two in this benchmark. So but their algorithm needs to learn, to have conduct like a matching learning on different datasets which is the part I don't like. But I respect their performance a lot. On YouTube video faces, ordinarily published on this dataset mainly because of computational issues, all kinds of issues there. Currently when we published our results, this only [indiscernible] published their results. This is the baseline here. So recently more algorithm published results, the best one is this one, published in biometric conference. So our results, if we only use one type of features, we're about here. Look, although we are worse in the high false positive read area we're nearly as good as them with just a single feature in the low false positive area. That's the area we really care. So if we use different set of features, we are actually already slightly better than the algorithm. If we fuse them together, we already get better here. If we do the adaptation, we do even better. Okay. So currently like it's a mix, we're still lower than those algorithms under this high false positive rate area, but we do significantly better in this low false positive error. >>: Which group, biometric thing? >> Gang Hua: >>: Yeah. This one? >> Gang Hua: That's a good -- I cannot remember. It's the group not so active in computation, they're mainly active in biometric community. I can send you some information. >>: That's okay. It's in your paper. >> Gang Hua: So currently, if we do this, in terms of recognition accuracy, if we determine a single operating point, we're about 1.5 better than the best algorithm published in biometric conference. So this shows like why this lambda is lower than the RFW datasets is because the resolution of the faces in this video face benchmark is not as good as the error for RFW dataset. So we're currently number one there. And also I want to highlight like here, once we build the paper representation, we don't need to do cross-frame matching as most of this algorithm would do. Okay. They need to do this frame by frame, like matching, then identify the best matching there. So just I mean our algorithm is obviously not perfect. Which by looking into this error results, which could tell us which direction we should go. So here are some errors made by our verification algorithm. As you can see clearly, we're matching white face with Asian face. This shows that we don't really have a comprehensive understanding of the face. So that's the direction we're trying to go. We are trying to build some, like do segmentation and do semantic labeling to try to really understand the face, like to avoid making such embarrassing errors. So that's the direction we're going, like trying to really drive the face verification accuracy to the next level. >>: The gray level. >> Gang Hua: Currently only using gray level images. The color will play an important role. But so far I haven't seen, I mean if there's any work, really, leverage the color information. >>: When you -- it's always interesting to look at these errors because it tells us we still have a long way to go. But when you report numbers like in the previous page there, 80 percent, on 80 percent of the queries you've got the right person out of how many potential answers? >>: Oh, yeah, this is not identification task it's verification task. So basically this benchmark datasets, they build some standard like basically the input is face pairs. So you just make a decision if this pair is from the same person or not. Is the classification ->>: Input distributions, 50 percent match. >> Gang Hua: 50 percent. >>: So the random guessing baseline is 50 percent. >> Gang Hua: Yes. >>: You're at 80. So sort of taken, what's that, about three out of five guesses, you're doing better than random. >> Gang Hua: Yes. >>: Something like that. >> Gang Hua: Something like that, that's currently state of the art. I think this benchmark, they designed it in this way to balance the training, really. I mean, as you can see the distribution of match and long match faces should not be 50/50 percent in real world. So it's skilled distribution. That's something perhaps in the future, the benchmark evaluation could need to be redesigned somehow, I think. >>: It's a way of measuring progress, but if someone who isn't a researcher in this field asked if I give you a select photo from this, what's the chance that your algorithm will name the correct one. And let's say there are 200 celebrities in here. What's the chance? It's like five percent of the time it will guess the right person or what is it? >> Gang Hua: That's an interesting question. I think that the number I heard from Google Picasa is what they can achieve is like on the top ten results, they can make sure that 90 percent of the cases it could be correct. >>: Top ten in, what, your family album? >> Gang Hua: search. From all celebrity face images they have in Google image >>: With the celebrities, they're saying -- I see. So you take a select photo you ask Google who is this and it gives you a list of 20 names, and usually if you kind of look in the top ten it will say ->> Gang Hua: faces. 90 percent of the time the correct name is in the top ten >>: It will say CARMEN Diaz and Hank -- it will name a whole bunch of people which are opposite genders and kind of nonsensical but somewhere in those ten the right person may be there. >> Gang Hua: That's what I heard from them. never publish their results. But it could be -- they >>: For something like finding, just verifying what the labels are and the images that could be useful. But if you basically said something like here's my personal Picasa photos show me the ones with my daughter, this daughter. The question is will it sort of, will 70 percent of the images be correct or only 20 percent. >> Gang Hua: I believe in the personal photo album we can do much better because the number is much less. >>: There's much lower perplexity, but sometimes people look more simpler; you can't take badges of facial differences and things like that. >> Gang Hua: That's absolutely true. There's really no best benchmark for this perhaps like some efforts list in that space really to evaluate progress in that space. I don't really see a lot of benchmarks there. So Symon and -- when I was still in Microsoft assignment, we explored that a little bit on this family album scenario. But I think we need more serious benchmark in that space. But it's kind of difficult. Next I'm going to talk about face detection. How many of you here still think that face recognition solve that problem. Paul is no longer here, safe to ask the question. So raise your hand if you think it's resolved. >>: I thought front faces, pretty good. >> Gang Hua: Resolved? Look at some of them, the mistakes you would make with the state-of-the-art face detector. You're missing deduction here. That's embarrassing right false alarm. Why this is detected as face. That's very suspicious. detection there. We also are having some of the missing >>: State of the art? >> Gang Hua: This is from [indiscernible] face detector. It's not his best one. But we tested one of the recent ones like from Adobe. They have an example face-to-face detection algorithm. They're also making these kind of errors. >>: That's kind of the baseline that you just download free software. >> Gang Hua: Yes. >>: Which is the ->> Gang Hua: We want to do better with that without a lot of effort. So I'm going to tell you a really simple approach you can do this. So what we did is very simple. Again. The philosophy is very simple. But I mean the implication is -- it could be inspare lots of instances, so that's something I'm trying to explore. What we did is here suppose I have photo connection here. So first I will set the decision threshold from the off line training detector to be really low to ensure the recall. So I'm going to have a set of face candidates here. Of course it's going to have false positive. So then I'm going to choose perhaps the top 10 percent positives, and the bottom 10 percent top negatives. And treat them as positive and negative examples. Then I'm going to build the paper representation. So suppose I have off line paper trained representation, paper model. I'm going to build that paper representation on each image I connect here. I'm simply going to train an SVM classifier on top of the paper representation. Then I'm going to rerank all these images, then cut off the threshold to see if we can do better. >>: How many components do you use in that? >>: 1,024. So we're really high dimension vector, indeed. But that's what we did. That's simple as is. After I describe this I don't need to go through all the slides because that's essentially what we did, to see if we can make it better. So we tested our algorithm on three photo albums. This album and this is connected by Symon, I and Ashish like several years ago. We still have some detection. And the other one is FTTP database which is again released by Eric Landon Miller from UMASS and funny thing I had lot of discussion with him recently. What he did he connected some Web photos and run the Johns detector and picked up the ones that failed there. So he specifically is screwing up Viola Jones face detector. So this set of photos are really challenging. So can the best algorithm the detector there is face detector called XZJY published by Adobe. They have an example, built a database with 10,000 faces just doing this example matching, face detection, yield the best performance so far on this FTTP database. So we're going to show that by using our adaptation scheme, very simple ones, we can improve both the Viola Jones face detector and XZJY detector. So I'm going to play a video. But here just some subjective results. This is on the G album the XZJY detector are still making those embarrassing false positives and we can get rid of those. And there are cases we can also -- yeah, the results we show here is mainly like eliminating false positives but we also have cases where we can actually get false negative back. I'm going to play a video. But in terms of performance curves, okay. So first I want to show that this paper representation is really also good to differentiate face and long face. So here we did the same adaptation process with just concatenated safe descriptor, meaning we're not doing -- we're using the Gaussian mixture model to select those part features. Instead we just densely extracted the safety descriptors. So here is the performance of the Viola Jones detector. If we simply use the sifter representation we're doing a lot better. But if we use the paper representation and use doing the adaptation, we can do much better, okay, on this. Which shows -- the reason is obvious. After we do this paper representation, we're reducing the within class variation of those real face images. Although background may be staying the same because they're random anyway. But we're making the face class to be tight so that we can do better. And here is some more comparison on the album. So here's some more comparison on the album. Here is the performance of the Viola Jones detector and this is the XZJY detector. You can see this detector is much better than Viola Jones and nearly perfect. This is our algorithm example based detector adaptation algorithm that was published in CPR 2011. It improved the Viola Jones detector by not making the XZJY detector to be better. If we use our paper representation, we managed to improve both from the Viola Jones detector and even made some improvement from the XZJY detector in the datasets. We make it much better. So on the album again these are the results from CPR 2011 work. The works mainly for video this example-based approach. That's where we make a lot of progress. If we use this paper representation and do this representation as you can see, after adaptation, even from the Viola Jones detection results, we -- let me see -- yeah, so the blue curve is already better than the XZJY detector which shows this is really helpful to do this one cycle iterative reranking there. So this is on the benchmark like FTTP database, like discrete score means if you predictive detection has more than 50 percent overlap with your [indiscernible] then you claim to be a correct detection. Okay. So this red curve is XZJY detector results and here are some baselines published before like they use different local features, like this one I believe used the serve feature. And this is published by Eric Landon Miller's group they're using -- they're also doing this type of adaptation but using a Gaussian process classifier in the middle. So if we do this fold by fold adaptation because they have a tenfold cross validation process. If we only do the adaptation fold by fold, we can enhance the results, we already enhance the results significantly. I'm not so sure -- I didn't put all adaptation results there. I should add back. I think we can do better than this. So this should be continuous the score and this is the overlapping between predicted rectangles and that and I should change these slides a little bit. This is the XZJY detector, after we adapted with the paper representation, we made it significantly better. So here I would like to play the video perhaps just give you more sense that this representation is also good in bringing some of the false negative back. Let me see. Okay. As I say, we're going to present this piece of work in ICCV soon. Yeah, this is the Viola Jones detector. Performance is valid at like 90 percent record rate. Here is the false positive. Got rid of it. We have false negative here missing detection, and we get it back after this adaptation. Here is the embarrassing one. And we wonder why this kind of patches are detected. If you do histogram equalization on the face patch looks like a face. And this album is actually published by Shao before. Get rid of this. And we can get this back. Okay. I will simply stop the video, because it's just going to show that it's better indeed. And you see playback. So just some discussion on this type of adaptation scheme. I think it could spare a lot of interesting things because you could think of this like this is kind of a process where we are trying to adapt a recognition algorithm to the statistic of the datasets you're dealing with. It's just the light way of doing that. But I would say that there could be a lot of interest in seeing, in using that. I think the chart did something for detector adaptation before with [indiscernible] extension tried to modify the detector, but here we're doing a little bit differently, like trying to leverage the examples. Okay. So in an unsupervised fashion. So just in conclusion we have post environment face representation induced from the PEP model and showed the leading performance samples face recognition face detection task some benchmarks. So our future work involves how we could use this type of representation for other visual recognition problems like [indiscernible] recognition and we're doing some work we're trying to see if we can make a better object detector based on this paper representation idea. So before I end my talk, I want to grill your face recognition capability. Who is this guy, without any hints. Let me give you a little bit of hint. Yes? >>: [indiscernible]. >> Gang Hua: [indiscernible]. >>: I don't know. Just decades. >> Gang Hua: [indiscernible] so I worked for IBM before. So [indiscernible] so we used his face all across my talks so I want to thank him indeed. I'll stop here in case you have any questions. >>: [indiscernible]. >> Gang Hua: No. I do a lot. IBM still have Deep Blue as demo in their demo room. But it is never played again, I guess. It's kind of interesting. Now they're focused on Watson, I think, the Deep QA project. Very interesting. >> Zhengyou Zhang: Thank you. [applause]

Document 17865249

Related documents

Products

Support

Document 17865249

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib