21218 >> Chris Burges: So we're delighted to have Sina join us today. He came here to hike, but he agreed to give a talk as well. He's finishing his fourth year at Princeton. His Ph.D. advisor is is Rob Schapire. Formerly Rob Calderbank. The advisors switched. And we're delighted have him here. >> Sina Jafarpour: Thank you very much, Chris. Thanks everyone for coming. And I'm really happy to see you again. Today I'm going to talk about basically my, this year, internship program which was in AT&T. It was basically using sparsity in two different projects. Let's see what we can do with them. So it's about sparsity in questions answering and classification. This is a joint work with Srinivas Bangalore, Howard Karloff Taniya Mishra and Carlos Scheidegger. And they really helped me a lot in this project. This is something before starting I should say these are ongoing projects. So please feel free to stop me and ask questions at any time. Also make comments at any time during the talk. So we all know in machine learning that in many applications, actually in almost all applications the data that we have is actually high dimensional data. It lies in high dimensional vector space but it's sparse. We know the bag of word model in IR or the decomposition in imaging. In many cases this data is actually this sparsity is well hidden. So we would like to learn the sparsity level and then use that as a feature space in order to be able to do better learning and, say, classification. This actually helps us removing the overfitting problems and also generalize better. So this is something that I'm going to talk also in this talk. I'll talk about extracting and exploiting the hidden sparsity level, the recovery of that. And unsupervised and semi-supervised learning cases. So we will have real datasets in the two projects. Base classification and question and answering and we'll see what are the barriers, what we can do and what are the barriers in the current stages of the work. So let's start with the space classification project. In the face classification, we have a bunch of movies. So let's say that we actually are Netflix users. We can get a lot of movies and we have them. And also let's assume that we have access to Internet. So we have the IMDB cast list. Let's say in Seattle we have the cast list. We don't have any picture or anything related to cast list, we just know who is actually, who is acting in this movie. The goal of the project is then suppose we have that for several movies. Now I give you a new movie, "Sleepless in Seattle" try to classify the faces in the actors, the famous actors in this project. And let's see what we can do. So the first thing is we need some training data. Ideally we need some label training data. So what we do is basically a way to get some sort of a noisy but okay training dataset. We have a bunch of movies here that you can see. We use basically let's say we implement a fast standard face detection algorithm like [inaudible] Johnson. And then we do some standard face alignment method so we make sure that all the faces are aligned. And as a result we have a bunch of faces detected from different movies. Then we look at two movies that have just one actor in common. Let's say Up in the Air and [inaudible]. They have George Clooney, and then we look at the pairs of faces in all these pairs of movies that have the most similarity. Something like this. For instance, these two faces we see that they are very similar to each other. Let's say with the inner product similarity or Euclidian distance similarity. Then we conclude that as a result with high probability these guys should be the same actor. Because these movies have only actor in common, we actually say that these should be George Clooney. So this is a noisy way that by that we can actually collect label 20 examples. Now, after that we need to now classify. The next set is basically face classification. We actually look at the lazy training actually paradigm. So let's assume that first let's assume for each phase we convert it to a vector. So these are all vectors. These are all faces of George Clooney, Peter Sellers, Ingrid Bergman and Bruce Lee. And, for instance, this part of the metric has a lot of faces of Bruce Lee that we have actually captured. Each vector is there. And then we actually -- somebody gives us a new face and wants us to classify based on the faces that are here. What we can do is basically we can actually exploit the known fact that if this is a face of the Bruce Lee, the known fact in face classification and face analysis that tells us that every face of Bruce Lee is basically, can be represented as sparse linear combination of the faces we have here plus some noise, of course. So we would like to basically, in a way, exploit this sparsity and classify this guy based on that. So let's be a little ->>: In the combination of the pixels? >> Sina Jafarpour: Linear combination of the pixels, exactly. >>: So the driver's license people have another representation where they look at the distance between the eye and the other features. [inaudible] more than ->> Sina Jafarpour: Right. >>: So why not use that as a feature set? >> Sina Jafarpour: We can actually do that, too. I'll talk about that in some sense. The thing is that here -- so here the point is that this angle and the way that we, the way the spaces have are much more different. In that case we have very -- very restrictive control, exactly, set of feature space. Here it's different. So this is actually -- this just is assuming the linear sparsity is basically more robust assumption in that case. But actually very shortly talk about that, too. Yep. And so let's actually formalize the problem then a little bit. We have a vector in RM, let's say M dimensional. Let's say we converge it to a vector. Square root N by square root N. Then we have a matrix. The matrix has the number of actors times the number of faces per actor columns and each actor has M rows because we converted that to a matrix, to a vector, I'm sorry. Then our goal is to find a linear combination of these columns here such that we'll approximate this, this vector F. So this brings us actually to the classical problem of sparse approximation and sparse recovery. In sparse recovery, if I want to say, which is actually the problem, the problem of compressed sensing and also a lot of graphical modeling and other related stuff, too, we have a vector that X star, which is sparse. So it has only very few, relatively few number of non-zero entries. And we also have a matrix A which now let's assume that it's a general matrix. I'll talk about its properties that we'll actually exploit. But let's say this is this. We also have noise. We're given a vector F and somebody tells us it's A the matrix that we have times the sparse vector X plus noise. And our goal is to recover or approximate X is star from the measurement vector F. So, of course, for the beginning let's forget the noise. What we want is to solve this optimization problem. Zero says the number of nonzero entries. It says we would like to have the sparsest vector X such that AX equals F, if we forget E. But unfortunately this problem is NP hard. So during the last, actually I should say even 50 years or even more, people started to think about ways to approximate or relax the problem. The base algorithm which is formally introduced by Chen and Donoho, but was known even before that, says let's forget -- now that we not solve the L minimization let's look at L1 minimization. L1 is basically convex norm. So this is a convex optimization program. But it also has -- it's the closest norm to the L0 norm that we can have. This is the proposed basis pursuit algorithm. And of course because we have noise you have to consider noise, too. So the last which was actually introduced by Tibshirani and also is used very widely says that we can use the L1 minimization, but with these constraints. We say we're okay -- of course we don't have exactly this linear equality, but the residual we want that to be small, be smaller the residual of the noise or constant times that. But in this case actually for this problem it seems that another linear programming is actually, might be a better solution. So here the assumption is that the noise vector is also sparse. So if we go up, it will actually go and see it in the face classification. But it assumes that, okay, now that if we assume that the noise vector is also sparse, we can actually integrate them together. So the optimization becomes minimized the L1 of X so that X now becomes X and E together. We actually concatenate them together. And therefore we also concatenate the identity matrix with A and we would like that to be F, the vector F that we have. So this is the basis pursuit noising and recently proposed by Write & Ma for these kind of problems where we can have sparsity assumption for the noise too. And there are some theoretical results about that. But I'm not going to talk. >>: I is. >> Sina Jafarpour: I is, exactly. So if you look, we concatenate noise here, too, so it's actually ->>: [inaudible]. >> Sina Jafarpour: Yeah. >>: In the [inaudible] application or face application? What evidence supports the fact that noise would be sparse? >> Sina Jafarpour: So basically the first thing that I should say is that we tried both of them. This one worked better. And the second thing is basically the fact that if you have the local integrity that you'll have together. So the pixels that are close to each other will be actually, we like them to be close to each other. The eyes or the noses and these things. So these things, therefore the differences do not matter so much. So this basically maps to the sparsity of the noise, but that's basically still your common works. We actually -- we should see if this one works better or this one. Mostly empirically this one, the assumption that noise is sparse works better in this case. And something you may also I should refer you to this paper to Write & Ma. They also have some sections basically explaining the locality that I mentioned in detail. >>: The sparse is the images in the database that are close enough to attest? >> Sina Jafarpour: To attest, yes. Exactly. So this is then the justification that I also wanted to have about why we are actually using L1 minimization. Then we say L1 minimization, because you might actually say why not L2 minimization. We can actually have an explicit solution for L2 minimization. So let's see what's happening. This is the line AE equals F the line that we have, so this is then the let's say identity L1 diamond that we have. So the points that have the L1 of X equals 1. Actually we tried to grow this diamond until it hit this line. And the same here. So let's see what happens. So here this is a solution of the L2 minimization. This is the solution of L1 minimization. Here it hits close to an axis point. The solution here is relatively sparse. Although I'm just using this for illustration, but as to actually give you the idea. But here the intersection point is this one. And this is not as sparse. It has X value and Y value. So it gives just some sort of intuition why we shall use L1 minimization. So then this is the face classification algorithm, right? It's actually if I want to say what it does. We first normalize the columns. That's just an assumption that we make to make sure that we're not actually changing the energy too much. Then use the basis pursuit denoising algorithm to get a sparsity, to find the sparse set of columns representing F. Then we need two things. We need to look at two things. One of them is that if you give me a base how should I know basically it's a linear combination of these guys. For instance, if you give me a face how do I know that it's Bruce Lee for instance? For this we look at residual. This residual says how much this recovered vector is a sparse actually. So if it's close to 1, R is the number of factors we have. If it's close to 1 then it says that the vector is actually concentrated on just one of these actors. So you have actor 1 to actor R. If it's close to 0, then it says that vector is very dense. So it has elements in all of them. So we say that if the value that we get is less than some threshold that we said apriori we reject. We say this is none of the actors. But if it's actually higher than that, then again you can either just look at this value S for different actor sets, or you can simply look at the residual. So this is what we did. We looked at the residual F minus let's say A concentrated to these actors times the corresponding coefficient. The second one, the third one, and the one that has the minimum residual value, which is that. So this is the procedure that we have. And I should mention because I'm going to show you some actually comparison stats. Previous state of the art and basically the thing that people usually use for face classification is SVM and SVM with some distance learning, which is actually the work of Weinberger et al., and it basically -- so it's again based on the similarity of the faces, the thing that we mentioned that a new face, if it's a face of Bruce Lee it will be very similar to the training set spaces that we have. So we use linear kernels is usually fine but we can actually also try to learn the distance matrix. We can actually try to learn and find transform in that we will have a better representation of the vectors. >>: L is global? >> Sina Jafarpour: L is global, yes. And so Weinberger proposes STP program learning for learning this relevant to the optimization of the support vector machines. But even though they provide a specific algorithm for solving that it is usually slow. So the drawback of this comparing to SDM learning takes time but it's global so we can actually do that once. And the good thing as we'll see it does something better. >>: Can I ask a question two slides back? >> Sina Jafarpour: Yes, sure. Uh-huh. >>: So the thing you output, you're not redetermining the mass vector for just the AI, you're using the vector you had determined once. >> Sina Jafarpour: Uh-huh. >>: So seems like if you have sort of like a lot of correlations between these -- like if there's two actors in front of a green background, kind of random luck whether it picks up the green from there or another. >> Sina Jafarpour: So we removed the background and everything like that. So what we have is basically just the face. Just the face. And also we removed the contrast and everything related to that, yep. >>: Black and white or typical --? >> Sina Jafarpour: In this effect it was black and white. >>: Can you rescale? >> Sina Jafarpour: We rescaled. And did the rotation alignment to make sure. All of these actually introduced noise but also actually helps in the next parts. >>: Thought about using other bases? Pixels seems ->> Sina Jafarpour: I tried wavelet. The resulting wavelet were not so good. So I still don't know exactly why. But relative -- you know comparing to the pixel domain, the wavelet domain was not so good. >>: Also work by [inaudible] they're doing for [inaudible] for other image classification, they use kind of pledges, try to build the image out of packages of images from the domain, sounds kind of, for example ->> Sina Jafarpour: Right. That's a good point, too. The thing is that actually we want this ultimately in an online setting. We want to have these classifications in an online case, if we have patches can we do them also efficiently online? I think if you can, then that's worth trying that and I'll be happy to talk to you about that at some point after it. >>: The state of the audience [inaudible]. >> Sina Jafarpour: So these were just very initial experiments we did. We used four actors and then 80 faces for each of them. Then we used across validation, and we looked at the across validation errors of the actors. This is the L1 minimization we use, the basic denoising. That's distance learning, so we actually mapped the data from the pixel domain from that learned distance domain, too. This is SVM and this is distance learning SVM. So as we can see, the results here in these experiments relatively promising but one reason of that, I should actually say right now, was because we had noise. So the dataset we had was noisy and that was a problem that SVM actually was facing, when we tried other state-of-the-art classification datasets the results were much closer, but because of the noise this algorithm turned out to be more robust to noise you have. Then we tried in the next experiment a larger set of actors. Here we can see the results. Something that I have to mention is, for instance, this one, this actor had very large classification error. We looked at the data, the training dataset for that was very noisy. So this is something that -- I'll mention actually I'll talk about that later, the issue of the noise in obtaining the training datasets. But relatively the message of these two slides were this algorithm even though much slower than SVM it was more consistent and more robust. The thing we actually focused on was "Sleepless in Seattle", the movie I'm saying this movie as a special case. In this movie we actually saw this is a frame of the movie let's say that we captured about 7,000. These are the confidences of the face classification algorithm. And this is a sorted version of this. So this is just a PDE for this. And for instance you can see that if you take value .6, then we'll actually have a value here. >>: Identified one actor or ->> Sina Jafarpour: Just saying that if this frame is either one of those actors or it's another one. So I'm talking about the classification accuracy in the next slide. As we mentioned then we took one confidence value and we looked at the clarification accuracy for this particular one, you look at it manually, so even in these numbers there might be some error because of my eye. Because we didn't have labels. But just as I wanted to show you the examples that, some examples, some random faces that can always look at the results. For instance, this is [inaudible] and this is [inaudible] and this is two classified, although I think this one. The next thing is going back to that question is making this algorithm faster. So we have some suggestions for making it faster. One is running SVM first. If it's connected to a machine it has high confidence, then we're okay we say SVM is doing a good job otherwise we run our algorithm. So this makes the overall turning like much faster, the other thing is we don't have to solve the optimization exactly. This is one suggestion. We can just solve it up to some number of iterations, not up to very close accuracy value. And the other thing is we can actually exploit the temporal coherences. If frame 1 and 3 [inaudible] are then with high probability the frame in number 3 is also an error because in video. We have this temporal coherence. So we can actually use that in the classification. So we're doing that, classification, accuracy, a little bit drop, but not so much. So in our current version of the classifier we incorporate these three ways. And we'll be happy to have any other suggestions for them. So if there's no question, I'll go to the second part of the talk. >>: Why did you choose [inaudible]. >> Sina Jafarpour: Here in this experiment, yeah, because it was much more distinctive. I did experiments with Ocean 11, also and the Mexican, and those -- currently I've used up to six different actors because Ocean 11 it was George Clooney and other people. And even then we had some consistency but of course a little bit less than this. This was just one example I wanted to show here since it's in Seattle. Now what the question is answering. In the question and answering we all know about that. If you're interested in automatically answering questions, we have datasets. And if you would like to do that using machine learning information retrieval best you can. This is actually a problem, well, I'm still happy Chris told me there is going to be much more research in question answering we have a [inaudible] here and also resent application in AT&T which is QME. QME is introduced by Spangler and Mitshita. Uses a large corpus of question answer pairs which are provided to them. The corpus is basically questions are basically answered by human experts, and they also categorize them. They split the question at, any new question that comes into category. The static is the question that doesn't change too much by time. They try to retrieve the answer from corpus. The dynamics change very rapidly. For that one they outsource their information so they do the Web information to do that. So now to answer the static question, basically what they do is they use blue score and grant TF-IDF. So they find the best match in the question answer pair and output the corresponding answer. So this works very well if the question is in the corpus. So if we have a very close question in the corpus, with this method we can find it because Bleu score is robust. We have high procedure but lower cost especially when we try to use the random samples from this question answer set. We observe that there is some low recall. So our go el in this part of the talk is basically to find the relevant questions and hope that we have a higher recall. So I'll postpone ranking them and getting a larger F measure some other time. I'll just talk about recall. And actually to increase the recall our approach was to use, was to expand the questions. So we try to find the set of relevant words from the question from and the corpus of the question answer pair that we have. For instance, with the question who is the president of China. Of course this is a very simple question. We might actually be actually expand that based on similarities that we'll get, for instance, for president something like leader and China with republic, something like that, and get -- so to do this, we tried four different expansion methods. LDA, two of them are basically generative methods, LDA and linked LDA. Two others are discriminative SVD and linked SDA, their approaches are all to use topics. But two of them are at least two are generative and two are discriminative. So it's good to look at what they did on the dataset. >>: So you have a bunch of questions with their answers? >> Sina Jafarpour: Yeah. >>: Map your questions to something like this. >> Sina Jafarpour: Yes. >>: To their questions. >> Sina Jafarpour: Yes. >>: How many questions? >> Sina Jafarpour: So there were -- the ones that I used were I used a sub -- so 5 million I used. 5 million I used. But they were much more actually. So then this is -- so this is the road map that we have. So this is as Chris mentioned this is the question answer dataset that we have. We generated a co-matrix, so let's say actually I have, I'll go through the detail of that in the exercise. But let's say actually the roles are the roles of the vocabulary and the columns with question answer pairs. This is just for the simplest approach that we can have. And then we map them to some -- so this is a very sparse vector every row of this matrix, we map them to some low dimensional topic -- so here again we have words but here these are topics. Then we look at the, we treat them as vectors in this low dimensional space and then we look at the vector similarity between them to find the closest word-to-word, let's say football, which is here in this case. And so we are starting, we need to do some preprocessing. To are preprocessing we did the following. So first using an LTK we removed the sub word and also we did some [inaudible] here that's very detailed sustaining but just to remove some of them. The third one was spell checking. For spell checking we didn't have a very, you know, open source spell check error that we can actually use that in batch mode. So the approach that we used was we say that every word that we have, word net -- I'm sorry, this is word net. I'm sorry. If word net accepts that, then we are find. If the word net rejects that, and this happens actually because when I look, for instance -- when word net is not updated so it didn't have that. So we also look at the whole corpus. If the whole corpus has this word in several times, we again accept it. Otherwise we reject it. And with this we could actually reduce the size of the vocabulary set significantly. So that's very much -- so after doing this preprocessing we now need to map the data to some lower dimensional space. So again if just reminding you about topics. We know that each -- people talk about very few topics so our goal is to learn this word topic matrix that we have and then use the similarity. Either cosign similarity or [inaudible] I'll talk actually for some cases this was a better measure. This was in other cases a better. >>: How do you choose the number of topic? >> Sina Jafarpour: I actually chose them ad hoc. So that was one of the very few parts of the project that I just chose, the trial and error for that. And then so the first method that we used was LDA. I'm just going to give a very short overview of that. LDA is a very generative approach. We assume that the document is a bag of words. And so as a result every word of this document is generated in an IAD process. Each topic is in a sparse, hopefully a sparse distribution over the degenerative that we have, and each document is a sparse distribution over the topic. So, for instance, here if we look, this is the document that we have. The topic is sports let's say. It's sparse distribution over the dictionary. So words like football, basketball, rugby, these words have high probability of being selected. The other words have less. And then look at, for instance, if you look at this document, for instance, the topic, let's say the topic is biology. That has something like live and distinct -- this document is a sparse distribution over all topics, the topic is sparse biology, and it's choosing the topics biology, genetics and computation. And then the last thing to come, the last thing is the generation of the word. So we have these topics generated. We have these documents as sparse distributions over the topic. The process of generating a word is the following. We choose a topic, we sample a topic. We sample a topic from this document for each word, and then we sample the vocabulary word from the corresponding topic. And then that's the word that you have. And of course we just see the values, so the words are the only things that we see. So what we do is we use the posterior inference to learn the best possible topic distributions and document distributions that matches our model. So this is just a very quick overview of LDA we use the posterior probabilities to generate the word topic matrix. There's another thing that we actually have to exploit in some sense, a structure of the question answer inset. So, of course, the previous thing that I mentioned, the question was and the answer words were both were treated in the same way. So we didn't have any distinction in between them. But here what we like is actually to have the most similar question words to the questions that we have. So for this we change that and we actually looked at the linked LDA. This linked LDA is basically a model that is used to model let's say, for instance, the NIPS articles and corresponding authors or blogs or corresponding comments. So we use that here. So it says again as before each document is in a sparse distribution over topics. So let's say when we sample, sample topic two. Topic two is a pair now. Topic two is a sparse distribution over the question word plus another sparse distribution over answer words. So the question answer words are coming from two different vocabulary sets and the corresponding topics have different ->>: Why not just ignore the answers and model the questions and map your incoming query to the question? >> Sina Jafarpour: I'll go to that. The point was that that again the recall. So that's the issue that we have with recall. When we do that, the chance that we get, we don't get any answer for them increases to some sense. But in the final cases, in the final sets, have some analysis comparing these two. That's a good question, actually. The two others are the SVD. SVD is basically discriminative approach. It takes it much similar. Co-metrics that we have. We said we are talking about very few topics. So hopefully we must be able to approximate it by very low rank matrix, by a matrix that can be decomposed as multiplication of these two matrices where here this is the number of topics. So this is the vocabulary topic matrix and this is the topic document matrix. So this is just saying that we want to have the best let's say the number of topics rank approximation of this matrix and we know we can solve it using single or valid decomposition. This is the generalization of the sparsity we had in the question answer pairs to matrices from sparsity to L0 and L0 matrices, which is rank here. Of course, again, we like to make a distinction between question words and answer words so we can actually generate question answer co-occurrence matrix too. This matrix here, these are the question vocabulary words. These are answer words. And the numbers here says the number of times this word and this word co-occur in the corpus we have. Again, we can actually use a lower rank approximation of that and use that. So we use this matrix basically as our similarity matrix. So this is the vocabulary. This is the topics. And here this is a question vocabulary and something like that, topics. So I'm just running you some examples of the things that we recovered, and then pointing to some quantitative experiments to them. So these are some of the topics that are retrieved for each of them, and this is just for an illustration. They are not very much informative and most of the other topics I should say do not actually match to something that we can interpret very well. But if we go to something like the most similar words, for instance, the word we have here, we see that the observation is that D is LSA and QLSA is the linked LSA basically here. Provides something, some words that are more, we can actually replace this word or very much related. This is LDA and I'll show more results in the next slides also, is the words that are actually some time they co-occur with the word wind that we have here. So basically just as we can -- as we can see, this LSA and QLSA might be more appropriate cases for our experiments. But we should look more at them, at the results here. So these are again the results for more, for other words that I tried. For many of them they were not so much different. For instance, for [inaudible] as we'll see for some of them. But for most of them, for the case of question answering S video or SLA was more appropriate. So these are just some illustrations that we have here. From my point of view, the LDA itself had the least performance, and the S video or LSA had the best. But those were just -- I'm sorry. This one, too. These are the result. And in many cases they're very similar. The results were very similar. And so the ->>: The LDA didn't do as well as the SDD. >> Sina Jafarpour: This was actually -- we were surprised I should say because at the very beginning of the project the main goal was to use LDA, and we did this and actually talked to David Blythe to make sure we're doing everything correctly. But finally, yeah, the symbol SVD. >>: I don't know, presumably LDA community has compared to SVD for their own stuff and ->> Sina Jafarpour: The point is that actually, yeah, the point is that in those communities, most of the comparisons are just from the I point of view. So they don't have very quantitative measures to compare them. And this is something David gives a talk people usually ask. For that he's invented something like a supervised LDA and something like that to do then. But for something like this task they still don't have, as far as I know as far as I talked with him, actually, there is not so much a good way of measuring the consistency and the accuracy of this. >>: Doesn't LDA determines how many clusters you should do automatically; is that right? You still have to ->> Sina Jafarpour: No, so this is a good question. In that case, we have a nonparametric LDA and the thing that we used here was actually was not a nonparametric. So it was parametric again with the same values, the same thing. If you do that, that tries to learn also the number of topics from, puts another distribution. But it doesn't change the results so much. If you choose the number of topics. So the number of topics that I chose here was about 1200 in this experiments, and with less than that the results were actually not good at all. But more than that, they were not so much difference. Much less and much more, I mean. >>: One advantage of LDA you could say is that you wouldn't have to figure out the 1200 number. >> Sina Jafarpour: That's right. >>: But as long as you have some validation in it, maybe ->> Sina Jafarpour: Yeah, that's right. You're absolutely right. And the last thing -- so basically coming back to Chris's question was actually we wanted to have a quantitative evaluation. Previously Taniya and Shinavas, they distributed questions and asked people to manually label them and give the relevant ones. But the problem was the question there was not the same as the question space as this. So they were very simple questions that people usually asked them, came to their mind. So whereas here in the question, the questions have much more diversity. So this is the thing that we came up with, and I'll be very, very happy if you also have any suggestion about that. So we separated the dataset into training question and test set. These are training questions and training answers corresponding to them. Let's assume this is a test question and corresponding to that there's a test answer that we only use for evaluation. For this, test answer, we look at the set of answers that are similar to that using a blue score. So not very sharp similarity like the things that Taniya Mishra and Shinavas were doing for exactly recovering. They actually let some, use some shorter threshold. We find some similar answers here. And then we look at the similar questions corresponding to them. So it's from 2D, let's say. And then we do the expansion. And we use [inaudible] TF-IDF to do a retrieval here. Then we look at the overlap of the set in the overlap of the whole question, the similar questions that are here. So these were the things that we got. So these are the relative improvements. Again, we were actually you know we got better results for SVD. And the other thing, so the thing that I have to emphasize was here also, word net. I'm sorry again. This is word net. So with word net we even had much -- so much negative improvement here. So this says that the feature space that word net had the similarities that word net provide is not a very appropriate similarity measure for our case, for question answering. So their spaces are different. So this is something that we actually had. And with these guys we had improvements. But so these are relative improvements. >>: Can you go back? >> Sina Jafarpour: Sure. >>: What are you evaluating? >> Sina Jafarpour: So I'm doing -- so I splitted this set into question set, test sets, right? So I'm sorry the training set, test set. Then for each test question I look at this test answer. >>: I want to know what -- you can measure the question answer and evaluate it by just seeing how many answers you get right. >> Sina Jafarpour: So that's the whole thing. So how do you define the right, basically, here. >>: So I guess [inaudible] in the question? Small test set. >> Sina Jafarpour: That was the thing that was done. The point is if you have -- if we want to have -- so this set is relatively large. If we wanted to have -- if we had to distribute this question answer pairs relatively large set to people so currently they're not. So this is something definitely we'll do. We'll definitely do. But at that time the internship we didn't have so much ability to do that. So this is an automated version of that exactly. And the same thing basically, so Taniya Mishra when we had the comments, we had to look at just questions and see what we do there as Chris also mentioned here. So we did the same thing here. So here the similar questions are questions that are basically, that are Bleu score with expansion says those are relevant questions. And then we do expansion and we do the TF-IDF similarity also. We find a set of retrieved questions and then we look at the overlap over the set of similar questions. And again we got so here the results are then more stable, much more stable than this question set which says that this Bleu score with shortest threshold is consistent with these question expansion methods. And again we got negative improvement with word net. So this was the thing that which also tried the question we just expand -- with just looking at questions without answers and relatively we got the same set of results. But again I totally -- I should say this was just an experiment that we had at that time because of the time limit in order to evaluate these guys more we need to actually have some human labor evaluation. >>: Kind of like Mechanical Turk. >> Sina Jafarpour: We actually thought about that, too. That came up to tend of my internship at that period. So we didn't -- we haven't done that yet. But I guess that's something -- I guess finally we'll do it in that way. So we talk about that. That's it, actually. Very good. Very good. So just to wrap up, I'm sorry to go over time, so we looked at two basically two projects, one of the question and answer, one of them face classification. And the way we can use sparsity to do classifications there. We have several challenges remained as during the talk all of you mentioned. In face classification, one of the main issues is denoising the training set. So the training set we looked at the similarity between faces from different movies that shared only one actor in their cast list. The thing can we do this -- we know that the classifier is doing relatively well, not very well. But then the classifier classifies something can we actually replace bad training examples with the ones that we get from these classifiers from newer movies to make the classifier cleaner? So this is something we need to do and we need to see how it works but we should be aware of overfitting here. And of course because if we want to have it as a final result, we need to do more experiments with larger actor sets. So the larger set I've used currently is six actors, but of course we need to go beyond. The question answer, better, actually experiments are required. So this is something that we will definitely do. And so the other thing that is interesting actually is to try to do the ranking after this information retrieval, at least, and see how much we can actually increase the F measure, which is the final goal of the task. And I'll be very happy if you have any other suggestion, any comments about that and thank you very much. Thank you. [applause]. >>: Will you be around for a little bit longer? >> Sina Jafarpour: Yes. >>: Stick around for a bit, if people want to chat. Any more questions? Let's thank Sina again. >> Sina Jafarpour: Thank you very much everyone. [applause]