21218 >> Chris Burges: So we're delighted to have Sina...

advertisement
21218
>> Chris Burges: So we're delighted to have Sina join us today. He came here to hike, but he
agreed to give a talk as well. He's finishing his fourth year at Princeton. His Ph.D. advisor is is
Rob Schapire. Formerly Rob Calderbank. The advisors switched. And we're delighted have him
here.
>> Sina Jafarpour: Thank you very much, Chris. Thanks everyone for coming. And I'm really
happy to see you again.
Today I'm going to talk about basically my, this year, internship program which was in AT&T. It
was basically using sparsity in two different projects.
Let's see what we can do with them.
So it's about sparsity in questions answering and classification. This is a joint work with Srinivas
Bangalore, Howard Karloff Taniya Mishra and Carlos Scheidegger. And they really helped me a
lot in this project. This is something before starting I should say these are ongoing projects. So
please feel free to stop me and ask questions at any time. Also make comments at any time
during the talk.
So we all know in machine learning that in many applications, actually in almost all applications
the data that we have is actually high dimensional data. It lies in high dimensional vector space
but it's sparse. We know the bag of word model in IR or the decomposition in imaging.
In many cases this data is actually this sparsity is well hidden. So we would like to learn the
sparsity level and then use that as a feature space in order to be able to do better learning and,
say, classification.
This actually helps us removing the overfitting problems and also generalize better. So this is
something that I'm going to talk also in this talk.
I'll talk about extracting and exploiting the hidden sparsity level, the recovery of that. And
unsupervised and semi-supervised learning cases.
So we will have real datasets in the two projects. Base classification and question and answering
and we'll see what are the barriers, what we can do and what are the barriers in the current
stages of the work.
So let's start with the space classification project. In the face classification, we have a bunch of
movies. So let's say that we actually are Netflix users. We can get a lot of movies and we have
them.
And also let's assume that we have access to Internet. So we have the IMDB cast list. Let's say
in Seattle we have the cast list. We don't have any picture or anything related to cast list, we just
know who is actually, who is acting in this movie.
The goal of the project is then suppose we have that for several movies. Now I give you a new
movie, "Sleepless in Seattle" try to classify the faces in the actors, the famous actors in this
project. And let's see what we can do.
So the first thing is we need some training data. Ideally we need some label training data. So
what we do is basically a way to get some sort of a noisy but okay training dataset.
We have a bunch of movies here that you can see. We use basically let's say we implement a
fast standard face detection algorithm like [inaudible] Johnson. And then we do some standard
face alignment method so we make sure that all the faces are aligned.
And as a result we have a bunch of faces detected from different movies. Then we look at two
movies that have just one actor in common. Let's say Up in the Air and [inaudible]. They have
George Clooney, and then we look at the pairs of faces in all these pairs of movies that have the
most similarity.
Something like this. For instance, these two faces we see that they are very similar to each
other. Let's say with the inner product similarity or Euclidian distance similarity. Then we
conclude that as a result with high probability these guys should be the same actor. Because
these movies have only actor in common, we actually say that these should be George Clooney.
So this is a noisy way that by that we can actually collect label 20 examples. Now, after that we
need to now classify. The next set is basically face classification. We actually look at the lazy
training actually paradigm.
So let's assume that first let's assume for each phase we convert it to a vector. So these are all
vectors. These are all faces of George Clooney, Peter Sellers, Ingrid Bergman and Bruce Lee.
And, for instance, this part of the metric has a lot of faces of Bruce Lee that we have actually
captured.
Each vector is there. And then we actually -- somebody gives us a new face and wants us to
classify based on the faces that are here. What we can do is basically we can actually exploit the
known fact that if this is a face of the Bruce Lee, the known fact in face classification and face
analysis that tells us that every face of Bruce Lee is basically, can be represented as sparse
linear combination of the faces we have here plus some noise, of course.
So we would like to basically, in a way, exploit this sparsity and classify this guy based on that.
So let's be a little ->>: In the combination of the pixels?
>> Sina Jafarpour: Linear combination of the pixels, exactly.
>>: So the driver's license people have another representation where they look at the distance
between the eye and the other features. [inaudible] more than ->> Sina Jafarpour: Right.
>>: So why not use that as a feature set?
>> Sina Jafarpour: We can actually do that, too. I'll talk about that in some sense. The thing is
that here -- so here the point is that this angle and the way that we, the way the spaces have are
much more different. In that case we have very -- very restrictive control, exactly, set of feature
space.
Here it's different. So this is actually -- this just is assuming the linear sparsity is basically more
robust assumption in that case.
But actually very shortly talk about that, too. Yep. And so let's actually formalize the problem
then a little bit. We have a vector in RM, let's say M dimensional. Let's say we converge it to a
vector. Square root N by square root N.
Then we have a matrix. The matrix has the number of actors times the number of faces per actor
columns and each actor has M rows because we converted that to a matrix, to a vector, I'm sorry.
Then our goal is to find a linear combination of these columns here such that we'll approximate
this, this vector F.
So this brings us actually to the classical problem of sparse approximation and sparse recovery.
In sparse recovery, if I want to say, which is actually the problem, the problem of compressed
sensing and also a lot of graphical modeling and other related stuff, too, we have a vector that X
star, which is sparse. So it has only very few, relatively few number of non-zero entries. And we
also have a matrix A which now let's assume that it's a general matrix. I'll talk about its properties
that we'll actually exploit. But let's say this is this.
We also have noise. We're given a vector F and somebody tells us it's A the matrix that we have
times the sparse vector X plus noise. And our goal is to recover or approximate X is star from the
measurement vector F.
So, of course, for the beginning let's forget the noise. What we want is to solve this optimization
problem. Zero says the number of nonzero entries. It says we would like to have the sparsest
vector X such that AX equals F, if we forget E.
But unfortunately this problem is NP hard. So during the last, actually I should say even 50 years
or even more, people started to think about ways to approximate or relax the problem.
The base algorithm which is formally introduced by Chen and Donoho, but was known even
before that, says let's forget -- now that we not solve the L minimization let's look at L1
minimization. L1 is basically convex norm. So this is a convex optimization program. But it also
has -- it's the closest norm to the L0 norm that we can have.
This is the proposed basis pursuit algorithm. And of course because we have noise you have to
consider noise, too. So the last which was actually introduced by Tibshirani and also is used very
widely says that we can use the L1 minimization, but with these constraints. We say we're
okay -- of course we don't have exactly this linear equality, but the residual we want that to be
small, be smaller the residual of the noise or constant times that.
But in this case actually for this problem it seems that another linear programming is actually,
might be a better solution.
So here the assumption is that the noise vector is also sparse. So if we go up, it will actually go
and see it in the face classification. But it assumes that, okay, now that if we assume that the
noise vector is also sparse, we can actually integrate them together. So the optimization
becomes minimized the L1 of X so that X now becomes X and E together. We actually
concatenate them together.
And therefore we also concatenate the identity matrix with A and we would like that to be F, the
vector F that we have. So this is the basis pursuit noising and recently proposed by Write & Ma
for these kind of problems where we can have sparsity assumption for the noise too. And there
are some theoretical results about that. But I'm not going to talk.
>>: I is.
>> Sina Jafarpour: I is, exactly. So if you look, we concatenate noise here, too, so it's actually ->>: [inaudible].
>> Sina Jafarpour: Yeah.
>>: In the [inaudible] application or face application? What evidence supports the fact that noise
would be sparse?
>> Sina Jafarpour: So basically the first thing that I should say is that we tried both of them. This
one worked better. And the second thing is basically the fact that if you have the local integrity
that you'll have together. So the pixels that are close to each other will be actually, we like them
to be close to each other. The eyes or the noses and these things.
So these things, therefore the differences do not matter so much. So this basically maps to the
sparsity of the noise, but that's basically still your common works. We actually -- we should see if
this one works better or this one.
Mostly empirically this one, the assumption that noise is sparse works better in this case. And
something you may also I should refer you to this paper to Write & Ma. They also have some
sections basically explaining the locality that I mentioned in detail.
>>: The sparse is the images in the database that are close enough to attest?
>> Sina Jafarpour: To attest, yes. Exactly. So this is then the justification that I also wanted to
have about why we are actually using L1 minimization. Then we say L1 minimization, because
you might actually say why not L2 minimization. We can actually have an explicit solution for L2
minimization.
So let's see what's happening. This is the line AE equals F the line that we have, so this is then
the let's say identity L1 diamond that we have.
So the points that have the L1 of X equals 1. Actually we tried to grow this diamond until it hit this
line. And the same here. So let's see what happens. So here this is a solution of the L2
minimization. This is the solution of L1 minimization.
Here it hits close to an axis point. The solution here is relatively sparse. Although I'm just using
this for illustration, but as to actually give you the idea. But here the intersection point is this one.
And this is not as sparse. It has X value and Y value.
So it gives just some sort of intuition why we shall use L1 minimization.
So then this is the face classification algorithm, right? It's actually if I want to say what it does.
We first normalize the columns. That's just an assumption that we make to make sure that we're
not actually changing the energy too much.
Then use the basis pursuit denoising algorithm to get a sparsity, to find the sparse set of columns
representing F.
Then we need two things. We need to look at two things. One of them is that if you give me a
base how should I know basically it's a linear combination of these guys. For instance, if you give
me a face how do I know that it's Bruce Lee for instance? For this we look at residual. This
residual says how much this recovered vector is a sparse actually. So if it's close to 1, R is the
number of factors we have. If it's close to 1 then it says that the vector is actually concentrated
on just one of these actors. So you have actor 1 to actor R. If it's close to 0, then it says that
vector is very dense.
So it has elements in all of them. So we say that if the value that we get is less than some
threshold that we said apriori we reject. We say this is none of the actors.
But if it's actually higher than that, then again you can either just look at this value S for different
actor sets, or you can simply look at the residual. So this is what we did. We looked at the
residual F minus let's say A concentrated to these actors times the corresponding coefficient.
The second one, the third one, and the one that has the minimum residual value, which is that.
So this is the procedure that we have.
And I should mention because I'm going to show you some actually comparison stats. Previous
state of the art and basically the thing that people usually use for face classification is SVM and
SVM with some distance learning, which is actually the work of Weinberger et al., and it
basically -- so it's again based on the similarity of the faces, the thing that we mentioned that a
new face, if it's a face of Bruce Lee it will be very similar to the training set spaces that we have.
So we use linear kernels is usually fine but we can actually also try to learn the distance matrix.
We can actually try to learn and find transform in that we will have a better representation of the
vectors.
>>: L is global?
>> Sina Jafarpour: L is global, yes. And so Weinberger proposes STP program learning for
learning this relevant to the optimization of the support vector machines. But even though they
provide a specific algorithm for solving that it is usually slow.
So the drawback of this comparing to SDM learning takes time but it's global so we can actually
do that once.
And the good thing as we'll see it does something better.
>>: Can I ask a question two slides back?
>> Sina Jafarpour: Yes, sure. Uh-huh.
>>: So the thing you output, you're not redetermining the mass vector for just the AI, you're using
the vector you had determined once.
>> Sina Jafarpour: Uh-huh.
>>: So seems like if you have sort of like a lot of correlations between these -- like if there's two
actors in front of a green background, kind of random luck whether it picks up the green from
there or another.
>> Sina Jafarpour: So we removed the background and everything like that. So what we have is
basically just the face. Just the face. And also we removed the contrast and everything related
to that, yep.
>>: Black and white or typical --?
>> Sina Jafarpour: In this effect it was black and white.
>>: Can you rescale?
>> Sina Jafarpour: We rescaled. And did the rotation alignment to make sure. All of these
actually introduced noise but also actually helps in the next parts.
>>: Thought about using other bases? Pixels seems ->> Sina Jafarpour: I tried wavelet. The resulting wavelet were not so good. So I still don't know
exactly why. But relative -- you know comparing to the pixel domain, the wavelet domain was not
so good.
>>: Also work by [inaudible] they're doing for [inaudible] for other image classification, they use
kind of pledges, try to build the image out of packages of images from the domain, sounds kind
of, for example ->> Sina Jafarpour: Right. That's a good point, too. The thing is that actually we want this
ultimately in an online setting. We want to have these classifications in an online case, if we have
patches can we do them also efficiently online? I think if you can, then that's worth trying that and
I'll be happy to talk to you about that at some point after it.
>>: The state of the audience [inaudible].
>> Sina Jafarpour: So these were just very initial experiments we did. We used four actors and
then 80 faces for each of them. Then we used across validation, and we looked at the across
validation errors of the actors.
This is the L1 minimization we use, the basic denoising. That's distance learning, so we actually
mapped the data from the pixel domain from that learned distance domain, too. This is SVM and
this is distance learning SVM. So as we can see, the results here in these experiments relatively
promising but one reason of that, I should actually say right now, was because we had noise.
So the dataset we had was noisy and that was a problem that SVM actually was facing, when we
tried other state-of-the-art classification datasets the results were much closer, but because of the
noise this algorithm turned out to be more robust to noise you have.
Then we tried in the next experiment a larger set of actors. Here we can see the results.
Something that I have to mention is, for instance, this one, this actor had very large classification
error.
We looked at the data, the training dataset for that was very noisy. So this is something that -- I'll
mention actually I'll talk about that later, the issue of the noise in obtaining the training datasets.
But relatively the message of these two slides were this algorithm even though much slower than
SVM it was more consistent and more robust.
The thing we actually focused on was "Sleepless in Seattle", the movie I'm saying this movie as a
special case. In this movie we actually saw this is a frame of the movie let's say that we captured
about 7,000. These are the confidences of the face classification algorithm.
And this is a sorted version of this. So this is just a PDE for this. And for instance you can see
that if you take value .6, then we'll actually have a value here.
>>: Identified one actor or ->> Sina Jafarpour: Just saying that if this frame is either one of those actors or it's another one.
So I'm talking about the classification accuracy in the next slide.
As we mentioned then we took one confidence value and we looked at the clarification accuracy
for this particular one, you look at it manually, so even in these numbers there might be some
error because of my eye. Because we didn't have labels. But just as I wanted to show you the
examples that, some examples, some random faces that can always look at the results. For
instance, this is [inaudible] and this is [inaudible] and this is two classified, although I think this
one.
The next thing is going back to that question is making this algorithm faster. So we have some
suggestions for making it faster. One is running SVM first. If it's connected to a machine it has
high confidence, then we're okay we say SVM is doing a good job otherwise we run our
algorithm. So this makes the overall turning like much faster, the other thing is we don't have to
solve the optimization exactly.
This is one suggestion. We can just solve it up to some number of iterations, not up to very close
accuracy value. And the other thing is we can actually exploit the temporal coherences. If frame
1 and 3 [inaudible] are then with high probability the frame in number 3 is also an error because
in video. We have this temporal coherence. So we can actually use that in the classification.
So we're doing that, classification, accuracy, a little bit drop, but not so much. So in our current
version of the classifier we incorporate these three ways. And we'll be happy to have any other
suggestions for them.
So if there's no question, I'll go to the second part of the talk.
>>: Why did you choose [inaudible].
>> Sina Jafarpour: Here in this experiment, yeah, because it was much more distinctive. I did
experiments with Ocean 11, also and the Mexican, and those -- currently I've used up to six
different actors because Ocean 11 it was George Clooney and other people. And even then we
had some consistency but of course a little bit less than this. This was just one example I wanted
to show here since it's in Seattle.
Now what the question is answering. In the question and answering we all know about that. If
you're interested in automatically answering questions, we have datasets. And if you would like
to do that using machine learning information retrieval best you can.
This is actually a problem, well, I'm still happy Chris told me there is going to be much more
research in question answering we have a [inaudible] here and also resent application in AT&T
which is QME. QME is introduced by Spangler and Mitshita. Uses a large corpus of question
answer pairs which are provided to them. The corpus is basically questions are basically
answered by human experts, and they also categorize them. They split the question at, any new
question that comes into category. The static is the question that doesn't change too much by
time. They try to retrieve the answer from corpus. The dynamics change very rapidly. For that
one they outsource their information so they do the Web information to do that.
So now to answer the static question, basically what they do is they use blue score and grant
TF-IDF. So they find the best match in the question answer pair and output the corresponding
answer.
So this works very well if the question is in the corpus.
So if we have a very close question in the corpus, with this method we can find it because Bleu
score is robust. We have high procedure but lower cost especially when we try to use the
random samples from this question answer set.
We observe that there is some low recall.
So our go el in this part of the talk is basically to find the relevant questions and hope that we
have a higher recall. So I'll postpone ranking them and getting a larger F measure some other
time. I'll just talk about recall. And actually to increase the recall our approach was to use, was
to expand the questions. So we try to find the set of relevant words from the question from and
the corpus of the question answer pair that we have.
For instance, with the question who is the president of China. Of course this is a very simple
question. We might actually be actually expand that based on similarities that we'll get, for
instance, for president something like leader and China with republic, something like that, and
get -- so to do this, we tried four different expansion methods.
LDA, two of them are basically generative methods, LDA and linked LDA. Two others are
discriminative SVD and linked SDA, their approaches are all to use topics. But two of them are at
least two are generative and two are discriminative. So it's good to look at what they did on the
dataset.
>>: So you have a bunch of questions with their answers?
>> Sina Jafarpour: Yeah.
>>: Map your questions to something like this.
>> Sina Jafarpour: Yes.
>>: To their questions.
>> Sina Jafarpour: Yes.
>>: How many questions?
>> Sina Jafarpour: So there were -- the ones that I used were I used a sub -- so 5 million I used.
5 million I used. But they were much more actually. So then this is -- so this is the road map that
we have. So this is as Chris mentioned this is the question answer dataset that we have. We
generated a co-matrix, so let's say actually I have, I'll go through the detail of that in the exercise.
But let's say actually the roles are the roles of the vocabulary and the columns with question
answer pairs. This is just for the simplest approach that we can have.
And then we map them to some -- so this is a very sparse vector every row of this matrix, we map
them to some low dimensional topic -- so here again we have words but here these are topics.
Then we look at the, we treat them as vectors in this low dimensional space and then we look at
the vector similarity between them to find the closest word-to-word, let's say football, which is
here in this case.
And so we are starting, we need to do some preprocessing. To are preprocessing we did the
following. So first using an LTK we removed the sub word and also we did some [inaudible] here
that's very detailed sustaining but just to remove some of them.
The third one was spell checking. For spell checking we didn't have a very, you know, open
source spell check error that we can actually use that in batch mode.
So the approach that we used was we say that every word that we have, word net -- I'm sorry,
this is word net. I'm sorry. If word net accepts that, then we are find. If the word net rejects that,
and this happens actually because when I look, for instance -- when word net is not updated so it
didn't have that. So we also look at the whole corpus. If the whole corpus has this word in
several times, we again accept it.
Otherwise we reject it. And with this we could actually reduce the size of the vocabulary set
significantly. So that's very much -- so after doing this preprocessing we now need to map the
data to some lower dimensional space.
So again if just reminding you about topics. We know that each -- people talk about very few
topics so our goal is to learn this word topic matrix that we have and then use the similarity.
Either cosign similarity or [inaudible] I'll talk actually for some cases this was a better measure.
This was in other cases a better.
>>: How do you choose the number of topic?
>> Sina Jafarpour: I actually chose them ad hoc. So that was one of the very few parts of the
project that I just chose, the trial and error for that.
And then so the first method that we used was LDA. I'm just going to give a very short overview
of that. LDA is a very generative approach. We assume that the document is a bag of words.
And so as a result every word of this document is generated in an IAD process.
Each topic is in a sparse, hopefully a sparse distribution over the degenerative that we have, and
each document is a sparse distribution over the topic. So, for instance, here if we look, this is the
document that we have. The topic is sports let's say. It's sparse distribution over the dictionary.
So words like football, basketball, rugby, these words have high probability of being selected.
The other words have less. And then look at, for instance, if you look at this document, for
instance, the topic, let's say the topic is biology. That has something like live and distinct -- this
document is a sparse distribution over all topics, the topic is sparse biology, and it's choosing the
topics biology, genetics and computation.
And then the last thing to come, the last thing is the generation of the word. So we have these
topics generated. We have these documents as sparse distributions over the topic. The process
of generating a word is the following. We choose a topic, we sample a topic. We sample a topic
from this document for each word, and then we sample the vocabulary word from the
corresponding topic.
And then that's the word that you have. And of course we just see the values, so the words are
the only things that we see. So what we do is we use the posterior inference to learn the best
possible topic distributions and document distributions that matches our model.
So this is just a very quick overview of LDA we use the posterior probabilities to generate the
word topic matrix.
There's another thing that we actually have to exploit in some sense, a structure of the question
answer inset. So, of course, the previous thing that I mentioned, the question was and the
answer words were both were treated in the same way.
So we didn't have any distinction in between them. But here what we like is actually to have the
most similar question words to the questions that we have.
So for this we change that and we actually looked at the linked LDA. This linked LDA is basically
a model that is used to model let's say, for instance, the NIPS articles and corresponding authors
or blogs or corresponding comments. So we use that here.
So it says again as before each document is in a sparse distribution over topics. So let's say
when we sample, sample topic two. Topic two is a pair now. Topic two is a sparse distribution
over the question word plus another sparse distribution over answer words.
So the question answer words are coming from two different vocabulary sets and the
corresponding topics have different ->>: Why not just ignore the answers and model the questions and map your incoming query to
the question?
>> Sina Jafarpour: I'll go to that. The point was that that again the recall. So that's the issue that
we have with recall. When we do that, the chance that we get, we don't get any answer for them
increases to some sense.
But in the final cases, in the final sets, have some analysis comparing these two. That's a good
question, actually.
The two others are the SVD. SVD is basically discriminative approach. It takes it much similar.
Co-metrics that we have. We said we are talking about very few topics. So hopefully we must be
able to approximate it by very low rank matrix, by a matrix that can be decomposed as
multiplication of these two matrices where here this is the number of topics. So this is the
vocabulary topic matrix and this is the topic document matrix.
So this is just saying that we want to have the best let's say the number of topics rank
approximation of this matrix and we know we can solve it using single or valid decomposition.
This is the generalization of the sparsity we had in the question answer pairs to matrices from
sparsity to L0 and L0 matrices, which is rank here.
Of course, again, we like to make a distinction between question words and answer words so we
can actually generate question answer co-occurrence matrix too. This matrix here, these are the
question vocabulary words. These are answer words.
And the numbers here says the number of times this word and this word co-occur in the corpus
we have. Again, we can actually use a lower rank approximation of that and use that.
So we use this matrix basically as our similarity matrix. So this is the vocabulary. This is the
topics. And here this is a question vocabulary and something like that, topics.
So I'm just running you some examples of the things that we recovered, and then pointing to
some quantitative experiments to them. So these are some of the topics that are retrieved for
each of them, and this is just for an illustration. They are not very much informative and most of
the other topics I should say do not actually match to something that we can interpret very well.
But if we go to something like the most similar words, for instance, the word we have here, we
see that the observation is that D is LSA and QLSA is the linked LSA basically here. Provides
something, some words that are more, we can actually replace this word or very much related.
This is LDA and I'll show more results in the next slides also, is the words that are actually some
time they co-occur with the word wind that we have here. So basically just as we can -- as we
can see, this LSA and QLSA might be more appropriate cases for our experiments.
But we should look more at them, at the results here. So these are again the results for more, for
other words that I tried.
For many of them they were not so much different. For instance, for [inaudible] as we'll see for
some of them. But for most of them, for the case of question answering S video or SLA was
more appropriate. So these are just some illustrations that we have here.
From my point of view, the LDA itself had the least performance, and the S video or LSA had the
best. But those were just -- I'm sorry. This one, too. These are the result. And in many cases
they're very similar. The results were very similar. And so the ->>: The LDA didn't do as well as the SDD.
>> Sina Jafarpour: This was actually -- we were surprised I should say because at the very
beginning of the project the main goal was to use LDA, and we did this and actually talked to
David Blythe to make sure we're doing everything correctly. But finally, yeah, the symbol SVD.
>>: I don't know, presumably LDA community has compared to SVD for their own stuff and ->> Sina Jafarpour: The point is that actually, yeah, the point is that in those communities, most of
the comparisons are just from the I point of view. So they don't have very quantitative measures
to compare them. And this is something David gives a talk people usually ask. For that he's
invented something like a supervised LDA and something like that to do then. But for something
like this task they still don't have, as far as I know as far as I talked with him, actually, there is not
so much a good way of measuring the consistency and the accuracy of this.
>>: Doesn't LDA determines how many clusters you should do automatically; is that right? You
still have to ->> Sina Jafarpour: No, so this is a good question. In that case, we have a nonparametric LDA
and the thing that we used here was actually was not a nonparametric. So it was parametric
again with the same values, the same thing.
If you do that, that tries to learn also the number of topics from, puts another distribution. But it
doesn't change the results so much. If you choose the number of topics. So the number of
topics that I chose here was about 1200 in this experiments, and with less than that the results
were actually not good at all. But more than that, they were not so much difference.
Much less and much more, I mean.
>>: One advantage of LDA you could say is that you wouldn't have to figure out the 1200 number.
>> Sina Jafarpour: That's right.
>>: But as long as you have some validation in it, maybe ->> Sina Jafarpour: Yeah, that's right. You're absolutely right. And the last thing -- so basically
coming back to Chris's question was actually we wanted to have a quantitative evaluation.
Previously Taniya and Shinavas, they distributed questions and asked people to manually label
them and give the relevant ones. But the problem was the question there was not the same as
the question space as this.
So they were very simple questions that people usually asked them, came to their mind. So
whereas here in the question, the questions have much more diversity. So this is the thing that
we came up with, and I'll be very, very happy if you also have any suggestion about that.
So we separated the dataset into training question and test set. These are training questions and
training answers corresponding to them. Let's assume this is a test question and corresponding
to that there's a test answer that we only use for evaluation.
For this, test answer, we look at the set of answers that are similar to that using a blue score. So
not very sharp similarity like the things that Taniya Mishra and Shinavas were doing for exactly
recovering. They actually let some, use some shorter threshold.
We find some similar answers here. And then we look at the similar questions corresponding to
them. So it's from 2D, let's say. And then we do the expansion. And we use [inaudible] TF-IDF
to do a retrieval here.
Then we look at the overlap of the set in the overlap of the whole question, the similar questions
that are here.
So these were the things that we got. So these are the relative improvements. Again, we were
actually you know we got better results for SVD. And the other thing, so the thing that I have to
emphasize was here also, word net. I'm sorry again. This is word net.
So with word net we even had much -- so much negative improvement here. So this says that
the feature space that word net had the similarities that word net provide is not a very appropriate
similarity measure for our case, for question answering. So their spaces are different. So this is
something that we actually had. And with these guys we had improvements. But so these are
relative improvements.
>>: Can you go back?
>> Sina Jafarpour: Sure.
>>: What are you evaluating?
>> Sina Jafarpour: So I'm doing -- so I splitted this set into question set, test sets, right? So I'm
sorry the training set, test set. Then for each test question I look at this test answer.
>>: I want to know what -- you can measure the question answer and evaluate it by just seeing
how many answers you get right.
>> Sina Jafarpour: So that's the whole thing. So how do you define the right, basically, here.
>>: So I guess [inaudible] in the question? Small test set.
>> Sina Jafarpour: That was the thing that was done. The point is if you have -- if we want to
have -- so this set is relatively large. If we wanted to have -- if we had to distribute this question
answer pairs relatively large set to people so currently they're not. So this is something definitely
we'll do. We'll definitely do. But at that time the internship we didn't have so much ability to do
that. So this is an automated version of that exactly.
And the same thing basically, so Taniya Mishra when we had the comments, we had to look at
just questions and see what we do there as Chris also mentioned here.
So we did the same thing here. So here the similar questions are questions that are basically,
that are Bleu score with expansion says those are relevant questions.
And then we do expansion and we do the TF-IDF similarity also. We find a set of retrieved
questions and then we look at the overlap over the set of similar questions.
And again we got so here the results are then more stable, much more stable than this question
set which says that this Bleu score with shortest threshold is consistent with these question
expansion methods.
And again we got negative improvement with word net. So this was the thing that which also tried
the question we just expand -- with just looking at questions without answers and relatively we got
the same set of results.
But again I totally -- I should say this was just an experiment that we had at that time because of
the time limit in order to evaluate these guys more we need to actually have some human labor
evaluation.
>>: Kind of like Mechanical Turk.
>> Sina Jafarpour: We actually thought about that, too. That came up to tend of my internship at
that period. So we didn't -- we haven't done that yet. But I guess that's something -- I guess
finally we'll do it in that way. So we talk about that. That's it, actually. Very good. Very good.
So just to wrap up, I'm sorry to go over time, so we looked at two basically two projects, one of
the question and answer, one of them face classification. And the way we can use sparsity to do
classifications there.
We have several challenges remained as during the talk all of you mentioned. In face
classification, one of the main issues is denoising the training set. So the training set we looked
at the similarity between faces from different movies that shared only one actor in their cast list.
The thing can we do this -- we know that the classifier is doing relatively well, not very well. But
then the classifier classifies something can we actually replace bad training examples with the
ones that we get from these classifiers from newer movies to make the classifier cleaner? So this
is something we need to do and we need to see how it works but we should be aware of
overfitting here. And of course because if we want to have it as a final result, we need to do more
experiments with larger actor sets.
So the larger set I've used currently is six actors, but of course we need to go beyond. The
question answer, better, actually experiments are required. So this is something that we will
definitely do. And so the other thing that is interesting actually is to try to do the ranking after this
information retrieval, at least, and see how much we can actually increase the F measure, which
is the final goal of the task.
And I'll be very happy if you have any other suggestion, any comments about that and thank you
very much. Thank you. [applause].
>>: Will you be around for a little bit longer?
>> Sina Jafarpour: Yes.
>>: Stick around for a bit, if people want to chat. Any more questions? Let's thank Sina again.
>> Sina Jafarpour: Thank you very much everyone. [applause]
Download