>> Misha Bilenko: Okay. Thanks for coming. ... from University of Maryland, College Park, where he's working with...

advertisement
>> Misha Bilenko: Okay. Thanks for coming. We're welcoming Abhishek Kumar today
from University of Maryland, College Park, where he's working with Hal Daume' on a
number of interesting topics. And today he'll tell us about the work he did at IBM T.J.
Watson with Decaus and Uwani [phonetic] on matrix factorization, which we all like and
use.
>> Abhishek Kumar: Thanks, Misha. So I talk about algorithms for separable
nonnegative matrix factorization. And here is outline of the talk.
So I'll start with introducing the problem of NMF and the separability assumption. And
then I'll talk about algorithms for interior-separable NMF. In the next part I'll talk about
extensions of these algorithms to other loss functions like [inaudible] loss or [inaudible].
And in the end I'll talk about some other work that I have done.
So let me start with the introducing the problem of NMF. So the problem I think is that
we are given a nonnegative matrix X here. And the goal is to factor it into two matrices,
W and H, so that the problem W times H is close to X in some distance measure. And
again, W and H are also nonnegative. So all the entries, all these matrices are non -are greater than zero or zero.
The small R here is inner dimension of the factorization. It's always usually much less
than both M and N.
The nonnegative rank is the smallest inner dimension for which we have an exact
factorization. So there is no error in the factorization here. So the smallest R for which
we have no error in the factorization, that's called nonnegative rank.
And important point about nonnegative rank is that it's always greater than linear rank
and always less than or equal to both M and N.
Now another point about NMF is that it can be non-unique. So even if you ignored the
permutation and scalings of columns of W and rows of H, NMF can still be non-unique.
So there can be multiple nonnegative matrix factorizations for a given matrix.
Now let's understand the motivation behind most of this small example here. So I've
shown the images from a database called Swimmer. And it had -- it has images of a
creature that is swimming. And it has four legs. So four legs can mean four different
orientations. So there are total four, four 256 images. And each image is a size 32 plus
32.
Now, let's say we factorize all these images and put them along the rows of matrix X.
So X is now 256 times 1024. And we do a loading factorization on X. So we factor it as
W times H.
Now, in this model every row of X can be seen as a linear combination of the rows of H.
And the combination coefficients are given by the corresponding row of W.
The rows of H are called basis if they are linearly independent. And if they are not
linearly independent we call them topics or dictionary.
Now, suppose we have nonnegative W here. So what does that mean? That means
that we are combining the rows of H [inaudible] so we are adding the topics to generator
every data point. So it gives a very clear, nice interpretation to the model.
On the other hand, if you don't have nonnegativity on W and H, in that case we can get
a loading factorization using SVD as well, which is optimal in terms of minimizing the
number between X and W times H.
So let's see what basis we get using SVD here. So this is the basis we get using SVD.
And, again, each basis image is a row matrix H which have reshaped to this [inaudible]
image.
So here you can see that topics are not very interpretable. Now, as soon as we put
nonnegativity constraints on W and H, we get interpretability in topics. So you can see
that every image in the database here can be generated by adding all these topics. And
we also get sparsity in the representation because -- yeah?
>>: [inaudible] your goal was to use the matrix ->> Abhishek Kumar: Right.
>>: [inaudible].
>> Abhishek Kumar: Yeah. [inaudible] is this. So minimize the previous number
between X and WH.
So we also get sparsity in the representation because -- because of nonnegativity. So
in this example, you can see that we can generate any image in the database by using
at most five topics from this set of 17 topics. So we get a sparse W. And which is again
useful in some applications.
Now, NMF has various applications. And one of them is topic modeling. In topic
modeling we are given a corpus of documents. And the goal is to learn prominent
topics from this corpus.
Now, let's say we represent our corpus as a matrix X here, where each row is a
document. So X is document-by-word matrix. And we factor it as W times H. Now let's
say we represent our corpus as a matrix X here where each row is a document. So X is
document by Y matrix. And we factor it as W times H. So W is document by topic, and
H is topic by word.
So here every document is generated by iteratively combining the topics, the rows of
matrix H. So here I've shown some topics that are recovered using NMF. So again
each topic is a row of H. And you can see the first topic is corresponding to the Olympic
games.
Second topic corresponds to a space mission, and third corresponds to stock market, I
guess, and so on.
So NMF does give us reasonable topics.
>>: I have a question.
>> Abhishek Kumar: Yes.
>>: So you're probably going to answer this later, so feel free to just say I'll do this later.
So given H, which are your topics and, like, some new document that wasn't in the
document word matrix, it would be nice maybe to say, hey, what topics is this new
document in?
>> Abhishek Kumar: Right. Right.
>>: That means it's sort of like computing a little bit of W for this new one.
>> Abhishek Kumar: Exactly. Yes.
>>: But it's not part of the overall process because the topics are fixed at this point, let's
say.
>> Abhishek Kumar: Yeah. Yeah.
>>: So what is -- you're going to talk about that computation? Because that
computation is not necessarily simple in my mind.
>> Abhishek Kumar: That's true. So if you know that the new document belongs to
these topics, then it's ->>: [inaudible] as much as any of the other documents. I mean ->> Abhishek Kumar: Yeah. I mean, if there is a novel topic there, then that's a
problem. Otherwise, it's ->>: All right. Let's assume it's from the same distribution.
>> Abhishek Kumar: Yeah. If it is from the same distribution, then you can basically
solve a problem which minimizes the distance between X and W times H where you
know H and you know the vector X and you solve for W.
>>: Then you also need nonnegative W.
>> Abhishek Kumar: So you solve a nonnegatively squared problem with unknown W.
>>: And if that fact -- you know in a weird way it's kind of -- but you can spend all the
time in the world finding some topics. But then maybe I'm going to have 20 billion
documents. So I don't want to spend all that much time per document figuring out what
topics this document is.
>> Abhishek Kumar: Yeah. I mean, in my experience, it's definitely faster than the
[inaudible] location if you have NMF and that's faster than LDF. But again, you have to
parallelize and do all sorts of optimizations to make it work.
So other applications of NMF are in the hyperspectral unmixing, blind source separation
and also in a few unsupervised feature extraction problems in computer vision.
So it was shown in 2009 that -- yes?
>>: I'm sorry. I'm missing some intuition. So why -- why is NMF give you that more
sort of topic oriented inseparable thing than SVD?
>> Abhishek Kumar: Here?
>>: Because just not going to be able to go negative forces it somehow? Like what's
the intuition there?
>> Abhishek Kumar: The intuition here is that when you have nonnegativity you're only
allowed to add the topics.
>>: Right.
>> Abhishek Kumar: So every topic in some sense is forced to pick certain parts of the
images from the database.
>>: Okay. Whereas SVD then trade each other off?
>> Abhishek Kumar: Yes, and SVD also have negativity so you can basically tear it off.
And now because you are forced to pick certain parts of the images, so you pick those
parts which are appearing in most of the images. So that's why you get the topics here.
>>: So for a given -- for a given number -- for a given R, will the -- the SVD will give you
a better fit to your X, but the NMF will give you this more interpretable thing?
>> Abhishek Kumar: Yes. Yes. That's true.
>>: So is it -- so what is -- so there might be multiple ways to do that, right? You could
do one L1 regularization, you could do these kind of positive less constraints. So what
is the drive there to use this strategy as opposed to let's add one or projection into some
->> Abhishek Kumar: With L1 you will not get nonnegativity. You can get like small -you can get several zeros but you will have also negative elements. So I guess this
type of interpretability you will not get, even if you use L1 regularization on W and H.
>>: And does it [inaudible] to pay a price compared to doing, for example, SVD with
one L1 regularizer?
>> Abhishek Kumar: SVD with L1?
>>: Yes. Let's say do this [inaudible] L1 regularizer because I'm not satisfied with those
topic?
>> Abhishek Kumar: Right. I think SVD will L1 will be nonconvex. So, again SVD also
known convex. But you can get a global solution because it's an Eigenvalue problem.
But if you put L1 on penalty and objective then, I guess, you will reach some local
minima. So the first problem is it will not be a global optima. And the second is I don't
think you will get interpretability like you get with NMF. Yeah. Yeah.
>>: So then you have to -- okay. Thank you.
>> Abhishek Kumar: Yeah. So in 2009, it was shown that and the problem with NMF is
NP-hard. The problem of exact NMF. So it also means that the approximate NMF
problem is also NP-hard.
Now, I'll mention a few problems that are solvable in polynomial time. So the first
problem that can be solved in polynomial time is if you treat the inner dimension of the
factorization as constant, in that case the complexity is polynomial in N and M. But still
it's exponentially not, so it's not very practical.
>>: I'm a little confused as to why it wouldn't be a -- isn't R an input to this algorithm or
something? Like ->> Abhishek Kumar: Yes. R is definitely an input.
>>: I see.
>> Abhishek Kumar: But the complexity here is exponentially in R. So it's exponentially
in the size of your data.
>>: Right.
>> Abhishek Kumar: So it's NP-hard. But if you treat R as a constant, then it's
polynomial in N and M. So complexity is like N times M to the power R squared, two to
the power R. So it's actually double explanation in R. Yeah.
>>: That's a practical result.
>> Abhishek Kumar: So second instance when the problem is solvable in polynomial
time is when rank of your matrix X is 1 or you are seeking inner -- factorization of inner
dimension 1.
So in that case if you do SVD on a data you directly go to NMF because the optimal
vectors of a nonnegative matrix, they're nonnegative.
Third case is when your rank of data is 2. So in that case, all that can be shown by
some geometric algorithm is that NMF can be solved in polynomial time.
Now because NMF is NP-hard the most common approaches to solve this problem is
based on local search. So these approaches start by randomly initializing matrices W
and H. And then they fix one block of variables and optimize for others. And this is
repeated for a number of cycles until the approach converges to some stationary point.
But the problem is that these approaches are not guarantee to converge to a global
optima. So that's the danger with the local search methods. They can converge to
different [inaudible] depending on the initialization.
So separability assumption was supposed to make the NMF problem tractable. So the
assumption basically says that you have an identity matrix that is hidden somewhere in
your right factor matrix H. So in other words, some columns of H are coming from the
matrix identity.
So if you look at the matrix inside the packet here, the first R columns are identity. And
the rest of the columns are given by H prime. And the columns of identity are hidden by
this permutation matrix P. And we don't know where they're hidden.
So in other words, the assumption is all [inaudible] saying that we have some columns
of matrix X that make up matrix W. So the columns that are given by the index set A,
those are the columns that consider W exactly.
Now, those columns of X that appear in W, we call them anchor columns. And here I've
directed them by the index set A here. Yes?
>>: Maybe it's possible, maybe it isn't. Sometimes people use this notation for
conjoining matrices. Wouldn't it just make sense for me to draw a picture like, okay, this
is -- see this little block of matrix, that's the identity, see that block over there, that's not
the identity. These are zeros. Can you do that picture for us? Or are you about to do
that picture?
>> Abhishek Kumar: Yeah. That picture is, unfortunately, not there.
>>: There's a white board. And a pen.
>> Abhishek Kumar: Okay. So the picture looks like this. So you have this big matrix
X.
>>: Sure.
>> Abhishek Kumar: Is equal to -- and then here you have some columns that are
coming from identities. So maybe this column is from identity. And somewhere else
you have some other column. But you don't know where these columns are appearing.
So ->>: Okay.
>> Abhishek Kumar: Yeah. And the rest of the columns can be anything. They can be
any nonnegative values. So when you multiply W with the column of identity, that
column transferred very clear to ->>: [inaudible] selecting that row.
>> Abhishek Kumar: Yeah. So wherever you had this one appearing, so the second
column of W will be transferred to that position.
>>: You're saying a word can only belong to one topic; is that correct?
>> Abhishek Kumar: You are saying that in every topic there is at least one word that is
not appearing in other topics.
>>: In other topics.
>> Abhishek Kumar: Yeah. Yeah. So there is ->>: [inaudible] uniquely bind to a single topic?
>> Abhishek Kumar: Yes. Yes.
>>: I see that -[brief talking over].
>> Abhishek Kumar: Yeah. They are called anchor words. And it's kind of reasonable
to assume in topic modeling. Because if you take like two topics like sports or politics,
you can assume that there is one word in sport that is not appearing in politics and so
on.
>>: Right.
>>: But it's stronger here. So you're saying that H finally is [inaudible] ->> Abhishek Kumar: H prime is not [inaudible].
>>: H prime is not constrained, right?
>> Abhishek Kumar: H prime can be anything. So it can be anything nonnegative.
>>: Okay.
>>: This button very helpful by the way. So for your next talk you should have ->> Abhishek Kumar: That's a good [inaudible].
>>: At least it was helpful to me.
>> Abhishek Kumar: Okay. So the columns where we have identity they are called
anchor columns. And you can see the rest of the columns are just combinations of
these anchor columns. Because H is nonnegative. So all the columns of X they are
[inaudible] by nonnegative combinations of these anchor columns.
Now, the problem of NMF is reduced to finding the extreme rays of the data cone here.
And the reason is that if you had these extreme rays they are demonstrated by the
anchor columns of H. And once you have the anchor columns you can get directly
matrix W and then you solve for H.
So here is a geometric picture of the problem. So the points here are the columns of X,
the matrix X. And the red points are the anchor columns. And black points are rest of
the columns.
So in this picture, you can see that all the points, all the columns of X, they are inside
the conical hull of the red points.
>>: So can I say one more thing?
>> Abhishek Kumar: Sure.
>>: So the anchors -- there's one condition on the anchors which I'm just going to say is
really, really hard. There's some words that are only in one topic. Right? That's one
way to think about it.
>> Abhishek Kumar: Yeah. Yeah.
>>: But actually you're relying on another property of this, which is, in fact, you get the
whole identity matrix. So for every term, that term must occur in only one topic? There
must be some topic for which it only occurs -- doesn't occur -- or no, no, I'm wrong.
You're saying it's in the linear combination?
>> Abhishek Kumar: Yes, assumption is that in every topic you have at least one word
that is not present in others.
>>: The next question is are there other words or is it the combination of the anchor
words, the words that are shared? They can be represented in the space of the anchor
words too. Is
that ->> Abhishek Kumar: Uh-huh. Yeah, that's also one way to say it.
>>: In a weird way, these words are sort of like -- like if you just knew these magic
words, then all the other words are just points in the space that is ->>: That's right.
>>: -- determined by the basis of the single words?
>>: Yeah.
>>: Okay.
>> Abhishek Kumar: Now, suppose we cut this cone by a positive hyperplane. So the
hyperplane is given by P transfer X equal to C and P and C are both are positive. And
then we scale all these points so they lie on this hyperplane.
So the problem now reduces to finding the vertices of the convex hull of the scale
points. Because the points that were anchor columns earlier, they are now vertices of
this convex hull. So we can also solve this NMF problem when we scale the data and
then find the vertices of this convex hull of the scale points.
Now the question is, is it a reasonable assumption or not? And in topic modeling, it is a
reasonable assumption as we discussed, that there is at least one word in every topic
that is not present in others.
In hyperspectral unmixing, also, it's a reasonable assumption that as it has been pointed
out by some recent papers.
And in general image processing, the answer might vary from application to application.
In some of them it might be reasonable. And in the end I'll show one application where
it does give good results.
Now, so far we saw a [inaudible] separable problem. In the noisy problem we -- yes?
>>: [inaudible] it could be exponential [inaudible].
>> Abhishek Kumar: So for a real data set, yes, there can be many vertices. But you
want to pick the vertices which will give you the least factorization error.
So because you are a single rank of factorization, so you want to pick only a small R
number of vertices. So you want to pick those vertices that will give you less
factorization measures.
>>: [inaudible] you factor in the [inaudible] finding a vertex or convex hull [inaudible] for
one thing if you have a similar problem.
>> Abhishek Kumar: Yeah. I mean, so if you have N points at most you can have N
vertices. I mean, depending on the dimension of the space, that will vary. But it tells
that if you have to select R -- so in every iteration, you can select one vertex of the
convex hull. And if you have to pick all the vertices, that problem can be solved in
polynomial time.
But, again, you have to pick only -- you have to be selective in picking the vertices.
>>: So, again, I thought unless it's intuition because it feels like we sort of added some
very tiny and got a huge -- too much gain for it.
So, I mean, all we need is that there's at least -- for each topic there's one document -at least one document that has one word that was the only time that that word
appeared. Like this seems like a really trivial thing to add to your set. And all of a
sudden it becomes not NP -- is that -- is that right? That's all you need?
>>: It's almost like a [inaudible] it's almost like ->>: But you only need the word to occur once?
>>: No.
>> Abhishek Kumar: No. There is no notion of document in the matrix H. So H is topic
by words. So you actually have the topics that you are recovering, those topics have
this property that ->>: Right. But I mean in your X, in your X, that word only needs to appear in one -- one
of the rows, right? I'm sorry, not that X. That topic.
>>: But it [inaudible] to be in the solution.
>>: Huh?
>>: I mean, [inaudible]. The solution has to be like that.
>> Abhishek Kumar: So, I mean, so the anchor words can appear in multiple
documents example.
>>: Right. But at a minimum it only has to appear once?
>> Abhishek Kumar: Yes. Yes. Of course.
>>: [inaudible] it won't be in this solution.
>>: It needs to occur at every document that has that topic.
>>: Multiple words for a topic, right?
>> Abhishek Kumar: For a topic there can be more than one anchor word, that's true.
>>: There can be. But we assume that there is at least -- that's where that identity
matrix thing comes from. The identity matrix is -- you know, it has 1, 1, in every column
and 1, 1 for every row.
>> Abhishek Kumar: But like ->>: [inaudible] extra stuff.
>>: But isn't it true that the other words that will go across multiple topics, they have to
occur with at least one of the anchor words for each topic? For them to be discovered?
So that the words that are not exclusive, the words that are not anchor ->> Abhishek Kumar: Right.
>>: For them to correctly get identified as having some weight to the topic, they have to
co occur with at least one of -- some of the basis words.
>> Abhishek Kumar: Of the anchor words.
>>: Yeah, of the anchor words.
>>: Is that true?
>> Abhishek Kumar: Uh-huh.
>>: So if I had only one document on -- so given all the documents on a topic, if only
one of them had my anchor word, that won't work? It needs to -- unless that document
happened to have every other word of that topic also.
>> Abhishek Kumar: Let me see. If you see -- if you take a particular column of this
matrix H, right, that column generates the corresponding words in the matrix X, right?
And that column is responsible for combining the columns of W, right? So in your
document, it is possible that the anchor word is not present. That's possible. But -- so
you have -- you have these columns of W that are combined, right, and -- yes, I think -- I
think you are right. So if the topic is present in a document, that anchor word has to be
-- has to be there. Yeah.
>>: I think you can sort of look at the diagram or M and N. I mean, one of the problems
of using M and N is that it sort of sounds like it's just numbers. But it's not, right? I
mean, M is documents, N is words.
>> Abhishek Kumar: M is the number of documents.
>>: And the N is words. And so then you can now label W and H. So W has
documents vertically and topics horizontally.
>>: Right.
>>: And the H has words ->>: Words by topic.
>>: By topic.
>>: Right.
>>: And so now you look at it, and basic any document that has topic 5 or something
like that ->>: Has to have that word.
>>: Has to have that word. That's the definition of it.
>> Abhishek Kumar: Yeah.
>>: That's what that inner product means.
>> Abhishek Kumar: Yes, every row H is a topic and -- yeah.
>>: It's like if you only knew what that word is, then you could just read off the answer.
That's what makes it sort of extreme within -- within the approximation of ->>: I think what's hard is that you give this simple condition on the solution. It's hard to
figure how the space of document matrices which, like, have this property, right, that
what -- I get some data. Does it have the anchor property. Maybe you come up with a
simple test.
>> Abhishek Kumar: Yeah. So there is no test. But it has to come from the domain
knowledge that you have. So for the topic modeling, I mean, it's a safe assumption to
have that if your topics are the discriminatory then you can say that -- you can assume
that there'll be at least one word that discriminates it from other topics.
But if you have very close-by topics, like in sports if you have like baseball in West
Coast and baseball in East Coast, then maybe they're not very discriminative, so in that
case maybe all the words are common between these two topics.
>>: And [inaudible] finding the best solution, assuming that, or is that if I give it X matrix
which might [inaudible] verify that could fail.
>> Abhishek Kumar: Yes. So it's a global solution under that structure.
>>: Okay. But you -- if I use the matrix which might have like let's say cook some
solution ->> Abhishek Kumar: Some noise probably?
>>: Yeah.
>> Abhishek Kumar: Yeah.
>>: And there you can -- you can still tell me something about ->> Abhishek Kumar: So if you have -- yeah. I mean, if you -- [inaudible] you always
had some noise. I mean, this assumption is not exactly satisfactory. You will always
have some noise.
And in that case, also, it turns out that you get -- I mean, these results with the solution.
>>: Okay. And we will talk about that.
>> Abhishek Kumar: Yeah. Later I'll talk about it.
So, again, in the noisy problem we have this separable structure. And then we add
some noise to it. And, again, the goal here is to recover the anchor columns.
So next I'll talk about the algorithms for near-separable NMF. So let me start with the
reviewing some recent activity that's been happening in this area. So Esser and
co-authors, they came up with this approach where they minimized the previous norm of
X minus X transpose H, X times H, and with a sparse penalty on H. So they take the
infinity norm of each row of H, and then they L1 norm of this in infinity norms.
So this norm penalty induces many exactly zero rows. And the rows of H that are
nonzero, that ends up selecting the corresponding columns of X.
And they become our anchor columns. So in some sense it's a sparse regression
problem with a row-sparsity on H.
And now if we have a PR separable problem, in that case it can be shown that this kind
of optimization problem can recover all the anchor columns. So we recover all the
anchor columns of X by solving this if there are no repeated columns in X.
Now, there are some drawbacks of this method. So one is that if there are repeated
columns or similar columns, then this method will fail. So it will not recover all the
anchor columns.
Another drawback is that it's not very scalable. So we have to reduce the
dimensionality of data and subsample the points to make it scalable. Yeah?
>>: So [inaudible] row sparsity from the H, the shape you draw there, if you look at the
identity and the columns [inaudible] column sparsity? I was thinking maybe column
[inaudible] makes more sense for H.
>> Abhishek Kumar: So the rows -- no, you want row sparsity on H because row
sparsity will give you some nonzero rows in H, right, and those nonzero rows will end up
selecting the corresponding column from X. So you can see that here we are
multiplying X with H, right?
So the rows of H that are nonzero, they will select the column of H -- column of X,
corresponding column of X. So let's say -- so let's say there are some rows -- so H is a
four [inaudible] now. And there are some rows that are nonzero, right? And here is X.
So these rows will select the corresponding columns. And they are now your anchor
columns.
So, again, H is subset of rows that are nonzero. And these are your topics of your
anchor column.
>>: [inaudible].
>> Abhishek Kumar: So, yeah, I mean, you assume that there is some discrimination in
the topics. Depending on -- and there is at least one word that discriminates them.
So recently Arora and co-authors, they all came up with an approach where they view
this problem as minimizing the problem norm again. But they scale all these points and
they solve the convex hull problem. So they [inaudible] the vertices of this convex hull
and that gives them the matrix W. But this approach is not very scalable.
More recently Bittorf and co-authors, they report a linear programming based method
where the minimizing for this norm, the norm of X minus X times H, under some
constraints on H. And this, again, is guaranteed to recover all the anchor columns if the
problem is pure separable.
And the optimization using parallel SVD, so the approach is very scalable. And the
noise performance is also shown to be better than the Arora paper.
Gillis and Vavasis, they also came up with the approach where they view this problem
as recovering that vertices of the convex hull. And they use a property that is strongly
convex function over a polytope maximizes on a vertex. So they value this function on
all the data points and they pull up the vertices. And they incrementally select the word
in each iteration to select one vertex. And the approach is very fast. The only -- I think
one drawback of this approach is that it the set of anchors are linearly independent.
So it assumes that the W matrix here is full rank. If it is not full rank, then it will not
recover all the anchor columns.
So when we started working on this problem, we did some preliminary experiments and
we made some observations. So one observation was that [inaudible] data is very
noisy. So if we take this TDT data set as a classic topic modeling data and it has 30
topics, so if I take an NMF with 30 inner dimension factorization, in that case the ratio of
noise to the ratio of data is about 90%. So 90% of your data is not explained by this
separable structure. So it's very important to choose the right noise model.
Further, when we are in high dimensions, it can happen that all your columns are
anchor columns. So in that case, it's -- the performance depends on which anchor
columns we choose. Because there are multiple options. And we are only allowed to
select only odd number of columns. So we have to be selective in choosing the right
anchor columns.
>>: [inaudible] about this TDT data set, like where did the data come from, how many
documents are there?
>> Abhishek Kumar: Yeah. This is topic detection and tracking data. So it was, I think,
released in early 2000 or so. And it has 30 topics. And I don't know what the topics,
exactly, but the document -- number of documents are around 19,000.
>>: [inaudible].
>> Abhishek Kumar: Yeah.
>>: And the topics overlap, or are they quite distinct?
>> Abhishek Kumar: Topics are fairly distinct. Fairly distinct.
>>: How many words? How many unique words?
>> Abhishek Kumar: I think vocabulary size is 20, 25,000.
>>: A -- so what is A?
>> Abhishek Kumar: A is the -- so A is the index set of the anchor columns. So the
size of -- so the number of anchor columns, actually, is 30. So there are 30 topics.
>>: Small R.
>> Abhishek Kumar: Small R, yes, in factorization.
>>: This is by assumption, right, because the TDT set may not have that property,
right?
>> Abhishek Kumar: Yes. I mean, it's assumption that we have in topic modeling, so
we are going to test on it on this data.
>>: So if you were to run [inaudible] you would get a different ratio for every R?
>> Abhishek Kumar: That's true. If you increase R, then your -- then more and more
data will be explained by this structure. Yeah.
But, I mean, in that data set it's given that there are 30 topics. So they say that 30
topics are there.
So there's one more thing about the previous methods. And they run by normalizing the
columns of X. So they go after the convex hull problem. And this is problematic in text
data because in text we are more used to using the TFIDF representation of
documents. And L1 normalized the rows of matrix X. And if on top of it we L1
normalize the columns, then that part of the TFIDF structure. And in our experience,
that adversely affects the performance.
So these observations the motivate us to look for something that directly solves the
conic hull problem. And there is no need to normalize the columns of X.
So again the approach is conic hull method to recover the extreme rays of the data
cone. And it is inspired by characterization of extreme rays and extreme points from
Clarkson and Dula paper. I think this should be 1999.
So the approach recovers the extreme incrementally in one extreme ray in each
situation. And here I've shown a current cone after three iterations. So we have added
three extreme rays to the cone. And let's say we want to expand this cone by a fourth
extreme ray. So to expand this cone, we project all the external points to the cone and
find normals to the faces. So you can see the green points are external to the current
cone, the blue points are internal to the current cone, and red points are the extreme
rays.
So we project all the green points to the current cone and that give us the normals to
the faces of the current cone.
Then we take a residual. So we pick a face and then rotate it outside the cone until it
hits the last point on that side. And that last point is our extreme ray. So in this case we
add this point as our anchor column. And this is a new extreme ray. And we expand
the cone.
So here is the algorithmic sketch of the method. So we start with empty anchor set and
we have matrix R that can be initialized to X. And the first step is selection step, where
we select the anchor column. So this is done by evaluating this criteria here. So we
evaluate R transpose times XI divided by Q norm of XI. And XI is the Ith column of X.
And then we evaluate this vector using the selection operator. And the selection
operator takes a vector and gives out a scaler. So we do this for all the columns of X.
And the column which gives us the maximum value, that is our anchor column. And we
add it to our set of anchors.
So the next step is after we have selected this anchor column and we have expanded
the cones, in the next step we wanted to project all the external points to the current
cone. And this is done by solving this nonnegative problem. So we minimize the of
[inaudible] norm X minus set of anchors times H with a nonnegativity constraint of X.
And then we compute the residuals. And ever column of the residual is normal to some
face of the cone.
And now we get the residual matrix. We go back again to the selection set, pull out a
new anchor core, expand the cone and then so on. So this is repeated until we have
selected the desired number of anchor columns.
Now, this is really a family of algorithms. Because depending on the selection operator
here and the Q norm, we can have different variants of the algorithm. So one variant is
when the selection operator is -- it takes a vector, and the output is a Jth element of the
vector. And suppose Q is equal to 1. So in the denominator we have one norm of XI.
So this variant can be shown to solve the separable problem exactly. So it recovers all
the extreme rays of this data cone.
And, of course, the performance under noise will depend on which element of V is
chosen by the selection operator. But if it's a separability problem, then you choose any
element of V, and that will solve the problem.
Second variant is that selection operator looks like this. So it takes a vector V and takes
all the nonnegative elements and computes their L2 norm. And Q is equal to 2 here.
So we take 2 norm of XI.
Now, this variant is very similar to orthogonal matching pursuit style of algorithm in
signal processing. Just like it is a nonnegative variant of OMP. And intuitively it picks a
column that minimizes the [inaudible] norm on the current reveal matrix. But this variant
does not solve the separable problem accurately. So it leaves out one or two anchors.
But this variant actually performance very well in the noise and the noise performance is
pretty good.
>>: Sorry. [inaudible] what is your measure of success? So do you have -- you have
this objective which has -- so that's the norm of N, your --
>> Abhishek Kumar: Yeah. We are minimizing the norm of N, uh-huh. So minimize the
previous norm of X times W times H. So that's why we have this previous norm
prediction here.
>>: Do you have like some [inaudible]? I mean, because you mentioned first if the
problem is solvable I will find it. So we'll have [inaudible] equals zero. [inaudible] how
do I know is it good or ->> Abhishek Kumar: Yeah. So the noisy problem actually we have shown empirically
that it's better than the previous methods. But on the formal guarantees we don't have it
yet. So we are working on that.
So if it's a noisy problem, then you want to show that if your noise is bounded, then your
output of the algorithm is also bounded.
So one advantage of the method is that the model selection is very easy. Because we
are incrementally adding these topics. So in every iteration we add one topic. So all
the previous solutions, they are contained in the current solution. So we can stop
whenever we have some external criteria that is met by the algorithm.
And we don't need to read on it every time for different values of R. And we have a
scalable implementation which exploits the row sparsity of matrix H. And both selection
and projection step here, they are very easily parallelizable.
And as we'll see later, it compares favorably with previous methods on the same
problem.
So let's go back again to this similar data. So I've shown you here some topics that are
recovered by local search methods. So on the left side you can see it's a bad run for
local search because it converge to a bad local optima. And one of the topics here is
very heavily corrupted. On the right I've shown a good run where it converge to a
reasonably good local optima. So this is a danger with localized method that depending
on the initialization it can converge to a bad local optima.
>>: Can we go back?
>> Abhishek Kumar: Yeah.
>>: So gosh. Is it really that much worse? Do I care?
>> Abhishek Kumar: Here?
>>: Yes. Like I would assume that that -- you know, whatever. Let's call that bad one
B. You know, I assume it's never used. Because if you actually use that topic to
generate any data, you get a piece of something that doesn't look like any example,
right?
>> Abhishek Kumar: Yeah. But, again, it has missed one topic. Because that topic
which should have been here, that is not --
>>: But who tells you how many topics there are?
>> Abhishek Kumar: No, but if you try to reconstruct your data set with these topics.
>>: Yes.
>> Abhishek Kumar: You will get some [inaudible] it's not an exact ->>: If I threw in a few more topics, what would have happened?
>> Abhishek Kumar: If you threw in more topics, then probably these topics also will be
corrupted because you want to -- because you have more topics than some of the
structure that is presented in the current topic that will move to the next topic.
>>: So it seems incredibly sensitive to something that you couldn't possibly know in
practice?
>> Abhishek Kumar: Yes. So in this -- I mean, local search methods you have to know
the number of topics before that you want to recover. So there is like -- so that's why I
have to run for a different number of topics.
>>: It seems to me like it's way worse than A doesn't look that much worse than B. If
you told me that if I didn't know the right number of topics, they'd all be screwed up,
that's C. And I'm like, of course, I don't know the number of topics.
>> Abhishek Kumar: Yes.
>>: [inaudible].
>> Abhishek Kumar: I mean, so that's why we will learn these methods for different
values of small R. And then basically choose using ->>: But, you know, real problems don't have well-defined number of documents.
>> Abhishek Kumar: Yeah. That's what I mean. So, I mean, any topic modeling
method ->>: Even the anchor assumption assumes that real data does have that somehow if
you looked carefully enough the data would tell you that there are this many problems.
Because of the anchor assumption.
>> Abhishek Kumar: Yeah. So with anchor assumption the one advantage is that it
incrementally pulls out the anchors.
So I mean, even if you don't know the topics beforehand, you can start running the
algorithm and stop whenever you feel that the next topic that you are getting is not -does not make sense or something.
>>: I'm sorry. You can go [inaudible] just curious.
>> Abhishek Kumar: So here I show topics by separable methods. Recovered by
Bittorf method. And Bittorf method has two step size parameters. So primal step size
and dual step size. And depending on how you set these parameters, the convergence
might vary.
So you can see there are some shadows in the topics. And this topic is not -- is not
very clean. In Gillis and Vavasis, so, again, the assumption here is that the W matrix
should be full rank. So in this data set, it turns out that this assumption does not hold.
And the reason it stops after recovering 14 topics whereas there are 17 topics in the
data.
The proposed method it recovers all the topics, and the topics also look reasonably
clean. And there is a clearcut structure in the topics.
Now, let's add some noise to this data and see what topics we get using different
methods. So on the top left I've shown topics by local search. So here there is some
noise that is corrected in the topics. But rest of them are very clean.
In Bittorf there are some topics that are reasonably well. And in others there is no
clearcut structures. In Gillis and Vavasis it turns out that almost all the topics are kind of
blurry. And there is no structure in the topics. In the proposed method, there are some
topics that are not very clean, like this one and this one. But the rest of them are kind of
reasonable. And there is no noise that is corrected in the topics.
So here is an experiment that tests this method on recovery of anchors. So generative
model is that we generate X by W times H plus some noise. And W is uniformly
generated between zero and 1. H has a separable structure.
So the first 20 columns of H are identity. And the rest of the columns are sampled from
a Drichlet distribution. So they lie on the same place.
>>: [inaudible] synthetic as well, right, it's a truly ->> Abhishek Kumar: [inaudible] is synthetic data, right, but it satisfies a separability
function.
>>: [inaudible].
>> Abhishek Kumar: Yes.
>>: [inaudible].
>> Abhishek Kumar: Look at the right facial matrix. You will have the columns of
identity there. So basically this data satisfies the separability function.
>>: [inaudible].
>> Abhishek Kumar: Not really. So, yeah, in this one we have a [inaudible] then we
add some noise. And the noise is a Gaussian noise with zero mean and delta standard
deviation. So here is the result. On the X axis we have the noise level, which is the
standard deviation of this Gaussian distribution, delta. And on the Y axis I have the
fraction of anchors that are correctly recovered by these methods.
So the black curve here is the proposed XRAY method. The red line is the Hottopix
which is the method of Bittorf. And the red is Hottopix. And the blue line here is by
Gillis and Vavasis.
So in this type of noise it turns out that XRAY is better in recovering the anchor columns
exactly and their noise. And, of course, as we keep on increasing the noise, the
performance drops.
Now, in this experiment we want to evaluate the selected features of selected words on
a prediction task. So again every row of X is a data point, is a document. And we are
selecting columns, that means we are selecting words from this corpus.
And then we train a multiclass SVM using these words. And we want to evaluate the
prediction accuracy using these anchor words.
So we use two data sets, TDT and Reuters. TDT has 30 topics and Reuters has 10
topics.
On the X axis I have the number of selected words, anchored columns. On the Y axis I
have the SVM accuracy. So you can see that as we keep increasing the number of
words, the accuracy increases uniformly.
But there is a huge gap between the proposed XRAY method and the Bittorf and Gillis.
So the magenta line is the XRAY greedy variant, and the black line is a probably correct
variant of XRAY. So both these variants are better than the previous methods.
>>: Is one of the -- is the local search, just the greedy local search on this chart? I don't
->> Abhishek Kumar: Local search is not there. But it lies somewhere -- so it's better
than the separable methods. It lies somewhere here. Somewhere here. And the
reason is that because it's not restricted to selecting only the columns because it allows
mixing of these columns. So you get better prediction accuracy.
>>: Okay.
>> Abhishek Kumar: Yes.
>>: [inaudible] I'm lost. I don't understand this [inaudible]. So you have a classification
experiment and so you use the lower [inaudible].
>> Abhishek Kumar: So we use matrix W to train -- to train the [inaudible].
>>: Okay. The number of selected features are ->> Abhishek Kumar: Is the number of columns in W, yes.
>>: [inaudible] configure R or is it that you solve a problem for given R and [inaudible].
>> Abhishek Kumar: So on the X axis I have the number of selected features. The
number of columns in W. And as you keep increasing the columns, your accuracy
keeps going up. Because you add more and more features to a classifier.
And you want to see how good are these features on a prediction task. So you want to
evaluate the quality of these selected features.
>>: What is the performance you get with the [inaudible].
>> Abhishek Kumar: With what?
>>: [inaudible].
>> Abhishek Kumar: That's the dotted black line. If you use all the words ->>: Some other dimensionality reduction technique like SVD [inaudible].
>> Abhishek Kumar: The evaluated non-convex NMF methods and ->>: Right.
>> Abhishek Kumar: They are little bit better than separable methods in terms of
prediction performance. SVD I did not try I think.
>>: That would be a natural thing. So this question about just use, you know,
[inaudible] I guess the option is [inaudible] then you have [inaudible] topics and you've
got [inaudible] so why not just choose [inaudible].
>> Abhishek Kumar: Yes. I think -- yeah, you are right in that sense, because pulling
out these anchor columns takes time. And I think linear SVM will be faster if you use
just all the words.
But this experiment was just to evaluate the quality of words that are selected by these
different methods and test the prediction performance.
>>: [inaudible] own the data for the factorization and then only a training set for the
trainer ->> Abhishek Kumar: So we split data. So the number of documents we split in 20 and
80. 20% is for training. And the 80% is for testing.
>>: So you use the same training set for both the SVM and learning the H packet or
you learn the H packets on the whole data set?
>> Abhishek Kumar: It's on the whole data set. It's totally unsupervised. So it's on the
whole data.
>>: [inaudible] so basically most papers which deal with those dimension to reduction,
they show that at one point the training set is small enough, the representation you get
by doing the decomposition is better. Did you look at this? Like so I -- so I can -- so
you say I have 20% training. Let's say I go to, I don't know, 4% training. Is it better than
the original features or not?
Most paper try to sell it in this way. I don't know if it's the right way to sell it but ->>: That makes sense.
>> Abhishek Kumar: Yeah, that makes sense.
>>: [inaudible].
>> Abhishek Kumar: That makes sense.
>>: My unsupervised algorithm of -- I have to make up for the lack of training.
>> Abhishek Kumar: For the lack of training, yeah, I think that makes sense. But I think
in this experiment, we are trying -- I think I increased from 20 -- started from 20% then
increased it for the -- and I observed that it's not better than if you use all the words.
But maybe if you have like 10% or 5% data then I guess these methods will outperform.
So this experiment evaluates the selected features on clustering task. So, I mean, I
won't go into more detail here, but you can see the magenta curve and the black curve
of the proposed methods. They are better than Bittorf and Gillis on both these data
sets.
Here are some large-scale experiments that we did. So we evaluate data on three data
sets, RCV1, PPL2 and IBM Twitter data. And the statistics of the data sets are here.
And we compare with Hottopix which is the method by Bittorf and [inaudible]. So this is
also very scalable method.
And it's identified using SVD. And on -- you can see that on PPL2, we are about three
times faster than Hottopix. And on Twitter data, which is very, very sparse, we are
around 60 times faster than Hottopix. So we have a scalable term that can be practical.
>>: How do you get to no more error?
>> Abhishek Kumar: So we didn't look at the factorization error here. So it's just the
running time for recovering hundred topics.
>>: [inaudible].
>> Abhishek Kumar: So the problem is that if you compare the error, it's not [inaudible]
to compare them because in Hottopix you optimize a different norm on the others. And
in X --
>>: I mean, let's say you put yourself in a [inaudible] condition and measure the
previous norm of N, right, and then look at data and hence ->> Abhishek Kumar: Yeah. I mean, I'm sure that if you look at the norm of -- previous
norm of noise, then XRAY will be better because Hottopix you optimize some different
norm. So, yeah.
So next I'll move on to some extensions of these algorithms to other loss functions like
[inaudible].
Let me start with introducing the problem of low-rank approximation. So in low-rank
approximation we are given a matrix X. And we want to approximate it with the
low-rank matrix L. And so we -- in our vector we have a rank penalty here, or we have a
rank constraint. And this problem is non-convex.
But we can get a global solution using SVD in polynomial time.
Now, we can imagine the robust counterpart of this problem where we assume that the
noise matrix is sparse. So we replace the previous norm by L1 norm. And we retain
the rank constraint as it is.
So this problem is also non-convex. And it is also believed to be NP-hard. In -- so both
these problems that I discussed we can imagine a nonnegative rank version of these
problems. So in nonnegative low-rank approximation we are given a nonnegative
matrix X. And the goal is to approximate it with a nonnegative matrix L with this
constraint, that nonnegative rank of L should be small. And this problem is exactly the
NMF problem which is NP-hard.
Now in a robust version of this problem, we assume that the noise is sparse. So we
replace the previous norm by this L1 norm and we have the nonnegative rank. And this
problem is also NP-hard. Now, both these approximations, both low-rank approximation
and nonnegative rank approximation, they can be used to do foreground-background
separation. So background is something which is slowly varying across data points.
And foreground is something which is varying a little bit faster, but it has low energy. So
it can be used to do foreground-background separation in video as well as in text.
But because -- because the previous -- previous versions of the problems are NP-hard,
the question is how to make them tractable.
So here is one popular approach to make robust low-rank approximation tractable. So
they replace the rank function by this nuclear non-penalty, and then this problem
becomes convex. And this is illustrated in the literature under the name of Robust PCA.
And in the robust low nonnegative rank approximation, we propose to use separability
assumption to make it tractable. Because the -- this problem earlier which we saw
before the NP-hard, so we propose to use separability to make it tractable. So the
problem is that we want to minimize the L1 norm of X minus W times H with a separable
structure on H.
And the question that we ask is how does it compare with convexified Robust PCA on
common applications?
Now, before going into this, I'll talk about an algorithm that we have. So, again, this is a
conical algorithm that recovers extreme rays of the conical hull. This time we do it with
the L1 residual. So we project all the external points to the current cone. And the
projections are in terms of L1 norm. And once we have these projections, we pull out a
new extreme ray using these L1 individuals and expand our cone.
So here is a sketch of the algorithm. We start with MP anchor set and the matrix D is
initialized to X. The first step is a selection step where we evaluate this criteria here,
and that gives us the next anchor column. So DI is the Ith column of D. XJ is the Jth
column of X. And P is any positive vector. So whichever column of X gives us the
maximum value of this criteria, that is selected as anchor column.
And once we have expanded the current cone, we take our external points and then
project onto the current cone. And this is done by solving a nonnegative least absolute
derivation problem. So we minimize the L1 norm of X minus anchors times H with a
nonnegativity constraint on H.
And the matrix on D is just a matrix of subgradient of this last function. So if the residual
entry is zero, we have DIJ is zero. If the residual is nonzero, then we have the sign of
the residual as DIJ. And once we have the matrix D, we go back to the selection step,
pull out a new anchor and expand the cone. And this is done until we have the desired
number of anchors.
Now, this can be shown that it provably solves the separability problem. But again in
the noisy version, it will depend on which I we end up choosing here.
Now, this method can also be extended to handle more general loss functions like
Bregman divergence. So I want you to focus on the last bullet here. So we want to
minimize the divergence between X and W times H. And we have a separability
constraint on H.
And so this problem we have a selection criteria that recovers the anchor columns using
Bregman divergence. But I won't go into more detail here because I don't think we have
enough time.
So then again to test the method on recovery of anchors. So generative model is X
equals W times H plus noise. And W is again uniform between zero and 1. H is
separable. And the noise matrix is now Laplace. So it's sample Laplace distribution
with zero mean and delta standard deviation.
And once we have the noise, we make all the negative entries equal to zero. So there
are approximately 50% entries in the noise that are zero.
And the recovery performance is here. So on the X axis I have the standards deviation
of the noise, which is delta. On Y axis I have deflexion of anchors that are correctly
recovered. So the black curve is a proposed robust XRAY method, and it's better than
the previous methods. And this experiment is just as a standard check to make sure
that the method is performing as expected. So it performs well under the sparse noise
setting.
Now, let's go back to the foreground-background separation problem in video. So,
again, here our data matrix is such that we have each row a video frame. And we want
to decompose it into two matrices, a low, nonnegative-rank matrix and a sparse matrix.
So low and nonnegative rank matrix will model the background. And sparse matrix will
model the foreground. And we want to see how separable other methods perform
under this setting.
So here I've shown some images on this data. So you can see that the separable
method with the inner dimension of 2 it recovers the background and foreground
reasonably well.
>>: [inaudible] for using a nonnegative decomposition for image data. Is that you
cannot like lower the intensity when you add like those components which like for text I
guess [inaudible] than for densities.
>> Abhishek Kumar: So you're asking in terms of separable methods or in general ->>: [inaudible] why would I use like a half pixel intensity and I want to [inaudible]
decomposition I'm only allowed to add them [inaudible] it seems that you might want to
subtract things.
>> Abhishek Kumar: Yeah. So, I mean, the most popular methods for this problem are,
as you said, based on robust PCA, where you allow negative entries.
Our motivation was to check whether nonnegative matrix factorization does perform well
for this problem or not.
And, again, I think the intuition is that say your background is a low nonnegative rank
matrix and I guess you want to represent every pixel that is presented in the video
frame as a nonnegative combination of certain elements.
So, again, like that data this is also nonnegative -- nonnegative data. So, I mean, the
motivation ->>: [Inaudible] because of one of the topics. And then maybe you want to [inaudible]
let's say the topic is something hiding a lamp or something, right? It cannot do that,
right, and for text data it's kind of intuitive that you don't see -- you don't have a -- you
don't put a topic delete words. For images it's -- it seems counteractive to me at least.
>> Abhishek Kumar: I mean, you can also look at it in terms of -- in a geometric sense.
So if you take all the pixels, right, and you represent it in the inner space, so all the data
is contained in the nonnegative [inaudible] and all you are doing is to bound this data in
a cone, basically, and -- I mean, that can always be done. And now you're asking
whether there exists a cone with very small number of extreme rays that can model the
background.
>>: [inaudible] I mean, if you have the corpus where there's several light sources and
the light source is thrown on or off, where it would make sense to have specific -- so it's
corresponding to the lighting being on or off certain ways which would just illuminate
certain fraction of pixels, right, that ->> Abhishek Kumar: But, again, if some pixels are off, for example, you can put a low
weight on the off light ->>: [inaudible] positive factor and they would be on or off, they would [inaudible]
>> Abhishek Kumar: Yeah.
>>: But not be negative.
>> Abhishek Kumar: No. I mean, they can be any real number bigger than zero. Yeah.
>>: [inaudible].
>>: [inaudible] it seems kind of like -- like if you give me like scans in stuff, I [inaudible]
if you give me like natural images I can see both like a few source [inaudible] and I can
see that are subtracting it, in fact, at the same topic as [inaudible]. Let's say you have
something moving in front of a light source, right?
>> Abhishek Kumar: Right. Right.
>>: So it will basically -- let's say this thing is left or right, right? So this would -- so you
would need like four topics there instead of two because you ->> Abhishek Kumar: Yeah.
>>: [inaudible] subtract anything.
>> Abhishek Kumar: Yeah. I mean, that's always there. So you -- you may end
increasing your number of topics if you have this nonnegative constraint.
But, again, this -- on this data, actually this result was obtained with the factorization of
inner dimension 2. So it's a very low rank factorization.
>>: Did you compare with SVD [inaudible].
>> Abhishek Kumar: Yeah. I have it in the next slide. So this is a [inaudible]
comparison between robust PCA and separable NMF. So, again, robust PCA we don't
have any nonnegativity constraint on the low rank part.
So we use two data sets here, Bootstrap and Airport Hall. And for these two video we
have the correct ground truth. So we know which pixels are for ground. So we have
them labeled. And here I've shown the ROC curves for these two data sets. So on the
X axis I have the false positives. On Y axis I have the true positives.
So in these two data sets you can see that robust XRAY is as good as robust PCA. It's
performing as well as the worst PCA. And the advantage that we get here is of speed.
Because in robust PCA, you have to solve -- in each iteration you have to solve SVD
problem. So this in some sense limits the state.
>>: [inaudible] works well like the first thing that I would try is [inaudible] and subtract
the main components as do foreground because [inaudible].
>> Abhishek Kumar: Yes, yes.
>>: And that would mean [inaudible] by the SVD problem.
>> Abhishek Kumar: Right. So robust PCA is just a convex relaxation of SVD problem.
Because, I mean, SVD you have to know what rank you want. And here you have ->>: Actually so what I'm saying is that you say -- you're saying two things. There's a
low rank, a representation which explain the background.
>>: And there's these sparse foreground, right? Let's say we move the [inaudible].
>> Abhishek Kumar: And you assume that the ->>: [inaudible] and then put the raw curve. So the problem becomes very [inaudible]
and maybe you get most of the performance right. We don't know that.
>> Abhishek Kumar: So, I mean, in SVD the [inaudible] is that you have Gaussian
noise.
>>: No, I agree that [inaudible] using the only assumption I could use to solve the
problem.
>> Abhishek Kumar: Yeah. Yeah. But, again, I mean, I think in
This sort of setting we have foreground-background separation, I think SVD is not going
to perform well. So you need like -- you need to model the sparsity of foreground.
Now, this is my last application here. So far we saw that we had data points along the
rows of this matrix. And we were selecting columns. So this >>: Selecting some
features from the matrix. And now let's say we transfer this matrix X. So in this case
we'll select some samples from the data which are representative or exemplars of the
whole data sets.
So this setting can be used in video summarization or in text corpus summarization in
terms of a few documents or a few number of frames.
And we compare -- in this setting we compare method with this method, which was
proposed in 2012. And they do it by solving this problem. So they minimize the
previous norm of X minus X times C and the other row sparsity on C here.
And these are the results. So we use Reuters and BBC, which are text data sets. And
the setting here is that we select some samples. And then we train SVM with those
selected samples. And the hope is that if the selected samples are diverse enough, in
that case we can get a good prediction error with the small number of samples.
So on the X axis I had the number of selected samples and on the Y axis the prediction
accuracy. So if we keep increasing the number of samples, we get a jump in the
prediction accuracy. And the red line and the black line are the XRAY method. And the
magenta line is the method by Elhamifar. So in Reuters, I think robust XRAY is better
than both these methods. And on BBC, it turns out that previous non-version of XRAY
is better than the robust version and Elhamifar.
So in the end I'll just summarize what we discussed. So we saw the separability
assumption, and it makes the NMF problem tractable. And it turns out to be a
reasonable assumption in topic modeling and hyperspectral unmixing.
We also saw a separable -- scaleable family of algorithms for near-separable NMF.
And the algorithms are based on recovering the extreme rays of the conical hull. And
they directly attack the conical hull problem and not -- it has not recurred to scale the
columns of X. And the solution is built up incrementally so we select one extreme ray in
each iteration so it's very easy to do cross-validation.
We also outperform the previous methods in terms of performance and speed. And
algorithms can be extended to other loss functions like L1 loss and Bregman
divergence. And in the future we want to do a formal noise analysis of the method. And
we also want to work on the streaming version of the algorithms where either rows or
columns of X are coming in a streaming fashion.
In the end I'll acknowledge my collaborators here. So Vikas Sindhwani and Prabhanjan
Kambadur from IBM, T.J. Watson.
And I'll just very briefly mention some other work that I've done. So I've worked on
multitask learning, modeling the grouping structure in multitask learning.
I've also worked on multiple output regression where we modeled the conditional mean
and conditional inverse covariance structure on the parameters.
And I've also worked on a little bit on transfer learning.
In this line of work, I have worked on multiple kernel learning. So we are given a set of
kernels. We learn to combine those to learn a final good kernel.
Here the setting is that we are given multiple similarity graphs, and we want to learn
spectral embedding, combining these similarity graphs. So both these approaches,
they are based on either co-training or co-regularization.
And this problem setting is that we have a database in which we have data points from
one modality. And that query is coming from a different modality. And the goal is to -given a query, we want to recover the corresponding element from the database.
So one example is like image retrieval from text query. And vice versa.
And I guess that's it from me. And thank you very much for your patience. Thank you.
[applause].
>> Misha Bilenko: Any questions? No. Okay. Thank you.
[applause].
>> Abhishek Kumar: Thank you.
Download