1

advertisement
1
>> Dengyong Zhou: Okay. I think we should start. So it's my pleasure to
welcome Sham Kakade this morning from MSR New England lab. So Sham probably
doesn't really need an introduction and people here probably know him already.
Sham has done a lot of work in computer machine learning theory and other works
and today, he's going to teach us how to use tensor decomposition for learning
hidden variable models. Sham?
>> Sham Kakade: It's fun to be here. I'll be around for the week so if any of
you want to chat, we wan continue discussions. How do we learn models with
hidden structure. This is a lot of the questions we're facing in a lot of
practical applications, from simple settings to really complicated settings in,
say, machine translation.
So let's just start with two basic examples for mixture models, which is
mixture of Gaussians, which I think we're familiar with. So here we're going
to see data, like a bunch of point clouds and we'd like to figure out the means
of these point clouds.
And another standard example are these topic models. So there we can think of
having a collection of documents and each document is about one or more topics,
and we think of the document as being like a collection or a bag of words. And
you can think of these two as a canonical mixture models.
And how do we learn that? Learning is obviously easy if someone gave us of the
labels, someone told us which documents are by which topics, it's easy. But if
we don't have these variables, how do we learn?
So let's start by just looking at what's used and what's known. So oftentimes
as used in practice is EM, it's a very natural algorithm, or k-means. And what
do we do? We guess the cluster assignments or right parameters, assign the
points to various clusters, and iterate.
And then there's various sampling-based approaches and MCMC approaches, and a
lot of the practical algorithms essentially do inference in learning. So as
they're learning, they try to figure out which points are assigned to which
cluster. And at some level, this seems intuitive, but this might also be part
of the difficulty in learning, because we're trying to figure out which
inference problem.
2
And inference is hard in some of these models. Like in the LDA model, you give
multiple topics per document, just solving the inference problem is hard, okay.
But that's practice. What about theory? What's the upper and lower limits we
know about learning? And there's actually been some really nice work recently
by Adam Kalai, Anker Moitra and Greg Valiant, so the first thing these guys
showed is let's just make the problem simple. What do we know about a mix of
two Gaussians. This is about as simple as it gets. Turns out that wasn't even
know. What they showed is how to load a mixture of two Gaussians in poly time
when the Gaussians could overlap. Because once things overlap, that's when
things start becoming difficult.
Actually, when they don't overlap but are close, it becomes difficult in high
dimensions. But in just two dimension, they gave a poly time algorithm, which
was efficient. So at least that case, we know we can solve. But K equals 2 so
you can easily be exponential in K, where K is the a number of Gaussians. So
there's kind of a search based procedure on a line, okay?
So subsequent to that work, there's some really nice follow-up work by Anker
and Greg which actually, in a sense, was a negative result, where they showed,
it seems like you actually need exponential in the number of Gaussian samples
to learn a mixture of K Gaussians. And this is an information theoretic lower
bound, and this seems bad. And the point is if you have K Gaussians which
overlap, but not like right on top of each other, but by a reasonable amount,
they actually showed you could potentially need many, many samples to learn
this thing from an information theoretic point of view, which basically says
computationally [indiscernible].
So this looks bad right now, and it's really a nice construction by Anker and
Greg. And to some degree, this talk is going to contradict that, because I'm
going to argue that for kind of a natural case of mixture of Gaussians and
these topic models, we can come up with a closed form and efficient estimation
procedure.
And it's kind of interesting for a number of reasons, which people are getting
a little surprised by, because this is a non-convex problem and the solution
we're coming up with is non-convex, yet it's closed form. It's a pretty simple
approach because it's based on linear algebra techniques. We aren't solving
inference in the learning process and this is handy when we start looking at
3
these topic models with multiple topics in them, because sometimes inference is
hard.
But somehow, we can do things greedily and still figure out where the topics
are with this kind of closed form estimation procedure. We'll come up to this
question of how do we avoid this lower bound, because I'm saying we can do
things in closed form, and isolate. But I just gave you this lower bound and
it extends to a number of other settings like this LDA model, which is a very
natural notion when you can have multiple topics per document.
And we can get a closed form estimation procedure for that, closed form
estimation procedure for hidden Markov models, and I'll discuss some
generalizations of these ideas to structure learning, like in these models used
in linguistics and Bayesian networks. But most of the topic is going to focus
on these two simple models and understanding how we learn them and basically
the proof is very simple. It's geometric, and I think we can really understand
how to learn these things.
So let's just go slowly.
>>:
[indiscernible].
>> Sham Kakade: No, global optimum, efficient, in closed form. So we'll see
what that means precisely, but there's no local optimization here. This is
global optima for estimating the parameters. But again, well see what we mean
here because we have sample data and this question is about the efficient
statistical rate and so on.
But nonetheless, I'm going to stand by the claim that it's a closed form ->>:
Initialization?
>> Sham Kakade: No initialize. We're not using EM. That's the point. We're
not doing inference. It's more like greedy approach. But let's see how we do
that. So let's just start with some definitions, but definitely ask questions
during the talk, but I think these two examples are simple enough that we
should be able to understand the proofs and everything.
Okay. I want to do these in parallel, because they share a lot of
similarities. And, you know, this is the one slide of notation. I think this
4
should all be reasonably clear, but definitely ask questions.
So for the topic model, we're going to consider the single topic case. But
let's go to the case of mixture of Gaussians and the single topic model. In
the mixture of Gaussians, we think of having K centers, mu one to mu K. These
are points in a vector space. And a topic model, let's think about having K
topics. I'm going to go use the same notation. So the topic case, each of
these mu Is are distributions of a word, okay. So you get rid of that.
Now we're going to think about how we generate a point. So the mixture of
Gaussian case, we first sample some cluster with probability WI, so that's my
parameters. In the topic case, we're going to decide on a topic with
probability WI. So that's why they're kind of analogous.
In the mixture of Gaussian case, what do we observe? We observe the mean
corrupted with spherical noise, because that's the case I'm going to consider.
I'm going to consider the case with spherical noise. And this really is the
underlying probabilistic model k-means, right. So k-means, what do you do?
You assign things to the closest point so this we can think of as the
probabilistic model for k-means.
So we just see our point, corrupt it with noise, and we're going to see many
such points.
So in the case of topic models, we're going to see a document, and document's
going to consist of M words which are sampled independently from this topic.
So it's an exchange of a model, so our document just consists of M words and
they're all sampled IID from mu I.
So at this level, we kind of see some distinctions between these two models.
In the mixture of Gaussian case, we're adding noise to one of the means. In
the topic model case, we get many words drawn independently from the same
probability distribution, the same hidden topic.
>>:
Multinomial model?
>> Sham Kakade: It's a multinomial model, yeah. It's a multinomial model here
and independently. They're exchangeable and this is why it's kind of a bag of
words model. It doesn't matter the order the words. We just see this
collection of samples.
5
>>: But this is not the common -- you've put single in parenthesis in there to
indicate this is not the topic models that are commonly used in practice where
you have a mixture of topics associated?
>> Sham Kakade: We're going to come back to the case of multiple, the LDA
model. I mean, this is one of the standard models used, but we often like
richer ones, like LDA. But we'll be able to handle that as well.
But in terms of understanding what's going on, I think it's helpful to consider
these two in parallel, because these are really one hidden mixture component,
and the learning question is we're just going to see a bunch of samples and
what we'd like to recover are the means or topics, the mixing weights, so these
are just real numbers and sigma for the case of mixture of Gaussian.
I like this because we kind of see how these models are similar and how they're
different. And the main difference is the topic model, we get multiple samples
from the same hidden state. And here we kind of get a different notion of
noise. So we good? Okay.
>>:
The topic part, you see [indiscernible].
>> Sham Kakade: No, so we see multiple documents. Sorry. So in the mixture
of Gaussians, 1X is a point. And in the topic model, you see multiple
documents. So think of the analog of X here to be M points here and we see
multiple documents.
>>:
So each point corresponds to one --
>> Sham Kakade: Yes, each point here corresponds to one document, and a
document is this collection of words, and that's analogous to one of the Xs.
>>:
And the dimension [indiscernible].
>> Sham Kakade: Let's go back to the dimension codes. Right now, we're just
thinking of these as discrete, but we'll formalize that later. I had to think
about it correctly. We good with that? Okay. Good.
So how do we learn these things? And learning is you have the separate points,
how do we figure things out. And there's a lot of work that's been done on
6
this. I'm not going to go through the details of all of this work, but there's
really a ton of work from the theory and ML community trying to figure out how
to run mixture of Gaussians back from a really nice paper from Sean
[indiscernible] over a decade ago. And a lot of the work was really on a case
where the Gaussians are very well separated. They don't overlap at all, and
then, you know, what kind of approaches can we use to figure this case out.
But they have relied on a ridiculous amount of separation.
And then there's some work where they can actually overlap or where they're
much closer together and most of these are basically more like information
theoretic results, and they're all typically exponential in K, where you're
searching all over the place.
>>:
So how do all these methods differ from original [indiscernible].
>> Sham Kakade: Okay, right. So all the ones I've listed up here are mostly
theory results where you're trying to find the global optima. Or maybe not the
global. Something provable. You care about computation, which is why they're
all either exponential in the number of topics, or the clusters are very well
separated and you can -- and even there it's tricky for what to do.
>>:
Basic procedures remain the same?
>> Sham Kakade: The procedure for the ones with a lot of separation are more
based on distance-based clustering, where you kind of -- some of them may
introduce the right notion of using linear algebra approaches, but all of these
are quite different because no one has a very good understanding of the EM type
approaches and how to -- but these are a lot of the theory. And some of them
give interesting algorithm, insights.
For topic models, there's also been a lot of work. So Christos and Santos and
other people had one of the early papers on how to view topic models as a
matrix factorization approach. There's been some really nice work by Joseph
Chang which kind of heavily influenced our work more from [indiscernible]. And
even recently, there's been some nice work by Sanjay Arora on the topic case
looking at this as an MNF problem and how do we do [indiscernible]
factorization. But again, they can prove it with only separation conditions.
Basically, there's a lot of work.
You can ask me later on about details of it.
7
And to some degree, we're going to take a pretty different approach from a lot
of these -- a lot of that work. So let's just back up and forget about
sampling issues and just start thinking about identifiability. Let's go back
to this old idea before maximum likelihood by Pearson. So Pearson was the old
statistician. He's like how do we estimate models.
His idea was somebody called the method of moments, which is we see averages in
data. How do we figure out the parameters which give rise to these averages.
And going back to this multiple topic case, we started addressing this by
identifiability questions. So what's the question here? Let's look at our
moments.
For the mixture of Gaussian case, the first moment is the mean. The second
moment is the expected value of XX transpose. The third moment is this tensor
and they just keep getting bigger. For topics, we can think of the moments in
a very natural sense. Think of the first moment as corresponding to having
documents with one word in them. We think of the second moment as documents
with two -- suppose we had an infinite collection of documents with two words
in them. What do we know? We can figure out the joint distribution of two
words in a document.
We think of the third moment as the joint distribution of three words in the
document. And now forget about, you know, sampling issues. We can just ask
the identifiability question, which is how many words, how long do documents
need to be before the parameters are well specified.
Because if every document contained one word in them, and had an infinite
collection of documents, I would know this exactly. But it's not identifiable,
obviously. You can't figure out the topics if every documents that one word in
them.
So we can ask this even more fundamental question, which is suppose we had
exact moments, when are the models well specified and what order of the moment
suffices to nail it down. And this is an interesting question when you have
multiple topics per document, because now the question is how long do documents
need to be. Say if every document had five words, say five topics in it out of
100 possible topics, how long do documents need to be before the model is
identifiable? There's an information theory. It's independent of sampling
issues.
8
And for the most part, let's just proceed for now, given that we have exact
moments. And we want to address the identifiability question and then we want
to see, can we invert these moments efficiently. Can we take these moments and
figure out the parameters.
As you know, these we can think of these all as the matrices of tensors. Okay,
because this is as -- this is like the bigram matrix and this would be like a
trigram tensor. So we good with that? So this is an even more basic question.
And this is what I mean by closed form solutions. I'm going to show you can
easily figure the answer to this question.
Okay.
>>:
Good?
So this [indiscernible].
>> Sham Kakade: We'll see. No. So now let's just look at why I'm comparing
these two and just to use some vector notation because it keeps us on the same
footing. So the mixture of Gaussians, we've got K clusters. We're in D
dimension in the mixture of Gaussian case or we could have D words. Typically,
we think of D being bigger than K. We have more words than topics or more
dimensions than clusters.
The mixture of Gaussians case, what's the expected value of X given from
cluster of I? It's just the mean, the definition, right. For the vector
notation, it's helpful to think of words, like this is the first word as these
hot encodings where this has the second word on. Why is this handy, because if
we use this encoding, we can think of these mus as probability of Xs. These
are D-length vectors which sum to one, and then we can think of the probability
of any given word, given that we're from topic I, is just being mu I. So this
is just the expected value of a word given topic and this is just mu I.
It's why vector notation is handy because it kind of keeps things on the same
footing. Given noise model here is obviously different. The noise model here
is spherical. The noise model here is multinomial and it depends on the
particular word we're getting. But that's just notation. Are we good with
that?
So now let's start looking ap these moments and trying to figure these out.
What does the first moment look like? Okay. On the mixture of Gaussian model,
it's just the average of the mu Is. And the topic model is just the average of
9
the topic probability vectors. And obviously, the model is not identifiable
from this because, you know, this is always specious arguments but parameter
counting suggests it's not enough.
And there are definitely problems with parameter counting arguments, but
nonetheless, we know, you know, the mean isn't enough. Okay? We good with
that? Forward ho, all right?
So let's look at the second moment. So let's look at the mixture of Gaussians
model. I'm going to use this kind of outer product notation, instead of XX
transpose. So what is the second moment in mixture of Gaussians model? It's
the expected value of XX transpose. Each X is a mean plus noise. So what you
get for the mixture of Gaussians is basically a contribution of how the means
vary, right, because you've got this picture where, you know, you've got these
means which lie in some subspace. Actually, I'm going to put the X as the
means and we've got kind of points lying around in the means.
And the contribution for the second moment is kind of how the means are figured
with respect to each other, plus the sigma square times identity matrix due to
the way that -- due to the variance in the points.
Now, something really nice happens for topic models, sort of discussions of
poly two when you have cross correlations in [indiscernible]. So if you look
at the joint distribution between two different words, these words are sampled
independently, conditioned on the topic. So the noise is independent.
So the joint distribution of two words, if you view it as a matrix, is just the
weighted sum of the means, because it's just, you know, if you condition on the
topic, the noise is independent so it's just the expected value, this is just
the expected value of X1, X2. The condition on the topic independent and the
expected value of X given the topic is just mu I.
So the joint distribution of the bigram matrix is just the average of the mean
mean transpose. Yes?
>>: [inaudible] in the topic model have M words in each document.
dictionary size M?
Did the
>> Sham Kakade: The dictionary is size D, and you get M samples from the
documents. So literally, the document is just --
10
>>: But then each word appears once or multiple times?
word --
Because if each
>> Sham Kakade: No, no, take the first two words. So in this one, suppose the
document just has two words in it. I'm looking at the joint ->>:
X1 and X2 are not different words in the vocabulary?
>> Sham Kakade:
>>:
No, no X1 is the first word.
X2 is the second word.
I see.
>> Sham Kakade: Okay. So we're looking at the joint distribution of the first
word and the second word and those two are independently drawn, given the
topic. So the average value of X1 X2 transpose is just mean and mean. You
know, think of the mixture of Gaussians case. If we can get two different
samples from the same Gaussian, then these noises would be independent and that
would go away.
>>: Another question, using the term being identifiable, and you said that
from the means, the previous slides, you can't identify ->> Sham Kakade: So what I mean by identifiable is there are two different
models which could give rise to the same means. So if you only knew the means,
could you figure out the parameters.
>>: But in your kind of [indiscernible] model, two clusters differ only by the
mean. So if I know the mean, I can identify it.
>> Sham Kakade: No, no. So if I just -- in the mixture of Gaussians model, if
you just needed the global average of the means, all you see is, in this case,
the global average would lie somewhere here. If you just see this point, you
can't figure out those Xs, right means, exactly. So you just know E of X and
now we're going to look at E of X, X transpose and what we get is this. And in
the Gaussian case, the topic model case, is this.
And now we see the first distinction, because this thing is a lower ranked
11
matrix. What's the rank of the bigram matrix? It's K, the number of topics.
This one is flow rank. It's dimension D.
This is the first distinction. And this is handy. And exchangeability is
really a wonderful property. Being able to get many samples that are
independent conditioned on the same state, because we get this lower ranked
matrix. It's still not identifiable, because basically all we learn is
intuitively, is basically you can think of this geometrically is what we
learned is an ellipse where the means lie. It's kind of what a second
[indiscernible] matrix tells us. We basically say, you know, we basically just
figure out an ellipse, which is like the covariance matrix.
And we don't really know anything past that. It's just kind of a rotational
problem. But we're getting close to the right number of parameters, but you
can actually show it's an identifiable. And it's not identifiable in the worst
sense. It's basically every single model is identifiable. It's not just a few
points get confused.
>>: Sorry to jump ahead, but should I expect like one case where it's less
than P it will be okay in the multi-topic case?
>> Sham Kakade:
>>:
No, actually, we'll be getting there in a second.
Okay.
>> Sham Kakade: Parameter counting suggests, okay, well how big is this? This
is size D squared. And how big is three D-cubed. Well, D-cubed has a lot of
parameters in it. So but this is definitely fishy, because there are some
models where are just fundamentally not identifiable.
>>: I should be clear, like where K square is less than D, the bigram matrix
would still be low rank.
>> Sham Kakade: This is always low rank.
is bigger than K.
>>:
Sorry.
This is only row rank when D
Right.
>> Sham Kakade:
So it's not K squared versus D, it's D versus K.
12
>>:
Right.
In the single topic case.
But in the multi-topic case.
>> Sham Kakade: This is going to be surprising. So this one will break your
intuition. And this is why the identifiability question was really nice to us,
because when we had multiple topics, like how long do documents need to be
before you can identify it. And forget about computation, because we didn't
expect an efficient algorithm. I just wanted to know the exact answer to this
question. It had to have an exact answer, but we couldn't for the life -because some of the difficulty with the identify ability, sometimes it's very
hard to make identifiability arguments that are nonconstructive. So there's a
really beautiful theorems in the 70s by Chris co in how to do this, but they're
very difficult because, you know, they're existence proof. So they kind of -they're interesting questions. But we'll definitely get to that.
>>: This is more of a higher level question.
[indiscernible].
I mean [indiscernible] generally
>> Sham Kakade: Let's get back to that. So there's kind of a long debate
between statisticians between these two, and Fisher basically argued the moment
methods -- so Fisher basically argued maximum likelihood was more efficient ->>:
[indiscernible].
>> Sham Kakade: Yeah, basically uses the samples better. But the point is we
kind of knew moment methods were easier to use, like even in simpler methods.
Like if you look at Duda and Hunt, the first edition, I think it says, you
know, for some of these problems, you want to start with a moment method and
then use Newton. And then kind of the modern versions of statistics by modern
more like theory statisticians said do method of moments, because maximum
likelihood isn't consistent. And then do a step of Newton.
And then they argued that was as efficient as maximum likelihood. Because
maximum likelihood does stupid things with infinities. So the best thing to do
for getting the constants is they're basically [indiscernible] with each other.
So do the method of moments. Then do a step of Newton because it's a local
optimization and then that's as good as maximum likelihood.
>>:
[indiscernible].
>> Sham Kakade:
That's what we're trying to do.
13
>>:
In terms of moment.
>> Sham Kakade: We'll get to that in a second. Forget about samples right
now. Let's figure out how to solve of the moments. And for now let's drop the
mixture of Gaussians case. We'll come back to it because it's not low rank.
Let's just proceed with the topic model case and then we'll come back to the
mixture of Gaussians, okay?
So again, now we should kind of -- we see what's going on. So if we have this
trigram matrix, what's it going to look like? Well, the noise is independent.
We're just going to get mu, mu, mu. So what we get with -- now suppose
documents have three words, what do we get?
I'm going to define M2 to be the bigram matrix. M3 is going to be the trigram
probabilities. And it just looks like the sum of WI is mu, mu, mu. So it's
this D-cubed sized beast.
So the same argument, the noise is independent. So now how do we figure out
the mus from this? Let's just work in a slightly better coordinate system.
Basically, this is the problem right now. How do we solve this structure? We
see a matrix that looks like this, we don't know the mus and we see this
D-cubed sized object that looks like we know it's guaranteed to look like this,
but we don't know what these mus are. How do we figure it out?
Geometrically, let's just think in a nice coordinate system. Whenever I see a
matrix, I like to think about ice so tropic coordinates. So let's think about
transforming the data so that this bigram matrix is the identity, and it's low
rank so that means when you transform it, we're going to be working in K
dimensions now.
So just take the bigram matrix, make it look like a sphere. That means we're
going to project things to K dimensions and we have K by K matrices. And when
we do that, we're going to do this same linear transformation of the tensor,
which means the equivalent way to look at this problem is we know the second
moment is ice so tropic, it's a sphere. We have a third moment that looks like
this where now we're in K dimensions, and what this transformation to a sphere
does is it makes these means orthogonal.
So basically, you know, in this problem, we're now working a coordinate system
14
where these Xs are orthogonal, and this is it. Someone says here's a K-cubed
object which is pretty small. It's guaranteed to have this decomposition. You
know they're orthogonal. What are the means.
>>:
It means you compose things in three-dimensional array.
>> Sham Kakade:
>>:
That's what this is.
Three-dimensional [indiscernible] -- it's a two-dimensional array.
>> Sham Kakade: No, this is a three-dimensional decomposition because each of
these are D-cubed. So we know such a decomposition exists for this tensor.
It's three-dimensional, how do we find it, okay. So that's ->>:
>>
do
do
as
You project out to the [indiscernible].
Sham Kakade: No I'm not going to do that. There's many different ways to
it. That's the question. We really just reduced this question of how do we
this decomposition, okay? So let's just back up a second and look at things
linear algebra operators.
So let's go back to matrices. Remember this notation where we have a matrix
M2. We can hit it by a vector on the left and right, which we can look at A
transpose M2B. For notation, I'm going to write that as hitting M2 with A and
B this way, and just use a matrix multiplication means we just sum out the
matrix over those coordinates.
And for ten source, we can also think of these things as trilinear operators.
We can take a tensor and you can hit it with three vectors, and just in the
same way you do matrix multiplication, you just take this tensor, which has
three coordinates, and sum them out over these three vectors. So this is
exactly, it's kind of lineal in each of these operators. It's, you know,
multi-linear.
What's an eigenvector of the matrix looking at it in this form? Well, if it's
you hit it with a vector, you get back the same vector scaled. There's a
generalization of eigenvectors for tensors, where it's just hit the kind of the
cube now, rather than a square, hit it twice with a vector, you're going to get
15
back a vector because sending over three things. If you hit it twice with a
vector, you get back a vector, and we're going to call it an eigenvector.
If that direction is proportionate to the direction we hit it with. Okay. So
this is actually pretty well studied in various areas of mathematics and other
areas. The problem is tensors are horrible in a lot of ways in the
[indiscernible] case, because any don't inherit a lot of night structures of
matrices, but we can still define them, and there's a kind of active area of
study as to what these things look like.
That's the definition. Turns out for our case, they're very well behaved.
What's our case? Our case is we have this problem. We know this cube looks
like a sum of orthogonal vectors. Any guesses? Well, what are the
eigenvectors of this whiten tensor? So let's just hit this M3 with two vectors
V. What does it end up looking like? Well, it ends up looking like this beast
was W mu mu mu. We hit it with V on the mus twice. It looks like WV dot mu
twice squared times mu I. And if we want to find an eigenvector, this had
better be equal to lambda V.
Okay. So let's suppose view is mu one. They're all orthogonal. So I put mu
one twice in that expression. What do you get on the right-hand side? Where
the mus are orthogonal, so what do you get?
>> Sham Kakade: Yes, you get W1 times mu one or [indiscernible] for the
transformation. So the only -- and this is it. That basically, all of the
tensor eigenvectors, all of the topics exactly, they're projected and they're
scaled so it's easy to unproject, because we know the whitening matrix. We
just snap it back in D-dimensions and we can figure out the scale because we
make it sum to one, the probability vectors.
Okay. And what was the assumption we needed to get this to work out? Well,
what did I do? Well, I made things white. When does that work? I need the
topics to be linearly independent. That's the only assumption. Because
obviously, if one topic is in a convex Hull of the other, this might be
problematic, because then it's -- because there's even identifiability
questions. But as long as things are well-conditioned, which basically always
occurs in general position, this is a minor assumption.
So as long as the topics are linearly independent, which is minor, all of this
tensor eigenvectors are the topics. It's a very clean statement, and the
16
reason tensors are nice to work with is because for this particular case, we
have this nice tensor structure. In general, tensors are a mess. This is
about the nicest possible case.
>>:
So do you stop at that moment?
>> Sham Kakade: There are moment [indiscernible] to identify it. And this is
the decomposition you need. And there's kind of no multiplicity issues either.
It turns out as eigenvectors, if you had the same eigenvalue of linear
combinations where -- there's no issues here, because even if all of these Ws
were the same, it doesn't cause problems because of the way tensors work
because of this cube. You can't get linear combinations being eigenvectors.
>>:
So is this limited still to the Gaussian case, or.
>> Sham Kakade: We dropped the Gaussian case for now.
model case with one topic per document.
>>:
This is for the topic
But the cardinality --
>> Sham Kakade: The cardinality can be arbitrary, yes. So now the statement
is you just need documents with M3. Any number of topics, as long as they're
linearly independent, and this is kind of the closed form solution which we can
think of as an eigenvector problem.
>>:
Do the lambdas suggest how many topics to pick?
>> Sham Kakade: No, the topics are going to come from the rank of the bigram
matrix. Basically, you just look at the rank of that matrix and that's the
number of topics.
>>:
[indiscernible] MOG.
>> Sham Kakade: We're going to get back to that. So now the question is what
happens in mixture of Gaussians, what happens in LDA and kind of richer models.
This is a clean way -- [indiscernible] algorithm first now too because can we
solve this thing. We know how to do this for matrices. General ones are hard.
But it's basically known, you know, we did analysis.
Basically, analogs are the power iteration.
You just hit it twice, repeat,
17
because you want to find a fixed point. This is exactly how matrices work and
then deflate. This thing converges insanely fast. It's even faster than the
power iteration for matrices, which is log one over epsilon. This is log log
one over epsilon.
Basically, this powering means you converge extremely quickly. So it's a very
fast algorithm. Different ways to view this, which I find kind of interesting
from a geometric perspective, is we're basically maximizing skewness. You can
view that eigenvector condition as saying -- it's almost like a subgradient
condition which says maximize M3 hitting it with V three times, which is like
kind of maximizing the variance in the third moment. Okay. And it's kind of
like finding the spiky directions.
And that's kind of another way to view it, and this is why it's greedy, because
it says all local optimizers of this third moment are the solutions we want.
Which is why you can kind of rip them off one at a time, because kind of every
spiky direction is the one you want.
And so, you know, there's a lot of different algorithms which are efficient
because they're greedy. These are very similar decompositions from those
studied in ICA, because somehow the same tensor structure arises in that
setting. So we understand it, and we also understand some of the
[indiscernible] questions, because if we don't have exact moments you just use
plug-in moments but it's just a perturbation argument, and the stability kind
of just depends on how overlapping the clusters are, but that's real.
But now let's -- so we're good with that. This is a minor -- I mean, the real
thing is understanding the closed form structure and the rest is perturbation.
But now let's go to the -- if we're done with that, we should look at the
mixture of Gaussians case and the multiple topic case. So we good with this?
This is pretty clean. I think we hopefully understood the proof and
everything. So now what about the mixture of Gaussians case.
So let's go back. There's this pesky sigma squared I. And if we're looking at
third moments, we're going to again get problems. But what's sigma squared?
Can we just figure it out? It turns out suppose a D was bigger than K,
strictly bigger than K. So the previous algorithms worked if D equaled K with
the topic ones. But for now, just suppose D is like K plus 1 or bigger, this
is always the case. Like a picture like this.
18
Sigma squared is just the variance off of the subspace. So because basically
in the subspace, if I look at the direction of variance, I get a contribution
for how the means vary and the way it points work out. If you look at the
variance kind of in a direction orthogonal to the subspace, the only
contribution from the variance in that direction is sigma squared.
So that actually, what that means mathematically is that minimal eigenvalue of
this matrix sigma squared, because these things lie in a K-dimensional space,
just look at any direction orthogonal to that, which will happen if you look at
the minimum eigenvalue or the K plus [indiscernible] eigenvalue, that's sigma
squared. So we know it's sigma squared. It's just the minimal eigenvalue of
that matrix, which we know.
Okay. And we need dimension one to be bigger for that argument. So it's
estimable. So we can just subtract it out. So we can basically figure this
beast out and the way we look at that is we look at the second moment matrix,
we figure out the noise, get rid of the noise.
>>:
So why is [indiscernible] the smallest?
What if WI is really tiny?
>> Sham Kakade: See, the point is you're going to use an orthogonal direction,
because this is a PSD matrix. So look at -- hit this thing with any direction
V, right. You're going to get V on this guy, but if we can guarantee the Vs to
be orthogonal to this, you wouldn't get it to V zero. So ->>:
Okay, I see.
>> Sham Kakade: All I'm saying is geometrically, you can pick out the variance
off the subspace without just by minimizing the variance. So yeah.
>>:
So there will be some dimension, okay.
>> Sham Kakade: Yeah, and if it's greater -- and it turns out there's a cheap
trick that even if D equaled K, which was the case we want to be lineal, you
can still do it, because it's a very natural point that actually -- let me go
back here. Basically, if you just look at the covariance matrix when you
subtract out the means, sigma squared is the minimum eigenvalue of the
covariance matrix.
So even if D equaled K, you could still figure out sigma squared.
It's just
19
the minimal -- so forget about that case if that's not clear.
squared is known.
But sigma
So we can get rid of that. What about for the third moment. If we look at the
third moment, it's not just mu mu mu. You're going to get extra junk. Let's
look at what that is. Let's just go to one dimension. Suppose we're in one
dimension and we have a random variable which is a mean plus Gaussian noise.
And is zero sigma squared. What is the expected value of X cubed?
We get one term which is like mu cubed, right?
>>:
What's the other term?
[inaudible].
>> Sham Kakade: Expected value of [indiscernible] squared is sigma squared
times mu and then there's a counting three. Expected value is this. So
basically, we want mu, mu, mu. We get some extra junk. We get three sigma
squared mu. We know mu, because that's the first moment. That's E of X. And
we know sigma squared. Granted, this is one decent -- we've just got to write
this in kind of a tensor way. So basically, the way you think about, you know,
three -- think of three sigma squared mu as equal to three -- as equal to sigma
squared times one times one times mu plus one times mu times mu plus mu times
one times one.
It's a stupid way of writing it, but the only reason I say that is because -so M2, what do we do? We subtract out sigma squared times identity, estimate
it and then we just do the tensor version of three sigma squared to get rid of
it, which is basically -- and these EI says are the basis vectors. So it's
really just mean one one, one mean one, one one mean, minus sigma squared.
And these two exactly have the structure we want. This means for the
probabilistic model underlying k-means, we basically can construct this form.
The eigenvectors are the projected means. Up to scale. How do we unscale?
Turns out you can get the scale from the eigenvalues themselves in this case.
But this is the main point. The main point is algebraically, the structure we
have, if we just rip out the noise, is this. It's pretty neat.
And this is why we see the difference in the topic model, because we didn't
have exchangeability, the noise correlates, so we've got to futz around with
the noise because we can't get these low-ranked matrices. It doesn't really
matter. We just get this, and then, you know, we get a closed form solution
20
with exact moments.
Now, let's go back to Greg and Anker's result. They gave an exponential lower
bound, but the way they did it they put K Gaussians on a line in a
configuration of -- a very particular configuration. And what that effectively
does is, you know, the assumption we needed to solve this problem is they had
to be in general position. And K Gaussians on a line are not in general
position. And so what's kind of nice, so we actually know there's some gap in
between that if they're not in general position, it could be bad. And if they
are in general position, it turns out we only have a polynomial dependence on
the separation condition.
By that I mean in general position, they still could be close together, but we
could look at the minimal singular value of the matrix of the means and it only
a model dependence there. Where somehow, once they become on a line, you could
be exponentially bad in the separation. Whereas if their general position
would be somehow polynomial, it would be bad.
And you can kind of see why, because if they're on a line, the third moment is
still a number. So it's not identifiable from the third moment if it's a
number. You have to go to very high moment. And estimating very high moments
is unstable.
Now, that's only one algorithm, and they prove it information theoretically, so
it's for any algorithms. But the intuition is nice. So that's how we got
around the lower bound.
>>:
You did take Gaussians, you would not have this kind of --
>> Sham Kakade: We don't know how to solve of the elliptic case. There, I
suspect -- I don't even know how to prove hardness results. There are cases
which I think statistically are fine to estimate. But computationally, we have
no idea. And I don't think there's any language for how to even understand
hardness in these average case type scenarios. So anyway, for this very
natural case, it's -- you basically have seen the proof. This is it.
It's pretty simple, and the very last thing I want to do is look at this case
of multiple topics. So ->>:
Okay, but can you talk about like what the finite sample effects are?
21
>> Sham Kakade: Oh, yeah. So it's basically like the analog of a matrix
perturbation argument, that if -- I'm only going to say [indiscernible] in the
paper for now. But basically for SVDs, there's like [indiscernible] theorem
and these kind of theorems of how accurate is an SVD if our matrices are
perturbed.
This is just the analog, how accurate are these eigenvectors if the tensors are
perturbed and that's what I was referring to here. What it depends on
basically are how co-linear the means are, but in a nice way. And that's kind
of real, because as the topics start becoming co-linear, you're going to expect
to need more samples and that's because of the stability of an SVD, even for
the matrix case, the stability of an SVD depends on the minimal eigenvalue.
But it behaves kind of nicely in the way that our matrix perturbation theory is
nice.
And the only way it starts becoming bad is if it becomes actually
[indiscernible] we don't know how to solve it. But basically, it's all like
[indiscernible] and nice dependencies, but I'm not going to explicitly give
those theorems here, but they will appear on the paper. And they're kind of
dependencies one would expect. I think even statistically, they're real.
They're kind of information theoretic.
The point is they're mild.
Not like these exponential worst case things.
>>: Do you actually like, for a finite sample, for instance, you would use,
like, the actual minimum eigenvalue for sigma squared, or do you like correct
it?
>> Sham Kakade: So what I would use, I would first project it in K dimensions
and use the minimum in the K space. There's actually a trick I would use. I
would actually try to use a slightly different model for the mixture of
Gaussians. We can talk about that later. I would look at it more like a topic
model case, which I'll get to. Because this is more like a spherical case, and
I don't know how to handle -- there's another case of mixture of Gaussians I
can solve. But maybe let's come back to that in the discussion section. Let's
go to the multiple topic case, unless they're ->>:
We have yet to find the estimating value.
22
>> Sham Kakade:
>>:
No, no, it's the eigenvector.
So I can --
Oh --
>> Sham Kakade: The eigenvector is the tensor. There are many ways to solve
that problem. In a sense, this is the moment. I've set it up for you, and go
to town. The point is that we know there are algorithms that solve this. They
come from -- you can actually do this with two SVDs as well. So the earlier
work, we were using kind of bad algorithms to do it, because you can project
this down to matrices and kind of there's this term simultaneous
diagonalization. That's another way to do it.
But geometrically, once we understood this, it's like this is the structure.
We know how to solve it. You can think of it as a generalized eigenvector
problem, which is a really clean ->>:
So eigenvector becomes the mu of the estimate?
>> Sham Kakade: The eigenvectors are telling us the means, yes, up to scale.
But we can find the scale and then we can find the Ws.
>>:
How about estimate of sigma square?
>> Sham Kakade: We got sigma square because that's how we got these formulas.
It's the covariance.
So let's go to topics. Topic model case, if there's multiple topics per
document, so now you can have a document like 30 percent cute animals, 70
percent YouTube or something like that. How do we figure this out? Well,
what's the identifiability issue? If every document is about five words, do we
need three times five or how long do they need to be?
And it would be bad if the moment had to increase because in a sense,
estimating higher moments becomes exponentially more difficult in the order of
the moment. But parameter counting suggests third moment is enough still.
Even if you have a mixture of multiple moments, you've got D-cubed parameters
in third moment.
And for a long time, we thought the LDA problem had to be exponential, and this
is borne out because some theoreticians basically gave very strong assumptions
23
to solve it, you know, people who did not think you could do this in closed
form.
But it turns out LDA, all you need is three words per document, and it's the
same idea of just maximizing some third moment. So LDA basically, you know,
you have to specify a distribution over distributions, because now every
document is about a few different topics so it means, you know, every document,
you know, you center to specify this distribution over distributions and these
LDA distributions are these kind of nice pictures. We have level sets here.
So that's what the prior is here. So this pi is, rather than this document
being about topic one and this document is going to be about 30 percent cute
animals, 70 percent YouTube, that's specified there, and that's this kind of
distribution over the triangle, which has a particular form.
In a sense, it's like the nicest possible form distribution you could put down
for a triangle. And it captures sparsity, because you could have pictures
where these level sets kind of bow out.
And the point is that, you know, you just got to write down what these
expectations look like, and, you know, [indiscernible] it's not so bad. Some
gamma functions. But the point is you look at the structure of these things.
And the same trick for the mixture of Gaussians. Now if you look at
correlations, you got kind of in extra terms in a similar kind of way, and you
just subtract them out, the stuff you don't want.
And this kind of has nice limiting behavior. Where the kind of coefficients of
subtraction basically depend on like a sparsity level of the model, which is
often said in practice. So the only parameter we need now is that kind of the
sum of these alphas, which is often said in practice, and it's like an average
level. It kind of determines the sparsity level.
So if you know that, you can kind of determine what to subtract and it kind of
blends between two regimes. So when alpha goes to zero, you kind of go back to
the single topic case, where these go away. And there's more stuff here,
because it looks more messy, you know. You get all of the symmetrization and
stuff like that. But it's still easy to write.
And as alpha becomes very large, what these moments actually end up looking is
central moments. When alpha is large, this thing ends up looking like, it
24
looks like E of X1 minus the mean, times X2 minus the mean. And same thing
here. And that sort of should be the case, because this triangle start looking
like a product distribution.
So these two regimes are kind of nice, and I like this kind of maximizing kind
of perspective rather than the potential eigenvector, because basically, you
know, you want to find these pointy directions.
So this skewing is almost like a geometric problem of how do I find the corners
of a convex polytope from its moments. It's all these different perspectives
are interesting and that's basically what's going on here. I'm just looking at
this [indiscernible] distribution, skewing things a little based on the way the
measure is put on the simplex so, you know, the maximizers point to the
corners. So that's it.
So the LDA has this closed form solution. You can kind of tweak these moments
with lower order things which you know. They have the same structure. The
eigenvectors are the topics.
>>:
It's a closed form solution for a different objective function, right?
>> Sham Kakade: If the moments are exact, it's just a closed form solution,
that's right. So if it's not a closed form solution, it's an inverse moment
solution, but as I said before, at least in some cases, in classical stats, we
know if you did the moment estimator and then did a step of Newton, that is
about as good as MLEs. And for these problems, it might actually be better.
This is often like infinity problems with MLEs. So in practice, it's kind of
nice to use moment estimators and then local search on top of that.
>>:
[indiscernible].
>> Sham Kakade: I guess inference is hard so you'd have to do sampling or
variational after that.
Okay. So that's LDA. And again, you raised a point and this is nice, because
inference is a headache in these models, and it's hard. And that was very
important for this, you know, the single topic models, mixture of Gaussians,
it's easy to do inference. These problems, it's hard because even just writing
down the posterior is not closed form, yet we can still solve it in this greedy
way. So somehow, we've gotten rid of inference in these moment approaches.
25
>>:
So you can do the same thing as [indiscernible].
>> Sham Kakade:
you want.
>>:
It's the same form of the moment.
Just figure it out any way
Eigenvector, but you have to convert that eigenvector into the parameters?
>> Sham Kakade: The eigenvectors are the distribution so you just unproject
them and normalize it. Okay?
>>:
Is that when you have a prior also in the word distributions?
>> Sham Kakade: No prior in the word. There's a prior in the topic
distribution. Because the word distributions, we think of them as parameters.
Those are the topics. There's no prior there.
>>:
So that [indiscernible] optimum for the moment.
>> Sham Kakade: In a sense. Think about more of this identifiability
viewpoint more than global optimum, if you know these moments, you uniquely
specified the parameters and you have a closed form way to get it, and then
it's kind of a [indiscernible] argument to figure things out.
>>:
Turns it into [indiscernible].
>> Sham Kakade: You need to do second in second order. That's going to be
close to maximum likelihood, because you're basically chasing the maximum
likelihood cost function at that point. But a step of Newton is pretty darn
good. So it will get you this kind of ->>:
That will be local for.
>> Sham Kakade: Not local. That will be actually -- it's not going to be -it's going be close to the maximum likelihood solution. It's going to be close
to the global maximum likelihood solution.
>>:
Is it going to be --
>> Sham Kakade:
Because the parameters are close.
You'd have to do some like
26
[indiscernible] argument.
>>:
[indiscernible].
>> Sham Kakade: Yes, so this is this whacky -- so now we're getting close to
discussion section so I'm just going to answer this. So this is where it's
funny with these average case hardness results, because we know maximum
likelihood is NP hard in general. Yet what this is saying, I'm going to argue
that this will give us something close to the maximum likelihood solution,
because we're getting the true parameters. And I can give you a poly sample
size. So even without doing the step of Newton, I'm getting close to the true
parameters.
And the maximum likelihood solution, the global one, is also close to the true
parameters. So what's the contradiction here? The contradiction is this is
like an average case result. It's saying if the points come from the
distribution, then we can find it. And there are cases where we have average
case results.
>>: But you're completing ignoring the disparity between the dataset and what
the model assumptions are.
>> Sham Kakade: No, no. So the hardness results are showing that there
exist -- if we could solve this problem for all point configurations, then P
equals NP. This is not saying that. This is saying with high probability, for
a configurations that tend to look like the model's correct, those are the ones
we can solve. Which is why, from a complexity point of view, it's very hard to
start understanding average case hardness when the model is correct, because I
think there's cases for these models where statistically, like in the mixture
of Gaussians case, I'm sure there's cases where when it's not spherical,
statistically, it's fine. We know those cases. Yet solving them, we don't
know how to do.
How are you going to prove a lower bound? Only the perimeter or something,
it's kind of interesting cases we know how to give low bounds for.
Anyway, I need to get back to the discussion. One more slide. I want to say
the idea generalizes to other models, so hidden Markov models, length three
chains, you can estimate it with the same idea. You can cook up tensors that
look like that.
27
Some other recent work on harder questions for like structure learning.
So these models in linguistics called probabilistic context-free grammars. You
see the sentence like the man saw the dog with the telescope, is the man seeing
a dog holding a telescope, or is he holding the telescope and how you figure it
out involves parsing problems.
There's a wonderful set of questions here for how you learn these models.
Turns out we made some recent progress in showing that even if you had
inference statistics, the general phrasing of these problems are not
identifiable if you only see sentences, which is pretty frustrating in some
work with Percy. Under some restricted assumptions we can make progress on
these models. But again, there's an interesting in between for the
restrictions we make and the non-identifiability of these models, which we'd
like to make progress on.
Also, some recent work on learning the structure of Bayesian networks with
these models. Suppose you see some DAG, you observe these nodes here. You'd
like to figure out the structure. You don't even know the number of nodes or
the edges, how do you even know this network is identifiable? We can look at
moments and try to figure out what's going on.
And here, we actually need some new techniques, even to just figure out, you
know, identifiability. Because, for example, suppose this network, these are
observed went up this way. Get more and more nodes as you go up. You wouldn't
hope this thing could be identifiable.
So even just characterizing identifiability turned out to involve some graphic
expansion properties and then combined with some moment ideas.
>>: So just a question about the hidden Markov models, I was thinking this
could be interesting for [indiscernible]. I was talking to Jason this past
summer and he pointed out, let's say I only looked at adjacent trigram
statistics. How do I get identifiability on chains like ABBB star A, or CBBB
star C? Say I have those two possible sequences coming out from the my model.
Where the A is at the beginning and the end, but there's a sequence of Bs
that's too long in of the middle.
>> Sham Kakade: So these are models where the assumptions could be wrong. Or
you have to use -- I mean, there's kind of this cryptographic hardness results
28
that you can make long chains and kind of hide combination locks in them, and
to some degree, the hope is these should be divorced in practice. If I see
some garbage fall out and, bang, something happens really far out in the
future, cryptographically, this is hard.
>>: But these happen all the time in linguistic phenomena, right? Like I have
agreement phenomena that can be arbitrarily long distance. Maybe I should try
to use tree structured models.
>> Sham Kakade: That's right. I would argue that hopefully for linguistics,
the regimes we're in are not the cryptographic hardness regimes, but the
delicate is how do you kind of phrase the model to avoid that?
>>: I'm not convinced that trigram statistic suffice to capture specific these
inter-dependencies either.
>> Sham Kakade: So in practice, this he seem to work reasonably well for
initialization. So Jeff Gordon has been doing a lot of work. Even in
linguistics, Michael Collins has been actually playing around not with these
eigenvector methods, but earlier operator representations. He's been getting
some reasonable results.
>>:
There are certainly things that you can capture, but I just -- okay.
>> Sham Kakade: But I would argue these are modeling questions rather than
fitting questions. Which is why these models are very interesting but now how
do you fit them? These are wonderful questions across the board here.
>>:
For three length chains, what it does it mean?
>> Sham Kakade: The chains can be arbitrarily long, but you only need to look
at the statistics of three things in a row to figure things out. Because it's
the same kind of idea. You don't need to look at very long chains, it's just
the correlations in three only have enough parameters, but you can actually ->>:
[indiscernible].
>> Sham Kakade: Yeah, a third moment, you can kind of construct the tensors
and figure out --
29
>>:
[indiscernible].
>> Sham Kakade: If you have any higher moment, you can also solve it, because
it gives you the information about the third.
>>:
You talk about two third moment, what about fourth moment?
>> Sham Kakade: Fourth basically have the same structures.
moments as well.
>>:
You can use higher
It doesn't have length of three, as you said?
>> Sham Kakade: The thing is if you have longer chains, you can estimate the
three better. So even in LDA, if you have longer documents, I would just use
the longer documents to get better estimates of my trigram statistics. I would
rarely go beyond third or fourth. Third is good if you have kind of asymmetric
things. If things are more [indiscernible], you want fourth.
So in a lot of settings, understanding the tensor structure does provide us
some interesting solutions for these problems. And kind of surprisingly,
they're very simple solutions and people in many different areas are now
actually studying various algebraic properties of these tensors.
The paper for this stuff is going to be forthcoming. To some degree, it's all
in previous papers. We just didn't really understand the tensor structure. We
were solving it with this basically simultaneous diagonalization, where we took
the third moment, projected it to a second moment, and then we're futzing
around with it that way. They weren't really the best algorithms. Now we're
realizing this is the structure we have and there's actually algorithms from
ICA. Like they've considered power methods, like even Tom Yang has some paper
with [indiscernible] 15 years ago, his advisor, on tensor decompositions for
how to do this.
This is joint work with a number of colleagues over the years. Two particular
people are Daniel and Anima. Daniel has been working on [indiscernible] since
the beginning of HMM stuff. He was an intern with me at TTI. He was a
post-doc at Microsoft. Anima is faculty at Irvine and she's visiting me now.
Both those two are fantastic. They're terrific to work with and just
[indiscernible] colleagues across the board. So thanks a lot.
>>:
[indiscernible] many layers.
30
>> Sham Kakade:
>>:
Yeah.
So does that deal with this training away [indiscernible].
>> Sham Kakade: Right, so this is, in a sense this is what we're trying to
avoid when you learn these things, because somehow coupling inference with
learning makes it difficult. When we think about learning Bayesian networks, I
sort of don't want to think about these explaining away things in the same way
I don't want to think about which documents are about which multiple topics in
the learning process.
I want at least one way to -- it's not the only way. Other methods might be
good, but it's a different way of thinking about it. What can we recover from
just the average correlations. But for the learning the Bayesian networks,
actually that paper just appeared on archive maybe yesterday or today or
something.
But there, we actually need more techniques, because somehow, the problem with
Bayesian networks is, you know, this LDA model is almost like a one-level kind
of model. These are which topics appear, and these are the words which appear.
And we have an explicit model of the correlations between these topics and the
LDA model.
The problem with the Bayesian networks is we have no idea what the correlation
models here at this level, because we don't know what's above it. So we're
actually using there some ideas of looking at a sparsity constraint, which is
kind of what the expander condition is doing. There's a paper from Colt this
year, which is how do you take a matrix and decompose it into basically some
kind of weight matrix but it's sparse, times some other matrix.
There's a nice paper by Dan Spielman, and we're really utilizing those
techniques along with some of these moment ideas. But even -- I mean, the
right way I used to think about it is what do these things need to even be
identifiable. If you could make a formal argument for that, in many cases I
think that does reveal something about the structure of the problem.
But if it's not even identifiable, that's giving some intuition as to what the
hard cases are. And these kind of hidden variable models, identifiability is
fundamental because we don't even know the structure of these things and it's
31
clear you can put in other nodes sometimes to give rise to the same structure.
So it's just different techniques, though. Other questions? So I'll be around
for the week, and it would be great to chat with people, because we are playing
around with these as algorithms. People understand these algorithms in other
settings, and they're very natural, particularly with features. It's just find
some pointing directions in your data.
It's like the LDA problem. The problem is if you try to do this in your data,
if it's noisy, all the words lie really far away from the corners. But
somehow, by averaging it is what allows you to say I'm finding these pointy
directions. So somehow, you can't just find the pointy directions like in the
raw data, because words don't lie anywhere near the, you know, where they
should be.
And even documents of length three, you know, if I say this document is about
sports, history, literature, and dogs and there's only three words in it, you'd
be like you're crazy, this is only three words. So somehow, you really do need
to do these averaging techniques. And there are an interesting class of
algorithms for initialization. So we have been toying around with it. It's
just another bag of tools we have.
If you're interested in playing around with it, it would be fun to chat and
just more broadly this set of techniques, because I don't think they solve
anything, but thinking about any problem differently always helps us, and this
is a new set of tools that I think should be complementary to other approaches.
Download