>>: Yeah, so it's my pleasure to welcome Yuval... Yuval is a professor of computer science at the Hebrew...

advertisement
>>: Yeah, so it's my pleasure to welcome Yuval Rabani to Microsoft Research.
Yuval is a professor of computer science at the Hebrew University of
Jerusalem. He's done similar work on algorithms, approximation algorithms
[inaudible] algorithms and many other things.
And so, yeah, he's been here all week. He'll be here all of next week. It's
always great to have him here. And he'll tell us about learning, so how to
learn mixtures of arbitrary distributions over large discrete domains.
>> Yuval Rabani: Thank you. So, yes, the title is almost as long as the
talk. I'll be talking about joint work with Leonard Schulman and Swamy, most
of which was done quite a few years ago when both Swamy and I were visiting
Caltech.
Okay. So I want to start with some motivation for this problem. And this -both the motivation and the problem itself arose in two different research
communities, so one is our wonderful theory community, and more or less in
parallel people in the machine learning community suggested similar problems.
So the problem is to understand the structure of a corpus of documents. And
for the purpose of this talk, we can think of documents just as being bags of
words. So we're not really interested in higher level grammar, just which
words appear in the document, perhaps also the frequency of these words
matters.
And of course if you really want to deal with documents, you have to
normalize the words and remove all sorts of irrelevant stuff, such as
prepositions and uninteresting -- other uninteresting words.
So I'm not going to into how people actually deal with documents. Mostly
because I have no idea what they do. But we will think of documents as being
these bags of words.
And we also have some model that how these documents are generated. So it's
a very simple model. Of course most people don't generate documents this
way, but that's how we think of them being generated. So in order to
generate a D word document, I we have some distribution P over words. And we
sample from this distribution independently D times. So that's the way we
generate documents.
Now, we think of these documents as being generated in the context of various
topics. And we will assume that we have a relatively small number of
different topics, we'll denote the number of topics by K.
So in fact if we look at the entire corpus of documents, each document in the
pure documents model that I'm presenting now, is generated from one of these
K topics. Each topic is defined by its own distribution over words.
So basically documents from different topics differ in the distribution that
is generating them. We have to assume something about the separation between
these distributions, and I'll get to that.
So we have these K distributions, P1 through PK. And in order to generate a
new document, what we do is, first of all, we choose a topic for that
document. And topic I is chosen with probability WI. Now that we've chosen
a topic I, we generate the document itself by choosing words independently
from the distribution PI. So that's the entire model. And you may argue
that perhaps the documents that you're familiar with are generated or not
generated this way. Papers, for example.
>>:
[inaudible].
>> Yuval Rabani:
>>:
I'm sorry?
[inaudible] the size of the document is [inaudible].
>> Yuval Rabani: Yeah. So we will think of these being fixed, but in
general you could think of different documents as having different lengths as
well. And maybe that's also chosen at random. But we will actually consider
fixed D. I'll get to that.
Now, here's another situation where a very similar phenomenon or a very
similar setting could be served as a simple model of what's happening. We
have various customers, so now instead of documents we have customers. And
we're collecting purchase history of these customers. And we can think of
each customer as this may be a bit more realistic, so you go shopping and you
have your distribution P over the possible things you want to buy, and you
just sample from this distribution some -- a number -- say D independent
samples, and that's your -- that's what you buy, for example, in the
supermarket on that day.
And, again, in the pure model here, customers -- so in general when you think
of these things, you assume that people are very simple and their behavior
can be summarized by simple statistics. So here it's a very simple
assumption. There are K types of people. And when someone arrives, say, in
the grocery store, that person is -- has type I with probability WI. And
then whatever this person is going to purchase is determined by the
probability PI, which is chosen according to the distribution W.
Okay. That gives us essentially the same data as documents because we don't
care if our items are words or groceries. But more generally this model is a
very simple model for various data mining applications. Or it's a model of
how the data is generated for various data mining applications, including
things like document features that we just talked about, customer taste. You
can think of this as sort of outdated when Web pages used to have hyperlinks.
Now they don't, right, because you can get to any Web page using a search
engine so you don't need to link them really. The links are -- I don't know,
they generate Java code that does something, but they're not really links.
So back when Web pages had real links, you could think of these hyperlinks as
being generated, so through the same process.
And you can think of various observational studies. So suppose you're
studying, I don't know, patients or whatever. You encounter a specimen and
all you can do since the person is not really cooperating with you, maybe
it's a bird in nature, they're not providing you -- they're not filling up a
detailed questionnaire. You can observe this thing for a little while and
gather just a few attributes of the specimen that you're observing. And
let's say those attributes are generated from the distribution that defines
the species of this thing. So various observational studies could be modeled
this way.
And the general properties that -- under which we want to think of this model
are the following. So every specimen has a large number of possible
features. So there's a huge number of words, for example, in almost any
language that I'm aware of. And each specimen -- so if you look at a
specific document, it usually doesn't contain the entire dictionary. It
contains only a subset. So the documents are relatively short. Think of -the specific thing to think about is, for example, Twitter tweets. Those are
very short documents. So those are the things we want to focus on.
So there's only a very small sample of the distribution that generates this
document. And that's of course a bad thing because we don't get any good
statistic on the distribution from any individual document. But, on the
other hand, we want to assume that the population or all behaves very nicely.
It falls, for example, into these K different categories that are well
specified. So this is the kind of setting that we want to be interested in.
And now you can think of various -- right. So this is a model that generates
data, and now you can think of doing all sorts of things with the data, like,
I don't know, ignore it, for example, would be my first inclination. But
then you want to generate a paper as well, so you need to do something rather
than ignoring the documents.
So the goal that we will talk about here is we actually want to infer from
the documents the topic distributions. So we want to learn the topics. We
don't want to classify the -- I mean, that would be perhaps a different
objective, is to classify the documents into topics. So we don't want to
do -- to classify the documents; we just want to learn these distributions P1
through PK.
So this is what I will call learning the mixture model. Because this is
really a mixture model. We have K different distributions and our documents
are generated from this mixture of distributions.
So I'll define now the problem more precisely. We have a known dictionary
and without loss of generality, this dictionary will be the numbers 1 through
N that represent the N different words in the language.
We have as input M samples of D tuples from -- so in fact it's with
repetition, right, so we have M samples of these words.
>>:
So relating to [inaudible] --
>> Yuval Rabani:
>>:
Yeah.
-- D is much smaller than N?
>> Yuval Rabani: D is much smaller than N, yes. Yes. In fact, N will be
very large. So think exactly of this analogy. Think of the dictionary of,
say, English and tweets and Twitter. I don't know how many words they have,
but not many words. I guess in German they have fewer words because the
words are longer. So they have bigger problems in learning things.
>>:
[inaudible] are you allowing repetition in the sample?
>> Yuval Rabani: Yes, of course. I mean, yes. The same word could appear
many times. Could appear D times in the document. We're sampling
independently, so in general you could have -- now, how is the sample
generated? So each sample, each one of these M samples is generated by
picking a number J between 1 and K with probability WJ.
So here it's important to understand these Js are independent and identically
distributed, but they're hidden from the observer of the data. So we don't
know which J was chosen.
And then we draw D items independently and identically from the distribution
PJ. And that's how a single sample here is generated. So the next sample
would be independent from this.
Our goal is to learn the model. The model means the mixture constituents,
the P1 through PK, and their mixture weights, W1 through WK.
>>:
PKs have large support size, right?
>> Yuval Rabani: Yes. Each P -- each PJ has support size N.
seeing a very small sample of it.
>>:
And we're
You care about sample complexity M or --
>> Yuval Rabani: We care about everything. So I'll soon mention what
exactly we care about. But we care about all these parameters.
>>:
Do we know K?
>> Yuval Rabani: We can assume that we know K, yes.
things under the assumption that K is known.
>>:
[inaudible].
So we will learn these
>> Yuval Rabani:
>>:
I'm sorry?
[inaudible] for different --
>> Yuval Rabani: Yeah.
different things, yeah.
model is unrealistic and
then why not assume that
You
But
our
you
could do that, yeah, so then you would learn
we will assume that we know K. Since the whole
algorithms have perhaps prohibitive constants,
know K. That's the least of your problems here.
So I'll also denote our -- obviously this can't always be achieved, right,
because there could be some tiny probability that the sample that I get is
completely unrepresentative of these distributions and weights. It's
something way off the charts. So I will have to allow a failure probability
which we will denote by delta, and from now on we're going to assume that
it's just some small constant. Even though most of the stuff that I'm going
to tell you you can control this constant and drive it to be as small as you
like. But I'm just going to ignore it. So there's some small constant
probability of failure. I don't know, 0.01. Yes.
>>:
[inaudible] even failure?
>> Yuval Rabani: The failure is that you fail to generate a model that is
correct within -- I'll specify what do I mean by a correct model. Obviously
you're not going to get exactly these Ps and exactly these Ws, but ->>:
But does it ever know that it is incorrect?
Does it say I'm sorry --
>> Yuval Rabani: No. How would it know? So suppose your data, for
example -- let's say for some reason each of these probabilities has some
other word that has tiny, tiny probability, but for some reason all of your
documents focus -- somehow chose only this word, so it appears D times. I
mean, once you choose one of the Ps, this tiny probability word just appears
there D times. This could happen with tiny probability for an exponentially
small probability. What would you do then? I mean, then basically your
distributions will necessarily look like they have all the weight on this one
word that has tiny probability in the real model. You can't avoid that
because that could be your data.
>>:
[inaudible].
>>:
Is the algorithm deterministic?
>> Yuval Rabani:
>>:
Okay.
Is algorithm itself is not deterministic.
So this failure probability is --
>> Yuval Rabani: The failure probability is over the data and the toin
tosses of the -- sorry, the coin tosses, right, not the toin tosses. I
usually say the right letters but not necessarily in the right order.
Okay. So is the model well understood now? Good. Okay. So I want to
mention a little bit stuff about learning mixtures. So learning mixtures of
course is a problem that goes way back. I'm not going to detail the entire
history of learning mixtures, but I think a good starting point would be
Dasgupta's paper, because that was considered at the time to be a
breakthrough result.
So Dasgupta and many papers following his result focused on learning mixtures
of Gaussians. That is sort of the -- the most -- the most standard, let's
say, model of mixture -- of mixture models.
So this is just a mixture of K Gaussians in RN. You have this distribution.
It's a mixture. You get -- you sample points from this distribution, and
then you want to infer the K Gaussians that generated this distribution. And
until Dasgupta's paper was published, I think no one knew how to do this, at
least with some formal proof it that actually works.
And following his paper, which gave some result, it gave result for spherical
Gaussians with a rather large separation between their centers, there was a
long sequence of papers that culminated a few years ago with these two
results, Moitra and Valiant and Belkin and Sinha, that resolved the problem
completely. So they learned an arbitrary mixture of Gaussians within
statistical accuracy as you like.
K here, by the way, is assumed to be a constant, and we will assume that as
well. So these algorithms don't run in time which is very good in K, is a
parameter of K.
>>:
[inaudible] separated by [inaudible] then we can do separation --
>> Yuval Rabani:
>>:
Yeah.
[inaudible].
>> Yuval Rabani: Yeah. Yeah. But here for the general case I had really
lousy time, which is K to the K or something like that. But dependent on the
dimension is very nice. Okay.
Then there are some other models that people studied. I'm not going to go
into this in great detail. People know how to learn product distributions,
for example, over the hypercube. And so there are various results about
learning product distributions over these domains.
Then hidden Markov chains, heavy-tailed distribution. So there are all sorts
of distributions, k-modal distributions. There are all sorts of -- these are
mixtures of these types of distributions. People know how to learn under a
certain -- not going to go into the details of what they do there or what the
assumptions are, but I just want you to be impressed that there are a lot of
mixture models that people try to learn. So this is another one of them.
But I'm doing this in order to point out a significant difference between all
this other work and what we're doing here. So the main issue that differs
our problem from all these other mixture model problems is the issue of
single view versus multiview samples.
So in the case of Gaussians as a representative example, mixtures of
Gaussians can be learned from single-view samples. So each sample contains
one point from the Gaussian that was chosen to generate it. And then the
next point is generated perhaps using a different Gaussian. So the Gaussian
is chosen again and so forth. So -- and you can learn Gaussians from
single-view samples.
On the other hand, the mixtures of these discrete distributions cannot be
learned from single samples. Because from a single sample the only thing
that you learn is the -- or that you can possibly learn is the expected
probability of every word.
So think of this as this represents the simplex. It should be the N simplex,
but here N equals 3. And let's look at -- so this represents some
distribution over words, this point. Since our dictionary has only size 3
because that's the only thing I can draw on a two-dimensional slide until
somebody here invents higher dimensional slides, so this is just a
distribution over three words. You're not taking up the challenge?
>>:
[inaudible].
>> Yuval Rabani: High-dimensional slides. Sounds very useful. So, for
example, I'm going to -- I'm also going to use three documents -- or, sorry,
K is going to be 3 because K equals 2 is boring and K equals 4 is too much.
But there's no connection between the 3 here -- between the N equals 3 and
the K equals 3.
So, for example, this expectation can be generated from these three topic
distributions. But, on the other hand, the same expectation can be generated
from these three topic distributions. And there's no way for us to
distinguish between these two cases using single-view samples.
And as you see, these things are pretty far -- these models are pretty far
apart. So I won't be able to reconstruct the model correctly from
single-view samples. You need to use multiview samples.
There are also some other differences that might be interesting. I'm
mentioning them because you might think of them as issues that are worthy of
thinking about and maybe in a more general setting.
So one other important difference is the issue of how much information does a
single sample point give you relative to the model size.
So the model size is not something very clear here, right? We're talking
about real numbers. But you can -- if you have -- if you represent these
real numbers with limited accuracies, say accuracy 1 over polynomial in N
or -- actually, you need 1 over let's say -- yeah, numbers that are 1 over
polynomial. And then if you're learning Gaussians, the entire model can be
determined by Ks because they're K Gaussians. For each Gaussian you need the
center and the axes of the Gaussian, right, of the ellipsoid, just the
matrix. And that's about N square log N bits altogether for each Gaussian
with this accuracy of 1 over polynomial.
And then -- and each sample point is, again, if you have similar accuracy as
order N log N bits, so it's not the same size, but it's within the same
ballpark. And this is typical of most learning problems. If you even think
of classical PAC learning problems, what you get from an individual sample is
about the same ballpark information as the model that describes the
phenomenon that you're observing.
But in the learning topic models, we need to specify K distributions within
reasonable accuracy, so that's something like order KN log N bits. But in
each sample point we get only D log N bits. We get D words. Each word is
specified by log N bits because it's a word between one and N. And if D is
constant, as we would like it to be, this is a much smaller size information.
So this is in general an interesting question, what can we learn from very
sparse information in our samples.
>>: So [inaudible] so log N [inaudible] so log N [inaudible] just in general
comes from the polynomial, like [inaudible] ->> Yuval Rabani: Yeah, exactly. Exactly. Yes. You can plug N [inaudible]
desire accuracy instead of the log N. I just thought it would be easier to
understand it this way.
Okay. I'll skip the last point. And -- but I do want to mention this
without getting into the details. So this entire area of learning mixture
models uses a rather well-defined toolkit. So one of the tools that is used
is just spectral decomposition. We look at eigenvectors corresponding to
certain eigenvalues. That's one of the tools that's widely used.
Another tool that is widely used is -- so this is some form of dimension
reduction, really, the spectral decomposition. You use -- you do principal
component analysis, so you take the higher singular values and you look at
that subspace. That's often useful.
Another form of dimension reduction that is very useful is just doing random
projections. Dasgupta's original paper actually uses random projections. So
it reduces the dimension to K. Because we only have K centers, we can reduce
the dimension. And then in the small dimension you can enumerate over things
efficiently. That's sort of what he does.
And the other thing that is used is what is
This is in fact what gives the final answer
method of moments. And what this method of
general form is it tells us that if we know
distribution we can reconstruct all of it.
called the method of moments.
on the Gaussians mostly, this
moments or sort of in its very
the first few moments of the
And this happens to be true for mixtures of Gaussians, for example. If we
know -- I don't remember how many moments exactly we need, but we don't need
a lot of them. Once we know those moments, in sufficiently many directions,
then we can reconstruct the mixture.
So all we have to do is to figure out what these moments are given the
sample. And, in fact, if we know -- so most of the error comes from the fact
that we don't know these moments precisely.
Okay.
Let's --
>>: [inaudible] so you said [inaudible] but why does the separation
[inaudible]?
>> Yuval Rabani: Because that was what came up with -- once you project, you
don't want -- if the Gaussians collapse to -- if their centers collapse to be
very close and they're just one on top of each other, then you can't do
anything. And the distance that he could figure out from the methods that he
was using after projection was this thing. And that was improved later. So
that was just the first crude result. But, again, by now I think the main
contribution of that paper is that it generated all the following work.
Because I don't think his methods are useful anymore.
Okay. So back to topic models. What is the problem -- what are we trying to
optimize here? So we have a learning problem. Of course let's sample -- you
know, let's wait for a hundred years, sample as many documents as we can
gather those hundred years, and hundred years from now maybe we will all be
dead and no one would care about this problem. So that's a wonderful
algorithm. It succeeds with probability one.
So but we want to minimize certain parameters. So obviously we want to
minimize the number of samples. I don't know, we want to classify Twitter
traffic into K topics assuming that people only talk about K topics, which
makes sense. Then you would want to take as few tweets as possible in order
to generate it.
The other thing is the -- we want to minimize the number of views that a
required. Here again, we want to be able to classify -- we want to be able
to determine topic distribution from tweets rather than books. It would be
much easier to determine topic distributions from encyclopedias, for example.
Because they have a lot of words. So they've sampled the distributions a
lot.
And of course we want to minimize the running time of our algorithm to make
it as efficient as possible in terms of the parameters of the problem which
are M the number of samples, D the number of views per sample, N is a
dictionary size, and K is the number of constituents in the mixture.
So you can think of M and N as being very large for our purposes. Of course
there is the other problem where they're small and the other things are
large. But we are considering the scenario where M and N are large and D and
K are small. So D and K will be considered to be constants. And M and N are
the things that we want to be asymptotically good with respect to.
Now, there are some trivial bounds that we can immediately discuss. So, for
example, if the total number of words that we see in our entire sample, so
that's M times D, right, because there are M samples, each sample has D
words, if this is little O of N, then there are definitely models that we
can't learn. Because we haven't even seen words that might support a large
portion of the distribution. So we can't really bound the error very well if
we see too few words.
So in this case, if M times D is little O of N, then there's no -- we have to
have at least linear size samples in the dictionary size.
>>:
So what is the error [inaudible] to be then?
>> Yuval Rabani: I haven't yet talked about it. I'll talk about it soon.
But under any reasonable error that you can think of. So if we have a sample
size of little O of N ->>:
[inaudible].
>> Yuval Rabani: Yeah. So think of all the words as having roughly the same
probability, say, within some constant factors and you want to learn those
things, then with little O of N words sampled, you haven't seen most of the
distribution yet. You've seen very little of the distribution. The
entire -- all the words that you collected are supporting little O of 1
weight of the probability.
So you definitely are not going to get something good under any reasonable
measure of accuracy that you can come up with, unless your accuracy is I want
to output anything, and that's good. It's a reasonable I guess think that
would simplify our work a lot.
Now, the other extreme is the following. Suppose our -- the number of views
is very large. Let's say it's N log N times some large constant. Then just
by coupon collector, or, you know, similar arguments, we've seen -- so a
single document is gives us a very good approximation to the probability PJ
that generated this particular document. And then all we have to make sure
is that we have enough documents to cover all the topics. And then we can
learn -- so this would make the problem trivial.
So of course ->>:
But you still need to figure out the WJs.
>> Yuval Rabani:
>>:
I'm sorry?
So you still have to figure out the WJs.
>> Yuval Rabani: Yeah. So assuming the WJs aren't too -- that would only
say how many documents you need to see until you've covered all the topics.
>>:
[inaudible].
>> Yuval Rabani: Yeah. So that would be the smallest WJ would determine -it would determine N, but it wouldn't affect anything else. N would have to
be something like 1 over the smallest J or maybe something slightly larger
than that to be sure that with high probability you encountered all topics.
So this would make the problem trivial.
Okay. You asked about accuracy. So I'm -- by the way, I'm mostly presenting
the problem again, so I don't know if I'll have much time to go into the
proof of anything. So the output of our learning algorithm are the -- are
some distributions. This thing is meant to be tilde, these things, except
that the Maxoff [phonetic] software doesn't have tildes. Maybe PowerPoint is
better ->>:
Of course.
>> Yuval Rabani: -- in that respect.
guys can brag about.
Yeah.
So I found something that you
So we have P1 tilde through PK tilde, and maybe also the weights. In fact,
some of the algorithms don't output the weights, they only output the
probabilities, and others also output the weights. And in this respect you
can think of -- or people have considered two types of error. L2 error,
which is just -- so the output has L2 error epsilon. If there is some
permutation over the Ps such that if you compare a PJ prime to the matching
PJ, then their L2 distance is bounded by epsilon. And then other papers
consider the L1 error. A similar thing, just using L1 or a total variation
distance between the distributions.
And you
because
implies
that we
should notice of course that this is a much stronger requirement
L1 error epsilon implies L2 error epsilon, but L2 error epsilon only
L1 error epsilon times root N. So in fact this is the kind of error
would like to get, the total variation distance is small.
>>: There is a permutation because if you don't know what the WJ says,
basically nowhere to distinguish between [inaudible].
>> Yuval Rabani: Even if you do, what happens if the WJs are identical?
of them are 1 over K.
>>:
All
Right.
>> Yuval Rabani: Then you don't know the permutation.
permutation for which this would be true.
But there is some
Okay. I want to explain what's known. And this is what we need for that.
Yeah. That doesn't look too promising. It's a lot of notation. But I need
that in order to explain the bounds that are -- so it's not that difficult.
Let's go over it very slowly.
First of all, we'll denote by P the constituents matrix. So that's the
matrix whose columns are the P1 through PK. So this is a matrix that has N
rows and K columns.
And mu will be the mean, so that's just the sum of WJPJ. That denotes the
expected probability of all the words. And M, the matrix M, is the matrix
of -- it's the pairwise distribution matrix. So this would be the
distribution of documents that contain two words. If I would generate from
this mixture a sequence -- an infinite sequence of documents, each one
contains two words, then MIJ would be the probability that the pair of words
IJ appeared in a document. This is the distribution. And then we'll denote
by V. V will be sort of a variance, so it will be M minus mu mu transposed.
Mu mu transposed would be the pairwise distribution if all the documents were
generated from this mean distribution mu.
So this denotes this high-dimensional variance. It's a variance matrix, V.
It's the difference between the pairwise distribution and our mixture and the
pairwise distribution had we replaced the mixture by its mean.
Now we have these linear algebra notation for matrices, so sigma would be the
ith largest singular value on the left. And lambda IE would be the largest
eigenvalue. So this assumes the matrix is symmetric. Lambda, the notation.
And the condition number is the ratio between the largest singular value of a
matrix and the smallest nonzero singular value of a matrix.
And finally we'll define a spreading parameter. So the spreading parameter
is intended to capture the fact that we need some -- that these PIs need to
be distinction in some sense. If they're all the same, then obviously I'm
not going to generate K different distributions. And if they're very close
together, distinguishing between them requires more effort. And this is
captured by this zeta spreading parameter, which is the minimum between two
specific parameters. One of them essentially captures -- so it's written in
this way, but it essentially captures -- it's the minimum total variation
distance between two constituent distributions, really. It's written here in
terms of L2 for various reasons, but that's what it tries to capture.
>>: [inaudible] if they're really, really close?
you just [inaudible]?
I mean, you can -- can't
>> Yuval Rabani: Ideally we would like to do that, but we don't know how to.
I mean, none of the results really knows how to do it. They all rely in one
way or the other in knowing something about the separation.
So some of them, for example, would work for any separation except that you
need to know what separation it is and you have to work harder in order to
achieve this.
But you're right. Ideally you would just want to generate a model that is
statistically close to the correct model. And if there are two constituents
that are very close apart and you don't have a good enough sample, you should
be able to replace them by just one constituent that would -- so
unfortunately we don't know how to do that.
And then zeta 2 is just -- it's really the width of this collection of points
in every direction where they have a width different than zero. So they're
spread somehow in space in the simplex, in the N minus 1 simplex. And
they -- of course they're K point, so they lie on a K flat. So in every
direction in this K plat, zeta 2 is the minimum width in every -- of the set
of points in every direction of the K flat.
Okay. So now we can say what is known. First result that I want to mention
is by Anandkumar, et al. It has certain assumptions or it has one big
assumption, which is P, right, that's the constituent's matrix is full
dimensional. So this means, for example, that our constituents don't lie on
the line, for example, in the simplex or on any flat that is -- has dimension
smaller than K minus 1. Then -- sorry. Yeah, K minus 1. K minus 1 or K? K
minus 1. It has affine dimension K.
In this case -- so this turns out to be a very strong assumption, because in
this case all you need is documents that have three words. Once you have
documents that have three words -- of course, if they're longer, you can just
split them. Then you can learn the model. Except that this algorithm
actually uses a pretty large sample size. So it learns the model with L2
error, not L1 error. This is the L2 error that it achieves. And this is the
sample size that it achieves.
So you should think of these singular values and eigenvalues as being
something which is about 1 over N, just to get -- in the worst case. So of
course there could be constituents P for which these are very small -- for
which these a very large. And then that's good for us. But in the worst
case, you could have a well-separated instance where all the entries are,
say, between 1 over N and 2 over N, or between, say, half over N and 2 over
N. And then these numbers will be very small. They will be 1 over N.
So this gives us a sample size which is polynomial in N and in epsilon
squared, but, on the other hand, it's also polynomial in K. So the behavior
with respect to K here is very good.
>>:
So what is C?
>> Yuval Rabani: C is some constant. I didn't want to specify it precisely,
partly because I don't remember what it is.
So they gave two algorithms that do this. Then this is our result. So this
is our result in comparison. We make no assumption, so we don't make the
full rank assumption. But unfortunately once you don't make the full rank
assumption, you must assume that you have larger documents. So we assume
that the documents have 2K minus 1 words. And in fact this is necessary if
you don't -- if you assume a general mixture.
The example is very simple. You can -- so this is an illustration of the
example. Think of a line passing through the simplex, and think of your
mixture constituents as being points on this line.
It turns out that you can find two different configurations of points on the
line that have exactly the same first 2K minus 2 moments. And if you sample
only 2K minus two words in each document, you can't get any information
beyond the first 2K minus 2 moments.
So in fact these two different examples, you just shift these things a bit.
Not a bit. Overall they're shifted by a lot. And you get two examples that
have exactly the same sample distribution over documents that have size 2K
minus 2. So you need at least 2K minus 1 words in order to get any
difference between these two far-apart models, and we can actually learn once
you hit that size document.
We get an L1 error as opposed to the L2 error there. And this is the sample
size. So you see that, first of all, in terms of the dependence on N, the
sample size is close to the best that we can, because we know we need linear
sample, and this is N poly log N, in terms of N alone. In terms of K, on the
other hand, we have this additive factor which is exponential in some
horrible things, K square log K, essentially.
Now, what's interesting here also is that suppose K is very small compared to
N. Then this would be the dominant part. But this part -- this size sample
we only need documents of size 2. And beyond that we need only this many
documents of size 2K minus 1.
So you could learn from a lot of very small documents and a few very large
ones, except that the few very large ones are something exponential or worse
than K.
So you see the comparison to the previous work. The dependence on N here is
much better. This would be the two algorithms if we want L1 error instead of
L2 error, if we translate their L2 error into L1 error.
is about N to the 8th, the second one is N cubed.
The first algorithm
With a sample size of N cubed, they're looking at three-word documents. You
get pretty accurate statistics on at least the more common words among all
the triples. So you get pretty accurate statistics on the distribution of
documents.
So the entire problem that they have is to reconstruct the model from exact
statistics, basically. Because they get very good statistics. On the other
hand, in our case we get very poor statistics on the document distribution.
We only have N poly log N documents. So the statistics on documents, even of
size 2, is not very accurate. And still we're able to learn. But, on the
other hand, they have -- their dependence on K is polynomial and here the
dependence is really bad.
>>:
These two algorithms [inaudible].
>> Yuval Rabani:
>>:
And it's showing you the assumption that --
>> Yuval Rabani:
>>:
Um-hmm.
Yeah.
-- [inaudible].
>> Yuval Rabani:
I'm sorry?
>>: You still need the [inaudible] they still need this assumption
[inaudible].
>> Yuval Rabani: Yeah. They need this assumption.
other thing. Okay. It's 4:28, so when do I stop?
>>:
Yeah, so there is this
Five minutes.
>> Yuval Rabani: Five minutes. Okay. So I'll skip any notion of algorithm.
This is what the algorithm does, but let's skip this. We won't have -- I
want to talk a little bit about mixed topic models. So we talked about pure
topic models. Pure topic models are those where each document is generated
from a single topic. In the mixed topic model, each document is a mixture of
topics. So it's not just one topic. It's not talking about one thing, it's
talking about many different things. Or some different things.
And in order to -- so here's the model. Again, in order to generate a D word
document, we simply draw D independent samples from some distribution P.
This is exactly the same as the pure documents model, except that now the
distribution P will be chosen in a more complicated way.
So we still have K topics, K distributions, P1 through PK. But we also have
a probability measure theta on the convex hull of P1 through PK. So the
convex hull is just some convex set inside the simplex.
And in order to generate a document, we choose a distribution P in the convex
hull of these pure topic distributions according to the probability measure
theta. And then we sample from this distribution P. So this is the model.
And now one example of theta that people talk about a lot is this latent
Dirichlet allocation. Other than saying the words, I don't know much about
it. So it's some specific class of distributions that has nice properties.
That's as much as I can say about it. Maybe some other people can say more.
Okay. So what's known about the mixed topic model? One thing that's known
is this old result of ours that was never published -- well, I think it's on
the archive, but that's it -- that essentially completely solves the problem
except that we can only do it when the number of topics is two. So if there
are two topics and there's an arbitrary distribution on convex combinations
of these two topics, then we can reconstruct the entire model including the
distribution on the convex hull of these topics that generated the documents.
And I have to explain what guarantees we actually get. So we basically have
two topics, so that defines a segment in the simplex. And we have some
arbitrary distribution on the segment. And what we're reconstructing is this
distribution on the segment and the segment itself of course.
So the segment could be slightly inaccurate and the distribution on the
segment could be slightly inaccurate. And what we can guarantee is that with
high probability the transportation cost between the model that we generate
and the true model is very small. But here the error depends on the document
size. So the bigger the documents, the better error we get.
And, in fact -- let me go back. In fact, this turns out to be necessary. So
without further assumptions -- if, for example, you know that your
distribution on this convex hull is -- comes from a latent Dirichlet model,
then perhaps you can do better. In fact, it's known that you can do it
better. But if it doesn't, if it's an arbitrary distribution, then you can't
an error better than this. So there are always two distributions that are
far apart, that are this far apart in transportation norm, that generate
exactly the same D word distribution. So you won't be able to distinguish -this is an issue of -- really of -- even if you have perfect statistics on
your documents, you would not be able to distinguish between the two models.
>>:
The transportation cost is [inaudible].
>> Yuval Rabani: Okay. So here is the transportation cost. It's how much
mass I need to -- it's the product of the mass that I need to transfer and
the distance that it needs to travel. That's basically what it is. I have
some probability in the simplex over an interval and I have the true
probability over maybe a slightly different interval, and in order to
translate one distribution to the other, I need to transfer mass along
certain distances, and the minimum cost of doing this is the transportation
cost. So that's what we can show. This is just an illustration of
transportation cost between two distributions. It's moving this mass here,
how can you do it in the most efficient way where your cost of moving a
quantum of mass is the distance of ->>:
[inaudible].
>> Yuval Rabani:
>>:
[inaudible].
>> Yuval Rabani:
>>:
Yeah.
Exactly.
[inaudible].
>> Yuval Rabani: Yes. Then finally I guess I'm nearing the end. So I
should mention this result of Arora, et al. They learned various things, so
they make pretty strong assumptions on the model. They assume that the
distributions are rho separable. And rho separable means that each
distribution has one unique entry that doesn't -- that appears with
probability 0 in the other distributions and in this distribution appears
with high probability, at least rho. Rho, think of it as a constant. So
every topic has one word that identifies it singularly. And they can learn
this thing with L infinity error epsilon, which means L1 error epsilon times
N. And sometimes -- so they can learn this for essentially arbitrary thetas.
So this is a mixed document. They learn the Ps. In some cases they can also
learn the theta itself. One of the cases that they can learn is this latent
Dirichlet location. But in general they don't reconstruct the theta. It's
only in very special cases.
And this is their sample size. So you should think of this as this will also
be something polynomial in N. But the dependence on K is also only
polynomial.
Okay. Finally there's this result that shows that specifically for the
Dirichlet location model, the latent Dirichlet location model, if P is full
rank, the thing is full rank, then it's efficient to have aperture D and you
get some -- so in this specific case you get basically a better result than
this.
And finally I guess some open problems. That's the highlight of the talk, I
guess, since I didn't give any algorithms. So I think the most important
question here, and there might be some indication that the answer to this
question is negative, it can't be done, is to do the best of both worlds. So
to get a learning algorithm that is both, say, nearly linear in N, N, maybe N
poly log N, and also polynomial in K, the sample size.
We have one or the other. We have nearly linear in N and we have polynomial
in K, but not in both of them together.
The other obvious question is can we -- in the mixed documents model, can we
recover an arbitrary theta without any assumptions. So Arora, et al., know
how to recover theta and also the other paper know how to recover theta in
the case of latent Dirichlet location, but not in the case of a general
distribution on the convex hull of P1 through PK.
I believe this can be done. And related to this is the following question.
So in fact our result specifically using a method of moments. All of these
results in one way or the other use a method of moments. They show that the
distribution can be inferred from small moments of the distribution.
But we have to use one-dimensional moments. So in some sense we're
projecting -- we're not really projecting because we don't know the data, but
we are generating the sample distribution of the projections of the Ps on two
lines in order to reconstruct the model.
And this uses a method of moments -- a one-dimensional method of moments. So
it would be nice to do this directly in the K dimensional flat, which is, by
the way, fairly easy to reconstruct because that comes from spectral
decomposition. To get this K dimensional flat, you can look at the pairwise
correlation matrix M from which you derive this variance matrix V, and just
doing -- taking the -- doing singular value decomposition and taking the K
highest components would essentially give you the flat, so can you come up
with a method of moments that works in this K dimensional flat directly
rather than having to pick lines there and project the points.
This might be useful for Gaussians as well, because there is this problem
there too, that people use one-dimensional methods of moments and they have
to project eventually the stuff onto lines.
So that's it.
[applause].
>>:
Any questions?
>>:
[inaudible] for just the simple [inaudible]?
>> Yuval Rabani: I don't know have any lower bounds. No, I'm not aware
of -- lower bounds -- so what do you want to lower bound? Certain lower
bounds we have do, for example, on the document size. So we know that in
general, say, 2K minus 1 words are necessary in order to learn the pure
documents model in general without any further assumptions. That's lower
bound. But a lower bound on the sample size ->>:
[inaudible].
>> Yuval Rabani: Yeah. So this is sort of that question, right, can we get
this bound. We don't know. I think ->>: [inaudible] maybe the question is like [inaudible] lower bounds that
[inaudible]?
>> Yuval Rabani: Told you all the ones that I was aware of when I wrote the
slides. Are they the same as the ones I'm aware of right now? I'm not sure.
So [inaudible] someone told me that they think this is probably not
achievable, but not that they have the proof that it's not achievable. So I
don't know. Maybe wouldn't take that as -- the same problem by the way you
could ask about Gaussian. So the Gaussian learning [inaudible] I think maybe
people do know. It also depends exponentially on K. Or even worse, I think
it's K -- K to the K. So that would be even more natural question to ask, is
that a lower bound [inaudible] necessary in general.
>>:
[inaudible] goes back to the separation [inaudible].
>> Yuval Rabani: Yeah. Yeah. Without a strong assumption [inaudible].
Anyway, I'm not aware of lower bounds here. Okay.
[applause]
Download