>>: Yeah, so it's my pleasure to welcome Yuval Rabani to Microsoft Research. Yuval is a professor of computer science at the Hebrew University of Jerusalem. He's done similar work on algorithms, approximation algorithms [inaudible] algorithms and many other things. And so, yeah, he's been here all week. He'll be here all of next week. It's always great to have him here. And he'll tell us about learning, so how to learn mixtures of arbitrary distributions over large discrete domains. >> Yuval Rabani: Thank you. So, yes, the title is almost as long as the talk. I'll be talking about joint work with Leonard Schulman and Swamy, most of which was done quite a few years ago when both Swamy and I were visiting Caltech. Okay. So I want to start with some motivation for this problem. And this -both the motivation and the problem itself arose in two different research communities, so one is our wonderful theory community, and more or less in parallel people in the machine learning community suggested similar problems. So the problem is to understand the structure of a corpus of documents. And for the purpose of this talk, we can think of documents just as being bags of words. So we're not really interested in higher level grammar, just which words appear in the document, perhaps also the frequency of these words matters. And of course if you really want to deal with documents, you have to normalize the words and remove all sorts of irrelevant stuff, such as prepositions and uninteresting -- other uninteresting words. So I'm not going to into how people actually deal with documents. Mostly because I have no idea what they do. But we will think of documents as being these bags of words. And we also have some model that how these documents are generated. So it's a very simple model. Of course most people don't generate documents this way, but that's how we think of them being generated. So in order to generate a D word document, I we have some distribution P over words. And we sample from this distribution independently D times. So that's the way we generate documents. Now, we think of these documents as being generated in the context of various topics. And we will assume that we have a relatively small number of different topics, we'll denote the number of topics by K. So in fact if we look at the entire corpus of documents, each document in the pure documents model that I'm presenting now, is generated from one of these K topics. Each topic is defined by its own distribution over words. So basically documents from different topics differ in the distribution that is generating them. We have to assume something about the separation between these distributions, and I'll get to that. So we have these K distributions, P1 through PK. And in order to generate a new document, what we do is, first of all, we choose a topic for that document. And topic I is chosen with probability WI. Now that we've chosen a topic I, we generate the document itself by choosing words independently from the distribution PI. So that's the entire model. And you may argue that perhaps the documents that you're familiar with are generated or not generated this way. Papers, for example. >>: [inaudible]. >> Yuval Rabani: >>: I'm sorry? [inaudible] the size of the document is [inaudible]. >> Yuval Rabani: Yeah. So we will think of these being fixed, but in general you could think of different documents as having different lengths as well. And maybe that's also chosen at random. But we will actually consider fixed D. I'll get to that. Now, here's another situation where a very similar phenomenon or a very similar setting could be served as a simple model of what's happening. We have various customers, so now instead of documents we have customers. And we're collecting purchase history of these customers. And we can think of each customer as this may be a bit more realistic, so you go shopping and you have your distribution P over the possible things you want to buy, and you just sample from this distribution some -- a number -- say D independent samples, and that's your -- that's what you buy, for example, in the supermarket on that day. And, again, in the pure model here, customers -- so in general when you think of these things, you assume that people are very simple and their behavior can be summarized by simple statistics. So here it's a very simple assumption. There are K types of people. And when someone arrives, say, in the grocery store, that person is -- has type I with probability WI. And then whatever this person is going to purchase is determined by the probability PI, which is chosen according to the distribution W. Okay. That gives us essentially the same data as documents because we don't care if our items are words or groceries. But more generally this model is a very simple model for various data mining applications. Or it's a model of how the data is generated for various data mining applications, including things like document features that we just talked about, customer taste. You can think of this as sort of outdated when Web pages used to have hyperlinks. Now they don't, right, because you can get to any Web page using a search engine so you don't need to link them really. The links are -- I don't know, they generate Java code that does something, but they're not really links. So back when Web pages had real links, you could think of these hyperlinks as being generated, so through the same process. And you can think of various observational studies. So suppose you're studying, I don't know, patients or whatever. You encounter a specimen and all you can do since the person is not really cooperating with you, maybe it's a bird in nature, they're not providing you -- they're not filling up a detailed questionnaire. You can observe this thing for a little while and gather just a few attributes of the specimen that you're observing. And let's say those attributes are generated from the distribution that defines the species of this thing. So various observational studies could be modeled this way. And the general properties that -- under which we want to think of this model are the following. So every specimen has a large number of possible features. So there's a huge number of words, for example, in almost any language that I'm aware of. And each specimen -- so if you look at a specific document, it usually doesn't contain the entire dictionary. It contains only a subset. So the documents are relatively short. Think of -the specific thing to think about is, for example, Twitter tweets. Those are very short documents. So those are the things we want to focus on. So there's only a very small sample of the distribution that generates this document. And that's of course a bad thing because we don't get any good statistic on the distribution from any individual document. But, on the other hand, we want to assume that the population or all behaves very nicely. It falls, for example, into these K different categories that are well specified. So this is the kind of setting that we want to be interested in. And now you can think of various -- right. So this is a model that generates data, and now you can think of doing all sorts of things with the data, like, I don't know, ignore it, for example, would be my first inclination. But then you want to generate a paper as well, so you need to do something rather than ignoring the documents. So the goal that we will talk about here is we actually want to infer from the documents the topic distributions. So we want to learn the topics. We don't want to classify the -- I mean, that would be perhaps a different objective, is to classify the documents into topics. So we don't want to do -- to classify the documents; we just want to learn these distributions P1 through PK. So this is what I will call learning the mixture model. Because this is really a mixture model. We have K different distributions and our documents are generated from this mixture of distributions. So I'll define now the problem more precisely. We have a known dictionary and without loss of generality, this dictionary will be the numbers 1 through N that represent the N different words in the language. We have as input M samples of D tuples from -- so in fact it's with repetition, right, so we have M samples of these words. >>: So relating to [inaudible] -- >> Yuval Rabani: >>: Yeah. -- D is much smaller than N? >> Yuval Rabani: D is much smaller than N, yes. Yes. In fact, N will be very large. So think exactly of this analogy. Think of the dictionary of, say, English and tweets and Twitter. I don't know how many words they have, but not many words. I guess in German they have fewer words because the words are longer. So they have bigger problems in learning things. >>: [inaudible] are you allowing repetition in the sample? >> Yuval Rabani: Yes, of course. I mean, yes. The same word could appear many times. Could appear D times in the document. We're sampling independently, so in general you could have -- now, how is the sample generated? So each sample, each one of these M samples is generated by picking a number J between 1 and K with probability WJ. So here it's important to understand these Js are independent and identically distributed, but they're hidden from the observer of the data. So we don't know which J was chosen. And then we draw D items independently and identically from the distribution PJ. And that's how a single sample here is generated. So the next sample would be independent from this. Our goal is to learn the model. The model means the mixture constituents, the P1 through PK, and their mixture weights, W1 through WK. >>: PKs have large support size, right? >> Yuval Rabani: Yes. Each P -- each PJ has support size N. seeing a very small sample of it. >>: And we're You care about sample complexity M or -- >> Yuval Rabani: We care about everything. So I'll soon mention what exactly we care about. But we care about all these parameters. >>: Do we know K? >> Yuval Rabani: We can assume that we know K, yes. things under the assumption that K is known. >>: [inaudible]. So we will learn these >> Yuval Rabani: >>: I'm sorry? [inaudible] for different -- >> Yuval Rabani: Yeah. different things, yeah. model is unrealistic and then why not assume that You But our you could do that, yeah, so then you would learn we will assume that we know K. Since the whole algorithms have perhaps prohibitive constants, know K. That's the least of your problems here. So I'll also denote our -- obviously this can't always be achieved, right, because there could be some tiny probability that the sample that I get is completely unrepresentative of these distributions and weights. It's something way off the charts. So I will have to allow a failure probability which we will denote by delta, and from now on we're going to assume that it's just some small constant. Even though most of the stuff that I'm going to tell you you can control this constant and drive it to be as small as you like. But I'm just going to ignore it. So there's some small constant probability of failure. I don't know, 0.01. Yes. >>: [inaudible] even failure? >> Yuval Rabani: The failure is that you fail to generate a model that is correct within -- I'll specify what do I mean by a correct model. Obviously you're not going to get exactly these Ps and exactly these Ws, but ->>: But does it ever know that it is incorrect? Does it say I'm sorry -- >> Yuval Rabani: No. How would it know? So suppose your data, for example -- let's say for some reason each of these probabilities has some other word that has tiny, tiny probability, but for some reason all of your documents focus -- somehow chose only this word, so it appears D times. I mean, once you choose one of the Ps, this tiny probability word just appears there D times. This could happen with tiny probability for an exponentially small probability. What would you do then? I mean, then basically your distributions will necessarily look like they have all the weight on this one word that has tiny probability in the real model. You can't avoid that because that could be your data. >>: [inaudible]. >>: Is the algorithm deterministic? >> Yuval Rabani: >>: Okay. Is algorithm itself is not deterministic. So this failure probability is -- >> Yuval Rabani: The failure probability is over the data and the toin tosses of the -- sorry, the coin tosses, right, not the toin tosses. I usually say the right letters but not necessarily in the right order. Okay. So is the model well understood now? Good. Okay. So I want to mention a little bit stuff about learning mixtures. So learning mixtures of course is a problem that goes way back. I'm not going to detail the entire history of learning mixtures, but I think a good starting point would be Dasgupta's paper, because that was considered at the time to be a breakthrough result. So Dasgupta and many papers following his result focused on learning mixtures of Gaussians. That is sort of the -- the most -- the most standard, let's say, model of mixture -- of mixture models. So this is just a mixture of K Gaussians in RN. You have this distribution. It's a mixture. You get -- you sample points from this distribution, and then you want to infer the K Gaussians that generated this distribution. And until Dasgupta's paper was published, I think no one knew how to do this, at least with some formal proof it that actually works. And following his paper, which gave some result, it gave result for spherical Gaussians with a rather large separation between their centers, there was a long sequence of papers that culminated a few years ago with these two results, Moitra and Valiant and Belkin and Sinha, that resolved the problem completely. So they learned an arbitrary mixture of Gaussians within statistical accuracy as you like. K here, by the way, is assumed to be a constant, and we will assume that as well. So these algorithms don't run in time which is very good in K, is a parameter of K. >>: [inaudible] separated by [inaudible] then we can do separation -- >> Yuval Rabani: >>: Yeah. [inaudible]. >> Yuval Rabani: Yeah. Yeah. But here for the general case I had really lousy time, which is K to the K or something like that. But dependent on the dimension is very nice. Okay. Then there are some other models that people studied. I'm not going to go into this in great detail. People know how to learn product distributions, for example, over the hypercube. And so there are various results about learning product distributions over these domains. Then hidden Markov chains, heavy-tailed distribution. So there are all sorts of distributions, k-modal distributions. There are all sorts of -- these are mixtures of these types of distributions. People know how to learn under a certain -- not going to go into the details of what they do there or what the assumptions are, but I just want you to be impressed that there are a lot of mixture models that people try to learn. So this is another one of them. But I'm doing this in order to point out a significant difference between all this other work and what we're doing here. So the main issue that differs our problem from all these other mixture model problems is the issue of single view versus multiview samples. So in the case of Gaussians as a representative example, mixtures of Gaussians can be learned from single-view samples. So each sample contains one point from the Gaussian that was chosen to generate it. And then the next point is generated perhaps using a different Gaussian. So the Gaussian is chosen again and so forth. So -- and you can learn Gaussians from single-view samples. On the other hand, the mixtures of these discrete distributions cannot be learned from single samples. Because from a single sample the only thing that you learn is the -- or that you can possibly learn is the expected probability of every word. So think of this as this represents the simplex. It should be the N simplex, but here N equals 3. And let's look at -- so this represents some distribution over words, this point. Since our dictionary has only size 3 because that's the only thing I can draw on a two-dimensional slide until somebody here invents higher dimensional slides, so this is just a distribution over three words. You're not taking up the challenge? >>: [inaudible]. >> Yuval Rabani: High-dimensional slides. Sounds very useful. So, for example, I'm going to -- I'm also going to use three documents -- or, sorry, K is going to be 3 because K equals 2 is boring and K equals 4 is too much. But there's no connection between the 3 here -- between the N equals 3 and the K equals 3. So, for example, this expectation can be generated from these three topic distributions. But, on the other hand, the same expectation can be generated from these three topic distributions. And there's no way for us to distinguish between these two cases using single-view samples. And as you see, these things are pretty far -- these models are pretty far apart. So I won't be able to reconstruct the model correctly from single-view samples. You need to use multiview samples. There are also some other differences that might be interesting. I'm mentioning them because you might think of them as issues that are worthy of thinking about and maybe in a more general setting. So one other important difference is the issue of how much information does a single sample point give you relative to the model size. So the model size is not something very clear here, right? We're talking about real numbers. But you can -- if you have -- if you represent these real numbers with limited accuracies, say accuracy 1 over polynomial in N or -- actually, you need 1 over let's say -- yeah, numbers that are 1 over polynomial. And then if you're learning Gaussians, the entire model can be determined by Ks because they're K Gaussians. For each Gaussian you need the center and the axes of the Gaussian, right, of the ellipsoid, just the matrix. And that's about N square log N bits altogether for each Gaussian with this accuracy of 1 over polynomial. And then -- and each sample point is, again, if you have similar accuracy as order N log N bits, so it's not the same size, but it's within the same ballpark. And this is typical of most learning problems. If you even think of classical PAC learning problems, what you get from an individual sample is about the same ballpark information as the model that describes the phenomenon that you're observing. But in the learning topic models, we need to specify K distributions within reasonable accuracy, so that's something like order KN log N bits. But in each sample point we get only D log N bits. We get D words. Each word is specified by log N bits because it's a word between one and N. And if D is constant, as we would like it to be, this is a much smaller size information. So this is in general an interesting question, what can we learn from very sparse information in our samples. >>: So [inaudible] so log N [inaudible] so log N [inaudible] just in general comes from the polynomial, like [inaudible] ->> Yuval Rabani: Yeah, exactly. Exactly. Yes. You can plug N [inaudible] desire accuracy instead of the log N. I just thought it would be easier to understand it this way. Okay. I'll skip the last point. And -- but I do want to mention this without getting into the details. So this entire area of learning mixture models uses a rather well-defined toolkit. So one of the tools that is used is just spectral decomposition. We look at eigenvectors corresponding to certain eigenvalues. That's one of the tools that's widely used. Another tool that is widely used is -- so this is some form of dimension reduction, really, the spectral decomposition. You use -- you do principal component analysis, so you take the higher singular values and you look at that subspace. That's often useful. Another form of dimension reduction that is very useful is just doing random projections. Dasgupta's original paper actually uses random projections. So it reduces the dimension to K. Because we only have K centers, we can reduce the dimension. And then in the small dimension you can enumerate over things efficiently. That's sort of what he does. And the other thing that is used is what is This is in fact what gives the final answer method of moments. And what this method of general form is it tells us that if we know distribution we can reconstruct all of it. called the method of moments. on the Gaussians mostly, this moments or sort of in its very the first few moments of the And this happens to be true for mixtures of Gaussians, for example. If we know -- I don't remember how many moments exactly we need, but we don't need a lot of them. Once we know those moments, in sufficiently many directions, then we can reconstruct the mixture. So all we have to do is to figure out what these moments are given the sample. And, in fact, if we know -- so most of the error comes from the fact that we don't know these moments precisely. Okay. Let's -- >>: [inaudible] so you said [inaudible] but why does the separation [inaudible]? >> Yuval Rabani: Because that was what came up with -- once you project, you don't want -- if the Gaussians collapse to -- if their centers collapse to be very close and they're just one on top of each other, then you can't do anything. And the distance that he could figure out from the methods that he was using after projection was this thing. And that was improved later. So that was just the first crude result. But, again, by now I think the main contribution of that paper is that it generated all the following work. Because I don't think his methods are useful anymore. Okay. So back to topic models. What is the problem -- what are we trying to optimize here? So we have a learning problem. Of course let's sample -- you know, let's wait for a hundred years, sample as many documents as we can gather those hundred years, and hundred years from now maybe we will all be dead and no one would care about this problem. So that's a wonderful algorithm. It succeeds with probability one. So but we want to minimize certain parameters. So obviously we want to minimize the number of samples. I don't know, we want to classify Twitter traffic into K topics assuming that people only talk about K topics, which makes sense. Then you would want to take as few tweets as possible in order to generate it. The other thing is the -- we want to minimize the number of views that a required. Here again, we want to be able to classify -- we want to be able to determine topic distribution from tweets rather than books. It would be much easier to determine topic distributions from encyclopedias, for example. Because they have a lot of words. So they've sampled the distributions a lot. And of course we want to minimize the running time of our algorithm to make it as efficient as possible in terms of the parameters of the problem which are M the number of samples, D the number of views per sample, N is a dictionary size, and K is the number of constituents in the mixture. So you can think of M and N as being very large for our purposes. Of course there is the other problem where they're small and the other things are large. But we are considering the scenario where M and N are large and D and K are small. So D and K will be considered to be constants. And M and N are the things that we want to be asymptotically good with respect to. Now, there are some trivial bounds that we can immediately discuss. So, for example, if the total number of words that we see in our entire sample, so that's M times D, right, because there are M samples, each sample has D words, if this is little O of N, then there are definitely models that we can't learn. Because we haven't even seen words that might support a large portion of the distribution. So we can't really bound the error very well if we see too few words. So in this case, if M times D is little O of N, then there's no -- we have to have at least linear size samples in the dictionary size. >>: So what is the error [inaudible] to be then? >> Yuval Rabani: I haven't yet talked about it. I'll talk about it soon. But under any reasonable error that you can think of. So if we have a sample size of little O of N ->>: [inaudible]. >> Yuval Rabani: Yeah. So think of all the words as having roughly the same probability, say, within some constant factors and you want to learn those things, then with little O of N words sampled, you haven't seen most of the distribution yet. You've seen very little of the distribution. The entire -- all the words that you collected are supporting little O of 1 weight of the probability. So you definitely are not going to get something good under any reasonable measure of accuracy that you can come up with, unless your accuracy is I want to output anything, and that's good. It's a reasonable I guess think that would simplify our work a lot. Now, the other extreme is the following. Suppose our -- the number of views is very large. Let's say it's N log N times some large constant. Then just by coupon collector, or, you know, similar arguments, we've seen -- so a single document is gives us a very good approximation to the probability PJ that generated this particular document. And then all we have to make sure is that we have enough documents to cover all the topics. And then we can learn -- so this would make the problem trivial. So of course ->>: But you still need to figure out the WJs. >> Yuval Rabani: >>: I'm sorry? So you still have to figure out the WJs. >> Yuval Rabani: Yeah. So assuming the WJs aren't too -- that would only say how many documents you need to see until you've covered all the topics. >>: [inaudible]. >> Yuval Rabani: Yeah. So that would be the smallest WJ would determine -it would determine N, but it wouldn't affect anything else. N would have to be something like 1 over the smallest J or maybe something slightly larger than that to be sure that with high probability you encountered all topics. So this would make the problem trivial. Okay. You asked about accuracy. So I'm -- by the way, I'm mostly presenting the problem again, so I don't know if I'll have much time to go into the proof of anything. So the output of our learning algorithm are the -- are some distributions. This thing is meant to be tilde, these things, except that the Maxoff [phonetic] software doesn't have tildes. Maybe PowerPoint is better ->>: Of course. >> Yuval Rabani: -- in that respect. guys can brag about. Yeah. So I found something that you So we have P1 tilde through PK tilde, and maybe also the weights. In fact, some of the algorithms don't output the weights, they only output the probabilities, and others also output the weights. And in this respect you can think of -- or people have considered two types of error. L2 error, which is just -- so the output has L2 error epsilon. If there is some permutation over the Ps such that if you compare a PJ prime to the matching PJ, then their L2 distance is bounded by epsilon. And then other papers consider the L1 error. A similar thing, just using L1 or a total variation distance between the distributions. And you because implies that we should notice of course that this is a much stronger requirement L1 error epsilon implies L2 error epsilon, but L2 error epsilon only L1 error epsilon times root N. So in fact this is the kind of error would like to get, the total variation distance is small. >>: There is a permutation because if you don't know what the WJ says, basically nowhere to distinguish between [inaudible]. >> Yuval Rabani: Even if you do, what happens if the WJs are identical? of them are 1 over K. >>: All Right. >> Yuval Rabani: Then you don't know the permutation. permutation for which this would be true. But there is some Okay. I want to explain what's known. And this is what we need for that. Yeah. That doesn't look too promising. It's a lot of notation. But I need that in order to explain the bounds that are -- so it's not that difficult. Let's go over it very slowly. First of all, we'll denote by P the constituents matrix. So that's the matrix whose columns are the P1 through PK. So this is a matrix that has N rows and K columns. And mu will be the mean, so that's just the sum of WJPJ. That denotes the expected probability of all the words. And M, the matrix M, is the matrix of -- it's the pairwise distribution matrix. So this would be the distribution of documents that contain two words. If I would generate from this mixture a sequence -- an infinite sequence of documents, each one contains two words, then MIJ would be the probability that the pair of words IJ appeared in a document. This is the distribution. And then we'll denote by V. V will be sort of a variance, so it will be M minus mu mu transposed. Mu mu transposed would be the pairwise distribution if all the documents were generated from this mean distribution mu. So this denotes this high-dimensional variance. It's a variance matrix, V. It's the difference between the pairwise distribution and our mixture and the pairwise distribution had we replaced the mixture by its mean. Now we have these linear algebra notation for matrices, so sigma would be the ith largest singular value on the left. And lambda IE would be the largest eigenvalue. So this assumes the matrix is symmetric. Lambda, the notation. And the condition number is the ratio between the largest singular value of a matrix and the smallest nonzero singular value of a matrix. And finally we'll define a spreading parameter. So the spreading parameter is intended to capture the fact that we need some -- that these PIs need to be distinction in some sense. If they're all the same, then obviously I'm not going to generate K different distributions. And if they're very close together, distinguishing between them requires more effort. And this is captured by this zeta spreading parameter, which is the minimum between two specific parameters. One of them essentially captures -- so it's written in this way, but it essentially captures -- it's the minimum total variation distance between two constituent distributions, really. It's written here in terms of L2 for various reasons, but that's what it tries to capture. >>: [inaudible] if they're really, really close? you just [inaudible]? I mean, you can -- can't >> Yuval Rabani: Ideally we would like to do that, but we don't know how to. I mean, none of the results really knows how to do it. They all rely in one way or the other in knowing something about the separation. So some of them, for example, would work for any separation except that you need to know what separation it is and you have to work harder in order to achieve this. But you're right. Ideally you would just want to generate a model that is statistically close to the correct model. And if there are two constituents that are very close apart and you don't have a good enough sample, you should be able to replace them by just one constituent that would -- so unfortunately we don't know how to do that. And then zeta 2 is just -- it's really the width of this collection of points in every direction where they have a width different than zero. So they're spread somehow in space in the simplex, in the N minus 1 simplex. And they -- of course they're K point, so they lie on a K flat. So in every direction in this K plat, zeta 2 is the minimum width in every -- of the set of points in every direction of the K flat. Okay. So now we can say what is known. First result that I want to mention is by Anandkumar, et al. It has certain assumptions or it has one big assumption, which is P, right, that's the constituent's matrix is full dimensional. So this means, for example, that our constituents don't lie on the line, for example, in the simplex or on any flat that is -- has dimension smaller than K minus 1. Then -- sorry. Yeah, K minus 1. K minus 1 or K? K minus 1. It has affine dimension K. In this case -- so this turns out to be a very strong assumption, because in this case all you need is documents that have three words. Once you have documents that have three words -- of course, if they're longer, you can just split them. Then you can learn the model. Except that this algorithm actually uses a pretty large sample size. So it learns the model with L2 error, not L1 error. This is the L2 error that it achieves. And this is the sample size that it achieves. So you should think of these singular values and eigenvalues as being something which is about 1 over N, just to get -- in the worst case. So of course there could be constituents P for which these are very small -- for which these a very large. And then that's good for us. But in the worst case, you could have a well-separated instance where all the entries are, say, between 1 over N and 2 over N, or between, say, half over N and 2 over N. And then these numbers will be very small. They will be 1 over N. So this gives us a sample size which is polynomial in N and in epsilon squared, but, on the other hand, it's also polynomial in K. So the behavior with respect to K here is very good. >>: So what is C? >> Yuval Rabani: C is some constant. I didn't want to specify it precisely, partly because I don't remember what it is. So they gave two algorithms that do this. Then this is our result. So this is our result in comparison. We make no assumption, so we don't make the full rank assumption. But unfortunately once you don't make the full rank assumption, you must assume that you have larger documents. So we assume that the documents have 2K minus 1 words. And in fact this is necessary if you don't -- if you assume a general mixture. The example is very simple. You can -- so this is an illustration of the example. Think of a line passing through the simplex, and think of your mixture constituents as being points on this line. It turns out that you can find two different configurations of points on the line that have exactly the same first 2K minus 2 moments. And if you sample only 2K minus two words in each document, you can't get any information beyond the first 2K minus 2 moments. So in fact these two different examples, you just shift these things a bit. Not a bit. Overall they're shifted by a lot. And you get two examples that have exactly the same sample distribution over documents that have size 2K minus 2. So you need at least 2K minus 1 words in order to get any difference between these two far-apart models, and we can actually learn once you hit that size document. We get an L1 error as opposed to the L2 error there. And this is the sample size. So you see that, first of all, in terms of the dependence on N, the sample size is close to the best that we can, because we know we need linear sample, and this is N poly log N, in terms of N alone. In terms of K, on the other hand, we have this additive factor which is exponential in some horrible things, K square log K, essentially. Now, what's interesting here also is that suppose K is very small compared to N. Then this would be the dominant part. But this part -- this size sample we only need documents of size 2. And beyond that we need only this many documents of size 2K minus 1. So you could learn from a lot of very small documents and a few very large ones, except that the few very large ones are something exponential or worse than K. So you see the comparison to the previous work. The dependence on N here is much better. This would be the two algorithms if we want L1 error instead of L2 error, if we translate their L2 error into L1 error. is about N to the 8th, the second one is N cubed. The first algorithm With a sample size of N cubed, they're looking at three-word documents. You get pretty accurate statistics on at least the more common words among all the triples. So you get pretty accurate statistics on the distribution of documents. So the entire problem that they have is to reconstruct the model from exact statistics, basically. Because they get very good statistics. On the other hand, in our case we get very poor statistics on the document distribution. We only have N poly log N documents. So the statistics on documents, even of size 2, is not very accurate. And still we're able to learn. But, on the other hand, they have -- their dependence on K is polynomial and here the dependence is really bad. >>: These two algorithms [inaudible]. >> Yuval Rabani: >>: And it's showing you the assumption that -- >> Yuval Rabani: >>: Um-hmm. Yeah. -- [inaudible]. >> Yuval Rabani: I'm sorry? >>: You still need the [inaudible] they still need this assumption [inaudible]. >> Yuval Rabani: Yeah. They need this assumption. other thing. Okay. It's 4:28, so when do I stop? >>: Yeah, so there is this Five minutes. >> Yuval Rabani: Five minutes. Okay. So I'll skip any notion of algorithm. This is what the algorithm does, but let's skip this. We won't have -- I want to talk a little bit about mixed topic models. So we talked about pure topic models. Pure topic models are those where each document is generated from a single topic. In the mixed topic model, each document is a mixture of topics. So it's not just one topic. It's not talking about one thing, it's talking about many different things. Or some different things. And in order to -- so here's the model. Again, in order to generate a D word document, we simply draw D independent samples from some distribution P. This is exactly the same as the pure documents model, except that now the distribution P will be chosen in a more complicated way. So we still have K topics, K distributions, P1 through PK. But we also have a probability measure theta on the convex hull of P1 through PK. So the convex hull is just some convex set inside the simplex. And in order to generate a document, we choose a distribution P in the convex hull of these pure topic distributions according to the probability measure theta. And then we sample from this distribution P. So this is the model. And now one example of theta that people talk about a lot is this latent Dirichlet allocation. Other than saying the words, I don't know much about it. So it's some specific class of distributions that has nice properties. That's as much as I can say about it. Maybe some other people can say more. Okay. So what's known about the mixed topic model? One thing that's known is this old result of ours that was never published -- well, I think it's on the archive, but that's it -- that essentially completely solves the problem except that we can only do it when the number of topics is two. So if there are two topics and there's an arbitrary distribution on convex combinations of these two topics, then we can reconstruct the entire model including the distribution on the convex hull of these topics that generated the documents. And I have to explain what guarantees we actually get. So we basically have two topics, so that defines a segment in the simplex. And we have some arbitrary distribution on the segment. And what we're reconstructing is this distribution on the segment and the segment itself of course. So the segment could be slightly inaccurate and the distribution on the segment could be slightly inaccurate. And what we can guarantee is that with high probability the transportation cost between the model that we generate and the true model is very small. But here the error depends on the document size. So the bigger the documents, the better error we get. And, in fact -- let me go back. In fact, this turns out to be necessary. So without further assumptions -- if, for example, you know that your distribution on this convex hull is -- comes from a latent Dirichlet model, then perhaps you can do better. In fact, it's known that you can do it better. But if it doesn't, if it's an arbitrary distribution, then you can't an error better than this. So there are always two distributions that are far apart, that are this far apart in transportation norm, that generate exactly the same D word distribution. So you won't be able to distinguish -this is an issue of -- really of -- even if you have perfect statistics on your documents, you would not be able to distinguish between the two models. >>: The transportation cost is [inaudible]. >> Yuval Rabani: Okay. So here is the transportation cost. It's how much mass I need to -- it's the product of the mass that I need to transfer and the distance that it needs to travel. That's basically what it is. I have some probability in the simplex over an interval and I have the true probability over maybe a slightly different interval, and in order to translate one distribution to the other, I need to transfer mass along certain distances, and the minimum cost of doing this is the transportation cost. So that's what we can show. This is just an illustration of transportation cost between two distributions. It's moving this mass here, how can you do it in the most efficient way where your cost of moving a quantum of mass is the distance of ->>: [inaudible]. >> Yuval Rabani: >>: [inaudible]. >> Yuval Rabani: >>: Yeah. Exactly. [inaudible]. >> Yuval Rabani: Yes. Then finally I guess I'm nearing the end. So I should mention this result of Arora, et al. They learned various things, so they make pretty strong assumptions on the model. They assume that the distributions are rho separable. And rho separable means that each distribution has one unique entry that doesn't -- that appears with probability 0 in the other distributions and in this distribution appears with high probability, at least rho. Rho, think of it as a constant. So every topic has one word that identifies it singularly. And they can learn this thing with L infinity error epsilon, which means L1 error epsilon times N. And sometimes -- so they can learn this for essentially arbitrary thetas. So this is a mixed document. They learn the Ps. In some cases they can also learn the theta itself. One of the cases that they can learn is this latent Dirichlet location. But in general they don't reconstruct the theta. It's only in very special cases. And this is their sample size. So you should think of this as this will also be something polynomial in N. But the dependence on K is also only polynomial. Okay. Finally there's this result that shows that specifically for the Dirichlet location model, the latent Dirichlet location model, if P is full rank, the thing is full rank, then it's efficient to have aperture D and you get some -- so in this specific case you get basically a better result than this. And finally I guess some open problems. That's the highlight of the talk, I guess, since I didn't give any algorithms. So I think the most important question here, and there might be some indication that the answer to this question is negative, it can't be done, is to do the best of both worlds. So to get a learning algorithm that is both, say, nearly linear in N, N, maybe N poly log N, and also polynomial in K, the sample size. We have one or the other. We have nearly linear in N and we have polynomial in K, but not in both of them together. The other obvious question is can we -- in the mixed documents model, can we recover an arbitrary theta without any assumptions. So Arora, et al., know how to recover theta and also the other paper know how to recover theta in the case of latent Dirichlet location, but not in the case of a general distribution on the convex hull of P1 through PK. I believe this can be done. And related to this is the following question. So in fact our result specifically using a method of moments. All of these results in one way or the other use a method of moments. They show that the distribution can be inferred from small moments of the distribution. But we have to use one-dimensional moments. So in some sense we're projecting -- we're not really projecting because we don't know the data, but we are generating the sample distribution of the projections of the Ps on two lines in order to reconstruct the model. And this uses a method of moments -- a one-dimensional method of moments. So it would be nice to do this directly in the K dimensional flat, which is, by the way, fairly easy to reconstruct because that comes from spectral decomposition. To get this K dimensional flat, you can look at the pairwise correlation matrix M from which you derive this variance matrix V, and just doing -- taking the -- doing singular value decomposition and taking the K highest components would essentially give you the flat, so can you come up with a method of moments that works in this K dimensional flat directly rather than having to pick lines there and project the points. This might be useful for Gaussians as well, because there is this problem there too, that people use one-dimensional methods of moments and they have to project eventually the stuff onto lines. So that's it. [applause]. >>: Any questions? >>: [inaudible] for just the simple [inaudible]? >> Yuval Rabani: I don't know have any lower bounds. No, I'm not aware of -- lower bounds -- so what do you want to lower bound? Certain lower bounds we have do, for example, on the document size. So we know that in general, say, 2K minus 1 words are necessary in order to learn the pure documents model in general without any further assumptions. That's lower bound. But a lower bound on the sample size ->>: [inaudible]. >> Yuval Rabani: Yeah. So this is sort of that question, right, can we get this bound. We don't know. I think ->>: [inaudible] maybe the question is like [inaudible] lower bounds that [inaudible]? >> Yuval Rabani: Told you all the ones that I was aware of when I wrote the slides. Are they the same as the ones I'm aware of right now? I'm not sure. So [inaudible] someone told me that they think this is probably not achievable, but not that they have the proof that it's not achievable. So I don't know. Maybe wouldn't take that as -- the same problem by the way you could ask about Gaussian. So the Gaussian learning [inaudible] I think maybe people do know. It also depends exponentially on K. Or even worse, I think it's K -- K to the K. So that would be even more natural question to ask, is that a lower bound [inaudible] necessary in general. >>: [inaudible] goes back to the separation [inaudible]. >> Yuval Rabani: Yeah. Yeah. Without a strong assumption [inaudible]. Anyway, I'm not aware of lower bounds here. Okay. [applause]