1 >> Dengyong Zhou: Okay. I think we should start. So it's my pleasure to welcome Sham Kakade this morning from MSR New England lab. So Sham probably doesn't really need an introduction and people here probably know him already. Sham has done a lot of work in computer machine learning theory and other works and today, he's going to teach us how to use tensor decomposition for learning hidden variable models. Sham? >> Sham Kakade: It's fun to be here. I'll be around for the week so if any of you want to chat, we wan continue discussions. How do we learn models with hidden structure. This is a lot of the questions we're facing in a lot of practical applications, from simple settings to really complicated settings in, say, machine translation. So let's just start with two basic examples for mixture models, which is mixture of Gaussians, which I think we're familiar with. So here we're going to see data, like a bunch of point clouds and we'd like to figure out the means of these point clouds. And another standard example are these topic models. So there we can think of having a collection of documents and each document is about one or more topics, and we think of the document as being like a collection or a bag of words. And you can think of these two as a canonical mixture models. And how do we learn that? Learning is obviously easy if someone gave us of the labels, someone told us which documents are by which topics, it's easy. But if we don't have these variables, how do we learn? So let's start by just looking at what's used and what's known. So oftentimes as used in practice is EM, it's a very natural algorithm, or k-means. And what do we do? We guess the cluster assignments or right parameters, assign the points to various clusters, and iterate. And then there's various sampling-based approaches and MCMC approaches, and a lot of the practical algorithms essentially do inference in learning. So as they're learning, they try to figure out which points are assigned to which cluster. And at some level, this seems intuitive, but this might also be part of the difficulty in learning, because we're trying to figure out which inference problem. 2 And inference is hard in some of these models. Like in the LDA model, you give multiple topics per document, just solving the inference problem is hard, okay. But that's practice. What about theory? What's the upper and lower limits we know about learning? And there's actually been some really nice work recently by Adam Kalai, Anker Moitra and Greg Valiant, so the first thing these guys showed is let's just make the problem simple. What do we know about a mix of two Gaussians. This is about as simple as it gets. Turns out that wasn't even know. What they showed is how to load a mixture of two Gaussians in poly time when the Gaussians could overlap. Because once things overlap, that's when things start becoming difficult. Actually, when they don't overlap but are close, it becomes difficult in high dimensions. But in just two dimension, they gave a poly time algorithm, which was efficient. So at least that case, we know we can solve. But K equals 2 so you can easily be exponential in K, where K is the a number of Gaussians. So there's kind of a search based procedure on a line, okay? So subsequent to that work, there's some really nice follow-up work by Anker and Greg which actually, in a sense, was a negative result, where they showed, it seems like you actually need exponential in the number of Gaussian samples to learn a mixture of K Gaussians. And this is an information theoretic lower bound, and this seems bad. And the point is if you have K Gaussians which overlap, but not like right on top of each other, but by a reasonable amount, they actually showed you could potentially need many, many samples to learn this thing from an information theoretic point of view, which basically says computationally [indiscernible]. So this looks bad right now, and it's really a nice construction by Anker and Greg. And to some degree, this talk is going to contradict that, because I'm going to argue that for kind of a natural case of mixture of Gaussians and these topic models, we can come up with a closed form and efficient estimation procedure. And it's kind of interesting for a number of reasons, which people are getting a little surprised by, because this is a non-convex problem and the solution we're coming up with is non-convex, yet it's closed form. It's a pretty simple approach because it's based on linear algebra techniques. We aren't solving inference in the learning process and this is handy when we start looking at 3 these topic models with multiple topics in them, because sometimes inference is hard. But somehow, we can do things greedily and still figure out where the topics are with this kind of closed form estimation procedure. We'll come up to this question of how do we avoid this lower bound, because I'm saying we can do things in closed form, and isolate. But I just gave you this lower bound and it extends to a number of other settings like this LDA model, which is a very natural notion when you can have multiple topics per document. And we can get a closed form estimation procedure for that, closed form estimation procedure for hidden Markov models, and I'll discuss some generalizations of these ideas to structure learning, like in these models used in linguistics and Bayesian networks. But most of the topic is going to focus on these two simple models and understanding how we learn them and basically the proof is very simple. It's geometric, and I think we can really understand how to learn these things. So let's just go slowly. >>: [indiscernible]. >> Sham Kakade: No, global optimum, efficient, in closed form. So we'll see what that means precisely, but there's no local optimization here. This is global optima for estimating the parameters. But again, well see what we mean here because we have sample data and this question is about the efficient statistical rate and so on. But nonetheless, I'm going to stand by the claim that it's a closed form ->>: Initialization? >> Sham Kakade: No initialize. We're not using EM. That's the point. We're not doing inference. It's more like greedy approach. But let's see how we do that. So let's just start with some definitions, but definitely ask questions during the talk, but I think these two examples are simple enough that we should be able to understand the proofs and everything. Okay. I want to do these in parallel, because they share a lot of similarities. And, you know, this is the one slide of notation. I think this 4 should all be reasonably clear, but definitely ask questions. So for the topic model, we're going to consider the single topic case. But let's go to the case of mixture of Gaussians and the single topic model. In the mixture of Gaussians, we think of having K centers, mu one to mu K. These are points in a vector space. And a topic model, let's think about having K topics. I'm going to go use the same notation. So the topic case, each of these mu Is are distributions of a word, okay. So you get rid of that. Now we're going to think about how we generate a point. So the mixture of Gaussian case, we first sample some cluster with probability WI, so that's my parameters. In the topic case, we're going to decide on a topic with probability WI. So that's why they're kind of analogous. In the mixture of Gaussian case, what do we observe? We observe the mean corrupted with spherical noise, because that's the case I'm going to consider. I'm going to consider the case with spherical noise. And this really is the underlying probabilistic model k-means, right. So k-means, what do you do? You assign things to the closest point so this we can think of as the probabilistic model for k-means. So we just see our point, corrupt it with noise, and we're going to see many such points. So in the case of topic models, we're going to see a document, and document's going to consist of M words which are sampled independently from this topic. So it's an exchange of a model, so our document just consists of M words and they're all sampled IID from mu I. So at this level, we kind of see some distinctions between these two models. In the mixture of Gaussian case, we're adding noise to one of the means. In the topic model case, we get many words drawn independently from the same probability distribution, the same hidden topic. >>: Multinomial model? >> Sham Kakade: It's a multinomial model, yeah. It's a multinomial model here and independently. They're exchangeable and this is why it's kind of a bag of words model. It doesn't matter the order the words. We just see this collection of samples. 5 >>: But this is not the common -- you've put single in parenthesis in there to indicate this is not the topic models that are commonly used in practice where you have a mixture of topics associated? >> Sham Kakade: We're going to come back to the case of multiple, the LDA model. I mean, this is one of the standard models used, but we often like richer ones, like LDA. But we'll be able to handle that as well. But in terms of understanding what's going on, I think it's helpful to consider these two in parallel, because these are really one hidden mixture component, and the learning question is we're just going to see a bunch of samples and what we'd like to recover are the means or topics, the mixing weights, so these are just real numbers and sigma for the case of mixture of Gaussian. I like this because we kind of see how these models are similar and how they're different. And the main difference is the topic model, we get multiple samples from the same hidden state. And here we kind of get a different notion of noise. So we good? Okay. >>: The topic part, you see [indiscernible]. >> Sham Kakade: No, so we see multiple documents. Sorry. So in the mixture of Gaussians, 1X is a point. And in the topic model, you see multiple documents. So think of the analog of X here to be M points here and we see multiple documents. >>: So each point corresponds to one -- >> Sham Kakade: Yes, each point here corresponds to one document, and a document is this collection of words, and that's analogous to one of the Xs. >>: And the dimension [indiscernible]. >> Sham Kakade: Let's go back to the dimension codes. Right now, we're just thinking of these as discrete, but we'll formalize that later. I had to think about it correctly. We good with that? Okay. Good. So how do we learn these things? And learning is you have the separate points, how do we figure things out. And there's a lot of work that's been done on 6 this. I'm not going to go through the details of all of this work, but there's really a ton of work from the theory and ML community trying to figure out how to run mixture of Gaussians back from a really nice paper from Sean [indiscernible] over a decade ago. And a lot of the work was really on a case where the Gaussians are very well separated. They don't overlap at all, and then, you know, what kind of approaches can we use to figure this case out. But they have relied on a ridiculous amount of separation. And then there's some work where they can actually overlap or where they're much closer together and most of these are basically more like information theoretic results, and they're all typically exponential in K, where you're searching all over the place. >>: So how do all these methods differ from original [indiscernible]. >> Sham Kakade: Okay, right. So all the ones I've listed up here are mostly theory results where you're trying to find the global optima. Or maybe not the global. Something provable. You care about computation, which is why they're all either exponential in the number of topics, or the clusters are very well separated and you can -- and even there it's tricky for what to do. >>: Basic procedures remain the same? >> Sham Kakade: The procedure for the ones with a lot of separation are more based on distance-based clustering, where you kind of -- some of them may introduce the right notion of using linear algebra approaches, but all of these are quite different because no one has a very good understanding of the EM type approaches and how to -- but these are a lot of the theory. And some of them give interesting algorithm, insights. For topic models, there's also been a lot of work. So Christos and Santos and other people had one of the early papers on how to view topic models as a matrix factorization approach. There's been some really nice work by Joseph Chang which kind of heavily influenced our work more from [indiscernible]. And even recently, there's been some nice work by Sanjay Arora on the topic case looking at this as an MNF problem and how do we do [indiscernible] factorization. But again, they can prove it with only separation conditions. Basically, there's a lot of work. You can ask me later on about details of it. 7 And to some degree, we're going to take a pretty different approach from a lot of these -- a lot of that work. So let's just back up and forget about sampling issues and just start thinking about identifiability. Let's go back to this old idea before maximum likelihood by Pearson. So Pearson was the old statistician. He's like how do we estimate models. His idea was somebody called the method of moments, which is we see averages in data. How do we figure out the parameters which give rise to these averages. And going back to this multiple topic case, we started addressing this by identifiability questions. So what's the question here? Let's look at our moments. For the mixture of Gaussian case, the first moment is the mean. The second moment is the expected value of XX transpose. The third moment is this tensor and they just keep getting bigger. For topics, we can think of the moments in a very natural sense. Think of the first moment as corresponding to having documents with one word in them. We think of the second moment as documents with two -- suppose we had an infinite collection of documents with two words in them. What do we know? We can figure out the joint distribution of two words in a document. We think of the third moment as the joint distribution of three words in the document. And now forget about, you know, sampling issues. We can just ask the identifiability question, which is how many words, how long do documents need to be before the parameters are well specified. Because if every document contained one word in them, and had an infinite collection of documents, I would know this exactly. But it's not identifiable, obviously. You can't figure out the topics if every documents that one word in them. So we can ask this even more fundamental question, which is suppose we had exact moments, when are the models well specified and what order of the moment suffices to nail it down. And this is an interesting question when you have multiple topics per document, because now the question is how long do documents need to be. Say if every document had five words, say five topics in it out of 100 possible topics, how long do documents need to be before the model is identifiable? There's an information theory. It's independent of sampling issues. 8 And for the most part, let's just proceed for now, given that we have exact moments. And we want to address the identifiability question and then we want to see, can we invert these moments efficiently. Can we take these moments and figure out the parameters. As you know, these we can think of these all as the matrices of tensors. Okay, because this is as -- this is like the bigram matrix and this would be like a trigram tensor. So we good with that? So this is an even more basic question. And this is what I mean by closed form solutions. I'm going to show you can easily figure the answer to this question. Okay. >>: Good? So this [indiscernible]. >> Sham Kakade: We'll see. No. So now let's just look at why I'm comparing these two and just to use some vector notation because it keeps us on the same footing. So the mixture of Gaussians, we've got K clusters. We're in D dimension in the mixture of Gaussian case or we could have D words. Typically, we think of D being bigger than K. We have more words than topics or more dimensions than clusters. The mixture of Gaussians case, what's the expected value of X given from cluster of I? It's just the mean, the definition, right. For the vector notation, it's helpful to think of words, like this is the first word as these hot encodings where this has the second word on. Why is this handy, because if we use this encoding, we can think of these mus as probability of Xs. These are D-length vectors which sum to one, and then we can think of the probability of any given word, given that we're from topic I, is just being mu I. So this is just the expected value of a word given topic and this is just mu I. It's why vector notation is handy because it kind of keeps things on the same footing. Given noise model here is obviously different. The noise model here is spherical. The noise model here is multinomial and it depends on the particular word we're getting. But that's just notation. Are we good with that? So now let's start looking ap these moments and trying to figure these out. What does the first moment look like? Okay. On the mixture of Gaussian model, it's just the average of the mu Is. And the topic model is just the average of 9 the topic probability vectors. And obviously, the model is not identifiable from this because, you know, this is always specious arguments but parameter counting suggests it's not enough. And there are definitely problems with parameter counting arguments, but nonetheless, we know, you know, the mean isn't enough. Okay? We good with that? Forward ho, all right? So let's look at the second moment. So let's look at the mixture of Gaussians model. I'm going to use this kind of outer product notation, instead of XX transpose. So what is the second moment in mixture of Gaussians model? It's the expected value of XX transpose. Each X is a mean plus noise. So what you get for the mixture of Gaussians is basically a contribution of how the means vary, right, because you've got this picture where, you know, you've got these means which lie in some subspace. Actually, I'm going to put the X as the means and we've got kind of points lying around in the means. And the contribution for the second moment is kind of how the means are figured with respect to each other, plus the sigma square times identity matrix due to the way that -- due to the variance in the points. Now, something really nice happens for topic models, sort of discussions of poly two when you have cross correlations in [indiscernible]. So if you look at the joint distribution between two different words, these words are sampled independently, conditioned on the topic. So the noise is independent. So the joint distribution of two words, if you view it as a matrix, is just the weighted sum of the means, because it's just, you know, if you condition on the topic, the noise is independent so it's just the expected value, this is just the expected value of X1, X2. The condition on the topic independent and the expected value of X given the topic is just mu I. So the joint distribution of the bigram matrix is just the average of the mean mean transpose. Yes? >>: [inaudible] in the topic model have M words in each document. dictionary size M? Did the >> Sham Kakade: The dictionary is size D, and you get M samples from the documents. So literally, the document is just -- 10 >>: But then each word appears once or multiple times? word -- Because if each >> Sham Kakade: No, no, take the first two words. So in this one, suppose the document just has two words in it. I'm looking at the joint ->>: X1 and X2 are not different words in the vocabulary? >> Sham Kakade: >>: No, no X1 is the first word. X2 is the second word. I see. >> Sham Kakade: Okay. So we're looking at the joint distribution of the first word and the second word and those two are independently drawn, given the topic. So the average value of X1 X2 transpose is just mean and mean. You know, think of the mixture of Gaussians case. If we can get two different samples from the same Gaussian, then these noises would be independent and that would go away. >>: Another question, using the term being identifiable, and you said that from the means, the previous slides, you can't identify ->> Sham Kakade: So what I mean by identifiable is there are two different models which could give rise to the same means. So if you only knew the means, could you figure out the parameters. >>: But in your kind of [indiscernible] model, two clusters differ only by the mean. So if I know the mean, I can identify it. >> Sham Kakade: No, no. So if I just -- in the mixture of Gaussians model, if you just needed the global average of the means, all you see is, in this case, the global average would lie somewhere here. If you just see this point, you can't figure out those Xs, right means, exactly. So you just know E of X and now we're going to look at E of X, X transpose and what we get is this. And in the Gaussian case, the topic model case, is this. And now we see the first distinction, because this thing is a lower ranked 11 matrix. What's the rank of the bigram matrix? It's K, the number of topics. This one is flow rank. It's dimension D. This is the first distinction. And this is handy. And exchangeability is really a wonderful property. Being able to get many samples that are independent conditioned on the same state, because we get this lower ranked matrix. It's still not identifiable, because basically all we learn is intuitively, is basically you can think of this geometrically is what we learned is an ellipse where the means lie. It's kind of what a second [indiscernible] matrix tells us. We basically say, you know, we basically just figure out an ellipse, which is like the covariance matrix. And we don't really know anything past that. It's just kind of a rotational problem. But we're getting close to the right number of parameters, but you can actually show it's an identifiable. And it's not identifiable in the worst sense. It's basically every single model is identifiable. It's not just a few points get confused. >>: Sorry to jump ahead, but should I expect like one case where it's less than P it will be okay in the multi-topic case? >> Sham Kakade: >>: No, actually, we'll be getting there in a second. Okay. >> Sham Kakade: Parameter counting suggests, okay, well how big is this? This is size D squared. And how big is three D-cubed. Well, D-cubed has a lot of parameters in it. So but this is definitely fishy, because there are some models where are just fundamentally not identifiable. >>: I should be clear, like where K square is less than D, the bigram matrix would still be low rank. >> Sham Kakade: This is always low rank. is bigger than K. >>: Sorry. This is only row rank when D Right. >> Sham Kakade: So it's not K squared versus D, it's D versus K. 12 >>: Right. In the single topic case. But in the multi-topic case. >> Sham Kakade: This is going to be surprising. So this one will break your intuition. And this is why the identifiability question was really nice to us, because when we had multiple topics, like how long do documents need to be before you can identify it. And forget about computation, because we didn't expect an efficient algorithm. I just wanted to know the exact answer to this question. It had to have an exact answer, but we couldn't for the life -because some of the difficulty with the identify ability, sometimes it's very hard to make identifiability arguments that are nonconstructive. So there's a really beautiful theorems in the 70s by Chris co in how to do this, but they're very difficult because, you know, they're existence proof. So they kind of -they're interesting questions. But we'll definitely get to that. >>: This is more of a higher level question. [indiscernible]. I mean [indiscernible] generally >> Sham Kakade: Let's get back to that. So there's kind of a long debate between statisticians between these two, and Fisher basically argued the moment methods -- so Fisher basically argued maximum likelihood was more efficient ->>: [indiscernible]. >> Sham Kakade: Yeah, basically uses the samples better. But the point is we kind of knew moment methods were easier to use, like even in simpler methods. Like if you look at Duda and Hunt, the first edition, I think it says, you know, for some of these problems, you want to start with a moment method and then use Newton. And then kind of the modern versions of statistics by modern more like theory statisticians said do method of moments, because maximum likelihood isn't consistent. And then do a step of Newton. And then they argued that was as efficient as maximum likelihood. Because maximum likelihood does stupid things with infinities. So the best thing to do for getting the constants is they're basically [indiscernible] with each other. So do the method of moments. Then do a step of Newton because it's a local optimization and then that's as good as maximum likelihood. >>: [indiscernible]. >> Sham Kakade: That's what we're trying to do. 13 >>: In terms of moment. >> Sham Kakade: We'll get to that in a second. Forget about samples right now. Let's figure out how to solve of the moments. And for now let's drop the mixture of Gaussians case. We'll come back to it because it's not low rank. Let's just proceed with the topic model case and then we'll come back to the mixture of Gaussians, okay? So again, now we should kind of -- we see what's going on. So if we have this trigram matrix, what's it going to look like? Well, the noise is independent. We're just going to get mu, mu, mu. So what we get with -- now suppose documents have three words, what do we get? I'm going to define M2 to be the bigram matrix. M3 is going to be the trigram probabilities. And it just looks like the sum of WI is mu, mu, mu. So it's this D-cubed sized beast. So the same argument, the noise is independent. So now how do we figure out the mus from this? Let's just work in a slightly better coordinate system. Basically, this is the problem right now. How do we solve this structure? We see a matrix that looks like this, we don't know the mus and we see this D-cubed sized object that looks like we know it's guaranteed to look like this, but we don't know what these mus are. How do we figure it out? Geometrically, let's just think in a nice coordinate system. Whenever I see a matrix, I like to think about ice so tropic coordinates. So let's think about transforming the data so that this bigram matrix is the identity, and it's low rank so that means when you transform it, we're going to be working in K dimensions now. So just take the bigram matrix, make it look like a sphere. That means we're going to project things to K dimensions and we have K by K matrices. And when we do that, we're going to do this same linear transformation of the tensor, which means the equivalent way to look at this problem is we know the second moment is ice so tropic, it's a sphere. We have a third moment that looks like this where now we're in K dimensions, and what this transformation to a sphere does is it makes these means orthogonal. So basically, you know, in this problem, we're now working a coordinate system 14 where these Xs are orthogonal, and this is it. Someone says here's a K-cubed object which is pretty small. It's guaranteed to have this decomposition. You know they're orthogonal. What are the means. >>: It means you compose things in three-dimensional array. >> Sham Kakade: >>: That's what this is. Three-dimensional [indiscernible] -- it's a two-dimensional array. >> Sham Kakade: No, this is a three-dimensional decomposition because each of these are D-cubed. So we know such a decomposition exists for this tensor. It's three-dimensional, how do we find it, okay. So that's ->>: >> do do as You project out to the [indiscernible]. Sham Kakade: No I'm not going to do that. There's many different ways to it. That's the question. We really just reduced this question of how do we this decomposition, okay? So let's just back up a second and look at things linear algebra operators. So let's go back to matrices. Remember this notation where we have a matrix M2. We can hit it by a vector on the left and right, which we can look at A transpose M2B. For notation, I'm going to write that as hitting M2 with A and B this way, and just use a matrix multiplication means we just sum out the matrix over those coordinates. And for ten source, we can also think of these things as trilinear operators. We can take a tensor and you can hit it with three vectors, and just in the same way you do matrix multiplication, you just take this tensor, which has three coordinates, and sum them out over these three vectors. So this is exactly, it's kind of lineal in each of these operators. It's, you know, multi-linear. What's an eigenvector of the matrix looking at it in this form? Well, if it's you hit it with a vector, you get back the same vector scaled. There's a generalization of eigenvectors for tensors, where it's just hit the kind of the cube now, rather than a square, hit it twice with a vector, you're going to get 15 back a vector because sending over three things. If you hit it twice with a vector, you get back a vector, and we're going to call it an eigenvector. If that direction is proportionate to the direction we hit it with. Okay. So this is actually pretty well studied in various areas of mathematics and other areas. The problem is tensors are horrible in a lot of ways in the [indiscernible] case, because any don't inherit a lot of night structures of matrices, but we can still define them, and there's a kind of active area of study as to what these things look like. That's the definition. Turns out for our case, they're very well behaved. What's our case? Our case is we have this problem. We know this cube looks like a sum of orthogonal vectors. Any guesses? Well, what are the eigenvectors of this whiten tensor? So let's just hit this M3 with two vectors V. What does it end up looking like? Well, it ends up looking like this beast was W mu mu mu. We hit it with V on the mus twice. It looks like WV dot mu twice squared times mu I. And if we want to find an eigenvector, this had better be equal to lambda V. Okay. So let's suppose view is mu one. They're all orthogonal. So I put mu one twice in that expression. What do you get on the right-hand side? Where the mus are orthogonal, so what do you get? >> Sham Kakade: Yes, you get W1 times mu one or [indiscernible] for the transformation. So the only -- and this is it. That basically, all of the tensor eigenvectors, all of the topics exactly, they're projected and they're scaled so it's easy to unproject, because we know the whitening matrix. We just snap it back in D-dimensions and we can figure out the scale because we make it sum to one, the probability vectors. Okay. And what was the assumption we needed to get this to work out? Well, what did I do? Well, I made things white. When does that work? I need the topics to be linearly independent. That's the only assumption. Because obviously, if one topic is in a convex Hull of the other, this might be problematic, because then it's -- because there's even identifiability questions. But as long as things are well-conditioned, which basically always occurs in general position, this is a minor assumption. So as long as the topics are linearly independent, which is minor, all of this tensor eigenvectors are the topics. It's a very clean statement, and the 16 reason tensors are nice to work with is because for this particular case, we have this nice tensor structure. In general, tensors are a mess. This is about the nicest possible case. >>: So do you stop at that moment? >> Sham Kakade: There are moment [indiscernible] to identify it. And this is the decomposition you need. And there's kind of no multiplicity issues either. It turns out as eigenvectors, if you had the same eigenvalue of linear combinations where -- there's no issues here, because even if all of these Ws were the same, it doesn't cause problems because of the way tensors work because of this cube. You can't get linear combinations being eigenvectors. >>: So is this limited still to the Gaussian case, or. >> Sham Kakade: We dropped the Gaussian case for now. model case with one topic per document. >>: This is for the topic But the cardinality -- >> Sham Kakade: The cardinality can be arbitrary, yes. So now the statement is you just need documents with M3. Any number of topics, as long as they're linearly independent, and this is kind of the closed form solution which we can think of as an eigenvector problem. >>: Do the lambdas suggest how many topics to pick? >> Sham Kakade: No, the topics are going to come from the rank of the bigram matrix. Basically, you just look at the rank of that matrix and that's the number of topics. >>: [indiscernible] MOG. >> Sham Kakade: We're going to get back to that. So now the question is what happens in mixture of Gaussians, what happens in LDA and kind of richer models. This is a clean way -- [indiscernible] algorithm first now too because can we solve this thing. We know how to do this for matrices. General ones are hard. But it's basically known, you know, we did analysis. Basically, analogs are the power iteration. You just hit it twice, repeat, 17 because you want to find a fixed point. This is exactly how matrices work and then deflate. This thing converges insanely fast. It's even faster than the power iteration for matrices, which is log one over epsilon. This is log log one over epsilon. Basically, this powering means you converge extremely quickly. So it's a very fast algorithm. Different ways to view this, which I find kind of interesting from a geometric perspective, is we're basically maximizing skewness. You can view that eigenvector condition as saying -- it's almost like a subgradient condition which says maximize M3 hitting it with V three times, which is like kind of maximizing the variance in the third moment. Okay. And it's kind of like finding the spiky directions. And that's kind of another way to view it, and this is why it's greedy, because it says all local optimizers of this third moment are the solutions we want. Which is why you can kind of rip them off one at a time, because kind of every spiky direction is the one you want. And so, you know, there's a lot of different algorithms which are efficient because they're greedy. These are very similar decompositions from those studied in ICA, because somehow the same tensor structure arises in that setting. So we understand it, and we also understand some of the [indiscernible] questions, because if we don't have exact moments you just use plug-in moments but it's just a perturbation argument, and the stability kind of just depends on how overlapping the clusters are, but that's real. But now let's -- so we're good with that. This is a minor -- I mean, the real thing is understanding the closed form structure and the rest is perturbation. But now let's go to the -- if we're done with that, we should look at the mixture of Gaussians case and the multiple topic case. So we good with this? This is pretty clean. I think we hopefully understood the proof and everything. So now what about the mixture of Gaussians case. So let's go back. There's this pesky sigma squared I. And if we're looking at third moments, we're going to again get problems. But what's sigma squared? Can we just figure it out? It turns out suppose a D was bigger than K, strictly bigger than K. So the previous algorithms worked if D equaled K with the topic ones. But for now, just suppose D is like K plus 1 or bigger, this is always the case. Like a picture like this. 18 Sigma squared is just the variance off of the subspace. So because basically in the subspace, if I look at the direction of variance, I get a contribution for how the means vary and the way it points work out. If you look at the variance kind of in a direction orthogonal to the subspace, the only contribution from the variance in that direction is sigma squared. So that actually, what that means mathematically is that minimal eigenvalue of this matrix sigma squared, because these things lie in a K-dimensional space, just look at any direction orthogonal to that, which will happen if you look at the minimum eigenvalue or the K plus [indiscernible] eigenvalue, that's sigma squared. So we know it's sigma squared. It's just the minimal eigenvalue of that matrix, which we know. Okay. And we need dimension one to be bigger for that argument. So it's estimable. So we can just subtract it out. So we can basically figure this beast out and the way we look at that is we look at the second moment matrix, we figure out the noise, get rid of the noise. >>: So why is [indiscernible] the smallest? What if WI is really tiny? >> Sham Kakade: See, the point is you're going to use an orthogonal direction, because this is a PSD matrix. So look at -- hit this thing with any direction V, right. You're going to get V on this guy, but if we can guarantee the Vs to be orthogonal to this, you wouldn't get it to V zero. So ->>: Okay, I see. >> Sham Kakade: All I'm saying is geometrically, you can pick out the variance off the subspace without just by minimizing the variance. So yeah. >>: So there will be some dimension, okay. >> Sham Kakade: Yeah, and if it's greater -- and it turns out there's a cheap trick that even if D equaled K, which was the case we want to be lineal, you can still do it, because it's a very natural point that actually -- let me go back here. Basically, if you just look at the covariance matrix when you subtract out the means, sigma squared is the minimum eigenvalue of the covariance matrix. So even if D equaled K, you could still figure out sigma squared. It's just 19 the minimal -- so forget about that case if that's not clear. squared is known. But sigma So we can get rid of that. What about for the third moment. If we look at the third moment, it's not just mu mu mu. You're going to get extra junk. Let's look at what that is. Let's just go to one dimension. Suppose we're in one dimension and we have a random variable which is a mean plus Gaussian noise. And is zero sigma squared. What is the expected value of X cubed? We get one term which is like mu cubed, right? >>: What's the other term? [inaudible]. >> Sham Kakade: Expected value of [indiscernible] squared is sigma squared times mu and then there's a counting three. Expected value is this. So basically, we want mu, mu, mu. We get some extra junk. We get three sigma squared mu. We know mu, because that's the first moment. That's E of X. And we know sigma squared. Granted, this is one decent -- we've just got to write this in kind of a tensor way. So basically, the way you think about, you know, three -- think of three sigma squared mu as equal to three -- as equal to sigma squared times one times one times mu plus one times mu times mu plus mu times one times one. It's a stupid way of writing it, but the only reason I say that is because -so M2, what do we do? We subtract out sigma squared times identity, estimate it and then we just do the tensor version of three sigma squared to get rid of it, which is basically -- and these EI says are the basis vectors. So it's really just mean one one, one mean one, one one mean, minus sigma squared. And these two exactly have the structure we want. This means for the probabilistic model underlying k-means, we basically can construct this form. The eigenvectors are the projected means. Up to scale. How do we unscale? Turns out you can get the scale from the eigenvalues themselves in this case. But this is the main point. The main point is algebraically, the structure we have, if we just rip out the noise, is this. It's pretty neat. And this is why we see the difference in the topic model, because we didn't have exchangeability, the noise correlates, so we've got to futz around with the noise because we can't get these low-ranked matrices. It doesn't really matter. We just get this, and then, you know, we get a closed form solution 20 with exact moments. Now, let's go back to Greg and Anker's result. They gave an exponential lower bound, but the way they did it they put K Gaussians on a line in a configuration of -- a very particular configuration. And what that effectively does is, you know, the assumption we needed to solve this problem is they had to be in general position. And K Gaussians on a line are not in general position. And so what's kind of nice, so we actually know there's some gap in between that if they're not in general position, it could be bad. And if they are in general position, it turns out we only have a polynomial dependence on the separation condition. By that I mean in general position, they still could be close together, but we could look at the minimal singular value of the matrix of the means and it only a model dependence there. Where somehow, once they become on a line, you could be exponentially bad in the separation. Whereas if their general position would be somehow polynomial, it would be bad. And you can kind of see why, because if they're on a line, the third moment is still a number. So it's not identifiable from the third moment if it's a number. You have to go to very high moment. And estimating very high moments is unstable. Now, that's only one algorithm, and they prove it information theoretically, so it's for any algorithms. But the intuition is nice. So that's how we got around the lower bound. >>: You did take Gaussians, you would not have this kind of -- >> Sham Kakade: We don't know how to solve of the elliptic case. There, I suspect -- I don't even know how to prove hardness results. There are cases which I think statistically are fine to estimate. But computationally, we have no idea. And I don't think there's any language for how to even understand hardness in these average case type scenarios. So anyway, for this very natural case, it's -- you basically have seen the proof. This is it. It's pretty simple, and the very last thing I want to do is look at this case of multiple topics. So ->>: Okay, but can you talk about like what the finite sample effects are? 21 >> Sham Kakade: Oh, yeah. So it's basically like the analog of a matrix perturbation argument, that if -- I'm only going to say [indiscernible] in the paper for now. But basically for SVDs, there's like [indiscernible] theorem and these kind of theorems of how accurate is an SVD if our matrices are perturbed. This is just the analog, how accurate are these eigenvectors if the tensors are perturbed and that's what I was referring to here. What it depends on basically are how co-linear the means are, but in a nice way. And that's kind of real, because as the topics start becoming co-linear, you're going to expect to need more samples and that's because of the stability of an SVD, even for the matrix case, the stability of an SVD depends on the minimal eigenvalue. But it behaves kind of nicely in the way that our matrix perturbation theory is nice. And the only way it starts becoming bad is if it becomes actually [indiscernible] we don't know how to solve it. But basically, it's all like [indiscernible] and nice dependencies, but I'm not going to explicitly give those theorems here, but they will appear on the paper. And they're kind of dependencies one would expect. I think even statistically, they're real. They're kind of information theoretic. The point is they're mild. Not like these exponential worst case things. >>: Do you actually like, for a finite sample, for instance, you would use, like, the actual minimum eigenvalue for sigma squared, or do you like correct it? >> Sham Kakade: So what I would use, I would first project it in K dimensions and use the minimum in the K space. There's actually a trick I would use. I would actually try to use a slightly different model for the mixture of Gaussians. We can talk about that later. I would look at it more like a topic model case, which I'll get to. Because this is more like a spherical case, and I don't know how to handle -- there's another case of mixture of Gaussians I can solve. But maybe let's come back to that in the discussion section. Let's go to the multiple topic case, unless they're ->>: We have yet to find the estimating value. 22 >> Sham Kakade: >>: No, no, it's the eigenvector. So I can -- Oh -- >> Sham Kakade: The eigenvector is the tensor. There are many ways to solve that problem. In a sense, this is the moment. I've set it up for you, and go to town. The point is that we know there are algorithms that solve this. They come from -- you can actually do this with two SVDs as well. So the earlier work, we were using kind of bad algorithms to do it, because you can project this down to matrices and kind of there's this term simultaneous diagonalization. That's another way to do it. But geometrically, once we understood this, it's like this is the structure. We know how to solve it. You can think of it as a generalized eigenvector problem, which is a really clean ->>: So eigenvector becomes the mu of the estimate? >> Sham Kakade: The eigenvectors are telling us the means, yes, up to scale. But we can find the scale and then we can find the Ws. >>: How about estimate of sigma square? >> Sham Kakade: We got sigma square because that's how we got these formulas. It's the covariance. So let's go to topics. Topic model case, if there's multiple topics per document, so now you can have a document like 30 percent cute animals, 70 percent YouTube or something like that. How do we figure this out? Well, what's the identifiability issue? If every document is about five words, do we need three times five or how long do they need to be? And it would be bad if the moment had to increase because in a sense, estimating higher moments becomes exponentially more difficult in the order of the moment. But parameter counting suggests third moment is enough still. Even if you have a mixture of multiple moments, you've got D-cubed parameters in third moment. And for a long time, we thought the LDA problem had to be exponential, and this is borne out because some theoreticians basically gave very strong assumptions 23 to solve it, you know, people who did not think you could do this in closed form. But it turns out LDA, all you need is three words per document, and it's the same idea of just maximizing some third moment. So LDA basically, you know, you have to specify a distribution over distributions, because now every document is about a few different topics so it means, you know, every document, you know, you center to specify this distribution over distributions and these LDA distributions are these kind of nice pictures. We have level sets here. So that's what the prior is here. So this pi is, rather than this document being about topic one and this document is going to be about 30 percent cute animals, 70 percent YouTube, that's specified there, and that's this kind of distribution over the triangle, which has a particular form. In a sense, it's like the nicest possible form distribution you could put down for a triangle. And it captures sparsity, because you could have pictures where these level sets kind of bow out. And the point is that, you know, you just got to write down what these expectations look like, and, you know, [indiscernible] it's not so bad. Some gamma functions. But the point is you look at the structure of these things. And the same trick for the mixture of Gaussians. Now if you look at correlations, you got kind of in extra terms in a similar kind of way, and you just subtract them out, the stuff you don't want. And this kind of has nice limiting behavior. Where the kind of coefficients of subtraction basically depend on like a sparsity level of the model, which is often said in practice. So the only parameter we need now is that kind of the sum of these alphas, which is often said in practice, and it's like an average level. It kind of determines the sparsity level. So if you know that, you can kind of determine what to subtract and it kind of blends between two regimes. So when alpha goes to zero, you kind of go back to the single topic case, where these go away. And there's more stuff here, because it looks more messy, you know. You get all of the symmetrization and stuff like that. But it's still easy to write. And as alpha becomes very large, what these moments actually end up looking is central moments. When alpha is large, this thing ends up looking like, it 24 looks like E of X1 minus the mean, times X2 minus the mean. And same thing here. And that sort of should be the case, because this triangle start looking like a product distribution. So these two regimes are kind of nice, and I like this kind of maximizing kind of perspective rather than the potential eigenvector, because basically, you know, you want to find these pointy directions. So this skewing is almost like a geometric problem of how do I find the corners of a convex polytope from its moments. It's all these different perspectives are interesting and that's basically what's going on here. I'm just looking at this [indiscernible] distribution, skewing things a little based on the way the measure is put on the simplex so, you know, the maximizers point to the corners. So that's it. So the LDA has this closed form solution. You can kind of tweak these moments with lower order things which you know. They have the same structure. The eigenvectors are the topics. >>: It's a closed form solution for a different objective function, right? >> Sham Kakade: If the moments are exact, it's just a closed form solution, that's right. So if it's not a closed form solution, it's an inverse moment solution, but as I said before, at least in some cases, in classical stats, we know if you did the moment estimator and then did a step of Newton, that is about as good as MLEs. And for these problems, it might actually be better. This is often like infinity problems with MLEs. So in practice, it's kind of nice to use moment estimators and then local search on top of that. >>: [indiscernible]. >> Sham Kakade: I guess inference is hard so you'd have to do sampling or variational after that. Okay. So that's LDA. And again, you raised a point and this is nice, because inference is a headache in these models, and it's hard. And that was very important for this, you know, the single topic models, mixture of Gaussians, it's easy to do inference. These problems, it's hard because even just writing down the posterior is not closed form, yet we can still solve it in this greedy way. So somehow, we've gotten rid of inference in these moment approaches. 25 >>: So you can do the same thing as [indiscernible]. >> Sham Kakade: you want. >>: It's the same form of the moment. Just figure it out any way Eigenvector, but you have to convert that eigenvector into the parameters? >> Sham Kakade: The eigenvectors are the distribution so you just unproject them and normalize it. Okay? >>: Is that when you have a prior also in the word distributions? >> Sham Kakade: No prior in the word. There's a prior in the topic distribution. Because the word distributions, we think of them as parameters. Those are the topics. There's no prior there. >>: So that [indiscernible] optimum for the moment. >> Sham Kakade: In a sense. Think about more of this identifiability viewpoint more than global optimum, if you know these moments, you uniquely specified the parameters and you have a closed form way to get it, and then it's kind of a [indiscernible] argument to figure things out. >>: Turns it into [indiscernible]. >> Sham Kakade: You need to do second in second order. That's going to be close to maximum likelihood, because you're basically chasing the maximum likelihood cost function at that point. But a step of Newton is pretty darn good. So it will get you this kind of ->>: That will be local for. >> Sham Kakade: Not local. That will be actually -- it's not going to be -it's going be close to the maximum likelihood solution. It's going to be close to the global maximum likelihood solution. >>: Is it going to be -- >> Sham Kakade: Because the parameters are close. You'd have to do some like 26 [indiscernible] argument. >>: [indiscernible]. >> Sham Kakade: Yes, so this is this whacky -- so now we're getting close to discussion section so I'm just going to answer this. So this is where it's funny with these average case hardness results, because we know maximum likelihood is NP hard in general. Yet what this is saying, I'm going to argue that this will give us something close to the maximum likelihood solution, because we're getting the true parameters. And I can give you a poly sample size. So even without doing the step of Newton, I'm getting close to the true parameters. And the maximum likelihood solution, the global one, is also close to the true parameters. So what's the contradiction here? The contradiction is this is like an average case result. It's saying if the points come from the distribution, then we can find it. And there are cases where we have average case results. >>: But you're completing ignoring the disparity between the dataset and what the model assumptions are. >> Sham Kakade: No, no. So the hardness results are showing that there exist -- if we could solve this problem for all point configurations, then P equals NP. This is not saying that. This is saying with high probability, for a configurations that tend to look like the model's correct, those are the ones we can solve. Which is why, from a complexity point of view, it's very hard to start understanding average case hardness when the model is correct, because I think there's cases for these models where statistically, like in the mixture of Gaussians case, I'm sure there's cases where when it's not spherical, statistically, it's fine. We know those cases. Yet solving them, we don't know how to do. How are you going to prove a lower bound? Only the perimeter or something, it's kind of interesting cases we know how to give low bounds for. Anyway, I need to get back to the discussion. One more slide. I want to say the idea generalizes to other models, so hidden Markov models, length three chains, you can estimate it with the same idea. You can cook up tensors that look like that. 27 Some other recent work on harder questions for like structure learning. So these models in linguistics called probabilistic context-free grammars. You see the sentence like the man saw the dog with the telescope, is the man seeing a dog holding a telescope, or is he holding the telescope and how you figure it out involves parsing problems. There's a wonderful set of questions here for how you learn these models. Turns out we made some recent progress in showing that even if you had inference statistics, the general phrasing of these problems are not identifiable if you only see sentences, which is pretty frustrating in some work with Percy. Under some restricted assumptions we can make progress on these models. But again, there's an interesting in between for the restrictions we make and the non-identifiability of these models, which we'd like to make progress on. Also, some recent work on learning the structure of Bayesian networks with these models. Suppose you see some DAG, you observe these nodes here. You'd like to figure out the structure. You don't even know the number of nodes or the edges, how do you even know this network is identifiable? We can look at moments and try to figure out what's going on. And here, we actually need some new techniques, even to just figure out, you know, identifiability. Because, for example, suppose this network, these are observed went up this way. Get more and more nodes as you go up. You wouldn't hope this thing could be identifiable. So even just characterizing identifiability turned out to involve some graphic expansion properties and then combined with some moment ideas. >>: So just a question about the hidden Markov models, I was thinking this could be interesting for [indiscernible]. I was talking to Jason this past summer and he pointed out, let's say I only looked at adjacent trigram statistics. How do I get identifiability on chains like ABBB star A, or CBBB star C? Say I have those two possible sequences coming out from the my model. Where the A is at the beginning and the end, but there's a sequence of Bs that's too long in of the middle. >> Sham Kakade: So these are models where the assumptions could be wrong. Or you have to use -- I mean, there's kind of this cryptographic hardness results 28 that you can make long chains and kind of hide combination locks in them, and to some degree, the hope is these should be divorced in practice. If I see some garbage fall out and, bang, something happens really far out in the future, cryptographically, this is hard. >>: But these happen all the time in linguistic phenomena, right? Like I have agreement phenomena that can be arbitrarily long distance. Maybe I should try to use tree structured models. >> Sham Kakade: That's right. I would argue that hopefully for linguistics, the regimes we're in are not the cryptographic hardness regimes, but the delicate is how do you kind of phrase the model to avoid that? >>: I'm not convinced that trigram statistic suffice to capture specific these inter-dependencies either. >> Sham Kakade: So in practice, this he seem to work reasonably well for initialization. So Jeff Gordon has been doing a lot of work. Even in linguistics, Michael Collins has been actually playing around not with these eigenvector methods, but earlier operator representations. He's been getting some reasonable results. >>: There are certainly things that you can capture, but I just -- okay. >> Sham Kakade: But I would argue these are modeling questions rather than fitting questions. Which is why these models are very interesting but now how do you fit them? These are wonderful questions across the board here. >>: For three length chains, what it does it mean? >> Sham Kakade: The chains can be arbitrarily long, but you only need to look at the statistics of three things in a row to figure things out. Because it's the same kind of idea. You don't need to look at very long chains, it's just the correlations in three only have enough parameters, but you can actually ->>: [indiscernible]. >> Sham Kakade: Yeah, a third moment, you can kind of construct the tensors and figure out -- 29 >>: [indiscernible]. >> Sham Kakade: If you have any higher moment, you can also solve it, because it gives you the information about the third. >>: You talk about two third moment, what about fourth moment? >> Sham Kakade: Fourth basically have the same structures. moments as well. >>: You can use higher It doesn't have length of three, as you said? >> Sham Kakade: The thing is if you have longer chains, you can estimate the three better. So even in LDA, if you have longer documents, I would just use the longer documents to get better estimates of my trigram statistics. I would rarely go beyond third or fourth. Third is good if you have kind of asymmetric things. If things are more [indiscernible], you want fourth. So in a lot of settings, understanding the tensor structure does provide us some interesting solutions for these problems. And kind of surprisingly, they're very simple solutions and people in many different areas are now actually studying various algebraic properties of these tensors. The paper for this stuff is going to be forthcoming. To some degree, it's all in previous papers. We just didn't really understand the tensor structure. We were solving it with this basically simultaneous diagonalization, where we took the third moment, projected it to a second moment, and then we're futzing around with it that way. They weren't really the best algorithms. Now we're realizing this is the structure we have and there's actually algorithms from ICA. Like they've considered power methods, like even Tom Yang has some paper with [indiscernible] 15 years ago, his advisor, on tensor decompositions for how to do this. This is joint work with a number of colleagues over the years. Two particular people are Daniel and Anima. Daniel has been working on [indiscernible] since the beginning of HMM stuff. He was an intern with me at TTI. He was a post-doc at Microsoft. Anima is faculty at Irvine and she's visiting me now. Both those two are fantastic. They're terrific to work with and just [indiscernible] colleagues across the board. So thanks a lot. >>: [indiscernible] many layers. 30 >> Sham Kakade: >>: Yeah. So does that deal with this training away [indiscernible]. >> Sham Kakade: Right, so this is, in a sense this is what we're trying to avoid when you learn these things, because somehow coupling inference with learning makes it difficult. When we think about learning Bayesian networks, I sort of don't want to think about these explaining away things in the same way I don't want to think about which documents are about which multiple topics in the learning process. I want at least one way to -- it's not the only way. Other methods might be good, but it's a different way of thinking about it. What can we recover from just the average correlations. But for the learning the Bayesian networks, actually that paper just appeared on archive maybe yesterday or today or something. But there, we actually need more techniques, because somehow, the problem with Bayesian networks is, you know, this LDA model is almost like a one-level kind of model. These are which topics appear, and these are the words which appear. And we have an explicit model of the correlations between these topics and the LDA model. The problem with the Bayesian networks is we have no idea what the correlation models here at this level, because we don't know what's above it. So we're actually using there some ideas of looking at a sparsity constraint, which is kind of what the expander condition is doing. There's a paper from Colt this year, which is how do you take a matrix and decompose it into basically some kind of weight matrix but it's sparse, times some other matrix. There's a nice paper by Dan Spielman, and we're really utilizing those techniques along with some of these moment ideas. But even -- I mean, the right way I used to think about it is what do these things need to even be identifiable. If you could make a formal argument for that, in many cases I think that does reveal something about the structure of the problem. But if it's not even identifiable, that's giving some intuition as to what the hard cases are. And these kind of hidden variable models, identifiability is fundamental because we don't even know the structure of these things and it's 31 clear you can put in other nodes sometimes to give rise to the same structure. So it's just different techniques, though. Other questions? So I'll be around for the week, and it would be great to chat with people, because we are playing around with these as algorithms. People understand these algorithms in other settings, and they're very natural, particularly with features. It's just find some pointing directions in your data. It's like the LDA problem. The problem is if you try to do this in your data, if it's noisy, all the words lie really far away from the corners. But somehow, by averaging it is what allows you to say I'm finding these pointy directions. So somehow, you can't just find the pointy directions like in the raw data, because words don't lie anywhere near the, you know, where they should be. And even documents of length three, you know, if I say this document is about sports, history, literature, and dogs and there's only three words in it, you'd be like you're crazy, this is only three words. So somehow, you really do need to do these averaging techniques. And there are an interesting class of algorithms for initialization. So we have been toying around with it. It's just another bag of tools we have. If you're interested in playing around with it, it would be fun to chat and just more broadly this set of techniques, because I don't think they solve anything, but thinking about any problem differently always helps us, and this is a new set of tools that I think should be complementary to other approaches.