36537 >> Li Deng: Okay. Welcome to this lecture. We thank Moontae for flying all the way from Cornell to come here for this week. So he will talk to us about topic modeling, and this will be the paper he will present at upcoming NIPS so we get a preview of his work. He's extremely productive researcher and intern and we appreciate that he spent one more week with us. Thank you. Bye. >> Moontae Lee: Thanks for the introduction. So today I will present robust spectral inference for joint stochastic manufacturing factorization and topic modeling. This is joint work with David Bindel and David Mimno, professors at Cornell University. Before going into the outline, I will briefly talk about what's the difference between several learning models. So in machine learning, in order to learn the model parameter, we have roughly two different methods. One is likely based training. We first choose a proper likelihood estimator. Of course, there are a lot of different estimators like pseudo likelihood, maximum likelihood math likelihood, and then we form a likelihood function in terms of the model parameters and then find the best parameter, usually via optimization. But there are another class of method called method of moments, which is, first, relate a population moments to the model parameters and then estimate population moments via sample. And it usually involves solving multiple different equations. So if I compare these two different approach to learn parameters, the solving likelihood usually uses optimization, which requires multiple iterations that makes the algorithm usually slow whereas method of moments we are solving equations, closed form equations and speed is relatively fast. And in terms of the estimation quality, unless the likelihood function be designed strictly convex, it's not always optimal, whereas the method of moments, it's statistically consistent. What I mean by statistically consistent is if we get more and more sample, it always converges to the right estimator. But whenever there's a mismatch between our models and the real data, likelihood-based method has intrinsic power to manage model mismatch, but it's unclear in the method of moments. So one the biggest focuses of this talk is how to handle model mismatch in a specific problem related to topping model. And the method that I'm interested in this paper is using both matrix algebra and probabilistic inference together in the same framework. Then so I will briefly explain the latent Dirichlet allocation. So we have two hard parameters for polynomial distribution and for each document and position we first sample the topic from the topic distribution of that document. And based on that topic, we sample the word. And this is how the topping model explains the document generation. If I proposed the corresponding view in terms of matrix vectorization, it looks like this. So we have a word distribution for each topic, and topic distribution for each document nth document. And because of this is a distribution, this matrix is a column stochastic, which means the sum of every column entries are equal to one. So let's say we have currently only one document on the right-hand side and this document, the first document, it is the first topic is the biggest topic, which affects a lot. Then what we are expected to observe in terms of the word level is the duplication of these words, first word and fourth and fifth word. And if the second document contains the second topics and fourth topics a lot and then the expected document looks like the sum of these, these words and these, these, these words. But in reality, in reality, always the model is not coinciding with the real data. So we are observing these noise in here. Now I'm explaining the events based for joint stochastic matrix factorization in comparing to the previous LDA model, now we have a pair of topics and pair of words here. So these pair of topics are sampled from topic, topic distribution and then based on those pair of topics, we are sampling the words. And words are of course related to the word topic distribution in here. So the corresponding matrix vectorization view look like this. So now we have a topic/topic matrix, the sum of all entries will be equal to one. So this is a joint stochastic matrix. And we have word topping matrix here and transpose here. What they observe is here, this is a word-word co-occurrence matrix or just nth document. So in reality what we observe is containing some degree of noises like this. So the goal in this paper is to decompose what we observed into this fashion. Okay. So I will explain why we are using second order method rather than the first order method. So the benefit which is proven in Arora, et al, 2012, unfortunately the first order statistics, which is simple word occurrence in each document is far from the ideal stuff. But fortunately the second order word-word co-occurrence matrix converges well to the ideal word-word co-occurrence. mathematically. This is proved So the goal is to decompose the noisy observation of our co-occurrence matrix into those BAB transpose and what is called inference here is to recover the word topping matrix, B, and in order to do that, we are going to utilize something called separability assumption. What I mean by that is the typical non-negative matrix vectorization the goal is to minimize the difference between C and BAB transpose, usually in terms of Frobenius norm. But just here merely doing that will not give us a great result, and usually it produces unrecognizable topics. It's because there's no identifiabilities guarantee. In some sense the separability exception, which I'll explain on the next slide, will guarantee those identifiability to this problem. So again just minimizing C and BA, the difference between C and BAB transpose is not the goal of this paper. So what is separability? Separability assumption. Each topic K has a specific anchor word SK which is exclusive to that topic. What I mean by exclusive is described in here. So when that topic is given, we have a positive probability to observe that anchor word whereas given different topics there is no probability to observe that specific word. This implies not every document about topic K must contain that anchor word, but every document which contains that anchor word will tell us and list something about that topic. Then what I mean by anchor word defined in the previous slide corresponds to this red/black this here. The red-black is Intel dedicated to the first topic because there is no probability of seeing this word, fifth word, given another topic. So we have three anchor words in this picture. >>: So this assumption is valid? >> Moontae Lee: This assumption is valid. >>: Is this assumption valid or not? >> Moontae Lee: So in real data it's, of course, not always valid. But without this assumption, as I said, the decomposition doesn't have an identifiability. So usually all these unsupervised learning, we would like to identify the mixture, different mixture, topping mixture, for example, in this problem. That those mixtures are not sufficiently separated to each other without this assumption. >>: But this will group some different topics into one, right? >>: No, this actually separates different topics more distinctively. >>: By assumption doesn't hurt, then it's possible that you see the same anchor word with several different topics and all those are different topics that you may group those together as well, right? >> Moontae Lee: It's different because topic is not observed. Topic is a hidden variable. So if we have this assumption, what we would like to do in the inference is to create topic which satisfies this assumption as much as possible as we can. And because these anchor words are exclusive to one topic, it actually tries to learn topic as separated as possible. Topic is not what we can observe. So if I reorder these and push all these red/blacks into the above and reorder, then it means B matrix contains diagonal matrix in here which is called D in this notation. So basically this decomposition will be rewritten into this blocked diagonal matrix form and there's already one interesting correspondence. So if I see this block after the reenumeration, then the DAD transpose will correspond to certain matrix of the co-occurrence matrix. >>: So this looks like extension of the original topic model, two branches? Or is it the same topping model with a different mode? >> Moontae Lee: It's not exactly the same topping model, because we no longer have the prior for the word topic matrix part. But it actually subsumes some degree of those all LAD different models. So you will see. >>: I see. Okay. Okay. >>: I want to know if I missed something. So can't there be two anchor words in the same column? >> Moontae Lee: Two anchor words in the same column, in here? >>: Yeah. >> Moontae Lee: No, there's no way to get those two same anchor words in the same column. Because anchor word assumption means -- in certain rows there's only one activated cell. >>: Right. >> Moontae Lee: And the number of those rows is also hyperparameter. >>: But I mean the condition says that you have to be the only non-white thing in your row. Doesn't say anything about your column. So I don't see why you can't have two red ones. The last row could be identical to the next to last row. >> Moontae Lee: So, again, this is not we can observe. This is not we can observe B. So what we can observe is only C, then as a user input, for example, in this picture, user -- I'd like to learn topping model with topic three. Number of topic three. Then we are going to decompose this matrix assuming there is three certain blocks in A. >>: Okay. So it's just another requirement, the anchor word? >> Moontae Lee: It's not another requirement. That's how the inference goes in terms of the directionality. Now, you will see the details. >>: I don't expect them to help me but I will see it. >>: Does that mean that even if you have two anchor words with the thing you can ignore one, so from -- is it one anchor word that's enough? >> Moontae Lee: No. Because if you would like to separate the dataset into 10 different mixtures, you need 10 anchor words. >>: Basically that's all I'll ask for. Because you're assuming you have a diagonal [indiscernible] at the top. >> Moontae Lee: Yep. >>: Which means for each column you only have one as well. >> Moontae Lee: Yep. >>: The parameter, is it possible that for some topic, for some topics, you have more than one anchor word? >> Moontae Lee: That is not allowed in this model. >>: So this is not allowed. >> Moontae Lee: Yep. So every topic must have only one anchor word. So in 2014, there's an extension, theoretical extension. So every topic has multiple different anchor words. I haven't seen any real implementation of that paper, but there's an extension. So if I have anchor word, this is one of the main inference, how to find the C bar IJ is well normalized version of word-word co-occurrence matrix so it's now conditional probability given observing one word, what's the probability of observing another word and because of this equation in the end these relations will be whole. So some row, each rows in the normalized co-occurrence matrix will be convex combination of certain rows corresponding to anchor words, and these the coefficient, the sum of these coefficients will be equal to one and these corresponding to the word topping matrix that we've seen in the previous slide. So assuming we somehow know the anchor word, the rest of the inference is just to figure out these coefficients. So assume we somehow learn those anchor words and then to solve the previous equation is just solve non-negative list square, simplex constraint, there will be many different methods. Use exponented gradient and there is subnet node and this is easily parallelizable for each word. That's one of the biggest benefits. And then its inference becomes very simple Bayes Rule. This will be one entry in E matrix, I-K entry and once we know the convex coefficient based on these methods, all these entries could be rewritten based on Bayes rule. So finding anchor words really matter. So at the beginning Arora 2012 they tried to solve a lot of LP. Basically pick one row in the co-occurrence matrix and see whether it could be reconstructed, it could be reconstructed by all the other rows, which is pretty exhaustive method, and it empirically doesn't work at all. And then people developed or used [indiscernible] row pivoting which is a very famous method in matrix algebra. So pick one extreme point in those row normalized co-occurrence matrix as initial anchor and project every other point rows down to the orthogonal complement of that vector and choose the farthest point and repeat this process again and again until we find K anchors. Of course these K anchors will never recover the rest of the rows perfectly. And what we are doing is just approximately find best K rows, and this is a greedy method, so it's not even the best but something we could do in a manageable fashion. So benefits of this anchor word algorithm is as you could imagine the every inference process is deterministic. So there's no random initialization goes, or there's no funky behavior at all. And so once we construct this noisy co-occurrence matrix based on the real data, and then we no longer play with these documents at all. These are the only statistics that we need for the inference and we produce an anchor word which is exclusively dedicated to each topic which might bring some interpretability. So what's the problem then? This always happens in machine learning. So the real data never follows the model. So in reality, this co-occurrences with rare words makes sparse row. So if there's a very rare words happens one or two times through the document, the co-occurrence with that word is extremely rare, which makes sparse row. But in terms of the matrix algebra, those rows look like there's a strange point or eccentric point of the co-occurrence space so QR with row pivoting algorithm prefers to sell at those rows. So anchor words are selected as two rare words. And the co-occurrence with those anchor words become noisy statistics. And even words the co-occurrences between those anchor words is usually diagonally dominant, which means so one anchor word corresponds to one topic. If this becomes diagonally dominant, we cannot capture any interaction between topics. This is a serious problem. So as a result all those previous work, they usually manually use manually crafted document frequency cut-off, which means they just set some threshold like if in order for this word to be an anchor word it must happen at least in five different documents, ten different documents. Sometimes 100 different documents. And they measure the held out likelihood again while in order to measure the held out likelihood we need to finish the inference process entirely. And then oh it doesn't look like it and so it changed the threshold and measure how the likelihood again. It's extremely painful process. And even with the document frequency cut-off and if the number of topics is very small, they learn garbage topics. And cannot capture interaction between topics at all because of this problem. And even if K is high, the topic quality is poor. So this is, this covered recently caused the original two papers, they all used just synthetic dataset which follows the model pretty well. But in reality, in the real data, oh, the topic quality is very poor. And comparing to probabilistic inference like if sampling, the entire inference quality is far inferior I'll show sample topics with the original algorithm which is called greedy. This is New York Times corpus, which is popular in this scene. And as you can see none of the anchor words are, yeah, understandable. And so we had a toy experiment in previous [indiscernible] paper which we just compressed this co-occurrence matrix using either PCA or T stack estimate or neighboring embedding and we realized it gives us much better anchor word and even increasing the held-out likelihood. But still we cannot explain well why these work, that explaining those is one of the purposes of this new paper. So I will show some visual example of those anchor words. So this is a 2-D or 3-D projection of small the LD corpus. This is illustrating the word co-occurrence space and the anchor words corresponding to the vertices of the convex hole, because as you might remember the goal after getting anchor word, the goal is to learn all those coefficient to express all these words by the convex combination of the anchor words. And interestingly these anchor words corresponding to certain topic of the anchor of the yelp. And we can do 3-D projection while it looks messy. There's homeless and enchilada here. So what we've done in the new paper coming for NIPS is we study extensively what's the structure, mathematical structure of co-occurrence matrix. So while I skipped entirely this has a lot of probabilistic and statistic structure. You could see in the paper. But what I'd like to articulate in this presentation is the geometric structure of this C. So C must be low rank and at the same time doubly nonnegative. Doubly nonnegative matrix is a category of matrix, which is entry-wise nonnegative and positive semidefinite. And by definition C must be joined stochastic, which means the sum of all entries is equal to one, and the low rank is document is believed to be generated based on small number of topics. The C need to satisfy this four different structure at the same time on top of these probabilistic and statistics structure. So obviously the C must satisfy a lot of different conditions at the same time. And this is proof very rough proof why C must be the positive semidefinite. I will skip it. So C needs to be this, this, this. And in order to make C satisfy all those conditions, we perform alternating projection. So projecting C down to the low rank, positive semidefinite space first, and then projecting down to the cone of normalized matrices, then cone of the nonnegative matrices, and repeat this process again and again. So the first projection is easily achievable by the truncated eigenvalue decomposition. So rather than doing the full eigen valid decomposition which is clearly painful if the vocabulary size is high, we just use, for example, by using power method, just get the first biggest K eigenvalue and reconstruct C based on that eigenvalue, and the projection orthogonal projection to the cone of matrix is given by this. Briefly the intuition is the sum must be equal to 1 and the measure, compute the difference between the current sum and the ideal sum and get the average entry-wise average and do basically penalize or reword based on that. And the nonnegative matrix cone, it's pretty easy. So the algorithm alternate these three projections again and again until conversions. However, these alternating projection algorithm no longer guarantees the convergence to the global optimum. It's because whereas the first two cones normalized matrices and nonnegative matrices are convex con, with the positive semidefinite with low rank is no longer convex, whereas simple positive semidefinite is a convex con. However, AP instead enjoys local linear conversions. So there's a mathematical proof in my paper, roughly speaking only about the intuition. So the set of rank K matrix still forms not bad shape, which is a smooth manifold. And the intersection with the convex cone and smooth manifold is still smooth manifold in almost every rare sense. And so long as our estimator is not too far from the convergence point, it is guaranteed to be converged. And this is one of the theories by Adrienne Lewis 2009 paper. So they rectified anchored algorithm that we proposed in the paper is consistent of five different procedures. First construct the noisy but unbiased estimator, word-word co-occurrence, and then rectify that by using alternating projection. So that C will satisfy all the structure of the ideal C. And then find the anchor words in the rectified co-occurrence. Recover the word topic matrix by a probabilistic inference with Bayes' rule and cover topic-topic matrix, A, based on the block matrix diagonal decomposition you see in the earlier slides. So it contains pretty simple procedures. And all of these procedures are deterministic. And that's, of course, clear benefit of this method. So I will show the result. So how the rectified co-occurrence looks like. So this is a two-dimensional visualization of the co-occurrence space. This is from the original algorithm. By the way, the dataset is NIPS dataset. And the goal is to find the five different anchors of this dataset. So each dot corresponds to word and this is a word-word co-occurrence space. And if I run the original greedy with row pivoting it chooses five different X vertices like this. And whereas if I rectify the space like this and then if I choose five different anchors on the rectified space, it corresponds to these five points. And if IMAP these five points into the original space, it looks like this. So as you could see, the coverage is much higher, and it can explain the rest of the word as a convex combination better. So this is a typical topic result. The dataset is NIPS and again I have five different topics. So the original algorithm Arora et al, if I run the code, based on his algorithm and their algorithm, this is the result topic. So everything looks like near run, layer, hidden, and basically each topic agrees with the uni-DRAM distribution of the corpus. And in general the word of the topic modeling, if the topic inference does not work well, topic simply mimics the unique run distribution that's typical behavior. And if I run the Gibb sampling it gives pretty good topic. One is about the neuron cell topic or control theory reenforcement learning or speech recognition neural network stuff or Gaussian approximation stuff. And if we run the code in our algorithm, it's pretty similar to the probabilistic LDA even with this small number of topic. So if you actually increase the number of topics drastically, like 200, then still there's a chance to cover all these vertices eventually at the end. So the topic inference quality becomes better, but if the number of topics is not that large enough, there's a large amount of chances that the original algorithm fails because of selecting two eccentric words, rare words, which is now statistically stable. And this illustrates the topic-topic interaction, which is the second top of the inference. This is coming from the original Aurora et al method. So what they did is while I haven't described, after finding the word topping matrix, which is B, they multiply the pseudo inverse of the B to the left-hand side and the right-hand side of the co-occurrence matrices. Of course, right-hand side, the transpose of the pseudo inverse matrix, and then it gives it this result. Of course, this is wrong, because some entries are even negative and there's no way for the probability becomes negative. And some entries even beyond one. The sum is close to one. That's because it's algebraic proper to satisfy that. And this is another method that we are using in our paper by multiplying those diagonal sub matrix to the left-hand side and the right-hand side. Of course, that is so simple method and there's no way that the original author didn't try that method. But actually if we try the original, that multiplying diagonal matrix method to recover the topic-topic interaction without rectification, it will look like this. So it's entirely diagonally dominant because of the reasons that I explained before. Again, anchor words is likely to be selected as very rare words, because of the co-occurrence between anchor and anchor are extremely rare and statistically it's not a good statistics at all. And it makes this diagonal matrices which cannot capture the topic interaction at all, whereas this is the result from a rectified anchored algorithm, it captures the topic interaction pretty reasonably if you actually match one topic with the previous topic in here that we learned. This failed to use our -- these values turn out to be reasonable. And this plots draws the overequality at just one big image. So we are not only testing our method in the document collection, which are NIPS and New York Times. We can also do this model for the movie and song data. In the movie data, the word corresponds to each movie and document corresponds to the collection of movies that each user observed. And in the song data, the song, each song corresponds to word and the play list in each row [indiscernible] plays frequently will be the document. And then we currently have six different measure. The recovery indicates how well those rows corresponding to anchor words reconstruct the other rest of the rows. And the approximation error is the Frobenius norm difference between the original co-occurrence matrices and the decomposed vector rised matrices. So basically the approximation error is the traditional measure to traditional metric to measure how successful the matrix decompositions are. And the domestic Nancy is how those topic topic matrices is diagonally dominant. And specificity is how much each topic is specified from the unigram distribution and dissimilarity how well each topic is separated to each other and occurrence is how well each topic coincides with the document. So while this graph may look a little bit messy, the thing that we have to focus is AP and Gibbs and the original baseline method. So if you see the original baseline method, for example, in approximation, the approximation is pretty high but the AP, alternating projection method or Gibbs sampling method, the errors are pretty low, and those behavior agrees across different dataset, and also in the recovery era, as you already seen in the word-word co-occurrence figure, after the alternating projection, the recovery rate becomes far better. And if you see another measure, the AP and Gibbs follows pretty similar trajectory. It's comparable to each other, which means we finally achieve the comparable result to the probabilistic inference. If you see the original baseline method, they are all far from those probabilistic inference or AP method. >>: So what these right criterion to judge ->> Moontae Lee: Which topic is better. >>: Yeah, which is better. >> Moontae Lee: That's one of the questions which pops up always in topping modeling. So this is unsupervised clustering. There's no clear way to judge which is better. So people use to introduce all these different metrics like that I suggested and also do the human validation at the same time. >>: So [indiscernible]. >> Moontae Lee: So basically you see this. It's not that hard. And some people, some researchers designed the matrix -- some other metrics based on this result, how well -- how much each word frequently occurs across different topics like that. But those are usually subsumed in the metrics that I illustrated in the plots. >>: So in other tasks, I saw that people used perplexity. >> Moontae Lee: Yes, perplexity is held-out likelihood measuring. >>: Did you use? >> Moontae Lee: We actually did that. While we didn't include it in this paper. Because the reconstruction -- the recovery error drastically goes down. The held out likelihood increased a lot. So the conclusion -- so we studied the various mathematical natures of the co-occurrence statistics and this might be exciting because as you, as all of you know, the word embedding stuff, they're all based on the word co-occurrence. While they are not coming from these topping modeling assumptions, but of assuming there are some clusters of words in the word, in the natural language, there might be exciting mathematical structures which is desirable for certain embedding of course that will be coming different based on which tasks we're tackling. And we develop a principled way to rectify those noisy co-occurrence rather than exploring the document cut off again and again exhaustively. And based on this method we can learn the quality topics even if K is very small you've already seen K is equal to 5 example. And another example which is in our paper is in the movie data. K is equal to 15. And if you run the original anchor word algorithm, I think Pulp Fiction appears across every 15 topics as a top movie. And the second -- I forgot the name of that movie -- but those two movies always top words across every different topic. While, what we've learned has exciting cluster like, like Lord of the Ring cluster and Star Wars cluster and Walt Disney cluster, and if you actually run the Gibbs sampling method, they gave pretty comparable result. And we quickly learned the topic interaction in stable and efficient manner and as I said we achieved the comparable result. So these talks are based on these two papers which I published last year and it's going to be coming soon this year. And while I haven't prepared the slide, we are doing several exciting extensions in multiple different fashions. So as one of you might, one of you might already realize. So the topping modeling contains two different inferences. One is word topic inference, which is, of course, the main inference. So how each topic is represented by the distribution of words. That part is included in this algorithm. But the secondary inference, which is what's the portion of topics for each document, that part is entirely missing in this algorithm. So that is currently ongoing experiment and interestingly all the original authors in Princeton like Sanja [indiscernible] and [indiscernible] all the authors in Cornell are collaborating with each other altogether for those. And another exciting extension will be like anchor topic, author topic inference. So one -- so rather than viewing all those things as a hidden variable, let's say each author of the document, sometimes document has a footprint of the authors. For example, all these papers have a collection of author and another assumption which adds more layer in the generative story is this author has, are interested in blah, blah, blah topics. So author has a topic distribution and based on that the observed words are decided. Basically some authors are interested more in the topics. For me I'm interested more in probabilistic method. And then based on that, the word that I'm using frequently will be decided. For example, like NAB estimation or Bayesian. Those are another ongoing extension and entirely different field like privacy issue or so basically the co-occurrence matrix C is large. If the vocabulary size is just 10,000, it's going to be 10,000 by 10,000 matrix. And save that in the memory is painful. Usually the natural language vocabulary is 100,000. So how to store and do the rectification step without explicitly storing those all entries is exciting question. And even more I mean how to store that efficiently and do the rectification without violating the privacy. And those are all exciting extensions and the future work. Yep. So again wrapping up the presentation. So this is a method combined -- this is a new inference combining the probabilistic method and the spectromic mode. So after forming the co-occurrence matrix, which is second order moment matrix, we do, we find the anchor words on that matrix and then based on those anchor words all the inference process is deterministic and transparent and the exciting part, I think, you can take to your home for the future work. So it is pretty susceptible to the sample. So if the sample is not enough, the estimation result is pretty bad and also the model mismatch, there's no intrinsic ability to handle the model mismatch. However, if those are solved, the result from the method of the moment, it's easy to compute efficiently and if we plot that result into the original probabilistic inference, for example, Gibbs sampling means a burning process, we need -- no one knows how many to iteration we need to run at the beginning to get a good result. But if we actually plug this result as initial value to the Gibbs sampling, it shows amazing results which even never appeared in hundred thousand of iteration which is very exciting. So usually all these likelihood functions in the likelihood based method they are never convex and there are a lot of modalities inside of there. And of course we cannot find the good initialization point caused parameter space is high dimensional. And this is a really good way to give good initialization. So combining those two methods seems highly promising. And other people's work there are several other variations. Rather than doing the second moment matrix, some people use third order tensor, which is word-word-word co-occurrence and those tensor decomposition stuff is also another way to do the topic inference while they need to assume each topic is independent to each other but, yeah, that's another direction in which is done by Anand Kumar. So this is the end of the talk. [applause] >> Li Deng: Questions? Okay. >> Moontae Lee: Pretty short. I finished in 30 minutes. Thanks.