Document 17862490

advertisement

>> Paul Bennett: All right. It’s my pleasure to welcome David Blei here. He is a professor of statistics in computer science at Columbia University. He recently moved there from Princeton.

Many of you are probably familiar with his work on Probabilistic Topic Models, Bayesian nonparametric methods, and a variety of inference methods. David has done a great variety of work in applications to text, multi-media images, a variety of different things, as well as his contributions theoretically. He has won a number of different awards that are too numerous to actually list here I think at the intro, but I am going to go ahead and just hand things over to

David here and we are going to hear from him about some of his methods in integrating user behavior.

>> David Blei: Thanks’ Paul for the introduction and thanks for inviting me here. My name is

David Blei and I am going to talk about Probabilistic Topic Models and User Behavior.

So as you know here at Microsoft, as we collect large archives of electronic texts that come with a huge challenge which is that we need to somehow organize, visualize, summarize, search through them, form predictions about them and understand them quickly, right. We collect massive collections of text data now everyday and doing this requires that we have new algorithmic tools, new machine learning tools to be able to process and understand large bodies of text. What probabilistic topic models do is take big, un-organized collections of text and organize them programmatically.

Loosely speaking, the process is to take your big collection, use an algorithm to discover the underlying thematic structure in that collection, once you have done that discover what the topics are, annotate the documents according to those topics as though you had people going through and labeling the documents by hand and then finally using those annotations to do whatever it is you want to do, visualize, organize, summarize, whatever it might be.

So here is an example of a topic model, of a picture of a summary of a big collection of text that a topic model provides. So here each node in this graph is a collection of words that are associated under a single theme. These are themes that were discovered by the algorithm. You can see here mangle, crust, meteorites, ratios, something about geology, geology. Topic models know the difference between geography and geology, but I don’t, that’s why we do this. And so this is called the topic, this is discovered by the algorithm and in this picture this was a topic model fit to a big collection to scientific articles. A connection between two topics means that they are likely to occur together in a document. So something about the sun is more likely to also be about like atmospheric science say than it is to be about something about genetics.

Okay, so in came a big collection of documents and out pops this picture that summarizes what’s in it. We can also look at how topics change over time. So here is another type of model where now instead of a single collection of words associated under a theme each topic is a changing set of words that are associated under the theme. And so here you can see a topic that the model discovers; this is called a dynamic topic model, which describes neuroscience. So here, I should say, this is fit to Science magazine from 1880 till 2000. It used to say the present, but it’s just

2000 and here we have the neuroscience topic where you see the word neuron increasing in its activity over time, the word nerve decreasing, the word oxygen peaking and decreasing.

Here is another topic from that same fit where the word laser, you know laser’s weren’t invented until about here and so it makes sense that laser increases in it’s activity after they are invented.

So the word forest kind of decreases, the word relativity peaks and then decreases. So this gives us another type of picture of what’s going on in the collection, again automatically derived from the algorithm. This is another picture from that same dynamic topic model, but looking at the topic in a different way. Instead of looking at a single word and how it changes over time we can look at the most salient words from a topic and how they changed over the decades.

And in this topic it started out in 1880 with words like electric, machine, power, steam, iron, battery wire. It slowly changes through the decades in the 1940s to air, tube, apparatus, laboratory, pressure and then up to 2000 with devices, silicon and technology. So this topic given again Science articles from 1880 to the present captured that there is something about the technology needed to do science and it’s a theme in this collection and it’s a theme that’s changed over the decades and there’s no supervision needed, no metadata about the articles being about technology, nothing about which words are about technology. This kind of complicated pattern comes out of just analyzing the text.

>>: [inaudible]. Are these the top words you are talking about?

>> David Blei: Yeah, good question. So I will define the topic in a little bit, very soon. These are just kind of the words that represent the topic. Yeah, the top words of the topic, but indeed the topic more formally, you will see this later, is a representation over the whole vocabulary and these are the top words. And please interrupt me, I noticed when I got the schedule from Paul there is like an hour and a half for this talk and it’s only an hour long talk, so please interrupt me with questions. But, I have another talk that I can show you afterwards I guess if we have time.

[laughter]

>> David Blei: But, then we have to go for two and a half hours.

Topic models can be integrated into other models of data. So this is a model I worked on a long time ago, a model of images and texts where we use a topic model to analyze the text, a different model to analyze the images and connect the two representations that the models capture. And so what you can do with this model now is take un-annotated images like a picture of a bird and predict what words go with that image. So here is the picture of a bird and the model predicts that words like birds, nest, tree, branch and leaves goes with this picture. Actually, I just met

Ryan from Scotland and this really is a picture of Scotland. So you can see that it’s a very good model, although every picture in the training set of farms was a Scottish farm. So if it sees any kind of green pastures it’s going to say it’s Scotland.

[laughter]

>> David Blei: Related algorithms can be used to analyze big network data as well. So this isn’t exactly a topic model, but it’s a related type of hidden variable model called a mixed membership model. And here we have a large social network and we want to uncover the communities that exist in that social network. But, of course, when we have a large network we

don’t expect that each person belongs to just one community, which classical community finding algorithms assume, but rather you might be on a social network and you have some friends from your neighborhood, some friends from work, and some friends from your high school. You want to capture that there are these overlapping communities that live in the social networks and here is a picture of a big network and I have showed the overlapping communities that the model has uncovered. Okay, so these are all the kinds of things that we can do with topic modeling and with the sort of algorithms that are associated with topic modeling.

Okay, so what I want to focus on today is our recent work on integrating user behavior into topic modeling. So the insight we had is that people read documents, right. We don’t just have these big electronic collections of documents that we want to organize, but we often have information about whose reading those documents. This could be click stream, whose clicking on documents or it could be more formal than that, actual people reporting that they have read an article. And there are two types of things we can do with data about whose reading a big collection of documents. One: these might be people for whom we want to form predictions. So for example here’s the New York City subway and you can see a lot of people reading documents on the subway as they always do and we might want to take the history of this fellows reading and predict for him something that he would want to read in the future, the classical recommendation problem.

Also, and it think this is also interesting and less looked at, is that people’s reading behavior is an additional signal about what the documents mean, where the documents sit in this bigger collection. In other words, you know I was using in those previous pictures I was using the words of the articles to show you summaries of what’s going on in the collection, but whose reading those articles can also help inform those pictures, help us understand what’s going on in the collection. So here’s an example of Charles Darwin’s library, you know, forming recommendations for Charles Darwin is not a useful thing to do right now, but knowing what books are in his library tells us something about those books and if we had every scientists library catalogued and we knew which books were in Charles Darwin’s library, or Einstein’s library or other’s people’s libraries that tells us something about who might want to read those books or what those books, where they live in the big landscape of scientific literature.

Okay, so in this talk I want to start by giving an introduction to topic modeling and more formally define some of the things I showed you pictures of in the beginning. And then talk about recommendation and exploration with collaborative topic models, this is the work about integrating user behavior into topic models and then finally talk a little bit about the bigger picture and how all of this is really just a case study in using probability models to solve problems in data, just for live five minutes. I know that looks vacuous so I will focus on the first two.

Okay, so let me give brief introduction to topic modeling, what these ideas are, because the user behavior builds on this. So I want to start by talking about the simplest topic model called Latent

Dirichlet Allocation. So the intuition behind Latent Dirichlet Allocation is that documents exhibit multiple topics. So here’s a document, this is from Science magazine called “Seeking

Life’s Bare Genetic Necessities” and it’s an article about determining the number of genes that an organism needs to survive in the evolutionary sense. Okay, so there is an organism, it’s like

lots of years ago, millions of years ago, you know, is that organism going to make it? You know the turtle, is the turtle going to get there?

This article is about using data analysis, genetics and evolutionary biology methods to determine this number and what I have done by hand here is highlighted different words with different colors. So words like computational, numbers, predictions, words about data analysis, I highlighted in blue; words like genome, sequenced, genes, words about genetics, I highlighted in yellow and words like organisms, life survive, words about evolutionary biology I highlighted in pink. So what you can imagine is if I took the time and highlighted every word in this article, throwing away stop words like and, and but, and of, and then you squinted at the article you would see it’s kind of pink, blue and yellow. It blends words from data analysis words from evolutionary biology and words from genetics.

And if you looked at every article in science and took the time to highlight each article with each word and each article with it’s corresponding theme you would see that the articles of science represent this heterogeneous mix of different topics. That there’s neuroscience, there’s data analysis, there’s genetics, there’s astronomy and so on. So the intuition behind LDA is that documents exhibit multiple topics. So what LDA does, and now we can define the topic, is it imbeds that intuition into a formal, probabilistic generative model of the document collection.

Okay, so here’s how that works: on the outside, oops he told me to stay on this side of the room, on the outside of the document collection live a collection of topics. Each topic is now defined as a distribution over a fixed vocabulary.

So there is some vocabulary, say 10,000 words and each topic is a distribution over those words.

And what I have done here, these are fake, but if have listed them in order of probability. So here’s data, number, computer, with some probability, brain, neuron, nerve and so on, okay.

That lives outside the collection and then for each document we assume that the document is generated as follows: first choose a distribution over topics. Okay, so for this document I chose the pink, yellow and blue topic with different proportion. Then for each word in the document choose a colored coin from this distribution. So here I chose the blue coin, look up the topic that blue coin is associated with and choose a word analysis in this case from that topic. Repeat this for every word in the document. Here I chose the yellow coin, the word genetic, the pink coin, the word life and so on.

So you repeat this for every word in the document, you get a big bag of words and that’s your document. Turn the page of Science, choose a new distribution over topics. This one might be about neuroscience and data analysis and choose its words. Okay, so this is, you will notice, a bag of words model. The order of the words isn’t modeled here, it doesn’t matter. But, this is a perfectly good generative process for drawing a collection of documents from a set of topics.

The problem, of course, is that we don’t get to observe any of that stuff. Okay, so the topic proportions, the topic assignments, and importantly the topics themselves are all unknown. We just have this big bag of documents, this big collection of documents. And so the main computational problem, the main algorithmic problem, the machine learning and statistical problem is to infer all of this hidden structure from the observed data.

And the bag of words assumption you might think, “Oh, that’s not realistic” and it’s true, it’s not, but if your goal is to understand what the topics are that describe the collection you can imagine that if I jumbled all these words, I didn’t show you what order they really came in and you looked at you would be able to say, “Well, I don’t know what this is about because obviously the words are jumbled, but it somehow blends data analysis, genetic and evolutionary biology.

Okay, so that’s LDA, the simplest topic model.

As a graphical model, I will just, I am sure most of you know what a graphical model is, but I will remind you. A graphical model is a representation of a joint distribution of hidden and observed variables. Each node is a random variable, a shaded node is observed, un-shaded node is hidden and an edge between two nodes indicates that there is dependence between this random variable and this random variable and these plates denote replication. Okay, so here are the K topics that describe the collection, for each document, that’s the D plate, I first choose a distribution over those topics, that’s Theta and then for each word in the document I choose the colored coin Z and I choose the word from the corresponding topic. Okay, so this picture describes the joint distribution of those variables that were at play in that generative process.

Okay, graphical models, which you know a lot of foundational work on graphical models was done right here, graphical models connect. They encode the independence assumptions between the random variables; they connect to algorithms for computing about these kinds of joint distributions. All right. So you will notice that the only thing observed here is WDN, the Nth word in the Dth document. Using graphical models algorithms we can figure out how to compute the conditional distribution of all of this hidden structure, which is of course what I told you were the main problem with topic modeling.

Yeah?

>>: In this picture you have the topic, so the number of topics you give are the number of topics as well in LDA?

>> David Blei: Yeah, so the number of topics here is K and that’s fixed. So you decide that in advance, or you can set that using cross validation or you can do fancy things like Bayesian nonparametrics.

>>: [inaudible] a point, an extreme point that each topic can contain only one word, like one word probability 1, and the rest of the words would be like probability 0, or like the top word would be, you know, that?

>> David Blei: Well, I will look into that, that’s a great question and a good point. That depends on the objective function. There is a conditional distribution here and there are priors that play.

There is a prior over what the topics look like, there is a prior over what these topic proportions look like and there is also likelihood. Let’s think in a couple of slides about what that likelihood means in terms of what kinds of topics it prefers, but yes, that’s right. So, kind of, if there were number of topics equal to the number of words one way to describe the documents is to have one word in each topic, but that’s not necessarily going to be the best way.

Okay, so just to go into a little more detail about the graphical model and hopefully you can see its connection to that cartoon picture I showed. So this is a joint distribution over our big ensemble of random variables. That joint defines a posterior, right, the probability of the topic proportions for all the documents, the topic assignments for all the words of all the documents and the topics themselves given all of the documents, right. We are given this big collection of documents, we want to infer this hidden structure and what we do, for example to make those pictures that I showed you in the beginning of the talk, is we infer these hidden variables from the data and then we use posterior expectations of those hidden variables to do whatever it is we want to do. To do information retrieval, to compute similarity between documents under the context of their topics, to explore, make pictures about the document collections and other tasks.

Okay, so those pictures that I showed you in the beginning, these are all posterior expectations of hidden structure visualized so that we can learn something about the corpus that we couldn’t otherwise see if we just had this big bag of documents.

All right. There are many methods for doing posterior inference, for solving that inference problem. Yeah, I won’t talk about any of them here, although I have one slide at the end about stochastic inference, which is a way to scale up this kind of computation to massive data sets and it has a one slide explanation, so worth putting it in, but it’s after the talk, so ask me if you want to see it. But, basically there are lots and lots of ways to do this. All of these methods are methods for approximating that conditional distribution that I said, the conditional distribution of the topic structure given the observations. Some of the most exciting recent methods involve factorization and sort of this new spectral approach to inference, which has now been applied to topic modeling.

Okay, I should mention one more thing. In all the results I am going to show you here we are using mean field variational methods and stochastic variational inference. Okay, so let’s see this at work: we took the OCR collection from Science magazine, 10 years of science. I am sure here at Microsoft, even now for me, this is a very small corpus, 17,000 documents, 11 million observations, a vocabulary of 20,000 and we fit a 100 topic LDA model using variational inference. Okay, so I took this big collection and I asked for 100 topics and a decomposition of the corpus. Here is that original article, here is the real data, and this is the inferred topic proportions for this article. Remember, I got 100 topics and now I am asking which of those 100 topics is this article exhibiting.

And if you look at the most frequent topics, remember for each topic there is only a handful of these that have been activated and each one is associated with a distribution over terms and the most frequent words from the most frequent topics correspond to things that we recognize as genetics, evolutionary biology, diseases, survival and data analysis. Okay, so you can see that again, without any metadata or extra information we are able to get a kind of intuitive representation of what the document is about through this latent structure.

Okay this get’s to your question: why does this work? One answer to why it works is well, I made a probability model, it captured my intuitions, I just looked at the posterior and it did the thing I was hoping it would do. It’s not a very satisfying answer, although it’s one that I gave for probably like 5-8 years. One way to think about how it works is to ask yourself: why does the posterior look the way it does? And here is one way to think about that: if you look at the

posterior, which is proportional to the joint distribution, you can see mathematically that LDA tries to tradeoff two goals. One goal is that in each document it wants to allocate its words to few topics, okay, so it pays a price if there are many topics that it allocates its documents words to more than if it just allocates a document to a few topics. The reason is that those topic proportions normalize.

The second goal is that in each topic it wants to assign high probability to just a few terms.

Okay, so it also pays a price if a topic is spread out over many, many terms, again because those topics normalize. And these goals are at odds, if you take a document like this one and you say,

“Okay, this document is just about one topic” you put all of it’s words in the same topic, well that means that topic has to have high probability for all those words for the model to see any benefit and that of course is at odds with goal number two, to have few words in each topic.

Conversely if you say each word in this document is assigned to its own topic then that’s good for those topics because they each only have one word in them, but for that document it’s going to have to use many, many topics to cover all of the words in the document.

So this is, you kind of un-pack the posterior, this is why LDA works. So what it tries to do is tradeoff these two goals and that helps us find these groups of tightly co-occurring words.

Yeah?

>>: [inaudible] this intuition you could say per word in the document you can pay a price, giving the example of [indiscernible], if you have as many opposite words then you would pay a fixed price for every word depending on which word it is in the vocabulary, right? But, if you have topic models then you pay enough [indiscernible] for using that topic then you pay only the cost for [indiscernible] words given the topics. If topics are very deterministic then you only pay as many times as you have topics verses as many times as you have words in the document. Is that kind of what you are saying across the odds?

>> David Blei: I am not sure. I admit that I am not sure. It sounds, it sounds like it does. I mean you were talking about paying a price; I was saying that too, but really maybe it’s better to think if it like rewards were if you only have one topic in a document then each time --. So let me just be a little less mysterious: there is an objective function at play here, which is something like the log posterior. And that log posterior is like the log joint. And the log joint there is a term for every topic assignment, log probability of that topic assignment, a term for every word, log probability of the word given the topic that it was assigned to and those are the two important pieces. If it’s only in one document, sorry, if a document is only in one topic then that first reward you get, log probability of the topic assignment, it’s as high as it can go, because it’s going to have probability one. And so when you add topics you end up taking away from and that log there is important. When you add topics you end up taking away from how much benefit you get for that document, but of course it’s the opposite on the words.

>>: [inaudible].

>> David Blei: Exactly and then it pays its price per word and that’s what --.

>>: [inaudible].

>> David Blei: Uh huh, that’s right, so yeah, I mean if you want to think more about this I encourage you to write down the log probability of all of the hidden and observed variables, stare at it and then just talk to yourself.

>>: [inaudible].

>> David Blei: Yes, so that’s another, so I think that’s a misconception. The misconception is that, oh sure, yeah, it’s because of this very small Drush parameter that we are getting these sparse topics. I think that the small Drush parameters encourage the inference algorithms to go in the right direction towards this kind of solution, but it’s in fact the normalization, the fact that it has to sum to 1, that encourages scarcity.

>>: [inaudible].

>> David Blei: Because you see so much data.

>>: [inaudible].

>> David Blei: That’s right, that’s right. I mean it helps in the sense that a drush like parameter that’s very small puts high probability mass on sparse solutions, but when you are overwhelmed with data, especially on the topic side where there is millions or billions of observations underneath there Drush it’s not like the prior matters much.

>>: [inaudible]. If you increase the Drush parameters you can make sure that more topics are being used.

>> David Blei: Well, again though, the normalization encourages scarcity, the prior doesn’t matter at that point. Okay, interesting.

Okay, so that’s all I wanted to say about LDA. It discovers themes through posterior inference.

This really built on the work by Deerwester, including people like [indiscernible] on latent semantic analysis. Okay, that started the whole enterprise of factorizing document by termmatrices, which LDA is essentially a probabilistic factorization of a document term matrix.

Probabilistic LSI, Hoffman’s work from the late 90s, was really the predecessor to LDA, which took this work and made it a bit more Bayesian, but also, more importantly, let it be used as a module in more complicated models. I will show you that in the next slide.

In statistics this idea is called a mixed-membership model. Okay, specifically, forgetting about documents, you can think of a document here, forgetting about documents think about a document, forgetting about documents you can think about this plate as representing data sets and here we have a data set that’s drawn from a mixture model. Here are the mixture proportions for those of you familiar with this language, here are the mixture proportions, here is the mixture assignment, here is the data point and these are the mixture components. So what

LDA is, is it’s called a mixed-membership model, it’s like we are modeling a bunch of data sets,

each with a mixture model, where the mixture components are shared across data sets, but the mixture proportions change from data set to data set.

And this idea is used in a lot of different settings. I showed you this picture from network analysis, it’s used in doing survey analysis, it’s used to analyze music, it’s kind of it’s own little industry in statistics. And as an example of that this exact model was independently invented around the same time we were working on it for population genetics. Now my group is working on these kinds of problems as well where basically if you have 1,000 or a million people you can sequence their genomes and each of our genomes, I don’t know much about this, but I can try to explain it, each of our genomes reflects the ancestral populations that were roaming the earth lots of years ago, reproducing to form us eventually over time.

And, you know, I might be kind of part Australian, part African, part European and you might be part European, part African and so on and you can detect that we are all what are called “add mixtures”, which is like this Theta heterogeneous topic proportions. We are all add mixtures of these different ancestral populations and then uncover what those ancestral populations genetic signature was. Okay, so take a bunch of people now, sequence their genomes, figure out what populations roamed the earth millions of years ago to form those people, loosely speaking.

That’s what they do in population genetics. It sounds goofy, but it’s really important for doing things like correcting for ancestry when you are doing studies of say the link between the disease and a gene. It’s also important for exploratory analysis of big genetic data. So all the same kinds of things that topic models are used for.

Okay, LDA is a simple building block, like I just mentioned, that enables many applications. So here are some examples of more complicated models from the literature, from the research literature, that used LDA as a module. And one of the advantages of graphical models is that an algorithm you derive for LDA, for that little 3 node graphical model, can be used as kind of a sub routine in whatever algorithm you need for this complicated beast. And I think this is one of the reasons that LDA has become a popular tool to use in machine learning research. Another reason is of course that we need these kinds of un-supervised learning methods, organizing and finding patterns in data has become important everywhere. And algorithmic improvements with

LDA as kind of a test case let us fit these types of models now to massive data sets.

I am going to show you a model with LDA as a module in the second part of the talk. More broadly, just to finish up about LDA, I think of topic modeling as a case study really in doing text analysis with probability models. Where we build a model, we have a bunch of data, our model reflects whatever assumptions we want to make about the data. We then do inference on the hidden variables given the observations and then from those inferences we do whatever it is we want to do. And this is a nice way to separate out the different activities in probabilistic modeling. And so what topic modeling research looks like is somebody developing a new model, somebody developing new inference algorithms like that long list that I showed you, or developing new applications, visualizations, tools, etc.

Frankly I don’t think there is enough of this kind of research where we treat, as given, that the model is going to be useful, but understanding what to do with these inferences. How to

effectively build interfaces to them, how to use them to do search, to do predictions, these are important parts of topic modeling research.

>>: Sorry, suppose that prediction here could improve classification.

>> David Blei: Uh huh, yeah, any kind of prediction.

Okay, so it’s easy to use LDA in R. There are lots of open source implementations of LDA. So just as an example if you have a bunch of documents, one per line, you represent them sparsely, you run five lines of R, and you can quickly get out the topics that created those articles. And then you use those inferences to do whatever it is you are trying to do. So, any other questions about topic modeling? That was the first part of the talk and I want to build on that in the second part of the talk.

>>: Is there any better way than you used earlier to describe different topics?

>> David Blei: Right, so --.

>>: I mean, the way you used the words to describe it don’t seem to be the same as the

[indiscernible].

>> David Blei: Yeah, that’s a good question. So that goes along with this: how do we name the topics? How do we best visualize the topics? This is clearly not the answer to that question.

>>: [inaudible].

>> David Blei: No, that is kind of a user interface question and here, you know, I thought it was a big achievement when I made the words bigger according to the probability, but yeah, there are research fields that would be better equipped to do something more interesting. Frankly, you know, topic modeling has begun to be used a lot in what is called the “digital humanities”.

People like historians and English scholars using these techniques to get a handle on bit electronic collections of newspapers or whatever it is they are analyzing and they have done beautifully things to visualize them, right, because they are like humanity scholars, so they are good at that stuff. So there is a fellow, Matt Jockers at the University of Nebraska, and his topics look much better than anybody else’s. He has really nice ways of summarizing these distributions visually. It’s an important problem.

Yeah?

>>: In your opinion, what’s the best way to evaluate these topic models, like especially when you are talking about the number of topics and how you can pick different ones? What’s the best way to see that this representation is the best one for this corpus?

>> David Blei: Yeah, so that’s a great question. So that’s another issue, these are two core issues: visualizing and evaluating. And they are core because they are kind of fuzzy, right.

There is no quick answer to that. So in a typical machine learning paper about topic modeling

we use something like held out log likelihood, where you take a portion of your data, you don’t see it and then you predict it. There even are a couple of ways of doing that. I have one way that

I prefer, but there is like kind of a little un-important debate around it. So that’s a way to ask the question: Which model is better? The idea being that a model that assigns higher probability to future data from the same un-known process of the data that you care about is a better model.

One that assigns higher probability of future data is a better model. And that idea really stems from even for exploratory tasks, in the 70s people like Seymour Geyser, statisticians like

Seymour Geyser promoted predictive likelihood as a way of measuring model fitness.

If you are doing something with the topic model, like you are doing information retrieval or collaborative filtering or classification then probably, however you want to evaluate that downstream process is how you are going to want to evaluate the model on the way to that solution. Although, that might be too expensive and so something like predictive likelihood could be a good proxy. Yeah, and then there are also, there might be kind of visual ways to assess the model. And that’s attention that’s coming up in this digital humanities work, where the scholar will say, “Well, I ran the algorithm 50 times, there are 50 local optima that it found, which one should I pick?” And you say, “Well, pick the one with the highest predictive likelihood.” And then she says, “Yeah, but that one doesn’t look the best; the one that looks the best is this one.” And then I say, “Well, you can’t pick that one; it’s got a lower predictive likelihood.” And then they will pick it.

[laughter]

>> David Blei: So, this is an issue and it is attention, and there are a lot of interesting open problems there. Another interesting open problem is that the humanist, and I am saying humanist, but don’t fell like that doesn’t include you, it’s anybody who wants to use these to do any kind of exploratory data analysis. You know, you might have three fifths of topic models and some topics are good in some and others are good in others and you want to somehow combine them. There is no easy way to do that. That’s an interesting open issue in this world.

>>: [inaudible]. Do you know of any tasks in classifications that this model can do better than?

>> David Blei: Well, I don’t work much on classification so I am not sure. Actually, I want to mention one other thing, which is a way of evaluating topics visually as using something called posterior predictive checks. That’s something I worked on with David [inaudible], where you can --. So here is another issue, it’s making it sound like there are more issues than good things about this work. Another issue is that you might fit a topic model, there are 100 topics, and some of those are there because there are real patterns in the data, neurons, brain. Okay, there is a real pattern in the data that it’s capturing.

Others might be there because you have a very complicated corpus written by many people, they are smart, and they did science research and it’s not that they generated their data from a multinomial distribution. So, some of those topics are there because they are accounting for that model misfit if you know about Gaussian mixtures, the way that a Gaussian mixture, with enough mixture components, can capture any distribution. A topic model with enough topics can capture the kind of idiosyncrasies of the distribution of language that a multinomial can’t capture

and you might not care about those. Posterior predictive checks can help you identify which topics are interpretable and which ones aren’t which would be important to say a humanist or anyone doing exploration.

>>: [inaudible], are there any specific kinds of corpuses or documents where topic modeling would not work well?

>> David Blei: Well, you will see that I look almost exclusively at things like news and scientific articles. I think it’s probably hard with things like Twitter. So I know that I have seen some

Twitter topic --. I haven’t read a lot of Twitter topic modeling papers, but there are a lot of them and some of them I know, and I think this works, where you take the Twitter user like Paul’s on

Twitter, you take all of his Tweets and that’s a document, that’s the Paul document. But, that’s not satisfying, right, because there should be a way to think about the Tweets individually. I am sure there is a lot of work to do there.

Yeah?

>>: [inaudible].

>> David Blei: Yeah.

>>: Has anybody tried to simply not model all the words in the document, but a lot of

[indiscernible] to be unexplained?

>> David Blei: I have seen that, I can’t recall the author or title, but yeah, some kind of robust opt out background topic. I have heard that’s that is useful.

>>: [inaudible].

>> David Blei: Yeah, I have heard that’s a useful thing to do.

>>: Michael Jordan had a paper a couple of years ago on a hierarchical topic model where there was a base distribution that common words could just come from it and it seemed to help him a lot.

>> David Blei: Okay, good, right, Michael Jordan did it.

[laughter]

>> David Blei: It’s a good answer to all questions, has anyone tried --?

Um, yeah, yeah, so there was something else I was going to mention around that question, ah I can’t remember, but yes, Twitter I think is difficult, but interesting.

Okay, so like I said, I want to talk about how people read documents. And we have been working lately, we have been very interested in this lately, and have been working on what are

called collaborative topic models, that connect the content of the articles to people’s patterns in consuming those articles. And as I mentioned this helps people find documents that they are interested in. It learns about how people are organizing the documents and it learns about the people who are reading the documents. So I thought you would be interested in this here at

Microsoft.

Okay, so here’s an example: scientists share their research libraries. Okay, we all here probably have some kind of research library on our hard drive, right. Chris Meek has his bib tech file that’s probably a lot of years old and has thousands of articles in. So if we all put our bib tech libraries online we would get a matrix like this, where in the rows we have articles, in the columns we have people and we have a black dot if this person in this column has the mathematics of statistical machine translation in his/her library.

Okay, so what can we do with this? Well, like I mentioned we can form recommendations. We can form recommendations of old articles and new articles. Okay, so this recommendation literature is called the “cold start problem” and the problem is this: let’s say we are going to do this, but we are not going to use the text of the articles. Well, if we want to recommend the EM article to someone it’s not that hard. We look at the other articles that you have read we figure out what you are interested in; we look at what other people who are interested in the same things liked. If they liked the EM article and you hadn’t read it yet then we recommend you the

EM article. Of course, the issue with that is that it requires that the EM article is not a brand new article. It’s not, luckily, but for new articles we would have an issue, right. Where topic models for recommendation comes out, nobody has read it yet, who do we recommend it to? So you can imagine that the text is going to play a role.

With collaborative topic models we can describe users in terms of their preferences. So what does that mean? Well, classical matrix factorization methods for doing recommendation basically will describe someone in terms of dimensions 61, 32 and 95. Whereas with collaborative topic models these dimensions, preference dimensions, are going to be attached to topics, like we just saw, and so they are going to have some kind of meaning. You are going to be able to say, “Well this person is interested in machine learning, and computer vision and web media.”

Finally, I think this is very interesting, we can identify impactful, interdisciplinary articles. So this is what I meant more crisply when I said that we can understand how the document collection is implicitly organized. We can take the EM article for example and show you why the EM article has had an impact outside of what the EM article is talking about.

Okay, so let me give you the intuition. Does this talk really go to noon?

>>: We have the room until then.

>> David Blei: Ah, okay.

>>: It starts getting sparse so you can kind of go into that time.

>> David Blei: Got ya, good answer, okay. I am not going to treat it like an hour and a half talk then.

Okay, so let’s talk about the EM article. Have any of you read it? Yeah, it’s a good article, it’s from 1977. So let’s imagine we have the EM article and that there are two types of people: computer vision researchers and statisticians, okay. You look at the EM article, its 1977, it just came out, if you have read it you know that it’s about one thing, statistics, this is an old statistics article. Here is our representation of the EM article when it just comes out, it’s about statistics.

Now, let’s suppose again, like I mentioned, that there are two kinds of topics and there is also two kinds of scientists: statisticians and vision researchers. Here are statisticians: they are only interested in statistics. Her are vision researchers: they are only interested in vision. When the

EM paper comes out we are going to recommend it to the statisticians. We are going to take the dot product of these two vectors and we are going to say, “You guys should read the EM paper.”

Now, let’s say I’ve got everyone’s bib tech file, all right. So here are the users by papers, it’s now how ever many years later, 30 years later and when I look at users by papers I can detect through the kind of model I am going to describe, that computer vision researchers are interested in the EM paper. Okay, that’s this red spike on vision. And now consider again these two scientists: I will recommend the paper to both the statisticians and the vision researchers.

So, what I want to point out is that without the text of the EM paper right, if we were just in a classical recommendation system, we wouldn’t be able to initially recommend it to anyone until somebody bothered to read it. Without user data we can’t recommend it to vision researchers at all. Okay, so if we only had the text and we didn’t have any kind of interaction data between people and papers we wouldn’t ever be able to detect that the EM algorithm is important in an area like computer vision.

Yeah?

>>: That’s not entirely true, because you could see similarities in text between vision papers and

EM paper and the recommendation based on that.

>> David Blei: You said you read the EM paper.

>>: I did.

>> David Blei: Have you ever read a computer vision paper that sites the EM paper?

>>: Yes.

>> David Blei: There is no similarity.

>>: [inaudible].

>> David Blei: That’s right.

>>: Still, they do talk about bounds, they talk about, and they do mention EM.

>> David Blei: They mention EM it’s true.

>>: [inaudible].

>> David Blei: But again, imagine, look at these authors, Dempster, Laird and Reuben, right.

Those guys are writing in 1977 from their world of statistics and as you know modern implementations of EM we learned it from Bishop or wherever, and yeah, so I don’t think you would get it.

>>: Yeah, but [indiscernible] for example was more linked to speech and then speech and

HMMs, [indiscernible], that would be sort of the link and then the HMM were used in vision. So they would kind of show up a little bit. I agree that this is a faster way, but I am also sure that the text alone wouldn’t match [inaudible].

>> David Blei: Fair enough, it’s an argument to have while drinking, but I can see what you are saying, that maybe it shows up, yeah. This is a dry statistics paper, but anyway.

So, here is the model that solves this problem and that captures these intuitions. Okay, so you saw the graphical model before and now I am showing you an example. Here is LDA as a piece of this model. So how am I going to describe this? Okay, so I have my topics, I have my document described in terms of the topics. Okay, I have the EM paper and now it’s just about statistics. Now the D plate are documents, I have this new plate here U, these are users, XU is the preference vector for user U. All right. This is a K vector, K is over topics and these are the topics that you are interested in.

Okay, so here is somebody interested in computer vision. VUD overlaps the U plate and the D plate; this is that binary random variable. Does user U have document D in her library? And this is zero or one and the idea here is that VUD comes from a distribution that depends on both the topics that the document is about and on this variable Zeta, which we called the correction.

Okay, Zeta is another K vector and it represents who else is interested in this article. It represents who else is interested in this article. Factor out what that article is about, who else is interested in it?

Okay, so when I see lots of information about people in computer vision reading the EM paper

Theta has to describe the words. It’s never going to spike at computer vision because none of the words have to do with computer vision, barring this fun debate. Zeta D then has to say, “Okay, I need to explain all of these computer vision researchers reading the EM paper. They are clearly not interested in statistics, because they haven’t read any of those other statistic papers. So I am going to put a spike in computer vision.” That’s the way it works, again, it’s sort of like earlier, it’s reasoning about the posterior.

All right. So I want to go into a little bit more detail about this, but if you understood the last two slides you understood the big idea for why this model works. Basically what we do is we blend factorization based and content based recommendation. So if you are familiar with matrix

factorization, this is like a Bayesian version of non-negative matrix factorization. We use

Gammas and Poisson everywhere. That’s a really good idea that I can talk about some other time. But, basically I have my document representation, Theta DK, which represents what the document is about. I have my correction representation and I have the preference vector for each user, X, U and K. And then whether or not user U likes document D has to do with the dot product between the user’s preferences and the summation of what the document is about and this correction.

Okay, so when you see a new document you don’t have any idea what this correction is and so it’s going to be zero and we are only going to use the text. But, as we see information we are going to populate this correction vector and it plays a role in forming recommendations.

Okay, yeah?

>>: Okay, so here I see the correction there. [indiscernible].

>> David Blei: Yes that’s right, yeah that’s right. So you could take the correction vector away, but then there is attention, right, because then --. So it’s important that, that’s a free variable.

When you take the correction vector away there’s this tension that when I see computer vision researchers clicking on the EM paper I am going to want to make the EM’s representation have to do with computer vision, but then here’s the text of the EM paper shouting at me that I can’t do that because then it screws up my estimates of what’s in the document. Exactly, you’ve got it.

Okay, so let’s look at some data with this model. We have two data sets: one is from Mendeley, this is a way that people like us can share our bib tech files and we have 80,000 users. They have 261 thousand documents in their libraries. We have a 10,000 term vocabulary, 25 million observed words and this is sparse, right, there are only 5.1 million entries in this big matrix. I didn’t mention it, but when you make Gammas and Poisson’s everywhere it makes inference very scalable. So we can handle these kinds of large data sets without even using fancy inference techniques.

We have another data set which is quite exciting. It’s a decade of clicks on archive.org. Hey, here is Paul Ginsparg many years ago inventing the archives somewhere, it was in like Santa Fe or something. And we have a decade of people clicking on the archive where there is 825 thousand documents, 120 thousand people, again our vocabulary is about 14 thousand and this is a little less sparse, 43.6 million entries. It’s less sparse because these are clicks, just pause and appreciate this data. The archive, physics runs on the archive. So since the archive was invented physicists every week look at the archive, click on archive papers, read archive papers. Physics journals, from my understanding, have become basically a necessity for tenure committees, but physicists learn about each other’s work, do work and publish work all on the archive. So, these ten years, part of it anyway, represents physics happening over a decade.

Okay, so let’s look at the EM paper. This is in the Mendeley data set; the EM paper wasn’t on the archive. Here is the abstract of the EM paper. You can see that it’s about statistics. Here is the topic representation of the EM paper. So these are the topics that the words activate when you look at just the text and you can see that only a handful of topics have been activated and the

main ones are things like algorithms and probability models. So now I took this Mendeley data,

I fit it, the EM paper of course has been around a long time, lots of people have it in their libraries and let’s add the correction vector to this topic vector to see what we are going to use when we form predictions.

And what you can see is that there is a huge difference where first of all, in terms of algorithms, the EM paper is one of the most important algorithms paper in this data set, but what’s interesting I think is what comes out of the weeds. So here is something about network analysis where the EM paper is important for doing community detection, which we talked about earlier and here is the computer vision. Okay, so the EM paper has nothing to do with computer vision, but when you look at the correction vector you can identify that computer vision researchers are reading the EM paper.

Okay, here is another example.

>>: What’s the Y axis?

>> David Blei: So, it’s basically these Gamma parameters. You can think of them as expected counts loosely. Yeah.

Here is another example: have any of you read --?

Uh, yeah?

>>: So this is actually [inaudible]?

>> David Blei: The red one is, yeah and red plus black is what we use when we form this prediction, right. This is black plus red.

>>: [inaudible].

>> David Blei: Yeah, there is, look, so here are algorithms and here it’s increased. I am going to, you know what, I will answer that in a second.

>>: [inaudible].

>> David Blei: Yeah, yeah, so it goes black to here and we have to put a line in there.

>>: Yeah, okay.

>> David Blei: Here’s another example: this is a book about convex optimization, you might have guessed that, and here again is its topic proportions. You are going to see how important this book is in the data set by just looking at the Y axis. And it’s not about much, but it’s about algorithms and a little bit it’s about single processing. Again, if you look at whose reading it you can see that in the algorithms this is very important. Part of the reason is that this book is free

online, so everyone has it in their library, but again also interesting is what comes out of the weeds.

So here there is nothing to do with finance, but cross trade economic market financial return pops up, because they care about convex optimization, but also because the examples in that book often are about portfolio optimization and it’s so clear that probably everyone interested in this reads that, although the description of the book doesn’t mention it. And then sensor networks and distributed computing pops up where of course, convex optimization is also very important.

Okay, so my point in showing you these pictures is to give you intuition of the model, but also to show that it’s capturing aspects of documents, not just people, of documents that are hard to otherwise get at. Okay, if you are interested at the end, those of you who are interested in the details of it working better we can talk about these plots. Since I want to end after an hour I want to get to the next pictures.

Yeah?

>>: Um, in terms of the relative size of the red and the black.

>> David Blei: Yeah?

>>: Is that how much it will factor into the recommendation?

>> David Blei: Yes, that’s right. So the red, red plus black is the factor to the recommendation again. So these questions relate to what I want to show you.

So these plots imagine being convinced that this works better than everything else. Okay, like I mentioned, the readers also tell us about the articles. We just saw two examples of that and I want to show you how you can use, basically a fitted recommendation system to tell you something about, like I mentioned the landscape of the scientific literature. Okay, so here is

Darwin’s library, here is Einstein reading, here is somebody reading the archive, you know, with all of this information can we say something about this paper, and this book and all of those books?

Okay, so here’s how it works: one topic that we find is about network analysis. Okay, so here are the words: network, connected, module, nodes, links, topology, connectivity. This is about analyzed statistical analysis of networks. We can do what we do with topic modeling and just ask: What are the articles in the collection that are about networks? Okay, so that’s just asking for a black bar on networks, here’s everything else, here’s networks, let’s filter our corpus by those articles that are just about networks and you get this list. Here’s a sorted of mixing networks, mixing patterns and networks, catastrophic cascade of failures and interdependent networks. These are the articles that have the highest topic proportion for networks.

But, now we can ask: what is about networks and read by users interested in networks? Okay, so what that says is, “I am going to filter on those, on that black bar. I only want articles that are

about statistical network analysis.” But, now I am going to add the red bar, add networks to the black bar and ask, “What are the papers now among that subset that are most interesting?” So these are, take all the papers about networks, take all the people that are interested in networks, which of those papers are they most interested in? Does it make sense?

Emergence of scaling and random networks, statistical mechanics of complex networks, complex network structure and dynamics, these are these very high profile science articles about network analysis. I think they are by Barabasi, maybe wrote this one, and of course whether or not their text is more about networks than other network papers, who cares? The point is everybody has these in their library, everybody who is interested in networks has these in their library, maybe it’s because they all have to site them, but still.

Okay, also interesting is to say, “All right, let’s ask this question: I am still going to filter on articles about networks. I only acre about articles about networks, but now I am going to ask: I don’t want to know which network enthusiasts are reading these articles. I want to know which enthusiasts about other stuff are reading these articles.” In other words, which articles about networks have had some kind of interdisciplinary effect, well effect is a terrible word to use these days, but have had interdisciplinary patterns of use in other areas?

And so here you see articles like mapping the structural core of human’s cerebral cortex. So this is an article about networks applied to neuroscience and of course it’s interesting to people in neuroscience. Network thinking in ecology and evolution: same kind of thing, an interdisciplinary article about networks, interesting to people who are interested in ecology. And then a kind of pop science book about networks: Linked: The New Science of Networks, also by that guy, Barabasi. This is a popular book and so many people have it in their library, even people that aren’t interested in networks.

Finally, we can ask the question: Let’s take articles that aren’t about networks, but ask whose reading them among the networks enthusiasts? And here you see that network enthusiasts are reading about power law distributions, statistical physics of social dynamics and heavy tail distribution in human dynamics. These are articles that aren’t necessarily about statistical network analysis, but these are what people are reading. This is where the EM paper could arise, for example, for computer vision community.

Okay and we can do this with all the different topics. So here is the statistical modeling topic where among people interested in probability models they were reading these papers. People that are not interested in probability models are reading these papers about probability models.

And you can see here is the EM paper, here is the famous tutorial about HMMs, which made

HMMs accessible to anybody, not just statisticians and here is another, hey this might be from

Microsoft Research, I don’t know if it is, but influence diagrams is a word I associate with

Microsoft Research. Anyway, here is another kind of introductory paper that is read by people in other fields. And again here are now people interested in statistical modeling. What are they reading about when they are not reading about statistical modeling? Well, they are reading about the bootstrap, because who wouldn’t and they are reading about multi-varied statistics and some kind of R plug-in.

Okay and we can do this with algorithms. Here on the click data we did this with information theory and we can use it with every topic and get a sense of how these articles are organized by reader and by topic. Okay, so in summary collaborative topic models connect text to usage.

They blend content based and user based recommendations give us a new window to how people consume articles and what those articles mean in terms of how people are using them. Okay, so I will skip the last part where I talk about probabilistic modeling and just summarize to say that we talked about these two things and I can take more questions now. Thank you.

[clapping]

>>: [inaudible]. Does this experiment show how different this information get’s from

[indiscernible] patterns than from the citation patterns?

>> David Blei: No, so citation patterns are hard to get on these large scale data sets. So I am working on getting them, although we have another goal for citation patterns which is, you know, I had these pictures of Darwin’s library and Einstein reading. You can’t get Darwin’s bib tech file, because he wasn’t good at bib tech, but what we can do is get every article that Darwin every cited and every article that Einstein ever cited. And that is like a subset of their libraries.

So what we want to do is build a collection like that, which we are working with some people at the University of Chicago on getting one and then do this kind of recommendation system modeling through history. But, another purpose --.

>>: Yeah, is it a parsing of the files that’s --?

>> David Blei: Yeah, that’s right; you know it’s a whole research field, a whole reference and all that.

>>: So I was hoping there was a database of input.

>> David Blei: I mean not with the, you know there are small databases like the ACL has one and there are other small ones lying around, but for the million archive articles, no.

>>: [inaudible].

>> David Blei: That’s right, people are clicking on, but never citing. I mean right --.

[laughter]

>> David Blei: It would debunk that kind of academic conspiracy theorist that many of us suffer from.

>>: Well, it’s the way that the community is organized; you actually don’t have time to write citations. So people don’t do that deliberately, they just don’t have space. But, with your technique it’s the tried and the true influence.

>> David Blei: Yeah, yeah. So we had some other work, I worked on this with Sean Gerrish, that uses words to find influence where we would take --. The idea being that if it’s 2014, sorry if it’s 1900 and you write a paper and I write a paper and Einstein writes a paper then in 2014 we should figure out that Einstein’s paper had influence and our papers didn’t just by looking at word use. The words that people are using in 2014 reflect more what Einstein was writing in

1900 than what you and I were writing then. And there we did use citation to validate that model to say, “Look, this captures correlation to citation.”

>>: [inaudible].

>> David Blei: That’s right, this is more micro-level.

>>: [inaudible], but because whoever wrote the paper actually didn’t follow through, they are forgotten, but the idea is there.

>> David Blei: Yeah, that’s a good idea, yeah.

>>: So tell us something about this hieratical LDA model.

>> David Blei: Yeah, so yeah, some of Mike Jordan’s work is on that too, on hieratical topics.

Where the top topic is something like stop words, underneath that there are big fields like say we are doing computer science articles there could be like systems, theory, AI and then underneath that are sub-topics. So Mike and I worked on that awhile ago and it was okay, but recently a fellow by the name of John Pasley really improved on it and made is scalable and also made the model richer in an important way so that their could be multiple paths through the same tree for single documents. And this captures good topical structure, these hieratical topic models.

>>: So is this the relationship you showed earlier when you showed the relationship among topics?

>> David Blei: No, that was a correlated topic model.

>>: So the [indiscernible].

>> David Blei: Yeah, so I would look up John Pasley’s paper. So yeah, he used stochastic variational inference and we applied it to very big data sets.

Yeah?

>>: Do you have a website that would allow me to print what I am interested in and give me better recommendations that I should be reading?

>> David Blei: No, we are working on that through the archive. That’s something that is going to be called MyArchive and it’s something that we are working on for exactly that.

Yeah?

Oh, there is a question behind you, hold on, patience.

>>: So I was just wondering, the examples we looked at, it seems like the user preference component dominates the topic component? Is that just an artifact with the examples we were looking at or is there something in the way it’s defined that causes that to dominate?

>> David Blei: Um, yeah, so that has to do with the Y axis I didn’t explain, which is that the Y axis represents word counts on the tech side and it represents clicks on the user side.

>>: Oh, okay.

>> David Blei: So the deal is that it needs Theta plus Zeta to represent the clicks and so there are so many more clicks than there are words in the abstract that it looks inflated.

>>: Oh.

>> David Blei: But, no it’s not that users are much more important than words or anything like that.

Yeah?

>>: So what [indiscernible].

>> David Blei: That’s right.

>>: [indiscernible].

>> David Blei: Right, so well there is this tradeoff basically between model complexity and computational simplicity. So when you ignore word order we can compute with a lot of data well and fast and also in reverse we don’t need as much data to get good inferences out. But, if it’s important to have that kind of fine grade inference there are models out there, like there is something called ‘Beyond Bag of Words’ by Hanna Wallach where she modeled the time series.

From the practical standpoint I think it’s good enough to find significant end grams in your data, like you know, New Jersey and then just model that as a single word and then model a bag of these significant end grams. That works very well in terms of also giving us nice topics to visualize and getting what you would get out of a real dynamic model. That’s not to say there aren’t applications where you would need a real dynamic model. Dynamic models at the level of documents I think are important when you have documents spanning hundreds of years or maybe news which changes quickly.

Yeah, you have a question?

>>: How about the word sequence within the same topic?

>> David Blei: Yeah, that’s what Beyond Bag of Words is. In the applications that I have looked at that has been less important, but that’s not to say they are not out there.

>>: How complicated is it in terms of inference [indiscernible]?

>> David Blei: Complex, so basically you end up paying N, where N is the number of words in the document and you usually end up paying something like N squared or 2N. But, it’s really more that the model has to be, the model can capture so much that it’s easy to over fit in those settings. It’s the old problem of language modeling, where if you go to a high-end end-gram you have to smooth a lot more.

>>: I was just wondering if you could talk a little more about you said the choice of the Gamma distributions was a really good decision.

>> David Blei: Oh yeah.

>>: So could you talk a little bit about that?

>> David Blei: Sure. Okay, so let me go back to this picture. So forget about the text for a moment, if you just have this model where you cover up Theta this is like probabilistic matrix factorization, right. And the way that’s usually done is where you model your observed data as a

Gaussian whose mean is the dot product between the user representation and the item representation. Are you familiar with that perspective?

>>: [inaudible].

>> David Blei: There is all kind of issues with that. One is you typically have a giant sparse matrix and you’re modeling each cell as a Gaussian. So a zero and a one are each modeled as a

Gaussian say or a zero and a four if you have ratings data and it’s very expensive. So we can conquer that with stochastic optimization, but that’s one issue. I think there is a more fundamental issue which is that the zeros and the ones aren’t really the same. A zero means that you maybe saw it and didn’t like it or maybe you just didn’t have time to even consider this option. Whereas a one means that you decided to spend some of your precious attention on this item. This model, omit Theta, just this model where we have Zeta X and then BUD coming from Zeta transpose X, that is like a Bayesian version of non-negative matrix factorization where

Zeta DK comes from a Gamma.

So these are going to be positive, these are going to be positive and then Poisson of course takes a positive value as its parameter and this thought product is going to be positive. And what we have found is that just with matrix factorization it works much better than Gaussian matrix factorization and it has some nice properties. One nice property is that when you unpack the likelihood of that model it only depends on the non-zero elements of the matrix. Okay, so that’s a computational advantage that it doesn’t require, so if you have a very sparse matrix you don’t have to look at those zeros. They can be factored out of the likelihood very easily. That’s a computational advantage, but it still might not be a better model.

But, it does better in terms of things like recall and precision on predicting what items the user is going to click on. And one way to interpret that is that this has the following interpretation, this model. It’s equivalent to the following model where each user decides, with a Poisson random variable, how many items they are going to click on and then conditional on that number chooses those items from a multinomial distribution whose probabilities are proportional to this value.

And I wouldn’t do inference that way; inference is best done using this. But, what that tells you is why the zeros count less. That the zeros in this model either mean “I already used my Poisson attention on other stuff” or it means that “I looked and didn’t want it”, but I think that’s why this model does better.

And so these plots that I skipped, here is, I guess this is precision, but this green line is Gaussian matrix factorization with content and all of these other lines are based on these Poisson/Gamma representation. Some of them only looking at a rating that is this blue line, some of them looking at ratings and text and this one being the one that I just showed you. So if you are interested we have this paper on the archive Gopalan et al. 2014, where we looked at like the Netflix challenge, a bunch of different recommendation data sets and compare basically Gaussian matrix factorization and this Poisson matrix factorization and it just out performs Gaussian in all cases.

Yeah?

>>: [indiscernible] if you have different types of data where it does matter, you do want to treat them the same, like you do want to say, for the sake of argument that you should haven’t read the

EM paper then [indiscernible]? So that’s not going to happen in this data set, in this type of work, but for some other type of [indiscernible], for example, [indiscernible] by unraveling the image into a vector and then treat every [indiscernible] as a feature with [indiscernible]. Then sometimes it might work and sometimes it might not, because you do want to say [indiscernible], a little more, a little less, but not much less and not much more.

>> David Blei: Yeah, I agree, I mean of course if the data is meaningful.

>>: But, on the other hand his [indiscernible] presentation that [indiscernible] data counts has huge [indiscernible] advantages we are talking about. So did you come up with some kind of solution that would keep those advantages and model the data and allow this sort of almost

Gaussian way of modeling cost?

>> David Blei: We don’t have any problem like the ones you described. So you think about the

Netflix challenge. This is people watching movies; I certainly don’t get to watch all the movies I want to watch. It’s not as though I marched through the Netflix catalog and decided what to watch. And then every other recommendation data said, “I think more has this scarcity, limited attention properties.” But, it would be interesting; I don’t know what the right model is in that case, but it would be interesting to think about. Kind of this Poisson, where you have the kind of symmetric loss that you have with Gaussian, which is what’s crucial there.

>>: [inaudible].

>> David Blei: Hm, that’s an interesting idea, yeah.

>>: I am just curios --.

>> David Blei: Oh.

>>: [inaudible].

>> David Blei: Um, it would do something similar. So we were working on this Poisson factorization, I was pretty excited because I was working on something that’s not LDA and then it turns out that if you condition a couple things and make some decisions you get something very close to LDA out of this, Poisson factorization. But, what you described is sort of close to a

Poisson factorization type model. But, it doesn’t involve topics and users together.

>>: Right, but [indiscernible].

>> David Blei: Yeah, so I didn’t put it in this plot, but one of the things we compare to is just running LDA in the way you described, running LDA where users are documents and items are words, not the other way around, which is what you just said. And that does okay, but not as well as these other methods. These other methods capture things like different people have different amounts, at different rates of consumption of the data and that get’s captured in those

Gamma variables.

Yeah?

>>: I was just curious how much time factors in your collaborative topic modeling? Like when the user has read an article, does that factor in? Do you just either cut off and do it based on

[indiscernible] or is there some way in which you could factor that in?

>> David Blei: That’s a good question and we are working on that problem right now. So right now what I showed you was just big exchangeable. It’s as though you sat down and looked at all those papers at once. But, thinking about how this evolves in a time series, especially I think with the archive data, because in 10 years you interests change. And if we can capture that with some of these long-running users I think that could be interesting. There are also interesting technical problems with doing that, like we are trying to model users as Poisson processes with this kind of waiting time in between their clicks and then model the preference and consumption patterns on top of the Poisson process. It makes for a fancy model.

>>: Is it also easy to update as new data becomes available or is it sort of you have to run it on the whole data set, get something and then just live with that?

>> David Blei: I don’t know, so that’s part of this Poisson process perspective on recommendation, it’s to solve that issue. Like then you start asking about conditional expectations given all of your previous history plus one more click and so you know at first you could just rerun inference, just to see what you would get if you did the perfect thing, but then thinking about approximations would be useful. There was a paper by Mike Jordan on streaming

variational inference that could be relevant there. Yeah, Mike Jordan, Tamara Broderick and other’s wrote that paper.

>>: [inaudible]. So if you do one topic model for your corpus then you would be like, “Okay, now it’s just interchanging based on the measure that I tried to do?” I mean I don’t know some measure that it could find and then would just like make more [inaudible]?

>> David Blei: I think that’s an interesting question to ask of probability models in general. So there is this kind of disconnect in Bayesian or any old statistics, you know, classically you have got your data, you analyze it with a model and then you report on the results and then you are done and then you get paid, right. And the kinds of data that we have now are data like click streams and query streams and other kinds of streams and it actually doesn’t make sense to contemplate a model on a stream, it just doesn’t compute. What does it mean to have to condition on a stream of data that never ends?

And one way to think about that might be to think, okay there is this sort of big process that’s changing my data and within that big process I can model it exchangeable in a windows and I want to see when do I want to change it? But, this requires new ways of thinking about just solving problems with probability models on streaming data. I think that’s an interesting area, but I don’t know of any work on it. This streaming variational inference is one perhaps way to start thinking about that, but that too doesn’t say we need to change the whole paradigm.

>>: Is this data publically available?

>> David Blei: The archive data?

>>: Yes.

>> David Blei: Neither of these data sets are, no and Mendeley has totally stopped answering my e-mail, I don’t know why. I didn’t do anything wrong, but they got bought by Elsevier and suddenly it got different.

>>: The corporate exchanges you mentioned have you given thought to different interpretations of how it affects preferences? So one thing with the movie problem right is you can view me as having a certain capacity for movies, but also there are only certain movies available at a certain time. And so my selection has something to do with what’s available as well, just as much as my own preferences.

>> David Blei: Yes, yeah, so we have thought of it in the sense that we have identified that’s an issue that we are not capturing. It’s kind of like, what do you call it, like a selection set or?

>>: Well in search it’s more of a problem, like nobody is going to click on something that’s not displayed.

>> David Blei: That’s right.

>>: [inaudible].

>> David Blei: And in the archive data that’s an issue in terms of some archive papers are in the weekly e-mail and others aren’t. And the ones that are in the weekly e-mail are clicked on a lot more and you don’t want to, this is another issue here, you don’t want to bias your method to be recommending papers that are like the papers that were e-mailed weekly back before they had the recommendation system. You want to do better than that. But, yeah, no, so building in a selection set is something we have identified as an issue and there is something called a zero inflated Poisson that could help with that where you have a separate process that measures whether or not it’s available from what you’re going to consume. But, you know, we haven’t worked out the details of how to capture that. A lot of the data we have we don’t have that, of course the data you have, you have that information.

>>: I have always thought about hiring an intern to work on just writing citations automatically.

And could you do something like that, I mean have you thought about it? It should be kind of possible, you get this large citation list and even [indiscernible] text, even if it’s [indiscernible].

>> David Blei: Well --.

>>: [inaudible].

>> David Blei: I don’t find it useful because we all read every paper we cite, so it would take too long to read those papers if they came from a --.

>>: Well, but the idea is to know them, you actually know them.

>> David Blei: Ah, papers from your library.

>>: They don’t even have to be from library, you probably know them anyways.

>> David Blei: Right.

>>: [indiscernible].

>> David Blei: No I am just kidding, no, yeah, yeah, yeah, no that’s a good idea. You could use this kind of thing to do that. I mean you could also --. Yeah, it would be very interesting because typically if you think about that problem you would want to take citation data and then build a citation predictor, but if you take user behavior data and build a citation predictor you might be getting a better citation set than the old ways that we got the citations, kind of similar to the archive e-mails, that’s cool.

>>: [inaudible]. This yeah, where you’d use [indiscernible] citations of both yourself, your coauthors and you can add co-authors and start adding citations with completions based on facts.

>>: Yeah, that’s cool, but I more meant like you write your abstract introduction.

>>: [indiscernible].

>> David Blei: And then you type in the program committee and then you get [inaudible].

[laughter]

>> David Blei: Very cynical.

>> Paul Bennett: All right, thanks a lot.

>> David Blei: Thanks.

[clapping]

Download