22610 >> Chris Burges: So it's a pleasure to welcome Jerry Zhu. He's an assistant professor at Wisconsin, and is going to be talking about adding domain knowledge to latent topic models. > Jerry Zhu: Thanks, Chris. So let's see, so first let me thank my collaborators, David Andrzejewski, my former graduate student who did most of the work and made, actually made most of the slides here. He's at Livermore Lab now. And then Mark Craven and Ben Liblit and Ben Recht all helped with the project and funding agencies. Let's see, how many people are familiar with latent Dirichlet allocation here? Just wanted to get a sense. But let's start from somewhere else. Let's start from Times Square. New Year's Eve. What happens there on New Year's Eve? This is >>: People partying. > Jerry Zhu: People party and this big ball is going to drop and what you probably do not know is for the past few years the organization which runs the party also set up a website to let people type in their New Year's wishes. It's a one line thing, type in your New Year's wish and they will print it, cut as confetti and drop it when the ball drops. That's very nice. Fortunately enough we get ahold of the dataset. So we have people's wishes. Can you take a guess what is the most frequent wish? >>: Peace on earth. > Jerry Zhu: Right on. So the most frequent one is peace on earth. Now, here is a random -- [laughter] -- random sample of wishes. Okay. Let's see. So the university, my friends, this particular dataset is from 2008. Find a cure for cancer. To lose weight and get a boyfriend so on and so forth. So since we will be talking about latent topic modeling, we're going to treat each wish as if it's a mini document and we're going to look at the topic that it belongs. So you can get a sense like the first one is about peace. This one is about getting education. That's about war. This one's a little bit interesting because it has a mix of two things. And that's what topic models are good for. It can model documents which contain multiple topics. Now, as we are all familiar with, if you do a -- can we lower the lighting a little bit? >>: Yes. > Jerry Zhu: So what you're seeing here is a simple frequency analysis of the whole corpus. Thank you. That's very -- that's fine. Can people see that? It's okay. So as you can imagine, like the word "wish" is very frequent. Peace, love, happy so on and so forth. But it has many things mixed in it, right? So the goal of latent topic modeling is to say, well, let's throw the data at a model which we will explain later on and we will get some interesting stuff out of it. I'm going to show you some topics that we recovered from using latent Dirichlet allocation. And by a topic, I mean a multinomial, a unigram distribution over your whole vocabulary. What I'm showing you here is a distribution or a unigram of probability of a word given that particular topic and the size, of course, corresponds to the probability. You can get some sense that the first one is talking about troops. So troops coming home. God bless our troops and so on. This one you see Paul, Ron Paul, for some reason his name showed up a lot in 2008's New Year's wish. This one is about love and so on and so forth. All right. So late latent Dirichlet allocation or in general topic modeling has interesting applications and mostly I would say in the form of exploratory data analysis, just to see what's in the data. The way to think about that is it's very similar to clustering. But instead of cluster a big document as a whole, you're looking inside it and see what composition that document has. Okay. So there has been a lot of different ways that people use that. Okay. Before we review that model, let's have a really quick statistical review. So a Dirichlet distribution is a distribution over a D dimensional simplex which is a probability distribution. So you can think of that as a dyes factory you could draw a dye from it and it will be a D dimensional dye. In this example we have three dimensional Dirichlet with parameters 20 by 5. The tools from that are going to be probability distribution over three items. Now, one way to that is to draw the triangle, the simplex here, where A, B, C, are the different, the three items. And the blue dots are samples from this Dirichlet. So if a dot is close to A, that means the probability of A is closer to one and so on and so forth. So from a Dirichlet, you sample a dye which is a multinomial. So in this case we show this particular multinomial .6.15.25 which corresponds to the red dot there from a multinomial you can sample a document. So that's the bag of words model. And in this case let's say the document lands at 6. You happen to get this. Okay. One thing that's good about Dirichlet is computationally it's very convenient since it's conjugate to multinomial is conjugate to Dirichlet, so people like it. So with that, let's look at, quickly review the model. Many of you are familiar with that. I just want to remind you the symbols. It gets messy a little bit. This is a generative model of a whole bunch of things given two hyper parameters alpha and beta. This is how it works you want to create a corpus. This is what the generative model is about. So what you start with is you start by saying I want big T number of topics. So right here. Imagine you have like 100 topics you want to use. Each topic would be a multinomial distribution. Or think of it as a dye over the vocabulary. The way you generate each one of those dye is by sampling from a Dirichlet with hyperparameter beta. Now this beta would have the dimensionality of your vocabulary which could be, let's say, one million would be a pretty big thing, by doing that now you have a dye with one million phases, each phase has a word on it. You have one dye and you sample 99 more dyes from it. So that's first how you prepare the topics. Then with those dyes you're going to generate documents, and it goes as follows: For each document that you're going to generate, first you sample a parameter called theta from a different Dirichlet distribution, with a parameter alpha. And this Dirichlet distribution has dimensionality 100. Okay? It's going to be a mixture of those dyes. For this document you are trying to decide how often am I going to use each topic. So that's what that dye is for. Once you have that theta dye, 100 phase dye, now you go through each word position in your document and you do the following: So for the first word position you're going to sample its topic index or in other words which topic dye am I going to use for this word position. This is sampled from that multinomial theta, that 100 phase dye. So let's say it's topic 33. Then you are going to actually generate the word at position one by sampling from the word dye multi-phase dye, the 33rd dye of that. You throw that one, get a word. Then you repeat. In this way to generate the whole document and the whole corpus. So that's the generative model. And the way we use this generative model, sorry, is by condition on, for simplicity, assuming you know of theta and beta, the hyperparameters, but conditioned on the corpus W and you want to infer either z, which is for each word position, which topic did it come from. That hidden quantity. Or perhaps more interestingly you want to infer theta, which are the 100 topic dyes. And phi and theta. I would say phi is the more interesting one. I put a little reminder there on top. So if you want to remember what those symbols are, okay, as we go along. All right. Any questions? No. So please feel free to interrupt me at any moment. All right. So what is this talking about. Now LDA by itself is an unsupervised learning method. You give it a corpus W. You actually have to give it something more. You have to give it alpha and beta and maybe the number of topics, big T, but you give it labels. So it's unsupervised. And it's going to recover those parameters phi theta for you. In this sense it's very similar to clustering. Now, because it's unsupervised, sometimes the result can be unsatisfactory. In particular, a domain expert who is interested in a particular scientific problem, think of like a biologist looking at pub met abstract, might have some particular idea in terms of like what kind of topics he is looking for. He has some prior domain knowledge that he wants to constrain the latent topic modeling process. And this is very similar to clustering where in many cases a domain expert has some knowledge of which items needs to be grouped together, for example. So there is a need to incorporate domain knowledge into the latent topic modeling process. Now there has been a lot of variants of LDA which allows people to incorporate different kinds of knowledge. But I would say most of them have a problem in that they need intensive machine learning hacking to get the model to work. And that means if a biologist is interested in incorporating a particular kind of domain knowledge, he needs to first find a machine learner. He cannot do that himself. So the talk that I'm going to present is about how to do knowledge plus LDA in a user-friendly way so that anybody, any scientist, not machine learning researchers, can incorporate their knowledge. And we're going to talk about three models. Number one topic in set number two Dirichlet Forest, and number three fold-all. So the first one. So I'm going to start with a slightly unconventional example, not this one about documents but about software debugging. And some of the work is done at Berkeley and here Alex Zen, Microsoft. Here's the idea. You have a software like, I see -- there are bugs. You want to see where one way to do it is to insert a lot of probes the code. And the way you can think of that, They're specialized counters. don't know, like IE. You want to they are so on and so forth. And into the code. So you instrument so the yellow things are probes. So we can imagine like at different locations you insert different counters, and whenever the program gets there, the counter gets added by one and so on and so forth. And at the end of the program, no matter whether the program crashed or not, you are able to recover those counts. And let's also assume that you just want the counts, the aggregate of how many times this line is executed, you're not interested in or you cannot keep the actual executing sequence of things. So it's very much a bag of words notion. So this is a setup where the predicates acts as words. You have a bag of predicates, and a software run would be a document. Now, for this problem, we actually have some extra knowledge, which is whether each run crashed or not. So you have a label. So you have a label. You have a bag of words of something. A straightforward idea would be if you want to do latent Dirichlet allocation on this is to collect all the failed runs, the crashed runs, run the Dirichlet and run it on it and say could you please recover 100 topics. I'm going to look at those topics. The hope is maybe some of those topics would be indicative of bug behavior like particular lines where it's abnormal in some way. So, yeah, that's a very reasonable goal. But what happens in reality if you try that is you don't get nice bug topics. Instead, you would get topics that are in some sense much stronger. Those are normal usage topics, like different functions, maybe you opened up a Web page, or you print a page and so on corresponding to those kind of things, and those are much more frequent, and therefore they dominate the topics you're going to get. To illustrate that, I'm going to show you a synthetic example. So the way to see this is I'm creating a figure which is a 5-by-5 square and each pixel corresponds to a predicate. So I have 25 different predicates. And I'm actually going to generate software runs. So the way I generate it is by using a bunch of topics, which are multinomial distributions that will generate pixels here, here, but not there. So that's that topic. I have, I don't know, eight topics there, and three bug topics here. So I can sample documents, both successful documents, runs, and crashed documents from that. And this is what you will see. So, remember, a document is a mix of all topics. It's very hard for us to see what's actually going on here. And you don't see those things. >>: The LDA completely independently like you're not paying attention at all to the other set when you're doing -> Jerry Zhu: documents. So far I'm generating it. So I'm using the LDA model to generate >>: But you're doing the decomposition independently for label one versus label zero you've just taken the group label one not paying any attention to the -> Jerry Zhu: Yes, yes. So in fact I think this will answer your question. Now I'm going to run LDA on the crashed runs. It's more than this, but I'm going to run it on that. And here is the topics that I recover. Notice I actually used the correct number of topics. But I made the dataset so that these topics are going to be more frequent, and therefore they somehow mask the bug topics. You don't get those nice things. All right. So what should we do? So here's a fairly simple idea. Since we know which runs crashed and which did not, we can actually jointly model both kind of documents. And here's intuitively what I want to do. Let's say I want to use big T topics overall, but I can sort of set aside a smaller number of topics, small T, and say that those are usage topics and the remaining ones are bug topics. Very importantly, if a run succeeds, I'm going to say that for all positioning in successful runs, it can only use usage topics, and it cannot access those bug topics. While for crashed runs I will allow them to use any topics that they want. So this is an additional constraint on top of standard LDA. The hope is by requiring this is the case, you force the first small T topics to model general software behavior, and then the remaining ones, hopefully, will capture the remaining bug behavior. Yes? >>: Is there any constraint or cost in terms of the difference in distributions between the different, the sets there? > Jerry Zhu: No, we didn't do anything at all. It's just as simple as this. >>: They could overlap. > Jerry Zhu: They [inaudible] that's if you do this and only these can get could overlap, definitely. We didn't do anything constrained as this. Now, this is the new result. And so say, okay, let's have the first eight be the usage ones and the buggy runs, you actually get something quite sensible. So this is a very simple example to get started with. run results. And here are some actual >>: To emphasize one more. > Jerry Zhu: Yes. >>: Question, why not have the two sets of topics disjoint? two plus two -- Why not say one to > Jerry Zhu: Right. Because both runs would contain normal usage. So for like both runs somebody would open the browser or do things like that, so you have the common element. You just want to group the common element in the first small t and let remaining ones explain bugs. Yes? >>: The LDA documents -> Jerry Zhu: Yes. >>: -- would this be the same as modeling the background distribution? > Jerry Zhu: Yes, it certainly sounds a lot like that. In fact, I imagine one can explain that it that way. So the first small t is the background model and the remaining one is special. Okay. I'm going to skip this. This is real software like people actually, their job is to insert bugs into programs and we were able to distinguish that you can easily extend this idea to the more general topic set, constrained where you say that here's our additional knowledge for each word position in the corpus. We can set up this set C, CI, which can be arbitrary. It's a subset of all topics, and you constrain it such that the topic index for that position must be in that set. So this can actually encode quite a few different domain knowledge. For example, CI could be a singleton that would be like saying this word must be topic three. Okay. And this is very easy to do. So one inference method for LDA is so-called collapsed Gibbs sampling where you are given W of a beta. You have to infer Z and you also need to integrate out phi and theta, that's why it's called collapsed. And you do this in eclipse sampler fashion you go through Z and word position and change Z in that order. Now, the black part is standard LDA collapsed Gibbs sampling, but the topic and set knowledge you can easily encode by inserting an indicator function there. So it's an indicator if the value of this ZI, the Ith word is in the specified set. You let this term be 1 otherwise it's 0. It's essentially forbid that Gibbs sampling to take values outside the set. So you can do that fairly efficiently. And you can also relax this hard constrain to something that's slightly softer so it's not strictly in the set but you pay a penalty there. So this is a warm-up of how you could start to add interest and knowledge to topic models. Now, let's move on to the second model. It's called Dirichlet Forest. I'm going to actually present the result first because I think it's kind of interesting. This is how people can use a model like this to do interactive topic modeling. So recall we have this New Year's wish corpus. If you run a standard LDA on that corpus with 15 topics T equals 15 you get something like this. And what we're showing here is each row is the most frequent word in that topic. So let's take a look at those. One thing you might immediately notice if you work in natural language is that we get a whole bunch of stop words mixed in there. And this is usually undesirable, because the topics are not as pure as or as informative as they could be. Now, there has been a lot of ways to handle stop words, simplest one might just be you preprocess your corpus so that you remove all stop words. But here we're going to do something else. Imagine a user looking at this result and say that, hey, we have too many stop words in here. Let's get a standard stop word list and say that please exclude all those stop words from normal topics. Let's sort of merge them into a single topic, isolate them. So this is an operation that a user can do. The user is going to do in isolation of in this case we actually use the 50-word stop word list. And we let the system rerun. But allow it one more topic. So now we have 15 topics in all. What I'm showing you here is the result of this rerun after we somehow do the isolate operation, and these new topics, two topics, are now responsible for stop words. So 2008 is a stop word because that's the year of the wish corpus. Yes? >>: Why are there two topics? > Jerry Zhu: Yes, that's because all we did is -- you will see how we do the isolate operation. And we didn't really say that please move everything into this one topic. But we just simply say please make sure those stop words are not in the same topic as other words. It turns out LDA wants to use two topics to explain those stop words. Now, if you look at the result, it's much better now. But you might notice this particular topic which I said mixed, go, school, cancer into well, free, college. What happens is it's a mixed topic of two things. One is you want to get into college. The second is you are wishing somebody to be cancer free so these are two different things. So we will be able to say let's split -- let's split here eight words into two groups. So imagine those are like seed words for those two topics. You're going to say that let's split those particular eight words but hopefully they will create new topics and drag other relevant words into their respective topics. And this is what happened. So they're marked here. We allowed one more topic to accommodate for that. And you see the green words are the ones we used in split, but also moved other relevant words in there. The reason you see mom/husband/son here is because most people are wishing like we hope mom to be cancer-free. So that's considered the same topic. Okay. So again now you see the first one and the tenth one is kind of all about love. And you might want to say let's merge these two. So we can again do a merge operation by specifying a few keywords. And if you do that and rerun the whole thing, now we have a single topic there which is much more pure, except lose weight. Somehow that's tied to love. And you get the idea. So you can use this in a way that's interactive to shape the topics that you want. >>: When you do a merge operation do you say the topics merge or two sets of words? > Jerry Zhu: Two sets of words. >>: Is that different from isolate? > Jerry Zhu: So you are asking for two sets of words which were originally -I mean, it's not quite different. I will show you how it's encoded differently. It certainly has the same flavor, but it's encoded slightly differently. Okay. And you can do this by changing the model, the standard LDA model slightly. So recall this is where we produce, we sample the document dyes, the one million sided dye, instead of using a Dirichlet dyes factory, I'll use something called a Dirichlet Forest distribution, which will give you something that can the no be achieved by Dirichlet. So what is this good at? It's particularly good at encoding pair-wise relations in the form of, for example, mass link. This is a term borrowed from clustering but what it means here you want two word types, U and V, to be in the same topic. But what does in the same topic mean? Because a topic is a multinomial distribution over the whole vocabulary. Any two words are going to be in there with nonzero probability, right? So by must in the same topic we mean the following: We don't like the case where one word has high probability, the other has low probability in a particular topic. They could either have both high probability or both low probability. That's it. So that's what we want to enforce in terms of must link. And similarly you can enforce cannot link, you do not want these two words to be presented in the same topic but again they will. So what we're really saying is you want one to be large, the other to be small. Or both small, I think. But you cannot not have them both hide. >>: Maybe I'm jumping ahead here but it seems like these kinds of things are difficult to put at the feature level but easier to put at the example level, right? There are often cases of words that you forget I imagine in any domain where the features sort of making these statements on features is very hard and preclude what you want to do with data analysis. > Jerry Zhu: I agree with you to some extent especially when you exhausted the first few easy features, things get hotter and our model here does not address that. So here everything is strictly on the features. All right. Now with these two you can create the higher level operations. For example, split. You can say that I have a group of words I can split them. The way I do that is by creating must links among -- within groups and cannot links across them. Merge, very simple just put must links among the words. And isolate, you put cannot links from the words you want to isolate to all high frequent words in the current topics. >>: So first you then need to encode for each, basically for every pairwise set? > Jerry Zhu: Okay. All right. So here's why you cannot do this with Dirichlet. Let's imagine you have a vocabulary school college lottery, just the three vocabulary, three word vocabulary for simplicity. What would a must link school college look like? That means you want to generate a multinomial such that school and college either both have .5.5 probability or .2.2, or .1.1 or 00 like they're somehow tied in this fashion. If you look at the simplex, this is what the density looks like. You want this. Okay? This, however, cannot be achieved by Dirichlet. If you try to encode it with Dirichlet, you will get things somewhat like this. And the reason why you cannot do that is because in the Dirichlet distribution, the variance of each dimension is tied to its mean. So you do not have the extra degree of freedom to encode its variance. And the way to get a run at it is to go to what's called a Dirichlet tree distribution, and Tom [inaudible] did something on this, too. In the Dirichlet tree distribution you have your vocabulary here but you put a tree on top of it. It's a tree where you have positive at weights, and the way it works is the following: So let's see, the weights can be arbitrary. But in this example I made them special. So you have eta, whenever you see eta that's a big number. So 50. So we have 50/50 here and this one is two, and that one is one. So let's imagine we have a tree like that. The way you generate a multinomial distribution from a Dirichlet tree like this is the following: You start from the root and you don't look at the sub trees. You just look at its children's weight and treat that as a Dirichlet distribution. So you're going to draw a multinomial of two elements from this Dirichlet. let's say we do that and we happen to get this. That's multinomial distribution we get. So Then you go down to this internal node, and at this place you're going to use these as the Dirichlet parameter and you draw another multinomial. So let's say that's that. And the intuition is -- this is the probability mess that you get here and you're going to split it according to that. So you basically multiple the numbers along each path. That gives you the final probability. And because of this construct, what we will see is the following. Here you pretty much get a fairly uniform distribution in terms of the weighted distribution. But once you go down here, because these are large, and the property of Dirichlet is you will evenly split them, right? So you get the desired behavior. This is just to show you things are a little bit messy, but it's actually quite nice. You still get what you want in close form. And, in particular, this important quantity delta for each node S, which is the in degree of a node in a tree versus the out degree minus the outdegree of that node, if that difference is all zero for every node in the tree, you actually recover a standard Dirichlet distribution. When it's not zero, you get more interesting behavior. Furthermore, this Dirichlet distribution is nice. It's conjugate again to multinomial. So it's convenient to work with as long as you're careful about bookkeeping. So this is how we're going to encode a must link. Basically when you have a must link, you're going to say these two words are must linked. You're going to create a sub tree with very strong but equal weights there. And then on top of that you're going to create a weak edge, but with the number of like number of children, number of leafs here and the number of leafs here. So that evenly splits that probability mess. This will give you samples like that. So that's must link. Now, recall cannot link where you do not want words to have both have high probability. That one is, again, impossible to encode with Dirichlet. >>: One question, so the eta and the beta in the tree you had it for, do you determine those empirically and then use in fixed? > Jerry Zhu: Yes, that's what we did. You can think of the mess, the knobs that the person can tune. We didn't learn them. We set them. >>: Must link distribution. > Jerry Zhu: Yes. >>: [inaudible]. > Jerry Zhu: Yeah. >>: [inaudible]. > Jerry Zhu: Yeah. >>: Makes sense. > Jerry Zhu: So cannot link, for example, cannot school and cancer. What you really want is this kind of behavior for three vocabulary word. You want either the probability multinomial probability to concentrate on school, or it can somehow arbitrarily split the must between these two words but not on school. So any thought there is, okay. As I said, this cannot be easily represented by a Dirichlet. And it cannot even be represented by a Dirichlet tree, a single tree is not sufficient. What you will need is a mixture of trees. Therefore, the name Dirichlet Forest. Okay. I'm going to show you with one example how things might look like. It's a little bit involved. But here we go. >>: Dirichlet Forest [inaudible]. > Jerry Zhu: It's a mixture of Dirichlet tree, that's it. >>: Then you can get any events you like, not showing up. > Jerry Zhu: Let's see -- >>: [inaudible]. > Jerry Zhu: I think it should be, yeah. I think it should be. Yeah. All right. So we have vocabulary of A to G. And here are the links that we want to satisfy, a must link between A and B. And a bunch of cannot links. Now, the thing is -- well, okay. So here's how we do it. Write down all the nodes and A/B those are must linked I'm going to treat it as a single node there. It's glued together by must link. I'm going to have all the red edges with a cross representing cannot links. So that comes from my knowledge. This is the graph we start with. So we have multiple steps. Step one, let's identify connected components. The reason we want to do that is because each connected component like this box can be modeled independently of other connected components. So we will then focus on one of these guys. Then inside each box we're going to flip the edges. So instead of cannot link, you turn on the opposite, the complement edges. Those are the can links. That means these words can appear with high probability together. You identify the maximum clicks of those within each component. a big click and D is another click there. So A, B, C is The interpretation again is each maximum click like A, B, C, those represent the maximum collection of words that can co-occur with high probability. You cannot add any other things from the same connected component otherwise you're going to violate one of the cannot links. So that's that. Now, with that, we are going to create or sample a Dirichlet tree. And the way we sample this tree is by constructing its sub trees in the following manner. We're going to intuitively say that for this box, I'm going to either select this click or that click. Allowing the selective one to share high probability and denying probability match to anything else. So the way I'm going to do that is to choose between, for this particular one, I have two maximum clicks. I'm going to have two candidate sub trees. And there the trees, if you don't see a number between the edge, imagine it's edge weight one and eta is something big. The reason we have this design is because recall when we sample that Dirichlet tree, when we are here, if you see eta, which is big, and 1 there, you're going to very likely apply all the probability math down to the eta branch. And with very few probability mess, very little mess to the other guy, D. So this, if you choose the left sub tree, you're saying, essentially, that you want all probability math to go to A, B, C, but not to D and the second one, vice versa. But you have to make the choice at this moment. So you sample the tree like that. Let's say we flip a coin which chooses the first tree. Once we are there, we need to further distribute the probability among A, B, and C, and here we're just going to give it one of these guys so uniform, not uniform, but a Dirichlet with all 1s so it can generate a uniform distribution over the simplex space. And finally -- oh, yeah, so at this moment we're safe in that we will satisfy all cannot links within the first box. Because it didn't give D any probability. And finally we're going to, because there's a must link between A and B we're going to again do this trick and let them evenly split the probability mass that rich, that A/B node. So this is how you do that branch. You can do the same thing for the second one. Here are the choices over either given all probability mass to E or to F. And so on. So let's say we say select the second one, the fact is you'll give probability mass to F and not E for this tree. So you do this, and you will have sampled a Dirichlet tree. And then you use this Dirichlet tree to generate a multinomial that will satisfy your must link and cannot link. Many of you will have theory things can get exponential number of there's a gap between noticed that the procedure is combinatorial. So in really bad and you get tons of candidate trees, them; but in practice we see very small numbers, so theory and practice. >>: What kind of data? > Jerry Zhu: Smallish data, and like the data I showed before, the wish and the split and the merge and so on. So all those operations. >>: But are you individually then going through the component until [inaudible] is close enough or. > Jerry Zhu: No, no. So you do this. You do a Gibbs sampling but now you sample over the Z assignment, topic assignment, as well as the tree index. So you're sort of massaging the trees for each topic. You're selecting which sub tree is along the way and you can adjust that. So that becomes part of the variables you are doing in Gibbs sampling. >>: I see, so the actual forest that you have is changing? > Jerry Zhu: Yes, that's exactly. And that's part of the sampling. And just to show you that this is how we sampled the forest, again, a lot of bookkeeping but not terrible. Okay. All right. So in the last ten minutes, let me quickly go through a third model which I promise you is not as hairy as this one. Okay. All right. The motivation is, well, these are okay. The methods I've talked about. They can encode certain knowledge but they're not general enough. I want to have a really general way to specify domain knowledge. Or not me but domain experts. And the solution is let's go to first order logic. specify his knowledge in first order logic form. A domain expert can usually So the advantage is it's easy for an expert to just write down a knowledge base in logic and say, okay, here are the things I want to satisfy in terms of the topic. It's very general. In fact, it can encode several existing variants that's been proposed in the literature. Critically, we want efficient inference. That's the difficult part. Whenever you involve logic, inference could get very hard, but we will see an efficient way to do that. So for those of you who are familiar with Markov logic network, the relation between our photo and that is the following. You can view note tow what's a hybrid knowledge network which is a Markov knowledge network which is a Markov random field constructed from logic, plus -- not plus, but multiplied by a likely heard term which involves continuous random variables. And those would be our phi and theta. So it's an instance of that. have our specialized optimization scheme. So that it's efficient. Okay. So here's how you think of the problem. knowledge in logic. But we Here's how you encode domain So let's think in terms of logic. You have predicates which are Boolean functions. For example, you can design Boolean functions big Z. Takes into arguments I and T where I is word position and T is a topic index. So we can say something like ZIT is true if word position I takes the value T or English, the ith word comes from topic T. So that's the key predicate. And we do not know that. We will need to infer that. Besides that, you can define -- and this is the power of this approach, you can define arbitrary, observed predicates. Some of the predicates are like things like before where you have WIV. The ith word takes value V it's like the ith word is this V word. But then you can put things like document boundary in there. That's D. The I's word position come from document J. Or has labeled JAL so document J has label L or even sentence level label, you say the ith word comes from the Kth sentence and so on and so forth. So you can put in any observed things that you have in this form. And we will use them. Now, the key is you're going to use this to write rules in logic called O size. You can write big L big number of rules. Each one is associated with a positive weight lambda. That's the standard Markov logic network setting. For example, I have two rules here. First one is weak. Weight one. And this rule says for all position I, if this is true, it means if the ith word is embryo, implies it's topic is three. So basically this says whenever you see the word embryo, put it in topic three. But it's somewhat weak rule and it can be overridden. Think of those as not hard constraints but as preferences. The second one, 100, so that's a very strong rule. has a whole bunch of universal quantifications. Cannot violate it, and it For all position I, position J, for all topic T, if the ith word is movie and the jth word is film, let's mother put these two words in the same topic so they cannot be in the same topic T. This is exactly the cannot link, but expressed in logic firm. So we put a cannot link between the word, that words movie and film. These two words need to have different topics. They need to go to different topics. Yes? >>: Is this the positive weights that you have there for to be 100, sorry, is that given by the domain expert or that's earned? > Jerry Zhu: Yes, right now it's given by the domain expert. into the learning issue with that. We didn't look >>: What range of numbers do you allow, and do they end up being kind of on the extremes? Do you see any -> Jerry Zhu: So this is a fairly empirical question and the answer is we tried a whole bunch of numbers. And you clearly see once you tune up the numbers, that rule gets enforced. But setting them appropriately is an art. >>: So that's the thing. I mean, I worked myself on which rules have weight systems needed with them. And I had weights one to ten. And the majority of the rules had either nine or one. And there was some squishiness in the middle, a 3 or a 7. But you don't get an even distribution of dollars. > Jerry Zhu: Here we don't even estimate the weights. distribution it's all user provided. We don't have a What we can, we can assign meaning to the lambdas if we normalize the terms appropriately so they are more interpretable. You will see how they come into play really soon. Okay. So if you're familiar with logic, you know, this is first order logic, you have universal quantifiers for all. And the way people, like one simple way of doing inference here is to do something called propositionalization, which turns first order logic into Boolean logic and that simply means for all combinations of free variables in your rule, you just assign it a particular number. So you do this exponential explosion and generate what's called a grounding g, small g, from a particular roof. This big tree is the set of groundings and small g is particular groundings. So for example this with particular grounding where we look at is word position 1, 2, 3, word position 4, 5, 6 for film and the topic in question is nine and so on. So you have to do, in this example, you have to do N squared times big T that number of groundings, which is a huge number. But in theory you can do that. So you turn everything into Boolean logic. So this set big G, just keep in mind that could be a huge set. Once you do that, yeah, that's combinatorial. Let's define this indicator function. One, with subscript with particular rounding. And it takes in the hidden topic assignment that you currently have. Let's say you have a candidate assignment. Then, you have the natural 1-0 definition. So here's how it is used. This is standard LDA except that I draw it as an undirected graphical model, factor graph. And these are the terms that you will recognize, that's the topic term, document term and so on. All we do here is to add in this term which corresponds exactly as Markov logic network. And that corresponds to this factor node which is a mega node. It takes in arbitrary observed variables. And it has this form. So you are taking -- you have the exponential here. You're summing over all the rules you have for each rule you're going to sum over its exponential number of roundings. And then the weight is here. And then that's the Boolean function here. And it's applied to the Z in question. So this objective function is really nothing but a straightforward combination of LDA and Markov logic network. How much time do I -- can I go over just a little bit? >>: Yes, sure. > Jerry Zhu: So I'm going to be done very soon. So we are interested in, again, inference. So this is the task where you're given the corpus and you want to -- and the rules, and you want to infer Z. Now, what you can do is you can take the logarithm of the previous probability, that's what you get here. The black term comes from LDA. Red terms come from logic. Again, you have the summation. This is the part that will kill us. But that's the objective function. And then you can do, keep in mind that this objective function is noncovex. Okay. And you can optimize it in the simple way you do the alternating optimization, where you fix Z first and then maximize this. If you fix this, this term disappears and you're doing standard LDA stuff which is very simple. We immediately get this. But the other part is you have to iterate and fix those phi and theta, and now you optimize Z. Keep in mind Z is discrete random variable. So you get yourself an integer program here. And it's huge. So this is the part where people typically get stuck. Like this is where it takes a lot of time to do inference. Okay. But we're going to do this procedure. We're going to iterate, and I'm going to explain how we do this part in a somewhat efficient fashion. Okay. So here's how we optimize Z. So we're talking about this step now. Number one, we're going to relax those indicator functions into a smooth function. It's continuous. So we can take derivatives. The way we do that is the following: So let's take a look at that particular grounding G, which says ZI 1, the first word, the ith word is topic one or not ZJ 2. And let's say we only have three topics, big T is three. So here's how you do the relaxation. Turn that one indicator function into a continuous function. First, you take the complement, logic complement, so you get 0 and this Z. That's fine. Then you remove any negations. So for this term not ZI 1. Recall that ZI 1 means the ith word takes topic one. Not of that. Since we only have three different topics, it must be like it takes topic two or topic three. So that's how you get rid of the negation. Then we're going to turn or into plus and into multiplication. But these variables are 0-1 integer variables. So so far things are equivalent. Then you do 1 minus this. This is the negation negated back. Still 0-1 variable, but then you trivially allow it to be in the interval 0-1. Now you have a continuous function. This is still noncovex, though. And, yeah, if you're curious the one big one function becomes something that looks like this. It involves a whole bunch of variables that's between 0 and 1. Okay. So remember, we're trying to optimize the big Q function. Now we're optimizing over Z. So let's take terms that's relevant to Z. And ignore all constant terms. You get something like this. So this comes from the logic. This part comes from LDA, the part that involves Z. You were doing an optimization problem. This one is a nice one, because Z now is between 0 and 1 for each element. And all you have to do is constrain it to be nonactive and sum to 1. So that's nice. But we still have that exponential number of groundings here. So this term is huge. And you don't want to take the normal like gradient step and do that. It's just way too expensive. So the trick here is let's do stochastic optimization and treat this objective function as the objective we're trying to optimize, but the actual thing you do is you randomly pick one term out of this. So call that term F. It could come from here or it could come from here. It's going to be a term that involves some Z. But it's a very small number of those. So you just take random, one term out randomly. Treat that as if it's the random objective, because if you take the expectation it gives you back that. So you're going to take one random term out, do a small gradient step, and then take another term out and do it. So you repeat. So you do the stochastic optimization in this fashion, take a term out. And you do this kind of exponentiated gradient which maintains this constraint. So you just do that. And it turns out that it works quite well for our problems. We were able to run fairly large logic knowledge base in a reasonable amount of time. I mean, by the way, you can stop this anytime you want. time you can keep working on that. So as long as you have Recall that this is the particular rule we said it's a cannot link. Penn and Lee's movie review Sentiment Analysis corpus. This is So it's a bunch of movie reviews submitted by people. And we put -- we run LDA on that. But we add this one rule, which says let's put a cannot link between the word movie and the word film and my student David did that, because he suspected that people are using these two words differently. And, yes, as you can see, these are the two resulting -- there are more topics, but we're taking out the two topics where film and movie are the most frequent. What do you notice? >>: [inaudible]. > Jerry Zhu: Yeah, for some reason film is a more elegant word. Associate good words, great, with film but here you have bad movie and -- yeah. So it's kind of strange that you see this. Yes? >>: I have a question about the semantics of this. You have a -- this example is kind of telling, is that because what you ended up encoding here, even though you're using the fuller logical form is just a cannot link constraint that you had earlier. I wonder if you told the domain experts, hey, sorry, you don't get full logic you just get [inaudible] how much does that really restrict them in terms of the -> Jerry Zhu: That's a great question. I think equivalent way to understand this is to say what is the power of just [inaudible] link. Right? You can create some high level like split, merge and so on. Maybe it's rich enough for normal usage. >>: It's pretty rich. > Jerry Zhu: It's pretty rich. But on the other hand the logic has the nice thing you can claim that you can pretty much encode anything as long as you could write it in the logic form. So you have that sort of peace of mind there, like if you want you can always write complicated [inaudible]. >>: That's true. > Jerry Zhu: Wonder if real domain experts [inaudible]. Sounds like a great project. >>: The one example that immediately comes to mind is contextual constraints, right; in this context that seems like the thing, what we would want to apply, disambiguates. Gave a nice example earlier with one of his project examples, that seems like maybe want first order logic but maybe [inaudible]. >>: Yeah. > Jerry Zhu: All right. So to evaluate -- we want to evaluate two things. One is whether the logic that we specified as extra knowledge, can it be generalized to answering test data, like the topic you get can they generalize. Second, we want to see how good the inference is, like how scaleable it is. here's what we did. We have a bunch of dataset. I don't have time to go through the details, but they're in the paper. So We're going to do cross-validation, but whenever we do training we'll do fold-all, plus all the logic rules that domain expert gives us. From that we will be able to get the topic dyes, the topics, the phis. Then on the test fold what I'm going to do is I'm going to use those topics to directly estimate the word topic index. So without, importantly without using any logic. I'm just going to use the phis there to get Zs. But then once we get the test set Zs, I can go back and essentially see how much of it fits the logic that the expert gave us in the beginning. But we did something slightly different. We looked at the test objective queue which balances both the desire to satisfy logic as well as the desire to satisfy LDA model. So in general, in this table, that's the test set queue. to be big, the bigger the better. You want this number And so you can see this is pretty good on test set. If you simply run LDA without logic, you get pretty poor results except for this one. Alchemy is where you only enforce logic but without the LDA part, of course, and so it's not as good. But, again, another thing to notice is these dashes are where things do not complete within the reasonable amount of time. So this shows that our method with stochastic gradient is fast enough to work on reasonably large, especially this one, logic knowledge base. Yes? >>: So in the logic part, you actually have two steps. You have basically doing the approximations -- I mean the first one is relaxation. The second one is actually [inaudible]. > Jerry Zhu: Yes. >>: So just wondering where they're exactly -- the exact inference where it gives you [inaudible] I mean, probably not on the logic -- but one of the [inaudible] put in the next slide which probably -- yeah, so that -> Jerry Zhu: So let's say we have a column here which says let's do exact inference, if we can do that, that's a smaller constraint base, how well would we do. Indeed, for the ones that we can complete, the number has to be slightly bigger than this one, but not too much bigger. Slightly bigger. It's in the paper presented here. But the problem with that is you will have a lot of dashes here because of this. So here's the summary. I think it's very interesting to inject domain knowledge into an unsupervisable learning method like LDA. Because I think the max really wants that. We presented three models they have increasingly more general power. And the key here is -- two keys: One is you really want this to be easy to use for domain experts to create their knowledge or to inject their knowledge into the model. Second, you really want to make sure it runs fast so it's scaleable. That's it. Thank you. [applause]