>> Matt Richardson: Okay. So it's my pleasure... here to talk about Markov logic. He actually interned...

>> Matt Richardson: Okay. So it's my pleasure to introduce Parag Singla. He's here to talk about Markov logic. He actually interned for me a few years ago in the summer of 2006 -- '5. I don't remember. '6. He worked on some really interesting work dealing with messenger data and search and doing a bunch of data mining and everything. Would have been simpler if cosmos had existed. He's come back to talk about what he did in his thesis work under Pedro Dominguez at the University of Washington. Thanks, Parag. >> Parag Singla: Thanks, Matt. So here's the title of my talk. Markov logic theory algorithm and applications. Most of it is joint work with Pedro Dominguez, my advisor at the university. There is some stuff that I worked in connection with research in Rochester. I'll mention when that comes and who are the collaborators. So it is a brief outline of my talk. I'm going to give some motivation, the work that I've done. Give some necessary background on Markov networks and a little bit on first order logic. I'll describe Markov logic on which I will sort of build on. Then I'll describe inference algorithms that are developed for this language, explain a couple of applications and then conclude with future work. So coming to motivation has been seen that an area of science, this idea of applications and infrastructure and this middle interface layer, sort of separates applications from infrastructure. And it has been observed that whenever we have this inference layer defines application and infrastructure, the process really happens fast. What do I mean by that? Let's take networking. On the application site you've got www. E-mail and all of the applications, things you can think of, there's YouTube and other stuff. Infrastructure essentially consists of protocols, routers, all those things. Interface layer is Internet. How does Internet help? Is that applications can be developed independent of how or what's going on in the infrastructure layer. You can develop applications, just knowing about the Internet and optimize those and, similarly, infrastructure layer can sort of work independently. People can optimize the routers, protocols and those two as long as they will interact well with the Internet, everything was sort of fine. Note that instead of making N squared combinations between the applications and infrastructure, we have only order of N connections. So this really helps speed up the progress. Similarly, databases we have applications like enterprises for planning, online transaction processing systems, CRMs and infrastructure is optimization transaction management. As you can see, both sides can develop independently. Applications can go on, infrastructure can be optimized, and the interface layer as we know is the relational model. Once we have the schema in the mind, everything works well. And, again, the progress can go in both directions independently. So what is the interface layer for A? Applications, as we know there are [inaudible] NLP, planning multi-systems and many, many more. Infrastructure, essentially, the presentation, how do you do the learning, inference and other terms. What is the interface layer that we're looking for in case of A? So for quite some time people thought that, first of all, first order logic could be language of choice because it has the power to present objects, relations among them. Can handle very complex world scenarios. But the problem with first order logic, as we'll see, it's [inaudible]. It doesn't have the power to represent uncertainty which is inherent in the world. We certainly like to have that. So that brings us to graphical models, something like Biznet or market networks, which have the capability to handle probability explicitly. But then the problem with them is that they do not really handle objects and relations. So you lack that capability of having that complex structure in the language. How do you do that? So that brings that what if you combine sort of these two approaches, the statistical and logical layer. Statistical, as I said, sort of the graphical model kind of approach, and the logical being something like first order logic. If you can combine those, maybe you could have something which could basically be a potential interface layer, really spreading of the progress. So there has been this whole area of statistical relational learning which has come up. This is sort of the background. There have been many languages which have been proposed. And today I'm going to talk about Markov logic. And in some sense maybe try to argue that it really gives you the representational power to handle both uncertainty and also the complex structure and also has the engine where you could do fast inference and learning. So Markov logic, as I mentioned, is not the first of these approaches. The history goes back to 1986, probabilistic logic authors and many more models. And this could be classified based on what representation language they use to represent uncertainty and what kind of first order model they use. I'm not really going to talk about those, but primarily focused on Markov logic which is represented by this here. Pedro Dominguez back in 2006. So briefly, in one slide, so for Markov logic, the syntax is essentially rated first order formulas, very simple. If you know first order logic, just write first order formulas. And that's the syntax. Semantics, as I'll explain it a little later, it can be seen as constructing templates for laying underlying Markov networks. In terms of inference, there have been a lot of terms that I have done, my colleagues have done, [inaudible] MSMC and belief propagation, [inaudible] propagation. And I'm going to talk about some of this. Learning, you could use something like LBFGS [inaudible] or second order methods, which have been developed. Applications, there are many. And I'm going to talk about two of them. And this slide, the red ones really highlight the algorithms that I've sort of worked on. There are couple of others I've done work on but I'll not be talking about those. The red ones are essentially those that I'm going to talk about. In addition to, of course, explaining the semantics and syntax of Markov logic. So this gives the basic motivation of the work that I've done. A brief background on Markov networks and first order logic. Markov networks, I assume some of you may be familiar, are essentially unbounded graphical models of data. So these nodes here. And just between them and the nodes represent some kind of, in this case you could say some kind of binary predicate. For example, if someone smokes or doesn't smoke. If someone has cancer, they don't have cancer as to the Markov, whatever disease they might have, it essentially represents that there's a direct influence of one node or another. For example, if you smoke, then you're likely to have cancer. You have cancer, then you're likely to have [inaudible] similarly. And the way we define the distribution over this network or over these nodes is by having this potential functions which are defined over the clicks in the graph. So in this case you can see there are many clicks, there's click as to cancer as to Markov and also there's a click between cancer and smoking. An example of the potential function is that, for example, for all possible states of smoking and cancer you have this real value function. Actually, it needs to be positive. So again have all these values. And the probability distribution is defined simply a product of all this potential functions defined over the clicks in the graph and the mobilization constant. And intuitively these numbers, just say that what state of the world is more likely? For example, in this case you can see that only the combination true/false has a low value which means that that part of the world is less likely as compared to other states. Equally, this model can be represented as a log linear model. So it's sort of exponent of linear feature functions. Again feature functions are defined over clicks in the graph just as I said before. But note that this can be much more compact than the potential representation. For example, in this case you may say that this feature is on or one. When smoking implies cancer, this small formula defined over these two nodes available is true, otherwise it's false. Instead of having four possible states you can represent it much more compactly in this log linear fashion. And in this case W can be any real value sort of weight. So I'll be using this form throughout this talk for defining the probability distribution. And Z is the relation constant as before. So this is Markov networks. Coming from the logical side of things, first order logic, first order logic essentially consists of constants, variables, functions and predicates which represent your underline world. Constant could be something like people in your domain and R Bob, variables at XYZ, which instantiate over the constants in your domain, could have functions like mother of X. So X then whatever the person who is the mother could be represented by this function. And friends XY which is a predicate, which could ask, for example, it could be true if X and Y are friends with each other, false otherwise. Grounding is an important construct in first order logic. It corresponds to replacing the variables in the predicate or any other construct by the corresponding constants. For example, friends X and Y you could have Anna Bob and similarly for other constants. Formula essentially combines predicates. For example, smoke X implies cancer X, which is true when smoke X is true and cancer is true in certain cases. Knowledge base is essentially a set of formulas, and it's a standard rule of standard theorem of first order logic that it can be equally converted into two forms. Finite first order logic. Along with an interpretation sort of assigns true values to all the ground predicates. That's sort of the semantics of first order logic. I'll be using this. Now given this background, let's try to understand what does Markov logic do given Markov networks and first order logic. So the problem with first order logic that I alluded to earlier is that a logical knowledge base is essentially sort of hard constraints. So you really need the world to satisfy all the formulas. So even if a formula is false in your domain, then the whole thing sort of crumbles down. It's variable. For example, if you have all smoke X implies cancer X, if one person smokes and they don't have cancer, the whole crumbles down. What if we could make them soft? That is, when the world relates a formula it becomes less probable but not impossible. So that is the idea in Markov logic. And how do we do that? Essentially you attach a real valued weight to each formula and the weight tells you how important that constraint is. The higher the weight, the higher that constraint is and you will satisfy it. In particular, the probability of the world is now proportional to the exponent of weights of the formulas it satisfies. So more formally, Markov logic is defined by two constructs. F and W. A set of pairs, F and W. F is a formula in first order logic and W is a real number. And together with the finite set of constraints it defines a Markov network where there is a ground node for each grounding of a predicate and there is a feature for each ground formula. And I'll explain this with the help of an example. And W is the corresponding weight of the feature. So that is an example. Let's take this formula, very simple domain, and I'll be using this throughout the talk. So the first formula is saying that for all people smokes X implies cancer X. If they smoke, they're likely to have cancer. The second formula says that if X and Y are friends, and X smokes, then Y also smokes. That is friends have similar smoking habits. So this, again, may be because these are very useful rules to model the real world, because most people who have smoked are likely to have cancer. More likely of having cancer as compared to a normal person. Similarly, it has been observed in social sciences that friends tend to have similar smoking habits. These are very good rules of thumbs but may not always be true. So let's convert them to Markov logic. We give them weights which could be used as training data. These are some arbitrary weights in this case. What is the ground Markov network? So first let's say we have two constants. Anna and Bob. We create the ground nodes. Substitute Anna and Bob in the corresponding predicates. Smoke and cancer and friends. Smokes Anna, smokes Bob, cancer Anna, cancer Bob. Similarly, the grounding of friends predicate. And note that we need both friends AB and friends AB because they may mean the same things. These are ground nodes corresponding to ground predicates. Then I connect those nodes which appear together in any ground formula or ground clause. So smokes Anna connected to cancer Anna and smokes Bob connected to cancer Bob. And, similarly, the corresponding to the second formula I also create all the clicks. So this is my ground Markov network. Now you're in the domain of Markov networks and you can do inference and learning on this. That's the basic semantics. And of course I'm going to talk about how to sort of make it efficient given this representation. So any questions at this point before I go further? So as I mentioned, MLN can be seen as a template for constructing ground Markov networks, and the probability distribution of a particular state of the world is given by Z being a nonrelation constant and WK FKX, this is basically the form of Markov networks. WK being the weight of the feature and the summation, all the ground formulas in theory. FK is the feature and WK is the weight of that formula from which this feature came. Equally, you can write it in this form, because we have sort of binary features. So whenever the formula is satisfied, ground formula is satisfied, then it is on. Otherwise it's off. So you can see that it equally can be returned as a summation or first order MLN formulas and WI and INX, where INX is the number of formula of which is satisfied. Because as many formulas are satisfied from that first order of formula, that many times feature will be on and that will be there. So this is the second equation is what I'll be using throughout this talk, the distribution defined by Markov logic is defined by this equation. Now briefly it can be shown that Markov, what is the relation of Markov logic to various statistical models? I sort of started saying that it sort of combines the power of uncertainty, various standard probabilistic models, statistical models with first order logic. So what's the connection? We can show that all these things on the left side, Markov networks, MLN fields and many others, Markov models, conditions random fields, they can represent it as a special case as Markov logic. In particular you make all the predicates zero edited because you really don't need the variables and you can represent all these models. What is the connection to first order logic? So it can be shown that in the limit of infinite weights, when all your weights turn to infinity, the whole distribution is essentially the one represented by first order logic, which is very nice, and in the limit case it tends to go to first order logic; but not only that, in the case when your knowledge base is satisfiable, but the weights are not infinite, the satisfying assignments are essentially the modes of the distribution, which, again, makes sense that the state which are most likely are sort of the satisfying assignments of your underlying theory which, again, makes sense and very intuitive. And in particular note that the difference between Markov logic and first order logic is that, it allows contradiction through formulas and still gives you reasonable probability of the underlying world. One thing I'll mention, not go into detail, I did not really talk about really how do you represent infinite domains, which is essentially basically one of the key things you can do with first order logic. We have this paper, EI, where we extend Markov logic semantics to infinite domains. It turns out that it's not that straightforward but you can do it and we borrow a lot of ideas from physics literature. In particular we use theory of Gibbs measures. And you can show that as long as your underlying network, each node has finite number of nodes. And you can have a valid distribution which can be represented as infinite collection of finite distributions. So I'm just not going to talk too much in detail about this, but look at the paper if needed. So now having described sort of the representation language, I'm going to describe how do you do inference, in particular efficient inference in this kind of model. So inference essentially corresponds to the problem of finding the probability of [inaudible] items given some evidence. Probability of Y given X where Y is the query, atoms and evidence could be something that you know at the time of inference. Substituting it back into the formula of Markov logic, you get probability of Y given X, which is essentially 1 YZX, normalization constant now depends on X because X is fixed. And essentially the formula for the distribution, WIN IXY, I varying over all the formulas, all the first order logic formulas in the theory. The problem with that is that you have to compute this normalization constant, which is excess potential time. So you cannot really do it exactly. So you result approximate methods. And you can use, there are two different ways. There is one more primarily you can say there's MCMC. That is Markov chain Monte Carlo. You can have a Markov chain and sample from the distribution. The other way which has become sort of more popular in the last few years is belief propagation. Here the idea is that you form a bipartide graph of nodes factors, variables and features, and then you pass messages from nodes to features and vice versa. And you repeat until convergence. So I'm going to focus on the second one and show how you could use this to do inference in Markov logic, and there is another approach, additional approaches which, again, I'm not going to talk about in this talk. Okay. So belief propagation. So the idea is that you pass this message, vice versa, from nodes to features and back. And what the message represents are the essentially the current approximation to the node marginals. And initialize each message to one and sort of carry this back and forth. So I'm going to show this with an example. Here's the example. On the left side you have all the nodes and right side you have the features. Select. You can see that, right? So an example of a node or a ground predicate is smokes Anna. This is again the example I've been talking about, the friends and smoker domain. The feature could be smokes Anna, friends Anna Bob implies smokes of Bob. All the nodes represent essentially the groundings of predicates on the left side and groundings on the features of the right side. And there is a connection between the node on the left and node and the node on the right if they appear in the same feature. For example, smokes Anna would be connected to the features smoke Anna friends Anna Bob implies smokes Bob. And then you want to pass messages it along the edges as you go along. What are these messages? So this equation gives the message passed from the nodes to features. And there's a lot of notation here, but the message is very simple. What it is saying is that it takes all the messages that this node received from the features in the previous time step and multiplies them together and sends it to the features that it wants to send. Multiply all the message that this nodes receives except for the feature being considered, in this case F. Multiply them and send it to the feature that you want to send the message to. And intuitively what it's saying is that what is the current belief that node has about its probabilities, about the probabilities of being in various states. Similarly, the message from the features to nodes is slightly more complex, but the form is similar. So if you look inside the summation, the inside is similar that you multiply out all the messages that this feature node received from the nodes in the previous time step, except for the node being considered. Then you multiply out by the feature or the potential that it is probability of X, the potential of this feature, multiply all of them together and then you sum out everything except for the node being considered and then you pass it back. So this is a standard sort of message passing and belief propagation, and it has been shown that if this is not guaranteed to converge in the loop E graphs and create converges and gives you the right result. In loop E graphs it's not guaranteed to converge but for many problems it is a very good inference in real times really gives you good results in very, very fast time. So this is nice. We could use this by converting the Markov logic into the ground Markov network, constructing this graph and passing messages. >>: Do we have any bounds for stopping? >> Parag Singla: Right. That's a good question. I think there are bounds, but many times what people do is just run it for a few iterations, certain number of iterations and sort of stop. So this is nice. You could use it, but then there is a problem. The problem is that the kind of domain that we're going to work on, and I'll show some examples, that there could be easily billions of features, because considered simple theories, smoke X implies cancer X, friends XY. Even if you have like, say, a thousand people in your domain, already you have a thousand, cross thousand features. And then this really grows exponentially with the number of variables in your features. That means too many messages. The network size is too big and you have to pass too many messages. This could really be taking a lot of memory and really too slow. So what's the solution? So the solution idea that we propose is that instead of passing that many messages for each ground node, what if you could cluster those nodes together which pass the same message in the ground version. So if you could somehow identify those nodes which would have passed the same message during the BP, then we could pass only one message for the whole cluster and that could really reduce the number of messages which have passed and really make your algorithm much faster. Also it is the network size. I'll try to demonstrate this on an example that I've been looking at. So this is the original ground BP. You have nodes and features and passing the messages. And let's say somehow you identify using this box that's demonstrated that. All the nodes which would have passed exactly the same message in each iteration of the belief propagation algorithm. Let's assume now somehow we identified those and I'm going to tell more detail how we do that. But once you know that that is the case, then actually, instead of having all these edges between the nodes on the left and the right, you could have essentially one edge between each box. So as you can see the number of messages could reduce by a big amount in this case, because in this case you have only three messages, going back and forth between both sides, and it will give you exactly the same result. And the form of messages is exactly the same except for two constraints, which are essentially the number of, which are the function of the number of edges which went through those boxes. So note that we replaced many edges by one edge. We have somehow taken into account, we have taken into account how many nodes are clustered in one box, and these alpha and beta constants essentially depend on that. So other than that it's the exactly the same algorithm inference in the same way. Gives the same result and will be much, much faster and can save a lot of memory. Now I'm going to describe it in more detail how to actually find these boxes and what these messages are. So basic idea in belief propagation, you can see it's two steps. First is network construction. That is construct these boxes which we call super nodes or super features. The name should be intuitive. Super nodes formally is a set of all ground items that all send and receive the same message throughout the ground version of the belief propagation algorithm. Similarly, a feature is defined. All ground clauses are formulas which send and receive the same messages throughout BP. Then you construct this network. Run modified belief propagation with those alpha and beta constants on this network; and, as I said, it gives the same results as ground BP and memory and time savings can be huge. So how do we construct this network? It's a simple process. It's a four-step process. So we sort of start with initial guess for the super nodes. So given your domain theory and given some constraints and evidence, what is the initial guess for super nodes. Basic guess is you want to cluster all true predicates in one box, all false predicates in one box and all unknown predicates in one box. That is the first guess, and you'll refine them as we go along. So given these predicates, super nodes, you can essentially now join them together to get the next level of super features. Given that, you can project them back on the super nodes to get the final super nodes; that is, you project the super features down to ground predicates and all those predicates which appear in the same number of super features will now be clustered together. And this is repeated and is real convergence and the algorithm is actually guaranteed to converge to the optimal network. This is the algorithm. And I'll demonstrate this with the help of an example. Example is this. So I'm just going to work on one formula and then you can extend it to as many formulas as you want. But for this representation, just one formula. Let's say we have smokes X friends XY implies smokes Y, the same example. Let's say we have some piece of evidence that is enough smokes, we know that smokes FN is true. We know that Bob and Charles are friends and Charles and Bob are friends. So this is our evidence. Let's say we have N people in the domain and N being greater than three. So intuitively it's very clear that the boxes that you should get, the algorithm should give in this case is three boxes. Smokes of Anna. Smokes Bob-Charles and smokes of other people. Because Anna is sort of different from all of this because she smokes. So probably the probability for her should be different from others. Similarly, Bob and Charles are different because we have some extra piece of information about them. And all other people should come in one box. So this should be my clustering of the super nodes. That's the idea. So let's see how the algorithms, this covers that. So we have this sort of now super nodes on the left and super features on the right. And I'm going to show how these sort of define them. Initial set of super nodes are simply we create the super nodes for true/false and unknown case. There are no false predicates. No false evidence in this case. So we have smokes of Anna, which we know is true, in the green. For smokes of X for all people other than Anna. Which is unknown. And then friends Bob, Charles, friend Charles, Bob, which is true. And friends XY for all other people. So these are my initial four super nodes. As I said, you don't really need the box for false case because there is no false evidence. So given these initial super nodes, let's try to construct super features. So we are trying to construct the super feature, smokes X and friends XY implies smokes Y. So note smokes Anna and the converting code tells you where the nodes are coming from. Smokes Anna comes from the first supernode, then join it with friends Anna X and smokes of X. This is my first feature coming from the boxes on the left side. Just simply doing it, doing just like this. And, similarly, you could construct other features, other super features. And the color-coding again demonstrates where those super nodes came from. And the third one and the fourth one. You can show that these are only -- these are the only known combinations in this case. So you got all these four super features corresponding to the four super nodes on the left where we really they're just giant. So now having constructed the super features, let's see how do I construct the super nodes? So again used a different color-coding for each super feature and what you want to do is project them on each ground predicate that is smokes of Anna, smokes of Bob and whoever there is, and there is this box of four counts, each count is for each super feature. And we're going to populate this with the number of ground features that's projected on smokes of Anna. Populate with projection counts. So note that the first one projects, N minus 1 times because X can take X minus 1 possible values. In this case X cannot be Anna. So it projects N minus times on smokes of Anna. I got N minus 1. You can show that the second one projects zero times because smoke of X cannot project on smokes of Anna because X cannot be Anna. Similarly you get zero. Smokes Bob cannot project on smokes of Anna. Similarly, smokes Bob cannot project on smokes of Anna. So you get this count. Similarly, we do it for other, all ground predicates. And we cluster those predicates together which guard the same count. So in this case I've already sort of clustered Bob and Charles but can verify that they would have got the same count, zero N, 1 minus 3, since they got the same count they're indistinguishable at this level at BP at this step of construction and then you combine them together. And all other things will now combine into this, the final box, because they have the same counts for all the super features. Now you have the new super nodes. Join them together and get the new super features and vice versa. In this case you can show this is sort of the final step. If you join them one more step you'll get the final super features, and you can show, more importantly, you can see that you have discovered the intuitive clustering of nodes that the smokes of Anna, smokes Bob, Charles, and smokes of all other people. So that is what you're looking for. So that is basically the lifted network construction algorithm. And we have a theorem. This appeared this year in Triple A, this paper, and we show that there are always exists a unique minimal network and algorithm lifting network. The construction finds it and running BP on this network essentially gives you the same result as running BP on ground network. Now experimental results. So in the paper actually we have the results only on three domains. But I'm working on another paper. So I have results on many more domains. So there are six domains that I'm going to present results on. The first one is entity resolution. I think many of us might be familiar with this. Given the database I'm going to talk a little bit more about it. Given the database of records, you want to identify each of the entities referred to them, which of the references referred to the same underlying object, the problem of entity resolution. Linked prediction. So we have this dataset of professors, students from the universities, UW. And we have information like who it is which -- who is the professor and who is the student and we want to find out who is advised by whom. Then there's the dataset of protein interactions from biological domain and you want to find out which proteins interact which each other. Hyperlink analysis. This is basically finding out which page is, which are pages linked to each of them given the information about their topics. Image [inaudible] this is on image domain. So you have a binary image. There is some text in the foreground and the background, and it is randomly -- there is some noise in the image and you want to really separate out the noise. And then finally the friends and smokers domain that I showed. So here are the results. There are three times -- this is on time. So I'm comparing the ground version and the lifted version. So there are three times that construction time BP and total. So construction time is the time taken to construct the network. BP is how much time it takes to run the belief propagation algorithm, and the total time is essentially the sum of these two. So as you can see, the ground time in almost -- except for maybe a couple -- is more in the case of lifted case. And that makes sense because you have to construct all the super nodes and super features and then sort of combine them. So it takes some time. And the ground version is faster because you can just ground out the network. But note that now BP time is much, much less in all the cases because the network is much, much smaller than the ground network and it can be much, much faster. >>: Are big are the datasets? >> Parag Singla: How big are these datasets? So [inaudible], for example, is a thousand records. So a thousand across a thousand. We use something like [inaudible] UWCSC is also about I think a few thousand, Matt can correct me, few thousand features. Then others are also probably similar size. Image is much bigger, because you have 400 across 400 pixel image. So it's probably good. >>: [inaudible] does it vary very much in terms of [inaudible]. >> Parag Singla: You mean like how many variables in each predicate? I think at most we have two. There are a few in Europe CSC which have three like publications, the DA professor, student and course. But most have two or one, just like the smokes example. >>: What's the size of [inaudible] for each one? >> Parag Singla: The number of rules. 200. Some cases it's only two or three rules. For example, image domain we just have tools. We say that one is whatever variables are observed, you have the same variable if it's observed it's text. The second rule you're likely to have the same value as your neighbors. UWC has about 94 rules. So it varies across that. Questions? >>: So the friends and smokers domain, how did you make the network for that? >> Parag Singla: So I did not explain that. So essentially what I did was I had this sort of -- I decided number of people, how many people I want to have. And then sort of randomly chose whether the smoke or not smoke and then also used sort of a random sort of distribution. I forgot the exact details, but the idea is that you sort of have clusters of people and for each cluster you decide whether you want to have a friend's relationship between them or not. That was the idea. >>: I have a more general question. I don't know if this is a good time or not. So it seems like the technique -- it works best when the domain breaks into little pieces? >> Parag Singla: Exactly. >>: So does it require, like to what extent do they need to break up? I couldn't quite -- I can't think of -- does it need, like would ground Markov never have complete separate distinct clusters or can it do better than that? >> Parag Singla: I think it does much better than that. I think the idea is basically like looking at this example, that it doesn't really say in this case, for example, right, that Anna and Bob Charles all those are independent, they'll certainly be connected. It's only identifying that all these nodes could be treated in one cluster. That is, they would be behaving similarly with respect to the cluster. So certainly they're all interconnections. I'm not saying your graph is disconnected. So that's a very important point actually. That graph is completely connected in the original case. But what we're saying is all these nodes essentially would have behaved in exactly the same fashion. So why don't you pass one message instead of passing N messages, that's the key. >>: In the smokers friends one that you generated, does it end up being each super node is; each supernode is a different number of smoking friends, effectively? >> Parag Singla: I did not actually look at what were the clusters. But basically similar idea. That people, all of those people who had same smoking evidence and were connected to let's say same number of friends would have the same evidence of being the ground thing. This is a bit further because evidence propagates, but realize it actually doesn't really split further. I'll show it in the example, actually. >>: One follow-up question. So if I have this right, the essential polynomial with respect to the number of areas, the number of proteins, you have some predicate with area three, 1,000 to a third to make the predicate work. So it is going grow that way? Message passing is what we've optimized down here. >> Parag Singla: Right. So we are optimizing both, right, because we're trying to optimize also the ground features. Ground features will be thousand to the cube. But the lifted features or the super features will be much less. >>: But you will still need to calculate every single ground feature with graph -the polynomial time? >> Parag Singla: Yes, that's a very good question. So, actually, I kind of skipped that detail. But it's a very good question. So what we do is we do not really construct the ground features also. So what we do is that, let's say, you start with, smoke X cancer X, right, that example. So what we do is initially we do the ground, we construct all the ground predicates. So up until that point it is true. Right but then we sort of cluster them and when we do the join we do the join only on those. So you do not really need to construct the ground features, all the ground features. Because now your compact representation for the ground predicates, and you can join them using that compact representation. You can do even more optimizations, things like suppose, you know, most of the things are false, you do not even need to look at that explicitly, do anything which is not true and unknown so you do not really need to represent that. Does that answer your question? >>: Yes. For the smokes example, if you had a million people, you'd have a million squared initial graph, you would have to address, then you would actually optimize that? For friends, friends, you have to be friends with every single? >> Parag Singla: I guess what I'm saying, let's say you had a lot of people let's say you knew that most of the friends XY are unknown. You don't really explicitly need to construct those, because what you could do is construct the compact representation for true and the false case and everything unknown is sort of default value, and that essentially you do not -- it's just like the close assumption in databases, the same thing. Right. So these are the results. You can see overall lifted BP is much, much more powerful than the ground version. In some case it's phenomenal. For example, KB. Because what we're doing is essentially, in this case, this sort of simple example we're not using the word "information," we're just using that certain topics imply that certain probability of being linked to each other. So if there are 10 topics, about 10 squared possibilities are there. So the super features are really, really big, as in like they have a lot of ground nodes clustered with them. And it trends really, really fast. So don't really need to construct the whole network. And this is so BP does not always converge. But the results are pretty good even in all the cases, and this is after a thousand iterations, thousand BP iterations. As I said, the results are exactly the same in the ground and the lifted case. >>: Did you compare the accuracy of the VP versus sort of MCMC? >> Parag Singla: Right. I do not have those results here. But they're quite comparable. And this is the number of features in physical memory. So I think that solves some of the questions that were asked. As you can see, the number of features is certainly much less in case of BP. That will directly translate to physical memory saving. I think that is a valid question that it is, because many times you have to actually construct the ground network. But, as I said, many cases, because of the compact representation, from the very beginning, you can, in fact, save a lot of memory. And I think in almost all the cases we do have a lot of memory. I should point out for UWCC and image what we did was running the network construction until N actually used more memory. So I stopped at after three or four iterations of lifting network construction, which gives exactly the same results and which runs much faster. For all other domains I ran the lifted network construction until end. For the two domains the results here are after stopping the construction after three or four iterations but give the same results. >>: Can you construct cases where your proposal is no better or even worse in either time or space than the baseline? >> Parag Singla: Yes, that could happen, yes. In worst case that could happen. Because if you have to ground out the whole network, although I'm not sure how common will that be in general, but if you're to completely ground out the network, then you'll spend extra time really constructing the supernode and super features only to realize that you have to actually ground out the whole network. So it will be a little slower. >>: Could be a little slower. Space-wise it would be no worse? >> Parag Singla: Space-wise, I think it depends on your representation. If you have very poor representation for super node and super features, yes, basically it should be the same. And any ordinary iteration it could take more space because you have to represent the super node and super features. So, finally, this is sort of last result on this. I think, again, this kind of goes back to some of the questions that I think they asked. But I also sort of experimented on this, how does lifted network construction vary when I increase the number of objects in my domain? So note that inference smokers, as I vary the number of people, the lifted network size remains about the same. Because essentially the number of clusters is the same. The ground version, exponential, so log scale. So number of features is on log scale and I have number of objects. You can see the green sort of grows with the number of features, but the red color stays constant because the final number of features that you have is essentially the same. So this is, again, a nice property because your domain is getting bigger but the final number of features is essentially the same. >>: Is the linear, have the same [inaudible]? >> Parag Singla: The record is almost linear, actually. >>: But given your model of pairing ->> Parag Singla: Yes, exactly. Because I'm using the same model, which means more or less the nodes, they're only a fixed number of cluster image they will fall in, then. So I think in the interests of time, probably skip the learning part. Basically there's some work I did on how to learn the parameters. So I guess I'm done with the lifted BP part, which is sort of the crux or the main sort of idea, one of the main ideas. Skip the learning part. Basically how do you learn the parameters in Markov logic and wrote a paper on that. But you can talk to me after the talk. So I'll describe a couple of applications that I use Markov logic for doing the first one is ->>: Can I interrupt? One question about the learning. Was there anything you did independent from the BP inference stuff or are they tied to each other? >> Parag Singla: That's a good question. It happened at least chronologically that I worked on the learning before the lifted BP. So that was independent in that sense. But now since we know how lifted PB works, since learning uses inference as a sub strip, so we could use lifted BP within that, use the learning. >>: Learning can also work with the -- they're independent. >> Parag Singla: You can plug in any black box for doing that inference for the learning, which could be lifted BP. So applications, I'll describe two applications. The first one is entered as a resolution. I briefly talked about this. [inaudible] integration is first step in the data mining process. When you merge the data from a lot of sources, particularly result in duplicates. You want the result as duplicates before you can do sort of any effective data mining. For example, if your paper is from different domains, the authors may be spelled differently. Titles may be missing. Venues may be sort of abbreviated. So you want to resolve them before you can do any sort of effective data mining. Resolution is this problem of identifying this records a fields, therefore the same underlying entity. This is a very well known problem in literature. And the original model was proposed by [inaudible] back I think in 1960s, and it's a simple model. The idea is that you make each pair of decisions independently. So you take each record pair and you find out the similarity between those attributes, and you see if the similarities more than a threshold, then you create a match or there is a nonmatch. And there have been many improvements on the original model. But most of them actually make this pair-wise independent assumption that there can be resolved independent of each other. But over the time people have realized that it helps to incorporate independence account. One pair will help resolve another pair. So this is what we sort of incorporate. But the problem with this approach is that even though they take dependencies into account, the problem is that all these approaches, different approaches have been developed as stand-alone systems and they address different aspects of the problem. There's no sort of unified solution to all of this which I'll try to represent trying to use Markov logic. So represented this nice paradigm. The idea is simple; that you use weighted first order formulas to give the domain theory and for the first order rule you have the rule which tells you how important that rule is. We used some hand-colored rules, but you could always use them. Structured learning in Markov logic. And it combines many different approaches, very seamlessly. In particular, there's a approach which [inaudible] which introduced transivity to different pairs. You can write one pair of transivity to Markov logic. We have one approach which sort of combines them and any more new approaches can be also combined into this framework. So this is the sort of the example which will demonstrate the power of sort of doing this -- building a collecting model rather than doing them independently. So this is a citation example. So you have author, title, and venues. So you can see that intuitively you can figure out the first two citations refer to the same paper. And the last two citations refer to the same paper. But the problem is that the first two pairs are reasonably similar to each other. The authors are similar. The title is similar. And venue is not similar, but then maybe you are able to say venue is just abbreviated because authors and titles match. So you may be able to say that. This is just to correspond within the titles and authors and venues. So you may be able to say that the first pair match. But now note that the second pair is more problematic, because one of the authors is missing. The title is really weighted very differently. And also the venue's abbreviated. So the threshold may not be high enough to declare it as a match. But once you identify that the first pair is a match, you will know that triple A and American, 21st National Conference on Artificial Intelligence is the same conference, because you know they're the match, therefore the venues are the same. Once you have this information you could really use this to duplicate the second pair. So once you know that triple AI and the string is the same conference, this pair also appears in the second case, and then you can combine them and use this exchange information to now say that actually the match. So this is the basic idea of doing this collectively. And we give model-based on Markov logic, and we write first order rules as I described. Again, I'll skip those and I'll not go into detail. As I said, you can really write a domain theory using Markov logic which sort of combines all these collective approaches. And we did some experiments on [inaudible] website datasets, we showed collective features help improve performance. But also we showed that many of these previous approaches can be seamlessly combined using Markov logic, in many cases writing just one formula for the previous approaches for which the whole system was developed for just one approach. Again, I'll be happy to talk about in more detail. So, finally, the second application, this is relating to -- so this is actually the work that I did at [inaudible] research with Henry [inaudible] and some of his colleagues. This is about predicting social relationships in consumer fraud collections. So here collect being sort of a company. The idea was can you use something like Markov logic to build smart cameras. So let's say you have a bunch of pictures of a user and you want to identify various social relationships in the pictures that we took. In this case, in the left picture this person may be interested in that he wants to say that go look at all my pictures and find out what kind of kids are my children hanging out, if they're in bad company, if they're in good company or what sort of case are they hanging out. So I doubt that it's a very difficult problem. You don't know what these kids are, or other kids are, how you do that. If you show these two pictures to a human, he or she may have a very good guess that the kids in the left picture are probably his own children and the third kid is not his child, right, because we know that typically children tend to be photographed with their parents, because the friends appear together. So the kid only in the left picture probably is friends of these two kids and these two kids probably belong to this person. But, again, we are not sure. But these are like sort of good rules of thumbs. Similarly, you could have other rules of thumb saying that friends appear together, and the parents are older of the children, which is a hard rule. You could say that related appeared together, grandparents like to appear with their children and so on. You could write various rules that are not always true but gives, but which give very good information about that underlying domain. So we constructed -- we handcrafted an MLN which had about five hard rules saying parents and their children and 14 soft rules. As I mentioned, and the rules, the weights of the soft rules were learned using some training data that we basically had some volunteers at about 13 different people. They labeled their photographs with the relationships they had. We asked them to label, and there are about 48 total images. What we wanted to do was predict seven different relationships. Parent, child, spouse, relative, friend, child, friend and acquaintance. I should mention that for this problem it was sort of initial venture. So we assumed that the face recognition has been done, which is face recognition and also the identification. So it is again a big assumption. But as you can see, you can think of constructing a bigger model where you want to fold this as part of the model. But for the results I'm going to show we assume we know what the faces are and we just want to identify what the relationships are of those, given those faces. And we compared five models. So since nobody at least in the published literature has done this before, so we compare very basic models. The first one was random. We randomly predict each relationship with uniform probability. We predict on prior information on the data. For example, if most people, if most of the persons appear, appearing in photographs are kids. So you are more related to the child relationship. Then hard constraints. So you can have only the hard rules. For example, parents of all their children. Then combine hard and prior, both of them together, and then the MLN, the full blown MLN model with all the soft and hard constraints and the rules learned for those soft constraints. Those are the results. This is basically the recall curve and comparing all the models. So as you can see all of them are pretty low, which means it is a hard task. But still you can see the right curve dominates all those curves. And most interestingly, the random curve is sort of at .14, which is sort of the baseline. There are seven relationships of 14 and seven is about 198. And then each set of additional model gives you more and more information. The prior is better, the hard is better, and MLN sort of combines all them and adds some extra power to give you sort of the best model. And, as I said, this is sort of the initial experiments we did. Certainly you could try to improve the model. We did not really learn those rules, so we could use training data, learn those rules and that could even maybe improve the results. So these are the second application. And just to mention a few. These are some of the other applications. Most of the applications people at some of the places have been working on information extraction and many of those actually papers have been published recently. Link prediction, collecting classification and many others I'll be happy to talk about some of this after the talk. And I mentioned just a couple of those which I have done. And finally all this has been developed using Alchemist software, developed at the University of Washington, many authors besides me. Gives the whole first order logic semantics of Markov logic, has the inferential terms I just described, also has structured learning to learn those rules. This is the website. And, finally, conclusion feature work. So in conclusion, I tried to sort of present that unifying statistical logical area is an important aspect which could really progress the spirit of the progress in A sort of providing the interface layer. Markov logic could be one potential choice. It's a combination of logic, very simple, and powerful models. Various algorithm represented for efficient learning inference. I should definitely mention that there are many more learning algorithms and inference also which probably did not really get time to talk about. Many applications. And, finally, coming to feature work, so I'm more -- there are many directions but I'm interested more in generalizing the framework for lifted inference. So how could you explain the lifted BP framework to other developments like MCMC, other than working on how to sort of give a general framework for EV and BP elimination which is exact. But extracting it to MCMC and then also connected to with the resolution. So we have the resolution approach, which what is the connection with BP resolution, first order logic. How you could use this in potential infrared domains and the third one is that identifying substructures in network for efficient inference. So here the idea is that it could be possible to break up your network in various parts where the inference in some of the parts is really simple. For example, you could have a linear chain in most of the network which you could, for example, solve using something like V to B, but some part may be complex. You could really combine those two to basically do inference separately on those parts and then combine those results. So that could be one potential approach for doing fast inference. So, in general, I'm interested in a little bit longer term in developing a comprehensive theory of lifted inference, and, of course, learning because that's a substep for learning. And the idea, intuition that I have is connected with human perception. For example, let's say if a human being is taken to a new place and he's asked to, he or she is asked to open his eyes for a second and then close it and then asked that what did they see? So they could say, oh, I saw some very big building and there's a parking lot on the left. But they may not be able to tell more. Now let's say if you ask them to open their eyes for five seconds and ask them what did they see? They could say, oh, I saw seven buildings and there are like two parking lots and there's like some cars here and then there are like a couple of roads. If you give more time, they could really now tell what were those buildings, how high they were, how many floors there were, and so on. So can we do something similar for probabilistic inference, given the time you have, you could sort of start with a very crude approximation for the nodes, like sort of the cluster I provided that started with a crude approximation, use some basic inference as you refine them as you go along and then depending on time you have you could actually give the exact results if you have sufficient time. And then, of course, this has a lot of application in comparative [inaudible] recognition, biological data, and so on. So that's pretty much it. Thanks. Any questions, I'll be happy to answer. >>: Is there any interaction with conjunction trees when you're doing this inference? >> Parag Singla: That's a good question, actually an interesting construction. I referenced variable inference. I'm working on a paper which sort of gives a general framework with very generic framework for having this idea of splitting the nodes to how does it tie with trees and lifted BP. Idea is similar that in junction trees also you can sort of -- I think it's easier to think in terms of variable elimination first. So the idea is similar, that you sort of start with this bucket elimination. There is a -- so the linear -- write a paper on bucket elimination, which sort of gives a framework for variable elimination. But the idea is similar that we have these super nodes. You start with the kind of same super nodes. You try to eliminate them and at each elimination step you sort of see whether you really need to refine them or not. So you start with very crude nodes, and at every step of elimination you really see if you really need to refine them or all of them can be really just eliminated in one step. And this idea is similar to BP in that sense. And it turns out that it really helps and you may have come to the talk of Rodrigo, he gave a talk, lifted variable elimination talk, I think some of you were probably here. So that ties with basically some stuff on lifted variable elimination that connects back. Any other questions? >>: I have a question. When people are authoring these, you could easily add a rule that's simply going to make it take 10 years to do the inference before it was going to take a minute. I think similar comment in databases where you add a certain dry analysis, it's going to take a curve. Do you think there's -- is anyone looking at trying to estimate how long a computation is going to take or we're getting some kind of feedback that because of this rule is the reason it's taking so long, maybe if you break it up in this way it will be a lot less time, do you know if anyone is looking at that? >> Parag Singla: No, I don't think -- I think the knowledge is more on the engineering side, like people try it out and they have this intuition that all this rule and then they just sort of throw it out. But I think some of the things could be automized. For example, you certainly know when you have more variables it really blows up. So those things -- >>: If you could know, for this rule it was this many runs, this many functions, or this many messages were passed because of this rule, you could probably see which one was -- that would be kind of interesting. >> Parag Singla: Yes. >>: Like you're saying, that's kind of an engineering thing, though. >> Parag Singla: Yeah. We could certainly think of developing a theory, but I don't think anybody has, at least not in my knowledge. Yeah. >>: I think it's been tried [inaudible] research. I think it's an unsolved. >>: That would make sense if they were doing it for that. >>: But I'm not quite sure. I think it's an unsolved, unsolvable in the general case. >> Parag Singla: I think that's an interesting question because it's interesting that sometimes people think that sort of doing inference, but this is harder. But I think once you come to the domain of approximate inference, many of these things you do not really need to do exactly. And so because you're able to trade a little bit of accuracy for like good time and memory efficiency. So I don't know. Some of these things may be applicable, for example, the things that people tried for long. But I'm not aware of. >>: I think it comes down to estimation, too. You don't need to do it exactly. >> Parag Singla: Yeah. >>: Because you're lifting, I don't know, I get a feeling like to do exact lifting it seems to be equivalent to saying that two algorithms are equivalent, which we know is an unsolvable issue, in the general case. Taking a stab at it and making estimations and getting close enough so that you actually get the results. >> Parag Singla: I think it's in the sense that lifting, as I described, it will give you exactly the same result, like probability the same result as BP. But I'm not doing two different algorithms. It's the same. >>: But BP is an estimation. >> Parag Singla: Yeah. >>: So if you take -- like you said, if you take your Markov logic and give it all infinite weights, then your equivalent to probabilistic logic -- I'm sorry, first order logic and first order logic equivalent, proving the two terms are equivalent in first order logic is impossible. >> Parag Singla: Right. So I think -- >>: But you're going to get close enough to where you're interested. >> Parag Singla: I think that sort of the place where this work comes in is saying that you could use resolution which is faster than propositional first order logic. Different question, what could resolution solve? I think lifted BP versus ground BP is more comparable with resolution versus propositional than doing inference. So what it is saying is in resolution you could eliminate a lot of constraints probably potential infinite number of constraints in one step and sort of really grounding them into saying the same thing. But it doesn't say anything inherent about [inaudible] in first order logic. Even using resolution some things may be hard, which is true in this case. >>: Which is why you have the case in the passing you could end up with a situation where you have to ground out the network. >> Parag Singla: Exactly. Thanks. [applause]

>> Matt Richardson: Okay. So it's my pleasure... here to talk about Markov logic. He actually interned...

Related documents

Products

Support

&gt;&gt; Matt Richardson: Okay. So it's my pleasure... here to talk about Markov logic. He actually interned...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Matt Richardson: Okay. So it's my pleasure... here to talk about Markov logic. He actually interned...