>> Matt Richardson: Okay. So it's my pleasure... here to talk about Markov logic. He actually interned...

advertisement
>> Matt Richardson: Okay. So it's my pleasure to introduce Parag Singla. He's
here to talk about Markov logic. He actually interned for me a few years ago in
the summer of 2006 -- '5. I don't remember. '6. He worked on some really
interesting work dealing with messenger data and search and doing a bunch of
data mining and everything. Would have been simpler if cosmos had existed.
He's come back to talk about what he did in his thesis work under Pedro
Dominguez at the University of Washington. Thanks, Parag.
>> Parag Singla: Thanks, Matt. So here's the title of my talk. Markov logic
theory algorithm and applications. Most of it is joint work with Pedro Dominguez,
my advisor at the university.
There is some stuff that I worked in connection with research in Rochester. I'll
mention when that comes and who are the collaborators. So it is a brief outline
of my talk. I'm going to give some motivation, the work that I've done. Give
some necessary background on Markov networks and a little bit on first order
logic. I'll describe Markov logic on which I will sort of build on.
Then I'll describe inference algorithms that are developed for this language,
explain a couple of applications and then conclude with future work.
So coming to motivation has been seen that an area of science, this idea of
applications and infrastructure and this middle interface layer, sort of separates
applications from infrastructure.
And it has been observed that whenever we have this inference layer defines
application and infrastructure, the process really happens fast. What do I mean
by that?
Let's take networking. On the application site you've got www. E-mail and all of
the applications, things you can think of, there's YouTube and other stuff.
Infrastructure essentially consists of protocols, routers, all those things. Interface
layer is Internet. How does Internet help? Is that applications can be developed
independent of how or what's going on in the infrastructure layer. You can
develop applications, just knowing about the Internet and optimize those and,
similarly, infrastructure layer can sort of work independently. People can
optimize the routers, protocols and those two as long as they will interact well
with the Internet, everything was sort of fine.
Note that instead of making N squared combinations between the applications
and infrastructure, we have only order of N connections. So this really helps
speed up the progress. Similarly, databases we have applications like
enterprises for planning, online transaction processing systems, CRMs and
infrastructure is optimization transaction management.
As you can see, both sides can develop independently. Applications can go on,
infrastructure can be optimized, and the interface layer as we know is the
relational model. Once we have the schema in the mind, everything works well.
And, again, the progress can go in both directions independently. So what is the
interface layer for A? Applications, as we know there are [inaudible] NLP,
planning multi-systems and many, many more. Infrastructure, essentially, the
presentation, how do you do the learning, inference and other terms.
What is the interface layer that we're looking for in case of A? So for quite some
time people thought that, first of all, first order logic could be language of choice
because it has the power to present objects, relations among them. Can handle
very complex world scenarios.
But the problem with first order logic, as we'll see, it's [inaudible]. It doesn't have
the power to represent uncertainty which is inherent in the world.
We certainly like to have that. So that brings us to graphical models, something
like Biznet or market networks, which have the capability to handle probability
explicitly. But then the problem with them is that they do not really handle
objects and relations.
So you lack that capability of having that complex structure in the language. How
do you do that? So that brings that what if you combine sort of these two
approaches, the statistical and logical layer. Statistical, as I said, sort of the
graphical model kind of approach, and the logical being something like first order
logic. If you can combine those, maybe you could have something which could
basically be a potential interface layer, really spreading of the progress.
So there has been this whole area of statistical relational learning which has
come up. This is sort of the background. There have been many languages
which have been proposed. And today I'm going to talk about Markov logic. And
in some sense maybe try to argue that it really gives you the representational
power to handle both uncertainty and also the complex structure and also has
the engine where you could do fast inference and learning.
So Markov logic, as I mentioned, is not the first of these approaches. The history
goes back to 1986, probabilistic logic authors and many more models. And this
could be classified based on what representation language they use to represent
uncertainty and what kind of first order model they use.
I'm not really going to talk about those, but primarily focused on Markov logic
which is represented by this here. Pedro Dominguez back in 2006.
So briefly, in one slide, so for Markov logic, the syntax is essentially rated first
order formulas, very simple. If you know first order logic, just write first order
formulas. And that's the syntax.
Semantics, as I'll explain it a little later, it can be seen as constructing templates
for laying underlying Markov networks. In terms of inference, there have been a
lot of terms that I have done, my colleagues have done, [inaudible] MSMC and
belief propagation, [inaudible] propagation. And I'm going to talk about some of
this. Learning, you could use something like LBFGS [inaudible] or second order
methods, which have been developed.
Applications, there are many. And I'm going to talk about two of them. And this
slide, the red ones really highlight the algorithms that I've sort of worked on.
There are couple of others I've done work on but I'll not be talking about those.
The red ones are essentially those that I'm going to talk about. In addition to, of
course, explaining the semantics and syntax of Markov logic.
So this gives the basic motivation of the work that I've done. A brief background
on Markov networks and first order logic. Markov networks, I assume some of
you may be familiar, are essentially unbounded graphical models of data. So
these nodes here. And just between them and the nodes represent some kind
of, in this case you could say some kind of binary predicate. For example, if
someone smokes or doesn't smoke. If someone has cancer, they don't have
cancer as to the Markov, whatever disease they might have, it essentially
represents that there's a direct influence of one node or another.
For example, if you smoke, then you're likely to have cancer. You have cancer,
then you're likely to have [inaudible] similarly.
And the way we define the distribution over this network or over these nodes is
by having this potential functions which are defined over the clicks in the graph.
So in this case you can see there are many clicks, there's click as to cancer as to
Markov and also there's a click between cancer and smoking.
An example of the potential function is that, for example, for all possible states of
smoking and cancer you have this real value function. Actually, it needs to be
positive. So again have all these values. And the probability distribution is
defined simply a product of all this potential functions defined over the clicks in
the graph and the mobilization constant. And intuitively these numbers, just say
that what state of the world is more likely? For example, in this case you can see
that only the combination true/false has a low value which means that that part of
the world is less likely as compared to other states.
Equally, this model can be represented as a log linear model. So it's sort of
exponent of linear feature functions. Again feature functions are defined over
clicks in the graph just as I said before. But note that this can be much more
compact than the potential representation.
For example, in this case you may say that this feature is on or one. When
smoking implies cancer, this small formula defined over these two nodes
available is true, otherwise it's false.
Instead of having four possible states you can represent it much more compactly
in this log linear fashion. And in this case W can be any real value sort of weight.
So I'll be using this form throughout this talk for defining the probability
distribution. And Z is the relation constant as before.
So this is Markov networks. Coming from the logical side of things, first order
logic, first order logic essentially consists of constants, variables, functions and
predicates which represent your underline world. Constant could be something
like people in your domain and R Bob, variables at XYZ, which instantiate over
the constants in your domain, could have functions like mother of X. So X then
whatever the person who is the mother could be represented by this function.
And friends XY which is a predicate, which could ask, for example, it could be
true if X and Y are friends with each other, false otherwise.
Grounding is an important construct in first order logic. It corresponds to
replacing the variables in the predicate or any other construct by the
corresponding constants. For example, friends X and Y you could have Anna
Bob and similarly for other constants. Formula essentially combines predicates.
For example, smoke X implies cancer X, which is true when smoke X is true and
cancer is true in certain cases.
Knowledge base is essentially a set of formulas, and it's a standard rule of
standard theorem of first order logic that it can be equally converted into two
forms. Finite first order logic. Along with an interpretation sort of assigns true
values to all the ground predicates. That's sort of the semantics of first order
logic.
I'll be using this. Now given this background, let's try to understand what does
Markov logic do given Markov networks and first order logic. So the problem with
first order logic that I alluded to earlier is that a logical knowledge base is
essentially sort of hard constraints. So you really need the world to satisfy all the
formulas. So even if a formula is false in your domain, then the whole thing sort
of crumbles down. It's variable. For example, if you have all smoke X implies
cancer X, if one person smokes and they don't have cancer, the whole crumbles
down.
What if we could make them soft? That is, when the world relates a formula it
becomes less probable but not impossible. So that is the idea in Markov logic.
And how do we do that? Essentially you attach a real valued weight to each
formula and the weight tells you how important that constraint is.
The higher the weight, the higher that constraint is and you will satisfy it. In
particular, the probability of the world is now proportional to the exponent of
weights of the formulas it satisfies. So more formally, Markov logic is defined by
two constructs. F and W. A set of pairs, F and W. F is a formula in first order
logic and W is a real number. And together with the finite set of constraints it
defines a Markov network where there is a ground node for each grounding of a
predicate and there is a feature for each ground formula.
And I'll explain this with the help of an example. And W is the corresponding
weight of the feature. So that is an example. Let's take this formula, very simple
domain, and I'll be using this throughout the talk. So the first formula is saying
that for all people smokes X implies cancer X. If they smoke, they're likely to
have cancer. The second formula says that if X and Y are friends, and X
smokes, then Y also smokes. That is friends have similar smoking habits.
So this, again, may be because these are very useful rules to model the real
world, because most people who have smoked are likely to have cancer. More
likely of having cancer as compared to a normal person. Similarly, it has been
observed in social sciences that friends tend to have similar smoking habits.
These are very good rules of thumbs but may not always be true.
So let's convert them to Markov logic. We give them weights which could be
used as training data. These are some arbitrary weights in this case. What is
the ground Markov network? So first let's say we have two constants. Anna and
Bob. We create the ground nodes. Substitute Anna and Bob in the
corresponding predicates. Smoke and cancer and friends. Smokes Anna,
smokes Bob, cancer Anna, cancer Bob. Similarly, the grounding of friends
predicate. And note that we need both friends AB and friends AB because they
may mean the same things. These are ground nodes corresponding to ground
predicates.
Then I connect those nodes which appear together in any ground formula or
ground clause. So smokes Anna connected to cancer Anna and smokes Bob
connected to cancer Bob. And, similarly, the corresponding to the second
formula I also create all the clicks. So this is my ground Markov network. Now
you're in the domain of Markov networks and you can do inference and learning
on this. That's the basic semantics. And of course I'm going to talk about how to
sort of make it efficient given this representation.
So any questions at this point before I go further? So as I mentioned, MLN can
be seen as a template for constructing ground Markov networks, and the
probability distribution of a particular state of the world is given by Z being a
nonrelation constant and WK FKX, this is basically the form of Markov networks.
WK being the weight of the feature and the summation, all the ground formulas in
theory. FK is the feature and WK is the weight of that formula from which this
feature came.
Equally, you can write it in this form, because we have sort of binary features.
So whenever the formula is satisfied, ground formula is satisfied, then it is on.
Otherwise it's off. So you can see that it equally can be returned as a summation
or first order MLN formulas and WI and INX, where INX is the number of formula
of which is satisfied. Because as many formulas are satisfied from that first order
of formula, that many times feature will be on and that will be there.
So this is the second equation is what I'll be using throughout this talk, the
distribution defined by Markov logic is defined by this equation.
Now briefly it can be shown that Markov, what is the relation of Markov logic to
various statistical models? I sort of started saying that it sort of combines the
power of uncertainty, various standard probabilistic models, statistical models
with first order logic. So what's the connection? We can show that all these
things on the left side, Markov networks, MLN fields and many others, Markov
models, conditions random fields, they can represent it as a special case as
Markov logic. In particular you make all the predicates zero edited because you
really don't need the variables and you can represent all these models.
What is the connection to first order logic? So it can be shown that in the limit of
infinite weights, when all your weights turn to infinity, the whole distribution is
essentially the one represented by first order logic, which is very nice, and in the
limit case it tends to go to first order logic; but not only that, in the case when
your knowledge base is satisfiable, but the weights are not infinite, the satisfying
assignments are essentially the modes of the distribution, which, again, makes
sense that the state which are most likely are sort of the satisfying assignments
of your underlying theory which, again, makes sense and very intuitive.
And in particular note that the difference between Markov logic and first order
logic is that, it allows contradiction through formulas and still gives you
reasonable probability of the underlying world.
One thing I'll mention, not go into detail, I did not really talk about really how do
you represent infinite domains, which is essentially basically one of the key
things you can do with first order logic.
We have this paper, EI, where we extend Markov logic semantics to infinite
domains. It turns out that it's not that straightforward but you can do it and we
borrow a lot of ideas from physics literature. In particular we use theory of Gibbs
measures. And you can show that as long as your underlying network, each
node has finite number of nodes. And you can have a valid distribution which
can be represented as infinite collection of finite distributions.
So I'm just not going to talk too much in detail about this, but look at the paper if
needed. So now having described sort of the representation language, I'm going
to describe how do you do inference, in particular efficient inference in this kind
of model.
So inference essentially corresponds to the problem of finding the probability of
[inaudible] items given some evidence. Probability of Y given X where Y is the
query, atoms and evidence could be something that you know at the time of
inference. Substituting it back into the formula of Markov logic, you get
probability of Y given X, which is essentially 1 YZX, normalization constant now
depends on X because X is fixed. And essentially the formula for the distribution,
WIN IXY, I varying over all the formulas, all the first order logic formulas in the
theory.
The problem with that is that you have to compute this normalization constant,
which is excess potential time. So you cannot really do it exactly. So you result
approximate methods. And you can use, there are two different ways. There is
one more primarily you can say there's MCMC. That is Markov chain Monte
Carlo. You can have a Markov chain and sample from the distribution. The
other way which has become sort of more popular in the last few years is belief
propagation.
Here the idea is that you form a bipartide graph of nodes factors, variables and
features, and then you pass messages from nodes to features and vice versa.
And you repeat until convergence. So I'm going to focus on the second one and
show how you could use this to do inference in Markov logic, and there is
another approach, additional approaches which, again, I'm not going to talk about
in this talk.
Okay. So belief propagation. So the idea is that you pass this message, vice
versa, from nodes to features and back. And what the message represents are
the essentially the current approximation to the node marginals. And initialize
each message to one and sort of carry this back and forth. So I'm going to show
this with an example. Here's the example. On the left side you have all the
nodes and right side you have the features. Select. You can see that, right?
So an example of a node or a ground predicate is smokes Anna. This is again
the example I've been talking about, the friends and smoker domain. The feature
could be smokes Anna, friends Anna Bob implies smokes of Bob. All the nodes
represent essentially the groundings of predicates on the left side and
groundings on the features of the right side. And there is a connection between
the node on the left and node and the node on the right if they appear in the
same feature.
For example, smokes Anna would be connected to the features smoke Anna
friends Anna Bob implies smokes Bob. And then you want to pass messages it
along the edges as you go along. What are these messages?
So this equation gives the message passed from the nodes to features. And
there's a lot of notation here, but the message is very simple. What it is saying is
that it takes all the messages that this node received from the features in the
previous time step and multiplies them together and sends it to the features that
it wants to send. Multiply all the message that this nodes receives except for the
feature being considered, in this case F. Multiply them and send it to the feature
that you want to send the message to.
And intuitively what it's saying is that what is the current belief that node has
about its probabilities, about the probabilities of being in various states. Similarly,
the message from the features to nodes is slightly more complex, but the form is
similar. So if you look inside the summation, the inside is similar that you multiply
out all the messages that this feature node received from the nodes in the
previous time step, except for the node being considered. Then you multiply out
by the feature or the potential that it is probability of X, the potential of this
feature, multiply all of them together and then you sum out everything except for
the node being considered and then you pass it back.
So this is a standard sort of message passing and belief propagation, and it has
been shown that if this is not guaranteed to converge in the loop E graphs and
create converges and gives you the right result. In loop E graphs it's not
guaranteed to converge but for many problems it is a very good inference in real
times really gives you good results in very, very fast time.
So this is nice. We could use this by converting the Markov logic into the ground
Markov network, constructing this graph and passing messages.
>>: Do we have any bounds for stopping?
>> Parag Singla: Right. That's a good question. I think there are bounds, but
many times what people do is just run it for a few iterations, certain number of
iterations and sort of stop.
So this is nice. You could use it, but then there is a problem. The problem is that
the kind of domain that we're going to work on, and I'll show some examples, that
there could be easily billions of features, because considered simple theories,
smoke X implies cancer X, friends XY. Even if you have like, say, a thousand
people in your domain, already you have a thousand, cross thousand features.
And then this really grows exponentially with the number of variables in your
features. That means too many messages. The network size is too big and you
have to pass too many messages. This could really be taking a lot of memory
and really too slow.
So what's the solution? So the solution idea that we propose is that instead of
passing that many messages for each ground node, what if you could cluster
those nodes together which pass the same message in the ground version. So if
you could somehow identify those nodes which would have passed the same
message during the BP, then we could pass only one message for the whole
cluster and that could really reduce the number of messages which have passed
and really make your algorithm much faster. Also it is the network size.
I'll try to demonstrate this on an example that I've been looking at. So this is the
original ground BP. You have nodes and features and passing the messages.
And let's say somehow you identify using this box that's demonstrated that. All
the nodes which would have passed exactly the same message in each iteration
of the belief propagation algorithm. Let's assume now somehow we identified
those and I'm going to tell more detail how we do that. But once you know that
that is the case, then actually, instead of having all these edges between the
nodes on the left and the right, you could have essentially one edge between
each box. So as you can see the number of messages could reduce by a big
amount in this case, because in this case you have only three messages, going
back and forth between both sides, and it will give you exactly the same result.
And the form of messages is exactly the same except for two constraints, which
are essentially the number of, which are the function of the number of edges
which went through those boxes. So note that we replaced many edges by one
edge. We have somehow taken into account, we have taken into account how
many nodes are clustered in one box, and these alpha and beta constants
essentially depend on that. So other than that it's the exactly the same algorithm
inference in the same way. Gives the same result and will be much, much faster
and can save a lot of memory. Now I'm going to describe it in more detail how to
actually find these boxes and what these messages are.
So basic idea in belief propagation, you can see it's two steps. First is network
construction. That is construct these boxes which we call super nodes or super
features. The name should be intuitive. Super nodes formally is a set of all
ground items that all send and receive the same message throughout the ground
version of the belief propagation algorithm. Similarly, a feature is defined. All
ground clauses are formulas which send and receive the same messages
throughout BP. Then you construct this network. Run modified belief
propagation with those alpha and beta constants on this network; and, as I said,
it gives the same results as ground BP and memory and time savings can be
huge.
So how do we construct this network? It's a simple process. It's a four-step
process. So we sort of start with initial guess for the super nodes. So given your
domain theory and given some constraints and evidence, what is the initial guess
for super nodes. Basic guess is you want to cluster all true predicates in one
box, all false predicates in one box and all unknown predicates in one box. That
is the first guess, and you'll refine them as we go along.
So given these predicates, super nodes, you can essentially now join them
together to get the next level of super features. Given that, you can project them
back on the super nodes to get the final super nodes; that is, you project the
super features down to ground predicates and all those predicates which appear
in the same number of super features will now be clustered together. And this is
repeated and is real convergence and the algorithm is actually guaranteed to
converge to the optimal network. This is the algorithm. And I'll demonstrate this
with the help of an example. Example is this. So I'm just going to work on one
formula and then you can extend it to as many formulas as you want. But for this
representation, just one formula.
Let's say we have smokes X friends XY implies smokes Y, the same example.
Let's say we have some piece of evidence that is enough smokes, we know that
smokes FN is true. We know that Bob and Charles are friends and Charles and
Bob are friends. So this is our evidence.
Let's say we have N people in the domain and N being greater than three. So
intuitively it's very clear that the boxes that you should get, the algorithm should
give in this case is three boxes. Smokes of Anna. Smokes Bob-Charles and
smokes of other people. Because Anna is sort of different from all of this
because she smokes. So probably the probability for her should be different
from others. Similarly, Bob and Charles are different because we have some
extra piece of information about them. And all other people should come in one
box. So this should be my clustering of the super nodes. That's the idea.
So let's see how the algorithms, this covers that. So we have this sort of now
super nodes on the left and super features on the right. And I'm going to show
how these sort of define them. Initial set of super nodes are simply we create the
super nodes for true/false and unknown case. There are no false predicates. No
false evidence in this case. So we have smokes of Anna, which we know is true,
in the green. For smokes of X for all people other than Anna. Which is unknown.
And then friends Bob, Charles, friend Charles, Bob, which is true. And friends
XY for all other people. So these are my initial four super nodes. As I said, you
don't really need the box for false case because there is no false evidence. So
given these initial super nodes, let's try to construct super features. So we are
trying to construct the super feature, smokes X and friends XY implies smokes Y.
So note smokes Anna and the converting code tells you where the nodes are
coming from. Smokes Anna comes from the first supernode, then join it with
friends Anna X and smokes of X. This is my first feature coming from the boxes
on the left side. Just simply doing it, doing just like this.
And, similarly, you could construct other features, other super features. And the
color-coding again demonstrates where those super nodes came from. And the
third one and the fourth one. You can show that these are only -- these are the
only known combinations in this case. So you got all these four super features
corresponding to the four super nodes on the left where we really they're just
giant. So now having constructed the super features, let's see how do I construct
the super nodes? So again used a different color-coding for each super feature
and what you want to do is project them on each ground predicate that is smokes
of Anna, smokes of Bob and whoever there is, and there is this box of four
counts, each count is for each super feature. And we're going to populate this
with the number of ground features that's projected on smokes of Anna.
Populate with projection counts.
So note that the first one projects, N minus 1 times because X can take X minus
1 possible values. In this case X cannot be Anna. So it projects N minus times
on smokes of Anna. I got N minus 1. You can show that the second one
projects zero times because smoke of X cannot project on smokes of Anna
because X cannot be Anna. Similarly you get zero. Smokes Bob cannot project
on smokes of Anna. Similarly, smokes Bob cannot project on smokes of Anna.
So you get this count. Similarly, we do it for other, all ground predicates. And we
cluster those predicates together which guard the same count. So in this case
I've already sort of clustered Bob and Charles but can verify that they would have
got the same count, zero N, 1 minus 3, since they got the same count they're
indistinguishable at this level at BP at this step of construction and then you
combine them together.
And all other things will now combine into this, the final box, because they have
the same counts for all the super features. Now you have the new super nodes.
Join them together and get the new super features and vice versa. In this case
you can show this is sort of the final step. If you join them one more step you'll
get the final super features, and you can show, more importantly, you can see
that you have discovered the intuitive clustering of nodes that the smokes of
Anna, smokes Bob, Charles, and smokes of all other people. So that is what
you're looking for. So that is basically the lifted network construction algorithm.
And we have a theorem. This appeared this year in Triple A, this paper, and we
show that there are always exists a unique minimal network and algorithm lifting
network. The construction finds it and running BP on this network essentially
gives you the same result as running BP on ground network.
Now experimental results. So in the paper actually we have the results only on
three domains. But I'm working on another paper. So I have results on many
more domains. So there are six domains that I'm going to present results on.
The first one is entity resolution. I think many of us might be familiar with this.
Given the database I'm going to talk a little bit more about it. Given the database
of records, you want to identify each of the entities referred to them, which of the
references referred to the same underlying object, the problem of entity
resolution. Linked prediction. So we have this dataset of professors, students
from the universities, UW. And we have information like who it is which -- who is
the professor and who is the student and we want to find out who is advised by
whom.
Then there's the dataset of protein interactions from biological domain and you
want to find out which proteins interact which each other. Hyperlink analysis.
This is basically finding out which page is, which are pages linked to each of
them given the information about their topics. Image [inaudible] this is on image
domain. So you have a binary image. There is some text in the foreground and
the background, and it is randomly -- there is some noise in the image and you
want to really separate out the noise.
And then finally the friends and smokers domain that I showed. So here are the
results. There are three times -- this is on time. So I'm comparing the ground
version and the lifted version. So there are three times that construction time BP
and total. So construction time is the time taken to construct the network. BP is
how much time it takes to run the belief propagation algorithm, and the total time
is essentially the sum of these two.
So as you can see, the ground time in almost -- except for maybe a couple -- is
more in the case of lifted case. And that makes sense because you have to
construct all the super nodes and super features and then sort of combine them.
So it takes some time. And the ground version is faster because you can just
ground out the network.
But note that now BP time is much, much less in all the cases because the
network is much, much smaller than the ground network and it can be much,
much faster.
>>: Are big are the datasets?
>> Parag Singla: How big are these datasets? So [inaudible], for example, is a
thousand records. So a thousand across a thousand. We use something like
[inaudible] UWCSC is also about I think a few thousand, Matt can correct me,
few thousand features.
Then others are also probably similar size. Image is much bigger, because you
have 400 across 400 pixel image. So it's probably good.
>>: [inaudible] does it vary very much in terms of [inaudible].
>> Parag Singla: You mean like how many variables in each predicate? I think
at most we have two. There are a few in Europe CSC which have three like
publications, the DA professor, student and course. But most have two or one,
just like the smokes example.
>>: What's the size of [inaudible] for each one?
>> Parag Singla: The number of rules. 200. Some cases it's only two or three
rules. For example, image domain we just have tools. We say that one is
whatever variables are observed, you have the same variable if it's observed it's
text. The second rule you're likely to have the same value as your neighbors.
UWC has about 94 rules. So it varies across that. Questions?
>>: So the friends and smokers domain, how did you make the network for that?
>> Parag Singla: So I did not explain that. So essentially what I did was I had
this sort of -- I decided number of people, how many people I want to have. And
then sort of randomly chose whether the smoke or not smoke and then also used
sort of a random sort of distribution. I forgot the exact details, but the idea is that
you sort of have clusters of people and for each cluster you decide whether you
want to have a friend's relationship between them or not. That was the idea.
>>: I have a more general question. I don't know if this is a good time or not. So
it seems like the technique -- it works best when the domain breaks into little
pieces?
>> Parag Singla: Exactly.
>>: So does it require, like to what extent do they need to break up? I couldn't
quite -- I can't think of -- does it need, like would ground Markov never have
complete separate distinct clusters or can it do better than that?
>> Parag Singla: I think it does much better than that. I think the idea is
basically like looking at this example, that it doesn't really say in this case, for
example, right, that Anna and Bob Charles all those are independent, they'll
certainly be connected. It's only identifying that all these nodes could be treated
in one cluster. That is, they would be behaving similarly with respect to the
cluster.
So certainly they're all interconnections. I'm not saying your graph is
disconnected. So that's a very important point actually. That graph is completely
connected in the original case. But what we're saying is all these nodes
essentially would have behaved in exactly the same fashion. So why don't you
pass one message instead of passing N messages, that's the key.
>>: In the smokers friends one that you generated, does it end up being each
super node is; each supernode is a different number of smoking friends,
effectively?
>> Parag Singla: I did not actually look at what were the clusters. But basically
similar idea. That people, all of those people who had same smoking evidence
and were connected to let's say same number of friends would have the same
evidence of being the ground thing. This is a bit further because evidence
propagates, but realize it actually doesn't really split further. I'll show it in the
example, actually.
>>: One follow-up question. So if I have this right, the essential polynomial with
respect to the number of areas, the number of proteins, you have some predicate
with area three, 1,000 to a third to make the predicate work. So it is going grow
that way? Message passing is what we've optimized down here.
>> Parag Singla: Right. So we are optimizing both, right, because we're trying to
optimize also the ground features. Ground features will be thousand to the cube.
But the lifted features or the super features will be much less.
>>: But you will still need to calculate every single ground feature with graph -the polynomial time?
>> Parag Singla: Yes, that's a very good question. So, actually, I kind of skipped
that detail. But it's a very good question. So what we do is we do not really
construct the ground features also. So what we do is that, let's say, you start
with, smoke X cancer X, right, that example. So what we do is initially we do the
ground, we construct all the ground predicates. So up until that point it is true.
Right but then we sort of cluster them and when we do the join we do the join
only on those.
So you do not really need to construct the ground features, all the ground
features. Because now your compact representation for the ground predicates,
and you can join them using that compact representation. You can do even
more optimizations, things like suppose, you know, most of the things are false,
you do not even need to look at that explicitly, do anything which is not true and
unknown so you do not really need to represent that. Does that answer your
question?
>>: Yes. For the smokes example, if you had a million people, you'd have a
million squared initial graph, you would have to address, then you would actually
optimize that? For friends, friends, you have to be friends with every single?
>> Parag Singla: I guess what I'm saying, let's say you had a lot of people let's
say you knew that most of the friends XY are unknown. You don't really explicitly
need to construct those, because what you could do is construct the compact
representation for true and the false case and everything unknown is sort of
default value, and that essentially you do not -- it's just like the close assumption
in databases, the same thing.
Right. So these are the results. You can see overall lifted BP is much, much
more powerful than the ground version. In some case it's phenomenal. For
example, KB. Because what we're doing is essentially, in this case, this sort of
simple example we're not using the word "information," we're just using that
certain topics imply that certain probability of being linked to each other. So if
there are 10 topics, about 10 squared possibilities are there. So the super
features are really, really big, as in like they have a lot of ground nodes clustered
with them. And it trends really, really fast. So don't really need to construct the
whole network.
And this is so BP does not always converge. But the results are pretty good
even in all the cases, and this is after a thousand iterations, thousand BP
iterations. As I said, the results are exactly the same in the ground and the lifted
case.
>>: Did you compare the accuracy of the VP versus sort of MCMC?
>> Parag Singla: Right. I do not have those results here. But they're quite
comparable. And this is the number of features in physical memory. So I think
that solves some of the questions that were asked.
As you can see, the number of features is certainly much less in case of BP.
That will directly translate to physical memory saving. I think that is a valid
question that it is, because many times you have to actually construct the ground
network. But, as I said, many cases, because of the compact representation,
from the very beginning, you can, in fact, save a lot of memory. And I think in
almost all the cases we do have a lot of memory. I should point out for UWCC
and image what we did was running the network construction until N actually
used more memory.
So I stopped at after three or four iterations of lifting network construction, which
gives exactly the same results and which runs much faster. For all other
domains I ran the lifted network construction until end. For the two domains the
results here are after stopping the construction after three or four iterations but
give the same results.
>>: Can you construct cases where your proposal is no better or even worse in
either time or space than the baseline?
>> Parag Singla: Yes, that could happen, yes. In worst case that could happen.
Because if you have to ground out the whole network, although I'm not sure how
common will that be in general, but if you're to completely ground out the
network, then you'll spend extra time really constructing the supernode and super
features only to realize that you have to actually ground out the whole network.
So it will be a little slower.
>>: Could be a little slower. Space-wise it would be no worse?
>> Parag Singla: Space-wise, I think it depends on your representation. If you
have very poor representation for super node and super features, yes, basically it
should be the same. And any ordinary iteration it could take more space
because you have to represent the super node and super features.
So, finally, this is sort of last result on this. I think, again, this kind of goes back
to some of the questions that I think they asked. But I also sort of experimented
on this, how does lifted network construction vary when I increase the number of
objects in my domain? So note that inference smokers, as I vary the number of
people, the lifted network size remains about the same. Because essentially the
number of clusters is the same. The ground version, exponential, so log scale.
So number of features is on log scale and I have number of objects.
You can see the green sort of grows with the number of features, but the red
color stays constant because the final number of features that you have is
essentially the same. So this is, again, a nice property because your domain is
getting bigger but the final number of features is essentially the same.
>>: Is the linear, have the same [inaudible]?
>> Parag Singla: The record is almost linear, actually.
>>: But given your model of pairing ->> Parag Singla: Yes, exactly. Because I'm using the same model, which
means more or less the nodes, they're only a fixed number of cluster image they
will fall in, then.
So I think in the interests of time, probably skip the learning part. Basically
there's some work I did on how to learn the parameters. So I guess I'm done
with the lifted BP part, which is sort of the crux or the main sort of idea, one of the
main ideas. Skip the learning part. Basically how do you learn the parameters in
Markov logic and wrote a paper on that. But you can talk to me after the talk. So
I'll describe a couple of applications that I use Markov logic for doing the first one
is ->>: Can I interrupt? One question about the learning. Was there anything you
did independent from the BP inference stuff or are they tied to each other?
>> Parag Singla: That's a good question. It happened at least chronologically
that I worked on the learning before the lifted BP. So that was independent in
that sense. But now since we know how lifted PB works, since learning uses
inference as a sub strip, so we could use lifted BP within that, use the learning.
>>: Learning can also work with the -- they're independent.
>> Parag Singla: You can plug in any black box for doing that inference for the
learning, which could be lifted BP. So applications, I'll describe two applications.
The first one is entered as a resolution. I briefly talked about this. [inaudible]
integration is first step in the data mining process.
When you merge the data from a lot of sources, particularly result in duplicates.
You want the result as duplicates before you can do sort of any effective data
mining. For example, if your paper is from different domains, the authors may be
spelled differently. Titles may be missing. Venues may be sort of abbreviated.
So you want to resolve them before you can do any sort of effective data mining.
Resolution is this problem of identifying this records a fields, therefore the same
underlying entity.
This is a very well known problem in literature. And the original model was
proposed by [inaudible] back I think in 1960s, and it's a simple model. The idea
is that you make each pair of decisions independently. So you take each record
pair and you find out the similarity between those attributes, and you see if the
similarities more than a threshold, then you create a match or there is a
nonmatch. And there have been many improvements on the original model. But
most of them actually make this pair-wise independent assumption that there can
be resolved independent of each other.
But over the time people have realized that it helps to incorporate independence
account. One pair will help resolve another pair. So this is what we sort of
incorporate. But the problem with this approach is that even though they take
dependencies into account, the problem is that all these approaches, different
approaches have been developed as stand-alone systems and they address
different aspects of the problem. There's no sort of unified solution to all of this
which I'll try to represent trying to use Markov logic. So represented this nice
paradigm. The idea is simple; that you use weighted first order formulas to give
the domain theory and for the first order rule you have the rule which tells you
how important that rule is.
We used some hand-colored rules, but you could always use them. Structured
learning in Markov logic. And it combines many different approaches, very
seamlessly. In particular, there's a approach which [inaudible] which introduced
transivity to different pairs. You can write one pair of transivity to Markov logic.
We have one approach which sort of combines them and any more new
approaches can be also combined into this framework.
So this is the sort of the example which will demonstrate the power of sort of
doing this -- building a collecting model rather than doing them independently.
So this is a citation example. So you have author, title, and venues. So you can
see that intuitively you can figure out the first two citations refer to the same
paper. And the last two citations refer to the same paper.
But the problem is that the first two pairs are reasonably similar to each other.
The authors are similar. The title is similar. And venue is not similar, but then
maybe you are able to say venue is just abbreviated because authors and titles
match.
So you may be able to say that. This is just to correspond within the titles and
authors and venues. So you may be able to say that the first pair match. But
now note that the second pair is more problematic, because one of the authors is
missing. The title is really weighted very differently. And also the venue's
abbreviated. So the threshold may not be high enough to declare it as a match.
But once you identify that the first pair is a match, you will know that triple A and
American, 21st National Conference on Artificial Intelligence is the same
conference, because you know they're the match, therefore the venues are the
same.
Once you have this information you could really use this to duplicate the second
pair. So once you know that triple AI and the string is the same conference, this
pair also appears in the second case, and then you can combine them and use
this exchange information to now say that actually the match.
So this is the basic idea of doing this collectively. And we give model-based on
Markov logic, and we write first order rules as I described. Again, I'll skip those
and I'll not go into detail. As I said, you can really write a domain theory using
Markov logic which sort of combines all these collective approaches. And we did
some experiments on [inaudible] website datasets, we showed collective features
help improve performance. But also we showed that many of these previous
approaches can be seamlessly combined using Markov logic, in many cases
writing just one formula for the previous approaches for which the whole system
was developed for just one approach.
Again, I'll be happy to talk about in more detail. So, finally, the second
application, this is relating to -- so this is actually the work that I did at [inaudible]
research with Henry [inaudible] and some of his colleagues. This is about
predicting social relationships in consumer fraud collections.
So here collect being sort of a company. The idea was can you use something
like Markov logic to build smart cameras. So let's say you have a bunch of
pictures of a user and you want to identify various social relationships in the
pictures that we took. In this case, in the left picture this person may be
interested in that he wants to say that go look at all my pictures and find out what
kind of kids are my children hanging out, if they're in bad company, if they're in
good company or what sort of case are they hanging out.
So I doubt that it's a very difficult problem. You don't know what these kids are,
or other kids are, how you do that. If you show these two pictures to a human,
he or she may have a very good guess that the kids in the left picture are
probably his own children and the third kid is not his child, right, because we
know that typically children tend to be photographed with their parents, because
the friends appear together. So the kid only in the left picture probably is friends
of these two kids and these two kids probably belong to this person.
But, again, we are not sure. But these are like sort of good rules of thumbs.
Similarly, you could have other rules of thumb saying that friends appear
together, and the parents are older of the children, which is a hard rule. You
could say that related appeared together, grandparents like to appear with their
children and so on. You could write various rules that are not always true but
gives, but which give very good information about that underlying domain.
So we constructed -- we handcrafted an MLN which had about five hard rules
saying parents and their children and 14 soft rules. As I mentioned, and the
rules, the weights of the soft rules were learned using some training data that we
basically had some volunteers at about 13 different people. They labeled their
photographs with the relationships they had. We asked them to label, and there
are about 48 total images. What we wanted to do was predict seven different
relationships. Parent, child, spouse, relative, friend, child, friend and
acquaintance.
I should mention that for this problem it was sort of initial venture. So we
assumed that the face recognition has been done, which is face recognition and
also the identification. So it is again a big assumption. But as you can see, you
can think of constructing a bigger model where you want to fold this as part of the
model. But for the results I'm going to show we assume we know what the faces
are and we just want to identify what the relationships are of those, given those
faces.
And we compared five models. So since nobody at least in the published
literature has done this before, so we compare very basic models. The first one
was random. We randomly predict each relationship with uniform probability.
We predict on prior information on the data. For example, if most people, if most
of the persons appear, appearing in photographs are kids. So you are more
related to the child relationship. Then hard constraints. So you can have only
the hard rules. For example, parents of all their children.
Then combine hard and prior, both of them together, and then the MLN, the full
blown MLN model with all the soft and hard constraints and the rules learned for
those soft constraints. Those are the results. This is basically the recall curve
and comparing all the models. So as you can see all of them are pretty low,
which means it is a hard task.
But still you can see the right curve dominates all those curves. And most
interestingly, the random curve is sort of at .14, which is sort of the baseline.
There are seven relationships of 14 and seven is about 198. And then each set
of additional model gives you more and more information.
The prior is better, the hard is better, and MLN sort of combines all them and
adds some extra power to give you sort of the best model. And, as I said, this is
sort of the initial experiments we did. Certainly you could try to improve the
model. We did not really learn those rules, so we could use training data, learn
those rules and that could even maybe improve the results.
So these are the second application. And just to mention a few. These are
some of the other applications. Most of the applications people at some of the
places have been working on information extraction and many of those actually
papers have been published recently.
Link prediction, collecting classification and many others I'll be happy to talk
about some of this after the talk. And I mentioned just a couple of those which I
have done. And finally all this has been developed using Alchemist software,
developed at the University of Washington, many authors besides me. Gives the
whole first order logic semantics of Markov logic, has the inferential terms I just
described, also has structured learning to learn those rules. This is the website.
And, finally, conclusion feature work. So in conclusion, I tried to sort of present
that unifying statistical logical area is an important aspect which could really
progress the spirit of the progress in A sort of providing the interface layer.
Markov logic could be one potential choice. It's a combination of logic, very
simple, and powerful models. Various algorithm represented for efficient learning
inference. I should definitely mention that there are many more learning
algorithms and inference also which probably did not really get time to talk about.
Many applications.
And, finally, coming to feature work, so I'm more -- there are many directions but
I'm interested more in generalizing the framework for lifted inference.
So how could you explain the lifted BP framework to other developments like
MCMC, other than working on how to sort of give a general framework for EV
and BP elimination which is exact. But extracting it to MCMC and then also
connected to with the resolution. So we have the resolution approach, which
what is the connection with BP resolution, first order logic. How you could use
this in potential infrared domains and the third one is that identifying
substructures in network for efficient inference.
So here the idea is that it could be possible to break up your network in various
parts where the inference in some of the parts is really simple. For example, you
could have a linear chain in most of the network which you could, for example,
solve using something like V to B, but some part may be complex. You could
really combine those two to basically do inference separately on those parts and
then combine those results. So that could be one potential approach for doing
fast inference. So, in general, I'm interested in a little bit longer term in
developing a comprehensive theory of lifted inference, and, of course, learning
because that's a substep for learning.
And the idea, intuition that I have is connected with human perception.
For example, let's say if a human being is taken to a new place and he's asked
to, he or she is asked to open his eyes for a second and then close it and then
asked that what did they see? So they could say, oh, I saw some very big
building and there's a parking lot on the left. But they may not be able to tell
more. Now let's say if you ask them to open their eyes for five seconds and ask
them what did they see? They could say, oh, I saw seven buildings and there
are like two parking lots and there's like some cars here and then there are like a
couple of roads. If you give more time, they could really now tell what were those
buildings, how high they were, how many floors there were, and so on.
So can we do something similar for probabilistic inference, given the time you
have, you could sort of start with a very crude approximation for the nodes, like
sort of the cluster I provided that started with a crude approximation, use some
basic inference as you refine them as you go along and then depending on time
you have you could actually give the exact results if you have sufficient time.
And then, of course, this has a lot of application in comparative [inaudible]
recognition, biological data, and so on. So that's pretty much it. Thanks. Any
questions, I'll be happy to answer.
>>: Is there any interaction with conjunction trees when you're doing this
inference?
>> Parag Singla: That's a good question, actually an interesting construction. I
referenced variable inference. I'm working on a paper which sort of gives a
general framework with very generic framework for having this idea of splitting
the nodes to how does it tie with trees and lifted BP. Idea is similar that in
junction trees also you can sort of -- I think it's easier to think in terms of variable
elimination first.
So the idea is similar, that you sort of start with this bucket elimination. There is
a -- so the linear -- write a paper on bucket elimination, which sort of gives a
framework for variable elimination. But the idea is similar that we have these
super nodes. You start with the kind of same super nodes. You try to eliminate
them and at each elimination step you sort of see whether you really need to
refine them or not. So you start with very crude nodes, and at every step of
elimination you really see if you really need to refine them or all of them can be
really just eliminated in one step. And this idea is similar to BP in that sense.
And it turns out that it really helps and you may have come to the talk of
Rodrigo, he gave a talk, lifted variable elimination talk, I think some of you were
probably here. So that ties with basically some stuff on lifted variable elimination
that connects back.
Any other questions?
>>: I have a question. When people are authoring these, you could easily add a
rule that's simply going to make it take 10 years to do the inference before it was
going to take a minute. I think similar comment in databases where you add a
certain dry analysis, it's going to take a curve. Do you think there's -- is anyone
looking at trying to estimate how long a computation is going to take or we're
getting some kind of feedback that because of this rule is the reason it's taking so
long, maybe if you break it up in this way it will be a lot less time, do you know if
anyone is looking at that?
>> Parag Singla: No, I don't think -- I think the knowledge is more on the
engineering side, like people try it out and they have this intuition that all this rule
and then they just sort of throw it out.
But I think some of the things could be automized. For example, you certainly
know when you have more variables it really blows up. So those things --
>>: If you could know, for this rule it was this many runs, this many functions, or
this many messages were passed because of this rule, you could probably see
which one was -- that would be kind of interesting.
>> Parag Singla: Yes.
>>: Like you're saying, that's kind of an engineering thing, though.
>> Parag Singla: Yeah. We could certainly think of developing a theory, but I
don't think anybody has, at least not in my knowledge. Yeah.
>>: I think it's been tried [inaudible] research. I think it's an unsolved.
>>: That would make sense if they were doing it for that.
>>: But I'm not quite sure. I think it's an unsolved, unsolvable in the general
case.
>> Parag Singla: I think that's an interesting question because it's interesting
that sometimes people think that sort of doing inference, but this is harder. But I
think once you come to the domain of approximate inference, many of these
things you do not really need to do exactly. And so because you're able to trade
a little bit of accuracy for like good time and memory efficiency. So I don't know.
Some of these things may be applicable, for example, the things that people tried
for long. But I'm not aware of.
>>: I think it comes down to estimation, too. You don't need to do it exactly.
>> Parag Singla: Yeah.
>>: Because you're lifting, I don't know, I get a feeling like to do exact lifting it
seems to be equivalent to saying that two algorithms are equivalent, which we
know is an unsolvable issue, in the general case. Taking a stab at it and making
estimations and getting close enough so that you actually get the results.
>> Parag Singla: I think it's in the sense that lifting, as I described, it will give you
exactly the same result, like probability the same result as BP. But I'm not doing
two different algorithms. It's the same.
>>: But BP is an estimation.
>> Parag Singla: Yeah.
>>: So if you take -- like you said, if you take your Markov logic and give it all
infinite weights, then your equivalent to probabilistic logic -- I'm sorry, first order
logic and first order logic equivalent, proving the two terms are equivalent in first
order logic is impossible.
>> Parag Singla: Right. So I think --
>>: But you're going to get close enough to where you're interested.
>> Parag Singla: I think that sort of the place where this work comes in is saying
that you could use resolution which is faster than propositional first order logic.
Different question, what could resolution solve? I think lifted BP versus ground
BP is more comparable with resolution versus propositional than doing inference.
So what it is saying is in resolution you could eliminate a lot of constraints
probably potential infinite number of constraints in one step and sort of really
grounding them into saying the same thing. But it doesn't say anything inherent
about [inaudible] in first order logic. Even using resolution some things may be
hard, which is true in this case.
>>: Which is why you have the case in the passing you could end up with a
situation where you have to ground out the network.
>> Parag Singla: Exactly.
Thanks.
[applause]
Download