Document 17877077

advertisement
22610
>> Chris Burges: So it's a pleasure to welcome Jerry Zhu. He's an assistant
professor at Wisconsin, and is going to be talking about adding domain
knowledge to latent topic models.
> Jerry Zhu: Thanks, Chris. So let's see, so first let me thank my
collaborators, David Andrzejewski, my former graduate student who did most of
the work and made, actually made most of the slides here.
He's at Livermore Lab now. And then Mark Craven and Ben Liblit and Ben Recht
all helped with the project and funding agencies.
Let's see, how many people are familiar with latent Dirichlet allocation here?
Just wanted to get a sense.
But let's start from somewhere else. Let's start from Times Square.
New Year's Eve. What happens there on New Year's Eve?
This is
>>: People partying.
> Jerry Zhu: People party and this big ball is going to drop and what you
probably do not know is for the past few years the organization which runs the
party also set up a website to let people type in their New Year's wishes.
It's a one line thing, type in your New Year's wish and they will print it, cut
as confetti and drop it when the ball drops. That's very nice. Fortunately
enough we get ahold of the dataset. So we have people's wishes. Can you take
a guess what is the most frequent wish?
>>: Peace on earth.
> Jerry Zhu: Right on. So the most frequent one is peace on earth. Now, here
is a random -- [laughter] -- random sample of wishes. Okay. Let's see. So
the university, my friends, this particular dataset is from 2008.
Find a cure for cancer. To lose weight and get a boyfriend so on and so forth.
So since we will be talking about latent topic modeling, we're going to treat
each wish as if it's a mini document and we're going to look at the topic that
it belongs. So you can get a sense like the first one is about peace. This
one is about getting education. That's about war. This one's a little bit
interesting because it has a mix of two things. And that's what topic models
are good for. It can model documents which contain multiple topics.
Now, as we are all familiar with, if you do a -- can we lower the lighting a
little bit?
>>: Yes.
> Jerry Zhu: So what you're seeing here is a simple frequency analysis of the
whole corpus. Thank you. That's very -- that's fine. Can people see that?
It's okay.
So as you can imagine, like the word "wish" is very frequent. Peace, love,
happy so on and so forth. But it has many things mixed in it, right?
So the goal of latent topic modeling is to say, well, let's throw the data at a
model which we will explain later on and we will get some interesting stuff out
of it.
I'm going to show you some topics that we recovered from using latent Dirichlet
allocation.
And by a topic, I mean a multinomial, a unigram distribution over your whole
vocabulary.
What I'm showing you here is a distribution or a unigram of probability of a
word given that particular topic and the size, of course, corresponds to the
probability. You can get some sense that the first one is talking about
troops.
So troops coming home.
God bless our troops and so on.
This one you see Paul, Ron Paul, for some reason his name showed up a lot in
2008's New Year's wish. This one is about love and so on and so forth.
All right. So late latent Dirichlet allocation or in general topic modeling
has interesting applications and mostly I would say in the form of exploratory
data analysis, just to see what's in the data.
The way to think about that is it's very similar to clustering. But instead of
cluster a big document as a whole, you're looking inside it and see what
composition that document has.
Okay. So there has been a lot of different ways that people use that. Okay.
Before we review that model, let's have a really quick statistical review.
So a Dirichlet distribution is a distribution over a D dimensional simplex
which is a probability distribution.
So you can think of that as a dyes factory you could draw a dye from it and it
will be a D dimensional dye. In this example we have three dimensional
Dirichlet with parameters 20 by 5. The tools from that are going to be
probability distribution over three items.
Now, one way to that is to draw the triangle, the simplex here, where A, B, C,
are the different, the three items. And the blue dots are samples from this
Dirichlet.
So if a dot is close to A, that means the probability of A is closer to one and
so on and so forth. So from a Dirichlet, you sample a dye which is a
multinomial. So in this case we show this particular multinomial .6.15.25
which corresponds to the red dot there from a multinomial you can sample a
document.
So that's the bag of words model. And in this case let's say the document
lands at 6. You happen to get this. Okay. One thing that's good about
Dirichlet is computationally it's very convenient since it's conjugate to
multinomial is conjugate to Dirichlet, so people like it. So with that, let's
look at, quickly review the model.
Many of you are familiar with that. I just want to remind you the symbols. It
gets messy a little bit. This is a generative model of a whole bunch of things
given two hyper parameters alpha and beta.
This is how it works you want to create a corpus. This is what the generative
model is about. So what you start with is you start by saying I want big T
number of topics. So right here. Imagine you have like 100 topics you want to
use.
Each topic would be a multinomial distribution. Or think of it as a dye over
the vocabulary. The way you generate each one of those dye is by sampling from
a Dirichlet with hyperparameter beta. Now this beta would have the
dimensionality of your vocabulary which could be, let's say, one million would
be a pretty big thing, by doing that now you have a dye with one million
phases, each phase has a word on it.
You have one dye and you sample 99 more dyes from it. So that's first how you
prepare the topics. Then with those dyes you're going to generate documents,
and it goes as follows: For each document that you're going to generate, first
you sample a parameter called theta from a different Dirichlet distribution,
with a parameter alpha.
And this Dirichlet distribution has dimensionality 100. Okay? It's going to
be a mixture of those dyes. For this document you are trying to decide how
often am I going to use each topic.
So that's what that dye is for. Once you have that theta dye, 100 phase dye,
now you go through each word position in your document and you do the
following: So for the first word position you're going to sample its topic
index or in other words which topic dye am I going to use for this word
position.
This is sampled from that multinomial theta, that 100 phase dye. So let's say
it's topic 33. Then you are going to actually generate the word at position
one by sampling from the word dye multi-phase dye, the 33rd dye of that.
You throw that one, get a word. Then you repeat. In this way to generate the
whole document and the whole corpus. So that's the generative model. And the
way we use this generative model, sorry, is by condition on, for simplicity,
assuming you know of theta and beta, the hyperparameters, but conditioned on
the corpus W and you want to infer either z, which is for each word position,
which topic did it come from. That hidden quantity. Or perhaps more
interestingly you want to infer theta, which are the 100 topic dyes.
And phi and theta.
I would say phi is the more interesting one.
I put a little reminder there on top.
So if you want to remember what those
symbols are, okay, as we go along. All right. Any questions? No. So please
feel free to interrupt me at any moment. All right. So what is this talking
about. Now LDA by itself is an unsupervised learning method. You give it a
corpus W. You actually have to give it something more. You have to give it
alpha and beta and maybe the number of topics, big T, but you give it labels.
So it's unsupervised. And it's going to recover those parameters phi theta for
you.
In this sense it's very similar to clustering. Now, because it's unsupervised,
sometimes the result can be unsatisfactory. In particular, a domain expert who
is interested in a particular scientific problem, think of like a biologist
looking at pub met abstract, might have some particular idea in terms of like
what kind of topics he is looking for.
He has some prior domain knowledge that he wants to constrain the latent topic
modeling process.
And this is very similar to clustering where in many cases a domain expert has
some knowledge of which items needs to be grouped together, for example.
So there is a need to incorporate domain knowledge into the latent topic
modeling process.
Now there has been a lot of variants of LDA which allows people to incorporate
different kinds of knowledge. But I would say most of them have a problem in
that they need intensive machine learning hacking to get the model to work.
And that means if a biologist is interested in incorporating a particular kind
of domain knowledge, he needs to first find a machine learner.
He cannot do that himself. So the talk that I'm going to present is about how
to do knowledge plus LDA in a user-friendly way so that anybody, any scientist,
not machine learning researchers, can incorporate their knowledge.
And we're going to talk about three models. Number one topic in set number two
Dirichlet Forest, and number three fold-all.
So the first one. So I'm going to start with a slightly unconventional
example, not this one about documents but about software debugging.
And some of the work is done at Berkeley and here Alex Zen, Microsoft.
Here's the idea. You have a software like, I
see -- there are bugs. You want to see where
one way to do it is to insert a lot of probes
the code. And the way you can think of that,
They're specialized counters.
don't know, like IE. You want to
they are so on and so forth. And
into the code. So you instrument
so the yellow things are probes.
So we can imagine like at different locations you insert different counters,
and whenever the program gets there, the counter gets added by one and so on
and so forth. And at the end of the program, no matter whether the program
crashed or not, you are able to recover those counts.
And let's also assume that you just want the counts, the aggregate of how many
times this line is executed, you're not interested in or you cannot keep the
actual executing sequence of things. So it's very much a bag of words notion.
So this is a setup where the predicates acts as words. You have a bag of
predicates, and a software run would be a document. Now, for this problem, we
actually have some extra knowledge, which is whether each run crashed or not.
So you have a label.
So you have a label. You have a bag of words of something. A straightforward
idea would be if you want to do latent Dirichlet allocation on this is to
collect all the failed runs, the crashed runs, run the Dirichlet and run it on
it and say could you please recover 100 topics. I'm going to look at those
topics. The hope is maybe some of those topics would be indicative of bug
behavior like particular lines where it's abnormal in some way.
So, yeah, that's a very reasonable goal. But what happens in reality if you
try that is you don't get nice bug topics. Instead, you would get topics that
are in some sense much stronger. Those are normal usage topics, like different
functions, maybe you opened up a Web page, or you print a page and so on
corresponding to those kind of things, and those are much more frequent, and
therefore they dominate the topics you're going to get.
To illustrate that, I'm going to show you a synthetic example. So the way to
see this is I'm creating a figure which is a 5-by-5 square and each pixel
corresponds to a predicate. So I have 25 different predicates. And I'm
actually going to generate software runs.
So the way I generate it is by using a bunch of topics, which are multinomial
distributions that will generate pixels here, here, but not there. So that's
that topic. I have, I don't know, eight topics there, and three bug topics
here.
So I can sample documents, both successful documents, runs, and crashed
documents from that.
And this is what you will see. So, remember, a document is a mix of all
topics. It's very hard for us to see what's actually going on here. And you
don't see those things.
>>: The LDA completely independently like you're not paying attention at all to
the other set when you're doing -> Jerry Zhu:
documents.
So far I'm generating it.
So I'm using the LDA model to generate
>>: But you're doing the decomposition independently for label one versus label
zero you've just taken the group label one not paying any attention to the -> Jerry Zhu: Yes, yes. So in fact I think this will answer your question.
Now I'm going to run LDA on the crashed runs. It's more than this, but I'm
going to run it on that. And here is the topics that I recover. Notice I
actually used the correct number of topics. But I made the dataset so that
these topics are going to be more frequent, and therefore they somehow mask the
bug topics. You don't get those nice things.
All right. So what should we do? So here's a fairly simple idea. Since we
know which runs crashed and which did not, we can actually jointly model both
kind of documents. And here's intuitively what I want to do. Let's say I want
to use big T topics overall, but I can sort of set aside a smaller number of
topics, small T, and say that those are usage topics and the remaining ones are
bug topics.
Very importantly, if a run succeeds, I'm going to say that for all positioning
in successful runs, it can only use usage topics, and it cannot access those
bug topics.
While for crashed runs I will allow them to use any topics that they want.
So
this is an additional constraint on top of standard LDA.
The hope is by requiring this is the case, you force the first small T topics
to model general software behavior, and then the remaining ones, hopefully,
will capture the remaining bug behavior. Yes?
>>: Is there any constraint or cost in terms of the difference in distributions
between the different, the sets there?
> Jerry Zhu:
No, we didn't do anything at all.
It's just as simple as this.
>>: They could overlap.
> Jerry Zhu: They
[inaudible] that's
if you do this and
only these can get
could overlap, definitely. We didn't do anything
constrained as this. Now, this is the new result. And so
say, okay, let's have the first eight be the usage ones and
the buggy runs, you actually get something quite sensible.
So this is a very simple example to get started with.
run results.
And here are some actual
>>: To emphasize one more.
> Jerry Zhu:
Yes.
>>: Question, why not have the two sets of topics disjoint?
two plus two --
Why not say one to
> Jerry Zhu: Right. Because both runs would contain normal usage. So for
like both runs somebody would open the browser or do things like that, so you
have the common element. You just want to group the common element in the
first small t and let remaining ones explain bugs.
Yes?
>>: The LDA documents -> Jerry Zhu:
Yes.
>>: -- would this be the same as modeling the background distribution?
> Jerry Zhu: Yes, it certainly sounds a lot like that. In fact, I imagine one
can explain that it that way. So the first small t is the background model and
the remaining one is special.
Okay. I'm going to skip this. This is real software like people actually,
their job is to insert bugs into programs and we were able to distinguish that
you can easily extend this idea to the more general topic set, constrained
where you say that here's our additional knowledge for each word position in
the corpus. We can set up this set C, CI, which can be arbitrary. It's a
subset of all topics, and you constrain it such that the topic index for that
position must be in that set.
So this can actually encode quite a few different domain knowledge. For
example, CI could be a singleton that would be like saying this word must be
topic three.
Okay. And this is very easy to do. So one inference method for LDA is
so-called collapsed Gibbs sampling where you are given W of a beta. You have
to infer Z and you also need to integrate out phi and theta, that's why it's
called collapsed. And you do this in eclipse sampler fashion you go through Z
and word position and change Z in that order.
Now, the black part is standard LDA collapsed Gibbs sampling, but the topic and
set knowledge you can easily encode by inserting an indicator function there.
So it's an indicator if the value of this ZI, the Ith word is in the specified
set. You let this term be 1 otherwise it's 0. It's essentially forbid that
Gibbs sampling to take values outside the set.
So you can do that fairly efficiently. And you can also relax this hard
constrain to something that's slightly softer so it's not strictly in the set
but you pay a penalty there.
So this is a warm-up of how you could start to add interest and knowledge to
topic models. Now, let's move on to the second model. It's called Dirichlet
Forest.
I'm going to actually present the result first because I think it's kind of
interesting. This is how people can use a model like this to do interactive
topic modeling.
So recall we have this New Year's wish corpus. If you run a standard LDA on
that corpus with 15 topics T equals 15 you get something like this. And what
we're showing here is each row is the most frequent word in that topic.
So let's take a look at those. One thing you might immediately notice if you
work in natural language is that we get a whole bunch of stop words mixed in
there. And this is usually undesirable, because the topics are not as pure as
or as informative as they could be.
Now, there has been a lot of ways to handle stop words, simplest one might just
be you preprocess your corpus so that you remove all stop words. But here
we're going to do something else. Imagine a user looking at this result and
say that, hey, we have too many stop words in here. Let's get a standard stop
word list and say that please exclude all those stop words from normal topics.
Let's sort of merge them into a single topic, isolate them. So this is an
operation that a user can do. The user is going to do in isolation of in this
case we actually use the 50-word stop word list.
And we let the system rerun. But allow it one more topic. So now we have 15
topics in all. What I'm showing you here is the result of this rerun after we
somehow do the isolate operation, and these new topics, two topics, are now
responsible for stop words.
So 2008 is a stop word because that's the year of the wish corpus.
Yes?
>>: Why are there two topics?
> Jerry Zhu: Yes, that's because all we did is -- you will see how we do the
isolate operation. And we didn't really say that please move everything into
this one topic. But we just simply say please make sure those stop words are
not in the same topic as other words.
It turns out LDA wants to use two topics to explain those stop words. Now, if
you look at the result, it's much better now. But you might notice this
particular topic which I said mixed, go, school, cancer into well, free,
college. What happens is it's a mixed topic of two things. One is you want to
get into college. The second is you are wishing somebody to be cancer free so
these are two different things.
So we will be able to say let's split -- let's split here eight words into two
groups. So imagine those are like seed words for those two topics. You're
going to say that let's split those particular eight words but hopefully they
will create new topics and drag other relevant words into their respective
topics. And this is what happened.
So they're marked here. We allowed one more topic to accommodate for that.
And you see the green words are the ones we used in split, but also moved other
relevant words in there. The reason you see mom/husband/son here is because
most people are wishing like we hope mom to be cancer-free. So that's
considered the same topic.
Okay. So again now you see the first one and the tenth one is kind of all
about love. And you might want to say let's merge these two. So we can again
do a merge operation by specifying a few keywords. And if you do that and
rerun the whole thing, now we have a single topic there which is much more
pure, except lose weight.
Somehow that's tied to love. And you get the idea. So you can use this in a
way that's interactive to shape the topics that you want.
>>: When you do a merge operation do you say the topics merge or two sets of
words?
> Jerry Zhu:
Two sets of words.
>>: Is that different from isolate?
> Jerry Zhu: So you are asking for two sets of words which were originally -I mean, it's not quite different. I will show you how it's encoded
differently.
It certainly has the same flavor, but it's encoded slightly differently. Okay.
And you can do this by changing the model, the standard LDA model slightly. So
recall this is where we produce, we sample the document dyes, the one million
sided dye, instead of using a Dirichlet dyes factory, I'll use something called
a Dirichlet Forest distribution, which will give you something that can the no
be achieved by Dirichlet.
So what is this good at? It's particularly good at encoding pair-wise
relations in the form of, for example, mass link. This is a term borrowed from
clustering but what it means here you want two word types, U and V, to be in
the same topic.
But what does in the same topic mean? Because a topic is a multinomial
distribution over the whole vocabulary. Any two words are going to be in there
with nonzero probability, right? So by must in the same topic we mean the
following: We don't like the case where one word has high probability, the
other has low probability in a particular topic.
They could either have both high probability or both low probability. That's
it. So that's what we want to enforce in terms of must link. And similarly
you can enforce cannot link, you do not want these two words to be presented in
the same topic but again they will. So what we're really saying is you want
one to be large, the other to be small.
Or both small, I think.
But you cannot not have them both hide.
>>: Maybe I'm jumping ahead here but it seems like these kinds of things are
difficult to put at the feature level but easier to put at the example level,
right? There are often cases of words that you forget I imagine in any domain
where the features sort of making these statements on features is very hard and
preclude what you want to do with data analysis.
> Jerry Zhu: I agree with you to some extent especially when you exhausted the
first few easy features, things get hotter and our model here does not address
that. So here everything is strictly on the features.
All right. Now with these two you can create the higher level operations. For
example, split. You can say that I have a group of words I can split them.
The way I do that is by creating must links among -- within groups and cannot
links across them.
Merge, very simple just put must links among the words. And isolate, you put
cannot links from the words you want to isolate to all high frequent words in
the current topics.
>>: So first you then need to encode for each, basically for every pairwise
set?
> Jerry Zhu: Okay. All right. So here's why you cannot do this with
Dirichlet. Let's imagine you have a vocabulary school college lottery, just
the three vocabulary, three word vocabulary for simplicity. What would a must
link school college look like? That means you want to generate a multinomial
such that school and college either both have .5.5 probability or .2.2, or .1.1
or 00 like they're somehow tied in this fashion.
If you look at the simplex, this is what the density looks like. You want
this. Okay? This, however, cannot be achieved by Dirichlet. If you try to
encode it with Dirichlet, you will get things somewhat like this. And the
reason why you cannot do that is because in the Dirichlet distribution, the
variance of each dimension is tied to its mean.
So you do not have the extra degree of freedom to encode its variance. And the
way to get a run at it is to go to what's called a Dirichlet tree distribution,
and Tom [inaudible] did something on this, too. In the Dirichlet tree
distribution you have your vocabulary here but you put a tree on top of it.
It's a tree where you have positive at weights, and the way it works is the
following: So let's see, the weights can be arbitrary. But in this example I
made them special. So you have eta, whenever you see eta that's a big number.
So 50. So we have 50/50 here and this one is two, and that one is one. So
let's imagine we have a tree like that.
The way you generate a multinomial distribution from a Dirichlet tree like this
is the following: You start from the root and you don't look at the sub trees.
You just look at its children's weight and treat that as a Dirichlet
distribution.
So you're going to draw a multinomial of two elements from this Dirichlet.
let's say we do that and we happen to get this. That's multinomial
distribution we get.
So
Then you go down to this internal node, and at this place you're going to use
these as the Dirichlet parameter and you draw another multinomial. So let's
say that's that. And the intuition is -- this is the probability mess that you
get here and you're going to split it according to that. So you basically
multiple the numbers along each path. That gives you the final probability.
And because of this construct, what we will see is the following.
Here you
pretty much get a fairly uniform distribution in terms of the weighted
distribution. But once you go down here, because these are large, and the
property of Dirichlet is you will evenly split them, right?
So you get the desired behavior. This is just to show you things are a little
bit messy, but it's actually quite nice. You still get what you want in close
form. And, in particular, this important quantity delta for each node S, which
is the in degree of a node in a tree versus the out degree minus the outdegree
of that node, if that difference is all zero for every node in the tree, you
actually recover a standard Dirichlet distribution.
When it's not zero, you get more interesting behavior. Furthermore, this
Dirichlet distribution is nice. It's conjugate again to multinomial. So it's
convenient to work with as long as you're careful about bookkeeping.
So this is how we're going to encode a must link. Basically when you have a
must link, you're going to say these two words are must linked. You're going
to create a sub tree with very strong but equal weights there. And then on top
of that you're going to create a weak edge, but with the number of like number
of children, number of leafs here and the number of leafs here. So that evenly
splits that probability mess.
This will give you samples like that. So that's must link. Now, recall cannot
link where you do not want words to have both have high probability. That one
is, again, impossible to encode with Dirichlet.
>>: One question, so the eta and the beta in the tree you had it for, do you
determine those empirically and then use in fixed?
> Jerry Zhu: Yes, that's what we did. You can think of the mess, the knobs
that the person can tune. We didn't learn them. We set them.
>>: Must link distribution.
> Jerry Zhu:
Yes.
>>: [inaudible].
> Jerry Zhu:
Yeah.
>>: [inaudible].
> Jerry Zhu:
Yeah.
>>: Makes sense.
> Jerry Zhu: So cannot link, for example, cannot school and cancer. What you
really want is this kind of behavior for three vocabulary word. You want
either the probability multinomial probability to concentrate on school, or it
can somehow arbitrarily split the must between these two words but not on
school. So any thought there is, okay.
As I said, this cannot be easily represented by a Dirichlet. And it cannot
even be represented by a Dirichlet tree, a single tree is not sufficient.
What you will need is a mixture of trees. Therefore, the name Dirichlet
Forest. Okay. I'm going to show you with one example how things might look
like. It's a little bit involved. But here we go.
>>: Dirichlet Forest [inaudible].
> Jerry Zhu:
It's a mixture of Dirichlet tree, that's it.
>>: Then you can get any events you like, not showing up.
> Jerry Zhu:
Let's see --
>>: [inaudible].
> Jerry Zhu: I think it should be, yeah. I think it should be. Yeah. All
right. So we have vocabulary of A to G. And here are the links that we want
to satisfy, a must link between A and B. And a bunch of cannot links.
Now, the thing is -- well, okay. So here's how we do it. Write down all the
nodes and A/B those are must linked I'm going to treat it as a single node
there. It's glued together by must link. I'm going to have all the red edges
with a cross representing cannot links. So that comes from my knowledge.
This is the graph we start with. So we have multiple steps. Step one, let's
identify connected components. The reason we want to do that is because each
connected component like this box can be modeled independently of other
connected components.
So we will then focus on one of these guys. Then inside each box we're going
to flip the edges. So instead of cannot link, you turn on the opposite, the
complement edges. Those are the can links. That means these words can appear
with high probability together.
You identify the maximum clicks of those within each component.
a big click and D is another click there.
So A, B, C is
The interpretation again is each maximum click like A, B, C, those represent
the maximum collection of words that can co-occur with high probability.
You cannot add any other things from the same connected component otherwise
you're going to violate one of the cannot links.
So that's that. Now, with that, we are going to create or sample a Dirichlet
tree. And the way we sample this tree is by constructing its sub trees in the
following manner. We're going to intuitively say that for this box, I'm going
to either select this click or that click. Allowing the selective one to share
high probability and denying probability match to anything else.
So the way I'm going to do that is to choose between, for this particular one,
I have two maximum clicks. I'm going to have two candidate sub trees. And
there the trees, if you don't see a number between the edge, imagine it's edge
weight one and eta is something big.
The reason we have this design is because recall when we sample that Dirichlet
tree, when we are here, if you see eta, which is big, and 1 there, you're going
to very likely apply all the probability math down to the eta branch. And with
very few probability mess, very little mess to the other guy, D.
So this, if you choose the left sub tree, you're saying, essentially, that you
want all probability math to go to A, B, C, but not to D and the second one,
vice versa. But you have to make the choice at this moment.
So you sample the tree like that. Let's say we flip a coin which chooses the
first tree. Once we are there, we need to further distribute the probability
among A, B, and C, and here we're just going to give it one of these guys so
uniform, not uniform, but a Dirichlet with all 1s so it can generate a uniform
distribution over the simplex space.
And finally -- oh, yeah, so at this moment we're safe in that we will satisfy
all cannot links within the first box. Because it didn't give D any
probability. And finally we're going to, because there's a must link between A
and B we're going to again do this trick and let them evenly split the
probability mass that rich, that A/B node. So this is how you do that branch.
You can do the same thing for the second one. Here are the choices over either
given all probability mass to E or to F. And so on. So let's say we say
select the second one, the fact is you'll give probability mass to F and not E
for this tree.
So you do this, and you will have sampled a Dirichlet tree. And then you use
this Dirichlet tree to generate a multinomial that will satisfy your must link
and cannot link.
Many of you will have
theory things can get
exponential number of
there's a gap between
noticed that the procedure is combinatorial. So in
really bad and you get tons of candidate trees,
them; but in practice we see very small numbers, so
theory and practice.
>>: What kind of data?
> Jerry Zhu: Smallish data, and like the data I showed before, the wish and
the split and the merge and so on. So all those operations.
>>: But are you individually then going through the component until [inaudible]
is close enough or.
> Jerry Zhu: No, no. So you do this. You do a Gibbs sampling but now you
sample over the Z assignment, topic assignment, as well as the tree index. So
you're sort of massaging the trees for each topic. You're selecting which sub
tree is along the way and you can adjust that. So that becomes part of the
variables you are doing in Gibbs sampling.
>>: I see, so the actual forest that you have is changing?
> Jerry Zhu:
Yes, that's exactly.
And that's part of the sampling.
And just
to show you that this is how we sampled the forest, again, a lot of bookkeeping
but not terrible.
Okay. All right. So in the last ten minutes, let me quickly go through a
third model which I promise you is not as hairy as this one.
Okay. All right. The motivation is, well, these are okay. The methods I've
talked about. They can encode certain knowledge but they're not general
enough. I want to have a really general way to specify domain knowledge. Or
not me but domain experts.
And the solution is let's go to first order logic.
specify his knowledge in first order logic form.
A domain expert can usually
So the advantage is it's easy for an expert to just write down a knowledge base
in logic and say, okay, here are the things I want to satisfy in terms of the
topic.
It's very general. In fact, it can encode several existing variants that's
been proposed in the literature. Critically, we want efficient inference.
That's the difficult part. Whenever you involve logic, inference could get
very hard, but we will see an efficient way to do that.
So for those of you who are familiar with Markov logic network, the relation
between our photo and that is the following. You can view note tow what's a
hybrid knowledge network which is a Markov knowledge network which is a Markov
random field constructed from logic, plus -- not plus, but multiplied by a
likely heard term which involves continuous random variables.
And those would be our phi and theta. So it's an instance of that.
have our specialized optimization scheme. So that it's efficient.
Okay. So here's how you think of the problem.
knowledge in logic.
But we
Here's how you encode domain
So let's think in terms of logic. You have predicates which are Boolean
functions. For example, you can design Boolean functions big Z. Takes into
arguments I and T where I is word position and T is a topic index.
So we can say something like ZIT is true if word position I takes the value T
or English, the ith word comes from topic T.
So that's the key predicate. And we do not know that. We will need to infer
that. Besides that, you can define -- and this is the power of this approach,
you can define arbitrary, observed predicates. Some of the predicates are like
things like before where you have WIV. The ith word takes value V it's like
the ith word is this V word. But then you can put things like document
boundary in there. That's D. The I's word position come from document J.
Or has labeled JAL so document J has label L or even sentence level label, you
say the ith word comes from the Kth sentence and so on and so forth. So you
can put in any observed things that you have in this form. And we will use
them.
Now, the key is you're going to use this to write rules in logic called O size.
You can write big L big number of rules. Each one is associated with a
positive weight lambda. That's the standard Markov logic network setting.
For example, I have two rules here. First one is weak. Weight one. And this
rule says for all position I, if this is true, it means if the ith word is
embryo, implies it's topic is three. So basically this says whenever you see
the word embryo, put it in topic three. But it's somewhat weak rule and it can
be overridden. Think of those as not hard constraints but as preferences.
The second one, 100, so that's a very strong rule.
has a whole bunch of universal quantifications.
Cannot violate it, and it
For all position I, position J, for all topic T, if the ith word is movie and
the jth word is film, let's mother put these two words in the same topic so
they cannot be in the same topic T.
This is exactly the cannot link, but expressed in logic firm. So we put a
cannot link between the word, that words movie and film. These two words need
to have different topics. They need to go to different topics.
Yes?
>>: Is this the positive weights that you have there for to be 100, sorry, is
that given by the domain expert or that's earned?
> Jerry Zhu: Yes, right now it's given by the domain expert.
into the learning issue with that.
We didn't look
>>: What range of numbers do you allow, and do they end up being kind of on the
extremes? Do you see any -> Jerry Zhu: So this is a fairly empirical question and the answer is we tried
a whole bunch of numbers. And you clearly see once you tune up the numbers,
that rule gets enforced. But setting them appropriately is an art.
>>: So that's the thing. I mean, I worked myself on which rules have weight
systems needed with them. And I had weights one to ten. And the majority of
the rules had either nine or one. And there was some squishiness in the
middle, a 3 or a 7. But you don't get an even distribution of dollars.
> Jerry Zhu: Here we don't even estimate the weights.
distribution it's all user provided.
We don't have a
What we can, we can assign meaning to the lambdas if we normalize the terms
appropriately so they are more interpretable. You will see how they come into
play really soon.
Okay. So if you're familiar with logic, you know, this is first order logic,
you have universal quantifiers for all. And the way people, like one simple
way of doing inference here is to do something called propositionalization,
which turns first order logic into Boolean logic and that simply means for all
combinations of free variables in your rule, you just assign it a particular
number.
So you do this exponential explosion and generate what's called a grounding g,
small g, from a particular roof. This big tree is the set of groundings and
small g is particular groundings. So for example this with particular
grounding where we look at is word position 1, 2, 3, word position 4, 5, 6 for
film and the topic in question is nine and so on.
So you have to do, in this example, you have to do N squared times big T that
number of groundings, which is a huge number. But in theory you can do that.
So you turn everything into Boolean logic.
So this set big G, just keep in mind that could be a huge set.
Once you do
that, yeah, that's combinatorial. Let's define this indicator function. One,
with subscript with particular rounding. And it takes in the hidden topic
assignment that you currently have. Let's say you have a candidate assignment.
Then, you have the natural 1-0 definition. So here's how it is used. This is
standard LDA except that I draw it as an undirected graphical model, factor
graph. And these are the terms that you will recognize, that's the topic term,
document term and so on.
All we do here is to add in this term which corresponds exactly as Markov logic
network. And that corresponds to this factor node which is a mega node. It
takes in arbitrary observed variables.
And it has this form. So you are taking -- you have the exponential here.
You're summing over all the rules you have for each rule you're going to sum
over its exponential number of roundings. And then the weight is here. And
then that's the Boolean function here. And it's applied to the Z in question.
So this objective function is really nothing but a straightforward combination
of LDA and Markov logic network. How much time do I -- can I go over just a
little bit?
>>: Yes, sure.
> Jerry Zhu: So I'm going to be done very soon. So we are interested in,
again, inference. So this is the task where you're given the corpus and you
want to -- and the rules, and you want to infer Z.
Now, what you can do is you can take the logarithm of the previous probability,
that's what you get here. The black term comes from LDA. Red terms come from
logic. Again, you have the summation. This is the part that will kill us.
But that's the objective function. And then you can do, keep in mind that this
objective function is noncovex. Okay. And you can optimize it in the simple
way you do the alternating optimization, where you fix Z first and then
maximize this. If you fix this, this term disappears and you're doing standard
LDA stuff which is very simple. We immediately get this.
But the other part is you have to iterate and fix those phi and theta, and now
you optimize Z. Keep in mind Z is discrete random variable. So you get
yourself an integer program here. And it's huge. So this is the part where
people typically get stuck. Like this is where it takes a lot of time to do
inference.
Okay. But we're going to do this procedure. We're going to iterate, and I'm
going to explain how we do this part in a somewhat efficient fashion.
Okay. So here's how we optimize Z. So we're talking about this step now.
Number one, we're going to relax those indicator functions into a smooth
function. It's continuous. So we can take derivatives.
The way we do that is the following: So let's take a look at that particular
grounding G, which says ZI 1, the first word, the ith word is topic one or not
ZJ 2.
And let's say we only have three topics, big T is three. So here's how you do
the relaxation. Turn that one indicator function into a continuous function.
First, you take the complement, logic complement, so you get 0 and this Z.
That's fine. Then you remove any negations. So for this term not ZI 1.
Recall that ZI 1 means the ith word takes topic one. Not of that. Since we
only have three different topics, it must be like it takes topic two or topic
three. So that's how you get rid of the negation.
Then we're going to turn or into plus and into multiplication. But these
variables are 0-1 integer variables. So so far things are equivalent. Then
you do 1 minus this. This is the negation negated back.
Still 0-1 variable, but then you trivially allow it to be in the interval 0-1.
Now you have a continuous function.
This is still noncovex, though. And, yeah, if you're curious the one big one
function becomes something that looks like this. It involves a whole bunch of
variables that's between 0 and 1.
Okay. So remember, we're trying to optimize the big Q function. Now we're
optimizing over Z. So let's take terms that's relevant to Z. And ignore all
constant terms. You get something like this. So this comes from the logic.
This part comes from LDA, the part that involves Z.
You were doing an optimization problem. This one is a nice one, because Z now
is between 0 and 1 for each element. And all you have to do is constrain it to
be nonactive and sum to 1. So that's nice.
But we still have that exponential number of groundings here. So this term is
huge. And you don't want to take the normal like gradient step and do that.
It's just way too expensive.
So the trick here is let's do stochastic optimization and treat this objective
function as the objective we're trying to optimize, but the actual thing you do
is you randomly pick one term out of this. So call that term F. It could come
from here or it could come from here. It's going to be a term that involves
some Z. But it's a very small number of those. So you just take random, one
term out randomly. Treat that as if it's the random objective, because if you
take the expectation it gives you back that. So you're going to take one
random term out, do a small gradient step, and then take another term out and
do it. So you repeat.
So you do the stochastic optimization in this fashion, take a term out.
And you do this kind of exponentiated gradient which maintains this constraint.
So you just do that. And it turns out that it works quite well for our
problems. We were able to run fairly large logic knowledge base in a
reasonable amount of time.
I mean, by the way, you can stop this anytime you want.
time you can keep working on that.
So as long as you have
Recall that this is the particular rule we said it's a cannot link.
Penn and Lee's movie review Sentiment Analysis corpus.
This is
So it's a bunch of movie reviews submitted by people. And we put -- we run LDA
on that. But we add this one rule, which says let's put a cannot link between
the word movie and the word film and my student David did that, because he
suspected that people are using these two words differently.
And, yes, as you can see, these are the two resulting -- there are more topics,
but we're taking out the two topics where film and movie are the most frequent.
What do you notice?
>>: [inaudible].
> Jerry Zhu: Yeah, for some reason film is a more elegant word. Associate
good words, great, with film but here you have bad movie and -- yeah. So it's
kind of strange that you see this. Yes?
>>: I have a question about the semantics of this. You have a -- this example
is kind of telling, is that because what you ended up encoding here, even
though you're using the fuller logical form is just a cannot link constraint
that you had earlier. I wonder if you told the domain experts, hey, sorry, you
don't get full logic you just get [inaudible] how much does that really
restrict them in terms of the -> Jerry Zhu: That's a great question. I think equivalent way to understand
this is to say what is the power of just [inaudible] link. Right? You can
create some high level like split, merge and so on. Maybe it's rich enough for
normal usage.
>>: It's pretty rich.
> Jerry Zhu: It's pretty rich. But on the other hand the logic has the nice
thing you can claim that you can pretty much encode anything as long as you
could write it in the logic form. So you have that sort of peace of mind
there, like if you want you can always write complicated [inaudible].
>>: That's true.
> Jerry Zhu:
Wonder if real domain experts [inaudible].
Sounds like a great project.
>>: The one example that immediately comes to mind is contextual constraints,
right; in this context that seems like the thing, what we would want to apply,
disambiguates. Gave a nice example earlier with one of his project examples,
that seems like maybe want first order logic but maybe [inaudible].
>>: Yeah.
> Jerry Zhu: All right. So to evaluate -- we want to evaluate two things.
One is whether the logic that we specified as extra knowledge, can it be
generalized to answering test data, like the topic you get can they generalize.
Second, we want to see how good the inference is, like how scaleable it is.
here's what we did. We have a bunch of dataset. I don't have time to go
through the details, but they're in the paper.
So
We're going to do cross-validation, but whenever we do training we'll do
fold-all, plus all the logic rules that domain expert gives us. From that we
will be able to get the topic dyes, the topics, the phis.
Then on the test fold what I'm going to do is I'm going to use those topics to
directly estimate the word topic index. So without, importantly without using
any logic.
I'm just going to use the phis there to get Zs. But then once we get the test
set Zs, I can go back and essentially see how much of it fits the logic that
the expert gave us in the beginning.
But we did something slightly different. We looked at the test objective queue
which balances both the desire to satisfy logic as well as the desire to
satisfy LDA model.
So in general, in this table, that's the test set queue.
to be big, the bigger the better.
You want this number
And so you can see this is pretty good on test set. If you simply run LDA
without logic, you get pretty poor results except for this one.
Alchemy is where you only enforce logic but without the LDA part, of course,
and so it's not as good. But, again, another thing to notice is these dashes
are where things do not complete within the reasonable amount of time. So this
shows that our method with stochastic gradient is fast enough to work on
reasonably large, especially this one, logic knowledge base.
Yes?
>>: So in the logic part, you actually have two steps. You have basically
doing the approximations -- I mean the first one is relaxation. The second one
is actually [inaudible].
> Jerry Zhu:
Yes.
>>: So just wondering where they're exactly -- the exact inference where it
gives you [inaudible] I mean, probably not on the logic -- but one of the
[inaudible] put in the next slide which probably -- yeah, so that -> Jerry Zhu: So let's say we have a column here which says let's do exact
inference, if we can do that, that's a smaller constraint base, how well would
we do. Indeed, for the ones that we can complete, the number has to be
slightly bigger than this one, but not too much bigger. Slightly bigger. It's
in the paper presented here.
But the problem with that is you will have a lot of dashes here because of
this. So here's the summary. I think it's very interesting to inject domain
knowledge into an unsupervisable learning method like LDA. Because I think the
max really wants that. We presented three models they have increasingly more
general power. And the key here is -- two keys: One is you really want this
to be easy to use for domain experts to create their knowledge or to inject
their knowledge into the model. Second, you really want to make sure it runs
fast so it's scaleable. That's it. Thank you.
[applause]
Download