>> Pedro Domingos: We'll start with the motivation. ... machine learning beyond doing basic classification and things like that...

advertisement
>> Pedro Domingos: We'll start with the motivation. My experience working in
machine learning beyond doing basic classification and things like that over
the last ten years or so is that at the end of the day the hardest part of
learning is doing inference.
Inference shows up as a subroutine, almost all the time, when you're trying to
do partial kinds of learning. It appears when you want to do learn on directed
graphical models, K Markov networks; when you want to learn discriminative
graphical models, whether they're directed or undirected.
It shows up, for example, in EM, when you're trying to learn within complete
data or latent variabilities. Shows up in bayesian learning, because in
bayesian learning, learning is inference.
It shows up very much in deep learning, and it shows up most prominently for me
in statistical relational learning. When you're trying to model data that is
not IID, you actually have the complexity of probabilistic inference combined
with the complexity of logical inference.
So this is a big problem, particularly when what we want to do is beyond the
kind of very simple models that people used to build in the past, to building
large joint models for the means like natural language, vision, where you want
to model the whole pipeline, not just one stage into join inference between all
of them. Social networks, active recognition, computational biology is another
example.
We want to model whole metabolic pathways, want to model whole systems, whole
cells, not just individual genes or proteins. So when inference is a problem
when you're trying to do this on large models basically at some point it just
becomes impossible. Because inference, in, for example, graphical models, is a
sharply complete problem.
It's as hard as counting the number of solutions of a set formula. So it's
actually harder than NP complete. And not only that, but when we do learning
with these things, typically we don't just have to solve the inference problem
once.
We have to solve it successfully, for example, at each step of gradient
descent. So it becomes very, very tough. Now, of course, when inference is
intractable what you do is approximate inference. And there's many good
approximate methods like MCM, belief, propagational, variational propagations
and whatnot. However -- and those things have a lot of problems. You need to
really know how to work with them.
But one that's particularly relevant to us is the approximate inference tends
to interact badly with parameter optimization. Parameter optimization often
uses things like line searches, these second-order methods like conjugate,
gradient and quasi-mutant and they go haywire, or they can go haywire and often
do go haywire when the function that is being optimized is not known exactly,
which is exactly what happens when you're doing learning.
The biggest problem, however, is the following: Suppose you had actually
learned a very, very accurate model using whatever, using unlimited
computational resources.
At performance time, even if the model is accurate, because it's intractable,
it effectively becomes inaccurate. Because now what you do is you do
approximate inference, and now you're limited by the errors of the inference,
which are often by far the dominant thing.
You have bazillions of modes in your space, and you're only going to sample one
or two or your mean field is going to converge to one of them, for example.
So this is not a very happy state of affairs. We would like -- we need to
overcome this problem if we really want to do the kind of powerful machine
learning that the world needs.
So what can we do? Well, one very attractive solution, at least [inaudible] is
just to learn tractable models. If we only win tractable models, then the
inference problem is easy. The problem, however, is that the kinds of
tractable models that we've had in the past were insufficiently expressive for
most of the applications that we want to do.
There's some useful restricted classes, but for the most part this is not
really a solution. The take-home message I would like you to remember from
this talk, however, is that actually the class of tractable models is actually
much more expressive than you might think. There's a whole series of tractable
models that people have discovered in the last several years, and progress
continues. And I would say that these models at this point we can say fairly
confidently they're expressive enough for a lot of real challenging, real world
applications.
So what I would like to do here as time allows is go through some of these.
Starting from the simplest to the most advanced. Thin junction trees I will
mention to start with, and then large mixture models, arithmetic circuits,
feature trees, some [inaudible] networks and tractable Markov logic, which I
may not have time to get to, but that's okay.
So the idea in conjunction trees is the following: So, first of all, as a
reminder, a junction tree is the basic structure you used to do inference in a
graphical model. If you have a Markov network, you obtain it by triangulating
the network which is making sure that every cycle of length greater than three
has a cord.
And then the tree width of the network is the size of the largest click. And
generally people think of the inference as being exponential in the tree width.
And this is the problem. If your model has a low tree width, and remember this
is not -- and you have to think even if the model doesn't look like it's going
to have high tree width, once you do the triangulation, it often will. If your
model has no tree width, you're in good shape. One thing you can imagine doing
is just learn low tree width models and then you're good.
The problem with learning with tree width models typically you can only learn
very low tree width models because both the learning and the inference are
exponential in the tree width. So people like Carlos, for example, have worked
on this. But you wind up only learning models say of length tree width maybe
two or three, which is really not enough for like, for example, if you're doing
image processing, you have a tree width of, if it's an image of a thousand by a
thousand the tree width is already a thousand.
So this is very far cry from what we need. So however there are many
interesting ideas in this literature that we can pick up and use more powerful
things which we will do.
Here's something that you can do that's very simple and you should remember
this before we go on to more sophisticated things, which is just to learn a
very large mixture model.
The beauty of a mixture model is that inference is linear in the size of the
model. It's very fast. Just have to go through the mixtures, through the
mixture components.
And if we just try to do accurate probability estimation then we don't have to
restrict the number of components. We only make sure that we don't overfit.
So we tried doing this. And we've actually found through our pleasant surprise
when you compare this with bayesian network structured learning using the state
of the art, the Wind Mind system that was developed here by people like Max
Chickering and David Heckerman [phonetic] and some people in this room, we
found that the, first of all, the likelihood that you obtain is comparable,
which is surprising, because it's a more illustrative class. But we get better
query. Then when you use Wind Mind to get a sample, which what a lot of people
typically do, give sampling next to Wind Mind. But, most importantly, the
inference is way faster. Takes a fraction of a second. It's reliable, don't
have to whittle parameters and try different things. There's only one way to
do it.
The problem with this, of course, is the curse of dimensionality. As with all
clustering, similarity-based methods, if you're in low dimension, say 10, this
is probably fine. But if you have hundreds or thousands, tens of thousands, or
millions of variables, this doesn't work. What's missing is factorization.
This is what graphical models do for you, they factorize into smaller pieces.
And those pieces are learnable.
What can we do? This is where the next key idea comes in. And this is
something that people like Rina Dector [phonetic] and Nina Darwiche and others
have worked on over the years. It's this very important notion that in fact
inference is not in general exponentially in tree width. The tree width is
just an upper bound. It's a worst case.
You could have a model with a very large tree width that's still tractable
because it might have things like content-specific independence where two
variables are independent given one value of a variable but not given other
values.
Determinism, determinism means there's parts of the space that have zero
probability which is very bad things like MCMC and BP and variational but
actually is really good. It means there's less of this space that you need to
sum over so we should be able to exploit that.
So if the cost of inference is not the tree width or exponential in tree width,
what is it? Well, if you think about it, inference is just doing sums and
products. That's all that's going on. It's a bunch of sums and products.
So the cost of inference is really just the total number of sums and products
that you do. So if you imagine organizing as we'll see in illustration
shortly, your computation as a DAG where the nodes are sums and products and
the leaves are the inputs, the variables and the parameters, then the cost of
inference is just the size of that circuit.
It's the number of edges in that circuit. So let's look at this in these
terms. So, first of all, what is an arithmetic circuit? An arithmetic circuit
is a way to represent the joint distribution of a set of variables as follows:
So here's a table X1, X2, the probability of each of its stakes. What I'll do
is I'll introduce indicators for the variables.
So X1 is true when X1 is 1when X1 is true, not X1 when it's false and so forth.
And I have products. This is a DAG. And then I have sums. And at the top I
have a sum and what I have here is the -- and the sum nodes have children,
right? And so, for example, here .4 is the width of this child which is a
product of X1 and X2. What happens if you're in this state and you set the
indicators in the appropriate way, meaning, for example, X1 to true and X1 to 1
and to 2 and the other to 0, this picks out only this product and the result is
.4. Okay? And the same for the other states.
So basically this -- all the information in here is also represented in here.
But more importantly, if you now want to marginalize, let's say I want to
compute the marginal of probability that X1 is equal to 1, then what I want to
do is I want to sum out X2, meaning I want -- what I need to do is I need to
sum this and this.
I can do that by setting the indicators accordingly, both indicators for X2
will get set to 1, because I want to include both types of terms. So think of
these indicators not as whether the variable is true or false, but whether you
want to include the corresponding terms in your computation.
So if I'm summing out a variable I want to include all of its terms so I set
both of these guys to one. Since I want to set the probability to one and this
one is 0, with those values now what I have is these two products being
different from 0 I get this sum, .4 plus .2 which is the answer.
So far so good. But in general the size of the circuit gets to be exponential
in the number of variables so we haven't gained anything. Where we gain
something is that if we have determinism, some of this won't be there.
If we have content-specific independence the parts repeated because this is a
graph need appear only once. So in general we might have a situation where the
tree width is large but the size of the circuit that computes all the marginals
is still polynomial the number of variables. And what people like N and
Darwiche and others did, they compiled graphical models into arithmetic
circuits, the same you can compile them in junction tree. You think of
arithmetic circuit as a sparse junction tree and do inference on those.
But what we would like to do is actually to learn from scratch arithmetic
circuits that we know are of bounded size. And one way to do that is we can
just use a standard, say, graphical model learning like Wind Mind, but instead
of the score being log likelihood minus the number of parameters, we let the
score be the log likelihood minus the circuit size.
So instead of regularizing using the model size, we regularize using the cost
of inference. And as far as overfitting goes this does the same job, does
equally well but has a very important consequence which in this case I could
have a model where inference is completely hellish. In this case I'm
guaranteed that the inference is always manageable.
So we try this. It works quite well. Again, comparable to likelihood to
bayesian networks, and this was on challenging problems like e-commerce
predicting action in e-commerce click logs and visits to websites and whatnot.
So high dimensional problems.
And the inference again is compared to doing something like MCMC is amazing.
It takes a fraction of a second there's nothing to do. When you look at the
circuits that we've learned, they do have large tree width.
But now -- so this is just a starting point. Let me actually skip over this
part and go straight to the next one. Now we can ask ourselves the question:
Well, why are we just using a standard bayesian learning network algorithm with
this except for a different objective function. What we'd like to do is learn
the most general class of tractable models that we can, right? So what is that
general class?
And this is the question that we addressed in this paper that we published in
UAI last summer. It actually won the best paper award. And we developed this
representation that is a generalization of arithmetic circuits called some
product networks. And the answer to this question of what are the most general
conditions under which a DAG sums and products correctly represents all the
marginals of a distribution is two words, completeness and consistency.
So let me tell you what those are. And then how we learn the sum product
networks. Sum product network is just a DAG of sums and products for the
variables and parameters of the leaps. So as long as these parameters are not
negative you can represent any sum product network where it represents a
normalized distribution we just say the probability of a state is proportional
to the value of the sum product network for that state. And then you normalize
by dividing by the partition function which is the sum of this overall states.
But that's where you have a problem because the partition function takes
exponential time to compute. In general, right, to compute the probability of
some evidence E, I have to sum, I have to compute the value of the network for
every state that's compatible with the evidence and add those all up.
And this in general will take exponential time and that's the whole problem.
What we would like to be able to do is instead of this exponential sum, just do
one linear time evaluation of the sum product network.
And then the exponential computation has been replaced by a linear one. The
million-dollar question is when will these two be the same? Where can I do
this one passive evaluation of the network and get the same result that I would
get from the exponential sum?
So we defined this as validity. We say SPN is valid if the linear time
computation is equal to explicitly computing the probability for every possible
evidence that I might have. So SPN being valid basically means I can compute
all my margins in linear time including the partition function. The partition
function is when you want to include every term. So you set all the indicators
to one. So now the question is when is an SPN valid?
SPN is valid if it's complete and consistent.
And the answer is the
So what I complete consistency actually two very simple conditions that can be
very easily checked and more importantly they can be very easily imposed. I
can say I'm going to have a network architecture that is complete and
consistent and then as soon as I have that I know that my inference is always
going to be tractable and correct and I can just learn whatever parameters I
want.
So what are they? Completeness is a condition on the sum notes. It says that
under a sum note, the children or the child sub trees must be over the same
variables. So when -- a sum note is really a little mixture model over a
subspace. What this condition says is that I have to have the sum space on
every side of the sum in every component. So, for example, here's an
incomplete SPN. It's incomplete because it has X1 on this side and X2 on that
side and the two are different. It's easy to say that if a network is
incomplete, then what happens is that my evaluation underestimates the real
partition function or the real marginal. Consistency is a condition on the
product notes. And it says that I can't have a child, a variable in the sub
tree on one side and then negation of that variable in the sub tree on the
other side.
The two sides have to be consistent. If I have X1 on one side I can't have X1
on the other. If I do then what happens is the linear evaluation is going to
overestimate the marginals.
If I'm both complete and consistent, then it's going to compute, then these two
things will be equal and I get the right answer.
So how do we learn SPNs? It's a lot like back problem. In fact, SPNs are
[inaudible] for back problem because it's a DAG, it's a DAG of sums and
products, so propagating derivatives from the likelihood from the output back
to the input is very straightforward.
However, there's a problem, which is the problem that people in deep learning
run into all the time which is that if you try to do this by gradient descent,
there's this gradient effusion problem where the signal disappears as you go
down the layers. It becomes sparser and sparser until we don't know how to
learn.
However, we found a very nice way to get around that problem which is to do the
following: We do online learning. So we present our examples to the network
one at a time. And update the parameters. And then we do hard EM. So we do
online EM, but it's hard EM. Meaning that what I send down each branch of -so at each sum node, you can think of it as a mixture model. Instead of
fractionally assigning the example to the different components of the mixture
model I pick the most likely child and I send the whole example or the whole
chunk of example down that child.
As a result of which I don't have a gradient effusion problem anymore. There's
always a unit size increment going down the network no matter how deep it is.
As a result of which we can learn networks of this type with as far as we know
arbitrary numbers of layers.
Even in deep learning people have sort of like these very ad hoc ways to try to
learn them that don't go beyond the few layers. We usually learn networks with
tens of layers. How does this work? Each sum node maintains a count for each
child for each example we see we find the most probable assignment of that
example to some children using the current weights.
We increment the count for each chosen child like you've won this example and
then we renormalize and those are the new weights. And then we just repeat
this until convergence. So it's actually very straightforward learning
algorithm. But it's very powerful.
In particular, you can think of this as a new kind of deep architecture. One
that is actually very well founded and doesn't suffer from the big problem in
deep learning, which is intractable inference and sort of like the difficulties
in learning that causes in turn. And we've applied SPNs to a number of things,
in particular to some very challenging problems in vision where we did a lot
better than anything else that went before.
In particular, we did this for image completion, where we're able to complete
images that no one had really been able to complete before. Also we have a
paper coming up in NIPS where did he do discriminative training using similar
principles and beat the state of the art on a number of image classification
benchmarks.
So before I finish let me just mention one more thing, which is what do these
sums and products mean? And can we extend these things which so far have been
on the propositional level to the relational level where we have non-IE data.
This is what we've done in tractable Markov logic. The basic idea of tractable
Markov logic is the sums really represent a splitting of a class into sub
classes.
And the products really represent features. So there's a meaning. So a
product decomposes an object into soft parts, and the sum decomposes parts into
subclasses. And again I can't go into details of this here, but what this
means is that if you take a language like tractable Markov logic, as long as -this is a powerful language, you can talk about objects, classes, relations,
hierarchies and so on, as long as you use this language we can guarantee that
the inference will always be tractable. And the details, we have a paper that
just came out in EEEI. So and we have some software coming out using this. If
this is not enough, what we suggest is doing variational inference, using these
model classes as the approximate model class. Much more powerful than what
people typically use so you get correspondingly better approximations and I
will stop there. Thank you.
[applause]
>>: Any questions? In the back.
>>: Yeah, the way you describe the hard EM, the back propagation, sounded
deterministic. Did you try like a stochastic routing rule where you would go
on each sub tree as far as the relative probability?
>> Pedro Domingos: This is an excellent idea, right? Instead of taking the
MAP choice just sample in proportion to the probability. We've actually tried
that recently. In fact, Rob over there just did this a couple of weeks ago.
In the domains that we've tried this, it doesn't help that much. It doesn't
hurt either. But we actually suspect that there will be many problems for
which this is actually the best solution.
When we just picked the most likely solution, there's this tendency for a
winner-take-all process to happen. So, yeah, that's actually an excellent
idea.
>>: Did you try to value the SPNs, the value of which you can compute all
known nodes, and maybe the problem may be not all marginals, have to have some,
direct path for a couple easy and the rest are inverse problems that harbor,
okay, we can live with that.
>> Pedro Domingos: Absolutely. So what I talked about here was the generative
training. But more coming out in NIPS we have the discriminative training
where we get exactly this pay off. We don't care about everything being
tractable, we only care about these variables being tractable. What this means
is that we have a broader class of models. So you're right, yeah.
>>: Haven't read this one.
>> Pedro Domingos:
It hasn't appeared yet.
>>: He's the chair.
>> Pedro Domingos:
Well --
>>: There were like 1500 ->> Pedro Domingos:
[laughter].
Maybe you weren't the chair handling this paper.
>>: You want to point out your grad student doing the work?
>> Pedro Domingos: Rob, where is he. Also the stuff on tractable Markov logic
was Austin, who is also here. And the software system that I mentioned was
Chloe, who is there. So we're here in force.
>>:
Okay.
Great.
Any other questions?
Pedro?
>>: The second you were talking about, the Markov [inaudible] have negation?
>> Pedro Domingos:
One more time?
>>: Your circuit, Markov circuit or you can have negation in the middle
somewhere?
>> Pedro Domingos: Negation -- but negation is a logical operation. So
actually let me make a higher level remark here. Everything that I have
about sums and products applies equally well to any semi ring. It could
some product. It could be max sum it could be and/or joint project. So
can learn tractable logical models in the same way. Not that we've done
here
be
you
that
here, but the same ideas apply. And then you could have negations at any
level, but there's actually no gain in exclusivity from having that. There's
this form called negation normal form where you push all the negations to the
bottom and it's only expressive.
So you could, but there's no necessary advantage to doing that.
>>: Okay.
[applause]
Well, let's thank the speaker.
Okay?
Download