>> Li Deng: Okay. So it's a great... give two talks today. I'm not going to go...

advertisement
>> Li Deng: Okay. So it's a great pleasure to have Professor Geoff Hinton to
give two talks today. I'm not going to go through all of his achievement honors
except to say that everybody acknowledge him as one of the major pioneers in
neural network and machine learning. And we are very fortunate to have him
stay with us until next Friday.
He's sitting on the third floor in the speech group there, so when we talk there,
we can knock his door and then get a conversation with him. We get all the
inspirations from all the things they talk about. He has been here for a few days
and have learned a great deal from him already.
So without further ado, I will give all the floor to Geoff and then he give all the
time to talk about two very important topics today. Okay.
>> Geoff Hinton: Thank you. The work I'm going to talk about today in the first
half I'm going to talk about stuff we've done over the last few years. And so
some of you may have heard this stuff already.
In the second half, I'm getting to talk about applying it to larger problems, in
particular to three dimensional object recognition.
So the motivation for this work is that a cubic sent meter of cortex has about a
trillion adjustable parameters in its synapses. My believe is year never going to
be able to compete with a human visual system until we can learn the systems
that have about a trillion parameters. Just so we have a idea of the scale of the
thing we have to do, and currently we can't do that.
The first half is going to be how to learn multi-layer generative models that have
a large number of parameters and can be trained without labeled data. And also
once you've learned those, how you can use them to do better image
classification.
The second half is going to be applying these ideas to recognizing three
dimensional objects in two different datasets. And in the second half we're going
to be learning about a hundred million parameters. So we're still a factor of 10 to
the 4 away from that. But you know, we're a university research lab so you can
get a hundred just by going a big organization.
So the starting point is backpropagation from the mid '80s where you give an
input back to which might be an image and you want to know what the object is
in the image and you go through a feed-forward neural network with adjustable
weights on the connections. You look to see if you got the right answer. If you
didn't, you take the discrepancy and you backpropagate through the net and that
tells you how to adjust these weights.
And that was initially very exciting. But it didn't work out too well. It worked out
okay, but it didn't give as good behavior as we hoped. And here's some of the
reasons why. It was very hard to get labeled datasets. You'd like a billion
labeled images. That's kind of hard to get. Particularly if you were on fine
labeling where the edges are and things. The learning time doesn't scale well in
deep networks. And although it gets reasonable local optima in small networks,
in big networks we can now show we get stuck in not very good local optima, and
you can get much better local optima. There's no hope of getting the global
optimum, but at least you can take good local ones.
So the way we're going to try and overcome the limitations about propagation,
particularly its limitation of requiring labeled data is we're going to try and keep
the efficiently of using a gradient update method. But as you compute in
gradient, you change the parameters slightly, you compute the gradients again,
you keep going like that. It's a very flexible way to do learning.
But we're not going to try and learn the probability of the label given the image,
we're going to try to learn the probability of an image. That is, we're going to try
and learn a generative model that spits out images. It's going to be a stochastic
generative model, and if you run it, hopefully what you'll see is lets of images that
look like the training data. Once you've learned this, you want to be able to show
it an image and say how might you have produced that and look at the underlying
variables that might have produced this image and use those for classification.
The question is what kind of generative model should we learn.
And there's an obvious candidate, which is something like a belief net. These
were introduced in the '80s by Pearl and Heckerman. Heckerman did the first
really impressive demonstration of them doing medical diagnosis.
I'm going to use a particular predistributing active form of a belief net, where you
get to observe the variables here of belief nodes and all the invariables. To
begin with, we're going to be just binary variables.
The inference problem is if I show you data and you know all the parameter -- all
the weights on the connections, can you infer the states of these binary variables
that cause that data? Or can you at least give me a sample that the quite a
plausible way that might have caused that data?
The learning problem is if I just show you lots of data here, can you not only infer
this but actually learn all these weights on the connections? And the kind of units
I'm going to use are stochastic binary units which takes some input which is that
bias plus input coming from other units. And then they give an output that's a
one or a zero, and the probability of the output is determined by the logistic
sigmoid like this, which is the equation. Are standard kind of binary unit.
Now, if you want to learn a belief net with multiple hidden nets like this, there's
one thing that's easy to do, which is after you've learned you can generate from
your model, that's nice and easy to do, and so you can see what the model
believes, you just choose these from their initial biases, you then given these you
choose these from their biases plus what the top down information is saying, and
you choose them all stochastically, and then you get to see an example. You do
it again, you see another example.
What's difficult in these nets is inferring the states of these latent variables when
you see data, even if you know all the parameters. And you have to solve this
inference problem, at least approximately, in order to do learning. So that's the
difficult. But inferring what is going on here given the data even if you know the
parameters.
So if we go to a deep net, because we're going to want to learn -- if you're
inspired by the human brain you believe you want to learn at least sort of five
layers of variables here, if you're doing object recognition. Let's consider the
problem of learning the parameters that connect the first layer of hidden variables
to the data. So we want to learn these weights here.
Well, the problem is if I give you a data vector and you already have an estimate
for these weights, can you infer what might have been going on in this layer?
Now, if you ask what's a plausible patent of the binary variables in this layer to
produce that data, will you have to sort of satisfy two things? It has to be a
patent but it's quite likely to have reduced the data using these weights. It also
has to be a patent that's quite likely to have been produced by network above.
So I'll call that the prior.
This stuff upstairs needs to be quite light to produce the patent and the patent
has to be quite light to produce the data. In fact it's the product of the probability
the patent coming from the prior and the probability of the data coming from the
patent that determines your posterior distribution here.
And so just to learn these weights, to improve our initial estimate, we need to get
a sample from the posterior here. And that involves all of these weights up here,
and it involves integrating all these hidden variables to figure out what the
probability of patents under the prior is. And it just looks like a hopeless problem.
All these weights interact. And it's going to be very hard.
So one method is to say well let's use a really bad approximation here. That's
called variational. But after trying that for a about it my conclusion was that
[inaudible] just give up. So let's give up on learning deep belief nets. It's just too
difficult if you want to learn them with billions of parameters. And let's try a very
different kind of learning. And then something magical will happen.
We're going to learn in a different kind of network. It's not a directive network like
a belief net, it's an undirected network like a Markov random field, where there's
symmetric interactions. So there's a generalization of [inaudible] network where
instead of just things you observe that have direct connections we have latent
variables, but all the interactions are symmetric here. So in graphical models
terms this is an undirected graphical model. And its connectivity graph is
bipartite. That is these guys don't connect to each other and these guys don't
connect to each other. At least to begin with.
Now, this model has one very tractive feature compared with a belief net, which
is if I tell you the states of the objected variables here, if I give you an image,
these are all conditionally independent. So to get an unbiased sample from the
posterior here is trivial. You just look at the input this guy's getting from the
pixels and put that through the sigmoid. Sample from the probability distribution
gives you and do that independently for all these guys and you've solved the
problem of inputs.
So those are going to be great for doing perception if we can learn them,
because inference is going to be trivial. Now, we'd like to learn with more hidden
layers but let's not worry about that for now, let's just try to do it with one.
So the inference is easy, but it looks like the learning is going to be difficult.
Now, before I go into the details of the learning, I want to tell you what the big
surprise for me was. If you can solve the learning problem and we can do
approximate learning quite efficiently, then what you can do is this. You can take
your data, you can learn weights. These will be symmetric weights to begin with
that connected to your first layer of hidden units. Call those feature directors.
And so you learn what feature directors to use for each data vector and you learn
these weights. Then you take the patents of activity you get in your feature
detectors when you're driving it with data, and you treat those as data, and you
do it again.
The way I discovered this was I was using MATLAB and I thought, why don't I
just take my hidden probabilities and call them data and try it again? And so I
just said, data equals hid probs and off I went. And it sort of worked. It did
something sensible. And it took quite a long time to understand what kind of
model I was learning. Because you'd expect that if you learned a model with
symmetric interactions here and then you learned another model with symmetric
interactions here, the overall model you'd get would be a great big model with
symmetric interactions.
It turns out for reasons I'm not going to try and explain in detail, but you can read
the papers, the overall model you get is not that. When you learn a little
Restricted Boltzmann Machine here and another one here and another one here,
the overall model is it's a Restricted Boltzmann Machine at the top with
symmetric interactions, so that's an undirected graphical model. And then
everything below that is directed graphical model. So this is very weird.
What we've managed to do is we've learned this directed graphical model
[inaudible] the fact that it's got this top level associative memory here of
undirected, and we learned it from the bottom up one layer at a time. And it's
very surprising you can do that. But we accidentally solved the problem of how
you learn deep directed model efficiently. So the only sort of fly in the ointment is
so how do you learn one of these little guys? Because the learning problem
looks to be quite tricky. And we have an efficient approximation that works pretty
well. And I'll go into the details of that approximation after showing you a demo
of what this can do.
So for the demo, I'm going to take handwritten digital signatures from the
[inaudible] database. I'm going to learn 500 feature detectors. I'm then going to
learn 500 feature detectors that represent the correlations among these features.
I'm then going to concatenate these 500 feature detectors, their states, with the
right answer. I'm going to have 10 labeled neurons to say what did you class it
as? And I'm going to learn a joint density model here.
So it knows the answer for learning the last layer. But when it did all this
learning, it didn't know what the labels were. Just for the top level here it knows
the labels. And it learns the joint density model is not trying to learn to be good
at predicting this from that, it's trying to be learned to be good at having this top
level Boltzmann Machine generate the correct kind of pairs here.
And one it's done that learning, it turns out that if you use it for trying to do
recognition, it's slightly better than the simple vector machine, which is quite
surprising because this vector machine is optimized for discrimination and is a
pretty good method. And this is partly because handwritten digit recognition is a
small problem. I mean, I think a lot of API people aren't realistic.
You want a problem where you could plausibly solve the problem with only a few
million parameters if you can only learn a few million parameters. And at this
stage, that's all we could do. And so handwritten digit recognition is good for that
because it's clearly an interesting problem, but a few million parameters might be
enough. Okay. So I'm going to show you a demo of that thing after it's learned.
So I actually programmed all of this except for the interface. What we'll do first is
we'll show it recognizing something. So we first learned these 500 feature
vectors and they're binary stochastic feature detectors. Then we learned these
500 with this is data. Then we learned these 2,000 with this and this as data.
And because it's stochastic, even though the image is the same, each time we go
up we'll get a different patent of those active but notice it's always very confident
that it's a four, right.
If I run it faster, you'll see that a bunch of these higher level guys are stable. So
some of them aren't changing at all, and that's why it can be confident it's four.
And there's a bunch that aren't quite certain. If I give it another digit, hopefully
we'll recognize that it's a five.
And even though all these are stochastic, note that it's almost certain it's five.
>>: How many cycles do you have to go through to [inaudible].
>> Geoff Hinton: Here you just have to go up once before it gives you an
opinion. There's another bit of the demo way you have to run for longer. So if I
show you something that's a bit ambiguous, it's the stochastic system. So it will
sort of oscillate between saying four and eight. And occasionally they'll say
things like six or something. And if you count up how often it did that, you could
see what probability associates with the different classes. Or you can do a
computation which computes a quantity called the free energy which will allow
you to compute how often it will say these two things in one computation without
sampling. And that's the sort of better way. Yeah?
>>: [inaudible] data always binary at each level --
>> Geoff Hinton: Okay. There's a fudge goes on. The fudge is you can see this
isn't really binary. So we're using probabilities here. And you could train it. And
it works perfectly fine if you sample ones and zeros from those probabilities and
then train it like that. Or you could train it back just using the probabilities. So
this is a little mean field approximation. And that's a bit quicker because there's
less noise around.
It's very important when you train it to make the hidden units when I'm learning
this module be binary, but I can afford to make these be probabilities. And then
when I learn this module, I can use the probabilities of turning these guys on that
I extracted from the data as the data, but these better be binary. And so on.
>>: [inaudible] your algorithm that actually refer to the binary values in training
was everything continuous in the actual math?
>> Geoff Hinton: The math is all binary values. You do the math with binary
values.
>>: [inaudible].
>> Geoff Hinton: No. The code has binary values in it. The code calls the
random number generator.
>>: Okay.
>> Geoff Hinton: But it calls it for setting the states of these guys. And that's so
you're sort of honest and you're not reconstructing the data from real numbers
here. You're reconstructing from bits.
Okay. That's the system after its learned doing recognition. And it's very good at
that. It's not as good as a convolutional net which is told about the relations
between pixels. But if you compare it with machine learning algorithms that
aren't told anything about space, then it's one of the best.
What's more interesting in this system is if it run it backwards.
>>: Could you decode convolutional [inaudible] by [inaudible].
>> Geoff Hinton: Easily, yeah. You can do convolutional versions of this and
[inaudible] it's time for other people who have done convolutional versions.
What we're going to do now is run to the generative model. So we're going to fix
the state of the label. Let's fix it to a two. And now what we're going to do is just
go up and down in this top level Boltzmann Machine here. So that's all that's
really happening. You don't look down here. We go on down here with this
fixed, and it's sort of sampling from his model here. And if you run backwards
and forwards for a long time, it will start showing you samples of what it believes
here. Here. It will show you the samples here.
But those don't mean anything to you. But below this top level underactive model
we have this directed belief network. And so although this doesn't mean
anything, I can convert that into an image so you can see what it's thinking. So
this is what's going on in his brain, and this is what's going on in his mind. And it
takes a while for this network to settle down. And after about 500 Gibb samples
it's settled down. It's in the two ravine now, and that's what it's thinking. That's
what it's imagining. That's his mind.
And what's nice here is you are seeing an energy ravine in a 510 dimensional
space which is normally quite hard to see. But you're seeing it wandering around
a ravine here. So after it's learned there's an energy function for this top level
thing when all the weights to zero that energy function is a sort of flat surface in a
500 dimensional space. After it's learned that energy function has 10
longstanding ravines in, and they have names. And if I turn that two unit on, the
2,000 weights coming out of here to the top-level units would lower the energy of
the two ravine and raise the energy of the other ravines. If you stumble around
here for long enough, you'll fall into the two ravine. And then you'll stay in the
two ravine and wander around and you'll see what it thinks twos look like,
including very bad ones but that's good because it can recognize that.
Let's just change that label unit. We're going to change it to an eight. So now
these 2,000 connections will be lowering the energy to the eight ravine, and it will
stumble into that eventually. Probably just luck. It's not really there yet, as you
can see. Now it's in the eight ravine. And it will generate all sorts of
differentiates that it believes in, including ones with open tops. Thank you. This
isn't really a demo, it's a canned demo.
>>: [inaudible].
>> Geoff Hinton: Sorry?
>>: Can you switch on two labels?
>> Geoff Hinton: You conditional switch on two labels because these have the
constraint that any one of them could be on. If you put two of them on a half that
you could do, you could do .5 of this and .5 of this. Then it would try and do
blends here I assume. I never actually tried that. I should do that. People have
suggested it before, and it would be a nice thing to do.
>>: So this is the net without doing fine tuning that propagation?
>> Geoff Hinton: Right. So far -- now, there was a little bit of generative fine
tuning I'm not going to talk about, because it just confuses things.
>>: [inaudible] the generation may not be as good as you show here.
>> Geoff Hinton: Without the generative fine tuning, the generation wouldn't be
as good. But it would still be pretty good. But this is all trained as a generative
model here and a joint density model here. No discriminative tuning yet.
Now what I want to show you is what the learning algorithm is. So a long time
ago, Terry Sejnowski and I came up with something called a Boltzmann Machine,
which was a sort of general undirected model with binary units with arbitrary
connectivity. And we got a nice simple learning algorithm for that that was
hopelessly inefficient. And so 17 years later I figured I have to make it run a
million times faster. And that's because computers were 10,000 times faster, I
made it go a hundred times faster. The trick was to wait 17 years.
So what we're going to do is restrict the architecture so we don't allow any
connections here or any connections here. That makes imprints very easy in this
kind of Restricted Boltzmann Machine. And as I said before, we can easily get a
sample here. But the question is how do we do the learning? And to understand
how the learning works, you need to understand a bit about how this model -- this
kind of undirected model models data. So underlying it there's an energy
function. I'm going to leave out all the biases so the math stays simple. This is a
vector of binary activities for the visible units. This is a vector of binary activities
for the hidden units. And if you tell me the states of the invisible and hidden
units, which are ones or zeros, I can tell you the energy of that configuration.
And it has a low energy if it has visible hidden units that has big positive weights
between them. That's a happy state for the network.
And the trick is what we'd like is when you put in visible -- a visible patent that
corresponds to data, it says yes, I'm happy with that, I have a low energy state
that goes with that, the sum configuration of the hidden units it makes me very
happy. When you put in rubbish, it says I can't find any configuration that makes
me happy with that, that's an improbable thing.
This energy function has a nice simple derivative with respect to the weights,
which is just this [inaudible] statistic, are these two guys are together, is this a
one and is this a one? So it's very easy to change the weights to modify the
energy. But unfortunately that's not sufficient to do learning. That's only half the
story. Because the way this model defines the probability of a joint vector of
visible and hidden units is if you tell me the full state of the system, tell me the
states of the visibles, tell me the states of the hiddens, then I can figure out the
energy that that full configuration has and then the probability is proportional to
each of them minus the energy. Low energy is highly probable. But of course it
has to be normalized by all the other alternative states of the network.
And this is called the partition function. And this makes learning opaque. If you
want to know the probability of a visible vector, you just suddenly stop over all
possible hidden configurations like this. So that's the probability assigned to a
visible vector.
So now if I want to do learning, and I show you some data, suppose V with some
data, I say make V more probable. What are you going to do? Well, what you
want to do is you want to lower the energy of all the hidden configurations that
contain all the four configurations with V in, and you want to raise the energy of
all other configurations. And then that will make this more probable. It turns out
there's a very simple way to do that. It's just computation expensive. You start
with data. So this is a data vector. Using the data, you activate your hidden
units. So each hidden unit gets some input from the data and stochastically
decides were to be a one or a zero. And you can do all these in parallel,
because they're all conditionally independent, given the data. Because of the
connectivity of the net.
Then given this vector here, you reactivate the visible units. And that's exactly
the same computation the other way around, using the same weights but the
other way around. And then you do it again. This is called alternating Gibb
sampling or block Gibb sampling. All these guys can be updated in parallel, then
all these guys in parallel, then all these guys in parallel.
If you go on long enough you'll have forgotten where you started. That's called
equilibrium. And you'll be able to see fantasies from the model. These are
exactly what I was showing you in the demo except to show you the fantasies
from the top level RBM, I had to go through a few more layers to turn it into an
image. But I'm showing you these fantasies.
And the learning algorithm is now very simple if you're willing to run this chain
long enough to get fantasies. You simply measure how often a pixel and a
feature detector are together with data and how often a pixel and the feature
detector are on together in the fantasies that the model produces. And it's the
difference of those statistics that is the derivative of the low probability of the data
with respect to a weight on a connection. And that's a bit surprising because this
derivative with respect to this weight depends on all the other weights in the
network. And they didn't seem to show up here. But all the other weights in the
network determine this quantity here. And so they show up in this quantity here.
But it's not like backpropagation where you explicitly have to sort of go through
those weights. They just show up in these statistics.
So Terry Sejnowski and I got very excited about this rule because it's a local rule.
It's local to a synopsis, just the two rules it connects. But it will do sort of sensible
learning for one of these models. But it takes a long time even in this you have
to settle down to the group room distribution, so I got bored. And what happens
if you just do this? You don't go all the way to equilibrium. Well, it turns out the
learning works just fine. Not quite as well as if you go to equilibrium. It's not
maximum likelihood learning anymore. But it works pretty well. And it's certainly
good enough for many applications. So now you've got an efficient learning
algorithm.
You take your data, you activate your feature detectors, you reconstruct the data
from the feature detectors, you activate the feature detectors again, you take the
difference of these statistics with the data and these statistics with the
reconstruction and that's your learning signal. Multiple that by some small
learning rate and way you go.
And so we take many batches of data, measure these statistics over a small
batch and then update the weights and then do it again and again and again.
And that's how that model was learned. And now as I already mentioned, if you
can learn one layer of features like this, you can then treat those features as data
and learn another layer of features. All of this without knowing any labels yet.
And so you can learn lots of layers.
We can prove that if the layers are the right size and they're initialized correctly,
then every time you add another layer -- what we'd like to prove is every time you
add another layer you get a better model of the data. And that's true when you
add the second hidden layer, you do get a better model of the data. But as you
add later layers, all we can prove is that there's a band that the improving each
time. So it's conceivable that model of the data will get slightly worse but there's
a variational band which each time improves. So there's something that's
improving. It's always encouraging when you're doing learning to know that
something sensible that's improving.
And in fact, the -- the probability of the data in all the cases we see it actually
gets better. But all we can prove is the variational part.
>>: So you said that the first layer is guaranteed to [inaudible] or is it just.
>> Geoff Hinton: There's a following guarantee that when you add the second
hidden layer and then start changing its weights, the low probability of the data
will improve.
>>: That's guaranteed?
>> Geoff Hinton: That's guaranteed. When you start changing its weights. Of
course you could change its weights and the low probability could improve and
then it could actually go down again. But it will always stay above what the low
probability was when you first added that second hidden layer. But we can't
prove that for later layers. That's it for the math. I'm not going to go through the
math.
The math says you're doing something reasonably sensible. So then you throw
away all the math, you violate all the conditions of the math and you get on with it
and you see what you can do. Now, once you've trained this model, these
multiple layers, you can fine tune it. And for fine tuning it, backpropagation is
very good. So the easiest way to do that, it's not what I showed you previously,
the easiest way is you just train lots of layers, then you add 10 labels at the top
with initially sort of random connections to the last layer, and then you just use
backpropagation on that net. And that works much better than using
backpropagation starting from random weights.
>>: I thought that you do backpropagations using the same configuration early
on. I think you ->> Geoff Hinton: There's two ways to do it. There's two ways to do it, and I don't
want to sort of confuse people any more than I have already. Yeah?
>>: Are you locking the weights on everything except for the last layer?
>> Geoff Hinton: Okay. The answer is no, we're training all the weights now with
backpropagation, but the lower weights don't change much. So if this is a weight
vector in the lower layer, when you do the fine tuning, it will sort of go like that.
Now, so if you look at the feature detector, it doesn't really change at all. But if
you go like that with a lot of feature detectors, you can move a decision boundary
quite a lot. So what's happening is the unsupervised learning discovered all the
feature detectors and the labels don't have to be used to create feature
detectors, they just use to very slightly change them to get the decision
boundaries in the right place. And so you don't need many labels.
So the optimization view of this is the greedy learning designs all the feature
vectors, tells us what parts of the weight space we should be in. When we first
turn on the backpropagation, we'll have a big gradient there, and we'll go a small
distance with this big gradient and then we'll trickle off, and we won't leave that
region of the space.
So where we are in this whole space is determined by the unsupervised learning
and the back probably is just fine tuning. And we get much better optima like
that. [Inaudible] and his students have shown that if you compare starting from
small weights and just using backprop with this, this gets you to a part of the
space that you never get to if you start with small weights and use backprop.
You just get very different solutions and these solutions are much better.
The other thing is the learning is generalizes better. So we design all the feature
detectors so as to model what's going on in the image and also as to get the right
labels, and so we don't actually need many labels. The labels are just slightly
tweaking things. So you won't ever fit in these [inaudible] and in particular you
can use this kind of learning when you have a huge dataset, most of which is
unlabeled. Your huge dataset designs all your feature detectors and then your
few labeled examples can be used to fine tune things a little bit. And that's sort
of the future I think.
So I just want to justify why this whole approach makes sense before I get on to
the 3D examples. You can tell I don't have a very good justification because I
have to appeal to concepts that come from Rumsfeld. So the stuff. Stuff is what
happens, right? And if you believe image and labels are created like this, that,
you know, stuff in the world creates an image and then the image creates the
label, then machine learning in the standard old-fashioned way of trying to
associate labels with images is the right thing to do.
>>: You mean discriminative learning?
>> Geoff Hinton: Discriminative learning, yes. So that would be the case, for
example, if the label was the parity. If this was a binary [inaudible] the label is it
even or a parity. Given the image, the label doesn't depend on this stuff. All the
-- everything you need to know about the stuff is in the image.
But that's not what you really believe, at least not for most data. What you really
believe is that the stuff out there in the real world, that gave rise to an image, and
the stuff out there in real world also cause someone to give a name to the image.
But they didn't get this label from the pixels, they got it from the staff. You know,
it was a cat out there. I mean, because it was a cat out there, the name's cat.
It's not because pixel 55 is orange or anything.
Now, in particular, there's a very high bandwidth path there and a very low
bandwidth path there. And in that situation it makes a lot of sense to use
unsupervised learning to get from here to there. And we know that it's possible,
because we do it, right? Little kids do it. They don't learn object categories by
their mothers telling them the name of every object. Their mother points at the
window and said cow and in the distance there's a field with clouds and a river
and a small brown dog and they know what she's talking about. But the label
information is terrible.
It's only because they already have the concept of cow, they say oh, that thing
must be called a cow.
>>: What's your [inaudible] for technical definition of bandwidth.
>> Geoff Hinton: Okay. I'm not going to give you a more technical definition but
I'll explain it a little bit.
If I show you a picture of the cow in the field, you can answer questions like, you
know, is the cow standing up or lying down, is it a big cow or a small cow, which
way is it facing, is it moving, is it brown or is it black and white? All these
questions. So you have lots of information about the stuff from a picture. From
the label cow, if I just say cow and then say what color is it, well, you're sort of
out of luck. There's not much information here. In fact, the most information that
could be here is minus the log of the probability of the word cow. Which is like 13
bits or something.
>>: So there's -- I was at a workshop recently on recognition and one of the
newer trends in the last two or three years is some people are starting to look at
learning attributes of images as opposed to learning ->> Geoff Hinton: Absolutely. And they got more information there. Yes. I
agree. I completely agree to that. But let's come back to that in the question
time.
But anyway, this is the justification what we're doing. That there's enough
information in an image to figure out what's going on in the world. And once
you've figured out what's going on in the world, then you're in much better shape
for assigning labels to things. And it's silly to try and learn all of this stuff by
backpropagating from here. You should learn all of this stuff by trying to
understand what's going on in the image and then maybe just slightly fine tune it
to get the right decision boundaries.
>>: So then the difference between the [inaudible].
>> Geoff Hinton: The idea is features are meant to be a model of stuff in the
end. We have this modeling material. Does that make you make a car out of
clay, right? It's not like you're doing identification you really believe the hood of
the car is actually made of clay, you just have clay which is modeling stuff, and
you can make anything out of it. We have these features and you can make sort
of anything out of them.
>>: That should be independent ->> Geoff Hinton: That should be the stuff, yes. Okay. So the one thing I did
recently, which is quite encouraging which as I said what if our labels aren't very
good? So let's corrupt the labels. So I went through the training set and with
probability .5 I made each label wrong. If you said it was a two, I made it one of
the other nine labels with probability .5, and I just did that once, so it was really
corrupted. Certainly you corrupted differently each time so you can average it
away. You just do it once.
And now it turns out that if you have this kind of architecture you have this
pathway from the data that's saying what these hidden labels should be. You
have -- you can infer from the noisy labels what they should be. I start off with
the confusion matrix that says it's roughly the identity but with a bit of off-diagonal
noise. And after a while, what will happen is if you showed a very nice two and
you said it's a four, it will say rubbish, it's a two. I've got overwhelming evidence
from here that it's a two. I just don't believe this. And then it will adapt its
confusion matrix to say sometimes when it's a two the guy says four. And it turns
out it can learn the right confusion matrix here and it can get very good
performance. So with 50 percent of the labels wrong, it can get down to two
percent error on both the training and test data. There's not much difference
between training and test.
In other words, it's corrected in all the wrong labels. It just knows you're talking
nonsense. It's like a very good student who you tell them the truth and they
believe you. You tell them something false, and they don't believe you. And
that's how they can get to be smarter than their advisors. Yeah?
>>: [inaudible] slightly different about the problem which is that -- which is fine,
and I think it's good in the case that you're looking at, which is that in the
unsupervised data there are instinctive categories ->> Geoff Hinton: Absolutely. They're really there, yes.
>>: Because I was thinking like on an alternative case where you might have
documents, for instance, right, where many different labelings might be correct.
In that case, I think that you wouldn't really help you very much because the
[inaudible] there wouldn't be sensible ->> Geoff Hinton: This works -- well, when there really are natural categories and
the labels are just giving it sort of noisy information, right, what to call these
categories. Yes?
>>: [inaudible].
>> Geoff Hinton: Right. If you make the labels 80 percent wrong, it still gets only
five percent error. So it really can do with very -- then if you ask, well, how much
is one of these cases worth, roughly speaking you compute the mutual
information between the label and the true class, and that tells you how much
your case is worth. So here is .07 bits and here is .33 bits. So these are worth
sort of 50 times less than those, but they're still worth quite a lot.
>>: [inaudible] 50 percent able but [inaudible] the other 50 percent correctly
[inaudible].
>> Geoff Hinton: Then you'd be much better. Because then you've got much
more mutual information. But it's -- if I tell you the label's write in this case and
it's wrong in that case, I'm telling you a whole lot. Just throw away the cases
where it's wrong.
But if I don't tell you which of the right cases, there's much less mutual
information. Notice 20 percent of these are right. There's only a 50th as much
mutual information as these perfect labels. And it seems to me the mutual
information that it's showing me is how much you can get out of a training case
which is nice.
>>: [inaudible] matrix there is really just matrix that [inaudible].
>> Geoff Hinton: Yes. It's a 10 by 10 matrix, and you learn how often when it's
really a two the guy says it's a four.
>>: [inaudible] the label [inaudible].
>> Geoff Hinton: Obviously it could do a random permutation. And so that I can
interpret it, I initialized it with a [inaudible] matrix so you would have the obvious
correspondence. That's not necessary.
>>: What results did you find when the labels were completely random.
>> Geoff Hinton: I didn't do it when the labels were completely random. But
what should happen is it says the noisy labels aren't telling me anything.
Anything could happen. But it still uses those hidden labels for natural
categories. Now, in the end this data is not clear that four and nine are naturally
different categories. They're very, very similar fours and nines in many cases.
For categories like one and zero it will assign labels to one -- it will use one of the
labels for one, another label for zero for the very obvious categories.
Okay. Now I'm going to talk about new stuff. So [inaudible] who made the
[inaudible] database also made a database for doing 3D object recognition that's
careful controlled. There's five classes. For each class there's five training
examples and then these training examples of photograph with lots of different
lighting conditions and few points. Many, many viewpoints and many lighting
conditions. But you can't just remember the training data because the test items
are different animals. So in one of the test animals is a stegosaurus. So you
have to generalize from knowing that these are animals to knowing that a
stegosaurus is an animal that's not a plane. Okay.
One of the class a bit problematical. It's humans. They were purchased as toys
in the US and every single one of them is holding a weapon. So the concept of
human and holding weapon are the same.
>>: [inaudible].
>> Geoff Hinton: What? Sorry?
>>: [inaudible]. [laughter].
>> Geoff Hinton: Okay. So the first problem we have with this is that it's high
dimensional. It's two 96 by 96 images, which machine learning is quite a lot of
dimensions. We -- what we did is we take the -- all the pixels around the edge
and we write much bigger so we got less off them. If a sense we're giving you
some knowledge that stuff around the edges is so important. So we're giving you
a bit of knowledge there. That way we get it down to 8,000 dimensional data.
We make that data zero, meaning unit-variance. And then we have to face the
problem that in images of digits, ink is really binary stuff. And so binary variables
you can get away with. In really images you can't get away with that.
You can't really represent a real pixel by the probability of a binary variable. Real
pixels have the property that given my neighbors I've got a very sharp
distribution. I'm almost certainly the average of my neighbors. You can't do that
with a binary unit. So how are we going to model these real value things?
So what we do is we adapt the Restricted Boltzmann Machine to use a different
kind of visible unit. It's got a different energy function which is this. But I'm going
to sort of tell you what that energy function is.
We say for each visible unit it's going to have independent Gaussian noise
model. So if you take the negative log probability of a Gaussian you get a
parabola. So in terms of energy, which is negative low probability plus a
constant, then the unit would like to be sitting around its mean, when is its biased
name, and it costs to go away from that.
Then if you look at the top down input that it gets from the hidden units when
you're going to do reconstruction or when you are running the model by itself to
produce fantasies, the top-down input from the hidden units looks like this, and
we can factor that into the state of the visible unit. The energy contribution is like
this. The set of visible unit times the summary rule hidden units of the state of
the hidden unit times the weight on the connection. So if we differentiate with
respect to the state of the visible unit, we'll get this thing, which gives us this
slope here.
So the top-down input has the effect of saying it's better off to be here than there,
and you win linearly as you go in this direction.
So now if you take a parabolic containment and a linear thing from the top down
input, then minimum will be where there's gradients are equal and opposite,
about here. So the hidden units have the effect of moving this parabola over. So
that's the model. And using visible units like that, it's a bit of a bad model of
images because we're assuming the pixels are uncorrelated given the hidden
units. But we'll come to that later.
What Vinod did was trained up. This is the stereo path, this sort of fuzzy edges.
He trained up a Restricted Boltzmann Machine with 4,000 bigger than had units,
9,000 visible Gaussian units, and he just trained this up and then we never
changed those weights. He didn't try to change those at all. He just trained them
up. So that's pure unsupervised.
And then the architecture that worked best here was to say stead of trying to give
from here to labels, what we'll do is we'll learn five different models, five different
density models of these 4,000 dimensional vectors, but we're going to learn them
in a special way. Each of these models is going to be trained on data from his
own class. So you have to know the labels through this top level. And so only
this is pure unsupervised. It's going to be trained as a generative model to
produce data that looks like it's own class. But in addition, we're going to do
discriminative training of all five classes. So there's a quantity called the free
energy, which is a measure how much a model likes the data. And what you do
is you do the discriminative training try and make the correct model have lower
free energy than the other models at the same time as you're trying to make
whatever model is appropriate for this data be a better model of that data. And it
turns out in the end you use five times the discriminative gradient plus the
generative gradient.
The generative gradient is much bigger because there's 4,000 things to be
explained here, and it's a one of five choice that the discriminative thing has to
complain. And so you use a validation set to figure out that five times the
discriminative gradient plus the generative gradient is a good thing to use. And
you train it up.
This model is now each of these has 4,000 times 4,000, 16 million weights. This
has 9,000 -- 36 million weights. Overall we got 116 million weights. And it's
trained on only 24,000 labeled cases. That's the training set. So you'd have
thought it would overfit like crazy, but it doesn't.
Now, if you ask how many pixels there are here, there's 200 million pixels. So at
least you haven't got more parameters than pixels, which is good news. And
when you're explaining images, each image gives you much more constraint than
the label. And so you need far fewer training cases. Even if they're unlabeled.
And so the fact is we could train 100 million parameters on only 24,000 images
without seriously overfitting.
>>: [inaudible] did you use this Gaussian unit or did you [inaudible].
>> Geoff Hinton: The Gaussian unit.
>>: Gaussian unit. So this one could be better.
>> Geoff Hinton: This would be better, yes.
>>: [inaudible].
>> Geoff Hinton: Yeah. I think.
>>: [inaudible].
>> Geoff Hinton: This I should say something about this. The amount of
computation you need to do here to do the discriminative learning is linear in the
number of classes. So if you had 183 classes, it would be quite a bit of work.
For five classes it's easy. For natural object recognition where there's maybe 50
guys in the classes, it's -- you don't want to do that.
But here's the results. If you take support vector machines, so this is the node
without a clustered background. With a clustered background it's not very
[inaudible] can't do it. They get 11.6 percent. So take that as the sort of machine
learning standard. If you take convolutional neural networks which are told about
the structure of pixels, they get six percent. They're is sort of record holders so
far.
If you take this, our method, which isn't told about the structure of the pixels, it
gets almost as good as the convolution neural networks. If you give it extra
unlabeled training data by translating the images but not telling it the labels, this
is just show what extra unlabeled data will do for you. It does considerably better
than convolutional neural nets. On a convolutional neural net done the standard
way wouldn't be able to use this extra unlabeled data. So this is just an
indication that unlabeled data is going to improve this quite a bit.
>>: [inaudible] labeled data and [inaudible].
>> Geoff Hinton: Indeed. We happen to know the labels for this because we
adjusted it. And that will help both methods, right. A convolutional method that I
-- yes.
>>: [inaudible].
>> Geoff Hinton: Yes.
>>: [inaudible].
>> Geoff Hinton: It deals with that.
>>: So somehow that kind of [inaudible] has to be done [inaudible].
>> Geoff Hinton: So the hidden features are coping with that.
>>: Okay. So that [inaudible] lower or is it done by the ->> Geoff Hinton: We don't really know. In this network, there's just this hidden
layer and the more specific things. So this hidden layer has to have binary units
that are coping with that lighting variation or viewpoint variation.
>>: Is there any way to examine whether [inaudible].
>> Geoff Hinton: You can look at the resected fields of these guys. I'm not going
to show you those now. I want to get on to much more recent stuff. Well, that
was recent stuff too. But the stuff I'm excited about at present, because we did it
last week is making the Restricted Boltzmann Machine model a lot more
powerful. So we got this idea that you can learn this Boltzmann Machine, you
can stack them up. But we know that in vision you won't need to do multiplies.
One signature of multiplies is that if you have heavy tail distributions, then you
can get heavy tail distributions by multiplying things together. Take Gaussian
things and multiple them together, you get heavy tails.
And we know visions are full of these heavy tails. But anyway. We want things
with multipliers because we know we're going to need them. And one thing for
which you need a multiplier is this. Up until now when you run the generative
model to generate data, you have active features in one layer and they're giving
biases to the features below. So they're sort of -- but given the features in one
node, the features in the upper level were conditionally independent.
Wouldn't it be nice if a feature in one layer could specify an interaction between
features in the other layer? It could specify a covariance. So this feature can
say these two should be highly correlated. For example, suppose I had a feature
that was a vertical edge. If I ask you what is a vertical edge, but you start off by
saying well it means it's light here and dark there, well, maybe dark here and light
there, I think that was the same actually, or maybe it's a stereo edge or maybe
it's motion edge or maybe it's disparity edge, texture edge, those all seem to be
completely different definitions of an edge. What do they all have in common?
They all have the following in common.
If you believe there's a vertical occluding edge here, you shouldn't interpolate this
way. However you're going to do the interpolation, what a vertical edge means is
don't do it this way, do it this way, but don't do it that way. A vertical edge is an
instruction to turn off some direction of interpolation. So you can think of that as
your default is that an [inaudible] image, things are very smooth and local
interpolation will work very well. You have very tight covariances.
If I give you this pixel, it's almost exactly as the average of his neighbors. But
occasionally you want to turn that off. That's what a vertical edge is. But you
don't want to turn it off everywhere, you want to turn it off in this direction. Okay.
So how are we going to do that? What we're going to do is we're going to say
let's just take two pixels. Here's two pixels. We're going to have a linear filter
that looks at this pixel. It's going to be learn. A linear filter that looks at this pixel.
We're going to square the output of that linear filter. And then we're going to use
that squared output in an energy function by putting a weight on it.
So suppose this filter learns this unit vector and this filter learns this unit vector.
If I put a big weight on the squared output of this filter, sorry, it's big weight on
this filter, this needs to be big. But I correspond to saying it costs a lot to go in
this direction. So if I start off as zero, which is cheap and then say it's costing to
go in this direction, that corresponds to a Gaussian which is sort of sharply
curved that way.
If I put a small weight on the output of this filter, putting this one small, that's a
Gaussian that's generally curved this way. So each of these linear filters you can
think of as causing a parabolic trough in energy. When I add up all these
parabolic troughs I'll get an elliptical ball and so I can synthesize the precision
matrix of a Gaussian, the corresponding energy function, by weighted outputs, by
squaring the outputs of linear filters and putting weights on them.
But now I can do something much nicer. I can say I'm going to have these linear
filters, but instead of just always adding them up, I'm going to put an extra hidden
variable in here that says whether I want to use this or not.
And that would allow me to modulate the covariance matrix or rather the inverse
of a covariance matrix is a precision matrix. So I can have this precision matrix.
I'm going to build it up out of all these components that are parabolic troughs
which I'm going to decide which ones I use. In speech recognition now I believe
they use 500,000 full diagonal covariance matrices. I'd like to replace that by one
covariance matrix but you synthesize it on the fly, it's full covariance, but you
build it out of these little one dimensional bases so it's appropriate for the current
data.
So here's how we're going to do that.
>>: On the other hand I mean there are so much data available we have
[inaudible] so it's just [inaudible].
>> Geoff Hinton: Well, then you could certainly afford to train this. So I'm going
to learn these parameters, to learn the direction of this linear filter. I'm then going
to put -- because this is bad, this grid I put, I insist this weight be negative. It's
very important this is where it be negative. I have a big positive bias for this
hidden unit. So this hidden unit is going to spend most of his life being on and
contributing some strong term to the precision matrix.
Once you start violating that by going off in the direction it doesn't like, it's going
to say whoa, you lose. You were originally winning by plus B, that's this
[inaudible] here in negative energy, and you're going to lose parabolically. That's
the Gaussian. As this [inaudible] will be firmly on, so you get this bit of a curve.
But once you start losing a lot, you say oh, maybe this constraint doesn't apply
anymore, maybe we don't want smoothness here. And that's this turning off of
the hidden unit here.
Once this guy's turned off, all of this is just irrelevant. It doesn't apply. So that's
got a very heavy tailed flavor. If you want a T distribution, you can build it up by
taking a product of several of these. You can approximate it very well. So the
idea is we need more hidden units than there are dimensions, quite a few more
so that we really do get a full covariance matrix. It's not unconstrained in some
directions. And then a few of them are going to be switched off to represent
violated constraints. Yeah?
>>: So since you're not a -- you're not using a convolution, you're not encoding
like spatial like information, so literally it's the correlation matrix that arbitrary
pixels ->> Geoff Hinton: That's what it will learn, yes.
>>: So technically what you're learning is not necessarily like edges in the sense
->> Geoff Hinton: Well, it will turn out it will learn edges. Just because that's
where the structure is.
>>: But wouldn't it have other like random stuff just like if two ->> Geoff Hinton: Absolutely. But if you average every bunch of images to
distant pixels, they just won't be correlated. There's a very strong -- if you look at
correlations over images, they fall off rapidly. So it will learn what the going on in
the data, which is these local things.
Also, we're actually going to learn on patches that are smaller than the whole
image here. We are going to go a bit convolutional here.
>>: [inaudible].
>> Geoff Hinton: Right. So that -- this is ideal for that. Because what it will say
is normally you expect a [inaudible] coefficient to be the average of its two
neighbors in time. But just occasionally that's not at all true, and it's very not true
so turn off that constraint all together. Don't say we'll pay a penalty and have
them be very different. Say it just didn't apply. Prepare fixed penalty and then
we can have a burst.
>>: This depends on the label. If we know that it's [inaudible].
>> Geoff Hinton: Right. But when you're going the other way, you want to detect
that thing, but it will help you predict the label. So you want to detect the
smoothness violation. Okay. So now the pixel intends is no longer independent
given the states of the hidden units. So we can't do a reconstruction by just
separately activating each pixel. We have to sort a system reconstruction. Life's
got a lot more difficult. The only place it's gotten more difficult is in the
construction. The hidden units are still independent given the pixels which is
nice. And we're actually going to have some hidden units to represent these
violated smoothness constraints and other hidden units to represent means.
They're all completely independent given the data. So inference is really easy.
In the end we want it for inference, and that's still simple.
But for doing the learning, we need to reconstruct. And when we reconstruct,
we've got these correlations back that we have to deal with. And the correlations
are different for every training example. So it's not like there's an covariance
matrix -- inverse covariance matrix, you inverse it you get the covariance matrix
and then you can sample from it. You'd have to do that separately for each
training sample, which would be too expensive.
So we're going to use another method called Hybrid Monte Carlo. What you do
is in Hybrid Monte Carlo, you can integrate out all those hidden units, these
switching off hidden units and compute something called the free imagery. And
then you can get the gradient into the free energy with respect to the activity of
one pixel, given the states of all the other pixels. And so now you can start your
data point. You can look at that gradient, and you can follow that gradient, but
with the appropriate level of noise. And the way you do that is you start with
some initial random momentum of the data and then you simulate a particle
travelling over that free energy surface with that initial random momentum. In
other words, the gradient of the free energy is used to accelerate the particle and
how far it moves depends on its velocity. And there's a numerical trick called
leapfrog steps which makes the approximation good to second order. And you
do all that. I'm not going to go into all that.
But essentially what this means is given the hidden units, we're going to start at
the taught and run this Hybrid Monte Carlo for 20 steps to get a reconstruction,
that's going to move it away from the data in the direction the model likes. And
then the learning's going to say don't do that, stay with the data, and don't go
after things you like more. So unlearn on wherever you got to and learn on the
data.
And so when you have one set of units for modeling these covariances, really in
those covariances and another set from modeling means, we call that the mean
and covariance RBM. And it's a generalization of this. You can throw away the
Gaussian containment function because that's going to be done by the
covariance bit. And so the mean units look just like the units we're using before,
but without that Gaussian containment. And the covariance units are giving you
the precision matrix. And so the covariance units are assuming everything is
zero mean, but they're modeling the covariance. The mean units are saying I
have no idea what the covariance is but I'm modeling the mean. And so you can
think of the mean units as putting a slope like this in the energy function and then
you got this, the covariance units put this parabolic ball, and now you find the min
of this ball in this sloping thing, and that's where you want to go to. That's the
mean of the reconstruction of what the reconstruction should be. And the Hybrid
Monte Carlo get you some distance in that direction.
So this is the dataset we're going to use. It's called the CIFAR-10 dataset.
Because they paid for us to get the labels. It's based on the MIT TINY images
dataset which is 18 million 32 by 32 images. They got from the web by searching
with particular search terms. If you term for the term cat, about 10 percent of the
images you get are a nice image of a cat. And most of them don't have a cat in
at all. So they're very unreliable labels, which we're always interested in learning
unreliable labels.
But to begin with, we wanted a reliable label set. So we got people to go through
all the images that were found with the term cat, and they had to answer the
following question: Is there a single main object in the image? And if there is, is
there a reasonable chance that if you're asked to name it you would say cat?
Okay. So these -- these guys all satisfy that. I have [inaudible] the cats, yeah.
They're very low resolution images, which is not good. But again, you have to be
realistic. I think we need a trillion parameters. We can only learn about 100
million at present, so we better simplify the test somehow. And we're going to
simplify the test here by using low resolution. We'd love to do a higher
resolution, but not just yet.
But notice that -- look at bird here. You have a chicken, a close-up frontal view of
an ostrich's head, various other birds that I don't know the names of and ostrich's
head from some distance away that looks maybe a little bit like a prairie dog. I
don't know what that is. So I can identify nine of those ten birds I think.
>>: [inaudible] label?
>> Geoff Hinton: These are manually labeled, yes. Someone went through and
said all of these are reasonable examples of a bird. I might reasonably have said
bird if I was asked what that is. Deer is a particularly bad category. But this is -these are real objects. And this is the real kind of variation you'd like an object
recognition system to deal with if you had already stopped the problem of
focussing on one region where there was an object.
This I think everybody will agree is a sort of tough database to do recognition on.
>>: [inaudible].
>> Geoff Hinton: Okay. We've got one set where there's 10 classes and 5,000
trained examples of each. We've got another set where there's 100 classes and
500 examples of each. And they don't overlap. So you can use one set as
negative examples for the other set. And it's guaranteed negative examples for
the other set.
>>: They have [inaudible].
>> Geoff Hinton: We didn't do that. What we did do was the student in charge of
it all went through afterwards and just checked that all of them were okay. So
you won't find any really glaring errors in the labeling. Well, not more than a few.
So the way we applied our learning to that is for this learning these in this
covariance matrices, these adaptive once, we're trying to do it now in the whole
32 by 32 image. But to [inaudible] we did it on eight by eight patches because
we only started doing this two weeks ago.
And so we turned on these eight by eight patches. We learned 81 hidden units
for modeling the means, 144 hidden units that are the guys that turn on and off
from modeling the covariances -- I've only got five more minutes, and those 144
units that are turning on and off are actually using 900 of these scared linear
filters. So each of them connected to several linear filters, not just one like I
showed before.
And you end up with 11,000 hidden units, most of which are modeling the
covariances and some of which are modeling the means. This is what the filters
that learn the means do. They learn very blurry things. But if you -- if you pick -if you pick sort of 15 or 20 of these to be on, you can synthesize roughly the
colors of any regions you want. So think of them as like in water coloring, is it
only you had sort of covering regions? But they're not really telling you much for
where edges are. But that's fine because the other guys are going to say -- tell
you where the edges are, and then when you reconstruct, you're going to have to
color in the region without any chart discontinuities where the edges aren't
because you've got high covariance there.
So what do the other guys get? They get completely different kind of information.
They get sharp filters, and they break into two completely distinct classes.
There's guys who learn to be exactly black and white but as they're looking at the
RGB signal, [inaudible] to making more human like they see RGB, and only on
the learning they'll be colored, and they then learn to be exactly black and white.
They really, really are very, very well balanced.
And then there's other guys who learn color opponent filters, and they learn to be
exactly ignoring intensity. So it just splits into those two sets.
>>: [inaudible].
>> Geoff Hinton: These are P and G images. I was worried about that too, yes.
I'm fairly sure it's not that. Because the P and G is lustless. I mean, that P and
G on things that were JPEG but I don't think it's JPEG [inaudible].
>>: [inaudible].
>> Geoff Hinton: I can give you a reason why this might happen. I'm -- we still
have to do the test, which is if you look at edges, some internal to an object and
they seem to have an intensity contrast but no color contrast because it's the
same material. And then others are occluding, and they have both intensity and
color contrast. So a lot of the edges you see have no color contrast. And that's a
very good way to tell that's something's not an occluding edge.
Almost always if there's a color contrast, it's an occluding edge, or it's an edge
between two different materials in your T-shirt or something. Two different kinds
of stuff.
>>: So this is [inaudible] receptive fields.
>> Geoff Hinton: This is the receptive fields of those linear filters whose squared
output is going to be units deciding whether that thing applies or not. And if it
gets a big output, it says you don't apply. There's no smoothness.
>>: So the way I see all these edges there, is it -- does it show that [inaudible].
>> Geoff Hinton: There will be edge detectors.
>>: [inaudible].
>> Geoff Hinton: Yes. Now, you notice they're clustered. And that's because
we formed a topographic map by a little trick. I'll show you the 1D trick for
topographic map. You lay out all your linear filters in a row, you lay out all your
hidden units in a row, and then you have local conductivity between your linear
filters and your linear units. So those two linear filters both connect to this hidden
unit and to this hidden unit. So they have something in common. They tend to
go to the same hidden units.
If you go to the same hidden units, it pays to use similar filters. Because if one of
you goes off, you pay a penalty. If the other goes off, you don't pay any more
penalty because the hidden units's already turn off. So if you're going to pay
these [inaudible] heavy penalties, you want a lot of guys to go on together to all
go to the same hidden unit. And that cause it to form in this case it would be a
1D map. If you do this in 2D, you get a 2D topographic map.
So now we're going to ask how well does it do on that CIFAR-10 dataset
compared with just modeling the means, for example. So if you just model the
means but it uses many hidden units to do that, you get just under 60 percent
correct. If you only model the covariances, you do better. If you model the
covariances and the means you do quite a lot better. Sorry, if you model the
covariances but you lose lots of these linear filters, you do quite a lot better. One
linear filter per hidden layer unit does better than this but not as good as that. If
you now use both of these, the means will learn to be a very different thing if
you're also modeling covariances. You do even better.
And if you then take these 11,000 hidden units and you just do greedy learning,
these are all binary now, right, you take there probability values as your data, you
greedily learn a set of 8,000 units, now you touch your labels to those, in all of
these we just do a logistic regression on the final layer you'll do even better. So
that says greedy learn -- this is a good input for this greedy layer by layer
learning. And we're up above 70 percent.
>>: [inaudible].
>> Geoff Hinton: This is sort of a new thing. So really it's only good for
comparing these but it's [inaudible] convolutional neural networks for example
done by one Yan Lacombe's [phonetic] best students when I first [inaudible] lab,
don't work very well at all on this.
There's some variations we could do obviously. Don't use a [inaudible] on the
top, use this fitting 10 different models. That should do better. Learn a signal
meaning covariance RBM on the whole 32 by 32 image. That should do better.
And there's also some other variations I'm not going to go into. So I'm done.
[applause].
>>: [inaudible] talking about the pixels [inaudible] people have all these delightful
high level features they use. [inaudible]. It seems like you're spending a lot of
time reproducing raw pixels [inaudible]. You're sort of crippling yourself, right,
you're not using sort of the most advanced feature technology.
>> Geoff Hinton: There's really two motivations. One is I would like to
understand more about how the brain might be doing that low level processing,
what the objective functions for learning are in doing that low level processing.
And I think this idea of learning an adaptive precision matrices [inaudible] the
idea. Because the architecture that comes out of that is exactly the simple style
complex architecture. But you reinterpret that as that architecture is how a
generative model would get itself an adaptive precision matrix.
So I think that's already something of interest for people understanding what the
simple complex style architecture is good for.
But I sort of agree with you we this try a list higher up. The other thing is I would
like to show that, if you learn several layers like this, you get sift features. Sift
features originally kind of motivated by David Lowe thinking about how the brain
might do it. I mean that was some of his background motivation. And then some
very good engineering to get some sensible feature.
I want to [inaudible]. So most of the motivation for not starting from sift features
is because I want to understand where they come from in the brain. But if you
just want to do vision, well, you could start from the sift features but my hope
would be that this would learn better things in sift features. Siftlike things. So
these squared [inaudible] filters are orange energy, which is one of the inputs to
a sift filter, a sift feature. But instead of putting them together in the way David
Lowe thought best, I'd like to learn all that, and presumably it will do better. I'm
quite sympathetic to you. If I just want to discriminate objects I'd start by doing
certainly things like [inaudible] processing and stuff like that. Maybe a bunch of
Gabor filters. Although notice that we learn on a 32 by 32 image patch or image
in this case, learning the filters is much better than just putting Gabor's down
everywhere because it will learn to use a limited number of filters to cover the
space. As soon as the Gabor filter has a scale on an orientation and elongation,
stuff like that, you don't know how to tile the space with the reasonable number of
filters. And this will do a very good job of tiling the space with only 10,000 filters.
So that's another advantage over putting them in by hand.
>>: One of the reasons presumably we have of multiple frequency bands for
scale space clusters, we can never assume that the object is already, you know,
at a known scale and a known location, right?
>> Geoff Hinton: Right.
>>: For something that could tie images databases, you pretty much have the
scale invariance taken out, right? Because you were [inaudible] an object to be
sort of [inaudible] fill the middle of the image.
>> Geoff Hinton: Plane, plane.
>>: Okay.
>> Geoff Hinton: I mean, no, maybe you just guessed this one wrong.
>>: I didn't know that was a plane.
>> Geoff Hinton: In a 10 way choice, sky is a plane or a bird I mean, it's
remarkable it can get 70 percent. And a lot of that is of course some -- for
example, the confusion matrix, it confuse animals with animals and it can -except for birds and planes it confuse because of the sky, and then it confuses
trucks and ships and cars. So the context is doing quite a lot of the work. But
there is quite a lot of variation of size. I mean, look, that and that, they're the
same thing at different scales. So we have some protection against that.
But I agree, that's presumably why the visual system has these different scales
because it's [inaudible] scalable it wants to look at. Yeah?
>>: So I was wondering like if you look at all the filters you have like the
[inaudible] how well do they work against locations? Like what is the randomly
[inaudible].
>> Geoff Hinton: Pretty well. Of course in real data vertical is special.
Horizontal is not so special. Because horizontal in the world doesn't run as
horizontal in the image. Vertical in the world comes out as vertical in the image.
So let's go back and look.
And also the pixilation of the image is rectangular pixilation. That causes a
difference. Sorry. I'm going to wrong way. I want to find those filters again.
Those guys.
So here's vertical ones, here's horizontal ones that way and diagonal ones that
way. And on an eight by eight, that's about all you can do. When we make these
-- we've done these on by, 16 by 16 patches. And then you've got lots of nice
orientations. Eight by eight the diagonal ones are bound to be a bit sort of pixely.
>>: So if you take off [inaudible] earlier version that ->> Geoff Hinton: That's the same thing without the covariance.
>>: This Gaussian again [inaudible].
>> Geoff Hinton: The Gaussian is linear with Gaussian noise. So this is that
only the covariance. And this is when you combine the covariance and the mean
stuff and you see you get a 10 percent -- there's a 10 percent difference.
>>: So presumably for speech if we ever that actually scrap all this extra stuff?
>> Geoff Hinton: Yes. You don't want to use that. You want to use the
spectrogram, I think.
>>: So I'm thinking similar to a couple of questions asked earlier. If you were to
concatenate the [inaudible] features with the image, [inaudible].
>> Geoff Hinton: Yes. But I would concatenate them out of some fairly high
level. I would exact some features and then add in the sift features as well. Say
here's the ones I learned, here's the ones David Lowe designed. Probably
between them they can do better than either of mine.
>>: So it's just like if you undo this percentages from [inaudible] and they started
with special whatever they somehow [inaudible].
>> Geoff Hinton: Right. You could certainly -- you could certainly use more
[inaudible] and things you've extracted from the spectrogram. So from the
spectrogram you can get little harmonic stacks and things that are lost by the
[inaudible]. And both together should be better than either alone.
>>: [inaudible].
>> Li Deng: Thank you very much.
>> Geoff Hinton: Okay.
[applause]
Download