>> Li Deng: Okay. It's a great pleasure... the second talk today. And he will review some...

advertisement
>> Li Deng: Okay. It's a great pleasure to have Professor Geoff Hinton to give
the second talk today. And he will review some of the basics that he talked about
this morning in case some of you were not here. I'm not going to give all the
lengthy instruction except to say that he's University recognized to be a pioneer
in machine learning in neural network. So without further ado, I will have him to
give the second lecture on information retrieval using multi-layer neural network.
>> Geoff Hinton: Thank you. What I'm going to do is first half of the talk is going
to be about using short codes found by neural networks to do document retrieval,
and this is published work with Russ Salakhutdinov; and then the second talk will
be applying the same ideas to image retrieval which is much more interesting.
Because for document retrieval, you have good ways of doing it and for image
retrieval you don't, as far as I can tell.
I'm going to spend the first five minutes just very quickly going over the basic
learning algorithm which I talked about this morning. If you want to know more
details, look at this morning's talk.
For the document retrieval, we have to figure out how to model bags of words
with the kind of learning module I'm going to use, and then I'm sure how to learn
short binary codes, and I have to use those for extremely fast retrieval. Basically
the retrieval in no time. There's no such involved at all in the retrieval. So we
can do sort of approximate matching at the speed of hashing. And then I'll ply
that to image retrieval. And this is very much work in process -- in progress. And
we just got preliminary result a couple weeks ago.
So five minutes on the basic learning ideas. We get a main network out of
stochastic binary neurons which get some input from the lab below and give and
output which is a probabilistic function of the input and is a one or a zero. We're
going to hook those neurons up into a two layer network with some observable
neurons and some hidden neurons that are going to learn features. So we put
the data here, and this is going to learn features of the data. It's getting to be
symmetrically connected. And we're going to try to learn all these weights just by
looking at the data. You have no label information. And the learning algorithm is
going to be you take some data, you put it in here. Using the weights you
currently have you decide for each feature detector whether it should be on or
after. They'll make probabilistic decisions but if they get lots of input they'll
probably turn on and lots of negative input that would probably turn off.
Then from the binary state these feature detectors, you try and reconstruct the
data using the same weighted connections. And then from the reconstructed
data, you activate the features. And you're going to try and train the model so
that it's happy with the data and unhappy with the reconstructed data. I want it to
believe in reality and not to believe in the kind of things we would prefer to
fantasize about. So the way it works is here's some data, where would you like
to go today? And wherever you would like to go, you make it as difficult as
possible. So you unlearn on this and you learn on that. So the learning
algorithm looks like this. You take the parallelized statistics of element of the
data like a pixel or a word in a bag of words zone and a feature detector zone,
how often they are on together, measure that correlation, and then with the
reconstructed data measure the same correlation. And the difference in those
two correlations provides your learning signal.
And now once you've trained one layer, you can train a deep -- many layers by
taking the feature detectors of one layer and making that activations be the data
for training the next layer. And you can prove that something good happens
when you do that. And if you want to know more, look at this morning's talk.
Okay. So now we're going to apply it in a way different from how we applied it
this morning. Basically all interesting data lies all on some lower dimensional
manifold in a high dimensional space. If I give you million dimensional data, if it
was really in a million dimensions, life would be completely hopeless because
every new data point would be outside the convex hull of the previous data points
and you couldn't do anything.
But if I give you say a thousand by a thousand image, actually not all images are
equally likely and the images you actually see lie on some low dimensional
manifolds, but maybe a number of such manifolds. In this morning's talk what we
tried to do is model as manifolds by going to high dimension space and energy
function that has ravines in it for each of the manifolds. That's what I call implicit
dimensional introduction because we get to this high dimensional space to
capture the manifolds. And that's good for capturing multiple manifolds when you
don't know their dimensionality.
There's something that's far more standard which is what I call explicit
dimensionality reduction. You say okay, I'm going to try and represent this data
by N numbers which you can think of as coordinates on the manifold. How do I
find the best N numbers? So principle components is a prime example of an
explicit dimensionality reduction method.
So I'm going to first of all, to sort of link to this morning's talk ply this to images of
handwritten digits. And we're going to have all sorts of different digits here of
different categories. But now instead of going to high dimensional space we're
going to go down to 30 real numbers. And what we'd like to do is have this
neural net that extracts lots of features from the image, then compresses a little
bit, so finds correlations among those features and has features of those, finds
correlations from those features and eventually goes down to a small number of
linear features here and we can learn this model unsupervised. That is, we show
it data and we just learn these features the way I sketched.
Then we take the feature vectors that we got here from data and we learn these
features, all unsupervised. And then we learn these features. And we learn
these features.
Now, each time I learn one of these modules, because of the way I learned it, the
features are quite good at reconstructing the data. So these features will be
quite good at reconstructing that. And these features will be quite good at
reconstructing that. And these features will be good at reconstructing that. And
these will be okay at reconstructing that. And so now we take the weight
matrices that we learned here, and we put them all in backwards here. So from
these features, we try and reconstruct this. And that's what this W4 transpose is
doing. And from these we try and reconstruct those and so on back to the data.
So after I've done my initial training, my unsupervised learning of these modules
one at a time, I unroll the whole system, put the transposed weights up here, and
then I'm going to train the whole system as a standard neural network
backpropagation. That is, I'm going to say I would like the output to look exactly
like the input. So my desired output is whatever the input was. I take the
difference between the actual output and the desired output and I backpropagate
it through all these layers.
Now, if you start off with small random weights here as we used to do in neural
nets, nothing happens. Because the derivative you get here is small times small
times small times small times small times small times small. That's small to the
seventh. And which is no derivative at all. If you use big weights here, you've
already decided what it ought to be doing randomly. That doesn't seem like a
good idea. You don't want to decide the features randomly. That was the point
of using small weights. But if you use small weights you won't learn.
What this initialization does by learning this module so these are good at
reconstructing those and these are good at reconstructing those, is it gets
features to get you going and then you can fine tune these features with
backpropagation. And when we fine tune, these weights will become different
from those weights. And these ones will become different from those ones.
Okay.
So here's an example of this is one random image of each class. That's the real
data. And for that autoencoder I just showed you after it's trained -- the training
takes a while. It takes like on the order of a day on one call or an hour on a GPU
board. These are the reconstructions of the data from those 30 numbers. So we
turn this image into 30 numbers and we reconstruct it from 30 numbers. And it's
pretty good. In fact, I would argue the reconstruction's actually better than the
data. So we look at this as a little gap time. It's gone away now. It's a bit
dangerous trying to do reconstructions better than the data because you will be
unlearning on your reconstructions than learning on the data.
But anyway, if you do standard PCA, you can see it's a much worse way of trying
to get that information into 30 numbers. So finally -- I first thought about doing
this in 1983. No, sorry. Yeah, in 1983 with Boltzmann Machines and then in
1985 with backpropagation. We could never get it to work. The fact it didn't work
didn't stop some people publishing papers about it. But eventually we got it to
work in deep nets. And it's clearly much better than PCA.
>>: So if you don't do backpropagation, how much worse is it going to get?
[inaudible].
>> Geoff Hinton: Probably it will be probably worse than PCA if I don't have the
backpropagation. But I don't know the answer. I should know, but I don't.
So now we're going to apply it to documents. And initially we're going to use it
just for visualizing documents. So suppose I give you a big document set and I
say lay these documents out in two dimensions so I can see the sort of structure
of the data sets just a standard visualization. We want a map of document space
and see how it's populated. Well, I could obviously use an autoencoder for that.
What I could do is I could train a model that has the counts for say the 2,000
commonest non-stop words, and from that count vector of 2,000 counts if I could
figure out how to model it without my Boltzmann Machines, I could learn some
binary features and then some binary features and then I could learn this little
map which goes into two really numbers. Obviously you're not going to extract -reconstruct 250 activities very well from two numbers, but you can do a lot better
than random. Like if these are sort of political documents, you'd expect high
probabilities from political kinds of things. And from these two real numbers
we're going to now try and reconstruct the word counts. We're not going to do
very well. But now we can fine tune them with backpropagation. And then we
can look to see how well we did. And as a comparison we can use a two
dimensional version of the latent semantic analysis where what you would do is
take the word counts, take the log of one plus the word count, and then do PCA
on that.
So we took a big document set of 800,000 documents, Rich's document set
that's publically available and where they give you the word counts, and we used
two dimensional latent semantic analysis, that is basically PCA done in a
sensible way. And it's all done without knowing the class of the documents. And
then once you've laid them out in 2D, you color -- there's a point per document,
the color the points by the classes because all these documents have been hand
labeled by what class they are. And you can see, PCA, there's green ones down
here and the more reddish ones here. There's some structure, but it's not very
good. And it's the fact we're using log of one plus that causes this funny
structure here.
Now, in other words, you would never have any big negative numbers going in.
Now we apply our method, and that's what we get. And we would argue that this
is better. We've labeled the main classes here. It comes with a hierarchy of
classes. They are sort of business documents. But if I told you that was the
accounts earning statement for Enron, do you want to invest in this company, my
advice would be probably not. So it's find structure in the document.
>>: [inaudible] backpropagation?
>> Geoff Hinton: It's after the backpropagation, yes.
>>: Have you compared [inaudible] with PCA?
>> Geoff Hinton: Latent dirichlet allocation?
>>: [inaudible].
>> Geoff Hinton: No. This will do much better. In other work we compared this
kind of stuff with linear discriminate analysis. On these multiple non-linear layers,
they're a big win for this kind of stuff. I can refresh you not in the paper
comparing with that.
In the middle you're not so sure. Those are probably short documents, a
document with a lot of worth. You get much more confident about where it is. So
you get to see structure in the data.
>>: [inaudible] sort of star shaped? I mean, most clustering techniques the
examples are, you know, isolated clusters. So why do you think everything's
pooling towards 00 here?
>> Geoff Hinton: It's a very good question. And when we use non-parametric
dimensionality reduction techniques on this, we get clusters. Basically it's
because this vector is being used to determine the probabilities of the output
words, and when you don't know, you want them all to be in the same place. So
all the uncertain ones you want to be in the same place. As you get more
competent, you can afford to have more -- less entropy in the output distribution.
But what you want is for short documents you're bound to have a high entry
output distribution. So all very short documents should give very similar
distribution because there aren't number words to know ->>: [inaudible].
>> Geoff Hinton: Partly effect of that, yes.
>>: [inaudible] the document [inaudible] or are you taking the raw term count?
>> Geoff Hinton: This is the raw term count.
>>: [inaudible].
>> Geoff Hinton: No, because we're doing probability of word but multiply that by
the number of words. So yes.
>>: [inaudible].
>>: [inaudible] four dimension to three dimension. Is it obvious that this
[inaudible] than PCA?
>> Geoff Hinton: It's obvious to me. I'd be amazed if you got those two pictures
in 2D and then the 3D PCA was as good as this. But yes?
>>: But that [inaudible] might just be the number of words.
>> Geoff Hinton: Yes, I think it is. Okay. Now, I didn't tell you how we modelled
a bag of words using a Boltzmann Machine, and it turns out that a little bit of
chaos go into it. And in particular when we did that, we didn't use the last trick,
and the last trick helps a lot. So we could do it better now, I think.
So we start off with this binary Boltzmann Machine where you've seen the input
variables as these binary variables. And now we're going to replace each binary
variable by a 2,000 way alternative. A binary variable is a two way alternative.
There's an easy generalization to a 2,000 way discrete alternative.
So the idea is we're going to build ourselves a Boltzmann Machine that for each
word -- for each instance of a word in the document has one input unit. So if it's
100 word document there's going to be 100 input units. And each of these units
is going to be a 2,000 way softmax. 2,000 way discrete variable. So we use
2,000 way variables. And in the data, you know, one of those is going to be on
for each of these input variables.
Then each hidden unit, because it's a bag of words, it doesn't care about the
order, a hidden unit is going to have the same weights to all of the different input
units. You can probably already see we can do the computation faster than
architecture I'm describing. We're going to use as many of these visible units as
there are non stop words in the document. And so for each document we'll get a
different size network. So it's not really a Boltzmann Machine, it's a family of
Boltzmann Machines. For a big document you have a bigger Boltzmann
Machine. But they all have tied weights. Okay? So the number of hidden units
is always going to be the same. And for each hidden unit it's got its weights to
the visible units. As I have more visible units you just keep reusing the same
weights. So you're making a bigger Boltzmann Machine but no more
parameters. And so you've got this family of different size machines. But the
one crucial thing you have to do to make this into a good density model at least,
which is the hidden unit is getting input from the visible units which is fighting a
bias. As you have more visible units there's more input and so you want more
bias for it to fight. And so you have to scale the biases with the number of input
units.
>>: [inaudible] scale of the input [inaudible].
>> Geoff Hinton: Yes. Either way. But this is the way we currently think about it.
And this makes a big difference. And now in the upcoming NIPS we compare
the density model you get for bags of words, this is what [inaudible] with things
like latent dirichlet allocation and this gives a much better density model for bags
of words. So you hold out some documents and you ask, you know. So here's
just a picture of the model. You decide how many hidden units you're going to
use. We're going to use thousands but -- and then for this hidden unit it will have
some weighted connections to this softmax and the same weighted connections
to this softmax and that softmax and then this hidden unit will have different
weighted connections than those but the same softmax. Okay. I [inaudible] that
a bit.
And so we're going to start out by using an architecture like this, where we go
down to 10 Gaussian units here, 10 real valued units. So we've done latent
semantic analysis, hoping I'm using the terms right. With a 10 component vector.
And now we're going to try and do document retrieval with that. And we're going
to compare with document retrieval using the vectors you get with latent
semantic analysis.
So if we extract 10 numbers and retrieve documents on using 10 numbers, we do
better than latent semantic analysis retrieving 50 numbers and much better than
latent semantic analysis retrieving 10 numbers. In other words, each of our
numbers is worth five times as much. And the amount of work you have to do is
just linear in the number of numbers. If you're just doing a linear search from the
things you want to retrieve from. So it's a big win to have a lot more information
in here. So one of these numbers is worth five of their numbers. So people
doing this LSA will sometimes tell you LSA is optimal and what they mean is
optimal within the space of linear methods, okay? Linear methods are
completely hopeless, but within that space is optimal. I should say LSA was a
really ground breaking invention when it was done in the '90s. We couldn't do
these more complicated things then. And it really amazed people how much
information you could get just from statistics. People were surprised that it would
work. But now we can do much better with non-linear methods.
>>: So what [inaudible].
>> Geoff Hinton: This is saying as a function of how many you retrieve, what
fraction of them are in the same class as the retrieved document? We don't have
a good way of saying was this a relevant one. But if we use a bad way, but it's
the same for all methods, that's the best we can do, and our bad ways to say is it
in the same class? So you put in a document that was accounts and earnings,
and you try and retrieve, you get the 10D code for this document, you compare it
with all the other 10D codes, you see which fits best. Is the one that fits best
about accounts and earnings? Well, 43 percent of the time it is.
>>: Well, that's precision. So this was [inaudible] some re-call.
>> Geoff Hinton: Yes. Yes. And this is sort of re-called here. We didn't label
the axis with the normal terminology because this was early in our careers,
information retrieval people, and we were still learning.
>>: Question. So you get those 10 numbers for the auto encoders. Do you
need to wait?
>> Geoff Hinton: Yes.
>>: That's a bunch more numbers. If you look at all the parameters that are
involved in the encoding in both cases how does that total number of parameters
compare?
>> Geoff Hinton: We have more. Not -- yeah, I need to think about this. If you're
doing PCA from 2,000 to 10, you got 20,000 numbers, right? Here we got a
million numbers right there. In other words, this is only going to work on big
document sets, it won't work on small document sets. We need to be able to
translate a million numbers here. What the small is the code they use at runtime.
>>: [inaudible].
>> Geoff Hinton: Okay. So the point is at runtime when you're actual doing the
retrieval is linear in the length of the code if you do it in a dumb way, just
matching codes. So you want a short code.
>>: But in order to find the [inaudible] of the code don't you need the actual
weights?
>> Geoff Hinton: Oh, if you're asking about the storage in the machine, it needs
these weights sort of once for all documents. It doesn't need these weights per
document. Also, if you got a big database what you do is overnight you sort of
run this and you store these 10 numbers with each document, so the retrieval
time you hope you don't have to run that at all.
>>: That would be analogous to the LSA, where you've got the basis vectors and
you have to store those once for each document. I mean once period for all
documents.
>> Geoff Hinton: Right. Yes. These are like the basis vectors.
>>: Right.
>> Geoff Hinton: But the speed at retrieval time is going to depend on how many
numbers you have here. Okay. So what we're going to do now is ->>: Is it really a question of speed or is it choosing the right bottleneck so that
things get separated? I mean, is it -- in other words, what's the ->> Geoff Hinton: Well, it's the case with all these methods that if I use more units
here I'll do better up to quite a big number. It's not that we have to go down to a
small number to do well at all. If I make a hundred here I'll do much better.
>>: So it really is a connotational efficiency.
>> Geoff Hinton: It's the efficiency at retrieval time.
>>: The retrieval time you don't necessarily scan, right, you use some sort of a
KB tree or other ->> Geoff Hinton: Absolutely.
>>: [inaudible].
>> Geoff Hinton: Absolutely. So this really isn't of much interest yet because
you're going to use much faster techniques in serial matching.
>>: So that's what the next point is going to address. Okay. It's just to compare
with LSA. And, sure, we get more information per component of the code than
LSA does. So now what we're going to do is we're going to have -- instead of
making these linear units, we'll make them binary units when we do the first -when we do the learning. So we learn this module, we learn this module, we
learn this module with binary units, we learn this module, we learn this, we unroll
it all and then we start doing backpropagation. And here as we're doing the
backpropagation we inject noise. And the point of that is to force these guys to
be either very firmly on or very firmly off. That's the only way they can resist this
noise. If they're in the middle range, then the noise which has quite a big
standard deviation will just swamp them, and they wouldn't be able to transmit
any useful information.
And the way we add the noise is for each training case we make up a noise
vector ahead of time, but there's so many training cases it can't load all that
information the order of a million training cases. So now we've got what I call
deterministic noise. For any one case, it's always the same noise. So we can
use methods like conjugate gradient for doing the backpropagation. If you don't
want to use conjugate gradient, then you can make this be true random noise.
But you train this thing up, and what you discover is you first train it without noise
and then you add the noise. When you add the noise here, these become far
more binary. So they started off being learned as binary units, then when you
use backprop they're logistic units with continuous values and you force those
values to the extremes so that now when you threshold these, you don't
threshold .5, you threshold a little bit lower than that. But when you threshold
them, you can still reconstruct quite well from the thresholded values.
So now what we've done is we've converted a bag of words into a binary vector.
Now we can do what we call semantic hashing. So here's a document. You
compute his binary vector, and you go into a memory space and that's binary
vector is viewed as an address in that memory space, and because of the way
these binary vectors were computed, it's not like a normal hash function. A
normal hash function you try and sort of spread things out randomly over the
memory. Just so you get things sort of evenly populated in the memory. Here
what we're doing is because of the way the learning's done, we expect that
documents with him words will be new in this space. And it may not seem like a
30 dimensional space is very high dimensional but there's a lot of space in 30
dimensions. So I have an analogy for supermarket search here. If I go into a
supermarket and I want to find things that are like a can of sardines, what I do is I
go to the teller and say where do you keep the sardines. I go to where the
sardines are kept, and I just look around. And then there's the tuna and the
[inaudible]. Unfortunately the anchovies are way over here with the pizza
toppings. But that's because supermarkets suggest really 2D things. I mean,
they're this 2 dimensional strip in 3D, and you conditional have everything that
you'd like near each other.
But in a 30 dimensional supermarket life would be wonderful. You could have
the organic things here, and the cheap things there. And you could have the
slightly beyond their date and very cheap here and the healthy ones here. And
you could have the kosher ones there and the unkosher ones there. And in 30
dimensions you could, you know, arrange it all very nicely so similar things were
near one another in all these different dimensions of similarity.
So the idea is maybe with 30 dimensions you can organize document so that
similar documents really are near one another. And if this autoencoder will do
that for us, then we have a very fast way of finding similar documents. You just
go to the address of the document, documents similar to a query document. You
go there and you just look around. And so you enumerate the local
neighborhood in terms of the humming distance from where you are.
So you might think you would form a list. But actually in a sense you already
have a list. You have this address, and as you flip bits in that address, you get
the addresses of the next things. So why actually form a list? You've gotten your
list. You've got an address and you got a way of getting to the next guy. So
that's a list. And so in a sense this memory space you have a billion addresses
here, I really have a billion things, each of which is a list a billion long, all in the
same memory space. So it's sort of instance retrieval of our short list.
Now, that short list may not be very good because you're only using 30 bits. So
our proposal is use those 30 bits and then once you've got a short list just say
10,000 or something, you can do a serial search using the longer bit code, you
just say 256 bits. Which is still much better than using LSA or something. And
256 bits are very easy to store. You just have to store four words with each
document.
>>: [inaudible] say something about the [inaudible] bits opposed to real value
numbers? Because you could do the same trick but just, you know, have real
value.
>> Geoff Hinton: We have done some comparison on that. And I can say one
comparison. One of our bits is worth slightly more than one of the LSA real
numbers. So if you look -- compare it with LSA, one bit for -- one of these deep
bits is worth about the same as one of their real numbers, but it's much cheaper
to store a match. Roughly speaking that the [inaudible].
So here's another view of semantic hashing. All fast methods of retrieving things
involve intersecting list roughly. Many fast methods of retrieving things involve
intersecting lists, all the ones I know about. And what we're doing here is we're
intersecting list because when you extract one of the bits of the code, that's a list
of half the things in the space. Okay? It's all the documents that have that bit of
their address turned on. So you can see it's a list of half the space.
And the next bit in your code is another list, orthogonal list of another set of half
the space. And now you want to intersect those 30 lists. Well, that's sort of
memory fetched. So we're saying in memory fetch really is a list intersection if
you can map your lists on to these bits and the address. And so that's what we
do. We just use machine learning to turn the information retrieval problem in the
kind of list intersection that all computers like to do, namely a memory fetch.
So the question is how good is it? We've only implemented it with 20-bit codes,
and that's quite a lot different from 30-bit codes. And this is on that same
document that I showed you about, a million document, a 20-bit code is about
right. I'm not going to talk about what you do if you get collisions to documents of
the same address. If you're not a computer scientist, you don't sort of think about
that, and if you are a computer scientist, you know what to do about it anyway.
So what we do is we take our short list and you can now feed the short list to
TF-IDF, which is sort of short gold standard method, fairly gold standard method.
And you can compare it with other methods. So the semantic hashing is the
fastest method we know. I can't see how you get a faster method of getting a
short list.
And if you feed the output of that short list to TF-IDF, then what you get is
performance that's actually slightly better than TF-IDF alone. If you do TF-IDF on
everything, which you could do relatively efficiently using indices and things, it's
better if you just do it on a shortlist. And that means that some of the things the
TF-IDF would decide are good our shortlist says are bad, they're not on our
shortlist, and they shouldn't be there. Because we're doing better by filtering with
this.
Okay. Now, all of that was for document retrieval. That's all published stuff. And
it's not so interesting for document retrieval because there's other good ways of
doing it. But for image retrieval I think it's more interesting.
>>: [inaudible] you're comparing when you're comparing the LSH you're saying
the speed of computing the 20 bits is faster?
>> Geoff Hinton: No. What I'm saying is I'm really thinking of it like this. That
you can have these 20 bits stored with each document and then it's just a
memory lookup. An LSH has bigger codes than that. You can't do it by just a
memory lookup.
Okay. So for image retrieval, until fairly recently it was either done with things
like color histograms or done more effectively I think using the captions. But
obviously you like to do it by recognizing what's in the image. And you'd like to
do object recognition and recognize what's in the image and do it like that.
Because the pixel isn't like a word in a document, a pixel doesn't tell you much
about what's there. A word tells you lots about a document.
The things that are like words in a document are objects in the image. But
recognizing those is kind of tough. So maybe we can do something else. Maybe
we can extract things that aren't as good as objects, but nevertheless contain
quite a lot of content about what's in the image in a short binary code, and then
we can use these ideas about short binary codes either for quick serial matches
or for semantic hashing.
So we propose to use a two-stage method where you use a short binary code, a
very short one, to get a short list, and then for that short list you use a longer
binary code. Now, there's no point in doing that unless your longer binary code is
going to work reasonably well. So all we've done so far is checked this longer
binary code to see that it works reasonably well for image retrieval. And those
are the results I'm going to be showing you.
That's just four words of memory for each image. And you can match quite fast.
And the question is how good are these codes. What do you get when you try
matching with these 256 bits?
Now, we thought about this a few years ago when we're doing document
retrieval. And then Yair Weiss and Antonio Torralba and Rob Fergus came
around and figured a different method of getting binary codes which is published
in NIPS, which is must faster than our machine learning method. And they claim
that it worked better and it's simpler and it has no free parameters. When they
claimed that it worked better, what happened was that they didn't know how to
train Restricted Boltzmann Machines on real valued data. And so -- on real
valued pixels. And so they gave up on using the pixels, and they used just
features as the input to the system. So they took the image, and they took it
down to 384 numbers before trying to compress it to this binary code. And it
turns out that's much too much compression much too soon. If you train on the
raw pixels and you do it properly, it's actually completely reversed. Their method
is basically hopeless, and our method is much, much better. They're friends of
mine.
So Alex Krizhevsky, a long-suffering graduate dent mine spent a long time trying
to figure out how to train these autoencoders properly with Gaussian visible
units. This morning I talked about something better than that where you model
covariance. But for this work we're just using the ordinary Gaussian units, that is
linear units which is you have an independent Gaussian containment on each
unit.
To do the training you need a very small learning rate. You mustn't add noise
when you make the reconstructions, and you need a cluster machine or a GPU
board. And you need to use a lot of units in the early layers, because you're
going from real valued things to binary things. And so he goes to a big set of
binary things. And then we're going to compare with the spectral hashing
method that got published in NIPS and the Euclidean matching, which is going to
be very expensive but will at least show us literally very similar images.
I don't know if you can see this. I first wanted to show you the kind of filters you
get if you apply this Gaussian RBM to these small color images. We're going to
do it on 32 by 32 color images. And they're very different from the means that I
showed you this morning. They're much more like the variances. And that's if
you just learn these by themselves. If you learn the covariances, these become
completely different. Again, they have the flavor. Most of them are pure black
and white. And this is now 32 by 32. So they're highly localized in space, they're
high frequency and they're pure black and white. And a few of them don't care at
all about intensity, they're completely insensitive to intensity and they're color
contrasts.
This isn't a topographic map. This is just a random subset. In this case you
learn 10,000 [inaudible]. So at least it shows you you're learning something
that's generally regarded to be a sensible way to start a image. So the
architecture here actually used for the autoencoder was the three color channels,
not preprocessed at all. He used 8,000 units here, then 4,000, then 2,000, then
1,000, then 500, then 256. And there's absolutely no theoretical justification for
this.
There's a justification if you're using powers of two, because he's using a GPU
board. But that's about it. But what we do know is if you use a small layer, it
didn't work nearly as well and using lots of layers seems to work better. And also
it's very robust. He's showing from 2,048 to 700 here to 256. It will probably
work about the same. Maybe slightly worse.
So although this is arbitrary, it's not arbitrary stuff that's sensitive. That's the
good news. And it takes a few days on the GPU to train this whole thing. This
thing has, well, four times eight, 32 million weights there. Overall it's got about
67 million parameters. It's trained on color images, but he has 80 million of
those, and he actually only bothers to train on two million. That's plenty. Well, I
should say in this work he's training on two million, and he's doing reveal on
another two million that he wasn't trained on. He could do better probably by
training on all 80 million, but it would take longer.
So what we have here is this is a query image, distance zero. These are the 15
best retrieved images in this order. And it's showing you the distance. And this
is the distance in hemming space for the 256 bit codes. So you have to go quite
a way until you get to another one.
But if you look at the top row, it's all men with shirts and ties. It seems nice. This
is spectral hashing, which doesn't have any men with shirts and ties and has
some things that you wouldn't have thought of as particularly similar to that. We
couldn't believe these results, and so I kept talking to Yair and saying spectral
hashing isn't working, and he said, yeah, I know. So we think we haven't got a
bug. Our code, we've run this code on things that they've run their code on and
gets the same results. So we're fairly sure this is the results, even though they
look terrible.
This is Euclidean distance which sometimes is slow, and you'll notice Euclidean
distance is much more similar to us, but we seem to be doing something better
than Euclidean distance. I would that these are more like that. Notice that's fine
details -- look at this guy's face. He doesn't have a face. But he does of a shirt
and tie. Okay.
So the order is we're the best, this is the second best, and spectral hashing is the
worst for that one. Here's Michael Jackson. This is Euclidean distance matched
to Michael Jackson. We at least get people nearly all the time. Spectral hashing,
well. There's something else that's very strange too. I accuse Alex of making a
mistake, 64 dimensional spectral hashing. And the reason was look at these
distances. It's got something at a distance of 23 from this thing. And our nearest
thing is a distance of 61 and is much more similar. Now, how could that be?
How could it have something that's really close and very dissimilar? Well,
because their bits are no good. Their bits are very far from orthogonal to each
other. When they hash something that has many bits the same and just isn't at
all similar.
So its furthest thing is only 26 bits away. It's just not laying the things out, it's not
using the space uniformly like information theory says you should. Okay. So
here again I would argue we do a better than Euclidean distance and much
better than spectral hashing.
>>: Why don't you expect the long distance hashing to be sensible, why do you
-- it seems like it's -- it seems like you didn't train for it and you didn't -- this
seems like a [inaudible] accident, right?
>> Geoff Hinton: No, because we know from the autoencoder that similar things
will get similar codes. Just because of the way it's trained. I also have evidence,
right?
>>: There's nothing semantic about the algorithm, right, it's not like you're asking
that near by things are meaningful.
>> Geoff Hinton: Oh, but you're asking to find a few features that describe an
image such that from those few features I can reconstruct the image. So think
about it for documents going down to two features. Suppose you had sports
documents and politics documents. You'd really like one of those features to be
sports versus politics. Because if it's sports you can raise the probability of old
sports words and another probability of old document words and vice versa. So
you'd really like to go for that abstract feature of sports versus politics.
>>: But maybe -- maybe there's something odd like this [inaudible] is black and
that one is purple that also just -- that also distinguishes and makes it [inaudible].
>> Geoff Hinton: But that one helped you predict other pixels as well. So for
example, take -- here's a well known semantic feature that you can actually
extract from a feature quite easily, indoor versus outdoor. It's quite easy to do
that. And the point about indoor is you expect to see sort of straight edges and
particular kinds of illumination and outdoor you expect to see far more rough
things. So that one feature will tell you a lot about what -- how to reconstruct the
image. The more we're hoping is we're -- if you could get 30 features, all of the
same quality of indoor versus outdoor but all orthogonal to that, you'd really be in
business for image retrieval. And we're trying to move in that direction.
Let me carry on. So here's a flower, a dandelion, I think. We get about half the
other things we get are dandelions. And in this case about half the other things
we get are also in the 15 closest in Euclid distance. So this is some evidence we
haven't moved very far away from, the sort of raw input metric. So that's bad
news for the abstract idea. But what the bad news for that is good news for the
following idea. Wouldn't it be nice if you could do Euclidean matches just by
using 256 bits instead of using about 3,000 real numbers? And apparently we
can. I mean, we can get half the things that come out close in Euclidean
distance, we can get by matching these 256 bit things and looking at hemming
distance. So it's good in one way and bad in another way.
Sectoral hashing as used only got one dandelion but --
>>: So have you compared should shock code with like [inaudible].
>> Geoff Hinton: No, not yet. So here's an outdoor scene, and here's what we
retrieve. This actually is visually very similar, but you can see it's not actually an
[inaudible] it's a sort of drum or swimming pool or something. But notice the
Euclidean distance is also retrieved [inaudible]. Yeah, there it is. That's the
seam thing you retrieve by Euclidean distance. So again, about half of these
was Euclidean distance, the same as the thing we retrieve. But we're much
better than spectral hashing. I mean I don't think that's particularly like that.
>>: [inaudible] impressive that it's doing this, but in some ways it's not only
suspected because of the autoencoder is trying to really reproduce the image, it
will -- you know, the definition of reproduction is Euclidean distance, right?
>> Geoff Hinton: Right. But we know that it generalizes to some extent. So if
you try to deport or encode like this and you then generate from it, you can
generate things shifted slightly a little. It will do small shifts and things.
>>: But has anyone considered training the autoencoder giving the image itself
as the thing to match, you give another member of the class? You know, if you
have ->> Geoff Hinton: We thought about things like that and people have done things
like that, but I'm not going to talk about that. But that's sort of work in progress,
yes.
So here's what I think is the best example. So that's a group of people. That's
the sort of high level description and the low level description is a white blob. So
if you look at things like this, it real matched on the white blob. But if you look at
these retrievals, about half of them are groups of people. If you look in Euclidean
distance, it's not doing nearly as well in groups of people here.
There's a lot of fairly high frequent variation in this image, right? And that high
frequency variation I think is underlying features that are saying high frequency
variation and therefore other images will similar high frequency variation around
about there are being matched, whereas if you do Euclidean distance and you
want to match high frequency variation, the best you can do is find something
that's very uniform that's the average. If you started introducing any variation to
this thing, it will probably disagree and you'll get extra variants.
So unless you can have perfectly correlated variation, you're better with no
variation at all. And you see Euclidean distance likes things like this, that's the
second best match, which is the sort of average of all this image everywhere.
But we're definitely going to do much better at getting groups of people. And
spectral hashing isn't. It gets some.
>>: [inaudible] pixel by pixel [inaudible].
>> Geoff Hinton: Pixel by pixel, yes.
>>: So everything that's [inaudible] normalized is [inaudible].
>> Geoff Hinton: Yup. Now, we should also try cosine of the angle just to
normalize the intensity. But remember, there's two million images being used.
>>: But [inaudible].
>> Geoff Hinton: Well, we should also try that. Like it, this was only done a
couple weeks ago. As you know, I wasn't planning to talk about this, but you
wanted me to so I did. There's a couple of obvious things we still have to do,
which is implement the semantic hashing stage and using that as a front end for
this slightly longer stage doesn't mess it up. We know that it doesn't with
documents, but we don't know that it doesn't with images.
If you lose some recall on images, it may not matter so much since people don't
worry about what they don't see. So you can get away with bad recall. It's
precision you can't get away with being bad.
There's an obvious extension to this which is the first half of the talk you had to
get short codes for documents and the second half you had to get short codes for
images so why not use the same short code. Now, that's -- you get three wins
from that. You might expect that if you use the words as well, it will help you get
more abstract features because the link to the words goes by more abstract
things, so you're more interested in. So it will pull you in that direction. We know
that compared with LDA models for example, latent dirichlet allocation models we
have a better model of the densities of bags of words. So we think these codes
got by RBMs even in one layer is a better way of modeling bags of words. So
we're going to get a win there. And the multi-layers are much better with
modeling bags of words.
We do well on images. And the interaction should help us a lot. So that's a very
obvious big win. Then we go some way to answering John's objections because
you share -- from the image, you can actually start producing words so you
understand some semantics, you know, people it should say.
There's a less obvious win. Semantic hashing is incredibly fast if you've already
got the codes for all the documents and you're just looking up what ones are
similar to this one. You can't really go far beyond 32 bits. Maybe you could go to
36 bits or something, but you can't go to like 100 bits.
But it's so fast that you could do it several times. And if you go to a memory
address, you have this sort of hemming ball, it's easy to enumerate the hamming
ball in ascending address order. So you can enumerate it ordered in the address
space.
So now if I take another query and gets it's hamming ball, I can intersect those
two hamming balls efficiently because they're ordered lists. And it's so fast. So
why don't I now say okay, I'm going to get myself 20 lists like this and intersect
them. And it's going to be [inaudible] intersect of these lists, but it's linear in the
list length. So I take a hamming ball that contains 10,000 things it's sort of 20
times that 10,000 operations to intersect these lists. And so I can afford to
access with several queries.
So I can apply a transformation to the query image. I can for example do a small
transaction. While a small transaction you don't have to do, you can already
cope with a very small translation. But what about a somewhat bigger
translation? So you'd have to worry about edge effects and things. And so to get
rid of those we're using 28 by 28 images so we can translate a few pixels and still
be inside the image. And that's work in progress, to see if you can use the speed
of semantic hashing to allow you to match transform things by simply trying a
number of transformations. Because you can afford to do that.
So the summary of what I've said is that we have this learning algorithm that can
learn layers of features efficiently. We can use it to run big representations for
doing object recognition or we can learn it to use small representations for doing
retrieval. And in particular we can run binary representations which are very
cheap to store. And we can use this semantic hashing which amounts to using
the speed of hash coding to do approximate matching. And then if this works
nicely with images, we can start trying to deal with obvious transformations by
taking the query and transforming it and doing multiple matches and then
intersecting them.
So we have a way of converting the speed of semantic hashing into better quality
of retrieval. Okay. That's it.
[applause].
>>: So do you [inaudible] you show that the features are either text or pixels.
>> Geoff Hinton: Right.
>>: Have you thought of features that are combination? For example, like when
you do a search on the web, you type some text and you find text in the web
page where the images and may be that -- how do you learn more?
>> Geoff Hinton: There's -- I mean, sort of one reason I'm talking about this here
is there's a huge of number of directions to go in from here, right? Once you've
got a way of getting codes, there's also a question about what you apply it to.
We've just done some of the most obvious things. Mainly I was concerned to
compare it with spectral hashing because it was so depressing that that dumb
technique worked better than our technique, and I wanted to show really it didn't.
So the main point was to actually compare with spectral hashing. But there's
also things you could do. And this morning people raised the idea if well, you
know, if you're dealing with images, you really don't want to start with pixels
because image processing people know what to do with pixels. You want to start
with something higher up. And I sort of agree with that. It was just to compare it
with spectral hashing and also to sort of know that we can do the whole thing
with this technique we start with pixels. But I agree, we -- there's all sorts of
other inputs you can use.
>>: So one of the messages here, main message to me is if you use
compression, compression will force you to find relevant features that you use to
do things. And I ->> Geoff Hinton: If you got a good way of finding features, yes.
>>: And the RPM is a compression mechanism that seems to find some good
features.
>> Geoff Hinton: Yes.
>>: If you view it as a compression mechanism, have you compared it purely as
the compression mechanism with other things, you know, like [inaudible] or other
compression techniques?
>> Geoff Hinton: We haven't, no. I don't -- I wouldn't expect it to be really good
as a compression mechanism. I wouldn't want to ->>: So then it's not just the fact that -- of compression per se, there's something
special ->> Geoff Hinton: There's something special which is taking similar things to
similar codes. That's very important. And compression doesn't necessarily do
that. So I think the fact that it's a sort of smooth function because the weights
don't get that big, so it's a fairly smooth function from inputs to codes, I think
that's what's important and that helps answer John's question, too. Why do you
expect similar things to have similar codes? Yeah?
>>: I think there might be two things going on simultaneously here, right? If you
just say get me similar, right, and if you gave the metric, Euclidean distance
metric, right, well in substance you're stuck because if there's no skew and if the
distribution is uniform you can't do it, right, if you're just going to pick 30 values
[inaudible] code those but it's the fact that the data is, you know, lives on a
manifold and it's this very different density and so on. So the thing is encoding,
try to minimize the difference of reconstruction given the prior distribution, the
implicit prior distribution that arises when you get a certain set of learning
examples, right.
>> Geoff Hinton: Exactly. This will only work for data that lies on a low
dimension manifold in a high dimension space and what's more where there's
obvious correlations in the raw data that you can pick up on that really are
relevant to that manifold.
I could put in some obvious correlations in the data that were nothing to do with
the manifold by sort of water marking or something. And this maybe would pick
up on those. And it wouldn't be any use at all. But real vision isn't like that. In
real vision it's screaming out at you that these correlations, they're caused by
what's really causing the image, and if you're sensitive to them, you'll be able to
find out what caused the image. And that's the kind of domain in when it works.
>>: That might be -- when you're doing compression, ultimately your goal is to
introduce the least visual distortion, right? And here that's not obvious that's
what the autoencoder is doing. It's been asked to minimize the Euclidean
distance but it's also being asked to do it at a frequency that is whatever is in
your training set. If you only have 10,000 or some number of images that's ->> Geoff Hinton: Right. But as you go up through these layers of features, if for
example you take an image and you go up and you come back down again you
can sort of shift it again and things like that. And so I should share some
examples of that. But it's not -- it's not sort of determined to keep the pixels in
the same place. It has a little bit of translational ->>: Right, it's not the [inaudible] that can shift it, it just so happens that the
distribution of the data space means that small shifts don't kill you in terms of
Euclidean distance.
>> Geoff Hinton: Exactly.
>>: Right.
>> Geoff Hinton: Because locally if you take a patch of image and you shift it
slightly there's probably other things in your database that are very like that
shifted patch of image. Exactly. Yes.
>>: So desirable property of -- sort of similarity [inaudible] is it affect in the
energy function definition in the Boltzmann Machine or ->> Geoff Hinton: No.
>>: [inaudible].
>> Geoff Hinton: It basically comes out of the weights not being that big. So
you've got a smooth function. I mean, that's one important thing. By using very
big weights you can get a very non-smooth function. So very similar things can
go to very different places. But you also like -- says the data has to have a lot of
structure, it has to lie in this low dimensional manifold, and so this will find that
manifold. And if the data isn't like that, this kind of method won't work. But
almost all data is like that. Almost all highly structured data.
>>: So [inaudible].
>> Geoff Hinton: We typically do randomize them a little bit, but it's not really that
important.
>>: [inaudible] and then you know, you [inaudible] the function as encoding that
[inaudible].
>> Geoff Hinton: Exactly. But typically with a lot of data you don't need to
regularize them, they just stay small anyway.
>>: [inaudible] in doing this image retrieval is to create the visual work, discrete
work and then think of the statistic of this work [inaudible].
>> Geoff Hinton: In the sense our bits are like those words. Our individual bits
are like those words. And then in semantic hashing you think if a bit is like a
word, then the inverted index is just all the addresses that have that bit turned on.
>>: So in that case, once you interpret the bits as [inaudible] document retrieval
->> Geoff Hinton: You don't need to. If you've got it in 30 bits, you could use
something much better, because you can intersect all those lists in one machine
instruction. And I bet they can't -- even John Platt can't intersect all those lists in
one machine instruction.
>>: But Intel can.
>> Geoff Hinton: But they have to build a big machine.
>>: Oh, know but -- yeah.
>> Geoff Hinton: Okay.
>> Li Deng: Thank you very much.
[applause]
Download