>> Li Deng: Okay. It's a great pleasure to have Professor Geoff Hinton to give the second talk today. And he will review some of the basics that he talked about this morning in case some of you were not here. I'm not going to give all the lengthy instruction except to say that he's University recognized to be a pioneer in machine learning in neural network. So without further ado, I will have him to give the second lecture on information retrieval using multi-layer neural network. >> Geoff Hinton: Thank you. What I'm going to do is first half of the talk is going to be about using short codes found by neural networks to do document retrieval, and this is published work with Russ Salakhutdinov; and then the second talk will be applying the same ideas to image retrieval which is much more interesting. Because for document retrieval, you have good ways of doing it and for image retrieval you don't, as far as I can tell. I'm going to spend the first five minutes just very quickly going over the basic learning algorithm which I talked about this morning. If you want to know more details, look at this morning's talk. For the document retrieval, we have to figure out how to model bags of words with the kind of learning module I'm going to use, and then I'm sure how to learn short binary codes, and I have to use those for extremely fast retrieval. Basically the retrieval in no time. There's no such involved at all in the retrieval. So we can do sort of approximate matching at the speed of hashing. And then I'll ply that to image retrieval. And this is very much work in process -- in progress. And we just got preliminary result a couple weeks ago. So five minutes on the basic learning ideas. We get a main network out of stochastic binary neurons which get some input from the lab below and give and output which is a probabilistic function of the input and is a one or a zero. We're going to hook those neurons up into a two layer network with some observable neurons and some hidden neurons that are going to learn features. So we put the data here, and this is going to learn features of the data. It's getting to be symmetrically connected. And we're going to try to learn all these weights just by looking at the data. You have no label information. And the learning algorithm is going to be you take some data, you put it in here. Using the weights you currently have you decide for each feature detector whether it should be on or after. They'll make probabilistic decisions but if they get lots of input they'll probably turn on and lots of negative input that would probably turn off. Then from the binary state these feature detectors, you try and reconstruct the data using the same weighted connections. And then from the reconstructed data, you activate the features. And you're going to try and train the model so that it's happy with the data and unhappy with the reconstructed data. I want it to believe in reality and not to believe in the kind of things we would prefer to fantasize about. So the way it works is here's some data, where would you like to go today? And wherever you would like to go, you make it as difficult as possible. So you unlearn on this and you learn on that. So the learning algorithm looks like this. You take the parallelized statistics of element of the data like a pixel or a word in a bag of words zone and a feature detector zone, how often they are on together, measure that correlation, and then with the reconstructed data measure the same correlation. And the difference in those two correlations provides your learning signal. And now once you've trained one layer, you can train a deep -- many layers by taking the feature detectors of one layer and making that activations be the data for training the next layer. And you can prove that something good happens when you do that. And if you want to know more, look at this morning's talk. Okay. So now we're going to apply it in a way different from how we applied it this morning. Basically all interesting data lies all on some lower dimensional manifold in a high dimensional space. If I give you million dimensional data, if it was really in a million dimensions, life would be completely hopeless because every new data point would be outside the convex hull of the previous data points and you couldn't do anything. But if I give you say a thousand by a thousand image, actually not all images are equally likely and the images you actually see lie on some low dimensional manifolds, but maybe a number of such manifolds. In this morning's talk what we tried to do is model as manifolds by going to high dimension space and energy function that has ravines in it for each of the manifolds. That's what I call implicit dimensional introduction because we get to this high dimensional space to capture the manifolds. And that's good for capturing multiple manifolds when you don't know their dimensionality. There's something that's far more standard which is what I call explicit dimensionality reduction. You say okay, I'm going to try and represent this data by N numbers which you can think of as coordinates on the manifold. How do I find the best N numbers? So principle components is a prime example of an explicit dimensionality reduction method. So I'm going to first of all, to sort of link to this morning's talk ply this to images of handwritten digits. And we're going to have all sorts of different digits here of different categories. But now instead of going to high dimensional space we're going to go down to 30 real numbers. And what we'd like to do is have this neural net that extracts lots of features from the image, then compresses a little bit, so finds correlations among those features and has features of those, finds correlations from those features and eventually goes down to a small number of linear features here and we can learn this model unsupervised. That is, we show it data and we just learn these features the way I sketched. Then we take the feature vectors that we got here from data and we learn these features, all unsupervised. And then we learn these features. And we learn these features. Now, each time I learn one of these modules, because of the way I learned it, the features are quite good at reconstructing the data. So these features will be quite good at reconstructing that. And these features will be quite good at reconstructing that. And these features will be good at reconstructing that. And these will be okay at reconstructing that. And so now we take the weight matrices that we learned here, and we put them all in backwards here. So from these features, we try and reconstruct this. And that's what this W4 transpose is doing. And from these we try and reconstruct those and so on back to the data. So after I've done my initial training, my unsupervised learning of these modules one at a time, I unroll the whole system, put the transposed weights up here, and then I'm going to train the whole system as a standard neural network backpropagation. That is, I'm going to say I would like the output to look exactly like the input. So my desired output is whatever the input was. I take the difference between the actual output and the desired output and I backpropagate it through all these layers. Now, if you start off with small random weights here as we used to do in neural nets, nothing happens. Because the derivative you get here is small times small times small times small times small times small times small. That's small to the seventh. And which is no derivative at all. If you use big weights here, you've already decided what it ought to be doing randomly. That doesn't seem like a good idea. You don't want to decide the features randomly. That was the point of using small weights. But if you use small weights you won't learn. What this initialization does by learning this module so these are good at reconstructing those and these are good at reconstructing those, is it gets features to get you going and then you can fine tune these features with backpropagation. And when we fine tune, these weights will become different from those weights. And these ones will become different from those ones. Okay. So here's an example of this is one random image of each class. That's the real data. And for that autoencoder I just showed you after it's trained -- the training takes a while. It takes like on the order of a day on one call or an hour on a GPU board. These are the reconstructions of the data from those 30 numbers. So we turn this image into 30 numbers and we reconstruct it from 30 numbers. And it's pretty good. In fact, I would argue the reconstruction's actually better than the data. So we look at this as a little gap time. It's gone away now. It's a bit dangerous trying to do reconstructions better than the data because you will be unlearning on your reconstructions than learning on the data. But anyway, if you do standard PCA, you can see it's a much worse way of trying to get that information into 30 numbers. So finally -- I first thought about doing this in 1983. No, sorry. Yeah, in 1983 with Boltzmann Machines and then in 1985 with backpropagation. We could never get it to work. The fact it didn't work didn't stop some people publishing papers about it. But eventually we got it to work in deep nets. And it's clearly much better than PCA. >>: So if you don't do backpropagation, how much worse is it going to get? [inaudible]. >> Geoff Hinton: Probably it will be probably worse than PCA if I don't have the backpropagation. But I don't know the answer. I should know, but I don't. So now we're going to apply it to documents. And initially we're going to use it just for visualizing documents. So suppose I give you a big document set and I say lay these documents out in two dimensions so I can see the sort of structure of the data sets just a standard visualization. We want a map of document space and see how it's populated. Well, I could obviously use an autoencoder for that. What I could do is I could train a model that has the counts for say the 2,000 commonest non-stop words, and from that count vector of 2,000 counts if I could figure out how to model it without my Boltzmann Machines, I could learn some binary features and then some binary features and then I could learn this little map which goes into two really numbers. Obviously you're not going to extract -reconstruct 250 activities very well from two numbers, but you can do a lot better than random. Like if these are sort of political documents, you'd expect high probabilities from political kinds of things. And from these two real numbers we're going to now try and reconstruct the word counts. We're not going to do very well. But now we can fine tune them with backpropagation. And then we can look to see how well we did. And as a comparison we can use a two dimensional version of the latent semantic analysis where what you would do is take the word counts, take the log of one plus the word count, and then do PCA on that. So we took a big document set of 800,000 documents, Rich's document set that's publically available and where they give you the word counts, and we used two dimensional latent semantic analysis, that is basically PCA done in a sensible way. And it's all done without knowing the class of the documents. And then once you've laid them out in 2D, you color -- there's a point per document, the color the points by the classes because all these documents have been hand labeled by what class they are. And you can see, PCA, there's green ones down here and the more reddish ones here. There's some structure, but it's not very good. And it's the fact we're using log of one plus that causes this funny structure here. Now, in other words, you would never have any big negative numbers going in. Now we apply our method, and that's what we get. And we would argue that this is better. We've labeled the main classes here. It comes with a hierarchy of classes. They are sort of business documents. But if I told you that was the accounts earning statement for Enron, do you want to invest in this company, my advice would be probably not. So it's find structure in the document. >>: [inaudible] backpropagation? >> Geoff Hinton: It's after the backpropagation, yes. >>: Have you compared [inaudible] with PCA? >> Geoff Hinton: Latent dirichlet allocation? >>: [inaudible]. >> Geoff Hinton: No. This will do much better. In other work we compared this kind of stuff with linear discriminate analysis. On these multiple non-linear layers, they're a big win for this kind of stuff. I can refresh you not in the paper comparing with that. In the middle you're not so sure. Those are probably short documents, a document with a lot of worth. You get much more confident about where it is. So you get to see structure in the data. >>: [inaudible] sort of star shaped? I mean, most clustering techniques the examples are, you know, isolated clusters. So why do you think everything's pooling towards 00 here? >> Geoff Hinton: It's a very good question. And when we use non-parametric dimensionality reduction techniques on this, we get clusters. Basically it's because this vector is being used to determine the probabilities of the output words, and when you don't know, you want them all to be in the same place. So all the uncertain ones you want to be in the same place. As you get more competent, you can afford to have more -- less entropy in the output distribution. But what you want is for short documents you're bound to have a high entry output distribution. So all very short documents should give very similar distribution because there aren't number words to know ->>: [inaudible]. >> Geoff Hinton: Partly effect of that, yes. >>: [inaudible] the document [inaudible] or are you taking the raw term count? >> Geoff Hinton: This is the raw term count. >>: [inaudible]. >> Geoff Hinton: No, because we're doing probability of word but multiply that by the number of words. So yes. >>: [inaudible]. >>: [inaudible] four dimension to three dimension. Is it obvious that this [inaudible] than PCA? >> Geoff Hinton: It's obvious to me. I'd be amazed if you got those two pictures in 2D and then the 3D PCA was as good as this. But yes? >>: But that [inaudible] might just be the number of words. >> Geoff Hinton: Yes, I think it is. Okay. Now, I didn't tell you how we modelled a bag of words using a Boltzmann Machine, and it turns out that a little bit of chaos go into it. And in particular when we did that, we didn't use the last trick, and the last trick helps a lot. So we could do it better now, I think. So we start off with this binary Boltzmann Machine where you've seen the input variables as these binary variables. And now we're going to replace each binary variable by a 2,000 way alternative. A binary variable is a two way alternative. There's an easy generalization to a 2,000 way discrete alternative. So the idea is we're going to build ourselves a Boltzmann Machine that for each word -- for each instance of a word in the document has one input unit. So if it's 100 word document there's going to be 100 input units. And each of these units is going to be a 2,000 way softmax. 2,000 way discrete variable. So we use 2,000 way variables. And in the data, you know, one of those is going to be on for each of these input variables. Then each hidden unit, because it's a bag of words, it doesn't care about the order, a hidden unit is going to have the same weights to all of the different input units. You can probably already see we can do the computation faster than architecture I'm describing. We're going to use as many of these visible units as there are non stop words in the document. And so for each document we'll get a different size network. So it's not really a Boltzmann Machine, it's a family of Boltzmann Machines. For a big document you have a bigger Boltzmann Machine. But they all have tied weights. Okay? So the number of hidden units is always going to be the same. And for each hidden unit it's got its weights to the visible units. As I have more visible units you just keep reusing the same weights. So you're making a bigger Boltzmann Machine but no more parameters. And so you've got this family of different size machines. But the one crucial thing you have to do to make this into a good density model at least, which is the hidden unit is getting input from the visible units which is fighting a bias. As you have more visible units there's more input and so you want more bias for it to fight. And so you have to scale the biases with the number of input units. >>: [inaudible] scale of the input [inaudible]. >> Geoff Hinton: Yes. Either way. But this is the way we currently think about it. And this makes a big difference. And now in the upcoming NIPS we compare the density model you get for bags of words, this is what [inaudible] with things like latent dirichlet allocation and this gives a much better density model for bags of words. So you hold out some documents and you ask, you know. So here's just a picture of the model. You decide how many hidden units you're going to use. We're going to use thousands but -- and then for this hidden unit it will have some weighted connections to this softmax and the same weighted connections to this softmax and that softmax and then this hidden unit will have different weighted connections than those but the same softmax. Okay. I [inaudible] that a bit. And so we're going to start out by using an architecture like this, where we go down to 10 Gaussian units here, 10 real valued units. So we've done latent semantic analysis, hoping I'm using the terms right. With a 10 component vector. And now we're going to try and do document retrieval with that. And we're going to compare with document retrieval using the vectors you get with latent semantic analysis. So if we extract 10 numbers and retrieve documents on using 10 numbers, we do better than latent semantic analysis retrieving 50 numbers and much better than latent semantic analysis retrieving 10 numbers. In other words, each of our numbers is worth five times as much. And the amount of work you have to do is just linear in the number of numbers. If you're just doing a linear search from the things you want to retrieve from. So it's a big win to have a lot more information in here. So one of these numbers is worth five of their numbers. So people doing this LSA will sometimes tell you LSA is optimal and what they mean is optimal within the space of linear methods, okay? Linear methods are completely hopeless, but within that space is optimal. I should say LSA was a really ground breaking invention when it was done in the '90s. We couldn't do these more complicated things then. And it really amazed people how much information you could get just from statistics. People were surprised that it would work. But now we can do much better with non-linear methods. >>: So what [inaudible]. >> Geoff Hinton: This is saying as a function of how many you retrieve, what fraction of them are in the same class as the retrieved document? We don't have a good way of saying was this a relevant one. But if we use a bad way, but it's the same for all methods, that's the best we can do, and our bad ways to say is it in the same class? So you put in a document that was accounts and earnings, and you try and retrieve, you get the 10D code for this document, you compare it with all the other 10D codes, you see which fits best. Is the one that fits best about accounts and earnings? Well, 43 percent of the time it is. >>: Well, that's precision. So this was [inaudible] some re-call. >> Geoff Hinton: Yes. Yes. And this is sort of re-called here. We didn't label the axis with the normal terminology because this was early in our careers, information retrieval people, and we were still learning. >>: Question. So you get those 10 numbers for the auto encoders. Do you need to wait? >> Geoff Hinton: Yes. >>: That's a bunch more numbers. If you look at all the parameters that are involved in the encoding in both cases how does that total number of parameters compare? >> Geoff Hinton: We have more. Not -- yeah, I need to think about this. If you're doing PCA from 2,000 to 10, you got 20,000 numbers, right? Here we got a million numbers right there. In other words, this is only going to work on big document sets, it won't work on small document sets. We need to be able to translate a million numbers here. What the small is the code they use at runtime. >>: [inaudible]. >> Geoff Hinton: Okay. So the point is at runtime when you're actual doing the retrieval is linear in the length of the code if you do it in a dumb way, just matching codes. So you want a short code. >>: But in order to find the [inaudible] of the code don't you need the actual weights? >> Geoff Hinton: Oh, if you're asking about the storage in the machine, it needs these weights sort of once for all documents. It doesn't need these weights per document. Also, if you got a big database what you do is overnight you sort of run this and you store these 10 numbers with each document, so the retrieval time you hope you don't have to run that at all. >>: That would be analogous to the LSA, where you've got the basis vectors and you have to store those once for each document. I mean once period for all documents. >> Geoff Hinton: Right. Yes. These are like the basis vectors. >>: Right. >> Geoff Hinton: But the speed at retrieval time is going to depend on how many numbers you have here. Okay. So what we're going to do now is ->>: Is it really a question of speed or is it choosing the right bottleneck so that things get separated? I mean, is it -- in other words, what's the ->> Geoff Hinton: Well, it's the case with all these methods that if I use more units here I'll do better up to quite a big number. It's not that we have to go down to a small number to do well at all. If I make a hundred here I'll do much better. >>: So it really is a connotational efficiency. >> Geoff Hinton: It's the efficiency at retrieval time. >>: The retrieval time you don't necessarily scan, right, you use some sort of a KB tree or other ->> Geoff Hinton: Absolutely. >>: [inaudible]. >> Geoff Hinton: Absolutely. So this really isn't of much interest yet because you're going to use much faster techniques in serial matching. >>: So that's what the next point is going to address. Okay. It's just to compare with LSA. And, sure, we get more information per component of the code than LSA does. So now what we're going to do is we're going to have -- instead of making these linear units, we'll make them binary units when we do the first -when we do the learning. So we learn this module, we learn this module, we learn this module with binary units, we learn this module, we learn this, we unroll it all and then we start doing backpropagation. And here as we're doing the backpropagation we inject noise. And the point of that is to force these guys to be either very firmly on or very firmly off. That's the only way they can resist this noise. If they're in the middle range, then the noise which has quite a big standard deviation will just swamp them, and they wouldn't be able to transmit any useful information. And the way we add the noise is for each training case we make up a noise vector ahead of time, but there's so many training cases it can't load all that information the order of a million training cases. So now we've got what I call deterministic noise. For any one case, it's always the same noise. So we can use methods like conjugate gradient for doing the backpropagation. If you don't want to use conjugate gradient, then you can make this be true random noise. But you train this thing up, and what you discover is you first train it without noise and then you add the noise. When you add the noise here, these become far more binary. So they started off being learned as binary units, then when you use backprop they're logistic units with continuous values and you force those values to the extremes so that now when you threshold these, you don't threshold .5, you threshold a little bit lower than that. But when you threshold them, you can still reconstruct quite well from the thresholded values. So now what we've done is we've converted a bag of words into a binary vector. Now we can do what we call semantic hashing. So here's a document. You compute his binary vector, and you go into a memory space and that's binary vector is viewed as an address in that memory space, and because of the way these binary vectors were computed, it's not like a normal hash function. A normal hash function you try and sort of spread things out randomly over the memory. Just so you get things sort of evenly populated in the memory. Here what we're doing is because of the way the learning's done, we expect that documents with him words will be new in this space. And it may not seem like a 30 dimensional space is very high dimensional but there's a lot of space in 30 dimensions. So I have an analogy for supermarket search here. If I go into a supermarket and I want to find things that are like a can of sardines, what I do is I go to the teller and say where do you keep the sardines. I go to where the sardines are kept, and I just look around. And then there's the tuna and the [inaudible]. Unfortunately the anchovies are way over here with the pizza toppings. But that's because supermarkets suggest really 2D things. I mean, they're this 2 dimensional strip in 3D, and you conditional have everything that you'd like near each other. But in a 30 dimensional supermarket life would be wonderful. You could have the organic things here, and the cheap things there. And you could have the slightly beyond their date and very cheap here and the healthy ones here. And you could have the kosher ones there and the unkosher ones there. And in 30 dimensions you could, you know, arrange it all very nicely so similar things were near one another in all these different dimensions of similarity. So the idea is maybe with 30 dimensions you can organize document so that similar documents really are near one another. And if this autoencoder will do that for us, then we have a very fast way of finding similar documents. You just go to the address of the document, documents similar to a query document. You go there and you just look around. And so you enumerate the local neighborhood in terms of the humming distance from where you are. So you might think you would form a list. But actually in a sense you already have a list. You have this address, and as you flip bits in that address, you get the addresses of the next things. So why actually form a list? You've gotten your list. You've got an address and you got a way of getting to the next guy. So that's a list. And so in a sense this memory space you have a billion addresses here, I really have a billion things, each of which is a list a billion long, all in the same memory space. So it's sort of instance retrieval of our short list. Now, that short list may not be very good because you're only using 30 bits. So our proposal is use those 30 bits and then once you've got a short list just say 10,000 or something, you can do a serial search using the longer bit code, you just say 256 bits. Which is still much better than using LSA or something. And 256 bits are very easy to store. You just have to store four words with each document. >>: [inaudible] say something about the [inaudible] bits opposed to real value numbers? Because you could do the same trick but just, you know, have real value. >> Geoff Hinton: We have done some comparison on that. And I can say one comparison. One of our bits is worth slightly more than one of the LSA real numbers. So if you look -- compare it with LSA, one bit for -- one of these deep bits is worth about the same as one of their real numbers, but it's much cheaper to store a match. Roughly speaking that the [inaudible]. So here's another view of semantic hashing. All fast methods of retrieving things involve intersecting list roughly. Many fast methods of retrieving things involve intersecting lists, all the ones I know about. And what we're doing here is we're intersecting list because when you extract one of the bits of the code, that's a list of half the things in the space. Okay? It's all the documents that have that bit of their address turned on. So you can see it's a list of half the space. And the next bit in your code is another list, orthogonal list of another set of half the space. And now you want to intersect those 30 lists. Well, that's sort of memory fetched. So we're saying in memory fetch really is a list intersection if you can map your lists on to these bits and the address. And so that's what we do. We just use machine learning to turn the information retrieval problem in the kind of list intersection that all computers like to do, namely a memory fetch. So the question is how good is it? We've only implemented it with 20-bit codes, and that's quite a lot different from 30-bit codes. And this is on that same document that I showed you about, a million document, a 20-bit code is about right. I'm not going to talk about what you do if you get collisions to documents of the same address. If you're not a computer scientist, you don't sort of think about that, and if you are a computer scientist, you know what to do about it anyway. So what we do is we take our short list and you can now feed the short list to TF-IDF, which is sort of short gold standard method, fairly gold standard method. And you can compare it with other methods. So the semantic hashing is the fastest method we know. I can't see how you get a faster method of getting a short list. And if you feed the output of that short list to TF-IDF, then what you get is performance that's actually slightly better than TF-IDF alone. If you do TF-IDF on everything, which you could do relatively efficiently using indices and things, it's better if you just do it on a shortlist. And that means that some of the things the TF-IDF would decide are good our shortlist says are bad, they're not on our shortlist, and they shouldn't be there. Because we're doing better by filtering with this. Okay. Now, all of that was for document retrieval. That's all published stuff. And it's not so interesting for document retrieval because there's other good ways of doing it. But for image retrieval I think it's more interesting. >>: [inaudible] you're comparing when you're comparing the LSH you're saying the speed of computing the 20 bits is faster? >> Geoff Hinton: No. What I'm saying is I'm really thinking of it like this. That you can have these 20 bits stored with each document and then it's just a memory lookup. An LSH has bigger codes than that. You can't do it by just a memory lookup. Okay. So for image retrieval, until fairly recently it was either done with things like color histograms or done more effectively I think using the captions. But obviously you like to do it by recognizing what's in the image. And you'd like to do object recognition and recognize what's in the image and do it like that. Because the pixel isn't like a word in a document, a pixel doesn't tell you much about what's there. A word tells you lots about a document. The things that are like words in a document are objects in the image. But recognizing those is kind of tough. So maybe we can do something else. Maybe we can extract things that aren't as good as objects, but nevertheless contain quite a lot of content about what's in the image in a short binary code, and then we can use these ideas about short binary codes either for quick serial matches or for semantic hashing. So we propose to use a two-stage method where you use a short binary code, a very short one, to get a short list, and then for that short list you use a longer binary code. Now, there's no point in doing that unless your longer binary code is going to work reasonably well. So all we've done so far is checked this longer binary code to see that it works reasonably well for image retrieval. And those are the results I'm going to be showing you. That's just four words of memory for each image. And you can match quite fast. And the question is how good are these codes. What do you get when you try matching with these 256 bits? Now, we thought about this a few years ago when we're doing document retrieval. And then Yair Weiss and Antonio Torralba and Rob Fergus came around and figured a different method of getting binary codes which is published in NIPS, which is must faster than our machine learning method. And they claim that it worked better and it's simpler and it has no free parameters. When they claimed that it worked better, what happened was that they didn't know how to train Restricted Boltzmann Machines on real valued data. And so -- on real valued pixels. And so they gave up on using the pixels, and they used just features as the input to the system. So they took the image, and they took it down to 384 numbers before trying to compress it to this binary code. And it turns out that's much too much compression much too soon. If you train on the raw pixels and you do it properly, it's actually completely reversed. Their method is basically hopeless, and our method is much, much better. They're friends of mine. So Alex Krizhevsky, a long-suffering graduate dent mine spent a long time trying to figure out how to train these autoencoders properly with Gaussian visible units. This morning I talked about something better than that where you model covariance. But for this work we're just using the ordinary Gaussian units, that is linear units which is you have an independent Gaussian containment on each unit. To do the training you need a very small learning rate. You mustn't add noise when you make the reconstructions, and you need a cluster machine or a GPU board. And you need to use a lot of units in the early layers, because you're going from real valued things to binary things. And so he goes to a big set of binary things. And then we're going to compare with the spectral hashing method that got published in NIPS and the Euclidean matching, which is going to be very expensive but will at least show us literally very similar images. I don't know if you can see this. I first wanted to show you the kind of filters you get if you apply this Gaussian RBM to these small color images. We're going to do it on 32 by 32 color images. And they're very different from the means that I showed you this morning. They're much more like the variances. And that's if you just learn these by themselves. If you learn the covariances, these become completely different. Again, they have the flavor. Most of them are pure black and white. And this is now 32 by 32. So they're highly localized in space, they're high frequency and they're pure black and white. And a few of them don't care at all about intensity, they're completely insensitive to intensity and they're color contrasts. This isn't a topographic map. This is just a random subset. In this case you learn 10,000 [inaudible]. So at least it shows you you're learning something that's generally regarded to be a sensible way to start a image. So the architecture here actually used for the autoencoder was the three color channels, not preprocessed at all. He used 8,000 units here, then 4,000, then 2,000, then 1,000, then 500, then 256. And there's absolutely no theoretical justification for this. There's a justification if you're using powers of two, because he's using a GPU board. But that's about it. But what we do know is if you use a small layer, it didn't work nearly as well and using lots of layers seems to work better. And also it's very robust. He's showing from 2,048 to 700 here to 256. It will probably work about the same. Maybe slightly worse. So although this is arbitrary, it's not arbitrary stuff that's sensitive. That's the good news. And it takes a few days on the GPU to train this whole thing. This thing has, well, four times eight, 32 million weights there. Overall it's got about 67 million parameters. It's trained on color images, but he has 80 million of those, and he actually only bothers to train on two million. That's plenty. Well, I should say in this work he's training on two million, and he's doing reveal on another two million that he wasn't trained on. He could do better probably by training on all 80 million, but it would take longer. So what we have here is this is a query image, distance zero. These are the 15 best retrieved images in this order. And it's showing you the distance. And this is the distance in hemming space for the 256 bit codes. So you have to go quite a way until you get to another one. But if you look at the top row, it's all men with shirts and ties. It seems nice. This is spectral hashing, which doesn't have any men with shirts and ties and has some things that you wouldn't have thought of as particularly similar to that. We couldn't believe these results, and so I kept talking to Yair and saying spectral hashing isn't working, and he said, yeah, I know. So we think we haven't got a bug. Our code, we've run this code on things that they've run their code on and gets the same results. So we're fairly sure this is the results, even though they look terrible. This is Euclidean distance which sometimes is slow, and you'll notice Euclidean distance is much more similar to us, but we seem to be doing something better than Euclidean distance. I would that these are more like that. Notice that's fine details -- look at this guy's face. He doesn't have a face. But he does of a shirt and tie. Okay. So the order is we're the best, this is the second best, and spectral hashing is the worst for that one. Here's Michael Jackson. This is Euclidean distance matched to Michael Jackson. We at least get people nearly all the time. Spectral hashing, well. There's something else that's very strange too. I accuse Alex of making a mistake, 64 dimensional spectral hashing. And the reason was look at these distances. It's got something at a distance of 23 from this thing. And our nearest thing is a distance of 61 and is much more similar. Now, how could that be? How could it have something that's really close and very dissimilar? Well, because their bits are no good. Their bits are very far from orthogonal to each other. When they hash something that has many bits the same and just isn't at all similar. So its furthest thing is only 26 bits away. It's just not laying the things out, it's not using the space uniformly like information theory says you should. Okay. So here again I would argue we do a better than Euclidean distance and much better than spectral hashing. >>: Why don't you expect the long distance hashing to be sensible, why do you -- it seems like it's -- it seems like you didn't train for it and you didn't -- this seems like a [inaudible] accident, right? >> Geoff Hinton: No, because we know from the autoencoder that similar things will get similar codes. Just because of the way it's trained. I also have evidence, right? >>: There's nothing semantic about the algorithm, right, it's not like you're asking that near by things are meaningful. >> Geoff Hinton: Oh, but you're asking to find a few features that describe an image such that from those few features I can reconstruct the image. So think about it for documents going down to two features. Suppose you had sports documents and politics documents. You'd really like one of those features to be sports versus politics. Because if it's sports you can raise the probability of old sports words and another probability of old document words and vice versa. So you'd really like to go for that abstract feature of sports versus politics. >>: But maybe -- maybe there's something odd like this [inaudible] is black and that one is purple that also just -- that also distinguishes and makes it [inaudible]. >> Geoff Hinton: But that one helped you predict other pixels as well. So for example, take -- here's a well known semantic feature that you can actually extract from a feature quite easily, indoor versus outdoor. It's quite easy to do that. And the point about indoor is you expect to see sort of straight edges and particular kinds of illumination and outdoor you expect to see far more rough things. So that one feature will tell you a lot about what -- how to reconstruct the image. The more we're hoping is we're -- if you could get 30 features, all of the same quality of indoor versus outdoor but all orthogonal to that, you'd really be in business for image retrieval. And we're trying to move in that direction. Let me carry on. So here's a flower, a dandelion, I think. We get about half the other things we get are dandelions. And in this case about half the other things we get are also in the 15 closest in Euclid distance. So this is some evidence we haven't moved very far away from, the sort of raw input metric. So that's bad news for the abstract idea. But what the bad news for that is good news for the following idea. Wouldn't it be nice if you could do Euclidean matches just by using 256 bits instead of using about 3,000 real numbers? And apparently we can. I mean, we can get half the things that come out close in Euclidean distance, we can get by matching these 256 bit things and looking at hemming distance. So it's good in one way and bad in another way. Sectoral hashing as used only got one dandelion but -- >>: So have you compared should shock code with like [inaudible]. >> Geoff Hinton: No, not yet. So here's an outdoor scene, and here's what we retrieve. This actually is visually very similar, but you can see it's not actually an [inaudible] it's a sort of drum or swimming pool or something. But notice the Euclidean distance is also retrieved [inaudible]. Yeah, there it is. That's the seam thing you retrieve by Euclidean distance. So again, about half of these was Euclidean distance, the same as the thing we retrieve. But we're much better than spectral hashing. I mean I don't think that's particularly like that. >>: [inaudible] impressive that it's doing this, but in some ways it's not only suspected because of the autoencoder is trying to really reproduce the image, it will -- you know, the definition of reproduction is Euclidean distance, right? >> Geoff Hinton: Right. But we know that it generalizes to some extent. So if you try to deport or encode like this and you then generate from it, you can generate things shifted slightly a little. It will do small shifts and things. >>: But has anyone considered training the autoencoder giving the image itself as the thing to match, you give another member of the class? You know, if you have ->> Geoff Hinton: We thought about things like that and people have done things like that, but I'm not going to talk about that. But that's sort of work in progress, yes. So here's what I think is the best example. So that's a group of people. That's the sort of high level description and the low level description is a white blob. So if you look at things like this, it real matched on the white blob. But if you look at these retrievals, about half of them are groups of people. If you look in Euclidean distance, it's not doing nearly as well in groups of people here. There's a lot of fairly high frequent variation in this image, right? And that high frequency variation I think is underlying features that are saying high frequency variation and therefore other images will similar high frequency variation around about there are being matched, whereas if you do Euclidean distance and you want to match high frequency variation, the best you can do is find something that's very uniform that's the average. If you started introducing any variation to this thing, it will probably disagree and you'll get extra variants. So unless you can have perfectly correlated variation, you're better with no variation at all. And you see Euclidean distance likes things like this, that's the second best match, which is the sort of average of all this image everywhere. But we're definitely going to do much better at getting groups of people. And spectral hashing isn't. It gets some. >>: [inaudible] pixel by pixel [inaudible]. >> Geoff Hinton: Pixel by pixel, yes. >>: So everything that's [inaudible] normalized is [inaudible]. >> Geoff Hinton: Yup. Now, we should also try cosine of the angle just to normalize the intensity. But remember, there's two million images being used. >>: But [inaudible]. >> Geoff Hinton: Well, we should also try that. Like it, this was only done a couple weeks ago. As you know, I wasn't planning to talk about this, but you wanted me to so I did. There's a couple of obvious things we still have to do, which is implement the semantic hashing stage and using that as a front end for this slightly longer stage doesn't mess it up. We know that it doesn't with documents, but we don't know that it doesn't with images. If you lose some recall on images, it may not matter so much since people don't worry about what they don't see. So you can get away with bad recall. It's precision you can't get away with being bad. There's an obvious extension to this which is the first half of the talk you had to get short codes for documents and the second half you had to get short codes for images so why not use the same short code. Now, that's -- you get three wins from that. You might expect that if you use the words as well, it will help you get more abstract features because the link to the words goes by more abstract things, so you're more interested in. So it will pull you in that direction. We know that compared with LDA models for example, latent dirichlet allocation models we have a better model of the densities of bags of words. So we think these codes got by RBMs even in one layer is a better way of modeling bags of words. So we're going to get a win there. And the multi-layers are much better with modeling bags of words. We do well on images. And the interaction should help us a lot. So that's a very obvious big win. Then we go some way to answering John's objections because you share -- from the image, you can actually start producing words so you understand some semantics, you know, people it should say. There's a less obvious win. Semantic hashing is incredibly fast if you've already got the codes for all the documents and you're just looking up what ones are similar to this one. You can't really go far beyond 32 bits. Maybe you could go to 36 bits or something, but you can't go to like 100 bits. But it's so fast that you could do it several times. And if you go to a memory address, you have this sort of hemming ball, it's easy to enumerate the hamming ball in ascending address order. So you can enumerate it ordered in the address space. So now if I take another query and gets it's hamming ball, I can intersect those two hamming balls efficiently because they're ordered lists. And it's so fast. So why don't I now say okay, I'm going to get myself 20 lists like this and intersect them. And it's going to be [inaudible] intersect of these lists, but it's linear in the list length. So I take a hamming ball that contains 10,000 things it's sort of 20 times that 10,000 operations to intersect these lists. And so I can afford to access with several queries. So I can apply a transformation to the query image. I can for example do a small transaction. While a small transaction you don't have to do, you can already cope with a very small translation. But what about a somewhat bigger translation? So you'd have to worry about edge effects and things. And so to get rid of those we're using 28 by 28 images so we can translate a few pixels and still be inside the image. And that's work in progress, to see if you can use the speed of semantic hashing to allow you to match transform things by simply trying a number of transformations. Because you can afford to do that. So the summary of what I've said is that we have this learning algorithm that can learn layers of features efficiently. We can use it to run big representations for doing object recognition or we can learn it to use small representations for doing retrieval. And in particular we can run binary representations which are very cheap to store. And we can use this semantic hashing which amounts to using the speed of hash coding to do approximate matching. And then if this works nicely with images, we can start trying to deal with obvious transformations by taking the query and transforming it and doing multiple matches and then intersecting them. So we have a way of converting the speed of semantic hashing into better quality of retrieval. Okay. That's it. [applause]. >>: So do you [inaudible] you show that the features are either text or pixels. >> Geoff Hinton: Right. >>: Have you thought of features that are combination? For example, like when you do a search on the web, you type some text and you find text in the web page where the images and may be that -- how do you learn more? >> Geoff Hinton: There's -- I mean, sort of one reason I'm talking about this here is there's a huge of number of directions to go in from here, right? Once you've got a way of getting codes, there's also a question about what you apply it to. We've just done some of the most obvious things. Mainly I was concerned to compare it with spectral hashing because it was so depressing that that dumb technique worked better than our technique, and I wanted to show really it didn't. So the main point was to actually compare with spectral hashing. But there's also things you could do. And this morning people raised the idea if well, you know, if you're dealing with images, you really don't want to start with pixels because image processing people know what to do with pixels. You want to start with something higher up. And I sort of agree with that. It was just to compare it with spectral hashing and also to sort of know that we can do the whole thing with this technique we start with pixels. But I agree, we -- there's all sorts of other inputs you can use. >>: So one of the messages here, main message to me is if you use compression, compression will force you to find relevant features that you use to do things. And I ->> Geoff Hinton: If you got a good way of finding features, yes. >>: And the RPM is a compression mechanism that seems to find some good features. >> Geoff Hinton: Yes. >>: If you view it as a compression mechanism, have you compared it purely as the compression mechanism with other things, you know, like [inaudible] or other compression techniques? >> Geoff Hinton: We haven't, no. I don't -- I wouldn't expect it to be really good as a compression mechanism. I wouldn't want to ->>: So then it's not just the fact that -- of compression per se, there's something special ->> Geoff Hinton: There's something special which is taking similar things to similar codes. That's very important. And compression doesn't necessarily do that. So I think the fact that it's a sort of smooth function because the weights don't get that big, so it's a fairly smooth function from inputs to codes, I think that's what's important and that helps answer John's question, too. Why do you expect similar things to have similar codes? Yeah? >>: I think there might be two things going on simultaneously here, right? If you just say get me similar, right, and if you gave the metric, Euclidean distance metric, right, well in substance you're stuck because if there's no skew and if the distribution is uniform you can't do it, right, if you're just going to pick 30 values [inaudible] code those but it's the fact that the data is, you know, lives on a manifold and it's this very different density and so on. So the thing is encoding, try to minimize the difference of reconstruction given the prior distribution, the implicit prior distribution that arises when you get a certain set of learning examples, right. >> Geoff Hinton: Exactly. This will only work for data that lies on a low dimension manifold in a high dimension space and what's more where there's obvious correlations in the raw data that you can pick up on that really are relevant to that manifold. I could put in some obvious correlations in the data that were nothing to do with the manifold by sort of water marking or something. And this maybe would pick up on those. And it wouldn't be any use at all. But real vision isn't like that. In real vision it's screaming out at you that these correlations, they're caused by what's really causing the image, and if you're sensitive to them, you'll be able to find out what caused the image. And that's the kind of domain in when it works. >>: That might be -- when you're doing compression, ultimately your goal is to introduce the least visual distortion, right? And here that's not obvious that's what the autoencoder is doing. It's been asked to minimize the Euclidean distance but it's also being asked to do it at a frequency that is whatever is in your training set. If you only have 10,000 or some number of images that's ->> Geoff Hinton: Right. But as you go up through these layers of features, if for example you take an image and you go up and you come back down again you can sort of shift it again and things like that. And so I should share some examples of that. But it's not -- it's not sort of determined to keep the pixels in the same place. It has a little bit of translational ->>: Right, it's not the [inaudible] that can shift it, it just so happens that the distribution of the data space means that small shifts don't kill you in terms of Euclidean distance. >> Geoff Hinton: Exactly. >>: Right. >> Geoff Hinton: Because locally if you take a patch of image and you shift it slightly there's probably other things in your database that are very like that shifted patch of image. Exactly. Yes. >>: So desirable property of -- sort of similarity [inaudible] is it affect in the energy function definition in the Boltzmann Machine or ->> Geoff Hinton: No. >>: [inaudible]. >> Geoff Hinton: It basically comes out of the weights not being that big. So you've got a smooth function. I mean, that's one important thing. By using very big weights you can get a very non-smooth function. So very similar things can go to very different places. But you also like -- says the data has to have a lot of structure, it has to lie in this low dimensional manifold, and so this will find that manifold. And if the data isn't like that, this kind of method won't work. But almost all data is like that. Almost all highly structured data. >>: So [inaudible]. >> Geoff Hinton: We typically do randomize them a little bit, but it's not really that important. >>: [inaudible] and then you know, you [inaudible] the function as encoding that [inaudible]. >> Geoff Hinton: Exactly. But typically with a lot of data you don't need to regularize them, they just stay small anyway. >>: [inaudible] in doing this image retrieval is to create the visual work, discrete work and then think of the statistic of this work [inaudible]. >> Geoff Hinton: In the sense our bits are like those words. Our individual bits are like those words. And then in semantic hashing you think if a bit is like a word, then the inverted index is just all the addresses that have that bit turned on. >>: So in that case, once you interpret the bits as [inaudible] document retrieval ->> Geoff Hinton: You don't need to. If you've got it in 30 bits, you could use something much better, because you can intersect all those lists in one machine instruction. And I bet they can't -- even John Platt can't intersect all those lists in one machine instruction. >>: But Intel can. >> Geoff Hinton: But they have to build a big machine. >>: Oh, know but -- yeah. >> Geoff Hinton: Okay. >> Li Deng: Thank you very much. [applause]