>> Li Deng: Okay. So it's a great pleasure to have Professor Geoff Hinton to give two talks today. I'm not going to go through all of his achievement honors except to say that everybody acknowledge him as one of the major pioneers in neural network and machine learning. And we are very fortunate to have him stay with us until next Friday. He's sitting on the third floor in the speech group there, so when we talk there, we can knock his door and then get a conversation with him. We get all the inspirations from all the things they talk about. He has been here for a few days and have learned a great deal from him already. So without further ado, I will give all the floor to Geoff and then he give all the time to talk about two very important topics today. Okay. >> Geoff Hinton: Thank you. The work I'm going to talk about today in the first half I'm going to talk about stuff we've done over the last few years. And so some of you may have heard this stuff already. In the second half, I'm getting to talk about applying it to larger problems, in particular to three dimensional object recognition. So the motivation for this work is that a cubic sent meter of cortex has about a trillion adjustable parameters in its synapses. My believe is year never going to be able to compete with a human visual system until we can learn the systems that have about a trillion parameters. Just so we have a idea of the scale of the thing we have to do, and currently we can't do that. The first half is going to be how to learn multi-layer generative models that have a large number of parameters and can be trained without labeled data. And also once you've learned those, how you can use them to do better image classification. The second half is going to be applying these ideas to recognizing three dimensional objects in two different datasets. And in the second half we're going to be learning about a hundred million parameters. So we're still a factor of 10 to the 4 away from that. But you know, we're a university research lab so you can get a hundred just by going a big organization. So the starting point is backpropagation from the mid '80s where you give an input back to which might be an image and you want to know what the object is in the image and you go through a feed-forward neural network with adjustable weights on the connections. You look to see if you got the right answer. If you didn't, you take the discrepancy and you backpropagate through the net and that tells you how to adjust these weights. And that was initially very exciting. But it didn't work out too well. It worked out okay, but it didn't give as good behavior as we hoped. And here's some of the reasons why. It was very hard to get labeled datasets. You'd like a billion labeled images. That's kind of hard to get. Particularly if you were on fine labeling where the edges are and things. The learning time doesn't scale well in deep networks. And although it gets reasonable local optima in small networks, in big networks we can now show we get stuck in not very good local optima, and you can get much better local optima. There's no hope of getting the global optimum, but at least you can take good local ones. So the way we're going to try and overcome the limitations about propagation, particularly its limitation of requiring labeled data is we're going to try and keep the efficiently of using a gradient update method. But as you compute in gradient, you change the parameters slightly, you compute the gradients again, you keep going like that. It's a very flexible way to do learning. But we're not going to try and learn the probability of the label given the image, we're going to try to learn the probability of an image. That is, we're going to try and learn a generative model that spits out images. It's going to be a stochastic generative model, and if you run it, hopefully what you'll see is lets of images that look like the training data. Once you've learned this, you want to be able to show it an image and say how might you have produced that and look at the underlying variables that might have produced this image and use those for classification. The question is what kind of generative model should we learn. And there's an obvious candidate, which is something like a belief net. These were introduced in the '80s by Pearl and Heckerman. Heckerman did the first really impressive demonstration of them doing medical diagnosis. I'm going to use a particular predistributing active form of a belief net, where you get to observe the variables here of belief nodes and all the invariables. To begin with, we're going to be just binary variables. The inference problem is if I show you data and you know all the parameter -- all the weights on the connections, can you infer the states of these binary variables that cause that data? Or can you at least give me a sample that the quite a plausible way that might have caused that data? The learning problem is if I just show you lots of data here, can you not only infer this but actually learn all these weights on the connections? And the kind of units I'm going to use are stochastic binary units which takes some input which is that bias plus input coming from other units. And then they give an output that's a one or a zero, and the probability of the output is determined by the logistic sigmoid like this, which is the equation. Are standard kind of binary unit. Now, if you want to learn a belief net with multiple hidden nets like this, there's one thing that's easy to do, which is after you've learned you can generate from your model, that's nice and easy to do, and so you can see what the model believes, you just choose these from their initial biases, you then given these you choose these from their biases plus what the top down information is saying, and you choose them all stochastically, and then you get to see an example. You do it again, you see another example. What's difficult in these nets is inferring the states of these latent variables when you see data, even if you know all the parameters. And you have to solve this inference problem, at least approximately, in order to do learning. So that's the difficult. But inferring what is going on here given the data even if you know the parameters. So if we go to a deep net, because we're going to want to learn -- if you're inspired by the human brain you believe you want to learn at least sort of five layers of variables here, if you're doing object recognition. Let's consider the problem of learning the parameters that connect the first layer of hidden variables to the data. So we want to learn these weights here. Well, the problem is if I give you a data vector and you already have an estimate for these weights, can you infer what might have been going on in this layer? Now, if you ask what's a plausible patent of the binary variables in this layer to produce that data, will you have to sort of satisfy two things? It has to be a patent but it's quite likely to have reduced the data using these weights. It also has to be a patent that's quite likely to have been produced by network above. So I'll call that the prior. This stuff upstairs needs to be quite light to produce the patent and the patent has to be quite light to produce the data. In fact it's the product of the probability the patent coming from the prior and the probability of the data coming from the patent that determines your posterior distribution here. And so just to learn these weights, to improve our initial estimate, we need to get a sample from the posterior here. And that involves all of these weights up here, and it involves integrating all these hidden variables to figure out what the probability of patents under the prior is. And it just looks like a hopeless problem. All these weights interact. And it's going to be very hard. So one method is to say well let's use a really bad approximation here. That's called variational. But after trying that for a about it my conclusion was that [inaudible] just give up. So let's give up on learning deep belief nets. It's just too difficult if you want to learn them with billions of parameters. And let's try a very different kind of learning. And then something magical will happen. We're going to learn in a different kind of network. It's not a directive network like a belief net, it's an undirected network like a Markov random field, where there's symmetric interactions. So there's a generalization of [inaudible] network where instead of just things you observe that have direct connections we have latent variables, but all the interactions are symmetric here. So in graphical models terms this is an undirected graphical model. And its connectivity graph is bipartite. That is these guys don't connect to each other and these guys don't connect to each other. At least to begin with. Now, this model has one very tractive feature compared with a belief net, which is if I tell you the states of the objected variables here, if I give you an image, these are all conditionally independent. So to get an unbiased sample from the posterior here is trivial. You just look at the input this guy's getting from the pixels and put that through the sigmoid. Sample from the probability distribution gives you and do that independently for all these guys and you've solved the problem of inputs. So those are going to be great for doing perception if we can learn them, because inference is going to be trivial. Now, we'd like to learn with more hidden layers but let's not worry about that for now, let's just try to do it with one. So the inference is easy, but it looks like the learning is going to be difficult. Now, before I go into the details of the learning, I want to tell you what the big surprise for me was. If you can solve the learning problem and we can do approximate learning quite efficiently, then what you can do is this. You can take your data, you can learn weights. These will be symmetric weights to begin with that connected to your first layer of hidden units. Call those feature directors. And so you learn what feature directors to use for each data vector and you learn these weights. Then you take the patents of activity you get in your feature detectors when you're driving it with data, and you treat those as data, and you do it again. The way I discovered this was I was using MATLAB and I thought, why don't I just take my hidden probabilities and call them data and try it again? And so I just said, data equals hid probs and off I went. And it sort of worked. It did something sensible. And it took quite a long time to understand what kind of model I was learning. Because you'd expect that if you learned a model with symmetric interactions here and then you learned another model with symmetric interactions here, the overall model you'd get would be a great big model with symmetric interactions. It turns out for reasons I'm not going to try and explain in detail, but you can read the papers, the overall model you get is not that. When you learn a little Restricted Boltzmann Machine here and another one here and another one here, the overall model is it's a Restricted Boltzmann Machine at the top with symmetric interactions, so that's an undirected graphical model. And then everything below that is directed graphical model. So this is very weird. What we've managed to do is we've learned this directed graphical model [inaudible] the fact that it's got this top level associative memory here of undirected, and we learned it from the bottom up one layer at a time. And it's very surprising you can do that. But we accidentally solved the problem of how you learn deep directed model efficiently. So the only sort of fly in the ointment is so how do you learn one of these little guys? Because the learning problem looks to be quite tricky. And we have an efficient approximation that works pretty well. And I'll go into the details of that approximation after showing you a demo of what this can do. So for the demo, I'm going to take handwritten digital signatures from the [inaudible] database. I'm going to learn 500 feature detectors. I'm then going to learn 500 feature detectors that represent the correlations among these features. I'm then going to concatenate these 500 feature detectors, their states, with the right answer. I'm going to have 10 labeled neurons to say what did you class it as? And I'm going to learn a joint density model here. So it knows the answer for learning the last layer. But when it did all this learning, it didn't know what the labels were. Just for the top level here it knows the labels. And it learns the joint density model is not trying to learn to be good at predicting this from that, it's trying to be learned to be good at having this top level Boltzmann Machine generate the correct kind of pairs here. And one it's done that learning, it turns out that if you use it for trying to do recognition, it's slightly better than the simple vector machine, which is quite surprising because this vector machine is optimized for discrimination and is a pretty good method. And this is partly because handwritten digit recognition is a small problem. I mean, I think a lot of API people aren't realistic. You want a problem where you could plausibly solve the problem with only a few million parameters if you can only learn a few million parameters. And at this stage, that's all we could do. And so handwritten digit recognition is good for that because it's clearly an interesting problem, but a few million parameters might be enough. Okay. So I'm going to show you a demo of that thing after it's learned. So I actually programmed all of this except for the interface. What we'll do first is we'll show it recognizing something. So we first learned these 500 feature vectors and they're binary stochastic feature detectors. Then we learned these 500 with this is data. Then we learned these 2,000 with this and this as data. And because it's stochastic, even though the image is the same, each time we go up we'll get a different patent of those active but notice it's always very confident that it's a four, right. If I run it faster, you'll see that a bunch of these higher level guys are stable. So some of them aren't changing at all, and that's why it can be confident it's four. And there's a bunch that aren't quite certain. If I give it another digit, hopefully we'll recognize that it's a five. And even though all these are stochastic, note that it's almost certain it's five. >>: How many cycles do you have to go through to [inaudible]. >> Geoff Hinton: Here you just have to go up once before it gives you an opinion. There's another bit of the demo way you have to run for longer. So if I show you something that's a bit ambiguous, it's the stochastic system. So it will sort of oscillate between saying four and eight. And occasionally they'll say things like six or something. And if you count up how often it did that, you could see what probability associates with the different classes. Or you can do a computation which computes a quantity called the free energy which will allow you to compute how often it will say these two things in one computation without sampling. And that's the sort of better way. Yeah? >>: [inaudible] data always binary at each level -- >> Geoff Hinton: Okay. There's a fudge goes on. The fudge is you can see this isn't really binary. So we're using probabilities here. And you could train it. And it works perfectly fine if you sample ones and zeros from those probabilities and then train it like that. Or you could train it back just using the probabilities. So this is a little mean field approximation. And that's a bit quicker because there's less noise around. It's very important when you train it to make the hidden units when I'm learning this module be binary, but I can afford to make these be probabilities. And then when I learn this module, I can use the probabilities of turning these guys on that I extracted from the data as the data, but these better be binary. And so on. >>: [inaudible] your algorithm that actually refer to the binary values in training was everything continuous in the actual math? >> Geoff Hinton: The math is all binary values. You do the math with binary values. >>: [inaudible]. >> Geoff Hinton: No. The code has binary values in it. The code calls the random number generator. >>: Okay. >> Geoff Hinton: But it calls it for setting the states of these guys. And that's so you're sort of honest and you're not reconstructing the data from real numbers here. You're reconstructing from bits. Okay. That's the system after its learned doing recognition. And it's very good at that. It's not as good as a convolutional net which is told about the relations between pixels. But if you compare it with machine learning algorithms that aren't told anything about space, then it's one of the best. What's more interesting in this system is if it run it backwards. >>: Could you decode convolutional [inaudible] by [inaudible]. >> Geoff Hinton: Easily, yeah. You can do convolutional versions of this and [inaudible] it's time for other people who have done convolutional versions. What we're going to do now is run to the generative model. So we're going to fix the state of the label. Let's fix it to a two. And now what we're going to do is just go up and down in this top level Boltzmann Machine here. So that's all that's really happening. You don't look down here. We go on down here with this fixed, and it's sort of sampling from his model here. And if you run backwards and forwards for a long time, it will start showing you samples of what it believes here. Here. It will show you the samples here. But those don't mean anything to you. But below this top level underactive model we have this directed belief network. And so although this doesn't mean anything, I can convert that into an image so you can see what it's thinking. So this is what's going on in his brain, and this is what's going on in his mind. And it takes a while for this network to settle down. And after about 500 Gibb samples it's settled down. It's in the two ravine now, and that's what it's thinking. That's what it's imagining. That's his mind. And what's nice here is you are seeing an energy ravine in a 510 dimensional space which is normally quite hard to see. But you're seeing it wandering around a ravine here. So after it's learned there's an energy function for this top level thing when all the weights to zero that energy function is a sort of flat surface in a 500 dimensional space. After it's learned that energy function has 10 longstanding ravines in, and they have names. And if I turn that two unit on, the 2,000 weights coming out of here to the top-level units would lower the energy of the two ravine and raise the energy of the other ravines. If you stumble around here for long enough, you'll fall into the two ravine. And then you'll stay in the two ravine and wander around and you'll see what it thinks twos look like, including very bad ones but that's good because it can recognize that. Let's just change that label unit. We're going to change it to an eight. So now these 2,000 connections will be lowering the energy to the eight ravine, and it will stumble into that eventually. Probably just luck. It's not really there yet, as you can see. Now it's in the eight ravine. And it will generate all sorts of differentiates that it believes in, including ones with open tops. Thank you. This isn't really a demo, it's a canned demo. >>: [inaudible]. >> Geoff Hinton: Sorry? >>: Can you switch on two labels? >> Geoff Hinton: You conditional switch on two labels because these have the constraint that any one of them could be on. If you put two of them on a half that you could do, you could do .5 of this and .5 of this. Then it would try and do blends here I assume. I never actually tried that. I should do that. People have suggested it before, and it would be a nice thing to do. >>: So this is the net without doing fine tuning that propagation? >> Geoff Hinton: Right. So far -- now, there was a little bit of generative fine tuning I'm not going to talk about, because it just confuses things. >>: [inaudible] the generation may not be as good as you show here. >> Geoff Hinton: Without the generative fine tuning, the generation wouldn't be as good. But it would still be pretty good. But this is all trained as a generative model here and a joint density model here. No discriminative tuning yet. Now what I want to show you is what the learning algorithm is. So a long time ago, Terry Sejnowski and I came up with something called a Boltzmann Machine, which was a sort of general undirected model with binary units with arbitrary connectivity. And we got a nice simple learning algorithm for that that was hopelessly inefficient. And so 17 years later I figured I have to make it run a million times faster. And that's because computers were 10,000 times faster, I made it go a hundred times faster. The trick was to wait 17 years. So what we're going to do is restrict the architecture so we don't allow any connections here or any connections here. That makes imprints very easy in this kind of Restricted Boltzmann Machine. And as I said before, we can easily get a sample here. But the question is how do we do the learning? And to understand how the learning works, you need to understand a bit about how this model -- this kind of undirected model models data. So underlying it there's an energy function. I'm going to leave out all the biases so the math stays simple. This is a vector of binary activities for the visible units. This is a vector of binary activities for the hidden units. And if you tell me the states of the invisible and hidden units, which are ones or zeros, I can tell you the energy of that configuration. And it has a low energy if it has visible hidden units that has big positive weights between them. That's a happy state for the network. And the trick is what we'd like is when you put in visible -- a visible patent that corresponds to data, it says yes, I'm happy with that, I have a low energy state that goes with that, the sum configuration of the hidden units it makes me very happy. When you put in rubbish, it says I can't find any configuration that makes me happy with that, that's an improbable thing. This energy function has a nice simple derivative with respect to the weights, which is just this [inaudible] statistic, are these two guys are together, is this a one and is this a one? So it's very easy to change the weights to modify the energy. But unfortunately that's not sufficient to do learning. That's only half the story. Because the way this model defines the probability of a joint vector of visible and hidden units is if you tell me the full state of the system, tell me the states of the visibles, tell me the states of the hiddens, then I can figure out the energy that that full configuration has and then the probability is proportional to each of them minus the energy. Low energy is highly probable. But of course it has to be normalized by all the other alternative states of the network. And this is called the partition function. And this makes learning opaque. If you want to know the probability of a visible vector, you just suddenly stop over all possible hidden configurations like this. So that's the probability assigned to a visible vector. So now if I want to do learning, and I show you some data, suppose V with some data, I say make V more probable. What are you going to do? Well, what you want to do is you want to lower the energy of all the hidden configurations that contain all the four configurations with V in, and you want to raise the energy of all other configurations. And then that will make this more probable. It turns out there's a very simple way to do that. It's just computation expensive. You start with data. So this is a data vector. Using the data, you activate your hidden units. So each hidden unit gets some input from the data and stochastically decides were to be a one or a zero. And you can do all these in parallel, because they're all conditionally independent, given the data. Because of the connectivity of the net. Then given this vector here, you reactivate the visible units. And that's exactly the same computation the other way around, using the same weights but the other way around. And then you do it again. This is called alternating Gibb sampling or block Gibb sampling. All these guys can be updated in parallel, then all these guys in parallel, then all these guys in parallel. If you go on long enough you'll have forgotten where you started. That's called equilibrium. And you'll be able to see fantasies from the model. These are exactly what I was showing you in the demo except to show you the fantasies from the top level RBM, I had to go through a few more layers to turn it into an image. But I'm showing you these fantasies. And the learning algorithm is now very simple if you're willing to run this chain long enough to get fantasies. You simply measure how often a pixel and a feature detector are together with data and how often a pixel and the feature detector are on together in the fantasies that the model produces. And it's the difference of those statistics that is the derivative of the low probability of the data with respect to a weight on a connection. And that's a bit surprising because this derivative with respect to this weight depends on all the other weights in the network. And they didn't seem to show up here. But all the other weights in the network determine this quantity here. And so they show up in this quantity here. But it's not like backpropagation where you explicitly have to sort of go through those weights. They just show up in these statistics. So Terry Sejnowski and I got very excited about this rule because it's a local rule. It's local to a synopsis, just the two rules it connects. But it will do sort of sensible learning for one of these models. But it takes a long time even in this you have to settle down to the group room distribution, so I got bored. And what happens if you just do this? You don't go all the way to equilibrium. Well, it turns out the learning works just fine. Not quite as well as if you go to equilibrium. It's not maximum likelihood learning anymore. But it works pretty well. And it's certainly good enough for many applications. So now you've got an efficient learning algorithm. You take your data, you activate your feature detectors, you reconstruct the data from the feature detectors, you activate the feature detectors again, you take the difference of these statistics with the data and these statistics with the reconstruction and that's your learning signal. Multiple that by some small learning rate and way you go. And so we take many batches of data, measure these statistics over a small batch and then update the weights and then do it again and again and again. And that's how that model was learned. And now as I already mentioned, if you can learn one layer of features like this, you can then treat those features as data and learn another layer of features. All of this without knowing any labels yet. And so you can learn lots of layers. We can prove that if the layers are the right size and they're initialized correctly, then every time you add another layer -- what we'd like to prove is every time you add another layer you get a better model of the data. And that's true when you add the second hidden layer, you do get a better model of the data. But as you add later layers, all we can prove is that there's a band that the improving each time. So it's conceivable that model of the data will get slightly worse but there's a variational band which each time improves. So there's something that's improving. It's always encouraging when you're doing learning to know that something sensible that's improving. And in fact, the -- the probability of the data in all the cases we see it actually gets better. But all we can prove is the variational part. >>: So you said that the first layer is guaranteed to [inaudible] or is it just. >> Geoff Hinton: There's a following guarantee that when you add the second hidden layer and then start changing its weights, the low probability of the data will improve. >>: That's guaranteed? >> Geoff Hinton: That's guaranteed. When you start changing its weights. Of course you could change its weights and the low probability could improve and then it could actually go down again. But it will always stay above what the low probability was when you first added that second hidden layer. But we can't prove that for later layers. That's it for the math. I'm not going to go through the math. The math says you're doing something reasonably sensible. So then you throw away all the math, you violate all the conditions of the math and you get on with it and you see what you can do. Now, once you've trained this model, these multiple layers, you can fine tune it. And for fine tuning it, backpropagation is very good. So the easiest way to do that, it's not what I showed you previously, the easiest way is you just train lots of layers, then you add 10 labels at the top with initially sort of random connections to the last layer, and then you just use backpropagation on that net. And that works much better than using backpropagation starting from random weights. >>: I thought that you do backpropagations using the same configuration early on. I think you ->> Geoff Hinton: There's two ways to do it. There's two ways to do it, and I don't want to sort of confuse people any more than I have already. Yeah? >>: Are you locking the weights on everything except for the last layer? >> Geoff Hinton: Okay. The answer is no, we're training all the weights now with backpropagation, but the lower weights don't change much. So if this is a weight vector in the lower layer, when you do the fine tuning, it will sort of go like that. Now, so if you look at the feature detector, it doesn't really change at all. But if you go like that with a lot of feature detectors, you can move a decision boundary quite a lot. So what's happening is the unsupervised learning discovered all the feature detectors and the labels don't have to be used to create feature detectors, they just use to very slightly change them to get the decision boundaries in the right place. And so you don't need many labels. So the optimization view of this is the greedy learning designs all the feature vectors, tells us what parts of the weight space we should be in. When we first turn on the backpropagation, we'll have a big gradient there, and we'll go a small distance with this big gradient and then we'll trickle off, and we won't leave that region of the space. So where we are in this whole space is determined by the unsupervised learning and the back probably is just fine tuning. And we get much better optima like that. [Inaudible] and his students have shown that if you compare starting from small weights and just using backprop with this, this gets you to a part of the space that you never get to if you start with small weights and use backprop. You just get very different solutions and these solutions are much better. The other thing is the learning is generalizes better. So we design all the feature detectors so as to model what's going on in the image and also as to get the right labels, and so we don't actually need many labels. The labels are just slightly tweaking things. So you won't ever fit in these [inaudible] and in particular you can use this kind of learning when you have a huge dataset, most of which is unlabeled. Your huge dataset designs all your feature detectors and then your few labeled examples can be used to fine tune things a little bit. And that's sort of the future I think. So I just want to justify why this whole approach makes sense before I get on to the 3D examples. You can tell I don't have a very good justification because I have to appeal to concepts that come from Rumsfeld. So the stuff. Stuff is what happens, right? And if you believe image and labels are created like this, that, you know, stuff in the world creates an image and then the image creates the label, then machine learning in the standard old-fashioned way of trying to associate labels with images is the right thing to do. >>: You mean discriminative learning? >> Geoff Hinton: Discriminative learning, yes. So that would be the case, for example, if the label was the parity. If this was a binary [inaudible] the label is it even or a parity. Given the image, the label doesn't depend on this stuff. All the -- everything you need to know about the stuff is in the image. But that's not what you really believe, at least not for most data. What you really believe is that the stuff out there in the real world, that gave rise to an image, and the stuff out there in real world also cause someone to give a name to the image. But they didn't get this label from the pixels, they got it from the staff. You know, it was a cat out there. I mean, because it was a cat out there, the name's cat. It's not because pixel 55 is orange or anything. Now, in particular, there's a very high bandwidth path there and a very low bandwidth path there. And in that situation it makes a lot of sense to use unsupervised learning to get from here to there. And we know that it's possible, because we do it, right? Little kids do it. They don't learn object categories by their mothers telling them the name of every object. Their mother points at the window and said cow and in the distance there's a field with clouds and a river and a small brown dog and they know what she's talking about. But the label information is terrible. It's only because they already have the concept of cow, they say oh, that thing must be called a cow. >>: What's your [inaudible] for technical definition of bandwidth. >> Geoff Hinton: Okay. I'm not going to give you a more technical definition but I'll explain it a little bit. If I show you a picture of the cow in the field, you can answer questions like, you know, is the cow standing up or lying down, is it a big cow or a small cow, which way is it facing, is it moving, is it brown or is it black and white? All these questions. So you have lots of information about the stuff from a picture. From the label cow, if I just say cow and then say what color is it, well, you're sort of out of luck. There's not much information here. In fact, the most information that could be here is minus the log of the probability of the word cow. Which is like 13 bits or something. >>: So there's -- I was at a workshop recently on recognition and one of the newer trends in the last two or three years is some people are starting to look at learning attributes of images as opposed to learning ->> Geoff Hinton: Absolutely. And they got more information there. Yes. I agree. I completely agree to that. But let's come back to that in the question time. But anyway, this is the justification what we're doing. That there's enough information in an image to figure out what's going on in the world. And once you've figured out what's going on in the world, then you're in much better shape for assigning labels to things. And it's silly to try and learn all of this stuff by backpropagating from here. You should learn all of this stuff by trying to understand what's going on in the image and then maybe just slightly fine tune it to get the right decision boundaries. >>: So then the difference between the [inaudible]. >> Geoff Hinton: The idea is features are meant to be a model of stuff in the end. We have this modeling material. Does that make you make a car out of clay, right? It's not like you're doing identification you really believe the hood of the car is actually made of clay, you just have clay which is modeling stuff, and you can make anything out of it. We have these features and you can make sort of anything out of them. >>: That should be independent ->> Geoff Hinton: That should be the stuff, yes. Okay. So the one thing I did recently, which is quite encouraging which as I said what if our labels aren't very good? So let's corrupt the labels. So I went through the training set and with probability .5 I made each label wrong. If you said it was a two, I made it one of the other nine labels with probability .5, and I just did that once, so it was really corrupted. Certainly you corrupted differently each time so you can average it away. You just do it once. And now it turns out that if you have this kind of architecture you have this pathway from the data that's saying what these hidden labels should be. You have -- you can infer from the noisy labels what they should be. I start off with the confusion matrix that says it's roughly the identity but with a bit of off-diagonal noise. And after a while, what will happen is if you showed a very nice two and you said it's a four, it will say rubbish, it's a two. I've got overwhelming evidence from here that it's a two. I just don't believe this. And then it will adapt its confusion matrix to say sometimes when it's a two the guy says four. And it turns out it can learn the right confusion matrix here and it can get very good performance. So with 50 percent of the labels wrong, it can get down to two percent error on both the training and test data. There's not much difference between training and test. In other words, it's corrected in all the wrong labels. It just knows you're talking nonsense. It's like a very good student who you tell them the truth and they believe you. You tell them something false, and they don't believe you. And that's how they can get to be smarter than their advisors. Yeah? >>: [inaudible] slightly different about the problem which is that -- which is fine, and I think it's good in the case that you're looking at, which is that in the unsupervised data there are instinctive categories ->> Geoff Hinton: Absolutely. They're really there, yes. >>: Because I was thinking like on an alternative case where you might have documents, for instance, right, where many different labelings might be correct. In that case, I think that you wouldn't really help you very much because the [inaudible] there wouldn't be sensible ->> Geoff Hinton: This works -- well, when there really are natural categories and the labels are just giving it sort of noisy information, right, what to call these categories. Yes? >>: [inaudible]. >> Geoff Hinton: Right. If you make the labels 80 percent wrong, it still gets only five percent error. So it really can do with very -- then if you ask, well, how much is one of these cases worth, roughly speaking you compute the mutual information between the label and the true class, and that tells you how much your case is worth. So here is .07 bits and here is .33 bits. So these are worth sort of 50 times less than those, but they're still worth quite a lot. >>: [inaudible] 50 percent able but [inaudible] the other 50 percent correctly [inaudible]. >> Geoff Hinton: Then you'd be much better. Because then you've got much more mutual information. But it's -- if I tell you the label's write in this case and it's wrong in that case, I'm telling you a whole lot. Just throw away the cases where it's wrong. But if I don't tell you which of the right cases, there's much less mutual information. Notice 20 percent of these are right. There's only a 50th as much mutual information as these perfect labels. And it seems to me the mutual information that it's showing me is how much you can get out of a training case which is nice. >>: [inaudible] matrix there is really just matrix that [inaudible]. >> Geoff Hinton: Yes. It's a 10 by 10 matrix, and you learn how often when it's really a two the guy says it's a four. >>: [inaudible] the label [inaudible]. >> Geoff Hinton: Obviously it could do a random permutation. And so that I can interpret it, I initialized it with a [inaudible] matrix so you would have the obvious correspondence. That's not necessary. >>: What results did you find when the labels were completely random. >> Geoff Hinton: I didn't do it when the labels were completely random. But what should happen is it says the noisy labels aren't telling me anything. Anything could happen. But it still uses those hidden labels for natural categories. Now, in the end this data is not clear that four and nine are naturally different categories. They're very, very similar fours and nines in many cases. For categories like one and zero it will assign labels to one -- it will use one of the labels for one, another label for zero for the very obvious categories. Okay. Now I'm going to talk about new stuff. So [inaudible] who made the [inaudible] database also made a database for doing 3D object recognition that's careful controlled. There's five classes. For each class there's five training examples and then these training examples of photograph with lots of different lighting conditions and few points. Many, many viewpoints and many lighting conditions. But you can't just remember the training data because the test items are different animals. So in one of the test animals is a stegosaurus. So you have to generalize from knowing that these are animals to knowing that a stegosaurus is an animal that's not a plane. Okay. One of the class a bit problematical. It's humans. They were purchased as toys in the US and every single one of them is holding a weapon. So the concept of human and holding weapon are the same. >>: [inaudible]. >> Geoff Hinton: What? Sorry? >>: [inaudible]. [laughter]. >> Geoff Hinton: Okay. So the first problem we have with this is that it's high dimensional. It's two 96 by 96 images, which machine learning is quite a lot of dimensions. We -- what we did is we take the -- all the pixels around the edge and we write much bigger so we got less off them. If a sense we're giving you some knowledge that stuff around the edges is so important. So we're giving you a bit of knowledge there. That way we get it down to 8,000 dimensional data. We make that data zero, meaning unit-variance. And then we have to face the problem that in images of digits, ink is really binary stuff. And so binary variables you can get away with. In really images you can't get away with that. You can't really represent a real pixel by the probability of a binary variable. Real pixels have the property that given my neighbors I've got a very sharp distribution. I'm almost certainly the average of my neighbors. You can't do that with a binary unit. So how are we going to model these real value things? So what we do is we adapt the Restricted Boltzmann Machine to use a different kind of visible unit. It's got a different energy function which is this. But I'm going to sort of tell you what that energy function is. We say for each visible unit it's going to have independent Gaussian noise model. So if you take the negative log probability of a Gaussian you get a parabola. So in terms of energy, which is negative low probability plus a constant, then the unit would like to be sitting around its mean, when is its biased name, and it costs to go away from that. Then if you look at the top down input that it gets from the hidden units when you're going to do reconstruction or when you are running the model by itself to produce fantasies, the top-down input from the hidden units looks like this, and we can factor that into the state of the visible unit. The energy contribution is like this. The set of visible unit times the summary rule hidden units of the state of the hidden unit times the weight on the connection. So if we differentiate with respect to the state of the visible unit, we'll get this thing, which gives us this slope here. So the top-down input has the effect of saying it's better off to be here than there, and you win linearly as you go in this direction. So now if you take a parabolic containment and a linear thing from the top down input, then minimum will be where there's gradients are equal and opposite, about here. So the hidden units have the effect of moving this parabola over. So that's the model. And using visible units like that, it's a bit of a bad model of images because we're assuming the pixels are uncorrelated given the hidden units. But we'll come to that later. What Vinod did was trained up. This is the stereo path, this sort of fuzzy edges. He trained up a Restricted Boltzmann Machine with 4,000 bigger than had units, 9,000 visible Gaussian units, and he just trained this up and then we never changed those weights. He didn't try to change those at all. He just trained them up. So that's pure unsupervised. And then the architecture that worked best here was to say stead of trying to give from here to labels, what we'll do is we'll learn five different models, five different density models of these 4,000 dimensional vectors, but we're going to learn them in a special way. Each of these models is going to be trained on data from his own class. So you have to know the labels through this top level. And so only this is pure unsupervised. It's going to be trained as a generative model to produce data that looks like it's own class. But in addition, we're going to do discriminative training of all five classes. So there's a quantity called the free energy, which is a measure how much a model likes the data. And what you do is you do the discriminative training try and make the correct model have lower free energy than the other models at the same time as you're trying to make whatever model is appropriate for this data be a better model of that data. And it turns out in the end you use five times the discriminative gradient plus the generative gradient. The generative gradient is much bigger because there's 4,000 things to be explained here, and it's a one of five choice that the discriminative thing has to complain. And so you use a validation set to figure out that five times the discriminative gradient plus the generative gradient is a good thing to use. And you train it up. This model is now each of these has 4,000 times 4,000, 16 million weights. This has 9,000 -- 36 million weights. Overall we got 116 million weights. And it's trained on only 24,000 labeled cases. That's the training set. So you'd have thought it would overfit like crazy, but it doesn't. Now, if you ask how many pixels there are here, there's 200 million pixels. So at least you haven't got more parameters than pixels, which is good news. And when you're explaining images, each image gives you much more constraint than the label. And so you need far fewer training cases. Even if they're unlabeled. And so the fact is we could train 100 million parameters on only 24,000 images without seriously overfitting. >>: [inaudible] did you use this Gaussian unit or did you [inaudible]. >> Geoff Hinton: The Gaussian unit. >>: Gaussian unit. So this one could be better. >> Geoff Hinton: This would be better, yes. >>: [inaudible]. >> Geoff Hinton: Yeah. I think. >>: [inaudible]. >> Geoff Hinton: This I should say something about this. The amount of computation you need to do here to do the discriminative learning is linear in the number of classes. So if you had 183 classes, it would be quite a bit of work. For five classes it's easy. For natural object recognition where there's maybe 50 guys in the classes, it's -- you don't want to do that. But here's the results. If you take support vector machines, so this is the node without a clustered background. With a clustered background it's not very [inaudible] can't do it. They get 11.6 percent. So take that as the sort of machine learning standard. If you take convolutional neural networks which are told about the structure of pixels, they get six percent. They're is sort of record holders so far. If you take this, our method, which isn't told about the structure of the pixels, it gets almost as good as the convolution neural networks. If you give it extra unlabeled training data by translating the images but not telling it the labels, this is just show what extra unlabeled data will do for you. It does considerably better than convolutional neural nets. On a convolutional neural net done the standard way wouldn't be able to use this extra unlabeled data. So this is just an indication that unlabeled data is going to improve this quite a bit. >>: [inaudible] labeled data and [inaudible]. >> Geoff Hinton: Indeed. We happen to know the labels for this because we adjusted it. And that will help both methods, right. A convolutional method that I -- yes. >>: [inaudible]. >> Geoff Hinton: Yes. >>: [inaudible]. >> Geoff Hinton: It deals with that. >>: So somehow that kind of [inaudible] has to be done [inaudible]. >> Geoff Hinton: So the hidden features are coping with that. >>: Okay. So that [inaudible] lower or is it done by the ->> Geoff Hinton: We don't really know. In this network, there's just this hidden layer and the more specific things. So this hidden layer has to have binary units that are coping with that lighting variation or viewpoint variation. >>: Is there any way to examine whether [inaudible]. >> Geoff Hinton: You can look at the resected fields of these guys. I'm not going to show you those now. I want to get on to much more recent stuff. Well, that was recent stuff too. But the stuff I'm excited about at present, because we did it last week is making the Restricted Boltzmann Machine model a lot more powerful. So we got this idea that you can learn this Boltzmann Machine, you can stack them up. But we know that in vision you won't need to do multiplies. One signature of multiplies is that if you have heavy tail distributions, then you can get heavy tail distributions by multiplying things together. Take Gaussian things and multiple them together, you get heavy tails. And we know visions are full of these heavy tails. But anyway. We want things with multipliers because we know we're going to need them. And one thing for which you need a multiplier is this. Up until now when you run the generative model to generate data, you have active features in one layer and they're giving biases to the features below. So they're sort of -- but given the features in one node, the features in the upper level were conditionally independent. Wouldn't it be nice if a feature in one layer could specify an interaction between features in the other layer? It could specify a covariance. So this feature can say these two should be highly correlated. For example, suppose I had a feature that was a vertical edge. If I ask you what is a vertical edge, but you start off by saying well it means it's light here and dark there, well, maybe dark here and light there, I think that was the same actually, or maybe it's a stereo edge or maybe it's motion edge or maybe it's disparity edge, texture edge, those all seem to be completely different definitions of an edge. What do they all have in common? They all have the following in common. If you believe there's a vertical occluding edge here, you shouldn't interpolate this way. However you're going to do the interpolation, what a vertical edge means is don't do it this way, do it this way, but don't do it that way. A vertical edge is an instruction to turn off some direction of interpolation. So you can think of that as your default is that an [inaudible] image, things are very smooth and local interpolation will work very well. You have very tight covariances. If I give you this pixel, it's almost exactly as the average of his neighbors. But occasionally you want to turn that off. That's what a vertical edge is. But you don't want to turn it off everywhere, you want to turn it off in this direction. Okay. So how are we going to do that? What we're going to do is we're going to say let's just take two pixels. Here's two pixels. We're going to have a linear filter that looks at this pixel. It's going to be learn. A linear filter that looks at this pixel. We're going to square the output of that linear filter. And then we're going to use that squared output in an energy function by putting a weight on it. So suppose this filter learns this unit vector and this filter learns this unit vector. If I put a big weight on the squared output of this filter, sorry, it's big weight on this filter, this needs to be big. But I correspond to saying it costs a lot to go in this direction. So if I start off as zero, which is cheap and then say it's costing to go in this direction, that corresponds to a Gaussian which is sort of sharply curved that way. If I put a small weight on the output of this filter, putting this one small, that's a Gaussian that's generally curved this way. So each of these linear filters you can think of as causing a parabolic trough in energy. When I add up all these parabolic troughs I'll get an elliptical ball and so I can synthesize the precision matrix of a Gaussian, the corresponding energy function, by weighted outputs, by squaring the outputs of linear filters and putting weights on them. But now I can do something much nicer. I can say I'm going to have these linear filters, but instead of just always adding them up, I'm going to put an extra hidden variable in here that says whether I want to use this or not. And that would allow me to modulate the covariance matrix or rather the inverse of a covariance matrix is a precision matrix. So I can have this precision matrix. I'm going to build it up out of all these components that are parabolic troughs which I'm going to decide which ones I use. In speech recognition now I believe they use 500,000 full diagonal covariance matrices. I'd like to replace that by one covariance matrix but you synthesize it on the fly, it's full covariance, but you build it out of these little one dimensional bases so it's appropriate for the current data. So here's how we're going to do that. >>: On the other hand I mean there are so much data available we have [inaudible] so it's just [inaudible]. >> Geoff Hinton: Well, then you could certainly afford to train this. So I'm going to learn these parameters, to learn the direction of this linear filter. I'm then going to put -- because this is bad, this grid I put, I insist this weight be negative. It's very important this is where it be negative. I have a big positive bias for this hidden unit. So this hidden unit is going to spend most of his life being on and contributing some strong term to the precision matrix. Once you start violating that by going off in the direction it doesn't like, it's going to say whoa, you lose. You were originally winning by plus B, that's this [inaudible] here in negative energy, and you're going to lose parabolically. That's the Gaussian. As this [inaudible] will be firmly on, so you get this bit of a curve. But once you start losing a lot, you say oh, maybe this constraint doesn't apply anymore, maybe we don't want smoothness here. And that's this turning off of the hidden unit here. Once this guy's turned off, all of this is just irrelevant. It doesn't apply. So that's got a very heavy tailed flavor. If you want a T distribution, you can build it up by taking a product of several of these. You can approximate it very well. So the idea is we need more hidden units than there are dimensions, quite a few more so that we really do get a full covariance matrix. It's not unconstrained in some directions. And then a few of them are going to be switched off to represent violated constraints. Yeah? >>: So since you're not a -- you're not using a convolution, you're not encoding like spatial like information, so literally it's the correlation matrix that arbitrary pixels ->> Geoff Hinton: That's what it will learn, yes. >>: So technically what you're learning is not necessarily like edges in the sense ->> Geoff Hinton: Well, it will turn out it will learn edges. Just because that's where the structure is. >>: But wouldn't it have other like random stuff just like if two ->> Geoff Hinton: Absolutely. But if you average every bunch of images to distant pixels, they just won't be correlated. There's a very strong -- if you look at correlations over images, they fall off rapidly. So it will learn what the going on in the data, which is these local things. Also, we're actually going to learn on patches that are smaller than the whole image here. We are going to go a bit convolutional here. >>: [inaudible]. >> Geoff Hinton: Right. So that -- this is ideal for that. Because what it will say is normally you expect a [inaudible] coefficient to be the average of its two neighbors in time. But just occasionally that's not at all true, and it's very not true so turn off that constraint all together. Don't say we'll pay a penalty and have them be very different. Say it just didn't apply. Prepare fixed penalty and then we can have a burst. >>: This depends on the label. If we know that it's [inaudible]. >> Geoff Hinton: Right. But when you're going the other way, you want to detect that thing, but it will help you predict the label. So you want to detect the smoothness violation. Okay. So now the pixel intends is no longer independent given the states of the hidden units. So we can't do a reconstruction by just separately activating each pixel. We have to sort a system reconstruction. Life's got a lot more difficult. The only place it's gotten more difficult is in the construction. The hidden units are still independent given the pixels which is nice. And we're actually going to have some hidden units to represent these violated smoothness constraints and other hidden units to represent means. They're all completely independent given the data. So inference is really easy. In the end we want it for inference, and that's still simple. But for doing the learning, we need to reconstruct. And when we reconstruct, we've got these correlations back that we have to deal with. And the correlations are different for every training example. So it's not like there's an covariance matrix -- inverse covariance matrix, you inverse it you get the covariance matrix and then you can sample from it. You'd have to do that separately for each training sample, which would be too expensive. So we're going to use another method called Hybrid Monte Carlo. What you do is in Hybrid Monte Carlo, you can integrate out all those hidden units, these switching off hidden units and compute something called the free imagery. And then you can get the gradient into the free energy with respect to the activity of one pixel, given the states of all the other pixels. And so now you can start your data point. You can look at that gradient, and you can follow that gradient, but with the appropriate level of noise. And the way you do that is you start with some initial random momentum of the data and then you simulate a particle travelling over that free energy surface with that initial random momentum. In other words, the gradient of the free energy is used to accelerate the particle and how far it moves depends on its velocity. And there's a numerical trick called leapfrog steps which makes the approximation good to second order. And you do all that. I'm not going to go into all that. But essentially what this means is given the hidden units, we're going to start at the taught and run this Hybrid Monte Carlo for 20 steps to get a reconstruction, that's going to move it away from the data in the direction the model likes. And then the learning's going to say don't do that, stay with the data, and don't go after things you like more. So unlearn on wherever you got to and learn on the data. And so when you have one set of units for modeling these covariances, really in those covariances and another set from modeling means, we call that the mean and covariance RBM. And it's a generalization of this. You can throw away the Gaussian containment function because that's going to be done by the covariance bit. And so the mean units look just like the units we're using before, but without that Gaussian containment. And the covariance units are giving you the precision matrix. And so the covariance units are assuming everything is zero mean, but they're modeling the covariance. The mean units are saying I have no idea what the covariance is but I'm modeling the mean. And so you can think of the mean units as putting a slope like this in the energy function and then you got this, the covariance units put this parabolic ball, and now you find the min of this ball in this sloping thing, and that's where you want to go to. That's the mean of the reconstruction of what the reconstruction should be. And the Hybrid Monte Carlo get you some distance in that direction. So this is the dataset we're going to use. It's called the CIFAR-10 dataset. Because they paid for us to get the labels. It's based on the MIT TINY images dataset which is 18 million 32 by 32 images. They got from the web by searching with particular search terms. If you term for the term cat, about 10 percent of the images you get are a nice image of a cat. And most of them don't have a cat in at all. So they're very unreliable labels, which we're always interested in learning unreliable labels. But to begin with, we wanted a reliable label set. So we got people to go through all the images that were found with the term cat, and they had to answer the following question: Is there a single main object in the image? And if there is, is there a reasonable chance that if you're asked to name it you would say cat? Okay. So these -- these guys all satisfy that. I have [inaudible] the cats, yeah. They're very low resolution images, which is not good. But again, you have to be realistic. I think we need a trillion parameters. We can only learn about 100 million at present, so we better simplify the test somehow. And we're going to simplify the test here by using low resolution. We'd love to do a higher resolution, but not just yet. But notice that -- look at bird here. You have a chicken, a close-up frontal view of an ostrich's head, various other birds that I don't know the names of and ostrich's head from some distance away that looks maybe a little bit like a prairie dog. I don't know what that is. So I can identify nine of those ten birds I think. >>: [inaudible] label? >> Geoff Hinton: These are manually labeled, yes. Someone went through and said all of these are reasonable examples of a bird. I might reasonably have said bird if I was asked what that is. Deer is a particularly bad category. But this is -these are real objects. And this is the real kind of variation you'd like an object recognition system to deal with if you had already stopped the problem of focussing on one region where there was an object. This I think everybody will agree is a sort of tough database to do recognition on. >>: [inaudible]. >> Geoff Hinton: Okay. We've got one set where there's 10 classes and 5,000 trained examples of each. We've got another set where there's 100 classes and 500 examples of each. And they don't overlap. So you can use one set as negative examples for the other set. And it's guaranteed negative examples for the other set. >>: They have [inaudible]. >> Geoff Hinton: We didn't do that. What we did do was the student in charge of it all went through afterwards and just checked that all of them were okay. So you won't find any really glaring errors in the labeling. Well, not more than a few. So the way we applied our learning to that is for this learning these in this covariance matrices, these adaptive once, we're trying to do it now in the whole 32 by 32 image. But to [inaudible] we did it on eight by eight patches because we only started doing this two weeks ago. And so we turned on these eight by eight patches. We learned 81 hidden units for modeling the means, 144 hidden units that are the guys that turn on and off from modeling the covariances -- I've only got five more minutes, and those 144 units that are turning on and off are actually using 900 of these scared linear filters. So each of them connected to several linear filters, not just one like I showed before. And you end up with 11,000 hidden units, most of which are modeling the covariances and some of which are modeling the means. This is what the filters that learn the means do. They learn very blurry things. But if you -- if you pick -if you pick sort of 15 or 20 of these to be on, you can synthesize roughly the colors of any regions you want. So think of them as like in water coloring, is it only you had sort of covering regions? But they're not really telling you much for where edges are. But that's fine because the other guys are going to say -- tell you where the edges are, and then when you reconstruct, you're going to have to color in the region without any chart discontinuities where the edges aren't because you've got high covariance there. So what do the other guys get? They get completely different kind of information. They get sharp filters, and they break into two completely distinct classes. There's guys who learn to be exactly black and white but as they're looking at the RGB signal, [inaudible] to making more human like they see RGB, and only on the learning they'll be colored, and they then learn to be exactly black and white. They really, really are very, very well balanced. And then there's other guys who learn color opponent filters, and they learn to be exactly ignoring intensity. So it just splits into those two sets. >>: [inaudible]. >> Geoff Hinton: These are P and G images. I was worried about that too, yes. I'm fairly sure it's not that. Because the P and G is lustless. I mean, that P and G on things that were JPEG but I don't think it's JPEG [inaudible]. >>: [inaudible]. >> Geoff Hinton: I can give you a reason why this might happen. I'm -- we still have to do the test, which is if you look at edges, some internal to an object and they seem to have an intensity contrast but no color contrast because it's the same material. And then others are occluding, and they have both intensity and color contrast. So a lot of the edges you see have no color contrast. And that's a very good way to tell that's something's not an occluding edge. Almost always if there's a color contrast, it's an occluding edge, or it's an edge between two different materials in your T-shirt or something. Two different kinds of stuff. >>: So this is [inaudible] receptive fields. >> Geoff Hinton: This is the receptive fields of those linear filters whose squared output is going to be units deciding whether that thing applies or not. And if it gets a big output, it says you don't apply. There's no smoothness. >>: So the way I see all these edges there, is it -- does it show that [inaudible]. >> Geoff Hinton: There will be edge detectors. >>: [inaudible]. >> Geoff Hinton: Yes. Now, you notice they're clustered. And that's because we formed a topographic map by a little trick. I'll show you the 1D trick for topographic map. You lay out all your linear filters in a row, you lay out all your hidden units in a row, and then you have local conductivity between your linear filters and your linear units. So those two linear filters both connect to this hidden unit and to this hidden unit. So they have something in common. They tend to go to the same hidden units. If you go to the same hidden units, it pays to use similar filters. Because if one of you goes off, you pay a penalty. If the other goes off, you don't pay any more penalty because the hidden units's already turn off. So if you're going to pay these [inaudible] heavy penalties, you want a lot of guys to go on together to all go to the same hidden unit. And that cause it to form in this case it would be a 1D map. If you do this in 2D, you get a 2D topographic map. So now we're going to ask how well does it do on that CIFAR-10 dataset compared with just modeling the means, for example. So if you just model the means but it uses many hidden units to do that, you get just under 60 percent correct. If you only model the covariances, you do better. If you model the covariances and the means you do quite a lot better. Sorry, if you model the covariances but you lose lots of these linear filters, you do quite a lot better. One linear filter per hidden layer unit does better than this but not as good as that. If you now use both of these, the means will learn to be a very different thing if you're also modeling covariances. You do even better. And if you then take these 11,000 hidden units and you just do greedy learning, these are all binary now, right, you take there probability values as your data, you greedily learn a set of 8,000 units, now you touch your labels to those, in all of these we just do a logistic regression on the final layer you'll do even better. So that says greedy learn -- this is a good input for this greedy layer by layer learning. And we're up above 70 percent. >>: [inaudible]. >> Geoff Hinton: This is sort of a new thing. So really it's only good for comparing these but it's [inaudible] convolutional neural networks for example done by one Yan Lacombe's [phonetic] best students when I first [inaudible] lab, don't work very well at all on this. There's some variations we could do obviously. Don't use a [inaudible] on the top, use this fitting 10 different models. That should do better. Learn a signal meaning covariance RBM on the whole 32 by 32 image. That should do better. And there's also some other variations I'm not going to go into. So I'm done. [applause]. >>: [inaudible] talking about the pixels [inaudible] people have all these delightful high level features they use. [inaudible]. It seems like you're spending a lot of time reproducing raw pixels [inaudible]. You're sort of crippling yourself, right, you're not using sort of the most advanced feature technology. >> Geoff Hinton: There's really two motivations. One is I would like to understand more about how the brain might be doing that low level processing, what the objective functions for learning are in doing that low level processing. And I think this idea of learning an adaptive precision matrices [inaudible] the idea. Because the architecture that comes out of that is exactly the simple style complex architecture. But you reinterpret that as that architecture is how a generative model would get itself an adaptive precision matrix. So I think that's already something of interest for people understanding what the simple complex style architecture is good for. But I sort of agree with you we this try a list higher up. The other thing is I would like to show that, if you learn several layers like this, you get sift features. Sift features originally kind of motivated by David Lowe thinking about how the brain might do it. I mean that was some of his background motivation. And then some very good engineering to get some sensible feature. I want to [inaudible]. So most of the motivation for not starting from sift features is because I want to understand where they come from in the brain. But if you just want to do vision, well, you could start from the sift features but my hope would be that this would learn better things in sift features. Siftlike things. So these squared [inaudible] filters are orange energy, which is one of the inputs to a sift filter, a sift feature. But instead of putting them together in the way David Lowe thought best, I'd like to learn all that, and presumably it will do better. I'm quite sympathetic to you. If I just want to discriminate objects I'd start by doing certainly things like [inaudible] processing and stuff like that. Maybe a bunch of Gabor filters. Although notice that we learn on a 32 by 32 image patch or image in this case, learning the filters is much better than just putting Gabor's down everywhere because it will learn to use a limited number of filters to cover the space. As soon as the Gabor filter has a scale on an orientation and elongation, stuff like that, you don't know how to tile the space with the reasonable number of filters. And this will do a very good job of tiling the space with only 10,000 filters. So that's another advantage over putting them in by hand. >>: One of the reasons presumably we have of multiple frequency bands for scale space clusters, we can never assume that the object is already, you know, at a known scale and a known location, right? >> Geoff Hinton: Right. >>: For something that could tie images databases, you pretty much have the scale invariance taken out, right? Because you were [inaudible] an object to be sort of [inaudible] fill the middle of the image. >> Geoff Hinton: Plane, plane. >>: Okay. >> Geoff Hinton: I mean, no, maybe you just guessed this one wrong. >>: I didn't know that was a plane. >> Geoff Hinton: In a 10 way choice, sky is a plane or a bird I mean, it's remarkable it can get 70 percent. And a lot of that is of course some -- for example, the confusion matrix, it confuse animals with animals and it can -except for birds and planes it confuse because of the sky, and then it confuses trucks and ships and cars. So the context is doing quite a lot of the work. But there is quite a lot of variation of size. I mean, look, that and that, they're the same thing at different scales. So we have some protection against that. But I agree, that's presumably why the visual system has these different scales because it's [inaudible] scalable it wants to look at. Yeah? >>: So I was wondering like if you look at all the filters you have like the [inaudible] how well do they work against locations? Like what is the randomly [inaudible]. >> Geoff Hinton: Pretty well. Of course in real data vertical is special. Horizontal is not so special. Because horizontal in the world doesn't run as horizontal in the image. Vertical in the world comes out as vertical in the image. So let's go back and look. And also the pixilation of the image is rectangular pixilation. That causes a difference. Sorry. I'm going to wrong way. I want to find those filters again. Those guys. So here's vertical ones, here's horizontal ones that way and diagonal ones that way. And on an eight by eight, that's about all you can do. When we make these -- we've done these on by, 16 by 16 patches. And then you've got lots of nice orientations. Eight by eight the diagonal ones are bound to be a bit sort of pixely. >>: So if you take off [inaudible] earlier version that ->> Geoff Hinton: That's the same thing without the covariance. >>: This Gaussian again [inaudible]. >> Geoff Hinton: The Gaussian is linear with Gaussian noise. So this is that only the covariance. And this is when you combine the covariance and the mean stuff and you see you get a 10 percent -- there's a 10 percent difference. >>: So presumably for speech if we ever that actually scrap all this extra stuff? >> Geoff Hinton: Yes. You don't want to use that. You want to use the spectrogram, I think. >>: So I'm thinking similar to a couple of questions asked earlier. If you were to concatenate the [inaudible] features with the image, [inaudible]. >> Geoff Hinton: Yes. But I would concatenate them out of some fairly high level. I would exact some features and then add in the sift features as well. Say here's the ones I learned, here's the ones David Lowe designed. Probably between them they can do better than either of mine. >>: So it's just like if you undo this percentages from [inaudible] and they started with special whatever they somehow [inaudible]. >> Geoff Hinton: Right. You could certainly -- you could certainly use more [inaudible] and things you've extracted from the spectrogram. So from the spectrogram you can get little harmonic stacks and things that are lost by the [inaudible]. And both together should be better than either alone. >>: [inaudible]. >> Li Deng: Thank you very much. >> Geoff Hinton: Okay. [applause]