21643 >>: So today it's our great pleasure to have Geoffrey Hinton to come back to our group again. Last year he spent about one week with us. We got a tremendous amount of progress on some of the research that people learning on speech have been pursuing afterwards. So today you are going to hear some even more exciting ideas that he will talk about. So I'm not going to give an introduction about what he has except to let you know that some of the things he's doing is very exciting and I'm going to give the floor to Geoff. >> Geoffrey Hinton: Thank you. First of all, I'd like to thank the Speech Group for giving me some money and also Eric Horvitz for giving me some money. That was all very nice. I'm always open to more. [laughter] I'm going to spend about a quarter of an hour explaining what deep learning is, and many of you here know this stuff already. So you should wake up after about a quarter of an hour. Then I'll describe about quarter of an hour describing how it's supplied to speech. About quarter of an hour describing how to apply to generating images that look like natural images, which nobody seems able to do. And the very last part of the talk I'll describe the amazing model produced by two students James Martens and Ilya Sutskever that takes a string of 500 million characters from Wikipedia and learns how to predict the next character. And does an amazing job of generating text after that. It does a job that's much too good and I don't understand -- I kept thinking there must be a bug. You must be cheating but we'll come to that at the end. So back in the 1980s people invented the back propagation algorithm. And you have multiple layers of neurons, and you can learn the weights on the connections and you can learn weights like this that determine what kind of a feature detector that is and what kind of a feature detector this is. We thought we can now solve everything because we have multiple layers of nonlinear feature detectors. It was quite good for quite a few things like reading checks, for example. But it had some serious limitations. And the main limitations were this: It required lots and lots of labelled training data. Back then it was very hard to get. The learning time didn't scale well for deep networks. And deep networks were the most exciting ones, because they could learn lots of nonlinearities. And it could get stuck in poor local optima, which might be quite good, but turns out you can find much better ones if you learned a slightly different way. So we're going to try and overcome the limitations of back propagation by learning a degenerative model instead of a discriminative model. That is, instead of learning to produce a label given an image, we'll start by learning to model images. We'll try to build a multi-layer generative model. That will have lots of features in it for generating images. Those features which learn to be good at generating images we will then use for discrimination. Many of them will be irrelevant. But the ones that are relevant should be very good. Turns out that's a much better way to do discrimination. In a sense we'll try learn to do computer graphics before computer vision. And turns out what's going to be the best model. Now we're going to learn a bunch of apparently bizarre decisions. The historical justification of them is I gave up on this problem and went and worked on something else. Then discovered the other thing I was working on was a solution to this problem. So we're going to learn generative model, learning has one layer of hidden variables, which doesn't seem very good if you want to learn deep networks. What's more, we're going to make it an undirected generative model; that is, it's going to have hidden variables which are features and visible variables which will be pixels. But the relation between the two is symmetric. So pixels cause features but also features cause pixels. And there's nothing really to distinguish the features and the pixels in the model. And we're going to make everything binary. They'll all be Bernoulli variables, which seems like a bad idea if you want to deal with the real world. We'll make these three decisions and gradually unpick them. So the units we're going to use are these stochastic binary units. They get total unit bias plus the weighted input from the states of other guys times the weights. And they output 1 or 0 probabilistically. And we're going to arrange them in a bipartite graph where you have visible units where you put your observations. Hidden units where you're going to learn features. And you have undirected connections. This is governed by an energy function. It has the nice property that if you give me a visible vector I can immediately write down the full probability distribution for these hidden units. They're all conditionally independent, and I can compute the probability of each of them. And since it's then dependent that gives me the full distribution. I can sample from the distribution easily. So it's very good for doing inference. And initially you think it's going to be hard for learning. But it turns out there's a quick way to do learning. So in that restricted Boltzmann measure, there's two conditional distributions that are relevant. There's the probability of hidden unit J turning on given a visible vector. And that's just the logistic function of the bias of J and the sum of the inputs it gets from the visible units that are on times of the weights of the connections. And then there's something which is just the same the other way around which is the probability of a visible unit turning on, given the hidden vector. That's what you would use to generate from a hidden vector. It's just the same the other way around. Now, using those two conditional distributions, you can do the following: You can put a visible vector in here. And now you can compute the probabilities of all these hidden units and then pick their states according to those probabilities so you get a binary vector. Then you can do the same here. Now you're reconstructing that visible vector from the features. And you just keep going until you have forgotten where you started. That's called the stationery distribution. And you measure the pairwise statistics of a pixel being on a feature detector being on here. Then you measure the same pairwise statistics there. And that difference in pairwise statistics is exactly the derivative of the log probability of this input vector here, with respect to the weight on this connection. That is, you can take this model and you can run it backwards and forwards like this until it's generating fantasies. You can look at this distribution. That's the model's distribution. And if I short some vectors here what I want is for the model to have the same distribution as the data. So if I generate from the model I get the data distribution. And this is the derivative of the probability that the model will generate this visible vector here. And it's a nice simple form. It depends only on things to do with that connection. Even though this derivative actually depends on all the other weights in the network. But that dependence shows up in these correlations here. Now, that learning algorithm you have to decide how long to go for. And after a very long time, after 17 years in fact I discovered you don't actually need to do that. You just do this. You go up. And you come down and you go up again. And you take this difference. Now, this isn't following the derivative of the log probability of the data. It's not maximum likelihood learning. What it's doing it's taking the data. It's representing it with the features. It's then reconstructing the features with something the features would like to believe in. So that this is the data polluted by the beliefs of the model. But only slightly polluted and you're measuring statistics with the data. And statistics that's been polluted with the model's beliefs and you'd like them to be the same because then your model isn't polluting the data. So intuitively you take this difference and do learning, and, hey, it works. It's following the great of another objective function, almost. So it's not doing maximum likelihood. It's doing something more the gradient of this other function, but it's not even doing that. But the main thing is it works and it's quick. So we can train a restricted Boltzmann machine, and we can give it binary data, and we can train it to learn lots of features that represent that data. Once we finish training, if you run the model back and forth, it will produce stuff that looks like the data. The reason it's interesting is because we can now stack these things up. With these undirected models it's very easy to stack them up. What you do is you first train one of these restricted Boltzmann machines and it learns a layer of features, the model of the data. You then take the activations of those features and you say that's data now. We've learned the first layer. Let's learn another layer that models the correlations of those features. And we keep going like that. And we can guarantee that unless you've already got a perfect model, each time you add another layer you improve your model of the data in the following sense: There's a variational bound on how well you're modeling the data. And if I had another layer in the right way, that variational bound is guaranteed to improve. So there's something that's getting better when we have more layers. And that's our kind of security blanket. It really is underlying in the math something that's improving as we have more layers. We now violate all the assumptions you need to prove that. This is sort of normal practice, right? And then we just get on and we learn lots of layers. With this sort of security that's behind all this is something sensible that we're doing. And now we're just getting on with it. So if, for example, we had some data here and we learned a layer of binary feature detectors and we took the activities of those when they're being driven by data and learned another layer and we took the activity of those and learned another layer, we then have these three restricted Boltzmann machines and we want to put them all together into one model. And what's surprising is the model that we get is not a great big Boltzmann machine. That's what everybody thinks. And it's not. The top two levels is the last restricted Boltzmann machine. So that's a little Boltzmann machine. A bipartite undirected model. But when we put them together what we're really doing is this lowest level model has a sort of pride here. And we're substituting this high level model for the prior, and we keep doing that. And we end up -- the only thing we keep from the lower level, model is the P of V given H. From this model, the only thing we keep is the P of this H given H. We end up with a model, restricted Boltzmann machine at the top. Like the first. And underneath that it's really a belief net. It's a directed model. If you want to generate from this model you go backwards and forwards here forever and then you go clunk, clunk once and then you can see what the model believes. Okay. So that's why it's called the deep belief net. And it has this undirected bit at the top. But all the layers below are belief net kind of stuff. So we accidentally discovered how to learn deep ones of these. You work on a different problem of learning a restricted Boltzmann machine. And then after I realize you could stack them and do nice things, eventually a very smart student called UIT realized that's really what you're doing and you're producing a belief net this way. Now, after you've done all that, you can then throw away the belief net you produced and say really I produced all this stuff. When I view it as a belief net the connections go this way. But it's also a belief net in which I can do very fast imprints by using the connections backwards. What I'm going to do now is just treat it as feed forward neural network. Good old-fashioned feed forward neural network. I just happened to have pretrained all feature sets without looking at the answer. That's good because that means I don't need the answer for all these detectors. If I've got lots of unlabeled data, this is great. So having learned all that, all these layers of feature detectors, we then stick our label units on top. And then we just use standard back propagation to fine tune it. This works quite a lot better than standard back propagation if you don't do all this prelearning. So one reason it works better is it goes faster. Once you turn the back propagation loose, you get sensible gradients. And you're just doing a local search. You're sort of fine tuning it. It's not really changing the feature detectors much. It's just changing a little bit to get the category of boundaries in the right place. It also generalizes much better. And that's because most of the information in the feature detectors in all those weights comes from trying to model the input. The input is typically a big rich thing with lots of information in it. You have about 10 to the 14 weights you need to set. And you only live about two times 10 to the nine seconds. You need to be setting about 10 to the five weights per second. And your mother would have to give you 10, at least five to the 10 bits per second of instruction to do that by discriminative training. And she doesn't. The only place you're going to get 10 to the 5 bytes per second which is what you need is sensory input. That's the only place where there's enough information to determine all these parameters. So there's not enough information in the DNA. >>: They're not one double ->> Geoffrey Hinton: Right. But that's why I said there's not enough information in the DNA. Okay? That's only three billion bases. Negligible broken depending on what you need. Because most of the information in the weights comes from trying to model the input, it generalizes much better. You're not limited by not having very many labels. Now, one thing wrong with it was we started off developing it for binary data. And most of the things you want to apply it to, like speech and vision, have real values. So we're going to modify the model just slightly. We change the underlying energy function, and the result of changing the energy function is that the rule for setting the states for the hidden units is it's basically just the same, except that the activity of a visible unit needs to be measured in units of its standard deviation. We'll have linear [inaudible] units just like in factor analysis. They'll have their own noise levels, that's the standard deviation of the noise level of a unit. That's the measurement noise in factor analysis. And you have to measure the V in units of standard deviation. Because that's the sort of log probability stuff. Then when you come down again it's quite different, because these are linear units. So the way you determine the state of a visible unit, that shouldn't say P of V equals 1. That should say P of V. The way you determine the real value state of a visible unit is it's a Gaussian. The unit learns its own mean. But then the top-down input coming from the hidden units causes an offset to that mean. And the magnitude of the offset again depends on the noise in that visible unit. If it's very tight, you need a lot of top-down input to move it over. It's got a very sloppy high variance. You don't need much top-down input to move it over. Then this is noise level in the unit. So those two conditional distributions go like that. And now we can learn with real values. Where this is a real value there. So we applied it to speech recognition. Two of my grad students George Darland and Abdul Rahman and Mohammed applied it to recognizing speech on the TIMIT database. And normally when you do speech recognition you have hidden Markov models that model the sequence of phonemes. And you have to relate them to some representation of the acoustic input, which is typically the Mel Cepstral coefficients, which is a long amount of development in speech recognition said were good things to use. Plus first and second derivatives. And then what you do is you, each node in the hidden Markov model has a Gaussian mixture model for trying to fit these guys, and we're going to replace that by something that goes in the other direction. It's going to be a feed forward neural net. This was done in the late '80s or early '90s by people like Nelson Morgan and Tony Robinson and [inaudible] Benjiyo [phonetic] replace it by something that takes the Mel Cepstral coefficients and tries to predict the probabilities of these states of the hidden Markov model. But we used a very deep net. So all the people doing that used fairly shallow nets. It works better if you use a deep net. In fact, George discovered the net that works best is here's your input representation of 13 Mel Cepstral coefficients plus the first temporal derivative and second temporal derivative, and he then puts that into 2,000 binary hidden units. That's a lot. Then another 2,000. So there's four million parameters right there. Between every pair of layers there's four million parameters. And he pretrains all this stuff. The thing that worked best was seven layers pretrained like that. Then he puts 183 labels because there's 63 phonemes and then each one has three hidden Markov model states and you're trying to predict which hidden Markov model state. But that's not pretrained. Notice that's far few parameters. Then you fine tune it with backdrop. At the point which he did that, the best result on the standard TIMIT database was 24.4 percent phone error rate. And he got 22.2 percent phone error rate. If you count the silencers which everybody does because they're easy. The issue we didn't count the silencer we learned a little better than this then we realized everybody can silence this so we did too. So that was nice. >>: [inaudible]. >> Geoffrey Hinton: No, Robinson's number would be about 26.3 or something like that. That was the west recurrent net. This is actually averaging a whole bunch of different models. Now, at about the same time as we did this, IBM started taking its large vocabulary speech stuff and applying it to TIMIT. If they do it speaker independent they get a similar result as this. It's not better. But in a minute we're going to make this better. So there's something -- well, the first thing is speech people say, ha, but things that work on TIMIT don't work on big vocabulary. Does this work on big vocabulary. And George came here working with the Microsoft people on Bing voice search he tried it on big vocabulary. And if you train on 24 hours of speech, they get 62 percent correct using a Gaussian mixture model. And the deep net gets 70 percent correct. Now, this isn't entirely fair, because the Gaussian mixture model you can train on much more data. If you train it on a thousand hours of data which is what they actually do you get performance that's very comparable with this. There's all sorts of little things you can change. It's sort of about this level, maybe slightly worse. Which suggests that if we could train this thing on a thousand hours, we'd wipe them out. But at present we can't and we're thinking of ways trying to train it on a thousand hours. There's one embarrassing thing about that nice model, which is that in the underlying math, you have these visible Gaussian units that have their own measurement noise level. Sort of the residual noise. That should be the residual prediction error when you try and predict what they're up to. And we took the data we normalized it so it had variance of one. But to learn this model we also used a residual model of one with a variance of one, which is crazy. We're saying you can't predict the data at all. Now, it works. But when we tried actually learning this it didn't work at all. I spent about a year trying to understand what the hell was happening, why we couldn't learn it. And throwing more and more graduate students on to the funeral pie, you know what I mean? [laughter] And you really can't learn it. Even if you're hyperbayesian about it it's still tough to learn. >>: MSR Cambridge they have a recent -- they came over to discovery ->> Geoffrey Hinton: Yeah, they have a complicated way of trying to do it, right? >>: Successful? >> Geoffrey Hinton: It was somewhat successful but they didn't really understand what the problem was. We have now solved the problem we can learn it much better. So I'll show you what the problem is. Remember when we got the two conditional distributions? We are dividing by the standard deviation of the residual noise here, and we're multiplying by there. So the top-down -- so the bottom-up effects you divide by it. And the top-down effects here you multiply that. So if you draw a picture of what's happening, the effect of changing this standard deviation is to, if you make this small, it effectively makes the bottom up weight very big and it effectively makes the top-down weight very small. Suppose we make it a tenth, which is about the right level. These weights are ten times too big and these are ten times too small. And the result of that is these hidden units all get saturated very early and have no flexibility to learn left. And these guys never get enough top down umph to actually drive them to their means so it's kind of under restructural all the time. That's sort of an embarrassing problem and there's a simple way to fix it. What we're going to do is we're going to have an infinite number of hidden units. This is actually good for getting NIPS papers accepted because you have to have infinite in the title. So we make infinitely many copies of each binary hidden unit. They all have the same weights and they all have the same bias. So there's no more parameters. But they have offsets in their biases. So their thresholds get progressively higher like this. And so if you provide this thing with some total input X, the ones way up here won't actually turn on. So you don't really need infinitely many. But you don't know in advance how many you need because it depends on how big X is. And now if you make X 10 times bigger, 10 times as many of these will turn on, so now you're in good shape. Because now if you make this input 10 times bigger you get -sorry. If you make this be a tenth, that makes the bottom of input 10 times bigger. 10 times as many of these turn on. That makes the top down 10 times as big. Which cancels out that tenth there. So we're back in business. Of course, we do need this infinite number of these binary units. >>: Can you fix the integer function to begin with, to reconcile the difference between ->> Geoffrey Hinton: It amounts to doing this. >>: The same thing? >> Geoffrey Hinton: So it turns out that if you take an infinite number like that and you compute this infinite sum here, that thing is exactly the same as this thing. That is, in the limit, as you're sort of steps in your threshold get sort of small and you scale everything down, they become the same. But with steps of one, they're so accurate in the same, if you plot the two you think you've forgot to plot one of them because you only see one line. So it looks like this. So basically we can model those guys very well with something like this, and we can model this with sort of a linear threshold thing with some noise in it. And it's a pretty good model of this if you make the noise level depend on the logistic of this. So when this is big, the variance of the noise is one. When it's big and negative, there's no variance in the noise. And right about here is where you're getting -- you get this curve thing by adding noise, basically. So in Matlab, that's what we actually do. You say the output of one of these units is the max of 0 and X plus some Gaussian noise whose variance is the logistic of X. So that's what you plug into your program and now you've got something efficient again. We call these rectified linear units. And that works really nicely. So then we can turn it loose on speech and in fact now that we know how to learn these models with binary hidden units, these rectified linear hidden units and real valued Gaussian visible units, we can run our own reputation of speech. I mean, people do this Fourier analysis. So there's a lot to be said about Fourier analysis. Except the speech isn't really cyclic like that. It makes all sorts of assumptions that aren't quite true. And people developed these Mel Capstral coefficients, but Mel Capstral coefficients were developed when people had small computers and the main problem was to get rid of most of the data so that it would fit in your computer. And also to blur it in a way that blurred together things that should be classified together. But we've got much more powerful models now. We don't want to blur it. We want to see what's going on. So we're going to actually do something that sounds crazy. We're going to take the raw speech wave, 100 samples from the raw speech wave. This is 120 of these rectified linear hidden units. And we're just going to learn how to model a tiny little piece of speech wave. This is going to be 6.25 milliseconds at 16 kilohertz. And we train it just the way we train these standard Boltzmann machines, and we're going to get some rectified linear features here. The question is: What are they going to look like? So you get a lot that look like what you expect, which is little sort of wavy things. You might expect to find different frequencies like this. But you also get guys like this. This guy is kind of cute. He's got one low frequency here and superimposed on it he's got another high frequency. And that's sort of right there in the front end of the system. And so if you take his Fourier transform, it looks like this. It's got lots of energy here, lots of energy here. That's actually detecting a vowel. Those are the first two Fouriers. If you take real speech data, a lot of the time people are saying vowels. So the data has a lot of these things in it where you have two Fouriers, and what's more, constant representation. This thing goes into detection directly. >>: So if you were to use the conventional, the Neuvy [phonetic] unit you wouldn't be able to get this. >> Geoffrey Hinton: You can't get this. You have to learn the variances to get this. You have to learn tight variances. You have to basically learn proper model to find these things. >>: Strictly ->> Geoffrey Hinton: The tricky showed you last year is a kludge before I really figured out what was going on. And next year I'll probably say this was a kludge. Okay. So now we can apply it to TIMIT. And we do slightly better than Mel Capstral. >>: Couple things. I assume you don't actually use it for an analysis to show on the slide you're just using the features directly. >> Geoffrey Hinton: Yes, the Fourier transform there was just for analysis of what these features represented in conventional terms. >>: One of the points of the Capstral is to remove the pitch information so now your frequency dependent with what you just learned. >> Geoffrey Hinton: Absolutely. >>: Are you going to do the equivalent trick make it homomorphic to remove that? >> Geoffrey Hinton: No, our idea is you don't want to remove information you want to model it. And if you've got lots and lots of layers of features and a big computer, you can afford to model all sorts of stuff that might turn out to be useless because you're building a generative model. >>: But as long as you go across multiple speaker types, multiple frequencies then you'll learn the features that are necessary. >> Geoffrey Hinton: Exactly. So now what we do is we actually now advance this window by just one sample. And do the whole analysis again. Okay. So we are not going to lose much this way. We have a hugely overcomplete representation. 20 times overcomplete. And then what we do is we look at these outputs. Look at the output of a hidden unit on all these different little windows that are advanced by one sample, and pool its output over 160 frames. So you're really saying did this feature get active anywhere in there. But we will make these pools overlap. And now we take that stuff and build more hidden layers on top. And, hey, presto, we do a little bit better than Mel Capstral. About one percent better. So standard speech recognition system starts with Mel Capstral coefficients. It then uses a Gaussian mixture model to relate these to the underlying hidden Markov model, and you can predict what I'm going to try to do next. That's going to be the last part of the talk. We don't know how to get rid of these yet. But we will. >>: Sift [inaudible]. >> Geoffrey Hinton: Yes. >>: Variable. >> Geoffrey Hinton: Not quite. Because sift features come with accurate poses and things. Okay now I'm going to switch to natural images and we're going to try to apply the same technique to modeling image patches. And images consist of smooth stuff with sudden discontinue annuities to first order. That's what people find so hard to model. If you just model the Fourier spectrum of images and then generate from it, you get clouds. They don't look like real images. So here's some images. And this image and this image are very similar if you use a color histogram. And if you do a template match of these images they're pretty similar. In most places they match very well. But this image is really more similar to that one. And if you sort of had to remember which is similar, these are the ones that are really similar. So it's not the mean intensities that matter. It's which intensities are the same as which other intensities that's what matters. That's the co-variance structure of the image. Now, you can of course fit Gaussian mixture models which learn in the parameters certain co-variant structures. But the thing about images is every new image has its own co-variance structure. So we want something that's dynamic that would model the co-variance structure of that particular image and that means we want some basis functions for co-variance structures. And we want to represent this image as you've activated all these co-variance basis functions and they between them produce an image with this co-variance structure. We also want to model the means. And so we're going to have a model that looks like this. Here's your pixels. Here's your standard. So this would be a standard sort of Gaussian binary RBM. And you might want to replace this with these rectified linear units. But actually this is what we normalize the image patches so you don't really need those to begin with. And then this is the bit that's going to model the co-variances. What we do is we take some linear filters, things that are going to look at the image, and we're going to now square the outputs, and we're going to send them here. And then these units are going to be units that are always on unless they're suppressed. And we're going to make sure these weights are all negative. So the interpretation of this is this unit here is saying my bit of the image is nice and smooth so you win by B. You get a bonus of B because your bit of the image is nice and smooth. This filter here is going to learn to detect things like an edge. And when there's an edge there, it's going to activate this guy. It's going to provide input here. And now that's going to suppress this guy. And so you don't get this bonus. But as this gets more and more active, once this guy's been turned off, it doesn't cost anymore. Once you've decided it's not smooth after all, you don't get a bigger and bigger penalty as the edge gets stronger and stronger, and that's what you want. That gives you the heavy tail property. I went over that very quickly, but that's the sort of basic idea. And now we're going to train this thing. And we use the training algorithm that's basically this contrasted divergence algorithm. There's all sorts of variations to make it work for this kind of circuit. And they're all in a NIPS paper in the upcoming NIPS. And I don't have time to go into them here but I want to show you what happens when you generate from this. If you learn on image patches, here's what image patches look like. And here's samples from the model. And they look quite -- they look quite like image patches. They have this property of being locally smooth with sudden outbreaks of structure. So you know this looks like the same kind of stuff as that. Whereas clouds produced by Gaussian model don't look at all similar. Markov [inaudible] who did this work wanted to apply it to big images. What you have to do then is you are going to model a patch and replicate it across images so you can model big images. But if you've learned to filter and you replicate it across the image with a small stride, that filter will overlap with itself. That's what happens with a convolution on the net. That's a disaster. Because where it overlaps with itself, in order to have high representational capacity, it wants to be orthogonal to itself. It wants the two outputs to be different. So the filter's try and learn to be orthogonal to themselves in all possible overlaps. The thing is it's high frequency noise. So they learn high frequency noise. There's a method called a field of experts that took our earlier work and put it in a convolutional net. And they learned high frequency noise. Surprisingly that works quite well. But you can do much better if you don't allow a filter to overlap with itself. So what we're going to do is we're going to tile the image with a filter. You learn a filter. You move it over by the width of the filter and then do another one. So those are replicas. But then you have lots of offset tilings of the image. >>: Why don't you have the same problem in the speech case, just a one dimensional image? The overlaps? >> Geoffrey Hinton: We probably did and we can probably do better. This is very early work. So we're going to have a red tiling of the image. There's the red tile. We tile the image with these red tiles. And for each red tile we learn something like 64 filters. And then we're going to have another tiling, the blue tiling, which is diagonally offset. So you minimize boundary artifacts and learn 64 filters for that and so on. Here's some of the co-variance filters learned by the red tiling. And you can see they're picking up on edges and things. And occasionally with high frequency noise. But edges are different frequencies of what they're really like. And so we learn lots of those. We also learn mean filters. And again using tiling. And now that you're learning the co-variance filters, you know what should be the same as what? So in order to model the intensities in the image you don't need to actually worry about where the edges are. You know how you do watercolor by having sort of a blob of color and then spreading it out until it hits the edges. Well, that's what this model does. It's a model of the mean has these rather fuzzy models that it sort of bright range here. But when you combine that with the co-variance filters, if there's an edge here, it will spread that brightness here but it won't spread it that way. So it's this sort of watercolor model of an image that you have a color wash and then these edges that it gets slaved to. And that just comes out in the math. So when you generate from it, here's a big image generated by using just the second order, the power spectrum. And it looks like that. One of the best models of images is the Gaussian scale mixture. Notice it's a mixture model. And is deciding for each little bit of image what's going on there and what frequency it is, what spatial scale. But if you generate from it it produces something like that. It's a good model of images but hopeless is the generative model. If you take the field of experts model, that's this model here, and that's like the model I described but is only using the co-variance units, and also it's overlapping the filters. It's not tiling the image. And that produces something that's sort of better than this but doesn't have sort of areas with edges. It's quite a nice texture model. If you take a pairwise MRF it produces something like that. If you take the model that Marco Redow [phonetic] developed it produces something like this. So at least you're getting sort of regions with edges. This looks a bit like a Henry Moore sculpture. Here's another sample from the model. Okay. So these are samples that are a lot more like images than what people could produce before. I mean, I should hesitate. From a parametric generative model there's many, many ways to put images together by stitching together little pieces of other images. But from a parametric generative model it's very hard to get things that look as much like images as that. Here's some more sample -- okay. So now what we do is we stick a hidden layer on top. If you stick a hidden layer on top, and then take this restricted Boltzmann machine on top and go backwards and forwards on that and once it's settled down you generate the image using the lower layers, you get something that looks like this. I should have got more samples of this. That looks like a really bad photo that someone took, right? It's out of focus. And you can't tell what it's a photo of because it's such a bad photo. But at least you could mistake that for a really bad photo. You couldn't have mistaken any of the others for that. If you use more hidden layers, it's not clear to me whether this is better. But you end up getting stuff like that. So this is -- with three extra binary layers you go backwards and forwards on the top two layers and generate. Actually here the top two layers don't have that many units. What you can do also is you can put a real image in there. Infer up to the top layer, and then regenerate from the top layer of binary units. And it looks just like the real image but a bit blurry. So that top layer can represent all that structure. It's a sad fact we haven't managed to show that it's good for denoising images. That would be the killer thing to show. But it is good for generating images. Okay. I might have one more of those. There's another sample from the model with three extra layers. >>: Four foot enhancement when talking about the noise, specifically with the specific problem. Because with this approach you would be to train the model. >> Geoffrey Hinton: You train the model. >>: And then ->> Geoffrey Hinton: You then take the image and you say follow the gradient of the log probability of this image. If you've got a probabilistic model of an image, compute the log probabilistic up to an unknown factor, unknown anti-offset. You can follow the gradient. >>: Following the image. >> Geoffrey Hinton: You would train on all images. Of course, if you want to denoise a certain class of images you'll train on that class and you'll do much better. Okay. One final thing we can do is take that model we developed for that Marco Redow developed for images and George just took his code and applied it to speech. And so this was actually frames of filter bank outputs. He didn't put in the deltas and delta deltas because those are for capturing the temporal correlations and we can model correlations now so we don't need them. And he tried this mean co-variance IBM at front end and just multiple layers after that and trained it the same way as before. Now it does sort of quite a lot better. And this is the record for speaker independent phone error rate on TIMIT. We don't know whether that will generalize to bigger databases. George found it hard to get it to work on a bigger database. We now have a version that's much easier to get to work, and we're working on showing that, it will work on large vocabulary. >>: Image, have you done any classification statistics yet? >> Geoffrey Hinton: Yes, Marco Redow has done classification stuff. The place where it really wins is if you take a face and you obscure a bit of the face and you tell it that that's been obscured. So a face with some object in front and you know there's an object in front. So you don't use those pixels. You know those are unknown pixels. What you do with the generative model, you fill that bit in and then recognize that. That does much better than any other technique. That's sort of semi cheating. But it does much better for example than if you fill it in by linear interpolation and then recognize it. And now we want to get to something completely different. The other thing that was disappointing back in the '80s about back propagation was it was a particular version of back propagation that was clearly the exciting version. And it wasn't this multi-layer stuff. If you take a recurrent net like this, suppose it takes one time step for this guy to affect that guy by this weight, I can explode it in time like this. So this effect here, W-2, is happening there. It's also happening here. It's also happening here. So by tying these weights together I can make a feed forward net like this. Be equivalent to a recurrent net. If you tell me that this guy should have been in a particular state, I can back propagate the error and I can train all these weights. Keeping these the same as each other. In particular I might say I'll take some units and make them the input units. I'll provide input here. I'll take some units and make them the output units and I'll have desired output here. And I'll have some hidden units. Hopefully more than that. And so now I can have something that takes a stream of input and spits out a stream of output with a little time delay with two steps to get there and that thing ought to be able to learn wonderful things. It ought to be able to learn little programs and lots of little programs operating in parallel between to manage to compute the right answer. I was really excited about this when I realized back prop would do this, I thought we've really got it now because we can learn in this big net we can get it to behave like all these little programs. And the problem was you couldn't train this. Particularly if you try and train it over 100 time steps. The error derivatives either die or explode. That's because a net like this tends to have attractors. Because it has attractors, between two attractors there's a bifurcation. If you start at that bifurcation point, you get an infinite derivative. And anywhere else you're going to get a near 0 derivative. So the derivatives tend to be infinitely 0 which isn't good news. So people gave up on these essentially. Tony Robinson used these for speech but later on there was some excitement and then people basically gave up on these. >>: Couldn't reproduce the result. It worked once. >> Geoffrey Hinton: Okay. I bet you we could reproduce it now. >>: Not using R and Ms. >> Geoffrey Hinton: Yes, using R and Ms, because we can train them much, much better. So two of these gradients, one in particular James Martens -- well, maybe the problem with these is you're just not using good enough optimization technique. Over the years optimization people say you should use this technique. You should use that technique but they're not willing to put in six months to make their technique work on neural nets. And James Martens still one of the best techniques, put in six months making it work on neural networks. Boy, does it work. It can follow these derivatives over 100 time steps now. Basically you want to use curvature information to decide which direction to go. >>: But what if you just use a standard pretraining for a fixed number of length? >> Geoffrey Hinton: It's very hard to pretrain these guys because the weights are shared, remember. >>: I see. >> Geoffrey Hinton: We're working on trying to do that. But it's not trivial to figure out how to pretrain them. >>: Is that just the lap ->> Geoffrey Hinton: No it's because the weights there are shared. You can't separate -you can't modularize anything. >>: Have you compared your results to Risso's at University of Texas? >> Geoffrey Hinton: We compared them to Schmitz Herber [phonetic]. We'll let them speak for themselves. Instead of using quasi Newton, which is approximating the curvature matrix and doing a line search to find how far to go, you use a different approximation to the curvature matrix you use Gaussian unit approximation which is guaranteed to be positive definite and you put a huge amount of work into getting a good approximation. It's not the full matrix because that's hopelessly large. But you're very interested in like the 100th eigenvalue, which is thousands of times smaller than the leading eigenvalues, and you want to get it right because as these directions have very, very low gradients but even lower curvatures. And it's the ratio of the curvature to the gradient that says how much wind you can make. And so in among this there's a conjugate gradient line search that's used just for getting a good approximation to the curvature matrix. And what's more the literature says you use five steps of the conjugant gradient line search. James discovered actually 250 steps is what you need. There's all sorts of dumping that goes on to do retrust regions and all that. He put a lot of work. There's an ICML paper doing impressive things on toy problems. He's made it work better since then. Another student Ilya Sutskever has applied it to predicting the next character in a character stream. I'm not going to talk anymore about the method. You can read the SML paper. There's something very nice about working with characters which is that's how the Web comes. It's just characters. If you try and work with words, you have to scan them. If you're doing finish, you've got a nightmare because there's all these morphemes, things like morphemes. Even in English there's problems. So there's all these prefixes and suffixes. You don't know whether to strip them off or leave them on. Cities like New York, don't know whether to make it one word or two words, because sometimes it's two words. And there's subtle effects you don't know but are there in English you'd like to pick up on like words beginning with SN typically means something to do with the upper lip or nose. So there's things like snot and sneer and snide and snarl and snap. There's lots and lots of them like that. Far too many to be chanced. Now there's things like snuggle which doesn't mean that. But snuggling leads to snugging and there you go. People always come up with, there's one exception people come up with they say snow. Snow's got nothing to do with the upper lip. If you ask yourself why is snow such a good name for cocaine, it's the perfect name because it's white and it has to do with the upper lip. I don't think that's a coincidence. >>: This upper discovery ->> Geoffrey Hinton: The point is that's the irregularity of English. And a linguist will learn that language and the linguist doesn't even know it, most linguists. I think I got this from George Lakeoff [phonetic] so some linguists know it. So we'll do a net that works like this. It's not your standard recurrent net but it's quite similar. We're going to have 2,000 logistic units. And here everything's going to be deterministic. 2,000 logistic units that have real value states between 1 and 0 and the real values are important. If you mess with them too much it doesn't work. And we're going to have another 2,000 units, which is the state of this, of the next time step. So this is one time step. And we're going to make a character, not provide input to these opportunities, which would be the standard thing to do. We're going to make a character to determine the transition matrix. So we're going to say what a character does is take some state and determine the transition matrix that gives it a new state. And one way it could determine the transition matrix is each of the 86 characters we use could have a lookup table which says I correspond to this transition matrix. But then you'd have 86 times four million parameters. And that's probably a good thing to do, but it's too many parameters. So what we're going to do is take that kind of 2,000 by 2,000 by 86 tenses, that's 86 fold matrices, and we're going to factorize it. Just like PCA factorizes 2-D matrices, we're going to factorize 3-D tenser and factorize it in the following way. We'll have a bunch of factors, these triangular things, actually 2,000 of them, which is confusing, but there you go. 1,999. Let's suppose we had that many. And each factor is going to work like this. If you think of it in how you program it, you take these states. You multiply them by the weights on these connections so this is a linear filter. And now the character tells you what weight to put on that linear filter. So you take this linear filter, multiply it by this weight and send it out along these connections where it gets multiplied by these outgoing weights. So another way -- and that's the contribution that this factor makes. And you have lots of factors. So this character -- this character says let's use a little bit of this factor. It's like cooking. Use a little bit -- it really is cooking -- use a little bit of this factor a bit more of that factor and each character uses different amounts of different characters. So similar characters like vowels might use similar amounts of the same factor. All of the digits use very similar amounts of different factors, because they all have very similar distributions. Another way of looking at it is take the outer product of this vector and this vector and you'll get a rank one matrix. And then what a character is doing is building up a whole transition matrix by adding together weighted rank one matrices. And you're going to learn all this, of course. Once you've made the transition, you then try and predict the next character. >>: Logistic function in here as well. >> Geoffrey Hinton: Then there's a logistic function here, right. So from here is linear to that and then we put everything through it logistically. Thank you. Then after you put it through logistic, you and predict the next character and you get a distribution of the 86 characters. And then to train it, you look at the log prop of the correct character and you back propagate to try and increase that log probability. Then you back propagate through here and down through here, also through here and down to the next character and through here, down to the next character and back propagate for 100 time steps. >>: The question if you don't, why this factorization ->> Geoffrey Hinton: It doesn't work as well. >>: Even if the work is almost infinitely number of characters around ->> Geoffrey Hinton: Well, for the same number of parameters it doesn't work nearly as well. That's what we know. It works a lot better when you do it like this. Because characters really are transitions from one state to the next state. Okay. So the question is: If you train it on 5 million strings from Wikipedia where each string is 100 characters long, so you can get it on the GPU efficiently, this is where you leave it at. And you leave it running on your in Nvidia top end GPU for a month and you come back and look at what it's learned, we were hoping that it would learn some words, right? Because there ought to be words there. And it ought to learn some of them. We're hoping that it would learn common words. What we didn't expect is that it would learn all the words and it would almost never produce anything that wasn't a word. Including very rare words. We were hoping it would learn a little bit of syntax, but it learns lots of syntax. We were hoping it could balance parentheses that came near one another. It can balance parentheses that are like 40 characters apart even though there's balanced quotes inside. And it's because it's not a hidden Markov model that's doing the remembering. It's got 2,000 real numbers at its state as opposed to one of N choice. So even if it was 2,000 binary numbers, you'd need a hidden Markov model with 2 to the 2,000 states to have as much representation capacity. And this has real numbers. So it's got a lot of capacity in those numbers. And so really you want to see what it produces. So here's some text produced by just running the model. You give it an initial ten characters and then you say predict the next character. I'm going to sample according to those probabilities. So occasionally I'll be picking something that he thinks is pretty rare and I just generate. If you always pick the commonest one, after a while it starts saying the United States of the United States of the United States of the United States of the United States. And that's boring. >>: You know why? The number one page ranked value top eigenvalue is United States [inaudible]. >> Geoffrey Hinton: There you go. Thank you. I thought it was just because it was the most important thing to ask. But...so here's some ->>: This one used the Joe Martens method ->> Geoffrey Hinton: This one is using James Martens method. >>: That's why it requires ->> Geoffrey Hinton: It generates this, right. And this is predicting a character at a time. So there's lots -- and because it's trained on Wikipedia, search Wikipedia for these things, you can see if Wikipedia has well-paid types of box printer in it. And no it doesn't. >>: How many of these strings would you actually have to go back and brute force look for the strings and see how many showed up in groupings? >> Geoffrey Hinton: I did it informally not properly, I did it informally how it must be getting this from Wikipedia. It's not there. It's creating these phrases. Most of them are created. And surprisingly, for example, the mansion house was completed in might be there. But I bet it wasn't completed in 1882. I bet some other year. It's very good at substituting year. Just occasionally it does something embarrassing like saying 1882.3. But it doesn't do that very often. Normally it doesn't. And look at these phrases. Look. It is the blurring of a pairing on any well-paid type of box print and it has longer range semantics. So I'll bet the blurring and pairing printer have something to do with each other. It knows you're talking about printing and stuff. Okay. Let me show you one more passage. Now, this was selected from a longer passage. That was one of the nice bits. Here's another nice bit. So again if you search for Opus Paul at Rome I bet you don't find it, but we know Opus and Paul and Rome have a lot to do with each other. I bet you probably don't have Arab women's icons. You might have Arab women's icons but you probably don't have now Arab women's icons and stuff like that. Look here. There's an opening quote. There's a closing quote. And they're like 40 characters apart. Normally it starts paragraph better than that. It starts with times or -so it's producing good stuff. >>: So we currently -- does it really have this capacity of learning long distance relations? >> Geoffrey Hinton: Look, quote and quote. >>: That's why I was surprised. >> Geoffrey Hinton: Everybody in the literature says there's all these papers that say that you cannot learn long distances with recurrent maps. >>: [inaudible]. >> Geoffrey Hinton: And the issue was is it because of the optimizer you're using or is it some eternal truth. And it turns out retrospectively it's because the optimizer you're using and because you didn't have enough compute power to use the optimizer you need to use. Until these Nvidia GPUs came out we couldn't have done this experiment. That would be back in the '90s when people say you can't learn these long range things. If we started the computation then it would only be one percent done by now. Things are getting faster exponentially. >>: How long -- how deep was your node again? How many characters was ->> Geoffrey Hinton: It only sees one character. The rest are hidden state. >>: Infinitely long. >> Geoffrey Hinton: Infinite impulse. As a matter of fact, it never has a history of more than 100. But it could. It's infinite impulse ->>: In model checking, for example, you can look at sort of finite traces of execution in the program. And by analyzing that, you can come up with invariance that will apply to longer segments of ->> Geoffrey Hinton: So with English you can do things like I mean it would be fun to give this to an undergraduate linguistics class and say figure out what it knows. You can type any string and it will tell you how probable it is. It will tell you how probable it is given the context. You have to give a context, and you have to say given that context how probable a string. For example, you can tie, a lion is a vegetable and a cabbage is an animal. And you type a lion is an animal and cabbage is a vegetable. And the second one is about three times as probable. You say, well, that's just associative knowledge. Vegetable was close to cabbage, so it doesn't really understand. But if you say to people what do cows drink, they say milk, they don't really understand either. They have a lot of associative processing going on, and this has huge amounts of associative stuff. >>: Can you test this with longer and longer distance correlations to see if it really learned that arbitrarily long quotations have to be balanced? >> Geoffrey Hinton: Yep. >>: You also have like the quotes and the parens up there are not matched. >> Geoffrey Hinton: Yes. There's cases where they're not matched. It doesn't do it perfectly. >>: Have you done a supervised learning layer on top of this to teach it syntax, teach it semantics, labeled data? >> Geoffrey Hinton: Okay. So the problem is that the labeled data corpuses are too small for this guy. But we did on the part of speech thing from Pen and what you do is you train this net forwards, but now you're not training it to predict the next character, you're training it when it gets to the end of the word it should predict the part of speech. That's unfair because it can't see the future. You train one in one direction and one in the other direction. When the backwards one gets to the beginning of the word and forwards one gets to the end of the word those two states are used to predict the part of speech. And the best systems get about 2.8 percent error and this thing gets about 3.5 percent error. So it's good but unfortunately it didn't beat the best systems, which we're not there yet. But it's pretty good. It can be trained to understand parts of speech pretty well. There's parts of speech I can't do and they're full of things like gerund. I don't know what a gerund is. You can do little experiments on it. So here's some experiments. You make up a nonsense word. And then you type Sheila Thrunge and ask what's the most likely next character is. It says S. You type People Thrunge. Says most likely next character is a space. You really need to do it for lots of pairs, a lot of statistics, but I believe it understands that Sheila is singular and people is plural. Then I tried to fool it. Gave Sheila comma Thrunge with a capital T. And the first time I gave it, it said Thrungelini Del Rey. That's when I really started believing in this system because it's obvious Thrungelini Del Rey is an exotic filmmaker with an Italian mother and Spanish father currently in Switzerland. It knows a lot about proper names, a huge amount. If you give it your name, it will know which order it goes in. If you give it eight words, and you say try all 40,000 orders and tell me what's most likely, the most likely order will be the sensible one. So it can do most people doing natural language with machine learning are busy converting it into bags of words. We've got the inverse operation that takes a bag of words and converts it back to text but only for eight words. >>: Is that in Wikipedia somehow? >> Geoffrey Hinton: Thrungelini Del Rey, no, I checked. There's no Thrungelini Del Rey. There will be now. [laughter] I sometimes call the program Thrungelini Del Rey, because I like the name so much. I tried typing the meaning of life is, hoping to get 42. But 42 wouldn't be that interesting because that's probably in Wikipedia. And you usually get garbage. You sometimes get quite interesting garbage. On the sixth try I got this. I thought that was amazing. Because it knows all these weird words. And so I don't know how to say what it knows. It knows a huge number of words. It rarely produces a nonword. When it does produce a nonword, sometimes you know it's a nonword like interdistinguished, it produced that. That's a nonword. But ephemerable, you know ephemerable would you like to make your worst employees more ephemerable. [laughter] And it produces Parled, P-a-r-l-e-d, which I'm fairly sure it isn't an English word, but it really could be. >>: Parlayed. >> Geoffrey Hinton: Parlayed is. And you know what it means almost. It's got lots of syntactic noise. But it's not in clean rules of any kind and it's all mixed up with the semantic knowledge. So it has no problems with things like budge. You can have things that won't budge. But budge goes with a knot. And budge doesn't -- you don't have things that like to budge. You have things that won't budge. And that's no problem for this kind of system. It only produces Wickerstein once in the text I have seen it produced. That was soon after it produced Plato. It knows those guys have something to do with each other. I'm finished, I think. Yes. [applause] >>: So if you were to use the old style training for recurrent network, Tony Robinson style, probably would produce on ->> Geoffrey Hinton: It won't work very well. If you want to see how badly it works look at Jeff Hellman's paper. >>: So you mentioned that having the space of real [inaudible] might hurt your reputation [inaudible] is it worth going back to this image and speech problems and using stacked RBMs about linear units or [inaudible] rather than binary. >> Geoffrey Hinton: No. The reason is this. This was a deterministic model. Those models are stochastic generative models when you're doing the training. And if you foresee units being binary it's a strong regularizer, if you allow them to have real values, it doesn't model things nearly as well. Because they use those real values, right? If they're forced to be binary, if I force a feature to be binary, it better be a sensible feature so when I turn it on it does something sensible. If I allow it to be real valued, then you can have a whole bunch of them with real different values all conspiring to produce something by this guy cancels that part of that guy and that guy cancels a little bit of that guy and it overfits terribly. So I've tried using real values in these other nets and it doesn't work nearly as well on the whole. Although, these rectified linear units work. So maybe real value and some noise are a good idea. But you need to put noise in to regularize it. >>: So is that -- and the thinking about of the adaptation for this model, like giving Bill Gates speech for lectures and effective model ->> Geoffrey Hinton: Okay. So this kind of model you can very easily do the following. You take on its parameters and you freeze them. So it knows what it knows. You then have a whole new set of parameters and they have the property that they learn fast and they have a huge amount of weight to get. They hate being away from 0. So they'll learn a little overlay on the model. So you can learn that Bill Gates overlay on top of Wikipedia. So now it will produce stuff that has got all its background knowledge in there but it's going to overlay it to be much more likely to produce words like Microsoft and profit and stuff like that. And charity. Better be careful here. So you could do that for Shakespeare, for example. Is it enough Shakespeare to train this? But you could train it on the whole of the New York Times or something like that. That's basically English. And then on top of that you can train it on Shakespeare. And then it should start producing stuff like Shakespeare if you make the sort of Shakespeare overlay. And I think that's the way to do it with smaller corpuses. In fact, I have a name for the paper when that works, the name for the paper is going to be it only takes one monkey. >>: Actually, next, I would do what Sandia did for machine translation, which is take the bible and train it in English. Train it in another language of your choice. Separate model. And now use, because they're perfectly labeled against each other about which sentence is which sentence and put them against each other to do that automatic machine translation. >> Geoffrey Hinton: So basically they're learning the mapping between the hid states? Right. >>: Now you can put English in one end and get Italian out the other end or vice versa. >> Geoffrey Hinton: Can I use des capital, I think that's translated into lots of different languages. >>: The bible is a big corpus, is all I'm saying. >>: I didn't quite get the whole thing. To train these two things how do you get ->>: Then you use supervised learning between them. >> Geoffrey Hinton: The hidden state -- when you get to the end of a sentence, the hidden state contains lots of information about what was in that sentences, plus previous plus contains information. So if you train on English and you train on French, and you have one-to-one sentences like Hensard, those hidden states better have something semantic to do with each other. Now you try and learn the mapping separately. >>: You're saying ->> Geoffrey Hinton: Some mechanisms that map things do other things, maybe neural net. >>: Sandy used eigenvectors for this. They were able to do basically 50 languages where they could throw something in, get something out. Provided the sentence was close to something in the bible. They were not able to get down to this level in terms of the letter level and predict next and whether you can apply ->>: Internal model of what's going on could be completely obvious. You've got the 2,000 dimensional real value space so you can afford to put lots and lots of little factors. >> Geoffrey Hinton: Exactly. >>: But it makes it seem very unlikely to be able to map the almost random distribution of English attractors to random distribution of French attractors with any kind of a low dimensional mapping. >> Geoffrey Hinton: The hope is some bits of that two-dimensional vector will be remembering that it's an open quote and open paren, just a few bits. Other bits will be remembering we're in past tense. Other bits will be remembering it's plural. Other bits -this was in the ideal world. Other bits will be remembering that we are talking about sort of pre-priced stuff and ->>: Why [inaudible]. >> Geoffrey Hinton: The semantic should match each other. >>: The semantic should match that's the point. If the semantic is big enough it should override the semantics. >> Geoffrey Hinton: And it will learn which bits of that vector are the semantics and which bits are syntax and ignore the syntax. >>: Using subsets of dimensions which you call bits for different functions. It's not clear to me what I need ->> Geoffrey Hinton: Well -- >>: I don't know if it would work at all, right? >> Geoffrey Hinton: It's a very good suggestion, yes. Because we -- this is an amazing model. And one thing you can do is put it up against other character predictors like Markov models. But these Markov character predictions, particularly the ones that use mixture models, are very good. This is about as good as the best one but makes very different errors. One thing we're going to do is average this with the best Markov model, and I think we will get a better model. Because those can't balance parens, for example. But this kind of thing is the way to really convince people ->>: This is on one GPU, sitting for a month. The stuff I'm talking about from the labs, they ran it on road runner with all banks going for a month. Biggest super computers in the world. So that you could even get anywhere close, have the result that you're trying to match, there's possibilities set up an experiment that would show how you do this on commodity hardware basically. That would be interesting. >> Geoffrey Hinton: Okay. [applause]