21643 >>: So today it's our great pleasure to have... again. Last year he spent about one week with...

advertisement
21643
>>: So today it's our great pleasure to have Geoffrey Hinton to come back to our group
again. Last year he spent about one week with us. We got a tremendous amount of
progress on some of the research that people learning on speech have been pursuing
afterwards.
So today you are going to hear some even more exciting ideas that he will talk about.
So I'm not going to give an introduction about what he has except to let you know that
some of the things he's doing is very exciting and I'm going to give the floor to Geoff.
>> Geoffrey Hinton: Thank you. First of all, I'd like to thank the Speech Group for giving
me some money and also Eric Horvitz for giving me some money. That was all very
nice.
I'm always open to more.
[laughter]
I'm going to spend about a quarter of an hour explaining what deep learning is, and
many of you here know this stuff already. So you should wake up after about a quarter
of an hour. Then I'll describe about quarter of an hour describing how it's supplied to
speech. About quarter of an hour describing how to apply to generating images that
look like natural images, which nobody seems able to do. And the very last part of the
talk I'll describe the amazing model produced by two students James Martens and Ilya
Sutskever that takes a string of 500 million characters from Wikipedia and learns how to
predict the next character.
And does an amazing job of generating text after that. It does a job that's much too
good and I don't understand -- I kept thinking there must be a bug. You must be
cheating but we'll come to that at the end.
So back in the 1980s people invented the back propagation algorithm. And you have
multiple layers of neurons, and you can learn the weights on the connections and you
can learn weights like this that determine what kind of a feature detector that is and what
kind of a feature detector this is. We thought we can now solve everything because we
have multiple layers of nonlinear feature detectors.
It was quite good for quite a few things like reading checks, for example. But it had
some serious limitations. And the main limitations were this: It required lots and lots of
labelled training data. Back then it was very hard to get. The learning time didn't scale
well for deep networks. And deep networks were the most exciting ones, because they
could learn lots of nonlinearities. And it could get stuck in poor local optima, which might
be quite good, but turns out you can find much better ones if you learned a slightly
different way.
So we're going to try and overcome the limitations of back propagation by learning a
degenerative model instead of a discriminative model. That is, instead of learning to
produce a label given an image, we'll start by learning to model images. We'll try to build
a multi-layer generative model. That will have lots of features in it for generating images.
Those features which learn to be good at generating images we will then use for
discrimination. Many of them will be irrelevant. But the ones that are relevant should be
very good. Turns out that's a much better way to do discrimination. In a sense we'll try
learn to do computer graphics before computer vision. And turns out what's going to be
the best model.
Now we're going to learn a bunch of apparently bizarre decisions. The historical
justification of them is I gave up on this problem and went and worked on something
else. Then discovered the other thing I was working on was a solution to this problem.
So we're going to learn generative model, learning has one layer of hidden variables,
which doesn't seem very good if you want to learn deep networks.
What's more, we're going to make it an undirected generative model; that is, it's going to
have hidden variables which are features and visible variables which will be pixels. But
the relation between the two is symmetric. So pixels cause features but also features
cause pixels. And there's nothing really to distinguish the features and the pixels in the
model.
And we're going to make everything binary. They'll all be Bernoulli variables, which
seems like a bad idea if you want to deal with the real world. We'll make these three
decisions and gradually unpick them.
So the units we're going to use are these stochastic binary units. They get total unit bias
plus the weighted input from the states of other guys times the weights. And they output
1 or 0 probabilistically.
And we're going to arrange them in a bipartite graph where you have visible units where
you put your observations. Hidden units where you're going to learn features.
And you have undirected connections. This is governed by an energy function. It has
the nice property that if you give me a visible vector I can immediately write down the full
probability distribution for these hidden units. They're all conditionally independent, and I
can compute the probability of each of them.
And since it's then dependent that gives me the full distribution. I can sample from the
distribution easily. So it's very good for doing inference. And initially you think it's going
to be hard for learning. But it turns out there's a quick way to do learning.
So in that restricted Boltzmann measure, there's two conditional distributions that are
relevant. There's the probability of hidden unit J turning on given a visible vector. And
that's just the logistic function of the bias of J and the sum of the inputs it gets from the
visible units that are on times of the weights of the connections.
And then there's something which is just the same the other way around which is the
probability of a visible unit turning on, given the hidden vector. That's what you would
use to generate from a hidden vector. It's just the same the other way around.
Now, using those two conditional distributions, you can do the following: You can put a
visible vector in here. And now you can compute the probabilities of all these hidden
units and then pick their states according to those probabilities so you get a binary
vector. Then you can do the same here. Now you're reconstructing that visible vector
from the features. And you just keep going until you have forgotten where you started.
That's called the stationery distribution.
And you measure the pairwise statistics of a pixel being on a feature detector being on
here. Then you measure the same pairwise statistics there. And that difference in
pairwise statistics is exactly the derivative of the log probability of this input vector here,
with respect to the weight on this connection. That is, you can take this model and you
can run it backwards and forwards like this until it's generating fantasies.
You can look at this distribution. That's the model's distribution. And if I short some
vectors here what I want is for the model to have the same distribution as the data. So if
I generate from the model I get the data distribution. And this is the derivative of the
probability that the model will generate this visible vector here.
And it's a nice simple form. It depends only on things to do with that connection. Even
though this derivative actually depends on all the other weights in the network. But that
dependence shows up in these correlations here.
Now, that learning algorithm you have to decide how long to go for. And after a very
long time, after 17 years in fact I discovered you don't actually need to do that. You just
do this.
You go up. And you come down and you go up again. And you take this difference.
Now, this isn't following the derivative of the log probability of the data. It's not maximum
likelihood learning. What it's doing it's taking the data. It's representing it with the
features. It's then reconstructing the features with something the features would like to
believe in. So that this is the data polluted by the beliefs of the model. But only slightly
polluted and you're measuring statistics with the data. And statistics that's been polluted
with the model's beliefs and you'd like them to be the same because then your model
isn't polluting the data.
So intuitively you take this difference and do learning, and, hey, it works.
It's following the great of another objective function, almost. So it's not doing maximum
likelihood. It's doing something more the gradient of this other function, but it's not even
doing that. But the main thing is it works and it's quick.
So we can train a restricted Boltzmann machine, and we can give it binary data, and we
can train it to learn lots of features that represent that data.
Once we finish training, if you run the model back and forth, it will produce stuff that
looks like the data. The reason it's interesting is because we can now stack these things
up. With these undirected models it's very easy to stack them up. What you do is you
first train one of these restricted Boltzmann machines and it learns a layer of features,
the model of the data. You then take the activations of those features and you say that's
data now. We've learned the first layer. Let's learn another layer that models the
correlations of those features.
And we keep going like that. And we can guarantee that unless you've already got a
perfect model, each time you add another layer you improve your model of the data in
the following sense: There's a variational bound on how well you're modeling the data.
And if I had another layer in the right way, that variational bound is guaranteed to
improve.
So there's something that's getting better when we have more layers. And that's our
kind of security blanket. It really is underlying in the math something that's improving as
we have more layers. We now violate all the assumptions you need to prove that. This
is sort of normal practice, right? And then we just get on and we learn lots of layers.
With this sort of security that's behind all this is something sensible that we're doing.
And now we're just getting on with it.
So if, for example, we had some data here and we learned a layer of binary feature
detectors and we took the activities of those when they're being driven by data and
learned another layer and we took the activity of those and learned another layer, we
then have these three restricted Boltzmann machines and we want to put them all
together into one model.
And what's surprising is the model that we get is not a great big Boltzmann machine.
That's what everybody thinks. And it's not. The top two levels is the last restricted
Boltzmann machine. So that's a little Boltzmann machine. A bipartite undirected model.
But when we put them together what we're really doing is this lowest level model has a
sort of pride here. And we're substituting this high level model for the prior, and we keep
doing that.
And we end up -- the only thing we keep from the lower level, model is the P of V given
H. From this model, the only thing we keep is the P of this H given H. We end up with a
model, restricted Boltzmann machine at the top. Like the first. And underneath that it's
really a belief net. It's a directed model.
If you want to generate from this model you go backwards and forwards here forever
and then you go clunk, clunk once and then you can see what the model believes.
Okay. So that's why it's called the deep belief net. And it has this undirected bit at the
top. But all the layers below are belief net kind of stuff.
So we accidentally discovered how to learn deep ones of these. You work on a different
problem of learning a restricted Boltzmann machine. And then after I realize you could
stack them and do nice things, eventually a very smart student called UIT realized that's
really what you're doing and you're producing a belief net this way.
Now, after you've done all that, you can then throw away the belief net you produced
and say really I produced all this stuff. When I view it as a belief net the connections go
this way. But it's also a belief net in which I can do very fast imprints by using the
connections backwards.
What I'm going to do now is just treat it as feed forward neural network. Good
old-fashioned feed forward neural network. I just happened to have pretrained all
feature sets without looking at the answer. That's good because that means I don't need
the answer for all these detectors.
If I've got lots of unlabeled data, this is great. So having learned all that, all these layers
of feature detectors, we then stick our label units on top. And then we just use standard
back propagation to fine tune it. This works quite a lot better than standard back
propagation if you don't do all this prelearning.
So one reason it works better is it goes faster. Once you turn the back propagation
loose, you get sensible gradients. And you're just doing a local search. You're sort of
fine tuning it. It's not really changing the feature detectors much. It's just changing a
little bit to get the category of boundaries in the right place.
It also generalizes much better. And that's because most of the information in the
feature detectors in all those weights comes from trying to model the input. The input is
typically a big rich thing with lots of information in it.
You have about 10 to the 14 weights you need to set. And you only live about two times
10 to the nine seconds. You need to be setting about 10 to the five weights per second.
And your mother would have to give you 10, at least five to the 10 bits per second of
instruction to do that by discriminative training. And she doesn't.
The only place you're going to get 10 to the 5 bytes per second which is what you need
is sensory input. That's the only place where there's enough information to determine all
these parameters. So there's not enough information in the DNA.
>>: They're not one double ->> Geoffrey Hinton: Right. But that's why I said there's not enough information in the
DNA. Okay? That's only three billion bases. Negligible broken depending on what you
need.
Because most of the information in the weights comes from trying to model the input, it
generalizes much better. You're not limited by not having very many labels.
Now, one thing wrong with it was we started off developing it for binary data. And most
of the things you want to apply it to, like speech and vision, have real values. So we're
going to modify the model just slightly. We change the underlying energy function, and
the result of changing the energy function is that the rule for setting the states for the
hidden units is it's basically just the same, except that the activity of a visible unit needs
to be measured in units of its standard deviation. We'll have linear [inaudible] units just
like in factor analysis. They'll have their own noise levels, that's the standard deviation
of the noise level of a unit.
That's the measurement noise in factor analysis. And you have to measure the V in
units of standard deviation. Because that's the sort of log probability stuff.
Then when you come down again it's quite different, because these are linear units. So
the way you determine the state of a visible unit, that shouldn't say P of V equals 1. That
should say P of V. The way you determine the real value state of a visible unit is it's a
Gaussian. The unit learns its own mean. But then the top-down input coming from the
hidden units causes an offset to that mean. And the magnitude of the offset again
depends on the noise in that visible unit. If it's very tight, you need a lot of top-down
input to move it over. It's got a very sloppy high variance. You don't need much
top-down input to move it over.
Then this is noise level in the unit. So those two conditional distributions go like that.
And now we can learn with real values. Where this is a real value there.
So we applied it to speech recognition. Two of my grad students George Darland and
Abdul Rahman and Mohammed applied it to recognizing speech on the TIMIT database.
And normally when you do speech recognition you have hidden Markov models that
model the sequence of phonemes. And you have to relate them to some representation
of the acoustic input, which is typically the Mel Cepstral coefficients, which is a long
amount of development in speech recognition said were good things to use.
Plus first and second derivatives. And then what you do is you, each node in the hidden
Markov model has a Gaussian mixture model for trying to fit these guys, and we're going
to replace that by something that goes in the other direction. It's going to be a feed
forward neural net.
This was done in the late '80s or early '90s by people like Nelson Morgan and Tony
Robinson and [inaudible] Benjiyo [phonetic] replace it by something that takes the Mel
Cepstral coefficients and tries to predict the probabilities of these states of the hidden
Markov model.
But we used a very deep net. So all the people doing that used fairly shallow nets. It
works better if you use a deep net.
In fact, George discovered the net that works best is here's your input representation of
13 Mel Cepstral coefficients plus the first temporal derivative and second temporal
derivative, and he then puts that into 2,000 binary hidden units. That's a lot. Then
another 2,000. So there's four million parameters right there.
Between every pair of layers there's four million parameters. And he pretrains all this
stuff. The thing that worked best was seven layers pretrained like that. Then he puts
183 labels because there's 63 phonemes and then each one has three hidden Markov
model states and you're trying to predict which hidden Markov model state. But that's
not pretrained. Notice that's far few parameters. Then you fine tune it with backdrop. At
the point which he did that, the best result on the standard TIMIT database was
24.4 percent phone error rate.
And he got 22.2 percent phone error rate. If you count the silencers which everybody
does because they're easy. The issue we didn't count the silencer we learned a little
better than this then we realized everybody can silence this so we did too. So that was
nice.
>>: [inaudible].
>> Geoffrey Hinton: No, Robinson's number would be about 26.3 or something like that.
That was the west recurrent net. This is actually averaging a whole bunch of different
models. Now, at about the same time as we did this, IBM started taking its large
vocabulary speech stuff and applying it to TIMIT. If they do it speaker independent they
get a similar result as this. It's not better. But in a minute we're going to make this
better.
So there's something -- well, the first thing is speech people say, ha, but things that work
on TIMIT don't work on big vocabulary. Does this work on big vocabulary. And George
came here working with the Microsoft people on Bing voice search he tried it on big
vocabulary.
And if you train on 24 hours of speech, they get 62 percent correct using a Gaussian
mixture model. And the deep net gets 70 percent correct. Now, this isn't entirely fair,
because the Gaussian mixture model you can train on much more data. If you train it on
a thousand hours of data which is what they actually do you get performance that's very
comparable with this. There's all sorts of little things you can change. It's sort of about
this level, maybe slightly worse.
Which suggests that if we could train this thing on a thousand hours, we'd wipe them out.
But at present we can't and we're thinking of ways trying to train it on a thousand hours.
There's one embarrassing thing about that nice model, which is that in the underlying
math, you have these visible Gaussian units that have their own measurement noise
level. Sort of the residual noise. That should be the residual prediction error when you
try and predict what they're up to.
And we took the data we normalized it so it had variance of one. But to learn this model
we also used a residual model of one with a variance of one, which is crazy. We're
saying you can't predict the data at all. Now, it works. But when we tried actually
learning this it didn't work at all. I spent about a year trying to understand what the hell
was happening, why we couldn't learn it.
And throwing more and more graduate students on to the funeral pie, you know what I
mean?
[laughter]
And you really can't learn it. Even if you're hyperbayesian about it it's still tough to learn.
>>: MSR Cambridge they have a recent -- they came over to discovery ->> Geoffrey Hinton: Yeah, they have a complicated way of trying to do it, right?
>>: Successful?
>> Geoffrey Hinton: It was somewhat successful but they didn't really understand what
the problem was. We have now solved the problem we can learn it much better. So I'll
show you what the problem is.
Remember when we got the two conditional distributions? We are dividing by the
standard deviation of the residual noise here, and we're multiplying by there.
So the top-down -- so the bottom-up effects you divide by it. And the top-down effects
here you multiply that. So if you draw a picture of what's happening, the effect of
changing this standard deviation is to, if you make this small, it effectively makes the
bottom up weight very big and it effectively makes the top-down weight very small.
Suppose we make it a tenth, which is about the right level.
These weights are ten times too big and these are ten times too small. And the result of
that is these hidden units all get saturated very early and have no flexibility to learn left.
And these guys never get enough top down umph to actually drive them to their means
so it's kind of under restructural all the time. That's sort of an embarrassing problem and
there's a simple way to fix it.
What we're going to do is we're going to have an infinite number of hidden units. This is
actually good for getting NIPS papers accepted because you have to have infinite in the
title.
So we make infinitely many copies of each binary hidden unit. They all have the same
weights and they all have the same bias. So there's no more parameters.
But they have offsets in their biases. So their thresholds get progressively higher like
this. And so if you provide this thing with some total input X, the ones way up here won't
actually turn on. So you don't really need infinitely many. But you don't know in
advance how many you need because it depends on how big X is.
And now if you make X 10 times bigger, 10 times as many of these will turn on, so now
you're in good shape. Because now if you make this input 10 times bigger you get -sorry. If you make this be a tenth, that makes the bottom of input 10 times bigger. 10
times as many of these turn on. That makes the top down 10 times as big. Which
cancels out that tenth there. So we're back in business.
Of course, we do need this infinite number of these binary units.
>>: Can you fix the integer function to begin with, to reconcile the difference between ->> Geoffrey Hinton: It amounts to doing this.
>>: The same thing?
>> Geoffrey Hinton: So it turns out that if you take an infinite number like that and you
compute this infinite sum here, that thing is exactly the same as this thing. That is, in the
limit, as you're sort of steps in your threshold get sort of small and you scale everything
down, they become the same. But with steps of one, they're so accurate in the same, if
you plot the two you think you've forgot to plot one of them because you only see one
line. So it looks like this. So basically we can model those guys very well with
something like this, and we can model this with sort of a linear threshold thing with some
noise in it.
And it's a pretty good model of this if you make the noise level depend on the logistic of
this. So when this is big, the variance of the noise is one. When it's big and negative,
there's no variance in the noise. And right about here is where you're getting -- you get
this curve thing by adding noise, basically.
So in Matlab, that's what we actually do. You say the output of one of these units is the
max of 0 and X plus some Gaussian noise whose variance is the logistic of X. So that's
what you plug into your program and now you've got something efficient again. We call
these rectified linear units. And that works really nicely.
So then we can turn it loose on speech and in fact now that we know how to learn these
models with binary hidden units, these rectified linear hidden units and real valued
Gaussian visible units, we can run our own reputation of speech. I mean, people do this
Fourier analysis. So there's a lot to be said about Fourier analysis. Except the speech
isn't really cyclic like that. It makes all sorts of assumptions that aren't quite true. And
people developed these Mel Capstral coefficients, but Mel Capstral coefficients were
developed when people had small computers and the main problem was to get rid of
most of the data so that it would fit in your computer.
And also to blur it in a way that blurred together things that should be classified together.
But we've got much more powerful models now. We don't want to blur it. We want to
see what's going on.
So we're going to actually do something that sounds crazy. We're going to take the raw
speech wave, 100 samples from the raw speech wave. This is 120 of these rectified
linear hidden units. And we're just going to learn how to model a tiny little piece of
speech wave. This is going to be 6.25 milliseconds at 16 kilohertz. And we train it just
the way we train these standard Boltzmann machines, and we're going to get some
rectified linear features here.
The question is: What are they going to look like? So you get a lot that look like what
you expect, which is little sort of wavy things. You might expect to find different
frequencies like this. But you also get guys like this. This guy is kind of cute. He's got
one low frequency here and superimposed on it he's got another high frequency.
And that's sort of right there in the front end of the system. And so if you take his Fourier
transform, it looks like this. It's got lots of energy here, lots of energy here. That's
actually detecting a vowel. Those are the first two Fouriers. If you take real speech
data, a lot of the time people are saying vowels. So the data has a lot of these things in
it where you have two Fouriers, and what's more, constant representation. This thing
goes into detection directly.
>>: So if you were to use the conventional, the Neuvy [phonetic] unit you wouldn't be
able to get this.
>> Geoffrey Hinton: You can't get this. You have to learn the variances to get this. You
have to learn tight variances. You have to basically learn proper model to find these
things.
>>: Strictly ->> Geoffrey Hinton: The tricky showed you last year is a kludge before I really figured
out what was going on. And next year I'll probably say this was a kludge.
Okay. So now we can apply it to TIMIT. And we do slightly better than Mel Capstral.
>>: Couple things. I assume you don't actually use it for an analysis to show on the slide
you're just using the features directly.
>> Geoffrey Hinton: Yes, the Fourier transform there was just for analysis of what these
features represented in conventional terms.
>>: One of the points of the Capstral is to remove the pitch information so now your
frequency dependent with what you just learned.
>> Geoffrey Hinton: Absolutely.
>>: Are you going to do the equivalent trick make it homomorphic to remove that?
>> Geoffrey Hinton: No, our idea is you don't want to remove information you want to
model it. And if you've got lots and lots of layers of features and a big computer, you can
afford to model all sorts of stuff that might turn out to be useless because you're building
a generative model.
>>: But as long as you go across multiple speaker types, multiple frequencies then you'll
learn the features that are necessary.
>> Geoffrey Hinton: Exactly. So now what we do is we actually now advance this
window by just one sample. And do the whole analysis again. Okay. So we are not
going to lose much this way. We have a hugely overcomplete representation. 20 times
overcomplete. And then what we do is we look at these outputs. Look at the output of a
hidden unit on all these different little windows that are advanced by one sample, and
pool its output over 160 frames.
So you're really saying did this feature get active anywhere in there. But we will make
these pools overlap. And now we take that stuff and build more hidden layers on top.
And, hey, presto, we do a little bit better than Mel Capstral. About one percent better.
So standard speech recognition system starts with Mel Capstral coefficients. It then
uses a Gaussian mixture model to relate these to the underlying hidden Markov model,
and you can predict what I'm going to try to do next.
That's going to be the last part of the talk. We don't know how to get rid of these yet.
But we will.
>>: Sift [inaudible].
>> Geoffrey Hinton: Yes.
>>: Variable.
>> Geoffrey Hinton: Not quite. Because sift features come with accurate poses and
things. Okay now I'm going to switch to natural images and we're going to try to apply
the same technique to modeling image patches.
And images consist of smooth stuff with sudden discontinue annuities to first order.
That's what people find so hard to model. If you just model the Fourier spectrum of
images and then generate from it, you get clouds. They don't look like real images.
So here's some images. And this image and this image are very similar if you use a
color histogram. And if you do a template match of these images they're pretty similar.
In most places they match very well. But this image is really more similar to that one.
And if you sort of had to remember which is similar, these are the ones that are really
similar. So it's not the mean intensities that matter. It's which intensities are the same
as which other intensities that's what matters. That's the co-variance structure of the
image. Now, you can of course fit Gaussian mixture models which learn in the
parameters certain co-variant structures.
But the thing about images is every new image has its own co-variance structure. So we
want something that's dynamic that would model the co-variance structure of that
particular image and that means we want some basis functions for co-variance
structures. And we want to represent this image as you've activated all these
co-variance basis functions and they between them produce an image with this
co-variance structure.
We also want to model the means. And so we're going to have a model that looks like
this. Here's your pixels. Here's your standard. So this would be a standard sort of
Gaussian binary RBM. And you might want to replace this with these rectified linear
units. But actually this is what we normalize the image patches so you don't really need
those to begin with.
And then this is the bit that's going to model the co-variances. What we do is we take
some linear filters, things that are going to look at the image, and we're going to now
square the outputs, and we're going to send them here.
And then these units are going to be units that are always on unless they're suppressed.
And we're going to make sure these weights are all negative. So the interpretation of
this is this unit here is saying my bit of the image is nice and smooth so you win by B.
You get a bonus of B because your bit of the image is nice and smooth.
This filter here is going to learn to detect things like an edge. And when there's an edge
there, it's going to activate this guy. It's going to provide input here. And now that's
going to suppress this guy. And so you don't get this bonus.
But as this gets more and more active, once this guy's been turned off, it doesn't cost
anymore. Once you've decided it's not smooth after all, you don't get a bigger and
bigger penalty as the edge gets stronger and stronger, and that's what you want. That
gives you the heavy tail property.
I went over that very quickly, but that's the sort of basic idea. And now we're going to
train this thing. And we use the training algorithm that's basically this contrasted
divergence algorithm. There's all sorts of variations to make it work for this kind of
circuit. And they're all in a NIPS paper in the upcoming NIPS. And I don't have time to
go into them here but I want to show you what happens when you generate from this.
If you learn on image patches, here's what image patches look like. And here's samples
from the model. And they look quite -- they look quite like image patches. They have
this property of being locally smooth with sudden outbreaks of structure.
So you know this looks like the same kind of stuff as that. Whereas clouds produced by
Gaussian model don't look at all similar. Markov [inaudible] who did this work wanted to
apply it to big images.
What you have to do then is you are going to model a patch and replicate it across
images so you can model big images. But if you've learned to filter and you replicate it
across the image with a small stride, that filter will overlap with itself. That's what
happens with a convolution on the net. That's a disaster. Because where it overlaps
with itself, in order to have high representational capacity, it wants to be orthogonal to
itself. It wants the two outputs to be different.
So the filter's try and learn to be orthogonal to themselves in all possible overlaps. The
thing is it's high frequency noise. So they learn high frequency noise.
There's a method called a field of experts that took our earlier work and put it in a
convolutional net. And they learned high frequency noise. Surprisingly that works quite
well. But you can do much better if you don't allow a filter to overlap with itself. So what
we're going to do is we're going to tile the image with a filter. You learn a filter. You
move it over by the width of the filter and then do another one. So those are replicas.
But then you have lots of offset tilings of the image.
>>: Why don't you have the same problem in the speech case, just a one dimensional
image? The overlaps?
>> Geoffrey Hinton: We probably did and we can probably do better. This is very early
work. So we're going to have a red tiling of the image. There's the red tile. We tile the
image with these red tiles. And for each red tile we learn something like 64 filters.
And then we're going to have another tiling, the blue tiling, which is diagonally offset. So
you minimize boundary artifacts and learn 64 filters for that and so on.
Here's some of the co-variance filters learned by the red tiling. And you can see they're
picking up on edges and things. And occasionally with high frequency noise. But edges
are different frequencies of what they're really like. And so we learn lots of those.
We also learn mean filters. And again using tiling. And now that you're learning the
co-variance filters, you know what should be the same as what? So in order to model
the intensities in the image you don't need to actually worry about where the edges are.
You know how you do watercolor by having sort of a blob of color and then spreading it
out until it hits the edges. Well, that's what this model does. It's a model of the mean
has these rather fuzzy models that it sort of bright range here. But when you combine
that with the co-variance filters, if there's an edge here, it will spread that brightness here
but it won't spread it that way.
So it's this sort of watercolor model of an image that you have a color wash and then
these edges that it gets slaved to. And that just comes out in the math.
So when you generate from it, here's a big image generated by using just the second
order, the power spectrum. And it looks like that. One of the best models of images is
the Gaussian scale mixture. Notice it's a mixture model.
And is deciding for each little bit of image what's going on there and what frequency it is,
what spatial scale. But if you generate from it it produces something like that.
It's a good model of images but hopeless is the generative model. If you take the field of
experts model, that's this model here, and that's like the model I described but is only
using the co-variance units, and also it's overlapping the filters. It's not tiling the image.
And that produces something that's sort of better than this but doesn't have sort of areas
with edges. It's quite a nice texture model. If you take a pairwise MRF it produces
something like that. If you take the model that Marco Redow [phonetic] developed it
produces something like this. So at least you're getting sort of regions with edges. This
looks a bit like a Henry Moore sculpture.
Here's another sample from the model. Okay. So these are samples that are a lot more
like images than what people could produce before. I mean, I should hesitate. From a
parametric generative model there's many, many ways to put images together by
stitching together little pieces of other images. But from a parametric generative model
it's very hard to get things that look as much like images as that.
Here's some more sample -- okay. So now what we do is we stick a hidden layer on top.
If you stick a hidden layer on top, and then take this restricted Boltzmann machine on
top and go backwards and forwards on that and once it's settled down you generate the
image using the lower layers, you get something that looks like this.
I should have got more samples of this. That looks like a really bad photo that someone
took, right? It's out of focus. And you can't tell what it's a photo of because it's such a
bad photo. But at least you could mistake that for a really bad photo. You couldn't have
mistaken any of the others for that.
If you use more hidden layers, it's not clear to me whether this is better. But you end up
getting stuff like that. So this is -- with three extra binary layers you go backwards and
forwards on the top two layers and generate.
Actually here the top two layers don't have that many units. What you can do also is you
can put a real image in there. Infer up to the top layer, and then regenerate from the top
layer of binary units. And it looks just like the real image but a bit blurry. So that top
layer can represent all that structure.
It's a sad fact we haven't managed to show that it's good for denoising images. That
would be the killer thing to show. But it is good for generating images. Okay. I might
have one more of those. There's another sample from the model with three extra layers.
>>: Four foot enhancement when talking about the noise, specifically with the specific
problem. Because with this approach you would be to train the model.
>> Geoffrey Hinton: You train the model.
>>: And then ->> Geoffrey Hinton: You then take the image and you say follow the gradient of the log
probability of this image. If you've got a probabilistic model of an image, compute the log
probabilistic up to an unknown factor, unknown anti-offset. You can follow the gradient.
>>: Following the image.
>> Geoffrey Hinton: You would train on all images. Of course, if you want to denoise a
certain class of images you'll train on that class and you'll do much better.
Okay. One final thing we can do is take that model we developed for that Marco Redow
developed for images and George just took his code and applied it to speech.
And so this was actually frames of filter bank outputs. He didn't put in the deltas and
delta deltas because those are for capturing the temporal correlations and we can model
correlations now so we don't need them.
And he tried this mean co-variance IBM at front end and just multiple layers after that
and trained it the same way as before. Now it does sort of quite a lot better. And this is
the record for speaker independent phone error rate on TIMIT. We don't know whether
that will generalize to bigger databases. George found it hard to get it to work on a
bigger database. We now have a version that's much easier to get to work, and we're
working on showing that, it will work on large vocabulary.
>>: Image, have you done any classification statistics yet?
>> Geoffrey Hinton: Yes, Marco Redow has done classification stuff. The place where it
really wins is if you take a face and you obscure a bit of the face and you tell it that that's
been obscured. So a face with some object in front and you know there's an object in
front. So you don't use those pixels. You know those are unknown pixels. What you do
with the generative model, you fill that bit in and then recognize that. That does much
better than any other technique. That's sort of semi cheating. But it does much better
for example than if you fill it in by linear interpolation and then recognize it.
And now we want to get to something completely different. The other thing that was
disappointing back in the '80s about back propagation was it was a particular version of
back propagation that was clearly the exciting version. And it wasn't this multi-layer
stuff.
If you take a recurrent net like this, suppose it takes one time step for this guy to affect
that guy by this weight, I can explode it in time like this. So this effect here, W-2, is
happening there. It's also happening here. It's also happening here. So by tying these
weights together I can make a feed forward net like this. Be equivalent to a recurrent
net.
If you tell me that this guy should have been in a particular state, I can back propagate
the error and I can train all these weights. Keeping these the same as each other.
In particular I might say I'll take some units and make them the input units. I'll provide
input here. I'll take some units and make them the output units and I'll have desired
output here. And I'll have some hidden units. Hopefully more than that.
And so now I can have something that takes a stream of input and spits out a stream of
output with a little time delay with two steps to get there and that thing ought to be able
to learn wonderful things. It ought to be able to learn little programs and lots of little
programs operating in parallel between to manage to compute the right answer. I was
really excited about this when I realized back prop would do this, I thought we've really
got it now because we can learn in this big net we can get it to behave like all these little
programs. And the problem was you couldn't train this. Particularly if you try and train it
over 100 time steps.
The error derivatives either die or explode. That's because a net like this tends to have
attractors. Because it has attractors, between two attractors there's a bifurcation. If you
start at that bifurcation point, you get an infinite derivative. And anywhere else you're
going to get a near 0 derivative. So the derivatives tend to be infinitely 0 which isn't
good news.
So people gave up on these essentially. Tony Robinson used these for speech but later
on there was some excitement and then people basically gave up on these.
>>: Couldn't reproduce the result. It worked once.
>> Geoffrey Hinton: Okay. I bet you we could reproduce it now.
>>: Not using R and Ms.
>> Geoffrey Hinton: Yes, using R and Ms, because we can train them much, much
better. So two of these gradients, one in particular James Martens -- well, maybe the
problem with these is you're just not using good enough optimization technique. Over
the years optimization people say you should use this technique. You should use that
technique but they're not willing to put in six months to make their technique work on
neural nets.
And James Martens still one of the best techniques, put in six months making it work on
neural networks. Boy, does it work. It can follow these derivatives over 100 time steps
now. Basically you want to use curvature information to decide which direction to go.
>>: But what if you just use a standard pretraining for a fixed number of length?
>> Geoffrey Hinton: It's very hard to pretrain these guys because the weights are
shared, remember.
>>: I see.
>> Geoffrey Hinton: We're working on trying to do that. But it's not trivial to figure out
how to pretrain them.
>>: Is that just the lap ->> Geoffrey Hinton: No it's because the weights there are shared. You can't separate -you can't modularize anything.
>>: Have you compared your results to Risso's at University of Texas?
>> Geoffrey Hinton: We compared them to Schmitz Herber [phonetic]. We'll let them
speak for themselves. Instead of using quasi Newton, which is approximating the
curvature matrix and doing a line search to find how far to go, you use a different
approximation to the curvature matrix you use Gaussian unit approximation which is
guaranteed to be positive definite and you put a huge amount of work into getting a good
approximation.
It's not the full matrix because that's hopelessly large. But you're very interested in like
the 100th eigenvalue, which is thousands of times smaller than the leading eigenvalues,
and you want to get it right because as these directions have very, very low gradients
but even lower curvatures. And it's the ratio of the curvature to the gradient that says
how much wind you can make.
And so in among this there's a conjugate gradient line search that's used just for getting
a good approximation to the curvature matrix. And what's more the literature says you
use five steps of the conjugant gradient line search. James discovered actually 250
steps is what you need.
There's all sorts of dumping that goes on to do retrust regions and all that. He put a lot
of work. There's an ICML paper doing impressive things on toy problems. He's made it
work better since then. Another student Ilya Sutskever has applied it to predicting the
next character in a character stream.
I'm not going to talk anymore about the method. You can read the SML paper. There's
something very nice about working with characters which is that's how the Web comes.
It's just characters. If you try and work with words, you have to scan them. If you're
doing finish, you've got a nightmare because there's all these morphemes, things like
morphemes.
Even in English there's problems. So there's all these prefixes and suffixes. You don't
know whether to strip them off or leave them on. Cities like New York, don't know
whether to make it one word or two words, because sometimes it's two words. And
there's subtle effects you don't know but are there in English you'd like to pick up on like
words beginning with SN typically means something to do with the upper lip or nose. So
there's things like snot and sneer and snide and snarl and snap. There's lots and lots of
them like that. Far too many to be chanced.
Now there's things like snuggle which doesn't mean that. But snuggling leads to
snugging and there you go.
People always come up with, there's one exception people come up with they say snow.
Snow's got nothing to do with the upper lip. If you ask yourself why is snow such a good
name for cocaine, it's the perfect name because it's white and it has to do with the upper
lip. I don't think that's a coincidence.
>>: This upper discovery ->> Geoffrey Hinton: The point is that's the irregularity of English. And a linguist will
learn that language and the linguist doesn't even know it, most linguists. I think I got this
from George Lakeoff [phonetic] so some linguists know it.
So we'll do a net that works like this. It's not your standard recurrent net but it's quite
similar. We're going to have 2,000 logistic units. And here everything's going to be
deterministic. 2,000 logistic units that have real value states between 1 and 0 and the
real values are important. If you mess with them too much it doesn't work.
And we're going to have another 2,000 units, which is the state of this, of the next time
step. So this is one time step. And we're going to make a character, not provide input to
these opportunities, which would be the standard thing to do. We're going to make a
character to determine the transition matrix.
So we're going to say what a character does is take some state and determine the
transition matrix that gives it a new state. And one way it could determine the transition
matrix is each of the 86 characters we use could have a lookup table which says I
correspond to this transition matrix. But then you'd have 86 times four million
parameters. And that's probably a good thing to do, but it's too many parameters. So
what we're going to do is take that kind of 2,000 by 2,000 by 86 tenses, that's 86 fold
matrices, and we're going to factorize it. Just like PCA factorizes 2-D matrices, we're
going to factorize 3-D tenser and factorize it in the following way. We'll have a bunch of
factors, these triangular things, actually 2,000 of them, which is confusing, but there you
go. 1,999. Let's suppose we had that many.
And each factor is going to work like this. If you think of it in how you program it, you
take these states. You multiply them by the weights on these connections so this is a
linear filter. And now the character tells you what weight to put on that linear filter. So
you take this linear filter, multiply it by this weight and send it out along these
connections where it gets multiplied by these outgoing weights. So another way -- and
that's the contribution that this factor makes. And you have lots of factors. So this
character -- this character says let's use a little bit of this factor. It's like cooking. Use a
little bit -- it really is cooking -- use a little bit of this factor a bit more of that factor and
each character uses different amounts of different characters. So similar characters like
vowels might use similar amounts of the same factor.
All of the digits use very similar amounts of different factors, because they all have very
similar distributions. Another way of looking at it is take the outer product of this vector
and this vector and you'll get a rank one matrix. And then what a character is doing is
building up a whole transition matrix by adding together weighted rank one matrices.
And you're going to learn all this, of course. Once you've made the transition, you then
try and predict the next character.
>>: Logistic function in here as well.
>> Geoffrey Hinton: Then there's a logistic function here, right. So from here is linear to
that and then we put everything through it logistically. Thank you. Then after you put it
through logistic, you and predict the next character and you get a distribution of the 86
characters.
And then to train it, you look at the log prop of the correct character and you back
propagate to try and increase that log probability. Then you back propagate through
here and down through here, also through here and down to the next character and
through here, down to the next character and back propagate for 100 time steps.
>>: The question if you don't, why this factorization ->> Geoffrey Hinton: It doesn't work as well.
>>: Even if the work is almost infinitely number of characters around ->> Geoffrey Hinton: Well, for the same number of parameters it doesn't work nearly as
well. That's what we know. It works a lot better when you do it like this. Because
characters really are transitions from one state to the next state.
Okay. So the question is: If you train it on 5 million strings from Wikipedia where each
string is 100 characters long, so you can get it on the GPU efficiently, this is where you
leave it at. And you leave it running on your in Nvidia top end GPU for a month and you
come back and look at what it's learned, we were hoping that it would learn some words,
right?
Because there ought to be words there. And it ought to learn some of them. We're
hoping that it would learn common words. What we didn't expect is that it would learn all
the words and it would almost never produce anything that wasn't a word.
Including very rare words. We were hoping it would learn a little bit of syntax, but it
learns lots of syntax. We were hoping it could balance parentheses that came near one
another. It can balance parentheses that are like 40 characters apart even though
there's balanced quotes inside. And it's because it's not a hidden Markov model that's
doing the remembering. It's got 2,000 real numbers at its state as opposed to one of N
choice. So even if it was 2,000 binary numbers, you'd need a hidden Markov model with
2 to the 2,000 states to have as much representation capacity. And this has real
numbers.
So it's got a lot of capacity in those numbers. And so really you want to see what it
produces. So here's some text produced by just running the model. You give it an initial
ten characters and then you say predict the next character. I'm going to sample
according to those probabilities. So occasionally I'll be picking something that he thinks
is pretty rare and I just generate. If you always pick the commonest one, after a while it
starts saying the United States of the United States of the United States of the United
States of the United States. And that's boring.
>>: You know why? The number one page ranked value top eigenvalue is United States
[inaudible].
>> Geoffrey Hinton: There you go. Thank you. I thought it was just because it was the
most important thing to ask. But...so here's some ->>: This one used the Joe Martens method ->> Geoffrey Hinton: This one is using James Martens method.
>>: That's why it requires ->> Geoffrey Hinton: It generates this, right. And this is predicting a character at a time.
So there's lots -- and because it's trained on Wikipedia, search Wikipedia for these
things, you can see if Wikipedia has well-paid types of box printer in it. And no it doesn't.
>>: How many of these strings would you actually have to go back and brute force look
for the strings and see how many showed up in groupings?
>> Geoffrey Hinton: I did it informally not properly, I did it informally how it must be
getting this from Wikipedia. It's not there. It's creating these phrases. Most of them are
created. And surprisingly, for example, the mansion house was completed in might be
there. But I bet it wasn't completed in 1882. I bet some other year. It's very good at
substituting year. Just occasionally it does something embarrassing like saying 1882.3.
But it doesn't do that very often. Normally it doesn't.
And look at these phrases. Look. It is the blurring of a pairing on any well-paid type of
box print and it has longer range semantics. So I'll bet the blurring and pairing printer
have something to do with each other. It knows you're talking about printing and stuff.
Okay. Let me show you one more passage. Now, this was selected from a longer
passage. That was one of the nice bits. Here's another nice bit. So again if you search
for Opus Paul at Rome I bet you don't find it, but we know Opus and Paul and Rome
have a lot to do with each other. I bet you probably don't have Arab women's icons.
You might have Arab women's icons but you probably don't have now Arab women's
icons and stuff like that.
Look here. There's an opening quote. There's a closing quote. And they're like 40
characters apart. Normally it starts paragraph better than that. It starts with times or -so it's producing good stuff.
>>: So we currently -- does it really have this capacity of learning long distance
relations?
>> Geoffrey Hinton: Look, quote and quote.
>>: That's why I was surprised.
>> Geoffrey Hinton: Everybody in the literature says there's all these papers that say
that you cannot learn long distances with recurrent maps.
>>: [inaudible].
>> Geoffrey Hinton: And the issue was is it because of the optimizer you're using or is it
some eternal truth. And it turns out retrospectively it's because the optimizer you're
using and because you didn't have enough compute power to use the optimizer you
need to use. Until these Nvidia GPUs came out we couldn't have done this experiment.
That would be back in the '90s when people say you can't learn these long range things.
If we started the computation then it would only be one percent done by now.
Things are getting faster exponentially.
>>: How long -- how deep was your node again? How many characters was ->> Geoffrey Hinton: It only sees one character. The rest are hidden state.
>>: Infinitely long.
>> Geoffrey Hinton: Infinite impulse. As a matter of fact, it never has a history of more
than 100. But it could. It's infinite impulse ->>: In model checking, for example, you can look at sort of finite traces of execution in
the program. And by analyzing that, you can come up with invariance that will apply to
longer segments of ->> Geoffrey Hinton: So with English you can do things like I mean it would be fun to
give this to an undergraduate linguistics class and say figure out what it knows. You can
type any string and it will tell you how probable it is. It will tell you how probable it is
given the context. You have to give a context, and you have to say given that context
how probable a string.
For example, you can tie, a lion is a vegetable and a cabbage is an animal. And you
type a lion is an animal and cabbage is a vegetable. And the second one is about three
times as probable. You say, well, that's just associative knowledge. Vegetable was
close to cabbage, so it doesn't really understand.
But if you say to people what do cows drink, they say milk, they don't really understand
either. They have a lot of associative processing going on, and this has huge amounts
of associative stuff.
>>: Can you test this with longer and longer distance correlations to see if it really
learned that arbitrarily long quotations have to be balanced?
>> Geoffrey Hinton: Yep.
>>: You also have like the quotes and the parens up there are not matched.
>> Geoffrey Hinton: Yes. There's cases where they're not matched. It doesn't do it
perfectly.
>>: Have you done a supervised learning layer on top of this to teach it syntax, teach it
semantics, labeled data?
>> Geoffrey Hinton: Okay. So the problem is that the labeled data corpuses are too
small for this guy. But we did on the part of speech thing from Pen and what you do is
you train this net forwards, but now you're not training it to predict the next character,
you're training it when it gets to the end of the word it should predict the part of speech.
That's unfair because it can't see the future. You train one in one direction and one in
the other direction. When the backwards one gets to the beginning of the word and
forwards one gets to the end of the word those two states are used to predict the part of
speech. And the best systems get about 2.8 percent error and this thing gets about
3.5 percent error. So it's good but unfortunately it didn't beat the best systems, which
we're not there yet.
But it's pretty good. It can be trained to understand parts of speech pretty well. There's
parts of speech I can't do and they're full of things like gerund. I don't know what a
gerund is.
You can do little experiments on it. So here's some experiments. You make up a
nonsense word. And then you type Sheila Thrunge and ask what's the most likely next
character is. It says S. You type People Thrunge. Says most likely next character is a
space. You really need to do it for lots of pairs, a lot of statistics, but I believe it
understands that Sheila is singular and people is plural.
Then I tried to fool it. Gave Sheila comma Thrunge with a capital T. And the first time I
gave it, it said Thrungelini Del Rey. That's when I really started believing in this system
because it's obvious Thrungelini Del Rey is an exotic filmmaker with an Italian mother
and Spanish father currently in Switzerland.
It knows a lot about proper names, a huge amount. If you give it your name, it will know
which order it goes in. If you give it eight words, and you say try all 40,000 orders and
tell me what's most likely, the most likely order will be the sensible one. So it can do
most people doing natural language with machine learning are busy converting it into
bags of words.
We've got the inverse operation that takes a bag of words and converts it back to text
but only for eight words.
>>: Is that in Wikipedia somehow?
>> Geoffrey Hinton: Thrungelini Del Rey, no, I checked. There's no Thrungelini Del
Rey. There will be now.
[laughter]
I sometimes call the program Thrungelini Del Rey, because I like the name so much. I
tried typing the meaning of life is, hoping to get 42. But 42 wouldn't be that interesting
because that's probably in Wikipedia.
And you usually get garbage. You sometimes get quite interesting garbage. On the
sixth try I got this. I thought that was amazing. Because it knows all these weird words.
And so I don't know how to say what it knows. It knows a huge number of words. It
rarely produces a nonword. When it does produce a nonword, sometimes you know it's
a nonword like interdistinguished, it produced that. That's a nonword. But ephemerable,
you know ephemerable would you like to make your worst employees more
ephemerable.
[laughter]
And it produces Parled, P-a-r-l-e-d, which I'm fairly sure it isn't an English word, but it
really could be.
>>: Parlayed.
>> Geoffrey Hinton: Parlayed is. And you know what it means almost. It's got lots of
syntactic noise. But it's not in clean rules of any kind and it's all mixed up with the
semantic knowledge. So it has no problems with things like budge. You can have
things that won't budge. But budge goes with a knot. And budge doesn't -- you don't
have things that like to budge. You have things that won't budge. And that's no problem
for this kind of system.
It only produces Wickerstein once in the text I have seen it produced. That was soon
after it produced Plato. It knows those guys have something to do with each other.
I'm finished, I think. Yes.
[applause]
>>: So if you were to use the old style training for recurrent network, Tony Robinson
style, probably would produce on ->> Geoffrey Hinton: It won't work very well. If you want to see how badly it works look
at Jeff Hellman's paper.
>>: So you mentioned that having the space of real [inaudible] might hurt your reputation
[inaudible] is it worth going back to this image and speech problems and using stacked
RBMs about linear units or [inaudible] rather than binary.
>> Geoffrey Hinton: No. The reason is this. This was a deterministic model. Those
models are stochastic generative models when you're doing the training. And if you
foresee units being binary it's a strong regularizer, if you allow them to have real values,
it doesn't model things nearly as well. Because they use those real values, right?
If they're forced to be binary, if I force a feature to be binary, it better be a sensible
feature so when I turn it on it does something sensible. If I allow it to be real valued,
then you can have a whole bunch of them with real different values all conspiring to
produce something by this guy cancels that part of that guy and that guy cancels a little
bit of that guy and it overfits terribly. So I've tried using real values in these other nets
and it doesn't work nearly as well on the whole. Although, these rectified linear units
work. So maybe real value and some noise are a good idea. But you need to put noise
in to regularize it.
>>: So is that -- and the thinking about of the adaptation for this model, like giving Bill
Gates speech for lectures and effective model ->> Geoffrey Hinton: Okay. So this kind of model you can very easily do the following.
You take on its parameters and you freeze them. So it knows what it knows. You then
have a whole new set of parameters and they have the property that they learn fast and
they have a huge amount of weight to get. They hate being away from 0. So they'll
learn a little overlay on the model.
So you can learn that Bill Gates overlay on top of Wikipedia. So now it will produce stuff
that has got all its background knowledge in there but it's going to overlay it to be much
more likely to produce words like Microsoft and profit and stuff like that.
And charity. Better be careful here. So you could do that for Shakespeare, for example.
Is it enough Shakespeare to train this? But you could train it on the whole of the New
York Times or something like that. That's basically English. And then on top of that you
can train it on Shakespeare. And then it should start producing stuff like Shakespeare if
you make the sort of Shakespeare overlay. And I think that's the way to do it with
smaller corpuses. In fact, I have a name for the paper when that works, the name for
the paper is going to be it only takes one monkey.
>>: Actually, next, I would do what Sandia did for machine translation, which is take the
bible and train it in English. Train it in another language of your choice. Separate
model. And now use, because they're perfectly labeled against each other about which
sentence is which sentence and put them against each other to do that automatic
machine translation.
>> Geoffrey Hinton: So basically they're learning the mapping between the hid states?
Right.
>>: Now you can put English in one end and get Italian out the other end or vice versa.
>> Geoffrey Hinton: Can I use des capital, I think that's translated into lots of different
languages.
>>: The bible is a big corpus, is all I'm saying.
>>: I didn't quite get the whole thing. To train these two things how do you get ->>: Then you use supervised learning between them.
>> Geoffrey Hinton: The hidden state -- when you get to the end of a sentence, the
hidden state contains lots of information about what was in that sentences, plus previous
plus contains information.
So if you train on English and you train on French, and you have one-to-one sentences
like Hensard, those hidden states better have something semantic to do with each other.
Now you try and learn the mapping separately.
>>: You're saying ->> Geoffrey Hinton: Some mechanisms that map things do other things, maybe neural
net.
>>: Sandy used eigenvectors for this. They were able to do basically 50 languages
where they could throw something in, get something out. Provided the sentence was
close to something in the bible. They were not able to get down to this level in terms of
the letter level and predict next and whether you can apply ->>: Internal model of what's going on could be completely obvious. You've got the 2,000
dimensional real value space so you can afford to put lots and lots of little factors.
>> Geoffrey Hinton: Exactly.
>>: But it makes it seem very unlikely to be able to map the almost random distribution
of English attractors to random distribution of French attractors with any kind of a low
dimensional mapping.
>> Geoffrey Hinton: The hope is some bits of that two-dimensional vector will be
remembering that it's an open quote and open paren, just a few bits. Other bits will be
remembering we're in past tense. Other bits will be remembering it's plural. Other bits -this was in the ideal world.
Other bits will be remembering that we are talking about sort of pre-priced stuff and ->>: Why [inaudible].
>> Geoffrey Hinton: The semantic should match each other.
>>: The semantic should match that's the point. If the semantic is big enough it should
override the semantics.
>> Geoffrey Hinton: And it will learn which bits of that vector are the semantics and
which bits are syntax and ignore the syntax.
>>: Using subsets of dimensions which you call bits for different functions. It's not clear
to me what I need ->> Geoffrey Hinton: Well --
>>: I don't know if it would work at all, right?
>> Geoffrey Hinton: It's a very good suggestion, yes. Because we -- this is an amazing
model. And one thing you can do is put it up against other character predictors like
Markov models. But these Markov character predictions, particularly the ones that use
mixture models, are very good. This is about as good as the best one but makes very
different errors. One thing we're going to do is average this with the best Markov model,
and I think we will get a better model.
Because those can't balance parens, for example. But this kind of thing is the way to
really convince people ->>: This is on one GPU, sitting for a month. The stuff I'm talking about from the labs,
they ran it on road runner with all banks going for a month. Biggest super computers in
the world. So that you could even get anywhere close, have the result that you're trying
to match, there's possibilities set up an experiment that would show how you do this on
commodity hardware basically. That would be interesting.
>> Geoffrey Hinton: Okay.
[applause]
Download