>> Misha Bilenko: All right. So we're delighted... who is coming to us from University of Montreal. ...

advertisement
>> Misha Bilenko: All right. So we're delighted to host Dumitru Erhan here today
who is coming to us from University of Montreal. And he will be talking to us
about the latest and greatest from the setting world of deep architectures. He's
been to a number of places, including our very own MSRC in the past, but mainly
he's spent time first in Montreal and then before that he did internships at MSRC
and Google and MaxPlanck and Helsinki University of Technology. And here's
Dumitru.
>> Dumitru Erhan: All right. Let me just hold on until I get the timing right.
All right. This was not purely for the animation. So I'm Dumitru Erhan. I'm from
the LISA Lab that is headed by Yoshua Bengio at the University of Montreal, and
this is getting to be work -- well, I'm going to be presenting work on deep
architectures and trying to understand them and trying to understand the effect of
unsupervised pre-training. And I've done this work jointly with a couple of
people, Aaron Courville, Yoshua Bengio, Pierre-Antoine Manzagol, Pascal
Vicent, and Samy Bengio.
So here's a brief outline of what I'll be talking about. I'll be starting with an intro to
deep learning and why we want to do that. I'll be going with a kind of a brief
motivation of -- for my work, for the work that I've been doing over the last couple
of years; namely it's work that I've -- you know, that can be ruled by these two
questions, why does unsupervised pre-training work for these deep architectures
and how does it work?
And if we have some time, and I think we have some time, maybe I'll go over
some more speculative work on analyzing deep architectures. And I'll show you
more pretty pictures since you seem to like them. And I'll end with some
discussion and concluding remarks.
So deep learning motivations. I think one of the first motivations for the deep
learning comes from this -- the fact that we take a look of hints from the way the
brain works. We know it's kind of an intelligent machine. And we've -- you know,
machine learning researchers have -- over the years have done a certain number
of copy pasting from there. So, you know, why not take a hint from the fact that
there are sort of many layers of nonlinear processing units in the brain. And so
we could call the brain a deep architecture in a sort of vague hand waving.
A more -- perhaps a more convincing argument is the fact that we as humans
tend to organize our ideas, the way we think about the world in a kind of
hierarchal fashion. So we tend to through composition of simpler ideas. So this
can be seen in the way we learn things. We start with learning simple concepts.
We kind of do a bit of scaffolding. We learn more complicated concepts. The
way we think about abstraction in the real world. We tend to do some sort of
decompositions into simpler abstractions. Less abstract things.
And the way we solve problems, we do problem decomposition. We solve
problems by, you know, a kind of an engineering approach to solving problems
would be to solve simpler problems first, to solve the complicated problem by
building on the simpler ones. On the simpler solutions.
So perhaps a bit more of a machine learning argument that my advisor Yoshua
Bengio seems to make quite often is that in -- you know, in certain restricted
cases, you know, if you make certain assumptions, perhaps unrealistic
assumptions about the classes of functions that you're trying to learn or trying to
represent, there are certain classes of functions which are -- can be compact
represented with K layers of nonlinear processing units, which is what we're
going to call a deep architecture, by the way. And which cannot be represented
efficiently. So they need an exponential size of these processing units when you
restrict yourself to layers that are of see -- well, not of size, a K minus 1 number
of these layers.
>>: [inaudible].
>> Dumitru Erhan: That's an existence proof, yes. Yeah. Basically.
>>: [inaudible].
>> Dumitru Erhan: That resist the function -- yeah. If you restrict yourself to kind
of linear threshold kind of units or binary units. It depends. It depends on the
class of function.
So this kind of result I particularly don't think it's -- maybe an existence proof of
this kind of result exists with a -- you know, if you don't restrict yourself to the -- if
you have like sigmoidal units or something like that and you don't restrict yourself
to the number of these units I don't think it's actually true. But I think for realistic
kind of functions that you're going to be trying to learn or represent, actually, is
this is -- my intuition this is kind of true as well.
>>: Does it depend on what kind of learning unit you have?
>> Dumitru Erhan: Yeah.
>>: [inaudible].
>> Dumitru Erhan: Yeah. It's a linear spatial function. So it's not -- nothing too
fancy.
The other kind of more machine learning type of arguments for doing deep
learning is that local features are local and the representations that you're trying
to learn of your data don't really seem to scale well to problems with many
variations. And by that I mean problems where perhaps you're trying to learn a
complicated decision boundary, you know, where -- where if you have a
complicated decision boundary and the kind of features that you're trying to learn
are features that operate only in the vicinity of your training data, then you can,
you know, perhaps need a lot of data and, you know, these types of problems
are what I would consider interesting problems in life where if you have a simple
decision boundary, then it's perhaps not an interesting machine learning problem.
And conversely, a distributed representation as opposed to kind of a local
representation, they seem to be necessary to achieve this kind of generalization
beyond the local features that you're going to be getting if you're just looking at
the training data in this vicinity. Yes?
>>: I'm sorry, you mean the local in future space or the local in [inaudible] like
over the image?
>> Dumitru Erhan: I don't really know what local in the [inaudible] but local in
the ->>: [inaudible].
>> Dumitru Erhan: Local in the input space.
>>: Okay.
>> Dumitru Erhan: So there's a bunch of [inaudible] that Yann LeCun and
Yoshua Bengio have made about this. The interesting thing is that the nice thing
about deep architectures that have he's deep representations that are distributed
is that in the end what they allow you to do is they allow you to share statistical
strength by kind of making it possible to reuse the features that you learn at that
level K minus 1 for learning the features at the level K. So it's -- it -- you know,
it's -- you could see this as kind of multitask learning if you want. If you were to
do multitask learning that kind of sense that Rich Caruana has done, and you
were to learn a task, you know, one, it makes sense to reuse the features that
you learned for the task one to learn the task two if that task is somehow related.
And the same kind of argument could be made for learning features. You know,
you consider learning a feature as a task, so if you learn a high level feature it's ->>: Can you leverage on the non-local generalization? I mean local to what? I
understand it's local in the feature space, but local to what?
>> Dumitru Erhan: Local -- close to the training data.
>>: [inaudible] far away from the [inaudible].
>> Dumitru Erhan: Yes.
>>: Okay. Far away from the training data, the assumption is the distribution
[inaudible] because the training data is sample the ID from the distribution so why
do you care about generalizing where there's low mass?
>> Dumitru Erhan: Well, because -- well, your -- you want to -- so you want to
generalize far from the training data because, you know, the training data cannot
possible in general capture all the possible variations from -- of what you're trying
to learn. So you're going to try to impose a complicated prior on the kind of
distributions that could arise, and you want this prior -- you know, the kind of
function -- the kind of class of functions that you're going to learn to not simply be
restricted on the neighborhood regions that are close to your training data.
Because it's -- it's not going to be possible -- you're not going to generalize to
anything interesting if you don't have data. So I think it's more of a, I don't know,
maybe a philosophical kind of question of whether you should be aiming to
generalize far away from the data or not. And I think sort of the argument is that
what are you trying to make that you should be able to? I don't -- I don't think -there will be an agreement. I under your point. But I think in sort of the limited
kind of data scenario, we make some sort of assumptions that you do want to
generalize away from the data. You would want some sort of priors on the -- on
the way which you're going to be generalizing away from the data. There is not
simply I'll shrink my parameters to be zero because I don't have any -- any -- any
data in there. So...
Hopefully I convinced you that deep, distributed representations are desirable
and have some sort of more expressive power than if you just don't use deep,
distributed representations.
So before 2006, fully-connected deep networks are not especially popular. And
so it's not exactly a mystery to me why, but I wasn't there in the '90s, I didn't do
machine learning. I was in high school. [laughter]. Sorry. It's like a dip in my
chances. [laughter].
But I think the -- there is a certain number of hypotheses. And you can feel free
to add some of these. I mean I'll add them to my list. But I think the -- one of the
main ones is that the problem -- in theory the problem becomes more
non-convex harder to solve. If you want to train a deep, you know, multi-layer
neural network, there's simply many more local minima, as you had more layers.
And since you're using these sort of stochastic gradient methods and by using
gradient methods you're in a multi-layer neural network, you have this problem of
kind of vanishing gradients, you know, your credit problem becomes even more
complicated.
So there is also the kind of Universal Approximation Theorem which states that
for, you know, a certain class of kind of -- not all the possible functions in the
world but set of smoothness assumptions you can -- you know, you can
approximate -- you can re-present basically any of these functions with a
one-on-one layer neural network so basically why bother, you know? And then
there's people who invented SVMs and convex and they were all nice and they
performed better than one layer neural network, so I guess people just kind of
threw these things away. That's my history of machine learning of 20 years. But
for neural networks.
I think the only kind of deep neural network variety that stayed out there and that
kind of evolved was the convolutional neural networks of Yann LeCun. They had
one -- you know, they were not fully connected. They had this constraint
topology and connectivity and that made -- seemed to have made all the
difference. It was easier to actually train them -- it was possible to actually train
these networks even though they had like a pretty large number layers.
So what happened in 2006 is that Jeff Hinton and collaborators, they used
unsupervised pre-training, so they just used unsupervised learning to initialize
these networks, these fully connected networks. The kind of algorithm that they
used was Restricted Boltzmann Machines. And they seemed to work.
So the question is going to be why this actually happened. Before we get to that
question, just a brief overview of what Restricted Boltzmann Machines are. It's a
simple graphical model with visible units X and H, hidden units. It's a bipartite
graphical model so -- in which, you know, there's efficient inference and
sampling. You know, inference is simply a matrix multiplication basically followed
by a sigmoid. So that's nice for a graphical model.
Since basically the conditional of H given X are factorial, it's easy to sample and
easy to do inference, you trade off the fact that it's -- learning it is very hard in
principle since you need to kind of compute a normalization constant here which
you need to -- it's going to be very hard if you're just going to do it for anything
that's kind of a non-trivial model.
So what people like Jeff Hinton and others have do is that they use contrasted
divergence and many others -- many other variations of -- from MCMC literature
as the whole industry kind of these days on kind of taking from those literature
and applying this to RBMs.
It's basically essentially either approximations of the gradient of the likelihood and
-- or some other ways doing smart sampling. It's fast and it's simple. It involves,
you know, contrastive divergence at its simplest and involves three matrix
multiplications and twice -- the sample twice. So it's easy. It's easy to do many,
many, many interactions of this contrastive divergence. RBMs have been
extended to real-valued units. This kind of model here pre-supposes that the Xs
are binary valued and the Hs are binary valued. You can extend this to kind of
arbitrary or real-valued visible units and hidden units. There's been extensions to
semi-supervised learning. You could have the label here. It's sort of efficient if
you want to.
Another way of -- so deep belief network is simply a way of turning these
Restricted Boltzmann Machines into algorithms for iteratively learning the weights
of your deep neural network in an unsupervised fashion. So you're going to take
this Restricted Boltzmann Machine which you're going to be learning the features
-- you know, you're going to be training this RBM on your training data. You're
going to figure out some set of features that is interested that somehow perhaps
captures the distribution of your training examples.
You're going to fix the weights in here of this RBM. You're going to do a sort of
forward propagation. You're going to find basically P of H given X for each of
your training examination. You're going to learn another Restricted Boltzmann
Machine on top of it that is basically going to model the features that you just
learned. You're going to do this until you're finished, which is in many cases
three layers. So a lot of people say it's not very deep, but I'm not going to go into
that.
You're -- once you're done, you basically have a way to generate -- well, to
generate as well, but to figure out what is the -- this deep distributed
representation at level 3 given the input for each of your inputs, you know, not
necessarily from the training data, and you can use these -- the weights that you
just learned the parameters that you just learned to initialize the network, the
supervised network, and you can just do backprop afterwards.
So -- or you can do like Jeff Hinton, you can do some sort of a fancy RBM here
that has a label. There's many ways to do this. But it usually involves kind of a
two stage process where you do unsupervised learning using this kind of weird
graphical model, stacking it greedily layerwise and transforming this into a deep
belief network.
Another way of doing this is this so called stacked denoising auto-encoders. This
is what we use mostly in our lab. I personally find them quite an interesting
model. So they're inspired a bit by the classical auto encoders from neural
networks about 20, 25 years ago. And a classical auto-encoder is simply -- it's
actually pretty similar kind of, you know, just mechanistically speaking it's very
similar to a Restricted Boltzmann Machine. It's where you -- you simply try to
predict with a one layer neural network the -- takes and inputs some X and tries
to predict X as well. So it will take X, compute some hidden representation which
is simply, you know, sigmoid of W, X, plus C. It will reconstruct either using the
same matrix -- the transpose or some other matrix, doesn't really matter, and it
will try to minimize the loss between the reconstruction and the original input X.
So that's a classical auto-encoder.
It's non-linear so it's not quite the same thing as a CPA. It will do something
different qualitatively. The trick here, the denoising auto-encoder is instead of
feeding to the network the original input X, you're going to try to feed it a
corrupted version of this input. So you're going to try to in some sort of
stochastical way corrupt the X and instead -- and you're going to do the same
thing, you're going to reconstruct. But instead of, you know -- and the kind of
loss that you're going to be doing is you're going to try to minimize the
reconstruction error between your X tilde and the original X. So you're going to
try to make your network learn how to predict parts of your input from the others.
So you're going to basically make your model more robust. And in theory this is
going to give you a better model of your data. I mean if your -- if your -- if your -parts of your input are predictable from the others. If they're not, them your task
is very easy in a sense.
So interestingly this kind of model handles a variety of learning problems which
are, you know, sort of as an afternoon thought, oh, wow, this also handles
missing values. If your inputs somehow don't have missing value -- or you can,
you know, stochastically make missing values. And then your model is going to
be able to predict the missing values. You can have like occlusion. If you're
trying to reconstruct images or kind of multi-modality instead of you're having just
an image, maybe you're having an image and a text and you corrupt the text part
and you try to reproduce it.
Anyways, so you -- depending on the kind of domain that you're going to be
using you're going to use either some sort of reconstruction error, a mean
squared error or cross entropy, didn't really matter. I mean for sort of analysis of
this algorithm. And the stacking, you know, how to make this deep, is the same
thing. You're computing for each of your inputs the representation you use the
representation to do this -- a second layer of denoising auto-encoder, you do this
until you're satisfied. You add a supervised soft max layer, do backprop and
obtain as good or better results than if you do a deep belief network. So I final
this model conceptually more -- it's simpler to look at. People who do graphical
models find Restricted Boltzmann Machines simpler to look at. It's really ->>: The model isn't generative though, is it?
>> Dumitru Erhan: You can make it generative sort of, but it's not -- yeah. Yeah,
you can't really sample from it. But if your goal in life is to classification, then
having a generative model is perhaps not necessarily useful. I mean ->>: And when you [inaudible] I should go back to Pascal's paper but actually
when you -- you don't add noise to the H, you sort of propagate [inaudible].
>> Dumitru Erhan: When you compute -- you mean when you ->>: [inaudible].
>> Dumitru Erhan: You forget the noise. So the noise is only during training.
>>: Right. But now you're going to train the next layer.
>> Dumitru Erhan: Oh, yeah. You add the noise here.
>>: So you have a noise model at every layer?
>> Dumitru Erhan: Yes.
>>: Okay.
>> Dumitru Erhan: But then like of course, you know, there is no notion of
occlusion or anything like that in the hidden layer unless you somehow made
your model like that. So it's really -- it's less interpretable as -- a -- but still kind of
the main gist of the argument remains, you're making robust model your model
more robust so you're trying to predict other parts of your input, be they actual
inputs or representations of those inputs from the other parts.
So using these and other things that there's plenty of other ways to do a -- to do
these as far as learning of networks, people have obtained a lot of good results
on -- on a bunch of kind of datasets from MNIST, the National Image Patches
learning generation. Object classification with more an less realistic kind of
objects. NLP tasks, I don't know if Ronan was here and Jason Weston. Or if you
have seen them at NIPS. They've been quite forceful about their results on sort
of named entity.
Speech/music classification with convolutional deep belief networks is work by
Honglak Lee and people from Andrew Ng's lab.
MoCap, learning and generation, motion capture if you've seen Jeff Hinton's
chicken dinosaur walks generation at NIPS. It's also kind of fun.
So people have applied this kind of a variety of settings. And it's not quite clear
whether you know at least two years ago when I started doing this kind of work,
whether this was well understood or why this whole business about initialization
the network actually works.
So the recipe, as you've probably seen -- I've tried to actually describe is that
you're doing layer-wise unsupervised learning or layer-wise learning of
unsupervised nonlinear features followed by supervised learning. So that's the
basic recipe. That's how you do debrief neural networks most of the time.
And this has been applied to Restricted Boltzmann Machine, more general
Boltzmann Machines, auto-encoders, denoising auto-encoders, sparse
auto-encoders, parse denoising auto-encoders. There's even been a paper by
Lawrence Saul and one of his students about using kind of deep kernels or
stacking kernel PCAs which doesn't sound very efficient but it works, actually
works. I was surprised. And the question is, you know, why does this work?
And some of the -- some of the stuff that I'll -- I was doing over the past couple of
years is trying to attempt to untangle the various effects that contribute to good
performance. So I'm going to try to verify some of these hypotheses in a kind of
large scale setting, many have parameters with kind of bigger datasets. I mean
large scale is all relevant. Maybe it's large scale. Large scale by our standards
maybe medium scale by your standards.
So and it's -- the goal is to try to kind of demystify deep architectures. There's
been somewhat of a -- I wouldn't say backlash but you can see sometimes in
reviewer's comments that they understand how -- you know, they understand that
they work, but they don't really understand why they work. So try to demystify a
bit what was going on, try to present some coherent arguments for why they work
and infer maybe something useful for the future research in this field.
So the plan, it's a scientific kind of plan, is to propose some explanatory
hypotheses for why things work, object the effects of pre-training in various kind
of settings, and infer the role of pre-training since there seems to be the crucial
ingredient in this whole recipe. And the level of agreement of the results that we
obtain with these hypotheses.
And I'm going to present the hypotheses straight away. They're very simple.
There's nothing too complicated about them. The first one is the regularization
hypothesis which basically states that, well, the unsupervised component
constrains the network to model -- to have good features of P of X. So it's -- it
constrains the model, the parameters to model P of X well either with RBMs or
with denoising auto-encoders or what have you. And it says that the
representations that you get with P of X are going to be good, you're going to
basically share the parameters, are going to be good for P of Y given X in a
generalization kind of way.
The optimization hypothesis states that, well, unsupervised initialization, what it
does is it initializes near better local minimum of P of Y given X. So better local
minimum or of a training criterion kind of statement. And what it -- what
pre-training helps you do, it allows you to reach lower local minimum, not
achievable by random initialization. By start kind of way of doing deep networks.
And an interesting thing is that we're going to discover is these hypotheses,
though they sound like they may be they're at odds with each other, they're not
necessarily incompatible in certain scenarios. Yes?
>>: Question about your use of the term local minimum.
>> Dumitru Ehran: Yes.
>>: We often don't train neural networks until they're actually stuck in a local
minimum. We often stop them earlier, doing do something else.
>> Dumitru Ehran: Yes.
>>: By local anyone is it okay if I just replace that with basin of attraction.
>> Dumitru Ehran: We're going to use ->>: Is that going to work?
>> Dumitru Ehran: No. We're actually going to use local minimum in here what
we mean is a better local minimum of the training criterion. We're going to use
the basin of attraction -- it's kind of a weird notion of what is a basin of attraction
of gradient descent. I guess it's -- we'll get to that. But literally here what we
mean is unsupervised initialization gets us better training error. So in this it's a
better local minimum in this -- this point of view.
>>: [inaudible].
>> Dumitru Ehran: Yeah. Because we -- it's very hard to actually verify that
we're a local minima in the neural network.
>>: Well, it's very specific, right, the optimization about this is assumes that deep
networks train without preinitialization or underfitting, right, and therefore ->> Dumitru Ehran: Yes. Yes.
>>: So he's saying that they will be -- it's a hypothesis.
>> Dumitru Ehran: So that's -- the two rather different statements of what's going
on and we'll jump straight into that. Once I can get my slide working.
>>: I have a question.
>> Dumitru Ehran: Yes?
>>: All right. So during the supervised training all of the weights are
manipulatable, right?
>> Dumitru Ehran: Yes, everything is manipulatable, yes.
>>: So how does it actually constrain [inaudible] I mean it doesn't sound like
there's any constraint [inaudible].
>> Dumitru Ehran: Yes. We're going to see that this is a very, very tough
constraint. Because you're doing -- one has to remember that you're not doing
convex optimization here anymore. So, you know, if you were to optimize in kind
of a -- you know, a convex -- you're doing kind of convex optimization. It didn't
really matter where you start from. I mean, one way or the other if you're kind of
optimization problem is well set up you're going to reach the local minimum.
>>: The constraint is sort of like moving towards the minimum that's near P of X
-- I mean, it's like picking ->> Dumitru Ehran: It's picking -- it's picking the basin of attraction. So I'm sort of
->>: You begin by just looking at P of X ->> Dumitru Ehran: Yes.
>>: [inaudible] optimize [inaudible]. Now, I'm not exactly sure how you define
your model. If you said Y for exactly [inaudible] it seems that PY of X would be
exactly the [inaudible] your minimal -- vice versa. But you're not using that
information in your initialization so your pre-learning step should be equally good
for P of Y [inaudible] or P of minus Y?
>> Dumitru Ehran: I mean, yeah. I'm not quite sure what -- are you saying that
basically since we're doing P of X and we don't know anything about Y ->>: You're claiming that you get a better starting point by looking only at X and
not Y.
>> Dumitru Ehran: Yes.
>>: Which would mean that you should -- that you're claiming that the good
starting point for any given set of [inaudible] so you're saying there's a region in
the space we have very, very high peaks and very, very low. And then there's
other places that you -- so you want that very high coefficient --
>> Dumitru Ehran: I think -- I think I agree -- I think yes. But -- yeah. But -yeah. So I guess I'm not -- yeah, I didn't think of what you just said but yeah, I
think the statement also says that.
All questions done.
>>: One more question then.
>> Dumitru Ehran: Okay.
>>: [inaudible] two kind of pieces of [inaudible] that actually does the [inaudible]
one is [inaudible] pre-training.
>> Dumitru Ehran: Yeah.
>>: And you just pull this entire weight into [inaudible]. And the other one is to
do [inaudible] right within the [inaudible] ->> Dumitru Ehran: You mean at the top layer. Yeah.
>>: So do you support both of those?
>> Dumitru Ehran: No, we don't. I mean we've empirically sort of found that they
do work the same way -- I mean, they give the same kind of results. So we stuck
with the simpler, from my point of view simpler kind of ->>: [inaudible].
>> Dumitru Ehran: Yeah. I mean, it's basically just taking the same network and
adding a submax layer so ->>: So you think all the conclusion you get from here is [inaudible].
>> Dumitru Ehran: Yeah.
>>: [inaudible] applied to the [inaudible].
>> Dumitru Ehran: DBNs, yes. Because the kind of conclusions -- the same
kind of framework, we applied it for both DBNs and stack denoising
auto-encoders, and with stack denoising auto-encoders they're no equivalent like
-- it's only backprop that you can do at the end.
So the setup is simple. It's MNIST, handwritten character recognition. So it's
one -- two datasets. One is a dataset where we can iterate many times until we
reach some sort of local minimum, you know, basically zero training error. Very,
very fast on this. And infiniteMNIST which is basically a way to by Gaelle Loosli
and I think -- I think it's Leon Bottou, yeah, Leon Bottou student maybe. It's a
way to generate elastic deformations on the fly from MNIST digits. So you can
sort of generate -- it's not a completely IAD 10 million examples kind of dataset
but it's large and somewhat interesting compared to MNIST in that you can't
really go many times. So we tried 10 million examples. That's what I mean -that's what I mean by large scale.
So we did three models. The two that I just presented, DBNs and stack
denoising auto-encoders. DBNs were trained by the classical contrastive
divergence algorithm, which one step of kind of sampling. Stack denoising
auto-encoders, again, the way that I just presented. And a models, models with
one to five layers without pre-training. So randomly initialized using, you know,
sort of the standard way of initializing neural networks as preached by Yann
LeCun maybe 10 years ago.
So then we testified very, very many hyperparameters with many, many seeds,
just to be sure of the -- yes?
>>: So one critical thing is are you using for the [inaudible] are you using
[inaudible]?
>> Dumitru Ehran: Percent. I think. Yeah. Yeah. We always use percent. So
I'm going to jump straight into the whole observing the effect. So if we study the
hypotheses, we now look at what pre-training does. So the very simple kind of
effect of pre-training is just observing the generalization error. So if you just
select the best validation error, you look at those models, you sample very many
instances of those models, so different random initialization seeds and you do
denoising auto-encoder pre-training and no pre-training, you bury the number of
layers and you look at the classification error for both of them, you can see that
qualitatively they behave differently, so as you add more layers with pre-training,
it's getting better. Though I would say statistically it's probably not better. If your
-- but at least from this point of view, it's very different than this one. And there is
a missing five year. Because we couldn't even -- weren't able to train the five
layer network. It just didn't converge to anything but garbage.
>>: So how does it compute the variance?
>> Dumitru Ehran: The variance is this is a cross of 500 initialization seeds.
>>: [inaudible] how do you choose, there's a lot of choices of like -- this is
optimal and [inaudible].
>> Dumitru Ehran: Yes. Everything -- so this is for the models that -- for the
models that give us the best classification error, we -- and we estimated this over
maybe 50 samples per combination of hyperparameters, so ->>: 50 [inaudible].
>> Dumitru Ehran: 50 initialization seeds. We choose the best model, we start
again with 500 initialization seeds for both of these. So for one, two, three, four,
and five ->>: So for backpropagation what kind of optimization did you use?
>> Dumitru Ehran: It's the simple stochastic gradient descent.
>>: No [inaudible].
>> Dumitru Ehran: No [inaudible] no. We have never found it to work very well.
So -- yes. And this ->>: So this is just neural network.
>> Dumitru Ehran: This is just a normal neural network.
>>: So what is the optimal number of hidden units, you know, roughly for one or
->> Dumitru Ehran: For one it's -- it's a thousand.
>>: That's interesting. I was getting 1.53.
>>: I thought the 1.06 by just changing the [inaudible].
>> Dumitru Ehran: We've really tried.
>>: No, no, I know.
>> Dumitru Ehran: It was a really -- I think, you know, if you look at some of the
papers that -- in this field there's a lot of these sort of arbitrary choices of like oh,
we do momentum this way for this first 10 epochs and then we do momentum
afterwards. So we really tried to make it as large scale as possible. So, you
know, this is like probably close to like a year of CPUR wasted just on this
experiment so.
>>: [inaudible] randomized the waste, did you choose [inaudible] randomize
[inaudible] choose the best one to randomize?
>> Dumitru Ehran: No, no. The seed was not validated.
>>: When you're using the square root of N.
>> Dumitru Ehran: Yes. Yes. Yeah, it's all -- yeah. So we wanted to make it as
black box as possible. So from that point.
>>: Is it still on training data? Like you see the difference between ->> Dumitru Ehran: Yeah, what will -- it's a different experiment, so yeah.
>>: I think the initialization of the weight is too big.
>> Dumitru Ehran: It does make a difference, yeah. It does -- I'll grant you that I
think somebody in my lab, they've done some sort of -- but it will never get, you
know, it will never be comparable. We've done --
>>: [inaudible].
>> Dumitru Ehran: Yeah. Yeah. So it's the -[brief talking over].
>>: [inaudible].
>> Dumitru Ehran: I'm sorry?
>>: I mean, I used a different set of data. For this one is probably is okay.
>> Dumitru Ehran: Okay.
>>: But for some other just using randomization, it didn't really get ->> Dumitru Ehran: Well, I think you have to have the right kind of combination of
the -- inputs -- I mean, it's a choice -- you have to choose, so this sigmoidal
hidden units, so your -- basically your weights have to -- if your data is kind of
mean zero and variance one or something, you're weights have to be chosen
such at that time activation of your hidden units falls on average in the linear part
of the sigmoid.
So it's a chose that depends on the data. First you have to kind of make your
data play well with the sigmoids, and then your weight kind of initializations have
to be, you know, as you said, one over square root of the number of inputs.
>>: And one more nitpicky question. I'm sorry. Did you initial all the layers
randomly or M minus one layers randomly?
>> Dumitru Ehran: By what's the Nth layer?
>>: You don't need to initialize all the lambda, you could choose one layer to be
zero.
>> Dumitru Ehran: Yeah, that's true. No, we initialized [inaudible].
>>: I have a question. So it looks like -- so you're adding [inaudible] you're
making the model more and more complex and this [inaudible] is getting worse.
>> Dumitru Ehran: Yeah.
>>: So that would be the classic effect of a regularization scheme. So the
question is, I don't know much about the left hand side. Is there any
regularization here at all?
>> Dumitru Ehran: We've validated that parameter also. We've done like L1 or
L2 kind of standard shrinking or just the L2 kind of regularization. So we buried
that parameter as well.
>>: [inaudible].
>> Dumitru Ehran: No, of the -- so you add a penalty to your cost which is
lambda times the L2 norm of your weights.
>>: [inaudible].
[brief talking over].
>>: And the same and so that's nonzero, and it's the same for all weights?
>> Dumitru Ehran: It's the same for all weights and it's non -- you mean the
optimal one? So for each of them, it's going -- but it's going to be very small. But
yeah, there's a nonzero optimal lambda for.
>>: [inaudible].
>> Dumitru Ehran: Yeah. And it's early stopping. Yes. And I think mini batches
-- the mini batch size was also validated, but we didn't do like very fine grained
like it's either one or two -- one or 10 or 20.
>>: What's the [inaudible] for that ->> Dumitru Ehran: Yeah, it's the same thing. And so we.
>>: [inaudible].
>> Dumitru Ehran: You mean the optimal one? I think it's 10.
>>: 10?
>> Dumitru Ehran: Yeah. No ->>: I guess [inaudible] you get variance.
>> Dumitru Ehran: I think if you were -- you know, if you've got to pound on it
more. If you were really to find the truly optimal hyperparameter selection, I think
what we wanted is more of a qualitative effect of what's going on. So I think you
can -- you can make this lower. You can make this -- I think you can subtract
probably safely about, you know, .1 or .2. But not for these. Not for these.
>>: [inaudible].
>> Dumitru Ehran: Yeah, yeah, I think for one layer maybe you can -- you can
get -- I think 1.5 I'd say safe. But these can get, you know, 1.3 let's say. And I
haven't really tried too hard. I mean, it's -- it was a simple experiment where, you
know, I launch a batch of jobs.
>>: So you think that the recent WiFi layer --
>> Dumitru Ehran: Here?
>>: The right hand side.
>> Dumitru Ehran: Oh, here. It goes up because ->> Dumitru Ehran: Well, there's no optimal kind of number. I mean obviously if
you add very, very many layers at some point in time. It's kind of a classical
tradeoff between the complexity of the model and the kind of data that you have.
So at some point in time you're going to overfit, even if you have a really good
initialization point.
>>: [inaudible] less severe for ->> Dumitru Ehran: That is correct.
>>: [inaudible].
>> Dumitru Ehran: That's basically the conclusion from this model. And you can
see this also from the tail here at the -- so, you know, the -- it's not like you get I
got a lot of these on this side of the distribution. So it's more the errors, you
know, there's a bit of a skew towards bad errors. So it's more likely to actually
get into a really bad local minimum or generalization minimum if you want. If you
have one of these networks without pre-training, then one of these. And it -- you
know, once you get to, you know, worse results, you get the same kind of effect
here. But it's less pronounced than in -- so it's -- like you can see some of this of
the histogram of errors. So this is for one layer, you know, this is with
pre-training and without pre-training. So this looked like a bit of Gaussian. This
one has a bit more fat tails, so this is with four layers. This corresponds to this.
This distribution here.
>>: So just curious, I mean, when you [inaudible] either case, it's true that all the
training error is zero.
>> Dumitru Ehran: I'll get to that.
>>: Okay.
>> Dumitru Ehran: In one second. Yeah, it's true.
>>: [inaudible].
>> Dumitru Ehran: But the -- if you're looking at the classification error, which is
-- which all these get zero classification error for training. So it's hard to
compare. We're going to look at the actual training objective, which is more
interesting.
So, yeah, we've seen the effect of pre-training for the generalization error.
Another kind of effect that we're going to look at is more -- so a bit more
conceptual is we're going to look at the actual networks, the actual networks in
what we call the function space. So we're going to try to project the networks.
The opposite the networks, we're going to treat them as functions. And we're
going to try to project them via some nonlinear dimension on to reduction. Two
methods actually. Two dimensional space. So each of these points here, each
single point for different colors represents one network at one point in time after
one epoch of training. So this kind of called the flower.
That represents the neural networks pre-trained using unsupervised learning. I
think it's RBM. Two layers. So this is 50 networks after one epoch of supervised
training. So they all kind of start in the same point. And they all kind of move to
the local minimum as well as learning goes. Yes?
>>: What is the X and Y ->> Dumitru Ehran: They are arbitrary. Yeah, they have no meaning. So this can
be rotated in any way. So there's no -- like it's just a -- you know, it's the result
of, you know, the t-SNE, which is a -- it's a dimensional T reduction method that
kind of preserves local structure or Isomap, which is reduction that results kind of
global structure. And it's whatever output -- you know, whatever values these
methods give to you. I don't think they're meaningful in anyway ->>: [inaudible] network you compose with all the weights across different
networks [inaudible].
>> Dumitru Ehran: No, these are not the weights. So this is slightly different.
But we're going to get to the weights in the next slight. These are the actual
output. So you take a test set, you compute the output of your network on each
of these -- of your testing points, you concatenate this gigantic output vector.
This is your input space, if you like. So this is -- this vector represents one
function at one instance, at one point in time. You're going to do this -- you're
going to take a burn of these. So 50 for 50 networks that are pre-trained, 50
networks without pre-training, so randomly initialized. You're going to put them in
a big bunch of basically vectors that have dimensionality I think 100,000 or so.
And you're going to protect this. Of course, you know, it's impossible to preserve
all the right distances. But these kind of -- it preserves -- it presents an
interesting picture of what happens if you protect all of these vectors into two
dimensions and you look at the evolution of the networks as you continue
supervised training.
So there's a couple of interesting things that you can observe here. One is that
they seem to explore different kind of regions of space. They don't seem to kind
of cross each other's paths, if you like. And with Isomap what's interesting, these
are networks without pre-training and this is with pre-training. From a point of
view of a network of one of these networks with random initialization, the volume
occupied by basically all the networks with pre-training is very small.
What else is here? There's many apparent local minima. So if you look at -- so
this is a network -- so this is a -- so it's a bit of an artifact of the method since it
preserves local structure. So it tries to kind of cluster things together a bit, but it
also -- also means that, you know, these kind of tiny things, this corresponds
basically to the trajectory of a network in this two dimensional space at the end of
training. So, you know, they all seem to go to the same kind of ->>: [inaudible] data.
>> Dumitru Ehran: This is MS data, yes.
>>: So where is the [inaudible].
>> Dumitru Ehran: This is 10 points. Yes. This represents the class probability.
>>: [inaudible].
>> Dumitru Ehran: No, this is ->>: Times 10,000. Oh, I see.
>> Dumitru Ehran: So the test set -- yeah, so the test set is 10,000. You
compute 10 output labels basically -- well, not labels, but just the classical
conditional probabilities or the log of these. You concatenate them. You project.
>>: So but I -- I question whether you can take the volume of the projection and
try to interpret that because it could be the volume how clustered -- how tightly
clustered they are. Could it be side effect of your -- of Isomap. So I mean, I
mean like.
>> Dumitru Ehran: So we tried to make one single component. So that's the
kind of parameter that we gave to the Isomap. So it's -- it will -- I agree, it
sucked.
>>: Right. But I think looking at how disjoint they are is probably still meaningful
in that it's probably not an artifact of your nonlinear embedding method. But like
looking at the volume, that seems kind of ->> Dumitru Ehran: Well, I mean the volume is -- yeah. So ->>: I've seen plenty of examples where you take -- you know, the uniform Swiss
roll and you unroll it and if you don't do it right, you can get this kind of [inaudible].
>> Dumitru Ehran: Well, that's why we did this -- you know, we tried a couple of
these. We tried PCA, also, we tried kind of a linear projection method and they
also, you know, they go different paths.
>>: So these are the same dataset, two different ->> Dumitru Ehran: This is the same dataset, two different methods for nonlinear
projection, nonlinear embedding.
>>: So [inaudible] so the question is what kind of [inaudible].
>> Dumitru Ehran: Well ->>: [inaudible].
>> Dumitru Ehran: So we've done -- as I said, we've -- I think the main
conclusion what we tried to see is whether the motivation for this experiment was
do they seem to explore -- you know, basically the hypothesis is well, if I try long
enough, you know, if I try hard enough and randomly sample from this uniform
maybe bad way of sampling you know weights in my randomly initialized
network, will I just get lucky and get one of these, you know, one of the pre-train
networks? It's a hypothesis. It's a plausible way. Maybe -- maybe it's, you
know, pre-training is just a way to kind of choose some of the good random
initializations. And I think basically the result is the [inaudible] as far as you can
see. I mean, admitted will I -- I'll admit to, you know, the -- and there cannot be
any perfect way to project from -- it's 100,000 [inaudible] to two, but at least from
this point of view they don't seem to explore the same kind of regions of space
from this function space approximation.
>>: [inaudible] different seeds, I suppose.
>> Dumitru Ehran: Different points of different seeds, yes. These are 15
networks. They start kind of in the same -- because they were pre-trained. And
they all -- so these correspond so the 50 networks correspond to the same exact
configuration of hyperparameters. Which is different randomly initialization
seeds. So any other questions? Yes.
>>: Do you [inaudible] vectors of the distribution of the [inaudible] data being
correct? Do you have experiments where you tried to distort the distribution or
say noise [inaudible] data?
>> Dumitru Ehran: Well, you can -- I mean, you can always make it -- clearly you
can always be adversarial about it and make it unsupervised distribution clearly
unrelated to P of Y given X, so then you're going to be wasting time. But, yeah,
here it's ->>: But how sensitive is the result of ->> Dumitru Ehran: I don't know. I don't think we've got any confirmation of that.
Yes?
>>: Just to make sure I understand. These two flowers, they were -- so the
[inaudible] two dimensional space was [inaudible].
>> Dumitru Ehran: No, no, no. Both of them at the same time.
>>: So why don't they start at that time same place.
>> Dumitru Ehran: No, no, no, both of them at the same time.
>>: So why don't they start at the same place?
>> Dumitru Ehran: Because they -- well, one of them -- so this is -- these all
networks have been randomly initialized -- I'm sorry. Yeah, randomly initialized
with standard neural network. So ->>: So how does your random initialization all start at one point basically?
>> Dumitru Ehran: Well, they all start at the same kind of random point because
they all have the same hyperparameters. So they all have, you know, a
thousand or whatever hidden units and ->>: It's not just the initialization point that's part of the data, it's also the output.
[brief talking over].
>> Dumitru Ehran: Yeah, yeah, they're not -- we're not projecting the weights
because it's not a good idea to do that. I mean, it's -- you can always permute
the kind of the hidden units in a network and it's the same solution. So that's not
-- so there's many, many configurations possible that -- it's misleading to protect
the parameters. But the outputs should be independent of the ->>: If I understand, why isn't it reasonable to ask what would a standard neural
network do if it started from that point with the [inaudible]?
>> Dumitru Ehran: Well, I don't know how you would make it start from there. I
mean this is -- so these are the standard networks that randomly initialized -- of
course, you're always randomly initialize. You do unsupervised pre-training.
Then you do supervised pre-training. Supervised learning. So you've done the
unsupervised, you've modelled P of X. You've gotten your parameters. You
compute the output of these -- of this network on your test data. This is where
you are. At the beginning. And then you move. Here you've done no
unsupervised learning. You start randomly, you compute where you are in this
projected space on your test data in this function space, and then this is where
you are and this is where you go. Is that clear?
>>: Are you saying -- so this basically means no matter how you initialize
random kind of covers the [inaudible].
>> Dumitru Ehran: No, no, no, but it's ->>: [inaudible].
>> Dumitru Ehran: It's random with -- so they all have in common that they have
the same hyperparameters. They all have the same architecture, the same
number of hidden units, the same learning rate, the same regularization factor,
the same mini batch size. The only thing that they differ is in the -- in the initially
value -- in the random seed of the weight matrix. So ->>: [inaudible].
>>: [inaudible].
>>: I mean, you're right, a random network could eventually be there, but it's so
improbable that ->> Dumitru Ehran: Yeah, yeah. I think that's the hypothesis is that it's so
improbable that it was -- it could theoretically be here. Though it's not even clear
because the way we initialized this uniform over the actual values are bounded to
be minus some small number and some small positive number.
>>: [inaudible] the other one doesn't have to start linear, right?
>> Dumitru Ehran: No.
>>: It's whatever crazy ->> Dumitru Ehran: Whatever DBN spits out is where we are here. And these
are big numbers.
>>: [inaudible]. Interest now, for each point that you finally end up with after
supervision, do you have a measure to see how [inaudible].
>> Dumitru Ehran: Yes. So these are the networks -- these are the networks
that -- so for each -- for the class, you know, this is the best randomly initialized
network, best 50 for two layers. This is the best pre-trained network. So these
are best in class. So they go to the whatever ->>: So you have a part to show ->> Dumitru Ehran: Yeah, yeah, we'll get there too.
>>: Another question. Is there a way for you to say like, you know, just say like
-- is there a way for you to say unproject two dimensions [inaudible].
>> Dumitru Ehran: [inaudible].
>>: [inaudible] like weight space and say what would happen if you say started
in one of the -- you know, like after you train at the little [inaudible] when you start
in that region?
>> Dumitru Ehran: Well, I think that the only thing that you can do is you -- I don't
think you can unproject, no. Not with any ->>: So because we -- it will be interesting to see, though, if you started in the -that bloomy region over there to start with.
>> Dumitru Ehran: You mean this one or ->>: Yeah. But this one it goes --
>> Dumitru Ehran: Oh, yeah.
>>: If you started closer to like where the eventually ended up, to begin with, I
mean would you get better results?
>> Dumitru Ehran: We'll have some sort of experiment that is kind of like that
where we start with the same kind of values. So not -- obviously -- yeah. So
same kind of magnitude of the weights that is given to you by pre-training, but
random. So not -- no, not the result of the unsupervised learning. And we'll get
to see -- we'll test hypothesis that is very similar to what you just said.
>>: Okay.
>>: [inaudible].
[brief talking over].
>> Dumitru Ehran: This talk was supposed to be 45 minutes. But we have spent
10 minutes on this one.
Okay. So the other way of looking at it, the actual weight space. So you don't
take the output of the networks on your test data, you actually just look at the
filters, weights, the -- I don't know how they call those.
>>: Receptive fields.
>> Dumitru Ehran: Receptive field, exactly. Yes. There's many word for that.
So basically you just -- since we're operating on images, you just take the
weights and the first layer weights and you -- you know, you make them a nice
image in the same space as your input space. You can -- you actually can see,
so Y corresponds to kind of positive values of your -- of your weights; black
corresponds to negative values; and gray corresponds to zero.
So a couple of interesting things in here. So this is with a DBN. The first layer
waits. After pre-training, so as well as learning this is -- this is what they give
you. So it's some sort of edge detector, stroke detector, whatever they are. It's
not necessarily interpretable. This is what happens with the same network. So
after you've done supervised learning. So it's a pretty small difference between
at least kind of visually if you look at the -- have the kind of a visual inspection of
where these weights ended up after you've done supervised learning, they don't
seem to be different much at all.
Conversely, these weights, obtained with a one layer network -- well, with a three
layer network actually, without pre-training after supervised learning. So the
comparison is between these. So they're not necessarily in the same kind of
regions of space. Visually speaking. It's not as strong of a statement that I said
with those trajectories.
But this is interesting. This -- what this shows us is that as far as learning kind of
-- really constrained where you start from, the kind of features that you're going to
be starting with in your supervised learning and where you're going to end up,
which is basically in the same kind of basin of attraction.
So I have this cute little method for visualizing second layer weights and kind of
third layer weights. I don't think we're going to have time for that. This is gone.
But it's basically a way to figure out what is the kind of the maximal input that
you're going to get for each unit in your layer. So the interesting thing is there's a
couple of -- the difference between what happens after unsupervised learning
and after supervised learning is increasing as you add more layers. So it's
increasing and there's basically nothing in common here. So there's one thing
interesting here. They basically become kind of digit detectors here.
>>: Why should the higher layer weight give you something closer to the input?
I thought ->> Dumitru Ehran: Well, they become -- so in the sense these are more
complicated features, right? So -[brief talking over].
>>: The first one is the weight.
>> Dumitru Ehran: These are the weights, yeah. This is the second weight.
>>: It's a receptive field.
>> Dumitru Ehran: Yeah, it's a receptive field, but where you take each unit and
you say find me a configuration of the inputs that will maximize -- maximize the
activation of the ->>: So there's no way it ->> Dumitru Ehran: It's not a weight. But you can consider ->>: [inaudible].
>> Dumitru Ehran: If this were a one-layer network, this is in a sense ->>: Okay.
>>: [inaudible] all white?
>> Dumitru Ehran: Because we are really cranking, so it's kind of a -- the
optimization ->>: [inaudible] I understand, but why aren't there black and white ones? I don't
know. You talk a fine --
>> Dumitru Ehran: Yeah. It's -- it's really just an artifact of the fact that we are
maximizing the function. So ->>: Okay.
>> Dumitru Ehran: So it can put a lot into the white. So because it can basically
-- it's a maximal input so -- and since, you know, this pixel, you know, if you don't
bound the norm of your whole thing can contribute a lot to the activation of the
function.
>> Dumitru Ehran: By the way, there's some funny things happening. This is
what looks like a four detector, you know, it transforms into an eight after
supervised learning. So ->>: These are just randomly chosen.
>> Dumitru Ehran: Yeah, this is just -- so these have like a ->>: 50,000.
>> Dumitru Ehran: Yeah, this is like -- each layer has a thousand units. It just
shows them without any -- it's the first I don't know how many, a hundred or so.
So we've seen the effect of pre-training, now we're going to try to more -- get to
the -- what pre-training actually does. I think we've seen a lot of hints of, you
know, regularization, and we're going to see a lot of this. So I'm going to get
back to what you just said. So if you're [inaudible] about this hypothesis. The
simplest hypothesis actually but not the regularization optimization hypothesis.
What we call the conditioning hypothesis is perhaps what is going on is well,
you're trying to simply -- what pre-training gives you is it gives you the right range
of values for your weights. I mean, it's -- if you think about for more than five
seconds, I think it's a pretty silly hypothesis.
But perhaps what just happens is it -- it conditions your supervised learning in
kind of the right way. So what we're going to do is -- with this experiment, we're
going to try to simply compute the marginal distribution of the weights that is
given to us by unsupervised learning. It's some sort of Laplacian looking kind of
distribution. So it's fat tailed. So there's a lot of large kind of values possible.
And I'm going to just sample from it. Instead of randomly initializing from the
standard kind of [inaudible] way, one over square root of [inaudible] we'll get a
sample from this Laplace distribution or fat-tailed distribution.
And it's interesting that it actually gives you a bit of a boost. So this is with
randomly initialization with this histogram initialization as we call it, and this is
with pre-training. So this is just the classification error. But it clearly cannot
account for the whole difference between what pre-training gives you and what ->>: Your sampling the distribution like independently, right?
>> Dumitru Ehran: Yes, yes. Yeah. So this is not -- yeah. We didn't push it too
hard, you know. We didn't want to like build a -- you know multivariate
distribution of the weights -- it sort of defeats -- I don't think it's a very interesting
experiment. Maybe it actually -- if you actually did this, you know. But
pre-training really what it gives you, it gives you large values of the weights, but it
-- you know, they're not randomly sampled large values. They have tight
correlations between them so because they need to reproduce the data well.
So this is the plot that many people have asked me already about. So this is to
more directly find the role of pre-training is you find -- you look at the training
error versus the testing error over time. So you take two networks that have the
best hyperparameters, so the best validation error. You train them until they
converge, until they -- you know, there's basically no more change in the training
error any more, in the training objective function, or at least as far as we can see
there is no more -- we are at kind of a local minima or we don't improve anymore.
And you look at what happens with pre-training, so in red, and without
pre-training.
And it's interesting that you never quite get the same kind of training errors. So
this is the training error. You never get it. So -- and this is lower this way. The
networks with pre-training never seem to get a better -- get to a better local
minimum in kind of the objective function. This is the -- this is the cross entropy
by the way. So it's not a classification error. Classification error they all get to
zero very fast. In training. So there's no point in actually measuring it. And this
is the test NLL cross entropy.
So it does fit kind of a regularization hypothesis pretty well, regularization
interpretation at least. It trades off training, so it trades off the capacity to overfit
four for better generalization.
>>: [inaudible] for that experiment [inaudible] one single number. That is if you
do infinite amount of, you know, randomization, one of them must be better than
that one.
>> Dumitru Ehran: You mean which -- you're looking at this ->>: Well, the one in the blue.
>> Dumitru Ehran: Yeah?
>>: So the question is how many choices have you ->> Dumitru Ehran: So.
>>: In randomization?
>>: [inaudible] not really infinite.
>>: The real question is what kind of exploration you might want to have in
[inaudible] that somehow will get something close to this --
>> Dumitru Ehran: But the point is that with pre-training you don't need to. I
think that's the -- you don't need to infinitely sample ->>: [inaudible]. Well, the question is how much, how much you need to explore?
>> Dumitru Ehran: Well, we ->>: The size of network small.
>> Dumitru Ehran: We tried more than -- we tried more than what a normal,
sane person would do. [laughter]. So I think that's ->>: [inaudible].
>> Dumitru Ehran: Let's just say that one of the -- I think I took down Google
Maps for a couple of hours, so -- with my exploration of these things. People
were not happy. So I think that's about as -- 500 per [inaudible] and then you
have number of layers then hidden units per layer, learning rate, batch size,
regularization parameter. Each of them ->>: Nothing can get in the way of that.
>> Dumitru Ehran: Yeah. I think from -- at least -- well, I think there's also -- we
haven't tried like different sigmoid versus 10H or other types of non-linearities,
different kinds of -- you can play with this a lot.
>>: [inaudible].
>> Dumitru Ehran: Yeah. I think you -- yeah. But I think the main -- the main
argument is not the difference between these on the test level, I think you can
make this a bit smaller maybe, maybe. But this, really we tried hard to see
whether a pre-train network can actually achieve the same training error as a
[inaudible] pre-training and it cannot.
Another argument for regularization is well, we said, well, if you think of
pre-training as some -- in some sort of way, kind of vague way constraining the
capacity of the network to do a regularization, maybe that's what it does. Maybe
if we did another classical way of constraining the capacity of the network is to
constrain the size of the hidden layer. So we just -- we just did that. We -- and
then we see whether pre-training -- what kind of effect has -- does pre-training
have in that case?
So what we did is as I just said, we took one, two, three layer networks. It's the
same result basically for all of them. And we varied the number of units per
layer, which is kind of constant. And you can see that -- so in black is a network
without pre-training and in red and in blue is with DBN or RBM or denoising
auto-encoders, it's the same kind of curve.
You can see that if your -- the number of units per layer is small enough,
pre-training actually hurts. So it seems like pre-training acts as an additional kind
of regularizer to the one that's imposed to you by the capacity that you're
shrinking with the number of hidden units.
So it's -- it's one way of doing it. So the -- we've had these results -- we've had
some other results. I'm not going to talk about them. But they -- they're
interesting, but as somebody pointed out, well, if pre-training is a regularizer and
what it does is basically just -- it just defines the starting point of your supervised
optimization, perhaps like some other regularizer that you might think of, perhaps
its influence will disappear as you have enough supervised data. You know, if
you think -- if you're Bayesian kind of person, you think, well, you have a prior
which is defined by the unsupervised learning, you have some likelihood term
that comes from the data, your prior is sort of maybe not data dependent so it will
be overwhelmed by likelihood if you train long enough, if you have enough data.
So we said, well, let's just verify this hypothesis, let's just see what happens in an
online setting where we have enough supervised data that we clearly -- if this is
true, clearly it will -- you know, the -- whatever starting point we start from will not
matter at some point in time.
So we did this infiniteMNIST, this 10 million example thing. And basically here
it's the number of examples seen. So this is in millions. So 10 million is here.
We've tried with 30 million as well. It's the same thought. There's nothing
different that we couldn't try with some of the models. Too slow.
And we did -- we measured the online classification error. So the only difference
between the blue -- the black curve, so the dashed is one layer, with random
initialization and the solid line is three layers with random initialization, and the
curve -- the red curve which is a three layer with generalizing auto-encoder
initialization, so with pre-training, is their starting point, where it started from,
which is the outcome of pre-training. Yes?
>>: So if online classification error is not [inaudible].
>> Dumitru Ehran: Oh, it's just the classification error on the next example. It's
moved over time. Over the next ->>: Oh, it's moved over time. So you're not integrating?
>> Dumitru Ehran: No. It's -- yeah. I'm -- you know, I'm a -- I've trained on a
million examples, I test on the next hundred thousand, I see what a classification
error is, I go on. You can see you get some sort of ridiculous classification
errors, 10 to the minus 5. There's so much data that you can get to pretty low.
>>: I assume this is [inaudible] the same data that [inaudible] because otherwise
it would be the same [inaudible].
>> Dumitru Ehran: Yes. In a sense, yes. But -- so here we've done 2 and a half
million examples of unsupervised learning with a red curve and 7 and a half
million. So we've sort of set a budget of 10 million examples, you know, and we
-- either way, even if you allow this work to continue, you know, the network
without random initialization -- with random initialization, sorry, to continue, it
never quite gets the same. So they're not -- clearly the starting point of this
non-convex optimization problem which is finding the -- you know, optimizing this
deep neural network, clearly it matters.
So even in the scenario with basically essentially unbounded training data.
What's a bit surprising is it doesn't quite follow the standard interpretation of a
regularizer. If you actually did this, and I did this experiment, if you did an L2
regularizer, the optimal value you have a lot of data is zero. There's no reason to
use L2 regularization or any of these kind of data independent regularization
schemes. Well, the very simple canonical regular -- I mean, maybe something
that's like minimizes sparsity of your activations or something. Its maximum.
But it doesn't follow the same interpretation of a regularizer, which is what people
thing of.
>>: I think it's pretty clear that this -- because your data isn't real data, I mean
otherwise there's no [inaudible].
>> Dumitru Ehran: Oh, yeah, yeah, yeah. So I've done some experiments with
more real data where we actually have kind of an IAD million examples and the
same kind of result. So there's no -- there's no difference in that. But this was -this was published and that one wasn't.
Any other questions about this part?
Okay. And it's funny that you -- so you look at this and okay, maybe it's a
regularizer but not quite clear why. But the training error, if you took -- if you took
the models that you get at the end of training, so you've done 10 million
examples and then you test them on the same data that you've trained with in the
same sequence, this is a network without pre-training and this is a network with
pre-training. So it -- even then -- there's a couple of things, you know. It's better
at classifying things that it has just seen that's just how gradients works and for
these examples, these are essentially basically like test examples, you know,
they're far enough in -- has seen them so long ago that it -- this error is basically
the same that you get at the end of training -- the end of training on the next
examples.
So it shows in a sense an optimization effect in the sense that it's in the online
setting it's better both at the training error, as defined by this -- actually some of
us expected these lines to cross and how you would think that maybe a network
without constraints such as the one that's randomly initialized, maybe you overfit
more on the data that has just seen, especially it has a lot of tuneable
parameters. But, you know, these networks, the pre-training seems to have
done an effect here, too.
>>: So what is -- there a --
>> Dumitru Ehran: This is the -- this is testing on the same sequence of data that
it has seen during training.
>>: Oh, I see.
>> Dumitru Ehran: So these are the very last examples. This is where the
gradient descent -- you know, the direction of the gradient descent has just
moved it, so it knows how to classify this pretty well. It's surprising that it's not
zero, but it's -- it's computed over large enough sample that it's -- it will not
actually be zero.
I think -- so I'm getting to one of the last results. And one that I think makes the
clearest and strongest point for why pre-training can be seen as regularization is
well, if the starting point of your optimization actually matters, why don't we just
do an actual testing of the -- you know, if you vary the where you start from but
not the parameters but the actual data that you look at the beginning of the -- of
training, so you take the same 10 million examples that you've seen but you vary
the first million. So you keep the 9 million that you've seen the same, but you
vary the first million that you look at. And then you observe what is the variance
of your -- of the output of your function at the end of training. So basically you're
just computing a score of how sensitive your estimator, your classifier is, how
sensitive is it to the data that it sees in the beginning?
And what you should be comparing with here is this point on the red line, which
corresponds to the start of the optimization of the supervised learning for
networks with pre-training, with this point. Which is randomly initialized network
that, you know, where we vary the first million examples.
And we can see here the pre-training, you know -- sorry, has, you know, it has
done as was learning and it kind of doesn't matter which examples we see, which
supervised examples we see in the beginning of supervised learning after we've
done unsupervised learning. The variance is going to be low. Or much lower
than the variance that you get with the -- with unsupervised -- with the network
with random initial weights.
>>: This is very log of the output.
>> Dumitru Ehran: Yes. Log of the correct class.
>>: Log of the correct class.
>> Dumitru Ehran: Yes.
>>: You're using ->> Dumitru Ehran: Both sometimes.
>>: So you put [inaudible].
>> Dumitru Ehran: Yes.
>>: [inaudible].
>> Dumitru Ehran: It's future work. So it's in one of my slide, but curriculum
learning for those who don't know is Yoshua's idea that if you order the
examples, you have some sort of non-convex optimization problem and you
order the examples that you look at from simple to hard under some sort of
metric that you -- is somehow defined, then in a couple of problems, image
readmission and NLP, I think it's joint work with Ronan Collobert and Jason
Weston, they obtained better results. With NLP they just basically increase the
vocabulary size.
So, yeah, I think -- in a sense it's sort of this is what it says basically.
>>: This one doesn't really order the importance ->> Dumitru Ehran: Yeah, it doesn't really order. But this one basically says to
you that it's great in the gradient descent -- stochastic gradient descent is very
sensitive to where you start from. So and the early examples, you know,
influence more the where you're going to end up. So you should pay attention to
where you start from. And this, you know reduces the variance of the output that
you're going to get at the end of training. So you can see pre-training as
variance reduction which is just another quality of regularization.
>>: So [inaudible] obtained by [inaudible].
>> Dumitru Ehran: Yes. So this is -- yeah. This is the networks that are best in
class again. So, yeah?
>>: Can I take you back one slide, please.
>> Dumitru Ehran: Sure.
>>: We went kind of quickly on that. So you were saying and I think very
correctly so, that you were expecting the blue line to ->> Dumitru Ehran: To intersect with the red line.
>>: To intersect, to get, you know, lower at some stage. You don't see that here.
>> Dumitru Ehran: Yeah.
>>: But then you said -- I don't know, you didn't really explain it. So this to me is
a red flag. I mean, this means that -- so this is kind of the whole problem with an
empirical observation to trust you that you did a good job in threading your
competitor good enough, and this kind of ->> Dumitru Ehran: Well, this is -- well, this has -- this is the network that has
obtained the best -- the best validation/test theory if you like.
So itself the one that, you know, I -- well, I mean you can trust my honesty or not,
but I tried hard enough ->>: I'm not trusting your integrity at all. I mean I'm just saying this to me would
be a red flag that would say something is wrong ->>: [inaudible].
>> Dumitru Ehran: Well, why is it a red flag? I mean, they're both equally -- for
me they're plausible alternatives that they cross or not. I mean, it's -- I don't -- I
don't see why -- I mean ->>: I think the natural intuition would be that the [inaudible].
>> Dumitru Ehran: Well, this is not -- but this is on the -- just examples that it has
just seen, right? So -- so it's ->>: So I mean [inaudible].
>> Dumitru Ehran: But you only ->>: [inaudible] can do better.
[brief talking over].
>>: Especially with these nonlinear models.
[brief talking over].
>>: [inaudible].
>>: Well, except for neural networks I know [inaudible] because the gradient
gets multiplied by the activation [inaudible] very little gradient affects the model.
Once you've trained them, they're wedged. So it's not truly an online experiment
because really the parameters of the first layer get set very early in the process,
and so those probably get screwed up. In fact, he showed pictures of them
screwed up and you're screwed. So it matches my a priori.
>> Dumitru Ehran: The reason I even mentioned the crossing is that it was one
of those, you know, we actually -- before we do -- doing this experiment we said
oh, what will happen, you know, like -- we like to do this -- I like to enforce this
because there's a lot of this hypothesis after the fact kind of dealing which I don't
like, like oh, we observed the data, let's make a hypothesis that really fits the
data. So I like to say, you know, somebody says an idea and then they like well,
and I like to ask the person show me what you think will happen, you know, show
me this graph, you know, for both of them. And I'll write you -- I'll put mine and I'll
try to make an argument for why I think this will happen, and the person will -and one of those -- you know, one of those hypotheses was that. And it was
something along the lines of what you just said, you know, like because it was
unconstrained. I don't know. This is what we observed. And as far as I can tell
you, it's also -- I predicted this would happen. But -- and I did the experiment so
it's not quite fair. We can talk more.
So anybody has questions about this? All right.
So finally plot courtesy of Yoshua. He said that if I say this, I should not use it.
[laughter]. So basically it's more kind of a mechanism explanation of -- and it's
along the lines of what John just said. It's -- the dynamics of unsupervised
pre-training initialization is such that, you know, in 2D you're basically at the
beginning of the training you're choosing the kind of quadrant or hyperquadrant
that you're going to end up, your weights with going to end up, and it's going to
be very hard for you to go to switch quadrants, to go somewhere else afterwards
once you've done this. So the initial updates have a crucial influence. Call this
kind of a critical period as in psychology. So -- and they explain -- we've seen
some of those plots that the initial updates really explain more of the variance of
what's going on.
And this notion of basin of attraction has come up a couple times in the
conversations here. What seems to happen is unsupervised pre-training
initializes in this basin of attraction with good generalization properties.
Let's just go over very quickly. Pre-training, we've seen, induces qualitatively
different functions in those flower power pictures.
We've seen the weights, the weight matrices. They seem to be different. They
seem to be exploring different parts of the networks space.
Some of the first results that we've seen pointed towards a regularization
hypothesis, you know, the worse training error in kind of MNIST scenario. The
layer capacity versus test error thing that we've seen.
We tried to expand those results in an online setting to see whether this
pre-training advantage would actually diminish as you add more data, and it
doesn't seem to. There's some caveats as we've seen.
Pre-training seems to be a variance reduction technique, will literally vary -measure the variance of the output.
And even in an online setting we can see pre-training as a regularizer. Because
it does define -- it does constrain where we start our supervised learning from,
and it seems like in a non-convex kind of setting, this kind of constraint actually
matters.
So the take-home messages that unsupervised updates have a crucial influence,
explain more of the variance, define the basin of attraction. As we've discussed
a bit, pre-training will have a positive effect as long as modeling P of X is useful
for modeling P of Y given X. If that is not true, I'm not sure what you're going to
get from pre-training.
The influence of early examples could be troublesome. If you think about it. If
stochastic gradient descent is truly kind of subject to overfitting on the early data,
stochastic gradient descent applied to neural networks, could it actually hurt us in
a large scale setting? If we're really wedged -- you know, if your weights are
stuck in this basin of attraction that you started with pre-training or not doesn't
quite matter, that could actually hurt you because you're going to do online
learning and maybe at some point in time your distribution is going to shift
somehow and you're not going to be able to get out of the minimum that you
ended up in.
Even -- this is -- you know, all these results I forgot to mention. This is like with a
constant learning rates or some sort of theory, this stochastic gradient can
constant learning should be able to escape kind of local minima in a sort of a
sense.
Anyways, we have relatively fresh, and in it was just published JMLR paper that
has more.
There's some future work. We want to understand some more -- some other
semi-supervised deep approaches, for instance there's a lot of pieces, namely
Andrew Ng and others who use cases where there's much more unlabeled data
than actual labeled data. It's unclear whether some of these results that we've
said actually apply in that case.
There's a couple of people who do combinations of supervised and unsupervised
learning costs. So there's not a clear two stage process. Not quite clear what
happens in there too. We want to explore some connections with generative
very distributed file system active work, some Andrew Ng and Michael Jordan's
kind of seminal work on that. Fisher kernels, which is a way of pre-training kernel
machines with generative models.
We've talked about curriculum learning so, it's unclear how this ties in with our
results, but there's some connections in that.
We want to describe in more concrete terms the basins of attraction of
unsupervised learning. So what does it mean? What properties do they have,
apart from being good models of P of X?
So along these lines, I would like to better understand the kind of invariances and
features that are learned by unsupervised learning. So I think that's more of a
concrete statement than just basins of attraction. It's pretty vague as a term.
So I think we need kind of a tool for visualizing or comparing maybe the
strategies that you use for pre-training, the costs, the architectures, and a tool for
better understanding, you know, maybe poking at those networks. And I have a
couple of those tools you can talk with me if you want to.
It could be useful to maybe thing so this is one of those. So we've seen those
features that I've just showed you. So I've had -- I've done some work on trying
to visualize that a bit more.
And maybe it could give us more definitive answer to why is it actually hard to
train a deep network and why does pre-training actually make it more easy. So I
think that's it. Right on time.
[applause].
>>: Well, besides [inaudible] do you have another set of data that maybe
corroborate all of the ->> Dumitru Ehran: Yeah. We have ->>: We just want more. [laughter].
>>: [inaudible].
>> Dumitru Ehran: Yeah. We have -- we did some -- I don't know if you know
about ImageNet. So this is -- I forgot her name. It's Stanford Computer Science.
Is it -- I forgot. Any way, it's Stanford, somewhere in that vicinity, somewhere in
the Bay area. They collected kind of -- I was on Mechanical Turk, so they
labeled about 10 million or 12 million images. And we've done some -- some
experiments with a couple of million of those examples. And the curves look kind
of the same.
>>: [inaudible].
>> Dumitru Ehran: No. It's not [inaudible] no, that was -- we did JMLR paper we
submitted in paper eight months ago. It was accepted as is. They didn't -- they
didn't want us to do more experiments, which we were surprised.
>>: [inaudible].
>> Dumitru Ehran: Yeah. Yeah. Well, the online setting. So we wanted -- we
wanted to test that. The online setting is I think a bit more of a surprising kind of
result for many people. So we -- maybe we'll get to publish this at NIPS this
year.
>> Misha Bilenko: Let's thank Dumitru again.
[applause]
Download