>> Misha Bilenko: All right. So we're delighted to host Dumitru Erhan here today who is coming to us from University of Montreal. And he will be talking to us about the latest and greatest from the setting world of deep architectures. He's been to a number of places, including our very own MSRC in the past, but mainly he's spent time first in Montreal and then before that he did internships at MSRC and Google and MaxPlanck and Helsinki University of Technology. And here's Dumitru. >> Dumitru Erhan: All right. Let me just hold on until I get the timing right. All right. This was not purely for the animation. So I'm Dumitru Erhan. I'm from the LISA Lab that is headed by Yoshua Bengio at the University of Montreal, and this is getting to be work -- well, I'm going to be presenting work on deep architectures and trying to understand them and trying to understand the effect of unsupervised pre-training. And I've done this work jointly with a couple of people, Aaron Courville, Yoshua Bengio, Pierre-Antoine Manzagol, Pascal Vicent, and Samy Bengio. So here's a brief outline of what I'll be talking about. I'll be starting with an intro to deep learning and why we want to do that. I'll be going with a kind of a brief motivation of -- for my work, for the work that I've been doing over the last couple of years; namely it's work that I've -- you know, that can be ruled by these two questions, why does unsupervised pre-training work for these deep architectures and how does it work? And if we have some time, and I think we have some time, maybe I'll go over some more speculative work on analyzing deep architectures. And I'll show you more pretty pictures since you seem to like them. And I'll end with some discussion and concluding remarks. So deep learning motivations. I think one of the first motivations for the deep learning comes from this -- the fact that we take a look of hints from the way the brain works. We know it's kind of an intelligent machine. And we've -- you know, machine learning researchers have -- over the years have done a certain number of copy pasting from there. So, you know, why not take a hint from the fact that there are sort of many layers of nonlinear processing units in the brain. And so we could call the brain a deep architecture in a sort of vague hand waving. A more -- perhaps a more convincing argument is the fact that we as humans tend to organize our ideas, the way we think about the world in a kind of hierarchal fashion. So we tend to through composition of simpler ideas. So this can be seen in the way we learn things. We start with learning simple concepts. We kind of do a bit of scaffolding. We learn more complicated concepts. The way we think about abstraction in the real world. We tend to do some sort of decompositions into simpler abstractions. Less abstract things. And the way we solve problems, we do problem decomposition. We solve problems by, you know, a kind of an engineering approach to solving problems would be to solve simpler problems first, to solve the complicated problem by building on the simpler ones. On the simpler solutions. So perhaps a bit more of a machine learning argument that my advisor Yoshua Bengio seems to make quite often is that in -- you know, in certain restricted cases, you know, if you make certain assumptions, perhaps unrealistic assumptions about the classes of functions that you're trying to learn or trying to represent, there are certain classes of functions which are -- can be compact represented with K layers of nonlinear processing units, which is what we're going to call a deep architecture, by the way. And which cannot be represented efficiently. So they need an exponential size of these processing units when you restrict yourself to layers that are of see -- well, not of size, a K minus 1 number of these layers. >>: [inaudible]. >> Dumitru Erhan: That's an existence proof, yes. Yeah. Basically. >>: [inaudible]. >> Dumitru Erhan: That resist the function -- yeah. If you restrict yourself to kind of linear threshold kind of units or binary units. It depends. It depends on the class of function. So this kind of result I particularly don't think it's -- maybe an existence proof of this kind of result exists with a -- you know, if you don't restrict yourself to the -- if you have like sigmoidal units or something like that and you don't restrict yourself to the number of these units I don't think it's actually true. But I think for realistic kind of functions that you're going to be trying to learn or represent, actually, is this is -- my intuition this is kind of true as well. >>: Does it depend on what kind of learning unit you have? >> Dumitru Erhan: Yeah. >>: [inaudible]. >> Dumitru Erhan: Yeah. It's a linear spatial function. So it's not -- nothing too fancy. The other kind of more machine learning type of arguments for doing deep learning is that local features are local and the representations that you're trying to learn of your data don't really seem to scale well to problems with many variations. And by that I mean problems where perhaps you're trying to learn a complicated decision boundary, you know, where -- where if you have a complicated decision boundary and the kind of features that you're trying to learn are features that operate only in the vicinity of your training data, then you can, you know, perhaps need a lot of data and, you know, these types of problems are what I would consider interesting problems in life where if you have a simple decision boundary, then it's perhaps not an interesting machine learning problem. And conversely, a distributed representation as opposed to kind of a local representation, they seem to be necessary to achieve this kind of generalization beyond the local features that you're going to be getting if you're just looking at the training data in this vicinity. Yes? >>: I'm sorry, you mean the local in future space or the local in [inaudible] like over the image? >> Dumitru Erhan: I don't really know what local in the [inaudible] but local in the ->>: [inaudible]. >> Dumitru Erhan: Local in the input space. >>: Okay. >> Dumitru Erhan: So there's a bunch of [inaudible] that Yann LeCun and Yoshua Bengio have made about this. The interesting thing is that the nice thing about deep architectures that have he's deep representations that are distributed is that in the end what they allow you to do is they allow you to share statistical strength by kind of making it possible to reuse the features that you learn at that level K minus 1 for learning the features at the level K. So it's -- it -- you know, it's -- you could see this as kind of multitask learning if you want. If you were to do multitask learning that kind of sense that Rich Caruana has done, and you were to learn a task, you know, one, it makes sense to reuse the features that you learned for the task one to learn the task two if that task is somehow related. And the same kind of argument could be made for learning features. You know, you consider learning a feature as a task, so if you learn a high level feature it's ->>: Can you leverage on the non-local generalization? I mean local to what? I understand it's local in the feature space, but local to what? >> Dumitru Erhan: Local -- close to the training data. >>: [inaudible] far away from the [inaudible]. >> Dumitru Erhan: Yes. >>: Okay. Far away from the training data, the assumption is the distribution [inaudible] because the training data is sample the ID from the distribution so why do you care about generalizing where there's low mass? >> Dumitru Erhan: Well, because -- well, your -- you want to -- so you want to generalize far from the training data because, you know, the training data cannot possible in general capture all the possible variations from -- of what you're trying to learn. So you're going to try to impose a complicated prior on the kind of distributions that could arise, and you want this prior -- you know, the kind of function -- the kind of class of functions that you're going to learn to not simply be restricted on the neighborhood regions that are close to your training data. Because it's -- it's not going to be possible -- you're not going to generalize to anything interesting if you don't have data. So I think it's more of a, I don't know, maybe a philosophical kind of question of whether you should be aiming to generalize far away from the data or not. And I think sort of the argument is that what are you trying to make that you should be able to? I don't -- I don't think -there will be an agreement. I under your point. But I think in sort of the limited kind of data scenario, we make some sort of assumptions that you do want to generalize away from the data. You would want some sort of priors on the -- on the way which you're going to be generalizing away from the data. There is not simply I'll shrink my parameters to be zero because I don't have any -- any -- any data in there. So... Hopefully I convinced you that deep, distributed representations are desirable and have some sort of more expressive power than if you just don't use deep, distributed representations. So before 2006, fully-connected deep networks are not especially popular. And so it's not exactly a mystery to me why, but I wasn't there in the '90s, I didn't do machine learning. I was in high school. [laughter]. Sorry. It's like a dip in my chances. [laughter]. But I think the -- there is a certain number of hypotheses. And you can feel free to add some of these. I mean I'll add them to my list. But I think the -- one of the main ones is that the problem -- in theory the problem becomes more non-convex harder to solve. If you want to train a deep, you know, multi-layer neural network, there's simply many more local minima, as you had more layers. And since you're using these sort of stochastic gradient methods and by using gradient methods you're in a multi-layer neural network, you have this problem of kind of vanishing gradients, you know, your credit problem becomes even more complicated. So there is also the kind of Universal Approximation Theorem which states that for, you know, a certain class of kind of -- not all the possible functions in the world but set of smoothness assumptions you can -- you know, you can approximate -- you can re-present basically any of these functions with a one-on-one layer neural network so basically why bother, you know? And then there's people who invented SVMs and convex and they were all nice and they performed better than one layer neural network, so I guess people just kind of threw these things away. That's my history of machine learning of 20 years. But for neural networks. I think the only kind of deep neural network variety that stayed out there and that kind of evolved was the convolutional neural networks of Yann LeCun. They had one -- you know, they were not fully connected. They had this constraint topology and connectivity and that made -- seemed to have made all the difference. It was easier to actually train them -- it was possible to actually train these networks even though they had like a pretty large number layers. So what happened in 2006 is that Jeff Hinton and collaborators, they used unsupervised pre-training, so they just used unsupervised learning to initialize these networks, these fully connected networks. The kind of algorithm that they used was Restricted Boltzmann Machines. And they seemed to work. So the question is going to be why this actually happened. Before we get to that question, just a brief overview of what Restricted Boltzmann Machines are. It's a simple graphical model with visible units X and H, hidden units. It's a bipartite graphical model so -- in which, you know, there's efficient inference and sampling. You know, inference is simply a matrix multiplication basically followed by a sigmoid. So that's nice for a graphical model. Since basically the conditional of H given X are factorial, it's easy to sample and easy to do inference, you trade off the fact that it's -- learning it is very hard in principle since you need to kind of compute a normalization constant here which you need to -- it's going to be very hard if you're just going to do it for anything that's kind of a non-trivial model. So what people like Jeff Hinton and others have do is that they use contrasted divergence and many others -- many other variations of -- from MCMC literature as the whole industry kind of these days on kind of taking from those literature and applying this to RBMs. It's basically essentially either approximations of the gradient of the likelihood and -- or some other ways doing smart sampling. It's fast and it's simple. It involves, you know, contrastive divergence at its simplest and involves three matrix multiplications and twice -- the sample twice. So it's easy. It's easy to do many, many, many interactions of this contrastive divergence. RBMs have been extended to real-valued units. This kind of model here pre-supposes that the Xs are binary valued and the Hs are binary valued. You can extend this to kind of arbitrary or real-valued visible units and hidden units. There's been extensions to semi-supervised learning. You could have the label here. It's sort of efficient if you want to. Another way of -- so deep belief network is simply a way of turning these Restricted Boltzmann Machines into algorithms for iteratively learning the weights of your deep neural network in an unsupervised fashion. So you're going to take this Restricted Boltzmann Machine which you're going to be learning the features -- you know, you're going to be training this RBM on your training data. You're going to figure out some set of features that is interested that somehow perhaps captures the distribution of your training examples. You're going to fix the weights in here of this RBM. You're going to do a sort of forward propagation. You're going to find basically P of H given X for each of your training examination. You're going to learn another Restricted Boltzmann Machine on top of it that is basically going to model the features that you just learned. You're going to do this until you're finished, which is in many cases three layers. So a lot of people say it's not very deep, but I'm not going to go into that. You're -- once you're done, you basically have a way to generate -- well, to generate as well, but to figure out what is the -- this deep distributed representation at level 3 given the input for each of your inputs, you know, not necessarily from the training data, and you can use these -- the weights that you just learned the parameters that you just learned to initialize the network, the supervised network, and you can just do backprop afterwards. So -- or you can do like Jeff Hinton, you can do some sort of a fancy RBM here that has a label. There's many ways to do this. But it usually involves kind of a two stage process where you do unsupervised learning using this kind of weird graphical model, stacking it greedily layerwise and transforming this into a deep belief network. Another way of doing this is this so called stacked denoising auto-encoders. This is what we use mostly in our lab. I personally find them quite an interesting model. So they're inspired a bit by the classical auto encoders from neural networks about 20, 25 years ago. And a classical auto-encoder is simply -- it's actually pretty similar kind of, you know, just mechanistically speaking it's very similar to a Restricted Boltzmann Machine. It's where you -- you simply try to predict with a one layer neural network the -- takes and inputs some X and tries to predict X as well. So it will take X, compute some hidden representation which is simply, you know, sigmoid of W, X, plus C. It will reconstruct either using the same matrix -- the transpose or some other matrix, doesn't really matter, and it will try to minimize the loss between the reconstruction and the original input X. So that's a classical auto-encoder. It's non-linear so it's not quite the same thing as a CPA. It will do something different qualitatively. The trick here, the denoising auto-encoder is instead of feeding to the network the original input X, you're going to try to feed it a corrupted version of this input. So you're going to try to in some sort of stochastical way corrupt the X and instead -- and you're going to do the same thing, you're going to reconstruct. But instead of, you know -- and the kind of loss that you're going to be doing is you're going to try to minimize the reconstruction error between your X tilde and the original X. So you're going to try to make your network learn how to predict parts of your input from the others. So you're going to basically make your model more robust. And in theory this is going to give you a better model of your data. I mean if your -- if your -- if your -parts of your input are predictable from the others. If they're not, them your task is very easy in a sense. So interestingly this kind of model handles a variety of learning problems which are, you know, sort of as an afternoon thought, oh, wow, this also handles missing values. If your inputs somehow don't have missing value -- or you can, you know, stochastically make missing values. And then your model is going to be able to predict the missing values. You can have like occlusion. If you're trying to reconstruct images or kind of multi-modality instead of you're having just an image, maybe you're having an image and a text and you corrupt the text part and you try to reproduce it. Anyways, so you -- depending on the kind of domain that you're going to be using you're going to use either some sort of reconstruction error, a mean squared error or cross entropy, didn't really matter. I mean for sort of analysis of this algorithm. And the stacking, you know, how to make this deep, is the same thing. You're computing for each of your inputs the representation you use the representation to do this -- a second layer of denoising auto-encoder, you do this until you're satisfied. You add a supervised soft max layer, do backprop and obtain as good or better results than if you do a deep belief network. So I final this model conceptually more -- it's simpler to look at. People who do graphical models find Restricted Boltzmann Machines simpler to look at. It's really ->>: The model isn't generative though, is it? >> Dumitru Erhan: You can make it generative sort of, but it's not -- yeah. Yeah, you can't really sample from it. But if your goal in life is to classification, then having a generative model is perhaps not necessarily useful. I mean ->>: And when you [inaudible] I should go back to Pascal's paper but actually when you -- you don't add noise to the H, you sort of propagate [inaudible]. >> Dumitru Erhan: When you compute -- you mean when you ->>: [inaudible]. >> Dumitru Erhan: You forget the noise. So the noise is only during training. >>: Right. But now you're going to train the next layer. >> Dumitru Erhan: Oh, yeah. You add the noise here. >>: So you have a noise model at every layer? >> Dumitru Erhan: Yes. >>: Okay. >> Dumitru Erhan: But then like of course, you know, there is no notion of occlusion or anything like that in the hidden layer unless you somehow made your model like that. So it's really -- it's less interpretable as -- a -- but still kind of the main gist of the argument remains, you're making robust model your model more robust so you're trying to predict other parts of your input, be they actual inputs or representations of those inputs from the other parts. So using these and other things that there's plenty of other ways to do a -- to do these as far as learning of networks, people have obtained a lot of good results on -- on a bunch of kind of datasets from MNIST, the National Image Patches learning generation. Object classification with more an less realistic kind of objects. NLP tasks, I don't know if Ronan was here and Jason Weston. Or if you have seen them at NIPS. They've been quite forceful about their results on sort of named entity. Speech/music classification with convolutional deep belief networks is work by Honglak Lee and people from Andrew Ng's lab. MoCap, learning and generation, motion capture if you've seen Jeff Hinton's chicken dinosaur walks generation at NIPS. It's also kind of fun. So people have applied this kind of a variety of settings. And it's not quite clear whether you know at least two years ago when I started doing this kind of work, whether this was well understood or why this whole business about initialization the network actually works. So the recipe, as you've probably seen -- I've tried to actually describe is that you're doing layer-wise unsupervised learning or layer-wise learning of unsupervised nonlinear features followed by supervised learning. So that's the basic recipe. That's how you do debrief neural networks most of the time. And this has been applied to Restricted Boltzmann Machine, more general Boltzmann Machines, auto-encoders, denoising auto-encoders, sparse auto-encoders, parse denoising auto-encoders. There's even been a paper by Lawrence Saul and one of his students about using kind of deep kernels or stacking kernel PCAs which doesn't sound very efficient but it works, actually works. I was surprised. And the question is, you know, why does this work? And some of the -- some of the stuff that I'll -- I was doing over the past couple of years is trying to attempt to untangle the various effects that contribute to good performance. So I'm going to try to verify some of these hypotheses in a kind of large scale setting, many have parameters with kind of bigger datasets. I mean large scale is all relevant. Maybe it's large scale. Large scale by our standards maybe medium scale by your standards. So and it's -- the goal is to try to kind of demystify deep architectures. There's been somewhat of a -- I wouldn't say backlash but you can see sometimes in reviewer's comments that they understand how -- you know, they understand that they work, but they don't really understand why they work. So try to demystify a bit what was going on, try to present some coherent arguments for why they work and infer maybe something useful for the future research in this field. So the plan, it's a scientific kind of plan, is to propose some explanatory hypotheses for why things work, object the effects of pre-training in various kind of settings, and infer the role of pre-training since there seems to be the crucial ingredient in this whole recipe. And the level of agreement of the results that we obtain with these hypotheses. And I'm going to present the hypotheses straight away. They're very simple. There's nothing too complicated about them. The first one is the regularization hypothesis which basically states that, well, the unsupervised component constrains the network to model -- to have good features of P of X. So it's -- it constrains the model, the parameters to model P of X well either with RBMs or with denoising auto-encoders or what have you. And it says that the representations that you get with P of X are going to be good, you're going to basically share the parameters, are going to be good for P of Y given X in a generalization kind of way. The optimization hypothesis states that, well, unsupervised initialization, what it does is it initializes near better local minimum of P of Y given X. So better local minimum or of a training criterion kind of statement. And what it -- what pre-training helps you do, it allows you to reach lower local minimum, not achievable by random initialization. By start kind of way of doing deep networks. And an interesting thing is that we're going to discover is these hypotheses, though they sound like they may be they're at odds with each other, they're not necessarily incompatible in certain scenarios. Yes? >>: Question about your use of the term local minimum. >> Dumitru Ehran: Yes. >>: We often don't train neural networks until they're actually stuck in a local minimum. We often stop them earlier, doing do something else. >> Dumitru Ehran: Yes. >>: By local anyone is it okay if I just replace that with basin of attraction. >> Dumitru Ehran: We're going to use ->>: Is that going to work? >> Dumitru Ehran: No. We're actually going to use local minimum in here what we mean is a better local minimum of the training criterion. We're going to use the basin of attraction -- it's kind of a weird notion of what is a basin of attraction of gradient descent. I guess it's -- we'll get to that. But literally here what we mean is unsupervised initialization gets us better training error. So in this it's a better local minimum in this -- this point of view. >>: [inaudible]. >> Dumitru Ehran: Yeah. Because we -- it's very hard to actually verify that we're a local minima in the neural network. >>: Well, it's very specific, right, the optimization about this is assumes that deep networks train without preinitialization or underfitting, right, and therefore ->> Dumitru Ehran: Yes. Yes. >>: So he's saying that they will be -- it's a hypothesis. >> Dumitru Ehran: So that's -- the two rather different statements of what's going on and we'll jump straight into that. Once I can get my slide working. >>: I have a question. >> Dumitru Ehran: Yes? >>: All right. So during the supervised training all of the weights are manipulatable, right? >> Dumitru Ehran: Yes, everything is manipulatable, yes. >>: So how does it actually constrain [inaudible] I mean it doesn't sound like there's any constraint [inaudible]. >> Dumitru Ehran: Yes. We're going to see that this is a very, very tough constraint. Because you're doing -- one has to remember that you're not doing convex optimization here anymore. So, you know, if you were to optimize in kind of a -- you know, a convex -- you're doing kind of convex optimization. It didn't really matter where you start from. I mean, one way or the other if you're kind of optimization problem is well set up you're going to reach the local minimum. >>: The constraint is sort of like moving towards the minimum that's near P of X -- I mean, it's like picking ->> Dumitru Ehran: It's picking -- it's picking the basin of attraction. So I'm sort of ->>: You begin by just looking at P of X ->> Dumitru Ehran: Yes. >>: [inaudible] optimize [inaudible]. Now, I'm not exactly sure how you define your model. If you said Y for exactly [inaudible] it seems that PY of X would be exactly the [inaudible] your minimal -- vice versa. But you're not using that information in your initialization so your pre-learning step should be equally good for P of Y [inaudible] or P of minus Y? >> Dumitru Ehran: I mean, yeah. I'm not quite sure what -- are you saying that basically since we're doing P of X and we don't know anything about Y ->>: You're claiming that you get a better starting point by looking only at X and not Y. >> Dumitru Ehran: Yes. >>: Which would mean that you should -- that you're claiming that the good starting point for any given set of [inaudible] so you're saying there's a region in the space we have very, very high peaks and very, very low. And then there's other places that you -- so you want that very high coefficient -- >> Dumitru Ehran: I think -- I think I agree -- I think yes. But -- yeah. But -yeah. So I guess I'm not -- yeah, I didn't think of what you just said but yeah, I think the statement also says that. All questions done. >>: One more question then. >> Dumitru Ehran: Okay. >>: [inaudible] two kind of pieces of [inaudible] that actually does the [inaudible] one is [inaudible] pre-training. >> Dumitru Ehran: Yeah. >>: And you just pull this entire weight into [inaudible]. And the other one is to do [inaudible] right within the [inaudible] ->> Dumitru Ehran: You mean at the top layer. Yeah. >>: So do you support both of those? >> Dumitru Ehran: No, we don't. I mean we've empirically sort of found that they do work the same way -- I mean, they give the same kind of results. So we stuck with the simpler, from my point of view simpler kind of ->>: [inaudible]. >> Dumitru Ehran: Yeah. I mean, it's basically just taking the same network and adding a submax layer so ->>: So you think all the conclusion you get from here is [inaudible]. >> Dumitru Ehran: Yeah. >>: [inaudible] applied to the [inaudible]. >> Dumitru Ehran: DBNs, yes. Because the kind of conclusions -- the same kind of framework, we applied it for both DBNs and stack denoising auto-encoders, and with stack denoising auto-encoders they're no equivalent like -- it's only backprop that you can do at the end. So the setup is simple. It's MNIST, handwritten character recognition. So it's one -- two datasets. One is a dataset where we can iterate many times until we reach some sort of local minimum, you know, basically zero training error. Very, very fast on this. And infiniteMNIST which is basically a way to by Gaelle Loosli and I think -- I think it's Leon Bottou, yeah, Leon Bottou student maybe. It's a way to generate elastic deformations on the fly from MNIST digits. So you can sort of generate -- it's not a completely IAD 10 million examples kind of dataset but it's large and somewhat interesting compared to MNIST in that you can't really go many times. So we tried 10 million examples. That's what I mean -that's what I mean by large scale. So we did three models. The two that I just presented, DBNs and stack denoising auto-encoders. DBNs were trained by the classical contrastive divergence algorithm, which one step of kind of sampling. Stack denoising auto-encoders, again, the way that I just presented. And a models, models with one to five layers without pre-training. So randomly initialized using, you know, sort of the standard way of initializing neural networks as preached by Yann LeCun maybe 10 years ago. So then we testified very, very many hyperparameters with many, many seeds, just to be sure of the -- yes? >>: So one critical thing is are you using for the [inaudible] are you using [inaudible]? >> Dumitru Ehran: Percent. I think. Yeah. Yeah. We always use percent. So I'm going to jump straight into the whole observing the effect. So if we study the hypotheses, we now look at what pre-training does. So the very simple kind of effect of pre-training is just observing the generalization error. So if you just select the best validation error, you look at those models, you sample very many instances of those models, so different random initialization seeds and you do denoising auto-encoder pre-training and no pre-training, you bury the number of layers and you look at the classification error for both of them, you can see that qualitatively they behave differently, so as you add more layers with pre-training, it's getting better. Though I would say statistically it's probably not better. If your -- but at least from this point of view, it's very different than this one. And there is a missing five year. Because we couldn't even -- weren't able to train the five layer network. It just didn't converge to anything but garbage. >>: So how does it compute the variance? >> Dumitru Ehran: The variance is this is a cross of 500 initialization seeds. >>: [inaudible] how do you choose, there's a lot of choices of like -- this is optimal and [inaudible]. >> Dumitru Ehran: Yes. Everything -- so this is for the models that -- for the models that give us the best classification error, we -- and we estimated this over maybe 50 samples per combination of hyperparameters, so ->>: 50 [inaudible]. >> Dumitru Ehran: 50 initialization seeds. We choose the best model, we start again with 500 initialization seeds for both of these. So for one, two, three, four, and five ->>: So for backpropagation what kind of optimization did you use? >> Dumitru Ehran: It's the simple stochastic gradient descent. >>: No [inaudible]. >> Dumitru Ehran: No [inaudible] no. We have never found it to work very well. So -- yes. And this ->>: So this is just neural network. >> Dumitru Ehran: This is just a normal neural network. >>: So what is the optimal number of hidden units, you know, roughly for one or ->> Dumitru Ehran: For one it's -- it's a thousand. >>: That's interesting. I was getting 1.53. >>: I thought the 1.06 by just changing the [inaudible]. >> Dumitru Ehran: We've really tried. >>: No, no, I know. >> Dumitru Ehran: It was a really -- I think, you know, if you look at some of the papers that -- in this field there's a lot of these sort of arbitrary choices of like oh, we do momentum this way for this first 10 epochs and then we do momentum afterwards. So we really tried to make it as large scale as possible. So, you know, this is like probably close to like a year of CPUR wasted just on this experiment so. >>: [inaudible] randomized the waste, did you choose [inaudible] randomize [inaudible] choose the best one to randomize? >> Dumitru Ehran: No, no. The seed was not validated. >>: When you're using the square root of N. >> Dumitru Ehran: Yes. Yes. Yeah, it's all -- yeah. So we wanted to make it as black box as possible. So from that point. >>: Is it still on training data? Like you see the difference between ->> Dumitru Ehran: Yeah, what will -- it's a different experiment, so yeah. >>: I think the initialization of the weight is too big. >> Dumitru Ehran: It does make a difference, yeah. It does -- I'll grant you that I think somebody in my lab, they've done some sort of -- but it will never get, you know, it will never be comparable. We've done -- >>: [inaudible]. >> Dumitru Ehran: Yeah. Yeah. So it's the -[brief talking over]. >>: [inaudible]. >> Dumitru Ehran: I'm sorry? >>: I mean, I used a different set of data. For this one is probably is okay. >> Dumitru Ehran: Okay. >>: But for some other just using randomization, it didn't really get ->> Dumitru Ehran: Well, I think you have to have the right kind of combination of the -- inputs -- I mean, it's a choice -- you have to choose, so this sigmoidal hidden units, so your -- basically your weights have to -- if your data is kind of mean zero and variance one or something, you're weights have to be chosen such at that time activation of your hidden units falls on average in the linear part of the sigmoid. So it's a chose that depends on the data. First you have to kind of make your data play well with the sigmoids, and then your weight kind of initializations have to be, you know, as you said, one over square root of the number of inputs. >>: And one more nitpicky question. I'm sorry. Did you initial all the layers randomly or M minus one layers randomly? >> Dumitru Ehran: By what's the Nth layer? >>: You don't need to initialize all the lambda, you could choose one layer to be zero. >> Dumitru Ehran: Yeah, that's true. No, we initialized [inaudible]. >>: I have a question. So it looks like -- so you're adding [inaudible] you're making the model more and more complex and this [inaudible] is getting worse. >> Dumitru Ehran: Yeah. >>: So that would be the classic effect of a regularization scheme. So the question is, I don't know much about the left hand side. Is there any regularization here at all? >> Dumitru Ehran: We've validated that parameter also. We've done like L1 or L2 kind of standard shrinking or just the L2 kind of regularization. So we buried that parameter as well. >>: [inaudible]. >> Dumitru Ehran: No, of the -- so you add a penalty to your cost which is lambda times the L2 norm of your weights. >>: [inaudible]. [brief talking over]. >>: And the same and so that's nonzero, and it's the same for all weights? >> Dumitru Ehran: It's the same for all weights and it's non -- you mean the optimal one? So for each of them, it's going -- but it's going to be very small. But yeah, there's a nonzero optimal lambda for. >>: [inaudible]. >> Dumitru Ehran: Yeah. And it's early stopping. Yes. And I think mini batches -- the mini batch size was also validated, but we didn't do like very fine grained like it's either one or two -- one or 10 or 20. >>: What's the [inaudible] for that ->> Dumitru Ehran: Yeah, it's the same thing. And so we. >>: [inaudible]. >> Dumitru Ehran: You mean the optimal one? I think it's 10. >>: 10? >> Dumitru Ehran: Yeah. No ->>: I guess [inaudible] you get variance. >> Dumitru Ehran: I think if you were -- you know, if you've got to pound on it more. If you were really to find the truly optimal hyperparameter selection, I think what we wanted is more of a qualitative effect of what's going on. So I think you can -- you can make this lower. You can make this -- I think you can subtract probably safely about, you know, .1 or .2. But not for these. Not for these. >>: [inaudible]. >> Dumitru Ehran: Yeah, yeah, I think for one layer maybe you can -- you can get -- I think 1.5 I'd say safe. But these can get, you know, 1.3 let's say. And I haven't really tried too hard. I mean, it's -- it was a simple experiment where, you know, I launch a batch of jobs. >>: So you think that the recent WiFi layer -- >> Dumitru Ehran: Here? >>: The right hand side. >> Dumitru Ehran: Oh, here. It goes up because ->> Dumitru Ehran: Well, there's no optimal kind of number. I mean obviously if you add very, very many layers at some point in time. It's kind of a classical tradeoff between the complexity of the model and the kind of data that you have. So at some point in time you're going to overfit, even if you have a really good initialization point. >>: [inaudible] less severe for ->> Dumitru Ehran: That is correct. >>: [inaudible]. >> Dumitru Ehran: That's basically the conclusion from this model. And you can see this also from the tail here at the -- so, you know, the -- it's not like you get I got a lot of these on this side of the distribution. So it's more the errors, you know, there's a bit of a skew towards bad errors. So it's more likely to actually get into a really bad local minimum or generalization minimum if you want. If you have one of these networks without pre-training, then one of these. And it -- you know, once you get to, you know, worse results, you get the same kind of effect here. But it's less pronounced than in -- so it's -- like you can see some of this of the histogram of errors. So this is for one layer, you know, this is with pre-training and without pre-training. So this looked like a bit of Gaussian. This one has a bit more fat tails, so this is with four layers. This corresponds to this. This distribution here. >>: So just curious, I mean, when you [inaudible] either case, it's true that all the training error is zero. >> Dumitru Ehran: I'll get to that. >>: Okay. >> Dumitru Ehran: In one second. Yeah, it's true. >>: [inaudible]. >> Dumitru Ehran: But the -- if you're looking at the classification error, which is -- which all these get zero classification error for training. So it's hard to compare. We're going to look at the actual training objective, which is more interesting. So, yeah, we've seen the effect of pre-training for the generalization error. Another kind of effect that we're going to look at is more -- so a bit more conceptual is we're going to look at the actual networks, the actual networks in what we call the function space. So we're going to try to project the networks. The opposite the networks, we're going to treat them as functions. And we're going to try to project them via some nonlinear dimension on to reduction. Two methods actually. Two dimensional space. So each of these points here, each single point for different colors represents one network at one point in time after one epoch of training. So this kind of called the flower. That represents the neural networks pre-trained using unsupervised learning. I think it's RBM. Two layers. So this is 50 networks after one epoch of supervised training. So they all kind of start in the same point. And they all kind of move to the local minimum as well as learning goes. Yes? >>: What is the X and Y ->> Dumitru Ehran: They are arbitrary. Yeah, they have no meaning. So this can be rotated in any way. So there's no -- like it's just a -- you know, it's the result of, you know, the t-SNE, which is a -- it's a dimensional T reduction method that kind of preserves local structure or Isomap, which is reduction that results kind of global structure. And it's whatever output -- you know, whatever values these methods give to you. I don't think they're meaningful in anyway ->>: [inaudible] network you compose with all the weights across different networks [inaudible]. >> Dumitru Ehran: No, these are not the weights. So this is slightly different. But we're going to get to the weights in the next slight. These are the actual output. So you take a test set, you compute the output of your network on each of these -- of your testing points, you concatenate this gigantic output vector. This is your input space, if you like. So this is -- this vector represents one function at one instance, at one point in time. You're going to do this -- you're going to take a burn of these. So 50 for 50 networks that are pre-trained, 50 networks without pre-training, so randomly initialized. You're going to put them in a big bunch of basically vectors that have dimensionality I think 100,000 or so. And you're going to protect this. Of course, you know, it's impossible to preserve all the right distances. But these kind of -- it preserves -- it presents an interesting picture of what happens if you protect all of these vectors into two dimensions and you look at the evolution of the networks as you continue supervised training. So there's a couple of interesting things that you can observe here. One is that they seem to explore different kind of regions of space. They don't seem to kind of cross each other's paths, if you like. And with Isomap what's interesting, these are networks without pre-training and this is with pre-training. From a point of view of a network of one of these networks with random initialization, the volume occupied by basically all the networks with pre-training is very small. What else is here? There's many apparent local minima. So if you look at -- so this is a network -- so this is a -- so it's a bit of an artifact of the method since it preserves local structure. So it tries to kind of cluster things together a bit, but it also -- also means that, you know, these kind of tiny things, this corresponds basically to the trajectory of a network in this two dimensional space at the end of training. So, you know, they all seem to go to the same kind of ->>: [inaudible] data. >> Dumitru Ehran: This is MS data, yes. >>: So where is the [inaudible]. >> Dumitru Ehran: This is 10 points. Yes. This represents the class probability. >>: [inaudible]. >> Dumitru Ehran: No, this is ->>: Times 10,000. Oh, I see. >> Dumitru Ehran: So the test set -- yeah, so the test set is 10,000. You compute 10 output labels basically -- well, not labels, but just the classical conditional probabilities or the log of these. You concatenate them. You project. >>: So but I -- I question whether you can take the volume of the projection and try to interpret that because it could be the volume how clustered -- how tightly clustered they are. Could it be side effect of your -- of Isomap. So I mean, I mean like. >> Dumitru Ehran: So we tried to make one single component. So that's the kind of parameter that we gave to the Isomap. So it's -- it will -- I agree, it sucked. >>: Right. But I think looking at how disjoint they are is probably still meaningful in that it's probably not an artifact of your nonlinear embedding method. But like looking at the volume, that seems kind of ->> Dumitru Ehran: Well, I mean the volume is -- yeah. So ->>: I've seen plenty of examples where you take -- you know, the uniform Swiss roll and you unroll it and if you don't do it right, you can get this kind of [inaudible]. >> Dumitru Ehran: Well, that's why we did this -- you know, we tried a couple of these. We tried PCA, also, we tried kind of a linear projection method and they also, you know, they go different paths. >>: So these are the same dataset, two different ->> Dumitru Ehran: This is the same dataset, two different methods for nonlinear projection, nonlinear embedding. >>: So [inaudible] so the question is what kind of [inaudible]. >> Dumitru Ehran: Well ->>: [inaudible]. >> Dumitru Ehran: So we've done -- as I said, we've -- I think the main conclusion what we tried to see is whether the motivation for this experiment was do they seem to explore -- you know, basically the hypothesis is well, if I try long enough, you know, if I try hard enough and randomly sample from this uniform maybe bad way of sampling you know weights in my randomly initialized network, will I just get lucky and get one of these, you know, one of the pre-train networks? It's a hypothesis. It's a plausible way. Maybe -- maybe it's, you know, pre-training is just a way to kind of choose some of the good random initializations. And I think basically the result is the [inaudible] as far as you can see. I mean, admitted will I -- I'll admit to, you know, the -- and there cannot be any perfect way to project from -- it's 100,000 [inaudible] to two, but at least from this point of view they don't seem to explore the same kind of regions of space from this function space approximation. >>: [inaudible] different seeds, I suppose. >> Dumitru Ehran: Different points of different seeds, yes. These are 15 networks. They start kind of in the same -- because they were pre-trained. And they all -- so these correspond so the 50 networks correspond to the same exact configuration of hyperparameters. Which is different randomly initialization seeds. So any other questions? Yes. >>: Do you [inaudible] vectors of the distribution of the [inaudible] data being correct? Do you have experiments where you tried to distort the distribution or say noise [inaudible] data? >> Dumitru Ehran: Well, you can -- I mean, you can always make it -- clearly you can always be adversarial about it and make it unsupervised distribution clearly unrelated to P of Y given X, so then you're going to be wasting time. But, yeah, here it's ->>: But how sensitive is the result of ->> Dumitru Ehran: I don't know. I don't think we've got any confirmation of that. Yes? >>: Just to make sure I understand. These two flowers, they were -- so the [inaudible] two dimensional space was [inaudible]. >> Dumitru Ehran: No, no, no. Both of them at the same time. >>: So why don't they start at that time same place. >> Dumitru Ehran: No, no, no, both of them at the same time. >>: So why don't they start at the same place? >> Dumitru Ehran: Because they -- well, one of them -- so this is -- these all networks have been randomly initialized -- I'm sorry. Yeah, randomly initialized with standard neural network. So ->>: So how does your random initialization all start at one point basically? >> Dumitru Ehran: Well, they all start at the same kind of random point because they all have the same hyperparameters. So they all have, you know, a thousand or whatever hidden units and ->>: It's not just the initialization point that's part of the data, it's also the output. [brief talking over]. >> Dumitru Ehran: Yeah, yeah, they're not -- we're not projecting the weights because it's not a good idea to do that. I mean, it's -- you can always permute the kind of the hidden units in a network and it's the same solution. So that's not -- so there's many, many configurations possible that -- it's misleading to protect the parameters. But the outputs should be independent of the ->>: If I understand, why isn't it reasonable to ask what would a standard neural network do if it started from that point with the [inaudible]? >> Dumitru Ehran: Well, I don't know how you would make it start from there. I mean this is -- so these are the standard networks that randomly initialized -- of course, you're always randomly initialize. You do unsupervised pre-training. Then you do supervised pre-training. Supervised learning. So you've done the unsupervised, you've modelled P of X. You've gotten your parameters. You compute the output of these -- of this network on your test data. This is where you are. At the beginning. And then you move. Here you've done no unsupervised learning. You start randomly, you compute where you are in this projected space on your test data in this function space, and then this is where you are and this is where you go. Is that clear? >>: Are you saying -- so this basically means no matter how you initialize random kind of covers the [inaudible]. >> Dumitru Ehran: No, no, no, but it's ->>: [inaudible]. >> Dumitru Ehran: It's random with -- so they all have in common that they have the same hyperparameters. They all have the same architecture, the same number of hidden units, the same learning rate, the same regularization factor, the same mini batch size. The only thing that they differ is in the -- in the initially value -- in the random seed of the weight matrix. So ->>: [inaudible]. >>: [inaudible]. >>: I mean, you're right, a random network could eventually be there, but it's so improbable that ->> Dumitru Ehran: Yeah, yeah. I think that's the hypothesis is that it's so improbable that it was -- it could theoretically be here. Though it's not even clear because the way we initialized this uniform over the actual values are bounded to be minus some small number and some small positive number. >>: [inaudible] the other one doesn't have to start linear, right? >> Dumitru Ehran: No. >>: It's whatever crazy ->> Dumitru Ehran: Whatever DBN spits out is where we are here. And these are big numbers. >>: [inaudible]. Interest now, for each point that you finally end up with after supervision, do you have a measure to see how [inaudible]. >> Dumitru Ehran: Yes. So these are the networks -- these are the networks that -- so for each -- for the class, you know, this is the best randomly initialized network, best 50 for two layers. This is the best pre-trained network. So these are best in class. So they go to the whatever ->>: So you have a part to show ->> Dumitru Ehran: Yeah, yeah, we'll get there too. >>: Another question. Is there a way for you to say like, you know, just say like -- is there a way for you to say unproject two dimensions [inaudible]. >> Dumitru Ehran: [inaudible]. >>: [inaudible] like weight space and say what would happen if you say started in one of the -- you know, like after you train at the little [inaudible] when you start in that region? >> Dumitru Ehran: Well, I think that the only thing that you can do is you -- I don't think you can unproject, no. Not with any ->>: So because we -- it will be interesting to see, though, if you started in the -that bloomy region over there to start with. >> Dumitru Ehran: You mean this one or ->>: Yeah. But this one it goes -- >> Dumitru Ehran: Oh, yeah. >>: If you started closer to like where the eventually ended up, to begin with, I mean would you get better results? >> Dumitru Ehran: We'll have some sort of experiment that is kind of like that where we start with the same kind of values. So not -- obviously -- yeah. So same kind of magnitude of the weights that is given to you by pre-training, but random. So not -- no, not the result of the unsupervised learning. And we'll get to see -- we'll test hypothesis that is very similar to what you just said. >>: Okay. >>: [inaudible]. [brief talking over]. >> Dumitru Ehran: This talk was supposed to be 45 minutes. But we have spent 10 minutes on this one. Okay. So the other way of looking at it, the actual weight space. So you don't take the output of the networks on your test data, you actually just look at the filters, weights, the -- I don't know how they call those. >>: Receptive fields. >> Dumitru Ehran: Receptive field, exactly. Yes. There's many word for that. So basically you just -- since we're operating on images, you just take the weights and the first layer weights and you -- you know, you make them a nice image in the same space as your input space. You can -- you actually can see, so Y corresponds to kind of positive values of your -- of your weights; black corresponds to negative values; and gray corresponds to zero. So a couple of interesting things in here. So this is with a DBN. The first layer waits. After pre-training, so as well as learning this is -- this is what they give you. So it's some sort of edge detector, stroke detector, whatever they are. It's not necessarily interpretable. This is what happens with the same network. So after you've done supervised learning. So it's a pretty small difference between at least kind of visually if you look at the -- have the kind of a visual inspection of where these weights ended up after you've done supervised learning, they don't seem to be different much at all. Conversely, these weights, obtained with a one layer network -- well, with a three layer network actually, without pre-training after supervised learning. So the comparison is between these. So they're not necessarily in the same kind of regions of space. Visually speaking. It's not as strong of a statement that I said with those trajectories. But this is interesting. This -- what this shows us is that as far as learning kind of -- really constrained where you start from, the kind of features that you're going to be starting with in your supervised learning and where you're going to end up, which is basically in the same kind of basin of attraction. So I have this cute little method for visualizing second layer weights and kind of third layer weights. I don't think we're going to have time for that. This is gone. But it's basically a way to figure out what is the kind of the maximal input that you're going to get for each unit in your layer. So the interesting thing is there's a couple of -- the difference between what happens after unsupervised learning and after supervised learning is increasing as you add more layers. So it's increasing and there's basically nothing in common here. So there's one thing interesting here. They basically become kind of digit detectors here. >>: Why should the higher layer weight give you something closer to the input? I thought ->> Dumitru Ehran: Well, they become -- so in the sense these are more complicated features, right? So -[brief talking over]. >>: The first one is the weight. >> Dumitru Ehran: These are the weights, yeah. This is the second weight. >>: It's a receptive field. >> Dumitru Ehran: Yeah, it's a receptive field, but where you take each unit and you say find me a configuration of the inputs that will maximize -- maximize the activation of the ->>: So there's no way it ->> Dumitru Ehran: It's not a weight. But you can consider ->>: [inaudible]. >> Dumitru Ehran: If this were a one-layer network, this is in a sense ->>: Okay. >>: [inaudible] all white? >> Dumitru Ehran: Because we are really cranking, so it's kind of a -- the optimization ->>: [inaudible] I understand, but why aren't there black and white ones? I don't know. You talk a fine -- >> Dumitru Ehran: Yeah. It's -- it's really just an artifact of the fact that we are maximizing the function. So ->>: Okay. >> Dumitru Ehran: So it can put a lot into the white. So because it can basically -- it's a maximal input so -- and since, you know, this pixel, you know, if you don't bound the norm of your whole thing can contribute a lot to the activation of the function. >> Dumitru Ehran: By the way, there's some funny things happening. This is what looks like a four detector, you know, it transforms into an eight after supervised learning. So ->>: These are just randomly chosen. >> Dumitru Ehran: Yeah, this is just -- so these have like a ->>: 50,000. >> Dumitru Ehran: Yeah, this is like -- each layer has a thousand units. It just shows them without any -- it's the first I don't know how many, a hundred or so. So we've seen the effect of pre-training, now we're going to try to more -- get to the -- what pre-training actually does. I think we've seen a lot of hints of, you know, regularization, and we're going to see a lot of this. So I'm going to get back to what you just said. So if you're [inaudible] about this hypothesis. The simplest hypothesis actually but not the regularization optimization hypothesis. What we call the conditioning hypothesis is perhaps what is going on is well, you're trying to simply -- what pre-training gives you is it gives you the right range of values for your weights. I mean, it's -- if you think about for more than five seconds, I think it's a pretty silly hypothesis. But perhaps what just happens is it -- it conditions your supervised learning in kind of the right way. So what we're going to do is -- with this experiment, we're going to try to simply compute the marginal distribution of the weights that is given to us by unsupervised learning. It's some sort of Laplacian looking kind of distribution. So it's fat tailed. So there's a lot of large kind of values possible. And I'm going to just sample from it. Instead of randomly initializing from the standard kind of [inaudible] way, one over square root of [inaudible] we'll get a sample from this Laplace distribution or fat-tailed distribution. And it's interesting that it actually gives you a bit of a boost. So this is with randomly initialization with this histogram initialization as we call it, and this is with pre-training. So this is just the classification error. But it clearly cannot account for the whole difference between what pre-training gives you and what ->>: Your sampling the distribution like independently, right? >> Dumitru Ehran: Yes, yes. Yeah. So this is not -- yeah. We didn't push it too hard, you know. We didn't want to like build a -- you know multivariate distribution of the weights -- it sort of defeats -- I don't think it's a very interesting experiment. Maybe it actually -- if you actually did this, you know. But pre-training really what it gives you, it gives you large values of the weights, but it -- you know, they're not randomly sampled large values. They have tight correlations between them so because they need to reproduce the data well. So this is the plot that many people have asked me already about. So this is to more directly find the role of pre-training is you find -- you look at the training error versus the testing error over time. So you take two networks that have the best hyperparameters, so the best validation error. You train them until they converge, until they -- you know, there's basically no more change in the training error any more, in the training objective function, or at least as far as we can see there is no more -- we are at kind of a local minima or we don't improve anymore. And you look at what happens with pre-training, so in red, and without pre-training. And it's interesting that you never quite get the same kind of training errors. So this is the training error. You never get it. So -- and this is lower this way. The networks with pre-training never seem to get a better -- get to a better local minimum in kind of the objective function. This is the -- this is the cross entropy by the way. So it's not a classification error. Classification error they all get to zero very fast. In training. So there's no point in actually measuring it. And this is the test NLL cross entropy. So it does fit kind of a regularization hypothesis pretty well, regularization interpretation at least. It trades off training, so it trades off the capacity to overfit four for better generalization. >>: [inaudible] for that experiment [inaudible] one single number. That is if you do infinite amount of, you know, randomization, one of them must be better than that one. >> Dumitru Ehran: You mean which -- you're looking at this ->>: Well, the one in the blue. >> Dumitru Ehran: Yeah? >>: So the question is how many choices have you ->> Dumitru Ehran: So. >>: In randomization? >>: [inaudible] not really infinite. >>: The real question is what kind of exploration you might want to have in [inaudible] that somehow will get something close to this -- >> Dumitru Ehran: But the point is that with pre-training you don't need to. I think that's the -- you don't need to infinitely sample ->>: [inaudible]. Well, the question is how much, how much you need to explore? >> Dumitru Ehran: Well, we ->>: The size of network small. >> Dumitru Ehran: We tried more than -- we tried more than what a normal, sane person would do. [laughter]. So I think that's ->>: [inaudible]. >> Dumitru Ehran: Let's just say that one of the -- I think I took down Google Maps for a couple of hours, so -- with my exploration of these things. People were not happy. So I think that's about as -- 500 per [inaudible] and then you have number of layers then hidden units per layer, learning rate, batch size, regularization parameter. Each of them ->>: Nothing can get in the way of that. >> Dumitru Ehran: Yeah. I think from -- at least -- well, I think there's also -- we haven't tried like different sigmoid versus 10H or other types of non-linearities, different kinds of -- you can play with this a lot. >>: [inaudible]. >> Dumitru Ehran: Yeah. I think you -- yeah. But I think the main -- the main argument is not the difference between these on the test level, I think you can make this a bit smaller maybe, maybe. But this, really we tried hard to see whether a pre-train network can actually achieve the same training error as a [inaudible] pre-training and it cannot. Another argument for regularization is well, we said, well, if you think of pre-training as some -- in some sort of way, kind of vague way constraining the capacity of the network to do a regularization, maybe that's what it does. Maybe if we did another classical way of constraining the capacity of the network is to constrain the size of the hidden layer. So we just -- we just did that. We -- and then we see whether pre-training -- what kind of effect has -- does pre-training have in that case? So what we did is as I just said, we took one, two, three layer networks. It's the same result basically for all of them. And we varied the number of units per layer, which is kind of constant. And you can see that -- so in black is a network without pre-training and in red and in blue is with DBN or RBM or denoising auto-encoders, it's the same kind of curve. You can see that if your -- the number of units per layer is small enough, pre-training actually hurts. So it seems like pre-training acts as an additional kind of regularizer to the one that's imposed to you by the capacity that you're shrinking with the number of hidden units. So it's -- it's one way of doing it. So the -- we've had these results -- we've had some other results. I'm not going to talk about them. But they -- they're interesting, but as somebody pointed out, well, if pre-training is a regularizer and what it does is basically just -- it just defines the starting point of your supervised optimization, perhaps like some other regularizer that you might think of, perhaps its influence will disappear as you have enough supervised data. You know, if you think -- if you're Bayesian kind of person, you think, well, you have a prior which is defined by the unsupervised learning, you have some likelihood term that comes from the data, your prior is sort of maybe not data dependent so it will be overwhelmed by likelihood if you train long enough, if you have enough data. So we said, well, let's just verify this hypothesis, let's just see what happens in an online setting where we have enough supervised data that we clearly -- if this is true, clearly it will -- you know, the -- whatever starting point we start from will not matter at some point in time. So we did this infiniteMNIST, this 10 million example thing. And basically here it's the number of examples seen. So this is in millions. So 10 million is here. We've tried with 30 million as well. It's the same thought. There's nothing different that we couldn't try with some of the models. Too slow. And we did -- we measured the online classification error. So the only difference between the blue -- the black curve, so the dashed is one layer, with random initialization and the solid line is three layers with random initialization, and the curve -- the red curve which is a three layer with generalizing auto-encoder initialization, so with pre-training, is their starting point, where it started from, which is the outcome of pre-training. Yes? >>: So if online classification error is not [inaudible]. >> Dumitru Ehran: Oh, it's just the classification error on the next example. It's moved over time. Over the next ->>: Oh, it's moved over time. So you're not integrating? >> Dumitru Ehran: No. It's -- yeah. I'm -- you know, I'm a -- I've trained on a million examples, I test on the next hundred thousand, I see what a classification error is, I go on. You can see you get some sort of ridiculous classification errors, 10 to the minus 5. There's so much data that you can get to pretty low. >>: I assume this is [inaudible] the same data that [inaudible] because otherwise it would be the same [inaudible]. >> Dumitru Ehran: Yes. In a sense, yes. But -- so here we've done 2 and a half million examples of unsupervised learning with a red curve and 7 and a half million. So we've sort of set a budget of 10 million examples, you know, and we -- either way, even if you allow this work to continue, you know, the network without random initialization -- with random initialization, sorry, to continue, it never quite gets the same. So they're not -- clearly the starting point of this non-convex optimization problem which is finding the -- you know, optimizing this deep neural network, clearly it matters. So even in the scenario with basically essentially unbounded training data. What's a bit surprising is it doesn't quite follow the standard interpretation of a regularizer. If you actually did this, and I did this experiment, if you did an L2 regularizer, the optimal value you have a lot of data is zero. There's no reason to use L2 regularization or any of these kind of data independent regularization schemes. Well, the very simple canonical regular -- I mean, maybe something that's like minimizes sparsity of your activations or something. Its maximum. But it doesn't follow the same interpretation of a regularizer, which is what people thing of. >>: I think it's pretty clear that this -- because your data isn't real data, I mean otherwise there's no [inaudible]. >> Dumitru Ehran: Oh, yeah, yeah, yeah. So I've done some experiments with more real data where we actually have kind of an IAD million examples and the same kind of result. So there's no -- there's no difference in that. But this was -this was published and that one wasn't. Any other questions about this part? Okay. And it's funny that you -- so you look at this and okay, maybe it's a regularizer but not quite clear why. But the training error, if you took -- if you took the models that you get at the end of training, so you've done 10 million examples and then you test them on the same data that you've trained with in the same sequence, this is a network without pre-training and this is a network with pre-training. So it -- even then -- there's a couple of things, you know. It's better at classifying things that it has just seen that's just how gradients works and for these examples, these are essentially basically like test examples, you know, they're far enough in -- has seen them so long ago that it -- this error is basically the same that you get at the end of training -- the end of training on the next examples. So it shows in a sense an optimization effect in the sense that it's in the online setting it's better both at the training error, as defined by this -- actually some of us expected these lines to cross and how you would think that maybe a network without constraints such as the one that's randomly initialized, maybe you overfit more on the data that has just seen, especially it has a lot of tuneable parameters. But, you know, these networks, the pre-training seems to have done an effect here, too. >>: So what is -- there a -- >> Dumitru Ehran: This is the -- this is testing on the same sequence of data that it has seen during training. >>: Oh, I see. >> Dumitru Ehran: So these are the very last examples. This is where the gradient descent -- you know, the direction of the gradient descent has just moved it, so it knows how to classify this pretty well. It's surprising that it's not zero, but it's -- it's computed over large enough sample that it's -- it will not actually be zero. I think -- so I'm getting to one of the last results. And one that I think makes the clearest and strongest point for why pre-training can be seen as regularization is well, if the starting point of your optimization actually matters, why don't we just do an actual testing of the -- you know, if you vary the where you start from but not the parameters but the actual data that you look at the beginning of the -- of training, so you take the same 10 million examples that you've seen but you vary the first million. So you keep the 9 million that you've seen the same, but you vary the first million that you look at. And then you observe what is the variance of your -- of the output of your function at the end of training. So basically you're just computing a score of how sensitive your estimator, your classifier is, how sensitive is it to the data that it sees in the beginning? And what you should be comparing with here is this point on the red line, which corresponds to the start of the optimization of the supervised learning for networks with pre-training, with this point. Which is randomly initialized network that, you know, where we vary the first million examples. And we can see here the pre-training, you know -- sorry, has, you know, it has done as was learning and it kind of doesn't matter which examples we see, which supervised examples we see in the beginning of supervised learning after we've done unsupervised learning. The variance is going to be low. Or much lower than the variance that you get with the -- with unsupervised -- with the network with random initial weights. >>: This is very log of the output. >> Dumitru Ehran: Yes. Log of the correct class. >>: Log of the correct class. >> Dumitru Ehran: Yes. >>: You're using ->> Dumitru Ehran: Both sometimes. >>: So you put [inaudible]. >> Dumitru Ehran: Yes. >>: [inaudible]. >> Dumitru Ehran: It's future work. So it's in one of my slide, but curriculum learning for those who don't know is Yoshua's idea that if you order the examples, you have some sort of non-convex optimization problem and you order the examples that you look at from simple to hard under some sort of metric that you -- is somehow defined, then in a couple of problems, image readmission and NLP, I think it's joint work with Ronan Collobert and Jason Weston, they obtained better results. With NLP they just basically increase the vocabulary size. So, yeah, I think -- in a sense it's sort of this is what it says basically. >>: This one doesn't really order the importance ->> Dumitru Ehran: Yeah, it doesn't really order. But this one basically says to you that it's great in the gradient descent -- stochastic gradient descent is very sensitive to where you start from. So and the early examples, you know, influence more the where you're going to end up. So you should pay attention to where you start from. And this, you know reduces the variance of the output that you're going to get at the end of training. So you can see pre-training as variance reduction which is just another quality of regularization. >>: So [inaudible] obtained by [inaudible]. >> Dumitru Ehran: Yes. So this is -- yeah. This is the networks that are best in class again. So, yeah? >>: Can I take you back one slide, please. >> Dumitru Ehran: Sure. >>: We went kind of quickly on that. So you were saying and I think very correctly so, that you were expecting the blue line to ->> Dumitru Ehran: To intersect with the red line. >>: To intersect, to get, you know, lower at some stage. You don't see that here. >> Dumitru Ehran: Yeah. >>: But then you said -- I don't know, you didn't really explain it. So this to me is a red flag. I mean, this means that -- so this is kind of the whole problem with an empirical observation to trust you that you did a good job in threading your competitor good enough, and this kind of ->> Dumitru Ehran: Well, this is -- well, this has -- this is the network that has obtained the best -- the best validation/test theory if you like. So itself the one that, you know, I -- well, I mean you can trust my honesty or not, but I tried hard enough ->>: I'm not trusting your integrity at all. I mean I'm just saying this to me would be a red flag that would say something is wrong ->>: [inaudible]. >> Dumitru Ehran: Well, why is it a red flag? I mean, they're both equally -- for me they're plausible alternatives that they cross or not. I mean, it's -- I don't -- I don't see why -- I mean ->>: I think the natural intuition would be that the [inaudible]. >> Dumitru Ehran: Well, this is not -- but this is on the -- just examples that it has just seen, right? So -- so it's ->>: So I mean [inaudible]. >> Dumitru Ehran: But you only ->>: [inaudible] can do better. [brief talking over]. >>: Especially with these nonlinear models. [brief talking over]. >>: [inaudible]. >>: Well, except for neural networks I know [inaudible] because the gradient gets multiplied by the activation [inaudible] very little gradient affects the model. Once you've trained them, they're wedged. So it's not truly an online experiment because really the parameters of the first layer get set very early in the process, and so those probably get screwed up. In fact, he showed pictures of them screwed up and you're screwed. So it matches my a priori. >> Dumitru Ehran: The reason I even mentioned the crossing is that it was one of those, you know, we actually -- before we do -- doing this experiment we said oh, what will happen, you know, like -- we like to do this -- I like to enforce this because there's a lot of this hypothesis after the fact kind of dealing which I don't like, like oh, we observed the data, let's make a hypothesis that really fits the data. So I like to say, you know, somebody says an idea and then they like well, and I like to ask the person show me what you think will happen, you know, show me this graph, you know, for both of them. And I'll write you -- I'll put mine and I'll try to make an argument for why I think this will happen, and the person will -and one of those -- you know, one of those hypotheses was that. And it was something along the lines of what you just said, you know, like because it was unconstrained. I don't know. This is what we observed. And as far as I can tell you, it's also -- I predicted this would happen. But -- and I did the experiment so it's not quite fair. We can talk more. So anybody has questions about this? All right. So finally plot courtesy of Yoshua. He said that if I say this, I should not use it. [laughter]. So basically it's more kind of a mechanism explanation of -- and it's along the lines of what John just said. It's -- the dynamics of unsupervised pre-training initialization is such that, you know, in 2D you're basically at the beginning of the training you're choosing the kind of quadrant or hyperquadrant that you're going to end up, your weights with going to end up, and it's going to be very hard for you to go to switch quadrants, to go somewhere else afterwards once you've done this. So the initial updates have a crucial influence. Call this kind of a critical period as in psychology. So -- and they explain -- we've seen some of those plots that the initial updates really explain more of the variance of what's going on. And this notion of basin of attraction has come up a couple times in the conversations here. What seems to happen is unsupervised pre-training initializes in this basin of attraction with good generalization properties. Let's just go over very quickly. Pre-training, we've seen, induces qualitatively different functions in those flower power pictures. We've seen the weights, the weight matrices. They seem to be different. They seem to be exploring different parts of the networks space. Some of the first results that we've seen pointed towards a regularization hypothesis, you know, the worse training error in kind of MNIST scenario. The layer capacity versus test error thing that we've seen. We tried to expand those results in an online setting to see whether this pre-training advantage would actually diminish as you add more data, and it doesn't seem to. There's some caveats as we've seen. Pre-training seems to be a variance reduction technique, will literally vary -measure the variance of the output. And even in an online setting we can see pre-training as a regularizer. Because it does define -- it does constrain where we start our supervised learning from, and it seems like in a non-convex kind of setting, this kind of constraint actually matters. So the take-home messages that unsupervised updates have a crucial influence, explain more of the variance, define the basin of attraction. As we've discussed a bit, pre-training will have a positive effect as long as modeling P of X is useful for modeling P of Y given X. If that is not true, I'm not sure what you're going to get from pre-training. The influence of early examples could be troublesome. If you think about it. If stochastic gradient descent is truly kind of subject to overfitting on the early data, stochastic gradient descent applied to neural networks, could it actually hurt us in a large scale setting? If we're really wedged -- you know, if your weights are stuck in this basin of attraction that you started with pre-training or not doesn't quite matter, that could actually hurt you because you're going to do online learning and maybe at some point in time your distribution is going to shift somehow and you're not going to be able to get out of the minimum that you ended up in. Even -- this is -- you know, all these results I forgot to mention. This is like with a constant learning rates or some sort of theory, this stochastic gradient can constant learning should be able to escape kind of local minima in a sort of a sense. Anyways, we have relatively fresh, and in it was just published JMLR paper that has more. There's some future work. We want to understand some more -- some other semi-supervised deep approaches, for instance there's a lot of pieces, namely Andrew Ng and others who use cases where there's much more unlabeled data than actual labeled data. It's unclear whether some of these results that we've said actually apply in that case. There's a couple of people who do combinations of supervised and unsupervised learning costs. So there's not a clear two stage process. Not quite clear what happens in there too. We want to explore some connections with generative very distributed file system active work, some Andrew Ng and Michael Jordan's kind of seminal work on that. Fisher kernels, which is a way of pre-training kernel machines with generative models. We've talked about curriculum learning so, it's unclear how this ties in with our results, but there's some connections in that. We want to describe in more concrete terms the basins of attraction of unsupervised learning. So what does it mean? What properties do they have, apart from being good models of P of X? So along these lines, I would like to better understand the kind of invariances and features that are learned by unsupervised learning. So I think that's more of a concrete statement than just basins of attraction. It's pretty vague as a term. So I think we need kind of a tool for visualizing or comparing maybe the strategies that you use for pre-training, the costs, the architectures, and a tool for better understanding, you know, maybe poking at those networks. And I have a couple of those tools you can talk with me if you want to. It could be useful to maybe thing so this is one of those. So we've seen those features that I've just showed you. So I've had -- I've done some work on trying to visualize that a bit more. And maybe it could give us more definitive answer to why is it actually hard to train a deep network and why does pre-training actually make it more easy. So I think that's it. Right on time. [applause]. >>: Well, besides [inaudible] do you have another set of data that maybe corroborate all of the ->> Dumitru Ehran: Yeah. We have ->>: We just want more. [laughter]. >>: [inaudible]. >> Dumitru Ehran: Yeah. We have -- we did some -- I don't know if you know about ImageNet. So this is -- I forgot her name. It's Stanford Computer Science. Is it -- I forgot. Any way, it's Stanford, somewhere in that vicinity, somewhere in the Bay area. They collected kind of -- I was on Mechanical Turk, so they labeled about 10 million or 12 million images. And we've done some -- some experiments with a couple of million of those examples. And the curves look kind of the same. >>: [inaudible]. >> Dumitru Ehran: No. It's not [inaudible] no, that was -- we did JMLR paper we submitted in paper eight months ago. It was accepted as is. They didn't -- they didn't want us to do more experiments, which we were surprised. >>: [inaudible]. >> Dumitru Ehran: Yeah. Yeah. Well, the online setting. So we wanted -- we wanted to test that. The online setting is I think a bit more of a surprising kind of result for many people. So we -- maybe we'll get to publish this at NIPS this year. >> Misha Bilenko: Let's thank Dumitru again. [applause]