Deep Learning of Representations
>> Li Deng: So it is our great pleasure to have Yoshua Bengio to give this talk. He gave four hours of tutorial just a few days ago and he compressed all things to come here, and he has done a fantastic job since 80s on neural networks and it's only until since maybe the last eight years, from 2006. A lot of his work is creating a lot more impact over the recent years, all the previous 20 years combined. So, I can have an opportunity to hear him talking about some newest developments from his work and he talked a little bit about it with me last night at dinner time which was fantastic. So we'll give all the floor to Joshua. I hope we'll have some exciting discussions.
>> Yoshua Bengio: Thanks, thanks very much Li. So indeed I started working with neural nets in the late 80s, in fact I was working on speech recognition with combusional and recurrent networks for speech with joint training of these models with HMMs. In a way that's very similar to what's now the state-of-the-art, the difference was we had smaller machines, we didn't have as deep nets and we didn't know a lot of the tricks we've learned since then to train these things better.
But today, I'll be talking about something different. I will be first doing a kind of overview of some of the elements of deep learning and then I'll move into some more recent territory and some material that I'm very excited about concerning unsupervised learning which has been kind of one of the parts of deep learning that hasn't received the attention that it should.
So, where do I start from? My goal is to move us towards AI, in a significant way. I want to understand the principles that give rise to intelligence through learning. And, how do we get machines to be intelligent? They need to have knowledge. So how does this knowledge is going to be there? It's going to be in great part because of learning. There is a lot we can tell machines directly explicitly, but learning is essential. Two of the essential ingredients of learning are, the priors that we put in the machine and the techniques we use for optimization and search. Now for learning to be successful a crucial ingredient is generalization as stated in my talk, I'll talk about generalization from a kind of geometric perspective, with the connection to manifold learning, thinking about how you can deal with the curse of dimensionality by being smart about where you put probability mass. From the training examples, to other places so that you can generalize to new configurations.
One of the ideas that I've been pushing forward a lot in order to get really good generalization, is the discovery of abstractions and what that really means for me is learning theorems that can discover the underlying explanatory factors implicitly and separate them out, so I call that disentaming the factors of variation. I won't talk a lot about that in this presentation today but somewhat touched the subject.
So, the starting point of this talk is the notion of representation learning. Something that has been around for a long time, but until the last years was not being given enough attention in machine learning. The typical way we use machine learning is by doing a lot of engineering to
design good features that we feed as input to our learning algorithms. And, that's, you know useful, but wouldn't it be nice if machine learning included the learning of good features. And this is probably important for AI where we would like the machine to cover a lot more ground than we could do by hand designing the features. So then we could ask the question, what are good features? What is a good representation for the data? And one of the ideas had has been around to answer that question is the idea that a good representation captures the underlying factors. This is already present in things like principal components analysis, for example the main factors of variation, but we want more than that. We want these factors to actually represent the explanatory causes, and that's much harder. And we want to be able to capture factors that are more abstract and that means more nonlinear, and that's involving more difficult optimization.
Now among the representation learning methods, my focus has been on deep representation learning. In other words, having multiple levels of representation. Why would we have multiple levels? Because in this way, we can gradually come up with better representations that capture more abstract information about the data, and it can be more complex and more nonlinear.
Why would we want to do that? Well there are some theoretical reasons that I won't talk about a lot today that have to do with the expressive power of deep circuits and deep neural net or deep architectures in general. Basically they are families of function that can be represented much more efficiently in exponential sense if you allow yourself to have just enough depth then if you are limited to a shallow kind of architecture with only one or two levels.
There is also, of course lots of motivation coming from biology, and from cognition where we know about the brain, in particular the visual and the auditory cortex suggests that we have in our brain deep representations. More recently at ICML we showed that if you are able to put your data at a higher level of representation then some operations like more typical of Markov chains becomes much easier and particular mixing between modes becomes much easier and I will say a few words about that later. And finally, I guess the reason why many people are excited about this is just because it works. And up to now there has been a lot of application, mostly using supervised learning of deep nets for object recognitions, speech recognition, language modeling and music modeling.
So, let's go back to basics of machine learning under lization. We get some data, and here assuming each point here represents an example in a kind of sketched 2D plane, but of course we're living in a high dimensional space and the empirical distribution puts math at these examples. So the name of the game is taking that probability mass and putting it elsewhere, where it's going to be a guessing work. So, the leading method that is behind most of machine learning is just to say, well if there is probability mass, at some point, then there's also probably a good reason to believe that there should be probability mass in the neighborhood. And that's just assuming that the density function is smooth. So that works remarkably well in this, old essentially nonparametric statistics and nonparametric machine learnings based on this.
Unfortunately, when you deal with AI problems, where we have high dimensionality and complicated manifold structures there is always not enough data to really make a good job of covering these manifolds and figuring out for example here, you know you should have high
probability masses. Well, and maybe you should extend this here and have high probability mass here. There's no way that this local method is going to work there. So, we need to guess some structure that can explain the data, and that's in part what representation learning and deep learning is about. It's just some ways to guess the what the structure is. Let learning explore this space of these guesses so that you can generalize. So, even though maybe people
10 years ago thought that nonparametric statistics is the ultimate thing and we can't do much better, you can't generalize to really new configurations that are very far from what you've seen, in fact you can, you just need more priors than the smoothness prior.
This is a review paper that I wrote that's been published in TPAMI and you can find on my webpage and in archive that discusses a bit more discretion of priors for learning representations. So with these priors we can actually have a chance to bypass the curse of dimensionality. One of the ingredients which I think is really important is to have some kind of compositionality and reuse in these models. Just as humans do, we have to have that in many ways.
So, talking about humans, there's one thing that distinguishes current machine learning from how humans learn. And it's the ability of humans to learn a new task from very few examples.
So how do they do that? I mean, typical supervised machine learning requires millions of examples to do a good job. And they can do a really good job, but humans seem to be able to do this much faster. The amount of data that we currently use in like speech or language models to train our models is way, way bigger than the amount of data that a five-year-old sees in his lifetime, and human still manage to do a very good job, much better than our current machines. And I think the reason is very simple. They take advantage of all the things that have been learned before, and that includes all the representations that have been learned and in those representations somehow is encoded relationships in the existence of a explanatory factors, that are some of which may be relevant to the new task at hand and allow the learner to generalize very quickly. So for this, what we need is essentially more multitask learning and more unsupervised learning. A lot of the data that learners can see will be unlabeled, or maybe labeled with other tasks. In 2011, we worked on using these unsupervised learning procedures for learning representations at different levels of depth and we competed in two challenges, one whose results represented at NIPS and the other at ICML on the subject of transfer learning, and we won both of these competitions. And I'm showing here some of the illustration of what happened to one of the data sets. Actually Xavier was sitting right here, participated in one of these challenges. And what we see on the x-axis of these pictures are the log of the numbers of these examples used to train on the new task where the representation has been learned based on the training data. And Y axis represents something like a curiosity, so we would like it to be high, so we would like to curve to be, you know rising up as quickly as possible as we see more examples. And what we find is that, when you move the higher level representations, you can do a much better job than if you work in the raw input space. For this task, at least it was so clear.
One of the important priors is that, as I've said, these different tasks that we may care about, they refer to different subsets of factors, so in this picture I show how you can imagine having a
situation where you have a bunch of tasks, things you want to predict, and somehow they share the same kind of inputs in this case. And from that inputs, the images, you would learn representations that might be useful for all of the tasks, and the subset of these features might be useful for different tasks. And because you can share the subsets, you can generalize better.
This kind of idea has been used a lot before, in particular for deep learning and is, you know one of the priors we can put in the sauce.
Something kind of related to multitasked learning is when you're different tasks actually regard different variables. So in traditional machine learning we think of the data as one big table, like one excel sheet, but actually in many applications of machine learning we have sources of data.
So for example here we have two sources, each give us different tuples of values of variables, like here URL words in the history, and here person URL event. And sometimes the same variable occurs in a different variables and you can learn a model that spans all of that data and this is called essentially relational learning, and you can do that with deep learning. And the idea here is to exploit the fact that you can reuse the same representations across these different tables. So I can have a mapping that maps URLs to some representation that I can reuse for both of these learners that learn the joint or some conditional between the different variables. And we've done some of that with data sets like Word Net, Wikipedia and Freebase and Image net.
One of the things that for which I had a bit of impact is work on so-called neural language models. At NIPS 2000 and in some sequent similar paper, we showed how a neural net can learn language models, meaning predicting the next word given the previous words purely from text where the lower levels of the neural net learn a representation for each word that is a vector, so-called embedding and we of course use the same embedding parameters independent of position and then we has some kind of neural net that predicts the next word.
So this very simple, but a lot of variations of this have been proposed over the years and one of the things that has been very striking early on is that you can look at these representations in 2-
D and try to make sense of them. So you probably can't read this, but when you zoom you see the representations for words sort of aggregating in a sort of semantically meaningful way. Like here you have countries, here you have things having to do with money, depth, cash, money, shares, stock and so on. Here's some more examples of here the verb to be being conjugated, here, to have and here other verbs.
Something really exciting has happened this year with some work by Thomas Mikhailov and his colleagues at Google where they show that these kinds of embedding when they were trained on large quantities of data sort of magically give rise to an illogical representation. So what I mean is the following, just give you an example.
If you take the vector embedding that has been learned for the word king and you subtract the vector embedding that has been learned for the word queen you get a vector that's very close to the difference between man and woman. And this has not been trained in any way, it just comes out of unsupervised learning of looking at text. So for example, you can take the vector of Paris minus France and add Italy, and you get roughly Rome.
So if you think graphically this is what happens. These line up in ways that are somehow convenient and this is an example of where when you learn good representations, what happens is that very simple operations like addition and subtraction, linear operations, become meaningful and I'll give you more examples of that later. You can also take these representations and learn different representations for different types of objects such that these representations live in the same space, and this has been done with great success by
Google people, in particular my brother Samy Bengio in using the Google image search where you learn a different mapping for images and for words. In the case of words it's just a table lookup, in the simplest case. I think they do actually more sophisticated things, but basically it's a function that maps keywords here to a higher dimensional vector for example and for images to have some more complicated functions, the gain has parameters and maps to the same space and you train it such that when someone types this keyword and clicks on that image then the probability that these two vectors will be close to each other is large. And of course once you have that, you can answer queries and do all kinds of fun stuff.
Now, let me talk about deep feed forward neural nets a little bit, and later I'll talk more about unsupervised learning. So there has been this revolution in deep nets, it started in 2006 based on unsupervised pre-training, but more recently since 2010 we've made big progress in training deep supervised nets where before we didn't know how to do it. In this study with Xavier
Glorot, who I just mentioned is right there, was one of the early works showing that you can actually make a difference in trained deeper nets just by playing with initialization and in this work we try to understand why different initialization make a difference having to do with
Jacobians of transformations going upwards and downwards and also with the choice of nominarities. Actually, the year after with Xavier we showed that a particular nominarity rectifier, also called rectifier linear unit, can make a big difference in the success of these very deep supervised nets. This has been very, very successful in recent years, in particular you're probably aware of the work by Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton in computer vision in which they were able to very substantially reduce the error rates on one of the benchmarks for Image net where you have thousand classes. I showed some examples of the outputs of their network. This is the image, and it's trying to choose one class out of a thousand. One of the reasons maybe why this is working better, and this is I think the interesting part, not just the good results is that somehow when you have these rectifiers, about half of the units will be off and maybe more if you penalize for sparsity and the information will flow only through these active units. That will do a kind of a symmetry breaking during training meaning that for a particular example, only some of the units will take charge in trying to explain the error, and the others won't even try to budge. Ten years ago I thought, having these hard nominarities would be bad because we need smoothness in the gradient, but actually it helps in a way that we now think we understand but was against common sense ten years ago.
>>: Can I asked you a question?
>> Yoshua Bengio: Yes, oh by the way, please interrupt me anytime.
>>: So, how do you know that this is the reason as opposed to having a better effect on the
Hessian because unlike---
>> Yoshua Bengio: It is the same, the two things are just two sides of the same quant, so let me give you an argument for this. Can I write somewhere? So, let's consider the Hessian. So it's the second derivative of the loss with respect to some parameter I, and some parameter J, right. So imagine in my neural net I have theta I here, and theta J here. So, if I have sparsity in deactivations, so this guy's off say. Then, the first derivative with respect to theta I would be zero, and of course because the second derivative is the rate of change of the first derivative then this is actually zero and it's flat, we're in a flat region of the nomary, the second derivative will be zero as well. So what happens is that when you have these kind of sparse activations you also get sparse gradients and a sparse Hessian matrix. So now the Hessian matrix will be full of these zeros outside of the diagonal and I suspect that this will make the conditioning problem much easier. I mean you can imagine in the extreme case where it was only diagonal that things will be much easier and get less interaction. Now why it makes things easier is because gradient descent assumes that I'm going to change one parameter and keep the others fixed. Right, that's what gradient descent does, it's just looking at one parameter and thinks, okay if I were to change that parameter little bit, but would be the effect on the cost, ignoring the fact that
I'm going to change into guys. The second derivative is trying to take that into account. Now, if the other guys are not going to change mostly, then the first derivative is sort of something I can trust more. So I think that's the connection.
>>: It's somewhat unclear because the ratio of eigenvalue is increased to infinity, so that was zero. Fortunately it's only the ratio of eigenvalue that counts and when there's zero you can ignore that. And the remaining eigenvalues now have a much shorter range.
>> Yoshua Bengio: That's right. It's not really the eigenvalues close to zero that we care about somehow. And, I don't have a good intuition as to why this is true. Another thing that happens is symmetry breaking. In the regular neural net with all the units being active, it will tend to kind of move together to try to solve the problem, and that's what I was talking about earlier.
When many of them are off, suddenly you get, you know much faster training because they will specialize faster.
>>: It's just from the intuition that gradients only talk about one variable independently. That doesn't feel right intuitively, right? Coordinate descent is where you're doing one at a time, but a gradient says a bunch of things are correlated, it just doesn't have that second order of structure.
>> Yoshua Bengio: Yeah, I understand what you're saying, but that's what the gradients
[indiscernible]. The gradient it really is looking at the effect of the cost when you change one variable at a time and we are using it as if we didn't care about what happened with the others because we can make an infinitesimal step in theory. In practice what's going to happen is there's going to be second order and higher order effects, but yeah you're right.
>>: Actually many groups have the similar nominarities and I think the conclusion is that if you don't do anything special what you get is similar to what you have with signal noise. You don't get any gain except you have faster convergence because your gradient is always long which is
[indiscernible], but other than that [indiscernible].
>> Yoshua Bengio: That's not our experience. It depends, right. In some cases it can make a big difference, in some cases you're better off not using rectifiers. So rectifiers also have problems.
In particular because they are unbounded they fail miserably in the context of recurrent nets and I think that if we go to deeper and deeper nets, then the benefits will be outweighed by the fact that this unboundedness means that sometimes you will get very large gradients. So, things are more subtle than just, oh it's great and magical. There's something going on. You have to understand it. Sometimes it gives great benefit, it can also be hurtful and if we understand it better we can cook up something even better.
>>: So do you have any intuition under what conditions it may be helpful.
>> Yoshua Bengio: I just said, if you have too many steps in the computation, like in a recurrent net. So actually in the models I want to talk about later we have a kind of recurrent obligation and if we try with rectifiers it just doesn't work. Things blow up, and you can see it in the
Markov chains that we're generating that it just diverges.
>>: Well I tried. I have to reduce the linear rate by at least 10 times.
>> Yoshua Bengio: Yes, that's right. That's exactly for this reason because sometimes the gradient can be too large.
>>: The soft version of rectifier linear doesn't work as good.
>> Yoshua Bengio: No, for the exact reason it doesn't get the gradients to be zero.
>>: I see so, it has to be zero.
>> Yoshua Bengio: Yes, well, I mean I think if it's near zero the effect is going to be also there to some extent, but I don't fully understand why the soft plus which is the soft version doesn't work as well but I suspect that the things happening nearest you are really important and having this harsh decision is somehow useful.
>>: Also, there's the possibility that outliers are basically corrupt . You haven't sent all this?
>> Yoshua Bengio: The outliers, what happens?
>>: You know, things may not go well at all because the whole thing is unbounded.
>> Yoshua Bengio: Yeah, that's right.
>>: So what is the best analysis of this type of nominarity?
>> Yoshua Bengio: Well, the two papers by Xavier I think, are relevant, but we need more analysis.
>>: So in a sense, the most crucial part of this defect was just training them well rather---
>> Yoshua Bengio: Absolutely, absolutely. You've put your finger on the right question. The optimization part is the crucial part. Now what you have to keep in mind is that the choice of architecture, like changing a nominarity. No, it does matter, because it changes the difficulty of optimization. So, it's not just choosing a better optimizer, it's playing with the architecture such that optimization becomes easier. If you don't mind I'm going to move forward, but please interrupt me again. I'm going to skip that.
So one of the things you've heard about probably that goes hand-in-hand with these rectifier networks is dropouts. Geoff Hinton has been talking a lot about it. And for me the main lesson is if you take these deep nets and inject noise you get a very good regorizer, and that allows to improve performance, not in all cases, but in many cases. So I'm going to move forward and we've actually published a paper at this ICML where we call Max out, which is a different nominarity works even better with this dropout trick then the original rectifiers. And we did a bit of analysis to try to understand what is going on. Why is dropout working so well? And clearly all the signs, I won't want the details, but we've done a lot of experiments whose goal was trying to understand why this is working and all the signs points towards a regorization effect and a kind of bagging effect. So the bagging effect comes from now, imagine you have all of the possible neural nets you obtain by shunning off some of the units with dropouts, so the dropout would set some of the outputs of the units to zero randomly half of the time. So you have an exponential number of these neural nets and you can imagine averaging all of their outputs in some way and at test time actually what you get is an approximately of this when you remove the noise and divide the weights by two. , The thing is the way they are trained, if you look at the gradient, basically it's as if each network were trained separately, but because they share weights you can actually generalize across all of these networks and you never have to actually instantiate all of them at the same time. If you try, you can change the training criterion so that it looks more like maximum licudes, in other words that you try to make their average have a better answer and actually that works worse than dropout and it works about the same as if you train a regular neural net, so you lose the regularizer. The regularizer comes from the bagging effect.
Okay, so I was talking about some of the challenges and new things that happened with supervised learning and now let me start shifting gears towards unsupervised learning. One of the things I've understood that has been a big insight for me over the last couple years after working a lot with RBMs and the Boltz machines is that there is a fundamental difficulty with
training models such as Boltz machines including RBM's that rely on Monte Carlo Markov chain to estimate the gradient.
So, to understand this trying to explain this picture. For any probabilistic model, when you start training, the model will assign probability kind of uniformly. You don't know anything so you put probability mass everywhere. So this is supposed to be input space and Y is probability.
Now as training progresses, the model starts to put probability mass more and more near where the data is, and this is what we want. Now if the data that you have actually has a distribution like this where the modes are not very separated, it's easy to go to from one mode to the other staying in regions of high probability, then things will work out well. But in typical
AI problems we actually want to learn these manifolds I'll talk about which means there are really high concentrations of probability, density or [indiscernible] in different areas of input space and actually this will occupy very, very small volume of the total volume. So, even this is not making justice to what happens and so as training progresses these modes will become sharper and sharper in the regions between those modes will become closer and closer to zero probability. Okay, so that's what happens during training, or at least what you like to see happening, because at the end of the day if you want to have a good model it has to be certain about the right things in certain that the wrong things won't be happening or having a very low probability.
Now, if you use Monte Carlo Markov chain to estimate the gradient you're going to be in trouble, why? So how do Monte Carlo Markov chain MCMC work? They basically tell you how to move from one configuration in input space or state space to another configuration that's going to be both sort of nearby and high probability. All the MCMC methods essentially do that.
And now you can see why this is going to be a problem, right? So here I can do small moves going each time through a reasonably high probability configuration that moved me from this mode to this mode. But here once I have my state in this region, there is no way I'm going to be able to cross this desert of low probability, because I would have to take, you know a series of steps, each of which has small probability and then the probability of going from here to here would be exponentially small. One way to think about it is, give sampling, if you are give sampling pixels are images and you imagine you start with an image of five and you say I'm allowed to change one pixel at a time so that each image looks like one of the Mness digits, but
I have to go from 5 to 7 or even worse 5 to 1. Any of the other modes or classes. It's very unlikely you be able to do that. Okay, so that's really a problem because want to train models that capture complex very sharp densities, and I'll come back to that a lot.
>>: A lot of learnings now in deep level, we don't do anything.
>> Yoshua Bengio: So, in supervised learning you don't need that and this is going to be an ingredient of my answer to this problem as you'll see. What I'm going to be proposing today is a way to train unsupervised learners that is basically by back prop without having to do this
MCMC or approximate entrance in the middle of training. So, here is one of the ingredients is to think about this as what I call computational graph. So what's the difference between supervised learning and unsupervised learning methods like Boltz machines. The difference is
that in neural nets, supervised neural nets, we have what I call a computational graph which essentially corresponds directly to the model. We go from inputs to outputs using some parameters and we do some computations. So what I call the computational graph is just the graph of computation. Each node tells you, you know add something, multiply something blah, blah, blah. If you use graphical models like RBM's or Boltz machines actually you have to generate a new computational graph depending on what questions you want to answer. And that makes sense, because you may have to answer all kinds of questions like with these unsupervised learning models you can answer questions like; I give you some variables, tell me what the others should look like and you want a sample from their conditional probability.
And in recurrent nets you also get something like this where you have a computational graph and actually you can change the computational graph depending on the length the sequence, it's the end fold of the graph of the recurrent net. But, in graphical models the parameterization of the model is through this probability function that you define analytically. Whereas in neural nets, traditional P cord neural nets, the parameterization is directly the computational graph itself, and we just tune it so it does the right thing. So the question I'm asking is, can we do the same thing as we're doing for deep supervised nets, but for unsupervised learning? So can we have a parameterized family of computational graphs that actually define the model directly. I'll come back to that, but before I'm going to be doing a little detour to explain what do autoencoders and denoising encoders which turn out to be useful in this discussion. Yes.
>>: So when you're talking about unsupervised, is that in the context of pre-training or supervised task or is that the goal in itself?
>> Yoshua Bengio: It's the goal in itself because information comes to us, comes to my brain for my senses and my brain is just trying to make sense of it that's just unsupervised learning. Even when you tell me something I should learn as a target, it's unsupervised learning. It just that one of the things that gets to my ears is your words. And I could be more discriminate by saying
I'm going to pay more attention to predicting some of the things more than others and you can have reinforcement learning and so on, but unsupervised learning, learning the relationships between variables is fundamental to doing things more complicated than what we are used to with supervised learning. Basically allowing us to answer new questions about the data, and not just always the same question which is, you know predict the spooning given this cell.
>>: [Inaudible] has there been interest in that recently like you know, like other examples for that, has that applied?
>> Yoshua Bengio: Unsupervised learning, no it's the other way around. If we were able to do a good job of unsupervised learning there would be lots of things we could do better because the idea is what we want is models that can understand the world around us. This is what unsupervised learning is about, is making sense of what are the relationships within things we observe. And in particular what I claim is it would be crucial to generalize better to do tasks from fewer examples. Multitask learning gives us a bit of that, but unsupervised learning can open even more doors, better unsupervised learning. So we already have some unsupervised
learning, and we can use unsupervised learning as pre-training but there is much more power behind that.
Okay, so auto-encoders map inputs through some reconstruction going through some intermediate representation and this is the thing we care about as a representation. By the way we train them is we asked them to reconstruct their input and this is sort of a traditional good old auto-encoders. Now to prevent the auto-encoders from learning the identity, you can put a bottleneck here, you can make these small number of units or you can introduce some regularizer. You can stack these things as we can do with RBMs. I going to skip that.
Now when you look at how you train them, so the reconstruction error, you can think of it as a log lycuid, so you have a representation H of your input X, and you are trying to reconstruct X, and all of the ways were training them like cross entropy and squared air. Basically can think of them as maximizing the probability of reconstructing the right input, the original input given the representation. So we're going to get this conditional probabilities and it turns out they are useful because we're going to be sampling from these guys. Now to understand what they do let me talk about a little bit, manifold learning. I mentioned this, that in many AI tasks the data concentrates in very, very tiny volume that has structure. In vision it's pretty obvious, if you take random configuration of pixels, you get white noise, but instead of configuration of pixels they give rise to images, natural images as actual tiny, tiny, tiny subset of all the possible configurations. And those tiny subsets are organized in a way that makes sense, for example
Patrice Simard, were sitting right over there in the back, studied many years ago the notion of tangent planes through these manifolds and how we could take advantage of them for doing better classification. And actually the idea behind auto encoders is we're going to be able to learn these tangent planes, we're going to learn the manifold structure.
And now let me explain why this is going to work. So again I have the same set of training data cartoon and we're going to try to think about what a regularized auto encoder is going to do with that. So there are two things that happening, two pressures on the auto encoder while it's training. One is the regularization and there are various ways of regularized, but you can think of regularization is basically trying to learn a function that is as constant as possible. So that's what regularization is. If you set all the weights to zero, the function is constant, right? So we have some kind of regularization. It's going to try to make the representation in the reconstruction a constant. On the other hand, it wants to minimize reconstruction error, so it can't throw away old information.
So let's consider a particular point and think about the representation that is learned in the auto encoder as we move that point into the direction of the manifold. Now, in order to be able to distinguish that point from that other example, we want to make sure that the reconstruction function in the representation will be sensitive to changes in this direction, otherwise if we map these two things to the same place we will have a reconstruction error, right, one of these two or both will be slightly wrong. So, the auto encoder is forced to have a representation that is sensitive to changes in this direction. On the other hand if you consider that direction or follow into the manifold, it could be thrown away. We don't need it to get
good reconstruction, and since we have a regularizer that tries to push everything towards a constant, it's going to just remove that information. So, it's the same thing in PCA right, if you keep the leading, if you think about doing little PCA locally here, that's exactly what you would be doing. You would keep the leading directions of variation and you would throw away the directions that you don't see in the data. The only difference is at different places in your manifold you will find different directions of variation that matter and these are the tangent planes that Patrice Simard worked on a few years ago. In his case, those tangent planes , they were crafted based on prior knowledge of translation, rotation and so on. And here we can learn them. So what you can do is actually, take an example and look at those directions of variation that the model thinks are likely according to the auto encoder by just looking at the leading directions of variation or the hidden units that respond the most to the particular input.
And what you can see is it learns that images can be translated, you know some pieces can be moved a little bit and when we are to think about what this means is that if you take an example and you add a little bit of one of these guys or some linear combination of them, you get something that the model thinks also has high probability. So it's doing exactly what I was talking about at the beginning. It's filling in the holes between the manifold in order to guess the structure about it. Okay, you can do that with other digits and you can do that with PCA and fail miserably, blah, blah, blah.
Now, let me talk about the denoising auto encoder. So we're getting closer to the main thing I want to talk about. So it's a variation on auto encoders which works a lot better than regular auto encoders and works better or about the same as RBM's as far as learning representations.
And the way it works is extremely simple. Instead of just reconstructing the original input, we're trying to reconstruct the clean input that we give it as input, a corrupted input, stochastically corrupted. So we add Gaussian noise, we set some inputs to zero like in drop out, and we try to train the model so that it denoises it, it guesses the claim thing from the corrupted thing. That's very simple, it's just a slight change in training criteria, and now suddenly we have guarantees that this will actually capture the distribution of the input and I'll tell you more about that.
There are also advantages compared to RBM's. We can for example measure the training criteria which is something you can't do with RBM's. So it's much more convenient to training, and again it's just back prop, there's no MCMC or variational approximation that's needed. This little bit of visualization here is to help understand what is going on. So again I have a manifold near which the data, which are these crosses concentrate. Imagine doing a Gaussian corruption, so that's going to move my clean example towards you know isotropically and that means in general it's going to be away from the manifold because in high dimension most of the directions will be away from the manifold and then the model is going to learn that from any of these corrupter places it has to move back towards the manifold. So it's going to learn this vector field. Let me show you a vector field here. So you can ask do the experiment. The data are lying on this 1-D manifold in 2-D. You could try the denoising encoder and it learns to point towards the manifold. So what does is it mean? It means if I start at one of these points, the reconstruction is going to be in that direction, it's going to move towards this. In general it's not going to be able to go directly in one step there, but actually goes pretty quickly there.
So this was the beginning of the understanding of what auto encoders do. In the old understanding I had which, you know made a lot of sense is that if you think about the reconstruction log like you would order squared air, what the auto encoder does is it carves holes in that function so that you can have small reconstruction error at the training examples or large reconstruction probability at these examples. It was never clear, you know how it would manage to have a high reconstruction errors in other places but the regularizer clearly is important and now we actually have answers to this.
The first answer came this year when, actually I'll, do I have this coming, yes. So let me go directly to this. So we had a theorem this year, it folds up on earlier work by Pascal Vincent a couple of years ago that says that if you minimize the denoising squared reconstruction error.
So I have an input, I add noise to it, I learn a function R and I minimize the squared reconstruction error in average, in expectation. Then, the difference between reconstruction and input which are these little vectors pointing towards manifold actually estimates the score of the true density, meaning the derivative of the log density with respect input. So in words that means, these vectors point in the direction of increasing probability. Now unfortunately this term is a bit limited because it only works for the case of continuous input, Gaussian corruption, squared reconstruction error and the worst thing is the theorem that we cover an estimator of this is in the limit of the noise going to zero. So this is like nice theoretically, but probably useless in practice. However, we have a new theorem that is unpublished but on archive and you can look up which basically generalizes this completely and is based on the idea of associating a Markov chain to the denoising auto encoder.
>>: So how does it compare with RBM now?
>> Yoshua Bengio: Right, so what we've done, I'll tell you about some experiments we've done to compare with RBM's. But these are very preliminary. So this is less than two months old work. I had the idea at the IP air conference at the beginning of May. Two weeks later I got the math figured out, one week later the code was written and one week later it was the NIPS deadline and two papers were sent. So, we didn't have a lot of time to play with this. Like the good old days, Patrice right?
But that's what happens, I think when you have a good idea, things just tumble and results just come quickly.
>>: So one big difference is you don't have nominator.
>> Yoshua Bengio: That's right, there's no partition function. There's no partition function.
>>: If they solve that problem by doing contrasting divergence, so is this equivalent to some approximation?
>> Yoshua Bengio: There's no approximation as you'll see. Well, we'll talk more about that. So let's talk about this Markov chain. So I'm going to define the Markov chain which goes like this.
We're going to have two sources of uncertainty or noise coming into the system. One is the corruption that I was talking about earlier, so we take an example and you corrupt it stochastically. So we have some fixed process here that takes input and say Gaussian noise or set some inputs to zero, but the important point is it has to destroy information, right. Destroy information means that from this corrupted thing I can't recover the original exactly. There's no way I can recover the original exactly. There's going to be uncertainty about where it came from. The second part is we're going to do this auto encoder and it's going to output a distribution that tells me where it thinks the original clean input was. It's not just a point now,
I'm going to have a whole distribution. To understand that, I'm going to show you this picture.
So imagine that they are these crosses, and I have an example X here. I corrupt it and I get this
X tilde. Now the denoising auto encoder sees this. It doesn't know that it came from there, but it is trying to guess that it came from there and it's going to be trained for this particular example to predict high probability for this point. But it could have equally come from any of these guys with high probability. So it's going to learn a quality distribution of where was the clean input and typically here it's going to be like, Gaussian like. It's going to be very unimodal and local. So this is the kind of thing it's going to be trying to do it. To produce a probability distribution for the clean input given the corrupted input. Okay, so we're going to take that distribution and sample from. That's going to be the second step of the Markov chain, I mean the second component of this. And then we can repeat, corrupt stochastically, sample from the reconstruction distribution, corrupt, sample, blah, blah, blah and that's the Markov chain.
>>: [Indiscernible], is at any connection?
>> Yoshua Bengio: The connection is it's a Markov chain. So, the really important central result which allows us to do the more fancy things that are we telling you about later is this theorem that says that denoising auto encoders are consistent estimators of the data generating distribution, through this Markov chain. And the estimator is the stationary distribution of the
Markov chain. So, in traditional graphical models we write down an equation for equality function. And from this we can derive the Markov chain for sampling from it or doing all kinds of things. Here were doing something different. We're learning a transition operator for Markov chain. In other words, it's a machine that takes the current state of Markov chain and produces stochastically samples for the next step of the Markov chain. And it turns out we have a recipe for training that transitional operator and the recipe is extremely simple and if we follow the recipe which is just the denoising reconstruction error we get an estimator, a consistent estimator of the data generating distribution. Now it's consistent in the sense that if the denoising auto encoder has enough expressive power so that the stationary distribution, the family of distribution of stationary distribution includes the true distribution then we will get it, assuming again that we optimize properly. The usual things you get in machine learning. And the idea is that the crux of the theorem is that we're going to be learning this denoising distribution that knocks the corrupted input to clean input and the way it's trained is basically is to match this additional distribution which is the truth that comes from the data, right. The data gives us pairs of X and X tilde. We learned this conditional distribution so it just matches this. It turns out that once you match this or as you approach this, the stationary distribution of
the Markov chain, so the distribution of samples that you get if you run long enough the chain will converge to the P of X of the data, the true data joining distribution. Assuming of course that this can do a good job. In other words, as this becomes a good job of doing this, this does a good job of doing this. Another way to think about it is, the denoising auto encoder maps a distribution which is what it sees before the corruption to one that is closer to the data the distribution, because it's always trying to produce something that looks like the data.
>>: You can do it for any function.
>> Yoshua Bengio: It doesn't work. So the theorem has conditions in the main condition is that the Markov chain converges and for the Markov chain to converge you need noise, you need that this distribution, this one or this one has entropy. That is impossible to predict exactly what is the source. So if you don't have any corruption, then you can easily learn the identity function, which is the usual case were auto encoders don't work.
>>: Does the theorem give you insight as to what kind of noise and the amount of noise you need to make [inaudible].
>> Yoshua Bengio: The theorem doesn't give you that insight, but I can give you that insight.
>>: [laughter]
>> Yoshua Bengio: Right, yes.
>>: It seems like this was to solve the MCMC problem of getting through the valley of low probability, right?
>> Yoshua Bengio: No, this was to get rid of the problem of requiring MCMC or variational inference or both in the middle of training. Actually this was, I was just motivated to try to understand what do these auto encoders learn. I'll show you later, if I have time how this can be used to generate even more interested and complicated distribution when we go to deeper models. Yes.
>>: So like a sensitivity question. If my PNR is in distribution is only close to the true conditional like how far off is the station.
>> Yoshua Bengio: I didn't understand the beginning of the question. The corruption distribution could be anything so long as it puts entropy in its reconstruction.
>>: Right, but let's say I estimate that denoising distribution.
>> Yoshua Bengio: Yes, yes, yes.
>>: So that it's only close to the truth---
>> Yoshua Bengio: So, the theorem will tell you the effect on that. Right, so I didn't do that analysis but if you look at the math you should be able to get some kind of bounds on the error you get between this and this, given the error between this and this. Yes.
>>: [Indiscernible]
>> Yoshua Bengio: Oh, the more over complete usually the better.
>>: Do you think they found it to be not ---
>> Yoshua Bengio: It depends what your input is. I mean, if your input is like very, very high dimensional already.
>>: It's small, it's very, very small. Like 5 by 5 patches.
>> Yoshua Bengio: Then over complete usually works better. Now it depends on how much data. It's like, think of it, it's just like a regular neural net, it's just that what it's learning is to denoise. So the number of hidden units only depends on sort of the ratio of, you know how much data do you have and how much complexity do you need to actually do that job of capturing that conditional distributions.
>>: What I want to ask is that, you know like in general when you [indiscernible] in audio or computer representation you can just find the work or stuff like that. But using just our code, denoising auto encoders, some 5 x 5 or 7 x 7 patches when I learn [indiscernible]. I have not seen anything like that, so do you think that sparsity has a role to play for you think that the thing that I was looking into which doesn't make any visual sense [indiscernible] was it still a good thing?
>> Yoshua Bengio: So there are many reasons why your experiments could of not given you what you expected. First maybe the right thing may not be [indiscernible]. Second, it could be that your optimization didn't really do a good job and we have a lot of experience where, you know depending on how you choose your learning rates and initializations, one could make a huge difference. There are lots of issues, you know that are in the way of understanding what could have happened in this case.
>>: But in your experience did you see like Gavor like return or something like that when you use denoising auto encoder.
>> Yoshua Bengio: You can. You get more Gavor likening with the contractive auto encoder, which is another sort of beast which is related. It seems to give nicer filters, and we don't complete any. So it turns out there is a nice mathematical relationship between the two types of models. But, yeah, we don't understand everything.
Okay, so very quickly, now, I'm going to skip this. So you can generalize this theorem to another one which includes in the state of the Markov chain not just a vector representing the data, but other things like hidden layers in the deep recurrent stochastic networks. So we're going to create a Markov chain where at each step we start from the previous visible vector, the previous image say and some previous state of hidden units. And from this were going to go to the next state where we're going to have not just another image, but some other internal vector that's going to be also a random variable depending on the other things. And in this case we use the same computational graph, at least in structure as the Boltz machine where we have a number of layers and each layer receives as input the outputs of the previous layer above and layer below and sends it's outputs to the layer above and below at the next time step. There's noise injected everywhere in this thing. Not just the usual corruption that you have in the denoising auto encoders. So, the reason I think it works, the theorem says, you know you can add noise wherever you want, but in practice if you add especially adding noise at a top level, it's going to make the Markov chains mix better and I'll say more about that later.
>>: So this reminds me there's some MCMC type needs you have these sort of auxiliary variables that light of change. This kind of reminds of that. Did you look at how it [inaudible].
>> Yoshua Bengio: No, I didn't. That's an interesting connection.
>>: Yes, except those don't have the noise I don't think.
>> Yoshua Bengio: Well, if they are stochastic, I mean the noise basically means these are random variables, okay. They are not deterministic function of the previous value, they are stochastic functions. So it's basically not, the state variable is like this and this. And from one step to the next step, we are baiting them stochastically and we have a machine that will go from one step to the next step and this machine has parameters, weight vectrices, vectors, biases and so on and the great thing is the theorems that I was talking to you about before in its generalization for this case tell you how you can train that transition operator. Basically you train it as a denoising auto encoder. At each step, if you start a sequence at a training example, you run the chain a few steps, basically enough proportional to the depth so that you have at least enough steps to go up and down at least once or twice. And then for each of these probabilistically constructed steps, you ask the system to produce original clean input with high probability. You just do maximum lycuid on this thing.
>>: So it looks like this is very different from earlier work you did.
>> Yoshua Bengio: This is very different from everything that has been done before. It's a completely different way of doing probabilistic models.
>>: I see, okay. So the old ways of constructing each one is to have denoising auto encoder and then take the output to get the input and then ---
>> Yoshua Bengio: This is very different, in fact one of the great things is that unlike the Boltz machines where typically you need to pre-train each layers in the RBM, this can be trained from scratch as a deep model and the main reason is we're just doing back prop. We don't have this ugly, very noisy and poor estimator of the gradient coming from this both the variational approximation and the MCMC step that are needed in the Boltz to estimate the gradients.
>>: So there's no more layer by layer, you know construction.
>> Yoshua Bengio: You don't need that. At least in our experiments it worked right away.
Okay, so here are some experiments on toy data to show that, you know you just train
Denoising auto encoder and then you apply this recipe for sampling from it and if these of the samples from the data, you would cover samples that look very much like it. This actually was in
10-D with a 1-D manifold so you can look at other projections of two dimensions, and from all angles it's doing a good job. So that's artificial data, it's kind of you know, check that the math, the theorem actually says what it should. You can also do it with a deeper architecture so that it's like the second theorem and you get better samples. So this is train on Mness and these are samples that we get from the shallow version that is just like this, right. So it's just like the regular denoising auto encoder and these are coming from the model that has multiple levels.
And I'll show you some numbers later, but basically the samples we get here are about the same quality as those you get from an RBM. And the samples you get from this guy are better than those that we get from the Boltz machine. I don't claim that they are substantially better and visually I don't see any difference. But in the quantity of experiments, so we've run experiments where we try to evaluate the quality of the samples using the procedure that has been introduced a few years ago where we generate say 10,000 samples from the models, consecutive samples and then we use them as training examples for a non-priametric density estimator. And then we use that density estimator to score the test data, and we get a log lycuid. And so, when you don't have a formula for getting the lycuid, this is a cheap alternative, but I don't like it very much because I don't think it's going to scale to very high dimensions but for now that's one thing we know how to do. So the numbers here are log lycuid so you want them to be large. This is the single denoising auto encoder so it's like minus 150, and the RBM gets -240, so it's slightly better and this is the deeper model which does a bit better than the
Boltz machine or the deep belief net which are two well known variants for doing things like that.
There's another little corollary of this theorems that says that you can, once you have trained this machine, which you can train on fully observed data or data with missing values by allowing the missing values to be free rather than clamped during training, you can also sample conditional distributions. So if you just clamp some of the variables, the usual way it's done in the Boltz machines and you left the Markov chain resample only the ones that are free, you get samples of the conditional distribution. And so you can try this there I'm in practice. Again we did it with Mness, remember this is only in one week time. So what you do is, you clamp the right-hand side of these images and you initialize the left-hand side with white noise, and then you run one step of the stochastic chain I was talking about. You can see right away it's quickly
moving toward something that looks like a digit, and after two steps it's right on a digit. In the case where there's uncertainty about what the left-hand side should be like, this seven which may be a nine, it actually goes between nines and sevens in samples from a nice posterior.
Same thing for this three which could have been and eight, so it does what we would expect.
We've tried it on slightly bigger images, a bit more complicated than the digits which are face images and although I can't show you here this compares favorably with previous work we've done with deep belief nets for that data.
Okay, so I'm going to do a little bit of publicity for a paper that contains all kinds of practical recommendations for gradient-based training of deep architectures where we talk about some of the issues that were raised here like setting learning rates, dealing with different hyper parameters, more importantly learning how to search for hyper parameter configurations.
There's been a lot of progress in that area recently. There's another paper that I'm going to do publicity about which is this looking forward paper. Deep learning representation looking forward where, it's not a review paper really, it's about what I see as some of the main challenges for deep learning. And I'm going to quickly go through some of them, in particular
I'm going to come back to this question of approximate inference and sampling which is my main contribution today is about.
So computational scaling, I think for practical industrial applications, this is where we can have our biggest bang for the buck quickly. We've seen a lot of the progress in applications due to just being able to train bigger models and bigger data sets. And to go forward without having to wait another 10 years for faster machines, we need new tricks to train bigger models as quickly as we can now with smaller ones. And I have some ideas to do that.
One of the main two things. One is conditional computations. The idea of conditional computation is instead of in traditional neural nets what we do is we for any given example we actually visit all the parameters of the model. And that's very, very expensive. If you look at something such as a distant learning tree, for any example we look at only log of the number of examples which are those on the path from the root to the leaves. So we need to do something like that in neural nets to make them really faster. So that for a given example, we'll only need to be visiting a very small subset of the parameters. In this way the computation can be much smaller. We can still have really big models which we need for AI. So there are different recipes for doing that but some of the ingredients are sparsity and things like the rectifiers which we talked about before and multiplicative connection so that when a unit is off it can turn off a whole bunch of other units. Another ingredient is distribute training and there's been a lot of work recently at Google including, I mean I actually started this many years ago in this GMR paper I mentioned earlier where you can exploit the idea of asynchronous stochasitory in descent, but there's more that needs to be done because you have to realize that the training procedure for these nets is essentially sequential and it's very hard to priorilize.
Another important element is optimization and under fitting, we've been talking about that a lot. One interesting result related to this is by my student Yan Dauphin who is actually doing an internship at Microsoft right now, where we show that as you train bigger and bigger nets, so
this is the number of feed units, they are less and less able to use the extra capacity to knock off training errors. So this curve shows what we call a marshal in utility which is, you can just think of the number of training errors knocked off by the addition of one hidden unit. So per hidden unit, how many examples can you knock off. And when you go below one it means that you are doing worse than a stupid algorithm that takes one of the training errors and copies it in one of the hidden units and just outputs the right thing there. So something really bad is going on that we have to understand.
So approximate inference and sampling, so that's the main thing that I've been talking about today. Burn in is one of the challenges of MCMC. One of the nice things with, so the time it takes to go from random pattern to a highly probable one, the nice thing with these denoising criteria is it's trained to burn in quickly. But the most important thing really is mixing. I've talked about this earlier, and especially mixing between modes. So I've shown that picture before. At last ICML we had a paper showing that if you sample from high-level representations, higherlevels like in a deep belief net or these facts of little encoders, you can mix much faster between classes. And we've done a bit of visualization to try to understand that and I think it's really interesting to consider that for a second and think about a geometry.
So here we have two examples, nine and three. By the way, this was done work in great part by
Gregar Manil who is also an intern at Microsoft the summer. And here were doing a linear interpolation in pixel space. When you see the intermediate images which are visibly adding up, you know nine and three, this image in that image, and you get something in the middle that doesn't look like anything that shouldn't have high probability. On the other hand if you do the linear interpolation not in pixel space, but at a level of the first layer or the second layer, you get these other images. And what you see is that they look much more like the natural examples. So what's going on is that in the high dimensional space when you move on a straight line between examples, that trajectory which is now going to be nonlinear in input space stays close to those manifolds of high probability. So I've drawn these manifolds like this, like these colored regions. So somehow what's happening? So one way to think about what's happening is in the original pixel space the manifolds are very thin regions and when you move to these high-level representation space, those regions that were very thin here become much larger and in fact the convex between any two points, even between points of different classes, most of the points in between happen to be likely points whereas here in the original input space even if you take two threes and your take linear interpolation, you might get some garbage.
Okay, so that's cool and that justifies you know, having depth in a representation even when you want to do unsupervised learning in [indiscernible] models.
So inference and sampling. There's a problem with that. I have a love-hate relationship with linked in variables. They are really appealing to explain what's going on, but the inference problem is really a problem. So the reason it's a problem is that the way we deal with these approximate, we deal with these linked in variables is we use approximations like map or variational methods or MCMC and all of these approximations basically will only work if the number of explanations for the linked in variables H, that are probable is small. Here we are assuming there's only one dominant mode. Here we're assuming that there are basically
independent factors and here there is an implicit assumption that the number of modes is small enough that by running a small chain, we can recover them. And all of these assumptions can fail.
So here's an example where it would fail. Let's say someone tell you a sentence in a foreign language that you don't understand very well and you have to answer maybe yes or no. It turns out that if you do a bit of counting that if, you know if there is a lot of uncertainty like a hundred words are a possible explanation for each of the words, then the number of highly probable explanations for that sound you just heard could be huge, and could be much, much larger than the number of modes you could hope to visit within MCMC. So how do we get around that? Well with deep supervised learners, we just ignore the linked in variables. We don't try to sum up with them. We just directly learn a function from Y to X. So these GSN's that
I was talking about, they are interesting because they skip these linked in variables. They directly learn to sample, they basically generalize what we do usually for neural nets. So they don't have any partition function, they don't need any partition function gradient, there is no interactive inference. There is no hopelessly approximate inference in the training loop or when you use the model. You don't need to an MCMC to mix in the middle of the training loop, there is an MCMC to sample from the model, but that's different. They can be trained by spread back props, so training is very fast. The training criterion can be monitored, unlike in RBMs and the
Boltz machines. Mixing is fast because we have these high-level representations. The burn in is fast because it's trained to do so. There is a probabilistic justification that is clear for the theorems I was mentioning and allows us to go sample and answer questions. The experiments as I said work really quickly which is very encouraging and I think most importantly why this is important is that although there has been a lot of progress with supervised learning recently, unsupervised learning has kind of stalled and we haven't seen, you know very big applications and I think one of the reasons is we didn't have the right tools and this may be an opportunity for exploring a very different avenue that could have a large impact.
So, I'm essentially done. I want to mention the International Conference on Learning
Representation. IP which was created this year by Yann Lecun and I and has been a huge success. Sort of a rallying point for people interested in deep learning. We also have regular deep learning workshops at NIPS and ICML every year now for the last few years. And it's an area that's very exciting. People at the conference have a lot of drive and energy, but there is a lot of research to do. I mean, we are far from having, completely understand everything, understood everything. And what it's exciting in the new field like this is that questions and answers come one after the other very quickly because there are just many avenues which can be very fruitful. So I mentioned some of those challenges, computation, optimization, inference. I didn't talk much about the disentangling of our representations but the visualizations I showed at the end I think we're going in that direction. One element I didn't talk about which I think is very important for AI is incorporating the ability to reason and to deal with incrementally added facts as we would need for AI. Thank you very much.
>>: [applause]
>>: You talked about representation. Do you have [inaudible] that could representation?
>> Yoshua Bengio: Of good representation?
>>: Yeah, can you tell if I give good representation, can you tell ---
>> Yoshua Bengio: No, I don't. I have lots of intuition about what's a good representation, but I don't have a mathematical formalization, and if I had I could just, you know use that as a training criterion, but I don't. I wish I had, I've been thinking a lot about it.
>>: Well I have a stupid question that I should've asked a lot earlier. Could you go back to show the recipe made that will generate these portions and then you, the distribution of ten. Can you actually give the simplicity of the code, so that we can actually see it?
>> Yoshua Bengio: Okay, so you understand how we go through the graph and it computes, you know unit outputs as usual. It's like a recurrent net, but I add Gaussian noise before and after the patch in this experiments. All right, so I get this graph, but the weights are reused so at this layer is the same as this layer, but it has different inputs to the same parameters.
>>: So you treat this in what particular pattern?
>> Yoshua Bengio: Set string training. Yes, so I present a training example here. I send it to the graph. At each step, say imagine this is images so I'm going to have a probability for each pixel here, so the sigmoid output, the usual thing. Which is, you can think of as trying to guess what the reconstruction should be, but actually it's a probability, so you can sample zeros and ones from there.
>>: But if you, you're going to take a gradient from the first pattern you represented.
>> Yoshua Bengio: Right, so I'm going to use that as a target for these sigmoid outputs. And I'm going to back prop through the same thing at every step, that's it. Very, very simple.
>>: You could have taken your closest pattern.
>> Yoshua Bengio: You could and that's a reasonable thing that I've been thinking about this, but then you need to do like nearest neighbor stuff and it turns out this is much simpler. But yeah, it's a good line of thought.
>>: You know so E matches, the choice of the E match is how their chosen seems to be important, does matter in practice.
>> Yoshua Bengio: It matters only from a numerical point of view for getting speed up, I don't think it matters much otherwise.
>>: [Indiscernible] having kinds of problems but you know. And I haven't seen the E match work and actually, you know seems that maybe there's a way that I can choose ---
>> Yoshua Bengio: Sixty-four, power of two on GPUs.
>>: But would you think that the way that you, how fast you, you know sort of ---
>> Yoshua Bengio: Oh, well, maybe, we haven't explored that. I think some dose of active learning or curriculum learning would eventually help a lot here, but I don't think it's been explored much.
>>: Yes so going back to see that after, several years after your student invented this denoising,
[indiscernible].
>> Yoshua Bengio: Yes, there are number of papers on the way. It's not a one-day thing.
>>: But now, do you have a good understanding as to if you use this new type from learning to initialize pretraining ---
>> Yoshua Bengio: Yeah, you can do that. You can do that.
>>: You think it's better than using PBN to do ---
>> Yoshua Bengio: I don't know, I think what matters to me is that it's going to learn a better model of data. How you can exploit that, we'll have to play and explore. But what matters in the end is to do a better job of capturing the true data generating. The better you can do that, the better you can answer any question about your data.
>>: Suppose you have like infinite amount of data, large amount of data.
>> Yoshua Bengio: Yes.
>>: So then do you have to ---
>> Yoshua Bengio: But I don't have infinite amount of computation. That's the problem.
>>: Suppose you have both of them.
>> Yoshua Bengio: No, I can't do that.
>>: [laughter]
>> Yoshua Bengio: My prior is I have finite amount of computation. But okay, let me entertain your question.
>>: So then in your final task of classification, when you go for a pre-training and then train the supervised ones?
>> Yoshua Bengio: No, no, no. None of these things. That would just be pure Bayesian if I had an infinite amount of data and competition. Just infinite amount of computation period.
>>: So you have the power of unsupervised representations of this stuff, you can combine datas from many places and data in a sense that may not be relevant for this particular task can be helpful. For example, you mentioned the joining of data, however most of the challenge have a particular data set and people are really just---
>> Yoshua Bengio: Yeah, that's just our culture. It's because we're not trying to face AI, we're trying to solve a particular industrial problem and we have blinds like this and, okay, okay we have to give results in six months or even less in the industry and you know, we don't have time to solve all the problems, right. But, that's wrong. I mean, it's good for industry because we have to deliver, but for academia if you want to think long-term, really the right thing to do is to think okay, we have a machine that needs to understand all things we understand. What does it need to see as example?
>>: Yes, so you look at, for generalization model, you can pretend that to be something so that the label can become part of the data.
>> Yoshua Bengio: Sure.
>>: And for each you can do, you know unsupervised---
>> Yoshua Bengio: Yes, exactly, joined training of X and Y.
>>: Correct, so do you think that paradigm in this new framework will work better than DBNs?
>> Yoshua Bengio: That's what I hope. That's what I hope.
>>: Sounds terrible when you use DBNs.
>> Yoshua Bengio: Yeah, but there's nothing allows you to inject a grain of preference for some of the variables being predicted right compared to others, and that's what for example we've done with RBNs, with Hugo LaRochelle a few years ago where we learned both the joint of X and Y and the conditional, but in a sense, in that joint we give more weight to the conditional of
Y, given X. And you can crank that and it becomes more discriminate and that it works better than either just the joint or the pure conditional.
>>: Yeah I know , but I know that works but ---
>> Yoshua Bengio: So we can do the same thing here.
>>: Okay, but that has never been doing better than using, you know this one to pre-train. You know the streaming of the neural network. Do you have an explanation for that?
>> Yoshua Bengio: I don't understand your question. In our experiments it was doing a lot better, but we're not talking about the same experiments, that's all. I think Patrice was ---
>>: Yes, so I was, I was try to think about the noise and at first I thought, well instead of having random noise on every pixel I could maybe involve it to Gaussive and now it's going to be recorded, but then I realized if you do that too much, then you get closer to the manifold.
>> Yoshua Bengio: That's right, right thinking.
>>: It's helping more, right?
>> Yoshua Bengio: No, it hurts actually. We've tried it. So, the idea is you want the noise to destroy information and you want it to be ideally orthogonal to the manifold. So I'll see if I have this picture here.
>>: But it seems grossly inefficient as well. So there may be something in between.
>> Yoshua Bengio: Well, I can say it's a fascinating mystery that somehow destroying information helps learning. It's like, you're forcing yourself, it's not just generalization, it's your forcing yourself into a corner by destroying information but then trying to recover it. I don't know, it's may be counter intuitive but it works, it has mathematical foundations and it has sort of a nice geometry. So the geometry is, is this. If I was to make my noise just move on the manifold, then actually what will happen is nothing. Because, sometimes I would move left, sometimes I would move right and an average the gradients would cancel out. It's kind of useless. What you really want is to move orthogonal to the manifold, because then you learn where to go in order to get higher probability and in an average you actually learn to point orthogonal to where you are on to the manifold and that works, right.
>>: The point is, that the orthogonal space is so much bigger than manifold.
>> Yoshua Bengio: The what?
>>: The orthogonal space is so much bigger than the manifold.
>> Yoshua Bengio: That's right, that's right. So if you just think isotropic noise, most of the time it will move your orthogonal to the manifold, and that's why it works. Exactly.
>>: [indiscernible] would be very small, right? If not you noise.
>> Yoshua Bengio: No, no, no. On the contrary , what I said is that an earlier paper showed that, what's going on here. Okay, an earlier paper showed that we can prove that the denoising auto encoder estimates something important about the density. It allows us to recover density in the special case where the noise is small. It doesn't say that it's better to have this small, it says that in that case we have an estimator and that results has been superseded by the more recent stuff I was talking about that says if you do this Markov chain, then I don't need to assume anything about the noise being small, in fact it's better if the noise is not small. In experiments it's better if the noise is not small. It would have to be pretty large for the model to learn efficiently. Otherwise, it would take forever to learn, you can see why it wouldn't work, , right. Because if we go back to this picture, if the amount of noise is very small, it would only learn locally the structure of this, but it wouldn't know that when you're here, the density should be small. In fact if you do that you'll see it will learn spurious modes, in other words it will think that in some weird place it's having to do with the primaritrization, the probability should be high. And, so to clean up the spurious modes of the density you need noise to be large enough that once in a while you going these places far from the data and you learn that there shouldn't be probability in those regions.
>>: So it looks like there's a connection between this model. Should the dropout ---
>> Yoshua Bengio: Yes, yes, there's a direct connection in the sense that the training procedure looks like training of deep nets would drop out. Except it's now kind of a recurrent and it learns a different thing which is this Markov chain which produces samples of the right distribution.
>>: So do you use the same noise during training as you do when you're actually running chain bearing test.
>> Yoshua Bengio: Yeah, you do have to use the same corruption process.
>>: Because I'm thinking about Patrice's comment that if you do the training with the orthogonal so that it learns to shrink the manifold, that during tests you actually kind of move more along the manifold and it pushes it back. Could it be a more efficient way to move along the manifold?
>> Yoshua Bengio: I don't know. The theorem tells me I should be using the same corruption process. Now, I think there's a case where you don't want to use the noise. It's when you actually want to use us for, like classification. So if I learn the joint between X and Y in I want to predict Y, the clean thing to do would be to run a Markov chain for a long time and sample the
Y and see what's the probability of the Y's. But, because I'm busy, I'm just going to do like, in dropout which is instead of you know sampling lots of configurations, I'm just going to do kind of a mean field thing, remove the noise and look at the output probabilities. But that's just the heuristic that I'm thinking about, you know would be useful to do.
>>: I have a question about a rectified lineal function. I remember that Yann Lecun asked another rectified
>> Yoshua Bengio: Yes.
>>: Absolute function. It doesn't have a flat thing, so is the flat ---
>> Yoshua Bengio: Yeah, the flat thing is important I think.
>>: So is it important?
>> Yoshua Bengio: Yes.
>>: So if you have ten very good results with the absolute function as well.
>> Yoshua Bengio: Right.
>>: Would you publish that results?
>> Yoshua Bengio: Right. Yes, they also tried all kinds of things. For example they use, this thing and this thing, so the rectifier is this thing and you know, I think these are interesting as well. I don't have a strong opinion about which is better, but, yeah there is something going on with the sharp nominarities that seems to be useful and we don't fully understand everything.
>> Li Deng: Okay, so thank you very much.
>>: [applause]
>>: Thank you very much. It was a wonderful, wonderful lecture.
>> Yoshua Bengio: Your welcome.