Document 17883944

>> Dengyong Zhou: it’s my pleasure to introduce Razvan Pascanu who’s a fourth year PhD student at the University of Montréal in Yoshua Bengio’s lab and has done a lot of great work in recurrent neural nets and deep learning in general, Razvan. >> Razvan Pascanu: Thank you, so good morning everybody. So the topic today will be On the Difficulty of Training Deep Models. I have one slide to introduce myself as well, ha, ha. Just to, so my name is Razvan. I’m doing an internship right now at Microsoft but I’m a PhD student with Yoshua. Our lab is pretty big and most of the work that I’ll present here it’s just a mixture of ideas that came from many people, so it’s just not me behind all this. The outline of the talk, so I have four different points that I’ll try to go through, most probably will not end up going through all of them, because they’re quite a few slides. We’ll just stop when the time expires. So the first thing will be just a quick overview of deep learning and I’ll not go into details. I’ll assume most people know a bit about deep learning and what it’s all about. Then I’ll go to one specific topic, so this, the two, the topic in the middle are things I’ve worked on during my PhD, and I’m still working on them. So the first one would be to talk about training Recurrent Neural Nets and these are recent results that will appear at ICML this year. The second topic will be to talk about other learning algorithms, specifically second order methods and things similar to second order methods and how, if we can use them for deep learning. This is also recent work, so I had a workshop poster at the ICLEAR about it and I have a submission at NIPS about it that I don’t know what’s going to happen with it. But it’s also recent work. Lastly if there’s time we’ll talk a bit about Theano which is how we scale things up in our lab. I’m a [indiscernible]. I know a few things about how Theano works internally. So if we have time we can go over that. So this slide is actually taken from Yoshua and this is one definition of deep learning, so first of all Yoshua likes to call it deep representation learning and just not deep learning. The idea is that when you’re doing deep learning what you’re learning is multiple levels of representation, each of them basically preprocessing the finding of a better encoding of the data. One thing that were looking for, for example is this entangling of the factors of variations that are in the input data. So I said less representation that you get on the top layer it’s a representation that it’s very useful or very easy to use if normal classifiers like AGM’s or Logistic [indiscernible] or so. Because that’s usually what you have on top of your deep network. One thing to notice on this definition is that the pre-training doesn’t appear. So pre-training is not, anymore is not really part of the, I mean deep learning started by this pre-training, followed by finetuning of deep models. But that’s not truly what people do now most of the time. So they just do it like a supervised fine-tuning for the whole model. So this is a typical deep model. This is again a slide that I took from Yoshua. So basically you have the input which is the bottom layer or usually is referred to as the bottom layer where you get, you put the data in like in a row format usually like an image you just put the pixel colors and so on, some preprocessing usually helps depending on the task. But the hope is that for a real, good trained deep model is you can just put the data the way you have it without pre-processing it. Then you learn this multiple layers, so you learn all of this jointly at one point and you learn this different representation of the data that are better and better disentangled variations of your input and idea. You get this finer presentation at the end from which you can, on which you can train a classifier if that’s the task you’re after, or a reggressor or whatever. So while this is the genetics scheme there are many variances of models of learning rules out there. One of the original I’ll make is that the idea of having deep models is not really new. So what people want to get from deep learning is some sort of efficiency. The efficiency comes from the fact that the deep model is able to represent better, it’s representing a more compact way distribution that a shallow model should. This is not really a new idea. So the question is what’s new now about deep learning or why is deep learning working now and didn’t work before? There are multiple reasons and I’ve listed two reasons that come to my mind. So one of them is just the kind of computations that we put now in our models, so we have libraries like Theano or TORCH that make it easy for people who are not really experienced in writing fast codes to get models that run fairly fast on things like GPUs or clusters. We have agencies like for example for our lab we have Shardnet and Quebec version of it that basically gives us clusters of GPUs and CPUs. So we have, our lab has access to GPU clusters, each of them having around fifty GPUs and then they have at least four CPU clusters where you have a few hundred, where you can easily get like a few hundred jobs running in parallel to explore your parameters. We can run things for, from a few days to two weeks which is what we need to do. Another thing is that big companies like Microsoft or Baidu, or Google they’re starting to look at this deep learning strategy. Well they’re kind of changing the game because they have access to a lot more data than university usually do and they have access to a lot more computational power than university usually do. So they’re able to, you know train the models for longer from larger data. This is a key component to get deep learning to work well. But just computational power is not enough to explain why deep learning works now and a second reason I think is that we’re kind of figured out what our, the root reason behind some of the difficulties we have when you try to learn this deep learning. We have some intuitions now and you can use those intuitions to better insulize our model or better set up our model size so it can train successfully. One of this intuition is the unsupervised pre-training. So this is where everything started. So the idea behind unsupervised pre-training is that learning P of X is, can be quite useful for learning P of Y given X. So there is structure that is shared between P of X and P of Y X. The difference between the two is that when you’re trying to learn P of X or when you’re doing the unsupervised part you can do that layer wise. So you can train each layer separately and when you’re training each layer separately you don’t have a deep model anymore, you have a shallow model for which some of this difficulties just disappear. Because certainly difficulties are, come from the death. I’ll talk a bit more about it on the next slide. So this is where everything started. This started in two thousand six and the agent behind this was Jeff Hinton, Yoshua Bengio, and John Lekuen. So they were the ones who started publishing about pretraining and deep learning and since then it got spread. A more specific problem with deep learning is weight utilization. So specifically if you look at a deep model if you have small weights when you look at the gradient as they’re coming from the cause down towards the input they get smaller and smaller. Because if you look at the formula for the gradients you basically get the W transpose multiplied there for every layer. The idea is that if that W is made out of small numbers every time you multiply that W like your gradients just gets smaller, shrinks down. If you have deep model, like a six layer model there’s a good chance that by the time you get to the bottom layer the gradients are almost zero. So it’s kind of really hard to learn the first layer and without learning the first layer you have no chance to learn the entire model. Because you need to learn the first layer in order to be able to get a representation on the first layer that can be used by the second layer and so on to get to a good result. These are some observation that are, so one of the first people who start working on this was Xavier from our lab. So he had a paper about this with Yoshua but also Ilya talks about it in his PhD thesis and he has an ICM [indiscernible] submission this year about properly insulizing recurrent neural nets and he shows that insulizing recurrent neural nets properly helps a lot to solve some of the difficulties that come with training these models. When I’m going to talk about my approach to training recurrent neural nets I’ll talk a bit more about this issue. A person that people don’t usually think about it when they talk about weight insulization is Herbert Jaeger. So he’s working on ECO State Networks which is some kind of special type of recurrent neural nets. He’s the first one, as far as I know that came up with the formula that Ilya has been using to insulize his recurrent neural nets. Ilya actually cites Herbert Jaeger and his PhD thesis and think in his paper as well. So the famous insulization is you try to have your largest Igon value to be one and you also want your weight metrics to be sparse. That the large Igon value helps you to have this, to not have this effect of gradients vanishing, because having large Igon values means that when you multiply some random vector that weight metrics things will not go as quickly to zero. Another problem with deep learning is the activation function. So the usual activation function to use sigmoids and [indiscernible] and every time you compute a gradient and you go through one of these activation function the gradient just gets smaller. You just have to look at the formula so if you have, yes, so just the way the formula is set up for the sigmoid function. Whenever you have to go through a non-lineated the gradient just gets smaller. So one solution to deal with this is to use Piezo’s linear functions and the most famous one is rectifier units. Rectifier units basically just take the max between your incoming value and zero. So its max zero X. Vinod and Xavier also they’re the first people who started talking about this. It turns out that choosing rectifiers helps with the problem of the gradient shrinking because basically the function is linear when you go backwards because it just be size linear and just zero part of your rectifier unit is sufficient as non-linearity to get something different from a simple linear model. Ian Goodfellow has a paper for ICML this year where he proposes max outs which is an extension to rectifier units. He has some experiments showing that it works; well at least it seems to work better than rectifiers. Just a note Alex Projevski and Ilya’s results on Image net they basically use rectifiers as like one key ingredient in their huge compilation on that so. It seems that if you just use sigmoids it doesn’t work as well. Especially drop outs which I’m not going to talk about in this presentation works much, much better with rectifiers and not as well with sigmoids and other things. So there seems to be some correlations there. So the first part of the talk will be about recurrent neural networks and this is joint work with Tomas Mikolov and Yoshua Bengio. As I said this is submission that I have at ICML this year. By the way if you have questions just feel free to ask at any point. The way this work started is Tomas came to visit our lab and I guess you guys know he has state of the art on language modeling. So one thing that we wanted to know when he got there was, I mean how did he get those results? I mean why did nobody else get those results in recurrent neural nets? Because recurrent neural nets had been around for a long time and what was his secret source in getting those results compared to other people? So first of all what’s a recurrent neural net? I have here a schematic. So a recurrent neural net you can think of it as a feet forward network but you have cycles in your, you basically have connections between your heating units. So through those connections the recurrent neural net is able to exhibit memory and for that reason it is a very compact and theoretically powerful model that should be very well suited to model time series. The model itself has been around since the eighties and, yeah it is, but it never got really successful. How would you train such a model? Well again since the eighties there is one of the algorithms, so there are a bunch of algorithms to train recurrent neural nets. But one of the algorithms that is the most popular and probably the most successful one is just to use backpropagation through time. So the idea of doing backpropagation through time is just you take a recurrent model and you unroll it into a deep model, into possibly infinite deep model. Then you just apply a backpropagation on this deep model. So you can see here this is the thick lines, oh I have the, sorry, the thick lines so this is the forward pass so this is how you go forward in time. So you start at some point, some initial state H zero and you get a first input and you get the first prediction and just go forward and forward and forward. This constructs this; well this deep model reaches as deep as the length of your sequence that you’re going over. Then once you’ve computed the forward pass you can go backwards and then compute the gradients in the same way you do a normal deep model. So this is just a re-write of the gradients that you get from recurrent net. What you see, so what I’ve done here I just wrote the gradients as a sum of product. You can see that, so first of all this sum goes over the cost of each time step because you have a cost of each time, so the sum over those gradients. Then for the cost of each time step you have different contributions. So you have how the state the time T minus K affects the cost of time T. This factors, this jacobian that goes from the state of time T to the state of time T minus K transports, the error signal in time from T minus K to T. So we usually call the components for which T minus K is far away from T we call them long term components and when T minus K is close to T we say those are short time components. So the key observation here is if we look at this factor to transport the error in time. So there are just, in their turn they’re just a product of jacobians. Yoshua made an important observation in ninety-four in a paper where he said that basically this product of mattresses here behaves exactly a product of numbers would. So if all the numbers that you’re multiplying are smaller than one their product will go quickly to zero. If all the numbers are larger than one then their multiplication will get larger. The same things happen here as well. So if you have, if all this jacobians have the largest, the spector, the largest Igon value smaller than one when you multiplied them they’re just shrinking, they’re just going to zero. Like any vector multiplied through those mattresses, those products of mattresses will go to zero. That’s what’s called, that’s usually called a vanishing gradient problem. A vanishing gradient problem says because these products go to zero you basically the gradient between events that happen far away in the past is almost zero so you’re not able to learn this correlation between these events that are far away in time. You can only learn correlation between events that are close in time. Basically just use your short term memory which kind of defeats the point of recurrent net. Because you want to use the recurrent net in order to capture this temporal correlation between events. If you’re just able to use, I mean when I say short term in practice if you use sigmoids in like the standard insulization and so on, and in practice it means short term is like less than ten steps. So if you’re able to see only ten steps back in time then it’s probably much more efficient and useful to just use like a time window of ten frames and just use a feet forward model because you’re not able to see anything more than those ten frames anyway. You’re just implementing this much more complicated model with a much more expensive learning role for no reason. So that’s the vanishing gradient problem so that’s one side of the coin. The other side of the coin is when things explode. That can also happen and that’s called exploding gradient problem and what you see is during training at one point your gradients just spike and go to really large values. Gradient, I mean basically learning fails at that point once you take this large step with values like ten to the nine or so on. Like you end up in a place where you cannot recover from then learning failed and you have to start all over training. So this is the sign or description of the problems. So this is a different view of, so this is a view that Doya proposed in a paper in nineteen ninety-two and we kind of extended here. For the exploding gradient problem and basically he’s starting to use some tools or some intuition from dynamical system theory to describe what’s going on. So first of all let me try to explain the plot. So what we have here is a bifurcation diagram. So we have a single unit recurrent neural net, specifically recurrent net that’s trying to learn something stupid. I don’t remember that you can, I should have written it somewhere but it doesn’t matter. The key point is, in order to just easily and simply use the dynamical system theory, at least the basic dynamical system theory that we are familiar with to some extent, we need to have this model to not have an input, because most of the theory for dynamical system is for autonomic. When you have input driven dynamical system everything becomes more complicated and it’s much harder to finalize. So this is a recurrent neural net that doesn’t have any input. So what happens is if you start it, so this is how you, on this axis you’re changing the bias and this is the value of the state after you go an infinite number of steps, so this basically the point where you converge. So the idea is that if your bias value is, so the W the weight, so the formula looks like Xt equals sigmoid of Xt minus one times zero point five plus B, I think. So the idea is that if B is minus three no matter where you start you’ll converge to this value. So that the dark line here shows the point where you converge after you’re running an infinite amount of steps. The gray arrows show the direction in which you move basically in time. So if you start somewhere here the arrow shows that you’ll go down, down until you reach the convergence point. Basically what you can see here is that no matter where you start, so the starting point is basically X zero. So no matter where you start with X zero on the real numbers you’ll always converge to this point and as you change B the position of your one point detractor changes until you reach this point which I call B two and this is a bifurcation boundary. What happens with the bifurcation boundary is that you suddenly have two fix point detractors, one that’s a bottom here and one that’s above here. Depending where you start in your domain you end up at one or the other. The dotted thick line here is an unstable attractor so the idea would be that if you start exactly here you’ll stay on this dotted line but if you start nearby it will converge to one of the point attractors. So the observation that they had made was that if you start, if you are near and bifurcation boundary and you take a small change in your parameters which is what you normally do when you’re doing, actually training of recurrent nets. You can cross the boundary and what happens if you were for example here you are converging down here and you do a small step and then you’re converging up here. So the idea is that you suddenly converge to a different point that you are converging before. So there is a sudden change in your cost. That sudden change in your cost will lead to a large gradient, possibly yeah a very large gradient. So there’s the idea that Doya had in his paper. We tried to extend this and say that it’s not just about crossing bifurcation boundaries. But it’s sufficient to cross boundaries between bastion of attraction. In our case this dotted line here is just the boundary between bastion of attractions. If you’re on this side you’re converging here. If you’re on this side you’re converging up there. So it’s sufficient to cross this boundary in order to get this effect of the gradients exploding. This observation is important because if you look here you have just two points where you have a bifurcation boundary. But you have an infinite amount of points where you have a boundary between bastions of attraction. So, yeah so this also has a point that we’re trying to make. So this also has another intuition that we’re building. So what happens when the gradients explode? The gradients explode when you have this rich regime where you have multiple types of attractors. Attractors describe different behaviors of your network and you’re trying to cross between these behaviors of your network by crossing either some bifurcation or by crossing some boundary between bastions of attraction. Okay, but the problem is that view was for autonomous recurrent nets that have no input. So, how can we use this intuition to the neural recurrent nets that we have which are input driven for which you cannot easily just use the previous intuition? So, one way to do this is, I mean you have the normal recurrent net which is just this map that you apply recurrently. You can think as if you have a different map at each point in time, I mean and the maps are basically indexed by the input if you will. Then you can describe the exploding gradients something like this. So you do a small change and then all your maps suddenly point in the opposite direction that they were pointing before. I mean while this is how [indiscernible] doesn’t really, it doesn’t really let you use the ideas that you have before and it doesn’t really tell you too much, it’s just like one, another way of expressing the same problem. It almost feels like you’re moving in a circle. But what you can do is you can rewrite the recurrent neural net slightly different. This is an equivalent formulation of a recurrent net. So what we did is we first applied to non-linearity and the previous hidden state and then do the linear part. We just reverse the order. So this recurrent net has the same computational power as the previous one, they’re kind of the same model, just different two rewritten. If you do this then you can decompose this step into two components, one being a constant map that we called F [indiscernible] here. The other one would be the input, the variable one. What this does it basically tells you that there is a map that you can analyze using the dynamical system perspective and it’s F [indiscernible]. When you are close to the, if you study this map and when you’re close to the boundary of bifurcations or when you’re between bastions of attraction of F [indiscernible], then you know in those regions there are regions were likely for gradients to explode. I mean all you can say it’s likely because you also have the input maps that can either oppose the, so for example if you cross here that this input might just oppose, might be much stronger, your [indiscernible] might be much stronger and just push you back into the previous regime where you are and you don’t see the bifurcation anymore. But it’s likely that around the bifurcation boundaries and things like this of F [indiscernible] you can see the exploding gradient problem. One note about the vanishing gradient, the way the vanishing gradient, I mean the way to understand the vanishing gradient from this perspective is here you see the vanishing gradient when your model converges. So the idea is that if in, I don’t know one hundred steps or how many steps you’re doing the recurrent net the state of the recurrent net reaches, is very close to the convergence state, or it converged to the point attractor to whatever kind of attractor is there. Then basically the position where it started it doesn’t matter, right because you end up to the same convergent point regardless where you stop. So there is no gradient flowing from the starting position to the cost that you get at the end. That’s kind of how the vanishing gradient exhibits, so just a different notation. Now, I mean based on intuitions what we tried to think of or what we’ve tried to get at is what we think happens when the gradients explode is that the curvature changes quickly. This is just another interpretation of what we’re thinking. I mean we basically had these different views of the problem of the exploding gradient problem because they started at that one. This is just another way of looking at the problem and just looking at it geometrically. While we don’t have a proof for this, we have simulations and this is one of those simulations on a toy model, it’s just like a one hidden unit auto encoder. If you’re looking the error surface of this model what you can see you can see this kind of steep walls and these steep walls are where the gradient explodes. So what happens is you’re moving slowly, slowly, slowly you reach this hiker richer region, you step there, and then you have this huge step telling you to go away. So this is the model that we’re trying, we’re plotting the just the [indiscernible] encoder that does fifty steps, a recurrent net that, yeah that does fifty steps. Then from this view specifically we can kind of figure out a solution to what to do. So basically what we want to do so as this our hypothesis is is that we have the steep walls but we have also this valley around them where the curvature is not as bad. So what we want to do is if the gradients is large we want to ignore the normal de-gradient and just take a step in that, a small step in that direction. Because what the gradient will do when you are the curvature wall it will just be perpendicular to the wall and try to push you backwards. So and that’s something you want to do but you just don’t want to go too far because you don’t want to jump out of the value UR when you’re looking for a solution. So what you do is you ignore the norm and just take a small step. This solution of clipping gradients which I have it on the next slide is basically the solution that Tomas used in his code when he got the state of the art results on language modeling. So, yeah, so he had this basically hard coded into his library. I think in his original paper he doesn’t even mention it. But he does mention it in his PhD thesis and talks a bit about it. I think this clipping strategy that he used in his, well his clipping strategy is slightly different from the one in proposal. Basically we started from his clipping strategy which was instead of clipping the norm you would just clip element wise the gradients so you have the gradient which is a metrics, and just look at each entry of the metrics and if it’s larger than some threshold you just fix it to that threshold. So we started from that and we’re just trying look for a theoretical justification of what he was doing because the way it looked it, yeah we didn’t know exactly what that clipping was doing. Basically what we did we came up with this story. We did geometrical interpretation story that I just said on the previous slide where the norm clipping is just ignore, when you’re just trying to ignore the norm when you’re at those high curvature wall. Both clipping strategies Tomas’ and the one that we propose here are working equally well. In fact some experiments where you just take random steps, once the gradients explode instead of just fail, just assuming that your learning failed just take a small random step that also works. So basically what happens you take a random step and, you know high probability you’ll probably end up in like a lower curvature region and then HGD can take it from there again, until you get again to the wall and then again you take a small step. So that’s how it goes basically. This is one component. So the norm clipping is based on our intuition is the solution that we proposed for the exploding gradient and then the question is what can we do about the regularization term, about the vanishing gradient? The answer is to use a regularization term. The idea is that, so there are previous solutions that talk about the vanishing gradient problem. One of them is the LSTM. The other one, a recent solution or recent results that seem to suggest that HessianFree which is a special secondary method might also kind of solve the vanishing gradient problem to a certain extent. So our take is that whenever you’re trying to increase, so think of it as your recurrent net has a certain distance and it can only see information up to that distance. What we want to do is basically we want to increase that distance. We want to look further and further in the past. Whenever you’re trying to do that you would have input sequences for which inputs that are far away in the past they actually have no information useful for the cost of time T. So you’d basically have an increase in the error because no matter what it needs to find it needs to figure out for which sequences is useful to look that far apart in time. Which inputs are useful ignore. Because, I mean a recurrent net has a bounded amount of memory so it needs to be able not only to memorize things but it also needs to be able to ignore and to forget things. This can only do through learning. So our take is that trying to increase this distance to where the network can look. When you do this you’re actually increasing your error and then you just need to wait for the network to learn to ignore those inputs that are not relevant. Then just use the ones that are relevant to actually get a better error gradient. The way you do this is you do, you add a regularization term. Because, you know if the regularization term is high it doesn’t matter that the error increases if the regularization term goes down. It depends on the ratio between the two terms when you sum them in the cost. The regularization term that they propose it’s a very simple one. Basically what we want, we’re asking the model exactly what we want. So you want the gradients not to vanish as they go back in time. So as you’re doing backpropogations of time. So you want the norm of the gradient up to state K plus one, multiply to the jacobian that transports it to state K to be the same or close to the norm of the gradient of the state K plus one. So basically what this means is that now we need to compute gradients through the gradients of a recurrent neural net. A net itself feels like something pretty crazy. You can do this and I know you’ve done experiments with this exact, this formula if you use something like Theano that automatically computes the gradients for you. Because otherwise you’ll probably get gradients wrong, because it’s just, the gradients themselves look pretty complicated and if you try to compute gradients to them unless you’re really good at doing that it’s bound that you get an error. But you can slightly simplify the formula and this is what we’re trying to do here. What we’re trying to do here is that we ignored that, when you look at this previous formula you ignore that this is a function of [indiscernible], and you ignore that this is a function of [indiscernible], you take them as constants and you’re just computing, you’re just looking, you’re just considering this jacobian to be a function of [indiscernible]. So you’re basically throwing away, so if you look at the computation a graph of your gradients you have different path from which your gradient comes from and you’re just chopping off paths that are too complicated to compute and just leave the short ones that are easy to implement and compute. That’s basically what we’re trying to do here. Then you get this formula that still looks ugly but it’s actually much, much easier to implement and it’s actually cheap to computer. So there are more details in the paper which I don’t think it’s up right now as an [indiscernible] paper but it should be up very soon. I’ll put it up very soon. Then there’s code for this as well available, reason that uses Theano but. So this turns out to be actually cheap to compute and it’s not much, much more expensive than just doing normal SGD if you add this term. But the gain is that you can force your network not to look further and further in the past. So just to mention there are other things that seem to help or other ways around this problem that people have proposed in the past. One of them is LSTMs, LSTMs is short for long short term memory networks. They’re proposing the, in ninety-one by [indiscernible] and basically their solution, so this models look specifically to the vanishing gradient problem only. Their solution is very engineer; it’s like an engineering solution. So they just change the structure of the recurrent net size that you always have this path through which the gradients do not explode. So they insert this special memory unit so they constructed or they have an input and an output gate that protects the information that’s inside the unit. The unit itself is linear so information doesn’t vanish in time. These units they seem to do pretty well in practice. Specifically Alex Graves has some great results using LSTMs on a bunch of tasks. All nine hundred recognition among them and I think speech recently he’s working with LSTMs on speech. Another approach is the Echo State Networks that I kind of mentioned before. They’re not as popular, the solution of the Echo State Networks is just do not learn anything anymore because you cannot learn them properly anyway. So you randomly insulize your recurrent weight, you randomly insulize your input weight and you just leave them random and you just train your output weights. Because when you, if you look at the gradients for the output weights there’s no issue there, and the key to getting an Echo State Network to work properly is to how you insulize those, the recurrent weights and input weights. That’s why I’m saying that Herbert Jaeger was the first one that actually looked at this insulization because insulization is extremely important in Echo State Networks. They have different formulas of how to do that depending on the task. So you basically look at your task and you figure out how much memory you need for this task and how complicated the task is. Depending on these qualitative things you kind of decide from which distribution you’re going to sample your weights. L one and L two norm help with the exploding gradient in practice. So what happens is if you look at the dynamical system perspective of this exploding gradient problem the gradient explodes when you cross the bifurcation or a boundary between bastion of attraction. That only can happen when you have rich regime. So when you have multiple attractors. But that doesn’t happen when you start because when you start you have this small weight metrics and because of the small weight metrics it means you have only a single point attractor at your region. If you keep your weight small by using L one and L two you stay in that regime. Hessian-Free is a new proposed method as a secondary method. It seems to work to some extent to solve the vanishing gradient problem, and to solve the exploding gradient because it’s a curvature issue. But there is some, there is an observation about Hessian-Free so basically at some point they need to add extra terms to the problem they’re solving, to the cost in order to probably deal with the explosion gradient problem among other things. Truncated BPTT backpropogation through time you just don’t go, it’s just a way of making your computations faster. Basically ignore a bunch of the gradients because you know they’re going to be small anyway. There’s some results that I’ve got using this. So on the bottom, so the most interesting results that I’ve got was on a study data. So these are the tasks proposed by [indiscernible] to test the vanishing gradient problem. So they’re kind of really hard problems to solve but probably they don’t have a great practical importance. But the temporal order task is, you have two symbols A and B and then you have some distracters equals B, A, F, G, whatever. What happens is you have along sequence. In the beginning of the sequence you see one of the two symbols either A or B. Then you have some noise and then somewhere in the middle you see another symbol. At the end you need to say in which order you saw the symbols. Did you see AA, AB, BA, BB, so you have four classes and you need to predict this. The idea is that you have all this noise and the relevant inputs are far away in time. So what you need to do is to kind of memorize what you’ve seen at this point in time, go, go until the middle of the sequence, see the second symbol and then decide. So what we managed to do is we managed to solve, so this, these are the kind of tests the Hessian-Free and the other algorithms are kind of benchmarked to see how they deal with this vanish gradient problem. So what Hessian-Free previously managed to do was managed to increase the success rate in solving this task. I don’t have the plots here but basically what happened is they couldn’t solve the task with one hundred percent accuracy when you have sequences up to two hundred steps. But it turns out if we add our regularization term and our clipping strategy, clipping gradient strategy we’re able to solve this task constantly to one hundred percent for up to two hundred. The three plots are different [indiscernible] of the model. So this is the most adversely [indiscernible] where you have a sigmoid activation function and you have really small weights. But if you use something like a [indiscernible] with slightly better [indiscernible] of the weights which is what people usually use or if you use something like the [indiscernible] that Ilya has been using and Herbert Jaeger proposed and it’s Hessian-Free also used, then you get a success rate of one hundred percent. What’s more interesting than this is if you learn, if you train the model on sequences that are up to two hundred fifty steps and then you test it on sequences that are much longer, and for this specific problem like this on sequences up to five thousand steps. The network still is able to solve the task. What this means is that it learns to generalize. It actually learned to memorize the bit into memory and keep it in memory somehow for an almost unbounded amount of time. Because you go from fifty steps you go to keeping it in memory for two thousand steps and that’s a big jump. We tried this on some natural tasks as well. There is some improvement compared to the vanilla recurrent net strain of SGD. But the state of the art for many of these tasks are some other models that have, so for the music that’s the state of the art is RNN RBM which is kind of a combination of a recurrent neural net with a probabilistic carbion output. That just does much better and we’re not able to reach there. For the task it didn’t seem to improve, it didn’t seem for not to make a big difference. I think the reason is you need to be careful with that regularization term because it, most of the time it actually makes you get worse solutions. So you need to have kind of a nice schedule for the, because forcing the network to see far in the past goes against minimizing the cost. So you need to have some kind of schedule for decreasing the regularization term. It probably helps in the beginning when you’re trying to discover things but at one point you need to shut down the regularization terms somehow and let the model just settle on something that minimizes the cost. That’s at least my intuition but I don’t have a solution. I don’t have a [indiscernible] strategy that works well in practice or anything like that. This is all the kind of the results that I have. Wow, and this took awhile. I’ll talk briefly about this topic as well because I still have about half an hour. So the other thing that we’re trying to look at was can we use other algorithms to train deep models besides [indiscernible] gradients? This is joint work with Guillaume, Aaron, and Yoshua from our lab. So what we’re looking for is maybe better convergence. But not only that we’re actually look to understand the, to better understand how you train deep models for one thing and the other one is that there seem to be certain limitations with HGD that people are starting to reach and maybe we need to find solutions for those. So the algorithm of choice that we started working with is natural gradient that was proposed by Amari. So Amari uses this information geometry description of the algorithm which is usually really hard to parse. But you can derive it by just saying that at time, at each step when you’re trying to minimize your cost out what you’re trying to do is you’re trying to pick the decent direction size of the [indiscernible] between PF [indiscernible] and PF [indiscernible] plus Delta, [indiscernible] is constant. So P is your model that you’re trying to learn so it’s a busier model P. So you basically want to do constant steps in your, constant steps and the [indiscernible] is like the distance between different models. So you have, you know for different [indiscernible] you get a different model and you need a way to compare different models. If the models are nearby then you can compare them using the [indiscernible] leverages between the two. Then you are saying that every step I want to make a constant progress in my model. So if you start with this and then you do like a first error approximation of the cost, the second error approximation of the [indiscernible] you get the formula that Amari proposed. I think this is a much, oops, sorry I hope I didn’t, I think this is a much easier way of addressing this, of explaining this algorithm. What we did this we’ve applied natural gradient to [indiscernible] machines. For [indiscernible] machines it turns out that this metric, this value that you need to compute here is just the expectation of the Hessian of the partition function which turns out to be the covariance of the, so there is the covariance of the negative phase of your gradients. So this is an important difference because many people only think of natural gradient they think is the covariance of the gradients. I’m going to get back to that. So this is a misunderstanding that happened in literature in several places. There’s even a paper that kind of makes, well defies the algorithm as doing the covariance of gradients and names it natural gradient just makes the confusion even bigger. It turns out that even if you use natural gradient you still need to use centering in order to get the joint training of DBMs, because that’s what we’re after. Centering is just an [indiscernible] trick. So it seems for natural gradient [indiscernible] is still a big issue and you always need to take care of it. This is where we have some intuition of what’s going here. This is mostly things that Yann is working on but we’re not going to go over them right now. We’re trying, oh this is just training of MNIST. So these are some plots that Yann did. It turns out that are natural gradient, for now it’s actually not faster than just using the normal [indiscernible] just kind of annoying but that’s usually how it goes the secondary method. It’s really hard to get them to be actually faster in terms of time. So in terms of [indiscernible] they’re better but in terms of time they’re not. You can extend this for neural networks. Very straightforwardly what you need to do is you need to have a probabilistic interpretation of your model and use the calibrations between that probabilistic interpretation of the model and then you get this formula. Which looks a bit ugly but it’s really not. It’s basically kind of the same thing. So the idea here is that because the probabilistic interpretation of a neural net is a conditional probability density function. What you need to do, what it means, it means that for every input value X you get a different family of probability density functions, P of a different family, so for every X you basically get like a different manifold. What you need to do is you just take an expectation of what all those X. So you take like the expected [indiscernible] divergence where you do the expectation over X or a simple X from your Empirical or something like this and then you do the [indiscernible]. The [indiscernible] here, so the differences that the [indiscernible] results in an expectation which is over the model probability, so it’s over P of T given X and it’s not over U Empirical distribution. So this is a big difference and I’ll go over it once again. So this is one thing that people usually make a confusion in. I’ve seen this in a few papers, so they take the covariance of your gradients and they sometimes say that this is the natural gradient algorithm that Amari proposed. This is not actually true, Amari’s formula is this one and they actually turn out to be different. They just don’t elevate to the same thing. TONGA is an algorithm, it’s a way that justified doing this. So this is a valid approach as well and it is yet another way of, another learning rule. So it’s yet another kind of secondary method, yet not being a secondary method that works. Its justification is by saying that your gradients are distributed according to a normal distribution centered around the true gradient to some noise and then you’re trying to account for that noise, for that covariance in the noise. So this is the approach that TONGA does and that’s like a valid way of, it’s just a valid algorithm but a different one from natural gradient that [indiscernible] proposed. As it turned out the Hessian-Free algorithm and [indiscernible] if you guys are familiar with any of those which I recently proposed secondary to methods that seem to do very well in a bunch of tasks. They turn out to actually be natural gradient. The reason is because they don’t use the Hessian even though they say they’re a second hand method. They actually approximate the Hessian by the Gauss-Newton approximation of the Hessian. It turns out that for the kind of activation functions and error functions they use and the people usual use, like sigmoid, the Krisentherapi, or self max a negative log likelihood, or square error with linear activation. For those kind of things the Gauss-Newton approximation of the Hessian equals the natural gradient metric. From that perspective you can see those methods of kind of doing natural gradient. This is like a nice neo- contribution that we had in our paper. I’m going to skip over this plot and I’m going to talk a bit about implementations. So the way we’ve implemented this method, we did the same thing as people did for Hessian-Free, namely use a truncated Newton approach. So what that means is basically, you say something like your Hessian H times X equals your gradient. Then you use a linear solver to solve the system and you get your X as being the Hessian minus one times G. Then in order to make things faster you don’t actually let the linear solver to go to convergence but you just do a few steps of it or something like this which is what you end up doing was all. One key notice is that when you’re doing this natural gradient you have to compute to expectations. It’s actually not this two because this expectation you can compute analytically. You need to compute this expectation over X. Then you have another expectation when you compute a gradient, because you compute a gradient over a mini batches. Different from a secondary method what you can justify to do here is basically you can, and what’s actually useful is you can compute this expectation on different data. Because what you’re trying to do is you’re trying to reduce the bias that you get from the specific set of examples that you use. That bias can be quite big if you’re trying to run this algorithm on small mini batches. The one thing that HessianFree needs to do is needs to run on batches that’s like half of your dataset or something like this. Because otherwise you get a lot of bias from the mini batch that you’re looking at and you’re moving to far in that direction and you’re kind of over fitting the mini batch that you’re looking at, at that moment in time and it, you end up getting worse results in the end. So if you estimate those on different batches of data you get better generalization and you get better training error, just because you’re avoiding this over fitting problems. So these are, the green line is when you’re computing them on the same data, the gradient and the metric. The blue one is when you’re computing them on different data samples. The red one which is doing better is when you’re using unlabeled data to compute the metric. The inside, so this is like the, another contribution of this work is that when you’re computing the metric you only need the inputs you don’t need the targets just because of how the formula works out. That means that you can use unlabeled data to get even better estimate of your metric and that seems to help when you’re dealing with things like small datasets that probably have some bias, some high bias in their training sets. Or in our case we worked on turn to phase dataset which is kind of small especially if you’re looking at [indiscernible]. So there’s a lot of over fitting that can go there. So it seems that you get this kind of robustness by using unlabeled data one that’s available. I don’t think it pays off though when you get to really large datasets because the training set is good enough. >>: [inaudible] >> Razvan Pascanu: The green is when you’re computing the metric and the gradient on the same batch of data, when you’re estimating them. So like the same, yeah you pick a mini batch and you do, compute the gradient on it and then you approximate the Hessian based on that mini batch is all. The blue is just you use different random batches of data from your training set. Another problem that might gone with HGD and this is some result that Dumitru Erhan had on a paper in a JMLR paper some, a few years ago, two thousand nine I think. So what he, and this is just a reproduction of his experiments and then we did the natural gradient as well. So what you do here is you, you’re considering, you take a lot, you’re considering the online regime and you take a large amount of data and you divide your data into portions, in ten potions in our case. Then what you do is keep nine of them fixed, for example you keep one, two, three, four, six, seven, eight, nine fixed, and then you just resample or you use other different samples for the first portion. You do this like we did this in this case five times. You train five models where you have the same data except for a fraction. At the end you’re just looking at the variance of that model. So you compute how the model, the output of the model on some validation set and then you do the variance of that, the mean variance. What you see is that if you change the data in the beginning you get a much higher variance, while if you change the data somewhere in the middle the variance is much smaller. So what this means is that the model probably takes some really big decision in the first steps that it does in the early stages of training. Then whatever comes afterwards is not as important, and this can be worse than when you have very large datasets. Because what it means is that you don’t get to see all your dataset before you kind of fix on the kind of solution that you decide. So it means that, you know you have a lot of data but you’re not actually using it properly because you’re just focusing on, I mean the most important data that you see is the one you see in the beginning and afterwards doesn’t really matter. So this was the experiments that he did. He showed that you can reduce this curve by using free training so that was the experiment that he did. In our case we tried natural gradient and it seems that natural gradient which is the red curve can reduce this variance by quite a few, by quite a bit. So it seems that natural gradient kind of always picks the direction that all the examples agree on and it avoids making these decisions in the beginning. You kind of have to make important decisions in the beginning so you still have the variance that’s larger in the beginning. But overall the variance that you get from natural gradient is much lower than the variance that you get from SGD, which just means that natural gradient more constantly gets you the same solution compared to HGD which is, for which small changes in your input set can result in different solutions at the end. Yann Dauphin is also doing now internship at Microsoft and if people want to talk to him they can contract him. He’s been doing these very interesting experiments that he started after some experience that he had a Google. Basically what seems to happen is if you have a very large dataset and you’re trying to train a model on HGD, at one point, so you start with a small model and a small model would just be just a hundred feet because it’s just too small so your model would just have a large training error at the end. You cannot reach the zero training error. But as you increase your training model it seems that you’re not able to train even very large models to get to zero training error. So what he did here is basically he took Image Nets, so the blue line is the experiment that he runs. So he took Image Net and what he did is he tried MLPs of variable size two hundred, five, two thousand, five thousand, seven thousand, ten thousand, fifteen thousand. What he looked at is the number of training errors that you can solve by increasing the size. So this is the return on the investment. So you add three thousand hidden units and you reduce your training error by this, so this all training errors. At one point you see that no matter how many hidden units you add, like the amount, the number of training errors that you’re reducing is very small and at one point it’s actually even starting to hurt. Another important thing is you have this green line. So this green line is reducing by adding a single hidden unit you’re removing one training error. This should always be the case because one hidden unit could just memorize a training example ideally. So you’d expect at least by adding a hidden unit to reduce the number of training errors by one. But it turns out that this doesn’t happen after some point. So you need to add several hidden units just to be able to learn an extra training example. Because there’s some kind of under fitting that he seems to observe. I guess this is the kind of experiment at places like here or Google can see because they usually have the large amounts of data. So you really need lots of data to see this. So what he does is he takes Image Net and he adds permutation to it so he gets like one hundred million examples or something like this so in order to start seeing this kind of effect. So this happens only on the large scale. I’ve tried to replicate his result to a certain extent using natural gradient and I got this red line. This is a very painful experiment to run. It takes about a week to two weeks to run. I didn’t get the chance to do a lot of hyper parameter optimization at all. I just had to fix something and start running. I couldn’t run if model is more than seven thousand units. But it seems that over seven thousand units at least I’m still about the one training error per, it seems to, that natural gradient might help. But it still probably needs a bit more experimentation to be sure. In the end we wanted to do a benchmark. In order to have a good benchmark we wanted to add something more besides natural gradient. What we decided to add is natural conjugate gradient. So it turned out that you can extend natural gradient, also incorporate second loader information besides the manifold information. The way you do that is you basically take the gradients that come from natural gradient and you just do nonlinear and conjugate gradient on top of them. So you can justify this theoretically pretty well using the information geometry perspective. This is something that’s done in this paper by Gonzalez and Dorronsoro. This is also not the only way to incorporate second loader information but it’s the only one that we’ve tried in this paper. But there are a bunch of other ways to incorporate them, secondary information and depending on which one you use we expect to get quite different results. One thing that you need to keep in mind about this method is that it needs to make a strong assumption that automatic doesn’t really change to much when you’re take your step, which is kind of goes against the, what you’re trying to do, right. Because you’re doing this, you’re adding the secondary information in order to be able to take even larger steps. But somehow in the algorithm itself you assume that if you take this large step things don’t really change too much. So and I think this is an important assumption that probably makes algorithm not go as well as we hope. So this is the benchmark. This is on the [indiscernible] dataset, this is number of iterations, this one’s here are HGD. So because its number of iteration HGD looks really bad when we go to look at time wise HGD will start looking pretty well. The red one is the nonlinear conjugate gradient and these two are natural gradient, and we just run them with different batches, different sizes of mini batches. So we’ve run them with five thousand which is what people doing things like Hessian-Free usually do. Then we run them with batches of five hundred examples which is ten times smaller. We show that because we’re using different data to compute, to estimate a different expectation we actually can run the algorithm in this small batch regime, they kind of work about the same. So you can see that nonlinear conjugate gradient starts out slowly but upon point it’s actually like it’s the slope here is much better than from natural gradient. The reason is in the first stages of learning using second order information actually hurt, might have actually hurt because we have like, it's only when you're kind of close to convergence that the second order approximation is a good approximation for your model. When you're just starting probably you’re better off not relying too much on the second order approximation when you have this non-convex function, at least that’s our assumption. If we moved to time we see HGD is doing about as well as the other methods. Also HGD might be able to do much better. So what we did for HGD we force it to work with mini batches of one hundred examples. If you go to smaller mini batches you might get a speed up, but we wanted to, we wanted to run things in about the same kind of mini batch sizes. So and you see the other two algorithms. So I guess what I want to say is that I don't think this is proof that natural gradient is faster than an HGD yet. So I don't think we are there. I think natural gradient probably converges slower than HGD. But we are on the same time scale using this approximation. I think because we’re on the same time scale we might be able to try to explore other properties a natural gradient can have like the robustness to the order of the training examples, or like the under fitting issue on really large scale. So if we can explore those properly now and you can show the natural gradient helps on those issues that might make natural gradient an appealing algorithm compared to HGD even though it’s not faster than HGD. So at least that’s my hope. So I have twenty minutes left so I don’t know do you guys want to do questions and skip over the Theano part or you want me to go over the Theano slides, it’s up to you? Do you have questions? >>: Quick question. >> Razvan Pascanu: Yes. >>: In the first part you mentioned that you’re using backpropogations through time, right? >> Razvan Pascanu: Yes. >>: How many [indiscernible] does it take to get those results? >> Razvan Pascanu: I think I had the hard limit to ten to five iterations but I don’t think I reached that probability in about, I don’t know I guess fifty thousand iterations, something like that. >>: [inaudible] >> Razvan Pascanu: But I mean it depends on which task we’re talking about. If we’re talking about a temporal order one it’s about fifty thousand that’s one I’m think of. But it was also kind of fast because we’re training a model of fifty hidden units so it’s, in terms of time usually I started them in the evening, in the morning I got some results, that’s hard work, it’s about. But if you go to large things, if you’ll try to train or currently on Wikipedia then you have to expect it to take a few days, there’s no way around it. >>: Is there any reason that they’re using for instance extending final feature, [inaudible]… >> Razvan Pascanu: No and its [indiscernible] filter might be, so it’s kind of known, so I think extended [indiscernible] filter goes in the same category as Hessian-Free. So because they’re using secondary information they probably do better than backpropagation through time. It’s just; I don’t have an implementation for it. It doesn’t feel like a really easy algorithm to implement so I don’t think it’s going to take me just a few hours to get it running so I opted for that. But, yeah a [indiscernible] filter might, probably will do better. I’m not sure it would be able to solve those problems with one hundred percent accuracy, the vanishing gradient problem ones, because on the other task I didn’t see a huge gain from my [indiscernible] term. But on those tasks that we’re specifically looking at the vanishing gradient problem I get like one hundred percent accuracy. I know Hessian-Free is not able to get there and I’m expecting [indiscernible] filter to do about the same. I might be wrong I never tried but that’s kind of what I expect, but much better than HGD probably. Yeah, other question, yes. >>: You, so I’m pretty interested in language modeling. You mentioned [indiscernible]… >> Razvan Pascanu: Yes. >>: Some time ago, you know you explained it sure has a lot of interesting methods. I was just interested in if you had a chance to try some of these methods, [inaudible]? >> Razvan Pascanu: So basically this work started from his stuff, right. A clipping gradient is basically just his approach. We just tried to kind of figure out what’s it is doing. I’ve tried so I’ve managed to reproduce his results. I didn’t manage to get better results than he did with these things up to now. So at most I can do as well as he does basically using kind of something that’s very similar to what he did. In terms of his results, so I think there are two components that he uses that makes a big difference. One of them is this clipping strategy that makes a difference. The other thing that I’ve noticed is the way he does early stopping. Because I use to do early stopping differently and I just couldn’t get his results and I changed my strategy. So I use to have this patience, so if the training error would increase I’ll just wait for I don’t know several steps and then see if that helps. He has this hard thing whenever the training error goes up he divides the learning grade by two. It seems that if you don’t do that you don’t really get where he gets. I’m not, I don’t have a really good intuition of why you need to do that and I don’t know if that specific to language modeling or if, I don’t know if it’s specific. Because like for the music I had a few task on music. It didn’t seem that I need to use that early stopping strategy to get those results on music. But it seems I need to use that when I do language modeling, I’m not sure why. >>: So I had [indiscernible], so you touched on a schedule [indiscernible] improvement. But I think he measures the improvement based on [indiscernible] not on the training. >> Razvan Pascanu: Yes, yes he also does that. >>: Okay, [indiscernible] the language [indiscernible] generalized by the… >> Razvan Pascanu: Yes, yes… >>: Versus just over [indiscernible]. >> Razvan Pascanu: Yeah, yeah you’re right. Once he starts dividing the learning rate by two he just keeps dividing it at every [indiscernible], it doesn’t stop until it goes more [indiscernible] something, which is also something that I usually don’t do, right. What I do is I see the error increasing I make the learning rate smaller and then I just wait to see again the learning rate increasing and then I make the learning rate smaller. That’s how I usually do it at stopping MIs. But that doesn’t seem to do as well as his. That’s an interesting question why exactly is that important? I don’t know. I mean it might be nothing, right. It might be just that’s the way it is there’s no explanation. It might be there’s something interesting about it, I don’t know. >>: So he was interested in [indiscernible] your average [indiscernible] experiments by a mesh of complexity off the [indiscernible] and [indiscernible] helps [indiscernible] complexity because, yeah if you don’t improve [indiscernible] anymore [indiscernible]. >> Razvan Pascanu: Yeah. >>: But actually sorry my question was like if you had a chance to sort of try things like natural gradient and things like this, because he’s [indiscernible] SGD. >> Razvan Pascanu: Okay, not really, not extensively. Natural gradient well it took me a bit to just get it running. So I have no code to get it running. But I didn’t get to, I had a few experiments. They were not extremely successful, ha, ha. So that’s why, I mean for now like the experiments that I’ve done they’re no extremely successful. But my hope is that I just didn’t explore hyper parameters enough. So another, the way I’m using natural gradient is I’m not really using it as a black box optimizer. For example I’m not doing a line search, right; I have a fixed learning rate in front of that gradient. So in some sense my natural gradient the way I’m using it right now it has more hyper parameters than HGD. Because I have this, I have the damping; I have some other things that I need to kind of set. It seems that setting all of those kind of can change things a bit. I mean its new work but my hope is that maybe it will do better. My hope is that it will probably do better but on the large scale regime, that’s what I think. I mean probably if I run it on [indiscernible] that’s probably not going to do anything better then that’s what I will be running it on. But if I run it on like Wikipedia or some really huge dataset then maybe I can see some gain out of it. That’s where I think natural gradient might have a chance to do better. >>: [inaudible] >> Razvan Pascanu: Yeah, yes? >>: Did you try the regularization term on the [indiscernible] task? >> Razvan Pascanu: I tried on tasks, yes, but I didn’t try it. So I’ve tried it on a character level because in order to get it working on the word level you have that issue that you have a large output layer so you need to have like a, you need to divide your words in classes or something to get it to run fast because otherwise it’s very slow. I didn’t have time to let it run as much as I wanted. On the character level it didn’t really improve too much. But as I said I think it’s just because I don’t have the right annealing scheme for the regularization term. I hope that’s the reason, I don’t know. On music I got better results but the reason is on music, the datasets for music that I’ve been running on they’re quite small so it takes about two to three hours to train on them. So I had quite a few iterations of playing around with hyper parameters and things. On [indiscernible] it’s not a big dataset but still I had enough time to play too much of it. I had to kind of use hyper parameters that I thought are good and just leave it to that. Yes? So I still have ten minutes I can go there just to slide. So, yes Theano is a solution we use to scale things up. It’s been around for awhile. There are a lot of papers written about it. It’s kind of stable and I think it’s a good alternative. So there are a few other libraries out there that do these kinds of things. TORCH is the one that’s probably most similar to Theano if you know about it, comes from [indiscernible] lab. As Theano, I think both TORCH and Theano started to be adopted by like start ups and stuff like this. So there are actually companies that are using the software. Then you have things like [indiscernible]. That’s what people at Jeff Hinton’s been using. [indiscernible] is kind of low level so it’s not really, Theano tries to do a lot more than [indiscernible], you still have to do a lot of things yourself by hand. I’m pretty sure there may be other libraries out there that I don’t know of. I’m pretty sure everybody’s almost doing their own library. The way Theano works and I think this is the whole message that I want to say is that in Theano what you do is basically construct this computational graph of what you’re trying to do. So you first, you construct your variables which are your starting nodes like the green things. Then you say your computation I want to do A plus A times to the power of tenth. Then this gives you a computational graph which is this one. This is the optimized computational graph. So it’s not, so when you constructed it to look exactly like what you’re saying here to look like A, B, you know A times ten and so on. Then you call this Theano function and what this Theano function does is first of all looks at this computational graph and it does graph substitutions to optimize the graph. So if you do something like A minus A it can find it in the graph and it can remove it and replace it with zero. It, I mean this is like a very basic substitution but it can do a lot of smart things, like if you have, I don’t know, logo sigmoid can be replaced with something more stable. So there is optimization to make computation stables, they’re optimization to make things fast. So there are quite a few of optimization, currently we’re having issue that there are too many optimization and it takes a long time to compile a function. So we’re applying all this optimization on a graph and we’re getting a pretty fast graph idea. Then what we’re doing is we’re writing custom code for each of the nodes either in C or for GPU and so on. We’re writing custom codes or we’re generating custom codes for each of the nodes and then we’re compiling that custom code. Then what we’re providing back is a [indiscernible] object that goes to this optimized computational graph and calls that customized, that compiled code that you have. This object that goes through the graph it’s also written in C. So basically what you end up, so you start writing things in [indiscernible] but what you end up running is just C code that uses some [indiscernible] epicals to just, and it uses non [indiscernible] as an interface. I think this is a nice strategy because it allows you to do things like computing gradients and it’s on automatically. So you have a fixed set of operations and then you can have, you can automatic compute gradient between every expression because you just have to go through the graph and apply backpropogation through time. It also gives you the R op if you guys know about. It’s like the thing that you need to do to implementation, which is just like a forward pass in the automatic differentiation language. This is how they usually call it. You just go forward instead of going backwards. So you can do this automatically because it has access to the graph and each op kind of knows what it does. It also is not just, it’s not only meant for machine learning. It’s actually meant more like a linear algebra library. So you don’t, so Theano doesn’t really give you like an MLP or a recurrent net or something like this. It just gives you variables and it just gives you operations that you normally do and you just write your formula the way you want it. You don’t need to worry about writing it in an optimized way or anything like that. You just write it the way you wanted and then you do F Theano function and most of the time Theano will figure out how to optimize. So it rearranges all your computation size that you get something that’s pretty fast. Then you can run that on GPU and you can do a lot of stuff with it. These are like in these some timings. So especially this one, TORCH seven use to be faster than us. Then we got, I mean it’s like a competition we got to be faster, we kind of changed a few things around. The Theano also, like in Theano it’s easy to share the use code that other people did. So for example here what we did is basically reused some of TORCH’s way of generating C code. In Theano right now you can have Alex [indiscernible] to do convolutions so you can easily incorporate this kind of stuff. Then you don’t need to worry about the specific of how Alex [indiscernible] look you just have the same, you know op that you usually use like [indiscernible] this, this, and you use some parameters, and then it does all the magic behind the scenes. This is for recurrent nets and here I’m comparing with Tomas library. He’s much faster in kind of the regime that, well the regime that he kind of runs where you have small hidden units. But as you increase the number of hidden units you can see Theano, especially Theano in GPU doing much better. That’s simply because there’s more things that can get paralyzed. So recurrent net itself is kind of a sequential model so it’s usually hard to get paralyzed, but. So this is on a, this is without mini batches. So this is a single sample but you can still kind of paralyze a few things in there as well. If you have a lot of hidden units and you have a lot of inputs. That’s why Theano is able to do better on the GPU. This is the whole presentation. Thank you so much and if you have more questions. [applause]

Document 17883944

Related documents

Products

Support

Document 17883944

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib