>> Jasha Droppo: Well good morning, everybody. We're here for Nicolas’ talk about his studies at University of Montreal. I first became of aware of him about 18 months ago when we were looking for an intern, and he was doing some very good work in RNN modeling for music analysis. He's one of the pioneers in that field, and he's well published. I found out recently that even before that he was doing work in a modern physics and electron tunneling microscope software for inferring the shape of molecules based on something I don't even understand. So if anybody understands that they can ask him questions about that too. He works most recently University of Montreal in Yoshua Bengio’s group doing good work like I said in recurrent neural networks which he's here to talk about today. And please welcome him and treat him nicely because we really like him. Come on up Nicolas. That's it. It's all for you now. >> Nicolas Maurice Boulanger-Lewandowski: Thank you. Hello everyone. I'm very happy to be here presenting work I’ve done during my PhD, and so thank you for having me and for attending. So it will be about modeling high-dimensional sequences, and I'll tell you more detail what this is exactly. So just first a slide to introduce how this all fits into the big picture. It was motivated because if we want to ever have strong AIs on [inaudible] it's probably necessary to have different modalities and interactions between them. Deep learning is very popular these days. It’s a way to learn different levels of abstractions, abstractions that comes from a knowledge that is in different domains, so very different in nature. And this work can be seen as a way, in fact, multiple different ways to interface knowledge that can be time-independent and timedependent. So if we have a sequential knowledge, for example language, then we can use that and try to make it influence other types of knowledge such as acoustics. So the kinds of sequences that we'll try to model I call them high-dimensional and because, simply because each time step we have a complicated the high-dimensional object such as an image in a video and a property of those sequences is that the condition of distribution is often multimodal. So if we try to predict a single time step given the previous ones it's often the case that it can go in two or more different possible ways in the molds of the conditional distribution and it would be unacceptable to just predict the expectation of all the possibilities. So we really want to capture the richness of the conditional distribution that is multimodal. So the general framework we’ll work in is that we start with, at the bottom input matrix X that has time and feature space dimensions. So in general the task in the old geo-processing would be to simply assign from a column of this matrix to find the label, so a column of the output matrix, the sequence that we want to predict. This is in the most simple case. It can be, for example, a deep neural network that acts as a classifier for that. Now this output sequence I call zed with a time index T, just to introduce a notation at the same time that I will use throughout. We also want, we also have a symbolic model, so a model that has some knowledge about what the output is supposed to look like. It can be in HMM in the simplest case or here we’ll try to have more complicated and more powerful models for that. Also I mentioned here the zed bar index by U it's because it's not always the case that the time steps in the output are already aligned in the time steps of the input so if they're unaligned, in this case you can see they’re unaligned because there's a lot of repetition in all the input symbol, output symbols. Then we have the U index in place of T. So that's the general framework. Yes? >>: Can I just clarify that? Can you go back please? In the bottom matrix does this mean that at every time step in this particular example there are five scalar features? >> Nicolas Maurice Boulanger-Lewandowski: Yeah. >>: Okay. And then in the output labeling, I guess there, are four different labels? Is that how I should think of this? >> Nicolas Maurice Boulanger-Lewandowski: Yes. >>: Four different labels? >> Nicolas Maurice Boulanger-Lewandowski: Four different labels. And in the high-dimensional case there can be more than one active at the same time. So in some tasks there can only be one, for example if it's a [indiscernible,] and we would just want to, there's only one. But in many cases we have a full vector that is not a one hard vector and we want to predict this for distribution. >>: So for instance, maybe I would have one label per pixel if I want to go really far. >> Nicolas Maurice Boulanger-Lewandowski: Yeah, yeah. We don't have to call them labels but it can be a vector of real values that will be the pixel intensities, for example. So if we only have one hard vector the distribution is very easy to model. We don't have to take in correlations into account because there are no correlations, only one. But in the high-dimensional case it's more complicated. If we output an image we want to capture the full distribution. >>: So how about for music? [inaudible] is complicated or simpler? >> Nicolas Maurice Boulanger-Lewandowski: Yeah. I will go through the application of the end; but for music, for example, if we have polyphonic music then we can have many notes at the same time. So this is would be a high-dimensional representation. Like the number of all possible configurations in fact is a very large, so that's why we call it high-dimensional. So just a brief outline of the presentation. I'll start with model definition, so introducing the RNNs and how we can make it so that we predict and actual conditional distribution; and also how do we combine the symbolic and the acoustic models. There are many ways to do that. Input, output architecture is one and we have other hybrid architectures as graphical models basically. And then we have an inference section. So this is when we know the input X and we want to infer the zed which is a random variable in our model, but we want to infer the zed that has a maximal probability so we know what to actually output from our algorithm. It will be by some beam search, some high-dimensional variance and pruning techniques. Also alignment algorithm that is very similar to inference. And then in the applications I will go over polyphonic music, generation, transcription, chord recognition, source separation, and speech recognition. So we start with the most basic version of RNN. So as you can see it models the output distribution, so the zed in the X by T. This is unrolled in time, but in fact all parameters are shared across time steps, so all weight matrices W are shared. We have a hidden, a layer of hidden units H also indexed by T and this layer is connected to the past state, so the state at T minus one. And it can also be influenced by zed, the zed from that past, so the current input. And from this state we are able, we hope to be able to make a prediction for the current value of zed. >> And just to make sure I understand this, in like a basic RNN language model there would not be the arrows from the Z to Z and there would not be the arrows from Z to H. Is that right? >> Nicolas Maurice Boulanger-Lewandowski: You can remove the rules from zed T minus one to zed T, but in more general view, you can also keep them. I mean if the function is powerful enough it can simply go through H and we don't have to have direct connection. In the same way we can also have skipped connections from zed T minus 2, minus 3 to help learning, but in principle it's not necessary. So we try to do away without them, but in practice it’s often useful to add them because>>: And there could be like an input layer down there below the H. >> Nicolas Maurice Boulanger-Lewandowski: Well that would be in the second part. But for now it's only the output modeling. >>: I see. >> Nicolas Maurice Boulanger-Lewandowski: Now when we want to include X things get more complicated. And there are many ways to do it. >>: Can you explain, [inaudible] the difference between the way you use the arrow here, [inaudible] arrow, this is the use of arrows as the dependency [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: I think the two views are compatible. Here the arrow just means that it's a deterministic function, so H, T depends in the deterministic way from zed, T minus one and H, T minus one. So we have an explicit function with model parameters that give this value and then, so all the H are deterministic given zed. And then we output zed, but in fact we output the distribution. In the simplest case the distribution we output can be a soft max[phonetic] layer or [inaudible] layer so it's not a very general distribution, but it's still considered a distribution. So yes. If the conditional distribution are multimodal we don't want to have a simple same way the layer we we'll use a Restrictable Boltzmann Machine, RBM, or also in practice more commonly the NADE is a tractable variance. And this can be seen, in fact, as a distribution estimator. So it can model, multimodal distribution very easily. It's an energy-based model, and it also has inside of it another hidden layer. So it’s different from the hidden layer of the RNN but it's used to describe model the distribution. So it has a visible layer, which is the zed T, so it’s a column of zed if we put it in the big picture, but in the RBM it’s just a visible vector and the hidden layer helps to find patterns in this data. So if we take the example of polyphonic music we have many notes occurring at the same time on the keyboard, for example, and the H would represent the possible chords that we can have. So one way to think about it is that one unit would encode the C major chord, for example. And then given that this unit is active we have a full probabilistic description of the visible vector. Now in practice it doesn't really happen that way. We don't have one unit for one chord, but it's useful to think about it this way. It can be groups of chords or it can be more lower level representation but similar to that. >>: Is there any relation between this and the [inaudible] you showed previously or is it just another way of modeling Z and H? >> Nicolas Maurice Boulanger-Lewandowski: Until now there is not but I will combine them and then the next slide but for now it's only a frame level model. So there's no time evolution at all. It would, for example, in video just model the distribution of images in the video. So in the RNN, RBM we want to explore the RBM to capture and model the conditional distribution. So the RNN stage is exactly the same except now I've renamed the H hidden layer of the RNN Hhat to distinguish them from the hidden layer of the RBMs. And the idea is that the RNN is still deterministic and instead of outputting zed T and predicting zed T directly at each time step we, in fact I'll put the parameters of the conditional distribution, so the parameters of the RBM. So at each time step there is one RBM with varying parameters. And that's exactly what we want to predict the distribution. In fact, we only predict the biases of the RBM, so the visible and the hidden biases. We could also predict the weight matrix. >>: So the motivation for this is trying to use the RBM as an approximation and then using the RNN to incorporate some candidacy across that? >> Nicolas Maurice Boulanger-Lewandowski: Right. >>: The arrows like from H-hat to H, those being solid, does that indicate that it's a deterministic function or that there’s a deterministic contribution to the energy term that comes from H-hat and then the double arrow is going between H and Z means it’s the RBM model? >> Nicolas Maurice Boulanger-Lewandowski: Yeah. In fact if I wanted to be really rigorous it would have to be another [inaudible] bubble just in this arrow because what we predicted is the bias of the hidden layer of the RBM. So this function is deterministic. So this is the bias and we also predict the visible bias so we have all the parameters of the RBM, and this RBM is in fact our prediction of the actual zed T that, in its actual distribution. So we predict the distribution in fact. The distribution parameters are deterministic. So to train that it's not that much more complicated than the regular RNN except that we first have to make a first pass to get those parameters of each RBM and then we can estimate the gradient of the cost by contrast to divergence so it's pretty similar to if we add the same weight layer but we have the CD approximation. And then we back propagate the gradient through the other parameters and we can do training with stochastic gradient decent in the usual way with that. >>: So qualitatively what does this get you that just the regular RNN can't provide? [inaudible] intuitive understanding, but qualitatively what>> Nicolas Maurice Boulanger-Lewandowski: It's the multimodal conditional distribution. So it all comes down to that. But if I can give more intuition about it in polyphonic music, so if we have a score, we have a time evolution as rhythm in music, and we also have the vertical so the cork, note co-occurrence. So this is the harmony component in music. And we know that if we predict a note it will affect the probability that other notes be present at the same time. So if, for example, at a given point in music it can go in two different directions. So we have a turning point. It can be a C major chord or a D major chord. If we don't have the conditional distribution we would predict a blend of the two. Let's say the two have a 50 percent chance of occurring. If we predict a blend of the two but in fact the blend is not probable at all because if you played those two chords at the same time it sounds very dissonant, very bad. So it's one or>>: [inaudible] separation issue, source separation. So you get much more notes correct the same time and then you use RBM to model your distribution. >> Nicolas Maurice Boulanger-Lewandowski: Yeah. >>: At what level? At the note level or at the acoustic level? >> Nicolas Maurice Boulanger-Lewandowski: At the symbolic level. You can do it first for acoustics too. If you want to model instead of having zed here you have X or if you have, for example, a spectrogram matrix for audio or for music or for speech you can model this directly with RNN, RBM and you have a generated model. But I will use this model for source separation layer and what this does is it models each source separately. So we have a prior when we want to make a separation. We know what each source is supposed to look like. So again, there will be experimental results later that explain really how this works. And yes, there's always the NADE, the tractable variant of the RBM so we have an exact probability and we can replace the RBMs with NADEs. And this is very useful because now we have a joint model that is tractable so we can train it with second order optimization methods. >>: [inaudible] example of the application the training [inaudible] is provided when all the notes are different every time [inaudible]. So you can considerably think about Gaussian network to do these type of probabilities, these type of [inaudible]. So this is the [inaudible] model when you use the RBM. So why do you choose this indirectly for your model rather than directly [inaudible] question? [inaudible] prediction [inaudible] so you could do either way. >> Nicolas Maurice Boulanger-Lewandowski: It can be both. This can be used for generation or just as a generative model. We can generate sequences of music with that. I have a few that I can show. But it can be used as a prediction too, but the thing is if it’s used as a prediction, so you can say why predict two modes if any way in the long run we will make only one prediction so we might as well just write a predictive right away, right? But it’s still important to capture the different modes of the distribution because we don't know which one we'll end up picking right away. I think it will become more clear in the following. Okay. So this was the RNN model now, but only for zed, now if we want to incorporate this with the acoustics modeling; so we want to have a model not only of zed alone but of zed given X. We can do it with an input, output architecture which is the simplest conceivable way to do it. So we add the same thing, same RNN at the top and now you notice that whenever there is a dashed line that is the prediction, but from now on I won’t to draw it every time, but you should see it as if there was an RBM there and that we predict the parameters of the RBM. So now I just draw with a single arrow towards zed to say that we predict zed but in fact we predict still the parameters of the RBM that describes zed. And so we have, the X is now part of the input that is fed into the RNN, so the old RNN state will now depend on the input acoustic observations. So this becomes, now instead of just describing the density of the output zed we describe the conditional distribution of zed given X. So what else? There's also the dotted line, W zed H that I put it dotted because it's optional connection. I call it the temporal smoothing connection because it's optional. If you remove it then all the zed become conditionally independent from each other. But arrogant[phonetic] is very useful because it forces the output to be consistent with itself. So I zed 1, zed 2, zed T must be consistent. So in particular it will introduce some temporal smoothing if the output is supposed to be smooth. >>: Why does that do something that the RBM doesn't do? >> Nicolas Maurice Boulanger-Lewandowski: Because for the RBM it’s only the distribution of a single frame, a single time. So if we didn’t feedback the zed that we end up picking into the RNN we wouldn't know which mode we picked. So I’ll just come back to this picture. So let's say that we have a turning point, we have two options. If we pick one, so the one where it doesn't fall, there was a 50 percent chance of occurring. The RNN needs to know that we ended up picking this one. If not, the distribution at the third time step will still be uncertain which mode we're in. But when we have feedback we know that okay, at the second time step we picked the first mode so we should stay in the first mode. So there's some former temporal smoothing here. Otherwise we could just switch mode randomly just because there is 50 percent chance. >>: So how does this model differ from this [inaudible] is this about sequential RBM? >> Nicolas Maurice Boulanger-Lewandowski: Yeah. It's very similar. >>: But then training is much easier than this. BIt doesn't involve any [inaudible] through time. It’s just using this sequence [inaudible] bias towards the next RBM. >> Nicolas Maurice Boulanger-Lewandowski: Yeah. But you need to back propagate, there is a temporal RBM, there is [inaudible] RBM. So those are two different, but basically we use those as a baseline because they're very smooth in that model and they don't have a separate hidden layer. So at the H-hat they don't have this. Instead they just connect the previous H directly into the zed but they use the mean field to provide it so it's deterministic. So in fact, this one is a generalization of the RBM, and we have more power to describe the temporal dependencies involved. So it can get quite better in fact on some tasks, but similar in spirit. >>: Yeah. But I don't remember, earlier they emphasized the fact that the problem to solve multimodals [inaudible] to solve. >> Nicolas Maurice Boulanger-Lewandowski: Using the mean field approximation simplifies a lot of the training procedure. Otherwise if they don't use the mean field it's very hard, they have to sample I think and then the training is not a very efficient. >>: I think [inaudible] mostly to this is a generative model to do these [inaudible] sequences unless you conditioned your new sampling on the previous input where your previous guess of what the input was you wouldn’t get a smooth sequence. >>: I see. [inaudible] for prediction as well? >>: They didn't do a prediction. >> Nicolas Maurice Boulanger-Lewandowski: It could be used for>>: [inaudible]. >>: Right. That would do like missing feature prediction [inaudible]. >>: I see. >>: [inaudible] try to sample the new ones and stuff. >>: I see. So this [inaudible]. >> Nicolas Maurice Boulanger-Lewandowski: And the idea was that if we have a separate RNN we can pre-train it in a better way, so without the stochastic signal from the RBMs, and also we can replace with NADE, something that you couldn’t do otherwise. Okay. So with this input, output architecture there are many problems that some of them are fundamental, teacher forcing problem, some of them it's more something that happens in practice and that limits the applicability. So the label bias problem occurs when there is a lot of smoothing involved, for example, and the RNN just uses the previous label as the most discriminative feature and pretty much ignores the acoustic observation. So if 90 percent of the time you just repeat the previously emitted symbol the classifier will have a bias because it will always try to repeat this symbol and the contribution from the input will not be high enough. So you can regularize, it's an option that can work sometimes, regularize, so try to make it that the weight coming from the previous label versus the one that comes from the input is not too large or not too small. And there is also the realignment thing. So it all depends on the temporal resolution that you pick for the time steps. So if you pick a coarser temporal resolution you increase the entropy of each conditional distribution. So you have less biased or just repeating what comes before. So that can work, but it's still problematic. Also there's, it’s related to the probability flow problem that when you go into a path that is bad that you will incur a cost for the probability at this time step. But all children of this node, so all the completions of this path will still sum to one. So overall in the sequence you can still have a pretty high probability, and this is another problem. The other is the teacher forcing problem which is that during training we have perfect zed, the zed that allow our input to do RNN, perfectly training conditions. But in testing it’s not the case. We'll explore many, entering inference we’ll explore many configurations and try to pick the one that is the best. So we have to make sure that the RNN can model even the bad cases. And it doesn't have training examples for that so you can train by adding noise, so we tilde[phonetic] zed or with zed star which is by using zed star is the prediction that you would have made at this time if you didn't know the answer. So we can use that star as an input to the RNN during training. It's another strategy. But there are some problems with that. So just to give an intuition to how this works, if you're training a model to drive a car and you're always at the center of the road in perfect conditions then yes, the model would be about to predict where we are going. But if something bad happens you're going off road you won't necessarily know how to come back on track in the center. So you want to make sure that during training you go off track and you do CDs examples so you know what to do in those cases. So this is why it can be useful to add noise to zed. We just go off track and we see if we can recover. >>: So when you talk just about [inaudible] so why this [inaudible] require this particular type of [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: [inaudible] version. >>: I have a question. So the example that you said [inaudible], for example Z, 1 to the H, 1 and then you copy this link, is the>> Nicolas Maurice Boulanger-Lewandowski: You mean the dotted arrow? >>: [inaudible] H. >> Nicolas Maurice Boulanger-Lewandowski: Yeah. If we don’t have this dotted arrow then we don't have any of these problems. >>: Usually regular RNN formulation having the feedback from [inaudible] back to the increased [inaudible] but it doesn't create any difference in terms of training, but now is the problem actually due to the RBM here? >> Nicolas Maurice Boulanger-Lewandowski: No. Even, it depends if you use a prediction or if you use the actual random variable value for feedback. If you use only the prediction, so mean field or expectation>>: I see. >> Nicolas Maurice Boulanger-Lewandowski: Then you don't have a problem because it doesn't depend on your choice of zed. But as soon as it does depend, and this is the goal here that we want to make it depend on zed because otherwise if you don't have these temporal smoothing connections you have to make some smoothing after the fact [inaudible] processing using HMMs, for example. And that's what we want to avoid and replace. >>: I have two questions. So can you, for example, have a link that is connected to Z, with H, 2? And then the second question is can you use this the Z, 1, the prediction of Z, 1, as the [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: Yeah. If you use the prediction of Z, 1 as a feedback the only thing is it changes is the way it’s trained because otherwise it’s completely equivalent to just having another hidden layer because it’s all deterministic. So if you just have a hidden layer that is between the two you don't have to call it the output in this case; you can just call it a hidden layer. But it's trained differently because you still want it to be closed to your target so that can work. Now for the arrangement of the arrows, there are many, many, many possibilities. You see now that the prediction is made using H, T minus one, but you can just take all the hidden layer and shift it by one time step to the right by just keeping the arrows attached. So now the prediction is made using the same time step, so H, T, but now you have to use the T minus one. So many configuration, the only thing that you don't want is having zed, T going to H, going back to zed T which would just use be used less because it be trivial. So as long as the prediction only depends on the previous values of zed, T and it can depend on all X. That's the only thing. I don’t know it that answered the question. Yeah. So, the probability flow problem is something that was solved with conditional random fields for linear chains. Now it would be possible maybe, I was working currently on similar model with RNNs but it's a bit hard to train because with the CRFs you have an exact dynamic programming formulation, but with RNN it’s not so obvious. So one very easy way to bypass this problem and make it work is what I called the hybrid architecture. It's a generalization of the HMM. So now you have underlying sequence zed which is the output, but we see it as the only random state, random variable. And then condition on zed, T we emit the observation X, T. So if I just go back to see the difference is that here X are input to the RNN so we have, on the upper right we modeled P zed given X, but here we in fact modeled P, X given zed. Now in practice we'll inverse this relation using [inaudible] Rule and by having a classifier that goes from X to zed. But the important part is that if we assume this model we can just to simplify, multiply the two probabilities of the two models. So the two models, the conditional probabilities is P of the zed given X. This is our acoustic classifier, so the term on the left and the multiplicand, and the P of the zed given A, A is the sequence history. So P of zed given A is the only symbolic model. So in this case we can just multiply the two. And this is true because we assume that we have independence and that X, T are emitted giving only the zed, T. In practice and in theory too it's not true because if X, for example, contains a window aggregation of features, I said that X could contain anything. Not only>>: Over here Z is matching model, is the symbol. >> Nicolas Maurice Boulanger-Lewandowski: Yes. >>: Listen to what I'm saying. Now you make Z to generate X which is observation rather than having X be equal to H as in the normal. So how do you justify that? I think I kind of got lost of about the multiplication [inaudible]. >> Nicolas Maurice Boulanger-Lewandowski: In fact the multiplication is only the backward rationalization of why it's okay to just multiply the two probabilities of the two models. I think it came from that but, so what we want to do is have our classifier that comes from X to zed, our regular classifier, form level classifier, and we have our symbolic predictor. We want to just multiply the two. Not to renormalize, it looks like a product of X part but it's not. There's no renormalization, it's just we can multiply because we have the independence assumption as the graphical model. So that's the motivation. If we have this model we get an easy formula for it. But in fact the arrows, as I said, are backwards. So if zed is complicated object with like a binary vector with many ones in it then the classifier from X to zed would be a multi-label classifier, for example, so that we can have multiple label at the same time or many, many different binary classifiers, one for each unit. So there are some principle, like we count factor twice with this approach because we just multiply the two probabilities. So if there is something in X, T that can give us any indication that what could come after given the, I can use this maybe. So in this term we assume that it's independent, but in fact inside the X, T there is some indication to find the zed, T minus one, two, so to find other zed, T’s that are close. So if we have some zed, T’s in there then it will, we will count the same factors here when we have our purely symbolic model. For example, if there is a phone transition that we know should occur because of the previous symbol then this term will be included here, but it could be included here if the acoustics is wide enough to encompass and that we can recover it as a T minus one. Anyway, so it's a problem, but it eliminates the probability flow problem so it’s still very worthwhile to use that instead. In practice this is the one that works best. And finally, this is the one used in the source separation. So incorporating that into NMF then I get this matrix factorization. If you're not familiar we have our acoustics observation and what we're trying to do is the same thing as in the sparse coding. So we try to find a dictionary W of basis elements that can be reused throughout the old the data set to explain the observations. And the activity matrix H is simply the coefficients in this new basis basically. >>: C is a spectrogram here. >> Nicolas Maurice Boulanger-Lewandowski: Here? >>: Yeah. >> Nicolas Maurice Boulanger-Lewandowski: No. This is, the spectrogram is X. Here is the cost of the [inaudible] composition. So this cost, usually the cost would only be the [inaudible] numbers squared but we can only have this cost. We would find coefficients purely to reconstruct [inaudible] observation. Now we can, at sparsity we can have temporal smoothing to try to have the coefficient here to be smooth in time [inaudible], and we can also have a full model of the density of H. In fact H here is the same as zed so they are the same matrix. >>: So does this mean that when you learn the dictionary you need to somehow feedback information from the RNN? >> Nicolas Maurice Boulanger-Lewandowski: Yeah. >>: Into MMF? >> Nicolas Maurice Boulanger-Lewandowski: That can get complicated. What we do is in the source of separation we train, we do supervised source separation. So we train with isolated sources. So we train, we use the activity matrices as target and we learn that the distribution of isolated sources. And now once we have one RNN that describe each isolated source we observe the mix and then using those priors we know that what each source is supposed to look like, and then we can say, we can add this extra term to influence the decomposition and assign each source into the right bin. Otherwise there are some problems that are just impossible without it. I'll show you examples of that later. >>: So I have a question. So basically these RNN is a prior for the density [inaudible]. So is there some relationship with [inaudible] divergence that you introduce the cost function [inaudible] reconstruction? >> Nicolas Maurice Boulanger-Lewandowski: Well the [inaudible] is a measure of how well we reconstruct X. So our reconstruction is simply W, H. So the matrix>>: [inaudible] PDF as the reference, what is the reference for the PDF, the density? >> Nicolas Maurice Boulanger-Lewandowski: it's a mix of, you mean in the source separation context? >>: When you have [inaudible] you have the opposite of density and reference density so that the reference density>> Nicolas Maurice Boulanger-Lewandowski: So in the source separation context the reference density is the mix and in fact all of this, well this term is for the full, for all sources together. So we have the constraint that all sources must sum to the observation and then this RNN is only for the individual sources. So the H here should be H for source one plus the same for source two. But this one occurs only once. So this is it for the model part. This was the longest section. Now inference. So as I said before we want to find a zed star. The zed that is the most likely given X which is in most tasks this is the output of our program. This is what we are searching for, the globally most likely, so just to an intuition of, let's say we are on this model, we make a prediction of zed, all of this is condition on X. We make a prediction of zed and then we have to commit to a decision. So let's say we pick the most likely configuration on that zed and then we feed this back here and we continue and we pick the most likely here, we continue, so this is a greedy algorithm. It doesn't guarantee that overall we’ll get something good. What we want is a search for the full sequence that is consistent with itself. For example here we might take a very bad, locally very a bad configuration that has low probability but in the long run it will allow us to be more consistent with the global. So that's what we want. So to do that it's not easy. We can make a greedy algorithm, but in fact, beam search is a generalization of the greedy algorithm where instead of keeping only the one best candidate at each time step we'll keep the W best so it’s the width of the beam. So here, for example, we have three candidates at this time step in red. So we kept the three. And now what happens here is that we find all the children of this node. It's a tree search. So the children of a node are all the possible continuations of this partial sequence of one time step, so all possible the configuration that can occur at time step T. We append that to the sequence and that's the children. And now we analyze all the resulting, the many, many resulting children and we still keep only the W best one. So that's beam search. Now in high-dimensional beam search the children are exponentially many because there are exponentially many configurations. This isn't the binary case. So if we have, if each column is a binary vector it can be a many ones in it so we can't enumerate all of them by increasing, ideally we would like to enumerate them in decreasing likelihood, but for an RBM or for NADE it’s not possible to do that. So what we do is in high-dimensional case is sample K elements and then find unique configurations, find a K most likely, well it will be simple more than K but until we have the K most likely elements so it doesn't guarantee, but the more we sample the more we are likely to find one. >>: Can you explain here, how does this kind of search [inaudible] different sources [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: Okay. How it fits is that when we find the likelihood, and the likelihood is the criterion that we use to see if a sequence is likely, so if we try here to jump from something that doesn't make sense given the past, so something that is not continuous, for example, we'll incur a large cost for that. Now in the long-term it might be worth it, but every time we jump from one symbol to the next we have a high cost for that. And this is given by the RNN density. The RNN will predict to stay the same state and if we change we have a large cost here. >>: You [inaudible] a minute ago you setting up the cost [inaudible] from one track to another. >> Nicolas Maurice Boulanger-Lewandowski: But the constraint is implicit in the RNN model. The RNN will just predict something that is similar to what it was before. If it's not it will have a low likelihood. So implicitly this sequence will be get relayed on the [inaudible]. But we need to have a sufficient beam width so that we can make some transitions that globally will still be worthwhile. So a big problem that we have with that is the beam saturation. So here I've made an example of the top three most likely sequences, and you can see that they are almost the same except only the transition time is shifted by only one time step. So basically it's pretty much the same output, right? But in principle the RNN state is different so we have to consider it, but if it's very far in the past we don't need to consider all the possible combinations of all these little variations. So it will saturate that width, the beam, and it won't be very effective. So what we do is prune the beam. The way to prune it is that we'll have some, in general some hashing function that assigns a hash to a sequence, but it's an approximate hashing function, and we have to design it so that equal hashes will corresponded to similar sequences that should be pruned. So what is it mean to be pruned is that from all the sequences that share the same hash we would keep only one, the one that is most likely. So we are not sure if it's really the one but we keep this one. So an example of a hash function is that we would only keep the underlined sequence. So all the little variation and precise alignment will be lost, and if the emitted, underlined sequence is the same we consider it the same and we prune it. This is one possibility. One other that is very fast and works well in practice is to use only the previous time step as a hash. In fact when I say hash it's a function of the previous time step. So what this makes is that we have at most only one sequence ending at each possible configuration. So in speech recognition if these are the full labels for each possible full label, we keep only one sequence ending at this label, the best one; so we prune a lot of, this is a very strong approximation, but it makes it very similar to [inaudible] and it works well in practice. >>: So my real question here is [inaudible] understood this model actually is the similar model to this temporal [inaudible] RBM in terms of [inaudible] except you've relaxed the mean field approximation in that model? >> Nicolas Maurice Boulanger-Lewandowski: No. [inaudible] RBM [inaudible]. What I'm saying is this one is a generalization of the [inaudible] RBM but this one is only a model of zed. So there is no X at all in their model. That model is basically human motion and the videos but the task is only predicting video, it's not annotating video, it's not>>: Hidden layer. >> Nicolas Maurice Boulanger-Lewandowski: Yeah. It's only a generation task or modeling task. That's it. And the RNN>>: But you said if you were to use [inaudible] approximation for this model [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: In fact this is a generalization. So the way to come back to the RT RBM but you shouldn't do it because it's not more efficient and the results are poorer. And the way to do that is to impose that W, zed, H-hat is the same as W of the RBM. So if you do that you see that H-hat would be exactly the mean field of the hidden units of the RBM, so you only have to compute it once. So this is what it comes down to but, so the RT RBM is like a version of the RNN, RBM with tied parameters. So those parameters are tied and I think the biases are tied too. So what happens when you turn the RT RBM is that the hidden layer gets split into hidden units that model the RBM intensity and hidden units that model the temporal evolution in the RNN. So here we just split them and it’s more flexible. >>: So is there a reason why the [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: Here? >>: No, the imprints, the [inaudible]. These parts. So [inaudible] there's no integrated dependency. It’s in the sample? >> Nicolas Maurice Boulanger-Lewandowski: No, but there is an [indiscernible]. >>: So there's a dependency. Is there a reason why the [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: but there is a backward. It’s just that back there if I use the pruning solution that makes it very similar to Viterbi it’s the same thing as Viterbi so if you want to code it with a forward, backward, you can but you don't have to. The backward, in fact, is only used to reconstruct your final output. But just to find the solution and its likelihood you don't need to go backward. You can just do one forward pass and save all the pointers correctly. But when you go backwards it’s just to reconstruct the output. But conceptually I find it simpler to explain it just with one forward pass. And it works for Viterbi too. >>: In a backward [inaudible] the forward pass computes the forward [inaudible] and the backward [inaudible] computes the [inaudible] so that in no, and one instance is of [inaudible] for it so combination [inaudible]. >> Nicolas Maurice Boulanger-Lewandowski: We would like to go into more details [inaudible]. >>: For decoding or for [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: [inaudible]. >>: [inaudible]. >> Nicolas Maurice Boulanger-Lewandowski: We'll go in more details [inaudible]. >>: One more quick question. Does this method require you to know that the total number of tracks are fixed? >> Nicolas Maurice Boulanger-Lewandowski: Tracks? >>: Yeah. Do you have to know>> Nicolas Maurice Boulanger-Lewandowski: So everything is mixed. But for a source separation I don't use this algorithm because there is a simpler method and it's gradient descent inference. And we can use that because the coefficient H or zed are real values. So we can simply have the RNN model and then different shape but not with respect to the parameters but with respect to the visible layers, and then we'll find a globally optimal solution of the output by gradient descent. >>: It’s automated. >> Nicolas Maurice Boulanger-Lewandowski: Yeah, it’s automated. So if I come back quickly to this pruning thing it's a lot faster when we can wonder how much precision do we lose; it’s surprising. In fact, we gain some accuracy. In fact, this is with a chord recognition task and the way we gain accuracy is because we can reduce the beam width very, very, very low. This is the accuracy. You see that we get beam width of five or something ridiculous below. And if we don't prune we have to go very high in width, something like 1000, and even then we don't reach the full accuracy so there's something about the hashing function that is very important to prune correctly if we want to do beam search otherwise you can see that it doesn't perform as well. >>: [inaudible] correct decoding procedure, right? >> Nicolas Maurice Boulanger-Lewandowski: Okay. So DP is worth pruning. I call it DP because it's a dynamic programming if you like. And beam search is without pruning. So without pruning is this curve. You need to increase beam width early on otherwise if you prune you can reduce beam width. So that's why it's faster because even if you reduce the width very low you still have a good performance; and performance is very key related to the width because if we have beam width of 1000 it means that we need to keep track of 1000 RNN states. And so it's proportional to W. Okay. So small side note on sequence alignment. So if our outputs during training and during test are not necessarily aligned with the input matrix, so if only underlying sequences are available what we can do is a version of hard expectation maximization and in the E step we find optimal alignments A star, so A star is the alignment according to the current model as the highest probability. And in the M step we assume that this alignment is right and then we update the model parameters assuming this. So it's a hard-EM because in the expectation step, in fact it’s not the real expectation it's, we don't compute the expectation we compute the most probable element. So this is similar to if you have regular EM to train a Gaussian mixture module model than if you have RBM you go into a K-means model. So this is, by using the optimal alignments it's RBM. Yes. So to find the actual alignments we use a strategy that is very similar than for inference based on beam search. So what we want to find is the U, T indices that will just map from the unaligned sequence to the aligned sequence in time. So these go from one to U and we know that each time step can only increase by one or stay the same. So U is the number of emitted symbols since the beginning of the sequence, so we start and we emit the first one and then either we stay at the same one or we output another one and so on. So if you want to find optimal alignments, A star, we use a similar strategy, so beam search, and a full search is intractable but we can use beam search. And then the pruning strategy that is equivalent to the other one is instead of keeping one, the hashing function would be in fact for the ending symbol U, T. So it’s the same thing but it's important you use U, T. And also for each candidate alignments, so partial sequence, we stored of course the associated alignment history. So we use, the RNN still depends on everything that came before. So there's the [inaudible] performing algorithm for this task. Now there is a way, we still have some time, to increase the speed of this alignment. Now in the first pass we’ll make approximate alignment A prime that comes only from the acoustic model, so you can just discard a symbolic model from that because we know that any way for the alignments we, in the training, during training we already have the good approximate output so we can’t deviate too much from that. So only the acoustic model will do a good job of finding the approximate boundaries for the transitions, the state transitions. So this is very fast to do an exact algorithm by dynamic programming is exact in this case. And in the second past we'll come back by assuming that the optimal alignment A star doesn't deviate too much from the A prime alignment by more than the Delta steps and this allows to prune even more the space of admissible sequence candidates so it's even faster. And it yields identical alignments than doing the full search in all cases by using very low [inaudible] Delta. So what it does, in short, it just take the approximate alignment by the acoustic algorithm and then refine, use RNN to refine the transition boundaries to get an alignment that the model believes is even better. All right. So I'll jump into the applications. There are some specifics for each one. Now for music if we only want to model the sequences of polyphonic music, so we only want to model zed. There's no audio, it's only symbolic form, so it's a score of music. Many notes can occur at the same time. What we’ll actually model is the piano roll representation. So it's a binary matrix with time and here it's pitch, [inaudible], note, number, in fact, so the pitch. So if you look at one column and you have 88 notes, one for each key of a piano keyboard and it represents the notes that are active at this current time. So this is a little limiting as a representation because we exclude all score annotations with scored dynamics, but it's still has the two most important aspects of music that is the temporal dependencies, rhythm, and highdimensionality. We want to predict just as another task a current time step given the past. >>: Is it one of the conditions of [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: There is no competition for that, in fact, but that's why when designed this task we chose a full high-dimensional representation that is very general because it's a good benchmark for matching learning algorithms that have been used later. And because most work that try to model music did it before, did it in the reduced space, reduced representation. So they tried to infer features that were psychologically relevant or something like that and then model that space, but it doesn't guarantee that we'll actually be able to generate music and that the feature space is actually relevant too. Here with the RBM we'll discover, we'll try to discover the space that is the most relevant possible to describe the full input. So something that it could do before was model only a sequence of chords, so premade chords like C major, D major, with one melody note at the same time. So as you can see this is a low-dimensionality because you only have a limited number of possible chords and it's only premade chords. So you don't necessarily generalize well to other, if you have a chord but you have some added notes and some removed notes then you're more flexible with that. So in this work we implemented many popular models of polyphonic music. So you see that the first one is the simplest one, previous was Gaussian, is simply predicting exactly the same thing as the previous frame. And having a Gaussian, so the probability would be centered around the previous frame. We have some N-Grams with smoothing back of, the N-Gram here models, we modeled patterns of notes so that its N-Gram of patterns, but the note N-Gram we would have one N-Gram for each possible pitch but it could be only a binary N-Gram. So this model is very often used. It's often used as an implicit model for smoothing, just moving each note independently with an HMM. >>: So what exactly is the input here? I guess music as well as>> Nicolas Maurice Boulanger-Lewandowski: There's no input. It's only modeling. >>: It's only modeling. >> Nicolas Maurice Boulanger-Lewandowski: Like this general row so>>: But building up to it does it get the music or does it just get the notes? >> Nicolas Maurice Boulanger-Lewandowski: There's no notes here. >>: So just at the end. >> Nicolas Maurice Boulanger-Lewandowski: So in the notation I'll introduce as the output of zed and it’s this whole matrix. >>: But it does get the previous score, right? >> Nicolas Maurice Boulanger-Lewandowski: Yes. >>: Okay. And it gets there correct previous. >> Nicolas Maurice Boulanger-Lewandowski: Yes. >>: And how much, how many notes in the past does it go, does it get? >> Nicolas Maurice Boulanger-Lewandowski: It gets everything that came before. >>: Everything. And is it predicting just what the next note is going to be or does it have to go like two or three notes in the future? >> Nicolas Maurice Boulanger-Lewandowski: So in fact it's a very general model. To compete in this task what we need to do is building probability distribution of this matrix. That's it. Now, with RNN it goes from it sequentially, and we'll add first the probability of the first column, and then if you multiply this probability by the probability of the rest of the sequence recursively then you can feed the right value there. It’s the way the conditional, any probability distribution can be expressed in this way like P of zed and zed is the old sequence. You can express it like, P of zed, 1 times P of everything that comes after one given zed, 1. So the given zed, 1 means that the RNN knows what comes before when predicting zed, 2 because we’ve already taken into account the probability of having the correct zed, 1 into the first frame. >>: So in the next slide when it says there's 40 percent accuracy that means 40 percent accuracy in predicting the very next note? >> Nicolas Maurice Boulanger-Lewandowski: Yeah. >>: Okay. Given the correct history. >> Nicolas Maurice Boulanger-Lewandowski: Real metric is really this one. >>: Why would that be the real metric? >> Nicolas Maurice Boulanger-Lewandowski: That's the low likelihood. So that's the one that we’re trying to optimize in training by maximum likelihood. This one is supposed to be musically relevant, but>>: Is it called like an end model prediction’s perplexity [inaudible]? >>: [inaudible] given a sentence on your model versus some other models. >>: Now the difference between multimodal prediction is that you get multiple notes coming out rather than [inaudible]. >> Nicolas Maurice Boulanger-Lewandowski: For this task, yes. So it's high-dimensional because if you were to just integrate possible configurations it would be too large. So you have to find something else. So the accuracy is the [inaudible] of all accuracy and it’s the expectation of that accuracy. So if we emit a conditional distribution we can compute the expectation of the accuracy under that conditional distribution that we predicted. So for each model it has been evaluated the same mathematical definition of course. But pretty much all models use the same sequential strategy of predicting one time step and then assuming that it’s the one finding the next . >>: So it’s [inaudible] that will work in terms of accuracy. Up here it’s not really. There’s no continuous solution of the acoustic. There’s no audio here. So what does [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: so even if there's no audio you can still make an HMM and GMM just to predict the symbolic>>: Oh, that could be [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: It’s not the same but if you would just try to find a hidden state that doesn't correspond to any observable quantities ever but>>: GMM [inaudible] Gaussian’s [inaudible]? It doesn't make sense. It's Gaussian. >> Nicolas Maurice Boulanger-Lewandowski: It’s all continuous but you can still say that it's a Gaussian so that the mean it would say that it’s closer to one then zero. So it’s just a form that is Gaussian, but you're right. >>: That's a very good question. I want to make you aware, Nicolas, that we have about 10 minutes. You’re scheduled to end at noon. Make sure you have enough time to at least give an overview of the breadth of your results. >> Nicolas Maurice Boulanger-Lewandowski:. Okay. I'll give a brief overview. So for polyphonic transcription, if you're not aware of this task is starting from an audio signal MP3 and directly outputting the musical score which is very interesting from a musician point of view. And here we have put a piano roll representation that is useful for intermediate step for that. So the input, output, RNN, NADE does very well on that. So you see here we only use NADE not RMB because in the inference step we want to be able to compare probabilities and candidates, but with RBM it's intractable, so we don't even have a probability. With NADE it solves that problem. >>: So do you have a sort of concise, [inaudible] I would've thought that LSTM or BLSTM was really good, well suited for music because it’s sort of based things that I've heard [inaudible]. So a short version, we can talk more where you’re getting the gain [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: I will try LSTM personally but I know it has been applied to onset detection very similar to that, but the task is only to evaluate it on the onsets, so the beginning and ending of notes. Here the smoothing is less important. It's a frame level evaluation. But LSTM could be integrated in fact into that; in fact it should. Now interesting point with that is that is very robust to noise. And a reason is that well, of course with the model we can learn temporal evolution and we have like a learned musical [inaudible] model that acts as a prior from the transcription. So whenever we have noise just as in speech or recognition this symbolic model becomes very important to fill in what is missing or what is too noisy or ambiguous. And also having conditional RBMs is very important because with noise we don't know which mode we want to jump into right now. We have to wait until we are later on in the sequence to backtrack so it's important to describe different modes. If the audio is very clean there won't necessarily be many modes. The first mode will be very dominant, but the more noise we have the more it becomes multimodal because more uniformly we don't know in which mode we are purely from the audio. So the curves illustrate these principles, in short, that we can do better. The advantage becomes even higher in a high level of noise. So this is my example for source separation. If we want to separate this kind of source with this one and now you have to imagine them being mixed together, so just added, like this in fact, if we use only NMF it's impossible to separate them, purely impossible because if we only looked at the content of each frame it's completely insufficient to know which source is sung into because they look, each frame looks exactly the same. So we need to have some temporal constraints. And if we integrate temporal constraint then we can solve this problem. As you can see it's not perfect even though it's a very simple problem, but we can be in the right ballpark at least. So this is what the gradient descent inference that I mentioned. >>: So [inaudible] as a prior [inaudible]? >> Nicolas Maurice Boulanger-Lewandowski: Yeah. So this is for audio separation, but for real world data sets we can do a lot better. In fact in the baseline with only NMF and NMF with smoothing that is often used for source separation. So if you listen at the results of source separation it’s still very bad not very usable in any real application because there are a lot of artifacts. So there are artifacts because when we force the audio to be constrained to a single source it looks more like this source but it will also have artifacts, so many things that sound weird. >>: So the RNN, RBM is that learned? Is that using the RNN in conjunction with NMF and learning the dictionary as we talked about before or is that just the RNN as in the previous overhead or is that just the RNN, RBM by itself? >> Nicolas Maurice Boulanger-Lewandowski: It's using the NMF dictionary. >>: It is. Okay. >> Nicolas Maurice Boulanger-Lewandowski: It should be NMF, RNN, RBM. Everything is based on NMF, but we use this prior or not. >>: I see. And what's SAR? >> Nicolas Maurice Boulanger-Lewandowski: It’s the artifact’s ratio. So as you can see we are worse in SAR. So it means that we have more artifacts than other models. The separation quality is better and this is like an overall measure that is the trade-off between artifacts and the quality. So overall, we get do better. So we can look at the SDR more. >>: And how do you measure, what’s SING, S-I-N-G? >> Nicolas Maurice Boulanger-Lewandowski: So these are the accompaniment and singing tracks. So these are two tracks that we want to, in this data set we separate karaoke songs. >>: So this is your combination? >> Nicolas Maurice Boulanger-Lewandowski: It’s a public data set. >>: So if you use this for like the Pascal Challenge how would it do? >> Nicolas Maurice Boulanger-Lewandowski: I don't know. So this one is actually for the MIREX competition. It uses the MIREX data set and it’s for chord recognition so it’s very similar in spirit to speech recognition. So it is in fact the second best that I've seen in literature. So it's very competitive with the state-of-the-art. It's based on the system I’ve presented. And for speech recognition, well, this one, the benchmarks are already higher. So it's a little tougher. So what we've made, and it was in fact the last summer in this internship here, is compare, just compare a baseline HMM with an RNN phonetic model. And we can see that even a strong deep neural network can be improved so by replacing HMM with the RNN. So this is, and the improvement is more significant than what you get particularly with the CRFs. So this is encouraging. Another thing is that RNN can be using conjunction with other classifiers. So I know that, for example, this again DNN is not the strongest. For example, if you use Dropout you can gain like three percent accuracy on that, and this can also be tried with RNN. So these results are encouraging for now. And also if we combine this with word language model we see that we can still get improvement. So it seems useful to have am RNN model to replace HMM even in complementarity with word language models that already model that temporal evolution. So by modeling the phones, sequences of phones, it can still be useful in itself it seems. I guess I have a few research perspectives, so I mentioned the gradient descent inference. This can be used as an approximation even if we don't have real value vector. Then it could be useful to have a different training procedure by having back propagation through the inference. So if the inference itself is derivable, differentiable, so it's like we will optimize directly the zed star, the one that we find during testing with inference, so that's a very promising option I think. Purely for RNN training there are some current work with stochastic methods similar to Dropout and active learning, so trying to take a few examples at a time. Metronome intermediate targets would be to get the better temporal description for music. Speech recognition would be having end to end system instead of combining a phone language model with a word language model in two separate steps. And also another very interesting one would be to use the RNN, RBM to model the context dependent phones. So instead of, so this we would be able to phonetically treat phones or context dependent states with RBM. I'm not very clear, but using RBM to capture or discover what could be the useful [inaudible] phones. That's it. Thank you to, also you are the co-authors of that, and thank you.