>> Jasha Droppo: Well good morning, everybody. We're here... at University of Montreal. I first became of aware...

advertisement
>> Jasha Droppo: Well good morning, everybody. We're here for Nicolas’ talk about his studies
at University of Montreal. I first became of aware of him about 18 months ago when we were
looking for an intern, and he was doing some very good work in RNN modeling for music
analysis. He's one of the pioneers in that field, and he's well published. I found out recently
that even before that he was doing work in a modern physics and electron tunneling
microscope software for inferring the shape of molecules based on something I don't even
understand. So if anybody understands that they can ask him questions about that too. He
works most recently University of Montreal in Yoshua Bengio’s group doing good work like I
said in recurrent neural networks which he's here to talk about today. And please welcome him
and treat him nicely because we really like him. Come on up Nicolas. That's it. It's all for you
now.
>> Nicolas Maurice Boulanger-Lewandowski: Thank you. Hello everyone. I'm very happy to be
here presenting work I’ve done during my PhD, and so thank you for having me and for
attending. So it will be about modeling high-dimensional sequences, and I'll tell you more
detail what this is exactly.
So just first a slide to introduce how this all fits into the big picture. It was motivated because if
we want to ever have strong AIs on [inaudible] it's probably necessary to have different
modalities and interactions between them. Deep learning is very popular these days. It’s a way
to learn different levels of abstractions, abstractions that comes from a knowledge that is in
different domains, so very different in nature. And this work can be seen as a way, in fact,
multiple different ways to interface knowledge that can be time-independent and timedependent. So if we have a sequential knowledge, for example language, then we can use that
and try to make it influence other types of knowledge such as acoustics.
So the kinds of sequences that we'll try to model I call them high-dimensional and because,
simply because each time step we have a complicated the high-dimensional object such as an
image in a video and a property of those sequences is that the condition of distribution is often
multimodal. So if we try to predict a single time step given the previous ones it's often the case
that it can go in two or more different possible ways in the molds of the conditional distribution
and it would be unacceptable to just predict the expectation of all the possibilities. So we really
want to capture the richness of the conditional distribution that is multimodal.
So the general framework we’ll work in is that we start with, at the bottom input matrix X that
has time and feature space dimensions. So in general the task in the old geo-processing would
be to simply assign from a column of this matrix to find the label, so a column of the output
matrix, the sequence that we want to predict. This is in the most simple case. It can be, for
example, a deep neural network that acts as a classifier for that.
Now this output sequence I call zed with a time index T, just to introduce a notation at the
same time that I will use throughout. We also want, we also have a symbolic model, so a model
that has some knowledge about what the output is supposed to look like. It can be in HMM in
the simplest case or here we’ll try to have more complicated and more powerful models for
that. Also I mentioned here the zed bar index by U it's because it's not always the case that the
time steps in the output are already aligned in the time steps of the input so if they're
unaligned, in this case you can see they’re unaligned because there's a lot of repetition in all
the input symbol, output symbols. Then we have the U index in place of T. So that's the
general framework. Yes?
>>: Can I just clarify that? Can you go back please? In the bottom matrix does this mean that
at every time step in this particular example there are five scalar features?
>> Nicolas Maurice Boulanger-Lewandowski: Yeah.
>>: Okay. And then in the output labeling, I guess there, are four different labels? Is that how I
should think of this?
>> Nicolas Maurice Boulanger-Lewandowski: Yes.
>>: Four different labels?
>> Nicolas Maurice Boulanger-Lewandowski: Four different labels. And in the high-dimensional
case there can be more than one active at the same time. So in some tasks there can only be
one, for example if it's a [indiscernible,] and we would just want to, there's only one. But in
many cases we have a full vector that is not a one hard vector and we want to predict this for
distribution.
>>: So for instance, maybe I would have one label per pixel if I want to go really far.
>> Nicolas Maurice Boulanger-Lewandowski: Yeah, yeah. We don't have to call them labels but
it can be a vector of real values that will be the pixel intensities, for example. So if we only have
one hard vector the distribution is very easy to model. We don't have to take in correlations
into account because there are no correlations, only one. But in the high-dimensional case it's
more complicated. If we output an image we want to capture the full distribution.
>>: So how about for music? [inaudible] is complicated or simpler?
>> Nicolas Maurice Boulanger-Lewandowski: Yeah. I will go through the application of the end;
but for music, for example, if we have polyphonic music then we can have many notes at the
same time. So this is would be a high-dimensional representation. Like the number of all
possible configurations in fact is a very large, so that's why we call it high-dimensional.
So just a brief outline of the presentation. I'll start with model definition, so introducing the
RNNs and how we can make it so that we predict and actual conditional distribution; and also
how do we combine the symbolic and the acoustic models. There are many ways to do that.
Input, output architecture is one and we have other hybrid architectures as graphical models
basically. And then we have an inference section. So this is when we know the input X and we
want to infer the zed which is a random variable in our model, but we want to infer the zed that
has a maximal probability so we know what to actually output from our algorithm. It will be by
some beam search, some high-dimensional variance and pruning techniques. Also alignment
algorithm that is very similar to inference. And then in the applications I will go over polyphonic
music, generation, transcription, chord recognition, source separation, and speech recognition.
So we start with the most basic version of RNN. So as you can see it models the output
distribution, so the zed in the X by T. This is unrolled in time, but in fact all parameters are
shared across time steps, so all weight matrices W are shared. We have a hidden, a layer of
hidden units H also indexed by T and this layer is connected to the past state, so the state at T
minus one. And it can also be influenced by zed, the zed from that past, so the current input.
And from this state we are able, we hope to be able to make a prediction for the current value
of zed.
>> And just to make sure I understand this, in like a basic RNN language model there would not
be the arrows from the Z to Z and there would not be the arrows from Z to H. Is that right?
>> Nicolas Maurice Boulanger-Lewandowski: You can remove the rules from zed T minus one
to zed T, but in more general view, you can also keep them. I mean if the function is powerful
enough it can simply go through H and we don't have to have direct connection. In the same
way we can also have skipped connections from zed T minus 2, minus 3 to help learning, but in
principle it's not necessary. So we try to do away without them, but in practice it’s often useful
to add them because>>: And there could be like an input layer down there below the H.
>> Nicolas Maurice Boulanger-Lewandowski: Well that would be in the second part. But for
now it's only the output modeling.
>>: I see.
>> Nicolas Maurice Boulanger-Lewandowski: Now when we want to include X things get more
complicated. And there are many ways to do it.
>>: Can you explain, [inaudible] the difference between the way you use the arrow here,
[inaudible] arrow, this is the use of arrows as the dependency [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: I think the two views are compatible. Here the
arrow just means that it's a deterministic function, so H, T depends in the deterministic way
from zed, T minus one and H, T minus one. So we have an explicit function with model
parameters that give this value and then, so all the H are deterministic given zed. And then we
output zed, but in fact we output the distribution. In the simplest case the distribution we
output can be a soft max[phonetic] layer or [inaudible] layer so it's not a very general
distribution, but it's still considered a distribution. So yes. If the conditional distribution are
multimodal we don't want to have a simple same way the layer we we'll use a Restrictable
Boltzmann Machine, RBM, or also in practice more commonly the NADE is a tractable variance.
And this can be seen, in fact, as a distribution estimator. So it can model, multimodal
distribution very easily. It's an energy-based model, and it also has inside of it another hidden
layer. So it’s different from the hidden layer of the RNN but it's used to describe model the
distribution. So it has a visible layer, which is the zed T, so it’s a column of zed if we put it in the
big picture, but in the RBM it’s just a visible vector and the hidden layer helps to find patterns in
this data.
So if we take the example of polyphonic music we have many notes occurring at the same time
on the keyboard, for example, and the H would represent the possible chords that we can have.
So one way to think about it is that one unit would encode the C major chord, for example. And
then given that this unit is active we have a full probabilistic description of the visible vector.
Now in practice it doesn't really happen that way. We don't have one unit for one chord, but
it's useful to think about it this way. It can be groups of chords or it can be more lower level
representation but similar to that.
>>: Is there any relation between this and the [inaudible] you showed previously or is it just
another way of modeling Z and H?
>> Nicolas Maurice Boulanger-Lewandowski: Until now there is not but I will combine them
and then the next slide but for now it's only a frame level model. So there's no time evolution
at all. It would, for example, in video just model the distribution of images in the video. So in
the RNN, RBM we want to explore the RBM to capture and model the conditional distribution.
So the RNN stage is exactly the same except now I've renamed the H hidden layer of the RNN Hhat to distinguish them from the hidden layer of the RBMs. And the idea is that the RNN is still
deterministic and instead of outputting zed T and predicting zed T directly at each time step we,
in fact I'll put the parameters of the conditional distribution, so the parameters of the RBM. So
at each time step there is one RBM with varying parameters. And that's exactly what we want
to predict the distribution. In fact, we only predict the biases of the RBM, so the visible and the
hidden biases. We could also predict the weight matrix.
>>: So the motivation for this is trying to use the RBM as an approximation and then using the
RNN to incorporate some candidacy across that?
>> Nicolas Maurice Boulanger-Lewandowski: Right.
>>: The arrows like from H-hat to H, those being solid, does that indicate that it's a
deterministic function or that there’s a deterministic contribution to the energy term that
comes from H-hat and then the double arrow is going between H and Z means it’s the RBM
model?
>> Nicolas Maurice Boulanger-Lewandowski: Yeah. In fact if I wanted to be really rigorous it
would have to be another [inaudible] bubble just in this arrow because what we predicted is
the bias of the hidden layer of the RBM. So this function is deterministic. So this is the bias and
we also predict the visible bias so we have all the parameters of the RBM, and this RBM is in
fact our prediction of the actual zed T that, in its actual distribution. So we predict the
distribution in fact. The distribution parameters are deterministic. So to train that it's not that
much more complicated than the regular RNN except that we first have to make a first pass to
get those parameters of each RBM and then we can estimate the gradient of the cost by
contrast to divergence so it's pretty similar to if we add the same weight layer but we have the
CD approximation. And then we back propagate the gradient through the other parameters
and we can do training with stochastic gradient decent in the usual way with that.
>>: So qualitatively what does this get you that just the regular RNN can't provide? [inaudible]
intuitive understanding, but qualitatively what>> Nicolas Maurice Boulanger-Lewandowski: It's the multimodal conditional distribution. So it
all comes down to that. But if I can give more intuition about it in polyphonic music, so if we
have a score, we have a time evolution as rhythm in music, and we also have the vertical so the
cork, note co-occurrence. So this is the harmony component in music. And we know that if we
predict a note it will affect the probability that other notes be present at the same time. So if,
for example, at a given point in music it can go in two different directions. So we have a turning
point. It can be a C major chord or a D major chord.
If we don't have the conditional distribution we would predict a blend of the two. Let's say the
two have a 50 percent chance of occurring. If we predict a blend of the two but in fact the
blend is not probable at all because if you played those two chords at the same time it sounds
very dissonant, very bad. So it's one or>>: [inaudible] separation issue, source separation. So you get much more notes correct the
same time and then you use RBM to model your distribution.
>> Nicolas Maurice Boulanger-Lewandowski: Yeah.
>>: At what level? At the note level or at the acoustic level?
>> Nicolas Maurice Boulanger-Lewandowski: At the symbolic level. You can do it first for
acoustics too. If you want to model instead of having zed here you have X or if you have, for
example, a spectrogram matrix for audio or for music or for speech you can model this directly
with RNN, RBM and you have a generated model. But I will use this model for source
separation layer and what this does is it models each source separately. So we have a prior
when we want to make a separation. We know what each source is supposed to look like. So
again, there will be experimental results later that explain really how this works. And yes,
there's always the NADE, the tractable variant of the RBM so we have an exact probability and
we can replace the RBMs with NADEs. And this is very useful because now we have a joint
model that is tractable so we can train it with second order optimization methods.
>>: [inaudible] example of the application the training [inaudible] is provided when all the notes
are different every time [inaudible]. So you can considerably think about Gaussian network to
do these type of probabilities, these type of [inaudible]. So this is the [inaudible] model when
you use the RBM. So why do you choose this indirectly for your model rather than directly
[inaudible] question? [inaudible] prediction [inaudible] so you could do either way.
>> Nicolas Maurice Boulanger-Lewandowski: It can be both. This can be used for generation or
just as a generative model. We can generate sequences of music with that. I have a few that I
can show. But it can be used as a prediction too, but the thing is if it’s used as a prediction, so
you can say why predict two modes if any way in the long run we will make only one prediction
so we might as well just write a predictive right away, right? But it’s still important to capture
the different modes of the distribution because we don't know which one we'll end up picking
right away. I think it will become more clear in the following.
Okay. So this was the RNN model now, but only for zed, now if we want to incorporate this
with the acoustics modeling; so we want to have a model not only of zed alone but of zed given
X. We can do it with an input, output architecture which is the simplest conceivable way to do
it. So we add the same thing, same RNN at the top and now you notice that whenever there is
a dashed line that is the prediction, but from now on I won’t to draw it every time, but you
should see it as if there was an RBM there and that we predict the parameters of the RBM. So
now I just draw with a single arrow towards zed to say that we predict zed but in fact we
predict still the parameters of the RBM that describes zed. And so we have, the X is now part of
the input that is fed into the RNN, so the old RNN state will now depend on the input acoustic
observations. So this becomes, now instead of just describing the density of the output zed we
describe the conditional distribution of zed given X.
So what else? There's also the dotted line, W zed H that I put it dotted because it's optional
connection. I call it the temporal smoothing connection because it's optional. If you remove it
then all the zed become conditionally independent from each other. But arrogant[phonetic] is
very useful because it forces the output to be consistent with itself. So I zed 1, zed 2, zed T
must be consistent. So in particular it will introduce some temporal smoothing if the output is
supposed to be smooth.
>>: Why does that do something that the RBM doesn't do?
>> Nicolas Maurice Boulanger-Lewandowski: Because for the RBM it’s only the distribution of a
single frame, a single time. So if we didn’t feedback the zed that we end up picking into the
RNN we wouldn't know which mode we picked. So I’ll just come back to this picture. So let's
say that we have a turning point, we have two options. If we pick one, so the one where it
doesn't fall, there was a 50 percent chance of occurring. The RNN needs to know that we
ended up picking this one. If not, the distribution at the third time step will still be uncertain
which mode we're in. But when we have feedback we know that okay, at the second time step
we picked the first mode so we should stay in the first mode. So there's some former temporal
smoothing here. Otherwise we could just switch mode randomly just because there is 50
percent chance.
>>: So how does this model differ from this [inaudible] is this about sequential RBM?
>> Nicolas Maurice Boulanger-Lewandowski: Yeah. It's very similar.
>>: But then training is much easier than this. BIt doesn't involve any [inaudible] through time.
It’s just using this sequence [inaudible] bias towards the next RBM.
>> Nicolas Maurice Boulanger-Lewandowski: Yeah. But you need to back propagate, there is a
temporal RBM, there is [inaudible] RBM. So those are two different, but basically we use those
as a baseline because they're very smooth in that model and they don't have a separate hidden
layer. So at the H-hat they don't have this. Instead they just connect the previous H directly
into the zed but they use the mean field to provide it so it's deterministic. So in fact, this one is
a generalization of the RBM, and we have more power to describe the temporal dependencies
involved. So it can get quite better in fact on some tasks, but similar in spirit.
>>: Yeah. But I don't remember, earlier they emphasized the fact that the problem to solve
multimodals [inaudible] to solve.
>> Nicolas Maurice Boulanger-Lewandowski: Using the mean field approximation simplifies a
lot of the training procedure. Otherwise if they don't use the mean field it's very hard, they
have to sample I think and then the training is not a very efficient.
>>: I think [inaudible] mostly to this is a generative model to do these [inaudible] sequences
unless you conditioned your new sampling on the previous input where your previous guess of
what the input was you wouldn’t get a smooth sequence.
>>: I see. [inaudible] for prediction as well?
>>: They didn't do a prediction.
>> Nicolas Maurice Boulanger-Lewandowski: It could be used for>>: [inaudible].
>>: Right. That would do like missing feature prediction [inaudible].
>>: I see.
>>: [inaudible] try to sample the new ones and stuff.
>>: I see. So this [inaudible].
>> Nicolas Maurice Boulanger-Lewandowski: And the idea was that if we have a separate RNN
we can pre-train it in a better way, so without the stochastic signal from the RBMs, and also we
can replace with NADE, something that you couldn’t do otherwise.
Okay. So with this input, output architecture there are many problems that some of them are
fundamental, teacher forcing problem, some of them it's more something that happens in
practice and that limits the applicability. So the label bias problem occurs when there is a lot of
smoothing involved, for example, and the RNN just uses the previous label as the most
discriminative feature and pretty much ignores the acoustic observation. So if 90 percent of
the time you just repeat the previously emitted symbol the classifier will have a bias because it
will always try to repeat this symbol and the contribution from the input will not be high
enough. So you can regularize, it's an option that can work sometimes, regularize, so try to
make it that the weight coming from the previous label versus the one that comes from the
input is not too large or not too small. And there is also the realignment thing.
So it all depends on the temporal resolution that you pick for the time steps. So if you pick a
coarser temporal resolution you increase the entropy of each conditional distribution. So you
have less biased or just repeating what comes before. So that can work, but it's still
problematic. Also there's, it’s related to the probability flow problem that when you go into a
path that is bad that you will incur a cost for the probability at this time step. But all children of
this node, so all the completions of this path will still sum to one. So overall in the sequence
you can still have a pretty high probability, and this is another problem. The other is the
teacher forcing problem which is that during training we have perfect zed, the zed that allow
our input to do RNN, perfectly training conditions. But in testing it’s not the case. We'll explore
many, entering inference we’ll explore many configurations and try to pick the one that is the
best. So we have to make sure that the RNN can model even the bad cases. And it doesn't
have training examples for that so you can train by adding noise, so we tilde[phonetic] zed or
with zed star which is by using zed star is the prediction that you would have made at this time
if you didn't know the answer. So we can use that star as an input to the RNN during training.
It's another strategy. But there are some problems with that.
So just to give an intuition to how this works, if you're training a model to drive a car and you're
always at the center of the road in perfect conditions then yes, the model would be about to
predict where we are going. But if something bad happens you're going off road you won't
necessarily know how to come back on track in the center. So you want to make sure that
during training you go off track and you do CDs examples so you know what to do in those
cases. So this is why it can be useful to add noise to zed. We just go off track and we see if we
can recover.
>>: So when you talk just about [inaudible] so why this [inaudible] require this particular type of
[inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: [inaudible] version.
>>: I have a question. So the example that you said [inaudible], for example Z, 1 to the H, 1 and
then you copy this link, is the>> Nicolas Maurice Boulanger-Lewandowski: You mean the dotted arrow?
>>: [inaudible] H.
>> Nicolas Maurice Boulanger-Lewandowski: Yeah. If we don’t have this dotted arrow then we
don't have any of these problems.
>>: Usually regular RNN formulation having the feedback from [inaudible] back to the increased
[inaudible] but it doesn't create any difference in terms of training, but now is the problem
actually due to the RBM here?
>> Nicolas Maurice Boulanger-Lewandowski: No. Even, it depends if you use a prediction or if
you use the actual random variable value for feedback. If you use only the prediction, so mean
field or expectation>>: I see.
>> Nicolas Maurice Boulanger-Lewandowski: Then you don't have a problem because it doesn't
depend on your choice of zed. But as soon as it does depend, and this is the goal here that we
want to make it depend on zed because otherwise if you don't have these temporal smoothing
connections you have to make some smoothing after the fact [inaudible] processing using
HMMs, for example. And that's what we want to avoid and replace.
>>: I have two questions. So can you, for example, have a link that is connected to Z, with H, 2?
And then the second question is can you use this the Z, 1, the prediction of Z, 1, as the
[inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: Yeah. If you use the prediction of Z, 1 as a
feedback the only thing is it changes is the way it’s trained because otherwise it’s completely
equivalent to just having another hidden layer because it’s all deterministic. So if you just have
a hidden layer that is between the two you don't have to call it the output in this case; you can
just call it a hidden layer. But it's trained differently because you still want it to be closed to
your target so that can work. Now for the arrangement of the arrows, there are many, many,
many possibilities. You see now that the prediction is made using H, T minus one, but you can
just take all the hidden layer and shift it by one time step to the right by just keeping the arrows
attached. So now the prediction is made using the same time step, so H, T, but now you have
to use the T minus one. So many configuration, the only thing that you don't want is having
zed, T going to H, going back to zed T which would just use be used less because it be trivial. So
as long as the prediction only depends on the previous values of zed, T and it can depend on all
X. That's the only thing. I don’t know it that answered the question.
Yeah. So, the probability flow problem is something that was solved with conditional random
fields for linear chains. Now it would be possible maybe, I was working currently on similar
model with RNNs but it's a bit hard to train because with the CRFs you have an exact dynamic
programming formulation, but with RNN it’s not so obvious. So one very easy way to bypass
this problem and make it work is what I called the hybrid architecture. It's a generalization of
the HMM. So now you have underlying sequence zed which is the output, but we see it as the
only random state, random variable. And then condition on zed, T we emit the observation X,
T. So if I just go back to see the difference is that here X are input to the RNN so we have, on
the upper right we modeled P zed given X, but here we in fact modeled P, X given zed. Now in
practice we'll inverse this relation using [inaudible] Rule and by having a classifier that goes
from X to zed. But the important part is that if we assume this model we can just to simplify,
multiply the two probabilities of the two models.
So the two models, the conditional probabilities is P of the zed given X. This is our acoustic
classifier, so the term on the left and the multiplicand, and the P of the zed given A, A is the
sequence history. So P of zed given A is the only symbolic model. So in this case we can just
multiply the two. And this is true because we assume that we have independence and that X, T
are emitted giving only the zed, T. In practice and in theory too it's not true because if X, for
example, contains a window aggregation of features, I said that X could contain anything. Not
only>>: Over here Z is matching model, is the symbol.
>> Nicolas Maurice Boulanger-Lewandowski: Yes.
>>: Listen to what I'm saying. Now you make Z to generate X which is observation rather than
having X be equal to H as in the normal. So how do you justify that? I think I kind of got lost of
about the multiplication [inaudible].
>> Nicolas Maurice Boulanger-Lewandowski: In fact the multiplication is only the backward
rationalization of why it's okay to just multiply the two probabilities of the two models. I think
it came from that but, so what we want to do is have our classifier that comes from X to zed,
our regular classifier, form level classifier, and we have our symbolic predictor. We want to just
multiply the two. Not to renormalize, it looks like a product of X part but it's not. There's no
renormalization, it's just we can multiply because we have the independence assumption as the
graphical model. So that's the motivation. If we have this model we get an easy formula for it.
But in fact the arrows, as I said, are backwards. So if zed is complicated object with like a binary
vector with many ones in it then the classifier from X to zed would be a multi-label classifier, for
example, so that we can have multiple label at the same time or many, many different binary
classifiers, one for each unit.
So there are some principle, like we count factor twice with this approach because we just
multiply the two probabilities. So if there is something in X, T that can give us any indication
that what could come after given the, I can use this maybe. So in this term we assume that it's
independent, but in fact inside the X, T there is some indication to find the zed, T minus one,
two, so to find other zed, T’s that are close. So if we have some zed, T’s in there then it will, we
will count the same factors here when we have our purely symbolic model. For example, if
there is a phone transition that we know should occur because of the previous symbol then this
term will be included here, but it could be included here if the acoustics is wide enough to
encompass and that we can recover it as a T minus one. Anyway, so it's a problem, but it
eliminates the probability flow problem so it’s still very worthwhile to use that instead. In
practice this is the one that works best.
And finally, this is the one used in the source separation. So incorporating that into NMF then I
get this matrix factorization. If you're not familiar we have our acoustics observation and what
we're trying to do is the same thing as in the sparse coding. So we try to find a dictionary W of
basis elements that can be reused throughout the old the data set to explain the observations.
And the activity matrix H is simply the coefficients in this new basis basically.
>>: C is a spectrogram here.
>> Nicolas Maurice Boulanger-Lewandowski: Here?
>>: Yeah.
>> Nicolas Maurice Boulanger-Lewandowski: No. This is, the spectrogram is X. Here is the cost
of the [inaudible] composition. So this cost, usually the cost would only be the [inaudible]
numbers squared but we can only have this cost. We would find coefficients purely to
reconstruct [inaudible] observation. Now we can, at sparsity we can have temporal smoothing
to try to have the coefficient here to be smooth in time [inaudible], and we can also have a full
model of the density of H. In fact H here is the same as zed so they are the same matrix.
>>: So does this mean that when you learn the dictionary you need to somehow feedback
information from the RNN?
>> Nicolas Maurice Boulanger-Lewandowski: Yeah.
>>: Into MMF?
>> Nicolas Maurice Boulanger-Lewandowski: That can get complicated. What we do is in the
source of separation we train, we do supervised source separation. So we train with isolated
sources. So we train, we use the activity matrices as target and we learn that the distribution of
isolated sources. And now once we have one RNN that describe each isolated source we
observe the mix and then using those priors we know that what each source is supposed to
look like, and then we can say, we can add this extra term to influence the decomposition and
assign each source into the right bin. Otherwise there are some problems that are just
impossible without it. I'll show you examples of that later.
>>: So I have a question. So basically these RNN is a prior for the density [inaudible]. So is
there some relationship with [inaudible] divergence that you introduce the cost function
[inaudible] reconstruction?
>> Nicolas Maurice Boulanger-Lewandowski: Well the [inaudible] is a measure of how well we
reconstruct X. So our reconstruction is simply W, H. So the matrix>>: [inaudible] PDF as the reference, what is the reference for the PDF, the density?
>> Nicolas Maurice Boulanger-Lewandowski: it's a mix of, you mean in the source separation
context?
>>: When you have [inaudible] you have the opposite of density and reference density so that
the reference density>> Nicolas Maurice Boulanger-Lewandowski: So in the source separation context the reference
density is the mix and in fact all of this, well this term is for the full, for all sources together. So
we have the constraint that all sources must sum to the observation and then this RNN is only
for the individual sources. So the H here should be H for source one plus the same for source
two. But this one occurs only once.
So this is it for the model part. This was the longest section. Now inference. So as I said before
we want to find a zed star. The zed that is the most likely given X which is in most tasks this is
the output of our program. This is what we are searching for, the globally most likely, so just to
an intuition of, let's say we are on this model, we make a prediction of zed, all of this is
condition on X. We make a prediction of zed and then we have to commit to a decision. So
let's say we pick the most likely configuration on that zed and then we feed this back here and
we continue and we pick the most likely here, we continue, so this is a greedy algorithm. It
doesn't guarantee that overall we’ll get something good. What we want is a search for the full
sequence that is consistent with itself. For example here we might take a very bad, locally very
a bad configuration that has low probability but in the long run it will allow us to be more
consistent with the global. So that's what we want.
So to do that it's not easy. We can make a greedy algorithm, but in fact, beam search is a
generalization of the greedy algorithm where instead of keeping only the one best candidate at
each time step we'll keep the W best so it’s the width of the beam. So here, for example, we
have three candidates at this time step in red. So we kept the three. And now what happens
here is that we find all the children of this node. It's a tree search. So the children of a node
are all the possible continuations of this partial sequence of one time step, so all possible the
configuration that can occur at time step T. We append that to the sequence and that's the
children. And now we analyze all the resulting, the many, many resulting children and we still
keep only the W best one. So that's beam search.
Now in high-dimensional beam search the children are exponentially many because there are
exponentially many configurations. This isn't the binary case. So if we have, if each column is a
binary vector it can be a many ones in it so we can't enumerate all of them by increasing, ideally
we would like to enumerate them in decreasing likelihood, but for an RBM or for NADE it’s not
possible to do that. So what we do is in high-dimensional case is sample K elements and then
find unique configurations, find a K most likely, well it will be simple more than K but until we
have the K most likely elements so it doesn't guarantee, but the more we sample the more we
are likely to find one.
>>: Can you explain here, how does this kind of search [inaudible] different sources [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: Okay. How it fits is that when we find the
likelihood, and the likelihood is the criterion that we use to see if a sequence is likely, so if we
try here to jump from something that doesn't make sense given the past, so something that is
not continuous, for example, we'll incur a large cost for that. Now in the long-term it might be
worth it, but every time we jump from one symbol to the next we have a high cost for that.
And this is given by the RNN density. The RNN will predict to stay the same state and if we
change we have a large cost here.
>>: You [inaudible] a minute ago you setting up the cost [inaudible] from one track to another.
>> Nicolas Maurice Boulanger-Lewandowski: But the constraint is implicit in the RNN model.
The RNN will just predict something that is similar to what it was before. If it's not it will have a
low likelihood. So implicitly this sequence will be get relayed on the [inaudible]. But we need
to have a sufficient beam width so that we can make some transitions that globally will still be
worthwhile.
So a big problem that we have with that is the beam saturation. So here I've made an example
of the top three most likely sequences, and you can see that they are almost the same except
only the transition time is shifted by only one time step. So basically it's pretty much the same
output, right? But in principle the RNN state is different so we have to consider it, but if it's
very far in the past we don't need to consider all the possible combinations of all these little
variations. So it will saturate that width, the beam, and it won't be very effective.
So what we do is prune the beam. The way to prune it is that we'll have some, in general some
hashing function that assigns a hash to a sequence, but it's an approximate hashing function,
and we have to design it so that equal hashes will corresponded to similar sequences that
should be pruned. So what is it mean to be pruned is that from all the sequences that share the
same hash we would keep only one, the one that is most likely. So we are not sure if it's really
the one but we keep this one. So an example of a hash function is that we would only keep the
underlined sequence. So all the little variation and precise alignment will be lost, and if the
emitted, underlined sequence is the same we consider it the same and we prune it. This is one
possibility.
One other that is very fast and works well in practice is to use only the previous time step as a
hash. In fact when I say hash it's a function of the previous time step. So what this makes is
that we have at most only one sequence ending at each possible configuration. So in speech
recognition if these are the full labels for each possible full label, we keep only one sequence
ending at this label, the best one; so we prune a lot of, this is a very strong approximation, but
it makes it very similar to [inaudible] and it works well in practice.
>>: So my real question here is [inaudible] understood this model actually is the similar model
to this temporal [inaudible] RBM in terms of [inaudible] except you've relaxed the mean field
approximation in that model?
>> Nicolas Maurice Boulanger-Lewandowski: No. [inaudible] RBM [inaudible]. What I'm saying
is this one is a generalization of the [inaudible] RBM but this one is only a model of zed. So
there is no X at all in their model. That model is basically human motion and the videos but the
task is only predicting video, it's not annotating video, it's not>>: Hidden layer.
>> Nicolas Maurice Boulanger-Lewandowski: Yeah. It's only a generation task or modeling task.
That's it. And the RNN>>: But you said if you were to use [inaudible] approximation for this model [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: In fact this is a generalization. So the way to
come back to the RT RBM but you shouldn't do it because it's not more efficient and the results
are poorer. And the way to do that is to impose that W, zed, H-hat is the same as W of the
RBM. So if you do that you see that H-hat would be exactly the mean field of the hidden units
of the RBM, so you only have to compute it once. So this is what it comes down to but, so the
RT RBM is like a version of the RNN, RBM with tied parameters. So those parameters are tied
and I think the biases are tied too.
So what happens when you turn the RT RBM is that the hidden layer gets split into hidden units
that model the RBM intensity and hidden units that model the temporal evolution in the RNN.
So here we just split them and it’s more flexible.
>>: So is there a reason why the [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: Here?
>>: No, the imprints, the [inaudible]. These parts. So [inaudible] there's no integrated
dependency. It’s in the sample?
>> Nicolas Maurice Boulanger-Lewandowski: No, but there is an [indiscernible].
>>: So there's a dependency. Is there a reason why the [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: but there is a backward. It’s just that back there
if I use the pruning solution that makes it very similar to Viterbi it’s the same thing as Viterbi so
if you want to code it with a forward, backward, you can but you don't have to. The backward,
in fact, is only used to reconstruct your final output. But just to find the solution and its
likelihood you don't need to go backward. You can just do one forward pass and save all the
pointers correctly. But when you go backwards it’s just to reconstruct the output. But
conceptually I find it simpler to explain it just with one forward pass. And it works for Viterbi
too.
>>: In a backward [inaudible] the forward pass computes the forward [inaudible] and the
backward [inaudible] computes the [inaudible] so that in no, and one instance is of [inaudible]
for it so combination [inaudible].
>> Nicolas Maurice Boulanger-Lewandowski: We would like to go into more details [inaudible].
>>: For decoding or for [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: [inaudible].
>>: [inaudible].
>> Nicolas Maurice Boulanger-Lewandowski: We'll go in more details [inaudible].
>>: One more quick question. Does this method require you to know that the total number of
tracks are fixed?
>> Nicolas Maurice Boulanger-Lewandowski: Tracks?
>>: Yeah. Do you have to know>> Nicolas Maurice Boulanger-Lewandowski: So everything is mixed. But for a source
separation I don't use this algorithm because there is a simpler method and it's gradient
descent inference. And we can use that because the coefficient H or zed are real values. So we
can simply have the RNN model and then different shape but not with respect to the
parameters but with respect to the visible layers, and then we'll find a globally optimal solution
of the output by gradient descent.
>>: It’s automated.
>> Nicolas Maurice Boulanger-Lewandowski: Yeah, it’s automated. So if I come back quickly to
this pruning thing it's a lot faster when we can wonder how much precision do we lose; it’s
surprising. In fact, we gain some accuracy. In fact, this is with a chord recognition task and the
way we gain accuracy is because we can reduce the beam width very, very, very low. This is the
accuracy. You see that we get beam width of five or something ridiculous below. And if we
don't prune we have to go very high in width, something like 1000, and even then we don't
reach the full accuracy so there's something about the hashing function that is very important
to prune correctly if we want to do beam search otherwise you can see that it doesn't perform
as well.
>>: [inaudible] correct decoding procedure, right?
>> Nicolas Maurice Boulanger-Lewandowski: Okay. So DP is worth pruning. I call it DP because
it's a dynamic programming if you like. And beam search is without pruning. So without
pruning is this curve. You need to increase beam width early on otherwise if you prune you can
reduce beam width. So that's why it's faster because even if you reduce the width very low you
still have a good performance; and performance is very key related to the width because if we
have beam width of 1000 it means that we need to keep track of 1000 RNN states. And so it's
proportional to W.
Okay. So small side note on sequence alignment. So if our outputs during training and during
test are not necessarily aligned with the input matrix, so if only underlying sequences are
available what we can do is a version of hard expectation maximization and in the E step we
find optimal alignments A star, so A star is the alignment according to the current model as the
highest probability. And in the M step we assume that this alignment is right and then we
update the model parameters assuming this. So it's a hard-EM because in the expectation step,
in fact it’s not the real expectation it's, we don't compute the expectation we compute the
most probable element. So this is similar to if you have regular EM to train a Gaussian mixture
module model than if you have RBM you go into a K-means model. So this is, by using the
optimal alignments it's RBM.
Yes. So to find the actual alignments we use a strategy that is very similar than for inference
based on beam search. So what we want to find is the U, T indices that will just map from the
unaligned sequence to the aligned sequence in time. So these go from one to U and we know
that each time step can only increase by one or stay the same. So U is the number of emitted
symbols since the beginning of the sequence, so we start and we emit the first one and then
either we stay at the same one or we output another one and so on. So if you want to find
optimal alignments, A star, we use a similar strategy, so beam search, and a full search is
intractable but we can use beam search. And then the pruning strategy that is equivalent to
the other one is instead of keeping one, the hashing function would be in fact for the ending
symbol U, T. So it’s the same thing but it's important you use U, T. And also for each candidate
alignments, so partial sequence, we stored of course the associated alignment history. So we
use, the RNN still depends on everything that came before. So there's the [inaudible]
performing algorithm for this task.
Now there is a way, we still have some time, to increase the speed of this alignment. Now in
the first pass we’ll make approximate alignment A prime that comes only from the acoustic
model, so you can just discard a symbolic model from that because we know that any way for
the alignments we, in the training, during training we already have the good approximate
output so we can’t deviate too much from that. So only the acoustic model will do a good job
of finding the approximate boundaries for the transitions, the state transitions. So this is very
fast to do an exact algorithm by dynamic programming is exact in this case. And in the second
past we'll come back by assuming that the optimal alignment A star doesn't deviate too much
from the A prime alignment by more than the Delta steps and this allows to prune even more
the space of admissible sequence candidates so it's even faster. And it yields identical
alignments than doing the full search in all cases by using very low [inaudible] Delta. So what it
does, in short, it just take the approximate alignment by the acoustic algorithm and then refine,
use RNN to refine the transition boundaries to get an alignment that the model believes is even
better.
All right. So I'll jump into the applications. There are some specifics for each one. Now for
music if we only want to model the sequences of polyphonic music, so we only want to model
zed. There's no audio, it's only symbolic form, so it's a score of music. Many notes can occur at
the same time. What we’ll actually model is the piano roll representation. So it's a binary
matrix with time and here it's pitch, [inaudible], note, number, in fact, so the pitch. So if you
look at one column and you have 88 notes, one for each key of a piano keyboard and it
represents the notes that are active at this current time. So this is a little limiting as a
representation because we exclude all score annotations with scored dynamics, but it's still has
the two most important aspects of music that is the temporal dependencies, rhythm, and highdimensionality. We want to predict just as another task a current time step given the past.
>>: Is it one of the conditions of [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: There is no competition for that, in fact, but that's
why when designed this task we chose a full high-dimensional representation that is very
general because it's a good benchmark for matching learning algorithms that have been used
later. And because most work that try to model music did it before, did it in the reduced space,
reduced representation. So they tried to infer features that were psychologically relevant or
something like that and then model that space, but it doesn't guarantee that we'll actually be
able to generate music and that the feature space is actually relevant too. Here with the RBM
we'll discover, we'll try to discover the space that is the most relevant possible to describe the
full input. So something that it could do before was model only a sequence of chords, so
premade chords like C major, D major, with one melody note at the same time. So as you can
see this is a low-dimensionality because you only have a limited number of possible chords and
it's only premade chords. So you don't necessarily generalize well to other, if you have a chord
but you have some added notes and some removed notes then you're more flexible with that.
So in this work we implemented many popular models of polyphonic music. So you see that
the first one is the simplest one, previous was Gaussian, is simply predicting exactly the same
thing as the previous frame. And having a Gaussian, so the probability would be centered
around the previous frame. We have some N-Grams with smoothing back of, the N-Gram here
models, we modeled patterns of notes so that its N-Gram of patterns, but the note N-Gram we
would have one N-Gram for each possible pitch but it could be only a binary N-Gram. So this
model is very often used. It's often used as an implicit model for smoothing, just moving each
note independently with an HMM.
>>: So what exactly is the input here? I guess music as well as>> Nicolas Maurice Boulanger-Lewandowski: There's no input. It's only modeling.
>>: It's only modeling.
>> Nicolas Maurice Boulanger-Lewandowski: Like this general row so>>: But building up to it does it get the music or does it just get the notes?
>> Nicolas Maurice Boulanger-Lewandowski: There's no notes here.
>>: So just at the end.
>> Nicolas Maurice Boulanger-Lewandowski: So in the notation I'll introduce as the output of
zed and it’s this whole matrix.
>>: But it does get the previous score, right?
>> Nicolas Maurice Boulanger-Lewandowski: Yes.
>>: Okay. And it gets there correct previous.
>> Nicolas Maurice Boulanger-Lewandowski: Yes.
>>: And how much, how many notes in the past does it go, does it get?
>> Nicolas Maurice Boulanger-Lewandowski: It gets everything that came before.
>>: Everything. And is it predicting just what the next note is going to be or does it have to go
like two or three notes in the future?
>> Nicolas Maurice Boulanger-Lewandowski: So in fact it's a very general model. To compete
in this task what we need to do is building probability distribution of this matrix. That's it.
Now, with RNN it goes from it sequentially, and we'll add first the probability of the first
column, and then if you multiply this probability by the probability of the rest of the sequence
recursively then you can feed the right value there. It’s the way the conditional, any probability
distribution can be expressed in this way like P of zed and zed is the old sequence. You can
express it like, P of zed, 1 times P of everything that comes after one given zed, 1. So the given
zed, 1 means that the RNN knows what comes before when predicting zed, 2 because we’ve
already taken into account the probability of having the correct zed, 1 into the first frame.
>>: So in the next slide when it says there's 40 percent accuracy that means 40 percent
accuracy in predicting the very next note?
>> Nicolas Maurice Boulanger-Lewandowski: Yeah.
>>: Okay. Given the correct history.
>> Nicolas Maurice Boulanger-Lewandowski: Real metric is really this one.
>>: Why would that be the real metric?
>> Nicolas Maurice Boulanger-Lewandowski: That's the low likelihood. So that's the one that
we’re trying to optimize in training by maximum likelihood. This one is supposed to be musically
relevant, but>>: Is it called like an end model prediction’s perplexity [inaudible]?
>>: [inaudible] given a sentence on your model versus some other models.
>>: Now the difference between multimodal prediction is that you get multiple notes coming
out rather than [inaudible].
>> Nicolas Maurice Boulanger-Lewandowski: For this task, yes. So it's high-dimensional
because if you were to just integrate possible configurations it would be too large. So you have
to find something else. So the accuracy is the [inaudible] of all accuracy and it’s the expectation
of that accuracy. So if we emit a conditional distribution we can compute the expectation of
the accuracy under that conditional distribution that we predicted. So for each model it has
been evaluated the same mathematical definition of course. But pretty much all models use
the same sequential strategy of predicting one time step and then assuming that it’s the one
finding the next .
>>: So it’s [inaudible] that will work in terms of accuracy. Up here it’s not really. There’s no
continuous solution of the acoustic. There’s no audio here. So what does [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: so even if there's no audio you can still make an
HMM and GMM just to predict the symbolic>>: Oh, that could be [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: It’s not the same but if you would just try to find a
hidden state that doesn't correspond to any observable quantities ever but>>: GMM [inaudible] Gaussian’s [inaudible]? It doesn't make sense. It's Gaussian.
>> Nicolas Maurice Boulanger-Lewandowski: It’s all continuous but you can still say that it's a
Gaussian so that the mean it would say that it’s closer to one then zero. So it’s just a form that
is Gaussian, but you're right.
>>: That's a very good question. I want to make you aware, Nicolas, that we have about 10
minutes. You’re scheduled to end at noon. Make sure you have enough time to at least give an
overview of the breadth of your results.
>> Nicolas Maurice Boulanger-Lewandowski:. Okay. I'll give a brief overview. So for
polyphonic transcription, if you're not aware of this task is starting from an audio signal MP3
and directly outputting the musical score which is very interesting from a musician point of
view. And here we have put a piano roll representation that is useful for intermediate step for
that. So the input, output, RNN, NADE does very well on that. So you see here we only use
NADE not RMB because in the inference step we want to be able to compare probabilities and
candidates, but with RBM it's intractable, so we don't even have a probability. With NADE it
solves that problem.
>>: So do you have a sort of concise, [inaudible] I would've thought that LSTM or BLSTM was
really good, well suited for music because it’s sort of based things that I've heard [inaudible].
So a short version, we can talk more where you’re getting the gain [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: I will try LSTM personally but I know it has been
applied to onset detection very similar to that, but the task is only to evaluate it on the onsets,
so the beginning and ending of notes. Here the smoothing is less important. It's a frame level
evaluation. But LSTM could be integrated in fact into that; in fact it should. Now interesting
point with that is that is very robust to noise. And a reason is that well, of course with the
model we can learn temporal evolution and we have like a learned musical [inaudible] model
that acts as a prior from the transcription. So whenever we have noise just as in speech or
recognition this symbolic model becomes very important to fill in what is missing or what is too
noisy or ambiguous. And also having conditional RBMs is very important because with noise we
don't know which mode we want to jump into right now. We have to wait until we are later on
in the sequence to backtrack so it's important to describe different modes. If the audio is very
clean there won't necessarily be many modes. The first mode will be very dominant, but the
more noise we have the more it becomes multimodal because more uniformly we don't know
in which mode we are purely from the audio. So the curves illustrate these principles, in short,
that we can do better. The advantage becomes even higher in a high level of noise.
So this is my example for source separation. If we want to separate this kind of source with this
one and now you have to imagine them being mixed together, so just added, like this in fact, if
we use only NMF it's impossible to separate them, purely impossible because if we only looked
at the content of each frame it's completely insufficient to know which source is sung into
because they look, each frame looks exactly the same. So we need to have some temporal
constraints. And if we integrate temporal constraint then we can solve this problem. As you
can see it's not perfect even though it's a very simple problem, but we can be in the right
ballpark at least. So this is what the gradient descent inference that I mentioned.
>>: So [inaudible] as a prior [inaudible]?
>> Nicolas Maurice Boulanger-Lewandowski: Yeah. So this is for audio separation, but for real
world data sets we can do a lot better. In fact in the baseline with only NMF and NMF with
smoothing that is often used for source separation. So if you listen at the results of source
separation it’s still very bad not very usable in any real application because there are a lot of
artifacts. So there are artifacts because when we force the audio to be constrained to a single
source it looks more like this source but it will also have artifacts, so many things that sound
weird.
>>: So the RNN, RBM is that learned? Is that using the RNN in conjunction with NMF and
learning the dictionary as we talked about before or is that just the RNN as in the previous
overhead or is that just the RNN, RBM by itself?
>> Nicolas Maurice Boulanger-Lewandowski: It's using the NMF dictionary.
>>: It is. Okay.
>> Nicolas Maurice Boulanger-Lewandowski: It should be NMF, RNN, RBM. Everything is based
on NMF, but we use this prior or not.
>>: I see. And what's SAR?
>> Nicolas Maurice Boulanger-Lewandowski: It’s the artifact’s ratio. So as you can see we are
worse in SAR. So it means that we have more artifacts than other models. The separation
quality is better and this is like an overall measure that is the trade-off between artifacts and
the quality. So overall, we get do better. So we can look at the SDR more.
>>: And how do you measure, what’s SING, S-I-N-G?
>> Nicolas Maurice Boulanger-Lewandowski: So these are the accompaniment and singing
tracks. So these are two tracks that we want to, in this data set we separate karaoke songs.
>>: So this is your combination?
>> Nicolas Maurice Boulanger-Lewandowski: It’s a public data set.
>>: So if you use this for like the Pascal Challenge how would it do?
>> Nicolas Maurice Boulanger-Lewandowski: I don't know. So this one is actually for the MIREX
competition. It uses the MIREX data set and it’s for chord recognition so it’s very similar in
spirit to speech recognition. So it is in fact the second best that I've seen in literature. So it's
very competitive with the state-of-the-art. It's based on the system I’ve presented. And for
speech recognition, well, this one, the benchmarks are already higher. So it's a little tougher.
So what we've made, and it was in fact the last summer in this internship here, is compare, just
compare a baseline HMM with an RNN phonetic model. And we can see that even a strong
deep neural network can be improved so by replacing HMM with the RNN.
So this is, and the improvement is more significant than what you get particularly with the CRFs.
So this is encouraging. Another thing is that RNN can be using conjunction with other
classifiers. So I know that, for example, this again DNN is not the strongest. For example, if you
use Dropout you can gain like three percent accuracy on that, and this can also be tried with
RNN. So these results are encouraging for now. And also if we combine this with word
language model we see that we can still get improvement. So it seems useful to have am RNN
model to replace HMM even in complementarity with word language models that already
model that temporal evolution. So by modeling the phones, sequences of phones, it can still be
useful in itself it seems.
I guess I have a few research perspectives, so I mentioned the gradient descent inference. This
can be used as an approximation even if we don't have real value vector. Then it could be
useful to have a different training procedure by having back propagation through the inference.
So if the inference itself is derivable, differentiable, so it's like we will optimize directly the zed
star, the one that we find during testing with inference, so that's a very promising option I
think. Purely for RNN training there are some current work with stochastic methods similar to
Dropout and active learning, so trying to take a few examples at a time. Metronome
intermediate targets would be to get the better temporal description for music. Speech
recognition would be having end to end system instead of combining a phone language model
with a word language model in two separate steps. And also another very interesting one
would be to use the RNN, RBM to model the context dependent phones. So instead of, so this
we would be able to phonetically treat phones or context dependent states with RBM. I'm not
very clear, but using RBM to capture or discover what could be the useful [inaudible] phones.
That's it. Thank you to, also you are the co-authors of that, and thank you.
Download