today to give us a talk. Jonathan has degrees... in Mathematics, differential equations, stochastic processes and lately he's at... >> Jasha Proppo:

advertisement
>> Jasha Proppo: Hi. I'm Jasha and we're welcoming Jonathan Le Roux here
today to give us a talk. Jonathan has degrees from various universities in Paris
in Mathematics, differential equations, stochastic processes and lately he's at the
University of Tokyo where he's studying computational audio scene analysis and
speech processing. It's a pleasure to welcome you here today.
>> Jonathan Le Roux: Thank you.
>> Jasha Proppo: Talking about a topic that many people have dashed
themselves upon trying to solve and hopefully we'll see some good work.
>> Jonathan Le Roux: Thanks. So thanks for having me here. It's a pleasure
to be here. And I'll be talking today about -- well, you could say it's about phase,
but certainly know about power.
>> Jasha Proppo: Jonathan...
>> Jonathan Le Roux: Yeah, let me start the slides.
>> Jasha Proppo: Yeah.
>> Jonathan Le Roux: Okay. So (inaudible) talk was trying to be a bit funny
and/or provocative and power is not everything.
So we'll talk mainly about two things I've been working this year -- I've been
working on this year. And they have a sort of common point, which is the fact
that they don't use the power domain or they'll try to cope with issues that are -that come when you work in a power domain.
So (inaudible) motivation of this talk is -- well, there are many as I'm sure you are
aware of, many, if not most, of (inaudible) processing methods work in the power
(inaudible) frequency domain. So we like power spectra and we do -- use them
to do notice canceling.
So separation, decomposed signals using nonnegated metrics factorization, for
example, which has been done very often recently. We can use them to do
modifications to a signal, like pitch scale, time scale modification. We also use
them -- use power frequency domain to perform multi-estimation, for example.
And so these are the applications. Usually how we do things is we like to design
time frequency masks, it can be binary or continuous and then we use some sort
of (inaudible) filtering to get back an output, which is supposed to be cleaner or -well, where there's only one source inside and so for example when start with a
mixture you want to do that.
The thing is by working in the power domain we -- well, sort of (inaudible) we
discover phase information. And people usually do that because phase
information is believed to be harder to model first and also you could argue that
the human ear is sort of face blind so it will make sort of sense to not look too
much at page information.
Let -- yeah?
>> Question: It's not really -- I mean we can distinguish -- sure enough be a
sense in which that is not true? I mean, if you scramble the phase of the
recording it sounds different.
>> Jonathan Le Roux: Yeah. I will show that. So it's actually not exactly true,
but people say that like here is (inaudible) phase blind by trying to -- having some
psycho(inaudible) trying to see if we can make a difference between two vowels
which I've pronounced several times and which have a different phase and can
you make a difference between them or not. Yeah. I mean these kind, I'm not
really (inaudible) about that, but I think people have been looking into that.
But I think for signal processing obviously we're not phase blind to the extent that
we could scramble the phase in short time for each transform and that's exactly
what I'm going to talk about today.
So again, second part of the motivation is that this actually raises issues and
there are mainly three of them. First there is a resynthesize, so if you want to
resynthesize, come back to time domain, then you need phase. And phase
information is missing. You see the mixture, then you could use the phase of the
mixture. Well, it is only approximately true and you could raise artifacts and so
the idea is that if you have a bad estimation of the phase information and you try
to couple it with your magnitude of power estimate then it most probably lead to
artifacts, which you can perceive.
The second issue which we usually don't look too much at is that oddity of
powers is only approximately true. So many people who work in time -- in the
power domain usually assume that the power office is roughly equal to the sum
of powers, which is not true, because the customs are even if you are 0 in
expectation they can be none 0 almost everywhere.
And I think the -- some people say because (inaudible) expectation you can
assume they're 0 (inaudible) better way to justify this approximation is to use -- to
say that signals are sparse. So if there's small chance that they will overlap in
the time frequency domain. So either of them is going to be 0, so in that case, of
course, indeed be zero. But still, if this approximation of additivity of powers is
only as true as the (inaudible) assumption is true.
>> Question: There's also ->> Jasha Proppo: Yeah...
>> Question: -- another factor here where your two sources need to have
almost equal power before the cross term is really important.
>> Jasha Proppo: Yes.
>> Question: Because it's related -- it's related in magnitude to the smallest
source contributing to time frequency.
>> Jasha Proppo: Uh-huh. That's right.
>> Question: So ->> Jasha Proppo: Selecting better ->> Question: What's that?
>> Jasha Proppo: It's ->> Question: Better than (inaudible) ->> Jasha Proppo: It's better, yeah. You're right. So but still if you want to do
things really cleanly then you need to think about this. I mean, take this into
account, the potential issue. And in some situations, phase may actually be a
relevant cue. And throwing it out means that you're not using everything you
could use.
For example, in electronic music where samples are exactly reproduced as a
way for -- really like you're playing a keyboard and the exact same sample with
different amplitudes is exacted for use. So it's not only the power -- the power
spectrum. It's really the phase information is exactly the same. So maybe you
could use that, as well.
And this might be true, as well, for real instruments like -- maybe for piano,
maybe some purpose of instruments, as well. I mean, we have a good
(inaudible)ibility if you hit the drums pretty much at the same place every time
you can expect that the wave form its is going to be reproduced quite well. But
all this needs more investigation.
And this is actually not only true in audio, but in euro data, like we had some
recordings of inside -- extracellular codlings in that situation, also. You have
good responsibility, so phase isn't so relevant.
So all these are motivations to investigate models and/or structures either in the
complex time frequency domain or the time domain. And this talk is going to be
on both. So the first part will be on complex time frequency domain and the
second on the time domain.
Okay. So let me start. Very good. First slide is going to be -- I'm going to talk
about what we call consistency, which I'm going to define. So just to illustrate a
problem here, what if you want to resynthesize a signal from a magnitude
spectrogram. So very classical problem which has been already looked at many
times as early as the end of the '70s and early '80s by Griffin and Lynne, for
example, the most famous method.
So if you have a power spectrum what phase information should you use to get
back to the time domain? So if you had a good phase then this is what you
would get.
>>: Do you understand what I'm trying to say?
>>: That's the absolute correct ->> Jonathan Le Roux: That's a good phase. If you scramble randomly in
phase, here is what you get.
>>: Do you understand what I'm trying to say?
>> Jonathan Le Roux: So it's obviously not very nice. Well, it's ->> Question: Can you describe how you scramble the phase for this?
>> Jonathan Le Roux: I use random uniform -- random numbers between 0
and 2 for phase.
>> Question: Are there any windows?
>> Jonathan Le Roux: Yes. So this is short time frequency spectrogram and I
used square root (inaudible) windows for noise synthesis and 50% overlap. It's
actually not -- you would expect even more artifacts. It already sounds pretty
bad.
>> Question: (Inaudible) -- sound okay.
>> Jonathan Le Roux: ->> Question: Do you understand what I'm trying to say?
>> Question: Yeah.
>> Question: It seems with the low frequency domain it gets really -- I don't
know.
>> Jonathan Le Roux: So that's the ->> Question: If you have a continuous phase (inaudible) ->> Question: You mean -(talking over each other)
>> Jonathan Le Roux: Okay.
>> Question: So he chose a random two pull all-pas filter for every window. It
would be smooth over phase.
>> Question: Then probably sound like it should.
>> Question: I don't know.
>> Jonathan Le Roux: You mean using a different signal as an example or
using a different way to reconfigure phase?
>> Question: Different way to get ->> Question: Didn't way to randomize the phase.
>> Jonathan Le Roux: Okay.
>> Question: But we have to talk about things that you could actually apply it
without having the original phase and that's a little bit harder than just ->> Jonathan Le Roux: Yes.
>> Question: -- scrambling the phase.
>> Question: Oh, you're right.
>> Jonathan Le Roux: Um, well, if I didn't have any phase information, I could
use random phase and that is what I did.
>> Question: Or you could use (inaudible) phase and then ->> Jonathan Le Roux: Right.
>> Question: -- and you could get the perceived pitch out here in analysis
window rate.
>> Question: In random phase, assuming everything is unphased (inaudible) ->> Question: Yes.
(talking over each other)
>> Question: What's that?
>> Question: Perceived (inaudible) at about four, All right, something like that.
No, I mean it may be related to the rate at which he was doing analysis.
>> Question: Yeah. Maybe to ->> Jonathan Le Roux: Rates to be overlapped, you mean? Yeah. So that's -that's even when you have a power spectrogram, which comes from a real
signal. But maybe something which -- more interesting is that if you knew how to
do that efficiently maybe you could like time stretch and then again reconstruct
using in some way to get a slow dancing, which is often done. Or you could like
cancel some (inaudible) or separate speakers and you could do anything you
want, like draw an spectrogram. It could be fun.
So the key idea here is that if you use the wrong phase to -- and you couple it
with your magnitude spectrum, so you get a good signal. You use a short-time
phrase transform, you only keep magnitude and you use another phase and you
get a signal back. But the idea is that if you look at this -- the magnitude
spectrum of (inaudible) it's actually different from the one you had before.
So if you don't take care about this you might be spending a lot of time designing
really cool magnitude spectrum, magnitude spectra, and when you resynthesize
and I just lose a part of a job you did. So the idea would be to try to use a phase
which doesn't make the spectrogram change. So sort of ensure that the
spectrogram stays the same so that the magnitude and the phase are consistent
with (inaudible). That is what I would call consistency here.
And we're going to do this by running a general criterion for consistency and we
more specifically talk here about short-time free of transform spectrograms.
Okay. So first a short -- I don't need to tell you much about it, but STFT is
already often used for time frequency analysis in audio and it's expected to be
invertable. So if you start with a signal and you perform it -- STFT, you get a time
frequency representation and you expect that they're forming what we call
inverse STFT. You get back to the same signal
In the two boxes I will review in the next slide, we use windows to do analysis
and synthesis to actually get the same signal back. You need to have some
constraints on the windows, which are called the perfect reconstriction
constraints. And the cool thing with STFTs, it's linear operator so additivity is still
true in complex and frequency domain. So it's a good domain to work in.
There's no approximation.
Okay. So a little bit more on the relation between short time frequency and its
inverse -- and when I say inverse, I mean the overlap add version. Obviously
(inaudible). How about that procedure.
So there is an exact equivalent between a wave form and its STFT when we
have the perfect reconstriction constraint. So you can go from one to the other
and back. So if you start with a signal, you perform STFT, you have an
spectrogram, and there is constraint you can get back here. And you have a
correspondence between time signals and STFT spectrograms. But actually as
various overlap between frames, you have the dimension of the space where
these STFT spectrograms live is actually bigger.
So if you have 50% overlap you have twice more coefficients. So you could
imagine that -- so not any set of complex numbers is actually coming from a time
signal. So to ensure that the set of complex numbers you have on the right
actually corresponds to a time signal. It must (inaudible) consistency constraints.
Okay.
So the problem is that if you start from that set from inverse STFT you indeed get
a time signal back, but if you perform STFT you don't get back to the same thing.
So the idea would be to -- that the consistent spectrograms are the one here for
which when you go, you need to start from right and go left and right again and
you go back to the same thing.
Okay. Why is that important? First, if you are designing an algorithm which
works in the complex time frequency domain you could do a source separation or
many, many different things. Then it would be cool to be sure that each time you
have an spectrogram, the spectrogram you're working on actually correspond to
a time signal. So it would -- you could use it as a constraint to lower the
undeterminisity of your problem. So in that situation you could use the
(inaudible) I'm going to present as a sort of cost function in your optimization
algorithm.
For example, if you're trying to separate sources you could ask a penalty for each
of the two separated spectrograms to be consistent.
If you're working on powering, power domain, then you need to find the best
corresponding phase and then in that case you could use the criterion as an
objective function on phase and so you try to optimize a phase such that to
minimize the cost of -- so the consistency cost and get back good estimation of a
phase.
Okay. So I think everybody knows this, so we'll be pretty quick. So perfect
reconstriction constraint you build an spectrogram with analysis window and
NFT, you shift, same thing, shift again, same thing. So if you have -- and you
have several frames, that's the analysis by STFT. Now synthesis, again inverse
at 50, you get short-time signals. You apply a synthesis window and you add
them and you get your overlap add and time domain signal. And for the signal
on top and bottom to be the same each has constraints on the analysis and
synthesis window, which basically can be summarized in saying that the
conjugation of STFT with window W and inverse STFT window S, it should be
identity.
Okay. Now if we start from the middle, so we start from the time frequency
domain. We can perform overlap add and we get time domain signal and so if
this set of complex numbers H was actually coming from a real signal then
because windows are supposed to respect perfect reconstriction constraint by
doing inverse STFT, you would get that signal again. And by doing STFT again,
I guess you got it now, you would get the same H. So the constraints I just
writing down the conjugation of ISTFT with Window S and instead of doing W is
identity. It is easier on the H.
Okay. And so the cool thing is that you don't actually need to go back to the time
domain to do that. You can write down everything in the time frequency domain
because it's just a linear operator from -- at the complex (inaudible) complex
coefficients of size MN to itself. And you could -- so if you define the operator F,
which is the difference between STFT conjugated with ISTFT minus identity, you
are just looking for spectrograms or sets of complex numbers which are in the
kernel of that operator.
And you can actually write it down methodically so it looks like this. To give you
an idea of what is actually performing, if you have 50% overlap to compute the
value of the image of H by F and then timing that is M and frequency of then N,
you basically need to perform a sum, weighted sum, on the overlapping frames
around that index and on all frequency bends. So the sum only spans on
overlapping frames. Okay.
And it's sort of, it's almost -- yeah?
>> Question: So how constraining is that? Does that mean if I have magnitude
spec -- STFT from a signal ->> Jonathan Le Roux: Uh-huh.
>> Question: And I delete one of the bins, it's perfectly reconstructible from the
bins that remain?
>> Jonathan Le Roux: Um, if you only did it one then I would say it is, but I
haven't -- analyzed it yet.
>> Question: How redundant is the ->> Jonathan Le Roux: I mean, depends on -- it's half of them, so ->> Question: Half of them. You could do the other half?
>> Jonathan Le Roux: No, it's -- I ->> Question: Yeah. Intelligently select some path.
>> Jonathan Le Roux: Yeah. It's ->> Question: You couldn't delete half because ->> Question: It's actually two redundancy (inaudible) ->> Question: Yeah. So if you dropped half the frames you might be able to still
get back to the original insignia.
>> Question: Maybe if it -- reach out to the (inaudible) I would imagine.
>> Jonathan Le Roux: Yeah.
>> Question: It's the same rate, it's the same all -(talking over each other) --
>> Question: The equations just because (inaudible), I mean it's just a set of
equations to find the consistency, right? So some of the equations won't match
until (inaudible) ->> Question: Or it is linear, so is there just one value that's correct once you've
deleted the bin?
>> Jonathan Le Roux: Mmm, let me think about it.
>> Question: I guess you're just doing ->> Question: It's complex.
>> Jonathan Le Roux: It's complex. Complex picture.
>> Question: Right.
>> Jonathan Le Roux: So you have double the dimensions, so need to think
about the number of unknowns and the number of equations you have. So even
with only if you have only one unknown basically definitely you can recover it.
But I need to look into where you can go.
>> Question: It probably also depends on what we know you're using and
overlap and ->> Jonathan Le Roux: Maybe not too much, it depends a lot on the overlap.
>> Question: The window has a 0 at the end.
>> Jonathan Le Roux: Yeah, okay. You don't want to have ->> Question: If your window has no zeros, now, after I asked the question, you
can drop every other frame and get back the original window.
>> Jonathan Le Roux: Yes, yes, yes.
>> Question: For sure?
>> Jonathan Le Roux: Yes. So it depends on how you drop the bins.
>> Question: Okay.
>> Jonathan Le Roux: But I guess the most you can drop is half.
>> Question: Okay.
>> Jonathan Le Roux: So, yes. The idea is actually to be able to do these kind
of things. So and it depends on the overlap and the more if you overlap, the
more constrained it is.
Okay. So it's almost a convolution I will not say too much about it. Basically you
have some weights which span overlapping frames, so if you have 50% overlap,
you have free frames and bins. However, I centered them around the venn you
want to compute and if you have an spectrogram the idea is just to you have this
bin here, you just multiply everybody and then use some up. So in the complex
domain you stop by, you have 0.5 times this complex number. Then you sum up
every one and lucky the end to end. You should get -- this is too long, sorry, to 0.
And you do that again for -- and the only thing which make its not a convolution,
you have a different set of weights for odd and even bins. In the 50% overlap
case. If it's more overlap, then you would have three different cases and stuff
like that. So it's not exactly a convolution, but it's likely, close to a convolution.
Okay. So interesting things are the weights are concentrated near the center.
So for example here we've had a window for analysis and rectangular for
synthesis and 50% overlap. It looks a bit like this. I only 30% ones, which allows
us to if we want we could simplify the residual in two ways. First we could notice
that in the image most of the contribution comes from the (inaudible) itself. The
same position in the original spectrogram.
and Nonessential coefficients could also be neglected here, so if you want to do
things faster, you could maybe drop some coefficients. But you don't have to.
And so if you -- you could also optimize the windows to have a better
concentration so that your approximation is actually better. And so you could for
example try to maximize the L2 norm inside the central block. And actually if you
do that for five times (inaudible) you at least it is something which is closed to the
square root of Hannig(phonetic). And we use square root Hanning(phonetic). I'll
do it again. It's much more concentrated, so we use in the experiments, we use
square root Hanning in the phase reconstruction experiments we did.
Okay. So now I -- we applied this to phase restoration. So we have a
consistency criterion which is actually that is a pretty big area because it's maybe
the main thing of the talk. So you have this linear operator. You just take its L2
norm, for example, you could take any norm. For example, the L2 norm. And it
can act as a consistency criterion. It's supposed to be zero. If you're
spectrogram is consistent.
So why do you need to know? Okay. I erase the slide, but can I explain to you
why the L2 norm is a good idea is because you have the -- it's the STFT
algorithm by Griffin and Lynn, which reconstructs the phase from the magnitude
by going back to the 10 domain and again to time frequency domain and keeping
the face then taking the magnitude you want and do the same thing again and
again. So try to update a phase which is -- which fits best your magnitude.
And this is actually -- if you use the L2 norm to -- if you try to minimize this
criterion you actually -- it corresponds to exactly doing the same thing as we do.
So the distance that they show their algorithm minimizes is intertwined with that
criterion. So no convergence there are equivalent, that is the message. It's
slightly different, but they are supposed to be (inaudible).
Okay. And where am I? Yeah.
>> Question: So you mean they would have to stay stationary points?
>> Jonathan Le Roux: Yes, yes.
>> Question: Are they both guaranteed to converge to some global optimum or
local optimum?
>> Jonathan Le Roux: Um, the ->> Question: It makes a difference if they tend to go to different local -- you
know ->> Jonathan Le Roux: Yes.
>> Question: I guess you're taking L2 norm of the linear operator.
>> Jonathan Le Roux: Which is good -- yeah, yeah.
>> Question: Okay.
>> Jonathan Le Roux: Yeah. So to apply this to phase restoration, you could
minimize this criterion. Only with respect to phase while fixing the magnitude.
And the fact that the ways are concentrated allows you to make local -- more
local optimization if you want. So the fact that only the central bin contributes to
the image, so justifies you could only update the phase at bin MN. Not to
minimize over the L to norm, but only the bin -- only the bin which corresponds to
it.
So actually this would mean that you do not minimize the L2 norm. You minimize
the bins and after -- so it's not, it is sort of weird optimization because it is not
related to an objection function. But actually if you do that, experimentally also it
minimize LP norms. Because it has all the bins basically.
And with the second thing which was that the only the central coefficients had
significant weight to compute that coefficient you could only replace visibility with
some of on all the bins, but only the sum on the central block, which is much
faster, so you could be faster than performing FST2 to compute things.
The advantages of this, but you can use sparseness. So if you only have a few
pots of spectrogram with significant amplitude then you only need to update the
phase for those terms.
If you -- you're more flexible because if you already know some the phase -- you
think some process of the phase is reliable, you don't need to update it. You just
keep it as it is and only update other terms. And it's supposed to be lower
computational cost if you use the local approximation.
Comparison with previous work. So (inaudible) was -- so the idea is to find -- it's
slightly different. It's to find the signal X in the time domain who's magnitude is
closest to the one you have estimated. And the argument being that it's phase
would then be the best phase. It's actually equivalent, so while I do the iterary
between STFT and (inaudible) only phase so we have to constantly alternate
between time and time frequency domain, which means that it is costly because
you have to do many STFTs. And it's also not flexible because you have to date
all the bins at the same time. And you cannot choose which bins you're going to
update.
So and for (inaudible) method as I already mentioned, it's direct optimization of
the time frequency domain, so you can -- the only possible spectrogram and it is
not limited to face, so you can deal with complex spectrograms, as I mentioned
at the beginning of the talk. And it's for the approximated (inaudible) you have
lower costs. Yeah.
So the idea for this approximation is that when you try to minimize a phase. You
get a nonlinear problem. So if you try to -- if you're minimizing on bins, on
complex bins, it's like linear and so it's -- I mean simple L2 optimization
techniques would work, but of all the phase you need to do something else
because it is much harder. That's why we came up with this approximation.
Okay. So here is an example. I'm running a bit late, right? I'm sorry.
>> Jasha Proppo: I think we have at least 23 seconds left.
>> Jonathan Le Roux: (Laughter) Okay. Okay. So ->> Jasha Proppo: (Inaudible) 32 seconds. We have the room for another 45
minutes.
>> Jonathan Le Roux: Oh, that's fine. Okay. So ->> Jasha Proppo: Actually have time for questions at the end, but ->> Jonathan Le Roux: Okay.
>> Jasha Proppo: Your audience is rude, they never let you get in a word.
>> Jonathan Le Roux: So I won't have many questions at the end. (Music
playing) okay. Sorry, I just press that.
So do you know that?
>> Question: Is that the second -- is that the (inaudible) one, the original?
>> Jonathan Le Roux: No, that's the original. And that's from the (inaudible)
database, music database. I don't know who is playing actually.
>> Question: I thought it was (inaudible) because it does sound very slow, that's
the way it's supposed to sound.
>> Jonathan Le Roux: It's going to be slower. So we slow it down by 30% and
for final length of 32 seconds and we also the original method like Griffin and
Lynn, and then we use our method and we also use (inaudible) so we only
update bins which have significant amplitudes.
And so in terms of number of iterations we see that we have a faster
convergence so we need less steps if -- without sparsity we need less -- much
less steps to reach the same level. With sparsity we first need less steps and
then it's pretty much the same as the original method in blue. And if you include
the speed of each step then you -- here is what you get. Time to reach a certain
level. You see that for example we've minus 15, 2H minus 15 (inaudible)
consistency so to lower down the criterion. We only need 2.1 second with
process against like about 45, 40 seconds for the original algorithm. So it's much
faster.
(Inaudible) supposed to be the same, pretty much because I mean we wait long
enough so that we have the same inconsistency.
>> Question: Does the window affect the final quality?
>> Jonathan Le Roux: I think it does, especially the overlap affects the final
quality. So here we will let you hear five times free sorry. 50% overlap results
there, which don't actually sound that good. But if you want to make a product,
you would want to go 75% overlap. I don't know -- 25% overlap, sorry. So or
more.
But the thing is our algorithm outperforms original state of the art, produced
method with more if it's 50% overlap and less and les because you need to -- it
doesn't scale so well because we don't use FFTs to -- so if you have more
overlap, you have much more coefficients and so it slows down a bit.
But it's an important issue definitely. So I'll let you listen to the reason without
processing. So without processing as we read, so we read the original wave
smile slowly and so like we read it at speed of the frame shift is made such that
the final speed will be slower when read with the original frame shift. So we
make the incoming frame shifts more and the outgoing frame, it's got the same.
So if we feel like it is slower. Okay.
(music playing slower)
So well, you get -- it's not great. I'll play you one of the -- the (inaudible) one if
everybody sounds pretty much the same.
(music playing)
>> Jonathan Le Roux: So you can hear it goes through some modulation in the
(inaudible). It's not very clear, but with head phones (inaudible) and I think it's
mostly due to the fact it's 50% overlap and it gets better with more overlap.
Okay. So now to -- yeah?
>> Question: Sorry. Is there a reason that you use music rather than speech
for this?
>> Jonathan Le Roux: No, not really.
>> Question: It would also slow down in speech?
>> Jonathan Le Roux: It works -- yeah, anything -- the thing is I think we notice
at some point, I'm not sure, that for speech and music the window lens that you
want to use is -- might be different. It may sound better with shorter window
(inaudible) for speech or with the opposite (inaudible). And I don't really
understand why actually. But it's ->> Question: The sound you cannot produce, right, with the piano.
>> Jonathan Le Roux: Uh-huh.
>> Question: Even if you play slower it is not going to sound the same. It's
never going to be that (inaudible).
>> Jonathan Le Roux: Yeah. Yes. It's supposed to be -- yeah. Not natural. If
you think of how the song is produced, maybe someone doesn't know the piano.
I mean, it also slows down the attack for example, so and maybe you don't want
to slow down the attack. So there are methods, I think, like spot coding methods
trying to do that actually, separate the attack from the rest and slowing down only
the rest. I think some people are actually trying to do that.
>> Question: Does the attack take more than a frame or two?
(talking over each other)
>> Jonathan Le Roux: So this is more like a funny -- I don't know if (inaudible),
I found it pretty cool when I found it out. And it's kind of simple, yet makes a cool
demo idea.
So you could actually use the inconsistency to your -- I mean to do funny things.
You could make silent spectrograms. And so how you do that is if you have -- if
you're given an (inaudible) synthesis windows, which are perfect reconstriction
and you can take a set of complex numbers, anything, so maybe inconsistent or
not. If you go back to the time domain, you get a signal and then if you look at
the STFT, you get something different in general.
Now so the difference -- still they have the same inverse TFT. So if you look at
the difference then you have something which is none zero, but which gives you
for that particular synthesis window silence. So that's very cool because if you
use a different synthesis window, it's not silence. So you could think of using this
in audio encryption scheme in the time frequency domain.
For example, so we care it is only for the correct window that you get silenced.
The window could act as sort of key. And so if you have a single that you want to
encrypt and you could add pseudospectrogram, which gives you silence to the
real spectrogram of that same one. So you started with a signal. I'm sorry for -funnier piece of music.
(music playing)
>> Jonathan Le Roux: Like from a mix of Trumpet and piano, you have an
spectrogram and you create a (inaudible) by for example taking a random
complex numbers and doing procedure I explained. And so you look at the
difference between the complex numbers and the STFT over inverse STFT. You
add it with a large co-efficient. I -- I -- also note you might have some problems
with dynamic range when doing this, but well, you know, just for a demo.
So here, look at the power spectrum. Here's how it looks like. Basically you
don't see much. If you use for example -- say here I use square root Hanning
windows. If you use Hanning to resynthesize, which is incorrect window, here is
what you get. So you get minus 30 for dbnr. So if you are going to lower the
volume a bit, here is what you get.
(making static sounds)
>> Jonathan Le Roux: Okay. So I overdid it a little bit. So like 10,000
something. So dynamic range is really an issue I think when you do that.
Well, if -- and the thing is if you use the incorrect window to resynthesize the
original spectrogram, so of the music, you still have plus 18 DB, so it sounds a bit
fuzzy, but still not too bad.
(music playing)
>> Jonathan Le Roux: And if you use the correct window, square root, you
have perfect reconstriction up to quantization level. So -- well, I don't know if it
has any application, but I thought it was pretty fun to ->> Question: So what you're doing is you're in the complex time frequency
domain adding something that is orthogonal to ->> Jonathan Le Roux: Yeah.
>> Question: -- linearly orthogonal to your signal.
>> Jonathan Le Roux: Yes. Only which is only for that particular window. So if
you don't have a window you need to so a way to crack it, I haven't tested it,
would be maybe to minimize the energy with respect to the window. So we tried
to find a window which gives you at least energy. Could be maybe a (inaudible).
So I'm not saying it is like robust or anything, but I just like thought it was fun.
Okay.
Okay. So summary of that part. It's a bit slow. Quit because we have a
consistency criterion. We could use it for phase resynthesis. I'm planning to use
it for missing data reconstruction, so like if you have -- if you reconstructed in the
power domain and have some positive phase which is known and you want to
reconstruct other parts, maybe you could use this. Potential applications are as
a cost function for on complex (inaudible). If you do separation of the complex
domain, maybe you could use this as the cost.
Okay. Now to part two. Yeah. I'm trying to be faster on this one. Now I -- that's
the week at Neps, I presented this on template matching in the time domain. So
again an illustrative problem, if you want to (inaudible) on (inaudible) domain
condition being single channel observation, you don't know how sound like, you
don't know their timings and you have no training data. The goal is to remain
with some parents their timings and their amplitudes. So here I have, for
example, a two-second drum loop of electronic drum loop, which is -- well, let me
play it first.
(drums playing)
>> Jonathan Le Roux: Okay. So I've done sampled it quite a lot to make it
tractable in terms of (inaudible) cost so it sounds a bit muffled. But this has been
built by using two templates of base drum and snare drum, which are taken from
real sounds and then could be placing them with different amplitudes at random
timings and emerging into a single channel. So the goal would be to -- from that
signal to reestimate this template and to reestimate their timings and amplitudes.
So motivation in general that -- so compared to the power domain the
advantages of time domain is that you don't need to determine analysis
parameters. So like frame land for (inaudible) shift. You don't need to
resynthesize and activity is true.
And you also export away from regularity when it is true. So in (inaudible)
recording I say, do we have four (inaudible) We have good reproducibility of
wave forms across examples and for extra rate for music instruments maybe like
piano and drums you would also be able to use that.
More generally we wanted to design a method to perform template matching, but
with unknown templates and also to investigate good single decompositions
wherever the amplitudes are supposed to be sparse and none negative.
Okay. So our goal will be to explain a time series with decomposition, which
detects the timings in which the events are characterized by the time course and
this time course being discovered from the data. And we suppose these events
can have variable amplitudes, but only positive, so we cannot have cancellations.
And we allow also for overlapping events.
Not sure what I meant by this sentence, but anyway...okay. So here is how the
model looks like. It's basically a convoluted factorization where you have -- so
here is the different events. So K stands for the even type and you have the time
codes, so a short wave form. And here are the amplitudes, so different legs.
You have amplitudes are supposed to be nonnegative and on which we put a
sparse (inaudible), which is a generalized Gaussian process. (Inaudible) has
much information as possible in their -- in the event. So if you didn't for example,
if you didn't assume any sparse prior you could imagine having -- for example if
your events were also nonnegative you could imagine having an which is just a
delta, chronic delta 1 and zero everywhere and all the time codes in the
amplitude. That is really informative.
So to get more meaningful results we just wanted back all the information in the
event itself and only like (inaudible) like on-set timings in the amplitude so we use
(inaudible) as a reason why we use the (inaudible) prior.
Okay. So we perform maximum (inaudible) estimate by -- it should be minimum
here. We try to minimize the L2 norm between the model and the offset data,
plus a (inaudible) term. And we normalize the template to avoid scaling
(inaudible) reduce the sparseness by putting more weight into the template and
reproducing it, but that is not very interesting. And -- okay.
A brief review of existing algorithms, which are close to this one, so none
negated matrix factorization treats the case of decomposition of a signal into a
product of two negative terms. Nonnegated matrixes. While semi nonnegated
metrics factorization (inaudible) where the amplitudes are only a nonnegative and
B can be any sign so it's close to what we want to do. But doesn't allow for
shifts. So doesn't allow for different onset timings.
Shift in MF treats the different onset timings so close (inaudible), but both
templates and time courses need to be nonnegative and so we need to combine
these two models.
And so here it's -- it looks like we have this convolution with nonnegative
amplitudes and by using this inch tying notation, so writing down the sums in this
way, we can rewrite the model a product so we convert the convolution to a
(inaudible) so (inaudible) and a matrix, which allows us to write -- use the
updates for the semi-NMF and MMF to convert them into updates for our
(inaudible) pretty easily.
So, yeah, I don't have much time, so maybe I won't insist too much on this. But
this is the original estimates of the equations for the NNF. You can modify them
for -- so the difference is basically in this kind of correlations you allow for shifts
and -- well, believe me, this should be correct. If you introduce sparsity term then
you have the derivative of (inaudible) term which comes up in the (inaudible) and
if you finally add up normalization of templates you could enforce it by using like
ownership multipliers, although we didn't use it in the first implementation, we just
renormalized it every few steps because it's much simpler.
Okay. Some results. First we wanted to test it on synthesis data, where we
know the correct answer to try to test how the performance is. So on -- we used
two templates (inaudible) templates and random timings and amplitudes. And so
if you can go ands choose templates and you add them up in a single channel,
here -- and you add as well some Gaussian noise, here is what you would get.
So from this noisy data, we were able to reconstruct the templates quite well.
And their timings, as well. And here is what would be reconstructed way from
looks like. So it sort of looks like the noise version of the original. And well I
guess the (inaudible) is added performance sort of (inaudible) estimation of a
template. So if you have noisy -- many noisy versions of your templates by
adding sort of taking the mean, you reduce the amount of noise.
Okay. And we also looked at the evolution of performance with respect to
(inaudible) similarity. Similarity being how light the templates look like in
(inaudible) So it's our own definition, but here 0 means that we use the two
templates I showed you and one is like (inaudible) each other. So one is like
identical template. So it gets easier again when you only have -- when you have
the same template because you only have one template to guess. But there's -at some point it gets harder in the middle. Okay. And real data, so this is
year-old data and (inaudible) coding so I'm not too knowledgeable about this.
The -- these data are composed of two types of spikes. So the (inaudible)
segment spike, which is in red, and the (inaudible) spike which is in blue. The
red spike may appear by itself sometimes. But the blue spike always appears
with the red spike in front -- slightly around here and smaller. So here
normalized, but usually -- so the blue one is always overlap with the red and the
red sometimes appears alone.
So doing this sort of spike sorting is very well -- I mean, has been studied a lot.
But usually conventional methods cannot really deal with overlapping spikes. So
that's why we had to -- tried to design a new method. And so we build this -- built
these spikes by hand so we try to find regions where the red one is alone, align
them together and try to find a typical way form. And then as this spike and so in
the blue and red complexes just cancel out the red and estimate the blue. So
very -- sort of takes a lot of time and it's not easy to do. But automatically
retrieve spikes look like this. They're quite good. They're quite alike. Not
perfect, especially in the blue, because this bump here, which is stronger than
the original. And we think that it comes from the fact that the red spike usually
always comes around here, so it's been learned a lot as part of the blue.
Okay. And on the drum loop data that I showed you earlier, so we from this data
we added some noise. So from the noisy data here...
(music beating)
>> Jonathan Le Roux: I plugged the algorithm to this data and was able to get
this templates back and their timings. And so here it's been normalized so timing
are slightly smaller, but you see that the evolution is very similar. And here is
how we look like, the reconstructed wave form. And so on the top right.
(music beating)
>> Jonathan Le Roux: No, sorry, not sure it's...play it again.
(music beating)
>> Jonathan Le Roux: Not sure explained the file. So the original...
(music beating)
>> Jonathan Le Roux: And reconstructed...
(music beating)
>> Jonathan Le Roux: Yeah. It's supposed to sound slightly less noisy, but just
a bit because there are only four examples here in (inaudible) so it is kind of
difficult conditions to learn the template with only that few templates.
So it still has quite some noise. And the -- and so now we play the only the bass
drum part and then only the snare drum part.
(music beating)
>> Jonathan Le Roux: And the snare drum...
(music beating)
>> Jonathan Le Roux: Okay. So well ->> Question: That sounds a little more like a series of snare drums than a
single snare drum.
>> Jonathan Le Roux: Actually the original sound is a double stroke, so it's -which make its even harder to learn because it's highly (inaudible) with itself.
So ->> Question: You were playing (inaudible), you were playing the ->> Jonathan Le Roux: I was playing this.
>> Question: All right.
>> Jonathan Le Roux: Sorry. I was playing the re -- like a separated ->> Question: Convolution.
>> Jonathan Le Roux: Yeah, like a convolution already.
Okay. So short discussion, so we spike it or even can be overlapping. They do
not need to be in areas (inaudible) PCRCA. And (inaudible) process where the
key criterion. And so we had a proof of conversion for the algorithm and the
decomposition is shift invariant. We could de(inaudible) the signal in spike
(inaudible) so I don't know if you know the nature paper by Smith and Leviky.
She's trying to do sparse coding of natural sounds. And they retrieve natural
basis sounds and they actually found out that they look like the cochlear feature
of a cat, which kind of interesting reasons.
So it came that cochlear features actually really very sparse code. And I was
actually hoping to get the same kind of reason with our algorithm by training a
large amount of data and having been able to get reasons yet. I hope maybe at
some point, considering the models are supposed to be quite close and that this
model is supposed to really do the good thing, we should, you know, with a lot of
tricking of parameters I hope we can get something.
And okay. Yeah, the model has a form of converted mixture, but as we have like
(inaudible) constraint and we had also sparseness and also we need to be
careful with the length of An and B. The interpretation is slightly different than
that of outline, so separation.
Okay. That's going to be my last slide. So I presented compliments and/or
tentative to power domain modeling. So first thing the complex (inaudible)
domain where actually we could recover face from power. We could potentially
use the pattern as a guide to (inaudible) so sort of perform consistency aware for
modeling and we gross maybe in the future use cost function for complex time for
domain modeling. And in the time domain we exploited the (inaudible) of wave
forms and we were able to deal with overlapping events through decomposed
signals. And that's the end.
Yes?
>> Question: Are you familiar with Barry (inaudible) work?
>> Jonathan Le Roux: Yes. Yes.
>> Question: Because they also use a convolutive mixture. I think they also
have sparseness with ->> Jonathan Le Roux: Yes, yes.
>> Question: With entropic trial.
>> Jonathan Le Roux: Yes. No, it's in power spectrum.
>> Question: Power spectrum?
>> Jonathan Le Roux: I think so.
>> Question: Because I think they were both nonnegative. Both ->> Jonathan Le Roux: Yes, yes. That is the main difference.
But I mean, it's a hot topic reason so you definitely using sparseness on the
amplitude.
>> Question: Yeah, I recall it was on spectrum.
>> Jonathan Le Roux: Yeah, I mean the main difference between where this
last work has been and what other people have done is that we use slightly
different conditions.
Yes?
>> Question: It reminded me of that, that is why I'm asking that.
>> Jonathan Le Roux: Yes, definitely.
>> Question: You know, the second method, finer things like drum rolls,
where ->> Jonathan Le Roux: Uh-huh.
>> Question: -- copies are exact.
>> Jonathan Le Roux: Yeah, that's the easiest.
>> Question: There's a lot more where they're like similar, but not exact, right?
Like ->> Jonathan Le Roux: Yeah.
>> Question: Real drums -- (talking over each other) -- or a vowel or whatever.
>> Jonathan Le Roux: So for vowels and stuff I think it would be hopeless
probably. For real drums I need to try. Haven't been able to find time to record
real drums and first I wanted to use a pen to do things like on a table. It's just not
that easy to get a clean signal and especially no double strokes with a pen on the
table. So I guess we have been too different ->> Question: Real conditions maybe that is the kind of data that you need.
>> Jonathan Le Roux: Yeah, I think that would have been really cool if we were
able to recognize like another more -- if we are able to separate these two we
would be very cool. I don't see any application, but -- well, would have made
some real data it would have been interesting, but I didn't have much time to do
it. The few times I tried it I was never able to get like only one stroke kind of
data. But maybe it means that in reality it's just hard to have this kind of data.
We need to try (inaudible) drums definitely on real data.
>> Question: What about the drums are going to be difficult ->> Jonathan Le Roux: You expect so?
>> Question: Yeah. It's -- yeah, it's a pretty random thing. It's not reproducible
in the time to (inaudible) --
>> Jonathan Le Roux: You think so. Yeah, have you looked into it?
>> Question: Yes.
>> Jonathan Le Roux: Okay. That's the kind of ->> Question: They have this thing that kind of clutters against this snare.
>> Question: That's a random vibration, even though the motion of the drum
head is very complex. Uh-huh. And once you get something that's above your
sampling rate, what is your sampling rate -- (inaudible) ->> Question: Yeah.
>> Jonathan Le Roux: You could make it higher.
>> Question: Anything above your sample rate comes out as random once you
get the ->> Jonathan Le Roux: Uh-huh.
>> Question: Fold it down a few times in Map quest. I would guess template
matching and time domain is destined to fail.
>> Jonathan Le Roux: Okay. For drums. Yeah. I mean, first motivation was
not (inaudible). It was like from old data in which it's supposed to work better.
And the (inaudible) was to what extent can we use that model.
>> Question: There is plenty of things where you are sending out like no
impulses and things. Plenty of applications (inaudible) or whatever so as top -trying to figure out where it would apply. Even for the (inaudible) data it seems
like the sparse prior is bias and yet to put the red and blue spike together into
one prototype rather than generate a new spike for it.
>> Jonathan Le Roux: Yeah. It seems so in the end ->> Question: Interpretation of the same data, if -- if A sometimes comes alone,
B always comes with A, then you learn that A and B come together and that's
your ->> Jonathan Le Roux: They come together with different amplitudes. They
could come together with different amplitudes. So then you would have to run
both. Good. Maybe they come really often with the same amplitude and
sometimes with different amplitude. So in which case it's hard to guess.
Well, I think it's difficult task, very undetermined and we try to come up with
constraints to make it more -- slightly more (inaudible) so we need to figure out is
it really important to ensure amplitude are nonnegative, because the sparseness
should enforce -- if a signal is really made of nonnegative amplitudes times
events, times these templates then you should be (inaudible) by learning with
good templates and you would only get (inaudible) amplitudes because that is
only what you get in your data.
So you shouldn't have to enforce it in the constraints.
>> Question: Right.
>> Jonathan Le Roux: But well, as in undetermined problems it's just better if
you can enforce it. So that is why we did it. I think we need to compare without
the constraint to see and if it works better then...
>>: Cool. Thank you very much, Jon ->> Jonathan Le Roux: Thank you.
>>: -- for a provoking talk.
(applause)
Download