>> Jasha Proppo: Hi. I'm Jasha and we're welcoming Jonathan Le Roux here today to give us a talk. Jonathan has degrees from various universities in Paris in Mathematics, differential equations, stochastic processes and lately he's at the University of Tokyo where he's studying computational audio scene analysis and speech processing. It's a pleasure to welcome you here today. >> Jonathan Le Roux: Thank you. >> Jasha Proppo: Talking about a topic that many people have dashed themselves upon trying to solve and hopefully we'll see some good work. >> Jonathan Le Roux: Thanks. So thanks for having me here. It's a pleasure to be here. And I'll be talking today about -- well, you could say it's about phase, but certainly know about power. >> Jasha Proppo: Jonathan... >> Jonathan Le Roux: Yeah, let me start the slides. >> Jasha Proppo: Yeah. >> Jonathan Le Roux: Okay. So (inaudible) talk was trying to be a bit funny and/or provocative and power is not everything. So we'll talk mainly about two things I've been working this year -- I've been working on this year. And they have a sort of common point, which is the fact that they don't use the power domain or they'll try to cope with issues that are -that come when you work in a power domain. So (inaudible) motivation of this talk is -- well, there are many as I'm sure you are aware of, many, if not most, of (inaudible) processing methods work in the power (inaudible) frequency domain. So we like power spectra and we do -- use them to do notice canceling. So separation, decomposed signals using nonnegated metrics factorization, for example, which has been done very often recently. We can use them to do modifications to a signal, like pitch scale, time scale modification. We also use them -- use power frequency domain to perform multi-estimation, for example. And so these are the applications. Usually how we do things is we like to design time frequency masks, it can be binary or continuous and then we use some sort of (inaudible) filtering to get back an output, which is supposed to be cleaner or -well, where there's only one source inside and so for example when start with a mixture you want to do that. The thing is by working in the power domain we -- well, sort of (inaudible) we discover phase information. And people usually do that because phase information is believed to be harder to model first and also you could argue that the human ear is sort of face blind so it will make sort of sense to not look too much at page information. Let -- yeah? >> Question: It's not really -- I mean we can distinguish -- sure enough be a sense in which that is not true? I mean, if you scramble the phase of the recording it sounds different. >> Jonathan Le Roux: Yeah. I will show that. So it's actually not exactly true, but people say that like here is (inaudible) phase blind by trying to -- having some psycho(inaudible) trying to see if we can make a difference between two vowels which I've pronounced several times and which have a different phase and can you make a difference between them or not. Yeah. I mean these kind, I'm not really (inaudible) about that, but I think people have been looking into that. But I think for signal processing obviously we're not phase blind to the extent that we could scramble the phase in short time for each transform and that's exactly what I'm going to talk about today. So again, second part of the motivation is that this actually raises issues and there are mainly three of them. First there is a resynthesize, so if you want to resynthesize, come back to time domain, then you need phase. And phase information is missing. You see the mixture, then you could use the phase of the mixture. Well, it is only approximately true and you could raise artifacts and so the idea is that if you have a bad estimation of the phase information and you try to couple it with your magnitude of power estimate then it most probably lead to artifacts, which you can perceive. The second issue which we usually don't look too much at is that oddity of powers is only approximately true. So many people who work in time -- in the power domain usually assume that the power office is roughly equal to the sum of powers, which is not true, because the customs are even if you are 0 in expectation they can be none 0 almost everywhere. And I think the -- some people say because (inaudible) expectation you can assume they're 0 (inaudible) better way to justify this approximation is to use -- to say that signals are sparse. So if there's small chance that they will overlap in the time frequency domain. So either of them is going to be 0, so in that case, of course, indeed be zero. But still, if this approximation of additivity of powers is only as true as the (inaudible) assumption is true. >> Question: There's also ->> Jasha Proppo: Yeah... >> Question: -- another factor here where your two sources need to have almost equal power before the cross term is really important. >> Jasha Proppo: Yes. >> Question: Because it's related -- it's related in magnitude to the smallest source contributing to time frequency. >> Jasha Proppo: Uh-huh. That's right. >> Question: So ->> Jasha Proppo: Selecting better ->> Question: What's that? >> Jasha Proppo: It's ->> Question: Better than (inaudible) ->> Jasha Proppo: It's better, yeah. You're right. So but still if you want to do things really cleanly then you need to think about this. I mean, take this into account, the potential issue. And in some situations, phase may actually be a relevant cue. And throwing it out means that you're not using everything you could use. For example, in electronic music where samples are exactly reproduced as a way for -- really like you're playing a keyboard and the exact same sample with different amplitudes is exacted for use. So it's not only the power -- the power spectrum. It's really the phase information is exactly the same. So maybe you could use that, as well. And this might be true, as well, for real instruments like -- maybe for piano, maybe some purpose of instruments, as well. I mean, we have a good (inaudible)ibility if you hit the drums pretty much at the same place every time you can expect that the wave form its is going to be reproduced quite well. But all this needs more investigation. And this is actually not only true in audio, but in euro data, like we had some recordings of inside -- extracellular codlings in that situation, also. You have good responsibility, so phase isn't so relevant. So all these are motivations to investigate models and/or structures either in the complex time frequency domain or the time domain. And this talk is going to be on both. So the first part will be on complex time frequency domain and the second on the time domain. Okay. So let me start. Very good. First slide is going to be -- I'm going to talk about what we call consistency, which I'm going to define. So just to illustrate a problem here, what if you want to resynthesize a signal from a magnitude spectrogram. So very classical problem which has been already looked at many times as early as the end of the '70s and early '80s by Griffin and Lynne, for example, the most famous method. So if you have a power spectrum what phase information should you use to get back to the time domain? So if you had a good phase then this is what you would get. >>: Do you understand what I'm trying to say? >>: That's the absolute correct ->> Jonathan Le Roux: That's a good phase. If you scramble randomly in phase, here is what you get. >>: Do you understand what I'm trying to say? >> Jonathan Le Roux: So it's obviously not very nice. Well, it's ->> Question: Can you describe how you scramble the phase for this? >> Jonathan Le Roux: I use random uniform -- random numbers between 0 and 2 for phase. >> Question: Are there any windows? >> Jonathan Le Roux: Yes. So this is short time frequency spectrogram and I used square root (inaudible) windows for noise synthesis and 50% overlap. It's actually not -- you would expect even more artifacts. It already sounds pretty bad. >> Question: (Inaudible) -- sound okay. >> Jonathan Le Roux: ->> Question: Do you understand what I'm trying to say? >> Question: Yeah. >> Question: It seems with the low frequency domain it gets really -- I don't know. >> Jonathan Le Roux: So that's the ->> Question: If you have a continuous phase (inaudible) ->> Question: You mean -(talking over each other) >> Jonathan Le Roux: Okay. >> Question: So he chose a random two pull all-pas filter for every window. It would be smooth over phase. >> Question: Then probably sound like it should. >> Question: I don't know. >> Jonathan Le Roux: You mean using a different signal as an example or using a different way to reconfigure phase? >> Question: Different way to get ->> Question: Didn't way to randomize the phase. >> Jonathan Le Roux: Okay. >> Question: But we have to talk about things that you could actually apply it without having the original phase and that's a little bit harder than just ->> Jonathan Le Roux: Yes. >> Question: -- scrambling the phase. >> Question: Oh, you're right. >> Jonathan Le Roux: Um, well, if I didn't have any phase information, I could use random phase and that is what I did. >> Question: Or you could use (inaudible) phase and then ->> Jonathan Le Roux: Right. >> Question: -- and you could get the perceived pitch out here in analysis window rate. >> Question: In random phase, assuming everything is unphased (inaudible) ->> Question: Yes. (talking over each other) >> Question: What's that? >> Question: Perceived (inaudible) at about four, All right, something like that. No, I mean it may be related to the rate at which he was doing analysis. >> Question: Yeah. Maybe to ->> Jonathan Le Roux: Rates to be overlapped, you mean? Yeah. So that's -that's even when you have a power spectrogram, which comes from a real signal. But maybe something which -- more interesting is that if you knew how to do that efficiently maybe you could like time stretch and then again reconstruct using in some way to get a slow dancing, which is often done. Or you could like cancel some (inaudible) or separate speakers and you could do anything you want, like draw an spectrogram. It could be fun. So the key idea here is that if you use the wrong phase to -- and you couple it with your magnitude spectrum, so you get a good signal. You use a short-time phrase transform, you only keep magnitude and you use another phase and you get a signal back. But the idea is that if you look at this -- the magnitude spectrum of (inaudible) it's actually different from the one you had before. So if you don't take care about this you might be spending a lot of time designing really cool magnitude spectrum, magnitude spectra, and when you resynthesize and I just lose a part of a job you did. So the idea would be to try to use a phase which doesn't make the spectrogram change. So sort of ensure that the spectrogram stays the same so that the magnitude and the phase are consistent with (inaudible). That is what I would call consistency here. And we're going to do this by running a general criterion for consistency and we more specifically talk here about short-time free of transform spectrograms. Okay. So first a short -- I don't need to tell you much about it, but STFT is already often used for time frequency analysis in audio and it's expected to be invertable. So if you start with a signal and you perform it -- STFT, you get a time frequency representation and you expect that they're forming what we call inverse STFT. You get back to the same signal In the two boxes I will review in the next slide, we use windows to do analysis and synthesis to actually get the same signal back. You need to have some constraints on the windows, which are called the perfect reconstriction constraints. And the cool thing with STFTs, it's linear operator so additivity is still true in complex and frequency domain. So it's a good domain to work in. There's no approximation. Okay. So a little bit more on the relation between short time frequency and its inverse -- and when I say inverse, I mean the overlap add version. Obviously (inaudible). How about that procedure. So there is an exact equivalent between a wave form and its STFT when we have the perfect reconstriction constraint. So you can go from one to the other and back. So if you start with a signal, you perform STFT, you have an spectrogram, and there is constraint you can get back here. And you have a correspondence between time signals and STFT spectrograms. But actually as various overlap between frames, you have the dimension of the space where these STFT spectrograms live is actually bigger. So if you have 50% overlap you have twice more coefficients. So you could imagine that -- so not any set of complex numbers is actually coming from a time signal. So to ensure that the set of complex numbers you have on the right actually corresponds to a time signal. It must (inaudible) consistency constraints. Okay. So the problem is that if you start from that set from inverse STFT you indeed get a time signal back, but if you perform STFT you don't get back to the same thing. So the idea would be to -- that the consistent spectrograms are the one here for which when you go, you need to start from right and go left and right again and you go back to the same thing. Okay. Why is that important? First, if you are designing an algorithm which works in the complex time frequency domain you could do a source separation or many, many different things. Then it would be cool to be sure that each time you have an spectrogram, the spectrogram you're working on actually correspond to a time signal. So it would -- you could use it as a constraint to lower the undeterminisity of your problem. So in that situation you could use the (inaudible) I'm going to present as a sort of cost function in your optimization algorithm. For example, if you're trying to separate sources you could ask a penalty for each of the two separated spectrograms to be consistent. If you're working on powering, power domain, then you need to find the best corresponding phase and then in that case you could use the criterion as an objective function on phase and so you try to optimize a phase such that to minimize the cost of -- so the consistency cost and get back good estimation of a phase. Okay. So I think everybody knows this, so we'll be pretty quick. So perfect reconstriction constraint you build an spectrogram with analysis window and NFT, you shift, same thing, shift again, same thing. So if you have -- and you have several frames, that's the analysis by STFT. Now synthesis, again inverse at 50, you get short-time signals. You apply a synthesis window and you add them and you get your overlap add and time domain signal. And for the signal on top and bottom to be the same each has constraints on the analysis and synthesis window, which basically can be summarized in saying that the conjugation of STFT with window W and inverse STFT window S, it should be identity. Okay. Now if we start from the middle, so we start from the time frequency domain. We can perform overlap add and we get time domain signal and so if this set of complex numbers H was actually coming from a real signal then because windows are supposed to respect perfect reconstriction constraint by doing inverse STFT, you would get that signal again. And by doing STFT again, I guess you got it now, you would get the same H. So the constraints I just writing down the conjugation of ISTFT with Window S and instead of doing W is identity. It is easier on the H. Okay. And so the cool thing is that you don't actually need to go back to the time domain to do that. You can write down everything in the time frequency domain because it's just a linear operator from -- at the complex (inaudible) complex coefficients of size MN to itself. And you could -- so if you define the operator F, which is the difference between STFT conjugated with ISTFT minus identity, you are just looking for spectrograms or sets of complex numbers which are in the kernel of that operator. And you can actually write it down methodically so it looks like this. To give you an idea of what is actually performing, if you have 50% overlap to compute the value of the image of H by F and then timing that is M and frequency of then N, you basically need to perform a sum, weighted sum, on the overlapping frames around that index and on all frequency bends. So the sum only spans on overlapping frames. Okay. And it's sort of, it's almost -- yeah? >> Question: So how constraining is that? Does that mean if I have magnitude spec -- STFT from a signal ->> Jonathan Le Roux: Uh-huh. >> Question: And I delete one of the bins, it's perfectly reconstructible from the bins that remain? >> Jonathan Le Roux: Um, if you only did it one then I would say it is, but I haven't -- analyzed it yet. >> Question: How redundant is the ->> Jonathan Le Roux: I mean, depends on -- it's half of them, so ->> Question: Half of them. You could do the other half? >> Jonathan Le Roux: No, it's -- I ->> Question: Yeah. Intelligently select some path. >> Jonathan Le Roux: Yeah. It's ->> Question: You couldn't delete half because ->> Question: It's actually two redundancy (inaudible) ->> Question: Yeah. So if you dropped half the frames you might be able to still get back to the original insignia. >> Question: Maybe if it -- reach out to the (inaudible) I would imagine. >> Jonathan Le Roux: Yeah. >> Question: It's the same rate, it's the same all -(talking over each other) -- >> Question: The equations just because (inaudible), I mean it's just a set of equations to find the consistency, right? So some of the equations won't match until (inaudible) ->> Question: Or it is linear, so is there just one value that's correct once you've deleted the bin? >> Jonathan Le Roux: Mmm, let me think about it. >> Question: I guess you're just doing ->> Question: It's complex. >> Jonathan Le Roux: It's complex. Complex picture. >> Question: Right. >> Jonathan Le Roux: So you have double the dimensions, so need to think about the number of unknowns and the number of equations you have. So even with only if you have only one unknown basically definitely you can recover it. But I need to look into where you can go. >> Question: It probably also depends on what we know you're using and overlap and ->> Jonathan Le Roux: Maybe not too much, it depends a lot on the overlap. >> Question: The window has a 0 at the end. >> Jonathan Le Roux: Yeah, okay. You don't want to have ->> Question: If your window has no zeros, now, after I asked the question, you can drop every other frame and get back the original window. >> Jonathan Le Roux: Yes, yes, yes. >> Question: For sure? >> Jonathan Le Roux: Yes. So it depends on how you drop the bins. >> Question: Okay. >> Jonathan Le Roux: But I guess the most you can drop is half. >> Question: Okay. >> Jonathan Le Roux: So, yes. The idea is actually to be able to do these kind of things. So and it depends on the overlap and the more if you overlap, the more constrained it is. Okay. So it's almost a convolution I will not say too much about it. Basically you have some weights which span overlapping frames, so if you have 50% overlap, you have free frames and bins. However, I centered them around the venn you want to compute and if you have an spectrogram the idea is just to you have this bin here, you just multiply everybody and then use some up. So in the complex domain you stop by, you have 0.5 times this complex number. Then you sum up every one and lucky the end to end. You should get -- this is too long, sorry, to 0. And you do that again for -- and the only thing which make its not a convolution, you have a different set of weights for odd and even bins. In the 50% overlap case. If it's more overlap, then you would have three different cases and stuff like that. So it's not exactly a convolution, but it's likely, close to a convolution. Okay. So interesting things are the weights are concentrated near the center. So for example here we've had a window for analysis and rectangular for synthesis and 50% overlap. It looks a bit like this. I only 30% ones, which allows us to if we want we could simplify the residual in two ways. First we could notice that in the image most of the contribution comes from the (inaudible) itself. The same position in the original spectrogram. and Nonessential coefficients could also be neglected here, so if you want to do things faster, you could maybe drop some coefficients. But you don't have to. And so if you -- you could also optimize the windows to have a better concentration so that your approximation is actually better. And so you could for example try to maximize the L2 norm inside the central block. And actually if you do that for five times (inaudible) you at least it is something which is closed to the square root of Hannig(phonetic). And we use square root Hanning(phonetic). I'll do it again. It's much more concentrated, so we use in the experiments, we use square root Hanning in the phase reconstruction experiments we did. Okay. So now I -- we applied this to phase restoration. So we have a consistency criterion which is actually that is a pretty big area because it's maybe the main thing of the talk. So you have this linear operator. You just take its L2 norm, for example, you could take any norm. For example, the L2 norm. And it can act as a consistency criterion. It's supposed to be zero. If you're spectrogram is consistent. So why do you need to know? Okay. I erase the slide, but can I explain to you why the L2 norm is a good idea is because you have the -- it's the STFT algorithm by Griffin and Lynn, which reconstructs the phase from the magnitude by going back to the 10 domain and again to time frequency domain and keeping the face then taking the magnitude you want and do the same thing again and again. So try to update a phase which is -- which fits best your magnitude. And this is actually -- if you use the L2 norm to -- if you try to minimize this criterion you actually -- it corresponds to exactly doing the same thing as we do. So the distance that they show their algorithm minimizes is intertwined with that criterion. So no convergence there are equivalent, that is the message. It's slightly different, but they are supposed to be (inaudible). Okay. And where am I? Yeah. >> Question: So you mean they would have to stay stationary points? >> Jonathan Le Roux: Yes, yes. >> Question: Are they both guaranteed to converge to some global optimum or local optimum? >> Jonathan Le Roux: Um, the ->> Question: It makes a difference if they tend to go to different local -- you know ->> Jonathan Le Roux: Yes. >> Question: I guess you're taking L2 norm of the linear operator. >> Jonathan Le Roux: Which is good -- yeah, yeah. >> Question: Okay. >> Jonathan Le Roux: Yeah. So to apply this to phase restoration, you could minimize this criterion. Only with respect to phase while fixing the magnitude. And the fact that the ways are concentrated allows you to make local -- more local optimization if you want. So the fact that only the central bin contributes to the image, so justifies you could only update the phase at bin MN. Not to minimize over the L to norm, but only the bin -- only the bin which corresponds to it. So actually this would mean that you do not minimize the L2 norm. You minimize the bins and after -- so it's not, it is sort of weird optimization because it is not related to an objection function. But actually if you do that, experimentally also it minimize LP norms. Because it has all the bins basically. And with the second thing which was that the only the central coefficients had significant weight to compute that coefficient you could only replace visibility with some of on all the bins, but only the sum on the central block, which is much faster, so you could be faster than performing FST2 to compute things. The advantages of this, but you can use sparseness. So if you only have a few pots of spectrogram with significant amplitude then you only need to update the phase for those terms. If you -- you're more flexible because if you already know some the phase -- you think some process of the phase is reliable, you don't need to update it. You just keep it as it is and only update other terms. And it's supposed to be lower computational cost if you use the local approximation. Comparison with previous work. So (inaudible) was -- so the idea is to find -- it's slightly different. It's to find the signal X in the time domain who's magnitude is closest to the one you have estimated. And the argument being that it's phase would then be the best phase. It's actually equivalent, so while I do the iterary between STFT and (inaudible) only phase so we have to constantly alternate between time and time frequency domain, which means that it is costly because you have to do many STFTs. And it's also not flexible because you have to date all the bins at the same time. And you cannot choose which bins you're going to update. So and for (inaudible) method as I already mentioned, it's direct optimization of the time frequency domain, so you can -- the only possible spectrogram and it is not limited to face, so you can deal with complex spectrograms, as I mentioned at the beginning of the talk. And it's for the approximated (inaudible) you have lower costs. Yeah. So the idea for this approximation is that when you try to minimize a phase. You get a nonlinear problem. So if you try to -- if you're minimizing on bins, on complex bins, it's like linear and so it's -- I mean simple L2 optimization techniques would work, but of all the phase you need to do something else because it is much harder. That's why we came up with this approximation. Okay. So here is an example. I'm running a bit late, right? I'm sorry. >> Jasha Proppo: I think we have at least 23 seconds left. >> Jonathan Le Roux: (Laughter) Okay. Okay. So ->> Jasha Proppo: (Inaudible) 32 seconds. We have the room for another 45 minutes. >> Jonathan Le Roux: Oh, that's fine. Okay. So ->> Jasha Proppo: Actually have time for questions at the end, but ->> Jonathan Le Roux: Okay. >> Jasha Proppo: Your audience is rude, they never let you get in a word. >> Jonathan Le Roux: So I won't have many questions at the end. (Music playing) okay. Sorry, I just press that. So do you know that? >> Question: Is that the second -- is that the (inaudible) one, the original? >> Jonathan Le Roux: No, that's the original. And that's from the (inaudible) database, music database. I don't know who is playing actually. >> Question: I thought it was (inaudible) because it does sound very slow, that's the way it's supposed to sound. >> Jonathan Le Roux: It's going to be slower. So we slow it down by 30% and for final length of 32 seconds and we also the original method like Griffin and Lynn, and then we use our method and we also use (inaudible) so we only update bins which have significant amplitudes. And so in terms of number of iterations we see that we have a faster convergence so we need less steps if -- without sparsity we need less -- much less steps to reach the same level. With sparsity we first need less steps and then it's pretty much the same as the original method in blue. And if you include the speed of each step then you -- here is what you get. Time to reach a certain level. You see that for example we've minus 15, 2H minus 15 (inaudible) consistency so to lower down the criterion. We only need 2.1 second with process against like about 45, 40 seconds for the original algorithm. So it's much faster. (Inaudible) supposed to be the same, pretty much because I mean we wait long enough so that we have the same inconsistency. >> Question: Does the window affect the final quality? >> Jonathan Le Roux: I think it does, especially the overlap affects the final quality. So here we will let you hear five times free sorry. 50% overlap results there, which don't actually sound that good. But if you want to make a product, you would want to go 75% overlap. I don't know -- 25% overlap, sorry. So or more. But the thing is our algorithm outperforms original state of the art, produced method with more if it's 50% overlap and less and les because you need to -- it doesn't scale so well because we don't use FFTs to -- so if you have more overlap, you have much more coefficients and so it slows down a bit. But it's an important issue definitely. So I'll let you listen to the reason without processing. So without processing as we read, so we read the original wave smile slowly and so like we read it at speed of the frame shift is made such that the final speed will be slower when read with the original frame shift. So we make the incoming frame shifts more and the outgoing frame, it's got the same. So if we feel like it is slower. Okay. (music playing slower) So well, you get -- it's not great. I'll play you one of the -- the (inaudible) one if everybody sounds pretty much the same. (music playing) >> Jonathan Le Roux: So you can hear it goes through some modulation in the (inaudible). It's not very clear, but with head phones (inaudible) and I think it's mostly due to the fact it's 50% overlap and it gets better with more overlap. Okay. So now to -- yeah? >> Question: Sorry. Is there a reason that you use music rather than speech for this? >> Jonathan Le Roux: No, not really. >> Question: It would also slow down in speech? >> Jonathan Le Roux: It works -- yeah, anything -- the thing is I think we notice at some point, I'm not sure, that for speech and music the window lens that you want to use is -- might be different. It may sound better with shorter window (inaudible) for speech or with the opposite (inaudible). And I don't really understand why actually. But it's ->> Question: The sound you cannot produce, right, with the piano. >> Jonathan Le Roux: Uh-huh. >> Question: Even if you play slower it is not going to sound the same. It's never going to be that (inaudible). >> Jonathan Le Roux: Yeah. Yes. It's supposed to be -- yeah. Not natural. If you think of how the song is produced, maybe someone doesn't know the piano. I mean, it also slows down the attack for example, so and maybe you don't want to slow down the attack. So there are methods, I think, like spot coding methods trying to do that actually, separate the attack from the rest and slowing down only the rest. I think some people are actually trying to do that. >> Question: Does the attack take more than a frame or two? (talking over each other) >> Jonathan Le Roux: So this is more like a funny -- I don't know if (inaudible), I found it pretty cool when I found it out. And it's kind of simple, yet makes a cool demo idea. So you could actually use the inconsistency to your -- I mean to do funny things. You could make silent spectrograms. And so how you do that is if you have -- if you're given an (inaudible) synthesis windows, which are perfect reconstriction and you can take a set of complex numbers, anything, so maybe inconsistent or not. If you go back to the time domain, you get a signal and then if you look at the STFT, you get something different in general. Now so the difference -- still they have the same inverse TFT. So if you look at the difference then you have something which is none zero, but which gives you for that particular synthesis window silence. So that's very cool because if you use a different synthesis window, it's not silence. So you could think of using this in audio encryption scheme in the time frequency domain. For example, so we care it is only for the correct window that you get silenced. The window could act as sort of key. And so if you have a single that you want to encrypt and you could add pseudospectrogram, which gives you silence to the real spectrogram of that same one. So you started with a signal. I'm sorry for -funnier piece of music. (music playing) >> Jonathan Le Roux: Like from a mix of Trumpet and piano, you have an spectrogram and you create a (inaudible) by for example taking a random complex numbers and doing procedure I explained. And so you look at the difference between the complex numbers and the STFT over inverse STFT. You add it with a large co-efficient. I -- I -- also note you might have some problems with dynamic range when doing this, but well, you know, just for a demo. So here, look at the power spectrum. Here's how it looks like. Basically you don't see much. If you use for example -- say here I use square root Hanning windows. If you use Hanning to resynthesize, which is incorrect window, here is what you get. So you get minus 30 for dbnr. So if you are going to lower the volume a bit, here is what you get. (making static sounds) >> Jonathan Le Roux: Okay. So I overdid it a little bit. So like 10,000 something. So dynamic range is really an issue I think when you do that. Well, if -- and the thing is if you use the incorrect window to resynthesize the original spectrogram, so of the music, you still have plus 18 DB, so it sounds a bit fuzzy, but still not too bad. (music playing) >> Jonathan Le Roux: And if you use the correct window, square root, you have perfect reconstriction up to quantization level. So -- well, I don't know if it has any application, but I thought it was pretty fun to ->> Question: So what you're doing is you're in the complex time frequency domain adding something that is orthogonal to ->> Jonathan Le Roux: Yeah. >> Question: -- linearly orthogonal to your signal. >> Jonathan Le Roux: Yes. Only which is only for that particular window. So if you don't have a window you need to so a way to crack it, I haven't tested it, would be maybe to minimize the energy with respect to the window. So we tried to find a window which gives you at least energy. Could be maybe a (inaudible). So I'm not saying it is like robust or anything, but I just like thought it was fun. Okay. Okay. So summary of that part. It's a bit slow. Quit because we have a consistency criterion. We could use it for phase resynthesis. I'm planning to use it for missing data reconstruction, so like if you have -- if you reconstructed in the power domain and have some positive phase which is known and you want to reconstruct other parts, maybe you could use this. Potential applications are as a cost function for on complex (inaudible). If you do separation of the complex domain, maybe you could use this as the cost. Okay. Now to part two. Yeah. I'm trying to be faster on this one. Now I -- that's the week at Neps, I presented this on template matching in the time domain. So again an illustrative problem, if you want to (inaudible) on (inaudible) domain condition being single channel observation, you don't know how sound like, you don't know their timings and you have no training data. The goal is to remain with some parents their timings and their amplitudes. So here I have, for example, a two-second drum loop of electronic drum loop, which is -- well, let me play it first. (drums playing) >> Jonathan Le Roux: Okay. So I've done sampled it quite a lot to make it tractable in terms of (inaudible) cost so it sounds a bit muffled. But this has been built by using two templates of base drum and snare drum, which are taken from real sounds and then could be placing them with different amplitudes at random timings and emerging into a single channel. So the goal would be to -- from that signal to reestimate this template and to reestimate their timings and amplitudes. So motivation in general that -- so compared to the power domain the advantages of time domain is that you don't need to determine analysis parameters. So like frame land for (inaudible) shift. You don't need to resynthesize and activity is true. And you also export away from regularity when it is true. So in (inaudible) recording I say, do we have four (inaudible) We have good reproducibility of wave forms across examples and for extra rate for music instruments maybe like piano and drums you would also be able to use that. More generally we wanted to design a method to perform template matching, but with unknown templates and also to investigate good single decompositions wherever the amplitudes are supposed to be sparse and none negative. Okay. So our goal will be to explain a time series with decomposition, which detects the timings in which the events are characterized by the time course and this time course being discovered from the data. And we suppose these events can have variable amplitudes, but only positive, so we cannot have cancellations. And we allow also for overlapping events. Not sure what I meant by this sentence, but anyway...okay. So here is how the model looks like. It's basically a convoluted factorization where you have -- so here is the different events. So K stands for the even type and you have the time codes, so a short wave form. And here are the amplitudes, so different legs. You have amplitudes are supposed to be nonnegative and on which we put a sparse (inaudible), which is a generalized Gaussian process. (Inaudible) has much information as possible in their -- in the event. So if you didn't for example, if you didn't assume any sparse prior you could imagine having -- for example if your events were also nonnegative you could imagine having an which is just a delta, chronic delta 1 and zero everywhere and all the time codes in the amplitude. That is really informative. So to get more meaningful results we just wanted back all the information in the event itself and only like (inaudible) like on-set timings in the amplitude so we use (inaudible) as a reason why we use the (inaudible) prior. Okay. So we perform maximum (inaudible) estimate by -- it should be minimum here. We try to minimize the L2 norm between the model and the offset data, plus a (inaudible) term. And we normalize the template to avoid scaling (inaudible) reduce the sparseness by putting more weight into the template and reproducing it, but that is not very interesting. And -- okay. A brief review of existing algorithms, which are close to this one, so none negated matrix factorization treats the case of decomposition of a signal into a product of two negative terms. Nonnegated matrixes. While semi nonnegated metrics factorization (inaudible) where the amplitudes are only a nonnegative and B can be any sign so it's close to what we want to do. But doesn't allow for shifts. So doesn't allow for different onset timings. Shift in MF treats the different onset timings so close (inaudible), but both templates and time courses need to be nonnegative and so we need to combine these two models. And so here it's -- it looks like we have this convolution with nonnegative amplitudes and by using this inch tying notation, so writing down the sums in this way, we can rewrite the model a product so we convert the convolution to a (inaudible) so (inaudible) and a matrix, which allows us to write -- use the updates for the semi-NMF and MMF to convert them into updates for our (inaudible) pretty easily. So, yeah, I don't have much time, so maybe I won't insist too much on this. But this is the original estimates of the equations for the NNF. You can modify them for -- so the difference is basically in this kind of correlations you allow for shifts and -- well, believe me, this should be correct. If you introduce sparsity term then you have the derivative of (inaudible) term which comes up in the (inaudible) and if you finally add up normalization of templates you could enforce it by using like ownership multipliers, although we didn't use it in the first implementation, we just renormalized it every few steps because it's much simpler. Okay. Some results. First we wanted to test it on synthesis data, where we know the correct answer to try to test how the performance is. So on -- we used two templates (inaudible) templates and random timings and amplitudes. And so if you can go ands choose templates and you add them up in a single channel, here -- and you add as well some Gaussian noise, here is what you would get. So from this noisy data, we were able to reconstruct the templates quite well. And their timings, as well. And here is what would be reconstructed way from looks like. So it sort of looks like the noise version of the original. And well I guess the (inaudible) is added performance sort of (inaudible) estimation of a template. So if you have noisy -- many noisy versions of your templates by adding sort of taking the mean, you reduce the amount of noise. Okay. And we also looked at the evolution of performance with respect to (inaudible) similarity. Similarity being how light the templates look like in (inaudible) So it's our own definition, but here 0 means that we use the two templates I showed you and one is like (inaudible) each other. So one is like identical template. So it gets easier again when you only have -- when you have the same template because you only have one template to guess. But there's -at some point it gets harder in the middle. Okay. And real data, so this is year-old data and (inaudible) coding so I'm not too knowledgeable about this. The -- these data are composed of two types of spikes. So the (inaudible) segment spike, which is in red, and the (inaudible) spike which is in blue. The red spike may appear by itself sometimes. But the blue spike always appears with the red spike in front -- slightly around here and smaller. So here normalized, but usually -- so the blue one is always overlap with the red and the red sometimes appears alone. So doing this sort of spike sorting is very well -- I mean, has been studied a lot. But usually conventional methods cannot really deal with overlapping spikes. So that's why we had to -- tried to design a new method. And so we build this -- built these spikes by hand so we try to find regions where the red one is alone, align them together and try to find a typical way form. And then as this spike and so in the blue and red complexes just cancel out the red and estimate the blue. So very -- sort of takes a lot of time and it's not easy to do. But automatically retrieve spikes look like this. They're quite good. They're quite alike. Not perfect, especially in the blue, because this bump here, which is stronger than the original. And we think that it comes from the fact that the red spike usually always comes around here, so it's been learned a lot as part of the blue. Okay. And on the drum loop data that I showed you earlier, so we from this data we added some noise. So from the noisy data here... (music beating) >> Jonathan Le Roux: I plugged the algorithm to this data and was able to get this templates back and their timings. And so here it's been normalized so timing are slightly smaller, but you see that the evolution is very similar. And here is how we look like, the reconstructed wave form. And so on the top right. (music beating) >> Jonathan Le Roux: No, sorry, not sure it's...play it again. (music beating) >> Jonathan Le Roux: Not sure explained the file. So the original... (music beating) >> Jonathan Le Roux: And reconstructed... (music beating) >> Jonathan Le Roux: Yeah. It's supposed to sound slightly less noisy, but just a bit because there are only four examples here in (inaudible) so it is kind of difficult conditions to learn the template with only that few templates. So it still has quite some noise. And the -- and so now we play the only the bass drum part and then only the snare drum part. (music beating) >> Jonathan Le Roux: And the snare drum... (music beating) >> Jonathan Le Roux: Okay. So well ->> Question: That sounds a little more like a series of snare drums than a single snare drum. >> Jonathan Le Roux: Actually the original sound is a double stroke, so it's -which make its even harder to learn because it's highly (inaudible) with itself. So ->> Question: You were playing (inaudible), you were playing the ->> Jonathan Le Roux: I was playing this. >> Question: All right. >> Jonathan Le Roux: Sorry. I was playing the re -- like a separated ->> Question: Convolution. >> Jonathan Le Roux: Yeah, like a convolution already. Okay. So short discussion, so we spike it or even can be overlapping. They do not need to be in areas (inaudible) PCRCA. And (inaudible) process where the key criterion. And so we had a proof of conversion for the algorithm and the decomposition is shift invariant. We could de(inaudible) the signal in spike (inaudible) so I don't know if you know the nature paper by Smith and Leviky. She's trying to do sparse coding of natural sounds. And they retrieve natural basis sounds and they actually found out that they look like the cochlear feature of a cat, which kind of interesting reasons. So it came that cochlear features actually really very sparse code. And I was actually hoping to get the same kind of reason with our algorithm by training a large amount of data and having been able to get reasons yet. I hope maybe at some point, considering the models are supposed to be quite close and that this model is supposed to really do the good thing, we should, you know, with a lot of tricking of parameters I hope we can get something. And okay. Yeah, the model has a form of converted mixture, but as we have like (inaudible) constraint and we had also sparseness and also we need to be careful with the length of An and B. The interpretation is slightly different than that of outline, so separation. Okay. That's going to be my last slide. So I presented compliments and/or tentative to power domain modeling. So first thing the complex (inaudible) domain where actually we could recover face from power. We could potentially use the pattern as a guide to (inaudible) so sort of perform consistency aware for modeling and we gross maybe in the future use cost function for complex time for domain modeling. And in the time domain we exploited the (inaudible) of wave forms and we were able to deal with overlapping events through decomposed signals. And that's the end. Yes? >> Question: Are you familiar with Barry (inaudible) work? >> Jonathan Le Roux: Yes. Yes. >> Question: Because they also use a convolutive mixture. I think they also have sparseness with ->> Jonathan Le Roux: Yes, yes. >> Question: With entropic trial. >> Jonathan Le Roux: Yes. No, it's in power spectrum. >> Question: Power spectrum? >> Jonathan Le Roux: I think so. >> Question: Because I think they were both nonnegative. Both ->> Jonathan Le Roux: Yes, yes. That is the main difference. But I mean, it's a hot topic reason so you definitely using sparseness on the amplitude. >> Question: Yeah, I recall it was on spectrum. >> Jonathan Le Roux: Yeah, I mean the main difference between where this last work has been and what other people have done is that we use slightly different conditions. Yes? >> Question: It reminded me of that, that is why I'm asking that. >> Jonathan Le Roux: Yes, definitely. >> Question: You know, the second method, finer things like drum rolls, where ->> Jonathan Le Roux: Uh-huh. >> Question: -- copies are exact. >> Jonathan Le Roux: Yeah, that's the easiest. >> Question: There's a lot more where they're like similar, but not exact, right? Like ->> Jonathan Le Roux: Yeah. >> Question: Real drums -- (talking over each other) -- or a vowel or whatever. >> Jonathan Le Roux: So for vowels and stuff I think it would be hopeless probably. For real drums I need to try. Haven't been able to find time to record real drums and first I wanted to use a pen to do things like on a table. It's just not that easy to get a clean signal and especially no double strokes with a pen on the table. So I guess we have been too different ->> Question: Real conditions maybe that is the kind of data that you need. >> Jonathan Le Roux: Yeah, I think that would have been really cool if we were able to recognize like another more -- if we are able to separate these two we would be very cool. I don't see any application, but -- well, would have made some real data it would have been interesting, but I didn't have much time to do it. The few times I tried it I was never able to get like only one stroke kind of data. But maybe it means that in reality it's just hard to have this kind of data. We need to try (inaudible) drums definitely on real data. >> Question: What about the drums are going to be difficult ->> Jonathan Le Roux: You expect so? >> Question: Yeah. It's -- yeah, it's a pretty random thing. It's not reproducible in the time to (inaudible) -- >> Jonathan Le Roux: You think so. Yeah, have you looked into it? >> Question: Yes. >> Jonathan Le Roux: Okay. That's the kind of ->> Question: They have this thing that kind of clutters against this snare. >> Question: That's a random vibration, even though the motion of the drum head is very complex. Uh-huh. And once you get something that's above your sampling rate, what is your sampling rate -- (inaudible) ->> Question: Yeah. >> Jonathan Le Roux: You could make it higher. >> Question: Anything above your sample rate comes out as random once you get the ->> Jonathan Le Roux: Uh-huh. >> Question: Fold it down a few times in Map quest. I would guess template matching and time domain is destined to fail. >> Jonathan Le Roux: Okay. For drums. Yeah. I mean, first motivation was not (inaudible). It was like from old data in which it's supposed to work better. And the (inaudible) was to what extent can we use that model. >> Question: There is plenty of things where you are sending out like no impulses and things. Plenty of applications (inaudible) or whatever so as top -trying to figure out where it would apply. Even for the (inaudible) data it seems like the sparse prior is bias and yet to put the red and blue spike together into one prototype rather than generate a new spike for it. >> Jonathan Le Roux: Yeah. It seems so in the end ->> Question: Interpretation of the same data, if -- if A sometimes comes alone, B always comes with A, then you learn that A and B come together and that's your ->> Jonathan Le Roux: They come together with different amplitudes. They could come together with different amplitudes. So then you would have to run both. Good. Maybe they come really often with the same amplitude and sometimes with different amplitude. So in which case it's hard to guess. Well, I think it's difficult task, very undetermined and we try to come up with constraints to make it more -- slightly more (inaudible) so we need to figure out is it really important to ensure amplitude are nonnegative, because the sparseness should enforce -- if a signal is really made of nonnegative amplitudes times events, times these templates then you should be (inaudible) by learning with good templates and you would only get (inaudible) amplitudes because that is only what you get in your data. So you shouldn't have to enforce it in the constraints. >> Question: Right. >> Jonathan Le Roux: But well, as in undetermined problems it's just better if you can enforce it. So that is why we did it. I think we need to compare without the constraint to see and if it works better then... >>: Cool. Thank you very much, Jon ->> Jonathan Le Roux: Thank you. >>: -- for a provoking talk. (applause)