22868 >> Ivan Tashev: It's my pleasure to present Ngoc Q.K. Duong here, who is going to give a talk about probabilistic special modeling capable and how you apply it to source separation. He took his Bachelor's degree in his native Vietnam. Took his master's degree in Korea and is going to finish his Ph.D. in India and France. So this alone is an interesting part of doing research, most of it in the area of audio signal parsing [phonetic]. So without further ado, you have the floor. >> Ngoc Q.K. Duong: Okay. Yes. Thank you for the introduction. And good morning, everyone. So first I'll briefly again talk a bit about my background. I'm Ngoc Q.K. Duong. I'm Vietnamese. So for four and a half years I took Bachelor's degrees in electrical engineering in Post and Telecommunication Institute in Vietnam. Then I have almost one and a half years working as a system engineer in industries, mainly for the Polycom video conferencing products, like audio and audiovisual. And for two years I moved to the Korea, South Korea, to pursue master degrees in electrical engineering and mainly on signal processing. After getting the degrees, I had happy years working as a research engineer in [inaudible] company, also that companies because on sound and speech technology which transfer to the more [inaudible] industry. And then after finishing -- not really finishing with that, I get funding for the Ph.D. and I move to France, France National Research Institute in [inaudible] India, to pursue my Ph.D. Up until now, [inaudible], it will be October of this year. So still four months. So exactly three years Ph.D.. that is briefly my background. Today, I'm happy to present my Ph.D. work, probabilistic space model for rebuilding and diffuse sources applied to audio source separation. So please feel free to interrupt me at any time for your question or comments. Okay. Here we go. So problem we're looking at is a cocktail party effects where we have many sound sources mixed together in meeting room, for instance, or in a patio, which prevents people from understanding. For instance, here's a real recording in a meeting room I recorded in India. [recording]. Okay. He's the director of India. We agree it's very difficult to understand, especially for the acoustic, for the recognizer to recognize a talk with lots of background noise and interference like that. So our goal is to separate the sources from the recorded signal, which is a mixture. And this problem is known as a blend sort separation. It's BRS. And BRS have loss of practical application, for instance, for hearing aid system, for speech recognition, which is very familiar with you. And so for like robotic or music information retrieval in this community we have people who try to separate the different components like [inaudible] or speech component to do some music information retrieval task. So considering the recording environment, we in reality we have two types of environment. One is [inaudible] and the other one is reverberant. What I say is any [inaudible] environment come from the perfect studio or in the outer recording where the [inaudible] environment come from a meeting room or to a [inaudible]. For instance, if you record a mixture in any environment you will hear quite clear sounds like this. [recording]. >> Ngoc Q.K. Duong: Main voice you can claim. But you can do the same thing in [inaudible] with a lot of reverberation, what you will hear is this [recording]. >> Ngoc Q.K. Duong: Hear a lot of echo and you feel that this one is actually very difficult to process. And why you have such differences is because in an environment you only record [inaudible] from the source to the microphone without any reflection, why in the reverse is true, you have a lot of sound reflection from the source to the megaphone. So this is a nice picture from the cafeteria in India. So as you hear the sounds and you see from here it's very difficult to model exactly the acoustic path from the source to the microphone in any reverse environment. And the source separation here remain very challenging, why a state-of-the-art profile already some good resource in the case of any good environment. That's why in my Ph.D. work I focus more on the modeling of [inaudible] environment which get more practical application. Okay. Now, if you look at some state-of-the-art process, there's two kinds of information we can explore. The first one is to model and explore the spatial information that is looking at the spatial position and location of the sources. And another one is to model and explore the spectral cues which is more important to the spectral information of the sources. More detail, as a spatial [inaudible] one of spatial cues, you can image in this place and intensity differences. For instance, you look at any recording environment and if the sort is here, which is closer to the microphone I C1 so the signal I C1 will be larger than I C2. So you'll have a very simple pro personal amplitude between I C2 and like that. If the source move to the position B which is closer to microphone I C1, you have the spectral cue now it's blue 1 B, so that ice 2 is larger than ice 1. So this is very simple gives you the basic ID of the spatial cues, and for the spectral cues you can explore, for instance, like if you look at the spectrum of speech, it can be explored [inaudible] the smoothness, for instance, the spectrum here is very smooth over the frequency reigns and it's capacity because here a lot of [inaudible] point, the coefficient is close to zero. And if you look at the state of Art for source separation we found there's much work on spectral cues that is based on statistical model like GMM, HMM or non-negative metric factorization, for instance, and there's actually very little work on spatial cues, and most of them based on the [inaudible] model. For instance, in the [inaudible] recording and the deterministic [inaudible] model is the source position by the face and intensity differences. We are far from the actual characteristic of the reverberant environment and also for the diffuse background noise where you cannot model deterministically the spatial information of the sources. So for our Ph.D., looking at that problem, you have assist from the deterministic model to the probabilistic model framework for spatial cues, that is our main contribution. And then we design general architecture for model parameter estimation and source separation. And then we propose several probabilities prior over the spatial position of the sources over the spectral information of the sources and [inaudible] the potential for the proposed framework in various practical shaping that I'll show later on in the simulation. But here it's summarized of the contribution and I will address this problem later on. Okay. So given that very general ideas and overview. So here will be the contents of my presentation. In the next part, I will focus on the general frameworks and parameterization where [inaudible] propose general Gaussian modeling framework, spatial co-variant parameterization and also source separation architecture EJ for general case. And then in the next part I move to the center of the problem is how to estimate the model parameter, right? We have framework. We have parameterization, but it's very important that we merge the parameter in order to perform our work, then I will present several probabilities prior in order to help enhance source separation. And finally I will come with some conclusion. Okay. So now we look at the problem in a more mathematic way. We have to formulate it right. So you have original sources it's denoted by X by J where J is the sources in the patio room and what you observe at the microphone is not actually the cell. It's a recorded signal [inaudible] and CJ is recorded source images it means it's a contribution of the sources at the microphone. And it's the proposed of source separation is that given the observed material here, you will need to estimate the source indices CJ. And since we have mixture of several sources, what you really observe at the microphone array is a mixture it's called I sub T, I by one vector where I is the number of microphone and IT would be the sum of TJ, J over 1 is the total number of sources. Okay. So now propose given XI you estimate the CJ. Okay. So let me emphasize that. We are not getting estimate in the original sources but we try to estimate the source indices. Okay. So first ask of the modeling framework, most across I use deterministic model for a spatial cue with realize on the two main assumptions. The first assumption performance is pro source assumption where acoustic path from the source microphone is modeled by the missing vector H sub J. So what you observe at the microphone CJ will be the conclusion of the original sources with the missing vector S sub J. And then here is a convolution in time domain, if you assume that the window in line is much larger than the actual hinterland of XJ you have a narrow band approximation where the conclusion in time domain is approximated by the complex value multiplication in the frequency domain. So given those two basic assumptions, most set of the across view they're source separation system. So we see that in practice these two equations do not house due to a cup of reason, for instance do not say the SJ, so when I talk I'm not straightforward like this, sometimes I move, sometimes I move my head. So we cannot model by deterministic machine vector S sub J and also because of diffuse reverberation sources. You have very diffuse, very lots of reverberation, modeling by single missing vector may not be enough. So for that reason we switch to probabilistic framework, and here it is. So in our proposed framework, we propose to model the source indices CJ as zero mean Gaussian random variable with the co-variant matrix is sigma J. And here is the form of probability function for Gaussian, it's simple. Let me emphasize a bit why we use the Gaussian modeling here. For instance, in source separation, many people use Laplacian distribution for their sources or something, but instead of Laplacian we use Gaussian here. Couple of reasons for this, because first if you consider the physical acoustic, so the reverberation is generally model extra Gaussian, is acoustic point of view. And it is easy to handle the computation. So if later on I will show with the Gaussian modeling we will just in a close-up date format for the parameterization, which is good. We showed, of course, that it worked. So this is a very important point. Okay. So furthermore, now what you have is this one. This is -- you have zero mean already, and you have unknown co-variant sigma A and furthermore in our framework we factorize sigma A by two parameters VJ and IJ, where VJ is a time variant source, variancies where we encode the spectral temporal of the spaces. So if you [inaudible] so it's modeling spatial temporal power. And IJ is actually we call it a spatial co-variant matrices where we encode the spatial information of the sources, what we refer to spatial information, it will include the spatial -- the spatial direction of the sources and the reverberation in the room. So both spatial information of the trusses is encoded here and all the spectral information of the sources, for instance, it can be modeled by G and M and S and M, for example, is coded by VJ, so you separate it by two components and all of this is encoded in the spatial co-variant, in a co-variant matrix sigma A. >>: Can you go back to the previous slide? So that the dimension of the vector here is just frequency? >> Ngoc Q.K. Duong: Vector here, C? >>: You have N and F, when the matrix -- you're looking at frequency is the dimension of the matrix? >> Ngoc Q.K. Duong: No it's the dimension of the metric is in the -- it's the number of microphone. So if you have two microphone, you have two-by-two matrix. So you have a C here is a vector of the -- let's say I have I microphone. So C is the vector of I by 1, right? So you have one source here. What you observed at the I microphone is a vector I by 1. >>: So all frequencies are, one frequency dependent on the next one, the adjacent one, right? >> Ngoc Q.K. Duong: No, not really. So here we model for -- each time frequency independently. >>: Right. That's what I'm saying. We're making the assumption that what happens at this frequency, the next frequency, it's independent. >> Ngoc Q.K. Duong: Yes. >>: Sigma J is separate for each request but not [inaudible]. >> Ngoc Q.K. Duong: Yes, true, general or something. But later on when I come back I'll make the less strict something where I can assume some dependencies in terms of frequency point. But at this time it's for general framework let me assume first that it is independent and let me model first for each frequency point by this one. Okay. So, again, V will be the scalar and R would be I by I, square matrix, where I is the number of microphone. So if you have stereo case, you have two-by-two microphone. For each time frequency point, N for sources J. Is that clear or something? Okay. And in most of the following we will more assume that IJ is time invariant. So we will not have time here. The reason we assume that is, for instance, now we don't deal with some very challenging case for moving source. So we assume more source utilization or [inaudible] so we assume that IJ modeling the spatial information is time invariant. But also let me come back to time varying case later in my presentation. But at this point, okay, for that. So this assumption is assumed by most state-of-the-art process, where the spatial location of the sources is modeled by the missing vector, which is time invariance. Okay. Okay. Let me -- so we'll hear, for instance, there's a model of the spatial information by time invariant S dependent F. So now we make the same things like this one. Okay. So now as I said before, our work is mainly focussed on modeling the spatial cues which means we focus on the model for IJ and the next slide I'll present the parameterization for IJ. Okay. So now again look at the state-of-the-art process for IJ in the inadequate recording where there's no reverberation. So IJ can be computed directly from the missing vector. Right? So, for instance, in the inadequate missing environment, if you know the source and the microphone position, so the knowledge from the room acoustic would allow you to compute directly the missing vector from the source to the microphone like this, where here, I refer to the N is the missing environment. So if you compute IJ, analyze this one, so the co-variant metric I of J can be computed like H and [inaudible] that is simply [inaudible] and why I refer to the right one here, because this is the product of two vectors. So the metric is one, one, metric. Okay. If you look at the popular parameterization based on the narrow band approximization I presented before we have CJ approximated by the product of IJ and original sources. So now if you compute the co-variant metrics of CJ, so you have sigma A, expectation of CJ and CJ eight and you reduce in this form, why this part is scalar and is VJ, it encodes a special temporal, spectral temporal information of the sources, and the spatial co-variant metrics IJ can be computed by SJ. SJS. And again it can be the 1-1 metrics. Right? So what I say in this slide is that whatever you consider the inadequate recording environment or reverb environment, the state of the S parameterization for IJ is the 1-1 matrix in both K. So what we do is that we don't want the 1-1 matrix, we want more full rank matrix with -- for more free parameter to better model the reverberation. So we look at what we call here is the full rank direct diffuse parameterization where we explore the knowledge from the room acoustic. This model is partly new because it is considered -- it was considered in the context of source localization before but the first time we look at it in the context of source separation. Okay. So if you assumed that the reject path from the source to the microphone and the reverberation is uncorrelated, so the co-variant is the sum of the co-variant of the direct path which is computed right here, by the missing vector. And the co-variant of the diffuse reverberation path. Okay. So the underlying assumption is that the [inaudible] and reverberation path is uncorroborated, where SJ of NI is computed before. So if you look at the state of the art. So state-of-the-art modeling is this one. So if we look at a full ranked parameterization where we adhere some contribution of the reverberation. Okay. >>: I have a question here about -- if you go back to the previous slide. So built in the lower equation, right, some of your channel characteristics are now being captured in the covariance only in the expectation, statistics over the expectation of the inner product. That's going to lose something it's not the entire channel characterization it's some characteristics, expectations. My question is when you go to the next slide you're saying okay now we're going to open up this matrix to be full rank, the matrix to be full rank, do you have some intuitions behind covariance is a good means for capturing the reverberation channel characteristics because it's not 100 percent. It's capturing it in expectation, what do you lose in that presentation? >> Ngoc Q.K. Duong: Okay. So what I said before is that for state of the art most processes like deterministic model for spatial information by the missing vector, so if you apply directly sky of deterministic model in our framework, so for Russian modeling framework, this one, it's our proposed modeling framework. And if you bring directly the state-of-the-art parameterization -- so you will get the right one. >>: I understand that part. But I get that. But what I'm saying is like here you're saying that you're going to expand that by using full rank, right? But there's still this question of why is the covariance structure the right way to represent reverberation? Is there enough information there in the covariance to really tell you about the reverb channels? >>: Because the question, is it Gaussian or not. Because if it is then that's probably fine. If it is. But maybe it isn't. >>: I was wondering also what's being lost when you turn this into an expectation quantity over ->>: But if it is Gaussian, then you're just many -- you should not be adding that term. Maybe some other combination. >>: That's right. >>: But it may not be Gaussian, we don't know. >>: Yes. >> Ngoc Q.K. Duong: Okay. I will demonstrate -- I understand your question is why we need the full rank and why we need to ask more parameterization in the co-variant message. >>: That's not really my question. It's okay. I can ask you later, too. That's okay. Sorry. I'm really -- I'll say it again but you don't need to necessarily answer it now. Maybe it will become clearer in your talk my question is the Gaussian a good way -- is the Gaussian approximation a good approximation for what's actually happening? >> Ngoc Q.K. Duong: Okay. I would say Gaussian is the best way, it's a good way. But as I point out several reasons why we consider Gaussian in previous slide. Of course, you can consider another one, for instance, Laplacian or something, but it may be more difficult for you to estimate the parameter later on or something. But I haven't been doing this. >>: My point is if it turns out that it's not very Gaussian, then adding more parameters to a Gaussian representation won't necessarily help it. But if you have a strong argument for why it is a Gaussian ->> Ngoc Q.K. Duong: Okay. Now I understand your question. But for sure, for physical acoustic, if you look at the acoustic book, where most people in physical acoustic, the community will model the reverberation as a Gaussian. I mean, not modeling both as a Gaussian. I mean, if you consider the reject path, it's okay, it's not Gaussian. But for diffuse reverse path, they really model Gaussian. So that is one of the reasons we consider. >>: It's a staging -- this thing it passes from the walls where the speech signal distribution basically moves from the sharper [inaudible] and more and more Gaussian and there's the tail of yet this completely Gaussian. Pretty much an infinite number of [inaudible] but the direct path and the tail kind of a -- goes from the sharper distribution and this is what actually is missing there. >> Ngoc Q.K. Duong: Yeah. I'm sorry for that. For example, if you look at the modeling for the origin itself, for the S -- for -- okay. Let me come back to this one. If you look at the modeling for the sort here, it's obviously not Gaussian. It's more like lap lap or something, but now you're looking at the model for it's the image to the microphone. There's a lot of reverberation, so there could be some less sharper as evaluation. >>: Okay. Thanks. >> Ngoc Q.K. Duong: Now we stop here. And this model have been considered in the context of source localization in 2003 for the first paper I saw. It's now we bring this one in the context of source separation and we see how it performs in this context. >>: So the second matrix -- sorry. >> Ngoc Q.K. Duong: So, yeah, if Z is something that is satisfied and if you know that the [inaudible] is shutting, then you know everything. The knowledge from the room acoustic will allow you to compute omega and sigma directly from the room acoustic. For instance, if the room is rectangular, if you know the reverbs abrasion co-vacation you know the time and you know the area and so you compute the proportion between the reverberation as it reaches the path and it's very nice that this co-variant matches here. It's like a sync function. So, yeah, it's proved already. So it's kind of deterministic. If you know everything, you have this model. >>: So the second -- all I could think is in 1954 paper I can't find it, I believe, what did you get the second -- what did you get the second at? >> Ngoc Q.K. Duong: This one? >>: Yeah. >> Ngoc Q.K. Duong: So from the textbook. >>: Okay. >> Ngoc Q.K. Duong: It's a very old textbook like the '60s or something in its proof already. And this equation is so written -- it's my mistake, I should mention some reference here. In Marco Wire [inaudible] in informing. >>: I think his paper [inaudible] it's the second, I think. >> Ngoc Q.K. Duong: Micro Wire so easy to model ->>: That's earlier paper. The 1954 is the complication of the [indiscernible] governance matrix, but no one uses it. >> Ngoc Q.K. Duong: That's why it's not really new but it's new in the context of source separation. I consider it here. But really this is not what I want. I just move from the ring one to the one I want. So what I really want is that here you see that it is full rank. But these two must constrain, you have to know everything and this contribution of this part and this contribution in this part is too much constrained. And what we want is we want to relax all the constraints and we want to present IJ as really a full rank matrices. Okay. So what this looks like. If you consider state-of-the-art rank one where rank one [inaudible] or rank one convoluted, the co-variant -- the spatial co-variant mate trick is a product of two vectors. We say 1-1 matrix. What I propose in my work we propose to present IJ with a homogenous message where we release all the constraints. So you see that over slides. This one is too much constraint. It's less constrained. We ask more. And this one is what we propose here. It's really unconstrained. So we have more parameter to model through reverberation. Okay. So now we already have Gaussian modeling framework. We have parameterization for the spatial co-variant matrix, and it's time to reside the source separation architecture. Okay. Now look at it. You have observed what's I sub T you transfer into the time considered domain you have ice NF. Given the Gaussian framework, it's a parameter you need to estimate is source variances for each sources and the spatial co-variant matrix for its source array. And once you estimate this one you can perform monkey channel [inaudible] to estimate C of J and then you converse back to time domain to get what we want. Okay. For multi-channel filtering, if you have an equation here, if you know VJ and IJ for all time frequency points, for all sources, right, so you -- okay. So this is simply an equation. But here let me emphasize that here we assume that the sources is uncorrelated. So we have a co-variant matrix of X is the sum of the co-variant message of each sources. So this is uncorrelated assumption which is less than the independencies, so this part and this part is ready. What we want now is this part. How we estimate VJ and IJ. Okay. This will be our focus. Okay. But before actually estimating this parameter, let me first demo and show the potential of the proposed framework in article strapping and semi blend setting. For instance, in the Oracle parameter estimation all the parameter here is estimated from the known sort in this. So this means that this will provide the upper bounds of the source separation performance. If you know everything and you can compute parameter, it will show you the upper bound. So I just want to prove that upper bound on source separation of the proposed framework is higher than the upper bound of the performance of the state of the art across. So if we succeed in this way. So later on we will manage to estimate the parameter blindly for the parameter. So that is the proposal of this work. And in the semi blend, because semi blend parameterization, it means we estimate the spacial parameter from the known source images but we blindly blend it from the source parameter. So I will skip the detail and present later on. But that parameter is estimated in the maximum likelihood, using EM expectation maximization algorithm. And we evaluate the performance, the source separation performance using the well-known criteria like signal to distortion SDR. And signal to interference ratio or something. Are you familiar with this iteration matches? >>: The signal distortion issue and [inaudible] factorization. >> Ngoc Q.K. Duong: Okay. So for this criteria if you have the estimated signal, so say that estimated signal can be factorized by the sum of several components. So you have estimated signal, maybe the sum of the original signal, plus some artifact, plus the interference from other sources. Right? So we have several components. So the signal to interference ratio, we mean the ratio between the two source signals and the contribution of other sources which is in factorization. And signal to distortion ratio we have the range of the overall distortion. It's the range of the good sources to all the other distortion. And the signal to artifact will measure the ratio between the signal and the artifact and the musical noise for instance. So you see SDR we measure the overall performance of the system. Why this we measure the specific application. >>: So you're trying to estimate the actual signal or a filter version of the signal? >> Ngoc Q.K. Duong: A filter version of the signal. >>: I have a question of the number of parameters here. So the covariance metrics it's full rank and full constraint. N squared over 2 parameters for each covariance matrix, right? And then you have that separately for each frequency band, right? >> Ngoc Q.K. Duong: I get -- I'll prove that later on in my slides. Thank you for that. Okay. Yeah. So as I said this is well known. So here is the resource in two case. This is the average for ten speed [inaudible] resources with the different UI, and the microphone spacing we use is 20-centimeter from the distance from the source to the microphone is 1.2 meter. And we compare the resource of our proposed full rank unconstrained parameterization with the full rank direct [inaudible] partly new as I presented before. And we compare with the state-of-the-art rank one convoluted and rank one inequitable model and we compare with the two baseline across. This is well known. It's called Valerie masking and L norm minimization. And this is the resource we have. Regarding your question about the number of parameter. So, of course, you see for the full rank unconstrained you have maximum parameter than another one, right, because you have three parameters for each sources. Each time frequency point. So if you multiply by the number of time frequency point and you multiply by the number of sources, three, you get this total number of parameter, need to be estimate. For rank one convoluted, which is mostly used, we have half a parameter, because this is rank one so there's only two. And it's very interesting that for rank 1 diffuse two you have very little parameter. >>: Rank 1 would be the square root, right? This is -- the 3,000 is that coming from the filter length of the convoluted, the estimation of the convoluted filter? Because I mean if you have a rank one matrix versus full rank matrix, the square root, the number of parameters, right, not half? Rank one is just a vector ->> Ngoc Q.K. Duong: Okay. Now you consider the status of ->>: When you say convoluted, is that the number of filter taps that you're estimating for the filter response, or why is it so large? >> Ngoc Q.K. Duong: No, no, okay, for instance, here I use 1,024 taps for the filter. So we need to estimate positive half of that, right? So this would be 513, is the number of frequency band, right? So we multiply 513 with the three sources, multiply by three. And if this true stereo case, you have two-by-one vector for each missing metrics. So it's 513 multiplied by three, multiplied by two and we get this number and for the full rank you get that. >>: How many microphones? >> Ngoc Q.K. Duong: Two microphones are used. >>: So is this a real recording or it's a simulated? >> Ngoc Q.K. Duong: It is simulated. >>: And using image methods? >> Ngoc Q.K. Duong: Yes, I am imagine method. >>: Up to the 60 the room ->> Ngoc Q.K. Duong: It is 250. >>: Okay. >> Ngoc Q.K. Duong: Yeah. >>: What's the sampling rate for this. >> Ngoc Q.K. Duong: It's 16 Q Herz. >>: How long our reverberation can you handle with a thousand taps assuming the limitation of the product? >> Ngoc Q.K. Duong: Okay. So for 16,000 and now you have only 1,000 depths, so it's like one -- 16 or so. >>: 57 seconds. >>: Much less than 250. >>: So you pass the one derivation goes down, and DB and explains the best numbers there. >> Ngoc Q.K. Duong: Okay. Again, so I have something. So if you have -- we have 16 ->>: 16 kilohertz. 1,000 frames. It's 16 of a second which is 87. 86.66 milliseconds. This is roughly ->> Ngoc Q.K. Duong: Okay. >>: It's what you have is the 16. Which is where the derivation goes down to minus 60 degrees. So mature, taking the stand of the reverberation where it goes down to minus 20. >> Ngoc Q.K. Duong: That's our proposal because we need to take the short filter compared to the actual reverberation if you take it has no meaning. >>: If you have [inaudible]. >> Ngoc Q.K. Duong: Misunderstanding here. >>: Work on it anymore. >> Ngoc Q.K. Duong: Okay. So, yeah, here is the resource for Oracle. For Oracle, of course, you have a much larger number because everything is estimated from the known sort images and the separation performance is the degree in a semi blend but overall you can see the proposed full rank unconstrained outperform another. >>: So you said that your matrices were number of microphones, the number of microphones. Here they're just two-by-two. >> Ngoc Q.K. Duong: Yeah. >>: So then -- then the explosion parameter is very small, because N squared is just four. As opposed to like if you had ten mics. So I wonder, so I'm just wondering, because so could you actually break down the 46, 17, where those parameters are coming from, where the 46 / 17 are coming from for the rank case. >> Ngoc Q.K. Duong: For the full microphone case. >>: No, no, just for your case, just want to know the breakdown of 4617 matrix of where those parameters are coming from. >> Ngoc Q.K. Duong: In terms of a spatial parameter? >>: Just what are the -- because you broke it down for the rank one, convoluted one you said 512 tops, times ->> Ngoc Q.K. Duong: You want ->>: In this case. >> Ngoc Q.K. Duong: Okay. So for this one, we also have 513 frequency band. And you also have three sources. Let me emphasize that. We are focusing on the undetermined case where we use smaller number of microphone to separate larger number of sources, which is much more challenging. Okay. So that's why I used stereo for three sources. So let's come back. You have 513. You multiply by three. And you multiply by four for each point. So you have that. But what we really are happy and surprising is that for this model with very little parameter, but it provided quite good resource is a bit less than the state of the art here, but it used very little. Let me explain why it is eight. For instance, if you look at here, so what you need to know is that microphone spacing is one parameter. And you need to know the distance from the source to the microphone. So if you have two microphones and three sources, you have six parameters, right? You have six parameter for the distance and microphone spacing is seven. And then you immediate to know a bit about reverberation time, and that's this. >>: So are you estimating the room parameters as well in those figures? >> Ngoc Q.K. Duong: No. For that I used -- I assumed that it is known. Yeah. >>: But you couldn't estimate those in principle. >> Ngoc Q.K. Duong: That is one of the pieces of work I would mention, yeah. It would be nice if we can estimate the parameter and incorporate it, yeah. But at this time let me first -- I should assume we know ->>: So actually if you look at the upper table, the full rank unconstrained, the full rank by diffuse are fairly close. It's only when you do the estimation. >> Ngoc Q.K. Duong: Yeah, yeah. >>: That all of them -- >> Ngoc Q.K. Duong: Yeah. That is really amazing. It means that if we know exactly everything. So this model is this quite good to handle the reverberation part. >>: Even before you maybe do this future work of trying to estimate those, it might be interesting just to see what the sensitivity of those results would be with respect to small perturbations in the room estimate. Because if it's robust to ->> Ngoc Q.K. Duong: I see what you mean. When I presented at some conference some people asked me about the variation about that. To explain that. But for a short time of my thesis actually I haven't got much time to investigate this one. I'm more investigating this one because in France we have almost two and a half years for doing research and a half year to write a thesis. So actually for me I only have two years to do those things. This is shorter than in the U.S. But you're right. Okay. Let me continue something. So now I proved that for the upper bound on the source separation you have good performance. So let's move to the real parameter estimate. >>: We don't want to hear ourselves then? >> Ngoc Q.K. Duong: Okay. Nice. [laughter] we have like 15 minutes more. Okay. Nice. Okay. Here is just one specific case. I did a long time before. But recently I write my thesis so I test on a very large database and we get similar ones with different sort of microphone different duration time. They're all there and so I get -- for instance, you have this [recording]. Okay. There's three sources. And for the right one convoluted K, state of the art, one voice. [recording]. Okay. And in our case you hear. [recording] okay. Again. [recording]. >>: Full rank in the thesis. >> Ngoc Q.K. Duong: Okay. Yeah, you have more. [recording]. So in this, you have lots of physical noise, right? There's artifact. That's why we use SAR, and you have more musical noise. >>: So in the Oracle results, what you mean by Oracle is just the geometry. >> Ngoc Q.K. Duong: It means we know everything, we know the source. >>: You know the whole signal. >> Ngoc Q.K. Duong: We know the whole signal, yes. >>: Not just signals. [laughter]. >> Ngoc Q.K. Duong: It will give upper bounds of performance. >>: I know the signals. >> Ngoc Q.K. Duong: Okay. [laughter]. That's why I didn't want to place out here because I know everything. So just know where to put it out here. But I just want to prove if the parameter is estimated from the very good -- I mean, this is known. So we prove that. >>: Lower case the geometry is known but not the source images, right? >> Ngoc Q.K. Duong: Yeah, yeah. Now ->>: I have one question about this. Your figures of merit do they depend on frequency as well? >> Ngoc Q.K. Duong: Yeah. >>: It seemed like for the full rank unconstrained, most of the signal to -- most of the interference was low frequency. >> Ngoc Q.K. Duong: Yeah. >>: That there wasn't a lot of leakage of high frequency from the other sources. >> Ngoc Q.K. Duong: Uh-huh. >>: Is that -- why would it -- first of all, is that truly happening, and why would that -- what's causing that? Why would the low frequency leak more than the high frequencies from the interfering sources? >> Ngoc Q.K. Duong: Of course for the speech sources most of the energy of the sources would lie in the low frequency, right? So, of course, in the low frequency range you have more interference. I mean, you can hear more interference. Because actually it's a mixture itself. It would have more energy lying in the low frequency range, right? And in terms of mathematical things I don't see the differences in my framework. The differences with the low and the high, because we treat the low and the high equally, and we apply the same parameterization for the old frequency mean. >>: The interference sounds for when you played the full rank unconstrained weren't a low volume male speech, it was a muffled male speech where the low frequencies were leaking through. And so you have a look into what could be causing that in your framework or -- >> Ngoc Q.K. Duong: Yes, actually I didn't really look closely inside the spectrum feature. Does that really affect the speech recognition system or something? >>: Yeah. >> Ngoc Q.K. Duong: Okay. Okay. Thank you for that. I will look at more or maybe it is just one specific case, because we have three sources and I play only one sources. When you hear another source maybe it's different, or different sources, I don't know. But for this I feel equally. >>: I guess similar thought. How much is the difference -- maybe you haven't done this analysis but particular spacing, you know, is going to make some differences then, make -- you are not going to have the same estimation capabilities for different frequencies given a different aperture. That might be where the limitations are coming from. >>: Either that or the framework ->>: Yeah, exactly. >> Ngoc Q.K. Duong: For that frequency solution, let me present in this slide, because I also have some work on the frequency resolution where instead of trying [inaudible] form we provide a linear resolution I use some more like acoustically motivated representation we provide more resolution at the low frequency range. I'll present later on in this work. So let me move. Okay. So now we move to the next part where I present general framework to estimate the parameter from the mixture. Okay. Here is the structure. I choose expectation maximization EM. It is well known for the Gaussian mixture model. And for YEM [phonetic], you may know that it's very sensitive to the initialization. So we need to design a good parameterization initialization scheme. And in the maximum likelihood, since we measure it independently in the frequency mean, so there may be some permutation problem in the domain, so later on we need to show the permutation problem.. So what is new in this framework is that in most signal processing processes, so the input for the parameter estimation is of the signal ice. But in our case we consider the input by the empirical mixture co-variant sigma of X is that of the signal itself. So I will explain this one later. But now here we have several block ices for the bland parameter estimation and we also consider the parameter estimation in the maximum posterior sense where, okay, there's a mathematical term. And M this case if we have some prior knowledge about the parameter. So we can estimate the parameter jolly [phonetic] in this so we don't need to show parameterization problem. So this is the general block. And following I will present in detail in each step. And for these two steps, I just explore the existing methods which is not our main contribution. And our main work we focus on these three parts. So here is the parameter initialization. I mainly rely on the work of Winther [phonetic] presented in [inaudible], where he tries hierarchical clustering of the mixture at each frequency mean after some [inaudible] and amplitude normalization. So for the source of time let me skip this slide because it's not our main contribution, it's just for parameter initialization. Okay. I'll explain more about the empirical mixture co-variant, why we use input is sigma hat of X instead of the signal itself. So, okay, I believe we need to estimate VJ and IJ using this relation, sigma X. But in practice sigma X is not observed. What we really observe is the empirical mixture co-variant sigma hat of ice. And we compute sigma S of hat as the average over the neighboring frequency point like this one. So and this one can be interpreted as [inaudible]. Later on I will present the resource with the equivalent rectangular bandwidth. This is more acoustically motivated time representation. And why we want to estimate the parameter from sigma X instead of the signal itself because compared to iso sigma X, it will provide information about the interchannel coefficient and it will take into account the neighboring frequency as you mentioned before. So we slice this and rely on the neighboring frequency point. >>: Wait. Go back. So you're mixing between neighboring channels and neighboring specs here is that what you're saying. >> Ngoc Q.K. Duong: Not channel. But time frequency reason only. >>: No, I understand. But time frequency regions. >> Ngoc Q.K. Duong: Yeah. >>: So this is time N and this is frequency. So give the same frequency being back in one more back. >> Ngoc Q.K. Duong: Yeah. >>: And previous frame give plus the two neighbor frequencies. >> Ngoc Q.K. Duong: Yeah. In the general case. If you set the length of that two is two. So you have two-by-two. And I test several cases with different length of [inaudible] and two-by-two I realize provides the best performance. We said okay. This one is the main point and this will one will contribute half of that. So you have a shape of filter. Okay. And here's your general expectation maximization algorithm. It includes N step and M step. In each step we try to estimate the multi-channel [inaudible] and the estimate co-variant for each sources, sigma FJ. I will skip the detail. And for M step, the parameter is the iteratively update, like this one. VJ and IJ. So update of VJ will depend on IJ and update IJ will depend on VJ. This is a maximum likelihood sense. And in the maximum post [inaudible] VJ and IJ can be estimated by the co-sponsored including the contribution of the likelihood and the contribution of the prior which I will present later on, where I introduce some [inaudible] parameter to take into account ->>: You estimate both these NPRs simultaneously? >> Ngoc Q.K. Duong: Yeah, because we don't know these two parameters, right? Yeah. >>: From the same amount of data. >> Ngoc Q.K. Duong: Yes, from the same amount of data. >>: Because the V in theory you could learn with a lot more data, nothing specialized right? >> Ngoc Q.K. Duong: Yes, I see. There's many -- processes learn VJ. But to learn VJ you need to know -- I mean the training data should be close to the isolated data. But, for instance, in source separation context general case when some people go into a room and talk you don't really have training data. And also in state of the art many models for VJ is present like you can present VJ at NMF model or mixture of some Gaussian. And they will train the Gaussian mixture model to get the estimation of VJ. But here, because my framework is focused on IJ. So we assume that there's no prior information on VJ. >>: But VJ is a bunch of M and F. So it's like a ton of parameters, right? >> Ngoc Q.K. Duong: Uh-huh. >>: As long as they are trans length? >> Ngoc Q.K. Duong: Yes. If you do not put any prior information, yes, it has a tons of parameters. But if you put some model for VJ, for instance, in MF order so you have VJ would be some activation and some pattern, some pattern, for instance. So you have less parameter. >>: Are you doing this utilization separate than for -- >> Ngoc Q.K. Duong: Yes, it is doing separately for each sources. Each frequency. >>: But not each frame. You're doing this tied across the entire input. >> Ngoc Q.K. Duong: Yes, because you see here R would be the sum of O frame for VJ. Yeah. Okay. Is that okay? Yes. So for permutation problem, let me skip, because I just reproduce the work of [inaudible] in Japan where he explored the DOA, the direct -- the source iteration to sole permutation problem. Okay. So here is some results we get. So this is the actual results. We also average SDR, which measures overall distortion over 10 simulated speed mixture. So we compare the results of the full rank unconstrained with the rank one convoluted case and the baseline across is by remasking L1 minimization because at this time we don't take into account another model like full rank [inaudible] because we already proved that it provided lower upper bound performance. So at this state of the research I only consider the full rank unconstrained. And here is the results for very low reverberation time of 50 millisecond. So you will see for low reverberation as rank 1 convoluted and full rank profile it leads to the same results because it's very low. But as far as we move to the higher reverberated room. So output proposed parameterization will clearly outperform the baseline across and the baseline across. >>: Is it resource [inaudible]. >> Ngoc Q.K. Duong: No, it's really blocking. >>: Static source locations or are they moving?. >> Ngoc Q.K. Duong: Source location is fixed, but we don't know. Yeah. For instance, now you hear the sounds. In 250 millisecond. [recording]. So rank one. [recording]. Okay. For our proposed part. [recording]. >>: I thought I heard results from the rank one convoluted mixture that is some better than this. >> Ngoc Q.K. Duong: In terms of? >>: From other researchers. >> Ngoc Q.K. Duong: In terms of interference or artifact or overall? >>: Overall. >> Ngoc Q.K. Duong: Okay. We can hear again. [recording]. >>: Just my recollection of some companies here ->>: Maybe they weren't in a reverberant room. >>: That's probably the biggest difference, yes. >>: So which depends ->>: How reverberant. >>: Go back one second, what level were you showing those samples at? >> Ngoc Q.K. Duong: 2,000 -- I'm sorry, 250. >>: Okay. >>: So maybe that's the difference, yes. It's possible. >> Ngoc Q.K. Duong: But even for this one, I tested in many cases, full rank clearly mattered. I don't know. >>: I'm saying what I've heard in other conferences, the separation was much better but perhaps it was much less reverberation. >> Ngoc Q.K. Duong: Yeah. >>: Because there they look almost the same. >> Ngoc Q.K. Duong: In low reverberation time it's almost the same. But as I said before, our result is getting the I for the reverberated environment. >>: But compare that with lower reverberation times you do as well as ours. >> Ngoc Q.K. Duong: Yes. Or even recently I test for the thesis I test a larger number of datasets with different scenario. So at the time I observed that even the rank 1 performed better at the low reverberation time. >>: It was even better than 10 DB. Maybe 50 milliseconds better than 10 DB. That's my recollection. >>: 17. >>: I'm sorry? >>: 17. You said [inaudible] this is what they used when they -- >>: Like the guys from ATR, the demos I listened to, to me they sound better than 10 DB. I know that reverberation is ->>: They have different definitions. >>: Did you use headphones when you listened to them? >>: I tried both. >> Ngoc Q.K. Duong: Okay. So this is one case. >>: I have a question for the room. Was it -- I'm not familiar with the data you're using for this. And to me both examples sounded unintelligible. I was maybe able to pick out two or three words from either the rank 1 convoluted or the full rank unconstrained. So it doesn't really seem like the difference is useful at 250. Is this the biggest difference that you see with this technique, or are we going to see better results? >> Ngoc Q.K. Duong: Yes, if you look at the number, you do see differences but here as I mentioned as I display one specific case. >>: Here I see 2 DB difference at 250 milliseconds. >> Ngoc Q.K. Duong: Okay. It's like one DB or something. >>: One DB. Quantitatively they're different but qualitatively I found them to be both be unintelligible. Do you have some experience that shows that there's some domain where you use this technique and there's qualitative results that are positive? >> Ngoc Q.K. Duong: Okay. I will show you later on. But let me emphasize that. In source separation community, one DBSDR is really a last improvement. If you look at the performance year after year. So we see that it's a process like 1.3 or 1.5 FDR or something performance, so it's not easy to get like one DB. >>: This is a significant change. >> Ngoc Q.K. Duong: Yeah, yeah. Let me emphasize that this is really a significant change. And in this slide I saw some more and resources compared with state-of-the-art across. So in source separation community, we have a campaign which is organized by one and a half years by the steering committee, associated with the source separation conference, which is known as the [inaudible] conference. And for this campaign we are the organizer have some common data, and the participant will use their algorithm to separate the file and submit the results so that the organizer will evaluate the performance. So they give the blind data. So all the participants don't know anything, and they give back the wave file and the result is evaluated by the organizer. And we participated in the data 2000 -- recently it's 2010 already, and we compared with five -- state-of-the-art process, for instance, [inaudible] came from an entity in Japan, well known for source separation and many of them come from Columbia University with Dan Ellis, so very famous and [inaudible] came from Tokyo University. So it's all state-of-the-art process. And here I test with five microphones based datasets and here's the results we have, with three sources and two sources compared to state-of-the-art process. >>: So that was the number of the entries? Because of those number? >> Ngoc Q.K. Duong: Yeah. I compared to all the numbers. >>: 250 millisecond it looks like the key method is 3.7 versus your 3.8. It's close to ->> Ngoc Q.K. Duong: Here? >>: Yeah. >> Ngoc Q.K. Duong: Yeah, sure. I don't [inaudible] with that I outperform everything. This is really hard. And let me also emphasize that CDs with five meter spacing and with less number of microphones facing even outperform our methods because our method we need to solve permutation problem. If you have large microphone distance, so you have many face wrapping, so it will degrade a lot of performance, yeah. So that's why I choose five-centimeter. But so the permutation is not the main goal of our process. >>: So your implementations of their algorithms or this is what they published? >> Ngoc Q.K. Duong: Yeah, yeah, and the data is tested and submitted themselves, not mine. I mean we all submitted to the same organizer. So that's another thing. And actually recently we were in 2010 I also attend and I didn't write resources here but let me summarize. In 2010 the group of entities over there outperform another one. And there is two other people outperform this method. One from EPIPI Switzerland and one is LSA but these two people use our framework. That means that he used exactly the full rank I proposed here in 2009. But he have more constraint on the source variances. So here the source variant is free. I don't use any more VSA but for those two across, they [inaudible] mode of VSA and Gaussian model for VSA and they get better performance. So it's usual. Yeah. >>: So one other question you may have said this before, but in all these cases the number of sources is known, right. >> Ngoc Q.K. Duong: Yeah, sure. >>: It's always specified. >> Ngoc Q.K. Duong: Yes. For estimating the number of sorts is remaining challenge I think. >>: But in all these experiments ->> Ngoc Q.K. Duong: It's known. Okay. So regarding the question about the frequency solution before. I test between the certain full array presentation where the frequency bandwidth is linearly equal for all frequency ranges. And if you apply EIB, which is more acoustically motivated, it will provide narrower band at the low frequency rate, where most cell energy lie in. And our resources show that the EIB will perform better performance than the software full array transform, why we have this because, yeah, so the difficulty of estimating the model parameter is that we have a lot of time frequency overlap between sources. I mean, if the sources is really spare in the time frequency region, it is easier to estimate the parameter, right? But if it's really overlap -- so it is more difficult to estimate model parameter. But if you use EIB, we have a narrower band at the low frequency range. So we have less overlap. That's why it outperforms the certain for array transform and these results were published in our ICI paper in 2010. Okay. There is PP. And now for a short time I'll move quickly to also a good contribution of the paper is defining some probabilities prior. Up to now I just present a general framework as a parameter is estimated in the maximum likelihood sense, without knowing anything. Right? But in practice, in many cases, for instance, if you look at the speed enhancement in your car you may know exactly the position of the driver and the co-driver. Or in a formal meeting, too, the position of its delegate is fixed. It means that in such situations the geometric is known. So that you can explore this knowledge in order to enhance source separation. So as far as that motivation, we design the spatial location prior. Okay. Let me also consider another one. So up to now I just assume that IJ of F is this time invariant. But if you consider the moving sources, for instance, if I talk like this, I keep moving. Or if you consider the run, so a separate run like he had in there, so IJ should not be time invariant, right? So in that time, we should consider time varying spatial position of the sources, but we can design some continuity prior. It means that we assume that it's time varying, but it should have some structure over time. So that's why we use reside some space continuity prior for us and these two different prior is published in two different ICAS papers this year. Okay. Or if you look at spectral information of the sources, for instance, if you look at spectrum of the room, it's quite mostly over time. Why if you look at a spectrum of harmonics, a piano, it describes harmonicity over smooth frequency range. And given that you can also design some spectral continuity prior to enforce the smoothness over time of frequency range. So these are a series of prior information we consider in order to enhance sort separation in specific setting where we know something. We detailed about spatial prior. So here is the model you have seen before. The full rank diffuse [inaudible] but this model, we test with large data and we found that it's quite good in terms of the operation fro the mean for the spatial co-variant methods. It means that if you have a lot of personal sources, the mean can follow this one. But in practice, it will vary over the mean. That's why we can define some distribution of IJ by the inverse we set prior where the mean of IJ is defined by this, and we need to estimate or to learn the variant. So that we know the distribution and we can estimate the parameter is the maximum post [inaudible] sense, right? Let me go a little bit faster and just give you a basic idea. And also we get close form update in MAPSIN. And here is some results. Here I just want to show that we -- if we know some knowledge about the geometric setting we can get better performance, compared to the blind initialization in a maximum likelihood sense, right? Okay. For spatial continuity prior, we consider this prior in the context of musical sort separation, where the music -- okay. And we are [inaudible] some property distribution IJ. So the mean at the current time frame depends on the value at the previous time frame. So we have continuity constraints. So now I come back with experience to show some practical where our framework is shown to provide good performance. For instance, if you look at music mixture. So it can be considered as a mixture of harmonic component and [inaudible] component and [inaudible] in the mirror community people try to separate harmonic component and drum component in order to perform some other music information retrieval, and we apply our framework with the continuity prior, so we get results for multi-channel harmonic and focus separation and we compare this result with existing single channel SPS which is proposed by the group in the University of Tokyo. Okay. And here is the results we get. Here is the, for example, one music mixture. [music]. Okay. So here is the harmonic component, we separate. [music]. Here's focussing component. [music]. Okay. If you compare with state-of-the-art for focusing you will hear [music]. So a few differences now. Because our last question is I -- for speed it's difficult. But I have some co-work with a group at the University of Tokyo where we tried to explore this framework in the context of music information retrieval. Yeah. So here instead of the when you hear loss of artifact, so the RAM is mass resource, you may not hear the actual one I play again. [music]. >>: That might not be the important thing. The goal you're trying to find features for mirror kind of stuff you might care about tempo and which [inaudible] and stuff like that. The actual acoustic artifacts are if humans are listening to the output. Another question I have in this case -- are you manually defining what the priors are for the different types of sources here or are you estimating them from data? Or where are they coming from? >> Ngoc Q.K. Duong: For the prior, there's only two parameters. One is mean. And one is variant. So for mean it's okay. It's well defined and for variant, I manually choose it. >>: Okay. >> Ngoc Q.K. Duong: But I observe that the reason it's not very sensitive to the two of the variant. >>: One extent ->> Ngoc Q.K. Duong: Yeah. So it varies between 10 and 100, it's okay. But of course, okay, we can learn from training data. >>: The source is regular state of recording? >> Ngoc Q.K. Duong: Stereo recording. >>: Stereo or mono phonic, single channel. >> Ngoc Q.K. Duong: For single channel we separate the source in each stereo channel separately. >>: So you did like, you took one of the stereo channels and then for each of the results? >> Ngoc Q.K. Duong: Yeah. So this is not a fair comparison, I will say, because I compared the multi-channel. >>: The single channel. The one is the multi-channel. >> Ngoc Q.K. Duong: I will say it's not fair but for state-of-the-art process I didn't use any multi-channel process, because for music separation, most people see the channel and they explore the mode for the spatial temporal of the sources in order to separate like. So, yeah, I tried to look for the state-of-the-art multi-channel, but I didn't see one. Okay. So here's the conclusion. So in this work we introduced probabilistic spatial model for the reverberation and diffuse sources. And we propose unconstrained parameterization of the spatial co-variant matches which overcomes the narrow band assumption to a certain extent, and we better account for the reverberation as shown in the experimental research. We consider the local and empirical co-variant for the parameterization. It's sigma hat of S which we provide additional information. And recently we proposed several probabilistic priors which have enhanced the separation in a certain context, where we know some information, yeah. Okay. For future works, we can consider the work that was blind source separation explored the spatial location prior. It's very related to the question before where the full rank [inaudible] diffuse model. And even for this prior we need to know the mean. It means we need to know that matrices and if we can find a way to estimate the parameter like estimate the source UA and derivation time it's smooth like this. And we can consider the use of the proposed framework in -- another field of signal processing. For instance, in a mirror, that is what we are doing now. And recently I found that there's some people already use our framework. For instance, Hitachi, told me Hitachi use our modeling in their acoustic canceling system and Anarchi [phonetic] also combined our process with the time frequency masking. So to enhance sort separation. >>: Is this [inaudible] 2011 paper? >> Ngoc Q.K. Duong: Yeah. Because these two guys from industry. So they consider the practical application. So so-called use the ID but he do not use the same estimation technique because ours requires more time. So combined with the process and used sort, faster estimation parameter. And we also can consider the proposed framework in spatial audio where we annualized and run the space [inaudible], yeah. Okay. And I think that is the end of my presentation. And here are some references you can see. >> Ivan Tashev: Any more questions? >> Ngoc Q.K. Duong: Thank you very much. [applause]. >> Ngoc Q.K. Duong: Thank you very much for your attention.