>> Ivan Tashev: Good morning, everyone. It's my... Kim. He's a Ph.D. student in the department --...

>> Ivan Tashev: Good morning, everyone. It's my great pleasure to introduce Lae-Hoon Kim. He's a Ph.D. student in the department -- or in the Statistical Speech Processing club in the Department of Electronics, Electrical Engineering, University of Illinois at Urbana-Champaign. And his advisor is ->> Lae-Hoon Kim: Mark Hasegawa-Johnson. >> Ivan Tashev: -- Mark Hasegawa-Johnson, well-known name in signal processing. And the topic of his three-months long intern project was speech signal separations in reverberant environment. Without more introductions, Lae-Hoon, you have the floor. >> Lae-Hoon Kim: Thank you for the kind introduction. Hi everyone. Good morning. Today I would like to talk about my summer internship with this title, "Reverberated Speech Signal Separation With Microphone Array." Before starting this talk, I would like, you know, express my deep gratitude to, first of all, my mentor, Ivan Tashev, and to Alex Kipman and Xbox team for funding my summer internship, and to Microsoft Research for excellent work conditions and equipment, and to the whole speech research group members and my fellow interns for the wonderful summer. Thank you so much. Let me start this presentation, then. This is our -- today's agenda. First of all, I'm going to briefly talk about what this project is like from this project outline. And then we are going to see what kind of start of art is there in the world, and then hopefully touch the background. And in this background section we are going to have the -- we are going to see the building blocks of our proposed algorithm. And then after that, you know, proposed method will be explained in a technical manner. And then experimental results will be given with some numbers, obviously. And then after that we are going to have some listening examples to listen here. And then I will conclude this talk with some future work. Here is the project outline. The object of this project is [inaudible] speech source separation using microphone array in reverberant and noisy environment, like this kind of space, reverberant and noisy environment. And the main application of this technology is for, you know, interactive game console, like an Xbox. So here's our challenges for this project. We have to deal with shoulder-by-shoulder speaker distance, you know, angle distance. And working distance should be like two to three meters from the microphone to the speakers. And we have to deal with the reverberation and also background noise. And this kind of technology should be integrated into the frequency domain implementation, short time -short frame processing units. So this is our -- these are our challenges we have to solve here. And here is the existing approaches to approach these kind of problems. First of all, there is a microphone array processing, and which includes the beamforming, nullforming, or, you know, post-processing, something like that. And it aims to suppress sources depending on the direction of arrival -- I mean, the spatial information you might be able to get. You know? And there is another way of doing these kind of things, which is called as blind source separation. And usually -- and normally people tend to try to use technology called independent component analysis. In short, ICA. This algorithm is trying to separate sources toward the maximizing higher-order mutual independence among the separated signals. So not that those two different technologies has been built on different kind of objective functions. Also there are some kind of tries to combine those two different approaches so far. And most of them report that the converged ICA is real similar with the nullforming on the interference speech. And, you know, it is not significant -- it does not produce some, you know, significant improvement in source separation compared with the beamforming plus nullforming. And here's some important questions we have made, you know, when you start this project. And some observations we have made here. First, No. 1. Why ICA just converges to nullforming? Answer is because it is just the best given the current frames only. And usually the reverberation time is longer than frame length. For example, if we take the 256 samples [inaudible] 16 kilohertz, that's just 16 milliseconds. And really in this kind of room, the reverberant time would be like more than 200 millisecond or 300 millisecond. So this is part of the reason. And here you can see the picture like this, and usually what it can do given it's just only current frame to maximize the independence of the separate channel is just nullifying this direction of the interference. That's the best, actually. But in a given situation, this kind of situation, there should be some kind of direction of major reflection of the interference. Then what it can do, do you want to do this? No. Because the main energy over here, we don't want to lose this kind of, you know, rejection power to, you know, suppress a little bit about this reflection. So we would like to maintain this. But we also would like to cancel this part. This is what we would like to do really actually. And how we can do this -- so this is very simple. If we just use the [inaudible] only, then this is the best. Why not? Why not we -- why don't use the previous frames? I mean, why don't we use the multi-tap structure to cancel out this part. At the same time it can maintain this, you know, [inaudible] power. So this is our basic simple motivation about using multi-tap structure to cancel out these kind of major reflections. And obviously this is the way we can deal with this long reverberation, given the short amount of the frame length. And, again, the beamforming technology is usually based on the prior knowledge about source direction, right? But we don't know where the reflection -- I mean, the significant echo directions as a prior information. But we don't have to worry about that. Because unknown DOA, direction of arrival of echoes, will be dealt with this ICA. The objective of ICA is try to maximize the independence of the separate signal. And by doing that we can, you know, automatically have this kind of, you know, nullifying in a multi-tap structure which will give us, you know, some kind of, you know, separation, even using the multi-tap effect. Okay. Here's the proposed approach, which is, you know, titled as regularized multi-tap ICA with IDOA-based spatial filtering. This -- I will start from the good initialization, which is based on the spatial information we've already got, like, you know, nullifying the interference direction and beamforming towards the target direction, something like that. And we also have the regularization on the big deviation from the spatial filtering first stage. So actually this is composed of two stages. The first stage is beamforming power by the IDOA-based spatial filtering, and second stage is this multi-tap structure ICA. And this second step will also be accompanied with this spatial filtering too. So we are going to look at that in a detailed manner soon. But here we would like to maximize independence. And we want to use the previous frame as well, not only just current frames. And this will cancel out the interference leakage in the current frames. And after that there could be some kind of slightly increased reverberation in a given target direction, which is made of the target speech only. So I would like to suppress but slightly increase the reverberation for this target speech. So I just use this spatial filtering again to suppress the reverberation to. Now you're going to see that we can use the both of different kind of technology in a kind of maximum way. So here's the results summary. It could give us better than 20 dBC separation. 20 dB means that one-tenth, one over ten, of two shoulder-by-shoulder speakers up to 2.7 meter distance. Improvement of the PESQ point -- PESQ point is for the perceptual measure. It was 0. to 0.6. And we could outperform 10 to 20 dBC, the already published, you know, algorithms. So here's the state of the art. In 2006 Saruwatari, et al., proposed this algorithm which is titled as "Blind Source Separation Based on Fast-Convergence Algorithm Combining ICA and Beamforming." And in this paper they report that the converged ICA is really like nullforming on the interference direction. And second state of art is our MSR result in 2008. And this is titled as "Maximum a Posteriori ICA: Applying Prior Knowledge to the Separation of Acoustic Sources." And this technology try to utilize the applied information about the source direction. And actually this is beamformer-based trained prior distribution on ICA filter given DOA, and after having this training and they want to build some kind of MAP ICA. Originally the ICA is kind of maximum likelihood because we can have some kind of prior knowledge about that, then we can convert into -- that into the MAP problem. So this is the way of, you know, incorporating the prior knowledge with -- instead of just going, you know, optimal -- optimal in a way of doing this kind of thing. And this was we have [inaudible] solving the permutation problems. But still this is also some kind of a way of doing the best using just the current frames only, like just beamforming. And third. This is the Sawada, Mukai, Araki, and Shoji Makino. And the -- this article has been titled as the "Frequency Domain Blind Source Separation." And in this paper, they actually introduced time-frequency binary mask post-processing, which is somewhat different with the other interesting algorithms. This is kind of, you know, similar with the -- our IDOA-based post-filtering. But we are going to see what the difference is. And they actually report that they can additionally suppress the interference other than just using ICA. But still they -- this kind of algorithm still do not discriminate the reverberated interference in the target direction. Which means given this direction, if the signal is coming from the interference, they cannot differentiate that. So they just accept that as a like -- as like just like, you know, target. So this is just same problem beamforming. >>: So maybe [inaudible] do you know whether this is the technique that is used by [inaudible] after [inaudible] acquired that company? >> Lae-Hoon Kim: Oh, maybe. >>: I don't know. It looks like they are very [inaudible]. >> Ivan Tashev: [inaudible] senior director of [inaudible], but after two telephone conversations with him last month, but of course he doesn't speak about the technology. >>: [inaudible] do you have notes about whether this type of algorithm is something that underlies the technologies? >> Ivan Tashev: They have a two-microphone -- two-microphone technology for speech as language. They offer us [inaudible] their latest chip. This is what I know for sure. What they actually use ->>: But whether it's frequency domain or ->> Ivan Tashev: It's quite possible. But I don't know for sure. >>: You know, all the [inaudible] but they're different. >> Lae-Hoon Kim: So, in short, you probably have to say about the difference, then I would say that this [inaudible] is based on some kind of narrow bin search, and if there is some kind of a source toward that direction, and you can, you know, search through this direction and this just direction for the source, something like that, then it says that one, I'm asking one, if not, zero. This is just how this work. But our case, we can provide some kind of [inaudible] information about [inaudible]. >>: So in general the -- see, this is -- the book is a [inaudible] it's about getting to [inaudible]. The chapter there is from [inaudible] people [inaudible]. >> Lae-Hoon Kim: If you're interested, I can -- yeah. So let's go into the background part. This is our -- the building blocks for our proposed algorithm. First of all, this is the subband domain speech separation. You know, typical situation for the speech separation. And this can be done by this kind of, you know, process. First of all, the time domain signal can be captured by the multi-microphones, and then we convert this frequent time domain signal into the short time frequency domain using the MCLT. And we do the separation per each frequency being separately, then we just do these, you know, IMCLT to get this time domain signal again. And if I say that this kind of speech separation processing can be described as this part, like, you know, separate speech can be obtained by applying this separating filter to the measured -- the frequency domain response. And I would like to, you know, note that N and K -- K is -- K means it's the frequence bin number and N is the frame number. So basically that that can be, you know -- that can be time varying, like, you know, to reflect the change of this pristine environment or something like that. So, first of all, beamforming. There are two different kind of beamforming technology in general. And the first thing is time invariant beamformer, which is to fix the [inaudible], and then we can calculate the coefficient offline. And this assume that we have the ambient isotropic noise assumption. And second of all there is another technology in adaptive beamformer which can track the variation of the source positions or in their environment itself. And we utilize this Minimum Power Distortionless Response beamformer, which is slightly different with the Minimum-Variance Distortionless Response beamformer in the sense that we utilize this, you know, power instead of just having the variance of noises. Because, you know, having the variant -- calculate, estimating variances also another problem. By basically having this, you know, I mean, this is the optimizing problem we would like to solve here. So basically why we want to do is we would like to minimize the source power after having this, you know, filtering. There is some constraint we would like to omit at the same time. One of them is we would like to maintain the beam -- beam with -- beam direction with nothing like, you know -- without any distortion like this one. And we would like to cancel the -- all the signal coming from the direction of the interference. We added this part. Because, you know, as you know, the -- as I told all of you, the converged ICA is just like, you know, nullforming -- beamforming plus nullforming. So why not? We can add this part for this beamformer. So we added this. And after solving this problem, we get this. And another thing I would like to mention is this part, lambda I, which means this is called as diagonal loading to prevent the divergent with this part. So this is very realistic implementation. And this is the direction pattern of the implementation. This is the time invariant version, this is adaptive version. So you can see that this blue line is null and this red line is the beam. And towards the designated directions of arrivals for the target and interference, we could make these nulls and the beams in an appropriate way. And, again, but this beamforming has some very good things, like, you know, it can boost or suppress depending on this spatial information. But as I told already, this has -- this technology has some kind of demerit, like, you know, this cannot discriminate source difference in a given direction. So we need some source discrimination power. This is just a spatial filter, which is called IDOA-based in post-processing. This is proposed by Dr. Ivan Tashev and Dr. Alex Acero over here. And IDOA-based is instantaneous direction of arrival-based approach. And this is nothing but just having this kind of space -- this space constructed by having them kind of pairs of the microphones. Like, for example, if there are four microphones, we can construct the three pairs, like 1-2, 1-3, or 1-4, something like that. And after that, based on the pairs information between those kind of pairs, we can draw some line like this. If there's no reverberation, no noise, then this -- the -- [inaudible] could follow this kind of line. But because there is some kind of reverberation or noise, something like that, there is a kind of spread around this, you know, designated line. So based on that information, we could construct the probability of the -- some frequent speed we can construct the probability of the direction of the membership of this frequent speed. Like, for example, if there in specific frequent speed all the signal comes from some direction, then we could have some kind of very narrow, sharp probability like this. But if not, then there could be a thing, kind of broad. So basically if there is, you know, once we -- once we have some kind of algorithm like this, then we can just multiply this gain picture after having this beamforming to additionally suppress some kind of part. So, for example, beamforming just see the direction and they just want to just boost that direction, even though there could be some -- nothing over there, right? Even though there could be -- for example, speech is not kind of very sparse signal, so there could be nothing in some specific frequent speed. But according to the beamforming, they just try to just post over there, even though there is nothing like that. Then there is not -- there is comfort in that way I want to do. And this kind of probability give us that -- will give us some kind of information like there's nothing, then we can just suppress additionally up to there. This is a very cool idea. But still this, you know, this is the spatial filtering. And here is the background for independent component analysis. And the objective of this algorithm is to maximize the high-order mutual independence, not just a second order. In this case it can be interpreted as try to maximize the super-Gaussianity. So, for example, in this figure you can see that this is clean speech probability and this is mixed one. I'm in the speech mixture. And this is the speech mixture plus noise. And this is just noise. So as you see here, you know, normally clean speech is really, really -- has this kind of peaky PDF. You know, so basically that means by maximizing this kind of super-Gaussianity and its peakedness is up, it could achieve some kind of separation. That's the whole basic idea about this ICA. And that can be formulated like this. Actually this maximization of the mutual independence would be like maximizing the entropy self after having this nonlinear mapping of this separated signal. And this can be also interpreted like -- using [inaudible] divergence. So basically they want to have -- make this part has the probability of this probability. In this case, this unif means probability of unif on distribution. So basically what it can do is after convergence, the probability of the nonlinear mapping of separate signal would bury signal with just independent uniform distribution. >>: Can I ask a quick question? >> Lae-Hoon Kim: Yes. >>: Many years ago working [inaudible] we were trying to just reduce reverberation and we used the same approach and we were trying to minimize just the kurtosis metric. >> Lae-Hoon Kim: Yes. >>: Do you think that there's -- kurtosis seems to be a little less computational cost than doing that. >> Lae-Hoon Kim: Yes. >>: But do you think you would get similar results if you used just a simple metric like kurtosis? >> Lae-Hoon Kim: Yeah. Yeah, why not? Because, you know, in kurtosis case there's first-order [inaudible], right? >>: Right, exactly. >> Lae-Hoon Kim: So in this case, you know, kind of more generalized version of that, I guess. So we're not ->>: But, again, you're trying to measure [inaudible]. >> Lae-Hoon Kim: Yes. How -- how -- yes, yeah, yes. Yes, yes. It could be used ->>: Okay. >> Lae-Hoon Kim: But, you know, here I just want to, you know, highlight that I'm now using the maximum information I'm able to use. So we can use that kind of -- it if it is helpful enough. Okay. And then this kind of objective function can be maximized or minimized using some sophisticated algorithm like this kind of update [inaudible]. This is [inaudible] proposed by the [inaudible], et al. And there is some kind of multiplication about the original [inaudible] algorithm by multiplying this part. This is called a natural gradient, which is proposed by [inaudible]. And the good thing about that is, you know, by multiplying this, we can, you know, normalize the -- we don't need to worry about the specific point and the subspace of this W. By multiplying this, it doesn't -- depending on the starting point of this subspace. It just guarantee us the same convergence rate [inaudible] of the starting point of W. So this is very good. This is very good convergence -- this has very good convergence property. And later on we extend this kind of, you know -- this is just for this current frame. So we extended this current frame algorithm into the multi-tap structure too. And after having the convergence, because this probability will have this [inaudible] distribution, which means by having -- by -- according to the simple map of probability, the [inaudible] random variable, if we have [inaudible], then the -- this S is WY. Probability of WY [inaudible] probability of WY has just this kind of form. So, for example, if we utilize this nonlinear function like the sigmoid or [inaudible] or something like that, then probability S, which means just speech, separate speech, will like this. So basically what it can say about that is basically by choosing the proper nonlinear mapping we could achieve this kind of, you know, probability of -- probability of -probability distribution of the speech itself. So, for example, if we want to achieve this super Gaussianity, then we could choose like [inaudible] or something like that. Let's see the pro and con over here. The pro, separating toward maximizing higher-order mutual independence. We do not need to worry about it at all. We just need to maximize the independence itself, then that will guarantee us to have the separate speech. This is cool, we know. But con, after having this difference in separation, we do not know the membership of the separated speech. How can we deal with that. This is a problem. Even though we could have, you know, achieve really good performance about the separating, but if we do not know the membership of the separate speech, then there is nothing we can do, right? So this is the kind of problem. Actually, originally this ICA is -- it was ideal for the time-domain instantaneous mixtures. What I mean by instantaneous is no multi-tap. That was ideal for that kind of situation. But we would like to deal with convoluted situation, multi-tap situation like this reverberant, environment, right? And this is convolutive mixture can be reformulated as instantaneous mixture in frequency domain. This is very well-known fact. But this is only true for sufficiently large frame. Right? If there is a -- as I already told, if the reverberance is much longer than the frame length, this doesn't hold at all. And we also have to deal with the permutation and arbitrary scaling, as I already told this membership thing. Should be fixed by additional post-processing. And this is very hard. Known to be very hard. So this is the reality. Shorter frame lengths. It just converges to DOA, DOA-based to nullforming itself. And this is quite disappointing. So from this observation we -- Ivan, my mentor, asked are we doing our best here? So this little analogy gives us some idea. So beamforming, we know the direction, we know the membership, we know the role. But we do not know the identity about the reverberated speech. Right? ICA. We know the identity. We know what the speech is, how we could differentiate them. But we do not know where to go. We do not know membership itself. And the thing is that current approach even do not know the identity itself for the reflection part because we are now just -- the current approach we just deal with the current frame only. We do not know identity either, you know. So here's our approach. We would like to let this kind of, you know, separate car goes into the right way, like A goes A, B goes B. How can we do that? First of all, we do like to solve this problem. So we proposed source identifying for the [inaudible] path which can be dealt with the speed for ICA and multi-tap ICA. And we utilize a guide. We would like to guide this kind of thing to go into right direction. So we initialize the filters based on the spatial information. And we also regularize so that they cannot go a long way. And after that we also introduce IDOA-based post-processing to narrow down this road. Not wondering. Okay? We can see this. Here is the proposed algorithm [inaudible]. This algorithm is composed of two stage. First stage is nothing but just what you already have. Beamforming followed by IDOA-based post-processing. And this technology gives us some kind of baseline. I mean, the bottom line in terms of the performance of the separation. And also it also have for us for this second-stage algorithm go to the right direction. And after having this first stage, we apply this second stage. Second stage compose of two parts. First part is feedforward ICA and second part is this, you know, IDOA-based post-processing again. But this is -- those are different. So in this case, this is more or less like trying to suppress the reverberated sound in the target itself. And finally we can -- we hope we can get this speech separated correctly. >>: Can you say a few words about prior from video. >> Lae-Hoon Kim: Okay, okay. So basically as of now, I mean, for this current situation, we cannot have the prior information about source position, right? So we actually use some kind of, you know, technology called music or something like that to find the source direction. But like in our case, like in Xbox case, because we have some video signal, right, and it is known to be very accurate to track the position of the mouse and source speech, then they kind of get together prior information about the direction can be used -- you know, combined here easily enough. >> Ivan Tashev: If you use just [inaudible], there is a [inaudible] seconds delay after somebody starts to speak before we pinpoint the speaker, let's say plus, minus a couple of degrees. And because in the [inaudible] we have those [inaudible] 3D cameras, you can get the direction right away. So this uses the transition of the beginning. Otherwise [inaudible] every prior information we have. Not that [inaudible] can deal with it separately, but just helps the initial -- when this person starts to speak. >> Lae-Hoon Kim: Yep. So basically if you can have, you know, better estimation about the source direction, that would be really helpful. And up to now, we -- just only by using some kind of well-known, you know, well-known source [inaudible] algorithm you could get -- achieve, you know, kind of good result, as you can see here. And first stage is nothing but just beamforming followed by IDOA-based post-processing. And this is, you know, time invariant or adaptive beamformer can be utilized and then, you know, same structure. And as I would like to mention again, produce safe guideline for second stage. And in the second stage we do this kind of update. And this update compose of two part. This part for the -- you know, multi-tap ICA, this part for the regularization. So basically what it can do is it tries to update each multi-tap. I goes zero to M minus 1. M is the length of the filter. It -- each time they want to update this part in the way of maximizing independence, but you'd like to minimize the big deviation from the first part result. So then each part can be expressed like this. And this does nothing but just to maximize this same idea. Same idea. But in this case we extend the original situation into the multi-tap case. And here is the regularization. This is just like [inaudible] or something like that. And what the algorithm measure over here is there is some kind of case where we just are interested in some kind of pure number of sources rather than having the channels. Like, for example, if there are four different microphones and you just tap some [inaudible] I'll give you the -- this is very simple. We can deal with using this [inaudible]. We just need to regularize on the specific two channels so that we can have this separate speech just two, only two separate speech. And this kind of thing also can be dealt with this initialization too. So because we are now -- we know the direction of the arrivals of speech, I mean, interference and target, they can be utilized for initializing the filter taps. And, first of all, we just initialize this first step as just identity matrix. Multiply by some kind of [inaudible] function, which is very similar with the lambda k. And then first step, because we know that first step should be like nullforming or beamforming, right? And they can be constructed by these kind of, you know, [inaudible]. And this should guarantee us to have the nullforming, beamforming for each [inaudible] in each [inaudible] like if you want to assign the first channel to have the first source and second channel to have the second source, then you can utilize this kind of thing onto first and second channel. And [inaudible] channel like just like, you know, identity. Even for the [inaudible]. And after that, after having convergence, we also do this in kind of scaling. This is very similar with the IDOA based post-processing. But it's depending -- it's just based on the convergent -- converting the first step. So a little bit deeper, but it is also helpful to have this kind of functionality together with the IDOA-based post-processing. So let's -- this is a time for me to show the experimental results. Actually, we chose in two different evaluation criteria. First thing is [inaudible] second thing is [inaudible]. And [inaudible] is the SIR, signal-to-interference ratio, which is measured by C-weighted dB. And perceptual measure. For perceptual measure we have used the PESQ scores, which is the computational proxy of the actual mean opinion scores. And to calculate the SIR, we actually use some technology for the measurement of the [inaudible]. Basically -- because we know the original speech and the clean speech for the interference and target in this case, we can utilize that to, you know, calculate this SIR. Why not. And that can be done by using same technology for, you know, calculating [inaudible] when you get playing -- just playing the [inaudible] and just by having inverse filter you could have some, you know, nice looking [inaudible], just like we use the same technology And that can be, you know, weighted with the C-weighting. And the C-weighting is good, because C-weighting is to just guarantee us to have some kind of flat response up to the 8 kilobyte -- 16 kilobyte sampling [inaudible]. And this is the test result. And this is original SIR. You can manipulate that. And in a reverberant and noisy condition. And this is a measured one. So I would like to highlight that, up to 10 -- 12 -- 20 dB we could have some quite [inaudible] measurement. And beyond that, this is underestimated. So basically if we can get better number than 20, then that's good. And we have collected the data, actually measured data. And we measured the room impulse response in the audio lab of the sound there, and we measured 18 different kind of position. And this is just kind of -- imagine that there is some kind of couch, couch can be placed like this. Okay. And this is just kind of shoulder-by-shoulder distance, actually. And you see this is 2 feet, 0.6 meter, which is very, very near. And we can see here over here the angle, you know, goes from 6 degree, which is very narrow over here, okay, to the 79 like this. And 1.2 meter to, you know, 4.3 meters. And again the reverberation time was just 375 millisecond, which is very huge. And that's 15 dB SNR. Very realistic situation. And this is a working distance. We would like to have some kind of really good performance over here, like 2 to 3 meters, something like that. Okay. So and clean speech, as a clean speech corpus we have used the TIMIT database. We simply just convolved this clean speech with this measured impulse response, and we added noise. Natural noise recorded in the same space in same device. And we actually -- like with the three minutes long, and we selected the random segment of those kind of three meter long, you know, noise and add that. And that's how we can generate the corpus. >>: So you simulate the reverberation by putting the TIMIT clean speech into the room while you record the ->> Lae-Hoon Kim: No, no, we actually convert. >>: It's basically done. >> Lae-Hoon Kim: Yeah, [inaudible] but it's [inaudible]. >> Ivan Tashev: So in each point you place a [inaudible] and play a shared signal, record this one [inaudible] then compute the impulse responses. >>: Oh. So it's not actually [inaudible]. It's just relation of -[multiple people speaking at once] >>: But it's an actual room. >>: It's an actual room with natural voice. And you have the database with the noises. So practically what happens is you take the clean TIMIT voice [inaudible] impulse response from the point to each of the microphones [inaudible] segment of the noise which is added to it to achieve test. And now this is what is your corpus. It's very realistic. The only thing which is missing is whether we talk or we slightly move our mouths so that there is a [inaudible] reverberation in the post response. That's it. >>: The -- two questions based on that. One is the device noise, is that the electrical noise picked up by the device or is that the -- some sitting in an office or a -- with HVAC and reverberating noise around the room that's actually being picked up by the microphones? >> Lae-Hoon Kim: Actually being picked up by microphones. Because we have [inaudible]. >>: It's not in the soundproof room? >> Lae-Hoon Kim: Huh? >> Ivan Tashev: No. This is the -- this is the room in front of [inaudible] chamber. >> Lae-Hoon Kim: Not [inaudible]. [multiple people speaking at once] >>: I don't care where the noise is coming from. So then the other question -- the other question I have is you probably have a strong reason to believe that this measured impulse response, which is deterministic in the tail is a good approximation for the real impulse response which is going to be stochastic in the tail. >> Ivan Tashev: So in general I just said that that stochastic thinking is mostly because humans move. We'll see some results at the end [inaudible] which incorporate that [inaudible]. But for now this is a good -- very close to the realistic test corpus which incorporates the long reverberation tail and actually noise recorded with the device. >>: Well, just like AURA2 or AURA3 ->> Ivan Tashev: Yes. >>: This is about as greedy as you can get with prerecorded clean signals and mixing. >> Ivan Tashev: Yes. So we have enough experience with this. This is how we synthesize the corpus for ABU. So it's very close to the reality, including compensation for [inaudible] effect, et cetera, et cetera, that's very close. The only thing that is missing is the reverberation tail, which is due because we kind of move our heads when we speak. >>: Although, one point, if you don't mind my mentioning, this could actually make some difference. I remember when I was working with Brad, we did the same thing: use the artificial mouth, measure the impulse response, then use the measured ones. But then when we went in and put real people, those levels, the stochastic difference, it actually does make a difference. >> Ivan Tashev: We are aware of that. Because -[multiple people speaking at once] >>: [inaudible] the rest of the algorithm, right? >> Ivan Tashev: The major improvement we get because we deal in the reverberation. Freezing the tail is kind of cheating. But you'll see the results. >>: So is it very difficult to create an AURA3 like data for the natural? >> Ivan Tashev: In AURA3 it's just a clean speech and noise. There is no impulse response from the microphone to the mouth. So [inaudible] ->>: AURA3 is actual data. >> Ivan Tashev: Is actual data. [multiple people speaking at once] >>: [inaudible] in the car and the microphone is on the dashboard, they're actually ->> Ivan Tashev: So it acquires the impulses mostly in the car. [multiple people speaking at once] >>: [inaudible] there's no impulse responses impulses on those. >>: Real [inaudible], real noise, real impulses. So, I mean, the equivalent -- it's much more time-consuming, you're just going to get users into the lab and sit in different places and play a game. I mean, that's the ->> Ivan Tashev: So our reasons to do use a synthetic corpus is mostly because PESQ requires a clean speech [inaudible]. >>: I see. Okay. >> Ivan Tashev: And you saw that our tool, which we wanted to make it noise independent, actually tries to estimate the impulse response of the source without -- in the mixture to extract it, source [inaudible] in the mixture so we can check out at the end we have just noise, and then to measure the extraction [inaudible] independent [inaudible] the separation. And this is very convenient to have the clean speech sources. We can do with humans, but we have to equip them with the close-up microphones, and that's kind of more complex. >>: So we actually are gathering data, so that actually will be able to [inaudible] it's a good start [inaudible]. [multiple people speaking at once] >> Lae-Hoon Kim: I do have one thing. I have information about that [inaudible] then we should have some kind of grade functionality to track the position of the speech. >>: [inaudible] >>: Be a real fun game [inaudible]. >>: So why this individual microphones locations like that, not symmetrical [inaudible]? >> Lae-Hoon Kim: Individual -- this one? Well, you know, if we have just the same distance, then we -- there is some kind of, you know, possibility to have some [inaudible] frequency, high-frequency [inaudible] >> Ivan Tashev: [inaudible] those four microphones, it's optimizing to cover evenly the frequent -- the frequency range. Each microphone pair is good at one frequency, and then it performance goes down. So the distances are in the way that those six microphone pairs [inaudible] to cover [inaudible] frequency bands. >> Lae-Hoon Kim: Okay. >> Ivan Tashev: But this is not something which was in the [inaudible] I want to change. He was just given a microphone and this is the device. >> Lae-Hoon Kim: Actually, that microphone was cool. The performance was really good. Was much better than I expected. >>: The prior from the video is simulated as well I guess. >> Lae-Hoon Kim: As I told you, there is no information like that up to now. Up to now. We actually used this just as kind of blind way of estimation of the source. Okay. But in this case, we are now using some kind of synthetic impulses. But that means we can do that quite easily. Okay. Let me go the next slide. This is configurations. We have test. We have test 18 different configurations. And as you see over here, this is wide-seating situation, this is narrow-seating situation, by varying the distance between the microphones and speakers. And this is the results. Let me explain in a detailed manner. This is distance, so basically this is close. This is far. This is angle. This is narrow. This is very wide. And our approach, our proposed algorithm keeps us from 10 dB to 20 AD. And from the individual experience of the listening I think that if we can achieve more than 20 dB, that's really good. This is really like [inaudible] we are going to hear that soon. And what I want to mention about here is like in 2.7 meter and 26 degree, we could match -- we could get that 20 dB separation. This is very interesting. And this is PESQ score. This is the angle, this is the distance from the near to the far, narrower to the wider. And this is PESQ score. As we'd expect, with the better SIR we could have better MOS, which is actually, you know, expected. So basically -- so this tells us that by having better SIR we do not hurt the perceptual quality. It hurts a little bit, but in a sense better than having just mixed [inaudible]. >>: So is the PESQ improvement over no processing? >> Lae-Hoon Kim: Over no processing, yeah. No processing. So that's meaningful, right? Over no processing. And, again, we would like to compare our approach with the proposed -- you know, current and state-of-the-art. So we actually plugged the situation of the most similar condition of the proposed as state of the art. So as you see over here, you know, this plus and [inaudible] and star. And this is located here. Upper left corner, which is the best condition, right? And we actually tested very -in a somewhat harder situation rather than having this best condition. >>: So, sorry, we're looking at the dB reduction? >> Lae-Hoon Kim: Yeah, dB reduction. Average. >>: [inaudible] like one is this 3-D thing and one is -- >> Lae-Hoon Kim: Actually, I have 3-D plot for this, and 2-D plot for this. But I just only, you know, combined those two things like this way. If you want -- if you are interested, I can show this -- everything afterwards. So I want to compare our proposed algorithm with the state-of-the-art. This is [inaudible]. So 1, 2, 3, 4, 5, 6 is the method, and this is SIR suppression. And this blue box means the -- the state-of-the-art we have already discussed, and this red box is our implemented version. So basically first [inaudible] is the Saruwatari one, and second one is Taylor and third one is Sawada. And you can see the spec. >>: In my quick geometry in my head, at 1.22 meters in angle and 54 degrees, how far apart [inaudible]? >> Lae-Hoon Kim: Huh? >>: At 1.22 meters and 54 degree angle, how far apart are the two sources? >> Lae-Hoon Kim: Like this. >>: About two feet? >> Lae-Hoon Kim: Yeah. Kind of a little bit of distance between the ->> Ivan Tashev: A person between them. >> Lae-Hoon Kim: Not just sitting like this, but kind of ->> Ivan Tashev: [inaudible] two other persons. >> Lae-Hoon Kim: Something like that. >>: About a meter, right? >> Ivan Tashev: Huh? >>: About a meter. >> Ivan Tashev: Yes. >>: So we're seeing almost a 30 dB [inaudible]? >> Lae-Hoon Kim: Yeah. This is very [inaudible] because you know what? This is best condition, our situation. >>: It's incredible. >> Lae-Hoon Kim: This is in incredible. You want to here that. >>: So the question here is I'm not quite sure whether different algorithms require different optima configuration of the microphone. That aspect isn't optimized, then you cannot make a fair-to-fair comparison. >>: These numbers are taken it looks like from papers, not from -[multiple people speaking at once] >>: Because like our algorithm was compared against one of those other two and basically the condition of our algorithm was the same result as the state-of-the-art with solving the permutation, without having to do the permutation. It wasn't a better algorithm. But in this case, the fact that it's like 4 dB worse than his is a little bit [inaudible]. [multiple people speaking at once] >> Ivan Tashev: The only check we did -- so, look, No. 5 is our implementation point by point of the algorithm described in [inaudible]. So you see that there are slightly different conditions here. We still are in a more difficult situation. So [inaudible] reports on 70 degrees and 1.15, but because we have a discrete [inaudible] second reverberation [inaudible]. In our case we're kind of closed but the [inaudible] our reverberation time is kind of [inaudible] but, look, we can comparable result. This was our check that we are not completely off. At least we acquired the idea of the simple tap ICA. So these give us to -- leads us to believe that at least for those two we have something close. And then those three are already our approaches. The only way to do a perfect comparison is to [inaudible] implement all those separately and to do the [inaudible], and we just have no time. So this was to have to find the closest point to those all three and to see how better or worse are we doing. >>: [inaudible] use the same kind of database that you have? >> Ivan Tashev: That's correct. That's correct. >> Lae-Hoon Kim: But the way of getting the data is same [inaudible]. >> Ivan Tashev: They report the reverberation time. In most of the cases here, they deal with the reverberation time from impulse response generated with Image Maker, which is less -- it's more sparse than any other [inaudible] reverberation [inaudible] impulse response. So in all cases those three papers work in way better conditions than us. >>: I think all three of those also have two microphones instead of four. >> Ivan Tashev: That's correct. That's correct. >> Lae-Hoon Kim: So I should say that all the conditions ->>: But the bottom one is the meaningful. I mean, 5 is the meaningful comparison. >> Lae-Hoon Kim: Yeah. Actually, I should say this ->>: [inaudible] the implementation on the same [inaudible]. >> Lae-Hoon Kim: In all the conditions, actually our condition worse than ->>: So the bottom four you actually recorded with -- were done with an implementation [inaudible] microphone. >> Lae-Hoon Kim: Yes, yes, yes. Yes. >>: Can I ask a quick question? >> Lae-Hoon Kim: Yes. >>: Because many of the others use two microphone. You're probably going to cover that later, but kind of a question that comes to mind, even though you have four available microphones, why [inaudible] you turn off one and you say if I only use three or if I only use two, so that you have kind of a return on investment for the cost of adding each microphone. So if you have, for example, three microphone only. >> Lae-Hoon Kim: I don't have enough time for doing that. But I have some kind of guess for that, actually. So the thing is that, you know what, because to implement this kind of beamforming and nullforming, we added one extra constraint, right? That means the decrease of the degree [inaudible]. >>: So if you were to guess, for example, if you were to reduce from 4 to 3 ->> Lae-Hoon Kim: That can work. >>: -- the DBs you would lose in terms of separation? >> Lae-Hoon Kim: I think there is no -- no -- seriously no degradation about. If I just use two rather than three, then there could be some kind of a degradation, because we cannot fulfill this nullforming because we don't have the degree of freedom [inaudible]. >>: [inaudible] in terms of how many CPU side processes you're going to have to run? >> Lae-Hoon Kim: Yeah. I would like to investigate those kind of situation, I really would like to, but, you know ->>: I hear you. >> Lae-Hoon Kim: Yeah. First of all, I would like to make sure this kind of algorithm gives us some kind of interesting -- >>: [inaudible] just couldn't resist. >> Lae-Hoon Kim: Yeah. >>: So what kind of dB difference you might get for different microphone configuration? >> Ivan Tashev: For different microphone arrays. So you see that ->>: [inaudible] >> Ivan Tashev: Number 6 is just the first stage, which is the existing sound-capturing system to date. This is what we propose Xbox team to take. And we already have 19. >>: [inaudible] >> Ivan Tashev: And then on top of this, the ICA gives us an additional -- technically it should be an additional around [inaudible] dB, but you know the good things [inaudible]. >> Lae-Hoon Kim: Basically this algorithm always output from the [inaudible] algorithm like a 10 to 12, 20 dBC. And it's always better than this first stage. But, you know, we'll see. >>: Does the ICA stage online or batch? >> Lae-Hoon Kim: Batch. And it can be -- but it can be converted online easily. Yeah. But the thing is that, you know -- yeah, I really would like to test the online of this version. But, yeah, that would have ->>: [inaudible] 20 dB online or 30 dB? >> Lae-Hoon Kim: Yeah. That ->> Ivan Tashev: [inaudible] >>: [inaudible] doing that stuff online is much harder. >> Lae-Hoon Kim: Yeah. If we would like to do this kind of a thing online, then we have to have some kind of, you know -- tons of different kind of, you know, optimization process first to fully optimize these design parameters, actually. But, you know, I just only have ten weeks over here, so I didn't have enough time to operate all these kind of optimizing process. So I just tested a couple of different kind of setting, and this is the results up there. Okay. Let me go for this SIR summary. Let me summarize this. The proposed approach outperforms 10 to 20 dBC, and it achieves improvement of the good PESQ scores. And I highlight all of them already. Okay. This is the kind of experiment for the realistic implementation. And I just tested the -- actually, the [inaudible] previous slide was with very conservative setting, like 20 tap -- number of tap was 20 and the iteration number was 1,000. Huge. So but I have tested whether this kind of [inaudible] can be decreased. And as I actually -- as you see over here, the iteration number can be decreased like, you know, to 200. And tap number can also be decreased at 5. >>: [inaudible] tap number means how many previous [inaudible]? >> Lae-Hoon Kim: Yes. Yes. >> Ivan Tashev: [inaudible] the reverberation. >> Lae-Hoon Kim: This is detailed iteration, but ->>: So how long was the frame? >> Ivan Tashev: [inaudible] >> Lae-Hoon Kim: Yeah, 16 milliseconds. So the detail, but I will not cover this. Yeah. This is the thing that I would like to play over here. So let me -This is a synthetic feature first. Let me try to play this. [audio playing] >>: So why this constant background noise? >> Lae-Hoon Kim: This is computer. [audio playing] >> Lae-Hoon Kim: There is some speech over here. There should be. But we cannot hear that. >>: Where is the mechanical aspect of the voice speech coming from? It sounds -sounded like there's a little bit of nonlinear noise in there. >>: This sounds more like missing spectral components. It's just the same as you get from the noisy when you're too aggressive and you're missing -[multiple people speaking at once] >>: There was an alpha position that traded off how much you trusted the prior from the IDOA or how much you trusted the ICA. >> Ivan Tashev: We didn't have time to tune this. We can squeeze a little bit. >>: But if you relax that, maybe ICA would make things -- try to make it more linear, less ->> Lae-Hoon Kim: Yeah. >>: In terms of recording and playback, you can attenuate a lot of that by just a little bit of noise still. I guess you do a tiny bit of noise fill, and then it ->> Lae-Hoon Kim: Yeah [inaudible]. >>: [inaudible] speech recognition. [multiple people speaking at once] >>: But you can do it artificially too. >>: Oh, I see. You can do the engine and you have that, okay. >> Lae-Hoon Kim: Yeah. And, well, as you already pointed out, this is kind of synthetic compose because we need to like make sure that this algorithm is working quite well. And obviously we also have some kind of curiosity, how it works in a realistic situation. So I invite one of my mentor in this MSR to come to this audio lab, which is original place we have measured in [inaudible], and I talk to her to sitting just to stand on this kind of predefined configuration, one of those predefined configuration area, and we just talked spontaneously. But, yeah, you know, you see the result. >>: Would male versus a male be hotter than male versus female? >> Lae-Hoon Kim: No such thing actually. I actually -- I test ->>: Generally it is male versus [inaudible]. >> Lae-Hoon Kim: But the thing is that, you know what, I test all kind of different kind of configurations, and the [inaudible] there was more likely depending on the distance and angle or way of speech or something like that. >>: What was the content of the speech being said? Because they're both saying the same thing at the same time, that's way different than if they're saying something completely different. >> Lae-Hoon Kim: You know what? >>: One person's saying consonants, one person's saying the vowels, it's really easy to separate them [inaudible]. >> Lae-Hoon Kim: That's a very interesting point. Actually for the synthetic part, we have used this TIMIT corpus. And all the TIMIT corpus start with the same sentence of two sentences. >>: [inaudible] but in the rest of it, I mean, you're averaging over the entire length of the thing and got one of the speakers that speaks 20 sentences, your average is going to be they're not saying the same thing, they're not even talking at the same time, if you don't control for the open areas. So you're going to end up with a situation where they're never going to collide anyway, so it's like ->> Ivan Tashev: So hold on. When we said batch processing, this doesn't mean that you take [inaudible]. Every frame is processed using the current frame and the number of taps before. That's it. Yes, inside [inaudible] but not in taps. So it cannot go ten frames or 160 milliseconds, so you are in the boundaries [inaudible]. That's it. Not to go [inaudible]. You don't have to -- we don't do 40 seconds over an alternative. That can get us way better results, actually. >>: It seemed like his objection was that each TIMIT utterance is 30 percent silence. >> Ivan Tashev: Human speech is 30 percent ->>: Yeah. And that -- that -- when you have a 30 percent chance of overlapping with absolutely nothing and you get really good results as opposed to trying to overlap speech and speech and just measuring there. >>: I think the game show scenario is where both people get the exact correct answer at the same time. Right? That would be an interesting [inaudible]. >> Ivan Tashev: [inaudible] of the same nasty phrase which most of us have sit to listen over and over and over. >> Lae-Hoon Kim: Actually, there is a kind of scenario where some people dominate the speech, then we can just use that kind of speech. You know, nothing else. And if there is some things -- and the situation can happen, like if there is some simultaneous speech, we can -- you know, detect that, too, and we can -- we can utilize different kind of mode to deal with different kind of situation, I guess. >>: So I guess what I'm hearing is that in running speech where two people are just actually talking -- the -- they're not going to overlap enough that your history is going to be able to distinguish the two. And you didn't test the case where they're saying the exact same thing at the exact same point, same rate, same accent and all that stuff and you don't know what will happen in the next situation. >> Lae-Hoon Kim: Okay, okay. >> Ivan Tashev: [inaudible] distinguished it [inaudible] separate two things which are the same. >>: Well, he [inaudible] pretty good job of separating two people saying the exact same thing, even if they're both male and both French and both speaking in Russian. You know, your voice can still [inaudible]. >> Ivan Tashev: [inaudible] the difference still. >> Lae-Hoon Kim: Yeah. I wish I could have the opportunity to test that kind of hard situation. >>: I mean, there is this game show game concept of two people buzzing, have the same answer at the same time. My guess is -- and that's viable scenario we're looking at. My guess, though, is even there there's quite a bit a statistical [inaudible], people don't speak at the same rate, even in that capacity, right? There's still -- I think it's really -- it's fairly artificial and probably very short time bound to have two people literally speaking the same thing at the same time at the same rate. Like even in this real-world scenario, I say who was the first president in America, and I ask everybody to answer at the same time, people speak at different rates. I would think that there would be some statistical [inaudible]. >>: Even if you do the extreme of a chorus where people are actually trying to sing in synchronism is still going to end up with [inaudible]. [multiple people speaking at once] >> Lae-Hoon Kim: On top of that is, speech is very sparse, you know. So basically there is no speech which [inaudible] exactly. So that means it can be [inaudible]. >> Ivan Tashev: [inaudible] mean the same thing the speech will be slightly different and the harmonics will be slightly off [inaudible]. >>: And so your [inaudible] is 256? >> Lae-Hoon Kim: Yes. Yeah. >>: Okay. >> Lae-Hoon Kim: And here you're going to hear my voice, my own voice and one of the intern's voice speaker Chinese and Korean. And because we don't have enough time -- so let me go to the harder situation, you know, No. 9 configuration, which is ->> Ivan Tashev: [inaudible] so go and show the easier part. >> Lae-Hoon Kim: Easier work first? >> Ivan Tashev: Yeah. >> Lae-Hoon Kim: Okay. This is a mix. [audio playing] >> Lae-Hoon Kim: Okay. >>: I don't think I can understand that [inaudible]. >> Lae-Hoon Kim: In fact, I don't know why she's talking right now. I just ask her to talk whatever she want. So, anyway, I'm talking myself. I mean -- what I'm talking over here is my name is Lae-Hoon Kim, I'm working in MSR with this kind of -- you know, I'm doing some kind of project like this, something like that. Let's hear about the -- our results. [audio playing] >> Lae-Hoon Kim: Okay. Let's go to the somewhat a little bit harder -- no, a harder situation, like, you know, [inaudible] the 2.444, which was reported to have 20 dB separation. [audio playing] >> Lae-Hoon Kim: Okay. We do the different speech. I mean, we are speaking different kind of content. So this is the result. [audio playing] >> Lae-Hoon Kim: Okay. So this environment is much better than [inaudible] because this environment was attenuate -- yeah, attenuate some kind of ->>: [inaudible] >> Lae-Hoon Kim: Yeah. But we can see that -- you can see here and you can hear the [inaudible] of this algorithm. This is what I got actually. And as I already told, above 20 dB I think it's going to be really good in terms of having this good separation. So okay. Let me, then, conclude my talk and future work will be [inaudible]. This is a summary of our talk, my talk. And we have actually subband domain and short time frequency domain regularized feed-forward in multi-tap ICA with IDOA-based post-processing has been proposed and evaluated. And this approach produces breakthrough -- I should say is that -- and breakthrough results in terms of improvement in SIR and PESQ scores were also really good. But as I pointed out, [inaudible] pointed out, we need to have some kind of more tuning for having some better perceptual quality, I guess. And that would be one of the our future works. And another thing is, as we already discussed, we need to have some online realtime-based algorithm to work here. And this should be tested, I guess. I think in terms of the technology, we also -- we already [inaudible] already because the algorithm is just -- it can be done with the batch or online. And this is -- can be done. But the thing is that, as I already mentioned, there are some kind of design parameter we have to optimize first. >>: So by online you really mean just realtime frame-by-frame processing? >> Lae-Hoon Kim: Yeah. There could be two different ways. Realtime means just one frame by one frame by one frame. They came like that. Or kind of the block-based thing, like, you know, like 40 milliseconds, 40 milliseconds, something like that. >>: Is it that the algorithms themselves are fundamentally different? >> Lae-Hoon Kim: I don't know. I have to -- I could have some opportunity to test it, this kind of thing. But if I could have like two weeks, three weeks more, then I would do that. [inaudible] I really would like to try that. Because, you know, yeah, this is kind of compromised for me to you -- you know, I had to stop at some point to organize all the -And also I would like to mention that it may be worthwhile to combine the presented algorithm with speech motivated stuff for us. Here I didn't use any kind of [inaudible] like computation or interesting alliances, something like that, or speech [inaudible] like harmonic-based one or something like that. If we can utilize [inaudible], then I'm sure that there could be some kind of benefit, like, you know, improving the perceptual quality or something like that I guess. >>: But then your demo is two different languages or have a harder time? [laughter] >>: If you change the language in specific models speech, if you do the you don't speak [inaudible]. >> Lae-Hoon Kim: That's going to another different kind of story, I guess, yeah, yeah, yeah. But there is a kind of [inaudible] like, you know, based on like computational analysis, something like that, which is just depending on some kind of major [inaudible] voice, not language. So kind of harmonicity or something like that. They can be -- they can be not depending on the languages. >>: So one of your slides you should sync the separation because basically you can -- the out put of your diagram, there is a speaker one, speaker two. >> Lae-Hoon Kim: Yeah. You want to ->>: Are you really reconstruct the speaker one, speaker two after your algorithm? Can you really do that? >> Lae-Hoon Kim: [inaudible] what in my speaker? >>: The speaker one sum, and you showed one [inaudible]. >> Lae-Hoon Kim: Yes, yes. >>: You really constructed the other speakers. >> Lae-Hoon Kim: Yes. Kind of, you know. Because you can suppress the others like in this case I just play just one single -- one of the two, right? If you want, I can let you hear the other one. And the other one has like, you know, that speech only, suppressing the other one. >> Ivan Tashev: [inaudible] >> Lae-Hoon Kim: Just symmetric. So I just ->>: [inaudible] >> Lae-Hoon Kim: Yeah. So that ->>: So it's not a simultaneously reconstructed the two, but just decide to reconstruct one and the other will get the surprise >> Ivan Tashev: No, it's simultaneous [inaudible] >> Lae-Hoon Kim: Simultaneous separation of two different mixtures >>: But if you do the simultaneous separation, and in our scenario we probably just need a one, right? We just need a one? >> Lae-Hoon Kim: No, no, no, no, no, no. In some specific application we need to [inaudible] like, you know, for example in the game scenario, if there are some -- two people sitting together in the game -- having some game together, they -- both of them -they talk something. Then those kind of thing has to be equalized at the same time. And [inaudible]. >>: So for this kind of algorithm it looks like there's not much difference in terms of use in gaming environment versus in the automobile expect the reverberation we have is different one. >> Lae-Hoon Kim: Yes. But the main difference of this kind of algorithm is I think up to now to my best knowledge if there is no way to, you know -- there is no approach to actually incorporate this multi-tap approach in this kind of ICA thing. And we actually -- we know -- what the beamforming is good for and what the ICA is good for, and why not. This is kind of, you know, a way of combining. [multiple people speaking at once] >>: You have much wider range of the reverberation. >> Lae-Hoon Kim: Sure, sure, sure, sure. >>: So that may require more robustness. >> Lae-Hoon Kim: Yeah. Yeah. But I don't know much about the [inaudible] environment, but in the case [inaudible] some kind of difference, like, you know, in this case because we are now showing the game scenario, we could have this kind of a big approacher for the microphone array. But for the [inaudible] scenario, there is limitation of the arrangement of the microphone. Right? So we cannot place this kind of big microphone array onto [inaudible], so this can be different like that. >>: [inaudible] >>: [inaudible] because of the rest of the audio stack, right? But like other people who are just, you know, researchers on ICA can just use a bigger window, right, so they could take it up in frame size of 2,000 ->> Lae-Hoon Kim: Yeah, yeah, that's true. >>: [inaudible]. >> Lae-Hoon Kim: Yeah. Yeah. There is a paper called Frequency Domain ICA [inaudible] limitation [inaudible] and written by [inaudible] something like that. And she just analyze the kind of dependency between these kind of performance depending on the frame lengths. And she just, well, reported that there is a kind of optimum way -- optimal length of the frame. So by just -- by just having a longer frame, it will guarantee us to have good results. Because it will hurt -- it will hurt discussion I think because having this longer frame means having longer average to get the frequency. This is not good. So basically we just want you to stick to this short frame because we would like to track the change of this kind of situation too. So this is the way how we can kind of deal with the kind of changing environment, changing position, something like that, together with this. >> Ivan Tashev: Plus this approach is one of the reasons [inaudible] so it's basically you already have it. >>: That was the first part of my question. >> Lae-Hoon Kim: Yeah. [multiple people speaking at once] >>: [inaudible] having a side conversation, your source separation [inaudible]. >> Lae-Hoon Kim: Actually, that was one of the challenges I have addressed in the beginning. >>: Can I ask a quick one? >> Lae-Hoon Kim: Yes. >>: I didn't see -- maybe I missed something, but you don't have in your algorithm on a step of trying to estimate how many people are speaking, right? >> Lae-Hoon Kim: Yeah, no. >>: [inaudible]. >> Lae-Hoon Kim: Yeah, yeah. I just on test about the scenario, like two speech. But as I already mentioned in the beginning of this talk, if there is kind of a multi-functionality like [inaudible] or something like that. >>: Right. >> Lae-Hoon Kim: They can be used. >>: Help you try to estimate how many ->>: I mean, we kind of have [inaudible] how many people in the room, where they are, and we can track ->> Lae-Hoon Kim: Yeah, I'm sure. I'm sure. >>: -- over time where they're moving. So -[multiple people speaking at once] >>: [inaudible] that's captured by the device but is making noise. >> Lae-Hoon Kim: You know what? There is still -- we have some functionality about doing that. Because we -- in the case of [inaudible] you see here, we don't use the video information here, but we can localize the source, right? [inaudible] compared with the [inaudible]. But we can do that. Even for the kind of dog under this. >>: So how would this work in a situation where you're at a party playing Rock Band and you're singing? Because there's going to be, you know, white babble, background noise. >> Lae-Hoon Kim: Yeah. >>: How is this going to perform? Are you going to lose your dBC very quickly? Is it going to work well and people have to be screaming in order for it not to work? >> Lae-Hoon Kim: Yeah. That could be very -- that is very interesting, good question. >> Ivan Tashev: [inaudible] >>: Yeah, okay. >> Lae-Hoon Kim: I have a guess. I have guess, actually. I have a guess. I don't know what happens, you know, but I have guess. You see here, we can concentrate on some specific position, angle, and a specific angle of the source. So like, for example, there is rock band playing something over there and I here I have -I try to speak something. But it is [inaudible] this is too annoying and too loud, but that would be very bad. But if I can have some kind of, you know, very -- you know, kind of very small distance between this microphone and me, I can concentrate on myself using this. And this. This is guideline. This is kind of, you know, way of guide -- of separating the signal, goes to some -- this is the way we can do it. But I'm not sure actually because, you know, that will be depending on the [inaudible] signal to interphase ratio of this rock band and myself. >>: So I know about the exact hardware configuration for the microphones has been involving in parallel, like where you put the microphone housing and the acoustic, and are these measurements with the latest one, with the latest configuration? >> Ivan Tashev: [inaudible] generally we have in the top? >> Lae-Hoon Kim: Let me go to the last slide. Thank you for your attention and any question? >>: You want more questions? >>: I think something we definitely want to try. So we're doing work now to basically do like identity tracking, so basically mapping sound source localization [inaudible] so we'd say, all right, there's a person here, we're hearing sound from here, and we could see everyone's face. And it would be interesting to tie that into this and say all right, well, we always output a unique speaker stream for the number of skeletons we see in view and then track with that. And just to see how it work end to end. You know, it offers value in some different ways. If nothing else, it improves the chance in multiplayer scenarios that when someone is trying to do speech recognition it will be successful [inaudible] suppressing background. It might be usable in the game show game analogy where you do have a lot of literal synchronized cross-talk. I don't think we ever expected that, you know, if you have 50 people in your house and people playing Rock Band that you were ever going to have viable speech recognition or ->>: You know someone is going to want to do that, though. >>: They are, but, you know, [inaudible] but I can at least see [inaudible] how we can provide something as a platform that might actually be beneficial to use the [inaudible] scenario. >>: How sophisticated is the facial expression tracking? Do you have that tracking? >>: We've done a little bit of work early on the project. The basic feeling was that the distances we're talking about and the -- you know, the aspiration of the camera we won't be able to do actual lip tracking. The camera beam we use today is a stop gap that's intended to be swapped out with a future version of a high-resolution camera. So hopefully [inaudible] but at three or four meters at like 640-by-40 camera, [inaudible] we wouldn't be able to do that sort of accurate. >>: Uh-huh. >> Ivan Tashev: Any more questions? Let's thank Lae-Hoon. [applause]

>> Ivan Tashev: Good morning, everyone. It's my... Kim. He's a Ph.D. student in the department --...

Related documents

Products

Support

&gt;&gt; Ivan Tashev: Good morning, everyone. It's my... Kim. He's a Ph.D. student in the department --...

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib

>> Ivan Tashev: Good morning, everyone. It's my... Kim. He's a Ph.D. student in the department --...