>> Ivan Tashev: Good morning, everyone. It's my... Kim. He's a Ph.D. student in the department --...

advertisement
>> Ivan Tashev: Good morning, everyone. It's my great pleasure to introduce Lae-Hoon
Kim. He's a Ph.D. student in the department -- or in the Statistical Speech Processing
club in the Department of Electronics, Electrical Engineering, University of Illinois at
Urbana-Champaign. And his advisor is ->> Lae-Hoon Kim: Mark Hasegawa-Johnson.
>> Ivan Tashev: -- Mark Hasegawa-Johnson, well-known name in signal processing.
And the topic of his three-months long intern project was speech signal separations in
reverberant environment.
Without more introductions, Lae-Hoon, you have the floor.
>> Lae-Hoon Kim: Thank you for the kind introduction.
Hi everyone. Good morning. Today I would like to talk about my summer internship
with this title, "Reverberated Speech Signal Separation With Microphone Array."
Before starting this talk, I would like, you know, express my deep gratitude to, first of all,
my mentor, Ivan Tashev, and to Alex Kipman and Xbox team for funding my summer
internship, and to Microsoft Research for excellent work conditions and equipment, and
to the whole speech research group members and my fellow interns for the wonderful
summer. Thank you so much.
Let me start this presentation, then. This is our -- today's agenda. First of all, I'm going
to briefly talk about what this project is like from this project outline. And then we are
going to see what kind of start of art is there in the world, and then hopefully touch the
background.
And in this background section we are going to have the -- we are going to see the
building blocks of our proposed algorithm. And then after that, you know, proposed
method will be explained in a technical manner. And then experimental results will be
given with some numbers, obviously. And then after that we are going to have some
listening examples to listen here. And then I will conclude this talk with some future
work.
Here is the project outline. The object of this project is [inaudible] speech source
separation using microphone array in reverberant and noisy environment, like this kind of
space, reverberant and noisy environment.
And the main application of this technology is for, you know, interactive game console,
like an Xbox.
So here's our challenges for this project. We have to deal with shoulder-by-shoulder
speaker distance, you know, angle distance. And working distance should be like two to
three meters from the microphone to the speakers.
And we have to deal with the reverberation and also background noise. And this kind of
technology should be integrated into the frequency domain implementation, short time -short frame processing units.
So this is our -- these are our challenges we have to solve here.
And here is the existing approaches to approach these kind of problems.
First of all, there is a microphone array processing, and which includes the beamforming,
nullforming, or, you know, post-processing, something like that.
And it aims to suppress sources depending on the direction of arrival -- I mean, the
spatial information you might be able to get. You know?
And there is another way of doing these kind of things, which is called as blind source
separation. And usually -- and normally people tend to try to use technology called
independent component analysis. In short, ICA.
This algorithm is trying to separate sources toward the maximizing higher-order mutual
independence among the separated signals. So not that those two different technologies
has been built on different kind of objective functions.
Also there are some kind of tries to combine those two different approaches so far. And
most of them report that the converged ICA is real similar with the nullforming on the
interference speech.
And, you know, it is not significant -- it does not produce some, you know, significant
improvement in source separation compared with the beamforming plus nullforming.
And here's some important questions we have made, you know, when you start this
project. And some observations we have made here.
First, No. 1. Why ICA just converges to nullforming? Answer is because it is just the
best given the current frames only. And usually the reverberation time is longer than
frame length. For example, if we take the 256 samples [inaudible] 16 kilohertz, that's just
16 milliseconds. And really in this kind of room, the reverberant time would be like
more than 200 millisecond or 300 millisecond.
So this is part of the reason. And here you can see the picture like this, and usually what
it can do given it's just only current frame to maximize the independence of the separate
channel is just nullifying this direction of the interference. That's the best, actually.
But in a given situation, this kind of situation, there should be some kind of direction of
major reflection of the interference. Then what it can do, do you want to do this? No.
Because the main energy over here, we don't want to lose this kind of, you know,
rejection power to, you know, suppress a little bit about this reflection.
So we would like to maintain this. But we also would like to cancel this part. This is
what we would like to do really actually.
And how we can do this -- so this is very simple. If we just use the [inaudible] only, then
this is the best. Why not? Why not we -- why don't use the previous frames? I mean,
why don't we use the multi-tap structure to cancel out this part. At the same time it can
maintain this, you know, [inaudible] power.
So this is our basic simple motivation about using multi-tap structure to cancel out these
kind of major reflections.
And obviously this is the way we can deal with this long reverberation, given the short
amount of the frame length.
And, again, the beamforming technology is usually based on the prior knowledge about
source direction, right? But we don't know where the reflection -- I mean, the significant
echo directions as a prior information.
But we don't have to worry about that. Because unknown DOA, direction of arrival of
echoes, will be dealt with this ICA.
The objective of ICA is try to maximize the independence of the separate signal. And by
doing that we can, you know, automatically have this kind of, you know, nullifying in a
multi-tap structure which will give us, you know, some kind of, you know, separation,
even using the multi-tap effect.
Okay. Here's the proposed approach, which is, you know, titled as regularized multi-tap
ICA with IDOA-based spatial filtering.
This -- I will start from the good initialization, which is based on the spatial information
we've already got, like, you know, nullifying the interference direction and beamforming
towards the target direction, something like that.
And we also have the regularization on the big deviation from the spatial filtering first
stage.
So actually this is composed of two stages. The first stage is beamforming power by the
IDOA-based spatial filtering, and second stage is this multi-tap structure ICA. And this
second step will also be accompanied with this spatial filtering too. So we are going to
look at that in a detailed manner soon.
But here we would like to maximize independence. And we want to use the previous
frame as well, not only just current frames.
And this will cancel out the interference leakage in the current frames.
And after that there could be some kind of slightly increased reverberation in a given
target direction, which is made of the target speech only.
So I would like to suppress but slightly increase the reverberation for this target speech.
So I just use this spatial filtering again to suppress the reverberation to.
Now you're going to see that we can use the both of different kind of technology in a kind
of maximum way.
So here's the results summary. It could give us better than 20 dBC separation. 20 dB
means that one-tenth, one over ten, of two shoulder-by-shoulder speakers up to 2.7 meter
distance.
Improvement of the PESQ point -- PESQ point is for the perceptual measure. It was 0. to
0.6.
And we could outperform 10 to 20 dBC, the already published, you know, algorithms.
So here's the state of the art. In 2006 Saruwatari, et al., proposed this algorithm which is
titled as "Blind Source Separation Based on Fast-Convergence Algorithm Combining
ICA and Beamforming."
And in this paper they report that the converged ICA is really like nullforming on the
interference direction.
And second state of art is our MSR result in 2008. And this is titled as "Maximum a
Posteriori ICA: Applying Prior Knowledge to the Separation of Acoustic Sources."
And this technology try to utilize the applied information about the source direction. And
actually this is beamformer-based trained prior distribution on ICA filter given DOA, and
after having this training and they want to build some kind of MAP ICA.
Originally the ICA is kind of maximum likelihood because we can have some kind of
prior knowledge about that, then we can convert into -- that into the MAP problem. So
this is the way of, you know, incorporating the prior knowledge with -- instead of just
going, you know, optimal -- optimal in a way of doing this kind of thing.
And this was we have [inaudible] solving the permutation problems. But still this is also
some kind of a way of doing the best using just the current frames only, like just
beamforming.
And third. This is the Sawada, Mukai, Araki, and Shoji Makino. And the -- this article
has been titled as the "Frequency Domain Blind Source Separation."
And in this paper, they actually introduced time-frequency binary mask post-processing,
which is somewhat different with the other interesting algorithms. This is kind of, you
know, similar with the -- our IDOA-based post-filtering. But we are going to see what
the difference is.
And they actually report that they can additionally suppress the interference other than
just using ICA. But still they -- this kind of algorithm still do not discriminate the
reverberated interference in the target direction.
Which means given this direction, if the signal is coming from the interference, they
cannot differentiate that. So they just accept that as a like -- as like just like, you know,
target.
So this is just same problem beamforming.
>>: So maybe [inaudible] do you know whether this is the technique that is used by
[inaudible] after [inaudible] acquired that company?
>> Lae-Hoon Kim: Oh, maybe.
>>: I don't know. It looks like they are very [inaudible].
>> Ivan Tashev: [inaudible] senior director of [inaudible], but after two telephone
conversations with him last month, but of course he doesn't speak about the technology.
>>: [inaudible] do you have notes about whether this type of algorithm is something that
underlies the technologies?
>> Ivan Tashev: They have a two-microphone -- two-microphone technology for speech
as language. They offer us [inaudible] their latest chip. This is what I know for sure.
What they actually use ->>: But whether it's frequency domain or ->> Ivan Tashev: It's quite possible. But I don't know for sure.
>>: You know, all the [inaudible] but they're different.
>> Lae-Hoon Kim: So, in short, you probably have to say about the difference, then I
would say that this [inaudible] is based on some kind of narrow bin search, and if there is
some kind of a source toward that direction, and you can, you know, search through this
direction and this just direction for the source, something like that, then it says that one,
I'm asking one, if not, zero. This is just how this work.
But our case, we can provide some kind of [inaudible] information about [inaudible].
>>: So in general the -- see, this is -- the book is a [inaudible] it's about getting to
[inaudible]. The chapter there is from [inaudible] people [inaudible].
>> Lae-Hoon Kim: If you're interested, I can -- yeah.
So let's go into the background part. This is our -- the building blocks for our proposed
algorithm.
First of all, this is the subband domain speech separation. You know, typical situation for
the speech separation. And this can be done by this kind of, you know, process.
First of all, the time domain signal can be captured by the multi-microphones, and then
we convert this frequent time domain signal into the short time frequency domain using
the MCLT. And we do the separation per each frequency being separately, then we just
do these, you know, IMCLT to get this time domain signal again.
And if I say that this kind of speech separation processing can be described as this part,
like, you know, separate speech can be obtained by applying this separating filter to the
measured -- the frequency domain response.
And I would like to, you know, note that N and K -- K is -- K means it's the frequence bin
number and N is the frame number. So basically that that can be, you know -- that can be
time varying, like, you know, to reflect the change of this pristine environment or
something like that.
So, first of all, beamforming. There are two different kind of beamforming technology in
general. And the first thing is time invariant beamformer, which is to fix the [inaudible],
and then we can calculate the coefficient offline. And this assume that we have the
ambient isotropic noise assumption.
And second of all there is another technology in adaptive beamformer which can track
the variation of the source positions or in their environment itself.
And we utilize this Minimum Power Distortionless Response beamformer, which is
slightly different with the Minimum-Variance Distortionless Response beamformer in the
sense that we utilize this, you know, power instead of just having the variance of noises.
Because, you know, having the variant -- calculate, estimating variances also another
problem.
By basically having this, you know, I mean, this is the optimizing problem we would like
to solve here. So basically why we want to do is we would like to minimize the source
power after having this, you know, filtering.
There is some constraint we would like to omit at the same time. One of them is we
would like to maintain the beam -- beam with -- beam direction with nothing like, you
know -- without any distortion like this one. And we would like to cancel the -- all the
signal coming from the direction of the interference.
We added this part. Because, you know, as you know, the -- as I told all of you, the
converged ICA is just like, you know, nullforming -- beamforming plus nullforming. So
why not? We can add this part for this beamformer.
So we added this. And after solving this problem, we get this.
And another thing I would like to mention is this part, lambda I, which means this is
called as diagonal loading to prevent the divergent with this part.
So this is very realistic implementation. And this is the direction pattern of the
implementation. This is the time invariant version, this is adaptive version.
So you can see that this blue line is null and this red line is the beam. And towards the
designated directions of arrivals for the target and interference, we could make these nulls
and the beams in an appropriate way.
And, again, but this beamforming has some very good things, like, you know, it can boost
or suppress depending on this spatial information.
But as I told already, this has -- this technology has some kind of demerit, like, you
know, this cannot discriminate source difference in a given direction. So we need some
source discrimination power.
This is just a spatial filter, which is called IDOA-based in post-processing. This is
proposed by Dr. Ivan Tashev and Dr. Alex Acero over here. And IDOA-based is
instantaneous direction of arrival-based approach.
And this is nothing but just having this kind of space -- this space constructed by having
them kind of pairs of the microphones. Like, for example, if there are four microphones,
we can construct the three pairs, like 1-2, 1-3, or 1-4, something like that. And after that,
based on the pairs information between those kind of pairs, we can draw some line like
this. If there's no reverberation, no noise, then this -- the -- [inaudible] could follow this
kind of line.
But because there is some kind of reverberation or noise, something like that, there is a
kind of spread around this, you know, designated line.
So based on that information, we could construct the probability of the -- some frequent
speed we can construct the probability of the direction of the membership of this frequent
speed. Like, for example, if there in specific frequent speed all the signal comes from
some direction, then we could have some kind of very narrow, sharp probability like this.
But if not, then there could be a thing, kind of broad.
So basically if there is, you know, once we -- once we have some kind of algorithm like
this, then we can just multiply this gain picture after having this beamforming to
additionally suppress some kind of part.
So, for example, beamforming just see the direction and they just want to just boost that
direction, even though there could be some -- nothing over there, right? Even though
there could be -- for example, speech is not kind of very sparse signal, so there could be
nothing in some specific frequent speed.
But according to the beamforming, they just try to just post over there, even though there
is nothing like that.
Then there is not -- there is comfort in that way I want to do. And this kind of probability
give us that -- will give us some kind of information like there's nothing, then we can just
suppress additionally up to there. This is a very cool idea. But still this, you know, this
is the spatial filtering.
And here is the background for independent component analysis. And the objective of
this algorithm is to maximize the high-order mutual independence, not just a second
order. In this case it can be interpreted as try to maximize the super-Gaussianity.
So, for example, in this figure you can see that this is clean speech probability and this is
mixed one. I'm in the speech mixture. And this is the speech mixture plus noise. And
this is just noise.
So as you see here, you know, normally clean speech is really, really -- has this kind of
peaky PDF.
You know, so basically that means by maximizing this kind of super-Gaussianity and its
peakedness is up, it could achieve some kind of separation. That's the whole basic idea
about this ICA.
And that can be formulated like this. Actually this maximization of the mutual
independence would be like maximizing the entropy self after having this nonlinear
mapping of this separated signal.
And this can be also interpreted like -- using [inaudible] divergence. So basically they
want to have -- make this part has the probability of this probability. In this case, this
unif means probability of unif on distribution.
So basically what it can do is after convergence, the probability of the nonlinear mapping
of separate signal would bury signal with just independent uniform distribution.
>>: Can I ask a quick question?
>> Lae-Hoon Kim: Yes.
>>: Many years ago working [inaudible] we were trying to just reduce reverberation and
we used the same approach and we were trying to minimize just the kurtosis metric.
>> Lae-Hoon Kim: Yes.
>>: Do you think that there's -- kurtosis seems to be a little less computational cost than
doing that.
>> Lae-Hoon Kim: Yes.
>>: But do you think you would get similar results if you used just a simple metric like
kurtosis?
>> Lae-Hoon Kim: Yeah. Yeah, why not? Because, you know, in kurtosis case there's
first-order [inaudible], right?
>>: Right, exactly.
>> Lae-Hoon Kim: So in this case, you know, kind of more generalized version of that, I
guess.
So we're not ->>: But, again, you're trying to measure [inaudible].
>> Lae-Hoon Kim: Yes. How -- how -- yes, yeah, yes. Yes, yes. It could be used ->>: Okay.
>> Lae-Hoon Kim: But, you know, here I just want to, you know, highlight that I'm now
using the maximum information I'm able to use. So we can use that kind of -- it if it is
helpful enough. Okay.
And then this kind of objective function can be maximized or minimized using some
sophisticated algorithm like this kind of update [inaudible]. This is [inaudible] proposed
by the [inaudible], et al.
And there is some kind of multiplication about the original [inaudible] algorithm by
multiplying this part. This is called a natural gradient, which is proposed by [inaudible].
And the good thing about that is, you know, by multiplying this, we can, you know,
normalize the -- we don't need to worry about the specific point and the subspace of this
W.
By multiplying this, it doesn't -- depending on the starting point of this subspace. It just
guarantee us the same convergence rate [inaudible] of the starting point of W.
So this is very good. This is very good convergence -- this has very good convergence
property. And later on we extend this kind of, you know -- this is just for this current
frame. So we extended this current frame algorithm into the multi-tap structure too.
And after having the convergence, because this probability will have this [inaudible]
distribution, which means by having -- by -- according to the simple map of probability,
the [inaudible] random variable, if we have [inaudible], then the -- this S is WY.
Probability of WY [inaudible] probability of WY has just this kind of form.
So, for example, if we utilize this nonlinear function like the sigmoid or [inaudible] or
something like that, then probability S, which means just speech, separate speech, will
like this.
So basically what it can say about that is basically by choosing the proper nonlinear
mapping we could achieve this kind of, you know, probability of -- probability of -probability distribution of the speech itself.
So, for example, if we want to achieve this super Gaussianity, then we could choose like
[inaudible] or something like that.
Let's see the pro and con over here. The pro, separating toward maximizing higher-order
mutual independence. We do not need to worry about it at all. We just need to maximize
the independence itself, then that will guarantee us to have the separate speech. This is
cool, we know.
But con, after having this difference in separation, we do not know the membership of the
separated speech. How can we deal with that. This is a problem. Even though we could
have, you know, achieve really good performance about the separating, but if we do not
know the membership of the separate speech, then there is nothing we can do, right? So
this is the kind of problem.
Actually, originally this ICA is -- it was ideal for the time-domain instantaneous
mixtures. What I mean by instantaneous is no multi-tap. That was ideal for that kind of
situation. But we would like to deal with convoluted situation, multi-tap situation like
this reverberant, environment, right?
And this is convolutive mixture can be reformulated as instantaneous mixture in
frequency domain. This is very well-known fact. But this is only true for sufficiently
large frame. Right? If there is a -- as I already told, if the reverberance is much longer
than the frame length, this doesn't hold at all.
And we also have to deal with the permutation and arbitrary scaling, as I already told this
membership thing. Should be fixed by additional post-processing. And this is very hard.
Known to be very hard.
So this is the reality. Shorter frame lengths. It just converges to DOA, DOA-based to
nullforming itself. And this is quite disappointing.
So from this observation we -- Ivan, my mentor, asked are we doing our best here? So
this little analogy gives us some idea.
So beamforming, we know the direction, we know the membership, we know the role.
But we do not know the identity about the reverberated speech. Right?
ICA. We know the identity. We know what the speech is, how we could differentiate
them. But we do not know where to go. We do not know membership itself.
And the thing is that current approach even do not know the identity itself for the
reflection part because we are now just -- the current approach we just deal with the
current frame only. We do not know identity either, you know.
So here's our approach. We would like to let this kind of, you know, separate car goes
into the right way, like A goes A, B goes B.
How can we do that? First of all, we do like to solve this problem. So we proposed
source identifying for the [inaudible] path which can be dealt with the speed for ICA and
multi-tap ICA.
And we utilize a guide. We would like to guide this kind of thing to go into right
direction. So we initialize the filters based on the spatial information. And we also
regularize so that they cannot go a long way.
And after that we also introduce IDOA-based post-processing to narrow down this road.
Not wondering. Okay? We can see this.
Here is the proposed algorithm [inaudible]. This algorithm is composed of two stage.
First stage is nothing but just what you already have. Beamforming followed by
IDOA-based post-processing.
And this technology gives us some kind of baseline. I mean, the bottom line in terms of
the performance of the separation.
And also it also have for us for this second-stage algorithm go to the right direction.
And after having this first stage, we apply this second stage. Second stage compose of
two parts. First part is feedforward ICA and second part is this, you know, IDOA-based
post-processing again. But this is -- those are different.
So in this case, this is more or less like trying to suppress the reverberated sound in the
target itself.
And finally we can -- we hope we can get this speech separated correctly.
>>: Can you say a few words about prior from video.
>> Lae-Hoon Kim: Okay, okay. So basically as of now, I mean, for this current
situation, we cannot have the prior information about source position, right? So we
actually use some kind of, you know, technology called music or something like that to
find the source direction.
But like in our case, like in Xbox case, because we have some video signal, right, and it is
known to be very accurate to track the position of the mouse and source speech, then they
kind of get together prior information about the direction can be used -- you know,
combined here easily enough.
>> Ivan Tashev: If you use just [inaudible], there is a [inaudible] seconds delay after
somebody starts to speak before we pinpoint the speaker, let's say plus, minus a couple of
degrees. And because in the [inaudible] we have those [inaudible] 3D cameras, you can
get the direction right away.
So this uses the transition of the beginning. Otherwise [inaudible] every prior
information we have. Not that [inaudible] can deal with it separately, but just helps the
initial -- when this person starts to speak.
>> Lae-Hoon Kim: Yep. So basically if you can have, you know, better estimation
about the source direction, that would be really helpful.
And up to now, we -- just only by using some kind of well-known, you know,
well-known source [inaudible] algorithm you could get -- achieve, you know, kind of
good result, as you can see here.
And first stage is nothing but just beamforming followed by IDOA-based
post-processing. And this is, you know, time invariant or adaptive beamformer can be
utilized and then, you know, same structure.
And as I would like to mention again, produce safe guideline for second stage.
And in the second stage we do this kind of update. And this update compose of two part.
This part for the -- you know, multi-tap ICA, this part for the regularization.
So basically what it can do is it tries to update each multi-tap. I goes zero to M minus 1.
M is the length of the filter. It -- each time they want to update this part in the way of
maximizing independence, but you'd like to minimize the big deviation from the first part
result.
So then each part can be expressed like this. And this does nothing but just to maximize
this same idea. Same idea. But in this case we extend the original situation into the
multi-tap case.
And here is the regularization. This is just like [inaudible] or something like that. And
what the algorithm measure over here is there is some kind of case where we just are
interested in some kind of pure number of sources rather than having the channels.
Like, for example, if there are four different microphones and you just tap some
[inaudible] I'll give you the -- this is very simple. We can deal with using this
[inaudible]. We just need to regularize on the specific two channels so that we can have
this separate speech just two, only two separate speech.
And this kind of thing also can be dealt with this initialization too. So because we are
now -- we know the direction of the arrivals of speech, I mean, interference and target,
they can be utilized for initializing the filter taps.
And, first of all, we just initialize this first step as just identity matrix. Multiply by some
kind of [inaudible] function, which is very similar with the lambda k.
And then first step, because we know that first step should be like nullforming or
beamforming, right? And they can be constructed by these kind of, you know,
[inaudible].
And this should guarantee us to have the nullforming, beamforming for each [inaudible]
in each [inaudible] like if you want to assign the first channel to have the first source and
second channel to have the second source, then you can utilize this kind of thing onto
first and second channel.
And [inaudible] channel like just like, you know, identity. Even for the [inaudible].
And after that, after having convergence, we also do this in kind of scaling. This is very
similar with the IDOA based post-processing. But it's depending -- it's just based on the
convergent -- converting the first step. So a little bit deeper, but it is also helpful to have
this kind of functionality together with the IDOA-based post-processing.
So let's -- this is a time for me to show the experimental results.
Actually, we chose in two different evaluation criteria. First thing is [inaudible] second
thing is [inaudible]. And [inaudible] is the SIR, signal-to-interference ratio, which is
measured by C-weighted dB.
And perceptual measure. For perceptual measure we have used the PESQ scores, which
is the computational proxy of the actual mean opinion scores.
And to calculate the SIR, we actually use some technology for the measurement of the
[inaudible]. Basically -- because we know the original speech and the clean speech for
the interference and target in this case, we can utilize that to, you know, calculate this
SIR. Why not.
And that can be done by using same technology for, you know, calculating [inaudible]
when you get playing -- just playing the [inaudible] and just by having inverse filter you
could have some, you know, nice looking [inaudible], just like we use the same
technology
And that can be, you know, weighted with the C-weighting. And the C-weighting is
good, because C-weighting is to just guarantee us to have some kind of flat response up
to the 8 kilobyte -- 16 kilobyte sampling [inaudible].
And this is the test result. And this is original SIR. You can manipulate that. And in a
reverberant and noisy condition. And this is a measured one. So I would like to
highlight that, up to 10 -- 12 -- 20 dB we could have some quite [inaudible] measurement.
And beyond that, this is underestimated. So basically if we can get better number than
20, then that's good.
And we have collected the data, actually measured data. And we measured the room
impulse response in the audio lab of the sound there, and we measured 18 different kind
of position.
And this is just kind of -- imagine that there is some kind of couch, couch can be placed
like this. Okay. And this is just kind of shoulder-by-shoulder distance, actually. And
you see this is 2 feet, 0.6 meter, which is very, very near.
And we can see here over here the angle, you know, goes from 6 degree, which is very
narrow over here, okay, to the 79 like this.
And 1.2 meter to, you know, 4.3 meters. And again the reverberation time was just 375
millisecond, which is very huge. And that's 15 dB SNR. Very realistic situation.
And this is a working distance. We would like to have some kind of really good
performance over here, like 2 to 3 meters, something like that.
Okay. So and clean speech, as a clean speech corpus we have used the TIMIT database.
We simply just convolved this clean speech with this measured impulse response, and we
added noise. Natural noise recorded in the same space in same device. And we
actually -- like with the three minutes long, and we selected the random segment of those
kind of three meter long, you know, noise and add that.
And that's how we can generate the corpus.
>>: So you simulate the reverberation by putting the TIMIT clean speech into the room
while you record the ->> Lae-Hoon Kim: No, no, we actually convert.
>>: It's basically done.
>> Lae-Hoon Kim: Yeah, [inaudible] but it's [inaudible].
>> Ivan Tashev: So in each point you place a [inaudible] and play a shared signal, record
this one [inaudible] then compute the impulse responses.
>>: Oh. So it's not actually [inaudible]. It's just relation of -[multiple people speaking at once]
>>: But it's an actual room.
>>: It's an actual room with natural voice. And you have the database with the noises.
So practically what happens is you take the clean TIMIT voice [inaudible] impulse
response from the point to each of the microphones [inaudible] segment of the noise
which is added to it to achieve test.
And now this is what is your corpus. It's very realistic. The only thing which is missing
is whether we talk or we slightly move our mouths so that there is a [inaudible]
reverberation in the post response. That's it.
>>: The -- two questions based on that. One is the device noise, is that the electrical
noise picked up by the device or is that the -- some sitting in an office or a -- with HVAC
and reverberating noise around the room that's actually being picked up by the
microphones?
>> Lae-Hoon Kim: Actually being picked up by microphones. Because we have
[inaudible].
>>: It's not in the soundproof room?
>> Lae-Hoon Kim: Huh?
>> Ivan Tashev: No. This is the -- this is the room in front of [inaudible] chamber.
>> Lae-Hoon Kim: Not [inaudible].
[multiple people speaking at once]
>>: I don't care where the noise is coming from. So then the other question -- the other
question I have is you probably have a strong reason to believe that this measured
impulse response, which is deterministic in the tail is a good approximation for the real
impulse response which is going to be stochastic in the tail.
>> Ivan Tashev: So in general I just said that that stochastic thinking is mostly because
humans move. We'll see some results at the end [inaudible] which incorporate that
[inaudible].
But for now this is a good -- very close to the realistic test corpus which incorporates the
long reverberation tail and actually noise recorded with the device.
>>: Well, just like AURA2 or AURA3 ->> Ivan Tashev: Yes.
>>: This is about as greedy as you can get with prerecorded clean signals and mixing.
>> Ivan Tashev: Yes. So we have enough experience with this. This is how we
synthesize the corpus for ABU. So it's very close to the reality, including compensation
for [inaudible] effect, et cetera, et cetera, that's very close.
The only thing that is missing is the reverberation tail, which is due because we kind of
move our heads when we speak.
>>: Although, one point, if you don't mind my mentioning, this could actually make
some difference. I remember when I was working with Brad, we did the same thing: use
the artificial mouth, measure the impulse response, then use the measured ones.
But then when we went in and put real people, those levels, the stochastic difference, it
actually does make a difference.
>> Ivan Tashev: We are aware of that. Because -[multiple people speaking at once]
>>: [inaudible] the rest of the algorithm, right?
>> Ivan Tashev: The major improvement we get because we deal in the reverberation.
Freezing the tail is kind of cheating. But you'll see the results.
>>: So is it very difficult to create an AURA3 like data for the natural?
>> Ivan Tashev: In AURA3 it's just a clean speech and noise. There is no impulse
response from the microphone to the mouth. So [inaudible] ->>: AURA3 is actual data.
>> Ivan Tashev: Is actual data.
[multiple people speaking at once]
>>: [inaudible] in the car and the microphone is on the dashboard, they're actually ->> Ivan Tashev: So it acquires the impulses mostly in the car.
[multiple people speaking at once]
>>: [inaudible] there's no impulse responses impulses on those.
>>: Real [inaudible], real noise, real impulses. So, I mean, the equivalent -- it's much
more time-consuming, you're just going to get users into the lab and sit in different places
and play a game. I mean, that's the ->> Ivan Tashev: So our reasons to do use a synthetic corpus is mostly because PESQ
requires a clean speech [inaudible].
>>: I see. Okay.
>> Ivan Tashev: And you saw that our tool, which we wanted to make it noise
independent, actually tries to estimate the impulse response of the source without -- in the
mixture to extract it, source [inaudible] in the mixture so we can check out at the end we
have just noise, and then to measure the extraction [inaudible] independent [inaudible]
the separation.
And this is very convenient to have the clean speech sources. We can do with humans,
but we have to equip them with the close-up microphones, and that's kind of more
complex.
>>: So we actually are gathering data, so that actually will be able to [inaudible] it's a
good start [inaudible].
[multiple people speaking at once]
>> Lae-Hoon Kim: I do have one thing. I have information about that [inaudible] then
we should have some kind of grade functionality to track the position of the speech.
>>: [inaudible]
>>: Be a real fun game [inaudible].
>>: So why this individual microphones locations like that, not symmetrical [inaudible]?
>> Lae-Hoon Kim: Individual -- this one? Well, you know, if we have just the same
distance, then we -- there is some kind of, you know, possibility to have some [inaudible]
frequency, high-frequency [inaudible]
>> Ivan Tashev: [inaudible] those four microphones, it's optimizing to cover evenly the
frequent -- the frequency range. Each microphone pair is good at one frequency, and
then it performance goes down. So the distances are in the way that those six
microphone pairs [inaudible] to cover [inaudible] frequency bands.
>> Lae-Hoon Kim: Okay.
>> Ivan Tashev: But this is not something which was in the [inaudible] I want to change.
He was just given a microphone and this is the device.
>> Lae-Hoon Kim: Actually, that microphone was cool. The performance was really
good. Was much better than I expected.
>>: The prior from the video is simulated as well I guess.
>> Lae-Hoon Kim: As I told you, there is no information like that up to now. Up to
now. We actually used this just as kind of blind way of estimation of the source. Okay.
But in this case, we are now using some kind of synthetic impulses. But that means we
can do that quite easily.
Okay. Let me go the next slide. This is configurations. We have test. We have test 18
different configurations. And as you see over here, this is wide-seating situation, this is
narrow-seating situation, by varying the distance between the microphones and speakers.
And this is the results. Let me explain in a detailed manner.
This is distance, so basically this is close. This is far. This is angle. This is narrow.
This is very wide.
And our approach, our proposed algorithm keeps us from 10 dB to 20 AD. And from the
individual experience of the listening I think that if we can achieve more than 20 dB,
that's really good. This is really like [inaudible] we are going to hear that soon.
And what I want to mention about here is like in 2.7 meter and 26 degree, we could
match -- we could get that 20 dB separation. This is very interesting.
And this is PESQ score. This is the angle, this is the distance from the near to the far,
narrower to the wider. And this is PESQ score.
As we'd expect, with the better SIR we could have better MOS, which is actually, you
know, expected. So basically -- so this tells us that by having better SIR we do not hurt
the perceptual quality. It hurts a little bit, but in a sense better than having just mixed
[inaudible].
>>: So is the PESQ improvement over no processing?
>> Lae-Hoon Kim: Over no processing, yeah. No processing. So that's meaningful,
right? Over no processing.
And, again, we would like to compare our approach with the proposed -- you know,
current and state-of-the-art. So we actually plugged the situation of the most similar
condition of the proposed as state of the art.
So as you see over here, you know, this plus and [inaudible] and star. And this is located
here. Upper left corner, which is the best condition, right? And we actually tested very -in a somewhat harder situation rather than having this best condition.
>>: So, sorry, we're looking at the dB reduction?
>> Lae-Hoon Kim: Yeah, dB reduction. Average.
>>: [inaudible] like one is this 3-D thing and one is --
>> Lae-Hoon Kim: Actually, I have 3-D plot for this, and 2-D plot for this. But I just
only, you know, combined those two things like this way. If you want -- if you are
interested, I can show this -- everything afterwards.
So I want to compare our proposed algorithm with the state-of-the-art. This is
[inaudible]. So 1, 2, 3, 4, 5, 6 is the method, and this is SIR suppression. And this blue
box means the -- the state-of-the-art we have already discussed, and this red box is our
implemented version.
So basically first [inaudible] is the Saruwatari one, and second one is Taylor and third
one is Sawada. And you can see the spec.
>>: In my quick geometry in my head, at 1.22 meters in angle and 54 degrees, how far
apart [inaudible]?
>> Lae-Hoon Kim: Huh?
>>: At 1.22 meters and 54 degree angle, how far apart are the two sources?
>> Lae-Hoon Kim: Like this.
>>: About two feet?
>> Lae-Hoon Kim: Yeah. Kind of a little bit of distance between the ->> Ivan Tashev: A person between them.
>> Lae-Hoon Kim: Not just sitting like this, but kind of ->> Ivan Tashev: [inaudible] two other persons.
>> Lae-Hoon Kim: Something like that.
>>: About a meter, right?
>> Ivan Tashev: Huh?
>>: About a meter.
>> Ivan Tashev: Yes.
>>: So we're seeing almost a 30 dB [inaudible]?
>> Lae-Hoon Kim: Yeah. This is very [inaudible] because you know what? This is best
condition, our situation.
>>: It's incredible.
>> Lae-Hoon Kim: This is in incredible. You want to here that.
>>: So the question here is I'm not quite sure whether different algorithms require
different optima configuration of the microphone. That aspect isn't optimized, then you
cannot make a fair-to-fair comparison.
>>: These numbers are taken it looks like from papers, not from -[multiple people speaking at once]
>>: Because like our algorithm was compared against one of those other two and
basically the condition of our algorithm was the same result as the state-of-the-art with
solving the permutation, without having to do the permutation. It wasn't a better
algorithm. But in this case, the fact that it's like 4 dB worse than his is a little bit
[inaudible].
[multiple people speaking at once]
>> Ivan Tashev: The only check we did -- so, look, No. 5 is our implementation point by
point of the algorithm described in [inaudible]. So you see that there are slightly different
conditions here. We still are in a more difficult situation. So [inaudible] reports on 70
degrees and 1.15, but because we have a discrete [inaudible] second reverberation
[inaudible].
In our case we're kind of closed but the [inaudible] our reverberation time is kind of
[inaudible] but, look, we can comparable result. This was our check that we are not
completely off. At least we acquired the idea of the simple tap ICA.
So these give us to -- leads us to believe that at least for those two we have something
close. And then those three are already our approaches. The only way to do a perfect
comparison is to [inaudible] implement all those separately and to do the [inaudible], and
we just have no time.
So this was to have to find the closest point to those all three and to see how better or
worse are we doing.
>>: [inaudible] use the same kind of database that you have?
>> Ivan Tashev: That's correct. That's correct.
>> Lae-Hoon Kim: But the way of getting the data is same [inaudible].
>> Ivan Tashev: They report the reverberation time. In most of the cases here, they deal
with the reverberation time from impulse response generated with Image Maker, which is
less -- it's more sparse than any other [inaudible] reverberation [inaudible] impulse
response. So in all cases those three papers work in way better conditions than us.
>>: I think all three of those also have two microphones instead of four.
>> Ivan Tashev: That's correct. That's correct.
>> Lae-Hoon Kim: So I should say that all the conditions ->>: But the bottom one is the meaningful. I mean, 5 is the meaningful comparison.
>> Lae-Hoon Kim: Yeah. Actually, I should say this ->>: [inaudible] the implementation on the same [inaudible].
>> Lae-Hoon Kim: In all the conditions, actually our condition worse than ->>: So the bottom four you actually recorded with -- were done with an implementation
[inaudible] microphone.
>> Lae-Hoon Kim: Yes, yes, yes. Yes.
>>: Can I ask a quick question?
>> Lae-Hoon Kim: Yes.
>>: Because many of the others use two microphone. You're probably going to cover
that later, but kind of a question that comes to mind, even though you have four available
microphones, why [inaudible] you turn off one and you say if I only use three or if I only
use two, so that you have kind of a return on investment for the cost of adding each
microphone. So if you have, for example, three microphone only.
>> Lae-Hoon Kim: I don't have enough time for doing that. But I have some kind of
guess for that, actually. So the thing is that, you know what, because to implement this
kind of beamforming and nullforming, we added one extra constraint, right? That means
the decrease of the degree [inaudible].
>>: So if you were to guess, for example, if you were to reduce from 4 to 3 ->> Lae-Hoon Kim: That can work.
>>: -- the DBs you would lose in terms of separation?
>> Lae-Hoon Kim: I think there is no -- no -- seriously no degradation about. If I just
use two rather than three, then there could be some kind of a degradation, because we
cannot fulfill this nullforming because we don't have the degree of freedom [inaudible].
>>: [inaudible] in terms of how many CPU side processes you're going to have to run?
>> Lae-Hoon Kim: Yeah. I would like to investigate those kind of situation, I really
would like to, but, you know ->>: I hear you.
>> Lae-Hoon Kim: Yeah. First of all, I would like to make sure this kind of algorithm
gives us some kind of interesting --
>>: [inaudible] just couldn't resist.
>> Lae-Hoon Kim: Yeah.
>>: So what kind of dB difference you might get for different microphone
configuration?
>> Ivan Tashev: For different microphone arrays. So you see that ->>: [inaudible]
>> Ivan Tashev: Number 6 is just the first stage, which is the existing sound-capturing
system to date. This is what we propose Xbox team to take. And we already have 19.
>>: [inaudible]
>> Ivan Tashev: And then on top of this, the ICA gives us an additional -- technically it
should be an additional around [inaudible] dB, but you know the good things [inaudible].
>> Lae-Hoon Kim: Basically this algorithm always output from the [inaudible]
algorithm like a 10 to 12, 20 dBC. And it's always better than this first stage. But, you
know, we'll see.
>>: Does the ICA stage online or batch?
>> Lae-Hoon Kim: Batch. And it can be -- but it can be converted online easily. Yeah.
But the thing is that, you know -- yeah, I really would like to test the online of this
version. But, yeah, that would have ->>: [inaudible] 20 dB online or 30 dB?
>> Lae-Hoon Kim: Yeah. That ->> Ivan Tashev: [inaudible]
>>: [inaudible] doing that stuff online is much harder.
>> Lae-Hoon Kim: Yeah. If we would like to do this kind of a thing online, then we
have to have some kind of, you know -- tons of different kind of, you know, optimization
process first to fully optimize these design parameters, actually.
But, you know, I just only have ten weeks over here, so I didn't have enough time to
operate all these kind of optimizing process. So I just tested a couple of different kind of
setting, and this is the results up there.
Okay. Let me go for this SIR summary. Let me summarize this.
The proposed approach outperforms 10 to 20 dBC, and it achieves improvement of the
good PESQ scores. And I highlight all of them already.
Okay. This is the kind of experiment for the realistic implementation. And I just tested
the -- actually, the [inaudible] previous slide was with very conservative setting, like 20
tap -- number of tap was 20 and the iteration number was 1,000. Huge. So but I have
tested whether this kind of [inaudible] can be decreased.
And as I actually -- as you see over here, the iteration number can be decreased like, you
know, to 200. And tap number can also be decreased at 5.
>>: [inaudible] tap number means how many previous [inaudible]?
>> Lae-Hoon Kim: Yes. Yes.
>> Ivan Tashev: [inaudible] the reverberation.
>> Lae-Hoon Kim: This is detailed iteration, but ->>: So how long was the frame?
>> Ivan Tashev: [inaudible]
>> Lae-Hoon Kim: Yeah, 16 milliseconds. So the detail, but I will not cover this. Yeah.
This is the thing that I would like to play over here. So let me -This is a synthetic feature first. Let me try to play this.
[audio playing]
>>: So why this constant background noise?
>> Lae-Hoon Kim: This is computer.
[audio playing]
>> Lae-Hoon Kim: There is some speech over here. There should be. But we cannot
hear that.
>>: Where is the mechanical aspect of the voice speech coming from? It sounds -sounded like there's a little bit of nonlinear noise in there.
>>: This sounds more like missing spectral components. It's just the same as you get
from the noisy when you're too aggressive and you're missing -[multiple people speaking at once]
>>: There was an alpha position that traded off how much you trusted the prior from the
IDOA or how much you trusted the ICA.
>> Ivan Tashev: We didn't have time to tune this. We can squeeze a little bit.
>>: But if you relax that, maybe ICA would make things -- try to make it more linear,
less ->> Lae-Hoon Kim: Yeah.
>>: In terms of recording and playback, you can attenuate a lot of that by just a little bit
of noise still. I guess you do a tiny bit of noise fill, and then it ->> Lae-Hoon Kim: Yeah [inaudible].
>>: [inaudible] speech recognition.
[multiple people speaking at once]
>>: But you can do it artificially too.
>>: Oh, I see. You can do the engine and you have that, okay.
>> Lae-Hoon Kim: Yeah. And, well, as you already pointed out, this is kind of synthetic
compose because we need to like make sure that this algorithm is working quite well.
And obviously we also have some kind of curiosity, how it works in a realistic situation.
So I invite one of my mentor in this MSR to come to this audio lab, which is original
place we have measured in [inaudible], and I talk to her to sitting just to stand on this
kind of predefined configuration, one of those predefined configuration area, and we just
talked spontaneously. But, yeah, you know, you see the result.
>>: Would male versus a male be hotter than male versus female?
>> Lae-Hoon Kim: No such thing actually. I actually -- I test ->>: Generally it is male versus [inaudible].
>> Lae-Hoon Kim: But the thing is that, you know what, I test all kind of different kind
of configurations, and the [inaudible] there was more likely depending on the distance
and angle or way of speech or something like that.
>>: What was the content of the speech being said? Because they're both saying the
same thing at the same time, that's way different than if they're saying something
completely different.
>> Lae-Hoon Kim: You know what?
>>: One person's saying consonants, one person's saying the vowels, it's really easy to
separate them [inaudible].
>> Lae-Hoon Kim: That's a very interesting point. Actually for the synthetic part, we
have used this TIMIT corpus. And all the TIMIT corpus start with the same sentence of
two sentences.
>>: [inaudible] but in the rest of it, I mean, you're averaging over the entire length of the
thing and got one of the speakers that speaks 20 sentences, your average is going to be
they're not saying the same thing, they're not even talking at the same time, if you don't
control for the open areas. So you're going to end up with a situation where they're never
going to collide anyway, so it's like ->> Ivan Tashev: So hold on. When we said batch processing, this doesn't mean that you
take [inaudible]. Every frame is processed using the current frame and the number of
taps before. That's it. Yes, inside [inaudible] but not in taps. So it cannot go ten frames
or 160 milliseconds, so you are in the boundaries [inaudible]. That's it. Not to go
[inaudible]. You don't have to -- we don't do 40 seconds over an alternative. That can
get us way better results, actually.
>>: It seemed like his objection was that each TIMIT utterance is 30 percent silence.
>> Ivan Tashev: Human speech is 30 percent ->>: Yeah. And that -- that -- when you have a 30 percent chance of overlapping with
absolutely nothing and you get really good results as opposed to trying to overlap speech
and speech and just measuring there.
>>: I think the game show scenario is where both people get the exact correct answer at
the same time. Right? That would be an interesting [inaudible].
>> Ivan Tashev: [inaudible] of the same nasty phrase which most of us have sit to listen
over and over and over.
>> Lae-Hoon Kim: Actually, there is a kind of scenario where some people dominate the
speech, then we can just use that kind of speech. You know, nothing else. And if there is
some things -- and the situation can happen, like if there is some simultaneous speech, we
can -- you know, detect that, too, and we can -- we can utilize different kind of mode to
deal with different kind of situation, I guess.
>>: So I guess what I'm hearing is that in running speech where two people are just
actually talking -- the -- they're not going to overlap enough that your history is going to
be able to distinguish the two. And you didn't test the case where they're saying the exact
same thing at the exact same point, same rate, same accent and all that stuff and you don't
know what will happen in the next situation.
>> Lae-Hoon Kim: Okay, okay.
>> Ivan Tashev: [inaudible] distinguished it [inaudible] separate two things which are
the same.
>>: Well, he [inaudible] pretty good job of separating two people saying the exact same
thing, even if they're both male and both French and both speaking in Russian. You
know, your voice can still [inaudible].
>> Ivan Tashev: [inaudible] the difference still.
>> Lae-Hoon Kim: Yeah. I wish I could have the opportunity to test that kind of hard
situation.
>>: I mean, there is this game show game concept of two people buzzing, have the same
answer at the same time.
My guess is -- and that's viable scenario we're looking at. My guess, though, is even
there there's quite a bit a statistical [inaudible], people don't speak at the same rate, even
in that capacity, right? There's still -- I think it's really -- it's fairly artificial and probably
very short time bound to have two people literally speaking the same thing at the same
time at the same rate.
Like even in this real-world scenario, I say who was the first president in America, and I
ask everybody to answer at the same time, people speak at different rates. I would think
that there would be some statistical [inaudible].
>>: Even if you do the extreme of a chorus where people are actually trying to sing in
synchronism is still going to end up with [inaudible].
[multiple people speaking at once]
>> Lae-Hoon Kim: On top of that is, speech is very sparse, you know. So basically there
is no speech which [inaudible] exactly. So that means it can be [inaudible].
>> Ivan Tashev: [inaudible] mean the same thing the speech will be slightly different
and the harmonics will be slightly off [inaudible].
>>: And so your [inaudible] is 256?
>> Lae-Hoon Kim: Yes. Yeah.
>>: Okay.
>> Lae-Hoon Kim: And here you're going to hear my voice, my own voice and one of
the intern's voice speaker Chinese and Korean. And because we don't have enough
time -- so let me go to the harder situation, you know, No. 9 configuration, which is ->> Ivan Tashev: [inaudible] so go and show the easier part.
>> Lae-Hoon Kim: Easier work first?
>> Ivan Tashev: Yeah.
>> Lae-Hoon Kim: Okay. This is a mix.
[audio playing]
>> Lae-Hoon Kim: Okay.
>>: I don't think I can understand that [inaudible].
>> Lae-Hoon Kim: In fact, I don't know why she's talking right now. I just ask her to
talk whatever she want. So, anyway, I'm talking myself. I mean -- what I'm talking over
here is my name is Lae-Hoon Kim, I'm working in MSR with this kind of -- you know,
I'm doing some kind of project like this, something like that.
Let's hear about the -- our results.
[audio playing]
>> Lae-Hoon Kim: Okay. Let's go to the somewhat a little bit harder -- no, a harder
situation, like, you know, [inaudible] the 2.444, which was reported to have 20 dB
separation.
[audio playing]
>> Lae-Hoon Kim: Okay. We do the different speech. I mean, we are speaking
different kind of content. So this is the result.
[audio playing]
>> Lae-Hoon Kim: Okay. So this environment is much better than [inaudible] because
this environment was attenuate -- yeah, attenuate some kind of ->>: [inaudible]
>> Lae-Hoon Kim: Yeah. But we can see that -- you can see here and you can hear the
[inaudible] of this algorithm. This is what I got actually. And as I already told, above 20
dB I think it's going to be really good in terms of having this good separation.
So okay. Let me, then, conclude my talk and future work will be [inaudible].
This is a summary of our talk, my talk. And we have actually subband domain and short
time frequency domain regularized feed-forward in multi-tap ICA with IDOA-based
post-processing has been proposed and evaluated.
And this approach produces breakthrough -- I should say is that -- and breakthrough
results in terms of improvement in SIR and PESQ scores were also really good.
But as I pointed out, [inaudible] pointed out, we need to have some kind of more tuning
for having some better perceptual quality, I guess. And that would be one of the our
future works.
And another thing is, as we already discussed, we need to have some online
realtime-based algorithm to work here.
And this should be tested, I guess. I think in terms of the technology, we also -- we
already [inaudible] already because the algorithm is just -- it can be done with the batch
or online. And this is -- can be done. But the thing is that, as I already mentioned, there
are some kind of design parameter we have to optimize first.
>>: So by online you really mean just realtime frame-by-frame processing?
>> Lae-Hoon Kim: Yeah. There could be two different ways. Realtime means just one
frame by one frame by one frame. They came like that. Or kind of the block-based
thing, like, you know, like 40 milliseconds, 40 milliseconds, something like that.
>>: Is it that the algorithms themselves are fundamentally different?
>> Lae-Hoon Kim: I don't know. I have to -- I could have some opportunity to test it,
this kind of thing. But if I could have like two weeks, three weeks more, then I would do
that. [inaudible] I really would like to try that. Because, you know, yeah, this is kind of
compromised for me to you -- you know, I had to stop at some point to organize all the -And also I would like to mention that it may be worthwhile to combine the presented
algorithm with speech motivated stuff for us. Here I didn't use any kind of [inaudible]
like computation or interesting alliances, something like that, or speech [inaudible] like
harmonic-based one or something like that.
If we can utilize [inaudible], then I'm sure that there could be some kind of benefit, like,
you know, improving the perceptual quality or something like that I guess.
>>: But then your demo is two different languages or have a harder time?
[laughter]
>>: If you change the language in specific models speech, if you do the you don't speak
[inaudible].
>> Lae-Hoon Kim: That's going to another different kind of story, I guess, yeah, yeah,
yeah. But there is a kind of [inaudible] like, you know, based on like computational
analysis, something like that, which is just depending on some kind of major [inaudible]
voice, not language. So kind of harmonicity or something like that. They can be -- they
can be not depending on the languages.
>>: So one of your slides you should sync the separation because basically you can -- the
out put of your diagram, there is a speaker one, speaker two.
>> Lae-Hoon Kim: Yeah. You want to ->>: Are you really reconstruct the speaker one, speaker two after your algorithm? Can
you really do that?
>> Lae-Hoon Kim: [inaudible] what in my speaker?
>>: The speaker one sum, and you showed one [inaudible].
>> Lae-Hoon Kim: Yes, yes.
>>: You really constructed the other speakers.
>> Lae-Hoon Kim: Yes. Kind of, you know. Because you can suppress the others like
in this case I just play just one single -- one of the two, right? If you want, I can let you
hear the other one. And the other one has like, you know, that speech only, suppressing
the other one.
>> Ivan Tashev: [inaudible]
>> Lae-Hoon Kim: Just symmetric. So I just ->>: [inaudible]
>> Lae-Hoon Kim: Yeah. So that ->>: So it's not a simultaneously reconstructed the two, but just decide to reconstruct one
and the other will get the surprise
>> Ivan Tashev: No, it's simultaneous [inaudible]
>> Lae-Hoon Kim: Simultaneous separation of two different mixtures
>>: But if you do the simultaneous separation, and in our scenario we probably just need
a one, right? We just need a one?
>> Lae-Hoon Kim: No, no, no, no, no, no. In some specific application we need to
[inaudible] like, you know, for example in the game scenario, if there are some -- two
people sitting together in the game -- having some game together, they -- both of them -they talk something. Then those kind of thing has to be equalized at the same time. And
[inaudible].
>>: So for this kind of algorithm it looks like there's not much difference in terms of use
in gaming environment versus in the automobile expect the reverberation we have is
different one.
>> Lae-Hoon Kim: Yes. But the main difference of this kind of algorithm is I think up
to now to my best knowledge if there is no way to, you know -- there is no approach to
actually incorporate this multi-tap approach in this kind of ICA thing.
And we actually -- we know -- what the beamforming is good for and what the ICA is
good for, and why not. This is kind of, you know, a way of combining.
[multiple people speaking at once]
>>: You have much wider range of the reverberation.
>> Lae-Hoon Kim: Sure, sure, sure, sure.
>>: So that may require more robustness.
>> Lae-Hoon Kim: Yeah. Yeah. But I don't know much about the [inaudible]
environment, but in the case [inaudible] some kind of difference, like, you know, in this
case because we are now showing the game scenario, we could have this kind of a big
approacher for the microphone array.
But for the [inaudible] scenario, there is limitation of the arrangement of the microphone.
Right? So we cannot place this kind of big microphone array onto [inaudible], so this can
be different like that.
>>: [inaudible]
>>: [inaudible] because of the rest of the audio stack, right? But like other people who
are just, you know, researchers on ICA can just use a bigger window, right, so they could
take it up in frame size of 2,000 ->> Lae-Hoon Kim: Yeah, yeah, that's true.
>>: [inaudible].
>> Lae-Hoon Kim: Yeah. Yeah. There is a paper called Frequency Domain ICA
[inaudible] limitation [inaudible] and written by [inaudible] something like that. And she
just analyze the kind of dependency between these kind of performance depending on the
frame lengths. And she just, well, reported that there is a kind of optimum way -- optimal
length of the frame.
So by just -- by just having a longer frame, it will guarantee us to have good results.
Because it will hurt -- it will hurt discussion I think because having this longer frame
means having longer average to get the frequency. This is not good.
So basically we just want you to stick to this short frame because we would like to track
the change of this kind of situation too.
So this is the way how we can kind of deal with the kind of changing environment,
changing position, something like that, together with this.
>> Ivan Tashev: Plus this approach is one of the reasons [inaudible] so it's basically you
already have it.
>>: That was the first part of my question.
>> Lae-Hoon Kim: Yeah.
[multiple people speaking at once]
>>: [inaudible] having a side conversation, your source separation [inaudible].
>> Lae-Hoon Kim: Actually, that was one of the challenges I have addressed in the
beginning.
>>: Can I ask a quick one?
>> Lae-Hoon Kim: Yes.
>>: I didn't see -- maybe I missed something, but you don't have in your algorithm on a
step of trying to estimate how many people are speaking, right?
>> Lae-Hoon Kim: Yeah, no.
>>: [inaudible].
>> Lae-Hoon Kim: Yeah, yeah. I just on test about the scenario, like two speech. But as
I already mentioned in the beginning of this talk, if there is kind of a multi-functionality
like [inaudible] or something like that.
>>: Right.
>> Lae-Hoon Kim: They can be used.
>>: Help you try to estimate how many ->>: I mean, we kind of have [inaudible] how many people in the room, where they are,
and we can track ->> Lae-Hoon Kim: Yeah, I'm sure. I'm sure.
>>: -- over time where they're moving. So -[multiple people speaking at once]
>>: [inaudible] that's captured by the device but is making noise.
>> Lae-Hoon Kim: You know what? There is still -- we have some functionality about
doing that. Because we -- in the case of [inaudible] you see here, we don't use the video
information here, but we can localize the source, right? [inaudible] compared with the
[inaudible]. But we can do that. Even for the kind of dog under this.
>>: So how would this work in a situation where you're at a party playing Rock Band
and you're singing? Because there's going to be, you know, white babble, background
noise.
>> Lae-Hoon Kim: Yeah.
>>: How is this going to perform? Are you going to lose your dBC very quickly? Is it
going to work well and people have to be screaming in order for it not to work?
>> Lae-Hoon Kim: Yeah. That could be very -- that is very interesting, good question.
>> Ivan Tashev: [inaudible]
>>: Yeah, okay.
>> Lae-Hoon Kim: I have a guess. I have guess, actually. I have a guess. I don't know
what happens, you know, but I have guess.
You see here, we can concentrate on some specific position, angle, and a specific angle of
the source.
So like, for example, there is rock band playing something over there and I here I have -I try to speak something. But it is [inaudible] this is too annoying and too loud, but that
would be very bad. But if I can have some kind of, you know, very -- you know, kind of
very small distance between this microphone and me, I can concentrate on myself using
this. And this. This is guideline. This is kind of, you know, way of guide -- of
separating the signal, goes to some -- this is the way we can do it.
But I'm not sure actually because, you know, that will be depending on the [inaudible]
signal to interphase ratio of this rock band and myself.
>>: So I know about the exact hardware configuration for the microphones has been
involving in parallel, like where you put the microphone housing and the acoustic, and
are these measurements with the latest one, with the latest configuration?
>> Ivan Tashev: [inaudible] generally we have in the top?
>> Lae-Hoon Kim: Let me go to the last slide.
Thank you for your attention and any question?
>>: You want more questions?
>>: I think something we definitely want to try. So we're doing work now to basically
do like identity tracking, so basically mapping sound source localization [inaudible] so
we'd say, all right, there's a person here, we're hearing sound from here, and we could see
everyone's face. And it would be interesting to tie that into this and say all right, well, we
always output a unique speaker stream for the number of skeletons we see in view and
then track with that. And just to see how it work end to end.
You know, it offers value in some different ways. If nothing else, it improves the chance
in multiplayer scenarios that when someone is trying to do speech recognition it will be
successful [inaudible] suppressing background. It might be usable in the game show
game analogy where you do have a lot of literal synchronized cross-talk.
I don't think we ever expected that, you know, if you have 50 people in your house and
people playing Rock Band that you were ever going to have viable speech recognition
or ->>: You know someone is going to want to do that, though.
>>: They are, but, you know, [inaudible] but I can at least see [inaudible] how we can
provide something as a platform that might actually be beneficial to use the [inaudible]
scenario.
>>: How sophisticated is the facial expression tracking? Do you have that tracking?
>>: We've done a little bit of work early on the project. The basic feeling was that the
distances we're talking about and the -- you know, the aspiration of the camera we won't
be able to do actual lip tracking.
The camera beam we use today is a stop gap that's intended to be swapped out with a
future version of a high-resolution camera. So hopefully [inaudible] but at three or four
meters at like 640-by-40 camera, [inaudible] we wouldn't be able to do that sort of
accurate.
>>: Uh-huh.
>> Ivan Tashev: Any more questions?
Let's thank Lae-Hoon.
[applause]
Download