>>

advertisement
>>
Good morning, Tashev. It's my pleasure to introduce Dr. Nikolay
Gaubitch from Imperial College in London. And he will present us his work in
the area of noise robust blind system identification and subband equalization
of room transfer functions with applications to speech reverberations. So
without too long introduction, Ivan have the floor.
Nikolay: Thank you, Ivan. So good morning, everyone. Thanks everybody for
coming along. And especially thank to Ivan for inviting me over here. It is
a great honor to give a talk for Microsoft Research. It's quite a long
title. It's almost like the completion of the talk in the title.
So I guess many of you are familiar that I have this sort of, for
completeness, the dereverberation issue and reverberation is you have
illustration with a talker at some distance from a microphone where you get
often, hopefully you get direct path sound and some reflections which are
somewhat delayed or attenuated versions of this direct account. And this
causes lot of trouble in hands-free telephony and sound reproduction. And I
say that it can have adverse effects because actually some situations is good
to have reverberation. Like in music, for example, people like it.
So my talk will be generally on two related, interlinked but separate blocks.
One is adaptive system identification, and one is on equalization. So I'll
start with a quick introduction, just of the problem, formulating it, a few
pluses of the reverberation methods that I believe you can split existing
methods into, and then the two main bits of system identification and
equalization.
So this is, I guess, familiar to many of you work with this. You have a
special signal SMS produced in reverberation and often you model the room
acoustics as an FIR filter. And observed signal is a combination between
this FIR filter of the room close response and the speech signal plus some
additive noise. And the aim of dereverberation is to find some kind of
estimate of this speech signal, S of M, possibly delayed or scale version of
it, using the observations X only. So this is known, I guess, to, as a blind
problem because we only have X and none of the other signals available
generally.
So I believe, at least from my thesis work, that you can divide the current
existing dereverberation algorithms into three large classes. One is B
forming, where you have some kind of array of microphones and you transform
bin towards desired speaker, the source, and exclude any other interfering
sources. And this is exclusive to multiple microphones.
They have some speech enhancement methods where often you have a model of
your speech signal perhaps based on the speech reduction model of humans.
And you try to process your reverberant observation so that it better fits
this speech model that you have.
Alternatively you can also have a model of the room impulse response, which
is (inaudible) kind of spectrum fashion algorithms for dereverberation.
And finally you have this, the most notorious (inaudible), the blind system
identification and equalization type of methods where you try to explicitly
estimate the room impulse responses and design some equalization filters
based on these estimates so that you can remove the (inaudible)
reverberation. And the bits I'm going to talk about fall into this last
category.
So -- yes, so this is the blind system identification issue, is to try, try
and estimate your impulse responses, H, using only the observed signals, X.
The rest of it is sort of in the dark. So many of the algorithms which exist
for this, the multi-channel system identification algorithms, are based on
this cross relation between two microphones. So you have basically the
observation of the first microphone involved with the impulse response of the
second microphone, is the same as the observation of the second microphone
involved with the impulse response of the first microchip.
And using this, you can form some system of equations which give you the
opportunity to find the solution for H by -- which is actually a scaled
solution of H by finding the (inaudible) vectors corresponding to the smaller
(inaudible) of R, which is a correlation matrix.
And this works, provided that you have, that the channels are significantly
different where you have no common zeroes between the room transfer
functions, and that your excitation signal is, contains large enough spectral
variety so that you can get this.
Once step further is if you take this cross-relation, you can actually
formulate an error signal which is extended to any number of channels, such
as taking various combinations of these channels. And you can use some kind
of adaptive filter to minimize this error.
The good thing with an adaptive filter is that, interior at least, it's able
to track some kind of, some changes in the, in the system, which occurs quite
often with room impulse responses. As you move around the room impulse
response changes, for example.
So you can have this cost function based on the errors of all the
combinations of channels, and there is a few different implementations,
mainly by Quong and Dynasty, which came out both (inaudible) the frequency
domain and in particular this, the normalized multi-channel frequency domain,
LMS, which is quite an efficient version of these algorithms.
And before I move on, I just introduce briefly this normalized projection
misalignment, which is often used for measuring the misalignment between your
estimated channel and your true channel. And the projection bit actually
helps avoiding any dependance on the scalene factor which occurs with these
algorithms, so that's bit.
So now if you look at the quite ideal example with this, with the frequency
domain adaptive algorithm, you have some three randomly generated channels.
Fairly short, 32 taps. Input is wide delta noise, and there is no additive
noise so ->> Question: What (inaudible).
Nikolay: It just means, you just have some random taps that you generate.
There's no ->> Question: (Inaudible).
Nikolay:
The impulse response -- exactly.
So -- yes, so it is quite ideal conditions for the algorithm and works well,
both in terms of minimizing the error. Here you have the top function on the
Y axis and number of reverberations on the X. And here is the projection
misalignment versus the reverberation, so that's fine.
However, when you try and have some additive noise, for example white noise,
the problem becomes that the cross relation doesn't hold anymore so your
cross relation error actually consists of two terms. One is the will
original desired cross relation error which is the good bit, which is one;
but there is also a component here which is, includes the noise of the
channel estimates.
And trying to minimize these two components together, kind of messes up
things a bit.
And you can see here an example of what happens. The left -- (inaudible)
they have strange behavior, that you start getting down, converge towards the
right solution and then they misconverge into something else.
And this seems to be unavoidable. So if you just try and have a smaller step
size, this, the only thing that changes is this point of misconvergence, that
it moves a bit further later on.
And the other interesting point is that this is 40 dBs and R, which is pretty
high S and R. You still get this effect. So very, very small amounts of
noise disrupt this algorithm.
And the question is, what can we to do to avoid this effect, for example.
One thing we thought about, is to try and include some kind of constraints
into the adaptation. Use more information from the, about the impulse
responses or about your environment to try and stop this misconvergence. Of
course one obvious thing to do is that if you have a, one known fact in your
impulse response, you try and force the algorithm to always keep that tap the
same. And one tap you could use is the direct path component. So the direct
path of impulse response, which you could say, estimate with a, another
algorithm for trying to delay arrival estimation.
So this is one, one way to do it. Another, another observation that we had
is that this misconverged solution has some strange low cost effect. It
always tends to this low path tilted spectrum. And what we talk of doing is
to impose a constraint that your spectrum should have somewhat uniform energy
distribution across all frequencies. It's a debatable assumption whether
this is the case but some, I guess some parts of room acoustics support that
idea that you have uniform distribution.
Either way, whatever constraint you take, you can -- it just adds this -- you
can put it to the lagrange multiplier and just essentially adds this penalty
term into the adaptation process.
And here is just one example of, again, quite short -- these are shortened
room impulse responses, simulated such. They are five channels so each bit
here represents one channel of the true ones. The room impulse response
truncated 128 taps.
Here is the misconverged solution, which was ->> Question: What was that again?
>> Nikolay: -- eight kilohertz, here. So it is quite heavily truncated,
/it's mainly the initial parts of the impulse response.
So here you have the misconverged solution, using just the
standard (inaudible) domain adapted algorithm. And If you impose the
spectrum constraint, you actually manage to stabilize it, although you delay
the convergence a little bit. And the same effect you can also see with the
direct path components of distribution. They both work in the same way. And
here you see the estimated channels which are, sort of correspond quite well
to the original ones.
Another thing to do to try and improve things with adaptive filter is to try
and get some kind of control of the step size rather than having a fixed one.
And this is what, another bit we looked at. So we tried to get a step size
called optimal. And it's optimal in the sense that you want the step size to
minimize the error between the true solution and your estimate at the next
step, given your current estimate of the impulse response.
And over in this, you obtain a step size, which at each iteration can be
calculated using this.
Now, this contains two terms. One is this bit here, which has the gradient
and your channel estimate, which is fine. This is available to you so you
can calculate it; but then you have this gamma term, which depends on the
true channel estimate.
So of course it defeats the point a little bit if you need the true channel
estimate to calculate it.
If you look at the noise-free case to start with, this term is actually zero
because the true solution and the gradient will be (inaudible) each other at
all points. So your optimal step size is fine.
calculating this bit.
You can just get it by
However, in noise you don't have that. And what happens in noise is actually
the discomponent -- so, sorry, just going back to the noise-free case. As
you see, that as the true solution -- as the estimate comes close to the true
solution, it drive the step size to zero. It just helps the allocations,
which is what should happen.
In the noisy case, this is what gamma does, is to help to support this
driving of the optimal step size to zero. And we thought of trying to find
the, some kind of approximation of this gamma term based on the bits that we
have available. And, yes, so you can have some approximation of it. This is
the one we used. And plug that in to the algorithm.
So here, can show you just a couple of examples of what happens. So first,
one of the issues with the optimal step size which comes across is that, with
the voice selecting a step size manual, which can be quite troublesome, here
you see that you have a new, step size of 0.02. The algorithm goes crazy.
If you have 0.01, it is fine. But it is kind of difficult to tweak it and so
on. So if you have the optimal step sizing, it selects it finally. Just
converges, no problems.
And here also, you see that the two lines that overlapping. So whether you
have this approximation of the gamma term or not, there is no benefit when
there is no noise, which is okay.
On this figure -- so here you have the misalignment on the Y axis, and then
iterations from the X axis. You see that the, if you introduce this
approximation of the gamma term, you can actually gain some extra performance
in the estimation compared to assuming that it's zero all the time.
And another interesting bit is that if you do use the true solution as a step
size, you get some really, really good convergence as you would expect. But
I mean it is not a useful thing, but it gives you some kind of performance
limit of what you can get with this step size.
So just to sum up this part of the talk, is that generally this, when you
have some measurement noise, you disrupt this cross relation error, and your
performance degrades with these algorithms. However, if you try and impose
some constraints based on additional information that you have, you can
increase (inaudible) noise quite a bit.
>> Question: Just a question to you.
exactly are you ->> Nikolay:
I mean --
>> Question: -- (inaudible).
So when you say measurement noise, what
>> Nikolay: Measurement noise is generally any noise. So it could come
partly from your equipment that you have or noise from other sources in the
room.
>> Question: But those noises, bit noises are different.
>> Nikolay:
Yes.
Yes.
>> Question: So in the (inaudible) ->> Nikolay:
Yes.
>> Question: -- and the external noise it's correlated because it is still
valid noise sources (inaudible) that channel, the microphones and system
over.
>> Nikolay:
correlated.
Yes, yes. So, I mean, here -- everything I looked at here is no
Uncorrelated noise, yes, generated.
Yes, if you have any other questions about this part before I proceed to the
second bit.
Okay. So I'll look at the other side of the, of the issues, the
equalization. So let's assume that you have some kind of estimate of your
room impulse response, H. And you want to design some equalizing filters so
that you get, when you involve your impulse response for (inaudible) filters,
you get impulse which can be delayed or scaled version of true impulse. And
that's sort of the ideal equalization, I guess.
As many of you know, coded is quite a few practical programs. With this, one
is the room impulse responses are non-minimal trace. And so you cannot just
have a single channel stable causal universe filter achieved. And the second
problem is that H is normally several thousand taps long, which causes
problems when you try and calculate the filters.
A third problem is that you normally have errors in your estimates of H, as
you can see from the previous part, so it can distort equalized signal, if
you design your equalizers based on inaccurate estimates.
And finally you have this quite large dynamic range of the room transfer
function so that equalizing it exactly could boost quite a lot of narrow band
noise.
So the one issue I'm going to look at here is the multi-channel lease
equalization. So instead of having one filter for a channel equalizing, you
can have a combination of filters, which all together equalize the signals
and give you one output. And this is generally known as the mint, mint
algorithm, which is just
a (inaudible) squared solution to the problem performed from this relation.
And the good bits with this is that, first of all, it eliminates this
non-minimal face problem so you can actually get perfect equalization, of
course if you have no common zeroes between the room transfer functions. So
the transfer functions have to be different.
And the problem with this, on the other hand, is that it is quite sensitive
to this, wrong estimates of the room impulse response.
Some attempts to improve that were made with some good results, I guess, by
introducing a regularlization into the minimization here.
Okay. I can just show you a couple of examples of what happens if you
compare this perfect multi-channel equalization, or if you have an
approximate single channel lease (inaudible) equalization. So you have some
kind of magnitude distortion and phase distortion of the equalized output.
I'll go through this a bit in more detail, these measures, later on. And you
have the system mismatch along the X axis.
So here, the plot A on both sides is the exact equalization with, with two
channels, whilst the B plot is approximate equalization with one filter.
And as you see, if you have perfect estimates of your channels, you get
perfect equalization. If not, you start to get quite bad performance of
these exact equalizers. Plus the, this more approximate single channel one
degrades much more gracefully. Unfortunately, to get this, you need very
large filters.
And this case, I've used the 15 times the length of the impulse response
filter for both single channel. So that's not, not, not ideal.
And another interesting observation, which I think is -- if you look at the
length. So if you vary the length of your filter and you equalize it again
with the multi-channel and the single channel equalizers, you see that the
error actually grows with the length of the filter that you are trying to
equalize, although you are keeping the misalignment the same. So you get
more and more troubles. So the idea from this is actually that it is best to
try and do some short filter equalization rather than this long impulse
responses.
>> Question: (Inaudible) Y axis (inaudible)->> Nikolay: Sorry, it is not really in dB. It is kind of dB. I'll show you
the expression a bit later on. But this is just demonstrated. You get some
problem with this. It's essentially just the deviation of the magnitude from
its mean. That's what it's supposed to be.
Yes, so to try and shorten things,
subbands, which people do. So you
multichannel case equalizers. You
equalizers. And you sum those up,
one obvious thing to do is to try to go to
have -- this is a figure on the fullband
have the signal, some channels, the
and you get your equalized output.
Now, if you want to do this in subbands, one way to look at it is this, which
is somewhat a conceptual version. You plug in this into the subband
structure. So you have the signal. And each subband, you have the channel
equalizer and the output.
I know this is quite a messy figure, but I think it does explain what I'm
trying to get to eventually.
So -- yes, so it just changes the order a little bit. This raises -- like,
from this figure there is a few things. So one is how you choose your filter
bank. The second is, if you have only the fullband estimates of your room
impulse responses, for example, how do you relate those to these subband
filters so that we can design these subband equalizers.
So one -- starting with the filter bank, the one we chose to use was this
oversampled filter banks and, in particular, generalized discrete (inaudible)
structure. The reason I chose that is that partly it's fairly
straightforward to implement fractional oversampling, and also there is some
efficient versions implemented for this.
And why we want the oversampling is that we can perform filtering in the
subbands approximately by using only one filter per subband rather
than (inaudible) crossovers which normally have to have otherwise. So
there's two properties which are helpful to this. One is that, in this
filter banks you can suppress the alias in the subbands significantly. And
also there is very little magnitude distortion, the output of the filter
bank.
And having that, you can relate your foo bank filter to the subband structure
where essentially what we want to do -- now, I want to find the relation
between the foo bank transfer function and the subband filters. So I want
the transfer function of this to be equal to the total transfer function of
the filter bank so that if you input the signal here, you get an output here,
which is having the same iteration as these two.
And one way to do it is this, some guys who use this approach and you can
arrive to a least quest (phonetic) version to approximate these filters
because they are not exact but it is, you get pretty good results in terms of
the error between this signal and this signal by using this, this, the
composition.
And the other major advantage is that your subband filters of the room
transfer functions are now the length of the foo bank filter divided by your
destination ratio, which is what we wanted to get to.
And if you now have these estimates, you can design your equalizing filters
for each subband based on these subband equivalent filters. And you achieve
overall equalization by applying the filters to each subband and then
reconstructing the equalized foo bank signal. And of course since the
impulse responses of the room transfer functions have reduced by a factor of
N, the summation, so do the equalizers.
So let me show you just a couple of results with this, with this approach for
equalization. So for this, for this example we used the filter bank with 32
subbands which are decimated by a factor of 24. You have a 512 tap prototype
filter for the design of the filter bank. We simulate the system mismatch by
just adding some noise to the impulse response that we're trying to equalize
so that you get some desired misalignment.
And these are the two measures that I had previously also, is the magnitude
deviation and the linear phase deviation. So these are the two things we
measure separately. So magnitude deviation, as I said, the deviation of the
equalized signals magnitude from its mean, while the phase deviation is the
deviation of the equalized signals phased from a linear phase.
So in the first case, this is just many randomly generated channels with taps
randomly generated. The channels are 512 tap long and is five different
channels. So you can see here, the magnitude deviation on the Y axis phase
is deviation on the Y axis down here, and the system mismatch across the X
axis. So for the foo band multichannel (Check) equalization you get what we
got before, pretty bad distortion and you start getting inaccuracies in your
estimates. Also with the subband case, you get a much more graceful
degradation again, now when the filters are much shorter.
And here is the case with
room impulse responses of
the results averaged over
microphones. So you get,
channels before.
longer impulse responses. So these are simulated
4,800 taps. Again, there's five microphones, and
10 different locations of the source and the
you get pretty similar performances for the random
And another nice thing with -- sorry. Yes. I'll just though show you a
couple of examples with this, one case when you are here, and one case when
you are here so that you can see what the output looks like. So here is the
impulse response of the output at nearly, at noise-free estimates of your
impulse responses. And here is the magnitude. So you can see that it is
pretty well equalized. On the other hand, when you have the minus 10 dB
misalignment, you start to get some rubbish in the equalized output.
However, if you listen to it, it's -- it doesn't distort the signal, which is
quite interesting. It just changes the, it changes the spectrum as you can
see from this, but it doesn't distort it. So I do have some samples. I'm
not playing any samples here because generally with reverberation, playing
through the room, it doesn't really get across the effect. But I have some
samples on the computer if anybody wants to listen with headphones. Later on
you can listen to what these things sound like.
And another good benefit from this subband equalizer is that you can actually
achieve quite a lot of savings and computations. So this is faulty point
operation versus your impulse response length and it's somewhat consistent.
So you get the factor of 120 reduction in computational complexity so that
allows you to convert quite, the quite long room impulse responses fairly
easily.
And again, to just summarize this bit of the talk is, I think the key points
with this subband equalizer is you get the more efficient version of an
equalizer, and you can improve your robustness to noise estimates of your
impulse responses. And if you get quite near perfect equalization, it
mismatches it smaller than minus 40 dB.
And also, I think I like this, the composition, the complex, the composition
to, for relating the foo band and the subband impulse responses as general
studies for any subband development of equalizers or (inaudible)
identification.
So just finally, I'll show you a couple of figures on applying this equalizer
to speech signals and some results on the dereverberation in terms of back
spectral distortion and segmental reverberant to speech to reverberation
ratio.
So the -- I know that this is a debatable issue of choice of measurement for
evaluating reverberation of signals, but this is what I used here. And here
you have several thoughts. So the bottom one -- so this is segmental signal
to reverberation versus various P 60s. This also uses simulated room
transfer functions (inaudible) microphones.
So here, at the bottom is the unprocessed reverberant signal. The first
figure, the first plot is just a delay in some bin former, what that would
do. And then each of these plots is for zero dB, minus 30, minus 60, and no
noise exact estimates of your impulse responses. So you see that if you have
your perfect estimate, you can get perfect dereverberation more or less. And
with noise you are still doing quite well with bad estimates, your impulse
responses, at least in terms of signal similarity.
So though the debatable issue with measurements is whether they represent the
audible differences. I believe that we have some kind of signal similarities
which reaches points down to here, it probably sounds pretty good.
And the similar thing you can get also with, in terms of bad spectral
distortion. Again, here is the unprocessed signal. This is the bin former.
And down here are all the lines for the equalized signals.
>> Question: So in this case we can say the (inaudible) distortion is mostly
(inaudible) while the segmental (inaudible) reverberation ratio (inaudible)
variable. I think (inaudible) what we are doing (inaudible) down there.
>> Nikolay:
Yes.
>> Question: (Inaudible).
>> Nikolay: Yes. Well, yes. And if you know can relate, I guess, the broad
spectrum more to the audible difference, it probably shows that the audible
difference between these three is not huge.
That although you may have slight difference -- so the point is that going
from, you know, 15 dB to 30 probably doesn't make a huge difference when you
listen to it, which I think can be true to some extent. That's -So just to sum up the overall talk this far, I think is that this adaptive
system identification and equalization, (inaudible) equalization, can provide
really good dereverberation in theory, which is good to have some kind of -it is possible to do.
However, in practice, it is difficult because you have a lot of noise and
other issues so it takes quite a bit of tweaking to get things working.
However, I believe, and I think that was what I tried to get across, is that
if you try and collect as much information about the environment and put it
in as constraints to the algorithm, you can gain quite a lot of robustness
and transfer, reduce the blindness of the problem. And also the other bit is
to try and reduce the dimensionality of the problem by divide and conquer
approaches. And this hopefully will lead to some more practical algorithms
using this type of approaches.
And just before finishing off, I would like to thank a few of the people I've
had the pleasure to meet and work with in relation to all this. And in
particular I mentioned Patrick Hayter (phonetic), who was my PhD supervisor
and continues to be my support in things that I do.
So thank you for listening.
Thanks for your attention.
[APPLAUSE]
>>> (Inaudible). So Nicolay (inaudible) this afternoon. If there are no
more questions, we can take a break. (Inaudible) single processing talk
here, which is organizing by today. And in this case we'll hear three
persons from (inaudible) processing center. And then, from my understanding,
the audience will be (inaudible) can take a break of 15 minutes and be back
here again in two hours.
Thank you, all, for stopping by.
>> Nikolay:
[APPLAUSE]
Thank you.
Thanks.
Thank you, Nicolay.
Download