>> Ivan Tashev: Well, good morning everyone, those who... in the lecture room and those who are watching us...

advertisement
>> Ivan Tashev: Well, good morning everyone, those who are present in person
in the lecture room and those who are watching us online. It's my pleasure this
morning to introduce Mark Thomas, who received his bachelor's degree in 2006
and his PhD in computer science in 2010, both from Imperial College in London.
And this morning he's going to talk about Microphone Array Signal Processing
Beyond the Beamformer. Without further ado, Mark.
>> Mark Thomas: Thank you, Ivan. Okay. So thank you for your kind
introduction, Ivan. My name is Mark Thomas, and I'm currently a post-doctoral
researcher at the Communications and Signal Processing Department at Imperial
College, London.
And I'd like to thank you for coming to my talk entitled Microphone Array Signal
Processing Beyond the Beamformer.
I'd like to welcome any questions throughout the talk, so feel free to interrupt me
at any point.
So array signal processing has been around for quite some time. In the '50s
there was quite a lot of research into phased antenna arrays for steering a beam.
But it's only relatively recently that we've started seeing multiple microphones in
consumer devices. And I'm sure I don't need to point out the round-table device
and the connect device. There are now atmosphere cal microphone arrays
including the Eigenmike and the 64 channel Visisonics array.
Imperial College has developed a number of planar and linear microphone arrays
for research purposes. And the motivation behind these is beamforming, which
is a very powerful technique for spatial filtering. But what I would like to talk
about in this talk is other ways of using multimicrophone measurements. So the
outline of my talk is that I will begin with an introduction showing the kind of
application scenario that I would expect to be working in and talk about some of
the notation.
Now, beamforming is a method that makes some assumptions about the way
waves propagate in space. What I would like to look at are some other kind of
models that we could impose. And the main topic of this talk is multichannel
dereververation.
I would first like to talk about spatiotemporal averaging, which applies models of
the voice. And equalization by channel shortening, which applies a channel
model. The acoustic rake receiver, which applies geometric model in addition to
beamforming.
And then in the second part I'll talk about geometric inferences. And what we
mean by inference is the estimation of the position of line reflectors in two
dimensions in and acoustic environment using acoustic measurements.
I will then conclude at the very end.
So the application scenario, as I'm sure is very familiar to many of you, is that we
have an acoustic enclosure. And inside this enclosure there's a talker at some
point in space. Somewhere else is a microphone or an array of microphones.
And this will receive the wanted direct path signal. Slightly after that will be early
reflections from hard surfaces like walls and tables. A bit later on, the late
reflection caused by the high order reflections or the reverberant tail, as it's
sometimes referred to. And there will also be some additive noise. And this
could be from acoustic noise sources like an unwanted talker or some other
noises within a room in a domestic environment. You'd expect to hear clattering
and people walking around.
And there's also the problem of additive noise coming from the measurement
apparatus. This might be within the microphone or within some code in the
[inaudible] and all of this adds to the observed signal at the microphone.
So put a bit more formally, we can say that a speech signal, S of N is convolved
using a filtering matrix H. This thing produces an observation Y. Added to this is
a noise signal B to then produce an observation X.
Now, what we can do with this X is perhaps apply beamformer. And the purpose
of this is to try to appoint a logo sensitive that attenuates the unwanted talkers
that are spatially distributed elsewhere.
Another class of dereverberation and noise reduction of which beamforming is
part will also aim to estimate the speech signal perhaps by some different
means. And this I'll talk about a little bit later in the dereverberation part of the
talk.
Geometric inference aims to estimate the parameters L, which are parameters
for the line reflectors in space. And I've also added the Boston blind channel
identification. Now, although I don't talk about this implicitly in this talk, it's
nevertheless been a part of my research as a post-doctoral researcher. And
without multichannel observations and the spatial diversity that it produces blind
channel identification can't be achieved.
The other thing that I would like to say in the introduction is some notation that I'll
use. The channel is also denoted by M and its length is L. And the observation
at channel X -- channel M is XM of N.
And what is often done, and I will be doing in this talk, is to remove the channel
independence and stack all of the channels into one vector. So if there's no
subscript, then it can be assumed to be a multichannel observation.
So I'd like to now move on to the first dereverberation algorithm, which is called
spatiotemporal averaging. And this relies on voice modeling. So I'd like to begin
by looking at the source filter model of speech.
And what this says is that air passing from the lungs passes through the glottis.
And as it does so, it causes the glottis to vibrate periodically. And these
excitations are then filtered by the pharyngeal, oral, and nasal cavities to produce
sounds that we interpret as voiced phonemes. What we often do is lump the
transfer function of the pharyngeal, oral or nasal cavities into one you function
called the vocal tract. And this in the Z domain is represented by V of Z and is
often modeled as a Pth order AR process. This is excited by the excitation signal
or sometimes called the error signal from the glottis, E of Z. Such the signal
that's produced, S of Z, is the product of V of Z and E of Z.
And the motivation for separating this out is very, very common in the case of
coding, where it's possible to achieve a very efficient coding by taking the V of Z
and the E of Z, parameterizing them differently and then recombining them again
at the receiver.
And it's possible to estimate these transfer functions and the excitation signal
blindly using linear predictive coding. What this does is try to minimize the
square error between the speech signal and the predicted speech signal using
the prediction weight A. And these As are found by solving this Wiener solution
involving an autocorrelation matrix capital R, and a cross-correlation vector lower
case R.
And as I said, this is very common in the field of coding and works very well for
single channel speech and providing that there's no reverberation.
But what then happens if we have multiple observations in a reverberant
environment? Well, we have two options. We could try to find a multichannel
solution to the problem or perhaps more straightforwardly, we could take a single
channel, linear predictive coding and supply it to the output of a beamformer.
Now, let's say we take on the former case. We can come up with these
autocorrelation matrices and cross-correlation vectors for every channel M in turn
and then average them to form a more robust estimate R hat and lower case R
hat. Using these two we can then find the optimal coefficients B hat opt. And
with some analysis it can be shown that by spatial expectation that the B hat opt
is a unbiased estimator of A opt.
Now, if, instead, we were to opt for the second case, where we take the single
channel LPC and apply it to the output of a beamformer, it can be analytically
that these coefficients are no longer an unbiased estimate of the A opt for the
clean speech.
So this is a fairly straightforward example to show that it can sometimes be a bit
better to come up with a dedicated multichannel algorithm as opposed to taking a
single channel algorithm and applying it to the output of a beamformer.
So let's take a look at the ->>: [inaudible] actually by cranking through the minimum [inaudible] error on
[inaudible].
>> Mark Thomas: Yes.
>>: Okay. It truly is the Wiener optimal solution? It is truly the optimal solution?
>> Mark Thomas: It is truly the optimum solution.
>>: Okay.
>> Mark Thomas: Yes. But only by spatial expectation. So the effect of
reverberation will of course add some error. But if you were to find the
expectation at every point in the space, then it provides an unbiased estimator.
So this analysis was done by Gaubitch, Ward, and Naylor in a paper in JASA in
2006.
So the algorithm I'd like to look at is called spatiotemporal averaging. So as
we've seen at microphone M, the output is given by XM of N. And we've seen
already that we can estimate these optimal coefficients B.
Now, what we also have is to try to estimate an enhanced linear prediction
residual that then when resynthesized gives us a more accurate estimate of the
speech signal.
And the motivation for this is as follows: The top left here we have the clean
speech signal, which shows this nice periodic nature and these extra wiggles due
to the spectral shaping by the vocal tract.
Beneath it is the clean linear prediction residual. And this is the features that we
see are these periodic impulse events caused by the rapid closure of the glottis.
So it sort of snaps together like hand clap.
On the top right is the reverberant speech which we can still see some of this
periodic nature in it but there's some additional noise caused by the presence of
reverberation. But if we then look beneath this, we see that the residual has
been masked very heavily by noise, and it's not really possible to pick out any
impulsive features.
And it's been shown in previous works that in the presence of reverberation the
effect is to have a lot more distortion placed into the residual than it is to be
placed on to the AR coefficients.
So given that we know that the reverberant residual is so badly corrupted by the
reverberation, we want to do as much as we can to try to fix this. So let's just
look at a system diagram of what we've done so far.
The speech signal passes through the acoustic impulse responses which then
produces these microphone observations. We then applied the multiply channel
LPC coefficients to give these optimal coefficients B.
So the next thing is to apply some spatial averaging. And we do this with a
delay-and-sum beamformer. And the aim is to take the output from the
delay-and-sum beamformer which turns a multichannel observation into a single
channel and then find a residual by inverse filtering the output of the beamformer
with these coefficients B.
So the idea of delay-and-sum beamformer is coherently sum the observations in
such a way that these delays tau allow the coherent sum of the wanted signal,
which is the speech, the direct path, causing the attenuation of the unwanted
terms. And there are various ways of estimating this tau that I won't go into any
detail about here.
And as I said on the previous slide, the aim is to then take this output, X bar of N,
inverse filter it with these multichannel LPC coefficients B hat to produce a
prediction residual E bar. And the way we enhance it is then to exploit the
pseudoperiodicity of the voiced speech. A few slides ago we saw that the
speech signal is very periodic. And therefore we can expect that any features
due to the voice source or this excitation signal is going to be common from one
cycle to the next. And if we then perform an intercycle averaging, we should be
able to boost those features that are caused by the voiced source in the glottis
and attenuate all those terms caused by the reverberation.
So we call this larynx cycle temporal averaging, which is applied to this DSB
residual.
So the way we do it is we consider a weighted temporal averaging of two I
neighboring cycles. So we take a cycle and then we take I neighboring cycles on
either side of it. And I is typically in the range two or three.
The signal is segmented into individual glottal cycles by finding the glottal closure
instants and using these to delimit the one cycle to the next. And then by
performing this temporal averaging it's possible to attenuate those unwanted
terms.
But a problem arises and we don't know where these glottal closure instants
occur. Now, this is a bit of a problem. And this is where the idea of multichannel
DYPSA comes from. Now, DYPSA was an algorithm that was developed by
Imperial College shortly before I arrived. And it's a method for taking a single
channel piece of speech and estimating the glottal closure instants, those
periodic impulses within the LPC residual.
And it does this by first calculating the LP residual and then using a technique
called the phase slope function. And this is essentially a measure of energy
calculated over a sliding window. And the features that we find with the phase
slope function is that the positive going zero crossings locate the positions of
impulsive features within the signal. We'll see a bit more of this on the next slide.
Once we've located these features, there will be a candidate set, and within that
set are the true GCIs that we want. But there will be a lot of additional incorrect
estimates that need to be removed. And we do this by applying a dynamic
programming algorithm that I'll explain in a couple of slide's time.
So just to give an example of what I'm talking about here, on the top is a speech
signal, which shows this clear periodic nature.
Underneath it is the linear prediction residual showing these impulsive features
and a few bits of noise in between.
And the aim is to locate these impulsive events.
Now, if we look at the phase-slope function at the bottom, we see that the
positive going zero crossings of this function identify the location of the true
GCIs. What sometimes happens is a GCI is missed, as shown here, due to
some additional random noise in the residual. And we have techniques to deal
with this. We call it phase slope projection. Yes, question please.
>>: [inaudible] you're whispering?
>> Mark Thomas: I will discuss this later. Yes. So -- yes. So the question there
is that there are -- there are modes of speech, fricative, whispering, non-voiced
unvoiced, where there is no such periodicity. And there's -- I'll talk very briefly
about how we deal with this.
Please, yes?
>>: I want to ask a question a little bit about the nature of the distortion that's
happening from the reverberation. So it seemed like when you showed the
earlier pictures and that I'm guessing it's also happening from the very noisy
residual that you showed that happened at your reverberant speech is then, you
know, essentially, you know, if you think about the wave forms it's like pieces are
being moved around a little bit because, you know, reflections from -- so the
baseline -- because reflections are coming in and affecting the magnitude, you're
basically seeing that like what looks like the start of the career has kind of been
moved over a little bit over here and moved over a little bit over there.
And I guess what I'm wondering is that -- is that if you now apply this type of
analysis, you know, to look at where the period's actually starting, you're not
going to have like a super uniform.
>> Mark Thomas: Absolutely.
>>: [inaudible] there's going to be a little bit of ->> Mark Thomas: Yeah. I don't know if that was picked up by the microphone.
But the question was that when we add reverberation the linear prediction
residual is spoiled and as well as having these wanted peaks we get lots of
unwanted peaks and becomes very ambiguous as to where these occur.
So right now I'm talking about the single channel DYPSA algorithm. I'll then talk
about the multichannel algorithm that tries to deal with this problem.
>>: Oh, I see.
>> Mark Thomas: Yeah. So what we're seeing here is the clean case. There's
no reverberation. Any errors here are too frame -- framing errors. And there
may be some additional aspiration noise in here that will manifest itself in the
residual. So, yes, we'll talk about what we do in the reverberant case in just a
minute.
But, yeah, so the reverberant case is a particular problem, especially when we
consider the erroneous error crossing. Right now in the clean case, this will
happen every now and then. And we don't know right now what zero crossing is
due to the true GCI and what is due to the erroneous GCI.
So we need to apply an algorithm that tries to get rid of these unwanted signals -unwanted zero crossings. And what DYPSA does is apply a dynamic
programming algorithm that finds the series of costs for each of the candidate in
turn.
And for each cost, some things like waveform similarity and pitch deviation are
found. And it's expected that because of the pseudoperiodicity of voiced speech,
the pitch deviation would be very low, it would happen very regularly and that the
waveform similarity from one cycle to the next should also be very similar.
So you make a cost that should be -- should be low for the -- for these terms. So
they should be low cost. And the dynamic programming then finds a path
through all of the candidate within some region of support that minimizeses this
cost function using some additional weights. And it just so happens that these
two speech waveform similarity and pitch deviation are the most important. And
this has been found with training data.
So, providing we have clean speech the single channel DYPSA algorithm works
very well. And usually it will produce about 96 percent identification accuracy.
This actually formed quite a large typical within my PhD. And there are lots of
new algorithms that have superceded it. So that's -- and get much more
accurate estimates of the glottal closure. But we'll stick with DYPSA for now.
So going back to your question, the problem arises when we have reverberation.
And it's a particular problem because the reverberation produces all sorts of
nasty spurious peaks in the LPC residual. So in much the same ways with
multichannel LPC we have two options. We could take DYPSA and apply it to
the output of a beamformer or we could extend the algorithm in some way to
make a multichannel variant of it.
Now, let's say we do the latter. One thing we could do is try to generate a series
of candidate for every channel in turn. Now, we know that there are going to be
lots of additional spurious zero crossings due to the reverberation. But due to the
spatial diversity we should expect that only the wanted signal is the one that
contains all of those candidate that show some kind of coherence across
channels.
So we augment the cost vector to the dynamic programming algorithm that
measures the interchannel correlation between the candidate. It then penalizes
those with low interchannel correlation and encourages those with high
interchannel correlation. So have a separate candidate generator for every
channel in turn. And we had a paper at EUSIPCO, the EUSIPCO conference in
2007 about this.
So if we look at what happens when we apply reverberation, in the clean case
with a large database of speech we get about a 96% identification accuracy. An
identification accuracy is defined as the number of correct type identifications
divided by the total number of cycles within the voiced speech.
>>: [inaudible].
>> Mark Thomas: The ground truth for this algorithm was done with
synchronous EGG recordings. So an EGG is an electrogottal graph. This
involves placing a pair of electrodes on the side of the glottis. And a recording
was then made with a head held at a specific distance from the microphone. And
then time aligned. And it's much more straightforward to pick out the glottal
closure instants from this signal because it doesn't contain any nonstructured
noise from acoustic noise sources. It doesn't have any filtering by the vocal tract.
So we don't have any errors in removing the residencies.
And this too was actually quite a large part of my PhD. And I've published quite a
few papers on GCI and GOI, the opening detection from voiced speech.
>>: You said EEG?
>> Mark Thomas: The EGG. The electrogottal graph. Or electrogottal gram.
>>: [inaudible] what you get?
>> Mark Thomas: Right.
>>: So [inaudible].
>> Mark Thomas: Well, it's a measure of the conductants. So across the
electrodes is an RF source of about one megahertz.
>>: [inaudible].
>> Mark Thomas: It's ->>: [inaudible] is measuring the [inaudible].
>> Mark Thomas: Yes.
>>: Okay. Thanks.
>> Mark Thomas: The conductant. And the reason why this changes is that if
we're measuring the cross-section of the glottis, it sort of takes this zipper-like
fashion and as it comes close together, then there's a lot more -- a lot higher
conductants and a lot let -- and if anyone's interested about this later on, then I
can show some examples and how we deal with it.
Yes, so back to the performance. We've got this ground truth that we form the
EGG estimate. In the clean case we do about 96% accuracy. If we then apply
the single channel algorithm to the output of a beamformer -- sorry, apply the
single channel output algorithm to the -- one of the observations, we see that the
identification accuracy drops very, very quickly. And if we observe as well the
time alignment of each identification, that also gets worse. But I haven't shown
an example here.
So one of the possibilities was to take a single channel algorithm and apply it to
the output of a beamformer. And this is shown in yellow and not surprisingly it
does a bit better because the beamformer has helped to attenuate the
reverberant components.
If we then take the same dataset and apply it to the multichannel algorithm it
does better still. So we have another example of how sometimes it's better to
come up with a multichannel variant of an algorithm rather than taking the single
channel algorithm and applying it to the output of a beamformer.
Another plug for a paper is -- has just been accepted on a number of different
algorithms for glottal closure detection from speech signals in lots of different
noise and reverberation environments.
So that was a bit of a sideline on DYPSA. Now, let's get back to the
spatiotemporal averaging. We have everything we need. We've got our
observations. We have an algorithm that finds an optimal set of LP coefficients
to give us residual. We apply that to a delay-and-sum beamformer to get a DSB
residual. We now know where the glottal closure instants occur and are able to
perform this larynx synchronous temporal averaging.
And then this gives us the following. On the top is the clean residual, showing
this nice periodic train of impulses.
In the middle is the reverberant residual which is not quite as bad in this case as
we saw earlier. There are some peaks in there, but some of them are buried in
noise.
Then having performed the spatiotemporal averaging we get this enhanced
which looks much more like the residual in the clean case.
>>: I have a question about the state vector of your dynamic programming
algorithm. Is it -- is it just the location of the glottal closures or is it like the entire
vector of the period. Like what's the actual [inaudible].
>> Mark Thomas: It's -- the internal state is based on the absolute timing of the
->>: I see.
>> Mark Thomas: Of the GCI.
>>: [inaudible] end up with the set of time ->> Mark Thomas: So we have a -- yeah. A set of time instants. And as you
pass through time, the instants of glottal closure from the candidate are found.
And with this is associated those six cost vectors. So the vector of six cost
elements. And then it may jump some candidate that are incorrect and then find
this path that within some region of support maximizes out the likelihood and
therefore minimizes the costs.
There is a trade-off. If you were to take lots and lots of cycles, it becomes quite
conversationally intrackable and it comes a point where there are too many
different possible paths to choose. What can also happen is if you take lots and
lots of cycles, the pitch period will start changing slightly and if it's different at one
end or the other, then the pitch consistency cost is not going to give you a
reliable estimate. So there is this trade-off. If you make it too short then there
comes a point when you're not benefitting from these similarity features. So ->>: Isn't [inaudible] from frame and frame? Like presumably like a standard dB
setup, you have all these candidate at each time step and then you have the
transitional costs between them. So the pitch transition cost would be one that
from one state to the next it would -- it would increase or decrease based on
whether they were consistent or not, right? But if it moved -- changed slowly
through a long period, that shouldn't be so bad, should it?
>> Mark Thomas: I'm trying to think how we achieve this. So, yeah, the question
is that if we take lots and lots of cycles then the pitch consistency will change
slowly. And it shouldn't matter from one end to the next.
Now ->>: We can talk later offline.
>> Mark Thomas: Yeah. I actually forget what -- exactly how we approach this.
I forget whether it's done on a per-cycle basis or whether it's the pitch period as -at the variation of the pitch period compared with the oval pitch period. If it's the
latter case, which I think it might be, then there's a problem. Because you will
start seeing more deviation. But maybe we can talk about this later.
>>: [inaudible] some sort of larynx center filter. But these are not exactly the
same. So how do you get from reverberant residual to enhanced residual?
>> Mark Thomas: So the reverberant residual first passes through the spatial
filtering with the beamformer. And that will then help to enhance these ever so
slightly.
The larynx cycle temporal averaging is weighted using a two-key window. So
what this does is prevent it from forming any averaging act, the glottal closure
instant or it's weighted in such a way that there's less averaging at the GCI. But
there's lots more averaging happening between them. And that's -- I'll go back a
few slides.
>>: So you essentially designed a sort of on the fly a filter that has sort of a -- I
guess it's a spike at the frequency of the glottal index. And then is it -- is a stop
band every where else, as much as you could. And then you apply it at the right
phase to ->> Mark Thomas: Yeah. We don't -- we don't think of it so much in the
frequency domain. This is all a time domain averaging. So there's no sort of
pass band and stop band at this point.
>>: [inaudible].
>> Mark Thomas: But, yeah, it's effectively a linear filter. So, yes, you could look
at it in those terms. Actually this brings us very neatly on to what happens in the
case of unvoiced speech. So thank you for pointing that out. And so if I just go
back to where we were. Yeah.
So we have the enhanced residual. So this is dealt with what happens in the
case of voiced speech. Now, there are times in speech that don't have any kind
of periodicity, anything unvoiced fricatives and such forth and indeed silence
don't have this periodicity. So we don't really want to apply this temporal
averaging during these periods. The spatial averaging is okay. But not so much
the temporal averaging.
So that brings on the very last part of this that involves a voiced, unvoiced, and
silence classification. I didn't actually put any slides on this. But the idea is that
we only perform the larynx cycle temporal averaging during the voiced speech.
And at the same time, we make an adaptive filter which we call G, which looks at
the difference between the outfit of the DSB and the enhanced residual and then
forms a filter that effectively chiefs the same as the larynx cycle temporal
averaging.
This is updated during voiced speech, and it's done slowly on every cycle. And
then during unvoiced speech, this filter is then applied instead of the temporal
averaging. So this helps to give a little bit more robustness during the unvoiced
speech. Although I don't have any slides on this, I -- if you -- if we have some
time afterwards I can walk through the way in which we form this adaptive filter.
>>: [inaudible] adaptive trying to get what the larynx cycle temporal averaging is
attempting to do --
>> Mark Thomas: Precisely.
>>: In time? In space? Both?
>> Mark Thomas: Well, both.
>>: One channel?
>> Mark Thomas: Well, it's actually in time because it's driven by the output of
the DSB. So it's only -- it's doing the same thing that the temporal averaging is
doing but not by temporal averaging but instead by an adaptive filter. So it's a
convoluted process that mimics this. I can -- it's -- unfortunately I don't have any
information on this at the moment, but perhaps if we get some time at the end I
can exit this and I can look a little bit closer at the approach that we use. Yes.
But let's say that this works. We have then found an enhanced residual E hat of
N. We can then resynthesize the speech using these multichannel LP
coefficients to give an estimated speech signal S hat of N.
And we evaluated this by taking a real room with a T60 of about 300
milliseconds. We placed an eight microphone linear array with a five centimeter
space between the elements. And varied source-array distance from half a
meter to two meters.
The performance was then nonstructured using the segmental SNR, which is a
means of estimating the SNR on a frame-based approach. And the bark spectral
distortion score. With the segmental SNR more -- higher is better. With the
spectral distortion, lower is better. And what we see is that if we look at the -one of the observed channels, in this case channel one, we get the worst
segmental SNR and the worst bark spectral distortion. A bit better than that is
the delay-and-sum beamformer, which is forming a spatial averaging only. And
then better than that is then the spatiotemporal averaging which is doing both the
spatial averaging of the beamformer and the temporal averaging of the larynx
cycle and averaging.
So I'd like to play some example sounds that were recordeded. So the first is the
output of microphone one.
>>: George made the girl measure a good blue ->> Mark Thomas: It always cuts off at that point. So again.
>>: George made the girl measure a good blue vase.
>> Mark Thomas: And then the output from the delay-and-sum beamformer.
>>: George made the girl measure a good blue vase.
>> Mark Thomas: So perceptionly it sounds like the microphone is closer, but
there's some evidence of the reverberant tail in the background.
If we then apply the spatiotemporal averaging.
>>: George made the girl measure a good blue vase.
>> Mark Thomas: Hopefully it should sound like the components of
reverberation have been reduced. But we're in a fairly reverberant room here,
and I think maybe to appreciate this properly you need to listen on headphones.
>>: [inaudible] play the last two again.
>> Mark Thomas: Yes, please.
>>: What is the distance between the speaker and the microphones in the
examples?
>> Mark Thomas: In this example, it's two meters. Yeah.
>>: George made the girl measure a good blue vase. George made the girl
measure a good blue vase.
>> Mark Thomas: So it's done its job. There are some artifacts. This is a fairly
good example. Maybe if we have time at the end, I can play some bad examples
where what tends to go wrong is the glottal closure instants are identified
incorrectly and then the wrong amounts of temporal averaging is applied. And
that then produces the musical noise.
So I would like to summarize this section by saying that multichannel LPC
decomposes a reverberant speech signal into an excitation signal and a vocal
tract filter and that reverberation has a very significant effect upon the LP
residual.
Multichannel DYPSA then segments the LP residual into, glottal cycles allowing
us to perform an enhanced LP residual estimated by this technique called
spatiotemporal averaging.
The speech is then resynthesized with the LP coefficients which is what we just
heard. And the result of this is that the dereverberation performance of the
spatiotemporal averaging is better than performing the spatial averaging alone.
And there's a paper that we published on this at WASPAA in 2007. Yes, please.
>>: So see -- it seems like you use a lot of knowledge of speech signal to -- as
predicting what happens at excitation to do [inaudible]. And to do that, you sort
of have to do a lot of the speech production analysis stuff.
So there's some other techniques where you take the same basic approach but
it's a little bit more -- it's more a question of changing objective function like for
example let's say, you know, only [inaudible] come up with some [inaudible] filter
so that the residual is peakier than it is for [inaudible] or minimizing how
[inaudible] do this.
So do you have any sense how -- you know, obviously [inaudible] very good. So
do you have any sense -- does this -- do you get something out of the fact that
you're exploiting a more probability speech knowledge or how this was prepared
and things like that?
>> Mark Thomas: Okay. So the question is that we at the moment exploit the
speech signal, and we know lots of things about how the periodicity of the
speech signal varies over time. And that there are other techniques that do
things like maximizing the kurtosis of the LP residual. Although we've not made
any direct comparisons between them, what we have done in the past is use the
costs of the -- of DYPSA to give us an estimation of the reliability of the voicing.
And rather than applying this technique exclusively to those periods in which
we've found GCIs, we apply it only to those periods where we're pretty confident
we can find it.
Then during those times when we aren't confident in the GCIs then we want to
apply something else. And that something else could include something like
kurtosis maximization and such forth so, yeah, I think there are lots of interesting
extensions that can be done here based on how reliable the estimation of voicing
is and using alternative techniques to try to make this robust.
Does that -- that's about all I can say on that. But thank you for the question. It's
a good question.
Okay. So moving on. The next algorithm I'd like to talk about is very different.
And this uses not voice modeling but channel modeling. And I would like to talk
specifically about an a approach called channel shortening, which is quite
popular in the field of RF but not so common in the field of acoustics.
Now, the same of acoustic channel equalization is to use the estimate of the
acoustic system H and remove it by some means. Unfortunately, although there
are some existing optical techniques -- optimal techniques, the effect of noise
and estimation error in this -- finding this H, has a very significant effect upon the
quality of the equalizing. And it really then limits how applicable it is to the real
world scenario.
But what I would like to talk about here is a technique that we've been working on
called channel shortening and see how this helps to increase the robustness. So
let's jump straight into a system diagram. This is very similar to the previous
case when we have a speech signal that passes through this -- these convoluted
process H with some noise to produce the observations X.
And the aim is to find some filters G that when convolved with the observations
and then summed over all channels produces an estimate of the speech signal.
And we find these Gs using an equalization algorithm. And this is driven by a
system identification algorithm that aims to estimate these signals -- these
impulse responses H.
So we can formulate the problem as taking a channel H and convolving it with an
equalizer G that when summed over all channels produces a target function.
And this target function is usually an impulse or a delayed impulse where there's
just one nonzero tap.
This can be represented in matrix form, where we have a filtering matrix H and
the coefficients G that we want to find that then produce this desired response B.
And the ultimate aim is to use these filters to form a filtering matrix, a convoluted
matrix that when applied to the observations gives us an estimate of the speech
signal.
Now, one way of doing this is to use the MINT algorithm, the multichannel input
output inverse. And this can be formulated as a multichannel least squares
problem that aims to minimize the difference between the equalized response
and the desired response. And this can be found fairly straightforwardly using
the Moore-Penrose pseudo-inverse or some other kind of pseudoinverse of the
filtering matrix multiplied by the desired signal. This thing gives us the G hat.
And this can provide exact solutions. It can give an exact inverse providing there
are no common zeros in the acoustic impulse responses age.
So if we were to take the impulse response from the source to one of the
microphones, factorize it, no two zeros should be in common with them. This -- if
there are common zeros and this produces an ambiguity because we don't know
whether the receive signal is due to the source or whether that zero is due to the
-- the acoustic impulse response. So it's important then that there are no
common zeros.
The or criteria is that we have to have at least two channels. And the equalizer
has to fulfill some length criteria. Now, these all three are all possible. The
problem comes in that the filters H used to estimate the inverse filters G must
contain no errors. Now, for a practical scenario, this can't possibly be the case.
There will always be some degree of error in the measurement apparatus or the
system identification approach.
So we could relax the problem. It may not be necessary to equalize the signal
entirely to produce a delta function. What we could do is appeal to a branch of
psychoacoustics that says that the ear can not hear low order reflections. So
those low order reflections from walls and tables and such forth perceived not so
much as a reverberation but perhaps a coloring. This has been shown
subjectively not to be so detrimental to the intelligibility of the speech.
So we have in the middle here a stylized diagram that shows the direct path
impulse response which is what we would really like to get at, the low order
reflections which then combine into lots of high-order reflections that produce this
reverberant tail.
If we then make a waiting function that incorporates a relaxation window, as we
call it, LR, we can say okay, these low-order reflections, let's not bother to try to
remove them. Let's just let equalizer do whatever the hell it looks at that point
and only try to attenuate the reverberant tail.
Before this we use an algorithm that we call the relaxed multichannel least
squares approach. And this involves a modified cost function that contains this
weighing matrix W that has got zero entries within the relaxation window and
therefore the -- by finding the minimized square error solution, those taps are
completely unconstrained.
Now, what ->>: [inaudible] the previous slide. [inaudible] you care about the tail?
>> Mark Thomas: Yes.
>>: So just that you want to get rid of the dependence in the middle part?
>> Mark Thomas: Yeah.
>>: Then what's the head? I'm kind of confused?
>> Mark Thomas: I'm sorry. What's the problem, sir?
>>: So what are the rows -- what are the rows of the error function? Go back.
You're smashing -- you're saying that -- in terms of -- there's a call -- there's an
error, a error vector. It's a column that's AG minus D, H tilde G minus T is the
error, right?
>> Mark Thomas: That's right.
>>: So what does each row in that error vector, in that residual represent?
>> Mark Thomas: That represents the magnitude -- so the amplitude of any one
of the resulting taps from the equalization. So G is a vector that is then
convolved with H forming -- but using this combination matrix H. And this then
produces a vector which is the response. So if you were to then apply this to a
speech signal ideally that would just be a delta function or or a delayed delta
function. So all you would get out from the reverberant speech is the
non-reverberant speech. So ideally we just want a delta.
>>: So why do you have [inaudible] that says yes, I want a -- I want a -- yes, I
want zero -- yes, I want one, yes, I want zero, I want zero, I want zero, I don't
care, I don't care, I don't care, I don't care, I want zero, I want zero, I want zero,
and I want zero, right? So why is it -- why is there talk, why [inaudible] equal
one?
>> Mark Thomas: Okay. So the reason why tau is not equal to one is because
we are working with finite lengths of impulse response. Now, let's -- let's say
that we were to taking the single channel case. It's not possible to form a
complete inverse. It's only possible to form a least squares inverse. And if we
have a finite number of taps, and that number of taps is in the order of the
lengths of the channel, it forms very tight constraints on the values of the taps.
And the result is getting some very, very large taps that good morning far too
large to be used for any kind of practical system.
If we increase the length, say we were to increase the length of the equalizer to
infinity, then the magnitude of those taps is no longer so great.
So the -- you're always forced by this length criteria. Now, one way of getting
round the -- this magnitude, this tap amplitude problem is to find a target function
that is not a delta but a delayed delta. And this then relaxes the causality
constraints. So let's say that we've got a direct path signal and then a reflected
signal, that reflected gallon is always going to come later on. And what we really
want to do is take that reflected signal and sort of time align it with the direct
path. This implies semantic causality.
Now, if we have a target function that is a delta with no taps before it, that
[inaudible] can't happen. And the only way it can get round it is to use what little
bit of information is in other taps to try to do whatever it can to get this target.
>>: So it's really I want zero, I want zero, I want zero, I want zero, I want one, I
don't care, I don't care, I don't care, I don't care.
>> Mark Thomas: Precisely.
>>: Okay.
>> Mark Thomas: Maybe I should have explained in a little bit more detail about
what this tau is meant for.
There was a paper actually recently by an author whom I forget that looks at
what happens with the filter gain. So this is the L2 norm of the filter. And looks
at the gain as function of this tau. And it shows that the robustness can be
improved by increasing the tau and introducing this non-causality problem. And
by increasing this tau we can reduce the gain and then make it more robust.
This RMCLS algorithm does it in two ways. It can exploit the tau. It can also
exploit the fact that we don't care what these taps are, so it can then take
whatever value it likes.
And if we then take some Monte Carlo simulations where we've just taken the
room and made some random locations of source and receiver and then look at
the mean filter gain, we see that when there is no relaxation window at all, which
is equivalent to the MINT approach, the gains are massive. They can be in
excess of 50 dB. And for such a large amount of gain, any tiny amount of noise
or system error is going to manifest itself as a huge error in the output.
So while this could work very nicely in the no error, no noise environment, as
soon as you come out with a practical scenario, this is not suitable.
But by having this relaxation and allowing those taps within the relaxation window
to take whatever value they like, the gain is significantly reduced. And even for a
relaxation window of 50 milliseconds perceptually this is not actually, actually too
bad. But it allows us to reduce this gain a long, long way and therefore make it
much more robust to real world deployment.
I'll give you an example now of what happens when we introduce a little bit of
error. So we simulated a room of 10 by 10 by 3 meters with a T60 of 600
milliseconds and placed two microphones within there.
We forget about noise for now, but instead we add Gaussian errors to the
channels H to yield the mismatch between H and the perturbed system of minus
40 dBs. This is a very small amount of error.
On the top left is the channel one. And here we can just about make out these
small taps due to the low-order reflections. And this is then followed by the
reverberant tail.
The second is the equalized impulse response on the output of the MINT
algorithm. And although we can't see it here at the zero tap, so here tau is zero,
there is a value of one or very close to it, and this is followed by lots of very small
taps. So it's done what it should do by suppressing those taps.
In the third and fourth case are two examples of channel shortening, RMCLS and
another algorithm that is based on maximizing the Rayleigh coefficient. I won't
talk in any great detail about that now.
But in both cases, the relaxation window has been set to 50 milliseconds. And
we can see that these taps are taking any nonzero value they like followed by a
very much more attenuated tail.
Now, if we then look at the energy decay curve, which is a measure of how the
energy in the impulse response and in the third -- the last three cases, the
equalized impulse response, we see that in blue is the decay of the first channel.
The MINT output is shown in red. Now, it's a bit deceptive because in B it looks
like there are lots of very small taps compared with the direct path. But actually if
you look at the energy decay curve we see that it has actually increased the level
of reverberation. So rather than performing D reverberation it's just replaced it
with another kind of reverberation. So it's failed in this case to achieve anything
useful. And I'd like to remind that you this is actually with minus 40 dB error and
no noise. So it's a really [inaudible] scenario and still it's gone wrong.
But if we then look at the RMCLS and other channel shortening approach, we
see that there's a rapid decay within the first 50 milliseconds and after after that
point the energy level is very much below that of the original channel.
So in this case, the channel shortening approaches have performed a deep
reverberation.
>>: [inaudible].
>> Mark Thomas: Yes. In this case, it's equal to one, yes. And we mention this
in a paper that I think I'll plug in a minute for the forth coming WASPAA
conference. It gives some more experimental results on this and looks at what
happens when we take lots of different scenarios with varying SNR and varying
system mismatch. And there are audio examples that I'm going to show in a
minute.
>>: [inaudible].
>> Mark Thomas: This is synthetic data at the moment, yes.
>>: [inaudible].
>> Mark Thomas: With the image method, yeah. But we can take the real world
scenario. So we took a real room of five by three by three meters with a T60 of
just under 200 milliseconds. So this is like a typical office environment. We them
performed a supervised system identification for two channels. This was done
using the maximum length sequence approach. And it's using the same setups.
We had a pair of microphones and a loud speaker. Using the same setup as we
did to estimate the impulse response we then played some speech through the
loud speaker, recorded it with the microphones, and then attempted to use these
algorithms to perform a D reverberation on the recorded signal.
Now, there is always going to be some degree of error in the system
identification because of noise and finite length errors and all that sort of thing.
There is also some amount of noise within the room due to the air ducting. And
there were people there sort of shuffling around.
And just as an additional objective measure, we used the PESQ approach.
PESQ is an algorithm that gives an estimated predicted mean opinion score, just
as sort of rough and ready approach to give an objective measure as to how well
this has worked.
So I'd like to play the clean speech signal.
>>: In language, infinitely many words can be written with a small set of letters.
>> Mark Thomas: And then the recorded signal.
>>: In language, infinitely many words can be written with a small set of letters.
>> Mark Thomas: Unfortunately, once again, with this room being a bit
reverberant, it's not very easy to tell you. And on head phones you can tell the
difference. But what should be noticeable is then when we take this approach
and then apply it to the MINT algorithm.
>>: In language, infinitely many words can be written with a small set of letters.
>> Mark Thomas: So the reverberation is certainly increased. And if you listen
on head phones, there's a lot more noise in the background. So it hasn't
performed any kind of dereverberation.
And I'll just play one more with a channel shortening of 12 and a half
milliseconds, which is pretty short. And this yields a per perceptual MOS score
of 3.1.
>>: In language, infinitely many words can be written with a small set of letters.
>> Mark Thomas: And if we compare that again with the recorded.
>>: In language, infinitely many words can be written with a small set of letters.
>> Mark Thomas: I hope.
>>: [inaudible] clean speech?
>> Mark Thomas: Yes, clean speech.
>>: In language, infinitely many words can be written with a small set of letters.
>>: So remind me again, LR is the amount of the delay?
>> Mark Thomas: So L ->>: [inaudible].
>> Mark Thomas: Okay. So LR is not delay. LR is the length of the relaxation
window. It's the number of samples at the beginning of the target impulse
response in which we say take whatever value you like. Don't try to attenuate it
to zero, just take it to whatever value.
What happens is when you increase the LR a bit further, some degree of spectral
distortion is introduced because the low order taps are then being included in the
[inaudible] this is spectral distortion. So maybe you'd like to listen to that as well.
>>: Language infinitely many words can be written with a small set of letters.
>> Mark Thomas: We missed the beginning. Let's play that begin.
>>: In language infinitely many words can be written with a small set of letters.
>> Mark Thomas: And then compare that with the best case.
>>: In language, infinitely many words can be written with a small set of letters.
>> Mark Thomas: There's a small spectral tilt. There's a bit of high frequency
information missing when we make the relaxation window too long.
>>: [inaudible].
>> Mark Thomas: And there are some more results in the forthcoming WASPAA
paper on this. And if anyone's interested, then we can talk about those a bit
later.
>>: [inaudible].
>> Mark Thomas: In this case, it is the optimal. The optimal will vary depending
on the room and the amount of mismatch in the system and the amount of noise.
And the -- can I show you a slide? A, yes. So this was a hidden slide. This
shows the case where we took a simulated room and performed some Monte
Carlo simulations where we had a case that only considers noise and only
considers channel error. And the noise was varied between 10 dB SNR and an
infinite SNR and the channel error was changed between minus 10 dB, to minus
infinity. So the minus infinity channel error is the best case.
And I think we can just about make out here on the bottom is the dash line. And
this dashed line is the unequalized response. And there are other dash lines to
show the equalized response. The unequalized response for the different noise
scenarios.
On the X axis is the length of relaxation and on the Y axis is the perceptual MOS
score. And what we see is that provided the SNR is in excess of about 30 dB,
which I think is about reasonable for an office environment, then you can benefit
from introducing the channel shortening. If you do no channel shortening at all
then this drops very, very much. And most of the time you're actually below the
dashed line. So there's no point in performing the MINT algorithm on those
cases. Only in the case of no noise can the channel shortening -- sorry, the
MINT algorithm produce the best score. And here four and a half is the highest
score which says that there is no error in the equalized signal.
Interestingly, on the right-hand side is the case where we consider no channel
error and the dashed line is therefore in the same place all the time because
there's no noise.
We see that actually the algorithm is more sensitive to a certain number of dB
and additive noise than it is to a certain number of dB of channel error. So while
there are lots of papers that look at channel error, the real killer here is actually
the amount of noise that's been added not the channel error itself. And it also
shows that I in every case in we introduce a small amount of channel shortening
then there is some empirical optimum length, and it seems to vary somewhere
between naught and 20 milliseconds. So on the whole I think in every case if you
pick a channel shortening length of about 10 milliseconds then that's in the area
of this empirical optimum. But some more analysis needs to be done on this to
find an analytic expression to the expected performance as a function of noise,
it's a function of channel error, and a function of the shortening length.
So if there are no further questions on this section, I would like to summarize by
saying that perfect dereverberation can be achieved by the MINT algorithm,
provided the channel is known with no error and provided there's no noise. If we
introduce an error or noise then this is is there detrimental to performance. And
we can approach the problem by using the channel shortening technique that
appeals to cycle acoustics and says that the ear cannot distinguish low order
reflections and therefore we aim to equalize only the reverberant tail and to leave
those low order reflections alone.
This achieves improved robustness because the gain is reduced. And it then
allows it to be applied to real world recordings. So the final algorithm I'd like to
talk about in dereverberation applies geometric modeling in addition to the
propagation modeling that is applied to beamforming.
Now, in overview, the idea of a beamformer is to coherently sum or filter and sum
or sum and filter the direct path of component of the multi-microphone
observations. It's this direct path that we want to get at.
If we then consider the image model of a room, we can say that if there is this
room which is in solid black and then inside it is the receiver or an array of
receivers by circle, and a source in the green -- in green on the X, the reflections
produce image sources outside the room in red here are the first order image
sources. In this 2D case, there are four and you take an extension to 3D where
we consider the floor and ceiling there will be six first order reflections in this
shoe box environment.
There are also high order reflections where reflections have been reflected and
this produces image sources that are further away, and this produces the
reverberant tail that we talked about in the previous section.
So if we can identify where these image sources are, is it then possible to use
them to our advantage? Can we actually then perform some kind of coherent
averaging of not just the direct path signal but the signal from the low-order
reflections? And this approach is called a rake receiver. A rake receiver is a
technique used commonly in communications where reflections are hills and tall
buildings are timely aligned and then coherently summed with a direct path
signal.
What we want to do here is apply the same approach but using acoustics. Now,
this section's a little bit shorter because this is a preliminary investigation, so I
won't go into quite so much detail. But the idea is that if we take this stylized
image of a single channel as the large direct path tap, there are some small
parse taps caused by the low order reflections and then followed by the
reverberant tail.
If we take the single channel and then align it with the first order reflections and
then sum, it should be them possible to reenforce the direct path with this sort of
pseudodirect path caused by the image sources. And here we've shown the
single channel case, but it should then be possible to then take the multichannel
case and align the direct path and first order taps for all of the channels.
So let's just look at what a delay-and-sum beamformer does. In this case, we're
looking at it in the frequency domain.
The steering filters can be thought of as a complex exponential, including a term
tau M, where tau compensates for the propagation delay from source -- a source
to microphone M. And there's some weights which just presume -- preserve
some amplitude criteria.
The response is given by the product and sum of steering weights which include
this complex exponential and the transfer function from the source to microphone
M. And then when summed over all, we then get the transfer function of the
delay-and-sum beamformer.
And the idea of this complex exponential is to cancel the positive complex
exponential that caused by the delay in propagation. So all we're left with then is
not a delay but just some amplitude term.
But at the moment, these taus are a function of M alone. If we consider the rake
case, then we need to think about more sources, which we'll call lower case R.
An R, if we limit ourselves to the first order reflections, is the capital R is seven
because there's one source within the room and then six image sources on the
outside of the room.
So tau then becomes a function of both R and M. And then the response is very
similar except it includes this additional term that sums over all channels and all
sources.
So it's natural then to ask what the expected performance of this rake
delay-and-sum beamformer is. Well, one means of doing this is to use the direct
to reverberant ratio. And this is a measure of the ratio of the energy in the direct
path to the energy in the reverberation. And this is very closely related to the
directivity index which is sometimes used in the context of beamforming.
Now, it's fairly straightforward to find the direct path amplitude as it can be used
-- we can use a 3D Green's function for this, where here we've replaced the taus
with the distance D. And the omega with a weight number K. The product KD is
actually equal to the product of of omega and tau.
There's also term for the reflection coefficient which is one for the direct path
signal and less than one for signals emalternating from the image sources
outside the room.
There's a slight similarification in here that we're assuming that this beta is not
frequency dependent but in the real world scenario there look some frequency
dependence on the reflected sources. But we don't consider this right now.
For the reverberant component it's a little bit more trickiy, as we don't necessarily
have an expression for the expected value of the reverberation. But what we can
do is appeal to statistical room acoustics that allows to find the expected value of
the cross-correlation between one microphone and a neighboring microphone.
And this is a function of the area of the walls. The average absorption coefficient
and the spatial location of the receiver relative to another receiver.
And now that we have this expression for the expected value of the cross
correlation and we have an expression for the expected DRR, we can work this
through, and then it comes up with this rather nasty looking expression. But
there is some validity in this as has been shown with some experimental results.
And what we did was take a room of four by five by 6.4 meters. You simulate it
using the source image method using a T60 of half a second. So it's quite a
large reverberant room where the field is quite diffuse. And it's this diffuseness
that the statistical room acoustics relies upon.
Within the room, we placed a linear array of M microphones spaced by 20 sent
meters. And then ran 20 Monte Carlo runs with a random source receiver
configuration.
And on the bottom and on the bottom here is the output from the theoretical case
with an expression the previous slide and the simulated direct reverberant ratio
using -- from the output of the beamformer. And it's plotted as a function of the
number of microphones and the number of sources. And what it shows is that
unsurprisingly as we increase the number of microphones we improve the
performance because we increase the amount of spatial diversity of the receiver.
But also that providing we have walls and about two microphones, if we introduce
additional sources, the output DRR improves, also. So without any additional
hardware, this rake paradigms gives us almost for free, with the exception of
some additional computational overhead an improved DRR.
So the output of the rake delay in some beamformer is shown here in a case for
six microphones and seven sources, where we see a very large tap caused by
the [inaudible] here and some of the direct path and first-order reflections
sources. Some anti-causal taps due at the time way in which the signals are
aligned. And this reverberant distribution on the right-hand side. And this is
really only a preliminary investigation, but it does show that there is some merit in
performing this rake delay-and-sum beamformer to at least the simulated
environment. So that's all I would like to talk about really on the rake
delay-and-sum beamformer.
And I would like to summarize by saying that reverberant environments can be
modeled as a distribution of image sources and that low-order reflections form
sparse impulses in the acoustic impulse response. And what we can then do is
perform a temporal alignment of the direct path and first order reflections to form
what we call an acoustic rake receiver. And this preliminary analysis reveals that
improved dereverberation can be achieved compared with a conventional
beamformer.
But there's quite a lot of future work to be done. It would be nice to make an
extension whereby we use in optimal beamforming as opposed to straightforward
delay and sum. And it needs to be evaluated against alternative approaches
such as the maximum SNR beamformer which in some sense considers all
reflections based on some estimate of the noise space.
So I'd now like to move on to the last topic of the talk which we call geometric
inference. And this, like the rake receiver, applies some geometric modeling.
Now, this falls into the category of acoustic scene reconstruction which aims to
estimate some parameters of the acoustic environment. And specifically we
want to know where the reflecting boundaries are based on the measurement of
some acoustic measurements. And this has been shown to be useful for things
such as source locationization and wave field rendering. And the way we do is
this is we use a geometric approach based on the time of arrival of reflections.
So if we look at the diagram on the left we can see a source RS and some
receivers R1 through 3 and through the 1X is a single reflector.
Now, I'll limit myself to the 2D case for this discussion and a just talk briefly about
the extension to 3D at the end. But we can say that the time of arrival of the
reflections is the composite time from the source to the reflector and the reflector
to the receiver. And they looked at in -- at the acoustic impulse response the
time have arrival is the absolute time at which the peak occurs. And the time
difference of arrival is the difference between for example the first -- the direct
path signal in channel one and the direct path signal in channel 2.
Now, estimating these TOAs is actually a bit of a problem of the on the top left
here is a supervised system identification of a real room. And a this is a fairly
ideal case. We had a good loud speaker, a good microphone, low noise. But
what we see is rather than having a nice defined peak, a sync-like peak as seen
in the previous slide where we use the image method, we have lots of little peaks
of some contact support caused by the impulse response of the measurement
apparatus. And most of this was in the loud speaker itself.
And this produces an ambiguity as to as to where this time occurs. The problem
gets even worse when we consider and acoustic impulse such as a click. So
here we've got a finger click. And this has got a slightly wider support but lots of
little peaks and it becomes very ambiguous as to where these peaks actually
occur. So it then becomes very difficult to pin point an exact time.
So the way we approach this problem is to estimate the impulse response on the
left with the supervised case and on the right with a click. And then identify the
direct path signal. We then draw a window around the direct path, time reverse it
and a then convolve it with the whole impulse response. And this is a matched
filter approach.
And the ultimate outcome of this is not to reduce the support but instead give a
nice defined single peak that we can identify very well for the direct path and the
first order reflections. And the same approach can be applied to the finger click
case.
So that's the first problem out of the way. The second problem is that usually we
don't have exact synchronization between the stimulus and the recording. So we
don't know exactly when the stimulus occurred. All we have is the time
difference of arrival. But this is okay because we can use existing source
localization algorithms based on the time difference of arrival that localizes the
source relative to the array and then making some assumptions on the speed of
sound we can estimate the time of arrival.
So let's say we then have this time of arrival. What can we do with it? Well, a
rather neat result occurs in that if we know the spatial location of the receiver
relative to the source and we know that the time of arrival, then this
parameterizes an ellipse and we know therefore that somewhere tangential to
the ellipse is the location of the reflector. If we then consider multiple
observations where the array geometry the known, we can then form multiple
ellipses and find the common tangent and this common tangent corresponds to
the location the reflector.
Now, the way we solve this analytically is to define the line in homogenous
coordinates. And it's this line that corresponds to the reflector. And this is
characterized by the fact that if you then multiple this tuple with XY and 1, any
point on the line is equal to zero. We can also define a conic with this matrix.
And there are some constraints with defining ellipse and using this notation. And
it can be be proven that a line is tangential to the conic providing that L transpose
C adjoint L equals zero where C adjoint is dipped in [inaudible] as a C star.
So then considering all of the channels we can form a cost function based on the
square magnitude of this L transpose C star L. Unfortunately solving this
problem is a little bit tricky because it's nonlinear squares optimization problem.
And if we apply iterative techniques then it tends to become trapped in local
minima. And only recently have we come up with a closed form solution by
making some assumptions about the nature of the cost surface and taking slices
through this through that cost. I won't go into the detail now, because we've not
got much time left.
The case of multiple reflectors is an additional problem because we've got all
these TOAs, and they then need to be grouped somehow to find an individual
reflectors. Because we've got these TOAs and we don't know which reflector it
comes from. We have some techniques that don't give a global optimization
because this is too big a problem and is also non-convex. But that's all I'll say
about that now.
But let's say we have a line L as a ground truth and we have an estimate of line L
hat. We could evaluate it by looking at the alignment error which is just the
normalized inner product between the line and its estimate. And this is
proportional to the cosine of the angle between them. Alternatively we can just
find the dispute angle between them.
This can be misleading somehow because any parallel lines in space will give a
low alignment error even if they're a long, long way from one another. So we
have another error that we call the distance error, where we take the zeroth
microphone and project it on to the line and project it on to the estimated line and
look at the distance between them.
So we can form some experimentation where we form a room of dimensions
between three and five and four and six meters using the image method. We
placed four microphones and one source in random locations. And for each
room at the Monte Carlo, we formed five seconds of observation with Gaussian
noise.
Noise was then added to the observations to generate an SNR between minus
five and a plus 40 dB. And so the first thing we did was then take these
observations and try to estimate the channel blindly using the robust NMCFLMS
algorithm, which is a frequency domain blind system identification.
The time differences of arrival were then found by a group delay-based peak
detector, much the same way as we used the group delay function to work out
the location of GCIs in the first dereverberation algorithm.
The source was then located relative to the array using these TDOAs. And then
the proposed algorithm was applied all four walls within 200 runs.
And the results are as follows: On the left is an example room where we have
the location of the source. Four receivers placed randomly in the room. And we
can just about make out here the common tangent of the blue line. And this sort
of the common tangent of the ellipses forming the blue line, the common tangent
of the red ellipses causing the red line. So this is an example of where the
algorithm has worked well.
And then on the right here is a plot of the localization accuracy as a function of
input SNR. And then by ranking the estimated locations we can see that the best
wall even in the case of minus 5 dB still does pretty well. And providing we have
an SNR of greater than about five dB, we can estimate even the worst wall.
And this is because the BSI algorithm is able to identify sparse taps very quickly.
The amplitude of the sparse taps may not be identified very well and the
amplitude in the location of the tail may not be very good. But providing we can
pick out the temporal time, the time of arrival from the sparse taps, this is enough
to then perform the inference.
What we can also see in this dashed line is the source error so the localization of
the source in meters and providing the SNR the greater than about naught dB,
then it does okay.
>>: [inaudible] impulses.
>> Mark Thomas: Well, the impulse is over all frequency and up [inaudible].
>>: [inaudible] wide band signal?
>> Mark Thomas: It's a wide band signal.
>>: Okay.
>> Mark Thomas: So it's wide band Gaussian noise that it was stimulated were.
>>: So that's interesting.
>> Mark Thomas: The blind system identification algorithm that we used, the
robust NMCFLMS has been applied to speech. Now, although we haven't
applied it here, a speech signal can be used to identify the transfer function, the
impulse response, up to a agree of error. And given that it's the location in time
of the sparse low-order reflections can be found fairly quickly it may be possible
to perform inference on speech alone. And they -- there will inevitably be an
increased error in the source localization and so that that would need to be
looked at as well.
And we have a paper actually in the upcoming WASPAA conference that looks at
what happens when we have errors in the source localization and how they
propagate through to the errors in the location have the reflectors.
So this was a simulated case. We took a real world recording of a room of about
five by six meters with a high ceiling and placed a mini microphone array in the
corner made up of four microphones as shown here. I think they were placed at
16 centimeter intervals. We placed an additional microphone just slightly outside
the line on a plane. And the reason for this is to resolve the front-back ambiguity
that's inherent with linear array processing.
We placed a loud speaker at one of four different locations and performed a
supervised system identification using the maximum length sequence method.
The ground truth is in the circle and the estimated is in the X. And I think actually
the error is due in part to estimating where the speaker was. Because if you've
got a bit of tape, there's only -- there's a finite region within the loud speaker and
it's very difficult to know exactly where this equivalent point source comes from
because this is not a point source.
But nevertheless the algorithm -- the algorithm was applied and in green and
blue are the estimated walls and the walls pass through the origin on the X -- on
-- parallel to the axis. And this seems to have done its job quite well. And we
can see for one of the measurements the -- in blue shows that the ellipses and
there's no ambiguity here because the array is pretty much tangential, is pretty
much normal to the wall. But in the case where the wall is parallel to the array,
this front-back ambiguity comes in. And only one of the ellipses due to this fifth
microphone in the middle analysis to disambiguate between the front and the
back. But here it seems to have done its job.
So I'd like to summarize this section by saying that localization of 2D reflectors
can be achieved with a geometric approach. And that's -- it can be applied in
very low SNR environments and to real recordingings. And this is an ongoing
piece of work in the project that I'm working on at the moment. There is the
question of robustness enhancement using the Hough transform and we have a
paper at the upcoming EUSIPCO conference on this.
As I mentioned earlier, there's the error propagation analysis due to things like
errors in localizing the source and temporal errors caused by finite frequency and
[inaudible] and such forth.
There's also the problem of the extension to 3D where and ellipse is no longer
the solution but somewhere on an ellipsoid and this makes it computationally
much more demanding. But the theory actually falls in quite neatly from 2D to
3D.
And also is the application in wavefield rendering. Now, I've not been involved in
this myself but within our project we've used the knowledge of the source -- sorry,
knowledge of the array of sources in wave field rendering relative to a reflector
and then used the image sources as part of the wave field rendering. And this
has been achieved using a legal world scenario.
So I would now like to bring my talk to a conclusion. And I'd like to summarize by
saying that there are many algorithms that use multichannel observations that
don't rely exclusively on beamforming techniques and that some multichannel
algorithms can outperform single channel algorithms applied to the output of a
beamformer. But beamforming remains an extremely important component of
the algorithms discussed. And so we would like to think of these algorithms as
complementary as opposed to replacement for beamforming.
The dereverberation algorithms that we've discussed exploit multiple channels by
appealing not only to wave propagation modeling but to voice modeling, to
channel modeling, and to geometric modeling. And finally that the estimation of
room geometry from multichannel operations can be achieved for practical
environments. And this is useful in many areas of acoustic signal processing.
That includes my talk. I thank you for your attention. And I welcome any further
questions.
>> Ivan Tashev: Thank you, Mark.
[applause].
>>: [inaudible] one, maybe two questions. There were a lot of questions during
your talk. So questions? Thank you again.
>> Mark Thomas: No further questions. Thank you very much.
[applause]
Related documents
Download