>> Ivan Tashev: Good morning, everyone, those who are... who are attending remote -- attending this talk remotely. ...

advertisement
>> Ivan Tashev: Good morning, everyone, those who are in the room and those
who are attending remote -- attending this talk remotely. Good evening though
those who are going to watch the recording this evening.
It's my pleasure to present Nikolay Gaubitch who did his master's degree in
Queen Mary University of London in 2002 and obtained his Ph.D. in Imperial
College London in 2006.
Nikolay has been presenting here, I think, in 2008 some results from his Ph.D.
thesis, and today he's going to present multi-microphone dereverberation and
intelligibility estimation in speech processing.
Without further ado, Nikolay, you have the floor.
>> Nikolay Gaubitch: Thank you.
Thanks. Thank you for those standing up and those watching remotely.
So, yeah, basically, as the title suggests, there are two separate topics, the
dereverberation and the intelligibility, and this sort of represents the two areas of
my life in this field. So one is more for my Ph.D. time and the other bit is
something that I've been working on more recently.
So to begin with this, I just wanted sort of have an overall look at what I'm trying
to or what we're trying to solve. And basically the key idea is that we have some
wanted speech signals somewhere produced by a talker, and we have a
recorded sound at the end, and the idea is that this wanted speech should be
understood by a listener at either a different time or different location.
And in the meantime, this wanted speech passes through a lot of things. You
have an acoustic channel, some additional noise due to other sources possibly
with their own channel effects, microphones, amplifiers, maybe some codex, so
on and so forth, and you get there, and by the time you've got there, there can be
a lot of things happening which, in general, if you have a human listener, they
would reduce the intelligibility, so what was trying to be conveyed by the talker is
not really understood by the listener. There's also lower quality of the signal, so
it's not that pleasant to listen to.
And then I have this productivity thing in question marks which is something that
people worked with related to the project that I was working on now to try to see
the productivity, say, for example, of a transcriber is affected by noise and noise
reduction algorithms. But there's no sort of conclusive evidence of any of that
yet, so that's why I have the question mark.
Or you can have a machine listener at the other end like speech recognition or
speaker identification. They also perform badly once you have this scenario.
So as I said, the talk will be in two parts. In the first part I will mention about
multi-microphone dereverberation, one method which is a source-filter model
based method and one which uses blind channel identification and inversion, and
in the second part I'll talk about intelligibility estimation. And this is mainly
subject-based intelligibility estimations. We have subjects in the loop. And I'll
talk about how we use these methods for evaluating speech enhancement
algorithms.
So let's just quickly go over what the problem of dereverberation is that we're
trying to solve. We have some speech signal, s of n, that is produced in a room.
Actually, before I continue, if you have any questions, please interrupt me
because there's quite a few different topics, and you wait to the end it gets
messy.
So we have this speech signal, s of n, produced in some reverberant room, and
it's observed by a bunch of microphones, and the talker is at some distance from
the microphone array, so we formulate this as an observed signal where the
microphone is the convolution between speech signal and the room impulse
response plus some additive noise.
And the aim of the reverberation is to find some estimate of the speech signal s,
possibly delayed or attenuated. It's not a problem. And in most cases the
dereverberation is a blind problem, so we don't know anything else but x, so just
the observed signal. So we try to do everything based on that.
So I think there are generally three classes of ways to do dereverberation, three
classes or methods. One is beamforming where you have some kind of array of
microphones and you do some processing to all these microphones and you sum
them in a way to try and form a beam in the direction of the talker that you want.
And beamforming is obviously a multi-channel, multi-microphone approach
exclusively.
Then there's something I like to call speech enhancement methods where you
may use some model of the speech signal or -- and then modify your reverberant
speech observation in a way so that you better represent this model, or you can
have some model of impulse response and then use that to process your
reverberant observation. An example of this is spectral subtraction type of
dereverberation methods.
And, finally, you have blind system identification and equalization where you
would try and actually estimate the impulse responses and then form some
inverse filters based on this. And, of course, you can have a combination of all of
these things to form methods.
So the first method that I will talk about is based on the speech enhancement
methods where it uses a model of the speech signal. It's quite a straightforward
method really. It starts by looking at the LPC formulation of speech. So if you
have a clean speech signal, you can write it in terms of a linear predictor like this
where a are the prediction coefficients and e is the prediction residual that you
get from this.
And you could have a reverberant observation which was formulated in the same
way. And then what you can look at is the relationship now between the
different -- the two different parameters of LPC. So what is the relationship
between the coefficients a and b and what is the relationship between the
prediction residuals of the two.
So here is an example of a bit of voice speech which could be quite familiar. So
you have -- at the top left is a clean speech signal, some reverberant speech
signal, and here are the residuals for the two signals. So you can see this one is
probably the sort of familiar structure where you have peaks in the residual that
approximately represent the instance of glottal closure and you have the
reverberant residual -- it becomes messy because this structure is destroyed due
to the reverberation.
So reverberation effects the residual quite a lot, but, on the other hand, it's been
found by several people generally that the LP coefficients are much less affected
by the reverberation.
So looking at these things, you can think of a way of doing the reverberation by
three steps. You do LPC analysis of your reverberant observations, you get a
bunch of residuals, you process the residuals, and then you use the LPC
coefficient the that you got from the reverberant speech and then you
resynthesize the speech.
So one of the advantages of this is partly that you have much simpler structure of
the clean speech residual and you know what it is, so you can maybe take
advantage of that so some extent. And, yeah, it assumes that the coefficients
will not be as affected.
So the way we looked at this thing is something called -- something that
combines spatial and temporal averaging of the residuals. So this is some
motivation behind what the algorithm that I want to talk about is doing.
So, again, if you have a clean speech residual and you have a reverberant
speech residual, you see that the peaks that are from the original residual are
kind of obscured and it's difficult to see where they are.
If you were to take a microphone array and get the residual from the output of
just the delay in some beamformer, you start to get a much better result of the
structure, right? So you can see these peaks much clear, but there's still a lot of
rubbish in between, so the structure between the peaks is quite far from this
thing.
So the first observation is that if we do spatial averaging, we should be able to
identify these instances of glottal closure.
>>: The last one is [inaudible] and then doing LPC or doing LPC [inaudible]?
>> Nikolay Gaubitch: No, this is doing [inaudible] some beamformer and then
doing an LPC.
Yeah, the other observation you can see is that actually the signal between these
peaks is pretty similar, so is changes quite slowly, while the output of the
beamformer residual seemed to be much less correlated in between.
So the idea could be that you could segment these larynx cycles between the
peaks and then do some kind of moving average operation so that you would
take this one and average it with its neighbors and it should reduce the effects of
this and get back to this structure a bit more.
So that's really the idea behind this approach. And in order to do this larynx
averaging you need to be able to identify these peaks, and for that I've used
something called DYPSA, which is -- it's very robust for identifying GCIs in clean
speech, and at the output of a beamformer it performs pretty well as well, satisfy
enough.
And also when you do the averaging of these larynx cycles, you don't want to
include the actual peaks in the averaging because they are quite sensitive. So if
you distort some of these peaks you can hear it in the resynthesized speech
afterwards.
So to avoid this, basically what you can do is use some kind of windowing
function around the cycles. So if you take -- for example, this is what I was using
is just the time domain Tukey window. You can center it within the larynx cycle
and that would exclude -- basically it would exclude the glottal closure instance
with some margin on the side so that you can have some for errors.
So then you basically segment your larynx cycles average across them without
the peaks and put together the peaks. That's what this does. And in the second
step what you can do is instead of actually using these averaged larynx cycles
you can try and calculate -- you can calculate an inverse filter based on your
original larynx cycle and the one that you've just averaged and you update this
filter slowly. So in this way you can partly reduce the sensitivity of GCI detection
errors, but also basically this whole thing that I've described mainly works for
voice speech, and when you come to unvoiced segment you can continue using
this filter because it should be a relatively good estimation of what happens to
reduce the reverberation.
Yes?
>>: So is the averaging done over a fixed number of windows or are you doing
that adaptively based on how much the residual is brought down?
>> Nikolay Gaubitch: I did it as a fixed. So normally sort of one or two larynx
cycles on each side of the one bit you are looking at.
>>: And the other question I have is that it seems like a lot of times if you have a
lot of reverberant environment, that the peaks around the cycles and the larynx
signature is being kind of generated around time a little bit because there's
inflections that are coming at different times so you see, like, this inflection is a bit
higher. And I wonder to some degree that, you know, is this averaging timelined
in any way beyond the glottal peak? Do you try to do some alignment within the
larynx signature as well after you've done the windowing or is it ->> Nikolay Gaubitch: No, it's just aligned with the peak itself, yeah. I see what
you mean, because you normally wouldn't have exactly the same ->>: Right. And you might be zeroing out a signal by just having slight shifts in
the ->> Nikolay Gaubitch: Yeah. But you try to basically align them with a peak. And
also you keep the keep the original peak position. That's why I don't include the
peak in the averaging process itself. So you just -- you just average with
everything between. So, yeah, I just found that that works best, anyway, for the
different ones that I looked at.
>>: So g is just a filter that approximates the difference between e and
[inaudible]?
>> Nikolay Gaubitch: Yeah. Exactly.
>>: And then that's what you actually use to get ->> Nikolay Gaubitch: That's what you actually apply, yeah. It's just partly to
know what to do when you have unvoiced speech coming, for example,
because -- yeah. And also it releases errors that [inaudible].
So this is a diagram basically of -- sorry.
>>: So what does g actually do? Is it basically a [inaudible] filter or what ->> Nikolay Gaubitch: G is just -- it's a ->>: What does it come up to be?
>> Nikolay Gaubitch: Well, it should be something which relates to the inverse
filter of the room transfer function to some extent. It's not really a ->>: You're operating a smaller ->> Nikolay Gaubitch: Yeah, yeah, exactly. So it's not going to be enough
continue to invert the full thing, but it will be kind of a very smoother version of
that. If you just have a very small size it will ->>: And the other thing is that that filter is not being applied to the peaks, it's
being applied to the whole signal?
>> Nikolay Gaubitch: It's applied to the peaks as well, yeah. It's applied offering
to.
>>: And I wonder if you might have some -- because there is something like -you can imagine there might be certain pathologies -- and I'm not saying this
necessarily happens, but you can imagine certain pathologies where the filter
ends up doing a good job of modeling whatever the averaging does, you know,
but then that resulting filter might damage the peaks as well because they want
part of the formulation.
>> Nikolay Gaubitch: Well, the peaks are parted of the formulation of calculating
the filter.
>>: That's true.
>> Nikolay Gaubitch: So they will hopefully stay the same.
>>: Other they are? Okay. So I see. So you keep that in the area -- I see. It's
still in there
>> Nikolay Gaubitch: Yeah.
>>: You don't apply the windowing for when you compute the [inaudible] the
entire thing in that case?
>> Nikolay Gaubitch: When you apply -- yeah, when you apply the filter, yes. It's
only when you do the averaging that you exclude it, so you don't average -- and
then you get it back in.
>>: If you're going to do analysis of this filter and estimate the frequencies
[inaudible], how would you like? High pass? Low pass? That was the question.
>> Nikolay Gaubitch: Yeah, I understood the question, but ->>: You suppress more the high part from the lower part?
>>: I mean, it seems like you have this low pass kind of thing that's your pitch
periods and this very high frequency noise, so my guess is it's just going to be a
high pass filter, but ->> Nikolay Gaubitch: Yeah, but it's not really. I mean, it doesn't have any sort of
high pass/low pass structure from what I've seen. It looks like a ->>: [inaudible] of the glottals are a series of direct pauses, they have a flat
frequency spectrum. And you model the noise, which is closer to white noise,
that thing has also flat frequencies. So what is the [inaudible] here by applying ->>: [inaudible].
>>: A sequence of deltas in time will also be [inaudible]. So you'd see ->>: If your frame is longer [inaudible] --
>>: [inaudible].
>>: If it's a short frame it's flat and the noise is flat.
>> Nikolay Gaubitch: Yeah, but the noise isn't really flat. I mean, if you look at it,
the noise isn't really flat. To some extent this noise should have some relation ->>: If it contains more lower frequency, then Mike is right, this is -- if you apply a
kind of higher pass filter you can improve the ->>: I see what you're saying.
>> Nikolay Gaubitch: Yeah. But the noise isn't really flat in between. It should
be ->>: [inaudible] filter applied to just [inaudible] periods [inaudible].
>> Nikolay Gaubitch: No, no, you wouldn't. No.
>>: So it's doing something that's helping dereverberation, but it's doing
[inaudible].
>> Nikolay Gaubitch: Yeah. It's still directly connected to this. But these peaks
here are connected to the reverberation, right? Because you sort of excite the
room with -- that's the way I see it. So every little clap of the glottis sort of excites
the room to some extent.
>>: Maybe a different way to look at the question would be that once you do the
averaging kind of, you know, get your signal into the nice temporal residual that
you want to see, you could actually estimate that g filter over a much longer
window. There's more research that you'd need to do [inaudible].
>> Nikolay Gaubitch: No, no. You could do it over a longer period.
>>: [inaudible]
>> Nikolay Gaubitch: That's true. I mean, the motivation behind this is that
people that did this before, what they did was to try and somehow identify the
peaks and then just flatten out whatever's in between and ->>: I mean, you had this Gillespie citation which I assume is, like, not trying to
look at the glottis but just trying to say we want a peaky signal.
>> Nikolay Gaubitch: Yeah.
>>: And that's the [inaudible] thing.
>> Nikolay Gaubitch: Exactly. But even that actually flattens out the portions in
between quite a lot when it works well. That's what I find. Because you just look
at a peaky signal, right? So the ideal solution would be --
>>: It seems like that's maybe a little more robust because you don't have to
estimate all this stuff like glottal closure and stuff.
>> Nikolay Gaubitch: Yeah. Yeah. No, it can be. But the problem is once -- if
your ideal is that you just have a peak and flatness in between, at best you get
some kind of robotic solution to the whole thing. That's the problem. So that's
what I was trying to avoid here is to try and sort of get back to the naturalness of
the speech.
So, yeah -- so that's the overall -- the overall idea here. So just to show you a
couple results of that with five male, five female ->>: Did you spend a lot of time on acronyms?
>> Nikolay Gaubitch: Sorry?
>>: Does your group spend a lot of time coming up with acronyms?
>> Nikolay Gaubitch: Actually, we do, yeah. This wasn't from our group. I think
it was done in [inaudible].
>>: [inaudible].
>> Nikolay Gaubitch: Yeah, we have of a bit of -- there's going to be a lot of
acronyms, actually. That's a good early point.
Anyway, so [inaudible] database with a bunch of male and five female talkers,
and this is with simulated room impulse responses with the image method with
varying T60. And you see there the talker is about two and a half meters away.
So this is an output if you compare just the delay in some beamformer on its own
versus this SMERSH. Yeah, that's actually the acronym of spatiotemporal
averaging method for enhancement of reverberant speech. So, yeah, this is in
terms of segmental signal through reverberant ratio. You see that you get some
benefits over just the delay in some beamformer.
>>: So what's the order of the LPC analysis and the [inaudible]?
>> Nikolay Gaubitch: The LPC analysis, 13 here is for 7 kilohertz frequency, and
g is 256 [inaudible] filter.
>>: How are you guys measuring the [inaudible]?
>> Nikolay Gaubitch: So in this case you just take the direct path, which is quite
easy since it's a simulated thing. So I know that ->>: Whatever is left is your reverberation?
>> Nikolay Gaubitch: Exactly. So when you simulate it, it's actually quite easy to
do.
And also in terms of bark spectral distortion, which is something I used to use
quite a lot and supposedly should be closer to what we hear, so, again, you see
there seems to be no -- so you get more benefit when there's more reverberation
from this than when there's nothing, when there's little reverberation.
>>: So go back a second [inaudible].
>> Nikolay Gaubitch: Oh, this is in bark spectral distortion ->>: Can we see the previous? I'm guessing my next question. If not, I'm going
to ask it.
>> Nikolay Gaubitch: Okay.
>>: So here you're getting [inaudible]?
>> Nikolay Gaubitch: Yeah.
>>: So is that audible?
>> Nikolay Gaubitch: You can hear it, actually, yeah. I mean, yeah, you can
definitely hear the reduction in space. I have some samples on my laptop. So if
you're interested in hearing them later -- there's not point in playing it -- it wouldn't
be audible here, so it's just -- but if you listen to it on headphones, you can
actually hear -- yeah.
>>: [inaudible] if you apply something more sophisticated like let's say [inaudible]
are audibly better than the delay in some beamformer, then what would be that
difference?
>> Nikolay Gaubitch: I don't know. I haven't checked it. This is just compared to
a delay-and-sum beamformer. Yeah, I haven't really ->>: [inaudible]
>> Nikolay Gaubitch: Yeah. Yeah.
>>: I have a question about your -- we can go on, but -- so you have these -- it
seemed like you did this study [inaudible] that the LPC coefficients for
reverberant speech are not quite as affected by reverberation -- the LPC
coefficients of speech are not affected by reverberation. I imagine it's -- well, let's
assume that's true. This is a noise-free case, so in the real world there's noise,
and generally noise does have some ability to [inaudible].
>> Nikolay Gaubitch: Absolutely, yeah.
>>: So how -- in the noise case, would you have to change something? You
have to have the room with noise and reverberation, so you change [inaudible]?
>> Nikolay Gaubitch: Well, actually -- yeah, you're right that it wouldn't -- that the
noise does other things, but the way -- in this case the way I calculate the LPC
coefficients is just using multi-channel LPC. So you basically get all the
correlation matrices based on the multiple microphones. So it's eight
microphones here, and that helps actually in noise as well.
And actually -- yeah, that's ->>: [inaudible]
>> Nikolay Gaubitch: Yeah. So you have either the beamformer, but actually I
think when you do the multi-channel LPC it's even better than doing beamforming
first and then -- so actually, yeah, you can hear. The interesting thing is that this
averaging of the larynx cycles actually helps also reduce noise if you have it
because -- so that comes there. So the combined effects you can hear quite well
in ->>: One more question. So this is assuming that you know precisely what the
source is for the beamformer?
>> Nikolay Gaubitch: In this case, yes.
>>: Did you have a chance to do any sensitivity analysis to how error in that
sound source estimate might ->> Nikolay Gaubitch: I haven't done. But we actually -- a bit later after this came
up, we implemented this and did some tests with -- well, we set up the whole
system with -- you know, when you do an actual estimation of the source, actual
source localization using just GCCFAT [phonetic] and things like this and real
voice [inaudible] detection and all this. And you get pretty close to these results
as well.
So I didn't do much further analysis of this. It was just an idea. We tried it, and it
worked nicely to some extent. And it was kind of fun to try it in a more realistic
scenario, so it seems to perform more or less the same when you do that.
>>: [inaudible]
>> Nikolay Gaubitch: Great. So I'll move on to the next bit, which is on the
channel estimation and equalization, starting with this cross relation between two
channels, basically, which often happens to be the beginning of multi-channel
estimation.
So all it says is that if you have an observation at one microphone and you
convolve it with the second microphone, impulse response is the same as taking
the second microphone observation and convolving it to the first microphone's
impulse response.
So starting from that, you can set up a set of linear equations with R being a
correlation matrix and h all the impulse responses that you're looking at. And you
can identify h by just finding the eigenvector corresponding to the smallest
eigenvalue of R.
And so this works really well if you have no common zeros between the channels
and if the autocorrelation matrix of the source signal is of full rank. So that's the
basic idea of this.
And then you have -- you can continue with this and do adaptive formulation
which is good since acoustics change generally quite a lot when you move
around, and trying to do it adaptively seems like a good idea. And by minimizing
this error you can get to something which looks quite similar to an LMS filter and
this -- also other versions implement this.
So this is all good, and it works well if your conditions are very ideal, but once
you start having noise, this whole formulation becomes quite messy. So
basically what your error function -- if you're trying minimize this, it also tries to -it tries to minimize two things. One is this original thing that you had and the
other one is this bit which includes noise.
And it causes, it's one of the causes to something which makes the convergence
behavior, these filters, very strange, so you start converging, and then at some
point it misconverges into something else. And it does this independently of how
good your signal, more or less, unless it's perfect. So you can see that even
things like 35dB and 40dB SNR, which is way too good for any realistic thing, it
has this strange behavior.
So one way we looked at trying to [inaudible] this is to add some constraints,
some extra knowledge, about what you're trying to estimate, and one of them
would be to just assume that you know the direct path, say that you've estimated
it from something else, so you add the constraint that you keep the direct path in
your estimate as the correct one.
Or another one which could be -- is to assume that your energy distribution in
your room transfer function is uniformly distributed over frequencies, so it's
relatively flat, and keep that as a constraint.
So one of the misconvergence things that you see is that it actually produces
some kind of extra filtering, and it makes the estimates sort of skewed here and
there. So that's where this constraint came from.
>>: K is frequency or ->> Nikolay Gaubitch: K is the frequency bin, yeah. So this would be the
magnitude -- sorry, the power per band in the room transfer function.
>>: Isn't this assumption that you have a basically cut off [inaudible]?
>> Nikolay Gaubitch: Yeah.
>>: So pretty much this means that these surfaces [inaudible]?
>> Nikolay Gaubitch: Yeah, more or less. Yeah. So overall it would be kind of
flattish when you look at it, especially within the frequency bands that you're
looking ->>: [inaudible]
>> Nikolay Gaubitch: Well, also in -- yeah. I think in common rooms it's fine if
you don't look at too high frequencies. So if you're within the telephone
bandwidth, you'll be okay, generally, with this. So this was all done with sort of
telephone bandwidth in mind, so I think it's fine. Yeah, once you start going into
higher frequencies, then you get the frequency dependence and it's different.
>>: [inaudible]
>> Nikolay Gaubitch: Yes, exactly. So we did telephone bandwidth, which is -- I
think 8 kilohertz is still pretty hard-core for these algorithms to try and estimate
impulse responses in terms of the number of taps that you get.
And either way, I mean, the idea is that you try and add some extra information
about what you're trying to estimate, and you add some extra -- basically it can
result in an extra penalty term in your adaptation algorithm to try and control this.
And if you take -- it looks pretty much the same. Whichever constraint you take
of these, you have this behavior which was without the -- I don't know if you can
see the plots here. So this was without the constraint and then if you do have the
constraint it sort of stops where it should, although it slows it down a little bit.
And here is what the channels look like for this case. You have five artificial
channels. So each of these columns is one channel, basically, and this was the
original bit that we had, and you're trying to estimate this is what it looks like
without any constraints, and if you do put some constraints in, it helps, so you get
a pretty good estimation. But, yeah, these are truncated to 128 taps, these
channels.
>>: A 128-tap reverberation?
>> Nikolay Gaubitch: Yeah. It's pretty short.
>>: So and the MC -- the red [inaudible] is ->> Nikolay Gaubitch: So the red [inaudible] is the misconvergence. So it's at this
point.
>>: So what does it look like at the [inaudible]?
>> Nikolay Gaubitch: At this point it looks like this. The problem is, though, how
you find this point and stop.
>>: What's MPM?
>> Nikolay Gaubitch: Oh, yes, I should have mentioned. So MPM is
misalignment, but it's normalized ->>: [inaudible]
>> Nikolay Gaubitch: Yeah, it's used quite often. So it's the same as
misalignment as used in normal channel estimation, but it normalizes so that you
avoid any scaling effects. So if the channels is scaled different from the
estimates, it takes that into account.
Yeah, so that's -- so, yeah, I mean, I have looked at some ways of trying to find
where this point is, because that would be a good idea as well is if you can find
this and stop here, but -- yeah, it becomes quite tricky.
Okay. So that's -- I think that's about all I was thinking of saying about channel
ID. It's still a very tricky problem, and to do it with very long channels, it becomes
difficult.
What I want to look at quickly is also, say that you do have an estimate of the
channel. What can we do to actually remove this channel from the reverberant
speech? And the sort of original talk on the multi-channel thing was this -- trying
to equalize all the channels simultaneously using an equalizer g, and you have
some desired output of what you want to get once you've equalized your
channels which is an impulse that can be with arbitrary amplitude and some
arbitrary delay.
And in this you can set up some least squares formulation on that, and you can
calculate estimates of your inverse filters. So that's the fundamental idea, which
sounds all good, but as everything else with this, it's full of -- it can be full of
problems.
So one of the problems, of course, is even if you have a good estimate of your
channels, you have several thousand taps in your channels, so you'll make the
actual design of the filters very long. You can have -- you usually wouldn't have
perfectly known channels. You have inaccuracies in your channels, and trying to
equalize these perfectly with inaccurate estimates, you'll get into a lot of
problems as well.
And you can also boost noise. So if you have noise and you have your -- you
have your channels trying to invert it like this, it will also increase the noise that
you had in the beginning.
So one of the things I looked at was to try and do something about these two
problems. So it was basically just quite a straightforward translation into the
problem into subbands. So if you have -- this is the full band formulation of the
equalization, so you have your speech signal going through the transfer functions
which you now know, and you design the inverse filters.
You can just translate the whole thing by assuming that this whole thing happens
in the subbands. So the only trick that is needed in this is to find the relationship
really between the full band estimates and the subband estimates of these -- of
the room transfer functions and then you can design the equalizers.
So you basically design and equalize in each subband separately and then you
reconstruct. So that reduces both the complexity quite a bit and also it makes it
less sensitive, to some extent, to inaccuracies on the channel estimates that you
may have.
So here's an example with how that works. You have -- in this case we had,
like -- we had a GDFT filterbank with 32 subbands decimated by a factory of 24,
and with this stuff you get a reduction of about 120 in terms of floating point
operations that you would have to design the filters. So it's quite an
improvement.
So here we have simulated the mismatch. So if you have an estimation error, it's
just simulated by adding some noise to the true channels and evaluate the
outcome in terms of magnitude and -- magnitude deviation and linear phase
deviation. So what you look at is if your -- how far your equalized spectrum is
from flat and how far your equalized phase is from linear.
So this is the outcome of that, which -- so this was done with five microphones,
4800 tap channels, again, simulated with image method, and this is -- the results
average over 100 channel realizations, so basically you have the room and you
keep your microphone and you source at the same relative position and you spin
them around in the room to get various realizations of an impulse response.
So you can see the circles are the full band version of the equalization in terms of
magnitude and phase, distortion, and if you do it in the subbands you get some
improvement, and you have system mismatch along the x axis.
So there's still an improvement, and especially if you go sort of above minus 32 -below minus 32dB system mismatch you can get quite decent results with this.
And if you look at this in terms of the reverberation, just going back to repeat the
same slide, but it's exactly the same setup as the reverberation was with
SMERSH just to see what you could achieve.
In this case the impulse responses are again not -- they're not fully estimated, so
you assume that you know the impulse response, but you add some distortion to
sort of simulate different mismatch that you get.
And this is what you get in terms of segmental SRR with that different MPM, so
different misalignment. So this one hundred is the reverberant case, then you
have the beamformer, so this is the [inaudible] beamformer just as before, and
these are with a progressively better estimation error of the channel that you
would have, you get better dereverberation. And so, yeah, if you had the perfect
channel and everything was great, you could actually achieve pretty good results.
But even down here, that's sort of zero to minus 32dB, you still perform quite
good with that.
And in terms of bark spectral distortion, it's pretty much the same. So, again, this
is the reverberant building some beamformer in here all the versions where you
see that anything above minus 30 -- below minus 30dB misalignment you get
pretty good audible results. And it's something you can hear as well.
So just to conclude that part of the talk -- yes?
>>: Do you know if, like -- I'm just trying to get a sense of the realisticness of
using all these techniques. So if I am here and I [inaudible] and I move my head
six inches to the left, how much -- do you have any idea how many MPM -- dB of
MPM between those two filters there will be?
>> Nikolay Gaubitch: I don't know it in terms of MPM, but I know that even if you
move very little, trying to equalize with what you had here already, you can
provide quite a lot of distortion. But I don't know in terms of MPM exactly what it
is.
>>: See it seems like by the time you figure out what the filter is, unless the
person is really staying still, the chances ->>: [inaudible].
>>: No, right. But I'm saying in a real scenario it seems ->> Nikolay Gaubitch: In a real scenario, yeah, it's -- it's very difficult, and that's -yeah. That was -- as I said, this was one part of my Ph.D. work, and one of the
conclusions that I did get to is actually -- I think this is nice because it gives you
hope that at some point you might be able to do perfect dereverberation, you
know, that it's there in a very artificial scenario. But at the moment, the way
things work, it's not really possible because you need to be able to track the
impulse response within seconds -- less than seconds.
So I think in general, between these two algorithms, you have on one end of the
spectrum this SMERSH which gives a little bit of dereverberation, but you could
actually implement it on a computer and it will do something, and then you have
this blind system identification and equalization which potentially can give you
really good dereverberation, but at the moment, yeah, you need to calculate for a
few days at one location or something to actually get something out of it.
So I'm going to move on to the next bit, which is intelligibility estimation.
So I have until 12, right? Okay.
So just with a bit of background, intelligibility estimation is -- what we're trying to
do is -- going into noise now -- is to find a relationship between SNR and how
much you can understand what is being said ->>: You mean human being?
>> Nikolay Gaubitch: In my case it's human being, yes. Yeah. So, yeah, this
stems from the project that I work on now.
And so the question is how you estimate this. Of course, one way is to have
automatic estimation, so that would be really good if you could have things. And
there are some methods to do that, some of which are intrusive. So intrusive
estimation means that you have the clean signal and the noisy signal, and you
can plug them into an algorithm and it gives you out a number.
And then you have the sort of oracle of everything which is non-intrusive where
you should be able to just give it a noisy signal and it gives you out a number of
how good the intelligibility is.
And, yeah, there's quite a few methods out there for intrusive and very little for
non-intrusive. There's been some recent attempts on that, but not fully verified
yet.
And then it's the subject-based estimation where you, instead, have listeners,
people sitting there and listening and somehow indicating what they hear.
So in that case you can have two ways of looking at it. One is that you fix your
SNR for a noise thing. So you place samples of the fixed SNR and you count
how well these subjects understand what's being said, then you get out the
percentage of correct recognition, or you can have a variable SNR in the fixed
performance that you want. So you set and you say I want to find the SNR for
which the recognition rate or the intelligibility rate is 75 percent, and then you
vary the SNR until you find the right spot.
So this constant stimuli version where you fix the SNR is quite slow and this one
is somewhat faster, and in the work that we were trying to do this subject-based
SNR is what we went for and what we actually needed. So I'm going to talk a
little bit about that.
So the objectives of what we were trying to do is basically trying to assess the
effect of speech enhancement algorithms on the intelligibility and also to maybe
find -- see if there are safe regions of operation in these methods, so if somebody
gives you a sense speech cleaning algorithm, to tell them does it harm
intelligibility or not, and often you get the speech announcement algorithm with a
slider where you can adjust some parameters. So you can try and look at the
parameter space and see if there are some regions in this parameter space
where it could be relatively safe to use this.
Now, although this automatic estimation algorithms exist, they're good at
estimating the intelligibility of speech and noise when nothing else has
happened, so it's just speech and noise. But they're not very good at predicting
what happens after you've processed the speech. So that's one of the problems.
And this is ongoing work to try and actually come up with good automatic
methods to predict intelligibility for speech enhancement methods.
The subject-based methods, on the other hand, for this type of problem result in
a lot of people having to listen to a lot of data, which takes a lot of time. And if
you pay these people, which you should for ethical reasons, it gets quite
expensive.
So we needed something that you can do quickly with people, and -- so this is
the basic procedure of this adaptive test for intelligibility. So in general that's
what you do. You have a trial, a trial n, you present the listener with something,
a sample, and the listener has to indicate what they think they heard. And you
present this at a certain SNR.
The listener indicates in one way or the other I heard this or that, and you score
that with 1 if it was correct or 0 if it wasn't.
Then you take this knowledge, you have the knowledge of your SNR and what
the subject indicated, and you have to adjust the SNR somehow to present it
back to the subject.
And typically this would be -- what people use is a fixed step-size way of doing it.
So you basically say your step-size is 1dB, and if you're looking for intelligibility
performance of 50 percent, you play the sample, and if the subject is right you
decrease the SNR by 1dB, and if the subject is wrong you increase it by 1dB,
and at some point you will converge hopefully to something which gives you 50
percent intelligibility points in terms of SNR.
>>: Are you just giving one probe and getting their response or are you doing an
average over ->> Nikolay Gaubitch: No, you have to do an average over quite a few. So that's
the whole point. So because you don't know what the SNR you're looking for is,
you have to start somewhere arbitrarily.
>>: No, no, I meant that let's say right now I'm at SNR equals, you know, 2.
Then are you giving, like, 10 probes to the user at that point and then averaging
to get their performance rate or are you just doing one and getting back a single
response?
>> Nikolay Gaubitch: You get -- well, it depends a little bit on the data. In our
case you do one and you get a single response. So you do ->>: [inaudible]
>> Nikolay Gaubitch: Yeah. And then you can step after that.
So, really, actually the question that I'm getting into is how to best adjust the SNR
at each probe so that you quickest get to the right SNR for a given intelligibility.
That's really the fundamental problem here.
And one way to try and do this is to look at something -- this is something called
the psychometric function of the speech, which is quite commonly used in
psychometrics and psychoacoustics.
So basically this is what the relationship between SNR and intelligibility looks
like. You have a sigmoid function which links the two, and that sigmoid has kind
of a point of intelligibility that you're interested in, say 75 percent, you have a
slope of that function, and you would have a threshold, which is the SNR.
And, also, there's shifts and scaling. So, for example, you have this one, which is
the guess rate, so people can sometimes guess the right answer depending on
how many options you have, and you have the luxury, which is basically a point
when you should hear what was being said but you don't because you're thinking
about lunch or something, for example.
So starting with this thing, we use this background model rather than just doing a
step-by-step adjustment of the SNR trying to select the SNR so that we get the
most information about the underlying psychometric function at a given
intelligibility level.
So you can take the psychometric function and with all these parameters of -yeah, the lapse rate and the guess rate, so you have the [inaudible] this, and
then this is your sigmoid function, and you can take more or less anything which
represents a sigmoid function. In this case we used the cumulative normal
distribution function, which is a nice sigmoid, and basically the slope and the
threshold are controlled by the mean and the variance here.
So now trying to estimate the underlying psychometric function is really just by
estimating the mean and the variances of that and find where it fits best.
So using this as a background, what we can do is say, well, select the next SNR
value of the next probe so that we minimize the variance of the slope and these
threshold estimates. Right?
And just to briefly sort of go through how we do this, you start off with a
2-dimensional PDF where -- yeah, so you have the slope and the thresholds for
all possible psychometric functions, you initialize this PDF to something. In the
next step you can calculate the probability of getting a response r at a second
SNR. Then you estimate the posterior probabilities for each of these
psychometric functions that you have underlying it. And using this, you can find
the variance of these two parameters, the parameters of slope and threshold,
and also the expected variance that we were looking for.
So then you basically select the next SNR that gives you the smallest expected
variance for the estimation of the two parameters. Then you present --
>>: So you don't know that the probe will be -- oh, I see. You're looking at either
if it's a [inaudible] ->> Nikolay Gaubitch: Exactly. Yeah. [inaudible] because I don't know what it's
going to be.
So what you do then is you present it to the listener, you get the answer, and
then you update your PDF based on the answer and the SNR that you tested.
And, yeah, this is called BASIE, speaking of acronyms. It took a while.
So, yeah. So that's the idea. And in this way you basically get the quicker -- the
idea is that with this one you should get a quicker estimate and a more accurate
estimate of the SNR you're looking at.
So the other thing -- sorry?
>>: [inaudible]
>> Nikolay Gaubitch: At the moment I just picked a bunch of SNRs and find the
one. Yeah, exactly.
And, actually -- well, actually I didn't add that on the slide, but in practice you also
resample and rescale the PDF at every iteration so that you get closer to the
actual one. It's quite -- yeah, I'm not sure with a closed form, because you need
the inputs of the ->>: [inaudible]
>> Nikolay Gaubitch: No. No. Because it's more of an inductive filter with a
person sitting there. So you need this interaction. You just have this underlying
model of the person that you're trying to match.
>>: What I'm saying is you can choose the -- depending on what the form of the
[inaudible] variance was, you can just take the derivative of that expression with
respect to [inaudible].
>>: [inaudible]
>> Nikolay Gaubitch: Yeah, actually ->>: [inaudible].
>>: In fact, it might even be quite convex. In this case it's not clear that you
actually need to just sample a bunch of points. You might be able to use, you
know, [inaudible]
>> Nikolay Gaubitch: Yeah, maybe. I haven't looked at that yet, but, yeah, that's
one of the ideas is to look at the logistic function to see if you can better stuff.
>>: And one thing that might be interesting is that -- I mean, I would suspect
[inaudible] is fine because I expect that that bound is probably going to recognize
we have steps of like .1dB and then you look around that to see how much
variation you'd get in the bound based on a small movement, it probably wouldn't
be that big a deal. So it probably just [inaudible] it's probably pretty smooth, I'm
guessing ->> Nikolay Gaubitch: Yeah. No, it looks -- yeah, [inaudible].
>>: So the question about that -- so you're saying that you choose a bunch of
points. Are you changing the sampling of that, so maybe you did it at some fixed,
you know, .1dB steps or something like that, but do you change the spacing of
that grid as you move along in your algorithm, or is it always the same?
>> Nikolay Gaubitch: It's always the same spacing. It's a fixed ->>: [inaudible] some fixed kind of sampling along the entire from 0 to ->> Nikolay Gaubitch: Yeah.
>>: -- [inaudible].
>> Nikolay Gaubitch: Yeah, you need to sort of fix the ranges as well with the
SNRs, so yeah. But normally, you know, it's quite reasonable to fix it, yeah. You
can have a pretty fine grid which is more or less enough. I mean, you will never
really get that accurate estimates based on the humans anyway, so -- yeah.
And then the other bit which you could do with this is, of course, you could make
it estimate more than one psychometric function at once, which is -- I'll show you
later why that's a good idea.
Yeah, actually -- so the idea of this using -- estimating more than one
psychometric function is that you can simultaneously estimate processed and
unprocessed speech with one subject if you try and speed these things up, and
it's also -- while doing this, you're playing samples to the listener which have
been processed and which have been not tampered with, so it sort of helps with
the attention and keeps the listener happier. That's the one of the thoughts that
we've had.
>>: So what would be the word clue in that case? So they would hear, like, type
one sample and they would give a response, and then they'd hear, like, type two
sample and give another response?
>> Nikolay Gaubitch: No. So that's the next point, actually. The answer to your
question is that you choose which sample to play which gives you the best
minimization of the cost function that you have.
>>: So you're choosing now not only between SNR points but also between
which PF you're choosing at every point?
>> Nikolay Gaubitch: I mean, so if you've already converged very quickly with
one of them, there's no point in keeping playing back samples on that, right?
>>: But is this -- but it seems like these could be completely independent
problems. Like, I mean, does it make a difference to ->> Nikolay Gaubitch: No, no, it doesn't make any difference to ->>: To serialize them and do, like, first PF1 and then PF2?
>> Nikolay Gaubitch: You could do that, yeah. Exactly. The only difference it
makes is that there's no point in playing -- in collecting more data if you know that
you've already converged quite well. So if your error is minimized, then there's
another -- so if you have two conditions, unprocessed and processed, for
instance, then if you've done quite quickly -- if you've got to the right point quite
quickly with the processed one, but the unprocessed isn't, you want to use more
time to try and converge with the processed. So that's the only reason. But it
doesn't actually do anything else ->>: They don't really help each other?
>> Nikolay Gaubitch: No. The only help -- the idea is that by interleaving
samples in some kind of ->>: [inaudible]
>> Nikolay Gaubitch: Yeah. Exactly. That's the idea.
>>: Okay.
>> Nikolay Gaubitch: Yeah. So here is just one example of a simulated version
of this BASIE. So this is just -- just to see that it works if your underlying model is
correct. So this is simulated with the psychometric function model, so you just do
the inverse of that and you generate user responses that are correct. That's the
idea with this.
And it's compared with a fixed step-size of 0.5dB, which is pretty good. Normally
people use around sort of 1, 2dB step-sizes for these.
And it's averaged over 500 runs. So here is the BASIE one. So you can see
here is the number of trials that you run, and here are the -- here's the threshold
estimation bias that you get. So you can see that you get pretty accurate
estimation pretty quickly with BASIE, but with the fixed step-size you're going to
have quite a large bias, and also the variance is pretty large with the fixed
step-size, something you can't get away from.
>>: I'm surprised with the fixed step -- I would have expected to kind of more
jump around back and forth across those [inaudible] bias point. Why is it
always -- because, I mean, if you go with the fixed step procedure as you
described earlier, they're getting a probe and then the person says something
and you take a step, you give a probe and they take a step, so you can see how
that would result in continuous jumping around. But I'm surprised that it's always
biased in one direction here. Do you have any thoughts about why that is?
>> Nikolay Gaubitch: I think -- yeah, I'm not sure exactly why. I mean, I think the
reason is probably that you have a very ideal user which does something strange
with that. So you might converge -- because if it you look at it ->>: [inaudible]
>> Nikolay Gaubitch: Because in reality, it wouldn't be like that. In reality you do
get actually those jumps that you're talking about.
So this is really with a very ideal user, so in some sense BASIE is more suitable
to test with this ideal user because it's trying to estimate exactly that function,
while this fixed step-size, when you do that you don't really necessarily have this
underlying assumption of the threshold.
So, I mean, one of the main things that I wanted to see with this is how many
trials you need to get some convergence as the minimum which is satisfactory,
and the answer was that you get it at around 30 iterations. You get an answer
which is sort of between plus and minus 1dB, which is usually good enough for
estimation.
>>: I guess one thing that I'm [inaudible] about the simulation here is that what is
the model for the human response here? They'll just respond exactly the
probability that the model says, so if you're -- okay, I see. So you have, like, this
sigmoid here ->> Nikolay Gaubitch: Exactly.
>>: -- and you know the model is correct, you have some parameters that are
predefined for the simulation, and the next thing is they'll report with 70 percent
accuracy here, and then they just report with 70 percent accurate and you just
sample from that?
>> Nikolay Gaubitch: Exactly. It's a very ideal condition for BASIE. That's it,
yeah, exactly.
>>: And this still says -- this assumes a single user?
>> Nikolay Gaubitch: Yeah. Well -- yeah. Because you have one model. So it
doesn't really incorporate anything that is more user specific than that.
>>: So one trial is typically, what, a phrase?
>> Nikolay Gaubitch: So -- well, that's a long ongoing discussion in this
community that people do this is what one trial is. So it can be a sentence or it
can be a word. As I'll mention a little bit later, the thing I was using for a while
was digit triplets. So you hear ->>: Three numbers.
>> Nikolay Gaubitch: -- three numbers, and then you have to enter the three
numbers that you heard.
>>: [inaudible]
>> Nikolay Gaubitch: Yeah. Exactly. That's the idea. So I'll show you how I use
this for evaluating a couple of methods.
Yeah, so this is where it actually goes to when it was applied for estimating
intelligibility. So this interleave bit, it doesn't really help with, as I said, with the
estimation anyway, but you can sit down one user and he does one trial and you
get information about several different settings, for example, in one algorithm or
any other thing.
Because, actually, this you can apply with anything. You can -- all we do is vary
the noise and you can have -- so you can have reverberation in the background,
you can have a codec after the noise. So I looked at various things with. So,
you know, the noise is just a variable. And very often it's a very sort of, you
know, decisive variable.
And the way we measure this is something called -- well, it's not called -- I've
called it relative tolerance to other noise. People call this kind of different things,
but what it means really is that you look at this threshold SNR for the noisy
speech and the threshold SNR for the processed one, take the difference, and
that gives you sort of the tolerance. So if you have a positive RTAN of 1dB, it
means that you can lower your SNR by 1dB and still get the same performance
and intelligibility. So that's really the whole thing.
And this is how people were using this. So it's digit triplets that we're using, and
you have an interface, so you listen to stuff and you just enter your three
numbers.
Yeah, there's a long talk just on how you would actually select all -- the data that
you're going to listen to but I'm not going to go into that.
So here is a relatively small study with this, but it's just to sort of demonstrate
what the idea was. You have six people. The samples -- yeah, these digit
samples were normalized, and you have car noise in the background, and people
listened to this with headphones. And we tested this with noise reduction from a
commercial audio work station, which I won't mention -- it's not good -- and just
standard spectral subtraction with minimum statistics noise estimates.
And to choose the parameters of this commercial audio work station to -- for
example, has five, six, seven parameters that you can vary, but actually the one
that makes the most difference when you listen to it is the maximum amount of
noise reduction that you apply.
And also it's similar with the spectral subtraction. That's where you can really
hear things happening.
So I took a bunch of settings for that between minus 1dB, so you have hardly any
subtraction that you can hear, to minus 40db where it really tries to destroy
everything. And if there were any other parameters in this algorithm, just keep
them fixed to whatever was recommended.
And so, yeah -- so this is really pushing it to 150 iterations, so you do 150
presentations of these samples to the people, which include the estimation of all
of these things, which is pretty quick. So it's 10 minutes ->>: [inaudible]
>> Nikolay Gaubitch: And you get -- yeah. You get quite a lot of information in
10 minutes from one person, which is pretty good.
And this is the outcome of this. So really the absolute values here are maybe not
that important as the highlight of trying to -- well, first of all, generally the spectral
subtraction and this one as well, they destroyed the intelligibility quite a lot
straightaway. Even with small settings, it seems like -- yeah, there's quite a large
effect on the intelligibility. But this commercial system one is definitely a very -has a very strong effect. And, I mean, you can hear it if you set this parameter to
minus 40dB.
Of course, this maximum noise attenuation, I know what it means in this because
it's a MATLAB implementation, and I know what are done in there. With this one,
I have no idea what this means. So anyway, when you do it at minus 40 it
destroys more or less anything that you can hear and ->>: So a positive RTAN is when it's making things better and a negative RTAN is
when it's making things worse?
>> Nikolay Gaubitch: Exactly, yeah. So I highlighted these two things here,
which are the positive RTANs.
>>: [inaudible]
>> Nikolay Gaubitch: Yeah. Well, so this RTAN is actually the SNRs, so you
vary the SNR a lot time, right? So ->>: But, I mean, what was your range?
>> Nikolay Gaubitch: Well, close between plus 10db to minus 20dB. Because
what I'm looking for here is 75 percent intelligibility. So for car noise, for that you
need to be somewhere at minus -- I think it's minus 11dB or something like this to
actually get -- unless it's a very good car.
>>: Did you count r as one [inaudible]?
>> Nikolay Gaubitch: No, I did all three digits. So all three digits have to be
correct, yeah.
>>: So let me make sure I understand. So it seems like -- in your formulation
you said that, like, I have a target which is, like, 70 percent recognition, and then
your RTAN is measuring that, like, okay, if I care about having that 70 percent
accuracy, this number of dB can drop in the process and still have this thing. So
an alternative way to look at it -- I mean, that makes sense. I think that -- but in
an alternative way you could say -- so the thing is that, like, there would be a
different chart essentially for each target recognition rate that you have. It could
be said that, okay, now that you're about 80 percent accuracy and intelligibility is
70 percent, it's a different chart
>> Nikolay Gaubitch: Yeah.
>>: And I wonder -- so let's say now [inaudible] wants to characterize the system
or something like that. Like an alternative way would be that you could say that,
like, each of these settings of dB actually just measure -- instead of [inaudible]
counting the actual intelligibility rate, you know, in each case -- because then you
have a fixed table -- I mean, it's not as good in some ways -- I'm just curious what
your -- do you see what I'm saying as the alternative
>> Nikolay Gaubitch: Yeah, yeah. Basically you keep fixed SNRs and then you
measure the ->>: Fixed SNRs. Like you just walk through this and now you get down to your
table and then you say that, like, well, given this, what is the -- I'm just curious as
to your thoughts around it. Because I like what you're saying about the RTAN,
but I wonder what your thoughts are about, like, you know, doing RTAN versus
just measuring, like, a fixed, you know, [inaudible], what the pluses and minuses
might be around doing that.
>> Nikolay Gaubitch: So if I understand you correctly it's that -- it's really looking
at the alternative that you have fixed SNRs, a bunch of SNRs, and you have to
measure that.
>>: [inaudible]
>> Nikolay Gaubitch: So, first, it takes longer to do it. The second one is that
you wouldn't really -- it's kind of difficult to know what SNR range to look at, right?
So in this case you actually -- yeah, I mean, it's easy to know what intelligibility
level you're looking for than knowing what SNR range you're looking into,
because the SNR range will change ->>: But there is this psychometric function that gets you above -- I don't want to
say [inaudible] prior distribution on what these thresholds are.
>> Nikolay Gaubitch: No, but they change for every noise factor. So the
intelligibility level ->>: Hugely?
>> Nikolay Gaubitch: Yeah. So if you have bubble noise, for example, to get
75dB intelligibility you have to be somewhere, well, in the vicinity of 0dB, sort of
minus 5, 6db [phonetic], I think it is. And if you have car noise, you can be down
to minus 12.
>>: I misunderstood one part of your table. I see. So this noise generation,
that's a setting in the algorithm?
>> Nikolay Gaubitch: Yeah, yeah.
>>: Oh, okay. So given a setting in the algorithm ->> Nikolay Gaubitch: That's it.
>>: -- then you're looking across a bunch of different ->> Nikolay Gaubitch: Exactly.
>>: Oh. I missed that part. Sorry. I thought this was just a fixed ->> Nikolay Gaubitch: No, no, no.
>>: I see. This is just ->> Nikolay Gaubitch: Basically you change the setting of the algorithm and then
you find the SNR for which your intelligibility level is 75.
>>: And now you're looking across a bunch of -- I see.
>> Nikolay Gaubitch: Exactly.
>>: So I think it's kind of [inaudible].
>> Nikolay Gaubitch: Yeah, yeah.
>>: [inaudible]
>> Nikolay Gaubitch: No, I wasn't trying to prove that at all.
>>: [inaudible]
>> Nikolay Gaubitch: Right.
>>: And so -- but there are speech processing algorithms that claim at least to
improve intelligibility like these binary [inaudible] things and these other things
[inaudible].
>> Nikolay Gaubitch: Yeah. Exactly.
>>: [inaudible]. So have you tried -- it would be interesting to see if you could
verify something that correlates with some published results with a positive
outcome rather than a negative outcome.
>> Nikolay Gaubitch: I mean, the trials I've done is something that I cannot
present here, but I have correlated it to -- so I correlated this way to really more
robust intelligibility estimates. So there's actually -- this is quite a weak test for
intelligibility, but it's fast.
So another way to do it is you have a listener who listens to a whole sentence
with keywords and so on and so forth, which is -- [inaudible] does that a lot with
his tests.
And I know that support spectral subtraction and a few other algorithms, we
definitely get really good correlation between this more rigorous test than this
simple test.
We have looked at -- we have actually looked at some of Philip [inaudible]'s
methods and tried to reproduce some of the result, but we haven't really been
able to get this intelligibility improvement that he claims, to be honest, yet.
Although -- yeah, I mean, binary masks, for example, if you have the ideal binary
mask you can easily demonstrate something which sounds really good, right,
because you have this oracle and it sounds fantastic. And it's clear that you do
improve intelligibility, but to actually do something with it, it gets tricky.
I haven't -- because -- yeah, it's because of the nature of the work that we've
been doing. We've tried different things.
>>: So one other question. There's subtlety here around average over users
because -- are you just, like, taking the scores over all the users and just taking
the mean variance for each or are you doing that ANOVA ->> Nikolay Gaubitch: In this case it's just the mean variance, but, yeah, we tend
to do it ANOVA ->>: Because especially in this one -- let's say that you have listener A and
listener B, and listener A just kind of has crappy hearing and listener B is good.
You want to account for that in the variation so that ->> Nikolay Gaubitch: In general we do that, so ->>: Oh, you do?
>> Nikolay Gaubitch: Yeah. This is -- as I said, this is a just a case to sort of
demonstrate what we're trying to do. But, yeah, in reality we have many more
listeners and --
>>: So you are [inaudible]?
>> Nikolay Gaubitch: Exactly. So ->>: [inaudible].
>>: I'm sorry?
>>: [inaudible] [laughter]
>> Nikolay Gaubitch: So, yeah -- so basically, just to sum up that part, the
presentation of the BASIE tool, which can estimate -- that can help you estimate
the intelligibility relatively quickly, in about 30 trials you can get an accuracy of
plus-minus 1dB. And I showed you how you can actually use this to try to search
in a sort of space of settings with subjects relatively quickly.
So thank you for listening.
[applause].
>> Ivan Tashev: Do you actually have more questions?
>>: Yeah, I have ->> Ivan Tashev: Go [laughter].
>>: So one question is -- I really like this subject. I think that was quite
interesting [inaudible] at this point now you can say, oh, well, we don't want to
use this test, we want to use, you know, more -- the stronger sentence-based
keyword testing and just slip that in [inaudible]. But one question I was going to
ask is what do you know about -- or can you characterize how good that model
is, like that -- you know, that particular sigmoidal model using the normal -- is
there other measurements or is there evidence that shows that that is a really
good model across a variety of situations to use?
>> Nikolay Gaubitch: There is quite -- yeah, there's quite a large amount of
research around that, yeah. Yeah, there's a vast amount of literature on
intelligibility testing, and so ->>: [inaudible]
>> Nikolay Gaubitch: Yeah, it looks like it. And, actually, the other thing which
we looked at which is less obvious is if you have a signal processing algorithm,
would it just shift the whole psychometric function if you have an improvement or
would it act differently in different parts.
>>: So that's a good question [laughter]
>> Nikolay Gaubitch: [inaudible].
>>: That's okay. That's the wrong answer. Now you're making your life more
difficult [laughter]. Right. Because you could have a certain kind of algorithm
now which effectively changes the shape of what [inaudible] ->>: [inaudible].
>>: Right. But then that sigmoidal one is no longer the right one
>> Nikolay Gaubitch: Well, I didn't say that. So, actually, the interesting thing
is ->>: [inaudible].
>>: Yeah, well -- I mean, but it could -- so it could be different, right? It could be
the case that something -- it might not be monotonic anymore because it could
be the case that, you know, [inaudible] which is really great under a certain level
of noise, but if things get too clean, then it actually does worse, you know,
[inaudible].
>> Nikolay Gaubitch: But so far what we've found, actually, is that -- so doing
this with more rigorous testing, we actually found that with quite a few of the
standard sort of speech enhancement methods that exist, there is just a shift. So
so far, so good. I haven't seen any evidence ->>: He's talking about minus 11db for [inaudible]. That's relatively ->>: [inaudible].
>>: No, that's okay. It's just that, I mean, like -- and it doesn't even have to just
be a shift, right, because you're modeling both the [inaudible]. My concern was
just that if something -- if there were some algorithm -- there may not be. I may
just be positing, you know, something that doesn't exist, but if there were
something where the actual, you know, overall profile of the response with
respect to noise level change completely -- like if it actually got worse under low
noise, then you might no longer have this thing. You might actually ->> Nikolay Gaubitch: In some sense, in defense of the case, you actually just -although we have the whole model of the function, you're still kind of looking at
one point. So I don't know what the answer is, but you look at some kind of
slope. So that kind of shift -- I'm not sure, but ->>: One thing that I would see is, you know -- another thing that I would say
which is good about your method which you didn't highlight, but maybe you
should, is that if you believe the model and you do your [inaudible], you're not just
getting a point, right? You're actually getting at the curve. So one thing that I
would say is that, like, you know, that you could argue -- and I was wondering if
you would say that when I asked the earlier question -- you could, but maybe you
feel like you'd be reaching too far -- you could say that -- you know how I was
asking that, like, if I changed my threshold and made a whole different chart, you
could argue that if you fit the model well, you could just create that new chart
without running any more samples because you've actually modeled the entire
curve.
>> Nikolay Gaubitch: No, no, exactly. That's right. That's it.
>>: That's a great benefit ->> Nikolay Gaubitch: No, no, it is ->>: -- over kind of ->> Nikolay Gaubitch: I didn't talk much about the slope estimate because I don't
really use it for anything yet. But, yeah, that's another good advantage of this
because if you do the fixed step-size and if you didn't need the whole curve, you
have to sort of sample a few points along the curve, which is basically, to some
extent, what you said. And then you can sort of fit ->>: I mean, there is an issue of -- you could argue that -- so because you're
focusing on that point, it might only be really valid around that, that region, and
maybe not [inaudible].
>> Nikolay Gaubitch: Yes.
>>: I see. That's what you're saying about the fact that it's region 2. So this
curve is valid only within that ->> Nikolay Gaubitch: Exactly.
>>: -- the neighborhood of where you care about, that's all that really matters.
>> Nikolay Gaubitch: Because what could happen is that if you have this
algorithm, you can still be valid at the top and [inaudible] ->>: The problem is that your -- I mean, I think your setup might be convex-ish
right now, but then if things were really weird, like you might end up in some state
where there might be multiple places, you know, which had satisfy your
[inaudible].
>> Nikolay Gaubitch: It's very tricky. And once you have humans involved in it,
it's quite difficult, yeah, and the processing. But I think so faster it's been fine,
and so far the only evidence that we do have from a lot of people is that this
model actually holds.
>>: The other thing that might be interesting -- maybe I'll save it for [inaudible].
>>: [inaudible].
>>: I'll stop talking. Go ahead.
>>: [inaudible].
>>: Oh, okay. All right. Okay. Good.
>> Ivan Tashev: No more questions? Thank you, Nikolay.
>> Nikolay Gaubitch: Thank you very much.
[applause]
Download