>>Mike Seltzer: Good morning. Welcome. My name... happily introduce Kshitiz Kumar, who is a Ph.D. student from...

advertisement
>>Mike Seltzer: Good morning. Welcome. My name is Mike Seltzer, and I'm here to
happily introduce Kshitiz Kumar, who is a Ph.D. student from Carnegie Mellon working
in the robust speech recognition group with Rich Stern, and he's going to tell us all
about the work he's been doing, and I think specifically mostly in reverberation ->>Kshitiz Kumar: Reverberation, right.
>>Mike Seltzer: -- which, as we all know is a very challenging problem. So here we go.
>>Kshitiz Kumar: Thanks, Mike.
So as Mike said, the object of this talk is to provide -- to study the problem of
reverberation and to study reverberation with respect to -- to compensate it for
applications in a speech recognition. And I'm sure that -- as Mike said, I'm Kshitiz
Kumar. I'm from CMU, and my colleagues are my advisor, Professor Richard Stern,
and my colleagues Bhiksha Raj, Rita Singh and Chanwoo Kim.
So going forward, we know that speech recognition technologies have considerably
matured in the last decade N the last 20 years, but the problem of robustness of these
technologies through noise and reverberation still exists. And even though many
technologies work great in clean conditions or in matched conditions, whenever there is
a mismatch between training and testing data, the performance is not as good. And the
object of this talk is to make speech technologies robust to reverberation and then we
will also extend them to noise and then, finally, to joint noise and reverberation.
Now, before going forward I will briefly show what reverberation is, what does it mean,
and how does it impact speech recognition accuracy.
So reverberation can be thought of as a superimposition of delayed and attenuated
signals. We can think of this as -- we can think of a rectangular room where the
microphone here will [inaudible] the direct component from the source but the reflexes
off the wall and then the reflexes of reflexes. And over a period of time we'll see that we
can model this as a linear filter with a room impulse response.
And reverberation is characterized by a parameter called room reverberation time. It is
the time taken for the signal power to decay by 60 decibel. And as we can see, the
reverberation time is higher in a room than the signal is -- the echos are going to persist
for a longer duration of time, and they will lead to greater self-interference than the
original signal.
Now, having said that, here I tried to figuratively show a spectrum plot for a clean
speech and a reverberated speech. So we see that -- the one clear distinction we see
is that the energy spreads -- the energy overlaps into future time segments, as we
expected it will, and we see that the silence segments here are filled up by energy here.
And then the strong voices [inaudible] are going -- the energy from [inaudible] are going
to overlap into softer sounds.
So the problem is that if you do have a training from this, we're really not going to do a
good job by testing here. And even if you do a matched condition training, even in that
case, because the phenomenon of reverberation depends on what -- the sound which
comes before the current sound, so even the matched condition is not going to do very
well. We will see those later.
Now, this is a speech recognition experiment on a resource manager [inaudible] just to
see what is the impact of reverberation, how much does it effect the speech recognition
accuracy. So we see that for clean condition here, this is done -- training was done on
clean condition. So for the speech, the WER increases from about seven percent to
about 50 percent here as the reverb time is 300 millisecond, which is -- which I think is
typical of this room here right now.
So at least for clean condition, this is really a very big problem. And most of the people
have -- in the past speech research people mostly worked on the noise compensation,
but there have not been many solutions and not a lot of study in reverberation. And the
focus of this work is primarily on the study of reverberation pattern with [inaudible]
reverberation and driving algorithms to compensate for that.
Now, here I present a brief outline of this talk. First of all, we are going to spend some
time studying reverberation, how reverberation affects speech recognition accuracy and
how can we model that phenomenon directly on the speech features.
My claim is that if you can do a good job at modeling reverberation in this featured
domain, in the spectral feature and the [inaudible] featured domain and directly
compensate in the featured domain, then we can build a good system because features
go to the speech recognition engine. And if we can compensate the features, we are
more likely, and the features are closest to a speech recognition engine. So if you can
compensate features, this will radically benefit speech recognition system.
So based on that, we proposed a new framework to study reverberation first, and then I
propose two algorithms. One is called LIFE and the other NMF. They use different
properties of speech. And then I finally propose a noise compensation algorithm. It's
called delta spectra features, and then I propose a joint noise and reverberation model
and I apply my individual compensation for noise and reverberation to be jointly
compensate for them.
And, finally, if time permits, I'll talk about an audio visual feature integration.
So before going on to present my model of reverberation, I will present what has
conventionally been done in the past. And in the past we treated reverberation as a
filtering in the time domain [inaudible] the filter is linear time invariant. And because of
that, in the spectral domain this is multiplicative and the log is spectral [inaudible] and
this is additive, and also in the cepstral domain here. So what we are saying is that
convolution in time is multiplicative in frequency, and because of that it becomes
additive in the cepstral domain.
And the model that we arrive is an additive shift model. So we are saying that
reverberation introduces an additive shift on the cepstral, and based on that, the first
algorithm, like way back in time, 1995, like that, was cepstral mean normalization which
tries to subtract the cepstral mean from a cepstral sequence, and that will try to remove
the reverberation. And CMN is now used in almost all the systems, it is so rudimentary.
But the problem is that the speech analysis windows are typically like 25 milliseconds,
but the impact of reverberation is hundreds of milliseconds. So we cannot -- so the
model here does not apply for a short duration. And to answer that, what people did
was that they studied speech analysis -- they did speech analysis over seconds, like
one second or two second window duration, and in that they -- based on the CMN
framework, they subtract the log spectral mean and they do speech reconstruction from
there.
But a key problem in this method is that because it uses one [inaudible] seconds of a
speech, this algorithm cannot really be made online because we have to wait for those
many seconds before doing processing further.
And so in our work we ensure that we work from short segments and still says that -still methods can be applied in an online fashion. And we can learn some of the
parameters which can be used to compensate.
Now, next I'll specifically talk about the model that I propose here, the feature domain
reverberation model. Now, to present my model, I present this framework here. So this
is [inaudible] frequency MFCC feature extraction. So what we do is that we take a
speed signal SN, we analyze it through a filter bank, they can be like [inaudible] filters.
We take a short duration power here, we apply log, and then we do DCT and these
features go to ASR system here.
And just to note the convention, the output here at the filter bank is labeled little x, the
output at power is labeled capital X and cap XL here and cap XC here to indicate the
two.
So the problem -- what I'm going to do next is to study how reverberation phenomenon
which is happening here affects these out puts, these consecutive out puts, and try to
drive an approximate model to represent reverberation.
Now, as I already mentioned, reverberation happens here, and then we'll study the
impact of these reverberation at later stages.
Now, the first thing we can immediately do is that we can -- because it's a linear filter, so
we can study reverberation at these filter banks. And what we are saying is that signal
first passes through each of the filter and then goes to the filter bank. And now note
that the labels having changed. This signal is no longer little x, this is little y and,
similarly, capital Y. So the purpose is to relate little y with little x and capital Y with
capital X, as mentioned before.
Now, the first thing we can immediately do is that because these two are filters, we can
switch the two. We can commute the two. And then we have an immediate relationship
between little y and the little x here. So it's just a filter by this filter room response filter.
So the representation that we have is like here, and next we will study the effect of this
h filter at the power domain.
And to do that -- and, similarly, well study the effect of this little filter at each of these
different domains. So to do that -- sorry -- so this is the problem that we are right now.
We are studying the effect of h filter for the power domain here. Now, we start from this
equation, which is a linear convolution expression here, and we take y squared because
we have to take powers, so we take y squared and then some y squared over a short
duration, 25 millisecond, like that.
Now, we can split this domain into two components. These are like self terms of x and
these are the cross terms of x. And then we take an expectation, and finally we will still
have two terms, one term corresponding to x squared, another term corresponding to
the cross terms of x, and what we find in principle is that we can ignore this cross term
here, we can consider that as an approximation [inaudible] in my model and just to
retain this term here.
And what that effectively does is that it sort of linearizes the problem. We are saying
that y squared is a convolution over x squared. And if you take a short duration power,
it will still remain so.
And essentially what I'm doing is that I'm replacing this filtering filter here as a
[inaudible] filter here, but this filter operates on the cap XS sequences, and I'm saying
that cap YX sequence is a filtering operation over cap XS sequence with an
approximation error.
Now, that was all said and done, but I'd like to show how good this approximation is. So
here I plot -- this line here is for the clean signal without reverberation, and this blue line
here is the reverberated -- the cap XS for the reverberated sequence. The cap Y is
actual. And the red line is my approximation here.
So if my approximation is good, then the approximation should hold on and it should
follow the -- the red line should follow the blue line here. So we see that even though
there is an error here, this approximation -- with respect to speech recognition error
accuracy, this approximation is not that bad, and I'm still able to form the contour.
And the big benefit that I get is that I have linearized the problem and I have a
representation of cap YX, the spectral power of y with respect to the spectral power of
the clean speech.
Now, the model that we have so far is I'm saying that I can [inaudible] obtain my cap YL
by cap YS by filtering cap XS. So this is kind of the model that we have achieved now.
And now we will further study the operation of log in the DCT.
Now, starting from this equation here, which, as I derived as a linear equation as a
convolution expression here, I can formulate the problem in a Jensen's inequality
framework. And without the details, what I achieve is this expression here. So what I'm
saying is that the cap YL can be [inaudible] as a constant plus a convolution over cap
XL plus an approximation error here.
And this also linearizes the problem and adds -- now the problem is not completely
linear because of this term here, but -- and this is, again -- this is the model that I have
derived so far, and this is, again, the same thing and the approximation error. So I
would like the red line to follow the blue line here.
So, again, there is an approximation error, but it does a good job at following the
contour and trying to formulate the problem so that we can derive [inaudible] operation
algorithms in an easier framework.
Now, this is the model that I have so far. I'm saying that the -- this is the log spectral
model that was derived, and this operates on this log spectral output.
Now, the same thing can be done on the DCT. DCT [inaudible] linear operation so
there is no problem. And, finally, what we arrive is this model here. So I'm saying that
the cepstral sequences can be obtained by passing the original cepstral sequence
through a filter and then adding a constant to that. And the next work will we'll build on
this model.
Now, we can compare this with the previous model, what was done previously. So
previously, as I said before, it's an additive model. So we can see that I'm introducing
this new term here. So I'm introducing a filtering first and then a constant additive shift.
And this can also be compared with a past work from our group, and there -- but this
model applied in the time domain sequence, but what I'm saying -- but now I'm saying
that I can [inaudible] have the same framework like a filtering plus a noise plus an error
term in the cepstral sequence as well where the approximation errors are small.
The constant c delta can be removed by a plain cepstral mean normalization, and next I
have to worry about removing -- mitigating this filter down here. Now, I'm going to
propose two algorithms, the LIFE, it's a likelihood-based inversed filtering approach, and
the NMF.
So, first of all, the LIFE approach. As I said before, the LIFE approach works in the
cepstral model here, and because we have, like, 13 cepstral sequences, so this
approach will be applied individually -- will be applied to all those 13 cepstral
sequences.
Now, before going on to the details of this algorithm, what I'd like to say is that -- so now
we have a model to represent reverberation in the cepstral domain, and next we need to
find other compensation algorithm. So the question is that how -- how do we guide our
approach. Now, see the problem is that we don't know the [inaudible] response, we do
not know the clean speech spectral values, cepstral values, so it's a very hard problem.
So in order to guide our problem, we need to make -- we need to study some of the
properties of speech which we can apply and guide our optimization problem towards a
direction where speech features lie. And specifically this algorithm is based on
maximum likelihood.
So I use the speech knowledge of the distribution of speech features so I can learn the
distribution of clean speech features, and then I can guide my optimization towards a
direction to maximize the likelihood of a clear spectral clean speech features. So this is
the speech knowledge I'm using in this algorithm. And otherwise it does not use any
knowledge of room impulse response or of the clean features directly. So it works in a
blind fashion.
So what I'm saying is that I -- so reverberation takes me from a clean speech feature
distribution to a reverberated feature distribution, and then I apply a filter to go back
from here to here so as to maximize my likelihood function.
Now, before going forward and showing the method, I'd like to study this method with
small examples, and I like to study that if I take a simple example, then do I get the best
answer or how close I am to the best answer. So here I take -- like this is a zero mean
unit variance. I mean, I'm just assuming a signal to be like this Gaussian here and
applying this simple filter with the one single delay tab. And then I'm trying -- the
question that I ask is that if I find -- if I try to find this filter, then what is the relationship
between this little p and this little h. So that is the question I'm asking now.
And what I would expect is that if I do this, then the little p should be able to cancel the
effect of this little h here. If I can do that, then I, at least in a simplified setting, just
theoretically I can show that the method is working.
No, this can be done simply by taking log likelihood here and differentiating this log
likelihood with respect to the unknown p parameter. And what I find is that this little p is
equal to approximately equal to minus h. Now, so -- and if you multiply these two -- the
filters, we would like the multiplication to be unity so that the effect is canceled out. And
we see that at least the error is in the second order. It's not in the first order.
So at least to the first order, this method is working fine. I can ask another question,
that what if I have a filter like this. It's an exponential tap filter with exponential filter
taps, and I do the same thing. And what I would expect is that this p should be equal to
minus h. And in fact it can be shown that this p is equal to minus h. And, similarly, in
this case I can also have an I filter, and again I can show that this little p is equal to little
h here.
So at least in simple settings I can show that, without doing a speech recognition
experiment, I can show that the method is doing what it is expected to do.
Now, next I'll apply this method to speech recognition. So to apply the problem for
speech recognition, I make an assumption that the speech features are distributed
according to a Gaussian mixture model. And I can learn these parameters from data
which I have already available, and then I want to estimate this filter here. It has an
unknown [inaudible] term, and these little p terms here. So I need to estimate these
parameters.
So, again, I mean, I set about a log likelihood expression here and then I do gradient
decent -- I mean I do gradient decent log maximizing log likelihood and try to obtain the
parameters in a step-by-step fashion.
>>: So one equation is different from an LPC?
>>Kshitiz Kumar: This expression?
>>: Uh-huh.
>>Kshitiz Kumar: Here I'm obtaining the filter taps, and I'm applying this on the cepstral
sequences. In the LPC what they do is that they apply this on the speech wave form by
using the correlation terms there. But, yes, the -- it is not the same. And ->>: [inaudible].
>>Kshitiz Kumar: It is. But I'm not -- I'm actually not -- yeah, but this is -- LPC is not
based on maximum likelihood, right? It is based on an error -- the squared error
minimization. So -- and in the LPC we get terms like auto-correlation. So I -- I mean,
you raise this point. I would expect that if I have a single Gaussian density here, then I
might get the LPC term, but if I have multiple Gaussians and I -- because single
Gaussian density reduces to error squared, and then I might get the same term, but
because I have a Gaussian mixture model here, then I think the answer will be different.
Yeah, a single Gaussian density might reduce to LPC. Yeah.
Okay. So as I said before, so, I mean, I'll skip details here, but I set up a log likelihood
maximization criteria and do a gradient decent from there, and the approach is applied
on the different cepstral features.
Now, before doing our speech recognition experiments, I'd like to just see how the
method is working. So here on the left panel, this is a cepstral scatter plot for -- on the x
axis, C1, and the y axis, C2 here. So we see that the cepstral scatter plot distorts, and
it does not match with clean and reverberated because the reverberation is going to
smear the spectrum and it is going to spread out.
But after applying the LIFE processing we see that at least with the scatter plot, these
two match better. So I would expect that the performance will improve.
>>: So the likely assumption [inaudible].
>>Kshitiz Kumar: No, I -- the GMM is, of course, trained using EM, but when I do my
gradient decent, I have the GMM parameters and then I just do a gradient decent -- a
gradient ascent later.
>>: Without updating the [inaudible].
>>Kshitiz Kumar: GMMs? Yeah, GMMs are not changed. GMM is trained just from
clean speech features. So that's not changed.
>>: [inaudible].
>>Kshitiz Kumar: You mean ->>: [inaudible].
>>Kshitiz Kumar: Oh, you mean for the GMMs? Okay.
>>: [inaudible].
>>Kshitiz Kumar: Right.
>>: [inaudible].
>>Kshitiz Kumar: Right.
>>: [inaudible].
>>Kshitiz Kumar: I did not do that because I didn't want to really change GMMs
because that will then -- that will probably change the clean speech feature. That will
not exactly with the distribution of clean speech features.
>>: You maximize the likelihood because the GMMs, not [inaudible] because I don't
know which Gaussian in GMM ->>Kshitiz Kumar: That's true.
>>: [inaudible].
>>Kshitiz Kumar: Yeah, actually, in practice what I do is that I do classify my single
feature vector into each one of these GMMs like a posterior, and then I used just that.
So just like top one -- yeah, top one.
>>: [inaudible].
>>Kshitiz Kumar: Because that really simplifies the problem a lot. I mean, it makes the
calculations much faster.
>>: So you take the ->>Kshitiz Kumar: I take the most likely for each feature and then just use that
information, discard the rest. I mean, I did experiment that, and I found that just that
works almost well in my experiment, so I don't care about the rest. Yeah.
>>: And you're doing this independently across the cepstral?
>>Kshitiz Kumar: No, it's actually done jointly, actually, because the GMM ->>: You're able to leverage information from one domain ->>Kshitiz Kumar: Yeah, right. Exactly. I'm not doing it independently. It's all joint.
Even though the covariance matrix is diagonal because it's still a GMM, it's a combined
GMM, so I'm [inaudible].
So next I'll present speech recognition results on different databases. So this is one of
those on the resource management database. And I compare with some of the
baseline algorithms which I worked on and are also from different research
communities.
So some of the baselines are MFCC, CPF is a cepstral post filtering, and FDLP
long-term log spectral subtraction which I talked about before and the LIFE filter.
So we see that -- I mean, going from seven person to 50 person, some of the baseline
algorithms that do not really provide a lot of benefit. And this is -- the training was done
on clean speech. I'll also talk about matched condition later on.
But the improvement here is limited to, like, 15 person at max like that, but the
improvement by using LIFE processing is up to about 40 percent here, 43 percent
relative reduction in error rate and also similarly here.
So even though the problem formulation is simple, the analysis is simple, I'm able to
leverage the fact that I represent reverberation in the cepstral feature domain and
directly apply my techniques there. And I'll next show some examples of the -Yeah?
>>: So even though the training was clean, did you put any of the [inaudible] through
any processing?
>>Kshitiz Kumar: Yes. So I feed clean speech also through my processing, yeah.
>>: [inaudible].
>>Kshitiz Kumar: Yes. Yes. All of them.
>>: [inaudible].
>>Kshitiz Kumar: On the clean also, yes. Yeah.
I mean, I haven't especially looked for LDLSs, but in general I find that ->>: [inaudible].
>>Kshitiz Kumar: Yeah, right. In general I find that it is also important to pass clean
speech also through the processing, even if the processing does not do much, even if it
just reconstructs a little bit. I mean, it provides a better match. So, yeah, all of these
were passed through all these processing.
Okay. So in this example I plot the frequency response that I get of the filters. So as I
said before, I'm trying to estimate the filter parameters. So these are four different
utterances for the same reverberation time, so 300 millisecond. So the first thing what I
expect is that the frequency response should have a high pass nature because the
room impulse responses has a low pass nature. It smears the spectrum. So it's like
adding things. So the inverse frequency should have a high pass nature, which I do find
that exists here.
The second thing I expect is that even if the utterances are different because the room
impulse response is same, so I would expect that these filter responses are almost
same for all the different utterances. If I could do that, I mean, that means that I'm
learning the algorithm is not as much dependent on the speech utterance, it is more
dependent on the room acoustics.
So as we see here, I mean, figuratively looking at these utterances, we see that it is
trying to do that. And to test this specifically, what I did is that I took 20 different
utterances and I took an average of those 20 utterances, I took an average filter, and
then I applied that single average filter to all the rest of the utterances. So here is an
example of the average room impulse response for a particular condition. And here are
the results.
So this is MFCC, this is LIFE, which was obtained by individually working on the
utterances, and this line is obtained by just applying an average filter which is learned
from 20 utterances and applying that average filter to all the rest of the utterances.
So, I mean, this does suggest that even though there is a bit of an error, I mean,
mismatching, but in general I'm learning the room acoustics more and the speechless,
because just from one filter applied to that particular condition, I'm working as well.
Okay. So I did experiments on different databases here. This is on a large room, but
still simulated, and this result is on -- from an ATR database. The ATR database was
collected in Japan where they played out audio from [inaudible] database through a
microphone and collected them in different rooms.
So this database -- actually this experiment is a little bit different. And they also
collected room impulse responses in different rooms. So this experiment was done by
taking a room impulse response from the ATR database and applying that through my
clean speech and get reverberated speech. So this actually -- this is a still [inaudible]
result. This compares the difference between simulated room impulse response and an
actual room impulse response. So we see -- we do see that improvements definitely
hold up even for an actual room impulse response.
So, so far I talked about clean condition where training was done on clean speech and
testing across different times, different reverberation times. But as we have noted
recently, actually not very recently, that that is not the best thing to do. And people in
industry especially are doing training across all the different conditions they can find
across all noise and reverberation preferably. They mix it in a training database and
then it's like multi-style training.
So the question is that do algorithms hold up even in matched condition or not. So that
is what I'm going to evaluate next.
So to do that, I did this experiment here. I took two different rooms and got room
impulse response from them, and this is from the ATR database, and I have all these
bunch of conditions. And from these I did multi-style conditions. So the training
database has room impulse response from different rooms and also different conditions.
Actually, there are more results, but I just present one result here. This experiment was
specifically done by training on room 1 with these reverb time, these reverberation
conditions here. And then testing was done on a large number of different conditions
here. So these are on room 1, the room 2 and this is from ATR database here.
So, I mean, the first thing which is very noticeable here is that the clean condition
performance improves drastically even for MFCC. So, I mean, not the clean, I mean the
baseline performance for MFCC shows a huge improvement. I mean, when I did this
experiment first, I was -- I think I was surprised to see that, because just the matched
conditions shows about 60 to 70 percent relative improvement just on MFCC.
But at least a good thing is that they are applying the LIFE processing further. So these
are like highly matched conditions. This database is from room 1. Nearly same reverb
times. So in these highly matched conditions I'm demonstrating, like, 15 percent
relative improvement, and these are slightly less highly matched, but still we can see
that the MFCC performance still shows a huge improvement, even the matching is not
strict. But in these cases LIFE filter provided, like, 20 percent relative improvement.
So the point is that the improvement is slightly less if the matching is very strong, but in
general we cannot really expect that in a practical setting because we have to -- I mean,
we cannot really have a perfect match in any case, right? So whenever the matching is
not totally perfect, we'll still get, I mean, a very significant improvement even in matched
condition as well.
>>: [inaudible].
>>Kshitiz Kumar: Yeah, this is still resource management.
>>: [inaudible].
>>Kshitiz Kumar: On the same database, yeah. On the same set, right, yeah.
Actually, I'll answer that later.
Okay. So this is on -- sort of the same experiment but on a Wall Street Journal
experiment. So in this experiment I do training from the room R1 in these reverb times,
and I do testing across, I mean, room R1's here. So, again, notice that -- I mean, one
thing noticeable is that the clean error actually increases here, as also in the previous
experiment, because now the training is done across a variety of room impulse
responses so the clean performance actually degrades for the baseline MFCC. But,
again, the LIFE processing shows improvement on clean condition here.
Yeah?
>>: [inaudible].
>>Kshitiz Kumar: This is one error rate.
>>: [inaudible].
>>Kshitiz Kumar: Actually -- okay, let me. Sorry, I should have explained this better.
So actually there are two sets of experiment here. One is the doubt matched condition
and one is clean condition. So here, as you can see, these two experiments are
training on clean and these two experiments are training on matched condition.
So we actually need to compare how much of improvement we get over matched
condition and then also compare how much further better over the clean training. So as
I said before ->>: You were probably going to say it, but the matched condition, it's matched on the
zero reverb time?
>>: [inaudible].
>>Kshitiz Kumar: It's multi-style actually. There are three sets here and this is all the
three sets there, right. It's not just one. Actually, I'll just show that.
But anyway, so I said that the clean error increases, but the processing improves here,
and as -- I mean, here -- so here also the processing is improving. And this is on Wall
Street Journal experiment about 15 to 19 relative improvement on an average.
>>: So can you make a comparison [inaudible].
>>Kshitiz Kumar: That's true. Actually, I'm still to do that. I mean, I'm scheduled to do
that sometime before my -- yeah, I have to do that, actually. I haven't been able to do
that so far.
Yeah, exactly. Because this is like adaptation where given some knowledge about the
speech features, we are adapting it. But I think the key difference is that in MLLR they
do a linear transformation, but here I'm doing a filtering transformation. That's the key
difference.
>>: [inaudible].
>>Kshitiz Kumar: Uh-huh.
>>: [inaudible].
>>Kshitiz Kumar: Okay. Yeah. That's true. I'll look into that.
>>: So if you look only at the clean in a non-matched training at R1 500, you're error
rate is basically from a 10, you know, now it goes to ->>Kshitiz Kumar: Yeah.
>>: After LIFE it's 19 percent.
>>Kshitiz Kumar: I mean, that is training on clean. But if you do ->>: That means that if your LIFE, it's doing a perfect job.
>>Kshitiz Kumar: It's not doing a perfect job.
>>: [inaudible] it's not. It seems to me it's very far from perfect.
>>Kshitiz Kumar: That's true.
>>: [inaudible].
>>Kshitiz Kumar: Yeah, that's true.
>>: How do you interpret this? Does that -- you know, your model is too [inaudible]
or ->>Kshitiz Kumar: Yeah, I think one of the reasons is that the model, my model, does
not -- it fits very, but reverberation time is somewhat -- is not very high. But if the
features are very, very, very highly reverberated, then the model is not doing a great
job, I think. That is how I interpret this. And especially if the database is very -- it's
very -- I mean, if you do the same experiment for resource management, we see that
here it's -- the resource management here, it's like going from about 70 to about 40
percent here. Actually, this experiment is the same thing for resource management. So
here it was from about less than 80 to about 40. So here the reduction is very large.
So there are two things here. The database itself is very difficult and the complexity of
the reverberation I think is -- yeah, my model is not perfect, of course, I think. That's a
point.
>>: There's two pieces to [inaudible] so I imagine that as the data gets more and more
reverberated, it's harder to find the right [inaudible] so I was wondering if you ever ran -or since all your data is synthetic, you can actually, like, cheat and figure out what is the
optimal Gaussian for that and then -- like what's [inaudible] use the clean speech as the
target rather than [inaudible] and then say that's the best your model could be likely to
be if the target observations are perfect.
>>Kshitiz Kumar: Right.
>>: [inaudible].
>>Kshitiz Kumar: No, I haven't done that, actually, but I did a slightly different version of
that. And so what I did was that I looked at the Gaussian component that I pick in clean
case and then I looked at the Gaussian component that I picked in LIFE case, and
doing that, before LIFE processing and after LIFE processing. So what I find is that if
you look at the Gaussian components that get picked after LIFE processing, they match
better with the data in clean condition. So, yeah, I did that experiment.
But I don't remember exactly, like -- I mean, I would expect that the same thing holds if
the reverberation time is small, but that will probably not hold well that well if the reverb
is very high. Yeah.
Actually, I will extend this method later on also. So I'll -- but anyway, so this is where
training was done on all the three different reverb sometimes. And, next, I was also
curious about another experiment where I did training on just one condition. So I just
reverberated my database with just 300 millisecond of reverb time, just one condition,
and I compared this with where we have three different cases here, so just to be curious
if there is an additional improvement in a perfect matched condition.
So, actually, this experiment here is a perfect matched condition here because the
experiment is trained on this condition, this specific condition, and tested on the specific
condition here.
Yeah?
>>: So in this experiment here, it seems like previously this MFCC number would be,
like ->>Kshitiz Kumar: Actually, it will be this one. This is MFCC and this is trained ->>: But the order seems to be that, like, MFCC and then LIFE was a little better and
then matched condition was even better than that. Am I remembering the slide wrong?
>>Kshitiz Kumar: Actually there is a bit of a difference here ->>: Yeah, yeah. It's ->>Kshitiz Kumar: These are on clean training, actually. These two are on clean
training.
>>: Oh, okay.
>>Kshitiz Kumar: So now ->>: [inaudible].
>>Kshitiz Kumar: Actually, yeah, these two are -- right.
>>: The blue and light blue ->>Kshitiz Kumar: Yeah, yeah, yeah. Right, right. Yeah. So these two become here.
So the real difference is that here the matched was a multi-style matched and these two
cases the matched is a very strict match.
>>: Okay. Okay. That makes sense.
>>: So the gain is just from ->>Kshitiz Kumar: Yeah, the gain is just from [inaudible]. And note that this shoots up
very high. The clean error shoots up very high, as expected here.
But anyway, the point is that on this database, even on very strict matched in
reverberation there is liked a 15 percent of reduction here.
Okay. So, finally, in conclusion, so LIFE processing, it is an adaptive solution. It does
not require any prior information. It buildings on just information of clean speech and
distribution, and we get high pass [inaudible] mitigating the room impulse response.
Now, next I'll talk about the non-negative matrix factor addition, which ->>: In the other one -- so what you're estimating is the [inaudible].
>>Kshitiz Kumar: In my LIFE processing I institute a [inaudible]. I also did for FIR filter.
I get good benefit in FIR also, but IR provides a little bit of benefit. The reason is that in
IR I need to have fewer parameters to -- if I have more parameters in the FIR, I think the
->>: Do you have any way to control for ->>Kshitiz Kumar: Parameters?
>>: Stability.
>>Kshitiz Kumar: Oh, yes.
>>: [inaudible].
>>Kshitiz Kumar: Yeah, that's true. It is possible to do that, and what I currently do is
that I check for that and if I violate my stable condition, I just pause. This is what I do
right now. But I do that because in my experiments I find that the filters are stable.
Actually, in all the experiments so far I checked for that, it actually -- very rarely I found
that it's not stable. So, I mean, there is probably some reason why, but it just
experimentally turns out that it's stable.
>>: So back to the question [inaudible] what's the comparison -- you know, of course
you compare with a basic MFCC.
>>Kshitiz Kumar: Uh-huh.
>>: But suppose, you know, you use a much simpler high pass filter rather than doing
the, you know, estimation.
>>Kshitiz Kumar: Oh, yes, uh-huh.
>>: And that would be a, you know, a different baseline. Do you have any thoughts on
that?
>>Kshitiz Kumar: Actually, Professor Rich, he did ask me that question some long time
back, and I didn't do those experiments. Because if you look at the filters here, I mean,
they look like high pass. I mean, they are a high pass filter rate. So he did ask me to
try to fit this filter by using one pole or two poles, try to empirically for this which single,
one pole or two pole, and I tried to do a good job by fitting them in, and the result is
that -- the result is actually somewhat intermediate. I mean, that's like -- it does not
really achieve this much, but it was somewhat intermediate, like less -- not even half.
Yeah, that does provide -- that does benefit, yeah. But it does not reach this level.
>>: So why would you limit yourself to one or two poles? Why not go further, like doing
a full [inaudible] normalization or ->>Kshitiz Kumar: Okay, I -- I don't know. I think I did not do that.
>>: You didn't do a -- design a, you know, multiple, whatever, coefficient filter and, you
know, minimize the [inaudible] with respect to filtering coefficients.
>>Kshitiz Kumar: I think the reason I did not do that was that because had I done that,
that would be like very, very specific to one particular reverberation time. So I didn't -- I
wanted to do that to study how it works, but in practice, I don't -- I can't really do that
because I don't know the room acoustics phenomenon. So that is why I did not take up
that further.
But even for a two-pole filter, I mean, I was getting this fit to a considerable extent. But,
yeah.
Okay. So the NMF also works on the similar framework that I talked about. It works -but instead of working in the cepstral domain it works in the spectral domain. But, again
the model is a convolution in the spectral domain.
So, again, notice that what we get is the cap YS sequences. We don't know the gap
edge or the gap x sequence, and we have to split terms. And there are obviously many
ways in which the cap YS can be split into these two terms here, and, again, to do a
better job, we need to specify some constraint, something that we know about the signal
of interest here.
So in the non-negative matrix factor addition I use the sparsity of the signal, sparsity of
the spectra as a feature of feature spectra and then I use the non-negativity of the
spectral values as an additional thing, which is obvious.
So, again, here, actually the problem formulation is trying to solve a mean squared error
object function here. So I'm trying to -- given an initial estimate of the cap X and the cap
edge -- actually, I think there is a multiplication here. Sorry. This is not minus.
So I'm trying to minimize this error squared criteria subject to a sparsity constraint, so I
want that the features -- the spectra should have a sparse value. So only few of the
values should have a large value and the rest of the values should be small.
And, again, this is solved in a gradient descent fashion, but we need to ensure that the
values are non-negative. I mean, there's not -- I've skipped equations here, but
eventually it turns out to be a multiplicative update. I mean, this ->>: [inaudible].
>>Kshitiz Kumar: In the spectral domain.
>>: Not in the time domain?
>>Kshitiz Kumar: Not in the time domain.
>>: [inaudible].
>>Kshitiz Kumar: Yeah, so the -- so I discussed in the feature domain, and when I was
discussing about this I said I'm trying to represent reverberation not in the time domain
but on the feature domain, because the speech recognition actually works on the
features. And then I showed that I can represent reverberation actually happening in
time domain in the spectral domain with a small error.
>>: [inaudible].
>>Kshitiz Kumar: Actually, it's both. Initially it is in the spectral domain and then also in
the Mel spectra.
>>: And you're definite the convolution is not -- it's across the [inaudible].
>>Kshitiz Kumar: No, it is on one frequency channel, yeah.
>>: If you're treating samples -- it's still a time domain convolution using a single
[inaudible] over time [inaudible].
>>Kshitiz Kumar: No, no, no. I'm treating each of these -- I'm treating each of those,
the filter -- each of the short duration power in a particular frequency channel. They are
convolved.
>>: With each other over ->>Kshitiz Kumar: With each other, yeah.
>>: [inaudible].
>>Kshitiz Kumar: Yeah, it's a time sequence, right.
>>: [inaudible] sequence of the log ->>Kshitiz Kumar: Not log. The Mel spectra, yeah.
>>: So what is the, you know, theoretical problem for this? Why convolution is the best
model there?
>>Kshitiz Kumar: The theoretical [inaudible] to some extent lies in, like, here. So I did
try to show here -- so this is what I'm doing. So I am representing the blue line with the
red line. And I'm saying that even -- I'm representing a convolution in time domain as a
convolution in a spectral domain. There will be some error, but the error, as we can see
in a practical application, is small, and this gives us simpler model. But it lets us do
parameter estimation and the reverberation application.
>>: But the thing is you don't have to do the convolution. You can do it -- you say it's
multiplication or additive. You can also move the two lines, you know, closer.
>>Kshitiz Kumar: Yeah.
>>: So why the convolution is particularly indicated?
>>Kshitiz Kumar: I can use multiplication, but then I have to use a very long feature
vector. I mean, the duration of the analysis window should be up to second. This was
what was done in the LDLSS time. The analysis window was one second or like that.
But right now I'm working in a small window, so as you can see, because the
reverberation will smear and reach into future frames, so energy will overlap. So energy
from past sample will overlap into future segment, right? So this is an additive
phenomenon.
I mean, this is not completely linear. There is also a noise term here which I'm ignoring
right now, as you can see, because if there is no noise, these two lines will match
perfectly. But there is an approximation error. But, still a good thing is that even with
small approximation error, this is a good model.
Okay. So coming on to your point -- so this is right now my -- the NMF compensation
framework. What I do is that I window -- I mean [inaudible] per framing window, and
this either a magnitude or a power domain. I can do it in either. And then I can apply
the NMF processing on the frequency, on the [inaudible] frequency channels or I can
apply the same processing in the gammatone filter bank. So there are different flavors
of this algorithm here.
And to do into gammatone filter bank I have to apply this gammatone filtering and then
apply this NMF here and then I reconstruct by applying actually an inverse
transformation here. So the question is that which will be better. So will working in
magnitude domain be better or will working in a power domain be better or will working
in the gammatone domain will be better or working in the Fourier [phonetic] domain be
better. So these are the questions I'm going to take up.
Yeah?
>>: [inaudible].
>>Kshitiz Kumar: It's [inaudible] actually.
>>: [inaudible].
>>Kshitiz Kumar: I mean, I did do experiments with changing that parameter, but what I
find is that it's -- I mean, it does not really change a lot. I mean, it's -- yeah, maybe I
need to do more experiment with that. I have not explored that a lot. I mean, this is
what we just chose and I chose some by changing small parameters, but doesn't really
change a lot.
Okay. So this is an example of applying this processing in the magnitude domain and in
the gammatone filter domain. So this is the reverberated sequence here. So we see
that by applying this processing, the processing helps to remove the spectral smearing
and we get sharper, cleaner spectra.
Okay. And in this experiment -Yeah?
>>: [inaudible].
>>Kshitiz Kumar: Oh, sure.
>>: This is before and after?
>>Kshitiz Kumar: Yes, this is before and after. Yeah.
[Demonstration played]
>>Kshitiz Kumar: But if you listen to it carefully I think -- it's not perfect, of course, but it
does help in that direction.
>>: [inaudible].
>>Kshitiz Kumar: Okay. So in this experiment actually I'm comparing different versions
of this algorithm applying in the Fourier domain versus applying in the gammatone
domain, applying in the magnitude domain versus applying in the power domain. So
the first thing I would expect is that if I applied the processing in the gammatone filter
domain, because I have only 40 gammatone filter, which is about 256 Fourier
frequencies, the dimensionality is reduced a lot. And then because the gammatone
filters are designed to mimic the focus on the frequency components which are more
useful for speech recognition, so I would expect that the performance is better.
And additionally what I find is that if I do the same processing in the gammatone filter
domain, the approximation errors are smaller. So earlier I talked about the
approximation error. The approximation errors is smaller in the gammatone domain
because it does an averaging, right? It does an averaging over the entire frequencies
and then -- and so because of that, some of the noise gets smoothed out. So already
lots of benefits in gammatone domain here, and starting with -- this is MFCC baseline
and these are for the NMF algorithms.
So we see that -- I think these bunch are for the Fourier frequencies, so these bunch
are Fourier frequencies and these bunch are for gammatone frequencies. So on an
average the gammatone frequencies error is much smaller.
And then I also compared this -- let's talk about the gammatone frequency here. I have
also compared with the magnitude domain versus -- the power domain versus the
magnitude domain. So I find that applying the processing in the magnitude domain
gives smaller error.
And, again, I did try to look into that, and I found that the approximation error was
smaller in the magnitude domain like that.
And there is one more experiment here which is labeled dash edge. So dash edge
means I apply the same filter across all the frequency bands. So I was interested in
comparing that understanding and see can I apply the same filter to all the frequency
bands, and if I apply individual filter to all the frequency bands, how much additional
improvement do we get. I mean, of course we expect improvement. But the question
is, is that how much improvement we get.
So as you can see -- so we get like from here to here, which is still like a significant
improvement, by applying the filter in all different channels on top of the rest of the
improvement.
>>: So is there a theoretical basis why NMF would work better in this case for
magnitude versus power? Or is this just an empirical ->>Kshitiz Kumar: This is sort of an empirical result. What I tried to see was that I
looked at the approximation error. I mean, you remember that my model has an error.
So what I empirically found was that the approximation error is about 13 decibel -- I
mean, the noise power was 13 decibel down signal power in power domain, but in the
magnitude domain it's like 17 dB. So the approximation area in empirical is smaller. I
mean, I think that is one of the reasons.
Okay. So these are, again, a resource management database. And here again I see
that there is a very sharp improvement in performance here. This is clean condition
training.
And these experiments have been done on ATR. So this experiment is done by
simulated room impulse response using ATR room impulse response, and this is using
the ATR database, actual ATR database, which is collected in a room. So we see that
the improvement holds up in both the cases.
Actually, they're not dyadically comparable because this room impulse response is not
the same as the room impulse response here. Yeah, I think the database didn't have
that probably.
But anyway, we see that the improvement holds up in real [inaudible] speech as well.
And another thing I wanted to show is that -[Demonstration played]
>>Kshitiz Kumar: Actually, this has a noise if you listen to it carefully. So the database
also has an additive noise here. So the point is that even if there is some additive
noise, even in that case the performance does not blow away. I mean, the algorithms
still hold up even if there is an additive noise.
[Demonstration played]: She had your dark suit in greasy wash water all year.
>>Kshitiz Kumar: I think the speech may not sound that great, but, still, speech
recognition accuracy differently holds up. Because the problem is that many of the
algorithms that are designed for dereverberation, some of them are so specific only for
the reverberation task, and even if there is small noise, they don't really hold up.
I don't want to go into those directions, I mean, so, okay.
These are, again, RM matched condition experiment. So again we see that the
baseline improves a lot, but we still get additional improvement by -- this is, again, a
multi-style condition. It has database from all these conditions here. So, again, the
improvement is about 15 or 20 percent relative reduction in error rate here.
Okay. So I talked about dereverberation algorithm, I talked about spectral
deconvolution representation, and then I'll briefly talk about a noise compensation
algorithm which I find very useful.
So it works on an additive noise model. So before going to talk about that I'll briefly talk
about the data cepstral features.
So conventionally initially we have like 13 MFCC feature here and then we apply delta
cepstral feature and double delta cepstral feature to have a 39-dimensional feature.
And what we find is that those features are very important for speech recognition. And
this is an example for that. So this is on real world noise collected. It's still simulated
speech. I mean, the speech is added to noise, but we see that adding delta feature
improves like five or seven [inaudible]. So the point is that delta feature is very
important for speech recognition accuracy, and even the clean performance improves
here.
So the next question is, is that the best we can do? And the question is that how -what is the robustness of delta features. So to do that, this is an example. So on the
top panel, this is clean speech and this is noisy speech in zero decibel. This is like
spectral power plot, this is a log spectral power plot, and this is the delta feature that we
can obtain from here.
So what we would expect is that if the delta features are robust, these two should match
very well if they're very robust. But, I mean, as we can understand because we're
applying a log compression here, the high energy will usually be very matched, but the
lower energy values will be very little matched. And because of that, the delta features
are not really robust to noise and they don't really follow the clean speech spectral
features. So there is a problem here.
So the next question is that what better can we do. And in my work -- I mean, as we
can see, this blue line is the power spectral plot for a speech, and this is for a noise, a
real-world noise, and what we find is that a lot of the noises we have in real world, the
dynamic range is not as high, but the speech hour has a very high dynamic range. I
mean, the speech energy values change like 30 dB between speaking and
non-speaking. But the noise energy change is like 5, within plus/minus 5 decibel.
So we can use this Q. And what we can do is that -- and also note that speech is very
highly non-stationary, but in general, I mean, the same point, noise is relative stationary.
So what we can do is that we can apply a temporal difference operation dyadically on
the spectral values here. And what I do is that I pick the delta operation here on the Mel
filter outputs.
Now, if I do that I'm going to have some more things that I need to work on. I'll talk
about that later. Especially in non-linearity I'll have to apply, because the things become
negative if I apply difference operation in the spectral values. The things are normal
non-negative, so I need to do something for that.
But anyway, I'd like to figuratively show how is this helpful or not. So this is the same
plot that I had before, and this is the plot that I get by applying the temporal difference
operation dyadically on the spectral here. I mean, as we naturally expect, if we apply a
temporal difference here, the speech energy changes very fast because -- and the
speech power dominates, so well get a strong following for the speech energy. But
when the speech power is not changing, things are relatively stationary here. And even
the things that are stationary, the values will be nearly zero.
And so we are -- as we can see here, I mean, we nicely show that applying temporal
difference here shows a lot of benefit. And so this plot was done by taking a short
duration power over the entire bandwidth of the signal and these sort of plots are for
doing the same operation over the output of [inaudible] Mel channel, so just one
frequency band.
So, again, this is speech, this is noise. This is speech, speech added to noise, and this
is -- these are the conventional thing applying log and then applying spectral -- delta
spectral values here, and this is applying delta spectral values directly on this sequence
here.
And then from here to here I have to go -- okay, from here to here, this is applying a
non-linearity on this. So the non-linearity, what I find is that these features are not -they're like a SuperGaussian distribution. They don't have a Gaussian characteristic so
we can't model them by a Gaussian mixture model, so we have to apply a non-linearity
to let them be modeled in a [inaudible] framework.
So what I do is that I take the sequence here and I make the histogram Gaussian. This
is what I do. But, again, later I need to improve this a little bit, but this is what I do right
now. I make them Gaussian so that they can be modeled by your GMM.
Okay. So what we have achieved is that we're -- by doing delta spectral, we are still
able to track the changes in the speech, we're able to attenuate changes in noise, and it
also emphasizes speech onset part because speech onset -- I mean, it has been shown
particularly that it's very important for speech recognition because it's a difference
operation so it is going to emphasize the speech onset part. And there is a non-linearity
here to make them modeled by Gaussian mixture model here.
Now, next I will compare the result using my delta features and then the conventional
way of doing delta feature. So as we notice, there is -- what we have done is that we
haven't added anything fundamentally new, but what we have added is that we're doing
a temporal difference operation in a domain which is better and then we apply
non-linearity to make them Gaussian. So that's it. And these are for a number of
different conditions here; white noise, music, real-world noise and reverberation.
So we see that just doing that provides us a lot of improvement here. I mean, especially
for music, noise and for real-world collected noise at 50 dB we have like 7 or 8 dB, I
think -- 5 or 7 dB threshold shift here. And it also works a lot in reverberation.
>>: [inaudible].
>>Kshitiz Kumar: Right now actually it is reference specific, but we can learn that
Gaussian addition from clean speech and then apply the same because actually it's not
fundamentally different across different references. And we can the learn the
non-linearity shape, basically, for a clean [inaudible] and then apply the same for all the
different conditions. I haven't done that, but ->>: [inaudible].
>>Kshitiz Kumar: Yes. Yeah.
>>: If someone from, like, ATR and [inaudible] basically proposed this reverberation
[inaudible] so I'm just wondering if you're familiar with that work and the differences.
>>Kshitiz Kumar: I don't exactly. I mean, I have done some research. What I see is
that there's one paper [inaudible], I forget the name, Kuldeep Paldiwal [phonetic]. He
has a paper, and what he does is something similar, but then he applies an mod
operation to make them non-negative. So he takes the difference and then applies a
modulus operation to make them non-negative. And I did do my experimental
comparison with that, but that showed some improvement, but that does not really show
a lot of improvement because once you take mod, you mitigate the differences between
high rise of energy and fall of energy. They become same. So I think that was the
closest which I found.
>>: I'll see if I can find it.
>>Kshitiz Kumar: Yeah, sure.
Okay. And this is the same thing where I compare MFCC with the delta cepstral feature
versus delta spectral feature and I compare with an advanced front end. This is from
ETSIE database, ETSI Institute.
So we see that comparing this red line versus this blue line, this is the improvement that
we get from advanced front end. So advanced front end shows a lot of improvement
over baseline, but if I replace the delta cepstral feature with the delta spectral feature,
we see that even the baseline MFCC feature works almost as well as the advanced
front end system here.
So the point is that we are really getting a lot of improvement by doing little work by
changing the delta operation, and we are approximately obtaining the gain which
advanced front end provides across different noise reverberation.
Okay. So ->>: What's the number of Gaussian mixtures you're using for this experiment?
>>Kshitiz Kumar: Actually here there is no Gaussian mixture. There is a Gaussian
mixture for speech recognition ->>: [inaudible].
>>Kshitiz Kumar: In the HMM.
>>: [inaudible].
>>Kshitiz Kumar: I think eight. For resource management, we use eight.
>>: [inaudible].
>>Kshitiz Kumar: That's true.
>>: [inaudible].
>>Kshitiz Kumar: Yeah, that is possible, but I just use baseline. The reason is that I
think that if I increase the number of distribution, I'm actually adding a lot more
parameters into my data and I might over fit. And that is when I think that I can easily
apply non-linearity. It's just one non-linearity. It's just like a log linearity but a different
log linearity which makes them more closer to Gaussian. So that is the approach I took.
Rather than increasing parameters, I thought it is better to do it on the features
themselves.
Okay. So right now what I'll try to do is that I'll also try to understand how much benefit
this noise compensation can provide without doing any speech recognition experiment.
So I'm interested in knowing how much benefit do we expect by doing this
compensation and how much noise is actually reduced analytically.
So to do that, I mean, I can consider white noise, for example. I can consider a
segment of white noise and I can find out the power p. This will be a random variable.
And then this P is a [inaudible] squared distribution with N degrees of freedom but we
can treat this approximately Gaussian if N is large. And then what we can see is that
this p is approximately distributed to be Gaussian sigma squared in this variance here.
Now, this can be split into an AC power and a DC power. So the DC power is just
means squared and the AC power is this variance here. And now to an approximation,
what the algorithm will do is that it will definitely remove the DC power because we are
subtracting. So the DC component of the signal will be totally gone. And to that
approximation the reduction in distortion power can be obtained from this expression
here which turns out to be this, and for a 25 millisecond, n is 400, and we can say that
we expect that, for a white noise condition, the white noise will be reduced by -- the
noise power associated with the white noise signal will be reduced by 23 decibel.
So this gives us some sense into where we're heading to. And we can do the same
thing for white noise, real-world noise, babble noise, music noise, and speech. Actually,
white noise is theoretical, but these are experimental. So I can take a segment and so
do the same thing and see what happens.
And so the blue line here is what I obtained by doing a speech recognition experiment.
I mean, so I would expect that at least the trend should be followed. The two are not
exactly the same because I'm not comparing same thing with respect to another thing.
But at least I would expect that the trend is followed in the two, which I, I mean, to some
extent see that these values are high, these values are also high, and they sort of
decrease.
And the good part of this analysis is that it tells us, without doing a speech recognition
experiment, just how much benefit we can obtain from an experiment from a
processing.
So, finally, a joint noise reverberation model. So as I said before, this reverberation
model has an approximate error, and we can actually account for that approximation
noise, approximation error, here by an additive noise here. And this also acts like a
unified model for joint noise reverberation. So if the data also has a noise in it, we can
actually account for that noise here itself. So this also acts as a better model for
reverberation and also for joint noise reverberation model as well.
And comparing this with the prior model which people worked on, so prior people
worked on this multiplicative model and an additive noise here. And right now what I'm
doing is that I'm changing, replacing this multiplication by a convolution here. So I have
more parameters to estimate and work with.
So next what I will do is that to compensate for this filter I'll apply NMF, and to
compensate for this noise I will apply either delta spectral features. Anyway, this is the
framework I'm working on. So I have this cap YS. This is what I get. It has both
reverberation and noise. I apply first NMF, and NMF reconstructs speech signal. And
then I pass it through the Mel spectra, and from Mel spectra I get DSCC features. I
append these two. So these are, again, 39-dimensional features, but they have
compensation for both noise and for reverberation. Not totally correct, but, yeah, they
do. And so the question is how will it work.
So this is an experiment with only reverberation, so this same thing I showed before.
So these are the results that I showed before. So just applying NMF. And now if I also
add the DSCC features, I get additional benefit here.
Actually, I notice that this clean performance increases. I need to look into that more
carefully and see what's going on there. But at least in the reverberation, we see that
there is a very sharp improvement at even all these conditions here.
Now, this is the same experiment where I passed signal through a reverberation, and
then I added real-world noise on that. So it is reverberation and noise both. And we
can see that the baseline MFCC feature is degraded very heavily if the data has both
noise and reverberation.
And applying the NMF shows improvement, and this is only applying -- DSCC shows a
lot of improvement, but combining NMF in a joint fashion shows additional improvement
over the rest.
>>: You say that you -- forward one. There. You say that you created this data by first
reverberating the clean speech and then adding noise?
>>Kshitiz Kumar: Yes.
>>: Is there any issue here with the noise not having the reverberation as if it was from
the same room? That is, if you added noise and you reverberated it, would the
algorithm work the same?
>>Kshitiz Kumar: Actually, I haven't done the experiment. But this noise was collected
in a real-world setting across -- I mean, it was collected in a restaurant and it was
collected in a factory. So the noise sort of has some reverberation. But it's not exactly
the same. It's different. That's true. It's different from the edge filter. I will probably
need to do that experiment. I did not do that.
>>: So do you think it would affect the experiment or do you think it would not affect?
What's your intuition?
>>Kshitiz Kumar: I don't think that it will affect the experiment. I think that I will still get
significant improvement from that. I think so.
Okay. So, finally, I'll very briefly present a profile lip-reading experiment. So the idea is
that we can combine audio features with the image features. And so people used to
work on frontal-view images and extract features, but in one of the experiments I
collected profile-view images and applied -- tried to extract features from the profile-view
image of a person and tried to integrate them.
And so this is the images which were collected from the side as well as from the front.
I'd like to briefly show the -- these are the images which were collected, and I'm working
on these images.
And then everything is extracted automatically. So the features which are defined here
are the lip protrusion features. Because I have a profile image, I can find out the
movement along the protrusion direction, the horizontal direction here. And then I can
also find out the lip height, the movement in the vertical direction. And note the lip
protrusion information cannot be obtained from a frontal image because it's the
movement which -- this information cannot be extracted in that direction. So this gives
an additional information from the profile-view images, and these are the processing
steps.
Anyway, so these are the speech recognition experiment done on a different database.
So we see that this is frontal view vert parameter, frontal view height parameter, profile
view height parameter and profile view protrusion parameter.
So the key thing which I extract from this is that the profile-view protrusion parameter
feature seems to be very important for speech recognition experiment here, and these
features will not present in the frontal-view images which are in previous experiments.
So this adds a lot of information to add.
And then I can also combine all this with the audio features and then try to see that -this is the line for audio features, so the performance degrades as the SNR approaches
zero, but at least the profile-view features stay the same. They don't get affected by
noise. And if I combine the two, I get this intermediate shape here.
So overall, in summary, I'd like to show that -- so a key purpose of my Ph.D. work was
to study reverberation in the feature domain by making approximations and trying to
work dyadically on the feature domain so that we have a tractable model, we have a
simple model, and then we can extract -- do dereverberation on those simple models.
Then I proposed two ways in which we can use speech knowledge; one by the
likelihood parameter and the other by the sparsity information. I proposed a delta
spectral feature to attenuate noise, and then I studied joint noise and reverberation
framework and then audio/visual integration. So these are the different ways in which
speech recognition performance can be improved in a real environment.
And then, anyway, thank you for coming, and it was a pleasure to come here. And the
papers are available at my website, and the algorithms are available at the robust
website, so you can take a look. And, yeah, anyway, thank you.
[applause]
>>Kshitiz Kumar: Thanks. We can have questions.
>>: So I never saw a slide that compared with, like, NMF.
>>Kshitiz Kumar: Okay. It was not there. I did the LIFE on NMF. They provide
comparative improvement. LIFE is a little better when the reverberation is high and
NMF is a little better when the reverberation is small. But the good thing is that I can do
one after another. I can first pass through NMF and then apply LIFE processing. I did
not show that slide. Maybe I should have. And the thing is that I get further additional
improvement if I do that.
So the improvement to some extent is somewhat additive if I do both. It's not huge. I
mean, it's not one plus another, but it's -- there is an additional like 10 percent of relative
reduction in error rate, in that range, for those experiments. Yeah.
>>: So actually you don't need to assume that [inaudible] just use the linear [inaudible]
that being a combination of you have a matrix and then you optimize the NMF object
option given this assumption of the [inaudible] all the energies are mixed, and then
given the constraint, you try to minimize the [inaudible] of the recovered X signal. So
then you can do some kind of recovery of the original signal. I think probably the
performance might be the same as where you are. So given the assumption of the
convolution that you have, I think [inaudible].
>>Kshitiz Kumar: Yeah, that's ->>: [inaudible].
>>Kshitiz Kumar: Yeah, that's true. I mean, I did start by making the convolution the
assumption, but as you can see later, I added a noise term also there to account for the
approximation error. And to account for the noise term, I added the noise compensation
algorithm. So finally a represent reverberation not like a convolution but like a
convolution plus noise. And by working in this framework, if I do something to
compensate for reverberation, if I do something to compensate for noise, the
improvements are additive. Yeah.
>>: I think if you assume convolution and then [inaudible] reduction of the ease of
computation, you can transform that to another domain, like the Fourier domain,
something like that. You can get some [inaudible], I mean, it's a simple assumption on
convolution without doing any further transformation to the domain [inaudible].
>>Kshitiz Kumar: Yeah. But I think the question is that because we have to also be
careful about the window size, so the caution is that because we have to work with
small window sizes and do analysis on small windows, so that is also an important
thing. And that is why I needed to have a convolution, because then it tracks for the
additivity of energy from previous segments into current segments. Is that what you're
asking?
>>: [inaudible].
>>Kshitiz Kumar: I don't need to assume convolution, but the point is that the additive
thing is correlated with the speech. So there is a convolution. I mean, the additive is
not uncorrelated with the speech. If the additive components are completely
uncorrelated with the speech, then it's noise, right? But the additive components are
correlated with the speech, and in that case, I mean, I can't read that as completely
additive because then the general noise algorithms assumes that the noise is
uncorrelated. So I think it will violate the assumption. So that is why I think the best
thing is probably to have both convolution and noise. Because this is very generic.
Then we can apply the same problem to only noise, to only reverberation, to both noise
and reverberation. So it sort of extends.
So there are some other experiments here, as Mike pointed out. So I didn't show many
experiments combining, like, NMF LIFE or DSCC. So I did those experiments as well,
and there are further improvement if I do some combination, though small.
>>Mike Seltzer: Any other questions? Let's thank our speaker again.
>>Kshitiz Kumar: Thank you.
[applause]
Download