>>Mike Seltzer: Good morning. Welcome. My name is Mike Seltzer, and I'm here to happily introduce Kshitiz Kumar, who is a Ph.D. student from Carnegie Mellon working in the robust speech recognition group with Rich Stern, and he's going to tell us all about the work he's been doing, and I think specifically mostly in reverberation ->>Kshitiz Kumar: Reverberation, right. >>Mike Seltzer: -- which, as we all know is a very challenging problem. So here we go. >>Kshitiz Kumar: Thanks, Mike. So as Mike said, the object of this talk is to provide -- to study the problem of reverberation and to study reverberation with respect to -- to compensate it for applications in a speech recognition. And I'm sure that -- as Mike said, I'm Kshitiz Kumar. I'm from CMU, and my colleagues are my advisor, Professor Richard Stern, and my colleagues Bhiksha Raj, Rita Singh and Chanwoo Kim. So going forward, we know that speech recognition technologies have considerably matured in the last decade N the last 20 years, but the problem of robustness of these technologies through noise and reverberation still exists. And even though many technologies work great in clean conditions or in matched conditions, whenever there is a mismatch between training and testing data, the performance is not as good. And the object of this talk is to make speech technologies robust to reverberation and then we will also extend them to noise and then, finally, to joint noise and reverberation. Now, before going forward I will briefly show what reverberation is, what does it mean, and how does it impact speech recognition accuracy. So reverberation can be thought of as a superimposition of delayed and attenuated signals. We can think of this as -- we can think of a rectangular room where the microphone here will [inaudible] the direct component from the source but the reflexes off the wall and then the reflexes of reflexes. And over a period of time we'll see that we can model this as a linear filter with a room impulse response. And reverberation is characterized by a parameter called room reverberation time. It is the time taken for the signal power to decay by 60 decibel. And as we can see, the reverberation time is higher in a room than the signal is -- the echos are going to persist for a longer duration of time, and they will lead to greater self-interference than the original signal. Now, having said that, here I tried to figuratively show a spectrum plot for a clean speech and a reverberated speech. So we see that -- the one clear distinction we see is that the energy spreads -- the energy overlaps into future time segments, as we expected it will, and we see that the silence segments here are filled up by energy here. And then the strong voices [inaudible] are going -- the energy from [inaudible] are going to overlap into softer sounds. So the problem is that if you do have a training from this, we're really not going to do a good job by testing here. And even if you do a matched condition training, even in that case, because the phenomenon of reverberation depends on what -- the sound which comes before the current sound, so even the matched condition is not going to do very well. We will see those later. Now, this is a speech recognition experiment on a resource manager [inaudible] just to see what is the impact of reverberation, how much does it effect the speech recognition accuracy. So we see that for clean condition here, this is done -- training was done on clean condition. So for the speech, the WER increases from about seven percent to about 50 percent here as the reverb time is 300 millisecond, which is -- which I think is typical of this room here right now. So at least for clean condition, this is really a very big problem. And most of the people have -- in the past speech research people mostly worked on the noise compensation, but there have not been many solutions and not a lot of study in reverberation. And the focus of this work is primarily on the study of reverberation pattern with [inaudible] reverberation and driving algorithms to compensate for that. Now, here I present a brief outline of this talk. First of all, we are going to spend some time studying reverberation, how reverberation affects speech recognition accuracy and how can we model that phenomenon directly on the speech features. My claim is that if you can do a good job at modeling reverberation in this featured domain, in the spectral feature and the [inaudible] featured domain and directly compensate in the featured domain, then we can build a good system because features go to the speech recognition engine. And if we can compensate the features, we are more likely, and the features are closest to a speech recognition engine. So if you can compensate features, this will radically benefit speech recognition system. So based on that, we proposed a new framework to study reverberation first, and then I propose two algorithms. One is called LIFE and the other NMF. They use different properties of speech. And then I finally propose a noise compensation algorithm. It's called delta spectra features, and then I propose a joint noise and reverberation model and I apply my individual compensation for noise and reverberation to be jointly compensate for them. And, finally, if time permits, I'll talk about an audio visual feature integration. So before going on to present my model of reverberation, I will present what has conventionally been done in the past. And in the past we treated reverberation as a filtering in the time domain [inaudible] the filter is linear time invariant. And because of that, in the spectral domain this is multiplicative and the log is spectral [inaudible] and this is additive, and also in the cepstral domain here. So what we are saying is that convolution in time is multiplicative in frequency, and because of that it becomes additive in the cepstral domain. And the model that we arrive is an additive shift model. So we are saying that reverberation introduces an additive shift on the cepstral, and based on that, the first algorithm, like way back in time, 1995, like that, was cepstral mean normalization which tries to subtract the cepstral mean from a cepstral sequence, and that will try to remove the reverberation. And CMN is now used in almost all the systems, it is so rudimentary. But the problem is that the speech analysis windows are typically like 25 milliseconds, but the impact of reverberation is hundreds of milliseconds. So we cannot -- so the model here does not apply for a short duration. And to answer that, what people did was that they studied speech analysis -- they did speech analysis over seconds, like one second or two second window duration, and in that they -- based on the CMN framework, they subtract the log spectral mean and they do speech reconstruction from there. But a key problem in this method is that because it uses one [inaudible] seconds of a speech, this algorithm cannot really be made online because we have to wait for those many seconds before doing processing further. And so in our work we ensure that we work from short segments and still says that -still methods can be applied in an online fashion. And we can learn some of the parameters which can be used to compensate. Now, next I'll specifically talk about the model that I propose here, the feature domain reverberation model. Now, to present my model, I present this framework here. So this is [inaudible] frequency MFCC feature extraction. So what we do is that we take a speed signal SN, we analyze it through a filter bank, they can be like [inaudible] filters. We take a short duration power here, we apply log, and then we do DCT and these features go to ASR system here. And just to note the convention, the output here at the filter bank is labeled little x, the output at power is labeled capital X and cap XL here and cap XC here to indicate the two. So the problem -- what I'm going to do next is to study how reverberation phenomenon which is happening here affects these out puts, these consecutive out puts, and try to drive an approximate model to represent reverberation. Now, as I already mentioned, reverberation happens here, and then we'll study the impact of these reverberation at later stages. Now, the first thing we can immediately do is that we can -- because it's a linear filter, so we can study reverberation at these filter banks. And what we are saying is that signal first passes through each of the filter and then goes to the filter bank. And now note that the labels having changed. This signal is no longer little x, this is little y and, similarly, capital Y. So the purpose is to relate little y with little x and capital Y with capital X, as mentioned before. Now, the first thing we can immediately do is that because these two are filters, we can switch the two. We can commute the two. And then we have an immediate relationship between little y and the little x here. So it's just a filter by this filter room response filter. So the representation that we have is like here, and next we will study the effect of this h filter at the power domain. And to do that -- and, similarly, well study the effect of this little filter at each of these different domains. So to do that -- sorry -- so this is the problem that we are right now. We are studying the effect of h filter for the power domain here. Now, we start from this equation, which is a linear convolution expression here, and we take y squared because we have to take powers, so we take y squared and then some y squared over a short duration, 25 millisecond, like that. Now, we can split this domain into two components. These are like self terms of x and these are the cross terms of x. And then we take an expectation, and finally we will still have two terms, one term corresponding to x squared, another term corresponding to the cross terms of x, and what we find in principle is that we can ignore this cross term here, we can consider that as an approximation [inaudible] in my model and just to retain this term here. And what that effectively does is that it sort of linearizes the problem. We are saying that y squared is a convolution over x squared. And if you take a short duration power, it will still remain so. And essentially what I'm doing is that I'm replacing this filtering filter here as a [inaudible] filter here, but this filter operates on the cap XS sequences, and I'm saying that cap YX sequence is a filtering operation over cap XS sequence with an approximation error. Now, that was all said and done, but I'd like to show how good this approximation is. So here I plot -- this line here is for the clean signal without reverberation, and this blue line here is the reverberated -- the cap XS for the reverberated sequence. The cap Y is actual. And the red line is my approximation here. So if my approximation is good, then the approximation should hold on and it should follow the -- the red line should follow the blue line here. So we see that even though there is an error here, this approximation -- with respect to speech recognition error accuracy, this approximation is not that bad, and I'm still able to form the contour. And the big benefit that I get is that I have linearized the problem and I have a representation of cap YX, the spectral power of y with respect to the spectral power of the clean speech. Now, the model that we have so far is I'm saying that I can [inaudible] obtain my cap YL by cap YS by filtering cap XS. So this is kind of the model that we have achieved now. And now we will further study the operation of log in the DCT. Now, starting from this equation here, which, as I derived as a linear equation as a convolution expression here, I can formulate the problem in a Jensen's inequality framework. And without the details, what I achieve is this expression here. So what I'm saying is that the cap YL can be [inaudible] as a constant plus a convolution over cap XL plus an approximation error here. And this also linearizes the problem and adds -- now the problem is not completely linear because of this term here, but -- and this is, again -- this is the model that I have derived so far, and this is, again, the same thing and the approximation error. So I would like the red line to follow the blue line here. So, again, there is an approximation error, but it does a good job at following the contour and trying to formulate the problem so that we can derive [inaudible] operation algorithms in an easier framework. Now, this is the model that I have so far. I'm saying that the -- this is the log spectral model that was derived, and this operates on this log spectral output. Now, the same thing can be done on the DCT. DCT [inaudible] linear operation so there is no problem. And, finally, what we arrive is this model here. So I'm saying that the cepstral sequences can be obtained by passing the original cepstral sequence through a filter and then adding a constant to that. And the next work will we'll build on this model. Now, we can compare this with the previous model, what was done previously. So previously, as I said before, it's an additive model. So we can see that I'm introducing this new term here. So I'm introducing a filtering first and then a constant additive shift. And this can also be compared with a past work from our group, and there -- but this model applied in the time domain sequence, but what I'm saying -- but now I'm saying that I can [inaudible] have the same framework like a filtering plus a noise plus an error term in the cepstral sequence as well where the approximation errors are small. The constant c delta can be removed by a plain cepstral mean normalization, and next I have to worry about removing -- mitigating this filter down here. Now, I'm going to propose two algorithms, the LIFE, it's a likelihood-based inversed filtering approach, and the NMF. So, first of all, the LIFE approach. As I said before, the LIFE approach works in the cepstral model here, and because we have, like, 13 cepstral sequences, so this approach will be applied individually -- will be applied to all those 13 cepstral sequences. Now, before going on to the details of this algorithm, what I'd like to say is that -- so now we have a model to represent reverberation in the cepstral domain, and next we need to find other compensation algorithm. So the question is that how -- how do we guide our approach. Now, see the problem is that we don't know the [inaudible] response, we do not know the clean speech spectral values, cepstral values, so it's a very hard problem. So in order to guide our problem, we need to make -- we need to study some of the properties of speech which we can apply and guide our optimization problem towards a direction where speech features lie. And specifically this algorithm is based on maximum likelihood. So I use the speech knowledge of the distribution of speech features so I can learn the distribution of clean speech features, and then I can guide my optimization towards a direction to maximize the likelihood of a clear spectral clean speech features. So this is the speech knowledge I'm using in this algorithm. And otherwise it does not use any knowledge of room impulse response or of the clean features directly. So it works in a blind fashion. So what I'm saying is that I -- so reverberation takes me from a clean speech feature distribution to a reverberated feature distribution, and then I apply a filter to go back from here to here so as to maximize my likelihood function. Now, before going forward and showing the method, I'd like to study this method with small examples, and I like to study that if I take a simple example, then do I get the best answer or how close I am to the best answer. So here I take -- like this is a zero mean unit variance. I mean, I'm just assuming a signal to be like this Gaussian here and applying this simple filter with the one single delay tab. And then I'm trying -- the question that I ask is that if I find -- if I try to find this filter, then what is the relationship between this little p and this little h. So that is the question I'm asking now. And what I would expect is that if I do this, then the little p should be able to cancel the effect of this little h here. If I can do that, then I, at least in a simplified setting, just theoretically I can show that the method is working. No, this can be done simply by taking log likelihood here and differentiating this log likelihood with respect to the unknown p parameter. And what I find is that this little p is equal to approximately equal to minus h. Now, so -- and if you multiply these two -- the filters, we would like the multiplication to be unity so that the effect is canceled out. And we see that at least the error is in the second order. It's not in the first order. So at least to the first order, this method is working fine. I can ask another question, that what if I have a filter like this. It's an exponential tap filter with exponential filter taps, and I do the same thing. And what I would expect is that this p should be equal to minus h. And in fact it can be shown that this p is equal to minus h. And, similarly, in this case I can also have an I filter, and again I can show that this little p is equal to little h here. So at least in simple settings I can show that, without doing a speech recognition experiment, I can show that the method is doing what it is expected to do. Now, next I'll apply this method to speech recognition. So to apply the problem for speech recognition, I make an assumption that the speech features are distributed according to a Gaussian mixture model. And I can learn these parameters from data which I have already available, and then I want to estimate this filter here. It has an unknown [inaudible] term, and these little p terms here. So I need to estimate these parameters. So, again, I mean, I set about a log likelihood expression here and then I do gradient decent -- I mean I do gradient decent log maximizing log likelihood and try to obtain the parameters in a step-by-step fashion. >>: So one equation is different from an LPC? >>Kshitiz Kumar: This expression? >>: Uh-huh. >>Kshitiz Kumar: Here I'm obtaining the filter taps, and I'm applying this on the cepstral sequences. In the LPC what they do is that they apply this on the speech wave form by using the correlation terms there. But, yes, the -- it is not the same. And ->>: [inaudible]. >>Kshitiz Kumar: It is. But I'm not -- I'm actually not -- yeah, but this is -- LPC is not based on maximum likelihood, right? It is based on an error -- the squared error minimization. So -- and in the LPC we get terms like auto-correlation. So I -- I mean, you raise this point. I would expect that if I have a single Gaussian density here, then I might get the LPC term, but if I have multiple Gaussians and I -- because single Gaussian density reduces to error squared, and then I might get the same term, but because I have a Gaussian mixture model here, then I think the answer will be different. Yeah, a single Gaussian density might reduce to LPC. Yeah. Okay. So as I said before, so, I mean, I'll skip details here, but I set up a log likelihood maximization criteria and do a gradient decent from there, and the approach is applied on the different cepstral features. Now, before doing our speech recognition experiments, I'd like to just see how the method is working. So here on the left panel, this is a cepstral scatter plot for -- on the x axis, C1, and the y axis, C2 here. So we see that the cepstral scatter plot distorts, and it does not match with clean and reverberated because the reverberation is going to smear the spectrum and it is going to spread out. But after applying the LIFE processing we see that at least with the scatter plot, these two match better. So I would expect that the performance will improve. >>: So the likely assumption [inaudible]. >>Kshitiz Kumar: No, I -- the GMM is, of course, trained using EM, but when I do my gradient decent, I have the GMM parameters and then I just do a gradient decent -- a gradient ascent later. >>: Without updating the [inaudible]. >>Kshitiz Kumar: GMMs? Yeah, GMMs are not changed. GMM is trained just from clean speech features. So that's not changed. >>: [inaudible]. >>Kshitiz Kumar: You mean ->>: [inaudible]. >>Kshitiz Kumar: Oh, you mean for the GMMs? Okay. >>: [inaudible]. >>Kshitiz Kumar: Right. >>: [inaudible]. >>Kshitiz Kumar: Right. >>: [inaudible]. >>Kshitiz Kumar: I did not do that because I didn't want to really change GMMs because that will then -- that will probably change the clean speech feature. That will not exactly with the distribution of clean speech features. >>: You maximize the likelihood because the GMMs, not [inaudible] because I don't know which Gaussian in GMM ->>Kshitiz Kumar: That's true. >>: [inaudible]. >>Kshitiz Kumar: Yeah, actually, in practice what I do is that I do classify my single feature vector into each one of these GMMs like a posterior, and then I used just that. So just like top one -- yeah, top one. >>: [inaudible]. >>Kshitiz Kumar: Because that really simplifies the problem a lot. I mean, it makes the calculations much faster. >>: So you take the ->>Kshitiz Kumar: I take the most likely for each feature and then just use that information, discard the rest. I mean, I did experiment that, and I found that just that works almost well in my experiment, so I don't care about the rest. Yeah. >>: And you're doing this independently across the cepstral? >>Kshitiz Kumar: No, it's actually done jointly, actually, because the GMM ->>: You're able to leverage information from one domain ->>Kshitiz Kumar: Yeah, right. Exactly. I'm not doing it independently. It's all joint. Even though the covariance matrix is diagonal because it's still a GMM, it's a combined GMM, so I'm [inaudible]. So next I'll present speech recognition results on different databases. So this is one of those on the resource management database. And I compare with some of the baseline algorithms which I worked on and are also from different research communities. So some of the baselines are MFCC, CPF is a cepstral post filtering, and FDLP long-term log spectral subtraction which I talked about before and the LIFE filter. So we see that -- I mean, going from seven person to 50 person, some of the baseline algorithms that do not really provide a lot of benefit. And this is -- the training was done on clean speech. I'll also talk about matched condition later on. But the improvement here is limited to, like, 15 person at max like that, but the improvement by using LIFE processing is up to about 40 percent here, 43 percent relative reduction in error rate and also similarly here. So even though the problem formulation is simple, the analysis is simple, I'm able to leverage the fact that I represent reverberation in the cepstral feature domain and directly apply my techniques there. And I'll next show some examples of the -Yeah? >>: So even though the training was clean, did you put any of the [inaudible] through any processing? >>Kshitiz Kumar: Yes. So I feed clean speech also through my processing, yeah. >>: [inaudible]. >>Kshitiz Kumar: Yes. Yes. All of them. >>: [inaudible]. >>Kshitiz Kumar: On the clean also, yes. Yeah. I mean, I haven't especially looked for LDLSs, but in general I find that ->>: [inaudible]. >>Kshitiz Kumar: Yeah, right. In general I find that it is also important to pass clean speech also through the processing, even if the processing does not do much, even if it just reconstructs a little bit. I mean, it provides a better match. So, yeah, all of these were passed through all these processing. Okay. So in this example I plot the frequency response that I get of the filters. So as I said before, I'm trying to estimate the filter parameters. So these are four different utterances for the same reverberation time, so 300 millisecond. So the first thing what I expect is that the frequency response should have a high pass nature because the room impulse responses has a low pass nature. It smears the spectrum. So it's like adding things. So the inverse frequency should have a high pass nature, which I do find that exists here. The second thing I expect is that even if the utterances are different because the room impulse response is same, so I would expect that these filter responses are almost same for all the different utterances. If I could do that, I mean, that means that I'm learning the algorithm is not as much dependent on the speech utterance, it is more dependent on the room acoustics. So as we see here, I mean, figuratively looking at these utterances, we see that it is trying to do that. And to test this specifically, what I did is that I took 20 different utterances and I took an average of those 20 utterances, I took an average filter, and then I applied that single average filter to all the rest of the utterances. So here is an example of the average room impulse response for a particular condition. And here are the results. So this is MFCC, this is LIFE, which was obtained by individually working on the utterances, and this line is obtained by just applying an average filter which is learned from 20 utterances and applying that average filter to all the rest of the utterances. So, I mean, this does suggest that even though there is a bit of an error, I mean, mismatching, but in general I'm learning the room acoustics more and the speechless, because just from one filter applied to that particular condition, I'm working as well. Okay. So I did experiments on different databases here. This is on a large room, but still simulated, and this result is on -- from an ATR database. The ATR database was collected in Japan where they played out audio from [inaudible] database through a microphone and collected them in different rooms. So this database -- actually this experiment is a little bit different. And they also collected room impulse responses in different rooms. So this experiment was done by taking a room impulse response from the ATR database and applying that through my clean speech and get reverberated speech. So this actually -- this is a still [inaudible] result. This compares the difference between simulated room impulse response and an actual room impulse response. So we see -- we do see that improvements definitely hold up even for an actual room impulse response. So, so far I talked about clean condition where training was done on clean speech and testing across different times, different reverberation times. But as we have noted recently, actually not very recently, that that is not the best thing to do. And people in industry especially are doing training across all the different conditions they can find across all noise and reverberation preferably. They mix it in a training database and then it's like multi-style training. So the question is that do algorithms hold up even in matched condition or not. So that is what I'm going to evaluate next. So to do that, I did this experiment here. I took two different rooms and got room impulse response from them, and this is from the ATR database, and I have all these bunch of conditions. And from these I did multi-style conditions. So the training database has room impulse response from different rooms and also different conditions. Actually, there are more results, but I just present one result here. This experiment was specifically done by training on room 1 with these reverb time, these reverberation conditions here. And then testing was done on a large number of different conditions here. So these are on room 1, the room 2 and this is from ATR database here. So, I mean, the first thing which is very noticeable here is that the clean condition performance improves drastically even for MFCC. So, I mean, not the clean, I mean the baseline performance for MFCC shows a huge improvement. I mean, when I did this experiment first, I was -- I think I was surprised to see that, because just the matched conditions shows about 60 to 70 percent relative improvement just on MFCC. But at least a good thing is that they are applying the LIFE processing further. So these are like highly matched conditions. This database is from room 1. Nearly same reverb times. So in these highly matched conditions I'm demonstrating, like, 15 percent relative improvement, and these are slightly less highly matched, but still we can see that the MFCC performance still shows a huge improvement, even the matching is not strict. But in these cases LIFE filter provided, like, 20 percent relative improvement. So the point is that the improvement is slightly less if the matching is very strong, but in general we cannot really expect that in a practical setting because we have to -- I mean, we cannot really have a perfect match in any case, right? So whenever the matching is not totally perfect, we'll still get, I mean, a very significant improvement even in matched condition as well. >>: [inaudible]. >>Kshitiz Kumar: Yeah, this is still resource management. >>: [inaudible]. >>Kshitiz Kumar: On the same database, yeah. On the same set, right, yeah. Actually, I'll answer that later. Okay. So this is on -- sort of the same experiment but on a Wall Street Journal experiment. So in this experiment I do training from the room R1 in these reverb times, and I do testing across, I mean, room R1's here. So, again, notice that -- I mean, one thing noticeable is that the clean error actually increases here, as also in the previous experiment, because now the training is done across a variety of room impulse responses so the clean performance actually degrades for the baseline MFCC. But, again, the LIFE processing shows improvement on clean condition here. Yeah? >>: [inaudible]. >>Kshitiz Kumar: This is one error rate. >>: [inaudible]. >>Kshitiz Kumar: Actually -- okay, let me. Sorry, I should have explained this better. So actually there are two sets of experiment here. One is the doubt matched condition and one is clean condition. So here, as you can see, these two experiments are training on clean and these two experiments are training on matched condition. So we actually need to compare how much of improvement we get over matched condition and then also compare how much further better over the clean training. So as I said before ->>: You were probably going to say it, but the matched condition, it's matched on the zero reverb time? >>: [inaudible]. >>Kshitiz Kumar: It's multi-style actually. There are three sets here and this is all the three sets there, right. It's not just one. Actually, I'll just show that. But anyway, so I said that the clean error increases, but the processing improves here, and as -- I mean, here -- so here also the processing is improving. And this is on Wall Street Journal experiment about 15 to 19 relative improvement on an average. >>: So can you make a comparison [inaudible]. >>Kshitiz Kumar: That's true. Actually, I'm still to do that. I mean, I'm scheduled to do that sometime before my -- yeah, I have to do that, actually. I haven't been able to do that so far. Yeah, exactly. Because this is like adaptation where given some knowledge about the speech features, we are adapting it. But I think the key difference is that in MLLR they do a linear transformation, but here I'm doing a filtering transformation. That's the key difference. >>: [inaudible]. >>Kshitiz Kumar: Uh-huh. >>: [inaudible]. >>Kshitiz Kumar: Okay. Yeah. That's true. I'll look into that. >>: So if you look only at the clean in a non-matched training at R1 500, you're error rate is basically from a 10, you know, now it goes to ->>Kshitiz Kumar: Yeah. >>: After LIFE it's 19 percent. >>Kshitiz Kumar: I mean, that is training on clean. But if you do ->>: That means that if your LIFE, it's doing a perfect job. >>Kshitiz Kumar: It's not doing a perfect job. >>: [inaudible] it's not. It seems to me it's very far from perfect. >>Kshitiz Kumar: That's true. >>: [inaudible]. >>Kshitiz Kumar: Yeah, that's true. >>: How do you interpret this? Does that -- you know, your model is too [inaudible] or ->>Kshitiz Kumar: Yeah, I think one of the reasons is that the model, my model, does not -- it fits very, but reverberation time is somewhat -- is not very high. But if the features are very, very, very highly reverberated, then the model is not doing a great job, I think. That is how I interpret this. And especially if the database is very -- it's very -- I mean, if you do the same experiment for resource management, we see that here it's -- the resource management here, it's like going from about 70 to about 40 percent here. Actually, this experiment is the same thing for resource management. So here it was from about less than 80 to about 40. So here the reduction is very large. So there are two things here. The database itself is very difficult and the complexity of the reverberation I think is -- yeah, my model is not perfect, of course, I think. That's a point. >>: There's two pieces to [inaudible] so I imagine that as the data gets more and more reverberated, it's harder to find the right [inaudible] so I was wondering if you ever ran -or since all your data is synthetic, you can actually, like, cheat and figure out what is the optimal Gaussian for that and then -- like what's [inaudible] use the clean speech as the target rather than [inaudible] and then say that's the best your model could be likely to be if the target observations are perfect. >>Kshitiz Kumar: Right. >>: [inaudible]. >>Kshitiz Kumar: No, I haven't done that, actually, but I did a slightly different version of that. And so what I did was that I looked at the Gaussian component that I pick in clean case and then I looked at the Gaussian component that I picked in LIFE case, and doing that, before LIFE processing and after LIFE processing. So what I find is that if you look at the Gaussian components that get picked after LIFE processing, they match better with the data in clean condition. So, yeah, I did that experiment. But I don't remember exactly, like -- I mean, I would expect that the same thing holds if the reverberation time is small, but that will probably not hold well that well if the reverb is very high. Yeah. Actually, I will extend this method later on also. So I'll -- but anyway, so this is where training was done on all the three different reverb sometimes. And, next, I was also curious about another experiment where I did training on just one condition. So I just reverberated my database with just 300 millisecond of reverb time, just one condition, and I compared this with where we have three different cases here, so just to be curious if there is an additional improvement in a perfect matched condition. So, actually, this experiment here is a perfect matched condition here because the experiment is trained on this condition, this specific condition, and tested on the specific condition here. Yeah? >>: So in this experiment here, it seems like previously this MFCC number would be, like ->>Kshitiz Kumar: Actually, it will be this one. This is MFCC and this is trained ->>: But the order seems to be that, like, MFCC and then LIFE was a little better and then matched condition was even better than that. Am I remembering the slide wrong? >>Kshitiz Kumar: Actually there is a bit of a difference here ->>: Yeah, yeah. It's ->>Kshitiz Kumar: These are on clean training, actually. These two are on clean training. >>: Oh, okay. >>Kshitiz Kumar: So now ->>: [inaudible]. >>Kshitiz Kumar: Actually, yeah, these two are -- right. >>: The blue and light blue ->>Kshitiz Kumar: Yeah, yeah, yeah. Right, right. Yeah. So these two become here. So the real difference is that here the matched was a multi-style matched and these two cases the matched is a very strict match. >>: Okay. Okay. That makes sense. >>: So the gain is just from ->>Kshitiz Kumar: Yeah, the gain is just from [inaudible]. And note that this shoots up very high. The clean error shoots up very high, as expected here. But anyway, the point is that on this database, even on very strict matched in reverberation there is liked a 15 percent of reduction here. Okay. So, finally, in conclusion, so LIFE processing, it is an adaptive solution. It does not require any prior information. It buildings on just information of clean speech and distribution, and we get high pass [inaudible] mitigating the room impulse response. Now, next I'll talk about the non-negative matrix factor addition, which ->>: In the other one -- so what you're estimating is the [inaudible]. >>Kshitiz Kumar: In my LIFE processing I institute a [inaudible]. I also did for FIR filter. I get good benefit in FIR also, but IR provides a little bit of benefit. The reason is that in IR I need to have fewer parameters to -- if I have more parameters in the FIR, I think the ->>: Do you have any way to control for ->>Kshitiz Kumar: Parameters? >>: Stability. >>Kshitiz Kumar: Oh, yes. >>: [inaudible]. >>Kshitiz Kumar: Yeah, that's true. It is possible to do that, and what I currently do is that I check for that and if I violate my stable condition, I just pause. This is what I do right now. But I do that because in my experiments I find that the filters are stable. Actually, in all the experiments so far I checked for that, it actually -- very rarely I found that it's not stable. So, I mean, there is probably some reason why, but it just experimentally turns out that it's stable. >>: So back to the question [inaudible] what's the comparison -- you know, of course you compare with a basic MFCC. >>Kshitiz Kumar: Uh-huh. >>: But suppose, you know, you use a much simpler high pass filter rather than doing the, you know, estimation. >>Kshitiz Kumar: Oh, yes, uh-huh. >>: And that would be a, you know, a different baseline. Do you have any thoughts on that? >>Kshitiz Kumar: Actually, Professor Rich, he did ask me that question some long time back, and I didn't do those experiments. Because if you look at the filters here, I mean, they look like high pass. I mean, they are a high pass filter rate. So he did ask me to try to fit this filter by using one pole or two poles, try to empirically for this which single, one pole or two pole, and I tried to do a good job by fitting them in, and the result is that -- the result is actually somewhat intermediate. I mean, that's like -- it does not really achieve this much, but it was somewhat intermediate, like less -- not even half. Yeah, that does provide -- that does benefit, yeah. But it does not reach this level. >>: So why would you limit yourself to one or two poles? Why not go further, like doing a full [inaudible] normalization or ->>Kshitiz Kumar: Okay, I -- I don't know. I think I did not do that. >>: You didn't do a -- design a, you know, multiple, whatever, coefficient filter and, you know, minimize the [inaudible] with respect to filtering coefficients. >>Kshitiz Kumar: I think the reason I did not do that was that because had I done that, that would be like very, very specific to one particular reverberation time. So I didn't -- I wanted to do that to study how it works, but in practice, I don't -- I can't really do that because I don't know the room acoustics phenomenon. So that is why I did not take up that further. But even for a two-pole filter, I mean, I was getting this fit to a considerable extent. But, yeah. Okay. So the NMF also works on the similar framework that I talked about. It works -but instead of working in the cepstral domain it works in the spectral domain. But, again the model is a convolution in the spectral domain. So, again, notice that what we get is the cap YS sequences. We don't know the gap edge or the gap x sequence, and we have to split terms. And there are obviously many ways in which the cap YS can be split into these two terms here, and, again, to do a better job, we need to specify some constraint, something that we know about the signal of interest here. So in the non-negative matrix factor addition I use the sparsity of the signal, sparsity of the spectra as a feature of feature spectra and then I use the non-negativity of the spectral values as an additional thing, which is obvious. So, again, here, actually the problem formulation is trying to solve a mean squared error object function here. So I'm trying to -- given an initial estimate of the cap X and the cap edge -- actually, I think there is a multiplication here. Sorry. This is not minus. So I'm trying to minimize this error squared criteria subject to a sparsity constraint, so I want that the features -- the spectra should have a sparse value. So only few of the values should have a large value and the rest of the values should be small. And, again, this is solved in a gradient descent fashion, but we need to ensure that the values are non-negative. I mean, there's not -- I've skipped equations here, but eventually it turns out to be a multiplicative update. I mean, this ->>: [inaudible]. >>Kshitiz Kumar: In the spectral domain. >>: Not in the time domain? >>Kshitiz Kumar: Not in the time domain. >>: [inaudible]. >>Kshitiz Kumar: Yeah, so the -- so I discussed in the feature domain, and when I was discussing about this I said I'm trying to represent reverberation not in the time domain but on the feature domain, because the speech recognition actually works on the features. And then I showed that I can represent reverberation actually happening in time domain in the spectral domain with a small error. >>: [inaudible]. >>Kshitiz Kumar: Actually, it's both. Initially it is in the spectral domain and then also in the Mel spectra. >>: And you're definite the convolution is not -- it's across the [inaudible]. >>Kshitiz Kumar: No, it is on one frequency channel, yeah. >>: If you're treating samples -- it's still a time domain convolution using a single [inaudible] over time [inaudible]. >>Kshitiz Kumar: No, no, no. I'm treating each of these -- I'm treating each of those, the filter -- each of the short duration power in a particular frequency channel. They are convolved. >>: With each other over ->>Kshitiz Kumar: With each other, yeah. >>: [inaudible]. >>Kshitiz Kumar: Yeah, it's a time sequence, right. >>: [inaudible] sequence of the log ->>Kshitiz Kumar: Not log. The Mel spectra, yeah. >>: So what is the, you know, theoretical problem for this? Why convolution is the best model there? >>Kshitiz Kumar: The theoretical [inaudible] to some extent lies in, like, here. So I did try to show here -- so this is what I'm doing. So I am representing the blue line with the red line. And I'm saying that even -- I'm representing a convolution in time domain as a convolution in a spectral domain. There will be some error, but the error, as we can see in a practical application, is small, and this gives us simpler model. But it lets us do parameter estimation and the reverberation application. >>: But the thing is you don't have to do the convolution. You can do it -- you say it's multiplication or additive. You can also move the two lines, you know, closer. >>Kshitiz Kumar: Yeah. >>: So why the convolution is particularly indicated? >>Kshitiz Kumar: I can use multiplication, but then I have to use a very long feature vector. I mean, the duration of the analysis window should be up to second. This was what was done in the LDLSS time. The analysis window was one second or like that. But right now I'm working in a small window, so as you can see, because the reverberation will smear and reach into future frames, so energy will overlap. So energy from past sample will overlap into future segment, right? So this is an additive phenomenon. I mean, this is not completely linear. There is also a noise term here which I'm ignoring right now, as you can see, because if there is no noise, these two lines will match perfectly. But there is an approximation error. But, still a good thing is that even with small approximation error, this is a good model. Okay. So coming on to your point -- so this is right now my -- the NMF compensation framework. What I do is that I window -- I mean [inaudible] per framing window, and this either a magnitude or a power domain. I can do it in either. And then I can apply the NMF processing on the frequency, on the [inaudible] frequency channels or I can apply the same processing in the gammatone filter bank. So there are different flavors of this algorithm here. And to do into gammatone filter bank I have to apply this gammatone filtering and then apply this NMF here and then I reconstruct by applying actually an inverse transformation here. So the question is that which will be better. So will working in magnitude domain be better or will working in a power domain be better or will working in the gammatone domain will be better or working in the Fourier [phonetic] domain be better. So these are the questions I'm going to take up. Yeah? >>: [inaudible]. >>Kshitiz Kumar: It's [inaudible] actually. >>: [inaudible]. >>Kshitiz Kumar: I mean, I did do experiments with changing that parameter, but what I find is that it's -- I mean, it does not really change a lot. I mean, it's -- yeah, maybe I need to do more experiment with that. I have not explored that a lot. I mean, this is what we just chose and I chose some by changing small parameters, but doesn't really change a lot. Okay. So this is an example of applying this processing in the magnitude domain and in the gammatone filter domain. So this is the reverberated sequence here. So we see that by applying this processing, the processing helps to remove the spectral smearing and we get sharper, cleaner spectra. Okay. And in this experiment -Yeah? >>: [inaudible]. >>Kshitiz Kumar: Oh, sure. >>: This is before and after? >>Kshitiz Kumar: Yes, this is before and after. Yeah. [Demonstration played] >>Kshitiz Kumar: But if you listen to it carefully I think -- it's not perfect, of course, but it does help in that direction. >>: [inaudible]. >>Kshitiz Kumar: Okay. So in this experiment actually I'm comparing different versions of this algorithm applying in the Fourier domain versus applying in the gammatone domain, applying in the magnitude domain versus applying in the power domain. So the first thing I would expect is that if I applied the processing in the gammatone filter domain, because I have only 40 gammatone filter, which is about 256 Fourier frequencies, the dimensionality is reduced a lot. And then because the gammatone filters are designed to mimic the focus on the frequency components which are more useful for speech recognition, so I would expect that the performance is better. And additionally what I find is that if I do the same processing in the gammatone filter domain, the approximation errors are smaller. So earlier I talked about the approximation error. The approximation errors is smaller in the gammatone domain because it does an averaging, right? It does an averaging over the entire frequencies and then -- and so because of that, some of the noise gets smoothed out. So already lots of benefits in gammatone domain here, and starting with -- this is MFCC baseline and these are for the NMF algorithms. So we see that -- I think these bunch are for the Fourier frequencies, so these bunch are Fourier frequencies and these bunch are for gammatone frequencies. So on an average the gammatone frequencies error is much smaller. And then I also compared this -- let's talk about the gammatone frequency here. I have also compared with the magnitude domain versus -- the power domain versus the magnitude domain. So I find that applying the processing in the magnitude domain gives smaller error. And, again, I did try to look into that, and I found that the approximation error was smaller in the magnitude domain like that. And there is one more experiment here which is labeled dash edge. So dash edge means I apply the same filter across all the frequency bands. So I was interested in comparing that understanding and see can I apply the same filter to all the frequency bands, and if I apply individual filter to all the frequency bands, how much additional improvement do we get. I mean, of course we expect improvement. But the question is, is that how much improvement we get. So as you can see -- so we get like from here to here, which is still like a significant improvement, by applying the filter in all different channels on top of the rest of the improvement. >>: So is there a theoretical basis why NMF would work better in this case for magnitude versus power? Or is this just an empirical ->>Kshitiz Kumar: This is sort of an empirical result. What I tried to see was that I looked at the approximation error. I mean, you remember that my model has an error. So what I empirically found was that the approximation error is about 13 decibel -- I mean, the noise power was 13 decibel down signal power in power domain, but in the magnitude domain it's like 17 dB. So the approximation area in empirical is smaller. I mean, I think that is one of the reasons. Okay. So these are, again, a resource management database. And here again I see that there is a very sharp improvement in performance here. This is clean condition training. And these experiments have been done on ATR. So this experiment is done by simulated room impulse response using ATR room impulse response, and this is using the ATR database, actual ATR database, which is collected in a room. So we see that the improvement holds up in both the cases. Actually, they're not dyadically comparable because this room impulse response is not the same as the room impulse response here. Yeah, I think the database didn't have that probably. But anyway, we see that the improvement holds up in real [inaudible] speech as well. And another thing I wanted to show is that -[Demonstration played] >>Kshitiz Kumar: Actually, this has a noise if you listen to it carefully. So the database also has an additive noise here. So the point is that even if there is some additive noise, even in that case the performance does not blow away. I mean, the algorithms still hold up even if there is an additive noise. [Demonstration played]: She had your dark suit in greasy wash water all year. >>Kshitiz Kumar: I think the speech may not sound that great, but, still, speech recognition accuracy differently holds up. Because the problem is that many of the algorithms that are designed for dereverberation, some of them are so specific only for the reverberation task, and even if there is small noise, they don't really hold up. I don't want to go into those directions, I mean, so, okay. These are, again, RM matched condition experiment. So again we see that the baseline improves a lot, but we still get additional improvement by -- this is, again, a multi-style condition. It has database from all these conditions here. So, again, the improvement is about 15 or 20 percent relative reduction in error rate here. Okay. So I talked about dereverberation algorithm, I talked about spectral deconvolution representation, and then I'll briefly talk about a noise compensation algorithm which I find very useful. So it works on an additive noise model. So before going to talk about that I'll briefly talk about the data cepstral features. So conventionally initially we have like 13 MFCC feature here and then we apply delta cepstral feature and double delta cepstral feature to have a 39-dimensional feature. And what we find is that those features are very important for speech recognition. And this is an example for that. So this is on real world noise collected. It's still simulated speech. I mean, the speech is added to noise, but we see that adding delta feature improves like five or seven [inaudible]. So the point is that delta feature is very important for speech recognition accuracy, and even the clean performance improves here. So the next question is, is that the best we can do? And the question is that how -what is the robustness of delta features. So to do that, this is an example. So on the top panel, this is clean speech and this is noisy speech in zero decibel. This is like spectral power plot, this is a log spectral power plot, and this is the delta feature that we can obtain from here. So what we would expect is that if the delta features are robust, these two should match very well if they're very robust. But, I mean, as we can understand because we're applying a log compression here, the high energy will usually be very matched, but the lower energy values will be very little matched. And because of that, the delta features are not really robust to noise and they don't really follow the clean speech spectral features. So there is a problem here. So the next question is that what better can we do. And in my work -- I mean, as we can see, this blue line is the power spectral plot for a speech, and this is for a noise, a real-world noise, and what we find is that a lot of the noises we have in real world, the dynamic range is not as high, but the speech hour has a very high dynamic range. I mean, the speech energy values change like 30 dB between speaking and non-speaking. But the noise energy change is like 5, within plus/minus 5 decibel. So we can use this Q. And what we can do is that -- and also note that speech is very highly non-stationary, but in general, I mean, the same point, noise is relative stationary. So what we can do is that we can apply a temporal difference operation dyadically on the spectral values here. And what I do is that I pick the delta operation here on the Mel filter outputs. Now, if I do that I'm going to have some more things that I need to work on. I'll talk about that later. Especially in non-linearity I'll have to apply, because the things become negative if I apply difference operation in the spectral values. The things are normal non-negative, so I need to do something for that. But anyway, I'd like to figuratively show how is this helpful or not. So this is the same plot that I had before, and this is the plot that I get by applying the temporal difference operation dyadically on the spectral here. I mean, as we naturally expect, if we apply a temporal difference here, the speech energy changes very fast because -- and the speech power dominates, so well get a strong following for the speech energy. But when the speech power is not changing, things are relatively stationary here. And even the things that are stationary, the values will be nearly zero. And so we are -- as we can see here, I mean, we nicely show that applying temporal difference here shows a lot of benefit. And so this plot was done by taking a short duration power over the entire bandwidth of the signal and these sort of plots are for doing the same operation over the output of [inaudible] Mel channel, so just one frequency band. So, again, this is speech, this is noise. This is speech, speech added to noise, and this is -- these are the conventional thing applying log and then applying spectral -- delta spectral values here, and this is applying delta spectral values directly on this sequence here. And then from here to here I have to go -- okay, from here to here, this is applying a non-linearity on this. So the non-linearity, what I find is that these features are not -they're like a SuperGaussian distribution. They don't have a Gaussian characteristic so we can't model them by a Gaussian mixture model, so we have to apply a non-linearity to let them be modeled in a [inaudible] framework. So what I do is that I take the sequence here and I make the histogram Gaussian. This is what I do. But, again, later I need to improve this a little bit, but this is what I do right now. I make them Gaussian so that they can be modeled by your GMM. Okay. So what we have achieved is that we're -- by doing delta spectral, we are still able to track the changes in the speech, we're able to attenuate changes in noise, and it also emphasizes speech onset part because speech onset -- I mean, it has been shown particularly that it's very important for speech recognition because it's a difference operation so it is going to emphasize the speech onset part. And there is a non-linearity here to make them modeled by Gaussian mixture model here. Now, next I will compare the result using my delta features and then the conventional way of doing delta feature. So as we notice, there is -- what we have done is that we haven't added anything fundamentally new, but what we have added is that we're doing a temporal difference operation in a domain which is better and then we apply non-linearity to make them Gaussian. So that's it. And these are for a number of different conditions here; white noise, music, real-world noise and reverberation. So we see that just doing that provides us a lot of improvement here. I mean, especially for music, noise and for real-world collected noise at 50 dB we have like 7 or 8 dB, I think -- 5 or 7 dB threshold shift here. And it also works a lot in reverberation. >>: [inaudible]. >>Kshitiz Kumar: Right now actually it is reference specific, but we can learn that Gaussian addition from clean speech and then apply the same because actually it's not fundamentally different across different references. And we can the learn the non-linearity shape, basically, for a clean [inaudible] and then apply the same for all the different conditions. I haven't done that, but ->>: [inaudible]. >>Kshitiz Kumar: Yes. Yeah. >>: If someone from, like, ATR and [inaudible] basically proposed this reverberation [inaudible] so I'm just wondering if you're familiar with that work and the differences. >>Kshitiz Kumar: I don't exactly. I mean, I have done some research. What I see is that there's one paper [inaudible], I forget the name, Kuldeep Paldiwal [phonetic]. He has a paper, and what he does is something similar, but then he applies an mod operation to make them non-negative. So he takes the difference and then applies a modulus operation to make them non-negative. And I did do my experimental comparison with that, but that showed some improvement, but that does not really show a lot of improvement because once you take mod, you mitigate the differences between high rise of energy and fall of energy. They become same. So I think that was the closest which I found. >>: I'll see if I can find it. >>Kshitiz Kumar: Yeah, sure. Okay. And this is the same thing where I compare MFCC with the delta cepstral feature versus delta spectral feature and I compare with an advanced front end. This is from ETSIE database, ETSI Institute. So we see that comparing this red line versus this blue line, this is the improvement that we get from advanced front end. So advanced front end shows a lot of improvement over baseline, but if I replace the delta cepstral feature with the delta spectral feature, we see that even the baseline MFCC feature works almost as well as the advanced front end system here. So the point is that we are really getting a lot of improvement by doing little work by changing the delta operation, and we are approximately obtaining the gain which advanced front end provides across different noise reverberation. Okay. So ->>: What's the number of Gaussian mixtures you're using for this experiment? >>Kshitiz Kumar: Actually here there is no Gaussian mixture. There is a Gaussian mixture for speech recognition ->>: [inaudible]. >>Kshitiz Kumar: In the HMM. >>: [inaudible]. >>Kshitiz Kumar: I think eight. For resource management, we use eight. >>: [inaudible]. >>Kshitiz Kumar: That's true. >>: [inaudible]. >>Kshitiz Kumar: Yeah, that is possible, but I just use baseline. The reason is that I think that if I increase the number of distribution, I'm actually adding a lot more parameters into my data and I might over fit. And that is when I think that I can easily apply non-linearity. It's just one non-linearity. It's just like a log linearity but a different log linearity which makes them more closer to Gaussian. So that is the approach I took. Rather than increasing parameters, I thought it is better to do it on the features themselves. Okay. So right now what I'll try to do is that I'll also try to understand how much benefit this noise compensation can provide without doing any speech recognition experiment. So I'm interested in knowing how much benefit do we expect by doing this compensation and how much noise is actually reduced analytically. So to do that, I mean, I can consider white noise, for example. I can consider a segment of white noise and I can find out the power p. This will be a random variable. And then this P is a [inaudible] squared distribution with N degrees of freedom but we can treat this approximately Gaussian if N is large. And then what we can see is that this p is approximately distributed to be Gaussian sigma squared in this variance here. Now, this can be split into an AC power and a DC power. So the DC power is just means squared and the AC power is this variance here. And now to an approximation, what the algorithm will do is that it will definitely remove the DC power because we are subtracting. So the DC component of the signal will be totally gone. And to that approximation the reduction in distortion power can be obtained from this expression here which turns out to be this, and for a 25 millisecond, n is 400, and we can say that we expect that, for a white noise condition, the white noise will be reduced by -- the noise power associated with the white noise signal will be reduced by 23 decibel. So this gives us some sense into where we're heading to. And we can do the same thing for white noise, real-world noise, babble noise, music noise, and speech. Actually, white noise is theoretical, but these are experimental. So I can take a segment and so do the same thing and see what happens. And so the blue line here is what I obtained by doing a speech recognition experiment. I mean, so I would expect that at least the trend should be followed. The two are not exactly the same because I'm not comparing same thing with respect to another thing. But at least I would expect that the trend is followed in the two, which I, I mean, to some extent see that these values are high, these values are also high, and they sort of decrease. And the good part of this analysis is that it tells us, without doing a speech recognition experiment, just how much benefit we can obtain from an experiment from a processing. So, finally, a joint noise reverberation model. So as I said before, this reverberation model has an approximate error, and we can actually account for that approximation noise, approximation error, here by an additive noise here. And this also acts like a unified model for joint noise reverberation. So if the data also has a noise in it, we can actually account for that noise here itself. So this also acts as a better model for reverberation and also for joint noise reverberation model as well. And comparing this with the prior model which people worked on, so prior people worked on this multiplicative model and an additive noise here. And right now what I'm doing is that I'm changing, replacing this multiplication by a convolution here. So I have more parameters to estimate and work with. So next what I will do is that to compensate for this filter I'll apply NMF, and to compensate for this noise I will apply either delta spectral features. Anyway, this is the framework I'm working on. So I have this cap YS. This is what I get. It has both reverberation and noise. I apply first NMF, and NMF reconstructs speech signal. And then I pass it through the Mel spectra, and from Mel spectra I get DSCC features. I append these two. So these are, again, 39-dimensional features, but they have compensation for both noise and for reverberation. Not totally correct, but, yeah, they do. And so the question is how will it work. So this is an experiment with only reverberation, so this same thing I showed before. So these are the results that I showed before. So just applying NMF. And now if I also add the DSCC features, I get additional benefit here. Actually, I notice that this clean performance increases. I need to look into that more carefully and see what's going on there. But at least in the reverberation, we see that there is a very sharp improvement at even all these conditions here. Now, this is the same experiment where I passed signal through a reverberation, and then I added real-world noise on that. So it is reverberation and noise both. And we can see that the baseline MFCC feature is degraded very heavily if the data has both noise and reverberation. And applying the NMF shows improvement, and this is only applying -- DSCC shows a lot of improvement, but combining NMF in a joint fashion shows additional improvement over the rest. >>: You say that you -- forward one. There. You say that you created this data by first reverberating the clean speech and then adding noise? >>Kshitiz Kumar: Yes. >>: Is there any issue here with the noise not having the reverberation as if it was from the same room? That is, if you added noise and you reverberated it, would the algorithm work the same? >>Kshitiz Kumar: Actually, I haven't done the experiment. But this noise was collected in a real-world setting across -- I mean, it was collected in a restaurant and it was collected in a factory. So the noise sort of has some reverberation. But it's not exactly the same. It's different. That's true. It's different from the edge filter. I will probably need to do that experiment. I did not do that. >>: So do you think it would affect the experiment or do you think it would not affect? What's your intuition? >>Kshitiz Kumar: I don't think that it will affect the experiment. I think that I will still get significant improvement from that. I think so. Okay. So, finally, I'll very briefly present a profile lip-reading experiment. So the idea is that we can combine audio features with the image features. And so people used to work on frontal-view images and extract features, but in one of the experiments I collected profile-view images and applied -- tried to extract features from the profile-view image of a person and tried to integrate them. And so this is the images which were collected from the side as well as from the front. I'd like to briefly show the -- these are the images which were collected, and I'm working on these images. And then everything is extracted automatically. So the features which are defined here are the lip protrusion features. Because I have a profile image, I can find out the movement along the protrusion direction, the horizontal direction here. And then I can also find out the lip height, the movement in the vertical direction. And note the lip protrusion information cannot be obtained from a frontal image because it's the movement which -- this information cannot be extracted in that direction. So this gives an additional information from the profile-view images, and these are the processing steps. Anyway, so these are the speech recognition experiment done on a different database. So we see that this is frontal view vert parameter, frontal view height parameter, profile view height parameter and profile view protrusion parameter. So the key thing which I extract from this is that the profile-view protrusion parameter feature seems to be very important for speech recognition experiment here, and these features will not present in the frontal-view images which are in previous experiments. So this adds a lot of information to add. And then I can also combine all this with the audio features and then try to see that -this is the line for audio features, so the performance degrades as the SNR approaches zero, but at least the profile-view features stay the same. They don't get affected by noise. And if I combine the two, I get this intermediate shape here. So overall, in summary, I'd like to show that -- so a key purpose of my Ph.D. work was to study reverberation in the feature domain by making approximations and trying to work dyadically on the feature domain so that we have a tractable model, we have a simple model, and then we can extract -- do dereverberation on those simple models. Then I proposed two ways in which we can use speech knowledge; one by the likelihood parameter and the other by the sparsity information. I proposed a delta spectral feature to attenuate noise, and then I studied joint noise and reverberation framework and then audio/visual integration. So these are the different ways in which speech recognition performance can be improved in a real environment. And then, anyway, thank you for coming, and it was a pleasure to come here. And the papers are available at my website, and the algorithms are available at the robust website, so you can take a look. And, yeah, anyway, thank you. [applause] >>Kshitiz Kumar: Thanks. We can have questions. >>: So I never saw a slide that compared with, like, NMF. >>Kshitiz Kumar: Okay. It was not there. I did the LIFE on NMF. They provide comparative improvement. LIFE is a little better when the reverberation is high and NMF is a little better when the reverberation is small. But the good thing is that I can do one after another. I can first pass through NMF and then apply LIFE processing. I did not show that slide. Maybe I should have. And the thing is that I get further additional improvement if I do that. So the improvement to some extent is somewhat additive if I do both. It's not huge. I mean, it's not one plus another, but it's -- there is an additional like 10 percent of relative reduction in error rate, in that range, for those experiments. Yeah. >>: So actually you don't need to assume that [inaudible] just use the linear [inaudible] that being a combination of you have a matrix and then you optimize the NMF object option given this assumption of the [inaudible] all the energies are mixed, and then given the constraint, you try to minimize the [inaudible] of the recovered X signal. So then you can do some kind of recovery of the original signal. I think probably the performance might be the same as where you are. So given the assumption of the convolution that you have, I think [inaudible]. >>Kshitiz Kumar: Yeah, that's ->>: [inaudible]. >>Kshitiz Kumar: Yeah, that's true. I mean, I did start by making the convolution the assumption, but as you can see later, I added a noise term also there to account for the approximation error. And to account for the noise term, I added the noise compensation algorithm. So finally a represent reverberation not like a convolution but like a convolution plus noise. And by working in this framework, if I do something to compensate for reverberation, if I do something to compensate for noise, the improvements are additive. Yeah. >>: I think if you assume convolution and then [inaudible] reduction of the ease of computation, you can transform that to another domain, like the Fourier domain, something like that. You can get some [inaudible], I mean, it's a simple assumption on convolution without doing any further transformation to the domain [inaudible]. >>Kshitiz Kumar: Yeah. But I think the question is that because we have to also be careful about the window size, so the caution is that because we have to work with small window sizes and do analysis on small windows, so that is also an important thing. And that is why I needed to have a convolution, because then it tracks for the additivity of energy from previous segments into current segments. Is that what you're asking? >>: [inaudible]. >>Kshitiz Kumar: I don't need to assume convolution, but the point is that the additive thing is correlated with the speech. So there is a convolution. I mean, the additive is not uncorrelated with the speech. If the additive components are completely uncorrelated with the speech, then it's noise, right? But the additive components are correlated with the speech, and in that case, I mean, I can't read that as completely additive because then the general noise algorithms assumes that the noise is uncorrelated. So I think it will violate the assumption. So that is why I think the best thing is probably to have both convolution and noise. Because this is very generic. Then we can apply the same problem to only noise, to only reverberation, to both noise and reverberation. So it sort of extends. So there are some other experiments here, as Mike pointed out. So I didn't show many experiments combining, like, NMF LIFE or DSCC. So I did those experiments as well, and there are further improvement if I do some combination, though small. >>Mike Seltzer: Any other questions? Let's thank our speaker again. >>Kshitiz Kumar: Thank you. [applause]